Chapter 37 Model Selection and Out-of-Sample Validation
Sometimes, we use a regression model for description of a multivariate relationship which requires only that we provide an adequate fit to the data at hand. Should we want to use the model for prediction, then a more appropriate standard which requires us to fit a model that does each of the following three things:
- [fits well] Provides an adequate fit to the data at hand
- [parsimonious] Includes only those predictors which actually have detectable predictive value
- [predicts well out of sample] Does a better job predicting our outcome in new data than other alternative models
We’ll spend considerable time in 432 studying how best to validate a regression model, but for now, we’ll focus on a pair of issues:
- Given a set of predictors, how should we let the computer help us understand which subsets of those predictors are most worthy of our additional attention?
- Given a pair of models to compare, what tools should we use to determine which one better predicts new data?
37.1 Using the WCGS Data to predict Cholesterol Level
To address these issues, I’ll look at one of our old examples: the wcgs
data (Western Collaborative Group Study), described in Section 13. We’ll try to predict the variable chol
on the basis of some subset of the following five predictors: age
, bmi
, sbp
, dbp
and smoke
.
The steps are:
- Check the
wcgs
data for missing or out-of-range values in the variables under study. - Separate the
wcgs
data into a test sample of 500 observations selected at random, and a model development (training) sample consisting of the remaining 2654 observations. - Using only the model development sample, run a stepwise regression algorithm working off of the kitchen sink model using backwards selection to identify a reasonable candidate for a model. Call this model A.
- Develop a second potential model (called model B) for the model development data by eliminating the least clearly helpful predictor in model A.
- Use the AIC to compare model A to model B in the development sample.
- Finally, moving forward to the holdout sample, compare the quality of predictions made by model A to those made by model B in terms of two of the many possible criteria:
- [i] mean squared prediction error and
- [ii] mean absolute prediction error
- to see if either model (A or B) is clearly superior in terms of out-of-sample prediction.
37.2 Separating the Data into a Training and a Test Sample
There are several ways to partition the data into training (model development) and test (model checking) samples. For example, we could develop separate data frames for a holdout sample of 500 randomly selected subjects (wcgs.test
), and then use the remainder as the model development sample (wcgs.dev
). Remember to set a seed so that you can replicate the selection.
set.seed(4311); wcgs.test <- wcgs %>% sample_n(500)
## hold out exactly 500 randomly selected observations
wcgs.dev <- anti_join(wcgs, wcgs.test, "id")
## model development sample - 2654 observations
wcgs.test
# A tibble: 500 x 22
id age agec height weight lnwght wghtcat bmi sbp lnsbp dbp
<int> <int> <fct> <int> <int> <dbl> <fct> <dbl> <int> <dbl> <int>
1 11424 40 35-40 72 202 5.31 > 200 27.4 138 4.93 84
2 12255 40 35-40 68 165 5.11 140-170 25.1 120 4.79 80
3 11100 44 41-45 72 196 5.28 170-200 26.6 118 4.77 80
4 10162 45 41-45 68 136 4.91 < 140 20.7 122 4.80 78
5 10236 51 51-55 65 159 5.07 140-170 26.5 106 4.66 80
6 19077 52 51-55 70 152 5.02 140-170 21.8 124 4.82 86
7 22064 49 46-50 71 152 5.02 140-170 21.2 124 4.82 76
8 21026 48 46-50 70 170 5.14 140-170 24.4 130 4.87 86
9 12356 42 41-45 74 190 5.25 170-200 24.4 144 4.97 78
10 11232 39 35-40 73 182 5.20 170-200 24.0 116 4.75 76
# ... with 490 more rows, and 11 more variables: chol <int>, behpat <fct>,
# dibpat <fct>, smoke <fct>, ncigs <int>, arcus <int>, chd69 <fct>,
# typchd69 <int>, time169 <int>, t1 <dbl>, uni <dbl>
[1] 2654 22
37.2.1 Using a specified fraction of the data in the test sample
If we’d wanted to select 20% of the data for our test sample, we could have instead used the sample_frac
and anti_join
commands. For the wcgs
data which has a unique id
variable that identifies each subject, we’d have…
set.seed(43199); wcgs.train80 <- wcgs %>% sample_frac(0.80)
wcgs.test20 <- anti_join(wcgs, wcgs.train80, by="id")
dim(wcgs.train80)
[1] 2523 22
[1] 631 22
Given a large sample size (at least 500 observations in the full data set) I would usually think about holding out somewhere between 15% and 25% of the data in this manner.
37.3 Stepwise Regression to Select Predictors
We next select the wcgs.dev
(development sample) and run a stepwise procedure, beginning with the kitchen sink model, that includes all potential predictors.
37.3.1 The Kitchen Sink Model
Call:
lm(formula = chol ~ age + bmi + sbp + dbp + smoke, data = wcgs.dev)
Residuals:
Min 1Q Median 3Q Max
-130.8 -28.4 -3.0 25.1 418.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 135.3671 11.4050 11.87 < 2e-16 ***
age 0.5555 0.1520 3.65 0.00026 ***
bmi 0.7772 0.3493 2.23 0.02615 *
sbp 0.1307 0.0885 1.48 0.13977
dbp 0.2963 0.1388 2.14 0.03285 *
smokeYes 10.3565 1.6967 6.10 1.2e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 42.8 on 2640 degrees of freedom
(8 observations deleted due to missingness)
Multiple R-squared: 0.0352, Adjusted R-squared: 0.0333
F-statistic: 19.2 on 5 and 2640 DF, p-value: <2e-16
37.3.2 Stepwise (Backward Elimination) Procedure
Start: AIC=19892
chol ~ age + bmi + sbp + dbp + smoke
Df Sum of Sq RSS AIC
<none> 4847288 19892
- sbp 1 4006 4851294 19892
- dbp 1 8370 4855658 19894
- bmi 1 9092 4856380 19895
- age 1 24516 4871804 19903
- smoke 1 68405 4915693 19927
Call:
lm(formula = chol ~ age + bmi + sbp + dbp + smoke, data = wcgs.dev)
Coefficients:
(Intercept) age bmi sbp dbp
135.367 0.555 0.777 0.131 0.296
smokeYes
10.357
The stepwise process first eliminates sbp
from the model, then sees no substantial improvement in AIC after this has been done, so it lands on a four-predictor model with age
, bmi
, dbp
and smoke
.
37.3.3 Three Candidate Models
For purposes of this exercise, we’ll call this four-predictor model model.A
and compare it to a three-predictor model with age
, dbp
and smoke
, which we’ll call model.B
37.4 AIC, ANOVA and BIC to assess Candidate Models
The stepwise regression output specifies the AIC value for each model, but we can also look at other characteristics, like the ANOVA table comparing the various models, or the Bayesian Information Criterion, abbreviated BIC.
df AIC
model.kitchensink 7 27403
model.A 6 27403
model.B 5 27406
AIC suggests model A (since it has the smallest AIC of these choices)
Analysis of Variance Table
Model 1: chol ~ age + bmi + sbp + dbp + smoke
Model 2: chol ~ age + bmi + dbp + smoke
Model 3: chol ~ age + dbp + smoke
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2640 4847288
2 2641 4851294 -1 -4006 2.18 0.14
3 2642 4861218 -1 -9923 5.40 0.02 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA model also suggests model A, for the following reasons:
- The p value of 0.35 indicates that moving from what we’ve called the kitchen sink model (model 1 in the ANOVA output) to what we’ve called model A (model 2 in the ANOVA output) does not have a statistically significant impact on predictive value.
- On the other hand, the p value of 0.013 indicates that moving from what we’ve called model A (model 2 in the ANOVA output) to what we’ve called model B (model 3 in the ANOVA output) does have a statistically significant impact on predictive value.
- Because these models are nested (model B is a proper subset of model A which is a proper subset of the kitchen sink) we can make these ANOVA comparisons directly.
df BIC
model.kitchensink 7 27444
model.A 6 27438
model.B 5 27436
BIC disagrees, and prefers model B, since its BIC is smaller. The penalty for fitting additional predictors in BIC varies with the number of observations, and so (especially with larger samples) we can get meaningfully different AIC and BIC selections.
37.5 Comparing Models in the Test Sample (MSPE, MAPE)
Finally, we’ll use our two candidate models (Model A and Model B) to predict the results in our holdout sample of 500 observations to see which model performs better in these new data (remember that our holdout sample was not used to identify or fit Models A or B.)
To do this, we first carefully specify the two models being compared
model.A <- lm(chol ~ age + bmi + dbp + smoke, data=wcgs.dev)
model.B <- lm(chol ~ age + dbp + smoke, data=wcgs.dev)
Next, use predict
to make predictions for the test data:
Just to fix ideas, here are the first few predictions for Model A…
1 2 3 4 5 6
231 217 231 215 225 224
We can compare these to the first few observed values of chol
in the test sample.
[1] 215 198 175 208 348 315
Next, calculate errors (observed value minus predicted value) for each model:
Again, just to be sure we understand, we look at the first few errors for Model A.
1 2 3 4 5 6
-16.36 -19.04 -56.22 -7.47 123.48 90.97
Next, we calculate the absolute errors (as |observed - predicted|) from each model in turn:
Let’s look at the first few absolute errors for Model A.
1 2 3 4 5 6
16.36 19.04 56.22 7.47 123.48 90.97
Next, we calculate the squared prediction errors from each model in turn:
And again, we look at the first few squared errors for Model A.
1 2 3 4 5 6
267.7 362.7 3160.8 55.8 15246.8 8275.4
To obtain our two key summaries: mean absolute prediction error and mean squared prediction error, I just use the summary
function.
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.1 10.9 27.2 32.3 48.4 148.2 4
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 11.1 27.4 32.5 48.1 147.0 4
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 119 740 1740 2340 21963 4
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 123 749 1743 2318 21613 4
Model | MAPE | MSPE | Maximum Abs. Error |
---|---|---|---|
A (age + bmi + dbp + smoke ) |
32.39 | 1669 | 125 |
B (age + dbp + smoke ) |
32.45 | 1668 | 123 |
Note that smaller values on these metrics are better, so that MAPE (barely) selects Model A over Model B, and MSPE (barely) selects Model B over Model A. We also sometimes look at the maximum absolute error, and here we see that Model B is slightly favored again. But really, there’s little to choose from. The NAs you see above refer to patients in the wcgs
data with missing values on one or more of the variables included in our kitchen sink model. I absolutely should have identified that problem at the beginning, and either omitted those or done some imputation back at the start. I’ll show that in the next section.
So, based on the test sample results, we might slightly favor Model B, but really, there’s no meaningful difference.