Chapter 43 Model Selection and Out-of-Sample Validation
Sometimes, we use a regression model for description of a multivariate relationship which requires only that we provide an adequate fit to the data at hand. Should we want to use the model for prediction, then a more appropriate standard which requires us to fit a model that does each of the following three things:
- [fits well] Provides an adequate fit to the data at hand
- [parsimonious] Includes only those predictors which actually have detectable predictive value
- [predicts well out of sample] Does a better job predicting our outcome in new data than other alternative models
We’ll spend considerable time in 432 studying how best to validate a regression model, but for now, we’ll focus on a pair of issues:
- Given a set of predictors, how should we let the computer help us understand which subsets of those predictors are most worthy of our additional attention?
- Given a pair of models to compare, what tools should we use to determine which one better predicts new data?
43.1 Using the WCGS Data to predict Cholesterol Level
To address these issues, I’ll look at one of our old examples: the wcgs
data (Western Collaborative Group Study), described in Section 13. We’ll try to predict the variable chol
on the basis of some subset of the following five predictors: age
, bmi
, sbp
, dbp
and smoke
.
The steps are:
- Check the
wcgs
data for missing or out-of-range values in the variables under study. - Separate the
wcgs
data into a test sample of 500 observations selected at random, and a model development (training) sample consisting of the remaining 2654 observations. - Using only the model development sample, run a stepwise regression algorithm working off of the kitchen sink model using backwards selection to identify a reasonable candidate for a model. Call this model A.
- Develop a second potential model (called model B) for the model development data by eliminating the least clearly helpful predictor in model A.
- Use the AIC to compare model A to model B in the development sample.
- Finally, moving forward to the holdout sample, compare the quality of predictions made by model A to those made by model B in terms of two of the many possible criteria:
- [i] mean squared prediction error and
- [ii] mean absolute prediction error
- to see if either model (A or B) is clearly superior in terms of out-of-sample prediction.
43.2 Separating the Data into a Training and a Test Sample
There are several ways to partition the data into training (model development) and test (model checking) samples. For example, we could develop separate data frames for a holdout sample of 500 randomly selected subjects (wcgs.test
), and then use the remainder as the model development sample (wcgs.dev
). Remember to set a seed so that you can replicate the selection.
set.seed(4311); wcgs.test <- wcgs %>% sample_n(500)
## hold out exactly 500 randomly selected observations
wcgs.dev <- anti_join(wcgs, wcgs.test, "id")
## model development sample - 2654 observations
wcgs.test
# A tibble: 500 x 22
id age agec height weight lnwght wghtcat bmi sbp lnsbp dbp
<int> <int> <fct> <int> <int> <dbl> <fct> <dbl> <int> <dbl> <int>
1 13215 43 41-45 69 165 5.11 140-170 24.4 110 4.70 70
2 21100 44 41-45 67 158 5.06 140-170 24.7 112 4.72 66
3 21141 52 51-55 70 180 5.19 170-200 25.8 136 4.91 86
4 3747 43 41-45 69 155 5.04 140-170 22.9 120 4.79 76
5 12256 54 51-55 68 143 4.96 140-170 21.7 116 4.75 68
6 11220 39 35-40 75 195 5.27 170-200 24.4 124 4.82 74
7 3079 40 35-40 69 153 5.03 140-170 22.6 120 4.79 76
8 13445 45 41-45 74 165 5.11 140-170 21.2 106 4.66 66
9 21064 50 46-50 68 166 5.11 140-170 25.2 128 4.85 78
10 3228 39 35-40 72 172 5.15 170-200 23.3 114 4.74 78
# ... with 490 more rows, and 11 more variables: chol <int>, behpat <fct>,
# dibpat <fct>, smoke <fct>, ncigs <int>, arcus <int>, chd69 <fct>,
# typchd69 <int>, time169 <int>, t1 <dbl>, uni <dbl>
[1] 2654 22
43.2.1 Using a specified fraction of the data in the test sample
If we’d wanted to select 20% of the data for our test sample, we could have instead used the sample_frac
and anti_join
commands. For the wcgs
data which has a unique id
variable that identifies each subject, we’d have…
set.seed(43199); wcgs.train80 <- wcgs %>% sample_frac(0.80)
wcgs.test20 <- anti_join(wcgs, wcgs.train80, by="id")
dim(wcgs.train80)
[1] 2523 22
[1] 631 22
Given a large sample size (at least 500 observations in the full data set) I would usually think about holding out somewhere between 15% and 25% of the data in this manner.
43.3 Stepwise Regression to Select Predictors
We next select the wcgs.dev
(development sample) and run a stepwise procedure, beginning with the kitchen sink model, that includes all potential predictors.
43.3.1 The Kitchen Sink Model
Call:
lm(formula = chol ~ age + bmi + sbp + dbp + smoke, data = wcgs.dev)
Residuals:
Min 1Q Median 3Q Max
-129.9 -28.2 -2.4 25.4 418.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 132.1154 11.4386 11.55 < 2e-16 ***
age 0.4895 0.1542 3.17 0.0015 **
bmi 0.8527 0.3493 2.44 0.0147 *
sbp 0.0824 0.0878 0.94 0.3481
dbp 0.4323 0.1395 3.10 0.0020 **
smokeYes 9.6617 1.7040 5.67 1.6e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 43 on 2640 degrees of freedom
(8 observations deleted due to missingness)
Multiple R-squared: 0.0358, Adjusted R-squared: 0.034
F-statistic: 19.6 on 5 and 2640 DF, p-value: <2e-16
43.3.2 Stepwise (Backward Elimination) Procedure
Start: AIC=19912
chol ~ age + bmi + sbp + dbp + smoke
Df Sum of Sq RSS AIC
- sbp 1 1629 4885694 19911
<none> 4884065 19912
- bmi 1 11023 4895088 19916
- dbp 1 17766 4901831 19919
- age 1 18633 4902698 19920
- smoke 1 59480 4943545 19942
Step: AIC=19911
chol ~ age + bmi + dbp + smoke
Df Sum of Sq RSS AIC
<none> 4885694 19911
- bmi 1 11404 4897098 19915
- age 1 19708 4905402 19919
- dbp 1 58934 4944628 19940
- smoke 1 62605 4948300 19942
Call:
lm(formula = chol ~ age + bmi + dbp + smoke, data = wcgs.dev)
Coefficients:
(Intercept) age bmi dbp smokeYes
133.766 0.502 0.867 0.529 9.846
The stepwise process first eliminates sbp
from the model, then sees no substantial improvement in AIC after this has been done, so it lands on a four-predictor model with age
, bmi
, dbp
and smoke
.
43.3.3 Three Candidate Models
For purposes of this exercise, we’ll call this four-predictor model model.A
and compare it to a three-predictor model with age
, dbp
and smoke
, which we’ll call model.B
43.4 AIC, ANOVA and BIC to assess Candidate Models
The stepwise regression output specifies the AIC value for each model, but we can also look at other characteristics, like the ANOVA table comparing the various models, or the Bayesian Information Criterion, abbreviated BIC.
df AIC
model.kitchensink 7 27423
model.A 6 27422
model.B 5 27426
AIC suggests model A (since it has the smallest AIC of these choices)
Analysis of Variance Table
Model 1: chol ~ age + bmi + sbp + dbp + smoke
Model 2: chol ~ age + bmi + dbp + smoke
Model 3: chol ~ age + dbp + smoke
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2640 4884065
2 2641 4885694 -1 -1629 0.88 0.348
3 2642 4897098 -1 -11404 6.16 0.013 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA model also suggests model A, for the following reasons:
- The p value of 0.35 indicates that moving from what we’ve called the kitchen sink model (model 1 in the ANOVA output) to what we’ve called model A (model 2 in the ANOVA output) does not have a statistically significant impact on predictive value.
- On the other hand, the p value of 0.013 indicates that moving from what we’ve called model A (model 2 in the ANOVA output) to what we’ve called model B (model 3 in the ANOVA output) does have a statistically significant impact on predictive value.
- Because these models are nested (model B is a proper subset of model A which is a proper subset of the kitchen sink) we can make these ANOVA comparisons directly.
df BIC
model.kitchensink 7 27464
model.A 6 27457
model.B 5 27455
BIC disagrees, and prefers model B, since its BIC is smaller. The penalty for fitting additional predictors in BIC varies with the number of observations, and so (especially with larger samples) we can get meaningfully different AIC and BIC selections.
43.5 Comparing Models in the Test Sample (MSPE, MAPE)
Finally, we’ll use our two candidate models (Model A and Model B) to predict the results in our holdout sample of 500 observations to see which model performs better in these new data (remember that our holdout sample was not used to identify or fit Models A or B.)
To do this, we first carefully specify the two models being compared
model.A <- lm(chol ~ age + bmi + dbp + smoke, data=wcgs.dev)
model.B <- lm(chol ~ age + dbp + smoke, data=wcgs.dev)
Next, use predict
to make predictions for the test data:
Just to fix ideas, here are the first few predictions for Model A…
1 2 3 4 5 6
223 212 238 225 216 214
We can compare these to the first few observed values of chol
in the test sample.
[1] 206 182 264 283 252 204
Next, calculate errors (observed value minus predicted value) for each model:
Again, just to be sure we understand, we look at the first few errors for Model A.
1 2 3 4 5 6
-17.34 -30.21 26.41 57.76 36.32 -9.61
Next, we calculate the absolute errors (as |observed - predicted|) from each model in turn:
Let’s look at the first few absolute errors for Model A.
1 2 3 4 5 6
17.34 30.21 26.41 57.76 36.32 9.61
Next, we calculate the squared prediction errors from each model in turn:
And again, we look at the first few squared errors for Model A.
1 2 3 4 5 6
300.7 912.5 697.6 3336.7 1319.2 92.4
To obtain our two key summaries: mean absolute prediction error and mean squared prediction error, I just use the summary
function.
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.1 13.1 26.7 32.4 45.9 125.2 4
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.2 14.1 26.5 32.4 44.8 122.9 4
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 173 713 1669 2107 15678 4
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 197 702 1668 2008 15102 4
Model | MAPE | MSPE | Maximum Abs. Error |
---|---|---|---|
A (age + bmi + dbp + smoke ) |
32.39 | 1669 | 125 |
B (age + dbp + smoke ) |
32.45 | 1668 | 123 |
Note that smaller values on these metrics are better, so that MAPE (barely) selects Model A over Model B, and MSPE (barely) selects Model B over Model A. We also sometimes look at the maximum absolute error, and here we see that Model B is slightly favored again. But really, there’s little to choose from. The NAs you see above refer to patients in the wcgs
data with missing values on one or more of the variables included in our kitchen sink model. I absolutely should have identified that problem at the beginning, and either omitted those or done some imputation back at the start. I’ll show that in the next section.
So, based on the test sample results, we might slightly favor Model B, but really, there’s no meaningful difference.