Chapter 43 Model Selection and Out-of-Sample Validation

Sometimes, we use a regression model for description of a multivariate relationship which requires only that we provide an adequate fit to the data at hand. Should we want to use the model for prediction, then a more appropriate standard which requires us to fit a model that does each of the following three things:

[fits well] Provides an adequate fit to the data at hand
[parsimonious] Includes only those predictors which actually have detectable predictive value
[predicts well out of sample] Does a better job predicting our outcome in new data than other alternative models

We’ll spend considerable time in 432 studying how best to validate a regression model, but for now, we’ll focus on a pair of issues:

Given a set of predictors, how should we let the computer help us understand which subsets of those predictors are most worthy of our additional attention?
Given a pair of models to compare, what tools should we use to determine which one better predicts new data?

43.1 Using the WCGS Data to predict Cholesterol Level

To address these issues, I’ll look at one of our old examples: the wcgs data (Western Collaborative Group Study), described in Section 13. We’ll try to predict the variable chol on the basis of some subset of the following five predictors: age, bmi, sbp, dbp and smoke.

The steps are:

Check the wcgs data for missing or out-of-range values in the variables under study.
Separate the wcgs data into a test sample of 500 observations selected at random, and a model development (training) sample consisting of the remaining 2654 observations.
Using only the model development sample, run a stepwise regression algorithm working off of the kitchen sink model using backwards selection to identify a reasonable candidate for a model. Call this model A.
Develop a second potential model (called model B) for the model development data by eliminating the least clearly helpful predictor in model A.
Use the AIC to compare model A to model B in the development sample.
Finally, moving forward to the holdout sample, compare the quality of predictions made by model A to those made by model B in terms of two of the many possible criteria:
- [i] mean squared prediction error and
- [ii] mean absolute prediction error
- to see if either model (A or B) is clearly superior in terms of out-of-sample prediction.

43.2 Separating the Data into a Training and a Test Sample

There are several ways to partition the data into training (model development) and test (model checking) samples. For example, we could develop separate data frames for a holdout sample of 500 randomly selected subjects (wcgs.test), and then use the remainder as the model development sample (wcgs.dev). Remember to set a seed so that you can replicate the selection.

set.seed(4311); wcgs.test <- wcgs %>% sample_n(500)
## hold out exactly 500 randomly selected observations

wcgs.dev <- anti_join(wcgs, wcgs.test, "id") 
## model development sample - 2654 observations

wcgs.test

# A tibble: 500 x 22
      id   age agec  height weight lnwght wghtcat   bmi   sbp lnsbp   dbp
   <int> <int> <fct>  <int>  <int>  <dbl> <fct>   <dbl> <int> <dbl> <int>
 1 13215    43 41-45     69    165   5.11 140-170  24.4   110  4.70    70
 2 21100    44 41-45     67    158   5.06 140-170  24.7   112  4.72    66
 3 21141    52 51-55     70    180   5.19 170-200  25.8   136  4.91    86
 4  3747    43 41-45     69    155   5.04 140-170  22.9   120  4.79    76
 5 12256    54 51-55     68    143   4.96 140-170  21.7   116  4.75    68
 6 11220    39 35-40     75    195   5.27 170-200  24.4   124  4.82    74
 7  3079    40 35-40     69    153   5.03 140-170  22.6   120  4.79    76
 8 13445    45 41-45     74    165   5.11 140-170  21.2   106  4.66    66
 9 21064    50 46-50     68    166   5.11 140-170  25.2   128  4.85    78
10  3228    39 35-40     72    172   5.15 170-200  23.3   114  4.74    78
# ... with 490 more rows, and 11 more variables: chol <int>, behpat <fct>,
#   dibpat <fct>, smoke <fct>, ncigs <int>, arcus <int>, chd69 <fct>,
#   typchd69 <int>, time169 <int>, t1 <dbl>, uni <dbl>

dim(wcgs.dev) # verify size of development sample

[1] 2654   22

43.2.1 Using a specified fraction of the data in the test sample

If we’d wanted to select 20% of the data for our test sample, we could have instead used the sample_frac and anti_join commands. For the wcgs data which has a unique id variable that identifies each subject, we’d have…

set.seed(43199); wcgs.train80 <- wcgs %>% sample_frac(0.80)
wcgs.test20 <- anti_join(wcgs, wcgs.train80, by="id")
dim(wcgs.train80)

[1] 2523   22

dim(wcgs.test20)

[1] 631  22

Given a large sample size (at least 500 observations in the full data set) I would usually think about holding out somewhere between 15% and 25% of the data in this manner.

43.3 Stepwise Regression to Select Predictors

We next select the wcgs.dev (development sample) and run a stepwise procedure, beginning with the kitchen sink model, that includes all potential predictors.

43.3.1 The Kitchen Sink Model

summary(lm(chol ~ age + bmi + sbp + dbp + smoke, data = wcgs.dev))


Call:
lm(formula = chol ~ age + bmi + sbp + dbp + smoke, data = wcgs.dev)

Residuals:
   Min     1Q Median     3Q    Max 
-129.9  -28.2   -2.4   25.4  418.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 132.1154    11.4386   11.55  < 2e-16 ***
age           0.4895     0.1542    3.17   0.0015 ** 
bmi           0.8527     0.3493    2.44   0.0147 *  
sbp           0.0824     0.0878    0.94   0.3481    
dbp           0.4323     0.1395    3.10   0.0020 ** 
smokeYes      9.6617     1.7040    5.67  1.6e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43 on 2640 degrees of freedom
  (8 observations deleted due to missingness)
Multiple R-squared:  0.0358,    Adjusted R-squared:  0.034 
F-statistic: 19.6 on 5 and 2640 DF,  p-value: <2e-16

43.3.2 Stepwise (Backward Elimination) Procedure

step(lm(chol ~ age + bmi + sbp + dbp + smoke, data = wcgs.dev))

Start:  AIC=19912
chol ~ age + bmi + sbp + dbp + smoke

        Df Sum of Sq     RSS   AIC
- sbp    1      1629 4885694 19911
<none>               4884065 19912
- bmi    1     11023 4895088 19916
- dbp    1     17766 4901831 19919
- age    1     18633 4902698 19920
- smoke  1     59480 4943545 19942

Step:  AIC=19911
chol ~ age + bmi + dbp + smoke

        Df Sum of Sq     RSS   AIC
<none>               4885694 19911
- bmi    1     11404 4897098 19915
- age    1     19708 4905402 19919
- dbp    1     58934 4944628 19940
- smoke  1     62605 4948300 19942


Call:
lm(formula = chol ~ age + bmi + dbp + smoke, data = wcgs.dev)

Coefficients:
(Intercept)          age          bmi          dbp     smokeYes  
    133.766        0.502        0.867        0.529        9.846

The stepwise process first eliminates sbp from the model, then sees no substantial improvement in AIC after this has been done, so it lands on a four-predictor model with age, bmi, dbp and smoke.

43.3.3 Three Candidate Models

For purposes of this exercise, we’ll call this four-predictor model model.A and compare it to a three-predictor model with age, dbp and smoke, which we’ll call model.B

model.kitchensink <- lm(chol ~ age + bmi + sbp + dbp + smoke, data = wcgs.dev)
model.A <- lm(chol ~ age + bmi + dbp + smoke, data = wcgs.dev)
model.B <- lm(chol ~ age + dbp + smoke, data = wcgs.dev)

43.4 AIC, ANOVA and BIC to assess Candidate Models

The stepwise regression output specifies the AIC value for each model, but we can also look at other characteristics, like the ANOVA table comparing the various models, or the Bayesian Information Criterion, abbreviated BIC.

AIC(model.kitchensink, model.A, model.B)

                  df   AIC
model.kitchensink  7 27423
model.A            6 27422
model.B            5 27426

AIC suggests model A (since it has the smallest AIC of these choices)

anova(model.kitchensink, model.A, model.B)

Analysis of Variance Table

Model 1: chol ~ age + bmi + sbp + dbp + smoke
Model 2: chol ~ age + bmi + dbp + smoke
Model 3: chol ~ age + dbp + smoke
  Res.Df     RSS Df Sum of Sq    F Pr(>F)  
1   2640 4884065                           
2   2641 4885694 -1     -1629 0.88  0.348  
3   2642 4897098 -1    -11404 6.16  0.013 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA model also suggests model A, for the following reasons:

The p value of 0.35 indicates that moving from what we’ve called the kitchen sink model (model 1 in the ANOVA output) to what we’ve called model A (model 2 in the ANOVA output) does not have a statistically significant impact on predictive value.
On the other hand, the p value of 0.013 indicates that moving from what we’ve called model A (model 2 in the ANOVA output) to what we’ve called model B (model 3 in the ANOVA output) does have a statistically significant impact on predictive value.
Because these models are nested (model B is a proper subset of model A which is a proper subset of the kitchen sink) we can make these ANOVA comparisons directly.

BIC(model.kitchensink, model.A, model.B)

                  df   BIC
model.kitchensink  7 27464
model.A            6 27457
model.B            5 27455

BIC disagrees, and prefers model B, since its BIC is smaller. The penalty for fitting additional predictors in BIC varies with the number of observations, and so (especially with larger samples) we can get meaningfully different AIC and BIC selections.

43.5 Comparing Models in the Test Sample (MSPE, MAPE)

Finally, we’ll use our two candidate models (Model A and Model B) to predict the results in our holdout sample of 500 observations to see which model performs better in these new data (remember that our holdout sample was not used to identify or fit Models A or B.)

To do this, we first carefully specify the two models being compared

model.A <- lm(chol ~ age + bmi + dbp + smoke, data=wcgs.dev)
model.B <- lm(chol ~ age + dbp + smoke, data=wcgs.dev)

Next, use predict to make predictions for the test data:

modA.pre <- predict(model.A, newdata=wcgs.test)
modB.pre <- predict(model.B, newdata=wcgs.test)

Just to fix ideas, here are the first few predictions for Model A…

head(modA.pre)

  1   2   3   4   5   6 
223 212 238 225 216 214

We can compare these to the first few observed values of chol in the test sample.

head(wcgs.test$chol)

[1] 206 182 264 283 252 204

Next, calculate errors (observed value minus predicted value) for each model:

modA.err <- wcgs.test$chol - modA.pre
modB.err <- wcgs.test$chol - modB.pre

Again, just to be sure we understand, we look at the first few errors for Model A.

head(modA.err)

     1      2      3      4      5      6 
-17.34 -30.21  26.41  57.76  36.32  -9.61

Next, we calculate the absolute errors (as |observed - predicted|) from each model in turn:

modA.abserr <- abs(modA.err)
modB.abserr <- abs(modB.err)

Let’s look at the first few absolute errors for Model A.

head(modA.abserr)

    1     2     3     4     5     6 
17.34 30.21 26.41 57.76 36.32  9.61

Next, we calculate the squared prediction errors from each model in turn:

modA.sqerr <- modA.err^2
modB.sqerr <- modB.err^2

And again, we look at the first few squared errors for Model A.

head(modA.sqerr)

     1      2      3      4      5      6 
 300.7  912.5  697.6 3336.7 1319.2   92.4

To obtain our two key summaries: mean absolute prediction error and mean squared prediction error, I just use the summary function.

summary(modA.abserr)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.1    13.1    26.7    32.4    45.9   125.2       4

summary(modB.abserr)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.2    14.1    26.5    32.4    44.8   122.9       4

summary(modA.sqerr)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0     173     713    1669    2107   15678       4

summary(modB.sqerr)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0     197     702    1668    2008   15102       4

Model	MAPE	MSPE	Maximum Abs. Error
A (`age + bmi + dbp + smoke`)	32.39	1669	125
B (`age + dbp + smoke`)	32.45	1668	123

Note that smaller values on these metrics are better, so that MAPE (barely) selects Model A over Model B, and MSPE (barely) selects Model B over Model A. We also sometimes look at the maximum absolute error, and here we see that Model B is slightly favored again. But really, there’s little to choose from. The NAs you see above refer to patients in the wcgs data with missing values on one or more of the variables included in our kitchen sink model. I absolutely should have identified that problem at the beginning, and either omitted those or done some imputation back at the start. I’ll show that in the next section.

So, based on the test sample results, we might slightly favor Model B, but really, there’s no meaningful difference.