Chapter 37 Model Selection and Out-of-Sample Validation

Sometimes, we use a regression model for description of a multivariate relationship which requires only that we provide an adequate fit to the data at hand. Should we want to use the model for prediction, then a more appropriate standard which requires us to fit a model that does each of the following three things:

[fits well] Provides an adequate fit to the data at hand
[parsimonious] Includes only those predictors which actually have detectable predictive value
[predicts well out of sample] Does a better job predicting our outcome in new data than other alternative models

We’ll spend considerable time in 432 studying how best to validate a regression model, but for now, we’ll focus on a pair of issues:

Given a set of predictors, how should we let the computer help us understand which subsets of those predictors are most worthy of our additional attention?
Given a pair of models to compare, what tools should we use to determine which one better predicts new data?

37.1 Using the WCGS Data to predict Cholesterol Level

To address these issues, I’ll look at one of our old examples: the wcgs data (Western Collaborative Group Study), described in Section 13. We’ll try to predict the variable chol on the basis of some subset of the following five predictors: age, bmi, sbp, dbp and smoke.

The steps are:

Check the wcgs data for missing or out-of-range values in the variables under study.
Separate the wcgs data into a test sample of 500 observations selected at random, and a model development (training) sample consisting of the remaining 2654 observations.
Using only the model development sample, run a stepwise regression algorithm working off of the kitchen sink model using backwards selection to identify a reasonable candidate for a model. Call this model A.
Develop a second potential model (called model B) for the model development data by eliminating the least clearly helpful predictor in model A.
Use the AIC to compare model A to model B in the development sample.
Finally, moving forward to the holdout sample, compare the quality of predictions made by model A to those made by model B in terms of two of the many possible criteria:
- [i] mean squared prediction error and
- [ii] mean absolute prediction error
- to see if either model (A or B) is clearly superior in terms of out-of-sample prediction.

37.2 Separating the Data into a Training and a Test Sample

There are several ways to partition the data into training (model development) and test (model checking) samples. For example, we could develop separate data frames for a holdout sample of 500 randomly selected subjects (wcgs.test), and then use the remainder as the model development sample (wcgs.dev). Remember to set a seed so that you can replicate the selection.

set.seed(4311); wcgs.test <- wcgs %>% sample_n(500)
## hold out exactly 500 randomly selected observations

wcgs.dev <- anti_join(wcgs, wcgs.test, "id") 
## model development sample - 2654 observations

wcgs.test

# A tibble: 500 x 22
      id   age agec  height weight lnwght wghtcat   bmi   sbp lnsbp   dbp
   <int> <int> <fct>  <int>  <int>  <dbl> <fct>   <dbl> <int> <dbl> <int>
 1 11424    40 35-40     72    202   5.31 > 200    27.4   138  4.93    84
 2 12255    40 35-40     68    165   5.11 140-170  25.1   120  4.79    80
 3 11100    44 41-45     72    196   5.28 170-200  26.6   118  4.77    80
 4 10162    45 41-45     68    136   4.91 < 140    20.7   122  4.80    78
 5 10236    51 51-55     65    159   5.07 140-170  26.5   106  4.66    80
 6 19077    52 51-55     70    152   5.02 140-170  21.8   124  4.82    86
 7 22064    49 46-50     71    152   5.02 140-170  21.2   124  4.82    76
 8 21026    48 46-50     70    170   5.14 140-170  24.4   130  4.87    86
 9 12356    42 41-45     74    190   5.25 170-200  24.4   144  4.97    78
10 11232    39 35-40     73    182   5.20 170-200  24.0   116  4.75    76
# ... with 490 more rows, and 11 more variables: chol <int>, behpat <fct>,
#   dibpat <fct>, smoke <fct>, ncigs <int>, arcus <int>, chd69 <fct>,
#   typchd69 <int>, time169 <int>, t1 <dbl>, uni <dbl>

dim(wcgs.dev) # verify size of development sample

[1] 2654   22

37.2.1 Using a specified fraction of the data in the test sample

If we’d wanted to select 20% of the data for our test sample, we could have instead used the sample_frac and anti_join commands. For the wcgs data which has a unique id variable that identifies each subject, we’d have…

set.seed(43199); wcgs.train80 <- wcgs %>% sample_frac(0.80)
wcgs.test20 <- anti_join(wcgs, wcgs.train80, by="id")
dim(wcgs.train80)

[1] 2523   22

dim(wcgs.test20)

[1] 631  22

Given a large sample size (at least 500 observations in the full data set) I would usually think about holding out somewhere between 15% and 25% of the data in this manner.

37.3 Stepwise Regression to Select Predictors

We next select the wcgs.dev (development sample) and run a stepwise procedure, beginning with the kitchen sink model, that includes all potential predictors.

37.3.1 The Kitchen Sink Model

summary(lm(chol ~ age + bmi + sbp + dbp + smoke, data = wcgs.dev))


Call:
lm(formula = chol ~ age + bmi + sbp + dbp + smoke, data = wcgs.dev)

Residuals:
   Min     1Q Median     3Q    Max 
-130.8  -28.4   -3.0   25.1  418.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 135.3671    11.4050   11.87  < 2e-16 ***
age           0.5555     0.1520    3.65  0.00026 ***
bmi           0.7772     0.3493    2.23  0.02615 *  
sbp           0.1307     0.0885    1.48  0.13977    
dbp           0.2963     0.1388    2.14  0.03285 *  
smokeYes     10.3565     1.6967    6.10  1.2e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 42.8 on 2640 degrees of freedom
  (8 observations deleted due to missingness)
Multiple R-squared:  0.0352,    Adjusted R-squared:  0.0333 
F-statistic: 19.2 on 5 and 2640 DF,  p-value: <2e-16

37.3.2 Stepwise (Backward Elimination) Procedure

step(lm(chol ~ age + bmi + sbp + dbp + smoke, data = wcgs.dev))

Start:  AIC=19892
chol ~ age + bmi + sbp + dbp + smoke

        Df Sum of Sq     RSS   AIC
<none>               4847288 19892
- sbp    1      4006 4851294 19892
- dbp    1      8370 4855658 19894
- bmi    1      9092 4856380 19895
- age    1     24516 4871804 19903
- smoke  1     68405 4915693 19927


Call:
lm(formula = chol ~ age + bmi + sbp + dbp + smoke, data = wcgs.dev)

Coefficients:
(Intercept)          age          bmi          sbp          dbp  
    135.367        0.555        0.777        0.131        0.296  
   smokeYes  
     10.357

The stepwise process first eliminates sbp from the model, then sees no substantial improvement in AIC after this has been done, so it lands on a four-predictor model with age, bmi, dbp and smoke.

37.3.3 Three Candidate Models

For purposes of this exercise, we’ll call this four-predictor model model.A and compare it to a three-predictor model with age, dbp and smoke, which we’ll call model.B

model.kitchensink <- lm(chol ~ age + bmi + sbp + dbp + smoke, data = wcgs.dev)
model.A <- lm(chol ~ age + bmi + dbp + smoke, data = wcgs.dev)
model.B <- lm(chol ~ age + dbp + smoke, data = wcgs.dev)

37.4 AIC, ANOVA and BIC to assess Candidate Models

The stepwise regression output specifies the AIC value for each model, but we can also look at other characteristics, like the ANOVA table comparing the various models, or the Bayesian Information Criterion, abbreviated BIC.

AIC(model.kitchensink, model.A, model.B)

                  df   AIC
model.kitchensink  7 27403
model.A            6 27403
model.B            5 27406

AIC suggests model A (since it has the smallest AIC of these choices)

anova(model.kitchensink, model.A, model.B)

Analysis of Variance Table

Model 1: chol ~ age + bmi + sbp + dbp + smoke
Model 2: chol ~ age + bmi + dbp + smoke
Model 3: chol ~ age + dbp + smoke
  Res.Df     RSS Df Sum of Sq    F Pr(>F)  
1   2640 4847288                           
2   2641 4851294 -1     -4006 2.18   0.14  
3   2642 4861218 -1     -9923 5.40   0.02 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA model also suggests model A, for the following reasons:

The p value of 0.35 indicates that moving from what we’ve called the kitchen sink model (model 1 in the ANOVA output) to what we’ve called model A (model 2 in the ANOVA output) does not have a statistically significant impact on predictive value.
On the other hand, the p value of 0.013 indicates that moving from what we’ve called model A (model 2 in the ANOVA output) to what we’ve called model B (model 3 in the ANOVA output) does have a statistically significant impact on predictive value.
Because these models are nested (model B is a proper subset of model A which is a proper subset of the kitchen sink) we can make these ANOVA comparisons directly.

BIC(model.kitchensink, model.A, model.B)

                  df   BIC
model.kitchensink  7 27444
model.A            6 27438
model.B            5 27436

BIC disagrees, and prefers model B, since its BIC is smaller. The penalty for fitting additional predictors in BIC varies with the number of observations, and so (especially with larger samples) we can get meaningfully different AIC and BIC selections.

37.5 Comparing Models in the Test Sample (MSPE, MAPE)

Finally, we’ll use our two candidate models (Model A and Model B) to predict the results in our holdout sample of 500 observations to see which model performs better in these new data (remember that our holdout sample was not used to identify or fit Models A or B.)

To do this, we first carefully specify the two models being compared

model.A <- lm(chol ~ age + bmi + dbp + smoke, data=wcgs.dev)
model.B <- lm(chol ~ age + dbp + smoke, data=wcgs.dev)

Next, use predict to make predictions for the test data:

modA.pre <- predict(model.A, newdata=wcgs.test)
modB.pre <- predict(model.B, newdata=wcgs.test)

Just to fix ideas, here are the first few predictions for Model A…

head(modA.pre)

  1   2   3   4   5   6 
231 217 231 215 225 224

We can compare these to the first few observed values of chol in the test sample.

head(wcgs.test$chol)

[1] 215 198 175 208 348 315

Next, calculate errors (observed value minus predicted value) for each model:

modA.err <- wcgs.test$chol - modA.pre
modB.err <- wcgs.test$chol - modB.pre

Again, just to be sure we understand, we look at the first few errors for Model A.

head(modA.err)

     1      2      3      4      5      6 
-16.36 -19.04 -56.22  -7.47 123.48  90.97

Next, we calculate the absolute errors (as |observed - predicted|) from each model in turn:

modA.abserr <- abs(modA.err)
modB.abserr <- abs(modB.err)

Let’s look at the first few absolute errors for Model A.

head(modA.abserr)

     1      2      3      4      5      6 
 16.36  19.04  56.22   7.47 123.48  90.97

Next, we calculate the squared prediction errors from each model in turn:

modA.sqerr <- modA.err^2
modB.sqerr <- modB.err^2

And again, we look at the first few squared errors for Model A.

head(modA.sqerr)

      1       2       3       4       5       6 
  267.7   362.7  3160.8    55.8 15246.8  8275.4

To obtain our two key summaries: mean absolute prediction error and mean squared prediction error, I just use the summary function.

summary(modA.abserr)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.1    10.9    27.2    32.3    48.4   148.2       4

summary(modB.abserr)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0    11.1    27.4    32.5    48.1   147.0       4

summary(modA.sqerr)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0     119     740    1740    2340   21963       4

summary(modB.sqerr)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0     123     749    1743    2318   21613       4

Model	MAPE	MSPE	Maximum Abs. Error
A (`age + bmi + dbp + smoke`)	32.39	1669	125
B (`age + dbp + smoke`)	32.45	1668	123

Note that smaller values on these metrics are better, so that MAPE (barely) selects Model A over Model B, and MSPE (barely) selects Model B over Model A. We also sometimes look at the maximum absolute error, and here we see that Model B is slightly favored again. But really, there’s little to choose from. The NAs you see above refer to patients in the wcgs data with missing values on one or more of the variables included in our kitchen sink model. I absolutely should have identified that problem at the beginning, and either omitted those or done some imputation back at the start. I’ll show that in the next section.

So, based on the test sample results, we might slightly favor Model B, but really, there’s no meaningful difference.