Chapter 7 Stepwise Variable Selection

7.1 Strategy for Model Selection

Ramsey and Schafer (2002) suggest a strategy for dealing with many potential explanatory variables should include the following elements:

Identify the key objectives.
Screen the available variables, deciding on a list that is sensitive to the objectives and excludes obvious redundancies.
Perform exploratory analysis, examining graphical displays and correlation coefficients.
Perform transformations, as necessary.
Examine a residual plot after fitting a rich model, performing further transformations and considering outliers.
Find a suitable subset of the predictors, exerting enough control over any semi-automated selection procedure to be sensitive to the questions of interest.
Proceed with the analysis, using the selected explanatory variables.

The Two Key Aspects of Model Selection are:

Evaluating each potential subset of predictor variables
Deciding on the collection of potential subsets

7.1.1 How Do We Choose Potential Subsets of Predictors?

Choosing potential subsets of predictor variables usually involves either:

Stepwise approaches
All possible subset (or best possible subset) searches

Note that the use of any variable selection procedure changes the properties of …

the estimated coefficients, which are biased, and
the associated tests and confidence intervals, which are overly optimistic.

Leeb and Potscher (2005) summarize the key issues:

Regardless of sample size, the model selection step typically has a dramatic effect on the sampling properties of the estimators that cannot be ignored. In particular, the sampling properties of post-model-selection estimators are typically significantly different from the nominal distributions that arise if a fixed model is supposed.
As a consequence, use of inference procedures that do not take into account the model selection step (e.g. using standard t-intervals as if the selected model has been given prior to the statistical analysis) can be highly misleading.

7.2 A “Kitchen Sink” Model (Model `c5_prost_ks`)

Suppose that we now consider a model for the prost data we have been working with, which includes main effects (and, in this case, only the main effects) of all eight candidate predictors for lpsa, as follows.

c5_prost_ks <- lm(lpsa ~ lcavol + lweight + age + bph_f + svi_f + 
                lcp + gleason_f + pgg45, data = prost)

tidy(c5_prost_ks)

# A tibble: 11 x 5
   term        estimate std.error statistic      p.value
   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
 1 (Intercept)  0.170     0.931       0.182 0.856       
 2 lcavol       0.544     0.0880      6.19  0.0000000201
 3 lweight      0.702     0.203       3.46  0.000846    
 4 age         -0.0239    0.0111     -2.15  0.0341      
 5 bph_fMedium  0.364     0.183       1.99  0.0493      
 6 bph_fHigh    0.249     0.196       1.27  0.208       
 7 svi_fYes     0.711     0.242       2.94  0.00424     
 8 lcp         -0.119     0.0895     -1.33  0.186       
 9 gleason_f7   0.221     0.343       0.643 0.522       
10 gleason_f6  -0.0531    0.430      -0.123 0.902       
11 pgg45        0.00398   0.00415     0.961 0.339

glance(c5_prost_ks)

# A tibble: 1 x 11
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>
1     0.679         0.642 0.691      18.2 2.37e-17    11  -95.9  216.  247.
# ... with 2 more variables: deviance <dbl>, df.residual <int>

We’ll often refer to this (all predictors on board) approach as a “kitchen sink” model[This refers to the English idiom “… everything but the kitchen sink” which describes, essentially, everything imaginable. A “kitchen sink regression” is often used as a pejorative term, since no special skill or insight is required to identify it, given a list of potential predictors. For more, yes, there is a Wikipedia page.].

7.3 Sequential Variable Selection: Stepwise Approaches

Forward Selection
- We begin with a constant mean and then add potential predictors one at a time according to some criterion (R defaults to minimizing the Akaike Information Criterion) until no further addition significantly improves the fit.
- Each categorical factor variable is represented in the regression model as a set of indicator variables. In the absence of a good reason to do something else, the set is added to the model as a single unit, and R does this automatically.
Backwards Elimination
- Start with the “kitchen sink” model and then delete potential predictors one at a time.
- Backwards Elimination is less likely than Forward Selection to omit negatively confounded sets of variables, though all stepwise procedures have problems.
Stepwise Regression can also be done by combining these methods.

7.3.1 The Big Problems with Stepwise Regression

There is no reason to assume that a single best model can be found.

The use of forward selection, or backwards elimination, or stepwise regression including both procedures, will NOT always find the same model.
It also appears to be essentially useless to try different stepwise methods to look for agreement.

Users of stepwise regression frequently place all of their attention on the particular explanatory variables included in the resulting model, when there’s no reason (in most cases) to assume that model is in any way optimal.

Despite all of its problems, let’s use stepwise regression to help predict lpsa given a subset of the eight predictors in c5_prost_ks.

7.4 Forward Selection with the `step` function

Specify the null model (intercept only)
Specify the variables R should consider as predictors (in the scope element of the step function)
Specify forward selection only
R defaults to using AIC as its stepwise criterion

with(prost, 
     step(lm(lpsa ~ 1), 
     scope=(~ lcavol + lweight + age + bph_f + svi_f + 
                lcp + gleason_f + pgg45), 
     direction="forward"))

Start:  AIC=28.84
lpsa ~ 1

            Df Sum of Sq     RSS     AIC
+ lcavol     1    69.003  58.915 -44.366
+ svi_f      1    41.011  86.907  -6.658
+ lcp        1    38.528  89.389  -3.926
+ gleason_f  2    30.121  97.796   6.793
+ lweight    1    24.019 103.899  10.665
+ pgg45      1    22.814 105.103  11.783
+ age        1     3.679 124.239  28.007
<none>                   127.918  28.838
+ bph_f      2     4.681 123.237  29.221

Step:  AIC=-44.37
lpsa ~ lcavol

            Df Sum of Sq    RSS     AIC
+ lweight    1    7.1726 51.742 -54.958
+ svi_f      1    5.2375 53.677 -51.397
+ bph_f      2    3.2994 55.615 -45.956
+ pgg45      1    1.6980 57.217 -45.203
+ gleason_f  2    2.7834 56.131 -45.061
<none>                   58.915 -44.366
+ lcp        1    0.6562 58.259 -43.452
+ age        1    0.0025 58.912 -42.370

Step:  AIC=-54.96
lpsa ~ lcavol + lweight

            Df Sum of Sq    RSS     AIC
+ svi_f      1    5.1737 46.568 -63.177
+ pgg45      1    1.8158 49.926 -56.424
+ gleason_f  2    2.6770 49.065 -56.111
<none>                   51.742 -54.958
+ lcp        1    0.8187 50.923 -54.506
+ age        1    0.6456 51.097 -54.176
+ bph_f      2    1.4583 50.284 -53.731

Step:  AIC=-63.18
lpsa ~ lcavol + lweight + svi_f

            Df Sum of Sq    RSS     AIC
<none>                   46.568 -63.177
+ gleason_f  2   1.60467 44.964 -62.579
+ age        1   0.62301 45.945 -62.484
+ bph_f      2   1.50046 45.068 -62.354
+ pgg45      1   0.50069 46.068 -62.226
+ lcp        1   0.06937 46.499 -61.322


Call:
lm(formula = lpsa ~ lcavol + lweight + svi_f)

Coefficients:
(Intercept)       lcavol      lweight     svi_fYes  
    -0.7772       0.5259       0.6618       0.6657

The resulting model, arrived at after three forward selection steps, includes lcavol, lweight and svi_f.

model.fs <- lm(lpsa ~ lcavol + lweight + svi_f, 
               data=prost)
summary(model.fs)$adj.r.squared

[1] 0.6242063

extractAIC(model.fs)

[1]   4.00000 -63.17744

The adjusted R² value for this model is 0.624, and the AIC value used by the stepwise procedure is -63.18, on 4 effective degrees of freedom.

7.5 Backward Elimination using the `step` function

In this case, the backward elimination approach, using reduction in AIC for a criterion, comes to the same conclusion about the “best” model.

with(prost, 
     step(lm(lpsa ~ lcavol + lweight + age + bph_f + 
                 svi_f + lcp + gleason_f + pgg45), 
          direction="backward"))

Start:  AIC=-61.4
lpsa ~ lcavol + lweight + age + bph_f + svi_f + lcp + gleason_f + 
    pgg45

            Df Sum of Sq    RSS     AIC
- gleason_f  2    1.1832 42.240 -62.639
- pgg45      1    0.4409 41.498 -62.359
- lcp        1    0.8492 41.906 -61.409
<none>                   41.057 -61.395
- bph_f      2    2.0299 43.087 -60.714
- age        1    2.2129 43.270 -58.303
- svi_f      1    4.1207 45.178 -54.118
- lweight    1    5.7123 46.769 -50.760
- lcavol     1   18.2738 59.331 -27.683

Step:  AIC=-62.64
lpsa ~ lcavol + lweight + age + bph_f + svi_f + lcp + pgg45

          Df Sum of Sq    RSS     AIC
- lcp      1    0.8470 43.087 -62.713
<none>                 42.240 -62.639
- pgg45    1    1.2029 43.443 -61.916
- bph_f    2    2.2515 44.492 -61.602
- age      1    2.0730 44.313 -59.992
- svi_f    1    4.6431 46.884 -54.523
- lweight  1    5.5988 47.839 -52.566
- lcavol   1   21.4956 63.736 -24.736

Step:  AIC=-62.71
lpsa ~ lcavol + lweight + age + bph_f + svi_f + pgg45

          Df Sum of Sq    RSS     AIC
- pgg45    1    0.5860 43.673 -63.403
<none>                 43.087 -62.713
- bph_f    2    2.0214 45.109 -62.266
- age      1    1.7101 44.798 -60.938
- svi_f    1    3.7964 46.884 -56.523
- lweight  1    5.6462 48.734 -52.769
- lcavol   1   22.5152 65.603 -23.936

Step:  AIC=-63.4
lpsa ~ lcavol + lweight + age + bph_f + svi_f

          Df Sum of Sq    RSS     AIC
<none>                 43.673 -63.403
- bph_f    2    2.2720 45.945 -62.484
- age      1    1.3945 45.068 -62.354
- svi_f    1    5.2747 48.948 -54.343
- lweight  1    5.3319 49.005 -54.230
- lcavol   1   25.5538 69.227 -20.720


Call:
lm(formula = lpsa ~ lcavol + lweight + age + bph_f + svi_f)

Coefficients:
(Intercept)       lcavol      lweight          age  bph_fMedium  
    0.14329      0.54022      0.67283     -0.01819      0.37607  
  bph_fHigh     svi_fYes  
    0.27216      0.68174

The backwards elimination approach in this case lands on a model with five inputs (one of which includes two bph indicators,) eliminating only gleason_f, pgg45 and lcp.

7.6 Allen-Cady Modified Backward Elimination

Ranking candidate predictors by importance in advance of backwards elimination can help avoid false-positives, while reducing model size. See Vittinghoff et al. (2012), Section 10.3 for more details.

First, force into the model any predictors of primary interest, and any confounders necessary for face validity of the final model.
- “Some variables in the hypothesized causal model may be such well-established causal antecedents of the outcome that it makes sense to include them, essentially to establish the face validity of the model and without regard to the strength or statistical significance of their associations with the primary predictor and outcome …”
Rank the remaining candidate predictors in order of importance.
Starting from an initial model with all candidate predictors included, delete predictors in order of ascending importance until the first variable meeting a criterion to stay in the model hits. Then stop.

Only the remaining variable hypothesized to be least important is eligible for removal at each step. When we are willing to do this sorting before collecting (or analyzing) the data, then we can do Allen-Cady backwards elimination using the drop1 command in R.

7.6.1 Demonstration of the Allen-Cady approach

Suppose, for the moment that we decided to fit a model for the log of psa and we decided (before we saw the data) that we would:

lcavol + lweight + svi_f + age + bph_f + gleason_f + lcp + pgg45

force the gleason_f variable to be in the model, due to prior information about its importance,
and then rated the importance of the other variables as lcavol (most important), then svi_f then age, and then bph_f, then lweight and lcp followed by pgg45 (least important)

When we are willing to do this sorting before collecting (or analyzing) the data, then we can do Allen-Cady backwards elimination using the drop1 command in R.

Step 1. Fit the full model, then see if removing pgg45 improves AIC…

with(prost, drop1(lm(lpsa ~ gleason_f + lcavol + svi_f + 
              age + bph_f + lweight + lcp + pgg45),
              scope = (~ pgg45)))

Single term deletions

Model:
lpsa ~ gleason_f + lcavol + svi_f + age + bph_f + lweight + lcp + 
    pgg45
       Df Sum of Sq    RSS     AIC
<none>              41.057 -61.395
pgg45   1   0.44085 41.498 -62.359

Since -62.3 is smaller (i.e. more negative) than -61.4, we delete pgg45 and move on to assess whether we can remove the variable we deemed next least important (lcp)

Step 2. Let’s see if removing lcp from this model improves AIC…

with(prost, drop1(lm(lpsa ~ gleason_f + lcavol + svi_f + 
              age + bph_f + lweight  + lcp),
              scope = (~ lcp)))

Single term deletions

Model:
lpsa ~ gleason_f + lcavol + svi_f + age + bph_f + lweight + lcp
       Df Sum of Sq    RSS     AIC
<none>              41.498 -62.359
lcp     1   0.56767 42.066 -63.041

Again, since -63.0 is smaller than -62.4, we delete lcp and next assess whether we should delete lweight.

Step 3. Does removing lweight from this model improves AIC…

with(prost, drop1(lm(lpsa ~ gleason_f + lcavol + svi_f + 
              age + bph_f + lweight),
              scope = (~ lweight)))

Single term deletions

Model:
lpsa ~ gleason_f + lcavol + svi_f + age + bph_f + lweight
        Df Sum of Sq    RSS     AIC
<none>               42.066 -63.041
lweight  1     5.678 47.744 -52.760

Since the AIC for the model after the removal of lweight is larger (i.e. less negative), we stop, and declare our final model by the Allen-Cady approach to include gleason_f, lcavol, svi_f, age, bph_f and lweight.

7.7 Summarizing the Results

Method	Suggested Predictors
Forward selection	`lcavol`, `lweight`, `svi_f`
Backwards elimination	`lcavol`, `lweight`, `svi_f`, `age`, `bph_f`
Allen-Cady approach	`lcavol`, `lweight`, `svi_f`, `age`, `bph_f`, `gleason_f`

7.7.1 In-Sample Testing and Summaries

Since these models are nested in each other, let’s look at the summary statistics (like R², and AIC) and also run an ANOVA-based comparison of these nested models to each other and to the model with the intercept alone, and the kitchen sink model with all available predictors.

prost_m_int <- lm(lpsa ~ 1, data = prost)
prost_m_fw <- lm(lpsa ~ lcavol + lweight + svi_f, data = prost)
prost_m_bw <- lm(lpsa ~ lcavol + lweight + svi_f + 
              age + bph_f + gleason_f, data = prost)
prost_m_ac <- lm(lpsa ~ lcavol + lweight + svi_f + 
              age + bph_f + gleason_f + lcp, data = prost)
prost_m_ks <- lm(lpsa ~ lcavol + lweight + svi_f + 
              age + bph_f + gleason_f + lcp + pgg45, data = prost)

7.7.1.1 Model Fit Summaries (in-sample) from `glance`

Here are the models, at a glance from the broom package.

prost_sum <- bind_rows(glance(prost_m_int), glance(prost_m_fw),
                       glance(prost_m_bw), glance(prost_m_ac), 
                       glance(prost_m_ks)) %>% round(., 3)
prost_sum$names <- c("intercept", "lcavol + lweight + svi", 
                      "... + age + bhp + gleason", "... + lcp", "... + pgg45")
prost_sum <- prost_sum %>%
    select(names, r.squared, adj.r.squared, AIC, BIC, sigma, df, df.residual)

prost_sum

# A tibble: 5 x 8
  names         r.squared adj.r.squared   AIC   BIC sigma    df df.residual
  <chr>             <dbl>         <dbl> <dbl> <dbl> <dbl> <dbl>       <dbl>
1 intercept         0             0      306.  311. 1.15      1          96
2 lcavol + lwe~     0.636         0.624  214.  227. 0.708     4          93
3 ... + age + ~     0.671         0.641  214.  240. 0.691     9          88
4 ... + lcp         0.676         0.642  215.  243. 0.691    10          87
5 ... + pgg45       0.679         0.642  216.  247. 0.691    11          86

From these summaries, it looks like:

the adjusted R² is essentially indistinguishable between the three largest models, but a bit less strong with the three-predictor (4 df) model, and
the AIC and BIC are (slightly) better (lower) with the three-predictor model (4 df) than any other.

So we might be motivated by these summaries to select any of the three models we’re studying closely here.

7.7.1.2 Model Testing via ANOVA (in-sample)

To obtain ANOVA-based test results, we’ll run…

anova(prost_m_int, prost_m_fw, prost_m_bw, prost_m_ac, prost_m_ks)

Analysis of Variance Table

Model 1: lpsa ~ 1
Model 2: lpsa ~ lcavol + lweight + svi_f
Model 3: lpsa ~ lcavol + lweight + svi_f + age + bph_f + gleason_f
Model 4: lpsa ~ lcavol + lweight + svi_f + age + bph_f + gleason_f + lcp
Model 5: lpsa ~ lcavol + lweight + svi_f + age + bph_f + gleason_f + lcp + 
    pgg45
  Res.Df     RSS Df Sum of Sq       F Pr(>F)    
1     96 127.918                                
2     93  46.568  3    81.349 56.7991 <2e-16 ***
3     88  42.066  5     4.503  1.8863 0.1050    
4     87  41.498  1     0.568  1.1891 0.2786    
5     86  41.057  1     0.441  0.9234 0.3393    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What conclusions can we draw on the basis of these ANOVA tests?

There is a statistically significant improvement in predictive value for Model 2 (the forward selection approach) as compared to Model 1 (the intercept only.)
The ANOVA test comparing Model 5 (kitchen sink) to Model 4 (Allen-Cady result) shows no statistically significant improvement in predictive value.
Neither does the ANOVA test comparing Model 3 to Model 2 or Model 4 to Model 3.

This suggests that, if we are willing to let the ANOVA test decide our best model than that would be the model produced by forward selection, with predictors lcavol, lweight and svi_f. But we haven’t validated the models.

If the purpose of the model is to predict new data, some sort of out-of-sample or cross-validation approach will be necessary, and
Even if our goal isn’t prediction but merely description of the current data, we would still want to build diagnostic plots to regression assumptions in each model, and
There is no reason to assume in advance that any of these models is in fact correct, or that any one of these stepwise approaches is necessarily better than any other, and
The mere act of running a stepwise regression model, as we’ll see, can increase the bias in our findings if we accept the results at face value.

So we’ll need some ways to validate the results once we complete the selection process.

7.7.2 Validating the Results of the Various Models

We can use a 5-fold cross-validation approach to assess the predictions made by our potential models and then compare them. Let’s compare our three models:

the three predictor model obtained by forward selection, including lcavol, lweight, and svi_f
the five predictor model obtained by backwards elimination, including lcavol, lweight, svi_f, and also age, and bph_f
the six predictor model obtained by the Allen-Cady approach, adding gleason_f to the previous model.

Here’s the 5-fold validation work (and resulting RMSE and MAE estimates) for the three-predictor model.

set.seed(43201012)

prost3_models <- prost %>%
    crossv_kfold(k = 5) %>%
    mutate(model = map(train, ~ lm(lpsa ~ lcavol + lweight + 
                                       svi_f, data = .)))

prost3_preds <- prost3_models %>%
    unnest(map2(model, test, ~ augment(.x, newdata = .y)))

prost3_preds %>%
    summarize(RMSE_prost3 = sqrt(mean((lpsa - .fitted) ^2)),
              MAE_prost3 = mean(abs(lpsa - .fitted)))

# A tibble: 1 x 2
  RMSE_prost3 MAE_prost3
        <dbl>      <dbl>
1       0.745      0.587

Now, we’ll generate the RMSE and MAE estimates for the five-predictor model.

set.seed(43206879)

prost5_models <- prost %>%
    crossv_kfold(k = 5) %>%
    mutate(model = map(train, ~ lm(lpsa ~ lcavol + lweight + 
                                       svi_f + age + bph_f, data = .)))

prost5_preds <- prost5_models %>%
    unnest(map2(model, test, ~ augment(.x, newdata = .y)))

prost5_preds %>%
    summarize(RMSE_prost5 = sqrt(mean((lpsa - .fitted) ^2)),
              MAE_prost5 = mean(abs(lpsa - .fitted)))

# A tibble: 1 x 2
  RMSE_prost5 MAE_prost5
        <dbl>      <dbl>
1       0.750      0.581

And at last, we’ll generate the RMSE and MAE estimates for the six-predictor model.

set.seed(43236198)

prost6_models <- prost %>%
    crossv_kfold(k = 5) %>%
    mutate(model = map(train, ~ lm(lpsa ~ lcavol + lweight + 
                                       svi_f + age + bph_f + gleason_f, data = .)))

prost6_preds <- prost6_models %>%
    unnest(map2(model, test, ~ augment(.x, newdata = .y)))

prost6_preds %>%
    summarize(RMSE_prost6 = sqrt(mean((lpsa - .fitted) ^2)),
              MAE_prost6 = mean(abs(lpsa - .fitted)))

# A tibble: 1 x 2
  RMSE_prost6 MAE_prost6
        <dbl>      <dbl>
1       0.720      0.551

It appears that the six-predictor model does better than either of the other two approaches, with smaller RMSE and MAE. The three-predictor model does slightly better in terms of root mean square prediction error and slightly worse in terms of mean absolute prediction error than the five-predictor model.

OK. A mixed bag, with different conclusions depending on which summary we want to look at. But of course, stepwise regression isn’t the only way to do variable selection. Let’s consider a broader range of potential predictor sets.

References

Ramsey, Fred L., and Daniel W. Schafer. 2002. The Statistical Sleuth: A Course in Methods of Data Analysis. Second Edition. Pacific Grove, CA: Duxbury.

Leeb, Hannes, and Benedikt M. Potscher. 2005. “Model Selection and Inference: Facts and Fiction.” Econometric Theory 21(1): 21–59. https://www.jstor.org/stable/3533623.

Vittinghoff, Eric, David V. Glidden, Stephen C. Shiboski, and Charles E. McCulloch. 2012. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. Second Edition. Springer-Verlag, Inc. http://www.biostat.ucsf.edu/vgsm/.