Chapter 7 Stepwise Variable Selection
7.1 Strategy for Model Selection
Ramsey and Schafer (2002) suggest a strategy for dealing with many potential explanatory variables should include the following elements:
- Identify the key objectives.
- Screen the available variables, deciding on a list that is sensitive to the objectives and excludes obvious redundancies.
- Perform exploratory analysis, examining graphical displays and correlation coefficients.
- Perform transformations, as necessary.
- Examine a residual plot after fitting a rich model, performing further transformations and considering outliers.
- Find a suitable subset of the predictors, exerting enough control over any semi-automated selection procedure to be sensitive to the questions of interest.
- Proceed with the analysis, using the selected explanatory variables.
The Two Key Aspects of Model Selection are:
- Evaluating each potential subset of predictor variables
- Deciding on the collection of potential subsets
7.1.1 How Do We Choose Potential Subsets of Predictors?
Choosing potential subsets of predictor variables usually involves either:
- Stepwise approaches
- All possible subset (or best possible subset) searches
Note that the use of any variable selection procedure changes the properties of …
- the estimated coefficients, which are biased, and
- the associated tests and confidence intervals, which are overly optimistic.
Leeb and Potscher (2005) summarize the key issues:
- Regardless of sample size, the model selection step typically has a dramatic effect on the sampling properties of the estimators that cannot be ignored. In particular, the sampling properties of post-model-selection estimators are typically significantly different from the nominal distributions that arise if a fixed model is supposed.
- As a consequence, use of inference procedures that do not take into account the model selection step (e.g. using standard t-intervals as if the selected model has been given prior to the statistical analysis) can be highly misleading.
7.2 A “Kitchen Sink” Model (Model c5_prost_ks
)
Suppose that we now consider a model for the prost
data we have been working with, which includes main effects (and, in this case, only the main effects) of all eight candidate predictors for lpsa
, as follows.
c5_prost_ks <- lm(lpsa ~ lcavol + lweight + age + bph_f + svi_f +
lcp + gleason_f + pgg45, data = prost)
tidy(c5_prost_ks)
# A tibble: 11 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.170 0.931 0.182 0.856
2 lcavol 0.544 0.0880 6.19 0.0000000201
3 lweight 0.702 0.203 3.46 0.000846
4 age -0.0239 0.0111 -2.15 0.0341
5 bph_fMedium 0.364 0.183 1.99 0.0493
6 bph_fHigh 0.249 0.196 1.27 0.208
7 svi_fYes 0.711 0.242 2.94 0.00424
8 lcp -0.119 0.0895 -1.33 0.186
9 gleason_f7 0.221 0.343 0.643 0.522
10 gleason_f6 -0.0531 0.430 -0.123 0.902
11 pgg45 0.00398 0.00415 0.961 0.339
glance(c5_prost_ks)
# A tibble: 1 x 11
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.679 0.642 0.691 18.2 2.37e-17 11 -95.9 216. 247.
# ... with 2 more variables: deviance <dbl>, df.residual <int>
We’ll often refer to this (all predictors on board) approach as a “kitchen sink” model[This refers to the English idiom “… everything but the kitchen sink” which describes, essentially, everything imaginable. A “kitchen sink regression” is often used as a pejorative term, since no special skill or insight is required to identify it, given a list of potential predictors. For more, yes, there is a Wikipedia page.].
7.3 Sequential Variable Selection: Stepwise Approaches
- Forward Selection
- We begin with a constant mean and then add potential predictors one at a time according to some criterion (R defaults to minimizing the Akaike Information Criterion) until no further addition significantly improves the fit.
- Each categorical factor variable is represented in the regression model as a set of indicator variables. In the absence of a good reason to do something else, the set is added to the model as a single unit, and R does this automatically.
- Backwards Elimination
- Start with the “kitchen sink” model and then delete potential predictors one at a time.
- Backwards Elimination is less likely than Forward Selection to omit negatively confounded sets of variables, though all stepwise procedures have problems.
- Stepwise Regression can also be done by combining these methods.
7.3.1 The Big Problems with Stepwise Regression
There is no reason to assume that a single best model can be found.
- The use of forward selection, or backwards elimination, or stepwise regression including both procedures, will NOT always find the same model.
- It also appears to be essentially useless to try different stepwise methods to look for agreement.
Users of stepwise regression frequently place all of their attention on the particular explanatory variables included in the resulting model, when there’s no reason (in most cases) to assume that model is in any way optimal.
Despite all of its problems, let’s use stepwise regression to help predict lpsa
given a subset of the eight predictors in c5_prost_ks
.
7.4 Forward Selection with the step
function
- Specify the null model (intercept only)
- Specify the variables R should consider as predictors (in the scope element of the step function)
- Specify forward selection only
- R defaults to using AIC as its stepwise criterion
with(prost,
step(lm(lpsa ~ 1),
scope=(~ lcavol + lweight + age + bph_f + svi_f +
lcp + gleason_f + pgg45),
direction="forward"))
Start: AIC=28.84
lpsa ~ 1
Df Sum of Sq RSS AIC
+ lcavol 1 69.003 58.915 -44.366
+ svi_f 1 41.011 86.907 -6.658
+ lcp 1 38.528 89.389 -3.926
+ gleason_f 2 30.121 97.796 6.793
+ lweight 1 24.019 103.899 10.665
+ pgg45 1 22.814 105.103 11.783
+ age 1 3.679 124.239 28.007
<none> 127.918 28.838
+ bph_f 2 4.681 123.237 29.221
Step: AIC=-44.37
lpsa ~ lcavol
Df Sum of Sq RSS AIC
+ lweight 1 7.1726 51.742 -54.958
+ svi_f 1 5.2375 53.677 -51.397
+ bph_f 2 3.2994 55.615 -45.956
+ pgg45 1 1.6980 57.217 -45.203
+ gleason_f 2 2.7834 56.131 -45.061
<none> 58.915 -44.366
+ lcp 1 0.6562 58.259 -43.452
+ age 1 0.0025 58.912 -42.370
Step: AIC=-54.96
lpsa ~ lcavol + lweight
Df Sum of Sq RSS AIC
+ svi_f 1 5.1737 46.568 -63.177
+ pgg45 1 1.8158 49.926 -56.424
+ gleason_f 2 2.6770 49.065 -56.111
<none> 51.742 -54.958
+ lcp 1 0.8187 50.923 -54.506
+ age 1 0.6456 51.097 -54.176
+ bph_f 2 1.4583 50.284 -53.731
Step: AIC=-63.18
lpsa ~ lcavol + lweight + svi_f
Df Sum of Sq RSS AIC
<none> 46.568 -63.177
+ gleason_f 2 1.60467 44.964 -62.579
+ age 1 0.62301 45.945 -62.484
+ bph_f 2 1.50046 45.068 -62.354
+ pgg45 1 0.50069 46.068 -62.226
+ lcp 1 0.06937 46.499 -61.322
Call:
lm(formula = lpsa ~ lcavol + lweight + svi_f)
Coefficients:
(Intercept) lcavol lweight svi_fYes
-0.7772 0.5259 0.6618 0.6657
The resulting model, arrived at after three forward selection steps, includes lcavol
, lweight
and svi_f
.
model.fs <- lm(lpsa ~ lcavol + lweight + svi_f,
data=prost)
summary(model.fs)$adj.r.squared
[1] 0.6242063
extractAIC(model.fs)
[1] 4.00000 -63.17744
The adjusted R2 value for this model is 0.624, and the AIC value used by the stepwise procedure is -63.18, on 4 effective degrees of freedom.
7.5 Backward Elimination using the step
function
In this case, the backward elimination approach, using reduction in AIC for a criterion, comes to the same conclusion about the “best” model.
with(prost,
step(lm(lpsa ~ lcavol + lweight + age + bph_f +
svi_f + lcp + gleason_f + pgg45),
direction="backward"))
Start: AIC=-61.4
lpsa ~ lcavol + lweight + age + bph_f + svi_f + lcp + gleason_f +
pgg45
Df Sum of Sq RSS AIC
- gleason_f 2 1.1832 42.240 -62.639
- pgg45 1 0.4409 41.498 -62.359
- lcp 1 0.8492 41.906 -61.409
<none> 41.057 -61.395
- bph_f 2 2.0299 43.087 -60.714
- age 1 2.2129 43.270 -58.303
- svi_f 1 4.1207 45.178 -54.118
- lweight 1 5.7123 46.769 -50.760
- lcavol 1 18.2738 59.331 -27.683
Step: AIC=-62.64
lpsa ~ lcavol + lweight + age + bph_f + svi_f + lcp + pgg45
Df Sum of Sq RSS AIC
- lcp 1 0.8470 43.087 -62.713
<none> 42.240 -62.639
- pgg45 1 1.2029 43.443 -61.916
- bph_f 2 2.2515 44.492 -61.602
- age 1 2.0730 44.313 -59.992
- svi_f 1 4.6431 46.884 -54.523
- lweight 1 5.5988 47.839 -52.566
- lcavol 1 21.4956 63.736 -24.736
Step: AIC=-62.71
lpsa ~ lcavol + lweight + age + bph_f + svi_f + pgg45
Df Sum of Sq RSS AIC
- pgg45 1 0.5860 43.673 -63.403
<none> 43.087 -62.713
- bph_f 2 2.0214 45.109 -62.266
- age 1 1.7101 44.798 -60.938
- svi_f 1 3.7964 46.884 -56.523
- lweight 1 5.6462 48.734 -52.769
- lcavol 1 22.5152 65.603 -23.936
Step: AIC=-63.4
lpsa ~ lcavol + lweight + age + bph_f + svi_f
Df Sum of Sq RSS AIC
<none> 43.673 -63.403
- bph_f 2 2.2720 45.945 -62.484
- age 1 1.3945 45.068 -62.354
- svi_f 1 5.2747 48.948 -54.343
- lweight 1 5.3319 49.005 -54.230
- lcavol 1 25.5538 69.227 -20.720
Call:
lm(formula = lpsa ~ lcavol + lweight + age + bph_f + svi_f)
Coefficients:
(Intercept) lcavol lweight age bph_fMedium
0.14329 0.54022 0.67283 -0.01819 0.37607
bph_fHigh svi_fYes
0.27216 0.68174
The backwards elimination approach in this case lands on a model with five inputs (one of which includes two bph
indicators,) eliminating only gleason_f
, pgg45
and lcp
.
7.6 Allen-Cady Modified Backward Elimination
Ranking candidate predictors by importance in advance of backwards elimination can help avoid false-positives, while reducing model size. See Vittinghoff et al. (2012), Section 10.3 for more details.
- First, force into the model any predictors of primary interest, and any confounders necessary for face validity of the final model.
- “Some variables in the hypothesized causal model may be such well-established causal antecedents of the outcome that it makes sense to include them, essentially to establish the face validity of the model and without regard to the strength or statistical significance of their associations with the primary predictor and outcome …”
- Rank the remaining candidate predictors in order of importance.
- Starting from an initial model with all candidate predictors included, delete predictors in order of ascending importance until the first variable meeting a criterion to stay in the model hits. Then stop.
Only the remaining variable hypothesized to be least important is eligible for removal at each step. When we are willing to do this sorting before collecting (or analyzing) the data, then we can do Allen-Cady backwards elimination using the drop1
command in R.
7.6.1 Demonstration of the Allen-Cady approach
Suppose, for the moment that we decided to fit a model for the log of psa
and we decided (before we saw the data) that we would:
lcavol + lweight + svi_f + age + bph_f + gleason_f + lcp + pgg45
- force the
gleason_f
variable to be in the model, due to prior information about its importance, - and then rated the importance of the other variables as
lcavol
(most important), thensvi_f
thenage
, and thenbph_f
, thenlweight
andlcp
followed bypgg45
(least important)
When we are willing to do this sorting before collecting (or analyzing) the data, then we can do Allen-Cady backwards elimination using the drop1
command in R.
Step 1. Fit the full model, then see if removing pgg45
improves AIC…
with(prost, drop1(lm(lpsa ~ gleason_f + lcavol + svi_f +
age + bph_f + lweight + lcp + pgg45),
scope = (~ pgg45)))
Single term deletions
Model:
lpsa ~ gleason_f + lcavol + svi_f + age + bph_f + lweight + lcp +
pgg45
Df Sum of Sq RSS AIC
<none> 41.057 -61.395
pgg45 1 0.44085 41.498 -62.359
Since -62.3 is smaller (i.e. more negative) than -61.4, we delete pgg45
and move on to assess whether we can remove the variable we deemed next least important (lcp
)
Step 2. Let’s see if removing lcp
from this model improves AIC…
with(prost, drop1(lm(lpsa ~ gleason_f + lcavol + svi_f +
age + bph_f + lweight + lcp),
scope = (~ lcp)))
Single term deletions
Model:
lpsa ~ gleason_f + lcavol + svi_f + age + bph_f + lweight + lcp
Df Sum of Sq RSS AIC
<none> 41.498 -62.359
lcp 1 0.56767 42.066 -63.041
Again, since -63.0 is smaller than -62.4, we delete lcp
and next assess whether we should delete lweight
.
Step 3. Does removing lweight
from this model improves AIC…
with(prost, drop1(lm(lpsa ~ gleason_f + lcavol + svi_f +
age + bph_f + lweight),
scope = (~ lweight)))
Single term deletions
Model:
lpsa ~ gleason_f + lcavol + svi_f + age + bph_f + lweight
Df Sum of Sq RSS AIC
<none> 42.066 -63.041
lweight 1 5.678 47.744 -52.760
Since the AIC for the model after the removal of lweight
is larger (i.e. less negative), we stop, and declare our final model by the Allen-Cady approach to include gleason_f
, lcavol
, svi_f
, age
, bph_f
and lweight
.
7.7 Summarizing the Results
Method | Suggested Predictors |
---|---|
Forward selection | lcavol , lweight , svi_f |
Backwards elimination | lcavol , lweight , svi_f , age , bph_f |
Allen-Cady approach | lcavol , lweight , svi_f , age , bph_f , gleason_f |
7.7.1 In-Sample Testing and Summaries
Since these models are nested in each other, let’s look at the summary statistics (like R2, and AIC) and also run an ANOVA-based comparison of these nested models to each other and to the model with the intercept alone, and the kitchen sink model with all available predictors.
prost_m_int <- lm(lpsa ~ 1, data = prost)
prost_m_fw <- lm(lpsa ~ lcavol + lweight + svi_f, data = prost)
prost_m_bw <- lm(lpsa ~ lcavol + lweight + svi_f +
age + bph_f + gleason_f, data = prost)
prost_m_ac <- lm(lpsa ~ lcavol + lweight + svi_f +
age + bph_f + gleason_f + lcp, data = prost)
prost_m_ks <- lm(lpsa ~ lcavol + lweight + svi_f +
age + bph_f + gleason_f + lcp + pgg45, data = prost)
7.7.1.1 Model Fit Summaries (in-sample) from glance
Here are the models, at a glance
from the broom
package.
prost_sum <- bind_rows(glance(prost_m_int), glance(prost_m_fw),
glance(prost_m_bw), glance(prost_m_ac),
glance(prost_m_ks)) %>% round(., 3)
prost_sum$names <- c("intercept", "lcavol + lweight + svi",
"... + age + bhp + gleason", "... + lcp", "... + pgg45")
prost_sum <- prost_sum %>%
select(names, r.squared, adj.r.squared, AIC, BIC, sigma, df, df.residual)
prost_sum
# A tibble: 5 x 8
names r.squared adj.r.squared AIC BIC sigma df df.residual
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 0 0 306. 311. 1.15 1 96
2 lcavol + lwe~ 0.636 0.624 214. 227. 0.708 4 93
3 ... + age + ~ 0.671 0.641 214. 240. 0.691 9 88
4 ... + lcp 0.676 0.642 215. 243. 0.691 10 87
5 ... + pgg45 0.679 0.642 216. 247. 0.691 11 86
From these summaries, it looks like:
- the adjusted R2 is essentially indistinguishable between the three largest models, but a bit less strong with the three-predictor (4 df) model, and
- the AIC and BIC are (slightly) better (lower) with the three-predictor model (4 df) than any other.
So we might be motivated by these summaries to select any of the three models we’re studying closely here.
7.7.1.2 Model Testing via ANOVA (in-sample)
To obtain ANOVA-based test results, we’ll run…
anova(prost_m_int, prost_m_fw, prost_m_bw, prost_m_ac, prost_m_ks)
Analysis of Variance Table
Model 1: lpsa ~ 1
Model 2: lpsa ~ lcavol + lweight + svi_f
Model 3: lpsa ~ lcavol + lweight + svi_f + age + bph_f + gleason_f
Model 4: lpsa ~ lcavol + lweight + svi_f + age + bph_f + gleason_f + lcp
Model 5: lpsa ~ lcavol + lweight + svi_f + age + bph_f + gleason_f + lcp +
pgg45
Res.Df RSS Df Sum of Sq F Pr(>F)
1 96 127.918
2 93 46.568 3 81.349 56.7991 <2e-16 ***
3 88 42.066 5 4.503 1.8863 0.1050
4 87 41.498 1 0.568 1.1891 0.2786
5 86 41.057 1 0.441 0.9234 0.3393
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What conclusions can we draw on the basis of these ANOVA tests?
- There is a statistically significant improvement in predictive value for Model 2 (the forward selection approach) as compared to Model 1 (the intercept only.)
- The ANOVA test comparing Model 5 (kitchen sink) to Model 4 (Allen-Cady result) shows no statistically significant improvement in predictive value.
- Neither does the ANOVA test comparing Model 3 to Model 2 or Model 4 to Model 3.
This suggests that, if we are willing to let the ANOVA test decide our best model than that would be the model produced by forward selection, with predictors lcavol
, lweight
and svi_f
. But we haven’t validated the models.
- If the purpose of the model is to predict new data, some sort of out-of-sample or cross-validation approach will be necessary, and
- Even if our goal isn’t prediction but merely description of the current data, we would still want to build diagnostic plots to regression assumptions in each model, and
- There is no reason to assume in advance that any of these models is in fact correct, or that any one of these stepwise approaches is necessarily better than any other, and
- The mere act of running a stepwise regression model, as we’ll see, can increase the bias in our findings if we accept the results at face value.
So we’ll need some ways to validate the results once we complete the selection process.
7.7.2 Validating the Results of the Various Models
We can use a 5-fold cross-validation approach to assess the predictions made by our potential models and then compare them. Let’s compare our three models:
- the three predictor model obtained by forward selection, including
lcavol
,lweight
, andsvi_f
- the five predictor model obtained by backwards elimination, including
lcavol
,lweight
,svi_f
, and alsoage
, andbph_f
- the six predictor model obtained by the Allen-Cady approach, adding
gleason_f
to the previous model.
Here’s the 5-fold validation work (and resulting RMSE and MAE estimates) for the three-predictor model.
set.seed(43201012)
prost3_models <- prost %>%
crossv_kfold(k = 5) %>%
mutate(model = map(train, ~ lm(lpsa ~ lcavol + lweight +
svi_f, data = .)))
prost3_preds <- prost3_models %>%
unnest(map2(model, test, ~ augment(.x, newdata = .y)))
prost3_preds %>%
summarize(RMSE_prost3 = sqrt(mean((lpsa - .fitted) ^2)),
MAE_prost3 = mean(abs(lpsa - .fitted)))
# A tibble: 1 x 2
RMSE_prost3 MAE_prost3
<dbl> <dbl>
1 0.745 0.587
Now, we’ll generate the RMSE and MAE estimates for the five-predictor model.
set.seed(43206879)
prost5_models <- prost %>%
crossv_kfold(k = 5) %>%
mutate(model = map(train, ~ lm(lpsa ~ lcavol + lweight +
svi_f + age + bph_f, data = .)))
prost5_preds <- prost5_models %>%
unnest(map2(model, test, ~ augment(.x, newdata = .y)))
prost5_preds %>%
summarize(RMSE_prost5 = sqrt(mean((lpsa - .fitted) ^2)),
MAE_prost5 = mean(abs(lpsa - .fitted)))
# A tibble: 1 x 2
RMSE_prost5 MAE_prost5
<dbl> <dbl>
1 0.750 0.581
And at last, we’ll generate the RMSE and MAE estimates for the six-predictor model.
set.seed(43236198)
prost6_models <- prost %>%
crossv_kfold(k = 5) %>%
mutate(model = map(train, ~ lm(lpsa ~ lcavol + lweight +
svi_f + age + bph_f + gleason_f, data = .)))
prost6_preds <- prost6_models %>%
unnest(map2(model, test, ~ augment(.x, newdata = .y)))
prost6_preds %>%
summarize(RMSE_prost6 = sqrt(mean((lpsa - .fitted) ^2)),
MAE_prost6 = mean(abs(lpsa - .fitted)))
# A tibble: 1 x 2
RMSE_prost6 MAE_prost6
<dbl> <dbl>
1 0.720 0.551
It appears that the six-predictor model does better than either of the other two approaches, with smaller RMSE and MAE. The three-predictor model does slightly better in terms of root mean square prediction error and slightly worse in terms of mean absolute prediction error than the five-predictor model.
OK. A mixed bag, with different conclusions depending on which summary we want to look at. But of course, stepwise regression isn’t the only way to do variable selection. Let’s consider a broader range of potential predictor sets.
References
Ramsey, Fred L., and Daniel W. Schafer. 2002. The Statistical Sleuth: A Course in Methods of Data Analysis. Second Edition. Pacific Grove, CA: Duxbury.
Leeb, Hannes, and Benedikt M. Potscher. 2005. “Model Selection and Inference: Facts and Fiction.” Econometric Theory 21(1): 21–59. https://www.jstor.org/stable/3533623.
Vittinghoff, Eric, David V. Glidden, Stephen C. Shiboski, and Charles E. McCulloch. 2012. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. Second Edition. Springer-Verlag, Inc. http://www.biostat.ucsf.edu/vgsm/.