Chapter 12 “Best Subsets” Variable Selection in our Prostate Cancer Study

A second approach to model selection involved fitting all possible subset models and identifying the ones that look best according to some meaningful criterion and ideally one that includes enough variables to model the response appropriately without including lots of redundant or unnecessary terms.

Here’s the set of predictors we’re considering for our modeling of lpsa, our outcome. Note that five of the eight predictors are quantitative, and the remaining three (bph, svi and gleason) are categorical. A little cleaning gives us the following results.

     lcavol           lweight           age            bph     svi   
 Min.   :-1.3471   Min.   :2.375   Min.   :41.00   High  :29   0:76  
 1st Qu.: 0.5128   1st Qu.:3.376   1st Qu.:60.00   Medium:25   1:21  
 Median : 1.4469   Median :3.623   Median :65.00   Low   :43         
 Mean   : 1.3500   Mean   :3.629   Mean   :63.87                     
 3rd Qu.: 2.1270   3rd Qu.:3.876   3rd Qu.:68.00                     
 Max.   : 3.8210   Max.   :4.780   Max.   :79.00                     
      lcp          gleason      pgg45        svi_f    gleason_f    bph_f   
 Min.   :-1.3863   6  :35   Min.   :  0.00   No :76   > 7: 6    Low   :43  
 1st Qu.:-1.3863   7  :56   1st Qu.:  0.00   Yes:21   7  :56    Medium:25  
 Median :-0.7985   > 7: 6   Median : 15.00            6  :35    High  :29  
 Mean   :-0.1794            Mean   : 24.38                                 
 3rd Qu.: 1.1787            3rd Qu.: 40.00                                 
 Max.   : 2.9042            Max.   :100.00                                 
    lcavol_c            cavol             psa        
 Min.   :-2.69708   Min.   : 0.260   Min.   :  0.65  
 1st Qu.:-0.83719   1st Qu.: 1.670   1st Qu.:  5.65  
 Median : 0.09691   Median : 4.250   Median : 13.35  
 Mean   : 0.00000   Mean   : 7.001   Mean   : 23.74  
 3rd Qu.: 0.77703   3rd Qu.: 8.390   3rd Qu.: 21.25  
 Max.   : 2.47099   Max.   :45.650   Max.   :265.85  

12.1 Key Summaries We’ll Use to Evaluate Potential Models (in-sample)

  1. Adjusted \(R^2\), which we try to maximize.
  2. Bayesian Information Criterion (BIC), which we also try to minimize.
  3. Mallows’ \(C_p\) statistic, which we will try to minimize, and which is closely related to AIC in this setting.

Choosing between AIC and BIC can be challenging.

For model selection purposes, there is no clear choice between AIC and BIC. Given a family of models, including the true model, the probability that BIC will select the correct model approaches one as the sample size n approaches infinity - thus BIC is asymptotically consistent, which AIC is not. [But, for practical purposes,] BIC often chooses models that are too simple [relative to AIC] because of its heavy penalty on complexity.

  • Source: Hastie, Tibshriani, and Frideman (2001), page 208.

Several useful tools for running “all subsets” or “best subsets” regression comparisons are developed in R’s leaps package.

12.2 Using regsubsets in the leaps package

We can use the leaps package to obtain results in the prost study from looking at all possible subsets of the candidate predictors. The leaps package isn’t particularly friendly to the tidyverse. In particular, we cannot have any character variables in our predictor set. We specify our “kitchen sink” model, and apply the regsubsets function from leaps, which identifies the set of models.

To start, we’ll ask R to find the one best subset (with 1 predictor variable [in addition to the intercept], then with 2 predictors, and then with each of 3, 4, … 8 predictor variables) according to an exhaustive search without forcing any of the variables to be in or out.

  • Use the nvmax command within the regsubsets function to limit the number of regression inputs to a maximum.
  • Use the nbest command to identify how many subsets you want to identify for each predictor count.

12.2.1 Identifying the models with which and outmat

The summary of the regsubsets output provides lots of information we’ll need. To see the models selected by the system, we use:

  (Intercept) lcavol lweight   age bphMedium bphLow  svi1   lcp gleason7
  gleason> 7 pgg45
8      FALSE  TRUE

Another version of this formatted for printing is:

         lcavol lweight age bphMedium bphLow svi1 lcp gleason7 gleason> 7 pgg45
1  ( 1 ) "*"    " "     " " " "       " "    " "  " " " "      " "        " "  
2  ( 1 ) "*"    "*"     " " " "       " "    " "  " " " "      " "        " "  
3  ( 1 ) "*"    "*"     " " " "       " "    "*"  " " " "      " "        " "  
4  ( 1 ) "*"    "*"     " " " "       " "    "*"  " " "*"      " "        " "  
5  ( 1 ) "*"    "*"     "*" " "       "*"    "*"  " " " "      " "        " "  
6  ( 1 ) "*"    "*"     "*" " "       "*"    "*"  " " "*"      " "        " "  
7  ( 1 ) "*"    "*"     "*" " "       "*"    "*"  "*" "*"      " "        " "  
8  ( 1 ) "*"    "*"     "*" " "       "*"    "*"  "*" "*"      " "        "*"  


  • the best one-predictor model used lcavol
  • the best two-predictor model used lcavol and lweight
  • the best three-predictor model used lcavol, lweight and svi
  • the best four-predictor model added an indicator variable for gleason7
  • the best five-predictor model added an indicator for bphLow and age to model 3
  • the best six-predictor model added a gleason6 indicator to model 5
  • the best seven-predictor model added lcp to model 6,
  • and the eight-input model replaced gleason6 with gleason7 and also added pgg45.

These “best subsets” models are not hierarchical. If they were, it would mean that each model was a subset of the one below it.

  • To identify potentially attractive candidate models, we can either tabulate or plot key summaries of model fit (adjusted \(R^2\), Mallows’ \(C_p\) and BIC) using ggplot2. I’ll show both approaches.

12.2.2 Obtaining Fit Quality Statistics

We’ll first store the summary statistics and winning models at each input count into a tibble called rs_winners. We’re doing this to facilitate plotting with ggplot2.

# A tibble: 8 x 17
  inputs adjr2    cp   bic `(Intercept)` lcavol lweight age   bphMedium bphLow
   <int> <dbl> <dbl> <dbl> <lgl>         <lgl>  <lgl>   <lgl> <lgl>     <lgl> 
1      1 0.535 30.4  -66.1 TRUE          TRUE   FALSE   FALSE FALSE     FALSE 
2      2 0.587 17.4  -74.1 TRUE          TRUE   TRUE    FALSE FALSE     FALSE 
3      3 0.624  8.54 -79.7 TRUE          TRUE   TRUE    FALSE FALSE     FALSE 
4      4 0.633  7.31 -78.4 TRUE          TRUE   TRUE    FALSE FALSE     FALSE 
5      5 0.639  6.73 -76.5 TRUE          TRUE   TRUE    TRUE  FALSE     TRUE  
6      6 0.647  5.64 -75.3 TRUE          TRUE   TRUE    TRUE  FALSE     TRUE  
7      7 0.646  6.90 -71.5 TRUE          TRUE   TRUE    TRUE  FALSE     TRUE  
8      8 0.648  7.34 -68.7 TRUE          TRUE   TRUE    TRUE  FALSE     TRUE  
# ... with 7 more variables: svi1 <lgl>, lcp <lgl>, gleason7 <lgl>, `gleason>
#   7` <lgl>, pgg45 <lgl>, r2 <dbl>, rss <dbl>

12.3 Plotting the Fit Statistics

12.3.1 Plots of \(R^2\) and RMSE for each model

Remember that raw \(R^2\) is greedy, and will always look to select as large a set of predictors as possible. The residual sum of squares, if plotted, will have a similar problem, where the minimum RSS will always be associated with the largest model (if the models are subsets of one another.)

So the \(R^2\) and residual sums of squares won’t be of much help to us. Instead, we’ll focus on the three other measures we mentioned earlier.

  1. Adjusted \(R^2\) which we try to maximize,
  2. Mallows’ \(C_p\) which we try to minimize, and
  3. The Bayes Information Criterion (BIC) which we also try to minimize.

12.3.3 Mallows’ \(C_p\) for our subsets

The \(C_p\) statistic focuses directly on the tradeoff between bias (due to excluding important predictors from the model) and extra variance (due to including too many unimportant predictors in the model.)

If N is the sample size, and we select \(p\) regression predictors from a set of \(K\) (where \(p < K\)), then the \(C_p\) statistic is

\(C_p = \frac{SSE_p}{MSE_K} - N + 2p\)


  • \(SSE_p\) is the sum of squares for error (residual) in the model with \(p\) predictors
  • \(MSE_K\) is the residual mean square after regression in the model with all \(K\) predictors

As it turns out, this is just measuring the particular model’s lack of fit, and then adding a penalty for the number of terms in the model (specifically \(2p - N\) is the penalty since the lack of fit is measured as \((N-p) \frac{SSE_p}{MSE_K}\).

  • If a model has no meaningful lack of fit (i.e. no substantial bias) then the expected value of \(C_p\) is roughly \(p\).
  • Otherwise, the expectation is \(p\) plus a positive bias term.
  • In general, we want to see smaller values of \(C_p\).
Warning: `expand_scale()` is deprecated; use `expansion()` instead.

12.3.4 BIC for our subsets

We might consider several information criteria:

  • the Bayesian Information Criterion, called BIC
  • the Akaike Information Criterion (used by R’s default stepwise approaches,) called AIC
  • a corrected version of AIC due to Hurvich and Tsai (1989), called \(AIC_c\) or aic.c

Each of these indicates better models by getting smaller. Since the \(C_p\) and AIC results will lead to the same model, I’ll focus on plotting the BIC.

Warning: `expand_scale()` is deprecated; use `expansion()` instead.

We could, if necessary, also calculate the uncorrected AIC value for each model, but we won’t make any direct use of that, because that will not provide any new information not already gathered by the \(C_p\) statistic for a linear regression model. If you wanted to find the uncorrected AIC for a given model, you can use the extractAIC function.

[1]   2.00000 -44.36603
[1]   3.00000 -54.95846

Note that:

  • these results are fairly comparable to the bias-corrected AIC we built above, and
  • the extractAIC and AIC functions look like they give very different results, but they really don’t.
[1] 232.908
[1] 222.3156

But notice that the differences in AIC are the same, either way, comparing these two models:

[1] -1.00000 10.59243
[1] 10.59243
  • AIC is only defined up to an additive constant.
  • Since the difference between two models using either AIC or extractAIC is the same, this doesn’t actually matter which one we use, so long as we use the same one consistently.

12.3.5 My Usual Presentation

Usually, I just plot the three useful things (adjusted \(R^2\), \(C_p\) and \(BIC\)) when I’m working with best subsets.

12.4 Which Subsets Look Best? (Tabulation)

# A tibble: 1 x 3
  AdjR2    Cp   BIC
  <int> <int> <int>
1     8     6     3

Our candidate models appear to be models 3 (minimizes BIC), 6 (minimizes \(C_p\)) and 7 (maximizes adjusted \(R^2\)).

12.4.2 Rerunning our candidate models

Note that the models are nested because model 3 is a subset of the predictors in model 6, which includes a subset of the predictors in model 7. We’ll also include an intercept-only model, and the full model containing all predictors we considered initially, just to show you what that would look like.

12.5 Validation of Candidate Models

12.5.1 10-fold Cross-Validation of model3

Model 3 uses lcavol, lweight and svi to predict the lpsa outcome. Let’s do 10-fold cross-validation of this modeling approach, and calculate key summary measures of fit, specifically the \(R^2\), RMSE and MAE across the validation samples. We’ll use the tools from the caret package we’ve seen in prior work.

Linear Regression 

97 samples
 3 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 88, 85, 87, 88, 88, 87, ... 
Resampling results:

  RMSE       Rsquared   MAE      
  0.7268315  0.6468265  0.5794489

Tuning parameter 'intercept' was held constant at a value of TRUE

12.5.2 10-fold Cross-Validation of model6

As above, we’ll do 10-fold cross-validation of model6.

Linear Regression 

97 samples
 6 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 86, 88, 88, 88, 87, 87, ... 
Resampling results:

  RMSE       Rsquared   MAE      
  0.7036155  0.6759678  0.5533581

Tuning parameter 'intercept' was held constant at a value of TRUE

This shows a lower RMSE, higher R-squared and lower MAE than we saw in Model 3, so Model 6 looks like the better choice. How about in comparison to Model 7?

12.5.3 10-fold Cross-Validation of model7

As above, we’ll do 10-fold cross-validation of model7.

Linear Regression 

97 samples
 7 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 88, 87, 86, 89, 88, 86, ... 
Resampling results:

  RMSE       Rsquared   MAE      
  0.6824712  0.6903994  0.5337871

Tuning parameter 'intercept' was held constant at a value of TRUE

Model 7 looks better still, with a smaller RMSE and MAE than Model 6, and a higher R-squared. It looks like Model 7 shows best in our validation.

12.6 What about Interaction Terms?

Suppose we consider for a moment a much smaller and less realistic problem. We want to use best subsets to identify a model out of a set of three predictors for lpsa: specifically lcavol, age and svi, but now we also want to consider the interaction of svi with lcavol as a potential addition. Remember that svi a factor with levels 0 and 1. We could simply add a numerical product term to our model, as follows.

Subset selection object
Call: regsubsets.formula(lpsa ~ lcavol + age + svi + svixlcavol, data = prost_c12, 
    nvmax = 4, nbest = 1)
4 Variables  (and intercept)
           Forced in Forced out
lcavol         FALSE      FALSE
age            FALSE      FALSE
svi1           FALSE      FALSE
svixlcavol     FALSE      FALSE
1 subsets of each size up to 4
Selection Algorithm: exhaustive
         lcavol age svi1 svixlcavol
1  ( 1 ) " "    " " " "  "*"       
2  ( 1 ) "*"    " " "*"  " "       
3  ( 1 ) "*"    " " "*"  "*"       
4  ( 1 ) "*"    "*" "*"  "*"       

In this case, best subsets identifies the interaction term as an attractive predictor at the very beginning, before it included the main effects that are associated with it. That’s a problem.

To resolve this, we could either:

  1. Consider interaction terms outside of best subsets, and only after the selection of main effects.
  2. Use another approach to deal with variable selection for interaction terms.

Best subsets is a very limited tool. While it’s a better choice for some problems that a stepwise regression, it suffers from some of the same problems, primarily because the resulting models are too optimistic about their measures of fit quality.

Some other tools, including ridge regression, the lasso and elastic net approaches, are sometimes more appealing.

You may be wondering if there is a “best subsets” approach which can be applied to generalized linear models. There is, in fact, a bestglm package in R which can be helpful.

But, for now, it’s time to move away from model selection and on to the very important consideration of non-linearity in the predictors.


Hastie, Trevor, Robert Tibshriani, and Jerome H. Frideman. 2001. The Elements of Statistical Learning. New York: Springer.

Hurvich, Clifford M., and Chih-Ling Tsai. 1989. “Regression and Time Series Model Selection in Small Samples.” Biometrika 76: 297–307.