Chapter 15 Other Variable Selection Strategies

15.1 Why not use stepwise procedures?

  1. The R2 for a model selected in a stepwise manner is biased, high.
  2. The coefficient estimates and standard errors are biased.
  3. The \(p\) values for the individual-variable t tests are too small.
  4. In stepwise analyses of prediction models, the final model represented noise 20-74% of the time.
  5. In stepwise analyses, the final model usually contained less than half of the actual number of real predictors.
  6. It is not logical that a population regression coefficient would be exactly zero just because its estimate was not statistically significant.

This last comment applies to things like our “best subsets” approach as well as standard stepwise procedures.

Sander Greenland’s comments on parsimony and stepwise approaches to model selection are worth addressing…

  • Stepwise variable selection on confounders leaves important confounders uncontrolled.
  • Shrinkage approaches (like ridge regression and the lasso) are far superior to variable selection.
  • Variable selection does more damage to confidence interval widths than to point estimates.

If we are seriously concerned about overfitting - winding up with a model that doesn’t perform well on new data - then stepwise approaches generally don’t help.

Vittinghoff et al. (2012) suggest four strategies for minimizing the chance of overfitting

  1. Pre-specify well-motivated predictors and how to model them.
  2. Eliminate predictors without using the outcome.
  3. Use the outcome, but cross-validate the target measure of prediction error.
  4. Use the outcome, and shrink the coefficient estimates.

The best subsets methods we have studied either include a variable or drop it from the model. Often, this choice is based on only a tiny difference in the quality of a fit to data.

  • Harrell (2001): not reasonable to assume that a population regression coefficient would be exactly zero just because it failed to meet a criterion for significance.
  • Brad Efron has suggested that a stepwise approach is “overly greedy, impulsively eliminating covariates which are correlated with other covariates.”

So, what’s the alternative?

15.2 Ridge Regression

Ridge regression involves a more smooth transition between useful and not useful predictors which can be obtained by constraining the overall size of the regression coefficients.

Ridge regression assumes that the regression coefficients (after normalization) should not be very large. This is reasonable to assume when you have lots of predictors and you believe many of them have some effect on the outcome.

Pros:

  1. Some nice statistical properties
  2. Can be calculated using only standard least squares approaches, so it’s been around for a while.
  3. Available in the MASS package.

Ridge regression takes the sum of the squared estimated standardized regression coefficients and constrains that sum to only be as large as some value \(k\).

\[ \sum \hat{\beta_j}^2 \leq k. \]

The value \(k\) is one of several available measures of the amount of shrinkage, but the main one used in the MASS package is a value \(\lambda\). As \(\lambda\) increases, the amount of shrinkage goes up, and \(k\) goes down.

15.2.1 Assessing a Ridge Regression Approach

We’ll look at a plot produced by the lm.ridge function for a ridge regression for the prostate cancer study we worked on when studying Stepwise Regression and Best Subsets methods earlier.

  • Several (here 101) different values for \(\lambda\), our shrinkage parameter, will be tested.
  • Results are plotted so that we see the coefficients across the various (standardized) predictors.
    • Each selection of a \(\lambda\) value implies a different vector of covariate values across the predictors we are studying.
    • The idea is to pick a value of \(\lambda\) for which the coefficients seem relatively stable.

Usually, you need to use trial and error to decide the range of \(\lambda\) to be tested. Here, 0:100 means going from 0 (no shrinkage) to 100 in steps of 1.

15.2.2 The lm.ridge plot - where do coefficients stabilize?

Does \(\lambda = 20\) seem like a stable spot here?

The coefficients at \(\lambda\) = 20 can be determined from the lm.ridge output. These are fully standardized coefficients. The original predictors are centered by their means and then scaled by their standard deviations and the outcome has also been centered, in these models.

   predslcavol   predslweight       predsage     predsbph_f     predssvi_f 
         0.482          0.248         -0.091          0.097          0.252 
      predslcp predsgleason_f     predspgg45 
         0.009         -0.099          0.061 

Was an intercept used?

[1] 1

Yes, it was. There is an automated way to pick \(\lambda\). Use the select function in the MASS package:

modified HKB estimator is 4.210238 
modified L-W estimator is 3.32223 
smallest value of GCV  at 6 

I’ll use the GCV = generalized cross-validation to select \(\lambda\) = 6 instead.

   predslcavol   predslweight       predsage     predsbph_f     predssvi_f 
    0.58911149     0.26773757    -0.13715070     0.11862949     0.29491008 
      predslcp predsgleason_f     predspgg45 
   -0.09389545    -0.10477578     0.07250609 

15.2.3 Ridge Regression: The Bottom Line

The main problem with ridge regression is that all it does is shrink the coefficient estimates, but it’s not so useful in practical settings because it still includes all variables.

  1. It’s been easy to do ridge regression for many years, so you see it occasionally in the literature.
  2. It leads to the lasso, which incorporates the positive features of shrinking regression coefficients with the ability to wisely select some variables to be eliminated from the predictor pool.

15.3 The Lasso

The lasso works by takes the sum of the absolute values of the estimated standardized regression coefficients and constrains it to only be as large as some value k.

\[ \sum \hat{|\beta_j|} \leq k. \]

This looks like a minor change, but it’s not.

15.3.1 Consequences of the Lasso Approach

  1. In ridge regression, while the individual coefficients shrink and sometimes approach zero, they seldom reach zero and are thus excluded from the model. With the lasso, some coefficients do reach zero and thus, those predictors do drop out of the model.
    • So the lasso leads to more parsimonious models than does ridge regression.
    • Ridge regression is a method of shrinkage but not model selection. The lasso accomplishes both tasks.
  2. If k is chosen to be too small, then the model may not capture important characteristics of the data. If k is too large, the model may over-fit the data in the sample and thus not represent the population of interest accurately.
  3. The lasso is far more difficult computationally than ridge regression (the problem requires an algorithm called least angle regression published in 2004), although R has a library (lars) which can do the calculations pretty efficiently.

The lasso is not an acronym, but rather refers to cowboys using a rope to pull cattle from the herd, much as we will pull predictors from a model.

15.3.2 How The Lasso Works

The lars package lets us compute the lasso coefficient estimates and do cross-validation to determine the appropriate amount of shrinkage. The main tool is a pair of graphs.

  1. The first plot shows what coefficients get selected as we move from constraining all of the coefficients to zero (complete shrinkage) towards fewer constraints all the way up to ordinary least squares, showing which variables are included in the model at each point.
  2. The second plot suggests where on the first plot we should look for a good model choice, according to a cross-validation approach.

  • The y axis shows standardized regression coefficients.
    • The lars package standardizes all variables so the shrinkage doesn’t penalize some coefficients because of their scale.
  • The x-axis is labeled |beta|/max|beta|.
    • This ranges from 0 to 1.
    • 0 means that the sum of the \(|\hat{\beta_j}|\) is zero (completely shrunk)
    • 1 means the ordinary least squares unbiased estimates.

The lasso graph starts at constraining all of the coefficients to zero, and then moves toward ordinary least squares.

Identifiers for the predictors (numbers) are shown to the right of the graph.

The vertical lines in the lasso plot show when a variable has been eliminated from the model, and in fact these are the only points that are actually shown in the default lasso graph. The labels on the top of the graph tell you how many predictors are in the model at that stage.

LARS/LASSO
Call: lars(x = preds, y = prost$lpsa, type = "lasso")
  Df     Rss       Cp
0  1 127.918 168.1835
1  2  76.392  64.1722
2  3  70.247  53.5293
3  4  50.598  15.1017
4  5  49.065  13.9485
5  6  48.550  14.8898
6  7  46.284  12.2276
7  8  44.002   9.5308
8  9  42.772   9.0000

Based on the \(C_p\) statistics, it looks like the improvements continue throughout, and don’t really finish happening until we get pretty close to the full model with 9 df.

15.3.3 Cross-Validation with the Lasso

Normally, cross-validation methods are used to determine how much shrinkage should be used. We’ll use the cv.lars function.

  • 10-fold (K = 10) cross-validation
    • the data are randomly divided into 10 groups.
    • Nine groups are used to predict the remaining group for each group in turn.
    • Overall prediction performance is computed, and the machine calculates a cross-validation criterion (mean squared error) and standard error for that criterion.

The cross-validation plot is the second lasso plot.

We’re looking to minimize cross-validated mean squared error in this plot, which doesn’t seem to happen until the fraction gets very close to 1.

15.3.4 What value of the key fraction minimizes cross-validated MSE?

[1] 1

The cross-validation plot suggests we use a fraction of nearly 1.0, suggesting that all of the predictors will be kept in, based on the top LASSO plot.

15.3.5 Coefficients for the Model Identified by the Cross-Validation

   lcavol   lweight       age     bph_f     svi_f       lcp gleason_f     pgg45 
   0.5550    0.6420   -0.0220    0.1552    0.7812   -0.1198   -0.1846    0.0031 

So the model suggested by the lasso still includes all eight of these predictors.

15.3.6 Obtaining Fitted Values from Lasso

$s
[1] 1

$fraction
[1] 1

$mode
[1] "fraction"

$fit
 [1] 0.7981479 0.7452123 0.5047527 0.6051627 1.6984386 0.8321833 1.8281390
 [8] 2.1333745 1.2493589 1.2626371 1.4669434 0.7736890 2.0754442 1.9125642
[15] 2.1526264 1.8103388 1.2670402 2.3921331 1.3259257 1.7692117 1.9759115
[22] 2.7420125 1.1640356 2.4797587 1.8024473 1.9100031 2.0164183 1.7846865
[29] 1.9707330 2.1562795 2.0359132 1.8658364 1.6927577 1.3572705 1.0450002
[36] 2.9154581 2.1796083 1.0436497 3.8894710 1.7956357 2.1016088 2.3261270
[43] 2.0788225 2.5248165 2.4462645 2.5880361 4.0702259 2.6868568 2.7106992
[50] 1.9547992 2.4558606 2.9843790 2.1921094 3.0522425 3.3516038 2.9824834
[57] 1.7591653 2.3487167 2.2871618 2.6256177 2.3046839 3.5581817 2.9732302
[64] 3.6777383 2.5104085 2.6597765 2.9467963 3.0901247 1.5062634 3.0294526
[71] 3.2912806 2.1092781 2.8890738 3.4988268 3.6925330 3.6049797 3.2758657
[78] 3.4559699 3.4011748 3.1824772 2.0470770 2.8945197 3.6756633 3.3286276
[85] 2.6958068 3.8336368 2.9955258 3.0589058 4.2901466 3.1001571 3.3809665
[92] 4.0343586 3.8332468 4.7608240 3.6275110 4.1408446 4.0937866

15.3.7 Complete Set of Fitted Values from the Lasso

 [1] 0.798 0.745 0.505 0.605 1.698 0.832 1.828 2.133 1.249 1.263 1.467 0.774
[13] 2.075 1.913 2.153 1.810 1.267 2.392 1.326 1.769 1.976 2.742 1.164 2.480
[25] 1.802 1.910 2.016 1.785 1.971 2.156 2.036 1.866 1.693 1.357 1.045 2.915
[37] 2.180 1.044 3.889 1.796 2.102 2.326 2.079 2.525 2.446 2.588 4.070 2.687
[49] 2.711 1.955 2.456 2.984 2.192 3.052 3.352 2.982 1.759 2.349 2.287 2.626
[61] 2.305 3.558 2.973 3.678 2.510 2.660 2.947 3.090 1.506 3.029 3.291 2.109
[73] 2.889 3.499 3.693 3.605 3.276 3.456 3.401 3.182 2.047 2.895 3.676 3.329
[85] 2.696 3.834 2.996 3.059 4.290 3.100 3.381 4.034 3.833 4.761 3.628 4.141
[97] 4.094

To assess the quality of these predictions, we might plot them against the observed values of our outcome (lpsa), or we might look at residuals vs. these fitted values.

`geom_smooth()` using formula 'y ~ x'

15.3.8 When is the Lasso Most Useful?

As Faraway (2015) suggests, the lasso is particularly useful when we believe the effects are sparse, in the sense that we believe that few of the many predictors we are evaluating have a meaningful effect.

Consider, for instance, the analysis of gene expression data, where we have good reason to believe that only a small number of genes have an influence on our response of interest.

Or, in medical claims data, where we can have thousands of available codes to search through that may apply to some of the people included in a large analysis relating health care costs to outcomes.

15.4 Applying the Lasso to the pollution data

Let’s consider the lasso approach in application to the pollution data we’ve seen previously. Recall that we have 60 observations on an outcome, y, and 15 predictors, labeled x1 through x15.

LARS/LASSO
Call: lars(x = preds, y = pollution$y, type = "lasso")
   Df    Rss       Cp
0   1 228311 129.1367
1   2 185419  95.9802
2   3 149370  68.4323
3   4 143812  65.8764
4   5  92077  25.4713
5   6  83531  20.4668
6   7  69532  10.9922
7   8  67682  11.4760
8   9  60689   7.7445
9  10  60167   9.3163
10 11  59609  10.8588
11 12  58287  11.7757
12 13  57266  12.9383
13 14  56744  14.5107
14 13  56159  12.0311
15 14  55238  13.2765
16 15  53847  14.1361
17 16  53681  16.0000

Based on the \(C_p\) statistics, it looks like the big improvements occur somewhere around the move from 6 to 7 df. Let’s look at the cross-validation.

Here it looks like cross-validated MSE happens somewhere between a fraction of 0.2 and 0.4.

[1] 0.3636364

It looks like a model with 6-8 predictors will be the most useful. The cross-validated coefficients are as follows:

     x1      x2      x3      x4      x5      x6      x7      x8      x9     x10 
  1.511  -1.194  -1.278   0.000   0.000 -10.394  -0.519   0.004   3.992  -0.016 
    x11     x12     x13     x14     x15 
  0.000   0.000   0.000   0.229   0.000 

Note that by this cross-validated lasso selection, not only are the coefficients for the 8 variables remaining in the model shrunken, but variables x4, x5, x10, x11, x12, x13 and x15 are all dropped from the model, and model x8 almost is, as well.

 [1]  933.179  918.621  921.473  987.781 1051.082 1067.800  912.441  916.116
 [9]  950.767  926.550  997.107 1018.063  977.723  955.038  931.881  891.788
[17]  931.634  867.682  973.719  941.645  879.011  906.369  974.116  919.592
[25]  932.388  957.082  912.097  925.046  873.603  984.542 1043.996  913.444
[33]  938.098  884.809  990.042  932.188 1014.197  970.005 1004.448  984.370
[41]  896.646  918.426  933.802 1005.297  910.259  976.554  829.222  908.848
[49]  824.828  896.169  908.915  918.541  926.658  917.542  991.639  878.248
[57]  942.784  914.382  961.495  949.723

Here’s a plot of the actual pollution y values, against these fitted values.

And now, here’s a plot or residuals vs. fitted values.

`geom_smooth()` using formula 'y ~ x'

References

Faraway, Julian J. 2015. Linear Models with R. Second. Boca Raton, FL: CRC Press.

Harrell, Frank E. 2001. Regression Modeling Strategies. New York: Springer.

Vittinghoff, Eric, David V. Glidden, Stephen C. Shiboski, and Charles E. McCulloch. 2012. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. Second Edition. Springer-Verlag, Inc. http://www.biostat.ucsf.edu/vgsm/.