::opts_chunk$set(comment = NA)
knitr
library(janitor)
library(broom)
library(car)
library(gt)
library(Hmisc)
library(MASS)
library(mosaic)
library(glmnet)
library(conflicted)
library(patchwork)
library(tidyverse)
conflicts_prefer(dplyr::select)
theme_set(theme_bw())
32 NEW!! A Few LASSO Ideas
32.1 R Setup Used Here
32.1.1 Data Load
<- read_csv("data/pollution.csv", show_col_types = FALSE) pollution
32.2 The pollution
data
Consider again the pollution
data set we developed in Chapter 13 which contains 15 independent variables and a measure of mortality, describing 60 US metropolitan areas in 1959-1961.
pollution
# A tibble: 60 × 16
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 13 49 68 7 3.36 12.2 90.7 2702 3 51.9 9.7 105 32
2 28 32 81 7 3.27 12.1 81 3665 7.5 51.6 13.2 4 2
3 10 55 70 7.3 3.11 12.1 88.9 3033 5.9 51 14 144 66
4 43 32 74 10.1 3.38 9.5 79.2 3214 2.9 43.7 12 11 7
5 25 12 73 9.2 3.28 12.1 83.1 2095 2 51.9 9.8 20 11
6 35 46 85 7.1 3.22 11.8 79.9 1441 14.8 51.2 16.1 1 1
7 60 67 82 10 2.98 11.5 88.6 4657 13.5 47.3 22.4 3 1
8 11 53 68 9.2 2.99 12.1 90.6 4700 7.8 48.9 12.3 648 319
9 31 24 72 9 3.37 10.9 82.8 3226 5.1 45.2 12.3 5 3
10 15 30 73 8.2 3.15 12.2 84.2 4824 4.7 53.1 12.7 17 8
# ℹ 50 more rows
# ℹ 3 more variables: x14 <dbl>, x15 <dbl>, y <dbl>
Here’s a codebook:
Variable | Description |
---|---|
y |
Total Age Adjusted Mortality Rate |
x1 |
Mean annual precipitation in inches |
x2 |
Mean January temperature in degrees Fahrenheit |
x3 |
Mean July temperature in degrees Fahrenheit |
x4 |
Percent of 1960 SMSA population that is 65 years of age or over |
x5 |
Population per household, 1960 SMSA |
x6 |
Median school years completed for those over 25 in 1960 SMSA |
x7 |
Percent of housing units that are found with facilities |
x8 |
Population per square mile in urbanized area in 1960 |
x9 |
Percent of 1960 urbanized area population that is non-white |
x10 |
Percent employment in white-collar occupations in 1960 urbanized area |
x11 |
Percent of families with income under $30,000 in 1960 urbanized area |
x12 |
Relative population potential of hydrocarbons, HC |
x13 |
Relative pollution potential of oxides of nitrogen, NOx |
x14 |
Relative pollution potential of sulfur dioxide, SO2 |
x15 |
Percent relative humidity, annual average at 1 p.m. |
32.3 Should We Rescale any Predictors?
Let’s get some basic summary statistics for our candidate predictors. When we do, we see that variable x8
is roughy 100 times larger than all of our other predictors.
df_stats(~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 +
+ x11 + x12 + x13 + x14 + x15, data = pollution) x10
response min Q1 median Q3 max mean sd
1 x1 10.00 32.750 38.000 43.250 60.00 37.366667 9.9846775
2 x2 12.00 27.000 31.500 40.000 67.00 33.983333 10.1688985
3 x3 63.00 72.000 74.000 77.250 85.00 74.583333 4.7631768
4 x4 5.60 7.675 9.000 9.700 11.80 8.798333 1.4645520
5 x5 2.92 3.210 3.265 3.360 3.53 3.263167 0.1352523
6 x6 9.00 10.400 11.050 11.500 12.30 10.973333 0.8452994
7 x7 66.80 78.375 81.150 83.600 90.70 80.913333 5.1413731
8 x8 1441.00 3104.250 3567.000 4519.750 9699.00 3876.050000 1454.1023607
9 x9 0.80 4.950 10.400 15.650 38.50 11.870000 8.9211480
10 x10 33.80 43.250 45.500 49.525 59.70 46.081667 4.6130431
11 x11 9.40 12.000 13.200 15.150 26.40 14.373333 4.1600956
12 x12 1.00 7.000 14.500 30.250 648.00 37.850000 91.9776732
13 x13 1.00 4.000 9.000 23.750 319.00 22.650000 46.3332896
14 x14 1.00 11.000 30.000 69.000 278.00 53.766667 63.3904678
15 x15 38.00 55.000 57.000 60.000 73.00 57.666667 5.3699309
n missing
1 60 0
2 60 0
3 60 0
4 60 0
5 60 0
6 60 0
7 60 0
8 60 0
9 60 0
10 60 0
11 60 0
12 60 0
13 60 0
14 60 0
15 60 0
Let’s dampen the size of that x8
variable down a little to make our coefficient comparisons easier later, by dividing all of the x8
values by 100.
<- pollution |>
pollution mutate(x8 = x8/100)
32.4 A Kitchen Sink Model
We’ll begin by fitting the obviously underpowered model with all 15 main effects used to predict our outcome.
<- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 +
mod_sink + x11 + x12 + x13 + x14 + x15,
x10 data = pollution)
32.4.1 Considering an Outcome Transformation
boxcox(mod_sink)
OK. In light of this Box-Cox plot, let’s consider taking the inverse of our outcome here. I’ll take that inverse and then standardize the result using the scale()
function to both subtract the mean of the transformed outcome and divide by its standard deviation, so that our new outcome, which I’ll call out_std
has mean 0, standard deviation 1, and a shape similar to that of the inverse of our \(y\).
<- pollution |>
pollution mutate(y_inverse = 1/y,
out_std = scale(1/y, center = TRUE, scale = TRUE))
<- ggplot(pollution, aes(x = y_inverse)) + geom_density()
p1 <- ggplot(pollution, aes(x = out_std)) + geom_density()
p2
/ p2 p1
OK. So now, I’ll build a revised kitchen sink model to use this new outcome.
<- lm(out_std ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 +
mod_sink2 + x11 + x12 + x13 + x14 + x15, data = pollution)
x10
tidy(mod_sink2, conf.int = TRUE, conf.level = 0.9) |>
select(term, estimate, std.error, conf.low, conf.high, p.value) |>
gt() |> fmt_number(decimals = 3)
term | estimate | std.error | conf.low | conf.high | p.value |
---|---|---|---|---|---|
(Intercept) | −15.899 | 6.886 | −27.469 | −4.328 | 0.026 |
x1 | −0.035 | 0.015 | −0.059 | −0.011 | 0.020 |
x2 | 0.036 | 0.017 | 0.007 | 0.066 | 0.044 |
x3 | 0.055 | 0.030 | 0.005 | 0.105 | 0.074 |
x4 | 0.176 | 0.134 | −0.049 | 0.400 | 0.196 |
x5 | 2.184 | 1.099 | 0.338 | 4.031 | 0.053 |
x6 | 0.225 | 0.187 | −0.089 | 0.539 | 0.234 |
x7 | 0.016 | 0.028 | −0.031 | 0.063 | 0.573 |
x8 | −0.006 | 0.006 | −0.017 | 0.005 | 0.351 |
x9 | −0.070 | 0.021 | −0.105 | −0.035 | 0.002 |
x10 | 0.012 | 0.026 | −0.031 | 0.056 | 0.636 |
x11 | 0.002 | 0.051 | −0.084 | 0.087 | 0.974 |
x12 | 0.011 | 0.008 | −0.002 | 0.024 | 0.160 |
x13 | −0.022 | 0.016 | −0.049 | 0.004 | 0.169 |
x14 | −0.001 | 0.002 | −0.005 | 0.003 | 0.549 |
x15 | 0.001 | 0.018 | −0.029 | 0.032 | 0.936 |
Does this new model show a strong fit to the data?
glance(mod_sink2) |> select(r.squared:p.value, df, df.residual, nobs) |>
gt() |> fmt_number(r.squared:p.value, decimals = 3)
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual | nobs |
---|---|---|---|---|---|---|---|
0.774 | 0.698 | 0.550 | 10.070 | 0.000 | 15 | 44 | 60 |
glance(mod_sink2) |> select(logLik, AIC, BIC, deviance) |> gt() |>
fmt_number(decimals = 2)
logLik | AIC | BIC | deviance |
---|---|---|---|
−39.96 | 113.92 | 149.52 | 13.31 |
32.4.2 How much collinearity are we dealing with?
vif(mod_sink2)
x1 x2 x3 x4 x5 x6 x7
4.113888 6.143551 3.967774 7.470045 4.307618 4.860538 3.994781
x8 x9 x10 x11 x12 x13 x14
1.658281 6.779599 2.841582 8.717068 98.639935 104.982405 4.228929
x15
1.907092
Clearly, we have some enormous collinearity to deal with, given that we have many VIFs over 5, some even over 100.
So a reduction in the size of the model seems appealing for multiple reasons.
32.5 Using the LASSO to suggest a smaller model
To begin, we will create a data matrix for our predictors, as follows:
<- model.matrix(mod_sink2) pred_x
Next, we create a matrix of our outcome.
<- pollution |> select(out_std) |> as.matrix() out_y
The LASSO involves both a cross-validation step, and a fitting step. Here’s the code we’ll use in this case:
set.seed(123456)
<- cv.glmnet(pred_x, out_y, type.measure = "mse", nfolds = 10)
cv_poll1
<- glmnet(pred_x, out_y, alpha = 1, lambda = cv_poll1$lambda.min) mod_las1
Now, let’s look at what the LASSO does. As we can see from the tidied output below, some predictors are dropped from the model, while others have their coefficients shrunk towards zero as compared to what we saw in the “kitchen sink” model.
tidy(mod_las1) |> gt()
term | step | estimate | lambda | dev.ratio |
---|---|---|---|---|
(Intercept) | 1 | -2.345226117 | 0.03783146 | 0.7335261 |
x1 | 1 | -0.026289494 | 0.03783146 | 0.7335261 |
x2 | 1 | 0.020872173 | 0.03783146 | 0.7335261 |
x3 | 1 | 0.017666053 | 0.03783146 | 0.7335261 |
x6 | 1 | 0.101973313 | 0.03783146 | 0.7335261 |
x7 | 1 | 0.012515424 | 0.03783146 | 0.7335261 |
x8 | 1 | -0.006425945 | 0.03783146 | 0.7335261 |
x9 | 1 | -0.060226425 | 0.03783146 | 0.7335261 |
x10 | 1 | 0.007342706 | 0.03783146 | 0.7335261 |
x14 | 1 | -0.003819796 | 0.03783146 | 0.7335261 |
This new LASSO model includes only 9 of the original 15 predictors.
32.6 Would the 9-predictor model be a big improvement?
Suppose we fit a new model inspired by this LASSO. It’s still just a linear model, with no shrinkage, here.
<- lm(out_std ~ x1 + x2 + x3 + x6 + x7 + x8 + x9 + x10 + x14,
mod_3 data = pollution)
vif(mod_3)
x1 x2 x3 x6 x7 x8 x9 x10
1.850485 1.548039 1.867603 3.899194 2.318288 1.479122 2.281879 2.333836
x14
1.609159
Well, the collinearity is certainly much improved.
tidy(mod_3, conf.int = TRUE, conf.level = 0.9) |>
select(term, estimate, std.error, conf.low, conf.high, p.value) |>
gt() |> fmt_number(decimals = 3)
term | estimate | std.error | conf.low | conf.high | p.value |
---|---|---|---|---|---|
(Intercept) | −4.122 | 2.254 | −7.900 | −0.344 | 0.073 |
x1 | −0.031 | 0.010 | −0.048 | −0.015 | 0.002 |
x2 | 0.024 | 0.009 | 0.010 | 0.039 | 0.007 |
x3 | 0.040 | 0.020 | 0.005 | 0.074 | 0.058 |
x6 | 0.057 | 0.166 | −0.222 | 0.335 | 0.734 |
x7 | 0.021 | 0.021 | −0.014 | 0.056 | 0.328 |
x8 | −0.009 | 0.006 | −0.019 | 0.001 | 0.129 |
x9 | −0.069 | 0.012 | −0.090 | −0.049 | 0.000 |
x10 | 0.013 | 0.024 | −0.026 | 0.053 | 0.581 |
x14 | −0.004 | 0.001 | −0.006 | −0.002 | 0.007 |
Does this new model show a strong fit to the data?
glance(mod_3) |> select(r.squared:p.value, df, df.residual, nobs) |>
gt() |> fmt_number(r.squared:p.value, decimals = 3)
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual | nobs |
---|---|---|---|---|---|---|---|
0.747 | 0.702 | 0.546 | 16.413 | 0.000 | 9 | 50 | 60 |
glance(mod_3) |> select(logLik, AIC, BIC, deviance) |> gt() |>
fmt_number(decimals = 2)
logLik | AIC | BIC | deviance |
---|---|---|---|
−43.39 | 108.78 | 131.81 | 14.92 |
Finally, here’s a set of plots for regression diagnostics. How do things look?
par(mfrow = c(2,2))
plot(mod_3)
32.7 Using Stepwise Regression to suggest a smaller model
<- step(mod_sink2) mod_4
Start: AIC=-58.35
out_std ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 +
x11 + x12 + x13 + x14 + x15
Df Sum of Sq RSS AIC
- x11 1 0.0003 13.310 -60.351
- x15 1 0.0020 13.311 -60.344
- x10 1 0.0689 13.378 -60.043
- x7 1 0.0976 13.407 -59.914
- x14 1 0.1102 13.420 -59.858
- x8 1 0.2693 13.579 -59.151
- x6 1 0.4401 13.749 -58.401
<none> 13.309 -58.353
- x4 1 0.5226 13.832 -58.042
- x13 1 0.5906 13.900 -57.747
- x12 1 0.6186 13.928 -57.627
- x3 1 1.0145 14.324 -55.945
- x5 1 1.1956 14.505 -55.191
- x2 1 1.3063 14.616 -54.735
- x1 1 1.7574 15.067 -52.911
- x9 1 3.4141 16.723 -46.652
Step: AIC=-60.35
out_std ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 +
x12 + x13 + x14 + x15
Df Sum of Sq RSS AIC
- x15 1 0.0017 13.311 -62.343
- x10 1 0.0696 13.379 -62.038
- x14 1 0.1119 13.422 -61.849
- x7 1 0.1411 13.451 -61.719
- x8 1 0.2752 13.585 -61.123
- x6 1 0.4398 13.749 -60.401
<none> 13.310 -60.351
- x13 1 0.5905 13.900 -59.747
- x4 1 0.6048 13.915 -59.685
- x12 1 0.6202 13.930 -59.618
- x3 1 1.0419 14.352 -57.829
- x5 1 1.2132 14.523 -57.117
- x1 1 1.8672 15.177 -54.474
- x2 1 2.3995 15.709 -52.406
- x9 1 4.5934 17.903 -44.562
Step: AIC=-62.34
out_std ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 +
x12 + x13 + x14
Df Sum of Sq RSS AIC
- x10 1 0.0680 13.380 -64.037
- x14 1 0.1102 13.422 -63.849
- x7 1 0.1398 13.451 -63.717
- x8 1 0.2800 13.591 -63.094
<none> 13.311 -62.343
- x6 1 0.4538 13.765 -62.332
- x4 1 0.6066 13.918 -61.669
- x13 1 0.6080 13.919 -61.664
- x12 1 0.6251 13.937 -61.590
- x5 1 1.2142 14.526 -59.106
- x3 1 1.5567 14.868 -57.707
- x1 1 1.8661 15.178 -56.472
- x2 1 2.7077 16.019 -53.234
- x9 1 4.6990 18.010 -46.204
Step: AIC=-64.04
out_std ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x12 +
x13 + x14
Df Sum of Sq RSS AIC
- x14 1 0.0750 13.454 -65.702
- x7 1 0.1158 13.495 -65.520
- x8 1 0.2459 13.625 -64.945
<none> 13.380 -64.037
- x4 1 0.6708 14.050 -63.102
- x12 1 0.7497 14.129 -62.766
- x13 1 0.7575 14.137 -62.733
- x6 1 1.1206 14.500 -61.212
- x5 1 1.1650 14.544 -61.028
- x3 1 1.6471 15.027 -59.071
- x1 1 1.9708 15.350 -57.793
- x2 1 2.8845 16.264 -54.323
- x9 1 4.6459 18.025 -48.154
Step: AIC=-65.7
out_std ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x12 +
x13
Df Sum of Sq RSS AIC
- x7 1 0.0736 13.528 -67.375
- x8 1 0.2931 13.748 -66.409
<none> 13.454 -65.702
- x4 1 0.7633 14.218 -64.391
- x5 1 1.3276 14.782 -62.056
- x6 1 1.5095 14.964 -61.322
- x3 1 1.6936 15.148 -60.588
- x1 1 1.9047 15.359 -59.758
- x12 1 2.2433 15.698 -58.450
- x13 1 2.5554 16.010 -57.269
- x2 1 3.2927 16.747 -54.567
- x9 1 4.9296 18.384 -48.972
Step: AIC=-67.37
out_std ~ x1 + x2 + x3 + x4 + x5 + x6 + x8 + x9 + x12 + x13
Df Sum of Sq RSS AIC
- x8 1 0.2268 13.755 -68.377
<none> 13.528 -67.375
- x4 1 0.6973 14.225 -66.359
- x5 1 1.2978 14.826 -63.878
- x3 1 1.6230 15.151 -62.576
- x1 1 1.8418 15.370 -61.716
- x6 1 2.1350 15.663 -60.582
- x12 1 2.2355 15.764 -60.199
- x13 1 2.5168 16.045 -59.137
- x2 1 3.3522 16.880 -56.092
- x9 1 6.1854 19.713 -46.783
Step: AIC=-68.38
out_std ~ x1 + x2 + x3 + x4 + x5 + x6 + x9 + x12 + x13
Df Sum of Sq RSS AIC
<none> 13.755 -68.377
- x4 1 0.8670 14.622 -66.710
- x3 1 1.6985 15.453 -63.391
- x1 1 1.8777 15.633 -62.700
- x5 1 1.9839 15.739 -62.293
- x12 1 2.4622 16.217 -60.497
- x13 1 2.8317 16.587 -59.145
- x6 1 3.0868 16.842 -58.229
- x2 1 4.0486 17.803 -54.897
- x9 1 6.3106 20.065 -47.721
Here’s a summary of the fitted model after stepwise regression, which suggests a different set of 9 predictors.
tidy(mod_4, conf.int = TRUE, conf.level = 0.9) |>
select(term, estimate, std.error, conf.low, conf.high, p.value) |>
gt() |> fmt_number(decimals = 3)
term | estimate | std.error | conf.low | conf.high | p.value |
---|---|---|---|---|---|
(Intercept) | −17.332 | 5.257 | −26.142 | −8.522 | 0.002 |
x1 | −0.034 | 0.013 | −0.057 | −0.012 | 0.012 |
x2 | 0.042 | 0.011 | 0.024 | 0.060 | 0.000 |
x3 | 0.055 | 0.022 | 0.018 | 0.092 | 0.016 |
x4 | 0.200 | 0.113 | 0.011 | 0.388 | 0.082 |
x5 | 2.530 | 0.942 | 0.951 | 4.109 | 0.010 |
x6 | 0.373 | 0.111 | 0.186 | 0.559 | 0.002 |
x9 | −0.073 | 0.015 | −0.099 | −0.048 | 0.000 |
x12 | 0.015 | 0.005 | 0.007 | 0.023 | 0.004 |
x13 | −0.031 | 0.010 | −0.047 | −0.015 | 0.002 |
glance(mod_4) |> select(r.squared:p.value, df, df.residual, nobs) |>
gt() |> fmt_number(r.squared:p.value, decimals = 3)
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual | nobs |
---|---|---|---|---|---|---|---|
0.767 | 0.725 | 0.524 | 18.274 | 0.000 | 9 | 50 | 60 |
glance(mod_4) |> select(logLik, AIC, BIC, deviance) |> gt() |>
fmt_number(decimals = 2)
logLik | AIC | BIC | deviance |
---|---|---|---|
−40.95 | 103.90 | 126.93 | 13.75 |
How is the collinearity in this model?
vif(mod_4)
x1 x2 x3 x4 x5 x6 x9 x12
3.724589 2.666789 2.359891 5.826631 3.483085 1.898922 3.981753 45.943057
x13
42.500579
That looks more troubling to me, at least as compared to mod_3
. How about the residual plots? Do those for mod_4
below look meaningfully different from the ones we built for our LASSO-inspired model mod_3
?
par(mfrow = c(2,2))
plot(mod_4)
We now seem to have a point with a pretty substantial Cook’s distance, specifically the point from row 8 of the data.
|> slice(8) |> select(x1:x12) pollution
# A tibble: 1 × 12
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11 53 68 9.2 2.99 12.1 90.6 47 7.8 48.9 12.3 648
|> slice(8) |> select(x13:x15, y, y_inverse, out_std) pollution
# A tibble: 1 × 6
x13 x14 x15 y y_inverse out_std[,1]
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 319 130 47 862. 0.00116 1.30
So what is unusual about Row 8? Well, it has an especially large value of x12
compared to the rest of the data.
describe(pollution$x12)
pollution$x12
n missing distinct Info Mean Gmd .05 .10
60 0 34 0.998 37.85 52.03 3.95 5.00
.25 .50 .75 .90 .95
7.00 14.50 30.25 56.90 106.95
lowest : 1 3 4 5 6, highest: 88 105 144 311 648
That might be part of the problem, especially since stepwise regression maintains variable x12
whereas our LASSO-inspired model (mod_3
) does not.