21  Multiple Regression and Transformations

This is a DRAFT version of this Chapter.

This is a sketchy draft with some useful content. I’ll remove this notice when I post a version of this Chapter that is essentially finished, but that won’t be until summer 2025.

21.1 R setup for this chapter

Note

Appendix A lists all R packages used in this book, and also provides R session information. Appendix B describes the 431-Love.R script, and demonstrates its use.

21.2 Plasma Retinol and Beta-Carotene Dataset

Note

Appendix C provides further guidance on pulling data from other systems into R, while Appendix D gives more information (including download links) for all data sets used in this book.

The data in this study come from an unpublished study related to Nierenberg et al. (1989). From Thérèse Stukel, one of the authors of that study, and posted to Statlib as well as Frank Harrell’s repository

Observational studies have suggested that low dietary intake or low plasma concentrations of retinol, beta-carotene, or other carotenoids might be associated with increased risk of developing certain types of cancer. However, relatively few studies have investigated the determinants of plasma concentrations of these micronutrients. We designed a cross-sectional study to investigate the relationship between personal characteristics and dietary factors, and plasma concentrations of retinol, beta-carotene and other carotenoids. Study subjects (N = 315) were patients who had an elective surgical procedure during a three-year period to biopsy or remove a lesion of the lung, colon, breast, skin, ovary or uterus that was found to be non-cancerous. Plasma concentrations of the micronutrients varied widely from subject to subject.

21.2.1 Data Ingest

plasma <- read_csv("data/plasma.csv", show_col_types = FALSE) |>
  mutate(across(where(is.character), as_factor)) |>
  mutate(P_ID = as.character(P_ID)) |>
  janitor::clean_names()

plasma
# A tibble: 315 × 15
   p_id    age sex    smokstat   bmi vituse       calories   fat fiber alcohol
   <chr> <dbl> <fct>  <fct>    <dbl> <fct>           <dbl> <dbl> <dbl>   <dbl>
 1 1        64 Female Former    21.5 Fairly_Often    1299.  57     6.3     0  
 2 2        76 Female Never     23.9 Fairly_Often    1032.  50.1  15.8     0  
 3 3        38 Female Former    20.0 Not_Often       2372.  83.6  19.1    14.1
 4 4        40 Female Former    25.1 Never           2450.  97.5  26.5     0.5
 5 5        72 Female Never     21.0 Fairly_Often    1952.  82.6  16.2     0  
 6 6        40 Female Former    27.5 Never           1367.  56     9.6     1.3
 7 7        65 Female Never     22.0 Not_Often       2214.  52    28.7     0  
 8 8        58 Female Never     28.8 Fairly_Often    1596.  63.4  10.9     0  
 9 9        35 Female Never     23.1 Never           1800.  57.8  20.3     0.6
10 10       55 Female Former    35.0 Never           1264.  39.6  15.5     0  
# ℹ 305 more rows
# ℹ 5 more variables: cholesterol <dbl>, betadiet <dbl>, retdiet <dbl>,
#   betaplasma <dbl>, retplasma <dbl>

21.2.2 Variable Descriptions

Variable Description
p_id Subject identification code
age Age (years)
sex Sex (Male, Female)
smokstat Smoking status (Never, Former, Current Smoker)
bmi Body mass index (weight/(height^2))
vituse Vitamin Use (Yes, fairly often; Yes, not often; No)
calories Number of calories consumed per day.
fat Grams of fat consumed per day.
fiber Grams of fiber consumed per day.
alcohol Number of alcoholic drinks consumed per week.
cholesterol Cholesterol consumed (mg per day).
betadiet Dietary beta-carotene consumed (mcg per day).
retdiet Dietary retinol consumed (mcg per day)
betaplasma Plasma beta-carotene (ng/ml)
retplasma Plasma Retinol (ng/ml)
n_miss(plasma)
[1] 0

21.2.3 Data Summary; Outlier Removal

describe_distribution(plasma)
Variable    |    Mean |      SD |     IQR |             Range | Skewness | Kurtosis |   n | n_Missing
-----------------------------------------------------------------------------------------------------
age         |   50.15 |   14.58 |   24.00 |    [19.00, 83.00] |     0.30 |    -0.87 | 315 |         0
bmi         |   26.16 |    6.01 |    7.16 |    [16.33, 50.40] |     1.38 |     2.01 | 315 |         0
calories    | 1796.65 |  680.35 |  772.60 | [445.20, 6662.20] |     1.75 |     8.13 | 315 |         0
fat         |   77.03 |   33.83 |   41.40 |   [14.40, 235.90] |     1.10 |     2.02 | 315 |         0
fiber       |   12.79 |    5.33 |    6.50 |     [3.10, 36.80] |     1.15 |     2.48 | 315 |         0
alcohol     |    3.28 |   12.32 |    3.20 |    [0.00, 203.00] |    13.82 |   221.33 | 315 |         0
cholesterol |  242.46 |  131.99 |  154.00 |   [37.70, 900.70] |     1.48 |     3.41 | 315 |         0
betadiet    | 2185.60 | 1473.89 | 1749.00 | [214.00, 9642.00] |     1.61 |     3.47 | 315 |         0
retdiet     |  832.71 |  589.29 |  568.00 |  [30.00, 6901.00] |     4.47 |    38.07 | 315 |         0
betaplasma  |  189.89 |  183.00 |  142.00 |   [0.00, 1415.00] |     3.56 |    17.21 | 315 |         0
retplasma   |  602.79 |  208.90 |  253.00 | [179.00, 1727.00] |     1.31 |     4.02 | 315 |         0

That alcohol value of 203 drinks per week is concerning.

plasma |> count(alcohol) |> tail()
# A tibble: 6 × 2
  alcohol     n
    <dbl> <int>
1    18.2     1
2    20       1
3    21       1
4    22       1
5    35       2
6   203       1

The next highest value is 35 drinks per week. I’m going to make the decision to remove this observation from the data, for the remainder of this chapter.

plasma <- plasma |>
  filter(alcohol < 200)

plasma |> reframe(lovedist(alcohol)) |> kable(digits = 2)
n miss mean sd med mad min q25 q75 max
314 0 2.64 4.95 0.3 0.44 0 0 3.2 35

21.3 Predicting Plasma Beta-Carotene

Having removed the subject with an unrealistic alcohol value, our first main task is to predict betaplasma using (a subset of) most of the other variables in the data. Let’s look at that outcome.

p1 <- ggplot(plasma, aes(x = betaplasma)) +
  geom_histogram(bins = 15, fill = "blue", col = "white") +
  labs(
    x = "Plasma beta-carotene (ng/ml)",
    title = "Histogram"
  ) 

p2 <- ggplot(plasma, aes(sample = betaplasma)) +
  geom_qq() +
  geom_qq_line(col = "red") +
  labs(
    y = "Plasma beta-carotene",
    x = "Standard Normal Distribution",
    title = "Normal Q-Q plot"
  ) 

p3 <- ggplot(plasma, aes(x = betaplasma, y = "")) +
  geom_violin(fill = "cornsilk") +
  geom_boxplot(width = 0.2) +
  stat_summary(
    fun = mean, geom = "point",
    shape = 16, col = "red"
  ) +
  labs(
    y = "", x = "Plasma beta-carotene",
    title = "Boxplot with Violin"
  ) 

p1 + (p2 / p3 + plot_layout(heights = c(2, 1))) +
  plot_annotation(
    title = "Plasma Beta-Carotene in ng/ml"
  )

We see a fairly pronounced skew to the right here.

plasma |> reframe(lovedist(betaplasma))
# A tibble: 1 × 10
      n  miss  mean    sd   med   mad   min   q25   q75   max
  <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1   314     0  190.  183.   140  88.2     0  89.5  230.  1415

Note that we have a minimum value of 0 here. To use the Box-Cox method, we need to have values of our outcome that are strictly positive. The simplest solution to this problem (of seeing one or more zero values but no negative values) is to simply add one to each value of betaplasma before considering transformations.

fit0 <- lm((betaplasma + 1) ~ age + sex + smokstat + bmi + vituse +
  calories + fat + fiber + alcohol + cholesterol +
  betadiet, data = plasma)

boxCox(fit0)

We’ll apply a logarithmic transformation to (betaplasma + 1) as our revised outcome for modeling.

p1 <- ggplot(plasma, aes(x = betaplasma)) +
  geom_histogram(bins = 20, fill = "coral", col = "blue")

p2 <- ggplot(plasma, aes(x = log(betaplasma + 1))) +
  geom_histogram(bins = 20, fill = "aquamarine", col = "blue")

p1 + p2

Most of our modeling of this transformed outcome is going to be affected, to some degree, by the outlying 0 value. Let’s take a closer look at that:

plasma |> count(betaplasma) |> head()
# A tibble: 6 × 2
  betaplasma     n
       <dbl> <int>
1          0     1
2         14     1
3         16     1
4         19     1
5         21     1
6         22     1

A value of 0 ng/ml for plasma beta-carotene also seems at best implausible. I’m going to go ahead and remove that subject from our study now, as well, and reframe our decision about a transformation accordingly.

plasma <- plasma |> filter(betaplasma > 0)

21.3.1 Final Analytic Data (- 2 observations)

So now, having removed two observations:

  • one for having an implausible level of betaplasma of 0 ng/ml, and
  • one for having an implausible level of alcohol of 203 drinks per week,

we consider again the issue of transforming our outcome, and a Box-Cox plot, but now, thanks to dropping one of our outliers, we have a strictly positive set of betaplasma values to work with, so we no longer need to add 1 to that outcome.

fit00 <- lm(betaplasma ~ 
              age + sex + smokstat + bmi + vituse +
              calories + fat + fiber + alcohol + 
              cholesterol + betadiet, 
            data = plasma)

boxCox(fit00)

Again, we’re strongly encouraged to fit a logarithm of the outcome, by virtue of the estimated \(\lambda = 0\).

21.3.2 Initial OLS Fit

plasma <- plasma |> 
  mutate(logbeta = log(betaplasma))

fit1 <- lm(logbeta ~ age + sex + smokstat + bmi + 
             vituse + calories + fat + fiber + alcohol + 
             cholesterol + betadiet, 
           data = plasma)
model_parameters(fit1)
Parameter          | Coefficient |       SE |         95% CI | t(299) |      p
------------------------------------------------------------------------------
(Intercept)        |        5.48 |     0.28 | [ 4.93,  6.04] |  19.57 | < .001
age                |    5.24e-03 | 2.94e-03 | [ 0.00,  0.01] |   1.78 | 0.076 
sex [Male]         |       -0.23 |     0.13 | [-0.48,  0.02] |  -1.85 | 0.066 
smokstat [Never]   |        0.09 |     0.08 | [-0.08,  0.25] |   1.02 | 0.308 
smokstat [Current] |       -0.21 |     0.12 | [-0.45,  0.04] |  -1.66 | 0.098 
bmi                |       -0.03 | 6.48e-03 | [-0.04, -0.02] |  -4.87 | < .001
vituse [Not_Often] |       -0.03 |     0.10 | [-0.22,  0.17] |  -0.26 | 0.797 
vituse [Never]     |       -0.30 |     0.09 | [-0.48, -0.12] |  -3.25 | 0.001 
calories           |   -2.19e-04 | 2.01e-04 | [ 0.00,  0.00] |  -1.09 | 0.278 
fat                |    1.40e-03 | 3.18e-03 | [ 0.00,  0.01] |   0.44 | 0.659 
fiber              |        0.03 |     0.01 | [ 0.01,  0.05] |   2.78 | 0.006 
alcohol            |    5.25e-03 | 8.76e-03 | [-0.01,  0.02] |   0.60 | 0.549 
cholesterol        |   -2.70e-04 | 4.38e-04 | [ 0.00,  0.00] |  -0.62 | 0.538 
betadiet           |    4.69e-05 | 2.95e-05 | [ 0.00,  0.00] |   1.59 | 0.113 

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
model_performance(fit1)
# Indices of model performance

AIC     |    AICc |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------
646.464 | 648.080 | 702.657 | 0.249 |     0.216 | 0.648 | 0.663
check_model(fit1)

21.3.3 Best Subsets Suggestion?

k <- ols_step_best_subset(fit1, 
                          max_order = 11,
                          metric = "cp")
k
                                 Best Subsets Regression                                  
------------------------------------------------------------------------------------------
Model Index    Predictors
------------------------------------------------------------------------------------------
     1         bmi                                                                         
     2         bmi vituse                                                                  
     3         bmi vituse fiber                                                            
     4         bmi vituse calories fiber                                                   
     5         smokstat bmi vituse calories fiber                                          
     6         smokstat bmi vituse calories fiber betadiet                                 
     7         age sex smokstat bmi vituse calories fiber                                  
     8         age sex smokstat bmi vituse calories fiber betadiet                         
     9         age sex smokstat bmi vituse calories fiber cholesterol betadiet             
    10         age sex smokstat bmi vituse calories fiber alcohol cholesterol betadiet     
    11         age sex smokstat bmi vituse calories fat fiber alcohol cholesterol betadiet 
------------------------------------------------------------------------------------------

                                                    Subsets Regression Summary                                                     
-----------------------------------------------------------------------------------------------------------------------------------
                       Adj.        Pred                                                                                             
Model    R-Square    R-Square    R-Square     C(p)        AIC         SBIC         SBC         MSEP       FPE       HSP       APC  
-----------------------------------------------------------------------------------------------------------------------------------
  1        0.0790      0.0760      0.0672    57.5387    686.2096    -202.6728    697.4483    162.0541    0.5211    0.0017    0.9329 
  2        0.1327      0.1243       0.111    40.1581    671.3979    -219.4340    690.1289    153.0952    0.4954    0.0016    0.8841 
  3        0.1708      0.1600      0.1438    26.9872    659.3289    -231.3437    681.8061    146.8413    0.4767    0.0015    0.8507 
  4        0.2062      0.1933       0.175    14.8858    647.6592    -242.6879    673.8826    141.0242    0.4592    0.0015    0.8195 
  5        0.2254      0.2077      0.1839    11.2438    643.9939    -248.0769    677.7098    138.0622    0.4525    0.0015    0.8048 
  6        0.2332      0.2130      0.1862    10.1707    642.8577    -249.0400    680.3197    137.1339    0.4508    0.0014    0.8019 
  7        0.2403      0.2177      0.1883     9.3375    641.9383    -249.7554    683.1465    136.3077    0.4495    0.0014    0.7996 
  8        0.2466      0.2217       0.189     8.8230    641.3242    -250.1441    686.2786    135.6201    0.4487    0.0014    0.7980 
  9        0.2476      0.2201      0.1847    10.4380    642.9220    -248.4318    691.6227    135.8945    0.4510    0.0014    0.8021 
 10        0.2482      0.2181        0.18    12.1951    644.6680    -246.5767    697.1149    136.2354    0.4535    0.0015    0.8066 
 11        0.2487      0.2160      0.1738    14.0000    646.4639    -244.6723    702.6569    136.6003    0.4561    0.0015    0.8112 
-----------------------------------------------------------------------------------------------------------------------------------
AIC: Akaike Information Criteria 
 SBIC: Sawa's Bayesian Information Criteria 
 SBC: Schwarz Bayesian Criteria 
 MSEP: Estimated error of prediction, assuming multivariate normality 
 FPE: Final Prediction Error 
 HSP: Hocking's Sp 
 APC: Amemiya Prediction Criteria 
plot(k)

I’m going to look more closely at the four-predictor (suggested by SBC) and eight-predictor (suggested by \(C_p\)) subsets1

21.3.4 Four-Predictor Model

fit4 <- lm(logbeta ~ bmi + vituse + calories + fiber,
           data = plasma)
compare_models(fit1, fit4)
Parameter          |                     fit1 |                     fit4
------------------------------------------------------------------------
(Intercept)        |      5.48 ( 4.93,  6.04) |      5.79 ( 5.37,  6.21)
bmi                |     -0.03 (-0.04, -0.02) |     -0.03 (-0.04, -0.02)
vituse [Not_Often] |     -0.03 (-0.22,  0.17) |     -0.07 (-0.26,  0.12)
vituse [Never]     |     -0.30 (-0.48, -0.12) |     -0.36 (-0.53, -0.18)
calories           | -2.19e-04 ( 0.00,  0.00) | -2.66e-04 ( 0.00,  0.00)
fiber              |      0.03 ( 0.01,  0.05) |      0.04 ( 0.03,  0.06)
smokstat [Current] |     -0.21 (-0.45,  0.04) |                         
age                |  5.24e-03 ( 0.00,  0.01) |                         
sex [Male]         |     -0.23 (-0.48,  0.02) |                         
smokstat [Never]   |      0.09 (-0.08,  0.25) |                         
cholesterol        | -2.70e-04 ( 0.00,  0.00) |                         
fat                |  1.40e-03 ( 0.00,  0.01) |                         
alcohol            |  5.25e-03 (-0.01,  0.02) |                         
betadiet           |  4.69e-05 ( 0.00,  0.00) |                         
------------------------------------------------------------------------
Observations       |                      313 |                      313
model_parameters(fit4, ci = 0.95)
Parameter          | Coefficient |       SE |         95% CI | t(307) |      p
------------------------------------------------------------------------------
(Intercept)        |        5.79 |     0.21 | [ 5.37,  6.21] |  27.17 | < .001
bmi                |       -0.03 | 6.38e-03 | [-0.04, -0.02] |  -4.63 | < .001
vituse [Not_Often] |       -0.07 |     0.10 | [-0.26,  0.12] |  -0.69 | 0.493 
vituse [Never]     |       -0.36 |     0.09 | [-0.53, -0.18] |  -3.99 | < .001
calories           |   -2.66e-04 | 7.19e-05 | [ 0.00,  0.00] |  -3.70 | < .001
fiber              |        0.04 | 8.41e-03 | [ 0.03,  0.06] |   5.21 | < .001

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
model_performance(fit4)
# Indices of model performance

AIC     |    AICc |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------
647.659 | 648.026 | 673.883 | 0.206 |     0.193 | 0.666 | 0.672
check_model(fit4)

21.3.5 Eight-Predictor Model

fit8 <- lm(logbeta ~ age + sex + smokstat + bmi + 
             vituse + calories + fiber + betadiet, 
           data = plasma)
compare_models(fit1, fit8)
Parameter          |                     fit1 |                     fit8
------------------------------------------------------------------------
(Intercept)        |      5.48 ( 4.93,  6.04) |      5.49 ( 4.95,  6.04)
age                |  5.24e-03 ( 0.00,  0.01) |  5.40e-03 ( 0.00,  0.01)
sex [Male]         |     -0.23 (-0.48,  0.02) |     -0.23 (-0.48,  0.01)
smokstat [Never]   |      0.09 (-0.08,  0.25) |      0.08 (-0.09,  0.24)
smokstat [Current] |     -0.21 (-0.45,  0.04) |     -0.21 (-0.46,  0.03)
bmi                |     -0.03 (-0.04, -0.02) |     -0.03 (-0.05, -0.02)
vituse [Not_Often] |     -0.03 (-0.22,  0.17) |     -0.02 (-0.21,  0.17)
vituse [Never]     |     -0.30 (-0.48, -0.12) |     -0.29 (-0.46, -0.11)
calories           | -2.19e-04 ( 0.00,  0.00) | -1.70e-04 ( 0.00,  0.00)
fiber              |      0.03 ( 0.01,  0.05) |      0.03 ( 0.01,  0.05)
betadiet           |  4.69e-05 ( 0.00,  0.00) |  4.65e-05 ( 0.00,  0.00)
cholesterol        | -2.70e-04 ( 0.00,  0.00) |                         
fat                |  1.40e-03 ( 0.00,  0.01) |                         
alcohol            |  5.25e-03 (-0.01,  0.02) |                         
------------------------------------------------------------------------
Observations       |                      313 |                      313
model_parameters(fit8)
Parameter          | Coefficient |       SE |         95% CI | t(302) |      p
------------------------------------------------------------------------------
(Intercept)        |        5.49 |     0.28 | [ 4.95,  6.04] |  19.90 | < .001
age                |    5.40e-03 | 2.90e-03 | [ 0.00,  0.01] |   1.86 | 0.063 
sex [Male]         |       -0.23 |     0.12 | [-0.48,  0.01] |  -1.92 | 0.056 
smokstat [Never]   |        0.08 |     0.08 | [-0.09,  0.24] |   0.94 | 0.350 
smokstat [Current] |       -0.21 |     0.12 | [-0.46,  0.03] |  -1.75 | 0.081 
bmi                |       -0.03 | 6.35e-03 | [-0.05, -0.02] |  -5.13 | < .001
vituse [Not_Often] |       -0.02 |     0.10 | [-0.21,  0.17] |  -0.24 | 0.814 
vituse [Never]     |       -0.29 |     0.09 | [-0.46, -0.11] |  -3.21 | 0.001 
calories           |   -1.70e-04 | 7.71e-05 | [ 0.00,  0.00] |  -2.21 | 0.028 
fiber              |        0.03 | 9.36e-03 | [ 0.01,  0.05] |   3.13 | 0.002 
betadiet           |    4.65e-05 | 2.92e-05 | [ 0.00,  0.00] |   1.59 | 0.113 

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
model_performance(fit8)
# Indices of model performance

AIC     |    AICc |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------
641.324 | 642.364 | 686.279 | 0.247 |     0.222 | 0.649 | 0.660
check_model(fit8)

21.3.6 Our Three Models So Far

compare_performance(fit1, fit4, fit8)
# Comparison of Model Performance Indices

Name | Model | AIC (weights) | AICc (weights) | BIC (weights) |    R2 | R2 (adj.) |  RMSE | Sigma
-------------------------------------------------------------------------------------------------
fit1 |    lm | 646.5 (0.068) |  648.1 (0.051) | 702.7 (<.001) | 0.249 |     0.216 | 0.648 | 0.663
fit4 |    lm | 647.7 (0.038) |  648.0 (0.053) | 673.9 (0.998) | 0.206 |     0.193 | 0.666 | 0.672
fit8 |    lm | 641.3 (0.894) |  642.4 (0.896) | 686.3 (0.002) | 0.247 |     0.222 | 0.649 | 0.660
plot(compare_performance(fit1, fit4, fit8))

set.seed(431)
performance_accuracy(fit1, method = "cv", k = 5)
# Accuracy of Model Predictions

Accuracy (95% CI): 42.42% [32.42%, 52.44%]
Method: Correlation between observed and predicted
set.seed(431)
performance_accuracy(fit4, method = "cv", k = 5)
# Accuracy of Model Predictions

Accuracy (95% CI): 42.98% [37.50%, 52.45%]
Method: Correlation between observed and predicted
set.seed(431)
performance_accuracy(fit8, method = "cv", k = 5)
# Accuracy of Model Predictions

Accuracy (95% CI): 44.05% [32.92%, 57.60%]
Method: Correlation between observed and predicted
set.seed(431)
performance_cv(fit1)
# Cross-validation performance (30% holdout method)

MSE  | RMSE |   R2
------------------
0.56 | 0.75 | 0.19
set.seed(431)
performance_cv(fit4)
# Cross-validation performance (30% holdout method)

MSE  | RMSE |   R2
------------------
0.56 | 0.75 | 0.19
set.seed(431)
performance_cv(fit8)
# Cross-validation performance (30% holdout method)

MSE  | RMSE |   R2
------------------
0.55 | 0.74 | 0.21

21.4 Using the Lasso

Need to discuss glmnet package, and point to:

and add some of the comments I used in https://thomaselove.github.io/432-notes/lasso.html#using-the-lasso-to-suggest-a-smaller-model

pred_x <- model.matrix(fit1)
out_y <- plasma |> select(logbeta) |> as.matrix()

set.seed(431)
lasso <- cv.glmnet(x = pred_x, y = out_y,
                   type.measure = "mse",
                   alpha = 1, family = "gaussian",
                   nlambda = 200, nfolds = 10)
plot(lasso)

log(lasso$lambda.min)
[1] -3.921033
predict(lasso, type = "coef", s = "lambda.min")
15 x 1 sparse Matrix of class "dgCMatrix"
                   lambda.min
(Intercept)      5.423141e+00
(Intercept)      .           
age              4.601321e-03
sexMale         -1.703956e-01
smokstatNever    6.368383e-02
smokstatCurrent -1.874663e-01
bmi             -2.918528e-02
vituseNot_Often  .           
vituseNever     -2.486722e-01
calories        -5.125006e-05
fat             -4.158782e-04
fiber            2.216516e-02
alcohol          .           
cholesterol     -3.158420e-04
betadiet         4.038103e-05

We can use the lasso to identify a set of predictors to drop. Here, the lasso only drops the alcohol variable from our initial set of predictors, since we aren’t willing to drop just one level of a multi-categorical predictor like vituse.

21.4.1 No “alcohol” model

fit10 <- lm(logbeta ~ age + sex + smokstat + bmi + 
             vituse + calories + fat + fiber + 
             cholesterol + betadiet, 
           data = plasma)
compare_models(fit1, fit10)
Parameter          |                     fit1 |                    fit10
------------------------------------------------------------------------
(Intercept)        |      5.48 ( 4.93,  6.04) |      5.49 ( 4.94,  6.04)
age                |  5.24e-03 ( 0.00,  0.01) |  5.35e-03 ( 0.00,  0.01)
sex [Male]         |     -0.23 (-0.48,  0.02) |     -0.22 (-0.46,  0.02)
smokstat [Never]   |      0.09 (-0.08,  0.25) |      0.08 (-0.09,  0.24)
smokstat [Current] |     -0.21 (-0.45,  0.04) |     -0.21 (-0.45,  0.03)
bmi                |     -0.03 (-0.04, -0.02) |     -0.03 (-0.04, -0.02)
vituse [Not_Often] |     -0.03 (-0.22,  0.17) |     -0.02 (-0.21,  0.17)
vituse [Never]     |     -0.30 (-0.48, -0.12) |     -0.29 (-0.47, -0.11)
calories           | -2.19e-04 ( 0.00,  0.00) | -1.73e-04 ( 0.00,  0.00)
fat                |  1.40e-03 ( 0.00,  0.01) |  8.53e-04 (-0.01,  0.01)
fiber              |      0.03 ( 0.01,  0.05) |      0.03 ( 0.01,  0.05)
cholesterol        | -2.70e-04 ( 0.00,  0.00) | -2.90e-04 ( 0.00,  0.00)
betadiet           |  4.69e-05 ( 0.00,  0.00) |  4.81e-05 ( 0.00,  0.00)
alcohol            |  5.25e-03 (-0.01,  0.02) |                         
------------------------------------------------------------------------
Observations       |                      313 |                      313
model_parameters(fit10)
Parameter          | Coefficient |       SE |         95% CI | t(300) |      p
------------------------------------------------------------------------------
(Intercept)        |        5.49 |     0.28 | [ 4.94,  6.04] |  19.66 | < .001
age                |    5.35e-03 | 2.94e-03 | [ 0.00,  0.01] |   1.82 | 0.069 
sex [Male]         |       -0.22 |     0.12 | [-0.46,  0.02] |  -1.77 | 0.078 
smokstat [Never]   |        0.08 |     0.08 | [-0.09,  0.24] |   0.95 | 0.343 
smokstat [Current] |       -0.21 |     0.12 | [-0.45,  0.03] |  -1.72 | 0.087 
bmi                |       -0.03 | 6.40e-03 | [-0.04, -0.02] |  -5.02 | < .001
vituse [Not_Often] |       -0.02 |     0.10 | [-0.21,  0.17] |  -0.21 | 0.834 
vituse [Never]     |       -0.29 |     0.09 | [-0.47, -0.11] |  -3.20 | 0.002 
calories           |   -1.73e-04 | 1.86e-04 | [ 0.00,  0.00] |  -0.93 | 0.353 
fat                |    8.53e-04 | 3.04e-03 | [-0.01,  0.01] |   0.28 | 0.779 
fiber              |        0.03 |     0.01 | [ 0.01,  0.05] |   2.72 | 0.007 
cholesterol        |   -2.90e-04 | 4.36e-04 | [ 0.00,  0.00] |  -0.66 | 0.507 
betadiet           |    4.81e-05 | 2.94e-05 | [ 0.00,  0.00] |   1.64 | 0.103 

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
model_performance(fit10)
# Indices of model performance

AIC     |    AICc |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------
644.840 | 646.249 | 697.287 | 0.248 |     0.218 | 0.648 | 0.662
check_model(fit10)

21.4.2 Comparing the Four Models

compare_performance(fit1, fit4, fit8, fit10)
# Comparison of Model Performance Indices

Name  | Model | AIC (weights) | AICc (weights) | BIC (weights) |    R2 | R2 (adj.) |  RMSE | Sigma
--------------------------------------------------------------------------------------------------
fit1  |    lm | 646.5 (0.059) |  648.1 (0.046) | 702.7 (<.001) | 0.249 |     0.216 | 0.648 | 0.663
fit4  |    lm | 647.7 (0.033) |  648.0 (0.047) | 673.9 (0.998) | 0.206 |     0.193 | 0.666 | 0.672
fit8  |    lm | 641.3 (0.775) |  642.4 (0.794) | 686.3 (0.002) | 0.247 |     0.222 | 0.649 | 0.660
fit10 |    lm | 644.8 (0.134) |  646.2 (0.114) | 697.3 (<.001) | 0.248 |     0.218 | 0.648 | 0.662
plot(compare_performance(fit1, fit4, fit8, fit10))

Within our sample, it looks like fit8 is the strongest of these models.

How about its cross-validation performance?

set.seed(431)
performance_accuracy(fit10, method = "cv", k = 5)
# Accuracy of Model Predictions

Accuracy (95% CI): 42.65% [32.47%, 52.20%]
Method: Correlation between observed and predicted
set.seed(431)
performance_cv(fit10)
# Cross-validation performance (30% holdout method)

MSE  | RMSE |  R2
-----------------
0.55 | 0.74 | 0.2

21.4.3 Table of Cross-validated Performance

Model Accuracy and 95% CI MSE\(_{30}\) RMSE\(_{30}\) \(R^2_{30}\)
fit1 .4242 (.3242, .5244) 0.56 0.75 0.19
fit10 .4265 (.3247, .5220) 0.55 0.74 0.20
fit8 .4405 (.3292, .5760) 0.56 0.75 0.19
fit4 .4298 (.3750, .5245) 0.55 0.74 0.21

It appears that fit8 also performs well in cross-validation, in terms of accuracy. Model fit4 may be a bit better in the measures assessed here using the 30% holdout sample approach. Still, given the strong in-sample performance of fit8, I think I would be inclined to select that model.

21.5 Predicting Plasma Retinol

fit_ret0 <- lm(retplasma ~ age + sex + smokstat + bmi + vituse +
  calories + fat + fiber + alcohol + cholesterol +
  retdiet, data = plasma)

boxCox(fit_ret0)

It appears we might want to be working with the logarithm of retplasma as our outcome.

plasma <- plasma |>
  mutate(logret = log(retplasma))

fit_ret1 <- lm(logret ~ age + sex + smokstat + bmi + 
                 vituse + calories + fat + fiber + 
                 alcohol + cholesterol + retdiet, 
               data = plasma)
model_parameters(fit_ret1, ci = 0.95)
Parameter          | Coefficient |       SE |        95% CI | t(299) |      p
-----------------------------------------------------------------------------
(Intercept)        |        6.17 |     0.14 | [ 5.90, 6.44] |  45.37 | < .001
age                |    4.98e-03 | 1.43e-03 | [ 0.00, 0.01] |   3.48 | < .001
sex [Male]         |        0.06 |     0.06 | [-0.06, 0.18] |   1.00 | 0.316 
smokstat [Never]   |       -0.07 |     0.04 | [-0.15, 0.01] |  -1.74 | 0.083 
smokstat [Current] |       -0.09 |     0.06 | [-0.20, 0.03] |  -1.43 | 0.153 
bmi                |    1.38e-03 | 3.14e-03 | [ 0.00, 0.01] |   0.44 | 0.659 
vituse [Not_Often] |    2.27e-03 |     0.05 | [-0.09, 0.09] |   0.05 | 0.962 
vituse [Never]     |       -0.04 |     0.04 | [-0.13, 0.05] |  -0.86 | 0.390 
calories           |    1.51e-04 | 9.81e-05 | [ 0.00, 0.00] |   1.54 | 0.124 
fat                |   -2.93e-03 | 1.54e-03 | [-0.01, 0.00] |  -1.90 | 0.058 
fiber              |   -8.58e-03 | 5.04e-03 | [-0.02, 0.00] |  -1.70 | 0.090 
alcohol            |        0.01 | 4.25e-03 | [ 0.00, 0.02] |   2.54 | 0.012 
cholesterol        |   -7.67e-05 | 2.16e-04 | [ 0.00, 0.00] |  -0.36 | 0.723 
retdiet            |    1.99e-06 | 3.57e-05 | [ 0.00, 0.00] |   0.06 | 0.956 

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
model_performance(fit_ret1)
# Indices of model performance

AIC     |    AICc |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------
193.645 | 195.262 | 249.839 | 0.125 |     0.087 | 0.314 | 0.322
check_model(fit_ret1)

Here, the use of a stepwise approach suggests a 6-predictor subset, including age, smokstat, calories, fat, fiber and alcohol:

fit_ret6 <- select_parameters(fit_ret1)

compare_models(fit_ret1, fit_ret6)
Parameter          |                fit_ret1 |                fit_ret6
----------------------------------------------------------------------
(Intercept)        |      6.17 ( 5.90, 6.44) |      6.17 ( 5.97, 6.36)
age                |  4.98e-03 ( 0.00, 0.01) |  5.36e-03 ( 0.00, 0.01)
smokstat [Never]   |     -0.07 (-0.15, 0.01) |     -0.07 (-0.15, 0.01)
smokstat [Current] |     -0.09 (-0.20, 0.03) |     -0.09 (-0.21, 0.02)
calories           |  1.51e-04 ( 0.00, 0.00) |  1.65e-04 ( 0.00, 0.00)
fat                | -2.93e-03 (-0.01, 0.00) | -3.18e-03 (-0.01, 0.00)
fiber              | -8.58e-03 (-0.02, 0.00) | -8.95e-03 (-0.02, 0.00)
alcohol            |      0.01 ( 0.00, 0.02) |      0.01 ( 0.00, 0.02)
cholesterol        | -7.67e-05 ( 0.00, 0.00) |                        
bmi                |  1.38e-03 ( 0.00, 0.01) |                        
sex [Male]         |      0.06 (-0.06, 0.18) |                        
vituse [Never]     |     -0.04 (-0.13, 0.05) |                        
retdiet            |  1.99e-06 ( 0.00, 0.00) |                        
vituse [Not_Often] |  2.27e-03 (-0.09, 0.09) |                        
----------------------------------------------------------------------
Observations       |                     313 |                     313

I’ll leave for the reader the exercise of evaluating this particular subset, and perhaps developing other potential choices before making additional modeling decisions. One might also consider returning to the data the observation with an unreasonably low betaplasma, since that variable isn’t involved in our prediction of retplasma.

21.6 For More Information


  1. If we run select_parameters(fit1), that stepwise procedure suggests the 8-predictor subset identified by best subsets, as it turns out.↩︎