12  Analysis of Covariance with the SMART data

In this chapter, we’ll work with the smart_cle1_sh data file that we developed in Chapter 7.

12.1 R Setup Used Here

knitr::opts_chunk$set(comment = NA)

library(janitor)
library(broom)
library(mosaic)
library(tidyverse) 

theme_set(theme_bw())

12.1.1 Data Load

smart_cle1_sh <- readRDS("data/smart_cle1_sh.Rds")

12.2 A New Small Study: Predicting BMI

We’ll begin by investigating the problem of predicting bmi, at first with just three regression inputs: smoke100, female and physhealth, in our smart_cle1_sh data set.

  • The outcome of interest is bmi.
  • Inputs to the regression model are:
    • smoke100 = 1 if the subject has smoked 100 cigarettes in their lifetime
    • female = 1 if the subject is female, and 0 if they are male
    • physhealth = number of poor physical health days in past 30 (treated as quantitative)

12.2.1 Does smoke100 predict bmi well?

12.2.1.1 Graphical Assessment

ggplot(smart_cle1_sh, aes(x = smoke100, y = bmi)) +
    geom_point()

Not so helpful. We should probably specify that smoke100 is a factor, and try another plotting approach.

ggplot(smart_cle1_sh, aes(x = factor(smoke100), y = bmi)) +
    geom_boxplot()

The median BMI looks a little higher for those who have smoked 100 cigarettes. Let’s see if a model reflects that.

12.3 mod1: A simple t-test model

mod1 <- lm(bmi ~ smoke100, data = smart_cle1_sh)
mod1

Call:
lm(formula = bmi ~ smoke100, data = smart_cle1_sh)

Coefficients:
(Intercept)     smoke100  
    27.9390       0.8722  
summary(mod1)

Call:
lm(formula = bmi ~ smoke100, data = smart_cle1_sh)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.511  -4.019  -0.870   2.841  41.749 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27.9390     0.2594 107.722   <2e-16 ***
smoke100      0.8722     0.3781   2.307   0.0213 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.353 on 1131 degrees of freedom
Multiple R-squared:  0.004682,  Adjusted R-squared:  0.003802 
F-statistic: 5.321 on 1 and 1131 DF,  p-value: 0.02125
confint(mod1)
                 2.5 %    97.5 %
(Intercept) 27.4301016 28.447875
smoke100     0.1303021  1.614195
tidy(mod1, conf.int = TRUE, conf.level = 0.95)
# A tibble: 2 × 7
  term        estimate std.error statistic p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)   27.9       0.259    108.    0        27.4       28.4 
2 smoke100       0.872     0.378      2.31  0.0213    0.130      1.61
glance(mod1)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1   0.00468       0.00380  6.35      5.32  0.0213     1 -3702. 7409. 7424.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The model suggests, based on these 1133 subjects, that

  • our best prediction for non-smokers is BMI = 27.94 kg/m2, and
  • our best prediction for those who have smoked 100 cigarettes is BMI = 27.94 + 0.87 = 28.81 kg/m2.
  • the mean difference between smokers and non-smokers is +0.87 kg/m2 in BMI
  • a 95% confidence (uncertainty) interval for that mean smoker - non-smoker difference in BMI ranges from 0.13 to 1.61
  • the model accounts for 0.47% of the variation in BMI, so that knowing the respondent’s tobacco status does very little to reduce the size of the prediction errors as compared to an intercept only model that would predict the overall mean BMI for each of our subjects.
  • the model makes some enormous errors, with one subject being predicted to have a BMI 42 points lower than his/her actual BMI.

Note that this simple regression model just gives us the t-test.

t.test(bmi ~ smoke100, var.equal = TRUE, data = smart_cle1_sh)

    Two Sample t-test

data:  bmi by smoke100
t = -2.3066, df = 1131, p-value = 0.02125
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -1.6141946 -0.1303021
sample estimates:
mean in group 0 mean in group 1 
       27.93899        28.81124 

12.4 mod2: Adding another predictor (two-way ANOVA without interaction)

When we add in the information about female (sex) to our original model, we might first picture the data. We could look at separate histograms,

ggplot(smart_cle1_sh, aes(x = bmi)) +
    geom_histogram(bins = 30) +
    facet_grid(female ~ smoke100, labeller = label_both)

or maybe boxplots?

ggplot(smart_cle1_sh, aes(x = factor(female), y = bmi)) +
    geom_boxplot() +
    facet_wrap(~ smoke100, labeller = label_both)

ggplot(smart_cle1_sh, aes(x = female, y = bmi))+
    geom_point(size = 3, alpha = 0.2) +
    theme_bw() +
    facet_wrap(~ smoke100, labeller = label_both)

OK. Let’s try fitting a model.

mod2 <- lm(bmi ~ female + smoke100, data = smart_cle1_sh)
mod2

Call:
lm(formula = bmi ~ female + smoke100, data = smart_cle1_sh)

Coefficients:
(Intercept)       female     smoke100  
    28.0265      -0.1342       0.8555  

This new model predicts only four predicted values:

  • bmi = 28.0265 if the subject is male and has not smoked 100 cigarettes (so female = 0 and smoke100 = 0)
  • bmi = 28.0265 - 0.1342 = 27.8923 if the subject is female and has not smoked 100 cigarettes (female = 1 and smoke100 = 0)
  • bmi = 28.0265 + 0.8555 = 28.8820 if the subject is male and has smoked 100 cigarettes (so female = 0 and smoke100 = 1), and, finally
  • bmi = 28.0265 - 0.1342 + 0.8555 = 28.7478 if the subject is female and has smoked 100 cigarettes (so both female and smoke100 = 1).

Another way to put this is that for those who have not smoked 100 cigarettes, the model is:

  • bmi = 28.0265 - 0.1342 female

and for those who have smoked 100 cigarettes, the model is:

  • bmi = 28.8820 - 0.1342 female

Only the intercept of the bmi-female model changes depending on smoke100.

summary(mod2)

Call:
lm(formula = bmi ~ female + smoke100, data = smart_cle1_sh)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.448  -3.972  -0.823   2.774  41.678 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  28.0265     0.3620  77.411   <2e-16 ***
female       -0.1342     0.3875  -0.346   0.7291    
smoke100      0.8555     0.3814   2.243   0.0251 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.356 on 1130 degrees of freedom
Multiple R-squared:  0.004788,  Adjusted R-squared:  0.003027 
F-statistic: 2.718 on 2 and 1130 DF,  p-value: 0.06642
confint(mod2)
                 2.5 %     97.5 %
(Intercept) 27.3161140 28.7368281
female      -0.8944773  0.6259881
smoke100     0.1072974  1.6037825

The slopes of both female and smoke100 have confidence intervals that are completely below zero, indicating that both female sex and smoke100 appear to be associated with reductions in bmi.

The \(R^2\) value suggests that 0.4788% of the variation in bmi is accounted for by this ANOVA model.

In fact, this regression (on two binary indicator variables) is simply a two-way ANOVA model without an interaction term.

anova(mod2)
Analysis of Variance Table

Response: bmi
            Df Sum Sq Mean Sq F value  Pr(>F)  
female       1     16  16.301  0.4036 0.52538  
smoke100     1    203 203.296  5.0330 0.02506 *
Residuals 1130  45644  40.393                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

12.5 mod3: Adding the interaction term (Two-way ANOVA with interaction)

Suppose we want to let the effect of female vary depending on the smoke100 status. Then we need to incorporate an interaction term in our model.

mod3 <- lm(bmi ~ female * smoke100, data = smart_cle1_sh)
mod3

Call:
lm(formula = bmi ~ female * smoke100, data = smart_cle1_sh)

Coefficients:
    (Intercept)           female         smoke100  female:smoke100  
        28.2690          -0.5064           0.4119           0.7536  

So, for example, for a male who has smoked 100 cigarettes, this model predicts

  • bmi = 28.269 - 0.506 (0) + 0.412 (1) + 0.754 (0)(1) = 28.269 + 0.412 = 28.681

And for a female who has smoked 100 cigarettes, the model predicts

  • bmi = 28.269 - 0.506 (1) + 0.412 (1) + 0.754 (1)(1) = 28.269 - 0.506 + 0.412 + 0.754 = 28.929

For those who have not smoked 100 cigarettes, the model is:

  • bmi = 28.269 - 0.506 female

But for those who have smoked 100 cigarettes, the model is:

  • bmi = (28.269 + 0.412) + (-0.506 + 0.754) female, or ,,,
  • bmi = 28.681 - 0.248 female

Now, both the slope and the intercept of the bmi-female model change depending on smoke100.

summary(mod3)

Call:
lm(formula = bmi ~ female * smoke100, data = smart_cle1_sh)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.628  -3.938  -0.829   2.759  41.879 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      28.2690     0.4396  64.301   <2e-16 ***
female           -0.5064     0.5446  -0.930    0.353    
smoke100          0.4119     0.5946   0.693    0.489    
female:smoke100   0.7536     0.7750   0.972    0.331    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.356 on 1129 degrees of freedom
Multiple R-squared:  0.005621,  Adjusted R-squared:  0.002979 
F-statistic: 2.127 on 3 and 1129 DF,  p-value: 0.09507
confint(mod3)
                     2.5 %     97.5 %
(Intercept)     27.4063783 29.1315563
female          -1.5749026  0.5621793
smoke100        -0.7547605  1.5786121
female:smoke100 -0.7670239  2.2742178

In fact, this regression (on two binary indicator variables and a product term) is simply a two-way ANOVA model with an interaction term.

anova(mod3)
Analysis of Variance Table

Response: bmi
                  Df Sum Sq Mean Sq F value  Pr(>F)  
female             1     16  16.301  0.4035 0.52539  
smoke100           1    203 203.296  5.0327 0.02507 *
female:smoke100    1     38  38.194  0.9455 0.33107  
Residuals       1129  45606  40.395                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction term doesn’t change very much here. Its uncertainty interval includes zero, and the overall model still accounts for less than 1% of the variation in bmi.

12.6 mod4: Using smoke100 and physhealth in a model for bmi

ggplot(smart_cle1_sh, aes(x = physhealth, y = bmi, color = factor(smoke100))) +
    geom_point() + 
    guides(col = "none") +
    geom_smooth(method = "lm", se = FALSE) +
    facet_wrap(~ smoke100, labeller = label_both) 
`geom_smooth()` using formula = 'y ~ x'

Does the difference in slopes of bmi and physhealth for those who have and haven’t smoked at least 100 cigarettes appear to be substantial and important?

mod4 <- lm(bmi ~ smoke100 * physhealth, data = smart_cle1_sh)

summary(mod4)

Call:
lm(formula = bmi ~ smoke100 * physhealth, data = smart_cle1_sh)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.011  -3.811  -0.559   2.609  38.249 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         27.49077    0.28369  96.904  < 2e-16 ***
smoke100             0.57017    0.41911   1.360 0.173965    
physhealth           0.10840    0.02995   3.619 0.000308 ***
smoke100:physhealth  0.03326    0.04093   0.813 0.416561    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.252 on 1129 degrees of freedom
Multiple R-squared:  0.03782,   Adjusted R-squared:  0.03526 
F-statistic: 14.79 on 3 and 1129 DF,  p-value: 1.886e-09

Does it seem as though the addition of physhealth has improved our model substantially over a model with smoke100 alone (which, you recall, was mod1)?

Since the mod4 model contains the mod1 model’s predictors as a subset and the outcome is the same for each model, we consider the models nested and have some extra tools available to compare them.

  • I might start by looking at the basic summaries for each model.
glance(mod4)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic       p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>         <dbl> <dbl>  <dbl> <dbl> <dbl>
1    0.0378        0.0353  6.25      14.8 0.00000000189     3 -3682. 7375. 7400.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(mod1)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1   0.00468       0.00380  6.35      5.32  0.0213     1 -3702. 7409. 7424.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
  • The \(R^2\) is much larger for the model with physhealth, but still very tiny.
  • Smaller AIC and smaller BIC statistics are more desirable. Here, there’s little to choose from, so mod4 looks better, too.
  • We might also consider a significance test by looking at an ANOVA model comparison. This is only appropriate because mod1 is nested in mod4.
anova(mod4, mod1)
Analysis of Variance Table

Model 1: bmi ~ smoke100 * physhealth
Model 2: bmi ~ smoke100
  Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
1   1129 44129                                 
2   1131 45649 -2   -1519.7 19.44 5.005e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The addition of the physhealth term appears to be an improvement, not that that means very much.

12.7 Making Predictions with a Linear Regression Model

Recall model 4, which yields predictions for body mass index on the basis of the main effects of having smoked (smoke100) and days of poor physical health (physhealth) and their interaction.

mod4

Call:
lm(formula = bmi ~ smoke100 * physhealth, data = smart_cle1_sh)

Coefficients:
        (Intercept)             smoke100           physhealth  
           27.49077              0.57017              0.10840  
smoke100:physhealth  
            0.03326  

12.7.1 Fitting an Individual Prediction and 95% Prediction Interval

What do we predict for the bmi of a subject who has smoked at least 100 cigarettes in their life and had 8 poor physical health days in the past 30?

new1 <- tibble(smoke100 = 1, physhealth = 8)
predict(mod4, newdata = new1, interval = "prediction", level = 0.95)
       fit      lwr      upr
1 29.19423 16.91508 41.47338

The predicted bmi for this new subject is shown above. The prediction interval shows the bounds of a 95% uncertainty interval for a predicted bmi for an individual smoker1 who has 8 days of poor physical health out of the past 30. From the predict function applied to a linear model, we can get the prediction intervals for any new data points in this manner.

12.7.2 Confidence Interval for an Average Prediction

  • What do we predict for the average body mass index of a population of subjects who are smokers and have physhealth = 8?
predict(mod4, newdata = new1, interval = "confidence", level = 0.95)
       fit      lwr      upr
1 29.19423 28.64267 29.74579
  • How does this result compare to the prediction interval?

12.7.3 Fitting Multiple Individual Predictions to New Data

  • How does our prediction change for a respondent if they instead have 7, or 9 poor physical health days? What if they have or don’t have a history of smoking?
c8_new2 <- tibble(subjectid = 1001:1006, smoke100 = c(1, 1, 1, 0, 0, 0), physhealth = c(7, 8, 9, 7, 8, 9))
pred2 <- predict(mod4, newdata = c8_new2, interval = "prediction", level = 0.95) |> as_tibble()

result2 <- bind_cols(c8_new2, pred2)
result2
# A tibble: 6 × 6
  subjectid smoke100 physhealth   fit   lwr   upr
      <int>    <dbl>      <dbl> <dbl> <dbl> <dbl>
1      1001        1          7  29.1  16.8  41.3
2      1002        1          8  29.2  16.9  41.5
3      1003        1          9  29.3  17.1  41.6
4      1004        0          7  28.2  16.0  40.5
5      1005        0          8  28.4  16.1  40.6
6      1006        0          9  28.5  16.2  40.7

The result2 tibble contains predictions for each scenario.

  • Which has a bigger impact on these predictions and prediction intervals? A one category change in smoke100 or a one hour change in physhealth?

12.8 Centering the model

Our model mod4 has four predictors (the constant, physhealth, smoke100 and their interaction) but just two inputs (smoke100 and physhealth.) If we center the quantitative input physhealth before building the model, we get a more interpretable interaction term.

smart_cle1_sh_c <- smart_cle1_sh |>
    mutate(physhealth_c = physhealth - mean(physhealth))

mod4_c <- lm(bmi ~ smoke100 * physhealth_c, data = smart_cle1_sh_c)

summary(mod4_c)

Call:
lm(formula = bmi ~ smoke100 * physhealth_c, data = smart_cle1_sh_c)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.011  -3.811  -0.559   2.609  38.249 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           27.99821    0.25576 109.471  < 2e-16 ***
smoke100               0.72589    0.37288   1.947 0.051818 .  
physhealth_c           0.10840    0.02995   3.619 0.000308 ***
smoke100:physhealth_c  0.03326    0.04093   0.813 0.416561    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.252 on 1129 degrees of freedom
Multiple R-squared:  0.03782,   Adjusted R-squared:  0.03526 
F-statistic: 14.79 on 3 and 1129 DF,  p-value: 1.886e-09

What has changed as compared to the original mod4?

  • Our original model was bmi = 27.5 + 0.57 smoke100 + 0.11 physhealth - 0.03 smoke100 x physhealth
  • Our new model is bmi = 28.0 + 0.73 smoke100 + 0.11 centered physhealth + 0.03 smoke100 x centered physhealth.

So our new model on centered data is:

  • 28.0 + 0.11 centered physhealth_c for non-smokers, and
  • (28.0 + 0.73) + (0.11 - 0.03) centered physhealth_c, or 28.73 + 0.08 centered physhealth_c for smokers.

In our new (centered physhealth_c) model,

  • the main effect of smoke100 now corresponds to a predictive difference (smoker - non-smoker) in bmi with physhealth at its mean value, 4.68 days,
  • the intercept term is now the predicted bmi for a non-smoker with an average physhealth, and
  • the product term corresponds to the change in the slope of centered physhealth_c on bmi for a smoker rather than a non-smoker, while
  • the residual standard deviation and the R-squared values remain unchanged from the model before centering.

12.8.1 Plot of Model 4 on Centered physhealth: mod4_c

ggplot(smart_cle1_sh_c, aes(x = physhealth_c, y = bmi, group = smoke100, col = factor(smoke100))) +
    geom_point(alpha = 0.5, size = 2) +
    geom_smooth(method = "lm", se = FALSE, formula = y ~ x) +
    guides(color = "none") +
    labs(x = "Poor Physical Health Days, centered", y = "Body Mass Index",
         title = "Model `mod4` on centered data") +
    facet_wrap(~ smoke100, labeller = label_both)

12.9 Rescaling an input by subtracting the mean and dividing by 2 standard deviations

Centering helped us interpret the main effects in the regression, but it still leaves a scaling problem.

  • The smoke100 coefficient estimate is much larger than that of physhealth, but this is misleading, considering that we are comparing the complete change in one variable (smoking or not) to a 1-day change in physhealth.
  • Gelman and Hill (2007) recommend all continuous predictors be scaled by dividing by 2 standard deviations, so that:
    • a 1-unit change in the rescaled predictor corresponds to a change from 1 standard deviation below the mean, to 1 standard deviation above.
    • an unscaled binary (1/0) predictor with 50% probability of occurring will be exactly comparable to a rescaled continuous predictor done in this way.
smart_cle1_sh_rescale <- smart_cle1_sh |>
    mutate(physhealth_z = (physhealth - mean(physhealth))/(2*sd(physhealth)))

12.9.1 Refitting model mod4 to the rescaled data

mod4_z <- lm(bmi ~ smoke100 * physhealth_z, data = smart_cle1_sh_rescale)

summary(mod4_z)

Call:
lm(formula = bmi ~ smoke100 * physhealth_z, data = smart_cle1_sh_rescale)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.011  -3.811  -0.559   2.609  38.249 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            27.9982     0.2558 109.471  < 2e-16 ***
smoke100                0.7259     0.3729   1.947 0.051818 .  
physhealth_z            1.9774     0.5463   3.619 0.000308 ***
smoke100:physhealth_z   0.6068     0.7467   0.813 0.416561    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.252 on 1129 degrees of freedom
Multiple R-squared:  0.03782,   Adjusted R-squared:  0.03526 
F-statistic: 14.79 on 3 and 1129 DF,  p-value: 1.886e-09

12.9.2 Interpreting the model on rescaled data

What has changed as compared to the original mod4?

  • Our original model was bmi = 27.5 + 0.57 smoke100 + 0.11 physhealth + 0.03 smoke100 x physhealth
  • Our model on centered physhealth was bmi = 28.0 + 0.73 smoke100 + 0.11 centered physhealth + 0.03 smoke100 x centered physhealth.
  • Our new model on rescaled physhealth is bmi = 28.0 + 0.73 smoke100 + 1.98 rescaled physhealth + 0.61 smoke100 x rescaled physhealth

So our rescaled model is:

  • 28.0 + 1.98 rescaled physhealth_z for non-smokers, and
  • (28.0 + 0.73) + (1.98 + 0.61) rescaled physhealth_z, or 28.73 + 2.59 rescaled physhealth_z for smokers.

In this new rescaled (physhealth_z) model, then,

  • the main effect of smoke100, 0.73, still corresponds to a predictive difference (smoker - non-smoker) in bmi with physhealth at its mean value, 4.68 days,
  • the intercept term is still the predicted bmi for a non-smoker with an average physhealth count, and
  • the residual standard deviation and the R-squared values remain unchanged,

as before, but now we also have that:

  • the coefficient of physhealth_z indicates the predictive difference in bmi associated with a change in physhealth of 2 standard deviations (from one standard deviation below the mean of 4.68 to one standard deviation above 4.68.)
    • Since the standard deviation of physhealth is 9.12 (see below), this covers a massive range of potential values of physhealth from 0 all the way up to 4.68 + 2(9.12) = 22.92 days.
favstats(~ physhealth, data = smart_cle1_sh)
 min Q1 median Q3 max     mean       sd    n missing
   0  0      0  4  30 4.681377 9.120899 1133       0
  • the coefficient of the product term (0.61) corresponds to the change in the coefficient of physhealth_z for smokers as compared to non-smokers.

12.9.3 Plot of model on rescaled data

ggplot(smart_cle1_sh_rescale, aes(x = physhealth_z, y = bmi, 
                              group = smoke100, col = factor(smoke100))) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "lm", se = FALSE, size = 1.5) +
    scale_color_discrete(name = "Is subject a smoker?") +
    labs(x = "Poor Physical Health Days, standardized (2 sd)", y = "Body Mass Index",
         title = "Model `mod4_z` on rescaled data") 
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using formula = 'y ~ x'

There’s very little difference here.

12.10 mod5: What if we add more variables?

We can boost our \(R^2\) a bit, to nearly 5%, by adding in two new variables, related to whether or not the subject (in the past 30 days) used the internet, and the average number of alcoholic drinks per week consumed by ths subject.

mod5 <- lm(bmi ~ smoke100 + female +physhealth + internet30 + drinks_wk,
         data = smart_cle1_sh)
summary(mod5)

Call:
lm(formula = bmi ~ smoke100 + female + physhealth + internet30 + 
    drinks_wk, data = smart_cle1_sh)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.358  -3.846  -0.657   2.534  38.049 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 27.52400    0.56076  49.083  < 2e-16 ***
smoke100     0.82654    0.37739   2.190  0.02872 *  
female      -0.43272    0.38510  -1.124  0.26140    
physhealth   0.12469    0.02074   6.012 2.47e-09 ***
internet30   0.44287    0.48830   0.907  0.36462    
drinks_wk   -0.10193    0.03352  -3.041  0.00241 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.231 on 1127 degrees of freedom
Multiple R-squared:  0.04582,   Adjusted R-squared:  0.04159 
F-statistic: 10.82 on 5 and 1127 DF,  p-value: 3.48e-10
  1. Here’s the ANOVA for this model. What can we study with this?
anova(mod5)
Analysis of Variance Table

Response: bmi
             Df Sum Sq Mean Sq F value    Pr(>F)    
smoke100      1    215  214.75  5.5304  0.018861 *  
female        1      5    4.85  0.1249  0.723878    
physhealth    1   1508 1508.08 38.8372 6.497e-10 ***
internet30    1     15   14.69  0.3783  0.538650    
drinks_wk     1    359  359.05  9.2466  0.002414 ** 
Residuals  1127  43762   38.83                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Consider the revised output below. Now what can we study?
anova(lm(bmi ~ smoke100 + internet30 + drinks_wk + female + physhealth,
         data = smart_cle1_sh))
Analysis of Variance Table

Response: bmi
             Df Sum Sq Mean Sq F value    Pr(>F)    
smoke100      1    215  214.75  5.5304 0.0188606 *  
internet30    1      8    7.81  0.2010 0.6539723    
drinks_wk     1    444  443.79 11.4288 0.0007479 ***
female        1     32   31.58  0.8132 0.3673566    
physhealth    1   1403 1403.49 36.1438 2.472e-09 ***
Residuals  1127  43762   38.83                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. What does the output below let us conclude?
anova(lm(bmi ~ smoke100 + internet30 + drinks_wk + female + physhealth, 
         data = smart_cle1_sh),
      lm(bmi ~ smoke100 + female + drinks_wk, 
         data = smart_cle1_sh))
Analysis of Variance Table

Model 1: bmi ~ smoke100 + internet30 + drinks_wk + female + physhealth
Model 2: bmi ~ smoke100 + female + drinks_wk
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1   1127 43762                                  
2   1129 45166 -2   -1403.7 18.075 1.877e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. What does it mean for the models to be “nested”?

12.11 mod6: Would adding self-reported health help?

And we can do even a bit better than that by adding in a multi-categorical measure: self-reported general health.

mod6 <- lm(bmi ~ female + smoke100 + physhealth + internet30 + drinks_wk + genhealth,
         data = smart_cle1_sh)
summary(mod6)

Call:
lm(formula = bmi ~ female + smoke100 + physhealth + internet30 + 
    drinks_wk + genhealth, data = smart_cle1_sh)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.216  -3.659  -0.736   2.669  36.810 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         25.20736    0.71106  35.450  < 2e-16 ***
female              -0.31949    0.37667  -0.848   0.3965    
smoke100             0.45866    0.37214   1.232   0.2180    
physhealth           0.04353    0.02506   1.737   0.0827 .  
internet30           0.93270    0.48273   1.932   0.0536 .  
drinks_wk           -0.07712    0.03294  -2.341   0.0194 *  
genhealth2_VeryGood  1.21169    0.56838   2.132   0.0332 *  
genhealth3_Good      3.22783    0.58009   5.564 3.29e-08 ***
genhealth4_Fair      4.14497    0.73284   5.656 1.96e-08 ***
genhealth5_Poor      5.86335    1.09253   5.367 9.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.089 on 1123 degrees of freedom
Multiple R-squared:  0.09206,   Adjusted R-squared:  0.08478 
F-statistic: 12.65 on 9 and 1123 DF,  p-value: < 2.2e-16
  1. If Harry and Marty have the same values of female, smoke100, physhealth, internet30 and drinks_wk, but Harry rates his health as Good, and Marty rates his as Fair, then what is the difference in the predictions? Who is predicted to have a larger BMI, and by how much?

  2. What does this normal probability plot of the residuals suggest?

plot(mod6, which = 2)

12.12 Key Regression Assumptions for Building Effective Prediction Models

  1. Validity - the data you are analyzing should map to the research question you are trying to answer.
    • The outcome should accurately reflect the phenomenon of interest.
    • The model should include all relevant predictors. (It can be difficult to decide which predictors are necessary, and what to do with predictors that have large standard errors.)
    • The model should generalize to all of the cases to which it will be applied.
    • Can the available data answer our question reliably?
  2. Additivity and linearity - most important assumption of a regression model is that its deterministic component is a linear function of the predictors. We often think about transformations in this setting.
  3. Independence of errors - errors from the prediction line are independent of each other
  4. Equal variance of errors - if this is violated, we can more efficiently estimate parameters using weighted least squares approaches, where each point is weighted inversely proportional to its variance, but this doesn’t affect the coefficients much, if at all.
  5. Normality of errors - not generally important for estimating the regression line

12.12.1 Checking Assumptions in model mod6

  1. How does the assumption of linearity behind this model look?
plot(mod6, which = 1)

We see no strong signs of serious non-linearity here. There’s no obvious curve in the plot, for example. We may have a problem with increasing variance as we move to the right.

  1. What can we conclude from the plot below?
plot(mod6, which = 5)

This plot can help us identify points with large standardized residuals, large leverage values, and large influence on the model (as indicated by large values of Cook’s distance.) In this case, I see no signs of any points used in the model with especially large influence, although there are some poorly fitted points (with especially large standardized residuals.)

We might want to identify the point listed here as 961, which appears to have an enormous standardized residual. To do so, we can use the slice function from dplyr.

smart_cle1_sh |> slice(961) |> select(SEQNO)
# A tibble: 1 × 1
       SEQNO
       <dbl>
1 2017000961

Now we know exactly which subject we’re talking about.

  1. What other residual plots are available with plot and how do we interpret them?
plot(mod6, which = 2)

This plot is simply a Normal Q-Q plot of the standardized residuals from our model. We’re looking here for serious problems with the assumption of Normality.

plot(mod6, which = 3)

This is a scale-location plot, designed to help us see non-constant variance in the residuals as we move across the fitted values as a linear trend, rather than as a fan shape, by plotting the square root of the residuals on the vertical axis.

plot(mod6, which = 4)

Finally, this is an index plot of the Cook’s distance values, allowing us to identify points that are particularly large. Remember that a value of 0.5 (or perhaps even 1.0) is a reasonable boundary for a substantially influential point.


  1. I’ll now refer to those who have smoked at least 100 cigarettes in their life as smokers, and those who have not as non-smokers to save some space.↩︎