17 Multiple Imputation and Linear Regression

17.1 R Setup Used Here

knitr::opts_chunk$set(comment = NA)

library(janitor)
library(broom)
library(car)
library(knitr)
library(mosaic)
library(mice)
library(rms)
library(naniar)
library(tidyverse) 

theme_set(theme_bw())

17.2 Data Load

In this chapter, we’ll return to the smart_ohio file based on data from BRFSS 2017 that we built and cleaned back in Chapter 6.

smart_ohio <- readRDS("data/smart_ohio.Rds")

17.3 Developing a `smart_16` data set

We’re going to look at a selection of variables from this tibble, among subjects who have been told they have diabetes, and who also provided a response to our physhealth (Number of Days Physical Health Not Good) variable, which asks “Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?” We’ll build two models. In this chapter, we’ll look at a linear model for physhealth and in the next chapter, we’ll look at a logistic regression describing whether or not the subject’s physhealth response was at least 1.

smart_16 <- smart_ohio |>
    filter(dm_status == "Diabetes") |>
    filter(complete.cases(physhealth)) |>
    mutate(bad_phys = ifelse(physhealth > 0, 1, 0),
           comor = hx_mi + hx_chd + hx_stroke + hx_asthma +
               hx_skinc + hx_otherc + hx_copd + hx_arthr) |>
    select(SEQNO, mmsa, physhealth, bad_phys, age_imp, smoke100,
           comor, hx_depress, bmi, activity)

The variables included in this smart_16 tibble are:

Variable	Description
`SEQNO`	respondent identification number (all begin with 2016)
`mmsa`
`physhealth`	Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?
`bad_phys`	Is `physhealth` 1 or more?
`age_imp`	Age in years (imputed from age categories)
`smoke100`	Have you smoked at least 100 cigarettes in your life? (1 = yes, 0 = no)
`hx_depress`	Has a doctor, nurse, or other health professional ever told you that you have a depressive disorder, including depression, major depression, dysthymia, or minor depression?
`bmi`	Body mass index, in kg/m²
`activity`	Physical activity (Highly Active, Active, Insufficiently Active, Inactive)
`comor`	Sum of 8 potential groups of comorbidities (see below)

The comor variable is the sum of the following 8 variables, each of which is measured on a 1 = Yes, 0 = No scale, and begin with “Has a doctor, nurse, or other health professional ever told you that you had …”

hx_mi: a heart attack, also called a myocardial infarction?
hx_chd: angina or coronary heart disease?
hx_stroke: a stroke?
hx_asthma: asthma?
hx_skinc: skin cancer?
hx_otherc: any other types of cancer?
hx_copd: Chronic Obstructive Pulmonary Disease or COPD, emphysema or chronic bronchitis?
hx_arthr: some form of arthritis, rheumatoid arthritis, gout, lupus, or fibromyalgia?

smart_16 |> tabyl(comor)

 comor   n     percent valid_percent
     0 224 0.211920530   0.221782178
     1 315 0.298013245   0.311881188
     2 228 0.215704825   0.225742574
     3 130 0.122989593   0.128712871
     4  72 0.068117313   0.071287129
     5  29 0.027436140   0.028712871
     6   9 0.008514664   0.008910891
     7   3 0.002838221   0.002970297
    NA  47 0.044465468            NA

17.3.1 Any missing values?

We have 1057 observations (rows) in the smart_16 data set, of whom 860 have complete data on all variables.

dim(smart_16)

[1] 1057   10

n_case_complete(smart_16)

[1] 860

Which variables are missing?

miss_var_summary(smart_16)

# A tibble: 10 × 3
   variable   n_miss pct_miss
   <chr>       <int>    <num>
 1 activity       85    8.04 
 2 bmi            84    7.95 
 3 comor          47    4.45 
 4 smoke100       24    2.27 
 5 age_imp        12    1.14 
 6 hx_depress      3    0.284
 7 SEQNO           0    0    
 8 mmsa            0    0    
 9 physhealth      0    0    
10 bad_phys        0    0

Note that our outcomes (physhealth and the derived bad_phys) have no missing values here, by design. We will be performing multiple imputation to account appropriately for missingness in the predictors with missing values.

17.4 Obtaining a Simple Imputation with `mice`

The mice package provides several approaches we can use for imputation in building models of all kinds. Here, we’ll use it just to obtain a single set of imputed results that we can apply to “complete” our data for the purposes of thinking about (a) transforming our outcome and (b) considering the addition of non-linear predictor terms.

# requires library(mice)

set.seed(432)

# create small data set including only variables to
# be used in building the imputation model

sm16 <- smart_16 |> 
    select(physhealth, activity, age_imp, bmi, comor, 
           hx_depress, smoke100)

smart_16_mice1 <- mice(sm16, m = 1)


 iter imp variable
  1   1  activity  age_imp  bmi  comor  hx_depress  smoke100
  2   1  activity  age_imp  bmi  comor  hx_depress  smoke100
  3   1  activity  age_imp  bmi  comor  hx_depress  smoke100
  4   1  activity  age_imp  bmi  comor  hx_depress  smoke100
  5   1  activity  age_imp  bmi  comor  hx_depress  smoke100

smart_16_imp1 <- mice::complete(smart_16_mice1)

n_case_miss(smart_16_imp1)

[1] 0

And now we’ll use this completed smart_16_imp1 data set (the product of just a single imputation) to help us address the next two issues.

17.5 Linear Regression: Considering a Transformation of the Outcome

A plausible strategy here would be to try to identify an outcome transformation only after some accounting for missing predictor values, perhaps through a simple imputation approach. However, to keep things simple here, I’ll just use the complete cases in this section.

Recall that our outcome here, physhealth can take the value 0, and is thus not strictly positive.

favstats(~ physhealth, data = smart_16_imp1)

 min Q1 median Q3 max     mean       sd    n missing
   0  0      2 20  30 9.227058 11.92676 1057       0

So, if we want to investigate a potential transformation with a Box-Cox plot, we’ll have to add a small value to each physhealth value. We’ll add 1, so that the range of potential values is now from 1-31.

smart_16_imp1 <- smart_16_imp1 |>
  mutate(phplus1 = physhealth + 1)

test_model <- lm(phplus1 ~ age_imp + comor + smoke100 + 
                   hx_depress + bmi + activity, data = smart_16_imp1)

boxCox(test_model)

It looks like the logarithm is a reasonable transformation in this setting. So we’ll create a new outcome, that is the natural logarithm of (physhealth + 1), which we’ll call phys_tr to remind us that a transformation is involved that we’ll eventually need to back out of to make predictions. We’ll build this new variable in both our original smart_16 data set and in the simply imputed data set we’re using for just these early stages.

smart_16_imp1 <- smart_16_imp1 |>
    mutate(phys_tr = log(physhealth + 1))

smart_16 <- smart_16 |>
    mutate(phys_tr = log(physhealth + 1))

So we have phys_tr = log(physhealth + 1)

where we are referring above to the natural (base \(e\) logarithm).

We can also specify our back-transformation to the original physhealth values from our new phys_tr as physhealth = exp(phys_tr) - 1.

17.6 Linear Regression: Considering Non-Linearity in the Predictors

Consider the following Spearman \(\rho^2\) plot.

plot(spearman2(phys_tr ~ age_imp + comor + smoke100 + 
           hx_depress + bmi + activity, data = smart_16_imp1))

After our single imputation, we have the same N value in all rows of this plot, which is what we want to see. It appears that in considering potential non-linear terms, comor and hx_depress and perhaps activity are worthy of increased attention. I’ll make a couple of arbitrary choices, to add a raw cubic polynomial to represent the comor information, and we’ll add an interaction term between hx_depress and activity.

17.7 “Main Effects” Linear Regression with `lm` on the Complete Cases

Recall that we have 860 complete cases in our smart_16 data, out of a total of 1057 observations in total. A model using only the complete cases should thus drop the remaining 197 subjects. Let’s see if a main effects only model for our newly transformed phys_tr outcome does in fact do this.

m_1cc <- 
    lm(phys_tr ~ age_imp + comor + smoke100 + 
           hx_depress + bmi + activity, data = smart_16)

summary(m_1cc)


Call:
lm(formula = phys_tr ~ age_imp + comor + smoke100 + hx_depress + 
    bmi + activity, data = smart_16)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0801 -1.0389 -0.2918  1.1029  2.8478 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    0.581959   0.370847   1.569  0.11696    
age_imp                       -0.007043   0.003813  -1.847  0.06511 .  
comor                          0.301773   0.033105   9.116  < 2e-16 ***
smoke100                       0.099038   0.090280   1.097  0.27295    
hx_depress                     0.471949   0.104232   4.528 6.81e-06 ***
bmi                            0.016375   0.006295   2.601  0.00945 ** 
activityActive                -0.229927   0.154912  -1.484  0.13812    
activityInsufficiently_Active -0.116998   0.139440  -0.839  0.40168    
activityInactive               0.256118   0.115266   2.222  0.02655 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.303 on 851 degrees of freedom
  (197 observations deleted due to missingness)
Multiple R-squared:  0.1806,    Adjusted R-squared:  0.1729 
F-statistic: 23.45 on 8 and 851 DF,  p-value: < 2.2e-16

Note that the appropriate number of observations are listed as “deleted due to missingness.”

17.7.1 Quality of Fit Statistics

glance(m_1cc) |>
    select(r.squared, adj.r.squared, sigma, AIC, BIC) |>
    kable(digits = c(3, 3, 2, 1, 1))

r.squared	adj.r.squared	sigma	AIC	BIC
0.181	0.173	1.3	2906.3	2953.8

17.7.2 Interpreting Effect Sizes

tidy(m_1cc, conf.int = TRUE) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    kable(digits = 3)

term	estimate	std.error	conf.low	conf.high
(Intercept)	0.582	0.371	-0.146	1.310
age_imp	-0.007	0.004	-0.015	0.000
comor	0.302	0.033	0.237	0.367
smoke100	0.099	0.090	-0.078	0.276
hx_depress	0.472	0.104	0.267	0.677
bmi	0.016	0.006	0.004	0.029
activityActive	-0.230	0.155	-0.534	0.074
activityInsufficiently_Active	-0.117	0.139	-0.391	0.157
activityInactive	0.256	0.115	0.030	0.482

We’ll interpret three of the predictors here to demonstrate ideas: comor, hx_depress and activity.

If we have two subjects with the same values of age_imp, smoke100, hx_depress, bmi, and activity, but Harry has a comor score that is one point higher than Sally’s, then the model predicts that Harry’s transformed outcome (specifically the natural logarithm of (his physhealth days + 1)) will be 0.302 higher than Sally’s, with a 95% confidence interval around that estimate ranging from (0.237, 0.367).

If we have two subjects with the same values of age_imp, comor, smoke100, bmi, and activity, but Harry has a history of depression (hx_depress = 1) while Sally does not have such a history (so Sally’s hx_depress = 0), then the model predicts that Harry’s transformed outcome (specifically the natural logarithm of (his physhealth days + 1)) will be 0.472 higher than Sally’s, with a 95% confidence interval around that estimate ranging from (0.267, 0.677).

The activity variable has four categories as indicated in the table below. The model uses the “Highly_Active” category as the reference group.

smart_16_imp1 |> tabyl(activity)

              activity   n   percent
         Highly_Active 252 0.2384106
                Active 135 0.1277200
 Insufficiently_Active 193 0.1825922
              Inactive 477 0.4512772

From the tidied set of coefficients, we can describe the activity effects as follows.
- If Sally is “Highly Active” and Harry is “Active” but they otherwise have the same values of all predictors, then our prediction is that Harry’s transformed outcome (specifically the natural logarithm of (his physhealth days + 1)) will be 0.230 lower than Sally’s, with a 95% confidence interval around that estimate ranging from (0.534 lower than Sally’s to 0.074 higher than Sally’s).
- If instead Harry is “Insufficiently Active” but nothing else changes, then our prediction is that Harry’s transformed outcome will be 0.117 lower than Sally’s, with a 95% confidence interval around that estimate ranging from (0.391 lower to 0.157 higher than Sally’s.)
- If instead Harry is “Inactive” but nothing else changes, then our prediction is that Harry’s transformed outcome will be 0.256 higher than Sally’s, with a 95% confidence interval around that estimate ranging from (0.030 to 0.482 higher than Sally’s.)

17.7.3 Making Predictions with the Model

Let’s describe two subjects, and use this model (and the ones that follow) to predict their physhealth values.

Sheena is age 50, has 2 comorbidities, has smoked 100 cigarettes in her life, has no history of depression, a BMI of 25, and is Highly Active.
Jacob is age 65, has 4 comorbidities, has never smoked, has a history of depression, a BMI of 32 and is Inactive.

We’ll first build predictions for Sheena and Jacob (with 95% prediction intervals) for phys_tr.

new2 <- tibble(
    name = c("Sheena", "Jacob"),
    age_imp = c(50, 65),
    comor = c(2, 4),
    smoke100 = c(1, 0),
    hx_depress = c(0, 1),
    bmi = c(25, 32),
    activity = c("Highly_Active", "Inactive")
)

preds_m_1cc <- predict(m_1cc, newdata = new2, 
                       interval = "prediction")

preds_m_1cc

       fit      lwr      upr
1 1.341778 -1.22937 3.912925
2 2.583336  0.01399 5.152681

The model makes predictions for our transformed outcome, phys_tr. Now, we need to back-transform the predictions and the confidence intervals to build predictions for physhealth.

preds_m_1cc <- preds_m_1cc |>
    tbl_df() |>
    mutate(names = c("Sheena", "Jacob"),
           pred_physhealth = exp(fit) - 1,
           conf_low = exp(lwr) - 1,
           conf_high = exp(upr) - 1) |>
    select(names, pred_physhealth, conf_low, conf_high, 
           everything())

Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
ℹ Please use `tibble::as_tibble()` instead.

preds_m_1cc |> kable(digits = 3)

names	pred_physhealth	conf_low	conf_high	fit	lwr	upr
Sheena	2.826	-0.708	49.045	1.342	-1.229	3.913
Jacob	12.241	0.014	171.894	2.583	0.014	5.153

17.8 “Augmented” Linear Regression with `lm` on the Complete Cases

Now, we’ll add the non-linear terms we discussed earlier. We’ll add a (raw) cubic polynomial to represent the comor information, and we’ll add an interaction term between hx_depress and activity.

m_2cc <- 
    lm(phys_tr ~ age_imp + pol(comor, 3) + smoke100 + 
           bmi + hx_depress*activity, data = smart_16)

summary(m_2cc)


Call:
lm(formula = phys_tr ~ age_imp + pol(comor, 3) + smoke100 + bmi + 
    hx_depress * activity, data = smart_16)

Residuals:
   Min     1Q Median     3Q    Max 
-2.907 -1.063 -0.267  1.143  2.924 

Coefficients:
                                          Estimate Std. Error t value Pr(>|t|)
(Intercept)                               0.514823   0.376203   1.368  0.17153
age_imp                                  -0.008100   0.003865  -2.096  0.03640
pol(comor, 3)comor                        0.634274   0.160630   3.949 8.51e-05
pol(comor, 3)comor^2                     -0.130626   0.073525  -1.777  0.07599
pol(comor, 3)comor^3                      0.012508   0.008977   1.393  0.16386
smoke100                                  0.089345   0.090336   0.989  0.32294
bmi                                       0.015203   0.006315   2.408  0.01627
hx_depress                                0.647054   0.229696   2.817  0.00496
activityActive                           -0.202196   0.172300  -1.174  0.24092
activityInsufficiently_Active            -0.005815   0.166221  -0.035  0.97210
activityInactive                          0.290380   0.132198   2.197  0.02832
hx_depress:activityActive                -0.124836   0.395415  -0.316  0.75230
hx_depress:activityInsufficiently_Active -0.376355   0.310160  -1.213  0.22531
hx_depress:activityInactive              -0.172952   0.267427  -0.647  0.51798
                                            
(Intercept)                                 
age_imp                                  *  
pol(comor, 3)comor                       ***
pol(comor, 3)comor^2                     .  
pol(comor, 3)comor^3                        
smoke100                                    
bmi                                      *  
hx_depress                               ** 
activityActive                              
activityInsufficiently_Active               
activityInactive                         *  
hx_depress:activityActive                   
hx_depress:activityInsufficiently_Active    
hx_depress:activityInactive                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.301 on 846 degrees of freedom
  (197 observations deleted due to missingness)
Multiple R-squared:  0.187, Adjusted R-squared:  0.1745 
F-statistic: 14.97 on 13 and 846 DF,  p-value: < 2.2e-16

Note again that the appropriate number of observations are listed as “deleted due to missingness.”

17.8.1 Quality of Fit Statistics

glance(m_2cc) |>
    select(r.squared, adj.r.squared, sigma, AIC, BIC) |>
    kable(digits = c(3, 3, 2, 1, 1))

r.squared	adj.r.squared	sigma	AIC	BIC
0.187	0.175	1.3	2909.5	2980.9

17.8.2 ANOVA assessing the impact of the non-linear terms

anova(m_1cc, m_2cc)

Analysis of Variance Table

Model 1: phys_tr ~ age_imp + comor + smoke100 + hx_depress + bmi + activity
Model 2: phys_tr ~ age_imp + pol(comor, 3) + smoke100 + bmi + hx_depress * 
    activity
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    851 1444.0                           
2    846 1432.8  5    11.265 1.3303  0.249

The difference between the models doesn’t meet the standard for statistical detectabilty at our usual \(\alpha\) levels.

17.8.3 Interpreting Effect Sizes

tidy(m_2cc, conf.int = TRUE) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    kable(digits = 3)

term	estimate	std.error	conf.low	conf.high
(Intercept)	0.515	0.376	-0.224	1.253
age_imp	-0.008	0.004	-0.016	-0.001
pol(comor, 3)comor	0.634	0.161	0.319	0.950
pol(comor, 3)comor^2	-0.131	0.074	-0.275	0.014
pol(comor, 3)comor^3	0.013	0.009	-0.005	0.030
smoke100	0.089	0.090	-0.088	0.267
bmi	0.015	0.006	0.003	0.028
hx_depress	0.647	0.230	0.196	1.098
activityActive	-0.202	0.172	-0.540	0.136
activityInsufficiently_Active	-0.006	0.166	-0.332	0.320
activityInactive	0.290	0.132	0.031	0.550
hx_depress:activityActive	-0.125	0.395	-0.901	0.651
hx_depress:activityInsufficiently_Active	-0.376	0.310	-0.985	0.232
hx_depress:activityInactive	-0.173	0.267	-0.698	0.352

Let’s focus first on interpreting the interaction terms between hx_depress and activity.

Assume first that we have a set of subjects with the same values of age_imp, smoke100, bmi, and comor.

Arnold has hx_depress = 1 and is Inactive
Betty has hx_depress = 1 and is Insufficiently Active
Carlos has hx_depress = 1 and is Active
Debbie has hx_depress = 1 and is Highly Active
Eamon has hx_depress = 0 and is Inactive
Florence has hx_depress = 0 and is Insufficiently Active
Garry has hx_depress = 0 and is Active
Harry has hx_depress = 0 and is Highly Active

So the model, essentially can be used to compare each of the first seven people on that list to Harry (who has the reference levels of both hx_depress and activity.) Let’s compare Arnold to Harry.

For instance, as compared to Harry, Arnold is expected to have a transformed outcome (specifically the natural logarithm of (his physhealth days + 1)) that is:

0.647 higher because Arnold’s hx_depress = 1, and
0.29 higher still because Arnold’s activity is “Inactive”, and
0.173 lower because of the combination (see the `hx_depress:activityInactive” row)

So, in total, we expect Arnold’s transformed outcome to be 0.647 + 0.29 + (-0.173), or 0.764 higher than Harry’s.

If we want to compare Arnold to, for instance, Betty, we first calculate Betty’s difference from Harry, and then compare the two differences.

As compared to Harry, Betty is expected to have a transformed outcome (specifically the natural logarithm of (her physhealth days + 1)) that is:

0.647 higher because Betty’s hx_depress = 1, and
0.006 lower still because Betty’s activity is “Insufficiently Active”, and
0.376 lower because of the combination (see the `hx_depress:activityInsufficiently_Active” row)

So, in total, we expect Betty’s transformed outcome to be 0.647 + (-0.006) + (-0.376), or 0.265 higher than Harry’s.

And thus we can compare Betty and Arnold directly.

Arnold is predicted to have an outcome that is 0.764 higher than Harry’s.
Betty is predicted to have an outcome that is 0.265 higher than Harry’s.
And so Arnold’s predicted outcome (phys_tr) is 0.499 larger than Betty’s.

Now, suppose we want to look at our cubic polynomial in comor.

Suppose Harry and Sally have the same values for all other predictors in the model, but Harry has 1 comorbidity where Sally has none. Then the three terms in the model related to comor will be 1 for Harry and 0 for Sally, and the interpretation becomes pretty straightforward.
But suppose instead that nothing has changed except Harry has 2 comorbidities and Sally has just 1. The size of the impact of this Harry - Sally difference is far larger in this situation, because the comor variable enters the model in a non-linear way. This is an area where fitting the model using ols can be helpful because of the ability to generate plots (of effects, nomograms, etc.) that can show this non-linearity in a clear way.

Suppose for instance, that Harry and Sally share the following values for the other predictors: each is age 40, has never smoked, has no history of depression, a BMI of 30 and is Highly Active.

Now, if Harry has 1 comorbidity and Sally has none, the predicted phys_tr values for Harry and Sally are as indicated below.

hands1 <- tibble(
    name = c("Harry", "Sally"),
    age_imp = c(40, 40),
    comor = c(1, 0),
    smoke100 = c(0, 0),
    hx_depress = c(0, 0),
    bmi = c(30, 30),
    activity = c("Highly_Active", "Highly_Active")
)

predict(m_2cc, newdata = hands1)

        1         2 
1.1630840 0.6469282

But if Harry has 2 comorbidities and Sally 1, the predictions are:

hands2 <- tibble(
    name = c("Harry", "Sally"),
    age_imp = c(40, 40),
    comor = c(2, 1), # only thing that changes
    smoke100 = c(0, 0),
    hx_depress = c(0, 0),
    bmi = c(30, 30),
    activity = c("Highly_Active", "Highly_Active")
)

predict(m_2cc, newdata = hands2)

       1        2 
1.493035 1.163084

Note that the difference in predictions between Harry and Sally is much smaller now than it was previously.

17.8.4 Making Predictions with the Model

As before, we’ll use the new model to predict physhealth values for Sheena and Jacob.

Sheena is age 50, has 2 comorbidities, has smoked 100 cigarettes in her life, has no history of depression, a BMI of 25, and is Highly Active.
Jacob is age 65, has 4 comorbidities, has never smoked, has a history of depression, a BMI of 32 and is Inactive.

We’ll first build predictions for Sheena and Jacob (with 95% prediction intervals) for phys_tr.

new2 <- tibble(
    name = c("Sheena", "Jacob"),
    age_imp = c(50, 65),
    comor = c(2, 4),
    smoke100 = c(1, 0),
    hx_depress = c(0, 1),
    bmi = c(25, 32),
    activity = c("Highly_Active", "Inactive")
)

preds_m_2cc <- predict(m_2cc, newdata = new2, 
                       interval = "prediction")

preds_m_2cc

       fit         lwr      upr
1 1.425362 -1.14707613 3.997801
2 2.486907 -0.08635658 5.060171

Now, we need to back-transform the predictions and the confidence intervals that describe phys_tr to build predictions for physhealth.

preds_m_2cc <- preds_m_2cc |>
    tbl_df() |>
    mutate(names = c("Sheena", "Jacob"),
           pred_physhealth = exp(fit) - 1,
           conf_low = exp(lwr) - 1,
           conf_high = exp(upr) - 1) |>
    select(names, pred_physhealth, conf_low, conf_high, 
           everything())

Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
ℹ Please use `tibble::as_tibble()` instead.

preds_m_2cc |> kable(digits = 3)

names	pred_physhealth	conf_low	conf_high	fit	lwr	upr
Sheena	3.159	-0.682	53.478	1.425	-1.147	3.998
Jacob	11.024	-0.083	156.617	2.487	-0.086	5.060

17.9 Using `mice` to perform Multiple Imputation

Let’s focus on the main effects model, and look at the impact of performing multiple imputation to account for the missing data. Recall that in our smart_16 data, the most “missingness” is shown in the activity variable, which is still missing less than 10% of the time. So we’ll try a set of 10 imputations, using the default settings in the mice package.

# requires library(mice)

set.seed(432)

# create small data set including only variables to
# be used in building the imputation model

sm16 <- smart_16 |> 
    select(physhealth, phys_tr, activity, age_imp, bmi, comor, 
           hx_depress, smoke100)

smart_16_mice10 <- mice(sm16, m = 10)


 iter imp variable
  1   1  activity  age_imp  bmi  comor  hx_depress  smoke100
  1   2  activity  age_imp  bmi  comor  hx_depress  smoke100
  1   3  activity  age_imp  bmi  comor  hx_depress  smoke100
  1   4  activity  age_imp  bmi  comor  hx_depress  smoke100
  1   5  activity  age_imp  bmi  comor  hx_depress  smoke100
  1   6  activity  age_imp  bmi  comor  hx_depress  smoke100
  1   7  activity  age_imp  bmi  comor  hx_depress  smoke100
  1   8  activity  age_imp  bmi  comor  hx_depress  smoke100
  1   9  activity  age_imp  bmi  comor  hx_depress  smoke100
  1   10  activity  age_imp  bmi  comor  hx_depress  smoke100
  2   1  activity  age_imp  bmi  comor  hx_depress  smoke100
  2   2  activity  age_imp  bmi  comor  hx_depress  smoke100
  2   3  activity  age_imp  bmi  comor  hx_depress  smoke100
  2   4  activity  age_imp  bmi  comor  hx_depress  smoke100
  2   5  activity  age_imp  bmi  comor  hx_depress  smoke100
  2   6  activity  age_imp  bmi  comor  hx_depress  smoke100
  2   7  activity  age_imp  bmi  comor  hx_depress  smoke100
  2   8  activity  age_imp  bmi  comor  hx_depress  smoke100
  2   9  activity  age_imp  bmi  comor  hx_depress  smoke100
  2   10  activity  age_imp  bmi  comor  hx_depress  smoke100
  3   1  activity  age_imp  bmi  comor  hx_depress  smoke100
  3   2  activity  age_imp  bmi  comor  hx_depress  smoke100
  3   3  activity  age_imp  bmi  comor  hx_depress  smoke100
  3   4  activity  age_imp  bmi  comor  hx_depress  smoke100
  3   5  activity  age_imp  bmi  comor  hx_depress  smoke100
  3   6  activity  age_imp  bmi  comor  hx_depress  smoke100
  3   7  activity  age_imp  bmi  comor  hx_depress  smoke100
  3   8  activity  age_imp  bmi  comor  hx_depress  smoke100
  3   9  activity  age_imp  bmi  comor  hx_depress  smoke100
  3   10  activity  age_imp  bmi  comor  hx_depress  smoke100
  4   1  activity  age_imp  bmi  comor  hx_depress  smoke100
  4   2  activity  age_imp  bmi  comor  hx_depress  smoke100
  4   3  activity  age_imp  bmi  comor  hx_depress  smoke100
  4   4  activity  age_imp  bmi  comor  hx_depress  smoke100
  4   5  activity  age_imp  bmi  comor  hx_depress  smoke100
  4   6  activity  age_imp  bmi  comor  hx_depress  smoke100
  4   7  activity  age_imp  bmi  comor  hx_depress  smoke100
  4   8  activity  age_imp  bmi  comor  hx_depress  smoke100
  4   9  activity  age_imp  bmi  comor  hx_depress  smoke100
  4   10  activity  age_imp  bmi  comor  hx_depress  smoke100
  5   1  activity  age_imp  bmi  comor  hx_depress  smoke100
  5   2  activity  age_imp  bmi  comor  hx_depress  smoke100
  5   3  activity  age_imp  bmi  comor  hx_depress  smoke100
  5   4  activity  age_imp  bmi  comor  hx_depress  smoke100
  5   5  activity  age_imp  bmi  comor  hx_depress  smoke100
  5   6  activity  age_imp  bmi  comor  hx_depress  smoke100
  5   7  activity  age_imp  bmi  comor  hx_depress  smoke100
  5   8  activity  age_imp  bmi  comor  hx_depress  smoke100
  5   9  activity  age_imp  bmi  comor  hx_depress  smoke100
  5   10  activity  age_imp  bmi  comor  hx_depress  smoke100

summary(smart_16_mice10)

Class: mids
Number of multiple imputations:  10 
Imputation methods:
physhealth    phys_tr   activity    age_imp        bmi      comor hx_depress 
        ""         ""  "polyreg"      "pmm"      "pmm"      "pmm"      "pmm" 
  smoke100 
     "pmm" 
PredictorMatrix:
           physhealth phys_tr activity age_imp bmi comor hx_depress smoke100
physhealth          0       1        1       1   1     1          1        1
phys_tr             1       0        1       1   1     1          1        1
activity            1       1        0       1   1     1          1        1
age_imp             1       1        1       0   1     1          1        1
bmi                 1       1        1       1   0     1          1        1
comor               1       1        1       1   1     0          1        1

17.10 Running the Linear Regression in `lm` with Multiple Imputation

Next, we’ll run the linear model (main effects) on each of the 10 imputed data sets.

m10_mods <- 
    with(smart_16_mice10, lm(phys_tr ~ age_imp + comor + 
                                 smoke100 + hx_depress + 
                                 bmi + activity))

summary(m10_mods)

# A tibble: 90 × 6
   term                          estimate std.error statistic  p.value  nobs
   <chr>                            <dbl>     <dbl>     <dbl>    <dbl> <int>
 1 (Intercept)                    0.317     0.326       0.971 3.32e- 1  1057
 2 age_imp                       -0.00489   0.00334    -1.47  1.43e- 1  1057
 3 comor                          0.313     0.0295     10.6   4.72e-25  1057
 4 smoke100                       0.135     0.0799      1.69  9.22e- 2  1057
 5 hx_depress                     0.500     0.0929      5.38  9.14e- 8  1057
 6 bmi                            0.0187    0.00564     3.31  9.64e- 4  1057
 7 activityActive                -0.202     0.138      -1.46  1.44e- 1  1057
 8 activityInsufficiently_Active -0.0695    0.124      -0.561 5.75e- 1  1057
 9 activityInactive               0.262     0.103       2.54  1.11e- 2  1057
10 (Intercept)                    0.363     0.332       1.10  2.74e- 1  1057
# ℹ 80 more rows

Then, we’ll pool results across the 10 imputations

m10_pool <- pool(m10_mods)
summary(m10_pool, conf.int = TRUE) |>
    select(-statistic, -df) |>
    kable(digits = 3)

term	estimate	std.error	p.value	2.5 %	97.5 %
(Intercept)	0.444	0.342	0.194	-0.227	1.114
age_imp	-0.005	0.003	0.128	-0.012	0.002
comor	0.309	0.031	0.000	0.249	0.369
smoke100	0.114	0.083	0.171	-0.049	0.278
hx_depress	0.512	0.094	0.000	0.327	0.696
bmi	0.016	0.006	0.009	0.004	0.027
activityActive	-0.204	0.140	0.146	-0.479	0.071
activityInsufficiently_Active	-0.044	0.129	0.735	-0.298	0.210
activityInactive	0.260	0.106	0.014	0.052	0.469

And we can compare these results to the complete case analysis we completed earlier.

tidy(m_1cc, conf.int = TRUE) |>
    select(term, estimate, std.error, p.value, conf.low, conf.high) |>
    kable(digits = 3)

term	estimate	std.error	p.value	conf.low	conf.high
(Intercept)	0.582	0.371	0.117	-0.146	1.310
age_imp	-0.007	0.004	0.065	-0.015	0.000
comor	0.302	0.033	0.000	0.237	0.367
smoke100	0.099	0.090	0.273	-0.078	0.276
hx_depress	0.472	0.104	0.000	0.267	0.677
bmi	0.016	0.006	0.009	0.004	0.029
activityActive	-0.230	0.155	0.138	-0.534	0.074
activityInsufficiently_Active	-0.117	0.139	0.402	-0.391	0.157
activityInactive	0.256	0.115	0.027	0.030	0.482

Note that there are some sizeable differences here, although nothing enormous.

If we want the pooled \(R^2\) or pooled adjusted \(R^2\) after imputation, R will provide it (and a 95% confidence interval around the estimate) with …

pool.r.squared(m10_mods)

          est     lo 95     hi 95        fmi
R^2 0.1912561 0.1482819 0.2369623 0.08061427

pool.r.squared(m10_mods, adjusted = TRUE)

              est     lo 95     hi 95        fmi
adj R^2 0.1850807 0.1425132 0.2305277 0.08312639

We can see the fraction of missing information about each coefficient due to non-response (fmi) and other details with the following code…

m10_pool

Class: mipo    m = 10 
                           term  m    estimate         ubar            b
1                   (Intercept) 10  0.44377168 1.078194e-01 8.002900e-03
2                       age_imp 10 -0.00522810 1.123668e-05 4.824290e-07
3                         comor 10  0.30871888 8.801039e-04 5.466082e-05
4                      smoke100 10  0.11415718 6.474388e-03 4.130673e-04
5                    hx_depress 10  0.51155722 8.669413e-03 1.582684e-04
6                           bmi 10  0.01576150 3.182381e-05 3.425191e-06
7                activityActive 10 -0.20412627 1.914250e-02 4.851040e-04
8 activityInsufficiently_Active 10 -0.04383739 1.565925e-02 9.855183e-04
9              activityInactive 10  0.26046070 1.069627e-02 5.023887e-04
             t dfcom       df        riv     lambda        fmi
1 1.166226e-01  1048 599.8172 0.08164758 0.07548446 0.07855177
2 1.176735e-05  1048 814.9042 0.04722675 0.04509697 0.04743197
3 9.402308e-04  1048 677.6361 0.06831796 0.06394909 0.06669961
4 6.928762e-03  1048 666.2487 0.07018023 0.06557795 0.06837041
5 8.843508e-03  1048 982.0512 0.02008154 0.01968621 0.02167660
6 3.559152e-05  1048 432.0874 0.11839279 0.10585976 0.10996992
7 1.967611e-02  1048 939.5064 0.02787591 0.02711992 0.02918437
8 1.674332e-02  1048 672.0473 0.06922874 0.06474643 0.06751736
9 1.124890e-02  1048 785.1905 0.05166545 0.04912727 0.05154007

17.11 Fit the Multiple Imputation Model with `aregImpute`

Here, we’ll use aregImpute to deal with missing values through multiple imputation, and use the ols function in the rms package to fit the model.

The first step is to fit the multiple imputation model. We’ll use n.impute = 10 imputations, with B = 10 bootstrap samples for the preditive mean matching, and fit both linear models and models with restricted cubic splines with 3 knots (nk = c(0, 3)) allowing the target variable to have a non-linear transformation when nk is 3, via tlinear = FALSE.

set.seed(43201602)
dd <- datadist(smart_16)
options(datadist = "dd")

fit16_imp <- 
    aregImpute(~ phys_tr + age_imp + comor + smoke100 + 
                   hx_depress + bmi + activity,
               nk = c(0, 3), tlinear = FALSE, 
               data = smart_16, B = 10, n.impute = 10)

Iteration 1 
Iteration 2 
Iteration 3 
Iteration 4 
Iteration 5 
Iteration 6 
Iteration 7 
Iteration 8 
Iteration 9 
Iteration 10 
Iteration 11 
Iteration 12 
Iteration 13

Here are the results of that imputation model.

fit16_imp


Multiple Imputation using Bootstrap and PMM

aregImpute(formula = ~phys_tr + age_imp + comor + smoke100 + 
    hx_depress + bmi + activity, data = smart_16, n.impute = 10, 
    nk = c(0, 3), tlinear = FALSE, B = 10)

n: 1057     p: 7    Imputations: 10     nk: 0 

Number of NAs:
   phys_tr    age_imp      comor   smoke100 hx_depress        bmi   activity 
         0         12         47         24          3         84         85 

           type d.f.
phys_tr       s    1
age_imp       s    1
comor         s    1
smoke100      l    1
hx_depress    l    1
bmi           s    1
activity      c    3

R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
   age_imp      comor   smoke100 hx_depress        bmi   activity 
     0.224      0.206      0.059      0.167      0.169      0.057 

Resampling results for determining the complexity of imputation models

Variable being imputed: age_imp 
                                          nk=0   nk=3
Bootstrap bias-corrected R^2             0.186  0.215
10-fold cross-validated  R^2             0.211  0.215
Bootstrap bias-corrected mean   |error|  9.108 10.894
10-fold cross-validated  mean   |error| 65.169 10.919
Bootstrap bias-corrected median |error|  7.290  8.784
10-fold cross-validated  median |error| 66.006  8.613

Variable being imputed: comor 
                                         nk=0  nk=3
Bootstrap bias-corrected R^2            0.183 0.182
10-fold cross-validated  R^2            0.184 0.193
Bootstrap bias-corrected mean   |error| 0.987 1.184
10-fold cross-validated  mean   |error| 1.759 1.171
Bootstrap bias-corrected median |error| 0.828 0.910
10-fold cross-validated  median |error| 1.574 0.892

Variable being imputed: smoke100 
                                          nk=0   nk=3
Bootstrap bias-corrected R^2            0.0224 0.0187
10-fold cross-validated  R^2            0.0358 0.0217
Bootstrap bias-corrected mean   |error| 0.4853 0.4866
10-fold cross-validated  mean   |error| 0.9462 0.9561
Bootstrap bias-corrected median |error| 0.4788 0.4772
10-fold cross-validated  median |error| 0.8479 0.8706

Variable being imputed: hx_depress 
                                         nk=0  nk=3
Bootstrap bias-corrected R^2            0.157 0.138
10-fold cross-validated  R^2            0.147 0.148
Bootstrap bias-corrected mean   |error| 0.355 0.360
10-fold cross-validated  mean   |error| 0.801 0.783
Bootstrap bias-corrected median |error| 0.333 0.337
10-fold cross-validated  median |error| 0.711 0.673

Variable being imputed: bmi 
                                          nk=0  nk=3
Bootstrap bias-corrected R^2             0.125 0.122
10-fold cross-validated  R^2             0.134 0.133
Bootstrap bias-corrected mean   |error|  5.221 6.822
10-fold cross-validated  mean   |error| 32.458 6.884
Bootstrap bias-corrected median |error|  4.178 5.782
10-fold cross-validated  median |error| 31.330 5.932

Variable being imputed: activity 
                                          nk=0   nk=3
Bootstrap bias-corrected R^2            0.0312 0.0275
10-fold cross-validated  R^2            0.0450 0.0402
Bootstrap bias-corrected mean   |error| 1.8774 1.8884
10-fold cross-validated  mean   |error| 1.1121 1.1043
Bootstrap bias-corrected median |error| 2.0000 2.0000
10-fold cross-validated  median |error| 1.0000 1.0000

par(mfrow = c(3,2))
plot(fit16_imp)

par(mfrow = c(1,1))

The plot helps us see where the imputations are happening.

17.12 Fit Linear Regression using `ols` and `fit.mult.impute`

m16_imp <- 
    fit.mult.impute(phys_tr ~ age_imp + comor + smoke100 +
                        hx_depress + bmi + activity,
                    fitter = ols, xtrans = fit16_imp,
                    data = smart_16, fitargs=list(x=TRUE,y=TRUE))


Wald Statistic Information

Variance Inflation Factors Due to Imputation:

                     Intercept                        age_imp 
                          1.03                           1.01 
                         comor                       smoke100 
                          1.03                           1.06 
                    hx_depress                            bmi 
                          1.02                           1.06 
               activity=Active activity=Insufficiently_Active 
                          1.19                           1.14 
             activity=Inactive 
                          1.23 

Fraction of Missing Information:

                     Intercept                        age_imp 
                          0.03                           0.01 
                         comor                       smoke100 
                          0.03                           0.06 
                    hx_depress                            bmi 
                          0.02                           0.06 
               activity=Active activity=Insufficiently_Active 
                          0.16                           0.13 
             activity=Inactive 
                          0.19 

d.f. for t-distribution for Tests of Single Coefficients:

                     Intercept                        age_imp 
                       8176.67                       45410.80 
                         comor                       smoke100 
                      13030.27                        2670.64 
                    hx_depress                            bmi 
                      28199.30                        2935.89 
               activity=Active activity=Insufficiently_Active 
                        354.62                         571.56 
             activity=Inactive 
                        258.42 

The following fit components were averaged over the 10 model fits:

  fitted.values stats linear.predictors

17.12.1 Summaries and Coefficients

Here are the results:

m16_imp

Linear Regression Model

fit.mult.impute(formula = phys_tr ~ age_imp + comor + smoke100 + 
    hx_depress + bmi + activity, fitter = ols, xtrans = fit16_imp, 
    data = smart_16, fitargs = list(x = TRUE, y = TRUE))

                Model Likelihood    Discrimination    
                      Ratio Test           Indexes    
Obs    1057    LR chi2    219.94    R2       0.188    
sigma1.2881    d.f.            8    R2 adj   0.182    
d.f.   1048    Pr(> chi2) 0.0000    g        0.687    

Residuals

    Min      1Q  Median      3Q     Max 
-3.0621 -1.0327 -0.2878  1.1104  2.8018 

                               Coef    S.E.   t     Pr(>|t|)
Intercept                       0.4052 0.3352  1.21 0.2271  
age_imp                        -0.0049 0.0034 -1.46 0.1437  
comor                           0.3078 0.0302 10.20 <0.0001 
smoke100                        0.1259 0.0830  1.52 0.1296  
hx_depress                      0.5120 0.0940  5.45 <0.0001 
bmi                             0.0164 0.0058  2.83 0.0048  
activity=Active                -0.1773 0.1513 -1.17 0.2416  
activity=Insufficiently_Active -0.0396 0.1342 -0.30 0.7680  
activity=Inactive               0.2401 0.1144  2.10 0.0360

17.12.2 Effect Sizes

We can plot and summarize the effect sizes using the usual ols tools:

summary(m16_imp)

             Effects              Response : phys_tr 

 Factor                                    Low   High  Diff. Effect    S.E.    
 age_imp                                   57.00 73.00 16.00 -0.079163 0.054107
 comor                                      1.00  2.00  1.00  0.307790 0.030171
 smoke100                                   0.00  1.00  1.00  0.125890 0.083000
 hx_depress                                 0.00  1.00  1.00  0.511980 0.094007
 bmi                                       27.29 36.65  9.36  0.153530 0.054322
 activity - Highly_Active:Inactive          4.00  1.00    NA -0.240070 0.114350
 activity - Active:Inactive                 4.00  2.00    NA -0.417320 0.137640
 activity - Insufficiently_Active:Inactive  4.00  3.00    NA -0.279650 0.115000
 Lower 0.95 Upper 0.95
 -0.185330   0.027008 
  0.248590   0.366990 
 -0.036973   0.288760 
  0.327520   0.696450 
  0.046932   0.260120 
 -0.464450  -0.015686 
 -0.687400  -0.147250 
 -0.505310  -0.054002

plot(summary(m16_imp))

17.12.3 Making Predictions with this Model

Once again, let’s make predictions for our two subjects, and use this model (and the ones that follow) to predict their physhealth values.

Sheena is age 50, has 2 comorbidities, has smoked 100 cigarettes in her life, has no history of depression, a BMI of 25, and is Highly Active.
Jacob is age 65, has 4 comorbidities, has never smoked, has a history of depression, a BMI of 32 and is Inactive.

new2 <- tibble(
    name = c("Sheena", "Jacob"),
    age_imp = c(50, 65),
    comor = c(2, 4),
    smoke100 = c(1, 0),
    hx_depress = c(0, 1),
    bmi = c(25, 32),
    activity = c("Highly_Active", "Inactive")
)

preds_m_16imp <- predict(m16_imp, 
                         newdata = data.frame(new2))

preds_m_16imp

       1        2 
1.309306 2.591649

preds_m_16imp <- preds_m_16imp |>
    tbl_df() |>
    mutate(names = c("Sheena", "Jacob"),
           pred_physhealth = exp(value) - 1) |>
    select(names, pred_physhealth)

Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
ℹ Please use `tibble::as_tibble()` instead.

preds_m_16imp |> kable(digits = 3)

names	pred_physhealth
Sheena	2.704
Jacob	12.352

17.12.4 Nomogram

We can also develop a nomogram, if we like. As a special touch, we’ll add a prediction at the bottom which back-transforms out of the predicted phys_tr back to the physhealth days.

plot(nomogram(m16_imp, 
              fun = list(function(x) exp(x) - 1),
              funlabel = "Predicted physhealth days",
              fun.at = seq(0, 30, 3)))

We can see the big role of comor and hx_depress in this model.

17.12.5 Validating Summary Statistics

We can cross-validate summary measures, like \(R^2\)…

validate(m16_imp)

          index.orig training   test optimism index.corrected  n
R-square      0.1867   0.1984 0.1793   0.0192          0.1676 40
MSE           1.6472   1.6168 1.6623  -0.0455          1.6927 40
g             0.6876   0.7031 0.6749   0.0282          0.6594 40
Intercept     0.0000   0.0000 0.0677  -0.0677          0.0677 40
Slope         1.0000   1.0000 0.9636   0.0364          0.9636 40