Chapter 14 Multiple Imputation and Linear Regression
14.1 Developing a smart_16
data set
In this chapter, we’ll return to the smart_ohio
file based on data from BRFSS 2017 that we built and cleaned back at the start of these Notes.
<- readRDS(here("data", "smart_ohio.Rds")) smart_ohio
We’re going to look at a selection of variables from this tibble, among subjects who have been told they have diabetes, and who also provided a response to our physhealth
(Number of Days Physical Health Not Good) variable, which asks “Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?” We’ll build two models. In this chapter, we’ll look at a linear model for physhealth
and in the next chapter, we’ll look at a logistic regression describing whether or not the subject’s physhealth
response was at least 1.
<- smart_ohio %>%
smart_16 filter(dm_status == "Diabetes") %>%
filter(complete.cases(physhealth)) %>%
mutate(bad_phys = ifelse(physhealth > 0, 1, 0),
comor = hx_mi + hx_chd + hx_stroke + hx_asthma +
+ hx_otherc + hx_copd + hx_arthr) %>%
hx_skinc select(SEQNO, mmsa, physhealth, bad_phys, age_imp, smoke100,
comor, hx_depress, bmi, activity)
The variables included in this smart_16
tibble are:
Variable | Description |
---|---|
SEQNO |
respondent identification number (all begin with 2016) |
mmsa |
|
physhealth |
Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? |
bad_phys |
Is physhealth 1 or more? |
age_imp |
Age in years (imputed from age categories) |
smoke100 |
Have you smoked at least 100 cigarettes in your life? (1 = yes, 0 = no) |
hx_depress |
Has a doctor, nurse, or other health professional ever told you that you have a depressive disorder, including depression, major depression, dysthymia, or minor depression? |
bmi |
Body mass index, in kg/m2 |
activity |
Physical activity (Highly Active, Active, Insufficiently Active, Inactive) |
comor |
Sum of 8 potential groups of comorbidities (see below) |
The comor
variable is the sum of the following 8 variables, each of which is measured on a 1 = Yes, 0 = No scale, and begin with “Has a doctor, nurse, or other health professional ever told you that you had …”
hx_mi
: a heart attack, also called a myocardial infarction?hx_chd
: angina or coronary heart disease?hx_stroke
: a stroke?hx_asthma
: asthma?hx_skinc
: skin cancer?hx_otherc
: any other types of cancer?hx_copd
: Chronic Obstructive Pulmonary Disease or COPD, emphysema or chronic bronchitis?hx_arthr
: some form of arthritis, rheumatoid arthritis, gout, lupus, or fibromyalgia?
%>% tabyl(comor) smart_16
comor n percent valid_percent
0 224 0.211920530 0.221782178
1 315 0.298013245 0.311881188
2 228 0.215704825 0.225742574
3 130 0.122989593 0.128712871
4 72 0.068117313 0.071287129
5 29 0.027436140 0.028712871
6 9 0.008514664 0.008910891
7 3 0.002838221 0.002970297
NA 47 0.044465468 NA
14.1.1 Any missing values?
We have 1057 observations (rows) in the smart_16
data set, of whom 860 have complete data on all variables.
dim(smart_16)
[1] 1057 10
n_case_complete(smart_16)
[1] 860
Which variables are missing?
miss_var_summary(smart_16)
# A tibble: 10 x 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 activity 85 8.04
2 bmi 84 7.95
3 comor 47 4.45
4 smoke100 24 2.27
5 age_imp 12 1.14
6 hx_depress 3 0.284
7 SEQNO 0 0
8 mmsa 0 0
9 physhealth 0 0
10 bad_phys 0 0
Note that our outcomes (physhealth
and the derived bad_phys
) have no missing values here, by design. We will be performing multiple imputation to account appropriately for missingness in the predictors with missing values.
14.2 Obtaining a Simple Imputation with mice
The mice
package provides several approaches we can use for imputation in building models of all kinds. Here, we’ll use it just to obtain a single set of imputed results that we can apply to “complete” our data for the purposes of thinking about (a) transforming our outcome and (b) considering the addition of non-linear predictor terms.
# requires library(mice)
set.seed(432)
# create small data set including only variables to
# be used in building the imputation model
<- smart_16 %>%
sm16 select(physhealth, activity, age_imp, bmi, comor,
hx_depress, smoke100)
<- mice(sm16, m = 1) smart_16_mice1
iter imp variable
1 1 activity age_imp bmi comor hx_depress smoke100
2 1 activity age_imp bmi comor hx_depress smoke100
3 1 activity age_imp bmi comor hx_depress smoke100
4 1 activity age_imp bmi comor hx_depress smoke100
5 1 activity age_imp bmi comor hx_depress smoke100
<- mice::complete(smart_16_mice1)
smart_16_imp1
n_case_miss(smart_16_imp1)
[1] 0
And now we’ll use this completed smart_16_imp1
data set (the product of just a single imputation) to help us address the next two issues.
14.3 Linear Regression: Considering a Transformation of the Outcome
A plausible strategy here would be to try to identify an outcome transformation only after some accounting for missing predictor values, perhaps through a simple imputation approach. However, to keep things simple here, I’ll just use the complete cases in this section.
Recall that our outcome here, physhealth
can take the value 0, and is thus not strictly positive.
::favstats(~ physhealth, data = smart_16_imp1) mosaic
min Q1 median Q3 max mean sd n missing
0 0 2 20 30 9.227058 11.92676 1057 0
So, if we want to investigate a potential transformation with a Box-Cox plot, we’ll have to add a small value to each physhealth
value. We’ll add 1, so that the range of potential values is now from 1-31.
# requires library(car)
%$%
smart_16_imp1 boxCox((physhealth + 1) ~ age_imp + comor + smoke100 +
+ bmi + activity) hx_depress
It looks like the logarithm is a reasonable transformation in this setting. So we’ll create a new outcome, that is the natural logarithm of (physhealth
+ 1), which we’ll call phys_tr
to remind us that a transformation is involved that we’ll eventually need to back out of to make predictions. We’ll build this new variable in both our original smart_16
data set and in the simply imputed data set we’re using for just these early stages.
<- smart_16_imp1 %>%
smart_16_imp1 mutate(phys_tr = log(physhealth + 1))
<- smart_16 %>%
smart_16 mutate(phys_tr = log(physhealth + 1))
So we have phys_tr
= log(physhealth
+ 1)
- where we are referring above to the natural (base \(e\) logarithm).
We can also specify our back-transformation to the original physhealth
values from our new phys_tr
as physhealth
= exp(phys_tr
) - 1.
14.4 Linear Regression: Considering Non-Linearity in the Predictors
Consider the following Spearman \(\rho^2\) plot.
# requires rms package
# (technically Hmisc, which is loaded by rms)
%$%
smart_16_imp1 plot(spearman2(phys_tr ~ age_imp + comor + smoke100 +
+ bmi + activity)) hx_depress
After our single imputation, we have the same N
value in all rows of this plot, which is what we want to see. It appears that in considering potential non-linear terms, comor
and hx_depress
and perhaps activity
are worthy of increased attention. I’ll make a couple of arbitrary choices, to add a raw cubic polynomial to represent the comor
information, and we’ll add an interaction term between hx_depress
and activity
.
14.5 “Main Effects” Linear Regression with lm
on the Complete Cases
Recall that we have 860 complete cases in our smart_16
data, out of a total of 1057 observations in total. A model using only the complete cases should thus drop the remaining 197 subjects. Let’s see if a main effects only model for our newly transformed phys_tr
outcome does in fact do this.
<- smart_16 %$%
m_1cc lm(phys_tr ~ age_imp + comor + smoke100 +
+ bmi + activity)
hx_depress
summary(m_1cc)
Call:
lm(formula = phys_tr ~ age_imp + comor + smoke100 + hx_depress +
bmi + activity)
Residuals:
Min 1Q Median 3Q Max
-3.0801 -1.0389 -0.2918 1.1029 2.8478
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.581959 0.370847 1.569 0.11696
age_imp -0.007043 0.003813 -1.847 0.06511 .
comor 0.301773 0.033105 9.116 < 2e-16 ***
smoke100 0.099038 0.090280 1.097 0.27295
hx_depress 0.471949 0.104232 4.528 6.81e-06 ***
bmi 0.016375 0.006295 2.601 0.00945 **
activityActive -0.229927 0.154912 -1.484 0.13812
activityInsufficiently_Active -0.116998 0.139440 -0.839 0.40168
activityInactive 0.256118 0.115266 2.222 0.02655 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.303 on 851 degrees of freedom
(197 observations deleted due to missingness)
Multiple R-squared: 0.1806, Adjusted R-squared: 0.1729
F-statistic: 23.45 on 8 and 851 DF, p-value: < 2.2e-16
Note that the appropriate number of observations are listed as “deleted due to missingness.”
14.5.1 Quality of Fit Statistics
glance(m_1cc) %>%
select(r.squared, adj.r.squared, sigma, AIC, BIC) %>%
kable(digits = c(3, 3, 2, 1, 1))
r.squared | adj.r.squared | sigma | AIC | BIC |
---|---|---|---|---|
0.181 | 0.173 | 1.3 | 2906.3 | 2953.8 |
14.5.2 Interpreting Effect Sizes
tidy(m_1cc, conf.int = TRUE) %>%
select(term, estimate, std.error, conf.low, conf.high) %>%
kable(digits = 3)
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
(Intercept) | 0.582 | 0.371 | -0.146 | 1.310 |
age_imp | -0.007 | 0.004 | -0.015 | 0.000 |
comor | 0.302 | 0.033 | 0.237 | 0.367 |
smoke100 | 0.099 | 0.090 | -0.078 | 0.276 |
hx_depress | 0.472 | 0.104 | 0.267 | 0.677 |
bmi | 0.016 | 0.006 | 0.004 | 0.029 |
activityActive | -0.230 | 0.155 | -0.534 | 0.074 |
activityInsufficiently_Active | -0.117 | 0.139 | -0.391 | 0.157 |
activityInactive | 0.256 | 0.115 | 0.030 | 0.482 |
We’ll interpret three of the predictors here to demonstrate ideas: comor
, hx_depress
and activity
.
If we have two subjects with the same values of
age_imp
,smoke100
,hx_depress
,bmi
, andactivity
, but Harry has acomor
score that is one point higher than Sally’s, then the model predicts that Harry’s transformed outcome (specifically the natural logarithm of (hisphyshealth
days + 1)) will be 0.302 higher than Sally’s, with a 95% confidence interval around that estimate ranging from (round(a$conf.low,3)
,round(a$conf.high,3)
).If we have two subjects with the same values of
age_imp
,comor
,smoke100
,bmi
, andactivity
, but Harry has a history of depression (hx_depress
= 1) while Sally does not have such a history (so Sally’shx_depress
= 0), then the model predicts that Harry’s transformed outcome (specifically the natural logarithm of (hisphyshealth
days + 1)) will be 0.472 higher than Sally’s, with a 95% confidence interval around that estimate ranging from (round(a$conf.low,3)
,round(a$conf.high,3)
).The
activity
variable has four categories as indicated in the table below. The model uses the “Highly_Active” category as the reference group.
%>% tabyl(activity) smart_16_imp1
activity n percent
Highly_Active 252 0.2384106
Active 135 0.1277200
Insufficiently_Active 193 0.1825922
Inactive 477 0.4512772
- From the tidied set of coefficients, we can describe the
activity
effects as follows.- If Sally is “Highly Active” and Harry is “Active” but they otherwise have the same values of all predictors, then our prediction is that Harry’s transformed outcome (specifically the natural logarithm of (his
physhealth
days + 1)) will be 0.23 lower than Sally’s, with a 95% confidence interval around that estimate ranging from (round(a$conf.low,3)
,round(a$conf.high,3)
). - If instead Harry is “Insufficiently Active” but nothing else changes, then our prediction is that Harry’s transformed outcome will be 0.117 lower than Sally’s, with a 95% confidence interval around that estimate ranging from (
round(a2$conf.low,3)
,round(a2$conf.high,3)
). - If instead Harry is “Inactive” but nothing else changes, then our prediction is that Harry’s transformed outcome will be -0.117 higher than Sally’s, with a 95% confidence interval around that estimate ranging from (
round(a2$conf.low,3)
,round(a2$conf.high,3)
).
- If Sally is “Highly Active” and Harry is “Active” but they otherwise have the same values of all predictors, then our prediction is that Harry’s transformed outcome (specifically the natural logarithm of (his
14.5.3 Making Predictions with the Model
Let’s describe two subjects, and use this model (and the ones that follow) to predict their physhealth
values.
- Sheena is age 50, has 2 comorbidities, has smoked 100 cigarettes in her life, has no history of depression, a BMI of 25, and is Highly Active.
- Jacob is age 65, has 4 comorbidities, has never smoked, has a history of depression, a BMI of 32 and is Inactive.
We’ll first build predictions for Sheena and Jacob (with 95% prediction intervals) for phys_tr
.
<- tibble(
new2 name = c("Sheena", "Jacob"),
age_imp = c(50, 65),
comor = c(2, 4),
smoke100 = c(1, 0),
hx_depress = c(0, 1),
bmi = c(25, 32),
activity = c("Highly_Active", "Inactive")
)
<- predict(m_1cc, newdata = new2,
preds_m_1cc interval = "prediction")
preds_m_1cc
fit lwr upr
1 1.341778 -1.22937 3.912925
2 2.583336 0.01399 5.152681
The model makes predictions for our transformed outcome, phys_tr
. Now, we need to back-transform the predictions and the confidence intervals to build predictions for physhealth
.
<- preds_m_1cc %>%
preds_m_1cc tbl_df() %>%
mutate(names = c("Sheena", "Jacob"),
pred_physhealth = exp(fit) - 1,
conf_low = exp(lwr) - 1,
conf_high = exp(upr) - 1) %>%
select(names, pred_physhealth, conf_low, conf_high,
everything())
%>% kable(digits = 3) preds_m_1cc
names | pred_physhealth | conf_low | conf_high | fit | lwr | upr |
---|---|---|---|---|---|---|
Sheena | 2.826 | -0.708 | 49.045 | 1.342 | -1.229 | 3.913 |
Jacob | 12.241 | 0.014 | 171.894 | 2.583 | 0.014 | 5.153 |
14.6 “Augmented” Linear Regression with lm
on the Complete Cases
Now, we’ll add the non-linear terms we discussed earlier. We’ll add a (raw) cubic polynomial to represent the comor
information, and we’ll add an interaction term between hx_depress
and activity
.
# requires rms package (and co-loading Hmisc)
<- smart_16 %$%
m_2cc lm(phys_tr ~ age_imp + pol(comor, 3) + smoke100 +
+ hx_depress*activity)
bmi
summary(m_2cc)
Call:
lm(formula = phys_tr ~ age_imp + pol(comor, 3) + smoke100 + bmi +
hx_depress * activity)
Residuals:
Min 1Q Median 3Q Max
-2.907 -1.063 -0.267 1.143 2.924
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.514823 0.376203 1.368 0.17153
age_imp -0.008100 0.003865 -2.096 0.03640
pol(comor, 3)comor 0.634274 0.160630 3.949 8.51e-05
pol(comor, 3)comor^2 -0.130626 0.073525 -1.777 0.07599
pol(comor, 3)comor^3 0.012508 0.008977 1.393 0.16386
smoke100 0.089345 0.090336 0.989 0.32294
bmi 0.015203 0.006315 2.408 0.01627
hx_depress 0.647054 0.229696 2.817 0.00496
activityActive -0.202196 0.172300 -1.174 0.24092
activityInsufficiently_Active -0.005815 0.166221 -0.035 0.97210
activityInactive 0.290380 0.132198 2.197 0.02832
hx_depress:activityActive -0.124836 0.395415 -0.316 0.75230
hx_depress:activityInsufficiently_Active -0.376355 0.310160 -1.213 0.22531
hx_depress:activityInactive -0.172952 0.267427 -0.647 0.51798
(Intercept)
age_imp *
pol(comor, 3)comor ***
pol(comor, 3)comor^2 .
pol(comor, 3)comor^3
smoke100
bmi *
hx_depress **
activityActive
activityInsufficiently_Active
activityInactive *
hx_depress:activityActive
hx_depress:activityInsufficiently_Active
hx_depress:activityInactive
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.301 on 846 degrees of freedom
(197 observations deleted due to missingness)
Multiple R-squared: 0.187, Adjusted R-squared: 0.1745
F-statistic: 14.97 on 13 and 846 DF, p-value: < 2.2e-16
Note again that the appropriate number of observations are listed as “deleted due to missingness.”
14.6.1 Quality of Fit Statistics
glance(m_2cc) %>%
select(r.squared, adj.r.squared, sigma, AIC, BIC) %>%
kable(digits = c(3, 3, 2, 1, 1))
r.squared | adj.r.squared | sigma | AIC | BIC |
---|---|---|---|---|
0.187 | 0.175 | 1.3 | 2909.5 | 2980.9 |
14.6.2 ANOVA assessing the impact of the non-linear terms
anova(m_1cc, m_2cc)
Analysis of Variance Table
Model 1: phys_tr ~ age_imp + comor + smoke100 + hx_depress + bmi + activity
Model 2: phys_tr ~ age_imp + pol(comor, 3) + smoke100 + bmi + hx_depress *
activity
Res.Df RSS Df Sum of Sq F Pr(>F)
1 851 1444.0
2 846 1432.8 5 11.265 1.3303 0.249
The difference between the models doesn’t meet the standard for statistical detectabilty at our usual \(\alpha\) levels.
14.6.3 Interpreting Effect Sizes
tidy(m_2cc, conf.int = TRUE) %>%
select(term, estimate, std.error, conf.low, conf.high) %>%
kable(digits = 3)
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
(Intercept) | 0.515 | 0.376 | -0.224 | 1.253 |
age_imp | -0.008 | 0.004 | -0.016 | -0.001 |
pol(comor, 3)comor | 0.634 | 0.161 | 0.319 | 0.950 |
pol(comor, 3)comor^2 | -0.131 | 0.074 | -0.275 | 0.014 |
pol(comor, 3)comor^3 | 0.013 | 0.009 | -0.005 | 0.030 |
smoke100 | 0.089 | 0.090 | -0.088 | 0.267 |
bmi | 0.015 | 0.006 | 0.003 | 0.028 |
hx_depress | 0.647 | 0.230 | 0.196 | 1.098 |
activityActive | -0.202 | 0.172 | -0.540 | 0.136 |
activityInsufficiently_Active | -0.006 | 0.166 | -0.332 | 0.320 |
activityInactive | 0.290 | 0.132 | 0.031 | 0.550 |
hx_depress:activityActive | -0.125 | 0.395 | -0.901 | 0.651 |
hx_depress:activityInsufficiently_Active | -0.376 | 0.310 | -0.985 | 0.232 |
hx_depress:activityInactive | -0.173 | 0.267 | -0.698 | 0.352 |
Let’s focus first on interpreting the interaction terms between hx_depress
and activity
.
Assume first that we have a set of subjects with the same values of age_imp
, smoke100
, bmi
, and comor
.
- Arnold has
hx_depress
= 1 and is Inactive - Betty has
hx_depress
= 1 and is Insufficiently Active - Carlos has
hx_depress
= 1 and is Active - Debbie has
hx_depress
= 1 and is Highly Active - Eamon has
hx_depress
= 0 and is Inactive - Florence has
hx_depress
= 0 and is Insufficiently Active - Garry has
hx_depress
= 0 and is Active - Harry has
hx_depress
= 0 and is Highly Active
So the model, essentially can be used to compare each of the first seven people on that list to Harry (who has the reference levels of both hx_depress
and activity
.) Let’s compare Arnold to Harry.
For instance, as compared to Harry, Arnold is expected to have a transformed outcome (specifically the natural logarithm of (his physhealth
days + 1)) that is:
- 0.647 higher because Arnold’s
hx_depress
= 1, and - 0.29 higher still because Arnold’s
activity
is “Inactive,” and - 0.173 lower because of the combination (see the `hx_depress:activityInactive” row)
So, in total, we expect Arnold’s transformed outcome to be 0.647 + 0.29 + (-0.173), or 0.764 higher than Harry’s.
If we want to compare Arnold to, for instance, Betty, we first calculate Betty’s difference from Harry, and then compare the two differences.
As compared to Harry, Betty is expected to have a transformed outcome (specifically the natural logarithm of (her physhealth
days + 1)) that is:
- 0.647 higher because Betty’s
hx_depress
= 1, and - 0.006 lower still because Betty’s
activity
is “Insufficiently Active,” and - 0.376 lower because of the combination (see the `hx_depress:activityInsufficiently_Active” row)
So, in total, we expect Betty’s transformed outcome to be 0.647 + (-0.006) + (-0.376), or 0.265 higher than Harry’s.
And thus we can compare Betty and Arnold directly.
- Arnold is predicted to have an outcome that is 0.764 higher than Harry’s.
- Betty is predicted to have an outcome that is 0.265 higher than Harry’s.
- And so Arnold’s predicted outcome (
phys_tr
) is 0.499 larger than Betty’s.
Now, suppose we want to look at our cubic polynomial in comor
.
- Suppose Harry and Sally have the same values for all other predictors in the model, but Harry has 1 comorbidity where Sally has none. Then the three terms in the model related to
comor
will be 1 for Harry and 0 for Sally, and the interpretation becomes pretty straightforward. - But suppose instead that nothing has changed except Harry has 2 comorbidities and Sally has just 1. The size of the impact of this Harry - Sally difference is far larger in this situation, because the
comor
variable enters the model in a non-linear way. This is an area where fitting the model usingols
can be helpful because of the ability to generate plots (of effects, nomograms, etc.) that can show this non-linearity in a clear way.
Suppose for instance, that Harry and Sally share the following values for the other predictors: each is age 40, has never smoked, has no history of depression, a BMI of 30 and is Highly Active.
- Now, if Harry has 1 comorbidity and Sally has none, the predicted
phys_tr
values for Harry and Sally are as indicated below.
<- tibble(
hands1 name = c("Harry", "Sally"),
age_imp = c(40, 40),
comor = c(1, 0),
smoke100 = c(0, 0),
hx_depress = c(0, 0),
bmi = c(30, 30),
activity = c("Highly_Active", "Highly_Active")
)
predict(m_2cc, newdata = hands1)
1 2
1.1630840 0.6469282
But if Harry has 2 comorbidities and Sally 1, the predictions are:
<- tibble(
hands2 name = c("Harry", "Sally"),
age_imp = c(40, 40),
comor = c(2, 1), # only thing that changes
smoke100 = c(0, 0),
hx_depress = c(0, 0),
bmi = c(30, 30),
activity = c("Highly_Active", "Highly_Active")
)
predict(m_2cc, newdata = hands2)
1 2
1.493035 1.163084
Note that the difference in predictions between Harry and Sally is much smaller now than it was previously.
14.6.4 Making Predictions with the Model
As before, we’ll use the new model to predict physhealth
values for Sheena and Jacob.
- Sheena is age 50, has 2 comorbidities, has smoked 100 cigarettes in her life, has no history of depression, a BMI of 25, and is Highly Active.
- Jacob is age 65, has 4 comorbidities, has never smoked, has a history of depression, a BMI of 32 and is Inactive.
We’ll first build predictions for Sheena and Jacob (with 95% prediction intervals) for phys_tr
.
<- tibble(
new2 name = c("Sheena", "Jacob"),
age_imp = c(50, 65),
comor = c(2, 4),
smoke100 = c(1, 0),
hx_depress = c(0, 1),
bmi = c(25, 32),
activity = c("Highly_Active", "Inactive")
)
<- predict(m_2cc, newdata = new2,
preds_m_2cc interval = "prediction")
preds_m_2cc
fit lwr upr
1 1.425362 -1.14707613 3.997801
2 2.486907 -0.08635658 5.060171
Now, we need to back-transform the predictions and the confidence intervals that describe phys_tr
to build predictions for physhealth
.
<- preds_m_2cc %>%
preds_m_2cc tbl_df() %>%
mutate(names = c("Sheena", "Jacob"),
pred_physhealth = exp(fit) - 1,
conf_low = exp(lwr) - 1,
conf_high = exp(upr) - 1) %>%
select(names, pred_physhealth, conf_low, conf_high,
everything())
%>% kable(digits = 3) preds_m_2cc
names | pred_physhealth | conf_low | conf_high | fit | lwr | upr |
---|---|---|---|---|---|---|
Sheena | 3.159 | -0.682 | 53.478 | 1.425 | -1.147 | 3.998 |
Jacob | 11.024 | -0.083 | 156.617 | 2.487 | -0.086 | 5.060 |
14.7 Using mice
to perform Multiple Imputation
Let’s focus on the main effects model, and look at the impact of performing multiple imputation to account for the missing data. Recall that in our smart_16
data, the most “missingness” is shown in the activity
variable, which is still missing less than 10% of the time. So we’ll try a set of 10 imputations, using the default settings in the mice
package.
# requires library(mice)
set.seed(432)
# create small data set including only variables to
# be used in building the imputation model
<- smart_16 %>%
sm16 select(physhealth, phys_tr, activity, age_imp, bmi, comor,
hx_depress, smoke100)
<- mice(sm16, m = 10) smart_16_mice10
iter imp variable
1 1 activity age_imp bmi comor hx_depress smoke100
1 2 activity age_imp bmi comor hx_depress smoke100
1 3 activity age_imp bmi comor hx_depress smoke100
1 4 activity age_imp bmi comor hx_depress smoke100
1 5 activity age_imp bmi comor hx_depress smoke100
1 6 activity age_imp bmi comor hx_depress smoke100
1 7 activity age_imp bmi comor hx_depress smoke100
1 8 activity age_imp bmi comor hx_depress smoke100
1 9 activity age_imp bmi comor hx_depress smoke100
1 10 activity age_imp bmi comor hx_depress smoke100
2 1 activity age_imp bmi comor hx_depress smoke100
2 2 activity age_imp bmi comor hx_depress smoke100
2 3 activity age_imp bmi comor hx_depress smoke100
2 4 activity age_imp bmi comor hx_depress smoke100
2 5 activity age_imp bmi comor hx_depress smoke100
2 6 activity age_imp bmi comor hx_depress smoke100
2 7 activity age_imp bmi comor hx_depress smoke100
2 8 activity age_imp bmi comor hx_depress smoke100
2 9 activity age_imp bmi comor hx_depress smoke100
2 10 activity age_imp bmi comor hx_depress smoke100
3 1 activity age_imp bmi comor hx_depress smoke100
3 2 activity age_imp bmi comor hx_depress smoke100
3 3 activity age_imp bmi comor hx_depress smoke100
3 4 activity age_imp bmi comor hx_depress smoke100
3 5 activity age_imp bmi comor hx_depress smoke100
3 6 activity age_imp bmi comor hx_depress smoke100
3 7 activity age_imp bmi comor hx_depress smoke100
3 8 activity age_imp bmi comor hx_depress smoke100
3 9 activity age_imp bmi comor hx_depress smoke100
3 10 activity age_imp bmi comor hx_depress smoke100
4 1 activity age_imp bmi comor hx_depress smoke100
4 2 activity age_imp bmi comor hx_depress smoke100
4 3 activity age_imp bmi comor hx_depress smoke100
4 4 activity age_imp bmi comor hx_depress smoke100
4 5 activity age_imp bmi comor hx_depress smoke100
4 6 activity age_imp bmi comor hx_depress smoke100
4 7 activity age_imp bmi comor hx_depress smoke100
4 8 activity age_imp bmi comor hx_depress smoke100
4 9 activity age_imp bmi comor hx_depress smoke100
4 10 activity age_imp bmi comor hx_depress smoke100
5 1 activity age_imp bmi comor hx_depress smoke100
5 2 activity age_imp bmi comor hx_depress smoke100
5 3 activity age_imp bmi comor hx_depress smoke100
5 4 activity age_imp bmi comor hx_depress smoke100
5 5 activity age_imp bmi comor hx_depress smoke100
5 6 activity age_imp bmi comor hx_depress smoke100
5 7 activity age_imp bmi comor hx_depress smoke100
5 8 activity age_imp bmi comor hx_depress smoke100
5 9 activity age_imp bmi comor hx_depress smoke100
5 10 activity age_imp bmi comor hx_depress smoke100
summary(smart_16_mice10)
Class: mids
Number of multiple imputations: 10
Imputation methods:
physhealth phys_tr activity age_imp bmi comor hx_depress
"" "" "polyreg" "pmm" "pmm" "pmm" "pmm"
smoke100
"pmm"
PredictorMatrix:
physhealth phys_tr activity age_imp bmi comor hx_depress smoke100
physhealth 0 1 1 1 1 1 1 1
phys_tr 1 0 1 1 1 1 1 1
activity 1 1 0 1 1 1 1 1
age_imp 1 1 1 0 1 1 1 1
bmi 1 1 1 1 0 1 1 1
comor 1 1 1 1 1 0 1 1
14.8 Running the Linear Regression in lm
with Multiple Imputation
Next, we’ll run the linear model (main effects) on each of the 10 imputed data sets.
<-
m10_mods with(smart_16_mice10, lm(phys_tr ~ age_imp + comor +
+ hx_depress +
smoke100 + activity))
bmi
summary(m10_mods)
# A tibble: 90 x 6
term estimate std.error statistic p.value nobs
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 (Intercept) 0.317 0.326 0.971 3.32e- 1 1057
2 age_imp -0.00489 0.00334 -1.47 1.43e- 1 1057
3 comor 0.313 0.0295 10.6 4.72e-25 1057
4 smoke100 0.135 0.0799 1.69 9.22e- 2 1057
5 hx_depress 0.500 0.0929 5.38 9.14e- 8 1057
6 bmi 0.0187 0.00564 3.31 9.64e- 4 1057
7 activityActive -0.202 0.138 -1.46 1.44e- 1 1057
8 activityInsufficiently_Active -0.0695 0.124 -0.561 5.75e- 1 1057
9 activityInactive 0.262 0.103 2.54 1.11e- 2 1057
10 (Intercept) 0.363 0.332 1.10 2.74e- 1 1057
# ... with 80 more rows
Then, we’ll pool results across the 10 imputations
<- pool(m10_mods)
m10_pool summary(m10_pool, conf.int = TRUE) %>%
select(-statistic, -df) %>%
kable(digits = 3)
term | estimate | std.error | p.value | 2.5 % | 97.5 % |
---|---|---|---|---|---|
(Intercept) | 0.444 | 0.342 | 0.194 | -0.227 | 1.114 |
age_imp | -0.005 | 0.003 | 0.128 | -0.012 | 0.002 |
comor | 0.309 | 0.031 | 0.000 | 0.249 | 0.369 |
smoke100 | 0.114 | 0.083 | 0.171 | -0.049 | 0.278 |
hx_depress | 0.512 | 0.094 | 0.000 | 0.327 | 0.696 |
bmi | 0.016 | 0.006 | 0.009 | 0.004 | 0.027 |
activityActive | -0.204 | 0.140 | 0.146 | -0.479 | 0.071 |
activityInsufficiently_Active | -0.044 | 0.129 | 0.735 | -0.298 | 0.210 |
activityInactive | 0.260 | 0.106 | 0.014 | 0.052 | 0.469 |
And we can compare these results to the complete case analysis we completed earlier.
tidy(m_1cc, conf.int = TRUE) %>%
select(term, estimate, std.error, p.value, conf.low, conf.high) %>%
kable(digits = 3)
term | estimate | std.error | p.value | conf.low | conf.high |
---|---|---|---|---|---|
(Intercept) | 0.582 | 0.371 | 0.117 | -0.146 | 1.310 |
age_imp | -0.007 | 0.004 | 0.065 | -0.015 | 0.000 |
comor | 0.302 | 0.033 | 0.000 | 0.237 | 0.367 |
smoke100 | 0.099 | 0.090 | 0.273 | -0.078 | 0.276 |
hx_depress | 0.472 | 0.104 | 0.000 | 0.267 | 0.677 |
bmi | 0.016 | 0.006 | 0.009 | 0.004 | 0.029 |
activityActive | -0.230 | 0.155 | 0.138 | -0.534 | 0.074 |
activityInsufficiently_Active | -0.117 | 0.139 | 0.402 | -0.391 | 0.157 |
activityInactive | 0.256 | 0.115 | 0.027 | 0.030 | 0.482 |
Note that there are some sizeable differences here, although nothing enormous.
If we want the pooled \(R^2\) or pooled adjusted \(R^2\) after imputation, R will provide it (and a 95% confidence interval around the estimate) with …
pool.r.squared(m10_mods)
est lo 95 hi 95 fmi
R^2 0.1912561 0.1482819 0.2369623 0.08061427
pool.r.squared(m10_mods, adjusted = TRUE)
est lo 95 hi 95 fmi
adj R^2 0.1850807 0.1425132 0.2305277 0.08312639
We can see the fraction of missing information about each coefficient due to non-response (fmi
) and other details with the following code…
m10_pool
Class: mipo m = 10
term m estimate ubar b
1 (Intercept) 10 0.44377168 1.078194e-01 8.002900e-03
2 age_imp 10 -0.00522810 1.123668e-05 4.824290e-07
3 comor 10 0.30871888 8.801039e-04 5.466082e-05
4 smoke100 10 0.11415718 6.474388e-03 4.130673e-04
5 hx_depress 10 0.51155722 8.669413e-03 1.582684e-04
6 bmi 10 0.01576150 3.182381e-05 3.425191e-06
7 activityActive 10 -0.20412627 1.914250e-02 4.851040e-04
8 activityInsufficiently_Active 10 -0.04383739 1.565925e-02 9.855183e-04
9 activityInactive 10 0.26046070 1.069627e-02 5.023887e-04
t dfcom df riv lambda fmi
1 1.166226e-01 1048 599.8172 0.08164758 0.07548446 0.07855177
2 1.176735e-05 1048 814.9042 0.04722675 0.04509697 0.04743197
3 9.402308e-04 1048 677.6361 0.06831796 0.06394909 0.06669961
4 6.928762e-03 1048 666.2487 0.07018023 0.06557795 0.06837041
5 8.843508e-03 1048 982.0512 0.02008154 0.01968621 0.02167660
6 3.559152e-05 1048 432.0874 0.11839279 0.10585976 0.10996992
7 1.967611e-02 1048 939.5064 0.02787591 0.02711992 0.02918437
8 1.674332e-02 1048 672.0473 0.06922874 0.06474643 0.06751736
9 1.124890e-02 1048 785.1905 0.05166545 0.04912727 0.05154007
14.9 Fit the Multiple Imputation Model with aregImpute
Here, we’ll use aregImpute
to deal with missing values through multiple imputation, and use the ols
function in the rms
package to fit the model.
The first step is to fit the multiple imputation model. We’ll use n.impute
= 10 imputations, with B
= 10 bootstrap samples for the preditive mean matching, and fit both linear models and models with restricted cubic splines with 3 knots (nk = c(0, 3)
) allowing the target variable to have a non-linear transformation when nk
is 3, via tlinear = FALSE
.
set.seed(43201602)
<- datadist(smart_16)
dd options(datadist = "dd")
<-
fit16_imp aregImpute(~ phys_tr + age_imp + comor + smoke100 +
+ bmi + activity,
hx_depress nk = c(0, 3), tlinear = FALSE,
data = smart_16, B = 10, n.impute = 10)
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
Here are the results of that imputation model.
fit16_imp
Multiple Imputation using Bootstrap and PMM
aregImpute(formula = ~phys_tr + age_imp + comor + smoke100 +
hx_depress + bmi + activity, data = smart_16, n.impute = 10,
nk = c(0, 3), tlinear = FALSE, B = 10)
n: 1057 p: 7 Imputations: 10 nk: 0
Number of NAs:
phys_tr age_imp comor smoke100 hx_depress bmi activity
0 12 47 24 3 84 85
type d.f.
phys_tr s 1
age_imp s 1
comor s 1
smoke100 l 1
hx_depress l 1
bmi s 1
activity c 3
R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
age_imp comor smoke100 hx_depress bmi activity
0.224 0.206 0.059 0.167 0.169 0.057
Resampling results for determining the complexity of imputation models
Variable being imputed: age_imp
nk=0 nk=3
Bootstrap bias-corrected R^2 0.186 0.215
10-fold cross-validated R^2 0.211 0.215
Bootstrap bias-corrected mean |error| 9.108 10.894
10-fold cross-validated mean |error| 65.169 10.919
Bootstrap bias-corrected median |error| 7.290 8.784
10-fold cross-validated median |error| 66.006 8.613
Variable being imputed: comor
nk=0 nk=3
Bootstrap bias-corrected R^2 0.183 0.182
10-fold cross-validated R^2 0.184 0.193
Bootstrap bias-corrected mean |error| 0.987 1.184
10-fold cross-validated mean |error| 1.759 1.171
Bootstrap bias-corrected median |error| 0.828 0.910
10-fold cross-validated median |error| 1.574 0.892
Variable being imputed: smoke100
nk=0 nk=3
Bootstrap bias-corrected R^2 0.0224 0.0187
10-fold cross-validated R^2 0.0358 0.0217
Bootstrap bias-corrected mean |error| 0.4853 0.4866
10-fold cross-validated mean |error| 0.9462 0.9561
Bootstrap bias-corrected median |error| 0.4788 0.4772
10-fold cross-validated median |error| 0.8479 0.8706
Variable being imputed: hx_depress
nk=0 nk=3
Bootstrap bias-corrected R^2 0.157 0.138
10-fold cross-validated R^2 0.147 0.148
Bootstrap bias-corrected mean |error| 0.355 0.360
10-fold cross-validated mean |error| 0.801 0.783
Bootstrap bias-corrected median |error| 0.333 0.337
10-fold cross-validated median |error| 0.711 0.673
Variable being imputed: bmi
nk=0 nk=3
Bootstrap bias-corrected R^2 0.125 0.122
10-fold cross-validated R^2 0.134 0.133
Bootstrap bias-corrected mean |error| 5.221 6.822
10-fold cross-validated mean |error| 32.458 6.884
Bootstrap bias-corrected median |error| 4.178 5.782
10-fold cross-validated median |error| 31.330 5.932
Variable being imputed: activity
nk=0 nk=3
Bootstrap bias-corrected R^2 0.0312 0.0275
10-fold cross-validated R^2 0.0450 0.0402
Bootstrap bias-corrected mean |error| 1.8774 1.8884
10-fold cross-validated mean |error| 1.1121 1.1043
Bootstrap bias-corrected median |error| 2.0000 2.0000
10-fold cross-validated median |error| 1.0000 1.0000
par(mfrow = c(3,2))
plot(fit16_imp)
par(mfrow = c(1,1))
The plot helps us see where the imputations are happening.
14.10 Fit Linear Regression using ols
and fit.mult.impute
<-
m16_imp fit.mult.impute(phys_tr ~ age_imp + comor + smoke100 +
+ bmi + activity,
hx_depress fitter = ols, xtrans = fit16_imp,
data = smart_16, x = TRUE, y = TRUE)
Variance Inflation Factors Due to Imputation:
Intercept age_imp
1.03 1.01
comor smoke100
1.03 1.06
hx_depress bmi
1.02 1.06
activity=Active activity=Insufficiently_Active
1.19 1.14
activity=Inactive
1.23
Rate of Missing Information:
Intercept age_imp
0.03 0.01
comor smoke100
0.03 0.06
hx_depress bmi
0.02 0.06
activity=Active activity=Insufficiently_Active
0.16 0.13
activity=Inactive
0.19
d.f. for t-distribution for Tests of Single Coefficients:
Intercept age_imp
8176.67 45410.80
comor smoke100
13030.27 2670.64
hx_depress bmi
28199.30 2935.89
activity=Active activity=Insufficiently_Active
354.62 571.56
activity=Inactive
258.42
The following fit components were averaged over the 10 model fits:
fitted.values stats linear.predictors
14.10.1 Summaries and Coefficients
Here are the results:
m16_imp
Linear Regression Model
fit.mult.impute(formula = phys_tr ~ age_imp + comor + smoke100 +
hx_depress + bmi + activity, fitter = ols, xtrans = fit16_imp,
data = smart_16, x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 1057 LR chi2 219.94 R2 0.188
sigma1.2881 d.f. 8 R2 adj 0.182
d.f. 1048 Pr(> chi2) 0.0000 g 0.687
Residuals
Min 1Q Median 3Q Max
-3.0621 -1.0327 -0.2878 1.1104 2.8018
Coef S.E. t Pr(>|t|)
Intercept 0.4052 0.3352 1.21 0.2271
age_imp -0.0049 0.0034 -1.46 0.1437
comor 0.3078 0.0302 10.20 <0.0001
smoke100 0.1259 0.0830 1.52 0.1296
hx_depress 0.5120 0.0940 5.45 <0.0001
bmi 0.0164 0.0058 2.83 0.0048
activity=Active -0.1773 0.1513 -1.17 0.2416
activity=Insufficiently_Active -0.0396 0.1342 -0.30 0.7680
activity=Inactive 0.2401 0.1144 2.10 0.0360
14.10.2 Effect Sizes
We can plot and summarize the effect sizes using the usual ols
tools:
summary(m16_imp)
Effects Response : phys_tr
Factor Low High Diff. Effect S.E.
age_imp 57.00 73.00 16.00 -0.079163 0.054107
comor 1.00 2.00 1.00 0.307790 0.030171
smoke100 0.00 1.00 1.00 0.125890 0.083000
hx_depress 0.00 1.00 1.00 0.511980 0.094007
bmi 27.29 36.65 9.36 0.153530 0.054322
activity - Highly_Active:Inactive 4.00 1.00 NA -0.240070 0.114350
activity - Active:Inactive 4.00 2.00 NA -0.417320 0.137640
activity - Insufficiently_Active:Inactive 4.00 3.00 NA -0.279650 0.115000
Lower 0.95 Upper 0.95
-0.185330 0.027008
0.248590 0.366990
-0.036973 0.288760
0.327520 0.696450
0.046932 0.260120
-0.464450 -0.015686
-0.687400 -0.147250
-0.505310 -0.054002
plot(summary(m16_imp))
14.10.3 Making Predictions with this Model
Once again, let’s make predictions for our two subjects, and use this model (and the ones that follow) to predict their physhealth
values.
- Sheena is age 50, has 2 comorbidities, has smoked 100 cigarettes in her life, has no history of depression, a BMI of 25, and is Highly Active.
- Jacob is age 65, has 4 comorbidities, has never smoked, has a history of depression, a BMI of 32 and is Inactive.
<- tibble(
new2 name = c("Sheena", "Jacob"),
age_imp = c(50, 65),
comor = c(2, 4),
smoke100 = c(1, 0),
hx_depress = c(0, 1),
bmi = c(25, 32),
activity = c("Highly_Active", "Inactive")
)
<- predict(m16_imp,
preds_m_16imp newdata = data.frame(new2))
preds_m_16imp
1 2
1.309306 2.591649
<- preds_m_16imp %>%
preds_m_16imp tbl_df() %>%
mutate(names = c("Sheena", "Jacob"),
pred_physhealth = exp(value) - 1) %>%
select(names, pred_physhealth)
%>% kable(digits = 3) preds_m_16imp
names | pred_physhealth |
---|---|
Sheena | 2.704 |
Jacob | 12.352 |
14.10.4 Nomogram
We can also develop a nomogram, if we like. As a special touch, we’ll add a prediction at the bottom which back-transforms out of the predicted phys_tr
back to the physhealth
days.
plot(nomogram(m16_imp,
fun = list(function(x) exp(x) - 1),
funlabel = "Predicted physhealth days",
fun.at = seq(0, 30, 3)))
We can see the big role of comor
and hx_depress
in this model.
14.10.5 Validating Summary Statistics
We can cross-validate summary measures, like \(R^2\)…
validate(m16_imp)
index.orig training test optimism index.corrected n
R-square 0.1867 0.1984 0.1793 0.0192 0.1676 40
MSE 1.6472 1.6168 1.6623 -0.0455 1.6927 40
g 0.6876 0.7031 0.6749 0.0282 0.6594 40
Intercept 0.0000 0.0000 0.0677 -0.0677 0.0677 40
Slope 1.0000 1.0000 0.9636 0.0364 0.9636 40