Chapter 44 Dealing with Missing Data

So what we should have done at the start of our wcgs analysis was to identify any issues with missing values or whether any variables show unreasonably large or small values.

44.1 Identifying Missingness

If you just want a count of missing values, you can use colSums and is.na

   id  chol   age   bmi   sbp   dbp smoke 
    0    12     0     0     0     0     0 

The mice package provides the md.pattern function to identify missingness patterns.

     id age bmi sbp dbp smoke chol   
3142  1   1   1   1   1     1    1  0
12    1   1   1   1   1     1    0  1
      0   0   0   0   0     0   12 12

Given the relatively small set of variables we’re studying here, I would run the describe function from the Hmisc package on each variable (maybe skipping id) in order to identify missing and (potentially) unrealistic values through range checks.

wcgs.s[-1] 

 6  Variables      3154  Observations
---------------------------------------------------------------------------
chol 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    3142       12      237        1    226.4    47.99    161.1    175.0 
     .25      .50      .75      .90      .95 
   197.2    223.0    253.0    280.0    302.0 

lowest : 103 110 111 112 113, highest: 386 390 400 414 645
---------------------------------------------------------------------------
age 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    3154        0       21    0.996    46.28    6.256       39       40 
     .25      .50      .75      .90      .95 
      42       45       50       55       57 

lowest : 39 40 41 42 43, highest: 55 56 57 58 59
---------------------------------------------------------------------------
bmi 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    3154        0      679        1    24.52    2.803    20.59    21.52 
     .25      .50      .75      .90      .95 
   22.96    24.39    25.84    27.45    28.73 

lowest : 11.2 15.7 16.9 17.2 17.2, highest: 36.0 37.2 37.2 37.7 38.9
---------------------------------------------------------------------------
sbp 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    3154        0       62    0.996    128.6    16.25      110      112 
     .25      .50      .75      .90      .95 
     120      126      136      148      156 

lowest :  98 100 102 104 106, highest: 200 208 210 212 230
---------------------------------------------------------------------------
dbp 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    3154        0       42    0.992    82.02    10.51       68       70 
     .25      .50      .75      .90      .95 
      76       80       86       94      100 

lowest :  58  60  62  64  66, highest: 125 129 130 136 150
---------------------------------------------------------------------------
smoke 
       n  missing distinct 
    3154        0        2 
                      
Value         No   Yes
Frequency   1652  1502
Proportion 0.524 0.476
---------------------------------------------------------------------------

No values are outside the range of plausibility, and in any case, we see that we have 12 missing chol values in the full data set. Options?

  • We might choose to omit those rows before creating our test and training samples.
    • If we’re building a model where chol is the outcome, I would omit those cases, as (for 431, at least) I won’t impute outcomes. This is R’s default.
  • We might choose to impute missing values after we partition.

44.2 Complete Case Analysis: A model for chol

Suppose we want to build a model for chol using age, body-mass index and systolic blood pressure, using the entire wcgs data frame, except those subjects with a missing chol value. Note that this is the default approach R takes, regardless of whether chol is used as an outcome or predictor.


Call:
lm(formula = chol ~ age + bmi + sbp, data = wcgs.s)

Residuals:
   Min     1Q Median     3Q    Max 
-125.4  -28.6   -3.0   25.9  409.4 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  146.871     10.320   14.23  < 2e-16 ***
age            0.561      0.141    3.98  7.2e-05 ***
bmi            0.681      0.312    2.18    0.029 *  
sbp            0.287      0.054    5.31  1.2e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43 on 3138 degrees of freedom
  (12 observations deleted due to missingness)
Multiple R-squared:  0.0214,    Adjusted R-squared:  0.0205 
F-statistic: 22.9 on 3 and 3138 DF,  p-value: 1.16e-14

The notification under the residual standard error indicates that 12 observations (the 12 with missing chol, since no other variables in this group have missing data) were dropped before the model was fit. You can also tell this from the degrees of freedom.

  • We have four coefficients to fit (intercept and slopes of age, bmi and sbp) so we have (4 - 1) = 3 numerator degrees of freedom.
  • The ANOVA F-statistic report also specifies 3138 denominator degrees of freedom.
  • Total df is always one less than the total number of observations used in the model.
  • That suggests we are using (3 + 3138) + 1 = 3142 subjects in this model.
  • But wcgs.s actually contains 3,154 people so 12 were omitted.
[1] 3154    7

44.2.1 Removing all subjects with NA, via na.omit

So, if we want to fit a complete case analysis, we don’t, technically, have to do anything except specify the model. We can also make this explicit by first pruning all subjects with any missing values on any variables from our tibble.

[1] 3142    5

44.2.2 Removing subjects with NA on one particular variable

If we wanted to specify that only subjects with missing chol should be removed from the data set (in case there were missing values in other variables), we could use the following approach.

Here, I’ll work with a data set (wcgs.s3) that contains 12 missing chol values, and 2 missing arcus values, and build a new set (wcgs.s3noNAchol) that retains the subjects who have a chol value, regardless of whether they are also missing arcus, but not include the 12 subjects with missing chol data

     id arcus chol   
3140  1     1    1  0
12    1     1    0  1
2     1     0    1  1
      0     2   12 14

     id chol arcus  
3140  1    1     1 0
2     1    1     0 1
      0    0     2 2

44.3 Using Multiple Imputation to fit our Regression Model

Following the approach we outlined in the R-20 document, we’ll try fitting the model while imputing the 12 cholesterol values, 100 times each, creating 100 new “complete” data sets.

44.3.1 Examining a Single Imputed Data Set

We could look at any one of our 100 imputations in some detail, perhaps the 12th such imputation, as follows:

imp12$chol 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    3154        0      237        1    226.4    47.93    161.7    175.0 
     .25      .50      .75      .90      .95 
   198.0    223.0    253.0    280.0    302.0 

lowest : 103 110 111 112 113, highest: 386 390 400 414 645

Working with a particular imputation might allow us to produce plots of the imputed data, or of diagnostics after a regression model, but we’ll focus our interpretation of regression results (in 431) on estimates of coefficients and of things like R2. For that, we want to use all 100 imputations, in a pooled approach.

44.3.2 Fitting a Pooled Regression Model across the Imputations

Next, we’ll fit a set of pooled regression model estimates across all 100 imputations, producing the following results. We’ll even estimate both forms of R2.

Class: mipo    m = 100 
            estimate     ubar        b        t dfcom   df     riv  lambda
(Intercept)  147.029 1.06e+02 7.47e-01 1.07e+02  3150 3121 0.00713 0.00708
age            0.564 1.97e-02 1.22e-04 1.99e-02  3150 3125 0.00624 0.00620
bmi            0.687 9.69e-02 5.83e-04 9.75e-02  3150 3125 0.00608 0.00604
sbp            0.283 2.87e-03 4.84e-05 2.92e-03  3150 3068 0.01703 0.01675
                fmi
(Intercept) 0.00772
age         0.00684
bmi         0.00668
sbp         0.01739
            estimate std.error statistic   df p.value
(Intercept)   147.03     10.32     14.24 3121    0.00
age             0.56      0.14      4.00 3125    0.00
bmi             0.69      0.31      2.20 3125    0.03
sbp             0.28      0.05      5.23 3068    0.00
       est  lo 95  hi 95 fmi
R^2 0.0213 0.0124 0.0325 NaN
           est  lo 95  hi 95 fmi
adj R^2 0.0204 0.0117 0.0313 NaN
  • fmi in the output above contains the fraction of missing information.
  • lambda in that output is the proportion of the total variance that is attributable to the missing data.

So, how do these estimates after imputation compare to our complete case analysis originally developed as model1, and re-summarized below?


Call:
lm(formula = chol ~ age + bmi + sbp, data = wcgs.s)

Residuals:
   Min     1Q Median     3Q    Max 
-125.4  -28.6   -3.0   25.9  409.4 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  146.871     10.320   14.23  < 2e-16 ***
age            0.561      0.141    3.98  7.2e-05 ***
bmi            0.681      0.312    2.18    0.029 *  
sbp            0.287      0.054    5.31  1.2e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43 on 3138 degrees of freedom
  (12 observations deleted due to missingness)
Multiple R-squared:  0.0214,    Adjusted R-squared:  0.0205 
F-statistic: 22.9 on 3 and 3138 DF,  p-value: 1.16e-14

44.4 Comparing Two Models After Imputation with pool.compare

Suppose we want to assess whether a model for chol with age, bmi and sbp is statistically significantly more effective than a model with age alone. As long as the models we want to compare are nested (so that one model is a proper subset of the other), then this can be done as follows, to produce a Wald test.

(Intercept)         age         bmi         sbp 
    147.029       0.564       0.687       0.283 
(Intercept)         age 
    193.922       0.701 
     [,1]
[1,] 16.2
[1] 2
[1] 1457052
         [,1]
[1,] 9.44e-08

Practically, a significant value of the Wald test here suggests that the difference in predictive value between model.1mi and model.2mi is statistically significant, and thus, that you need the larger model. A non-significant Wald test would be consistent with a situation where you could use the model with age alone.

Here, we clearly want to retain bmi and sbp as compared to dropping them both from the model. This isn’t a surprise, as we saw previously that both bmi and sbp had apparent predictive value (by the t test) even after all other variables were already in the model.

To create the overall F test, we could compare our model to an intercept only model, model.intmi <- with(chol.mi, lm(chol ~ 1)) but we’ll skip that here.