Chapter 50 Missing Data Mechanisms and Simple Imputation

Almost all serious statistical analyses have to deal with missing data. Data values that are missing are indicated in R, and to R, by the symbol NA.

50.1 A Toy Example

In the following tiny data set called sbp_example, we have four variables for a set of 15 subjects. In addition to a subject id, we have:

  • the treatment this subject received (A, B or C are the treatments),
  • an indicator (1 = yes, 0 = no) of whether the subject has diabetes,
  • the subject’s systolic blood pressure at baseline
  • the subject’s systolic blood pressure after the application of the treatment

Attaching package: 'simputation'
The following object is masked from 'package:Hmisc':

    impute
# A tibble: 15 x 5
   subject treat diabetes sbp.before sbp.after
     <int> <fct>    <dbl>      <dbl>     <dbl>
 1     101 A            1        120       105
 2     102 B            0        145       135
 3     103 C            0        150       150
 4     104 A            1         NA       120
 5     105 C           NA        155       135
 6     106 A            1         NA       115
 7     107 A            0        135       160
 8     108 <NA>         1         NA       150
 9     109 B           NA        115       130
10     110 C            1        170       155
11     111 A            0        150       140
12     112 B            0        145       140
13     113 C            1        140       150
14     114 A            1        160       135
15     115 B           NA        135       120

50.1.1 How many missing values do we have in each column?

   subject      treat   diabetes sbp.before  sbp.after 
         0          1          3          3          0 

We are missing one treat, 3 diabetes and 3 sbp.before values.

50.1.2 What is the pattern of missing data?

  subject sbp.after treat diabetes sbp.before  
9       1         1     1        1          1 0
2       1         1     1        1          0 1
3       1         1     1        0          1 1
1       1         1     0        1          0 2
        0         0     1        3          3 7

We have nine subjects with complete data, three subjects with missing diabetes (only), two subjects with missing sbp.before (only), and 1 subject with missing treat and sbp.before.

50.1.3 How can we identify the subjects with missing data?

# A tibble: 6 x 5
  subject treat diabetes sbp.before sbp.after
    <int> <fct>    <dbl>      <dbl>     <dbl>
1     104 A            1         NA       120
2     105 C           NA        155       135
3     106 A            1         NA       115
4     108 <NA>         1         NA       150
5     109 B           NA        115       130
6     115 B           NA        135       120

50.2 Missing-data mechanisms

My source for this description of mechanisms is Chapter 25 of Gelman and Hill (2007), and that chapter is available at this link.

  1. MCAR = Missingness completely at random. A variable is missing completely at random if the probability of missingness is the same for all units, for example, if for each subject, we decide whether to collect the diabetes status by rolling a die and refusing to answer if a “6” shows up. If data are missing completely at random, then throwing out cases with missing data does not bias your inferences.
  2. Missingness that depends only on observed predictors. A more general assumption, called missing at random or MAR, is that the probability a variable is missing depends only on available information. Here, we would have to be willing to assume that the probability of nonresponse to diabetes depends only on the other, fully recorded variables in the data. It is often reasonable to model this process as a logistic regression, where the outcome variable equals 1 for observed cases and 0 for missing. When an outcome variable is missing at random, it is acceptable to exclude the missing cases (that is, to treat them as NA), as long as the regression controls for all the variables that affect the probability of missingness.
  3. Missingness that depends on unobserved predictors. Missingness is no longer “at random” if it depends on information that has not been recorded and this information also predicts the missing values. If a particular treatment causes discomfort, a patient is more likely to drop out of the study. This missingness is not at random (unless “discomfort” is measured and observed for all patients). If missingness is not at random, it must be explicitly modeled, or else you must accept some bias in your inferences.
  4. Missingness that depends on the missing value itself. Finally, a particularly difficult situation arises when the probability of missingness depends on the (potentially missing) variable itself. For example, suppose that people with higher earnings are less likely to reveal them.

Essentially, situations 3 and 4 are referred to collectively as non-random missingness, and cause more trouble for us than 1 and 2.

50.3 Options for Dealing with Missingness

There are several available methods for dealing with missing data that are MCAR or MAR, but they basically boil down to:

  • Complete Case (or Available Case) analyses
  • Single Imputation
  • Multiple Imputation

50.4 Complete Case (and Available Case) analyses

In Complete Case analyses, rows containing NA values are omitted from the data before analyses commence. This is the default approach for many statistical software packages, and may introduce unpredictable bias and fail to include some useful, often hard-won information.

  • A complete case analysis can be appropriate when the number of missing observations is not large, and the missing pattern is either MCAR (missing completely at random) or MAR (missing at random.)
  • Two problems arise with complete-case analysis:
    1. If the units with missing values differ systematically from the completely observed cases, this could bias the complete-case analysis.
    2. If many variables are included in a model, there may be very few complete cases, so that most of the data would be discarded for the sake of a straightforward analysis.
  • A related approach is available-case analysis where different aspects of a problem are studied with different subsets of the data, perhaps identified on the basis of what is missing in them.

50.5 Single Imputation

In single imputation analyses, NA values are estimated/replaced one time with one particular data value for the purpose of obtaining more complete samples, at the expense of creating some potential bias in the eventual conclusions or obtaining slightly less accurate estimates than would be available if there were no missing values in the data.

  • A single imputation can be just a replacement with the mean or median (for a quantity) or the mode (for a categorical variable.) However, such an approach, though easy to understand, underestimates variance and ignores the relationship of missing values to other variables.
  • Single imputation can also be done using a variety of models to try to capture information about the NA values that are available in other variables within the data set.
  • The simputation package can help us execute single imputations using a wide variety of techniques, within the pipe approach used by the tidyverse. Another approach I have used in the past is the mice package, which can also perform single imputations.

50.6 Multiple Imputation

Multiple imputation, where NA values are repeatedly estimated/replaced with multiple data values, for the purpose of obtaining mode complete samples and capturing details of the variation inherent in the fact that the data have missingness, so as to obtain more accurate estimates than are possible with single imputation.

  • We’ll postpone the discussion of multiple imputation for a while.

50.7 Building a Complete Case Analysis

We can drop all of the missing values from a data set with drop_na or with na.omit or by filtering for complete.cases. Any of these approaches produces the same result - a new data set with 9 rows (after dropping the six subjects with any NA values) and 5 columns.

50.8 Single Imputation with the Mean or Mode

The most straightforward approach to single imputation is to impute a single summary of the variable, such as the mean, median or mode.

Skim summary statistics
 n obs: 15 
 n variables: 5 

-- Variable type:factor ---------------------------------------------------
 variable missing complete  n n_unique              top_counts ordered
    treat       1       14 15        3 A: 6, B: 4, C: 4, NA: 1   FALSE

-- Variable type:integer --------------------------------------------------
 variable missing complete  n mean   sd  p0   p25 p50   p75 p100     hist
  subject       0       15 15  108 4.47 101 104.5 108 111.5  115 <U+2587><U+2587><U+2587><U+2587><U+2583><U+2587><U+2587><U+2587>

-- Variable type:numeric --------------------------------------------------
   variable missing complete  n   mean    sd  p0 p25 p50    p75 p100
   diabetes       3       12 15   0.58  0.51   0   0   1   1       1
  sbp.after       0       15 15 136    15.83 105 125 135 150     160
 sbp.before       3       12 15 143.33 15.72 115 135 145 151.25  170
     hist
 <U+2586><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
 <U+2582><U+2582><U+2585><U+2582><U+2587><U+2585><U+2587><U+2585>
 <U+2585><U+2581><U+2585><U+2582><U+2585><U+2587><U+2582><U+2582>

Here, suppose we decide to impute

  • sbp.before with the mean (143.33) among non-missing values,
  • diabetes with its median (1) among non-missing values, and
  • treat with its most common value, or mode (A)
# A tibble: 15 x 5
   subject treat diabetes sbp.before sbp.after
     <int> <fct>    <dbl>      <dbl>     <dbl>
 1     101 A            1       120        105
 2     102 B            0       145        135
 3     103 C            0       150        150
 4     104 A            1       143.       120
 5     105 C            1       155        135
 6     106 A            1       143.       115
 7     107 A            0       135        160
 8     108 A            1       143.       150
 9     109 B            1       115        130
10     110 C            1       170        155
11     111 A            0       150        140
12     112 B            0       145        140
13     113 C            1       140        150
14     114 A            1       160        135
15     115 B            1       135        120

We could accomplish the same thing with, for example:

50.9 Doing Single Imputation with simputation

Single imputation is a potentially appropriate method when missingness can be assumed to be either completely at random (MCAR) or dependent only on observed predictors (MAR). We’ll use the simputation package to accomplish it.

50.9.1 Mirroring Our Prior Approach (imputing means/medians/modes)

Suppose we want to mirror what we did above, simply impute the mean for sbp.before and the median for diabetes again.

# A tibble: 15 x 5
   subject treat diabetes sbp.before sbp.after
 *   <int> <fct>    <dbl>      <dbl>     <dbl>
 1     101 A            1       120        105
 2     102 B            0       145        135
 3     103 C            0       150        150
 4     104 A            1       143.       120
 5     105 C            1       155        135
 6     106 A            1       143.       115
 7     107 A            0       135        160
 8     108 A            1       143.       150
 9     109 B            1       115        130
10     110 C            1       170        155
11     111 A            0       150        140
12     112 B            0       145        140
13     113 C            1       140        150
14     114 A            1       160        135
15     115 B            1       135        120

50.9.2 Using a model to impute sbp.before and diabetes

Suppose we wanted to use:

  • a robust linear model to predict sbp.before missing values, on the basis of sbp.after and diabetes status, and
  • a predictive mean matching approach to predict diabetes status, on the basis of sbp.after, and
  • a decision tree approach to predict treat status, using all other variables in the data
# A tibble: 15 x 5
   subject treat diabetes sbp.before sbp.after
 *   <int> <fct>    <dbl>      <dbl>     <dbl>
 1     101 A            1       120        105
 2     102 B            0       145        135
 3     103 C            0       150        150
 4     104 A            1       139.       120
 5     105 C            1       155        135
 6     106 A            1       136.       115
 7     107 A            0       135        160
 8     108 A            1       155.       150
 9     109 B            1       115        130
10     110 C            1       170        155
11     111 A            0       150        140
12     112 B            0       145        140
13     113 C            1       140        150
14     114 A            1       160        135
15     115 B            1       135        120

Details on the many available methods in simputation are provided in its manual. These include:

  • impute_cart uses a Classification and Regression Tree approach for numerical or categorical data. There is also an impute_rf command which uses Random Forests for imputation.
  • impute_pmm is one of several “hot deck” options for imputation, this one is predictive mean matching, which can be used with numeric data (only). Missing values are first imputed using a predictive model. Next, these predictions are replaced with the observed values which are nearest to the prediction. Other imputation options in this group include random hot deck, sequential hot deck and k-nearest neighbor imputation.
  • impute_rlm is one of several regression imputation methods, including linear models, robust linear models (which use what is called M-estimation to impute numerical variables) and lasso/elastic net/ridge regression models.

The simputation package can also do EM-based multivariate imputation, and multivariate random forest imputation, and several other approaches.

References

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel-Hierarchical Models. New York: Cambridge University Press. http://www.stat.columbia.edu/~gelman/arm/.