11  Analysis of Covariance

11.1 R Setup Used Here

knitr::opts_chunk$set(comment = NA)

library(janitor)
library(broom)
library(mosaic)
library(tidyverse) 

theme_set(theme_bw())

11.1.1 Data Load

emphysema <- read_csv("data/emphysema.csv", show_col_types = FALSE) 

11.2 An Emphysema Study

My source for this example is Riffenburgh (2006), section 18.4. Serum theophylline levels (in mg/dl) were measured in 16 patients with emphysema at baseline, then 5 days later (at the end of a course of antibiotics) and then at 10 days after baseline. Clinicians anticipate that the antibiotic will increase the theophylline level. The data are stored in the emphysema.csv data file, and note that the age for patient 5 is not available.

11.2.1 Codebook

Variable Description
patient ID code
age patient’s age in years
sex patient’s sex (F or M)
st_base patient’s serum theophylline at baseline (mg/dl)
st_day5 patient’s serum theophylline at day 5 (mg/dl)
st_day10 patient’s serum theophylline at day 10 (mg/dl)

We’re going to look at the change from baseline to day 5 as our outcome of interest, since the clinical expectation is that the antibiotic (azithromycin) will increase theophylline levels.

emphysema <- emphysema |> 
    mutate(st_delta = st_day5 - st_base)

emphysema
# A tibble: 16 × 7
   patient   age sex   st_base st_day5 st_day10 st_delta
     <dbl> <dbl> <chr>   <dbl>   <dbl>    <dbl>    <dbl>
 1       1    61 F        14.1     2.3     10.3  -11.8  
 2       2    70 F         7.2     5.4      7.3   -1.8  
 3       3    65 M        14.2    11.9     11.3   -2.3  
 4       4    65 M        10.3    10.7     13.8    0.400
 5       5    NA M         9.9    10.7     11.7    0.800
 6       6    76 M         5.2     6.8      4.2    1.6  
 7       7    72 M        10.4    14.6     14.1    4.2  
 8       8    69 F        10.5     7.2      5.4   -3.3  
 9       9    66 M         5       5        5.1    0    
10      10    62 M         8.6     8.1      7.4   -0.5  
11      11    65 F        16.6    14.9     13     -1.70 
12      12    71 M        16.4    18.6     17.1    2.20 
13      13    51 F        12.2    11       12.3   -1.2  
14      14    71 M         6.6     3.7      4.5   -2.9  
15      15    64 F        15.4    15.2     13.6   -0.200
16      16    50 M        10.2    10.8     11.2    0.600

11.3 Does sex affect the mean change in theophylline?

favstats(~ st_delta, data = emphysema)
   min     Q1 median   Q3 max     mean       sd  n missing
 -11.8 -1.925  -0.35 0.65 4.2 -0.99375 3.484149 16       0
favstats(st_delta ~ sex, data = emphysema)
  sex   min     Q1 median     Q3  max      mean       sd  n missing
1   F -11.8 -2.925  -1.75 -1.325 -0.2 -3.333333 4.267864  6       0
2   M  -2.9 -0.375   0.50  1.400  4.2  0.410000 2.067446 10       0

Overall, the mean change in theophylline during the course of the antibiotic is -0.99, but this is -3.33 for female patients and 0.41 for male patients.

A one-way ANOVA model looks like this:

anova(lm(st_delta ~ sex, data = emphysema))
Analysis of Variance Table

Response: st_delta
          Df  Sum Sq Mean Sq F value  Pr(>F)  
sex        1  52.547  52.547  5.6789 0.03189 *
Residuals 14 129.542   9.253                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA F test finds a fairly large difference between the mean st_delta among males and the mean st_delta among females. But is there more to the story?

11.4 Is there an association between age and sex in this study?

favstats(age ~ sex, data = emphysema)
  sex min    Q1 median Q3 max     mean       sd n missing
1   F  51 61.75   64.5 68  70 63.33333 6.889606 6       0
2   M  50 65.00   66.0 71  76 66.44444 7.568208 9       1

But we note that the male patients are also older than the female patients, on average (mean age for males is 66.4, for females 63.3)

  • Does the fact that male patients are older affect change in theophylline level?
  • And how should we deal with the one missing age value (in a male patient)?

11.5 Adding a quantitative covariate, age, to the model

We could fit an ANOVA model to predict st_delta using sex and age directly, but only if we categorized age into two or more groups. Because age is not categorical, we cannot include it in an ANOVA. But if age is an influence, and we don’t adjust for it, it may well bias the outcome of our initial ANOVA. With a quantitative variable like age, we will need a method called ANCOVA, for analysis of covariance.

11.5.1 The ANCOVA model

ANCOVA in this case is just an ANOVA model with our outcome (st_delta) adjusted for a continuous covariate, called age. For the moment, we’ll ignore the one subject with missing age and simply fit the regression model with sex and age.

summary(lm(st_delta ~ sex + age, data = emphysema))

Call:
lm(formula = st_delta ~ sex + age, data = emphysema)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3352 -0.4789  0.6948  1.5580  3.5202 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -6.90266    7.92948  -0.871   0.4011  
sexM         3.52466    1.75815   2.005   0.0681 .
age          0.05636    0.12343   0.457   0.6561  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.255 on 12 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.2882,    Adjusted R-squared:  0.1696 
F-statistic:  2.43 on 2 and 12 DF,  p-value: 0.13

This model assumes that the slope of the regression line between st_delta and age is the same for both sexes.

Note that the model yields st_delta = -6.9 + 3.52 (sex = male) + 0.056 age, or

  • st_delta = -6.9 + 0.056 age for female patients, and
  • st_delta = (-6.9 + 3.52) + 0.056 age = -3.38 + 0.056 age for male patients.

Note that we can test this assumption of equal slopes by fitting an alternative model (with a product term between sex and age) that doesn’t require the assumption, and we’ll do that later.

11.5.2 The ANCOVA Table

First, though, we’ll look at the ANCOVA table.

anova(lm(st_delta ~ sex + age, data = emphysema))
Analysis of Variance Table

Response: st_delta
          Df  Sum Sq Mean Sq F value  Pr(>F)  
sex        1  49.284  49.284  4.6507 0.05203 .
age        1   2.209   2.209  0.2085 0.65612  
Residuals 12 127.164  10.597                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When we tested sex without accounting for age, we found a p value of 0.032, which is less than our usual cutpoint of 0.05. But when we adjusted for age, we find that sex’s p value rises, even though age doesn’t seem to have a particularly strong influence on st_delta by itself, according to the ANCOVA table.

11.6 Rerunning the ANCOVA model after simple imputation

We could have imputed the missing age value for patient 5, rather than just deleting that patient. Suppose we do the simplest potentially reasonable thing to do: insert the mean age in where the NA value currently exists.

emph_imp <- replace_na(emphysema, list(age = mean(emphysema$age, na.rm = TRUE)))

emph_imp
# A tibble: 16 × 7
   patient   age sex   st_base st_day5 st_day10 st_delta
     <dbl> <dbl> <chr>   <dbl>   <dbl>    <dbl>    <dbl>
 1       1  61   F        14.1     2.3     10.3  -11.8  
 2       2  70   F         7.2     5.4      7.3   -1.8  
 3       3  65   M        14.2    11.9     11.3   -2.3  
 4       4  65   M        10.3    10.7     13.8    0.400
 5       5  65.2 M         9.9    10.7     11.7    0.800
 6       6  76   M         5.2     6.8      4.2    1.6  
 7       7  72   M        10.4    14.6     14.1    4.2  
 8       8  69   F        10.5     7.2      5.4   -3.3  
 9       9  66   M         5       5        5.1    0    
10      10  62   M         8.6     8.1      7.4   -0.5  
11      11  65   F        16.6    14.9     13     -1.70 
12      12  71   M        16.4    18.6     17.1    2.20 
13      13  51   F        12.2    11       12.3   -1.2  
14      14  71   M         6.6     3.7      4.5   -2.9  
15      15  64   F        15.4    15.2     13.6   -0.200
16      16  50   M        10.2    10.8     11.2    0.600

More on simple imputation and missing data is coming soon.

For now, we can rerun the ANCOVA model on this new data set, after imputation…

anova(lm(st_delta ~ sex + age, data = emph_imp))
Analysis of Variance Table

Response: st_delta
          Df  Sum Sq Mean Sq F value  Pr(>F)  
sex        1  52.547  52.547  5.3623 0.03755 *
age        1   2.151   2.151  0.2195 0.64721  
Residuals 13 127.392   9.799                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When we do this, we see that now the sex variable returns to a p value below 0.05. Our complete case analysis (which omitted patient 5) gives us a different result than the ANCOVA based on the data after mean imputation.

11.7 Looking at a factor-covariate interaction

Let’s run a model including the interaction (product) term between age and sex, which implies that the slope of age on our outcome (st_delta) depends on the patient’s sex. We’ll use the imputed data again. Here is the new ANCOVA table, which suggests that the interaction of age and sex is small (because it accounts for only a small amount of the total Sum of Squares) with a p value of 0.91.

anova(lm(st_delta ~ sex * age, data = emph_imp))
Analysis of Variance Table

Response: st_delta
          Df  Sum Sq Mean Sq F value  Pr(>F)  
sex        1  52.547  52.547  4.9549 0.04594 *
age        1   2.151   2.151  0.2028 0.66051  
sex:age    1   0.130   0.130  0.0123 0.91355  
Residuals 12 127.261  10.605                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the interaction term isn’t accounting for much variation, we probably don’t need it here. But let’s look at its interpretation anyway, just to fix ideas. To do that, we’ll need the coefficients from the underlying regression model.

tidy(lm(st_delta ~ sex * age, data = emph_imp))
# A tibble: 4 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  -5.65      13.5      -0.420   0.682
2 sexM          1.72      16.8       0.102   0.920
3 age           0.0365     0.211     0.173   0.866
4 sexM:age      0.0289     0.260     0.111   0.914

Our ANCOVA model for st_delta incorporating the age x sex product term is -5.65 + 1.72 (sex = M) + 0.037 age + 0.029 (sex = M)(age). So that means:

  • our model for females is st_delta = -5.65 + 0.037 age
  • our model for males is st_delta = (-5.65 + 1.72) + (0.037 + 0.029) age, or -3.93 + 0.066 age

but, again, our conclusion from the ANCOVA table is that this increase in complexity (letting both the slope and intercept vary by sex) doesn’t add much in the way of predictive value for our st_delta outcome.

11.8 Centering the Covariate to Facilitate ANCOVA Interpretation

When developing an ANCOVA model, we will often center or even center and rescale the covariate to facilitate interpretation of the product term. In this case, let’s center age and rescale it by dividing by two standard deviations.

favstats(~ age, data = emph_imp)
 min   Q1 median    Q3 max mean       sd  n missing
  50 63.5   65.1 70.25  76 65.2 6.978061 16       0

Note that in our imputed data, the mean age is 65.2 and the standard deviation of age is 7 years.

So we build the rescaled age variable that I’ll call age_z, and then use it to refit our model.

emph_imp <- emph_imp |>
    mutate(age_z = (age - mean(age))/ (2 * sd(age)))

anova(lm(st_delta ~ sex * age_z, data = emph_imp))
Analysis of Variance Table

Response: st_delta
          Df  Sum Sq Mean Sq F value  Pr(>F)  
sex        1  52.547  52.547  4.9549 0.04594 *
age_z      1   2.151   2.151  0.2028 0.66051  
sex:age_z  1   0.130   0.130  0.0123 0.91355  
Residuals 12 127.261  10.605                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tidy(lm(st_delta ~ sex * age_z, data = emph_imp))
# A tibble: 4 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)   -3.27       1.39    -2.35   0.0364
2 sexM           3.60       1.74     2.08   0.0601
3 age_z          0.510      2.95     0.173  0.866 
4 sexM:age_z     0.403      3.63     0.111  0.914 

Comparing the two models, we have:

  • (unscaled): st_delta = -5.65 + 1.72 (sex = M) + 0.037 age + 0.029 (sex = M) x (age)
  • (rescaled): st_delta = -3.27 + 3.60 (sex = M) + 0.510 rescaled age_z + 0.402 (sex = M) x (rescaled age_z)

In essence, the rescaled model on age_z is:

  • st_delta = -3.27 + 0.510 age_z for female subjects, and
  • st_delta = (-3.27 + 3.60) + (0.510 + 0.402) age_z = 0.33 + 0.912 age_z for male subjects

Interpreting the centered, rescaled model, we have:

  • no change in the ANOVA results or R-squared or residual standard deviation compared to the uncentered, unscaled model, but
  • the intercept (-3.27) now represents the st_delta for a female of average age,
  • the sex slope (3.60) represents the (male - female) difference in predicted st_delta for a person of average age,
  • the age_z slope (0.510) represents the difference in predicted st_delta for a female one standard deviation older than the mean age as compared to a female one standard deviation younger than the mean age, and
  • the product term’s slope (0.402) represents the male - female difference in the slope of age_z, so that if you add the age_z slope (0.510) and the interaction slope (0.402) you see the difference in predicted st_delta for a male one standard deviation older than the mean age as compared to a male one standard deviation younger than the mean age.