Chapter 4 Analysis of Covariance
4.1 An Emphysema Study
My source for this example is Riffenburgh (2006), section 18.4. Serum theophylline levels (in mg/dl) were measured in 16 patients with emphysema at baseline, then 5 days later (at the end of a course of antibiotics) and then at 10 days after baseline. Clinicians anticipate that the antibiotic will increase the theophylline level. The data are stored in the emphysema.csv
data file, and note that the age for patient 5 is not available.
4.1.1 Codebook
Variable | Description |
---|---|
patient |
ID code |
age |
patient’s age in years |
sex |
patient’s sex (F or M) |
st_base |
patient’s serum theophylline at baseline (mg/dl) |
st_day5 |
patient’s serum theophylline at day 5 (mg/dl) |
st_day10 |
patient’s serum theophylline at day 10 (mg/dl) |
We’re going to look at the change from baseline to day 5 as our outcome of interest, since the clinical expectation is that the antibiotic (azithromycin) will increase theophylline levels.
emphysema <- emphysema %>%
mutate(st_delta = st_day5 - st_base)
emphysema
# A tibble: 16 x 7
patient age sex st_base st_day5 st_day10 st_delta
<int> <int> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 61 F 14.1 2.3 10.3 -11.8
2 2 70 F 7.2 5.4 7.3 -1.8
3 3 65 M 14.2 11.9 11.3 -2.30
4 4 65 M 10.3 10.7 13.8 0.400
5 5 NA M 9.9 10.7 11.7 0.800
6 6 76 M 5.2 6.8 4.2 1.60
7 7 72 M 10.4 14.6 14.1 4.20
8 8 69 F 10.5 7.2 5.4 -3.3
9 9 66 M 5 5 5.1 0
10 10 62 M 8.6 8.1 7.4 -0.5
11 11 65 F 16.6 14.9 13 -1.7
12 12 71 M 16.4 18.6 17.1 2.2
13 13 51 F 12.2 11 12.3 -1.20
14 14 71 M 6.6 3.7 4.5 -2.90
15 15 64 F 15.4 15.2 13.6 -0.2
16 16 50 M 10.2 10.8 11.2 0.6
4.2 Does sex
affect the mean change in theophylline?
emphysema %>% skim(st_delta)
Skim summary statistics
n obs: 16
n variables: 7
-- Variable type:numeric ----------------------------------------------------------
variable missing complete n mean sd p0 p25 p50 p75 p100
st_delta 0 16 16 -0.99 3.48 -11.8 -1.92 -0.35 0.65 4.2
emphysema %>% group_by(sex) %>% skim(st_delta)
Skim summary statistics
n obs: 16
n variables: 7
group variables: sex
-- Variable type:numeric ----------------------------------------------------------
sex variable missing complete n mean sd p0 p25 p50 p75 p100
F st_delta 0 6 6 -3.33 4.27 -11.8 -2.92 -1.75 -1.32 -0.2
M st_delta 0 10 10 0.41 2.07 -2.9 -0.38 0.5 1.4 4.2
Overall, the mean change in theophylline during the course of the antibiotic is -0.99, but this is -3.33 for female patients and 0.41 for male patients.
A one-way ANOVA model looks like this:
anova(lm(st_delta ~ sex, data = emphysema))
Analysis of Variance Table
Response: st_delta
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 52.547 52.547 5.6789 0.03189 *
Residuals 14 129.542 9.253
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA F test finds a statistically significant difference between the mean st_delta
among males and the mean st_delta
among females. But is there more to the story?
4.3 Is there an association between age
and sex
in this study?
emphysema %>% group_by(sex) %>% skim(age)
Skim summary statistics
n obs: 16
n variables: 7
group variables: sex
-- Variable type:integer ----------------------------------------------------------
sex variable missing complete n mean sd p0 p25 p50 p75 p100
F age 0 6 6 63.33 6.89 51 61.75 64.5 68 70
M age 1 9 10 66.44 7.57 50 65 66 71 76
But we note that the male patients are also older than the female patients, on average (mean age for males is 66.4, for females 63.3)
- Does the fact that male patients are older affect change in theophylline level?
- And how should we deal with the one missing
age
value (in a male patient)?
4.4 Adding a quantitative covariate, age
, to the model
We could fit an ANOVA model to predict st_delta
using sex
and age
directly, but only if we categorized age
into two or more groups. Because age
is not categorical, we cannot include it in an ANOVA. But if age is an influence, and we don’t adjust for it, it may well bias the outcome of our initial ANOVA. With a quantitative variable like age
, we will need a method called ANCOVA, for analysis of covariance.
4.4.1 The ANCOVA model
ANCOVA in this case is just an ANOVA model with our outcome (st_delta
) adjusted for a continuous covariate, called age
. For the moment, we’ll ignore the one subject with missing age
and simply fit the regression model with sex
and age
.
summary(lm(st_delta ~ sex + age, data = emphysema))
Call:
lm(formula = st_delta ~ sex + age, data = emphysema)
Residuals:
Min 1Q Median 3Q Max
-8.3352 -0.4789 0.6948 1.5580 3.5202
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.90266 7.92948 -0.871 0.4011
sexM 3.52466 1.75815 2.005 0.0681 .
age 0.05636 0.12343 0.457 0.6561
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.255 on 12 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.2882, Adjusted R-squared: 0.1696
F-statistic: 2.43 on 2 and 12 DF, p-value: 0.13
This model assumes that the slope of the regression line between st_delta
and age
is the same for both sexes.
Note that the model yields st_delta
= -6.9 + 3.52 (sex
= male) + 0.056 age
, or
st_delta
= -6.9 + 0.056age
for female patients, andst_delta
= (-6.9 + 3.52) + 0.056age
= -3.38 + 0.056age
for male patients.
Note that we can test this assumption of equal slopes by fitting an alternative model (with a product term between sex
and age
) that doesn’t require the assumption, and we’ll do that later.
4.4.2 The ANCOVA Table
First, though, we’ll look at the ANCOVA table.
anova(lm(st_delta ~ sex + age, data = emphysema))
Analysis of Variance Table
Response: st_delta
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 49.284 49.284 4.6507 0.05203 .
age 1 2.209 2.209 0.2085 0.65612
Residuals 12 127.164 10.597
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When we tested sex
without accounting for age
, we found a p value of 0.032, which is less than our usual cutpoint of 0.05. But when we adjusted for age
, we find that sex
loses significance, even though age
is not a significant influence on st_delta
by itself, according to the ANCOVA table.
4.5 Rerunning the ANCOVA model after simple imputation
We could have imputed the missing age
value for patient 5, rather than just deleting that patient. Suppose we do the simplest potentially reasonable thing to do: insert the mean age
in where the NA value currently exists.
emph_imp <- replace_na(emphysema, list(age = mean(emphysema$age, na.rm = TRUE)))
emph_imp
# A tibble: 16 x 7
patient age sex st_base st_day5 st_day10 st_delta
<int> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 61 F 14.1 2.3 10.3 -11.8
2 2 70 F 7.2 5.4 7.3 -1.8
3 3 65 M 14.2 11.9 11.3 -2.30
4 4 65 M 10.3 10.7 13.8 0.400
5 5 65.2 M 9.9 10.7 11.7 0.800
6 6 76 M 5.2 6.8 4.2 1.60
7 7 72 M 10.4 14.6 14.1 4.20
8 8 69 F 10.5 7.2 5.4 -3.3
9 9 66 M 5 5 5.1 0
10 10 62 M 8.6 8.1 7.4 -0.5
11 11 65 F 16.6 14.9 13 -1.7
12 12 71 M 16.4 18.6 17.1 2.2
13 13 51 F 12.2 11 12.3 -1.20
14 14 71 M 6.6 3.7 4.5 -2.90
15 15 64 F 15.4 15.2 13.6 -0.2
16 16 50 M 10.2 10.8 11.2 0.6
More on simple imputation and missing data is coming soon.
For now, we can rerun the ANCOVA model on this new data set, after imputation…
anova(lm(st_delta ~ sex + age, data = emph_imp))
Analysis of Variance Table
Response: st_delta
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 52.547 52.547 5.3623 0.03755 *
age 1 2.151 2.151 0.2195 0.64721
Residuals 13 127.392 9.799
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When we do this, we see that now the sex
variable returns to a p value below 0.05. Our complete case analysis (which omitted patient 5) gives us a different result than the ANCOVA based on the data after mean imputation.
4.6 Looking at a factor-covariate interaction
Let’s run a model including the interaction (product) term between age
and sex
, which implies that the slope of age
on our outcome (st_delta
) depends on the patient’s sex. We’ll use the imputed data again. Here is the new ANCOVA table, which suggests that the interaction of age
and sex
is small (because it accounts for only a small amount of the total Sum of Squares) and not significant (p = 0.91).
anova(lm(st_delta ~ sex * age, data = emph_imp))
Analysis of Variance Table
Response: st_delta
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 52.547 52.547 4.9549 0.04594 *
age 1 2.151 2.151 0.2028 0.66051
sex:age 1 0.130 0.130 0.0123 0.91355
Residuals 12 127.261 10.605
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the interaction term is neither substantial nor significant, we probably don’t need it here. But let’s look at its interpretation anyway, just to fix ideas. To do that, we’ll need the coefficients from the underlying regression model.
tidy(lm(st_delta ~ sex * age, data = emph_imp))
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -5.65 13.5 -0.420 0.682
2 sexM 1.72 16.8 0.102 0.920
3 age 0.0365 0.211 0.173 0.866
4 sexM:age 0.0289 0.260 0.111 0.914
Our ANCOVA model for st_delta
incorporating the age
x sex
product term is -5.65 + 1.72 (sex = M) + 0.037 age + 0.029 (sex = M)(age). So that means:
- our model for females is
st_delta
= -5.65 + 0.037age
- our model for males is
st_delta
= (-5.65 + 1.72) + (0.037 + 0.029)age
, or -3.93 + 0.066age
but, again, our conclusion from the ANCOVA table is that this increase in complexity (letting both the slope and intercept vary by sex
) doesn’t add much in the way of predictive value for our st_delta
outcome.
4.7 Centering the Covariate to Facilitate ANCOVA Interpretation
When developing an ANCOVA model, we will often center or even center and rescale the covariate to facilitate interpretation of the product term. In this case, let’s center age
and rescale it by dividing by two standard deviations.
emph_imp %>% skim(age)
Skim summary statistics
n obs: 16
n variables: 7
-- Variable type:numeric ----------------------------------------------------------
variable missing complete n mean sd p0 p25 p50 p75 p100
age 0 16 16 65.2 6.98 50 63.5 65.1 70.25 76
Note that in our imputed data, the mean age
is 65.2 and the standard deviation of age
is 7 years.
So we build the rescaled age
variable that I’ll call age_z
, and then use it to refit our model.
emph_imp <- emph_imp %>%
mutate(age_z = (age - mean(age))/ (2 * sd(age)))
anova(lm(st_delta ~ sex * age_z, data = emph_imp))
Analysis of Variance Table
Response: st_delta
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 52.547 52.547 4.9549 0.04594 *
age_z 1 2.151 2.151 0.2028 0.66051
sex:age_z 1 0.130 0.130 0.0123 0.91355
Residuals 12 127.261 10.605
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tidy(lm(st_delta ~ sex * age_z, data = emph_imp))
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -3.27 1.39 -2.35 0.0364
2 sexM 3.60 1.74 2.08 0.0601
3 age_z 0.510 2.95 0.173 0.866
4 sexM:age_z 0.403 3.63 0.111 0.914
Comparing the two models, we have:
- (unscaled):
st_delta
= -5.65 + 1.72 (sex
= M) + 0.037age
+ 0.029 (sex
= M) x (age
) - (rescaled):
st_delta
= -3.27 + 3.60 (sex
= M) + 0.510 rescaledage_z
+ 0.402 (sex
= M) x (rescaledage_z
)
In essence, the rescaled model on age_z
is:
st_delta
= -3.27 + 0.510age_z
for female subjects, andst_delta
= (-3.27 + 3.60) + (0.510 + 0.402)age_z
= 0.33 + 0.912age_z
for male subjects
Interpreting the centered, rescaled model, we have:
- no change in the ANOVA results or R-squared or residual standard deviation compared to the uncentered, unscaled model, but
- the intercept (-3.27) now represents the
st_delta
for a female of average age, - the
sex
slope (3.60) represents the (male - female) difference in predictedst_delta
for a person of average age, - the
age_z
slope (0.510) represents the difference in predictedst_delta
for a female one standard deviation older than the mean age as compared to a female one standard deviation younger than the mean age, and - the product term’s slope (0.402) represents the male - female difference in the slope of
age_z
, so that if you add theage_z
slope (0.510) and the interaction slope (0.402) you see the difference in predictedst_delta
for a male one standard deviation older than the mean age as compared to a male one standard deviation younger than the mean age.
References
Riffenburgh, Robert H. 2006. Statistics in Medicine. Second Edition. Burlington, MA: Elsevier Academic Press.