# Chapter 7 Analysis of Covariance

## 7.1 An Emphysema Study

My source for this example is Riffenburgh (2006), section 18.4. Serum theophylline levels (in mg/dl) were measured in 16 patients with emphysema at baseline, then 5 days later (at the end of a course of antibiotics) and then at 10 days after baseline. Clinicians anticipate that the antibiotic will increase the theophylline level. The data are stored in the `emphysema.csv`

data file, and note that the age for patient 5 is not available.

### 7.1.1 Codebook

Variable | Description |
---|---|

`patient` |
ID code |

`age` |
patient’s age in years |

`sex` |
patient’s sex (F or M) |

`st_base` |
patient’s serum theophylline at baseline (mg/dl) |

`st_day5` |
patient’s serum theophylline at day 5 (mg/dl) |

`st_day10` |
patient’s serum theophylline at day 10 (mg/dl) |

We’re going to look at the change from baseline to day 5 as our outcome of interest, since the clinical expectation is that the antibiotic (azithromycin) will increase theophylline levels.

```
# A tibble: 16 x 7
patient age sex st_base st_day5 st_day10 st_delta
<int> <int> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 61 F 14.1 2.3 10.3 -11.8
2 2 70 F 7.2 5.4 7.3 -1.8
3 3 65 M 14.2 11.9 11.3 -2.30
4 4 65 M 10.3 10.7 13.8 0.400
5 5 NA M 9.9 10.7 11.7 0.800
6 6 76 M 5.2 6.8 4.2 1.60
7 7 72 M 10.4 14.6 14.1 4.20
8 8 69 F 10.5 7.2 5.4 -3.3
9 9 66 M 5 5 5.1 0
10 10 62 M 8.6 8.1 7.4 -0.5
11 11 65 F 16.6 14.9 13 -1.7
12 12 71 M 16.4 18.6 17.1 2.2
13 13 51 F 12.2 11 12.3 -1.20
14 14 71 M 6.6 3.7 4.5 -2.90
15 15 64 F 15.4 15.2 13.6 -0.2
16 16 50 M 10.2 10.8 11.2 0.6
```

## 7.2 Does `sex`

affect the mean change in theophylline?

```
min Q1 median Q3 max mean sd n missing
-11.8 -1.925 -0.35 0.65 4.2 -0.99375 3.484149 16 0
```

```
sex min Q1 median Q3 max mean sd n missing
1 F -11.8 -2.925 -1.75 -1.325 -0.2 -3.333333 4.267864 6 0
2 M -2.9 -0.375 0.50 1.400 4.2 0.410000 2.067446 10 0
```

Overall, the mean change in theophylline during the course of the antibiotic is -0.99, but this is -3.33 for female patients and 0.41 for male patients.

A one-way ANOVA model looks like this:

```
Analysis of Variance Table
Response: st_delta
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 52.547 52.547 5.6789 0.03189 *
Residuals 14 129.542 9.253
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The ANOVA F test finds a statistically significant difference between the mean `st_delta`

among males and the mean `st_delta`

among females. But is there more to the story?

## 7.3 Is there an association between `age`

and `sex`

in this study?

```
sex min Q1 median Q3 max mean sd n missing
1 F 51 61.75 64.5 68 70 63.33333 6.889606 6 0
2 M 50 65.00 66.0 71 76 66.44444 7.568208 9 1
```

But we note that the male patients are also older than the female patients, on average (mean age for males is 66.4, for females 63.3)

- Does the fact that male patients are older affect change in theophylline level?
- And how should we deal with the one missing
`age`

value (in a male patient)?

## 7.4 Adding a quantitative covariate, `age`

, to the model

We could fit an ANOVA model to predict `st_delta`

using `sex`

and `age`

directly, but only if we categorized `age`

into two or more groups. Because `age`

is not categorical, we cannot include it in an ANOVA. But if age is an influence, and we don’t adjust for it, it may well bias the outcome of our initial ANOVA. With a quantitative variable like `age`

, we will need a method called ANCOVA, for **analysis of covariance**.

### 7.4.1 The ANCOVA model

ANCOVA in this case is just an ANOVA model with our outcome (`st_delta`

) adjusted for a continuous covariate, called `age`

. For the moment, we’ll ignore the one subject with missing `age`

and simply fit the regression model with `sex`

and `age`

.

```
Call:
lm(formula = st_delta ~ sex + age, data = emphysema)
Residuals:
Min 1Q Median 3Q Max
-8.3352 -0.4789 0.6948 1.5580 3.5202
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.90266 7.92948 -0.871 0.4011
sexM 3.52466 1.75815 2.005 0.0681 .
age 0.05636 0.12343 0.457 0.6561
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.255 on 12 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.2882, Adjusted R-squared: 0.1696
F-statistic: 2.43 on 2 and 12 DF, p-value: 0.13
```

This model assumes that the slope of the regression line between `st_delta`

and `age`

is the same for both sexes.

Note that the model yields `st_delta`

= -6.9 + 3.52 (`sex`

= male) + 0.056 `age`

, or

`st_delta`

= -6.9 + 0.056`age`

for female patients, and`st_delta`

= (-6.9 + 3.52) + 0.056`age`

= -3.38 + 0.056`age`

for male patients.

Note that we can test this assumption of equal slopes by fitting an alternative model (with a product term between `sex`

and `age`

) that doesn’t require the assumption, and we’ll do that later.

### 7.4.2 The ANCOVA Table

First, though, we’ll look at the ANCOVA table.

```
Analysis of Variance Table
Response: st_delta
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 49.284 49.284 4.6507 0.05203 .
age 1 2.209 2.209 0.2085 0.65612
Residuals 12 127.164 10.597
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

When we tested `sex`

without accounting for `age`

, we found a *p* value of 0.032, which is less than our usual cutpoint of 0.05. But when we adjusted for `age`

, we find that `sex`

loses significance, even though `age`

is not a significant influence on `st_delta`

by itself, according to the ANCOVA table.

## 7.5 Rerunning the ANCOVA model after simple imputation

We could have *imputed* the missing `age`

value for patient 5, rather than just deleting that patient. Suppose we do the simplest potentially reasonable thing to do: insert the mean `age`

in where the NA value currently exists.

```
# A tibble: 16 x 7
patient age sex st_base st_day5 st_day10 st_delta
<int> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 61 F 14.1 2.3 10.3 -11.8
2 2 70 F 7.2 5.4 7.3 -1.8
3 3 65 M 14.2 11.9 11.3 -2.30
4 4 65 M 10.3 10.7 13.8 0.400
5 5 65.2 M 9.9 10.7 11.7 0.800
6 6 76 M 5.2 6.8 4.2 1.60
7 7 72 M 10.4 14.6 14.1 4.20
8 8 69 F 10.5 7.2 5.4 -3.3
9 9 66 M 5 5 5.1 0
10 10 62 M 8.6 8.1 7.4 -0.5
11 11 65 F 16.6 14.9 13 -1.7
12 12 71 M 16.4 18.6 17.1 2.2
13 13 51 F 12.2 11 12.3 -1.20
14 14 71 M 6.6 3.7 4.5 -2.90
15 15 64 F 15.4 15.2 13.6 -0.2
16 16 50 M 10.2 10.8 11.2 0.6
```

More on simple imputation and missing data is coming soon.

For now, we can rerun the ANCOVA model on this new data set, after imputation…

```
Analysis of Variance Table
Response: st_delta
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 52.547 52.547 5.3623 0.03755 *
age 1 2.151 2.151 0.2195 0.64721
Residuals 13 127.392 9.799
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

When we do this, we see that now the `sex`

variable returns to a *p* value below 0.05. Our complete case analysis (which omitted patient 5) gives us a different result than the ANCOVA based on the data after mean imputation.

## 7.6 Looking at a factor-covariate interaction

Let’s run a model including the interaction (product) term between `age`

and `sex`

, which implies that the slope of `age`

on our outcome (`st_delta`

) depends on the patient’s sex. We’ll use the imputed data again. Here is the new ANCOVA table, which suggests that the interaction of `age`

and `sex`

is small (because it accounts for only a small amount of the total Sum of Squares) and not significant (p = 0.91).

```
Analysis of Variance Table
Response: st_delta
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 52.547 52.547 4.9549 0.04594 *
age 1 2.151 2.151 0.2028 0.66051
sex:age 1 0.130 0.130 0.0123 0.91355
Residuals 12 127.261 10.605
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Since the interaction term is neither substantial nor significant, we probably don’t need it here. But let’s look at its interpretation anyway, just to fix ideas. To do that, we’ll need the coefficients from the underlying regression model.

```
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -5.65 13.5 -0.420 0.682
2 sexM 1.72 16.8 0.102 0.920
3 age 0.0365 0.211 0.173 0.866
4 sexM:age 0.0289 0.260 0.111 0.914
```

Our ANCOVA model for `st_delta`

incorporating the `age`

x `sex`

product term is -5.65 + 1.72 (sex = M) + 0.037 age + 0.029 (sex = M)(age). So that means:

- our model for females is
`st_delta`

= -5.65 + 0.037`age`

- our model for males is
`st_delta`

= (-5.65 + 1.72) + (0.037 + 0.029)`age`

, or -3.93 + 0.066`age`

but, again, our conclusion from the ANCOVA table is that this increase in complexity (letting both the slope and intercept vary by `sex`

) doesn’t add much in the way of predictive value for our `st_delta`

outcome.

## 7.7 Centering the Covariate to Facilitate ANCOVA Interpretation

When developing an ANCOVA model, we will often **center** or even **center and rescale** the covariate to facilitate interpretation of the product term. In this case, let’s center `age`

and rescale it by dividing by two standard deviations.

```
min Q1 median Q3 max mean sd n missing
50 63.5 65.1 70.25 76 65.2 6.978061 16 0
```

Note that in our imputed data, the mean `age`

is 65.2 and the standard deviation of `age`

is 7 years.

So we build the rescaled `age`

variable that I’ll call `age_z`

, and then use it to refit our model.

```
emph_imp <- emph_imp %>%
mutate(age_z = (age - mean(age))/ (2 * sd(age)))
anova(lm(st_delta ~ sex * age_z, data = emph_imp))
```

```
Analysis of Variance Table
Response: st_delta
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 52.547 52.547 4.9549 0.04594 *
age_z 1 2.151 2.151 0.2028 0.66051
sex:age_z 1 0.130 0.130 0.0123 0.91355
Residuals 12 127.261 10.605
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

```
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -3.27 1.39 -2.35 0.0364
2 sexM 3.60 1.74 2.08 0.0601
3 age_z 0.510 2.95 0.173 0.866
4 sexM:age_z 0.403 3.63 0.111 0.914
```

Comparing the two models, we have:

- (unscaled):
`st_delta`

= -5.65 + 1.72 (`sex`

= M) + 0.037`age`

+ 0.029 (`sex`

= M) x (`age`

) - (rescaled):
`st_delta`

= -3.27 + 3.60 (`sex`

= M) + 0.510 rescaled`age_z`

+ 0.402 (`sex`

= M) x (rescaled`age_z`

)

In essence, the rescaled model on `age_z`

is:

`st_delta`

= -3.27 + 0.510`age_z`

for female subjects, and`st_delta`

= (-3.27 + 3.60) + (0.510 + 0.402)`age_z`

= 0.33 + 0.912`age_z`

for male subjects

Interpreting the centered, rescaled model, we have:

- no change in the ANOVA results or R-squared or residual standard deviation compared to the uncentered, unscaled model, but
- the intercept (-3.27) now represents the
`st_delta`

for a female of average age, - the
`sex`

slope (3.60) represents the (male - female) difference in predicted`st_delta`

for a person of average age, - the
`age_z`

slope (0.510) represents the difference in predicted`st_delta`

for a female one standard deviation older than the mean age as compared to a female one standard deviation younger than the mean age, and - the product term’s slope (0.402) represents the male - female difference in the slope of
`age_z`

, so that if you add the`age_z`

slope (0.510) and the interaction slope (0.402) you see the difference in predicted`st_delta`

for a male one standard deviation older than the mean age as compared to a male one standard deviation younger than the mean age.

### References

Riffenburgh, Robert H. 2006. *Statistics in Medicine*. Second Edition. Burlington, MA: Elsevier Academic Press.