Chapter 23 Comparing Means/Quantities using Two Paired Samples

Here, we’ll consider the problem of estimating a confidence interval to describe the difference in population means (or medians) based on a comparison of two samples of quantitative data, gathered using a matched pairs design. Specifically, we’ll use as our example the Lead in the Blood of Children study, described in Section 22.

Recall that in that study, we measured blood lead content, in mg/dl, for 33 matched pairs of children, one of which was exposed (had a parent working in a battery factory) and the other of which was control (no parent in the battery factory, but matched to the exposed child by age, exposure to traffic and neighborhood). We then created a variable called lead_diff which contained the (exposed - control) differences within each pair.

bloodlead

# A tibble: 33 x 4
   pair  exposed control lead_diff
   <fct>   <dbl>   <dbl>     <dbl>
 1 P01        38      16        22
 2 P02        23      18         5
 3 P03        41      18        23
 4 P04        18      24        -6
 5 P05        37      19        18
 6 P06        36      11        25
 7 P07        23      10        13
 8 P08        62      15        47
 9 P09        31      16        15
10 P10        34      18        16
# ... with 23 more rows

23.0.1 Matched Pairs vs. Two Independent Samples

These data were NOT obtained from two independent samples, but rather from matched pairs.

We only have matched pairs if each individual observation in the “treatment” group is matched to one and only one observation in the “control” group by the way in which the data were gathered. Paired (or matched) data can arise in several ways.
- The most common is a “pre-post” study where subjects are measured both before and after an exposure happens.
- In observational studies, we often match up subjects who did and did not receive an exposure so as to account for differences on things like age, sex, race and other covariates. This, of course, is what happens in the Lead in the Blood of Children study from Chapter 22.
If the data are from paired samples, we should (and in fact) must form paired differences, with no subject left unpaired.
- If we cannot line up the data comparing two samples of quantitative data so that the links between the individual “treated” and “control” observations to form matched pairs are evident, then the data are not paired.
- If the sample sizes were different, we’d know we have independent samples, because matched pairs requires that each subject in the “treated” group be matched to a single, unique member of the “control” group, and thus that we have exactly as many “treated” as “control” subjects.
- But having as many subjects in one treatment group as the other (which is called a balanced design) is only necessary, and not sufficient, for us to conclude that matched pairs are used.

As Bock, Velleman, and De Veaux (2004) suggest,

… if you know the data are paired, you can take advantage of that fact - in fact, you must take advantage of it. … You must decide whether the data are paired from understanding how they were collected and what they mean. … There is no test to determine whether the data are paired.

23.1 Estimating the Population Mean of the Paired Differences

There are two main approaches used frequently to estimate the population mean of paired differences.

Estimation using the t distribution (and assuming at least an approximately Normal distribution for the paired differences)
Estimation using the bootstrap (which doesn’t require the Normal assumption)

In addition, we might consider estimating an alternate statistic when the data don’t follow a symmetric distribution, like the median, with the bootstrap. In other settings, a rank-based alternative called the Wilcoxon signed rank test is available to estimate a psuedo-median. All of these approaches mirror what we did with a single sample, back in Chapter 17.

23.2 t-based CI for Population Mean of Paired Differences, \(\mu_d\).

In R, there are at least five different methods for obtaining the t-based confidence interval for the population difference in means between paired samples. They are all mathematically identical. The key idea is to calculate the paired differences (exposed - control, for example) in each pair, and then treat the result as if it were a single sample and apply the methods discussed in Chapter 17.

23.2.1 Method 1

We can use the single-sample approach, applied to the variable containing the paired differences. Let’s build a 90% two-sided confidence interval for the population mean of the difference in blood lead content across all possible pairs of an exposed (parent works in a lead-based industry) and a control (parent does not) child, \(\mu_d\).

tt1 <- bloodlead %$% t.test(lead_diff, conf.level = 0.90, 
                            alt = "two.sided")

tt1


    One Sample t-test

data:  lead_diff
t = 6, df = 32, p-value = 2e-06
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
 11.3 20.6
sample estimates:
mean of x 
       16

tidy(tt1) %>% knitr::kable(digits = 2)

estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
16	5.78	0	32	11.3	20.6	One Sample t-test	two.sided

The 90% confidence interval is (11.29, 20.65) according to this t-based procedure. An appropriate interpretation of the 90% two-sided confidence interval would be:

(11.29, 20.65) milligrams per deciliter is a 90% two-sided confidence interval for the population mean difference in blood lead content between exposed and control children.
Our point estimate for the true population difference in mean blood lead content is 15.97 mg.dl. The values in the interval (11.29, 20.65) mg/dl represent a reasonable range of estimates for the true population difference in mean blood lead content, and we are 90% confident that this method of creating a confidence interval will produce a result containing the true population mean difference.
Were we to draw 100 samples of 33 matched pairs from the population described by this sample, and use each such sample to produce a confidence interval in this manner, approximately 90 of those confidence intervals would cover the true population mean difference in blood lead content levels.

23.2.2 Method 2

Or, we can apply the single-sample approach to a calculated difference in blood lead content between the exposed and control groups. Here, we’ll get a 95% two-sided confidence interval for \(\mu_d\), instead of the 90% interval we obtained above.

tt2 <- bloodlead %$% t.test(exposed - control, 
       conf.level = 0.95, alt = "two.sided")

tt2


    One Sample t-test

data:  exposed - control
t = 6, df = 32, p-value = 2e-06
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 10.3 21.6
sample estimates:
mean of x 
       16

tidy(tt2) %>% knitr::kable(digits = 2)

estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
16	5.78	0	32	10.3	21.6	One Sample t-test	two.sided

23.2.3 Method 3

Or, we can provide R with two separate samples (unaffected and affected) and specify that the samples are paired. Here, we’ll get a 99% one-sided confidence interval (lower bound) for \(\mu_d\), the population mean difference in blood lead content.

tt3 <- bloodlead %$% t.test(exposed, control, conf.level = 0.99,
       paired = TRUE, alt = "greater")

tt3


    Paired t-test

data:  exposed and control
t = 6, df = 32, p-value = 1e-06
alternative hypothesis: true difference in means is greater than 0
99 percent confidence interval:
 9.21  Inf
sample estimates:
mean of the differences 
                     16

tidy(tt3) %>% knitr::kable(digits = 2)

estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
16	5.78	0	32	9.21	Inf	Paired t-test	greater

Again, the three different methods using t.test for paired samples will all produce identical results if we feed them the same confidence level and type of interval (two-sided, greater than or less than).

23.2.4 Method 4

As we saw in Chapter 22, we can also use an intercept-only linear regression model to estimate the population mean of the paired differences with a two-tailed confidence interval, by creating a variable containing those paired differences.

model_lead <- lm(lead_diff ~ 1, data = bloodlead)

tidy(model_lead, conf.int = TRUE, conf.level = 0.95)

# A tibble: 1 x 7
  term        estimate std.error statistic    p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>    <dbl>     <dbl>
1 (Intercept)     16.0      2.76      5.78 0.00000204     10.3      21.6

23.2.5 Method 5

As we also saw in Chapter 22, if we have the data in a longer format, with a variable identifying the matched pairs, we can use a different specification for a linear model to obtain the same estimate.

model2_lead <- lm(lead_level ~ status + factor(pair), data = bloodlead_longer)

tidy(model2_lead, conf.int = TRUE, conf.level = 0.95) %>%
    filter(term == "statusexposed")

# A tibble: 1 x 7
  term          estimate std.error statistic    p.value conf.low conf.high
  <chr>            <dbl>     <dbl>     <dbl>      <dbl>    <dbl>     <dbl>
1 statusexposed     16.0      2.76      5.78 0.00000204     10.3      21.6

23.2.6 Assumptions

If we are building a confidence interval based on a sample of observations drawn from a population, then we must pay close attention to the assumptions of those procedures. The confidence interval procedure for the population mean paired difference \(\mu_d\) using the t distribution assumes that:

We want to estimate the population mean paired difference \(\mu_d\).
We have drawn a sample of paired differences at random from the population of interest.
The sampled paired differences are drawn from the population set of paired differences independently and have identical distributions.
The population follows a Normal distribution. At the very least, the sample itself is approximately Normal.

23.3 Bootstrap CI for mean difference using paired samples

The same bootstrap approach is used for paired differences as for a single sample. We again use the smean.cl.boot() function in the Hmisc package to obtain bootstrap confidence intervals for the population mean, \(\mu_d\), of the paired differences in blood lead content.

set.seed(431555)
bloodlead %$% Hmisc::smean.cl.boot(lead_diff, B = 1000,
                                   conf.int = 0.95)

 Mean Lower Upper 
 16.0  10.8  21.5

Note that in this case, the confidence interval for the difference in means is a bit less wide than the 95% confidence interval generated by the t test, which was (10.34, 21.59). It’s common for the bootstrap to produce a narrower range (i.e. an apparently more precise estimate) for the population mean, but it’s not automatic that the endpoints from the bootstrap will be inside those provided by the t test, either.

For example, this bootstrap CI doesn’t contain the t-test based interval, since its upper bound exceeds that of the t-based interval:

set.seed(431002)
bloodlead %$% Hmisc::smean.cl.boot(lead_diff, B = 1000,
                                   conf.int = 0.95)

 Mean Lower Upper 
 16.0  10.8  21.7

This demonstration aside, the appropriate thing to do when applying the bootstrap to specify a confidence interval is select a seed and the number (B = 1,000 or 10,000, usually) of desired bootstrap replications, then run the bootstrap just once and move on, rather than repeating the process multiple times looking for a particular result.

23.3.1 Assumptions

The bootstrap confidence interval procedure for the population mean (or median) of a set of paired differences assumes that:

We want to estimate the population mean \(\mu_d\) of the paired differences (or the population median).
We have drawn a sample of observations at random from the population of interest.
The sampled observations are drawn from the population of paired differences independently and have identical distributions.
We are willing to put up with the fact that different people (not using the same random seed) will get somewhat different confidence interval estimates using the same data.

As we’ve seen, a major part of the bootstrap’s appeal is the ability to relax some assumptions.

23.4 Wilcoxon Signed Rank-based CI for paired samples

We could also use the Wilcoxon signed rank procedure to generate a CI for the pseudo-median of the paired differences.

wt <- bloodlead %$% wilcox.test(lead_diff, conf.int = TRUE,
                                conf.level = 0.90, 
                                exact = FALSE)
wt


    Wilcoxon signed rank test with continuity correction

data:  lead_diff
V = 499, p-value = 1e-05
alternative hypothesis: true location is not equal to 0
90 percent confidence interval:
 11.0 20.5
sample estimates:
(pseudo)median 
          15.5

tidy(wt)

# A tibble: 1 x 7
  estimate statistic  p.value conf.low conf.high method         alternative
     <dbl>     <dbl>    <dbl>    <dbl>     <dbl> <chr>          <chr>      
1     15.5       499  1.15e-5     11.0      20.5 Wilcoxon sign~ two.sided

As in the one sample case, we can revise this code slightly to specify a different confidence level, or gather a one-sided rather than a two-sided confidence interval.

23.4.1 Assumptions

The Wilcoxon signed rank confidence interval procedure in working with paired differences assumes that:

We want to estimate the population pseudo-median of the paired differences.
We have drawn a sample of observations at random from the population of paired differences of interest.
The sampled observations are drawn from the population of paired differences independently and have identical distributions.
The population follows a symmetric distribution. At the very least, the sample itself shows no substantial skew, so that the sample pseudo-median is a reasonable estimate for the population median.

23.5 Choosing a Confidence Interval Approach

Suppose we want to find a confidence interval for the mean of a population, \(\mu\), or, the population mean difference \(\mu_{d}\) between two populations based on matched pairs.

If we are willing to assume that the population distribution is Normal
- we usually use a t-based CI.
If we are unwilling to assume that the population is Normal,
- use a bootstrap procedure to get a CI for the population mean, or even the median
- but are willing to assume the population is symmetric, consider a Wilcoxon signed rank procedure to get a CI for the median, rather than the mean.

The two methods you’ll use most often are the bootstrap (especially if the data don’t appear to be at least pretty well fit by a Normal model) and the t-based confidence intervals (if the data do appear to fit a Normal model reasonably well.)

23.6 Conclusions for the `bloodlead` study

Using any of these procedures, we would conclude that the null hypothesis (that the true mean of the paired differences is 0 mg/dl) is not tenable, and that it should be rejected at the 10% significance level. The smaller the p value, the stronger is the evidence that the null hypothesis is incorrect, and in this case, we have some fairly tiny p values.

Procedure	Comparing …	90% CI for \(\mu_{Exposed - Control}\)
Paired t	Means	11.3, 20.6
Wilcoxon signed rank	Pseudo-medians	11, 20.5
Bootstrap CI	Means	11.6, 20.6

Note that one-sided or one-tailed hypothesis testing procedures work the same way for paired samples as they did for a single sample in Chapter 17.

23.7 The Sign test

The sign test is something we’ve skipped in our discussion so far. It is a test for consistent differences between pairs of observations, just as the paired t, Wilcoxon signed rank and bootstrap for paired samples can provide. It has the advantage that it is relatively easy to calculate by hand, and that it doesn’t require the paired differences to follow a Normal distribution. In fact, it will even work if the data are substantially skewed.

Calculate the paired difference for each pair, and drop those with difference = 0.
Let \(N\) be the number of pairs that remain, so there are 2N data points.
Let \(W\), the test statistic, be the number of pairs (out of N) in which the difference is positive.
Assuming that \(H_0\) is true, then \(W\) follows a binomial distribution with probability 0.5 on \(N\) trials.

For example, consider our data on blood lead content:

bloodlead$lead_diff

 [1] 22  5 23 -6 18 25 13 47 15 16  6  1  2  7  0  4 -9 -3 36 25  1 16 42
[24] 30 25 23 32 17  9 -3 60 14 14

Difference	# of Pairs
Greater than zero	28
Equal to zero	1
Less than zero	4

So we have \(N\) = 32 pairs, with \(W\) = 28 that are positive. We then use the binom.test approach in R:

binom.test(x = 28, n = 32, p = 0.5, 
           alternative = "two.sided")


    Exact binomial test

data:  28 and 32
number of successes = 28, number of trials = 32, p-value = 2e-05
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.710 0.965
sample estimates:
probability of success 
                 0.875

A one-tailed test can be obtained by substituting in “less” or “greater” as the alternative of interest.
The confidence interval provided doesn’t relate back to our original population means, of course. It’s just showing the confidence interval around the probability of the exposed mean being greater than the control mean for a pair of children.

23.8 A More Complete Decision Support Tool: Comparing Means

Are these paired or independent samples?
If paired samples, then are the paired differences approximately Normally distributed?
1. If yes, then a paired t test or confidence interval is likely the best choice.
2. If no, is the main concern outliers (with generally symmetric data), or skew?
  1. If the paired differences appear to be generally symmetric but with substantial outliers, a Wilcoxon signed rank test is an appropriate choice, as is a bootstrap confidence interval for the population mean of the paired differences.
  2. If the paired differences appear to be seriously skewed, then we’ll usually build a bootstrap confidence interval, although a sign test is another reasonable possibility, although it doesn’t provide a confidence interval for the population mean of the paired differences.
If independent, is each sample Normally distributed?
1. No –> use Wilcoxon-Mann-Whitney rank sum test or bootstrap via bootdif.
2. Yes –> are sample sizes equal?
  1. Balanced Design (equal sample sizes) - use pooled t test
  2. Unbalanced Design - use Welch test

23.9 Paired (Dependent) vs. Independent Samples

One area that consistently trips students up in this course is the thought process involved in distinguishing studies comparing means that should be analyzed using dependent (i.e. paired or matched) samples and those which should be analyzed using independent samples. A dependent samples analysis uses additional information about the sample to pair/match subjects receiving the various exposures. That additional information is not part of an independent samples analysis (unpaired testing situation.) The reasons to do this are to (a) increase statistical power, and/or (b) reduce the effect of confounding. Here are a few thoughts on the subject.

In the design of experiments, blocking is the term often used for the process of arranging subjects into groups (blocks) that are similar to one another. Typically, a blocking factor is a source of variability that is not of primary interest to the researcher An example of a blocking factor might be the sex of a patient; by blocking on sex, this source of variability is controlled for, thus leading to greater accuracy.

If the sample sizes are not balanced (not equal), the samples must be treated as independent, since there would be no way to precisely link all subjects. So, if we have 10 subjects receiving exposure A and 12 subjects receiving exposure B, a dependent samples analysis (such as a paired t test) is not correct.
The key element is a meaningful link between each observation in one exposure group and a specific observation in the other exposure group. Given a balanced design, the most common strategy indicating dependent samples involves two or more repeated measures on the same subjects. For example, if we are comparing outcomes before and after the application of an exposure, and we have, say, 20 subjects who provide us data both before and after the exposure, then the comparison of results before and after exposure should use a dependent samples analysis. The link between the subjects is the subject itself - each exposed subject serves as its own control.
The second most common strategy indicating dependent samples involves deliberate matching of subjects receiving the two exposures. A matched set of observations (often a pair, but it could be a trio or quartet, etc.) is determined using baseline information and then (if a pair is involved) one subject receives exposure A while the other member of the pair receives exposure B, so that by calculating the paired difference, we learn about the effect of the exposure, while controlling for the variables made similar across the two subjects by the matching process.
In order for a dependent samples analysis to be used, we need (a) a link between each observation across the exposure groups based on the way the data were collected, and (b) a consistent measure (with the same units of measurement) so that paired differences can be calculated and interpreted sensibly.
If the samples are collected to facilitate a dependent samples analysis, the correlation of the outcome measurements across the groups will often be moderately strong and positive. If that’s the case, then the use of a dependent samples analysis will reduce the effect of baseline differences between the exposure groups, and thus provide a more precise estimate. But even if the correlation is quite small, a dependent samples analysis should provide a more powerful estimate of the impact of the exposure on the outcome than would an independent samples analysis with the same number of observations.

23.9.1 Three “Tricky” Examples

Suppose we take a convenient sample of 200 patients from the population of patients who complete a blood test in April 2017 including a check of triglycerides, and who have a triglyceride level in the high category (200 to 499 mg/dl). Next, we select a patient at random from this group of 200 patients, and then identify another patient from the group of 200 who is the same age (to within 2 years) and also the same sex. We then randomly assign our intervention to one of these two patients and usual care without our intervention to the other patient. We then set these two patients aside and return to our original sample, repeating the process until we cannot find any more patients in the same age range and of the same gender. This generates a total of 77 patients who receive the intervention and 77 who do not. If we are trying to assess the effect of our intervention on triglyceride level in October 2017 using this sample of 154 people, should we use dependent (paired) or independent samples?
Suppose we take a convenient sample of 77 patients from the population of patients who complete a blood test in April 2017 including a check of triglycerides, and who have a triglyceride level in the high category (200 to 499 mg/dl). Next, we take a convenient sample of 77 patients from the population of patients who complete a blood test in May 2017 including a check of triglycerides, and who have a triglyceride level in the high category (200 to 499 mg/dl). We flip a coin to determine whether the intervention will be given to each of the 77 patients from April 2017 (if the coin comes up “HEADS”) or instead to each of the 77 patients from May 2017 (if the coin comes up “TAILS”). Then, we assign our intervention to the patients seen in the month specified by the coin and assign usual care without our intervention to the patients seen in the other month. If we are trying to assess the effect of our intervention on triglyceride level in October 2017 using this sample of 154 people, should we use dependent (paired) or independent samples?
Suppose we take a convenient sample of 200 patients from the population of patients who complete a blood test in April 2017 including a check of triglycerides, and who have a triglyceride level in the high category (200 to 499 mg/dl). For each patient, we re-measure them again in October 2017, again checking their triglyceride level. But in between, we take the first 77 of the patients in a randomly sorted list and assign them to our intervention (which takes place from June through September 2017) and take an additional group of 77 patients from the remaining part of the list and assign them to usual care without our intervention over the same time period. If we are trying to assess the effect of our intervention on each individual’s change in triglyceride level (from April/May to October) using this sample of 154 people, should we use dependent (paired) or independent samples?

23.9.2 Answers for the Three “Tricky” Examples

Answer for 1. Our first task is to identify the outcome and the exposure groups. Here, we are comparing the distribution of our outcome (triglyceride level in October) across two exposures: (a) receiving the intervention and (b) not receiving the intervention. We have a sample of 77 patients receiving the intervention, and a different sample of 77 patients receiving usual care. Each of the 77 subjects receiving the intervention is matched (on age and sex) to a specific subject not receiving the intervention. So, we can calculate paired differences by taking the triglyceride level for the exposed member of each pair and subtracting the triglyceride level for the usual care member of that same pair. Thus our comparison of the exposure groups should be accomplished using a dependent samples analysis, such as a paired t test.

Answer for 2. Again, we begin by identfying the outcome (triglyceride level in October) and the exposure groups. Here, we compare two exposures: (a) receiving the intervention and (b) receiving usual care. We have a sample of 77 patients receiving the intervention, and a different sample of 77 patients receiving usual care. But there is no pairing or matching involved. There is no connection implied by the way that the data were collected that implies that, for example, patient 1 in the intervention group is linked to any particular subject in the usual care group. So we need to analyze the data using independent samples.

Answer for 3. Once again, we identfy the outcome (now it is the within-subject change in triglyceride level from April to October) and the exposure groups. Here again, we compare two exposures: (a) receiving the intervention and (b) receiving usual care. We have a sample of 77 patients receiving the intervention, and a different sample of 77 patients receiving usual care. But again, there is no pairing or matching between the patients receiving the intervention and the patients receiving usual care. While each outcome value is a difference (or change) in triglyceride levels, there’s no connection implied by the way that the data were collected that implies that, for example, patient 1 in the intervention group is linked to any particular subject in the usual care group. So, again, we need to analyze the data using independent samples.

For more background and fundamental material, you might consider the Wikipedia pages on Paired Difference Test and on Blocking (statistics).

References

Bock, David E., Paul F. Velleman, and Richard D. De Veaux. 2004. Stats: Modelling the World. Boston MA: Pearson Addison-Wesley.