Chapter 22 A Paired Sample Study: Lead in the Blood of Children

One of the best ways to eliminate a source of variation and the errors of interpretation associated with it is through the use of matched pairs. Each subject in one group is matched as closely as possible by a subject in the other group. If a 45-year-old African-American male with hypertension is given a [treatment designed to lower their blood pressure], then we give a second, similarly built 45-year old African-American male with hypertension a placebo.

Good (2005), section 5.2.4

22.1 The Lead in the Blood of Children Study

Morton et al. (1982) studied the absorption of lead into the blood of children. This was a matched-sample study, where the exposed group of interest contained 33 children of parents who worked in a battery manufacturing factory (where lead was used) in the state of Oklahoma. Specifically, each child with a lead-exposed parent was matched to another child of the same age, exposure to traffic, and living in the same neighborhood whose parents did not work in lead-related industries. So the complete study had 66 children, arranged in 33 matched pairs. The outcome of interest, gathered from a sample of whole blood from each of the children, was lead content, measured in mg/dl.

One motivation for doing this study is captured in the Abstract from Morton et al. (1982).

It has been repeatedly reported that children of employees in a lead-related industry are at increased risk of lead absorption because of the high levels of lead found in the household dust of these workers.

The data are available in several places, including Table 5 of Pruzek and Helmreich (2009), in the BloodLead data set within the PairedData package in R, but we also make them available in the bloodlead.csv file. A table of the first few pairs of observations (blood lead levels for one child exposed to lead and the matched control) is shown below.

bloodlead

# A tibble: 33 x 3
   pair  exposed control
   <chr>   <dbl>   <dbl>
 1 P01        38      16
 2 P02        23      18
 3 P03        41      18
 4 P04        18      24
 5 P05        37      19
 6 P06        36      11
 7 P07        23      10
 8 P08        62      15
 9 P09        31      16
10 P10        34      18
# ... with 23 more rows

In each pair, one child was exposed (to having a parent working in the factory) and the other was not.
Otherwise, though, each child was very similar to its matched partner.
The data under exposed and control are the blood lead content, in mg/dl.

Our primary goal will be to estimate the difference in lead content between the exposed and control children, and then use that sample estimate to make inferences about the difference in lead content between the population of all children like those in the exposed group and the population of all children like those in the control group.

22.1.1 Our Key Questions for a Paired Samples Comparison

What is the population under study?

All pairs of children living in Oklahoma near the factory in question, in which one had a parent working in a factory that exposed them to lead, and the other did not.

What is the sample? Is it representative of the population?

The sample consists of 33 pairs of one exposed and one control child.
This is a case-control study, where the children were carefully enrolled to meet the design criteria. Absent any other information, we’re likely to assume that there is no serious bias associated with these pairs, and that assuming they represent the population effectively (and perhaps the broader population of kids whose parents work in lead-based industries more generally) may well be at least as reasonable as assuming they don’t.

Who are the subjects / individuals within the sample?

Each of our 33 pairs of children includes one exposed child and one unexposed (control) child.

What data are available on each individual?

The blood lead content, as measured in mg/dl of whole blood.

22.1.2 Lead Study Caveats

Note that the children were not randomly selected from general populations of kids whose parents did and did not work in lead-based industries.

To make inferences to those populations, we must make strong assumptions to believe, for instance, that the sample of exposed children is as representative as a random sample of children with similar exposures across the world would be.
The researchers did have a detailed theory about how the exposed children might be at increased risk of lead absorption, and in fact as part of the study gathered additional information about whether a possible explanation might be related to the quality of hygiene of the parents (all of them were fathers, actually) who worked in the factory.
This is an observational study, so that the estimation of a causal effect between parental work in a lead-based industry and children’s blood lead content can be made, without substantial (and perhaps heroic) assumptions.

22.2 Exploratory Data Analysis for Paired Samples

We’ll begin by adjusting the data in two ways.

We’d like that first variable (pair) to be a factor rather than a character type in R, because we want to be able to summarize it more effectively. So we’ll make that change.
Also, we’d like to calculate the difference in lead content between the exposed and the control children in each pair, and we’ll save that within-pair difference in a variable called lead_diff. We’ll take lead_diff = exposed - control so that positive values indicate increased lead in the exposed child.

bloodlead_original <- bloodlead

bloodlead <- bloodlead_original %>%
    mutate(pair = factor(pair),
           lead_diff = exposed - control)

bloodlead

# A tibble: 33 x 4
   pair  exposed control lead_diff
   <fct>   <dbl>   <dbl>     <dbl>
 1 P01        38      16        22
 2 P02        23      18         5
 3 P03        41      18        23
 4 P04        18      24        -6
 5 P05        37      19        18
 6 P06        36      11        25
 7 P07        23      10        13
 8 P08        62      15        47
 9 P09        31      16        15
10 P10        34      18        16
# ... with 23 more rows

22.2.1 The Paired Differences

To begin, we focus on lead_diff for our exploratory work, which is the exposed - control difference in lead content within each of the 33 pairs. So, we’ll have 33 observations, as compared to the 462 in the serum zinc data, but most of the same tools are still helpful.

res_lead <- mosaic::favstats(~ lead_diff, data = bloodlead)
bin_w1 <- 5 # specify binwidth

p1 <- ggplot(bloodlead, aes(x = lead_diff)) +
    geom_histogram(binwidth = bin_w1, 
                   fill = "firebrick", 
                   col = "white") +
    theme_light() +
    stat_function(
        fun = function(x) dnorm(x, mean = res_lead$mean, 
                                sd = res_lead$sd) * 
            res_lead$n * bin_w1,
        col = "blue") +
labs(x = "Diff. in Lead Content (mg/dl)", y = "")

p2 <- ggplot(bloodlead, aes(sample = lead_diff)) +
    geom_qq(col = "firebrick") + 
    geom_qq_line(col = "black") +
    theme_light() +
    labs(y = "Diff. in Lead Content (mg/dl)")

p3 <- ggplot(bloodlead, aes(x = "", y = lead_diff)) +
    geom_violin() +
    geom_boxplot(width = 0.5, fill = "firebrick", 
                 outlier.color = "firebrick") +
    theme_light() +
    coord_flip() +
    labs(x = "", y = "Diff. in Lead Content (mg/dl)")

p1 + p2 - p3 + plot_layout(ncol = 1, height = c(3, 1)) +
    plot_annotation(title = "Difference in Blood Lead Content (mg/dl) for 33 Pairs of Children")

Note that in all of this work, I plotted the paired differences. One obvious way to tell if you have paired samples is that you can pair every single subject from one exposure group to a unique subject in the other exposure group. Everyone has to be paired, so the sample sizes will always be the same in the two groups.

22.2.2 Numerical Summaries

mosaic::favstats(~ lead_diff, data = bloodlead)

 min Q1 median Q3 max mean   sd  n missing
  -9  4     15 25  60   16 15.9 33       0

bloodlead %$% skew1(lead_diff)

[1] 0.0611

22.2.3 Impact of Matching - Scatterplot and Correlation

Here, the data are paired by the study through matching on neighborhood, age and exposure to traffic. Each individual child’s outcome value is part of a pair with the outcome value for his/her matching partner. We can see this pairing in several ways, perhaps by drawing a scatterplot of the pairs.

ggplot(bloodlead, aes(x = control, y = exposed)) +
    geom_point(size = 2) + 
    geom_smooth(method = "lm", se = FALSE) +
    geom_text(x = 20, y = 65, col = "blue", 
             label = 
                 paste("Pearson r = ",
                       round(bloodlead %$% 
                                 cor(control, exposed),2))) +
    labs(title = "Paired Samples in Blood Lead study",
         x = "Blood Lead Content (mg/dl) in Control Child",
         y = "Blood Lead Content (mg/dl) in Exposed Child")

Each point here represents a pair of observations, one from a control child, and one from the matched exposed child. If there is a strong linear relationship (usually with a positive slope, thus positive correlation) between the paired outcomes, then the pairing will be more helpful in terms of improving statistical power of the estimates we build than if there is a weak relationship.

The stronger the Pearson correlation coefficient, the more helpful pairing will be.
Here, a straight line model using the control child’s blood lead content accounts for about 3.2% of the variation in blood lead content in the exposed child.
As it turns out, pairing will have only a modest impact here on the inferences we draw in the study. We still will treat the data as paired, despite this.

22.3 Looking at Separate Samples: Using `pivot_longer`

For the purpose of estimating the difference between the exposed and control children, the summaries of the paired differences are what we’ll need.

In some settings, however, we might also look at a boxplot, or violin plot, or ridgeline plot that showed the distributions of exposed and control children separately. But we will run into trouble because one variable (blood lead content) is spread across multiple columns (control and exposed.) The solution is to “pivot” the tibble from its current format to build a new, tidy tibble. Because the data aren’t tidied here, so that we have one row for each subject and one column for each variable, we have to do some work to get them in that form for our usual plotting strategy to work well.

pivot_longer() “lengthens” the data, increasing the number of rows and decreasing the number of columns.
pivot_wider() performs the inverse of that transformation, “widening” the data.

In our original bloodlead data, if we drop the lead_diff addition we made, we have wide data, with each row representing two different subjects.

head(bloodlead_original, 3)

# A tibble: 3 x 3
  pair  exposed control
  <chr>   <dbl>   <dbl>
1 P01        38      16
2 P02        23      18
3 P03        41      18

And what we want to accomplish is to have one row for each subject, instead of one row for each pair of subjects. So we want to make the data longer.

bloodlead_longer <- bloodlead_original %>%
    pivot_longer(
        cols = -c(pair),
        names_to = "status",
        values_to = "lead_level")

bloodlead_longer

# A tibble: 66 x 3
   pair  status  lead_level
   <chr> <chr>        <dbl>
 1 P01   exposed         38
 2 P01   control         16
 3 P02   exposed         23
 4 P02   control         18
 5 P03   exposed         41
 6 P03   control         18
 7 P04   exposed         18
 8 P04   control         24
 9 P05   exposed         37
10 P05   control         19
# ... with 56 more rows

For more on this approach (in this case, we’re making the data “longer” and its opposite would be be making the data “wider”), visit the Tidy data chapter in Grolemund and Wickham (2019) and the tidyr repository on Github at https://github.com/tidyverse/tidyr.

And now, we can plot as usual to compare the two samples.

First, we’ll look at a boxplot, showing all of the data.

ggplot(bloodlead_longer, aes(x = status, y = lead_level)) +
    geom_violin() +
    geom_boxplot(aes(fill = status), width = 0.2) +
    scale_fill_viridis_d(begin = 0.5) +
    guides(fill = FALSE) + 
    coord_flip() +
    labs(title = "Boxplot of Lead Content in Exposed and Control kids") + 
    theme_bw()

We’ll also look at a ridgeline plot, because Dr. Love likes them, even though they’re really more useful when we’re comparing more than two samples.

ggplot(bloodlead_longer, aes(x = lead_level, y = status, fill = status)) +
    ggridges::geom_density_ridges(scale = 0.9) +
    guides(fill = FALSE) + 
    labs(title = "Lead Content in Exposed and Control kids") +
    ggridges::theme_ridges()

Picking joint bandwidth of 4.01

Both the center and the spread of the distribution are substantially larger in the exposed group than in the controls. Of course, numerical summaries show these patterns, too.

mosaic::favstats(lead_level ~ status, data = bloodlead_longer)

   status min Q1 median Q3 max mean    sd  n missing
1 control   7 13     16 19  25 15.9  4.54 33       0
2 exposed  10 21     34 39  73 31.8 14.41 33       0

22.4 Estimating the Difference in Means with Paired Samples

Suppose we want to estimate the difference in the mean blood level across the population of children represented by the sample taken in this study. To do so, we must take advantage of the matched samples design, and complete our estimation on the paired differences, treating them as if they were a single sample of data.

One way to accomplish this is simply to run the usual intercept-only linear regression model on the paired differences.

model_lead <- lm(lead_diff ~ 1, data = bloodlead)

tidy(model_lead, conf.int = TRUE, conf.level = 0.90)

# A tibble: 1 x 7
  term        estimate std.error statistic    p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>    <dbl>     <dbl>
1 (Intercept)     16.0      2.76      5.78 0.00000204     11.3      20.6

Our point estimate for the difference (exposed - control) in lead levels is 15.97 mg/dl, and our 90% confidence interval is (11.29, 20.65) mg/dl.

22.4.1 Paired Data in Longer Format?

If we had the data in “longer” format, as in bloodlead_longer, with the pairs identified by the pair variable, then we could obtained the same confidence interval using:

model2_lead <- lm(lead_level ~ status + factor(pair), data = bloodlead_longer)

tidy(model2_lead, conf.int = TRUE, conf.level = 0.90)

# A tibble: 34 x 7
   term           estimate std.error statistic   p.value conf.low conf.high
   <chr>             <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
 1 (Intercept)       19.0       8.05    2.36     2.44e-2     5.38     32.7 
 2 statusexposed     16.0       2.76    5.78     2.04e-6    11.3      20.6 
 3 factor(pair)P~    -6.50     11.2    -0.579    5.66e-1   -25.5      12.5 
 4 factor(pair)P~     2.5      11.2     0.223    8.25e-1   -16.5      21.5 
 5 factor(pair)P~    -6.00     11.2    -0.535    5.96e-1   -25.0      13.0 
 6 factor(pair)P~     1.       11.2     0.0891   9.30e-1   -18.0      20.0 
 7 factor(pair)P~    -3.50     11.2    -0.312    7.57e-1   -22.5      15.5 
 8 factor(pair)P~   -10.5      11.2    -0.936    3.56e-1   -29.5       8.50
 9 factor(pair)P~    11.5      11.2     1.03     3.13e-1    -7.50     30.5 
10 factor(pair)P~    -3.50     11.2    -0.312    7.57e-1   -22.5      15.5 
# ... with 24 more rows

and the key elements are found in the statusexposed row, which we can focus on nicely (since the output of the tidy() function is always a tibble) with:

tidy(model2_lead, conf.int = TRUE, conf.level = 0.90) %>%
    filter(term == "statusexposed") %>% 
    knitr::kable(digits = 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
statusexposed	16	2.76	5.78	0	11.3	20.6

and again, we have our 90% confidence interval estimate of the population mean difference between exposed and control children.

More approaches to making inferences about paired sample means as well as other summaries of quantitative outcomes will be found in Chapter 23.

22.5 Comparing Proportions in Paired Samples

We discuss some ideas relevant to the comparison of proportions in paired samples in Chapter 24, in the context of a different example.

References

Good, Phillip I. 2005. Introduction to Statistics Through Resampling Methods and R/S-Plus. Hoboken, NJ: Wiley.

Grolemund, Garrett, and Hadley Wickham. 2019. R for Data Science. O’Reilly. http://r4ds.had.co.nz/.

Morton, D., A. Saah, S. Silberg, W. Owens, M. Roberts, and M. Saah. 1982. “Lead Absorption in Children of Employees in a Lead Related Industry.” American Journal of Epidemiology 115: 549–55.

Pruzek, Robert M., and James E. Helmreich. 2009. “Enhancing Dependent Sample Analyses with Graphics.” Journal of Statistics Education 17(1). http://ww2.amstat.org/publications/jse/v17n1/helmreich.html.