Chapter 22 A Paired Sample Study: Lead in the Blood of Children
One of the best ways to eliminate a source of variation and the errors of interpretation associated with it is through the use of matched pairs. Each subject in one group is matched as closely as possible by a subject in the other group. If a 45-year-old African-American male with hypertension is given a [treatment designed to lower their blood pressure], then we give a second, similarly built 45-year old African-American male with hypertension a placebo.
- Good (2005), section 5.2.4
22.1 The Lead in the Blood of Children Study
Morton et al. (1982) studied the absorption of lead into the blood of children. This was a matched-sample study, where the exposed group of interest contained 33 children of parents who worked in a battery manufacturing factory (where lead was used) in the state of Oklahoma. Specifically, each child with a lead-exposed parent was matched to another child of the same age, exposure to traffic, and living in the same neighborhood whose parents did not work in lead-related industries. So the complete study had 66 children, arranged in 33 matched pairs. The outcome of interest, gathered from a sample of whole blood from each of the children, was lead content, measured in mg/dl.
One motivation for doing this study is captured in the Abstract from Morton et al. (1982).
It has been repeatedly reported that children of employees in a lead-related industry are at increased risk of lead absorption because of the high levels of lead found in the household dust of these workers.
The data are available in several places, including Table 5 of Pruzek and Helmreich (2009), in the BloodLead data set within the PairedData
package in R, but we also make them available in the bloodlead.csv
file. A table of the first few pairs of observations (blood lead levels for one child exposed to lead and the matched control) is shown below.
# A tibble: 33 x 3
pair exposed control
<chr> <dbl> <dbl>
1 P01 38 16
2 P02 23 18
3 P03 41 18
4 P04 18 24
5 P05 37 19
6 P06 36 11
7 P07 23 10
8 P08 62 15
9 P09 31 16
10 P10 34 18
# ... with 23 more rows
- In each pair, one child was exposed (to having a parent working in the factory) and the other was not.
- Otherwise, though, each child was very similar to its matched partner.
- The data under
exposed
andcontrol
are the blood lead content, in mg/dl.
Our primary goal will be to estimate the difference in lead content between the exposed and control children, and then use that sample estimate to make inferences about the difference in lead content between the population of all children like those in the exposed group and the population of all children like those in the control group.
22.1.1 Our Key Questions for a Paired Samples Comparison
- What is the population under study?
- All pairs of children living in Oklahoma near the factory in question, in which one had a parent working in a factory that exposed them to lead, and the other did not.
- What is the sample? Is it representative of the population?
- The sample consists of 33 pairs of one exposed and one control child.
- This is a case-control study, where the children were carefully enrolled to meet the design criteria. Absent any other information, we’re likely to assume that there is no serious bias associated with these pairs, and that assuming they represent the population effectively (and perhaps the broader population of kids whose parents work in lead-based industries more generally) may well be at least as reasonable as assuming they don’t.
- Who are the subjects / individuals within the sample?
- Each of our 33 pairs of children includes one exposed child and one unexposed (control) child.
- What data are available on each individual?
- The blood lead content, as measured in mg/dl of whole blood.
22.1.2 Lead Study Caveats
Note that the children were not randomly selected from general populations of kids whose parents did and did not work in lead-based industries.
- To make inferences to those populations, we must make strong assumptions to believe, for instance, that the sample of exposed children is as representative as a random sample of children with similar exposures across the world would be.
- The researchers did have a detailed theory about how the exposed children might be at increased risk of lead absorption, and in fact as part of the study gathered additional information about whether a possible explanation might be related to the quality of hygiene of the parents (all of them were fathers, actually) who worked in the factory.
- This is an observational study, so that the estimation of a causal effect between parental work in a lead-based industry and children’s blood lead content can be made, without substantial (and perhaps heroic) assumptions.
22.2 Exploratory Data Analysis for Paired Samples
We’ll begin by adjusting the data in two ways.
- We’d like that first variable (
pair
) to be afactor
rather than acharacter
type in R, because we want to be able to summarize it more effectively. So we’ll make that change. - Also, we’d like to calculate the difference in lead content between the exposed and the control children in each pair, and we’ll save that within-pair difference in a variable called
lead_diff
. We’ll takelead_diff
=exposed
-control
so that positive values indicate increased lead in the exposed child.
bloodlead_original <- bloodlead
bloodlead <- bloodlead_original %>%
mutate(pair = factor(pair),
lead_diff = exposed - control)
bloodlead
# A tibble: 33 x 4
pair exposed control lead_diff
<fct> <dbl> <dbl> <dbl>
1 P01 38 16 22
2 P02 23 18 5
3 P03 41 18 23
4 P04 18 24 -6
5 P05 37 19 18
6 P06 36 11 25
7 P07 23 10 13
8 P08 62 15 47
9 P09 31 16 15
10 P10 34 18 16
# ... with 23 more rows
22.2.1 The Paired Differences
To begin, we focus on lead_diff
for our exploratory work, which is the exposed
- control
difference in lead content within each of the 33 pairs. So, we’ll have 33 observations, as compared to the 462 in the serum zinc data, but most of the same tools are still helpful.
res_lead <- mosaic::favstats(~ lead_diff, data = bloodlead)
bin_w1 <- 5 # specify binwidth
p1 <- ggplot(bloodlead, aes(x = lead_diff)) +
geom_histogram(binwidth = bin_w1,
fill = "firebrick",
col = "white") +
theme_light() +
stat_function(
fun = function(x) dnorm(x, mean = res_lead$mean,
sd = res_lead$sd) *
res_lead$n * bin_w1,
col = "blue") +
labs(x = "Diff. in Lead Content (mg/dl)", y = "")
p2 <- ggplot(bloodlead, aes(sample = lead_diff)) +
geom_qq(col = "firebrick") +
geom_qq_line(col = "black") +
theme_light() +
labs(y = "Diff. in Lead Content (mg/dl)")
p3 <- ggplot(bloodlead, aes(x = "", y = lead_diff)) +
geom_violin() +
geom_boxplot(width = 0.5, fill = "firebrick",
outlier.color = "firebrick") +
theme_light() +
coord_flip() +
labs(x = "", y = "Diff. in Lead Content (mg/dl)")
p1 + p2 - p3 + plot_layout(ncol = 1, height = c(3, 1)) +
plot_annotation(title = "Difference in Blood Lead Content (mg/dl) for 33 Pairs of Children")
Note that in all of this work, I plotted the paired differences. One obvious way to tell if you have paired samples is that you can pair every single subject from one exposure group to a unique subject in the other exposure group. Everyone has to be paired, so the sample sizes will always be the same in the two groups.
22.2.2 Numerical Summaries
min Q1 median Q3 max mean sd n missing
-9 4 15 25 60 16 15.9 33 0
[1] 0.0611
22.2.3 Impact of Matching - Scatterplot and Correlation
Here, the data are paired by the study through matching on neighborhood, age and exposure to traffic. Each individual child’s outcome value is part of a pair with the outcome value for his/her matching partner. We can see this pairing in several ways, perhaps by drawing a scatterplot of the pairs.
ggplot(bloodlead, aes(x = control, y = exposed)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE) +
geom_text(x = 20, y = 65, col = "blue",
label =
paste("Pearson r = ",
round(bloodlead %$%
cor(control, exposed),2))) +
labs(title = "Paired Samples in Blood Lead study",
x = "Blood Lead Content (mg/dl) in Control Child",
y = "Blood Lead Content (mg/dl) in Exposed Child")
Each point here represents a pair of observations, one from a control child, and one from the matched exposed child. If there is a strong linear relationship (usually with a positive slope, thus positive correlation) between the paired outcomes, then the pairing will be more helpful in terms of improving statistical power of the estimates we build than if there is a weak relationship.
- The stronger the Pearson correlation coefficient, the more helpful pairing will be.
- Here, a straight line model using the control child’s blood lead content accounts for about 3.2% of the variation in blood lead content in the exposed child.
- As it turns out, pairing will have only a modest impact here on the inferences we draw in the study. We still will treat the data as paired, despite this.
22.3 Looking at Separate Samples: Using pivot_longer
For the purpose of estimating the difference between the exposed and control children, the summaries of the paired differences are what we’ll need.
In some settings, however, we might also look at a boxplot, or violin plot, or ridgeline plot that showed the distributions of exposed and control children separately. But we will run into trouble because one variable (blood lead content) is spread across multiple columns (control and exposed.) The solution is to “pivot” the tibble from its current format to build a new, tidy tibble. Because the data aren’t tidied here, so that we have one row for each subject and one column for each variable, we have to do some work to get them in that form for our usual plotting strategy to work well.
pivot_longer()
“lengthens” the data, increasing the number of rows and decreasing the number of columns.pivot_wider()
performs the inverse of that transformation, “widening” the data.
In our original bloodlead
data, if we drop the lead_diff
addition we made, we have wide data, with each row representing two different subjects.
# A tibble: 3 x 3
pair exposed control
<chr> <dbl> <dbl>
1 P01 38 16
2 P02 23 18
3 P03 41 18
And what we want to accomplish is to have one row for each subject, instead of one row for each pair of subjects. So we want to make the data longer.
bloodlead_longer <- bloodlead_original %>%
pivot_longer(
cols = -c(pair),
names_to = "status",
values_to = "lead_level")
bloodlead_longer
# A tibble: 66 x 3
pair status lead_level
<chr> <chr> <dbl>
1 P01 exposed 38
2 P01 control 16
3 P02 exposed 23
4 P02 control 18
5 P03 exposed 41
6 P03 control 18
7 P04 exposed 18
8 P04 control 24
9 P05 exposed 37
10 P05 control 19
# ... with 56 more rows
For more on this approach (in this case, we’re making the data “longer” and its opposite would be be making the data “wider”), visit the Tidy data chapter in Grolemund and Wickham (2019) and the tidyr repository on Github at https://github.com/tidyverse/tidyr.
And now, we can plot as usual to compare the two samples.
First, we’ll look at a boxplot, showing all of the data.
ggplot(bloodlead_longer, aes(x = status, y = lead_level)) +
geom_violin() +
geom_boxplot(aes(fill = status), width = 0.2) +
scale_fill_viridis_d(begin = 0.5) +
guides(fill = FALSE) +
coord_flip() +
labs(title = "Boxplot of Lead Content in Exposed and Control kids") +
theme_bw()
We’ll also look at a ridgeline plot, because Dr. Love likes them, even though they’re really more useful when we’re comparing more than two samples.
ggplot(bloodlead_longer, aes(x = lead_level, y = status, fill = status)) +
ggridges::geom_density_ridges(scale = 0.9) +
guides(fill = FALSE) +
labs(title = "Lead Content in Exposed and Control kids") +
ggridges::theme_ridges()
Picking joint bandwidth of 4.01
Both the center and the spread of the distribution are substantially larger in the exposed group than in the controls. Of course, numerical summaries show these patterns, too.
status min Q1 median Q3 max mean sd n missing
1 control 7 13 16 19 25 15.9 4.54 33 0
2 exposed 10 21 34 39 73 31.8 14.41 33 0
22.4 Estimating the Difference in Means with Paired Samples
Suppose we want to estimate the difference in the mean blood level across the population of children represented by the sample taken in this study. To do so, we must take advantage of the matched samples design, and complete our estimation on the paired differences, treating them as if they were a single sample of data.
One way to accomplish this is simply to run the usual intercept-only linear regression model on the paired differences.
model_lead <- lm(lead_diff ~ 1, data = bloodlead)
tidy(model_lead, conf.int = TRUE, conf.level = 0.90)
# A tibble: 1 x 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 16.0 2.76 5.78 0.00000204 11.3 20.6
Our point estimate for the difference (exposed - control) in lead levels is 15.97 mg/dl, and our 90% confidence interval is (11.29, 20.65) mg/dl.
22.4.1 Paired Data in Longer Format?
If we had the data in “longer” format, as in bloodlead_longer
, with the pairs identified by the pair
variable, then we could obtained the same confidence interval using:
model2_lead <- lm(lead_level ~ status + factor(pair), data = bloodlead_longer)
tidy(model2_lead, conf.int = TRUE, conf.level = 0.90)
# A tibble: 34 x 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 19.0 8.05 2.36 2.44e-2 5.38 32.7
2 statusexposed 16.0 2.76 5.78 2.04e-6 11.3 20.6
3 factor(pair)P~ -6.50 11.2 -0.579 5.66e-1 -25.5 12.5
4 factor(pair)P~ 2.5 11.2 0.223 8.25e-1 -16.5 21.5
5 factor(pair)P~ -6.00 11.2 -0.535 5.96e-1 -25.0 13.0
6 factor(pair)P~ 1. 11.2 0.0891 9.30e-1 -18.0 20.0
7 factor(pair)P~ -3.50 11.2 -0.312 7.57e-1 -22.5 15.5
8 factor(pair)P~ -10.5 11.2 -0.936 3.56e-1 -29.5 8.50
9 factor(pair)P~ 11.5 11.2 1.03 3.13e-1 -7.50 30.5
10 factor(pair)P~ -3.50 11.2 -0.312 7.57e-1 -22.5 15.5
# ... with 24 more rows
and the key elements are found in the statusexposed
row, which we can focus on nicely (since the output of the tidy()
function is always a tibble) with:
tidy(model2_lead, conf.int = TRUE, conf.level = 0.90) %>%
filter(term == "statusexposed") %>%
knitr::kable(digits = 2)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
statusexposed | 16 | 2.76 | 5.78 | 0 | 11.3 | 20.6 |
and again, we have our 90% confidence interval estimate of the population mean difference between exposed and control children.
More approaches to making inferences about paired sample means as well as other summaries of quantitative outcomes will be found in Chapter 23.
22.5 Comparing Proportions in Paired Samples
We discuss some ideas relevant to the comparison of proportions in paired samples in Chapter 24, in the context of a different example.
References
Good, Phillip I. 2005. Introduction to Statistics Through Resampling Methods and R/S-Plus. Hoboken, NJ: Wiley.
Grolemund, Garrett, and Hadley Wickham. 2019. R for Data Science. O’Reilly. http://r4ds.had.co.nz/.
Morton, D., A. Saah, S. Silberg, W. Owens, M. Roberts, and M. Saah. 1982. “Lead Absorption in Children of Employees in a Lead Related Industry.” American Journal of Epidemiology 115: 549–55.
Pruzek, Robert M., and James E. Helmreich. 2009. “Enhancing Dependent Sample Analyses with Graphics.” Journal of Statistics Education 17(1). http://ww2.amstat.org/publications/jse/v17n1/helmreich.html.