Chapter 20 Comparing Means/Quantities using Two Independent Samples
Here, we’ll consider the problem of estimating a confidence interval to describe the difference in population means (or medians) based on a comparison of two samples of quantitative data, gathered using an independent samples design. Specifically, we’ll look at the randomized controlled trial of Ibuprofen in Sepsis patients described in Chapter 19.
In that trial, 300 patients meeting specific criteria (including elevated temperature) for a diagnosis of sepsis were randomly assigned to either the Ibuprofen group (150 patients) and 150 to the Placebo group. Group information (our exposure) is contained in the treat
variable. The key outcome of interest to us was temp_drop
, the change in body temperature (in \(^{\circ}\)C) from baseline to 2 hours later, so that positive numbers indicate drops in temperature (a good outcome.) Here’s the comparison of temp_drop
summary statistics in the two treat
groups.
treat min Q1 median Q3 max mean sd n missing
1 Ibuprofen -1.5 0.000 0.5 0.9 3.1 0.464 0.688 150 0
2 Placebo -2.7 -0.175 0.1 0.4 1.9 0.153 0.571 150 0
20.1 t-based CI for population mean difference \(\mu_1 - \mu_2\) from Independent Samples
20.1.1 The Pooled t procedure
The most commonly used t-procedure for building a confidence interval assumes not only that each of the two populations being compared follows a Normal distribution, but also that they have the same population variance. This is the pooled t-test, and it is what people usually mean when they describe a two-sample t test.
Two Sample t-test
data: temp_drop by treat
t = 4, df = 298, p-value = 3e-05
alternative hypothesis: true difference in means is not equal to 0
90 percent confidence interval:
0.191 0.432
sample estimates:
mean in group Ibuprofen mean in group Placebo
0.464 0.153
Or, we can use tidy
on this object:
tt1 <- sepsis %$% t.test(temp_drop ~ treat,
conf.level = 0.90,
alt = "two.sided",
var.equal = TRUE)
tidy(tt1)
# A tibble: 1 x 9
estimate1 estimate2 statistic p.value parameter conf.low conf.high method
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 0.464 0.153 4.27 2.68e-5 298 0.191 0.432 " Two~
# ... with 1 more variable: alternative <chr>
20.1.2 Using linear regression to obtain a pooled t confidence interval
A linear regression model, using the same outcome and predictor (group) as the pooled t procedure, produces the same confidence interval, again, under the assumption that the two populations we are comparing follow a Normal distribution with the same (population) variance.
# A tibble: 2 x 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.464 0.0516 8.99 2.91e-17 0.379 0.549
2 treatPlacebo -0.311 0.0730 -4.27 2.68e- 5 -0.432 -0.191
We see that our point estimate from the linear regression model is that the difference in temp_drop
is -0.311, where Ibuprofen subjects have higher temp_drop
values than do Placebo subjects, and that the 90% confidence interval for this difference ranges from -0.432 to -0.191.
We can obtain a t-based confidence interval for each of the parameter estimates in a linear model directly using confint
. Linear models usually summarize only the estimate and standard error. Remember that a reasonable approximation in large samples to a 95% confidence interval for a regression estimate (slope or intercept) can be obtained from estimate \(\pm\) 2 * standard error.
Call:
lm(formula = temp_drop ~ treat, data = sepsis)
Residuals:
Min 1Q Median 3Q Max
-2.8527 -0.3640 -0.0527 0.3473 2.6360
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4640 0.0516 8.99 < 2e-16 ***
treatPlacebo -0.3113 0.0730 -4.27 2.7e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.632 on 298 degrees of freedom
Multiple R-squared: 0.0575, Adjusted R-squared: 0.0544
F-statistic: 18.2 on 1 and 298 DF, p-value: 2.68e-05
So, in the case of the treatPlacebo
estimate, we can obtain an approximate 95% confidence interval with -0.311 \(\pm\) 2 x 0.073 or (-0.457, -0.165). Compare this to the 95% confidence interval available from the model directly, shown in the tidied output above, or with the confint
command below, and you’ll see only a small difference.
2.5 % 97.5 %
(Intercept) 0.362 0.566
treatPlacebo -0.455 -0.168
20.1.3 The Welch t procedure
The default confidence interval based on the t test for independent samples in R uses something called the Welch test, in which the two populations being compared are not assumed to have the same variance. Each population is assumed to follow a Normal distribution.
Welch Two Sample t-test
data: temp_drop by treat
t = 4, df = 288, p-value = 3e-05
alternative hypothesis: true difference in means is not equal to 0
90 percent confidence interval:
0.191 0.432
sample estimates:
mean in group Ibuprofen mean in group Placebo
0.464 0.153
Tidying works in this situation, too.
# A tibble: 1 x 10
estimate estimate1 estimate2 statistic p.value parameter conf.low
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.311 0.464 0.153 4.27 2.71e-5 288. 0.191
# ... with 3 more variables: conf.high <dbl>, method <chr>,
# alternative <chr>
When there is a balanced design, that is, when the same number of observations appear in each of the two samples, then the Welch t test and the Pooled t test produce the same confidence interval. Differences appear if the sample sizes in the two groups being compared are different.
20.2 Bootstrap CI for \(mu_1 - \mu_2\) from Independent Samples
The bootdif
function contained in the Love-boost.R
script, that we will use in this setting is a slightly edited version of the function at http://biostat.mc.vanderbilt.edu/wiki/Main/BootstrapMeansSoftware. Note that this approach uses a comma to separate the outcome variable (here, temp_drop
) from the variable identifying the exposure groups (here, treat
).
Mean Difference 0.05 0.95
-0.311 -0.431 -0.183
20.3 Wilcoxon-Mann-Whitney “Rank Sum” CI from Independent Samples
As in the one-sample case, a rank-based alternative attributed to Wilcoxon (and sometimes to Mann and Whitney) provides a two-sample comparison of the pseudomedians in the two treat
groups in terms of temp_drop
. This is called a rank sum test, rather than the Wilcoxon signed rank test that is used for inference about a single sample. Here’s the resulting 90% confidence interval for the difference in pseudomedians.
wt <- sepsis %$% wilcox.test(temp_drop ~ treat,
conf.int = TRUE, conf.level = 0.90,
alt = "two.sided")
wt
Wilcoxon rank sum test with continuity correction
data: temp_drop by treat
W = 14614, p-value = 7e-06
alternative hypothesis: true location shift is not equal to 0
90 percent confidence interval:
0.2 0.4
sample estimates:
difference in location
0.3
# A tibble: 1 x 7
estimate statistic p.value conf.low conf.high method alternative
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 0.300 14614. 7.28e-6 0.200 0.400 Wilcoxon rank~ two.sided
20.4 Summary: Specifying A Two-Sample Study Design
These questions will help specify the details of the study design involved in any comparison of two populations on a quantitative outcome, perhaps with means.
- What is the outcome under study?
- What are the (in this case, two) treatment/exposure groups?
- Were the data collected using matched / paired samples or independent samples?
- Are the data a random sample from the population(s) of interest? Or is there at least a reasonable argument for generalizing from the sample to the population(s)?
- What is the significance level (or, the confidence level) we require here?
- Are we doing one-sided or two-sided testing/confidence interval generation?
- If we have paired samples, did pairing help reduce nuisance variation?
- If we have paired samples, what does the distribution of sample paired differences tell us about which inferential procedure to use?
- If we have independent samples, what does the distribution of each individual sample tell us about which inferential procedure to use?
20.4.1 For the sepsis
study
- The outcome is
temp_drop
, the change in body temperature (in \(^{\circ}\)C) from baseline to 2 hours later, so that positive numbers indicate drops in temperature (a good outcome.) - The groups are Ibuprofen and Placebo as contained in the
treat
variable in thesepsis
tibble. - The data were collected using independent samples. The Ibuprofen subjects are not matched or linked to individual Placebo subjects - they are separate groups.
- The subjects of the study aren’t drawn from a random sample of the population of interest, but they are randomly assigned to their respective treatments (Ibuprofen and Placebo) which will provide the reasoned basis for our inferences.
- We’ll use a 10% significance level (or 90% confidence level) in this setting, as we did in our previous work on these data.
- We’ll use a two-sided testing and confidence interval approach.
Questions 7 and 8 don’t apply, because these are independent samples of data, rather than paired samples.
To address question 9, we’ll need to look at the data in each sample. We’ll repeat the boxplot and Normal Q-Q plots from Chapter 19, that allow us to assess the Normality of the distributions of (separately) the temp_drop
results in the Ibuprofen and Placebo groups.
ggplot(sepsis, aes(x = treat, y = temp_drop, fill = treat)) +
geom_violin() +
geom_boxplot(width = 0.3, fill = "white") +
scale_fill_viridis_d() +
guides(fill = FALSE) +
labs(title = "Boxplot of Temperature Drop in Sepsis Patients",
x = "", y = "Drop in Temperature (degrees C)") +
coord_flip() +
theme_bw()
ggplot(sepsis, aes(sample = temp_drop)) +
geom_qq() + geom_qq_line(col = "red") +
theme_bw() +
facet_wrap(~ treat) +
labs(y = "Temperature Drop Values (in degrees C)")
From these plots we conclude that the data in the Ibuprofen sample follow a reasonably Normal distribution, but this isn’t quite as true for the Placebo sample. It’s hard to know whether the apparent Placebo group outliers will affect whether the Normal distribution assumption is reasonable, so we can see if the confidence intervals change much when we don’t assume Normality (for instance, comparing the bootstrap to the t-based approaches), as a way of understanding whether a Normal model has a large impact on our conclusions.
20.4.2 Sepsis Estimation Results
Here’s a set of confidence interval estimates (we’ll use 90% confidence here) using the methods discussed in this Chapter.
treat min Q1 median Q3 max mean sd n missing
1 Ibuprofen -1.5 0.000 0.5 0.9 3.1 0.464 0.688 150 0
2 Placebo -2.7 -0.175 0.1 0.4 1.9 0.153 0.571 150 0
s_pooled_t_test <- sepsis %$% t.test(temp_drop ~ treat,
conf.level = 0.90,
alt = "two.sided",
var.equal = TRUE)
tidy(s_pooled_t_test) %>%
select(conf.low, conf.high)
# A tibble: 1 x 2
conf.low conf.high
<dbl> <dbl>
1 0.191 0.432
s_welch_t_test <- sepsis %$% t.test(temp_drop ~ treat,
conf.level = 0.90,
alt = "two.sided",
var.equal = FALSE)
tidy(s_welch_t_test) %>%
select(estimate, conf.low, conf.high)
# A tibble: 1 x 3
estimate conf.low conf.high
<dbl> <dbl> <dbl>
1 0.311 0.191 0.432
s_wilcoxon_test <- sepsis %$% wilcox.test(temp_drop ~ treat,
conf.int = TRUE, conf.level = 0.90,
alt = "two.sided")
tidy(s_wilcoxon_test) %>%
select(estimate, conf.low, conf.high)
# A tibble: 1 x 3
estimate conf.low conf.high
<dbl> <dbl> <dbl>
1 0.300 0.200 0.400
Mean Difference 0.05 0.95
-0.311 -0.431 -0.183
Procedure | Compares… | Point Estimate | 90% CI |
---|---|---|---|
Pooled t | Means | 0.311 | (0.191, 0.432) |
Welch t | Means | 0.311 | (0.191, 0.432) |
Bootstrap | Means | 0.311 | (0.183, 0.431) |
Wilcoxon rank sum | Pseudo-Medians | 0.3 | (0.2, 0.4) |
What conclusions can we draw in this setting?