Chapter 20 Comparing Means/Quantities using Two Independent Samples

Here, we’ll consider the problem of estimating a confidence interval to describe the difference in population means (or medians) based on a comparison of two samples of quantitative data, gathered using an independent samples design. Specifically, we’ll look at the randomized controlled trial of Ibuprofen in Sepsis patients described in Chapter 19.

In that trial, 300 patients meeting specific criteria (including elevated temperature) for a diagnosis of sepsis were randomly assigned to either the Ibuprofen group (150 patients) and 150 to the Placebo group. Group information (our exposure) is contained in the treat variable. The key outcome of interest to us was temp_drop, the change in body temperature (in \(^{\circ}\)C) from baseline to 2 hours later, so that positive numbers indicate drops in temperature (a good outcome.) Here’s the comparison of temp_drop summary statistics in the two treat groups.

mosaic::favstats(temp_drop ~ treat, data = sepsis)

      treat  min     Q1 median  Q3 max  mean    sd   n missing
1 Ibuprofen -1.5  0.000    0.5 0.9 3.1 0.464 0.688 150       0
2   Placebo -2.7 -0.175    0.1 0.4 1.9 0.153 0.571 150       0

20.1 t-based CI for population mean difference \(\mu_1 - \mu_2\) from Independent Samples

20.1.1 The Pooled t procedure

The most commonly used t-procedure for building a confidence interval assumes not only that each of the two populations being compared follows a Normal distribution, but also that they have the same population variance. This is the pooled t-test, and it is what people usually mean when they describe a two-sample t test.

sepsis %$% t.test(temp_drop ~ treat,
                  conf.level = 0.90,
                  alt = "two.sided",
                  var.equal = TRUE)


    Two Sample t-test

data:  temp_drop by treat
t = 4, df = 298, p-value = 3e-05
alternative hypothesis: true difference in means is not equal to 0
90 percent confidence interval:
 0.191 0.432
sample estimates:
mean in group Ibuprofen   mean in group Placebo 
                  0.464                   0.153

Or, we can use tidy on this object:

tt1 <- sepsis %$% t.test(temp_drop ~ treat,
                  conf.level = 0.90,
                  alt = "two.sided",
                  var.equal = TRUE)
tidy(tt1)

# A tibble: 1 x 9
  estimate1 estimate2 statistic p.value parameter conf.low conf.high method
      <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl> <chr> 
1     0.464     0.153      4.27 2.68e-5       298    0.191     0.432 " Two~
# ... with 1 more variable: alternative <chr>

20.1.2 Using linear regression to obtain a pooled t confidence interval

A linear regression model, using the same outcome and predictor (group) as the pooled t procedure, produces the same confidence interval, again, under the assumption that the two populations we are comparing follow a Normal distribution with the same (population) variance.

model1 <- lm(temp_drop ~ treat, data = sepsis)

tidy(model1, conf.int = TRUE, conf.level = 0.90)

# A tibble: 2 x 7
  term         estimate std.error statistic  p.value conf.low conf.high
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)     0.464    0.0516      8.99 2.91e-17    0.379     0.549
2 treatPlacebo   -0.311    0.0730     -4.27 2.68e- 5   -0.432    -0.191

We see that our point estimate from the linear regression model is that the difference in temp_drop is -0.311, where Ibuprofen subjects have higher temp_drop values than do Placebo subjects, and that the 90% confidence interval for this difference ranges from -0.432 to -0.191.

We can obtain a t-based confidence interval for each of the parameter estimates in a linear model directly using confint. Linear models usually summarize only the estimate and standard error. Remember that a reasonable approximation in large samples to a 95% confidence interval for a regression estimate (slope or intercept) can be obtained from estimate \(\pm\) 2 * standard error.

summary(model1)


Call:
lm(formula = temp_drop ~ treat, data = sepsis)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8527 -0.3640 -0.0527  0.3473  2.6360 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.4640     0.0516    8.99  < 2e-16 ***
treatPlacebo  -0.3113     0.0730   -4.27  2.7e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.632 on 298 degrees of freedom
Multiple R-squared:  0.0575,    Adjusted R-squared:  0.0544 
F-statistic: 18.2 on 1 and 298 DF,  p-value: 2.68e-05

So, in the case of the treatPlacebo estimate, we can obtain an approximate 95% confidence interval with -0.311 \(\pm\) 2 x 0.073 or (-0.457, -0.165). Compare this to the 95% confidence interval available from the model directly, shown in the tidied output above, or with the confint command below, and you’ll see only a small difference.

confint(model1, level = 0.95)

              2.5 % 97.5 %
(Intercept)   0.362  0.566
treatPlacebo -0.455 -0.168

20.1.3 The Welch t procedure

The default confidence interval based on the t test for independent samples in R uses something called the Welch test, in which the two populations being compared are not assumed to have the same variance. Each population is assumed to follow a Normal distribution.

sepsis %$% t.test(temp_drop ~ treat,
                  conf.level = 0.90, 
                  alt = "two.sided")


    Welch Two Sample t-test

data:  temp_drop by treat
t = 4, df = 288, p-value = 3e-05
alternative hypothesis: true difference in means is not equal to 0
90 percent confidence interval:
 0.191 0.432
sample estimates:
mean in group Ibuprofen   mean in group Placebo 
                  0.464                   0.153

Tidying works in this situation, too.

tt0 <- sepsis %$% t.test(temp_drop ~ treat,
                  conf.level = 0.90, 
                  alt = "two.sided")

tidy(tt0)

# A tibble: 1 x 10
  estimate estimate1 estimate2 statistic p.value parameter conf.low
     <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
1    0.311     0.464     0.153      4.27 2.71e-5      288.    0.191
# ... with 3 more variables: conf.high <dbl>, method <chr>,
#   alternative <chr>

When there is a balanced design, that is, when the same number of observations appear in each of the two samples, then the Welch t test and the Pooled t test produce the same confidence interval. Differences appear if the sample sizes in the two groups being compared are different.

20.2 Bootstrap CI for \(mu_1 - \mu_2\) from Independent Samples

The bootdif function contained in the Love-boost.R script, that we will use in this setting is a slightly edited version of the function at http://biostat.mc.vanderbilt.edu/wiki/Main/BootstrapMeansSoftware. Note that this approach uses a comma to separate the outcome variable (here, temp_drop) from the variable identifying the exposure groups (here, treat).

set.seed(431212)

sepsis %$% bootdif(temp_drop, treat, conf.level = 0.90)

Mean Difference            0.05            0.95 
         -0.311          -0.431          -0.183

20.3 Wilcoxon-Mann-Whitney “Rank Sum” CI from Independent Samples

As in the one-sample case, a rank-based alternative attributed to Wilcoxon (and sometimes to Mann and Whitney) provides a two-sample comparison of the pseudomedians in the two treat groups in terms of temp_drop. This is called a rank sum test, rather than the Wilcoxon signed rank test that is used for inference about a single sample. Here’s the resulting 90% confidence interval for the difference in pseudomedians.

wt <- sepsis %$% wilcox.test(temp_drop ~ treat,
                       conf.int = TRUE, conf.level = 0.90,
                       alt = "two.sided") 

wt


    Wilcoxon rank sum test with continuity correction

data:  temp_drop by treat
W = 14614, p-value = 7e-06
alternative hypothesis: true location shift is not equal to 0
90 percent confidence interval:
 0.2 0.4
sample estimates:
difference in location 
                   0.3

tidy(wt)

# A tibble: 1 x 7
  estimate statistic  p.value conf.low conf.high method         alternative
     <dbl>     <dbl>    <dbl>    <dbl>     <dbl> <chr>          <chr>      
1    0.300    14614.  7.28e-6    0.200     0.400 Wilcoxon rank~ two.sided

20.4 Summary: Specifying A Two-Sample Study Design

These questions will help specify the details of the study design involved in any comparison of two populations on a quantitative outcome, perhaps with means.

What is the outcome under study?
What are the (in this case, two) treatment/exposure groups?
Were the data collected using matched / paired samples or independent samples?
Are the data a random sample from the population(s) of interest? Or is there at least a reasonable argument for generalizing from the sample to the population(s)?
What is the significance level (or, the confidence level) we require here?
Are we doing one-sided or two-sided testing/confidence interval generation?
If we have paired samples, did pairing help reduce nuisance variation?
If we have paired samples, what does the distribution of sample paired differences tell us about which inferential procedure to use?
If we have independent samples, what does the distribution of each individual sample tell us about which inferential procedure to use?

20.4.1 For the `sepsis` study

The outcome is temp_drop, the change in body temperature (in \(^{\circ}\)C) from baseline to 2 hours later, so that positive numbers indicate drops in temperature (a good outcome.)
The groups are Ibuprofen and Placebo as contained in the treat variable in the sepsis tibble.
The data were collected using independent samples. The Ibuprofen subjects are not matched or linked to individual Placebo subjects - they are separate groups.
The subjects of the study aren’t drawn from a random sample of the population of interest, but they are randomly assigned to their respective treatments (Ibuprofen and Placebo) which will provide the reasoned basis for our inferences.
We’ll use a 10% significance level (or 90% confidence level) in this setting, as we did in our previous work on these data.
We’ll use a two-sided testing and confidence interval approach.

Questions 7 and 8 don’t apply, because these are independent samples of data, rather than paired samples.

To address question 9, we’ll need to look at the data in each sample. We’ll repeat the boxplot and Normal Q-Q plots from Chapter 19, that allow us to assess the Normality of the distributions of (separately) the temp_drop results in the Ibuprofen and Placebo groups.

ggplot(sepsis, aes(x = treat, y = temp_drop, fill = treat)) +
    geom_violin() +
    geom_boxplot(width = 0.3, fill = "white") +
    scale_fill_viridis_d() +
    guides(fill = FALSE) + 
    labs(title = "Boxplot of Temperature Drop in Sepsis Patients",
         x = "", y = "Drop in Temperature (degrees C)") + 
    coord_flip() +
    theme_bw()

ggplot(sepsis, aes(sample = temp_drop)) +
    geom_qq() + geom_qq_line(col = "red") +
    theme_bw() +
    facet_wrap(~ treat) + 
    labs(y = "Temperature Drop Values (in degrees C)")

From these plots we conclude that the data in the Ibuprofen sample follow a reasonably Normal distribution, but this isn’t quite as true for the Placebo sample. It’s hard to know whether the apparent Placebo group outliers will affect whether the Normal distribution assumption is reasonable, so we can see if the confidence intervals change much when we don’t assume Normality (for instance, comparing the bootstrap to the t-based approaches), as a way of understanding whether a Normal model has a large impact on our conclusions.

20.4.2 Sepsis Estimation Results

Here’s a set of confidence interval estimates (we’ll use 90% confidence here) using the methods discussed in this Chapter.

mosaic::favstats(temp_drop ~ treat, data = sepsis)

      treat  min     Q1 median  Q3 max  mean    sd   n missing
1 Ibuprofen -1.5  0.000    0.5 0.9 3.1 0.464 0.688 150       0
2   Placebo -2.7 -0.175    0.1 0.4 1.9 0.153 0.571 150       0

s_pooled_t_test <- sepsis %$% t.test(temp_drop ~ treat, 
                           conf.level = 0.90,
                           alt = "two.sided", 
                           var.equal = TRUE)

tidy(s_pooled_t_test) %>% 
    select(conf.low, conf.high)

# A tibble: 1 x 2
  conf.low conf.high
     <dbl>     <dbl>
1    0.191     0.432

s_welch_t_test <- sepsis %$% t.test(temp_drop ~ treat, 
                           conf.level = 0.90,
                           alt = "two.sided", 
                           var.equal = FALSE)

tidy(s_welch_t_test) %>% 
    select(estimate, conf.low, conf.high)

# A tibble: 1 x 3
  estimate conf.low conf.high
     <dbl>    <dbl>     <dbl>
1    0.311    0.191     0.432

s_wilcoxon_test <- sepsis %$% wilcox.test(temp_drop ~ treat,
                       conf.int = TRUE, conf.level = 0.90,
                       alt = "two.sided") 

tidy(s_wilcoxon_test) %>% 
    select(estimate, conf.low, conf.high)

# A tibble: 1 x 3
  estimate conf.low conf.high
     <dbl>    <dbl>     <dbl>
1    0.300    0.200     0.400

set.seed(431212)
s_bootstrap <- sepsis %$% bootdif(temp_drop, treat, 
                                  conf.level = 0.90)

s_bootstrap

Mean Difference            0.05            0.95 
         -0.311          -0.431          -0.183

Procedure	Compares…	Point Estimate	90% CI
Pooled t	Means	0.311	(0.191, 0.432)
Welch t	Means	0.311	(0.191, 0.432)
Bootstrap	Means	0.311	(0.183, 0.431)
Wilcoxon rank sum	Pseudo-Medians	0.3	(0.2, 0.4)

What conclusions can we draw in this setting?