::opts_chunk$set(comment = NA)
knitr
library(janitor)
library(broom)
library(car)
library(glue)
library(gt)
library(gtsummary)
library(lmboot)
library(MKinfer)
library(mosaic)
library(naniar)
library(patchwork)
library(rstatix)
library(tidyverse)
theme_set(theme_bw())
3 431 Review: Comparing Means
In this Chapter, we will review some key issues about comparing means, mostly drawn from the 431 course. This and the other Chapters labeled “431 Review” show elements of the two studies involved in a “431 Project B” using the pre-pandemic (2017 - March 2020) data from the National Health and Nutrition Examination Survey (NHANES) called nh432 that we developed in Chapter 1 and then summarized in Chapter 2. Note that we’ll ignore the issue of sampling weights and assume a “missing completely at random” (MCAR) mechanism in these “431 Review” chapters, so that we can work with complete cases.
The 431 course notes are at https://thomaselove.github.io/431-notes/ and will remain there until June 1.
3.1 R Setup
3.2 Data Ingest
Since we’ve already got the nh432 file formatted as an R data set, we’ll use that.
<- read_rds("data/nh432.Rds") nh432
3.3 Testing or Summarizing Normality?
As we’ll see, the two most useful strategies for dealing with problematic non-Normality when comparing means are (1) transformation of the outcome to make the assumption of Normality more tenable, and (2) alternate inference approaches (for example, using a bootstrap or rank-based procedure instead of a t test.)
While it is possible to obtain numerical summaries of deviations from Normality, perhaps a measure of skewness (asymmetry) or kurtosis (heavy-tailed behavior), in practical work, I never use such summaries to overrule my assessment of the plots. It’s critical instead to focus on the pictures of a distribution, most especially Normal Q-Q plots.
Perhaps the simplest skewness summary is \(skew_1\) = (mean-median)/(standard deviation), where values below -0.2 are meant to indicate (meaningful) left skew, and values above +0.2 indicate (meaningful) right skew. Unfortunately, this approach works poorly with many distributions (for example, multimodal distributions) and so do many other (more sophisticated) measures1.
It is also possible to develop hypothesis tests of whether a particular batch of data follows a Normal distribution, for example, the Kolmogorov-Smirnov test2, or the Shapiro-Wilk test3, but again, I find these to be without value in practical work and cannot recommend their use.
3.4 Comparing Two Means using Paired Samples
Now, we’ll demonstrate some approaches to comparing two means coming from paired samples. This will include:
- a paired t test (one-sample t test on the paired differences), which we can obtain from a linear model, or from
t.test()
These procedures based on the t distribution for paired samples require that the distribution of the sample paired differences is well-approximated by a Normal model. As an alternative without that requirement, we’ll focus primarily on a bootstrap comparison (not assuming Normality) from boot.t.test()
, which comes from the MKinfer
package. It is also possible to generate rank-based inference, such as using the Wilcoxon signed rank approach, but this introduces the major weakness of not estimating the population mean (or even the population median.)
We’ll assume a Missing Completely at Random (MCAR) mechanism for missing data, so that a complete case analysis makes sense, and we’ll also use functions from the broom
package to tidy our output, and from the gt
package to help present it in an attractive table.
3.5 Comparing PULSE1 to PULSE2
We have two measurements of pulse rate (in beats per minute) in nh432
for each participant. Let’s compare the two for all participants with two PULSE readings. Since we have a value of PULSE1 and PULSE2 for each participant, it makes sense to treat these as paired samples, and study the paired differences in pulse rate.
<- nh432 |> select(SEQN, PULSE1, PULSE2) |>
dat1 drop_na() |>
mutate(PULSEDIFF = PULSE2 - PULSE1)
summary(dat1 |> select(-SEQN))
PULSE1 PULSE2 PULSEDIFF
Min. : 38.0 Min. : 37.00 Min. :-22.0000
1st Qu.: 62.0 1st Qu.: 63.00 1st Qu.: -1.0000
Median : 69.0 Median : 70.00 Median : 1.0000
Mean : 70.3 Mean : 70.96 Mean : 0.6533
3rd Qu.: 77.0 3rd Qu.: 78.00 3rd Qu.: 2.0000
Max. :126.0 Max. :121.00 Max. : 26.0000
df_stats(~ PULSE1 + PULSE2 + PULSEDIFF, data = dat1) |>
mutate(across(.cols = c(mean, sd), \(x) num(x, digits = 2))) |>
gt()
response | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
PULSE1 | 38 | 62 | 69 | 77 | 126 | 70.30 | 11.61 | 3314 | 0 |
PULSE2 | 37 | 63 | 70 | 78 | 121 | 70.96 | 11.57 | 3314 | 0 |
PULSEDIFF | -22 | -1 | 1 | 2 | 26 | 0.65 | 3.43 | 3314 | 0 |
3.5.1 Distribution of Paired Differences
<- ggplot(dat1, aes(sample = PULSEDIFF)) +
p1 geom_qq(col = "dodgerblue") + geom_qq_line(col = "navy") +
theme(aspect.ratio = 1) +
labs(title = "Normal Q-Q plot")
<- ggplot(dat1, aes(x = PULSEDIFF)) +
p2 geom_histogram(aes(y = after_stat(density)),
bins = 25, fill = "dodgerblue", col = "white") +
stat_function(fun = dnorm,
args = list(mean = mean(dat1$PULSEDIFF),
sd = sd(dat1$PULSEDIFF)),
col = "navy", linewidth = 1.5) +
labs(title = "Histogram and Normal Density")
<- ggplot(dat1, aes(x = PULSEDIFF, y = "")) +
p3 geom_boxplot(fill = "dodgerblue", outlier.color = "dodgerblue") +
stat_summary(fun = "mean", geom = "point",
shape = 23, size = 3, fill = "white") +
labs(title = "Boxplot with mean", y = "")
+ (p2 / p3 + plot_layout(heights = c(4,1))) +
p1 plot_annotation(title = "Pulse 2 - Pulse 1 difference in nh432",
subtitle = glue(nrow(dat1), " NHANES Participants ages 30-59 in nh432 data"))
These data appear to come from a distribution that is essentially symmetric, but extremely heavy-tailed, with many outlier candidates on both the low and high end of the distribution. It seems unwise to assume a Normal distribution for these differences in pulse rate.
3.5.2 Using t.test
to obtain a 90% CI for the mean pulse difference
Note that I use 90% as my confidence level here, mostly to make sure that we don’t always simply default to 95% without engaging our brains.
<- t.test(dat1$PULSEDIFF, conf.level = 0.90)
tt1
tt1
One Sample t-test
data: dat1$PULSEDIFF
t = 10.98, df = 3313, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
0.5553989 0.7511792
sample estimates:
mean of x
0.6532891
tidy(tt1, conf.int = TRUE, conf = 0.90) |>
select(estimate, low.90 = conf.low, hi.90 = conf.high,
"Paired t" = statistic, p.value, method) |>
mutate(across(.cols = -method, \(x) num(x, digits = 3))) |>
gt() |>
tab_header(title = "Pulse 2 - Pulse 1 Difference Estimate with 90% T-based CI",
subtitle = glue(nrow(dat1), " NHANES Participants ages 30-59 in nh432 data"))
Pulse 2 - Pulse 1 Difference Estimate with 90% T-based CI | |||||
3314 NHANES Participants ages 30-59 in nh432 data | |||||
estimate | low.90 | hi.90 | Paired t | p.value | method |
---|---|---|---|---|---|
0.653 | 0.555 | 0.751 | 10.980 | 0.000 | One Sample t-test |
3.5.3 Using linear regression to obtain a 90% CI for the mean pulse difference
A linear regression model predicting the paired differences with an intercept alone produces the same result as the paired t test.
<- lm(PULSEDIFF ~ 1, data = dat1)
lm1
summary(lm1)
Call:
lm(formula = PULSEDIFF ~ 1, data = dat1)
Residuals:
Min 1Q Median 3Q Max
-22.6533 -1.6533 0.3467 1.3467 25.3467
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6533 0.0595 10.98 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.425 on 3313 degrees of freedom
tidy(lm1, conf.int = TRUE, conf = 0.90) |>
mutate(method = c("Linear Model")) |>
select(estimate, low.90 = conf.low, hi.90 = conf.high,
"Paired t" = statistic, p.value, method) |>
mutate(across(.cols = -method, \(x) num(x, digits = 3))) |>
gt() |>
tab_header(title = "Pulse 2 - Pulse 1 Difference Estimate with 90% T-based CI",
subtitle = glue(nrow(dat1), " NHANES Participants ages 30-59 in nh432 data"))
Pulse 2 - Pulse 1 Difference Estimate with 90% T-based CI | |||||
3314 NHANES Participants ages 30-59 in nh432 data | |||||
estimate | low.90 | hi.90 | Paired t | p.value | method |
---|---|---|---|---|---|
0.653 | 0.555 | 0.751 | 10.980 | 0.000 | Linear Model |
3.5.4 Using the bootstrap to obtain a 90% CI for the mean pulse difference
This is a better choice than the t test if the distribution of the paired differences veer far away from a Normal distribution, but you are still interested in making inferences about the population mean. This is a different approach to obtaining a bootstrap than I have used in the past, but I prefer it because it works well with the tidy()
function in the broom
package.
set.seed(4321)
<- boot.t.test(dat1$PULSEDIFF, conf.level = 0.90,
bs1 boot = TRUE, R = 999)
bs1
Bootstrap One Sample t-test
data: dat1$PULSEDIFF
number of bootstrap samples: 999
bootstrap p-value < 0.001001
bootstrap mean of x (SE) = 0.6554904 (0.05954107)
90 percent bootstrap percentile confidence interval:
0.5554617 0.7526554
Results without bootstrap:
t = 10.98, df = 3313, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
0.5553989 0.7511792
sample estimates:
mean of x
0.6532891
tidy(bs1, conf.int = TRUE, conf = 0.90) |>
select(estimate, low.90 = conf.low, hi.90 = conf.high,
"Bootstrap t" = statistic, p.value, method) |>
mutate(across(.cols = -method, \(x) num(x, digits = 3))) |>
gt() |>
tab_header(title = "Pulse 2 - Pulse 1 Difference Estimate with 90% Bootstrap CI",
subtitle = glue(nrow(dat1), " NHANES Participants ages 30-59 in nh432 data"))
Pulse 2 - Pulse 1 Difference Estimate with 90% Bootstrap CI | |||||
3314 NHANES Participants ages 30-59 in nh432 data | |||||
estimate | low.90 | hi.90 | Bootstrap t | p.value | method |
---|---|---|---|---|---|
0.653 | 0.555 | 0.751 | 10.980 | 0.000 | Bootstrap One Sample t-test |
Given our large sample size, it is perhaps not overly surprising that even a small difference in mean pulse rate (0.653 beats per minute) turns out to have a 90% confidence interval well above the value (0) that would occur if there were no difference at all between the groups.
3.5.5 Wilcoxon signed rank approach to comparing pulse rates
We can obtain a 90% confidence interval for the pseudo-median of our paired differences in pulse rate with the Wilcoxon signed rank approach.
<- wilcox.test(dat1$PULSEDIFF, conf.int = TRUE, conf.level = 0.90)
wt1
wt1
Wilcoxon signed rank test with continuity correction
data: dat1$PULSEDIFF
V = 2449203, p-value < 2.2e-16
alternative hypothesis: true location is not equal to 0
90 percent confidence interval:
0.5000466 0.9999290
sample estimates:
(pseudo)median
0.9999809
But this is of limited value, even though it doesn’t assume Normality of the distribution of paired differences, because the summary statistic is a pseudo-median4, which isn’t straightforward to interpret, unless the true distribution of the paired differences is symmetric, in which case the pseudo-median and the median have the same value.
Let’s consider another example using two paired samples to compare means, this time with a somewhat smaller sample size.
3.6 Comparing WEIGHT to ESTWT
We have two assessments of each participant’s weight in nh432
: their WEIGHT (as measured using a scale, in kilograms) and their ESTWT (self-reported weight via questionnaire, in pounds.) First, let’s create a data set containing those values, and converting pounds to kilograms for the ESTWT results so that we can compare the two assessments fairly. To shrink the sample size a bit, let’s only look at people whose age is 43, and who describe their overall health as either Good or Fair.
<- nh432 |> select(SEQN, AGE, SROH, WEIGHT, ESTWT) |>
dat2 filter(AGE == 43, SROH %in% c("Good", "Fair")) |>
drop_na() |>
mutate(ESTWTKG = ESTWT*0.45359,
WTDIFF = WEIGHT - ESTWTKG)
glimpse(dat2)
Rows: 70
Columns: 7
$ SEQN <chr> "109342", "109602", "109805", "110286", "110645", "111149", "1…
$ AGE <dbl> 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43…
$ SROH <fct> Good, Good, Good, Fair, Good, Good, Good, Good, Good, Fair, Go…
$ WEIGHT <dbl> 92.1, 76.5, 133.0, 86.8, 119.3, 74.1, 75.8, 106.8, 102.1, 77.0…
$ ESTWT <dbl> 200, 167, 260, 198, 230, 145, 167, 240, 223, 172, 150, 265, 22…
$ ESTWTKG <dbl> 90.71800, 75.74953, 117.93340, 89.81082, 104.32570, 65.77055, …
$ WTDIFF <dbl> 1.38200, 0.75047, 15.06660, -3.01082, 14.97430, 8.32945, 0.050…
df_stats(~ WEIGHT + ESTWTKG + WTDIFF, data = dat2) |>
mutate(across(.cols = -c(response, n, missing), \(x) num(x , digits = 3))) |>
gt()
response | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
WEIGHT | 36.900 | 74.525 | 89.400 | 106.175 | 204.600 | 93.040 | 29.211 | 70 | 0 |
ESTWTKG | 45.359 | 74.842 | 89.130 | 103.759 | 204.115 | 92.532 | 27.869 | 70 | 0 |
WTDIFF | -9.871 | -2.256 | -0.028 | 1.923 | 15.067 | 0.508 | 4.671 | 70 | 0 |
3.6.1 Plotting The Paired Difference in Weight
<- ggplot(dat2, aes(sample = WTDIFF)) +
p1 geom_qq(col = "seagreen") + geom_qq_line(col = "deeppink") +
theme(aspect.ratio = 1) +
labs(title = "Normal Q-Q plot")
<- ggplot(dat2, aes(x = WTDIFF)) +
p2 geom_histogram(aes(y = after_stat(density)),
bins = 15, fill = "seagreen", col = "white") +
stat_function(fun = dnorm,
args = list(mean = mean(dat2$WTDIFF),
sd = sd(dat2$WTDIFF)),
col = "deeppink", linewidth = 1.5) +
labs(title = "Histogram and Normal Density")
<- ggplot(dat2, aes(x = WTDIFF, y = "")) +
p3 geom_boxplot(fill = "seagreen", outlier.color = "seagreen") +
stat_summary(fun = "mean", geom = "point",
shape = 23, size = 3, fill = "white") +
labs(title = "Boxplot with mean", y = "")
+ (p2 / p3 + plot_layout(heights = c(4,1))) +
p1 plot_annotation(title = "Measured - Self-reported Weight (in kilograms)",
subtitle = glue(nrow(dat2), " participants in Good or Fair Health aged 43 in nh432"))
As we saw with the differences in pulse rate, the differences in weight for this sample appear to come from a distribution that might be symmetric, but that still has several outlier candidates, especially on the high end of the distribution. We may want to consider whether the assumption of a t-based confidence interval is reasonable here, and whether we might be better off using a bootstrap approach.
3.6.2 t.test
90% CI for the mean weight difference
<- t.test(dat2$WTDIFF, conf.level = 0.90)
tt2
tidy(tt2, conf.int = TRUE, conf = 0.90) |>
select(estimate, low.90 = conf.low, hi.90 = conf.high,
"Paired t" = statistic, p.value, method) |>
mutate(across(.cols = -method, \(x) num(x, digits = 3))) |>
gt() |>
tab_header(title = "Measured - Self-reported Weight with 90% T-based CI",
subtitle = glue(nrow(dat2), " NHANES Participants aged 43 in Good or Fair Health in nh432 data"))
Measured - Self-reported Weight with 90% T-based CI | |||||
70 NHANES Participants aged 43 in Good or Fair Health in nh432 data | |||||
estimate | low.90 | hi.90 | Paired t | p.value | method |
---|---|---|---|---|---|
0.508 | -0.423 | 1.438 | 0.909 | 0.366 | One Sample t-test |
3.6.3 Linear Regression: 90% CI for mean weight difference
<- lm(WTDIFF ~ 1, data = dat2)
lm2
tidy(lm2, conf.int = TRUE, conf = 0.90) |>
mutate(method = c("Linear Model")) |>
select(estimate, low.90 = conf.low, hi.90 = conf.high,
"Paired t" = statistic, p.value, method) |>
mutate(across(.cols = -method, \(x) num(x, digits = 3))) |>
gt() |>
tab_header(title = "Measured - Self-reported Weight with 90% T-based CI",
subtitle = glue(nrow(dat2), " NHANES Participants aged 43 in Good or Fair Health in nh432 data"))
Measured - Self-reported Weight with 90% T-based CI | |||||
70 NHANES Participants aged 43 in Good or Fair Health in nh432 data | |||||
estimate | low.90 | hi.90 | Paired t | p.value | method |
---|---|---|---|---|---|
0.508 | -0.423 | 1.438 | 0.909 | 0.366 | Linear Model |
3.6.4 Bootstrap 90% CI for the mean weight difference
set.seed(4322)
<- boot.t.test(dat2$WTDIFF, conf.level = 0.90,
bs2 boot = TRUE, R = 999)
tidy(bs2, conf.int = TRUE, conf = 0.90) |>
select(estimate, low.90 = conf.low, hi.90 = conf.high,
"Bootstrap t" = statistic, p.value, method) |>
mutate(across(.cols = -method, \(x) num(x, digits = 3))) |>
gt() |>
tab_header(title = "Measured - Self-reported Weight",
subtitle = "with 90% Bootstrap CI") |>
tab_footnote(footnote = glue(nrow(dat1), " NHANES Participants age 43 in Good or Fair Health in nh432 data"))
Measured - Self-reported Weight | |||||
with 90% Bootstrap CI | |||||
estimate | low.90 | hi.90 | Bootstrap t | p.value | method |
---|---|---|---|---|---|
0.508 | -0.423 | 1.438 | 0.909 | 0.366 | Bootstrap One Sample t-test |
3314 NHANES Participants age 43 in Good or Fair Health in nh432 data |
In light of the clear issue with outliers in the plots of the weight differences, I think I would choose the bootstrap confidence interval, which clearly includes both negative and positive values as plausible estimates of the population mean difference.
3.6.5 Wilcoxon signed rank approach to comparing weight estimates
We can obtain a 90% confidence interval for the pseudo-median of our paired differences in weight with the Wilcoxon signed rank approach.
<- wilcox.test(dat2$WTDIFF, conf.int = TRUE, conf.level = 0.90)
wt2
wt2
Wilcoxon signed rank test with continuity correction
data: dat2$WTDIFF
V = 1262, p-value = 0.9115
alternative hypothesis: true location is not equal to 0
90 percent confidence interval:
-0.6347756 0.7941699
sample estimates:
(pseudo)median
0.06155979
But this is of limited value, even though it doesn’t assume Normality of the distribution of paired differences, because the summary statistic is a pseudo-median5, which isn’t straightforward to interpret, unless the true distribution of the paired differences is symmetric, in which case the pseudo-median and the median have the same value.
3.7 Comparing Two Means using Independent Samples
Now, we’ll demonstrate some approaches to comparing two means coming from independent samples. This will include:
- a pooled t test (t test assuming equal population variances), which we can obtain from a linear model, or from
t.test()
- a Welch t test (t test not assuming equal population variances), from
t.test()
Each of these t tests requires the distribution of each of our two independent samples to be well-approximated by a Normal model. As an alternative without that requirement, we’ll focus on a bootstrap comparison (not assuming equal variances or Normality) from boot.t.test()
(again from the MKinfer
package.) Once more, it is also possible to generate rank-based inference, such as using the Wilcoxon-Mann-Whitney rank sum approach, but again this does not provide us with estimates of either the difference in population means or medians, which limits its utility.
3.8 Comparing White Blood Cell Count by Hospitalization Status
Now, we’ll use independent samples to compare subjects who were hospitalized in the past year to those who were not, in terms of their white blood cell count. The normal range of WBCs in the blood is 4.5 to 11 on the scale (1000 cells per microliter) our data is available.
3.8.1 Exploring the Data
<- nh432 |>
dat3 select(SEQN, HOSPITAL, WBC) |>
drop_na()
ggplot(dat3, aes(x = factor(HOSPITAL), y = WBC)) +
geom_violin(aes(fill = factor(HOSPITAL))) +
geom_boxplot(width = 0.3, notch = TRUE) +
stat_summary(aes(fill = factor(HOSPITAL)), fun = "mean", geom = "point",
shape = 23, size = 3) +
guides(fill = "none", col = "none") +
scale_fill_viridis_d(option = "cividis", alpha = 0.3) +
coord_flip() +
labs(x = "Hospitalized in Past Year? (0 = No, 1 = Yes)",
y = "White blood cell count (1000 cells / uL)",
title = "White Blood Cell Count by Hospitalization Status",
subtitle = glue(nrow(dat3), " NHANES participants in nh432"))
Each of these distributions shows some signs of right skew, or at least more than a few outlier candidates on the upper end of the white blood cell count’s distribution, according to the boxplot. A pair of Normal Q-Q plots should help clarify issues for us.
ggplot(dat3, aes(sample = WBC)) +
geom_qq(aes(col = factor(HOSPITAL))) + geom_qq_line(col = "red") +
facet_wrap(~ HOSPITAL, labeller = "label_both") +
guides(col = "none") +
scale_color_viridis_d(option = "cividis", end = 0.5) +
labs(x = "Expectation Under Standard Normal Distribution",
y = "Observed White Blood Cell Count (1000 cells/uL)",
title = "Normal Q-Q plots of White Blood Cell Count",
subtitle = "By Hospitalization Status in the Past Year",
caption = glue(nrow(dat3), " NHANES participants in nh432"))
It seems unreasonable to assume that each of these samples comes from a distribution that is well-approximated by the Normal. There’s just too much skew here. Here are some key numerical summaries of the data in each sample.
favstats(WBC ~ HOSPITAL, data = dat3) |>
mutate(across(.cols = c(mean, sd), \(x) num(x, digits = 3))) |>
gt()
HOSPITAL | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
0 | 2.3 | 5.7 | 7.0 | 8.4 | 22.8 | 7.246 | 2.170 | 3430 | 0 |
1 | 2.5 | 5.6 | 6.9 | 8.7 | 17.1 | 7.332 | 2.392 | 325 | 0 |
3.8.2 Pooled t test (assumes equal variances) via linear model
The pooled t test for comparison of two population means using independent samples assumes:
- that the WBC (outcome) in each of the two HOSPITAL (exposure) groups follows a Normal distribution, and
- that the population variances are equal in the two groups
The “equal population variances” assumption can be relaxed and a pooled t test used if we have a balanced design, with the same number of subjects in each exposure group.
In our setting, we shouldn’t be particularly comfortable with the assumption of Normality, as mentioned above. Were we able to get past that, though, we can see that the two distributions have fairly similar sample variances (remember this is just the square of the standard deviation.) The sample sizes are wildly different, with many more non-hospitalized subjects than hospitalized ones.
For completeness, though, we’ll start by running the pooled t test.
<- lm(WBC ~ HOSPITAL, data = dat3)
lm3
summary(lm3)
Call:
lm(formula = WBC ~ HOSPITAL, data = dat3)
Residuals:
Min 1Q Median 3Q Max
-4.9463 -1.5463 -0.3463 1.1537 15.5537
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.24633 0.03740 193.760 <2e-16 ***
HOSPITAL 0.08567 0.12712 0.674 0.5
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.19 on 3753 degrees of freedom
Multiple R-squared: 0.000121, Adjusted R-squared: -0.0001454
F-statistic: 0.4542 on 1 and 3753 DF, p-value: 0.5004
confint(lm3, level = 0.90)
5 % 95 %
(Intercept) 7.1847964 7.3078567
HOSPITAL -0.1234734 0.2948204
tidy(lm3, conf.int = TRUE, conf.level = 0.90) |>
filter(term == "HOSPITAL") |>
mutate(method = "Pooled t") |>
select(estimate, low.90 = conf.low, hi.90 = conf.high,
"Pooled t" = statistic, p.value, method) |>
mutate(across(.cols = -method, \(x) num(x, digits = 3))) |>
gt() |>
tab_header(title = "White Blood Cell Count (Hospitalized - Non-Hospitalized)",
subtitle = "with 90% Pooled t-based CI via Linear Model") |>
tab_footnote(footnote = glue(nrow(dat3), " NHANES Participants ages 30-59 in nh432 data"))
White Blood Cell Count (Hospitalized - Non-Hospitalized) | |||||
with 90% Pooled t-based CI via Linear Model | |||||
estimate | low.90 | hi.90 | Pooled t | p.value | method |
---|---|---|---|---|---|
0.086 | -0.123 | 0.295 | 0.674 | 0.500 | Pooled t |
3755 NHANES Participants ages 30-59 in nh432 data |
3.8.3 Pooled t test (assumes equal variances) via t.test
Note that this approach estimates the difference with Not Hospitalized - Hospitalized, as opposed to the approach used in the linear model. Be careful to check the sample estimates provided in your output against the original summary of the sample data to avoid making a mistake.
<- t.test(WBC ~ HOSPITAL, data = dat3, var.equal = TRUE, conf.level = 0.90)
tt3p
tt3p
Two Sample t-test
data: WBC by HOSPITAL
t = -0.67395, df = 3753, p-value = 0.5004
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
90 percent confidence interval:
-0.2948204 0.1234734
sample estimates:
mean in group 0 mean in group 1
7.246327 7.332000
tidy(tt3p, conf.int = TRUE, conf.level = 0.90) |>
select(estimate, low.90 = conf.low, hi.90 = conf.high,
"Pooled t" = statistic, p.value, method) |>
mutate(across(.cols = -method, \(x) num(x, digits = 3))) |>
gt() |>
tab_header(title = "White Blood Cell Count (Non-Hospitalized minus Hospitalized)",
subtitle = "with 90% Pooled t-based Confidence Interval") |>
tab_footnote(footnote = glue(nrow(dat3), " NHANES Participants ages 30-59 in nh432 data"))
White Blood Cell Count (Non-Hospitalized minus Hospitalized) | |||||
with 90% Pooled t-based Confidence Interval | |||||
estimate | low.90 | hi.90 | Pooled t | p.value | method |
---|---|---|---|---|---|
-0.086 | -0.295 | 0.123 | -0.674 | 0.500 | Two Sample t-test |
3755 NHANES Participants ages 30-59 in nh432 data |
Again, note that the t.test()
approach estimates Non-Hospitalized - Hospitalized (so that the sample mean is negative.)
3.8.4 Welch t test (doesn’t assume equal variance) via t.test
The Welch t test (which is actually the default t.test
in R) assumes that the two groups each follow a Normal distribution, but does not require that those distributions have the same population variance.
<- t.test(WBC ~ HOSPITAL, data = dat3, conf.level = 0.90)
tt3w
tt3w
Welch Two Sample t-test
data: WBC by HOSPITAL
t = -0.6218, df = 376.28, p-value = 0.5345
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
90 percent confidence interval:
-0.3128672 0.1415202
sample estimates:
mean in group 0 mean in group 1
7.246327 7.332000
tidy(tt3w, conf.int = TRUE, conf.level = 0.90) |>
select(estimate, low.90 = conf.low, hi.90 = conf.high,
"Welch t" = statistic, p.value, method) |>
mutate(across(.cols = -method, \(x) num(x, digits = 3))) |>
gt() |>
tab_header(title = "White Blood Cell Count (Non-Hospitalized minus Hospitalized)",
subtitle = "with 90% Welch t-based Confidence Interval") |>
tab_footnote(footnote = glue(nrow(dat3), " NHANES Participants ages 30-59 in nh432 data"))
White Blood Cell Count (Non-Hospitalized minus Hospitalized) | |||||
with 90% Welch t-based Confidence Interval | |||||
estimate | low.90 | hi.90 | Welch t | p.value | method |
---|---|---|---|---|---|
-0.086 | -0.313 | 0.142 | -0.622 | 0.534 | Welch Two Sample t-test |
3755 NHANES Participants ages 30-59 in nh432 data |
3.8.5 Bootstrap comparison of WBC by HOSPITAL
The bootstrap approach is appealing in part because it neither assumes Normality or equal population variances.
set.seed(4323)
<- boot.t.test(WBC ~ HOSPITAL, data = dat3,
bs3 R = 999, conf.level = 0.90)
bs3
Bootstrap Welch Two Sample t-test
data: WBC by HOSPITAL
number of bootstrap samples: 999
bootstrap p-value = 0.5325
bootstrap difference of means (SE) = -0.08734076 (0.1376246)
90 percent bootstrap percentile confidence interval:
-0.3137606 0.1414317
Results without bootstrap:
t = -0.6218, df = 376.28, p-value = 0.5345
alternative hypothesis: true difference in means is not equal to 0
90 percent confidence interval:
-0.3128672 0.1415202
sample estimates:
mean in group 0 mean in group 1
7.246327 7.332000
tidy(bs3, conf.int = TRUE, conf.level = 0.90) |>
mutate(diff = estimate1 - estimate2) |>
select(est1 = estimate1, est2 = estimate2, diff,
low.90 = conf.low, hi.90 = conf.high,
|>
p.value, method) mutate(across(.cols = -method, \(x) num(x, digits = 3))) |>
gt() |>
tab_header(title = "White Blood Cell Count (Non-Hospitalized minus Hospitalized)",
subtitle = "with 90% Bootstrap Confidence Interval") |>
tab_footnote(footnote = glue(nrow(dat3), " NHANES Participants ages 30-59 in nh432 data"))
White Blood Cell Count (Non-Hospitalized minus Hospitalized) | ||||||
with 90% Bootstrap Confidence Interval | ||||||
est1 | est2 | diff | low.90 | hi.90 | p.value | method |
---|---|---|---|---|---|---|
7.246 | 7.332 | -0.086 | -0.313 | 0.142 | 0.534 | Bootstrap Welch Two Sample t-test |
3755 NHANES Participants ages 30-59 in nh432 data |
In any case, though, we come to the same basic conclusion - both positive and negative differences in WBC count are plausible.
Given the huge imbalance between the two groups in terms of sample size, and the apparent skew in the distribution of each sample, I would probably be most comfortable with the bootstrap approach here than the t-based intervals.
3.8.6 Transforming the WBC Counts
Since the White Blood Cell counts are far from Normally distributed, and in fact appear to be substantially skewed (asymmetric) we might want to consider a transformation of the data. The Box-Cox approach can be used to suggest potential transformations even in a simple case like this. We can use the boxCox()
function from the car package, for example.
<- lm(WBC ~ HOSPITAL, data = dat3)
m3
boxCox(m3)
The estimated power (\(\lambda\)) shown in the plot is close to 0. The ladder of power transformations looks like this:
$ | Transformation | Formula |
---|---|---|
-2 | inverse square | \(1/y^2\) |
-1 | inverse | \(1/y\) |
-0.5 | inverse square root | \(1/\sqrt{y}\) |
0 | logarithm | \(log y\) |
0.5 | square root | \(\sqrt{y}\) |
1 | no transformation | y |
2 | square | \(y^2\) |
3 | cube | \(y^3\) |
So in this case, the Box-Cox approach is suggesting we try the logarithm (we use the natural logarithm, with base e, here) of WBC.
Let’s redraw our Normal Q-Q plots with this transformation applied.
ggplot(dat3, aes(sample = log(WBC))) +
geom_qq(aes(col = factor(HOSPITAL))) + geom_qq_line(col = "red") +
facet_wrap(~ HOSPITAL, labeller = "label_both") +
guides(col = "none") +
scale_color_viridis_d(option = "cividis", end = 0.5) +
labs(x = "Expectation Under Standard Normal Distribution",
y = "Log of White Blood Cell Count (1000 cells/uL)",
title = "Normal Q-Q plots of log(WBC)",
subtitle = "By Hospitalization Status in the Past Year",
caption = glue(nrow(dat3), " NHANES participants in nh432"))
The assumption of Normality now looks much more plausible for each of our samples. So we might try building a 90% confidence interval for the mean of log(WBC), as follows:
favstats(log(WBC) ~ HOSPITAL, data = dat3) |>
mutate(across(.cols = -c(HOSPITAL, n, missing), \(x) num(x, digits = 3))) |>
gt() |>
tab_header(title = "log(WBC) by Hospitalization Status",
subtitle = "NHANES participants in nh432")
log(WBC) by Hospitalization Status | |||||||||
NHANES participants in nh432 | |||||||||
HOSPITAL | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
0 | 0.833 | 1.740 | 1.946 | 2.128 | 3.127 | 1.938 | 0.293 | 3430 | 0 |
1 | 0.916 | 1.723 | 1.932 | 2.163 | 2.839 | 1.941 | 0.322 | 325 | 0 |
We see that there’s essentially no difference at all in the means of the log(WBC) values across the two levels of hospitalization status.
<- t.test(log(WBC) ~ HOSPITAL, data = dat3, var.equal = TRUE, conf.level = 0.90)
tt3log
tidy(tt3log, conf.int = TRUE, conf.level = 0.90) |>
select(estimate1, estimate2, estimate, conf.low, conf.high, p.value)
# A tibble: 1 × 6
estimate1 estimate2 estimate conf.low conf.high p.value
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.94 1.94 -0.00320 -0.0314 0.0250 0.852
Let’s consider a second example for comparing means from independent samples.
3.9 Comparing Waist Circumference by Sleep Trouble
Now, we’ll restrict ourselves to NHANES participants who rated their overall health as “Fair”, and we’ll compare the mean waist circumference (WAIST, in cm) of people in that group who responded Yes (vs. No) to the question of whether they had told a doctor that they had trouble sleeping (gathered in the SLPTROUB variable.)
3.9.1 Summarizing the Data
<- nh432 |>
dat4 select(SEQN, SROH, SLPTROUB, WAIST) |>
filter(SROH == "Fair") |>
drop_na()
ggplot(dat4, aes(x = factor(SLPTROUB), y = WAIST)) +
geom_violin(aes(fill = factor(SLPTROUB))) +
geom_boxplot(width = 0.3, notch = TRUE) +
stat_summary(fill = "red", fun = "mean", geom = "point",
shape = 23, size = 3) +
guides(fill = "none", col = "none") +
scale_fill_brewer(palette = "Accent") +
labs(x = "Reported Sleep Trouble to a Doctor? (0 = No, 1 = Yes)",
y = "Waist circumference (cm)",
title = "Waist Circumference by Sleep Trouble",
subtitle = glue(nrow(dat4), " NHANES participants in Fair health in nh432"))
ggplot(dat4, aes(sample = WAIST)) +
geom_qq(aes(col = factor(SLPTROUB))) + geom_qq_line(col = "red") +
facet_wrap(~ SLPTROUB, labeller = "label_both") +
guides(col = "none") +
scale_color_brewer(palette = "Accent") +
labs(x = "Expectation Under Standard Normal Distribution",
y = "Observed Waist Circumference (in cm)",
title = "Normal Q-Q plots of Waist Circumference",
subtitle = "By Reported Sleep Trouble",
caption = glue(nrow(dat4), " NHANES participants in Fair health in nh432"))
Here’s a situation where we might be willing to consider a t test, since a Normal distribution is a much better fit for the data in each of our two samples. Let’s look at some brief numerical summaries, too.
favstats(WAIST ~ SLPTROUB, data = dat4) |>
mutate(across(.cols = c(mean, sd), \(x) num(x, digits = 2))) |>
gt()
SLPTROUB | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
0 | 65.8 | 91.7 | 102.1 | 117.5 | 178 | 104.71 | 18.35 | 425 | 0 |
1 | 69.6 | 97.4 | 109.4 | 124.0 | 166 | 110.69 | 18.86 | 309 | 0 |
3.9.2 Pooled t test (assumes equal variances) via linear model
Here’s the pooled t test via linear model.
<- lm(WAIST ~ SLPTROUB, data = dat4)
lm4
summary(lm4)
Call:
lm(formula = WAIST ~ SLPTROUB, data = dat4)
Residuals:
Min 1Q Median 3Q Max
-41.093 -13.293 -2.293 12.790 73.290
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 104.7099 0.9006 116.27 < 2e-16 ***
SLPTROUB 5.9830 1.3881 4.31 1.85e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.57 on 732 degrees of freedom
Multiple R-squared: 0.02475, Adjusted R-squared: 0.02342
F-statistic: 18.58 on 1 and 732 DF, p-value: 1.853e-05
confint(lm4, level = 0.90)
5 % 95 %
(Intercept) 103.226621 106.193144
SLPTROUB 3.696944 8.269052
tidy(lm4, conf.int = TRUE, conf.level = 0.90) |>
filter(term == "SLPTROUB") |>
mutate(method = "Pooled t") |>
select(estimate, low.90 = conf.low, hi.90 = conf.high,
"Pooled t" = statistic, p.value, method) |>
mutate(across(.cols = -method, \(x) num(x, digits = 2))) |>
gt() |>
tab_header(title = "Waist Circumference by Sleep Trouble",
subtitle = "with 90% Pooled t-based CI via Linear Model") |>
tab_footnote(footnote = glue(nrow(dat4), " NHANES Participants in Fair Health in nh432 data"))
Waist Circumference by Sleep Trouble | |||||
with 90% Pooled t-based CI via Linear Model | |||||
estimate | low.90 | hi.90 | Pooled t | p.value | method |
---|---|---|---|---|---|
5.98 | 3.70 | 8.27 | 4.31 | 0.00 | Pooled t |
734 NHANES Participants in Fair Health in nh432 data |
3.9.3 Pooled t test (assumes equal variances) via t.test
<- t.test(WAIST ~ SLPTROUB, data = dat4, var.equal = TRUE, conf.level = 0.90)
tt4p
tt4p
Two Sample t-test
data: WAIST by SLPTROUB
t = -4.3103, df = 732, p-value = 1.853e-05
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
90 percent confidence interval:
-8.269052 -3.696944
sample estimates:
mean in group 0 mean in group 1
104.7099 110.6929
tidy(tt4p, conf.int = TRUE, conf.level = 0.90) |>
select(estimate, low.90 = conf.low, hi.90 = conf.high,
"Pooled t" = statistic, p.value, method) |>
mutate(across(.cols = -method, \(x) num(x , digits = 2))) |>
gt() |>
tab_header(title = "Waist Circumference by Sleep Trouble",
subtitle = "with 90% Pooled t-based Confidence Interval") |>
tab_footnote(footnote = glue(nrow(dat4), " NHANES Participants in Fair Health in nh432 data"))
Waist Circumference by Sleep Trouble | |||||
with 90% Pooled t-based Confidence Interval | |||||
estimate | low.90 | hi.90 | Pooled t | p.value | method |
---|---|---|---|---|---|
-5.98 | -8.27 | -3.70 | -4.31 | 0.00 | Two Sample t-test |
734 NHANES Participants in Fair Health in nh432 data |
3.9.4 Welch t test (doesn’t assume equal variance) via t.test
<- t.test(WAIST ~ SLPTROUB, data = dat4, conf.level = 0.90)
tt4w
tt4w
Welch Two Sample t-test
data: WAIST by SLPTROUB
t = -4.2919, df = 653.24, p-value = 2.04e-05
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
90 percent confidence interval:
-8.279237 -3.686759
sample estimates:
mean in group 0 mean in group 1
104.7099 110.6929
tidy(tt4w, conf.int = TRUE, conf.level = 0.90) |>
select(estimate, low.90 = conf.low, hi.90 = conf.high,
"Welch t" = statistic, p.value, method) |>
mutate(across(.cols = -method, \(x) num(x, digits = 2))) |>
gt() |>
tab_header(title = "Waist Circumference by Sleep Trouble",
subtitle = "with 90% Welch t-based Confidence Interval") |>
tab_footnote(footnote = glue(nrow(dat4), " NHANES Participants in Fair Health in nh432 data"))
Waist Circumference by Sleep Trouble | |||||
with 90% Welch t-based Confidence Interval | |||||
estimate | low.90 | hi.90 | Welch t | p.value | method |
---|---|---|---|---|---|
-5.98 | -8.28 | -3.69 | -4.29 | 0.00 | Welch Two Sample t-test |
734 NHANES Participants in Fair Health in nh432 data |
3.9.5 Bootstrap comparison of WAIST by SLPTROUB
set.seed(4324)
<- boot.t.test(WAIST ~ SLPTROUB, data = dat4,
bs4 R = 999, conf.level = 0.90)
bs4
Bootstrap Welch Two Sample t-test
data: WAIST by SLPTROUB
number of bootstrap samples: 999
bootstrap p-value < 0.001001
bootstrap difference of means (SE) = -5.991814 (1.389937)
90 percent bootstrap percentile confidence interval:
-8.207833 -3.747965
Results without bootstrap:
t = -4.2919, df = 653.24, p-value = 2.04e-05
alternative hypothesis: true difference in means is not equal to 0
90 percent confidence interval:
-8.279237 -3.686759
sample estimates:
mean in group 0 mean in group 1
104.7099 110.6929
tidy(bs4, conf.int = TRUE, conf.level = 0.90) |>
mutate(diff = estimate1 - estimate2) |>
select(est1 = estimate1, est2 = estimate2, diff,
low.90 = conf.low, hi.90 = conf.high,
|>
p.value, method) mutate(across(.cols = -method, \(x) num(x, digits = 2))) |>
gt() |>
tab_header(title = "Waist Circumference by Sleep Trouble",
subtitle = "with 90% Bootstrap Confidence Interval") |>
tab_footnote(footnote = glue(nrow(dat4), " NHANES Participants in Fair Health in nh432 data"))
Waist Circumference by Sleep Trouble | ||||||
with 90% Bootstrap Confidence Interval | ||||||
est1 | est2 | diff | low.90 | hi.90 | p.value | method |
---|---|---|---|---|---|---|
104.71 | 110.69 | -5.98 | -8.28 | -3.69 | 0.00 | Bootstrap Welch Two Sample t-test |
734 NHANES Participants in Fair Health in nh432 data |
3.9.6 Wilcoxon-Mann-Whitney Rank Sum Approach
The Wilcoxon-Mann-Whitney rank sum approach also allows us (like the bootstrap) to avoid the assumptions of Normality and equal population variances, but at the cost of no longer yielding direct inference about the population mean.
<- wilcox.test(WAIST ~ SLPTROUB, data = dat4,
wt4 conf.int = TRUE, conf.level = 0.90, paired = FALSE)
wt4
Wilcoxon rank sum test with continuity correction
data: WAIST by SLPTROUB
W = 53322, p-value = 1.355e-05
alternative hypothesis: true location shift is not equal to 0
90 percent confidence interval:
-8.599997 -3.900026
sample estimates:
difference in location
-6.299963
Note that the estimated “difference in location” here is not the difference in the medians across the two groups, but instead the median of the difference between a sample from the SLPTROUB = Yes group and a sample from the SLPTROUB = No group.
Just to prove my point, here are the sample median WAIST results in the two SLPTROUB groups. You can see that the difference between these medians does not match the “difference in location” estimate from the Wilcoxon-Mann-Whitney rank sum output.
|> group_by(SLPTROUB) |> summarise(median(WAIST)) dat4
# A tibble: 2 × 2
SLPTROUB `median(WAIST)`
<dbl> <dbl>
1 0 102.
2 1 109.
In conclusion, the confidence intervals (from any of these approaches) suggest that plausible means of waist circumference are around 3-8 centimeters larger in the “told Dr. about sleep problems” group, which I suppose isn’t especially surprising, at least in terms of its direction.
3.10 Comparing 3 Means using Independent Samples: Systolic BP by Weight Goal
We’ll compare systolic blood pressure means across the three samples defined by WTGOAL (goal is to weigh more, less or stay about the same), restricting to our participants of Hispanic or Latinx ethnicity in nh432.
<- nh432 |>
dat5 select(SEQN, RACEETH, SBP, WTGOAL) |>
filter(RACEETH == "Hispanic") |>
drop_na()
3.10.1 Summarizing SBP by WTGOAL
ggplot(dat5, aes(x = SBP, y = WTGOAL)) +
geom_violin(aes(fill = WTGOAL)) +
geom_boxplot(width = 0.3, notch = TRUE) +
guides(fill = "none") +
labs(title = "Comparing Mean Systolic BP by Weight Goal",
subtitle = glue("among ", nrow(dat5), " Hispanic participants in nh432"),
x = "Systolic Blood Pressure (mm Hg)", y = "Weight Goal")
favstats(SBP ~ WTGOAL, data = dat5) |>
as_tibble() |>
mutate(across(.cols = c("mean", "sd"), \(x) num(x, digits = 2))) |>
gt()
WTGOAL | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
More | 84 | 104.5 | 114.0 | 123 | 150 | 114.08 | 14.76 | 36 | 0 |
Same | 87 | 107.0 | 116.5 | 128 | 200 | 119.26 | 16.49 | 188 | 0 |
Less | 80 | 109.0 | 119.0 | 131 | 199 | 120.70 | 17.17 | 564 | 0 |
The analysis of variance is our primary tool for comparing more than two means (this is the extension of the pooled t test, with similar assumptions.) So the assumptions we might want to think about here are:
- SBP in each Weight Goal group is assumed to follow a Normal distribution
- SBP in each Weight Goal group is assumed to have the same population variance
The ANOVA, however, is far more robust to minor violations of these assumptions than is the pooled t test. So we might go ahead and fit the ANOVA model anyway, despite the apparent right skew in the “Less” group.
3.10.2 Fitting an ANOVA Model
<- lm(SBP ~ WTGOAL, data = dat5)
m5
anova(m5)
Analysis of Variance Table
Response: SBP
Df Sum Sq Mean Sq F value Pr(>F)
WTGOAL 2 1639 819.45 2.8656 0.05754 .
Residuals 785 224477 285.96
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A small p value (remember we are using 90% confidence in our 432 work) like this isn’t really very important - usually it simply steers us towards trying to identify confidence intervals for differences between pairs of SBP means defined by WTGOAL.
3.10.2.1 ANOVA without assuming Equal Variances?
R will also fit an ANOVA-style model and produce a p value without the assumption of equal population SBP variance across the three groups of WTGOAL.
oneway.test(SBP ~ WTGOAL, data = dat5)
One-way analysis of means (not assuming equal variances)
data: SBP and WTGOAL
F = 3.5255, num df = 2.000, denom df = 93.753, p-value = 0.0334
I don’t use this approach much, as ANOVA is pretty robust to the assumption of equal variance. The huge differences in sample size in this study (many more participants are in the Less group than the More, for instance) are most of the cause of the difference we see here.
3.10.2.2 Testing for Equal Population Variance?
Some people like to perform tests for equal population variance to help choose between ANOVA and the oneway.test()
approach, but I do not. If I’m happy with the assumption of Normality, I virtually always just use ANOVA. There are many such tests of “equal variance”, including:
- Bartlett’s test
- Levene’s test (which in R comes from the car package)
- Fligner-Killeen test
Bartlett’s test is the least reliable of these when the data in at least one sample appear to be poorly described by the Normal distribution. Either Levene or Fligner-Killeen is a better choice in that setting, but again, I don’t use any of these in my work.
bartlett.test(SBP ~ WTGOAL, data = dat5)
Bartlett test of homogeneity of variances
data: SBP by WTGOAL
Bartlett's K-squared = 1.6722, df = 2, p-value = 0.4334
leveneTest(SBP ~ WTGOAL, data = dat5)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 0.6557 0.5193
785
fligner.test(SBP ~ WTGOAL, data = dat5)
Fligner-Killeen test of homogeneity of variances
data: SBP by WTGOAL
Fligner-Killeen:med chi-squared = 1.26, df = 2, p-value = 0.5326
3.10.2.3 Is there a bootstrap one-way ANOVA approach?
If all you are looking for is a p value for the ANOVA model, then yes, there is a bootstrap approach available to perform one-way ANOVA testing. But I don’t actually use it, again usually preferring the usual ANOVA if the data seem reasonably likely to have been drawn from a Normal distribution, and the Kruskal-Wallis rank-based test otherwise. If you are willing to install the lmboot
package, and use its ANOVA.boot()
function, you can do so, like this.
<- ANOVA.boot(SBP ~ WTGOAL, B = 1000, seed = 4325, data = dat5)
bs5 $`p-value` bs5
[1] 0.052
In this case, it doesn’t seem that we have a wildly different result than we got from the original ANOVA. That is often the case, and I have never actually used ANOVA.boot()
in practical work.
3.10.3 Tukey HSD Pairwise Comparisons
When pairwise comparisons are pre-planned, especially when the design is close to balanced, my favorite choice for generating adjusted inferences about the means is Tukey’s Honestly Significant Differences (HSD) approach.
Here, we generate confidence intervals for the pairwise differences in the SBP means by WTGOAL group with a 90% family-wise confidence level.
<- TukeyHSD(aov(SBP ~ WTGOAL, data = dat5),
th5 ordered = TRUE, conf.level = 0.90)
tidy(th5) |>
select(-c(term, null.value)) |>
mutate(across(.cols = -contrast, \(x) num(x, digits = 3))) |>
arrange(desc(estimate)) |>
gt() |>
tab_header(title = "Comparing Mean Systolic BP across pairs of WTGOAL groups",
subtitle = "90% Tukey HSD Confidence Intervals") |>
tab_footnote(footnote = glue(nrow(dat5), " Hispanic participants in nh432"))
Comparing Mean Systolic BP across pairs of WTGOAL groups | ||||
90% Tukey HSD Confidence Intervals | ||||
contrast | estimate | conf.low | conf.high | adj.p.value |
---|---|---|---|---|
Less-More | 6.617 | 0.642 | 12.592 | 0.060 |
Same-More | 5.172 | -1.151 | 11.495 | 0.213 |
Less-Same | 1.445 | -1.482 | 4.372 | 0.568 |
788 Hispanic participants in nh432 |
tidy(th5) |>
mutate(contrast = fct_reorder(contrast, estimate, .desc = TRUE)) %>%
ggplot(., aes(x = contrast, y = estimate, ymin = conf.low, ymax = conf.high)) +
geom_crossbar() +
geom_hline(yintercept = 0, col = "red", linetype = "dashed") +
geom_label(aes(label = round_half_up(estimate, 2))) +
labs(title = "Comparing Mean Systolic BP across pairs of WTGOAL groups",
subtitle = "90% Tukey HSD Confidence intervals",
caption = glue(nrow(dat5), " Hispanic participants in nh432"),
x = "Pairwise Difference between WTGOAL groups",
y = "Difference in Systolic Blood Pressure (mm Hg)")
The main problems here are that:
- the sample sizes in the various levels of WTGOAL are very different from one another, and
- the SBP data are not especially well-described by a Normal distribution, at least in the “Less” group.
3.10.4 Holm pairwise comparisons of means
Another approach to developing pairwise inferences would be to use either Bonferroni or (my preference) Holm-adjusted p values for the relevant t tests. First, we’ll run the appropriate Holm comparison of means assuming equal population variances of SBP across all three WTGOAL groups.
<- pairwise.t.test(dat5$SBP, dat5$WTGOAL, pool.sd = TRUE, p.adjust.method = "holm")
ht5 tidy(ht5) |>
gt()
group1 | group2 | p.value |
---|---|---|
Same | More | 0.18625400 |
Less | More | 0.06929224 |
Less | Same | 0.31056211 |
The results are merely p-values, and not confidence intervals. There’s nothing being estimated here of interest. We can also perform these Holm comparisons without assuming equal population variances, as shown below.
<- pairwise.t.test(dat5$SBP, dat5$WTGOAL, pool.sd = FALSE,
ht5un p.adjust.method = "holm")
tidy(ht5un) |>
gt()
group1 | group2 | p.value |
---|---|---|
Same | More | 0.12866665 |
Less | More | 0.04046259 |
Less | Same | 0.30396220 |
Again, the problem with this approach is that it’s only producing a p value, which tempts us into talking about useless things like “statistical significance.” This is part of the reason I prefer Tukey HSD approaches when appropriate.
3.11 Comparing 4 Means using Independent Samples: Weight by Food Security
<- nh432 |>
dat6 select(SEQN, WEIGHT, FOODSEC) |>
drop_na()
3.11.1 Summarizing the Data
ggplot(dat6, aes(x = FOODSEC, y = WEIGHT)) +
geom_violin(aes(fill = FOODSEC)) +
geom_boxplot(width = 0.3, notch = TRUE) +
guides(fill = "none") +
scale_fill_viridis_d(option = "rocket") +
labs(title = "Comparing Mean Weight by Food Security",
subtitle = glue("among ", nrow(dat6), " participants in nh432"),
x = "Food Security Category", y = "Weight (kg)")
favstats(WEIGHT ~ FOODSEC, data = dat6) |>
as_tibble() |>
mutate(across(.cols = -c("FOODSEC", "n", "missing"), \(x) num(x, digits = 1))) |>
gt()
FOODSEC | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
Full | 36.9 | 68.6 | 81.0 | 97.7 | 210.8 | 85.2 | 24.5 | 2233 | 0 |
Marginal | 39.9 | 69.2 | 83.0 | 100.2 | 242.6 | 87.0 | 25.0 | 560 | 0 |
Low | 40.9 | 70.4 | 83.6 | 102.3 | 201.0 | 88.2 | 24.6 | 501 | 0 |
Very Low | 46.1 | 73.4 | 85.8 | 101.3 | 254.3 | 90.1 | 24.4 | 379 | 0 |
3.11.2 Fitting the ANOVA model
<- lm(WEIGHT ~ FOODSEC, data = dat6)
m6
anova(m6)
Analysis of Variance Table
Response: WEIGHT
Df Sum Sq Mean Sq F value Pr(>F)
FOODSEC 3 10021 3340.3 5.525 0.0008786 ***
Residuals 3669 2218162 604.6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Does the bootstrap ANOVA give a meaningfully different result? No.
<- ANOVA.boot(WEIGHT ~ FOODSEC, B = 5000, seed = 4326, data = dat6)
bs6 $`p-value` bs6
[1] 8e-04
3.11.3 Tukey HSD Pairwise Comparisons
<- TukeyHSD(aov(WEIGHT ~ FOODSEC, data = dat6),
th6 ordered = TRUE, conf.level = 0.90)
tidy(th6) |>
select(-c(term, null.value)) |>
mutate(across(.cols = -contrast, \(x) num(x, digits = 3))) |>
arrange(desc(estimate)) |>
gt() |>
tab_header(title = "Comparing Mean Weight across pairs of Food Security groups",
subtitle = "90% Tukey HSD Confidence Intervals") |>
tab_footnote(footnote = glue(nrow(dat6), " participants in nh432"))
Comparing Mean Weight across pairs of Food Security groups | ||||
90% Tukey HSD Confidence Intervals | ||||
contrast | estimate | conf.low | conf.high | adj.p.value |
---|---|---|---|---|
Very Low-Full | 4.877 | 1.746 | 8.008 | 0.002 |
Very Low-Marginal | 3.103 | -0.646 | 6.851 | 0.229 |
Low-Full | 2.930 | 0.144 | 5.717 | 0.075 |
Very Low-Low | 1.946 | -1.891 | 5.783 | 0.650 |
Marginal-Full | 1.774 | -0.889 | 4.438 | 0.422 |
Low-Marginal | 1.156 | -2.310 | 4.622 | 0.870 |
3673 participants in nh432 |
tidy(th6) |>
mutate(contrast = fct_reorder(contrast, estimate, .desc = TRUE)) %>%
ggplot(., aes(x = contrast, y = estimate, ymin = conf.low, ymax = conf.high)) +
geom_pointrange() +
geom_hline(yintercept = 0, col = "blue", linetype = "dashed") +
geom_label(aes(label = round_half_up(estimate, 2))) +
labs(title = "Comparing Mean Weight across pairs of Food Security groups",
subtitle = "90% Tukey HSD Confidence intervals",
caption = glue(nrow(dat5), " non-Hispanic Black participants in nh432"),
x = "Pairwise Difference between FOODSEC groups",
y = "Difference in Weight (kg)")
3.11.4 Kruskal-Wallis Test
When the assumption of Normality is really unreasonable, many people (including me) will instead use a rank-based method, called the Kruskal-Wallis test to compare the locations of WEIGHT across levels of FOODSEC.
kruskal.test(WEIGHT ~ FOODSEC, data = dat6)
Kruskal-Wallis rank sum test
data: WEIGHT by FOODSEC
Kruskal-Wallis chi-squared = 21.102, df = 3, p-value = 0.0001003
3.11.5 Dunn Test for Pairwise Comparisons after Kruskal-Wallis Test
Should you develop a Kruskal-Wallis test result which implies that running a set of pairwise comparisons is important, I would suggest the use of the Dunn test, available in the dunn_test()
function from the rstatix package.
dunn_test(data = dat6, WEIGHT ~ FOODSEC,
p.adjust.method = "holm", detailed = TRUE) |>
select(group1, group2, p.adj, n1, n2, estimate1, estimate2, estimate) |>
mutate(across(.cols = -c(group1, group2, n1, n2), \(x) num(x, digits = 3))) |>
gt() |>
tab_header(title = "Dunn Tests comparing WEIGHT by FOODSEC",
subtitle = "Pairwise Comparisons after Kruskal-Wallis test") |>
tab_footnote(footnote = glue(nrow(dat6), " participants in nh432"))
Dunn Tests comparing WEIGHT by FOODSEC | |||||||
Pairwise Comparisons after Kruskal-Wallis test | |||||||
group1 | group2 | p.adj | n1 | n2 | estimate1 | estimate2 | estimate |
---|---|---|---|---|---|---|---|
Full | Marginal | 0.269 | 2233 | 560 | 1780.595 | 1865.621 | 85.026 |
Full | Low | 0.049 | 2233 | 501 | 1780.595 | 1916.147 | 135.551 |
Full | Very Low | 0.000 | 2233 | 379 | 1780.595 | 2022.412 | 241.816 |
Marginal | Low | 0.438 | 560 | 501 | 1865.621 | 1916.147 | 50.525 |
Marginal | Very Low | 0.105 | 560 | 379 | 1865.621 | 2022.412 | 156.790 |
Low | Very Low | 0.282 | 501 | 379 | 1916.147 | 2022.412 | 106.265 |
3673 participants in nh432 |
Again, a problem with this approach is that all it provides is a set of adjusted p values for these comparisons, but if we’re not willing to assume even very approximate Normality (and thus use an ANOVA approach) this is what we’ll have to cope with.
The
e1071
package in R provides several such measures of skewness and kurtosis if you really want to see them.↩︎The Kolmogorov-Smirnov test is available in R: see
ks.test()
↩︎See
shapiro.test()
in R↩︎The pseudomedian of a distribution \(F\) is the median of the distribution of (u + v)/2, where u and v are independent and each have distribution \(F\).↩︎
The pseudomedian of a distribution \(F\) is the median of the distribution of (u + v)/2, where u and v are independent and each have distribution \(F\).↩︎