Chapter 28 Comparing 3 or more Population Means: Analysis of Variance

Recall the National Youth Fitness Survey, which we explored a small piece of in some detail back in Section 7. We’ll look at a different part of the same survey here - specifically the 280 children whose data are captured in the nyfs2 file.

# A tibble: 280 x 21
   subject.id sex   age.exam race.eth english income.cat3 income.detail
        <int> <fct>    <int> <fct>      <int> <fct>       <fct>        
 1      73228 Male         4 5 Other~       1 Low (below~ 0 to 4999    
 2      72393 Male         4 2 Non-H~       1 Low (below~ 0 to 4999    
 3      73303 Male         3 2 Non-H~       1 Low (below~ 0 to 4999    
 4      72786 Male         5 1 Non-H~       1 Low (below~ 0 to 4999    
 5      73048 Male         3 2 Non-H~       1 Low (below~ 0 to 4999    
 6      72556 Fema~        4 2 Non-H~       1 Low (below~ 0 to 4999    
 7      72580 Fema~        5 2 Non-H~       1 Low (below~ 0 to 4999    
 8      72532 Fema~        4 4 Other~       0 Low (below~ 0 to 4999    
 9      73012 Male         4 1 Non-H~       1 Low (below~ 0 to 4999    
10      72099 Male         6 1 Non-H~       1 Low (below~ 0 to 4999    
# ... with 270 more rows, and 14 more variables: inc.to.pov <dbl>,
#   weight.kg <dbl>, height.cm <dbl>, bmi <dbl>, bmi.group <int>,
#   bmi.cat <fct>, arm.length <dbl>, arm.circ <dbl>, waist.circ <dbl>,
#   calf.circ <dbl>, calf.skinfold <dbl>, triceps.skinfold <dbl>,
#   subscap.skinfold <dbl>, GMQ <int>

28.1 Comparing Gross Motor Quotient Scores by Income Level (3 Categories)

In this first analysis, we’ll compare the population mean on the Gross Motor Quotient evaluation of these kids across three groups defined by income level. Higher values of this GMQ measure indicate improved levels of gross motor development, both in terms of locomotor and object control. See https://wwwn.cdc.gov/Nchs/Nnyfs/Y_GMX.htm for more details.

# A tibble: 3 x 4
  income.cat3            n `mean(GMQ)` `median(GMQ)`
  <fct>              <int>       <dbl>         <dbl>
1 High (65K or more)    92        95.7            97
2 Low (below 25K)       98        97.0            97
3 Middle (25 - 64K)     90        95.4            94

Uh, oh. We should rearrange those income categories to match a natural order from low to high.

28.1.2 Numerical Summaries

nyfs2a$income.cat3: Low (below 25K)
 min Q1 median  Q3 max mean   sd  n missing
  55 91     97 106 130   97 14.8 98       0
-------------------------------------------------------- 
nyfs2a$income.cat3: Middle (25 - 64K)
 min Q1 median  Q3 max mean   sd  n missing
  67 85     94 106 136 95.4 14.2 90       0
-------------------------------------------------------- 
nyfs2a$income.cat3: High (65K or more)
 min Q1 median  Q3 max mean   sd  n missing
  64 85     97 103 145 95.7 14.5 92       0

28.2 Alternative Procedures for Comparing More Than Two Means

Now, if we only had two independent samples, we’d be choosing between a pooled t test, a Welch t test, and a non-parametric procedure like the Wilcoxon-Mann-Whitney rank sum test, or even perhaps a bootstrap alternative.

In the case of more than two independent samples, we have methods analogous to the Welch test, and the rank sum test, and even the bootstrap, but we’re going to be far more likely to select the analysis of variance (ANOVA) or an equivalent regression-based approach. These are the extensions of the pooled t test. Unless the sample outcome data are very clearly not Normally distributed, and no transformation is available which makes them appear approximately Normal in all of the groups we are comparing, we will stick with ANOVA.

28.2.1 Extending the Welch Test to > 2 Independent Samples

It is possible to extend the Welch two-sample t test (not assuming equal population variances) into an analogous one-factor analysis for comparing population means based on independent samples from more than two groups.

If we want to compare the population mean GMQ levels across those three income groups without assuming equal population variances, oneway.test is up to the task. The hypotheses being tested here are:

  • H0: \(\mu_{Low} = \mu_{Middle} = \mu_{High}\) vs.
  • HA: At least one of the population means is different than the others.

    One-way analysis of means (not assuming equal variances)

data:  GMQ and income.cat3
F = 0.3, num df = 2, denom df = 200, p-value = 0.7

At our usual 5% significance level, since the p value for the one-way test is 0.71, we’d conclude that we cannot reject H0 and can only conclude that there is no significant difference between the true mean GMQ levels across the three income categories. You’ll note that this isn’t much help, though, because we don’t have any measure of effect size, nor do we have any confidence intervals.

Like the analogous Welch t test, this approach allows us to forego the assumption of equal population variances in each of the three income groups, but it still requires us to assume that the populations are Normally distributed.

That said, most of the time when we have more than two levels of the factor of interest, we won’t bother worrying about the equal population variance assumption, and will just use the one-factor ANOVA approach (with pooled variances) described below, to make the comparisons of interest.

28.2.2 Extending the Rank Sum Test to > 2 Independent Samples

It is also possible to extend the Wilcoxon-Mann-Whitney two-sample test into an analogous one-factor analysis called the Kruskal-Wallis test for comparing population measures of location based on independent samples from more than two groups.

If we want to compare the centers of the distributions of population GMQ score across our three income groups without assuming Normality, we can use kruskal.test.

The hypotheses being tested here are still as before, but the \(\mu\) referred to here is a measure of location other than the population mean:

  • H0: \(\mu_{Low} = \mu_{Middle} = \mu_{High}\) vs.
  • HA: At least one of the population means is different than the others.

    Kruskal-Wallis rank sum test

data:  GMQ by income.cat3
Kruskal-Wallis chi-squared = 2, df = 2, p-value = 0.3

So, in this case, at our usual 5% significance level, since the p value for the Kruskal-Wallis test is 0.31, we’d conclude that we cannot reject H0 and can only conclude that there is no significant difference between the true mean GMQ scores across the three income categories. Again, note that this isn’t much help, though, because we don’t have any measure of effect size, nor do we have any confidence intervals.

That said, most of the time when we have more than two levels of the factor of interest, we won’t bother worrying about potential violations of the Normality assumption unless they are glaring, and will just use the usual one-factor ANOVA approach (with pooled variances) described below, to make the comparisons of interest.

28.2.3 Can we use the bootstrap to compare more than two means?

Sure. There are both ANOVA and ANCOVA analogues using the bootstrap, and in fact, there are power calculations based on the bootstrap, too.

We’ll discuss all of these methods in 432, but if you want to see some example code, look at https://sammancuso.com/2015/05/28/bootstrap-anova-with-pairwise-post-hoc-contrasts/

28.3 The Analysis of Variance

Extending the two-sample t test (assuming equal population variances) into a comparison of more than two samples uses the analysis of variance or ANOVA.

This is analysis of a continuous outcome variable on the basis of a single categorical factor, in fact, it’s often called one-factor ANOVA or one-way ANOVA to indicate that the outcome is being split up into the groups defined by a single factor.

The null hypothesis is that the population means are all the same, and the alternative is that this is not the case. When there are just two groups, then this boils down to an F test that is equivalent to the Pooled t test.

28.3.1 The oneway.test approach

R will produce some elements of a one-factor ANOVA using the oneway.test command:


    One-way analysis of means

data:  GMQ and income.cat3
F = 0.3, num df = 2, denom df = 300, p-value = 0.7

This isn’t the full analysis, though, which would require a more complete ANOVA table. There are two equivalent approaches to obtaining the full ANOVA table when comparing a series of 2 or more population means based on independent samples.

28.3.2 Using the aov approach and the summary function

Here’s one possible ANOVA table, which doesn’t require directly fitting a linear model.

             Df Sum Sq Mean Sq F value Pr(>F)
income.cat3   2    146    72.8    0.35   0.71
Residuals   277  58174   210.0               

28.3.3 Using the anova function after fitting a linear model

An equivalent way to get identical results in a slightly different format runs the linear model behind the ANOVA approach directly.

Analysis of Variance Table

Response: GMQ
             Df Sum Sq Mean Sq F value Pr(>F)
income.cat3   2    146    72.8    0.35   0.71
Residuals   277  58174   210.0               

28.4 Interpreting the ANOVA Table

28.4.1 What are we Testing?

The null hypothesis for the ANOVA table is that the population means of the outcome across the various levels of the factor of interest are all the same, against a two-sided alternative hypothesis that the level-specific population means are not all the same.

Specifically, if we have a grouping factor with k levels, then we are testing:

  • H0: \(\mu_1 = \mu_2 = ... = \mu_k\) vs.
  • HA: At least one of the population means \(\mu_1, \mu_2, ..., \mu_k\) is different from the others.

28.4.2 Elements of the ANOVA Table

The ANOVA table breaks down the variation in the outcome explained by the k levels of the factor of interest, and the variation in the outcome which remains (the Residual, or Error).

Specifically, the elements of the ANOVA table are:

  1. the degrees of freedom (labeled Df) for the factor of interest and for the Residuals
  2. the sums of squares (labeled Sum Sq) for the factor of interest and for the Residuals
  3. the mean square (labeled Mean Sq) for the factor of interest and for the Residuals
  4. the ANOVA F test statistic (labeled F value), which is used to generate
  5. the p value for the comparison assessed by the ANOVA model, labeled Pr(>F)

28.4.3 The Degrees of Freedom

Analysis of Variance Table

Response: GMQ
             Df Sum Sq Mean Sq F value Pr(>F)
income.cat3   2    146    72.8    0.35   0.71
Residuals   277  58174   210.0               
  • The degrees of freedom attributable to the factor of interest (here, Income category) is the number of levels of the factor minus 1. Here, we have three Income categories (levels), so df(income.cat3) = 2.
  • The total degrees of freedom are the number of observations (across all levels of the factor) minus 1. We have 280 GMQ scores in the nyfs2a data, so the df(Total) must be 279, although the Total row isn’t shown by R in its output.
  • The Residual degrees of freedom are the Total df - Factor df. So, here, that’s 279 - 2 = 277.

28.4.4 The Sums of Squares

Analysis of Variance Table

Response: GMQ
             Df Sum Sq Mean Sq F value Pr(>F)
income.cat3   2    146    72.8    0.35   0.71
Residuals   277  58174   210.0               
  • The sum of squares (often abbreviated SS or Sum Sq) represents variation explained.
  • The factor SS is the sum across all levels of the factor of the sample size for the level multiplied by the squared difference between the level mean and the overall mean across all levels. Here, SS(income.cat3) = 146
  • The total SS is the sum across all observations of the square of the difference between the individual values and the overall mean. Here, that is 146 + 58174 = 58320
  • Residual SS = Total SS - Factor SS.
  • Also of interest is a calculation called \(\eta^2\), (“eta-squared”), which is equivalent to \(R^2\) in a linear model.
    • \(\eta^2 = SS(Factor) / SS(Total)\) = the proportion of variation in our outcome (here, GMQ) explained by the variation between groups (here, income groups)
    • In our case, \(\eta^2\) = 146 / (146 + 58174) = 146 / 58320 = 0.0025
    • So, Income Category alone accounts for about 0.25% of the variation in GMQ levels observed in these data.

28.4.5 The Mean Square

Analysis of Variance Table

Response: GMQ
             Df Sum Sq Mean Sq F value Pr(>F)
income.cat3   2    146    72.8    0.35   0.71
Residuals   277  58174   210.0               
  • The Mean Square is the Sum of Squares divided by the degrees of freedom, so MS(Factor) = SS(Factor)/df(Factor).
  • In our case, MS(income.cat3) = SS(income.cat3)/df(income.cat3) = 146 / 2 = 72.848 (notice that R maintains more decimal places than it shows for these calculations) and
  • MS(Residuals) = SS(Residuals) / df(Residuals) = 58174 / 277 = 210.014.
    • MS(Residuals) or MS(Error) is an estimate of the residual variance which corresponds to \(\sigma^2\) in the underlying linear model for the outcome of interest, here GMQ.

28.4.6 The F Test Statistic and p Value

Analysis of Variance Table

Response: GMQ
             Df Sum Sq Mean Sq F value Pr(>F)
income.cat3   2    146    72.8    0.35   0.71
Residuals   277  58174   210.0               
  • The ANOVA F test is obtained by calculating MS(Factor) / MS(Residuals). So in our case, F = 72.848 / 210.014 = 0.3469
  • The F test statistic is then compared to a specific F distribution to obtain a p value, which is shown here to be 0.7072
  • Specifically, the observed F test statistic is compared to an F distribution with numerator df = Factor df, and denominator df = Residual df to obtain the p value.
    • Here, we have SS(Factor) = 146 (approximately), and df(Factor) = 2, leaving MS(Factor) = 72.848
    • We have SS(Residual) = 58174, and df(Residual) = 277, leaving MS(Residual) = 210.014
    • MS(Factor) / MS(Residual) = F value = 0.3469, which, when compared to an F distribution with 2 and 277 degrees of freedom, yields a p value of 0.7072

28.5 The Residual Standard Error

The residual standard error is simply the square root of the variance estimate MS(Residual). Here, MS(Residual) = 210.014, so the Residual standard error = 14.49 points.

28.6 The Proportion of Variance Explained by the Factor, or \(\eta^2\)

We will often summarize the proportion of the variation explained by the factor. The summary statistic is called eta-squared (\(\eta^2\)), and is equivalent to the R2 value we have seen previously in linear regression models.

Again, \(\eta^2\) = SS(Factor) / SS(Total)

Here, we have - SS(income.cat3) = 146 and SS(Residuals) = 58174, so SS(Total) = 58320 - Thus, \(\eta^2\) = SS(Factor)/SS(Total) = 146/58320 = 0.0025

The income category accounts for 0.25% of the variation in GMQ levels: only a tiny fraction.

28.7 The Regression Approach to Compare Population Means based on Independent Samples

This approach is equivalent to the ANOVA approach, and thus also (when there are just two samples to compare) to the pooled-variance t test. We run a linear regression model to predict the outcome (here, GMQ) on the basis of the categorical factor with three levels (here, income.cat3)


Call:
lm(formula = GMQ ~ income.cat3, data = nyfs2a)

Residuals:
   Min     1Q Median     3Q    Max 
-42.03  -9.03  -0.03   8.97  49.27 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      97.03       1.46   66.28   <2e-16 ***
income.cat3Middle (25 - 64K)     -1.66       2.12   -0.79     0.43    
income.cat3High (65K or more)    -1.30       2.10   -0.62     0.54    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.5 on 277 degrees of freedom
Multiple R-squared:  0.0025,    Adjusted R-squared:  -0.0047 
F-statistic: 0.347 on 2 and 277 DF,  p-value: 0.707

28.7.1 Interpreting the Regression Output

This output tells us many things, but for now, we’ll focus just on the coefficients output, which tells us that:

  • the point estimate for the population mean GMQ score across “Low” income subjects is 97.03
  • the point estimate (sample mean difference) for the difference in population mean GMQ level between the “Middle” and “Low” income subjects is -1.66 (in words, the Middle income kids have lower GMQ scores than the Low income kids by 1.66 points on average.)
  • the point estimate (sample mean difference) for the difference in population mean GMQ level between the “High” and “Low” income subjects is -1.30 (in words, the High income kids have lower GMQ scores than the Low income kids by 1.30 points on average.)

Of course, we knew all of this already from a summary of the sample means.

# A tibble: 3 x 3
  income.cat3            n `mean(GMQ)`
  <fct>              <int>       <dbl>
1 Low (below 25K)       98        97.0
2 Middle (25 - 64K)     90        95.4
3 High (65K or more)    92        95.7

The model for predicting GMQ is based on two binary (1/0) indicator variables, specifically, we have:

  • Estimated GMQ = 97.03 - 1.66 x [1 if Middle income or 0 if not] - 1.30 x [1 if High income or 0 if not]

The coefficients section also provides a standard error and t statistic and two-sided p value for each coefficient.

28.7.2 The Full ANOVA Table

To see the full ANOVA table corresponding to any linear regression model, we run…

Analysis of Variance Table

Response: GMQ
             Df Sum Sq Mean Sq F value Pr(>F)
income.cat3   2    146    72.8    0.35   0.71
Residuals   277  58174   210.0               

28.7.3 ANOVA Assumptions

The assumptions behind analysis of variance are the same as those behind a linear model. Of specific interest are:

  • The samples obtained from each group are independent.
  • Ideally, the samples from each group are a random sample from the population described by that group.
  • In the population, the variance of the outcome in each group is equal. (This is less of an issue if our study involves a balanced design.)
  • In the population, we have Normal distributions of the outcome in each group.

Happily, the F test is fairly robust to violations of the Normality assumption.

28.8 Equivalent approach to get ANOVA Results

H0: \(\mu_{Low} = \mu_{Middle} = \mu_{High}\) vs. HA: H0 not true.

             Df Sum Sq Mean Sq F value Pr(>F)
income.cat3   2    146    72.8    0.35   0.71
Residuals   277  58174   210.0               

So which of the pairs of means are significantly different?

28.9 The Problem of Multiple Comparisons

  1. Suppose we compare High to Low, using a test with \(\alpha\) = 0.05
  2. Then we compare Middle to Low on the same outcome, also using \(\alpha\) = 0.05
  3. Then we compare High to Middle, also with \(\alpha\) = 0.05

What is our overall \(\alpha\) level across these three comparisons?

  • It could be as bad as 0.05 + 0.05 + 0.05, or 0.15.
  • Rather than our nominal 95% confidence, we have something as low as 85% confidence across this set of simultaneous comparisons.

28.9.1 The Bonferroni solution

  1. Suppose we compare High to Low, using a test with \(\alpha\) = 0.05/3
  2. Then we compare Middle to Low on the same outcome, also using \(\alpha\) = 0.05/3
  3. Then we compare High to Middle, also with \(\alpha\) = 0.05/3

Then across these three comparisons, our overall \(\alpha\) can be (at worst)

  • 0.05/3 + 0.05/3 + 0.05/3 = 0.05
  • So by changing our nominal confidence level from 95% to 98.333% in each comparison, we wind up with at least 95% confidence across this set of simultaneous comparisons.
  • This is a conservative (worst case) approach.

Goal: Simultaneous p values comparing White vs AA, AA vs Other and White vs Other


    Pairwise comparisons using t tests with pooled SD 

data:  nyfs2a$GMQ and nyfs2a$income.cat3 

                   Low (below 25K) Middle (25 - 64K)
Middle (25 - 64K)  1               -                
High (65K or more) 1               1                

P value adjustment method: bonferroni 

In this case, none of these p values are even remotely close to being small enough to indicate statistical significance. They’re all 1, in effect.

28.9.2 Pairwise Comparisons of Group Means using Tukey’s Honestly Significant Differences

Goal: Simultaneous (less conservative) confidence intervals and p values for our three pairwise comparisons (High vs. Low, High vs. Middle, Middle vs. Low)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = GMQ ~ income.cat3, data = nyfs2a)

$income.cat3
                                       diff   lwr  upr p adj
Middle (25 - 64K)-Low (below 25K)    -1.664 -6.65 3.32 0.712
High (65K or more)-Low (below 25K)   -1.302 -6.26 3.65 0.810
High (65K or more)-Middle (25 - 64K)  0.362 -4.70 5.42 0.985

28.9.3 Plotting the Tukey HSD results

Note that the default positioning of the y axis in the plot of Tukey HSD results can be problematic. If we have longer names, in particular, for the levels of our factor, R will leave out some of the labels. We can alleviate that problem either by using the fct_recode function in the forcats package to rename the factor levels, or we can use the following code to reconfigure the margins of the plot.

28.10 What if we consider another outcome, BMI?

We’ll look at the full data set in nyfs2 now, so we can look at BMI as a function of income.

Here are the descriptive numerical summaries:

nyfs2$income.cat3: Low (below 25K)
  min   Q1 median   Q3  max mean   sd  n missing
 13.8 15.5   16.5 17.9 25.2   17 2.19 98       0
-------------------------------------------------------- 
nyfs2$income.cat3: Middle (25 - 64K)
  min   Q1 median   Q3  max mean  sd  n missing
 13.7 15.2   16.1 16.9 24.1 16.4 1.9 90       0
-------------------------------------------------------- 
nyfs2$income.cat3: High (65K or more)
  min   Q1 median   Q3  max mean   sd  n missing
 13.8 15.3   15.9 17.2 21.8 16.3 1.61 92       0

Here is the ANOVA table.

Analysis of Variance Table

Response: bmi
             Df Sum Sq Mean Sq F value Pr(>F)  
income.cat3   2     28    14.2    3.83  0.023 *
Residuals   277   1025     3.7                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And we see a statistically significant result at usual \(\alpha\) levels. Let’s consider the Tukey HSD results. First, we’ll create a factor with shorter labels.

It appears that there is a statistically significant difference between the bmi means of the “Low” group and both the “High” and “Middle” group at the 10% significance level, but no significant difference between “Middle” and “High.” Details of those p values and confidence intervals for those pairwise comparisons follow.

  Tukey multiple comparisons of means
    90% family-wise confidence level

Fit: aov(formula = bmi ~ inc.new, data = nyfs2)

$inc.new
              diff    lwr     upr p adj
Middle-Low  -0.611 -1.189 -0.0317 0.078
High-Low    -0.711 -1.287 -0.1354 0.031
High-Middle -0.100 -0.688  0.4874 0.934