Chapter 29 Comparisons with Contingency Tables
29.1 Larger Contingency Tables - Testing for Independence
What will we do with tables describing data from more than two categories at a time, returning to the notion of independent (rather than paired or matched) samples? The chi-square tests we have already seen in our twobytwo
table output will extend nicely to this scenario, especially the Pearson \(\chi^2\) (asymptotic) test.
29.2 A 2x3 Table: Comparing Response to Active vs. Placebo
The table below, containing 2 rows and 3 columns of data (ignoring the marginal totals) specifies the number of patients who show complete, partial, or no response after treatment with either active medication or a placebo.
Group | None | Partial | Complete |
---|---|---|---|
Active | 16 | 26 | 29 |
Placebo | 24 | 26 | 18 |
Is there a statistically significant association here? That is to say, is there a statistically significant difference between the treatment groups in the distribution of responses?
29.2.1 Getting the Table into R
To answer this, we’ll have to get the data from this contingency table into a matrix in R. Here’s one approach…
T1 <- matrix(c(16,26,29,24,26,18), ncol=3, nrow=2, byrow=TRUE)
rownames(T1) <- c("Active", "Placebo")
colnames(T1) <- c("None", "Partial", "Complete")
T1
None Partial Complete
Active 16 26 29
Placebo 24 26 18
29.2.2 Manipulating the Table’s presentation
We can add margins to the matrix to get a table including row and column totals.
None Partial Complete Sum
Active 16 26 29 71
Placebo 24 26 18 68
Sum 40 52 47 139
Instead of the counts, we can tabulate the proportion of all patients within each cell.
None Partial Complete
Active 0.115 0.187 0.209
Placebo 0.173 0.187 0.129
Or, we can tabulate the probabilities, rather than the proportions, after some rounding.
None Partial Complete
Active 11.5 18.7 20.9
Placebo 17.3 18.7 12.9
R can also plot the percentages conditional on the rows…
None Partial Complete
Active 22.5 36.6 40.8
Placebo 35.3 38.2 26.5
The 40.8 in Active/Complete means that of the Active medication patients, 40.8% had a complete response.
R can also plot the percentages conditional on the columns…
None Partial Complete
Active 40 50 61.7
Placebo 60 50 38.3
If we add the row of column totals for these percentages as shown below, it clarifies that the 61.7 in Active/Complete here means that of the patients with a Complete response, 61.7% were on the active medication.
None Partial Complete
Active 40 50 61.7
Placebo 60 50 38.3
Sum 100 100 100.0
29.3 Getting the Chi-Square Test Results
Now, to actually obtain a p value and perform the significance test with H0: rows and columns are independent vs. HA: rows and columns are associated, we simply run a Pearson chi-square test on T1 …
Pearson's Chi-squared test
data: T1
X-squared = 4, df = 2, p-value = 0.1
Thanks to a p-value of about 0.13 (using the Pearson \(\chi^2\) test) our conclusion would be to retain the null hypothesis of independence in this setting.
We could have run a Fisher’s exact test, too, if we needed it.
Fisher's Exact Test for Count Data
data: T1
p-value = 0.1
alternative hypothesis: two.sided
The Fisher exact test p value is also 0.13. Either way, there is insufficient evidence to conclude that there is a (true) difference in the distributions of responses.
29.4 Getting a 2x3 Table into R using a .csv file
Suppose we have a table like this available, and want to compute the Pearson and Fisher \(\chi^2\) test results in R, without having to set up the whole matrix piece discussed above?
Group | None | Partial | Complete |
---|---|---|---|
Active | 16 | 26 | 29 |
Placebo | 24 | 26 | 18 |
We could do so by building a .csv file (a spreadsheet) containing the table above. In particular, I built a file called active2x3.csv
, available on the course website. It is simply the table below.
Group None Partial Complete
Active 16 26 29
Placebo 24 26 18
When we pull this .csv file into R, it emerges as a data frame and deliberately NOT as a tibble.
Group None Partial Complete
1 Active 16 26 29
2 Placebo 24 26 18
29.5 Turning the Data Frame into a Table That R Recognizes
We need to turn this data frame into something that R can recognize as a contingency table. The steps to do this are:
- Establish row names from the Group variable.
- Delete the Group variable, retaining only the variables containing counts.
- Convert the data frame into a matrix.
Specifically in this case, the steps are:
rownames(active2x3) <- active2x3$Group
active2x3$Group <- NULL
tab2x3 <- as.matrix(active2x3)
tab2x3
None Partial Complete
Active 16 26 29
Placebo 24 26 18
And this resulting tab2x3
object can be treated as the matrix was previously, using addmargins
or chisq.test
or prop.table
or whatever. For instance,
Fisher's Exact Test for Count Data
data: tab2x3
p-value = 0.1
alternative hypothesis: two.sided
29.6 Collapsing Levels / Categories in a Factor
Returning to the survey1
data, let’s build a table of the relationship between response to the sex
and favorite color
questions.
aqua aquamarine black blue brown chartreuse dark blue depends gray
f 2 0 1 21 0 0 2 0 0
m 0 1 2 41 3 1 0 1 1
green grey light blue navy none orange pink purple red royal blue
f 11 1 1 0 1 0 4 10 5 1
m 12 2 0 2 0 3 1 1 8 0
sapphire silver teal violet white yellow
f 1 1 1 1 3 3
m 0 0 0 0 2 1
Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be
incorrect
Pearson's Chi-squared test
data: table(survey1$sex, survey1$color)
X-squared = 41, df = 24, p-value = 0.02
Note the warning message. With all those small cell counts, and particular, so many zeros, this might not be so useful.
We might reasonably consider collapsing the colors
data into four new categories:
- Blue – in which I will include the old aqua, aquamarine, blue, dark blue and sapphire;
- Green, which will include the old chartreuse and green,
- Red - which will include the old orange, pink and red, and
- Other - which will include all other colors, including purple and violet
One way to do this sort of work uses the plyr
library in R, and the revalue
function in particular. I learned about this in Chapter 15 of Chung’s R Graphics Cookbook, which is invaluable.
[1] "aqua" "aquamarine" "black" "blue" "brown"
[6] "chartreuse" "dark blue" "depends" "gray" "green"
[11] "grey" "light blue" "navy" "none" "orange"
[16] "pink" "purple" "red" "royal blue" "sapphire"
[21] "silver" "teal" "violet" "white" "yellow"
survey1$color.new <- dplyr::recode(survey1$color.new,
aqua = "Blue", aquamarine="Blue", black = "Other",
blue = "Blue", brown = "Other", chartreuse = "Green",
'dark blue' = "Blue", depends = "Other", gray = "Other",
green = "Green", grey ="Other", 'light blue' = "Blue",
navy = "Blue", none = "Other", orange = "Red",
pink = "Red", purple = "Other", red = "Red",
'royal blue' = "Blue", sapphire = "Blue", silver = "Other",
teal = "Blue", violet = "Other", white = "Other", yellow = "Other")
So, what I’ve done here is create a new color variable that assigns the original colors into the four categories: Blue, Green, Red and Other that I defined above. Let’s run a sanity check on our recoding, then look at the relationship between sex and this new four-category variable, in a 2x4 table…
Blue Other Green Red
aqua 2 0 0 0
aquamarine 1 0 0 0
black 0 3 0 0
blue 62 0 0 0
brown 0 3 0 0
chartreuse 0 0 1 0
dark blue 2 0 0 0
depends 0 1 0 0
gray 0 1 0 0
green 0 0 23 0
grey 0 3 0 0
light blue 1 0 0 0
navy 2 0 0 0
none 0 1 0 0
orange 0 0 0 3
pink 0 0 0 5
purple 0 11 0 0
red 0 0 0 13
royal blue 1 0 0 0
sapphire 1 0 0 0
silver 0 1 0 0
teal 1 0 0 0
violet 0 1 0 0
white 0 5 0 0
yellow 0 4 0 0
Blue Other Green Red
f 29 21 11 9
m 44 13 13 12
To neaten this output up, we might want to have Other show up last in the color.new
group.
survey1$color.new2 <- factor(survey1$color.new,
levels=c("Blue", "Green", "Red", "Other"))
table(survey1$sex, survey1$color.new2)
Blue Green Red Other
f 29 11 9 21
m 44 13 12 13
Now, we run the new Pearson \(\chi^2\) test, and conclude that there is no evidence of statistically significant association (at the 5% significance level) between sex and these collapsed favorite color categories.
Pearson's Chi-squared test
data: table(survey1$sex, survey1$color.new2)
X-squared = 5, df = 3, p-value = 0.2
29.7 Accuracy of Death Certificates (A 6x3 Table)
The table below compiles data from six studies designed to investigate the accuracy of death certificates. 5373 autopsies were compared to the causes of death listed on the certificates. Of those, 3726 were confirmed to be accurate, 783 either lacked information or contained inaccuracies but did not require recoding of the underlying cause of death, and 864 were incorrect and required recoding. Do the results across studies appear consistent?
Date of Study | [Confirmed] Accurate | [Inaccurate] No Change | [Incorrect] Recoding | Total |
---|---|---|---|---|
1955-1965 | 2040 | 367 | 327 | 2734 |
1970 | 149 | 60 | 48 | 257 |
1970-1971 | 288 | 25 | 70 | 383 |
1975-1977 | 703 | 197 | 252 | 1152 |
1977-1978 | 425 | 62 | 88 | 575 |
1980 | 121 | 72 | 79 | 272 |
Total | 3726 | 783 | 864 | 5373 |
29.8 The Pearson Chi-Square Test of Independence
We can assess the homogeneity of the confirmation results (columns) we observe in the table using a Pearson chi-squared test of independence.
- The null hypothesis is that the rows and columns are independent.
- The alternative hypothesis is that there is an association between the rows and the columns.
z <- matrix(c(2040,367,327,149,60,48,288,25,70,703,
197,252,425,62,88,121,72,79), byrow=TRUE, nrow=6)
rownames(z) <- c("1955-65", "1970", "1970-71", "1975-77", "1977-78",
"1980")
colnames(z) <- c("Confirmed", "Inaccurate", "Incorrect")
addmargins(z)
Confirmed Inaccurate Incorrect Sum
1955-65 2040 367 327 2734
1970 149 60 48 257
1970-71 288 25 70 383
1975-77 703 197 252 1152
1977-78 425 62 88 575
1980 121 72 79 272
Sum 3726 783 864 5373
To see the potential heterogeneity across rows in these data, we should perhaps also look at the proportions of autopsies in each of the three accuracy categories for each study.
Confirmed Inaccurate Incorrect Sum
1955-65 74.6 13.4 12.0 100
1970 58.0 23.3 18.7 100
1970-71 75.2 6.5 18.3 100
1975-77 61.0 17.1 21.9 100
1977-78 73.9 10.8 15.3 100
1980 44.5 26.5 29.0 100
In three of the studies, approximately 3/4 of the results were confirmed. In the other three, 45%, 58% and 61% were confirmed. It looks like there’s a fair amount of variation in results across studies. To see if this is true, formally, we run Pearson’s chi-square test of independence, where the null hypothesis is that the rows and columns are independent, and the alternative hypothesis is that there is an association between the rows and the columns.
Pearson's Chi-squared test
data: z
X-squared = 209, df = 10, p-value <2e-16
The chi-square test statistic is 200 on 10 degrees of freedom, yielding p < 0.0001.
Autopsies are not performed at random; in fact, many are done because the cause of death listed on the certificate is uncertain. What problems may arise if you attempt to use the results of these studies to make inference about the population as a whole?
29.9 Three-Way Tables: A 2x2xK Table and a Mantel-Haenszel Analysis
The material I discuss in this section is attributable to Jeff Simonoff and his book Analyzing Categorical Data. The example is taken from Section 8.1 of that book.
A three-dimensional or three-way table of counts often reflects a situation where the rows and columns refer to variables whose association is of primary interest to us, and the third factor (a layer, or strata) describes a control variable, whose effect on our primary association is something we are controlling for in the analysis.
29.9.1 Smoking and Mortality in the UK
In the early 1970s and then again 20 years later, in Whickham, United Kingdom, surveys yielded the following relationship between whether a person was a smoker at the time of the original survey and whether they were still alive 20 years later.
whickham1 <- matrix(c(443, 139, 502, 230), byrow=TRUE, nrow=2)
rownames(whickham1) <- c("Smoker", "Non-Smoker")
colnames(whickham1) <- c("Alive", "Dead")
addmargins(whickham1)
Alive Dead Sum
Smoker 443 139 582
Non-Smoker 502 230 732
Sum 945 369 1314
Here’s the two-by-two table analysis.
2 by 2 table analysis:
------------------------------------------------------
Outcome : Alive
Comparing : Smoker vs. Non-Smoker
Alive Dead P(Alive) 95% conf. interval
Smoker 443 139 0.761 0.725 0.794
Non-Smoker 502 230 0.686 0.651 0.718
95% conf. interval
Relative Risk: 1.1099 1.0381 1.187
Sample Odds Ratio: 1.4602 1.1414 1.868
Conditional MLE Odds Ratio: 1.4598 1.1335 1.884
Probability difference: 0.0754 0.0266 0.123
Exact P-value: 0.0030
Asymptotic P-value: 0.0026
------------------------------------------------------
Pearson's Chi-squared test with Yates' continuity correction
data: whickham1
X-squared = 9, df = 1, p-value = 0.003
There is a significant association between smoking and mortality (\(\chi^2\) = 8.75 on 1 df, p = 0.003), but it isn’t the one you might expect.
- The odds ratio is 1.46, implying that the odds of having lived were 46% higher for smokers than for non-smokers.
- Does that mean that smoking is good for you?
Not likely. There is a key “lurking” variable here - a variable that is related to both smoking and mortality that is obscuring the actual relationship - namely, age.
29.9.2 The whickham
data including age, as well as smoking and mortality
The table below gives the mortality experience separated into subtables by initial age group.
age <- c(rep("18-24", 4), rep("25-34", 4),
rep("35-44", 4), rep("45-54", 4),
rep("55-64", 4), rep("65-74", 4),
rep("75+", 4))
smoking <- c(rep(c("Smoker", "Smoker", "Non-Smoker", "Non-Smoker"), 7))
status <- c(rep(c("Alive", "Dead"), 14))
counts <- c(53, 2, 61, 1, 121, 3, 152, 5,
95, 14, 114, 7, 103, 27, 66, 12,
64, 51, 81, 40, 7, 29, 28, 101,
0, 13, 0, 64)
whickham2 <- data.frame(smoking, status, age, counts) %>% tbl_df()
whickham2$smoking <- factor(whickham2$smoking, levels = c("Smoker", "Non-Smoker"))
whickham2.tab1 <- xtabs(counts ~ smoking + status + age, data = whickham2)
whickham2.tab1
, , age = 18-24
status
smoking Alive Dead
Smoker 53 2
Non-Smoker 61 1
, , age = 25-34
status
smoking Alive Dead
Smoker 121 3
Non-Smoker 152 5
, , age = 35-44
status
smoking Alive Dead
Smoker 95 14
Non-Smoker 114 7
, , age = 45-54
status
smoking Alive Dead
Smoker 103 27
Non-Smoker 66 12
, , age = 55-64
status
smoking Alive Dead
Smoker 64 51
Non-Smoker 81 40
, , age = 65-74
status
smoking Alive Dead
Smoker 7 29
Non-Smoker 28 101
, , age = 75+
status
smoking Alive Dead
Smoker 0 13
Non-Smoker 0 64
The odds ratios for each of these subtables, except the last one, where it is undefined are as follows:
Age Group | Odds Ratio |
---|---|
18-24 | 0.43 |
25-34 | 1.33 |
35-44 | 0.42 |
45-54 | 0.69 |
55-64 | 0.62 |
65-74 | 0.87 |
75+ | undefined |
Thus, for all age groups except 25-34 year olds, smoking is associated with higher mortality.
Why? Not surprisingly, there is a strong association between age and mortality, with mortality rates being very low for young people (2.5% for 18-24 year olds) and increasing to 100% for 75+ year olds.
There is also an association between age and smoking, with smoking rates peaking in the 45-54 year old range and then falling off rapidly. In particular, respondents who were 65 and older at the time of the first survey had very low smoking rates (25.4%) but very high mortality rates (85.5%). Smoking was hardly the cause, however, since even among the 65-74 year olds mortality was higher among smokers (80.6%) than it was among non-smokers (78.3%). A flat version of the table (ftable
in R) can help us with these calculations.
age 18-24 25-34 35-44 45-54 55-64 65-74 75+
smoking status
Smoker Alive 53 121 95 103 64 7 0
Dead 2 3 14 27 51 29 13
Non-Smoker Alive 61 152 114 66 81 28 0
Dead 1 5 7 12 40 101 64
29.9.3 The Cochran-Mantel-Haenszel Test
So, the marginal table looking at smoking and mortality combining all age groups isn’t the most meaningful summary of the relationship between smoking and mortality. Instead, we need to look at the conditional association of smoking and mortality, given age, to address our interests.
The null hypothesis would be that, in the population, smoking and mortality are independent within strata formed by age group. In other words, H0 requires that smoking be of no value in predicting mortality once age has been accounted for.
The alternative hypothesis would be that, in the population, smoking and mortality are associated within the strata formed by age group. In other words, HA requires that smoking be of at least some value in predicting mortality even after age has been accounted for.
We can consider the evidence that helps us choose between these two hypotheses with a Cochran-Mantel-Haenszel test, which is obtained in R through the mantelhaen.test
function. This test requires us to assume that, in the population and within each age group, the smoking-mortality odds ratio is the same. Essentially, this means that the association of smoking with mortality is the same for older and younger people.
Mantel-Haenszel chi-squared test with continuity correction
data: whickham2.tab1
Mantel-Haenszel X-squared = 5, df = 1, p-value = 0.02
alternative hypothesis: true common odds ratio is not equal to 1
90 percent confidence interval:
0.490 0.875
sample estimates:
common odds ratio
0.655
- The Cochran-Mantel-Haenszel test statistic is 5 (after a continuity correction) leading to a p value of 0.02, indicating strong rejection of the null hypothesis of conditional independence of smoking and survival given age.
- The estimated common conditional odds ratio is 0.65. This implies that (given age) being a smoker is associated with a 35% lower odds of being alive 20 years later than being a non-smoker.
- A 90% confidence interval for that common odds ratio is (0.49, 0.87), reinforcing rejection of the conditional independence (where the odds ratio would be 1).
29.9.4 Checking Assumptions: The Woolf test
We can also obtain a test (using the woolf_test
function, in the vcd
library) to see if the common odds ratio estimated in the Mantel-Haenszel procedure is reasonable for all age groups. In other words, the Woolf test is a test of the assumption of homogeneous odds ratios across the six age groups.
If the Woolf test is significant, it suggests that the Cochran-Mantel-Haenszel test is not appropriate, since the odds ratios for smoking and mortality vary too much in the sub-tables by age group. Here, we have the following log odds ratios (estimated using conditional maximum likelihood, rather than cross-product ratios) and the associated Woolf test.
log odds ratios for smoking and status by age
18-24 25-34 35-44 45-54 55-64 65-74 75+
-0.6502 0.2247 -0.8407 -0.3461 -0.4742 -0.0993 1.5640
Woolf-test on Homogeneity of Odds Ratios (no 3-Way assoc.)
data: whickham2.tab1
X-squared = 3, df = 6, p-value = 0.8
As you can see, the Woolf test is not close to statistically significant, implying the common odds ratio is at least potentially reasonable for all age groups (or at least the ones under ages 75, where some data are available.)
29.9.5 Without the Continuity Correction
By default, R presents the Mantel-Haenszel test with a continuity correction, when used for a 2x2xK table. In virtually all cases, go ahead and do this, but as you can see below, the difference it makes in this case is modest.
Mantel-Haenszel chi-squared test without continuity correction
data: whickham2.tab1
Mantel-Haenszel X-squared = 6, df = 1, p-value = 0.02
alternative hypothesis: true common odds ratio is not equal to 1
90 percent confidence interval:
0.490 0.875
sample estimates:
common odds ratio
0.655