14 Cross-Tabulations
14.1 R setup for this chapter
Appendix A lists all R packages used in this book, and also provides R session information.
14.2 Tattoo Example
Appendix C provides further guidance on pulling data from other systems into R, while Appendix D gives more information (including download links) for all data sets used in this book.
tats <- read_tsv("data/tattoos.txt", show_col_types = FALSE) |>
mutate(across(where(is.character), as_factor)) |>
janitor::clean_names()
glimpse(tats)
Rows: 626
Columns: 2
$ location <fct> Commercial Parlor, Commercial Parlor, Commercial Parlo…
$ has_hepatitis_c <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
The tatoo.txt
data we ingest here into R comes from the Data and Story Library. The original source of the data is the University of Texas Southwestern Medical Center, and we observe 625 individuals categorized according to their tattoo status and whether or not they have a diagnosis of Hepatitis C. Specifically, the variables include:
-
location in one of three groups:
- (tattoo obtained in a) Commercial Parlor,
- (tattoo obtained) Elsewhere, or
- No Tattoo
- has_hepatitis_c status in two groups: Yes, No
tats |> count(location, has_hepatitis_c)
# A tibble: 6 × 3
location has_hepatitis_c n
<fct> <fct> <int>
1 Commercial Parlor Yes 17
2 Commercial Parlor No 35
3 Elsewhere Yes 8
4 Elsewhere No 53
5 No Tattoo Yes 22
6 No Tattoo No 491
The question we’re interested in here is whether there is a strong association between the tattoo location and the probability of hepatitis C.
tats |> tabyl(location, has_hepatitis_c) |>
adorn_title() |>
kable()
has_hepatitis_c | ||
---|---|---|
location | Yes | No |
Commercial Parlor | 17 | 35 |
Elsewhere | 8 | 53 |
No Tattoo | 22 | 491 |
tats |>
tabyl(location, has_hepatitis_c) |>
adorn_percentages(denominator = "row") |>
adorn_pct_formatting() |>
adorn_ns(position = "front")
location Yes No
Commercial Parlor 17 (32.7%) 35 (67.3%)
Elsewhere 8 (13.1%) 53 (86.9%)
No Tattoo 22 (4.3%) 491 (95.7%)
Yet another way to show the table is data_tabulate()
.
data_tabulate(tats$location, tats$has_hepatitis_c,
proportions = "col")
tats$location | Yes | No | <NA> | Total
------------------+------------+-------------+--------+------
Commercial Parlor | 17 (36.2%) | 35 (6.0%) | 0 (0%) | 52
Elsewhere | 8 (17.0%) | 53 (9.2%) | 0 (0%) | 61
No Tattoo | 22 (46.8%) | 491 (84.8%) | 0 (0%) | 513
<NA> | 0 (0.0%) | 0 (0.0%) | 0 (0%) | 0
------------------+------------+-------------+--------+------
Total | 47 | 579 | 0 | 626
14.3 Chi-Square Test
The most common approach to assessing whether the relationship we observe between two categorical variables is stronger than we might expect to see if the rows and columns had no effect on one another just due to sampling error is called a chi-square test.
chisq.test(table(tats$location, tats$has_hepatitis_c))
Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be
incorrect
Pearson's Chi-squared test
data: table(tats$location, tats$has_hepatitis_c)
X-squared = 57.912, df = 2, p-value = 2.658e-13
A chi-square test of independence is a descriptive summary, like a correlation coefficient, so there’s no outcome being modeled, really. This is reflected in the xtabs()
function’s approach.
tabx <- xtabs(~ location + has_hepatitis_c, data = tats)
tabx
has_hepatitis_c
location Yes No
Commercial Parlor 17 35
Elsewhere 8 53
No Tattoo 22 491
summary(tabx)
Call: xtabs(formula = ~location + has_hepatitis_c, data = tats)
Number of cases in table: 626
Number of factors: 2
Test for independence of all factors:
Chisq = 57.91, df = 2, p-value = 2.658e-13
Chi-squared approximation may be incorrect
Note that these chi-square assessments have very small p values (indicating some support for an association between location and hepatitis C status) but that the chi-square approximation used here may be incorrect.
R makes the decision to warn about potentially incorrect approximations when the sample size in one or more of the cells is rather small. Specifically if any of the expected frequencies (under the null hypothesis of no association between the rows and the columns) is below 10, we have reason to be concerned. Here, both the (Commercial Parlor, Yes) and (Elsewhere, Yes) cells have expected frequencies below 5, so R warns us that the results may be incorrect.
tat_tab <- tats |> tabyl(location, has_hepatitis_c)
tab_res1 <- chisq.test(tat_tab, tabyl_results = TRUE)
Warning in stats::chisq.test(., ...): Chi-squared approximation may be
incorrect
tab_res1
Pearson's Chi-squared test
data: tat_tab
X-squared = 57.912, df = 2, p-value = 2.658e-13
Next, here are the observed frequencies in each cell, and the expected frequencies if the null hypothesis of no relationship between rows and columns actually held up.
tab_res1$observed
location Yes No
Commercial Parlor 17 35
Elsewhere 8 53
No Tattoo 22 491
tab_res1$expected
location Yes No
Commercial Parlor 3.904153 48.09585
Elsewhere 4.579872 56.42013
No Tattoo 38.515974 474.48403
14.4 Personal Appearance Example
These data are also adapted from an example in the Data and Story Library. The data are an excerpt from the results of a GfK Roper Reports® Worldwide survey. In addition to grouping the subjects into five age groups, each was also asked how important their personal appearance is to them, on a seven-point scale.
The data are a contingency table of responses to this question by age decade for 5,844 consumers.
Personal Appearance | 20-29 | 30-39 | 40-49 | 50-59 | 60plus | Total |
---|---|---|---|---|---|---|
1 - Not at all important | 37 | 53 | 56 | 36 | 52 | 234 |
2 | 43 | 53 | 58 | 37 | 45 | 236 |
3 | 83 | 88 | 93 | 54 | 45 | 363 |
4 - Average importance | 376 | 403 | 423 | 224 | 210 | 1636 |
5 | 312 | 317 | 270 | 150 | 106 | 1155 |
6 | 326 | 307 | 254 | 123 | 86 | 1096 |
7 - Extremely important | 337 | 300 | 252 | 142 | 93 | 1124 |
Total | 1514 | 1521 | 1406 | 766 | 637 | 5844 |
Rather than generating an R tibble with 5844 rows, here I’ll just recreate the cross-tabulation in R, then analyze it.
persapp <-
as.table(rbind (
c(37, 53, 56, 36, 52),
c(43, 53, 58, 37, 45),
c(83, 88, 93, 54, 45),
c(376, 403, 423, 224, 210),
c(312, 317, 270, 150, 106),
c(326, 307, 254, 123, 86),
c(337, 300, 252, 142, 93)))
dimnames(persapp) <-
list( appear= c("1", "2", "3", "4", "5", "6", "7"),
age = c("20-29", "30-39", "40-49", "50-59", "60plus"))
persapp
age
appear 20-29 30-39 40-49 50-59 60plus
1 37 53 56 36 52
2 43 53 58 37 45
3 83 88 93 54 45
4 376 403 423 224 210
5 312 317 270 150 106
6 326 307 254 123 86
7 337 300 252 142 93
Now, let’s look at the results from a \(\chi^2\) test of independence of the rows and columns from this contingency table.
out2 <- chisq.test(persapp)
out2
Pearson's Chi-squared test
data: persapp
X-squared = 120.83, df = 24, p-value = 6.914e-15
out2$observed
age
appear 20-29 30-39 40-49 50-59 60plus
1 37 53 56 36 52
2 43 53 58 37 45
3 83 88 93 54 45
4 376 403 423 224 210
5 312 317 270 150 106
6 326 307 254 123 86
7 337 300 252 142 93
out2$expected
age
appear 20-29 30-39 40-49 50-59 60plus
1 60.62218 60.90246 56.29774 30.67146 25.50616
2 61.14031 61.42300 56.77892 30.93361 25.72416
3 94.04209 94.47690 87.33368 47.58008 39.56725
4 423.83710 425.79671 393.60301 214.43806 178.32512
5 299.22485 300.60832 277.87988 151.39117 125.89579
6 283.93977 285.25257 263.68515 143.65777 119.46475
7 291.19370 292.54004 270.42163 147.32786 122.51677
Again, the chi-square test result has an extremely small p value, suggesting evidence favoring some association between the appearance and age group data.