14 Cross-Tabulations
This is a sketchy draft. I’ll remove this notice when I post a version of this Chapter that is essentially finished.
14.1 R setup for this chapter
Appendix A lists all R packages used in this book, and also provides R session information.
14.2 Tattoo Example
Appendix C provides further guidance on pulling data from other systems into R, while Appendix D gives more information (including download links) for all data sets used in this book.
tats <- read_tsv("data/tattoos.txt", show_col_types = FALSE) |>
mutate(across(where(is.character), as_factor)) |>
janitor::clean_names()
glimpse(tats)
Rows: 626
Columns: 2
$ location <fct> Commercial Parlor, Commercial Parlor, Commercial Parlo…
$ has_hepatitis_c <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
The tatoo.txt
data we ingest here into R comes from the Data and Story Library. The original source of the data is the University of Texas Southwestern Medical Center, and we observe 625 individuals categorized according to their tattoo status and whether or not they have a diagnosis of Hepatitis C. Specifically, the variables include:
-
location in one of three groups:
- (tattoo obtained in a) Commercial Parlor,
- (tattoo obtained) Elsewhere, or
- No Tattoo
- has_hepatitis_c status in two groups: Yes, No
tats |> count(location, has_hepatitis_c)
# A tibble: 6 × 3
location has_hepatitis_c n
<fct> <fct> <int>
1 Commercial Parlor Yes 17
2 Commercial Parlor No 35
3 Elsewhere Yes 8
4 Elsewhere No 53
5 No Tattoo Yes 22
6 No Tattoo No 491
tats |> tabyl(location, has_hepatitis_c) |>
adorn_title() |>
kable()
has_hepatitis_c | ||
---|---|---|
location | Yes | No |
Commercial Parlor | 17 | 35 |
Elsewhere | 8 | 53 |
No Tattoo | 22 | 491 |
tats |>
tabyl(location, has_hepatitis_c) |>
adorn_percentages(denominator = "row") |>
adorn_pct_formatting() |>
adorn_ns(position = "front")
location Yes No
Commercial Parlor 17 (32.7%) 35 (67.3%)
Elsewhere 8 (13.1%) 53 (86.9%)
No Tattoo 22 (4.3%) 491 (95.7%)
data_tabulate(tats$location, tats$has_hepatitis_c,
proportions = "col", include_na = FALSE
)
tats$location | Yes | No | <NA> | Total
------------------+------------+-------------+--------+------
Commercial Parlor | 17 (36.2%) | 35 (6.0%) | 0 (0%) | 52
Elsewhere | 8 (17.0%) | 53 (9.2%) | 0 (0%) | 61
No Tattoo | 22 (46.8%) | 491 (84.8%) | 0 (0%) | 513
<NA> | 0 (0.0%) | 0 (0.0%) | 0 (0%) | 0
------------------+------------+-------------+--------+------
Total | 47 | 579 | 0 | 626
14.3 Chi-Square Test
chisq.test(table(tats$location, tats$has_hepatitis_c))
Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be
incorrect
Pearson's Chi-squared test
data: table(tats$location, tats$has_hepatitis_c)
X-squared = 57.912, df = 2, p-value = 2.658e-13
A chi-square test of independence is a descriptive summary, like a correlation coefficient, so there’s no outcome being modeled, really. This is reflected in the xtabs()
function’s approach.
tabx <- xtabs(~ location + has_hepatitis_c, data = tats)
tabx
has_hepatitis_c
location Yes No
Commercial Parlor 17 35
Elsewhere 8 53
No Tattoo 22 491
summary(tabx)
Call: xtabs(formula = ~location + has_hepatitis_c, data = tats)
Number of cases in table: 626
Number of factors: 2
Test for independence of all factors:
Chisq = 57.91, df = 2, p-value = 2.658e-13
Chi-squared approximation may be incorrect
tat_tab <- tats |> tabyl(location, has_hepatitis_c)
tab_res1 <- chisq.test(tat_tab, tabyl_results = TRUE)
Warning in stats::chisq.test(., ...): Chi-squared approximation may be
incorrect
tab_res1
Pearson's Chi-squared test
data: tat_tab
X-squared = 57.912, df = 2, p-value = 2.658e-13
tab_res1$observed
location Yes No
Commercial Parlor 17 35
Elsewhere 8 53
No Tattoo 22 491
tab_res1$expected
location Yes No
Commercial Parlor 3.904153 48.09585
Elsewhere 4.579872 56.42013
No Tattoo 38.515974 474.48403
tab_res1$residuals
location Yes No
Commercial Parlor 6.627811 -1.8883383
Elsewhere 1.598143 -0.4553290
No Tattoo -2.661238 0.7582168
tab_res1$stdres
location Yes No
Commercial Parlor 7.196963 -7.196963
Elsewhere 1.749148 -1.749148
No Tattoo -6.512978 6.512978
14.4 Personal Appearance Example
These data are also adapted from an example in the Data and Story Library. The data are an excerpt from the results of a GfK Roper Reports® Worldwide survey. In addition to grouping the subjects into five age groups, each was also asked how important their personal appearance is to them, on a seven-point scale.
The data are a contingency table of responses to this question by age decade for 5,844 consumers.
Personal Appearance | 20-29 | 30-39 | 40-49 | 50-59 | 60plus | Total |
---|---|---|---|---|---|---|
1 - Not at all important | 37 | 53 | 56 | 36 | 52 | 234 |
2 | 43 | 53 | 58 | 37 | 45 | 236 |
3 | 83 | 88 | 93 | 54 | 45 | 363 |
4 - Average importance | 376 | 403 | 423 | 224 | 210 | 1636 |
5 | 312 | 317 | 270 | 150 | 106 | 1155 |
6 | 326 | 307 | 254 | 123 | 86 | 1096 |
7 - Extremely important | 337 | 300 | 252 | 142 | 93 | 1124 |
Total | 1514 | 1521 | 1406 | 766 | 637 | 5844 |
Rather than generating an R tibble with 5844 rows, here I’ll just recreate the cross-tabulation in R, then analyze it.
persapp <-
as.table(rbind (
c(37, 53, 56, 36, 52),
c(43, 53, 58, 37, 45),
c(83, 88, 93, 54, 45),
c(376, 403, 423, 224, 210),
c(312, 317, 270, 150, 106),
c(326, 307, 254, 123, 86),
c(337, 300, 252, 142, 93)))
dimnames(persapp) <-
list( appear= c("1", "2", "3", "4", "5", "6", "7"),
age = c("20-29", "30-39", "40-49", "50-59", "60plus"))
persapp
age
appear 20-29 30-39 40-49 50-59 60plus
1 37 53 56 36 52
2 43 53 58 37 45
3 83 88 93 54 45
4 376 403 423 224 210
5 312 317 270 150 106
6 326 307 254 123 86
7 337 300 252 142 93
Now, let’s look at the results from a \(\chi^2\) test of independence of the rows and columns from this contingency table.
out2 <- chisq.test(persapp)
out2
Pearson's Chi-squared test
data: persapp
X-squared = 120.83, df = 24, p-value = 6.914e-15
out2$observed
age
appear 20-29 30-39 40-49 50-59 60plus
1 37 53 56 36 52
2 43 53 58 37 45
3 83 88 93 54 45
4 376 403 423 224 210
5 312 317 270 150 106
6 326 307 254 123 86
7 337 300 252 142 93
out2$expected
age
appear 20-29 30-39 40-49 50-59 60plus
1 60.62218 60.90246 56.29774 30.67146 25.50616
2 61.14031 61.42300 56.77892 30.93361 25.72416
3 94.04209 94.47690 87.33368 47.58008 39.56725
4 423.83710 425.79671 393.60301 214.43806 178.32512
5 299.22485 300.60832 277.87988 151.39117 125.89579
6 283.93977 285.25257 263.68515 143.65777 119.46475
7 291.19370 292.54004 270.42163 147.32786 122.51677
out2$residuals
age
appear 20-29 30-39 40-49 50-59 60plus
1 -3.0339202 -1.0126167 -0.0396820 0.9621465 5.2459285
2 -2.3199626 -1.0747345 0.1620508 1.0907250 3.8005169
3 -1.1386502 -0.6663530 0.6063321 0.9307154 0.8636781
4 -2.3236213 -1.1047681 1.4817456 0.6529731 2.3719674
5 0.7385286 0.9454163 -0.4727057 -0.1130655 -1.7731913
6 2.4960803 1.2876363 -0.5964354 -1.7235300 -3.0617357
7 2.6843194 0.4361579 -1.1202303 -0.4389451 -2.6666809
out2$stdres
age
appear 20-29 30-39 40-49 50-59 60plus
1 -3.59740205 -1.20165904 -0.04647599 1.05347351 5.67227598
2 -2.75133354 -1.27560078 0.18982947 1.19446962 4.11012547
3 -1.36592442 -0.80000375 0.71845112 1.03098141 0.94479685
4 -3.18122022 -1.51373842 2.00379056 0.82550794 2.96133317
5 0.95784108 1.22715817 -0.60557414 -0.13541096 -2.09716160
6 3.21713632 1.66094580 -0.75931957 -2.05129138 -3.59856010
7 3.46999984 0.56427443 -1.43038477 -0.52396591 -3.14352177