14  Cross-Tabulations

This is a DRAFT version of this Chapter.

This is a sketchy draft. I’ll remove this notice when I post a version of this Chapter that is essentially finished.

14.1 R setup for this chapter

Note

Appendix A lists all R packages used in this book, and also provides R session information.

14.2 Tattoo Example

Note

Appendix C provides further guidance on pulling data from other systems into R, while Appendix D gives more information (including download links) for all data sets used in this book.

tats <- read_tsv("data/tattoos.txt", show_col_types = FALSE) |>
  mutate(across(where(is.character), as_factor)) |>
  janitor::clean_names()

glimpse(tats)
Rows: 626
Columns: 2
$ location        <fct> Commercial Parlor, Commercial Parlor, Commercial Parlo…
$ has_hepatitis_c <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…

The tatoo.txt data we ingest here into R comes from the Data and Story Library. The original source of the data is the University of Texas Southwestern Medical Center, and we observe 625 individuals categorized according to their tattoo status and whether or not they have a diagnosis of Hepatitis C. Specifically, the variables include:

  • location in one of three groups:
    • (tattoo obtained in a) Commercial Parlor,
    • (tattoo obtained) Elsewhere, or
    • No Tattoo
  • has_hepatitis_c status in two groups: Yes, No
tats |> count(location, has_hepatitis_c)
# A tibble: 6 × 3
  location          has_hepatitis_c     n
  <fct>             <fct>           <int>
1 Commercial Parlor Yes                17
2 Commercial Parlor No                 35
3 Elsewhere         Yes                 8
4 Elsewhere         No                 53
5 No Tattoo         Yes                22
6 No Tattoo         No                491
tats |> tabyl(location, has_hepatitis_c) |>
  adorn_title() |>
  kable()
has_hepatitis_c
location Yes No
Commercial Parlor 17 35
Elsewhere 8 53
No Tattoo 22 491
tats |>
  tabyl(location, has_hepatitis_c) |>
  adorn_percentages(denominator = "row") |>
  adorn_pct_formatting() |>
  adorn_ns(position = "front")
          location        Yes          No
 Commercial Parlor 17 (32.7%)  35 (67.3%)
         Elsewhere  8 (13.1%)  53 (86.9%)
         No Tattoo 22  (4.3%) 491 (95.7%)
data_tabulate(tats$location, tats$has_hepatitis_c,
  proportions = "col", include_na = FALSE
)
tats$location     |        Yes |          No |   <NA> | Total
------------------+------------+-------------+--------+------
Commercial Parlor | 17 (36.2%) |  35  (6.0%) | 0 (0%) |    52
Elsewhere         |  8 (17.0%) |  53  (9.2%) | 0 (0%) |    61
No Tattoo         | 22 (46.8%) | 491 (84.8%) | 0 (0%) |   513
<NA>              |  0  (0.0%) |   0  (0.0%) | 0 (0%) |     0
------------------+------------+-------------+--------+------
Total             |         47 |         579 |      0 |   626

14.3 Chi-Square Test

chisq.test(table(tats$location, tats$has_hepatitis_c))
Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be
incorrect

    Pearson's Chi-squared test

data:  table(tats$location, tats$has_hepatitis_c)
X-squared = 57.912, df = 2, p-value = 2.658e-13

A chi-square test of independence is a descriptive summary, like a correlation coefficient, so there’s no outcome being modeled, really. This is reflected in the xtabs() function’s approach.

tabx <- xtabs(~ location + has_hepatitis_c, data = tats)

tabx
                   has_hepatitis_c
location            Yes  No
  Commercial Parlor  17  35
  Elsewhere           8  53
  No Tattoo          22 491
summary(tabx)
Call: xtabs(formula = ~location + has_hepatitis_c, data = tats)
Number of cases in table: 626 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 57.91, df = 2, p-value = 2.658e-13
    Chi-squared approximation may be incorrect
tat_tab <- tats |> tabyl(location, has_hepatitis_c)

tab_res1 <- chisq.test(tat_tab, tabyl_results = TRUE)
Warning in stats::chisq.test(., ...): Chi-squared approximation may be
incorrect
tab_res1

    Pearson's Chi-squared test

data:  tat_tab
X-squared = 57.912, df = 2, p-value = 2.658e-13
tab_res1$observed
          location Yes  No
 Commercial Parlor  17  35
         Elsewhere   8  53
         No Tattoo  22 491
tab_res1$expected
          location       Yes        No
 Commercial Parlor  3.904153  48.09585
         Elsewhere  4.579872  56.42013
         No Tattoo 38.515974 474.48403
tab_res1$residuals
          location       Yes         No
 Commercial Parlor  6.627811 -1.8883383
         Elsewhere  1.598143 -0.4553290
         No Tattoo -2.661238  0.7582168
tab_res1$stdres
          location       Yes        No
 Commercial Parlor  7.196963 -7.196963
         Elsewhere  1.749148 -1.749148
         No Tattoo -6.512978  6.512978

14.4 Personal Appearance Example

These data are also adapted from an example in the Data and Story Library. The data are an excerpt from the results of a GfK Roper Reports® Worldwide survey. In addition to grouping the subjects into five age groups, each was also asked how important their personal appearance is to them, on a seven-point scale.

The data are a contingency table of responses to this question by age decade for 5,844 consumers.

Personal Appearance 20-29 30-39 40-49 50-59 60plus Total
1 - Not at all important 37 53 56 36 52 234
2 43 53 58 37 45 236
3 83 88 93 54 45 363
4 - Average importance 376 403 423 224 210 1636
5 312 317 270 150 106 1155
6 326 307 254 123 86 1096
7 - Extremely important 337 300 252 142 93 1124
Total 1514 1521 1406 766 637 5844

Rather than generating an R tibble with 5844 rows, here I’ll just recreate the cross-tabulation in R, then analyze it.

persapp <- 
  as.table(rbind ( 
    c(37, 53, 56, 36, 52),
    c(43, 53, 58, 37, 45),
    c(83, 88, 93, 54, 45),
    c(376, 403, 423, 224, 210),
    c(312, 317, 270, 150, 106),
    c(326, 307, 254, 123, 86),
    c(337, 300, 252, 142, 93)))

dimnames(persapp) <- 
  list( appear= c("1", "2", "3", "4", "5", "6", "7"),
        age = c("20-29", "30-39", "40-49", "50-59", "60plus"))

persapp
      age
appear 20-29 30-39 40-49 50-59 60plus
     1    37    53    56    36     52
     2    43    53    58    37     45
     3    83    88    93    54     45
     4   376   403   423   224    210
     5   312   317   270   150    106
     6   326   307   254   123     86
     7   337   300   252   142     93

Now, let’s look at the results from a \(\chi^2\) test of independence of the rows and columns from this contingency table.

out2 <- chisq.test(persapp)

out2

    Pearson's Chi-squared test

data:  persapp
X-squared = 120.83, df = 24, p-value = 6.914e-15
out2$observed
      age
appear 20-29 30-39 40-49 50-59 60plus
     1    37    53    56    36     52
     2    43    53    58    37     45
     3    83    88    93    54     45
     4   376   403   423   224    210
     5   312   317   270   150    106
     6   326   307   254   123     86
     7   337   300   252   142     93
out2$expected
      age
appear     20-29     30-39     40-49     50-59    60plus
     1  60.62218  60.90246  56.29774  30.67146  25.50616
     2  61.14031  61.42300  56.77892  30.93361  25.72416
     3  94.04209  94.47690  87.33368  47.58008  39.56725
     4 423.83710 425.79671 393.60301 214.43806 178.32512
     5 299.22485 300.60832 277.87988 151.39117 125.89579
     6 283.93977 285.25257 263.68515 143.65777 119.46475
     7 291.19370 292.54004 270.42163 147.32786 122.51677
out2$residuals
      age
appear      20-29      30-39      40-49      50-59     60plus
     1 -3.0339202 -1.0126167 -0.0396820  0.9621465  5.2459285
     2 -2.3199626 -1.0747345  0.1620508  1.0907250  3.8005169
     3 -1.1386502 -0.6663530  0.6063321  0.9307154  0.8636781
     4 -2.3236213 -1.1047681  1.4817456  0.6529731  2.3719674
     5  0.7385286  0.9454163 -0.4727057 -0.1130655 -1.7731913
     6  2.4960803  1.2876363 -0.5964354 -1.7235300 -3.0617357
     7  2.6843194  0.4361579 -1.1202303 -0.4389451 -2.6666809
out2$stdres
      age
appear       20-29       30-39       40-49       50-59      60plus
     1 -3.59740205 -1.20165904 -0.04647599  1.05347351  5.67227598
     2 -2.75133354 -1.27560078  0.18982947  1.19446962  4.11012547
     3 -1.36592442 -0.80000375  0.71845112  1.03098141  0.94479685
     4 -3.18122022 -1.51373842  2.00379056  0.82550794  2.96133317
     5  0.95784108  1.22715817 -0.60557414 -0.13541096 -2.09716160
     6  3.21713632  1.66094580 -0.75931957 -2.05129138 -3.59856010
     7  3.46999984  0.56427443 -1.43038477 -0.52396591 -3.14352177

14.5 For More Information