14  Cross-Tabulations

14.1 R setup for this chapter

Note

Appendix A lists all R packages used in this book, and also provides R session information.

14.2 Tattoo Example

Note

Appendix C provides further guidance on pulling data from other systems into R, while Appendix D gives more information (including download links) for all data sets used in this book.

tats <- read_tsv("data/tattoos.txt", show_col_types = FALSE) |>
  mutate(across(where(is.character), as_factor)) |>
  janitor::clean_names()

glimpse(tats)
Rows: 626
Columns: 2
$ location        <fct> Commercial Parlor, Commercial Parlor, Commercial Parlo…
$ has_hepatitis_c <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…

The tatoo.txt data we ingest here into R comes from the Data and Story Library. The original source of the data is the University of Texas Southwestern Medical Center, and we observe 625 individuals categorized according to their tattoo status and whether or not they have a diagnosis of Hepatitis C. Specifically, the variables include:

  • location in one of three groups:
    • (tattoo obtained in a) Commercial Parlor,
    • (tattoo obtained) Elsewhere, or
    • No Tattoo
  • has_hepatitis_c status in two groups: Yes, No
tats |> count(location, has_hepatitis_c)
# A tibble: 6 × 3
  location          has_hepatitis_c     n
  <fct>             <fct>           <int>
1 Commercial Parlor Yes                17
2 Commercial Parlor No                 35
3 Elsewhere         Yes                 8
4 Elsewhere         No                 53
5 No Tattoo         Yes                22
6 No Tattoo         No                491

The question we’re interested in here is whether there is a strong association between the tattoo location and the probability of hepatitis C.

tats |> tabyl(location, has_hepatitis_c) |>
  adorn_title() |>
  kable()
has_hepatitis_c
location Yes No
Commercial Parlor 17 35
Elsewhere 8 53
No Tattoo 22 491
tats |>
  tabyl(location, has_hepatitis_c) |>
  adorn_percentages(denominator = "row") |>
  adorn_pct_formatting() |>
  adorn_ns(position = "front")
          location        Yes          No
 Commercial Parlor 17 (32.7%)  35 (67.3%)
         Elsewhere  8 (13.1%)  53 (86.9%)
         No Tattoo 22  (4.3%) 491 (95.7%)

Yet another way to show the table is data_tabulate().

data_tabulate(tats$location, tats$has_hepatitis_c,
  proportions = "col")
tats$location     |        Yes |          No |   <NA> | Total
------------------+------------+-------------+--------+------
Commercial Parlor | 17 (36.2%) |  35  (6.0%) | 0 (0%) |    52
Elsewhere         |  8 (17.0%) |  53  (9.2%) | 0 (0%) |    61
No Tattoo         | 22 (46.8%) | 491 (84.8%) | 0 (0%) |   513
<NA>              |  0  (0.0%) |   0  (0.0%) | 0 (0%) |     0
------------------+------------+-------------+--------+------
Total             |         47 |         579 |      0 |   626

14.3 Chi-Square Test

The most common approach to assessing whether the relationship we observe between two categorical variables is stronger than we might expect to see if the rows and columns had no effect on one another just due to sampling error is called a chi-square test.

chisq.test(table(tats$location, tats$has_hepatitis_c))
Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be
incorrect

    Pearson's Chi-squared test

data:  table(tats$location, tats$has_hepatitis_c)
X-squared = 57.912, df = 2, p-value = 2.658e-13

A chi-square test of independence is a descriptive summary, like a correlation coefficient, so there’s no outcome being modeled, really. This is reflected in the xtabs() function’s approach.

tabx <- xtabs(~ location + has_hepatitis_c, data = tats)

tabx
                   has_hepatitis_c
location            Yes  No
  Commercial Parlor  17  35
  Elsewhere           8  53
  No Tattoo          22 491
summary(tabx)
Call: xtabs(formula = ~location + has_hepatitis_c, data = tats)
Number of cases in table: 626 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 57.91, df = 2, p-value = 2.658e-13
    Chi-squared approximation may be incorrect

Note that these chi-square assessments have very small p values (indicating some support for an association between location and hepatitis C status) but that the chi-square approximation used here may be incorrect.

R makes the decision to warn about potentially incorrect approximations when the sample size in one or more of the cells is rather small. Specifically if any of the expected frequencies (under the null hypothesis of no association between the rows and the columns) is below 10, we have reason to be concerned. Here, both the (Commercial Parlor, Yes) and (Elsewhere, Yes) cells have expected frequencies below 5, so R warns us that the results may be incorrect.

tat_tab <- tats |> tabyl(location, has_hepatitis_c)

tab_res1 <- chisq.test(tat_tab, tabyl_results = TRUE)
Warning in stats::chisq.test(., ...): Chi-squared approximation may be
incorrect
tab_res1

    Pearson's Chi-squared test

data:  tat_tab
X-squared = 57.912, df = 2, p-value = 2.658e-13

Next, here are the observed frequencies in each cell, and the expected frequencies if the null hypothesis of no relationship between rows and columns actually held up.

tab_res1$observed
          location Yes  No
 Commercial Parlor  17  35
         Elsewhere   8  53
         No Tattoo  22 491
tab_res1$expected
          location       Yes        No
 Commercial Parlor  3.904153  48.09585
         Elsewhere  4.579872  56.42013
         No Tattoo 38.515974 474.48403

14.4 Personal Appearance Example

These data are also adapted from an example in the Data and Story Library. The data are an excerpt from the results of a GfK Roper Reports® Worldwide survey. In addition to grouping the subjects into five age groups, each was also asked how important their personal appearance is to them, on a seven-point scale.

The data are a contingency table of responses to this question by age decade for 5,844 consumers.

Personal Appearance 20-29 30-39 40-49 50-59 60plus Total
1 - Not at all important 37 53 56 36 52 234
2 43 53 58 37 45 236
3 83 88 93 54 45 363
4 - Average importance 376 403 423 224 210 1636
5 312 317 270 150 106 1155
6 326 307 254 123 86 1096
7 - Extremely important 337 300 252 142 93 1124
Total 1514 1521 1406 766 637 5844

Rather than generating an R tibble with 5844 rows, here I’ll just recreate the cross-tabulation in R, then analyze it.

persapp <- 
  as.table(rbind ( 
    c(37, 53, 56, 36, 52),
    c(43, 53, 58, 37, 45),
    c(83, 88, 93, 54, 45),
    c(376, 403, 423, 224, 210),
    c(312, 317, 270, 150, 106),
    c(326, 307, 254, 123, 86),
    c(337, 300, 252, 142, 93)))

dimnames(persapp) <- 
  list( appear= c("1", "2", "3", "4", "5", "6", "7"),
        age = c("20-29", "30-39", "40-49", "50-59", "60plus"))

persapp
      age
appear 20-29 30-39 40-49 50-59 60plus
     1    37    53    56    36     52
     2    43    53    58    37     45
     3    83    88    93    54     45
     4   376   403   423   224    210
     5   312   317   270   150    106
     6   326   307   254   123     86
     7   337   300   252   142     93

Now, let’s look at the results from a \(\chi^2\) test of independence of the rows and columns from this contingency table.

out2 <- chisq.test(persapp)

out2

    Pearson's Chi-squared test

data:  persapp
X-squared = 120.83, df = 24, p-value = 6.914e-15
out2$observed
      age
appear 20-29 30-39 40-49 50-59 60plus
     1    37    53    56    36     52
     2    43    53    58    37     45
     3    83    88    93    54     45
     4   376   403   423   224    210
     5   312   317   270   150    106
     6   326   307   254   123     86
     7   337   300   252   142     93
out2$expected
      age
appear     20-29     30-39     40-49     50-59    60plus
     1  60.62218  60.90246  56.29774  30.67146  25.50616
     2  61.14031  61.42300  56.77892  30.93361  25.72416
     3  94.04209  94.47690  87.33368  47.58008  39.56725
     4 423.83710 425.79671 393.60301 214.43806 178.32512
     5 299.22485 300.60832 277.87988 151.39117 125.89579
     6 283.93977 285.25257 263.68515 143.65777 119.46475
     7 291.19370 292.54004 270.42163 147.32786 122.51677

Again, the chi-square test result has an extremely small p value, suggesting evidence favoring some association between the appearance and age group data.