8  Summarizing Categories

8.1 Setup: Packages Used Here

8.2 Using the nh_adult750 data again

To demonstrate key ideas in this Chapter, we will again consider our sample of 750 adults ages 21-64 from NHANES 2011-12 which includes some missing values. We’ll load into the nh_750 data frame the information from the nh_adult750.Rds file we created in Section @ref(newNHANES).

nh_750 <- read_rds("data/nh_adult750.Rds")

Summarizing categorical variables numerically is mostly about building tables, and calculating percentages or proportions. We’ll save our discussion of modeling categorical data for later. Recall that in the nh_750 data set we built in Section @ref(newNHANES) we had the following categorical variables. The number of levels indicates the number of possible categories for each categorical variable.

Variable Description Levels Type
Sex sex of subject 2 binary
Race subject’s race 6 nominal
Education subject’s educational level 5 ordinal
PhysActive Participates in sports? 2 binary
Smoke100 Smoked 100+ cigarettes? 2 binary
SleepTrouble Trouble sleeping? 2 binary
HealthGen Self-report health 5 ordinal

8.3 The summary function for Categorical data

When R recognizes a variable as categorical, it stores it as a factor. Such variables get special treatment from the summary function, in particular a table of available values (so long as there aren’t too many.)

nh_750 |>
  select(Sex, Race, Education, PhysActive, Smoke100, 
         SleepTrouble, HealthGen, MaritalStatus) |>
  summary()
     Sex            Race              Education   PhysActive Smoke100 
 female:388   Asian   : 70   8th Grade     : 50   No :326    No :453  
 male  :362   Black   :128   9 - 11th Grade: 76   Yes:424    Yes:297  
              Hispanic: 63   High School   :143                       
              Mexican : 80   Some College  :241                       
              White   :393   College Grad  :240                       
              Other   : 16                                            
 SleepTrouble     HealthGen        MaritalStatus
 No :555      Excellent: 84   Divorced    : 78  
 Yes:195      Vgood    :197   LivePartner : 70  
              Good     :252   Married     :388  
              Fair     :104   NeverMarried:179  
              Poor     : 14   Separated   : 19  
              NA's     : 99   Widowed     : 16  

8.4 Tables to describe One Categorical Variable

Suppose we build a table (using the tabyl function from the janitor package) to describe the HealthGen distribution.

nh_750 |>
    tabyl(HealthGen) |>
    adorn_pct_formatting()
 HealthGen   n percent valid_percent
 Excellent  84   11.2%         12.9%
     Vgood 197   26.3%         30.3%
      Good 252   33.6%         38.7%
      Fair 104   13.9%         16.0%
      Poor  14    1.9%          2.2%
      <NA>  99   13.2%             -

Note how the missing (<NA>) values are not included in the valid_percent calculation, but are in the percent calculation. Note also the use of percentage formatting.

What if we want to add a total count, sometimes called the marginal total?

nh_750 |>
    tabyl(HealthGen) |>
    adorn_totals() |>
    adorn_pct_formatting()
 HealthGen   n percent valid_percent
 Excellent  84   11.2%         12.9%
     Vgood 197   26.3%         30.3%
      Good 252   33.6%         38.7%
      Fair 104   13.9%         16.0%
      Poor  14    1.9%          2.2%
      <NA>  99   13.2%             -
     Total 750  100.0%        100.0%

What about marital status, which has no missing data in our sample?

nh_750 |>
    tabyl(MaritalStatus) |>
    adorn_totals() |>
    adorn_pct_formatting()
 MaritalStatus   n percent
      Divorced  78   10.4%
   LivePartner  70    9.3%
       Married 388   51.7%
  NeverMarried 179   23.9%
     Separated  19    2.5%
       Widowed  16    2.1%
         Total 750  100.0%

8.5 Constructing Tables Well

The prolific Howard Wainer is responsible for many interesting books on visualization and related issues, including Wainer (2005) and Wainer (2013). These rules come from Chapter 10 of Wainer (1997).

  1. Order the rows and columns in a way that makes sense.
  2. Round, a lot!
  3. ALL is different and important

8.5.1 Alabama First!

Which of these Tables is more useful to you?

2013 Percent of Students in grades 9-12 who are obese

State % Obese 95% CI Sample Size
Alabama 17.1 (14.6 - 19.9) 1,499
Alaska 12.4 (10.5-14.6) 1,167
Arizona 10.7 (8.3-13.6) 1,520
Arkansas 17.8 (15.7-20.1) 1,470
Connecticut 12.3 (10.2-14.7) 2,270
Delaware 14.2 (12.9-15.6) 2,475
Florida 11.6 (10.5-12.8) 5,491
Wisconsin 11.6 (9.7-13.9) 2,771
Wyoming 10.7 (9.4-12.2) 2,910

or …

State % Obese 95% CI Sample Size
Kentucky 18.0 (15.7 - 20.6) 1,537
Arkansas 17.8 (15.7 - 20.1) 1,470
Alabama 17.1 (14.6 - 19.9) 1,499
Tennessee 16.9 (15.1 - 18.8) 1,831
Texas 15.7 (13.9 - 17.6) 3,039
Massachusetts 10.2 (8.5 - 12.1) 2,547
Idaho 9.6 (8.2 - 11.1) 1,841
Montana 9.4 (8.4 - 10.5) 4,679
New Jersey 8.7 (6.8 - 11.2) 1,644
Utah 6.4 (4.8 - 8.5) 2,136

It is a rare event when Alabama first is the best choice.

8.5.2 ALL is different and important

Summaries of rows and columns provide a measure of what is typical or usual. Sometimes a sum is helpful, at other times, consider presenting a median or other summary. The ALL category, as Wainer (1997) suggests, should be both visually different from the individual entries and set spatially apart.

On the whole, it’s far easier to fall into a good graph in R (at least if you have some ggplot2 skills) than to produce a good table.

8.6 The Mode of a Categorical Variable

A common measure applied to a categorical variable is to identify the mode, the most frequently observed value. To find the mode for variables with lots of categories (so that the summary may not be sufficient), we usually tabulate the data, and then sort by the counts of the numbers of observations, as we did with discrete quantitative variables.

nh_750 |>
    group_by(HealthGen) |>
    summarise(count = n()) |>
    arrange(desc(count)) 
# A tibble: 6 × 2
  HealthGen count
  <fct>     <int>
1 Good        252
2 Vgood       197
3 Fair        104
4 <NA>         99
5 Excellent    84
6 Poor         14

8.7 describe in the Hmisc package

Hmisc::describe(nh_750 |> 
                    select(Sex, Race, Education, PhysActive, 
                           Smoke100, SleepTrouble, 
                           HealthGen, MaritalStatus))
select(nh_750, Sex, Race, Education, PhysActive, Smoke100, SleepTrouble, HealthGen, MaritalStatus) 

 8  Variables      750  Observations
--------------------------------------------------------------------------------
Sex 
       n  missing distinct 
     750        0        2 
                        
Value      female   male
Frequency     388    362
Proportion  0.517  0.483
--------------------------------------------------------------------------------
Race 
       n  missing distinct 
     750        0        6 
                                                                
Value         Asian    Black Hispanic  Mexican    White    Other
Frequency        70      128       63       80      393       16
Proportion    0.093    0.171    0.084    0.107    0.524    0.021
--------------------------------------------------------------------------------
Education 
       n  missing distinct 
     750        0        5 
                                                                      
Value           8th Grade 9 - 11th Grade    High School   Some College
Frequency              50             76            143            241
Proportion          0.067          0.101          0.191          0.321
                         
Value        College Grad
Frequency             240
Proportion          0.320
--------------------------------------------------------------------------------
PhysActive 
       n  missing distinct 
     750        0        2 
                      
Value         No   Yes
Frequency    326   424
Proportion 0.435 0.565
--------------------------------------------------------------------------------
Smoke100 
       n  missing distinct 
     750        0        2 
                      
Value         No   Yes
Frequency    453   297
Proportion 0.604 0.396
--------------------------------------------------------------------------------
SleepTrouble 
       n  missing distinct 
     750        0        2 
                    
Value        No  Yes
Frequency   555  195
Proportion 0.74 0.26
--------------------------------------------------------------------------------
HealthGen 
       n  missing distinct 
     651       99        5 
                                                            
Value      Excellent     Vgood      Good      Fair      Poor
Frequency         84       197       252       104        14
Proportion     0.129     0.303     0.387     0.160     0.022
--------------------------------------------------------------------------------
MaritalStatus 
       n  missing distinct 
     750        0        6 
                                                                           
Value          Divorced  LivePartner      Married NeverMarried    Separated
Frequency            78           70          388          179           19
Proportion        0.104        0.093        0.517        0.239        0.025
                       
Value           Widowed
Frequency            16
Proportion        0.021
--------------------------------------------------------------------------------

8.8 Cross-Tabulations of Two Variables

It is very common for us to want to describe the association of one categorical variable with another. For instance, is there a relationship between Education and SleepTrouble in these data?

nh_750 |>
    tabyl(Education, SleepTrouble) |>
    adorn_totals(where = c("row", "col")) 
      Education  No Yes Total
      8th Grade  40  10    50
 9 - 11th Grade  52  24    76
    High School 102  41   143
   Some College 173  68   241
   College Grad 188  52   240
          Total 555 195   750

Note the use of adorn_totals to get the marginal counts, and how we specify that we want both the row and column totals. We can add a title for the columns with…

nh_750 |>
    tabyl(Education, SleepTrouble) |>
    adorn_totals(where = c("row", "col")) |>
    adorn_title(placement = "combined")
 Education/SleepTrouble  No Yes Total
              8th Grade  40  10    50
         9 - 11th Grade  52  24    76
            High School 102  41   143
           Some College 173  68   241
           College Grad 188  52   240
                  Total 555 195   750

Often, we’ll want to show percentages in a cross-tabulation like this. To get row percentages so that we can directly see the probability of SleepTrouble = Yes for each level of Education, we can use:

nh_750 |>
    tabyl(Education, SleepTrouble) |>
    adorn_totals(where = "row") |>
    adorn_percentages(denominator = "row") |>
    adorn_pct_formatting() |>
    adorn_title(placement = "combined")
 Education/SleepTrouble    No   Yes
              8th Grade 80.0% 20.0%
         9 - 11th Grade 68.4% 31.6%
            High School 71.3% 28.7%
           Some College 71.8% 28.2%
           College Grad 78.3% 21.7%
                  Total 74.0% 26.0%

If we want to compare the distribution of Education between the two levels of SleepTrouble with column percentages, we can use the following…

nh_750 |>
    tabyl(Education, SleepTrouble) |>
    adorn_totals(where = "col") |>
    adorn_percentages(denominator = "col") |>
    adorn_pct_formatting() |>
    adorn_title(placement = "combined") 
 Education/SleepTrouble    No   Yes Total
              8th Grade  7.2%  5.1%  6.7%
         9 - 11th Grade  9.4% 12.3% 10.1%
            High School 18.4% 21.0% 19.1%
           Some College 31.2% 34.9% 32.1%
           College Grad 33.9% 26.7% 32.0%

If we want overall percentages in the cells of the table, so that the total across all combinations of Education and SleepTrouble is 100%, we can use:

nh_750 |>
    tabyl(Education, SleepTrouble) |>
    adorn_totals(where = c("row", "col")) |>
    adorn_percentages(denominator = "all") |>
    adorn_pct_formatting() |>
    adorn_title(placement = "combined") |>
    kbl(align = 'lrrrrrr') 
Education/SleepTrouble No Yes Total
8th Grade 5.3% 1.3% 6.7%
9 - 11th Grade 6.9% 3.2% 10.1%
High School 13.6% 5.5% 19.1%
Some College 23.1% 9.1% 32.1%
College Grad 25.1% 6.9% 32.0%
Total 74.0% 26.0% 100.0%

Another common approach is to include both counts and percentages in a cross-tabulation. Let’s look at the breakdown of HealthGen by MaritalStatus.

nh_750 |>
    tabyl(MaritalStatus, HealthGen) |>
    adorn_totals(where = c("row")) |>
    adorn_percentages(denominator = "row") |>
    adorn_pct_formatting() |>
    adorn_ns(position = "front") |>
    adorn_title(placement = "combined") |>
    kbl(align = 'lrrrrrr') |>
    kable_styling(full_width = FALSE)
MaritalStatus/HealthGen Excellent Vgood Good Fair Poor NA_
Divorced 7 (9.0%) 19 (24.4%) 29 (37.2%) 11 (14.1%) 3 (3.8%) 9 (11.5%)
LivePartner 4 (5.7%) 19 (27.1%) 25 (35.7%) 18 (25.7%) 0 (0.0%) 4 (5.7%)
Married 46 (11.9%) 101 (26.0%) 130 (33.5%) 41 (10.6%) 6 (1.5%) 64 (16.5%)
NeverMarried 25 (14.0%) 52 (29.1%) 56 (31.3%) 24 (13.4%) 3 (1.7%) 19 (10.6%)
Separated 2 (10.5%) 3 (15.8%) 4 (21.1%) 8 (42.1%) 0 (0.0%) 2 (10.5%)
Widowed 0 (0.0%) 3 (18.8%) 8 (50.0%) 2 (12.5%) 2 (12.5%) 1 (6.2%)
Total 84 (11.2%) 197 (26.3%) 252 (33.6%) 104 (13.9%) 14 (1.9%) 99 (13.2%)

What if we wanted to ignore the missing HealthGen values? Most often, I filter down to the complete observations.

nh_750 |>
    filter(complete.cases(MaritalStatus, HealthGen)) |>
    tabyl(MaritalStatus, HealthGen) |>
    adorn_totals(where = c("row")) |>
    adorn_percentages(denominator = "row") |>
    adorn_pct_formatting() |>
    adorn_ns(position = "front") |>
    adorn_title(placement = "combined") |>
    kbl(align = 'lrrrrr') |>
    kable_styling(full_width = FALSE)
MaritalStatus/HealthGen Excellent Vgood Good Fair Poor
Divorced 7 (10.1%) 19 (27.5%) 29 (42.0%) 11 (15.9%) 3 (4.3%)
LivePartner 4 (6.1%) 19 (28.8%) 25 (37.9%) 18 (27.3%) 0 (0.0%)
Married 46 (14.2%) 101 (31.2%) 130 (40.1%) 41 (12.7%) 6 (1.9%)
NeverMarried 25 (15.6%) 52 (32.5%) 56 (35.0%) 24 (15.0%) 3 (1.9%)
Separated 2 (11.8%) 3 (17.6%) 4 (23.5%) 8 (47.1%) 0 (0.0%)
Widowed 0 (0.0%) 3 (20.0%) 8 (53.3%) 2 (13.3%) 2 (13.3%)
Total 84 (12.9%) 197 (30.3%) 252 (38.7%) 104 (16.0%) 14 (2.2%)

For more on working with tabyls, see this overview of janitor functions. There you’ll find a complete list of all of the adorn functions, for example.

Here’s another approach, to look at the cross-classification of Race and HealthGen:

xtabs(~ Race + HealthGen, data = nh_750)
          HealthGen
Race       Excellent Vgood Good Fair Poor
  Asian           10    17   24    6    1
  Black           15    28   40   24    4
  Hispanic         4     9   24   13    2
  Mexican          6    12   25   21    2
  White           48   128  131   37    5
  Other            1     3    8    3    0

8.9 Cross-Classifying Three Categorical Variables

Suppose we are interested in Smoke100 and its relationship to PhysActive and SleepTrouble.

nh_750 |>
    tabyl(Smoke100, PhysActive, SleepTrouble) |>
    adorn_title(placement = "top")
$No
          PhysActive    
 Smoke100         No Yes
       No        137 219
      Yes         93 106

$Yes
          PhysActive    
 Smoke100         No Yes
       No         41  56
      Yes         55  43

The result here is a tabyl of Smoke100 (rows) by PhysActive (columns), split into a list by SleepTrouble.

There are several alternative approaches for doing this, although I expect us to stick with tabyl for our work in 431. These alternatives include the use of the xtabs function:

xtabs(~ Smoke100 + PhysActive + SleepTrouble, data = nh_750)
, , SleepTrouble = No

        PhysActive
Smoke100  No Yes
     No  137 219
     Yes  93 106

, , SleepTrouble = Yes

        PhysActive
Smoke100  No Yes
     No   41  56
     Yes  55  43

We can also build a flat version of this table, as follows:

ftable(Smoke100 ~ PhysActive + SleepTrouble, data = nh_750)
                        Smoke100  No Yes
PhysActive SleepTrouble                 
No         No                    137  93
           Yes                    41  55
Yes        No                    219 106
           Yes                    56  43

And we can do this with dplyr functions and the table() function, as well, for example…

nh_750 |>
    select(Smoke100, PhysActive, SleepTrouble) |>
    table() 
, , SleepTrouble = No

        PhysActive
Smoke100  No Yes
     No  137 219
     Yes  93 106

, , SleepTrouble = Yes

        PhysActive
Smoke100  No Yes
     No   41  56
     Yes  55  43

8.10 Gaining Control over Tables in R: the gt package

With the gt package, anyone can make wonderful-looking tables using the R programming language. The gt package allows you to start with a tibble or data frame, and use it to make very detailed tables that look professional, and includes tools that enable you to include titles and subtitles, all sorts of labels, as well as footnotes and source notes.

Here’s a fairly simple example of a cross-tabulation of part of the nh_750 data built using a few tools from the gt package.

temp_tbl <- nh_750 |> filter(complete.cases(PhysActive, HealthGen)) |>
  tabyl(PhysActive, HealthGen) |>
  tibble() 

gt(temp_tbl) |>
  tab_header(title = md("**Cross-Tabulation from nh_750**"),
             subtitle = md("Physical Activity vs. Overall Health"))
Cross-Tabulation from nh_750
Physical Activity vs. Overall Health
PhysActive Excellent Vgood Good Fair Poor
No 24 66 126 59 10
Yes 60 131 126 45 4

The gt package and its usage is described in detail at https://gt.rstudio.com/.

8.11 Coming Up

Next, we’ll make some early attempts at describing missingness in our data.