Chapter 6 Summarizing Categorical Variables

Summarizing categorical variables numerically is mostly about building tables, and calculating percentages or proportions. We’ll save our discussion of modeling categorical data for later. Recall that in the nh_adults data set we built in Section (@ref(createnh_adults)), we had the following categorical variables. The number of levels indicates the number of possible categories for each categorical variable.

Variable Description Levels Type
Sex sex of subject 2 binary
Race subject’s race 6 nominal
Education subject’s educational level 5 ordinal
PhysActive Participates in sports? 2 binary
Smoke100 Smoked 100+ cigarettes? 2 binary
SleepTrouble Trouble sleeping? 2 binary
HealthGen Self-report health 5 ordinal

6.1 The summary function for Categorical data

When R recognizes a variable as categorical, it stores it as a factor. Such variables get special treatment from the summary function, in particular a table of available values (so long as there aren’t too many.)

nh_adults %>%
  select(Sex, Race, Education, PhysActive, Smoke100, SleepTrouble, HealthGen) %>%
  summary()
     Sex            Race              Education   PhysActive Smoke100 
 female:253   Asian   : 29   8th Grade     : 24   No :225    No :289  
 male  :247   Black   : 57   9 - 11th Grade: 57   Yes:275    Yes:211  
              Hispanic: 39   High School   : 81                       
              Mexican : 43   Some College  :153                       
              White   :322   College Grad  :185                       
              Other   : 10                                            
 SleepTrouble     HealthGen  
 No :362      Excellent: 51  
 Yes:138      Vgood    :153  
              Good     :172  
              Fair     : 71  
              Poor     :  7  
              NA's     : 46  

6.2 Tables to describe One Categorical Variable

Suppose we build a table to describe the HealthGen distribution.

nh_adults %>%
    select(HealthGen) %>%
    table(., useNA = "ifany")
.
Excellent     Vgood      Good      Fair      Poor      <NA> 
       51       153       172        71         7        46 

The main tools we have for augmenting tables are:

  • adding in marginal totals, and
  • working with proportions/percentages.

What if we want to add a total count?

nh_adults %>%
    select(HealthGen) %>%
    table(., useNA = "ifany") %>%
    addmargins()
.
Excellent     Vgood      Good      Fair      Poor      <NA>       Sum 
       51       153       172        71         7        46       500 

What if we want to leave out the missing responses?

nh_adults %>%
    select(HealthGen) %>%
    table(., useNA = "no") %>%
    addmargins()
.
Excellent     Vgood      Good      Fair      Poor       Sum 
       51       153       172        71         7       454 

Let’s put the missing values back in, but now calculate proportions instead. Since the total will just be 1.0, we’ll leave that out.

nh_adults %>%
    select(HealthGen) %>%
    table(., useNA = "ifany") %>%
    prop.table()
.
Excellent     Vgood      Good      Fair      Poor      <NA> 
    0.102     0.306     0.344     0.142     0.014     0.092 

Now, we’ll calculate percentages by multiplying the proportions by 100.

nh_adults %>%
    select(HealthGen) %>%
    table(., useNA = "ifany") %>%
    prop.table() %>%
    "*"(100) 
.
Excellent     Vgood      Good      Fair      Poor      <NA> 
     10.2      30.6      34.4      14.2       1.4       9.2 

6.3 The Mode of a Categorical Variable

A common measure applied to a categorical variable is to identify the mode, the most frequently observed value. To find the mode for variables with lots of categories (so that the summary may not be sufficient), we usually tabulate the data, and then sort by the counts of the numbers of observations, as we did with discrete quantitative variables.

nh_adults %>%
    group_by(HealthGen) %>%
    summarise(count = n()) %>%
    arrange(desc(count)) 
# A tibble: 6 x 2
  HealthGen count
     <fctr> <int>
1      Good   172
2     Vgood   153
3      Fair    71
4 Excellent    51
5      <NA>    46
6      Poor     7

6.4 describe in the Hmisc package

Hmisc::describe(nh_adults %>% 
                    select(Sex, Race, Education, PhysActive, 
                           Smoke100, SleepTrouble, HealthGen))
nh_adults %>% select(Sex, Race, Education, PhysActive, Smoke100, SleepTrouble, HealthGen) 

 7  Variables      500  Observations
---------------------------------------------------------------------------
Sex 
       n  missing distinct 
     500        0        2 
                        
Value      female   male
Frequency     253    247
Proportion  0.506  0.494
---------------------------------------------------------------------------
Race 
       n  missing distinct 
     500        0        6 
                                                                
Value         Asian    Black Hispanic  Mexican    White    Other
Frequency        29       57       39       43      322       10
Proportion    0.058    0.114    0.078    0.086    0.644    0.020
---------------------------------------------------------------------------
Education 
       n  missing distinct 
     500        0        5 
                                                                      
Value           8th Grade 9 - 11th Grade    High School   Some College
Frequency              24             57             81            153
Proportion          0.048          0.114          0.162          0.306
                         
Value        College Grad
Frequency             185
Proportion          0.370
---------------------------------------------------------------------------
PhysActive 
       n  missing distinct 
     500        0        2 
                    
Value        No  Yes
Frequency   225  275
Proportion 0.45 0.55
---------------------------------------------------------------------------
Smoke100 
       n  missing distinct 
     500        0        2 
                      
Value         No   Yes
Frequency    289   211
Proportion 0.578 0.422
---------------------------------------------------------------------------
SleepTrouble 
       n  missing distinct 
     500        0        2 
                      
Value         No   Yes
Frequency    362   138
Proportion 0.724 0.276
---------------------------------------------------------------------------
HealthGen 
       n  missing distinct 
     454       46        5 
                                                            
Value      Excellent     Vgood      Good      Fair      Poor
Frequency         51       153       172        71         7
Proportion     0.112     0.337     0.379     0.156     0.015
---------------------------------------------------------------------------

6.5 Cross-Tabulations

It is very common for us to want to describe the association of one categorical variable with another. For instance, is there a relationship between Education and SleepTrouble in these data?

nh_adults %>%
    select(Education, SleepTrouble) %>%
    table() %>%
    addmargins()
                SleepTrouble
Education         No Yes Sum
  8th Grade       15   9  24
  9 - 11th Grade  40  17  57
  High School     67  14  81
  Some College   107  46 153
  College Grad   133  52 185
  Sum            362 138 500

To get row percentages, we can use:

nh_adults %>%
    select(Education, SleepTrouble) %>%
    table() %>%
    prop.table(., 1) %>%
    "*"(100) 
                SleepTrouble
Education              No      Yes
  8th Grade      62.50000 37.50000
  9 - 11th Grade 70.17544 29.82456
  High School    82.71605 17.28395
  Some College   69.93464 30.06536
  College Grad   71.89189 28.10811

For column percentages, we use 2 instead of 1 in the prop.table function. Here, we’ll also round off to two decimal places:

nh_adults %>%
    select(Education, SleepTrouble) %>%
    table() %>%
    prop.table(., 2) %>%
    "*"(100) %>%
    round(.,2) 
                SleepTrouble
Education           No   Yes
  8th Grade       4.14  6.52
  9 - 11th Grade 11.05 12.32
  High School    18.51 10.14
  Some College   29.56 33.33
  College Grad   36.74 37.68

Here’s another approach, to look at the cross-classification of Race and HealthGen:

xtabs(~ Race + HealthGen, data = nh_adults)
          HealthGen
Race       Excellent Vgood Good Fair Poor
  Asian            4     7    9    2    1
  Black            7    11   16   11    2
  Hispanic         1     9   18    8    0
  Mexican          5     6   12   16    1
  White           34   115  115   32    3
  Other            0     5    2    2    0

6.5.1 Cross-Classifying Three Categorical Variables

Suppose we are interested in Smoke100 and its relationship to PhysActive and SleepTrouble.

xtabs(~ Smoke100 + PhysActive + SleepTrouble, data = nh_adults)
, , SleepTrouble = No

        PhysActive
Smoke100  No Yes
     No   99 135
     Yes  62  66

, , SleepTrouble = Yes

        PhysActive
Smoke100  No Yes
     No   26  29
     Yes  38  45

We can also build a flat version of this table, as follows:

ftable(Smoke100 ~ PhysActive + SleepTrouble, data = nh_adults)
                        Smoke100  No Yes
PhysActive SleepTrouble                 
No         No                     99  62
           Yes                    26  38
Yes        No                    135  66
           Yes                    29  45

And we can do this with dplyr functions, as well, for example…

nh_adults %>%
    select(Smoke100, PhysActive, SleepTrouble) %>%
    table() 
, , SleepTrouble = No

        PhysActive
Smoke100  No Yes
     No   99 135
     Yes  62  66

, , SleepTrouble = Yes

        PhysActive
Smoke100  No Yes
     No   26  29
     Yes  38  45

6.6 Constructing Tables Well

The prolific Howard Wainer is responsible for many interesting books on visualization and related issues, including Wainer (2005) and Wainer (2013). These rules come from Chapter 10 of Wainer (1997).

  1. Order the rows and columns in a way that makes sense.
  2. Round, a lot!
  3. ALL is different and important

6.6.1 Alabama First!

Which of these Tables is more useful to you?

2013 Percent of Students in grades 9-12 who are obese

State % Obese 95% CI Sample Size
Alabama 17.1 (14.6 - 19.9) 1,499
Alaska 12.4 (10.5-14.6) 1,167
Arizona 10.7 (8.3-13.6) 1,520
Arkansas 17.8 (15.7-20.1) 1,470
Connecticut 12.3 (10.2-14.7) 2,270
Delaware 14.2 (12.9-15.6) 2,475
Florida 11.6 (10.5-12.8) 5,491
Wisconsin 11.6 (9.7-13.9) 2,771
Wyoming 10.7 (9.4-12.2) 2,910

or …

State % Obese 95% CI Sample Size
Kentucky 18.0 (15.7 - 20.6) 1,537
Arkansas 17.8 (15.7 - 20.1) 1,470
Alabama 17.1 (14.6 - 19.9) 1,499
Tennessee 16.9 (15.1 - 18.8) 1,831
Texas 15.7 (13.9 - 17.6) 3,039
Massachusetts 10.2 (8.5 - 12.1) 2,547
Idaho 9.6 (8.2 - 11.1) 1,841
Montana 9.4 (8.4 - 10.5) 4,679
New Jersey 8.7 (6.8 - 11.2) 1,644
Utah 6.4 (4.8 - 8.5) 2,136

It is a rare event when Alabama first is the best choice.

6.6.2 Order rows and columns sensibly

  • Alabama First!
    • Size places - put the largest first. We often look most carefully at the top.
  • Order time from the past to the future to help the viewer.
  • If there is a clear predictor-outcome relationship, put the predictors in the rows and the outcomes in the columns.

6.6.3 Round - a lot!

  • Humans cannot understand more than two digits very easily.
  • We almost never care about accuracy of more than two digits.
  • We can almost never justify more than two digits of accuracy statistically.
  • It’s also helpful to remember that we are almost invariably publishing progress to date, rather than a truly final answer.

Suppose, for instance, we report a correlation coefficient of 0.25. How many observations do you think you would need to justify such a choice?

  • To report 0.25 meaningfully, we want to be sure that the second digit isn’t 4 or 6.
  • That requires a standard error less than 0.005
  • The standard error of any statistic is proportional to 1 over the square root of the sample size, n.

So \(\frac{1}{\sqrt{n}}\) ~ 0.005, but that means \(\sqrt{n} = \frac{1}{0.005} = 200\). If \(\sqrt{n} = 200\), then n = (200)2 = 40,000.

Do we usually have 40,000 observations?

6.6.4 ALL is different and important

Summaries of rows and columns provide a measure of what is typical or usual. Sometimes a sum is helpful, at other times, consider presenting a median or other summary. The ALL category, as Wainer (1997) suggests, should be both visually different from the individual entries and set spatially apart.

On the whole, it’s far easier to fall into a good graph in R (at least if you have some ggplot2 skills) than to produce a good table.

References

Wainer, Howard. 2005. Graphic Discovery: A Trout in the Milk and Other Visual Adventures. Princeton, NJ: Princeton University Press.

Wainer, Howard. 2013. Medical Illuminations: Using Evidence, Visualization and Statistical Thinking to Improve Healthcare. New York: Oxford University Press.

Wainer, Howard. 1997. Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot. New York: Springer-Verlag.