Chapter 6 Summarizing Categorical Variables

Summarizing categorical variables numerically is mostly about building tables, and calculating percentages or proportions. We’ll save our discussion of modeling categorical data for later. Recall that in the nh_adults data set we built in Section 4.2 we had the following categorical variables. The number of levels indicates the number of possible categories for each categorical variable.

Variable Description Levels Type
Sex sex of subject 2 binary
Race subject’s race 6 nominal
Education subject’s educational level 5 ordinal
PhysActive Participates in sports? 2 binary
Smoke100 Smoked 100+ cigarettes? 2 binary
SleepTrouble Trouble sleeping? 2 binary
HealthGen Self-report health 5 ordinal

6.1 The summary function for Categorical data

When R recognizes a variable as categorical, it stores it as a factor. Such variables get special treatment from the summary function, in particular a table of available values (so long as there aren’t too many.)

     Sex            Race              Education   PhysActive Smoke100 
 female:221   Asian   : 42   8th Grade     : 24   No :215    No :297  
 male  :279   Black   : 63   9 - 11th Grade: 60   Yes:285    Yes:203  
              Hispanic: 26   High School   : 81                       
              Mexican : 38   Some College  :153                       
              White   :313   College Grad  :182                       
              Other   : 18                                            
 SleepTrouble     HealthGen        MaritalStatus
 No :380      Excellent: 50   Divorced    : 51  
 Yes:120      Vgood    :154   LivePartner : 51  
              Good     :184   Married     :259  
              Fair     : 49   NeverMarried:112  
              Poor     : 14   Separated   : 16  
              NA's     : 49   Widowed     : 11  

6.2 Tables to describe One Categorical Variable

Suppose we build a table (using the tabyl function from the janitor package) to describe the HealthGen distribution.

 HealthGen   n percent valid_percent
 Excellent  50   10.0%         11.1%
     Vgood 154   30.8%         34.1%
      Good 184   36.8%         40.8%
      Fair  49    9.8%         10.9%
      Poor  14    2.8%          3.1%
      <NA>  49    9.8%             -

Note how the missing (<NA>) values are not included in the valid_percent calculation, but are in the percent calculation. Note also the use of percentage formatting.

What if we want to add a total count, sometimes called the marginal total?

 HealthGen   n percent valid_percent
 Excellent  50   10.0%         11.1%
     Vgood 154   30.8%         34.1%
      Good 184   36.8%         40.8%
      Fair  49    9.8%         10.9%
      Poor  14    2.8%          3.1%
      <NA>  49    9.8%             -
     Total 500  100.0%        100.0%

What about marital status, which has no missing data in our sample?

 MaritalStatus   n percent
      Divorced  51   10.2%
   LivePartner  51   10.2%
       Married 259   51.8%
  NeverMarried 112   22.4%
     Separated  16    3.2%
       Widowed  11    2.2%
         Total 500  100.0%

6.3 The Mode of a Categorical Variable

A common measure applied to a categorical variable is to identify the mode, the most frequently observed value. To find the mode for variables with lots of categories (so that the summary may not be sufficient), we usually tabulate the data, and then sort by the counts of the numbers of observations, as we did with discrete quantitative variables.

Warning: Factor `HealthGen` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 6 x 2
  HealthGen count
  <fct>     <int>
1 Good        184
2 Vgood       154
3 Excellent    50
4 Fair         49
5 <NA>         49
6 Poor         14

6.4 describe in the Hmisc package

nh_adults %>% select(Sex, Race, Education, PhysActive, Smoke100, SleepTrouble, HealthGen, MaritalStatus) 

 8  Variables      500  Observations
---------------------------------------------------------------------------
Sex 
       n  missing distinct 
     500        0        2 
                        
Value      female   male
Frequency     221    279
Proportion  0.442  0.558
---------------------------------------------------------------------------
Race 
       n  missing distinct 
     500        0        6 
                                                                
Value         Asian    Black Hispanic  Mexican    White    Other
Frequency        42       63       26       38      313       18
Proportion    0.084    0.126    0.052    0.076    0.626    0.036
---------------------------------------------------------------------------
Education 
       n  missing distinct 
     500        0        5 
                                                                      
Value           8th Grade 9 - 11th Grade    High School   Some College
Frequency              24             60             81            153
Proportion          0.048          0.120          0.162          0.306
                         
Value        College Grad
Frequency             182
Proportion          0.364
---------------------------------------------------------------------------
PhysActive 
       n  missing distinct 
     500        0        2 
                    
Value        No  Yes
Frequency   215  285
Proportion 0.43 0.57
---------------------------------------------------------------------------
Smoke100 
       n  missing distinct 
     500        0        2 
                      
Value         No   Yes
Frequency    297   203
Proportion 0.594 0.406
---------------------------------------------------------------------------
SleepTrouble 
       n  missing distinct 
     500        0        2 
                    
Value        No  Yes
Frequency   380  120
Proportion 0.76 0.24
---------------------------------------------------------------------------
HealthGen 
       n  missing distinct 
     451       49        5 
                                                            
Value      Excellent     Vgood      Good      Fair      Poor
Frequency         50       154       184        49        14
Proportion     0.111     0.341     0.408     0.109     0.031
---------------------------------------------------------------------------
MaritalStatus 
       n  missing distinct 
     500        0        6 
                                                              
Value          Divorced  LivePartner      Married NeverMarried
Frequency            51           51          259          112
Proportion        0.102        0.102        0.518        0.224
                                    
Value         Separated      Widowed
Frequency            16           11
Proportion        0.032        0.022
---------------------------------------------------------------------------

6.5 Cross-Tabulations

It is very common for us to want to describe the association of one categorical variable with another. For instance, is there a relationship between Education and SleepTrouble in these data?

      Education  No Yes Total
      8th Grade  18   6    24
 9 - 11th Grade  45  15    60
    High School  62  19    81
   Some College 118  35   153
   College Grad 137  45   182
          Total 380 120   500

Note the use of adorn_totals to get the marginal counts, and how we specify that we want both the row and column totals. We can add a title for the columns with…

 Education/SleepTrouble  No Yes Total
              8th Grade  18   6    24
         9 - 11th Grade  45  15    60
            High School  62  19    81
           Some College 118  35   153
           College Grad 137  45   182
                  Total 380 120   500

Often, we’ll want to show percentages in a cross-tabulation like this. To get row percentages so that we can directly see the probability of SleepTrouble = Yes for each level of Education, we can use:

 Education/SleepTrouble    No   Yes
              8th Grade 75.0% 25.0%
         9 - 11th Grade 75.0% 25.0%
            High School 76.5% 23.5%
           Some College 77.1% 22.9%
           College Grad 75.3% 24.7%
                  Total 76.0% 24.0%

If we want to compare the distribution of Education between the two levels of SleepTrouble with column percentages, we can use the following…

 Education/SleepTrouble    No   Yes Total
              8th Grade  4.7%  5.0%  4.8%
         9 - 11th Grade 11.8% 12.5% 12.0%
            High School 16.3% 15.8% 16.2%
           Some College 31.1% 29.2% 30.6%
           College Grad 36.1% 37.5% 36.4%

If we want overall percentages in the cells of the table, so that the total across all combinations of Education and SleepTrouble is 100%, we can use:

 Education/SleepTrouble    No   Yes  Total
              8th Grade  3.6%  1.2%   4.8%
         9 - 11th Grade  9.0%  3.0%  12.0%
            High School 12.4%  3.8%  16.2%
           Some College 23.6%  7.0%  30.6%
           College Grad 27.4%  9.0%  36.4%
                  Total 76.0% 24.0% 100.0%

Another common approach is to include both counts and percentages in a cross-tabulation. Let’s look at the breakdown of HealthGen by MaritalStatus.

MaritalStatus/HealthGen Excellent Vgood Good Fair Poor NA_
Divorced 7 (13.7%) 14 (27.5%) 20 (39.2%) 5 (9.8%) 2 (3.9%) 3 (5.9%)
LivePartner 1 (2.0%) 18 (35.3%) 16 (31.4%) 11 (21.6%) 1 (2.0%) 4 (7.8%)
Married 23 (8.9%) 84 (32.4%) 102 (39.4%) 15 (5.8%) 4 (1.5%) 31 (12.0%)
NeverMarried 14 (12.5%) 31 (27.7%) 43 (38.4%) 13 (11.6%) 3 (2.7%) 8 (7.1%)
Separated 4 (25.0%) 4 (25.0%) 1 (6.2%) 4 (25.0%) 1 (6.2%) 2 (12.5%)
Widowed 1 (9.1%) 3 (27.3%) 2 (18.2%) 1 (9.1%) 3 (27.3%) 1 (9.1%)
Total 50 (10.0%) 154 (30.8%) 184 (36.8%) 49 (9.8%) 14 (2.8%) 49 (9.8%)

What if we wanted to ignore the missing HealthGen values? Most often, I filter down to the complete observations.

 MaritalStatus/HealthGen  Excellent       Vgood        Good       Fair
                Divorced  7 (14.6%)  14 (29.2%)  20 (41.7%)  5 (10.4%)
             LivePartner  1  (2.1%)  18 (38.3%)  16 (34.0%) 11 (23.4%)
                 Married 23 (10.1%)  84 (36.8%) 102 (44.7%) 15  (6.6%)
            NeverMarried 14 (13.5%)  31 (29.8%)  43 (41.3%) 13 (12.5%)
               Separated  4 (28.6%)   4 (28.6%)   1  (7.1%)  4 (28.6%)
                 Widowed  1 (10.0%)   3 (30.0%)   2 (20.0%)  1 (10.0%)
                   Total 50 (11.1%) 154 (34.1%) 184 (40.8%) 49 (10.9%)
       Poor
  2  (4.2%)
  1  (2.1%)
  4  (1.8%)
  3  (2.9%)
  1  (7.1%)
  3 (30.0%)
 14  (3.1%)

For more on working with tabyls, see the vignette in the janitor package. There you’ll find a complete list of all of the adorn functions, for example.

Here’s another approach, to look at the cross-classification of Race and HealthGen:

          HealthGen
Race       Excellent Vgood Good Fair Poor
  Asian            3    11   17    3    0
  Black            8    11   19   11    6
  Hispanic         3     3   11    4    1
  Mexican          2     8   17    6    3
  White           33   113  114   22    4
  Other            1     8    6    3    0

6.5.1 Cross-Classifying Three Categorical Variables

Suppose we are interested in Smoke100 and its relationship to PhysActive and SleepTrouble.

$No
          PhysActive    
 Smoke100         No Yes
       No         99 142
      Yes         62  77

$Yes
          PhysActive    
 Smoke100         No Yes
       No         21  35
      Yes         33  31

The result here is a tabyl of Smoke100 (rows) by PhysActive (columns), split into a list by SleepTrouble. Another approach to get the same table is:

, , SleepTrouble = No

        PhysActive
Smoke100  No Yes
     No   99 142
     Yes  62  77

, , SleepTrouble = Yes

        PhysActive
Smoke100  No Yes
     No   21  35
     Yes  33  31

We can also build a flat version of this table, as follows:

                        Smoke100  No Yes
PhysActive SleepTrouble                 
No         No                     99  62
           Yes                    21  33
Yes        No                    142  77
           Yes                    35  31

And we can do this with dplyr functions, as well, for example…

, , SleepTrouble = No

        PhysActive
Smoke100  No Yes
     No   99 142
     Yes  62  77

, , SleepTrouble = Yes

        PhysActive
Smoke100  No Yes
     No   21  35
     Yes  33  31

6.6 Constructing Tables Well

The prolific Howard Wainer is responsible for many interesting books on visualization and related issues, including Wainer (2005) and Wainer (2013). These rules come from Chapter 10 of Wainer (1997).

  1. Order the rows and columns in a way that makes sense.
  2. Round, a lot!
  3. ALL is different and important

6.6.1 Alabama First!

Which of these Tables is more useful to you?

2013 Percent of Students in grades 9-12 who are obese

State % Obese 95% CI Sample Size
Alabama 17.1 (14.6 - 19.9) 1,499
Alaska 12.4 (10.5-14.6) 1,167
Arizona 10.7 (8.3-13.6) 1,520
Arkansas 17.8 (15.7-20.1) 1,470
Connecticut 12.3 (10.2-14.7) 2,270
Delaware 14.2 (12.9-15.6) 2,475
Florida 11.6 (10.5-12.8) 5,491
Wisconsin 11.6 (9.7-13.9) 2,771
Wyoming 10.7 (9.4-12.2) 2,910

or …

State % Obese 95% CI Sample Size
Kentucky 18.0 (15.7 - 20.6) 1,537
Arkansas 17.8 (15.7 - 20.1) 1,470
Alabama 17.1 (14.6 - 19.9) 1,499
Tennessee 16.9 (15.1 - 18.8) 1,831
Texas 15.7 (13.9 - 17.6) 3,039
Massachusetts 10.2 (8.5 - 12.1) 2,547
Idaho 9.6 (8.2 - 11.1) 1,841
Montana 9.4 (8.4 - 10.5) 4,679
New Jersey 8.7 (6.8 - 11.2) 1,644
Utah 6.4 (4.8 - 8.5) 2,136

It is a rare event when Alabama first is the best choice.

6.6.2 Order rows and columns sensibly

  • Alabama First!
    • Size places - put the largest first. We often look most carefully at the top.
  • Order time from the past to the future to help the viewer.
  • If there is a clear predictor-outcome relationship, put the predictors in the rows and the outcomes in the columns.

6.6.3 Round - a lot!

  • Humans cannot understand more than two digits very easily.
  • We almost never care about accuracy of more than two digits.
  • We can almost never justify more than two digits of accuracy statistically.
  • It’s also helpful to remember that we are almost invariably publishing progress to date, rather than a truly final answer.

Suppose, for instance, we report a correlation coefficient of 0.25. How many observations do you think you would need to justify such a choice?

  • To report 0.25 meaningfully, we want to be sure that the second digit isn’t 4 or 6.
  • That requires a standard error less than 0.005
  • The standard error of any statistic is proportional to 1 over the square root of the sample size, n.

So \(\frac{1}{\sqrt{n}}\) ~ 0.005, but that means \(\sqrt{n} = \frac{1}{0.005} = 200\). If \(\sqrt{n} = 200\), then n = (200)2 = 40,000.

Do we usually have 40,000 observations?

6.6.4 ALL is different and important

Summaries of rows and columns provide a measure of what is typical or usual. Sometimes a sum is helpful, at other times, consider presenting a median or other summary. The ALL category, as Wainer (1997) suggests, should be both visually different from the individual entries and set spatially apart.

On the whole, it’s far easier to fall into a good graph in R (at least if you have some ggplot2 skills) than to produce a good table.

References

Wainer, Howard. 1997. Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot. New York: Springer-Verlag.

Wainer, Howard. 2005. Graphic Discovery: A Trout in the Milk and Other Visual Adventures. Princeton, NJ: Princeton University Press.

Wainer, Howard. 2013. Medical Illuminations: Using Evidence, Visualization and Statistical Thinking to Improve Healthcare. New York: Oxford University Press.