Chapter 6 Summarizing Categorical Variables
Summarizing categorical variables numerically is mostly about building tables, and calculating percentages or proportions. We’ll save our discussion of modeling categorical data for later. Recall that in the nh_adults
data set we built in Section 4.2 we had the following categorical variables. The number of levels indicates the number of possible categories for each categorical variable.
Variable | Description | Levels | Type |
---|---|---|---|
Sex | sex of subject | 2 | binary |
Race | subject’s race | 6 | nominal |
Education | subject’s educational level | 5 | ordinal |
PhysActive | Participates in sports? | 2 | binary |
Smoke100 | Smoked 100+ cigarettes? | 2 | binary |
SleepTrouble | Trouble sleeping? | 2 | binary |
HealthGen | Self-report health | 5 | ordinal |
6.1 The summary
function for Categorical data
When R recognizes a variable as categorical, it stores it as a factor. Such variables get special treatment from the summary
function, in particular a table of available values (so long as there aren’t too many.)
nh_adults %>%
select(Sex, Race, Education, PhysActive, Smoke100,
SleepTrouble, HealthGen, MaritalStatus) %>%
summary()
Sex Race Education PhysActive Smoke100
female:221 Asian : 42 8th Grade : 24 No :215 No :297
male :279 Black : 63 9 - 11th Grade: 60 Yes:285 Yes:203
Hispanic: 26 High School : 81
Mexican : 38 Some College :153
White :313 College Grad :182
Other : 18
SleepTrouble HealthGen MaritalStatus
No :380 Excellent: 50 Divorced : 51
Yes:120 Vgood :154 LivePartner : 51
Good :184 Married :259
Fair : 49 NeverMarried:112
Poor : 14 Separated : 16
NA's : 49 Widowed : 11
6.2 Tables to describe One Categorical Variable
Suppose we build a table (using the tabyl
function from the janitor
package) to describe the HealthGen
distribution.
HealthGen n percent valid_percent
Excellent 50 10.0% 11.1%
Vgood 154 30.8% 34.1%
Good 184 36.8% 40.8%
Fair 49 9.8% 10.9%
Poor 14 2.8% 3.1%
<NA> 49 9.8% -
Note how the missing (<NA>
) values are not included in the valid_percent
calculation, but are in the percent
calculation. Note also the use of percentage formatting.
What if we want to add a total count, sometimes called the marginal total?
HealthGen n percent valid_percent
Excellent 50 10.0% 11.1%
Vgood 154 30.8% 34.1%
Good 184 36.8% 40.8%
Fair 49 9.8% 10.9%
Poor 14 2.8% 3.1%
<NA> 49 9.8% -
Total 500 100.0% 100.0%
What about marital status, which has no missing data in our sample?
MaritalStatus n percent
Divorced 51 10.2%
LivePartner 51 10.2%
Married 259 51.8%
NeverMarried 112 22.4%
Separated 16 3.2%
Widowed 11 2.2%
Total 500 100.0%
6.3 The Mode of a Categorical Variable
A common measure applied to a categorical variable is to identify the mode, the most frequently observed value. To find the mode for variables with lots of categories (so that the summary
may not be sufficient), we usually tabulate the data, and then sort by the counts of the numbers of observations, as we did with discrete quantitative variables.
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 6 x 2
HealthGen count
<fct> <int>
1 Good 184
2 Vgood 154
3 Excellent 50
4 Fair 49
5 <NA> 49
6 Poor 14
6.4 describe
in the Hmisc
package
Hmisc::describe(nh_adults %>%
select(Sex, Race, Education, PhysActive,
Smoke100, SleepTrouble,
HealthGen, MaritalStatus))
nh_adults %>% select(Sex, Race, Education, PhysActive, Smoke100, SleepTrouble, HealthGen, MaritalStatus)
8 Variables 500 Observations
--------------------------------------------------------------------------------
Sex
n missing distinct
500 0 2
Value female male
Frequency 221 279
Proportion 0.442 0.558
--------------------------------------------------------------------------------
Race
n missing distinct
500 0 6
lowest : Asian Black Hispanic Mexican White
highest: Black Hispanic Mexican White Other
Value Asian Black Hispanic Mexican White Other
Frequency 42 63 26 38 313 18
Proportion 0.084 0.126 0.052 0.076 0.626 0.036
--------------------------------------------------------------------------------
Education
n missing distinct
500 0 5
lowest : 8th Grade 9 - 11th Grade High School Some College College Grad
highest: 8th Grade 9 - 11th Grade High School Some College College Grad
Value 8th Grade 9 - 11th Grade High School Some College
Frequency 24 60 81 153
Proportion 0.048 0.120 0.162 0.306
Value College Grad
Frequency 182
Proportion 0.364
--------------------------------------------------------------------------------
PhysActive
n missing distinct
500 0 2
Value No Yes
Frequency 215 285
Proportion 0.43 0.57
--------------------------------------------------------------------------------
Smoke100
n missing distinct
500 0 2
Value No Yes
Frequency 297 203
Proportion 0.594 0.406
--------------------------------------------------------------------------------
SleepTrouble
n missing distinct
500 0 2
Value No Yes
Frequency 380 120
Proportion 0.76 0.24
--------------------------------------------------------------------------------
HealthGen
n missing distinct
451 49 5
lowest : Excellent Vgood Good Fair Poor
highest: Excellent Vgood Good Fair Poor
Value Excellent Vgood Good Fair Poor
Frequency 50 154 184 49 14
Proportion 0.111 0.341 0.408 0.109 0.031
--------------------------------------------------------------------------------
MaritalStatus
n missing distinct
500 0 6
lowest : Divorced LivePartner Married NeverMarried Separated
highest: LivePartner Married NeverMarried Separated Widowed
Value Divorced LivePartner Married NeverMarried Separated
Frequency 51 51 259 112 16
Proportion 0.102 0.102 0.518 0.224 0.032
Value Widowed
Frequency 11
Proportion 0.022
--------------------------------------------------------------------------------
6.5 Cross-Tabulations
It is very common for us to want to describe the association of one categorical variable with another. For instance, is there a relationship between Education and SleepTrouble in these data?
Education No Yes Total
8th Grade 18 6 24
9 - 11th Grade 45 15 60
High School 62 19 81
Some College 118 35 153
College Grad 137 45 182
Total 380 120 500
Note the use of adorn_totals
to get the marginal counts, and how we specify that we want both the row and column totals. We can add a title for the columns with…
nh_adults %>%
tabyl(Education, SleepTrouble) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_title(placement = "combined")
Education/SleepTrouble No Yes Total
8th Grade 18 6 24
9 - 11th Grade 45 15 60
High School 62 19 81
Some College 118 35 153
College Grad 137 45 182
Total 380 120 500
Often, we’ll want to show percentages in a cross-tabulation like this. To get row percentages so that we can directly see the probability of SleepTrouble = Yes
for each level of Education
, we can use:
nh_adults %>%
tabyl(Education, SleepTrouble) %>%
adorn_totals(where = "row") %>%
adorn_percentages(denominator = "row") %>%
adorn_pct_formatting() %>%
adorn_title(placement = "combined")
Education/SleepTrouble No Yes
8th Grade 75.0% 25.0%
9 - 11th Grade 75.0% 25.0%
High School 76.5% 23.5%
Some College 77.1% 22.9%
College Grad 75.3% 24.7%
Total 76.0% 24.0%
If we want to compare the distribution of Education
between the two levels of SleepTrouble
with column percentages, we can use the following…
nh_adults %>%
tabyl(Education, SleepTrouble) %>%
adorn_totals(where = "col") %>%
adorn_percentages(denominator = "col") %>%
adorn_pct_formatting() %>%
adorn_title(placement = "combined")
Education/SleepTrouble No Yes Total
8th Grade 4.7% 5.0% 4.8%
9 - 11th Grade 11.8% 12.5% 12.0%
High School 16.3% 15.8% 16.2%
Some College 31.1% 29.2% 30.6%
College Grad 36.1% 37.5% 36.4%
If we want overall percentages in the cells of the table, so that the total across all combinations of Education
and SleepTrouble
is 100%, we can use:
nh_adults %>%
tabyl(Education, SleepTrouble) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages(denominator = "all") %>%
adorn_pct_formatting() %>%
adorn_title(placement = "combined")
Education/SleepTrouble No Yes Total
8th Grade 3.6% 1.2% 4.8%
9 - 11th Grade 9.0% 3.0% 12.0%
High School 12.4% 3.8% 16.2%
Some College 23.6% 7.0% 30.6%
College Grad 27.4% 9.0% 36.4%
Total 76.0% 24.0% 100.0%
Another common approach is to include both counts and percentages in a cross-tabulation. Let’s look at the breakdown of HealthGen
by MaritalStatus
.
nh_adults %>%
tabyl(MaritalStatus, HealthGen) %>%
adorn_totals(where = c("row")) %>%
adorn_percentages(denominator = "row") %>%
adorn_pct_formatting() %>%
adorn_ns(position = "front") %>%
adorn_title(placement = "combined") %>%
knitr::kable()
MaritalStatus/HealthGen | Excellent | Vgood | Good | Fair | Poor | NA_ |
---|---|---|---|---|---|---|
Divorced | 7 (13.7%) | 14 (27.5%) | 20 (39.2%) | 5 (9.8%) | 2 (3.9%) | 3 (5.9%) |
LivePartner | 1 (2.0%) | 18 (35.3%) | 16 (31.4%) | 11 (21.6%) | 1 (2.0%) | 4 (7.8%) |
Married | 23 (8.9%) | 84 (32.4%) | 102 (39.4%) | 15 (5.8%) | 4 (1.5%) | 31 (12.0%) |
NeverMarried | 14 (12.5%) | 31 (27.7%) | 43 (38.4%) | 13 (11.6%) | 3 (2.7%) | 8 (7.1%) |
Separated | 4 (25.0%) | 4 (25.0%) | 1 (6.2%) | 4 (25.0%) | 1 (6.2%) | 2 (12.5%) |
Widowed | 1 (9.1%) | 3 (27.3%) | 2 (18.2%) | 1 (9.1%) | 3 (27.3%) | 1 (9.1%) |
Total | 50 (10.0%) | 154 (30.8%) | 184 (36.8%) | 49 (9.8%) | 14 (2.8%) | 49 (9.8%) |
What if we wanted to ignore the missing HealthGen
values? Most often, I filter down to the complete observations.
nh_adults %>%
filter(complete.cases(MaritalStatus, HealthGen)) %>%
tabyl(MaritalStatus, HealthGen) %>%
adorn_totals(where = c("row")) %>%
adorn_percentages(denominator = "row") %>%
adorn_pct_formatting() %>%
adorn_ns(position = "front") %>%
adorn_title(placement = "combined")
MaritalStatus/HealthGen Excellent Vgood Good Fair
Divorced 7 (14.6%) 14 (29.2%) 20 (41.7%) 5 (10.4%)
LivePartner 1 (2.1%) 18 (38.3%) 16 (34.0%) 11 (23.4%)
Married 23 (10.1%) 84 (36.8%) 102 (44.7%) 15 (6.6%)
NeverMarried 14 (13.5%) 31 (29.8%) 43 (41.3%) 13 (12.5%)
Separated 4 (28.6%) 4 (28.6%) 1 (7.1%) 4 (28.6%)
Widowed 1 (10.0%) 3 (30.0%) 2 (20.0%) 1 (10.0%)
Total 50 (11.1%) 154 (34.1%) 184 (40.8%) 49 (10.9%)
Poor
2 (4.2%)
1 (2.1%)
4 (1.8%)
3 (2.9%)
1 (7.1%)
3 (30.0%)
14 (3.1%)
For more on working with tabyls
, see the vignette in the janitor
package. There you’ll find a complete list of all of the adorn
functions, for example.
Here’s another approach, to look at the cross-classification of Race and HealthGen:
HealthGen
Race Excellent Vgood Good Fair Poor
Asian 3 11 17 3 0
Black 8 11 19 11 6
Hispanic 3 3 11 4 1
Mexican 2 8 17 6 3
White 33 113 114 22 4
Other 1 8 6 3 0
6.5.1 Cross-Classifying Three Categorical Variables
Suppose we are interested in Smoke100
and its relationship to PhysActive
and SleepTrouble
.
$No
PhysActive
Smoke100 No Yes
No 99 142
Yes 62 77
$Yes
PhysActive
Smoke100 No Yes
No 21 35
Yes 33 31
The result here is a tabyl of Smoke100
(rows) by PhysActive
(columns), split into a list by SleepTrouble
. Another approach to get the same table is:
, , SleepTrouble = No
PhysActive
Smoke100 No Yes
No 99 142
Yes 62 77
, , SleepTrouble = Yes
PhysActive
Smoke100 No Yes
No 21 35
Yes 33 31
We can also build a flat version of this table, as follows:
Smoke100 No Yes
PhysActive SleepTrouble
No No 99 62
Yes 21 33
Yes No 142 77
Yes 35 31
And we can do this with dplyr
functions, as well, for example…
, , SleepTrouble = No
PhysActive
Smoke100 No Yes
No 99 142
Yes 62 77
, , SleepTrouble = Yes
PhysActive
Smoke100 No Yes
No 21 35
Yes 33 31
6.6 Constructing Tables Well
The prolific Howard Wainer is responsible for many interesting books on visualization and related issues, including Wainer (2005) and Wainer (2013). These rules come from Chapter 10 of Wainer (1997).
- Order the rows and columns in a way that makes sense.
- Round, a lot!
- ALL is different and important
6.6.1 Alabama First!
Which of these Tables is more useful to you?
2013 Percent of Students in grades 9-12 who are obese
State | % Obese | 95% CI | Sample Size |
---|---|---|---|
Alabama | 17.1 | (14.6 - 19.9) | 1,499 |
Alaska | 12.4 | (10.5-14.6) | 1, | 1,167 |
Arizona | 10.7 | (8.3 | (8.3-13.6) | 1,52 | 1,520 |
Arkansas | 17.8 | (15.7- | (15.7-20.1) | | 1,470 |
Connecticut | 12.3 | | (10.2-14.7) | 2,2 | 2,270 |
Delaware | 14.2 | (12 | (12.9-15.6) | | 2,475 |
Florida | 11.6 | (10.5-1 | (10.5-12.8) | | 5,491 |
… | |||
Wisconsin | 11.6 | ( | (9.7-13.9) | 2,7 | 2,771 |
Wyoming | 10.7 | (9.4-12.2) | 2,910 | 2,910 |
or …
State | % Obese | 95% CI | Sample Size |
---|---|---|---|
Kentucky | 18.0 | (15.7 - 20.6) | 1,537 |
Arkansas | 17.8 | (15.7 - 20.1) | 1,470 |
Alabama | 17.1 | (14.6 - 19.9) | 1,499 |
Tennessee | 16.9 | (15.1 - 18.8) | 1,831 |
Texas | 15.7 | (13.9 - 17.6) | 3,039 |
… | |||
Massachusetts | 10.2 | (8.5 - 12.1) | 2,547 |
Idaho | 9.6 | (8.2 - 11.1) | 1,841 |
Montana | 9.4 | (8.4 - 10.5) | 4,679 |
New Jersey | 8.7 | (6.8 - 11.2) | 1,644 |
Utah | 6.4 | (4.8 - 8.5) | 2,136 |
It is a rare event when Alabama first is the best choice.
6.6.2 Order rows and columns sensibly
- Alabama First!
- Size places - put the largest first. We often look most carefully at the top.
- Order time from the past to the future to help the viewer.
- If there is a clear predictor-outcome relationship, put the predictors in the rows and the outcomes in the columns.
6.6.3 Round - a lot!
- Humans cannot understand more than two digits very easily.
- We almost never care about accuracy of more than two digits.
- We can almost never justify more than two digits of accuracy statistically.
- It’s also helpful to remember that we are almost invariably publishing progress to date, rather than a truly final answer.
Suppose, for instance, we report a correlation coefficient of 0.25. How many observations do you think you would need to justify such a choice?
- To report 0.25 meaningfully, we want to be sure that the second digit isn’t 4 or 6.
- That requires a standard error less than 0.005
- The standard error of any statistic is proportional to 1 over the square root of the sample size, n.
So \(\frac{1}{\sqrt{n}}\) ~ 0.005, but that means \(\sqrt{n} = \frac{1}{0.005} = 200\). If \(\sqrt{n} = 200\), then n = (200)2 = 40,000.
Do we usually have 40,000 observations?
6.6.4 ALL is different and important
Summaries of rows and columns provide a measure of what is typical or usual. Sometimes a sum is helpful, at other times, consider presenting a median or other summary. The ALL category, as Wainer (1997) suggests, should be both visually different from the individual entries and set spatially apart.
On the whole, it’s far easier to fall into a good graph in R (at least if you have some ggplot2 skills) than to produce a good table.
6.7 Gaining Control over Tables in R: the gt
package
With the gt
package, anyone can make wonderful-looking tables using the R programming language. The gt
package is described in substantial detail at https://gt.rstudio.com/ and we’ll get started with it soon.
References
Wainer, Howard. 1997. Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot. New York: Springer-Verlag.
Wainer, Howard. 2005. Graphic Discovery: A Trout in the Milk and Other Visual Adventures. Princeton, NJ: Princeton University Press.
Wainer, Howard. 2013. Medical Illuminations: Using Evidence, Visualization and Statistical Thinking to Improve Healthcare. New York: Oxford University Press.