Chapter 6 Summarizing Categorical Variables
Summarizing categorical variables numerically is mostly about building tables, and calculating percentages or proportions. We’ll save our discussion of modeling categorical data for later. Recall that in the nh_adults
data set we built in Section (@ref(createnh_adults)), we had the following categorical variables. The number of levels indicates the number of possible categories for each categorical variable.
Variable | Description | Levels | Type |
---|---|---|---|
Sex | sex of subject | 2 | binary |
Race | subject’s race | 6 | nominal |
Education | subject’s educational level | 5 | ordinal |
PhysActive | Participates in sports? | 2 | binary |
Smoke100 | Smoked 100+ cigarettes? | 2 | binary |
SleepTrouble | Trouble sleeping? | 2 | binary |
HealthGen | Self-report health | 5 | ordinal |
6.1 The summary
function for Categorical data
When R recognizes a variable as categorical, it stores it as a factor. Such variables get special treatment from the summary
function, in particular a table of available values (so long as there aren’t too many.)
nh_adults %>%
select(Sex, Race, Education, PhysActive, Smoke100, SleepTrouble, HealthGen) %>%
summary()
Sex Race Education PhysActive Smoke100
female:253 Asian : 29 8th Grade : 24 No :225 No :289
male :247 Black : 57 9 - 11th Grade: 57 Yes:275 Yes:211
Hispanic: 39 High School : 81
Mexican : 43 Some College :153
White :322 College Grad :185
Other : 10
SleepTrouble HealthGen
No :362 Excellent: 51
Yes:138 Vgood :153
Good :172
Fair : 71
Poor : 7
NA's : 46
6.2 Tables to describe One Categorical Variable
Suppose we build a table to describe the HealthGen
distribution.
.
Excellent Vgood Good Fair Poor <NA>
51 153 172 71 7 46
The main tools we have for augmenting tables are:
- adding in marginal totals, and
- working with proportions/percentages.
What if we want to add a total count?
.
Excellent Vgood Good Fair Poor <NA> Sum
51 153 172 71 7 46 500
What if we want to leave out the missing responses?
.
Excellent Vgood Good Fair Poor Sum
51 153 172 71 7 454
Let’s put the missing values back in, but now calculate proportions instead. Since the total will just be 1.0, we’ll leave that out.
.
Excellent Vgood Good Fair Poor <NA>
0.102 0.306 0.344 0.142 0.014 0.092
Now, we’ll calculate percentages by multiplying the proportions by 100.
.
Excellent Vgood Good Fair Poor <NA>
10.2 30.6 34.4 14.2 1.4 9.2
6.3 The Mode of a Categorical Variable
A common measure applied to a categorical variable is to identify the mode, the most frequently observed value. To find the mode for variables with lots of categories (so that the summary
may not be sufficient), we usually tabulate the data, and then sort by the counts of the numbers of observations, as we did with discrete quantitative variables.
# A tibble: 6 x 2
HealthGen count
<fct> <int>
1 Good 172
2 Vgood 153
3 Fair 71
4 Excellent 51
5 <NA> 46
6 Poor 7
6.4 describe
in the Hmisc
package
Hmisc::describe(nh_adults %>%
select(Sex, Race, Education, PhysActive,
Smoke100, SleepTrouble, HealthGen))
nh_adults %>% select(Sex, Race, Education, PhysActive, Smoke100, SleepTrouble, HealthGen)
7 Variables 500 Observations
---------------------------------------------------------------------------
Sex
n missing distinct
500 0 2
Value female male
Frequency 253 247
Proportion 0.506 0.494
---------------------------------------------------------------------------
Race
n missing distinct
500 0 6
Value Asian Black Hispanic Mexican White Other
Frequency 29 57 39 43 322 10
Proportion 0.058 0.114 0.078 0.086 0.644 0.020
---------------------------------------------------------------------------
Education
n missing distinct
500 0 5
Value 8th Grade 9 - 11th Grade High School Some College
Frequency 24 57 81 153
Proportion 0.048 0.114 0.162 0.306
Value College Grad
Frequency 185
Proportion 0.370
---------------------------------------------------------------------------
PhysActive
n missing distinct
500 0 2
Value No Yes
Frequency 225 275
Proportion 0.45 0.55
---------------------------------------------------------------------------
Smoke100
n missing distinct
500 0 2
Value No Yes
Frequency 289 211
Proportion 0.578 0.422
---------------------------------------------------------------------------
SleepTrouble
n missing distinct
500 0 2
Value No Yes
Frequency 362 138
Proportion 0.724 0.276
---------------------------------------------------------------------------
HealthGen
n missing distinct
454 46 5
Value Excellent Vgood Good Fair Poor
Frequency 51 153 172 71 7
Proportion 0.112 0.337 0.379 0.156 0.015
---------------------------------------------------------------------------
6.5 Cross-Tabulations
It is very common for us to want to describe the association of one categorical variable with another. For instance, is there a relationship between Education and SleepTrouble in these data?
SleepTrouble
Education No Yes Sum
8th Grade 15 9 24
9 - 11th Grade 40 17 57
High School 67 14 81
Some College 107 46 153
College Grad 133 52 185
Sum 362 138 500
To get row percentages, we can use:
SleepTrouble
Education No Yes
8th Grade 62.50000 37.50000
9 - 11th Grade 70.17544 29.82456
High School 82.71605 17.28395
Some College 69.93464 30.06536
College Grad 71.89189 28.10811
For column percentages, we use 2 instead of 1 in the prop.table
function. Here, we’ll also round off to two decimal places:
nh_adults %>%
select(Education, SleepTrouble) %>%
table() %>%
prop.table(., 2) %>%
"*"(100) %>%
round(.,2)
SleepTrouble
Education No Yes
8th Grade 4.14 6.52
9 - 11th Grade 11.05 12.32
High School 18.51 10.14
Some College 29.56 33.33
College Grad 36.74 37.68
Here’s another approach, to look at the cross-classification of Race and HealthGen:
HealthGen
Race Excellent Vgood Good Fair Poor
Asian 4 7 9 2 1
Black 7 11 16 11 2
Hispanic 1 9 18 8 0
Mexican 5 6 12 16 1
White 34 115 115 32 3
Other 0 5 2 2 0
6.5.1 Cross-Classifying Three Categorical Variables
Suppose we are interested in Smoke100
and its relationship to PhysActive
and SleepTrouble
.
, , SleepTrouble = No
PhysActive
Smoke100 No Yes
No 99 135
Yes 62 66
, , SleepTrouble = Yes
PhysActive
Smoke100 No Yes
No 26 29
Yes 38 45
We can also build a flat version of this table, as follows:
Smoke100 No Yes
PhysActive SleepTrouble
No No 99 62
Yes 26 38
Yes No 135 66
Yes 29 45
And we can do this with dplyr
functions, as well, for example…
, , SleepTrouble = No
PhysActive
Smoke100 No Yes
No 99 135
Yes 62 66
, , SleepTrouble = Yes
PhysActive
Smoke100 No Yes
No 26 29
Yes 38 45
6.6 Constructing Tables Well
The prolific Howard Wainer is responsible for many interesting books on visualization and related issues, including Wainer (2005) and Wainer (2013). These rules come from Chapter 10 of Wainer (1997).
- Order the rows and columns in a way that makes sense.
- Round, a lot!
- ALL is different and important
6.6.1 Alabama First!
Which of these Tables is more useful to you?
2013 Percent of Students in grades 9-12 who are obese
State | % Obese | 95% CI | Sample Size |
---|---|---|---|
Alabama | 17.1 | (14.6 - 19.9) | 1,499 |
Alaska | 12.4 | (10.5-14.6) | 1,167 |
Arizona | 10.7 | (8.3-13.6) | 1,520 |
Arkansas | 17.8 | (15.7-20.1) | 1,470 |
Connecticut | 12.3 | (10.2-14.7) | 2,270 |
Delaware | 14.2 | (12.9-15.6) | 2,475 |
Florida | 11.6 | (10.5-12.8) | 5,491 |
… | |||
Wisconsin | 11.6 | (9.7-13.9) | 2,771 |
Wyoming | 10.7 | (9.4-12.2) | 2,910 |
or …
State | % Obese | 95% CI | Sample Size |
---|---|---|---|
Kentucky | 18.0 | (15.7 - 20.6) | 1,537 |
Arkansas | 17.8 | (15.7 - 20.1) | 1,470 |
Alabama | 17.1 | (14.6 - 19.9) | 1,499 |
Tennessee | 16.9 | (15.1 - 18.8) | 1,831 |
Texas | 15.7 | (13.9 - 17.6) | 3,039 |
… | |||
Massachusetts | 10.2 | (8.5 - 12.1) | 2,547 |
Idaho | 9.6 | (8.2 - 11.1) | 1,841 |
Montana | 9.4 | (8.4 - 10.5) | 4,679 |
New Jersey | 8.7 | (6.8 - 11.2) | 1,644 |
Utah | 6.4 | (4.8 - 8.5) | 2,136 |
It is a rare event when Alabama first is the best choice.
6.6.2 Order rows and columns sensibly
- Alabama First!
- Size places - put the largest first. We often look most carefully at the top.
- Order time from the past to the future to help the viewer.
- If there is a clear predictor-outcome relationship, put the predictors in the rows and the outcomes in the columns.
6.6.3 Round - a lot!
- Humans cannot understand more than two digits very easily.
- We almost never care about accuracy of more than two digits.
- We can almost never justify more than two digits of accuracy statistically.
- It’s also helpful to remember that we are almost invariably publishing progress to date, rather than a truly final answer.
Suppose, for instance, we report a correlation coefficient of 0.25. How many observations do you think you would need to justify such a choice?
- To report 0.25 meaningfully, we want to be sure that the second digit isn’t 4 or 6.
- That requires a standard error less than 0.005
- The standard error of any statistic is proportional to 1 over the square root of the sample size, n.
So \(\frac{1}{\sqrt{n}}\) ~ 0.005, but that means \(\sqrt{n} = \frac{1}{0.005} = 200\). If \(\sqrt{n} = 200\), then n = (200)2 = 40,000.
Do we usually have 40,000 observations?
6.6.4 ALL is different and important
Summaries of rows and columns provide a measure of what is typical or usual. Sometimes a sum is helpful, at other times, consider presenting a median or other summary. The ALL category, as Wainer (1997) suggests, should be both visually different from the individual entries and set spatially apart.
On the whole, it’s far easier to fall into a good graph in R (at least if you have some ggplot2 skills) than to produce a good table.
References
Wainer, Howard. 2005. Graphic Discovery: A Trout in the Milk and Other Visual Adventures. Princeton, NJ: Princeton University Press.
Wainer, Howard. 2013. Medical Illuminations: Using Evidence, Visualization and Statistical Thinking to Improve Healthcare. New York: Oxford University Press.
Wainer, Howard. 1997. Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot. New York: Springer-Verlag.