8 Summarizing Categories
8.1 Setup: Packages Used Here
8.2 Using the nh_adult750
data again
To demonstrate key ideas in this Chapter, we will again consider our sample of 750 adults ages 21-64 from NHANES 2011-12 which includes some missing values. We’ll load into the nh_750
data frame the information from the nh_adult750.Rds
file we created in Section @ref(newNHANES).
nh_750 <- read_rds("data/nh_adult750.Rds")
Summarizing categorical variables numerically is mostly about building tables, and calculating percentages or proportions. We’ll save our discussion of modeling categorical data for later. Recall that in the nh_750
data set we built in Section @ref(newNHANES) we had the following categorical variables. The number of levels indicates the number of possible categories for each categorical variable.
Variable | Description | Levels | Type |
---|---|---|---|
Sex | sex of subject | 2 | binary |
Race | subject’s race | 6 | nominal |
Education | subject’s educational level | 5 | ordinal |
PhysActive | Participates in sports? | 2 | binary |
Smoke100 | Smoked 100+ cigarettes? | 2 | binary |
SleepTrouble | Trouble sleeping? | 2 | binary |
HealthGen | Self-report health | 5 | ordinal |
8.3 The summary
function for Categorical data
When R recognizes a variable as categorical, it stores it as a factor. Such variables get special treatment from the summary
function, in particular a table of available values (so long as there aren’t too many.)
nh_750 |>
select(Sex, Race, Education, PhysActive, Smoke100,
SleepTrouble, HealthGen, MaritalStatus) |>
summary()
Sex Race Education PhysActive Smoke100
female:388 Asian : 70 8th Grade : 50 No :326 No :453
male :362 Black :128 9 - 11th Grade: 76 Yes:424 Yes:297
Hispanic: 63 High School :143
Mexican : 80 Some College :241
White :393 College Grad :240
Other : 16
SleepTrouble HealthGen MaritalStatus
No :555 Excellent: 84 Divorced : 78
Yes:195 Vgood :197 LivePartner : 70
Good :252 Married :388
Fair :104 NeverMarried:179
Poor : 14 Separated : 19
NA's : 99 Widowed : 16
8.4 Tables to describe One Categorical Variable
Suppose we build a table (using the tabyl
function from the janitor
package) to describe the HealthGen
distribution.
nh_750 |>
tabyl(HealthGen) |>
adorn_pct_formatting()
HealthGen n percent valid_percent
Excellent 84 11.2% 12.9%
Vgood 197 26.3% 30.3%
Good 252 33.6% 38.7%
Fair 104 13.9% 16.0%
Poor 14 1.9% 2.2%
<NA> 99 13.2% -
Note how the missing (<NA>
) values are not included in the valid_percent
calculation, but are in the percent
calculation. Note also the use of percentage formatting.
What if we want to add a total count, sometimes called the marginal total?
nh_750 |>
tabyl(HealthGen) |>
adorn_totals() |>
adorn_pct_formatting()
HealthGen n percent valid_percent
Excellent 84 11.2% 12.9%
Vgood 197 26.3% 30.3%
Good 252 33.6% 38.7%
Fair 104 13.9% 16.0%
Poor 14 1.9% 2.2%
<NA> 99 13.2% -
Total 750 100.0% 100.0%
What about marital status, which has no missing data in our sample?
nh_750 |>
tabyl(MaritalStatus) |>
adorn_totals() |>
adorn_pct_formatting()
MaritalStatus n percent
Divorced 78 10.4%
LivePartner 70 9.3%
Married 388 51.7%
NeverMarried 179 23.9%
Separated 19 2.5%
Widowed 16 2.1%
Total 750 100.0%
8.5 Constructing Tables Well
The prolific Howard Wainer is responsible for many interesting books on visualization and related issues, including Wainer (2005) and Wainer (2013). These rules come from Chapter 10 of Wainer (1997).
- Order the rows and columns in a way that makes sense.
- Round, a lot!
- ALL is different and important
8.5.1 Alabama First!
Which of these Tables is more useful to you?
2013 Percent of Students in grades 9-12 who are obese
State | % Obese | 95% CI | Sample Size |
---|---|---|---|
Alabama | 17.1 | (14.6 - 19.9) | 1,499 |
Alaska | 12.4 | (10.5-14.6) | 1,167 |
Arizona | 10.7 | (8.3-13.6) | 1,520 |
Arkansas | 17.8 | (15.7-20.1) | 1,470 |
Connecticut | 12.3 | (10.2-14.7) | 2,270 |
Delaware | 14.2 | (12.9-15.6) | 2,475 |
Florida | 11.6 | (10.5-12.8) | 5,491 |
… | |||
Wisconsin | 11.6 | (9.7-13.9) | 2,771 |
Wyoming | 10.7 | (9.4-12.2) | 2,910 |
or …
State | % Obese | 95% CI | Sample Size |
---|---|---|---|
Kentucky | 18.0 | (15.7 - 20.6) | 1,537 |
Arkansas | 17.8 | (15.7 - 20.1) | 1,470 |
Alabama | 17.1 | (14.6 - 19.9) | 1,499 |
Tennessee | 16.9 | (15.1 - 18.8) | 1,831 |
Texas | 15.7 | (13.9 - 17.6) | 3,039 |
… | |||
Massachusetts | 10.2 | (8.5 - 12.1) | 2,547 |
Idaho | 9.6 | (8.2 - 11.1) | 1,841 |
Montana | 9.4 | (8.4 - 10.5) | 4,679 |
New Jersey | 8.7 | (6.8 - 11.2) | 1,644 |
Utah | 6.4 | (4.8 - 8.5) | 2,136 |
It is a rare event when Alabama first is the best choice.
8.5.2 ALL is different and important
Summaries of rows and columns provide a measure of what is typical or usual. Sometimes a sum is helpful, at other times, consider presenting a median or other summary. The ALL category, as Wainer (1997) suggests, should be both visually different from the individual entries and set spatially apart.
On the whole, it’s far easier to fall into a good graph in R (at least if you have some ggplot2 skills) than to produce a good table.
8.6 The Mode of a Categorical Variable
A common measure applied to a categorical variable is to identify the mode, the most frequently observed value. To find the mode for variables with lots of categories (so that the summary
may not be sufficient), we usually tabulate the data, and then sort by the counts of the numbers of observations, as we did with discrete quantitative variables.
8.7 describe
in the Hmisc
package
Hmisc::describe(nh_750 |>
select(Sex, Race, Education, PhysActive,
Smoke100, SleepTrouble,
HealthGen, MaritalStatus))
select(nh_750, Sex, Race, Education, PhysActive, Smoke100, SleepTrouble, HealthGen, MaritalStatus)
8 Variables 750 Observations
--------------------------------------------------------------------------------
Sex
n missing distinct
750 0 2
Value female male
Frequency 388 362
Proportion 0.517 0.483
--------------------------------------------------------------------------------
Race
n missing distinct
750 0 6
Value Asian Black Hispanic Mexican White Other
Frequency 70 128 63 80 393 16
Proportion 0.093 0.171 0.084 0.107 0.524 0.021
--------------------------------------------------------------------------------
Education
n missing distinct
750 0 5
Value 8th Grade 9 - 11th Grade High School Some College
Frequency 50 76 143 241
Proportion 0.067 0.101 0.191 0.321
Value College Grad
Frequency 240
Proportion 0.320
--------------------------------------------------------------------------------
PhysActive
n missing distinct
750 0 2
Value No Yes
Frequency 326 424
Proportion 0.435 0.565
--------------------------------------------------------------------------------
Smoke100
n missing distinct
750 0 2
Value No Yes
Frequency 453 297
Proportion 0.604 0.396
--------------------------------------------------------------------------------
SleepTrouble
n missing distinct
750 0 2
Value No Yes
Frequency 555 195
Proportion 0.74 0.26
--------------------------------------------------------------------------------
HealthGen
n missing distinct
651 99 5
Value Excellent Vgood Good Fair Poor
Frequency 84 197 252 104 14
Proportion 0.129 0.303 0.387 0.160 0.022
--------------------------------------------------------------------------------
MaritalStatus
n missing distinct
750 0 6
Value Divorced LivePartner Married NeverMarried Separated
Frequency 78 70 388 179 19
Proportion 0.104 0.093 0.517 0.239 0.025
Value Widowed
Frequency 16
Proportion 0.021
--------------------------------------------------------------------------------
8.8 Cross-Tabulations of Two Variables
It is very common for us to want to describe the association of one categorical variable with another. For instance, is there a relationship between Education and SleepTrouble in these data?
nh_750 |>
tabyl(Education, SleepTrouble) |>
adorn_totals(where = c("row", "col"))
Education No Yes Total
8th Grade 40 10 50
9 - 11th Grade 52 24 76
High School 102 41 143
Some College 173 68 241
College Grad 188 52 240
Total 555 195 750
Note the use of adorn_totals
to get the marginal counts, and how we specify that we want both the row and column totals. We can add a title for the columns with…
nh_750 |>
tabyl(Education, SleepTrouble) |>
adorn_totals(where = c("row", "col")) |>
adorn_title(placement = "combined")
Education/SleepTrouble No Yes Total
8th Grade 40 10 50
9 - 11th Grade 52 24 76
High School 102 41 143
Some College 173 68 241
College Grad 188 52 240
Total 555 195 750
Often, we’ll want to show percentages in a cross-tabulation like this. To get row percentages so that we can directly see the probability of SleepTrouble = Yes
for each level of Education
, we can use:
nh_750 |>
tabyl(Education, SleepTrouble) |>
adorn_totals(where = "row") |>
adorn_percentages(denominator = "row") |>
adorn_pct_formatting() |>
adorn_title(placement = "combined")
Education/SleepTrouble No Yes
8th Grade 80.0% 20.0%
9 - 11th Grade 68.4% 31.6%
High School 71.3% 28.7%
Some College 71.8% 28.2%
College Grad 78.3% 21.7%
Total 74.0% 26.0%
If we want to compare the distribution of Education
between the two levels of SleepTrouble
with column percentages, we can use the following…
nh_750 |>
tabyl(Education, SleepTrouble) |>
adorn_totals(where = "col") |>
adorn_percentages(denominator = "col") |>
adorn_pct_formatting() |>
adorn_title(placement = "combined")
Education/SleepTrouble No Yes Total
8th Grade 7.2% 5.1% 6.7%
9 - 11th Grade 9.4% 12.3% 10.1%
High School 18.4% 21.0% 19.1%
Some College 31.2% 34.9% 32.1%
College Grad 33.9% 26.7% 32.0%
If we want overall percentages in the cells of the table, so that the total across all combinations of Education
and SleepTrouble
is 100%, we can use:
nh_750 |>
tabyl(Education, SleepTrouble) |>
adorn_totals(where = c("row", "col")) |>
adorn_percentages(denominator = "all") |>
adorn_pct_formatting() |>
adorn_title(placement = "combined") |>
kbl(align = 'lrrrrrr')
Education/SleepTrouble | No | Yes | Total |
---|---|---|---|
8th Grade | 5.3% | 1.3% | 6.7% |
9 - 11th Grade | 6.9% | 3.2% | 10.1% |
High School | 13.6% | 5.5% | 19.1% |
Some College | 23.1% | 9.1% | 32.1% |
College Grad | 25.1% | 6.9% | 32.0% |
Total | 74.0% | 26.0% | 100.0% |
Another common approach is to include both counts and percentages in a cross-tabulation. Let’s look at the breakdown of HealthGen
by MaritalStatus
.
nh_750 |>
tabyl(MaritalStatus, HealthGen) |>
adorn_totals(where = c("row")) |>
adorn_percentages(denominator = "row") |>
adorn_pct_formatting() |>
adorn_ns(position = "front") |>
adorn_title(placement = "combined") |>
kbl(align = 'lrrrrrr') |>
kable_styling(full_width = FALSE)
MaritalStatus/HealthGen | Excellent | Vgood | Good | Fair | Poor | NA_ |
---|---|---|---|---|---|---|
Divorced | 7 (9.0%) | 19 (24.4%) | 29 (37.2%) | 11 (14.1%) | 3 (3.8%) | 9 (11.5%) |
LivePartner | 4 (5.7%) | 19 (27.1%) | 25 (35.7%) | 18 (25.7%) | 0 (0.0%) | 4 (5.7%) |
Married | 46 (11.9%) | 101 (26.0%) | 130 (33.5%) | 41 (10.6%) | 6 (1.5%) | 64 (16.5%) |
NeverMarried | 25 (14.0%) | 52 (29.1%) | 56 (31.3%) | 24 (13.4%) | 3 (1.7%) | 19 (10.6%) |
Separated | 2 (10.5%) | 3 (15.8%) | 4 (21.1%) | 8 (42.1%) | 0 (0.0%) | 2 (10.5%) |
Widowed | 0 (0.0%) | 3 (18.8%) | 8 (50.0%) | 2 (12.5%) | 2 (12.5%) | 1 (6.2%) |
Total | 84 (11.2%) | 197 (26.3%) | 252 (33.6%) | 104 (13.9%) | 14 (1.9%) | 99 (13.2%) |
What if we wanted to ignore the missing HealthGen
values? Most often, I filter down to the complete observations.
nh_750 |>
filter(complete.cases(MaritalStatus, HealthGen)) |>
tabyl(MaritalStatus, HealthGen) |>
adorn_totals(where = c("row")) |>
adorn_percentages(denominator = "row") |>
adorn_pct_formatting() |>
adorn_ns(position = "front") |>
adorn_title(placement = "combined") |>
kbl(align = 'lrrrrr') |>
kable_styling(full_width = FALSE)
MaritalStatus/HealthGen | Excellent | Vgood | Good | Fair | Poor |
---|---|---|---|---|---|
Divorced | 7 (10.1%) | 19 (27.5%) | 29 (42.0%) | 11 (15.9%) | 3 (4.3%) |
LivePartner | 4 (6.1%) | 19 (28.8%) | 25 (37.9%) | 18 (27.3%) | 0 (0.0%) |
Married | 46 (14.2%) | 101 (31.2%) | 130 (40.1%) | 41 (12.7%) | 6 (1.9%) |
NeverMarried | 25 (15.6%) | 52 (32.5%) | 56 (35.0%) | 24 (15.0%) | 3 (1.9%) |
Separated | 2 (11.8%) | 3 (17.6%) | 4 (23.5%) | 8 (47.1%) | 0 (0.0%) |
Widowed | 0 (0.0%) | 3 (20.0%) | 8 (53.3%) | 2 (13.3%) | 2 (13.3%) |
Total | 84 (12.9%) | 197 (30.3%) | 252 (38.7%) | 104 (16.0%) | 14 (2.2%) |
For more on working with tabyls
, see this overview of janitor functions. There you’ll find a complete list of all of the adorn
functions, for example.
Here’s another approach, to look at the cross-classification of Race and HealthGen:
xtabs(~ Race + HealthGen, data = nh_750)
HealthGen
Race Excellent Vgood Good Fair Poor
Asian 10 17 24 6 1
Black 15 28 40 24 4
Hispanic 4 9 24 13 2
Mexican 6 12 25 21 2
White 48 128 131 37 5
Other 1 3 8 3 0
8.9 Cross-Classifying Three Categorical Variables
Suppose we are interested in Smoke100
and its relationship to PhysActive
and SleepTrouble
.
nh_750 |>
tabyl(Smoke100, PhysActive, SleepTrouble) |>
adorn_title(placement = "top")
$No
PhysActive
Smoke100 No Yes
No 137 219
Yes 93 106
$Yes
PhysActive
Smoke100 No Yes
No 41 56
Yes 55 43
The result here is a tabyl of Smoke100
(rows) by PhysActive
(columns), split into a list by SleepTrouble
.
There are several alternative approaches for doing this, although I expect us to stick with tabyl
for our work in 431. These alternatives include the use of the xtabs
function:
xtabs(~ Smoke100 + PhysActive + SleepTrouble, data = nh_750)
, , SleepTrouble = No
PhysActive
Smoke100 No Yes
No 137 219
Yes 93 106
, , SleepTrouble = Yes
PhysActive
Smoke100 No Yes
No 41 56
Yes 55 43
We can also build a flat version of this table, as follows:
ftable(Smoke100 ~ PhysActive + SleepTrouble, data = nh_750)
Smoke100 No Yes
PhysActive SleepTrouble
No No 137 93
Yes 41 55
Yes No 219 106
Yes 56 43
And we can do this with dplyr
functions and the table()
function, as well, for example…
8.10 Gaining Control over Tables in R: the gt
package
With the gt
package, anyone can make wonderful-looking tables using the R programming language. The gt
package allows you to start with a tibble or data frame, and use it to make very detailed tables that look professional, and includes tools that enable you to include titles and subtitles, all sorts of labels, as well as footnotes and source notes.
Here’s a fairly simple example of a cross-tabulation of part of the nh_750
data built using a few tools from the gt
package.
temp_tbl <- nh_750 |> filter(complete.cases(PhysActive, HealthGen)) |>
tabyl(PhysActive, HealthGen) |>
tibble()
gt(temp_tbl) |>
tab_header(title = md("**Cross-Tabulation from nh_750**"),
subtitle = md("Physical Activity vs. Overall Health"))
Cross-Tabulation from nh_750 | |||||
Physical Activity vs. Overall Health | |||||
PhysActive | Excellent | Vgood | Good | Fair | Poor |
---|---|---|---|---|---|
No | 24 | 66 | 126 | 59 | 10 |
Yes | 60 | 131 | 126 | 45 | 4 |
The gt
package and its usage is described in detail at https://gt.rstudio.com/.
8.11 Coming Up
Next, we’ll make some early attempts at describing missingness in our data.