Chapter 3 Visualizing Data
Part A of these Notes is designed to ease your transition into working effectively with data, so that you can better understand it. We’ll start by visualizing some data from the US National Health and Nutrition Examination Survey, or NHANES. We’ll display R code as we go, but we’ll return to all of the key coding ideas involved later in the Notes.
3.1 The NHANES data: Collecting a Sample
To begin, we’ll gather a random sample of 1,000 subjects participating in NHANES, and then identify several variables of interest about those subjects1. The motivation for this example came from a Figure in Baumer, Kaplan, and Horton (2017).
# library(NHANES) # already loaded NHANES package/library of functions, data
set.seed(431001)
# use set.seed to ensure that we all get the same random sample
# of 1,000 NHANES subjects in our nh_data collection
nh_data <- sample_n(NHANES, size = 1000) %>%
select(ID, Gender, Age, Height, Weight, BMI, Pulse, Race1, HealthGen, Diabetes)
nh_data
# A tibble: 1,000 x 10
ID Gender Age Height Weight BMI Pulse Race1 HealthGen Diabetes
<int> <fct> <int> <dbl> <dbl> <dbl> <int> <fct> <fct> <fct>
1 59640 male 54 176. 129 41.8 74 White Good No
2 59826 female 67 156. 50.2 20.5 66 White Vgood No
3 56340 male 9 128. 23.3 14.2 86 Black <NA> No
4 56747 male 33 194. 105. 27.9 68 White Vgood No
5 51754 female 58 167. 106 37.9 70 White <NA> No
6 52712 male 6 109. 16.9 14.3 NA White <NA> No
7 63908 male 55 169. 90.6 31.9 62 Mexican Vgood Yes
8 60865 female 25 156. 55 22.8 58 Other Vgood No
9 66642 male 41 178. 89.3 28.2 72 White Vgood No
10 59880 female 45 163. 98.3 36.9 80 Hispanic Good Yes
# ... with 990 more rows
We have 1000 rows (observations) and 10 columns (variables) that describe the subjects listed in the rows.
3.2 Age and Height
Suppose we want to visualize the relationship of Height and Age in our 1,000 NHANES observations. The best choice is likely to be a scatterplot.
Warning: Removed 25 rows containing missing values (geom_point).
We note several interesting results here.
- As a warning, R tells us that it has “Removed 25 rows containing missing values (geom_point).” Only 975 subjects plotted here, because the remaining 25 people have missing (NA) values for either Height, Age or both.
- Unsurprisingly, the measured Heights of subjects grow from Age 0 to Age 20 or so, and we see that a typical Height increases rapidly across these Ages. The middle of the distribution at later Ages is pretty consistent at at a Height somewhere between 150 and 175. The units aren’t specified, but we expect they must be centimeters. The Ages are clearly reported in Years.
- No Age is reported over 80, and it appears that there is a large cluster of Ages at 80. This may be due to a requirement that Ages 80 and above be reported at 80 so as to help mask the identity of those individuals.2
As in this case, we’re going to build most of our visualizations using tools from the ggplot2
package, which is part of the tidyverse
series of packages. You’ll see similar coding structures throughout this Chapter, most of which are covered as well in Chapter 3 of Grolemund and Wickham (2017).
3.3 Subset of Subjects with Known Age and Height
Before we move on, let’s manipulate the data set a bit, to focus on only those subjects who have complete data on both Age and Height. This will help us avoid that warning message.
ID Gender Age Height
Min. :51654 female:498 Min. : 2.00 Min. : 86.3
1st Qu.:56753 male :477 1st Qu.:20.00 1st Qu.:156.4
Median :61453 Median :36.00 Median :165.8
Mean :61602 Mean :37.27 Mean :161.7
3rd Qu.:66484 3rd Qu.:53.00 3rd Qu.:174.1
Max. :71826 Max. :80.00 Max. :195.0
Weight BMI Pulse Race1
Min. : 12.50 Min. :13.17 Min. : 42.00 Black :112
1st Qu.: 57.60 1st Qu.:21.60 1st Qu.: 66.00 Hispanic: 69
Median : 73.40 Median :26.10 Median : 72.00 Mexican :104
Mean : 73.41 Mean :26.96 Mean : 73.75 White :607
3rd Qu.: 90.20 3rd Qu.:31.10 3rd Qu.: 82.00 Other : 83
Max. :198.70 Max. :80.60 Max. :124.00
NA's :2 NA's :2 NA's :120
HealthGen Diabetes
Excellent: 87 No :910
Vgood :276 Yes : 64
Good :276 NA's: 1
Fair :103
Poor : 15
NA's :218
Note that the units and explanations for these variables are contained in the NHANES help file, available via ?NHANES in the Console of R Studio.
3.4 Age-Height and Gender?
Let’s add Gender to the plot using color, and also adjust the y axis label to incorporate the units of measurement.
ggplot(data = nh_dat2, aes(x = Age, y = Height, color = Gender)) +
geom_point() +
labs(title = "Height-Age Relationship in NHANES sample",
y = "Height in cm.")
3.4.1 Can we show the Female and Male relationships in separate panels?
Sure.
3.4.2 Can we add a smooth curve to show the relationship in each plot?
Yep, and let’s change the theme of the graph to remove the gray background, too.
3.4.3 What if we want to assume straight line relationships?
We could look at a linear model in the plot. Does this make sense here?
3.5 A Subset: Ages 21-79
Suppose we wanted to look at a subset of our sample - those observations (subjects) whose Age is at least 21 and at most 79. We’ll create that sample below, and also subset the variables to include nine of particular interest, and remove any observations with any missingness on any of the nine variables we’re including here.
nh_data_2179 <- nh_data %>%
filter(Age > 20 & Age < 80) %>%
select(ID, Gender, Age, Height, Weight, BMI, Pulse, Race1, HealthGen, Diabetes) %>%
na.omit
nh_data_2179
# A tibble: 594 x 10
ID Gender Age Height Weight BMI Pulse Race1 HealthGen Diabetes
<int> <fct> <int> <dbl> <dbl> <dbl> <int> <fct> <fct> <fct>
1 59640 male 54 176. 129 41.8 74 White Good No
2 59826 female 67 156. 50.2 20.5 66 White Vgood No
3 56747 male 33 194. 105. 27.9 68 White Vgood No
4 63908 male 55 169. 90.6 31.9 62 Mexican Vgood Yes
5 60865 female 25 156. 55 22.8 58 Other Vgood No
6 66642 male 41 178. 89.3 28.2 72 White Vgood No
7 59880 female 45 163. 98.3 36.9 80 Hispanic Good Yes
8 71784 female 24 161. 50.2 19.3 72 White Vgood No
9 67616 male 63 184. 70 20.6 82 White Vgood No
10 55391 female 32 161. 69.2 26.6 114 Other Good No
# ... with 584 more rows
3.6 Distribution of Heights
What is the distribution of height in this new sample?
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can do several things to clean this up.
- We’ll change the color of the lines for each bar of the histogram.
- We’ll change the fill inside each bar to make them stand out a bit more.
- We’ll add a title and relabel the horizontal (x) axis to include the units of measurement.
- We’ll avoid the warning by selecting a number of bins (we’ll use 25 here) into which we’ll group the heights before drawing the histogram.
ggplot(data = nh_data_2179, aes(x = Height)) +
geom_histogram(bins = 25, col = "yellow", fill = "blue") +
labs(title = "Height of NHANES subjects ages 21-79",
x = "Height in cm.")
3.6.1 Changing a Histogram’s Fill and Color
The CWRU color guide (https://case.edu/umc/our-brand/visual-guidelines/) lists the HTML color schemes for CWRU blue and CWRU gray. Let’s match that color scheme.
cwru.blue <- '#0a304e'
cwru.gray <- '#626262'
ggplot(data = nh_data_2179, aes(x = Height)) +
geom_histogram(binwidth = 2, col = cwru.gray, fill = cwru.blue) +
labs(title = "Height of NHANES subjects ages 21-79",
x = "Height in cm.") +
theme_bw()
Note the other changes to the graph above.
- We changed the theme to replace the gray background.
- We changed the bins for the histogram, to gather observations into groups of 2 cm. each.
3.7 Height and Gender
ggplot(data = nh_data_2179, aes(x = Gender, y = Height, color = Gender)) +
geom_point() +
labs(title = "Height by Gender for NHANES subjects ages 21-79",
y = "Height in cm.")
This plot isn’t so useful. We can improve things a little by jittering the points horizontally, so that the overlap is reduced.
ggplot(data = nh_data_2179, aes(x = Gender, y = Height, color = Gender)) +
geom_jitter(width = 0.2) +
labs(title = "Height by Gender (jittered) for NHANES subjects ages 21-79",
y = "Height in cm.")
Perhaps it might be better to summarise the distribution in a different way. We might consider a boxplot of the data.
3.7.1 A Boxplot of Height by Gender
ggplot(data = nh_data_2179, aes(x = Gender, y = Height, fill = Gender)) +
geom_boxplot() +
labs(title = "Boxplot of Height by Gender for NHANES subjects ages 21-79",
y = "Height in cm.")
Or perhaps we’d like to see a pair of histograms?
3.7.2 Histograms of Height by Gender
ggplot(data = nh_data_2179, aes(x = Height, fill = Gender)) +
geom_histogram(color = "white", bins = 20) +
labs(title = "Histogram of Height by Gender for NHANES subjects ages 21-79",
x = "Height in cm.") +
facet_wrap(~ Gender)
Can we redraw these histograms so that they are a little more comparable, and to get rid of the unnecessary legend?
3.8 A Look at Body-Mass Index
Let’s look at a different outcome, the body-mass index, or BMI. The definition of BMI for adult subjects (which is expressed in units of kg/m2) is:
\[ \mbox{Body Mass Index} = \frac{\mbox{weight in kg}}{(\mbox{height in meters})^2} = 703 \times \frac{\mbox{weight in pounds}}{(\mbox{height in inches})^2} \]
[BMI is essentially] … a measure of a person’s thinness or thickness… BMI was designed for use as a simple means of classifying average sedentary (physically inactive) populations, with an average body composition. For these individuals, the current value recommendations are as follow: a BMI from 18.5 up to 25 may indicate optimal weight, a BMI lower than 18.5 suggests the person is underweight, a number from 25 up to 30 may indicate the person is overweight, and a number from 30 upwards suggests the person is obese.
Here’s a histogram, again with CWRU colors, for the BMI data.
ggplot(data = nh_data_2179, aes(x = BMI)) +
geom_histogram(binwidth = 1, fill = cwru.blue, col = cwru.gray) +
labs(title = "Histogram of BMI: NHANES subjects ages 21-79",
x = "Body-mass index")
Note how different this picture looks if instead we bin up groups of 5 kg/m2 at a time. Which is the more useful representation will depend a lot on what questions you’re trying to answer.
ggplot(data = nh_data_2179, aes(x = BMI)) +
geom_histogram(binwidth = 5, fill = cwru.blue, col = cwru.gray) +
labs(title = "Histogram of BMI: NHANES subjects ages 21-79",
x = "Body-mass index")
3.8.1 BMI by Gender
ggplot(data = nh_data_2179, aes(x = BMI, fill = Gender)) +
geom_histogram(color = "white", bins = 20) +
labs(title = "Histogram of BMI by Gender for NHANES subjects ages 21-79",
x = "Body-mass index") +
guides(fill = FALSE) +
facet_grid(Gender ~ .)
As an accompanying numerical summary, we might ask how many people fall into each of these Gender categories, and what is their “average” BMI.
nh_data_2179 %>%
group_by(Gender) %>%
summarise(count = n(), mean(BMI), median(BMI)) %>%
knitr::kable()
Gender | count | mean(BMI) | median(BMI) |
---|---|---|---|
female | 290 | 29.35486 | 27.43 |
male | 304 | 29.35773 | 28.69 |
3.8.2 BMI and Diabetes
We can split up our histogram into groups based on whether the subjects have been told they have diabetes.
ggplot(data = nh_data_2179, aes(x = BMI, fill = Diabetes)) +
geom_histogram(color = "white", bins = 20) +
labs(title = "BMI by Diabetes Status for NHANES ages 21-79",
x = "Body-mass index") +
guides(fill = FALSE)
How many people fall into each of these Diabetes categories, and what is their “average” BMI?
nh_data_2179 %>%
group_by(Diabetes) %>%
summarise(count = n(), mean(BMI), median(BMI)) %>%
knitr::kable()
Diabetes | count | mean(BMI) | median(BMI) |
---|---|---|---|
No | 551 | 28.89544 | 27.89 |
Yes | 43 | 35.26209 | 33.43 |
3.8.3 BMI and Race
We can compare the distribution of BMI across Race groups, as well.
ggplot(data = nh_data_2179, aes(x = Race1, y = BMI, fill = Race1)) +
geom_boxplot() +
labs(title = "BMI by Race for NHANES ages 21-79",
x = "Body-mass index") +
guides(fill = FALSE)
How many people fall into each of these Race1 categories, and what is their “average” BMI?
library(tidyverse)
nh_data_2179 %>%
group_by(Race1) %>%
summarise(count = n(), mean(BMI), median(BMI)) %>%
knitr::kable()
Race1 | count | mean(BMI) | median(BMI) |
---|---|---|---|
Black | 63 | 31.04444 | 29.010 |
Hispanic | 44 | 29.36227 | 29.505 |
Mexican | 50 | 29.97040 | 29.730 |
White | 387 | 29.27326 | 27.900 |
Other | 50 | 27.25300 | 25.805 |
3.8.4 BMI and Pulse Rate
3.8.5 Diabetes vs. No Diabetes
Could we see whether subjects who have been told they have diabetes show different BMI-pulse rate patterns than the subjects who haven’t?
- Let’s try doing this by changing the shape and the color of the points based on diabetes status.
ggplot(data = nh_data_2179,
aes(x = BMI, y = Pulse,
color = Diabetes, shape = Diabetes)) +
geom_point() +
geom_smooth(method = "loess") +
labs(title = "BMI vs. Pulse rate for NHANES subjects, ages 21-79")
This plot might be easier to interpret if we faceted by Diabetes status, as well.
3.9 General Health Status
Here’s a Table of the General Health Status results. This is a self-reported rating of each subject’s health on a five point scale (Excellent, Very Good, Good, Fair, Poor.)
.
Excellent Vgood Good Fair Poor
67 213 221 80 13
The HealthGen data are categorical, which means that summarizing them with averages isn’t as appealing as looking at percentages, proportions and rates.
3.9.1 Bar Chart for Categorical Data
Usually, a bar chart is the best choice for a graphing a variable made up of categories.
There are lots of things we can do to make this plot fancier.
ggplot(data = nh_data_2179, aes(x = HealthGen, fill = HealthGen)) +
geom_bar() +
guides(fill = FALSE) +
labs(x = "Self-Reported Health Status",
y = "Number of NHANES subjects",
title = "Self-Reported Health Status in NHANES subjects ages 21-79")
Or, we can really go crazy…
nh_data_2179 %>%
count(HealthGen) %>%
ungroup() %>%
mutate(pct = round(prop.table(n) * 100, 1)) %>%
ggplot(aes(x = HealthGen, y = pct, fill = HealthGen)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_viridis(discrete = TRUE) +
guides(fill = FALSE) +
geom_text(aes(y = pct + 1, # nudge above top of bar
label = paste0(pct, '%')), # prettify
position = position_dodge(width = .9),
size = 4) +
labs(x = "Self-Reported Health Status",
y = "Percentage of NHANES subjects",
title = "Self-Reported Health Status in NHANES subjects ages 21-79") +
theme_bw()
3.9.2 Working with Tables
We can add a marginal total, and compare subjects by Gender, as follows…
HealthGen
Gender Excellent Vgood Good Fair Poor Sum
female 34 107 107 34 8 290
male 33 106 114 46 5 304
Sum 67 213 221 80 13 594
If we like, we can make this look a little more polished with the knitr::kable
function…
Excellent | Vgood | Good | Fair | Poor | Sum | |
---|---|---|---|---|---|---|
female | 34 | 107 | 107 | 34 | 8 | 290 |
male | 33 | 106 | 114 | 46 | 5 | 304 |
Sum | 67 | 213 | 221 | 80 | 13 | 594 |
If we want the proportions of patients within each Gender that fall in each HealthGen category (the row percentages), we can get them, too.
Excellent | Vgood | Good | Fair | Poor | |
---|---|---|---|---|---|
female | 0.1172414 | 0.3689655 | 0.3689655 | 0.1172414 | 0.0275862 |
male | 0.1085526 | 0.3486842 | 0.3750000 | 0.1513158 | 0.0164474 |
To make this a little easier to use, we might consider rounding.
nh_data_2179 %>%
select(Gender, HealthGen) %>%
table() %>%
prop.table(.,1) %>%
round(.,2) %>%
knitr::kable()
Excellent | Vgood | Good | Fair | Poor | |
---|---|---|---|---|---|
female | 0.12 | 0.37 | 0.37 | 0.12 | 0.03 |
male | 0.11 | 0.35 | 0.38 | 0.15 | 0.02 |
Another possibility would be to show the percentages, rather than the proportions (which requires multiplying the proportion by 100.) Note the strange "*" function, which is needed to convince R to multiply each entry by 100 here.
nh_data_2179 %>%
select(Gender, HealthGen) %>%
table() %>%
prop.table(.,1) %>%
"*"(100) %>%
round(.,2) %>%
knitr::kable()
Excellent | Vgood | Good | Fair | Poor | |
---|---|---|---|---|---|
female | 11.72 | 36.90 | 36.9 | 11.72 | 2.76 |
male | 10.86 | 34.87 | 37.5 | 15.13 | 1.64 |
And, if we wanted the column percentages, to determine which gender had the higher rate of each HealthGen status level, we can get that by changing the prop.table to calculate 2 (column) proportions, rather than 1 (rows.)
nh_data_2179 %>%
select(Gender, HealthGen) %>%
table() %>%
prop.table(.,2) %>%
"*"(100) %>%
round(.,2) %>%
knitr::kable()
Excellent | Vgood | Good | Fair | Poor | |
---|---|---|---|---|---|
female | 50.75 | 50.23 | 48.42 | 42.5 | 61.54 |
male | 49.25 | 49.77 | 51.58 | 57.5 | 38.46 |
3.9.3 BMI by General Health Status
Let’s consider now the relationship between self-reported overall health and body-mass index.
ggplot(data = nh_data_2179, aes(x = HealthGen, y = BMI, fill = HealthGen)) +
geom_boxplot() +
labs(title = "BMI by Health Status, Overall Health for NHANES ages 21-79",
y = "Body-mass index", x = "Self-Reported Overall Health") +
guides(fill = FALSE)
We can see that not too many people self-identify with the “Poor” health category.
nh_data_2179 %>%
group_by(HealthGen) %>%
summarise(count = n(), mean(BMI), median(BMI)) %>%
knitr::kable()
HealthGen | count | mean(BMI) | median(BMI) |
---|---|---|---|
Excellent | 67 | 25.70060 | 24.900 |
Vgood | 213 | 27.55878 | 26.700 |
Good | 221 | 32.00321 | 30.550 |
Fair | 80 | 29.28663 | 28.685 |
Poor | 13 | 33.08154 | 35.380 |
3.9.4 BMI by Gender and General Health Status
We’ll start with two panels of boxplots to try to understand the relationships between BMI, General Health Status and Gender. Note the use of coord_flip
to rotate the graph 90 degrees.
ggplot(data = nh_data_2179, aes(x = HealthGen, y = BMI, fill = HealthGen)) +
geom_boxplot() +
labs(title = "BMI by Health Status, Overall Health for NHANES ages 21-79",
y = "Body-mass index", x = "Self-Reported Overall Health") +
guides(fill = FALSE) +
facet_wrap(~ Gender) +
coord_flip()
Here’s a plot of faceted histograms, which might be used to address similar questions.
3.10 Conclusions
This is just a small piece of the toolbox for visualizations that we’ll create in this class. Many additional tools are on the way, but the main idea won’t change. Using the ggplot2
package, we can accomplish several critical tasks in creating a visualization, including:
- Identifying (and labeling) the axes and titles
- Identifying a type of
geom
to use, like a point, bar or histogram - Changing fill, color, shape, size to facilitate comparisons
- Building “small multiples” of plots with faceting
Good data visualizations make it easy to see the data, and ggplot2
’s tools make it relatively difficult to make a really bad graph.
References
Baumer, Benjamin S., Daniel T. Kaplan, and Nicholas J. Horton. 2017. Modern Data Science with R. Boca Raton, FL: CRC Press. https://mdsr-book.github.io/.
Grolemund, Garrett, and Hadley Wickham. 2017. R for Data Science. O’Reilly. http://r4ds.had.co.nz/.