Chapter 3 Visualizing Data

Part A of these Notes is designed to ease your transition into working effectively with data, so that you can better understand it. We’ll start by visualizing some data from the US National Health and Nutrition Examination Survey, or NHANES. We’ll display R code as we go, but we’ll return to all of the key coding ideas involved later in the Notes.

3.1 The NHANES data: Collecting a Sample

To begin, we’ll gather a random sample of 1,000 subjects participating in NHANES, and then identify several variables of interest about those subjects1. Some of the motivation for this example came from a Figure in Baumer, Kaplan, and Horton (2017).

# A tibble: 1,000 x 4
      ID Gender   Age Height
   <int> <fct>  <int>  <dbl>
 1 69638 female     5   106.
 2 70782 male      64   176.
 3 52408 female    54   162.
 4 59031 female    15   155.
 5 64530 male      53   185.
 6 71040 male      63   169.
 7 55186 female    30   168.
 8 60211 male       5   103.
 9 55730 male      66   161.
10 68229 female    36   170.
# ... with 990 more rows

We have 1000 rows (observations) and 4 columns (variables) that describe the subjects listed in the rows.

3.2 Age and Height

Suppose we want to visualize the relationship of Height and Age in our 1,000 NHANES observations. The best choice is likely to be a scatterplot.

Warning: Removed 37 rows containing missing values (geom_point).

We note several interesting results here.

  1. As a warning, R tells us that it has “Removed 37 rows containing missing values (geom_point).” Only 963 subjects plotted here, because the remaining 37 people have missing (NA) values for either Height, Age or both.
  2. Unsurprisingly, the measured Heights of subjects grow from Age 0 to Age 20 or so, and we see that a typical Height increases rapidly across these Ages. The middle of the distribution at later Ages is pretty consistent at at a Height somewhere between 150 and 175. The units aren’t specified, but we expect they must be centimeters. The Ages are clearly reported in Years.
  3. No Age is reported over 80, and it appears that there is a large cluster of Ages at 80. This may be due to a requirement that Ages 80 and above be reported at 80 so as to help mask the identity of those individuals.2

As in this case, we’re going to build most of our visualizations using tools from the ggplot2 package, which is part of the tidyverse series of packages. You’ll see similar coding structures throughout this Chapter, most of which are covered as well in Chapter 3 of Grolemund and Wickham (2019).

3.3 Subset of Subjects with Known Age and Height

Before we move on, let’s manipulate the data set a bit, to focus on only those subjects who have complete data on both Age and Height. This will help us avoid that warning message.

       ID           Gender         Age            Height     
 Min.   :51624   female:484   Min.   : 2.00   Min.   : 85.0  
 1st Qu.:57034   male  :479   1st Qu.:19.00   1st Qu.:156.2  
 Median :62056                Median :37.00   Median :165.0  
 Mean   :61967                Mean   :38.29   Mean   :162.3  
 3rd Qu.:67269                3rd Qu.:56.00   3rd Qu.:174.5  
 Max.   :71875                Max.   :80.00   Max.   :195.9  

Note that the units and explanations for these variables are contained in the NHANES help file, available via typing ?NHANES in the Console of R Studio, or by typing NHANES into the Search bar in R Studio’s Help window.

3.3.1 The Distinction between Gender and Sex

The Gender variable here is a mistake. These data refer to the biological status of these subjects, which is their Sex, and not the social construct of Gender which can be quite different. In our effort to avoid further confusion, we’ll rename the variable Gender to instead more accurately describe what is actually measured here.

To do this, we can use this approach…

       ID            Sex           Age            Height     
 Min.   :51624   female:484   Min.   : 2.00   Min.   : 85.0  
 1st Qu.:57034   male  :479   1st Qu.:19.00   1st Qu.:156.2  
 Median :62056                Median :37.00   Median :165.0  
 Mean   :61967                Mean   :38.29   Mean   :162.3  
 3rd Qu.:67269                3rd Qu.:56.00   3rd Qu.:174.5  
 Max.   :71875                Max.   :80.00   Max.   :195.9  

That’s better. How many observations do we have now? We could use dim to find out the number of rows and columns in this new data set.

[1] 963   4

Or, we could simply list the data set and read off the result.

# A tibble: 963 x 4
      ID Sex      Age Height
   <int> <fct>  <int>  <dbl>
 1 69638 female     5   106.
 2 70782 male      64   176.
 3 52408 female    54   162.
 4 59031 female    15   155.
 5 64530 male      53   185.
 6 71040 male      63   169.
 7 55186 female    30   168.
 8 60211 male       5   103.
 9 55730 male      66   161.
10 68229 female    36   170.
# ... with 953 more rows

3.5 A New Subset: Ages 21-79

Suppose we wanted to look only at those observations (subjects) whose Age is at least 21 and at most 79. Suppose also that we want to look at some of the additional variables available in NHANES. To start, we’ll do the following:

  1. Set a seed for the random sampling from the original NHANES data, so we get the same original sample of 1000 people we started with earlier.
  2. Select 1000 people from the NHANES data.
  3. Filter the sample to only those people whose age is more than 20 and less than 80 years.
  4. Select the variables we will be using in the remainder of this chapter, which will be:
    • Age as we’ve seen before, in years.
    • Height as we’ve seen before, in centimeters.
    • Gender which we’ll rename as Sex again.
    • Pulse = 60 second pulse rate (in beats per minute).
    • BPSysAve = Systolic Blood Pressure, in mm Hg (and we’ll rename this SBP).
    • SleepTrouble = Yes means the subject has told a health professional that they had trouble sleeping.
    • PhysActive = Yes means the subject does moderate or vigorous-intensity sports, fitness or recreational activity.
    • MaritalStatus = one of Married, Widowed, Divorced, Separated, NeverMarried or LivePartner (living with partner.)
    • HealthGen = self-reported rating of general health, one of Excellent, Vgood (Very Good), Good, Fair or Poor.
  5. Rename Gender as Sex, to more accurately describe what is being measured.
  6. Omit subjects with any missingness on any of the variables we’ve selected.

Can you see how the code below accomplishes these tasks?

# A tibble: 603 x 10
      ID Sex     Age Height Pulse   SBP SleepTrouble PhysActive
   <int> <fct> <int>  <dbl> <int> <int> <fct>        <fct>     
 1 70782 male     64   176.    78   127 No           No        
 2 52408 fema~    54   162.    80   135 No           No        
 3 64530 male     53   185.   100   131 No           No        
 4 71040 male     63   169.    70   124 Yes          Yes       
 5 55186 fema~    30   168.    76   107 No           No        
 6 55730 male     66   161.    78   133 No           No        
 7 68229 fema~    36   170.    90   105 No           Yes       
 8 63762 male     23   180.    66   118 No           No        
 9 66290 fema~    63   162.    88   116 No           No        
10 66984 male     75   174.    84   141 No           No        
# ... with 593 more rows, and 2 more variables: MaritalStatus <fct>,
#   HealthGen <fct>

3.6 Distribution of Heights

What is the distribution of height in this new sample?

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can do several things to clean this up.

  1. We’ll change the color of the lines for each bar of the histogram.
  2. We’ll change the fill inside each bar to make them stand out a bit more.
  3. We’ll add a title and relabel the horizontal (x) axis to include the units of measurement.
  4. We’ll avoid the warning by selecting a number of bins (we’ll use 25 here) into which we’ll group the heights before drawing the histogram.

3.6.1 Changing a Histogram’s Fill and Color

The CWRU color guide (https://case.edu/umc/our-brand/visual-guidelines/) lists the HTML color schemes for CWRU blue and CWRU gray. Let’s match that color scheme.

Note the other changes to the graph above.

  1. We changed the theme to replace the gray background.
  2. We changed the bins for the histogram, to gather observations into groups of 2 cm. each.

3.7 Height and Sex

This plot isn’t so useful. We can improve things a little by jittering the points horizontally, so that the overlap is reduced.

Perhaps it might be better to summarise the distribution in a different way. We might consider a boxplot of the data.

3.8 Looking at Pulse Rate

Let’s look at a different outcome, the pulse rate for our subjects.

Here’s a histogram, again with CWRU colors, for the pulse rates in our sample.

Suppose we instead bin up groups of 5 beats per minute together as we plot the Pulse rates.

Which is the more useful representation will depend a lot on what questions you’re trying to answer.

3.8.1 Pulse Rate and Physical Activity

We can also split up our data into groups based on whether the subjects are physically active. Let’s try a boxplot.

As an accompanying numerical summary, we might ask how many people fall into each of these PhysActive categories, and what is their “average” Pulse rate.

PhysActive count mean(Pulse) median(Pulse)
No 293 74.21 74
Yes 310 72.37 72

The knitr::kable(digits = 2) piece of this command tells R Markdown to generate a table with some attractive formatting, and rounding any decimals to two figures.

3.8.3 Pulse and HealthGen

We can compare the distribution of Pulse rate across groups by the subject’s self-reported overall health (HealthGen), as well.

How many people fall into each of these HealthGen categories, and what is their “average” Pulse rate?

HealthGen count mean(Pulse) median(Pulse)
Excellent 64 69.97 69
Vgood 196 72.81 72
Good 238 73.66 72
Fair 83 74.22 74
Poor 22 79.09 77

3.8.5 Sleep Troubls vs. No Sleep Trouble?

Could we see whether subjects who have described SleepTrouble show different SBP-pulse rate patterns than the subjects who haven’t?

  • Let’s try doing this by changing the shape and the color of the points based on SleepTrouble.

This plot might be easier to interpret if we faceted by SleepTrouble, as well.

3.9 General Health Status

Here’s a Table of the General Health Status results. Again, this is a self-reported rating of each subject’s health on a five point scale (Excellent, Very Good, Good, Fair, Poor.)

.
Excellent     Vgood      Good      Fair      Poor 
       64       196       238        83        22 

The HealthGen data are categorical, which means that summarizing them with averages isn’t as appealing as looking at percentages, proportions and rates.

Another, somewhat simpler way to get a table of this sort of information uses the tabyl function from the janitor package in R.

 HealthGen   n    percent
 Excellent  64 0.10613599
     Vgood 196 0.32504146
      Good 238 0.39469320
      Fair  83 0.13764511
      Poor  22 0.03648425

I don’t actually like the title of percent here, as it’s really a proportion, but that can be adjusted, and we can add a total.

 HealthGen   n percent
 Excellent  64   10.6%
     Vgood 196   32.5%
      Good 238   39.5%
      Fair  83   13.8%
      Poor  22    3.6%
     Total 603  100.0%

When working with an unordered categorical variable, like MaritalStatus, the same approach can work.

 MaritalStatus   n percent
      Divorced  61   10.1%
   LivePartner  43    7.1%
       Married 349   57.9%
  NeverMarried 104   17.2%
     Separated   8    1.3%
       Widowed  38    6.3%
         Total 603  100.0%

3.9.2 Working with Tables

We can add both row and column marginal totals, and compare subjects by Sex, as follows…

    Sex Excellent Vgood Good Fair Poor Total
 female        27    96  121   41   14   299
   male        37   100  117   42    8   304
  Total        64   196  238   83   22   603

If we like, we can make this look a little more polished with the knitr::kable function…

Sex Excellent Vgood Good Fair Poor Total
female 27 96 121 41 14 299
male 37 100 117 42 8 304
Total 64 196 238 83 22 603

Or, we can get a complete cross-tabulation, including (in this case) the percentages of people within each Sex that fall in each HealthGen category (percentages within each row) like this.

Sex Excellent Vgood Good Fair Poor
female 9.0% (27) 32.1% (96) 40.5% (121) 13.7% (41) 4.7% (14)
male 12.2% (37) 32.9% (100) 38.5% (117) 13.8% (42) 2.6% (8)
Total 10.6% (64) 32.5% (196) 39.5% (238) 13.8% (83) 3.6% (22)

And, if we wanted the column percentages, to determine which sex had the higher rate of each HealthGen status level, we can get that by changing the adorn_percentages to describe results at the column level:

Sex Excellent Vgood Good Fair Poor Total
female 42.2% (27) 49.0% (96) 50.8% (121) 49.4% (41) 63.6% (14) 49.6% (299)
male 57.8% (37) 51.0% (100) 49.2% (117) 50.6% (42) 36.4% (8) 50.4% (304)

3.9.3 SBP by General Health Status

Let’s consider now the relationship between self-reported overall health and systolic blood pressure.

We can see that not too many people self-identify with the “Poor” health category.

HealthGen count mean(SBP) median(SBP)
Excellent 64 119.1562 117.0
Vgood 196 119.0714 118.0
Good 238 120.4244 119.5
Fair 83 123.9398 119.0
Poor 22 125.8636 126.5

3.9.4 SBP by Physical Activity and General Health Status

We’ll build a panel of boxplots to try to understand the relationships between Systolic Blood Pressure, General Health Status and Physical Activity. Note the use of coord_flip to rotate the graph 90 degrees, and the use of labeller within facet_wrap to include both the name of the (Physical Activity) variable and its value.

3.9.5 SBP by Sleep Trouble and General Health Status

Here’s a plot of faceted histograms, which might be used to address similar questions related to the relationship between Overall Health, Systolic Blood Pressure and Sex.

3.10 Conclusions

This is just a small piece of the toolbox for visualizations that we’ll create in this class. Many additional tools are on the way, but the main idea won’t change. Using the ggplot2 package, we can accomplish several critical tasks in creating a visualization, including:

  • Identifying (and labeling) the axes and titles
  • Identifying a type of geom to use, like a point, bar or histogram
  • Changing fill, color, shape, size to facilitate comparisons
  • Building “small multiples” of plots with faceting

Good data visualizations make it easy to see the data, and ggplot2’s tools make it relatively difficult to make a really bad graph.

References

Baumer, Benjamin S., Daniel T. Kaplan, and Nicholas J. Horton. 2017. Modern Data Science with R. Boca Raton, FL: CRC Press. https://mdsr-book.github.io/.

Grolemund, Garrett, and Hadley Wickham. 2019. R for Data Science. O’Reilly. http://r4ds.had.co.nz/.


  1. For more on the NHANES data available in the NHANES package, type ?NHANES in the Console in R Studio.

  2. If you visit the NHANES help file with ?NHANES, you will see that subjects 80 years or older were indeed recorded as 80.