Chapter 3 Visualizing Data

Part A of these Notes is designed to ease your transition into working effectively with data, so that you can better understand it. We’ll start by visualizing some data from the US National Health and Nutrition Examination Survey, or NHANES. We’ll display R code as we go, but we’ll return to all of the key coding ideas involved later in the Notes.

3.1 The NHANES data: Collecting a Sample

To begin, we’ll gather a random sample of 1,000 subjects participating in NHANES, and then identify several variables of interest about those subjects1. The motivation for this example came from a Figure in Baumer, Kaplan, and Horton (2017).

# A tibble: 1,000 x 10
      ID Gender   Age Height Weight   BMI Pulse Race1    HealthGen Diabetes
   <int> <fct>  <int>  <dbl>  <dbl> <dbl> <int> <fct>    <fct>     <fct>   
 1 59640 male      54   176.  129    41.8    74 White    Good      No      
 2 59826 female    67   156.   50.2  20.5    66 White    Vgood     No      
 3 56340 male       9   128.   23.3  14.2    86 Black    <NA>      No      
 4 56747 male      33   194.  105.   27.9    68 White    Vgood     No      
 5 51754 female    58   167.  106    37.9    70 White    <NA>      No      
 6 52712 male       6   109.   16.9  14.3    NA White    <NA>      No      
 7 63908 male      55   169.   90.6  31.9    62 Mexican  Vgood     Yes     
 8 60865 female    25   156.   55    22.8    58 Other    Vgood     No      
 9 66642 male      41   178.   89.3  28.2    72 White    Vgood     No      
10 59880 female    45   163.   98.3  36.9    80 Hispanic Good      Yes     
# ... with 990 more rows

We have 1000 rows (observations) and 10 columns (variables) that describe the subjects listed in the rows.

3.2 Age and Height

Suppose we want to visualize the relationship of Height and Age in our 1,000 NHANES observations. The best choice is likely to be a scatterplot.

Warning: Removed 25 rows containing missing values (geom_point).

We note several interesting results here.

  1. As a warning, R tells us that it has “Removed 25 rows containing missing values (geom_point).” Only 975 subjects plotted here, because the remaining 25 people have missing (NA) values for either Height, Age or both.
  2. Unsurprisingly, the measured Heights of subjects grow from Age 0 to Age 20 or so, and we see that a typical Height increases rapidly across these Ages. The middle of the distribution at later Ages is pretty consistent at at a Height somewhere between 150 and 175. The units aren’t specified, but we expect they must be centimeters. The Ages are clearly reported in Years.
  3. No Age is reported over 80, and it appears that there is a large cluster of Ages at 80. This may be due to a requirement that Ages 80 and above be reported at 80 so as to help mask the identity of those individuals.2

As in this case, we’re going to build most of our visualizations using tools from the ggplot2 package, which is part of the tidyverse series of packages. You’ll see similar coding structures throughout this Chapter, most of which are covered as well in Chapter 3 of Grolemund and Wickham (2017).

3.3 Subset of Subjects with Known Age and Height

Before we move on, let’s manipulate the data set a bit, to focus on only those subjects who have complete data on both Age and Height. This will help us avoid that warning message.

       ID           Gender         Age            Height     
 Min.   :51654   female:498   Min.   : 2.00   Min.   : 86.3  
 1st Qu.:56753   male  :477   1st Qu.:20.00   1st Qu.:156.4  
 Median :61453                Median :36.00   Median :165.8  
 Mean   :61602                Mean   :37.27   Mean   :161.7  
 3rd Qu.:66484                3rd Qu.:53.00   3rd Qu.:174.1  
 Max.   :71826                Max.   :80.00   Max.   :195.0  
                                                             
     Weight            BMI            Pulse             Race1    
 Min.   : 12.50   Min.   :13.17   Min.   : 42.00   Black   :112  
 1st Qu.: 57.60   1st Qu.:21.60   1st Qu.: 66.00   Hispanic: 69  
 Median : 73.40   Median :26.10   Median : 72.00   Mexican :104  
 Mean   : 73.41   Mean   :26.96   Mean   : 73.75   White   :607  
 3rd Qu.: 90.20   3rd Qu.:31.10   3rd Qu.: 82.00   Other   : 83  
 Max.   :198.70   Max.   :80.60   Max.   :124.00                 
 NA's   :2        NA's   :2       NA's   :120                    
     HealthGen   Diabetes  
 Excellent: 87   No  :910  
 Vgood    :276   Yes : 64  
 Good     :276   NA's:  1  
 Fair     :103             
 Poor     : 15             
 NA's     :218             
                           

Note that the units and explanations for these variables are contained in the NHANES help file, available via ?NHANES in the Console of R Studio.

3.5 A Subset: Ages 21-79

Suppose we wanted to look at a subset of our sample - those observations (subjects) whose Age is at least 21 and at most 79. We’ll create that sample below, and also subset the variables to include nine of particular interest, and remove any observations with any missingness on any of the nine variables we’re including here.

# A tibble: 594 x 10
      ID Gender   Age Height Weight   BMI Pulse Race1    HealthGen Diabetes
   <int> <fct>  <int>  <dbl>  <dbl> <dbl> <int> <fct>    <fct>     <fct>   
 1 59640 male      54   176.  129    41.8    74 White    Good      No      
 2 59826 female    67   156.   50.2  20.5    66 White    Vgood     No      
 3 56747 male      33   194.  105.   27.9    68 White    Vgood     No      
 4 63908 male      55   169.   90.6  31.9    62 Mexican  Vgood     Yes     
 5 60865 female    25   156.   55    22.8    58 Other    Vgood     No      
 6 66642 male      41   178.   89.3  28.2    72 White    Vgood     No      
 7 59880 female    45   163.   98.3  36.9    80 Hispanic Good      Yes     
 8 71784 female    24   161.   50.2  19.3    72 White    Vgood     No      
 9 67616 male      63   184.   70    20.6    82 White    Vgood     No      
10 55391 female    32   161.   69.2  26.6   114 Other    Good      No      
# ... with 584 more rows

3.6 Distribution of Heights

What is the distribution of height in this new sample?

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can do several things to clean this up.

  1. We’ll change the color of the lines for each bar of the histogram.
  2. We’ll change the fill inside each bar to make them stand out a bit more.
  3. We’ll add a title and relabel the horizontal (x) axis to include the units of measurement.
  4. We’ll avoid the warning by selecting a number of bins (we’ll use 25 here) into which we’ll group the heights before drawing the histogram.

3.6.1 Changing a Histogram’s Fill and Color

The CWRU color guide (https://case.edu/umc/our-brand/visual-guidelines/) lists the HTML color schemes for CWRU blue and CWRU gray. Let’s match that color scheme.

Note the other changes to the graph above.

  1. We changed the theme to replace the gray background.
  2. We changed the bins for the histogram, to gather observations into groups of 2 cm. each.

3.7 Height and Gender

This plot isn’t so useful. We can improve things a little by jittering the points horizontally, so that the overlap is reduced.

Perhaps it might be better to summarise the distribution in a different way. We might consider a boxplot of the data.

3.8 A Look at Body-Mass Index

Let’s look at a different outcome, the body-mass index, or BMI. The definition of BMI for adult subjects (which is expressed in units of kg/m2) is:

\[ \mbox{Body Mass Index} = \frac{\mbox{weight in kg}}{(\mbox{height in meters})^2} = 703 \times \frac{\mbox{weight in pounds}}{(\mbox{height in inches})^2} \]

[BMI is essentially] … a measure of a person’s thinness or thickness… BMI was designed for use as a simple means of classifying average sedentary (physically inactive) populations, with an average body composition. For these individuals, the current value recommendations are as follow: a BMI from 18.5 up to 25 may indicate optimal weight, a BMI lower than 18.5 suggests the person is underweight, a number from 25 up to 30 may indicate the person is overweight, and a number from 30 upwards suggests the person is obese.

Wikipedia, https://en.wikipedia.org/wiki/Body_mass_index

Here’s a histogram, again with CWRU colors, for the BMI data.

Note how different this picture looks if instead we bin up groups of 5 kg/m2 at a time. Which is the more useful representation will depend a lot on what questions you’re trying to answer.

3.8.1 BMI by Gender

As an accompanying numerical summary, we might ask how many people fall into each of these Gender categories, and what is their “average” BMI.

Gender count mean(BMI) median(BMI)
female 290 29.35486 27.43
male 304 29.35773 28.69

3.8.2 BMI and Diabetes

We can split up our histogram into groups based on whether the subjects have been told they have diabetes.

How many people fall into each of these Diabetes categories, and what is their “average” BMI?

Diabetes count mean(BMI) median(BMI)
No 551 28.89544 27.89
Yes 43 35.26209 33.43

3.8.3 BMI and Race

We can compare the distribution of BMI across Race groups, as well.

How many people fall into each of these Race1 categories, and what is their “average” BMI?

Race1 count mean(BMI) median(BMI)
Black 63 31.04444 29.010
Hispanic 44 29.36227 29.505
Mexican 50 29.97040 29.730
White 387 29.27326 27.900
Other 50 27.25300 25.805

3.8.5 Diabetes vs. No Diabetes

Could we see whether subjects who have been told they have diabetes show different BMI-pulse rate patterns than the subjects who haven’t?

  • Let’s try doing this by changing the shape and the color of the points based on diabetes status.

This plot might be easier to interpret if we faceted by Diabetes status, as well.

3.9 General Health Status

Here’s a Table of the General Health Status results. This is a self-reported rating of each subject’s health on a five point scale (Excellent, Very Good, Good, Fair, Poor.)

.
Excellent     Vgood      Good      Fair      Poor 
       67       213       221        80        13 

The HealthGen data are categorical, which means that summarizing them with averages isn’t as appealing as looking at percentages, proportions and rates.

3.9.2 Working with Tables

We can add a marginal total, and compare subjects by Gender, as follows…

        HealthGen
Gender   Excellent Vgood Good Fair Poor Sum
  female        34   107  107   34    8 290
  male          33   106  114   46    5 304
  Sum           67   213  221   80   13 594

If we like, we can make this look a little more polished with the knitr::kable function…

Excellent Vgood Good Fair Poor Sum
female 34 107 107 34 8 290
male 33 106 114 46 5 304
Sum 67 213 221 80 13 594

If we want the proportions of patients within each Gender that fall in each HealthGen category (the row percentages), we can get them, too.

Excellent Vgood Good Fair Poor
female 0.1172414 0.3689655 0.3689655 0.1172414 0.0275862
male 0.1085526 0.3486842 0.3750000 0.1513158 0.0164474

To make this a little easier to use, we might consider rounding.

Excellent Vgood Good Fair Poor
female 0.12 0.37 0.37 0.12 0.03
male 0.11 0.35 0.38 0.15 0.02

Another possibility would be to show the percentages, rather than the proportions (which requires multiplying the proportion by 100.) Note the strange "*" function, which is needed to convince R to multiply each entry by 100 here.

Excellent Vgood Good Fair Poor
female 11.72 36.90 36.9 11.72 2.76
male 10.86 34.87 37.5 15.13 1.64

And, if we wanted the column percentages, to determine which gender had the higher rate of each HealthGen status level, we can get that by changing the prop.table to calculate 2 (column) proportions, rather than 1 (rows.)

Excellent Vgood Good Fair Poor
female 50.75 50.23 48.42 42.5 61.54
male 49.25 49.77 51.58 57.5 38.46

3.9.3 BMI by General Health Status

Let’s consider now the relationship between self-reported overall health and body-mass index.

We can see that not too many people self-identify with the “Poor” health category.

HealthGen count mean(BMI) median(BMI)
Excellent 67 25.70060 24.900
Vgood 213 27.55878 26.700
Good 221 32.00321 30.550
Fair 80 29.28663 28.685
Poor 13 33.08154 35.380

3.10 Conclusions

This is just a small piece of the toolbox for visualizations that we’ll create in this class. Many additional tools are on the way, but the main idea won’t change. Using the ggplot2 package, we can accomplish several critical tasks in creating a visualization, including:

  • Identifying (and labeling) the axes and titles
  • Identifying a type of geom to use, like a point, bar or histogram
  • Changing fill, color, shape, size to facilitate comparisons
  • Building “small multiples” of plots with faceting

Good data visualizations make it easy to see the data, and ggplot2’s tools make it relatively difficult to make a really bad graph.

References

Baumer, Benjamin S., Daniel T. Kaplan, and Nicholas J. Horton. 2017. Modern Data Science with R. Boca Raton, FL: CRC Press. https://mdsr-book.github.io/.

Grolemund, Garrett, and Hadley Wickham. 2017. R for Data Science. O’Reilly. http://r4ds.had.co.nz/.


  1. For more on the NHANES data available in the NHANES package, type ?NHANES in the Console in R Studio.

  2. If you visit the NHANES help file with ?NHANES, you will see that subjects 80 years or older were indeed recorded as 80.