Chapter 7 NHANES National Youth Fitness Survey (nnyfs
)
The nnyfs.csv
and the nnyfs.Rds
data files were built by Professor Love using data from the 2012 National Youth Fitness Survey.
The NHANES National Youth Fitness Survey (NNYFS) was conducted in 2012 to collect data on physical activity and fitness levels in order to provide an evaluation of the health and fitness of children in the U.S. ages 3 to 15. The NNYFS collected data on physical activity and fitness levels of our youth through interviews and fitness tests.
In the nnyfs
data file (either .csv
or .Rds
), I’m only providing a modest fraction of the available information. More on the NNYFS (including information I’m not using) is available at https://wwwn.cdc.gov/nchs/nhanes/search/nnyfs12.aspx.
The data elements I’m using fall into four main groups, or components:
What I did was merge a few elements from each of the available components of the NHANES National Youth Fitness Survey, reformulated (and in some cases simplified) some variables, and restricted the sample to kids who had completed elements of each of the four components.
7.1 The Variables included in nnyfs
This section tells you where the data come from, and briefly describe what is collected.
7.1.1 From the NNYFS Demographic Component
All of these come from the Y_DEMO
file.
In nnyfs |
In Y_DEMO |
Description |
---|---|---|
SEQN |
SEQN |
Subject ID, connects all of the files |
sex |
RIAGENDR |
Really, this is sex, not gender |
age_child |
RIDAGEYR |
Age in years at screening |
race_eth |
RIDRETH1 |
Race/Hispanic origin (collapsed to 4 levels) |
educ_child |
DMDEDUC3 |
Education Level (for children ages 6-15). 0 = Kindergarten, 9 = Ninth grade or higher |
language |
SIALANG |
Language in which the interview was conducted |
sampling_wt |
WTMEC |
Full-sample MEC exam weight (for inference) |
income_pov |
INDFMPIR |
Ratio of family income to poverty (ceiling is 5.0) |
age_adult |
DMDHRAGE |
Age of adult who brought child to interview |
educ_adult |
DMDHREDU |
Education level of adult who brought child |
7.1.2 From the NNYFS Dietary Component
From the Y_DR1TOT
file, we have a number of variables related to the child’s diet, with the following summaries mostly describing consumption “yesterday” in a dietary recall questionnaire.
In nnyfs |
In Y_DR1TOT |
Description |
---|---|---|
respondent |
DR1MNRSP |
who responded to interview (child, Mom, someone else) |
salt_used |
DBQ095Z |
uses salt, lite salt or salt substitute at the table |
energy |
DR1TKCAL |
energy consumed (kcal) |
protein |
DR1TPROT |
protein consumed (g) |
sugar |
DR1TSUGR |
total sugar consumed (g) |
fat |
DR1TTFAT |
total fat consumed (g) |
diet_yesterday |
DR1_300 |
compare food consumed yesterday to usual amount |
water |
DR1_320Z |
total plain water drank (g) |
7.1.3 From the NNYFS Examination Component
From the Y_BMX
file of Body Measures:
In nnyfs |
In Y_BMX |
Description |
---|---|---|
height |
BMXHT | standing height (cm) |
weight |
BMXWT | weight (kg) |
bmi |
BMXBMI | body mass index (\(kg/m^2\)) |
bmi_cat |
BMDBMIC | BMI category (4 levels) |
arm_length |
BMXARML | Upper arm length (cm) |
waist |
BMXWAIST | Waist circumference (cm) |
arm_circ |
BMXARMC | Arm circumference (cm) |
calf_circ |
BMXCALF | Maximal calf circumference (cm) |
calf_skinfold |
BMXCALFF | Calf skinfold (mm) |
triceps_skinfold |
BMXTRI | Triceps skinfold (mm) |
subscapular_skinfold |
BMXSUB | Subscapular skinfold (mm) |
From the Y_PLX
file of Plank test results:
In nnyfs |
In Y_PLX |
Description |
---|---|---|
plank_time |
MPXPLANK | # of seconds plank position is held |
7.1.4 From the NNYFS Questionnaire Component
From the Y_PAQ
file of Physical Activity questions:
In nnyfs |
In Y_PAQ |
Description |
---|---|---|
active_days |
PAQ706 | Days physically active (\(\geq 60\) min.) in past week |
tv_hours |
PAQ710 | Average hours watching TV/videos past 30d |
computer_hours |
PAQ715 | Average hours on computer past 30d |
physical_last_week |
PAQ722 | Any physical activity outside of school past week |
enjoy_recess |
PAQ750 | Enjoy participating in PE/recess |
From the Y_DBQ
file of Diet Behavior and Nutrition questions:
In nnyfs |
In Y_DBQ |
Description |
---|---|---|
meals_out |
DBD895 | # meals not home-prepared in past 7 days |
From the Y_HIQ
file of Health Insurance questions:
In nnyfs |
In Y_HIQ |
Description |
---|---|---|
insured |
HIQ011 | Covered by Health Insurance? |
insurance |
HIQ031 | Type of Health Insurance coverage |
From the Y_HUQ
file of Access to Care questions:
In nnyfs |
In Y_HUQ |
Description |
---|---|---|
phys_health |
HUQ010 | Generall health condition (Excellent - Poor) |
access_to_care |
HUQ030 | Routine place to get care? |
care_source |
HUQ040 | Type of place most often goes to for care |
From the Y_MCQ
file of Medical Conditions questions:
In nnyfs |
In Y_MCQ |
Description |
---|---|---|
asthma_ever |
MCQ010 | Ever told you have asthma? |
asthma_now |
MCQ035 | Still have asthma? |
From the Y_RXQ_RX
file of Prescription Medication questions:
In nnyfs |
In Y_RXQ_RX |
Description |
---|---|---|
med_use |
RXDUSE | Taken prescription medication in last month? |
med_count |
RXDCOUNT | # of prescription meds taken in past month |
7.2 Looking over the Data Set
Now, I’ll take a look at the nnyfs
data, which I’ve made available in a comma-separated version (nnyfs.csv
), if you prefer, as well as in an R data set (nnyfs.Rds
) which loads a bit faster. After loading the file, let’s get a handle on its size and contents.
[1] 1518 45
There are 1518 rows (subjects) and 45 columns (variables), by which I mean that there are 1518 kids in the nnyfs
data frame, and we have 45 pieces of information on each subject.
So, what do we have, exactly?
# A tibble: 1,518 x 45
SEQN sex age_child race_eth educ_child language sampling_wt
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 71917 Fema~ 15 3_Black~ 9 English 28299.
2 71918 Fema~ 8 3_Black~ 2 English 15127.
3 71919 Fema~ 14 2_White~ 8 English 29977.
4 71920 Fema~ 15 2_White~ 8 English 80652.
5 71921 Male 3 2_White~ NA English 55592.
6 71922 Male 12 1_Hispa~ 6 English 27365.
7 71923 Male 12 2_White~ 5 English 86673.
8 71924 Fema~ 8 4_Other~ 2 English 39549.
9 71925 Male 7 1_Hispa~ 0 English 42333.
10 71926 Male 8 3_Black~ 2 English 15307.
# ... with 1,508 more rows, and 38 more variables: income_pov <dbl>,
# age_adult <dbl>, educ_adult <chr>, respondent <chr>, salt_used <chr>,
# energy <dbl>, protein <dbl>, sugar <dbl>, fat <dbl>,
# diet_yesterday <chr>, water <dbl>, plank_time <dbl>, height <dbl>,
# weight <dbl>, bmi <dbl>, bmi_cat <chr>, arm_length <dbl>, waist <dbl>,
# arm_circ <dbl>, calf_circ <dbl>, calf_skinfold <dbl>,
# triceps_skinfold <dbl>, subscapular_skinfold <dbl>, active_days <dbl>,
# tv_hours <dbl>, computer_hours <dbl>, physical_last_week <chr>,
# enjoy_recess <chr>, meals_out <dbl>, insured <chr>, phys_health <chr>,
# access_to_care <chr>, care_source <chr>, asthma_ever <chr>,
# asthma_now <chr>, med_use <chr>, med_count <dbl>, insurance <chr>
Tibbles are a modern reimagining of the main way in which people have stored data in R, called a data frame. Tibbles were developed to keep what time has proven to be effective, and throwing out what is not. We can learn something about the structure of the tibble from such functions as str
or glimpse
.
Classes 'tbl_df', 'tbl' and 'data.frame': 1518 obs. of 45 variables:
$ SEQN : num 71917 71918 71919 71920 71921 ...
$ sex : chr "Female" "Female" "Female" "Female" ...
$ age_child : num 15 8 14 15 3 12 12 8 7 8 ...
$ race_eth : chr "3_Black Non-Hispanic" "3_Black Non-Hispanic" "2_White Non-Hispanic" "2_White Non-Hispanic" ...
$ educ_child : num 9 2 8 8 NA 6 5 2 0 2 ...
$ language : chr "English" "English" "English" "English" ...
$ sampling_wt : num 28299 15127 29977 80652 55592 ...
$ income_pov : num 0.21 5 5 0.87 4.34 5 5 2.74 0.46 1.57 ...
$ age_adult : num 46 46 42 53 31 42 39 31 45 56 ...
$ educ_adult : chr "2_9-11th Grade" "3_High School Graduate" "5_College Graduate" "3_High School Graduate" ...
$ respondent : chr "Child" "Mom" "Child" "Child" ...
$ salt_used : chr "Yes" "Yes" "Yes" "Yes" ...
$ energy : num 2844 1725 2304 1114 1655 ...
$ protein : num 169.1 55.2 199.3 14 50.6 ...
$ sugar : num 128.2 118.7 81.4 119.2 90.3 ...
$ fat : num 127.9 63.7 86.1 36 53.3 ...
$ diet_yesterday : chr "2_Usual" "2_Usual" "2_Usual" "2_Usual" ...
$ water : num 607 178 503 859 148 ...
$ plank_time : num NA 45 121 45 11 107 127 44 184 58 ...
$ height : num NA 131.6 172 167.1 90.2 ...
$ weight : num NA 38.6 58.7 92.5 12.4 66.4 56.7 22.2 20.9 28.3 ...
$ bmi : num NA 22.3 19.8 33.1 15.2 25.9 22.5 14.4 15.9 17 ...
$ bmi_cat : chr NA "4_Obese" "2_Normal" "4_Obese" ...
$ arm_length : num NA 27.7 38.4 35.9 18.3 34.2 33 26.5 24.2 26 ...
$ waist : num NA 71.9 79.4 96.4 46.8 90 72.3 56.1 54.5 59.7 ...
$ arm_circ : num NA 25.4 26 37.9 15.1 29.5 27.9 17.6 17.7 19.9 ...
$ calf_circ : num NA 32.3 35.3 46.8 19.4 36.9 36.8 24 24.3 27.3 ...
$ calf_skinfold : num NA 22 18.4 NA 8.4 22 18.3 7 7.2 8.2 ...
$ triceps_skinfold : num NA 19.9 15 20.6 8.6 22.8 20.5 12.9 6.9 8.8 ...
$ subscapular_skinfold: num NA 17.4 9.8 22.8 5.7 24.4 12.6 6.8 4.8 6.1 ...
$ active_days : num 3 5 3 3 7 2 5 3 7 7 ...
$ tv_hours : num 2 2 1 3 2 3 0 4 2 2 ...
$ computer_hours : num 1 2 3 3 0 1 0 3 1 1 ...
$ physical_last_week : chr "No" "No" "Yes" "Yes" ...
$ enjoy_recess : chr "1_Strongly Agree" "1_Strongly Agree" "3_Neither Agree nor Disagree" "2_Agree" ...
$ meals_out : num 0 2 3 2 1 1 2 1 0 2 ...
$ insured : chr "Has Insurance" "Has Insurance" "Has Insurance" "Has Insurance" ...
$ phys_health : chr "1_Excellent" "3_Good" "1_Excellent" "3_Good" ...
$ access_to_care : chr "Has Usual Care Source" "Has Usual Care Source" "Has Usual Care Source" "Has Usual Care Source" ...
$ care_source : chr "Clinic or Health Center" "Doctor's Office" "Doctor's Office" "Doctor's Office" ...
$ asthma_ever : chr "Never Had Asthma" "History of Asthma" "Never Had Asthma" "History of Asthma" ...
$ asthma_now : chr "No Asthma Now" "Asthma Now" "No Asthma Now" "Asthma Now" ...
$ med_use : chr "No Medications" "Had Medication" "No Medications" "Had Medication" ...
$ med_count : num 0 1 0 2 0 0 0 0 0 0 ...
$ insurance : chr "State Sponsored" "State Sponsored" "Private" "State Sponsored" ...
There are a lot of variables here. Let’s run through the first few rather slowly, and then we’ll speed up.
7.2.1 SEQN
The first variable, SEQN
is just a (numerical) identifying code attributable to a given subject of the survey. This is nominal data, which will be of little interest down the line. On some occasions, as in this case, the ID numbers are sequential, in the sense that subject 71919 was included in the data base after subject 71918, but this fact isn’t particularly interesting here, because the protocol remained unchanged throughout the study.
7.2.2 sex
The second variable, sex
, is listed as a character variable (R uses factor and character to refer to categorical, especially non-numeric information). Here, as we can see below, we have two levels, Female and Male.
sex n percent
Female 760 50.1%
Male 758 49.9%
Total 1518 100.0%
Obviously, we don’t actually need more than a decimal place here for any real purpose.
7.2.3 age_child
The third variable, age_child
, is the age of the child at the time of their screening to be in the study, measured in years. Note that age is a continuous concept, but the measure used here (number of full years alive) is a common discrete approach to measurement. Age, of course, has a meaningful zero point, so this can be thought of as a ratio variable; a child who is 6 is half as old as one who is 12. We can tabulate the observed values, since there are only a dozen or so.
age_child n percent
3 110 7.2%
4 112 7.4%
5 114 7.5%
6 129 8.5%
7 123 8.1%
8 112 7.4%
9 99 6.5%
10 124 8.2%
11 111 7.3%
12 137 9.0%
13 119 7.8%
14 130 8.6%
15 98 6.5%
At the time of initial screening, these children should have been between 3 and 15 years of age, so things look reasonable. Since this is a meaningful quantitative variable, we may be interested in a more descriptive summary.
age_child
Min. : 3.000
1st Qu.: 6.000
Median : 9.000
Mean : 9.033
3rd Qu.:12.000
Max. :15.000
These six numbers provide a nice, if incomplete, look at the ages.
Min.
= the minimum, or youngest age at the examination was 3 years old.1st Qu.
= the first quartile (25th percentile) of the ages was 6. This means that 25 percent of the subjects were age 6 or less.Median
= the second quartile (50th percentile) of the ages was 9. This is often used to describe the center of the data. Half of the subjects were age 9 or less.3rd Qu.
= the third quartile (75th percentile) of the ages was 12Max.
= the maximum, or oldest age at the examination was 15 years.
We could get the standard deviation and a count of missing and non-missing observations with favstats
from the mosaic
package.
min Q1 median Q3 max mean sd n missing
3 6 9 12 15 9.032938 3.705574 1518 0
7.2.4 race_eth
The fourth variable in the data set is race_eth
, which is a multi-categorical variable describing the child’s race and ethnicity.
race_eth | n | percent |
---|---|---|
1_Hispanic | 450 | 29.6% |
2_White Non-Hispanic | 610 | 40.2% |
3_Black Non-Hispanic | 338 | 22.3% |
4_Other Race/Ethnicity | 120 | 7.9% |
And now, we get the idea of looking at whether our numerical summaries of the children’s ages varies by their race/ethnicity…
race_eth min Q1 median Q3 max mean sd n
1 1_Hispanic 3 5.25 9.0 12 15 8.793333 3.733846 450
2 2_White Non-Hispanic 3 6.00 9.0 12 15 9.137705 3.804421 610
3 3_Black Non-Hispanic 3 6.00 9.0 12 15 9.038462 3.576423 338
4 4_Other Race/Ethnicity 3 7.00 9.5 12 15 9.383333 3.427970 120
missing
1 0
2 0
3 0
4 0
7.2.5 income_pov
Skipping down a bit, let’s look at the family income as a multiple of the poverty level. Here’s the summary.
income_pov
Min. :0.000
1st Qu.:0.870
Median :1.740
Mean :2.242
3rd Qu.:3.520
Max. :5.000
NA's :89
We see there is some missing data here. Let’s ignore that for the moment and concentrate on interpreting the results for the children with actual data. We should start with a picture.
Warning: Removed 89 rows containing non-finite values (stat_bin).
The histogram shows us that the values are truncated at 5, so that children whose actual family income is above 5 times the poverty line are listed as 5. We also see a message reminding us that some of the data are missing for this variable.
Is there a relationship between income_pov
and race_eth
in these data?
race_eth min Q1 median Q3 max mean sd n
1 1_Hispanic 0 0.56 0.96 1.7400 5 1.336895 1.097235 409
2 2_White Non-Hispanic 0 1.52 2.96 4.5200 5 2.915816 1.617534 588
3 3_Black Non-Hispanic 0 0.78 1.57 2.8200 5 1.970457 1.494911 328
4 4_Other Race/Ethnicity 0 1.17 2.74 4.5775 5 2.845673 1.670373 104
missing
1 41
2 22
3 10
4 16
This deserves a picture. Let’s try a boxplot.
Warning: Removed 89 rows containing non-finite values (stat_boxplot).
7.2.6 bmi
Moving into the body measurement data, bmi
is the body-mass index of the child. The BMI is a person’s weight in kilograms divided by his or her height in meters squared. Symbolically, BMI = weight in kg / (height in m)2. This is a continuous concept, measured to as many decimal places as you like, and it has a meaningful zero point, so it’s a ratio variable.
bmi
Min. :11.90
1st Qu.:15.90
Median :18.10
Mean :19.63
3rd Qu.:21.90
Max. :48.30
NA's :4
Why would a table of these BMI values not be a great idea, for these data? A hint is that R represents this variable as num
or numeric in its depiction of the data structure, and this implies that R has some decimal values stored. Here, I’ll use the head()
function and the tail()
function to show the first few and the last few values of what would prove to be a very long table of bmi
values.
bmi n percent valid_percent
11.9 1 0.1% 0.1%
12.6 1 0.1% 0.1%
12.7 1 0.1% 0.1%
12.9 1 0.1% 0.1%
13.0 2 0.1% 0.1%
13.1 1 0.1% 0.1%
bmi n percent valid_percent
42.8 1 0.1% 0.1%
43.0 1 0.1% 0.1%
46.9 1 0.1% 0.1%
48.2 1 0.1% 0.1%
48.3 1 0.1% 0.1%
NA 4 0.3% -
7.2.7 bmi_cat
Next I’ll look at the bmi_cat
information. This is a four-category ordinal variable, which divides the sample according to BMI into four groups. The BMI categories use sex-specific 2000 BMI-for-age (in months) growth charts prepared by the Centers for Disease Control for the US. We can get the breakdown from a table of the variable’s values.
bmi_cat n percent valid_percent
1_Underweight 41 2.7% 2.7%
2_Normal 920 60.6% 60.8%
3_Overweight 258 17.0% 17.0%
4_Obese 295 19.4% 19.5%
<NA> 4 0.3% -
In terms of percentiles by age and sex from the growth charts, the meanings of the categories are:
- Underweight (BMI < 5th percentile)
- Normal weight (BMI 5th to < 85th percentile)
- Overweight (BMI 85th to < 95th percentile)
- Obese (BMI \(\geq\) 95th percentile)
Note how I’ve used labels in the bmi_cat
variable that include a number at the start so that the table results are sorted in a rational way. R sorts tables alphabetically, in general. We’ll use the forcats
package to work with categorical variables that we store as factors eventually, but for now, we’ll keep things relatively simple.
Note that the bmi_cat
data don’t completely separate out the raw bmi
data, because the calculation of percentiles requires different tables for each combination of age
and sex
.
bmi_cat min Q1 median Q3 max mean sd n missing
1 1_Underweight 11.9 13.4 13.7 15.000 16.5 14.10976 1.104492 41 0
2 2_Normal 13.5 15.4 16.5 18.700 24.0 17.16391 2.304162 920 0
3 3_Overweight 16.9 18.3 21.4 23.375 27.9 21.18101 2.918489 258 0
4 4_Obese 17.9 22.3 26.2 30.200 48.3 26.73153 5.721179 295 0
7.2.8 waist
Let’s also look briefly at waist
, which is the circumference of the child’s waist, in centimeters. Again, this is a numeric variable, so perhaps we’ll stick to the simple summary, rather than obtaining a table of observed values.
min Q1 median Q3 max mean sd n missing
42.5 55.6 64.8 76.6 144.7 67.70536 15.19809 1512 6
Here’s a histogram of the waist circumference data.
Warning: Removed 6 rows containing non-finite values (stat_bin).
7.2.9 triceps_skinfold
The last variable I’ll look at for now is triceps_skinfold
, which is measured in millimeters. This is one of several common locations used for the assessment of body fat using skinfold calipers, and is a frequent part of growth assessments in children. Again, this is a numeric variable according to R.
min Q1 median Q3 max mean sd n missing
4 9.1 12.4 18 38.8 14.35725 6.758825 1497 21
And here’s a histogram of the triceps skinfold data, with the fill and color flipped from what we saw in the plot of the waist circumference data a moment ago.
ggplot(nnyfs, aes(x = triceps_skinfold)) +
geom_histogram(bins = 25, fill = "cyan", color = "tomato")
Warning: Removed 21 rows containing non-finite values (stat_bin).
OK. We’ve seen a few variables, and we’ll move on now to look more seriously at the data.
7.3 Basic Numerical Summaries
7.3.1 The Five Number Summary, Quantiles and IQR
The five number summary is most famous when used to form a box plot - it’s the minimum, 25th percentile, median, 75th percentile and maximum. For numerical and integer variables, the summary
function produces the five number summary, plus the mean, and a count of any missing values (NA’s).
waist energy sugar
Min. : 42.50 Min. : 257 Min. : 1.00
1st Qu.: 55.60 1st Qu.:1368 1st Qu.: 82.66
Median : 64.80 Median :1794 Median :116.92
Mean : 67.71 Mean :1877 Mean :124.32
3rd Qu.: 76.60 3rd Qu.:2306 3rd Qu.:157.05
Max. :144.70 Max. :5265 Max. :405.49
NA's :6
As an alternative, we can use the $
notation to indicate the variable we wish to study inside a data set, and we can use the fivenum
function to get the five numbers used in developing a box plot. We’ll focus for a little while on the number of kilocalories consumed by each child, according to the dietary recall questionnaire. That’s the energy
variable.
[1] 257.0 1367.0 1794.5 2306.0 5265.0
- As mentioned in 5.3.1, the inter-quartile range, or IQR, is sometimes used as a competitor for the standard deviation. It’s the difference between the 75th percentile and the 25th percentile. The 25th percentile, median, and 75th percentile are referred to as the quartiles of the data set, because, together, they split the data into quarters.
[1] 938.5
We can obtain quantiles (percentiles) as we like - here, I’m asking for the 1st and 99th:
1% 99%
566.85 4051.75
7.4 Additional Summaries from favstats
If we’re focusing on a single variable, the favstats
function in the mosaic
package can be very helpful. Rather than calling up the entire mosaic
library here, I’ll just specify the function within the library.
min Q1 median Q3 max mean sd n missing
257 1367.5 1794.5 2306 5265 1877.157 722.3537 1518 0
This adds three useful results to the base summary - the standard deviation, the sample size and the number of missing observations.
7.5 The Histogram
As we saw in 3, obtaining a basic histogram of, for example, the energy (kilocalories consumed) in the nnyfs
data is pretty straightforward.
7.5.1 Freedman-Diaconis Rule to select bin width
If we like, we can suggest a particular number of cells for the histogram, instead of accepting the defaults. In this case, we have \(n\) = 1518 observations. The Freedman-Diaconis rule can be helpful here. That rule suggests that we set the bin-width to
\[ h = \frac{2*IQR}{n^{1/3}} \]
so that the number of bins is equal to the range of the data set (maximum - minimum) divided by \(h\).
For the energy
data in the nnyfs
tibble, we have
- IQR of 938.5, \(n\) = 1518 and range = 5008
- Thus, by the Freedman-Diaconis rule, the optimal binwidth \(h\) is 163.3203676, or, realistically, 163.
- And so the number of bins would be 30.6636586, or, realistically 31.
Here, we’ll draw the graph again, using the Freedman-Diaconis rule to identify the number of bins, and also play around a bit with the fill and color of the bars.
bw <- 2 * IQR(nnyfs$energy) / length(nnyfs$energy)^(1/3)
ggplot(data = nnyfs, aes(x = energy)) +
geom_histogram(binwidth=bw, color = "white", fill = "black")
This is a nice start, but it is by no means a finished graph.
Let’s improve the axis labels, add a title, and fill in the bars with a distinctive blue and use a black outline around each bar. I’ll just use 25 bars, because I like how that looks in this case, and optimizing the number of bins is rarely important.
7.5.2 A Note on Colors
The simplest way to specify a color is with its name, enclosed in parentheses. My favorite list of R colors is http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf. In a pinch, you can usually find it by googling Colors in R. You can also type colors()
in the R console to obtain a list of the names of the same 657 colors.
When using colors to make comparisons, you may be interested in using a scale that has some nice properties. The viridis package vignette describes four color scales (viridis, magma, plasma and inferno) that are designed to be colorful, robust to colorblindness and gray scale printing, and perceptually uniform, which means (as the package authors describe it) that values close to each other have similar-appearing colors and values far away from each other have more different-appearing colors, consistently across the range of values. We can apply these colors with special functions within ggplot
.
Here’s a comparison of several histograms, looking at energy
consumed as a function of whether yesterday was typical in terms of food consumption.
ggplot(data = nnyfs, aes(x = energy, fill = diet_yesterday)) +
geom_histogram(bins = 20, col = "white") +
scale_fill_viridis_d() +
facet_wrap(~ diet_yesterday)
We don’t really need the legend here, and perhaps we should restrict the plot to participants who responded to the diet_yesterday
question, and put in a title and better axis labels?
nnyfs %>% filter(complete.cases(energy, diet_yesterday)) %>%
ggplot(data = ., aes(x = energy, fill = diet_yesterday)) +
geom_histogram(bins = 20, col = "white") +
scale_fill_viridis_d() +
guides(fill = FALSE) +
facet_wrap(~ diet_yesterday) +
labs(x = "Energy consumed, in kcal",
title = "Energy Consumption and How Typical Was Yesterday's Eating",
subtitle = "NHANES National Youth Fitness Survey, no survey weighting")
7.6 The Stem-and-Leaf
We might consider a stem-and-leaf display (a John Tukey invention) to show the actual data values while retaining the shape of a histogram. The scale
parameter can help expand the size of the diagram, so you can see more of the values. Stem and leaf displays are usually used for relatively small samples, perhaps with 10-200 observations, so we’ll first take a sample of 150 of the BMI values from the complete set gathered in the nnyfs
tibble.
set.seed(431) # set a seed for the random sampling so we can replicate the results
sampleA <- sample_n(nnyfs, 150, replace = FALSE) # draw a sample of 150 unique rows from nnyfs
stem(sampleA$bmi) # build a stem-and-leaf for those 150 sampled BMI values
The decimal point is at the |
12 | 936678999
14 | 012457880011233555566777789999
16 | 0012233333335556688890001222355888889
18 | 03455778991112556889
20 | 02344678912378
22 | 00013455779125
24 | 346978
26 | 03799446
28 | 25
30 | 2
32 | 065
34 | 14
36 | 31
38 |
40 |
42 |
44 |
46 |
48 | 3
We can see that the minimum BMI value in this small sample is 12.9 and the maximum BMI value is 48.3.
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
12.90 15.90 17.80 19.68 22.00 48.30 1
If we really wanted to, we could obtain a stem-and-leaf of all of the BMI values in the entire nnyfs
data. The scale
parameter lets us see some more of the values.
The decimal point is at the |
11 | 9
12 | 679
13 | 00123444555566666777788888999999999999
14 | 00000001111111111112222222233333333333344444444445555555555666666667+48
15 | 00000000000000111111111111112222222222222222222333333333333333333333+134
16 | 00000000000000000000000011111111111111222222222222222223333333333333+114
17 | 00000000000000000000001111111111111222222222222222222233333333333333+75
18 | 00000000000000011111111111112222222222223333333334444444455555555555+39
19 | 00000111111111111111222222222222222333333333333444444444555555555555+36
20 | 00000000000001111111112222222333333333333444444455566666666666777778+3
21 | 00000001111112222223333333334444444444555555555566666666777777778888+7
22 | 00000000000001111111222222233333333444444445555566666677777788999999
23 | 00000001111222222223333334444444555667778888899999
24 | 000000112222233334444455555566666677788888899999
25 | 00011122222233344444555666666677777888899999
26 | 00011222223345555667789999
27 | 002233444455566679999
28 | 11122344456667778999
29 | 0112222333567788
30 | 1122222344556788999
31 | 0013445567788
32 | 0023344669
33 | 1255
34 | 01224456679
35 | 0179
36 | 369
37 | 01458
38 | 138
39 |
40 | 0
41 | 6
42 | 8
43 | 0
44 |
45 |
46 | 9
47 |
48 | 23
Note that some of the rows extend far beyond what is displayed in the data (as indicated by the +
sign, followed by a count of the number of unshown data values.)
7.6.1 A Fancier Stem-and-Leaf Display
We can use the stem.leaf
function in the aplpack
package to obtain a fancier version of the stem-and-leaf plot, that identifies outlying values. Below, we display this new version for the random sample of 150 BMI observations we developed earlier.
1 | 2: represents 1.2
leaf unit: 0.1
n: 149
1 12 | 9
9 13 | 36678999
17 14 | 01245788
39 15 | 0011233555566777789999
60 16 | 001223333333555668889
(16) 17 | 0001222355888889
73 18 | 0345577899
63 19 | 1112556889
53 20 | 023446789
44 21 | 12378
39 22 | 00013455779
28 23 | 125
25 24 | 3469
21 25 | 78
19 26 | 03799
14 27 | 446
28 |
11 29 | 25
9 30 | 2
HI: 32 32.6 33.5 34.1 34.4 36.3 37.1 48.3
NA's: 1
We can also produce back-to-back stem and leaf plots to compare, for instance, body-mass index by sex.
samp.F <- filter(sampleA, sex=="Female")
samp.M <- filter(sampleA, sex=="Male")
aplpack::stem.leaf.backback(samp.F$bmi, samp.M$bmi)
___________________________________________________
1 | 2: represents 1.2, leaf unit: 0.1
samp.F$bmi samp.M$bmi
___________________________________________________
1 9| 12 |
5 9863| 13 |6799 4
7 10| 14 |245788 10
19 999875532100| 15 |1355667779 20
25 863330| 16 |012233335556889 35
29 8210| 17 |002235588889 (12)
(4) 9750| 18 |345789 38
31 985211| 19 |1568 32
25 943| 20 |024678 28
22 8731| 21 |2 22
18 000| 22 |13455779 21
15 51| 23 |2 13
13 63| 24 |49 12
11 87| 25 |
9 0| 26 |3799 10
8 644| 27 |
| 28 |
5 2| 29 |5 6
4 2| 30 |
| 31 |
___________________________________________________
HI: 33.5 37.1 48.3 HI: 32 32.6 34.1 34.4
36.3
n: 65 85
NAs: 1 0
___________________________________________________
7.7 The Dot Plot to display a distribution
We can plot the distribution of a single continuous variable using the dotplot
geom:
7.8 The Frequency Polygon
We can plot the distribution of a single continuous variable using the freqpoly
geom:
7.9 Plotting the Probability Density Function
We can also produce a density function, which has the effect of smoothing out the bumps in a histogram or frequency polygon, while also changing what is plotted on the y-axis.
ggplot(data = nnyfs, aes(x = energy)) +
geom_density(kernel = "gaussian", color = "dodgerblue") +
labs(title = "Density of nnyfs Energy data",
x = "Energy (kcal)", y = "Probability Density function")
So, what’s a density function?
- A probability density function is a function of a continuous variable, x, that represents the probability of x falling within a given range. Specifically, the integral over the interval (a,b) of the density function gives the probability that the value of x is within (a,b).
- If you’re interested in exploring more on the notion of density functions for continuous (and discrete) random variables, some nice elementary material is available at Khan Academy.
7.10 The Boxplot
Sometimes, it’s helpful to picture the five-number summary of the data in such a way as to get a general sense of the distribution. One approach is a boxplot, sometimes called a box-and-whisker plot.
7.10.1 Drawing a Boxplot for One Variable in ggplot2
The ggplot2
library easily handles comparison boxplots for multiple distributions, as we’ll see in a moment. However, building a boxplot for a single distribution requires a little trickiness.
7.10.2 About the Boxplot
The boxplot is another John Tukey invention.
- R draws the box (here in yellow) so that its edges of the box fall at the 25th and 75th percentiles of the data, and the thick line inside the box falls at the median (50th percentile).
- The whiskers then extend out to the largest and smallest values that are not classified by the plot as candidate outliers.
- An outlier is an unusual point, far from the center of a distribution.
- Note that I’ve used the
horizontal
option to show this boxplot in this direction. Most comparison boxplots, as we’ll see below, are oriented vertically.
The boxplot’s whiskers that are drawn from the first and third quartiles (i.e. the 25th and 75th percentiles) out to the most extreme points in the data that do not meet the standard of ``candidate outliers.’’ An outlier is simply a point that is far away from the center of the data - which may be due to any number of reasons, and generally indicates a need for further investigation.
Most software, including R, uses a standard proposed by Tukey which describes a ``candidate outlier’’ as any point above the upper fence or below the lower fence. The definitions of the fences are based on the inter-quartile range (IQR).
If IQR = 75th percentile - 25th percentile, then the upper fence is 75th percentile + 1.5 IQR, and the lower fence is 25th percentile - 1.5 IQR.
So for these energy
data,
- the upper fence is located at 2306 + 1.5(938.5) = 3713.75
- the lower fence is located at 1367 - 1.5(938.5) = -40.75
In this case, we see no points identified as outliers in the low part of the distribution, but quite a few identified that way on the high side. This tends to identify about 5% of the data as a candidate outlier, if the data follow a Normal distribution.
- This plot is indicating clearly that there is some asymmetry (skew) in the data, specifically right skew.
- The standard R uses is to indicate as outliers any points that are more than 1.5 inter-quartile ranges away from the edges of the box.
The horizontal orientation I’ve chosen here clarifies the relationship of direction of skew to the plot. A plot like this, with multiple outliers on the right side is indicative of a long right tail in the distribution, and hence, positive or right skew - with the mean being larger than the median. Other indications of skew include having one side of the box being substantially wider than the other, or one side of the whiskers being substantially longer than the other. More on skew later.
7.11 A Simple Comparison Boxplot
Boxplots are most often used for comparison. We can build boxplots using ggplot2
, as well, and we’ll discuss that in detail later. For now, here’s a boxplot built to compare the energy
results by the subject’s race/ethnicity.
ggplot(nnyfs, aes(x = factor(race_eth), y = energy, fill=factor(race_eth))) +
geom_boxplot() +
guides(fill = FALSE) +
labs(y = "Energy consumed (kcal)", x = "Race/Ethnicity")
Let’s look at the comparison of observed energy
levels across the five categories in our phys_health
variable, now making use of the viridis
color scheme.
ggplot(nnyfs, aes(x = factor(phys_health), y = energy, fill = factor(phys_health))) +
geom_boxplot() +
scale_fill_viridis_d() +
labs(title = "Energy by Self-Reported Physical Health, in nnyfs data")
As a graph, that’s not bad, but what if we want to improve it further?
Let’s turn the boxes in the horizontal direction, and get rid of the perhaps unnecessary phys_health
labels.
7.12 Using describe
in the psych
library
For additional numerical summaries, one option would be to consider using the describe
function from the psych
library.
vars n mean sd median trimmed mad min max range skew
X1 1 1518 1877.16 722.35 1794.5 1827.1 678.29 257 5265 5008 0.8
kurtosis se
X1 1.13 18.54
This package provides, in order, the following…
n
= the sample sizemean
= the sample meansd
= the sample standard deviationmedian
= the median, or 50th percentiletrimmed
= mean of the middle 80% of the datamad
= median absolute deviationmin
= minimum value in the samplemax
= maximum value in the samplerange
= max - minskew
= skewness measure, described below (indicates degree of asymmetry)kurtosis
= kurtosis measure, described below (indicates heaviness of tails, degree of outlier-proneness)se
= standard error of the sample mean = sd / square root of sample size, useful in inference
7.12.1 The Trimmed Mean
The trimmed mean trim value in R indicates proportion of observations to be trimmed from each end of the outcome distribution before the mean is calculated. The trimmed
value provided by the psych::describe
package describes what this particular package calls a 20% trimmed mean (bottom and top 10% of energy
values are removed before taking the mean - it’s the mean of the middle 80% of the data.) I might call that a 10% trimmed mean in some settings, but that’s just me.
[1] 1827.1
7.12.2 The Median Absolute Deviation
An alternative to the IQR that is fancier, and a bit more robust, is the median absolute deviation, which, in large sample sizes, for data that follow a Normal distribution, will be (in expectation) equal to the standard deviation. The MAD is the median of the absolute deviations from the median, multiplied by a constant (1.4826) to yield asymptotically normal consistency.
[1] 678.2895
7.13 Assessing Skew
A relatively common idea is to assess skewness, several measures of which (including the one below, sometimes called type 3 skewness, or Pearson’s moment coefficient of skewness) are available. Many models assume a Normal distribution, where, among other things, the data are symmetric around the mean.
Skewness measures asymmetry in the distribution - left skew (mean < median) is indicated by negative skewness values, while right skew (mean > median) is indicated by positive values. The skew value will be near zero for data that follow a Normal distribution.
7.13.1 Non-parametric Skew via skew1
A simpler measure of skew, sometimes called the nonparametric skew and closely related to Pearson’s notion of median skewness, falls between -1 and +1 for any distribution. It is just the difference between the mean and the median, divided by the standard deviation.
- Values greater than +0.2 are sometimes taken to indicate fairly substantial right skew, while values below -0.2 indicate fairly substantial left skew.
[1] 0.114427
The Wikipedia page on skewness, from which some of this material is derived, provides definitions for several other skewness measures.
7.14 Assessing Kurtosis (Heavy-Tailedness)
Another measure of a distribution’s shape that can be found in the psych
library is the kurtosis. Kurtosis is an indicator of whether the distribution is heavy-tailed or light-tailed as compared to a Normal distribution. Positive kurtosis means more of the variance is due to outliers - unusual points far away from the mean relative to what we might expect from a Normally distributed data set with the same standard deviation.
- A Normal distribution will have a kurtosis value near 0, a distribution with similar tail behavior to what we would expect from a Normal is said to be mesokurtic
- Higher kurtosis values (meaningfully higher than 0) indicate that, as compared to a Normal distribution, the observed variance is more the result of extreme outliers (i.e. heavy tails) as opposed to being the result of more modest sized deviations from the mean. These heavy-tailed, or outlier prone, distributions are sometimes called leptokurtic.
- Kurtosis values meaningfully lower than 0 indicate light-tailed data, with fewer outliers than we’d expect in a Normal distribution. Such distributions are sometimes referred to as platykurtic, and include distributions without outliers, like the Uniform distribution.
Here’s a table:
Fewer outliers than a Normal | Approximately Normal | More outliers than a Normal |
---|---|---|
Light-tailed | “Normalish” | Heavy-tailed |
platykurtic (kurtosis < 0) | mesokurtic (kurtosis = 0) | leptokurtic (kurtosis > 0) |
[1] 1.130539
7.14.1 The Standard Error of the Sample Mean
The standard error of the sample mean, which is the standard deviation divided by the square root of the sample size:
[1] 18.54018
7.15 The describe
function in the Hmisc
library
The Hmisc
library has lots of useful functions. It’s named for its main developer, Frank Harrell. The describe
function in Hmisc
knows enough to separate numerical from categorical variables, and give you separate (and detailed) summaries for each.
- For a categorical variable, it provides counts of total observations (n), the number of missing values, and the number of unique categories, along with counts and percentages falling in each category.
- For a numerical variable, it provides:
- counts of total observations (n), the number of missing values, and the number of unique values
- an Info value for the data, which indicates how continuous the variable is (a score of 1 is generally indicative of a completely continuous variable with no ties, while scores near 0 indicate lots of ties, and very few unique values)
- the sample Mean
- many sample percentiles (quantiles) of the data, specifically (5, 10, 25, 50, 75, 90, 95, 99)
- either a complete table of all observed values, with counts and percentages (if there are a modest number of unique values), or
- a table of the five smallest and five largest values in the data set, which is useful for range checking
.
3 Variables 1518 Observations
---------------------------------------------------------------------------
waist
n missing distinct Info Mean Gmd .05 .10
1512 6 510 1 67.71 16.6 49.40 51.40
.25 .50 .75 .90 .95
55.60 64.80 76.60 88.70 96.84
lowest : 42.5 43.4 44.1 44.4 44.5, highest: 125.8 126.0 127.0 132.3 144.7
---------------------------------------------------------------------------
energy
n missing distinct Info Mean Gmd .05 .10
1518 0 1137 1 1877 796.1 849 1047
.25 .50 .75 .90 .95
1368 1794 2306 2795 3195
lowest : 257 260 326 349 392, highest: 4382 4529 5085 5215 5265
---------------------------------------------------------------------------
bmi
n missing distinct Info Mean Gmd .05 .10
1514 4 225 1 19.63 5.269 14.30 14.90
.25 .50 .75 .90 .95
15.90 18.10 21.90 26.27 30.20
lowest : 11.9 12.6 12.7 12.9 13.0, highest: 42.8 43.0 46.9 48.2 48.3
---------------------------------------------------------------------------
More on the Info
value in Hmisc::describe is available here
7.16 xda
from GitHub for numerical summaries for exploratory data analysis
## next two commands needed if xda is not already installed
library(devtools)
install_github("ujjwalkarn/xda")
Skipping install of 'xda' from a github remote, the SHA1 (86cf14db) has not changed since last install.
Use `force = TRUE` to force installation
n mean sd max min range
SEQN 1518 7.27e+04 4.57e+02 73492.0 71917.00 1575.0
age_child 1518 9.03e+00 3.71e+00 15.0 3.00 12.0
educ_child 1181 4.26e+00 2.82e+00 9.0 0.00 9.0
sampling_wt 1518 3.41e+04 1.60e+04 104673.9 9412.87 95261.0
income_pov 1429 2.24e+00 1.61e+00 5.0 0.00 5.0
age_adult 1518 4.00e+01 9.60e+00 80.0 18.00 62.0
energy 1518 1.88e+03 7.22e+02 5265.0 257.00 5008.0
protein 1518 6.69e+01 3.10e+01 241.8 4.18 237.7
sugar 1518 1.24e+02 5.90e+01 405.5 1.00 404.5
fat 1518 6.74e+01 3.37e+01 235.2 1.70 233.5
water 1518 5.43e+02 6.50e+02 8591.2 0.00 8591.2
plank_time 1384 6.11e+01 4.58e+01 450.0 1.00 449.0
height 1514 1.37e+02 2.30e+01 188.9 89.70 99.2
weight 1514 3.98e+01 2.08e+01 136.9 12.30 124.6
bmi 1514 1.96e+01 5.08e+00 48.3 11.90 36.4
arm_length 1511 2.90e+01 5.63e+00 42.5 17.30 25.2
waist 1512 6.77e+01 1.52e+01 144.7 42.50 102.2
arm_circ 1513 2.29e+01 5.50e+00 46.8 13.70 33.1
calf_circ 1509 2.98e+01 6.19e+00 55.9 18.30 37.6
calf_skinfold 1390 1.37e+01 6.81e+00 39.9 3.70 36.2
triceps_skinfold 1497 1.44e+01 6.76e+00 38.8 4.00 34.8
subscapular_skinfold 1451 1.07e+01 6.31e+00 38.0 3.70 34.3
active_days 1513 5.39e+00 2.12e+00 7.0 0.00 7.0
tv_hours 1514 1.90e+00 1.37e+00 5.0 0.00 5.0
computer_hours 1515 1.04e+00 1.35e+00 5.0 0.00 5.0
meals_out 1510 1.79e+00 2.13e+00 20.0 0.00 20.0
med_count 1518 2.96e-01 7.56e-01 6.0 0.00 6.0
nunique nzeros iqr lowerbound upperbound
SEQN 1518 0 792.00 71121.25 74288.75
age_child 13 0 6.00 -3.00 21.00
educ_child 11 146 5.00 -5.50 14.50
sampling_wt 845 0 19987.74 -6664.16 73223.34
income_pov 238 15 2.65 -3.10 7.49
age_adult 55 0 12.00 15.00 63.00
energy 1137 0 939.00 -41.00 3714.50
protein 1482 0 37.25 -10.55 138.44
sugar 1491 0 74.35 -28.86 268.57
fat 1474 0 42.48 -20.57 149.34
water 342 326 636.94 -851.73 1696.05
plank_time 190 0 55.00 -55.50 164.75
height 694 0 38.45 60.67 214.45
weight 615 0 30.55 -23.30 98.90
bmi 226 0 6.00 6.90 30.90
arm_length 222 0 9.60 9.80 48.20
waist 511 0 21.00 24.10 108.10
arm_circ 240 0 7.80 6.70 37.90
calf_circ 259 0 9.57 10.34 48.66
calf_skinfold 253 0 8.40 -4.00 29.60
triceps_skinfold 263 0 8.90 -4.25 31.35
subscapular_skinfold 230 0 7.00 -4.30 23.70
active_days 9 68 3.00 -0.50 11.50
tv_hours 7 254 2.00 -2.00 6.00
computer_hours 7 759 2.00 -3.00 5.00
meals_out 17 425 2.00 -3.00 5.00
med_count 7 1241 0.00 0.00 0.00
noutlier kurtosis skewness mode miss miss%
SEQN 0 -1.2050 0.000814 71917.0 0 0.000
age_child 0 -1.2317 -0.025225 12.0 0 0.000
educ_child 0 -1.2195 0.010719 NA 337 22.200
sampling_wt 37 1.9218 1.123744 53977.6 0 0.000
income_pov 0 -1.1165 0.501565 5.0 89 5.863
age_adult 20 0.8409 0.631691 40.0 0 0.000
energy 30 1.1305 0.797210 1915.0 0 0.000
protein 45 2.0114 1.121048 82.5 0 0.000
sugar 30 1.4265 0.921237 94.7 0 0.000
fat 43 1.6609 1.024582 76.7 0 0.000
water 79 24.9830 3.486903 0.0 0 0.000
plank_time 40 5.3557 1.538689 NA 134 8.827
height 0 -1.0648 -0.057423 154.0 4 0.264
weight 18 1.0902 1.031371 22.0 4 0.264
bmi 63 3.3768 1.600300 15.5 4 0.264
arm_length 0 -1.0848 -0.023759 34.0 7 0.461
waist 25 1.1911 1.041341 55.4 6 0.395
arm_circ 19 0.5714 0.862535 17.0 5 0.329
calf_circ 6 -0.1366 0.535805 33.0 9 0.593
calf_skinfold 49 0.7705 1.114848 NA 128 8.432
triceps_skinfold 39 0.8936 1.135255 8.0 21 1.383
subscapular_skinfold 77 2.2220 1.573722 NA 67 4.414
active_days 0 0.0351 -1.095239 7.0 5 0.329
tv_hours 0 -0.2414 0.538424 2.0 4 0.264
computer_hours 0 0.8827 1.293648 0.0 3 0.198
meals_out 70 12.4583 2.823921 0.0 8 0.527
med_count 277 14.3174 3.409516 0.0 0 0.000
1% 5% 25% 50% 75% 95%
SEQN 7.19e+04 71993.85 72309.25 72704.50 73100.75 73415.2
age_child 3.00e+00 3.00 6.00 9.00 12.00 15.0
educ_child 0.00e+00 0.00 2.00 4.00 7.00 9.0
sampling_wt 1.07e+04 13352.52 23317.44 31605.91 43241.74 62215.7
income_pov 2.80e-03 0.31 0.87 1.74 3.52 5.0
age_adult 2.12e+01 26.00 33.00 40.00 45.00 56.0
energy 5.67e+02 849.00 1367.50 1794.50 2306.00 3195.4
protein 1.67e+01 26.53 45.33 61.25 82.57 125.7
sugar 2.41e+01 44.55 82.66 116.91 157.05 234.0
fat 1.16e+01 22.74 43.16 61.98 85.62 129.9
water 0.00e+00 0.00 103.69 375.00 740.64 1718.2
plank_time 2.00e+00 5.00 27.00 54.00 82.25 144.0
height 9.45e+01 100.73 118.35 137.60 156.78 172.2
weight 1.36e+01 15.70 22.52 34.95 53.08 77.7
bmi 1.35e+01 14.30 15.90 18.10 21.90 30.2
arm_length 1.88e+01 20.00 24.20 29.00 33.80 37.5
waist 4.61e+01 49.40 55.60 64.80 76.60 96.8
arm_circ 1.50e+01 16.10 18.40 22.00 26.20 33.3
calf_circ 1.98e+01 21.40 24.70 29.20 34.30 40.2
calf_skinfold 4.89e+00 5.95 8.60 12.00 17.00 27.7
triceps_skinfold 5.60e+00 6.80 9.10 12.40 18.00 28.0
subscapular_skinfold 4.20e+00 4.80 6.20 8.20 13.20 24.1
active_days 0.00e+00 1.00 4.00 7.00 7.00 7.0
tv_hours 0.00e+00 0.00 1.00 2.00 3.00 5.0
computer_hours 0.00e+00 0.00 0.00 0.00 2.00 4.0
meals_out 0.00e+00 0.00 0.00 1.00 2.00 5.0
med_count 0.00e+00 0.00 0.00 0.00 0.00 2.0
99%
SEQN 73476.8
age_child 15.0
educ_child 9.0
sampling_wt 86783.8
income_pov 5.0
age_adult 66.0
energy 4051.7
protein 165.8
sugar 309.1
fat 172.8
water 2841.3
plank_time 200.9
height 180.5
weight 103.2
bmi 36.9
arm_length 39.5
waist 112.2
arm_circ 38.3
calf_circ 45.3
calf_skinfold 33.8
triceps_skinfold 35.5
subscapular_skinfold 31.4
active_days 7.0
tv_hours 5.0
computer_hours 5.0
meals_out 10.0
med_count 4.0
Most of the elements of this numSummary
should be familiar. Some new pieces include:
nunique
= number of unique valuesnzeroes
= number of zeroesnoutlier
= number of outliers (using a standard that isn’t entirely transparent to me)miss
= number of rows with missing valuemiss%
= percentage of total rows with missing values ((miss/n)*100)5%
= 5th percentile value of that variable (value below which 5 percent of the observations may be found)
n miss miss% unique
sex 1518 0 0.000 2
race_eth 1518 0 0.000 4
language 1518 0 0.000 2
educ_adult 1496 22 1.449 6
respondent 1506 12 0.791 4
salt_used 1505 13 0.856 3
diet_yesterday 1516 2 0.132 4
bmi_cat 1514 4 0.264 5
physical_last_week 1514 4 0.264 3
enjoy_recess 1240 278 18.314 6
insured 1518 0 0.000 2
phys_health 1518 0 0.000 5
access_to_care 1518 0 0.000 2
care_source 1518 0 0.000 6
asthma_ever 1518 0 0.000 2
asthma_now 1518 0 0.000 2
med_use 1518 0 0.000 2
insurance 1518 0 0.000 10
top5levels:count
sex Female:760, Male:758
race_eth 2_White Non-Hispanic:610, 1_Hispanic:450, 3_Black Non-Hispanic:338, 4_Other Race/Ethnicity:120
language English:1285, Spanish:233
educ_adult 4_Some College:464, 5_College Graduate:386, 3_High School Graduate:318, 2_9-11th Grade:187, 1_Less than 9th Grade:141
respondent Child:866, Mom:522, Other:118
salt_used Yes:858, No:647
diet_yesterday 2_Usual:1175, 3_Much less than usual:203, 1_Much more than usual:138
bmi_cat 2_Normal:920, 4_Obese:295, 3_Overweight:258, 1_Underweight:41
physical_last_week Yes:1262, No:252
enjoy_recess 1_Strongly Agree:903, 2_Agree:257, 3_Neither Agree nor Disagree:49, 4_Disagree:21, 5_Strongly Disagree:10
insured Has Insurance:1447, Not Insured:71
phys_health 1_Excellent:742, 2_VeryGood:424, 3_Good:315, 4_Fair:35, 5_Poor:2
access_to_care Has Usual Care Source:1491, No Usual Care Source:27
care_source Doctor's Office:1164, Clinic or Health Center:297, No Usual Care Source:27, Hospital ER:18, Hospital Outpatient:8
asthma_ever Never Had Asthma:1258, History of Asthma:260
asthma_now No Asthma Now:1347, Asthma Now:171
med_use No Medications:1241, Had Medication:277
insurance Private:736, Medicaid:470, State Sponsored:179, Uninsured:71, SCHIP:21
The top5levels:count
provides the top 5 unique values for each variable, sorted by their counts.
7.17 What Summaries to Report
It is usually helpful to focus on the shape, center and spread of a distribution. Bock, Velleman and DeVeaux provide some useful advice:
- If the data are skewed, report the median and IQR (or the three middle quantiles). You may want to include the mean and standard deviation, but you should point out why the mean and median differ. The fact that the mean and median do not agree is a sign that the distribution may be skewed. A histogram will help you make that point.
- If the data are symmetric, report the mean and standard deviation, and possibly the median and IQR as well.
- If there are clear outliers and you are reporting the mean and standard deviation, report them with the outliers present and with the outliers removed. The differences may be revealing. The median and IQR are not likely to be seriously affected by outliers.