6 Summarizing Quantities
Most numerical summaries that might be new to you are applied most appropriately to quantitative variables. The measures that will interest us relate to:
- the center of our distribution,
- the spread of our distribution, and
- the shape of our distribution.
To demonstrate key ideas in this Chapter, we will consider our sample of 750 adults ages 21-64 from NHANES 2011-12 which includes some missing values. We’ll load into the nh_750
data frame the information from the nh_adult750.Rds
file we created in Section 4.2.
nh_750 <- read_rds("data/nh_adult750.Rds")
6.1 The summary
function for Quantitative data
R provides a small sampling of numerical summaries with the summary
function, for instance.
Age BMI SBP
Min. :21.00 Min. :16.70 Min. : 83.0
1st Qu.:30.00 1st Qu.:24.20 1st Qu.:108.0
Median :40.00 Median :27.90 Median :118.0
Mean :40.82 Mean :29.08 Mean :118.8
3rd Qu.:51.00 3rd Qu.:32.10 3rd Qu.:127.0
Max. :64.00 Max. :80.60 Max. :209.0
NA's :5 NA's :33
DBP Pulse
Min. : 0.00 Min. : 40.00
1st Qu.: 66.00 1st Qu.: 66.00
Median : 73.00 Median : 72.00
Mean : 72.69 Mean : 73.53
3rd Qu.: 80.00 3rd Qu.: 80.00
Max. :108.00 Max. :124.00
NA's :33 NA's :32
This basic summary includes a set of five quantiles18, plus the sample’s mean.
-
Min.
= the minimum value for each variable, so, for example, the youngest subject’s Age was 21. -
1st Qu.
= the first quartile (25th percentile) for each variable - for example, 25% of the subjects were Age 30 or younger. -
Median
= the median (50th percentile) - half of the subjects were Age 40 or younger. -
Mean
= the mean, usually what one means by an average - the sum of the Ages divided by 750 is 40.8, -
3rd Qu.
= the third quartile (75th percentile) - 25% of the subjects were Age 51 or older. -
Max.
= the maximum value for each variable, so the oldest subject was Age 64.
The summary also specifies the number of missing values for each variable. Here, we are missing 5 of the BMI values, for example.
6.2 Measuring the Center of a Distribution
6.2.1 The Mean and The Median
The mean and median are the most commonly used measures of the center of a distribution for a quantitative variable. The median is the more generally useful value, as it is relevant even if the data have a shape that is not symmetric. We might also collect the sum of the observations, and the count of the number of observations, usually symbolized with n.
For variables without missing values, like Age
, this is pretty straightforward.
# A tibble: 1 x 4
n Mean Median Sum
<int> <dbl> <dbl> <int>
1 750 40.8 40 30616
And again, the Mean is just the Sum (30616), divided by the number of non-missing values of Age (750), or 40.8213333.
The Median is the middle value when the data are sorted in order. When we have an odd number of values, this is sufficient. When we have an even number, as in this case, we take the mean of the two middle values. We could sort and list all 500 Ages, if we wanted to do so.
# A tibble: 750 x 1
Age
<int>
1 21
2 21
3 21
4 21
5 21
6 21
7 21
8 21
9 21
10 21
# ... with 740 more rows
But this data set figures we don’t want to output more than 10 observations to a table like this.
If we really want to see all of the data, we can use View(nh_750)
to get a spreadsheet-style presentation, or use the sort
command…
sort(nh_750$Age)
[1] 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21
[19] 21 21 21 21 22 22 22 22 22 22 22 22 22 22 22 22 22 22
[37] 22 22 22 22 22 22 22 22 23 23 23 23 23 23 23 23 23 23
[55] 23 23 23 23 23 23 23 23 23 23 23 23 24 24 24 24 24 24
[73] 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 25
[91] 25 25 25 25 25 25 25 25 25 25 25 25 25 26 26 26 26 26
[109] 26 26 26 26 26 26 26 26 26 27 27 27 27 27 27 27 27 27
[127] 27 27 27 27 28 28 28 28 28 28 28 28 28 28 28 28 28 28
[145] 28 28 28 28 28 28 29 29 29 29 29 29 29 29 29 29 29 29
[163] 29 29 29 29 29 29 29 29 30 30 30 30 30 30 30 30 30 30
[181] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 31 31 31 31
[199] 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 32 32 32
[217] 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32
[235] 32 32 32 32 32 32 32 33 33 33 33 33 33 33 33 33 33 33
[253] 33 33 33 33 33 33 33 33 33 33 33 33 33 34 34 34 34 34
[271] 34 34 34 34 34 34 34 34 34 35 35 35 35 35 35 35 35 35
[289] 35 35 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36
[307] 36 36 36 36 36 36 36 36 36 36 37 37 37 37 37 37 37 37
[325] 37 37 37 37 37 37 37 37 37 37 38 38 38 38 38 38 38 38
[343] 38 38 38 38 38 38 38 38 39 39 39 39 39 39 39 39 39 39
[361] 39 39 39 39 39 39 39 39 39 40 40 40 40 40 40 40 40 40
[379] 40 40 40 40 40 40 40 40 40 41 41 41 41 41 41 41 41 41
[397] 41 41 41 41 42 42 42 42 42 42 42 42 42 42 42 42 42 42
[415] 42 42 42 42 42 43 43 43 43 43 43 43 43 43 43 43 43 43
[433] 43 43 43 43 43 44 44 44 44 44 44 44 44 44 44 44 44 44
[451] 44 44 44 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45
[469] 45 45 46 46 46 46 46 46 46 46 46 46 46 46 46 46 47 47
[487] 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 48 48 48
[505] 48 48 48 48 48 48 48 48 49 49 49 49 49 49 49 49 49 49
[523] 49 49 49 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50
[541] 50 50 50 50 50 50 50 50 50 50 50 51 51 51 51 51 51 51
[559] 51 51 51 51 51 51 51 51 51 51 51 51 51 52 52 52 52 52
[577] 52 52 52 52 52 52 52 53 53 53 53 53 53 53 53 53 53 53
[595] 53 53 53 53 54 54 54 54 54 54 54 54 54 54 54 54 54 54
[613] 54 54 55 55 55 55 55 55 55 55 55 55 56 56 56 56 56 56
[631] 56 56 56 56 56 56 56 56 56 56 56 56 56 57 57 57 57 57
[649] 57 57 57 57 58 58 58 58 58 58 58 58 58 58 58 58 58 58
[667] 58 58 58 58 59 59 59 59 59 59 59 59 59 59 59 59 59 60
[685] 60 60 60 60 60 60 60 60 60 60 60 60 61 61 61 61 61 61
[703] 61 61 61 61 61 61 61 61 62 62 62 62 62 62 62 62 62 62
[721] 62 62 62 63 63 63 63 63 63 63 63 63 63 63 63 63 64 64
[739] 64 64 64 64 64 64 64 64 64 64 64 64
Again, to find the median, we would take the mean of the middle two observations in this sorted data set. That would be the 250th and 251st largest Ages.
sort(nh_750$Age)[250:251]
[1] 33 33
6.2.2 Dealing with Missingness
When calculating a mean, you may be tempted to try something like this…
# A tibble: 1 x 2
`mean(Pulse)` `median(Pulse)`
<dbl> <int>
1 NA NA
This fails because we have some missing values in the Pulse data. We can address this by either omitting the data with missing values before we run the summarise()
function, or tell the mean and median summary functions to remove missing values19.
# A tibble: 1 x 3
count `mean(Pulse)` `median(Pulse)`
<int> <dbl> <dbl>
1 718 73.5 72
Or, we could tell the summary functions themselves to remove NA values.
# A tibble: 1 x 2
`mean(Pulse, na.rm = TRUE)` `median(Pulse, na.rm = TRUE)`
<dbl> <dbl>
1 73.5 72
In Chapter 8, we will discuss various assumptions we can make about missing data, and the importance of imputation when dealing with it in modeling or making inferences. For now, we will limit our descriptive summaries to observed values, in what are called complete case or available case analyses.
6.2.3 The Mode of a Quantitative Variable
One other less common measure of the center of a quantitative variable’s distribution is its most frequently observed value, referred to as the mode. This measure is only appropriate for discrete variables, be they quantitative or categorical. To find the mode, we usually tabulate the data, and then sort by the counts of the numbers of observations.
# A tibble: 44 x 2
Age count
<int> <int>
1 32 28
2 36 26
3 50 26
4 30 24
5 33 24
6 24 23
7 21 22
8 22 22
9 23 22
10 28 20
# ... with 34 more rows
The mode is just the most common Age observed in the data.
Note the use of three different “verbs” in our function there - for more explanation of this strategy, visit Grolemund and Wickham.20 The group_by
function here is very useful. It converts the nh_750 data frame into a new grouped tibble where operations are performed on the groups. Here, this means that it groups the data by Age before counting observations, and then sorting the groups (the Ages) by their frequencies.
As an alternative, the modeest
package’s mfv
function calculates the sample mode (or most frequent value).21
6.3 Measuring the Spread of a Distribution
Statistics is all about variation, so spread or dispersion is an important fundamental concept in statistics. Measures of spread like the inter-quartile range and range (maximum - minimum) can help us understand and compare data sets. If the values in the data are close to the center, the spread will be small. If many of the values in the data are scattered far away from the center, the spread will be large.
6.3.1 The Range and the Interquartile Range (IQR)
The range of a quantitative variable is sometimes interpreted as the difference between the maximum and the minimum, even though R presents the actual minimum and maximum values when you ask for a range…
[1] 21 64
And, for a variable with missing values, we can use…
[1] 16.7 80.6
A more interesting and useful statistic is the inter-quartile range, or IQR, which is the range of the middle half of the distribution, calculated by subtracting the 25th percentile value from the 75th percentile value.
# A tibble: 1 x 3
`IQR(Age)` `quantile(Age, 0.25)` `quantile(Age, 0.75)`
<dbl> <dbl> <dbl>
1 21 30 51
We can calculate the range and IQR nicely from the summary information on quantiles, of course:
Age BMI SBP
Min. :21.00 Min. :16.70 Min. : 83.0
1st Qu.:30.00 1st Qu.:24.20 1st Qu.:108.0
Median :40.00 Median :27.90 Median :118.0
Mean :40.82 Mean :29.08 Mean :118.8
3rd Qu.:51.00 3rd Qu.:32.10 3rd Qu.:127.0
Max. :64.00 Max. :80.60 Max. :209.0
NA's :5 NA's :33
DBP Pulse
Min. : 0.00 Min. : 40.00
1st Qu.: 66.00 1st Qu.: 66.00
Median : 73.00 Median : 72.00
Mean : 72.69 Mean : 73.53
3rd Qu.: 80.00 3rd Qu.: 80.00
Max. :108.00 Max. :124.00
NA's :33 NA's :32
6.3.2 The Variance and the Standard Deviation
The IQR is always a reasonable summary of spread, just as the median is always a reasonable summary of the center of a distribution. Yet, most people are inclined to summarize a batch of data using two numbers: the mean and the standard deviation. This is really only a sensible thing to do if you are willing to assume the data follow a Normal distribution: a bell-shaped, symmetric distribution without substantial outliers.
But most data do not (even approximately) follow a Normal distribution. Summarizing by the median and quartiles (25th and 75th percentiles) is much more robust, explaining R’s emphasis on them.
6.3.3 Obtaining the Variance and Standard Deviation in R
Here are the variances of the quantitative variables in the nh_750
data. Note the need to include na.rm = TRUE
to deal with the missing values in some variables.
nh_750 %>%
select(Age, BMI, SBP, DBP, Pulse) %>%
summarise_all(var, na.rm = TRUE)
# A tibble: 1 x 5
Age BMI SBP DBP Pulse
<dbl> <dbl> <dbl> <dbl> <dbl>
1 157. 52.4 229. 128. 136.
And here are the standard deviations of those same variables.
nh_750 %>%
select(Age, BMI, SBP, DBP, Pulse) %>%
summarise_all(sd, na.rm = TRUE)
# A tibble: 1 x 5
Age BMI SBP DBP Pulse
<dbl> <dbl> <dbl> <dbl> <dbl>
1 12.5 7.24 15.1 11.3 11.6
6.3.4 Defining the Variance and Standard Deviation
Bock, Velleman, and De Veaux22 have lots of useful thoughts here, which are lightly edited here.
In thinking about spread, we might consider how far each data value is from the mean. Such a difference is called a deviation. We could just average the deviations, but the positive and negative differences always cancel out, leaving an average deviation of zero, so that’s not helpful. Instead, we square each deviation to obtain non-negative values, and to emphasize larger differences. When we add up these squared deviations and find their mean (almost), this yields the variance.
\[ \mbox{Variance} = s^2 = \frac{\Sigma (y - \bar{y})^2}{n-1} \]
Why almost? It would be the mean of the squared deviations only if we divided the sum by \(n\), but instead we divide by \(n-1\) because doing so produces an estimate of the true (population) variance that is unbiased23. If you’re looking for a more intuitive explanation, this Stack Exchange link awaits your attention.
- To return to the original units of measurement, we take the square root of \(s^2\), and instead work with \(s\), the standard deviation, also abbreviated SD.
\[ \mbox{Standard Deviation} = s = \sqrt{\frac{\Sigma (y - \bar{y})^2}{n-1}} \]
6.3.5 Interpreting the SD when the data are Normally distributed
For a set of measurements that follow a Normal distribution, the interval:
- Mean \(\pm\) Standard Deviation contains approximately 68% of the measurements;
- Mean \(\pm\) 2(Standard Deviation) contains approximately 95% of the measurements;
- Mean \(\pm\) 3(Standard Deviation) contains approximately all (99.7%) of the measurements.
We often refer to the population or process mean of a distribution with \(\mu\) and the standard deviation with \(\sigma\), leading to the Figure below.
But if the data are not from an approximately Normal distribution, then this Empirical Rule is less helpful.
6.3.6 Chebyshev’s Inequality: One Interpretation of the Standard Deviation
Chebyshev’s Inequality tells us that for any distribution, regardless of its relationship to a Normal distribution, no more than 1/k2 of the distribution’s values can lie more than k standard deviations from the mean. This implies, for instance, that for any distribution, at least 75% of the values must lie within two standard deviations of the mean, and at least 89% must lie within three standard deviations of the mean.
Again, most data sets do not follow a Normal distribution. We’ll return to this notion soon. But first, let’s try to draw some pictures that let us get a better understanding of the distribution of our data.
6.4 Measuring the Shape of a Distribution
When considering the shape of a distribution, one is often interested in three key points.
- The number of modes in the distribution, which I always assess through plotting the data.
- The skewness, or symmetry that is present, which I typically assess by looking at a plot of the distribution of the data, but if required to, will summarize with a non-parametric measure of skewness.
- The kurtosis, or heavy-tailedness (outlier-proneness) that is present, usually in comparison to a Normal distribution. Again, this is something I nearly inevitably assess graphically, but there are measures.
A Normal distribution has a single mode, is symmetric and, naturally, is neither heavy-tailed nor light-tailed as compared to a Normal distribution (we call this mesokurtic).
6.4.1 Multimodal vs. Unimodal distributions
A unimodal distribution, on some level, is straightforward. It is a distribution with a single mode, or “peak” in the distribution. Such a distribution may be skewed or symmetric, light-tailed or heavy-tailed. We usually describe as multimodal distributions like the two on the right below, which have multiple local maxima, even though they have just a single global maximum peak.
Truly multimodal distributions are usually described that way in terms of shape. For unimodal distributions, skewness and kurtosis become useful ideas.
6.4.2 Skew
Whether or not a distribution is approximately symmetric is an important consideration in describing its shape. Graphical assessments are always most useful in this setting, particularly for unimodal data. My favorite measure of skew, or skewness if the data have a single mode, is:
\[ skew_1 = \frac{\mbox{mean} - \mbox{median}}{\mbox{standard deviation}} \]
- Symmetric distributions generally show values of \(skew_1\) near zero. If the distribution is actually symmetric, the mean should be equal to the median.
- Distributions with \(skew_1\) values above 0.2 in absolute value generally indicate meaningful skew.
- Positive skew (mean > median if the data are unimodal) is also referred to as right skew.
- Negative skew (mean < median if the data are unimodal) is referred to as left skew.
6.4.3 Kurtosis
When we have a unimodal distribution that is symmetric, we will often be interested in the behavior of the tails of the distribution, as compared to a Normal distribution with the same mean and standard deviation. High values of kurtosis measures (and there are several) indicate data which has extreme outliers, or is heavy-tailed.
- A mesokurtic distribution has similar tail behavior to what we would expect from a Normal distribution.
- A leptokurtic distribution is a thinner, more slender distribution, with heavier tails than we’d expect from a Normal distribution. One example is the t distribution.
- A platykurtic distribution is a broader, flatter distribution, with thinner tails than we’d expect from a Normal distribution. One example is a uniform distribution.
set.seed(431)
sims_kurt <- tibble(meso = rnorm(n = 300, mean = 0, sd = 1),
lepto = rt(n = 300, df = 4),
platy = runif(n = 300, min = -2, max = 2))
p1 <- ggplot(sims_kurt, aes(x = meso)) +
geom_histogram(aes(y = stat(density)),
bins = 25, fill = "royalblue", col = "white") +
stat_function(fun = dnorm,
args = list(mean = mean(sims_kurt$meso),
sd = sd(sims_kurt$meso)),
col = "red") +
labs(title = "Normal (mesokurtic)")
p1a <- ggplot(sims_kurt, aes(x = meso, y = "")) +
geom_violin() +
geom_boxplot(fill = "royalblue", outlier.color = "royalblue", width = 0.3) +
labs(y = "", x = "Normal (mesokurtic)")
p2 <- ggplot(sims_kurt, aes(x = lepto)) +
geom_histogram(aes(y = stat(density)),
bins = 25, fill = "tomato", col = "white") +
stat_function(fun = dnorm,
args = list(mean = mean(sims_kurt$lepto),
sd = sd(sims_kurt$lepto)),
col = "royalblue") +
labs(title = "t (leptokurtic)")
p2a <- ggplot(sims_kurt, aes(x = lepto, y = "")) +
geom_violin() +
geom_boxplot(fill = "tomato", outlier.color = "tomato", width = 0.3) +
labs(y = "", x = "t (slender with heavy tails)")
p3 <- ggplot(sims_kurt, aes(x = platy)) +
geom_histogram(aes(y = stat(density)),
bins = 25, fill = "yellow", col = "black") +
stat_function(fun = dnorm,
args = list(mean = mean(sims_kurt$platy),
sd = sd(sims_kurt$platy)),
col = "royalblue", lwd = 1.5) +
xlim(-3, 3) +
labs(title = "Uniform (platykurtic)")
p3a <- ggplot(sims_kurt, aes(x = platy, y = "")) +
geom_violin() +
geom_boxplot(fill = "yellow", width = 0.3) +
xlim(-3, 3) +
labs(y = "", x = "Uniform (broad with thin tails)")
(p1 + p2 + p3) / (p1a + p2a + p3a) +
plot_layout(heights = c(3, 1))
Graphical tools are in most cases the best way to identify issues related to kurtosis.
6.5 Numerical Summaries for Quantitative Variables
6.5.1 favstats
in the mosaic
package
The favstats
function adds the standard deviation, and counts of overall and missing observations to our usual summary
for a continuous variable. Let’s look at systolic blood pressure, because we haven’t yet.
mosaic::favstats(~ SBP, data = nh_750)
min Q1 median Q3 max mean sd n missing
83 108 118 127 209 118.7908 15.14329 717 33
We could, of course, duplicate these results with several summarise()
pieces…
nh_750 %>%
filter(complete.cases(SBP)) %>%
summarise(min = min(SBP), Q1 = quantile(SBP, 0.25),
median = median(SBP), Q3 = quantile(SBP, 0.75),
max = max(SBP), mean = mean(SBP),
sd = sd(SBP), n = n(), miss = sum(is.na(SBP)))
# A tibble: 1 x 9
min Q1 median Q3 max mean sd n miss
<int> <dbl> <int> <dbl> <int> <dbl> <dbl> <int> <int>
1 83 108 118 127 209 119. 15.1 717 0
The somewhat unusual structure of favstats
(complete with an easy to forget ~
) is actually helpful. It allows you to look at some interesting grouping approaches, like this:
mosaic::favstats(SBP ~ Education, data = nh_750)
Education min Q1 median Q3 max mean
1 8th Grade 96 110.25 119.5 129.75 167 122.4565
2 9 - 11th Grade 85 107.75 116.0 127.00 191 118.8026
3 High School 84 111.50 120.5 129.00 209 121.0882
4 Some College 85 108.00 117.0 126.00 186 118.6293
5 College Grad 83 107.00 117.0 125.00 171 116.8326
sd n missing
1 16.34993 46 4
2 15.79453 76 0
3 16.52853 136 7
4 14.32736 232 9
5 14.41202 227 13
Of course, we could accomplish the same comparison with dplyr
commands, too, but the favstats
approach has much to offer.
nh_750 %>%
filter(complete.cases(SBP, Education)) %>%
group_by(Education) %>%
summarise(min = min(SBP), Q1 = quantile(SBP, 0.25),
median = median(SBP), Q3 = quantile(SBP, 0.75),
max = max(SBP), mean = mean(SBP),
sd = sd(SBP), n = n(), miss = sum(is.na(SBP)))
# A tibble: 5 x 10
Education min Q1 median Q3 max mean sd n
<fct> <int> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <int>
1 8th Grade 96 110. 120. 130. 167 122. 16.3 46
2 9 - 11th~ 85 108. 116 127 191 119. 15.8 76
3 High Sch~ 84 112. 120. 129 209 121. 16.5 136
4 Some Col~ 85 108 117 126 186 119. 14.3 232
5 College ~ 83 107 117 125 171 117. 14.4 227
# ... with 1 more variable: miss <int>
6.5.2 describe
in the psych
package
The psych
package has a more detailed list of numerical summaries for quantitative variables that lets us look at a group of observations at once.
vars n mean sd median trimmed mad min max
Age 1 750 40.82 12.54 40.0 40.53 14.83 21.0 64.0
BMI 2 745 29.08 7.24 27.9 28.31 5.93 16.7 80.6
SBP 3 717 118.79 15.14 118.0 117.88 13.34 83.0 209.0
DBP 4 717 72.69 11.34 73.0 72.65 10.38 0.0 108.0
Pulse 5 718 73.53 11.65 72.0 73.11 11.86 40.0 124.0
range skew kurtosis se
Age 43.0 0.16 -1.15 0.46
BMI 63.9 1.72 6.16 0.27
SBP 126.0 0.96 3.10 0.57
DBP 108.0 -0.28 2.59 0.42
Pulse 84.0 0.48 0.73 0.43
The additional statistics presented here are:
-
trimmed
= a trimmed mean (by default in this function, this removes the top and bottom 10% from the data, then computes the mean of the remaining values - the middle 80% of the full data set.) -
mad
= the median absolute deviation (from the median), which can be used in a manner similar to the standard deviation or IQR to measure spread.- If the data are \(Y_1, Y_2, ..., Y_n\), then the
mad
is defined as \(median(|Y_i - median(Y_i)|)\). - To find the
mad
for a set of numbers, find the median, subtract the median from each value and find the absolute value of that difference, and then find the median of those absolute differences. - For non-normal data with a skewed shape but tails well approximated by the Normal, the
mad
is likely to be a better (more robust) estimate of the spread than is the standard deviation.
- If the data are \(Y_1, Y_2, ..., Y_n\), then the
- a measure of
skew
, which refers to how much asymmetry is present in the shape of the distribution. The measure is not the same as the nonparametric skew measure that we will usually prefer. The [Wikipedia page on skewness][https://en.wikipedia.org/wiki/Skewness] is very detailed. - a measure of excess
kurtosis
, which refers to how outlier-prone, or heavy-tailed the shape of the distribution is, as compared to a Normal distribution. -
se
= the standard error of the sample mean, equal to the sample sd divided by the square root of the sample size.
6.5.3 The Hmisc
package’s version of describe
nh_750 %>% select(Age, BMI, SBP, DBP, Pulse)
5 Variables 750 Observations
------------------------------------------------------------
Age
n missing distinct Info Mean Gmd
750 0 44 0.999 40.82 14.46
.05 .10 .25 .50 .75 .90
22 24 30 40 51 59
.95
62
lowest : 21 22 23 24 25, highest: 60 61 62 63 64
------------------------------------------------------------
BMI
n missing distinct Info Mean Gmd
745 5 250 1 29.08 7.538
.05 .10 .25 .50 .75 .90
20.22 21.30 24.20 27.90 32.10 37.60
.95
41.28
lowest : 16.7 17.6 17.8 17.9 18.0, highest: 59.1 62.8 63.3 69.0 80.6
------------------------------------------------------------
SBP
n missing distinct Info Mean Gmd
717 33 81 0.999 118.8 16.36
.05 .10 .25 .50 .75 .90
98.0 102.0 108.0 118.0 127.0 137.0
.95
144.2
lowest : 83 84 85 86 89, highest: 171 179 186 191 209
------------------------------------------------------------
DBP
n missing distinct Info Mean Gmd
717 33 66 0.999 72.69 12.43
.05 .10 .25 .50 .75 .90
55 59 66 73 80 86
.95
91
lowest : 0 25 41 42 44, highest: 104 105 106 107 108
------------------------------------------------------------
Pulse
n missing distinct Info Mean Gmd
718 32 37 0.997 73.53 12.95
.05 .10 .25 .50 .75 .90
56 60 66 72 80 88
.95
94
lowest : 40 44 46 48 50, highest: 108 112 114 118 124
------------------------------------------------------------
The Hmisc
package’s version of describe
for a distribution of data presents three new ideas, in addition to a more comprehensive list of quartiles (the 5th, 10th, 25th, 50th, 75th, 90th and 95th are shown) and the lowest and highest few observations. These are:
-
distinct
- the number of different values observed in the data. -
Info
- a measure of how “continuous” the variable is, related to how many “ties” there are in the data, with Info taking a higher value (closer to its maximum of one) if the data are more continuous. -
Gmd
- the Gini mean difference - a robust measure of spread that is calculated as the mean absolute difference between any pairs of observations. Larger values of Gmd indicate more spread-out distributions. (Gini is pronounced as either “Genie” or “Ginny.”)
6.5.4 Other options
The package summarytools
has a function called dfSummary
which I like and Dominic Comtois has also published Recommendations for Using summarytools with R Markdown. Note that this isn’t really for Word documents.
DataExplorer
can be used for more automated exploratory data analyses (and some people also like skimr
) and visdat
, as well.
The df_stats
function available when the mosaic
package is loaded allows you to run favstats
for multiple outcome variables simultaneously.