knitr::opts_chunk$set(comment = NA)
library(palmerpenguins)
library(kableExtra)
library(gtsummary)
library(summarytools)
library(visdat)
library(lvplot)
library(tidyverse)
theme_set(theme_bw())
3 Summarizing Penguins
We will again use the data contained in the palmerpenguins
package in this chapter. Here, we present a few of the more appealing ways to obtain numerical and graphical summaries, without much explanation. We’ll discuss these issues further in the rest of Part A of these Course Notes.
3.1 Setup: Packages Used Here
Here, we’ll add several new packages to allow us to display some additional summaries, and present our tables and plots in different ways.
We will also use functions from the mosaic
and Hmisc
packages here, though I won’t load them into our session at this time.
3.2 Our Data Set
Let’s look again at the penguins
data contained in the palmerpenguins
package.
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
3.3 Numerical Summaries for a Tibble
Note that in this work, I sometimes don’t explain all of the numerical summaries provided. Some of that discussion is postponed to Chapter 7.
3.3.1 Using summary()
We have several ways to obtain useful summaries of all variables in the penguins
data.
penguins |>
summary()
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
3.3.2 Using inspect() from mosaic
Some people like the inspect()
function from the mosaic
package.
penguins |>
mosaic::inspect()
categorical variables:
name class levels n missing
1 species factor 3 344 0
2 island factor 3 344 0
3 sex factor 2 333 11
distribution
1 Adelie (44.2%), Gentoo (36%) ...
2 Biscoe (48.8%), Dream (36%) ...
3 male (50.5%), female (49.5%)
quantitative variables:
name class min Q1 median Q3 max mean
1 bill_length_mm numeric 32.1 39.225 44.45 48.5 59.6 43.92193
2 bill_depth_mm numeric 13.1 15.600 17.30 18.7 21.5 17.15117
3 flipper_length_mm integer 172.0 190.000 197.00 213.0 231.0 200.91520
4 body_mass_g integer 2700.0 3550.000 4050.00 4750.0 6300.0 4201.75439
5 year integer 2007.0 2007.000 2008.00 2009.0 2009.0 2008.02907
sd n missing
1 5.4595837 342 2
2 1.9747932 342 2
3 14.0617137 342 2
4 801.9545357 342 2
5 0.8183559 344 0
Daniel Kaplan’s Statistical Modeling, 2nd edition provides an entire course which coordinates nicely with the tools available in the mosaic
package. In our course, we’ll most often use this inspect()
tool, and a related tool called favstats
.
3.3.3 Using favstats() from mosaic.
The favstats
function lets us look at some common summaries for a single variable, or for one variable divided into groups by another. We’ll also return to this approach in Chapter 7.
mosaic::favstats(~ bill_length_mm, data = penguins) |>
kbl() |>
kable_styling()
min | Q1 | median | Q3 | max | mean | sd | n | missing | |
---|---|---|---|---|---|---|---|---|---|
32.1 | 39.225 | 44.45 | 48.5 | 59.6 | 43.92193 | 5.459584 | 342 | 2 |
mosaic::favstats(bill_length_mm ~ species, data = penguins) |>
kbl() |>
kable_styling()
species | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
Adelie | 32.1 | 36.75 | 38.80 | 40.750 | 46.0 | 38.79139 | 2.663405 | 151 | 1 |
Chinstrap | 40.9 | 46.35 | 49.55 | 51.075 | 58.0 | 48.83382 | 3.339256 | 68 | 0 |
Gentoo | 40.9 | 45.30 | 47.30 | 49.550 | 59.6 | 47.50488 | 3.081857 | 123 | 1 |
3.3.4 Using describe() from psych
We can use the describe()
function from the psych
package to get some additional summaries, if we’re interested, and here we also demonstrate the use of the kbl()
and kable_styling()
functions from the kableExtra
package to make the table look appealing in HTML. More on the use of the kableExtra
package is available here. We’ll also return to this approach in Chapter 7.
penguins |>
psych::describe() |>
kbl() |>
kable_styling()
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
species* | 1 | 344 | 1.918605 | 0.8933198 | 2.00 | 1.898551 | 1.48260 | 1.0 | 3.0 | 2.0 | 0.1591315 | -1.7318773 | 0.0481646 |
island* | 2 | 344 | 1.662791 | 0.7261940 | 2.00 | 1.579710 | 1.48260 | 1.0 | 3.0 | 2.0 | 0.6086049 | -0.9064333 | 0.0391538 |
bill_length_mm | 3 | 342 | 43.921930 | 5.4595837 | 44.45 | 43.906934 | 7.04235 | 32.1 | 59.6 | 27.5 | 0.0526530 | -0.8931397 | 0.2952205 |
bill_depth_mm | 4 | 342 | 17.151170 | 1.9747932 | 17.30 | 17.172628 | 2.22390 | 13.1 | 21.5 | 8.4 | -0.1422086 | -0.9233523 | 0.1067846 |
flipper_length_mm | 5 | 342 | 200.915205 | 14.0617137 | 197.00 | 200.335766 | 16.30860 | 172.0 | 231.0 | 59.0 | 0.3426554 | -0.9991866 | 0.7603704 |
body_mass_g | 6 | 342 | 4201.754386 | 801.9545357 | 4050.00 | 4154.014598 | 889.56000 | 2700.0 | 6300.0 | 3600.0 | 0.4662117 | -0.7395200 | 43.3647348 |
sex* | 7 | 333 | 1.504504 | 0.5007321 | 2.00 | 1.505618 | 0.00000 | 1.0 | 2.0 | 1.0 | -0.0179376 | -2.0056743 | 0.0274400 |
year | 8 | 344 | 2008.029070 | 0.8183559 | 2008.00 | 2008.036232 | 1.48260 | 2007.0 | 2009.0 | 2.0 | -0.0532601 | -1.5092478 | 0.0441228 |
3.3.5 Using describe() from Hmisc
One approach Frank Harrell has developed that I find helpful is the describe()
function within his Hmisc
package, which produces these results. We’ll also return to this approach in Chapter 7.
penguins |>
Hmisc::describe()
penguins
8 Variables 344 Observations
--------------------------------------------------------------------------------
species
n missing distinct
344 0 3
Value Adelie Chinstrap Gentoo
Frequency 152 68 124
Proportion 0.442 0.198 0.360
--------------------------------------------------------------------------------
island
n missing distinct
344 0 3
Value Biscoe Dream Torgersen
Frequency 168 124 52
Proportion 0.488 0.360 0.151
--------------------------------------------------------------------------------
bill_length_mm
n missing distinct Info Mean Gmd .05 .10
342 2 164 1 43.92 6.274 35.70 36.60
.25 .50 .75 .90 .95
39.23 44.45 48.50 50.80 51.99
lowest : 32.1 33.1 33.5 34 34.1, highest: 55.1 55.8 55.9 58 59.6
--------------------------------------------------------------------------------
bill_depth_mm
n missing distinct Info Mean Gmd .05 .10
342 2 80 1 17.15 2.267 13.9 14.3
.25 .50 .75 .90 .95
15.6 17.3 18.7 19.5 20.0
lowest : 13.1 13.2 13.3 13.4 13.5, highest: 20.7 20.8 21.1 21.2 21.5
--------------------------------------------------------------------------------
flipper_length_mm
n missing distinct Info Mean Gmd .05 .10
342 2 55 0.999 200.9 16.03 181.0 185.0
.25 .50 .75 .90 .95
190.0 197.0 213.0 220.9 225.0
lowest : 172 174 176 178 179, highest: 226 228 229 230 231
--------------------------------------------------------------------------------
body_mass_g
n missing distinct Info Mean Gmd .05 .10
342 2 94 1 4202 911.8 3150 3300
.25 .50 .75 .90 .95
3550 4050 4750 5400 5650
lowest : 2700 2850 2900 2925 2975, highest: 5850 5950 6000 6050 6300
--------------------------------------------------------------------------------
sex
n missing distinct
333 11 2
Value female male
Frequency 165 168
Proportion 0.495 0.505
--------------------------------------------------------------------------------
year
n missing distinct Info Mean Gmd
344 0 3 0.888 2008 0.8919
Value 2007 2008 2009
Frequency 110 114 120
Proportion 0.320 0.331 0.349
For the frequency table, variable is rounded to the nearest 0
--------------------------------------------------------------------------------
3.3.6 Using tbl_summary() from gtsummary
If you want to produce results which look like you might expect to see in a published paper, the tbl_summary()
function from the gtsummary
package has many nice features.
penguins |>
tbl_summary()
Characteristic | N = 3441 |
---|---|
species | |
Adelie | 152 (44%) |
Chinstrap | 68 (20%) |
Gentoo | 124 (36%) |
island | |
Biscoe | 168 (49%) |
Dream | 124 (36%) |
Torgersen | 52 (15%) |
bill_length_mm | 44.5 (39.2, 48.5) |
Unknown | 2 |
bill_depth_mm | 17.30 (15.60, 18.70) |
Unknown | 2 |
flipper_length_mm | 197 (190, 213) |
Unknown | 2 |
body_mass_g | 4,050 (3,550, 4,750) |
Unknown | 2 |
sex | |
female | 165 (50%) |
male | 168 (50%) |
Unknown | 11 |
year | |
2007 | 110 (32%) |
2008 | 114 (33%) |
2009 | 120 (35%) |
1 n (%); Median (IQR) |
A vignette explaining the use of the gtsummary
package is available here. We’ll also return to this approach in Chapter 7.
3.3.6.1 Using descr from summarytools
The descr()
function from the summarytools
package can also be used to provide numerical descriptions of all of the numerical variables contained within a tibble.
penguins |>
descr(stats = "common" )
Non-numerical variable(s) ignored: species, island, sex
Descriptive Statistics
penguins
N: 344
bill_depth_mm bill_length_mm body_mass_g flipper_length_mm year
--------------- --------------- ---------------- ------------- ------------------- ---------
Mean 17.15 43.92 4201.75 200.92 2008.03
Std.Dev 1.97 5.46 801.95 14.06 0.82
Min 13.10 32.10 2700.00 172.00 2007.00
Median 17.30 44.45 4050.00 197.00 2008.00
Max 21.50 59.60 6300.00 231.00 2009.00
N.Valid 342.00 342.00 342.00 342.00 344.00
Pct.Valid 99.42 99.42 99.42 99.42 100.00
An introduction to the summarytools package is available here, and illustrates some other ways to modify this output to suit your needs. We’ll also return to this approach in Chapter 7.
3.3.7 dfSummary() from summarytools
The dfSummary()
function from the summarytools
package can be used to provide some additional descriptions of all variables within a tibble. You’ll find more information about these numerical descriptions in Chapter 7.
dfSummary(penguins,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.75,
valid.col = FALSE)
### Data Frame Summary
#### penguins
**Dimensions:** 344 x 8
**Duplicates:** 0
+----+--------------------+---------------------------+---------------------+------------------------+---------+
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing |
+====+====================+===========================+=====================+========================+=========+
| 1 | species\ | 1\. Adelie\ | 152 (44.2%)\ | IIIIIIII \ | 0\ |
| | [factor] | 2\. Chinstrap\ | 68 (19.8%)\ | III \ | (0.0%) |
| | | 3\. Gentoo | 124 (36.0%) | IIIIIII | |
+----+--------------------+---------------------------+---------------------+------------------------+---------+
| 2 | island\ | 1\. Biscoe\ | 168 (48.8%)\ | IIIIIIIII \ | 0\ |
| | [factor] | 2\. Dream\ | 124 (36.0%)\ | IIIIIII \ | (0.0%) |
| | | 3\. Torgersen | 52 (15.1%) | III | |
+----+--------------------+---------------------------+---------------------+------------------------+---------+
| 3 | bill_length_mm\ | Mean (sd) : 43.9 (5.5)\ | 164 distinct values | \ \ \ \ . \ \ \ \ . :\ | 2\ |
| | [numeric] | min < med < max:\ | | \ \ . : : : : :\ | (0.6%) |
| | | 32.1 < 44.5 < 59.6\ | | \ \ : : : : : :\ | |
| | | IQR (CV) : 9.3 (0.1) | | \ \ : : : : : : .\ | |
| | | | | : : : : : : : : . | |
+----+--------------------+---------------------------+---------------------+------------------------+---------+
| 4 | bill_depth_mm\ | Mean (sd) : 17.2 (2)\ | 80 distinct values | \ \ \ \ \ \ \ \ \ \ :\ | 2\ |
| | [numeric] | min < med < max:\ | | \ \ \ \ \ \ \ \ : :\ | (0.6%) |
| | | 13.1 < 17.3 < 21.5\ | | \ \ : . : : : .\ | |
| | | IQR (CV) : 3.1 (0.1) | | . : : : : : :\ | |
| | | | | : : : : : : : . . | |
+----+--------------------+---------------------------+---------------------+------------------------+---------+
| 5 | flipper_length_mm\ | Mean (sd) : 200.9 (14.1)\ | 55 distinct values | \ \ \ \ \ \ :\ | 2\ |
| | [integer] | min < med < max:\ | | \ \ \ \ . :\ | (0.6%) |
| | | 172 < 197 < 231\ | | \ \ \ \ : : : \ \ . .\ | |
| | | IQR (CV) : 23 (0.1) | | \ \ . : : : \ \ : : :\ | |
| | | | | \ \ : : : : : : : : : | |
+----+--------------------+---------------------------+---------------------+------------------------+---------+
| 6 | body_mass_g\ | Mean (sd) : 4201.8 (802)\ | 94 distinct values | \ \ \ \ :\ | 2\ |
| | [integer] | min < med < max:\ | | \ \ . :\ | (0.6%) |
| | | 2700 < 4050 < 6300\ | | \ \ : : : :\ | |
| | | IQR (CV) : 1200 (0.2) | | \ \ : : : : : .\ | |
| | | | | . : : : : : : | |
+----+--------------------+---------------------------+---------------------+------------------------+---------+
| 7 | sex\ | 1\. female\ | 165 (49.5%)\ | IIIIIIIII \ | 11\ |
| | [factor] | 2\. male | 168 (50.5%) | IIIIIIIIII | (3.2%) |
+----+--------------------+---------------------------+---------------------+------------------------+---------+
| 8 | year\ | Mean (sd) : 2008 (0.8)\ | 2007 : 110 (32.0%)\ | IIIIII \ | 0\ |
| | [integer] | min < med < max:\ | 2008 : 114 (33.1%)\ | IIIIII \ | (0.0%) |
| | | 2007 < 2008 < 2009\ | 2009 : 120 (34.9%) | IIIIII | |
| | | IQR (CV) : 2 (0) | | | |
+----+--------------------+---------------------------+---------------------+------------------------+---------+
3.3.8 Visualizing with visdat
functions
The vis_dat()
function from the visdat
package shows something about the types of variables, providing visual clues about what’s inside. The picture below identifies variables types, and missing values.
vis_dat(penguins)
We can explore the missing data further using the vis_miss
function.
vis_miss(penguins)
A vignette explaining the use of the visdat
package is available here.
3.4 Histograms for a Variable
The most common tool we use in producing a graphical summary of a variable, like the penguin’s flipper length, is a histogram. Here’s one option.
ggplot(data = penguins, aes(x = flipper_length_mm)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
This approach produces two messages that alert us to potential concerns, and a fairly unattractive plot.
This time, we’ll first exclude the two penguins without a measured flipper length, and then set the binwidth
to be 10. How well does that work?
penguins2 <-
penguins |>
filter(complete.cases(flipper_length_mm))
ggplot(data = penguins2, aes(x = flipper_length_mm)) +
geom_histogram(binwidth = 10)
Now we’ve eliminated the messages, but it would be nice to have some more granularity in the bars (so we’d like a smaller binwidth) and I’d also like to make the bars more clearly separated with colors. I’d also like to add a title. Like this:
ggplot(data = penguins2, aes(x = flipper_length_mm)) +
geom_histogram(binwidth = 5, fill = "orange", col = "navy") +
labs(title = "Distribution of Flipper Length in 342 Palmer Penguins")
There are some other options for creating a graphical summary of a variable’s distribution. For example, we might consider a density plot, as well as a rug plot along the horizontal (X) axis:
ggplot(data = penguins2, aes(x = flipper_length_mm)) +
geom_density(col = "navy") +
geom_rug(col = "red") +
labs(title = "Density and Rug Plot of Flipper Length in 342 Palmer Penguins")
Or perhaps a dotplot would provide a useful look…
ggplot(data = penguins2, aes(x = flipper_length_mm)) +
geom_dotplot(binwidth = 1, fill = "orange", col = "navy") +
labs(title = "Dot Plot of Flipper Length in 342 Palmer Penguins")
We’ll learn about several other approaches to summarizing the distribution of a variable graphically later in the course.
3.5 Comparing Penguins by Species Numerically
We have data from three different species of penguin. Can we compare their flipper lengths numerically, perhaps by calculating the mean flipper length within each species?
# A tibble: 3 × 2
species `mean(flipper_length_mm)`
<fct> <dbl>
1 Adelie NA
2 Chinstrap 196.
3 Gentoo NA
Well, that’s a problem. Looks like we have some missing values. Can we fix that, and also provide some additional summaries, like the sample size (n) and the median and standard deviation within each species? While we’re at it, can we make it prettier, with kbl()
and kable_styling()
?
penguins |>
filter(complete.cases(species, flipper_length_mm)) |>
group_by(species) |>
summarise(n = n(),
mean = mean(flipper_length_mm),
sd = sd(flipper_length_mm),
median = median(flipper_length_mm)) |>
kbl() |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
species | n | mean | sd | median |
---|---|---|---|---|
Adelie | 151 | 189.9536 | 6.539457 | 190 |
Chinstrap | 68 | 195.8235 | 7.131894 | 196 |
Gentoo | 123 | 217.1870 | 6.484976 | 216 |
3.6 Using favstats() from the mosaic package
As we noted previously, we can also use favstats()
from the mosaic
package to help us look at the results for a single variable, split into groups by another, like this:
mosaic::favstats(bill_length_mm ~ species, data = penguins) |>
kbl() |>
kable_styling()
species | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
Adelie | 32.1 | 36.75 | 38.80 | 40.750 | 46.0 | 38.79139 | 2.663405 | 151 | 1 |
Chinstrap | 40.9 | 46.35 | 49.55 | 51.075 | 58.0 | 48.83382 | 3.339256 | 68 | 0 |
Gentoo | 40.9 | 45.30 | 47.30 | 49.550 | 59.6 | 47.50488 | 3.081857 | 123 | 1 |
One advantage of this approach is that (as you’ll note) it handles the missing data in the way we’d probably expect, by restricting the summaries to the complete cases.
3.7 Using tbl_summary() to summarize the tibble
The tbl_summary()
function from the gtsummary
package can also do the job of summarizing all of the other variables in the tibble, broken down by species, very nicely.
penguins |>
tbl_summary(by = species)
Characteristic | Adelie, N = 1521 | Chinstrap, N = 681 | Gentoo, N = 1241 |
---|---|---|---|
island | |||
Biscoe | 44 (29%) | 0 (0%) | 124 (100%) |
Dream | 56 (37%) | 68 (100%) | 0 (0%) |
Torgersen | 52 (34%) | 0 (0%) | 0 (0%) |
bill_length_mm | 38.8 (36.8, 40.8) | 49.6 (46.4, 51.1) | 47.3 (45.3, 49.6) |
Unknown | 1 | 0 | 1 |
bill_depth_mm | 18.40 (17.50, 19.00) | 18.45 (17.50, 19.40) | 15.00 (14.20, 15.70) |
Unknown | 1 | 0 | 1 |
flipper_length_mm | 190 (186, 195) | 196 (191, 201) | 216 (212, 221) |
Unknown | 1 | 0 | 1 |
body_mass_g | 3,700 (3,350, 4,000) | 3,700 (3,488, 3,950) | 5,000 (4,700, 5,500) |
Unknown | 1 | 0 | 1 |
sex | |||
female | 73 (50%) | 34 (50%) | 58 (49%) |
male | 73 (50%) | 34 (50%) | 61 (51%) |
Unknown | 6 | 0 | 5 |
year | |||
2007 | 50 (33%) | 26 (38%) | 34 (27%) |
2008 | 50 (33%) | 18 (26%) | 46 (37%) |
2009 | 52 (34%) | 24 (35%) | 44 (35%) |
1 n (%); Median (IQR) |
3.8 Comparing Penguins by Species Graphically
3.8.1 Faceting Histograms with facet_wrap()
We could compare the distributions of the flipper lengths across the three species, by creating a set of faceted histograms, like so…
penguins3 <-
penguins |>
filter(complete.cases(flipper_length_mm, species))
ggplot(data = penguins3, aes(x = flipper_length_mm, fill = species)) +
geom_histogram(binwidth = 5, col = "white") +
facet_wrap(~ species) +
labs(title = "Distribution of Flipper Length in Palmer Penguins, by Species")
We might add in the command
guides(fill = "none") +
to eliminate the redundant legend on the right-hand side of the plot.
3.8.2 Using facet_grid() instead
The facet_wrap()
approach has created three histograms, spread horizontally. Alternatively, we could plot the species vertically using facet_grid()
, which clearly shows which species produces the penguins with the larger flipper lengths, especially if we reduce the width of the bins a bit.
ggplot(data = penguins3, aes(x = flipper_length_mm, fill = species)) +
geom_histogram(binwidth = 2, col = "white") +
facet_grid(species ~ .) +
guides(fill = "none") +
labs(title = "Distribution of Flipper Length in Palmer Penguins, by Species")
We’ll use facets like this all the time in what follows.
3.8.3 Boxplots
Another very common tool we’ll use for looking simultaneously at the distributions of a variable across two or more categories is a boxplot. More on this later, but here’s one example of what this might look like.
ggplot(data = penguins3, aes(x = flipper_length_mm, y = species,
fill = species)) +
geom_boxplot() +
guides(fill = "none") +
labs(title = "Distribution of Flipper Length in Palmer Penguins, by Species")
3.8.4 Adding Violins
And here’s a somewhat fancier version, including a violin plot, and with the coordinates flipped so the plots are shown vertically rather than horizontally.
ggplot(data = penguins3, aes(x = flipper_length_mm, y = species)) +
geom_violin(aes(col = species)) +
geom_boxplot(aes(fill = species), width = 0.3) +
guides(col = "none", fill = "none") +
coord_flip() +
labs(title = "Augmented Boxplot of Flipper Length in Penguins by Species")
3.8.5 Letter-Value Plots (Boxplots for Large Data)
We might also consider a letter-value plot, using the geom_lv()
function from the lvplot
package in R, although I rarely use such a plot unless I have at least 1000 observations to work with.
ggplot(data = penguins3, aes(x = species, y = flipper_length_mm)) +
geom_lv(aes(fill=..LV..)) + scale_fill_brewer() +
labs(title = "Letter-Value Plot of Flipper Length in Penguins by Species")
Warning: The dot-dot notation (`..LV..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(LV)` instead.
3.9 Coming Up
You’re probably tiring of the penguins now. Next, we’ll look at some data on people, taken from the National Health and Nutrition Examination Survey, or NHANES.