Appendix E — Statistical Summaries
E.1 Summarizing a Quantity
E.1.1 The sample mean,
The sample mean,
- Wikipedia on the mean
E.1.2 The sample variance and standard deviation
The sample variance,
The sample standard deviation,
- Wikipedia on the standard deviation
E.1.3 The range
The range of a data set can be expressed either as the two-element vector consisting of the minimum and maximum values of the data, or as the difference (maximum - minimum) depending on context. Larger range is an indication of greater variation.
- Wikipedia on the range
E.1.4 The median
The median of a set of
if
is odd, the sample median is the middle value when the data are sorted in order from lowest to highestif
is even, the sample median is the mean of the two middle values when the data are sorted in order from lowest to highestWikipedia on the median
E.1.5 Quantiles / Percentiles
Quantiles or percentiles (the terms are used equivalently here) divide the observations of a sample into groups with equal probabilities within the sample. For example, the 42nd percentile is the cut point which splits the sample data so that 42% of the observations are below that cut point, with the remaining 58% above that cut point.
We often describe the 25th percentile as the first quartile (Q25) of the data, and the 75th percentile as the third quartile (Q75). Some regard the zeroth percentile as the minimum value in the data, while the 100th percentile describes the maximum value.
- Wikipedia on percentiles
E.1.6 The IQR (inter-quartile range)
The IQR of a data set is the difference between the third quartile (Q75) and the first quartile (Q25). Thus, it provides a measure of the range of the “middle half” of the data. It also describes the length of the box in a boxplot.
- Wikipedia on the inter-quartile range
E.1.7 The median absolute deviation
The median absolute deviation (MAD) used in this book as a robust measure of variation, is the median of the absolute deviations from the sample median, so if
The purpose of multiplying by the constant is so that if data come from a Normal distribution, the MAD (with this multiplication) and the standard deviation will have the same value.
- Wikipedia on the median absolute deviation
E.1.8 The standard error of the sample mean
The standard error of the sample mean for a set of
- Wikipedia on the standard error generally, and the standard error of the sample mean, specifically
E.1.9 The coefficient of variation
The coefficient of variation of a set of
- Wikipedia on the coefficient of variation
E.1.10 The mode
The mode of a set of observations is simply the most common value. A batch of data can have more than one mode, if there are multiple observations which tie for the most common value.
- Wikipedia on the mode
E.1.11 Skewness
Skew measures the degree of asymmetry in our data. The skewness
function in R that is used by describe_distribution()
is Type II, often used by SAS and SPSS, for instance. For a sample of
- The Type I or “classical” method produces
- The Type II method then adjusts the result of the Type I method as follows:
and this is what describe_distribution()
returns.
Each of these skewness measures will have a value of zero for symmetric data, including the Normal distribution, with negative values indicating left skew and positive values indicating right skew.
- See the skewness and kurtosis page in easystats for more details.
- Wikipedia on skewness.
E.1.12 Simple Skewness ( )
A (perhaps overly) simple description of skew of a set of
- Values of
above +0.2 indicate right skew worthy of additional consideration - Values of
below -0.2 indicate left skew worthy of additional consideration - Values of
near 0 indicate fairly symmetric data
E.1.13 Kurtosis
The sample kurtosis measures the “tailedness” of a distribution - whether it is light or heavy tailed as compared to a Normal distribution.
The kurtosis
function in R that is used by describe_distribution()
is also the Type II approach.
For a sample of
- The Type I or “classical” method produces
- The Type II method then adjusts the result of the Type I method as follows:
and this is what describe_distribution()
returns.
Each of these kurtosis estimates measures tail behavior and can help in characterizing the sample distribution as either:
- mesokurtic (distribution has a kurtosis value near 0) which indicates similar tail behavior to a Normal distribution
- leptokurtic (“fatter tails”) as indicated by a kurtosis value well above 0, or
- platykurtic (“thinner tails”) as indicated by a kurtosis value well below 0.
That said, I cannot remember the last time I used a kurtosis calculation in practical work.
- See the skewness and kurtosis page in easystats for more details.
- Wikipedia on kurtosis.
E.2 Summarizing an Association
E.2.1 The Pearson Correlation Coefficient
When applied to a sample, the Pearson correlation is represented by
An equivalent formula requires that we specify
Yet another formula for the Pearson correlation is:
- Wikipedia on the Pearson correlation coefficient
E.2.2 Intercept and Slope of a Least Squares Fit
Suppose we have
The least squares estimate of the slope is:
and the least squares estimate of the intercept is:
In addition to writing the equation as
where the residual
and this should make it clear that the least squares regression line must pass through
- Wikipedia on the method of least squares