Appendix E — Statistical Summaries

E.1 Summarizing a Quantity

E.1.1 The sample mean, \(\bar{x}\)

The sample mean, \(\bar{x}\), of a set of \(n\) observations is the sum of the observations divided by the count, \(n\).

\[ \bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_i \]

  • Wikipedia on the mean

E.1.2 The sample variance and standard deviation

The sample variance, \(s^2\) of a set of \(n\) observations is the sum of the squared deviance of each of the observations from the sample mean, divided by one less than the count, \(n\) minus 1.

\[ s^2 = \frac{1}{n-1} \sum_{i = 1}^{n} (x_i - \bar{x})^2 \]

The sample standard deviation, \(s\) is the square root of the sample variance.

\[ s = \sqrt{\frac{1}{n-1} \sum_{i = 1}^{n} (x_i - \bar{x})^2} \]

E.1.3 The range

The range of a data set can be expressed either as the two-element vector consisting of the minimum and maximum values of the data, or as the difference (maximum - minimum) depending on context. Larger range is an indication of greater variation.

E.1.4 The median

The median of a set of \(n\) observations separates the upper half of the observations from the lower half. It is the 50th percentile (or quantile) of the data, calculated as follows:

  • if \(n\) is odd, the sample median is the middle value when the data are sorted in order from lowest to highest

  • if \(n\) is even, the sample median is the mean of the two middle values when the data are sorted in order from lowest to highest

  • Wikipedia on the median

E.1.5 Quantiles / Percentiles

Quantiles or percentiles (the terms are used equivalently here) divide the observations of a sample into groups with equal probabilities within the sample. For example, the 42nd percentile is the cut point which splits the sample data so that 42% of the observations are below that cut point, with the remaining 58% above that cut point.

We often describe the 25th percentile as the first quartile (Q25) of the data, and the 75th percentile as the third quartile (Q75). Some regard the zeroth percentile as the minimum value in the data, while the 100th percentile describes the maximum value.

E.1.6 The IQR (inter-quartile range)

The IQR of a data set is the difference between the third quartile (Q75) and the first quartile (Q25). Thus, it provides a measure of the range of the “middle half” of the data. It also describes the length of the box in a boxplot.

E.1.7 The median absolute deviation

The median absolute deviation (MAD) used in this book as a robust measure of variation, is the median of the absolute deviations from the sample median, so if \(x_{MED}\) is the sample median of a set of \(n\) observations, multiplied by the constant 1.4826:

\[ MAD = median(|x_i - x_{MED}) \times 1.4826 \]

The purpose of multiplying by the constant is so that if data come from a Normal distribution, the MAD (with this multiplication) and the standard deviation will have the same value.

E.1.8 The standard error of the sample mean

The standard error of the sample mean for a set of \(n\) observations is:

\[ SE = \frac{s}{\sqrt{n}} \]

  • Wikipedia on the standard error generally, and the standard error of the sample mean, specifically

E.1.9 The coefficient of variation

The coefficient of variation of a set of \(n\) observations is the sample standard deviation divided by the sample mean:

\[ CV = \frac{s}{\bar{x}} \]

E.1.10 The mode

The mode of a set of observations is simply the most common value. A batch of data can have more than one mode, if there are multiple observations which tie for the most common value.

  • Wikipedia on the mode

E.1.11 Skewness

Skew measures the degree of asymmetry in our data. The skewness function in R that is used by describe_distribution() is Type II, often used by SAS and SPSS, for instance. For a sample of \(n\) observations with sample mean \(\bar{x}\),

  • The Type I or “classical” method produces

\[ skew_I = \left(\frac{\sum_{i = 1}^{n} (x_i - \bar{x})^3}{n} \right) / \left( \frac{\sum_{i = 1}^{n} (x_i - \bar{x})^2}{n} \right)^{1.5} \]

  • The Type II method then adjusts the result of the Type I method as follows:

\[ skew_{II} = skew_I \times \sqrt{\frac{n(n -1)}{n-2}} \]

and this is what describe_distribution() returns.

Each of these skewness measures will have a value of zero for symmetric data, including the Normal distribution, with negative values indicating left skew and positive values indicating right skew.

E.1.12 Simple Skewness (\(skew_0\))

A (perhaps overly) simple description of skew of a set of \(n\) observations that I occasionally look at is the sample mean minus the sample median, divided by the sample standard deviation:

\[ skew_0 = \frac{\bar{x} - median}{s} \]

  • Values of \(skew_0\) above +0.2 indicate right skew worthy of additional consideration
  • Values of \(skew_0\) below -0.2 indicate left skew worthy of additional consideration
  • Values of \(skew_0\) near 0 indicate fairly symmetric data

E.1.13 Kurtosis

The sample kurtosis measures the “tailedness” of a distribution - whether it is light or heavy tailed as compared to a Normal distribution.

The kurtosis function in R that is used by describe_distribution()is also the Type II approach.

For a sample of \(n\) observations with sample mean \(\bar{x}\),

  • The Type I or “classical” method produces

\[ kurt_I = n \times \frac{\sum_{i = 1}^{n} (x_i - \bar{x})^4}{ ( \sum_{i = 1}^{n} (x_i - \bar{x})^2 )^2} \]

  • The Type II method then adjusts the result of the Type I method as follows:

\[ kurt_{II} = ((n + 1) \times kurt_I + 6) \times \frac{n-1}{(n-2)\times(n-3)} \]

and this is what describe_distribution() returns.

Each of these kurtosis estimates measures tail behavior and can help in characterizing the sample distribution as either:

  • mesokurtic (distribution has a kurtosis value near 0) which indicates similar tail behavior to a Normal distribution
  • leptokurtic (“fatter tails”) as indicated by a kurtosis value well above 0, or
  • platykurtic (“thinner tails”) as indicated by a kurtosis value well below 0.

That said, I cannot remember the last time I used a kurtosis calculation in practical work.

E.2 Summarizing an Association

E.2.1 The Pearson Correlation Coefficient

When applied to a sample, the Pearson correlation is represented by \(r_{xy}\), and can be calculated using the formula below, assuming we have \(n\) observations on both \(x\) and \(y\), and that \(\bar{x}\) is the sample mean of the \(x\) values, and \(\bar{y}\) is the sample mean of the \(y\) values.

\[ r_{xy} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} \]

An equivalent formula requires that we specify \(s_x\) as the standard deviation of \(x\), and \(s_y\) as the standard deviation of \(y\). Then we have:

\[ r_{xy} = \frac{1}{n-1} \sum_{i = 1}^n \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i - \bar{y}}{s_y} \right) \]

Yet another formula for the Pearson correlation is:

\[ r_{xy} = \frac{\sum_{i=1}^n x_i y_i - n \bar{x} \bar{y}}{(n-1) s_x s_y} \]

E.2.2 Intercept and Slope of a Least Squares Fit

Suppose we have \(n\) observations on two variables, \(x\) and \(y\), where the mean of \(x\) is \(\bar{x}\) and the mean of \(y\) is \(\bar{y}\). We want to estimate the slope (\(b\)) and y-intercept (\(a\)) of the least squares line \(y = a + bx\).

The least squares estimate of the slope is:

\[ \hat{b} = \frac{\sum_{i=1}^{n} (x_i - \bar{x}) y }{\sum_{i = 1}^{n} (x_i - \bar{x})^2 } \]

and the least squares estimate of the intercept is:

\[ \hat{a} = \bar{y} - \hat{b} \bar{x} \]

In addition to writing the equation as \(y = a + bx\), we could also write it as:

\[ y_i = \bar{y} + \hat{b}(x_i - \bar{x}) + r_i \]

where the residual \(r_i\) is

\[ r_i = y_i - (\hat{a} + \hat{b} x_i) \]

and this should make it clear that the least squares regression line must pass through \((\bar{x}, \bar{y})\), the means of the two variables.