Chapter 9 Using Transformations to “Normalize” Distributions

  • When we are confronted with a variable that is not Normally distributed but that we wish was Normally distributed, it is sometimes useful to consider whether working with a transformation of the data will yield a more helpful result.
  • Many statistical methods, including t tests and analyses of variance, assume Normal distributions.
  • We’ll discuss using R to assess a range of what are called Box-Cox power transformations, via plots, mainly.

9.1 The Ladder of Power Transformations

The key notion in re-expression of a single variable to obtain a distribution better approximated by the Normal or re-expression of an outcome in a simple regression model is that of a ladder of power transformations, which applies to any unimodal data.

Power Transformation
3 x3
2 x2
1 x (unchanged)
0.5 x0.5 = \(\sqrt{x}\)
0 ln x
-0.5 x-0.5 = 1/\(\sqrt{x}\)
-1 x-1 = 1/x
-2 x-2 = 1/x2

9.2 Using the Ladder

As we move further away from the identity function (power = 1) we change the shape more and more in the same general direction.

  • For instance, if we try a logarithm, and this seems like too much of a change, we might try a square root instead.
  • Note that this ladder (which like many other things is due to John Tukey) uses the logarithm for the “power zero” transformation rather than the constant, which is what x0 actually is.
  • If the variable x can take on negative values, we might take a different approach. If x is a count of something that could be zero, we often simply add 1 to x before transformation.

9.3 Can we transform Waist Circumferences?

Here are a pair of plots describing the waist circumference data in the NYFS data.

All of the values are positive, naturally, and there is some sign of skew. If we want to use the tools of the Normal distribution to describe these data, we might try taking a step “up” our ladder from power 1 to power 2.

9.3.3 The Logarithm

We might also try a logarithm of the waist circumference data. We can use either the natural logarithm (log, in R) or the base-10 logarithm (log10, in R) - either will have the same impact on skew.