Chapter 14 Re-Expression, Tukey’s Ladder & Box-Cox Plot

14.1 “Linearize” The Association between Quantitative Variables

Confronted with a scatterplot describing a monotone association between two quantitative variables, we may decide the data are not well approximated by a straight line, and thus, that a least squares regression may not be not sufficiently useful. In these circumstances, we have at least two options, which are not mutually exclusive:

  1. Let the data be as they may, and summarize the scatterplot using tools like loess curves, polynomial functions, or cubic splines to model the relationship.
  2. Consider re-expressing the data (often we start with re-expressions of the outcome data [the Y variable]) using a transformation so that the transformed data may be modeled effectively using a straight line.

14.2 A New Tool: the Box-Cox Plot

As before, Tukey’s ladder of power transformations can guide our exploration.

Power (\(\lambda\)) -2 -1 -1/2 0 1/2 1 2
Transformation 1/y2 1/y 1/\(\sqrt{y}\) log y \(\sqrt{y}\) y y2

The Box-Cox plot, from the boxCox function in the car package, sifts through the ladder of options to suggest a transformation (for Y) to best linearize the outcome-predictor(s) relationship.

14.2.1 A Few Caveats

  1. These methods work well with monotone data, where a smooth function of Y is either strictly increasing, or strictly decreasing, as X increases.
  2. Some of these transformations require the data to be positive. We can rescale the Y data by adding a constant to every observation in a data set without changing shape.
  3. We can use a natural logarithm (log in R), a base 10 logarithm (log10) or even sometimes a base 2 logarithm (log2) to good effect in Tukey’s ladder. All affect the association’s shape in the same way, so we’ll stick with log (base e).
  4. Some re-expressions don’t lead to easily interpretable results. Not many things that make sense in their original units also make sense in inverse square roots. There are times when we won’t care, but often, we will.
  5. If our primary interest is in making predictions, we’ll generally be more interested in getting good predictions back on the original scale, and we can back-transform the point and interval estimates to accomplish this.

14.3 A Simulated Example

set.seed(999); x.rand <- rbeta(80, 2, 5) * 20 + 3
set.seed(1000); y.rand <- abs(50 + 0.75*x.rand^(3) - 0.65*x.rand + rnorm(80, 0, 200))
scatter1 <- data.frame(x = x.rand, y = y.rand) %>% tbl_df
Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
Please use `tibble::as_tibble()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
rm(x.rand, y.rand)

ggplot(scatter1, aes(x = x, y = y)) +
    geom_point(shape = 1, size = 3) +
    ## add loess smooth
    geom_smooth(method = "loess", se = FALSE, 
                col = "dodgerblue", formula = y ~ x) +
    ## then add linear fit
    geom_smooth(method = "lm", se = FALSE, 
                col = "red", formula = y ~ x, linetype = "dashed") +
    labs(title = "Simulated scatter1 example: Y vs. X")

Having simulated data that produces a curved scatterplot, I will now use the Box-Cox plot to lead my choice of an appropriate power transformation for Y in order to “linearize” the association of Y and X.

library(car)
boxCox(scatter1$y ~ scatter1$x) 

powerTransform(scatter1$y ~ scatter1$x)
Estimated transformation parameter 
       Y1 
0.4368753 

The Box-Cox plot peaks at the value \(\lambda\) = 0.44, which is pretty close to \(\lambda\) = 0.5. Now, 0.44 isn’t on Tukey’s ladder, but 0.5 is.

Power (\(\lambda\)) -2 -1 -1/2 0 1/2 1 2
Transformation 1/y2 1/y 1/\(\sqrt{y}\) log y \(\sqrt{y}\) y y2

If we use \(\lambda\) = 0.5, on Tukey’s ladder of power transformations, it suggests we look at the relationship between the square root of Y and X, as shown next.

p1 <- ggplot(scatter1, aes(x = x, y = y)) +
    geom_point(size = 2) +
    geom_smooth(method = "loess", se = FALSE, 
                formula = y ~ x, col = "dodgerblue") +
    geom_smooth(method = "lm", se = FALSE, 
                formula = y ~ x, col = "red", linetype = "dashed") +
    labs(title = "scatter1: Y vs. X")

p2 <- ggplot(scatter1, aes(x = x, y = sqrt(y))) +
    geom_point(size = 2) +
    geom_smooth(method = "loess", se = FALSE, 
                formula = y ~ x, col = "dodgerblue") +
    geom_smooth(method = "lm", se = FALSE, 
                formula = y ~ x, col = "red", linetype = "dashed") +
    labs(title = "scatter1: Square Root of Y vs. X")

p1 + p2

By eye, I think the square root plot better matches the linear fit.

14.4 Checking on a Transformation or Re-Expression

We can do three more things to check on our transformation.

  1. We can calculate the correlation of our original and re-expressed associations.
  2. We can use the testTransform function in the car library in R to perform a statistical test comparing the optimal choice of power (\(\lambda\) = 0.44) to various other transformations.
  3. We can go ahead and fit the regression models using each approach and compare the plots of studentized residuals against fitted values from the data to see if the re-expression reduces the curve in that residual plot, as well.

Option 3 is by far the most important in practice, and it’s the one we’ll focus on going forward, but we’ll demonstrate all three here.

14.4.1 Checking the Correlation Coefficients

Here, we calculate the correlation of original and re-expressed associations.

cor(scatter1$y, scatter1$x)
[1] 0.891198
cor(sqrt(scatter1$y), scatter1$x)
[1] 0.9144307

The Pearson correlation is a little stronger after the transformation. as we’d expect.

14.4.2 Using the testTransform function

Here, we use the testTransform function (also from the car package) to compare the optimal choice determined by the powerTransform function (here \(\lambda\) = 0.44) to \(\lambda\) = 0 (logarithm), 0.5 (square root) and 1 (no transformation).

testTransform(powerTransform(scatter1$y ~ scatter1$x), 0)
                           LRT df      pval
LR test, lambda = (0) 46.17947  1 1.079e-11
testTransform(powerTransform(scatter1$y ~ scatter1$x), 0.5)
                             LRT df    pval
LR test, lambda = (0.5) 1.024888  1 0.31136
testTransform(powerTransform(scatter1$y ~ scatter1$x), 1)
                           LRT df       pval
LR test, lambda = (1) 63.75953  1 1.4433e-15
  • It looks like only the square root (\(\lambda\) = 0.5) of these three options is not significantly worse by the log-likelihood criterion applied here than the optimal choice.
  • That’s because it’s the only one with a p value larger than our usual standard for statistical significance, of 0.05.

14.4.3 Comparing the Residual Plots

We can fit the regression models, obtain plots of residuals against fitted values, and compare them to see which one has less indication of a curve in the residuals.

model.orig <- lm(scatter1$y ~ scatter1$x)
model.sqrt <- lm(sqrt(scatter1$y) ~ scatter1$x)

p1 <- augment(model.orig) %>%
    ggplot(., aes(x = scatter1$x, y = .resid)) + 
    geom_point() +
    geom_smooth(method = "loess", formula = y ~ x, se = FALSE) +
    geom_smooth(method = "lm", formula = y ~ x, se = FALSE, col = "red") +
    labs(title = "Y vs X Residual Plot")

p2 <- augment(model.sqrt) %>%
    ggplot(., aes(x = scatter1$x, y = .resid)) + 
    geom_point() +
    geom_smooth(method = "loess", formula = y ~ x, se = FALSE) +
    geom_smooth(method = "lm", formula = y ~ x, se = FALSE, col = "red") +
    labs(title = "Square Root Model Residuals")

p1 + p2

What we’re looking for in such a plot is the absence of a curve, among other things, we want to see “fuzzy football” shapes.

As compared to the original residual plot, the square root version, is a modest improvement in this regard. It does look a bit less curved, and a bit more like a random cluster of points, so that’s nice.