Chapter 49 Standardizing/Rescaling in Regression Models
49.1 Scaling Predictors using Z Scores: Semi-Standardized Coefficients
We know that the interpretation of the coefficients in a regression model is sensitive to the scale of the predictors. We have already seen how to “standardize” each predictor by subtracting its mean and dividing by its standard deviation.
- Each coefficient in this semi-standardized model has the following interpretation: the expected difference in the outcome, comparing units (subjects) that differ by one standard deviation in the variable of interest, but for which all other variables are fixed at their average.
- Remember also that the intercept in such a model shows the mean outcome across all subjects.
Consider a two-variable model, using area
and elevation
to predict the number of species
…
model2 <- lm(species ~ area + elevation, data=gala)
summary(model2)
Call:
lm(formula = species ~ area + elevation, data = gala)
Residuals:
Min 1Q Median 3Q Max
-192.62 -33.53 -19.20 7.54 261.51
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.1052 20.9421 0.82 0.4212
area 0.0188 0.0259 0.72 0.4748
elevation 0.1717 0.0532 3.23 0.0032 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 79.3 on 27 degrees of freedom
Multiple R-squared: 0.554, Adjusted R-squared: 0.521
F-statistic: 16.8 on 2 and 27 DF, p-value: 1.84e-05
Now compare these results to the ones we get after scaling the area and elevation variables. Remember that the scale
function centers a variable on zero by subtracting the mean from each observation, and then scales the result by dividing by the standard deviation. This ensures that each regression input has mean 0 and standard deviation 1, and is thus a z score.
model2.z <- lm(species ~ scale(area) + scale(elevation), data=gala)
summary(model2.z)
Call:
lm(formula = species ~ scale(area) + scale(elevation), data = gala)
Residuals:
Min 1Q Median 3Q Max
-192.62 -33.53 -19.20 7.54 261.51
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.2 14.5 5.88 2.9e-06 ***
scale(area) 16.2 22.4 0.72 0.4748
scale(elevation) 72.4 22.4 3.23 0.0032 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 79.3 on 27 degrees of freedom
Multiple R-squared: 0.554, Adjusted R-squared: 0.521
F-statistic: 16.8 on 2 and 27 DF, p-value: 1.84e-05
49.1.1 Questions about the Semi-Standardized Model
- What changes after centering and rescaling the predictors, and what does not?
- Why might rescaling like this be a helpful thing to do if you want to compare predictors in terms of importance?
49.2 Fully Standardized Regression Coefficients
Suppose we standardize the coefficients by also taking centering and scaling (using the z score) the outcome variable: species
, creating a fully standardized model.
model2.zout <- lm(scale(species) ~
scale(area) + scale(elevation), data=gala)
summary(model2.zout)
Call:
lm(formula = scale(species) ~ scale(area) + scale(elevation),
data = gala)
Residuals:
Min 1Q Median 3Q Max
-1.6803 -0.2925 -0.1675 0.0658 2.2813
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.59e-17 1.26e-01 0.00 1.0000
scale(area) 1.42e-01 1.96e-01 0.72 0.4748
scale(elevation) 6.32e-01 1.96e-01 3.23 0.0032 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.692 on 27 degrees of freedom
Multiple R-squared: 0.554, Adjusted R-squared: 0.521
F-statistic: 16.8 on 2 and 27 DF, p-value: 1.84e-05
49.2.1 Questions about the Standardized Model
- How do you interpret the value 0.142 of the
scale(area)
coefficient here? You may want to start by reviewing the summary of the originalgala
data shown here.
summary(gala[c("species", "area", "elevation")])
species area elevation
Min. : 2 Min. : 0 Min. : 25
1st Qu.: 13 1st Qu.: 0 1st Qu.: 98
Median : 42 Median : 3 Median : 192
Mean : 85 Mean : 262 Mean : 368
3rd Qu.: 96 3rd Qu.: 59 3rd Qu.: 435
Max. :444 Max. :4669 Max. :1707
- How do you interpret the value 0.632 of the
scale(elevation)
coefficient in the standardized model? - What is the intercept in this setting? Will this be the case whenever you scale like this?
- What are some of the advantages of looking at scaled regression coefficients?
- Why are these called fully standardized coefficients while the previous page described semi-standardized coefficients?
- What would motivate you to use one of these two methods of standardization (fully standardized or semi-standardized) vs. the other?
49.3 Robust Standardization of Regression Coefficients
Another common option for scaling is to specify lower and upper comparison points, perhaps by comparing the impact of a move from the 25th to the 75th percentile for each variable, while holding all of the other variables constant.
Occasionally, you will see robust semi-standardized regression coefficients, which measure the increase in the outcome, Y, associated with an increase in that particular predictor of one IQR (inter-quartile range).
gala$area.scaleiqr <- (gala$area - mean(gala$area)) / IQR(gala$area)
gala$elevation.scaleiqr <- (gala$elevation - mean(gala$elevation)) /
IQR(gala$elevation)
model2.iqr <- lm(species ~ area.scaleiqr + elevation.scaleiqr,
data=gala)
summary(model2.iqr)
Call:
lm(formula = species ~ area.scaleiqr + elevation.scaleiqr, data = gala)
Residuals:
Min 1Q Median 3Q Max
-192.62 -33.53 -19.20 7.54 261.51
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.23 14.48 5.88 2.9e-06 ***
area.scaleiqr 1.11 1.53 0.72 0.4748
elevation.scaleiqr 57.96 17.95 3.23 0.0032 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 79.3 on 27 degrees of freedom
Multiple R-squared: 0.554, Adjusted R-squared: 0.521
F-statistic: 16.8 on 2 and 27 DF, p-value: 1.84e-05
49.3.1 Questions about Robust Standardization
- How should we interpret the 57.96 value for the scaled
elevation
variable? You may want to start by considering the summary of the original elevation data below.
summary(gala$elevation)
Min. 1st Qu. Median Mean 3rd Qu. Max.
25 98 192 368 435 1707
A robust standardized coefficient analysis measures the increase in Y (in IQR of Y) associated with an increase in the predictor of interest of one IQR.
gala$species.scaleiqr <- (gala$species - mean(gala$species)) / IQR(gala$species)
model2.iqrout <- lm(species.scaleiqr ~ area.scaleiqr + elevation.scaleiqr, data=gala)
model2.iqrout
Call:
lm(formula = species.scaleiqr ~ area.scaleiqr + elevation.scaleiqr,
data = gala)
Coefficients:
(Intercept) area.scaleiqr elevation.scaleiqr
-1.01e-16 1.34e-02 6.98e-01
- What can we learn from the R output above?
49.4 Scaling Inputs by Dividing by 2 Standard Deviations
It turns out that standardizing the inputs to a regression model by dividing by a standard deviation creates some difficulties when you want to include a binary predictor in the model.
Instead, Andrew Gelman recommends that you consider centering all of the predictors (binary or continuous) by subtracting off the mean, and then, for the non-binary predictors, also dividing not by one, but rather by two standard deviations.
- Such a standardization can go a long way to helping us understand a model whose predictors are on different scales, and provides an interpretable starting point.
- Another appealing part of this approach is that in the
arm
library, Gelman and his colleagues have created an R function calledstandardize
, which can be used to automate the process of checking coefficients that have been standardized in this manner, after the regression model has been fit.
model2
Call:
lm(formula = species ~ area + elevation, data = gala)
Coefficients:
(Intercept) area elevation
17.1052 0.0188 0.1717
arm::standardize(model2)
Call:
lm(formula = species ~ z.area + z.elevation, data = gala)
Coefficients:
(Intercept) z.area z.elevation
85.2 32.5 144.8
49.4.1 Questions about Standardizing by Dividing by Two SD
How does this result compare to the semi-standardized regression coefficients we have seen on the previous few pages?
How should we interpret the
z.area
coefficient of 32.5 here? Again, you may want to start by obtaining a statistical summary of the originalarea
data, as shown below.
summary(gala$area)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 3 262 59 4669
To standardize the outcome in this way, as well, we use
arm::standardize(model2, standardize.y=TRUE)
Call:
lm(formula = z.species ~ z.area + z.elevation, data = gala)
Coefficients:
(Intercept) z.area z.elevation
1.65e-19 1.42e-01 6.32e-01
- How should we interpret the
z.area
coefficient of 0.142 here? - How does these relate to the standardized regression coefficients we’ve seen before?