Chapter 49 Standardizing/Rescaling in Regression Models

49.1 Scaling Predictors using Z Scores: Semi-Standardized Coefficients

We know that the interpretation of the coefficients in a regression model is sensitive to the scale of the predictors. We have already seen how to “standardize” each predictor by subtracting its mean and dividing by its standard deviation.

  • Each coefficient in this semi-standardized model has the following interpretation: the expected difference in the outcome, comparing units (subjects) that differ by one standard deviation in the variable of interest, but for which all other variables are fixed at their average.
  • Remember also that the intercept in such a model shows the mean outcome across all subjects.

Consider a two-variable model, using area and elevation to predict the number of species


Call:
lm(formula = species ~ area + elevation, data = gala)

Residuals:
    Min      1Q  Median      3Q     Max 
-192.62  -33.53  -19.20    7.54  261.51 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  17.1052    20.9421    0.82   0.4212   
area          0.0188     0.0259    0.72   0.4748   
elevation     0.1717     0.0532    3.23   0.0032 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 79.3 on 27 degrees of freedom
Multiple R-squared:  0.554, Adjusted R-squared:  0.521 
F-statistic: 16.8 on 2 and 27 DF,  p-value: 1.84e-05

Now compare these results to the ones we get after scaling the area and elevation variables. Remember that the scale function centers a variable on zero by subtracting the mean from each observation, and then scales the result by dividing by the standard deviation. This ensures that each regression input has mean 0 and standard deviation 1, and is thus a z score.


Call:
lm(formula = species ~ scale(area) + scale(elevation), data = gala)

Residuals:
    Min      1Q  Median      3Q     Max 
-192.62  -33.53  -19.20    7.54  261.51 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)          85.2       14.5    5.88  2.9e-06 ***
scale(area)          16.2       22.4    0.72   0.4748    
scale(elevation)     72.4       22.4    3.23   0.0032 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 79.3 on 27 degrees of freedom
Multiple R-squared:  0.554, Adjusted R-squared:  0.521 
F-statistic: 16.8 on 2 and 27 DF,  p-value: 1.84e-05

49.1.1 Questions about the Semi-Standardized Model

  1. What changes after centering and rescaling the predictors, and what does not?
  2. Why might rescaling like this be a helpful thing to do if you want to compare predictors in terms of importance?

49.2 Fully Standardized Regression Coefficients

Suppose we standardize the coefficients by also taking centering and scaling (using the z score) the outcome variable: species, creating a fully standardized model.


Call:
lm(formula = scale(species) ~ scale(area) + scale(elevation), 
    data = gala)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6803 -0.2925 -0.1675  0.0658  2.2813 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)   
(Intercept)      4.59e-17   1.26e-01    0.00   1.0000   
scale(area)      1.42e-01   1.96e-01    0.72   0.4748   
scale(elevation) 6.32e-01   1.96e-01    3.23   0.0032 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.692 on 27 degrees of freedom
Multiple R-squared:  0.554, Adjusted R-squared:  0.521 
F-statistic: 16.8 on 2 and 27 DF,  p-value: 1.84e-05

49.2.1 Questions about the Standardized Model

  1. How do you interpret the value 0.142 of the scale(area) coefficient here? You may want to start by reviewing the summary of the original gala data shown here.
    species         area        elevation   
 Min.   :  2   Min.   :   0   Min.   :  25  
 1st Qu.: 13   1st Qu.:   0   1st Qu.:  98  
 Median : 42   Median :   3   Median : 192  
 Mean   : 85   Mean   : 262   Mean   : 368  
 3rd Qu.: 96   3rd Qu.:  59   3rd Qu.: 435  
 Max.   :444   Max.   :4669   Max.   :1707  
  1. How do you interpret the value 0.632 of the scale(elevation) coefficient in the standardized model?
  2. What is the intercept in this setting? Will this be the case whenever you scale like this?
  3. What are some of the advantages of looking at scaled regression coefficients?
  4. Why are these called fully standardized coefficients while the previous page described semi-standardized coefficients?
  5. What would motivate you to use one of these two methods of standardization (fully standardized or semi-standardized) vs. the other?

49.3 Robust Standardization of Regression Coefficients

Another common option for scaling is to specify lower and upper comparison points, perhaps by comparing the impact of a move from the 25th to the 75th percentile for each variable, while holding all of the other variables constant.

Occasionally, you will see robust semi-standardized regression coefficients, which measure the increase in the outcome, Y, associated with an increase in that particular predictor of one IQR (inter-quartile range).


Call:
lm(formula = species ~ area.scaleiqr + elevation.scaleiqr, data = gala)

Residuals:
    Min      1Q  Median      3Q     Max 
-192.62  -33.53  -19.20    7.54  261.51 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)           85.23      14.48    5.88  2.9e-06 ***
area.scaleiqr          1.11       1.53    0.72   0.4748    
elevation.scaleiqr    57.96      17.95    3.23   0.0032 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 79.3 on 27 degrees of freedom
Multiple R-squared:  0.554, Adjusted R-squared:  0.521 
F-statistic: 16.8 on 2 and 27 DF,  p-value: 1.84e-05

49.3.1 Questions about Robust Standardization

  1. How should we interpret the 57.96 value for the scaled elevation variable? You may want to start by considering the summary of the original elevation data below.
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     25      98     192     368     435    1707 

A robust standardized coefficient analysis measures the increase in Y (in IQR of Y) associated with an increase in the predictor of interest of one IQR.


Call:
lm(formula = species.scaleiqr ~ area.scaleiqr + elevation.scaleiqr, 
    data = gala)

Coefficients:
       (Intercept)       area.scaleiqr  elevation.scaleiqr  
         -1.01e-16            1.34e-02            6.98e-01  
  1. What can we learn from the R output above?

49.4 Scaling Inputs by Dividing by 2 Standard Deviations

It turns out that standardizing the inputs to a regression model by dividing by a standard deviation creates some difficulties when you want to include a binary predictor in the model.

Instead, Andrew Gelman recommends that you consider centering all of the predictors (binary or continuous) by subtracting off the mean, and then, for the non-binary predictors, also dividing not by one, but rather by two standard deviations.

  • Such a standardization can go a long way to helping us understand a model whose predictors are on different scales, and provides an interpretable starting point.
  • Another appealing part of this approach is that in the arm library, Gelman and his colleagues have created an R function called standardize, which can be used to automate the process of checking coefficients that have been standardized in this manner, after the regression model has been fit.

Call:
lm(formula = species ~ area + elevation, data = gala)

Coefficients:
(Intercept)         area    elevation  
    17.1052       0.0188       0.1717  

Call:
lm(formula = species ~ z.area + z.elevation, data = gala)

Coefficients:
(Intercept)       z.area  z.elevation  
       85.2         32.5        144.8  

49.4.1 Questions about Standardizing by Dividing by Two SD

  1. How does this result compare to the semi-standardized regression coefficients we have seen on the previous few pages?

  2. How should we interpret the z.area coefficient of 32.5 here? Again, you may want to start by obtaining a statistical summary of the original area data, as shown below.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       3     262      59    4669 

To standardize the outcome in this way, as well, we use


Call:
lm(formula = z.species ~ z.area + z.elevation, data = gala)

Coefficients:
(Intercept)       z.area  z.elevation  
   1.65e-19     1.42e-01     6.32e-01  
  1. How should we interpret the z.area coefficient of 0.142 here?
  2. How does these relate to the standardized regression coefficients we’ve seen before?