Chapter 47 Influence Measures for Multiple Regression

R can output a series of influence measures for a regression model. Let me show you all of the available measures for model 1, but just for three of the data points - #1 (which is not particularly influential) and #12 and #16 (which are).

First, we’ll look at the raw data:

# A tibble: 3 x 11
     id island species   area elevation nearest scruz adjacent stures  fits
  <int> <fct>    <int>  <dbl>     <int>   <dbl> <dbl>    <dbl>  <dbl> <dbl>
1     1 Baltra      58   25.1       346     0.6   0.6     1.84 -1.00  117. 
2    12 Ferna~      93  634.       1494     4.3  95.3  4669.   -0.286  97.0
3    16 Isabe~     347 4669.       1707     0.7  28.1   634.   -5.33  386. 
# ... with 1 more variable: stanres <dbl>

And then, we’ll gather the output available in the influence.measures function.

Here’s an edited version of this output…

Influence measures of
lm(formula = species ~ area + elevation + nearest + scruz + adjacent, 
data = gala) :

     dfb.1_  dfb.area  dfb.elvt dfb.nrst  dfb.scrz  dfb.adjc
1  -0.15064   0.13572 -0.122412  0.07684  0.084786  1.14e-01
12  0.16112   0.16395 -0.122578  0.03093 -0.059059 -8.27e-01
16 -1.18618 -20.87453  4.885852  0.36713 -1.022431 -8.09e-01

     dffit   cov.r   cook.d    hat inf
1   -0.29335  1.0835 1.43e-02 0.0787    
12  -1.24249 25.1101 2.68e-01 0.9497   *
16 -29.59041  0.3275 6.81e+01 0.9685   *

This output presents dfbetas for each coefficient, followed by dffit statistics, covariance ratios, Cook’s distance and leverage values (hat) along with an indicator of influence.

We’ll consider each of these elements in turn.

47.1 DFBETAs

The first part of the influence measures output concerns what are generally called dfbetas

id island dfb.1_ dfb.area dfb.elvt dfb.nrst dfb.scrz dfb.adjc
1 Baltra -0.151 0.136 -0.122 0.077 0.085 0.114
12 Fernandina 0.161 0.164 -0.123 0.031 -0.059 -0.827
16 Isabela -1.186 -20.875 4.886 0.367 -1.022 -0.809

The dfbetas look at a standardized difference in the estimate of a coefficient (slope) that will occur if the specified point (here, island) is removed from the data set.

  • Positive values indicate that deleting the point will yield a smaller coefficient.
  • Negative values indicate that deleting the point will yield a larger coefficient.
  • If the absolute value of the dfbeta is greater than \(2 / \sqrt{n}\), where \(n\) is the sample size, then the dfbeta is considered to be large.

In this case, our cutoff would be \(2 / \sqrt{30}\) or 0.365, so that the Isabela dfbeta values are all indicative of large influence. Essentially, if we remove Isabela from the data, and refit the model, our regression slopes will change a lot (see below). Fernandina has some influence as well, especially on the adjacent coefficient.

Predictor Coefficient (p) all 30 islands Coefficient (p) without Isabela
Intercept 7.07 (p = 0.72) 22.59 (p = 0.11)
area -0.02 (p = 0.30) 0.30 (p < 0.01)
elevation 0.32 (p < 0.01) 0.14 (p < 0.01)
nearest 0.01 (p = 0.99) -0.26 (p = 0.73)
scruz -0.24 (p = 0.28) -0.09 (p = 0.55)
adjacent -0.08 (p < 0.01) -0.07 (p < 0.01)

47.2 Other Available Influence Measures

After the dfbetas, the influence.measures output presents dffit, covariance ratios, Cook’s distance and leverage values (hat) for each observation, along with an indicator of influence.

id  island         dffit   cov.r   cook.d    hat inf
1   Baltra      -0.29335  1.0835 1.43e-02 0.0787    
12  Fernandina  -1.24249 25.1101 2.68e-01 0.9497   *
16  Isabela    -29.59041  0.3275 6.81e+01 0.9685   *

47.2.1 Cook’s d or Cook’s Distance

The main measure of influence is Cook’s Distance, also called Cook’s d. Cook’s d provides a summary of the influence of a particular point on all of the regression coefficients. It is a function of the standardized residual and the leverage.

  • Cook’s distance values greater than 1 are generally indicators of high influence.
  • Obviously, Isabela (with a value of Cook’s d = 68.1) is a highly influential observation by this measure.

47.2.2 Plotting Cook’s Distance

As one of its automated regression diagnostic plots, R will produce an index plot of the Cook’s distance values. Note the relatively enormous influence for island 16 (Isabela).

47.2.3 DFFITS

A similar measure to Cook’s distance is called DFFITS. The DFFITS value describes the influence of the point on the fitted value. It’s the number of standard deviations that the fitted value changes if the observation is removed. This is defined as a function of the studentized residual and the leverage.

  • If the absolute value of DFFITS is greater than 2 times \(\sqrt{p / n-p}\), where p is the number of predictors (not including the intercept), we deem the observation influential.
  • For the gala data, we’d consider any point with DFFITS greater than 2 x \(\sqrt{5 / (30-5)}\) = 0.894 to be influential by this standard, since n = 30 and we are estimating p = 5 slopes in our model. This is true of both Fernandina and Isabela.

47.2.4 Covariance Ratio

The covariance ratio cov.r indicates the role of the observation on the precision of estimation. If cov.r is greater than 1, then this observation improves the precision, overall, and if it’s less than 1, the observation drops the precision of estimation, and these are the points about which we’ll be most concerned.

  • As with most of our other influence measures, Isabela appears to be a concern.

47.2.5 Leverage

The hat value is a measure of leverage. Specifically, this addresses whether or not the point in question is unusual in terms of its combination of predictor values.

  • The usual cutoff for a large leverage value is 2.5 times the average leverage across all observations, where the average leverage is equal to k/n, where n is the number of observations included in the regression model, and k is the number of model coefficients (slopes plus intercept).
  • In the gala example, we’d regard any observation with a hat value larger than 2.5 x 6/30 = 0.5 to have large leverage. This includes Fernandina and Isabela.

47.2.6 Indicator of Influence

The little asterisk indicates an observation which is influential according to R’s standards for any of these measures. You can take the absence of an asterisk as a clear indication that a point is NOT influential. Points with asterisks may or may not be influential in an important way. In practice, I usually focus on the Cook’s distance to make decisions about likely influence, when the results aren’t completely clear.