14 Dehydration Recovery

The hydrate data describe the degree of recovery that takes place 90 minutes following treatment of moderate to severe dehydration, for 36 children diagnosed at a hospital’s main pediatric clinic.

Upon diagnosis and study entry, patients were treated with an electrolytic solution at one of seven dose levels (0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0 mEq/l) in a frozen, flavored, ice popsicle. The degree of rehydration was determined using a subjective scale based on physical examination and parental input, converted to a 0 to 100 point scale, representing the percent of recovery (recov.score). Each child’s age (in years) and weight (in pounds) are also available.

First, we’ll check ranges (and for missing data) in the hydrate file.

hydrate <- read_csv("data/hydrate.csv")

summary(hydrate)
       id         recov.score          dose      
 Min.   : 1.00   Min.   : 44.00   Min.   :0.000  
 1st Qu.: 9.75   1st Qu.: 61.50   1st Qu.:1.000  
 Median :18.50   Median : 71.50   Median :1.500  
 Mean   :18.50   Mean   : 71.56   Mean   :1.569  
 3rd Qu.:27.25   3rd Qu.: 80.00   3rd Qu.:2.500  
 Max.   :36.00   Max.   :100.00   Max.   :3.000  
      age             weight     
 Min.   : 3.000   Min.   :22.00  
 1st Qu.: 5.000   1st Qu.:34.50  
 Median : 6.500   Median :47.50  
 Mean   : 6.667   Mean   :46.89  
 3rd Qu.: 8.000   3rd Qu.:57.25  
 Max.   :11.000   Max.   :76.00  

There are no missing values, and all of the ranges make sense. There are no especially egregious problems to report.

14.1 A Scatterplot Matrix

Next, we’ll use a scatterplot matrix to summarize relationships between the outcome recov.score and the key predictor dose as well as the ancillary predictors age and weight, which are of less interest, but are expected to be related to our outcome. The one below uses the ggpairs function in the GGally package, as introduced in Part A of the Notes. We place the outcome in the bottom row, and the key predictor immediately above it, with age and weight in the top rows, using the select function within the `ggpairs call.

GGally::ggpairs(dplyr::select(hydrate, age, weight, dose, recov.score), 
                title = "Scatterplot Matrix for hydrate data")

What can we conclude here?

  • It looks like recov.score has a moderately strong negative relationship with both age and weight (with correlations in each case around -0.5), but a positive relationship with dose (correlation = 0.36).
  • The distribution of recov.score looks to be pretty close to Normal. No potential predictors (age, weight and dose) show substantial non-Normality.
  • age and weight, as we’d expect, show a very strong and positive linear relationship, with r = 0.94
  • Neither age nor weight shows a meaningful relationship with dose. (r = 0.16)

14.2 Are the recovery scores well described by a Normal model?

Next, we’ll do a more thorough graphical summary of our outcome, recovery score.

p1 <- ggplot(hydrate, aes(sample = recov.score)) +
  geom_qq(col = '#440154') + geom_qq_line(col = "red") + 
  theme(aspect.ratio = 1) + 
  labs(title = "Normal Q-Q plot: hydrate")

p2 <- ggplot(hydrate, aes(x = recov.score)) +
  geom_histogram(aes(y = stat(density)), 
                 bins = 10, fill = '#440154', col = '#FDE725') +
  stat_function(fun = dnorm, 
                args = list(mean = mean(hydrate$recov.score), 
                            sd = sd(hydrate$recov.score)),
                col = "red", lwd = 1.5) +
  labs(title = "Density Function: hydrate")

p3 <- ggplot(hydrate, aes(x = recov.score, y = "")) +
  geom_boxplot(fill = '#440154', outlier.color = '#440154') + 
  labs(title = "Boxplot: hydrate", y = "")

p1 + (p2 / p3 + plot_layout(heights = c(4,1)))
mosaic::favstats(~ recov.score, data = hydrate) %>% kable(digits = 1)
min Q1 median Q3 max mean sd n missing
44 61.5 71.5 80 100 71.6 12.9 36 0

I see no serious problems with assuming Normality for these recovery scores. Our outcome variable doesn’t in any way need to follow a Normal distribution, but it’s nice when it does, because summaries involving means and standard deviations make sense.

14.3 Simple Regression: Using Dose to predict Recovery

To start, consider a simple (one predictor) regression model using dose alone to predict the % Recovery (recov.score). Ignoring the age and weight covariates, what can we conclude about this relationship?

14.4 The Scatterplot, with fitted Linear Model

ggplot(hydrate, aes(x = dose, y = recov.score)) +
    geom_point(size = 2) +
    geom_smooth(method = "lm", formula = y ~ x, col = "red") +
    labs(title = "Simple Regression model for the hydrate data",
         x = "Dose (mEq/l)", y = "Recovery Score (points)")

14.5 The Fitted Linear Model

To obtain the fitted linear regression model, we use the lm function:

m1 <- lm(recov.score ~ dose, data = hydrate)

tidy(m1) %>% kable(digits = 2)
term estimate std.error statistic p.value
(Intercept) 63.90 3.97 16.09 0.00
dose 4.88 2.17 2.25 0.03

So, our fitted regression model (prediction model) is recov.score = 63.9 + 4.88 dose.

14.5.1 Confidence Intervals

We can obtain confidence intervals around the coefficients of our fitted model with tidy, too.

tidy(m1, conf.int = TRUE, conf.level = 0.90) %>% kable(digits = 2)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 63.90 3.97 16.09 0.00 57.18 70.61
dose 4.88 2.17 2.25 0.03 1.21 8.55

So, our 90% confidence interval for the slope of dose ranges from 1.21 to 8.55.

14.6 Coefficient Plots

The tidy method makes it easy to construct coefficient plots using ggplot2.

td <- tidy(m1, conf.int = TRUE, conf.level = 0.90)
ggplot(td, aes(x = estimate, y = term, col = term)) +
  geom_point() +
  geom_crossbarh(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 0) +
  guides(col = "none") +
  labs(title = "Estimates with 90% confidence intervals from m1 in hydrate")

Another option would be to use geom_errorbarh in this setting, perhaps with a different color scheme…

td <- tidy(m1, conf.int = TRUE, conf.level = 0.90)
ggplot(td, aes(x = estimate, y = term, col = term)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 0) +
  scale_color_viridis_d(end = 0.5) +
  guides(col = "none") +
  labs(title = "Estimates with 90% confidence intervals from m1 in hydrate")

14.7 The Summary Output

To get a more complete understanding of the fitted model, we’ll summarize it.

summary(lm(recov.score ~ dose, data = hydrate))

Call:
lm(formula = recov.score ~ dose, data = hydrate)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.3360  -7.2763   0.0632   8.4233  23.9028 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   63.896      3.970  16.093   <2e-16 ***
dose           4.881      2.172   2.247   0.0313 *  
---
Signif. codes:  
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.21 on 34 degrees of freedom
Multiple R-squared:  0.1293,    Adjusted R-squared:  0.1037 
F-statistic: 5.047 on 1 and 34 DF,  p-value: 0.03127

14.7.1 Model Specification

  1. The first part of the output specifies the model that has been fit.
    • Here, we have a simple regression model that predicts recov.score on the basis of dose.
    • Notice that we’re treating dose here as a quantitative variable. If we wanted dose to be treated as a factor, we’d have specified that in the model.

14.7.2 Residual Summary

  1. The second part of the output summarizes the regression residuals across the subjects involved in fitting the model.
    • The residual is defined as the Actual value of our outcome minus the predicted value of that outcome fitted by the model.
    • In our case, the residual for a given child is their actual recov.score minus the predicted recov.score according to our model, for that child.
    • The residual summary gives us a sense of how “incorrect” our predictions are for the hydrate observations.
      • A positive residual means that the observed value was higher than the predicted value from the linear regression model, so the prediction was too low.
      • A negative residual means that the observed value was lower than the predicted value from the linear regression model, so the prediction was too high.
      • The residuals will center near 0 (the ordinary least squares model fitting process is designed so the mean of the residuals will always be zero)
      • We hope to see the median of the residuals also be near zero, generally. In this case, the median prediction is 0.06 point too low.
      • The minimum and maximum show us the largest prediction errors, made in the subjects used to fit this model.
      • Here, we predicted a recovery score that was 22.3 points too high for one patient, and another of our predicted recovery scores was 23.9 points too low.
      • The middle half of our predictions were between 8.4 points too low and 7.3 points too high.

14.7.3 Coefficients Output

  1. The Coefficients output begins with a table of the estimated coefficients from the regression equation.
    • Generally, we write a simple regression model as \(y = \beta_0 + \beta_1 x\).
    • In the hydrate model, we have recov.score = \(\beta_0\) + \(\beta_1\) dose.
    • The first column of the table gives the estimated \(\beta\) coefficients for our model
      • Here the estimated intercept \(\hat{\beta_0} = 63.9\)
      • The estimated slope of dose \(\hat{\beta_1} = 4.88\)
      • Thus, our model is recov.score= 63.9 + 4.88 dose

We interpret these coefficients as follows:

  • The intercept (63.9) is the predicted recov.score for a patient receiving a dose of 0 mEq/l of the electrolytic solution.
  • The slope (4.88) of the dose is the predicted change in recov.score associated with a 1 mEq/l increase in the dose of electrolytic solution.
    • Essentially, if we have two children like the ones studied here, and we give Roger a popsicle with dose X and Sarah a popsicle with dose X + 1, then this model predicts that Sarah will have a recovery score that is 4.88 points higher than will Roger.
    • From the confidence interval output we saw previously with the function confint(lm(recov.score ~ dose)), we are 95% confident that the true slope for dose is between (0.47, 9.30) mEq/l. We are also 95% confident that the true intercept is between (55.8, 72.0).

14.7.4 Correlation and Slope

If we like, we can use the cor function to specify the Pearson correlation of recov.score and dose, which turns out to be 0.36. - Note that the slope in a simple regression model will follow the sign of the Pearson correlation coefficient, in this case, both will be positive.

hydrate %$% cor(recov.score, dose)
[1] 0.359528

14.7.5 Coefficient Testing

summary(lm(recov.score ~ dose, data = hydrate))

Call:
lm(formula = recov.score ~ dose, data = hydrate)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.3360  -7.2763   0.0632   8.4233  23.9028 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   63.896      3.970  16.093   <2e-16 ***
dose           4.881      2.172   2.247   0.0313 *  
---
Signif. codes:  
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.21 on 34 degrees of freedom
Multiple R-squared:  0.1293,    Adjusted R-squared:  0.1037 
F-statistic: 5.047 on 1 and 34 DF,  p-value: 0.03127

Next to each coefficient in the summary regression table is its estimated standard error, followed by the coefficient’s t value (the coefficient value divided by the standard error), and the associated two-tailed p value for the test of:

  • H0: This coefficient’s \(\beta\) value = 0 vs. 
  • HA: This coefficient’s \(\beta\) value \(\neq\) 0.

For the slope coefficient, we can interpret this choice as:

  • H0: This predictor adds no predictive value to the model vs. 
  • HA: This predictor adds some predictive value to the model.

In the hydrate simple regression model, by running either tidy with or just the confint function shown below, we can establish a confidence interval for each of the estimated regression coefficients.

tidy(m1, conf.int = TRUE, conf.level = 0.95) %>% kable(digits = 2)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 63.90 3.97 16.09 0.00 55.83 71.96
dose 4.88 2.17 2.25 0.03 0.47 9.30
confint(m1, level = .95)
                2.5 %    97.5 %
(Intercept) 55.826922 71.964589
dose         0.465695  9.295466

If the slope of dose was in fact zero, then this would mean that knowing the dose information would be of no additional value in predicting the outcome over just guessing the mean of recov.score for every subject.

So, since the confidence interval for the slope of dose does not include zero, it appears that there is at least some evidence that the model m1 is more effective than a model that ignores the dose information (and simply predicts the mean of recov.score for each subject.) That’s not saying much, actually.

14.7.6 Summarizing the Quality of Fit

  1. The next part of the regression summary output is a summary of fit quality.

The residual standard error estimates the standard deviation of the prediction errors made by the model.

  • If assumptions hold, the model will produce residuals that follow a Normal distribution with mean 0 and standard deviation equal to this residual standard error.
    • So we’d expect roughly 95% of our residuals to fall between -2(12.21) and +2(12.21), or roughly -24.4 to +24.4 and that we’d see virtually no residuals outside the range of -3(12.21) to +3(12.21), or roughly -36.6 to +36.6.
    • The output at the top of the summary tells us about the observed regression residuals, and that they actually range from -22 to +24.
    • In context, it’s hard to know whether or not we should be happy about this. On a scale from 0 to 100, rarely missing by more than 24 seems OK to me, but not terrific.
  • The degrees of freedom here are the same as the denominator degrees of freedom in the ANOVA to follow. The calculation is \(n - k\), where \(n\) = the number of observations and \(k\) is the number of coefficients estimated by the regression (including the intercept and any slopes).
    • Here, there are 36 observations in the model, and we fit k = 2 coefficients; the slope and the intercept, as in any simple regression model, so df = 36 - 2 = 34.

The multiple R-squared value is usually just referred to as R-squared.

  • This is interpreted as the proportion of variation in the outcome variable that has been accounted for by our regression model.
    • Here, we’ve accounted for just under 13% of the variation in % Recovery using Dose.
  • The R in multiple R-squared is the Pearson correlation of recov.score and dose, which in this case is 0.3595.
    • Squaring this value gives the R-squared for this simple regression.
    • (0.3595)^2 = 0.129

R-squared is greedy.

  • R-squared will always suggest that we make our models as big as possible, often including variables of dubious predictive value.
  • As a result, there are various methods for adjusting or penalizing R-squared so that we wind up with smaller models.
  • The adjusted R-squared is often a useful way to compare multiple models for the same response.
    • \(R^2_{adj} = 1 - \frac{(1-R^2)(n - 1)}{n - k}\), where \(n\) = the number of observations and \(k\) is the number of coefficients estimated by the regression (including the intercept and any slopes).
    • So, in this case, \(R^2_{adj} = 1 - \frac{(1 - 0.1293)(35)}{34} = 0.1037\)
    • The adjusted R-squared value is not, technically, a proportion of anything, but it is comparable across models for the same outcome.
    • The adjusted R-squared will always be less than the (unadjusted) R-squared.

14.7.7 ANOVA F test

  1. The last part of the standard summary of a regression model is the overall ANOVA F test.

The hypotheses for this test are:

  • H0: Each of the coefficients in the model (other than the intercept) has \(\beta\) = 0 vs.
  • HA: At least one regression slope has \(\beta \neq\) 0

Since we are doing a simple regression with just one predictor, the ANOVA F test hypotheses are exactly the same as the t test for dose:

  • H0: The slope for dose has \(\beta\) = 0 vs.
  • HA: The slope for dose has \(\beta \neq\) 0

In this case, we have an F statistic of 5.05 on 1 and 34 degrees of freedom, yielding p = 0.03

This provides some evidence that “something” in our model (here, dose is the only predictor) predicts the outcome to a degree beyond that easily attributed to chance alone. This is not actually surprising, nor is it especially interesting. The confidence interval for the slope is definitely more interesting than this.

  • In simple regression (regression with only one predictor), the t test for the slope (dose) always provides the same p value as the ANOVA F test.
    • The F test statistic in a simple regression is always by definition just the square of the slope’s t test statistic.
    • Here, F = 5.047, and this is the square of t = 2.247 from the Coefficients output

This test is basically just a combination of the R-squared value (13%) and the sample size. We don’t learn much from it that’s practically interesting or useful.

14.8 Viewing the complete ANOVA table

We can obtain the complete ANOVA table associated with this particular model, and the details behind this F test using the anova function:

anova(lm(recov.score ~ dose, data = hydrate))
Analysis of Variance Table

Response: recov.score
          Df Sum Sq Mean Sq F value  Pr(>F)  
dose       1  752.2  752.15  5.0473 0.03127 *
Residuals 34 5066.7  149.02                  
---
Signif. codes:  
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The R-squared for our regression model is equal to the \(\eta^2\) for this ANOVA model.
    • If we divide SS(dose) = 752.2 by the total sum of squares (752.2 + 5066.7), we’ll get the multiple R-squared [0.1293]
  • Note that this is not the same ANOVA model we would get if we treated dose as a factor with seven levels, rather than as a quantitative variable.

14.9 Using glance to summarize the model’s fit

When applied to a linear model, the glance function from the broom package summarizes 12 characteristics of the model’s fit.

Let’s look at the eight of these that we’ve already addressed.

glance(m1) %>% select(r.squared:df, df.residual, nobs) %>%
  kable(digits = c(3, 3, 1, 2, 3, 0, 0, 0))
r.squared adj.r.squared sigma statistic p.value df df.residual nobs
0.129 0.104 12.2 5.05 0.031 1 34 36
  • We’ve discussed the R-square value, shown in r.squared.
  • We’ve also discussed the adjusted R-square value, in adj.r.squared
  • sigma is the residual standard error.
  • statistic is the ANOVA F statistic.
  • p.value is the p value associated with the ANOVA F statistic.
  • df is the numerator degrees of freedom (here, the df associated with dose) for the ANOVA test associated with this model.
  • df.residual is the denominator degrees of freedom (here the df associated with residual) for that same ANOVA test.
  • Remember that the F-statistic at the bottom of the summary output provides these last four statistics, as well.
  • nobs is the number of observations (rows) used to fit the model.

Now, let’s look at the remaining four summaries:

glance(m1) %>% select(logLik:deviance) %>%
  kable(digits = 1)
logLik AIC BIC deviance
-140.1 286.3 291 5066.7
  • logLik is the log-likelihood value for the model, and is most commonly used for a model (like the ordinary least squares model fit by lm that is fit using the method of maximum likelihood). Thus, the log-likelihood value will be maximized in this fit.
  • AIC is the Akaike Information Criterion for the model. When comparing models fitted by maximum likelihood to the same outcome variable (using the same transformation, for example), the smaller the AIC, the better the fit.
  • BIC is the Bayes Information Criterion for the model. When comparing models fitted by maximum likelihood to the same outcome variable (using the same transformation, for example), the smaller the BIC, the better the fit. BIC often prefers models with fewer coefficients to estimate than does AIC.
    • AIC and BIC can be estimated using several different approaches in R, but we’ll need to use the same one across multiple models if we’re comparing the results, because the concepts are only defined up to a constant.
  • deviance is the fitted model’s deviance, a measure of lack of fit. It is a generalization of the residual sum of squares seen in the ANOVA table, and takes the same value in the case of a simple linear regression model fit with lm as we have here. For some generalized linear models, we’ll use this for hypothesis testing, just as the ANOVA table does in the linear model case.

14.10 Plotting Residuals vs. Fitted Values

To save the residuals and predicted (fitted) values from this simple regression model, we can use the resid and fitted commands, respectively, or we can use the augment function in the broom package to obtain a tidy data set containing these objects and others.

augment(m1) %>%
    ggplot(., aes(x = .fitted, y = .resid)) +
    geom_point() +
    geom_smooth(method = "loess", formula = y ~ x, se = F) +
    geom_smooth(method = "lm", formula = y ~ x, se = F, col = "red") +
    labs(title = "Residuals vs. Fitted values for Model m1")

We can also obtain a plot of residuals vs. fitted values for m1 using the following code from base R.

plot(m1, which = 1)

We hope in this plot to see a generally random scatter of points, perhaps looking like a “fuzzy football.” Since we only have seven possible dose values, we obtain only seven distinct predicted values, which explains the seven vertical lines in the plot. Here, the smooth red line indicates a gentle curve, but no evidence of a strong curve, or any other regular pattern in this residual plot.