Chapter 12 Logistic Regression and the resect
data
12.1 The resect
data
My source for these data was Riffenburgh (2006). The data describe 134 patients who had undergone resection of the tracheal carina (most often this is done to address tumors in the trachea), and the resect.csv
data file contains the following variables:
id
= a patient ID #,age
= the patient’s age at surgery,prior
= prior tracheal surgery (1 = yes, 0 = no),resection
= extent of the resection (in cm),intubated
= whether intubation was required at the end of surgery (1 = yes, 0 = no), anddied
= the patient’s death status (1 = dead, 0 = alive).
miss_var_summary(resect)
# A tibble: 6 x 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 id 0 0
2 age 0 0
3 prior 0 0
4 resection 0 0
5 intubated 0 0
6 died 0 0
%>% count(died, prior) resect
# A tibble: 4 x 3
died prior n
<dbl> <dbl> <int>
1 0 0 89
2 0 1 28
3 1 0 11
4 1 1 6
%>% mosaic::inspect() resect
quantitative variables:
name class min Q1 median Q3 max mean sd n
...1 id numeric 1 34.25 67.5 100.75 134 67.5000000 38.8265373 134
...2 age numeric 8 36.00 51.0 61.00 80 47.8432836 15.7775202 134
...3 prior numeric 0 0.00 0.0 0.75 1 0.2537313 0.4367785 134
...4 resection numeric 1 2.00 2.5 4.00 6 2.9634328 1.2402123 134
...5 intubated numeric 0 0.00 0.0 0.00 1 0.1417910 0.3501447 134
...6 died numeric 0 0.00 0.0 0.00 1 0.1268657 0.3340713 134
missing
...1 0
...2 0
...3 0
...4 0
...5 0
...6 0
We have no missing data, and 17 of the 134 patients died. Our goal will be to understand the characteristics of the patients, and how they relate to the binary outcome of interest, death.
12.2 Running A Simple Logistic Regression Model
In the most common scenario, a logistic regression model is used to predict a binary outcome (which can take on the values 0 or 1.) We will eventually fit a logistic regression model in two ways.
- Through the
glm
function in the base package of R (similar tolm
for linear regression) - Through the
lrm
function available in therms
package (similar tools
for linear regression)
We’ll focus on the glm
approach first, and save the lrm
ideas for later in this Chapter.
12.2.1 Logistic Regression Can Be Harder than Linear Regression
- Logistic regression models are fitted using the method of maximum likelihood in
glm
, which requires multiple iterations until convergence is reached. - Logistic regression models are harder to interpret (for most people) than linear regressions.
- Logistic regression models don’t have the same set of assumptions as linear models.
- Logistic regression outcomes (yes/no) carry much less information than quantitative outcomes. As a result, fitting a reasonable logistic regression requires more data than a linear model of similar size.
- The rule I learned in graduate school was that a logistic regression requires 100 observations to fit an intercept plus another 15 observations for each candidate predictor. That’s not terrible, but it’s a very large sample size.
- Frank Harrell recommends that 96 observations + a function of the number of candidate predictors (which depends on the amount of variation in the predictors, but 15 x the number of such predictors isn’t too bad if the signal to noise ratio is pretty good) are required just to get reasonable confidence intervals around your predictions.
- In a twitter note, Frank suggests that 96 + 8 times the number of candidate parameters might be reasonable so long as the smallest cell of interest (combination of an outcome and a split of the covariates) is 96 or more observations.
- Peduzzi et al. (1996) suggest that if we let \(\pi\) be the smaller of the proportions of “yes” or “no” cases in the population of interest, and k be the number of inputs under consideration, then \(N = 10k/\pi\) is the minimum number of cases to include, except that if N < 100 by this standard, you should increase it to 100, according to Long (1997).
- That suggests that if you have an outcome that happens 10% of the time, and you are running a model with 3 predictors, then you could get away with \((10 \times 3)/(0.10) = 300\) observations, but if your outcome happened 40% of the time, you could get away with only \((10 \times 3)/(0.40) = 75\) observations, which you’d round up to 100.
12.3 Logistic Regression using glm
We’ll begin by attempting to predict death based on the extent of the resection.
<- glm(died ~ resection, data=resect,
res_modA family="binomial"(link="logit"))
res_modA
Call: glm(formula = died ~ resection, family = binomial(link = "logit"),
data = resect)
Coefficients:
(Intercept) resection
-4.4337 0.7417
Degrees of Freedom: 133 Total (i.e. Null); 132 Residual
Null Deviance: 101.9
Residual Deviance: 89.49 AIC: 93.49
Note that the logit
link is the default approach with the binomial
family, so we could also have used:
<- glm(died ~ resection, data = resect,
res_modA family = "binomial")
which yields the same model.
12.3.1 Interpreting the Coefficients of a Logistic Regression Model
Our model is:
\[ logit(died = 1) = log\left(\frac{Pr(died = 1)}{1 - Pr(died = 1)}\right) = \beta_0 + \beta_1 x = -4.4337 + 0.7417 \times resection \]
The predicted log odds of death for a subject with a resection of 4 cm is:
\[ log\left(\frac{Pr(died = 1)}{1 - Pr(died = 1)}\right) = -4.4337 + 0.7417 \times 4 = -1.467 \]
The predicted odds of death for a subject with a resection of 4 cm is thus:
\[ \frac{Pr(died = 1)}{1 - Pr(died = 1)} = e^{-4.4337 + 0.7417 \times 4} = e^{-1.467} = 0.2306 \]
Since the odds are less than 1, we should find that the probability of death is less than 1/2. With a little algebra, we see that the predicted probability of death for a subject with a resection of 4 cm is:
\[ Pr(died = 1) = \frac{e^{-4.4337 + 0.7417 \times 4}}{1 + e^{-4.4337 + 0.7417 \times 4}} = \frac{e^{-1.467}}{1 + e^{-1.467}} = \frac{0.2306}{1.2306} = 0.187 \]
In general, we can frame the model in terms of a statement about probabilities, like this:
\[ Pr(died = 1) = \frac{e^{\beta_0 + \beta_1 x}}{1 + {e^{\beta_0 + \beta_1 x}}} = \frac{e^{-4.4337 + 0.7417 \times resection}}{1 + e^{-4.4337 + 0.7417 \times resection}} \]
and so by substituting in values for resection
, we can estimate the model’s fitted probabilities of death.
12.3.2 Using predict
to describe the model’s fits
To obtain these fitted odds and probabilities in R, we can use the predict
function.
- The default predictions are on the scale of the log odds. These predictions are also available through the
type = "link"
command within thepredict
function for a generalized linear model like logistic regression. - Here are the predicted log odds of death for a subject (Sally) with a 4 cm resection and a subject (Harry) who had a 5 cm resection.
predict(res_modA, newdata = tibble(resection = c(4,5)))
1 2
-1.4669912 -0.7253027
- We can also obtain predictions for each subject on the original response (here, probability) scale, backing out of the logit link.
predict(res_modA, newdata = tibble(resection = c(4, 5)),
type = "response")
1 2
0.1874004 0.3262264
So the predicted probability of death is 0.187 for Sally, the subject with a 4 cm resection, and 0.326 for Harry, the subject with a 5 cm resection.
12.3.3 Odds Ratio interpretation of Coefficients
Often, we will exponentiate the estimated slope coefficients of a logistic regression model to help us understand the impact of changing a predictor on the odds of our outcome.
exp(coef(res_modA))
(Intercept) resection
0.01186995 2.09947754
To interpret this finding, suppose we have two subjects, Harry and Sally. Harry had a resection that was 1 cm larger than Sally. This estimated coefficient suggests that the estimated odds for death associated with Harry is 2.099 times larger than the odds for death associated with Sally. In general, the odds ratio comparing two subjects who differ by 1 cm on the resection length is 2.099.
To illustrate, again let’s assume that Harry’s resection was 5 cm, and Sally’s was 4 cm. Then we have:
\[ log\left(\frac{Pr(Harry died)}{1 - Pr(Harry died)}\right) = -4.4337 + 0.7417 \times 5 = -0.7253, \\ log\left(\frac{Pr(Sally died)}{1 - Pr(Sally died)}\right) = -4.4337 + 0.7417 \times 4 = -1.4667. \]
which implies that our estimated odds of death for Harry and Sally are:
\[ Odds(Harry died) = \frac{Pr(Harry died)}{1 - Pr(Harry died)} = e^{-4.4337 + 0.7417 \times 5} = e^{-0.7253} = 0.4842 \\ Odds(Sally died) = \frac{Pr(Sally died)}{1 - Pr(Sally died)} = e^{-4.4337 + 0.7417 \times 4} = e^{-1.4667} = 0.2307 \]
and so the odds ratio is:
\[ OR = \frac{Odds(Harry died)}{Odds(Sally died)} = \frac{0.4842}{0.2307} = 2.099 \]
- If the odds ratio was 1, that would mean that Harry and Sally had the same estimated odds of death, and thus the same estimated probability of death, despite having different sizes of resections.
- Since the odds ratio is greater than 1, it means that Harry has a higher estimated odds of death than Sally, and thus that Harry has a higher estimated probability of death than Sally.
- If the odds ratio was less than 1, it would mean that Harry had a lower estimated odds of death than Sally, and thus that Harry had a lower estimated probability of death than Sally.
Remember that the odds ratio is a fraction describing two positive numbers (odds can only be non-negative) so that the smallest possible odds ratio is 0.
12.3.4 Interpreting the rest of the model output from glm
res_modA
Call: glm(formula = died ~ resection, family = "binomial", data = resect)
Coefficients:
(Intercept) resection
-4.4337 0.7417
Degrees of Freedom: 133 Total (i.e. Null); 132 Residual
Null Deviance: 101.9
Residual Deviance: 89.49 AIC: 93.49
In addition to specifying the logistic regression coefficients, we are also presented with information on degrees of freedom, deviance (null and residual) and AIC.
- The degrees of freedom indicate the sample size.
- Recall that we had n = 134 subjects in the data. The “Null” model includes only an intercept term (which uses 1 df) and we thus have n - 1 (here 133) degrees of freedom available for estimation.
- In our
res_modA
model, a logistic regression is fit including a single slope (resection) and an intercept term. Each uses up one degree of freedom to build an estimate, so we have n - 2 = 134 - 2 = 132 residual df remaining.
- The AIC or Akaike Information Criterion (lower values are better) is also provided. This is helpful if we’re comparing multiple models for the same outcome.
12.3.5 Deviance and Comparing Our Model to the Null Model
- The deviance (a measure of the model’s lack of fit) is available for both the null model (the model with only an intercept) and for our model (
res_modA
) predicting our outcome, mortality. - The deviance test, though available in R (see below) isn’t really a test of whether the model works well. Instead, it assumes the model is true, and then tests to see if the coefficients are detectably different from zero. So it isn’t of much practical use.
- To compare the
deviance
statistics, we can subtract the residual deviance from the null deviance to describe the impact of our model on fit. - Null Deviance - Residual Deviance can be compared to a \(\chi^2\) distribution with Null DF - Residual DF degrees of freedom to obtain a global test of the in-sample predictive power of our model.
- We can see this comparison more directly by running
anova
on our model:
- To compare the
anova(res_modA, test = "LRT")
Analysis of Deviance Table
Model: binomial, link: logit
Response: died
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 133 101.943
resection 1 12.45 132 89.493 0.0004179 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The test = "LRT"
section completes a deviance test and provides a p value, which just estimates the probability that a chi-square distribution with a single degree of freedom would exhibit an improvement in deviance as large as 12.45.
The p value for the deviance test here is about 0.0004. But, again, this isn’t a test of whether the model is any good - it assumes the model is true, and then tests some consequences.
- Specifically, it tests whether (if the model is TRUE) some of the model’s coefficients are non-zero.
- That’s not so practically useful, so I discourage you from performing global tests of a logistic regression model with a deviance test.
12.3.6 Using glance
with a logistic regression model
We can use the glance
function from the broom
package to obtain the null and residual deviance and degrees of freedom. Note that the deviance for our model is related to the log likelihood by -2*logLik
.
glance(res_modA)
# A tibble: 1 x 8
null.deviance df.null logLik AIC BIC deviance df.residual nobs
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 102. 133 -44.7 93.5 99.3 89.5 132 134
The glance
result also provides the AIC, and the BIC (Bayes Information Criterion), each of which is helpful in understanding comparisons between multiple models for the same outcome (with smaller values of either criterion indicating better models.) The AIC is based on the deviance, but penalizes you for making the model more complicated. The BIC does the same sort of thing but with a different penalty.
Again we see that we have a null deviance of 101.94 on 133 degrees of freedom. Including the resection
information in the model decreased the deviance to 89.49 points on 132 degrees of freedom, so that’s a decrease of 12.45 points while using one degree of freedom, a statistically significant reduction in deviance.
12.4 Interpreting the Model Summary
Let’s get a more detailed summary of our res_modA
model, including 95% confidence intervals for the coefficients:
summary(res_modA)
Call:
glm(formula = died ~ resection, family = "binomial", data = resect)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.1844 -0.5435 -0.3823 -0.2663 2.4501
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.4337 0.8799 -5.039 4.67e-07 ***
resection 0.7417 0.2230 3.327 0.000879 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 101.943 on 133 degrees of freedom
Residual deviance: 89.493 on 132 degrees of freedom
AIC: 93.493
Number of Fisher Scoring iterations: 5
confint(res_modA, level = 0.95)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -6.344472 -2.855856
resection 0.322898 1.208311
Some elements of this summary are very familiar from our work with linear models.
- We still have a five-number summary of residuals, although these are called deviance residuals.
- We have a table of coefficients with standard errors, and hypothesis tests, although these are Wald z-tests, rather than the t tests we saw in linear modeling.
- We have a summary of global fit in the comparison of null deviance and residual deviance, but without a formal p value. And we have the AIC, as discussed above.
- We also have some new items related to a dispersion parameter and to the number of Fisher Scoring Iterations.
Let’s walk through each of these elements.
12.4.1 Wald Z tests for Coefficients in a Logistic Regression
The coefficients output provides the estimated coefficients, and their standard errors, plus a Wald Z statistic, which is just the estimated coefficient divided by its standard error. This is compared to a standard Normal distribution to obtain the two-tailed p values summarized in the Pr(>|z|)
column.
- The interesting result is
resection
, which has a Wald Z = 3.327, yielding a p value of 0.00088. - The p value assesses whether the estimated coefficient of
resection
, 0.7417, is statistically detectably different from 0. If the coefficient (on the logit scale) forresection
was truly 0, this would mean that:- the log odds of death did not change based on the
resection
size, - the odds of death were unchanged based on the
resection
size (the odds ratio would be 1), and - the probability of death was unchanged based on the
resection
size.
- the log odds of death did not change based on the
In our case, we have a statistically detectable change in the log odds of died
associated with changes in resection
, according to this p value. We conclude that resection
size is associated with a positive impact on death rates (death rates are generally higher for people with larger resections.)
12.4.2 Confidence Intervals for the Coefficients
As in linear regression, we can obtain 95% confidence intervals (to get other levels, change the level
parameter in confint
) for the intercept and slope coefficients.
Here, we see, for example, that the coefficient of resection
has a point estimate of 0.7417, and a confidence interval of (0.3229, 1.208). Since this is on the logit scale, it’s not that interpretable, but we will often exponentiate the model and its confidence interval to obtain a more interpretable result on the odds ratio scale.
tidy(res_modA, exponentiate = TRUE, conf.int = TRUE) %>%
select(term, estimate, conf.low, conf.high)
# A tibble: 2 x 4
term estimate conf.low conf.high
<chr> <dbl> <dbl> <dbl>
1 (Intercept) 0.0119 0.00176 0.0575
2 resection 2.10 1.38 3.35
From this output, we can estimate the odds ratio for death associated with a 1 cm increase in resection size is 2.099, with a 95% CI of (1.38, 3.35). - If the odds ratio was 1, it would indicate that the odds of death did not change based on the change in resection size. - Here, it’s clear that the estimated odds of death will be larger (odds > 1) for subjects with larger resection sizes. Larger odds of death also indicate larger probabilities of death. This confidence interval indicates that with 95% confidence, we conclude that increases in resection size are associated with statistically detectable increases in the odds of death. - If the odds ratio was less than 1 (remember that it cannot be less than 0) that would mean that subjects with larger resection sizes were associated with smaller estimated odds of death.
12.4.3 Deviance Residuals
In logistic regression, it’s certainly a good idea to check to see how well the model fits the data. However, there are a few different types of residuals. The residuals presented here by default are called deviance residuals. Other types of residuals are available for generalized linear models, such as Pearson residuals, working residuals, and response residuals. Logistic regression model diagnostics often make use of multiple types of residuals.
The deviance residuals for each individual subject sum up to the deviance statistic for the model, and describe the contribution of each point to the model likelihood function.
The deviance residual, \(d_i\), for the ith observation in a model predicting \(y_i\) (a binary variable), with the estimate being \(\hat{\pi}_i\) is:
\[ d_i = s_i \sqrt{-2 [y_i log \hat{\pi_i} + (1 - y_i) log(1 - \hat{\pi_i})]}, \]
where \(s_i\) is 1 if \(y_i = 1\) and \(s_i = -1\) if \(y_i = 0\).
Again, the sum of the deviance residuals is the deviance.
12.4.4 Dispersion Parameter
The dispersion parameter is taken to be 1 for glm
fit using either the binomial
or Poisson
families. For other sorts of generalized linear models, the dispersion parameter will be of some importance in estimating standard errors sensibly.
12.4.5 Fisher Scoring iterations
The solution of a logistic regression model involves maximizing a likelihood function. Fisher’s scoring algorithm in our res_modA
needed five iterations to perform the logistic regression fit. All that this tells you is that the model converged, and didn’t require a lot of time to do so.
12.5 Plotting a Simple Logistic Regression Model
Let’s plot the logistic regression model res_modA
for died
using the extent of the resection in terms of probabilities. We can use either of two different approaches:
- we can plot the fitted values from our specific model against the original data, using the
augment
function from thebroom
package, or - we can create a smooth function called
binomial_smooth
that plots a simple logistic model in an analogous way togeom_smooth(method = "lm")
for a simple linear regression.
12.5.1 Using augment
to capture the fitted probabilities
<- augment(res_modA, resect,
res_A_aug type.predict = "response")
head(res_A_aug)
# A tibble: 6 x 12
id age prior resection intubated died .fitted .resid .std.resid .hat
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 34 1 2.5 0 0 0.0705 -0.382 -0.384 0.0100
2 2 57 0 5 0 0 0.326 -0.889 -0.904 0.0337
3 3 60 1 4 1 1 0.187 1.83 1.84 0.0120
4 4 62 1 4.2 0 0 0.211 -0.689 -0.693 0.0143
5 5 28 0 6 1 1 0.504 1.17 1.22 0.0818
6 6 52 0 3 0 0 0.0990 -0.457 -0.459 0.00922
# ... with 2 more variables: .sigma <dbl>, .cooksd <dbl>
This approach augments the resect
data set with fitted, residual and other summaries of each observation’s impact on the fit, using the “response” type of prediction, which yields the fitted probabilities in the .fitted
column.
12.5.2 Plotting a Logistic Regression Model’s Fitted Values
ggplot(res_A_aug, aes(x = resection, y = died)) +
geom_jitter(height = 0.05) +
geom_line(aes(x = resection, y = .fitted),
col = "blue") +
labs(title = "Logistic Regression from Model res_modA")
12.5.3 Plotting a Simple Logistic Model using binomial_smooth
<- function(...) {
binomial_smooth geom_smooth(method = "glm",
method.args = list(family = "binomial"), ...)
}
ggplot(resect, aes(x = resection, y = died)) +
geom_jitter(height = 0.05) +
binomial_smooth() + ## ...smooth(se=FALSE) to leave out interval
labs(title = "Logistic Regression from Model A") +
theme_bw()
`geom_smooth()` using formula 'y ~ x'
As expected, we see an increase in the model probability of death as the extent of the resection grows larger.
12.6 How well does Model A classify subjects?
A natural question to ask is how well does our model classify patients in terms of likelihood of death.
We could specify a particular rule, for example: if the predicted probability of death is 0.5 or greater, then predict “Died.”
$rule.5 <- ifelse(res_A_aug$.fitted >= 0.5,
res_A_aug"Predict Died", "Predict Alive")
table(res_A_aug$rule.5, res_A_aug$died)
0 1
Predict Alive 114 16
Predict Died 3 1
And perhaps build the linked table of row probabilities which tells us, for example, that 87.69% of the patients predicted by the model to be alive actually did survive.
round(100*prop.table(
table(res_A_aug$rule.5, res_A_aug$died), 1), 2)
0 1
Predict Alive 87.69 12.31
Predict Died 75.00 25.00
Or the table of column probabilities which tell us, for example, that 97.44% of those who actually survived were predicted by the model to be alive.
round(100*prop.table(
table(res_A_aug$rule.5, res_A_aug$died), 2), 2)
0 1
Predict Alive 97.44 94.12
Predict Died 2.56 5.88
We’ll discuss various measures of concordance derived from this sort of classification later.
12.7 The Confusion Matrix
Let’s build this misclassification table in standard epidemiological format.
<-
confuseA_small %>%
res_A_aug mutate(death_predicted = factor(.fitted >= 0.5),
death_actual = factor(died == "1"),
death_predicted = fct_relevel(death_predicted, "TRUE"),
death_actual = fct_relevel(death_actual, "TRUE")) %$%
table(death_predicted, death_actual)
confuseA_small
death_actual
death_predicted TRUE FALSE
TRUE 1 3
FALSE 16 114
In total, we have 134 observations.
- 115 correct predictions, or 85.8% accuracy
- 17 subjects who died, or 12.6% prevalence of death
- 4 subjects who were predicted to die, or 3.0% detection prevalence.
The sensitivity (also called recall) here is 1 / (1 + 16) = 5.9%.
- 5.9% of the subjects who actually died were predicted to die by the model.
The specificity here is 114 / (114 + 3) = 97.4%.
- 97.4% of the subjects who actually survived were predicted to survive by the model.
The positive predictive value (PPV: also called precision) is 1 / (1 + 3) = 25%
- Our predictions of death were correct 25% of the time.
The negative predictive value (NPV) is 114 / (114 + 16) = 87.7%
- Our predictions of survival were correct 87.7% of the time.
12.8 Using the confusionMatrix
tool from the caret
package
This provides a more detailed summary of the classification results from our logistic regression model.
%$%
res_A_aug confusionMatrix(
data = factor(.fitted >= 0.5),
reference = factor(died == 1),
positive = "TRUE"
)
Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 114 16
TRUE 3 1
Accuracy : 0.8582
95% CI : (0.7875, 0.9124)
No Information Rate : 0.8731
P-Value [Acc > NIR] : 0.747802
Kappa : 0.0493
Mcnemar's Test P-Value : 0.005905
Sensitivity : 0.058824
Specificity : 0.974359
Pos Pred Value : 0.250000
Neg Pred Value : 0.876923
Prevalence : 0.126866
Detection Rate : 0.007463
Detection Prevalence : 0.029851
Balanced Accuracy : 0.516591
'Positive' Class : TRUE
- The No Information Rate or NIR is just the percentage of correct predictions we’d get if we just predicted the more common classification (not dead) for every subject.
- Kappa is a correlation statistic ranging from -1 to +1. It measures the inter-rater reliability of our predictions and the true classifications, in this context. Complete agreement would be +1, and complete disagreement would be -1.
12.9 Receiver Operating Characteristic Curve Analysis
One way to assess the predictive accuracy within the model development sample in a logistic regression is to consider an analyses based on the receiver operating characteristic (ROC) curve. ROC curves are commonly used in assessing diagnoses in medical settings, and in signal detection applications.
The accuracy of a “test” can be evaluated by considering two types of errors: false positives and false negatives.
In our res_modA
model, we use resection
size to predict whether the patient died
. Suppose we established a value R, so that if the resection size was larger than R cm, we would predict that the patient died
, and otherwise we would predict that the patient did not die.
A good outcome of our model’s “test,” then, would be when the resection size is larger than R for a patient who actually died. Another good outcome would be when the resection size is smaller than R for a patient who survived.
But we can make errors, too.
- A false positive error in this setting would occur when the resection size is larger than R (so we predict the patient dies) but in fact the patient does not die.
- A false negative error in this case would occur when the resection size is smaller than R (so we predict the patient survives) but in fact the patient dies.
Formally, the true positive fraction (TPF) for a specific resection cutoff \(R\), is the probability of a positive test (a prediction that the patient will die) among the people who have the outcome died = 1 (those who actually die).
\[ TPF(R) = Pr(resection > R | subject died) \]
Similarly, the false positive fraction (FPF) for a specific cutoff \(R\) is the probability of a positive test (prediction that the patient will die) among the people with died = 0 (those who don’t actually die)
\[ FPF(R) = Pr(resection > R | subject did not die) \]
The True Positive Rate is referred to as the sensitivity of a diagnostic test, and the True Negative rate (1 - the False Positive rate) is referred to as the specificity of a diagnostic test.
Since the cutoff \(R\) is not fixed in advanced, we can plot the value of TPF (on the y axis) against FPF (on the x axis) for all possible values of \(R\), and this is what the ROC curve is. Others refer to the Sensitivity on the Y axis, and 1-Specificity on the X axis, and this is the same idea.
Before we get too far into the weeds, let me show you some simple situations so you can understand what you might learn from the ROC curve. The web page http://blog.yhat.com/posts/roc-curves.html provides source materials.
12.9.1 Interpreting the Area under the ROC curve
The AUC or Area under the ROC curve is the amount of space underneath the ROC curve. Often referred to as the c statistic, the AUC represents the quality of your TPR and FPR overall in a single number. The C statistic ranges from 0 to 1, with C = 0.5 for a prediction that is no better than random guessing, and C = 1 for a perfect prediction model.
Next, I’ll build a simulation to demonstrate several possible ROC curves in the sections that follow.
set.seed(432999)
<- data_frame(x = rnorm(n = 200),
sim.temp prob = exp(x)/(1 + exp(x)),
y = as.numeric(1 * runif(200) < prob))
<- sim.temp %>%
sim.temp mutate(p_guess = 1,
p_perfect = y,
p_bad = exp(-2*x) / (1 + exp(-2*x)),
p_ok = prob + (1-y)*runif(1, 0, 0.05),
p_good = prob + y*runif(1, 0, 0.27))
12.9.1.1 What if we are guessing?
If we’re guessing completely at random, then the model should correctly classify a subject (as died or not died) about 50% of the time, so the TPR and FPR will be equal. This yields a diagonal line in the ROC curve, and an area under the curve (C statistic) of 0.5.
There are several ways to do this on the web, but I’ll show this one, which has some bizarre code, but that’s a function of using a package called ROCR
to do the work. It comes from this link
<- prediction(sim.temp$p_guess, sim.temp$y)
pred_guess <- performance(pred_guess, measure = "tpr", x.measure = "fpr")
perf_guess <- performance(pred_guess, measure="auc")
auc_guess
<- round(auc_guess@y.values[[1]],3)
auc_guess <- data.frame(fpr=unlist(perf_guess@x.values),
roc_guess tpr=unlist(perf_guess@y.values),
model="GLM")
ggplot(roc_guess, aes(x=fpr, ymin=0, ymax=tpr)) +
geom_ribbon(alpha=0.2, fill = "blue") +
geom_line(aes(y=tpr), col = "blue") +
labs(title = paste0("Guessing: ROC Curve w/ AUC=", auc_guess)) +
theme_bw()
12.9.1.2 What if we classify things perfectly?
If we’re classifying subjects perfectly, then we have a TPR of 1 and an FPR of 0. That yields an ROC curve that looks like the upper and left edges of a box. If our model correctly classifies a subject (as died or not died) 100% of the time, the area under the curve (c statistic) will be 1.0. We’ll add in the diagonal line here (in a dashed black line) to show how this model compares to random guessing.
<- prediction(sim.temp$p_perfect, sim.temp$y)
pred_perf <- performance(pred_perf, measure = "tpr", x.measure = "fpr")
perf_perf <- performance(pred_perf, measure="auc")
auc_perf
<- round(auc_perf@y.values[[1]],3)
auc_perf <- data.frame(fpr=unlist(perf_perf@x.values),
roc_perf tpr=unlist(perf_perf@y.values),
model="GLM")
ggplot(roc_perf, aes(x=fpr, ymin=0, ymax=tpr)) +
geom_ribbon(alpha=0.2, fill = "blue") +
geom_line(aes(y=tpr), col = "blue") +
geom_abline(intercept = 0, slope = 1, lty = "dashed") +
labs(title = paste0("Perfect Prediction: ROC Curve w/ AUC=", auc_perf)) +
theme_bw()
12.9.1.3 What does “worse than guessing” look like?
A bad classifier will appear below and to the right of the diagonal line we’d see if we were completely guessing. Such a model will have a c statistic below 0.5, and will be valueless.
<- prediction(sim.temp$p_bad, sim.temp$y)
pred_bad <- performance(pred_bad, measure = "tpr", x.measure = "fpr")
perf_bad <- performance(pred_bad, measure="auc")
auc_bad
<- round(auc_bad@y.values[[1]],3)
auc_bad <- data.frame(fpr=unlist(perf_bad@x.values),
roc_bad tpr=unlist(perf_bad@y.values),
model="GLM")
ggplot(roc_bad, aes(x=fpr, ymin=0, ymax=tpr)) +
geom_ribbon(alpha=0.2, fill = "blue") +
geom_line(aes(y=tpr), col = "blue") +
geom_abline(intercept = 0, slope = 1, lty = "dashed") +
labs(title = paste0("A Bad Model: ROC Curve w/ AUC=", auc_bad)) +
theme_bw()
12.9.1.4 What does “better than guessing” look like?
An “OK” classifier will appear above and to the left of the diagonal line we’d see if we were completely guessing. Such a model will have a c statistic above 0.5, and might have some value. The plot below shows a very fairly poor model, but at least it’s better than guessing.
<- prediction(sim.temp$p_ok, sim.temp$y)
pred_ok <- performance(pred_ok, measure = "tpr", x.measure = "fpr")
perf_ok <- performance(pred_ok, measure="auc")
auc_ok
<- round(auc_ok@y.values[[1]],3)
auc_ok <- data.frame(fpr=unlist(perf_ok@x.values),
roc_ok tpr=unlist(perf_ok@y.values),
model="GLM")
ggplot(roc_ok, aes(x=fpr, ymin=0, ymax=tpr)) +
geom_ribbon(alpha=0.2, fill = "blue") +
geom_line(aes(y=tpr), col = "blue") +
geom_abline(intercept = 0, slope = 1, lty = "dashed") +
labs(title = paste0("A Mediocre Model: ROC Curve w/ AUC=", auc_ok)) +
theme_bw()
Sometimes people grasp for a rough guide as to the accuracy of a model’s predictions based on the area under the ROC curve. A common thought is to assess the C statistic much like you would a class grade.
C statistic | Interpretation |
---|---|
0.90 to 1.00 | model does an excellent job at discriminating “yes” from “no” (A) |
0.80 to 0.90 | model does a good job (B) |
0.70 to 0.80 | model does a fair job (C) |
0.60 to 0.70 | model does a poor job (D) |
0.50 to 0.60 | model fails (F) |
below 0.50 | model is worse than random guessing |
12.9.1.5 What does “pretty good” look like?
A strong and good classifier will appear above and to the left of the diagonal line we’d see if we were completely guessing, often with a nice curve that is continually increasing and appears to be pulled up towards the top left. Such a model will have a c statistic well above 0.5, but not as large as 1. The plot below shows a stronger model, which appears substantially better than guessing.
<- prediction(sim.temp$p_good, sim.temp$y)
pred_good <- performance(pred_good, measure = "tpr", x.measure = "fpr")
perf_good <- performance(pred_good, measure="auc")
auc_good
<- round(auc_good@y.values[[1]],3)
auc_good <- data.frame(fpr=unlist(perf_good@x.values),
roc_good tpr=unlist(perf_good@y.values),
model="GLM")
ggplot(roc_good, aes(x=fpr, ymin=0, ymax=tpr)) +
geom_ribbon(alpha=0.2, fill = "blue") +
geom_line(aes(y=tpr), col = "blue") +
geom_abline(intercept = 0, slope = 1, lty = "dashed") +
labs(title = paste0("A Pretty Good Model: ROC Curve w/ AUC=", auc_good)) +
theme_bw()
12.10 The ROC Plot for res_modA
Let me show you the ROC curve for our res_modA
model.
## requires ROCR package
<- predict(res_modA, resect, type="response")
prob <- prediction(prob, resect$died)
pred <- performance(pred, measure = "tpr", x.measure = "fpr")
perf <- performance(pred, measure="auc")
auc
<- round(auc@y.values[[1]],3)
auc <- data.frame(fpr=unlist(perf@x.values),
roc.data tpr=unlist(perf@y.values),
model="GLM")
ggplot(roc.data, aes(x=fpr, ymin=0, ymax=tpr)) +
geom_ribbon(alpha=0.2, fill = "blue") +
geom_line(aes(y=tpr), col = "blue") +
geom_abline(intercept = 0, slope = 1, lty = "dashed") +
labs(title = paste0("ROC Curve w/ AUC=", auc)) +
theme_bw()
Based on the C statistic (AUC = 0.771) this would rank somewhere near the high end of a “fair” predictive model by this standard, not quite to the level of a “good” model.
12.10.1 Another way to plot the ROC Curve
If we’ve loaded the pROC
package, we can also use the following (admittedly simpler) approach to plot the ROC curve, without ggplot2
, and to obtain the C statistic, and a 95% confidence interval around that C statistic.
## requires pROC package
<-
roc.modA roc(resect$died ~ predict(res_modA, type="response"),
ci = TRUE)
roc.modA
Call:
roc.formula(formula = resect$died ~ predict(res_modA, type = "response"), ci = TRUE)
Data: predict(res_modA, type = "response") in 117 controls (resect$died 0) < 17 cases (resect$died 1).
Area under the curve: 0.7707
95% CI: 0.67-0.8715 (DeLong)
plot(roc.modA)
12.11 Assessing Residual Plots from Model A
Residuals are certainly less informative for logistic regression than they are for linear regression: not only do yes/no outcomes inherently contain less information than continuous ones, but the fact that the adjusted response depends on the fit hampers our ability to use residuals as external checks on the model.
This is mitigated to some extent, however, by the fact that we are also making fewer distributional assumptions in logistic regression, so there is no need to inspect residuals for, say, skewness or heteroskedasticity.
- Patrick Breheny, University of Kentucky, Slides on GLM Residuals and Diagnostics
The usual residual plots are available in R for a logistic regression model, but most of them are irrelevant in the logistic regression setting. The residuals shouldn’t follow a standard Normal distribution, and they will not show constant variance over the range of the predictor variables, so plots looking into those issues aren’t helpful.
The only plot from the standard set that we’ll look at in many settings is plot 5, which helps us assess influence (via Cook’s distance contours), and a measure related to leverage (how unusual an observation is in terms of the predictors) and standardized Pearson residuals.
plot(res_modA, which = 5)
In this case, I don’t see any highly influential points, as no points fall outside of the Cook’s distance (0.5 or 1) contours.
12.12 Model B: A “Kitchen Sink” Logistic Regression Model
<- glm(died ~ resection + age + prior + intubated,
res_modB data = resect, family = binomial)
res_modB
Call: glm(formula = died ~ resection + age + prior + intubated, family = binomial,
data = resect)
Coefficients:
(Intercept) resection age prior intubated
-5.152886 0.612211 0.001173 0.814691 2.810797
Degrees of Freedom: 133 Total (i.e. Null); 129 Residual
Null Deviance: 101.9
Residual Deviance: 67.36 AIC: 77.36
12.12.1 Comparing Model A to Model B
anova(res_modA, res_modB)
Analysis of Deviance Table
Model 1: died ~ resection
Model 2: died ~ resection + age + prior + intubated
Resid. Df Resid. Dev Df Deviance
1 132 89.493
2 129 67.359 3 22.134
The addition of age
, prior
and intubated
reduces the lack of fit by 22.134 points, at a cost of 3 degrees of freedom.
glance(res_modA)
# A tibble: 1 x 8
null.deviance df.null logLik AIC BIC deviance df.residual nobs
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 102. 133 -44.7 93.5 99.3 89.5 132 134
glance(res_modB)
# A tibble: 1 x 8
null.deviance df.null logLik AIC BIC deviance df.residual nobs
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 102. 133 -33.7 77.4 91.8 67.4 129 134
By either AIC or BIC, the larger model (res_modB
) looks more effective.
12.12.2 Interpreting Model B
summary(res_modB)
Call:
glm(formula = died ~ resection + age + prior + intubated, family = binomial,
data = resect)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7831 -0.3741 -0.2386 -0.2014 2.5228
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.152886 1.469453 -3.507 0.000454 ***
resection 0.612211 0.282807 2.165 0.030406 *
age 0.001173 0.020646 0.057 0.954700
prior 0.814691 0.704785 1.156 0.247705
intubated 2.810797 0.658395 4.269 1.96e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 101.943 on 133 degrees of freedom
Residual deviance: 67.359 on 129 degrees of freedom
AIC: 77.359
Number of Fisher Scoring iterations: 6
It appears that the intubated
predictor adds significant value to the model, by the Wald test.
Let’s focus on the impact of these variables through odds ratios.
tidy(res_modB, exponentiate = TRUE, conf.int = TRUE) %>%
select(term, estimate, conf.low, conf.high)
# A tibble: 5 x 4
term estimate conf.low conf.high
<chr> <dbl> <dbl> <dbl>
1 (Intercept) 0.00578 0.000241 0.0837
2 resection 1.84 1.08 3.35
3 age 1.00 0.962 1.04
4 prior 2.26 0.549 9.17
5 intubated 16.6 4.75 64.6
At a 5% significance level, we might conclude that:
- larger sized
resection
s are associated with a meaningful rise (est OR: 1.84, 95% CI 1.08, 3.35) in the odds of death, holding all other predictors constant, - the need for
intubation
at the end of surgery is associated with a substantial rise (est OR: 16.6, 95% CI 4.7, 64.7) in the odds of death, holding all other predictors constant, but that - older
age
as well as having aprior
tracheal surgery appears to be associated with an increase in death risk, but not to an extent that we can declare statistically significant.
12.13 Plotting Model B
Let’s think about plotting the fitted values from our model, in terms of probabilities.
12.13.1 Using augment
to capture the fitted probabilities
<- augment(res_modB, resect,
res_B_aug type.predict = "response")
head(res_B_aug)
# A tibble: 6 x 12
id age prior resection intubated died .fitted .resid .std.resid .hat
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 34 1 2.5 0 0 0.0591 -0.349 -0.354 0.0267
2 2 57 0 5 0 0 0.117 -0.498 -0.508 0.0380
3 3 60 1 4 1 1 0.729 0.794 0.844 0.114
4 4 62 1 4.2 0 0 0.155 -0.581 -0.602 0.0704
5 5 28 0 6 1 1 0.796 0.675 0.724 0.131
6 6 52 0 3 0 0 0.0371 -0.275 -0.277 0.0105
# ... with 2 more variables: .sigma <dbl>, .cooksd <dbl>
12.13.2 Plotting Model B Fits by Observed Mortality
ggplot(res_B_aug, aes(x = factor(died), y = .fitted, col = factor(died))) +
geom_boxplot() +
geom_jitter(width = 0.1) +
guides(col = "none")
Certainly it appears as though most of our predicted probabilities (of death) for the subjects who actually survived are quite small, but not all of them. We also have at least 6 big “misses” among the 17 subjects who actually died.
12.13.3 Confusion Matrix for Model B
%$%
res_B_aug confusionMatrix(
data = factor(.fitted >= 0.5),
reference = factor(died == 1),
positive = "TRUE"
)
Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 113 6
TRUE 4 11
Accuracy : 0.9254
95% CI : (0.867, 0.9636)
No Information Rate : 0.8731
P-Value [Acc > NIR] : 0.03897
Kappa : 0.6453
Mcnemar's Test P-Value : 0.75183
Sensitivity : 0.64706
Specificity : 0.96581
Pos Pred Value : 0.73333
Neg Pred Value : 0.94958
Prevalence : 0.12687
Detection Rate : 0.08209
Detection Prevalence : 0.11194
Balanced Accuracy : 0.80644
'Positive' Class : TRUE
12.13.4 The ROC curve for Model B
## requires ROCR package
<- predict(res_modB, resect, type="response")
prob <- prediction(prob, resect$died)
pred <- performance(pred, measure = "tpr", x.measure = "fpr")
perf <- performance(pred, measure="auc")
auc
<- round(auc@y.values[[1]],3)
auc <- data.frame(fpr=unlist(perf@x.values),
roc.data tpr=unlist(perf@y.values),
model="GLM")
ggplot(roc.data, aes(x=fpr, ymin=0, ymax=tpr)) +
geom_ribbon(alpha=0.2, fill = "blue") +
geom_line(aes(y=tpr), col = "blue") +
geom_abline(intercept = 0, slope = 1, lty = "dashed") +
labs(title = paste0("Model B: ROC Curve w/ AUC=", auc)) +
theme_bw()
The area under the curve (C-statistic) is 0.86, which certainly looks like a more discriminating fit than model A with resection alone.
12.13.5 Residuals, Leverage and Influence
plot(res_modB, which = 5)
Again, we see no signs of deeply influential points in this model.
12.14 Logistic Regression using lrm
To obtain the Nagelkerke \(R^2\) and the C statistic, as well as some other summaries, I’ll now demonstrate the use of lrm
from the rms
package to fit a logistic regression model.
We’ll return to the original model, predicting death using resection size alone.
<- datadist(resect)
dd options(datadist="dd")
<- lrm(died ~ resection, data=resect, x=TRUE, y=TRUE)
res_modC res_modC
Logistic Regression Model
lrm(formula = died ~ resection, data = resect, x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 134 LR chi2 12.45 R2 0.167 C 0.771
0 117 d.f. 1 g 1.037 Dxy 0.541
1 17 Pr(> chi2) 0.0004 gr 2.820 gamma 0.582
max |deriv| 2e-06 gp 0.110 tau-a 0.121
Brier 0.103
Coef S.E. Wald Z Pr(>|Z|)
Intercept -4.4337 0.8799 -5.04 <0.0001
resection 0.7417 0.2230 3.33 0.0009
This output specifies the following:
Obs
= The number of observations used to fit the model, with0
= the number of zeros and1
= the number of ones in our outcome,died
. Also specified is the maximum absolute value of the derivative at the point where the maximum likelihood function was estimated. I wouldn’t worry about that practically, as all you will care about is whether the iterative function-fitting process converged, and R will warn you in other ways if it doesn’t.- A likelihood ratio test (drop in deviance test) subtracting the residual deviance from the null deviance obtain the Likelihood Ratio \(\chi^2\) statistic, subtracting residual df from null df to obtain degrees of freedom, and comparing the resulting test statistic to a \(\chi^2\) distribution with the appropriate degrees of freedom to determine a p value.
- A series of discrimination indexes, including the Nagelkerke \(R^2\), symbolized R2, and several others we’ll discuss shortly.
- A series of rank discrimination indexes, including the C statistic (area under the ROC curve) and Somers’ D (Dxy), and several others.
- A table of coefficients, standard errors, Wald Z statistics and p values based on those Wald statistics.
The C statistic is estimated to be 0.771, with an associated (Nagelkerke) \(R^2\) = 0.167, both indicating at best mediocre performance for this model, as it turns out.
12.14.1 Interpreting Nagelkerke \(R^2\)
There are many ways to calculate \(R^2\) for logistic regression.
- At the URL linked here there is a nice summary of the key issue, which is that there are at least three different ways to think about \(R^2\) in linear regression that are equivalent in that context, but when you move to a categorical outcome, which interpretation you use leads you down a different path for extension to the new type of outcome. In linear regression…
- You might think of \(R^2\) as a measure of the proportion of variability explained.
- You might think of \(R^2\) as measuring the improvement from a null model to a fitted model.
- You might think of \(R^2\) as the square of the correlation coefficient.
- Paul Allison, for instance, describes several at this link in a post entitled “What’s the Best R-Squared for Logistic Regression?”
- Jonathan Bartlett looks at McFadden’s pseudo \(R^2\) in some detail (including some R code) at this link, in a post entitled “R squared in logistic regression”
The Nagelkerke approach that is presented as R2
in the lrm
output is as good as most of the available approaches, and has the positive feature that it does reach 1 if the fitted model shows as much improvement as possible over the null model (which predicts the mean response for all subjects, and has \(R^2\) = 0). The greater the improvement, the higher the Nagelkerke \(R^2\).
For model A, our Nagelkerke \(R^2\) = 0.167, which is pretty poor. It doesn’t technically mean that 16.7% of any sort of variation has been explained, though.
12.14.2 Interpreting the C statistic and Plotting the ROC Curve
The C statistic is a measure of the area under the receiver operating characteristic curve. This link has some nice material that provides some insight into the C statistic and ROC curve.
- Recall that C ranges from 0 to 1. 0 = BAD, 1 = GOOD.
- values of C less than 0.5 indicate that your prediction model is not even as good as simple random guessing of “yes” or “no” for your response.
- C = 0.5 for random guessing
- C = 1 indicates a perfect classification scheme - one that correctly guesses “yes” for all “yes” patients, and for none of the “no” patients.
- The closer C is to 1, the happier we’ll be, most of the time.
- Often we’ll call models with 0.5 < C < 0.8 poor or weak in terms of predictive ability by this measure
- 0.8 \(\leq\) C < 0.9 are moderately strong in terms of predictive power (indicate good discrimination)
- C \(\geq\) 0.9 usually indicates a very strong model in this regard (indicate excellent discrimination)
We’ve seen the ROC curve for this model before, when we looked at model res_modA
fitted using glm
in the previous chapter. But, just for completeness, I’ll include it.
Note. I change the initial predict
call from type = "response"
for a glm
fit to type = "fitted"
in a lrm
fit. Otherwise, this is the same approach.
## requires ROCR package
<- predict(res_modC, resect, type="fitted")
prob <- prediction(prob, resect$died)
pred <- performance(pred, measure = "tpr", x.measure = "fpr")
perf <- performance(pred, measure="auc")
auc
<- round(auc@y.values[[1]],3)
auc <- data.frame(fpr=unlist(perf@x.values),
roc.data tpr=unlist(perf@y.values),
model="GLM")
ggplot(roc.data, aes(x=fpr, ymin=0, ymax=tpr)) +
geom_ribbon(alpha=0.2, fill = "blue") +
geom_line(aes(y=tpr), col = "blue") +
geom_abline(intercept = 0, slope = 1, lty = "dashed") +
labs(title = paste0("Model C: ROC Curve w/ AUC=", auc)) +
theme_bw()
12.14.3 The C statistic and Somers’ D
- The C statistic is directly related to Somers’ D statistic, abbreviated \(D_{xy}\), by the equation C = 0.5 + (D/2).
- Somers’ D and the ROC area only measure how well predicted values from the model can rank-order the responses. For example, predicted probabilities of 0.01 and 0.99 for a pair of subjects are no better than probabilities of 0.2 and 0.8 using rank measures, if the first subject had a lower response value than the second.
- Thus, the C statistic (or \(D_{xy}\)) may not be very sensitive ways to choose between models, even though they provide reasonable summaries of the models individually.
- This is especially true when the models are strong. The Nagelkerke \(R^2\) may be more sensitive.
- But as it turns out, we sometimes have to look at the ROC shapes, as the summary statistic alone isn’t enough.
In our case, Somers D (Dxy) = .541, so the C statistic is 0.771.
12.14.4 Validating the Logistic Regression Model Summary Statistics
Like other regression-fitting tools in rms
, the lrm
function has a special validate
tool to help perform resampling validation of a model, with or without backwards step-wise variable selection. Here, we’ll validate our model’s summary statistics using 100 bootstrap replications.
set.seed(432001)
validate(res_modC, B = 100)
index.orig training test optimism index.corrected n
Dxy 0.5415 0.5422 0.5415 0.0007 0.5408 100
R2 0.1666 0.1748 0.1666 0.0083 0.1583 100
Intercept 0.0000 0.0000 0.1631 -0.1631 0.1631 100
Slope 1.0000 1.0000 1.0463 -0.0463 1.0463 100
Emax 0.0000 0.0000 0.0428 0.0428 0.0428 100
D 0.0854 0.0909 0.0854 0.0055 0.0800 100
U -0.0149 -0.0149 0.0017 -0.0167 0.0017 100
Q 0.1004 0.1058 0.0837 0.0221 0.0783 100
B 0.1025 0.0986 0.1051 -0.0065 0.1090 100
g 1.0369 1.0677 1.0369 0.0308 1.0061 100
gp 0.1101 0.1080 0.1101 -0.0021 0.1122 100
Recall that our area under the curve (C statistic) = 0.5 + (Dxy/2)
, so that we can also use the first row of statistics to validate the C statistic. Accounting for optimism in this manner, our validation-corrected estimates are Dxy = 0.5408, so C = 0.7704, and, from the second row of statistics, we can read off the validated Nagelkerke \(R^2\), which is 0.1583.
12.14.5 Plotting the Summary of the lrm
approach
The summary
function applied to an lrm
fit shows the effect size comparing the 25th to the 75th percentile of resection.
plot(summary(res_modC))
summary(res_modC)
Effects Response : died
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
resection 2 4 2 1.4834 0.44591 0.6094 2.3574
Odds Ratio 2 4 2 4.4078 NA 1.8393 10.5630
So, a move from a resection of 2 cm to a resection of 4 cm is associated with an estimated effect on the log odds of death of 1.48 (with standard error 0.45), or with an estimated effect on the odds ratio for death of 4.41, with 95% CI (1.84, 10.56).
12.14.6 Plot In-Sample Predictions for Model C
Here we plot the effect of resection
(and 95% confidence intervals) across the range of observed values of resection
on the log odds of death. Note the linear effect of resection
size on the log odds scale.
ggplot(Predict(res_modC))
By applying the plogis
function within the Predict
command, we can plot the effect of resection
on the estimated probability of death. Note the non-linear effect on this probability in this logistic regression model.
ggplot(Predict(res_modC, fun = plogis)) +
labs(y = "Predicted probability from Model C",
title = "Model C with the resect data")
The Predict
function itself provides the raw material being captured in this plot.
head(Predict(res_modC, fun = plogis))
resection yhat lower upper .predictor.
resection.1 1.000000 0.02431476 0.006636502 0.08505223 resection
resection.2 1.020101 0.02467096 0.006789313 0.08559056 resection
resection.3 1.040201 0.02503224 0.006945549 0.08613277 resection
resection.4 1.060302 0.02539867 0.007105283 0.08667889 resection
resection.5 1.080402 0.02577033 0.007268589 0.08722896 resection
resection.6 1.100503 0.02614728 0.007435542 0.08778304 resection
Response variable (y):
Limits are 0.95 confidence limits
12.14.7 ANOVA from the lrm
approach
anova(res_modC)
Wald Statistics Response: died
Factor Chi-Square d.f. P
resection 11.07 1 9e-04
TOTAL 11.07 1 9e-04
The ANOVA approach applied to a lrm
fit provides a Wald test for the model as a whole. Here, the use of resection
is a significant improvement over a null (intercept-only) model. The p value is 9 x 10-4.
12.14.8 Are any points particularly influential?
I’ll use a cutoff for dfbeta
here of 0.3, instead of the default 0.2, because I want to focus on truly influential points. Note that we have to use the data frame version of resect
as show.influence
isn’t tibble-friendly.
<- which.influence(res_modC, cutoff=0.3)
inf.C inf.C
$Intercept
[1] 84 128
$resection
[1] 84
show.influence(object = inf.C, dframe = data.frame(resect))
Count resection
84 2 *2.0
128 1 2.5
It appears that observation 84 may have a meaningful effect on both the intercept and the coefficient for resection
.
12.14.9 A Nomogram for Model C
We use the plogis
function within a nomogram call to get R to produce fitted probabilities (of our outcome, died
) in this case.
plot(nomogram(res_modC, fun=plogis,
fun.at=c(0.05, seq(0.1, 0.9, by = 0.1), 0.95),
funlabel="Pr(died)"))
Since there’s no non-linearity in the right hand side of our simple logistic regression model, the nomogram is straightforward. We calculate the points based on the resection by traveling up, and then travel down in a straight vertical line from total points through the linear (log odds) predictor straight to a fitted probability. Note that fitted probabilities above 0.5 are not possible within the range of observed resection
values in this case.
12.15 Model D: An Augmented Kitchen Sink Model
Can we predict survival from the patient’s age, whether the patient had prior tracheal surgery or not, the extent of the resection, and whether intubation was required at the end of surgery?
12.15.1 Spearman \(\rho^2\) Plot
Let’s start by considering the limited use of non-linear terms for predictors that look important in a Spearman \(\rho^2\) plot.
plot(spearman2(died ~ age + prior + resection + intubated, data=resect))
The most important variable appears to be whether intubation was required, so I’ll include intubated
’s interaction with the linear effect of the next most (apparently) important variable, resection
, and also a cubic spline for resection
, with three knots. Since prior
and age
look less important, I’ll simply add them as linear terms.
12.15.2 Fitting Model D using lrm
Note the use of %ia%
here. This insures that only the linear part of the resection
term will be used in the interaction with intubated
.
<- datadist(resect)
dd options(datadist="dd")
<- lrm(died ~ age + prior + rcs(resection, 3) +
res_modD + intubated %ia% resection,
intubated data=resect, x=TRUE, y=TRUE)
12.15.3 Assessing Model D using lrm
’s tools
res_modD
Logistic Regression Model
lrm(formula = died ~ age + prior + rcs(resection, 3) + intubated +
intubated %ia% resection, data = resect, x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 134 LR chi2 38.08 R2 0.464 C 0.880
0 117 d.f. 6 g 2.382 Dxy 0.759
1 17 Pr(> chi2) <0.0001 gr 10.825 gamma 0.770
max |deriv| 9e-08 gp 0.172 tau-a 0.169
Brier 0.067
Coef S.E. Wald Z Pr(>|Z|)
Intercept -11.3636 4.9099 -2.31 0.0206
age 0.0000 0.0210 0.00 0.9988
prior 0.6269 0.7367 0.85 0.3947
resection 3.3799 1.9700 1.72 0.0862
resection' -4.2104 2.7035 -1.56 0.1194
intubated 0.4576 2.7848 0.16 0.8695
intubated * resection 0.6188 0.7306 0.85 0.3970
- The model likelihood ratio test suggests that at least some of these predictors are helpful.
- The Nagelkerke \(R^2\) of 0.46, and the C statistic of 0.88 indicate a meaningful improvement in discrimination over our model with
resection
alone. - The Wald Z tests see some potential need to prune the model, as none of the elements reaches statistical significance without the others. The product term between
intubated
andresection
, in particular, doesn’t appear to have helped much, once we already had the main effects.
12.15.4 ANOVA and Wald Tests for Model D
anova(res_modD)
Wald Statistics Response: died
Factor Chi-Square d.f. P
age 0.00 1 0.9988
prior 0.72 1 0.3947
resection (Factor+Higher Order Factors) 4.95 3 0.1753
All Interactions 0.72 1 0.3970
Nonlinear 2.43 1 0.1194
intubated (Factor+Higher Order Factors) 16.45 2 0.0003
All Interactions 0.72 1 0.3970
intubated * resection (Factor+Higher Order Factors) 0.72 1 0.3970
TOTAL NONLINEAR + INTERACTION 2.56 2 0.2783
TOTAL 23.90 6 0.0005
Neither the interaction term nor the non-linearity from the cubic spline appears to be statistically significant, based on the Wald tests via ANOVA. However it is clear that intubated
has a significant impact as a main effect.
12.15.5 Effect Sizes in Model D
plot(summary(res_modD))
summary(res_modD)
Effects Response : died
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
age 36 61 25 -0.00080933 0.52409 -1.02800 1.0264
Odds Ratio 36 61 25 0.99919000 NA 0.35772 2.7910
prior 0 1 1 0.62693000 0.73665 -0.81688 2.0707
Odds Ratio 0 1 1 1.87190000 NA 0.44181 7.9307
resection 2 4 2 2.42930000 1.43510 -0.38342 5.2419
Odds Ratio 2 4 2 11.35000000 NA 0.68153 189.0400
intubated 0 1 1 2.00470000 1.11220 -0.17513 4.1845
Odds Ratio 0 1 1 7.42380000 NA 0.83934 65.6610
Adjusted to: resection=2.5 intubated=0
The effect sizes are perhaps best described in terms of odds ratios. The odds ratio for death isn’t significantly different from 1 for any variable, but the impact of resection
and intubated
, though not strong enough to be significant, look more substantial (if poorly estimated) than the effects of age
and prior
.
12.15.6 Plot In-Sample Predictions for Model D
Here are plots of the effects across the range of each predictor (holding the others constant) on the log odds scale. Note the non-linear effect of resection implied by the use of a spline there.
ggplot(Predict(res_modD))
We can also capture and plot these results on the probability scale, as follows7.
ggplot(Predict(res_modD, fun = plogis))
12.15.7 Plotting the ROC curve for Model D
Again, remember to use type = "fitted"
with a lrm
fit.
## requires ROCR package
<- predict(res_modD, resect, type="fitted")
prob <- prediction(prob, resect$died)
pred <- performance(pred, measure = "tpr", x.measure = "fpr")
perf <- performance(pred, measure="auc")
auc
<- round(auc@y.values[[1]],3)
auc <- data.frame(fpr=unlist(perf@x.values),
roc.data tpr=unlist(perf@y.values),
model="GLM")
ggplot(roc.data, aes(x=fpr, ymin=0, ymax=tpr)) +
geom_ribbon(alpha=0.2, fill = "blue") +
geom_line(aes(y=tpr), col = "blue") +
geom_abline(intercept = 0, slope = 1, lty = "dashed") +
labs(title = paste0("ROC Curve w/ AUC=", auc)) +
theme_bw()
The AUC fitted with ROCR
(0.883) is slightly different than what lrm
has told us (0.880), and this also happens if we use the pROC
approach, demonstrated below.
## requires pROC package
<-
roc.modD roc(resect$died ~ predict(res_modD, type="fitted"),
ci = TRUE)
roc.modD
Call:
roc.formula(formula = resect$died ~ predict(res_modD, type = "fitted"), ci = TRUE)
Data: predict(res_modD, type = "fitted") in 117 controls (resect$died 0) < 17 cases (resect$died 1).
Area under the curve: 0.8826
95% CI: 0.7952-0.97 (DeLong)
plot(roc.modD)
12.15.8 Validation of Model D summaries
set.seed(432002)
validate(res_modD, B = 100)
Divergence or singularity in 5 samples
index.orig training test optimism index.corrected n
Dxy 0.7652 0.8080 0.7352 0.0727 0.6925 95
R2 0.4643 0.5347 0.4119 0.1228 0.3416 95
Intercept 0.0000 0.0000 -0.3533 0.3533 -0.3533 95
Slope 1.0000 1.0000 0.7658 0.2342 0.7658 95
Emax 0.0000 0.0000 0.1308 0.1308 0.1308 95
D 0.2767 0.3415 0.2407 0.1008 0.1759 95
U -0.0149 -0.0149 0.0883 -0.1032 0.0883 95
Q 0.2916 0.3564 0.1524 0.2040 0.0876 95
B 0.0673 0.0640 0.0736 -0.0096 0.0769 95
g 2.3819 4.0387 2.4635 1.5751 0.8068 95
gp 0.1720 0.1910 0.1632 0.0278 0.1442 95
The C statistic indicates fairly strong discrimination, at C = 0.88, although after validation, this looks much weaker (based on Dxy = 0.6925, we would have C = 0.5 + 0.6925/2 = 0.85) and the Nagelkerke \(R^2\) is also reasonably good, at 0.46, although this, too, is overly optimistic, and we bias-correct through our validation study to 0.34.
12.16 Model E: Fitting a Reduced Model in light of Model D
Can you suggest a reduced model (using a subset of the independent variables in model D) that adequately predicts survival?
Based on the anova for model D and the Spearman rho-squared plot, it appears that a two-predictor model using intubation and resection may be sufficient. Neither of the other potential predictors shows a statistically detectable effect in its Wald test.
<- lrm(died ~ intubated + resection, data=resect,
res_modE x=TRUE, y=TRUE)
res_modE
Logistic Regression Model
lrm(formula = died ~ intubated + resection, data = resect, x = TRUE,
y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 134 LR chi2 33.27 R2 0.413 C 0.867
0 117 d.f. 2 g 1.397 Dxy 0.734
1 17 Pr(> chi2) <0.0001 gr 4.043 gamma 0.757
max |deriv| 5e-10 gp 0.160 tau-a 0.164
Brier 0.073
Coef S.E. Wald Z Pr(>|Z|)
Intercept -4.6370 1.0430 -4.45 <0.0001
intubated 2.8640 0.6479 4.42 <0.0001
resection 0.5475 0.2689 2.04 0.0418
The model equation is that the log odds of death is -4.637 + 2.864 intubated
+ 0.548 resection
.
This implies that:
- for intubated patients, the equation is -1.773 + 0.548
resection
, while - for non-intubated patients, the equation is -4.637 + 0.548
resection
.
We can use the ilogit
function within the faraway
package to help plot this.
12.16.1 A Plot comparing the two intubation groups
ggplot(resect, aes(x = resection, y = died,
col = factor(intubated))) +
scale_color_manual(values = c("blue", "red")) +
geom_jitter(size = 2, height = 0.1) +
geom_line(aes(x = resection,
y = faraway::ilogit(-4.637 + 0.548*resection)),
col = "blue") +
geom_line(aes(x = resection,
y = faraway::ilogit(-1.773 + 0.548*resection)),
col = "red") +
geom_text(x = 4, y = 0.2, label = "Not Intubated",
col="blue") +
geom_text(x = 2.5, y = 0.6, label = "Intubated Patients",
col="red") +
labs(x = "Extent of Resection (in cm.)",
y = "Death (1,0) and estimated probability of death",
title = "resect data, Model E")
The effect of intubation
appears to be very large, compared to the resection size effect.
12.16.2 Nomogram for Model E
A nomogram of the model would help, too.
plot(nomogram(res_modE, fun=plogis,
fun.at=c(0.05, seq(0.1, 0.9, by=0.1), 0.95),
funlabel="Pr(died)"))
Again, we see that the effect of intubation is enormous, compared to the effect of resection. Another way to see this is to plot the effect sizes directly.
12.16.3 Effect Sizes from Model E
plot(summary(res_modE))
summary(res_modE)
Effects Response : died
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
intubated 0 1 1 2.8640 0.64790 1.59410 4.1338
Odds Ratio 0 1 1 17.5310 NA 4.92390 62.4160
resection 2 4 2 1.0949 0.53783 0.04082 2.1491
Odds Ratio 2 4 2 2.9890 NA 1.04170 8.5769
12.16.4 Plot In-Sample Predictions for Model E
Here are plots of the effects across the range of each predictor (holding the other constant) on the log odds scale.
ggplot(Predict(res_modE))
We can also capture and plot these results on the probability scale, as follows.
ggplot(Predict(res_modE, fun = plogis))
12.16.5 ANOVA for Model E
anova(res_modE)
Wald Statistics Response: died
Factor Chi-Square d.f. P
intubated 19.54 1 <.0001
resection 4.14 1 0.0418
TOTAL 25.47 2 <.0001
12.16.6 Validation of Model E
validate(res_modE, method="boot", B=40)
index.orig training test optimism index.corrected n
Dxy 0.7340 0.6896 0.7326 -0.0430 0.7771 40
R2 0.4128 0.3814 0.4025 -0.0211 0.4339 40
Intercept 0.0000 0.0000 0.1367 -0.1367 0.1367 40
Slope 1.0000 1.0000 1.0472 -0.0472 1.0472 40
Emax 0.0000 0.0000 0.0369 0.0369 0.0369 40
D 0.2408 0.2183 0.2339 -0.0157 0.2565 40
U -0.0149 -0.0149 -0.0001 -0.0148 -0.0001 40
Q 0.2558 0.2332 0.2340 -0.0009 0.2566 40
B 0.0727 0.0727 0.0759 -0.0032 0.0759 40
g 1.3970 1.3391 1.3577 -0.0186 1.4156 40
gp 0.1597 0.1446 0.1563 -0.0117 0.1714 40
Our bootstrap validated assessments of discrimination and goodness of fit look somewhat more reasonable now.
12.16.7 Do any points seem particularly influential?
As a last step, I’ll look at influence, and residuals, associated with model E.
<- which.influence(res_modE, cutoff=0.3)
inf.E
inf.E
$Intercept
[1] 84 94
$resection
[1] 84 94
show.influence(inf.E, dframe = data.frame(resect))
Count resection
84 2 *2
94 2 *6
12.16.8 Fitting Model E using glm
to get plots about influence
<- glm(died ~ intubated + resection,
res_modEglm data=resect, family="binomial")
par(mfrow=c(1,2))
plot(res_modEglm, which=c(4:5))
Using this glm
residuals approach, we again see that points 84 and 94 have the largest influence on our model E.
12.17 Concordance: Comparing Model C, D and E’s predictions
To start, we’ll gather the predictions fomade by each model (C, D and E) on the probability scale, in one place. Sadly, augment
from broom
doesn’t work well with lrm
fits, so we have to do this on our own.
<- resect %>%
resect_preds mutate(C = predict(res_modC, type = "fitted"),
D = predict(res_modD, type = "fitted"),
E = predict(res_modE, type = "fitted"))
head(resect_preds)
# A tibble: 6 x 9
id age prior resection intubated died C D E
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 34 1 2.5 0 0 0.0705 0.0632 0.0367
2 2 57 0 5 0 0 0.326 0.0620 0.130
3 3 60 1 4 1 1 0.187 0.791 0.603
4 4 62 1 4.2 0 0 0.211 0.158 0.0881
5 5 28 0 6 1 1 0.504 0.711 0.819
6 6 52 0 3 0 0 0.0990 0.0737 0.0477
And now, we’ll use the gather
command to arrange the models and predicted probabilities in a more useful manner for plotting.
<- resect_preds %>%
res_p gather("model", "prediction", 7:9) %>%
select(id, died, model, prediction)
head(res_p)
# A tibble: 6 x 4
id died model prediction
<dbl> <dbl> <chr> <dbl>
1 1 0 C 0.0705
2 2 0 C 0.326
3 3 1 C 0.187
4 4 0 C 0.211
5 5 1 C 0.504
6 6 0 C 0.0990
Here’s the resulting plot.
ggplot(res_p, aes(x = factor(died), y = prediction,
group = model, col = model)) +
geom_jitter(width = 0.25) +
geom_hline(yintercept = 0.5) +
facet_wrap( ~ model) +
guides(color = "none") +
labs(title = "Comparing Predictions for our Three Models",
subtitle = "A graphical view of concordance",
x = "Actual mortality status (1 = died)",
y = "Predicted probability of death")
We could specify a particular rule, for example: if the predicted probability of death is 0.5 or greater, then predict “Died.”
$rule.5 <- ifelse(res_p$prediction >= 0.5,
res_p"Predict Died", "Predict Alive")
ftable(table(res_p$model, res_p$rule.5, res_p$died))
0 1
C Predict Alive 114 16
Predict Died 3 1
D Predict Alive 113 7
Predict Died 4 10
E Predict Alive 114 8
Predict Died 3 9
And perhaps build the linked table of row probabilities…
round(100*prop.table(
ftable(table(res_p$model, res_p$rule.5, res_p$died))
1),2) ,
0 1
C Predict Alive 87.69 12.31
Predict Died 75.00 25.00
D Predict Alive 94.17 5.83
Predict Died 28.57 71.43
E Predict Alive 93.44 6.56
Predict Died 25.00 75.00
For example, in model E, 93.44% of those predicted to be alive actually survived, and 75% of those predicted to die actually died.
- Model D does a little better in one direction (94.17% of those predicted by Model D to be alive actually survived) but worse in the other (71.43% of those predicted by Model D to die actually died.)
- Model C does worse than each of the others in both predicting those who survive and those who die.
Note that the approaches discussed here would be useful if we had a new sample to predict on, as well. We could then compare the errors for that new data made by this sort of classification scheme either graphically or in a table.
12.18 Conclusions
It appears that intubated
status and, to a lesser degree, the extent of the resection
both play a meaningful role in predicting death associated with tracheal carina resection surgery. Patients who are intubated are associated with worse outcomes (greater risk of death) and more extensive resections are also associated with worse outcomes.
References
Although I’ve yet to figure out how to get the y axis relabeled properly without simply dumping the Predict results into a new tibble and starting over with creating the plots.↩︎