Chapter 23 Cox Regression Models for Survival Data: Example 1
The Cox proportional hazards (Cox regression) model fits survival data with a constant (i.e. not varying over time) covariate \(x\) to a hazard function of the form:
\[ h(t | x) = h_0(t) exp[\beta_1 x] \]
where we will estimate the unknown value of \(\beta_1\) and where \(h_0(t)\) is the baseline hazard, which is a non-parametric and unspecified value which depends on \(t\) but not on \(x\).
- For particular \(x\) values, we will be able to estimate the survival function if we have an estimate of the baseline survival function, \(\hat{S_0}(t)\).
The estimated survival function for an individual with covariate value \(x_k\) turns out to be
\[ \hat{S}(t | x_k) = [\hat{S_0}(t)]^{exp(\beta_1 x_k)} \]
From Wikipedia (yes, really) …
Survival models can be viewed as consisting of two parts: the underlying hazard function, describing how the risk of event per time unit changes over time at baseline levels of covariates; and the effect parameters, describing how the hazard varies in response to explanatory covariates.
The key assumption in a Cox model is that the hazards are proportional - other types of survival models need not have this restriction. Quoting the always reliable (well, it’s better than you think) Wikipedia …
In a proportional hazards model, the unique effect of a unit increase in a covariate is multiplicative with respect to the hazard rate. For example, taking a drug may halve one’s hazard rate for a stroke occurring, or, changing the material from which a manufactured component is constructed may double its hazard rate for failure.
There are two main approaches to fitting Cox models in R.
- the
coxph
function in thesurvival
package, and - the
cph
function in therms
package.
23.1 Sources used in building this material
- David Diez’s excellent supplement for the OpenIntro Statistics project, on Survival Analysis in R. I’ve posted it on our web site, as well.
- Some tools in R to do some fancier work can be viewed at https://cran.r-project.org/web/views/Survival.html
- You might also look at these two blog posts, originally from the Just Another Data blog.
- https://rpubs.com/daspringate/survival has some great slides, and I’ve stolen from them quite a bit here.
23.2 Fitting a Cox Model in R with coxph
As a first example, I’ll fit a model to predict time to recurrence in the hem
data, on the basis of a single predictor: genotype
.
Call:
coxph(formula = Surv(time, recur) ~ genotype)
coef exp(coef) se(coef) z p
genotype 1.3317 3.7874 0.5699 2.337 0.0195
Likelihood ratio test=6.61 on 1 df, p=0.01015
n= 70, number of events= 18
This summary provides an overall comparison of the two genotypes, using a proportional hazards model.
- The default approach in R is to use the “efron” method of breaking ties: other options include “breslow” and “exact”.
23.2.1 Summarizing the Fit
Call:
coxph(formula = Surv(time, recur) ~ genotype)
n= 70, number of events= 18
coef exp(coef) se(coef) z Pr(>|z|)
genotype 1.3317 3.7874 0.5699 2.337 0.0195 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
genotype 3.787 0.264 1.239 11.57
Concordance= 0.622 (se = 0.061 )
Likelihood ratio test= 6.61 on 1 df, p=0.01
Wald test = 5.46 on 1 df, p=0.02
Score (logrank) test = 6.28 on 1 df, p=0.01
This provides estimates of the \(\beta\) value for genotype
, including standard errors and \(p\) values for a Wald test. Also included is an estimate of the hazard ratio and its confidence interval.
- Here we have a hazard ratio estimate of
exp(coef) = 3.787
, with 95% CI (1.24, 11.57). - The hazard ratio is the multiplicative effect of the covariate (here, having at least one of the \(\epsilon 2\) or \(\epsilon 4\) allele) on the hazard function for recurrent hemorrhage
- A hazard ratio of 1 indicates no effect
- A hazard ratio > 1 indicates an increase in the hazard as the covariate rises
- A hazard ratio < 1 indicates a decrease in the hazard as the covariate rises
We can also tidy
the hazard ratio estimate with the broom
package.
# A tibble: 1 x 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 genotype 3.79 0.570 2.34 0.0195 1.24 11.6
In addition, we have several other summaries:
- The concordance measure is only appropriate when we have at least one continuous predictor in our Cox model.
- The Cox & Snell pseudo-R2 reflects the improvement of the model we’ve fit over the model with an intercept alone, but isn’t a proportion of anything (hence the listing of the maximum possible value).
- The Likelihood ratio, Wald and Score (logrank) tests provide insight into the overall significance of the model.
We can obtain a more detailed description of the likelihood-ratio test of the model with anova
.
Analysis of Deviance Table
Cox model: response is Surv(time, recur)
Terms added sequentially (first to last)
loglik Chisq Df Pr(>|Chi|)
NULL -66.675
genotype -63.371 6.6078 1 0.01015 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
23.2.2 Glancing at the model?
# A tibble: 1 x 15
n nevent statistic.log p.value.log statistic.sc p.value.sc statistic.wald
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 70 18 6.61 0.0102 6.28 0.0122 5.46
# ... with 8 more variables: p.value.wald <dbl>, r.squared <dbl>,
# r.squared.max <dbl>, concordance <dbl>, std.error.concordance <dbl>,
# logLik <dbl>, AIC <dbl>, BIC <dbl>
Here, we obtain several additional summaries of the model, including most of the important information from a summary of cfit
.
23.2.3 Plot the baseline survival function
Here, we’ll plot the time in terms of months, but scaled to 12 month (one year) groups.
23.2.4 Plot the genotype effect
There are several ways to build these plots. One approach follows. Another uses a cph
fit and the survplot
function from the rms
package.
genotype
1 1
2 0
plot(survfit(cfit, newdata = newdat), xscale = 12,
conf.int = TRUE,
col = c("red", "blue"),
xlab = "Years after initial hemorrhage",
ylab = "Proportion without recurrent hemorrhage",
main = "Plotting the genotype effects from the cfit Model")
legend(0.5, 0.2,
legend=c(
expression(paste("Homozygous ", epsilon, "3 / ",
epsilon, "3")),
expression(paste("At least one ", epsilon,"2 or ",
epsilon,"4 Allele"))
),
lty = 1,
col = c("red", "blue"),
text.col = c("red", "blue"))
23.2.5 Testing the Key Assumption: Proportional Hazards
The cox.zph
function in the survival
package will test the proportionality of all of the predictors included in your model by creating interactions with time.
- A small \(p\) value would indicate a violation of the proportionality assumption.
chisq df p
genotype 2.09 1 0.15
GLOBAL 2.09 1 0.15
Since the p value here is not significant, we may be all right. But it’s sensible to focus further on plots derived from the model fit, rather than relying solely on this test.
23.2.6 Plotting the cox.zph
results for the cfit
model
cox.zph
function can be used to generate a plot for each of the individual predictors in the model. Of course, in this case, we have just one predictor: genotype
. If the proportional hazards assumption is appropriate, then we should see a slope of essentially zero in each such plot. A slope that is seriously different from zero suggests a violation of the proportional hazards assumption.
The plot suggests only a slight rise in the plotted values over time, suggesting no serious problem with the proportional hazards assumption. This combined testing and plotting approach is a reasonable starting place for assessing the proportional hazards assumption, but it’s likely insufficient for good practical work.
Should the proportional hazards assumption fit less well, we have two main options: (1) fit a non-linear term in the covariate in question, and (2) fit a different type of regression model that doesn’t require the proportional hazards assumption.
23.3 Fitting a Cox Model using cph
from the rms
package
To set up a cph
fit for our comparison of genotypes in the hem
data, we’ll follow these steps.
units(hem$time) <- "month"
d <- datadist(hem)
options(datadist = "d")
hemsurv <- Surv(time = hem$time, event = hem$recur)
model_hem <- cph(hemsurv ~ genotype, data = hem,
x = TRUE, y = TRUE, surv = TRUE)
Note that the surv = TRUE
piece is required to get some of the follow-up analyses to work smoothly.
23.3.1 The Main cph
results
Cox Proportional Hazards Model
cph(formula = hemsurv ~ genotype, data = hem, x = TRUE, y = TRUE,
surv = TRUE)
Model Tests Discrimination
Indexes
Obs 70 LR chi2 6.61 R2 0.106
Events 18 d.f. 1 Dxy 0.244
Center 0.7229 Pr(> chi2) 0.0102 g 0.671
Score chi2 6.28 gr 1.955
Pr(> chi2) 0.0122
Coef S.E. Wald Z Pr(>|Z|)
genotype 1.3317 0.5699 2.34 0.0195
Included here are likelihood ratio and score tests for the model as a whole (as compared to the intercept-only model), as well as the usual discrimination indexes.
- These include both an R2 analog due to Nagelkerke (which can go all the way up to 1), and
- Somers’ \(Dxy\), which can also produce an estimate of the C statistic (area under the curve) via the formula C = 0.5 + Dxy / 2, so here C = 0.5 + (.244/2) = 0.622
- For lots more on survival analysis C statistics, look at the
survAUC
package in R.
These results are followed by a table of Wald tests for each of the coefficients in the model.
23.3.2 Using anova
with cph
As in other rms
fits, we can use anova
to obtain more detailed (in terms of combining nonlinear terms and, if available, interactions) tests.
Wald Statistics Response: hemsurv
Factor Chi-Square d.f. P
genotype 5.46 1 0.0195
TOTAL 5.46 1 0.0195
23.3.3 Effect Sizes after cph
fit
We can use summary
on a cph
object to get and plot effect size estimates (here, these are hazard ratios.)
Effects Response : hemsurv
Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
genotype 0 1 1 1.3317 0.5699 0.21468 2.4486
Hazard Ratio 0 1 1 3.7873 NA 1.23950 11.5730
23.3.4 Validating cph
summaries
For details on these last few indices (D, U, Q, etc.), visit ?validate.cph
in R.
index.orig training test optimism index.corrected n
Dxy 0.2441 0.2508 0.2197 0.0311 0.2130 40
R2 0.1058 0.1203 0.1058 0.0144 0.0914 40
Slope 1.0000 1.0000 -6.4747 7.4747 -6.4747 40
D 0.0421 0.0532 0.0421 0.0111 0.0310 40
U -0.0150 -0.0168 0.0198 -0.0366 0.0216 40
Q 0.0571 0.0699 0.0222 0.0477 0.0094 40
g 0.6705 0.7908 0.6705 0.1203 0.5502 40
23.3.5 Plotting Survival Functions for each Genotype
Here is the survplot
approach I mentioned earlier.
survplot(model_hem, genotype,
lty = c(1,2), n.risk=TRUE, time.inc=12,
col=c("magenta", "dodgerblue"),
xlab="Hemorrhage-Free Survival in Months")
We can add, for instance, confidence interval bars, with:
survplot(model_hem, genotype,
lwd=3, lty = c(1,2), conf.int=.95,
n.risk=TRUE, time.inc = 12, conf='bars',
col=c("magenta", "dodgerblue"),
xlab="Hemorrhage-Free Survival Time in Months")
For more details, check out R’s help file on survplot
.
23.3.6 Genotype’s effect on log relative hazard
23.3.7 Nomogram of our simple hem
model
We can estimate 1-year and 3-year hemorrhage-free survival probabilities, for example, with this model, and incorporate these results into our nomogram.
survx <- Survival(model_hem)
plot(nomogram(model_hem, fun=list(function(x) survx(12, x),
function(x) survx(36, x)),
funlabel=c("12-month Pr(Survival)",
"36-month Pr(Survival)")))
Again, this is just a very simple model, with one binary predictor.
23.3.8 Assessing the Proportional Hazards Assumption
chisq df p
genotype 2.09 1 0.15
GLOBAL 2.09 1 0.15
- Consider using
transform="rank"
to transform the survival times by their ranks prior to performing the test. - Or use
transform="identity"
as we’ll do in the plot below.