General Instructions
- Submit your work via Canvas.
- The deadline for this Lab is specified on the Calendar.
- Work submitted late, but within 12 hours of the deadline will lose 5 of the available 50 points.
- Work submitted more than 12 hours after the deadline will lose 10 of the available 50 points.
Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided. While we have not provided a specific template for this Lab, we encourage you to adapt the one provided for Lab 2.
Data and Codebook for Questions 1-2
The umaru.csv
data file found on our 432-data page contains information for 575 subjects selected from the UMARU IMPACT study collaborative project done by the University of Massachusetts AIDS Research Unit over 5 years (1989-1994). Various versions of this data set are frequently used in survival analysis texts. I’ve tweaked your data set enough that you’ll see some different results. The study included two concurrent randomized trials of residential treatment for drug abuse. The key question is to compare treatment programs of different planned durations in terms of their ability to reduce drug abuse and prevent high-risk HIV behavior. Here’s a codebook:
subject |
Subject ID #, ranging from 1001 - 1575 |
age |
age at enrollment, in years |
beck |
Beck Depression Score at admission |
hercoc |
heroin or cocaine use during the 3 months prior to admission (1 = Heroin & Cocaine, 2 = Heroin only, 3 = Cocaine only, 4 = Neither Heroin nor Cocaine) |
ivhx |
IV drug use history at admission (1 = never, 2 = previous but not recent, 3 = recent) |
ndrugtx |
# of prior drug treatments |
race |
subject’s race (0 = white, 1 = other) |
treat |
treatment randomization assignment (Long, or Short) |
site |
treatment site (A or B) |
lot |
Length of Treatment (Exit Date - Admission Date), in days |
time |
Time to Return to Drug Use (measured from Admission Date), in days |
censor |
Returned to Drug Use indicator (1 = returned to drug use, 0 = otherwise) |
Question 1 (10 points)
Using the umaru
data, build a Cox model, using treat
as a predictor, and spending degrees of freedom in any way you like after you include the main effects of the rest of the available predictors (i.e. everything but subject
, lot
, time
and censor
) in the data set, so long as you include a total of 13 degrees of freedom, predicting the time to return to drug use.
You’ll want to use a Spearman rho-squared plot to make your selection, and you should stick with the model you develop using that tool, regardless of its eventual performance. Specify your model carefully, and interpret the hazard ratio for treat implied by your new model.
Hints for Question 1
- When you build the Spearman \(\rho^2\) plot, use
time
but not the entire survival object as the “outcome.”
- The correct model for Question 1 should include the main effects of
age
, site
, beck
, hercoc
, ivhx
, race
, ndrugtx
and treat
, being sure to treat ivhx
and hercoc
as factors, along with one interaction term.
Question 2 (10 points)
Assess the quality of fit from the model you fit in Question 1 (be sure to use at least two different assessments of fit quality) and an assessment of adherence to the assumptions of a proportional hazards model to help justify your final selection. Validate the summary statistics describing your chosen model, and explain what those results mean in complete English sentences, too.
Hint for Question 2
Somers’ d and \(R^2\) are the two assessments of fit quality that we recommend you use in this question, although there are others available.
Question 3 (30 points)
Next, we will fit the same model in two different ways to the hbp3456
data we first described and developed in Lab 2. In each model, we’ll be predicting systolic blood pressure (sbp
) using three variables: the subject’s age (age
), weight (weight
), and primary insurance type (insurance
). Throughout this question, annotate your R code with text so that it is very clear what you’re doing.
(5 points) Use set.seed(2023)
and the rsample
package to split the data into a training sample (called hbp_train
) consisting of essentially 70% of the subjects, and a testing sample (called hbp_test
) of the remaining 30% of the counties, while requiring a very similar distribution of insurance
across the training and test samples.
(15 points) In this step, you will build two models, each using the same three predictors: age
, weight
, and insurance
, to predict sbp
in the training sample hbp_train
.
First, build a recipe for pre-processing the predictors and establishing the roles of your variables that works for each model. Be sure your recipe centers the outcome (sbp
), normalizes all predictors and uses indicator variables for any factors, using the training set of data. The same “recipe” should be used for both Model 1 and Model 2. Next, for model 1, create a workflow that uses the lm
modeling engine. Now, for model 2, create a revised workflow that instead uses a Bayesian approach with stan
to fit the same model, using a seed of 2023 again, as well as a Student t distribution with 1 degree of freedom for the intercept term, and a Normal distribution with mean 0 and variance 4 for each of the predictors. Fit the two models in the training sample, and display the results for each.
Fit the two models, and display the fitted coefficients for each.
- (10 points) Compare the coefficients of your
stan
fit for the model to the lm
fit in a plot of the coefficients, with a 95% confidence interval for each. Write a couple of complete English sentences addressing what you see in your plot, and what conclusions you draw from it.
Our Best Advice
Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.