500 Lab 0 is an Example with instructions, meant to help you get started on other Labs. Your task for Lab 0 is to read over these materials and see if they help you answer the questions that arise in generating your responses to your actual Lab assignments (Lab 1 - Lab 4) this semester. It’s not a bad idea to try to build a response yourself, and then check it against my solution, but there’s nothing for you to turn in here.
This work uses a data set called lab0.csv (which is a comma-separated .csv file suitable for reading into R.) You’ll find this in the data folder at our 500-data web site.
Template for Lab 0
We have provided a template for building a response to Lab 0, specifically a Quarto file containing some key bits of code which you can edit to produce a response. You’ll find the Lab 0 template in the templates folder at our 500-data web site.
Answer Sketch for Lab 0
We have also provided an Answer Sketch for Lab 0, including both the Quarto file we used to create the sketch and the resulting HTML file it generates. You’ll find those files in the Labs section of our Shared Google Drive.
Original Instructions for this work
Here are the instructions I gave to students for whom this was a required assignment.
Do professional work with this little problem. What do I mean by this?
Properly labeled graphs/figures are a minimal expectation for graduate school.
Use complete English sentences to describe your findings and clarify when annotating code.
Make sure that the answers include enough of the question that your text responses (in addition to the graphs) stand on their own. Be sure to address all three tasks.
Present edited code, making an effort to delete false starts, and comment liberally. Don’t present R code without explaining what you’re doing in English. Quarto makes it easy to intersperse code with explanations, so make that happen.
Use words I know, without simply repeating my explanations verbatim, please.
You are welcome to discuss Lab 0 with anyone, including myself, or your colleagues, but your answer must be prepared by you alone.
If you are confused by the assignment, or stuck in the development of your response, please ask questions!
The file includes 135 subjects, the first 40 of whom have received a particular treatment and the remaining 95 of whom have not received it.
Also provided are five meaningful predictors of treatment status, labeled (imaginatively) cov1, cov2, cov3, cov4 and cov5.
Covariates 1-4 are continuous covariates, gathered at varying levels of precision. The cov5 variable is an indicator of whether the subject has a particular characteristic (1 = yes, 0 = no.)
Happily, there are no missing values in the data.
Tasks
Build a logistic regression model using the main effects of the five predictors to predict treatment status.
Use R to add two columns to the data set, specifically the fitted probability (according to your logistic regression model) of being treated, and the linear component of the logistic regression model (the logit of the probability of being treated.)
Next, summarize the resulting probabilities across the untreated and treated patients in an appropriate and attractive manner.
Raw R code is rarely attractive on its face - build something brief, effective and appropriate for a presentation.
Of course, we’d expect that the average probability of being treated will be higher in the patients who are actually treated. Verify that this is the case, in a short numerical and graphical summary of your findings.
How much overlap is there between the fitted probabilities of the treated patients and the fitted probabilities of the untreated patients?
A graph of this overlap (perhaps a boxplot, but a better option would be a dot chart or density plot of some sort; creativity is welcome here) is crucial, supplemented by a short written description of your findings.
R Setup
To start, I’ll request that R sets its responses to be rendered without the default pair of hashtags. Next, I’ll load two R packages that will help me with these instructions. Then, I’ll load the data, and take a look at the variables.