General Instructions
- Submit your work via Canvas.
- The deadline for this Lab is specified on the Course Calendar.
- We charge a 5 point penalty for a lab that is 1-48 hours late.
- We do not grade work that is more than 48 hours late.
- Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided.
You can skip exactly one of Labs 1-5 without penalty, but all students must complete both Lab 6 and Lab 7. If you decide to skip a lab, please submit a note to Canvas by the deadline saying that you are skipping the lab.
Template
There is a Lab 2 Quarto template available on our 432-data page. Please use the template to prepare your response to Lab 2, as it will make things easier for you and for the people grading your work.
Our Best Advice
Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.
The Data
- The
hbp3024.xlsx
file is available for download on the 432 data page.
- A detailed description of each variable in the
hbp3024
data is available here.
Question 1 (25 points)
Import the data from the hbp_3024.xlsx
file into R, being sure to include NA as a potential missing value when you do, since all missing values are indicated in the Excel file with NA
. You should find 8 variables which have at least 1 missing value. For Question 1, begin by restricting the hbp_3024
data to include only those subjects with complete data on the entire set of 23 variables included in hbp_3024
.
Please explicitly state in your response that you assume that the missingness you observe in these data are MCAR, and that a complete case analysis is thus appropriate for this Question.
Your data set for Question 1 should include meaningfully fewer than 3024 observations.
- {10 points} Shortly, you will build a model (which we’ll call
fit1
) to describe the impact of insurance status (insurance
) to predict 100 times the natural logarithm of a subject’s systolic blood pressure (sbp
), adjusting for whether or not they have a prescription for an ACE inhibitor or angiotensin receptor blocker (acearb
). As a first step, decide whether your model should include an interaction term (between the two predictors) in a sensible way, providing an appropriate plot of means and a complete sentence or two describing your conclusion from that plot to help us understand your reasoning.
The outcome you’re using in model fit1
should be \(100 \times log(sbp)\), where sbp
is measured in millimeters of mercury.
- {5 points} Build model
fit1
(including an interaction if you chose to include it in part 1a) with the lm()
function, and display the resulting coefficient estimates (along with 90% confidence intervals) in an attractive way. Rounding to two decimal places should be fine here.
We’re hoping you will use the gt()
function from the gt
package in parts b and c in question 1.
{5 points} For the fit1
model, use R code to specify the number of observations your model uses, the (raw) \(R^2\) value (rounded to four decimal places) and the residual standard error (rounded to two decimal places.)
{5 points} Use your fit1
model to predict the systolic blood pressure (in mm Hg, rounded to zero decimal places) for each of the following two subjects:
- Charlie, who has Medicaid insurance and has not been prescribed an ACE or ARB, and
- Delta, who has Commercial insurance and has been prescribed an ACE inhibitor.
Question 2 (25 points)
Your work on Question 2a should use all 3024 observations from the original hbp_3024
data. Questions 2b and 2c will use data from a single practice, with 432 subjects.
{5} Returning to the original data stored in hbp_3024.xlsx
, which of the seven practices has the largest mean number of visits for primary care in the past two years? Show how you figured this out using R.
{10} For the 432 subjects associated with the practice you identified in part a, predict the probability of a depression diagnosis on the basis of the subject’s number of visits for primary care in the past two years using a logistic regression model. Obtain and then interpret the coefficient of your predictor as an odds ratio, and display its 90% confidence interval. Feel encouraged to round your response to two decimal places.
{10} Use your model from part b. to make a prediction for the probability of a depression diagnosis for a “new” patient at the practice who had 4 primary care visits in the past two years, and state that prediction clearly. How does this prediction compare to the actual fraction of patients at this practice with 4 primary care visits in the past two years who have a depression diagnosis?
Use of AI
If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.
After the Lab
- We will post an answer sketch to our Shared Google Drive 48 hours after the Lab is due.
- We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.
- See the Lab Appeal Policy in our Syllabus if you are interested in having your Lab grade reviewed, and use the Lab Regrade Request form specified there to complete the task. Thank you.