Lab 2

Published

2025-04-02

General Instructions

Submit your work via Canvas.
The deadline for this Lab is specified on the Course Calendar.
- We charge a 5 point penalty for a lab that is 1-48 hours late.
- We do not grade work that is more than 48 hours late.
Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided.

Important

You can skip exactly one of Labs 1-5 without penalty, but all students must complete both Lab 6 and Lab 7. If you decide to skip a lab, please submit a note to Canvas by the deadline saying that you are skipping the lab.

Template

There is a Lab 2 Quarto template available on our 432-data page. Please use the template to prepare your response to Lab 2, as it will make things easier for you and for the people grading your work.

In the Lab 2 template, we use the zephyr theme. If you’d like to use a different theme, the available list is here.

Our Best Advice

Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.

The Data

The hbp3024.xlsx file is available for download on the 432 data page.
A detailed description of each variable in the hbp3024 data is available here.

Question 1 (25 points)

Import the data from the hbp_3024.xlsx file into R, being sure to include NA as a potential missing value when you do, since all missing values are indicated in the Excel file with NA. You should find 8 variables which have at least 1 missing value. For Question 1, begin by restricting the hbp_3024 data to include only those subjects with complete data on the entire set of 23 variables included in hbp_3024.

Please explicitly state in your response that you assume that the missingness you observe in these data are MCAR, and that a complete case analysis is thus appropriate for this Question.

Tip

Your data set for Question 1 should include meaningfully fewer than 3024 observations.

{10 points} Shortly, you will build a model (which we’ll call fit1) to describe the impact of insurance status (insurance) to predict 100 times the natural logarithm of a subject’s systolic blood pressure (sbp), adjusting for whether or not they have a prescription for an ACE inhibitor or angiotensin receptor blocker (acearb). As a first step, decide whether your model should include an interaction term (between the two predictors) in a sensible way, providing an appropriate plot of means and a complete sentence or two describing your conclusion from that plot to help us understand your reasoning.

Tip

The outcome you’re using in model fit1 should be \(100 \times log(sbp)\), where sbp is measured in millimeters of mercury.

{5 points} Build model fit1 (including an interaction if you chose to include it in part 1a) with the lm() function, and display the resulting coefficient estimates (along with 90% confidence intervals) in an attractive way. Rounding to two decimal places should be fine here.

Tip

We’re hoping you will use the gt() function from the gt package in parts b and c in question 1.

{5 points} For the fit1 model, use R code to specify the number of observations your model uses, the (raw) \(R^2\) value (rounded to four decimal places) and the residual standard error (rounded to two decimal places.)
{5 points} Use your fit1 model to predict the systolic blood pressure (in mm Hg, rounded to zero decimal places) for each of the following two subjects:

Charlie, who has Medicaid insurance and has not been prescribed an ACE or ARB, and
Delta, who has Commercial insurance and has been prescribed an ACE inhibitor.

Question 2 (25 points)

Tip

Your work on Question 2a should use all 3024 observations from the original hbp_3024 data. Questions 2b and 2c will use data from a single practice, with 432 subjects.

{5} Returning to the original data stored in hbp_3024.xlsx, which of the seven practices has the largest mean number of visits for primary care in the past two years? Show how you figured this out using R.
{10} For the 432 subjects associated with the practice you identified in part a, predict the probability of a depression diagnosis on the basis of the subject’s number of visits for primary care in the past two years using a logistic regression model. Obtain and then interpret the coefficient of your predictor as an odds ratio, and display its 90% confidence interval. Feel encouraged to round your response to two decimal places.
{10} Use your model from part b. to make a prediction for the probability of a depression diagnosis for a “new” patient at the practice who had 4 primary care visits in the past two years, and state that prediction clearly. How does this prediction compare to the actual fraction of patients at this practice with 4 primary care visits in the past two years who have a depression diagnosis?

Use of AI

If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.

Be sure to include Session Information

Please display your session information at the end of your submission, as shown below.

xfun::session_info()

R version 4.4.3 (2025-02-28 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  base64enc_0.1.3   bslib_0.9.0       cachem_1.1.0      cli_3.6.4        
  compiler_4.4.3    digest_0.6.37     evaluate_1.0.3    fastmap_1.2.0    
  fontawesome_0.5.3 fs_1.6.6          glue_1.8.0        graphics_4.4.3   
  grDevices_4.4.3   highr_0.11        htmltools_0.5.8.1 htmlwidgets_1.6.4
  jquerylib_0.1.4   jsonlite_2.0.0    knitr_1.50        lifecycle_1.0.4  
  memoise_2.0.1     methods_4.4.3     mime_0.13         R6_2.6.1         
  rappdirs_0.3.3    rlang_1.1.6       rmarkdown_2.29    rstudioapi_0.17.1
  sass_0.4.10       stats_4.4.3       tinytex_0.57      tools_4.4.3      
  utils_4.4.3       xfun_0.52         yaml_2.3.10

After the Lab

We will post an answer sketch to our Shared Google Drive 48 hours after the Lab is due.
We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.
See the Lab Appeal Policy in Section 8.5 of our Syllabus if you are interested in having your Lab 1, 2, 3, 4, 5 or 7 grade reviewed, and use the Lab Regrade Request form to complete the task. The form (which is optional) is due when the Calendar says it is. Thank you.