Lab 2

Published

2025-01-09

General Instructions

  • Submit your work via Canvas.
  • The deadline for this Lab is specified on the Course Calendar.
    • We charge a 5 point penalty for a lab that is 1-48 hours late.
    • We do not grade work that is more than 48 hours late.
  • Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided.
Important

You can skip exactly one of Labs 1-5 without penalty, but all students must complete both Lab 6 and Lab 7. If you decide to skip a lab, please submit a note to Canvas by the deadline saying that you are skipping the lab.

Template

There is a Lab 2 Quarto template available on our 432-data page. Please use the template to prepare your response to Lab 2, as it will make things easier for you and for the people grading your work.

Our Best Advice

Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.

The Data

  • The hbp3024.xlsx file is available for download on the 432 data page.
  • A detailed description of each variable in the hbp3024 data is available here.

Question 1 (25 points)

Import the data from the hbp_3024.xlsx file into R, being sure to include NA as a potential missing value when you do, since all missing values are indicated in the Excel file with NA. You should find 8 variables which have at least 1 missing value.

Please explicitly state in your response that you assume that the missingness you observe in these data are MCAR, and that a complete case analysis is thus appropriate for this Question.

  1. {10 points} Describe the impact of insurance status on a subject’s systolic blood pressure, adjusting for whether or not they have a prescription for an ACE inhibitor. To do this, build a linear model using the lm function, after first deciding whether your model should include an interaction term in a sensible way (providing a graph to help us understand your reasoning.) Display your resulting model’s coefficients (along with 90% confidence intervals for the estimates) in an attractive way. Be sure to specify the number of observations your model uses.
  • One graph you might use would be one to assess the need for an interaction term, probably via a plot of means.
  • Another graph (or perhaps table) to consider for insight would look at the relationship between insurance and ace inhibitor status in these subjects.
  1. {15} Provide a written explanation of your findings, in complete sentences. Your explanation should address all of the following:
  • the overall quality of fit (at minimum, the \(R^2\) and residual standard error, interpreted in context),
  • a careful interpretation of the meaning of each of the coefficients of your chosen model
  • a detailed description as to how you used the output you generated in part a to decide whether or not to include an interaction term.

Question 2 (25 points)

  1. {5} Again using the data in hbp_3024.xlsx, which of the seven practices has the largest mean number of visits for primary care in the past two years? Show how you figured this out using R.

  2. {10} For the 432 subjects associated with the practice you identified in part a, predict the probability of a depression diagnosis on the basis of the subject’s number of visits for primary care in the past two years using a logistic regression model. Obtain and then interpret the coefficient of your predictor as an odds ratio, with a 90% confidence interval.

  3. {10} Use your model from part b. to make a prediction for the probability of a depression diagnosis for a “new” patient at the practice who had 4 primary care visits in the past two years, and state that prediction clearly. How does this prediction compare to the actual fraction of patients at this practice with 4 primary care visits in the past two years who have a depression diagnosis?

Use of AI

If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.

Be sure to include Session Information

Please display your session information at the end of your submission, as shown below.

xfun::session_info()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  base64enc_0.1.3   bslib_0.8.0       cachem_1.1.0      cli_3.6.3        
  compiler_4.4.2    digest_0.6.37     evaluate_1.0.3    fastmap_1.2.0    
  fontawesome_0.5.3 fs_1.6.5          glue_1.8.0        graphics_4.4.2   
  grDevices_4.4.2   highr_0.11        htmltools_0.5.8.1 htmlwidgets_1.6.4
  jquerylib_0.1.4   jsonlite_1.8.9    knitr_1.49        lifecycle_1.0.4  
  memoise_2.0.1     methods_4.4.2     mime_0.12         R6_2.5.1         
  rappdirs_0.3.3    rlang_1.1.4       rmarkdown_2.29    rstudioapi_0.17.1
  sass_0.4.9        stats_4.4.2       tinytex_0.54      tools_4.4.2      
  utils_4.4.2       xfun_0.50         yaml_2.3.10      

After the Lab

  • We will post an answer sketch to our Shared Google Drive 48 hours after the Lab is due.
  • We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.
  • See the Lab Appeal Policy in our Syllabus if you are interested in having your Lab grade reviewed, and use the Lab Regrade Request form specified there to complete the task. Thank you.