Lab 4

Published

2025-01-09

General Instructions

  • Submit your work via Canvas.
  • The deadline for this Lab is specified on the Course Calendar.
    • We charge a 5 point penalty for a lab that is 1-48 hours late.
    • We do not grade work that is more than 48 hours late.
  • Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided.
Important

You can skip exactly one of Labs 1-5 without penalty, but all students must complete both Lab 6 and Lab 7. If you decide to skip a lab, please submit a note to Canvas by the deadline saying that you are skipping the lab.

Template

You should be able to modify the Lab 3 Quarto template available on our 432-data page to help you do this Lab.

Our Best Advice

Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.

The Data

  • The hbp3024.xlsx Excel file (first introduced in Lab 2) is available for download on the 432 data page.
  • A detailed description of each variable in the hbp3024 data is available here.

Question 1. (15 points)

Begin with the hbp3024 data, but filter to include only the 1,714 subjects with complete data on the hsgrad and tobacco variables, and who were seen by either the Elm, Highland, Sycamore, or Walnut practices. We will use this sample of 1,714 subjects, which I’ll call the hbp_lab4 data, for the rest of this Lab.

Using the hbp_lab4 data, build a logistic regression model to predict whether a subject has a depression diagnosis based on:

  • which of the four practices they receive care from, along with
  • the subject’s age,
  • the subject’s tobacco use status, and
  • the estimated percentage of adults living in the subject’s home neighborhood who have graduated from high school (the hsgrad variable)

Fit two models: one with and one without an interaction term between the practice and the hsgrad value. Include the age variable in each model using a restricted cubic spline with three knots, but without any interaction with the other predictors. Display the coefficients of your two models.

Question 2. (10 points)

For the “no interaction” model from Question 1, interpret the odds ratio associated with the hsgrad main effect carefully, specifying a 90% confidence interval and what we can conclude from the results.

  • To obtain a 90% confidence interval with a fit using one of the rms fitting functions rather than the default 95% interval, the appropriate code would be summary(modelname, conf.int = 0.9).
  • Hint: We assume you will describe the hsgrad main effect by considering the case of Harry and Sally. Harry lives in a neighborhood with an hsgrad value equal to the 75th percentile hsgrad value in the data. Sally has an hsgrad value equal to the 25th percentile hsgrad value in the data. Assume Harry and Sally are the same age and receive care at the same practice and have the same tobacco use status. So the odds ratio of interest here compares the odds of a depression diagnosis for Harry to the odds of a depression diagnosis for Sally.

Question 3. (15 points)

Now using the “interaction” model from Question 2, please interpret the effect of hsgrad on the odds of a depression diagnosis appropriately, specifying again what we can conclude from the results. A detailed description of the point estimate(s) will be sufficient here.

  • Again, we want you to describe the hsgrad main effect by considering the case of Harry and Sally. Harry lives in a neighborhood with an hsgrad value equal to the 75th percentile hsgrad value in the data. Sally has an hsgrad value equal to the 25th percentile hsgrad value in the data. Assume Harry and Sally are the same age and receive care at the same practice. So the odds ratio of interest here compares the odds of statin prescription for Harry to the odds of statin prescription for Sally. But now, you need to be able to do this separately for each individual level of practice, since practice interacts with ldl. There are at least two ways to accomplish this.
    • In one approach, you would create predicted odds values for Harry and Sally, assuming a common age (40 would be a reasonable choice, and it’s the one used in the answer sketch), but creating four different versions of Harry and Sally (one for each practice.) Then use those predicted odds within each practice to obtain practice-specific odds ratios.
    • In the other approach, you could convince the rms package to use a different practice as the choice for which adjustments are made. By default, datadist chooses the modal practice. To change this, you’d need to convince datadist instead to choose its practice based on which practice is the first one, and re-level the practice factor accordingly. So, if you’d re leveled the practice data so that Elm was first and placed that into a tibble called dataelm, you could use the following adjustment to the datadist call to ensure that the adjustments made by datadist used Elm instead of the modal practice.
d_elm <- datadist(dataelm, adjto.cat = "first")
options(datadist = "d_elm")

Question 4. (10 points)

Now, use bootstrap validation to compare the effectiveness of your two fitted models (the “interaction” and “no interaction” models) from Question 1 and draw a reasoned conclusion about which of those two models is more effective in describing the available set of observations (after those without hsgrad or tobacco data are removed) from these four practices.

An appropriate response will make use of at least two different validated assessments of fit quality. The natural choices for validated assessments of fit quality in Question 4 are a bootstrap-validated C statistic and a bootstrap-validated Nagelkerke \(R^2\). Use 2025 as your random seed here and 40 bootstrap replications.

Be sure to justify your eventual selection (between the “interaction” or “no interaction” model) with complete sentences.

Use of AI

If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.

Be sure to include Session Information

Please display your session information at the end of your submission, as shown below.

xfun::session_info()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  base64enc_0.1.3   bslib_0.8.0       cachem_1.1.0      cli_3.6.3        
  compiler_4.4.2    digest_0.6.37     evaluate_1.0.3    fastmap_1.2.0    
  fontawesome_0.5.3 fs_1.6.5          glue_1.8.0        graphics_4.4.2   
  grDevices_4.4.2   highr_0.11        htmltools_0.5.8.1 htmlwidgets_1.6.4
  jquerylib_0.1.4   jsonlite_1.8.9    knitr_1.49        lifecycle_1.0.4  
  memoise_2.0.1     methods_4.4.2     mime_0.12         R6_2.5.1         
  rappdirs_0.3.3    rlang_1.1.4       rmarkdown_2.29    rstudioapi_0.17.1
  sass_0.4.9        stats_4.4.2       tinytex_0.54      tools_4.4.2      
  utils_4.4.2       xfun_0.50         yaml_2.3.10      

After the Lab

  • We will post an answer sketch to our Shared Google Drive 48 hours after the Lab is due.
  • We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.
  • See the Lab Appeal Policy in our Syllabus if you are interested in having your Lab grade reviewed, and use the Lab Regrade Request form specified there to complete the task. Thank you.