library(janitor)
library(tidyverse)
::opts_chunk$set(comment = NA)
knitr
<- read_csv("https://raw.githubusercontent.com/THOMASELOVE/432-data/master/data/oh_counties_2022.csv", show_col_types = FALSE) |>
oh22 clean_names() |>
mutate(fips = as.character(fips))
Lab 4
General Instructions
- Submit your work via Canvas.
- The deadline for this Lab is specified on the Calendar.
- Work submitted more than 59 minutes late, but within 12 hours of the deadline will lose 5 of the available 50 points.
- Work submitted 12 to 24 hours after the deadline will lose 10 of the available 50 points.
- Work submitted more than 24 hours after the deadline will not be graded.
Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided. While we have not provided a specific template for this Lab, we encourage you to adapt the one provided for Lab 2.
Question 1. (10 points)
This question uses the oh22
data developed in Lab 1. See the Lab 1 instructions for details on the data set. Back in Lab 1, recall that we loaded the data with this code.
Use the oh22
data to create a logistic regression model to predict the presence of a water violation (as contained in h2oviol
) on the basis of sev_housing
and pm2.5
. Use a model with main effects only, and annotate your code with text so that it’s extremely clear what you are doing. Specify and then carefully interpret the estimated odds ratio associated with the sev_housing
effect and a 90% confidence interval around that estimate in context using complete English sentences. Be sure to get the direction of the effect right in your modeling and description.
Question 2. (10 points)
Begin with the hbp3456
data we developed in Lab 2, but now restricted to the following four practices: Center, Elm, Plympton and Walnut, and to subjects with complete data on the ldl
and statin
variables. We will use these data (which should now include 1446 rows of data) for Questions 2-5 in this Lab.
Build a logistic regression model to predict whether a subject seen in one of those four practices has a statin
prescription based on:
- the subject’s current LDL cholesterol level
- which of the four practices they receive care from, along with
- the subject’s age.
Fit two models: one with and one without an interaction term between the practice and the LDL level. Include the age
variable in each model using a restricted cubic spline with four knots, but without any interaction with the other predictors. Display the coefficients of your two models.
Question 3. (10 points)
For the “no interaction” model from Question 2, interpret the odds ratio associated with the ldl
main effect carefully, specifying a 90% confidence interval and what we can conclude from the results.
- To obtain a 90% confidence interval with a fit using one of the
rms
fitting functions rather than the default 95% interval, the appropriate code would besummary(modelname, conf.int = 0.9)
. - Hint: We assume you will describe the
ldl
main effect by considering the case of Harry and Sally. Harry has anldl
value of 142, equal to the 75th percentileldl
value in the data. Sally has anldl
value of 85, equal to the 25th percentileldl
value in the data. Assume Harry and Sally are the sameage
and receive care at the samepractice
. So the odds ratio of interest here compares the odds ofstatin
prescription for Harry to the odds ofstatin
prescription for Sally.
Question 4. (10 points)
Now using the “interaction” model from Question 2, please interpret the effect of ldl
on the odds of a statin prescription appropriately, specifying again what we can conclude from the results. A detailed description of the point estimate(s) will be sufficient here.
- Here, we want you to describe the
ldl
main effect by considering the case of Harry and Sally. Harry has anldl
value of 142, equal to the 75th percentileldl
value in the data. Sally has anldl
value of 85, equal to the 25th percentileldl
value in the data. Assume Harry and Sally are the sameage
and receive care at the samepractice
. So the odds ratio of interest here compares the odds ofstatin
prescription for Harry to the odds ofstatin
prescription for Sally. But now, you need to be able to do this separately for each individual level ofpractice
, sincepractice
interacts withldl
. There are at least two ways to accomplish this.- In one approach, you would create predicted odds values for Harry and Sally, assuming a common age (40 would be a reasonable choice, and it’s the one used in the answer sketch) with
ldl
set to 142 for Harry and 85 for Sally, but creating four different versions of Harry and Sally (one for each practice.) Then use those predicted odds within each practice to obtain practice-specific odds ratios. - In the other approach, you could convince the
rms
package to use a different practice as the choice for which adjustments are made. By default,datadist
chooses the modal practice. To change this, you’d need to convincedatadist
instead to choose its practice based on which practice is the first one, and relevel the practice factor accordingly. So, if you’d releveled the practice data so that Elm was first and placed that into a tibble calleddataelm
, you could use the following adjustment to thedatadist
call to ensure that the adjustments made bydatadist
used Elm instead of the modal practice.
- In one approach, you would create predicted odds values for Harry and Sally, assuming a common age (40 would be a reasonable choice, and it’s the one used in the answer sketch) with
<- datadist(dataelm, adjto.cat = "first")
d_elm options(datadist = "d_elm")
Question 5. (10 points)
Now, compare the effectiveness of your two fitted models (the “interaction” and “no interaction” models) from Question 2 and draw a reasoned conclusion about which of those two models is more effective in describing the available set of observations (after those without statin
data are removed) from these four practices. An appropriate response will make use of at least two different validated assessments of fit quality. Be sure to justify your eventual selection (between the “interaction” or “no interaction” model) with complete sentences.
- The natural choices for validated assessments of fit quality in Question 5 are a bootstrap-validated C statistic and a bootstrap-validated Nagelkerke \(R^2\). In the answer sketch, we will use
2023
as our random seed for this work, and we’ll use the default amount of bootstrap replications.
Our Best Advice
Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.
Use of AI
If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.
Session Information
Please display your session information at the end of your submission, as shown below.
::session_info() xfun
R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)
Locale:
LC_COLLATE=English_United States.utf8
LC_CTYPE=English_United States.utf8
LC_MONETARY=English_United States.utf8
LC_NUMERIC=C
LC_TIME=English_United States.utf8
Package version:
askpass_1.2.0 backports_1.4.1 base64enc_0.1.3
bit_4.0.5 bit64_4.0.5 blob_1.2.4
broom_1.0.5 bslib_0.7.0 cachem_1.0.8
callr_3.7.6 cellranger_1.1.0 cli_3.6.2
clipr_0.8.0 colorspace_2.1-0 compiler_4.3.3
conflicted_1.2.0 cpp11_0.4.7 crayon_1.5.2
curl_5.2.1 data.table_1.15.4 DBI_1.2.2
dbplyr_2.5.0 digest_0.6.35 dplyr_1.1.4
dtplyr_1.3.1 ellipsis_0.3.2 evaluate_0.23
fansi_1.0.6 farver_2.1.1 fastmap_1.1.1
fontawesome_0.5.2 forcats_1.0.0 fs_1.6.3
gargle_1.5.2 generics_0.1.3 ggplot2_3.5.0
glue_1.7.0 googledrive_2.1.1 googlesheets4_1.1.1
graphics_4.3.3 grDevices_4.3.3 grid_4.3.3
gtable_0.3.4 haven_2.5.4 highr_0.10
hms_1.1.3 htmltools_0.5.8.1 htmlwidgets_1.6.4
httr_1.4.7 ids_1.0.1 isoband_0.2.7
janitor_2.2.0 jquerylib_0.1.4 jsonlite_1.8.8
knitr_1.46 labeling_0.4.3 lattice_0.22.6
lifecycle_1.0.4 lubridate_1.9.3 magrittr_2.0.3
MASS_7.3.60.0.1 Matrix_1.6.5 memoise_2.0.1
methods_4.3.3 mgcv_1.9.1 mime_0.12
modelr_0.1.11 munsell_0.5.1 nlme_3.1.164
openssl_2.1.1 parallel_4.3.3 pillar_1.9.0
pkgconfig_2.0.3 prettyunits_1.2.0 processx_3.8.4
progress_1.2.3 ps_1.7.6 purrr_1.0.2
R6_2.5.1 ragg_1.3.0 rappdirs_0.3.3
RColorBrewer_1.1.3 readr_2.1.5 readxl_1.4.3
rematch_2.0.0 rematch2_2.1.2 reprex_2.1.0
rlang_1.1.3 rmarkdown_2.26 rstudioapi_0.16.0
rvest_1.0.4 sass_0.4.9 scales_1.3.0
selectr_0.4.2 snakecase_0.11.1 splines_4.3.3
stats_4.3.3 stringi_1.8.3 stringr_1.5.1
sys_3.4.2 systemfonts_1.0.6 textshaping_0.3.7
tibble_3.2.1 tidyr_1.3.1 tidyselect_1.2.1
tidyverse_2.0.0 timechange_0.3.0 tinytex_0.50
tools_4.3.3 tzdb_0.4.0 utf8_1.2.4
utils_4.3.3 uuid_1.2.0 vctrs_0.6.5
viridisLite_0.4.2 vroom_1.6.5 withr_3.0.0
xfun_0.43 xml2_1.3.6 yaml_2.3.8
After the Lab
We will post an answer sketch 24 hours after the Lab is due.
We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.
See the Lab Appeal Policy in our Syllabus if you are interested in having your Lab grade reviewed, and use the Lab Regrade Request form specified there to complete the task. Thank you.