hbp_q3 <- hbp_3024 |>
select(record, practice, dbp, age, ldl, tobacco, insurance, betab) |>
mutate(across(where(is.character), as_factor)) |>
mutate(record = as.character(record))
str(hbp_q3)
miss_var_summary(hbp_q3) |> filter(n_miss > 0)
hbp_q3_cc <- hbp_q3 |> drop_na()
dim(hbp_q3_cc)
n_miss(hbp_q3_cc)
hbp_train <- hbp_q3_cc |>
filter(practice %in% c("Highland", "King", "Plympton", "Sycamore"))
hbp_test <- hbp_q3_cc |>
filter(practice %in% c("Center", "Elm", "Walnut"))
dim(hbp_train); dim(hbp_test)Lab 4
General Instructions
- Submit your work via Canvas.
- The deadline for this Lab is specified on the Course Calendar.
- We charge a 5 point penalty for a lab that is 1-48 hours late.
- We do not grade work that is more than 48 hours late.
- Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided.
You can skip exactly one of Labs 1-5 without penalty, but all students must complete both Lab 6 and Lab 7. If you decide to skip a lab, please submit a note to Canvas by the deadline saying that you are skipping the lab.
Template
You should be able to modify any of the first three Lab templates available on our 432-data page to help you do this Lab. Feel encouraged to try a different HTML theme if you like, maybe yeti or spacelab or materia.
In my answer sketch for Lab 4, I used the following R packages:
janitor,naniarbroom,car,caret,gt,mice,readxl,rmseasystatsandtidyverse
in case that is useful for you to know.
Our Best Advice
Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.
The Data
- The
hbp_3024.xlsxExcel file (first introduced in Lab 2) is available for download on the 432 data page. Be sure that you see 8 variables with missing values when you impute the data. - A detailed description of each variable in the
hbp_3024data is available here.
Question 1 (10 points)
Begin with the hbp_3024 data, but include only the 1,296 subjects with who were seen by either the Center, Elm, or Walnut practices, and only the following seven variables: record, practice, age, tobacco, hsgrad, income and depr_diag.
- Be sure at this point that
depr_diag,tobaccoandpracticeare factor variables. - Be sure that
ageandhsgradandincomeare numeric variables (they should be.) - Be sure that
recordis a character variable.
Next, develop a single imputation strategy using the mice package and a seed of 2025 to account for missing values. Call the resulting imputed data set hbp_1. Show your R code, and demonstrate that the following three things are true…
- your
hbp_1data has no missing values, - within
hbp_1, the practices have 88, 121 and 156 subjects with a depression diagnosis, respectively, and - both the mean and standard deviation of
incomeis lower in the subjects with a depression diagnosis than in the subjects without.
You’ll use the hbp_1 data in Questions 1 and 2 of this Lab.
Question 2 (15 points)
Using the hbp_1 data, build a logistic regression model, called fit1, to predict whether a subject has a depression diagnosis based on the main effects of four variables:
- which of the practices they receive care from, along with
- the subject’s age,
- the subject’s tobacco use status, and
- the estimated percentage of adults living in the subject’s home neighborhood who have graduated from high school (the
hsgradvariable)
{5} What is the area under the ROC curve for
fit1? Interpret the meaning of that C statistic in a complete English sentence.{5} Drop
hsgradfromfit1and call the resulting modelfit2. What do the AIC and BIC statistics forfit2suggest to you as compared to those values infit1?{5} For this question, use the model (
fit1orfit2) which looks better to you according to your response to Question 2b. Use a decision rule that you will predict a depression diagnosis if the predicted probability of such a diagnosis according to your model exceeds 0.4. Now, across thehbp_1sample, specify the sensitivity, specificity and accuracy of this prediction rule for your chosen model.
Question 3 (25 points)
Starting again from the hbp_3024 data, take the following steps (and note the tip below):
Create a
hbp_q3data set to include only the following eight variables:record,practice,dbp,age,ldl,tobacco,insuranceandbetab.Within your
hbp_q3data, ensure thatpractice,tobacco,insuranceandbetabare represented using factors, thatdbp,age, andldlare numeric (double-precision is fine), andrecordis a character variable. You can check this by runningglimpse(hbp_q3)orstr(hbp_q3).In your
hbp_q3data, ensure that you have 403 missingldlvalues, 4 missingtobaccovalues and no other missing data.Create a “complete case” version of
hbp_q3, which you’ll callhbp_q3_cc, which contains only the 2,617 subjects who are not missing data on any of the eight variables inhbp_q3.Partition the
hbp_q3_ccdata into:
- a training sample, which I would call
hbp_train, including the 1,439 subjects contained inhbp_q3_ccwho were seen in the Highland, King, Plympton or Sycamore practices, and - a test sample, which I would call
hbp_test, consisting of the remaining three practices (Center, Elm, or Walnut practices) and the 1,178 subjects fromhbp_q3_ccseen in those practices.
You will build two models to predict diastolic blood pressure (potentially using an outcome transformation) on the basis of:
- in model
fit3: the subject’s age, LDL cholesterol level, tobacco status, insurance status, and whether or not they had a beta-blocker prescription. - in model
fit4: the subject’s age and LDL cholesterol level alone.
Here is the code that I used to create all of the necessary samples and check the details listed above. Please feel encouraged to use this code in your response to Lab 4.
You will use the hbp_train data in Questions 3a and 3b, the hbp_test data in Question 3c and then go back to the hbp_q3 data set in Question 3d.
{5} Use an appropriate tool to make a decision about a transformation of your outcome, and describe your conclusions from that tool. Use the
hbp_traindata.{5} Using the transformed outcome you identified in Question 3a, fit models
fit3andfit4, and compare the two models in terms of AIC, BIC, adjusted \(R^2\) and sigma. Use these four summaries to help you decide which of the two models (fit3orfit4) shows better training sample performance. Again, use thehbp_traindata.{5} Compare the performance of your two models (
fit3andfit4) in the test sample (thehbp_testdata set), using the following four summaries of prediction error: MAPE, Maximum absolute prediction error, RMSPE and validated \(R^2\). According to these measures, which model looks better in terms of fit in the test sample?{10} Select a winning model (
fit3orfit4) based on your results in parts b and c, and refit that model now using the whole sample fromhbp_q3(including missing values) and 20 multiple imputations developed using themicepackage, with the seed20254323. Specify the point estimate of theLDLeffect, as well as its 90% confidence interval, with each rounded to four decimal places, after this imputation process, and interpret the meaning of the point estimate in a complete sentence or two.
Use of AI
If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.
Be sure to include Session Information
Please display your session information at the end of your submission, as shown below.
xfun::session_info()R version 4.4.3 (2025-02-28 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)
Locale:
LC_COLLATE=English_United States.utf8
LC_CTYPE=English_United States.utf8
LC_MONETARY=English_United States.utf8
LC_NUMERIC=C
LC_TIME=English_United States.utf8
Package version:
base64enc_0.1.3 bslib_0.9.0 cachem_1.1.0 cli_3.6.4
compiler_4.4.3 digest_0.6.37 evaluate_1.0.3 fastmap_1.2.0
fontawesome_0.5.3 fs_1.6.6 glue_1.8.0 graphics_4.4.3
grDevices_4.4.3 highr_0.11 htmltools_0.5.8.1 htmlwidgets_1.6.4
jquerylib_0.1.4 jsonlite_2.0.0 knitr_1.50 lifecycle_1.0.4
memoise_2.0.1 methods_4.4.3 mime_0.13 R6_2.6.1
rappdirs_0.3.3 rlang_1.1.6 rmarkdown_2.29 rstudioapi_0.17.1
sass_0.4.10 stats_4.4.3 tinytex_0.57 tools_4.4.3
utils_4.4.3 xfun_0.52 yaml_2.3.10
After the Lab
- We will post an answer sketch to our Shared Google Drive 48 hours after the Lab is due.
- We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.
- See the Lab Appeal Policy in Section 8.5 of our Syllabus if you are interested in having your Lab 1, 2, 3, 4, 5 or 7 grade reviewed, and use the Lab Regrade Request form to complete the task. The form (which is optional) is due when the Calendar says it is. Thank you.