<- hbp_3024 |>
hbp_q3 select(record, practice, dbp, age, ldl, tobacco, insurance, betab) |>
mutate(across(where(is.character), as_factor)) |>
mutate(record = as.character(record))
str(hbp_q3)
miss_var_summary(hbp_q3) |> filter(n_miss > 0)
<- hbp_q3 |> drop_na()
hbp_q3_cc
dim(hbp_q3_cc)
n_miss(hbp_q3_cc)
<- hbp_q3_cc |>
hbp_train filter(practice %in% c("Highland", "King", "Plympton", "Sycamore"))
<- hbp_q3_cc |>
hbp_test filter(practice %in% c("Center", "Elm", "Walnut"))
dim(hbp_train); dim(hbp_test)
Lab 4
General Instructions
- Submit your work via Canvas.
- The deadline for this Lab is specified on the Course Calendar.
- We charge a 5 point penalty for a lab that is 1-48 hours late.
- We do not grade work that is more than 48 hours late.
- Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided.
You can skip exactly one of Labs 1-5 without penalty, but all students must complete both Lab 6 and Lab 7. If you decide to skip a lab, please submit a note to Canvas by the deadline saying that you are skipping the lab.
Template
You should be able to modify any of the first three Lab templates available on our 432-data page to help you do this Lab. Feel encouraged to try a different HTML theme if you like, maybe yeti or spacelab or materia.
In my answer sketch for Lab 4, I used the following R packages:
janitor
,naniar
broom
,car
,caret
,gt
,mice
,readxl
,rms
easystats
andtidyverse
in case that is useful for you to know.
Our Best Advice
Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.
The Data
- The
hbp_3024.xlsx
Excel file (first introduced in Lab 2) is available for download on the 432 data page. Be sure that you see 8 variables with missing values when you impute the data. - A detailed description of each variable in the
hbp_3024
data is available here.
Question 1 (10 points)
Begin with the hbp_3024
data, but include only the 1,296 subjects with who were seen by either the Center, Elm, or Walnut practices, and only the following seven variables: record
, practice
, age
, tobacco
, hsgrad
, income
and depr_diag
.
- Be sure at this point that
depr_diag
,tobacco
andpractice
are factor variables. - Be sure that
age
andhsgrad
andincome
are numeric variables (they should be.) - Be sure that
record
is a character variable.
Next, develop a single imputation strategy using the mice
package and a seed of 2025
to account for missing values. Call the resulting imputed data set hbp_1
. Show your R code, and demonstrate that the following three things are true…
- your
hbp_1
data has no missing values, - within
hbp_1
, the practices have 88, 121 and 156 subjects with a depression diagnosis, respectively, and - both the mean and standard deviation of
income
is lower in the subjects with a depression diagnosis than in the subjects without.
You’ll use the hbp_1
data in Questions 1 and 2 of this Lab.
Question 2 (15 points)
Using the hbp_1
data, build a logistic regression model, called fit1
, to predict whether a subject has a depression diagnosis based on the main effects of four variables:
- which of the practices they receive care from, along with
- the subject’s age,
- the subject’s tobacco use status, and
- the estimated percentage of adults living in the subject’s home neighborhood who have graduated from high school (the
hsgrad
variable)
{5} What is the area under the ROC curve for
fit1
? Interpret the meaning of that C statistic in a complete English sentence.{5} Drop
hsgrad
fromfit1
and call the resulting modelfit2
. What do the AIC and BIC statistics forfit2
suggest to you as compared to those values infit1
?{5} For this question, use the model (
fit1
orfit2
) which looks better to you according to your response to Question 2b. Use a decision rule that you will predict a depression diagnosis if the predicted probability of such a diagnosis according to your model exceeds 0.4. Now, across thehbp_1
sample, specify the sensitivity, specificity and accuracy of this prediction rule for your chosen model.
Question 3 (25 points)
Starting again from the hbp_3024
data, take the following steps (and note the tip below):
Create a
hbp_q3
data set to include only the following eight variables:record
,practice
,dbp
,age
,ldl
,tobacco
,insurance
andbetab
.Within your
hbp_q3
data, ensure thatpractice
,tobacco
,insurance
andbetab
are represented using factors, thatdbp
,age
, andldl
are numeric (double-precision is fine), andrecord
is a character variable. You can check this by runningglimpse(hbp_q3)
orstr(hbp_q3)
.In your
hbp_q3
data, ensure that you have 403 missingldl
values, 4 missingtobacco
values and no other missing data.Create a “complete case” version of
hbp_q3
, which you’ll callhbp_q3_cc
, which contains only the 2,617 subjects who are not missing data on any of the eight variables inhbp_q3
.Partition the
hbp_q3_cc
data into:
- a training sample, which I would call
hbp_train
, including the 1,439 subjects contained inhbp_q3_cc
who were seen in the Highland, King, Plympton or Sycamore practices, and - a test sample, which I would call
hbp_test
, consisting of the remaining three practices (Center, Elm, or Walnut practices) and the 1,178 subjects fromhbp_q3_cc
seen in those practices.
You will build two models to predict diastolic blood pressure (potentially using an outcome transformation) on the basis of:
- in model
fit3
: the subject’s age, LDL cholesterol level, tobacco status, insurance status, and whether or not they had a beta-blocker prescription. - in model
fit4
: the subject’s age and LDL cholesterol level alone.
Here is the code that I used to create all of the necessary samples and check the details listed above. Please feel encouraged to use this code in your response to Lab 4.
You will use the hbp_train
data in Questions 3a and 3b, the hbp_test
data in Question 3c and then go back to the hbp_q3
data set in Question 3d.
{5} Use an appropriate tool to make a decision about a transformation of your outcome, and describe your conclusions from that tool. Use the
hbp_train
data.{5} Using the transformed outcome you identified in Question 3a, fit models
fit3
andfit4
, and compare the two models in terms of AIC, BIC, adjusted \(R^2\) and sigma. Use these four summaries to help you decide which of the two models (fit3
orfit4
) shows better training sample performance. Again, use thehbp_train
data.{5} Compare the performance of your two models (
fit3
andfit4
) in the test sample (thehbp_test
data set), using the following four summaries of prediction error: MAPE, Maximum absolute prediction error, RMSPE and validated \(R^2\). According to these measures, which model looks better in terms of fit in the test sample?{10} Select a winning model (
fit3
orfit4
) based on your results in parts b and c, and refit that model now using the whole sample fromhbp_q3
(including missing values) and 20 multiple imputations developed using themice
package, with the seed20254323
. Specify the point estimate of theLDL
effect, as well as its 90% confidence interval, with each rounded to four decimal places, after this imputation process, and interpret the meaning of the point estimate in a complete sentence or two.
Use of AI
If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.
Be sure to include Session Information
Please display your session information at the end of your submission, as shown below.
::session_info() xfun
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)
Locale:
LC_COLLATE=English_United States.utf8
LC_CTYPE=English_United States.utf8
LC_MONETARY=English_United States.utf8
LC_NUMERIC=C
LC_TIME=English_United States.utf8
Package version:
base64enc_0.1.3 bslib_0.9.0 cachem_1.1.0 cli_3.6.4
compiler_4.4.2 digest_0.6.37 evaluate_1.0.3 fastmap_1.2.0
fontawesome_0.5.3 fs_1.6.5 glue_1.8.0 graphics_4.4.2
grDevices_4.4.2 highr_0.11 htmltools_0.5.8.1 htmlwidgets_1.6.4
jquerylib_0.1.4 jsonlite_1.9.0 knitr_1.49 lifecycle_1.0.4
memoise_2.0.1 methods_4.4.2 mime_0.12 R6_2.6.1
rappdirs_0.3.3 rlang_1.1.5 rmarkdown_2.29 rstudioapi_0.17.1
sass_0.4.9 stats_4.4.2 tinytex_0.55 tools_4.4.2
utils_4.4.2 xfun_0.51 yaml_2.3.10
After the Lab
- We will post an answer sketch to our Shared Google Drive 48 hours after the Lab is due.
- We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.
- See the Lab Appeal Policy in our Syllabus if you are interested in having your Lab grade reviewed, and use the Lab Regrade Request form specified there to complete the task. Thank you.