::opts_chunk$set(comment = NA)
knitr
library(janitor)
library(naniar)
library(gt)
library(here)
library(conflicted)
library(readxl)
library(haven)
library(broom)
library(MASS)
library(mosaic)
library(nnet)
library(rms)
library(survival)
library(survminer)
library(topmodels)
library(yardstick)
library(easystats)
library(tidyverse)
theme_set(theme_dark())
conflicts_prefer(base::max, base::mean, stats::cor,
::filter, dplyr::select, dplyr::summarize) dplyr
Lab 7
General Instructions
- Submit your work via Canvas.
- The deadline for this Lab is specified on the Course Calendar.
- For this lab, we will charge a 5 point penalty for a lab that is 1-24 hours late.
- We will charge a 15 point penalty for a response that is more than 24 but less than 48 hours late.
- We do not grade work that is more than 48 hours late.
- Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided.
All students must complete both Lab 6 and Lab 7. Neithert can be “skipped”.
Template
You should be able to modify the Lab 3 Quarto template available on our 432-data page to help you do this Lab.
Our Best Advice
Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.
R Packages used in the Answer Sketch
In my answer sketch for Lab 7, I used the following setup. You are welcome to do the same.
R Packages and Setup
Note that my list of R packages does not include separate loading of any of the core tidyverse packages, or the packages in the easystats framework. The core tidyverse packages are listed at https://www.tidyverse.org/packages/#core-tidyverse, and the packages in the easystats framework are listed at https://easystats.github.io/easystats/. If you separately load any of these packages, you will lose points on this Lab. Same goes for Projects A and B.
The Data
- The
chr_2015.csv
csv file (from Lab 1),hbp3024.xlsx
Excel file (from Lab 2),nh_1500.Rds
R data set (from Lab 3) and theremit48.sav
SPSS file all appear on the 432 data page. - A detailed codebook for all of the data in the
chr_2015
file is available here. - A detailed description of each variable in the
hbp3024
data is available here. - A detailed description of each variable in the
nh_1500
data is available here. - The variables included in the
remit48
data are described in Question 4, below.
Question 1. (14 points)
Use the chr_2015
data to build a model to predict each county’s percentage of the population ages 16 and older who are unemployed but seeking work, as measured in 2013 (and reported in CHR 2015). Note that each of the values in the data are integers (that fall between 1 and 28), and so we will treat the unemp
values as counts in Question 1. You will produce a Poisson regression model for unemp
using the main effects of two quantitative predictors: the county’s food environment index and the county’s adult obesity rate.
- {5} Produce the Poisson regression model, which I’ll call
mod1
, then interpret the coefficient (the point estimate is sufficient) for thefood_env
variable. Finally, provide a 90% confidence interval for the point estimate you provide. Round your estimates to two decimal places.
In stating your response to Question 1a, you can choose whether or not to exponentiate the coefficient’s estimate and its confidence interval. So long as you interpret the result you present correctly, we will be happy with either, but there is no need to present both the exponentiated and unexponentiated results.
- {5} Produce and interpret the meaning of a rootogram for
mod1
, and specify the model’s \(R^2\) value, rounded to three decimal places.
In stating your response to Question 1b, you can choose whether you want to show the Nagelkerke \(R^2\), or the squared correlation of the observed and model-predicted outcome values.
- {4} Use
mod1
to make a prediction of theunemp
rate (rounded to two decimal places) for Cuyahoga County, in Ohio, based on Cuyahoga’s values forfood_env
(6.7) and forobesity
(28) and then, in a complete sentence, compare themod1
prediction to the observedunemp
rate for Cuyahoga County as reported in 2013 as part of CHR 2015.
Question 2. (12 points)
Import the data from the hbp_3024.xlsx
file into R, being sure to include NA as a potential missing value when you do, since all missing values are indicated in the Excel file with NA. Next, create a data set I’ll call lab5q2
, which:
- restricts the
hbp_3024
data to only the 1296 subjects who were seen in one of three practices, specifically: Center, King or Plympton, and - includes only those 1284 subjects from those three practices with complete data on the three variables we will study in Question 2, specifically
income
,insurance
andpractice
, and which - includes a new variable called
s_income
, which rescales theincome
data to have mean 0 and standard deviation 1.
The rescaling would be facilitated by including code using the scale()
function. My code, for instance, includes the following …
s_income = scale(income, center = TRUE, scale = TRUE)
{6} Using your
lab5q2
data set, build a multinomial logistic regression model, calledfit2
to predictpractice
(a nominal categorical outcome) for each of the 1284 subjects on the basis of main effects of the subject’sinsurance
and (re-scaled)s_income
. Show R code which displays the coefficients of the model you fit, and their 90% confidence intervals. Then use the model you fit to estimate the log odds of a subject’s practice being Plympton rather than Center for a subject with Medicare insurance whoseincome
is at the mean across our sample. Round the resulting log odds estimate to two decimal places.{6} Now use the model
fit2
you fit in Question 2 part a to produce a classification table for the 1284 subjects in your data which compares their actualpractice
to their predictedpractice
. Then, in a complete sentence or two, specify both the percentage of correct classifications overall, and also identify thepractice
that is most often mis-classified by your model.
Question 3. (12 points)
Use the nh_1500
data to predict self-reported overall health (which is a five-category ordinal categorical outcome) on the basis of the subject’s age, waist circumference, and whether or not they have smoked 100 cigarettes in their lifetime. On 2025-03-01, I updated the nh_1500
data to save the health
variable as an ordered factor. You’ll want to ensure that’s what you see when you pull in the data.
{6} Produce two proportional odds logistic regression models. The first, which I’ll call
mod3a
, should use all three predictors to predicthealth
, while the second model, calledmod3b
, should use only two predictors, leaving out age. For each model, write R code to display the exponentiated coefficient estimates, along with 90% confidence intervals. Then, formod3a
, write a sentence (or two) where you interpret the meaning of the point estimate (after exponentiating) for waist circumference, and also specify its 90% confidence interval.{6} Validate the C statistic and Nagelkerke \(R^2\) for each of your models using a bootstrap procedure with 300 iterations and a seed set to
2025
. Specify your conclusion about which model looks better on the basis of this work.
Question 4. (12 points)
The remit48.sav
gathers initial remission times, in days (the variable is called days
) for 48 adult subjects with a leukemia diagnosis who randomly allocated to one of two different treatment
s, labeled Old and New. Some patients were right-censored before their remission times could be fully determined, as indicated by values of censored
= “Yes” in the data set. Note that remission is a good thing, so long times before remission are bad.
Here is my code for creating the necessary tibble, which I called lab5q4
.
lab5q4 <- read_spss(here("data/remit48.sav"))
lab5q4$treatment |> attr("label")
lab5q4$censored |> attr("label")
lab5q4 <- lab5q4 |>
mutate(treatment =
fct_recode(factor(treatment), "New" = "1", "Old" = "2"),
censored =
fct_recode(factor(censored), "No" = "1", "Yes" = "2"),
subject = as.character(subject)) |>
zap_labels()
Be sure a glimpse at your lab5q4
produces the following:
> glimpse(lab5q4)
Rows: 48
Columns: 4
$ subject <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", …
$ treatment <fct> New, New, Old, New, New, Old, Old, Old, New, New, …
$ days <dbl> 269, 139, 161, 9, 31, 199, 19, 20, 28, 29, …
$ censored <fct> Yes, No, Yes, No, No, Yes, No, No, No, No, …
{6} Plot appropriate Kaplan-Meier estimates of the survival functions for each of the two treatments in a single plot. Create a table that shows the restricted mean and median for survival time in days for the two treatment groups. In a sentence or two, what conclusions can you draw from your plot and table?
{6} Use a Cox proportional hazards model to compare the two treatments, specifying the relevant point and 90% confidence intervals estimates of the hazard ratio, and describing the meaning of the point estimate.
Use of AI
If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.
Session Information
Please display your session information at the end of your submission in a separate section of its own, as shown below.
::session_info() xfun
R version 4.4.3 (2025-02-28 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)
Locale:
LC_COLLATE=English_United States.utf8
LC_CTYPE=English_United States.utf8
LC_MONETARY=English_United States.utf8
LC_NUMERIC=C
LC_TIME=English_United States.utf8
Package version:
abind_1.4-8 askpass_1.2.1 backports_1.5.0
base64enc_0.1-3 bayestestR_0.15.2 bigD_0.3.0
bit_4.6.0 bit64_4.6.0.1 bitops_1.0.9
blob_1.2.4 boot_1.3.31 broom_1.0.7
bslib_0.9.0 cachem_1.1.0 callr_3.7.6
car_3.1-3 carData_3.0-5 cellranger_1.1.0
checkmate_2.3.2 cli_3.6.4 clipr_0.8.0
cluster_2.1.8 coda_0.19-4.1 codetools_0.2-20
colorspace_2.1-1 commonmark_1.9.5 compiler_4.4.3
conflicted_1.2.0 correlation_0.8.7 corrplot_0.95
cowplot_1.1.3 cpp11_0.5.2 crayon_1.5.3
curl_6.2.1 data.table_1.17.0 datasets_4.4.3
datawizard_1.0.1 DBI_1.2.3 dbplyr_2.5.0
Deriv_4.1.6 digest_0.6.37 distributions3_0.2.2
doBy_4.6.25 dplyr_1.1.4 dtplyr_1.3.1
easystats_0.7.4 effectsize_1.0.0 emmeans_1.11.0
estimability_1.5.1 evaluate_1.0.3 exactRankTests_0.8.35
fansi_1.0.6 farver_2.1.2 fastmap_1.2.0
fontawesome_0.5.3 forcats_1.0.0 foreign_0.8-88
Formula_1.2-5 fs_1.6.5 gargle_1.5.2
generics_0.1.3 ggformula_0.12.0 ggplot2_3.5.1
ggpubr_0.6.0 ggrepel_0.9.6 ggridges_0.5.6
ggsci_3.2.0 ggsignif_0.6.4 ggtext_0.1.2
glue_1.8.0 googledrive_2.1.1 googlesheets4_1.1.1
graphics_4.4.3 grDevices_4.4.3 grid_4.4.3
gridExtra_2.3 gridtext_0.1.5 gt_0.11.1
gtable_0.3.6 hardhat_1.4.1 haven_2.5.4
here_1.0.1 highr_0.11 Hmisc_5.2-3
hms_1.1.3 htmlTable_2.4.3 htmltools_0.5.8.1
htmlwidgets_1.6.4 httr_1.4.7 ids_1.0.1
insight_1.1.0 isoband_0.2.7 janitor_2.2.1
jpeg_0.1.11 jquerylib_0.1.4 jsonlite_1.9.1
juicyjuice_0.1.0 km.ci_0.5-6 KMsurv_0.1-5
knitr_1.50 labeling_0.4.3 labelled_2.14.0
lattice_0.22-6 lifecycle_1.0.4 lme4_1.1.36
lubridate_1.9.4 magrittr_2.0.3 markdown_1.13
MASS_7.3-65 Matrix_1.7-2 MatrixModels_0.5-3
maxstat_0.7.25 memoise_2.0.1 methods_4.4.3
mgcv_1.9.1 microbenchmark_1.5.0 mime_0.13
minqa_1.2.8 modelbased_0.10.0 modelr_0.1.11
mosaic_1.9.1 mosaicCore_0.9.4.0 mosaicData_0.20.4
multcomp_1.4-28 munsell_0.5.1 mvtnorm_1.3-3
naniar_1.1.0 nlme_3.1-167 nloptr_2.2.1
nnet_7.3-20 norm_1.0.11.1 numDeriv_2016.8.1.1
openssl_2.3.2 parallel_4.4.3 parameters_0.24.2
patchwork_1.3.0 pbkrtest_0.5.3 performance_0.13.0
pillar_1.10.1 pkgconfig_2.0.3 plyr_1.8.9
png_0.1.8 polspline_1.1.25 polynom_1.4.1
prettyunits_1.2.0 processx_3.8.6 progress_1.2.3
ps_1.9.0 purrr_1.0.4 quantreg_6.1
R6_2.6.1 ragg_1.3.3 rappdirs_0.3.3
rbibutils_2.3 RColorBrewer_1.1.3 Rcpp_1.0.14
RcppEigen_0.3.4.0.2 Rdpack_2.6.3 reactable_0.4.4
reactR_0.6.1 readr_2.1.5 readxl_1.4.5
reformulas_0.4.0 rematch_2.0.0 rematch2_2.1.2
report_0.6.1 reprex_2.1.1 rlang_1.1.5
rmarkdown_2.29 rms_7.0-0 rpart_4.1.24
rprojroot_2.0.4 rstatix_0.7.2 rstudioapi_0.17.1
rvest_1.0.4 sandwich_3.1-1 sass_0.4.9
scales_1.3.0 see_0.11.0 selectr_0.4.2
snakecase_0.11.1 SparseM_1.84-2 sparsevctrs_0.3.2
splines_4.4.3 stats_4.4.3 stringi_1.8.4
stringr_1.5.1 survival_3.8-3 survminer_0.5.0
survMisc_0.5.6 sys_3.4.3 systemfonts_1.2.1
textshaping_1.0.0 TH.data_1.1-3 tibble_3.2.1
tidyr_1.3.1 tidyselect_1.2.1 tidyverse_2.0.0
timechange_0.3.0 tinytex_0.56 tools_4.4.3
topmodels_0.3-0 tzdb_0.5.0 UpSetR_1.4.0
utf8_1.2.4 utils_4.4.3 uuid_1.2.1
V8_6.0.2 vctrs_0.6.5 viridis_0.6.5
viridisLite_0.4.2 visdat_0.6.0 vroom_1.6.5
withr_3.0.2 xfun_0.51 xml2_1.3.8
xtable_1.8-4 yaml_2.3.10 yardstick_1.3.2
zoo_1.8-13
After the Lab
- We will post an answer sketch to our Shared Google Drive 48 hours after the Lab is due.
- We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.
- See the Lab Appeal Policy in Section 8.5 of our Syllabus if you are interested in having your Lab 1, 2, 3, 4, 5 or 7 grade reviewed, and use the Lab Regrade Request form to complete the task. The form (which is optional) is due when the Calendar says it is. Thank you.