Lab 7

Published

2026-04-01

General Instructions

Submit your work via Canvas.
The deadline for this Lab is specified on the Course Calendar.
- We charge a 5 point penalty for a lab that is 1-48 hours late.
- Labs that are more than 48 hours late will receive 30 points (out of a possible 50.)
- No labs may be skipped in 432. Students must submit all seven Labs to pass the course.
Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided.
Our usual advice and templates apply to Lab 7 in the same way as they did in Labs 1-4 and 6.

The Data

The hbp3024.xlsx Excel file (from Lab 2), nh_1500.Rds R data set (from Lab 3) and the remit48.sav SPSS file (from Lab 6) all appear on the 432 data page.
A detailed description of each variable in the hbp3024 data is available here.
A detailed description of each variable in the nh_1500 data is available here.
The variables in the remit48 data are described in Question 3 below, as well as in Lab 6 Question 2.

R Packages and Setup

My answer sketch uses the following R packages and set-up.

knitr::opts_chunk$set(comment = NA) 

library(conflicted)
library(janitor)
library(naniar)

library(here)
library(readxl)
library(haven)

library(broom)
library(MASS)
library(nnet)
library(rms)
library(survival)
library(survminer)
library(yardstick)

library(easystats)
library(tidyverse) 

conflicts_prefer(dplyr::select, dplyr::filter)

theme_set(theme_bw())

Question 1. (15 points)

Import the data from the hbp_3024.xlsx file into R, being sure to include NA as a potential missing value when you do, since all missing values are indicated in the Excel file with NA. Next, create a data set I’ll call lab7q1, which:

restricts the hbp_3024 data to only the 1296 subjects who were seen in one of three practices, specifically: Center, King or Plympton, and
includes only those 1284 subjects from those three practices with complete data on the three variables we will study here, specifically income, insurance and practice, and which
includes a new variable called s_income, which rescales the income data to have mean 0 and standard deviation 1.

Tip

The rescaling could be facilitated by including code using the scale() function. My code, for instance, includes the following …

s_income = scale(income, center = TRUE, scale = TRUE)

{5} Using your lab7q1 data set, build a multinomial logistic regression model, called fit1 to predict practice (a nominal categorical outcome) for each of the 1284 subjects on the basis of main effects of the subject’s insurance and (re-scaled) s_income. Show R code which displays the coefficients of the model you fit, and their 90% confidence intervals.

Tip

R may warn you that some of the coefficients are very large, which may indicate issues with complete separation. If it does, feel free to ignore than message in Lab 7.

{5} Then use the fit1 model to estimate the log odds of a subject’s practice being Plympton rather than Center for a subject with Medicare insurance whose income is at the mean across our sample. Round the resulting log odds estimate to two decimal places.
{5} Now use the model fit1 to produce a classification table for the 1284 subjects in your data which compares their actual practice to their predicted practice. Then, in a complete sentence or two, specify both the percentage of correct classifications overall, and also identify the practice that is most often mis-classified by your model.

Question 2. (15 points)

Use the nh_1500 data to predict self-reported overall health (which is a five-category ordinal categorical outcome) on the basis of the subject’s age, waist circumference, and whether or not they have smoked 100 cigarettes in their lifetime. The nh_1500 data saves the health variable as an unordered factor. You’ll need to change that to an ordered factor when you pull in the data.

Tip

To check whether the health variable in the lab7q2 tibble is ordered, try str(lab7q2) or is.ordered(lab7q2$health).
One way to convert an unordered factor health in an R tibble called lab7q2 into an ordered one is: lab7q2 <- lab7q2 |> mutate(health = as.ordered(health))

{5} Produce two proportional odds logistic regression models. The first, which I’ll call mod2a, should use all three predictors to predict health (which you should ensure is an ordered factor - see above), while the second model, called mod2b, should use only two predictors, leaving out age. For each model, use R to display the exponentiated coefficient estimates, along with 90% confidence intervals.
{5} For your mod2a, write a sentence (or two) where you interpret the meaning of the point estimate (after exponentiating) for waist circumference, and also specify its 90% confidence interval.
{5} Validate the C statistic and Nagelkerke $R^2$ for each of your models using a bootstrap procedure with 300 iterations and a seed set to 20261 for mod2a and to 20262 for mod2b. Specify your conclusion about which model looks better on the basis of this work.

Question 3. (10 points)

The remit48.sav file gathers initial remission times, in days (the variable is called days) for 48 adult subjects with a leukemia diagnosis who were randomly allocated to one of two different treatments, labeled Old and New. Some patients were right-censored before their remission times could be fully determined, as indicated by values of censored = “Yes” in the data set. Note that remission is a good thing, so long times before remission are bad.

Use a Cox proportional hazards model to compare the two treatments, specifying the relevant point and 90% confidence intervals estimates of the hazard ratio, and describing the meaning of the point estimate carefully and thoroughly.

Tip

See Question 2 from Lab 6 for details on loading the data for this Question.

Question 4. (10 points)

Write an essay of at least 100 words (and a minimum of 4 complete sentences) specifying something from your reading of Jeff Leek’s How To Be a Modern Scientist that you disagree with, and that you specifically don’t intend to make use of in your life. Please be specific about what Leek’s suggestion is, why you find it problematic, and provide details of your reasoning, and (ideally) what alternative approach you anticipate would be more helpful.

Note

We will award full credit to any student who we believe:

provides an insightful response
provides a response that is written well
avoids grammar, syntax and spelling errors
clearly indicates the source of the advice and context for it
clearly provides context about why this idea is problematic for them,
and provides specific information about an alternative they would prefer,
in at least 100 words and four sentences

Use of AI

If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.

Be sure to include Session Information

Please display your session information at the end of your submission, as shown below.

xfun::session_info()

R version 4.5.3 (2026-03-11 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  abind_1.4-8            askpass_1.2.1          backports_1.5.1       
  base64enc_0.1-6        bayestestR_0.17.0      bit_4.6.0             
  bit64_4.6.0.1          blob_1.3.0             boot_1.3.32           
  broom_1.0.12           bslib_0.10.0           cachem_1.1.0          
  callr_3.7.6            car_3.1-5              carData_3.0-6         
  cellranger_1.1.0       checkmate_2.3.4        cli_3.6.6             
  clipr_0.8.0            cluster_2.1.8.2        coda_0.19-4.1         
  codetools_0.2-20       colorspace_2.1-2       commonmark_2.0.0      
  compiler_4.5.3         conflicted_1.2.0       correlation_0.8.8     
  corrplot_0.95          cowplot_1.2.0          cpp11_0.5.4           
  crayon_1.5.3           curl_7.0.0             data.table_1.18.2.1   
  datasets_4.5.3         datawizard_1.3.0       DBI_1.3.0             
  dbplyr_2.5.2           Deriv_4.2.0            digest_0.6.39         
  doBy_4.7.1             dplyr_1.2.1            dtplyr_1.3.3          
  easystats_0.7.5        effectsize_1.0.2       emmeans_2.0.3         
  estimability_1.5.1     evaluate_1.0.5         exactRankTests_0.8.36 
  farver_2.1.2           fastmap_1.2.0          fontawesome_0.5.3     
  forcats_1.0.1          forecast_9.0.2         foreign_0.8-91        
  Formula_1.2-5          fracdiff_1.5.3         fs_2.0.1              
  gargle_1.6.1           generics_0.1.4         ggplot2_4.0.2         
  ggpubr_0.6.3           ggrepel_0.9.8          ggsci_4.3.0           
  ggsignif_0.6.4         ggtext_0.1.2           glue_1.8.0            
  googledrive_2.1.2      googlesheets4_1.1.2    graphics_4.5.3        
  grDevices_4.5.3        grid_4.5.3             gridExtra_2.3         
  gridtext_0.1.6         gtable_0.3.6           hardhat_1.4.3         
  haven_2.5.5            here_1.0.2             highr_0.12            
  Hmisc_5.2-5            hms_1.1.4              htmlTable_2.4.3       
  htmltools_0.5.9        htmlwidgets_1.6.4      httr_1.4.8            
  ids_1.0.1              insight_1.4.6          isoband_0.3.0         
  janitor_2.2.1          jpeg_0.1.11            jquerylib_0.1.4       
  jsonlite_2.0.0         knitr_1.51             labeling_0.4.3        
  lattice_0.22-9         lifecycle_1.0.5        litedown_0.9          
  lme4_2.0.1             lmtest_0.9.40          lubridate_1.9.5       
  magrittr_2.0.5         markdown_2.0           MASS_7.3-65           
  Matrix_1.7-5           MatrixModels_0.5-4     maxstat_0.7.26        
  memoise_2.0.1          methods_4.5.3          mgcv_1.9.4            
  microbenchmark_1.5.0   mime_0.13              minqa_1.2.8           
  modelbased_0.14.0      modelr_0.1.11          multcomp_1.4-30       
  mvtnorm_1.3-6          naniar_1.1.0           nlme_3.1-169          
  nloptr_2.2.1           nnet_7.3-20            norm_1.0.11.1         
  numDeriv_2016.8.1.1    openssl_2.3.5          otel_0.2.0            
  parallel_4.5.3         parameters_0.28.3      patchwork_1.3.2       
  pbkrtest_0.5.5         performance_0.16.0     pillar_1.11.1         
  pkgconfig_2.0.3        plyr_1.8.9             png_0.1.9             
  polspline_1.1.25       polynom_1.4.1          prettyunits_1.2.0     
  processx_3.8.7         progress_1.2.3         ps_1.9.2              
  purrr_1.2.2            quantreg_6.1           R6_2.6.1              
  ragg_1.5.2             rappdirs_0.3.4         rbibutils_2.4.1       
  RColorBrewer_1.1-3     Rcpp_1.1.1             RcppArmadillo_15.2.4.1
  RcppEigen_0.3.4.0.2    Rdpack_2.6.6           readr_2.2.0           
  readxl_1.4.5           reformulas_0.4.4       rematch_2.0.0         
  rematch2_2.1.2         report_0.6.3           reprex_2.1.1          
  rlang_1.2.0            rmarkdown_2.31         rms_8.1-1             
  rpart_4.1.27           rprojroot_2.1.1        rstatix_0.7.3         
  rstudioapi_0.18.0      rvest_1.0.5            S7_0.2.1              
  sandwich_3.1-1         sass_0.4.10            scales_1.4.0          
  see_0.13.0             selectr_0.5.1          snakecase_0.11.1      
  SparseM_1.84-2         sparsevctrs_0.3.6      splines_4.5.3         
  stats_4.5.3            stringi_1.8.7          stringr_1.6.0         
  survival_3.8-6         survminer_0.5.2        sys_3.4.3             
  systemfonts_1.3.2      textshaping_1.0.5      TH.data_1.1-5         
  tibble_3.3.1           tidyr_1.3.2            tidyselect_1.2.1      
  tidyverse_2.0.0        timechange_0.4.0       timeDate_4052.112     
  tinytex_0.59           tools_4.5.3            tzdb_0.5.0            
  UpSetR_1.4.0           urca_1.3.4             utf8_1.2.6            
  utils_4.5.3            uuid_1.2.2             vctrs_0.7.3           
  viridis_0.6.5          viridisLite_0.4.3      visdat_0.6.0          
  vroom_1.7.1            withr_3.0.2            xfun_0.57             
  xml2_1.5.2             xtable_1.8-8           yaml_2.3.12           
  yardstick_1.4.0        zoo_1.8-15

After the Lab

We will post an answer sketch to our Shared Google Drive 48 hours after the Lab is due.
We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.
See the Lab Appeal Policy in Section 8.4 of our Syllabus if you are interested in having your Lab grade reviewed, and use the Lab Regrade Request form to complete the task. The form (which is optional) deadline is specified in the Calendar. Thank you.