Lab 6

Published

2024-01-21

General Instructions

  • Submit your work via Canvas.
  • The deadline for this Lab is specified on the Calendar.
    • Work submitted more than 59 minutes late, but within 12 hours of the deadline will lose 5 of the available 50 points.
    • Work submitted 12 to 24 hours after the deadline will lose 10 of the available 50 points.
    • Work submitted more than 24 hours after the deadline will not be graded.

Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided. While we have not provided a specific template for this Lab, we encourage you to adapt the one provided for Lab 2.

Question 1. (20 points)

The remission.csv file located on our 432-data page contains initial remission times, in days, for 44 leukemia patients who were randomly allocated to two different treatments, labeled A and B. Some patients were right-censored before their remission times could be fully determined, as indicated by values of censored = 1 in the data set. Note that remission is a good thing, so long times before remission are bad.

Your task is to plot and compare appropriate estimates of the survival functions for each of the two treatments, using a Kaplan-Meier estimate for each treatment. Write at least two complete sentences providing context to accompany your estimates and plots. Do not use a regression model.

Question 2. (30 points)

This question uses the oh22 data we built back in Lab 1. There, we loaded the data with the following code.

library(janitor)
library(tidyverse)

knitr::opts_chunk$set(comment = NA)

oh22 <- read_csv("https://raw.githubusercontent.com/THOMASELOVE/432-data/master/data/oh_counties_2022.csv", show_col_types = FALSE) |>
  clean_names() |>
  mutate(fips = as.character(fips))

For Question 2, you’re going to develop two models using 86 of the counties (every county other than Cuyahoga County and Monroe County). Later, you will use each of those models to make predictions of the outcome of interest for Cuyahoga County and for Monroe County and to assess the quality of those predictions.

Build a new outcome variable that is a count (possible values = 0-4) of whether the county meets each of the following standards:

  • the county has a sroh_fairpoor value below the Ohio-wide mean of 18.1
  • the county has an obese_pct value below the Ohio-wide average of 34.6
  • the county has an exer_access value above the Ohio-wide average of 77.2
  • the county has NOT had a water violation in the past year (as shown by h2oviol = No)

Among the 86 counties (excluding Cuyahoga and Monroe) you should find 16 counties which meet 0 of these standards, 45 which meet 1, 16 which meet 2, 5 which meet 3 and 4 which meet all 4.

To illustrate, consider these five counties:

County sroh_fairpoor obese_pct exer_access h2oviol Standards Met
Standard < 18.1 < 34.6 > 77.2 No
Stark 19.6 36.1 67.8 Yes 0
Putnam 16.5 36.7 35.3 Yes 1
Lorain 19.9 39.0 85.0 No 2
Summit 18.8 34.4 89.4 No 3
Lake 16.8 34.4 85.8 No 4

Your job is to fit two possible regression models in your development sample to predict this count, using the predictors (not used in the calculation of standards) available in the data set. Fit one model using 4-6 of the predictors, as described in Lab 1, and fit the other model using 2-3 of those same predictors, so that the predictor set in the smaller model is a subset of the larger model. Demonstrate how well each model fits the counts by developing a rootogram and other summaries that you deem useful, then select the model you prefer, specifying your reasons for doing so. Finally, use your preferred model to predict Cuyahoga County and Monroe County results, and assess the quality of those predictions, with an attractive table of results, and a brief discussion in a few complete English sentences.

Hint for Question 2

The modeling approaches we’ve worked on for count outcomes can be finicky, at least in comparison to OLS. Sometimes, you’ll get to the point where it seems like the model won’t run, or won’t summarize properly, or you have some extremely large or extremely small coefficient estimates or standard errors. Should this happen to you, the first thing we would do is try to identify which of your predictors is causing this problem, by running the model first with one predictor, then two, etc. until you figure out which predictors cause problems. Reasons why you could be having a problem include:

  1. a predictor has values that completely identify the category of your outcome variable, perfectly (e.g., one category’s predictor values are inevitably lower than all of another category’s predictor values, with no overlap)
  2. the scales of the predictors are wildly different, for instance one predictor has extremely large or extremely small values, causing the estimated standard errors to explode, which should cause you to think about reducing the impact of that, perhaps by changing the units, say from $s to $1000s or by normalizing the predictors
  3. intense collinearity between two or more of your predictors
  4. coding issues in setting up one or more of the variables.

Our Best Advice

Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.

Use of AI

If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.

Session Information

Please display your session information at the end of your submission, as shown below.

xfun::session_info()
R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  askpass_1.2.0       backports_1.4.1     base64enc_0.1.3    
  bit_4.0.5           bit64_4.0.5         blob_1.2.4         
  broom_1.0.5         bslib_0.7.0         cachem_1.0.8       
  callr_3.7.6         cellranger_1.1.0    cli_3.6.2          
  clipr_0.8.0         colorspace_2.1-0    compiler_4.3.3     
  conflicted_1.2.0    cpp11_0.4.7         crayon_1.5.2       
  curl_5.2.1          data.table_1.15.4   DBI_1.2.2          
  dbplyr_2.5.0        digest_0.6.35       dplyr_1.1.4        
  dtplyr_1.3.1        ellipsis_0.3.2      evaluate_0.23      
  fansi_1.0.6         farver_2.1.1        fastmap_1.1.1      
  fontawesome_0.5.2   forcats_1.0.0       fs_1.6.3           
  gargle_1.5.2        generics_0.1.3      ggplot2_3.5.0      
  glue_1.7.0          googledrive_2.1.1   googlesheets4_1.1.1
  graphics_4.3.3      grDevices_4.3.3     grid_4.3.3         
  gtable_0.3.4        haven_2.5.4         highr_0.10         
  hms_1.1.3           htmltools_0.5.8.1   htmlwidgets_1.6.4  
  httr_1.4.7          ids_1.0.1           isoband_0.2.7      
  janitor_2.2.0       jquerylib_0.1.4     jsonlite_1.8.8     
  knitr_1.46          labeling_0.4.3      lattice_0.22.6     
  lifecycle_1.0.4     lubridate_1.9.3     magrittr_2.0.3     
  MASS_7.3.60.0.1     Matrix_1.6.5        memoise_2.0.1      
  methods_4.3.3       mgcv_1.9.1          mime_0.12          
  modelr_0.1.11       munsell_0.5.1       nlme_3.1.164       
  openssl_2.1.1       parallel_4.3.3      pillar_1.9.0       
  pkgconfig_2.0.3     prettyunits_1.2.0   processx_3.8.4     
  progress_1.2.3      ps_1.7.6            purrr_1.0.2        
  R6_2.5.1            ragg_1.3.0          rappdirs_0.3.3     
  RColorBrewer_1.1.3  readr_2.1.5         readxl_1.4.3       
  rematch_2.0.0       rematch2_2.1.2      reprex_2.1.0       
  rlang_1.1.3         rmarkdown_2.26      rstudioapi_0.16.0  
  rvest_1.0.4         sass_0.4.9          scales_1.3.0       
  selectr_0.4.2       snakecase_0.11.1    splines_4.3.3      
  stats_4.3.3         stringi_1.8.3       stringr_1.5.1      
  sys_3.4.2           systemfonts_1.0.6   textshaping_0.3.7  
  tibble_3.2.1        tidyr_1.3.1         tidyselect_1.2.1   
  tidyverse_2.0.0     timechange_0.3.0    tinytex_0.50       
  tools_4.3.3         tzdb_0.4.0          utf8_1.2.4         
  utils_4.3.3         uuid_1.2.0          vctrs_0.6.5        
  viridisLite_0.4.2   vroom_1.6.5         withr_3.0.0        
  xfun_0.43           xml2_1.3.6          yaml_2.3.8         

After the Lab

We will post an answer sketch 24 hours after the Lab is due.

We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.

See the Lab Appeal Policy in our Syllabus if you are interested in having your Lab grade reviewed, and use the Lab Regrade Request form specified there to complete the task. Thank you.