Lab 3

Published

2025-01-09

General Instructions

  • Submit your work via Canvas.
  • The deadline for this Lab is specified on the Course Calendar.
    • We charge a 5 point penalty for a lab that is 1-48 hours late.
    • We do not grade work that is more than 48 hours late.
  • Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided.
Important

You can skip exactly one of Labs 1-5 without penalty, but all students must complete both Lab 6 and Lab 7. If you decide to skip a lab, please submit a note to Canvas by the deadline saying that you are skipping the lab.

Template

There is a Lab 3 Quarto template available on our 432-data page. Please use the template to prepare your response to Lab 3, as it will make things easier for you and for the people grading your work.

Our Best Advice

Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.

The Data

  • The nh_1500 R data set is available for download on the 432 data page.
  • A detailed description of each variable in the nh_1500 (and also the nh_3143) data is available here.

Question 1 (25 points)

In question 1, you will evaluate a linear regression fit in the nh_1500 data to predict a subject’s red blood cell count using these five predictors:

  • the subject’s sex,
  • the subject’s race/ethnicity,
  • the subject’s waist circumference,
  • the subject’s pulse rate, and
  • whether or not the subject has smoked 100 cigarettes in their life

Note that the main effects model using all five of these predictors will use 7 degrees of freedom, since there are four race/ethnicity categories, and the other four variables are all either binary or quantitative.

  1. {10 points} Use a Spearman \(\rho^2\) plot to identify a single non-linear term which could be added to the model. Your selected non-linear term may add at most 3 degrees of freedom to the main effects model. Specify the added term clearly, and then fit both the main effects model (call it m1_main) and the model with your non-linear term (call it m1_add) using both ols() and lm().

  2. {10} Which of the two models you fit in part a. appears to do a better job, when evaluated using bootstrap validation in the development sample? Why? An appropriate response will compare the models in terms of validated R-square and MSE values using set.seed(2025) and 40 bootstrap replications.

  3. {5} Plot the effect summary (using plot(summary) after an ols() fit) for the model you preferred in part b, and explain the meaning of the pulse coefficient shown in the plot in a complete English sentence.

Question 2 (25 points)

Again using the nh_1500 data, we will now build a set of logistic regression models to predict whether a subject is limited in the kind or amount of work they can do by a physical, mental or emotional problem.

  1. {10} Build a model to predict limited on the basis of self-reported overall health. Call this model2a. Then add the main effects of white blood cell count, waist circumference and age to the model and call this new model model2b.

  2. {10} Interpret the odds ratio associated with self-reported overall health being Excellent as compared to being Good in each of your two models, and provide a 90% confidence interval for each such estimate.

  3. {5} As measured by a validated C statistic using a seed of 432 and 40 bootstrap replications, which model performs better, model2a or model2b, and why?

Use of AI

If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.

Be sure to include Session Information

Please display your session information at the end of your submission, as shown below.

xfun::session_info()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  base64enc_0.1.3   bslib_0.8.0       cachem_1.1.0      cli_3.6.3        
  compiler_4.4.2    digest_0.6.37     evaluate_1.0.3    fastmap_1.2.0    
  fontawesome_0.5.3 fs_1.6.5          glue_1.8.0        graphics_4.4.2   
  grDevices_4.4.2   highr_0.11        htmltools_0.5.8.1 htmlwidgets_1.6.4
  jquerylib_0.1.4   jsonlite_1.8.9    knitr_1.49        lifecycle_1.0.4  
  memoise_2.0.1     methods_4.4.2     mime_0.12         R6_2.5.1         
  rappdirs_0.3.3    rlang_1.1.4       rmarkdown_2.29    rstudioapi_0.17.1
  sass_0.4.9        stats_4.4.2       tinytex_0.54      tools_4.4.2      
  utils_4.4.2       xfun_0.50         yaml_2.3.10      

After the Lab

  • We will post an answer sketch to our Shared Google Drive 48 hours after the Lab is due.
  • We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.
  • See the Lab Appeal Policy in our Syllabus if you are interested in having your Lab grade reviewed, and use the Lab Regrade Request form specified there to complete the task. Thank you.