Lab 3

Published

2025-04-02

General Instructions

Submit your work via Canvas.
The deadline for this Lab is specified on the Course Calendar.
- We charge a 5 point penalty for a lab that is 1-48 hours late.
- We do not grade work that is more than 48 hours late.
Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided.

Important

You can skip exactly one of Labs 1-5 without penalty, but all students must complete both Lab 6 and Lab 7. If you decide to skip a lab, please submit a note to Canvas by the deadline saying that you are skipping the lab.

Template

There is a Lab 3 Quarto template available on our 432-data page. Please use the template to prepare your response to Lab 3, as it will make things easier for you and for the people grading your work.

In the Lab 3 template, we use the united theme. If you’d like to use a different theme, the available list is here.

Our Best Advice

Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.

The Data

The nh_1500 R data set is available for download on the 432 data page.
A detailed description of each variable in the nh_1500 (and also the nh_3143) data is available here.

Question 1 (25 points)

In question 1, you will evaluate a linear regression fit in the nh_1500 data to predict a subject’s red blood cell count (please use rbc, untransformed, as the outcome for all of Question 1) using these five predictors:

the subject’s sex,
the subject’s race/ethnicity,
the subject’s waist circumference,
the subject’s pulse rate, and
whether or not the subject has smoked 100 cigarettes in their life

Note that the main effects model using all five of these predictors will use 7 degrees of freedom, since there are four race/ethnicity categories, and the other four variables are all either binary or quantitative.

{9 points} Use a Spearman \(\rho^2\) plot to identify a single non-linear term which could be added to the model. Your selected non-linear term may add at most 3 degrees of freedom to the main effects model. Specify the added term clearly, and then fit both the main effects model (call it m1_main) and the model with your non-linear term (call it m1_add) using both ols() and lm().
{8} Which of the two models you fit in part a. appears to do a better job, when evaluated using bootstrap validation in the development sample? Why? An appropriate response will compare the models in terms of validated R-square and MSE values using 40 bootstrap replications. Set your seed to be set.seed(20251) for your main effects model, and set.seed(20252) for your model adding a non-linear term.
{8} Using a 90% confidence level, plot the effect summary (using plot(summary(modelname, conf.int = 0.90)) after an ols() fit) for the model you preferred in part b, and then show the corresponding tabular summary of effect size estimates. State and fully explain the meaning of the pulse effect shown in your output in a complete English sentence or two.

Question 2 (25 points)

Again using the nh_1500 data, we will now build a set of logistic regression models to predict whether a subject is limited in the kind or amount of work they can do by a physical, mental or emotional problem.

{10} Build a model to predict limited on the basis of self-reported overall health. Call this model2a. Then add the main effects of white blood cell count, waist circumference and age to the model and call this new model model2b. In a sentence, specify the estimated value of Tjur’s \(R^2\) for each model and indicate which of these two models shows the better fit by that measure.
{10} Interpret the odds ratio associated with self-reported overall health being Good as compared to being Excellent in each of your two models, and provide a 90% confidence interval for each such estimate.
{5} As measured by a validated C statistic using a seed of 4321 for model2a and 4322 for model2b and 40 bootstrap replications in each case, which model performs better, model2a or model2b, and why? Be sure to specify your validated C statistics for each model.

Use of AI

If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.

Be sure to include Session Information

Please display your session information at the end of your submission, as shown below.

xfun::session_info()

R version 4.4.3 (2025-02-28 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  base64enc_0.1.3   bslib_0.9.0       cachem_1.1.0      cli_3.6.4        
  compiler_4.4.3    digest_0.6.37     evaluate_1.0.3    fastmap_1.2.0    
  fontawesome_0.5.3 fs_1.6.6          glue_1.8.0        graphics_4.4.3   
  grDevices_4.4.3   highr_0.11        htmltools_0.5.8.1 htmlwidgets_1.6.4
  jquerylib_0.1.4   jsonlite_2.0.0    knitr_1.50        lifecycle_1.0.4  
  memoise_2.0.1     methods_4.4.3     mime_0.13         R6_2.6.1         
  rappdirs_0.3.3    rlang_1.1.6       rmarkdown_2.29    rstudioapi_0.17.1
  sass_0.4.10       stats_4.4.3       tinytex_0.57      tools_4.4.3      
  utils_4.4.3       xfun_0.52         yaml_2.3.10

After the Lab

We will post an answer sketch to our Shared Google Drive 48 hours after the Lab is due.
We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.
See the Lab Appeal Policy in Section 8.5 of our Syllabus if you are interested in having your Lab 1, 2, 3, 4, 5 or 7 grade reviewed, and use the Lab Regrade Request form to complete the task. The form (which is optional) is due when the Calendar says it is. Thank you.