Lab 3

Published

2026-02-02

General Instructions

  • Submit your work via Canvas.
  • The deadline for this Lab is specified on the Course Calendar.
    • We charge a 5 point penalty for a lab that is 1-48 hours late.
    • Labs that are more than 48 hours late will receive 30 points (out of a possible 50.)
    • No labs may be skipped in 432. Students must submit all seven Labs to pass the course.
  • Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided.

Template

There is a Lab 3 Quarto template available on our 432-data page. Please use the template to prepare your response to Lab 3, as it will make things easier for you and for the people grading your work.

Our Best Advice

Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.

The Data

Important

In Question 3, you will need to specify some information about the data you plan to use in Project A, and the logistic regression outcome you intend to use.

  • Questions 1 and 2 use the nh_1500 data set, which is available for download on the 432 data page.
  • A detailed description of each variable in the nh_1500 (and also the nh_3143) data is available here.

Question 1 (20 points)

In question 1, you will evaluate a linear regression fit in the nh_1500 data to predict a subject’s red blood cell count (please use rbc, untransformed, as the outcome for all of Question 1) using these five predictors:

  • the subject’s sex,
  • the subject’s race/ethnicity,
  • the subject’s waist circumference,
  • the subject’s pulse rate, and
  • whether or not the subject has smoked 100 cigarettes in their life

Note that the main effects model using all five of these predictors will use 7 degrees of freedom, since there are four race/ethnicity categories, and the other four variables are all either binary or quantitative.

  1. {7 points} Use a Spearman \(\rho^2\) plot to identify a single non-linear term which could be added to the model. Your selected non-linear term may add at most 3 degrees of freedom to the main effects model. Specify the added term clearly, and then fit both the main effects model (call it m1_main) and the model with your non-linear term (call it m1_add) using both ols() and lm().
TipHint for Question 1a

For Question 1a in Lab 3, “a single non-linear term” means either a spline or an interaction term or a polynomial, not more than one of those.

  • If the predictor variable with the largest value of Spearman \(\rho^2\) is quantitative, I would include a spline or a polynomial in that variable which adds no more than 3 degrees of freedom to the main effects model.
  • If the predictor variable with the largest value of Spearman \(\rho^2\) is categorical, I would include an interaction term between that variable and the variable with the next largest value of Spearman \(\rho^2\), again so that the resulting interaction adds no more than 3 degrees of freedom to the main effects model.
  1. {5} Which of the two models you fit in part a. appears to do a better job, when evaluated using bootstrap validation in the development sample? Why? An appropriate response will compare the models in terms of validated R-square and MSE values using 40 bootstrap replications. Set your seed to be set.seed(20251) for your main effects model, and set.seed(20252) for your model adding a non-linear term.

  2. {8} Using a 90% confidence level, plot the effect summary (using plot(summary(modelname, conf.int = 0.90)) after an ols() fit) for the model you preferred in part b, and then show the corresponding tabular summary of effect size estimates. State and fully explain the meaning of the pulse effect shown in your output in a complete English sentence or two.

Question 2 (20 points)

Again using the nh_1500 data, we will now build a set of logistic regression models to predict whether a subject is limited in the kind or amount of work they can do by a physical, mental or emotional problem.

  1. {7} Build a model to predict limited on the basis of self-reported overall health. Call this model2a. Then add the main effects of white blood cell count, waist circumference and age to the model and call this new model model2b. In a sentence, specify the estimated value of Tjur’s \(R^2\) for each model and indicate which of these two models shows the better fit by that measure.

  2. {6} As measured by a validated C statistic using a seed of 4321 for model2a and 4322 for model2b and 40 bootstrap replications in each case, which model performs better, model2a or model2b, and why? Be sure to specify your validated C statistics for each model.

  3. {7} Interpret the odds ratio associated with self-reported overall health being Good as compared to being Excellent in the model you chose in part b., and provide a 90% confidence interval for that estimate.

Question 3 (10 points)

This question relates to your Project A data.

  1. {1} Specify whether or not you will be working with a partner on Project A. If you are working with a partner, then (1) tell us who it is, and (2) remember that you will need to work alone on Project B later this term.

  2. {5} Provide the name of the data set (or sets) you plan to use for Project A, and a working URL where we can obtain the data.

  3. {4} Specify the variable you intend to use for your logistic regression outcome, specifying the variable name, and briefly describing what it means, then provide a table showing the number of observations in each of the two categories for that variable.

TipHint for Question 3

My goal is to create a table like this based on this question, so be sure this level of information is perfectly clear.

Investigator Name(s) Data Source (with Link) Logistic Model Outcome Sample Size Description and Counts
Seymour Krelborn Edible Plants from
Tidy Tuesday 2026-02-03
sunlight 140 87 require Full Sun,
53 require at least partial shade

Note: Seymour doesn’t have the sample size we want to see here (we want between 300 and 2000 observations, with at least 150 in each group of your binary outcome.)

Use of AI

If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.

Be sure to include Session Information

Please display your session information at the end of your submission, as shown below.

xfun::session_info()
R version 4.5.2 (2025-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  base64enc_0.1.3   bslib_0.10.0      cachem_1.1.0      cli_3.6.5        
  compiler_4.5.2    digest_0.6.39     evaluate_1.0.5    fastmap_1.2.0    
  fontawesome_0.5.3 fs_1.6.6          graphics_4.5.2    grDevices_4.5.2  
  highr_0.11        htmltools_0.5.9   htmlwidgets_1.6.4 jquerylib_0.1.4  
  jsonlite_2.0.0    knitr_1.51        lifecycle_1.0.5   memoise_2.0.1    
  methods_4.5.2     mime_0.13         otel_0.2.0        R6_2.6.1         
  rappdirs_0.3.4    rlang_1.1.7       rmarkdown_2.30    rstudioapi_0.18.0
  sass_0.4.10       stats_4.5.2       tinytex_0.58      tools_4.5.2      
  utils_4.5.2       xfun_0.56         yaml_2.3.12      

After the Lab