Lab 3

Published

2024-02-07

General Instructions

Submit your work via Canvas.
The deadline for this Lab is specified on the Calendar.
- Work submitted more than 59 minutes late, but within 12 hours of the deadline will lose 5 of the available 50 points.
- Work submitted 12 to 24 hours after the deadline will lose 10 of the available 50 points.
- Work submitted more than 24 hours after the deadline will not be graded.

Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided. While we have not provided a specific template for this Lab, we encourage you to adapt the one provided for Lab 2.

The Data

This Lab uses the hbp3456 data developed in Lab 2. See the Lab 2 instructions for details on the data set. Back in Lab 2, we loaded the data with this code.

library(janitor)
library(tidyverse)

knitr::opts_chunk$set(comment = NA)

hbp3456 <- read_csv("https://raw.githubusercontent.com/THOMASELOVE/432-data/master/data/hbp3456.csv", 
                    show_col_types = FALSE) |>
  clean_names() |>
  mutate(record = as.character(record))

Here, we will walk through the process of fitting and evaluating linear regression fits to predict a subject’s estimated (neighborhood) median income (the income variable) on the basis of the following five predictors:

the subject’s neighborhood high school graduation rate, collected in the hsgrad variable
the subject’s race category, from the race variable
the subject’s Hispanic/Latinx ethnicity category, as shown in eth_hisp,
the subject’s age (in the age variable), and
the subject’s current tobacco status, available in the tobacco variable.

Preliminary Data Work for Lab 3

Start your work by completing the following tasks to create a tibble that we’ll call hbp_b in the answer sketch:

Exclude the 25 subjects in hbp3456 who have missing values of either hsgrad or income.
Restrict your data to the variables we’ll use in our models (the five predictors listed above, the estimated neighborhood income, and the subject identifying code (the record)).
Ensure that all character variables (other than record) in your tibble are recognized as factors.
Create a new variable called sqrtinc which will serve as your response (outcome) for your regression modeling, within your tibble.
Use set.seed(432) and slice_sample() to select a random sample of 1000 subjects from the tibble.

Your resulting hbp_b tibble should look like this:

hbp_b

# A tibble: 1,000 × 8
   record income hsgrad race     eth_hisp   age tobacco sqrtinc
   <chr>   <int>  <dbl> <fct>    <fct>    <int> <fct>     <dbl>
 1 903574  34800   94.9 White    No          48 Current    187.
 2 926837  24700   74.2 AA_Black No          55 Current    157.
 3 929198  14700   40   AA_Black No          35 Never      121.
 4 932367  24700   74.2 AA_Black No          41 Never      157.
 5 925592  65600   92.2 <NA>     <NA>        61 Never      256.
 6 932404  18500   67.8 AA_Black No          67 Never      136.
 7 933953  21500   84.4 White    No          72 Never      147.
 8 911527  23000   83.6 White    No          62 Never      152.
 9 918228  13400   70.3 AA_Black No          52 Current    116.
10 930262  48300   90   <NA>     <NA>        73 Never      220.
# ℹ 990 more rows

Note If your tibble looks different, it’s not immediately clear to me why. In the answer sketch, I do the following things, in this order:

read in the data (note that whether you then clean_names() or not has no real effect in this particular case)
filter to complete cases on hsgrad and income
select the seven variables we’ll use
convert all character variables to factors
ensure record remains a character variable
add the sqrtinc variable to the tibble
set a seed of 432
create the random sample

Question 1. (10 points)

Display the code you used to ingest the data and complete the preliminary data work described above. Be sure to include text annotations to clarify exactly what your code is doing. Then produce a table to tell us how many missing values you have in each of the important variables in your hbp_b tibble. The important variables are your outcome (square root of estimated neighborhood income) and the five predictors.

Question 2. (10 points)

Using the entire sample in hbp_b, obtain and display an appropriate Spearman \(\rho^2\) plot and use it to identify a good choice of a single non-linear term that adds exactly two degrees of freedom to the main effects model using all five predictors for sqrtinc. Specify your choice of non-linear term, and your motivation for that choice, based on the plot.

Question 3. (10 points)

Fit the main effects model for sqrtinc using ols in the hbp_b sample, and call that model m1. Plot the effect summary (using plot(summary(m1))) for model m1 and carefully explain the meaning of the hsgrad coefficient shown in that plot in a complete English sentence.

Hint 1: you are permitted to also fit the model using lm, if that is useful to you.
Hint 2: If you use anova() on model m1 you should have 8 total degrees of freedom in your model.

Question 4. (10 points)

Fit a new model using ols, for sqrtinc using all five predictors, including the non-linear term you identified in Question 2 in the hbp_b sample, and call that model m2. Plot the effect summary (using plot(summary(m2))) for model m2, and explain the meaning of the tobacco coefficient shown in the plot in a complete English sentence.

Hint 1: you are permitted to also fit the model using lm, if that is useful to you.
Hint 2: If you use anova() on model m2 you should have 2 non-linear degrees of freedom, and 10 total degrees of freedom in your model.

Question 5. (10 points)

You’ve now fit models m1 and m2. For each model, obtain the following summary statistics: the uncorrected raw \(R^2\) value, the AIC and BIC. Then validate each model’s \(R^2\) and MSE values using set.seed(2023) and 40 bootstrap replications.

Now, report the five results you obtained for each model in an attractive, well-formatted table. Then write a sentence or two explaining what your findings mean about the performance of the two models.

Our Best Advice

Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.

Use of AI

If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.

Session Information

Please display your session information at the end of your submission, as shown below.

xfun::session_info()

R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  askpass_1.2.0       backports_1.4.1     base64enc_0.1.3    
  bit_4.0.5           bit64_4.0.5         blob_1.2.4         
  broom_1.0.5         bslib_0.7.0         cachem_1.0.8       
  callr_3.7.6         cellranger_1.1.0    cli_3.6.2          
  clipr_0.8.0         colorspace_2.1-0    compiler_4.3.3     
  conflicted_1.2.0    cpp11_0.4.7         crayon_1.5.2       
  curl_5.2.1          data.table_1.15.4   DBI_1.2.2          
  dbplyr_2.5.0        digest_0.6.35       dplyr_1.1.4        
  dtplyr_1.3.1        ellipsis_0.3.2      evaluate_0.23      
  fansi_1.0.6         farver_2.1.1        fastmap_1.1.1      
  fontawesome_0.5.2   forcats_1.0.0       fs_1.6.3           
  gargle_1.5.2        generics_0.1.3      ggplot2_3.5.0      
  glue_1.7.0          googledrive_2.1.1   googlesheets4_1.1.1
  graphics_4.3.3      grDevices_4.3.3     grid_4.3.3         
  gtable_0.3.4        haven_2.5.4         highr_0.10         
  hms_1.1.3           htmltools_0.5.8.1   htmlwidgets_1.6.4  
  httr_1.4.7          ids_1.0.1           isoband_0.2.7      
  janitor_2.2.0       jquerylib_0.1.4     jsonlite_1.8.8     
  knitr_1.46          labeling_0.4.3      lattice_0.22.6     
  lifecycle_1.0.4     lubridate_1.9.3     magrittr_2.0.3     
  MASS_7.3.60.0.1     Matrix_1.6.5        memoise_2.0.1      
  methods_4.3.3       mgcv_1.9.1          mime_0.12          
  modelr_0.1.11       munsell_0.5.1       nlme_3.1.164       
  openssl_2.1.1       parallel_4.3.3      pillar_1.9.0       
  pkgconfig_2.0.3     prettyunits_1.2.0   processx_3.8.4     
  progress_1.2.3      ps_1.7.6            purrr_1.0.2        
  R6_2.5.1            ragg_1.3.0          rappdirs_0.3.3     
  RColorBrewer_1.1.3  readr_2.1.5         readxl_1.4.3       
  rematch_2.0.0       rematch2_2.1.2      reprex_2.1.0       
  rlang_1.1.3         rmarkdown_2.26      rstudioapi_0.16.0  
  rvest_1.0.4         sass_0.4.9          scales_1.3.0       
  selectr_0.4.2       snakecase_0.11.1    splines_4.3.3      
  stats_4.3.3         stringi_1.8.3       stringr_1.5.1      
  sys_3.4.2           systemfonts_1.0.6   textshaping_0.3.7  
  tibble_3.2.1        tidyr_1.3.1         tidyselect_1.2.1   
  tidyverse_2.0.0     timechange_0.3.0    tinytex_0.50       
  tools_4.3.3         tzdb_0.4.0          utf8_1.2.4         
  utils_4.3.3         uuid_1.2.0          vctrs_0.6.5        
  viridisLite_0.4.2   vroom_1.6.5         withr_3.0.0        
  xfun_0.43           xml2_1.3.6          yaml_2.3.8

After the Lab

We will post an answer sketch 24 hours after the Lab is due.

We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.

See the Lab Appeal Policy in our Syllabus if you are interested in having your Lab grade reviewed, and use the Lab Regrade Request form specified there to complete the task. Thank you.