Lab 2

Published

2024-01-21

General Instructions

Submit your work via Canvas.
The deadline for this Lab is specified on the Course Calendar.
- Work submitted more than 59 minutes late, but within 12 hours of the deadline will lose 5 of the available 50 points.
- Work submitted 12 to 24 hours after the deadline will lose 10 of the available 50 points.
- Work submitted more than 24 hours after the deadline will not be graded.

Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided.

Template

There is a Lab 2 Quarto template available on our 432-data page. Please use the template to prepare your response to Lab 2, as it will make things easier for you and for the people grading your work. The template is quite generic, and can also be used for other work, including Labs 3-8.

Our Best Advice

Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.

Question 1 (25 points)

Question 1 uses the lab2q1 data, based on NHANES 2017-18 results. You will use these data to generate two different responses to the question:

Estimate the percentage of the US non-institutionalized adult population within the ages of 21-49 who engage in moderate-activity sports that would describe their General Health as either “Excellent” or “Very Good”.

(10 points) What percentage of the subjects who responded “Yes” to the moderate-intensity sports question included in the lab2q1 data have described their General Health as either “Excellent” or “Very Good”, among those who provided an answer to the General Health question? Be sure to use a complete-case analysis to deal with missing data on the General Health variable, and provide all of the R code you use to obtain your result, annotated with detailed text that makes it clear what you are doing as you proceed. Please express your final response as a percentage between 0 and 100, including a single decimal place.
(15 points) Please answer the question asked in Question 1a, again, but this time accounting for the sampling weights used in wtint2yr, again using a complete-case analysis to deal with missing General Health values. As you did in Question 1a, provide all of the R code you use to obtain your result, annotated with text to make it clear what you are doing, and then express your final response to Question 1b as a percentage, again including a single decimal place.

Data for Question 1

Dr. Love created these data from NHANES 2017-18 Demographics and Questionnaire data, using the code below.

Specifically, he used the DEMO_J (Demographics) and HSQ_J (Current Health Status) files, which are described at this link.

library(nhanesA)
library(janitor)
library(tidyverse)

temp1 <- nhanes('DEMO_J')
temp2 <- nhanes('HSQ_J')
temp3 <- nhanes('PAQ_J')

temp12 <- inner_join(temp1, temp2, by = "SEQN")
temp123 <- inner_join(temp12, temp3, by = "SEQN")

lab2q1 <- temp123 |>
  select(SEQN, WTINT2YR, RIDAGEYR, HSD010, PAQ665) |> 
  filter(RIDAGEYR > 20 & RIDAGEYR < 50) |>
  filter(PAQ665 < 3) |>
  mutate(HSD010 = factor(HSD010),
         PAQ665 = factor(PAQ665),
         SEQN = as.character(SEQN)) |>
  clean_names() |>
  tibble()

rm(temp1, temp12, temp2, temp3, temp123)

saveRDS(lab2q1, file = "data/lab2q1.Rds")

Variables Studied in Question 1

The resulting variables are listed below.

Item	Description	Possible Responses
`seqn`	Subject id code	93717 through 102956
`wtint2yr`	Full sample 2 year interview weight	min = 4363, max = 387879
`ridageyr`	Age in years at screening	min = 21, max = 49
`hsd010`	General Health Condition	see below
`paq665`	Moderate Recreational Activities	see below

hsd010 Would you say your health in general is
- 1 = Excellent,
- 2 = Very Good,
- 3 = Good,
- 4 = Fair, or
- 5 = Poor?
- (Note that 7 = Refused, 9 = Don’t know in this variable, which we will treat as missing.)
paq665 Do you do any moderate-intensity sports, fitness, or recreational activities that cause a small increase in breathing or heart rate such as brisk walking, bicycling, swimming, or golf for at least 10 minutes continuously?
- 1 = Yes, 2 = No

Loading the Question 1 Data

I have provided the saved lab2q1.Rds file to you on the 432-data page. I encourage you to load it using the code below.

library(janitor)
library(tidyverse)

knitr::opts_chunk$set(comment = NA)

lab2q1 <- read_rds("https://raw.githubusercontent.com/THOMASELOVE/432-data/master/data/lab2q1.Rds")

lab2q1

# A tibble: 2,295 × 5
   seqn  wtint2yr ridageyr hsd010 paq665
   <chr>    <dbl>    <dbl> <fct>  <fct> 
 1 93717   53249.       22 2      2     
 2 93718   20257.       45 3      1     
 3 93729   11760.       42 4      2     
 4 93738   59333.       26 3      2     
 5 93746   27135.       25 2      2     
 6 93755   30922.       26 2      1     
 7 93761   18939.       44 3      2     
 8 93763  103670.       40 3      1     
 9 93766   16414.       36 4      2     
10 93774  232377.       41 <NA>   1     
# ℹ 2,285 more rows

Question 2 (25 points)

Question 2 uses the hbp3456 data.

(10 points) Does which insurance status a person has seem to have a meaningful impact on their systolic blood pressure, adjusting for whether or not they have a prescription for a beta-blocker? Decide whether your model should include an interaction term in a sensible way (providing a graph to help us understand your reasoning), and then fit your choice of model using the lm function in R. Display your results.
(15 points) Provide a written explanation of your findings, in complete sentences. Your explanation should address both the overall quality of fit and the interpretation of the coefficients of your chosen model, as well as provide a detailed description as to how you used the output you generated in part a to decide whether or not to include an interaction term.

Question 2 Hints

One graph you might use would be one to assess the need for an interaction term, probably via a plot of means.
Another graph (or perhaps table) to consider for insight would look at the relationship between insurance and beta-blocker status in these subjects.
Please explicitly state in your response that you assume that the missingness you observe in these data are MCAR, and that a complete case analysis is thus appropriate for this Question.

Data for Question 2 (`hbp3456` data)

The (simulated) data in the hbp3456.csv file describe a total of 3456 people living with hypertension (high blood pressure) diagnoses who receive primary care in one of eight practices.

In each of the eight practices, 432 (different) individuals (who I’ll call subjects in what follows) were sampled at random from all eligible subjects.
The data are based on real electronic health record (EHR) data, but with some noise added.
- The practices are named after streets that appear in The Simpsons.
- There are 62 (fictional) providers identified across the eight practices, and each provider cares for subjects within a single practice.

Eligibility Criteria

The data are cross-sectional and describe results from a one-year reporting window. To be eligible for the study, a subject had to meet all of the following criteria:

have an EHR-documented hypertension diagnosis which applied during the one-year reporting window,
cared for at one of the eight practices in this study, and by one of the 62 participating providers in this study
age 25 or older at the start of the one-year reporting period (note that all subjects with ages 80 and higher are listed as age 80 in the data)
between 1 and 12 primary care office visits in the one-year reporting period
between 2 and 24 primary care office visits combined across the reporting period and the previous year
fall into one of two biological sex categories (female or male)
fall into one of four primary insurance categories, specifically Medicare, Commercial, Medicaid or Uninsured.
have a most recent systolic BP between 80 and 220 mm Hg and most recent diastolic BP between 40 and 140 mm Hg, where the systolic BP is at least 15 and no more than 130 mm Hg larger than the diastolic BP.

Codebook

Variable	Description
`record`	unique code for each subject (six digits, first digit is 9, last indicates practice)
`practice`	primary care practice, of which there are eight in the data
`provider`	primary care provider (each practice has multiple providers)
`age`	subject’s age as of the start of the reporting period
`race`	subject’s race (4 levels: Asian, AA_Black, White, Other)
`eth_hisp`	is subject of Hispanic/Latino ethnicity? Yes or No
`sex`	subject’s sex (F or M)
`insurance`	subject’s primary insurance (Medicare, Commercial, Medicaid, Uninsured)
`income`	estimated median income of subject’s home neighborhood (via American Community Survey, to nearest $100)
`hsgrad`	estimated percentage of adults living in the subject’s home neighborhood who have graduated from high school (via American Community Survey, to the nearest tenth of a percent)
`tobacco`	tobacco use status (Current, Former, or Never)
`depr_diag`	does subject have depression diagnosis? Yes or No
`height`	subject’s height in meters, rounded to two decimal places
`weight`	subject’s weight in kilograms, rounded to one decimal place
`ldl`	subject’s LDL cholesterol level, in mg/dl
`statin`	does subject have a current prescription for a statin medication? Yes or No
`bp_med`	does subject have a current prescription for a blood pressure control medication? Yes or No
`sbp`	subject’s most recently obtained systolic blood pressure, in mm Hg
`dbp`	subject’s most recently obtained diastolic blood pressure, in mm Hg
`visits_1`	subject’s number of visits for primary care in reporting period (one year)
`visits_2`	subject’s visits for primary care in the past two years
`acearb`	does subject have a current prescription for an ACE-inhibitor or ARB? Yes or No
`betab`	does subject have a current prescription for a beta-blocker? Yes or No

Notes on Specific Variables

The list of medications included in bp_med is: ACE-inhibitor, ARB, Diuretic, Calcium-Channel Blocker, Beta-Blocker, Alpha-1 Blocker, Centrally acting Alpha-2 Agonist, Vasodilator or other antihypertensive agents. A subject with a current prescription for any of these will have a Yes in bp_med.
For the acearb, betab, bpmed, statin and depr_diag variables, a No response includes all subjects where there’s no evidence in the EHR of meeting the Yes criterion, so that there are no missing values (a missing value is interpreted there as No.)
For the height, weight and ldl results, implausible values were treated as missing in preparing the data for you.
The race and eth_hisp values are self-reported, and some subjects refused to answer one or both of the relevant questions.
The income and hsgrad values are imputed from the subject’s home address, usually at the census block level, but occasionally at the level of the zip code.
- When a subject’s home address could not be geocoded, these values are noted as missing.
- Geocoded estimates of income below 6500 are reported as 6500, and estimates above 130000 are reported as 130000.
- For hsgrad, geocoded estimates below 40 are reported as 40, and estimates above 99.9 are reported as 99.9.

Loading the Data for Question 2

Here’s the approach I took to load and view the hbp3456 data.

library(janitor)
library(tidyverse)

knitr::opts_chunk$set(comment = NA)

hbp3456 <- read_csv("https://raw.githubusercontent.com/THOMASELOVE/432-data/master/data/hbp3456.csv", show_col_types = FALSE) |>
  clean_names() |>
  mutate(record = as.character(record))

hbp3456

# A tibble: 3,456 × 23
   record practice provider   age race    eth_hisp sex   insurance income hsgrad
   <chr>  <chr>    <chr>    <dbl> <chr>   <chr>    <chr> <chr>      <dbl>  <dbl>
 1 900018 Walnut   W_05        64 <NA>    <NA>     F     Medicare   15600   83  
 2 900024 King     K_07        74 AA_Bla… No       F     Medicare   16200   92.8
 3 900037 Sycamore S_06        60 AA_Bla… No       F     Commerci…  21400   79  
 4 900043 Highland H_07        46 White   Yes      F     Medicaid   38300   83.5
 5 900057 Sycamore S_04        59 AA_Bla… No       M     Commerci…  23200   78.7
 6 900062 Elm      E_03        54 AA_Bla… No       M     Commerci…  48600   85.5
 7 900076 Plympton P_03        74 White   No       M     Commerci…  64200   92.9
 8 900082 Elm      E_06        73 White   No       M     Medicare   48600   85.5
 9 900097 Sycamore S_10        58 AA_Bla… No       F     Commerci…  29900   86.2
10 900101 Center   C_01        46 AA_Bla… No       M     Uninsured  63600   97.5
# ℹ 3,446 more rows
# ℹ 13 more variables: tobacco <chr>, depr_diag <chr>, height <dbl>,
#   weight <dbl>, ldl <dbl>, statin <chr>, bp_med <chr>, sbp <dbl>, dbp <dbl>,
#   visits_1 <dbl>, visits_2 <dbl>, acearb <chr>, betab <chr>

Use of AI

If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.

Include the Session Information

Please display your session information at the end of your submission, as shown below.

xfun::session_info()

R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  askpass_1.2.0       backports_1.4.1     base64enc_0.1.3    
  bit_4.0.5           bit64_4.0.5         blob_1.2.4         
  broom_1.0.5         bslib_0.7.0         cachem_1.0.8       
  callr_3.7.6         cellranger_1.1.0    cli_3.6.2          
  clipr_0.8.0         colorspace_2.1-0    compiler_4.3.3     
  conflicted_1.2.0    cpp11_0.4.7         crayon_1.5.2       
  curl_5.2.1          data.table_1.15.4   DBI_1.2.2          
  dbplyr_2.5.0        digest_0.6.35       dplyr_1.1.4        
  dtplyr_1.3.1        ellipsis_0.3.2      evaluate_0.23      
  fansi_1.0.6         farver_2.1.1        fastmap_1.1.1      
  fontawesome_0.5.2   forcats_1.0.0       fs_1.6.3           
  gargle_1.5.2        generics_0.1.3      ggplot2_3.5.0      
  glue_1.7.0          googledrive_2.1.1   googlesheets4_1.1.1
  graphics_4.3.3      grDevices_4.3.3     grid_4.3.3         
  gtable_0.3.4        haven_2.5.4         highr_0.10         
  hms_1.1.3           htmltools_0.5.8.1   htmlwidgets_1.6.4  
  httr_1.4.7          ids_1.0.1           isoband_0.2.7      
  janitor_2.2.0       jquerylib_0.1.4     jsonlite_1.8.8     
  knitr_1.46          labeling_0.4.3      lattice_0.22.6     
  lifecycle_1.0.4     lubridate_1.9.3     magrittr_2.0.3     
  MASS_7.3.60.0.1     Matrix_1.6.5        memoise_2.0.1      
  methods_4.3.3       mgcv_1.9.1          mime_0.12          
  modelr_0.1.11       munsell_0.5.1       nlme_3.1.164       
  openssl_2.1.1       parallel_4.3.3      pillar_1.9.0       
  pkgconfig_2.0.3     prettyunits_1.2.0   processx_3.8.4     
  progress_1.2.3      ps_1.7.6            purrr_1.0.2        
  R6_2.5.1            ragg_1.3.0          rappdirs_0.3.3     
  RColorBrewer_1.1.3  readr_2.1.5         readxl_1.4.3       
  rematch_2.0.0       rematch2_2.1.2      reprex_2.1.0       
  rlang_1.1.3         rmarkdown_2.26      rstudioapi_0.16.0  
  rvest_1.0.4         sass_0.4.9          scales_1.3.0       
  selectr_0.4.2       snakecase_0.11.1    splines_4.3.3      
  stats_4.3.3         stringi_1.8.3       stringr_1.5.1      
  sys_3.4.2           systemfonts_1.0.6   textshaping_0.3.7  
  tibble_3.2.1        tidyr_1.3.1         tidyselect_1.2.1   
  tidyverse_2.0.0     timechange_0.3.0    tinytex_0.50       
  tools_4.3.3         tzdb_0.4.0          utf8_1.2.4         
  utils_4.3.3         uuid_1.2.0          vctrs_0.6.5        
  viridisLite_0.4.2   vroom_1.6.5         withr_3.0.0        
  xfun_0.43           xml2_1.3.6          yaml_2.3.8

After the Lab

We will post an answer sketch 24 hours after the Lab is due.

We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.

See the Lab Appeal Policy in our Syllabus if you are interested in having your Lab grade reviewed, and use the Lab Regrade Request form specified there to complete the task. Thank you.