library(nhanesA)
library(janitor)
library(tidyverse)
<- nhanes('DEMO_J')
temp1 <- nhanes('HSQ_J')
temp2 <- nhanes('PAQ_J')
temp3
<- inner_join(temp1, temp2, by = "SEQN")
temp12 <- inner_join(temp12, temp3, by = "SEQN")
temp123
<- temp123 |>
lab2q1 select(SEQN, WTINT2YR, RIDAGEYR, HSD010, PAQ665) |>
filter(RIDAGEYR > 20 & RIDAGEYR < 50) |>
filter(PAQ665 < 3) |>
mutate(HSD010 = factor(HSD010),
PAQ665 = factor(PAQ665),
SEQN = as.character(SEQN)) |>
clean_names() |>
tibble()
rm(temp1, temp12, temp2, temp3, temp123)
saveRDS(lab2q1, file = "data/lab2q1.Rds")
Lab 2
General Instructions
- Submit your work via Canvas.
- The deadline for this Lab is specified on the Course Calendar.
- Work submitted more than 59 minutes late, but within 12 hours of the deadline will lose 5 of the available 50 points.
- Work submitted 12 to 24 hours after the deadline will lose 10 of the available 50 points.
- Work submitted more than 24 hours after the deadline will not be graded.
Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided.
Template
There is a Lab 2 Quarto template available on our 432-data page. Please use the template to prepare your response to Lab 2, as it will make things easier for you and for the people grading your work. The template is quite generic, and can also be used for other work, including Labs 3-8.
Our Best Advice
Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.
Question 1 (25 points)
Question 1 uses the lab2q1
data, based on NHANES 2017-18 results. You will use these data to generate two different responses to the question:
Estimate the percentage of the US non-institutionalized adult population within the ages of 21-49 who engage in moderate-activity sports that would describe their General Health as either “Excellent” or “Very Good”.
(10 points) What percentage of the subjects who responded “Yes” to the moderate-intensity sports question included in the
lab2q1
data have described their General Health as either “Excellent” or “Very Good”, among those who provided an answer to the General Health question? Be sure to use a complete-case analysis to deal with missing data on the General Health variable, and provide all of the R code you use to obtain your result, annotated with detailed text that makes it clear what you are doing as you proceed. Please express your final response as a percentage between 0 and 100, including a single decimal place.(15 points) Please answer the question asked in Question 1a, again, but this time accounting for the sampling weights used in
wtint2yr
, again using a complete-case analysis to deal with missing General Health values. As you did in Question 1a, provide all of the R code you use to obtain your result, annotated with text to make it clear what you are doing, and then express your final response to Question 1b as a percentage, again including a single decimal place.
Data for Question 1
Dr. Love created these data from NHANES 2017-18 Demographics and Questionnaire data, using the code below.
Specifically, he used the DEMO_J
(Demographics) and HSQ_J
(Current Health Status) files, which are described at this link.
Variables Studied in Question 1
The resulting variables are listed below.
Item | Description | Possible Responses |
---|---|---|
seqn |
Subject id code | 93717 through 102956 |
wtint2yr |
Full sample 2 year interview weight | min = 4363, max = 387879 |
ridageyr |
Age in years at screening | min = 21, max = 49 |
hsd010 |
General Health Condition | see below |
paq665 |
Moderate Recreational Activities | see below |
hsd010
Would you say your health in general is- 1 = Excellent,
- 2 = Very Good,
- 3 = Good,
- 4 = Fair, or
- 5 = Poor?
- (Note that 7 = Refused, 9 = Don’t know in this variable, which we will treat as missing.)
paq665
Do you do any moderate-intensity sports, fitness, or recreational activities that cause a small increase in breathing or heart rate such as brisk walking, bicycling, swimming, or golf for at least 10 minutes continuously?- 1 = Yes, 2 = No
Loading the Question 1 Data
I have provided the saved lab2q1.Rds
file to you on the 432-data page. I encourage you to load it using the code below.
library(janitor)
library(tidyverse)
::opts_chunk$set(comment = NA)
knitr
<- read_rds("https://raw.githubusercontent.com/THOMASELOVE/432-data/master/data/lab2q1.Rds")
lab2q1
lab2q1
# A tibble: 2,295 × 5
seqn wtint2yr ridageyr hsd010 paq665
<chr> <dbl> <dbl> <fct> <fct>
1 93717 53249. 22 2 2
2 93718 20257. 45 3 1
3 93729 11760. 42 4 2
4 93738 59333. 26 3 2
5 93746 27135. 25 2 2
6 93755 30922. 26 2 1
7 93761 18939. 44 3 2
8 93763 103670. 40 3 1
9 93766 16414. 36 4 2
10 93774 232377. 41 <NA> 1
# ℹ 2,285 more rows
Question 2 (25 points)
Question 2 uses the hbp3456
data.
(10 points) Does which insurance status a person has seem to have a meaningful impact on their systolic blood pressure, adjusting for whether or not they have a prescription for a beta-blocker? Decide whether your model should include an interaction term in a sensible way (providing a graph to help us understand your reasoning), and then fit your choice of model using the
lm
function in R. Display your results.(15 points) Provide a written explanation of your findings, in complete sentences. Your explanation should address both the overall quality of fit and the interpretation of the coefficients of your chosen model, as well as provide a detailed description as to how you used the output you generated in part a to decide whether or not to include an interaction term.
Question 2 Hints
- One graph you might use would be one to assess the need for an interaction term, probably via a plot of means.
- Another graph (or perhaps table) to consider for insight would look at the relationship between insurance and beta-blocker status in these subjects.
- Please explicitly state in your response that you assume that the missingness you observe in these data are MCAR, and that a complete case analysis is thus appropriate for this Question.
Data for Question 2 (hbp3456
data)
The (simulated) data in the hbp3456.csv
file describe a total of 3456 people living with hypertension (high blood pressure) diagnoses who receive primary care in one of eight practices.
- In each of the eight practices, 432 (different) individuals (who I’ll call subjects in what follows) were sampled at random from all eligible subjects.
- The data are based on real electronic health record (EHR) data, but with some noise added.
- The practices are named after streets that appear in The Simpsons.
- There are 62 (fictional) providers identified across the eight practices, and each provider cares for subjects within a single practice.
Eligibility Criteria
The data are cross-sectional and describe results from a one-year reporting window. To be eligible for the study, a subject had to meet all of the following criteria:
- have an EHR-documented hypertension diagnosis which applied during the one-year reporting window,
- cared for at one of the eight practices in this study, and by one of the 62 participating providers in this study
- age 25 or older at the start of the one-year reporting period (note that all subjects with ages 80 and higher are listed as age 80 in the data)
- between 1 and 12 primary care office visits in the one-year reporting period
- between 2 and 24 primary care office visits combined across the reporting period and the previous year
- fall into one of two biological sex categories (female or male)
- fall into one of four primary insurance categories, specifically Medicare, Commercial, Medicaid or Uninsured.
- have a most recent systolic BP between 80 and 220 mm Hg and most recent diastolic BP between 40 and 140 mm Hg, where the systolic BP is at least 15 and no more than 130 mm Hg larger than the diastolic BP.
Codebook
Variable | Description |
---|---|
record |
unique code for each subject (six digits, first digit is 9, last indicates practice) |
practice |
primary care practice, of which there are eight in the data |
provider |
primary care provider (each practice has multiple providers) |
age |
subject’s age as of the start of the reporting period |
race |
subject’s race (4 levels: Asian, AA_Black, White, Other) |
eth_hisp |
is subject of Hispanic/Latino ethnicity? Yes or No |
sex |
subject’s sex (F or M) |
insurance |
subject’s primary insurance (Medicare, Commercial, Medicaid, Uninsured) |
income |
estimated median income of subject’s home neighborhood (via American Community Survey, to nearest $100) |
hsgrad |
estimated percentage of adults living in the subject’s home neighborhood who have graduated from high school (via American Community Survey, to the nearest tenth of a percent) |
tobacco |
tobacco use status (Current, Former, or Never) |
depr_diag |
does subject have depression diagnosis? Yes or No |
height |
subject’s height in meters, rounded to two decimal places |
weight |
subject’s weight in kilograms, rounded to one decimal place |
ldl |
subject’s LDL cholesterol level, in mg/dl |
statin |
does subject have a current prescription for a statin medication? Yes or No |
bp_med |
does subject have a current prescription for a blood pressure control medication? Yes or No |
sbp |
subject’s most recently obtained systolic blood pressure, in mm Hg |
dbp |
subject’s most recently obtained diastolic blood pressure, in mm Hg |
visits_1 |
subject’s number of visits for primary care in reporting period (one year) |
visits_2 |
subject’s visits for primary care in the past two years |
acearb |
does subject have a current prescription for an ACE-inhibitor or ARB? Yes or No |
betab |
does subject have a current prescription for a beta-blocker? Yes or No |
Notes on Specific Variables
- The list of medications included in
bp_med
is: ACE-inhibitor, ARB, Diuretic, Calcium-Channel Blocker, Beta-Blocker, Alpha-1 Blocker, Centrally acting Alpha-2 Agonist, Vasodilator or other antihypertensive agents. A subject with a current prescription for any of these will have a Yes inbp_med
. - For the
acearb
,betab
,bpmed
,statin
anddepr_diag
variables, a No response includes all subjects where there’s no evidence in the EHR of meeting the Yes criterion, so that there are no missing values (a missing value is interpreted there as No.) - For the
height
,weight
andldl
results, implausible values were treated as missing in preparing the data for you. - The
race
andeth_hisp
values are self-reported, and some subjects refused to answer one or both of the relevant questions. - The
income
andhsgrad
values are imputed from the subject’s home address, usually at the census block level, but occasionally at the level of the zip code.- When a subject’s home address could not be geocoded, these values are noted as missing.
- Geocoded estimates of
income
below 6500 are reported as 6500, and estimates above 130000 are reported as 130000. - For
hsgrad
, geocoded estimates below 40 are reported as 40, and estimates above 99.9 are reported as 99.9.
Loading the Data for Question 2
Here’s the approach I took to load and view the hbp3456
data.
library(janitor)
library(tidyverse)
::opts_chunk$set(comment = NA)
knitr
<- read_csv("https://raw.githubusercontent.com/THOMASELOVE/432-data/master/data/hbp3456.csv", show_col_types = FALSE) |>
hbp3456 clean_names() |>
mutate(record = as.character(record))
hbp3456
# A tibble: 3,456 × 23
record practice provider age race eth_hisp sex insurance income hsgrad
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 900018 Walnut W_05 64 <NA> <NA> F Medicare 15600 83
2 900024 King K_07 74 AA_Bla… No F Medicare 16200 92.8
3 900037 Sycamore S_06 60 AA_Bla… No F Commerci… 21400 79
4 900043 Highland H_07 46 White Yes F Medicaid 38300 83.5
5 900057 Sycamore S_04 59 AA_Bla… No M Commerci… 23200 78.7
6 900062 Elm E_03 54 AA_Bla… No M Commerci… 48600 85.5
7 900076 Plympton P_03 74 White No M Commerci… 64200 92.9
8 900082 Elm E_06 73 White No M Medicare 48600 85.5
9 900097 Sycamore S_10 58 AA_Bla… No F Commerci… 29900 86.2
10 900101 Center C_01 46 AA_Bla… No M Uninsured 63600 97.5
# ℹ 3,446 more rows
# ℹ 13 more variables: tobacco <chr>, depr_diag <chr>, height <dbl>,
# weight <dbl>, ldl <dbl>, statin <chr>, bp_med <chr>, sbp <dbl>, dbp <dbl>,
# visits_1 <dbl>, visits_2 <dbl>, acearb <chr>, betab <chr>
Use of AI
If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.
Include the Session Information
Please display your session information at the end of your submission, as shown below.
::session_info() xfun
R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)
Locale:
LC_COLLATE=English_United States.utf8
LC_CTYPE=English_United States.utf8
LC_MONETARY=English_United States.utf8
LC_NUMERIC=C
LC_TIME=English_United States.utf8
Package version:
askpass_1.2.0 backports_1.4.1 base64enc_0.1.3
bit_4.0.5 bit64_4.0.5 blob_1.2.4
broom_1.0.5 bslib_0.7.0 cachem_1.0.8
callr_3.7.6 cellranger_1.1.0 cli_3.6.2
clipr_0.8.0 colorspace_2.1-0 compiler_4.3.3
conflicted_1.2.0 cpp11_0.4.7 crayon_1.5.2
curl_5.2.1 data.table_1.15.4 DBI_1.2.2
dbplyr_2.5.0 digest_0.6.35 dplyr_1.1.4
dtplyr_1.3.1 ellipsis_0.3.2 evaluate_0.23
fansi_1.0.6 farver_2.1.1 fastmap_1.1.1
fontawesome_0.5.2 forcats_1.0.0 fs_1.6.3
gargle_1.5.2 generics_0.1.3 ggplot2_3.5.0
glue_1.7.0 googledrive_2.1.1 googlesheets4_1.1.1
graphics_4.3.3 grDevices_4.3.3 grid_4.3.3
gtable_0.3.4 haven_2.5.4 highr_0.10
hms_1.1.3 htmltools_0.5.8.1 htmlwidgets_1.6.4
httr_1.4.7 ids_1.0.1 isoband_0.2.7
janitor_2.2.0 jquerylib_0.1.4 jsonlite_1.8.8
knitr_1.46 labeling_0.4.3 lattice_0.22.6
lifecycle_1.0.4 lubridate_1.9.3 magrittr_2.0.3
MASS_7.3.60.0.1 Matrix_1.6.5 memoise_2.0.1
methods_4.3.3 mgcv_1.9.1 mime_0.12
modelr_0.1.11 munsell_0.5.1 nlme_3.1.164
openssl_2.1.1 parallel_4.3.3 pillar_1.9.0
pkgconfig_2.0.3 prettyunits_1.2.0 processx_3.8.4
progress_1.2.3 ps_1.7.6 purrr_1.0.2
R6_2.5.1 ragg_1.3.0 rappdirs_0.3.3
RColorBrewer_1.1.3 readr_2.1.5 readxl_1.4.3
rematch_2.0.0 rematch2_2.1.2 reprex_2.1.0
rlang_1.1.3 rmarkdown_2.26 rstudioapi_0.16.0
rvest_1.0.4 sass_0.4.9 scales_1.3.0
selectr_0.4.2 snakecase_0.11.1 splines_4.3.3
stats_4.3.3 stringi_1.8.3 stringr_1.5.1
sys_3.4.2 systemfonts_1.0.6 textshaping_0.3.7
tibble_3.2.1 tidyr_1.3.1 tidyselect_1.2.1
tidyverse_2.0.0 timechange_0.3.0 tinytex_0.50
tools_4.3.3 tzdb_0.4.0 utf8_1.2.4
utils_4.3.3 uuid_1.2.0 vctrs_0.6.5
viridisLite_0.4.2 vroom_1.6.5 withr_3.0.0
xfun_0.43 xml2_1.3.6 yaml_2.3.8
After the Lab
We will post an answer sketch 24 hours after the Lab is due.
We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.
See the Lab Appeal Policy in our Syllabus if you are interested in having your Lab grade reviewed, and use the Lab Regrade Request form specified there to complete the task. Thank you.