Lab 5

Published

2025-04-02

General Instructions

Submit your work via Canvas.
The deadline for this Lab is specified on the Course Calendar.
- We charge a 5 point penalty for a lab that is 1-48 hours late.
- We do not grade work that is more than 48 hours late.
Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided.

Important

You can skip exactly one of Labs 1-5 without penalty, but all students must complete both Lab 6 and Lab 7. If you decide to skip a lab, please submit a note to Canvas by the deadline saying that you are skipping the lab.

Template

You should be able to modify any of the first three Lab templates available on our 432-data page to help you do this Lab. Feel encouraged to try a different HTML theme if you like, maybe yeti or spacelab or materia.

Tip

In my answer sketch for Lab 5, I used the following R packages:

janitor, naniar
readxl, survey, tableone
easystats and tidyverse

in case that is useful for you to know.

Our Best Advice

Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.

The Data

The hbp3024.xlsx Excel file (first introduced in Lab 2) and the nh_3143.Rds R data set (introduced in Lab 3) are available for download on the 432 data page.
A detailed description of each variable in the hbp3024 data is available here.
A detailed description of each variable in the nh_3143 data is available here.

Question 1 (30 points)

Question 1 uses the hbp3024 data. Build a Table 1 to compare the subjects in the Elm practice to the subjects in the King practice on the following nine variables:

age,
sex,
race,
primary insurance,
tobacco use status,
body mass index,
BMI category,
systolic blood pressure, and
diastolic blood pressure.

Make the Table as well as you can within Quarto, and display the result as part of your HTML file. All code must be visible to us. Include a description of the important results from your Table 1 that does not exceed 100 words, using complete English sentences.

Hints for Question 1

Be sure that your table specifies the number of subjects in each practice. Note that you’ll have to do something so that your work focuses on the comparison of Elm to King, leaving out the other practices.
You’ll have to deal with some missing values in the data. All missing values are indicated in the .xlsx file with NA, and you’ll need to specify that as you ingest the data in order to have them show up properly in R. It’s not usually appropriate to report results that include imputation in a Table 1, so build a note specifying the amount of missing data in a footnote to the table. An appropriate approach would be to produce a list just below your Table. Do not impute.
Some variables will present as characters in the data, but you’d instead prefer them to appear as factors. Be sure to include code in your response to make these changes (the forcats package is your friend here) and then (perhaps using the fct_relevel function in the forcats package) be sure to move the levels of those factors into an order that facilitates interpretation.
Be sure, too, to make reasoned choices about whether means and standard deviations or instead medians and quartiles are more appropriate displays for the quantitative variables. Include your reasons in a list displayed at the end of your table. Note that the record information is just a code (even though it is numerical) and should be treated as a character variable in using these data.
Note that body mass index (BMI) and BMI category are not supplied in the data, although you do have height and weight. So, you’ll have to calculate the BMI and add it to the data set. If you don’t know the formula for BMI, you have Google to help you figure it out.
For BMI categories, use the four groups specified in the How is BMI interpreted for Adults section of this description of Adult BMI by the Centers for Disease Control. Again, you’ll need to use your calculated BMI values and then create the categories in your data set, and you’ll need to figure out a way to accurately get each subject into the correct category.
Do not include R output without complete sentences describing what you are doing in each step, and what you conclude from that work.

Question 2 (20 points)

Question 2 uses the nh_3143 data to generate two different responses to the question:

Estimate the percentage of the US non-institutionalized adult population within the ages of 30-59 who have smoked at least 100 cigarettes in their life that would describe their General Health as either “Excellent” or “Very Good”.

{10 points} What percentage of the subjects who responded “Yes” to the smoked at least 100 cigarettes in their life question included in the nh_3143 data have described their General Health as either “Excellent” or “Very Good”? Provide all of the R code you use to obtain your result, annotated with detailed text that makes it clear what you are doing as you proceed. Please express your final response as a percentage between 0 and 100, including a single decimal place.
{10} Please answer the question asked in Question 2a, again, but this time accounting for the sampling weights used in wtint2yr. Again, provide all of the R code you use to obtain your result, annotated with text to make it clear what you are doing, and then express your final response as a percentage, again including a single decimal place.

Use of AI

If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.

Be sure to include Session Information

Please display your session information at the end of your submission, as shown below.

xfun::session_info()

R version 4.4.3 (2025-02-28 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  base64enc_0.1.3   bslib_0.9.0       cachem_1.1.0      cli_3.6.4        
  compiler_4.4.3    digest_0.6.37     evaluate_1.0.3    fastmap_1.2.0    
  fontawesome_0.5.3 fs_1.6.6          glue_1.8.0        graphics_4.4.3   
  grDevices_4.4.3   highr_0.11        htmltools_0.5.8.1 htmlwidgets_1.6.4
  jquerylib_0.1.4   jsonlite_2.0.0    knitr_1.50        lifecycle_1.0.4  
  memoise_2.0.1     methods_4.4.3     mime_0.13         R6_2.6.1         
  rappdirs_0.3.3    rlang_1.1.6       rmarkdown_2.29    rstudioapi_0.17.1
  sass_0.4.10       stats_4.4.3       tinytex_0.57      tools_4.4.3      
  utils_4.4.3       xfun_0.52         yaml_2.3.10

After the Lab

We will post an answer sketch to our Shared Google Drive 48 hours after the Lab is due.
We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.
See the Lab Appeal Policy in Section 8.5 of our Syllabus if you are interested in having your Lab 1, 2, 3, 4, 5 or 7 grade reviewed, and use the Lab Regrade Request form to complete the task. The form (which is optional) is due when the Calendar says it is. Thank you.