Lab 1

Published

2025-01-15

General Instructions

  • Submit your work via Canvas.
  • The deadline for this Lab is specified on the Course Calendar.
    • We charge a 5 point penalty for a lab that is 1-48 hours late.
    • We do not grade work that is more than 48 hours late.
  • Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided.
Important

You can skip exactly one of Labs 1-5 without penalty, but all students must complete both Lab 6 and Lab 7. If you decide to skip a lab, please submit a note to Canvas by the deadline saying that you are skipping the lab.

Template

There is a Lab 1 Quarto template available on our 432-data page. Please use the template to prepare your response to Lab 1, as it will make things easier for you and for the people grading your work.

Our Best Advice

Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.

The Data

The chr_2024.csv data set we have provided describes a series of 30 variables, pulled from the data for 3054 counties in the County Health Rankings report for 2024.

  • The chr_2024.csv file is available for download on the 432 data page.
  • A detailed codebook for all of the data in the chr_2024 file is available here.

Question 1 (20 points)

Load the data from chr_2024.csv into R appropriately. You should see 30 variables and 3054 counties (rows) in your data. Now, filter your data to create a new data set called ohio24 which should include only the 88 counties located in the state of Ohio.

  1. {5 points} Use R code to demonstrate concisely that there are no missing values in the 88 Ohio counties for any of the 30 variables in the ohio24 data.

  2. {15} Then create a visualization (using R and Quarto) based on your ohio24 data to help describe the 88 Ohio counties and share it (the visualization and all of the R code you used to build it) with us.

The visualization should:

  • be of a professional quality,
  • describe information from three different variables from this list of 15 below:
    • prem_death, pf_health, poor_phys, poor_ment, low_bwt
    • smoking, drinking, sti_rate, unins, pcp_rate
    • unemp, hs_grad, sev_hous, commute, non_eng
  • you are welcome to transform or re-express the variables if that is of interest to you
    • the main option we have in mind is an attractive faceted scatterplot showing the association of two of the variables divided into categories by a third variable
  • include proper labels and a meaningful title,
  • include a caption of no more than 75 words that highlights the key result. Your caption can be placed within the visualization, or in a note below.
  • In developing your caption, I find it helpful to think about what question this visualization is meant to answer, and then provide a caption which makes it clear what the question (and answer) is.

You are welcome to find useful tools for visualizing data in R that we have seen in either 431 or 432 or elsewhere.

Although you may fit a model to help show patterns if you like, your primary task is to show the data in a meaningful way, rather than to simply highlight the results of a model.

We will evaluate Question 1 based on the quality of the visualization, its title and caption, in terms of being attractive, well-labeled and useful for representing the data reported in County Health Rankings 2024 for Ohio, and how well it adheres to general principles for good visualizations we’ve seen in 431 and 432.

Tip

The code below could be used to create four groups of 22 counties from the smoking data, using the categorize() function from the datawizard package in the easystats ecosystem.

ohio24 <- ohio24 |> 
  mutate(smoke_group = categorize(smoking, split = "quantile", 
                                  n_groups = 4, labels = "range"))

Question 2 (30 points)

Create a linear regression model using the data for the state of Ohio that you developed in Question 1 to predict obesity as a function of food_env adjusting for unemployment, and treating all three variables as quantitative. Please build your model using main effects only, entered as linear predictors without transformation, and call this model model1.

  1. {10 points} Provide R code which specifies the estimated coefficient of food_env and a 90% confidence interval around that estimate. Then write a concise but sufficient explanation of the meaning of these results in context using complete English sentences.

  2. {10} Evaluate the quality of the model you fit in part a, in terms of adherence to regression modeling assumptions, through a set of regression residual plots using check_model(). Describe any problems you see with the residual plots in complete sentences.

Tip

In question 2b, I suggest you use check_model(model1, detrend = FALSE) so that the Normal Q-Q plot looks like it usually does in our other work.

  1. {10} Create an attractive table which compares model1 to a simple linear model (called model2) for the same outcome (obesity) which uses only the food_env variable as a predictor. Your comparisons should include assessments of raw and adjusted R-squared, AIC, BIC and the residual standard error in your sample of all 88 Ohio counties. Then reflect on your findings in at least two complete sentences: based on these metrics, which model looks like it fits the Ohio obesity data for 2021 (that were reported in CHR 2024) more effectively, and why?

Use of AI

If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.

Be sure to include Session Information

Please display your session information at the end of your submission, as shown below.

xfun::session_info()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  base64enc_0.1.3   bslib_0.8.0       cachem_1.1.0      cli_3.6.3        
  compiler_4.4.2    digest_0.6.37     evaluate_1.0.3    fastmap_1.2.0    
  fontawesome_0.5.3 fs_1.6.5          glue_1.8.0        graphics_4.4.2   
  grDevices_4.4.2   highr_0.11        htmltools_0.5.8.1 htmlwidgets_1.6.4
  jquerylib_0.1.4   jsonlite_1.8.9    knitr_1.49        lifecycle_1.0.4  
  memoise_2.0.1     methods_4.4.2     mime_0.12         R6_2.5.1         
  rappdirs_0.3.3    rlang_1.1.4       rmarkdown_2.29    rstudioapi_0.17.1
  sass_0.4.9        stats_4.4.2       tinytex_0.54      tools_4.4.2      
  utils_4.4.2       xfun_0.50         yaml_2.3.10      

After the Lab

  • We will post an answer sketch to our Shared Google Drive 48 hours after the Lab is due.
  • We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.
  • See the Lab Appeal Policy in our Syllabus if you are interested in having your Lab grade reviewed, and use the Lab Regrade Request form specified there to complete the task. Thank you.