Lab 1

Published

2024-01-21

General Instructions

Submit your work via Canvas.
The deadline for this Lab is specified on the Course Calendar.
- Work submitted more than 59 minutes late, but within 12 hours of the deadline will lose 5 of the available 50 points.
- Work submitted 12 to 24 hours after the deadline will lose 10 of the available 50 points.
- Work submitted more than 24 hours after the deadline will not be graded.

Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided.

Template

There is a Lab 1 Quarto template available on our 432-data page. Please use the template to prepare your response to Lab 1, as it will make things easier for you and for the people grading your work.

Our Best Advice

Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.

The Data

The oh_counties_2022.csv data set I have provided describes a series of variables, pulled from the data for the 88 counties of the the State of Ohio from the County Health Rankings report for 2022.

The oh_counties_2022.csv file is available for download on the 432 data page.
Several detailed County Health Rankings files augment these 2022 Ohio Rankings Data. Find those items here if you’re interested. Remember to use the 2022 files.

The Variables

The available variables are listed below. Each variable describes data at the COUNTY level.

Variable	Description
`fips`	Federal Information Processing Standard code
`county`	name of County
`years_lost_rate`	age-adjusted years of potential life lost rate (per 100,000 population)
`sroh_fairpoor`	% of adults reporting fair or poor health (via BRFSS)
`phys_days`	mean number of reported physically unhealthy days per month
`ment_days`	mean number of reported mentally unhealthy days per mo
`lbw_pct`	% of births with low birth weight (< 2500 grams)
`smoker_pct`	% of adults that report currently smoking
`obese_pct`	% of adults that report body mass index of 30 or higher
`food_env`	indicator of access to healthy foods, in points (0 is worst, 10 is best)
`inactive_pct`	% of adults that report no leisure-time physical activity
`exer_access`	% of the population with access to places for physical activity
`exc_drink`	% of adults that report excessive drinking
`alc_drive`	% of driving deaths with alcohol involvement
`sti_rate`	Chlamydia cases / Population x 100,000
`teen_births`	Teen births / females ages 15-19 x 1,000
`uninsured`	% of people under age 65 without insurance
`pcp_ratio`	Population to Primary Care Physicians ratio
`prev_hosp`	Rate of hospital stays for ambulatory-care sensitive conditions per 100,000 Medicare enrollees.
`hsgrads`	High School graduation rate
`unemployed`	% of population age 16+ who are unemployed and looking for work
`poor_kids`	% of children (under age 18) living in poverty
`income_ratio`	Ratio of household income at the 80th percentile to income at the 20th percentile
`associations`	# of social associations / population x 10,000
`pm2.5`	Average daily amount of fine particulate matter in micrograms per cubic meter
`h2oviol`	Presence of a water violation: Yes or No
`sev_housing`	% of households with at least 1 of 4 housing problems: overcrowding, high housing costs, or lack of kitchen or plumbing facilities
`drive_alone`	% of workers who drive alone to work
`age_adj_mortality`	premature age-adjusted mortality
`dm_prev`	% of adults with a diabetes diagnosis
`freq_phys_distress`	% in frequent physical distress
`freq_mental_distress`	% in frequent mental distress
`food_insecure`	% who are food insecure
`insuff_sleep`	% who get insufficient sleep
`median_income`	estimated median income
`population`	population size
`age65plus`	% of population who are 65 and over
`african_am`	% of population who are African-American
`hispanic`	% of population who are of Hispanic/Latino ethnicity
`white`	% of population who are White
`female`	% of population who are Female
`rural`	% of people in the county who live in rural areas

Loading the Data

Applying the clean_names() function from the janitor package as part of the initial oh22 creation process, as I’ve done in my code below, is a sensible strategy. We hope you’ll adopt it when ingesting almost any data you ever try to pull into R.

library(janitor)
library(tidyverse)

knitr::opts_chunk$set(comment = NA)

oh22 <- read_csv("https://raw.githubusercontent.com/THOMASELOVE/432-data/master/data/oh_counties_2022.csv", show_col_types = FALSE) |>
  clean_names() |>
  mutate(fips = as.character(fips))

oh22

# A tibble: 88 × 43
   fips  state county  years_lost_rate sroh_fairpoor phys_days ment_days lbw_pct
   <chr> <chr> <chr>             <dbl>         <dbl>     <dbl>     <dbl>   <dbl>
 1 39001 Ohio  Adams             11037          25.5      5.41      6.1      9.3
 2 39003 Ohio  Allen              8518          20.1      4.53      5.35     9.7
 3 39005 Ohio  Ashland            7769          19.9      4.49      5.44     6  
 4 39007 Ohio  Ashtab…            9749          24.7      5.07      5.73     8  
 5 39009 Ohio  Athens             7619          22.4      5.02      5.71     8.3
 6 39011 Ohio  Auglai…            6498          17.3      4.16      5.2      6.7
 7 39013 Ohio  Belmont            8782          20.5      4.48      5.38     8.4
 8 39015 Ohio  Brown             10510          21.2      4.68      5.6      7.7
 9 39017 Ohio  Butler             9053          18.7      4.16      5.05     7.8
10 39019 Ohio  Carroll            8066          19.8      4.54      5.5      8.3
# ℹ 78 more rows
# ℹ 35 more variables: smoker_pct <dbl>, obese_pct <dbl>, food_env <dbl>,
#   inactive_pct <dbl>, exer_access <dbl>, exc_drink <dbl>, alc_drive <dbl>,
#   sti_rate <dbl>, teen_births <dbl>, uninsured <dbl>, pcp_ratio <dbl>,
#   prev_hosp <dbl>, hsgrads <dbl>, unemployed <dbl>, poor_kids <dbl>,
#   income_ratio <dbl>, associations <dbl>, pm2_5 <dbl>, h2oviol <chr>,
#   sev_housing <dbl>, drive_alone <dbl>, age_adj_mortality <dbl>, …

Question 1 (20 points)

Create a visualization (using R and Quarto) based on some part of the oh_counties_2022.csv data set we have provided, and share it (the visualization and all of the R code you used to build it) with us.

The visualization should:

be of a professional quality,
describe information from at least three different variables from those listed above
- you are welcome to transform or re-express the variables if that is of interest to you
- please do not use the obese_pct variable, since we will look at that in Question 2, below.
include proper labels and a meaningful title,
include a caption of no more than 75 words that highlights the key result. Your caption can be placed within the visualization, or in a note below.
In developing your caption, I find it helpful to think about what question this visualization is meant to answer, and then provide a caption which makes it clear what the question (and answer) is.

You are welcome to find useful tools for visualizing data in R that we have seen in either 431 or 432 or elsewhere.

Although you may fit a model to help show patterns, your primary task is to show the data in a meaningful way, rather than to simply highlight the results of a model.

We will evaluate Question 1 based on the quality of the visualization, its title and caption, in terms of being attractive, well-labeled and useful for representing the County Health Rankings data for Ohio, and how well it adheres to general principles for good visualizations we’ve seen in 431 and 432.

Question 2 (30 points)

Create a linear regression model using the oh22 data you developed in Question 1 to predict obese_pct as a function of food_env adjusting for median_income, and treating all three variables as quantitative. Please build your model using main effects only, entered as linear predictors without transformation, and call this model model1.

Provide R code which specifies the estimated coefficient of food_env and a 90% confidence interval around that estimate. Then write a concise but sufficient explanation of the meaning of these results in context using complete English sentences.
Evaluate the quality of the model you fit in terms of adherence to regression modeling assumptions, through the specification and written evaluation of the four basic regression residual plots. Then reflect on your findings in a few complete sentences: what might be done to improve the fit of the model you’ve developed? Be sure to identify by name any outlying counties and explain why they are flagged as outliers.
Use the glance function in the broom package to help you create an attractive table which compares model1 to a simple linear model (called model2) for the same outcome (obese_pct) which uses only the food_env variable as a predictor. Your comparisons should include assessments of raw and adjusted R-squared, AIC, BIC and residual standard error within the complete sample of all 88 Ohio counties. Then reflect on your findings in a few complete sentences: based on these metrics, which model looks like it fits the Ohio 2022 data more effectively, and why?

Use of AI

If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.

Be sure to include Session Information

Please display your session information at the end of your submission, as shown below.

xfun::session_info()

R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  askpass_1.2.0       backports_1.4.1     base64enc_0.1.3    
  bit_4.0.5           bit64_4.0.5         blob_1.2.4         
  broom_1.0.5         bslib_0.7.0         cachem_1.0.8       
  callr_3.7.6         cellranger_1.1.0    cli_3.6.2          
  clipr_0.8.0         colorspace_2.1-0    compiler_4.3.3     
  conflicted_1.2.0    cpp11_0.4.7         crayon_1.5.2       
  curl_5.2.1          data.table_1.15.4   DBI_1.2.2          
  dbplyr_2.5.0        digest_0.6.35       dplyr_1.1.4        
  dtplyr_1.3.1        ellipsis_0.3.2      evaluate_0.23      
  fansi_1.0.6         farver_2.1.1        fastmap_1.1.1      
  fontawesome_0.5.2   forcats_1.0.0       fs_1.6.3           
  gargle_1.5.2        generics_0.1.3      ggplot2_3.5.0      
  glue_1.7.0          googledrive_2.1.1   googlesheets4_1.1.1
  graphics_4.3.3      grDevices_4.3.3     grid_4.3.3         
  gtable_0.3.4        haven_2.5.4         highr_0.10         
  hms_1.1.3           htmltools_0.5.8.1   htmlwidgets_1.6.4  
  httr_1.4.7          ids_1.0.1           isoband_0.2.7      
  janitor_2.2.0       jquerylib_0.1.4     jsonlite_1.8.8     
  knitr_1.46          labeling_0.4.3      lattice_0.22.6     
  lifecycle_1.0.4     lubridate_1.9.3     magrittr_2.0.3     
  MASS_7.3.60.0.1     Matrix_1.6.5        memoise_2.0.1      
  methods_4.3.3       mgcv_1.9.1          mime_0.12          
  modelr_0.1.11       munsell_0.5.1       nlme_3.1.164       
  openssl_2.1.1       parallel_4.3.3      pillar_1.9.0       
  pkgconfig_2.0.3     prettyunits_1.2.0   processx_3.8.4     
  progress_1.2.3      ps_1.7.6            purrr_1.0.2        
  R6_2.5.1            ragg_1.3.0          rappdirs_0.3.3     
  RColorBrewer_1.1.3  readr_2.1.5         readxl_1.4.3       
  rematch_2.0.0       rematch2_2.1.2      reprex_2.1.0       
  rlang_1.1.3         rmarkdown_2.26      rstudioapi_0.16.0  
  rvest_1.0.4         sass_0.4.9          scales_1.3.0       
  selectr_0.4.2       snakecase_0.11.1    splines_4.3.3      
  stats_4.3.3         stringi_1.8.3       stringr_1.5.1      
  sys_3.4.2           systemfonts_1.0.6   textshaping_0.3.7  
  tibble_3.2.1        tidyr_1.3.1         tidyselect_1.2.1   
  tidyverse_2.0.0     timechange_0.3.0    tinytex_0.50       
  tools_4.3.3         tzdb_0.4.0          utf8_1.2.4         
  utils_4.3.3         uuid_1.2.0          vctrs_0.6.5        
  viridisLite_0.4.2   vroom_1.6.5         withr_3.0.0        
  xfun_0.43           xml2_1.3.6          yaml_2.3.8

After the Lab

We will post an answer sketch 24 hours after the Lab is due.

We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.

See the Lab Appeal Policy in our Syllabus if you are interested in having your Lab grade reviewed, and use the Lab Regrade Request form specified there to complete the task. Thank you.