<- ohio24 |>
ohio24 mutate(smoke_group = categorize(smoking, split = "quantile",
n_groups = 4, labels = "range"))
Lab 1
General Instructions
- Submit your work via Canvas.
- The deadline for this Lab is specified on the Course Calendar.
- We charge a 5 point penalty for a lab that is 1-48 hours late.
- We do not grade work that is more than 48 hours late.
- Your response should include a Quarto file (.qmd) and an HTML document that is the result of applying your Quarto file to the data we’ve provided.
You can skip exactly one of Labs 1-5 without penalty, but all students must complete both Lab 6 and Lab 7. If you decide to skip a lab, please submit a note to Canvas by the deadline saying that you are skipping the lab.
Template
There is a Lab 1 Quarto template available on our 432-data page. Please use the template to prepare your response to Lab 1, as it will make things easier for you and for the people grading your work.
- In the Lab 1 template, we use the
flatly
theme. If you’d like to use a different theme, the available list is here.
Our Best Advice
Review your HTML output file carefully before submission for copy-editing issues (spelling, grammar and syntax.) Even with spell-check in RStudio (just hit F7), it’s hard to find errors with these issues in your Quarto file so long as it is running. You really need to look closely at the resulting HTML output.
The Data
The chr_2024.csv
data set we have provided describes a series of 30 variables, pulled from the data for 3054 counties in the County Health Rankings report for 2024.
- The
chr_2024.csv
file is available for download on the 432 data page. - A detailed codebook for all of the data in the
chr_2024
file is available here.
Question 1 (20 points)
Load the data from chr_2024.csv
into R appropriately. You should see 30 variables and 3054 counties (rows) in your data. Now, filter your data to create a new data set called ohio24
which should include only the 88 counties located in the state of Ohio.
{5 points} Use R code to demonstrate concisely that there are no missing values in the 88 Ohio counties for any of the 30 variables in the
ohio24
data.{15} Then create a visualization (using R and Quarto) based on your
ohio24
data to help describe the 88 Ohio counties and share it (the visualization and all of the R code you used to build it) with us.
The visualization should:
- be of a professional quality,
- describe information from three different variables from this list of 15 below:
prem_death
,pf_health
,poor_phys
,poor_ment
,low_bwt
smoking
,drinking
,sti_rate
,unins
,pcp_rate
unemp
,hs_grad
,sev_hous
,commute
,non_eng
- you are welcome to transform or re-express the variables if that is of interest to you
- the main option we have in mind is an attractive faceted scatterplot showing the association of two of the variables divided into categories by a third variable
- include proper labels and a meaningful title,
- include a caption of no more than 75 words that highlights the key result. Your caption can be placed within the visualization, or in a note below.
- In developing your caption, I find it helpful to think about what question this visualization is meant to answer, and then provide a caption which makes it clear what the question (and answer) is.
You are welcome to find useful tools for visualizing data in R that we have seen in either 431 or 432 or elsewhere.
Although you may fit a model to help show patterns if you like, your primary task is to show the data in a meaningful way, rather than to simply highlight the results of a model.
We will evaluate Question 1 based on the quality of the visualization, its title and caption, in terms of being attractive, well-labeled and useful for representing the data reported in County Health Rankings 2024 for Ohio, and how well it adheres to general principles for good visualizations we’ve seen in 431 and 432.
The code below could be used to create four groups of 22 counties from the smoking
data, using the categorize()
function from the datawizard
package in the easystats
ecosystem.
Question 2 (30 points)
Create a linear regression model using the data for the state of Ohio that you developed in Question 1 to predict obesity
as a function of food_env
adjusting for unemployment
, and treating all three variables as quantitative. Please build your model using main effects only, entered as linear predictors without transformation, and call this model model1
.
{10 points} Provide R code which specifies the estimated coefficient of
food_env
and a 90% confidence interval around that estimate. Then write a concise but sufficient explanation of the meaning of these results in context using complete English sentences.{10} Evaluate the quality of the model you fit in part a, in terms of adherence to regression modeling assumptions, through a set of regression residual plots using
check_model()
. Describe any problems you see with the residual plots in complete sentences.
In question 2b, I suggest you use check_model(model1, detrend = FALSE)
so that the Normal Q-Q plot looks like it usually does in our other work.
- {10} Create an attractive table which compares
model1
to a simple linear model (calledmodel2
) for the same outcome (obesity
) which uses only thefood_env
variable as a predictor. Your comparisons should include assessments of raw and adjusted R-squared, AIC, BIC and the residual standard error in your sample of all 88 Ohio counties. Then reflect on your findings in at least two complete sentences: based on these metrics, which model looks like it fits the Ohioobesity
data for 2021 (that were reported in CHR 2024) more effectively, and why?
Use of AI
If you decide to use some sort of AI to help you with this Lab, we ask that you place a note to that effect, describing what you used and how you used it, as a separate section called “Use of AI”, after your answers to our questions, and just before your presentation of the Session Information. Thank you.
Be sure to include Session Information
Please display your session information at the end of your submission, as shown below.
::session_info() xfun
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)
Locale:
LC_COLLATE=English_United States.utf8
LC_CTYPE=English_United States.utf8
LC_MONETARY=English_United States.utf8
LC_NUMERIC=C
LC_TIME=English_United States.utf8
Package version:
base64enc_0.1.3 bslib_0.8.0 cachem_1.1.0 cli_3.6.3
compiler_4.4.2 digest_0.6.37 evaluate_1.0.3 fastmap_1.2.0
fontawesome_0.5.3 fs_1.6.5 glue_1.8.0 graphics_4.4.2
grDevices_4.4.2 highr_0.11 htmltools_0.5.8.1 htmlwidgets_1.6.4
jquerylib_0.1.4 jsonlite_1.8.9 knitr_1.49 lifecycle_1.0.4
memoise_2.0.1 methods_4.4.2 mime_0.12 R6_2.5.1
rappdirs_0.3.3 rlang_1.1.4 rmarkdown_2.29 rstudioapi_0.17.1
sass_0.4.9 stats_4.4.2 tinytex_0.54 tools_4.4.2
utils_4.4.2 xfun_0.50 yaml_2.3.10
After the Lab
- We will post an answer sketch to our Shared Google Drive 48 hours after the Lab is due.
- We will post grades to our Grading Roster on our Shared Google Drive one week after the Lab is due.
- See the Lab Appeal Policy in our Syllabus if you are interested in having your Lab grade reviewed, and use the Lab Regrade Request form specified there to complete the task. Thank you.