<-
data_url "https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2022.csv"
<- read_csv(data_url, skip = 1, guess_max = 4000,
chr_2022_raw show_col_types = FALSE)
Project A Data
Most Recent Update: 2022-10-18 07:39:43.
What Data Will I Use?
Your Project A will use the 2022 version of the analytic data from County Health Rankings (CHR). You have already seen some versions of these data in Labs 02 and 03, but you’ll create and clean a new data set of your own for Project A.
The data are gathered at the County Health Rankings Data & Documentation site.
- The key elements we’ll use are in the Rankings Data & Documentation section, specifically the National Data & Documentation section site for the 2022 County Health Rankings.
- Do not use data from previous editions of the CHR, and do not use the trends data available on their website for this project.
Specifically, you’ll need three files:
- the 2022 CHR CSV Analytic Data (a .csv file)
- the 2022 CHR Analytic Data Documentation file (PDF), and
- the 2022 Data Dictionary (PDF)
These files are also available in the data folder for Project A on Github.
You’ll wind up selecting five measures, which you’ll also want to look up at this CHR 2022 Measures link for further details, including who created the measure, how it is measured, and when it was measured.
Developing the Data: Overview
Obtaining and cleaning your data takes a little while, but you can start right now. There are seven main tasks to complete. You will create a clean tibble containing 200-400 rows (counties) and eleven variables, and then present the data, a codebook and some brief summaries, by following the steps below. Detailed instructions follow this overview.
- Ingest the data into a raw tibble in R called
chr_2022_raw
, carefully. - Select the states (3-5 plus Ohio) that you want to study, creating a new
chr_2022
tibble that contains only the counties from those states, carefully. - Select nine variables (4 we’ll specify and 5 more you’ll select), rename them in useful ways, clean up problems, and then save this smaller tibble as
chr_2022
. - Create new factors (categorical variables) from two of your 5 selected variables from the previous task, and add them to
chr_2022
so you now have 11 variables. - Save the resulting
chr_2022
tibble, and share it with us in your Proposal. - Create a codebook for your
chr_2022
tibble which briefly but sufficiently describes your selected variables and subjects (counties.) - Print your
chr_2022
tibble (to prove it is a tibble) and then provide a modest set of appropriate numerical summaries of each variable.
Here are some key additional details for each step of this process. - Please refer to Dr. Love’s materials in the rest of these Project A Instructions, particularly in the Examples & Tips section for more help.
Task 1. Ingest the raw data into chr_2022_raw
Begin an R Project just for Project A, and create an R Markdown file within that project where you will do your data development work. Working from a template, or from your own best understanding of what works well for you, start by loading the packages you’ll need, including the tidyverse
and any other packages you plan to use.
You’ll use read_csv()
to ingest your raw .csv
file into R and call the resulting tibble chr_2022_raw. But as you do this, you’ll need to remove the top row from the .csv within your R code. You should probably look at the raw .csv in Excel or another spreadsheet system so that you know why we need to do this.
To accomplish this, use the skip = 1
command within your read_csv()
. Sample code follows:
You should at this point have a chr_2022_raw tibble with 3194 rows and 725 columns.
Finally, to complete Task 1, you should now write code to restrict the chr_2022_raw
data to only include the 3082 rows which have “county_ranked” values of 1, since the other rows will not be used by us in this project.
Task 2. Filter your data to the counties in the states you’ve chosen
Now we’ll filter the chr_2022_raw data containing 3082 rows and 725 columns to a new tibble called chr_2022
which contains the 200-400 rows (counties) you will actually study in your project. Each row you’ll keep should have a county_ranked
value of 1, as we’ve ensured at the end of Task 1. To go from 3082 counties to our final sample, you will need to select 4-6 states, including Ohio.
Your selection must include:
- the 88 counties of Ohio, and
- all of the counties in a subset of 3-5 additional states in the US
The number of counties (which have county_ranked
values of 1) associated with each state (specified using its two-letter postal abbreviation code) is listed below, for your convenience.
- Do not include any of the six “states” (District of Columbia, Connecticut, Delaware, Hawaii, New Hampshire and Rhode Island) with fewer than 12 counties.
States You Are Permitted to Select
|> count(statecode, state) |> filter(n > 12) |> print(n = 50) chr_2022_raw
# A tibble: 45 × 3
statecode state n
<chr> <chr> <int>
1 01 AL 67
2 02 AK 24
3 04 AZ 15
4 05 AR 75
5 06 CA 58
6 08 CO 59
7 12 FL 67
8 13 GA 159
9 16 ID 43
10 17 IL 102
11 18 IN 92
12 19 IA 99
13 20 KS 104
14 21 KY 120
15 22 LA 64
16 23 ME 16
17 24 MD 24
18 25 MA 14
19 26 MI 83
20 27 MN 87
21 28 MS 82
22 29 MO 115
23 30 MT 47
24 31 NE 79
25 32 NV 16
26 34 NJ 21
27 35 NM 32
28 36 NY 62
29 37 NC 100
30 38 ND 48
31 39 OH 88 ## note: must be selected
32 40 OK 77
33 41 OR 35
34 42 PA 67
35 45 SC 46
36 46 SD 61
37 47 TN 95
38 48 TX 244
39 49 UT 28
40 50 VT 14
41 51 VA 133
42 53 WA 39
43 54 WV 55
44 55 WI 72
45 56 WY 23
So, including OH
, you will need a total of 200-400 counties, from 4-6 states. For example,
- one possible combination would be the states of
TX
(244 counties),AZ
(15 counties) andNM
(32 counties) in addition toOH
(88 counties), which would yield 379 counties in 4 states - another combination would be the states of
WA
(39 counties),WI
(72 counties),WV
(55 counties) andWY
(23 counties) to joinOH
(88 counties) yielding 277 counties in 5 states - and yet another would be
PA
(67),NY
(62),NJ
(21),MD
(24) andVA
(133) in addition toOH
(88) yielding 395 counties in 6 states
After making your selection, and applying it, you should have a tibble called chr_2022
which contains 200-400 rows, and 725 columns. In Task 3, we’ll cut down those columns.
Additional Task 2 Notes
- Choose your subset of states with the knowledge in mind that some variables in the CHR data are not available for some counties, and that each state you select must have more than 12 counties.
- You should have some reason for selecting the states that you do, and you will need to describe that reason in a complete sentence or two in your proposal and your final report.
- Since we’ve studied five Midwest states (and 437 counties) in Lab 2, we want you to look at a minimum of one state not included in that list, so your final selection of states (besides Ohio) must include at least one state other than Indiana, Illinois, Michigan and Wisconsin.
- Don’t forget to filter the rows at the end of Task 1 so that only those rows with
county_ranked
values of 1 are included. Otherwise, your counts won’t match those shown above, and you’ll have lots of problems.
Task 3. Select five variables, add four required ones, and clean up chr_2022
Next, you will identify variables from the your existing chr_2022
tibble describing your selected sample of counties to include only nine variables (columns) instead of the 725 you should have at the start of Task 3. These nine variables must include:
- the five-digit fips code for the county, which will be a convenient ID variable, which is called
fipscode
in the raw data. - the name of the county, which will be useful for labeling and identifying the counties, and which is called
county
in the raw data. - the
state
, which will be a multi-categorical (with 4-6 categories) variable - the
county_ranked
variable, which tells us whether the row should be included in our data (all of the rows you include in your data should havecounty_ranked == 1
) - followed by five variables you select from the 84 variables in the raw CHR data set that we have listed below.
- Each of these must be of the form
vXXX_rawvalue
(note: to select the entire group of 86 variables, you might tryselect(ends_with("rawvalue"))
as part of a pipe.) - Note that we have not included two of these 86 in our list below of the 84 variables you can choose from.
- Each of these must be of the form
The 84 variables you must select five from
You must select your five variables from the 84 variables in the raw (.csv) data file listed below. We show them here in the order in which they appear in the raw file. The listing “v001” in this table actually refers to the variable named “v001_rawvalue”.
[1] "v001" "v002" "v036" "v042" "v037" "v009" "v011" "v133" "v070" "v132"
[11] "v049" "v134" "v045" "v014" "v085" "v004" "v088" "v062" "v005" "v050"
[21] "v155" "v168" "v069" "v023" "v024" "v044" "v082" "v140" "v043" "v135"
[31] "v125" "v124" "v136" "v067" "v137" "v173" "v147" "v127" "v128" "v129"
[41] "v144" "v145" "v060" "v061" "v139" "v083" "v138" "v039" "v143" "v003"
[51] "v122" "v021" "v149" "v159" "v160" "v167" "v169" "v151" "v063" "v065"
[61] "v141" "v142" "v171" "v172" "v015" "v161" "v148" "v158" "v156" "v153"
[71] "v154" "v166" "v051" "v052" "v053" "v054" "v055" "v081" "v080" "v056"
[81] "v126" "v059" "v057" "v058"
For example, v001_rawvalue
shows the raw values for the premature death measure. If you select this variable, it is up to you to use the documentation in the two PDF files I have linked to, as well as the information on the County Health Rankings website, to get a reasonable understanding of what the variable measures, and how it was collected.
The 2022 CHR Analytic Data Documentation file (PDF), and the 2022 Data Dictionary file (also PDF) are crucial here, as those are the ones that explain what the available variables mean, and how they should be labeled.
As noted below, you will also need look up each measure you wind up selecting to learn more about it at this CHR 2022 Measures link.
Some important notes for Task 3
Be sure to read this entire Data page (especially the material on How to Clean Your Variables) before selecting your variables, so you are aware of some features of the data we disclose below.
The five variables you select must be five different variables selected from the 84 variables listed above. Some variables in that set of 84 are better choices than others.
Each variable you select should be of some interest to you on its own, in terms of either providing a health outcome of interest, or potentially providing useful information about a feature of the county that might relate to that health outcome. Your five selected quantitative variables, selected by you from the 84 available “raw value” CHR variables, will need be treated as follows:
- variable 1 will be treated as quantitative, and as an outcome of interest
- two others (variables 2 and 3) will be treated as quantitative predictors of interest for variable 1
- Each of your quantitative variables (1, 2 and 3) must have at least 15 distinct values within your final tibble.
- variable 4 will be categorized into 2 mutually exclusive and collectively exhaustive levels to create a binary categorical variable of interest in predicting variable 1 (this is a terrible idea in practical work.)
- Exactly one of the variables in the data (
v124
about drinking water violations) is already a binary (1 = Yes, 0 = No) variable (all other variables are quantitative.) You are permitted to usev124
as your variable 4, if you like.v124
may not be used as anything other than variable 4 in your work.
- Exactly one of the variables in the data (
- variable 5 will be categorized into 3-5 mutually exclusive and collectively exhaustive levels to create a multi-categorical variable of interest in predicting variable 1 (this is almost as bad as what we’ll do to variable 4 in practical work)
- the
state
will serve as another multi-categorical (with 4-6 categories) predictor of variable 1, so this will also be part of your tibble
- Each of the five variables you select must have data for at least 75% of the counties in each state you plan to study. This is something you will have to check on, in R, and you’ll have to present the code, and demonstrate with complete English sentences that you’ve verified this. The use of a tool (or more than one) from the
naniar
package to do this checking is encouraged.
Caution: Some Variables have Lots of Missing Data
The variables listed below have more than 10% missing values across all 3082 ranked counties in the US. While you are welcome to use any of these variables, you may want to look elsewhere to avoid problems with the “minimum 75% complete data” requirement. (Note that variable v170_rawvalue
is missing for all counties in the data so we don’t allow its use.)
Variable | Description | % missing |
---|---|---|
v129_rawvalue |
infant mortality | 60.5 |
v149_rawvalue |
disconnected youth | 60.4 |
v015_rawvalue |
homicides | 57.0 |
v138_rawvalue |
drug overdose deaths | 41.7 |
v128_rawvalue |
child mortality | 38.9 |
v158_rawvalue |
juvenile arrests | 37.3 |
v141_rawvalue |
residential segregation | 32.6 |
v148_rawvalue |
firearm fatalities | 26.3 |
v161_rawvalue |
suicides | 21.1 |
v021_rawvalue |
High School graduation | 19.5 |
v173_rawvalue |
COVID-19 age-adjusted mortality | 16.8 |
v061_rawvalue |
HIV prevalence | 14.2 |
v160_rawvalue |
Math scores | 12.8 |
v039_rawvalue |
Motor vehicle crash deaths | 12.7 |
The descriptions provided in these instructions associated with each variable are available as the first (deleted in R) row in the .csv file, and are also specified in the 2022 CHR Analytic Data Documentation PDF file.
Across your complete set of 4-6 selected states, the raw versions of each of your five selected variables must have at least ten distinct non-missing values. You’ll need to show R code to do this checking (the best choice is often
Hmisc::describe
since you’ll run that anyway), and you’ll need to demonstrate with a complete English sentence or two that you have checked this to be true.
How To Clean Your Selected Variables
Find each of your five selected variables on one of the lists below.
- If you plan to use the variable as quantitative, do what is suggested below as part of your data development work, rename the variable appropriately, and use that version in your codebook.
- If you plan to use the variable as a categorical predictor, you should still make the appropriate change to the original quantitative version as indicated below.
The Binary Category
v124
is about Drinking Water Violations, is the only binary (1 = Yes, 0 = No) variable in the data. (all other variables are quantitative.) You are permitted to use v124
as your variable 4, if you like. v124
may not be used as anything other than variable 4 in your work.
Ratios That Need Converting
These variables are specified as providers per population in the raw data file. You will want to take the reciprocal (1/raw value) to rescale the results in terms of population per provider, which should be much more interpretable.
- You’ll note that this rescaled ratio is also provided in the raw file if you want it, for example, for primary care physicians,
v004_rawvalue
is providers per population, butv004_other_data_1
is the ratio of population (residents) per provider. - Note that after rescaling by taking the reciprocal, you may see some counties with infinite ratios, which should then be changed to missing values.
Code | Variable Description |
---|---|
v004 | Primary care physicians |
v062 | Mental health providers |
v088 | Dentists |
- Note that we don’t allow the use of variable
v131
in Project A because of the large number of counties with very small numbers of these providers listed.
Variables that should be rescaled
Code | Variable Description | What to do? |
---|---|---|
v001 | Premature death | Divide by 100 to represent losses per 1000 population |
v051 | Population | Either use log10(population), or divide population by 1000 to represent population in thousands. |
v061 | HIV prevalence | Divide by 100 to represent cases per 1000 population (caution: substantial missing data) |
v063 | Median household income | Divide by 1000 to represent in thousands of dollars |
Variables that are Proportions should be converted to Percentages
Each of the variables listed below are proportions (between 0 and 1). Before you use them in any analyses, please multiply them by 100 in your data development (using mutate) to turn their values into percentages (between 0 and 100) and this will seriously ease the interpretation of slopes and transformations for these variables.
Code | Variable Description |
---|---|
v002 | Poor or fair health |
v003 | Uninsured adults (pick v003 or v085, but not both) |
v009 | Adult smoking |
v011 | Adult obesity |
v021 | High school graduation (pick v021 or v168, but not both, see note) |
v023 | Unemployment |
v024 | Children in poverty |
v037 | Low birthweight |
v049 | Excessive drinking |
v050 | Mammography screening |
v052 | Proportion below 18 years of age |
v053 | Proportion 65 and older |
v054 | Proportion Non-Hispanic Black (see note below) |
v055 | Proportion American Indian and Alaska Native (see note below) |
v056 | Proportion Hispanic (see note below) |
v057 | Proportion Females |
v058 | Proportion Rural |
v059 | Proportion not proficient in English |
v060 | Diabetes prevalence |
v065 | Children eligible for free or reduced price lunch |
v067 | Driving alone to work |
v069 | Some college |
v070 | Physical inactivity |
v080 | Proportion Native Hawaiian/Other Pacific Islander (see note below) |
v081 | Proportion Asian (see note below) |
v082 | Children in single-parent households |
v083 | Limited access to healthy foods |
v085 | Uninsured (pick v003 or v085, but not both) |
v122 | Uninsured children (pick v085 or v122, but not both) |
v126 | Proportion Non-Hispanic White (see note below) |
v132 | Access to exercise opportunities |
v134 | Proportion of driving deaths with alcohol involvement |
v136 | Severe housing problems |
v137 | Long commute - driving alone |
v139 | Food insecurity |
v143 | Insufficient sleep |
v144 | Frequent physical distress (pick v036 or v144, but not both) |
v145 | Frequent mental distress (pick v042 or v145, but not both) |
v149 | Disconnected youth (caution: lots of missing data) |
v153 | Homeownership |
v154 | Severe housing cost burden |
v155 | Flu vaccinations |
v166 | Broadband access |
v168 | High school completion (pick v021 or v168, but not both, see note) |
v171 | Childcare cost burden |
- If you are interested in studying race and ethnicity and their impact on a health outcome, these data aren’t particularly well suited to a detailed look. Instead we suggest using
v126
to incorporate this dimension as a predictor (or its inverse, 1 -v126
), rather than including any of the other race/ethnicity variables, specificallyv053
,v054
,v055
,v056
,v080
, orv081
. This is because there’s more variation in thev126
data across the reported counties. Again, a serious look at the impact of race/ethnicity is beyond the scope of Project A. - Note that variable
v021
generally has more missing data thanv168
, should you be trying to choose between them.
Variables that should be OK as is
The variables listed below should be useful as they are. Most of them are ratios, although a few are averages or indexes. The main issue for these variables is correctly specifying the units of measurement (note that the indexes don’t have units.)
Code | Variable Description |
---|---|
v005 | Preventable hospital stays |
v014 | Teen births |
v015 | Homicide rate (caution: lots of missing data) |
v036 | Poor physical health days (pick v036 or v144, but not both) |
v039 | Motor vehicle crash deaths (caution: substantial missing data) |
v042 | Poor mental health days (pick v042 or v145, but not both) |
v043 | Violent crime |
v044 | Income inequality |
v045 | Sexually transmitted infections |
v125 | Air pollution - particulate matter |
v127 | Premature age-adjusted mortality |
v128 | Child mortality (caution: lots of missing data) |
v129 | Infant mortality (caution: lots of missing data) |
v133 | Food environment index |
v135 | Injury deaths |
v138 | Drug overdose deaths (caution: lots of missing data) |
v140 | Social associations |
v141 | Residential segregation - Black/White (pick v141 or v142, see note) |
v142 | Residential segregation - non-White/White (pick v141 or v142, see note) |
v147 | Life expectancy |
v148 | Firearm fatalities (caution: substantial missing data) |
v151 | Gender pay gap |
v156 | Traffic volume |
v158 | Juvenile arrests (caution: lots of missing data) |
v159 | Reading scores |
v160 | Math scores (caution: substantial missing data) |
v161 | Suicides (caution: substantial missing data) |
v167 | School segregation (index from 0-1 with no units of measurement) |
v169 | School funding adequacy (gap measured in dollars per pupil) |
v172 | Childcare centers per 1000 population under 5 years old |
v173 | COVID-19 age-adjusted mortality (caution: substantial missing data) |
- Note that variable
v141
has more missing data thanv142
, should you be trying to choose between them. - For any variable you select, be sure that it has at least 75% complete data across the counties in your selected states, and within each state you have selected.
Task 4. Create factors for two of the five variables you chose in Task 2
Create new categorical variables (factors) based on two of your previously selected variables, and add them to your chr_2022 tibble. You’ll also retain the original (quantitative) versions of those two variables, so your tibble will now have 11 variables.
As mentioned, you will need to provide code to categorize two of your variables. Specifically…
- variable 4 will be categorized into 2 mutually exclusive and collectively exhaustive levels to create a binary categorical variable of interest in predicting variable 1 (this is a terrible idea in practical work.)
- variable 5 will be categorized into 3-5 mutually exclusive and collectively exhaustive levels to create a multi-categorical variable of interest in predicting variable 1 (this is almost as bad as what we’ll do to variable 4 in practical work.)
Additional Task 4 Notes
- You will need to verify that all levels (categories) in each of your two categorical variables contain at least ten counties. Using
tabyl
is a good approach to check that this is true. - Verify that your
state
variable and your two categorical variables are now recognized by R as factors with appropriate levels.
Task 6. Provide a Codebook for Your Tibble
Next, you will provide a codebook as part of your R Markdown file (and HTML/PDF output) that specifies the name of each variable in your tibble and its definition. After you select your variables, use the County Health Rankings website’s 2022 Measures list, and in particular the linked information on that page for full descriptions, definitions and limitations of the variables you have selected.
Demo Codebook
The start of your Codebook might look something like this (note that I’ve used states you’re not allowed to use here.) Here, I’ve split the codebook into two parts - first describing the states, and then the variables.
My states are:
State Name | Abbreviation | # of counties |
---|---|---|
Ohio | OH |
88 |
Connecticut | CT |
8 |
Delaware | DE |
3 |
Hawaii | HI |
4 |
New Hampshire | NH |
10 |
Rhode Island | RI |
5 |
Total (6 states) | – | 118 |
Remember that you need to have 4-6 states, including Ohio, with 200-400 counties, so my choice of states wouldn’t be acceptable.
My variables are:
Variable | Description | Original Name | # Missing |
---|---|---|---|
fipscode |
five-digit identifying code for state and county | fipscode |
0 |
county |
name of the county | county |
0 |
state |
postal abbreviation for the state | state |
0 |
poorfair |
(variable 1: outcome) % of adults reporting fair or poor health (age-adjusted). | v002_rawvalue |
0 |
fluvax |
(variable 2) % of fee-for-service (FFS) Medicare enrollees that had an annual flu vaccination. | v155_rawvalue |
0 |
infmort |
(variable 3) Infant deaths (within 1 year) per 1,000 live births. | v129_rawvalue |
31 |
discyouth_raw |
% of teens and young adults ages 16-19 who are neither working nor in school. | v149_rawvalue |
44 |
homicide_raw |
Deaths due to homicide per 100,000 population. | v015_rawvalue |
48 |
discyouth_cat |
(variable 4) Low (below A%) vs. High (A% or higher) based on discyouth_raw |
v149_rawvalue |
44 |
homicide_cat |
(variable 5) Low (below B), Medium (between B and C), High (above C) based on homicide_raw |
v015_rawvalue |
48 |
- A% is the median value for
discyouth_raw
- B and C represent the tertiles of
homicide_raw
.
Remember that you need to have at least 75% complete cases for each variable, so several of my variables here wouldn’t be acceptable given this selection of states.
Additional Task 6 Notes
Be sure to include a description for each of the 11 variables (4 required including
county_ranked
+ 5 you select + 2 categorical versions you create) in your codebook for your final tibble. The descriptions I’ve used here come from the County Health Rankings website’s 2022 Measures list site, which is what you should use, as well.Include a description of what the cut points are for the two categorical variables you create, and specify how those cut points were chosen, as I have done above - you should include the actual numbers, not just A, B and C.
Tools from the
naniar
package, especially themiss_var_summary()
tool, can be of help in preparing this codebook.
Task 7. Print and Summarize Your Tibble
Print the Tibble: You will then prove that your tibble is in fact a tibble and not just a data frame by listing it, so that the first 10 rows are printed, and the columns are appropriately labeled. The command you want is just
chr_2022
.Numerical Summaries: You will then demonstrate main numerical summaries from the tibble by running some of the following summaries.
- At a minimum in the proposal, you will need to show the results of
Hmisc::describe
run on the whole tibble. This can also be used to demonstrate that you have at least 15 distinct values for each quantitative variable (variables 1-3) in your data set. - You may need an additional summary to prove that you have at least 75% complete data within each state for each variable.
- You may need an additional summary to demonstrate that there are at least ten observations within each level of each of your categorical variables.
- Note that in the project report, there are two additional requirements, so you might consider them now.
mosaic::favstats
on each of your quantitative variables (so variables 1, 2, and 3 and the original versions of variables 4 and 5)janitor::tabyl
on your categorical variables 4 and 5 as well as onstate
.