This page was last updated: 2021-10-25 09:36:46.
The data set for Project A is now available. The data are the 2021 version of the analytic data from County Health Rankings. In much of what follows, we will abbreviate County Health Rankings as CHR. You have already seen some versions of these data in Labs 02 and 03, but you’ll need to generate a new data set of your own for Project A.
The data are gathered at the County Health Rankings Data & Documentation site.
The key links for you are provided as part of the National Data & Documentation section of that site for the 2021 County Health Rankings. Do not use data from previous editions of the CHR, and do not use the trends data available on their website for this project.
Specifically, you’ll need three files:
These files are also available in the data folder for Project A on Github.
Cleaning the data will be a time-consuming effort, but the good news is that you can begin it immediately. Before you complete any of the required analyses in R, you’ll need to complete the following steps.
county_ranked
value of 1. You will be working with a subset of the available state
s, which should include the 88 counties of Ohio, plus counties from 3-5 other states you will select.Here are some of the necessary details for each step of this process. Please refer to Dr. Love’s materials in the Project A Examples and Tips section for more help.
.csv
file to leave the second row as the top row before ingesting the data into R. You could do this in Excel or perhaps Google Sheets, and then resave a new version of the file without this top row as a .csv before ingesting into R. A cleaner approach would be to remove this row in R when ingesting the data, by using the skip = 1
command within your read_csv()
as shown below.data_url <- "https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2021.csv"
chr_2021_raw <- read_csv(data_url, skip = 1)
.csv
file into R, and call the resulting tibble chr_2021_raw. Be sure that this goes as smoothly as possible. You should at this point have 3194 rows and 690 columns. At this stage, you may also want to restrict the data set to only include the 3081 rows which have “county_ranked” values of 1, since the other rows will not be used by us in this project.Your selection must include:
The number of counties (which have county_ranked
values of 1) associated with each state (specified using its two-letter postal abbreviation code) is listed below, for your convenience.
So, including OH
, you will need a total of 200-400 counties, from 4-6 states. For example,
TX
(243 counties), AZ
(15 counties) and NM
(32 counties) in addition to OH
(88 counties), which would yield 378 counties in 4 statesWA
(39 counties), WI
(72 counties), WV
(55 counties) and WY
(23 counties) to join OH
(88 counties) yielding 277 counties in 5 statesPA
(67), NY
(62), NJ
(21), MD
(24) and VA
(133) in addition to OH
(88) yielding 395 counties in 6 statescounty_ranked
of 1 are included. Otherwise, your counts won’t match the image above.Here you will need to identify variables from the data for your selected sample of counties that will allow you to create a tibble that includes:
state
, which will be a multi-categorical (with 4-6 categories) variablecounty_ranked
variable, which tells us whether the row should be included in our data (so you’ll want to eventually filter down to the rows where county_ranked == 1
)vXXX_rawvalue
(note: to select the entire group of 79 variables, you might try select(ends_with("rawvalue"))
as part of a pipe of the data.)Your code to look at all 79 variables, plus the other required elements, might look something like this:
chr_2021_raw2 <- chr_2021_raw %>%
select(fipscode, state, county, county_ranked,
ends_with("rawvalue"))
Note that the resulting chr_2021_raw2
tibble would have 3,194 rows and 83 columns, and you’d still need to reduce the number of variables (as described below), and filter down to the rows with county_ranked
equal to 1.)
The 2021 CHR Analytic Data Documentation file (PDF), and 2021 Data Dictionary PDF files are crucial here, as those are the ones that explain what the available variables mean, and how they should be labeled.
You must select your five variables from these 79 variables in the raw (.csv) data file. We show them here in the order in which they appear in the raw file. The listing “v001” in this table actually refers to the variable named “v001_rawvalue”.
“v001”, “v002”, “v036”, “v042”, “v037”, “v009”, “v011”, “v133”, “v070”, “v132”, “v049”, “v134”, “v045”, “v014”, “v085”, “v004”, “v088”, “v062”, “v005”, “v050”, “v155”, “v168”, “v069”, “v023”, “v024”, “v044”, “v082”, “v140”, “v043”, “v135”, “v125”, “v124”, “v136”, “v067”, “v137”, “v147”, “v127”, “v128”, “v129”, “v144”, “v145”, “v060”, “v061”, “v139”, “v083”, “v138”, “v039”, “v143”, “v003”, “v122”, “v131”, “v021”, “v149”, “v159”, “v160”, “v063”, “v065”, “v141”, “v142”, “v015”, “v161”, “v148”, “v158”, “v156”, “v153”, “v154”, “v166”, “v051”, “v052”, “v053”, “v054”, “v055”, “v081”, “v080”, “v056”, “v126”, “v059”, “v057”, “v058”
For example, v001_rawvalue
shows the raw values for the premature death measure. If you select this variable, it is up to you to use the documentation in the two PDF files I have linked to, as well as the information on the County Health Rankings website, to get a reasonable understanding of what the variable measures, and how it was collected.
Be sure to read this entire Data page (especially the material on Cleaning Your Variables) before selecting your variables, so you are aware of some features of the data we disclose below.
The five variables you select must be five different variables selected from the 79 variables in the CHR data set that are listed as vXXX_rawvalue
. Some of the variables in that set of 79 are better choices than others, based on the criteria in the next few notes.
Each of the variables you select should be of some actual interest to you on its own, in terms of either providing a health outcome of interest, or potentially providing useful information about a feature of the county that might relate to that health outcome. Your five selected quantitative variables, selected by you from the 79 available “raw value” CHR variables, will need be treated as follows:
v124
about drinking water violations) is already a binary (1 = Yes, 0 = No) variable (all other variables are quantitative.) You are permitted to use v124
as your variable 4, if you like. v124
may not be used as anything other than variable 4 in your work.state
will serve as another multi-categorical (with 4-6 categories) predictor of variable 1, so this will also be part of your tibbleVariable | Description | % missing |
---|---|---|
v129_rawvalue |
infant mortality | 60.1 |
v015_rawvalue |
homicides | 57.8 |
v149_rawvalue |
disconnected youth | 57.6 |
v138_rawvalue |
drug overdose deaths | 43.6 |
v158_rawvalue |
juvenile arrests | 41.9 |
v128_rawvalue |
child mortality | 38.8 |
v141_rawvalue |
residential segregation | 32.2 |
v148_rawvalue |
firearm fatalities | 27.3 |
v061_rawvalue |
HIV prevalence | 22.5 |
v161_rawvalue |
suicides | 22.2 |
v021_rawvalue |
High School graduation | 16.1 |
v160_rawvalue |
Math scores | 12.8 |
v039_rawvalue |
Motor vehicle crash deaths | 12.7 |
Note that the descriptions in the table above (and in the tables below) associated with each variable are available as the first (deleted) row in the .csv file, and are also specified in the 2021 CHR Analytic Data Documentation file.
Across your complete set of 4-6 selected states, the raw versions of each of your five selected variables must have at least 10 distinct non-missing values. Again, you’ll need to show R code to do this checking, and demonstrate with complete English sentences that you have checked this to be true.
Check your five variables on the lists below. You should find all five of your selected variables on the list somewhere.
v124
is about Drinking Water Violations, is the only binary (1 = Yes, 0 = No) variable in the data. (all other variables are quantitative.) You are permitted to use v124
as your variable 4, if you like. v124
may not be used as anything other than variable 4 in your work.
These variables are specified as providers per population in the raw data file. You will want to take the reciprocal (1/raw value) to rescale the results in terms of population per provider, which should be much more interpretable.
v004_rawvalue
is providers per population, but v004_other_data_1
is the ratio of population (residents) per provider.Code | Variable Description |
---|---|
v004 | Primary care physicians |
v062 | Mental health providers |
v088 | Dentists |
v131 | Other primary care providers (not recommended) |
v131
in Project A because of the large number of counties with very small numbers of these providers listed.Code | Variable Description | What to do? |
---|---|---|
v001 | Premature death | Divide by 100 to represent losses per 1000 population |
v051 | Population | Either use log10(population), or divide population by 1000 to represent population in thousands. |
v061 | HIV prevalence | Divide by 100 to represent cases per 1000 population (caution: substantial missing data) |
v063 | Median household income | Divide by 1000 to represent in thousands of dollars |
Each of the variables listed below are proportions (between 0 and 1). Before you use them in any analyses, please multiply them by 100 in your data development (using mutate) to turn their values into percentages (between 0 and 100) and this will seriously ease the interpretation of slopes and transformations for these variables.
Code | Variable Description |
---|---|
v002 | Poor or fair health |
v003 | Uninsured adults (pick v003 or v085, but not both) |
v009 | Adult smoking |
v011 | Adult obesity |
v021 | High school graduation (pick v021 or v168, but not both, see note) |
v023 | Unemployment |
v024 | Children in poverty |
v037 | Low birthweight |
v049 | Excessive drinking |
v050 | Mammography screening |
v052 | Proportion below 18 years of age |
v053 | Proportion 65 and older |
v054 | Proportion Non-Hispanic Black (see note below) |
v055 | Proportion American Indian and Alaska Native (see note below) |
v056 | Proportion Hispanic (see note below) |
v057 | Proportion Females |
v058 | Proportion Rural |
v059 | Proportion not proficient in English |
v060 | Diabetes prevalence |
v065 | Children eligible for free or reduced price lunch |
v067 | Driving alone to work |
v069 | Some college |
v070 | Physical inactivity |
v080 | Proportion Native Hawaiian/Other Pacific Islander (see note below) |
v081 | Proportion Asian (see note below) |
v082 | Children in single-parent households |
v083 | Limited access to healthy foods |
v085 | Uninsured (pick v003 or v085, but not both) |
v122 | Uninsured children (pick v085 or v122, but not both) |
v126 | Proportion Non-Hispanic White (see note below) |
v132 | Access to exercise opportunities |
v134 | Proportion of driving deaths with alcohol involvement |
v136 | Severe housing problems |
v137 | Long commute - driving alone |
v139 | Food insecurity |
v143 | Insufficient sleep |
v144 | Frequent physical distress (pick v036 or v144, but not both) |
v145 | Frequent mental distress (pick v042 or v145, but not both) |
v149 | Disconnected youth (caution: lots of missing data) |
v153 | Homeownership |
v154 | Severe housing cost burden |
v155 | Flu vaccinations |
v166 | Broadband access |
v168 | High school completion (pick v021 or v168, but not both, see note) |
v126
to incorporate this dimension as a predictor (or its inverse, 1 - v126
), rather than including any of the other race/ethnicity variables, specifically v053
, v054
, v055
, v056
, v080
, or v081
. This is because there’s more variation in the v126
data across the reported counties. Again, a serious look at the impact of race/ethnicity is beyond the scope of Project A.v021
has more missing data than v168
, should you be trying to choose between them.The variables listed below should be fine as they are. Most of them are ratios, although a few are averages or indexes. The main issue for these variables is correctly specifying the units of measurement (note that the indexes don’t have units.)
Code | Variable Description |
---|---|
v005 | Preventable hospital stays |
v014 | Teen births |
v015 | Homicide rate (caution: lots of missing data) |
v036 | Poor physical health days (pick v036 or v144, but not both) |
v039 | Motor vehicle crash deaths (caution: substantial missing data) |
v042 | Poor mental health days (pick v042 or v145, but not both) |
v043 | Violent crime |
v044 | Income inequality |
v045 | Sexually transmitted infections |
v125 | Air pollution - particulate matter |
v127 | Premature age-adjusted mortality |
v128 | Child mortality (caution: lots of missing data) |
v129 | Infant mortality (caution: lots of missing data) |
v133 | Food environment index |
v135 | Injury deaths |
v138 | Drug overdose deaths (caution: lots of missing data) |
v140 | Social associations |
v141 | Residential segregation - Black/White (pick v141 or v142, see note) |
v142 | Residential segregation - non-White/White (pick v141 or v142, see note) |
v147 | Life expectancy |
v148 | Firearm fatalities (caution: substantial missing data) |
v156 | Traffic volume |
v158 | Juvenile arrests (caution: lots of missing data) |
v159 | Reading scores |
v160 | Math scores (caution: substantial missing data) |
v161 | Suicides (caution: substantial missing data) |
v141
has more missing data than v142
, should you be trying to choose between them.As mentioned, you will need to provide code to categorize two of your variables. Specifically…
tabyl
is a good approach to check that this is true.In your R Markdown file, you will need to present all code necessary to take the original .csv data file and wind up with your chr_2021
tibble so that we can run that code and get the same results you do.
Your main data set for analysis then should be gathered into a tibble called chr_2021 containing the following information:
fipscode
= the five-digit fips code for the state and county,county
= the name of the countystate
= the postal abbreviation code for the stateAll of these variables should be renamed (and also have clean_names applied) so as to have descriptive and maximally helpful variable names.
v009_rawvalue
, which is about adult smoking, you should rename the variable v009_rawvalue
to adult_smoking in your tibble.v009_rawvalue
(about adult smoking) is to be one of your categorical variables, then you should include both the original quantitative value (renamed adult_smoking_raw
) and your categorical variable that you’ll actually use in analyses, which should be named adult_smoking_cat
.chr_2021
tibble should contain all of the counties within the 4-6 states you are studying, and no other counties should be included in your tibble. All rows of your chr_2021
tibble should be taken from those rows in the original data with county_ranked
values of 1.Next, you will provide a codebook as part of your R Markdown file (and HTML/PDF output) that specifies the name of each variable in your tibble and its definition. After you select your variables, use the County Health Rankings website’s 2021 Measures list, and in particular the linked information on that page for full descriptions, definitions and limitations of the variables you have selected.
Be sure to include a description for each of the 10 variables (3 required + 5 you select + 2 categorical versions you create) in your final tibble, not just the ones that you wind up using in your analyses.
Include a description of what the cutpoints are for the two categorical variables you create, and specify how those cutpoints were chosen.
Save the Tibble: You should provide code that saves your tibble as an R data set into your R Project, with the file name chr_2021_YOURNAME.Rds. If you like, you can store this .Rds
file in a data
subdirectory within your R Project.
Print the Tibble: You will then prove that your tibble is in fact a tibble and not just a data frame by listing it, so that the first 10 rows are printed, and the columns are appropriately labeled.
Numerical Summaries: You will then demonstrate main numerical summaries from the tibble by running the following summaries.
Hmisc::describe
on the whole tibble (all 10 variables: which is what you’ll do for the proposal in Lab 04).mosaic::favstats
on each of your quantitative variables (so variables 1, 2, and 3 and the original versions of variables 4 and 5)janitor::tabyl
on your categorical variables 4 and 5 as well as on state
.