Chapter 17 Cleaning the BRFSS SMART Data
The Centers for Disease Control analyzes Behavioral Risk Factor Surveillance System (BRFSS) survey data for specific metropolitan and micropolitan statistical areas (MMSAs) in a program called the Selected Metropolitan/Micropolitan Area Risk Trends of BRFSS (SMART BRFSS.)
In this work, we will focus on data from the 2016 SMART, and in particular on data from the state of Ohio, and from the Cleveland-Elyria, OH, Metropolitan Statistical Area. The purpose of this survey is to provide localized health information that can help public health practitioners identify local emerging health problems, plan and evaluate local responses, and efficiently allocate resources to specific needs.
In this chapter, I describe all of the cleaning I have done (so far) on the BRFSS SMART data, and break it out into national, statewide, and local samples.
The data files produced by this chapter include:
smart_ohio.csv
which includes data on 57 variables for 6014 subjects in five MMSAs in Ohio.smartcle.csv
which includes data on those same 57 variables for the 1036 subjects in the Cleveland-Elyria-Lorain OH MMSA.smartcle1.csv
which includes a subset of 11 variables for those same 1036 subjects.
SMART and BRFSS
The Centers for Disease Control analyzes Behavioral Risk Factor Surveillance System (BRFSS) survey data for specific metropolitan and micropolitan statistical areas (MMSAs) in a program called the Selected Metropolitan/Micropolitan Area Risk Trends of BRFSS (SMART BRFSS.)
In this work, we will focus on data from the 2016 SMART, and in particular on data from the Cleveland-Elyria, OH, Metropolitan Statistical Area. The purpose of this survey is to provide localized health information that can help public health practitioners identify local emerging health problems, plan and evaluate local responses, and efficiently allocate resources to specific needs.
17.1 Key resources
- the data, in the form of the 2016 SMART BRFSS MMSA Data, found in a zipped SAS Transport Format file. The data were released in August 2017.
- the MMSA Variable Layout PDF which simply lists the variables included in the data file
- the Calculated Variables PDF which describes the risk factors by data variable names - there is also an online summary matrix of these calculated variables, as well.
- the lengthy 2016 Survey Questions PDF which lists all questions asked as part of the BRFSS in 2016
- the enormous Codebook for the 2016 BRFSS Survey PDF which identifies the variables by name for us.
17.2 Ingesting The Raw Data
library(skimr)
library(broom); library(modelr)
library(haven); library(tidyverse)
17.3 The National Data
To create the data files we’ll use, I used the read_xpt
function from the haven
package to bring in the SAS XPT data file that is provided by CDC. The codes I used (but won’t use in these Notes) were:
smart.raw <- read_xpt("data/MMSA2016.xpt") %>% tbl_df
dim(smart.raw)
This gives the nationwide data, which has 249,011 rows and 146 columns.
But for the purposes of putting these Notes online, I needed to crank down the sample size enormously. To that end, I created a new data file, called smartorig.csv, which I developed by
- importing the MMSA2016.xpt file
- filtering away all observations except those from the state of Ohio
- saving it
So, for purposes of these notes, our complete data set is actually coming from smartorg.csv
and consists only of the 6,014 observations from Ohio.
smart.raw <- read_csv("data/smartorig.csv") %>% tbl_df
Parsed with column specification:
cols(
.default = col_double(),
MMSANAME = col_character()
)
See spec(...) for full column specifications.
dim(smart.raw)
[1] 6014 146
17.4 Cleaning the BRFSS Data
My source for learning about the codes in these variables is the enormous Codebook for the 2016 BRFSS Survey PDF which identifies the variables by name for us. The first two variables in the raw file are
DISPCODE
which is 1100 if the subject completed the interview, and 1200 if they partially completed the interview.
smart.raw %>% count(DISPCODE)
# A tibble: 2 x 2
DISPCODE n
<dbl> <int>
1 1100 5162
2 1200 852
HHADULT
which is the response to “How many members of your household, including yourself, are 18 years of age or older?”- The response is 1-76, with special values 77 for Don’t Know/Not Sure and 99 for refused, with BLANK for missing or not asked.
- So we should change all numerical values above 76 to NA for our analyses (the blanks are already regarded as NAs by R in the ingestion process.)
smart.raw <- smart.raw %>%
mutate(hhadults = HHADULT,
hhadults = replace(hhadults, hhadults > 76, NA))
smart.raw %>% count(HHADULT, hhadults) %>% tail()
# A tibble: 6 x 3
HHADULT hhadults n
<dbl> <dbl> <int>
1 6 6 9
2 7 7 3
3 10 10 2
4 77 NA 4
5 99 NA 1
6 NA NA 3419
17.4.1 Health Status (1 item)
The next variable describes relate to the subject’s health status.
17.4.1.1 GENHLTH
and its cleanup to genhealth
GENHLTH
, the General Health variable, which is the response to “Would you say that in general your health is …”- 1 = Excellent
- 2 = Very good
- 3 = Good
- 4 = Fair
- 5 = Poor
- 7 = Don’t know/Not sure
- 9 = Refused
- BLANK = Not asked or missing
To clean up the GENHLTH
data into a new variable called genhealth
we’ll need to
- convince R that the 7 and 9 values are in fact best interpreted as NA
,
- and perhaps change the variable to a factor and incorporate the names into the levels.
smart.raw <- smart.raw %>%
mutate(genhealth = fct_recode(factor(GENHLTH),
"1_Excellent" = "1",
"2_VeryGood" = "2",
"3_Good" = "3",
"4_Fair" = "4",
"5_Poor" = "5",
NULL = "7",
NULL = "9"))
smart.raw %>% count(GENHLTH, genhealth) %>% tail()
Warning: Factor `genhealth` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 6 x 3
GENHLTH genhealth n
<dbl> <fct> <int>
1 2 2_VeryGood 2014
2 3 3_Good 1895
3 4 4_Fair 798
4 5 5_Poor 298
5 7 <NA> 8
6 9 <NA> 4
17.4.3 Health Care Access (4 items)
The next four variables relate to the subject’s health care access.
17.4.3.1 HLTHPLN1
and its cleanup to healthplan
HLTHPLN1
, the Have any health care coverage variable, is the response to “Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?”
- 1 = Yes
- 2 = No
- 7 = Don’t know/Not sure
- 9 = Refused
To clean up the HLTHPLN1
data into a new variable called healthplan
we’ll
- convince R that the 7 and 9 values are in fact best interpreted as NA
,
- and turn it into an indicator variable, e.g., we will leave the variable as numeric, but change the values to 1 = Yes and 0 = No.
smart.raw <- smart.raw %>%
mutate(healthplan = HLTHPLN1,
healthplan = replace(healthplan, healthplan %in% c(7, 9), NA),
healthplan = replace(healthplan, healthplan == 2, 0))
smart.raw %>% count(HLTHPLN1, healthplan)
# A tibble: 4 x 3
HLTHPLN1 healthplan n
<dbl> <dbl> <int>
1 1 1 5722
2 2 0 271
3 7 NA 11
4 9 NA 10
17.4.3.2 PERSDOC2
and its cleanup to hasdoc
and to numdocs2
PERSDOC2
, the Multiple Health Care Professionals variable, is the response to “Do you have one person you think of as your personal doctor or health care provider?” where if the response is “No”, the survey then asks “Is there more than one or is there no person who you think of as your personal doctor or health care provider?”
- 1 = Yes, only one
- 2 = More than one
- 3 = No
- 7 = Don’t know/Not sure
- 9 = Refused
- BLANK = Not asked or missing
Let’s build two variables here. We’ll create a hasdoc
variable which is simply Yes or No, and a factor variable called numdocs2
which keeps the first three levels in place.
To clean up the PERSDOC2
data into a new variable called hasdoc
we’ll
- convince R that the 7 and 9 values are in fact best interpreted as NA
,
- and turn it into an indicator variable, e.g., we will leave the variable as numeric, but change the values to 1 = Yes and 0 = No, so that the original 1 and 2 become 1, and the original 3 becomes 0.
To clean up the PERSDOC2
data into a new variable called numdocs2
we’ll
- convince R that the 7 and 9 values are in fact best interpreted as NA
,
- relabel options 1, 2 and 3 while turning the variable into a factor.
smart.raw <- smart.raw %>%
mutate(hasdoc = PERSDOC2,
hasdoc = replace(hasdoc, hasdoc %in% c(7, 9), NA),
hasdoc = replace(hasdoc, hasdoc %in% c(1, 2), 1),
hasdoc = replace(hasdoc, hasdoc == 3, 0)) %>%
mutate(numdocs2 = fct_recode(factor(PERSDOC2),
"1_One" = "1",
"2_MoreThanOne" = "2",
"3_Zero" = "3",
NULL = "7",
NULL = "9"))
smart.raw %>% count(PERSDOC2, hasdoc, numdocs2)
Warning: Factor `numdocs2` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 5 x 4
PERSDOC2 hasdoc numdocs2 n
<dbl> <dbl> <fct> <int>
1 1 1 1_One 4783
2 2 1 2_MoreThanOne 463
3 3 0 3_Zero 744
4 7 NA <NA> 14
5 9 NA <NA> 10
17.4.3.3 MEDCOST
and its cleanup to costprob
MEDCOST
, the Could Not See Doctor Because of Cost variable, is the response to “Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?”
- 1 = Yes
- 2 = No
- 7 = Don’t know/Not sure
- 9 = Refused
- BLANK = Not asked or missing
This is just like HLTHPLAN
.
smart.raw <- smart.raw %>%
mutate(costprob = MEDCOST,
costprob = replace(costprob, costprob %in% c(7, 9), NA),
costprob = replace(costprob, costprob == 2, 0))
smart.raw %>% count(MEDCOST, costprob)
# A tibble: 4 x 3
MEDCOST costprob n
<dbl> <dbl> <int>
1 1 1 503
2 2 0 5498
3 7 NA 11
4 9 NA 2
17.4.3.4 CHECKUP1
and its cleanup to checktime
CHECKUP1
, the Length of time since last routine checkup variable, is the response to “About how long has it been since you last visited a doctor for a routine checkup? [A routine checkup is a general physical exam, not an exam for a specific injury, illness, or condition.]”
- 1 = Within past year (anytime less than 12 months ago)
- 2 = Within past 2 years (1 year but less than 2 years ago)
- 3 = Within past 5 years (2 years but less than 5 years ago)
- 4 = 5 or more years ago
- 7 = Don’t know/Not sure
- 8 = Never
- 9 = Refused
- BLANK = Not asked or missing
To clean up the CHECKUP1
data into a new variable called checktime
we’ll
- convince R that the 7 and 9 values are in fact best interpreted as NA
,
- relabel options 1, 2, 3, 4 and 8 while turning the variable into a factor.
smart.raw <- smart.raw %>%
mutate(checktime = fct_recode(factor(CHECKUP1),
"1_In-past-year" = "1",
"2_1-to-2-years" = "2",
"3_2-to-5-years" = "3",
"4_5_plus_years" = "4",
"8_Never" = "8",
NULL = "7",
NULL = "9"))
smart.raw %>% count(CHECKUP1, checktime)
Warning: Factor `checktime` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 7 x 3
CHECKUP1 checktime n
<dbl> <fct> <int>
1 1 1_In-past-year 4742
2 2 2_1-to-2-years 571
3 3 3_2-to-5-years 299
4 4 4_5_plus_years 299
5 7 <NA> 66
6 8 8_Never 30
7 9 <NA> 7
17.4.4 Exercise (1 item)
17.4.4.1 EXERANY2
and its cleanup to exerany
EXERANY2
, the Exercise in Past 30 Days variable, is the response to “During the past month, other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?”
- 1 = Yes
- 2 = No
- 7 = Don’t know/Not sure
- 9 = Refused
- BLANK = Not asked or missing
This is just like HLTHPLAN
.
smart.raw <- smart.raw %>%
mutate(exerany = EXERANY2,
exerany = replace(exerany, exerany %in% c(7, 9), NA),
exerany = replace(exerany, exerany == 2, 0))
smart.raw %>% count(EXERANY2, exerany)
# A tibble: 4 x 3
EXERANY2 exerany n
<dbl> <dbl> <int>
1 1 1 4464
2 2 0 1537
3 7 NA 10
4 9 NA 3
17.4.5 Inadequate Sleep (1 item)
17.4.5.1 SLEPTIM1
and its cleanup to sleephrs
SLEPTIM1
, the How Much Time Do You Sleep variable, is the response to “I would like to ask you about your sleep pattern. On average, how many hours of sleep do you get in a 24-hour period?”
- The response is 1-24, with special values 77 for Don’t Know/Not Sure and 99 for refused, with BLANK for missing or not asked.
- So we should change all numerical values above 24 to NA for our analyses (the blanks are already regarded as NAs by R in the ingestion process.)
smart.raw <- smart.raw %>%
mutate(sleephrs = SLEPTIM1,
sleephrs = replace(sleephrs, sleephrs > 24, NA))
smart.raw %>% count(SLEPTIM1, sleephrs) %>% tail()
# A tibble: 6 x 3
SLEPTIM1 sleephrs n
<dbl> <dbl> <int>
1 18 18 2
2 20 20 4
3 22 22 1
4 23 23 1
5 77 NA 47
6 99 NA 8
17.4.6 Chronic Health Conditions (13 items)
17.4.6.1 Self-reported diagnosis history (11 items)
The next few variables describe whether or not the subject meets a particular standard, and are all coded in the raw data the same way:
- 1 = Yes
- 2 = No
- 7 = Don’t know/Not sure
- 9 = Refused
- BLANK = Not asked or missing
and we’ll recode them all to 1 = Yes, 0 = No, otherwise NA, as we’ve done previously.
The questions are all started with “Has a doctor, nurse, or other health professional ever told you that you had any of the following? For each, tell me Yes, No, or you’re Not sure.”
Original | Revised | Details |
---|---|---|
CVDINFR4 |
hx.mi |
(Ever told) you had a heart attack, also called a myocardial infarction? |
CVDCRHD4 |
hx.chd |
(Ever told) you had angina or coronary heart disease? |
CVDSTRK3 |
hx.stroke |
(Ever told) you had a stroke? |
ASTHMA3 |
hx.asthma |
(Ever told) you had asthma? |
ASTHNOW |
now.asthma |
Do you still have asthma? (only asked of those with Yes in ASTHMA3 ) |
CHCSCNCR |
hx.skinc |
(Ever told) you had skin cancer? |
CHCOCNCR |
hx.otherc |
(Ever told) you had any other types of cancer? |
CHCCOPD1 |
hx.copd |
(Ever told) you have Chronic Obstructive Pulmonary Disease or COPD, emphysema or chronic bronchitis? |
HAVARTH3 |
hx.arthr |
(Ever told) you have some form of arthritis, rheumatoid arthritis, gout, lupus, or fibromyalgia? (Arthritis diagnoses include: rheumatism, polymyalgia rheumatica; osteoarthritis (not osteporosis); tendonitis, bursitis, bunion, tennis elbow; carpal tunnel syndrome, tarsal tunnel syndrome; joint infection, etc.) |
ADDEPEV2 |
hx.depress |
(Ever told) you that you have a depressive disorder, including depression, major depression, dysthymia, or minor depression? |
CHCKIDNY |
hx.kidney |
(Ever told) you have kidney disease? Do NOT include kidney stones, bladder infection or incontinence. |
smart.raw <- smart.raw %>%
mutate(hx.mi = CVDINFR4,
hx.mi = replace(hx.mi, hx.mi %in% c(7, 9), NA),
hx.mi = replace(hx.mi, hx.mi == 2, 0),
hx.chd = CVDCRHD4,
hx.chd = replace(hx.chd, hx.chd %in% c(7, 9), NA),
hx.chd = replace(hx.chd, hx.chd == 2, 0),
hx.stroke = CVDSTRK3,
hx.stroke = replace(hx.stroke, hx.stroke %in% c(7, 9), NA),
hx.stroke = replace(hx.stroke, hx.stroke == 2, 0),
hx.asthma = ASTHMA3,
hx.asthma = replace(hx.asthma, hx.asthma %in% c(7, 9), NA),
hx.asthma = replace(hx.asthma, hx.asthma == 2, 0),
now.asthma = ASTHNOW,
now.asthma = replace(now.asthma, now.asthma %in% c(7, 9), NA),
now.asthma = replace(now.asthma, now.asthma == 2, 0),
hx.skinc = CHCSCNCR,
hx.skinc = replace(hx.skinc, hx.skinc %in% c(7, 9), NA),
hx.skinc = replace(hx.skinc, hx.skinc == 2, 0),
hx.otherc = CHCOCNCR,
hx.otherc = replace(hx.otherc, hx.otherc %in% c(7, 9), NA),
hx.otherc = replace(hx.otherc, hx.otherc == 2, 0),
hx.copd = CHCCOPD1,
hx.copd = replace(hx.copd, hx.copd %in% c(7, 9), NA),
hx.copd = replace(hx.copd, hx.copd == 2, 0),
hx.arthr = HAVARTH3,
hx.arthr = replace(hx.arthr, hx.arthr %in% c(7, 9), NA),
hx.arthr = replace(hx.arthr, hx.arthr == 2, 0),
hx.depress = ADDEPEV2,
hx.depress = replace(hx.depress, hx.depress %in% c(7, 9), NA),
hx.depress = replace(hx.depress, hx.depress == 2, 0),
hx.kidney = CHCKIDNY,
hx.kidney = replace(hx.kidney, hx.kidney %in% c(7, 9), NA),
hx.kidney = replace(hx.kidney, hx.kidney == 2, 0))
We definitely should have written a function to do that, of course.
17.4.6.2 DIABETE3
and its cleanup to hx.diabetes
and diabetes3
DIABETE3
, the (Ever told) you have diabetes variable, is the response to “(Ever told) you have diabetes (If Yes and respondent is female, ask Was this only when you were pregnant?. If Respondent says pre-diabetes or borderline diabetes, use response code 4.)”
- 1 = Yes
- 2 = Yes, but female told only during pregnancy
- 3 = No
- 4 = No, pre-diabetes or borderline diabetes
- 7 = Don’t know/Not sure
- 9 = Refused
- BLANK = Not asked or missing
I’ll create one variable called hx.diabetes
which is 1 if DIABETE3
= 1, and 0 otherwise, with appropriate NAs, like our other variables. Then I’ll create diabetes3
to include all of this information in a factor, but again recode the missing values properly.
smart.raw <- smart.raw %>%
mutate(hx.diabetes = DIABETE3,
hx.diabetes = replace(hx.diabetes, hx.diabetes %in% c(7, 9), NA),
hx.diabetes = replace(hx.diabetes, hx.diabetes %in% 2:4, 0),
diabetes3 = fct_recode(factor(DIABETE3),
"Diabetes" = "1",
"Pregnancy-Induced" = "2",
"No-Diabetes" = "3",
"Pre-Diabetes" = "4",
NULL = "7",
NULL = "9"))
smart.raw %>% count(DIABETE3, hx.diabetes, diabetes3)
Warning: Factor `diabetes3` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 6 x 4
DIABETE3 hx.diabetes diabetes3 n
<dbl> <dbl> <fct> <int>
1 1 1 Diabetes 833
2 2 0 Pregnancy-Induced 49
3 3 0 No-Diabetes 5037
4 4 0 Pre-Diabetes 85
5 7 NA <NA> 8
6 9 NA <NA> 2
17.4.6.3 DIABAGE2
and its cleanup to diab.age
DIABAGE2
, the Age When Told Diabetic variable, is the response to “How old were you when you were told you have diabetes?” It is asked only of people with DIABETE3 = 1
(Yes).
- The response is 1-97, with special values 98 for Don’t Know/Not Sure and 99 for refused, with BLANK for missing or not asked. People 97 years of age and above were listed as 97.
smart.raw <- smart.raw %>%
mutate(diab.age = DIABAGE2,
diab.age = replace(diab.age, diab.age > 97, NA))
smart.raw %>% count(DIABAGE2, diab.age) %>% tail()
# A tibble: 6 x 3
DIABAGE2 diab.age n
<dbl> <dbl> <int>
1 81 81 2
2 83 83 1
3 85 85 2
4 98 NA 32
5 99 NA 3
6 NA NA 5181
17.4.7 Oral Health (2 items)
17.4.7.1 LASTDEN3
and its cleanup to denttime
LASTDEN3
, the Last Visited Dentist or Dental Clinic variable, is the response to “How long has it been since you last visited a dentist or a dental clinic for any reason? Include visits to dental specialists, such as orthodontists.”
- 1 = Within past year (anytime less than 12 months ago)
- 2 = Within past 2 years (1 year but less than 2 years ago)
- 3 = Within past 5 years (2 years but less than 5 years ago)
- 4 = 5 or more years ago
- 7 = Don’t know/Not sure
- 8 = Never
- 9 = Refused
- BLANK = Not asked or missing
We’ll again use the approach from CHECKUP1
…
smart.raw <- smart.raw %>%
mutate(denttime = fct_recode(factor(LASTDEN3),
"1_In-past-year" = "1",
"2_1-to-2-years" = "2",
"3_2-to-5-years" = "3",
"4_5_plus_years" = "4",
"8_Never" = "8",
NULL = "7",
NULL = "9"))
smart.raw %>% count(LASTDEN3, denttime)
Warning: Factor `denttime` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 7 x 3
LASTDEN3 denttime n
<dbl> <fct> <int>
1 1 1_In-past-year 4221
2 2 2_1-to-2-years 638
3 3 3_2-to-5-years 453
4 4 4_5_plus_years 617
5 7 <NA> 53
6 8 8_Never 24
7 9 <NA> 8
17.4.7.2 RMVTETH3
and its cleanup to remteeth
RMVTETH3
, the Number of Permanent Teeth Removed variable, is the response to “How many of your permanent teeth have been removed because of tooth decay or gum disease? Include teeth lost to infection, but do not include teeth lost for other reasons, such as injury or orthodontics. (If wisdom teeth are removed because of tooth decay or gum disease, they should be included in the count for lost teeth).”
- 1 = 1 to 5
- 2 = 6 or more but not all
- 3 = All
- 7 = Don’t know/Not sure
- 8 = None
- 9 = Refused
- BLANK = Not asked or missing
We’ll recode this to remteeth
which will be 0 for None, 1 for 1-5, 2 for 6+ but not all, and 3 for All.
smart.raw <- smart.raw %>%
mutate(remteeth = fct_recode(factor(RMVTETH3),
"None" = "8",
"1-5" = "1",
"6+_butnotAll" = "2",
"All" = "3",
NULL = "7",
NULL = "9"))
smart.raw %>% count(RMVTETH3, remteeth)
Warning: Factor `remteeth` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 6 x 3
RMVTETH3 remteeth n
<dbl> <fct> <int>
1 1 1-5 1632
2 2 6+_butnotAll 775
3 3 All 485
4 7 <NA> 118
5 8 None 2993
6 9 <NA> 11
17.4.8 Demographics (23 items)
17.4.8.1 _AGEG5YR
, which we’ll edit into agegroup
The _AGEG5YR
variable is a calculated variable (by CDC) obtained from the subject’s age. Since the age
data are not available, we instead get these groupings, which we’ll rearrange into the agegroup
factor.
_AGEG5YR |
Age range | agegroup |
---|---|---|
1 | 18 <= AGE <= 24 | 18-24 |
2 | 25 <= AGE <= 29 | 25-29 |
3 | 30 <= AGE <= 34 | 30-34 |
4 | 35 <= AGE <= 39 | 35-39 |
5 | 40 <= AGE <= 44 | 40-44 |
6 | 45 <= AGE <= 49 | 45-49 |
7 | 50 <= AGE <= 54 | 50-54 |
8 | 55 <= AGE <= 59 | 55-59 |
9 | 60 <= AGE <= 64 | 60-64 |
10 | 65 <= AGE <= 69 | 65-69 |
11 | 70 <= AGE <= 74 | 70-74 |
12 | 75 <= AGE <= 79 | 75-79 |
13 | AGE >= 80 | 80plus |
14 | Don’t Know, Refused or Missing | NA |
smart.raw <- smart.raw %>%
mutate(agegroup = fct_recode(factor(`_AGEG5YR`),
"18-24" = "1",
"25-29" = "2",
"30-34" = "3",
"35-39" = "4",
"40-44" = "5",
"45-49" = "6",
"50-54" = "7",
"55-59" = "8",
"60-64" = "9",
"65-69" = "10",
"70-74" = "11",
"75-79" = "12",
"80-96" = "13",
NULL = "14"))
smart.raw %>% count(`_AGEG5YR`, agegroup)
Warning: Factor `agegroup` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 14 x 3
`_AGEG5YR` agegroup n
<dbl> <fct> <int>
1 1 18-24 348
2 2 25-29 280
3 3 30-34 322
4 4 35-39 298
5 5 40-44 346
6 6 45-49 437
7 7 50-54 504
8 8 55-59 666
9 9 60-64 651
10 10 65-69 711
11 11 70-74 482
12 12 75-79 395
13 13 80-96 498
14 14 <NA> 76
17.4.8.2 SEX
recoded to female
The available levels of SEX
are:
- 1 = Male
- 2 = Female
- 9 = Refused
We’ll recode that to female
= 1 for Female, 0 Male, otherwise NA. Note the trick here is to subtract one from the coded SEX
to get the desired female
, but this requires that we move 9 to NA, rather than 9.
smart.raw <- smart.raw %>%
mutate(female = SEX - 1,
female = replace(female, female == 8, NA))
smart.raw %>% count(SEX, female)
# A tibble: 3 x 3
SEX female n
<dbl> <dbl> <int>
1 1 0 2559
2 2 1 3454
3 9 NA 1
17.4.8.3 MARITAL
status, revised to marital
The available levels of MARITAL
are:
- 1 = Married
- 2 = Divorced
- 3 = Widowed
- 4 = Separated
- 5 = Never married
- 6 = A member of an unmarried couple
- 9 = Refused
- BLANK = Not asked or missing
We’ll just turn this into a factor, and move 9 to NA.
smart.raw <- smart.raw %>%
mutate(marital = fct_recode(factor(MARITAL),
"Married" = "1",
"Divorced" = "2",
"Widowed" = "3",
"Separated" = "4",
"Never_Married" = "5",
"Unmarried_Couple" = "6",
NULL = "9"))
smart.raw %>% count(MARITAL, marital)
Warning: Factor `marital` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 7 x 3
MARITAL marital n
<dbl> <fct> <int>
1 1 Married 2989
2 2 Divorced 888
3 3 Widowed 813
4 4 Separated 107
5 5 Never_Married 1019
6 6 Unmarried_Couple 148
7 9 <NA> 50
17.4.8.4 EDUCA
recoded to educgroup
The available levels of EDUCA
(Education Level) are responses to: “What is the highest grade or year of school you completed?”
- 1 = Never attended school or only kindergarten
- 2 = Grades 1 through 8 (Elementary)
- 3 = Grades 9 through 11 (Some high school)
- 4 = Grade 12 or GED (High school graduate)
- 5 = College 1 year to 3 years (Some college or technical school)
- 6 = College 4 years or more (College graduate)
- 9 = Refused
- BLANK = Not asked or missing
We’ll just turn this into a factor, and move 9 to NA.
smart.raw <- smart.raw %>%
mutate(educgroup = fct_recode(factor(EDUCA),
"Kindergarten" = "1",
"Elementary" = "2",
"Some_HS" = "3",
"HS_Grad" = "4",
"Some_College" = "5",
"College_Grad" = "6",
NULL = "9"))
smart.raw %>% count(EDUCA, educgroup)
Warning: Factor `educgroup` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 7 x 3
EDUCA educgroup n
<dbl> <fct> <int>
1 1 Kindergarten 5
2 2 Elementary 80
3 3 Some_HS 287
4 4 HS_Grad 1745
5 5 Some_College 1611
6 6 College_Grad 2270
7 9 <NA> 16
17.4.8.5 RENTHOM1
recoded to home.own
The available levels of RENTHOM1
(Own or Rent Home) are responses to: “Do you own or rent your home? (Home is defined as the place where you live most of the time/the majority of the year.)”
- 1 = Own
- 2 = Rent
- 3 = Other Arrangement
- 7 = Don’t know/Not Sure
- 9 = Refused
- BLANK = Not asked or missing
We’ll recode as home.own
= 1 if they own their home, and 0 otherwise, and dealing with missingness properly.
smart.raw <- smart.raw %>%
mutate(home.own = RENTHOM1,
home.own = replace(home.own, home.own %in% c(7,9), NA),
home.own = replace(home.own, home.own %in% c(2,3), 0))
smart.raw %>% count(RENTHOM1, home.own)
# A tibble: 5 x 3
RENTHOM1 home.own n
<dbl> <dbl> <int>
1 1 1 4296
2 2 0 1486
3 3 0 196
4 7 NA 9
5 9 NA 27
17.4.8.6 NUMHHOL2
and its cleanup to phones2
NUMHHOL2
, the Household Telephones variable, is the response to “Do you have more than one telephone number in your household? Do not include cell phones or numbers that are only used by a computer or fax machine.”
- 1 = Yes
- 2 = No
- 7 = Don’t know/Not sure
- 9 = Refused
- BLANK = Not asked or missing
smart.raw <- smart.raw %>%
mutate(phones2 = NUMHHOL2,
phones2 = replace(phones2, phones2 %in% c(7, 9), NA),
phones2 = replace(phones2, phones2 == 2, 0))
smart.raw %>% count(NUMHHOL2, phones2)
# A tibble: 5 x 3
NUMHHOL2 phones2 n
<dbl> <dbl> <int>
1 1 1 222
2 2 0 3151
3 7 NA 2
4 9 NA 8
5 NA NA 2631
17.4.8.7 NUMPHON2
and its cleanup to numphones
NUMPHON2
, the Residential Phones variable, is the response to “How many of these telephone numbers are residential numbers?”
- 1-6 = legitimate responses (6 = 6 or more)
- 7 = Don’t know/Not sure
- 9 = Refused
- BLANK = Not asked or missing
smart.raw <- smart.raw %>%
mutate(numphones = NUMPHON2,
numphones = replace(numphones, numphones > 6, NA))
smart.raw %>% count(NUMPHON2, numphones)
# A tibble: 9 x 3
NUMPHON2 numphones n
<dbl> <dbl> <int>
1 1 1 131
2 2 2 66
3 3 3 10
4 4 4 5
5 5 5 4
6 6 6 2
7 7 NA 2
8 9 NA 2
9 NA NA 5792
17.4.8.8 VETERAN3
and its cleanup to veteran
VETERAN3
, the Are You A Veteran variable, is the response to “Have you ever served on active duty in the United States Armed Forces, either in the regular military or in a National Guard or military reserve unit? (Active duty does not include training for the Reserves or National Guard, but DOES include activation, for example, for the Persian Gulf War.)”
- 1 = Yes
- 2 = No
- 7 = Don’t know/Not sure
- 9 = Refused
- BLANK = Not asked or missing
smart.raw <- smart.raw %>%
mutate(veteran = VETERAN3,
veteran = replace(veteran, veteran %in% c(7, 9), NA),
veteran = replace(veteran, veteran == 2, 0))
smart.raw %>% count(VETERAN3, veteran)
# A tibble: 5 x 3
VETERAN3 veteran n
<dbl> <dbl> <int>
1 1 1 736
2 2 0 5264
3 7 NA 1
4 9 NA 12
5 NA NA 1
17.4.8.9 EMPLOY1
and its cleanup to employment
EMPLOY1
, the Employment Status variable, is the response to “Are you currently … ?”
- 1 = Employed for wages
- 2 = Self-employed
- 3 = Out of work for 1 year or more
- 4 = Out of work for less than 1 year
- 5 = A homemaker
- 6 = A student
- 7 = Retired
- 8 = Unable to work
- 9 = Refused
- BLANK = Not asked or missing
We’ll just turn this into a factor, and move 9 to NA.
smart.raw <- smart.raw %>%
mutate(employment = fct_recode(factor(EMPLOY1),
"Employed_for_wages" = "1",
"Self-employed" = "2",
"Outofwork_1yearormore" = "3",
"Outofwork_lt1year" = "4",
"Homemaker" = "5",
"Student" = "6",
"Retired" = "7",
"Unable_to_work" = "8",
NULL = "9"))
smart.raw %>% count(EMPLOY1, employment)
Warning: Factor `employment` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 10 x 3
EMPLOY1 employment n
<dbl> <fct> <int>
1 1 Employed_for_wages 2614
2 2 Self-employed 378
3 3 Outofwork_1yearormore 103
4 4 Outofwork_lt1year 110
5 5 Homemaker 302
6 6 Student 144
7 7 Retired 1868
8 8 Unable_to_work 453
9 9 <NA> 41
10 NA <NA> 1
17.4.8.10 CHILDREN
and its cleanup to numkids
CHILDREN
, the Number of Children in Household variable, is the response to “How many children less than 18 years of age live in your household?”
- 1-87 = legitimate responses
- 88 = None
- 99 = Refused
- BLANK = Not asked or missing
smart.raw <- smart.raw %>%
mutate(numkids = CHILDREN,
numkids = replace(numkids, numkids == 99, NA),
numkids = replace(numkids, numkids == 88, 0))
smart.raw %>% count(CHILDREN, numkids) %>% tail()
# A tibble: 6 x 3
CHILDREN numkids n
<dbl> <dbl> <int>
1 7 7 2
2 8 8 2
3 9 9 4
4 88 0 4408
5 99 NA 39
6 NA NA 1
17.4.8.11 INCOME2
to incomegroup
The available levels of INCOME2
(Income Level) are responses to: “Is your annual household income from all sources …”
- 1 = Less than $10,000
- 2 = $10,000 to less than $15,000
- 3 = $15,000 to less than $20,000
- 4 = $20,000 to less than $25,000
- 5 = $25,000 to less than $35,000
- 6 = $35,000 to less than $50,000
- 7 = $50,000 to less than $75,000
- 8 = $75,000 or more
- 77 = Don’t know/Not sure
- 99 = Refused
- BLANK = Not asked or missing
We’ll just turn this into a factor, and move 77 and 99 to NA.
smart.raw <- smart.raw %>%
mutate(incomegroup = fct_recode(factor(`INCOME2`),
"0-9K" = "1",
"10-14K" = "2",
"15-19K" = "3",
"20-24K" = "4",
"25-34K" = "5",
"35-49K" = "6",
"50-74K" = "7",
"75K+" = "8",
NULL = "77",
NULL = "99"))
smart.raw %>% count(`INCOME2`, incomegroup)
Warning: Factor `incomegroup` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 11 x 3
INCOME2 incomegroup n
<dbl> <fct> <int>
1 1 0-9K 212
2 2 10-14K 235
3 3 15-19K 374
4 4 20-24K 482
5 5 25-34K 567
6 6 35-49K 723
7 7 50-74K 814
8 8 75K+ 1631
9 77 <NA> 387
10 99 <NA> 565
11 NA <NA> 24
17.4.8.12 INTERNET
and its cleanup to internet30
INTERNET
, the Internet use in the past 30 days variable, is the response to “Have you used the internet in the past 30 days?”
- 1 = Yes
- 2 = No
- 7 = Don’t know/Not sure
- 9 = Refused
- BLANK = Not asked or missing
smart.raw <- smart.raw %>%
mutate(internet30 = INTERNET,
internet30 = replace(internet30, internet30 %in% c(7, 9), NA),
internet30 = replace(internet30, internet30 == 2, 0))
smart.raw %>% count(INTERNET, internet30)
# A tibble: 5 x 3
INTERNET internet30 n
<dbl> <dbl> <int>
1 1 1 4895
2 2 0 1078
3 7 NA 6
4 9 NA 6
5 NA NA 29
17.4.8.13 WTKG3
is weight_kg
WTKG3
is computed by CDC, as the respondent’s weight in kilograms with two implied decimal places. We calculate the actual weight in kg, with the following:
smart.raw <- smart.raw %>%
mutate(weight_kg = WTKG3/100)
smart.raw %>% count(WTKG3, weight_kg)# %>% tail()
# A tibble: 260 x 3
WTKG3 weight_kg n
<dbl> <dbl> <int>
1 3084 30.8 1
2 3266 32.7 1
3 3447 34.5 1
4 3629 36.3 2
5 3674 36.7 1
6 3856 38.6 3
7 3946 39.5 1
8 4037 40.4 1
9 4082 40.8 3
10 4128 41.3 1
# ... with 250 more rows
17.4.8.14 HEIGHT3
is replaced with height_m
HEIGHT3
is strangely gathered to allow people to specify their height in either feet and inches or in meters and centimeters.
- 200-711 indicates height in feet (first digit) and inches (second two digits)
- 9000 - 9998 indicates height in meters (second digit) and centimeters (last two digits)
- 7777 = Don’t know/Not sure
- 9999 = Refused
Note that there is one impossible value of 575 in the data set. We’ll make that an NA, and we’ll also make NA any heights below 3 feet, or above 2.24 meters. Specifically, we calculate the actual height in meters, with the following:
smart.raw <- smart.raw %>%
mutate(height_m = case_when(
HEIGHT3 >= 300 & HEIGHT3 <= 511 ~ round((12*floor(HEIGHT3/100) + (HEIGHT3 - 100*floor(HEIGHT3/100)))*0.0254,2),
HEIGHT3 >= 600 & HEIGHT3 <= 711 ~ round((12*floor(HEIGHT3/100) + (HEIGHT3 - 100*floor(HEIGHT3/100)))*0.0254,2),
HEIGHT3 >= 9000 & HEIGHT3 <= 9224 ~ ((HEIGHT3 - 9000)/100)))
smart.raw %>% count(HEIGHT3, height_m) # %>% tail()
# A tibble: 35 x 3
HEIGHT3 height_m n
<dbl> <dbl> <int>
1 406 1.37 1
2 407 1.4 1
3 408 1.42 4
4 409 1.45 14
5 410 1.47 23
6 411 1.5 63
7 500 1.52 184
8 501 1.55 171
9 502 1.57 415
10 503 1.6 408
# ... with 25 more rows
17.4.8.15 bmi
is calculated from height_m
and weight_kg
We’ll calculate body-mass index from height and weight.
smart.raw <- smart.raw %>%
mutate(bmi = round(weight_kg/(height_m)^2,2))
smart.raw %>% count(height_m, weight_kg, bmi)# %>% tail()
# A tibble: 1,583 x 4
height_m weight_kg bmi n
<dbl> <dbl> <dbl> <int>
1 1.37 68.0 36.2 1
2 1.4 81.6 41.7 1
3 1.42 40.4 20.0 1
4 1.42 66.2 32.8 1
5 1.42 72.6 36.0 2
6 1.45 44.4 21.1 1
7 1.45 45.4 21.6 1
8 1.45 46.7 22.2 1
9 1.45 52.2 24.8 2
10 1.45 53.5 25.5 1
# ... with 1,573 more rows
17.4.8.16 bmigroup
is calculated from bmi
We’ll then divide the respondents into adult BMI categories, in the usual way.
- BMI < 18.5 indicates underweight
- BMI from 18.5 up to 25 indicates normal weight
- BMI from 25 up to 30 indicates overweight
- BMI of 30 and higher indicates obesity
smart.raw <- smart.raw %>%
mutate(bmigroup = factor(cut2(as.numeric(bmi),
cuts = c(18.5, 25.0, 30.0))))
smart.raw %>% count(bmigroup)
Warning: Factor `bmigroup` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 5 x 2
bmigroup n
<fct> <int>
1 [ 11.6, 18.5) 102
2 [ 18.5, 25.0) 1723
3 [ 25.0, 30.0) 2001
4 [ 30.0,110.9] 1766
5 <NA> 422
17.4.8.17 PREGNANT
and its cleanup to pregnant
PREGNANT
, the Pregnancy Status variable, is the response to “To your knowledge, are you now pregnant?”
- 1 = Yes
- 2 = No
- 7 = Don’t know/Not sure
- 9 = Refused
- BLANK = Not asked or missing (includes SEX = male)
smart.raw <- smart.raw %>%
mutate(pregnant = PREGNANT,
pregnant = replace(pregnant, pregnant %in% c(7, 9), NA),
pregnant = replace(pregnant, pregnant == 2, 0))
smart.raw %>% count(PREGNANT, pregnant)
# A tibble: 5 x 3
PREGNANT pregnant n
<dbl> <dbl> <int>
1 1 1 33
2 2 0 797
3 7 NA 5
4 9 NA 1
5 NA NA 5178
17.4.8.18 DEAF
and its cleanup to deaf
DEAF
, the Are you deaf or do you have serious difficulty hearing variable, is the response to “Are you deaf or do you have serious difficulty hearing?”
- 1 = Yes
- 2 = No
- 7 = Don’t know/Not sure
- 9 = Refused
- BLANK = Not asked or missing
smart.raw <- smart.raw %>%
mutate(deaf = DEAF,
deaf = replace(deaf, deaf %in% c(7, 9), NA),
deaf = replace(deaf, deaf == 2, 0))
smart.raw %>% count(DEAF, deaf)
# A tibble: 5 x 3
DEAF deaf n
<dbl> <dbl> <int>
1 1 1 507
2 2 0 5392
3 7 NA 9
4 9 NA 7
5 NA NA 99
17.4.8.19 BLIND
and its cleanup to blind
BLIND
, the Blind or Difficulty seeing variable, is the response to “Are you blind or do you have serious difficulty seeing, even when wearing glasses?”
- 1 = Yes
- 2 = No
- 7 = Don’t know/Not sure
- 9 = Refused
- BLANK = Not asked or missing
smart.raw <- smart.raw %>%
mutate(blind = BLIND,
blind = replace(blind, blind %in% c(7, 9), NA),
blind = replace(blind, blind == 2, 0))
smart.raw %>% count(BLIND, blind)
# A tibble: 5 x 3
BLIND blind n
<dbl> <dbl> <int>
1 1 1 290
2 2 0 5605
3 7 NA 9
4 9 NA 2
5 NA NA 108
17.4.8.20 DECIDE
and its cleanup to decide
DECIDE
, the Difficulty Concentrating or Remembering variable, is the response to “Because of a physical, mental, or emotional condition, do you have serious difficulty concentrating, remembering, or making decisions?”
- 1 = Yes
- 2 = No
- 7 = Don’t know/Not sure
- 9 = Refused
- BLANK = Not asked or missing
smart.raw <- smart.raw %>%
mutate(decide = DECIDE,
decide = replace(decide, decide %in% c(7, 9), NA),
decide = replace(decide, decide == 2, 0))
smart.raw %>% count(DECIDE, decide)
# A tibble: 5 x 3
DECIDE decide n
<dbl> <dbl> <int>
1 1 1 583
2 2 0 5293
3 7 NA 16
4 9 NA 5
5 NA NA 117
17.4.8.21 DIFFWALK
and its cleanup to diffwalk
DIFFWALK
, the Difficulty Walking or Climbing Stairs variable, is the response to “Do you have serious difficulty walking or climbing stairs?”
- 1 = Yes
- 2 = No
- 7 = Don’t know/Not sure
- 9 = Refused
- BLANK = Not asked or missing
smart.raw <- smart.raw %>%
mutate(diffwalk = DIFFWALK,
diffwalk = replace(diffwalk, diffwalk %in% c(7, 9), NA),
diffwalk = replace(diffwalk, diffwalk == 2, 0))
smart.raw %>% count(DIFFWALK, diffwalk)
# A tibble: 5 x 3
DIFFWALK diffwalk n
<dbl> <dbl> <int>
1 1 1 1036
2 2 0 4841
3 7 NA 11
4 9 NA 5
5 NA NA 121
17.4.8.22 DIFFDRES
and its cleanup to diffdress
DIFFDRES
, the Difficulty Dressing or Bathing variable, is the response to “Do you have difficulty dressing or bathing?”
- 1 = Yes
- 2 = No
- 7 = Don’t know/Not sure
- 9 = Refused
- BLANK = Not asked or missing
smart.raw <- smart.raw %>%
mutate(diffdress = DIFFDRES,
diffdress = replace(diffdress, diffdress %in% c(7, 9), NA),
diffdress = replace(diffdress, diffdress == 2, 0))
smart.raw %>% count(DIFFDRES, diffdress)
# A tibble: 5 x 3
DIFFDRES diffdress n
<dbl> <dbl> <int>
1 1 1 221
2 2 0 5658
3 7 NA 3
4 9 NA 2
5 NA NA 130
17.4.8.23 DIFFALON
and its cleanup to diffalone
DIFFALON
, the Difficulty Doing Errands Alone variable, is the response to “Because of a physical, mental, or emotional condition, do you have difficulty doing errands alone such as visiting a doctor’s office or shopping?”
- 1 = Yes
- 2 = No
- 7 = Don’t know/Not sure
- 9 = Refused
- BLANK = Not asked or missing
smart.raw <- smart.raw %>%
mutate(diffalone = DIFFALON,
diffalone = replace(diffalone, diffalone %in% c(7, 9), NA),
diffalone = replace(diffalone, diffalone == 2, 0))
smart.raw %>% count(DIFFALON, diffalone)
# A tibble: 5 x 3
DIFFALON diffalone n
<dbl> <dbl> <int>
1 1 1 437
2 2 0 5419
3 7 NA 13
4 9 NA 6
5 NA NA 139
17.4.9 Tobacco Use (5 items)
17.4.9.1 SMOKE100
and its cleanup to smoke100
SMOKE100
, the Smoked at Least 100 Cigarettes variable, is the response to “Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]”
- 1 = Yes
- 2 = No
- 7 = Don’t know/Not sure
- 9 = Refused
smart.raw <- smart.raw %>%
mutate(smoke100 = SMOKE100,
smoke100 = replace(smoke100, smoke100 %in% c(7, 9), NA),
smoke100 = replace(smoke100, smoke100 == 2, 0))
smart.raw %>% count(SMOKE100, smoke100)
# A tibble: 5 x 3
SMOKE100 smoke100 n
<dbl> <dbl> <int>
1 1 1 2567
2 2 0 3260
3 7 NA 30
4 9 NA 6
5 NA NA 151
17.4.9.2 SMOKDAY2
17.4.9.3 STOPSMK2
17.4.9.4 LASTSMK2
17.4.9.5 USENOW3
17.4.10 E-Cigarettes (2 items)
17.4.10.1 ECIGARET
and its cleanup to ecig.ever
ECIGARET
, the Ever used an e-cigarette variable, is the response to “Have you ever used an e-cigarette or other electronic vaping product, even just one time, in your entire life?”
- 1 = Yes
- 2 = No
- 7 = Don’t know/Not sure
- 9 = Refused
smart.raw <- smart.raw %>%
mutate(ecig.ever = ECIGARET,
ecig.ever = replace(ecig.ever, ecig.ever %in% c(7, 9), NA),
ecig.ever = replace(ecig.ever, ecig.ever == 2, 0))
smart.raw %>% count(ECIGARET, ecig.ever)
# A tibble: 5 x 3
ECIGARET ecig.ever n
<dbl> <dbl> <int>
1 1 1 1028
2 2 0 4818
3 7 NA 8
4 9 NA 1
5 NA NA 159
17.4.10.2 ECIGNOW
17.4.11 Alcohol Consumption (4 items)
17.4.11.1 ALCDAY5
and its cleanup to alcdays
ALCDAY5
, the Days in past 30 had alcoholic beverage variable, is the response to “During the past 30 days, how many days per week or per month did you have at least one drink of any alcoholic beverage such as beer, wine, a malt beverage or liquor?”
- 101-107 = # of days per week (101 = 1 day per week, 107 = 7 days per week)
- 201-230 = # of days in past 30 days (201 = 1 day in last 30, 230 = 30 days in last 30)
- 777 = Don’t know/Not sure
- 888 = No drinks in past 30 days
- 999 = Refused
- BLANK = Not asked or Missing
We’re going to convert this to a single numeric value. Answers in days per week (in the past 7 days) will be converted (after rounding) to days in the past 30. This is a little bit of a mess, really, but we can do it.
smart.raw <- smart.raw %>%
mutate(alcdays = as.numeric(ALCDAY5)) %>%
mutate(alcdays = replace(alcdays, alcdays == 888, 0),
alcdays = replace(alcdays, alcdays %in% c(777, 999), NA)) %>%
mutate(alcdays = case_when(ALCDAY5 > 199 & ALCDAY5 < 231 ~ ALCDAY5 - 200,
ALCDAY5 > 100 & ALCDAY5 < 108 ~ round((ALCDAY5 - 100)*30/7,0),
TRUE ~ alcdays))
smart.raw %>% count(ALCDAY5, alcdays)
# A tibble: 38 x 3
ALCDAY5 alcdays n
<dbl> <dbl> <int>
1 101 4 256
2 102 9 197
3 103 13 142
4 104 17 62
5 105 21 51
6 106 26 11
7 107 30 124
8 201 1 471
9 202 2 363
10 203 3 216
# ... with 28 more rows
17.4.11.2 AVEDRNK2
and its cleanup to avgdrinks
AVEDRNK2
, the Avg alcoholic drinks per day in past 30 variable, is the response to “One drink is equivalent to a 12-ounce beer, a 5-ounce glass of wine, or a drink with one shot of liquor. During the past 30 days, on the days when you drank, about how many drinks did you drink on the average? (A 40 ounce beer would count as 3 drinks, or a cocktail drink with 2 shots would count as 2 drinks.)”
- 1-76 = # of drinks per day
- 77 = Don’t know/Not sure
- 99 = Refused
- BLANK = Not asked or Missing (always happens when ALCDAY5 = 777, 888 or 999)
smart.raw <- smart.raw %>%
mutate(avgdrinks = AVEDRNK2,
avgdrinks = replace(avgdrinks, avgdrinks > 76, NA))
smart.raw %>% count(AVEDRNK2, avgdrinks) %>% tail()
# A tibble: 6 x 3
AVEDRNK2 avgdrinks n
<dbl> <dbl> <int>
1 40 40 2
2 50 50 2
3 75 75 1
4 77 NA 36
5 99 NA 10
6 NA NA 2981
17.4.11.3 DRNK3GE5
17.4.11.4 MAXDRNKS
17.4.12 Immunization (4 items)
17.4.12.1 FLUSHOT6
17.4.12.2 FLSHTMY2
17.4.12.3 PNEUVAC3
17.4.12.4 TETANUS
17.4.13 Falls (2 items)
17.4.13.1 FALL12MN
17.4.13.2 FALLINJ2
17.4.14 Seatbelt Use (1 item)
17.4.14.1 SEATBELT
17.4.15 Drinking and Driving (1 item)
17.4.15.1 DRNKDRI2
17.4.16 Breast and Cervical Cancer Screening (7 items)
17.4.16.1 HADMAM
17.4.16.2 HOWLONG
17.4.16.3 HADPAP2
17.4.16.4 LASTPAP2
17.4.16.5 HPVTEST
17.4.16.6 HPLSTTST
17.4.16.7 HADHYST2
17.4.17 Prostate Cancer Screening (6 items)
17.4.17.1 PCPSAAD2
17.4.17.2 PCPSADI1
17.4.17.3 PCPSARE1
17.4.17.4 PSATEST1
17.4.17.5 PSATIME
17.4.17.6 PCPSARS1
17.4.18 Colorectal Cancer Screening (5 items)
17.4.18.1 BLDSTOOL
17.4.18.2 LSTBLDS3
17.4.18.3 HADSIGM3
17.4.18.4 HADSGCO1
17.4.18.5 LASTSIG3
17.4.19 HIV/AIDS (3 items)
17.4.19.1 HIVTST6
17.4.19.2 HIVTSTD3
17.4.19.3 HIVRISK4
17.5 Creating Some Quantitative Variables from Thin Air
This section is purely for teaching purposes. I would never use the variables created in this section for research work.
17.5.1 age_imp
: Creating Age Data out of Thin Air
I want a quantitative age variable, so I’m going to create an imputed age_imp
value for each subject based on their agegroup
. For each age group, I will assume that each of the ages represented by a value in that age group will be equally likely, and will draw from the relevant uniform distribution to impute age.
set.seed(20180114)
smart.raw <- smart.raw %>%
mutate(age_low = as.numeric(str_sub(as.character(agegroup), 1, 2))) %>%
mutate(age_high = as.numeric(str_sub(as.character(agegroup), 4, 5))) %>%
rowwise() %>%
mutate(age_imp = ifelse(!is.na(agegroup),
round(runif(1, min = age_low, max = age_high),0),
NA))
smart.raw %>% count(agegroup, age_imp) %>% tail()
# A tibble: 6 x 3
agegroup age_imp n
<fct> <dbl> <int>
1 80-96 92 30
2 80-96 93 30
3 80-96 94 33
4 80-96 95 33
5 80-96 96 14
6 <NA> NA 76
17.6 Clean Data in the State of Ohio
There are five MMSAs associated with the state of Ohio, associated with Cincinnati, Cleveland, Columbus, Dayton and Toledo.
smart_ohio <- smart.raw %>%
filter(str_detect(MMSANAME, ', OH')) %>%
select(SEQNO, MMSANAME, hhadults,
genhealth, physhealth, menthealth,
poorhealth, healthplan, hasdoc, numdocs2,
costprob, checktime, exerany, sleephrs,
hx.mi, hx.chd, hx.stroke, hx.asthma,
now.asthma, hx.skinc, hx.otherc, hx.copd,
hx.arthr, hx.depress, hx.kidney,
hx.diabetes, diabetes3, diab.age,
denttime, remteeth, agegroup, female,
marital, educgroup, home.own, phones2,
numphones, veteran, employment, numkids,
incomegroup, internet30, weight_kg,
height_m, bmi, bmigroup, pregnant, deaf,
blind, decide, diffwalk, diffdress,
diffalone, smoke100, ecig.ever, alcdays,
avgdrinks, age_imp)
skim_with(numeric = list(hist = NULL))
skim(select(smart_ohio, -SEQNO))
Skim summary statistics
n obs: 6014
n variables: 57
-- Variable type:character --------------------------------------------------------
variable missing complete n min max empty n_unique
MMSANAME 0 6014 6014 41 51 0 5
-- Variable type:factor -----------------------------------------------------------
variable missing complete n n_unique
agegroup 76 5938 6014 13
bmigroup 422 5592 6014 4
checktime 73 5941 6014 5
denttime 61 5953 6014 5
diabetes3 10 6004 6014 4
educgroup 16 5998 6014 6
employment 42 5972 6014 8
genhealth 12 6002 6014 5
incomegroup 976 5038 6014 8
marital 50 5964 6014 6
numdocs2 24 5990 6014 3
remteeth 129 5885 6014 4
top_counts ordered
65-: 711, 55-: 666, 60-: 651, 50-: 504 FALSE
[ 2: 2001, [ 3: 1766, [ 1: 1723, NA: 422 FALSE
1_I: 4742, 2_1: 571, 3_2: 299, 4_5: 299 FALSE
1_I: 4221, 2_1: 638, 4_5: 617, 3_2: 453 FALSE
No-: 5037, Dia: 833, Pre: 85, Pre: 49 FALSE
Col: 2270, HS_: 1745, Som: 1611, Som: 287 FALSE
Emp: 2614, Ret: 1868, Una: 453, Sel: 378 FALSE
2_V: 2014, 3_G: 1895, 1_E: 997, 4_F: 798 FALSE
75K: 1631, NA: 976, 50-: 814, 35-: 723 FALSE
Mar: 2989, Nev: 1019, Div: 888, Wid: 813 FALSE
1_O: 4783, 3_Z: 744, 2_M: 463, NA: 24 FALSE
Non: 2993, 1-5: 1632, 6+_: 775, All: 485 FALSE
-- Variable type:numeric ----------------------------------------------------------
variable missing complete n mean sd p0 p25 p50 p75
age_imp 76 5938 6014 55.96 18.34 18 43 57 68
alcdays 217 5797 6014 4.76 8.02 0 0 1 5
avgdrinks 3027 2987 6014 2.28 2.96 1 1 2 2
blind 119 5895 6014 0.049 0.22 0 0 0 0
bmi 422 5592 6014 28.36 6.7 11.61 23.92 27.28 31.41
costprob 13 6001 6014 0.084 0.28 0 0 0 0
deaf 115 5899 6014 0.086 0.28 0 0 0 0
decide 138 5876 6014 0.099 0.3 0 0 0 0
diab.age 5216 798 6014 50.83 14.12 1 43 51 60
diffalone 158 5856 6014 0.075 0.26 0 0 0 0
diffdress 135 5879 6014 0.038 0.19 0 0 0 0
diffwalk 137 5877 6014 0.18 0.38 0 0 0 0
ecig.ever 168 5846 6014 0.18 0.38 0 0 0 0
exerany 13 6001 6014 0.74 0.44 0 0 1 1
female 1 6013 6014 0.57 0.49 0 0 1 1
hasdoc 24 5990 6014 0.88 0.33 0 1 1 1
healthplan 21 5993 6014 0.95 0.21 0 1 1 1
height_m 166 5848 6014 1.7 0.1 1.37 1.63 1.7 1.78
hhadults 3424 2590 6014 2.08 0.97 1 1 2 2
home.own 36 5978 6014 0.72 0.45 0 0 1 1
hx.arthr 26 5988 6014 0.38 0.49 0 0 0 1
hx.asthma 16 5998 6014 0.13 0.33 0 0 0 0
hx.chd 41 5973 6014 0.064 0.25 0 0 0 0
hx.copd 24 5990 6014 0.093 0.29 0 0 0 0
hx.depress 19 5995 6014 0.18 0.38 0 0 0 0
hx.diabetes 10 6004 6014 0.14 0.35 0 0 0 0
hx.kidney 22 5992 6014 0.039 0.19 0 0 0 0
hx.mi 25 5989 6014 0.061 0.24 0 0 0 0
hx.otherc 9 6005 6014 0.099 0.3 0 0 0 0
hx.skinc 15 5999 6014 0.087 0.28 0 0 0 0
hx.stroke 11 6003 6014 0.045 0.21 0 0 0 0
internet30 41 5973 6014 0.82 0.38 0 1 1 1
menthealth 74 5940 6014 3.55 7.94 0 0 0 2
now.asthma 5269 745 6014 0.74 0.44 0 0 1 1
numkids 40 5974 6014 0.51 1.02 0 0 0 1
numphones 5796 218 6014 1.58 0.94 1 1 1 2
phones2 2641 3373 6014 0.066 0.25 0 0 0 0
physhealth 94 5920 6014 4.24 8.78 0 0 0 3
poorhealth 3043 2971 6014 5.3 9.39 0 0 0 5
pregnant 5184 830 6014 0.04 0.2 0 0 0 0
sleephrs 55 5959 6014 6.97 1.51 1 6 7 8
smoke100 187 5827 6014 0.44 0.5 0 0 0 1
veteran 14 6000 6014 0.12 0.33 0 0 0 0
weight_kg 390 5624 6014 82.39 21.91 30.84 68.04 79.38 92.99
p100
96
30
75
1
110.88
1
1
1
85
1
1
1
1
1
1
1
1
2.03
10
1
1
1
1
1
1
1
1
1
1
1
1
1
30
1
9
6
1
30
30
1
23
1
1
249.48
write_csv(smart_ohio, "data/smart_ohio.csv")
17.7 Clean Cleveland-Elyria Data
17.7.1 Cleveland - Elyria Raw Data
The MMSANAME
variable is probably the simplest way for us to filter our data down to the MMSA we are interested in. Here, I’m using the str_detect
function to identify the values of MMSANAME
that contain the text “Cleveland”.
smart.cle.raw <- smart.raw %>%
filter(str_detect(MMSANAME, 'Cleveland'))
In the Cleveland-Elyria MSA, we have 1036 observations on the same 204 variables.
17.7.2 Clean Data - Larger
smart.cle <- smart.cle.raw %>%
select(SEQNO, MMSANAME, hhadults,
genhealth, physhealth, menthealth,
poorhealth, healthplan, hasdoc, numdocs2,
costprob, checktime, exerany, sleephrs,
hx.mi, hx.chd, hx.stroke, hx.asthma,
now.asthma, hx.skinc, hx.otherc, hx.copd,
hx.arthr, hx.depress, hx.kidney,
hx.diabetes, diabetes3, diab.age,
denttime, remteeth, agegroup, female,
marital, educgroup, home.own, phones2,
numphones, veteran, employment, numkids,
incomegroup, internet30, weight_kg,
height_m, bmi, bmigroup, pregnant, deaf,
blind, decide, diffwalk, diffdress,
diffalone, smoke100, ecig.ever, alcdays,
avgdrinks, age_imp)
skim_with(numeric = list(hist = NULL))
skim(select(smart.cle, -SEQNO))
Skim summary statistics
n obs: 1036
n variables: 57
-- Variable type:character --------------------------------------------------------
variable missing complete n min max empty n_unique
MMSANAME 0 1036 1036 51 51 0 1
-- Variable type:factor -----------------------------------------------------------
variable missing complete n n_unique
agegroup 12 1024 1036 13
bmigroup 84 952 1036 4
checktime 9 1027 1036 5
denttime 6 1030 1036 5
diabetes3 1 1035 1036 4
educgroup 2 1034 1036 6
employment 10 1026 1036 8
genhealth 3 1033 1036 5
incomegroup 170 866 1036 8
marital 10 1026 1036 6
numdocs2 5 1031 1036 3
remteeth 25 1011 1036 4
top_counts ordered
65-: 128, 60-: 122, 55-: 105, 80-: 103 FALSE
[ 2: 349, [ 1: 322, [ 3: 268, NA: 84 FALSE
1_I: 820, 2_1: 101, 3_2: 52, 4_5: 49 FALSE
1_I: 752, 2_1: 113, 4_5: 91, 3_2: 70 FALSE
No-: 884, Dia: 128, Pre: 15, Pre: 8 FALSE
Col: 385, HS_: 292, Som: 290, Som: 54 FALSE
Emp: 427, Ret: 345, Sel: 70, Una: 57 FALSE
2_V: 350, 3_G: 344, 1_E: 173, 4_F: 122 FALSE
75K: 258, NA: 170, 50-: 146, 35-: 115 FALSE
Mar: 476, Nev: 213, Wid: 157, Div: 143 FALSE
1_O: 795, 3_Z: 135, 2_M: 101, NA: 5 FALSE
Non: 510, 1-5: 293, 6+_: 127, All: 81 FALSE
-- Variable type:numeric ----------------------------------------------------------
variable missing complete n mean sd p0 p25 p50 p75
age_imp 12 1024 1036 57.3 18.86 18 45 60 70
alcdays 46 990 1036 4.65 8.05 0 0 1 4
avgdrinks 510 526 1036 2.17 2.44 1 1 2 2
blind 20 1016 1036 0.037 0.19 0 0 0 0
bmi 84 952 1036 27.89 6.47 12.71 23.7 26.68 30.53
costprob 3 1033 1036 0.067 0.25 0 0 0 0
deaf 18 1018 1036 0.071 0.26 0 0 0 0
decide 24 1012 1036 0.069 0.25 0 0 0 0
diab.age 912 124 1036 52.35 12.96 5 46.75 55 60
diffalone 33 1003 1036 0.07 0.25 0 0 0 0
diffdress 24 1012 1036 0.037 0.19 0 0 0 0
diffwalk 23 1013 1036 0.16 0.37 0 0 0 0
ecig.ever 36 1000 1036 0.15 0.35 0 0 0 0
exerany 3 1033 1036 0.76 0.43 0 1 1 1
female 0 1036 1036 0.6 0.49 0 0 1 1
hasdoc 5 1031 1036 0.87 0.34 0 1 1 1
healthplan 0 1036 1036 0.95 0.22 0 1 1 1
height_m 33 1003 1036 1.69 0.1 1.45 1.63 1.68 1.75
hhadults 586 450 1036 2.07 1.02 1 1 2 2
home.own 3 1033 1036 0.71 0.45 0 0 1 1
hx.arthr 6 1030 1036 0.37 0.48 0 0 0 1
hx.asthma 4 1032 1036 0.12 0.32 0 0 0 0
hx.chd 11 1025 1036 0.062 0.24 0 0 0 0
hx.copd 6 1030 1036 0.08 0.27 0 0 0 0
hx.depress 3 1033 1036 0.13 0.34 0 0 0 0
hx.diabetes 1 1035 1036 0.12 0.33 0 0 0 0
hx.kidney 4 1032 1036 0.052 0.22 0 0 0 0
hx.mi 5 1031 1036 0.051 0.22 0 0 0 0
hx.otherc 2 1034 1036 0.1 0.3 0 0 0 0
hx.skinc 3 1033 1036 0.075 0.26 0 0 0 0
hx.stroke 4 1032 1036 0.042 0.2 0 0 0 0
internet30 6 1030 1036 0.81 0.39 0 1 1 1
menthealth 11 1025 1036 2.72 6.82 0 0 0 2
now.asthma 918 118 1036 0.69 0.46 0 0 1 1
numkids 5 1031 1036 0.41 0.92 0 0 0 0
numphones 994 42 1036 1.74 0.99 1 1 2 2
phones2 460 576 1036 0.075 0.26 0 0 0 0
physhealth 17 1019 1036 3.97 8.67 0 0 0 2
poorhealth 543 493 1036 4.07 8.09 0 0 0 3
pregnant 905 131 1036 0.053 0.23 0 0 0 0
sleephrs 8 1028 1036 7.02 1.53 1 6 7 8
smoke100 37 999 1036 0.45 0.5 0 0 0 1
veteran 3 1033 1036 0.098 0.3 0 0 0 0
weight_kg 77 959 1036 80.4 21.36 36.29 65.77 77.11 90.72
p100
96
30
40
1
66.06
1
1
1
83
1
1
1
1
1
1
1
1
1.98
6
1
1
1
1
1
1
1
1
1
1
1
1
1
30
1
6
5
1
30
30
1
20
1
1
204.12
write_csv(smart.cle, "data/smartcle.csv")
17.7.3 Creation of the smartcle1.csv
data
smart.cle1 <- smart.raw %>%
filter(str_detect(MMSANAME, 'Cleveland')) %>%
select(SEQNO, physhealth, menthealth,
poorhealth, genhealth, bmi, female,
internet30, exerany, sleephrs, alcdays)
# write_csv(smart.cle1, "data/smartcle1.csv")
## this last line commented out because the file was
## created earlier in these notes