Chapter 17 Cleaning the BRFSS SMART Data

The Centers for Disease Control analyzes Behavioral Risk Factor Surveillance System (BRFSS) survey data for specific metropolitan and micropolitan statistical areas (MMSAs) in a program called the Selected Metropolitan/Micropolitan Area Risk Trends of BRFSS (SMART BRFSS.)

In this work, we will focus on data from the 2016 SMART, and in particular on data from the state of Ohio, and from the Cleveland-Elyria, OH, Metropolitan Statistical Area. The purpose of this survey is to provide localized health information that can help public health practitioners identify local emerging health problems, plan and evaluate local responses, and efficiently allocate resources to specific needs.

In this chapter, I describe all of the cleaning I have done (so far) on the BRFSS SMART data, and break it out into national, statewide, and local samples.

The data files produced by this chapter include:

  • smart_ohio.csv which includes data on 57 variables for 6014 subjects in five MMSAs in Ohio.
  • smartcle.csv which includes data on those same 57 variables for the 1036 subjects in the Cleveland-Elyria-Lorain OH MMSA.
  • smartcle1.csv which includes a subset of 11 variables for those same 1036 subjects.

SMART and BRFSS

The Centers for Disease Control analyzes Behavioral Risk Factor Surveillance System (BRFSS) survey data for specific metropolitan and micropolitan statistical areas (MMSAs) in a program called the Selected Metropolitan/Micropolitan Area Risk Trends of BRFSS (SMART BRFSS.)

In this work, we will focus on data from the 2016 SMART, and in particular on data from the Cleveland-Elyria, OH, Metropolitan Statistical Area. The purpose of this survey is to provide localized health information that can help public health practitioners identify local emerging health problems, plan and evaluate local responses, and efficiently allocate resources to specific needs.

17.1 Key resources

17.2 Ingesting The Raw Data

library(skimr) 
library(broom); library(modelr)
library(haven); library(tidyverse)

17.3 The National Data

To create the data files we’ll use, I used the read_xpt function from the haven package to bring in the SAS XPT data file that is provided by CDC. The codes I used (but won’t use in these Notes) were:

smart.raw <- read_xpt("data/MMSA2016.xpt") %>% tbl_df

dim(smart.raw)

This gives the nationwide data, which has 249,011 rows and 146 columns.

But for the purposes of putting these Notes online, I needed to crank down the sample size enormously. To that end, I created a new data file, called smartorig.csv, which I developed by

  • importing the MMSA2016.xpt file
  • filtering away all observations except those from the state of Ohio
  • saving it

So, for purposes of these notes, our complete data set is actually coming from smartorg.csv and consists only of the 6,014 observations from Ohio.

smart.raw <- read_csv("data/smartorig.csv") %>% tbl_df
Parsed with column specification:
cols(
  .default = col_double(),
  MMSANAME = col_character()
)
See spec(...) for full column specifications.
dim(smart.raw)
[1] 6014  146

17.4 Cleaning the BRFSS Data

My source for learning about the codes in these variables is the enormous Codebook for the 2016 BRFSS Survey PDF which identifies the variables by name for us. The first two variables in the raw file are

  • DISPCODE which is 1100 if the subject completed the interview, and 1200 if they partially completed the interview.
smart.raw %>% count(DISPCODE)
# A tibble: 2 x 2
  DISPCODE     n
     <dbl> <int>
1     1100  5162
2     1200   852
  • HHADULT which is the response to “How many members of your household, including yourself, are 18 years of age or older?”
    • The response is 1-76, with special values 77 for Don’t Know/Not Sure and 99 for refused, with BLANK for missing or not asked.
    • So we should change all numerical values above 76 to NA for our analyses (the blanks are already regarded as NAs by R in the ingestion process.)
smart.raw <- smart.raw %>% 
    mutate(hhadults = HHADULT,
           hhadults = replace(hhadults, hhadults > 76, NA))

smart.raw %>% count(HHADULT, hhadults) %>% tail()
# A tibble: 6 x 3
  HHADULT hhadults     n
    <dbl>    <dbl> <int>
1       6        6     9
2       7        7     3
3      10       10     2
4      77       NA     4
5      99       NA     1
6      NA       NA  3419

17.4.1 Health Status (1 item)

The next variable describes relate to the subject’s health status.

17.4.1.1 GENHLTH and its cleanup to genhealth

  • GENHLTH, the General Health variable, which is the response to “Would you say that in general your health is …”
    • 1 = Excellent
    • 2 = Very good
    • 3 = Good
    • 4 = Fair
    • 5 = Poor
    • 7 = Don’t know/Not sure
    • 9 = Refused
    • BLANK = Not asked or missing

To clean up the GENHLTH data into a new variable called genhealth we’ll need to - convince R that the 7 and 9 values are in fact best interpreted as NA, - and perhaps change the variable to a factor and incorporate the names into the levels.

smart.raw <- smart.raw %>% 
    mutate(genhealth = fct_recode(factor(GENHLTH), 
                                "1_Excellent" = "1",
                                "2_VeryGood" = "2",
                                "3_Good" = "3",
                                "4_Fair" = "4", 
                                "5_Poor" = "5",
                                NULL = "7",
                                NULL = "9"))

smart.raw %>% count(GENHLTH, genhealth) %>% tail()
Warning: Factor `genhealth` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 6 x 3
  GENHLTH genhealth      n
    <dbl> <fct>      <int>
1       2 2_VeryGood  2014
2       3 3_Good      1895
3       4 4_Fair       798
4       5 5_Poor       298
5       7 <NA>           8
6       9 <NA>           4

17.4.3 Health Care Access (4 items)

The next four variables relate to the subject’s health care access.

17.4.3.1 HLTHPLN1 and its cleanup to healthplan

HLTHPLN1, the Have any health care coverage variable, is the response to “Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused

To clean up the HLTHPLN1 data into a new variable called healthplan we’ll
- convince R that the 7 and 9 values are in fact best interpreted as NA, - and turn it into an indicator variable, e.g., we will leave the variable as numeric, but change the values to 1 = Yes and 0 = No.

smart.raw <- smart.raw %>% 
    mutate(healthplan = HLTHPLN1,
           healthplan = replace(healthplan, healthplan %in% c(7, 9), NA),
           healthplan = replace(healthplan, healthplan == 2, 0))

smart.raw %>% count(HLTHPLN1, healthplan)
# A tibble: 4 x 3
  HLTHPLN1 healthplan     n
     <dbl>      <dbl> <int>
1        1          1  5722
2        2          0   271
3        7         NA    11
4        9         NA    10

17.4.3.2 PERSDOC2 and its cleanup to hasdoc and to numdocs2

PERSDOC2, the Multiple Health Care Professionals variable, is the response to “Do you have one person you think of as your personal doctor or health care provider?” where if the response is “No”, the survey then asks “Is there more than one or is there no person who you think of as your personal doctor or health care provider?”

  • 1 = Yes, only one
  • 2 = More than one
  • 3 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing

Let’s build two variables here. We’ll create a hasdoc variable which is simply Yes or No, and a factor variable called numdocs2 which keeps the first three levels in place.

To clean up the PERSDOC2 data into a new variable called hasdoc we’ll
- convince R that the 7 and 9 values are in fact best interpreted as NA, - and turn it into an indicator variable, e.g., we will leave the variable as numeric, but change the values to 1 = Yes and 0 = No, so that the original 1 and 2 become 1, and the original 3 becomes 0.

To clean up the PERSDOC2 data into a new variable called numdocs2 we’ll - convince R that the 7 and 9 values are in fact best interpreted as NA, - relabel options 1, 2 and 3 while turning the variable into a factor.

smart.raw <- smart.raw %>% 
    mutate(hasdoc = PERSDOC2,
           hasdoc = replace(hasdoc, hasdoc %in% c(7, 9), NA),
           hasdoc = replace(hasdoc, hasdoc %in% c(1, 2), 1),
           hasdoc = replace(hasdoc, hasdoc == 3, 0)) %>%
    mutate(numdocs2 = fct_recode(factor(PERSDOC2),
                                 "1_One" = "1",
                                 "2_MoreThanOne" = "2",
                                 "3_Zero" = "3",
                                 NULL = "7",
                                 NULL = "9"))

smart.raw %>% count(PERSDOC2, hasdoc, numdocs2)
Warning: Factor `numdocs2` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 5 x 4
  PERSDOC2 hasdoc numdocs2          n
     <dbl>  <dbl> <fct>         <int>
1        1      1 1_One          4783
2        2      1 2_MoreThanOne   463
3        3      0 3_Zero          744
4        7     NA <NA>             14
5        9     NA <NA>             10

17.4.3.3 MEDCOST and its cleanup to costprob

MEDCOST, the Could Not See Doctor Because of Cost variable, is the response to “Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing

This is just like HLTHPLAN.

smart.raw <- smart.raw %>% 
    mutate(costprob = MEDCOST,
           costprob = replace(costprob, costprob %in% c(7, 9), NA),
           costprob = replace(costprob, costprob == 2, 0))

smart.raw %>% count(MEDCOST, costprob)
# A tibble: 4 x 3
  MEDCOST costprob     n
    <dbl>    <dbl> <int>
1       1        1   503
2       2        0  5498
3       7       NA    11
4       9       NA     2

17.4.3.4 CHECKUP1 and its cleanup to checktime

CHECKUP1, the Length of time since last routine checkup variable, is the response to “About how long has it been since you last visited a doctor for a routine checkup? [A routine checkup is a general physical exam, not an exam for a specific injury, illness, or condition.]”

  • 1 = Within past year (anytime less than 12 months ago)
  • 2 = Within past 2 years (1 year but less than 2 years ago)
  • 3 = Within past 5 years (2 years but less than 5 years ago)
  • 4 = 5 or more years ago
  • 7 = Don’t know/Not sure
  • 8 = Never
  • 9 = Refused
  • BLANK = Not asked or missing

To clean up the CHECKUP1 data into a new variable called checktime we’ll - convince R that the 7 and 9 values are in fact best interpreted as NA, - relabel options 1, 2, 3, 4 and 8 while turning the variable into a factor.

smart.raw <- smart.raw %>% 
    mutate(checktime = fct_recode(factor(CHECKUP1),
                                 "1_In-past-year" = "1",
                                 "2_1-to-2-years" = "2",
                                 "3_2-to-5-years" = "3",
                                 "4_5_plus_years" = "4",
                                 "8_Never" = "8",
                                 NULL = "7",
                                 NULL = "9"))

smart.raw %>% count(CHECKUP1, checktime)
Warning: Factor `checktime` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 7 x 3
  CHECKUP1 checktime          n
     <dbl> <fct>          <int>
1        1 1_In-past-year  4742
2        2 2_1-to-2-years   571
3        3 3_2-to-5-years   299
4        4 4_5_plus_years   299
5        7 <NA>              66
6        8 8_Never           30
7        9 <NA>               7

17.4.4 Exercise (1 item)

17.4.4.1 EXERANY2 and its cleanup to exerany

EXERANY2, the Exercise in Past 30 Days variable, is the response to “During the past month, other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing

This is just like HLTHPLAN.

smart.raw <- smart.raw %>% 
    mutate(exerany = EXERANY2,
           exerany = replace(exerany, exerany %in% c(7, 9), NA),
           exerany = replace(exerany, exerany == 2, 0))

smart.raw %>% count(EXERANY2, exerany)
# A tibble: 4 x 3
  EXERANY2 exerany     n
     <dbl>   <dbl> <int>
1        1       1  4464
2        2       0  1537
3        7      NA    10
4        9      NA     3

17.4.5 Inadequate Sleep (1 item)

17.4.5.1 SLEPTIM1 and its cleanup to sleephrs

SLEPTIM1, the How Much Time Do You Sleep variable, is the response to “I would like to ask you about your sleep pattern. On average, how many hours of sleep do you get in a 24-hour period?”

  • The response is 1-24, with special values 77 for Don’t Know/Not Sure and 99 for refused, with BLANK for missing or not asked.
  • So we should change all numerical values above 24 to NA for our analyses (the blanks are already regarded as NAs by R in the ingestion process.)
smart.raw <- smart.raw %>% 
    mutate(sleephrs = SLEPTIM1,
           sleephrs = replace(sleephrs, sleephrs > 24, NA))

smart.raw %>% count(SLEPTIM1, sleephrs) %>% tail()
# A tibble: 6 x 3
  SLEPTIM1 sleephrs     n
     <dbl>    <dbl> <int>
1       18       18     2
2       20       20     4
3       22       22     1
4       23       23     1
5       77       NA    47
6       99       NA     8

17.4.6 Chronic Health Conditions (13 items)

17.4.6.1 Self-reported diagnosis history (11 items)

The next few variables describe whether or not the subject meets a particular standard, and are all coded in the raw data the same way:

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing

and we’ll recode them all to 1 = Yes, 0 = No, otherwise NA, as we’ve done previously.

The questions are all started with “Has a doctor, nurse, or other health professional ever told you that you had any of the following? For each, tell me Yes, No, or you’re Not sure.”

Original Revised Details
CVDINFR4 hx.mi (Ever told) you had a heart attack, also called a myocardial infarction?
CVDCRHD4 hx.chd (Ever told) you had angina or coronary heart disease?
CVDSTRK3 hx.stroke (Ever told) you had a stroke?
ASTHMA3 hx.asthma (Ever told) you had asthma?
ASTHNOW now.asthma Do you still have asthma? (only asked of those with Yes in ASTHMA3)
CHCSCNCR hx.skinc (Ever told) you had skin cancer?
CHCOCNCR hx.otherc (Ever told) you had any other types of cancer?
CHCCOPD1 hx.copd (Ever told) you have Chronic Obstructive Pulmonary Disease or COPD, emphysema or chronic bronchitis?
HAVARTH3 hx.arthr (Ever told) you have some form of arthritis, rheumatoid arthritis, gout, lupus, or fibromyalgia? (Arthritis diagnoses include: rheumatism, polymyalgia rheumatica; osteoarthritis (not osteporosis); tendonitis, bursitis, bunion, tennis elbow; carpal tunnel syndrome, tarsal tunnel syndrome; joint infection, etc.)
ADDEPEV2 hx.depress (Ever told) you that you have a depressive disorder, including depression, major depression, dysthymia, or minor depression?
CHCKIDNY hx.kidney (Ever told) you have kidney disease? Do NOT include kidney stones, bladder infection or incontinence.
smart.raw <- smart.raw %>% 
    mutate(hx.mi = CVDINFR4,
           hx.mi = replace(hx.mi, hx.mi %in% c(7, 9), NA),
           hx.mi = replace(hx.mi, hx.mi == 2, 0),
           hx.chd = CVDCRHD4,
           hx.chd = replace(hx.chd, hx.chd %in% c(7, 9), NA),
           hx.chd = replace(hx.chd, hx.chd == 2, 0),
           hx.stroke = CVDSTRK3,
           hx.stroke = replace(hx.stroke, hx.stroke %in% c(7, 9), NA),
           hx.stroke = replace(hx.stroke, hx.stroke == 2, 0),
           hx.asthma = ASTHMA3,
           hx.asthma = replace(hx.asthma, hx.asthma %in% c(7, 9), NA),
           hx.asthma = replace(hx.asthma, hx.asthma == 2, 0),
           now.asthma = ASTHNOW,
           now.asthma = replace(now.asthma, now.asthma %in% c(7, 9), NA),
           now.asthma = replace(now.asthma, now.asthma == 2, 0),
           hx.skinc = CHCSCNCR,
           hx.skinc = replace(hx.skinc, hx.skinc %in% c(7, 9), NA),
           hx.skinc = replace(hx.skinc, hx.skinc == 2, 0),
           hx.otherc = CHCOCNCR,
           hx.otherc = replace(hx.otherc, hx.otherc %in% c(7, 9), NA),
           hx.otherc = replace(hx.otherc, hx.otherc == 2, 0),
           hx.copd = CHCCOPD1,
           hx.copd = replace(hx.copd, hx.copd %in% c(7, 9), NA),
           hx.copd = replace(hx.copd, hx.copd == 2, 0),
           hx.arthr = HAVARTH3,
           hx.arthr = replace(hx.arthr, hx.arthr %in% c(7, 9), NA),
           hx.arthr = replace(hx.arthr, hx.arthr == 2, 0),
           hx.depress = ADDEPEV2,
           hx.depress = replace(hx.depress, hx.depress %in% c(7, 9), NA),
           hx.depress = replace(hx.depress, hx.depress == 2, 0),
           hx.kidney = CHCKIDNY,
           hx.kidney = replace(hx.kidney, hx.kidney %in% c(7, 9), NA),
           hx.kidney = replace(hx.kidney, hx.kidney == 2, 0))

We definitely should have written a function to do that, of course.

17.4.6.2 DIABETE3 and its cleanup to hx.diabetes and diabetes3

DIABETE3, the (Ever told) you have diabetes variable, is the response to “(Ever told) you have diabetes (If Yes and respondent is female, ask Was this only when you were pregnant?. If Respondent says pre-diabetes or borderline diabetes, use response code 4.)”

  • 1 = Yes
  • 2 = Yes, but female told only during pregnancy
  • 3 = No
  • 4 = No, pre-diabetes or borderline diabetes
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing

I’ll create one variable called hx.diabetes which is 1 if DIABETE3 = 1, and 0 otherwise, with appropriate NAs, like our other variables. Then I’ll create diabetes3 to include all of this information in a factor, but again recode the missing values properly.

smart.raw <- smart.raw %>% 
    mutate(hx.diabetes = DIABETE3,
           hx.diabetes = replace(hx.diabetes, hx.diabetes %in% c(7, 9), NA),
           hx.diabetes = replace(hx.diabetes, hx.diabetes %in% 2:4, 0),
           diabetes3 = fct_recode(factor(DIABETE3),
                                  "Diabetes" = "1",
                                  "Pregnancy-Induced" = "2",
                                  "No-Diabetes" = "3",
                                  "Pre-Diabetes" = "4",
                                  NULL = "7",
                                  NULL = "9"))

smart.raw %>% count(DIABETE3, hx.diabetes, diabetes3)
Warning: Factor `diabetes3` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 6 x 4
  DIABETE3 hx.diabetes diabetes3             n
     <dbl>       <dbl> <fct>             <int>
1        1           1 Diabetes            833
2        2           0 Pregnancy-Induced    49
3        3           0 No-Diabetes        5037
4        4           0 Pre-Diabetes         85
5        7          NA <NA>                  8
6        9          NA <NA>                  2

17.4.6.3 DIABAGE2 and its cleanup to diab.age

DIABAGE2, the Age When Told Diabetic variable, is the response to “How old were you when you were told you have diabetes?” It is asked only of people with DIABETE3 = 1 (Yes).

  • The response is 1-97, with special values 98 for Don’t Know/Not Sure and 99 for refused, with BLANK for missing or not asked. People 97 years of age and above were listed as 97.
smart.raw <- smart.raw %>% 
    mutate(diab.age = DIABAGE2,
           diab.age = replace(diab.age, diab.age > 97, NA))

smart.raw %>% count(DIABAGE2, diab.age) %>% tail()
# A tibble: 6 x 3
  DIABAGE2 diab.age     n
     <dbl>    <dbl> <int>
1       81       81     2
2       83       83     1
3       85       85     2
4       98       NA    32
5       99       NA     3
6       NA       NA  5181

17.4.7 Oral Health (2 items)

17.4.7.1 LASTDEN3 and its cleanup to denttime

LASTDEN3, the Last Visited Dentist or Dental Clinic variable, is the response to “How long has it been since you last visited a dentist or a dental clinic for any reason? Include visits to dental specialists, such as orthodontists.”

  • 1 = Within past year (anytime less than 12 months ago)
  • 2 = Within past 2 years (1 year but less than 2 years ago)
  • 3 = Within past 5 years (2 years but less than 5 years ago)
  • 4 = 5 or more years ago
  • 7 = Don’t know/Not sure
  • 8 = Never
  • 9 = Refused
  • BLANK = Not asked or missing

We’ll again use the approach from CHECKUP1

smart.raw <- smart.raw %>% 
    mutate(denttime = fct_recode(factor(LASTDEN3),
                                 "1_In-past-year" = "1",
                                 "2_1-to-2-years" = "2",
                                 "3_2-to-5-years" = "3",
                                 "4_5_plus_years" = "4",
                                 "8_Never" = "8",
                                 NULL = "7",
                                 NULL = "9"))

smart.raw %>% count(LASTDEN3, denttime)
Warning: Factor `denttime` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 7 x 3
  LASTDEN3 denttime           n
     <dbl> <fct>          <int>
1        1 1_In-past-year  4221
2        2 2_1-to-2-years   638
3        3 3_2-to-5-years   453
4        4 4_5_plus_years   617
5        7 <NA>              53
6        8 8_Never           24
7        9 <NA>               8

17.4.7.2 RMVTETH3 and its cleanup to remteeth

RMVTETH3, the Number of Permanent Teeth Removed variable, is the response to “How many of your permanent teeth have been removed because of tooth decay or gum disease? Include teeth lost to infection, but do not include teeth lost for other reasons, such as injury or orthodontics. (If wisdom teeth are removed because of tooth decay or gum disease, they should be included in the count for lost teeth).”

  • 1 = 1 to 5
  • 2 = 6 or more but not all
  • 3 = All
  • 7 = Don’t know/Not sure
  • 8 = None
  • 9 = Refused
  • BLANK = Not asked or missing

We’ll recode this to remteeth which will be 0 for None, 1 for 1-5, 2 for 6+ but not all, and 3 for All.

smart.raw <- smart.raw %>% 
    mutate(remteeth = fct_recode(factor(RMVTETH3),
                                 "None" = "8",
                                 "1-5" = "1",
                                 "6+_butnotAll" = "2",
                                 "All" = "3",
                                 NULL = "7",
                                 NULL = "9"))

smart.raw %>% count(RMVTETH3, remteeth)
Warning: Factor `remteeth` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 6 x 3
  RMVTETH3 remteeth         n
     <dbl> <fct>        <int>
1        1 1-5           1632
2        2 6+_butnotAll   775
3        3 All            485
4        7 <NA>           118
5        8 None          2993
6        9 <NA>            11

17.4.8 Demographics (23 items)

17.4.8.1 _AGEG5YR, which we’ll edit into agegroup

The _AGEG5YR variable is a calculated variable (by CDC) obtained from the subject’s age. Since the age data are not available, we instead get these groupings, which we’ll rearrange into the agegroup factor.

_AGEG5YR Age range agegroup
1 18 <= AGE <= 24 18-24
2 25 <= AGE <= 29 25-29
3 30 <= AGE <= 34 30-34
4 35 <= AGE <= 39 35-39
5 40 <= AGE <= 44 40-44
6 45 <= AGE <= 49 45-49
7 50 <= AGE <= 54 50-54
8 55 <= AGE <= 59 55-59
9 60 <= AGE <= 64 60-64
10 65 <= AGE <= 69 65-69
11 70 <= AGE <= 74 70-74
12 75 <= AGE <= 79 75-79
13 AGE >= 80 80plus
14 Don’t Know, Refused or Missing NA
smart.raw <- smart.raw %>% 
    mutate(agegroup = fct_recode(factor(`_AGEG5YR`),
                                "18-24" = "1",
                                "25-29" = "2",
                                "30-34" = "3",
                                "35-39" = "4",
                                "40-44" = "5",
                                "45-49" = "6",
                                "50-54" = "7",
                                "55-59" = "8",
                                "60-64" = "9",
                                "65-69" = "10",
                                "70-74" = "11",
                                "75-79" = "12",
                                "80-96" = "13",
                                NULL = "14"))

smart.raw %>% count(`_AGEG5YR`, agegroup)
Warning: Factor `agegroup` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 14 x 3
   `_AGEG5YR` agegroup     n
        <dbl> <fct>    <int>
 1          1 18-24      348
 2          2 25-29      280
 3          3 30-34      322
 4          4 35-39      298
 5          5 40-44      346
 6          6 45-49      437
 7          7 50-54      504
 8          8 55-59      666
 9          9 60-64      651
10         10 65-69      711
11         11 70-74      482
12         12 75-79      395
13         13 80-96      498
14         14 <NA>        76

17.4.8.2 SEX recoded to female

The available levels of SEX are:

  • 1 = Male
  • 2 = Female
  • 9 = Refused

We’ll recode that to female = 1 for Female, 0 Male, otherwise NA. Note the trick here is to subtract one from the coded SEX to get the desired female, but this requires that we move 9 to NA, rather than 9.

smart.raw <- smart.raw %>% 
    mutate(female = SEX - 1,
           female = replace(female, female == 8, NA))

smart.raw %>% count(SEX, female)
# A tibble: 3 x 3
    SEX female     n
  <dbl>  <dbl> <int>
1     1      0  2559
2     2      1  3454
3     9     NA     1

17.4.8.3 MARITAL status, revised to marital

The available levels of MARITAL are:

  • 1 = Married
  • 2 = Divorced
  • 3 = Widowed
  • 4 = Separated
  • 5 = Never married
  • 6 = A member of an unmarried couple
  • 9 = Refused
  • BLANK = Not asked or missing

We’ll just turn this into a factor, and move 9 to NA.

smart.raw <- smart.raw %>% 
    mutate(marital = fct_recode(factor(MARITAL),
                                "Married" = "1",
                                "Divorced" = "2",
                                "Widowed" = "3",
                                "Separated" = "4",
                                "Never_Married" = "5",
                                "Unmarried_Couple" = "6",
                                NULL = "9"))

smart.raw %>% count(MARITAL, marital)
Warning: Factor `marital` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 7 x 3
  MARITAL marital              n
    <dbl> <fct>            <int>
1       1 Married           2989
2       2 Divorced           888
3       3 Widowed            813
4       4 Separated          107
5       5 Never_Married     1019
6       6 Unmarried_Couple   148
7       9 <NA>                50

17.4.8.4 EDUCA recoded to educgroup

The available levels of EDUCA (Education Level) are responses to: “What is the highest grade or year of school you completed?”

  • 1 = Never attended school or only kindergarten
  • 2 = Grades 1 through 8 (Elementary)
  • 3 = Grades 9 through 11 (Some high school)
  • 4 = Grade 12 or GED (High school graduate)
  • 5 = College 1 year to 3 years (Some college or technical school)
  • 6 = College 4 years or more (College graduate)
  • 9 = Refused
  • BLANK = Not asked or missing

We’ll just turn this into a factor, and move 9 to NA.

smart.raw <- smart.raw %>% 
    mutate(educgroup = fct_recode(factor(EDUCA),
                                "Kindergarten" = "1",
                                "Elementary" = "2",
                                "Some_HS" = "3",
                                "HS_Grad" = "4",
                                "Some_College" = "5",
                                "College_Grad" = "6",
                                NULL = "9"))

smart.raw %>% count(EDUCA, educgroup)
Warning: Factor `educgroup` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 7 x 3
  EDUCA educgroup        n
  <dbl> <fct>        <int>
1     1 Kindergarten     5
2     2 Elementary      80
3     3 Some_HS        287
4     4 HS_Grad       1745
5     5 Some_College  1611
6     6 College_Grad  2270
7     9 <NA>            16

17.4.8.5 RENTHOM1 recoded to home.own

The available levels of RENTHOM1 (Own or Rent Home) are responses to: “Do you own or rent your home? (Home is defined as the place where you live most of the time/the majority of the year.)”

  • 1 = Own
  • 2 = Rent
  • 3 = Other Arrangement
  • 7 = Don’t know/Not Sure
  • 9 = Refused
  • BLANK = Not asked or missing

We’ll recode as home.own = 1 if they own their home, and 0 otherwise, and dealing with missingness properly.

smart.raw <- smart.raw %>% 
    mutate(home.own = RENTHOM1,
           home.own = replace(home.own, home.own %in% c(7,9), NA),
           home.own = replace(home.own, home.own %in% c(2,3), 0))

smart.raw %>% count(RENTHOM1, home.own)
# A tibble: 5 x 3
  RENTHOM1 home.own     n
     <dbl>    <dbl> <int>
1        1        1  4296
2        2        0  1486
3        3        0   196
4        7       NA     9
5        9       NA    27

17.4.8.6 NUMHHOL2 and its cleanup to phones2

NUMHHOL2, the Household Telephones variable, is the response to “Do you have more than one telephone number in your household? Do not include cell phones or numbers that are only used by a computer or fax machine.”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
smart.raw <- smart.raw %>% 
    mutate(phones2 = NUMHHOL2,
           phones2 = replace(phones2, phones2 %in% c(7, 9), NA),
           phones2 = replace(phones2, phones2 == 2, 0))

smart.raw %>% count(NUMHHOL2, phones2)
# A tibble: 5 x 3
  NUMHHOL2 phones2     n
     <dbl>   <dbl> <int>
1        1       1   222
2        2       0  3151
3        7      NA     2
4        9      NA     8
5       NA      NA  2631

17.4.8.7 NUMPHON2 and its cleanup to numphones

NUMPHON2, the Residential Phones variable, is the response to “How many of these telephone numbers are residential numbers?”

  • 1-6 = legitimate responses (6 = 6 or more)
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
smart.raw <- smart.raw %>% 
    mutate(numphones = NUMPHON2,
           numphones = replace(numphones, numphones > 6, NA))

smart.raw %>% count(NUMPHON2, numphones) 
# A tibble: 9 x 3
  NUMPHON2 numphones     n
     <dbl>     <dbl> <int>
1        1         1   131
2        2         2    66
3        3         3    10
4        4         4     5
5        5         5     4
6        6         6     2
7        7        NA     2
8        9        NA     2
9       NA        NA  5792

17.4.8.8 VETERAN3 and its cleanup to veteran

VETERAN3, the Are You A Veteran variable, is the response to “Have you ever served on active duty in the United States Armed Forces, either in the regular military or in a National Guard or military reserve unit? (Active duty does not include training for the Reserves or National Guard, but DOES include activation, for example, for the Persian Gulf War.)”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
smart.raw <- smart.raw %>% 
    mutate(veteran = VETERAN3,
           veteran = replace(veteran, veteran %in% c(7, 9), NA),
           veteran = replace(veteran, veteran == 2, 0))

smart.raw %>% count(VETERAN3, veteran)
# A tibble: 5 x 3
  VETERAN3 veteran     n
     <dbl>   <dbl> <int>
1        1       1   736
2        2       0  5264
3        7      NA     1
4        9      NA    12
5       NA      NA     1

17.4.8.9 EMPLOY1 and its cleanup to employment

EMPLOY1, the Employment Status variable, is the response to “Are you currently … ?”

  • 1 = Employed for wages
  • 2 = Self-employed
  • 3 = Out of work for 1 year or more
  • 4 = Out of work for less than 1 year
  • 5 = A homemaker
  • 6 = A student
  • 7 = Retired
  • 8 = Unable to work
  • 9 = Refused
  • BLANK = Not asked or missing

We’ll just turn this into a factor, and move 9 to NA.

smart.raw <- smart.raw %>% 
    mutate(employment = fct_recode(factor(EMPLOY1),
                                "Employed_for_wages" = "1",
                                "Self-employed" = "2",
                                "Outofwork_1yearormore" = "3",
                                "Outofwork_lt1year" = "4",
                                "Homemaker" = "5",
                                "Student" = "6",
                                "Retired" = "7",
                                "Unable_to_work" = "8",
                                NULL = "9"))

smart.raw %>% count(EMPLOY1, employment)
Warning: Factor `employment` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 10 x 3
   EMPLOY1 employment                n
     <dbl> <fct>                 <int>
 1       1 Employed_for_wages     2614
 2       2 Self-employed           378
 3       3 Outofwork_1yearormore   103
 4       4 Outofwork_lt1year       110
 5       5 Homemaker               302
 6       6 Student                 144
 7       7 Retired                1868
 8       8 Unable_to_work          453
 9       9 <NA>                     41
10      NA <NA>                      1

17.4.8.10 CHILDREN and its cleanup to numkids

CHILDREN, the Number of Children in Household variable, is the response to “How many children less than 18 years of age live in your household?”

  • 1-87 = legitimate responses
  • 88 = None
  • 99 = Refused
  • BLANK = Not asked or missing
smart.raw <- smart.raw %>% 
    mutate(numkids = CHILDREN,
           numkids = replace(numkids, numkids == 99, NA),
           numkids = replace(numkids, numkids == 88, 0))

smart.raw %>% count(CHILDREN, numkids) %>% tail()
# A tibble: 6 x 3
  CHILDREN numkids     n
     <dbl>   <dbl> <int>
1        7       7     2
2        8       8     2
3        9       9     4
4       88       0  4408
5       99      NA    39
6       NA      NA     1

17.4.8.11 INCOME2 to incomegroup

The available levels of INCOME2 (Income Level) are responses to: “Is your annual household income from all sources …”

  • 1 = Less than $10,000
  • 2 = $10,000 to less than $15,000
  • 3 = $15,000 to less than $20,000
  • 4 = $20,000 to less than $25,000
  • 5 = $25,000 to less than $35,000
  • 6 = $35,000 to less than $50,000
  • 7 = $50,000 to less than $75,000
  • 8 = $75,000 or more
  • 77 = Don’t know/Not sure
  • 99 = Refused
  • BLANK = Not asked or missing

We’ll just turn this into a factor, and move 77 and 99 to NA.

smart.raw <- smart.raw %>% 
    mutate(incomegroup = fct_recode(factor(`INCOME2`),
                                "0-9K" = "1",
                                "10-14K" = "2",
                                "15-19K" = "3",
                                "20-24K" = "4",
                                "25-34K" = "5",
                                "35-49K" = "6",
                                "50-74K" = "7",
                                "75K+" = "8",
                                NULL = "77",
                                NULL = "99"))

smart.raw %>% count(`INCOME2`, incomegroup)
Warning: Factor `incomegroup` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 11 x 3
   INCOME2 incomegroup     n
     <dbl> <fct>       <int>
 1       1 0-9K          212
 2       2 10-14K        235
 3       3 15-19K        374
 4       4 20-24K        482
 5       5 25-34K        567
 6       6 35-49K        723
 7       7 50-74K        814
 8       8 75K+         1631
 9      77 <NA>          387
10      99 <NA>          565
11      NA <NA>           24

17.4.8.12 INTERNET and its cleanup to internet30

INTERNET, the Internet use in the past 30 days variable, is the response to “Have you used the internet in the past 30 days?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
smart.raw <- smart.raw %>% 
    mutate(internet30 = INTERNET,
           internet30 = replace(internet30, internet30 %in% c(7, 9), NA),
           internet30 = replace(internet30, internet30 == 2, 0))

smart.raw %>% count(INTERNET, internet30)
# A tibble: 5 x 3
  INTERNET internet30     n
     <dbl>      <dbl> <int>
1        1          1  4895
2        2          0  1078
3        7         NA     6
4        9         NA     6
5       NA         NA    29

17.4.8.13 WTKG3 is weight_kg

WTKG3 is computed by CDC, as the respondent’s weight in kilograms with two implied decimal places. We calculate the actual weight in kg, with the following:

smart.raw <- smart.raw %>% 
    mutate(weight_kg = WTKG3/100)

smart.raw %>% count(WTKG3, weight_kg)# %>% tail()
# A tibble: 260 x 3
   WTKG3 weight_kg     n
   <dbl>     <dbl> <int>
 1  3084      30.8     1
 2  3266      32.7     1
 3  3447      34.5     1
 4  3629      36.3     2
 5  3674      36.7     1
 6  3856      38.6     3
 7  3946      39.5     1
 8  4037      40.4     1
 9  4082      40.8     3
10  4128      41.3     1
# ... with 250 more rows

17.4.8.14 HEIGHT3 is replaced with height_m

HEIGHT3 is strangely gathered to allow people to specify their height in either feet and inches or in meters and centimeters.

  • 200-711 indicates height in feet (first digit) and inches (second two digits)
  • 9000 - 9998 indicates height in meters (second digit) and centimeters (last two digits)
  • 7777 = Don’t know/Not sure
  • 9999 = Refused

Note that there is one impossible value of 575 in the data set. We’ll make that an NA, and we’ll also make NA any heights below 3 feet, or above 2.24 meters. Specifically, we calculate the actual height in meters, with the following:

smart.raw <- smart.raw %>%
    mutate(height_m = case_when(
        HEIGHT3 >= 300 & HEIGHT3 <= 511 ~ round((12*floor(HEIGHT3/100) + (HEIGHT3 - 100*floor(HEIGHT3/100)))*0.0254,2),
        HEIGHT3 >= 600 & HEIGHT3 <= 711 ~ round((12*floor(HEIGHT3/100) + (HEIGHT3 - 100*floor(HEIGHT3/100)))*0.0254,2),
        HEIGHT3 >= 9000 & HEIGHT3 <= 9224 ~ ((HEIGHT3 - 9000)/100)))

smart.raw %>% count(HEIGHT3, height_m) # %>% tail()
# A tibble: 35 x 3
   HEIGHT3 height_m     n
     <dbl>    <dbl> <int>
 1     406     1.37     1
 2     407     1.4      1
 3     408     1.42     4
 4     409     1.45    14
 5     410     1.47    23
 6     411     1.5     63
 7     500     1.52   184
 8     501     1.55   171
 9     502     1.57   415
10     503     1.6    408
# ... with 25 more rows

17.4.8.15 bmi is calculated from height_m and weight_kg

We’ll calculate body-mass index from height and weight.

smart.raw <- smart.raw %>% 
    mutate(bmi = round(weight_kg/(height_m)^2,2))

smart.raw %>% count(height_m, weight_kg, bmi)# %>% tail()
# A tibble: 1,583 x 4
   height_m weight_kg   bmi     n
      <dbl>     <dbl> <dbl> <int>
 1     1.37      68.0  36.2     1
 2     1.4       81.6  41.7     1
 3     1.42      40.4  20.0     1
 4     1.42      66.2  32.8     1
 5     1.42      72.6  36.0     2
 6     1.45      44.4  21.1     1
 7     1.45      45.4  21.6     1
 8     1.45      46.7  22.2     1
 9     1.45      52.2  24.8     2
10     1.45      53.5  25.5     1
# ... with 1,573 more rows

17.4.8.16 bmigroup is calculated from bmi

We’ll then divide the respondents into adult BMI categories, in the usual way.

  • BMI < 18.5 indicates underweight
  • BMI from 18.5 up to 25 indicates normal weight
  • BMI from 25 up to 30 indicates overweight
  • BMI of 30 and higher indicates obesity
smart.raw <- smart.raw %>%
    mutate(bmigroup = factor(cut2(as.numeric(bmi), 
                           cuts = c(18.5, 25.0, 30.0))))

smart.raw %>% count(bmigroup)
Warning: Factor `bmigroup` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 5 x 2
  bmigroup          n
  <fct>         <int>
1 [ 11.6, 18.5)   102
2 [ 18.5, 25.0)  1723
3 [ 25.0, 30.0)  2001
4 [ 30.0,110.9]  1766
5 <NA>            422

17.4.8.17 PREGNANT and its cleanup to pregnant

PREGNANT, the Pregnancy Status variable, is the response to “To your knowledge, are you now pregnant?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing (includes SEX = male)
smart.raw <- smart.raw %>% 
    mutate(pregnant = PREGNANT,
           pregnant = replace(pregnant, pregnant %in% c(7, 9), NA),
           pregnant = replace(pregnant, pregnant == 2, 0))

smart.raw %>% count(PREGNANT, pregnant)
# A tibble: 5 x 3
  PREGNANT pregnant     n
     <dbl>    <dbl> <int>
1        1        1    33
2        2        0   797
3        7       NA     5
4        9       NA     1
5       NA       NA  5178

17.4.8.18 DEAF and its cleanup to deaf

DEAF, the Are you deaf or do you have serious difficulty hearing variable, is the response to “Are you deaf or do you have serious difficulty hearing?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
smart.raw <- smart.raw %>% 
    mutate(deaf = DEAF,
           deaf = replace(deaf, deaf %in% c(7, 9), NA),
           deaf = replace(deaf, deaf == 2, 0))

smart.raw %>% count(DEAF, deaf)
# A tibble: 5 x 3
   DEAF  deaf     n
  <dbl> <dbl> <int>
1     1     1   507
2     2     0  5392
3     7    NA     9
4     9    NA     7
5    NA    NA    99

17.4.8.19 BLIND and its cleanup to blind

BLIND, the Blind or Difficulty seeing variable, is the response to “Are you blind or do you have serious difficulty seeing, even when wearing glasses?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
smart.raw <- smart.raw %>% 
    mutate(blind = BLIND,
           blind = replace(blind, blind %in% c(7, 9), NA),
           blind = replace(blind, blind == 2, 0))

smart.raw %>% count(BLIND, blind)
# A tibble: 5 x 3
  BLIND blind     n
  <dbl> <dbl> <int>
1     1     1   290
2     2     0  5605
3     7    NA     9
4     9    NA     2
5    NA    NA   108

17.4.8.20 DECIDE and its cleanup to decide

DECIDE, the Difficulty Concentrating or Remembering variable, is the response to “Because of a physical, mental, or emotional condition, do you have serious difficulty concentrating, remembering, or making decisions?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
smart.raw <- smart.raw %>% 
    mutate(decide = DECIDE,
           decide = replace(decide, decide %in% c(7, 9), NA),
           decide = replace(decide, decide == 2, 0))

smart.raw %>% count(DECIDE, decide)
# A tibble: 5 x 3
  DECIDE decide     n
   <dbl>  <dbl> <int>
1      1      1   583
2      2      0  5293
3      7     NA    16
4      9     NA     5
5     NA     NA   117

17.4.8.21 DIFFWALK and its cleanup to diffwalk

DIFFWALK, the Difficulty Walking or Climbing Stairs variable, is the response to “Do you have serious difficulty walking or climbing stairs?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
smart.raw <- smart.raw %>% 
    mutate(diffwalk = DIFFWALK,
           diffwalk = replace(diffwalk, diffwalk %in% c(7, 9), NA),
           diffwalk = replace(diffwalk, diffwalk == 2, 0))

smart.raw %>% count(DIFFWALK, diffwalk)
# A tibble: 5 x 3
  DIFFWALK diffwalk     n
     <dbl>    <dbl> <int>
1        1        1  1036
2        2        0  4841
3        7       NA    11
4        9       NA     5
5       NA       NA   121

17.4.8.22 DIFFDRES and its cleanup to diffdress

DIFFDRES, the Difficulty Dressing or Bathing variable, is the response to “Do you have difficulty dressing or bathing?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
smart.raw <- smart.raw %>% 
    mutate(diffdress = DIFFDRES,
           diffdress = replace(diffdress, diffdress %in% c(7, 9), NA),
           diffdress = replace(diffdress, diffdress == 2, 0))

smart.raw %>% count(DIFFDRES, diffdress)
# A tibble: 5 x 3
  DIFFDRES diffdress     n
     <dbl>     <dbl> <int>
1        1         1   221
2        2         0  5658
3        7        NA     3
4        9        NA     2
5       NA        NA   130

17.4.8.23 DIFFALON and its cleanup to diffalone

DIFFALON, the Difficulty Doing Errands Alone variable, is the response to “Because of a physical, mental, or emotional condition, do you have difficulty doing errands alone such as visiting a doctor’s office or shopping?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
smart.raw <- smart.raw %>% 
    mutate(diffalone = DIFFALON,
           diffalone = replace(diffalone, diffalone %in% c(7, 9), NA),
           diffalone = replace(diffalone, diffalone == 2, 0))

smart.raw %>% count(DIFFALON, diffalone)
# A tibble: 5 x 3
  DIFFALON diffalone     n
     <dbl>     <dbl> <int>
1        1         1   437
2        2         0  5419
3        7        NA    13
4        9        NA     6
5       NA        NA   139

17.4.9 Tobacco Use (5 items)

17.4.9.1 SMOKE100 and its cleanup to smoke100

SMOKE100, the Smoked at Least 100 Cigarettes variable, is the response to “Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
smart.raw <- smart.raw %>% 
    mutate(smoke100 = SMOKE100,
           smoke100 = replace(smoke100, smoke100 %in% c(7, 9), NA),
           smoke100 = replace(smoke100, smoke100 == 2, 0))

smart.raw %>% count(SMOKE100, smoke100)
# A tibble: 5 x 3
  SMOKE100 smoke100     n
     <dbl>    <dbl> <int>
1        1        1  2567
2        2        0  3260
3        7       NA    30
4        9       NA     6
5       NA       NA   151

17.4.9.2 SMOKDAY2

17.4.9.3 STOPSMK2

17.4.9.4 LASTSMK2

17.4.9.5 USENOW3

17.4.10 E-Cigarettes (2 items)

17.4.10.1 ECIGARET and its cleanup to ecig.ever

ECIGARET, the Ever used an e-cigarette variable, is the response to “Have you ever used an e-cigarette or other electronic vaping product, even just one time, in your entire life?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
smart.raw <- smart.raw %>% 
    mutate(ecig.ever = ECIGARET,
           ecig.ever = replace(ecig.ever, ecig.ever %in% c(7, 9), NA),
           ecig.ever = replace(ecig.ever, ecig.ever == 2, 0))

smart.raw %>% count(ECIGARET, ecig.ever)
# A tibble: 5 x 3
  ECIGARET ecig.ever     n
     <dbl>     <dbl> <int>
1        1         1  1028
2        2         0  4818
3        7        NA     8
4        9        NA     1
5       NA        NA   159

17.4.10.2 ECIGNOW

17.4.11 Alcohol Consumption (4 items)

17.4.11.1 ALCDAY5 and its cleanup to alcdays

ALCDAY5, the Days in past 30 had alcoholic beverage variable, is the response to “During the past 30 days, how many days per week or per month did you have at least one drink of any alcoholic beverage such as beer, wine, a malt beverage or liquor?”

  • 101-107 = # of days per week (101 = 1 day per week, 107 = 7 days per week)
  • 201-230 = # of days in past 30 days (201 = 1 day in last 30, 230 = 30 days in last 30)
  • 777 = Don’t know/Not sure
  • 888 = No drinks in past 30 days
  • 999 = Refused
  • BLANK = Not asked or Missing

We’re going to convert this to a single numeric value. Answers in days per week (in the past 7 days) will be converted (after rounding) to days in the past 30. This is a little bit of a mess, really, but we can do it.

smart.raw <- smart.raw %>% 
    mutate(alcdays = as.numeric(ALCDAY5)) %>%
    mutate(alcdays = replace(alcdays, alcdays == 888, 0),
           alcdays = replace(alcdays, alcdays %in% c(777, 999), NA)) %>%
    mutate(alcdays = case_when(ALCDAY5 > 199 & ALCDAY5 < 231 ~ ALCDAY5 - 200,
                               ALCDAY5 > 100 & ALCDAY5 < 108 ~ round((ALCDAY5 - 100)*30/7,0),
                               TRUE ~ alcdays))

smart.raw %>% count(ALCDAY5, alcdays)
# A tibble: 38 x 3
   ALCDAY5 alcdays     n
     <dbl>   <dbl> <int>
 1     101       4   256
 2     102       9   197
 3     103      13   142
 4     104      17    62
 5     105      21    51
 6     106      26    11
 7     107      30   124
 8     201       1   471
 9     202       2   363
10     203       3   216
# ... with 28 more rows

17.4.11.2 AVEDRNK2 and its cleanup to avgdrinks

AVEDRNK2, the Avg alcoholic drinks per day in past 30 variable, is the response to “One drink is equivalent to a 12-ounce beer, a 5-ounce glass of wine, or a drink with one shot of liquor. During the past 30 days, on the days when you drank, about how many drinks did you drink on the average? (A 40 ounce beer would count as 3 drinks, or a cocktail drink with 2 shots would count as 2 drinks.)”

  • 1-76 = # of drinks per day
  • 77 = Don’t know/Not sure
  • 99 = Refused
  • BLANK = Not asked or Missing (always happens when ALCDAY5 = 777, 888 or 999)
smart.raw <- smart.raw %>% 
    mutate(avgdrinks = AVEDRNK2,
           avgdrinks = replace(avgdrinks, avgdrinks > 76, NA))

smart.raw %>% count(AVEDRNK2, avgdrinks) %>% tail()
# A tibble: 6 x 3
  AVEDRNK2 avgdrinks     n
     <dbl>     <dbl> <int>
1       40        40     2
2       50        50     2
3       75        75     1
4       77        NA    36
5       99        NA    10
6       NA        NA  2981

17.4.11.3 DRNK3GE5

17.4.11.4 MAXDRNKS

17.4.12 Immunization (4 items)

17.4.12.1 FLUSHOT6

17.4.12.2 FLSHTMY2

17.4.12.3 PNEUVAC3

17.4.12.4 TETANUS

17.4.13 Falls (2 items)

17.4.13.1 FALL12MN

17.4.13.2 FALLINJ2

17.4.14 Seatbelt Use (1 item)

17.4.14.1 SEATBELT

17.4.15 Drinking and Driving (1 item)

17.4.15.1 DRNKDRI2

17.4.16 Breast and Cervical Cancer Screening (7 items)

17.4.16.1 HADMAM

17.4.16.2 HOWLONG

17.4.16.3 HADPAP2

17.4.16.4 LASTPAP2

17.4.16.5 HPVTEST

17.4.16.6 HPLSTTST

17.4.16.7 HADHYST2

17.4.17 Prostate Cancer Screening (6 items)

17.4.17.1 PCPSAAD2

17.4.17.2 PCPSADI1

17.4.17.3 PCPSARE1

17.4.17.4 PSATEST1

17.4.17.5 PSATIME

17.4.17.6 PCPSARS1

17.4.18 Colorectal Cancer Screening (5 items)

17.4.18.1 BLDSTOOL

17.4.18.2 LSTBLDS3

17.4.18.3 HADSIGM3

17.4.18.4 HADSGCO1

17.4.18.5 LASTSIG3

17.4.19 HIV/AIDS (3 items)

17.4.19.1 HIVTST6

17.4.19.2 HIVTSTD3

17.4.19.3 HIVRISK4

17.5 Creating Some Quantitative Variables from Thin Air

This section is purely for teaching purposes. I would never use the variables created in this section for research work.

17.5.1 age_imp: Creating Age Data out of Thin Air

I want a quantitative age variable, so I’m going to create an imputed age_imp value for each subject based on their agegroup. For each age group, I will assume that each of the ages represented by a value in that age group will be equally likely, and will draw from the relevant uniform distribution to impute age.

set.seed(20180114)

smart.raw <- smart.raw %>%
    mutate(age_low = as.numeric(str_sub(as.character(agegroup), 1, 2))) %>%
    mutate(age_high = as.numeric(str_sub(as.character(agegroup), 4, 5))) %>%
    rowwise() %>%
    mutate(age_imp = ifelse(!is.na(agegroup), 
                            round(runif(1, min = age_low, max = age_high),0),
                            NA)) 

smart.raw %>% count(agegroup, age_imp) %>% tail()
# A tibble: 6 x 3
  agegroup age_imp     n
  <fct>      <dbl> <int>
1 80-96         92    30
2 80-96         93    30
3 80-96         94    33
4 80-96         95    33
5 80-96         96    14
6 <NA>          NA    76

17.6 Clean Data in the State of Ohio

There are five MMSAs associated with the state of Ohio, associated with Cincinnati, Cleveland, Columbus, Dayton and Toledo.

smart_ohio <- smart.raw %>%
    filter(str_detect(MMSANAME, ', OH')) %>% 
    select(SEQNO, MMSANAME, hhadults, 
           genhealth, physhealth, menthealth, 
           poorhealth, healthplan, hasdoc, numdocs2, 
           costprob, checktime, exerany, sleephrs, 
           hx.mi, hx.chd, hx.stroke, hx.asthma, 
           now.asthma, hx.skinc, hx.otherc, hx.copd, 
           hx.arthr, hx.depress, hx.kidney, 
           hx.diabetes, diabetes3, diab.age, 
           denttime, remteeth, agegroup, female, 
           marital, educgroup, home.own, phones2, 
           numphones, veteran, employment, numkids, 
           incomegroup, internet30, weight_kg, 
           height_m, bmi, bmigroup, pregnant, deaf, 
           blind, decide, diffwalk, diffdress, 
           diffalone, smoke100, ecig.ever, alcdays, 
           avgdrinks, age_imp)

skim_with(numeric = list(hist = NULL))
skim(select(smart_ohio, -SEQNO))
Skim summary statistics
 n obs: 6014 
 n variables: 57 

-- Variable type:character --------------------------------------------------------
 variable missing complete    n min max empty n_unique
 MMSANAME       0     6014 6014  41  51     0        5

-- Variable type:factor -----------------------------------------------------------
    variable missing complete    n n_unique
    agegroup      76     5938 6014       13
    bmigroup     422     5592 6014        4
   checktime      73     5941 6014        5
    denttime      61     5953 6014        5
   diabetes3      10     6004 6014        4
   educgroup      16     5998 6014        6
  employment      42     5972 6014        8
   genhealth      12     6002 6014        5
 incomegroup     976     5038 6014        8
     marital      50     5964 6014        6
    numdocs2      24     5990 6014        3
    remteeth     129     5885 6014        4
                                top_counts ordered
    65-: 711, 55-: 666, 60-: 651, 50-: 504   FALSE
  [ 2: 2001, [ 3: 1766, [ 1: 1723, NA: 422   FALSE
   1_I: 4742, 2_1: 571, 3_2: 299, 4_5: 299   FALSE
   1_I: 4221, 2_1: 638, 4_5: 617, 3_2: 453   FALSE
     No-: 5037, Dia: 833, Pre: 85, Pre: 49   FALSE
 Col: 2270, HS_: 1745, Som: 1611, Som: 287   FALSE
  Emp: 2614, Ret: 1868, Una: 453, Sel: 378   FALSE
  2_V: 2014, 3_G: 1895, 1_E: 997, 4_F: 798   FALSE
    75K: 1631, NA: 976, 50-: 814, 35-: 723   FALSE
  Mar: 2989, Nev: 1019, Div: 888, Wid: 813   FALSE
     1_O: 4783, 3_Z: 744, 2_M: 463, NA: 24   FALSE
  Non: 2993, 1-5: 1632, 6+_: 775, All: 485   FALSE

-- Variable type:numeric ----------------------------------------------------------
    variable missing complete    n   mean    sd    p0   p25   p50   p75
     age_imp      76     5938 6014 55.96  18.34 18    43    57    68   
     alcdays     217     5797 6014  4.76   8.02  0     0     1     5   
   avgdrinks    3027     2987 6014  2.28   2.96  1     1     2     2   
       blind     119     5895 6014  0.049  0.22  0     0     0     0   
         bmi     422     5592 6014 28.36   6.7  11.61 23.92 27.28 31.41
    costprob      13     6001 6014  0.084  0.28  0     0     0     0   
        deaf     115     5899 6014  0.086  0.28  0     0     0     0   
      decide     138     5876 6014  0.099  0.3   0     0     0     0   
    diab.age    5216      798 6014 50.83  14.12  1    43    51    60   
   diffalone     158     5856 6014  0.075  0.26  0     0     0     0   
   diffdress     135     5879 6014  0.038  0.19  0     0     0     0   
    diffwalk     137     5877 6014  0.18   0.38  0     0     0     0   
   ecig.ever     168     5846 6014  0.18   0.38  0     0     0     0   
     exerany      13     6001 6014  0.74   0.44  0     0     1     1   
      female       1     6013 6014  0.57   0.49  0     0     1     1   
      hasdoc      24     5990 6014  0.88   0.33  0     1     1     1   
  healthplan      21     5993 6014  0.95   0.21  0     1     1     1   
    height_m     166     5848 6014  1.7    0.1   1.37  1.63  1.7   1.78
    hhadults    3424     2590 6014  2.08   0.97  1     1     2     2   
    home.own      36     5978 6014  0.72   0.45  0     0     1     1   
    hx.arthr      26     5988 6014  0.38   0.49  0     0     0     1   
   hx.asthma      16     5998 6014  0.13   0.33  0     0     0     0   
      hx.chd      41     5973 6014  0.064  0.25  0     0     0     0   
     hx.copd      24     5990 6014  0.093  0.29  0     0     0     0   
  hx.depress      19     5995 6014  0.18   0.38  0     0     0     0   
 hx.diabetes      10     6004 6014  0.14   0.35  0     0     0     0   
   hx.kidney      22     5992 6014  0.039  0.19  0     0     0     0   
       hx.mi      25     5989 6014  0.061  0.24  0     0     0     0   
   hx.otherc       9     6005 6014  0.099  0.3   0     0     0     0   
    hx.skinc      15     5999 6014  0.087  0.28  0     0     0     0   
   hx.stroke      11     6003 6014  0.045  0.21  0     0     0     0   
  internet30      41     5973 6014  0.82   0.38  0     1     1     1   
  menthealth      74     5940 6014  3.55   7.94  0     0     0     2   
  now.asthma    5269      745 6014  0.74   0.44  0     0     1     1   
     numkids      40     5974 6014  0.51   1.02  0     0     0     1   
   numphones    5796      218 6014  1.58   0.94  1     1     1     2   
     phones2    2641     3373 6014  0.066  0.25  0     0     0     0   
  physhealth      94     5920 6014  4.24   8.78  0     0     0     3   
  poorhealth    3043     2971 6014  5.3    9.39  0     0     0     5   
    pregnant    5184      830 6014  0.04   0.2   0     0     0     0   
    sleephrs      55     5959 6014  6.97   1.51  1     6     7     8   
    smoke100     187     5827 6014  0.44   0.5   0     0     0     1   
     veteran      14     6000 6014  0.12   0.33  0     0     0     0   
   weight_kg     390     5624 6014 82.39  21.91 30.84 68.04 79.38 92.99
   p100
  96   
  30   
  75   
   1   
 110.88
   1   
   1   
   1   
  85   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
   2.03
  10   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
  30   
   1   
   9   
   6   
   1   
  30   
  30   
   1   
  23   
   1   
   1   
 249.48
write_csv(smart_ohio, "data/smart_ohio.csv")

17.7 Clean Cleveland-Elyria Data

17.7.1 Cleveland - Elyria Raw Data

The MMSANAME variable is probably the simplest way for us to filter our data down to the MMSA we are interested in. Here, I’m using the str_detect function to identify the values of MMSANAME that contain the text “Cleveland”.

smart.cle.raw <- smart.raw %>% 
  filter(str_detect(MMSANAME, 'Cleveland')) 

In the Cleveland-Elyria MSA, we have 1036 observations on the same 204 variables.

17.7.2 Clean Data - Larger

smart.cle <- smart.cle.raw %>%
    select(SEQNO, MMSANAME, hhadults, 
           genhealth, physhealth, menthealth, 
           poorhealth, healthplan, hasdoc, numdocs2, 
           costprob, checktime, exerany, sleephrs, 
           hx.mi, hx.chd, hx.stroke, hx.asthma, 
           now.asthma, hx.skinc, hx.otherc, hx.copd, 
           hx.arthr, hx.depress, hx.kidney, 
           hx.diabetes, diabetes3, diab.age, 
           denttime, remteeth, agegroup, female, 
           marital, educgroup, home.own, phones2, 
           numphones, veteran, employment, numkids, 
           incomegroup, internet30, weight_kg, 
           height_m, bmi, bmigroup, pregnant, deaf, 
           blind, decide, diffwalk, diffdress, 
           diffalone, smoke100, ecig.ever, alcdays, 
           avgdrinks, age_imp)

skim_with(numeric = list(hist = NULL))
skim(select(smart.cle, -SEQNO))
Skim summary statistics
 n obs: 1036 
 n variables: 57 

-- Variable type:character --------------------------------------------------------
 variable missing complete    n min max empty n_unique
 MMSANAME       0     1036 1036  51  51     0        1

-- Variable type:factor -----------------------------------------------------------
    variable missing complete    n n_unique
    agegroup      12     1024 1036       13
    bmigroup      84      952 1036        4
   checktime       9     1027 1036        5
    denttime       6     1030 1036        5
   diabetes3       1     1035 1036        4
   educgroup       2     1034 1036        6
  employment      10     1026 1036        8
   genhealth       3     1033 1036        5
 incomegroup     170      866 1036        8
     marital      10     1026 1036        6
    numdocs2       5     1031 1036        3
    remteeth      25     1011 1036        4
                             top_counts ordered
 65-: 128, 60-: 122, 55-: 105, 80-: 103   FALSE
   [ 2: 349, [ 1: 322, [ 3: 268, NA: 84   FALSE
   1_I: 820, 2_1: 101, 3_2: 52, 4_5: 49   FALSE
   1_I: 752, 2_1: 113, 4_5: 91, 3_2: 70   FALSE
    No-: 884, Dia: 128, Pre: 15, Pre: 8   FALSE
  Col: 385, HS_: 292, Som: 290, Som: 54   FALSE
   Emp: 427, Ret: 345, Sel: 70, Una: 57   FALSE
 2_V: 350, 3_G: 344, 1_E: 173, 4_F: 122   FALSE
  75K: 258, NA: 170, 50-: 146, 35-: 115   FALSE
 Mar: 476, Nev: 213, Wid: 157, Div: 143   FALSE
    1_O: 795, 3_Z: 135, 2_M: 101, NA: 5   FALSE
  Non: 510, 1-5: 293, 6+_: 127, All: 81   FALSE

-- Variable type:numeric ----------------------------------------------------------
    variable missing complete    n   mean    sd    p0   p25   p50   p75
     age_imp      12     1024 1036 57.3   18.86 18    45    60    70   
     alcdays      46      990 1036  4.65   8.05  0     0     1     4   
   avgdrinks     510      526 1036  2.17   2.44  1     1     2     2   
       blind      20     1016 1036  0.037  0.19  0     0     0     0   
         bmi      84      952 1036 27.89   6.47 12.71 23.7  26.68 30.53
    costprob       3     1033 1036  0.067  0.25  0     0     0     0   
        deaf      18     1018 1036  0.071  0.26  0     0     0     0   
      decide      24     1012 1036  0.069  0.25  0     0     0     0   
    diab.age     912      124 1036 52.35  12.96  5    46.75 55    60   
   diffalone      33     1003 1036  0.07   0.25  0     0     0     0   
   diffdress      24     1012 1036  0.037  0.19  0     0     0     0   
    diffwalk      23     1013 1036  0.16   0.37  0     0     0     0   
   ecig.ever      36     1000 1036  0.15   0.35  0     0     0     0   
     exerany       3     1033 1036  0.76   0.43  0     1     1     1   
      female       0     1036 1036  0.6    0.49  0     0     1     1   
      hasdoc       5     1031 1036  0.87   0.34  0     1     1     1   
  healthplan       0     1036 1036  0.95   0.22  0     1     1     1   
    height_m      33     1003 1036  1.69   0.1   1.45  1.63  1.68  1.75
    hhadults     586      450 1036  2.07   1.02  1     1     2     2   
    home.own       3     1033 1036  0.71   0.45  0     0     1     1   
    hx.arthr       6     1030 1036  0.37   0.48  0     0     0     1   
   hx.asthma       4     1032 1036  0.12   0.32  0     0     0     0   
      hx.chd      11     1025 1036  0.062  0.24  0     0     0     0   
     hx.copd       6     1030 1036  0.08   0.27  0     0     0     0   
  hx.depress       3     1033 1036  0.13   0.34  0     0     0     0   
 hx.diabetes       1     1035 1036  0.12   0.33  0     0     0     0   
   hx.kidney       4     1032 1036  0.052  0.22  0     0     0     0   
       hx.mi       5     1031 1036  0.051  0.22  0     0     0     0   
   hx.otherc       2     1034 1036  0.1    0.3   0     0     0     0   
    hx.skinc       3     1033 1036  0.075  0.26  0     0     0     0   
   hx.stroke       4     1032 1036  0.042  0.2   0     0     0     0   
  internet30       6     1030 1036  0.81   0.39  0     1     1     1   
  menthealth      11     1025 1036  2.72   6.82  0     0     0     2   
  now.asthma     918      118 1036  0.69   0.46  0     0     1     1   
     numkids       5     1031 1036  0.41   0.92  0     0     0     0   
   numphones     994       42 1036  1.74   0.99  1     1     2     2   
     phones2     460      576 1036  0.075  0.26  0     0     0     0   
  physhealth      17     1019 1036  3.97   8.67  0     0     0     2   
  poorhealth     543      493 1036  4.07   8.09  0     0     0     3   
    pregnant     905      131 1036  0.053  0.23  0     0     0     0   
    sleephrs       8     1028 1036  7.02   1.53  1     6     7     8   
    smoke100      37      999 1036  0.45   0.5   0     0     0     1   
     veteran       3     1033 1036  0.098  0.3   0     0     0     0   
   weight_kg      77      959 1036 80.4   21.36 36.29 65.77 77.11 90.72
   p100
  96   
  30   
  40   
   1   
  66.06
   1   
   1   
   1   
  83   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
   1.98
   6   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
   1   
  30   
   1   
   6   
   5   
   1   
  30   
  30   
   1   
  20   
   1   
   1   
 204.12
write_csv(smart.cle, "data/smartcle.csv")

17.7.3 Creation of the smartcle1.csv data

smart.cle1 <- smart.raw %>%
    filter(str_detect(MMSANAME, 'Cleveland')) %>%
    select(SEQNO, physhealth, menthealth, 
           poorhealth, genhealth, bmi, female,
           internet30, exerany, sleephrs, alcdays)

# write_csv(smart.cle1, "data/smartcle1.csv")

## this last line commented out because the file was
## created earlier in these notes