Chapter 2 BRFSS SMART Data

The Centers for Disease Control analyzes Behavioral Risk Factor Surveillance System (BRFSS) survey data for specific metropolitan and micropolitan statistical areas (MMSAs) in a program called the Selected Metropolitan/Micropolitan Area Risk Trends of BRFSS (SMART BRFSS.)

In this work, we will focus on data from the 2017 SMART, and in particular on data from the state of Ohio, and from the Cleveland-Elyria, OH, Metropolitan Statistical Area. The purpose of this survey is to provide localized health information that can help public health practitioners identify local emerging health problems, plan and evaluate local responses, and efficiently allocate resources to specific needs.

In this chapter, I describe some cleaning of the BRFSS SMART data, and break it out into national, statewide, and local samples.

The data files produced by this chapter include:

  • smart_ohio.Rds which includes data on approximately 100 variables for over 7000 subjects in six MMSAs that are at least partially located in the state of Ohio.
  • smart_cle.Rds which includes data on those same variables for a little over 1000 subjects in the Cleveland-Elyria-Lorain OH MMSA.

2.1 Key resources

Also, for each subject, we are also provided with a sampling weight, in _MMSAWT, which will help us incorporate the sampling design later. These weights are at the MMSA level, and are used for generating MMSA-level estimates for variables in the data set. Details on the weighting methodology are available at this PDF.

2.2 Ingesting the Raw Data

To create the data files we’ll use, I used the read_xpt function from the haven package to bring in the SAS XPT data file that is provided by CDC. The codes I used (but won’t use in these Notes) were:

This gives the nationwide data, which has 230,875 rows and 177 columns.

But for the purposes of putting these Notes online, I needed to crank down the sample size enormously. To that end, I created a new data file, which I developed by

  • importing the MMSA2017.xpt file as above
  • filtering away all observations except those from MMSAs which include Ohio in their name, and
  • saving the result, which now has 7,412 rows and 177 columns.

The code (again, not run here) that I used to filter to the OH-based MMSAs was:

So, for purposes of these notes, our complete data set is actually coming from smart_ohio_raw.csv and consists only of the 7,412 observations associated with the six MMSAs that include Ohio in their names.

2.3 Ingesting from our CSV file

Parsed with column specification:
cols(
  .default = col_double(),
  CALLBCKZ = col_logical(),
  NAMTRIBE = col_logical(),
  NAMOTHR = col_logical(),
  MMSANAME = col_character()
)
See spec(...) for full column specifications.
[1] 7412  177

2.4 What does the raw data look like?

  [1] "DISPCODE" "STATERE1" "SAFETIME" "HHADULT"  "GENHLTH"  "PHYSHLTH"
  [7] "MENTHLTH" "POORHLTH" "HLTHPLN1" "PERSDOC2" "MEDCOST"  "CHECKUP1"
 [13] "BPHIGH4"  "BPMEDS"   "CHOLCHK1" "TOLDHI2"  "CHOLMED1" "CVDINFR4"
 [19] "CVDCRHD4" "CVDSTRK3" "ASTHMA3"  "ASTHNOW"  "CHCSCNCR" "CHCOCNCR"
 [25] "CHCCOPD1" "HAVARTH3" "ADDEPEV2" "CHCKIDNY" "DIABETE3" "DIABAGE2"
 [31] "LMTJOIN3" "ARTHDIS2" "ARTHSOCL" "JOINPAI1" "SEX"      "MARITAL" 
 [37] "EDUCA"    "RENTHOM1" "NUMHHOL2" "NUMPHON2" "CPDEMO1A" "VETERAN3"
 [43] "EMPLOY1"  "CHILDREN" "INCOME2"  "INTERNET" "WEIGHT2"  "HEIGHT3" 
 [49] "PREGNANT" "DEAF"     "BLIND"    "DECIDE"   "DIFFWALK" "DIFFDRES"
 [55] "DIFFALON" "SMOKE100" "SMOKDAY2" "STOPSMK2" "LASTSMK2" "USENOW3" 
 [61] "ECIGARET" "ECIGNOW"  "ALCDAY5"  "AVEDRNK2" "DRNK3GE5" "MAXDRNKS"
 [67] "FRUIT2"   "FRUITJU2" "FVGREEN1" "FRENCHF1" "POTATOE1" "VEGETAB2"
 [73] "EXERANY2" "EXRACT11" "EXEROFT1" "EXERHMM1" "EXRACT21" "EXEROFT2"
 [79] "EXERHMM2" "STRENGTH" "SEATBELT" "FLUSHOT6" "FLSHTMY2" "PNEUVAC3"
 [85] "SHINGLE2" "HIVTST6"  "HIVTSTD3" "HIVRISK5" "CASTHDX2" "CASTHNO2"
 [91] "CALLBCKZ" "WDUSENOW" "WDINFTRK" "WDHOWOFT" "WDSHARE"  "NAMTRIBE"
 [97] "NAMOTHR"  "_URBNRRL" "_STSTR"   "_IMPSEX"  "_RFHLTH"  "_PHYS14D"
[103] "_MENT14D" "_HCVU651" "_RFHYPE5" "_CHOLCH1" "_RFCHOL1" "_MICHD"  
[109] "_LTASTH1" "_CASTHM1" "_ASTHMS1" "_DRDXAR1" "_LMTACT1" "_LMTWRK1"
[115] "_LMTSCL1" "_PRACE1"  "_MRACE1"  "_HISPANC" "_RACE"    "_RACEG21"
[121] "_RACEGR3" "_AGEG5YR" "_AGE65YR" "_AGE80"   "_AGE_G"   "WTKG3"   
[127] "_BMI5"    "_BMI5CAT" "_RFBMI5"  "_EDUCAG"  "_INCOMG"  "_SMOKER3"
[133] "_RFSMOK3" "_ECIGSTS" "_CURECIG" "DRNKANY5" "_RFBING5" "_DRNKWEK"
[139] "_RFDRHV5" "FTJUDA2_" "FRUTDA2_" "GRENDA1_" "FRNCHDA_" "POTADA1_"
[145] "VEGEDA2_" "_MISFRT1" "_MISVEG1" "_FRTRES1" "_VEGRES1" "_FRUTSU1"
[151] "_VEGESU1" "_FRTLT1A" "_VEGLT1A" "_FRT16A"  "_VEG23A"  "_FRUITE1"
[157] "_VEGETE1" "_TOTINDA" "_MINAC11" "_MINAC21" "_PACAT1"  "_PAINDX1"
[163] "_PA150R2" "_PA300R2" "_PA30021" "_PASTRNG" "_PAREC1"  "_PASTAE1"
[169] "_RFSEAT2" "_RFSEAT3" "_FLSHOT6" "_PNEUMO2" "_AIDTST3" "_MMSA"   
[175] "_MMSAWT"  "SEQNO"    "MMSANAME"

We’re not going to use all of those variables, but for now, we’ll leave this as is.

2.5 Cleaning the BRFSS Data

2.5.1 Identifying Information

The identifying variables for each subject are gathered in SEQNO, which I’ll leave alone.

  • Each statistical (geographic) area is identified by a _MMSA variable, which I’ll rename mmsa_code, and by an MMSANAME which I’ll rename as mmsa_name

  • For each subject, we are also provided with a sampling weight, in _MMSAWT, which will help us incorporate the sampling design later in the semester. We’ll rename this as mmsa_wt. Details on the weighting methodology are available at https://www.cdc.gov/brfss/annual_data/2017/pdf/2017_SMART_BRFSS_MMSA_Methodology-508.pdf

# A tibble: 6 x 3
  mmsa_code mmsa_name                                                       n
      <dbl> <chr>                                                       <int>
1     17140 Cincinnati, OH-KY-IN, Metropolitan Statistical Area          1737
2     17460 Cleveland-Elyria, OH, Metropolitan Statistical Area          1133
3     18140 Columbus, OH, Metropolitan Statistical Area                  2033
4     19380 Dayton, OH, Metropolitan Statistical Area                     587
5     26580 Huntington-Ashland, WV-KY-OH, Metropolitan Statistical Area  1156
6     45780 Toledo, OH, Metropolitan Statistical Area                     766

Those names are very long. I’ll build some shorter ones, by dropping everything after the comma.

# A tibble: 6 x 3
  mmsa                mmsa_name                                                n
  <chr>               <chr>                                                <int>
1 "Cincinnati "       Cincinnati, OH-KY-IN, Metropolitan Statistical Area   1737
2 "Cleveland-Elyria " Cleveland-Elyria, OH, Metropolitan Statistical Area   1133
3 "Columbus "         Columbus, OH, Metropolitan Statistical Area           2033
4 "Dayton "           Dayton, OH, Metropolitan Statistical Area              587
5 "Huntington-Ashlan~ Huntington-Ashland, WV-KY-OH, Metropolitan Statisti~  1156
6 "Toledo "           Toledo, OH, Metropolitan Statistical Area              766

And here are the sampling weights for the subjects in the Cleveland-Elyria MSA.

2.5.2 Survey Method

2.5.2.1 DISPCODE and its cleanup to completed

DISPCODE which is 1100 if the subject completed the interview, and 1200 if they partially completed the interview. We’ll create a variable called completed that indicates (1 = complete, 0 = not) whether the subject completed the interview.

# A tibble: 2 x 3
  DISPCODE completed     n
     <dbl>     <dbl> <int>
1     1100         1  6277
2     1200         0  1135

2.5.2.2 STATERE1 and SAFETIME and their reduction to landline

BRFSSS is conducted by telephone. The next two variables help us understand whether the subject was contacted via land line or via cellular phone.

  • STATERE1 is 1 if the subject is a resident of the state (only asked of people in the land line version of the survey).
  • SAFETIME is 1 if this is a safe time to talk (only asked of people in the cell phone version of the survey).
  • We’ll use STATERE1 and SAFETIME to create an indicator variable landline that specifies how the respondent was surveyed (1 = land line, 0 = cell phone), as follows…
# A tibble: 2 x 4
  STATERE1 SAFETIME landline     n
     <dbl>    <dbl>    <dbl> <int>
1        1       NA        1  3649
2       NA        1        0  3763

2.5.2.3 HHADULT and its cleanup to hhadults

  • HHADULT is the response to “How many members of your household, including yourself, are 18 years of age or older?”
    • The permitted responses range from 1-76, with special values 77 for Don’t Know/Not Sure and 99 for refused, with BLANK for missing or not asked.
    • So we should change all numerical values above 76 to NA for our analyses (the blanks are already regarded as NAs by R in the ingestion process.)
 HHADULT    n      percent valid_percent
       1  274 0.0369670804   0.236206897
       2  603 0.0813545602   0.519827586
       3  170 0.0229357798   0.146551724
       4   73 0.0098488937   0.062931034
       5   28 0.0037776579   0.024137931
       6    4 0.0005396654   0.003448276
       7    3 0.0004047491   0.002586207
       8    1 0.0001349164   0.000862069
      10    1 0.0001349164   0.000862069
      11    1 0.0001349164   0.000862069
      99    2 0.0002698327   0.001724138
      NA 6252 0.8434970318            NA
# A tibble: 6 x 3
  HHADULT hhadults     n
    <dbl>    <dbl> <int>
1       7        7     3
2       8        8     1
3      10       10     1
4      11       11     1
5      99       NA     2
6      NA       NA  6252

2.5.3 Health Status (1 item)

The next variable describes relate to the subject’s health status.

2.5.3.1 GENHLTH and its cleanup to genhealth

  • GENHLTH, the General Health variable, which is the response to “Would you say that in general your health is …”
    • 1 = Excellent
    • 2 = Very good
    • 3 = Good
    • 4 = Fair
    • 5 = Poor
    • 7 = Don’t know/Not sure
    • 9 = Refused
    • BLANK = Not asked or missing

To clean up the GENHLTH data into a new variable called genhealth we’ll need to - convince R that the 7 and 9 values are in fact best interpreted as NA, - and perhaps change the variable to a factor and incorporate the names into the levels.

Warning: Factor `genhealth` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 7 x 3
  GENHLTH genhealth       n
    <dbl> <fct>       <int>
1       1 1_Excellent  1057
2       2 2_VeryGood   2406
3       3 3_Good       2367
4       4 4_Fair       1139
5       5 5_Poor        428
6       7 <NA>           10
7       9 <NA>            5

2.5.5 Health Care Access (4 items)

The next four variables relate to the subject’s health care access.

2.5.5.1 HLTHPLN1 and its cleanup to healthplan

HLTHPLN1, the Have any health care coverage variable, is the response to “Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused

To clean up the HLTHPLN1 data into a new variable called healthplan we’ll
- convince R that the 7 and 9 values are in fact best interpreted as NA, - and turn it into an indicator variable, e.g., we will leave the variable as numeric, but change the values to 1 = Yes and 0 = No.

# A tibble: 4 x 3
  HLTHPLN1 healthplan     n
     <dbl>      <dbl> <int>
1        1          1  6994
2        2          0   398
3        7         NA    10
4        9         NA    10

2.5.5.2 PERSDOC2 and its cleanup to hasdoc and to numdocs2

PERSDOC2, the Multiple Health Care Professionals variable, is the response to “Do you have one person you think of as your personal doctor or health care provider?” where if the response is “No”, the survey then asks “Is there more than one or is there no person who you think of as your personal doctor or health care provider?”

  • 1 = Yes, only one
  • 2 = More than one
  • 3 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing

To clean up the PERSDOC2 data into a new variable called hasdoc we’ll
- convince R that the 7 and 9 values are in fact best interpreted as NA, - and turn it into an indicator variable, e.g., we will leave the variable as numeric, but change the values to 1 = Yes and 0 = No, so that the original 1 and 2 become 1, and the original 3 becomes 0.

# A tibble: 5 x 3
  PERSDOC2 hasdoc     n
     <dbl>  <dbl> <int>
1        1      1  5784
2        2      1   623
3        3      0   990
4        7     NA    14
5        9     NA     1

2.5.5.3 MEDCOST and its cleanup to costprob

MEDCOST, the Could Not See Doctor Because of Cost variable, is the response to “Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing

This is just like HLTHPLAN.

# A tibble: 4 x 3
  MEDCOST costprob     n
    <dbl>    <dbl> <int>
1       1        1   714
2       2        0  6680
3       7       NA    14
4       9       NA     4

2.5.5.4 CHECKUP1 and its cleanup to t_checkup

CHECKUP1, the Length of time since last routine checkup variable, is the response to “About how long has it been since you last visited a doctor for a routine checkup? [A routine checkup is a general physical exam, not an exam for a specific injury, illness, or condition.]”

  • 1 = Within past year (anytime less than 12 months ago)
  • 2 = Within past 2 years (1 year but less than 2 years ago)
  • 3 = Within past 5 years (2 years but less than 5 years ago)
  • 4 = 5 or more years ago
  • 7 = Don’t know/Not sure
  • 8 = Never
  • 9 = Refused
  • BLANK = Not asked or missing

To clean up the CHECKUP1 data into a new variable called t_checkup we’ll - convince R that the 7 and 9 values are in fact best interpreted as NA, - relabel options 1, 2, 3, 4 and 8 while turning the variable into a factor.

Warning: Factor `t_checkup` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 7 x 3
  CHECKUP1 t_checkup          n
     <dbl> <fct>          <int>
1        1 1_In-past-year  5803
2        2 2_1-to-2-years   714
3        3 3_2-to-5-years   413
4        4 4_5_plus_years   376
5        7 <NA>              68
6        8 8_Never           32
7        9 <NA>               6

2.5.6 Blood Pressure (2 measures)

2.5.6.1 BPHIGH4 and its cleanup to bp_high

BPHIGH4 is asking about awareness of a hypertension diagnosis. It’s the response to the question: “Have you EVER been told by a doctor, nurse or other health professional that you have high blood pressure?” In addition, if the answer was “Yes” and the respondent is female, they were then asked “Was this only when you were pregnant?”

The available codes are:

  • 1 = Yes
  • 2 = Yes, but female told only during pregnancy
  • 3 = No
  • 4 = Told borderline high or pre-hypertensive
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing

To clean up the BPHIGH4 data into a new variable called bp_high we’ll - convince R that the 7 and 9 values are in fact best interpreted as NA, - relabel (and re-order) options 1, 2, 3, 4 while turning the variable into a factor.

Warning: Factor `bp_high` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 6 x 3
  BPHIGH4 bp_high                   n
    <dbl> <fct>                 <int>
1       1 1_Yes                  3161
2       2 2_Only_while_pregnant    67
3       3 0_No                   4114
4       4 4_Borderline             49
5       7 <NA>                     19
6       9 <NA>                      2

2.5.6.2 BPMEDS and its cleanup to bp_meds

BPMEDS is the response to the question “Are you currently taking medicine for your high blood pressure?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing

To clean up the BPMEDS data into a new variable called bp_meds we’ll treat it just as we did with HLTHPLN1 and - convince R that the 7 and 9 values are in fact best interpreted as NA, - and turn it into an indicator variable, e.g., we will leave the variable as numeric, but change the values to 1 = Yes and 0 = No.

# A tibble: 5 x 3
  BPMEDS bp_meds     n
   <dbl>   <dbl> <int>
1      1       1  2675
2      2       0   481
3      7      NA     4
4      9      NA     1
5     NA      NA  4251

What is the relationship between our two blood pressure variables? Only the people with bp_meds = “1_Yes” were asked the bp_meds question.

               bp_high   0    1  NA_
                  0_No   0    0 4114
                 1_Yes 481 2675    5
 2_Only_while_pregnant   0    0   67
          4_Borderline   0    0   49
                  <NA>   0    0   21

2.5.7 Cholesterol (3 items)

2.5.7.1 CHOLCHK1 and its cleanup to t_chol

CHOLCHK1, the Length of time since cholesterol was checked, is the response to “Blood cholesterol is a fatty substance found in the blood. About how long has it been since you last had your blood cholesterol checked?”

  • 1 = Never
  • 2 = Within past year (anytime less than 12 months ago)
  • 3 = Within past 2 years (1 year but less than 2 years ago)
  • 4 = Within past 5 years (2 years but less than 5 years ago)
  • 5 = 5 or more years ago
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing

To clean up the CHOLCHK1 data into a new variable called t_chol we’ll - convince R that the 7 and 9 values are in fact best interpreted as NA, - relabel options 1, 2, 3, 4 and 8 while turning the variable into a factor.

Warning: Factor `t_chol` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 8 x 3
  CHOLCHK1 t_chol             n
     <dbl> <fct>          <int>
1        1 1_Never          424
2        2 2_In-past-year  5483
3        3 3_1-to-2-years   559
4        4 4_2-to-5-years   289
5        5 5_5_plus_years   272
6        7 <NA>             376
7        9 <NA>               8
8       NA <NA>               1

The next two measures are not gathered from the people who answered “Never” to this question.

2.5.7.2 TOLDHI2 and its cleanup to chol_high

TOLDHI2 is asking about awareness of a diagnosis of high cholesterol. It’s the response to the question: “Have you EVER been told by a doctor, nurse or other health professional that your blood cholesterol is high?”

The available codes are:

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing

To clean up the TOLDHI2 data into a new variable called chol_high we’ll treat it like BPMEDS and HLTHPLN1 - convince R that the 7 and 9 values are in fact best interpreted as NA, - and turn it into an indicator variable, e.g., we will leave the variable as numeric, but change the values to 1 = Yes and 0 = No.

# A tibble: 5 x 3
  TOLDHI2 chol_high     n
    <dbl>     <dbl> <int>
1       1         1  2612
2       2         0  4286
3       7        NA    70
4       9        NA     4
5      NA        NA   440

2.5.7.3 CHOLMED1 and its cleanup to chol_meds

CHOLMED1 is the response to the question “Are you currently taking medicine prescribed by a doctor or other health professional for your blood cholesterol?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing

To clean up the CHOLMED1 data into a new variable called chol_meds we’ll treat it just as we did with HLTHPLN1 and - convince R that the 7 and 9 values are in fact best interpreted as NA, - and turn it into an indicator variable, e.g., we will leave the variable as numeric, but change the values to 1 = Yes and 0 = No.

# A tibble: 4 x 3
  CHOLMED1 chol_meds     n
     <dbl>     <dbl> <int>
1        1         1  1781
2        2         0   826
3        7        NA     5
4       NA        NA  4800

2.5.8 Chronic Health Conditions (14 items)

2.5.8.1 Self-reported diagnosis history (11 items)

The next few variables describe whether or not the subject meets a particular standard, and are all coded in the raw data the same way:

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing

and we’ll recode them all to 1 = Yes, 0 = No, otherwise NA, as we’ve done previously.

The questions are all started with “Has a doctor, nurse, or other health professional ever told you that you had any of the following? For each, tell me Yes, No, or you’re Not sure.”

Original Revised Details
CVDINFR4 hx_mi (Ever told) you had a heart attack, also called a myocardial infarction?
CVDCRHD4 hx_chd (Ever told) you had angina or coronary heart disease?
CVDSTRK3 hx_stroke (Ever told) you had a stroke?
ASTHMA3 hx_asthma (Ever told) you had asthma?
ASTHNOW now_asthma Do you still have asthma? (only asked of those with Yes in ASTHMA3)
CHCSCNCR hx_skinc (Ever told) you had skin cancer?
CHCOCNCR hx_otherc (Ever told) you had any other types of cancer?
CHCCOPD1 hx_copd (Ever told) you have Chronic Obstructive Pulmonary Disease or COPD, emphysema or chronic bronchitis?
HAVARTH3 hx_arthr (Ever told) you have some form of arthritis, rheumatoid arthritis, gout, lupus, or fibromyalgia? (Arthritis diagnoses include: rheumatism, polymyalgia rheumatica; osteoarthritis (not osteporosis); tendonitis, bursitis, bunion, tennis elbow; carpal tunnel syndrome, tarsal tunnel syndrome; joint infection, etc.)
ADDEPEV2 hx_depress (Ever told) you that you have a depressive disorder, including depression, major depression, dysthymia, or minor depression?
CHCKIDNY hx_kidney (Ever told) you have kidney disease? Do NOT include kidney stones, bladder infection or incontinence.

We definitely should have written a function to do that, of course.

2.5.8.2 _ASTHMS1 and its cleanup to asthma

_ASTHMS1 categorizes subjects by asthma status as:

  • 1 = Current
  • 2 = Former
  • 3 = Never
  • 9 = Don’t Know / Not Sure / Refused / Missing

We’ll turn this into a factor with appropriate levels and NA information.

Warning: Factor `asthma` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 4 x 3
  `_ASTHMS1` asthma      n
       <dbl> <fct>   <int>
1          1 Current   734
2          2 Former    248
3          3 Never    6376
4          9 <NA>       54

2.5.8.3 DIABETE3 and its cleanup to hx_diabetes and dm_status

DIABETE3, the (Ever told) you have diabetes variable, is the response to “(Ever told) you have diabetes (If Yes and respondent is female, ask Was this only when you were pregnant?. If Respondent says pre-diabetes or borderline diabetes, use response code 4.)”

  • 1 = Yes
  • 2 = Yes, but female told only during pregnancy
  • 3 = No
  • 4 = No, pre-diabetes or borderline diabetes
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing

I’ll create one variable called hx_diabetes which is 1 if DIABETE3 = 1, and 0 otherwise, with appropriate NAs, like our other variables. Then I’ll create dm_status to include all of this information in a factor, but again recode the missing values properly.

Warning: Factor `dm_status` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 6 x 4
  DIABETE3 hx_diabetes dm_status             n
     <dbl>       <dbl> <fct>             <int>
1        1           1 Diabetes           1098
2        2           0 Pregnancy-Induced    67
3        3           0 No-Diabetes        6100
4        4           0 Pre-Diabetes        133
5        7          NA <NA>                 12
6        9          NA <NA>                  2

2.5.8.4 DIABAGE2 and its cleanup to dm_age

DIABAGE2, the Age When Told Diabetic variable, is the response to “How old were you when you were told you have diabetes?” It is asked only of people with DIABETE3 = 1 (Yes).

  • The response is 1-97, with special values 98 for Don’t Know/Not Sure and 99 for refused, with BLANK for missing or not asked. People 97 years of age and above were listed as 97.
# A tibble: 6 x 3
  DIABAGE2 dm_age     n
     <dbl>  <dbl> <int>
1       84     84     1
2       85     85     2
3       90     90     1
4       98     NA    61
5       99     NA     4
6       NA     NA  6314

2.5.9 Arthritis Burden (4 items)

The first two measures are only asked of people with hx_arthr = 1, and are coded as:

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing

and we’ll recode them to 1 = Yes, 0 = No, otherwise NA, as we’ve done previously.

2.5.9.1 LMTJOIN3 (Limited because of joint symptoms), and its cleanup to arth_lims

This is the response to “Are you now limited in any way in any of your usual activities because of arthritis or joint symptoms?”

# A tibble: 6 x 4
  hx_arthr LMTJOIN3 arth_lims     n
     <dbl>    <dbl>     <dbl> <int>
1        0       NA        NA  4587
2        1        1         1  1378
3        1        2         0  1388
4        1        7        NA    17
5        1        9        NA     2
6       NA       NA        NA    40

2.5.9.2 ARTHDIS2 (Does Arthritis Affect Whether You Work), and its cleanup to arth_work

This is the response to “Do arthritis or joint symptoms now affect whether you work, the type of work you do or the amount of work you do?”

# A tibble: 5 x 3
  ARTHDIS2 arth_work     n
     <dbl>     <dbl> <int>
1        1         1   925
2        2         0  1808
3        7        NA    42
4        9        NA    10
5       NA        NA  4627

2.5.9.3 ARTHSOCL (Social Activities Limited Because of Joint Symptoms) and its cleanup to arth_soc

This is the response to “During the past 30 days, to what extent has your arthritis or joint symptoms interfered with your normal social activities, such as going shopping, to the movies, or to religious or social gatherings?”

The responses are:

  • 1 = A lot
  • 2 = A little
  • 3 = Not at all
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
Warning: Factor `arth_soc` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 6 x 3
  ARTHSOCL arth_soc       n
     <dbl> <fct>      <int>
1        1 A lot        606
2        2 A little     734
3        3 Not at all  1427
4        7 <NA>          15
5        9 <NA>           3
6       NA <NA>        4627

2.5.9.4 JOINPAI1 (How Bad Was Joint Pain - scale of 0-10) and its cleanup to joint_pain

This is the response to the following question: “Please think about the past 30 days, keeping in mind all of your joint pain or aching and whether or not you have taken medication. On a scale of 0 to 10 where 0 is no pain or aching and 10 is pain or aching as bad as it can be, DURING THE PAST 30 DAYS, how bad was your joint pain ON AVERAGE?”

The available values are 0-10, plus codes 77 (Don’t Know / Not Sure), 99 (Refused) and BLANK.

To clean up JOINPAI1 to a new variable called joint_pain, we’ll need to convince R that the 77 and 99 values are, like BLANK, in fact best interpreted as NA.

# A tibble: 6 x 3
  JOINPAI1 joint_pain     n
     <dbl>      <dbl> <int>
1        8          8   277
2        9          9    72
3       10         10   158
4       77         NA    28
5       99         NA     5
6       NA         NA  4627

2.5.10 Demographics (25 items)

2.5.10.1 _AGEG5YR, which we’ll edit into agegroup

The _AGEG5YR variable is a calculated variable (by CDC) obtained from the subject’s age. Since the age data are not available, we instead get these groupings, which we’ll rearrange into the agegroup factor.

_AGEG5YR Age range agegroup
1 18 <= AGE <= 24 18-24
2 25 <= AGE <= 29 25-29
3 30 <= AGE <= 34 30-34
4 35 <= AGE <= 39 35-39
5 40 <= AGE <= 44 40-44
6 45 <= AGE <= 49 45-49
7 50 <= AGE <= 54 50-54
8 55 <= AGE <= 59 55-59
9 60 <= AGE <= 64 60-64
10 65 <= AGE <= 69 65-69
11 70 <= AGE <= 74 70-74
12 75 <= AGE <= 79 75-79
13 AGE >= 80 80plus
14 Don’t Know, Refused or Missing NA
Warning: Factor `agegroup` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 14 x 3
   `_AGEG5YR` agegroup     n
        <dbl> <fct>    <int>
 1          1 18-24      448
 2          2 25-29      327
 3          3 30-34      375
 4          4 35-39      446
 5          5 40-44      426
 6          6 45-49      509
 7          7 50-54      604
 8          8 55-59      786
 9          9 60-64      837
10         10 65-69      810
11         11 70-74      685
12         12 75-79      499
13         13 80-96      592
14         14 <NA>        68

2.5.10.2 _MRACE1 recoded to race

We’ll create three variables describing race/ethnicity. The first comes from the _MRACE1 variable categorized by CDC, and the available responses are:

  • 1 = White only
  • 2 = Black or African-American only
  • 3 = American Indian or Alaskan Native only
  • 4 = Asian only
  • 5 = Native Hawaiian or Pacific Islander only
  • 6 = Other race only
  • 7 = Multiracial
  • 77 = Don’t know / Not Sure
  • 99 = Refused
  • BLANK = Missing

We’ll create a factor out of this information, with appropriate level names.

Warning: Factor `race` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 9 x 3
  `_MRACE1` race                       n
      <dbl> <fct>                  <int>
1         1 White                   6177
2         2 Black or African A       739
3         3 Amer Indian or Alaskan    66
4         4 Asian                    115
5         5 Hawaiian or Pac Island     5
6         6 Other Race                43
7         7 Multiracial              153
8        77 <NA>                      14
9        99 <NA>                     100

2.5.10.3 _HISPANC recoded to hispanic

The _HISPANC variable specifies whether or not the respondent is of Hispanic or Latinx origin. The available responses are:

  • 1 = Hispanic, Latinx or Spanish origin
  • 2 = Not of Hispanic, Latinx or Spanish origin
  • 9 = Don’t Know, Refused, or Missing

We’ll turn the 9s into NA, and create an indicator variable (1 = Hispanic or Latinx, 0 = not)

# A tibble: 3 x 3
  `_HISPANC` hispanic     n
       <dbl>    <dbl> <int>
1          1        1   146
2          2        0  7217
3          9       NA    49

2.5.10.4 _RACEGR3 recoded to race_eth

The _RACEGR3 variable is a five-level combination of race and ethnicity. The responses are:

  • 1 = White non-Hispanic
  • 2 = Black non-Hispanic
  • 3 = Other race non-Hispanic
  • 4 = Multiracial non-Hispanic
  • 5 = Hispanic
  • 9 = Don’t Know / Not Sure / Refused

We’ll create a factor out of this information, with appropriate level names.

Warning: Factor `race_eth` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 6 x 3
  `_RACEGR3` race_eth                     n
       <dbl> <fct>                    <int>
1          1 White non-Hispanic        6086
2          2 Black non-Hispanic         725
3          3 Other race non-Hispanic    193
4          4 Multiracial non-Hispanic   143
5          5 Hispanic                   146
6          9 <NA>                       119

2.5.10.5 SEX recoded to female

The available levels of SEX are:

  • 1 = Male
  • 2 = Female
  • 9 = Refused

We’ll recode that to female = 1 for Female, 0 Male, otherwise NA. Note the trick here is to subtract one from the coded SEX to get the desired female, but this requires that we move 9 to NA, rather than 9.

# A tibble: 2 x 3
    SEX female     n
  <dbl>  <dbl> <int>
1     1      0  3136
2     2      1  4276

2.5.10.6 MARITAL status, revised to marital

The available levels of MARITAL are:

  • 1 = Married
  • 2 = Divorced
  • 3 = Widowed
  • 4 = Separated
  • 5 = Never married
  • 6 = A member of an unmarried couple
  • 9 = Refused
  • BLANK = Not asked or missing

We’ll just turn this into a factor, and move 9 to NA.

Warning: Factor `marital` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 7 x 3
  MARITAL marital              n
    <dbl> <fct>            <int>
1       1 Married           3668
2       2 Divorced          1110
3       3 Widowed            978
4       4 Separated          142
5       5 Never_Married     1248
6       6 Unmarried_Couple   208
7       9 <NA>                58

2.5.10.7 EDUCA recoded to educgroup

The available levels of EDUCA (Education Level) are responses to: “What is the highest grade or year of school you completed?”

  • 1 = Never attended school or only kindergarten
  • 2 = Grades 1 through 8 (Elementary)
  • 3 = Grades 9 through 11 (Some high school)
  • 4 = Grade 12 or GED (High school graduate)
  • 5 = College 1 year to 3 years (Some college or technical school)
  • 6 = College 4 years or more (College graduate)
  • 9 = Refused
  • BLANK = Not asked or missing

We’ll just turn this into a factor, and move 9 to NA.

Warning: Factor `educgroup` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 7 x 3
  EDUCA educgroup        n
  <dbl> <fct>        <int>
1     1 Kindergarten     3
2     2 Elementary     117
3     3 Some_HS        332
4     4 HS_Grad       2209
5     5 Some_College  2079
6     6 College_Grad  2646
7     9 <NA>            26

2.5.10.8 RENTHOM1 recoded to home_own

The available levels of RENTHOM1 (Own or Rent Home) are responses to: “Do you own or rent your home? (Home is defined as the place where you live most of the time/the majority of the year.)”

  • 1 = Own
  • 2 = Rent
  • 3 = Other Arrangement
  • 7 = Don’t know/Not Sure
  • 9 = Refused
  • BLANK = Not asked or missing

We’ll recode as home_own = 1 if they own their home, and 0 otherwise, and dealing with missingness properly.

# A tibble: 5 x 3
  RENTHOM1 home_own     n
     <dbl>    <dbl> <int>
1        1        1  5216
2        2        0  1793
3        3        0   348
4        7       NA    28
5        9       NA    27

2.5.10.9 CPDEMO1A and its cleanup to cell_own

CPDEMO1A is the response to “Including phones for business and personal use, do you have a cell phone for personal use?”

Available responses are:

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing

and we’ll recode them to 1 = Yes, 0 = No, otherwise NA, as we’ve done previously.

# A tibble: 5 x 3
  CPDEMO1A cell_own     n
     <dbl>    <dbl> <int>
1        1        1  2930
2        2        0   698
3        7       NA     2
4        9       NA    19
5       NA       NA  3763

2.5.10.10 VETERAN3 and its cleanup to veteran

VETERAN3, the Are You A Veteran variable, is the response to “Have you ever served on active duty in the United States Armed Forces, either in the regular military or in a National Guard or military reserve unit? (Active duty does not include training for the Reserves or National Guard, but DOES include activation, for example, for the Persian Gulf War.)”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
# A tibble: 3 x 3
  VETERAN3 veteran     n
     <dbl>   <dbl> <int>
1        1       1   927
2        2       0  6479
3        9      NA     6

2.5.10.11 EMPLOY1 and its cleanup to employment

EMPLOY1, the Employment Status variable, is the response to “Are you currently … ?”

  • 1 = Employed for wages
  • 2 = Self-employed
  • 3 = Out of work for 1 year or more
  • 4 = Out of work for less than 1 year
  • 5 = A homemaker
  • 6 = A student
  • 7 = Retired
  • 8 = Unable to work
  • 9 = Refused
  • BLANK = Not asked or missing

We’ll just turn this into a factor, and move 9 to NA.

Warning: Factor `employment` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 9 x 3
  EMPLOY1 employment                n
    <dbl> <fct>                 <int>
1       1 Employed_for_wages     3119
2       2 Self-employed           466
3       3 Outofwork_1yearormore   254
4       4 Outofwork_lt1year       134
5       5 Homemaker               411
6       6 Student                 190
7       7 Retired                2202
8       8 Unable_to_work          603
9       9 <NA>                     33

2.5.10.12 CHILDREN and its cleanup to kids

CHILDREN, the Number of Children in Household variable, is the response to “How many children less than 18 years of age live in your household?”

  • 1-87 = legitimate responses
  • 88 = None
  • 99 = Refused
  • BLANK = Not asked or missing
# A tibble: 6 x 3
  CHILDREN  kids     n
     <dbl> <dbl> <int>
1        6     6     7
2        7     7     5
3        8     8     2
4       12    12     1
5       88     0  5449
6       99    NA    43

2.5.10.13 INCOME2 to incomegroup

The available levels of INCOME2 (Income Level) are responses to: “Is your annual household income from all sources …”

  • 1 = Less than $10,000
  • 2 = $10,000 to less than $15,000
  • 3 = $15,000 to less than $20,000
  • 4 = $20,000 to less than $25,000
  • 5 = $25,000 to less than $35,000
  • 6 = $35,000 to less than $50,000
  • 7 = $50,000 to less than $75,000
  • 8 = $75,000 or more
  • 77 = Don’t know/Not sure
  • 99 = Refused
  • BLANK = Not asked or missing

We’ll just turn this into a factor, and move 77 and 99 to NA.

Warning: Factor `incomegroup` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 11 x 3
   INCOME2 incomegroup     n
     <dbl> <fct>       <int>
 1       1 0-9K          285
 2       2 10-14K        306
 3       3 15-19K        477
 4       4 20-24K        589
 5       5 25-34K        685
 6       6 35-49K        922
 7       7 50-74K        928
 8       8 75K+         1910
 9      77 <NA>          610
10      99 <NA>          678
11      NA <NA>           22

2.5.10.14 INTERNET and its cleanup to internet30

INTERNET, the Internet use in the past 30 days variable, is the response to “Have you used the internet in the past 30 days?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
# A tibble: 5 x 3
  INTERNET internet30     n
     <dbl>      <dbl> <int>
1        1          1  6020
2        2          0  1335
3        7         NA    10
4        9         NA    10
5       NA         NA    37

2.5.10.15 WTKG3 is weight_kg

WTKG3 is computed by CDC, as the respondent’s weight in kilograms with two implied decimal places. We calculate the actual weight in kg, with the following:

# A tibble: 6 x 3
  WTKG3 weight_kg     n
  <dbl>     <dbl> <int>
1 19051      191.     1
2 19278      193.     1
3 19504      195.     1
4 20412      204.     2
5 20865      209.     1
6    NA       NA    462

2.5.10.16 HEIGHT3 is replaced with height_m

HEIGHT3 is strangely gathered to allow people to specify their height in either feet and inches or in meters and centimeters.

  • 200-711 indicates height in feet (first digit) and inches (second two digits)
  • 9000 - 9998 indicates height in meters (second digit) and centimeters (last two digits)
  • 7777 = Don’t know/Not sure
  • 9999 = Refused

Note that there is one impossible value of 575 in the data set. We’ll make that an NA, and we’ll also make NA any heights below 3 feet, or above 2.24 meters. Specifically, we calculate the actual height in meters, with the following:

# A tibble: 6 x 3
  HEIGHT3 height_m     n
    <dbl>    <dbl> <int>
1     607     2.01     2
2     608     2.03     6
3     609     2.06     1
4    7777    NA       27
5    9999    NA       86
6      NA    NA       67

2.5.10.17 bmi is calculated from height_m and weight_kg

We’ll calculate body-mass index from height and weight.

# A tibble: 1,806 x 4
   height_m weight_kg   bmi     n
      <dbl>     <dbl> <dbl> <int>
 1     1.35      39.0  21.4     1
 2     1.35      52.2  28.6     1
 3     1.4       89.8  45.8     1
 4     1.42      31.8  15.8     1
 5     1.42      45.4  22.5     1
 6     1.42      55.8  27.7     1
 7     1.42      58.5  29.0     1
 8     1.42      59.9  29.7     1
 9     1.42      60.8  30.1     1
10     1.42      71.2  35.3     1
# ... with 1,796 more rows

2.5.10.18 bmigroup is calculated from bmi

We’ll then divide the respondents into adult BMI categories, in the usual way.

  • BMI < 18.5 indicates underweight
  • BMI from 18.5 up to 25 indicates normal weight
  • BMI from 25 up to 30 indicates overweight
  • BMI of 30 and higher indicates obesity
Warning: Factor `bmigroup` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 5 x 2
  bmigroup        n
  <fct>       <int>
1 [13.3,18.5)   119
2 [18.5,25.0)  2010
3 [25.0,30.0)  2447
4 [30.0,75.5]  2343
5 <NA>          493

2.5.10.19 PREGNANT and its cleanup to pregnant

PREGNANT, the Pregnancy Status variable, is the response to “To your knowledge, are you now pregnant?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing (includes SEX = male)
# A tibble: 5 x 3
  PREGNANT pregnant     n
     <dbl>    <dbl> <int>
1        1        1    41
2        2        0  1329
3        7       NA     3
4        9       NA     3
5       NA       NA  6036

2.5.10.20 DEAF and its cleanup to deaf

DEAF, the Are you deaf or do you have serious difficulty hearing variable, is the response to “Are you deaf or do you have serious difficulty hearing?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
# A tibble: 5 x 3
   DEAF  deaf     n
  <dbl> <dbl> <int>
1     1     1   708
2     2     0  6551
3     7    NA    15
4     9    NA     4
5    NA    NA   134

2.5.10.21 BLIND and its cleanup to blind

BLIND, the Blind or Difficulty seeing variable, is the response to “Are you blind or do you have serious difficulty seeing, even when wearing glasses?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
# A tibble: 5 x 3
  BLIND blind     n
  <dbl> <dbl> <int>
1     1     1   415
2     2     0  6834
3     7    NA    14
4     9    NA     1
5    NA    NA   148

2.5.10.22 DECIDE and its cleanup to decide

DECIDE, the Difficulty Concentrating or Remembering variable, is the response to “Because of a physical, mental, or emotional condition, do you have serious difficulty concentrating, remembering, or making decisions?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
# A tibble: 5 x 3
  DECIDE decide     n
   <dbl>  <dbl> <int>
1      1      1   870
2      2      0  6348
3      7     NA    30
4      9     NA     2
5     NA     NA   162

2.5.10.23 DIFFWALK and its cleanup to diffwalk

DIFFWALK, the Difficulty Walking or Climbing Stairs variable, is the response to “Do you have serious difficulty walking or climbing stairs?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
# A tibble: 5 x 3
  DIFFWALK diffwalk     n
     <dbl>    <dbl> <int>
1        1        1  1482
2        2        0  5738
3        7       NA    19
4        9       NA     2
5       NA       NA   171

2.5.10.24 DIFFDRES and its cleanup to diffdress

DIFFDRES, the Difficulty Dressing or Bathing variable, is the response to “Do you have difficulty dressing or bathing?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
# A tibble: 5 x 3
  DIFFDRES diffdress     n
     <dbl>     <dbl> <int>
1        1         1   352
2        2         0  6868
3        7        NA    12
4        9        NA     1
5       NA        NA   179

2.5.10.25 DIFFALON and its cleanup to diffalone

DIFFALON, the Difficulty Doing Errands Alone variable, is the response to “Because of a physical, mental, or emotional condition, do you have difficulty doing errands alone such as visiting a doctor’s office or shopping?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing
# A tibble: 5 x 3
  DIFFALON diffalone     n
     <dbl>     <dbl> <int>
1        1         1   636
2        2         0  6560
3        7        NA    15
4        9        NA     4
5       NA        NA   197

2.5.11 Tobacco Use (2 items)

2.5.11.1 SMOKE100 and its cleanup to smoke100

SMOKE100, the Smoked at Least 100 Cigarettes variable, is the response to “Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
# A tibble: 5 x 3
  SMOKE100 smoke100     n
     <dbl>    <dbl> <int>
1        1        1  3294
2        2        0  3881
3        7       NA    31
4        9       NA     4
5       NA       NA   202

2.5.11.2 _SMOKER3 and its cleanup to smoker

_SMOKER3, is a calculated variable which categorizes subjects by their smoking status:

  • 1 = Current smoker who smokes daily
  • 2 = Current smoker but not every day
  • 3 = Former smoker
  • 4 = Never smoked
  • 9 = Don’t Know / Refused / Missing

We’ll reclassify this as a factor with appropriate labels and NAs.

Warning: Factor `smoker` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 5 x 3
  `_SMOKER3` smoker                n
       <dbl> <fct>             <int>
1          1 Current_daily       990
2          2 Current_not_daily   300
3          3 Former             1999
4          4 Never              3881
5          9 <NA>                242

2.5.12 E-Cigarettes (2 items)

2.5.12.1 ECIGARET and its cleanup to ecig_ever

ECIGARET, the Ever used an e-cigarette variable, is the response to “Have you ever used an e-cigarette or other electronic vaping product, even just one time, in your entire life?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
# A tibble: 5 x 3
  ECIGARET ecig_ever     n
     <dbl>     <dbl> <int>
1        1         1  1354
2        2         0  5799
3        7        NA     9
4        9        NA     3
5       NA        NA   247

2.5.12.2 _ECIGSTS and its cleanup to ecigs

_ECIGSTS, is a calculated variable which categorizes subjects by their smoking status:

  • 1 = Current and uses daily
  • 2 = Current user but not every day
  • 3 = Former user
  • 4 = Never used e-cigarettes
  • 9 = Don’t Know / Refused / Missing

We’ll reclassify this as a factor with appropriate labels and NAs.

Warning: Factor `ecigs` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 5 x 3
  `_ECIGSTS` ecigs                 n
       <dbl> <fct>             <int>
1          1 Current_daily       102
2          2 Current_not_daily   165
3          3 Former             1085
4          4 Never              5799
5          9 <NA>                261

2.5.13 Alcohol Consumption (6 items)

2.5.13.1 ALCDAY5 and its cleanup to alcdays

ALCDAY5, the Days in past 30 had alcoholic beverage variable, is the response to “During the past 30 days, how many days per week or per month did you have at least one drink of any alcoholic beverage such as beer, wine, a malt beverage or liquor?”

  • 101-107 = # of days per week (101 = 1 day per week, 107 = 7 days per week)
  • 201-230 = # of days in past 30 days (201 = 1 day in last 30, 230 = 30 days in last 30)
  • 777 = Don’t know/Not sure
  • 888 = No drinks in past 30 days
  • 999 = Refused
  • BLANK = Not asked or Missing

We’re going to convert this to a single numeric value. Answers in days per week (in the past 7 days) will be converted (after rounding) to days in the past 30. This is a little bit of a mess, really, but we can do it.

# A tibble: 39 x 3
   ALCDAY5 alcdays     n
     <dbl>   <dbl> <int>
 1     101       4   263
 2     102       9   197
 3     103      13   142
 4     104      17    76
 5     105      21    53
 6     106      26    18
 7     107      30   114
 8     201       1   621
 9     202       2   448
10     203       3   233
# ... with 29 more rows

2.5.13.2 AVEDRNK2 and its cleanup to avgdrinks

AVEDRNK2, the Avg alcoholic drinks per day in past 30 variable, is the response to “One drink is equivalent to a 12-ounce beer, a 5-ounce glass of wine, or a drink with one shot of liquor. During the past 30 days, on the days when you drank, about how many drinks did you drink on the average? (A 40 ounce beer would count as 3 drinks, or a cocktail drink with 2 shots would count as 2 drinks.)”

  • 1-76 = # of drinks per day
  • 77 = Don’t know/Not sure
  • 99 = Refused
  • BLANK = Not asked or Missing (always happens when ALCDAY5 = 777, 888 or 999)
# A tibble: 6 x 3
  AVEDRNK2 avgdrinks     n
     <dbl>     <dbl> <int>
1       42        42     1
2       60        60     2
3       76        76     1
4       77        NA    46
5       99        NA     5
6       NA        NA  3876

2.5.13.3 MAXDRNKS and its cleanup to maxdrinks

MAXDRINKS, the most drinks on a single occasion in the past 30 days variable, is the response to “During the past 30 days, what is the largest number of drinks you had on any occasion?”

  • 1-76 = # of drinks
  • 77 = Don’t know/Not sure
  • 99 = Refused
  • BLANK = Not asked or Missing (always happens when ALCDAY5 = 777, 888 or 999)
# A tibble: 6 x 3
  MAXDRNKS maxdrinks     n
     <dbl>     <dbl> <int>
1       42        42     1
2       48        48     1
3       76        76     2
4       77        NA    94
5       99        NA    11
6       NA        NA  3899

2.5.13.4 _RFBING5 and its cleanup to binge

_RFBING5 identifies binge drinkers (males having five or more drinks on one occasion, females having four or more drinks on one occasion in the past 30 days)

The values are

  • 1 = No
  • 2 = Yes
  • 9 = Don’t Know / Refused / Missing

People who reported no alcdays are reported here as “No”, so we’ll adjust this into an indicator variable, and create the necessary NAs.

# A tibble: 3 x 3
  `_RFBING5` binge     n
       <dbl> <dbl> <int>
1          1     0  6035
2          2     1  1000
3          9    NA   377

2.5.13.5 _DRNKWEK and its cleanup to drinks_wk

_DRNKWEK provides the computed number of alcoholic drinks per week, with two implied decimal places. The code 99900 is used for “Don’t know / Not sure / Refused / Missing” so we’ll fix that, and also divide by 100 to get an average with a decimal point.

Note: We’re also going to treat all results of 100 or more drinks per week as incorrect, and thus indicate them as missing data here.

# A tibble: 12 x 3
   `_DRNKWEK` drinks_wk     n
        <dbl>     <dbl> <int>
 1       9333      93.3     2
 2      10000      NA       1
 3      10500      NA       2
 4      11667      NA       1
 5      14000      NA       2
 6      16800      NA       2
 7      17500      NA       1
 8      18200      NA       1
 9      28000      NA       1
10      29400      NA       1
11      53200      NA       1
12      99900      NA     379

2.5.13.6 _RFDRHV5 and its cleanup to drink_heavy

_RFDRHV5 identifies heavy drinkers (males having 14 or more drinks per week, females having 7 or more drinks per week)

The values are

  • 1 = No
  • 2 = Yes
  • 9 = Don’t Know / Refused / Missing

People who reported no alcdays are reported here as “No”, so we’ll adjust this into an indicator variable, and create the necessary NAs.

# A tibble: 3 x 3
  `_RFDRHV5` drink_heavy     n
       <dbl>       <dbl> <int>
1          1           0  6607
2          2           1   426
3          9          NA   379

2.5.14 Fruits and Vegetables (8 items)

2.5.14.1 _FRUTSU1 and its cleanup to fruit_day

_FRUTSU1 provides the computed number of fruit servings consumed per day, with two implied decimal places. We’ll divide by 100 to insert the decimal point.

Note: We’re also going to treat all results exceeding 16 servings per day as implausible, and thus indicate them as missing data here, following some CDC procedures.

# A tibble: 6 x 3
  `_FRUTSU1` fruit_day     n
       <dbl>     <dbl> <int>
1        913      9.13     1
2       1000     10        4
3       1400     14        1
4       3000     NA        1
5       7600     NA        1
6         NA     NA      555

2.5.14.2 _VEGESU1 and its cleanup to veg_day

_VEGESU1 provides the computed number of vegetable servings consumed per day, with two implied decimal places. We’ll divide by 100 to insert the decimal point.

Note: We’re also going to treat all results exceeding 23 servings per day as implausible, and thus indicate them as missing data here, following some CDC procedures.

# A tibble: 6 x 3
  `_VEGESU1` veg_day     n
       <dbl>   <dbl> <int>
1       1414    14.1     1
2       1603    16.0     1
3       1891    18.9     1
4       2167    21.7     1
5       3150    NA       1
6         NA    NA     666

2.5.14.3 FTJUDA2_ and its cleanup to eat_juice

FTJUDA2_ provides the servings of fruit juice consumed per day, with two implied decimal places. We’ll divide by 100 to insert the decimal point.

Note: We’re also going to treat all results exceeding 16 servings per day as implausible, and thus indicate them as missing data here.

# A tibble: 6 x 3
  FTJUDA2_ eat_juice     n
     <dbl>     <dbl> <int>
1      500         5     6
2      600         6     1
3      700         7     1
4     1200        12     1
5     7500        NA     1
6       NA        NA   469

2.5.14.4 FRUTDA2_ and its cleanup to eat_fruit

FRUTDA2_ provides the servings of fruit consumed per day, with two implied decimal places. We’ll divide by 100 to insert the decimal point.

Note: We’re also going to treat all results exceeding 16 servings per day as implausible, and thus indicate them as missing data here.

# A tibble: 6 x 3
  FRUTDA2_ eat_fruit     n
     <dbl>     <dbl> <int>
1      700         7     5
2      800         8     3
3      900         9     1
4     1000        10     1
5     3000        NA     1
6       NA        NA   456

2.5.14.5 GRENDA1_ and its cleanup to eat_greenveg

GRENDA1_ provides the servings of dark green vegetables consumed per day, with two implied decimal places. We’ll divide by 100 to insert the decimal point.

Note: We’re also going to treat all results exceeding 16 servings per day as implausible, and thus indicate them as missing data here.

# A tibble: 6 x 3
  GRENDA1_ eat_greenveg     n
     <dbl>        <dbl> <int>
1      700         7        4
2      786         7.86     1
3      800         8        2
4     2000        NA        1
5     3000        NA        1
6       NA        NA      447

2.5.14.6 FRNCHDA_ and its cleanup to eat_fries

FRNCHDA_ provides the servings of french fries consumed per day, with two implied decimal places. We’ll divide by 100 to insert the decimal point.

Note: We’re also going to treat all results exceeding 16 servings per day as implausible, and thus indicate them as missing data here.

# A tibble: 6 x 3
  FRNCHDA_ eat_fries     n
     <dbl>     <dbl> <int>
1      300      3        9
2      314      3.14     1
3      400      4        3
4      500      5        1
5      700      7        1
6       NA     NA      453

2.5.14.7 POTADA1_ and its cleanup to eat_potato

POTADA1_ provides the servings of potatoes consumed per day, with two implied decimal places. We’ll divide by 100 to insert the decimal point.

Note: We’re also going to treat all results exceeding 16 servings per day as implausible, and thus indicate them as missing data here.

# A tibble: 6 x 3
  POTADA1_ eat_potato     n
     <dbl>      <dbl> <int>
1      314       3.14     1
2      329       3.29     1
3      400       4        3
4      471       4.71     1
5      700       7        1
6       NA      NA      501

2.5.14.8 VEGEDA2_ and its cleanup to eat_otherveg

VEGEDA2_ provides the servings of other vegetables consumed per day, with two implied decimal places. We’ll divide by 100 to insert the decimal point.

Note: We’re also going to treat all results exceeding 16 servings per day as implausible, and thus indicate them as missing data here.

# A tibble: 6 x 3
  VEGEDA2_ eat_otherveg     n
     <dbl>        <dbl> <int>
1      600            6     3
2      700            7    11
3      800            8     1
4     1000           10     2
5     1100           11     1
6       NA           NA   509

2.5.15 Exercise and Physical Activity (8 items)

2.5.15.1 _TOTINDA and its cleanup to exerany

_TOTINDA, the Exercise in Past 30 Days variable, is the response to “During the past month, other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?”

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
  • BLANK = Not asked or missing

This is just like HLTHPLAN.

# A tibble: 3 x 3
  `_TOTINDA` exerany     n
       <dbl>   <dbl> <int>
1          1       1  4828
2          2       0  2137
3          9      NA   447

2.5.15.2 _PACAT1 and its cleanup to activity

_PACAT1 contains physical activity categories, estimated from responses to the BRFSS. The categories are:

  • 1 = Highly Active
  • 2 = Active
  • 3 = Insufficiently Active
  • 4 = Inactive
  • 9 = Don’t Know / Not Sure / Refused / Missing

So we’ll create a factor.

Warning: Factor `activity` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 5 x 3
  `_PACAT1` activity                  n
      <dbl> <fct>                 <int>
1         1 Highly_Active          2053
2         2 Active                 1132
3         3 Insufficiently_Active  1293
4         4 Inactive               2211
5         9 <NA>                    723

2.5.15.3 _PAINDX1 and its cleanup to rec_aerobic

_PAINDX1 indicates whether the respondent’s stated levels of physical activity meet recommendations for aerobic activity. The responses are:

  • 1 = Yes
  • 2 = No
  • 9 = Don’t know/Not sure/Refused/Missing
# A tibble: 3 x 3
  `_PAINDX1` rec_aerobic     n
       <dbl>       <dbl> <int>
1          1           1  3228
2          2           0  3504
3          9          NA   680

2.5.15.4 _PASTRNG and its cleanup to rec_strength

_PASTRNG indicates whether the respondent’s stated levels of physical activity meet recommendations for strength-building activity. The responses are:

  • 1 = Yes
  • 2 = No
  • 9 = Don’t know/Not sure/Refused/Missing
# A tibble: 3 x 3
  `_PASTRNG` rec_strength     n
       <dbl>        <dbl> <int>
1          1            1  1852
2          2            0  5004
3          9           NA   556

2.5.15.5 EXRACT11 and its cleanup to exer1_type

Respondents are asked “What type of physical activity or exercise did you spend the most time doing during the past month?” and these responses are gathered into a set of 76 named categories, including an “other” category. Codes 77 (Don’t Know / Not Sure) and 99 (Refused) are dropped into NA in my code below, and Code 98 (“Other type of activity”) remains. Then I went through the tedious work of converting the factor levels from numbers to names, following the value labels provided by BRFSS.

smart_ohio_raw <- smart_ohio_raw %>%
    mutate(exer1_type = factor(EXRACT11),
           exer1_type = fct_recode(
               exer1_type,
               "Active Gaming Devices" = "1",
               "Aerobics video or class" = "2",
               "Backpacking" = "3",
               "Badminton" = "4",
               "Basketball" = "5",
               "Bicycling machine" = "6",
               "Bicycling" = "7",
               "Boating" = "8",
               "Bowling" = "9",
               "Boxing" = "10",
               "Calisthenics" = "11",
               "Canoeing" = "12",
               "Carpentry" = "13",
               "Dancing" = "14",
               "Elliptical machine" = "15",
               "Fishing" = "16",
               "Frisbee" = "17",
               "Gardening" = "18",
               "Golf with cart" = "19",
               "Golf without cart" = "20",
               "Handball" = "21",
               "Hiking" = "22",
               "Hockey" = "23",
               "Horseback riding" = "24",
               "Hunting large game" = "25",
               "Hunting small game" = "26",
               "Inline skating" = "27",
               "Jogging" = "28",
               "Lacrosse" = "29",
               "Mountain climbing" = "30",
               "Mowing lawn" = "31",
               "Paddleball" = "32",
               "Painting house" = "33",
               "Pilates" = "34",
               "Racquetball" = "35",
               "Raking lawn" = "36",
               "Running" = "37",
               "Rock climbing" = "38",
               "Rope skipping" = "39",
               "Rowing machine" = "40",
               "Rugby" = "41",
               "Scuba diving" = "42",
               "Skateboarding" = "43",
               "Skating" = "44",
               "Sledding" = "45",
               "Snorkeling" = "46",
               "Snow blowing" = "47",
               "Snow shoveling" = "48",
               "Snow skiing" = "49",
               "Snowshoeing" = "50",
               "Soccer" = "51",
               "Softball/Baseball" = "52",
               "Squash" = "53",
               "Stair Climbing" = "54",
               "Stream fishing" = "55",
               "Surfing" = "56",
               "Swimming" = "57",
               "Swimming in laps" = "58",
               "Table tennis" = "59",
               "Tai Chi" = "60",
               "Tennis" = "61",
               "Touch football" = "62",
               "Volleyball" = "63",
               "Walking" = "64",
               "Waterskiing" = "66",
               "Weight lifting" = "67",
               "Wrestling" = "68",
               "Yoga" = "69",
               "Child Care" = "71",
               "Farm Work" = "72",
               "Household Activities" = "73",
               "Martial Arts" = "74",
               "Upper Body Cycle" = "75",
               "Yard Work" = "76",
               "Other Activities" = "98",
               NULL = "77", 
               NULL = "99")
    )
Warning: Unknown levels in `f`: 3, 17, 21, 32, 36, 41, 42, 45, 47, 53, 55, 56,
59
Warning: Factor `exer1_type` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 6 x 3
  EXRACT11 exer1_type           n
     <dbl> <fct>            <int>
1       75 Upper Body Cycle     6
2       76 Yard Work           78
3       77 <NA>                10
4       98 Other Activities   276
5       99 <NA>                 4
6       NA <NA>              2588

The most common activities are:

Warning: Factor `exer1_type` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 10 x 2
   exer1_type                  n
   <fct>                   <int>
 1 Walking                  2605
 2 <NA>                     2602
 3 Running                   324
 4 Other Activities          276
 5 Gardening                 242
 6 Weight lifting            189
 7 Aerobics video or class   103
 8 Bicycling machine         103
 9 Bicycling                  96
10 Golf with cart             90

2.5.15.6 EXRACT21 and its cleanup to exer2_type

As a follow-up, respondents are asked “What other type of physical activity gave you the next most exercise during the past month?” and these responses are also gathered into the same set of 76 named categories, including an “other” category, but now also adding a “No Other Activity” category (code 88). Codes 77 (Don’t Know / Not Sure) and 99 (Refused) are dropped into NA in my code below, and Code 98 (“Other type of activity”) remains. Then I went through the tedious work of converting the factor levels from numbers to names, following the value labels provided by BRFSS. I’m sure there’s a better way to do this.

smart_ohio_raw <- smart_ohio_raw %>%
    mutate(exer2_type = factor(EXRACT21),
           exer2_type = fct_recode(
               exer2_type,
               "Active Gaming Devices" = "1",
               "Aerobics video or class" = "2",
               "Backpacking" = "3",
               "Badminton" = "4",
               "Basketball" = "5",
               "Bicycling machine" = "6",
               "Bicycling" = "7",
               "Boating" = "8",
               "Bowling" = "9",
               "Boxing" = "10",
               "Calisthenics" = "11",
               "Canoeing" = "12",
               "Carpentry" = "13",
               "Dancing" = "14",
               "Elliptical machine" = "15",
               "Fishing" = "16",
               "Frisbee" = "17",
               "Gardening" = "18",
               "Golf with cart" = "19",
               "Golf without cart" = "20",
               "Handball" = "21",
               "Hiking" = "22",
               "Hockey" = "23",
               "Horseback riding" = "24",
               "Hunting large game" = "25",
               "Hunting small game" = "26",
               "Inline skating" = "27",
               "Jogging" = "28",
               "Lacrosse" = "29",
               "Mountain climbing" = "30",
               "Mowing lawn" = "31",
               "Paddleball" = "32",
               "Painting house" = "33",
               "Pilates" = "34",
               "Racquetball" = "35",
               "Raking lawn" = "36",
               "Running" = "37",
               "Rock climbing" = "38",
               "Rope skipping" = "39",
               "Rowing machine" = "40",
               "Rugby" = "41",
               "Scuba diving" = "42",
               "Skateboarding" = "43",
               "Skating" = "44",
               "Sledding" = "45",
               "Snorkeling" = "46",
               "Snow blowing" = "47",
               "Snow shoveling" = "48",
               "Snow skiing" = "49",
               "Snowshoeing" = "50",
               "Soccer" = "51",
               "Softball/Baseball" = "52",
               "Squash" = "53",
               "Stair Climbing" = "54",
               "Stream fishing" = "55",
               "Surfing" = "56",
               "Swimming" = "57",
               "Swimming in laps" = "58",
               "Table tennis" = "59",
               "Tai Chi" = "60",
               "Tennis" = "61",
               "Touch football" = "62",
               "Volleyball" = "63",
               "Walking" = "64",
               "Waterskiing" = "66",
               "Weight lifting" = "67",
               "Wrestling" = "68",
               "Yoga" = "69",
               "Child Care" = "71",
               "Farm Work" = "72",
               "Household Activities" = "73",
               "Martial Arts" = "74",
               "Upper Body Cycle" = "75",
               "Yard Work" = "76",
               "No Other Activity" = "88",
               "Other Activities" = "98",
               NULL = "77", 
               NULL = "99")
    )
Warning: Unknown levels in `f`: 3, 21, 30, 39, 41, 46, 50, 62
Warning: Factor `exer2_type` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 6 x 3
  EXRACT21 exer2_type            n
     <dbl> <fct>             <int>
1       76 Yard Work           153
2       77 <NA>                 26
3       88 No Other Activity  1854
4       98 Other Activities    246
5       99 <NA>                 19
6       NA <NA>               2627

The most common activity types in this group are:

Warning: Factor `exer2_type` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 10 x 2
   exer2_type               n
   <fct>                <int>
 1 <NA>                  2672
 2 No Other Activity     1854
 3 Walking                629
 4 Weight lifting         272
 5 Other Activities       246
 6 Gardening              202
 7 Household Activities   169
 8 Yard Work              153
 9 Running                148
10 Bicycling              118

2.5.15.7 _MINAC11 and its cleanup to exer1_min

_MINAC11 is minutes of physical activity per week for the first activity (listed as exer1_type above.) Since there are only about 10,080 minutes in a typical week, we’ll treat as implausible any values larger than 4200 minutes (which would indicate 70 hours per week.)

# A tibble: 6 x 3
  `_MINAC11` exer1_min     n
       <dbl>     <dbl> <int>
1       3780      3780     8
2       3959      3959     1
3       3960      3960     1
4       4193      4193     6
5      27000        NA     1
6         NA        NA  2760

2.5.15.8 _MINAC21 and its cleanup to exer2_min

_MINAC21 is minutes of physical activity per week for the second activity (listed as exer2_type above.) Again, we’ll treat as implausible any values larger than 4200 minutes (which would indicate 70 hours per week.)

# A tibble: 6 x 3
  `_MINAC21` exer2_min     n
       <dbl>     <dbl> <int>
1       3360      3360     3
2       3780      3780     7
3       4193      4193     3
4       6120        NA     1
5       8400        NA     1
6         NA        NA  2770

2.5.16 Seatbelt Use (1 item)

2.5.16.1 SEATBELT and its cleanup to seatbelt

This question asks “How often do you use seat belts when you drive or ride in a car?” Possible responses are:

  • 1 = Always
  • 2 = Nearly always
  • 3 = Sometimes
  • 4 = Seldom
  • 5 = Never
  • 7 = Don’t know / Not sure
  • 8 = Never drive or ride in a car
  • 9 = Refused

We’ll treat codes 7, 8 and 9 as NA, and turn this into a factor.

Warning: Factor `seatbelt` contains implicit NA, consider using
`forcats::fct_explicit_na`
# A tibble: 9 x 3
  SEATBELT seatbelt          n
     <dbl> <fct>         <int>
1        1 Always         6047
2        2 Nearly_always   409
3        3 Sometimes       191
4        4 Seldom           81
5        5 Never           148
6        7 <NA>              7
7        8 <NA>             21
8        9 <NA>              2
9       NA <NA>            506

2.5.17 Immunization (3 items)

2.5.17.1 FLUSHOT6 and its cleanup to vax_flu

FLUSHOT6 gives the response to “During the past 12 months, have you had either a flu shot or a flu vaccine that was sprayed in your nose?” The responses are:

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
# A tibble: 5 x 3
  FLUSHOT6 vax_flu     n
     <dbl>   <dbl> <int>
1        1       1  3453
2        2       0  3410
3        7      NA    26
4        9      NA     3
5       NA      NA   520

2.5.17.2 PNEUVAC3 and its cleanup to vax_pneumo

PNEUVAC3 gives the response to “A pneumonia shot or pneumococcal vaccine is usually given only once or twice in a person’s lifetime and is different from the flu shot. Have you ever had a pneumonia shot?” The responses are:

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
# A tibble: 5 x 3
  PNEUVAC3 vax_pneumo     n
     <dbl>      <dbl> <int>
1        1          1  3112
2        2          0  3262
3        7         NA   509
4        9         NA     3
5       NA         NA   526

2.5.17.3 SHINGLE2 and its cleanup to vax_shingles

SHINGLE2 gives the response to “Have you ever had the shingles or zoster vaccine?” The responses are:

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
# A tibble: 4 x 3
  SHINGLE2 vax_shingles     n
     <dbl>        <dbl> <int>
1        1            1  1503
2        2            0  2979
3        7           NA    78
4       NA           NA  2852

2.5.18 HIV/AIDS (2 items)

2.5.18.1 HIVTST6 and its cleanup to hiv_test

HIVTST6 gives the response to “Have you ever been tested for HIV? Do not count tests you may have had as part of a blood donation. Include testing fluid from your mouth.” The responses are:

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
# A tibble: 5 x 3
  HIVTST6 hiv_test     n
    <dbl>    <dbl> <int>
1       1        1  2017
2       2        0  4565
3       7       NA   260
4       9       NA    14
5      NA       NA   556

2.5.18.2 HIVRISK5 and its cleanup to hiv_risk

HIVRISK5 gives the response to “I am going to read you a list. When I am done, please tell me if any of the situations apply to you. You do not need to tell me which one. You have injected any drug other than those prescribed for you in the past year. You have been treated for a sexually transmitted disease or STD in the past year. You have given or received money or drugs in exchange for sex in the past year.” The responses are:

  • 1 = Yes
  • 2 = No
  • 7 = Don’t know/Not sure
  • 9 = Refused
# A tibble: 5 x 3
  HIVRISK5 hiv_risk     n
     <dbl>    <dbl> <int>
1        1        1   277
2        2        0  6537
3        7       NA     2
4        9       NA    17
5       NA       NA   579

2.6 Imputing Age and Income as Quantitative from Thin Air

This section is purely for teaching purposes. I would never use the variables created in this section for research work.

2.6.1 age_imp: Imputing Age Data

I want a quantitative age variable, so I’m going to create an imputed age_imp value for each subject based on their agegroup. For each age group, I will assume that each of the ages represented by a value in that age group will be equally likely, and will draw from the relevant uniform distribution to impute age.

# A tibble: 80 x 3
   agegroup age_imp     n
   <fct>      <dbl> <int>
 1 18-24         18    46
 2 18-24         19    75
 3 18-24         20    76
 4 18-24         21    82
 5 18-24         22    80
 6 18-24         23    54
 7 18-24         24    35
 8 25-29         25    42
 9 25-29         26    93
10 25-29         27    77
# ... with 70 more rows

Here is a histogram of the age_imp variable.

2.6.2 inc_imp: Imputing Income Data

I want a quantitative income variable, so I’m going to create an imputed inc_imp value for each subject based on their incomegroup. For most income groups, I will assume that each of the incomes represented by a value in that income group will be equally likely, and will draw from the relevant uniform distribution to impute income. The exception is the highest income group, where I will impute a value drawn from a distribution that places all values at $75,000 or more, but has a substantial right skew and long tail.

# A tibble: 6 x 3
  incomegroup inc_imp     n
  <fct>         <dbl> <int>
1 75K+         774009     1
2 75K+         798174     1
3 75K+         806161     1
4 75K+         847758     1
5 75K+        1085111     1
6 <NA>             NA  1310

Here are density plots of the inc_imp variable. The top picture shows the results on a linear scale, and the bottom shows them on a log (base 10) scale.

2.7 Clean Data in the State of Ohio

There are six MMSAs associated with the state of Ohio. We’re going to create a smart_ohio that includes each of them. First, I’ll ungroup the data that I created earlier, so I get a clean tibble.

Next, I’ll select the variables I want to retain (they are the ones I created, plus SEQNO.)

The smart_ohio file should contain 99 variables, describing 7412 respondents.

2.7.1 Summary of the smart_ohio file

     SEQNO               mmsa             mmsa_code      mmsa_name        
 Min.   :2.017e+09   Length:7412        Min.   :17140   Length:7412       
 1st Qu.:2.017e+09   Class :character   1st Qu.:17460   Class :character  
 Median :2.017e+09   Mode  :character   Median :18140   Mode  :character  
 Mean   :2.017e+09                      Mean   :22073                     
 3rd Qu.:2.017e+09                      3rd Qu.:26580                     
 Max.   :2.017e+09                      Max.   :45780                     
                                                                          
    mmsa_wt           completed         landline         hhadults     
 Min.   :   8.531   Min.   :0.0000   Min.   :0.0000   Min.   : 1.000  
 1st Qu.: 261.648   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.: 2.000  
 Median : 565.210   Median :1.0000   Median :0.0000   Median : 2.000  
 Mean   : 842.477   Mean   :0.8469   Mean   :0.4923   Mean   : 2.155  
 3rd Qu.:1111.718   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 2.000  
 Max.   :8573.596   Max.   :1.0000   Max.   :1.0000   Max.   :11.000  
                                                      NA's   :6254    
       genhealth      physhealth       menthealth       poorhealth    
 1_Excellent:1057   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 2_VeryGood :2406   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
 3_Good     :2367   Median : 0.000   Median : 0.000   Median : 0.000  
 4_Fair     :1139   Mean   : 4.975   Mean   : 4.044   Mean   : 5.716  
 5_Poor     : 428   3rd Qu.: 4.000   3rd Qu.: 3.000   3rd Qu.: 6.250  
 NA's       :  15   Max.   :30.000   Max.   :30.000   Max.   :30.000  
                    NA's   :138      NA's   :114      NA's   :3412    
    agegroup       age_imp                          race         hispanic      
 60-64  : 837   Min.   :18.00   White                 :6177   Min.   :0.00000  
 65-69  : 810   1st Qu.:42.00   Black or African A    : 739   1st Qu.:0.00000  
 55-59  : 786   Median :58.00   Multiracial           : 153   Median :0.00000  
 70-74  : 685   Mean   :55.93   Asian                 : 115   Mean   :0.01983  
 50-54  : 604   3rd Qu.:69.00   Amer Indian or Alaskan:  66   3rd Qu.:0.00000  
 (Other):3622   Max.   :96.00   (Other)               :  48   Max.   :1.00000  
 NA's   :  68   NA's   :68      NA's                  : 114   NA's   :49       
                     race_eth        female                   marital    
 White non-Hispanic      :6086   Min.   :0.0000   Married         :3668  
 Black non-Hispanic      : 725   1st Qu.:0.0000   Divorced        :1110  
 Other race non-Hispanic : 193   Median :1.0000   Widowed         : 978  
 Multiracial non-Hispanic: 143   Mean   :0.5769   Separated       : 142  
 Hispanic                : 146   3rd Qu.:1.0000   Never_Married   :1248  
 NA's                    : 119   Max.   :1.0000   Unmarried_Couple: 208  
                                                  NA's            :  58  
      kids                educgroup       home_own        veteran      
 Min.   : 0.0000   Kindergarten:   3   Min.   :0.000   Min.   :0.0000  
 1st Qu.: 0.0000   Elementary  : 117   1st Qu.:0.000   1st Qu.:0.0000  
 Median : 0.0000   Some_HS     : 332   Median :1.000   Median :0.0000  
 Mean   : 0.5036   HS_Grad     :2209   Mean   :0.709   Mean   :0.1252  
 3rd Qu.: 1.0000   Some_College:2079   3rd Qu.:1.000   3rd Qu.:0.0000  
 Max.   :12.0000   College_Grad:2646   Max.   :1.000   Max.   :1.0000  
 NA's   :43        NA's        :  26   NA's   :55      NA's   :6       
              employment    incomegroup      inc_imp           cell_own    
 Employed_for_wages:3119   75K+   :1910   Min.   :    128   Min.   :0.000  
 Retired           :2202   50-74K : 928   1st Qu.:  23810   1st Qu.:1.000  
 Unable_to_work    : 603   35-49K : 922   Median :  46316   Median :1.000  
 Self-employed     : 466   25-34K : 685   Mean   :  73966   Mean   :0.808  
 Homemaker         : 411   20-24K : 589   3rd Qu.:  80926   3rd Qu.:1.000  
 (Other)           : 578   (Other):1068   Max.   :1085111   Max.   :1.000  
 NA's              :  33   NA's   :1310   NA's   :1310      NA's   :3784   
   internet30       weight_kg         height_m          bmi       
 Min.   :0.0000   Min.   : 31.75   Min.   :1.350   Min.   :13.30  
 1st Qu.:1.0000   1st Qu.: 68.04   1st Qu.:1.630   1st Qu.:24.16  
 Median :1.0000   Median : 79.38   Median :1.700   Median :27.40  
 Mean   :0.8185   Mean   : 82.88   Mean   :1.696   Mean   :28.65  
 3rd Qu.:1.0000   3rd Qu.: 95.25   3rd Qu.:1.780   3rd Qu.:31.84  
 Max.   :1.0000   Max.   :208.65   Max.   :2.060   Max.   :75.52  
 NA's   :57       NA's   :462      NA's   :180     NA's   :493    
        bmigroup       pregnant         deaf             blind        
 [13.3,18.5): 119   Min.   :0.00   Min.   :0.00000   Min.   :0.00000  
 [18.5,25.0):2010   1st Qu.:0.00   1st Qu.:0.00000   1st Qu.:0.00000  
 [25.0,30.0):2447   Median :0.00   Median :0.00000   Median :0.00000  
 [30.0,75.5]:2343   Mean   :0.03   Mean   :0.09753   Mean   :0.05725  
 NA's       : 493   3rd Qu.:0.00   3rd Qu.:0.00000   3rd Qu.:0.00000  
                    Max.   :1.00   Max.   :1.00000   Max.   :1.00000  
                    NA's   :6042   NA's   :153       NA's   :163      
     decide          diffwalk        diffdress         diffalone      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :0.0000   Median :0.0000   Median :0.00000   Median :0.00000  
 Mean   :0.1205   Mean   :0.2053   Mean   :0.04875   Mean   :0.08838  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
 NA's   :194      NA's   :192      NA's   :192       NA's   :216      
    smoke100                    smoker       ecig_ever     
 Min.   :0.0000   Current_daily    : 990   Min.   :0.0000  
 1st Qu.:0.0000   Current_not_daily: 300   1st Qu.:0.0000  
 Median :0.0000   Former           :1999   Median :0.0000  
 Mean   :0.4591   Never            :3881   Mean   :0.1893  
 3rd Qu.:1.0000   NA's             : 242   3rd Qu.:0.0000  
 Max.   :1.0000                            Max.   :1.0000  
 NA's   :237                               NA's   :259     
               ecigs        healthplan         hasdoc          costprob      
 Current_daily    : 102   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 Current_not_daily: 165   1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.00000  
 Former           :1085   Median :1.0000   Median :1.0000   Median :0.00000  
 Never            :5799   Mean   :0.9462   Mean   :0.8662   Mean   :0.09656  
 NA's             : 261   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
                          Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
                          NA's   :20       NA's   :15       NA's   :18       
          t_checkup                     bp_high        bp_meds     
 1_In-past-year:5803   0_No                 :4114   Min.   :0.000  
 2_1-to-2-years: 714   1_Yes                :3161   1st Qu.:1.000  
 3_2-to-5-years: 413   2_Only_while_pregnant:  67   Median :1.000  
 4_5_plus_years: 376   4_Borderline         :  49   Mean   :0.848  
 8_Never       :  32   NA's                 :  21   3rd Qu.:1.000  
 NA's          :  74                                Max.   :1.000  
                                                    NA's   :4256   
            t_chol       chol_high        chol_meds         asthma    
 1_Never       : 424   Min.   :0.0000   Min.   :0.000   Current: 734  
 2_In-past-year:5483   1st Qu.:0.0000   1st Qu.:0.000   Former : 248  
 3_1-to-2-years: 559   Median :0.0000   Median :1.000   Never  :6376  
 4_2-to-5-years: 289   Mean   :0.3787   Mean   :0.683   NA's   :  54  
 5_5_plus_years: 272   3rd Qu.:1.0000   3rd Qu.:1.000                 
 NA's          : 385   Max.   :1.0000   Max.   :1.000                 
                       NA's   :514      NA's   :4805                  
   hx_asthma        now_asthma        hx_mi             hx_chd       
 Min.   :0.0000   Min.   :0.000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :0.0000   Median :1.000   Median :0.00000   Median :0.00000  
 Mean   :0.1365   Mean   :0.747   Mean   :0.06771   Mean   :0.06554  
 3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.000   Max.   :1.00000   Max.   :1.00000  
 NA's   :28       NA's   :6430    NA's   :42        NA's   :73       
   hx_stroke          hx_skinc        hx_otherc          hx_copd      
 Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
 1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
 Median :0.00000   Median :0.0000   Median :0.00000   Median :0.0000  
 Mean   :0.04643   Mean   :0.0908   Mean   :0.09959   Mean   :0.1101  
 3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
 NA's   :25        NA's   :22       NA's   :22        NA's   :35      
   hx_depress       hx_kidney        hx_diabetes                 dm_status   
 Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   No-Diabetes      :6100  
 1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   Pre-Diabetes     : 133  
 Median :0.0000   Median :0.00000   Median :0.0000   Pregnancy-Induced:  67  
 Mean   :0.2284   Mean   :0.03982   Mean   :0.1484   Diabetes         :1098  
 3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   NA's             :  14  
 Max.   :1.0000   Max.   :1.00000   Max.   :1.0000                           
 NA's   :29       NA's   :28        NA's   :14                               
     dm_age         hx_arthr        arth_lims       arth_work    
 Min.   : 1.00   Min.   :0.0000   Min.   :0.000   Min.   :0.000  
 1st Qu.:42.00   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.000  
 Median :52.00   Median :0.0000   Median :0.000   Median :0.000  
 Mean   :50.73   Mean   :0.3778   Mean   :0.498   Mean   :0.338  
 3rd Qu.:60.00   3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:1.000  
 Max.   :90.00   Max.   :1.0000   Max.   :1.000   Max.   :1.000  
 NA's   :6379    NA's   :40       NA's   :4646    NA's   :4679   
       arth_soc      joint_pain        alcdays         avgdrinks     
 A lot     : 606   Min.   : 0.000   Min.   : 0.000   Min.   : 1.000  
 A little  : 734   1st Qu.: 3.000   1st Qu.: 0.000   1st Qu.: 1.000  
 Not at all:1427   Median : 5.000   Median : 0.000   Median : 2.000  
 NA's      :4645   Mean   : 4.709   Mean   : 4.402   Mean   : 2.394  
                   3rd Qu.: 7.000   3rd Qu.: 4.000   3rd Qu.: 3.000  
                   Max.   :10.000   Max.   :30.000   Max.   :76.000  
                   NA's   :4660     NA's   :319      NA's   :3927    
   maxdrinks          binge          drinks_wk       drink_heavy    
 Min.   : 1.000   Min.   :0.0000   Min.   : 0.000   Min.   :0.0000  
 1st Qu.: 1.000   1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.:0.0000  
 Median : 2.000   Median :0.0000   Median : 0.000   Median :0.0000  
 Mean   : 3.506   Mean   :0.1421   Mean   : 2.562   Mean   :0.0606  
 3rd Qu.: 4.000   3rd Qu.:0.0000   3rd Qu.: 2.000   3rd Qu.:0.0000  
 Max.   :76.000   Max.   :1.0000   Max.   :93.330   Max.   :1.0000  
 NA's   :4004     NA's   :377      NA's   :392      NA's   :379     
   fruit_day        veg_day         eat_juice         eat_fruit    
 Min.   : 0.00   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.00  
 1st Qu.: 0.57   1st Qu.: 1.190   1st Qu.: 0.0000   1st Qu.: 0.43  
 Median : 1.00   Median : 1.720   Median : 0.0700   Median : 1.00  
 Mean   : 1.34   Mean   : 1.903   Mean   : 0.2965   Mean   : 1.04  
 3rd Qu.: 2.00   3rd Qu.: 2.320   3rd Qu.: 0.4300   3rd Qu.: 1.00  
 Max.   :14.00   Max.   :21.670   Max.   :12.0000   Max.   :10.00  
 NA's   :557     NA's   :667      NA's   :470       NA's   :457    
  eat_greenveg      eat_fries        eat_potato      eat_otherveg    
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   : 0.0000  
 1st Qu.:0.1400   1st Qu.:0.0700   1st Qu.:0.0700   1st Qu.: 0.4300  
 Median :0.4300   Median :0.1400   Median :0.1400   Median : 1.0000  
 Mean   :0.5338   Mean   :0.2255   Mean   :0.2461   Mean   : 0.8929  
 3rd Qu.:0.7100   3rd Qu.:0.2900   3rd Qu.:0.2900   3rd Qu.: 1.0000  
 Max.   :8.0000   Max.   :7.0000   Max.   :7.0000   Max.   :11.0000  
 NA's   :449      NA's   :453      NA's   :501      NA's   :509      
    exerany                        activity     rec_aerobic    
 Min.   :0.0000   Highly_Active        :2053   Min.   :0.0000  
 1st Qu.:0.0000   Active               :1132   1st Qu.:0.0000  
 Median :1.0000   Insufficiently_Active:1293   Median :0.0000  
 Mean   :0.6932   Inactive             :2211   Mean   :0.4795  
 3rd Qu.:1.0000   NA's                 : 723   3rd Qu.:1.0000  
 Max.   :1.0000                                Max.   :1.0000  
 NA's   :447                                   NA's   :680     
  rec_strength               exer1_type               exer2_type  
 Min.   :0.0000   Walking         :2605   No Other Activity:1854  
 1st Qu.:0.0000   Running         : 324   Walking          : 629  
 Median :0.0000   Other Activities: 276   Weight lifting   : 272  
 Mean   :0.2701   Gardening       : 242   Other Activities : 246  
 3rd Qu.:1.0000   Weight lifting  : 189   Gardening        : 202  
 Max.   :1.0000   (Other)         :1174   (Other)          :1537  
 NA's   :556      NA's            :2602   NA's             :2672  
   exer1_min        exer2_min               seatbelt       vax_flu      
 Min.   :   0.0   Min.   :   0.0   Always       :6047   Min.   :0.0000  
 1st Qu.:  64.0   1st Qu.:   0.0   Nearly_always: 409   1st Qu.:0.0000  
 Median : 140.0   Median :  14.0   Sometimes    : 191   Median :1.0000  
 Mean   : 242.7   Mean   : 124.2   Seldom       :  81   Mean   :0.5031  
 3rd Qu.: 270.0   3rd Qu.: 120.0   Never        : 148   3rd Qu.:1.0000  
 Max.   :4193.0   Max.   :4193.0   NA's         : 536   Max.   :1.0000  
 NA's   :2761     NA's   :2772                          NA's   :549     
   vax_pneumo      vax_shingles       hiv_test         hiv_risk     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.4882   Mean   :0.3353   Mean   :0.3064   Mean   :0.0407  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
 NA's   :1038     NA's   :2930     NA's   :830      NA's   :598     

2.8 Clean Cleveland-Elyria Data

2.8.1 Cleveland - Elyria Data

The mmsa_name variable is probably the simplest way for us to filter our data down to the MMSA we are interested in. Here, I’m using the str_detect function to identify the values of mmsa_name that contain the text “Cleveland”.

In the Cleveland-Elyria MSA, we have 1133 observations on the same 99 variables.

2.8.2 Inspecting the Cleveland-Elyria Data

Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

categorical variables:  
          name     class levels    n missing
1         mmsa character      1 1133       0
2    mmsa_name character      1 1133       0
3    genhealth    factor      5 1129       4
4     agegroup    factor     13 1122      11
5         race    factor      7 1107      26
6     race_eth    factor      5 1107      26
7      marital    factor      6 1122      11
8    educgroup    factor      6 1129       4
9   employment    factor      8 1129       4
10 incomegroup    factor      8  947     186
11    bmigroup    factor      4 1042      91
12      smoker    factor      4 1093      40
13       ecigs    factor      4 1090      43
14   t_checkup    factor      5 1129       4
15     bp_high    factor      4 1131       2
16      t_chol    factor      5 1079      54
17      asthma    factor      3 1126       7
18   dm_status    factor      4 1128       5
19    arth_soc    factor      3  407     726
20    activity    factor      4 1024     109
21  exer1_type    factor     62  756     377
22  exer2_type    factor     68  749     384
23    seatbelt    factor      5 1056      77
                                    distribution
1  Cleveland-Elyria  (100%)                     
2   (%) ...                                     
3  2_VeryGood (33.9%), 3_Good (32.2%) ...       
4  65-69 (11.6%), 60-64 (11.2%) ...             
5  White (74.8%) ...                            
6  White non-Hispanic (72.7%) ...               
7  Married (43.9%), Never_Married (20.4%) ...   
8  College_Grad (36.8%), HS_Grad (29.9%) ...    
9  Employed_for_wages (40.9%) ...               
10 75K+ (29.6%), 35-49K (14.7%) ...             
11 [25.0,30.0) (37.2%) ...                      
12 Never (53%), Former (29.3%) ...              
13 Never (82.2%), Former (14.1%) ...            
14 1_In-past-year (79.5%) ...                   
15 0_No (56.9%), 1_Yes (41.6%) ...              
16 2_In-past-year (79%) ...                     
17 Never (87.5%), Current (9.8%) ...            
18 No-Diabetes (83.6%), Diabetes (14.2%) ...    
19 Not at all (55.8%), A little (24.3%) ...     
20 Highly_Active (33%) ...                      
21 Walking (56%) ...                            
22 No Other Activity (36.7%) ...                
23 Always (86.9%), Nearly_always (5.6%) ...     

quantitative variables:  
           name   class       min         Q1    median       Q3        max
1       mmsa_wt numeric  84.72969   616.7628  1062.991  1761.88   7117.962
2     completed numeric   0.00000     1.0000     1.000     1.00      1.000
3      landline numeric   0.00000     0.0000     1.000     1.00      1.000
4      hhadults numeric   1.00000     2.0000     2.000     2.00      5.000
5    physhealth numeric   0.00000     0.0000     0.000     3.00     30.000
6    menthealth numeric   0.00000     0.0000     0.000     3.00     30.000
7    poorhealth numeric   0.00000     0.0000     0.000     6.00     30.000
8       age_imp numeric  18.00000    45.0000    59.000    71.00     95.000
9      hispanic numeric   0.00000     0.0000     0.000     0.00      1.000
10       female numeric   0.00000     0.0000     1.000     1.00      1.000
11         kids numeric   0.00000     0.0000     0.000     0.00     12.000
12     home_own numeric   0.00000     0.0000     1.000     1.00      1.000
13      veteran numeric   0.00000     0.0000     0.000     0.00      1.000
14      inc_imp numeric 216.00000 22424.5000 43948.000 76938.00 727553.000
15     cell_own numeric   0.00000     1.0000     1.000     1.00      1.000
16   internet30 numeric   0.00000     1.0000     1.000     1.00      1.000
17    weight_kg numeric  31.75000    65.8825    78.700    92.99    180.980
18     height_m numeric   1.42000     1.6300     1.680     1.78      2.030
19          bmi numeric  13.30000    24.1100    27.300    31.68     70.560
20     pregnant numeric   0.00000     0.0000     0.000     0.00      1.000
21         deaf numeric   0.00000     0.0000     0.000     0.00      1.000
22        blind numeric   0.00000     0.0000     0.000     0.00      1.000
23       decide numeric   0.00000     0.0000     0.000     0.00      1.000
24     diffwalk numeric   0.00000     0.0000     0.000     0.00      1.000
25    diffdress numeric   0.00000     0.0000     0.000     0.00      1.000
26    diffalone numeric   0.00000     0.0000     0.000     0.00      1.000
27     smoke100 numeric   0.00000     0.0000     0.000     1.00      1.000
28    ecig_ever numeric   0.00000     0.0000     0.000     0.00      1.000
29   healthplan numeric   0.00000     1.0000     1.000     1.00      1.000
30       hasdoc numeric   0.00000     1.0000     1.000     1.00      1.000
31     costprob numeric   0.00000     0.0000     0.000     0.00      1.000
32      bp_meds numeric   0.00000     1.0000     1.000     1.00      1.000
33    chol_high numeric   0.00000     0.0000     0.000     1.00      1.000
34    chol_meds numeric   0.00000     0.0000     1.000     1.00      1.000
35    hx_asthma numeric   0.00000     0.0000     0.000     0.00      1.000
36   now_asthma numeric   0.00000     1.0000     1.000     1.00      1.000
37        hx_mi numeric   0.00000     0.0000     0.000     0.00      1.000
38       hx_chd numeric   0.00000     0.0000     0.000     0.00      1.000
39    hx_stroke numeric   0.00000     0.0000     0.000     0.00      1.000
40     hx_skinc numeric   0.00000     0.0000     0.000     0.00      1.000
41    hx_otherc numeric   0.00000     0.0000     0.000     0.00      1.000
42      hx_copd numeric   0.00000     0.0000     0.000     0.00      1.000
43   hx_depress numeric   0.00000     0.0000     0.000     0.00      1.000
44    hx_kidney numeric   0.00000     0.0000     0.000     0.00      1.000
45  hx_diabetes numeric   0.00000     0.0000     0.000     0.00      1.000
46       dm_age numeric   7.00000    40.0000    53.500    61.25     85.000
47     hx_arthr numeric   0.00000     0.0000     0.000     1.00      1.000
48    arth_lims numeric   0.00000     0.0000     0.000     1.00      1.000
49    arth_work numeric   0.00000     0.0000     0.000     1.00      1.000
50   joint_pain numeric   0.00000     2.0000     4.000     7.00     10.000
51      alcdays numeric   0.00000     0.0000     1.000     6.00     30.000
52    avgdrinks numeric   1.00000     1.0000     2.000     3.00     60.000
53    maxdrinks numeric   1.00000     1.0000     2.000     4.00     42.000
54        binge numeric   0.00000     0.0000     0.000     0.00      1.000
55    drinks_wk numeric   0.00000     0.0000     0.230     2.33     56.000
56  drink_heavy numeric   0.00000     0.0000     0.000     0.00      1.000
57    fruit_day numeric   0.00000     0.6225     1.070     2.00     10.000
58      veg_day numeric   0.00000     1.2675     1.720     2.42      7.490
59    eat_juice numeric   0.00000     0.0000     0.070     0.43      7.000
60    eat_fruit numeric   0.00000     0.4300     1.000     2.00     10.000
61 eat_greenveg numeric   0.00000     0.2000     0.430     1.00      5.000
62    eat_fries numeric   0.00000     0.0300     0.140     0.29      3.140
63   eat_potato numeric   0.00000     0.0700     0.140     0.29      2.000
64 eat_otherveg numeric   0.00000     0.4300     1.000     1.00      7.000
65      exerany numeric   0.00000     0.0000     1.000     1.00      1.000
66  rec_aerobic numeric   0.00000     0.0000     1.000     1.00      1.000
67 rec_strength numeric   0.00000     0.0000     0.000     1.00      1.000
68    exer1_min numeric   0.00000    70.0000   140.000   270.00   4193.000
69    exer2_min numeric   0.00000     0.0000    21.000   120.00   3780.000
70      vax_flu numeric   0.00000     0.0000     0.000     1.00      1.000
71   vax_pneumo numeric   0.00000     0.0000     1.000     1.00      1.000
72 vax_shingles numeric   0.00000     0.0000     0.000     1.00      1.000
73     hiv_test numeric   0.00000     0.0000     0.000     1.00      1.000
74     hiv_risk numeric   0.00000     0.0000     0.000     0.00      1.000
           mean           sd    n missing
1  1.425903e+03 1.163054e+03 1133       0
2  8.526037e-01 3.546571e-01 1133       0
3  5.595763e-01 4.966572e-01 1133       0
4  2.075000e+00 9.167249e-01   40    1093
5  4.670875e+00 9.218216e+00 1109      24
6  3.901873e+00 8.266067e+00 1121      12
7  5.757025e+00 9.782679e+00  605     528
8  5.732709e+01 1.816553e+01 1122      11
9  3.033006e-02 1.715704e-01 1121      12
10 5.931156e-01 4.914699e-01 1133       0
11 4.445430e-01 9.722088e-01 1127       6
12 6.802842e-01 4.665736e-01 1126       7
13 1.112092e-01 3.145298e-01 1133       0
14 7.016271e+04 8.772181e+04  947     186
15 8.196203e-01 3.848080e-01  632     501
16 8.161634e-01 3.875230e-01 1126       7
17 8.176926e+01 2.152458e+01 1054      79
18 1.692374e+00 1.073986e-01 1091      42
19 2.841404e+01 6.629880e+00 1042      91
20 3.571429e-02 1.860521e-01  196     937
21 7.837838e-02 2.688872e-01 1110      23
22 5.420054e-02 2.265153e-01 1107      26
23 1.077899e-01 3.102554e-01 1104      29
24 1.854545e-01 3.888427e-01 1100      33
25 4.347826e-02 2.040235e-01 1104      29
26 8.014572e-02 2.716424e-01 1098      35
27 4.702653e-01 4.993435e-01 1093      40
28 1.779817e-01 3.826729e-01 1090      43
29 9.530973e-01 2.115240e-01 1130       3
30 8.710247e-01 3.353207e-01 1132       1
31 1.042403e-01 3.057071e-01 1132       1
32 8.404255e-01 3.666012e-01  470     663
33 3.656015e-01 4.818249e-01 1064      69
34 6.537468e-01 4.763910e-01  387     746
35 1.298587e-01 3.362964e-01 1132       1
36 7.801418e-01 4.156269e-01  141     992
37 5.673759e-02 2.314431e-01 1128       5
38 5.966162e-02 2.369644e-01 1123      10
39 5.044248e-02 2.189531e-01 1130       3
40 7.603890e-02 2.651776e-01 1131       2
41 1.152482e-01 3.194629e-01 1128       5
42 9.115044e-02 2.879504e-01 1130       3
43 2.039007e-01 4.030747e-01 1128       5
44 2.661934e-02 1.610396e-01 1127       6
45 1.418440e-01 3.490448e-01 1128       5
46 5.030263e+01 1.649164e+01  152     981
47 3.598585e-01 4.801710e-01 1131       2
48 4.777228e-01 5.001228e-01  404     729
49 2.979798e-01 4.579492e-01  396     737
50 4.486284e+00 2.775867e+00  401     732
51 4.883549e+00 7.873733e+00 1082      51
52 2.375421e+00 3.583568e+00  594     539
53 3.298276e+00 3.382725e+00  580     553
54 1.565704e-01 3.635647e-01 1073      60
55 2.612474e+00 5.835242e+00 1067      66
56 6.809701e-02 2.520299e-01 1072      61
57 1.431676e+00 1.153104e+00 1056      77
58 1.912548e+00 1.038403e+00 1032     101
59 3.034709e-01 5.108793e-01 1066      67
60 1.121039e+00 9.887880e-01 1068      65
61 5.736585e-01 4.918381e-01 1066      67
62 2.021275e-01 2.797162e-01 1067      66
63 2.327799e-01 2.618815e-01 1054      79
64 9.029340e-01 7.318759e-01 1060      73
65 7.133459e-01 4.524112e-01 1064      69
66 5.009728e-01 5.002424e-01 1028     105
67 2.761905e-01 4.473252e-01 1050      83
68 2.487078e+02 4.022634e+02  729     404
69 9.986712e+01 2.485299e+02  730     403
70 4.995261e-01 5.002369e-01 1055      78
71 5.118191e-01 5.001174e-01  973     160
72 3.398601e-01 4.739931e-01  715     418
73 3.339941e-01 4.718715e-01 1009     124
74 4.871060e-02 2.153652e-01 1047      86

We’ll build a variety of smaller subsets from these data, eventually.