Data Development
Most Recent Update: 2022-12-15 12:21:01.
You must use the same data source for Study 1 and Study 2 in Project B. That data source must either be:
- NHANES data, as discussed here.
- something else, as discussed here.
- To encourage people to use data other than NHANES, there is a three-point bonus in the final project B grade available for all projects which use non-NHANES data.
Data Requirements for Study 1 and Study 2
- Number of observations
- If you are using NHANES data, you will need between 500 and 7,500 observations with a minimum of 500 observations containing complete data on all of the variables you will use in Study 1 or Study 2.
- If you are using any other data source, you will need between 250 and 10,000 observations, and at least 250 with complete data on all variables you will use in Study 1 or Study 2.
- We require that all variables treated as quantitative in Study 1 (Analyses A, B, or C) or as your quantitative outcome in Study 2 contain at least 15 unique values.
- The data must include a unique coded identifier (
SEQN
in NHANES) for each row (subject.)
- Number and type of variables for Study 1 (where you will do 4 of the following 5 analyses)
- For Analysis A, you will need appropriate data to allow you to compare two means using paired samples. So that would require a set of paired measurements of the same quantity, perhaps measurements of the same subjects before and after the application of an exposure. Again, we require that all variables treated as quantitative in Study 1 (Analyses A, B, or C) or as your quantitative outcome in Study 2 contain at least 15 unique values.
- For Analysis B, you will need appropriate data to allow you to compare two means using independent samples. This would require a quantitative outcome (with at least 15 unique values), and a binary categorical variable which divides the data into two subgroups, so that each subgroup has a minimum of 30 observations.
- For Analysis C, you will need appropriate data to allow you to compare 3-6 means using independent samples. This would require a quantitative outcome (with at least 15 unique values), and a multi-categorical variable with 3-6 categories which divides the data into subgroups, so that each subgroup has a minimum of 30 observations.
- For Analysis D, you will need appropriate data to allow you to create and analyze a 2 \(\times\) 2 table. This would require two independently collected binary categorical variables, that split the data into four groups in a 2 \(\times\) 2 table, each with a minimum of 30 observations.
- For Analysis E, you will need appropriate data to allow you to create and analyze a J \(\times\) K table, where \(2 \leq J \leq 5\) and \(3 \leq K \leq 5\). This would require two independently collected categorical variables (one with J groups and one with K groups), that split the data into groups in a J \(\times\) K table, so that each cell within the table has a minimum of 15 observations.
- At a bare minimum, then, you will need a quantitative outcome (for Analyses A, B and C) and two binary variables (one for B and two for D and one for E) and a multi-categorical variable with 3-5 levels (for C and E) although you are welcome to use different variables from the same data source for each of the four Study 1 analyses you complete.
- Number and type of variables for Study 2
- You will need a quantitative outcome. Again, we require that your quantitative outcome contains at least 15 unique values.
- You will need a key predictor of interest (which may be either quantitative, or categorical.) If it is categorical, it must have 2-6 categories, and each category must contain at least 30 observations. Your research question will focus largely on how effectively this key predictor can be used to predict your quantitative outcome.
- You will need to identify 3-8 additional predictors of your outcome, at least one of which must be a multi-categorical predictor with 3-6 categories, where each category contains at least 30 observations. The other predictors can be any combination of quantitative and categorical (with 2-6 categories each, with at least 30 observations in each category.)
- Your data for each Study must be managed, merged (if necessary) and cleaned exclusively using R to go from raw data to that Study’s final clean tibble. This data management process must include the creation of appropriately labeled (and, if necessary, collapsed) factors for all categorical variables you will use in either study, and appropriate investigation and actions regarding missing values, numbers of unique values, and impossible values.
- You will need to generate a single clean data set which you will then use for all four of your Study 1 analyses. Describe those data overall using a codebook and an appropriate set of numerical summaries, such as those provided by the
describe()
function in theHmisc
package, or something similar. - You will need to generate a single clean data set for all of your Study 2 analyses, which you will then partition into a model training (or development) sample containing 60-80% of the data, and a model testing (or validation) sample containing the remaining 20-40% of the data. Describe the training data overall using a codebook and an appropriate set of numerical summaries, as you did in Study 1.
Tips on Cleaning Your Data
If you need to merge data (for instance in NHANES) I would clean the data after doing the merge.
Note that it’s only necessary to clean the variables you will actually use in your analyses below. Create an analytic data set containing only those variables.
- This should include a subject identification code (the SEQN in NHANES), your outcome, your key predictor and your other predictors.
- If you are working with NHANES 2017-March 2020 data, you will need to include RIDSTATR and RIDAGEYR from the P_DEMO file.
- Include RIDSTATR just so that you can prove that all of its values are 2 in your sample.
- Include RIDAGEYR even if you’re not using it in your models, so you can describe the ages of the people in your sample.
If you create a categorical variable from a quantitative one, do so in this section of your report, and then refer to that work in the analyses below when you use the new variable. In general, though, I wouldn’t do that except in dire circumstances. Variables that use categories to describe what were originally quantitative variables aren’t quantitative any more.
Things I would treat as missing include responses like Refused, Don’t Know, Did Not Respond, Unknown, No response and missing.
- If you have a quantitative variable that includes a code like 5555 or 9999 for “don’t know” or “missing”, you will need to drop those cases, just as you would if you were working with a categorical variable.
- Be sure that R recognizes things that are missing as missing and filters them out when you filter for complete cases.
Collapse levels sensibly for multi-categorical variables with more than 6 categories. If you want to use more than 6 categories for a categorical variable in your analyses, contact Dr. Love.
- If you have a categorical variable with codes like 77, 88 or 99, in addition to treating those as missing, you want to drop those levels from the factors you create. I recommend you run droplevels() on your tibble to remove all factor levels with zero subjects. That can help down the line.
For NHANES folks, a few specific things:
- Gender vs. Sex I would treat the
RIAGENDR
variable as describing biological sex and would rename it as I created a factor. - Race/Ethnicity If you want to use race/ethnicity I would prefer the use of
RIDRETH3
overRIDRETH1
, and I would recommend using all six categories, assuming you have at least 100 subjects at each level after whatever other pruning you do. If you want to collapse, then lumping codes 1 and 2 into “Hispanic/Latinx” is acceptable. Remember that race/ethnicity as a covariate is an attempt to understand the impact of structural racism, at least as much as it is anything else, so interpretation requires special care. - Age Do not use a categorical version of age. Use the quantitative version, called
RIDAGEYR
, provided in theP_DEMO
data. When you describe your subjects, you should specify the range (minimum and maximum) ages of those subjects, so you will need to captureRIDAGEYR
in your final analytic data set even if you’re not including it in your regression models. - Income and Measurement Caps The family income ratio
INDFMPIR
is appealing and quantitative, but it has a pronounced ceiling effect. It is the ratio of income to the poverty level, but is capped at 5. How should you think about that? (Note that age in adults is also capped, at 80.) - Categorical Income? As a categorical alternative, the income data in
INDHHIN2
in NHANES can be tricky to use, since there are so many categories and some of them overlap. CollapseINDHHIN2
to the following four categories, which are easy to describe, and have reasonable numbers of subjects in each category. Note that this approach drops the subjects with codes 12, 77 or 99, in addition to those with missing data.- Lowest: Below 20,000 (includes original codes 1, 2, 3, 4 and 13)
- Low: between 20,000 and 44,999 (includes original codes 5, 6, and 7)
- High: between 45,000 and 74,999 (includes original codes 8, 9 and 10)
- Highest: 75,000 and above (includes original codes 14 and 15)
- Education Categories If you’re working with adults (ages 20 and over), the
DMDEDUC2
variable in theP_DEMO
file is the set of categories to use. I would probably collapse codes 1 and 2 together to create a four-category variable with “Less than HS”, “HS Grad”, “Some College”, “College Grad”.
- Gender vs. Sex I would treat the
Be sure to treat all multi-categorical variables as factors in R, and don’t treat numeric codes as meaningful numeric variables.
Make sure that all of your quantitative variables have sensible minimum and maximum values as you’re cleaning.
Some binary variables are coded 1 and 2. Fix that in your work, ideally by using the real names and treating the variable as a factor, or by converting the 1-2 to a proper 1-0 indicator variable.
- Use the formula NEWVAR = 2 - OLDVAR to turn OLDVAR: 1 = Yes, 2 = No into NEWVAR: 1 = Yes, 0 = No.
- If you have OLDVAR: 1 = No, 2 = Yes, create a NEWVAR with 1 = Yes, 0 = No using NEWVAR = OLDVAR - 1.
If you would prefer to impute missing values for variables that are neither your outcome or your key predictor, you are permitted to do so, but a complete cases analysis is completely acceptable for this project, so long as you wind up in the range specified above for complete cases. You are not permitted to impute your outcome or your key predictor in Study 2.
You are welcome to apply clean_names at the start or end of your cleaning, if you like, but I wouldn’t otherwise change the variable names, at least for NHANES. If you do decide to change the names, that’s OK, but you will then need to specify the original names as well in the codebook.
Please don’t include sanity checks in your report. We’ll trust you have to have done that work on your own.
To use data from NHANES (National Health and Nutrition Examination Survey).
- Please read the Data Requirements for Study 1 and Study 2 and Tips on Cleaning Your Data above, then read the additional details on what is required if you’re working with NHANES data here.
- Many people (in past years) have felt that using NHANES data was a little easier than using another data source. To encourage people to use data other than NHANES, there is a three-point bonus in the final project B grade available for all projects which use non-NHANES data.
To use data from a non-NHANES source that meets with Dr. Love’s approval.
- Please read the Data Requirements for Study 1 and Study 2 and Tips on Cleaning Your Data above, then read the additional details on what is required if you’re instead working with some other data source here.
- Again, to encourage people to use data other than NHANES, there is a three-point bonus available to all projects which use non-NHANES data.