If you’re using data from another source
If you don’t want to use NHANES data, you will need to obtain Dr. Love’s approval to use something else, by filling out a form he will make available. Here are the data specifications.
- The data must be freely available to all, and there must be no risk associated with your using the data for this project of any kind. What does that mean?
- Your use of the data for this project must not be subject to IRB approval, or the approval of anyone other than you (this includes your principal investigator.)
- The data must be available to the general public at no cost.
- There can be no protected health information of any kind.
- The data must be completely de-identified.
- Dr. Love will need to see your source for the data in its entirety. You will need to be able to provide a link to a web page from which you (and Dr. Love) can download the raw data.
- The data must be of a certain type, so as to suit this project.
- The data must be cross-sectional, rather than longitudinal.
- The only exception to this rule would be data where a baseline set of predictors is measured, which might include the baseline measure of the outcome, and then the outcome (and only the outcome) is measured at a later time.
- The data must not be hierarchical, so everything must be measured at the subject level.
- We cannot have subjects nested in states, for instance, with some variables measured only at the state level included in your set of 5-10 variables.
- The data must not be from County Health Rankings, nor can they appear in any teaching repository of data.
- The data must not be pre-compiled as part of an R package, but rather available in raw form and ingested into R by you.
- Dr. Love has a strong preference for data that describe individual people or animals, as opposed to other types of “subjects”. Who the subjects (rows) of your data are must be completely clear.
- The data you select must in all ways be suitable for the analyses required in Project B.
- Dr. Love can refuse to let you use a data set for any reason at all, and this includes the reason that he’s tired of people using the data set.
- The data must include 5-10 variables (columns) measured on each subject, not including a coded identifier for each subject.
- This must include at least 2 quantitative variables, each of which shows more than 20 unique values. One of these quantitative variables will need to be your outcome.
- This must also include at least 2 categorical predictors.
- One of your two categorical predictors must have between 3 and 6 categories (variables with more than 6 categories must be collapsed down to no more than 6 levels.)
- Your other categorical predictors (of which you must have at least one) must have between 2 and 6 categories (again, collapse all categorical variables with more than 6 levels.)
- All of your categorical predictors must include at least 30 subjects at each level.
- The data must include 250-10,000 observations (rows), each describing a unique subject, for whom there must be a coded identifier.
- You will need a minimum of 250 complete cases across all of the 5-10 variables you identified.
- If there are more than 10,000 observations, sample down to 10,000 with complete data on your selected variables to create a new version of your raw data.
This page was last updated: 2020-12-06 13:42:32.