Required Study 2 Analyses

Most Recent Update: 2022-12-15 12:21:14.

Once you have identified an acceptable data set, you will produce a report that demonstrates that you have accomplished the following:

Identify a quantitative outcome.
- For purposes of this project, we will require your quantitative outcome to contain more than 15 unique values.
Identify a key predictor (which may be either quantitative or categorical.)
- If the key predictor is categorical, it must have 3-6 categories, and each category must contain at least 30 observations.
Identify 3-8 other predictors of your outcome (demonstrating that either your key predictor or at least one of the “other” predictors is multi-categorical with 3-6 categories.)
Define a research question related to how effectively your key predictor predicts your quantitative outcome, while (possibly) adjusting for the other predictors.
Steps 1-3 will yield a set of 6-11 variables (an outcome, a key predictor, 3-8 other predictors, and a subject identifier). Use those selections to create your analytic data set.
- You must have between (500 and 7,500 observations if you’re using NHANES; 250 and 10,000 observations if not using NHANES) with complete data on all 6-11 variables included in your Study 2 analytic tibble. No other variables should be included in your Study 2 analytic tibble.
Clean the data in R, and this includes the creation of appropriately labeled (and if necessary, collapsed) factors for all categorical variables, and the investigation and decision-making regarding missing values, numbers of unique values and impossible values.
Use Hmisc::describe or mosaic::favstats or a similar tool to provide useful numerical summaries for all data you will study, after all cleaning, so that these results describe the exact variables you will be modeling in the remaining work.
Partition the clean data into a model development (also called a model training) sample (60-80% of the data) and a model testing (also called a model validation) sample (the remaining 20-40%) using the approach recommended in the data .
- Suppose you had a tibble called original_data which identified its subjects with a subjID variable, and you wanted to place 75% of your data for development into training_sample and the rest into test_sample. You could do that with the following code…

library(tidyverse)

set.seed(431) # pick a different seed than this

training_sample <- original_data %>% 
    slice_sample(., prop = 0.75)

test_sample <- 
    anti_join(original_data, training_sample, by = "subjID")

Provide appropriate, well-labeled visualizations of your outcome, and investigate potential transformations of that outcome for the purpose of fitting regression models in a useful way. Whatever transformation (including no transformation at all) should be used for the steps that follow.
Produce two competitive models for predicting your outcome using your data that provide evidence regarding your research question.
- One of these models should be the full model with all candidate predictors included.
- Your other model should be a well-motivated subset of your full model, that at least includes the key predictor. The naive strategy of using as your subset the key predictor alone is 100% appropriate.
Assess the performance of your two models (full model or subset) and come to a conclusion about which is better. This assessment should includes both in-sample (predictive performance and adherence to assumptions) and holdout sample (predictive quality) assessments. Be sure to attend to back-transformation properly should that be necessary, in evaluating the quality of predictions.
Use the results of the model you chose to answer your research question, and then describe the limitations of this study and next steps you would like to pursue.