Chapter 37 Introduction for Part C

In 431, my primary goal is to immerse you in several cases, which will demonstrate good statistical practice in the analysis of data using multiple regression models. Often, we will leave gaps for 432, but the principal goal is to get you to the point where you can do a solid (if not quite complete) analysis of data for the modeling part of your project.

The ten main topics to be discussed or reviewed in these notes are:

Describing the multivariate relationship
1. Scatterplots and smoothing
2. Correlation coefficients, Correlation matrices
Transformations and Re-expression
1. The need for transformation
2. Using a Box-Cox method to help identify effective transformation choices
Testing the significance of a multiple regression model
1. T tests for individual predictors as last predictor in
2. Global F tests based on ANOVA to assess overall predictive significance
3. Incremental and Sequential testing of groups of predictors
Interpreting the predictive value of a model
1. R² and Adjusted R², along with AIC and BIC
2. Residual standard deviation and RMSE
3. Estimating the effect size in terms of raw units, standard deviations or IQRs
4. Fitted values; Distinguishing prediction from confidence intervals
Checking model assumptions
1. Residual Analysis including studentized residuals, and the major plots
2. Identifying points with high Leverage
3. Assessing Influence numerically and graphically
4. Measuring and addressing collinearity
Model Selection
1. The importance of parsimony
2. Stepwise regression and other automated techniques
Assessing Predictive Accuracy through Cross-Validation
1. Summaries of predictive error
Dealing with Missing Values sensibly
1. Imputation vs. Complete Case analyses
2. Including a missing data category vs. simple imputation vs. removal
Dealing with Categorical Predictors
1. Indicator variables
2. Impact of Categorical Variables on the rest of our Modeling
Summarizing the Key Findings of the Model, briefly and accurately
1. Making the distinction between causal findings and associations
2. The importance of logic, theory and empirical evidence. (LTE)

37.1 Additional Reading

Vittinghoff et al. (2012) is strong in this area. The relevant sections of the text for 431 Part C are

Section 3.3 on the Simple Linear Regression Model
Chapter 4 on Linear Regression, where most of the material is relevant to 431, although we’ll postpone the discussion of cubic splines, mostly, to 432.
Chapters 10 (Model Selection) in particular the alternatives to R² in 10.1.3.2 and some of the material on cross-validation, though we’ll do much more in 432.
A little of Chapter 11 (Missing Data), specifically, section 11.1.1 and a little of section 11.3, although we’ll do more on this in 432 as well.

37.2 Scatterplots

We have often accompanied our scatterplots with regression lines estimated by the method of least squares, and by loess smooths which permit local polynomial functions to display curved relationships, and occasionally presented in the form of a scatterplot matrix to enable simultaneous comparisons of multiple two-way associations.

37.3 Correlation Coefficients

By far the most commonly used is the Pearson correlation, which is a unitless (scale-free) measure of bivariate linear association for the variables X and Y, symbolized by r, and ranging from -1 to +1. The Pearson correlation is a function of the slope of the least squares regression line, divided by the product of the standard deviations of X and Y.

We have also mentioned the Spearman rank correlation coefficient, which is obtained by using the usual formula for a Pearson correlation, but on the ranks (1 = minimum, n = maximum, with average ranks are applied to the ties) of the X and Y values. This approach (running a correlation of the orderings of the data) substantially reduces the effect of outliers. The result still ranges from -1 to +1, with 0 indicating no linear association.

37.4 Fitting a Linear Model

We have fit several styles of linear model to date, including both simple regressions, where our outcome Y is modeled as a linear function of a single predictor X, and multiple regression models, where more than one predictor is used. Important elements of a regression fit, obtained through the summary function for a lm object, include

the estimated coefficients (intercept and slope(s)) of the fitted model, and
the R² or coefficient of determination, which specifies the proportion of variation in our outcome accounted for by the linear model.

37.5 Building Predictions from a Linear Model

We’ve also used the predict function applied to a lm object to obtain point and interval estimates for our outcome based on new values of the predictor(s). We’ve established both confidence intervals from such models, which describe the mean result across a population of subjects with the new predictor values, and prediction intervals which describe an individual result for a new subject with those same new values. Prediction intervals are much wider than confidence intervals.

37.6 Data Sets for Part C

hydrate <- read.csv("data/hydrate.csv") %>% tbl_df
hers1race <- read.csv("data/hers1race.csv") %>% tbl_df
wcgs <- read.csv("data/wcgs.csv") %>% tbl_df
emp_bmi <- read.csv("data/emp_bmi.csv") %>% tbl_df
gala <- read.csv("data/gala.csv") %>% tbl_df

References

Vittinghoff, Eric, David V. Glidden, Stephen C. Shiboski, and Charles E. McCulloch. 2012. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. Second. Springer-Verlag, Inc. http://www.biostat.ucsf.edu/vgsm/.