This page was last updated: 2021-10-25 09:36:49.

Here is an outline of the “Analysis” section of your final Project A report. Note that none of this material should be included in your Project A proposal.

Analysis 1

The Variables

In this section you will build a simple linear regression model to predict your outcome using one of your two quantitative predictors (it’s your choice.) Start by identifying those variables, and restricting your data set for this analysis to the complete cases on those variables.

Use complete English sentences to identify your outcome and your predictor, describing what each variable means and its units of measurement.
Also specify the name of each variable in your tibble (making it clear which is the outcome and which the predictor), the name of your tibble, which states you’re studying, and how many counties have complete data on both variables.
Finally, specify the values of your outcome and predictor for Cuyahoga County, in Ohio, where CWRU’s campus is located.

Research Question

Here, you should state your research question clearly. A good research question (and your response to it) is the most important part of this analysis.

Do not use more than one research question for this Analysis.
A research question will end with a question mark, and will be something you will be able to answer (or at least respond to effectively) after your analysis is complete.

Guidance on Research Questions from Dr. Love

Examples of dull but moderately effective and minimally appropriate research questions in this setting would be:

How well does a linear model using [predictor] predict [outcome], in [number] counties in the states of [list of your states]?

What is the nature of the association between [predictor] and [outcome], in [number] counties in the states of [list of your states]?

You should be able to do meaningfully better than that, especially if you have a reason to believe something in advance about the direction or strength of the association you are anticipating.
However, if you’re struggling, using that format will be OK.
A research question uses formal but clear language.
Given your planned analyses, stay away from statements about cause and effect, and don’t use the words correlate or regression (in any form) in your research question.

Visualizing the Data

Provide an attractive, well-labeled scatterplot of your outcome and predictor, before any transformations, including all code necessary to build it, and describe what the plot tells you about the association of the variables, as detailed in the instructions.

Transformation Assessment

Here, you will provide information about any transformations you chose to apply to the outcome, and explain why you did (or didn’t) choose to use a transformation.

If you decide to use a transformation, specify the transformation you’ve chosen carefully, and show the scatterplot of the transformed data, using the same suggestions as were provided in the previous section to describe the resulting plot. Write a few sentences describing the transformations you considered, and why you thought this was the most promising, and why you eventually decided to use the transformed versions of your variables.
If you decide not to use a transformation, identify the ONE plot which was the most promising of available transformations and show that one. Write a few sentences describing the transformations you considered, and why you thought this was the most promising, and why you eventually decided to stick with the original versions of the variables.

Guidance from Dr. Love on the Transformation Assessment

Your response will include ONE plot demonstrating a particular transformation, although you will probably fit and view several plots in order to select a final one for this work.

For this part of Project A, confine your search to either a logarithm, an inverse, or a square as applied to the outcome. If you want to consider one of those transformations for the predictor as well, that’s OK but not crucial. You should select the most promising transformation on the basis of a scatterplot (perhaps with a loess smooth and linear fit) after the transformation has been applied.

You are discouraged from using numerical summaries of fit in this section. Of course, summary statistics like \(R^2\) when the outcome is transformed will definitely not be comparable.

Note that this transformation assessment is part of Analysis 1, but will not be shown in Analyses 2 or 3. Feel free to use the transformation (of the outcome) that you select in Analysis 1 for the other two Analyses, if you like.

The Fitted Model

Fit your model to use your predictor to predict your outcome (applying your selected transformation) and provide the code you used, and the following summary elements in this section.

A written statement of the full prediction equation, with coefficients nicely rounded, and a careful description of what the coefficients mean in context.
A tidy summary of the model’s coefficients, including 90% confidence interval for model estimates.
The model’s R-squared, residual standard error, the number of observations to which the model was fit, and the Pearson correlation of your predictor and outcome.

Residual Analysis

Here, you’ll need to do four things.

prepare a pair of residual plots (one to assess residuals vs. fitted values for non-linearity, and one to assess Normality in the residuals or the standardized residuals.)
interpret those plots in terms of what they tell you about how well the assumptions of linearity and Normality hold for your setting, in complete English sentences.
display your model’s prediction for the original (untransformed) outcome you are studying for Cuyahoga County, in Ohio, and compare it to Cuyahoga’s actual value of this outcome.
identify the two counties (by name and state) where the model you’ve fit is least successful at predicting the outcome (in the sense of having the largest residual in absolute value.)

The augment function would be very helpful here.
If you model is called m1, you could use something like plot(m1, which = c(1:2)) to obtain these two plots and that’s OK.
You can produce a more pleasing picture using ggplot2 and patchwork following the strategy I’ve demonstrated multiple times in the slides and the Course Notes, should you desire.

Conclusions and Limitations

Here, you’ll write two paragraphs.

In the first paragraph, you should provide a clear restatement of your research question, followed by a clear and appropriate response to your research question, motivated by your results.

Then, write a paragraph which summarizes the key limitations of your work in Analysis 1.

If you see problems with regression assumptions in your residual plot, that would be a good thing to talk about here, for instance.
Another issue that may be worth discussing is your target population, and what evidence you can describe that might indicate whether your selected states are a representative sample of the US as a whole, or perhaps some particular part of the United States.

Analysis 2

The Variables

In this section you will build a simple linear regression model to predict your outcome using either of your two categorical predictors (it’s your choice.) Start by identifying those variables, and restricting your data set for this analysis to the complete cases on those variables.

Use complete English sentences to identify your outcome and your predictor, describing what each variable means and the available categories for the predictor. Be sure the predictor is represented in R as a factor, with an appropriate ordering.
Again, specify the name of each variable in your tibble (making it clear which is the outcome and which the predictor), the name of your tibble, which states you’re studying, and how many counties have complete data on both variables.
Finally, specify the values of your outcome and predictor for Cuyahoga County, in Ohio, where CWRU’s campus is located.

Research Question

Here, you should state your research question clearly. A good research question (and your response to it) is the most important part of this analysis.

Do not use more than one research question for this Analysis.
A research question will end with a question mark, and will be something you will be able to answer (or at least respond to effectively) after your analysis is complete.

Guidance on Research Questions from Dr. Love

Examples of dull but moderately effective and minimally appropriate research questions for analysis 2 would be:

Does [category A] or [category B] show higher levels of [outcome], in [number] counties in the states of [list of your states]? (for a binary predictor)

Which of [the categories in your predictor] is associated with the highest mean level of [outcome], in [number] counties in the states of [list of your states]? (for a multi-categorical predictor)

Otherwise, use the same guidance about research questions I provided for analysis 1.

Visualizing the Data

In Analysis 2, it is up to decide whether or not a transformation of the outcome would be valuable. Your model will assume that the distribution of the outcome is Normal, with similar variance across each level of your categorical predictor. If you decide to transform the outcome, again, stick with either the logarithm, inverse or square.

Provide an attractive boxplot (with or without a violin plot) showing your outcome broken down by levels of your predictor, after whatever transformation you choose (if any), including all code necessary to build it, and describe what the plot tells you about the association of the variables, as detailed in the instructions.

The Fitted Model

Fit your model to use your predictor to predict your outcome (applying your selected transformation) and provide the code you used, and the following summary elements in this section.

A written statement of the full prediction equation, with coefficients nicely rounded, and a careful description of what the coefficients mean in context.
A tidy summary of the model’s coefficients, including 90% confidence interval for model estimates.
The model’s R-squared, residual standard error, and the number of observations to which the model was fit.

Prediction Analysis

Here, you’ll need to do three things.

plot the residuals against the categorical predictor in a useful way to help you assess Normality of the residuals within each category.
display your model’s prediction for the original (untransformed) outcome you are studying for Cuyahoga County, in Ohio, and compare it to Cuyahoga’s actual value of this outcome.
identify the two counties (by name and state) where the model you’ve fit is least successful at predicting the outcome (in the sense of having the largest residual in absolute value.)

Again, augment would be very helpful here.

Conclusions and Limitations

Your first paragraph in this response should provide a clear restatement of your research question, and then a clear and appropriate response to your research question, motivated by your results. This should include a statement comparing the categories on the mean of the outcome you modeled.

Then, in your second and final paragraph in this section, provide a brief description of the limitations of this Analysis, Be specific about your concerns.

Analysis 3

In this section you will build a linear regression model to predict your outcome using one of your two quantitative predictors (it’s your choice) and the state (which should be a factor in your model with Ohio as the baseline category.)

The Variables

Do everything in this section that I recommended for the Variables section in Analysis 1.

Research Question

Here, you should state your research question clearly. A good research question (and your response to it) is the most important part of this analysis.

Do not use more than one research question for this Analysis.
A research question will end with a question mark, and will be something you will be able to answer (or at least respond to effectively) after your analysis is complete.

Guidance on Research Questions from Dr. Love

A dull but moderately effective and minimally appropriate research questions in this setting would be:

How well does [predictor] predict [outcome] after accounting for differences between states?

Again, follow the guidance from Analysis 1.

Visualizing the Data

Provide an attractive, well-labeled scatterplot of your outcome and your quantitative predictor, and incorporate some way of providing separate results by state. You can do this in several ways in R, including through the use of facets.

Plot the data incorporating the transformation for your outcome (if any) that you will use in your model for Analysis 3.
In your model for Analysis 3, use the same transformation for the outcome that you used in Analysis 1.
Show your code, and describe what the plot tells you about the association of the variables.

The Fitted Model

Fit your model to use your predictors to predict your outcome (applying your selected transformation) and provide the code you used, and the following summary elements in this section.

A written statement of the full prediction equation, with coefficients nicely rounded, and a careful description of what the coefficients mean for the intercept, the quantitative predictor and one of the states, in context.
A tidy summary of the model’s coefficients, including 90% confidence interval for model estimates. Be sure that Ohio is used as the baseline state.
The model’s R-squared, residual standard error, the number of observations to which the model was fit.

Residual Analysis

Do what you did in Analysis 1.

Conclusion and Limitations

Your first paragraph in this response should provide a clear restatement of your research question, and then a clear and appropriate response to your research question, motivated by your results.

Then, write a paragraph to summarize the key limitations of your work, as you did in the previous analyses.

Checklist?

Note that there is a checklist for the final report as part of the Examples & Hints page that includes many things we’ll be looking for in grading. It’s well worth your time to review the checklist before submitting your work.

431 Footer

Outline for Project Report

Analysis 1

The Variables

Research Question

Guidance on Research Questions from Dr. Love

Visualizing the Data

Transformation Assessment

Guidance from Dr. Love on the Transformation Assessment

The Fitted Model

Residual Analysis

Conclusions and Limitations

Analysis 2

The Variables

Research Question

Guidance on Research Questions from Dr. Love

Visualizing the Data

The Fitted Model

Prediction Analysis

Conclusions and Limitations

Analysis 3

The Variables

Research Question

Guidance on Research Questions from Dr. Love

Visualizing the Data

The Fitted Model

Residual Analysis

Conclusion and Limitations

Checklist?