Project A Analyses

Author

431 Staff

Published

2023-10-19

1 General Information (applicable to all three Analyses)

Using the tibble you’ve developed following the instructions on the Data and Proposal pages, you will perform three analyses, as specified below. Your final portfolio report will include:

  • Sections 1-12 from the Proposal, edited as necessary.
  • Section 13 will be your Analysis 1, and will have four subsections.
  • Section 14 will be your Analysis 2, with four subsections.
  • Section 15 will be your Analysis 3, with four subsections.
  • Section 16 will be a new Reflections section, discussed in the Portfolio materials.
  • Section 17 will be your new location for session information.

The four subsections for each Analysis are to be labeled:

  1. Variables
  2. Summaries
  3. Approach
  4. Conclusions

1.1 Subsection 1. Variables

For each of the three Analyses, you will start by re-specifying the variables you are studying, including appropriate units of measurement, and briefly remind us about how your sample was developed, including information on the total number of counties involved in this Analysis, specifying which states are used, and whether any relevant data are missing. Be sure to edit your data management and Codebook in sections from your Proposal to accurately describe all variables you are using in your analyses. In particular, no additional adjustments to your tibble should be made after section 8 in your portfolio report.

In proposal section 12, you stated the research question you are trying to answer with each analysis. You should also indicate in that section your pre-data collection belief about what conclusion you would draw. In developing your final portfolio report, you should edit these statements (as necessary) to reflect the Analyses you’ve actually done.

1.2 Subsection 2. Summaries

Here, you will provide numerical summaries and visualizations of interest that are relevant to your analysis, and comment on any issues you observe. All of your plots should be attractive and well-labeled, and (when possible) use ggplot tools. Specific suggestions about necessary data descriptions for each Analysis are discussed below.

1.3 Subsection 3. Approach

Each Analysis should, of course, include all of the R code you used to create your work, and complete English sentences that interpret your results. Your work must demonstrate that you are able to reason about what you’ve done, not just generate the code which works to create results. Here, you will:

  • Assess which of several options for modeling (Analysis 1) or creating an inference (Analyses 2 and 3) is most appropriate, with the help of useful diagnostics.
  • Complete any specialized analytic requirements (as discussed below) unique to that Analysis, and obtain results which allow you to address your research question.

1.4 Subsection 4. Conclusions

You will write a two-paragraph conclusions section for each Analysis.

  • The first paragraph should provide a clear restatement of your research question, followed by a clear response to that question, motivated by your results. Remember that a research question will end with a question mark, and will be something you will be able to answer (or at least respond to effectively) after your analysis is complete. You should also reflect on your pre-existing belief about what would happen, (as discussed in Section 12) in light of the data.
  • The second paragraph should summarize the key limitations of your Analysis, and opportunities for useful next steps associated with that Analysis. To be clear, just writing “get more data,” though generally good advice, isn’t a sufficient next step. Also, note that it’s not a good idea to suggest limitations that you could fix with the tools you have - instead, apply those tools and build a better Analysis.

2 Analysis 1: Simple Linear Regression Model

2.1 Advice on Variables

In this section you will build a simple linear regression model to predict your Analysis 1 outcome using your Analysis 1 predictor. Start by identifying those variables, and by restricting your data set for this analysis to the complete cases on those variables.

  • Use complete English sentences to identify your outcome and your predictor, describing what each variable means and its units of measurement.
  • Also specify how many counties have complete data on both variables.
  • Finally, specify the values of your outcome and predictor for Cuyahoga County, in Ohio, where CWRU’s campus is located.

2.2 Advice on Summaries

You will need to build a visualization of the relationship between the outcome (Y-axis) and the predictor (X-axis) and a written description of what you learn about the association (which should include comments about its direction, shape and strength along with identification of any substantial outliers).

You will need to address the possibility of a transformation here:

  • A specification of any transformations you choose to apply to the X or Y variable in order to obtain a better fit with a simple linear regression, along with some justification for the choice (or for the decision not to apply a transformation.)
    • For this part of Project A, confine your search to either a logarithm, an inverse, or a square as applied to the outcome. If you want to consider one of those transformations for the predictor as well, that’s OK but not crucial.
    • You should select the most promising transformation on the basis of a scatterplot (perhaps with a loess smooth and linear fit) after the transformation has been applied. You are permitted but not required to use the Box-Cox approach to help here.

If you decide to use a transformation with either the outcome or predictor before fitting your final model, you should display two plots: one with and one without that transformation. Your plot (including the transformation, if you use one) should include both a loess smooth and the regression line from your final linear model.

2.3 Advice on Approach

Fit your model to use your predictor to predict your outcome (applying your selected transformation) and provide the code you used, and the following summary elements in this section.

  1. A written statement of the full prediction equation, with coefficients nicely rounded, and a careful description of what the coefficients mean in context. If you’re using a transformation of the outcome or the predictor, be sure this is reflected in your comments here.

  2. A tidy summary of the model’s coefficients, including 90% confidence interval for model estimates.

  3. The model’s R-squared, residual standard error, and the number of observations to which the model was fit.

2.3.1 Residual Analysis

At the end of your Approach section for Analysis 1, you’ll need to:

  1. prepare a pair of residual plots (one to assess residuals vs. fitted values for non-linearity, and one to assess Normality in the residuals or the standardized residuals.)
    • If you model is called m1, you could use something like plot(m1, which = c(1:2)) to obtain these two plots and that’s OK, although a ggplot-based alternative using patchwork would be even nicer.
  2. interpret the residual plots in terms of what they tell you about how well the assumptions of linearity and Normality hold for your setting, in complete English sentences.
  3. display your model’s prediction for the original (untransformed) outcome you are studying for Cuyahoga County, in Ohio, and compare it to Cuyahoga’s actual value of this outcome.
  4. identify the two counties (by name and state) where the model you’ve fit is least successful at predicting the outcome (in the sense of having the largest residual in absolute value.)

2.4 Advice on Conclusions

For Analysis 1, you’ll write two paragraphs.

In the first paragraph, you should provide a clear restatement of your research question, followed by a clear and appropriate response to your research question, motivated by your results. Most of the time, one model won’t let you come to a strong conclusion about a question of interest, and it is your job to accurately present what information can be specified as a result of the model, without overstating your conclusions.

Then, write a paragraph which summarizes the key limitations of your work in Analysis 1.

  • If you see problems with regression assumptions in your residual plot, that would be a good thing to talk about here, for instance.
  • Another issue that is worth discussing is your target population, and what evidence you can describe that might indicate whether your selected states are a representative sample of the US as a whole, or perhaps some particular part of the United States.
  • You should also provide at least one useful “next step” that you could take to improve this analysis (just saying “get more data” isn’t a sufficient next step.)

3 Analysis 2: Comparing Two Independent Samples

3.1 Advice on Variables

Here, you have identified one quantitative (Analysis 2 outcome) and one categorical variable (Analysis 2 binary predictor.)

  • Use complete English sentences to identify your outcome and your predictor, describing what each variable means and its units of measurement.
  • Also specify how many counties have complete data on both variables.
  • Finally, specify the values of your outcome and predictor for Cuyahoga County, in Ohio, where CWRU’s campus is located.

3.2 Advice on Summaries

Here, prepare descriptive summaries of the data across the two predictor groups for your chosen outcome, including, of course, attractive and well-constructed visualizations which can be used for comparisons.

A comparison boxplot with violins is an excellent option here for the key visualization. Be sure to label it carefully, and use color and/or fill wisely to create a clear and attractive picture.

3.3 Advice on Approach

You’ll analyze the results and build a 90% confidence interval for the difference in group means with an appropriate t-based procedure, and with a bootstrap procedure.

You’ll then select one of these two procedures to provide your final response, and discuss the reasons behind the choice you made.

Show your work and your reasoning (not just your code), and comment on any analytic decisions you make. Be sure to actively present and justify any assumptions you are making.

3.4 Advice on Conclusions

For Analysis 2, you’ll write two paragraphs.

In the first paragraph, you should provide a clear restatement of your research question, followed by a clear and appropriate response to your research question, motivated by your results. Interpret your 90% confidence interval’s endpoints and width in context.

Then, write a paragraph which summarizes the key limitations of your work in Analysis 2.

  • If you see problems with the assumptions behind your choice of interval, that would be a good thing to talk about here, for instance.
  • Another issue that is worth discussing is your target population, and what evidence you can describe that might indicate whether your selected states are a representative sample of the US as a whole, or perhaps some particular part of the United States.
  • You should also provide at least one useful “next step” that you could take to improve this analysis (just saying “get more data” isn’t a sufficient next step.)

4 Analysis 3: Comparing An Outcome in 2023 to its value in 2018

4.1 Advice on Variables

Here, we have identified two quantitative variables (the same outcome in two different time periods) which are paired (so that they have a natural link between them, and use the same units of measurement.)

  • Your Analysis 3 material should start with specifications of what the outcome you are studying in this analysis actually means, including its units, and how your samples in 2018 and 2023 were created.

4.2 Advice on Summaries

Provide numerical summaries and visualizations of interest that are relevant to this analysis, and comment on any issues you observe.

The natural choice is a boxplot with violin for the paired differences, along with a Normal Q-Q plot of those paired differences. Be sure to remind us how many “pairs” of values you have to work with in your labels for these plots.

You will need to provide some evidence on how well the “pairing” worked in this setting, by interpreting the Pearson correlation between the 2018 and 2023 values.

4.3 Advice on Approach

You’ll analyze the results and build a 90% confidence interval for the population mean difference with an appropriate t-based procedure, and an appropriate bootstrap procedure.

You’ll then select one of these two procedures to provide your final response, and discuss the reasons behind the choice you made.

Show your work and your reasoning (not just your code), and comment on any analytic decisions you make. Be sure to actively present and justify any assumptions you are making.

4.4 Advice on Conclusions

For Analysis 3, you’ll write two paragraphs.

In the first paragraph, you should provide a clear restatement of your research question, followed by a clear and appropriate response to your research question, motivated by your results. Interpret your chosen 90% confidence interval’s endpoints and width in context. You should also reflect on your pre-existing belief about what would happen, (as discussed in Section 12) in light of your results.

Then, write a paragraph which summarizes the key limitations of your work in Analysis 3.

  • If you see problems with the assumptions behind your choice of interval, that would be a good thing to talk about here, for instance.
  • If pairing didn’t “help” (in the sense that there was no substantial positive correlation between the 2018 and 2023 reports), that would be worth discussing here.
  • Another issue that is worth discussing is your target population, and what evidence you can describe that might indicate whether your selected states are a representative sample of the US as a whole, or perhaps some particular part of the United States.
  • You should also provide at least one useful “next step” that you could take to improve this analysis (just saying “get more data” isn’t a sufficient next step.)