431 Project A Instructions

Most Recent Update: 2022-10-18 07:39:47.

What is on this Website

  • The Data page contains information on the data you’ll use, and on the required cleaning and development of your final project tibble. You’ll need to complete this cleaning to complete the Proposal.
  • The Proposal page describes the proposal, which you will submit in mid-October, according to the deadline on the Course Calendar.
    • You can (and should) do all of the work for the Proposal in September, and completing it by the end of the month will definitely help you with Quiz 1.
    • To encourage an early start, proposals submitted to Canvas prior to Monday 2021-10-03 at noon will be reviewed quickly by Dr. Love and the TAs, and if a redo is required, you’ll hear back quickly enough that you should be able to accomplish it before the final proposal deadline, and thus still have a chance to earn full credit.
  • The Analyses page contains information on the analyses you’ll need to include in your final submission after completing the development of your tibble.
  • The Final Report page describes the elements of the final Project A submission (the report, the video and the self-evaluation) that you will build in October and provides submission instructions.
  • The Examples & Tips section includes information on sample work and other tips we’ve prepared to help you through Project A.
  • The top menu also provides links to contact us, and to the 431 home page.

Project A Overview

  • You can work alone, or with one other person on this project. If you work as a pair, you create one project together, and each of you receive the same grade.
  • Project A deliverables. There are two deadlines, each of which is specified on the Course Calendar.
    • The Proposal is due in mid-October. Here you will answer a few specific questions after creating a clean “tibble” for the project, including 200-400 rows (“counties”) and a set of variables you will choose from options we provide.
      • You’ll also need to tell us in the Proposal whether you are working alone or with another person.
    • The final submission is due at the end of October, and includes:
      • a report in R Markdown, and knitted to HTML
      • a short video which highlights some of your key findings from the report
      • a self-evaluation form, which you’ll complete after submitting the report and video
  • The Data: You will be working with data from the 2022 edition of County Health Rankings.
    • These data describe most counties in the United States. Counties are located within states in the US. There are 88 counties, for instance, in Ohio.
    • The data are available now online with extensive additional information available at the County Health Rankings website.
    • Everyone in the class will start with the same data set, but will pick a subset of that data that will be different for each project.
  • Cleaning the Data: To build your data set, which will be a sample of the full data, you will do a series of things in R.
    • You’ll need to select a subset of 4-6 states (Ohio will automatically be one of them) which together include 200-400 counties.
    • You’ll need to select a sample of 5 variables from those available in the data, to which you’ll add three pre-specified variables (a fips code, the state name and the county name) which will identify your observations.
    • Three of those variables will be included as quantitative, and for the other two, we’ll require you to use R to create categories out of the variables. (Note: Categorizing quantitative information like this turns out to be a terrible idea in practical model-building. We’re doing this here for pedagogical reasons that will, in time, become clearer.)
    • Creating a tibble of your selected counties, and your set of variables will be necessary to complete the proposal.
  • Analyses You’ll Do: After your proposal is accepted, you have three analytic tasks using your clean tibble. The observations (rows) in your data will be the counties from your set of 4-6 states. The three tasks are listed below.
    • Visualizing and modeling the relationship between a quantitative outcome and a quantitative predictor.
    • Visualizing and modeling the relationship between a quantitative outcome and a categorical predictor (with 2-5 levels).
    • Visualizing and modeling the relationship between a quantitative outcome and a quantitative predictor, after adjustment for the state in which the county is located
  • Examples from Dr. Love: We have provided a sample proposal (using R Markdown) that we strongly recommend you use in developing your data and proposal work. We’ve also provided an outline of the analysis report we hope to see. These items will help ensure your final project report matches our expectations and is something to be proud of.

Project A Objectives

On the remainder of this page, you’ll find a description of the educational objectives for this project and for projects in this course, generally.

It is hard to learn statistics (or anything else) passively; concurrent theory and application are essential1.

Project A is about building linear models and visualizing data from a (fairly clean) data set I provide to you. In Project A, you will complete most of the elements of a data science project designed to create a statistical model for a quantitative outcome, then use it for prediction, and assess the quality of those predictions. Tools necessary for Project A include:

  • Describing the experimental or observational study design used to gather the data, as well as the complete data collection process.
  • Sharing the process of ingesting and tidying the data, with specific reference to the decisions you make to “clean” things up.
  • Descriptive and exploratory summaries of the data across the groups for each of your chosen outcomes, including, of course, attractive and well-constructed visualizations, which may include both graphs and tables.
  • Developing appropriate research questions that lead to the identification of smart measures for predictors and outcomes, and then the development of a prediction model using linear regression.
  • Regression-based comparisons of mean differences for a quantitative outcome across a set of two (or more) groups, including appropriate demonstrations of the reasons behind the models you build.
  • Regression-based predictions of a quantitative outcome using a quantitative predictor.

In Project A’s analysis stage, everyone will be working with different parts of the same data set.

Think of a graph as a comparison. All graphs are comparisons (indeed, all statistical analyses are comparisons). If you already have the graph in mind, think of what comparisons it’s enabling. Or if you haven’t settled on the graph yet, think of what comparisons you’d like to make. Andrew Gelman

Why Two Projects?

The main reason is that I can’t figure out a way to get you to think about all of the things I hope you’ll learn from this course in a single Project. Another important reason is that I want you to be able to make mistakes during the semester without worrying about it too much, and having two projects spreads out this learning a bit.

  1. I set different tasks for Project A and for Project B, allowing us to touch on a wider fraction of the things I hope you’ll learn in 431.
  2. I give much more guidance in Project A than in Project B.
  3. We have to evaluate each of your projects quickly, and there are many students in the class. Knowing the data set you’ll be working with helps us manage this.
  4. Having a broad range of activities to evaluate helps reduce the cost of a mistake on any one of them, so that we can build on what you do well.
  5. All of Project A can be done using materials discussed in Classes 1-15.

Educational Objectives

“Statistics has no reason for existence except as the catalyst for investigation and discovery.” George E. P. Box

I am primarily interested in your learning something interesting, useful and even valuable from your project experiences in 431. In particular, an effective Project A will demonstrate:

  1. The ability to create and formulate research questions that are statistically and scientifically appropriate.
  2. The ability to turn research questions into measures of interest.
  3. The ability to pull and merge and clean and tidy data, then present the data set following Jeff Leek’s guide to sharing data with a statistician.
  4. The ability to build a reasonable (if simplistic) linear model, assess the quality of the model, and use it to make predictions.
  5. The ability to identify and (with help) solve problems that crop up
  6. The ability to comment on your work within code, and in written and oral presentation.
  7. The ability to build a Markdown-based report and to give a short presentation based on key findings from that report.

Questions?

If you have questions, let us know about them on Campuswire using the projectA folder, or speak with Dr. Love before or after class, or discuss them with the TAs during office hours.

Footnotes

  1. Though by no means an original idea, this particular phrasing is stolen from Harry Roberts.↩︎