This page was last updated: 2021-10-25 09:36:55.
What is on this Website
- The Data page contains information on the data you’ll be using, and on the required cleaning and development of your final data frame (tibble) for the project. You’ll need to complete this cleaning to complete the Proposal.
- The Proposal page describes the proposal, which you will submit as a part of Lab 04. You can (and should) do all of the work for the Proposal in September, and completing it by the end of the month will definitely help you with Quiz 1, and give you an opportunity to get some initial feedback before it’s due.
- The Analyses page contains information on the analyses you’ll need to include in your final submission after completing the development of your data frame.
- The Final Report page describes the three elements of the final submission (the report, the video and the self-evaluation) that you will build in October and provides submission instructions.
- The Examples & Tips section includes information on sample work and other tips we’ve prepared to help you find your way through Project A.
- The top menu also provides links to contact us, and to the 431 home page.
Project A Overview
- You can work alone, or with one other person on this project. If you work as a pair, you create one project together, and each of you receive the same grade.
- Project A deliverables. There are two deadlines.
- as part of Lab 04, due Monday 2021-10-11 at 9 PM, a “proposal” which requires you to answer a few specific questions after creating a clean data set (a “tibble” with 200-400 rows “counties” and a specific set of variables you will choose from the available options.)
- you’ll have to tell us at this time whether you are working alone or with another person
- there is an opportunity to turn in the proposal early (by 2021-10-01 at Noon) for a “free” review by the TAs and Dr. Love, and if we’re happy, you’d be done at that point with the proposal.
- by Tuesday 2021-11-02 at 9 PM, a final submission consisting of:
- a report in R Markdown, and knitted to HTML
- a short video which highlights some of your key findings from the report
- a self-evaluation Google Form, which you’ll complete after submitting the report and video
- The Data: You will be working with data from the 2021 edition of County Health Rankings.
- These data describe most counties in the United States. Counties are located within states in the US. There are 88 counties, for instance, in Ohio.
- The data are available now online with extensive additional information available at the County Health Rankings website.
- Everyone in the class will start with the same data set, but will pick a subset of that data that will be different for each project.
- Cleaning the Data: To build your data set, which will be a sample of the full data, you will do a series of things in R.
- You’ll need to select a subset of 4-6 states (Ohio will automatically be one of them) which together include 200-400 counties.
- You’ll need to select a sample of 5 variables from those available in the data, to which you’ll add three pre-specified variables (a fips code, the state name and the county name) which will identify your observations.
- Three of those variables will be included as quantitative, and for the other two, we’ll require you to use R to create categories out of the variables. (Note: Categorizing quantitative information like this turns out to be a terrible idea in practical model-building. We’re doing this here for pedagogical reasons that will, in time, become clearer.)
- Creating a tibble of your selected counties, and your set of variables will be necessary to complete the proposal.
- Analyses You’ll Do: After your proposal is accepted, you have three analytic tasks using your clean tibble. The observations (rows) in your data will be the counties from your set of 4-6 states. The three tasks are listed below.
- Visualizing and modeling the relationship between a quantitative outcome and a quantitative predictor.
- Visualizing and modeling the relationship between a quantitative outcome and a categorical predictor (with 2-5 levels).
- Visualizing and modeling the relationship between a quantitative outcome and a quantitative predictor, after adjustment for the state in which the county is located
- Examples from Dr. Love: We have provided a sample proposal (using R Markdown) that we strongly recommend you use in developing your data and proposal work. We’ve also provided an outline of the analysis report we hope to see. These items will help ensure your final project report matches our expectations and is something to be proud of.
Project A Objectives
On the remainder of this page, you’ll find a description of the educational objectives for this project and for projects in this course, generally.
It is hard to learn statistics (or anything else) passively; concurrent theory and application are essential.
Project A is about building linear models and visualizing data from a (fairly clean) data set I provide to you. In Project A, you will complete most of the elements of a data science project designed to create a statistical model for a quantitative outcome, then use it for prediction, and assess the quality of those predictions. Tools necessary for Project A include:
- Describing the experimental or observational study design used to gather the data, as well as the complete data collection process.
- Sharing the process of ingesting and tidying the data, with specific reference to the decisions you make to “clean” things up.
- Descriptive and exploratory summaries of the data across the groups for each of your chosen outcomes, including, of course, attractive and well-constructed visualizations, which may include both graphs and tables.
- Developing appropriate research questions that lead to the identification of smart measures for predictors and outcomes, and then the development of a prediction model using linear regression.
- Regression-based comparisons of mean differences for a quantitative outcome across a set of two (or more) groups, including appropriate demonstrations of the reasons behind the models you build.
- Regression-based predictions of a quantitative outcome using a quantitative predictor.
In Project A’s analysis stage, everyone will be working with different parts of the same data set.
Think of a graph as a comparison. All graphs are comparisons (indeed, all statistical analyses are comparisons). If you already have the graph in mind, think of what comparisons it’s enabling. Or if you haven’t settled on the graph yet, think of what comparisons you’d like to make. Andrew Gelman
Why Two Projects?
The main reason is that I can’t figure out a way to get you to think about all of the things I hope you’ll learn from this course in a single Project. Another important reason is that I want you to be able to make mistakes during the semester without worrying about it too much, and having two projects spreads out this learning a bit.
- I set different tasks for Project A and for Project B, allowing us to touch on a wider fraction of the things I hope you’ll learn in 431.
- I give more guidance (including a sample Project) in Project A than in Project B.
- I have to evaluate each of your projects, and there are many students in the class. Knowing at least one of the data sets you’ll be working with helps me manage this.
- Having a broad range of activities to evaluate helps reduce the cost of a mistake on any one of them, so that we can build on what you do well.
- All of Project A can be done using materials discussed in classes 1-17.
Educational Objectives
“Statistics has no reason for existence except as the catalyst for investigation and discovery.” George E. P. Box
I am primarily interested in your learning something interesting, useful and even valuable from your project experiences in 431. In particular, an effective Project A will demonstrate:
- The ability to create and formulate research questions that are statistically and scientifically appropriate.
- The ability to turn research questions into measures of interest.
- The ability to pull and merge and clean and tidy data, then present the data set following Jeff Leek’s guide to sharing data with a statistician.
- The ability to build a reasonable (if simplistic) linear model, assess the quality of the model, and use it to make predictions.
- The ability to identify and (with help) solve problems that crop up
- The ability to comment on your work within code, and in written and oral presentation.
- The ability to build a Markdown-based report and to give a short presentation based on key findings from that report.