Project A Plan

Author

Thomas E. Love, Ph.D.

Modified

2024-09-23

1 The Deliverables

You will develop and send to us (through Canvas):

  1. the analytic tibble (as an .Rds data file) you developed following the Data page instructions,
  2. a Quarto (.qmd) file containing your data management work, and the other elements required in the Project A Plan, and
  3. the HTML result of rendering your Quarto file. The resulting HTML document will have 14 sections, as described below.

The deadline for the Project A Plan is found on the Course Calendar.

Note
  • Use spell check (hit F7) on your Quarto file before rendering it, and review the HTML result carefully to ensure that no residual warnings or error messages remain in the document.
  • When we run the Quarto file, it needs to produce your HTML file without errors. (Note: warnings or messages are OK at this stage, but it has to work.)
  • Do not suppress the R code at any point in the document. All R code should be echoed.
  • If you are working with a partner, exactly one of you should submit the materials to Canvas, and the other partner should submit a text document (Word or PDF is fine) to Canvas that reads: “My name is [YOUR NAME]. I am working on Project A with [INSERT FULL NAME OF YOUR PARTNER], and they will submit the materials for the Plan”

To complete the Project A Plan, you will need to have completed all six Data Tasks on our Data page but none of the work described on our Analyses page.

2 Use the Template and Sample Plan

There is a Project A Plan template, and a sample Project A Plan, available to you on the Examples page. Please use these tools to facilitate your progress, and help us evaluate your work quickly.

3 Project A Plan: Title

You need to come up with a meaningful title (containing at most 80 characters) for your work.

  1. Do not use the terms “Project”, “Plan” or “Project A” or “431” in your title.
  2. You can use “CHR 2024” as the abbreviation in your title for “County Health Rankings, 2024.”
  3. Your title should be no longer than 80 characters. You can use a subtitle if you like, but the title must stand on its own.
  4. It’s fine if your title focuses on just one of your three analyses, rather than all three.

4 Plan Author(s)

Your full name, and if you worked with a partner, both full names, are found in the author section of the Quarto (in the YAML) and appear legibly at the start of the HTML document.

5 Sections of the Project A Plan

The Project A Plan includes sections, as described below.

5.1 Section 1. R Packages

All necessary packages (and no unused packages) should be loaded at the start of the work, and all warnings or messages associated with that loading are suppressed in the HTML result.

Note

It seems impossible to do this work without the janitor, naniar, xfun, easystats and tidyverse packages. Remember to load easystats and the tidyverse last.

The use of other packages is up to you, and you may want to source in the Love-431.R script, too.

5.2 Section 2. Data Ingest

This section should display the results of following the instructions in Data Task 1 from the Data page. In this section, you should ingest the raw data into chr_2024_raw, selecting the main variables under consideration and then filter to the ranked counties, producing a tibble with 3088 rows and 90 columns. Do not list the tibble, but use R code to demonstrate that your work meets this requirement.

Note

We will want to check that the project ingests the data properly. This means that you have eliminated the correct row (in any way that works) as you import the data, and that you have used read_csv() to read in the data, thus creating a tibble.

5.3 Section 3. State Selection

This section follows the instructions for Data Task 2. Here, you will identify your six states by postal abbreviation code and name, and write a sentence or two describing your motivation for these selections. You will also demonstrate that your selection yields a new tibble with 300-800 counties, all of which are ranked, and that you have converted the state variable of postal abbreviation codes to a factor.

Note
  • Which states were chosen needs to be clear, and we’ll want to read a sentence about why you chose the states you did.
  • This must lead to 300-800 counties, and this number (total number of counties) must be clear from the document (both the number of rows in the final tibble and in writing.)
  • Do not drop any counties from any of the states you selected in the development of your tibble.

5.4 Section 4. Variable Selection

This section follows the instructions for Data Task 3. Here, you will provide code to identify and select a total of nine variables, including the four required and your five selected variables.

5.5 Section 5. Variable Cleaning and Renaming

For each of your selected variables, you will use the instructions in Data Task 3 to complete appropriate cleaning of the variable, and replace the initial version with your cleaned alternative, renaming each selected variable as you go.

5.6 Section 6. Creating the Analysis 2 Predictor

Use the instructions in Data Task 4 to develop the binary factor that you will use as your Analysis 2 predictor. Be sure to describe the exact cutpoints used in creating the low and high groups, and demonstrate that the resulting factor has approximately 40% of your original observations in the low group, 40% in the high group, and missing data for the middle 20%.

5.7 Section 7. Adding Data from CHR 2019 for the Analysis 3 Outcome

Use the instructions in Data Task 5 to include the CHR 2019 version of your Analysis 3 outcome as a variable in your tibble. This should give you a total of 11 variables.

5.8 Section 8. Arranging and Saving The Analytic Tibble

After renaming your variables, arrange the 11 variables in your chr_2024 tibble into the order specified in Data Task 6.

Then save the tibble as an R data set. Don’t make any changes to the tibble after this point.

5.9 Section 9. Print the Tibble

You will show that your chr_2024 tibble is in fact a tibble and not just a data frame by listing it, so that we can see that only the first 10 rows are printed, and the columns are appropriately labeled. The command you want is just:

chr_2024

5.10 Section 10. Numerical Summaries

You will do four things here.

5.10.1 Section 10.1 Table of States by Binary Factor

First, use the tabyl() function from the janitor package to produce a two-way table of state (in the rows) classified against the factor variable (with levels Low and High, plus some missing values, in the columns) that you created in Data Task 4 from the Data page.

  • Be sure to include both row and column marginal totals in your tabyl() result here.
  • We should see some missing values in the columns.

5.10.2 Section 10.2 describe_distribution() results

Next, display the results of describe_distribution() from the datawizard package (from the easystats meta-package), run on the entire tibble.

  • For all variables except the two factors, this will allow us to see (among other things) that the minimum and maximum values that make sense, and also that each of your variables shows the appropriate number of counties.

5.10.3 Section 10.3 data_codebook() results

Next, run the data_codebook() function from the datawizard package from easystats on your tibble, with two specific settings (max_values = 6 and range_at = 15), as shown below.

data_codebook(chr_2024, max_values = 6, range_at = 15)

Look over these results to be sure that:

  • all of your planned outcome and quantitative predictor variables show a minimum and maximum value,
  • you have no more than 20% missing values in either of your Analysis 1 variables, your Analysis 2 outcome and also your Analysis 3 outcome (in each year)
  • and that your data are missing only where they are supposed to be missing.
    • Specifically, you have no missing data at all in your identifying variables (fipscode, state, county) nor in your county_clustered variable.
    • Your binary factor should definitely have about 20% missing values, of course, since that’s how you created it.

Include a sentence verifying that all of these checks are true.

5.10.4 Section 10.4 Distinct Values

Fourth and finally, provide counts of distinct values for each of your variables. The following code should be helpful:

chr_2024 |> 
  summarise(across(everything(), ~ n_distinct(.))) |>
  knitr::kable()

Be sure to check this output to see that:

  • you have a distinct fipscode for every row in your tibble, and
  • you have at least 15 unique values for each outcome and for the quantitative predictors in your analyses,
  • we can see all of the results in your HTML output.

Include a sentence verifying that these checks are true.

5.11 Section 11. The Codebook

There are two steps to complete the codebook section.

First, write a sentence stating the number of counties (should be 300-800) and variables (should be eleven) in your tibble. Then use the dim() function to show us that your sentence is correct.

Second, create a variable description table which contains, for each of the 11 variables in your data set:

  • the variable’s name in your tibble,
  • its role in your planned analyses, where:
    • A1-out = Analysis 1 outcome
    • A1-pre = Analysis 1 predictor
    • and so on, leaving the variables not used in Analyses 1-3 with blank roles,
  • the original vXXX code (leave off the _rawvalue, so it might be v009, for example) for this variable in the CHR 2024 data,
  • the definition of the variable (from CHR) including its units,
    • for your binary variable, a clear indication of the cutpoints you used should be provided (what does Low mean and what does High mean?)
  • the year(s) in which the variable was measured (from CHR)
Tip

Use the County Health Rankings website’s 2024 Measures list, and in particular the linked information on that page for full descriptions, definitions, years and limitations of the variables you have selected.

  • With regard to the data for Analysis 3 that you obtained from CHR 2019, I provided the relevant information at the end of the instructions for Data Task 5.

Here is a sample variable description table with just some of the variables - yours will have to include all 11, of course…

Variable Role Old Name Description Year(s)
fipscode ID fipscode FIPS code
state ID state State Abbreviation (list your states here)
county ID county County Name
poor_fair_health A1-out v002 Percentage of adults reporting fair or poor health (age-adjusted). 2021
adult_obesity A1-pre v011 Percentage of the adult population (age 18 and older) that reports a body mass index (BMI) greater than or equal to 30 kg/m2 (age-adjusted). 2021
adult_smoking A2-out v009 Percentage of adults who are current smokers (age-adjusted). 2021
income_ineq_cat A2-pre from v044 Low: \(\leq\) 4.25, High: \(\geq\) 4.60. 2018-22
income_inequality v044 Ratio of household income at the 80th percentile to income at the 20th percentile. 2018-22

and so on.

5.12 Section 12. Your Research Questions

In this section, you will provide three research questions, one each for Analysis 1, Analysis 2, and Analysis 3. It would help to provide one subsections for each Analysis within this section, so that Section 12.1 is Analysis 1, Section 12.2 is Analysis 2, and Section 12.3 is Analysis 3.

For each analysis …

  1. Start by describing what you want to study, and then specify a research question (which should end with a question mark and be something you can resolve with the planned analysis.)
  2. Don’t boil the ocean here. You’re looking for a research question that can be reasonably addressed using your data, so it has to be pretty straightforward.
  3. If you have a pre-existing belief about what will happen, before you look at the data, please feel encouraged to include a statement about that belief before specifying your question.

5.12.1 Your Research Question for Analysis 1

Here you will propose your Research Question for Analysis 1 and provide some motivation for your variable choices associated with that analysis. A trio of well-developed research questions in this section is the most important part of this portfolio.

  • A research question will end with a question mark, and will be something you will be able to answer (or at least respond to effectively) after your Analysis 1 is complete.
  • Use one research question for Analysis 1, another for Analysis 2, and a third for Analysis 3.

After listing your question, you should provide some brief speculation as to the nature of the relationship you anticipate finding based on your hypotheses about the variables you plan to study (following my tips below.)

5.12.1.1 Tips for the research question in Analysis 1

  • Examples of dull but moderately effective and minimally appropriate research questions for Analysis 1 would be:

How well does a linear model using [predictor] predict [outcome], in [number] counties in the states of [list of your states]?

or

What is the nature of the association between [predictor] and [outcome], in [number] counties in the states of [list of your states]?

  • You should be able to do better than that, especially if you have a reason to believe something in advance about the direction or strength of the association you are anticipating.
  • However, if you’re struggling, using that format will be OK.
  • A research question uses formal but clear language.
  • Given your planned analyses, stay away from statements about cause and effect, and don’t use the words correlate or regression (in any form) in your research question.

5.12.2 Research Question for Analysis 2

Here you will propose your Research Question for Analysis 2 and provide some motivation for your variable choices associated with that analysis. All of the tips for Analysis 1 apply in this case, too.

  • Please remember that a research question for this course should end in a question mark.
  • Again, after listing your question, it is completely appropriate for you to provide some brief speculation as to the nature of the relationship you anticipate finding based on your hypotheses about the variables you plan to study.
  • An example of a dull but moderately effective and minimally appropriate research question for Analysis 2 might be:

Are there meaningful differences in the mean of [outcome] for counties with high and low levels of [predictor]?

Again, you should be able to do a bit better than that.

5.12.3 Research Question for Analysis 3

Here you will propose your Research Question for Analysis 3, and provide some motivation for your variable choices associated with that analysis. All of the tips for Analysis 1 apply in this case, too.

  • Please remember that a research question for this course should end in a question mark.
  • Again, after listing your question, it is completely appropriate for you to provide some brief speculation as to the nature of the relationship you anticipate finding based on your hypotheses about the variables you plan to study.
  • An example of a dull but moderately effective and minimally appropriate research question for Analysis 3 might be:

Are there meaningful differences in the mean of [outcome] for counties in 2024 as compared to 2019?

Again, you should be able to do a bit better than that.

5.13 Section 13. Reflection

Here you will provide a paragraph describing the most challenging (or difficult) part of completing the work so far, and how you were able to overcome whatever it was that was most challenging.

Your discussion of the biggest challenge includes a paragraph (or more) written in English, with attention paid to grammar, syntax and spelling. It needs to be clear to us from reading your reflection what your biggest issue was, and how you tried to address it.

Please be specific - a time management issue, for example, could be attributed to a lot of different things. Take the time to describe what happened and how you got past it in some detail.

5.14 Section 14. Session Information

As we have required in our Labs, there must be a Session Information section at the end of the document. This should indicate that a current version of R and R packages is used.

We’d like you to use the xfun package’s version of session_info(), as we’ve done in the labs.

6 Submission Instructions

  • Be sure to use spell check (just hit F7) on your Quarto file before rendering it, and review the HTML result carefully to ensure that no residual warnings or error messages remain in the document.
  • Be sure that your Plan includes your name (names if you are working with a partner) as the author in Quarto, and thus in HTML, as well as a proper title.

Your Plan (including your R data set, Quarto and HTML files) should then be submitted to Canvas by the deadline listed in the Course Calendar.

7 Grading The Project A Plan

Dr. Love and the TAs will review your Project A Plan and either ACCEPT it or ask for a REDO.

  • ACCEPT means you’re done, and should move on the rest of Project A, perhaps while making some small changes that we request of you in your submission on Canvas.
    • To receive an ACCEPT, you must meet the specifications above, and also you must have no more than two spelling, typographical or grammatical errors in the document as a whole that we catch.
  • REDO means you’ll need to fix the problems we’ve identified (plus make sure you’ve done all of the things in these instructions) and resubmit by the deadline we give you, which will require a rapid turnaround.
    • If your Plan is accepted on the initial submission and was on time, you’ll receive 20 points.
    • If your Plan is accepted on the first REDO (submission 2), you’ll receive 18 points out of a possible 20.
    • If your Plan is accepted on the second or third REDO (submissions 3 or 4) or later, you’ll receive 16 points out of a possible 20.
    • We hope we won’t have to go further than 2 revisions for anyone.

If you don’t have some feature that we “prefer” but doesn’t alone require a REDO, we may mention this in our ACCEPT feedback and ask you to fix it in your Project A Portfolio.