<- lm(mpg ~ disp + wt, data = mtcars)
m1 par(mfrow = c(2,2)); plot(m1); par(mfrow = c(1,1))
431 Project B Tips
What is this?
I wanted to pass along these tips, most of which have came up in assessing last year’s projects.
- This should be a good set of things to review (along with the Project B Checklist) as you’re preparing your final materials for submission.
YAML and Setup issues
- No hashtags preceding R results. I would like you to be sure that your R output is not preceded by hashtags. The easiest way to ensure this is to include the following code at the top of your code, where you load your R packages.
knitr::opts_chunk$set(comment = NA)
- Theming gg-Plots. I would like you to use either
theme_bw()
ortheme_light()
globally, rather than including it in the code of every individual plot you build. So include something like
theme_set(theme_bw())
immediately after you load the tidyverse, and then don’t include theming of this sort in your individual plots, unless you are deliberately adding some specialized theming elements for a specific plot.
- Clean List of R Packages. Your list of R packages should be clean for each study, which means:
- tidyverse is loaded last
- none of the core tidyverse packages (core list is here) are loaded
- all packages you will load for this study are in one place
- no packages are loaded that you don’t use in your work.
General Issues
Check your HTML for plot-text transition problems. One of the hardest things to get people to do is add empty lines in R Markdown after they create a plot or heading. Forgetting to do this can cause your plots to show up in the HTML with the start of the next paragraph shown to the right of the plot instead of below the plot. Make sure you avoid this mistake.
Missing Data Mechanism and Dealing with Missingness. You need to have an explicit statement about your assumed missing data mechanism, including either the term MCAR, MAR or MNAR, in both Study 1 and Study 2, and you have to be specific about what you’ve done. This should be part of your HTML file everywhere where you impute or filter to complete cases. None of your analyses (in Study 1 or Study 2) should involve missing values: either you should have imputed missing values or you should have filtered to complete cases.
Spell check doesn’t check headings and subheadings. Using spell check in R Studio is trivial (just hit F7) and important, but be aware that you still need to read your HTML to be sure that you don’t have problems. A particular issue is that the spell check doesn’t check your headings and subheadings so you’ll want to pay especially close attention to those pieces. In particular, I’ve seen several people misspell the word “Transformations” in section 6 of Study 2.
Your confidence level is 90%, not 95%. All of Project B uses a 90% confidence level, so the phrase “p < 0.05” is 100% irrelevant to this work. If you must compare a p value to something, be sure it is 0.10. I would also strongly suggest you search through your work and eliminate the terms “statistical significance” and even “significant” unless you have a very good reason to include them.
Order multi-categorical factors properly. Please respect the ordering of multicategorical variables, especially in Analyses C and E for Study 1. Be sure that you adjust the levels of your factor so that they use the natural order of the variable. If you have a nominal multi-categorical variable, like race/ethnicity, in Study 2, then I suggest you order the levels of that factor variable from largest to smallest in terms of number of subjects, so that the baseline group will be the one that appears most frequently in your data.
Don’t change numeric variables to factors. If you change a numeric variable to a factor, and then change it back into a numeric variable, that will create many, many problems. Don’t do that. Instead, create a new factor variable if you’re going to convert a numeric variable into categories.
NHANES issues
NHANES isn’t a random sample. Don’t suggest or state that it is. So the NHANES sampling procedure is a limitation in terms of you cannot really generalize to the US population with NHANES unless you use survey weighting.
Specify your approach if not standard. If you’re using NHANES data but either not using the 2017-March 2020 data, or not using adults ages 21-79, be sure that you’ve made that abundantly clear everywhere where it’s relevant, including at least in the Data Description section for Study 1 and Study 2.
Study 1 Issues
Study 1 Analyses must stand on their own. Each of your four Study 1 Analyses should stand on its own, in the sense that you should specify the relevant group of subjects, the exposure and the outcome in words at the start of each of those analyses. Please label these as Analysis A, B, C, D or E, (leaving out one, of course) as I did in building the assignment.
Describe the direction and size of estimated effects. In Study 1, you should have no statements about, for instance, a statistically detectable, or clinically meaningful effect, without indicating the direction of that effect, and, if possible, its estimated size, including a confidence interval. This is easy for Analyses A, B, and D, I think, but more challenging for C and E. Be sure to carefully focus your description of your result on the direction and size of the effect you estimate, in the context of your problem.
- For instance, a bad sentence in Analysis B would be something like “We saw a significant difference between males and females on mean systolic blood pressure.”
- A better sentence would be something like “The mean systolic blood pressure for males was 3 mm Hg higher than that of females, with a 90% CI of (1, 5).” Notice that this better sentence includes the actual units of measurement, and not something generic like “points”.
Paired vs. Independent Samples. In Analyses A and B for Study 1, be sure that you provide a logical argument near the top of your work for why the data you are studying use (in Analysis A, paired) (in Analysis B, independent) samples.
Simplifying Conclusions in Analysis D. In Analysis D of Study 1, in writing up your conclusions after forming an appropriate 2x2 table, and specifying the probabilities of obtaining your outcome within each exposure group as estimated at the top of the table, it is completely sufficient to provide your interpretation of either:
- the relative risk and the odds ratio and their confidence intervals, or
- the relative risk and the difference in probabilities, with their confidence intervals.
- Describe some percentages in Analysis E. In Analysis E for Study 1, you should focus your interpretation of the result from your table and chi-square test on a comparison of interesting percentages from your table, in addition to the p value and a visualization of the results.
Study 2 Issues
- Residual Plots should be tall. When building residual plots, whether with
ggplot2
or with theplot
function from base R, make them tall, by incorporatingr, fig.height = 8
into your chunk header for that code. For example, this is the default size:
and below is what you get if you add #| fig.height: 8
at the start of the code chunk.
<- lm(mpg ~ disp + wt, data = mtcars)
m1 par(mfrow = c(2,2)); plot(m1); par(mfrow = c(1,1))
This helps us see things more effectively, especially with large sample sizes in the plots. So please do it.
Box-Cox. In Study 2, in the Transformation of Outcome section, please show the Box-Cox analysis immediately after the starting graphical summary (as opposed to the strange approach I used in the template) and then either use it (which is fine) or specify why you’ve decided not to use it. Remember that a Box-Cox \(\lambda\) near 0 suggests a logarithmic transformation, and that a Box-Cox \(\lambda\) of 1 indicates no transformation.
Observed vs. Predicted Plots. In Study 2, the plots of observed vs. predicted results for each of your two models you develop in section 10.2. on Visualizing the Predictions is important for two reasons (I think the first of these is being ignored):
- It helps you see the range of predicted values for each model on the X axis, and compare it to the range of observed values on the Y axis. Many models will be overly conservative, only predicting outcomes within a small range. For instance, if your big model’s range of predictions is much more in keeping with the observed range, that’s a reason to like the big model.
- It helps you see if one of the models matches the line for observed = predicted better than the other.
- Using Validated R-Square. In Study 2, you should use the validated R-square you develop in section 10.3.2 as part of your discussion in both Sections 10.4 and 11.1. (in addition to whatever else you decide to use) to help describe how successful your winning model is. You should also reflect in Section 11.1 (Chosen Model, within the Discussion section) on the relationship between the original training sample R-square you observed for your chosen model and the validated R-square you calculated for that model in section 10.3.2. Here, you want to assess how overconfident or underconfident your original R-square was, basically.
And finally…
The Discussion section is important. The one piece of your HTML that I guarantee I will be looking at to help me settle on your final grade is the Discussion section in Study 2. I expect to see meaningful paragraphs there in response to the required elements. So don’t neglect that material just because it comes last.
Don’t forget to submit:
- your Study 1 qmd and HTML, and your Study 2 qmd and HTML to Canvas no later than the deadline.
- your data, if you’re not using NHANES, to Canvas no later than the deadline,
- your Project B self-evaluation form after you submit your Canvas materials, and no later than the deadline.
- your CWRU class evaluation by their deadline.
Thanks and good luck to you all!