m1 <- lm(mpg ~ disp + wt, data = mtcars)
par(mfrow = c(2,2)); plot(m1); par(mfrow = c(1,1))
431 Staff
2023-10-16
I wanted to pass along these tips, most of which have came up in assessing last year’s projects.
knitr::opts_chunk$set(comment = NA)theme_bw() or theme_light() globally, rather than including it in the code of every individual plot you build. So include something liketheme_set(theme_bw())immediately after you load the tidyverse, and then don’t include theming of this sort in your individual plots, unless you are deliberately adding some specialized theming elements for a specific plot.
Check your HTML for plot-text transition problems. One of the hardest things to get people to do is add empty lines in R Markdown after they create a plot or heading. Forgetting to do this can cause your plots to show up in the HTML with the start of the next paragraph shown to the right of the plot instead of below the plot. Make sure you avoid this mistake.
Missing Data Mechanism and Dealing with Missingness. You need to have an explicit statement about your assumed missing data mechanism, including either the term MCAR, MAR or MNAR, in both Study 1 and Study 2, and you have to be specific about what you’ve done. This should be part of your HTML file everywhere where you impute or filter to complete cases. None of your analyses (in Study 1 or Study 2) should involve missing values: either you should have imputed missing values or you should have filtered to complete cases.
Spell check doesn’t check headings and subheadings. Using spell check in R Studio is trivial (just hit F7) and important, but be aware that you still need to read your HTML to be sure that you don’t have problems. A particular issue is that the spell check doesn’t check your headings and subheadings so you’ll want to pay especially close attention to those pieces. In particular, I’ve seen several people misspell the word “Transformations” in section 6 of Study 2.
Your confidence level is 90%, not 95%. All of Project B uses a 90% confidence level, so the phrase “p < 0.05” is 100% irrelevant to this work. If you must compare a p value to something, be sure it is 0.10. I would also strongly suggest you search through your work and eliminate the terms “statistical significance” and even “significant” unless you have a very good reason to include them.
Order multi-categorical factors properly. Please respect the ordering of multicategorical variables, especially in Analyses C and E for Study 1. Be sure that you adjust the levels of your factor so that they use the natural order of the variable. If you have a nominal multi-categorical variable, like race/ethnicity, in Study 2, then I suggest you order the levels of that factor variable from largest to smallest in terms of number of subjects, so that the baseline group will be the one that appears most frequently in your data.
Don’t change numeric variables to factors. If you change a numeric variable to a factor, and then change it back into a numeric variable, that will create many, many problems. Don’t do that. Instead, create a new factor variable if you’re going to convert a numeric variable into categories.
NHANES isn’t a random sample. Don’t suggest or state that it is. So the NHANES sampling procedure is a limitation in terms of you cannot really generalize to the US population with NHANES unless you use survey weighting.
Specify your approach if not standard. If you’re using NHANES data but either not using the 2017-March 2020 data, or not using adults ages 21-79, be sure that you’ve made that abundantly clear everywhere where it’s relevant, including at least in the Data Description section for Study 1 and Study 2.
Study 1 Analyses must stand on their own. Each of your four Study 1 Analyses should stand on its own, in the sense that you should specify the relevant group of subjects, the exposure and the outcome in words at the start of each of those analyses. Please label these as Analysis A, B, C, D or E, (leaving out one, of course) as I did in building the assignment.
Describe the direction and size of estimated effects. In Study 1, you should have no statements about, for instance, a statistically detectable, or clinically meaningful effect, without indicating the direction of that effect, and, if possible, its estimated size, including a confidence interval. This is easy for Analyses A, B, and D, I think, but more challenging for C and E. Be sure to carefully focus your description of your result on the direction and size of the effect you estimate, in the context of your problem.
Paired vs. Independent Samples. In Analyses A and B for Study 1, be sure that you provide a logical argument near the top of your work for why the data you are studying use (in Analysis A, paired) (in Analysis B, independent) samples.
Simplifying Conclusions in Analysis D. In Analysis D of Study 1, in writing up your conclusions after forming an appropriate 2x2 table, and specifying the probabilities of obtaining your outcome within each exposure group as estimated at the top of the table, it is completely sufficient to provide your interpretation of either:
ggplot2 or with the plot function from base R, make them tall, by incorporating r, fig.height = 8 into your chunk header for that code. For example, this is the default size:and below is what you get if you add #| fig.height: 8 at the start of the code chunk.
This helps us see things more effectively, especially with large sample sizes in the plots. So please do it.
Box-Cox. In Study 2, in the Transformation of Outcome section, please show the Box-Cox analysis immediately after the starting graphical summary (as opposed to the strange approach I used in the template) and then either use it (which is fine) or specify why you’ve decided not to use it. Remember that a Box-Cox \(\lambda\) near 0 suggests a logarithmic transformation, and that a Box-Cox \(\lambda\) of 1 indicates no transformation.
Observed vs. Predicted Plots. In Study 2, the plots of observed vs. predicted results for each of your two models you develop in section 10.2. on Visualizing the Predictions is important for two reasons (I think the first of these is being ignored):
The Discussion section is important. The one piece of your HTML that I guarantee I will be looking at to help me settle on your final grade is the Discussion section in Study 2. I expect to see meaningful paragraphs there in response to the required elements. So don’t neglect that material just because it comes last.
Don’t forget to submit:
Thanks and good luck to you all!