Chapter 2 Setting Up R
These Notes make extensive use of
- the statistical software language R, and
- the development environment R Studio,
both of which are free, and you’ll need to install them on your machine. Instructions for doing so are in found in the course syllabus.
If you need an even gentler introduction, or if you’re just new to R and RStudio and need to learn about them, we encourage you to take a look at http://moderndive.com/, which provides an introduction to statistical and data sciences via R at Ismay and Kim (2019).
2.1 R Markdown
These notes were written using R Markdown. R Markdown, like R and R Studio, is free and open source.
R Markdown is described as an authoring framework for data science, which lets you
- save and execute R code
- generate high-quality reports that can be shared with an audience
This description comes from http://rmarkdown.rstudio.com/lesson-1.html which you can visit to get an overview and quick tour of what’s possible with R Markdown.
Another excellent resource to learn more about R Markdown tools is the Communicate section (especially the R Markdown chapter) of Grolemund and Wickham (2019).
2.2 R Packages
To start, I’ll present a series of commands I run at the beginning of these Notes. These particular commands set up the output so it will look nice as either an HTML or PDF file, and also set up R to use several packages (libraries) of functions that expand its capabilities. A chunk of code like this will occur near the top of any R Markdown work.
knitr::opts_chunk$set(comment = NA)
# library(pander); library(pwr)
library(grid); library(devtools);
library(magrittr); library(patchwork);
library(knitr); library(NHANES); library(boot);
library(broom); library(janitor); library(tidyverse)
# source("data/Love-boost.R")
I have deliberately set up this list of loaded packages/libraries to be relatively small, and will add some other packages later, as needed. You only need to install a package once, but you need to reload it every time you start a new session.
2.3 Other Packages
I may also make use of functions in the following packages/libraries, but when I do so, I will explicitly specify the package name, using a command like Hmisc::describe(x)
, rather than just describe(x)
, so as to specify that I want the Hmisc package’s version of describe
applied to whatever x
is. Those packages are:
aplpack
which providesstem.leaf
andstem.leaf.backback
for building fancier stem-and-leaf displaysarm
which provides a set of functions for model building and checking that are used in Gelman and Hill (2007)broom
which turns the results lots of different analyses in R into more useful tidy data frames (tibbles.)car
which provides some tools for building scatterplot matrices, but also many other functions described in Fox and Weisberg (2011)cowplot
which is used in Part C to put multiple graphical objects in the same plot, likegridExtra
: https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.htmlDataExplorer
for generating highly detailed profiles of a data frameEpi
for 2x2 table analyses and materials for classical epidemiology: http://BendixCarstensen.com/Epi/exact2x2
for calculating McNemar odds ratios and confidence intervals in paired comparisons of proportionsGGally
for scatterplot and correlation matrix visualizations: http://ggobi.github.io/ggally/ggridges
which is used to make ridgeline plotsgridExtra
which includes a variety of functions for manipulating graphs: https://github.com/baptiste/gridextraHmisc
from Frank Harrell at Vanderbilt U., for its version ofdescribe
and for many regression modeling functions we’ll use in 432. Details on Hmisc are at http://biostat.mc.vanderbilt.edu/wiki/Main/Hmisc. Frank has written several books - the most useful of which for 431 students is probably Harrell and Slaughter (2019)mice
, which we’ll use (a little) in 431 for multiple imputation to deal with missing data: http://www.stefvanbuuren.nl/mi/mosaic
, mostly for itsfavstats
summary, but Project MOSAIC is a community of educators you might be interested in: http://mosaic-web.org/naniar
, for wrangling and visualizing missingness, and for checking imputations. See http://naniar.njtierney.com/.PropCIs
for computing confidence intervals for differences in proportions in paired samples.psych
for its own version ofdescribe
, but other features are described at http://personality-project.org/r/psych/simputation
for some imputation workskimr
for its ability to provide a “skimmed” descriptive analysis of a data set
We’ll also use some packages that get loaded via devtools
and Github by the code in these notes, including:
xda
for two functions callednumSummary
andcharSummary
visdat
for two functions called vis_missand
vis_dat`patchwork
, which is a framework for composingggplot2
objects (actually this is now loaded above).
Several other packages are included below, even though they are not used in these Notes, because they will be used in class sessions or in 432.
When compiling the Notes from the original code files, these packages will need to be installed (but not loaded) in R, or an error will be thrown when compiling this document. To install all of the packages used within these Notes, type in (or copy and paste) the following commands and run them in the R Console. Again, you only need to install a package once, but you need to reload it every time you start a new session.
pkgs <- c("aplpack", "arm", "babynames", "boot", "broom", "car", "cowplot",
"DataExplorer", "devtools", "Epi", "exact2x2", faraway", "fivethirtyeight",
"foreign", "gapminder", "GGally", "ggridges", "gridExtra", "Hmisc",
"janitor", "kableExtra", "knitr", "lme4", "magrittr", "markdown",
"MASS", "mice", "mosaic", "multcomp", "naniar", "NHANES", "pander",
"PropCIs", "psych", "pwr", "qcc", "rmarkdown", "rmdformats", "rms",
"sandwich", "simputation", "skimr", "survival", "tableone",
"tidyverse", "vcd")
install.packages(pkgs)
References
Fox, John, and Sanford Weisberg. 2011. An R Companion to Applied Regression. Second. Thousand Oaks CA: Sage. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion.
Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel-Hierarchical Models. New York: Cambridge University Press. http://www.stat.columbia.edu/~gelman/arm/.
Grolemund, Garrett, and Hadley Wickham. 2019. R for Data Science. O’Reilly. http://r4ds.had.co.nz/.
Harrell, Frank E., and James C. Slaughter. 2019. Biostatistics for Biomedical Research. Vanderbilt University School of Medicine. biostat.mc.vanderbilt.edu/ClinStat.
Ismay, Chester, and Albert Y. Kim. 2019. ModernDive: Statistical Inference via Data Science. http://moderndive.com/.