Section 4 Course Description
PQHS 431 (cross-listed as CRSP 431 and MPHP 431) is the first half of a two-semester sequence (with 432) focused on modern data analysis and advanced statistical modeling, with a practical bent and as little theory as possible. We emphasize the key roles of thinking hard, and well, about design and analysis in research.
The course is formally titled Statistical Methods in Biological & Medical Sciences, Part 1. A more accurate title is Data Science for Biological, Medical or Health Research.
We’ll learn about managing and visualizing data, building models and making predictions, and other data science activities. This highly applied course focuses on modern tools for learning from data. We’ll learn a lot of R, and we’ll use RStudio and R Markdown as tools to help make R work better, and help perform our research in rigorous and replicable ways.
4.1 Course Objectives
During the 431-432 sequence, students will:
- Use modern data science tools to import, tidy/manage, explore (through transformation, visualization and modeling) and communicate about data.
- Think hard and well about rigorous design and analysis in scientific research.
- Gain sufficient background in the practical issues regarding linear and generalized linear models to give you a starting place for meaningful applied work, particularly in terms of making comparisons to address general types of statistical and analytic questions (exploratory, predictive, inferential, and causal, in particular.)
- Learn about the importance of replicable research, and develop facility and practice in open source tools for doing it.
- Complete a series of assignments designed to help you demonstrate what you’ve learned.
- Program (“Code”) in R sufficiently to accomplish the tasks above, with enough self-sufficiency afterwards to be able to debug and use new R tools without substantial troubleshooting help. What separates “doing data science” from “doing data analysis” is programming.
4.2 Key Topics in 431 and 432
This is NOT a course in mathematical statistics or statistical inference. It’s far more applied than that.
- Exploratory Data Analysis: “All graphs are comparisons” including data exploration, statistical graphics and more general visualization of information.
- Placing biological, medical and health research questions into a statistical framework.
- Study Development - making choices in designing and executing the collection and aggregation of data.
- Data Handling - including important issues in importing, tidying and transforming data, as well as methods for dealing with missing data, including imputation.
- Statistical Comparisons: “All of statistics are comparisons” - including methods for discrete and continuous variables: intervals, assumptions, some thoughts on statistical power, and the bootstrap, design of visualizations and models for rates, proportions and contingency tables.
- The proper and rigorous use of multi-predictor models for continuous and discrete data, including…
- Fitting, evaluating, and interpreting linear and generalized linear models.
- Prediction and validation.
- Critical role of graphics, including diagnostics and residual analysis.
- Model choice, including variable selection, shrinkage and model uncertainty.
- Dealing with categorical predictors and interactions meaningfully.
- Causal inference using regression: controlling for covariates meaningfully.
- Using R and RStudio to make all of the things above happen; with particular emphasis on doing replicable research and using Markdown to document the work.
4.3 The 431 course is split in two parts.
Part A (Classes 2-14, roughly) is mostly about R, Visualizing Data and Making Comparisons.
Project A is focused on the material from this part of the course.
- Exploratory Data Analysis
- Descriptive Numerical and Graphical Summaries
- Distributions, specifically the Normal
- Histograms and their cousins
- Scatterplots and related tools from correlation and linear regression
- Exploring Data with the Tidyverse, Getting Up To Speed with R
- Visualizing Data with
ggplot2
- Data Transformation and
dplyr
- Using scripts and projects, Building Code
- Dealing with Missing Data
- Visualizing Data with
- Estimation and Inference for Means and Proportions (especially)
- Confidence Intervals
- Design Implications: Matched vs. Independent Samples
- Hypothesis Testing Strategies and why significance isn’t so helpful
- Cross-Tabulations
- Randomized Trials vs. Non-Randomized Studies
Part B (which starts around Class 15) is about Building Regression Models.
Project B also incorporates material from this part of the course.
- Estimation and Inference using Ordinary Least Squares
- Simple and Multivariate Linear Regression Models
- Building Prediction Models, and Validating Them
- Categorical Variables, Analysis of Variance
- Analysis of Covariance
- Residual and Influence Analyses
- Foundations of Model / Feature / Variable Selection
- What you’ve learned in the past and how it wasn’t so helpful
4.4 What We Expect You To Know Already
Not much.
Useful prior experience includes training/experience in statistics, coding/programming and biology/biomedical science. We expect most people will have some experience in one or two of these areas, but very few will have all three.
- Some students have lots of prior training in statistics. But there are many students in the class with no statistical training at all that they use regularly. We assume only that everyone knows what an average is, and has some sense of why statistics might be useful to them in their chosen field.
- Some students have lots of prior coding and programming experience, including experience with R. Some have never written a line of code in their life. We assume only that everyone is willing to learn how to do modern work with data, and that means writing computer code, but that some people will be starting from nothing.
- Some students have lots of prior experience with biological and biomedical science, and know a lot of useful things in those areas which relate directly to our work. Others have zero experience in this area, and will learn a lot from their colleagues. We assume only that everyone is willing to learn, and to put in some effort to do so.
People succeed in this course with a wide range of backgrounds and a common interest in using data effectively in research related to biology, health or medicine. There will be multiple people in the class who are years away from their last statistics class. We expect the majority of students will have no prior experience using R, or any meaningful recollection of using statistical software.
The pace can be brisk at times, but all CWRU students who feel up to it are welcome, regardless of their field of study or prior experience.
4.5 Why We Teach 431 Like This
Dr. Love has a lot of thoughts on this issue, but you may prefer to hear from other people on the subject. So here are a few references that have guided our recent thinking.
- A Guide to Teaching Data Science by Stephanie C. Hicks, Rafael A. Irizarry (pdf)
- … our (case-study) approach (in a graduate-level, introductory data science course) teaches students three key skills needed to succeed in data science, which we refer to as creating, connecting, and computing.
- Data Visualization on Day One: Bringing Big Ideas into Intro Stats Early and Often by Xiaofei Wang, Cynthia Rush, Nicholas Jon Horton (pdf)
- 50 Years of Data Science by David Donoho in the Journal of Computational and Graphical Statistics, 2017.
- Why You Should Master R (Even if it might eventually become obsolete) blog post from Sharp Sight, 2016-12-27
- Teaching R to New Users - From tapply to the Tidyverse by Roger D. Peng, which is also available as a YouTube Video
- Teach the Tidyverse to Beginners and a related post on teaching
ggplot2
, specifically from David Robinson. There is also a related video from rstudio::conf 2018. - Video from Hadley Wickham, You can’t do data science in a GUI, 2018 in Chicago.