Section 3 Course Description
PQHS 431 (cross-listed as CRSP 431 and MPHP 431) is the first half of a two-semester sequence (with 432) focused on modern data analysis and advanced statistical modeling, with a practical bent and as little theory as possible. We emphasize the key roles of thinking hard, and well, about design and analysis in research.
The course is formally titled Statistical Methods in Biological & Medical Sciences, Part 1. A more accurate title is Data Science for Biological, Medical or Health Research.
We’ll learn about managing and visualizing data, building models and making predictions, and other data science activities. This highly applied course focuses on modern tools for learning from data. We’ll learn a lot of R, and we’ll use RStudio and R Markdown (and, in 432, Quarto) as tools to help make R work better, and help perform our research in rigorous and replicable ways.
3.1 Course Objectives
During the 431-432 sequence, students will:
- Use modern data science tools to import, tidy/manage, explore (through transformation, visualization and modeling) and communicate about data.
- Think hard and well about rigorous design and analysis in scientific research.
- Gain sufficient background in the practical issues regarding linear and generalized linear models to give you a starting place for meaningful applied work, particularly in terms of making comparisons to address general types of statistical and analytic questions (exploratory, predictive, inferential, and causal, in particular.)
- Learn about the importance of replicable research, and develop facility and practice in open source tools for doing it.
- Complete a series of assignments (labs, projects and quizzes) designed to help you demonstrate what you’ve learned.
- Program (“Code”) in R sufficiently to accomplish the tasks above, with enough self-sufficiency afterwards to be able to debug and use new R tools without substantial troubleshooting help. What separates “doing data science” from “doing data analysis” is programming.
This is NOT a course in mathematical statistics or statistical inference. It’s far more applied than that.
3.2 Key Topics in 431-432
- Exploratory Data Analysis: “All graphs are comparisons” including data exploration, statistical graphics and more general visualization of information.
- Placing biological, medical and health research questions into a statistical framework.
- Study Development - making choices in designing and executing the collection and aggregation of data.
- Data Handling - including important issues in importing, tidying and transforming data, as well as methods for dealing with missing data, including imputation.
- Statistical Comparisons: “All of statistics are comparisons” - including methods for discrete and continuous variables: intervals, assumptions, some thoughts on statistical power, and the bootstrap, design of visualizations and models for rates, proportions and contingency tables.
- The proper and rigorous use of multi-predictor models for continuous and discrete data, including…
- Fitting, evaluating, and interpreting linear and generalized linear models.
- Prediction and validation.
- Critical role of graphics, including diagnostics and residual analysis.
- Model choice, including variable selection, shrinkage and model uncertainty.
- Dealing with categorical predictors and interactions meaningfully.
- Causal inference using regression: controlling for covariates meaningfully.
- Using R and RStudio to make all of the things above happen; with particular emphasis on doing replicable research and using R Markdown (and Quarto) to document the work in a replicable way.
3.3 431 Course Outline & Format
The main group sessions for the 431 course will include 24 in-person lecture sessions (plus two “working day” sessions to be explained later in the term) led by Professor Love, to be held on Tuesdays and Thursdays from 1:00 to 2:15 PM in E321-323 in the Robbins Building at the CWRU School of Medicine.
- The Course Calendar provides additional detail on specific sessions, and links to materials used in those sessions, including slides.
3.3.1 Part A: R and Exploring Data
Classes 1-12 (roughly) focus on this material.
- Exploratory Data Analysis
- Descriptive Numerical and Graphical Summaries
- Histograms and their cousins
- Scatterplots and related tools from correlation and linear regression
- Dealing with Missing Data
- The Importance of the Normal Distribution
- Exploring Data with the Tidyverse, Getting Up To Speed with R
- Visualizing Data with
ggplot2
- Data Transformation and
dplyr
- Using scripts and projects, Building Code
- Visualizing Data with
3.3.2 Part B: Making Comparisons
Classes 13-17 (roughly) focus on this material.
- Estimation and Inference for Means and Proportions (especially)
- Confidence Intervals
- Design Implications: Matched vs. Independent Samples
- Hypothesis Testing Strategies and why significance isn’t so helpful
- Cross-Tabulations
- Randomized Trials vs. Non-Randomized Studies
3.3.3 Part C: Linear Models
Classes 18-24 (roughly) focus on this material.
- Estimation and Inference using Ordinary Least Squares
- Simple and Multivariate Linear Regression Models
- Building Prediction Models, and Validating Them
- Categorical Variables, Analysis of Variance
- Analysis of Covariance
- Residual and Influence Analyses
- Foundations of Model / Feature / Variable Selection
- What you’ve learned in the past and how it wasn’t so helpful
3.4 Prerequisites and Intended Student Population
What do we expect you to know already before you start the course? Not much.
Useful prior experience includes training/experience in statistics, coding/programming and biology/biomedical science. We expect most people will have some experience in one or two of these areas, but very few will have all three.
- Some students have lots of prior training in statistics. But there are many students in the class with no statistical training at all that they use regularly. We assume only that everyone knows what an average is, and has some sense of why statistics might be useful to them in their chosen field.
- Some students have lots of prior coding and programming experience, including experience with R. Some have never written a line of code in their life. We assume only that everyone is willing to learn how to do modern work with data, and that means writing computer code, but that some people will be starting from nothing.
- Some students have lots of prior experience with biological and biomedical science, and know a lot of useful things in those areas which relate directly to our work. Others have zero experience in this area, and will learn a lot from their colleagues. We assume only that everyone is willing to learn, and to put in some effort to do so.
People succeed in this course with a wide range of backgrounds and a common interest in using data effectively in research related to biology, health or medicine. There will be multiple people in the class who are years away from their last statistics class. We expect the majority of students will have no prior experience using R, or any meaningful recollection of using statistical software.
The pace can be brisk at times, but all CWRU students who feel up to it are welcome, regardless of their field of study or prior experience. Section 1 (Professor Love’s section) is specifically geared towards students in programs under the auspices of the Department of Population & Quantitative Health Sciences, as well as students who intend to continue on and take 432 this Spring. Section 2 (with Professor Zhang) is more appropriate for most other students.
3.5 Motivations for our Approach
Professor Love has a lot of thoughts on this issue and you’ll hear about them through the semester, but you may prefer to hear from other people on the subject. So here are a few references that have guided our recent thinking.
- A Guide to Teaching Data Science by Stephanie C. Hicks, Rafael A. Irizarry (pdf)
- … our (case-study) approach (in a graduate-level, introductory data science course) teaches students three key skills needed to succeed in data science, which we refer to as creating, connecting, and computing.
- Data Visualization on Day One: Bringing Big Ideas into Intro Stats Early and Often by Xiaofei Wang, Cynthia Rush, Nicholas Jon Horton (pdf)
- 50 Years of Data Science by David Donoho in the Journal of Computational and Graphical Statistics, 2017.
- Why You Should Master R (Even if it might eventually become obsolete) blog post from Sharp Sight, 2016-12-27
- Teaching R to New Users - From tapply to the Tidyverse, A YouTube video by Roger D. Peng
- Teach the Tidyverse to Beginners and a related post on teaching
ggplot2
, specifically from David Robinson. There is also a related video from rstudio::conf 2018. - Video from Hadley Wickham, You can’t do data science in a GUI, 2018 in Chicago.
3.6 Is 432 Required?
If I take 431 this semester, do I have to take 432 in the Spring?
It is the natural thing to do, and I assume that almost all of you will do so. The 431 course is part 1 of a two-semester sequence. Frankly, 432 contains some of the most interesting material and is generally regarded by students who take both as the more entertaining course. Every year, some students take only 431, though. The decision is up to you. The 432 course assumes you have completed 431, whether with me or another instructor.