Section 1 Course Description
PQHS 432 (cross-listed as, for example, CRSP 432 and MPHP 432, and formerly known as EPBI 432) is the second half of a two-semester sequence (with PQHS 431) focused on modern data analysis and advanced statistical modeling, with a practical bent (as little theory as possible), emphasizing the key role of thinking hard, and well, about design and analysis in research. The title listed by the registrar is a little dated - I prefer Data Science for Biological, Medical or Health Research.
This is a good course for people who want to learn how to use the R language to get information from data, and who want to learn about making comparisons and building models to help make meaningful progress in research, focusing on questions from biology, medicine and public health. We spend time managing and visualizing data, building models and making predictions, and other things thought of as “data science” - in essence, this highly applied course focuses on modern, more than classical, tools for learning from data. The course is taught using the R statistical software and RStudio environments, with the material discussed in 431 assumed in 432. Students learned a lot of R in the 431 course, and that material remains available at https://github.com/THOMASELOVE/431. We’ll continue to use R Studio and R Markdown as tools to help make R work better, and perform our research in replicable ways.
1.1 General Approach / Topics
The course covers the following general topics, roughly in this order, through early April. Additional topics (for the remainder of April) will be determined later in the semester.
- Linear Regression (including weighted and robust approaches, variable selection, dealing with missing data, fitting non-linear relationships through predictor transformation, cross-validation approaches, and multi-factor ANOVA and ANCOVA)
- Logistic Regression (including both models for binary outcomes, and models for proportions, and risk adjustment)
- Generalized Linear Models (including regression models for count data, multi-categorical outcomes)
- The Statistical Crisis in Science
- Cluster Analysis (mostly in the form of Principal Components Analysis)
- Survival Analysis (Kaplan-Meier curves and Cox Regression)
A more detailed and up-to-date plan is contained on the Topics page for the course.
1.2 Prerequisites
Taking 432 without 431 is not recommended. The pace can be brisk at times, but all CWRU students who feel up to it are welcome, in any field of study.
The main things students need for 432 are:
- tools: substantive knowledge of the use of R, R Studio and R Markdown to produce code which will ingest, visualize, explore, analyze and model data, then communicate the results
- statistical methodology: substantiave understanding of statistical inference in the one-, two- and multi-sample cases and the fundamentals of linear regression models, including the building of multiple linear regressions, and their evaluation through diagnostic plots, stepwise model selection, assessment of uncertainty via confidence and prediction intervals, and basic in-sample and out-of sample validation summaries
- data to study related to biological, health and/or medical phenomena, and
- an interest in studying data closely and presenting rigorous analyses effectively
Some of these topics are reviewed in early 432 sessions.
1.3 Everything is on the Web
https://github.com/THOMASELOVE/432-2018 is the place to go for everything related to this course. Please visit any time you need something. I update the web site frequently. You’ll find links there related to:
- Your homework Assignments
- Data and Code I will provide
- my in-class presentation Slides
- instructions and hints related to the Projects
- various outside Texts
- a detailed Schedule of classes and deadlines
- details on the Quizzes