::opts_chunk$set(comment = NA)
knitr
library(broom)
library(broom.mixed)
library(gt)
library(janitor)
library(mosaic)
library(bayestestR)
library(insight)
library(rstanarm)
library(conflicted)
library(tidyverse)
conflicts_prefer(dplyr::select, dplyr::filter, base::mean, base::range)
theme_set(theme_bw())
34 NEW!! Bayes and a Logistic Model
Almost all of this material is based on
- https://mc-stan.org/rstanarm/articles/binomial.html and
- https://easystats.github.io/bayestestR/articles/bayestestR.html and
- https://easystats.github.io/bayestestR/articles/example1.html and
- https://easystats.github.io/bayestestR/articles/example2.html#logistic-model
There’s not a lot that is truly original here. That’s a summer project.
34.1 R Setup Used Here
34.2 Return to the smalldat
Example
Consider the smalldat.csv
data we discussed initially Chapter 22. The data includes 150 observations on 6 variables, and our goal here is to predict smoker
given four predictors (totchol
, age
, sex
and educ
.)
<- read_csv("data/smalldat.csv", show_col_types = FALSE) smalldat
Variable | Description |
---|---|
subject | Subject identification code |
smoker | 1 = current smoker, 0 = not current smoker |
totchol | total cholesterol, in mg/dl |
age | age in years |
sex | subject’s sex (M or F) |
educ | subject’s educational attainment (4 levels) |
The educ levels are: 1_Low, 2_Middle, 3_High and 4_VHigh, which stands for Very High
34.3 The Distribution of Smoking Status
Across the 150 observations in the smalldat
data, we have 74 smokers.
|> tabyl(smoker) |> adorn_pct_formatting() |> adorn_totals() smalldat
smoker n percent
0 76 50.7%
1 74 49.3%
Total 150 -
34.4 Fitting a Logistic Regression Model with glm()
<- glm((smoker == 1) ~ age + sex + totchol + factor(educ),
m1 data = smalldat)
glance(m1) |> select(nobs, df.residual, AIC, BIC, logLik, deviance, null.deviance) |>
gt() |> fmt_number(AIC:null.deviance, decimals = 2)
nobs | df.residual | AIC | BIC | logLik | deviance | null.deviance |
---|---|---|---|---|---|---|
150 | 143 | 218.30 | 242.39 | −101.15 | 33.83 | 37.49 |
## raw coefficients
tidy(m1, conf.int = TRUE, conf.level = 0.95) |>
gt() |> fmt_number(decimals = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 1.071 | 0.317 | 3.374 | 0.001 | 0.449 | 1.693 |
age | −0.014 | 0.005 | −2.872 | 0.005 | −0.024 | −0.004 |
sexM | 0.131 | 0.083 | 1.578 | 0.117 | −0.032 | 0.293 |
totchol | 0.001 | 0.001 | 0.531 | 0.597 | −0.002 | 0.003 |
factor(educ)2_Middle | −0.089 | 0.100 | −0.894 | 0.373 | −0.284 | 0.106 |
factor(educ)3_High | −0.073 | 0.121 | −0.604 | 0.547 | −0.309 | 0.163 |
factor(educ)4_VHigh | −0.242 | 0.126 | −1.919 | 0.057 | −0.489 | 0.005 |
## with exponentiated coefficients
tidy(m1, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
gt() |> fmt_number(decimals = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 2.918 | 0.317 | 3.374 | 0.001 | 1.566 | 5.435 |
age | 0.986 | 0.005 | −2.872 | 0.005 | 0.977 | 0.996 |
sexM | 1.140 | 0.083 | 1.578 | 0.117 | 0.969 | 1.340 |
totchol | 1.001 | 0.001 | 0.531 | 0.597 | 0.998 | 1.003 |
factor(educ)2_Middle | 0.915 | 0.100 | −0.894 | 0.373 | 0.753 | 1.112 |
factor(educ)3_High | 0.930 | 0.121 | −0.604 | 0.547 | 0.734 | 1.177 |
factor(educ)4_VHigh | 0.785 | 0.126 | −1.919 | 0.057 | 0.613 | 1.005 |
34.5 Fitting a Bayesian Logistic Regression
Can we fit a model for the same data using a Bayesian approach?
Yes, we can, for instance using the stan_glm()
function from the rstanarm
package.
set.seed(43234231) # best to set a random seed first
<- stan_glm((smoker == 1) ~ age + sex + totchol + factor(educ),
m2 data = smalldat, refresh = 0)
Here the refresh = 0
parameter stops the machine from printing out each of the updates it does while sampling, which is not generally something I need to look at. Here’s what’s placed in the m2
object:
m2
stan_glm
family: gaussian [identity]
formula: (smoker == 1) ~ age + sex + totchol + factor(educ)
observations: 150
predictors: 7
------
Median MAD_SD
(Intercept) 1.1 0.3
age 0.0 0.0
sexM 0.1 0.1
totchol 0.0 0.0
factor(educ)2_Middle -0.1 0.1
factor(educ)3_High -0.1 0.1
factor(educ)4_VHigh -0.2 0.1
Auxiliary parameter(s):
Median MAD_SD
sigma 0.5 0.0
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
- The first few lines specify the fitting process.
- Next, for each coefficient, we find the median value from the posterior distribution, and the MAD_SD value, which is an indicator of variation derived from the estimated posterior distribution of the parameters, and is used as a standard error in what follows.
- Finally, we see the estimated root mean squared error (residual standard deviation)
sigma
, again estimated with the median of thesigma
values in the posterior distribution.
Using the summary()
function provides some additional information about the parameter estimates, but mostly some convergence diagnostics for the Markov Chain Monte Carlo procedure that the rstanarm
package used to build the estimates.
summary(m2)
Model Info:
function: stan_glm
family: gaussian [identity]
formula: (smoker == 1) ~ age + sex + totchol + factor(educ)
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 150
predictors: 7
Estimates:
mean sd 10% 50% 90%
(Intercept) 1.1 0.3 0.7 1.1 1.5
age 0.0 0.0 0.0 0.0 0.0
sexM 0.1 0.1 0.0 0.1 0.2
totchol 0.0 0.0 0.0 0.0 0.0
factor(educ)2_Middle -0.1 0.1 -0.2 -0.1 0.0
factor(educ)3_High -0.1 0.1 -0.2 -0.1 0.1
factor(educ)4_VHigh -0.2 0.1 -0.4 -0.2 -0.1
sigma 0.5 0.0 0.5 0.5 0.5
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 0.5 0.1 0.4 0.5 0.6
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 4790
age 0.0 1.0 2700
sexM 0.0 1.0 3922
totchol 0.0 1.0 3983
factor(educ)2_Middle 0.0 1.0 2847
factor(educ)3_High 0.0 1.0 3724
factor(educ)4_VHigh 0.0 1.0 3550
sigma 0.0 1.0 4324
mean_PPD 0.0 1.0 4262
log-posterior 0.0 1.0 1996
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
34.5.1 Extracting the Posterior
Let’s extract the coefficients of our model, using the get_parameters()
function from the insight
package:
<- get_parameters(m2)
posteriors
head(posteriors)
(Intercept) age sexM totchol factor(educ)2_Middle
1 1.0395289 -0.004619993 0.25120820 -0.0012609246 -0.07421785
2 1.3897359 -0.015504179 0.05308297 -0.0005806271 -0.05711548
3 0.6960912 -0.012991489 0.12013110 0.0015857650 0.08652352
4 0.3501038 -0.011626531 0.23179922 0.0024659073 0.12610455
5 1.5843855 -0.009405856 0.05310819 -0.0025427990 -0.05049668
6 1.0403138 -0.006376498 0.04919309 -0.0013779428 0.08575331
factor(educ)3_High factor(educ)4_VHigh
1 -0.247278983 -0.34931693
2 -0.027945532 -0.04715445
3 0.098936630 -0.09453129
4 0.055186994 -0.08377399
5 0.003482159 -0.11669724
6 0.123287117 -0.06056735
In all, we have 4000 observations of this posterior distribution:
nrow(posteriors)
[1] 4000
Let’s visualize the posterior distribution of our parameter for age
.
ggplot(posteriors, aes(x = age)) + geom_density(fill = "dodgerblue")
This distribution describes the probability (on the vertical axis) of various age
effects (shown on the horizontal axis). Most of the distribution is between -0.025 and -0.005, with the peak being around -0.15.
Remember that our m1
fit with glm()
had an estimated \(\beta\) for age
of -0.014, so, as is often the case, there is not a lot of difference between the two models in terms of the estimates they make.
Here’s the mean and median of the age
effect, across our 4000 simulations from the posterior distribution.
mean(posteriors$age)
[1] -0.01389409
median(posteriors$age)
[1] -0.01391701
And here are the results after exponentiation, so that the estimates describe odds ratios:
mean(exp(posteriors$age))
[1] 0.9862135
median(exp(posteriors$age))
[1] 0.9861794
Again, these are very close to what we obtained from least squares estimation.
Another option is to take the mode (peak) of the posterior distribution, and this is called the maximum a posteriori (MAP) estimate:
map_estimate(posteriors$age)
MAP Estimate
Parameter | MAP_Estimate
------------------------
x | -0.01
map_estimate(exp(posteriors$age)) # on odds ratio scale
MAP Estimate
Parameter | MAP_Estimate
------------------------
x | 0.99
Adding these estimates to our plot, we can see that they are quite close:
ggplot(posteriors, aes(x = age)) +
geom_density(fill = "dodgerblue") +
# The mean in yellow
geom_vline(xintercept = mean(posteriors$age), color = "yellow", linewidth = 1) +
# The median in red
geom_vline(xintercept = median(posteriors$age), color = "red", linewidth = 1) +
# The MAP in purple
geom_vline(xintercept = as.numeric(map_estimate(posteriors$age)), color = "purple", linewidth = 1)
34.5.2 Describing Uncertainty
We might describe the range of estimates for the age
effect.
range(posteriors$age)
[1] -0.03166005 0.00409895
Instead of showing the whole range, we usually compute the highest density interval at some percentage level, for instance a 95% credible interval which shows the range containing the 95% most probable effect values.
hdi(posteriors$age, ci = 0.95)
95% HDI: [-0.02, 0.00]
So we conclude that the age
effect has a 95% chance of falling within the [-0.02, 0.00] range. We can also exponentiate here, so as to provide the result in terms of an odds ratio.
hdi(exp(posteriors$age), ci = 0.95)
95% HDI: [0.98, 1.00]
34.5.3 Visualizing the Coefficients and Credible Intervals
Here is a plot of the coefficients and parameters estimated in m2
, along with credible intervals for their values. The inner interval (shaded region) here uses the default choice of 50%, and the outer interval (lines) uses a non-default choice of 95% (90% is the default choice here, as it turns out.) The point estimate shown here is the median of the posterior distribution, which is the default.
plot(m2, prob = 0.5, prob_outer = 0.95)
34.6 Summarizing the Posterior Distribution
A more detailed set of summaries for the posterior distribution can be obtained from the describe_posterior()
function from the bayestestR
package.
A brief tutorial on what is shown here is available at https://easystats.github.io/bayestestR/articles/bayestestR.html and https://easystats.github.io/bayestestR/articles/example1.html and this is the source for much of what I’ve built in this little chapter.
describe_posterior(posteriors, test = c("pd", "ROPE")) |> print_md(decimals = 3)
Parameter | Median | 95% CI | pd | ROPE | % in ROPE |
---|---|---|---|---|---|
(Intercept) | 1.07 | [ 0.43, 1.70] | 100% | [-0.10, 0.10] | 0% |
age | -0.01 | [-0.02, 0.00] | 99.85% | [-0.10, 0.10] | 100% |
sexM | 0.13 | [-0.04, 0.30] | 94.15% | [-0.10, 0.10] | 35.08% |
totchol | 5.36e-04 | [ 0.00, 0.00] | 69.92% | [-0.10, 0.10] | 100% |
factor(educ)2_Middle | -0.08 | [-0.29, 0.11] | 79.77% | [-0.10, 0.10] | 54.87% |
factor(educ)3_High | -0.07 | [-0.30, 0.17] | 72.32% | [-0.10, 0.10] | 53.47% |
factor(educ)4_VHigh | -0.24 | [-0.49, 0.01] | 96.88% | [-0.10, 0.10] | 11.82% |
Let’s walk through all of this output.
34.6.1 Summarizing the Parameter values
For each parameter, we have:
- its estimated median across the posterior distribution
- its 95% credible interval (highest density interval of values within the posterior distribution)
as we’ve previously discussed.
34.6.2 Probability of Direction (pd
) estimates.
The pd
estimate helps us understand whether each effect is positive or negative. For instance, regarding age
, we see the proportion of the posterior that is in the direction of the median effect (negative), no matter what the “size” of the effect is, will be as follows.
<- posteriors |> filter(age < 0) |> nrow()
n_negative 100 * n_negative / nrow(posteriors)
[1] 99.85
So we see that the effect of age
is negative with a probability of 99.85%, and this is called the probability of direction and abbreviated pd.
We can also calculate this with
p_direction(posteriors$age)
Probability of Direction
Parameter | pd
------------------
Posterior | 99.85%
34.6.3 The ROPE estimates
Testing whether this distribution is different from 0 doesn’t make sense, as 0 is a single value (and the probability that any distribution is different from a single value is infinite). However, one way to assess significance could be to define an area around 0, which will consider as practically equivalent to zero (i.e., absence of, or a negligible, effect). This is called the Region of Practical Equivalence (ROPE).
The default (semi-objective) way of defining the ROPE is to use (-0.1, 0.1) in this context. This is sometimes considered a “negligible” effect size.
rope_range(posteriors)
[1] -0.1 0.1
So we then compute the percentage in ROPE as the percentage of the posterior distribution that falls within this ROPE range. When most of the posterior distribution does not overlap with ROPE, we might conclude that the effect is important enough to be noted.
In our case, 100% of the age
effects are in the ROPE, so that’s not really evidence of an important effect.
For sex
, though, only 35% of the effects are in the ROPE, so that’s indicative of a somewhat more substantial effect, but we’d only get really excited if a much smaller fraction, say 1%, 5% or maybe 10% were in the ROPE.
34.6.4 Summarizing the Coefficients as Odds Ratios
::tidy(m2, exponentiate = TRUE,
broom.mixedconf.int = TRUE, conf.level = 0.95) |>
gt() |> fmt_number(decimals = 3)
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
(Intercept) | 2.912 | 0.949 | 1.531 | 5.492 |
age | 0.986 | 0.005 | 0.977 | 0.996 |
sexM | 1.141 | 0.096 | 0.965 | 1.348 |
totchol | 1.001 | 0.001 | 0.998 | 1.003 |
factor(educ)2_Middle | 0.920 | 0.093 | 0.751 | 1.121 |
factor(educ)3_High | 0.931 | 0.111 | 0.738 | 1.188 |
factor(educ)4_VHigh | 0.790 | 0.101 | 0.616 | 1.013 |
34.7 Summarizing the Priors Used
From the bayestestR
package, we also have the describe_prior()
and print_md()
functions to get the following summary of the priors we have assumed. Since we didn’t specify anything about the priors in fitting model m2
, we are looking at the default choices, which are weakly informative priors following Normal distributions. Details on the default prior choices can be found at https://mc-stan.org/rstanarm/articles/priors.html.
describe_prior(m2) |> print_md(decimals = 3)
Parameter | Prior_Distribution | Prior_Location | Prior_Scale |
---|---|---|---|
(Intercept) | normal | 0.49 | 1.25 |
age | normal | 0.00 | 0.14 |
sexM | normal | 0.00 | 2.53 |
totchol | normal | 0.00 | 0.03 |
factor(educ)2_Middle | normal | 0.00 | 2.73 |
factor(educ)3_High | normal | 0.00 | 3.53 |
factor(educ)4_VHigh | normal | 0.00 | 3.68 |
34.8 Graphical Posterior Predictive Checks
For more on these checks, visit https://mc-stan.org/rstanarm/articles/continuous.html#the-posterior-predictive-distribution-1, for example.
Here’s the first plot which compares the density function of the observed outcome \(y\) (totchol) to several of the simulated data sets \(y_{rep}\) from the posterior predictive distribution using the same predictor values as were used to fit the model.
pp_check(m2, nreps = 5)
The idea is that if the model is a good fit to the data we should be able to generate data \(y_{rep}\) from the posterior predictive distribution that looks a lot like the observed data \(y\). That is, given \(y\), the \(y_{rep}\) we generate should look plausible. We’d worry a bit if this plot showed histograms that were wildly different from one another.
Another useful plot (shown below) made using pp_check
shows the distribution of a test quantity \(T(y_{rep})\) compared to \(T(y)\), the value of the quantity in the observed data. I like this scatterplot version which allows us to look at where the simulations’ mean and standard deviation fall compared to what the observed totchol
values show us.
pp_check(m2, plotfun = "stat_2d", stat = c("mean", "sd"))
We can see that the cloud of simulated means and standard deviations has the observed statistics near its center, although perhaps the standard deviations are a bit higher than we might like to see, typically. Ideally, the dot would be right in the center of this cloud of simulated results.