34  NEW!! Bayes and a Logistic Model

Almost all of this material is based on

There’s not a lot that is truly original here. That’s a summer project.

34.1 R Setup Used Here

knitr::opts_chunk$set(comment = NA)

library(broom)
library(broom.mixed)
library(gt)
library(janitor)
library(mosaic)
library(bayestestR)
library(insight)
library(rstanarm)
library(conflicted)
library(tidyverse) 

conflicts_prefer(dplyr::select, dplyr::filter, base::mean, base::range)

theme_set(theme_bw())

34.2 Return to the smalldat Example

Consider the smalldat.csv data we discussed initially Chapter 22. The data includes 150 observations on 6 variables, and our goal here is to predict smoker given four predictors (totchol, age, sex and educ.)

smalldat <- read_csv("data/smalldat.csv", show_col_types = FALSE)
Variable Description
subject Subject identification code
smoker 1 = current smoker, 0 = not current smoker
totchol total cholesterol, in mg/dl
age age in years
sex subject’s sex (M or F)
educ subject’s educational attainment (4 levels)

The educ levels are: 1_Low, 2_Middle, 3_High and 4_VHigh, which stands for Very High

34.3 The Distribution of Smoking Status

Across the 150 observations in the smalldat data, we have 74 smokers.

smalldat |> tabyl(smoker) |> adorn_pct_formatting() |> adorn_totals()
 smoker   n percent
      0  76   50.7%
      1  74   49.3%
  Total 150       -

34.4 Fitting a Logistic Regression Model with glm()

m1 <- glm((smoker == 1) ~ age + sex + totchol + factor(educ), 
          data = smalldat)

glance(m1) |> select(nobs, df.residual, AIC, BIC, logLik, deviance, null.deviance) |>
  gt() |> fmt_number(AIC:null.deviance, decimals = 2)
nobs df.residual AIC BIC logLik deviance null.deviance
150 143 218.30 242.39 −101.15 33.83 37.49
## raw coefficients

tidy(m1, conf.int = TRUE, conf.level = 0.95) |> 
  gt() |> fmt_number(decimals = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 1.071 0.317 3.374 0.001 0.449 1.693
age −0.014 0.005 −2.872 0.005 −0.024 −0.004
sexM 0.131 0.083 1.578 0.117 −0.032 0.293
totchol 0.001 0.001 0.531 0.597 −0.002 0.003
factor(educ)2_Middle −0.089 0.100 −0.894 0.373 −0.284 0.106
factor(educ)3_High −0.073 0.121 −0.604 0.547 −0.309 0.163
factor(educ)4_VHigh −0.242 0.126 −1.919 0.057 −0.489 0.005
## with exponentiated coefficients

tidy(m1, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |> 
  gt() |> fmt_number(decimals = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 2.918 0.317 3.374 0.001 1.566 5.435
age 0.986 0.005 −2.872 0.005 0.977 0.996
sexM 1.140 0.083 1.578 0.117 0.969 1.340
totchol 1.001 0.001 0.531 0.597 0.998 1.003
factor(educ)2_Middle 0.915 0.100 −0.894 0.373 0.753 1.112
factor(educ)3_High 0.930 0.121 −0.604 0.547 0.734 1.177
factor(educ)4_VHigh 0.785 0.126 −1.919 0.057 0.613 1.005

34.5 Fitting a Bayesian Logistic Regression

Can we fit a model for the same data using a Bayesian approach?

Yes, we can, for instance using the stan_glm() function from the rstanarm package.

set.seed(43234231) # best to set a random seed first

m2 <- stan_glm((smoker == 1) ~ age + sex + totchol + factor(educ), 
               data = smalldat, refresh = 0)

Here the refresh = 0 parameter stops the machine from printing out each of the updates it does while sampling, which is not generally something I need to look at. Here’s what’s placed in the m2 object:

m2
stan_glm
 family:       gaussian [identity]
 formula:      (smoker == 1) ~ age + sex + totchol + factor(educ)
 observations: 150
 predictors:   7
------
                     Median MAD_SD
(Intercept)           1.1    0.3  
age                   0.0    0.0  
sexM                  0.1    0.1  
totchol               0.0    0.0  
factor(educ)2_Middle -0.1    0.1  
factor(educ)3_High   -0.1    0.1  
factor(educ)4_VHigh  -0.2    0.1  

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.5    0.0   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
  • The first few lines specify the fitting process.
  • Next, for each coefficient, we find the median value from the posterior distribution, and the MAD_SD value, which is an indicator of variation derived from the estimated posterior distribution of the parameters, and is used as a standard error in what follows.
  • Finally, we see the estimated root mean squared error (residual standard deviation) sigma, again estimated with the median of the sigma values in the posterior distribution.

Using the summary() function provides some additional information about the parameter estimates, but mostly some convergence diagnostics for the Markov Chain Monte Carlo procedure that the rstanarm package used to build the estimates.

summary(m2)

Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      (smoker == 1) ~ age + sex + totchol + factor(educ)
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 150
 predictors:   7

Estimates:
                       mean   sd   10%   50%   90%
(Intercept)           1.1    0.3  0.7   1.1   1.5 
age                   0.0    0.0  0.0   0.0   0.0 
sexM                  0.1    0.1  0.0   0.1   0.2 
totchol               0.0    0.0  0.0   0.0   0.0 
factor(educ)2_Middle -0.1    0.1 -0.2  -0.1   0.0 
factor(educ)3_High   -0.1    0.1 -0.2  -0.1   0.1 
factor(educ)4_VHigh  -0.2    0.1 -0.4  -0.2  -0.1 
sigma                 0.5    0.0  0.5   0.5   0.5 

Fit Diagnostics:
           mean   sd   10%   50%   90%
mean_PPD 0.5    0.1  0.4   0.5   0.6  

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
                     mcse Rhat n_eff
(Intercept)          0.0  1.0  4790 
age                  0.0  1.0  2700 
sexM                 0.0  1.0  3922 
totchol              0.0  1.0  3983 
factor(educ)2_Middle 0.0  1.0  2847 
factor(educ)3_High   0.0  1.0  3724 
factor(educ)4_VHigh  0.0  1.0  3550 
sigma                0.0  1.0  4324 
mean_PPD             0.0  1.0  4262 
log-posterior        0.0  1.0  1996 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

34.5.1 Extracting the Posterior

Let’s extract the coefficients of our model, using the get_parameters() function from the insight package:

posteriors <- get_parameters(m2)

head(posteriors)
  (Intercept)          age       sexM       totchol factor(educ)2_Middle
1   1.0395289 -0.004619993 0.25120820 -0.0012609246          -0.07421785
2   1.3897359 -0.015504179 0.05308297 -0.0005806271          -0.05711548
3   0.6960912 -0.012991489 0.12013110  0.0015857650           0.08652352
4   0.3501038 -0.011626531 0.23179922  0.0024659073           0.12610455
5   1.5843855 -0.009405856 0.05310819 -0.0025427990          -0.05049668
6   1.0403138 -0.006376498 0.04919309 -0.0013779428           0.08575331
  factor(educ)3_High factor(educ)4_VHigh
1       -0.247278983         -0.34931693
2       -0.027945532         -0.04715445
3        0.098936630         -0.09453129
4        0.055186994         -0.08377399
5        0.003482159         -0.11669724
6        0.123287117         -0.06056735

In all, we have 4000 observations of this posterior distribution:

nrow(posteriors)
[1] 4000

Let’s visualize the posterior distribution of our parameter for age.

ggplot(posteriors, aes(x = age)) + geom_density(fill = "dodgerblue")

This distribution describes the probability (on the vertical axis) of various age effects (shown on the horizontal axis). Most of the distribution is between -0.025 and -0.005, with the peak being around -0.15.

Remember that our m1 fit with glm() had an estimated \(\beta\) for age of -0.014, so, as is often the case, there is not a lot of difference between the two models in terms of the estimates they make.

Here’s the mean and median of the age effect, across our 4000 simulations from the posterior distribution.

mean(posteriors$age)
[1] -0.01389409
median(posteriors$age)
[1] -0.01391701

And here are the results after exponentiation, so that the estimates describe odds ratios:

mean(exp(posteriors$age))
[1] 0.9862135
median(exp(posteriors$age))
[1] 0.9861794

Again, these are very close to what we obtained from least squares estimation.

Another option is to take the mode (peak) of the posterior distribution, and this is called the maximum a posteriori (MAP) estimate:

map_estimate(posteriors$age)
MAP Estimate

Parameter | MAP_Estimate
------------------------
x         |        -0.01
map_estimate(exp(posteriors$age)) # on odds ratio scale
MAP Estimate

Parameter | MAP_Estimate
------------------------
x         |         0.99

Adding these estimates to our plot, we can see that they are quite close:

ggplot(posteriors, aes(x = age)) +
  geom_density(fill = "dodgerblue") +
  # The mean in yellow
  geom_vline(xintercept = mean(posteriors$age), color = "yellow", linewidth = 1) +
  # The median in red
  geom_vline(xintercept = median(posteriors$age), color = "red", linewidth = 1) +
  # The MAP in purple
  geom_vline(xintercept = as.numeric(map_estimate(posteriors$age)), color = "purple", linewidth = 1)

34.5.2 Describing Uncertainty

We might describe the range of estimates for the age effect.

range(posteriors$age)
[1] -0.03166005  0.00409895

Instead of showing the whole range, we usually compute the highest density interval at some percentage level, for instance a 95% credible interval which shows the range containing the 95% most probable effect values.

hdi(posteriors$age, ci = 0.95)
95% HDI: [-0.02,  0.00]

So we conclude that the age effect has a 95% chance of falling within the [-0.02, 0.00] range. We can also exponentiate here, so as to provide the result in terms of an odds ratio.

hdi(exp(posteriors$age), ci = 0.95)
95% HDI: [0.98, 1.00]

34.5.3 Visualizing the Coefficients and Credible Intervals

Here is a plot of the coefficients and parameters estimated in m2, along with credible intervals for their values. The inner interval (shaded region) here uses the default choice of 50%, and the outer interval (lines) uses a non-default choice of 95% (90% is the default choice here, as it turns out.) The point estimate shown here is the median of the posterior distribution, which is the default.

plot(m2, prob = 0.5, prob_outer = 0.95)

34.6 Summarizing the Posterior Distribution

A more detailed set of summaries for the posterior distribution can be obtained from the describe_posterior() function from the bayestestR package.

A brief tutorial on what is shown here is available at https://easystats.github.io/bayestestR/articles/bayestestR.html and https://easystats.github.io/bayestestR/articles/example1.html and this is the source for much of what I’ve built in this little chapter.

describe_posterior(posteriors, test = c("pd", "ROPE")) |> print_md(decimals = 3)
Summary of Posterior Distribution
Parameter Median 95% CI pd ROPE % in ROPE
(Intercept) 1.07 [ 0.43, 1.70] 100% [-0.10, 0.10] 0%
age -0.01 [-0.02, 0.00] 99.85% [-0.10, 0.10] 100%
sexM 0.13 [-0.04, 0.30] 94.15% [-0.10, 0.10] 35.08%
totchol 5.36e-04 [ 0.00, 0.00] 69.92% [-0.10, 0.10] 100%
factor(educ)2_Middle -0.08 [-0.29, 0.11] 79.77% [-0.10, 0.10] 54.87%
factor(educ)3_High -0.07 [-0.30, 0.17] 72.32% [-0.10, 0.10] 53.47%
factor(educ)4_VHigh -0.24 [-0.49, 0.01] 96.88% [-0.10, 0.10] 11.82%

Let’s walk through all of this output.

34.6.1 Summarizing the Parameter values

For each parameter, we have:

  • its estimated median across the posterior distribution
  • its 95% credible interval (highest density interval of values within the posterior distribution)

as we’ve previously discussed.

34.6.2 Probability of Direction (pd) estimates.

The pd estimate helps us understand whether each effect is positive or negative. For instance, regarding age, we see the proportion of the posterior that is in the direction of the median effect (negative), no matter what the “size” of the effect is, will be as follows.

n_negative <- posteriors |> filter(age < 0) |> nrow()
100 * n_negative / nrow(posteriors)
[1] 99.85

So we see that the effect of age is negative with a probability of 99.85%, and this is called the probability of direction and abbreviated pd.

We can also calculate this with

p_direction(posteriors$age)
Probability of Direction

Parameter |     pd
------------------
Posterior | 99.85%

34.6.3 The ROPE estimates

Testing whether this distribution is different from 0 doesn’t make sense, as 0 is a single value (and the probability that any distribution is different from a single value is infinite). However, one way to assess significance could be to define an area around 0, which will consider as practically equivalent to zero (i.e., absence of, or a negligible, effect). This is called the Region of Practical Equivalence (ROPE).

The default (semi-objective) way of defining the ROPE is to use (-0.1, 0.1) in this context. This is sometimes considered a “negligible” effect size.

rope_range(posteriors)
[1] -0.1  0.1

So we then compute the percentage in ROPE as the percentage of the posterior distribution that falls within this ROPE range. When most of the posterior distribution does not overlap with ROPE, we might conclude that the effect is important enough to be noted.

In our case, 100% of the age effects are in the ROPE, so that’s not really evidence of an important effect.

For sex, though, only 35% of the effects are in the ROPE, so that’s indicative of a somewhat more substantial effect, but we’d only get really excited if a much smaller fraction, say 1%, 5% or maybe 10% were in the ROPE.

34.6.4 Summarizing the Coefficients as Odds Ratios

broom.mixed::tidy(m2, exponentiate = TRUE, 
                  conf.int = TRUE, conf.level = 0.95) |> 
  gt() |> fmt_number(decimals = 3)
term estimate std.error conf.low conf.high
(Intercept) 2.912 0.949 1.531 5.492
age 0.986 0.005 0.977 0.996
sexM 1.141 0.096 0.965 1.348
totchol 1.001 0.001 0.998 1.003
factor(educ)2_Middle 0.920 0.093 0.751 1.121
factor(educ)3_High 0.931 0.111 0.738 1.188
factor(educ)4_VHigh 0.790 0.101 0.616 1.013

34.7 Summarizing the Priors Used

From the bayestestR package, we also have the describe_prior() and print_md() functions to get the following summary of the priors we have assumed. Since we didn’t specify anything about the priors in fitting model m2, we are looking at the default choices, which are weakly informative priors following Normal distributions. Details on the default prior choices can be found at https://mc-stan.org/rstanarm/articles/priors.html.

describe_prior(m2) |> print_md(decimals = 3)
Parameter Prior_Distribution Prior_Location Prior_Scale
(Intercept) normal 0.49 1.25
age normal 0.00 0.14
sexM normal 0.00 2.53
totchol normal 0.00 0.03
factor(educ)2_Middle normal 0.00 2.73
factor(educ)3_High normal 0.00 3.53
factor(educ)4_VHigh normal 0.00 3.68

34.8 Graphical Posterior Predictive Checks

For more on these checks, visit https://mc-stan.org/rstanarm/articles/continuous.html#the-posterior-predictive-distribution-1, for example.

Here’s the first plot which compares the density function of the observed outcome \(y\) (totchol) to several of the simulated data sets \(y_{rep}\) from the posterior predictive distribution using the same predictor values as were used to fit the model.

pp_check(m2, nreps = 5)

The idea is that if the model is a good fit to the data we should be able to generate data \(y_{rep}\) from the posterior predictive distribution that looks a lot like the observed data \(y\). That is, given \(y\), the \(y_{rep}\) we generate should look plausible. We’d worry a bit if this plot showed histograms that were wildly different from one another.

Another useful plot (shown below) made using pp_check shows the distribution of a test quantity \(T(y_{rep})\) compared to \(T(y)\), the value of the quantity in the observed data. I like this scatterplot version which allows us to look at where the simulations’ mean and standard deviation fall compared to what the observed totchol values show us.

pp_check(m2, plotfun = "stat_2d", stat = c("mean", "sd"))

We can see that the cloud of simulated means and standard deviations has the observed statistics near its center, although perhaps the standard deviations are a bit higher than we might like to see, typically. Ideally, the dot would be right in the center of this cloud of simulated results.