Chapter 16 Colorectal Cancer Screening and Some Special Cases
In this Chapter, we discuss two issues as yet unraised regarding regression on a binary outcome.
What do we do if our binary outcome is not available for each subject individually, but instead aggregated?
What is probit regression, and how can we use it as an alternative to logistic regression on a binary outcome?
16.1 Logistic Regression for Aggregated Data
16.1.1 Colorectal Cancer Screening Data
The screening.csv
data (imported into the R tibble colscr
are simulated. They mirror a subset of the actual results from the Better Health Partnership’s pilot study of colorectal cancer screening in primary care clinics in Northeast Ohio, but the numbers have been fuzzed slightly, and the clinics have been de-identified and moved around from system to system.
Available to us are the following variables:
Variable | Description |
---|---|
location |
clinic code |
subjects |
number of subjects reported by clinic |
screen_rate |
proportion of subjects who were screened |
screened |
number of subjects who were screened |
notscreened |
number of subjects not screened |
meanage |
mean age of clinic’s subjects, years |
female |
% of clinic’s subjects who are female |
pct_lowins |
% of clinic’s subjects who have Medicaid or are uninsured |
system |
system code |
colscr %>% skim()
Skim summary statistics
n obs: 26
n variables: 9
-- Variable type:factor -----------------------------------------------------------
variable missing complete n n_unique top_counts
location 0 26 26 26 A: 1, B: 1, C: 1, D: 1
system 0 26 26 4 Sys: 7, Sys: 7, Sys: 6, Sys: 6
ordered
FALSE
FALSE
-- Variable type:integer ----------------------------------------------------------
variable missing complete n mean sd p0 p25 p50
notscreened 0 26 26 663.23 271.17 231 508.75 611
screened 0 26 26 2584.04 1765.11 572 1395.25 2169.5
subjects 0 26 26 3247.27 1945.83 803 1914.75 2765.5
p75 p100
791 1356
2716 6947
3607.75 7677
-- Variable type:numeric ----------------------------------------------------------
variable missing complete n mean sd p0 p25 p50 p75 p100
female 0 26 26 58.72 6.29 46.2 55.42 60.05 62.62 70.3
meanage 0 26 26 60.58 1.93 58 58.82 60.5 61.98 65.9
pct_lowins 0 26 26 24.47 19.13 0.3 4.8 23.95 44.03 51.3
screen_rate 0 26 26 0.77 0.072 0.64 0.72 0.76 0.81 0.9
16.1.2 Fitting a Logistic Regression Model to Proportion Data
Here, we have a binary outcome (was the subject screened or not?) but we have aggregated results. We can use the counts of the numbers of subjects at each clinic (in subjects
) and the proportion who were screened (in screen_rate
) to fit a logistic regression model, as follows:
m_screen1 <- glm(screen_rate ~ meanage + female +
pct_lowins + system, family = binomial,
weights = subjects, data = colscr)
summary(m_screen1)
Call:
glm(formula = screen_rate ~ meanage + female + pct_lowins + system,
family = binomial, data = colscr, weights = subjects)
Deviance Residuals:
Min 1Q Median 3Q Max
-9.017 -3.705 -2.000 1.380 11.500
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.3270393 0.5530782 -2.399 0.0164 *
meanage 0.0679866 0.0089754 7.575 3.60e-14 ***
female -0.0193142 0.0015831 -12.200 < 2e-16 ***
pct_lowins -0.0134547 0.0008585 -15.672 < 2e-16 ***
systemSys_2 -0.1382189 0.0246591 -5.605 2.08e-08 ***
systemSys_3 -0.0400170 0.0254505 -1.572 0.1159
systemSys_4 0.0229273 0.0294207 0.779 0.4358
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2825.28 on 25 degrees of freedom
Residual deviance: 816.39 on 19 degrees of freedom
AIC: 1037.9
Number of Fisher Scoring iterations: 4
16.1.3 Fitting a Logistic Regression Model to Counts of Successes and Failures
m_screen2 <- glm(cbind(screened, notscreened) ~
meanage + female + pct_lowins + system,
family = binomial, data = colscr)
summary(m_screen2)
Call:
glm(formula = cbind(screened, notscreened) ~ meanage + female +
pct_lowins + system, family = binomial, data = colscr)
Deviance Residuals:
Min 1Q Median 3Q Max
-9.017 -3.705 -2.000 1.380 11.500
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.3270392 0.5530782 -2.399 0.0164 *
meanage 0.0679866 0.0089754 7.575 3.60e-14 ***
female -0.0193142 0.0015831 -12.200 < 2e-16 ***
pct_lowins -0.0134547 0.0008585 -15.672 < 2e-16 ***
systemSys_2 -0.1382189 0.0246591 -5.605 2.08e-08 ***
systemSys_3 -0.0400170 0.0254505 -1.572 0.1159
systemSys_4 0.0229273 0.0294207 0.779 0.4358
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2825.28 on 25 degrees of freedom
Residual deviance: 816.39 on 19 degrees of freedom
AIC: 1037.9
Number of Fisher Scoring iterations: 4
16.1.4 How does one address this problem in rms
?
We can use Glm
. As an example to mirror m_screen1
, we have the following…
d <- datadist(colscr)
options(datadist = "d")
mod_screen_1 <- Glm(screen_rate ~ meanage + female +
pct_lowins + system,
family = binomial, weights = subjects,
data = colscr, x = T, y = T)
mod_screen_1
General Linear Model
Glm(formula = screen_rate ~ meanage + female + pct_lowins + system,
family = binomial, data = colscr, weights = subjects, x = T,
y = T)
Model Likelihood
Ratio Test
Obs 26 LR chi2 2008.90
Residual d.f.19 d.f. 6
g 0.4614539 Pr(> chi2) <0.0001
Coef S.E. Wald Z Pr(>|Z|)
Intercept -1.3270 0.5531 -2.40 0.0164
meanage 0.0680 0.0090 7.57 <0.0001
female -0.0193 0.0016 -12.20 <0.0001
pct_lowins -0.0135 0.0009 -15.67 <0.0001
system=Sys_2 -0.1382 0.0247 -5.61 <0.0001
system=Sys_3 -0.0400 0.0255 -1.57 0.1159
system=Sys_4 0.0229 0.0294 0.78 0.4358
16.2 Probit Regression
16.2.1 Colorectal Cancer Screening Data on Individuals
The data in the colscr2
data frame describe (disguised) data on the status of 172 adults who were eligible for colon cancer screening, with the following information included:
Variable | Description |
---|---|
subject |
subject ID code |
age |
subject’s age (years) |
race |
subject’s race (White/Black/Other) |
hispanic |
subject of Hispanic ethnicity (1 = yes / 0 = no) |
insurance |
Commercial, Medicaid, Medicare, Uninsured |
bmi |
body mass index at most recent visit |
sbp |
systolic blood pressure at most recent visit |
up_to_date |
meets colon cancer screening standards |
The goal is to use the other variables (besides subject ID) to predict whether or not a subject is up to date.
16.2.2 A logistic regression model
Here is a logistic regression model.
colscr2 %>% summary()
subject age race hispanic
Min. :101.0 Min. :51.00 Black:118 Min. :0.00000
1st Qu.:143.8 1st Qu.:54.00 Other: 9 1st Qu.:0.00000
Median :186.5 Median :57.00 White: 45 Median :0.00000
Mean :186.5 Mean :57.80 Mean :0.06395
3rd Qu.:229.2 3rd Qu.:61.25 3rd Qu.:0.00000
Max. :272.0 Max. :69.00 Max. :1.00000
insurance bmi sbp up_to_date
Commercial:32 Min. :17.20 Min. : 89.0 Min. :0.0000
Medicaid :81 1st Qu.:25.48 1st Qu.:118.0 1st Qu.:0.0000
Medicare :46 Median :30.05 Median :127.0 Median :1.0000
Uninsured :13 Mean :31.24 Mean :128.9 Mean :0.6047
3rd Qu.:36.03 3rd Qu.:138.0 3rd Qu.:1.0000
Max. :55.41 Max. :198.0 Max. :1.0000
m_scr2_logistic <- glm(up_to_date ~ age + race + hispanic +
insurance + bmi + sbp,
family = binomial, data = colscr2)
summary(m_scr2_logistic)
Call:
glm(formula = up_to_date ~ age + race + hispanic + insurance +
bmi + sbp, family = binomial, data = colscr2)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8604 -1.1540 0.6933 0.9564 1.6108
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.7040470 2.7418625 0.986 0.3240
age 0.0204901 0.0396920 0.516 0.6057
raceOther -1.9722351 1.0023237 -1.968 0.0491 *
raceWhite -0.3210458 0.4001744 -0.802 0.4224
hispanic 0.0005855 0.7953482 0.001 0.9994
insuranceMedicaid -1.0151860 0.4945169 -2.053 0.0401 *
insuranceMedicare -0.5216006 0.5629935 -0.926 0.3542
insuranceUninsured 0.1099966 0.7906196 0.139 0.8893
bmi 0.0155894 0.0213547 0.730 0.4654
sbp -0.0241777 0.0099138 -2.439 0.0147 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 230.85 on 171 degrees of freedom
Residual deviance: 210.55 on 162 degrees of freedom
AIC: 230.55
Number of Fisher Scoring iterations: 4
confint(m_scr2_logistic)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -2.64274441 8.157738367
age -0.05703934 0.099298904
raceOther -4.24604744 -0.175953229
raceWhite -1.10800168 0.469396096
hispanic -1.53691431 1.688738936
insuranceMedicaid -2.03640881 -0.079142467
insuranceMedicare -1.66246932 0.564088260
insuranceUninsured -1.40110091 1.759391281
bmi -0.02568426 0.058563761
sbp -0.04436143 -0.005282686
In this model, there appears to be some link between sbp
and screening, as well as, perhaps, some statistically significant differences between some race groups and some insurance groups. We won’t look at this much further for now, though. Instead, we’ll simply describe predictions for two subjects, Harry and Sally.
16.2.3 Predicting status for Harry and Sally
- Harry is age 65, White, non-Hispanic, with Medicare insurance, a BMI of 28 and SBP of 135.
- Sally is age 60, Black, Hispanic, with Medicaid insurance, a BMI of 22 and SBP of 148.
newdat_s2 <- data_frame(subject = c("Harry", "Sally"),
age = c(65, 60),
race = c("White", "Black"),
hispanic = c(0, 1),
insurance = c("Medicare", "Medicaid"),
bmi = c(28, 22),
sbp = c(135, 148))
predict(m_scr2_logistic, newdata = newdat_s2,
type = "response")
1 2
0.5904364 0.4215335
The prediction for Harry is 0.59, and for Sally, 0.42, by this logistic regression model.
16.2.4 A probit regression model
Now, consider a probit regression, fit by changing the default link for the binomial
family as follows:
m_scr2_probit <- glm(up_to_date ~ age + race + hispanic +
insurance + bmi + sbp,
family = binomial(link = "probit"),
data = colscr2)
summary(m_scr2_probit)
Call:
glm(formula = up_to_date ~ age + race + hispanic + insurance +
bmi + sbp, family = binomial(link = "probit"), data = colscr2)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8608 -1.1561 0.6933 0.9607 1.6073
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.584604 1.658489 0.955 0.3394
age 0.013461 0.024107 0.558 0.5766
raceOther -1.238445 0.587093 -2.109 0.0349 *
raceWhite -0.199260 0.243505 -0.818 0.4132
hispanic 0.029483 0.484819 0.061 0.9515
insuranceMedicaid -0.619277 0.293205 -2.112 0.0347 *
insuranceMedicare -0.322881 0.333549 -0.968 0.3330
insuranceUninsured 0.052776 0.463798 0.114 0.9094
bmi 0.009652 0.012887 0.749 0.4539
sbp -0.014696 0.005944 -2.472 0.0134 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 230.85 on 171 degrees of freedom
Residual deviance: 210.49 on 162 degrees of freedom
AIC: 230.49
Number of Fisher Scoring iterations: 4
confint(m_scr2_probit)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -1.70739455 4.894713408
age -0.03400934 0.061437784
raceOther -2.53819772 -0.119812915
raceWhite -0.67910311 0.282566349
hispanic -0.92476109 1.011089937
insuranceMedicaid -1.21212866 -0.049749873
insuranceMedicare -0.99315267 0.329851215
insuranceUninsured -0.83335121 0.966504724
bmi -0.01577917 0.035632196
sbp -0.02659318 -0.003148488
16.2.5 Interpreting the Probit Model’s Coefficients
It is possible to use any number of link functions to ensure that predicted values in a generalized linear model fall between 0 and 1. The probit regression model, for instance, uses the inverse of the cumulative distribution function of the Normal model as its link function. Let’s look more closely at the coefficients of the probit model we just fit.
m_scr2_probit$coef
(Intercept) age raceOther
1.584603569 0.013461338 -1.238445198
raceWhite hispanic insuranceMedicaid
-0.199260184 0.029483051 -0.619276718
insuranceMedicare insuranceUninsured bmi
-0.322880519 0.052775722 0.009652339
sbp
-0.014695526
The probit regression coefficients give the change in the z-score of the outcome of interest (here, up_to_date
) for a one-unit change in the target predictor, holding all other predictors constant.
- So, for a one-year increase in age, holding all other predictors constant, the z-score for
up_to_date
increases by 0.013 - And for a Medicaid subject as compared to a Commercial subject of the same age, race, ethnicity, bmi and sbp, the z-score for the Medicaid subject is predicted to be -0.619 lower, according to this model.
16.2.6 What about Harry and Sally?
Do the predictions for Harry and Sally change much with this probit model, as compared to the logistic regression?
predict(m_scr2_probit, newdata = newdat_s2, type = "response")
1 2
0.5885511 0.4364027