Chapter 18 Colorectal Cancer Screening and Some Special Cases
In this Chapter, we discuss two issues not yet raised regarding regression on a binary outcome.
What do we do if our binary outcome is not available for each subject individually, but instead aggregated?
What is probit regression, and how can we use it as an alternative to logistic regression on a binary outcome?
18.1 Logistic Regression for Aggregated Data
18.1.1 Colorectal Cancer Screening Data
The screening.csv
data (imported into the R tibble colscr
are simulated. They mirror a subset of the actual results from the Better Health Partnership’s pilot study of colorectal cancer screening in primary care clinics in Northeast Ohio, but the numbers have been fuzzed slightly, and the clinics have been de-identified and moved around from system to system.
Available to us are the following variables:
Variable | Description |
---|---|
location |
clinic code |
subjects |
number of subjects reported by clinic |
screen_rate |
proportion of subjects who were screened |
screened |
number of subjects who were screened |
notscreened |
number of subjects not screened |
meanage |
mean age of clinic’s subjects, years |
female |
% of clinic’s subjects who are female |
pct_lowins |
% of clinic’s subjects who have Medicaid or are uninsured |
system |
system code |
describe(colscr)
colscr
9 Variables 26 Observations
--------------------------------------------------------------------------------
location
n missing distinct
26 0 26
lowest : A B C D E, highest: V W X Y Z
--------------------------------------------------------------------------------
subjects
n missing distinct Info Mean Gmd .05 .10
26 0 26 1 3247 2134 1249 1475
.25 .50 .75 .90 .95
1915 2766 3608 6523 7068
lowest : 803 1179 1459 1491 1512, highest: 5451 6061 6985 7095 7677
--------------------------------------------------------------------------------
screen_rate
n missing distinct Info Mean Gmd .05 .10
26 0 26 1 0.7659 0.08339 0.6682 0.6732
.25 .50 .75 .90 .95
0.7179 0.7579 0.8088 0.8654 0.8899
lowest : 0.6354901 0.6666667 0.6728914 0.6734521 0.6869919
highest: 0.8237129 0.8469110 0.8838745 0.8919112 0.9049108
--------------------------------------------------------------------------------
screened
n missing distinct Info Mean Gmd .05 .10
26 0 26 1 2584 1895 843.5 993.0
.25 .50 .75 .90 .95
1395.2 2169.5 2716.0 5293.5 6107.2
lowest : 572 794 992 994 1088, highest: 4818 4848 5739 6230 6947
--------------------------------------------------------------------------------
notscreened
n missing distinct Info Mean Gmd .05 .10
26 0 26 1 663.2 303.5 336.0 352.5
.25 .50 .75 .90 .95
508.8 611.0 791.0 989.0 1172.5
lowest : 231 335 339 366 371, highest: 881 927 1051 1213 1356
--------------------------------------------------------------------------------
meanage
n missing distinct Info Mean Gmd .05 .10
26 0 23 0.999 60.58 2.186 58.23 58.35
.25 .50 .75 .90 .95
58.82 60.50 61.98 62.50 62.90
lowest : 58.0 58.2 58.3 58.4 58.5, highest: 62.2 62.4 62.6 63.0 65.9
--------------------------------------------------------------------------------
female
n missing distinct Info Mean Gmd .05 .10
26 0 23 0.999 58.72 7.118 46.93 48.45
.25 .50 .75 .90 .95
55.42 60.05 62.62 64.90 67.50
lowest : 46.2 46.6 47.9 49.0 54.3, highest: 63.6 64.1 65.7 68.1 70.3
--------------------------------------------------------------------------------
pct_lowins
n missing distinct Info Mean Gmd .05 .10
26 0 24 0.999 24.47 22.12 0.675 1.800
.25 .50 .75 .90 .95
4.800 23.950 44.025 49.500 49.950
lowest : 0.3 0.4 1.5 2.1 3.0, highest: 45.4 47.1 49.5 50.1 51.3
--------------------------------------------------------------------------------
system
n missing distinct
26 0 4
Value Sys_1 Sys_2 Sys_3 Sys_4
Frequency 7 7 6 6
Proportion 0.269 0.269 0.231 0.231
--------------------------------------------------------------------------------
18.1.2 Fitting a Logistic Regression Model to Proportion Data
Here, we have a binary outcome (was the subject screened or not?) but we have aggregated results. We can use the counts of the numbers of subjects at each clinic (in subjects
) and the proportion who were screened (in screen_rate
) to fit a logistic regression model, as follows:
<- glm(screen_rate ~ meanage + female +
m_screen1 + system, family = binomial,
pct_lowins weights = subjects, data = colscr)
summary(m_screen1)
Call:
glm(formula = screen_rate ~ meanage + female + pct_lowins + system,
family = binomial, data = colscr, weights = subjects)
Deviance Residuals:
Min 1Q Median 3Q Max
-9.017 -3.705 -2.000 1.380 11.500
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.3270393 0.5530782 -2.399 0.0164 *
meanage 0.0679866 0.0089754 7.575 3.60e-14 ***
female -0.0193142 0.0015831 -12.200 < 2e-16 ***
pct_lowins -0.0134547 0.0008585 -15.672 < 2e-16 ***
systemSys_2 -0.1382189 0.0246591 -5.605 2.08e-08 ***
systemSys_3 -0.0400170 0.0254505 -1.572 0.1159
systemSys_4 0.0229273 0.0294207 0.779 0.4358
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2825.28 on 25 degrees of freedom
Residual deviance: 816.39 on 19 degrees of freedom
AIC: 1037.9
Number of Fisher Scoring iterations: 4
18.1.3 Fitting a Logistic Regression Model to Counts of Successes and Failures
<- glm(cbind(screened, notscreened) ~
m_screen2 + female + pct_lowins + system,
meanage family = binomial, data = colscr)
summary(m_screen2)
Call:
glm(formula = cbind(screened, notscreened) ~ meanage + female +
pct_lowins + system, family = binomial, data = colscr)
Deviance Residuals:
Min 1Q Median 3Q Max
-9.017 -3.705 -2.000 1.380 11.500
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.3270392 0.5530782 -2.399 0.0164 *
meanage 0.0679866 0.0089754 7.575 3.60e-14 ***
female -0.0193142 0.0015831 -12.200 < 2e-16 ***
pct_lowins -0.0134547 0.0008585 -15.672 < 2e-16 ***
systemSys_2 -0.1382189 0.0246591 -5.605 2.08e-08 ***
systemSys_3 -0.0400170 0.0254505 -1.572 0.1159
systemSys_4 0.0229273 0.0294207 0.779 0.4358
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2825.28 on 25 degrees of freedom
Residual deviance: 816.39 on 19 degrees of freedom
AIC: 1037.9
Number of Fisher Scoring iterations: 4
18.1.4 How does one address this problem in rms
?
We can use Glm
. As an example to mirror m_screen1
, we have the following…
<- datadist(colscr)
d options(datadist = "d")
<- Glm(screen_rate ~ meanage + female +
mod_screen_1 + system,
pct_lowins family = binomial, weights = subjects,
data = colscr, x = T, y = T)
mod_screen_1
General Linear Model
Glm(formula = screen_rate ~ meanage + female + pct_lowins + system,
family = binomial, data = colscr, weights = subjects, x = T,
y = T)
Model Likelihood
Ratio Test
Obs 26 LR chi2 2008.90
Residual d.f.19 d.f. 6
g 0.4614539 Pr(> chi2) <0.0001
Coef S.E. Wald Z Pr(>|Z|)
Intercept -1.3270 0.5531 -2.40 0.0164
meanage 0.0680 0.0090 7.57 <0.0001
female -0.0193 0.0016 -12.20 <0.0001
pct_lowins -0.0135 0.0009 -15.67 <0.0001
system=Sys_2 -0.1382 0.0247 -5.61 <0.0001
system=Sys_3 -0.0400 0.0255 -1.57 0.1159
system=Sys_4 0.0229 0.0294 0.78 0.4358
18.2 Probit Regression
18.2.1 Colorectal Cancer Screening Data on Individuals
The data in the colscr2
data frame describe (disguised) data on the status of 172 adults who were eligible for colon cancer screening, with the following information included:
Variable | Description |
---|---|
subject |
subject ID code |
age |
subject’s age (years) |
race |
subject’s race (White/Black/Other) |
hispanic |
subject of Hispanic ethnicity (1 = yes / 0 = no) |
insurance |
Commercial, Medicaid, Medicare, Uninsured |
bmi |
body mass index at most recent visit |
sbp |
systolic blood pressure at most recent visit |
up_to_date |
meets colon cancer screening standards |
The goal is to use the other variables (besides subject ID) to predict whether or not a subject is up to date.
%>% describe() colscr2
.
8 Variables 172 Observations
--------------------------------------------------------------------------------
subject
n missing distinct Info Mean Gmd .05 .10
172 0 172 1 186.5 57.67 109.6 118.1
.25 .50 .75 .90 .95
143.8 186.5 229.2 254.9 263.4
lowest : 101 102 103 104 105, highest: 268 269 270 271 272
--------------------------------------------------------------------------------
age
n missing distinct Info Mean Gmd .05 .10
172 0 19 0.995 57.8 5.536 51.00 52.00
.25 .50 .75 .90 .95
54.00 57.00 61.25 65.00 67.00
lowest : 51 52 53 54 55, highest: 65 66 67 68 69
Value 51 52 53 54 55 56 57 58 59 60 61
Frequency 10 17 14 9 15 13 18 13 4 10 6
Proportion 0.058 0.099 0.081 0.052 0.087 0.076 0.105 0.076 0.023 0.058 0.035
Value 62 63 64 65 66 67 68 69
Frequency 11 4 5 7 6 3 4 3
Proportion 0.064 0.023 0.029 0.041 0.035 0.017 0.023 0.017
--------------------------------------------------------------------------------
race
n missing distinct
172 0 3
Value Black Other White
Frequency 118 9 45
Proportion 0.686 0.052 0.262
--------------------------------------------------------------------------------
hispanic
n missing distinct Info Sum Mean Gmd
172 0 2 0.18 11 0.06395 0.1204
--------------------------------------------------------------------------------
insurance
n missing distinct
172 0 4
Value Commercial Medicaid Medicare Uninsured
Frequency 32 81 46 13
Proportion 0.186 0.471 0.267 0.076
--------------------------------------------------------------------------------
bmi
n missing distinct Info Mean Gmd .05 .10
172 0 165 1 31.24 8.982 20.32 21.88
.25 .50 .75 .90 .95
25.48 30.05 36.03 43.06 45.68
lowest : 17.20 17.59 17.85 18.09 18.44, highest: 49.41 50.83 53.28 54.66 55.41
--------------------------------------------------------------------------------
sbp
n missing distinct Info Mean Gmd .05 .10
172 0 68 0.999 128.9 19.46 101.6 109.1
.25 .50 .75 .90 .95
118.0 127.0 138.0 150.9 162.0
lowest : 89 90 94 95 96, highest: 166 168 169 170 198
--------------------------------------------------------------------------------
up_to_date
n missing distinct Info Sum Mean Gmd
172 0 2 0.717 104 0.6047 0.4809
--------------------------------------------------------------------------------
18.2.2 A logistic regression model
Here is a logistic regression model.
<- glm(up_to_date ~ age + race + hispanic +
m_scr2_logistic + bmi + sbp,
insurance family = binomial, data = colscr2)
summary(m_scr2_logistic)
Call:
glm(formula = up_to_date ~ age + race + hispanic + insurance +
bmi + sbp, family = binomial, data = colscr2)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8604 -1.1540 0.6933 0.9564 1.6108
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.7040470 2.7418625 0.986 0.3240
age 0.0204901 0.0396920 0.516 0.6057
raceOther -1.9722351 1.0023237 -1.968 0.0491 *
raceWhite -0.3210458 0.4001744 -0.802 0.4224
hispanic 0.0005855 0.7953482 0.001 0.9994
insuranceMedicaid -1.0151860 0.4945169 -2.053 0.0401 *
insuranceMedicare -0.5216006 0.5629935 -0.926 0.3542
insuranceUninsured 0.1099966 0.7906196 0.139 0.8893
bmi 0.0155894 0.0213547 0.730 0.4654
sbp -0.0241777 0.0099138 -2.439 0.0147 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 230.85 on 171 degrees of freedom
Residual deviance: 210.55 on 162 degrees of freedom
AIC: 230.55
Number of Fisher Scoring iterations: 4
confint(m_scr2_logistic)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -2.64274441 8.157738367
age -0.05703934 0.099298904
raceOther -4.24604744 -0.175953229
raceWhite -1.10800168 0.469396096
hispanic -1.53691431 1.688738936
insuranceMedicaid -2.03640881 -0.079142467
insuranceMedicare -1.66246932 0.564088260
insuranceUninsured -1.40110091 1.759391281
bmi -0.02568426 0.058563761
sbp -0.04436143 -0.005282686
In this model, there appears to be some link between sbp
and screening, as well as, perhaps, some statistically significant differences between some race groups and some insurance groups. We won’t look at this much further for now, though. Instead, we’ll simply describe predictions for two subjects, Harry and Sally.
18.2.3 Predicting status for Harry and Sally
- Harry is age 65, White, non-Hispanic, with Medicare insurance, a BMI of 28 and SBP of 135.
- Sally is age 60, Black, Hispanic, with Medicaid insurance, a BMI of 22 and SBP of 148.
<- tibble(subject = c("Harry", "Sally"),
newdat_s2 age = c(65, 60),
race = c("White", "Black"),
hispanic = c(0, 1),
insurance = c("Medicare", "Medicaid"),
bmi = c(28, 22),
sbp = c(135, 148))
predict(m_scr2_logistic, newdata = newdat_s2,
type = "response")
1 2
0.5904364 0.4215335
The prediction for Harry is 0.59, and for Sally, 0.42, by this logistic regression model.
18.2.4 A probit regression model
Now, consider a probit regression, fit by changing the default link for the binomial
family as follows:
<- glm(up_to_date ~ age + race + hispanic +
m_scr2_probit + bmi + sbp,
insurance family = binomial(link = "probit"),
data = colscr2)
summary(m_scr2_probit)
Call:
glm(formula = up_to_date ~ age + race + hispanic + insurance +
bmi + sbp, family = binomial(link = "probit"), data = colscr2)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8608 -1.1561 0.6933 0.9607 1.6073
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.584604 1.658489 0.955 0.3394
age 0.013461 0.024107 0.558 0.5766
raceOther -1.238445 0.587093 -2.109 0.0349 *
raceWhite -0.199260 0.243505 -0.818 0.4132
hispanic 0.029483 0.484819 0.061 0.9515
insuranceMedicaid -0.619277 0.293205 -2.112 0.0347 *
insuranceMedicare -0.322881 0.333549 -0.968 0.3330
insuranceUninsured 0.052776 0.463798 0.114 0.9094
bmi 0.009652 0.012887 0.749 0.4539
sbp -0.014696 0.005944 -2.472 0.0134 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 230.85 on 171 degrees of freedom
Residual deviance: 210.49 on 162 degrees of freedom
AIC: 230.49
Number of Fisher Scoring iterations: 4
confint(m_scr2_probit)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -1.70739455 4.894713408
age -0.03400934 0.061437784
raceOther -2.53819772 -0.119812915
raceWhite -0.67910311 0.282566349
hispanic -0.92476109 1.011089937
insuranceMedicaid -1.21212866 -0.049749873
insuranceMedicare -0.99315267 0.329851215
insuranceUninsured -0.83335121 0.966504724
bmi -0.01577917 0.035632196
sbp -0.02659318 -0.003148488
18.2.5 Interpreting the Probit Model’s Coefficients
It is possible to use any number of link functions to ensure that predicted values in a generalized linear model fall between 0 and 1. The probit regression model, for instance, uses the inverse of the cumulative distribution function of the Normal model as its link function. Let’s look more closely at the coefficients of the probit model we just fit.
$coef m_scr2_probit
(Intercept) age raceOther raceWhite
1.584603569 0.013461338 -1.238445198 -0.199260184
hispanic insuranceMedicaid insuranceMedicare insuranceUninsured
0.029483051 -0.619276718 -0.322880519 0.052775722
bmi sbp
0.009652339 -0.014695526
The probit regression coefficients give the change in the z-score of the outcome of interest (here, up_to_date
) for a one-unit change in the target predictor, holding all other predictors constant.
- So, for a one-year increase in age, holding all other predictors constant, the z-score for
up_to_date
increases by 0.013 - And for a Medicaid subject as compared to a Commercial subject of the same age, race, ethnicity, bmi and sbp, the z-score for the Medicaid subject is predicted to be -0.619 lower, according to this model.
18.2.6 What about Harry and Sally?
Do the predictions for Harry and Sally change much with this probit model, as compared to the logistic regression?
predict(m_scr2_probit, newdata = newdat_s2, type = "response")
1 2
0.5885511 0.4364027