Chapter 18 Colorectal Cancer Screening and Some Special Cases

In this Chapter, we discuss two issues not yet raised regarding regression on a binary outcome.

  1. What do we do if our binary outcome is not available for each subject individually, but instead aggregated?

  2. What is probit regression, and how can we use it as an alternative to logistic regression on a binary outcome?

18.1 Logistic Regression for Aggregated Data

18.1.1 Colorectal Cancer Screening Data

The screening.csv data (imported into the R tibble colscr are simulated. They mirror a subset of the actual results from the Better Health Partnership’s pilot study of colorectal cancer screening in primary care clinics in Northeast Ohio, but the numbers have been fuzzed slightly, and the clinics have been de-identified and moved around from system to system.

Available to us are the following variables:

Variable Description
location clinic code
subjects number of subjects reported by clinic
screen_rate proportion of subjects who were screened
screened number of subjects who were screened
notscreened number of subjects not screened
meanage mean age of clinic’s subjects, years
female % of clinic’s subjects who are female
pct_lowins % of clinic’s subjects who have Medicaid or are uninsured
system system code
describe(colscr)
colscr 

 9  Variables      26  Observations
--------------------------------------------------------------------------------
location 
       n  missing distinct 
      26        0       26 

lowest : A B C D E, highest: V W X Y Z
--------------------------------------------------------------------------------
subjects 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      26        0       26        1     3247     2134     1249     1475 
     .25      .50      .75      .90      .95 
    1915     2766     3608     6523     7068 

lowest :  803 1179 1459 1491 1512, highest: 5451 6061 6985 7095 7677
--------------------------------------------------------------------------------
screen_rate 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      26        0       26        1   0.7659  0.08339   0.6682   0.6732 
     .25      .50      .75      .90      .95 
  0.7179   0.7579   0.8088   0.8654   0.8899 

lowest : 0.6354901 0.6666667 0.6728914 0.6734521 0.6869919
highest: 0.8237129 0.8469110 0.8838745 0.8919112 0.9049108
--------------------------------------------------------------------------------
screened 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      26        0       26        1     2584     1895    843.5    993.0 
     .25      .50      .75      .90      .95 
  1395.2   2169.5   2716.0   5293.5   6107.2 

lowest :  572  794  992  994 1088, highest: 4818 4848 5739 6230 6947
--------------------------------------------------------------------------------
notscreened 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      26        0       26        1    663.2    303.5    336.0    352.5 
     .25      .50      .75      .90      .95 
   508.8    611.0    791.0    989.0   1172.5 

lowest :  231  335  339  366  371, highest:  881  927 1051 1213 1356
--------------------------------------------------------------------------------
meanage 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      26        0       23    0.999    60.58    2.186    58.23    58.35 
     .25      .50      .75      .90      .95 
   58.82    60.50    61.98    62.50    62.90 

lowest : 58.0 58.2 58.3 58.4 58.5, highest: 62.2 62.4 62.6 63.0 65.9
--------------------------------------------------------------------------------
female 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      26        0       23    0.999    58.72    7.118    46.93    48.45 
     .25      .50      .75      .90      .95 
   55.42    60.05    62.62    64.90    67.50 

lowest : 46.2 46.6 47.9 49.0 54.3, highest: 63.6 64.1 65.7 68.1 70.3
--------------------------------------------------------------------------------
pct_lowins 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      26        0       24    0.999    24.47    22.12    0.675    1.800 
     .25      .50      .75      .90      .95 
   4.800   23.950   44.025   49.500   49.950 

lowest :  0.3  0.4  1.5  2.1  3.0, highest: 45.4 47.1 49.5 50.1 51.3
--------------------------------------------------------------------------------
system 
       n  missing distinct 
      26        0        4 
                                  
Value      Sys_1 Sys_2 Sys_3 Sys_4
Frequency      7     7     6     6
Proportion 0.269 0.269 0.231 0.231
--------------------------------------------------------------------------------

18.1.2 Fitting a Logistic Regression Model to Proportion Data

Here, we have a binary outcome (was the subject screened or not?) but we have aggregated results. We can use the counts of the numbers of subjects at each clinic (in subjects) and the proportion who were screened (in screen_rate) to fit a logistic regression model, as follows:

m_screen1 <-  glm(screen_rate ~ meanage + female + 
                    pct_lowins + system, family = binomial, 
                  weights = subjects, data = colscr)

summary(m_screen1)

Call:
glm(formula = screen_rate ~ meanage + female + pct_lowins + system, 
    family = binomial, data = colscr, weights = subjects)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-9.017  -3.705  -2.000   1.380  11.500  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.3270393  0.5530782  -2.399   0.0164 *  
meanage      0.0679866  0.0089754   7.575 3.60e-14 ***
female      -0.0193142  0.0015831 -12.200  < 2e-16 ***
pct_lowins  -0.0134547  0.0008585 -15.672  < 2e-16 ***
systemSys_2 -0.1382189  0.0246591  -5.605 2.08e-08 ***
systemSys_3 -0.0400170  0.0254505  -1.572   0.1159    
systemSys_4  0.0229273  0.0294207   0.779   0.4358    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2825.28  on 25  degrees of freedom
Residual deviance:  816.39  on 19  degrees of freedom
AIC: 1037.9

Number of Fisher Scoring iterations: 4

18.1.3 Fitting a Logistic Regression Model to Counts of Successes and Failures

m_screen2 <-  glm(cbind(screened, notscreened) ~ 
                    meanage + female + pct_lowins + system, 
           family = binomial, data = colscr)
summary(m_screen2)

Call:
glm(formula = cbind(screened, notscreened) ~ meanage + female + 
    pct_lowins + system, family = binomial, data = colscr)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-9.017  -3.705  -2.000   1.380  11.500  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.3270392  0.5530782  -2.399   0.0164 *  
meanage      0.0679866  0.0089754   7.575 3.60e-14 ***
female      -0.0193142  0.0015831 -12.200  < 2e-16 ***
pct_lowins  -0.0134547  0.0008585 -15.672  < 2e-16 ***
systemSys_2 -0.1382189  0.0246591  -5.605 2.08e-08 ***
systemSys_3 -0.0400170  0.0254505  -1.572   0.1159    
systemSys_4  0.0229273  0.0294207   0.779   0.4358    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2825.28  on 25  degrees of freedom
Residual deviance:  816.39  on 19  degrees of freedom
AIC: 1037.9

Number of Fisher Scoring iterations: 4

18.1.4 How does one address this problem in rms?

We can use Glm. As an example to mirror m_screen1, we have the following…

d <- datadist(colscr)
options(datadist = "d")

mod_screen_1 <-  Glm(screen_rate ~ meanage + female + 
                         pct_lowins + system, 
                     family = binomial, weights = subjects, 
                     data = colscr, x = T, y = T)

mod_screen_1
General Linear Model
 
 Glm(formula = screen_rate ~ meanage + female + pct_lowins + system, 
     family = binomial, data = colscr, weights = subjects, x = T, 
     y = T)
 
                      Model Likelihood    
                            Ratio Test    
     Obs      26    LR chi2    2008.90    
 Residual d.f.19    d.f.             6    
     g 0.4614539    Pr(> chi2) <0.0001    
 
              Coef    S.E.   Wald Z Pr(>|Z|)
 Intercept    -1.3270 0.5531  -2.40 0.0164  
 meanage       0.0680 0.0090   7.57 <0.0001 
 female       -0.0193 0.0016 -12.20 <0.0001 
 pct_lowins   -0.0135 0.0009 -15.67 <0.0001 
 system=Sys_2 -0.1382 0.0247  -5.61 <0.0001 
 system=Sys_3 -0.0400 0.0255  -1.57 0.1159  
 system=Sys_4  0.0229 0.0294   0.78 0.4358  
 

18.2 Probit Regression

18.2.1 Colorectal Cancer Screening Data on Individuals

The data in the colscr2 data frame describe (disguised) data on the status of 172 adults who were eligible for colon cancer screening, with the following information included:

Variable Description
subject subject ID code
age subject’s age (years)
race subject’s race (White/Black/Other)
hispanic subject of Hispanic ethnicity (1 = yes / 0 = no)
insurance Commercial, Medicaid, Medicare, Uninsured
bmi body mass index at most recent visit
sbp systolic blood pressure at most recent visit
up_to_date meets colon cancer screening standards

The goal is to use the other variables (besides subject ID) to predict whether or not a subject is up to date.

colscr2 %>% describe()
. 

 8  Variables      172  Observations
--------------------------------------------------------------------------------
subject 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     172        0      172        1    186.5    57.67    109.6    118.1 
     .25      .50      .75      .90      .95 
   143.8    186.5    229.2    254.9    263.4 

lowest : 101 102 103 104 105, highest: 268 269 270 271 272
--------------------------------------------------------------------------------
age 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     172        0       19    0.995     57.8    5.536    51.00    52.00 
     .25      .50      .75      .90      .95 
   54.00    57.00    61.25    65.00    67.00 

lowest : 51 52 53 54 55, highest: 65 66 67 68 69
                                                                            
Value         51    52    53    54    55    56    57    58    59    60    61
Frequency     10    17    14     9    15    13    18    13     4    10     6
Proportion 0.058 0.099 0.081 0.052 0.087 0.076 0.105 0.076 0.023 0.058 0.035
                                                          
Value         62    63    64    65    66    67    68    69
Frequency     11     4     5     7     6     3     4     3
Proportion 0.064 0.023 0.029 0.041 0.035 0.017 0.023 0.017
--------------------------------------------------------------------------------
race 
       n  missing distinct 
     172        0        3 
                            
Value      Black Other White
Frequency    118     9    45
Proportion 0.686 0.052 0.262
--------------------------------------------------------------------------------
hispanic 
       n  missing distinct     Info      Sum     Mean      Gmd 
     172        0        2     0.18       11  0.06395   0.1204 

--------------------------------------------------------------------------------
insurance 
       n  missing distinct 
     172        0        4 
                                                      
Value      Commercial   Medicaid   Medicare  Uninsured
Frequency          32         81         46         13
Proportion      0.186      0.471      0.267      0.076
--------------------------------------------------------------------------------
bmi 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     172        0      165        1    31.24    8.982    20.32    21.88 
     .25      .50      .75      .90      .95 
   25.48    30.05    36.03    43.06    45.68 

lowest : 17.20 17.59 17.85 18.09 18.44, highest: 49.41 50.83 53.28 54.66 55.41
--------------------------------------------------------------------------------
sbp 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     172        0       68    0.999    128.9    19.46    101.6    109.1 
     .25      .50      .75      .90      .95 
   118.0    127.0    138.0    150.9    162.0 

lowest :  89  90  94  95  96, highest: 166 168 169 170 198
--------------------------------------------------------------------------------
up_to_date 
       n  missing distinct     Info      Sum     Mean      Gmd 
     172        0        2    0.717      104   0.6047   0.4809 

--------------------------------------------------------------------------------

18.2.2 A logistic regression model

Here is a logistic regression model.

m_scr2_logistic <- glm(up_to_date ~ age + race + hispanic + 
                    insurance + bmi + sbp, 
                family = binomial, data = colscr2)

summary(m_scr2_logistic)

Call:
glm(formula = up_to_date ~ age + race + hispanic + insurance + 
    bmi + sbp, family = binomial, data = colscr2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8604  -1.1540   0.6933   0.9564   1.6108  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)  
(Intercept)         2.7040470  2.7418625   0.986   0.3240  
age                 0.0204901  0.0396920   0.516   0.6057  
raceOther          -1.9722351  1.0023237  -1.968   0.0491 *
raceWhite          -0.3210458  0.4001744  -0.802   0.4224  
hispanic            0.0005855  0.7953482   0.001   0.9994  
insuranceMedicaid  -1.0151860  0.4945169  -2.053   0.0401 *
insuranceMedicare  -0.5216006  0.5629935  -0.926   0.3542  
insuranceUninsured  0.1099966  0.7906196   0.139   0.8893  
bmi                 0.0155894  0.0213547   0.730   0.4654  
sbp                -0.0241777  0.0099138  -2.439   0.0147 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 230.85  on 171  degrees of freedom
Residual deviance: 210.55  on 162  degrees of freedom
AIC: 230.55

Number of Fisher Scoring iterations: 4
confint(m_scr2_logistic)
Waiting for profiling to be done...
                         2.5 %       97.5 %
(Intercept)        -2.64274441  8.157738367
age                -0.05703934  0.099298904
raceOther          -4.24604744 -0.175953229
raceWhite          -1.10800168  0.469396096
hispanic           -1.53691431  1.688738936
insuranceMedicaid  -2.03640881 -0.079142467
insuranceMedicare  -1.66246932  0.564088260
insuranceUninsured -1.40110091  1.759391281
bmi                -0.02568426  0.058563761
sbp                -0.04436143 -0.005282686

In this model, there appears to be some link between sbp and screening, as well as, perhaps, some statistically significant differences between some race groups and some insurance groups. We won’t look at this much further for now, though. Instead, we’ll simply describe predictions for two subjects, Harry and Sally.

18.2.3 Predicting status for Harry and Sally

  • Harry is age 65, White, non-Hispanic, with Medicare insurance, a BMI of 28 and SBP of 135.
  • Sally is age 60, Black, Hispanic, with Medicaid insurance, a BMI of 22 and SBP of 148.
newdat_s2 <- tibble(subject = c("Harry", "Sally"),
                     age = c(65, 60),
                     race = c("White", "Black"),
                     hispanic = c(0, 1),
                     insurance = c("Medicare", "Medicaid"),
                     bmi = c(28, 22),
                     sbp = c(135, 148))

predict(m_scr2_logistic, newdata = newdat_s2, 
        type = "response")
        1         2 
0.5904364 0.4215335 

The prediction for Harry is 0.59, and for Sally, 0.42, by this logistic regression model.

18.2.4 A probit regression model

Now, consider a probit regression, fit by changing the default link for the binomial family as follows:

m_scr2_probit <- glm(up_to_date ~ age + race + hispanic + 
                    insurance + bmi + sbp, 
                family = binomial(link = "probit"), 
                data = colscr2)

summary(m_scr2_probit)

Call:
glm(formula = up_to_date ~ age + race + hispanic + insurance + 
    bmi + sbp, family = binomial(link = "probit"), data = colscr2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8608  -1.1561   0.6933   0.9607   1.6073  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)  
(Intercept)         1.584604   1.658489   0.955   0.3394  
age                 0.013461   0.024107   0.558   0.5766  
raceOther          -1.238445   0.587093  -2.109   0.0349 *
raceWhite          -0.199260   0.243505  -0.818   0.4132  
hispanic            0.029483   0.484819   0.061   0.9515  
insuranceMedicaid  -0.619277   0.293205  -2.112   0.0347 *
insuranceMedicare  -0.322881   0.333549  -0.968   0.3330  
insuranceUninsured  0.052776   0.463798   0.114   0.9094  
bmi                 0.009652   0.012887   0.749   0.4539  
sbp                -0.014696   0.005944  -2.472   0.0134 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 230.85  on 171  degrees of freedom
Residual deviance: 210.49  on 162  degrees of freedom
AIC: 230.49

Number of Fisher Scoring iterations: 4
confint(m_scr2_probit)
Waiting for profiling to be done...
                         2.5 %       97.5 %
(Intercept)        -1.70739455  4.894713408
age                -0.03400934  0.061437784
raceOther          -2.53819772 -0.119812915
raceWhite          -0.67910311  0.282566349
hispanic           -0.92476109  1.011089937
insuranceMedicaid  -1.21212866 -0.049749873
insuranceMedicare  -0.99315267  0.329851215
insuranceUninsured -0.83335121  0.966504724
bmi                -0.01577917  0.035632196
sbp                -0.02659318 -0.003148488

18.2.5 Interpreting the Probit Model’s Coefficients

It is possible to use any number of link functions to ensure that predicted values in a generalized linear model fall between 0 and 1. The probit regression model, for instance, uses the inverse of the cumulative distribution function of the Normal model as its link function. Let’s look more closely at the coefficients of the probit model we just fit.

m_scr2_probit$coef
       (Intercept)                age          raceOther          raceWhite 
       1.584603569        0.013461338       -1.238445198       -0.199260184 
          hispanic  insuranceMedicaid  insuranceMedicare insuranceUninsured 
       0.029483051       -0.619276718       -0.322880519        0.052775722 
               bmi                sbp 
       0.009652339       -0.014695526 

The probit regression coefficients give the change in the z-score of the outcome of interest (here, up_to_date) for a one-unit change in the target predictor, holding all other predictors constant.

  • So, for a one-year increase in age, holding all other predictors constant, the z-score for up_to_date increases by 0.013
  • And for a Medicaid subject as compared to a Commercial subject of the same age, race, ethnicity, bmi and sbp, the z-score for the Medicaid subject is predicted to be -0.619 lower, according to this model.

18.2.6 What about Harry and Sally?

Do the predictions for Harry and Sally change much with this probit model, as compared to the logistic regression?

predict(m_scr2_probit, newdata = newdat_s2, type = "response")
        1         2 
0.5885511 0.4364027