Chapter 16 Colorectal Cancer Screening and Some Special Cases

In this Chapter, we discuss two issues as yet unraised regarding regression on a binary outcome.

  1. What do we do if our binary outcome is not available for each subject individually, but instead aggregated?

  2. What is probit regression, and how can we use it as an alternative to logistic regression on a binary outcome?

16.1 Logistic Regression for Aggregated Data

16.1.1 Colorectal Cancer Screening Data

The screening.csv data (imported into the R tibble colscr are simulated. They mirror a subset of the actual results from the Better Health Partnership’s pilot study of colorectal cancer screening in primary care clinics in Northeast Ohio, but the numbers have been fuzzed slightly, and the clinics have been de-identified and moved around from system to system.

Available to us are the following variables:

Variable Description
location clinic code
subjects number of subjects reported by clinic
screen_rate proportion of subjects who were screened
screened number of subjects who were screened
notscreened number of subjects not screened
meanage mean age of clinic’s subjects, years
female % of clinic’s subjects who are female
pct_lowins % of clinic’s subjects who have Medicaid or are uninsured
system system code
colscr %>% skim()
Skim summary statistics
 n obs: 26 
 n variables: 9 

-- Variable type:factor -----------------------------------------------------------
 variable missing complete  n n_unique                     top_counts
 location       0       26 26       26         A: 1, B: 1, C: 1, D: 1
   system       0       26 26        4 Sys: 7, Sys: 7, Sys: 6, Sys: 6
 ordered
   FALSE
   FALSE

-- Variable type:integer ----------------------------------------------------------
    variable missing complete  n    mean      sd  p0     p25    p50
 notscreened       0       26 26  663.23  271.17 231  508.75  611  
    screened       0       26 26 2584.04 1765.11 572 1395.25 2169.5
    subjects       0       26 26 3247.27 1945.83 803 1914.75 2765.5
     p75 p100
  791    1356
 2716    6947
 3607.75 7677

-- Variable type:numeric ----------------------------------------------------------
    variable missing complete  n  mean     sd    p0   p25   p50   p75 p100
      female       0       26 26 58.72  6.29  46.2  55.42 60.05 62.62 70.3
     meanage       0       26 26 60.58  1.93  58    58.82 60.5  61.98 65.9
  pct_lowins       0       26 26 24.47 19.13   0.3   4.8  23.95 44.03 51.3
 screen_rate       0       26 26  0.77  0.072  0.64  0.72  0.76  0.81  0.9

16.1.2 Fitting a Logistic Regression Model to Proportion Data

Here, we have a binary outcome (was the subject screened or not?) but we have aggregated results. We can use the counts of the numbers of subjects at each clinic (in subjects) and the proportion who were screened (in screen_rate) to fit a logistic regression model, as follows:

m_screen1 <-  glm(screen_rate ~ meanage + female + 
                    pct_lowins + system, family = binomial, 
                  weights = subjects, data = colscr)

summary(m_screen1)

Call:
glm(formula = screen_rate ~ meanage + female + pct_lowins + system, 
    family = binomial, data = colscr, weights = subjects)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-9.017  -3.705  -2.000   1.380  11.500  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.3270393  0.5530782  -2.399   0.0164 *  
meanage      0.0679866  0.0089754   7.575 3.60e-14 ***
female      -0.0193142  0.0015831 -12.200  < 2e-16 ***
pct_lowins  -0.0134547  0.0008585 -15.672  < 2e-16 ***
systemSys_2 -0.1382189  0.0246591  -5.605 2.08e-08 ***
systemSys_3 -0.0400170  0.0254505  -1.572   0.1159    
systemSys_4  0.0229273  0.0294207   0.779   0.4358    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2825.28  on 25  degrees of freedom
Residual deviance:  816.39  on 19  degrees of freedom
AIC: 1037.9

Number of Fisher Scoring iterations: 4

16.1.3 Fitting a Logistic Regression Model to Counts of Successes and Failures

m_screen2 <-  glm(cbind(screened, notscreened) ~ 
                    meanage + female + pct_lowins + system, 
           family = binomial, data = colscr)
summary(m_screen2)

Call:
glm(formula = cbind(screened, notscreened) ~ meanage + female + 
    pct_lowins + system, family = binomial, data = colscr)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-9.017  -3.705  -2.000   1.380  11.500  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.3270392  0.5530782  -2.399   0.0164 *  
meanage      0.0679866  0.0089754   7.575 3.60e-14 ***
female      -0.0193142  0.0015831 -12.200  < 2e-16 ***
pct_lowins  -0.0134547  0.0008585 -15.672  < 2e-16 ***
systemSys_2 -0.1382189  0.0246591  -5.605 2.08e-08 ***
systemSys_3 -0.0400170  0.0254505  -1.572   0.1159    
systemSys_4  0.0229273  0.0294207   0.779   0.4358    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2825.28  on 25  degrees of freedom
Residual deviance:  816.39  on 19  degrees of freedom
AIC: 1037.9

Number of Fisher Scoring iterations: 4

16.1.4 How does one address this problem in rms?

We can use Glm. As an example to mirror m_screen1, we have the following…

d <- datadist(colscr)
options(datadist = "d")

mod_screen_1 <-  Glm(screen_rate ~ meanage + female + 
                         pct_lowins + system, 
                     family = binomial, weights = subjects, 
                     data = colscr, x = T, y = T)

mod_screen_1
General Linear Model
 
 Glm(formula = screen_rate ~ meanage + female + pct_lowins + system, 
     family = binomial, data = colscr, weights = subjects, x = T, 
     y = T)
 
                     Model Likelihood     
                        Ratio Test        
 Obs      26        LR chi2    2008.90    
 Residual d.f.19    d.f.             6    
 g 0.4614539        Pr(> chi2) <0.0001    
 
              Coef    S.E.   Wald Z Pr(>|Z|)
 Intercept    -1.3270 0.5531  -2.40 0.0164  
 meanage       0.0680 0.0090   7.57 <0.0001 
 female       -0.0193 0.0016 -12.20 <0.0001 
 pct_lowins   -0.0135 0.0009 -15.67 <0.0001 
 system=Sys_2 -0.1382 0.0247  -5.61 <0.0001 
 system=Sys_3 -0.0400 0.0255  -1.57 0.1159  
 system=Sys_4  0.0229 0.0294   0.78 0.4358  
 

16.2 Probit Regression

16.2.1 Colorectal Cancer Screening Data on Individuals

The data in the colscr2 data frame describe (disguised) data on the status of 172 adults who were eligible for colon cancer screening, with the following information included:

Variable Description
subject subject ID code
age subject’s age (years)
race subject’s race (White/Black/Other)
hispanic subject of Hispanic ethnicity (1 = yes / 0 = no)
insurance Commercial, Medicaid, Medicare, Uninsured
bmi body mass index at most recent visit
sbp systolic blood pressure at most recent visit
up_to_date meets colon cancer screening standards

The goal is to use the other variables (besides subject ID) to predict whether or not a subject is up to date.

16.2.2 A logistic regression model

Here is a logistic regression model.

colscr2 %>% summary()
    subject           age           race        hispanic      
 Min.   :101.0   Min.   :51.00   Black:118   Min.   :0.00000  
 1st Qu.:143.8   1st Qu.:54.00   Other:  9   1st Qu.:0.00000  
 Median :186.5   Median :57.00   White: 45   Median :0.00000  
 Mean   :186.5   Mean   :57.80               Mean   :0.06395  
 3rd Qu.:229.2   3rd Qu.:61.25               3rd Qu.:0.00000  
 Max.   :272.0   Max.   :69.00               Max.   :1.00000  
      insurance       bmi             sbp          up_to_date    
 Commercial:32   Min.   :17.20   Min.   : 89.0   Min.   :0.0000  
 Medicaid  :81   1st Qu.:25.48   1st Qu.:118.0   1st Qu.:0.0000  
 Medicare  :46   Median :30.05   Median :127.0   Median :1.0000  
 Uninsured :13   Mean   :31.24   Mean   :128.9   Mean   :0.6047  
                 3rd Qu.:36.03   3rd Qu.:138.0   3rd Qu.:1.0000  
                 Max.   :55.41   Max.   :198.0   Max.   :1.0000  
m_scr2_logistic <- glm(up_to_date ~ age + race + hispanic + 
                    insurance + bmi + sbp, 
                family = binomial, data = colscr2)

summary(m_scr2_logistic)

Call:
glm(formula = up_to_date ~ age + race + hispanic + insurance + 
    bmi + sbp, family = binomial, data = colscr2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8604  -1.1540   0.6933   0.9564   1.6108  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)  
(Intercept)         2.7040470  2.7418625   0.986   0.3240  
age                 0.0204901  0.0396920   0.516   0.6057  
raceOther          -1.9722351  1.0023237  -1.968   0.0491 *
raceWhite          -0.3210458  0.4001744  -0.802   0.4224  
hispanic            0.0005855  0.7953482   0.001   0.9994  
insuranceMedicaid  -1.0151860  0.4945169  -2.053   0.0401 *
insuranceMedicare  -0.5216006  0.5629935  -0.926   0.3542  
insuranceUninsured  0.1099966  0.7906196   0.139   0.8893  
bmi                 0.0155894  0.0213547   0.730   0.4654  
sbp                -0.0241777  0.0099138  -2.439   0.0147 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 230.85  on 171  degrees of freedom
Residual deviance: 210.55  on 162  degrees of freedom
AIC: 230.55

Number of Fisher Scoring iterations: 4
confint(m_scr2_logistic)
Waiting for profiling to be done...
                         2.5 %       97.5 %
(Intercept)        -2.64274441  8.157738367
age                -0.05703934  0.099298904
raceOther          -4.24604744 -0.175953229
raceWhite          -1.10800168  0.469396096
hispanic           -1.53691431  1.688738936
insuranceMedicaid  -2.03640881 -0.079142467
insuranceMedicare  -1.66246932  0.564088260
insuranceUninsured -1.40110091  1.759391281
bmi                -0.02568426  0.058563761
sbp                -0.04436143 -0.005282686

In this model, there appears to be some link between sbp and screening, as well as, perhaps, some statistically significant differences between some race groups and some insurance groups. We won’t look at this much further for now, though. Instead, we’ll simply describe predictions for two subjects, Harry and Sally.

16.2.3 Predicting status for Harry and Sally

  • Harry is age 65, White, non-Hispanic, with Medicare insurance, a BMI of 28 and SBP of 135.
  • Sally is age 60, Black, Hispanic, with Medicaid insurance, a BMI of 22 and SBP of 148.
newdat_s2 <- data_frame(subject = c("Harry", "Sally"),
                     age = c(65, 60),
                     race = c("White", "Black"),
                     hispanic = c(0, 1),
                     insurance = c("Medicare", "Medicaid"),
                     bmi = c(28, 22),
                     sbp = c(135, 148))

predict(m_scr2_logistic, newdata = newdat_s2, 
        type = "response")
        1         2 
0.5904364 0.4215335 

The prediction for Harry is 0.59, and for Sally, 0.42, by this logistic regression model.

16.2.4 A probit regression model

Now, consider a probit regression, fit by changing the default link for the binomial family as follows:

m_scr2_probit <- glm(up_to_date ~ age + race + hispanic + 
                    insurance + bmi + sbp, 
                family = binomial(link = "probit"), 
                data = colscr2)

summary(m_scr2_probit)

Call:
glm(formula = up_to_date ~ age + race + hispanic + insurance + 
    bmi + sbp, family = binomial(link = "probit"), data = colscr2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8608  -1.1561   0.6933   0.9607   1.6073  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)  
(Intercept)         1.584604   1.658489   0.955   0.3394  
age                 0.013461   0.024107   0.558   0.5766  
raceOther          -1.238445   0.587093  -2.109   0.0349 *
raceWhite          -0.199260   0.243505  -0.818   0.4132  
hispanic            0.029483   0.484819   0.061   0.9515  
insuranceMedicaid  -0.619277   0.293205  -2.112   0.0347 *
insuranceMedicare  -0.322881   0.333549  -0.968   0.3330  
insuranceUninsured  0.052776   0.463798   0.114   0.9094  
bmi                 0.009652   0.012887   0.749   0.4539  
sbp                -0.014696   0.005944  -2.472   0.0134 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 230.85  on 171  degrees of freedom
Residual deviance: 210.49  on 162  degrees of freedom
AIC: 230.49

Number of Fisher Scoring iterations: 4
confint(m_scr2_probit)
Waiting for profiling to be done...
                         2.5 %       97.5 %
(Intercept)        -1.70739455  4.894713408
age                -0.03400934  0.061437784
raceOther          -2.53819772 -0.119812915
raceWhite          -0.67910311  0.282566349
hispanic           -0.92476109  1.011089937
insuranceMedicaid  -1.21212866 -0.049749873
insuranceMedicare  -0.99315267  0.329851215
insuranceUninsured -0.83335121  0.966504724
bmi                -0.01577917  0.035632196
sbp                -0.02659318 -0.003148488

16.2.5 Interpreting the Probit Model’s Coefficients

It is possible to use any number of link functions to ensure that predicted values in a generalized linear model fall between 0 and 1. The probit regression model, for instance, uses the inverse of the cumulative distribution function of the Normal model as its link function. Let’s look more closely at the coefficients of the probit model we just fit.

m_scr2_probit$coef
       (Intercept)                age          raceOther 
       1.584603569        0.013461338       -1.238445198 
         raceWhite           hispanic  insuranceMedicaid 
      -0.199260184        0.029483051       -0.619276718 
 insuranceMedicare insuranceUninsured                bmi 
      -0.322880519        0.052775722        0.009652339 
               sbp 
      -0.014695526 

The probit regression coefficients give the change in the z-score of the outcome of interest (here, up_to_date) for a one-unit change in the target predictor, holding all other predictors constant.

  • So, for a one-year increase in age, holding all other predictors constant, the z-score for up_to_date increases by 0.013
  • And for a Medicaid subject as compared to a Commercial subject of the same age, race, ethnicity, bmi and sbp, the z-score for the Medicaid subject is predicted to be -0.619 lower, according to this model.

16.2.6 What about Harry and Sally?

Do the predictions for Harry and Sally change much with this probit model, as compared to the logistic regression?

predict(m_scr2_probit, newdata = newdat_s2, type = "response")
        1         2 
0.5885511 0.4364027