13 Proportions and Rates
This is a sketchy draft. I’ll remove this notice when I post a version of this Chapter that is essentially finished.
13.1 R setup for this chapter
Appendix A lists all R packages used in this book, and also provides R session information.
13.2 Data: strep_tb data from the medicaldata
R package
Appendix C provides further guidance on pulling data from other systems into R, while Appendix D gives more information (including download links) for all data sets used in this book. Appendix B describes the 431-Love.R script, and demonstrates its use.
See pages 51-52 of R&OS for standard errors and confidence intervals for proportions, and for what to do when y = 0 or y = n
My source for these data is Higgins (2023).
strep_tb data from the medicaldata
R package - we’ll look at study arm (streptomycin or control) and the dichotomous outcome of improved (true, false) - will need to work with a logical variable, and we’ll also keep the patient ID.
See https://higgi13425.github.io/medicaldata/ for more details.
strep <- medicaldata::strep_tb |>
mutate(
imp_f = factor(improved),
imp_f = fct_recode(imp_f,
"Improved" = "TRUE",
"Worsened" = "FALSE"
),
imp_f = fct_relevel(imp_f, "Improved")
)
strep
# A tibble: 107 × 14
patient_id arm dose_strep_g dose_PAS_g gender baseline_condition
<chr> <fct> <dbl> <dbl> <fct> <fct>
1 0001 Control 0 0 M 1_Good
2 0002 Control 0 0 F 1_Good
3 0003 Control 0 0 F 1_Good
4 0004 Control 0 0 M 1_Good
5 0005 Control 0 0 F 1_Good
6 0006 Control 0 0 M 1_Good
7 0007 Control 0 0 F 1_Good
8 0008 Control 0 0 M 1_Good
9 0009 Control 0 0 F 2_Fair
10 0010 Control 0 0 M 2_Fair
# ℹ 97 more rows
# ℹ 8 more variables: baseline_temp <fct>, baseline_esr <fct>,
# baseline_cavitation <fct>, strep_resistance <fct>, radiologic_6m <fct>,
# rad_num <dbl>, improved <lgl>, imp_f <fct>
table(strep$arm, strep$imp_f)
Improved Worsened
Streptomycin 38 17
Control 17 35
13.3 Estimating a Proportion
Within those who received Streptomycin, 38 improved and 17 did not out of 55 subjects. Can we estimate a confidence interval for the population proportion of all subjects?
13.3.1 Using a Bayesian augmentation
binom.test(x = 38 + 2, n = 55 + 4, conf.level = 0.95)
Exact binomial test
data: 38 + 2 and 55 + 4
number of successes = 40, number of trials = 59, p-value = 0.008641
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.5436200 0.7937535
sample estimates:
probability of success
0.6779661
13.3.2 SAIFS: single augmentation with an imaginary failure or success
The saifs_ci()
function
`saifs_ci` <-
function(x, n, conf.level=0.95, dig=3)
{
p.sample <- round(x/n, digits=dig)
p1 <- x / (n+1)
p2 <- (x+1) / (n+1)
var1 <- (p1*(1-p1))/n
se1 <- sqrt(var1)
var2 <- (p2*(1-p2))/n
se2 <- sqrt(var2)
lowq = (1 - conf.level)/2
tcut <- qt(lowq, df=n-1, lower.tail=FALSE)
lower.bound <- round(p1 - tcut*se1, digits=dig)
upper.bound <- round(p2 + tcut*se2, digits=dig)
tibble(
sample_x = x,
sample_n = n,
sample_p = p.sample,
lower = lower.bound,
upper = upper.bound,
conf_level = conf.level
)
}
Using the saifs_ci()
function from Love-431.R
saifs_ci(x = 38, n = 55, conf.level = 0.95, dig = 3)
# A tibble: 1 × 6
sample_x sample_n sample_p lower upper conf_level
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 38 55 0.691 0.552 0.821 0.95
13.4 Assessing the 2 x 2 table
This table is in standard epidemiological format, which means that:
- The rows of the table describe the “treatment” (which we’ll take here to be arm).
- The more interesting (sometimes also the more common) “treatment” is placed in the top row. That’s Streptomycin here.
- The columns of the table describe the “outcome” (which we’ll take here to be whether the subject improved or not.)
- Typically, the more common or more interesting “outcome” is placed to the left. Here, we’ll use “improved” on the left.
2 by 2 table analysis:
------------------------------------------------------
Outcome : Improved
Comparing : Streptomycin vs. Control
Improved Worsened P(Improved) 95% conf. interval
Streptomycin 38 17 0.6909 0.5579 0.7984
Control 17 35 0.3269 0.2139 0.4644
95% conf. interval
Relative Risk: 2.1134 1.3773 3.2429
Sample Odds Ratio: 4.6021 2.0389 10.3877
Conditional MLE Odds Ratio: 4.5304 1.8962 11.2779
Probability difference: 0.3640 0.1754 0.5182
Exact P-value: 0.0002
Asymptotic P-value: 0.0002
------------------------------------------------------
13.5 Ebola Virus Study
The World Health Organization’s Ebola Response Team published an article1 in the October 16, 2014 issue of the New England Journal of Medicine, which contained some data I will use in this example, focusing on their Table 2.
Suppose we want to compare the proportion of deaths among cases that had a definitive outcome who were hospitalized to the proportion of deaths among cases that had a definitive outcome who were not hospitalized.
The article suggests that of the 1,737 cases with a definitive outcome, there were 1,153 hospitalized cases. Across those 1,153 hospitalized cases, 741 people (64.3%) died, which means that across the remaining 584 non-hospitalized cases, 488 people (83.6%) died.
Here is the initial contingency table, using only the numbers from the previous paragraph.
Initial Ebola Table | Deceased | Alive | Total |
---|---|---|---|
Hospitalized | 741 | – | 1153 |
Not Hospitalized | 488 | – | 584 |
Total | 1737 |
Now, we can use arithmetic to complete the table, since the rows and the columns are each mutually exclusive and collectively exhaustive.
Ebola 2x2 Table | Deceased | Alive | Total |
---|---|---|---|
Hospitalized | 741 | 412 | 1153 |
Not Hospitalized | 488 | 96 | 584 |
Total | 1229 | 508 | 1737 |
We want to compare the fatality risk (probability of being in the deceased column) for the population of people in the hospitalized row to the population of people in the not hospitalized row.
See sections 25.4 and 26.11 in the 2023 course notes.
twobytwo(741, 412, 488, 96,
"Hosp", "Not Hosp", "Dead", "Alive",
conf.level = 0.95)
2 by 2 table analysis:
------------------------------------------------------
Outcome : Dead
Comparing : Hosp vs. Not Hosp
Dead Alive P(Dead) 95% conf. interval
Hosp 741 412 0.6427 0.6146 0.6698
Not Hosp 488 96 0.8356 0.8033 0.8635
95% conf. interval
Relative Risk: 0.7691 0.7271 0.8135
Sample Odds Ratio: 0.3538 0.2756 0.4542
Conditional MLE Odds Ratio: 0.3540 0.2726 0.4566
Probability difference: -0.1929 -0.2325 -0.1508
Exact P-value: 0.0000
Asymptotic P-value: 0.0000
------------------------------------------------------
twobytwo(412, 741, 96, 488,
"Hosp", "Not Hosp", "Alive", "Dead",
conf.level = 0.95)
2 by 2 table analysis:
------------------------------------------------------
Outcome : Alive
Comparing : Hosp vs. Not Hosp
Alive Dead P(Alive) 95% conf. interval
Hosp 412 741 0.3573 0.3302 0.3854
Not Hosp 96 488 0.1644 0.1365 0.1967
95% conf. interval
Relative Risk: 2.1737 1.7823 2.6512
Sample Odds Ratio: 2.8264 2.2016 3.6284
Conditional MLE Odds Ratio: 2.8248 2.1900 3.6678
Probability difference: 0.1929 0.1508 0.2325
Exact P-value: 0.0000
Asymptotic P-value: 0.0000
------------------------------------------------------
13.6 For More Information
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2797398/ talks about 2x2 tables
OpenStats https://www.openintro.org/book/os/ Section 5 - foundations for inference about a proportion
OpenStats https://www.openintro.org/book/os/ Section 6 - inference for categorical data (except but we don’t do 6.3 in 431)
WHO Ebola Response Team (2014) Ebola virus disease in West Africa: The first 9 months of the epidemic and forward projections. New Engl J Med 371: 1481-1495 doi: 10.1056/NEJMoa1411100↩︎