Propensity Matching (in several ways) using R

Author

Thomas E. Love, Ph.D.

Published

2025-01-27

1 Setup

Code
knitr::opts_chunk$set(comment = NA) # do not remove this

library(janitor) # load other packages as desired
library(naniar)

library(gt)
library(tableone)
library(broom)
library(Epi)
library(survival)
library(Matching)
library(MatchIt)
library(cobalt)
library(lme4)
library(survey)
library(rbounds)

library(easystats)
library(tidyverse) # load tidyverse last

## Note that we will also use the broom.mixed package
## but we won't load it here

theme_set(theme_bw()) # set theme for ggplots

1.1 The dm2200 Data Set

I’ve simulated data to match real information we’ve collected over the years at Better Health Partnership on adults who live with diabetes. These data mirror some of the real data collected from electronic health records across the region by Better Health Partnership, but individual values have been permuted across patients, so the results are not applicable to any population. The data I simulated from was a subset of Better Health data that required that the subject fall into exactly one of the two exposure groups we’ll study, that they live in Cuyahoga County, prefer English for health-related communications, and have no missing data on the variables we’ll study.

  • The exposure we’ll study is called exposure and consists of two levels: A and B. I won’t specify the details further on how the exposure is determined, except to say that it is uniquely determinable for each subject.
  • We’ll study a binary outcome, specifically whether the subject’s blood pressure is in control, in the sense that both their systolic blood pressure is below 140 mm Hg, and their diastolic blood pressure is below 90 mm Hg.
  • We’ll also study a continuous outcome, the subject’s body-mass index or bmi.
Code
dm2200 <- read_csv("data/dm2200.csv", 
                   show_col_types = FALSE) |> 
    mutate(across(where(is.character), as_factor)) |>
    mutate(subject = as.character(subject),
           bp_good = as.numeric(sbp < 140 & dbp < 90))

dm2200
# A tibble: 2,200 × 26
   subject exposure   age race      hisp sex   insur      nincome nhsgrad cleve
   <chr>   <fct>    <dbl> <fct>    <dbl> <fct> <fct>        <dbl>   <dbl> <dbl>
 1 S-0001  A           67 Black_AA     0 F     Medicare     24700      72     1
 2 S-0002  B           56 White        0 F     Commercial   62300      85     0
 3 S-0003  B           41 Black_AA     0 M     Medicare      6500      49     1
 4 S-0004  A           56 Black_AA     0 M     Medicaid     45400      86     0
 5 S-0005  B           69 Black_AA     0 F     Medicare     15400      72     0
 6 S-0006  B           44 Black_AA     0 M     Medicaid     34100      87     0
 7 S-0007  B           47 Black_AA     0 F     Medicaid     13900      42     1
 8 S-0008  B           60 Black_AA     0 M     Medicaid     30800      91     0
 9 S-0009  B           67 Other        1 F     Medicare     47100      90     1
10 S-0010  B           70 Black_AA     0 F     Medicare     20800      69     1
# ℹ 2,190 more rows
# ℹ 16 more variables: height_cm <dbl>, weight_kg <dbl>, bmi <dbl>, a1c <dbl>,
#   sbp <dbl>, dbp <dbl>, ldl <dbl>, visits <dbl>, tobacco <fct>, statin <dbl>,
#   ace_arb <dbl>, betab <dbl>, depr_dx <dbl>, eyeex <dbl>, pneumo <dbl>,
#   bp_good <dbl>

1.2 Elements of a Codebook

Note

I used paste(colnames(dm2200), collapse = " | ") to help me make this list.

Variable Type Description
subject character subject identifier (S-0001 to S-2200)
exposure factor (2 levels) A or B
age integer age in years
race factor (4 levels) White, Black_AA, Asian, Other
hisp 1/0 1 = Hispanic or Latinx, 0 = not
sex F/M F = Female, M = Male
insur factor (4 levels) Insurance: Medicare, Commercial, Medicaid or Uninsured
nincome integer est. Neighborhood Median Income, in $
nhsgrad integer est. % of adults in Neighborhood who are High School graduates
cleve 1/0 1 = Cleveland resident, 0 = resident of suburbs
height_cm integer height in cm
weight_kg integer weight in kg
bmi numeric body mass index (kg/m2)
a1c numeric most recent Hemoglobin A1c (in %)
sbp numeric most recent systolic blood pressure (in mm Hg)
dbp numeric most recent diastolic blood pressure (in mm Hg)
bp_good 1/0 1 if sbp < 140 and dbp < 90, 0 otherwise
ldl numeric most recent LDL cholesterol (in mg/dl)
visits integer primary care office visits in past year
tobacco factor (3 levels) Tobacco use: Current, Former, Never
statin 1/0 1 if subject had a statin prescription in the past year
ace_arb 1/0 1 if subject had an ACE inhibitor or ARB prescription in the past year
betab 1/0 1 if subject had a beta-blocker prescription in the past year
depr_dx 1/0 1 if the subject has a depression diagnosis
eyeex 1/0 1 if the subject has had a retinal eye exam in the past year
pneumo 1/0 1 if the subject has had a pneumococcal vaccination in the past 10 years
Code
data_codebook(dm2200 |> select(-subject))
select(dm2200, -subject) (2200 rows and 25 variables, 25 shown)

ID | Name      | Type        | Missings |         Values |            N
---+-----------+-------------+----------+----------------+-------------
1  | exposure  | categorical | 0 (0.0%) |              A |  200 ( 9.1%)
   |           |             |          |              B | 2000 (90.9%)
---+-----------+-------------+----------+----------------+-------------
2  | age       | numeric     | 0 (0.0%) |       [25, 75] |         2200
---+-----------+-------------+----------+----------------+-------------
3  | race      | categorical | 0 (0.0%) |       Black_AA | 1231 (56.0%)
   |           |             |          |          White |  867 (39.4%)
   |           |             |          |          Other |   75 ( 3.4%)
   |           |             |          |          Asian |   27 ( 1.2%)
---+-----------+-------------+----------+----------------+-------------
4  | hisp      | numeric     | 0 (0.0%) |              0 | 2092 (95.1%)
   |           |             |          |              1 |  108 ( 4.9%)
---+-----------+-------------+----------+----------------+-------------
5  | sex       | categorical | 0 (0.0%) |              F | 1230 (55.9%)
   |           |             |          |              M |  970 (44.1%)
---+-----------+-------------+----------+----------------+-------------
6  | insur     | categorical | 0 (0.0%) |       Medicare |  950 (43.2%)
   |           |             |          |     Commercial |  517 (23.5%)
   |           |             |          |       Medicaid |  626 (28.5%)
   |           |             |          |      Uninsured |  107 ( 4.9%)
---+-----------+-------------+----------+----------------+-------------
7  | nincome   | numeric     | 0 (0.0%) | [6100, 149300] |         2200
---+-----------+-------------+----------+----------------+-------------
8  | nhsgrad   | numeric     | 0 (0.0%) |       [37, 99] |         2200
---+-----------+-------------+----------+----------------+-------------
9  | cleve     | numeric     | 0 (0.0%) |              0 |  910 (41.4%)
   |           |             |          |              1 | 1290 (58.6%)
---+-----------+-------------+----------+----------------+-------------
10 | height_cm | numeric     | 0 (0.0%) |     [138, 203] |         2200
---+-----------+-------------+----------+----------------+-------------
11 | weight_kg | numeric     | 0 (0.0%) |      [45, 223] |         2200
---+-----------+-------------+----------+----------------+-------------
12 | bmi       | numeric     | 0 (0.0%) |   [16.4, 77.2] |         2200
---+-----------+-------------+----------+----------------+-------------
13 | a1c       | numeric     | 0 (0.0%) |    [4.3, 15.9] |         2200
---+-----------+-------------+----------+----------------+-------------
14 | sbp       | numeric     | 0 (0.0%) |      [86, 232] |         2200
---+-----------+-------------+----------+----------------+-------------
15 | dbp       | numeric     | 0 (0.0%) |      [38, 119] |         2200
---+-----------+-------------+----------+----------------+-------------
16 | ldl       | numeric     | 0 (0.0%) |      [25, 243] |         2200
---+-----------+-------------+----------+----------------+-------------
17 | visits    | numeric     | 0 (0.0%) |        [2, 18] |         2200
---+-----------+-------------+----------+----------------+-------------
18 | tobacco   | categorical | 0 (0.0%) |         Former |  821 (37.3%)
   |           |             |          |        Current |  553 (25.1%)
   |           |             |          |          Never |  826 (37.5%)
---+-----------+-------------+----------+----------------+-------------
19 | statin    | numeric     | 0 (0.0%) |              0 |  443 (20.1%)
   |           |             |          |              1 | 1757 (79.9%)
---+-----------+-------------+----------+----------------+-------------
20 | ace_arb   | numeric     | 0 (0.0%) |              0 |  509 (23.1%)
   |           |             |          |              1 | 1691 (76.9%)
---+-----------+-------------+----------+----------------+-------------
21 | betab     | numeric     | 0 (0.0%) |              0 | 1366 (62.1%)
   |           |             |          |              1 |  834 (37.9%)
---+-----------+-------------+----------+----------------+-------------
22 | depr_dx   | numeric     | 0 (0.0%) |              0 | 1394 (63.4%)
   |           |             |          |              1 |  806 (36.6%)
---+-----------+-------------+----------+----------------+-------------
23 | eyeex     | numeric     | 0 (0.0%) |              0 |  871 (39.6%)
   |           |             |          |              1 | 1329 (60.4%)
---+-----------+-------------+----------+----------------+-------------
24 | pneumo    | numeric     | 0 (0.0%) |              0 |  318 (14.5%)
   |           |             |          |              1 | 1882 (85.5%)
---+-----------+-------------+----------+----------------+-------------
25 | bp_good   | numeric     | 0 (0.0%) |              0 |  652 (29.6%)
   |           |             |          |              1 | 1548 (70.4%)
-----------------------------------------------------------------------
Note

Here is a basic Table 1 comparing the two exposure groups.

Code
t1 <- CreateTableOne(
    vars = c("age", "race", "hisp", "sex", "insur", 
             "nincome", "nhsgrad", "cleve", "sbp", "dbp",
             "ldl", "visits", "tobacco", "statin", 
             "ace_arb", "betab", "depr_dx", "eyeex", 
             "pneumo", "bmi", "bp_good"), 
    factorVars = c("hisp", "cleve", "statin",
                   "ace_arb", "betab", "depr_dx", 
                   "eyeex", "pneumo", "bp_good"),
    strata = "exposure", 
    data = dm2200)

t1
                     Stratified by exposure
                      A                   B                   p      test
  n                        200                2000                       
  age (mean (SD))        54.90 (8.55)        58.90 (9.83)     <0.001     
  race (%)                                                    <0.001     
     Black_AA              166 (83.0)         1065 (53.2)                
     White                  31 (15.5)          836 (41.8)                
     Other                   1 ( 0.5)           74 ( 3.7)                
     Asian                   2 ( 1.0)           25 ( 1.2)                
  hisp = 1 (%)               4 ( 2.0)          104 ( 5.2)      0.068     
  sex = M (%)              128 (64.0)          842 (42.1)     <0.001     
  insur (%)                                                   <0.001     
     Medicare               37 (18.5)          913 (45.6)                
     Commercial              7 ( 3.5)          510 (25.5)                
     Medicaid              125 (62.5)          501 (25.0)                
     Uninsured              31 (15.5)           76 ( 3.8)                
  nincome (mean (SD)) 26927.50 (11642.18) 37992.55 (20854.20) <0.001     
  nhsgrad (mean (SD))    77.74 (11.50)       82.23 (11.35)    <0.001     
  cleve = 1 (%)            181 (90.5)         1109 (55.5)     <0.001     
  sbp (mean (SD))       135.88 (20.60)      132.44 (16.17)     0.005     
  dbp (mean (SD))        82.47 (9.08)        72.41 (11.94)    <0.001     
  ldl (mean (SD))        94.34 (38.62)       97.22 (36.66)     0.293     
  visits (mean (SD))      4.81 (2.62)         3.69 (1.78)     <0.001     
  tobacco (%)                                                 <0.001     
     Former                 46 (23.0)          775 (38.8)                
     Current               108 (54.0)          445 (22.2)                
     Never                  46 (23.0)          780 (39.0)                
  statin = 1 (%)           154 (77.0)         1603 (80.2)      0.334     
  ace_arb = 1 (%)          160 (80.0)         1531 (76.6)      0.310     
  betab = 1 (%)             53 (26.5)          781 (39.1)      0.001     
  depr_dx = 1 (%)           53 (26.5)          753 (37.6)      0.002     
  eyeex = 1 (%)            104 (52.0)         1225 (61.3)      0.013     
  pneumo = 1 (%)           116 (58.0)         1766 (88.3)     <0.001     
  bmi (mean (SD))        32.83 (8.22)        35.09 (8.17)     <0.001     
  bp_good = 1 (%)          117 (58.5)         1431 (71.6)     <0.001     

2 Propensity for Exposure

We’ll fit a logistic regression model to predict propensity for exposure A (as compared to B), on the basis of these 18 covariates:

  • age, race, hisp, sex, insur, nincome, nhsgrad,
  • cleve, a1c, ldl, visits, tobacco, statin,
  • ace_arb, betab, depr_dx, eyeex, pneumo

Practically, we might well fit something more complex than a simple model with main effects, but that’s what we’ll limit ourselves to in this setting. Note that we’re not including any direct information on either of our outcomes, or the elements that go into them. In practical work, we might fit different propensity scores for each outcome, but we’re keeping things simple here.

2.1 Fitting a Propensity Model

We’ll use the f.build tool from the cobalt package here.

Code
dm2200 <- dm2200 |>
    mutate(treat = as.logical(exposure == "A"))

covs_1 <- dm2200 |>
    select(age, race, hisp, sex, insur, nincome,
           nhsgrad, cleve, a1c, ldl, visits, tobacco,
           statin, ace_arb, betab, depr_dx, eyeex, pneumo)

prop_model <- glm(f.build("treat", covs_1), data = dm2200,
                  family = binomial)

2.2 Summarizing the Propensity Model

Code
model_parameters(prop_model, exponentiate = TRUE, ci = 0.95)
Parameter          | Odds Ratio |       SE |        95% CI |     z |      p
---------------------------------------------------------------------------
(Intercept)        |       0.01 |     0.01 | [0.00,  0.13] | -3.56 | < .001
age                |       1.01 |     0.01 | [0.99,  1.04] |  1.03 | 0.305 
race [White]       |       0.32 |     0.08 | [0.19,  0.52] | -4.44 | < .001
race [Other]       |       0.12 |     0.14 | [0.01,  0.79] | -1.86 | 0.063 
race [Asian]       |       1.05 |     1.06 | [0.10,  5.97] |  0.05 | 0.962 
hisp               |       1.04 |     0.66 | [0.26,  3.26] |  0.06 | 0.949 
sex [M]            |       2.30 |     0.44 | [1.59,  3.35] |  4.37 | < .001
insur [Commercial] |       0.38 |     0.17 | [0.15,  0.88] | -2.12 | 0.034 
insur [Medicaid]   |       4.31 |     1.11 | [2.62,  7.22] |  5.66 | < .001
insur [Uninsured]  |       9.83 |     3.41 | [4.98, 19.47] |  6.59 | < .001
nincome            |       1.00 | 9.07e-06 | [1.00,  1.00] |  0.25 | 0.803 
nhsgrad            |       1.00 | 9.53e-03 | [0.98,  1.02] | -0.42 | 0.677 
cleve              |       6.12 |     1.96 | [3.35, 11.80] |  5.65 | < .001
a1c                |       0.96 |     0.05 | [0.88,  1.05] | -0.82 | 0.412 
ldl                |       1.00 | 2.58e-03 | [0.99,  1.00] | -1.62 | 0.104 
visits             |       1.31 |     0.05 | [1.20,  1.42] |  6.47 | < .001
tobacco [Current]  |       3.11 |     0.71 | [2.00,  4.89] |  5.00 | < .001
tobacco [Never]    |       1.12 |     0.28 | [0.68,  1.83] |  0.44 | 0.657 
statin             |       1.01 |     0.23 | [0.65,  1.59] |  0.03 | 0.978 
ace arb            |       1.47 |     0.34 | [0.95,  2.32] |  1.68 | 0.093 
betab              |       0.77 |     0.16 | [0.51,  1.14] | -1.28 | 0.202 
depr dx            |       0.51 |     0.11 | [0.34,  0.77] | -3.20 | 0.001 
eyeex              |       0.72 |     0.14 | [0.50,  1.05] | -1.70 | 0.089 
pneumo             |       0.20 |     0.04 | [0.13,  0.30] | -7.65 | < .001

Uncertainty intervals (profile-likelihood) and p-values (two-tailed)
  computed using a Wald z-distribution approximation.
Code
tidy(prop_model, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
  select(term, estimate, std.error, conf.low, conf.high, p.value) |>
  gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 6, color = "blue")
term estimate std.error conf.low conf.high p.value
(Intercept) 0.011 1.268 0.001 0.128 0.000
age 1.013 0.012 0.989 1.037 0.305
raceWhite 0.323 0.255 0.193 0.524 0.000
raceOther 0.119 1.143 0.006 0.788 0.063
raceAsian 1.049 1.011 0.105 5.972 0.962
hisp 1.041 0.633 0.260 3.263 0.949
sexM 2.296 0.190 1.586 3.346 0.000
insurCommercial 0.383 0.452 0.146 0.881 0.034
insurMedicaid 4.312 0.258 2.622 7.223 0.000
insurUninsured 9.831 0.347 4.985 19.467 0.000
nincome 1.000 0.000 1.000 1.000 0.803
nhsgrad 0.996 0.010 0.978 1.015 0.677
cleve 6.121 0.320 3.347 11.802 0.000
a1c 0.962 0.047 0.877 1.053 0.412
ldl 0.996 0.003 0.991 1.001 0.104
visits 1.307 0.041 1.205 1.418 0.000
tobaccoCurrent 3.108 0.227 2.004 4.886 0.000
tobaccoNever 1.117 0.250 0.684 1.826 0.657
statin 1.006 0.228 0.648 1.586 0.978
ace_arb 1.468 0.229 0.946 2.325 0.093
betab 0.771 0.204 0.514 1.145 0.202
depr_dx 0.515 0.208 0.340 0.769 0.001
eyeex 0.723 0.191 0.497 1.051 0.089
pneumo 0.200 0.211 0.132 0.301 0.000
Code
glance(prop_model) |> gt()
null.deviance df.null logLik AIC BIC deviance df.residual nobs
1340.399 2199 -415.85 879.7 1016.409 831.7 2176 2200
Code
model_performance(prop_model)
# Indices of model performance

AIC     |    AICc |      BIC | Tjur's R2 |  RMSE | Sigma | Log_loss | Score_log
-------------------------------------------------------------------------------
879.700 | 880.252 | 1016.409 |     0.329 | 0.237 | 1.000 |    0.189 |      -Inf

AIC     | Score_spherical |   PCP
---------------------------------
879.700 |           0.012 | 0.889

2.3 Storing the Propensity Scores

Code
dm2200 <- dm2200 |>
    mutate(ps = prop_model$fitted,
           linps = prop_model$linear.predictors)

2.4 Comparing Propensity Scores

Code
ggplot(dm2200, aes(x = exposure, y = ps)) +
  geom_boxplot(width = 0.2, col = "red", fill = "yellow") +
  geom_jitter(width = 0.4, alpha = 0.2, col = "blue") +
  stat_summary(fun = mean, geom = "point", col = "black", size = 2.5) +
  labs(x = "Exposure Group", y = "Propensity Score (for Exposure A)")

Code
ggplot(dm2200, aes(x = exposure, y = linps)) +
  geom_violin(fill = "azure") +
  geom_boxplot(width = 0.3) + 
  stat_summary(fun = mean, geom = "point", col = "red", size = 2.5) +
  labs(x = "Exposure Group", y = "Linear Propensity Score")

3 match_1 1:1 greedy matching without replacement with the Matching package

We’re going to match on the linear propensity score, and define our treat (treatment) as occurring when exposure is A.

Code
match_1 <- Match(Tr = dm2200$treat, X = dm2200$linps, 
                 M = 1, replace = FALSE, ties = FALSE,
                 estimand = "ATT")

summary(match_1)

Estimate...  0 
SE.........  0 
T-stat.....  NaN 
p.val......  NA 

Original number of observations..............  2200 
Original number of treated obs...............  200 
Matched number of observations...............  200 
Matched number of observations  (unweighted).  200 

3.1 ATT vs. ATE vs. ATC estimates

Note that in each of the matched samples we build, we’ll focus on ATT estimates (average treated effect on the treated) rather than ATE estimates. This means that in our matching we’re trying to mirror the population represented by the “treated” sample we observed.

  • To obtain ATE estimates rather than ATT with the Match function from the Matching package, use estimand = "ATE" in the process of developing the matched sample.
  • To obtain ATC estimates (average treatment effect on the controls), use estimand = "ATC".

I encourage the use of ATT estimates in your projects, where possible. I suggest also that you define the “treated” group (the one that the propensity score is estimating) to be the smaller of the two groups you have, to facilitate this approach. If you estimate ATE or ATC instead of ATT, of course, you are answering a different question than what ATT resolves.

3.2 Obtaining the Matched Sample

Now, we build a new matched sample data frame in order to do some of the analyses to come. This will contain only the matched subjects.

Code
match1_matches <- factor(rep(match_1$index.treated, 2))
dm2200_matched1 <- cbind(match1_matches, 
                         dm2200[c(match_1$index.control, 
                                  match_1$index.treated),])

Some sanity checks:

Code
dm2200_matched1 |> count(exposure)
  exposure   n
1        A 200
2        B 200
Code
dm2200_matched1 |> head()
  match1_matches subject exposure age     race hisp sex    insur nincome
1              1  S-0236        B  74 Black_AA    0   M Medicare   28900
2              4  S-1650        B  48 Black_AA    0   M Medicaid   22100
3             17  S-0834        B  61 Black_AA    0   M Medicaid   31500
4             26  S-1232        B  69 Black_AA    0   F Medicare   21700
5             27  S-0678        B  36 Black_AA    0   F Medicaid   16600
6             37  S-1007        B  49 Black_AA    0   M Medicaid   29800
  nhsgrad cleve height_cm weight_kg  bmi  a1c sbp dbp ldl visits tobacco statin
1      75     1       178       105 33.1  8.4 138  79  81      2  Former      1
2      80     1       172       134 45.3  5.1  98  66 121      3   Never      1
3      80     1       183        94 28.1  6.5 127  79 137      3 Current      1
4      77     1       168        65 23.0  5.9 101  70  76      2 Current      1
5      79     1       160        80 31.3 14.5 142  95 157      5 Current      1
6      82     1       170        74 25.6 12.1 138  80 126      3 Current      1
  ace_arb betab depr_dx eyeex pneumo bp_good treat         ps      linps
1       0     0       0     1      1       1 FALSE 0.03787384 -3.2348848
2       1     0       0     0      0       1 FALSE 0.62855155  0.5260079
3       1     0       1     0      1       1 FALSE 0.33963493 -0.6649215
4       1     0       0     1      1       1 FALSE 0.07464006 -2.5175054
5       0     1       0     1      0       0 FALSE 0.40466703 -0.3860563
6       1     0       0     0      1       1 FALSE 0.41799059 -0.3310277

3.3 Checking Covariate Balance for our 1:1 Greedy Match

3.3.1 Using bal.tab to obtain a balance table

Code
covs_1plus <- dm2200 |>
    select(age, race, hisp, sex, insur, nincome,
           nhsgrad, cleve, a1c, ldl, visits, tobacco,
           statin, ace_arb, betab, depr_dx, eyeex, pneumo,
           ps, linps)

bal1 <- bal.tab(match_1,
                treat = dm2200$exposure,
                covs = covs_1plus, quick = FALSE,
                data = dm2200, stats = c("m", "v"),
                un = TRUE, disp.v.ratio = TRUE)
bal1
Balance Measures
                    Type Diff.Un V.Ratio.Un Diff.Adj V.Ratio.Adj
age              Contin.  0.4076     1.3195  -0.0209      1.3292
race_Black_AA     Binary -0.2975          .  -0.0050           .
race_White        Binary  0.2630          .   0.0100           .
race_Other        Binary  0.0320          .  -0.0050           .
race_Asian        Binary  0.0025          .   0.0000           .
hisp              Binary  0.0320          .  -0.0050           .
sex_M             Binary -0.2190          .  -0.0400           .
insur_Medicare    Binary  0.2715          .   0.0400           .
insur_Commercial  Binary  0.2200          .  -0.0050           .
insur_Medicaid    Binary -0.3745          .  -0.0050           .
insur_Uninsured   Binary -0.1170          .  -0.0300           .
nincome          Contin.  0.5306     3.2086  -0.0270      1.3512
nhsgrad          Contin.  0.3958     0.9727  -0.0405      1.0954
cleve             Binary -0.3505          .   0.0250           .
a1c              Contin. -0.0419     0.8190  -0.0189      1.0321
ldl              Contin.  0.0783     0.9012  -0.0038      0.9641
visits           Contin. -0.6304     0.4602  -0.3063      0.7620
tobacco_Former    Binary  0.1575          .   0.0350           .
tobacco_Current   Binary -0.3175          .  -0.0100           .
tobacco_Never     Binary  0.1600          .  -0.0250           .
statin            Binary  0.0315          .  -0.0350           .
ace_arb           Binary -0.0345          .  -0.0450           .
betab             Binary  0.1255          .   0.0050           .
depr_dx           Binary  0.1115          .   0.0200           .
eyeex             Binary  0.0925          .   0.0650           .
pneumo            Binary  0.3030          .   0.0850           .
ps               Contin. -2.9189     0.1576  -0.7387      0.4791
linps            Contin. -1.7842     1.3395  -0.2127      0.5378

Sample sizes
            A    B
All       200 2000
Matched   200  200
Unmatched   0 1800

3.3.2 Checking Rubin’s Rules 1 and 2

We’ll build a little table of the Rubin’s Rules (1 and 2) before and after our match_1 is applied.

Code
covs_for_rubin <- dm2200 |>
    select(linps)

rubin_m1 <- bal.tab(match_1,
                treat = dm2200$treat,
                covs = covs_for_rubin,
                data = dm2200, stats = c("m", "v"),
                un = TRUE, disp.v.ratio = TRUE)[1]

rubin_report_m1 <- tibble(
    status = c("Rule1", "Rule2"),
    Unmatched = c(rubin_m1$Balance$Diff.Un,
                  rubin_m1$Balance$V.Ratio.Un),
    Matched = c(rubin_m1$Balance$Diff.Adj,
               rubin_m1$Balance$V.Ratio.Adj))

rubin_report_m1 |> 
  gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 2, color = "blue")
status Unmatched Matched
Rule1 2.065 0.246
Rule2 0.747 1.859
  • The Rule 1 results tell us about the standardized differences expressed as proportions, so we’d like to be certain that our results are as close to zero as possible, and definitely below 0.5 in absolute value.
    • Multiply these by 100 to describe them as percentages, adjusting the cutoff to below 50 in absolute value.
    • Here, before matching we have a bias of 206.5005338%, and this is reduced to 24.620512% after 1:1 greedy matching.
  • The Rule 2 results tell us about the variance ratio of the linear propensity scores. We want this to be within (0.5, 2) and ideally within (0.8, 1.25).
    • Here, before matching we have a variance ratio of 74.6560482%, and this becomes 185.9473698% after 1:1 greedy matching.

3.3.3 Using bal.plot from cobalt

We can look at any particular variable with this approach, for example, age:

Code
bal.plot(match_1,
         treat = dm2200$exposure,
         covs = covs_1plus,
         var.name = "age", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "Matched Sample"))

We could also look at the propensity scores in each group, perhaps in mirrored histograms, with …

Code
bal.plot(match_1,
         treat = dm2200$exposure,
         covs = covs_1plus,
         var.name = "ps", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "Matched Sample"),
         type = "histogram", mirror = TRUE)

Can we look at a categorical variable this way?

Code
bal.plot(match_1,
         treat = dm2200$exposure,
         covs = covs_1plus,
         var.name = "insur", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "Matched Sample"))

3.3.4 Using love.plot to look at Standardized Differences

Code
love.plot(bal1, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          sample.names = c("Unmatched", "Matched"),
          title = "Love Plot for our 1:1 Match") +
    labs(caption = "* indicates raw mean differences (for binary variables)")

Code
love.plot(bal1, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          abs = TRUE,
          sample.names = c("Unmatched", "Matched"),
          title = "Absolute Differences for 1:1 Match") +
    labs(caption = "* indicates raw mean differences (for binary variables)")

3.3.5 Using love.plot to look at Variance Ratios

Note that this will only include the variables (and summaries like ps and linps) that describe quantities. Categorical variables are dropped.

Code
love.plot(bal1, 
          threshold = .5, size = 3,
          stats = "variance.ratios",
          sample.names = c("Unmatched", "Matched"),
          title = "Variance Ratios for our 1:1 Match") 

4 match_2 1:2 greedy matching without replacement with the Matching package

Again, we’ll match on the linear propensity score, and define our treat (treatment) as occurring when exposure is A. The only difference will be that we’ll allow each subject with exposure A to be matched to exactly two subjects with exposure B.

Code
match_2 <- Match(Tr = dm2200$treat, X = dm2200$linps, 
                 M = 2, replace = FALSE, ties = FALSE,
                 estimand = "ATT")

summary(match_2)

Estimate...  0 
SE.........  0 
T-stat.....  NaN 
p.val......  NA 

Original number of observations..............  2200 
Original number of treated obs...............  200 
Matched number of observations...............  200 
Matched number of observations  (unweighted).  400 

Note that we now have 400 matched exposure “B” subjects in our matched sample.

4.1 Obtaining the Matched Sample

As before,

Code
match2_matches <- factor(rep(match_2$index.treated, 2))
dm2200_matched2 <- cbind(match2_matches, 
                         dm2200[c(match_2$index.control, 
                                  match_2$index.treated),])

How many unique subjects are in our matched sample?

Code
n_distinct(dm2200_matched2$subject)
[1] 600

This match repeats each exposure A subject twice, to match up with the 400 exposure B subjects.

Code
dm2200_matched2 |> count(exposure) 
  exposure   n
1        A 400
2        B 400
Code
dm2200_matched2 |> count(subject, exposure) |> head()
  subject exposure n
1  S-0001        A 2
2  S-0004        A 2
3  S-0007        B 1
4  S-0008        B 1
5  S-0014        B 1
6  S-0017        A 2

4.2 Checking Covariate Balance for our 1:2 Greedy Match

4.2.1 Using bal.tab to obtain a balance table

Code
covs_2plus <- dm2200 |>
    select(age, race, hisp, sex, insur, nincome,
           nhsgrad, cleve, a1c, ldl, visits, tobacco,
           statin, ace_arb, betab, depr_dx, eyeex, pneumo,
           ps, linps)

bal2 <- bal.tab(match_2,
                treat = dm2200$exposure,
                covs = covs_2plus, quick = FALSE,
                data = dm2200, stats = c("m", "v"),
                un = TRUE, disp.v.ratio = TRUE)
bal2
Balance Measures
                    Type Diff.Un V.Ratio.Un Diff.Adj V.Ratio.Adj
age              Contin.  0.4076     1.3195   0.0804      1.2519
race_Black_AA     Binary -0.2975          .  -0.0175           .
race_White        Binary  0.2630          .   0.0175           .
race_Other        Binary  0.0320          .  -0.0025           .
race_Asian        Binary  0.0025          .   0.0025           .
hisp              Binary  0.0320          .   0.0025           .
sex_M             Binary -0.2190          .  -0.0750           .
insur_Medicare    Binary  0.2715          .   0.0925           .
insur_Commercial  Binary  0.2200          .  -0.0050           .
insur_Medicaid    Binary -0.3745          .  -0.0400           .
insur_Uninsured   Binary -0.1170          .  -0.0475           .
nincome          Contin.  0.5306     3.2086  -0.0002      1.4755
nhsgrad          Contin.  0.3958     0.9727   0.0170      1.0717
cleve             Binary -0.3505          .  -0.0175           .
a1c              Contin. -0.0419     0.8190  -0.0307      0.8214
ldl              Contin.  0.0783     0.9012   0.0126      0.9109
visits           Contin. -0.6304     0.4602  -0.3204      0.7619
tobacco_Former    Binary  0.1575          .   0.0600           .
tobacco_Current   Binary -0.3175          .  -0.1025           .
tobacco_Never     Binary  0.1600          .   0.0425           .
statin            Binary  0.0315          .   0.0175           .
ace_arb           Binary -0.0345          .  -0.0150           .
betab             Binary  0.1255          .   0.0300           .
depr_dx           Binary  0.1115          .   0.0525           .
eyeex             Binary  0.0925          .   0.0425           .
pneumo            Binary  0.3030          .   0.1525           .
ps               Contin. -2.9189     0.1576  -1.4656      0.3428
linps            Contin. -1.7842     1.3395  -0.4322      0.4092

Sample sizes
            A    B
All       200 2000
Matched   200  400
Unmatched   0 1600

4.2.2 Checking Rubin’s Rules 1 and 2

We’ll build a little table of the Rubin’s Rules (1 and 2) before and after our 1:2 greedy match_2 is applied, and compare these to the results we found in match_1 (the 1:1 match).

Code
covs_for_rubin <- dm2200 |>
    select(linps)

rubin_m2 <- bal.tab(match_2,
                treat = dm2200$treat,
                covs = covs_for_rubin, 
                data = dm2200, stats = c("m", "v"),
                un = TRUE, disp.v.ratio = TRUE)[1]

rubin_report_m12 <- tibble(
    status = c("Rule1", "Rule2"),
    Unmatched = c(rubin_m2$Balance$Diff.Un,
                  rubin_m2$Balance$V.Ratio.Un),
    Match1 = c(rubin_m1$Balance$Diff.Adj,
               rubin_m1$Balance$V.Ratio.Adj),
    Match2 = c(rubin_m2$Balance$Diff.Adj,
               rubin_m2$Balance$V.Ratio.Adj))

rubin_report_m12 |> 
  gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 2, color = "blue")
status Unmatched Match1 Match2
Rule1 2.065 0.246 0.500
Rule2 0.747 1.859 2.444
  • Again, we’d like to see Rule 1 as close to zero as possible, and definitely below 0.5 in absolute value. Unsurprisingly, when we have to match two exposure B subjects to each exposure A subject, we don’t get matches that are as close.
  • The Rule 2 results tell us about the variance ratio of the linear propensity scores. We want this to be within (0.5, 2) and ideally within (0.8, 1.25). Again, here the results are a bit disappointing in comparison to what we saw in our 1:1 match.

4.2.3 Using bal.plot from cobalt

Looking at the propensity scores in each group, perhaps in mirrored histograms, we have …

Code
bal.plot(match_2,
         treat = dm2200$exposure,
         covs = covs_2plus,
         var.name = "ps", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "match_2 Sample"),
         type = "histogram", mirror = TRUE)

4.2.4 Using love.plot to look at Standardized Differences

Code
love.plot(bal2, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          sample.names = c("Unmatched", "Matched"),
          title = "Love Plot for our 1:2 Match") +
    labs(caption = "* indicates raw mean differences (for binary variables)")

4.2.5 Using love.plot to look at Variance Ratios

Again, the categorical variables are dropped.

Code
love.plot(bal2, 
          threshold = .5, size = 3,
          stats = "variance.ratios",
          sample.names = c("Unmatched", "Matched"),
          title = "Variance Ratios for our 1:2 Match") 

5 match_3 1:3 matching, with replacement with the Matching package

Again, we’ll match on the linear propensity score, and define our treat (treatment) as occurring when exposure is A. But now, we’ll match with replacement (which means that multiple subject with exposure A can be matched to the same subject with exposure B) and we’ll also match each subject with exposure A to be matched to exactly three subjects with exposure B.

Code
match_3 <- Match(Tr = dm2200$treat, X = dm2200$linps, 
                 M = 3, replace = TRUE, ties = FALSE,
                 estimand = "ATT")

summary(match_3)

Estimate...  0 
SE.........  0 
T-stat.....  NaN 
p.val......  NA 

Original number of observations..............  2200 
Original number of treated obs...............  200 
Matched number of observations...............  200 
Matched number of observations  (unweighted).  600 

Note that we now have 600 matched exposure “B” subjects in our matched sample.

5.1 Obtaining the Matched Sample

As before,

Code
match3_matches <- factor(rep(match_3$index.treated, 2))
dm2200_matched3 <- cbind(match3_matches, 
                         dm2200[c(match_3$index.control, 
                                  match_3$index.treated),])

If this was being done without replacement, this would repeat each exposure A subject three times, to match up with the 600 exposure B subjects. But here, we have a different result.

How many unique subjects are in our matched sample?

Code
n_distinct(dm2200_matched3$subject)
[1] 504

How many of those are in Exposure A?

Code
temp <- dm2200_matched3 |> filter(exposure == "A") 
n_distinct(temp$subject)
[1] 200

How many of those are in Exposure B?

Code
temp <- dm2200_matched3 |> filter(exposure == "B")
n_distinct(temp$subject)
[1] 304

Among those exposure A subjects, how many times were they used in the matches?

Code
dm2200_matched3 |> filter(exposure == "A") |> 
    count(subject) |>
    tabyl(n)
 n n_n percent
 3 200       1

Among those exposure B subjects, how many times were they used in the matches?

Code
dm2200_matched3 |> filter(exposure == "B") |> 
    count(subject) |>
    tabyl(n)
  n n_n     percent
  1 198 0.651315789
  2  62 0.203947368
  3  19 0.062500000
  4   7 0.023026316
  5   4 0.013157895
  6   2 0.006578947
  7   1 0.003289474
  8   1 0.003289474
  9   2 0.006578947
 10   2 0.006578947
 13   2 0.006578947
 15   1 0.003289474
 20   1 0.003289474
 23   1 0.003289474
 24   1 0.003289474

5.2 Checking Covariate Balance for our 1:3 Match

5.2.1 Using bal.tab to obtain a balance table

Code
covs_3plus <- dm2200 |>
    select(age, race, hisp, sex, insur, nincome,
           nhsgrad, cleve, a1c, ldl, visits, tobacco,
           statin, ace_arb, betab, depr_dx, eyeex, pneumo,
           ps, linps)

bal3 <- bal.tab(match_3,
                treat = dm2200$exposure,
                covs = covs_3plus, quick = FALSE,
                data = dm2200, stats = c("m", "v"),
                un = TRUE, disp.v.ratio = TRUE)
bal3
Balance Measures
                    Type Diff.Un V.Ratio.Un Diff.Adj V.Ratio.Adj
age              Contin.  0.4076     1.3195  -0.0845      1.2582
race_Black_AA     Binary -0.2975          .   0.0300           .
race_White        Binary  0.2630          .  -0.0250           .
race_Other        Binary  0.0320          .  -0.0033           .
race_Asian        Binary  0.0025          .  -0.0017           .
hisp              Binary  0.0320          .  -0.0083           .
sex_M             Binary -0.2190          .   0.0133           .
insur_Medicare    Binary  0.2715          .   0.0100           .
insur_Commercial  Binary  0.2200          .  -0.0033           .
insur_Medicaid    Binary -0.3745          .  -0.0283           .
insur_Uninsured   Binary -0.1170          .   0.0217           .
nincome          Contin.  0.5306     3.2086   0.0254      1.3567
nhsgrad          Contin.  0.3958     0.9727  -0.0179      0.8844
cleve             Binary -0.3505          .  -0.0033           .
a1c              Contin. -0.0419     0.8190  -0.0539      1.0273
ldl              Contin.  0.0783     0.9012   0.1121      1.0653
visits           Contin. -0.6304     0.4602  -0.0646      0.8624
tobacco_Former    Binary  0.1575          .   0.0450           .
tobacco_Current   Binary -0.3175          .  -0.0100           .
tobacco_Never     Binary  0.1600          .  -0.0350           .
statin            Binary  0.0315          .  -0.0150           .
ace_arb           Binary -0.0345          .  -0.0083           .
betab             Binary  0.1255          .  -0.0250           .
depr_dx           Binary  0.1115          .   0.0167           .
eyeex             Binary  0.0925          .   0.0017           .
pneumo            Binary  0.3030          .   0.0167           .
ps               Contin. -2.9189     0.1576  -0.0386      0.9512
linps            Contin. -1.7842     1.3395  -0.0268      0.8894

Sample sizes
                       A       B
All                  200 2000.  
Matched (ESS)        200  104.53
Matched (Unweighted) 200  304.  
Unmatched              0 1696.  

5.2.2 Checking Rubin’s Rules 1 and 2

We’ll build a little table of the Rubin’s Rules (1 and 2) before and after our 1:2 greedy match_2 is applied, and compare these to the results we found in match_1 (the 1:1 match).

Code
covs_for_rubin <- dm2200 |>
    select(linps)

rubin_m3 <- bal.tab(match_3,
                treat = dm2200$treat,
                covs = covs_for_rubin, 
                data = dm2200, stats = c("m", "v"),
                un = TRUE, disp.v.ratio = TRUE)[1]

rubin_report_m123 <- tibble(
    status = c("Rule1", "Rule2"),
    Unmatched = c(rubin_m2$Balance$Diff.Un,
                  rubin_m2$Balance$V.Ratio.Un),
    Match1 = c(rubin_m1$Balance$Diff.Adj,
               rubin_m1$Balance$V.Ratio.Adj),
    Match2 = c(rubin_m2$Balance$Diff.Adj,
               rubin_m2$Balance$V.Ratio.Adj),
    Match3 = c(rubin_m3$Balance$Diff.Adj,
               rubin_m3$Balance$V.Ratio.Adj))


rubin_report_m123 |> 
  gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 2, color = "blue")
status Unmatched Match1 Match2 Match3
Rule1 2.065 0.246 0.500 0.031
Rule2 0.747 1.859 2.444 1.124
  • Again, we’d like to see Rule 1 results as close to zero as possible, and definitely below 0.5 in absolute value.
  • In Rule 2, we want the variance ratio of the linear propensity scores to be within (0.5, 2) and ideally within (0.8, 1.25).
  • It appears that (in these data) allowing the same exposure B subject to be used for multiple matches (matching with replacement) more than makes up for the fact that matching 3 exposure B’s for each exposure A (1:3 matching) is a tougher job than pair (1:1) matching, as seen in the results for Rubin’s Rule 1 and Rule 2.

5.2.3 Using bal.plot from cobalt

Looking at the propensity scores in each group, perhaps in mirrored histograms, we have …

Code
bal.plot(match_3,
         treat = dm2200$exposure,
         covs = covs_3plus,
         var.name = "ps", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "match_3 Sample"),
         type = "histogram", mirror = TRUE)

5.2.4 Using love.plot to look at Standardized Differences

Code
love.plot(bal3, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          abs = TRUE,
          sample.names = c("Unmatched", "Matched"),
          title = "Love Plot of |Mean Differences| for our 1:3 Match") +
    labs(caption = "* indicates raw mean differences (for binary variables)")

5.2.5 Using love.plot to look at Variance Ratios

Again, the categorical variables are dropped.

Code
love.plot(bal3, 
          threshold = .5, size = 3,
          stats = "variance.ratios",
          sample.names = c("Unmatched", "Matched"),
          title = "Variance Ratios for our 1:3 Match") 

6 match_4 Caliper Matching (1:1 without replacement) with the Matching package

The Match function in the Matching package allows you to specify a caliper. From the Matching help file:

  • A caliper is the maximum acceptable distance (on a covariate) which we are willing to accept in any match. Observations for which we cannot find a match within the caliper are dropped.Dropping observations generally changes the quantity being estimated.
  • The caliper is interpreted to be in standardized units. For example, caliper=.25 means that all matches not equal to or within .25 standard deviations of each covariate in X are dropped, and not matched.
    • If a scalar caliper is provided to the caliper setting in the Match function, this caliper is used for all covariates in X.
    • If a vector of calipers is provided, a caliper value should be provided for each covariate in X.

We’ll again perform a 1:1 match without replacement, but now we’ll do so while only accepting matches where the linear propensity score of each match is within 0.2 standard deviations of the linear PS.

Code
match_4 <- Match(Tr = dm2200$treat, X = dm2200$linps, 
                 M = 1, replace = FALSE, ties = FALSE,
                 caliper = 0.2, estimand = "ATT")

summary(match_4)

Estimate...  0 
SE.........  0 
T-stat.....  NaN 
p.val......  NA 

Original number of observations..............  2200 
Original number of treated obs...............  200 
Matched number of observations...............  162 
Matched number of observations  (unweighted).  162 

Caliper (SDs)........................................   0.2 
Number of obs dropped by 'exact' or 'caliper'  38 

Note that we have now dropped 38 of the exposure “A” subjects, and reduced our sample to the 168 remaining exposure “A” subjects, who are paired with 162 unique matched exposure “B” subjects in our matched sample.

6.1 Obtaining the Matched Sample

As before,

Code
match4_matches <- factor(rep(match_4$index.treated, 2))
dm2200_matched4 <- cbind(match4_matches, 
                         dm2200[c(match_4$index.control, 
                                  match_4$index.treated),])

How many unique subjects are in our matched sample?

Code
n_distinct(dm2200_matched4$subject)
[1] 324

This match includes 162 pairs so 324 subjects, since we’ve done matching without replacement.

Code
dm2200_matched4 |> count(exposure)
  exposure   n
1        A 162
2        B 162

6.2 Checking Covariate Balance for our 1:1 Caliper Match

6.2.1 Using bal.tab to obtain a balance table

Code
covs_4plus <- dm2200 |>
    select(age, race, hisp, sex, insur, nincome,
           nhsgrad, cleve, a1c, ldl, visits, tobacco,
           statin, ace_arb, betab, depr_dx, eyeex, pneumo,
           ps, linps)

bal4 <- bal.tab(match_4,
                treat = dm2200$exposure,
                covs = covs_4plus, quick = FALSE,
                data = dm2200, stats = c("m", "v"),
                un = TRUE, disp.v.ratio = TRUE)
bal4
Balance Measures
                    Type Diff.Un V.Ratio.Un Diff.Adj V.Ratio.Adj
age              Contin.  0.4076     1.3195  -0.0069      1.1267
race_Black_AA     Binary -0.2975          .  -0.0062           .
race_White        Binary  0.2630          .   0.0123           .
race_Other        Binary  0.0320          .   0.0000           .
race_Asian        Binary  0.0025          .  -0.0062           .
hisp              Binary  0.0320          .  -0.0062           .
sex_M             Binary -0.2190          .   0.0000           .
insur_Medicare    Binary  0.2715          .   0.0247           .
insur_Commercial  Binary  0.2200          .  -0.0185           .
insur_Medicaid    Binary -0.3745          .  -0.0062           .
insur_Uninsured   Binary -0.1170          .   0.0000           .
nincome          Contin.  0.5306     3.2086  -0.0071      1.5352
nhsgrad          Contin.  0.3958     0.9727  -0.0501      1.2527
cleve             Binary -0.3505          .   0.0062           .
a1c              Contin. -0.0419     0.8190  -0.0732      0.8675
ldl              Contin.  0.0783     0.9012   0.0115      0.9169
visits           Contin. -0.6304     0.4602  -0.1353      1.1436
tobacco_Former    Binary  0.1575          .   0.0062           .
tobacco_Current   Binary -0.3175          .   0.0370           .
tobacco_Never     Binary  0.1600          .  -0.0432           .
statin            Binary  0.0315          .   0.0000           .
ace_arb           Binary -0.0345          .  -0.0309           .
betab             Binary  0.1255          .   0.0679           .
depr_dx           Binary  0.1115          .  -0.0432           .
eyeex             Binary  0.0925          .  -0.0123           .
pneumo            Binary  0.3030          .   0.0000           .
ps               Contin. -2.9189     0.1576  -0.0309      0.9599
linps            Contin. -1.7842     1.3395  -0.0073      0.9775

Sample sizes
            A    B
All       200 2000
Matched   162  162
Unmatched   0 1838
Discarded  38    0

6.2.2 Checking Rubin’s Rules 1 and 2

We’ll build a little table of the Rubin’s Rules (1 and 2) before and after our 1:2 greedy match_4 is applied, and compare these to the results we found in match_1 (the 1:1 match).

Code
covs_for_rubin <- dm2200 |>
    select(linps)

rubin_m4 <- bal.tab(match_4,
                treat = dm2200$treat,
                covs = covs_for_rubin, 
                data = dm2200, stats = c("m", "v"),
                un = TRUE, disp.v.ratio = TRUE)[1]

rubin_report_m1234 <- tibble(
    status = c("Rule1", "Rule2"),
    Unmatched = c(rubin_m2$Balance$Diff.Un,
                  rubin_m2$Balance$V.Ratio.Un),
    Match1 = c(rubin_m1$Balance$Diff.Adj,
               rubin_m1$Balance$V.Ratio.Adj),
    Match2 = c(rubin_m2$Balance$Diff.Adj,
               rubin_m2$Balance$V.Ratio.Adj),
    Match3 = c(rubin_m3$Balance$Diff.Adj,
               rubin_m3$Balance$V.Ratio.Adj),
    Match4 = c(rubin_m4$Balance$Diff.Adj,
               rubin_m4$Balance$V.Ratio.Adj))

rubin_report_m1234 |> gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 2, color = "blue")
status Unmatched Match1 Match2 Match3 Match4
Rule1 2.065 0.246 0.500 0.031 0.008
Rule2 0.747 1.859 2.444 1.124 1.023
  • This approach produces an exceptionally strong match in terms of balance, with Rubin’s Rule 1 being very close to 0, and Rule 2 being very close to 1.
  • Unfortunately, we’ve only done this by dropping the 38 “hardest to match” subjects receiving exposure “A”.

6.2.3 Using bal.plot from cobalt

Looking at the propensity scores in each group, perhaps in mirrored histograms, we have …

Code
bal.plot(match_4,
         treat = dm2200$exposure,
         covs = covs_4plus,
         var.name = "ps", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "match_4 Sample"),
         type = "histogram", mirror = TRUE)

6.2.4 Using love.plot to look at Standardized Differences

Code
love.plot(bal4, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          sample.names = c("Unmatched", "Matched"),
          title = "Love Plot for our 1:1 Caliper Match") +
    labs(caption = "* indicates raw mean differences (for binary variables)")

6.2.5 Using love.plot to look at Variance Ratios

Again, the categorical variables are dropped.

Code
love.plot(bal4, 
          threshold = .5, size = 3,
          stats = "variance.ratios",
          sample.names = c("Unmatched", "Matched"),
          title = "Variance Ratios for our 1:1 Caliper Match") 

7 match_5 1:1 Nearest Neighbor Matching with Matchit

The MatchIt package implements the suggestions of Ho, Imai, King, and Stuart (2007) for improving parametric statistical models by pre-processing data with nonparametric matching methods. Read more about MatchIt at https://gking.harvard.edu/matchit.

With MatchIt, the matching is done using the matchit(treat ~ X, ...) function, where treat is the vector of treatment assignments and X are the covariates to be used in the matching.

We’ll start our exploration of the MatchIt approach to developing matches with nearest neighbor matching, which (quoting the MatchIt manual…)

… selects the r (default = 1, specified by the ratio option) best control matches for each individual in the treated group, using a distance measure specified by the distance option (default = logit). Matches are chosen for each treated unit one at a time, with the order specified by the m.order command (default=largest to smallest). At each matching step we choose the control unit that is not yet matched but is closest to the treated unit on the distance measure.

The full syntax (with default choices indicated) is

matchit(formula, data=NULL, discard=0, exact=FALSE, 
         replace=FALSE, ratio=1, model="logit", 
         reestimate=FALSE, nearest=TRUE, m.order=2, 
         caliper=0, calclosest=FALSE, mahvars=NULL, 
         subclass=0, sub.by="treat", counter=TRUE, 
         full=FALSE, full.options=list(), ...)

Here, we’ll match on the linear propensity score (by specifying the distance to use the logistic link with the linear propensity score via "linear.logit"), and we’ll perform 1:1 nearest neighbor matching without replacement using the default ordering (largest to smallest). Since we’ve already seen that greedy 1:1 matching without replacement doesn’t work well in this setting, we’re not expecting a strong result in terms of balance here, either.

Code
dm2200 <- dm2200 |>
    mutate(treat = as.logical(exposure == "A"))

covs_1 <- dm2200 |>
    select(age, race, hisp, sex, insur, nincome,
           nhsgrad, cleve, a1c, ldl, visits, tobacco,
           statin, ace_arb, betab, depr_dx, eyeex, pneumo)

match_5 <- matchit(f.build("treat", covs_1), data = dm2200,
                   distance = "glm", link = "linear.logit",
                   method = "nearest", 
                   ratio = 1, replace = FALSE)

match_5
A `matchit` object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score
             - estimated with logistic regression and linearized
 - number of obs.: 2200 (original), 400 (matched)
 - target estimand: ATT
 - covariates: age, race, hisp, sex, insur, nincome, nhsgrad, cleve, a1c, ldl, visits, tobacco, statin, ace_arb, betab, depr_dx, eyeex, pneumo

7.1 Obtaining the Matched Sample

There is just one tricky part to doing this in MatchIt. The main work can be done with a very simple command.

Code
dm2200_matched5 <- match.data(match_5)

This leaves only the job of creating a matching number, for which we have to develop some additional R code.

Code
# Thanks to Robert McDonald at Mayo
# https://lists.gking.harvard.edu/pipermail/matchit/2012-April/000458.html

len <- dim(dm2200)[1]
matchx <- rep(NA,len)
len2 <- length(match_5$match.matrix)
count <- 1
for(i in 1:len2){
    
    match1 <- match_5$match.matrix[i]
    match2 <- row.names(match_5$match.matrix)[i]
    
    if(!is.na(match1)){
    matchx[as.numeric(match1)] <- count
    matchx[as.numeric(match2)] <- count
    count <- count+1}
    
}

dm2200_matched5 <- dm2200_matched5 |>
    mutate(match5_matches = 
               matchx[as.numeric
                      (row.names(dm2200_matched5))]) |>
    mutate(match5_matches = factor(match5_matches))

How many unique subjects are in our matched sample?

Code
n_distinct(dm2200_matched5$subject)
[1] 400

This match includes 200 pairs so 400 subjects, as we’ve done matching without replacement.

Code
dm2200_matched5 |> count(exposure)
# A tibble: 2 × 2
  exposure     n
  <fct>    <int>
1 A          200
2 B          200

7.2 Checking Covariate Balance for our 1:1 Nearest Neighbor Match

7.2.1 Default Numerical Balance Summary from MatchIt

Code
summary(match_5)

Call:
matchit(formula = f.build("treat", covs_1), data = dm2200, method = "nearest", 
    distance = "glm", link = "linear.logit", replace = FALSE, 
    ratio = 1)

Summary of Balance for All Data:
                Means Treated Means Control Std. Mean Diff. Var. Ratio
distance              -0.6577       -4.0979          2.0650     0.7466
age                   54.8950       58.9000         -0.4682     0.7579
raceBlack_AA           0.8300        0.5325          0.7920          .
raceWhite              0.1550        0.4180         -0.7267          .
raceOther              0.0050        0.0370         -0.4537          .
raceAsian              0.0100        0.0125         -0.0251          .
hisp                   0.0200        0.0520         -0.2286          .
sexF                   0.3600        0.5790         -0.4562          .
sexM                   0.6400        0.4210          0.4563          .
insurMedicare          0.1850        0.4565         -0.6992          .
insurCommercial        0.0350        0.2550         -1.1971          .
insurMedicaid          0.6250        0.2505          0.7736          .
insurUninsured         0.1550        0.0380          0.3233          .
nincome            26927.5000    37992.5500         -0.9504     0.3117
nhsgrad               77.7350       82.2260         -0.3904     1.0281
cleve                  0.9050        0.5545          1.1954          .
a1c                    7.7655        7.6868          0.0379     1.2209
ldl                   94.3450       97.2160         -0.0743     1.1096
visits                 4.8100        3.6885          0.4276     2.1732
tobaccoFormer          0.2300        0.3875         -0.3743          .
tobaccoCurrent         0.5400        0.2225          0.6370          .
tobaccoNever           0.2300        0.3900         -0.3802          .
statin                 0.7700        0.8015         -0.0749          .
ace_arb                0.8000        0.7655          0.0863          .
betab                  0.2650        0.3905         -0.2844          .
depr_dx                0.2650        0.3765         -0.2526          .
eyeex                  0.5200        0.6125         -0.1851          .
pneumo                 0.5800        0.8830         -0.6139          .
                eCDF Mean eCDF Max
distance           0.4093   0.6595
age                0.0801   0.2350
raceBlack_AA       0.2975   0.2975
raceWhite          0.2630   0.2630
raceOther          0.0320   0.0320
raceAsian          0.0025   0.0025
hisp               0.0320   0.0320
sexF               0.2190   0.2190
sexM               0.2190   0.2190
insurMedicare      0.2715   0.2715
insurCommercial    0.2200   0.2200
insurMedicaid      0.3745   0.3745
insurUninsured     0.1170   0.1170
nincome            0.1442   0.3445
nhsgrad            0.0724   0.2295
cleve              0.3505   0.3505
a1c                0.0146   0.0465
ldl                0.0189   0.0670
visits             0.0743   0.2150
tobaccoFormer      0.1575   0.1575
tobaccoCurrent     0.3175   0.3175
tobaccoNever       0.1600   0.1600
statin             0.0315   0.0315
ace_arb            0.0345   0.0345
betab              0.1255   0.1255
depr_dx            0.1115   0.1115
eyeex              0.0925   0.0925
pneumo             0.3030   0.3030

Summary of Balance for Matched Data:
                Means Treated Means Control Std. Mean Diff. Var. Ratio
distance              -0.6577       -1.0679          0.2462     1.8590
age                   54.8950       54.2500          0.0754     0.8059
raceBlack_AA           0.8300        0.8300          0.0000          .
raceWhite              0.1550        0.1600         -0.0138          .
raceOther              0.0050        0.0000          0.0709          .
raceAsian              0.0100        0.0100          0.0000          .
hisp                   0.0200        0.0150          0.0357          .
sexF                   0.3600        0.4000         -0.0833          .
sexM                   0.6400        0.6000          0.0833          .
insurMedicare          0.1850        0.1800          0.0129          .
insurCommercial        0.0350        0.0400         -0.0272          .
insurMedicaid          0.6250        0.6450         -0.0413          .
insurUninsured         0.1550        0.1350          0.0553          .
nincome            26927.5000    25385.5000          0.1324     1.1096
nhsgrad               77.7350       77.4650          0.0235     1.0794
cleve                  0.9050        0.9050          0.0000          .
a1c                    7.7655        7.8350         -0.0334     0.9035
ldl                   94.3450       94.4900         -0.0038     1.0865
visits                 4.8100        4.3400          0.1792     1.3150
tobaccoFormer          0.2300        0.2650         -0.0832          .
tobaccoCurrent         0.5400        0.4950          0.0903          .
tobaccoNever           0.2300        0.2400         -0.0238          .
statin                 0.7700        0.7150          0.1307          .
ace_arb                0.8000        0.7650          0.0875          .
betab                  0.2650        0.3100         -0.1020          .
depr_dx                0.2650        0.2850         -0.0453          .
eyeex                  0.5200        0.5300         -0.0200          .
pneumo                 0.5800        0.6600         -0.1621          .
                eCDF Mean eCDF Max Std. Pair Dist.
distance           0.0120    0.230          0.2469
age                0.0198    0.060          1.1755
raceBlack_AA       0.0000    0.000          0.2700
raceWhite          0.0050    0.005          0.7322
raceOther          0.0050    0.005          0.0709
raceAsian          0.0000    0.000          0.0200
hisp               0.0050    0.005          0.2500
sexF               0.0400    0.040          0.9583
sexM               0.0400    0.040          0.9583
insurMedicare      0.0050    0.005          0.6310
insurCommercial    0.0050    0.005          0.2993
insurMedicaid      0.0200    0.020          0.8882
insurUninsured     0.0200    0.020          0.6355
nincome            0.0290    0.140          1.0179
nhsgrad            0.0183    0.050          1.0605
cleve              0.0000    0.000          0.1500
a1c                0.0161    0.070          1.0819
ldl                0.0099    0.045          1.0769
visits             0.0320    0.125          0.9456
tobaccoFormer      0.0350    0.035          0.8436
tobaccoCurrent     0.0450    0.045          0.8929
tobaccoNever       0.0100    0.010          0.8079
statin             0.0550    0.055          0.9861
ace_arb            0.0350    0.035          0.8625
betab              0.0450    0.045          0.8950
depr_dx            0.0200    0.020          0.8837
eyeex              0.0100    0.010          0.8807
pneumo             0.0800    0.080          0.8510

Sample Sizes:
          Control Treated
All          2000     200
Matched       200     200
Unmatched    1800       0
Discarded       0       0

7.2.2 Using bal.tab to obtain a balance table

Code
bal5 <- bal.tab(match_5, un = TRUE, disp.v.ratio = TRUE)

bal5
Balance Measures
                     Type Diff.Un V.Ratio.Un Diff.Adj V.Ratio.Adj
distance         Distance  2.0650     0.7466   0.2462      1.8590
age               Contin. -0.4682     0.7579   0.0754      0.8059
race_Black_AA      Binary  0.2975          .   0.0000           .
race_White         Binary -0.2630          .  -0.0050           .
race_Other         Binary -0.0320          .   0.0050           .
race_Asian         Binary -0.0025          .   0.0000           .
hisp               Binary -0.0320          .   0.0050           .
sex_M              Binary  0.2190          .   0.0400           .
insur_Medicare     Binary -0.2715          .   0.0050           .
insur_Commercial   Binary -0.2200          .  -0.0050           .
insur_Medicaid     Binary  0.3745          .  -0.0200           .
insur_Uninsured    Binary  0.1170          .   0.0200           .
nincome           Contin. -0.9504     0.3117   0.1324      1.1096
nhsgrad           Contin. -0.3904     1.0281   0.0235      1.0794
cleve              Binary  0.3505          .   0.0000           .
a1c               Contin.  0.0379     1.2209  -0.0334      0.9035
ldl               Contin. -0.0743     1.1096  -0.0038      1.0865
visits            Contin.  0.4276     2.1732   0.1792      1.3150
tobacco_Former     Binary -0.1575          .  -0.0350           .
tobacco_Current    Binary  0.3175          .   0.0450           .
tobacco_Never      Binary -0.1600          .  -0.0100           .
statin             Binary -0.0315          .   0.0550           .
ace_arb            Binary  0.0345          .   0.0350           .
betab              Binary -0.1255          .  -0.0450           .
depr_dx            Binary -0.1115          .  -0.0200           .
eyeex              Binary -0.0925          .  -0.0100           .
pneumo             Binary -0.3030          .  -0.0800           .

Sample sizes
          Control Treated
All          2000     200
Matched       200     200
Unmatched    1800       0

7.2.3 Checking Rubin’s Rules 1 and 2

We’ll build a little table of the Rubin’s Rules (1 and 2) results before and after match_5 is applied.

Code
rubin_report_m5 <- tibble(
    status = c("Rule1", "Rule2"),
    Unmatched = c(bal5$Balance$Diff.Un[1],
                  bal5$Balance$V.Ratio.Un[1]),
    Match5 = c(bal5$Balance$Diff.Adj[1],
               bal5$Balance$V.Ratio.Adj[1]))

rubin_report_m5 |> gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 2, color = "blue")
status Unmatched Match5
Rule1 2.065 0.246
Rule2 0.747 1.859
  • As we’d expect based on our previous greedy 1:1 match, this isn’t great sufficiently strong balance.

7.2.4 Using bal.plot from cobalt

Looking at the linear propensity scores in each group, in mirrored histograms, we have …

Code
bal.plot(match_5,
         var.name = "distance", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "match_5 Sample"),
         type = "histogram", mirror = TRUE)

7.2.5 Using love.plot to look at Standardized Differences

Code
love.plot(bal5, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          sample.names = c("Unmatched", "Matched"),
          title = "Love Plot for our 1:1 Nearest Neighbor Match") +
    labs(subtitle = "distance = linear propensity score",
         caption = "* indicates raw mean differences (for binary variables)")

7.2.6 Using love.plot to look at Variance Ratios

Again, the categorical variables are dropped.

Code
love.plot(bal5, 
          threshold = .5, size = 3,
          stats = "variance.ratios",
          sample.names = c("Unmatched", "Matched"),
          title = "Variance Ratios for our 1:1 Nearest Neighbor Match") +
    labs(subtitle = "distance = linear propensity score",
         caption = "* indicates raw mean differences (for binary variables)")

8 match_6 1:1 Nearest Neighbor Caliper Matching with Matchit

As in match_5, we’ll match on the linear propensity score (by specifying the distance to use the logistic link with the linear propensity score via "linear.logit"), and we’ll perform 1:1 nearest neighbor matching without replacement using the default ordering (largest to smallest), but we’ll add a some arguments to build a specific kind caliper match. Specifically, we’ll require our matches to be within 0.25 standard deviations of each other on the linear propensity score.

Code
match_6 <- matchit(f.build("treat", covs_1), data = dm2200,
                   distance = "glm", link = "linear.logit", 
                   method = "nearest", caliper = 0.25,
                   ratio = 1, replace = FALSE)

match_6
A `matchit` object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score [caliper]

             - estimated with logistic regression and linearized
 - caliper: <distance> (0.537)
 - number of obs.: 2200 (original), 344 (matched)
 - target estimand: ATT
 - covariates: age, race, hisp, sex, insur, nincome, nhsgrad, cleve, a1c, ldl, visits, tobacco, statin, ace_arb, betab, depr_dx, eyeex, pneumo

8.1 Obtaining the Matched Sample

There is just one tricky part to doing this in MatchIt. The main work can be done with a very simple command.

Code
dm2200_matched6 <- match.data(match_6)

This leaves only the job of creating a matching number, for which we have to develop some additional R code.

Code
# Thanks to Robert McDonald at Mayo
# https://lists.gking.harvard.edu/pipermail/matchit/2012-April/000458.html

len <- dim(dm2200)[1]
matchx <- rep(NA,len)
len2 <- length(match_6$match.matrix)
count <- 1
for(i in 1:len2){
    
    match1 <- match_6$match.matrix[i]
    match2 <- row.names(match_6$match.matrix)[i]
    
    if(!is.na(match1)){
    matchx[as.numeric(match1)] <- count
    matchx[as.numeric(match2)] <- count
    count <- count+1}
    
}

dm2200_matched6 <- dm2200_matched6 |>
    mutate(match6_matches = 
               matchx[as.numeric
                      (row.names(dm2200_matched6))]) |>
    mutate(match6_matches = factor(match6_matches))

How many unique subjects are in our matched sample?

Code
n_distinct(dm2200_matched6$subject)
[1] 344

This match includes 172 pairs so 344 subjects, as we’ve done matching without replacement.

Code
dm2200_matched6 |> count(exposure)
# A tibble: 2 × 2
  exposure     n
  <fct>    <int>
1 A          172
2 B          172

8.2 Checking Covariate Balance for our 1:1 Nearest Neighbor Caliper Match

8.2.1 Using bal.tab to obtain a balance table

Code
bal6 <- bal.tab(match_6, un = TRUE, disp.v.ratio = TRUE)

bal6
Balance Measures
                     Type Diff.Un V.Ratio.Un Diff.Adj V.Ratio.Adj
distance         Distance  2.0650     0.7466   0.0612      1.2154
age               Contin. -0.4682     0.7579   0.0068      0.8700
race_Black_AA      Binary  0.2975          .  -0.0116           .
race_White         Binary -0.2630          .   0.0058           .
race_Other         Binary -0.0320          .   0.0058           .
race_Asian         Binary -0.0025          .   0.0000           .
hisp               Binary -0.0320          .   0.0116           .
sex_M              Binary  0.2190          .   0.0233           .
insur_Medicare     Binary -0.2715          .   0.0000           .
insur_Commercial   Binary -0.2200          .  -0.0058           .
insur_Medicaid     Binary  0.3745          .   0.0000           .
insur_Uninsured    Binary  0.1170          .   0.0058           .
nincome           Contin. -0.9504     0.3117   0.1677      1.1339
nhsgrad           Contin. -0.3904     1.0281   0.1137      0.9119
cleve              Binary  0.3505          .   0.0000           .
a1c               Contin.  0.0379     1.2209  -0.0011      0.9429
ldl               Contin. -0.0743     1.1096  -0.0188      1.1241
visits            Contin.  0.4276     2.1732   0.0887      1.2449
tobacco_Former     Binary -0.1575          .  -0.0174           .
tobacco_Current    Binary  0.3175          .   0.0000           .
tobacco_Never      Binary -0.1600          .   0.0174           .
statin             Binary -0.0315          .   0.0349           .
ace_arb            Binary  0.0345          .   0.0174           .
betab              Binary -0.1255          .  -0.0465           .
depr_dx            Binary -0.1115          .   0.0174           .
eyeex              Binary -0.0925          .  -0.0233           .
pneumo             Binary -0.3030          .   0.0000           .

Sample sizes
          Control Treated
All          2000     200
Matched       172     172
Unmatched    1828      28

8.2.2 Checking Rubin’s Rules 1 and 2

We’ll build a little table of the Rubin’s Rules (1 and 2) results before and after match_6 is applied, and compare these to the match_5 results.

Code
rubin_report_m56 <- tibble(
    status = c("Rule1", "Rule2"),
    Unmatched = c(bal6$Balance$Diff.Un[1],
                  bal6$Balance$V.Ratio.Un[1]),
    Match5 = c(bal5$Balance$Diff.Adj[1],
               bal5$Balance$V.Ratio.Adj[1]),
    Match6 = c(bal6$Balance$Diff.Adj[1],
               bal6$Balance$V.Ratio.Adj[1])
    )

rubin_report_m56 |> gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 2, color = "blue")
status Unmatched Match5 Match6
Rule1 2.065 0.246 0.061
Rule2 0.747 1.859 1.215
  • So the caliper appears to help quite a bit to improve the results that we saw in 1:1 nearest neighbor matching without replacement, at the cost of not including 28 of the treated subjects in the match.

8.2.3 Using bal.plot from cobalt

Looking at the linear propensity scores in each group, in mirrored histograms, we have …

Code
bal.plot(match_6,
         var.name = "distance", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "match_6 Sample"),
         type = "histogram", mirror = TRUE)

8.2.4 Using love.plot to look at Standardized Differences

Code
love.plot(bal6, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          sample.names = c("Unmatched", "Matched"),
          title = "Love Plot for 1:1 Nearest Neighbor Caliper Match") +
    labs(subtitle = "distance = linear propensity score",
         caption = "* indicates raw mean differences (for binary variables)")

8.2.5 Using love.plot to look at Variance Ratios

Again, the categorical variables are dropped.

Code
love.plot(bal6, 
          threshold = .5, size = 3,
          stats = "variance.ratios",
          sample.names = c("Unmatched", "Matched"),
          title = "Variance Ratios for 1:1 Nearest Neighbor Caliper Match") +
    labs(subtitle = "distance = linear propensity score",
         caption = "* indicates raw mean differences (for binary variables)")

9 match_7 1:1 Optimal Matching with Matchit

As in match_5, we’ll match on the linear propensity score (by specifying the distance to use the logistic link with the linear propensity score via "linear.logit"), but now we’ll use an optimal 1:1 matching (which is always done in matchit without replacement).

Code
match_7 <- matchit(f.build("treat", covs_1), data = dm2200,
                   distance = "glm", link = "linear.logit",
                   method = "optimal", ratio = 1)

match_7
A `matchit` object
 - method: 1:1 optimal pair matching
 - distance: Propensity score
             - estimated with logistic regression and linearized
 - number of obs.: 2200 (original), 400 (matched)
 - target estimand: ATT
 - covariates: age, race, hisp, sex, insur, nincome, nhsgrad, cleve, a1c, ldl, visits, tobacco, statin, ace_arb, betab, depr_dx, eyeex, pneumo

9.1 Obtaining the Matched Sample

As before, much of the work can be done with match.data.

Code
dm2200_matched7 <- match.data(match_7)

This leaves only the job of creating a matching number, for which we have to develop some additional R code.

Code
# Thanks to Robert McDonald at Mayo
# https://lists.gking.harvard.edu/pipermail/matchit/2012-April/000458.html

len <- dim(dm2200)[1]
matchx <- rep(NA,len)
len2 <- length(match_7$match.matrix)
count <- 1
for(i in 1:len2){
    
    match1 <- match_7$match.matrix[i]
    match2 <- row.names(match_7$match.matrix)[i]
    
    if(!is.na(match1)){
    matchx[as.numeric(match1)] <- count
    matchx[as.numeric(match2)] <- count
    count <- count+1}
    
}

dm2200_matched7 <- dm2200_matched7 |>
    mutate(match7_matches = 
               matchx[as.numeric
                      (row.names(dm2200_matched7))]) |>
    mutate(match7_matches = factor(match7_matches))

How many unique subjects are in our matched sample?

Code
n_distinct(dm2200_matched7$subject)
[1] 400

This match includes 200 pairs so 400 subjects, as we’ve done matching without replacement.

Code
dm2200_matched7 |> count(exposure)
# A tibble: 2 × 2
  exposure     n
  <fct>    <int>
1 A          200
2 B          200

9.2 Checking Covariate Balance for our 1:1 Optimal Match

9.2.1 Using bal.tab to obtain a balance table

Code
bal7 <- bal.tab(match_7, un = TRUE, disp.v.ratio = TRUE)

bal7
Balance Measures
                     Type Diff.Un V.Ratio.Un Diff.Adj V.Ratio.Adj
distance         Distance  2.0650     0.7466   0.2460      1.8591
age               Contin. -0.4682     0.7579   0.1005      0.8433
race_Black_AA      Binary  0.2975          .   0.0050           .
race_White         Binary -0.2630          .  -0.0050           .
race_Other         Binary -0.0320          .   0.0000           .
race_Asian         Binary -0.0025          .   0.0000           .
hisp               Binary -0.0320          .   0.0000           .
sex_M              Binary  0.2190          .   0.0400           .
insur_Medicare     Binary -0.2715          .   0.0100           .
insur_Commercial   Binary -0.2200          .   0.0050           .
insur_Medicaid     Binary  0.3745          .  -0.0200           .
insur_Uninsured    Binary  0.1170          .   0.0050           .
nincome           Contin. -0.9504     0.3117   0.1197      1.0425
nhsgrad           Contin. -0.3904     1.0281  -0.0226      1.1902
cleve              Binary  0.3505          .   0.0050           .
a1c               Contin.  0.0379     1.2209   0.0135      0.9199
ldl               Contin. -0.0743     1.1096  -0.0443      1.0515
visits            Contin.  0.4276     2.1732   0.1754      1.2822
tobacco_Former     Binary -0.1575          .  -0.0350           .
tobacco_Current    Binary  0.3175          .   0.0400           .
tobacco_Never      Binary -0.1600          .  -0.0050           .
statin             Binary -0.0315          .   0.0500           .
ace_arb            Binary  0.0345          .   0.0500           .
betab              Binary -0.1255          .  -0.0600           .
depr_dx            Binary -0.1115          .  -0.0450           .
eyeex              Binary -0.0925          .  -0.0050           .
pneumo             Binary -0.3030          .  -0.0800           .

Sample sizes
          Control Treated
All          2000     200
Matched       200     200
Unmatched    1800       0

9.2.2 Checking Rubin’s Rules 1 and 2

We’ll build a little table of the Rubin’s Rules (1 and 2) results before and after match_6 is applied, and compare these to the match_5 results.

Code
rubin_report_m567 <- tibble(
    status = c("Rule1", "Rule2"),
    Unmatched = c(bal6$Balance$Diff.Un[1],
                  bal6$Balance$V.Ratio.Un[1]),
    Match5 = c(bal5$Balance$Diff.Adj[1],
               bal5$Balance$V.Ratio.Adj[1]),
    Match6 = c(bal6$Balance$Diff.Adj[1],
               bal6$Balance$V.Ratio.Adj[1]),
    Match7 = c(bal7$Balance$Diff.Adj[1],
               bal7$Balance$V.Ratio.Adj[1])
    )

rubin_report_m567 |> gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 2, color = "blue")
status Unmatched Match5 Match6 Match7
Rule1 2.065 0.246 0.061 0.246
Rule2 0.747 1.859 1.215 1.859
  • We note here that the optimal matching here does very little, if anything, to improved on the nearest neighbor match.

9.2.3 Are the Optimal and Nearest Neighbor Matches the Same?

Code
m5_subs <- dm2200_matched5 |> select(subject)
m7_subs <- dm2200_matched7 |> select(subject)

all.equal(m5_subs, m7_subs)
[1] "Component \"subject\": 356 string mismatches"

Apparently not, but they are very similar.

9.2.4 Using bal.plot from cobalt

Looking at the linear propensity scores in each group, in mirrored histograms, we have …

Code
bal.plot(match_7,
         var.name = "distance", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "match_7 Sample"),
         type = "histogram", mirror = TRUE)

9.2.5 Using love.plot to look at Standardized Differences

Code
love.plot(bal7, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          sample.names = c("Unmatched", "Matched"),
          title = "Love Plot for 1:1 Optimal Match") +
    labs(subtitle = "distance = linear propensity score",
         caption = "* indicates raw mean differences (for binary variables)")

9.2.6 Using love.plot to look at Variance Ratios

Again, the categorical variables are dropped.

Code
love.plot(bal7, 
          threshold = .5, size = 3,
          stats = "variance.ratios",
          sample.names = c("Unmatched", "Matched"),
          title = "Variance Ratios for 1:1 Optimal Match") +
    labs(subtitle = "distance = linear propensity score",
         caption = "* indicates raw mean differences (for binary variables)")

10 match_8 1:2 Optimal Matching with Matchit

Again, we’ll match on the linear propensity score (by specifying the distance to use the logistic link with the linear propensity score via "linear.logit"), but now we’ll perform 1:2 (so ratio = 2) optimal (so `method = “optimal”) matching without replacement using the default ordering (largest to smallest).

Code
match_8 <- matchit(f.build("treat", covs_1), data = dm2200,
                   distance = "glm", link = "linear.logit", 
                   method = "optimal", ratio = 2)

match_8
A `matchit` object
 - method: 2:1 optimal pair matching
 - distance: Propensity score
             - estimated with logistic regression and linearized
 - number of obs.: 2200 (original), 600 (matched)
 - target estimand: ATT
 - covariates: age, race, hisp, sex, insur, nincome, nhsgrad, cleve, a1c, ldl, visits, tobacco, statin, ace_arb, betab, depr_dx, eyeex, pneumo

10.1 Obtaining the Matched Sample

As before, much of the work can be done with a very simple command.

Code
dm2200_matched8 <- match.data(match_8)

This leaves only the job of creating a matching number, for which we have to develop some additional R code, and this needs some modification because we now have two matched controls for each treated subject.

Code
# Thanks to Robert McDonald at Mayo
# https://lists.gking.harvard.edu/pipermail/matchit/2012-April/000458.html

mm1 <- match_8$match.matrix[,1]
mm2 <- match_8$match.matrix[,2]

len <- dim(dm2200)[1]
matchx <- rep(NA,len)
len2 <- length(match_8$match.matrix)
count <- 1
for(i in 1:len2){
    
    match_tr <- row.names(match_8$match.matrix)[i]
    match_con1 <- mm1[i]
    match_con2 <- mm2[i]
    
    if(!is.na(match_tr) & !is.na(match_con1) & !is.na(match_con2)){
    matchx[as.numeric(match_con1)] <- count
    matchx[as.numeric(match_con2)] <- count
    matchx[as.numeric(match_tr)] <- count
    count <- count+1}
    
}

dm2200_matched8 <- dm2200_matched8 |>
     mutate(match8_matches = 
            matchx[as.numeric(row.names(dm2200_matched8))]) 

How many unique subjects are in our matched sample?

Code
n_distinct(dm2200_matched8$subject)
[1] 600

This match includes 200 triplets (1 treated and 2 control) so 600 subjects, and again we’ve done matching without replacement.

Code
dm2200_matched8 |> count(exposure)
# A tibble: 2 × 2
  exposure     n
  <fct>    <int>
1 A          200
2 B          400

10.2 Checking Covariate Balance for our 1:2 Optimal Match

10.2.1 Using bal.tab to obtain a balance table

Code
bal8 <- bal.tab(match_8, un = TRUE, disp.v.ratio = TRUE)

bal8
Balance Measures
                     Type Diff.Un V.Ratio.Un Diff.Adj V.Ratio.Adj
distance         Distance  2.0650     0.7466   0.5002      2.4424
age               Contin. -0.4682     0.7579  -0.0605      0.8144
race_Black_AA      Binary  0.2975          .   0.0225           .
race_White         Binary -0.2630          .  -0.0225           .
race_Other         Binary -0.0320          .   0.0000           .
race_Asian         Binary -0.0025          .   0.0000           .
hisp               Binary -0.0320          .  -0.0050           .
sex_M              Binary  0.2190          .   0.0700           .
insur_Medicare     Binary -0.2715          .  -0.0800           .
insur_Commercial   Binary -0.2200          .  -0.0025           .
insur_Medicaid     Binary  0.3745          .   0.0350           .
insur_Uninsured    Binary  0.1170          .   0.0475           .
nincome           Contin. -0.9504     0.3117   0.0429      0.7310
nhsgrad           Contin. -0.3904     1.0281  -0.0139      0.9187
cleve              Binary  0.3505          .   0.0175           .
a1c               Contin.  0.0379     1.2209   0.0463      1.1943
ldl               Contin. -0.0743     1.1096  -0.0223      1.0845
visits            Contin.  0.4276     2.1732   0.2097      1.3355
tobacco_Former     Binary -0.1575          .  -0.0550           .
tobacco_Current    Binary  0.3175          .   0.1075           .
tobacco_Never      Binary -0.1600          .  -0.0525           .
statin             Binary -0.0315          .  -0.0025           .
ace_arb            Binary  0.0345          .   0.0075           .
betab              Binary -0.1255          .  -0.0375           .
depr_dx            Binary -0.1115          .  -0.0600           .
eyeex              Binary -0.0925          .  -0.0225           .
pneumo             Binary -0.3030          .  -0.1475           .

Sample sizes
          Control Treated
All          2000     200
Matched       400     200
Unmatched    1600       0

10.2.2 Checking Rubin’s Rules 1 and 2

We’ll build a little table of the Rubin’s Rules (1 and 2) results before and after match_8 is applied, and compare these to the other MatchIt results we’ve developed.

Code
rubin_report_m5678 <- tibble(
    status = c("Rule1", "Rule2"),
    Unmatched = c(bal8$Balance$Diff.Un[1],
                  bal8$Balance$V.Ratio.Un[1]),
    Match5 = c(bal5$Balance$Diff.Adj[1],
               bal5$Balance$V.Ratio.Adj[1]),
    Match6 = c(bal6$Balance$Diff.Adj[1],
               bal6$Balance$V.Ratio.Adj[1]),
    Match7 = c(bal7$Balance$Diff.Adj[1],
               bal7$Balance$V.Ratio.Adj[1]),
    Match8 = c(bal8$Balance$Diff.Adj[1],
               bal8$Balance$V.Ratio.Adj[1])
    )

rubin_report_m5678 |> gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 2, color = "blue")
status Unmatched Match5 Match6 Match7 Match8
Rule1 2.065 0.246 0.061 0.246 0.500
Rule2 0.747 1.859 1.215 1.859 2.442
  • So the optimal matching doesn’t look very strong here. Of course, we’re matching 1:2 in this situation.

10.2.3 Using bal.plot from cobalt

Looking at the linear propensity scores in each group, in mirrored histograms, we have …

Code
bal.plot(match_8,
         var.name = "distance", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "match_8 Sample"),
         type = "histogram", mirror = TRUE)

10.2.4 Using love.plot to look at Standardized Differences

Code
love.plot(bal8, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          sample.names = c("Unmatched", "Matched"),
          title = "Love Plot for 1:2 Optimal Match") +
    labs(subtitle = "distance = linear propensity score",
         caption = "* indicates raw mean differences (for binary variables)")

10.2.5 Using love.plot to look at Variance Ratios

Again, the categorical variables are dropped.

Code
love.plot(bal8, 
          threshold = .5, size = 3,
          stats = "variance.ratios",
          sample.names = c("Unmatched", "Matched"),
          title = "Variance Ratios for 1:2 Optimal Match") +
    labs(subtitle = "distance = linear propensity score",
         caption = "* indicates raw mean differences (for binary variables)")

11 Outcome Models

We’ll fit two (overly simplistic) outcome models, one for bp_good (our binary outcome) and another for bmi (our quantitative outcome.) Later, we’ll compare the exposure effect estimates made here to the estimates we obtain after propensity matching. In each case, we’ll focus on ATT estimates (average treated effect on the treated) rather than ATE estimates.

11.1 Potential Fitting Problems

When you fit mixed effect models as I have done below for binary and for quantitative outcomes, you may find that the model fit appears to be singular, which means that some element of the variance-covariance matrix estimated by lmer is zero. This may be an ignorable problem in this setting, since the problem is usually with the separation of residual effects from random match effects. We’re focusing right now on the fixed effects, which can be summarized as indicated below.

But, there’s certainly an argument to be made that an alternative strategy for estimating the match might be more appropriate. We won’t worry about that in this example, and just display what we get.

11.2 Unadjusted Models prior to Propensity Matching

11.2.1 Unadjusted Outcome Model for bp_good

Code
unadj_mod1 <- glm(bp_good == 1 ~ exposure == "A", data = dm2200, 
                  family = binomial())

tidy(unadj_mod1, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
(Intercept) 2.515 0.050 2.284 2.773
exposure == "A"TRUE 0.561 0.152 0.417 0.757

11.2.2 Unadjusted Outcome Model for bmi

Code
unadj_mod2 <- lm(bmi ~ exposure == "A", data = dm2200)

tidy(unadj_mod2, conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, 
           conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
(Intercept) 35.087 0.183 34.729 35.446
exposure == "A"TRUE −2.260 0.606 −3.449 −1.071

11.3 Adjusted Outcome Models after match1

11.3.1 Binary Outcome: bp_good

Code
result_match1_bp <- clogit(bp_good ~ (exposure == "A") + 
                          strata(match1_matches),
                      data = dm2200_matched1)

tidy(result_match1_bp, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
exposure == "A"TRUE 0.559 0.217 0.365 0.856

11.3.2 Quantitative Outcome: bmi

We’ll use a mixed model to account for our 1:1 matching. The matches here are treated as a random factor, with the exposure a fixed factor, in the lme4 package.

Code
dm2200_matched1 <- dm2200_matched1 |> 
    mutate(match1_matches_f = as.factor(match1_matches))

result_match1_bmi <- lmer(bmi ~ (exposure == "A") + 
                              (1 | match1_matches_f), 
                          data = dm2200_matched1)
Code
broom.mixed::tidy(result_match1_bmi, 
     conf.int = TRUE, conf.level = 0.95) |> 
    filter(effect == "fixed") |>
    select(term, estimate, std.error, 
           conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
(Intercept) 35.168 0.612 33.969 36.367
exposure == "A"TRUE −2.341 0.857 −4.020 −0.662

11.4 Adjusted Outcome Models after match2

11.4.1 Binary Outcome: bp_good

Code
result_match2_bp <- clogit(bp_good ~ (exposure == "A") + 
                          strata(match2_matches),
                      data = dm2200_matched2)

tidy(result_match2_bp, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
exposure == "A"TRUE 0.417 0.180 0.293 0.593

11.4.2 Quantitative Outcome: bmi

We’ll use a mixed model to account for our 1:1 matching. The matches here are treated as a random factor, with the exposure a fixed factor, in the lme4 package.

Code
dm2200_matched2 <- dm2200_matched2 |> 
    mutate(match2_matches_f = as.factor(match2_matches))

result_match2_bmi <- lmer(bmi ~ (exposure == "A") + 
                              (1 | match2_matches_f), 
                          data = dm2200_matched2)

broom.mixed::tidy(result_match2_bmi, 
     conf.int = TRUE, conf.level = 0.95) |> 
    filter(effect == "fixed") |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
(Intercept) 34.609 0.458 33.712 35.506
exposure == "A"TRUE −1.782 0.558 −2.876 −0.688

11.5 Adjusted Outcome Models after match3

11.5.1 Binary Outcome: bp_good

Code
result_match3_bp <- clogit(bp_good ~ (exposure == "A") + 
                          strata(match3_matches),
                      data = dm2200_matched3)

tidy(result_match3_bp, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
exposure == "A"TRUE 0.573 0.139 0.436 0.753

11.5.2 Quantitative Outcome: bmi

We’ll use a mixed model to account for our 1:1 matching. The matches here are treated as a random factor, with the exposure a fixed factor, in the lme4 package.

Code
dm2200_matched3 <- dm2200_matched3 |> 
    mutate(match3_matches_f = as.factor(match3_matches))

result_match3_bmi <- lmer(bmi ~ (exposure == "A") + 
                              (1 | match3_matches_f), 
                          data = dm2200_matched3)

broom.mixed::tidy(result_match3_bmi, 
     conf.int = TRUE, conf.level = 0.95) |> 
    filter(effect == "fixed") |>
    select(term, estimate, std.error, 
           conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
(Intercept) 34.531 0.410 33.728 35.335
exposure == "A"TRUE −1.704 0.449 −2.584 −0.824

11.6 Adjusted Outcome Models after match4

11.6.1 Binary Outcome: bp_good

Code
result_match4_bp <- clogit(bp_good ~ (exposure == "A") + 
                          strata(match4_matches),
                      data = dm2200_matched4)

tidy(result_match4_bp, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
exposure == "A"TRUE 0.651 0.243 0.405 1.048

11.6.2 Quantitative Outcome: bmi

We’ll use a mixed model to account for our 1:1 matching. The matches here are treated as a random factor, with the exposure a fixed factor, in the lme4 package.

Code
dm2200_matched4 <- dm2200_matched4 |> 
    mutate(match4_matches_f = as.factor(match4_matches))

result_match4_bmi <- lmer(bmi ~ (exposure == "A") + 
                              (1 | match4_matches_f), 
                          data = dm2200_matched4)
boundary (singular) fit: see help('isSingular')
Code
broom.mixed::tidy(result_match4_bmi, 
     conf.int = TRUE, conf.level = 0.95) |> 
    filter(effect == "fixed") |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
(Intercept) 34.828 0.657 33.540 36.116
exposure == "A"TRUE −1.577 0.929 −3.399 0.244

11.7 Adjusted Outcome Models after match5

11.7.1 Binary Outcome: bp_good

Code
result_match5_bp <- clogit(bp_good ~ (exposure == "A") + 
                          strata(match5_matches),
                      data = dm2200_matched5)
Warning in coxexact.fit(X, Y, istrat, offset, init, control, weights = weights,
: Loglik converged before variable 1 ; beta may be infinite.
Code
tidy(result_match5_bp, exponentiate = TRUE, 
     conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
exposure == "A"TRUE 1,615,474,899.108 40,192.969 0.000 Inf

11.7.2 Quantitative Outcome: bmi

We’ll use a mixed model to account for our 1:1 matching. The matches here are treated as a random factor, with the exposure a fixed factor, in the lme4 package.

Code
result_match5_bmi <- lmer(bmi ~ (exposure == "A") + 
                              (1 | match5_matches), 
                          data = dm2200_matched5)
boundary (singular) fit: see help('isSingular')
Code
broom.mixed::tidy(result_match5_bmi, 
     conf.int = TRUE, conf.level = 0.95) |> 
    filter(effect == "fixed") |>
    select(term, estimate, std.error, 
           conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
(Intercept) 33.258 1.354 30.605 35.911
exposure == "A"TRUE 1.683 2.037 −2.310 5.676

11.8 Adjusted Outcome Models after match6

11.8.1 Binary Outcome: bp_good

Code
result_match6_bp <- clogit(bp_good ~ (exposure == "A") + 
                          strata(match6_matches),
                      data = dm2200_matched6)
Warning in coxexact.fit(X, Y, istrat, offset, init, control, weights = weights,
: Ran out of iterations and did not converge
Code
tidy(result_match6_bp, exponentiate = TRUE, 
     conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
exposure == "A"TRUE 0.000 24,378.269 0.000 Inf

11.8.2 Quantitative Outcome: bmi

We’ll use a mixed model to account for our 1:1 matching. The matches here are treated as a random factor, with the exposure a fixed factor, in the lme4 package.

Code
result_match6_bmi <- lmer(bmi ~ (exposure == "A") + 
                              (1 | match6_matches), 
                          data = dm2200_matched6)
boundary (singular) fit: see help('isSingular')
Code
broom.mixed::tidy(result_match6_bmi, 
     conf.int = TRUE, conf.level = 0.95) |> 
    filter(effect == "fixed") |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
(Intercept) 34.338 1.982 30.453 38.223
exposure == "A"TRUE 1.097 2.567 −3.934 6.129

11.9 Adjusted Outcome Models after match7

11.9.1 Binary Outcome: bp_good

Code
result_match7_bp <- clogit(bp_good ~ (exposure == "A") + 
                          strata(match7_matches),
                      data = dm2200_matched7)

tidy(result_match7_bp, exponentiate = TRUE, 
     conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
exposure == "A"TRUE 2.000 1.225 0.181 22.056

11.9.2 Quantitative Outcome: bmi

We’ll use a mixed model to account for our 1:1 matching. The matches here are treated as a random factor, with the exposure a fixed factor, in the lme4 package.

Code
result_match7_bmi <- lmer(bmi ~ (exposure == "A") + 
                              (1 | match7_matches), 
                          data = dm2200_matched7)

broom.mixed::tidy(result_match7_bmi, 
     conf.int = TRUE, conf.level = 0.95) |> 
    filter(effect == "fixed") |>
    select(term, estimate, std.error, 
           conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
(Intercept) 35.658 1.416 32.882 38.433
exposure == "A"TRUE −3.210 1.884 −6.903 0.484

11.10 Adjusted Outcome Models after match8

11.10.1 Binary Outcome: bp_good

Code
result_match8_bp <- clogit(bp_good ~ (exposure == "A") + 
                          strata(match8_matches),
                      data = dm2200_matched8)

tidy(result_match8_bp, exponentiate = TRUE, 
     conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
exposure == "A"TRUE 1.073 0.650 0.300 3.839

11.10.2 Quantitative Outcome: bmi

We’ll use a mixed model to account for our 1:1 matching. The matches here are treated as a random factor, with the exposure a fixed factor, in the lme4 package.

Code
dm2200_matched8 <- dm2200_matched8 |>
    mutate(match8_f = factor(match8_matches))

result_match8_bmi <- lmer(bmi ~ (exposure == "A") + 
                              (1 | match8_f),  
                          data = dm2200_matched8)

broom.mixed::tidy(result_match8_bmi, 
     conf.int = TRUE, conf.level = 0.95) |> 
    filter(effect == "fixed") |>
    select(term, estimate, std.error, 
           conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")
term estimate std.error conf.low conf.high
(Intercept) 34.597 0.786 33.057 36.137
exposure == "A"TRUE −3.797 1.329 −6.402 −1.192

12 Cleanup

We’ve created a lot of variables here that we don’t actually need going forward. So I’ll remove them here:

Code
rm(bal1, bal2, bal3, bal4,
   covs_1, covs_1plus, covs_2plus, covs_3plus, covs_4plus,
   covs_for_rubin, dm2200, 
   dm2200_matched1, dm2200_matched2, dm2200_matched3,
   dm2200_matched4,
   match_1, match_2, match_3, match_4,
   prop_model, 
   result_match1_bmi, result_match2_bmi, result_match3_bmi,
   result_match4_bmi,
   result_match1_bp, result_match2_bp, result_match3_bp,
   result_match4_bp,
   rubin_m1, rubin_m2, rubin_m3, rubin_m4,
   rubin_report_m1, rubin_report_m12, rubin_report_m123,
   rubin_report_m1234, rubin_report_m5, 
   rubin_report_m56, rubin_report_m567, rubin_report_m5678, 
   t1, unadj_mod1, unadj_mod2,
   match1_matches, match2_matches, match3_matches,
   count, i, len, len2, match_con1, match_con2, match_tr,
   match1, match2, match4_matches, matchx, mm1, mm2, 
   bal5, bal6, bal7, bal8, dm2200_matched5, 
   dm2200_matched6, dm2200_matched7, dm2200_matched8,
   m5_subs, m7_subs, match_5, match_6, match_7, match_8,
   result_match5_bmi, result_match6_bmi, result_match7_bmi,
   result_match8_bmi,
   result_match5_bp, result_match6_bp, result_match7_bp,
   result_match8_bp)

13 Key References

Matching in these examples was performed using the Matching package (Sekhon, 2011), and the MatchIt package and covariate balance was assessed using cobalt (Greifer, 2020), both in R (R Core Team, 2019).

  • Greifer, N. (2020). cobalt: Covariate Balance Tables and Plots. R package version 4.3.0.
  • MatchIt: Nonparametric Preprocessing for Parametric Causal Inference, R package version 4.1.0.
  • Sekhon, J.S. (2011) Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching Package for R, J of Statistical Software, 2011, 42: 7, http://www.jstatsoft.org/. R package version 4.9-6.
  • R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

13.1 Session Information

Code
xfun::session_info()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  askpass_1.2.1          backports_1.5.0        base64enc_0.1.3       
  bayestestR_0.15.1      bigD_0.3.0             bit_4.5.0.1           
  bit64_4.6.0-1          bitops_1.0.9           blob_1.2.4            
  boot_1.3-31            broom_1.0.7            broom.mixed_0.2.9.6   
  bslib_0.8.0            cachem_1.1.0           callr_3.7.6           
  cellranger_1.1.0       chk_0.10.0             class_7.3-22          
  cli_3.6.3              clipr_0.8.0            cmprsk_2.2-12         
  cobalt_4.5.5           coda_0.19-4.1          codetools_0.2-20      
  colorspace_2.1-1       commonmark_1.9.2       compiler_4.4.2        
  conflicted_1.2.0       correlation_0.8.6      cpp11_0.5.1           
  crayon_1.5.3           curl_6.2.0             data.table_1.16.4     
  datasets_4.4.2         datawizard_1.0.0       DBI_1.2.3             
  dbplyr_2.5.0           digest_0.6.37          dplyr_1.1.4           
  dtplyr_1.3.1           e1071_1.7-16           easystats_0.7.3       
  effectsize_1.0.0       emmeans_1.10.6         Epi_2.59              
  estimability_1.5.1     etm_1.1.1              evaluate_1.0.3        
  fansi_1.0.6            farver_2.1.2           fastmap_1.2.0         
  fontawesome_0.5.3      forcats_1.0.0          fs_1.6.5              
  furrr_0.3.1            future_1.34.0          gargle_1.5.2          
  gdata_3.0.1            generics_0.1.3         ggplot2_3.5.1         
  globals_0.16.3         glue_1.8.0             gmodels_2.19.1        
  googledrive_2.1.1      googlesheets4_1.1.1    graphics_4.4.2        
  grDevices_4.4.2        grid_4.4.2             gridExtra_2.3         
  gt_0.11.1              gtable_0.3.6           gtools_3.9.5          
  haven_2.5.4            highr_0.11             hms_1.1.3             
  htmltools_0.5.8.1      htmlwidgets_1.6.4      httr_1.4.7            
  ids_1.0.1              insight_1.0.1          isoband_0.2.7         
  janitor_2.2.1          jquerylib_0.1.4        jsonlite_1.8.9        
  juicyjuice_0.1.0       knitr_1.49             labeling_0.4.3        
  labelled_2.14.0        lattice_0.22-6         lifecycle_1.0.4       
  listenv_0.9.1          lme4_1.1-36            lubridate_1.9.4       
  magrittr_2.0.3         markdown_1.13          MASS_7.3-64           
  Matching_4.10-15       MatchIt_4.7.0          Matrix_1.7-1          
  memoise_2.0.1          methods_4.4.2          mgcv_1.9-1            
  mime_0.12              minqa_1.2.8            mitools_2.4           
  modelbased_0.8.9       modelr_0.1.11          multcomp_1.4-26       
  munsell_0.5.1          mvtnorm_1.3-3          naniar_1.1.0          
  nlme_3.1-166           nloptr_2.1.1           norm_1.0.11.1         
  numDeriv_2016.8-1.1    openssl_2.3.1          optmatch_0.10.8       
  parallel_4.4.2         parallelly_1.41.0      parameters_0.24.1     
  patchwork_1.3.0        performance_0.13.0     pillar_1.10.1         
  pkgconfig_2.0.3        plyr_1.8.9             prettyunits_1.2.0     
  processx_3.8.5         progress_1.2.3         proxy_0.4-27          
  ps_1.8.1               purrr_1.0.2            R6_2.5.1              
  ragg_1.3.3             rappdirs_0.3.3         rbibutils_2.3         
  rbounds_2.2            RColorBrewer_1.1.3     Rcpp_1.0.14           
  RcppArmadillo_14.2.2.1 RcppEigen_0.3.4.0.2    RcppProgress_0.4.2    
  Rdpack_2.6.2           reactable_0.4.4        reactR_0.6.1          
  readr_2.1.5            readxl_1.4.3           reformulas_0.4.0      
  rematch_2.0.0          rematch2_2.1.2         report_0.5.9          
  reprex_2.1.1           rlang_1.1.5            rlemon_0.2.1          
  rmarkdown_2.29         rstudioapi_0.17.1      rvest_1.0.4           
  sandwich_3.1-1         sass_0.4.9             scales_1.3.0          
  see_0.10.0             selectr_0.4.2          snakecase_0.11.1      
  splines_4.4.2          stats_4.4.2            stringi_1.8.4         
  stringr_1.5.1          survey_4.4-2           survival_3.8-3        
  sys_3.4.3              systemfonts_1.2.1      tableone_0.13.2       
  textshaping_1.0.0      TH.data_1.1-3          tibble_3.2.1          
  tidyr_1.3.1            tidyselect_1.2.1       tidyverse_2.0.0       
  timechange_0.3.0       tinytex_0.54           tools_4.4.2           
  tzdb_0.4.0             UpSetR_1.4.0           utf8_1.2.4            
  utils_4.4.2            uuid_1.2.1             V8_6.0.0              
  vctrs_0.6.5            viridis_0.6.5          viridisLite_0.4.2     
  visdat_0.6.0           vroom_1.6.5            withr_3.0.2           
  xfun_0.50              xml2_1.3.6             xtable_1.8-4          
  yaml_2.3.10            zoo_1.8-12