The dm2200 example

Propensity Matching (in several ways) using R

Author

Thomas E. Love, Ph.D.

Published

2025-01-27

1 Setup

Code

knitr::opts_chunk$set(comment = NA) # do not remove this

library(janitor) # load other packages as desired
library(naniar)

library(gt)
library(tableone)
library(broom)
library(Epi)
library(survival)
library(Matching)
library(MatchIt)
library(cobalt)
library(lme4)
library(survey)
library(rbounds)

library(easystats)
library(tidyverse) # load tidyverse last

## Note that we will also use the broom.mixed package
## but we won't load it here

theme_set(theme_bw()) # set theme for ggplots

1.1 The `dm2200` Data Set

I’ve simulated data to match real information we’ve collected over the years at Better Health Partnership on adults who live with diabetes. These data mirror some of the real data collected from electronic health records across the region by Better Health Partnership, but individual values have been permuted across patients, so the results are not applicable to any population. The data I simulated from was a subset of Better Health data that required that the subject fall into exactly one of the two exposure groups we’ll study, that they live in Cuyahoga County, prefer English for health-related communications, and have no missing data on the variables we’ll study.

The exposure we’ll study is called exposure and consists of two levels: A and B. I won’t specify the details further on how the exposure is determined, except to say that it is uniquely determinable for each subject.
We’ll study a binary outcome, specifically whether the subject’s blood pressure is in control, in the sense that both their systolic blood pressure is below 140 mm Hg, and their diastolic blood pressure is below 90 mm Hg.
We’ll also study a continuous outcome, the subject’s body-mass index or bmi.

Code

dm2200 <- read_csv("data/dm2200.csv", 
                   show_col_types = FALSE) |> 
    mutate(across(where(is.character), as_factor)) |>
    mutate(subject = as.character(subject),
           bp_good = as.numeric(sbp < 140 & dbp < 90))

dm2200

# A tibble: 2,200 × 26
   subject exposure   age race      hisp sex   insur      nincome nhsgrad cleve
   <chr>   <fct>    <dbl> <fct>    <dbl> <fct> <fct>        <dbl>   <dbl> <dbl>
 1 S-0001  A           67 Black_AA     0 F     Medicare     24700      72     1
 2 S-0002  B           56 White        0 F     Commercial   62300      85     0
 3 S-0003  B           41 Black_AA     0 M     Medicare      6500      49     1
 4 S-0004  A           56 Black_AA     0 M     Medicaid     45400      86     0
 5 S-0005  B           69 Black_AA     0 F     Medicare     15400      72     0
 6 S-0006  B           44 Black_AA     0 M     Medicaid     34100      87     0
 7 S-0007  B           47 Black_AA     0 F     Medicaid     13900      42     1
 8 S-0008  B           60 Black_AA     0 M     Medicaid     30800      91     0
 9 S-0009  B           67 Other        1 F     Medicare     47100      90     1
10 S-0010  B           70 Black_AA     0 F     Medicare     20800      69     1
# ℹ 2,190 more rows
# ℹ 16 more variables: height_cm <dbl>, weight_kg <dbl>, bmi <dbl>, a1c <dbl>,
#   sbp <dbl>, dbp <dbl>, ldl <dbl>, visits <dbl>, tobacco <fct>, statin <dbl>,
#   ace_arb <dbl>, betab <dbl>, depr_dx <dbl>, eyeex <dbl>, pneumo <dbl>,
#   bp_good <dbl>

1.2 Elements of a Codebook

Note

I used paste(colnames(dm2200), collapse = " | ") to help me make this list.

Variable	Type	Description
`subject`	character	subject identifier (S-0001 to S-2200)
`exposure`	factor (2 levels)	A or B
`age`	integer	age in years
`race`	factor (4 levels)	White, Black_AA, Asian, Other
`hisp`	1/0	1 = Hispanic or Latinx, 0 = not
`sex`	F/M	F = Female, M = Male
`insur`	factor (4 levels)	Insurance: Medicare, Commercial, Medicaid or Uninsured
`nincome`	integer	est. Neighborhood Median Income, in $
`nhsgrad`	integer	est. % of adults in Neighborhood who are High School graduates
`cleve`	1/0	1 = Cleveland resident, 0 = resident of suburbs
`height_cm`	integer	height in cm
`weight_kg`	integer	weight in kg
`bmi`	numeric	body mass index (kg/m²)
`a1c`	numeric	most recent Hemoglobin A1c (in %)
`sbp`	numeric	most recent systolic blood pressure (in mm Hg)
`dbp`	numeric	most recent diastolic blood pressure (in mm Hg)
`bp_good`	1/0	1 if `sbp` < 140 and `dbp` < 90, 0 otherwise
`ldl`	numeric	most recent LDL cholesterol (in mg/dl)
`visits`	integer	primary care office visits in past year
`tobacco`	factor (3 levels)	Tobacco use: Current, Former, Never
`statin`	1/0	1 if subject had a statin prescription in the past year
`ace_arb`	1/0	1 if subject had an ACE inhibitor or ARB prescription in the past year
`betab`	1/0	1 if subject had a beta-blocker prescription in the past year
`depr_dx`	1/0	1 if the subject has a depression diagnosis
`eyeex`	1/0	1 if the subject has had a retinal eye exam in the past year
`pneumo`	1/0	1 if the subject has had a pneumococcal vaccination in the past 10 years

Code

data_codebook(dm2200 |> select(-subject))

select(dm2200, -subject) (2200 rows and 25 variables, 25 shown)

ID | Name      | Type        | Missings |         Values |            N
---+-----------+-------------+----------+----------------+-------------
1  | exposure  | categorical | 0 (0.0%) |              A |  200 ( 9.1%)
   |           |             |          |              B | 2000 (90.9%)
---+-----------+-------------+----------+----------------+-------------
2  | age       | numeric     | 0 (0.0%) |       [25, 75] |         2200
---+-----------+-------------+----------+----------------+-------------
3  | race      | categorical | 0 (0.0%) |       Black_AA | 1231 (56.0%)
   |           |             |          |          White |  867 (39.4%)
   |           |             |          |          Other |   75 ( 3.4%)
   |           |             |          |          Asian |   27 ( 1.2%)
---+-----------+-------------+----------+----------------+-------------
4  | hisp      | numeric     | 0 (0.0%) |              0 | 2092 (95.1%)
   |           |             |          |              1 |  108 ( 4.9%)
---+-----------+-------------+----------+----------------+-------------
5  | sex       | categorical | 0 (0.0%) |              F | 1230 (55.9%)
   |           |             |          |              M |  970 (44.1%)
---+-----------+-------------+----------+----------------+-------------
6  | insur     | categorical | 0 (0.0%) |       Medicare |  950 (43.2%)
   |           |             |          |     Commercial |  517 (23.5%)
   |           |             |          |       Medicaid |  626 (28.5%)
   |           |             |          |      Uninsured |  107 ( 4.9%)
---+-----------+-------------+----------+----------------+-------------
7  | nincome   | numeric     | 0 (0.0%) | [6100, 149300] |         2200
---+-----------+-------------+----------+----------------+-------------
8  | nhsgrad   | numeric     | 0 (0.0%) |       [37, 99] |         2200
---+-----------+-------------+----------+----------------+-------------
9  | cleve     | numeric     | 0 (0.0%) |              0 |  910 (41.4%)
   |           |             |          |              1 | 1290 (58.6%)
---+-----------+-------------+----------+----------------+-------------
10 | height_cm | numeric     | 0 (0.0%) |     [138, 203] |         2200
---+-----------+-------------+----------+----------------+-------------
11 | weight_kg | numeric     | 0 (0.0%) |      [45, 223] |         2200
---+-----------+-------------+----------+----------------+-------------
12 | bmi       | numeric     | 0 (0.0%) |   [16.4, 77.2] |         2200
---+-----------+-------------+----------+----------------+-------------
13 | a1c       | numeric     | 0 (0.0%) |    [4.3, 15.9] |         2200
---+-----------+-------------+----------+----------------+-------------
14 | sbp       | numeric     | 0 (0.0%) |      [86, 232] |         2200
---+-----------+-------------+----------+----------------+-------------
15 | dbp       | numeric     | 0 (0.0%) |      [38, 119] |         2200
---+-----------+-------------+----------+----------------+-------------
16 | ldl       | numeric     | 0 (0.0%) |      [25, 243] |         2200
---+-----------+-------------+----------+----------------+-------------
17 | visits    | numeric     | 0 (0.0%) |        [2, 18] |         2200
---+-----------+-------------+----------+----------------+-------------
18 | tobacco   | categorical | 0 (0.0%) |         Former |  821 (37.3%)
   |           |             |          |        Current |  553 (25.1%)
   |           |             |          |          Never |  826 (37.5%)
---+-----------+-------------+----------+----------------+-------------
19 | statin    | numeric     | 0 (0.0%) |              0 |  443 (20.1%)
   |           |             |          |              1 | 1757 (79.9%)
---+-----------+-------------+----------+----------------+-------------
20 | ace_arb   | numeric     | 0 (0.0%) |              0 |  509 (23.1%)
   |           |             |          |              1 | 1691 (76.9%)
---+-----------+-------------+----------+----------------+-------------
21 | betab     | numeric     | 0 (0.0%) |              0 | 1366 (62.1%)
   |           |             |          |              1 |  834 (37.9%)
---+-----------+-------------+----------+----------------+-------------
22 | depr_dx   | numeric     | 0 (0.0%) |              0 | 1394 (63.4%)
   |           |             |          |              1 |  806 (36.6%)
---+-----------+-------------+----------+----------------+-------------
23 | eyeex     | numeric     | 0 (0.0%) |              0 |  871 (39.6%)
   |           |             |          |              1 | 1329 (60.4%)
---+-----------+-------------+----------+----------------+-------------
24 | pneumo    | numeric     | 0 (0.0%) |              0 |  318 (14.5%)
   |           |             |          |              1 | 1882 (85.5%)
---+-----------+-------------+----------+----------------+-------------
25 | bp_good   | numeric     | 0 (0.0%) |              0 |  652 (29.6%)
   |           |             |          |              1 | 1548 (70.4%)
-----------------------------------------------------------------------

Note

Here is a basic Table 1 comparing the two exposure groups.

Code

t1 <- CreateTableOne(
    vars = c("age", "race", "hisp", "sex", "insur", 
             "nincome", "nhsgrad", "cleve", "sbp", "dbp",
             "ldl", "visits", "tobacco", "statin", 
             "ace_arb", "betab", "depr_dx", "eyeex", 
             "pneumo", "bmi", "bp_good"), 
    factorVars = c("hisp", "cleve", "statin",
                   "ace_arb", "betab", "depr_dx", 
                   "eyeex", "pneumo", "bp_good"),
    strata = "exposure", 
    data = dm2200)

t1

                     Stratified by exposure
                      A                   B                   p      test
  n                        200                2000                       
  age (mean (SD))        54.90 (8.55)        58.90 (9.83)     <0.001     
  race (%)                                                    <0.001     
     Black_AA              166 (83.0)         1065 (53.2)                
     White                  31 (15.5)          836 (41.8)                
     Other                   1 ( 0.5)           74 ( 3.7)                
     Asian                   2 ( 1.0)           25 ( 1.2)                
  hisp = 1 (%)               4 ( 2.0)          104 ( 5.2)      0.068     
  sex = M (%)              128 (64.0)          842 (42.1)     <0.001     
  insur (%)                                                   <0.001     
     Medicare               37 (18.5)          913 (45.6)                
     Commercial              7 ( 3.5)          510 (25.5)                
     Medicaid              125 (62.5)          501 (25.0)                
     Uninsured              31 (15.5)           76 ( 3.8)                
  nincome (mean (SD)) 26927.50 (11642.18) 37992.55 (20854.20) <0.001     
  nhsgrad (mean (SD))    77.74 (11.50)       82.23 (11.35)    <0.001     
  cleve = 1 (%)            181 (90.5)         1109 (55.5)     <0.001     
  sbp (mean (SD))       135.88 (20.60)      132.44 (16.17)     0.005     
  dbp (mean (SD))        82.47 (9.08)        72.41 (11.94)    <0.001     
  ldl (mean (SD))        94.34 (38.62)       97.22 (36.66)     0.293     
  visits (mean (SD))      4.81 (2.62)         3.69 (1.78)     <0.001     
  tobacco (%)                                                 <0.001     
     Former                 46 (23.0)          775 (38.8)                
     Current               108 (54.0)          445 (22.2)                
     Never                  46 (23.0)          780 (39.0)                
  statin = 1 (%)           154 (77.0)         1603 (80.2)      0.334     
  ace_arb = 1 (%)          160 (80.0)         1531 (76.6)      0.310     
  betab = 1 (%)             53 (26.5)          781 (39.1)      0.001     
  depr_dx = 1 (%)           53 (26.5)          753 (37.6)      0.002     
  eyeex = 1 (%)            104 (52.0)         1225 (61.3)      0.013     
  pneumo = 1 (%)           116 (58.0)         1766 (88.3)     <0.001     
  bmi (mean (SD))        32.83 (8.22)        35.09 (8.17)     <0.001     
  bp_good = 1 (%)          117 (58.5)         1431 (71.6)     <0.001

2 Propensity for Exposure

We’ll fit a logistic regression model to predict propensity for exposure A (as compared to B), on the basis of these 18 covariates:

age, race, hisp, sex, insur, nincome, nhsgrad,
cleve, a1c, ldl, visits, tobacco, statin,
ace_arb, betab, depr_dx, eyeex, pneumo

Practically, we might well fit something more complex than a simple model with main effects, but that’s what we’ll limit ourselves to in this setting. Note that we’re not including any direct information on either of our outcomes, or the elements that go into them. In practical work, we might fit different propensity scores for each outcome, but we’re keeping things simple here.

2.1 Fitting a Propensity Model

We’ll use the f.build tool from the cobalt package here.

Code

dm2200 <- dm2200 |>
    mutate(treat = as.logical(exposure == "A"))

covs_1 <- dm2200 |>
    select(age, race, hisp, sex, insur, nincome,
           nhsgrad, cleve, a1c, ldl, visits, tobacco,
           statin, ace_arb, betab, depr_dx, eyeex, pneumo)

prop_model <- glm(f.build("treat", covs_1), data = dm2200,
                  family = binomial)

2.2 Summarizing the Propensity Model

Code

model_parameters(prop_model, exponentiate = TRUE, ci = 0.95)

Parameter          | Odds Ratio |       SE |        95% CI |     z |      p
---------------------------------------------------------------------------
(Intercept)        |       0.01 |     0.01 | [0.00,  0.13] | -3.56 | < .001
age                |       1.01 |     0.01 | [0.99,  1.04] |  1.03 | 0.305 
race [White]       |       0.32 |     0.08 | [0.19,  0.52] | -4.44 | < .001
race [Other]       |       0.12 |     0.14 | [0.01,  0.79] | -1.86 | 0.063 
race [Asian]       |       1.05 |     1.06 | [0.10,  5.97] |  0.05 | 0.962 
hisp               |       1.04 |     0.66 | [0.26,  3.26] |  0.06 | 0.949 
sex [M]            |       2.30 |     0.44 | [1.59,  3.35] |  4.37 | < .001
insur [Commercial] |       0.38 |     0.17 | [0.15,  0.88] | -2.12 | 0.034 
insur [Medicaid]   |       4.31 |     1.11 | [2.62,  7.22] |  5.66 | < .001
insur [Uninsured]  |       9.83 |     3.41 | [4.98, 19.47] |  6.59 | < .001
nincome            |       1.00 | 9.07e-06 | [1.00,  1.00] |  0.25 | 0.803 
nhsgrad            |       1.00 | 9.53e-03 | [0.98,  1.02] | -0.42 | 0.677 
cleve              |       6.12 |     1.96 | [3.35, 11.80] |  5.65 | < .001
a1c                |       0.96 |     0.05 | [0.88,  1.05] | -0.82 | 0.412 
ldl                |       1.00 | 2.58e-03 | [0.99,  1.00] | -1.62 | 0.104 
visits             |       1.31 |     0.05 | [1.20,  1.42] |  6.47 | < .001
tobacco [Current]  |       3.11 |     0.71 | [2.00,  4.89] |  5.00 | < .001
tobacco [Never]    |       1.12 |     0.28 | [0.68,  1.83] |  0.44 | 0.657 
statin             |       1.01 |     0.23 | [0.65,  1.59] |  0.03 | 0.978 
ace arb            |       1.47 |     0.34 | [0.95,  2.32] |  1.68 | 0.093 
betab              |       0.77 |     0.16 | [0.51,  1.14] | -1.28 | 0.202 
depr dx            |       0.51 |     0.11 | [0.34,  0.77] | -3.20 | 0.001 
eyeex              |       0.72 |     0.14 | [0.50,  1.05] | -1.70 | 0.089 
pneumo             |       0.20 |     0.04 | [0.13,  0.30] | -7.65 | < .001


Uncertainty intervals (profile-likelihood) and p-values (two-tailed)
  computed using a Wald z-distribution approximation.

Code

tidy(prop_model, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
  select(term, estimate, std.error, conf.low, conf.high, p.value) |>
  gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 6, color = "blue")

term	estimate	std.error	conf.low	conf.high	p.value
(Intercept)	0.011	1.268	0.001	0.128	0.000
age	1.013	0.012	0.989	1.037	0.305
raceWhite	0.323	0.255	0.193	0.524	0.000
raceOther	0.119	1.143	0.006	0.788	0.063
raceAsian	1.049	1.011	0.105	5.972	0.962
hisp	1.041	0.633	0.260	3.263	0.949
sexM	2.296	0.190	1.586	3.346	0.000
insurCommercial	0.383	0.452	0.146	0.881	0.034
insurMedicaid	4.312	0.258	2.622	7.223	0.000
insurUninsured	9.831	0.347	4.985	19.467	0.000
nincome	1.000	0.000	1.000	1.000	0.803
nhsgrad	0.996	0.010	0.978	1.015	0.677
cleve	6.121	0.320	3.347	11.802	0.000
a1c	0.962	0.047	0.877	1.053	0.412
ldl	0.996	0.003	0.991	1.001	0.104
visits	1.307	0.041	1.205	1.418	0.000
tobaccoCurrent	3.108	0.227	2.004	4.886	0.000
tobaccoNever	1.117	0.250	0.684	1.826	0.657
statin	1.006	0.228	0.648	1.586	0.978
ace_arb	1.468	0.229	0.946	2.325	0.093
betab	0.771	0.204	0.514	1.145	0.202
depr_dx	0.515	0.208	0.340	0.769	0.001
eyeex	0.723	0.191	0.497	1.051	0.089
pneumo	0.200	0.211	0.132	0.301	0.000

Code

glance(prop_model) |> gt()

null.deviance	df.null	logLik	AIC	BIC	deviance	df.residual	nobs
1340.399	2199	-415.85	879.7	1016.409	831.7	2176	2200

Code

model_performance(prop_model)

# Indices of model performance

AIC     |    AICc |      BIC | Tjur's R2 |  RMSE | Sigma | Log_loss | Score_log
-------------------------------------------------------------------------------
879.700 | 880.252 | 1016.409 |     0.329 | 0.237 | 1.000 |    0.189 |      -Inf

AIC     | Score_spherical |   PCP
---------------------------------
879.700 |           0.012 | 0.889

2.3 Storing the Propensity Scores

Code

dm2200 <- dm2200 |>
    mutate(ps = prop_model$fitted,
           linps = prop_model$linear.predictors)

2.4 Comparing Propensity Scores

Code

ggplot(dm2200, aes(x = exposure, y = ps)) +
  geom_boxplot(width = 0.2, col = "red", fill = "yellow") +
  geom_jitter(width = 0.4, alpha = 0.2, col = "blue") +
  stat_summary(fun = mean, geom = "point", col = "black", size = 2.5) +
  labs(x = "Exposure Group", y = "Propensity Score (for Exposure A)")

Code

ggplot(dm2200, aes(x = exposure, y = linps)) +
  geom_violin(fill = "azure") +
  geom_boxplot(width = 0.3) + 
  stat_summary(fun = mean, geom = "point", col = "red", size = 2.5) +
  labs(x = "Exposure Group", y = "Linear Propensity Score")

3 `match_1` 1:1 greedy matching without replacement with the `Matching` package

We’re going to match on the linear propensity score, and define our treat (treatment) as occurring when exposure is A.

Code

match_1 <- Match(Tr = dm2200$treat, X = dm2200$linps, 
                 M = 1, replace = FALSE, ties = FALSE,
                 estimand = "ATT")

summary(match_1)


Estimate...  0 
SE.........  0 
T-stat.....  NaN 
p.val......  NA 

Original number of observations..............  2200 
Original number of treated obs...............  200 
Matched number of observations...............  200 
Matched number of observations  (unweighted).  200

3.1 ATT vs. ATE vs. ATC estimates

Note that in each of the matched samples we build, we’ll focus on ATT estimates (average treated effect on the treated) rather than ATE estimates. This means that in our matching we’re trying to mirror the population represented by the “treated” sample we observed.

To obtain ATE estimates rather than ATT with the Match function from the Matching package, use estimand = "ATE" in the process of developing the matched sample.
To obtain ATC estimates (average treatment effect on the controls), use estimand = "ATC".

I encourage the use of ATT estimates in your projects, where possible. I suggest also that you define the “treated” group (the one that the propensity score is estimating) to be the smaller of the two groups you have, to facilitate this approach. If you estimate ATE or ATC instead of ATT, of course, you are answering a different question than what ATT resolves.

3.2 Obtaining the Matched Sample

Now, we build a new matched sample data frame in order to do some of the analyses to come. This will contain only the matched subjects.

Code

match1_matches <- factor(rep(match_1$index.treated, 2))
dm2200_matched1 <- cbind(match1_matches, 
                         dm2200[c(match_1$index.control, 
                                  match_1$index.treated),])

Some sanity checks:

Code

dm2200_matched1 |> count(exposure)

  exposure   n
1        A 200
2        B 200

Code

dm2200_matched1 |> head()

  match1_matches subject exposure age     race hisp sex    insur nincome
1              1  S-0236        B  74 Black_AA    0   M Medicare   28900
2              4  S-1650        B  48 Black_AA    0   M Medicaid   22100
3             17  S-0834        B  61 Black_AA    0   M Medicaid   31500
4             26  S-1232        B  69 Black_AA    0   F Medicare   21700
5             27  S-0678        B  36 Black_AA    0   F Medicaid   16600
6             37  S-1007        B  49 Black_AA    0   M Medicaid   29800
  nhsgrad cleve height_cm weight_kg  bmi  a1c sbp dbp ldl visits tobacco statin
1      75     1       178       105 33.1  8.4 138  79  81      2  Former      1
2      80     1       172       134 45.3  5.1  98  66 121      3   Never      1
3      80     1       183        94 28.1  6.5 127  79 137      3 Current      1
4      77     1       168        65 23.0  5.9 101  70  76      2 Current      1
5      79     1       160        80 31.3 14.5 142  95 157      5 Current      1
6      82     1       170        74 25.6 12.1 138  80 126      3 Current      1
  ace_arb betab depr_dx eyeex pneumo bp_good treat         ps      linps
1       0     0       0     1      1       1 FALSE 0.03787384 -3.2348848
2       1     0       0     0      0       1 FALSE 0.62855155  0.5260079
3       1     0       1     0      1       1 FALSE 0.33963493 -0.6649215
4       1     0       0     1      1       1 FALSE 0.07464006 -2.5175054
5       0     1       0     1      0       0 FALSE 0.40466703 -0.3860563
6       1     0       0     0      1       1 FALSE 0.41799059 -0.3310277

3.3 Checking Covariate Balance for our 1:1 Greedy Match

3.3.1 Using `bal.tab` to obtain a balance table

Code

covs_1plus <- dm2200 |>
    select(age, race, hisp, sex, insur, nincome,
           nhsgrad, cleve, a1c, ldl, visits, tobacco,
           statin, ace_arb, betab, depr_dx, eyeex, pneumo,
           ps, linps)

bal1 <- bal.tab(match_1,
                treat = dm2200$exposure,
                covs = covs_1plus, quick = FALSE,
                data = dm2200, stats = c("m", "v"),
                un = TRUE, disp.v.ratio = TRUE)
bal1

Balance Measures
                    Type Diff.Un V.Ratio.Un Diff.Adj V.Ratio.Adj
age              Contin.  0.4076     1.3195  -0.0209      1.3292
race_Black_AA     Binary -0.2975          .  -0.0050           .
race_White        Binary  0.2630          .   0.0100           .
race_Other        Binary  0.0320          .  -0.0050           .
race_Asian        Binary  0.0025          .   0.0000           .
hisp              Binary  0.0320          .  -0.0050           .
sex_M             Binary -0.2190          .  -0.0400           .
insur_Medicare    Binary  0.2715          .   0.0400           .
insur_Commercial  Binary  0.2200          .  -0.0050           .
insur_Medicaid    Binary -0.3745          .  -0.0050           .
insur_Uninsured   Binary -0.1170          .  -0.0300           .
nincome          Contin.  0.5306     3.2086  -0.0270      1.3512
nhsgrad          Contin.  0.3958     0.9727  -0.0405      1.0954
cleve             Binary -0.3505          .   0.0250           .
a1c              Contin. -0.0419     0.8190  -0.0189      1.0321
ldl              Contin.  0.0783     0.9012  -0.0038      0.9641
visits           Contin. -0.6304     0.4602  -0.3063      0.7620
tobacco_Former    Binary  0.1575          .   0.0350           .
tobacco_Current   Binary -0.3175          .  -0.0100           .
tobacco_Never     Binary  0.1600          .  -0.0250           .
statin            Binary  0.0315          .  -0.0350           .
ace_arb           Binary -0.0345          .  -0.0450           .
betab             Binary  0.1255          .   0.0050           .
depr_dx           Binary  0.1115          .   0.0200           .
eyeex             Binary  0.0925          .   0.0650           .
pneumo            Binary  0.3030          .   0.0850           .
ps               Contin. -2.9189     0.1576  -0.7387      0.4791
linps            Contin. -1.7842     1.3395  -0.2127      0.5378

Sample sizes
            A    B
All       200 2000
Matched   200  200
Unmatched   0 1800

3.3.2 Checking Rubin’s Rules 1 and 2

We’ll build a little table of the Rubin’s Rules (1 and 2) before and after our match_1 is applied.

Code

covs_for_rubin <- dm2200 |>
    select(linps)

rubin_m1 <- bal.tab(match_1,
                treat = dm2200$treat,
                covs = covs_for_rubin,
                data = dm2200, stats = c("m", "v"),
                un = TRUE, disp.v.ratio = TRUE)[1]

rubin_report_m1 <- tibble(
    status = c("Rule1", "Rule2"),
    Unmatched = c(rubin_m1$Balance$Diff.Un,
                  rubin_m1$Balance$V.Ratio.Un),
    Matched = c(rubin_m1$Balance$Diff.Adj,
               rubin_m1$Balance$V.Ratio.Adj))

rubin_report_m1 |> 
  gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 2, color = "blue")

status	Unmatched	Matched
Rule1	2.065	0.246
Rule2	0.747	1.859

The Rule 1 results tell us about the standardized differences expressed as proportions, so we’d like to be certain that our results are as close to zero as possible, and definitely below 0.5 in absolute value.
- Multiply these by 100 to describe them as percentages, adjusting the cutoff to below 50 in absolute value.
- Here, before matching we have a bias of 206.5005338%, and this is reduced to 24.620512% after 1:1 greedy matching.
The Rule 2 results tell us about the variance ratio of the linear propensity scores. We want this to be within (0.5, 2) and ideally within (0.8, 1.25).
- Here, before matching we have a variance ratio of 74.6560482%, and this becomes 185.9473698% after 1:1 greedy matching.

3.3.3 Using `bal.plot` from `cobalt`

We can look at any particular variable with this approach, for example, age:

Code

bal.plot(match_1,
         treat = dm2200$exposure,
         covs = covs_1plus,
         var.name = "age", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "Matched Sample"))

We could also look at the propensity scores in each group, perhaps in mirrored histograms, with …

Code

bal.plot(match_1,
         treat = dm2200$exposure,
         covs = covs_1plus,
         var.name = "ps", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "Matched Sample"),
         type = "histogram", mirror = TRUE)

Can we look at a categorical variable this way?

Code

bal.plot(match_1,
         treat = dm2200$exposure,
         covs = covs_1plus,
         var.name = "insur", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "Matched Sample"))

3.3.4 Using `love.plot` to look at Standardized Differences

Code

love.plot(bal1, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          sample.names = c("Unmatched", "Matched"),
          title = "Love Plot for our 1:1 Match") +
    labs(caption = "* indicates raw mean differences (for binary variables)")

Code

love.plot(bal1, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          abs = TRUE,
          sample.names = c("Unmatched", "Matched"),
          title = "Absolute Differences for 1:1 Match") +
    labs(caption = "* indicates raw mean differences (for binary variables)")

3.3.5 Using `love.plot` to look at Variance Ratios

Note that this will only include the variables (and summaries like ps and linps) that describe quantities. Categorical variables are dropped.

Code

love.plot(bal1, 
          threshold = .5, size = 3,
          stats = "variance.ratios",
          sample.names = c("Unmatched", "Matched"),
          title = "Variance Ratios for our 1:1 Match")

4 `match_2` 1:2 greedy matching without replacement with the `Matching` package

Again, we’ll match on the linear propensity score, and define our treat (treatment) as occurring when exposure is A. The only difference will be that we’ll allow each subject with exposure A to be matched to exactly two subjects with exposure B.

Code

match_2 <- Match(Tr = dm2200$treat, X = dm2200$linps, 
                 M = 2, replace = FALSE, ties = FALSE,
                 estimand = "ATT")

summary(match_2)


Estimate...  0 
SE.........  0 
T-stat.....  NaN 
p.val......  NA 

Original number of observations..............  2200 
Original number of treated obs...............  200 
Matched number of observations...............  200 
Matched number of observations  (unweighted).  400

Note that we now have 400 matched exposure “B” subjects in our matched sample.

4.1 Obtaining the Matched Sample

As before,

Code

match2_matches <- factor(rep(match_2$index.treated, 2))
dm2200_matched2 <- cbind(match2_matches, 
                         dm2200[c(match_2$index.control, 
                                  match_2$index.treated),])

How many unique subjects are in our matched sample?

Code

n_distinct(dm2200_matched2$subject)

[1] 600

This match repeats each exposure A subject twice, to match up with the 400 exposure B subjects.

Code

dm2200_matched2 |> count(exposure)

  exposure   n
1        A 400
2        B 400

Code

dm2200_matched2 |> count(subject, exposure) |> head()

  subject exposure n
1  S-0001        A 2
2  S-0004        A 2
3  S-0007        B 1
4  S-0008        B 1
5  S-0014        B 1
6  S-0017        A 2

4.2 Checking Covariate Balance for our 1:2 Greedy Match

4.2.1 Using `bal.tab` to obtain a balance table

Code

covs_2plus <- dm2200 |>
    select(age, race, hisp, sex, insur, nincome,
           nhsgrad, cleve, a1c, ldl, visits, tobacco,
           statin, ace_arb, betab, depr_dx, eyeex, pneumo,
           ps, linps)

bal2 <- bal.tab(match_2,
                treat = dm2200$exposure,
                covs = covs_2plus, quick = FALSE,
                data = dm2200, stats = c("m", "v"),
                un = TRUE, disp.v.ratio = TRUE)
bal2

Balance Measures
                    Type Diff.Un V.Ratio.Un Diff.Adj V.Ratio.Adj
age              Contin.  0.4076     1.3195   0.0804      1.2519
race_Black_AA     Binary -0.2975          .  -0.0175           .
race_White        Binary  0.2630          .   0.0175           .
race_Other        Binary  0.0320          .  -0.0025           .
race_Asian        Binary  0.0025          .   0.0025           .
hisp              Binary  0.0320          .   0.0025           .
sex_M             Binary -0.2190          .  -0.0750           .
insur_Medicare    Binary  0.2715          .   0.0925           .
insur_Commercial  Binary  0.2200          .  -0.0050           .
insur_Medicaid    Binary -0.3745          .  -0.0400           .
insur_Uninsured   Binary -0.1170          .  -0.0475           .
nincome          Contin.  0.5306     3.2086  -0.0002      1.4755
nhsgrad          Contin.  0.3958     0.9727   0.0170      1.0717
cleve             Binary -0.3505          .  -0.0175           .
a1c              Contin. -0.0419     0.8190  -0.0307      0.8214
ldl              Contin.  0.0783     0.9012   0.0126      0.9109
visits           Contin. -0.6304     0.4602  -0.3204      0.7619
tobacco_Former    Binary  0.1575          .   0.0600           .
tobacco_Current   Binary -0.3175          .  -0.1025           .
tobacco_Never     Binary  0.1600          .   0.0425           .
statin            Binary  0.0315          .   0.0175           .
ace_arb           Binary -0.0345          .  -0.0150           .
betab             Binary  0.1255          .   0.0300           .
depr_dx           Binary  0.1115          .   0.0525           .
eyeex             Binary  0.0925          .   0.0425           .
pneumo            Binary  0.3030          .   0.1525           .
ps               Contin. -2.9189     0.1576  -1.4656      0.3428
linps            Contin. -1.7842     1.3395  -0.4322      0.4092

Sample sizes
            A    B
All       200 2000
Matched   200  400
Unmatched   0 1600

4.2.2 Checking Rubin’s Rules 1 and 2

We’ll build a little table of the Rubin’s Rules (1 and 2) before and after our 1:2 greedy match_2 is applied, and compare these to the results we found in match_1 (the 1:1 match).

Code

covs_for_rubin <- dm2200 |>
    select(linps)

rubin_m2 <- bal.tab(match_2,
                treat = dm2200$treat,
                covs = covs_for_rubin, 
                data = dm2200, stats = c("m", "v"),
                un = TRUE, disp.v.ratio = TRUE)[1]

rubin_report_m12 <- tibble(
    status = c("Rule1", "Rule2"),
    Unmatched = c(rubin_m2$Balance$Diff.Un,
                  rubin_m2$Balance$V.Ratio.Un),
    Match1 = c(rubin_m1$Balance$Diff.Adj,
               rubin_m1$Balance$V.Ratio.Adj),
    Match2 = c(rubin_m2$Balance$Diff.Adj,
               rubin_m2$Balance$V.Ratio.Adj))

rubin_report_m12 |> 
  gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 2, color = "blue")

status	Unmatched	Match1	Match2
Rule1	2.065	0.246	0.500
Rule2	0.747	1.859	2.444

Again, we’d like to see Rule 1 as close to zero as possible, and definitely below 0.5 in absolute value. Unsurprisingly, when we have to match two exposure B subjects to each exposure A subject, we don’t get matches that are as close.
The Rule 2 results tell us about the variance ratio of the linear propensity scores. We want this to be within (0.5, 2) and ideally within (0.8, 1.25). Again, here the results are a bit disappointing in comparison to what we saw in our 1:1 match.

4.2.3 Using `bal.plot` from `cobalt`

Looking at the propensity scores in each group, perhaps in mirrored histograms, we have …

Code

bal.plot(match_2,
         treat = dm2200$exposure,
         covs = covs_2plus,
         var.name = "ps", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "match_2 Sample"),
         type = "histogram", mirror = TRUE)

4.2.4 Using `love.plot` to look at Standardized Differences

Code

love.plot(bal2, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          sample.names = c("Unmatched", "Matched"),
          title = "Love Plot for our 1:2 Match") +
    labs(caption = "* indicates raw mean differences (for binary variables)")

4.2.5 Using `love.plot` to look at Variance Ratios

Again, the categorical variables are dropped.

Code

love.plot(bal2, 
          threshold = .5, size = 3,
          stats = "variance.ratios",
          sample.names = c("Unmatched", "Matched"),
          title = "Variance Ratios for our 1:2 Match")

5 `match_3` 1:3 matching, with replacement with the `Matching` package

Again, we’ll match on the linear propensity score, and define our treat (treatment) as occurring when exposure is A. But now, we’ll match with replacement (which means that multiple subject with exposure A can be matched to the same subject with exposure B) and we’ll also match each subject with exposure A to be matched to exactly three subjects with exposure B.

Code

match_3 <- Match(Tr = dm2200$treat, X = dm2200$linps, 
                 M = 3, replace = TRUE, ties = FALSE,
                 estimand = "ATT")

summary(match_3)


Estimate...  0 
SE.........  0 
T-stat.....  NaN 
p.val......  NA 

Original number of observations..............  2200 
Original number of treated obs...............  200 
Matched number of observations...............  200 
Matched number of observations  (unweighted).  600

Note that we now have 600 matched exposure “B” subjects in our matched sample.

5.1 Obtaining the Matched Sample

As before,

Code

match3_matches <- factor(rep(match_3$index.treated, 2))
dm2200_matched3 <- cbind(match3_matches, 
                         dm2200[c(match_3$index.control, 
                                  match_3$index.treated),])

If this was being done without replacement, this would repeat each exposure A subject three times, to match up with the 600 exposure B subjects. But here, we have a different result.

How many unique subjects are in our matched sample?

Code

n_distinct(dm2200_matched3$subject)

[1] 504

How many of those are in Exposure A?

Code

temp <- dm2200_matched3 |> filter(exposure == "A") 
n_distinct(temp$subject)

[1] 200

How many of those are in Exposure B?

Code

temp <- dm2200_matched3 |> filter(exposure == "B")
n_distinct(temp$subject)

[1] 304

Among those exposure A subjects, how many times were they used in the matches?

Code

dm2200_matched3 |> filter(exposure == "A") |> 
    count(subject) |>
    tabyl(n)

 n n_n percent
 3 200       1

Among those exposure B subjects, how many times were they used in the matches?

Code

dm2200_matched3 |> filter(exposure == "B") |> 
    count(subject) |>
    tabyl(n)

  n n_n     percent
  1 198 0.651315789
  2  62 0.203947368
  3  19 0.062500000
  4   7 0.023026316
  5   4 0.013157895
  6   2 0.006578947
  7   1 0.003289474
  8   1 0.003289474
  9   2 0.006578947
 10   2 0.006578947
 13   2 0.006578947
 15   1 0.003289474
 20   1 0.003289474
 23   1 0.003289474
 24   1 0.003289474

5.2 Checking Covariate Balance for our 1:3 Match

5.2.1 Using `bal.tab` to obtain a balance table

Code

covs_3plus <- dm2200 |>
    select(age, race, hisp, sex, insur, nincome,
           nhsgrad, cleve, a1c, ldl, visits, tobacco,
           statin, ace_arb, betab, depr_dx, eyeex, pneumo,
           ps, linps)

bal3 <- bal.tab(match_3,
                treat = dm2200$exposure,
                covs = covs_3plus, quick = FALSE,
                data = dm2200, stats = c("m", "v"),
                un = TRUE, disp.v.ratio = TRUE)
bal3

Balance Measures
                    Type Diff.Un V.Ratio.Un Diff.Adj V.Ratio.Adj
age              Contin.  0.4076     1.3195  -0.0845      1.2582
race_Black_AA     Binary -0.2975          .   0.0300           .
race_White        Binary  0.2630          .  -0.0250           .
race_Other        Binary  0.0320          .  -0.0033           .
race_Asian        Binary  0.0025          .  -0.0017           .
hisp              Binary  0.0320          .  -0.0083           .
sex_M             Binary -0.2190          .   0.0133           .
insur_Medicare    Binary  0.2715          .   0.0100           .
insur_Commercial  Binary  0.2200          .  -0.0033           .
insur_Medicaid    Binary -0.3745          .  -0.0283           .
insur_Uninsured   Binary -0.1170          .   0.0217           .
nincome          Contin.  0.5306     3.2086   0.0254      1.3567
nhsgrad          Contin.  0.3958     0.9727  -0.0179      0.8844
cleve             Binary -0.3505          .  -0.0033           .
a1c              Contin. -0.0419     0.8190  -0.0539      1.0273
ldl              Contin.  0.0783     0.9012   0.1121      1.0653
visits           Contin. -0.6304     0.4602  -0.0646      0.8624
tobacco_Former    Binary  0.1575          .   0.0450           .
tobacco_Current   Binary -0.3175          .  -0.0100           .
tobacco_Never     Binary  0.1600          .  -0.0350           .
statin            Binary  0.0315          .  -0.0150           .
ace_arb           Binary -0.0345          .  -0.0083           .
betab             Binary  0.1255          .  -0.0250           .
depr_dx           Binary  0.1115          .   0.0167           .
eyeex             Binary  0.0925          .   0.0017           .
pneumo            Binary  0.3030          .   0.0167           .
ps               Contin. -2.9189     0.1576  -0.0386      0.9512
linps            Contin. -1.7842     1.3395  -0.0268      0.8894

Sample sizes
                       A       B
All                  200 2000.  
Matched (ESS)        200  104.53
Matched (Unweighted) 200  304.  
Unmatched              0 1696.

5.2.2 Checking Rubin’s Rules 1 and 2

We’ll build a little table of the Rubin’s Rules (1 and 2) before and after our 1:2 greedy match_2 is applied, and compare these to the results we found in match_1 (the 1:1 match).

Code

covs_for_rubin <- dm2200 |>
    select(linps)

rubin_m3 <- bal.tab(match_3,
                treat = dm2200$treat,
                covs = covs_for_rubin, 
                data = dm2200, stats = c("m", "v"),
                un = TRUE, disp.v.ratio = TRUE)[1]

rubin_report_m123 <- tibble(
    status = c("Rule1", "Rule2"),
    Unmatched = c(rubin_m2$Balance$Diff.Un,
                  rubin_m2$Balance$V.Ratio.Un),
    Match1 = c(rubin_m1$Balance$Diff.Adj,
               rubin_m1$Balance$V.Ratio.Adj),
    Match2 = c(rubin_m2$Balance$Diff.Adj,
               rubin_m2$Balance$V.Ratio.Adj),
    Match3 = c(rubin_m3$Balance$Diff.Adj,
               rubin_m3$Balance$V.Ratio.Adj))


rubin_report_m123 |> 
  gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 2, color = "blue")

status	Unmatched	Match1	Match2	Match3
Rule1	2.065	0.246	0.500	0.031
Rule2	0.747	1.859	2.444	1.124

Again, we’d like to see Rule 1 results as close to zero as possible, and definitely below 0.5 in absolute value.
In Rule 2, we want the variance ratio of the linear propensity scores to be within (0.5, 2) and ideally within (0.8, 1.25).
It appears that (in these data) allowing the same exposure B subject to be used for multiple matches (matching with replacement) more than makes up for the fact that matching 3 exposure B’s for each exposure A (1:3 matching) is a tougher job than pair (1:1) matching, as seen in the results for Rubin’s Rule 1 and Rule 2.

5.2.3 Using `bal.plot` from `cobalt`

Looking at the propensity scores in each group, perhaps in mirrored histograms, we have …

Code

bal.plot(match_3,
         treat = dm2200$exposure,
         covs = covs_3plus,
         var.name = "ps", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "match_3 Sample"),
         type = "histogram", mirror = TRUE)

5.2.4 Using `love.plot` to look at Standardized Differences

Code

love.plot(bal3, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          abs = TRUE,
          sample.names = c("Unmatched", "Matched"),
          title = "Love Plot of |Mean Differences| for our 1:3 Match") +
    labs(caption = "* indicates raw mean differences (for binary variables)")

5.2.5 Using `love.plot` to look at Variance Ratios

Again, the categorical variables are dropped.

Code

love.plot(bal3, 
          threshold = .5, size = 3,
          stats = "variance.ratios",
          sample.names = c("Unmatched", "Matched"),
          title = "Variance Ratios for our 1:3 Match")

6 `match_4` Caliper Matching (1:1 without replacement) with the `Matching` package

The Match function in the Matching package allows you to specify a caliper. From the Matching help file:

A caliper is the maximum acceptable distance (on a covariate) which we are willing to accept in any match. Observations for which we cannot find a match within the caliper are dropped.Dropping observations generally changes the quantity being estimated.
The caliper is interpreted to be in standardized units. For example, caliper=.25 means that all matches not equal to or within .25 standard deviations of each covariate in X are dropped, and not matched.
- If a scalar caliper is provided to the caliper setting in the Match function, this caliper is used for all covariates in X.
- If a vector of calipers is provided, a caliper value should be provided for each covariate in X.

We’ll again perform a 1:1 match without replacement, but now we’ll do so while only accepting matches where the linear propensity score of each match is within 0.2 standard deviations of the linear PS.

Code

match_4 <- Match(Tr = dm2200$treat, X = dm2200$linps, 
                 M = 1, replace = FALSE, ties = FALSE,
                 caliper = 0.2, estimand = "ATT")

summary(match_4)


Estimate...  0 
SE.........  0 
T-stat.....  NaN 
p.val......  NA 

Original number of observations..............  2200 
Original number of treated obs...............  200 
Matched number of observations...............  162 
Matched number of observations  (unweighted).  162 

Caliper (SDs)........................................   0.2 
Number of obs dropped by 'exact' or 'caliper'  38

Note that we have now dropped 38 of the exposure “A” subjects, and reduced our sample to the 168 remaining exposure “A” subjects, who are paired with 162 unique matched exposure “B” subjects in our matched sample.

6.1 Obtaining the Matched Sample

As before,

Code

match4_matches <- factor(rep(match_4$index.treated, 2))
dm2200_matched4 <- cbind(match4_matches, 
                         dm2200[c(match_4$index.control, 
                                  match_4$index.treated),])

How many unique subjects are in our matched sample?

Code

n_distinct(dm2200_matched4$subject)

[1] 324

This match includes 162 pairs so 324 subjects, since we’ve done matching without replacement.

Code

dm2200_matched4 |> count(exposure)

  exposure   n
1        A 162
2        B 162

6.2 Checking Covariate Balance for our 1:1 Caliper Match

6.2.1 Using `bal.tab` to obtain a balance table

Code

covs_4plus <- dm2200 |>
    select(age, race, hisp, sex, insur, nincome,
           nhsgrad, cleve, a1c, ldl, visits, tobacco,
           statin, ace_arb, betab, depr_dx, eyeex, pneumo,
           ps, linps)

bal4 <- bal.tab(match_4,
                treat = dm2200$exposure,
                covs = covs_4plus, quick = FALSE,
                data = dm2200, stats = c("m", "v"),
                un = TRUE, disp.v.ratio = TRUE)
bal4

Balance Measures
                    Type Diff.Un V.Ratio.Un Diff.Adj V.Ratio.Adj
age              Contin.  0.4076     1.3195  -0.0069      1.1267
race_Black_AA     Binary -0.2975          .  -0.0062           .
race_White        Binary  0.2630          .   0.0123           .
race_Other        Binary  0.0320          .   0.0000           .
race_Asian        Binary  0.0025          .  -0.0062           .
hisp              Binary  0.0320          .  -0.0062           .
sex_M             Binary -0.2190          .   0.0000           .
insur_Medicare    Binary  0.2715          .   0.0247           .
insur_Commercial  Binary  0.2200          .  -0.0185           .
insur_Medicaid    Binary -0.3745          .  -0.0062           .
insur_Uninsured   Binary -0.1170          .   0.0000           .
nincome          Contin.  0.5306     3.2086  -0.0071      1.5352
nhsgrad          Contin.  0.3958     0.9727  -0.0501      1.2527
cleve             Binary -0.3505          .   0.0062           .
a1c              Contin. -0.0419     0.8190  -0.0732      0.8675
ldl              Contin.  0.0783     0.9012   0.0115      0.9169
visits           Contin. -0.6304     0.4602  -0.1353      1.1436
tobacco_Former    Binary  0.1575          .   0.0062           .
tobacco_Current   Binary -0.3175          .   0.0370           .
tobacco_Never     Binary  0.1600          .  -0.0432           .
statin            Binary  0.0315          .   0.0000           .
ace_arb           Binary -0.0345          .  -0.0309           .
betab             Binary  0.1255          .   0.0679           .
depr_dx           Binary  0.1115          .  -0.0432           .
eyeex             Binary  0.0925          .  -0.0123           .
pneumo            Binary  0.3030          .   0.0000           .
ps               Contin. -2.9189     0.1576  -0.0309      0.9599
linps            Contin. -1.7842     1.3395  -0.0073      0.9775

Sample sizes
            A    B
All       200 2000
Matched   162  162
Unmatched   0 1838
Discarded  38    0

6.2.2 Checking Rubin’s Rules 1 and 2

We’ll build a little table of the Rubin’s Rules (1 and 2) before and after our 1:2 greedy match_4 is applied, and compare these to the results we found in match_1 (the 1:1 match).

Code

covs_for_rubin <- dm2200 |>
    select(linps)

rubin_m4 <- bal.tab(match_4,
                treat = dm2200$treat,
                covs = covs_for_rubin, 
                data = dm2200, stats = c("m", "v"),
                un = TRUE, disp.v.ratio = TRUE)[1]

rubin_report_m1234 <- tibble(
    status = c("Rule1", "Rule2"),
    Unmatched = c(rubin_m2$Balance$Diff.Un,
                  rubin_m2$Balance$V.Ratio.Un),
    Match1 = c(rubin_m1$Balance$Diff.Adj,
               rubin_m1$Balance$V.Ratio.Adj),
    Match2 = c(rubin_m2$Balance$Diff.Adj,
               rubin_m2$Balance$V.Ratio.Adj),
    Match3 = c(rubin_m3$Balance$Diff.Adj,
               rubin_m3$Balance$V.Ratio.Adj),
    Match4 = c(rubin_m4$Balance$Diff.Adj,
               rubin_m4$Balance$V.Ratio.Adj))

rubin_report_m1234 |> gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 2, color = "blue")

status	Unmatched	Match1	Match2	Match3	Match4
Rule1	2.065	0.246	0.500	0.031	0.008
Rule2	0.747	1.859	2.444	1.124	1.023

This approach produces an exceptionally strong match in terms of balance, with Rubin’s Rule 1 being very close to 0, and Rule 2 being very close to 1.
Unfortunately, we’ve only done this by dropping the 38 “hardest to match” subjects receiving exposure “A”.

6.2.3 Using `bal.plot` from `cobalt`

Looking at the propensity scores in each group, perhaps in mirrored histograms, we have …

Code

bal.plot(match_4,
         treat = dm2200$exposure,
         covs = covs_4plus,
         var.name = "ps", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "match_4 Sample"),
         type = "histogram", mirror = TRUE)

6.2.4 Using `love.plot` to look at Standardized Differences

Code

love.plot(bal4, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          sample.names = c("Unmatched", "Matched"),
          title = "Love Plot for our 1:1 Caliper Match") +
    labs(caption = "* indicates raw mean differences (for binary variables)")

6.2.5 Using `love.plot` to look at Variance Ratios

Again, the categorical variables are dropped.

Code

love.plot(bal4, 
          threshold = .5, size = 3,
          stats = "variance.ratios",
          sample.names = c("Unmatched", "Matched"),
          title = "Variance Ratios for our 1:1 Caliper Match")

7 `match_5` 1:1 Nearest Neighbor Matching with `Matchit`

The MatchIt package implements the suggestions of Ho, Imai, King, and Stuart (2007) for improving parametric statistical models by pre-processing data with nonparametric matching methods. Read more about MatchIt at https://gking.harvard.edu/matchit.

With MatchIt, the matching is done using the matchit(treat ~ X, ...) function, where treat is the vector of treatment assignments and X are the covariates to be used in the matching.

We’ll start our exploration of the MatchIt approach to developing matches with nearest neighbor matching, which (quoting the MatchIt manual…)

… selects the r (default = 1, specified by the ratio option) best control matches for each individual in the treated group, using a distance measure specified by the distance option (default = logit). Matches are chosen for each treated unit one at a time, with the order specified by the m.order command (default=largest to smallest). At each matching step we choose the control unit that is not yet matched but is closest to the treated unit on the distance measure.

The full syntax (with default choices indicated) is

matchit(formula, data=NULL, discard=0, exact=FALSE, 
         replace=FALSE, ratio=1, model="logit", 
         reestimate=FALSE, nearest=TRUE, m.order=2, 
         caliper=0, calclosest=FALSE, mahvars=NULL, 
         subclass=0, sub.by="treat", counter=TRUE, 
         full=FALSE, full.options=list(), ...)

Here, we’ll match on the linear propensity score (by specifying the distance to use the logistic link with the linear propensity score via "linear.logit"), and we’ll perform 1:1 nearest neighbor matching without replacement using the default ordering (largest to smallest). Since we’ve already seen that greedy 1:1 matching without replacement doesn’t work well in this setting, we’re not expecting a strong result in terms of balance here, either.

Code

dm2200 <- dm2200 |>
    mutate(treat = as.logical(exposure == "A"))

covs_1 <- dm2200 |>
    select(age, race, hisp, sex, insur, nincome,
           nhsgrad, cleve, a1c, ldl, visits, tobacco,
           statin, ace_arb, betab, depr_dx, eyeex, pneumo)

match_5 <- matchit(f.build("treat", covs_1), data = dm2200,
                   distance = "glm", link = "linear.logit",
                   method = "nearest", 
                   ratio = 1, replace = FALSE)

match_5

A `matchit` object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score
             - estimated with logistic regression and linearized
 - number of obs.: 2200 (original), 400 (matched)
 - target estimand: ATT
 - covariates: age, race, hisp, sex, insur, nincome, nhsgrad, cleve, a1c, ldl, visits, tobacco, statin, ace_arb, betab, depr_dx, eyeex, pneumo

7.1 Obtaining the Matched Sample

There is just one tricky part to doing this in MatchIt. The main work can be done with a very simple command.

Code

dm2200_matched5 <- match.data(match_5)

This leaves only the job of creating a matching number, for which we have to develop some additional R code.

Code

# Thanks to Robert McDonald at Mayo
# https://lists.gking.harvard.edu/pipermail/matchit/2012-April/000458.html

len <- dim(dm2200)[1]
matchx <- rep(NA,len)
len2 <- length(match_5$match.matrix)
count <- 1
for(i in 1:len2){
    
    match1 <- match_5$match.matrix[i]
    match2 <- row.names(match_5$match.matrix)[i]
    
    if(!is.na(match1)){
    matchx[as.numeric(match1)] <- count
    matchx[as.numeric(match2)] <- count
    count <- count+1}
    
}

dm2200_matched5 <- dm2200_matched5 |>
    mutate(match5_matches = 
               matchx[as.numeric
                      (row.names(dm2200_matched5))]) |>
    mutate(match5_matches = factor(match5_matches))

How many unique subjects are in our matched sample?

Code

n_distinct(dm2200_matched5$subject)

[1] 400

This match includes 200 pairs so 400 subjects, as we’ve done matching without replacement.

Code

dm2200_matched5 |> count(exposure)

# A tibble: 2 × 2
  exposure     n
  <fct>    <int>
1 A          200
2 B          200

7.2 Checking Covariate Balance for our 1:1 Nearest Neighbor Match

7.2.1 Default Numerical Balance Summary from `MatchIt`

Code

summary(match_5)


Call:
matchit(formula = f.build("treat", covs_1), data = dm2200, method = "nearest", 
    distance = "glm", link = "linear.logit", replace = FALSE, 
    ratio = 1)

Summary of Balance for All Data:
                Means Treated Means Control Std. Mean Diff. Var. Ratio
distance              -0.6577       -4.0979          2.0650     0.7466
age                   54.8950       58.9000         -0.4682     0.7579
raceBlack_AA           0.8300        0.5325          0.7920          .
raceWhite              0.1550        0.4180         -0.7267          .
raceOther              0.0050        0.0370         -0.4537          .
raceAsian              0.0100        0.0125         -0.0251          .
hisp                   0.0200        0.0520         -0.2286          .
sexF                   0.3600        0.5790         -0.4562          .
sexM                   0.6400        0.4210          0.4563          .
insurMedicare          0.1850        0.4565         -0.6992          .
insurCommercial        0.0350        0.2550         -1.1971          .
insurMedicaid          0.6250        0.2505          0.7736          .
insurUninsured         0.1550        0.0380          0.3233          .
nincome            26927.5000    37992.5500         -0.9504     0.3117
nhsgrad               77.7350       82.2260         -0.3904     1.0281
cleve                  0.9050        0.5545          1.1954          .
a1c                    7.7655        7.6868          0.0379     1.2209
ldl                   94.3450       97.2160         -0.0743     1.1096
visits                 4.8100        3.6885          0.4276     2.1732
tobaccoFormer          0.2300        0.3875         -0.3743          .
tobaccoCurrent         0.5400        0.2225          0.6370          .
tobaccoNever           0.2300        0.3900         -0.3802          .
statin                 0.7700        0.8015         -0.0749          .
ace_arb                0.8000        0.7655          0.0863          .
betab                  0.2650        0.3905         -0.2844          .
depr_dx                0.2650        0.3765         -0.2526          .
eyeex                  0.5200        0.6125         -0.1851          .
pneumo                 0.5800        0.8830         -0.6139          .
                eCDF Mean eCDF Max
distance           0.4093   0.6595
age                0.0801   0.2350
raceBlack_AA       0.2975   0.2975
raceWhite          0.2630   0.2630
raceOther          0.0320   0.0320
raceAsian          0.0025   0.0025
hisp               0.0320   0.0320
sexF               0.2190   0.2190
sexM               0.2190   0.2190
insurMedicare      0.2715   0.2715
insurCommercial    0.2200   0.2200
insurMedicaid      0.3745   0.3745
insurUninsured     0.1170   0.1170
nincome            0.1442   0.3445
nhsgrad            0.0724   0.2295
cleve              0.3505   0.3505
a1c                0.0146   0.0465
ldl                0.0189   0.0670
visits             0.0743   0.2150
tobaccoFormer      0.1575   0.1575
tobaccoCurrent     0.3175   0.3175
tobaccoNever       0.1600   0.1600
statin             0.0315   0.0315
ace_arb            0.0345   0.0345
betab              0.1255   0.1255
depr_dx            0.1115   0.1115
eyeex              0.0925   0.0925
pneumo             0.3030   0.3030

Summary of Balance for Matched Data:
                Means Treated Means Control Std. Mean Diff. Var. Ratio
distance              -0.6577       -1.0679          0.2462     1.8590
age                   54.8950       54.2500          0.0754     0.8059
raceBlack_AA           0.8300        0.8300          0.0000          .
raceWhite              0.1550        0.1600         -0.0138          .
raceOther              0.0050        0.0000          0.0709          .
raceAsian              0.0100        0.0100          0.0000          .
hisp                   0.0200        0.0150          0.0357          .
sexF                   0.3600        0.4000         -0.0833          .
sexM                   0.6400        0.6000          0.0833          .
insurMedicare          0.1850        0.1800          0.0129          .
insurCommercial        0.0350        0.0400         -0.0272          .
insurMedicaid          0.6250        0.6450         -0.0413          .
insurUninsured         0.1550        0.1350          0.0553          .
nincome            26927.5000    25385.5000          0.1324     1.1096
nhsgrad               77.7350       77.4650          0.0235     1.0794
cleve                  0.9050        0.9050          0.0000          .
a1c                    7.7655        7.8350         -0.0334     0.9035
ldl                   94.3450       94.4900         -0.0038     1.0865
visits                 4.8100        4.3400          0.1792     1.3150
tobaccoFormer          0.2300        0.2650         -0.0832          .
tobaccoCurrent         0.5400        0.4950          0.0903          .
tobaccoNever           0.2300        0.2400         -0.0238          .
statin                 0.7700        0.7150          0.1307          .
ace_arb                0.8000        0.7650          0.0875          .
betab                  0.2650        0.3100         -0.1020          .
depr_dx                0.2650        0.2850         -0.0453          .
eyeex                  0.5200        0.5300         -0.0200          .
pneumo                 0.5800        0.6600         -0.1621          .
                eCDF Mean eCDF Max Std. Pair Dist.
distance           0.0120    0.230          0.2469
age                0.0198    0.060          1.1755
raceBlack_AA       0.0000    0.000          0.2700
raceWhite          0.0050    0.005          0.7322
raceOther          0.0050    0.005          0.0709
raceAsian          0.0000    0.000          0.0200
hisp               0.0050    0.005          0.2500
sexF               0.0400    0.040          0.9583
sexM               0.0400    0.040          0.9583
insurMedicare      0.0050    0.005          0.6310
insurCommercial    0.0050    0.005          0.2993
insurMedicaid      0.0200    0.020          0.8882
insurUninsured     0.0200    0.020          0.6355
nincome            0.0290    0.140          1.0179
nhsgrad            0.0183    0.050          1.0605
cleve              0.0000    0.000          0.1500
a1c                0.0161    0.070          1.0819
ldl                0.0099    0.045          1.0769
visits             0.0320    0.125          0.9456
tobaccoFormer      0.0350    0.035          0.8436
tobaccoCurrent     0.0450    0.045          0.8929
tobaccoNever       0.0100    0.010          0.8079
statin             0.0550    0.055          0.9861
ace_arb            0.0350    0.035          0.8625
betab              0.0450    0.045          0.8950
depr_dx            0.0200    0.020          0.8837
eyeex              0.0100    0.010          0.8807
pneumo             0.0800    0.080          0.8510

Sample Sizes:
          Control Treated
All          2000     200
Matched       200     200
Unmatched    1800       0
Discarded       0       0

7.2.2 Using `bal.tab` to obtain a balance table

Code

bal5 <- bal.tab(match_5, un = TRUE, disp.v.ratio = TRUE)

bal5

Balance Measures
                     Type Diff.Un V.Ratio.Un Diff.Adj V.Ratio.Adj
distance         Distance  2.0650     0.7466   0.2462      1.8590
age               Contin. -0.4682     0.7579   0.0754      0.8059
race_Black_AA      Binary  0.2975          .   0.0000           .
race_White         Binary -0.2630          .  -0.0050           .
race_Other         Binary -0.0320          .   0.0050           .
race_Asian         Binary -0.0025          .   0.0000           .
hisp               Binary -0.0320          .   0.0050           .
sex_M              Binary  0.2190          .   0.0400           .
insur_Medicare     Binary -0.2715          .   0.0050           .
insur_Commercial   Binary -0.2200          .  -0.0050           .
insur_Medicaid     Binary  0.3745          .  -0.0200           .
insur_Uninsured    Binary  0.1170          .   0.0200           .
nincome           Contin. -0.9504     0.3117   0.1324      1.1096
nhsgrad           Contin. -0.3904     1.0281   0.0235      1.0794
cleve              Binary  0.3505          .   0.0000           .
a1c               Contin.  0.0379     1.2209  -0.0334      0.9035
ldl               Contin. -0.0743     1.1096  -0.0038      1.0865
visits            Contin.  0.4276     2.1732   0.1792      1.3150
tobacco_Former     Binary -0.1575          .  -0.0350           .
tobacco_Current    Binary  0.3175          .   0.0450           .
tobacco_Never      Binary -0.1600          .  -0.0100           .
statin             Binary -0.0315          .   0.0550           .
ace_arb            Binary  0.0345          .   0.0350           .
betab              Binary -0.1255          .  -0.0450           .
depr_dx            Binary -0.1115          .  -0.0200           .
eyeex              Binary -0.0925          .  -0.0100           .
pneumo             Binary -0.3030          .  -0.0800           .

Sample sizes
          Control Treated
All          2000     200
Matched       200     200
Unmatched    1800       0

7.2.3 Checking Rubin’s Rules 1 and 2

We’ll build a little table of the Rubin’s Rules (1 and 2) results before and after match_5 is applied.

Code

rubin_report_m5 <- tibble(
    status = c("Rule1", "Rule2"),
    Unmatched = c(bal5$Balance$Diff.Un[1],
                  bal5$Balance$V.Ratio.Un[1]),
    Match5 = c(bal5$Balance$Diff.Adj[1],
               bal5$Balance$V.Ratio.Adj[1]))

rubin_report_m5 |> gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 2, color = "blue")

status	Unmatched	Match5
Rule1	2.065	0.246
Rule2	0.747	1.859

As we’d expect based on our previous greedy 1:1 match, this isn’t great sufficiently strong balance.

7.2.4 Using `bal.plot` from `cobalt`

Looking at the linear propensity scores in each group, in mirrored histograms, we have …

Code

bal.plot(match_5,
         var.name = "distance", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "match_5 Sample"),
         type = "histogram", mirror = TRUE)

7.2.5 Using `love.plot` to look at Standardized Differences

Code

love.plot(bal5, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          sample.names = c("Unmatched", "Matched"),
          title = "Love Plot for our 1:1 Nearest Neighbor Match") +
    labs(subtitle = "distance = linear propensity score",
         caption = "* indicates raw mean differences (for binary variables)")

7.2.6 Using `love.plot` to look at Variance Ratios

Again, the categorical variables are dropped.

Code

love.plot(bal5, 
          threshold = .5, size = 3,
          stats = "variance.ratios",
          sample.names = c("Unmatched", "Matched"),
          title = "Variance Ratios for our 1:1 Nearest Neighbor Match") +
    labs(subtitle = "distance = linear propensity score",
         caption = "* indicates raw mean differences (for binary variables)")

8 `match_6` 1:1 Nearest Neighbor Caliper Matching with `Matchit`

As in match_5, we’ll match on the linear propensity score (by specifying the distance to use the logistic link with the linear propensity score via "linear.logit"), and we’ll perform 1:1 nearest neighbor matching without replacement using the default ordering (largest to smallest), but we’ll add a some arguments to build a specific kind caliper match. Specifically, we’ll require our matches to be within 0.25 standard deviations of each other on the linear propensity score.

Code

match_6 <- matchit(f.build("treat", covs_1), data = dm2200,
                   distance = "glm", link = "linear.logit", 
                   method = "nearest", caliper = 0.25,
                   ratio = 1, replace = FALSE)

match_6

A `matchit` object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score [caliper]

             - estimated with logistic regression and linearized
 - caliper: <distance> (0.537)
 - number of obs.: 2200 (original), 344 (matched)
 - target estimand: ATT
 - covariates: age, race, hisp, sex, insur, nincome, nhsgrad, cleve, a1c, ldl, visits, tobacco, statin, ace_arb, betab, depr_dx, eyeex, pneumo

8.1 Obtaining the Matched Sample

There is just one tricky part to doing this in MatchIt. The main work can be done with a very simple command.

Code

dm2200_matched6 <- match.data(match_6)

This leaves only the job of creating a matching number, for which we have to develop some additional R code.

Code

# Thanks to Robert McDonald at Mayo
# https://lists.gking.harvard.edu/pipermail/matchit/2012-April/000458.html

len <- dim(dm2200)[1]
matchx <- rep(NA,len)
len2 <- length(match_6$match.matrix)
count <- 1
for(i in 1:len2){
    
    match1 <- match_6$match.matrix[i]
    match2 <- row.names(match_6$match.matrix)[i]
    
    if(!is.na(match1)){
    matchx[as.numeric(match1)] <- count
    matchx[as.numeric(match2)] <- count
    count <- count+1}
    
}

dm2200_matched6 <- dm2200_matched6 |>
    mutate(match6_matches = 
               matchx[as.numeric
                      (row.names(dm2200_matched6))]) |>
    mutate(match6_matches = factor(match6_matches))

How many unique subjects are in our matched sample?

Code

n_distinct(dm2200_matched6$subject)

[1] 344

This match includes 172 pairs so 344 subjects, as we’ve done matching without replacement.

Code

dm2200_matched6 |> count(exposure)

# A tibble: 2 × 2
  exposure     n
  <fct>    <int>
1 A          172
2 B          172

8.2 Checking Covariate Balance for our 1:1 Nearest Neighbor Caliper Match

8.2.1 Using `bal.tab` to obtain a balance table

Code

bal6 <- bal.tab(match_6, un = TRUE, disp.v.ratio = TRUE)

bal6

Balance Measures
                     Type Diff.Un V.Ratio.Un Diff.Adj V.Ratio.Adj
distance         Distance  2.0650     0.7466   0.0612      1.2154
age               Contin. -0.4682     0.7579   0.0068      0.8700
race_Black_AA      Binary  0.2975          .  -0.0116           .
race_White         Binary -0.2630          .   0.0058           .
race_Other         Binary -0.0320          .   0.0058           .
race_Asian         Binary -0.0025          .   0.0000           .
hisp               Binary -0.0320          .   0.0116           .
sex_M              Binary  0.2190          .   0.0233           .
insur_Medicare     Binary -0.2715          .   0.0000           .
insur_Commercial   Binary -0.2200          .  -0.0058           .
insur_Medicaid     Binary  0.3745          .   0.0000           .
insur_Uninsured    Binary  0.1170          .   0.0058           .
nincome           Contin. -0.9504     0.3117   0.1677      1.1339
nhsgrad           Contin. -0.3904     1.0281   0.1137      0.9119
cleve              Binary  0.3505          .   0.0000           .
a1c               Contin.  0.0379     1.2209  -0.0011      0.9429
ldl               Contin. -0.0743     1.1096  -0.0188      1.1241
visits            Contin.  0.4276     2.1732   0.0887      1.2449
tobacco_Former     Binary -0.1575          .  -0.0174           .
tobacco_Current    Binary  0.3175          .   0.0000           .
tobacco_Never      Binary -0.1600          .   0.0174           .
statin             Binary -0.0315          .   0.0349           .
ace_arb            Binary  0.0345          .   0.0174           .
betab              Binary -0.1255          .  -0.0465           .
depr_dx            Binary -0.1115          .   0.0174           .
eyeex              Binary -0.0925          .  -0.0233           .
pneumo             Binary -0.3030          .   0.0000           .

Sample sizes
          Control Treated
All          2000     200
Matched       172     172
Unmatched    1828      28

8.2.2 Checking Rubin’s Rules 1 and 2

We’ll build a little table of the Rubin’s Rules (1 and 2) results before and after match_6 is applied, and compare these to the match_5 results.

Code

rubin_report_m56 <- tibble(
    status = c("Rule1", "Rule2"),
    Unmatched = c(bal6$Balance$Diff.Un[1],
                  bal6$Balance$V.Ratio.Un[1]),
    Match5 = c(bal5$Balance$Diff.Adj[1],
               bal5$Balance$V.Ratio.Adj[1]),
    Match6 = c(bal6$Balance$Diff.Adj[1],
               bal6$Balance$V.Ratio.Adj[1])
    )

rubin_report_m56 |> gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 2, color = "blue")

status	Unmatched	Match5	Match6
Rule1	2.065	0.246	0.061
Rule2	0.747	1.859	1.215

So the caliper appears to help quite a bit to improve the results that we saw in 1:1 nearest neighbor matching without replacement, at the cost of not including 28 of the treated subjects in the match.

8.2.3 Using `bal.plot` from `cobalt`

Looking at the linear propensity scores in each group, in mirrored histograms, we have …

Code

bal.plot(match_6,
         var.name = "distance", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "match_6 Sample"),
         type = "histogram", mirror = TRUE)

8.2.4 Using `love.plot` to look at Standardized Differences

Code

love.plot(bal6, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          sample.names = c("Unmatched", "Matched"),
          title = "Love Plot for 1:1 Nearest Neighbor Caliper Match") +
    labs(subtitle = "distance = linear propensity score",
         caption = "* indicates raw mean differences (for binary variables)")

8.2.5 Using `love.plot` to look at Variance Ratios

Again, the categorical variables are dropped.

Code

love.plot(bal6, 
          threshold = .5, size = 3,
          stats = "variance.ratios",
          sample.names = c("Unmatched", "Matched"),
          title = "Variance Ratios for 1:1 Nearest Neighbor Caliper Match") +
    labs(subtitle = "distance = linear propensity score",
         caption = "* indicates raw mean differences (for binary variables)")

9 `match_7` 1:1 Optimal Matching with `Matchit`

As in match_5, we’ll match on the linear propensity score (by specifying the distance to use the logistic link with the linear propensity score via "linear.logit"), but now we’ll use an optimal 1:1 matching (which is always done in matchit without replacement).

Code

match_7 <- matchit(f.build("treat", covs_1), data = dm2200,
                   distance = "glm", link = "linear.logit",
                   method = "optimal", ratio = 1)

match_7

A `matchit` object
 - method: 1:1 optimal pair matching
 - distance: Propensity score
             - estimated with logistic regression and linearized
 - number of obs.: 2200 (original), 400 (matched)
 - target estimand: ATT
 - covariates: age, race, hisp, sex, insur, nincome, nhsgrad, cleve, a1c, ldl, visits, tobacco, statin, ace_arb, betab, depr_dx, eyeex, pneumo

9.1 Obtaining the Matched Sample

As before, much of the work can be done with match.data.

Code

dm2200_matched7 <- match.data(match_7)

This leaves only the job of creating a matching number, for which we have to develop some additional R code.

Code

# Thanks to Robert McDonald at Mayo
# https://lists.gking.harvard.edu/pipermail/matchit/2012-April/000458.html

len <- dim(dm2200)[1]
matchx <- rep(NA,len)
len2 <- length(match_7$match.matrix)
count <- 1
for(i in 1:len2){
    
    match1 <- match_7$match.matrix[i]
    match2 <- row.names(match_7$match.matrix)[i]
    
    if(!is.na(match1)){
    matchx[as.numeric(match1)] <- count
    matchx[as.numeric(match2)] <- count
    count <- count+1}
    
}

dm2200_matched7 <- dm2200_matched7 |>
    mutate(match7_matches = 
               matchx[as.numeric
                      (row.names(dm2200_matched7))]) |>
    mutate(match7_matches = factor(match7_matches))

How many unique subjects are in our matched sample?

Code

n_distinct(dm2200_matched7$subject)

[1] 400

This match includes 200 pairs so 400 subjects, as we’ve done matching without replacement.

Code

dm2200_matched7 |> count(exposure)

# A tibble: 2 × 2
  exposure     n
  <fct>    <int>
1 A          200
2 B          200

9.2 Checking Covariate Balance for our 1:1 Optimal Match

9.2.1 Using `bal.tab` to obtain a balance table

Code

bal7 <- bal.tab(match_7, un = TRUE, disp.v.ratio = TRUE)

bal7

Balance Measures
                     Type Diff.Un V.Ratio.Un Diff.Adj V.Ratio.Adj
distance         Distance  2.0650     0.7466   0.2460      1.8591
age               Contin. -0.4682     0.7579   0.1005      0.8433
race_Black_AA      Binary  0.2975          .   0.0050           .
race_White         Binary -0.2630          .  -0.0050           .
race_Other         Binary -0.0320          .   0.0000           .
race_Asian         Binary -0.0025          .   0.0000           .
hisp               Binary -0.0320          .   0.0000           .
sex_M              Binary  0.2190          .   0.0400           .
insur_Medicare     Binary -0.2715          .   0.0100           .
insur_Commercial   Binary -0.2200          .   0.0050           .
insur_Medicaid     Binary  0.3745          .  -0.0200           .
insur_Uninsured    Binary  0.1170          .   0.0050           .
nincome           Contin. -0.9504     0.3117   0.1197      1.0425
nhsgrad           Contin. -0.3904     1.0281  -0.0226      1.1902
cleve              Binary  0.3505          .   0.0050           .
a1c               Contin.  0.0379     1.2209   0.0135      0.9199
ldl               Contin. -0.0743     1.1096  -0.0443      1.0515
visits            Contin.  0.4276     2.1732   0.1754      1.2822
tobacco_Former     Binary -0.1575          .  -0.0350           .
tobacco_Current    Binary  0.3175          .   0.0400           .
tobacco_Never      Binary -0.1600          .  -0.0050           .
statin             Binary -0.0315          .   0.0500           .
ace_arb            Binary  0.0345          .   0.0500           .
betab              Binary -0.1255          .  -0.0600           .
depr_dx            Binary -0.1115          .  -0.0450           .
eyeex              Binary -0.0925          .  -0.0050           .
pneumo             Binary -0.3030          .  -0.0800           .

Sample sizes
          Control Treated
All          2000     200
Matched       200     200
Unmatched    1800       0

9.2.2 Checking Rubin’s Rules 1 and 2

We’ll build a little table of the Rubin’s Rules (1 and 2) results before and after match_6 is applied, and compare these to the match_5 results.

Code

rubin_report_m567 <- tibble(
    status = c("Rule1", "Rule2"),
    Unmatched = c(bal6$Balance$Diff.Un[1],
                  bal6$Balance$V.Ratio.Un[1]),
    Match5 = c(bal5$Balance$Diff.Adj[1],
               bal5$Balance$V.Ratio.Adj[1]),
    Match6 = c(bal6$Balance$Diff.Adj[1],
               bal6$Balance$V.Ratio.Adj[1]),
    Match7 = c(bal7$Balance$Diff.Adj[1],
               bal7$Balance$V.Ratio.Adj[1])
    )

rubin_report_m567 |> gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 2, color = "blue")

status	Unmatched	Match5	Match6	Match7
Rule1	2.065	0.246	0.061	0.246
Rule2	0.747	1.859	1.215	1.859

We note here that the optimal matching here does very little, if anything, to improved on the nearest neighbor match.

9.2.3 Are the Optimal and Nearest Neighbor Matches the Same?

Code

m5_subs <- dm2200_matched5 |> select(subject)
m7_subs <- dm2200_matched7 |> select(subject)

all.equal(m5_subs, m7_subs)

[1] "Component \"subject\": 356 string mismatches"

Apparently not, but they are very similar.

9.2.4 Using `bal.plot` from `cobalt`

Looking at the linear propensity scores in each group, in mirrored histograms, we have …

Code

bal.plot(match_7,
         var.name = "distance", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "match_7 Sample"),
         type = "histogram", mirror = TRUE)

9.2.5 Using `love.plot` to look at Standardized Differences

Code

love.plot(bal7, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          sample.names = c("Unmatched", "Matched"),
          title = "Love Plot for 1:1 Optimal Match") +
    labs(subtitle = "distance = linear propensity score",
         caption = "* indicates raw mean differences (for binary variables)")

9.2.6 Using `love.plot` to look at Variance Ratios

Again, the categorical variables are dropped.

Code

love.plot(bal7, 
          threshold = .5, size = 3,
          stats = "variance.ratios",
          sample.names = c("Unmatched", "Matched"),
          title = "Variance Ratios for 1:1 Optimal Match") +
    labs(subtitle = "distance = linear propensity score",
         caption = "* indicates raw mean differences (for binary variables)")

10 `match_8` 1:2 Optimal Matching with `Matchit`

Again, we’ll match on the linear propensity score (by specifying the distance to use the logistic link with the linear propensity score via "linear.logit"), but now we’ll perform 1:2 (so ratio = 2) optimal (so `method = “optimal”) matching without replacement using the default ordering (largest to smallest).

Code

match_8 <- matchit(f.build("treat", covs_1), data = dm2200,
                   distance = "glm", link = "linear.logit", 
                   method = "optimal", ratio = 2)

match_8

A `matchit` object
 - method: 2:1 optimal pair matching
 - distance: Propensity score
             - estimated with logistic regression and linearized
 - number of obs.: 2200 (original), 600 (matched)
 - target estimand: ATT
 - covariates: age, race, hisp, sex, insur, nincome, nhsgrad, cleve, a1c, ldl, visits, tobacco, statin, ace_arb, betab, depr_dx, eyeex, pneumo

10.1 Obtaining the Matched Sample

As before, much of the work can be done with a very simple command.

Code

dm2200_matched8 <- match.data(match_8)

This leaves only the job of creating a matching number, for which we have to develop some additional R code, and this needs some modification because we now have two matched controls for each treated subject.

Code

# Thanks to Robert McDonald at Mayo
# https://lists.gking.harvard.edu/pipermail/matchit/2012-April/000458.html

mm1 <- match_8$match.matrix[,1]
mm2 <- match_8$match.matrix[,2]

len <- dim(dm2200)[1]
matchx <- rep(NA,len)
len2 <- length(match_8$match.matrix)
count <- 1
for(i in 1:len2){
    
    match_tr <- row.names(match_8$match.matrix)[i]
    match_con1 <- mm1[i]
    match_con2 <- mm2[i]
    
    if(!is.na(match_tr) & !is.na(match_con1) & !is.na(match_con2)){
    matchx[as.numeric(match_con1)] <- count
    matchx[as.numeric(match_con2)] <- count
    matchx[as.numeric(match_tr)] <- count
    count <- count+1}
    
}

dm2200_matched8 <- dm2200_matched8 |>
     mutate(match8_matches = 
            matchx[as.numeric(row.names(dm2200_matched8))])

How many unique subjects are in our matched sample?

Code

n_distinct(dm2200_matched8$subject)

[1] 600

This match includes 200 triplets (1 treated and 2 control) so 600 subjects, and again we’ve done matching without replacement.

Code

dm2200_matched8 |> count(exposure)

# A tibble: 2 × 2
  exposure     n
  <fct>    <int>
1 A          200
2 B          400

10.2 Checking Covariate Balance for our 1:2 Optimal Match

10.2.1 Using `bal.tab` to obtain a balance table

Code

bal8 <- bal.tab(match_8, un = TRUE, disp.v.ratio = TRUE)

bal8

Balance Measures
                     Type Diff.Un V.Ratio.Un Diff.Adj V.Ratio.Adj
distance         Distance  2.0650     0.7466   0.5002      2.4424
age               Contin. -0.4682     0.7579  -0.0605      0.8144
race_Black_AA      Binary  0.2975          .   0.0225           .
race_White         Binary -0.2630          .  -0.0225           .
race_Other         Binary -0.0320          .   0.0000           .
race_Asian         Binary -0.0025          .   0.0000           .
hisp               Binary -0.0320          .  -0.0050           .
sex_M              Binary  0.2190          .   0.0700           .
insur_Medicare     Binary -0.2715          .  -0.0800           .
insur_Commercial   Binary -0.2200          .  -0.0025           .
insur_Medicaid     Binary  0.3745          .   0.0350           .
insur_Uninsured    Binary  0.1170          .   0.0475           .
nincome           Contin. -0.9504     0.3117   0.0429      0.7310
nhsgrad           Contin. -0.3904     1.0281  -0.0139      0.9187
cleve              Binary  0.3505          .   0.0175           .
a1c               Contin.  0.0379     1.2209   0.0463      1.1943
ldl               Contin. -0.0743     1.1096  -0.0223      1.0845
visits            Contin.  0.4276     2.1732   0.2097      1.3355
tobacco_Former     Binary -0.1575          .  -0.0550           .
tobacco_Current    Binary  0.3175          .   0.1075           .
tobacco_Never      Binary -0.1600          .  -0.0525           .
statin             Binary -0.0315          .  -0.0025           .
ace_arb            Binary  0.0345          .   0.0075           .
betab              Binary -0.1255          .  -0.0375           .
depr_dx            Binary -0.1115          .  -0.0600           .
eyeex              Binary -0.0925          .  -0.0225           .
pneumo             Binary -0.3030          .  -0.1475           .

Sample sizes
          Control Treated
All          2000     200
Matched       400     200
Unmatched    1600       0

10.2.2 Checking Rubin’s Rules 1 and 2

We’ll build a little table of the Rubin’s Rules (1 and 2) results before and after match_8 is applied, and compare these to the other MatchIt results we’ve developed.

Code

rubin_report_m5678 <- tibble(
    status = c("Rule1", "Rule2"),
    Unmatched = c(bal8$Balance$Diff.Un[1],
                  bal8$Balance$V.Ratio.Un[1]),
    Match5 = c(bal5$Balance$Diff.Adj[1],
               bal5$Balance$V.Ratio.Adj[1]),
    Match6 = c(bal6$Balance$Diff.Adj[1],
               bal6$Balance$V.Ratio.Adj[1]),
    Match7 = c(bal7$Balance$Diff.Adj[1],
               bal7$Balance$V.Ratio.Adj[1]),
    Match8 = c(bal8$Balance$Diff.Adj[1],
               bal8$Balance$V.Ratio.Adj[1])
    )

rubin_report_m5678 |> gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 2, color = "blue")

status	Unmatched	Match5	Match6	Match7	Match8
Rule1	2.065	0.246	0.061	0.246	0.500
Rule2	0.747	1.859	1.215	1.859	2.442

So the optimal matching doesn’t look very strong here. Of course, we’re matching 1:2 in this situation.

10.2.3 Using `bal.plot` from `cobalt`

Looking at the linear propensity scores in each group, in mirrored histograms, we have …

Code

bal.plot(match_8,
         var.name = "distance", 
         which = "both",
         sample.names = 
             c("Unmatched Sample", "match_8 Sample"),
         type = "histogram", mirror = TRUE)

10.2.4 Using `love.plot` to look at Standardized Differences

Code

love.plot(bal8, 
          threshold = .1, size = 3,
          var.order = "unadjusted",
          stats = "mean.diffs",
          stars = "raw",
          sample.names = c("Unmatched", "Matched"),
          title = "Love Plot for 1:2 Optimal Match") +
    labs(subtitle = "distance = linear propensity score",
         caption = "* indicates raw mean differences (for binary variables)")

10.2.5 Using `love.plot` to look at Variance Ratios

Again, the categorical variables are dropped.

Code

love.plot(bal8, 
          threshold = .5, size = 3,
          stats = "variance.ratios",
          sample.names = c("Unmatched", "Matched"),
          title = "Variance Ratios for 1:2 Optimal Match") +
    labs(subtitle = "distance = linear propensity score",
         caption = "* indicates raw mean differences (for binary variables)")

11 Outcome Models

We’ll fit two (overly simplistic) outcome models, one for bp_good (our binary outcome) and another for bmi (our quantitative outcome.) Later, we’ll compare the exposure effect estimates made here to the estimates we obtain after propensity matching. In each case, we’ll focus on ATT estimates (average treated effect on the treated) rather than ATE estimates.

11.1 Potential Fitting Problems

When you fit mixed effect models as I have done below for binary and for quantitative outcomes, you may find that the model fit appears to be singular, which means that some element of the variance-covariance matrix estimated by lmer is zero. This may be an ignorable problem in this setting, since the problem is usually with the separation of residual effects from random match effects. We’re focusing right now on the fixed effects, which can be summarized as indicated below.

But, there’s certainly an argument to be made that an alternative strategy for estimating the match might be more appropriate. We won’t worry about that in this example, and just display what we get.

11.2 Unadjusted Models prior to Propensity Matching

11.2.1 Unadjusted Outcome Model for `bp_good`

Code

unadj_mod1 <- glm(bp_good == 1 ~ exposure == "A", data = dm2200, 
                  family = binomial())

tidy(unadj_mod1, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
(Intercept)	2.515	0.050	2.284	2.773
exposure == "A"TRUE	0.561	0.152	0.417	0.757

11.2.2 Unadjusted Outcome Model for `bmi`

Code

unadj_mod2 <- lm(bmi ~ exposure == "A", data = dm2200)

tidy(unadj_mod2, conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, 
           conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
(Intercept)	35.087	0.183	34.729	35.446
exposure == "A"TRUE	−2.260	0.606	−3.449	−1.071

11.3 Adjusted Outcome Models after `match1`

11.3.1 Binary Outcome: `bp_good`

Code

result_match1_bp <- clogit(bp_good ~ (exposure == "A") + 
                          strata(match1_matches),
                      data = dm2200_matched1)

tidy(result_match1_bp, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
exposure == "A"TRUE	0.559	0.217	0.365	0.856

11.3.2 Quantitative Outcome: `bmi`

We’ll use a mixed model to account for our 1:1 matching. The matches here are treated as a random factor, with the exposure a fixed factor, in the lme4 package.

Code

dm2200_matched1 <- dm2200_matched1 |> 
    mutate(match1_matches_f = as.factor(match1_matches))

result_match1_bmi <- lmer(bmi ~ (exposure == "A") + 
                              (1 | match1_matches_f), 
                          data = dm2200_matched1)

Code

broom.mixed::tidy(result_match1_bmi, 
     conf.int = TRUE, conf.level = 0.95) |> 
    filter(effect == "fixed") |>
    select(term, estimate, std.error, 
           conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
(Intercept)	35.168	0.612	33.969	36.367
exposure == "A"TRUE	−2.341	0.857	−4.020	−0.662

11.4 Adjusted Outcome Models after `match2`

11.4.1 Binary Outcome: `bp_good`

Code

result_match2_bp <- clogit(bp_good ~ (exposure == "A") + 
                          strata(match2_matches),
                      data = dm2200_matched2)

tidy(result_match2_bp, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
exposure == "A"TRUE	0.417	0.180	0.293	0.593

11.4.2 Quantitative Outcome: `bmi`

We’ll use a mixed model to account for our 1:1 matching. The matches here are treated as a random factor, with the exposure a fixed factor, in the lme4 package.

Code

dm2200_matched2 <- dm2200_matched2 |> 
    mutate(match2_matches_f = as.factor(match2_matches))

result_match2_bmi <- lmer(bmi ~ (exposure == "A") + 
                              (1 | match2_matches_f), 
                          data = dm2200_matched2)

broom.mixed::tidy(result_match2_bmi, 
     conf.int = TRUE, conf.level = 0.95) |> 
    filter(effect == "fixed") |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
(Intercept)	34.609	0.458	33.712	35.506
exposure == "A"TRUE	−1.782	0.558	−2.876	−0.688

11.5 Adjusted Outcome Models after `match3`

11.5.1 Binary Outcome: `bp_good`

Code

result_match3_bp <- clogit(bp_good ~ (exposure == "A") + 
                          strata(match3_matches),
                      data = dm2200_matched3)

tidy(result_match3_bp, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
exposure == "A"TRUE	0.573	0.139	0.436	0.753

11.5.2 Quantitative Outcome: `bmi`

We’ll use a mixed model to account for our 1:1 matching. The matches here are treated as a random factor, with the exposure a fixed factor, in the lme4 package.

Code

dm2200_matched3 <- dm2200_matched3 |> 
    mutate(match3_matches_f = as.factor(match3_matches))

result_match3_bmi <- lmer(bmi ~ (exposure == "A") + 
                              (1 | match3_matches_f), 
                          data = dm2200_matched3)

broom.mixed::tidy(result_match3_bmi, 
     conf.int = TRUE, conf.level = 0.95) |> 
    filter(effect == "fixed") |>
    select(term, estimate, std.error, 
           conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
(Intercept)	34.531	0.410	33.728	35.335
exposure == "A"TRUE	−1.704	0.449	−2.584	−0.824

11.6 Adjusted Outcome Models after `match4`

11.6.1 Binary Outcome: `bp_good`

Code

result_match4_bp <- clogit(bp_good ~ (exposure == "A") + 
                          strata(match4_matches),
                      data = dm2200_matched4)

tidy(result_match4_bp, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
exposure == "A"TRUE	0.651	0.243	0.405	1.048

11.6.2 Quantitative Outcome: `bmi`

We’ll use a mixed model to account for our 1:1 matching. The matches here are treated as a random factor, with the exposure a fixed factor, in the lme4 package.

Code

dm2200_matched4 <- dm2200_matched4 |> 
    mutate(match4_matches_f = as.factor(match4_matches))

result_match4_bmi <- lmer(bmi ~ (exposure == "A") + 
                              (1 | match4_matches_f), 
                          data = dm2200_matched4)

boundary (singular) fit: see help('isSingular')

Code

broom.mixed::tidy(result_match4_bmi, 
     conf.int = TRUE, conf.level = 0.95) |> 
    filter(effect == "fixed") |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
(Intercept)	34.828	0.657	33.540	36.116
exposure == "A"TRUE	−1.577	0.929	−3.399	0.244

11.7 Adjusted Outcome Models after `match5`

11.7.1 Binary Outcome: `bp_good`

Code

result_match5_bp <- clogit(bp_good ~ (exposure == "A") + 
                          strata(match5_matches),
                      data = dm2200_matched5)

Warning in coxexact.fit(X, Y, istrat, offset, init, control, weights = weights,
: Loglik converged before variable 1 ; beta may be infinite.

Code

tidy(result_match5_bp, exponentiate = TRUE, 
     conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
exposure == "A"TRUE	1,615,474,899.108	40,192.969	0.000	Inf

11.7.2 Quantitative Outcome: `bmi`

We’ll use a mixed model to account for our 1:1 matching. The matches here are treated as a random factor, with the exposure a fixed factor, in the lme4 package.

Code

result_match5_bmi <- lmer(bmi ~ (exposure == "A") + 
                              (1 | match5_matches), 
                          data = dm2200_matched5)

boundary (singular) fit: see help('isSingular')

Code

broom.mixed::tidy(result_match5_bmi, 
     conf.int = TRUE, conf.level = 0.95) |> 
    filter(effect == "fixed") |>
    select(term, estimate, std.error, 
           conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
(Intercept)	33.258	1.354	30.605	35.911
exposure == "A"TRUE	1.683	2.037	−2.310	5.676

11.8 Adjusted Outcome Models after `match6`

11.8.1 Binary Outcome: `bp_good`

Code

result_match6_bp <- clogit(bp_good ~ (exposure == "A") + 
                          strata(match6_matches),
                      data = dm2200_matched6)

Warning in coxexact.fit(X, Y, istrat, offset, init, control, weights = weights,
: Ran out of iterations and did not converge

Code

tidy(result_match6_bp, exponentiate = TRUE, 
     conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
exposure == "A"TRUE	0.000	24,378.269	0.000	Inf

11.8.2 Quantitative Outcome: `bmi`

We’ll use a mixed model to account for our 1:1 matching. The matches here are treated as a random factor, with the exposure a fixed factor, in the lme4 package.

Code

result_match6_bmi <- lmer(bmi ~ (exposure == "A") + 
                              (1 | match6_matches), 
                          data = dm2200_matched6)

boundary (singular) fit: see help('isSingular')

Code

broom.mixed::tidy(result_match6_bmi, 
     conf.int = TRUE, conf.level = 0.95) |> 
    filter(effect == "fixed") |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
(Intercept)	34.338	1.982	30.453	38.223
exposure == "A"TRUE	1.097	2.567	−3.934	6.129

11.9 Adjusted Outcome Models after `match7`

11.9.1 Binary Outcome: `bp_good`

Code

result_match7_bp <- clogit(bp_good ~ (exposure == "A") + 
                          strata(match7_matches),
                      data = dm2200_matched7)

tidy(result_match7_bp, exponentiate = TRUE, 
     conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
exposure == "A"TRUE	2.000	1.225	0.181	22.056

11.9.2 Quantitative Outcome: `bmi`

We’ll use a mixed model to account for our 1:1 matching. The matches here are treated as a random factor, with the exposure a fixed factor, in the lme4 package.

Code

result_match7_bmi <- lmer(bmi ~ (exposure == "A") + 
                              (1 | match7_matches), 
                          data = dm2200_matched7)

broom.mixed::tidy(result_match7_bmi, 
     conf.int = TRUE, conf.level = 0.95) |> 
    filter(effect == "fixed") |>
    select(term, estimate, std.error, 
           conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
(Intercept)	35.658	1.416	32.882	38.433
exposure == "A"TRUE	−3.210	1.884	−6.903	0.484

11.10 Adjusted Outcome Models after `match8`

11.10.1 Binary Outcome: `bp_good`

Code

result_match8_bp <- clogit(bp_good ~ (exposure == "A") + 
                          strata(match8_matches),
                      data = dm2200_matched8)

tidy(result_match8_bp, exponentiate = TRUE, 
     conf.int = TRUE, conf.level = 0.95) |>
    select(term, estimate, std.error, conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
exposure == "A"TRUE	1.073	0.650	0.300	3.839

11.10.2 Quantitative Outcome: `bmi`

We’ll use a mixed model to account for our 1:1 matching. The matches here are treated as a random factor, with the exposure a fixed factor, in the lme4 package.

Code

dm2200_matched8 <- dm2200_matched8 |>
    mutate(match8_f = factor(match8_matches))

result_match8_bmi <- lmer(bmi ~ (exposure == "A") + 
                              (1 | match8_f),  
                          data = dm2200_matched8)

broom.mixed::tidy(result_match8_bmi, 
     conf.int = TRUE, conf.level = 0.95) |> 
    filter(effect == "fixed") |>
    select(term, estimate, std.error, 
           conf.low, conf.high) |>
    gt() |> fmt_number(decimals = 3) |> 
  opt_stylize(style = 4, color = "blue")

term	estimate	std.error	conf.low	conf.high
(Intercept)	34.597	0.786	33.057	36.137
exposure == "A"TRUE	−3.797	1.329	−6.402	−1.192

12 Cleanup

We’ve created a lot of variables here that we don’t actually need going forward. So I’ll remove them here:

Code

rm(bal1, bal2, bal3, bal4,
   covs_1, covs_1plus, covs_2plus, covs_3plus, covs_4plus,
   covs_for_rubin, dm2200, 
   dm2200_matched1, dm2200_matched2, dm2200_matched3,
   dm2200_matched4,
   match_1, match_2, match_3, match_4,
   prop_model, 
   result_match1_bmi, result_match2_bmi, result_match3_bmi,
   result_match4_bmi,
   result_match1_bp, result_match2_bp, result_match3_bp,
   result_match4_bp,
   rubin_m1, rubin_m2, rubin_m3, rubin_m4,
   rubin_report_m1, rubin_report_m12, rubin_report_m123,
   rubin_report_m1234, rubin_report_m5, 
   rubin_report_m56, rubin_report_m567, rubin_report_m5678, 
   t1, unadj_mod1, unadj_mod2,
   match1_matches, match2_matches, match3_matches,
   count, i, len, len2, match_con1, match_con2, match_tr,
   match1, match2, match4_matches, matchx, mm1, mm2, 
   bal5, bal6, bal7, bal8, dm2200_matched5, 
   dm2200_matched6, dm2200_matched7, dm2200_matched8,
   m5_subs, m7_subs, match_5, match_6, match_7, match_8,
   result_match5_bmi, result_match6_bmi, result_match7_bmi,
   result_match8_bmi,
   result_match5_bp, result_match6_bp, result_match7_bp,
   result_match8_bp)

13 Key References

Matching in these examples was performed using the Matching package (Sekhon, 2011), and the MatchIt package and covariate balance was assessed using cobalt (Greifer, 2020), both in R (R Core Team, 2019).

Greifer, N. (2020). cobalt: Covariate Balance Tables and Plots. R package version 4.3.0.
MatchIt: Nonparametric Preprocessing for Parametric Causal Inference, R package version 4.1.0.
- See also MatchIt: Getting Started by Noah Greifer.
Sekhon, J.S. (2011) Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching Package for R, J of Statistical Software, 2011, 42: 7, http://www.jstatsoft.org/. R package version 4.9-6.
R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

13.1 Session Information

Code

xfun::session_info()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  askpass_1.2.1          backports_1.5.0        base64enc_0.1.3       
  bayestestR_0.15.1      bigD_0.3.0             bit_4.5.0.1           
  bit64_4.6.0-1          bitops_1.0.9           blob_1.2.4            
  boot_1.3-31            broom_1.0.7            broom.mixed_0.2.9.6   
  bslib_0.8.0            cachem_1.1.0           callr_3.7.6           
  cellranger_1.1.0       chk_0.10.0             class_7.3-22          
  cli_3.6.3              clipr_0.8.0            cmprsk_2.2-12         
  cobalt_4.5.5           coda_0.19-4.1          codetools_0.2-20      
  colorspace_2.1-1       commonmark_1.9.2       compiler_4.4.2        
  conflicted_1.2.0       correlation_0.8.6      cpp11_0.5.1           
  crayon_1.5.3           curl_6.2.0             data.table_1.16.4     
  datasets_4.4.2         datawizard_1.0.0       DBI_1.2.3             
  dbplyr_2.5.0           digest_0.6.37          dplyr_1.1.4           
  dtplyr_1.3.1           e1071_1.7-16           easystats_0.7.3       
  effectsize_1.0.0       emmeans_1.10.6         Epi_2.59              
  estimability_1.5.1     etm_1.1.1              evaluate_1.0.3        
  fansi_1.0.6            farver_2.1.2           fastmap_1.2.0         
  fontawesome_0.5.3      forcats_1.0.0          fs_1.6.5              
  furrr_0.3.1            future_1.34.0          gargle_1.5.2          
  gdata_3.0.1            generics_0.1.3         ggplot2_3.5.1         
  globals_0.16.3         glue_1.8.0             gmodels_2.19.1        
  googledrive_2.1.1      googlesheets4_1.1.1    graphics_4.4.2        
  grDevices_4.4.2        grid_4.4.2             gridExtra_2.3         
  gt_0.11.1              gtable_0.3.6           gtools_3.9.5          
  haven_2.5.4            highr_0.11             hms_1.1.3             
  htmltools_0.5.8.1      htmlwidgets_1.6.4      httr_1.4.7            
  ids_1.0.1              insight_1.0.1          isoband_0.2.7         
  janitor_2.2.1          jquerylib_0.1.4        jsonlite_1.8.9        
  juicyjuice_0.1.0       knitr_1.49             labeling_0.4.3        
  labelled_2.14.0        lattice_0.22-6         lifecycle_1.0.4       
  listenv_0.9.1          lme4_1.1-36            lubridate_1.9.4       
  magrittr_2.0.3         markdown_1.13          MASS_7.3-64           
  Matching_4.10-15       MatchIt_4.7.0          Matrix_1.7-1          
  memoise_2.0.1          methods_4.4.2          mgcv_1.9-1            
  mime_0.12              minqa_1.2.8            mitools_2.4           
  modelbased_0.8.9       modelr_0.1.11          multcomp_1.4-26       
  munsell_0.5.1          mvtnorm_1.3-3          naniar_1.1.0          
  nlme_3.1-166           nloptr_2.1.1           norm_1.0.11.1         
  numDeriv_2016.8-1.1    openssl_2.3.1          optmatch_0.10.8       
  parallel_4.4.2         parallelly_1.41.0      parameters_0.24.1     
  patchwork_1.3.0        performance_0.13.0     pillar_1.10.1         
  pkgconfig_2.0.3        plyr_1.8.9             prettyunits_1.2.0     
  processx_3.8.5         progress_1.2.3         proxy_0.4-27          
  ps_1.8.1               purrr_1.0.2            R6_2.5.1              
  ragg_1.3.3             rappdirs_0.3.3         rbibutils_2.3         
  rbounds_2.2            RColorBrewer_1.1.3     Rcpp_1.0.14           
  RcppArmadillo_14.2.2.1 RcppEigen_0.3.4.0.2    RcppProgress_0.4.2    
  Rdpack_2.6.2           reactable_0.4.4        reactR_0.6.1          
  readr_2.1.5            readxl_1.4.3           reformulas_0.4.0      
  rematch_2.0.0          rematch2_2.1.2         report_0.5.9          
  reprex_2.1.1           rlang_1.1.5            rlemon_0.2.1          
  rmarkdown_2.29         rstudioapi_0.17.1      rvest_1.0.4           
  sandwich_3.1-1         sass_0.4.9             scales_1.3.0          
  see_0.10.0             selectr_0.4.2          snakecase_0.11.1      
  splines_4.4.2          stats_4.4.2            stringi_1.8.4         
  stringr_1.5.1          survey_4.4-2           survival_3.8-3        
  sys_3.4.3              systemfonts_1.2.1      tableone_0.13.2       
  textshaping_1.0.0      TH.data_1.1-3          tibble_3.2.1          
  tidyr_1.3.1            tidyselect_1.2.1       tidyverse_2.0.0       
  timechange_0.3.0       tinytex_0.54           tools_4.4.2           
  tzdb_0.4.0             UpSetR_1.4.0           utf8_1.2.4            
  utils_4.4.2            uuid_1.2.1             V8_6.0.0              
  vctrs_0.6.5            viridis_0.6.5          viridisLite_0.4.2     
  visdat_0.6.0           vroom_1.6.5            withr_3.0.2           
  xfun_0.50              xml2_1.3.6             xtable_1.8-4          
  yaml_2.3.10            zoo_1.8-12

1 Setup

1.1 The dm2200 Data Set

1.2 Elements of a Codebook

2 Propensity for Exposure

2.1 Fitting a Propensity Model

2.2 Summarizing the Propensity Model

2.3 Storing the Propensity Scores

2.4 Comparing Propensity Scores

3 match_1 1:1 greedy matching without replacement with the Matching package

3.1 ATT vs. ATE vs. ATC estimates

3.2 Obtaining the Matched Sample

3.3 Checking Covariate Balance for our 1:1 Greedy Match

3.3.1 Using bal.tab to obtain a balance table

3.3.2 Checking Rubin’s Rules 1 and 2

3.3.3 Using bal.plot from cobalt

3.3.4 Using love.plot to look at Standardized Differences

3.3.5 Using love.plot to look at Variance Ratios

4 match_2 1:2 greedy matching without replacement with the Matching package

4.1 Obtaining the Matched Sample

4.2 Checking Covariate Balance for our 1:2 Greedy Match

4.2.1 Using bal.tab to obtain a balance table

4.2.2 Checking Rubin’s Rules 1 and 2

4.2.3 Using bal.plot from cobalt

4.2.4 Using love.plot to look at Standardized Differences

4.2.5 Using love.plot to look at Variance Ratios

5 match_3 1:3 matching, with replacement with the Matching package

5.1 Obtaining the Matched Sample

5.2 Checking Covariate Balance for our 1:3 Match

5.2.1 Using bal.tab to obtain a balance table

5.2.2 Checking Rubin’s Rules 1 and 2

5.2.3 Using bal.plot from cobalt

5.2.4 Using love.plot to look at Standardized Differences

5.2.5 Using love.plot to look at Variance Ratios

6 match_4 Caliper Matching (1:1 without replacement) with the Matching package

6.1 Obtaining the Matched Sample

6.2 Checking Covariate Balance for our 1:1 Caliper Match

6.2.1 Using bal.tab to obtain a balance table

6.2.2 Checking Rubin’s Rules 1 and 2

6.2.3 Using bal.plot from cobalt

6.2.4 Using love.plot to look at Standardized Differences

6.2.5 Using love.plot to look at Variance Ratios

7 match_5 1:1 Nearest Neighbor Matching with Matchit

7.1 Obtaining the Matched Sample

7.2 Checking Covariate Balance for our 1:1 Nearest Neighbor Match

7.2.1 Default Numerical Balance Summary from MatchIt

7.2.2 Using bal.tab to obtain a balance table

7.2.3 Checking Rubin’s Rules 1 and 2

7.2.4 Using bal.plot from cobalt

7.2.5 Using love.plot to look at Standardized Differences

7.2.6 Using love.plot to look at Variance Ratios

8 match_6 1:1 Nearest Neighbor Caliper Matching with Matchit

8.1 Obtaining the Matched Sample

8.2 Checking Covariate Balance for our 1:1 Nearest Neighbor Caliper Match

8.2.1 Using bal.tab to obtain a balance table

8.2.2 Checking Rubin’s Rules 1 and 2

8.2.3 Using bal.plot from cobalt

8.2.4 Using love.plot to look at Standardized Differences

8.2.5 Using love.plot to look at Variance Ratios

9 match_7 1:1 Optimal Matching with Matchit

9.1 Obtaining the Matched Sample

9.2 Checking Covariate Balance for our 1:1 Optimal Match

9.2.1 Using bal.tab to obtain a balance table

9.2.2 Checking Rubin’s Rules 1 and 2

9.2.3 Are the Optimal and Nearest Neighbor Matches the Same?

9.2.4 Using bal.plot from cobalt

9.2.5 Using love.plot to look at Standardized Differences

9.2.6 Using love.plot to look at Variance Ratios

10 match_8 1:2 Optimal Matching with Matchit

10.1 Obtaining the Matched Sample

10.2 Checking Covariate Balance for our 1:2 Optimal Match

10.2.1 Using bal.tab to obtain a balance table

10.2.2 Checking Rubin’s Rules 1 and 2

10.2.3 Using bal.plot from cobalt

10.2.4 Using love.plot to look at Standardized Differences

10.2.5 Using love.plot to look at Variance Ratios

11 Outcome Models

11.1 Potential Fitting Problems

11.2 Unadjusted Models prior to Propensity Matching

11.2.1 Unadjusted Outcome Model for bp_good

11.2.2 Unadjusted Outcome Model for bmi

1.1 The `dm2200` Data Set

3 `match_1` 1:1 greedy matching without replacement with the `Matching` package

3.3.1 Using `bal.tab` to obtain a balance table

3.3.3 Using `bal.plot` from `cobalt`

3.3.4 Using `love.plot` to look at Standardized Differences

3.3.5 Using `love.plot` to look at Variance Ratios

4 `match_2` 1:2 greedy matching without replacement with the `Matching` package

4.2.1 Using `bal.tab` to obtain a balance table

4.2.3 Using `bal.plot` from `cobalt`

4.2.4 Using `love.plot` to look at Standardized Differences

4.2.5 Using `love.plot` to look at Variance Ratios

5 `match_3` 1:3 matching, with replacement with the `Matching` package

5.2.1 Using `bal.tab` to obtain a balance table

5.2.3 Using `bal.plot` from `cobalt`

5.2.4 Using `love.plot` to look at Standardized Differences

5.2.5 Using `love.plot` to look at Variance Ratios

6 `match_4` Caliper Matching (1:1 without replacement) with the `Matching` package

6.2.1 Using `bal.tab` to obtain a balance table

6.2.3 Using `bal.plot` from `cobalt`

6.2.4 Using `love.plot` to look at Standardized Differences

6.2.5 Using `love.plot` to look at Variance Ratios

7 `match_5` 1:1 Nearest Neighbor Matching with `Matchit`

7.2.1 Default Numerical Balance Summary from `MatchIt`

7.2.2 Using `bal.tab` to obtain a balance table

7.2.4 Using `bal.plot` from `cobalt`

7.2.5 Using `love.plot` to look at Standardized Differences

7.2.6 Using `love.plot` to look at Variance Ratios

8 `match_6` 1:1 Nearest Neighbor Caliper Matching with `Matchit`

8.2.1 Using `bal.tab` to obtain a balance table

8.2.3 Using `bal.plot` from `cobalt`

8.2.4 Using `love.plot` to look at Standardized Differences

8.2.5 Using `love.plot` to look at Variance Ratios

9 `match_7` 1:1 Optimal Matching with `Matchit`

9.2.1 Using `bal.tab` to obtain a balance table

9.2.4 Using `bal.plot` from `cobalt`

9.2.5 Using `love.plot` to look at Standardized Differences

9.2.6 Using `love.plot` to look at Variance Ratios

10 `match_8` 1:2 Optimal Matching with `Matchit`

10.2.1 Using `bal.tab` to obtain a balance table

10.2.3 Using `bal.plot` from `cobalt`

10.2.4 Using `love.plot` to look at Standardized Differences

10.2.5 Using `love.plot` to look at Variance Ratios

11.2.1 Unadjusted Outcome Model for `bp_good`

11.2.2 Unadjusted Outcome Model for `bmi`

11.3 Adjusted Outcome Models after `match1`

11.3.1 Binary Outcome: `bp_good`

11.3.2 Quantitative Outcome: `bmi`

11.4 Adjusted Outcome Models after `match2`

11.4.1 Binary Outcome: `bp_good`

11.4.2 Quantitative Outcome: `bmi`

11.5 Adjusted Outcome Models after `match3`

11.5.1 Binary Outcome: `bp_good`

11.5.2 Quantitative Outcome: `bmi`

11.6 Adjusted Outcome Models after `match4`

11.6.1 Binary Outcome: `bp_good`

11.6.2 Quantitative Outcome: `bmi`

11.7 Adjusted Outcome Models after `match5`

11.7.1 Binary Outcome: `bp_good`

11.7.2 Quantitative Outcome: `bmi`

11.8 Adjusted Outcome Models after `match6`

11.8.1 Binary Outcome: `bp_good`

11.8.2 Quantitative Outcome: `bmi`

11.9 Adjusted Outcome Models after `match7`

11.9.1 Binary Outcome: `bp_good`

11.9.2 Quantitative Outcome: `bmi`

11.10 Adjusted Outcome Models after `match8`

11.10.1 Binary Outcome: `bp_good`

11.10.2 Quantitative Outcome: `bmi`