8 Weighting
8.1 R setup for this chapter
Appendix A lists all R packages used in this book, and also provides R session information. Appendix B describes the 431-Love.R script, and demonstrates its use.
8.2 Data from a SAS file: NHANES 2015-16
Appendix C provides further guidance on pulling data from other systems into R, while Appendix D gives more information (including download links) for all data sets used in this book.
In this chapter, we’ll be using data from the 2015-16 administration of the NHANES (National Health and Nutrition Examination Survey.) See https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/overview.aspx?BeginYear=2015 for an overview of NHANES 2015-16.
Course Project B also uses NHANES data, and there we will use the nhanesA
(see https://cran.r-project.org/web/packages/nhanesA/vignettes/Introducing_nhanesA.html) R package to pull in publicly available data.
Here, though, we’ll use two SAS files (called DEMO_I.xpt
and BPX_I.xpt
) downloaded from the NHANES 2015-16 data repository to pull in the data we need. The files are known as SAS Transport files, and carry the extension .xpt
. The haven
package provides the read_xpt()
function to ingest such files.
Now, we’ll reduce the size of each raw file to include only the variables we plan to use.
Next, we’ll merge the two files together (each contains an ID variable called SEQN which allows us to do this) using the left_join()
function from the dplyr
package within the tidyverse
family.
nh1516 <- left_join(demo_i, bpx_i, by = "SEQN")
Several of the variables in nh1516
are in fact categorical, and we’ll account for that as follows.
nh1516 <- nh1516 |>
mutate(RIDSTATR = factor(RIDSTATR),
RIAGENDR = factor(RIAGENDR),
SIALANG = factor(SIALANG),
SEQN = as.character(SEQN))
8.3 What’s in the Data?
We now have a tibble containing 9971 rows and 9 columns.
nh1516
# A tibble: 9,971 × 9
SEQN RIDSTATR RIAGENDR RIDAGEYR SIALANG WTINT2YR WTMEC2YR BPXSY1 BPXDI1
<chr> <fct> <fct> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 83732 2 1 62 1 134671. 135630. 128 70
2 83733 2 1 53 1 24329. 25282. 146 88
3 83734 2 1 78 1 12400. 12576. 138 46
4 83735 2 2 56 1 102718. 102079. 132 72
5 83736 2 2 42 1 17628. 18235. 100 70
6 83737 2 2 72 2 11252. 10879. 116 58
7 83738 2 2 11 1 9965. 9861. 102 36
8 83739 2 1 4 1 44750. 46173. NA NA
9 83740 2 1 1 1 9892. 10963. NA NA
10 83741 2 1 22 1 37043. 39353. 110 70
# ℹ 9,961 more rows
From the About NHANES page…
The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel.
The 9 variables included in nh1516
are:
Variable | Description | Source |
---|---|---|
SEQN | respondent sequence number (code) | DEMO_I BPX_I |
RIDSTATR | interview/examination status (1 = Interview only, 2 = Interview + MEC Exam) |
DEMO_I |
RIAGENDR | sex1 (1 = Male, 2 = Female) | DEMO_I |
RIDAGEYR | age in years at screening | DEMO_I |
SIALANG | language used2 in interview (1 = English, 2 = Spanish) | DEMO_I |
WTINT2YR | full sample 2 year interview weight | DEMO_I |
WTMEC2YR | full sample 2 year MEC3 exam weight | DEMO_I |
BPXSY1 | first systolic blood pressure, in mm Hg | BPX_I |
BPXDI1 | first diastolic blood pressure, in mm Hg | BPX_I |
8.3.1 Renaming factor levels
nh1516 |> count(RIAGENDR, SIALANG)
# A tibble: 4 × 3
RIAGENDR SIALANG n
<fct> <fct> <int>
1 1 1 4239
2 1 2 653
3 2 1 4345
4 2 2 734
It would help to rename the factor levels from 1 and 2 to something more meaningful.
nh1516 <- nh1516 |>
mutate(RIAGENDR =
fct_recode(RIAGENDR, "Male" = "1", "Female" = "2"),
SIALANG =
fct_recode(SIALANG, "English" = "1", "Spanish" = "2"),
RIDSTATR =
fct_recode(RIDSTATR, "Int only" = "1",
"Exam and Int" = "2")
)
nh1516 |> count(RIAGENDR, SIALANG)
# A tibble: 4 × 3
RIAGENDR SIALANG n
<fct> <fct> <int>
1 Male English 4239
2 Male Spanish 653
3 Female English 4345
4 Female Spanish 734
8.3.2 Missing data?
The only missing values in our data are blood pressures. This is because not all subjects completed the MEC examination (where the blood pressures were recorded) as well as the interview.
miss_var_summary(nh1516)
# A tibble: 9 × 3
variable n_miss pct_miss
<chr> <int> <num>
1 BPXSY1 2826 28.3
2 BPXDI1 2826 28.3
3 SEQN 0 0
4 RIDSTATR 0 0
5 RIAGENDR 0 0
6 RIDAGEYR 0 0
7 SIALANG 0 0
8 WTINT2YR 0 0
9 WTMEC2YR 0 0
8.4 What is the average age?
8.4.1 Unweighted
Here’s a summary of the age information in our sample.
n | miss | mean | sd | med | mad | min | q25 | q75 | max |
---|---|---|---|---|---|---|---|---|---|
9971 | 0 | 31.9 | 24.77 | 27 | 29.65 | 0 | 9 | 53 | 80 |
That’s an accurate description of the ages shown in our sample, but each subject in NHANES is assigned a different sampling weight (called WTINT2YR
) which allows this sample to represent the NHANES Target Population, which is…
The NHANES (target) population is the non-institutionalized U.S. civilian population of all ages residing in all 50 states and Washington, D.C. The survey exmaines a nationally representative sample of about 5,000 persons per year.
8.4.2 Weighting is important here
In statistics it is common to reweight data or inferences so as to adapt to a target population. – Gelman, Hill, and Vehtari (2021)
From the description of the DEMO_I file:
The 2-year sample weights (WTINT2YR, WTMEC2YR) should be used for all NHANES 2015-2016 analyses.
For a variable (like AGE) gathered through the interview process, we will use the WTINT2YR
weight.
# A tibble: 6 × 3
SEQN RIDAGEYR WTINT2YR
<chr> <dbl> <dbl>
1 83732 62 134671.
2 83733 53 24329.
3 83734 78 12400.
4 83735 56 102718.
5 83736 42 17628.
6 83737 72 11252.
So, for example, subject 83732 is 62 years old, and has a sampling weight of more than 134,600 people, while subject 83733 (aged 53) has a much smaller sampling weight of just over 24,300 people. In order to describe the NHANES target population, we must apply these weights in calculating summaries like the mean, median, standard deviation or MAD of age. So how might we do this?
8.4.3 Using weighted_mean()
, etc.
A series of functions, called weighted_mean()
, weighted_median()
and so forth, are available from the datawizard
package in the easystats
family.
weighted_mean(nh1516$RIDAGEYR, weights = nh1516$WTINT2YR)
[1] 37.98659
weighted_sd(nh1516$RIDAGEYR, weights = nh1516$WTINT2YR)
[1] 22.60222
weighted_median(nh1516$RIDAGEYR, weights = nh1516$WTINT2YR)
[1] 37
weighted_mad(nh1516$RIDAGEYR, weights = nh1516$WTINT2YR)
[1] 28.1694
Compare these to the unweighted results (repeated below) and we see that the weighting makes a big difference - after weighting the sample appears to describe a much older population.
8.5 Variation in age by preferred language?
Can we compare the ages we see by the language used in the interview?
8.5.1 Ignoring the weighting
means_by_group(nh1516$RIDAGEYR, by = nh1516$SIALANG)
# Mean of Age in years at screening by nh1516$SIALANG
Category | Mean | N | SD | 95% CI | p
---------------------------------------------------------
English | 31.45 | 8584 | 24.73 | [30.93, 31.97] | < .001
Spanish | 34.67 | 1387 | 24.85 | [33.37, 35.98] | < .001
Total | 31.90 | 9971 | 24.77 | |
Anova: R2=0.002; adj.R2=0.002; F=20.258; p<.001
fit1 <- lm(RIDAGEYR ~ SIALANG, data = nh1516)
fit1 |> model_parameters(ci = 0.95) |> kable(digits = 2)
Parameter | Coefficient | SE | CI | CI_low | CI_high | t | df_error | p |
---|---|---|---|---|---|---|---|---|
(Intercept) | 31.45 | 0.27 | 0.95 | 30.93 | 31.97 | 117.76 | 9969 | 0 |
SIALANGSpanish | 3.22 | 0.72 | 0.95 | 1.82 | 4.63 | 4.50 | 9969 | 0 |
Ignoring the weighting, we see that the mean age of subjects for whom the interview is in Spanish is about 3.2 years older than those who were interviewed in English, with 95% CI (1.8, 4.6) years.
8.5.2 Weighted means within subgroups
Let’s look at the mean age after weighting within each SIALANG
group.
nh1516_ENG <- nh1516 |> filter(SIALANG == "English")
weighted_mean(nh1516_ENG$RIDAGEYR, weights = nh1516_ENG$WTINT2YR)
[1] 38.28088
weighted_sd(nh1516_ENG$RIDAGEYR, weights = nh1516_ENG$WTINT2YR)
[1] 22.68499
nh1516_ESP <- nh1516 |> filter(SIALANG == "Spanish")
weighted_mean(nh1516_ESP$RIDAGEYR, weights = nh1516_ESP$WTINT2YR)
[1] 34.21842
weighted_sd(nh1516_ESP$RIDAGEYR, weights = nh1516_ESP$WTINT2YR)
[1] 21.16711
8.5.3 Using means_by_group()
Another approach is available through the means_by_group()
function.
means_by_group(nh1516$RIDAGEYR,
by = nh1516$SIALANG,
weights = nh1516$WTINT2YR)
# Mean of Age in years at screening by nh1516$SIALANG
Category | Mean | N | SD | 95% CI | p
----------------------------------------------------------
English | 38.28 | 3e+08 | 22.68 | [37.82, 38.74] | < .001
Spanish | 34.22 | 2e+07 | 21.17 | [32.57, 35.87] | < .001
Total | 37.99 | 9971 | 22.60 | |
Anova: R2=0.002; adj.R2=0.002; F=21.691; p<.001
8.5.4 Comparing Weighted Means: Linear Model
fit1w <- lm(RIDAGEYR ~ SIALANG,
weights = WTINT2YR, data = nh1516)
fit1w |> model_parameters(ci = 0.95) |> kable(digits = 2)
Parameter | Coefficient | SE | CI | CI_low | CI_high | t | df_error | p |
---|---|---|---|---|---|---|---|---|
(Intercept) | 38.28 | 0.23 | 0.95 | 37.82 | 38.74 | 163.06 | 9969 | 0 |
SIALANGSpanish | -4.06 | 0.87 | 0.95 | -5.77 | -2.35 | -4.66 | 9969 | 0 |
However, after incorporating the weighting, we see that the weighted mean age of subjects for whom the interview is in Spanish is almost 4.1 years YOUNGER than those who were interviewed in English, with 95% CI (2.4, 5.8) years.
So the weighting in fact alters the direction of our estimate. This is deliberate, as primary language is one of the strata on which the NHANES weights are based to match the sample to the target population.
8.6 What is the average systolic blood pressure (SBP)?
8.6.1 Unweighted SBP summaries
8.6.2 Weighted mean and SD
For the blood pressure readings, which are obtained through the MEC exam, rather than through the interview, we will need to use the MEC weights gathered in WTMEC2YR
instead of the interview weights we used previously.
weighted_mean(nh1516$BPXSY1, weights = nh1516$WTMEC2YR)
Warning: Some `weights` were negative. Weighting not carried out.
[1] 120.5394
Whoops - we have some negative weights?
summary(nh1516$WTMEC2YR)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 12551 20281 31740 33708 242387
Actually, it looks like we have some zeros.
Are all of these coming from the people who were not examined?
nh1516 |>
reframe(lovedist(WTMEC2YR), .by = RIDSTATR)
# A tibble: 2 × 11
RIDSTATR n miss mean sd med mad min q25 q75 max
<fct> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Exam and I… 9544 0 33160. 34178. 21034. 13256. 3419. 13395. 34969. 2.42e5
2 Int only 427 0 0 0 0 0 0 0 0 0
OK. So in order to apply the weights, we need to restrict ourselves to only those subjects who completed the MEC Examination.
nh1516_exam <- nh1516 |>
filter(RIDSTATR == "Exam and Int")
WTMEC2YR
Min. : 3419
1st Qu.: 13395
Median : 21034
Mean : 33160
3rd Qu.: 34969
Max. :242387
OK. Now we shouldn’t have any non-positive weights…
weighted_mean(nh1516_exam$BPXSY1, weights = nh1516_exam$WTMEC2YR)
[1] 120.6674
weighted_sd(nh1516_exam$BPXSY1, weights = nh1516_exam$WTMEC2YR)
[1] 17.45271
After weighting, the mean SBP is a little larger, and the standard deviation of SBP is a little smaller.
8.7 Variation in SBP by sex?
8.7.1 Ignoring the weighting
means_by_group(nh1516$BPXSY1, by = nh1516$RIAGENDR)
# Mean of Systolic: Blood pres (1st rdg) mm Hg by nh1516$RIAGENDR
Category | Mean | N | SD | 95% CI | p
------------------------------------------------------------
Male | 122.18 | 3498 | 18.15 | [121.56, 122.79] | < .001
Female | 118.97 | 3647 | 18.93 | [118.37, 119.57] | < .001
Total | 120.54 | 7145 | 18.62 | |
Anova: R2=0.007; adj.R2=0.007; F=53.393; p<.001
fit2 <- lm(BPXSY1 ~ RIAGENDR, data = nh1516)
fit2 |> model_parameters(ci = 0.95) |> kable(digits = 2)
Parameter | Coefficient | SE | CI | CI_low | CI_high | t | df_error | p |
---|---|---|---|---|---|---|---|---|
(Intercept) | 122.18 | 0.31 | 0.95 | 121.56 | 122.79 | 389.56 | 7143 | 0 |
RIAGENDRFemale | -3.21 | 0.44 | 0.95 | -4.07 | -2.35 | -7.31 | 7143 | 0 |
Ignoring the weighting, we see that the mean SBP for female subjects is about 3.2 mm Hg smaller than male subjects, with 95% CI for the difference of (2.4, 4.1) mm Hg.
8.7.2 Weighted Means by sex
Again, we’ll restrict ourselves to those completing both the Exam and the interview.
means_by_group(nh1516_exam$BPXSY1,
by = nh1516_exam$RIAGENDR,
weights = nh1516_exam$WTMEC2YR)
# Mean of Systolic: Blood pres (1st rdg) mm Hg by nh1516_exam$RIAGENDR
Category | Mean | N | SD | 95% CI | p
-------------------------------------------------------------
Male | 122.08 | 1e+08 | 16.42 | [121.50, 122.66] | < .001
Female | 119.33 | 1e+08 | 18.28 | [118.77, 119.90] | < .001
Total | 120.67 | 7145 | 17.45 | |
Anova: R2=0.006; adj.R2=0.006; F=44.449; p<.001
8.7.3 Comparing Weighted Means: Linear Model
fit2w <- lm(BPXSY1 ~ RIAGENDR,
weights = WTMEC2YR, data = nh1516_exam)
fit2w |> model_parameters(ci = 0.95) |> kable(digits = 2)
Parameter | Coefficient | SE | CI | CI_low | CI_high | t | df_error | p |
---|---|---|---|---|---|---|---|---|
(Intercept) | 122.08 | 0.30 | 0.95 | 121.50 | 122.66 | 413.58 | 7143 | 0 |
RIAGENDRFemale | -2.75 | 0.41 | 0.95 | -3.55 | -1.94 | -6.67 | 7143 | 0 |
After weighting, the SBP difference is more like 2.75 mm Hg (95% CI: 1.9, 3.6) which is a bit smaller than what we saw without weighting.
8.8 For More Information
The National Center for Health Statistics, as part of its materials on the NHANES (National Health and Nutrition Examination Survey) publishes modules found here on why weights are calculated, how they are calculated, the importance of weights in making estimates that are representative of the U.S. civilian non-institutionalized population, how to select the appropriate weight to use in your analysis, and other issues.
Thomas Lumley on Weights in Statistics from 2020 is a nice way of learning more about the various uses of the terms weights and weighting in statistics and data science.
Max Kuhn on Using case weights with tidymodels in R may be helpful to you for a general overview. Case weights are non-negative numbers used to specify how much each observation influences the estimation of a model.
Within R for Data Science, the Joins chapter provides more information about joining and merging R data frames.
The name
RIAGENDR
is somewhat unfortunate, as this variable describes biological sex, rather than gender.↩︎Persons 16 years and older and emancipated minors were interviewed directly. A proxy provided information for survey participants who were under 16 and for participants who could not answer the questions themselves.↩︎
MEC = Mobile Examination Center↩︎