8  Weighting

8.1 R setup for this chapter

Note

Appendix A lists all R packages used in this book, and also provides R session information. Appendix B describes the 431-Love.R script, and demonstrates its use.

8.2 Data from a SAS file: NHANES 2015-16

Note

Appendix C provides further guidance on pulling data from other systems into R, while Appendix D gives more information (including download links) for all data sets used in this book.

In this chapter, we’ll be using data from the 2015-16 administration of the NHANES (National Health and Nutrition Examination Survey.) See https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/overview.aspx?BeginYear=2015 for an overview of NHANES 2015-16.

Course Project B also uses NHANES data, and there we will use the nhanesA (see https://cran.r-project.org/web/packages/nhanesA/vignettes/Introducing_nhanesA.html) R package to pull in publicly available data.

Here, though, we’ll use two SAS files (called DEMO_I.xpt and BPX_I.xpt) downloaded from the NHANES 2015-16 data repository to pull in the data we need. The files are known as SAS Transport files, and carry the extension .xpt. The haven package provides the read_xpt() function to ingest such files.

demo_i_raw <- read_xpt("data/DEMO_I.xpt")
bpx_i_raw <- read_xpt("data/BPX_I.xpt")

Now, we’ll reduce the size of each raw file to include only the variables we plan to use.

demo_i <- demo_i_raw |> 
  select(SEQN, RIDSTATR, RIAGENDR, RIDAGEYR, SIALANG,
         WTINT2YR, WTMEC2YR)

bpx_i <- bpx_i_raw |> select(SEQN, BPXSY1, BPXDI1)

Next, we’ll merge the two files together (each contains an ID variable called SEQN which allows us to do this) using the left_join() function from the dplyr package within the tidyverse family.

nh1516 <- left_join(demo_i, bpx_i, by = "SEQN")

Several of the variables in nh1516 are in fact categorical, and we’ll account for that as follows.

nh1516 <- nh1516 |>
  mutate(RIDSTATR = factor(RIDSTATR),
         RIAGENDR = factor(RIAGENDR),
         SIALANG = factor(SIALANG),
         SEQN = as.character(SEQN))

8.3 What’s in the Data?

We now have a tibble containing 9971 rows and 9 columns.

nh1516
# A tibble: 9,971 × 9
   SEQN  RIDSTATR RIAGENDR RIDAGEYR SIALANG WTINT2YR WTMEC2YR BPXSY1 BPXDI1
   <chr> <fct>    <fct>       <dbl> <fct>      <dbl>    <dbl>  <dbl>  <dbl>
 1 83732 2        1              62 1        134671.  135630.    128     70
 2 83733 2        1              53 1         24329.   25282.    146     88
 3 83734 2        1              78 1         12400.   12576.    138     46
 4 83735 2        2              56 1        102718.  102079.    132     72
 5 83736 2        2              42 1         17628.   18235.    100     70
 6 83737 2        2              72 2         11252.   10879.    116     58
 7 83738 2        2              11 1          9965.    9861.    102     36
 8 83739 2        1               4 1         44750.   46173.     NA     NA
 9 83740 2        1               1 1          9892.   10963.     NA     NA
10 83741 2        1              22 1         37043.   39353.    110     70
# ℹ 9,961 more rows

From the About NHANES page

The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel.

The 9 variables included in nh1516 are:

Variable Description Source
SEQN respondent sequence number (code) DEMO_I
BPX_I
RIDSTATR interview/examination status
(1 = Interview only, 2 = Interview + MEC Exam)
DEMO_I
RIAGENDR sex1 (1 = Male, 2 = Female) DEMO_I
RIDAGEYR age in years at screening DEMO_I
SIALANG language used2 in interview (1 = English, 2 = Spanish) DEMO_I
WTINT2YR full sample 2 year interview weight DEMO_I
WTMEC2YR full sample 2 year MEC3 exam weight DEMO_I
BPXSY1 first systolic blood pressure, in mm Hg BPX_I
BPXDI1 first diastolic blood pressure, in mm Hg BPX_I

8.3.1 Renaming factor levels

nh1516 |> count(RIAGENDR, SIALANG)
# A tibble: 4 × 3
  RIAGENDR SIALANG     n
  <fct>    <fct>   <int>
1 1        1        4239
2 1        2         653
3 2        1        4345
4 2        2         734

It would help to rename the factor levels from 1 and 2 to something more meaningful.

nh1516 <- nh1516 |>
  mutate(RIAGENDR = 
           fct_recode(RIAGENDR, "Male" = "1", "Female" = "2"),
         SIALANG = 
           fct_recode(SIALANG, "English" = "1", "Spanish" = "2"),
         RIDSTATR =
           fct_recode(RIDSTATR, "Int only" = "1",
                      "Exam and Int" = "2")
  )

nh1516 |> count(RIAGENDR, SIALANG)
# A tibble: 4 × 3
  RIAGENDR SIALANG     n
  <fct>    <fct>   <int>
1 Male     English  4239
2 Male     Spanish   653
3 Female   English  4345
4 Female   Spanish   734

8.3.2 Missing data?

The only missing values in our data are blood pressures. This is because not all subjects completed the MEC examination (where the blood pressures were recorded) as well as the interview.

# A tibble: 9 × 3
  variable n_miss pct_miss
  <chr>     <int>    <num>
1 BPXSY1     2826     28.3
2 BPXDI1     2826     28.3
3 SEQN          0      0  
4 RIDSTATR      0      0  
5 RIAGENDR      0      0  
6 RIDAGEYR      0      0  
7 SIALANG       0      0  
8 WTINT2YR      0      0  
9 WTMEC2YR      0      0  

8.4 What is the average age?

8.4.1 Unweighted

Here’s a summary of the age information in our sample.

nh1516 |>
  reframe(lovedist(RIDAGEYR)) |>
  kable(digits = 2)
n miss mean sd med mad min q25 q75 max
9971 0 31.9 24.77 27 29.65 0 9 53 80

That’s an accurate description of the ages shown in our sample, but each subject in NHANES is assigned a different sampling weight (called WTINT2YR) which allows this sample to represent the NHANES Target Population, which is…

The NHANES (target) population is the non-institutionalized U.S. civilian population of all ages residing in all 50 states and Washington, D.C. The survey exmaines a nationally representative sample of about 5,000 persons per year.

8.4.2 Weighting is important here

In statistics it is common to reweight data or inferences so as to adapt to a target population. – Gelman, Hill, and Vehtari (2021)

From the description of the DEMO_I file:

The 2-year sample weights (WTINT2YR, WTMEC2YR) should be used for all NHANES 2015-2016 analyses.

For a variable (like AGE) gathered through the interview process, we will use the WTINT2YR weight.

nh1516 |> select(SEQN, RIDAGEYR, WTINT2YR) |> head()
# A tibble: 6 × 3
  SEQN  RIDAGEYR WTINT2YR
  <chr>    <dbl>    <dbl>
1 83732       62  134671.
2 83733       53   24329.
3 83734       78   12400.
4 83735       56  102718.
5 83736       42   17628.
6 83737       72   11252.

So, for example, subject 83732 is 62 years old, and has a sampling weight of more than 134,600 people, while subject 83733 (aged 53) has a much smaller sampling weight of just over 24,300 people. In order to describe the NHANES target population, we must apply these weights in calculating summaries like the mean, median, standard deviation or MAD of age. So how might we do this?

8.4.3 Using weighted_mean(), etc.

A series of functions, called weighted_mean(), weighted_median() and so forth, are available from the datawizard package in the easystats family.

weighted_mean(nh1516$RIDAGEYR, weights = nh1516$WTINT2YR)
[1] 37.98659
weighted_sd(nh1516$RIDAGEYR, weights = nh1516$WTINT2YR)
[1] 22.60222
weighted_median(nh1516$RIDAGEYR, weights = nh1516$WTINT2YR)
[1] 37
weighted_mad(nh1516$RIDAGEYR, weights = nh1516$WTINT2YR)
[1] 28.1694

Compare these to the unweighted results (repeated below) and we see that the weighting makes a big difference - after weighting the sample appears to describe a much older population.

nh1516 |> summarise(
  mean = mean(RIDAGEYR), sd = sd(RIDAGEYR),
  median = median(RIDAGEYR), mad = mad(RIDAGEYR)
)
# A tibble: 1 × 4
   mean    sd median   mad
  <dbl> <dbl>  <dbl> <dbl>
1  31.9  24.8     27  29.7

8.5 Variation in age by preferred language?

Can we compare the ages we see by the language used in the interview?

8.5.1 Ignoring the weighting

means_by_group(nh1516$RIDAGEYR, by = nh1516$SIALANG)
# Mean of Age in years at screening by nh1516$SIALANG

Category |  Mean |    N |    SD |         95% CI |      p
---------------------------------------------------------
English  | 31.45 | 8584 | 24.73 | [30.93, 31.97] | < .001
Spanish  | 34.67 | 1387 | 24.85 | [33.37, 35.98] | < .001
Total    | 31.90 | 9971 | 24.77 |                |       

Anova: R2=0.002; adj.R2=0.002; F=20.258; p<.001
fit1 <- lm(RIDAGEYR ~ SIALANG, data = nh1516)

fit1 |> model_parameters(ci = 0.95) |> kable(digits = 2)
Parameter Coefficient SE CI CI_low CI_high t df_error p
(Intercept) 31.45 0.27 0.95 30.93 31.97 117.76 9969 0
SIALANGSpanish 3.22 0.72 0.95 1.82 4.63 4.50 9969 0

Ignoring the weighting, we see that the mean age of subjects for whom the interview is in Spanish is about 3.2 years older than those who were interviewed in English, with 95% CI (1.8, 4.6) years.

8.5.2 Weighted means within subgroups

Let’s look at the mean age after weighting within each SIALANG group.

nh1516_ENG <- nh1516 |> filter(SIALANG == "English")

weighted_mean(nh1516_ENG$RIDAGEYR, weights = nh1516_ENG$WTINT2YR)
[1] 38.28088
weighted_sd(nh1516_ENG$RIDAGEYR, weights = nh1516_ENG$WTINT2YR)
[1] 22.68499
nh1516_ESP <- nh1516 |> filter(SIALANG == "Spanish")

weighted_mean(nh1516_ESP$RIDAGEYR, weights = nh1516_ESP$WTINT2YR)
[1] 34.21842
weighted_sd(nh1516_ESP$RIDAGEYR, weights = nh1516_ESP$WTINT2YR)
[1] 21.16711

8.5.3 Using means_by_group()

Another approach is available through the means_by_group() function.

means_by_group(nh1516$RIDAGEYR, 
               by = nh1516$SIALANG, 
               weights = nh1516$WTINT2YR)
# Mean of Age in years at screening by nh1516$SIALANG

Category |  Mean |     N |    SD |         95% CI |      p
----------------------------------------------------------
English  | 38.28 | 3e+08 | 22.68 | [37.82, 38.74] | < .001
Spanish  | 34.22 | 2e+07 | 21.17 | [32.57, 35.87] | < .001
Total    | 37.99 |  9971 | 22.60 |                |       

Anova: R2=0.002; adj.R2=0.002; F=21.691; p<.001

8.5.4 Comparing Weighted Means: Linear Model

fit1w <- lm(RIDAGEYR ~ SIALANG, 
            weights = WTINT2YR, data = nh1516)

fit1w |> model_parameters(ci = 0.95) |> kable(digits = 2)
Parameter Coefficient SE CI CI_low CI_high t df_error p
(Intercept) 38.28 0.23 0.95 37.82 38.74 163.06 9969 0
SIALANGSpanish -4.06 0.87 0.95 -5.77 -2.35 -4.66 9969 0

However, after incorporating the weighting, we see that the weighted mean age of subjects for whom the interview is in Spanish is almost 4.1 years YOUNGER than those who were interviewed in English, with 95% CI (2.4, 5.8) years.

So the weighting in fact alters the direction of our estimate. This is deliberate, as primary language is one of the strata on which the NHANES weights are based to match the sample to the target population.

8.6 What is the average systolic blood pressure (SBP)?

8.6.1 Unweighted SBP summaries

nh1516 |> reframe(lovedist(BPXSY1)) |> kable(digits = 2)
n miss mean sd med mad min q25 q75 max
9971 2826 120.54 18.62 118 17.79 72 108 130 236

8.6.2 Weighted mean and SD

For the blood pressure readings, which are obtained through the MEC exam, rather than through the interview, we will need to use the MEC weights gathered in WTMEC2YR instead of the interview weights we used previously.

weighted_mean(nh1516$BPXSY1, weights = nh1516$WTMEC2YR)
Warning: Some `weights` were negative. Weighting not carried out.
[1] 120.5394

Whoops - we have some negative weights?

summary(nh1516$WTMEC2YR)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0   12551   20281   31740   33708  242387 

Actually, it looks like we have some zeros.

Are all of these coming from the people who were not examined?

nh1516 |>
  reframe(lovedist(WTMEC2YR), .by = RIDSTATR)
# A tibble: 2 × 11
  RIDSTATR        n  miss   mean     sd    med    mad   min    q25    q75    max
  <fct>       <int> <int>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1 Exam and I…  9544     0 33160. 34178. 21034. 13256. 3419. 13395. 34969. 2.42e5
2 Int only      427     0     0      0      0      0     0      0      0  0     

OK. So in order to apply the weights, we need to restrict ourselves to only those subjects who completed the MEC Examination.

nh1516_exam <- nh1516 |>
  filter(RIDSTATR == "Exam and Int")
nh1516_exam |> select(WTMEC2YR) |> summary()
    WTMEC2YR     
 Min.   :  3419  
 1st Qu.: 13395  
 Median : 21034  
 Mean   : 33160  
 3rd Qu.: 34969  
 Max.   :242387  

OK. Now we shouldn’t have any non-positive weights…

weighted_mean(nh1516_exam$BPXSY1, weights = nh1516_exam$WTMEC2YR)
[1] 120.6674
weighted_sd(nh1516_exam$BPXSY1, weights = nh1516_exam$WTMEC2YR)
[1] 17.45271

After weighting, the mean SBP is a little larger, and the standard deviation of SBP is a little smaller.

8.7 Variation in SBP by sex?

8.7.1 Ignoring the weighting

means_by_group(nh1516$BPXSY1, by = nh1516$RIAGENDR)
# Mean of Systolic: Blood pres (1st rdg) mm Hg by nh1516$RIAGENDR

Category |   Mean |    N |    SD |           95% CI |      p
------------------------------------------------------------
Male     | 122.18 | 3498 | 18.15 | [121.56, 122.79] | < .001
Female   | 118.97 | 3647 | 18.93 | [118.37, 119.57] | < .001
Total    | 120.54 | 7145 | 18.62 |                  |       

Anova: R2=0.007; adj.R2=0.007; F=53.393; p<.001
fit2 <- lm(BPXSY1 ~ RIAGENDR, data = nh1516)

fit2 |> model_parameters(ci = 0.95) |> kable(digits = 2)
Parameter Coefficient SE CI CI_low CI_high t df_error p
(Intercept) 122.18 0.31 0.95 121.56 122.79 389.56 7143 0
RIAGENDRFemale -3.21 0.44 0.95 -4.07 -2.35 -7.31 7143 0

Ignoring the weighting, we see that the mean SBP for female subjects is about 3.2 mm Hg smaller than male subjects, with 95% CI for the difference of (2.4, 4.1) mm Hg.

8.7.2 Weighted Means by sex

Again, we’ll restrict ourselves to those completing both the Exam and the interview.

means_by_group(nh1516_exam$BPXSY1, 
               by = nh1516_exam$RIAGENDR, 
               weights = nh1516_exam$WTMEC2YR)
# Mean of Systolic: Blood pres (1st rdg) mm Hg by nh1516_exam$RIAGENDR

Category |   Mean |     N |    SD |           95% CI |      p
-------------------------------------------------------------
Male     | 122.08 | 1e+08 | 16.42 | [121.50, 122.66] | < .001
Female   | 119.33 | 1e+08 | 18.28 | [118.77, 119.90] | < .001
Total    | 120.67 |  7145 | 17.45 |                  |       

Anova: R2=0.006; adj.R2=0.006; F=44.449; p<.001

8.7.3 Comparing Weighted Means: Linear Model

fit2w <- lm(BPXSY1 ~ RIAGENDR, 
            weights = WTMEC2YR, data = nh1516_exam)

fit2w |> model_parameters(ci = 0.95) |> kable(digits = 2)
Parameter Coefficient SE CI CI_low CI_high t df_error p
(Intercept) 122.08 0.30 0.95 121.50 122.66 413.58 7143 0
RIAGENDRFemale -2.75 0.41 0.95 -3.55 -1.94 -6.67 7143 0

After weighting, the SBP difference is more like 2.75 mm Hg (95% CI: 1.9, 3.6) which is a bit smaller than what we saw without weighting.

8.8 For More Information

  1. The National Center for Health Statistics, as part of its materials on the NHANES (National Health and Nutrition Examination Survey) publishes modules found here on why weights are calculated, how they are calculated, the importance of weights in making estimates that are representative of the U.S. civilian non-institutionalized population, how to select the appropriate weight to use in your analysis, and other issues.

  2. Thomas Lumley on Weights in Statistics from 2020 is a nice way of learning more about the various uses of the terms weights and weighting in statistics and data science.

  3. Max Kuhn on Using case weights with tidymodels in R may be helpful to you for a general overview. Case weights are non-negative numbers used to specify how much each observation influences the estimation of a model.

  4. Within R for Data Science, the Joins chapter provides more information about joining and merging R data frames.


  1. The name RIAGENDR is somewhat unfortunate, as this variable describes biological sex, rather than gender.↩︎

  2. Persons 16 years and older and emancipated minors were interviewed directly. A proxy provided information for survey participants who were under 16 and for participants who could not answer the questions themselves.↩︎

  3. MEC = Mobile Examination Center↩︎