17  Nations of the World

This is a DRAFT version of this Chapter.

This is a sketchy draft. I’ll remove this notice when I post a version of this Chapter that is essentially finished.

17.1 R setup for this chapter

Note

Appendix A lists all R packages used in this book, and also provides R session information. Appendix B describes the 431-Love.R script, and demonstrates its use.

17.2 Data on Nations of the World

Note

Appendix C provides further guidance on pulling data from other systems into R, while Appendix D gives more information (including download links) for all data sets used in this book.

I collected these data from the World Health Organization and other sources, as of the end of May 2024. The primary resource was the Global Health Observatory from the WHO. Other resources included:

nations <- read_csv("data/nations.csv", show_col_types = FALSE) |>
  mutate(
    UNIV_CARE = factor(UNIV_CARE),
    WHO_REGION = factor(WHO_REGION),
    C_NUM = as.character(C_NUM),
    GDP_PERCAP = GDP_PERCAP/1000
  ) |>
  janitor::clean_names()

glimpse(nations)
Rows: 192
Columns: 7
$ c_num        <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", …
$ country_name <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angola",…
$ iso_alpha3   <chr> "AFG", "ALB", "DZA", "AND", "AGO", "ATG", "ARG", "ARM", "…
$ uhc_index    <dbl> 41, 64, 74, 79, 37, 76, 79, 68, 87, 85, 66, 77, 76, 52, 7…
$ univ_care    <fct> no, yes, yes, no, no, no, yes, no, yes, yes, no, yes, no,…
$ gdp_percap   <dbl> 0.356, 6.810, 4.343, 41.993, 3.000, 19.920, 13.651, 7.018…
$ who_region   <fct> Eastern Mediterranean, European, African, European, Afric…

The data include information on 7 variables (listed below), describing 192 separate countries of the world.

Note

By Country, we mean a sovereign state that is a member of both the United Nations and the World Health Organization, in its own right. Note that Liechtenstein is not a member of the World Health Organization, but is in the UN. Liechtenstein is the only UN member nation not included in the nations data.

Variable Description
c_num Country ID (alphabetical by country_name)
country_name Name of Country (according to WHO and UN)
iso_alpha3 ISO Alphabetical 3-letter code gathered from WHO
uhc_index Universal Health Coverage: Service coverage index1 - higher numbers indicate better coverage.
univ_care Yes if country offers government-regulated and government-funded health care to more than 90% of its citizens as of 2023, else No
gdp_percap GDP Per Capita per the World Bank in thousands of 2023 US dollars2
who_region Six groups3 according to regional distribution

17.3 Exploratory Data Analysis

Our interest is in understanding the association of uhc_index and univ_care, perhaps after adjustment for the country’s gdp_percap.

17.3.1 EDA for our outcome

bw = 4 # specify width of bins in histogram

p1 <- ggplot(nations, aes(uhc_index)) +
  geom_histogram(binwidth = bw, 
                 fill = "black", col = "yellow") +
  stat_function(fun = function(x)
    dnorm(x, mean = mean(nations$uhc_index, 
                         na.rm = TRUE),
          sd = sd(nations$uhc_index, 
                  na.rm = TRUE)) *
      length(nations$uhc_index) * bw,  
    geom = "area", alpha = 0.5,
    fill = "lightblue", col = "blue"
    ) +
  labs(
    x = "UHC Service Coverage Index",
    title = "Histogram & Normal Curve"
  ) 

p2 <- ggplot(nations, aes(sample = uhc_index)) +
  geom_qq() +
  geom_qq_line(col = "red") +
  labs(
    y = "UHC Service Coverage Index",
    x = "Standard Normal Distribution",
    title = "Normal Q-Q plot"
  ) 

p3 <- ggplot(nations, aes(x = uhc_index, y = "")) +
  geom_violin(fill = "cornsilk") +
  geom_boxplot(width = 0.2) +
  stat_summary(
    fun = mean, geom = "point",
    shape = 16, col = "red"
  ) +
  labs(
    y = "", x = "UHC Service Coverage Index",
    title = "Boxplot with Violin"
  ) 

p1 + (p2 / p3 + plot_layout(heights = c(2, 1))) +
  plot_annotation(
    title = "Nations: UHC Service Coverage Index"
  )

nations |>
  reframe(lovedist(uhc_index)) |>
  kable(digits = 2)
n miss mean sd med mad min q25 q75 max
192 0 65.46 16.29 69 19.27 27 52 79 91

17.3.2 UHC Service Coverage by Universal Care Status

ggplot(nations, aes(y = univ_care, x = uhc_index, 
                    fill = univ_care)) +
  geom_violin() +
  geom_boxplot(width = 0.3, fill = "cornsilk", notch = TRUE) +
  stat_summary(fun = mean, geom = "point", fill = "red", 
               size = 3, shape = 21) +
  scale_fill_brewer(palette = "Set2") +
  guides(fill = "none")

nations |>
  reframe(lovedist(uhc_index), .by = univ_care) |>
  kable(digits = 2)
univ_care n miss mean sd med mad min q25 q75 max
no 124 0 59.92 15.77 59.5 21.50 27 46.00 75 86
yes 68 0 75.57 11.79 79.0 9.64 40 68.75 85 91

17.4 Should we transform our outcome?

Eventually, we will consider including the gdp_percap predictor in our model, so let’s include that here, as well as univ_care when predicting uhc_index.

fit1 <- lm(uhc_index ~ univ_care + gdp_percap, 
           data = nations)

boxCox(fit1, main = "Transforming UHC_Index")

summary(powerTransform(fit1))$result
   Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
Y1  2.532079           2     1.908487     3.155671

17.4.1 Looking at Residual Plots

We might consider whether the residuals from our two models (with and without transforming the outcome) follow the assumption of a Normal distribution.

fit2 <- lm((uhc_index^2) ~ univ_care + gdp_percap, 
           data = nations)

resids_df <- tibble(raw = fit1$residuals, 
                 sqr = fit2$residuals)

p1 <- ggplot(resids_df, aes(sample = raw)) +
  geom_qq() + geom_qq_line(col = "red") +
  labs(title = "uhc_index",
       subtitle = "Normal Q-Q for fit1 residuals")

p2 <- ggplot(resids_df, aes(sample = sqr)) +
  geom_qq() + geom_qq_line(col = "red") +
  labs(title = "Square of uhc_index",
       subtitle = "Normal Q-Q for fit2 residuals")

p1 + p2 + 
  plot_annotation("How well do we support a Normality assumption?")

In terms of providing a better match to the assumption of Normality in the residuals, the square transformation doesn’t appear to be very helpful. If anything, our problem with light tails (fewer values far from the mean than expected) may get a little worse in fit2. So we’ll skip the transformation and work with our original outcome in what follows.

17.5 Least Squares Linear Model

fit3 <- lm(uhc_index ~ univ_care, data = nations)

model_parameters(fit3, ci = 0.95)
Parameter       | Coefficient |   SE |         95% CI | t(190) |      p
-----------------------------------------------------------------------
(Intercept)     |       59.92 | 1.30 | [57.35, 62.49] |  46.04 | < .001
univ care [yes] |       15.65 | 2.19 | [11.34, 19.97] |   7.16 | < .001

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
estimate_contrasts(fit3, contrast = "univ_care", ci = 0.95)
Marginal Contrasts Analysis

Level1 | Level2 | Difference |           95% CI |   SE | t(190) |      p
------------------------------------------------------------------------
no     |    yes |     -15.65 | [-19.97, -11.34] | 2.19 |  -7.16 | < .001

Marginal contrasts estimated at univ_care
p-value adjustment method: Holm (1979)

17.5.1 Interpreting the Fit

When comparing two nations where one offers universal health care and the other does not, the country that offers universal care will, on average, have a UHC index score that is 15.7 points higher.

Under this fitted model, the average difference in UHC index score between a country that offers universal health care and one that does not is 15.7 points (95% uncertainty interval for the estimated difference is 11.34, 19.97 points.)

17.5.2 Performance of the Model

model_performance(fit3)
# Indices of model performance

AIC      |     AICc |      BIC |    R2 | R2 (adj.) |   RMSE |  Sigma
--------------------------------------------------------------------
1575.544 | 1575.672 | 1585.317 | 0.212 |     0.208 | 14.417 | 14.493

17.5.3 Checking the Model

check_model(fit3)

17.6 Bayesian Linear Model

set.seed(431)
fit4 <- stan_glm(uhc_index ~ univ_care, data = nations, refresh = 0)

model_parameters(fit4, ci = 0.95)
Parameter    | Median |         95% CI |   pd |  Rhat |     ESS |                   Prior
-----------------------------------------------------------------------------------------
(Intercept)  |  59.92 | [57.29, 62.50] | 100% | 1.000 | 4052.00 | Normal (65.46 +- 40.72)
univ_careyes |  15.69 | [11.28, 19.94] | 100% | 1.000 | 3679.00 |  Normal (0.00 +- 84.92)

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a MCMC distribution approximation.
estimate_contrasts(fit4, contrast = "univ_care", ci = 0.95)
Marginal Contrasts Analysis

Level1 | Level2 | Difference |           95% CI |   pd | % in ROPE
------------------------------------------------------------------
no     |    yes |     -15.69 | [-19.94, -11.28] | 100% |        0%

Marginal contrasts estimated at univ_care

17.6.1 Interpreting the Fit

Under this fitted Bayesian model (fit4) which uses a weakly informative prior, the estimated median difference in UHC index score between a country that offers universal health care and one that does not is 15.7 points (95% credible interval for the median difference ranges from 11.28 to 19.94 points.)

The estimated probability that the median (universal care - no universal care) difference in UHC index scores is positive is 100%, according to the fit4 model. None of the simulated results comparing the two groups (universal care yes vs. no) fall in the region of practical equivalence.

17.6.2 Performance of the Model

model_performance(fit4)
# Indices of model performance

ELPD     | ELPD_SE |    LOOIC | LOOIC_SE |     WAIC |    R2 | R2 (adj.) |   RMSE |  Sigma
-----------------------------------------------------------------------------------------
-787.518 |   7.559 | 1575.035 |   15.119 | 1575.027 | 0.212 |     0.202 | 14.417 | 14.518

17.6.3 Checking the Model

check_model(fit4)

17.7 Missing Data Mechanisms

We now tackle the issue of missing data.

My source for the following description of mechanisms is Chapter 25 of Gelman and Hill (2007), and that chapter is available at this link.

  1. MCAR = Missingness completely at random. A variable is missing completely at random if the probability of missingness is the same for all units, for example, if for each subject, we decide whether to collect gdp_percap by rolling a die and refusing to answer if a “6” shows up. If data are missing completely at random, then throwing out cases with missing data does not bias your inferences.
Note

We have tacitly made the MCAR assumption in all of our prior dealings with missing data in this book.

  1. Missingness that depends only on observed predictors. A more general assumption, called missing at random or MAR, is that the probability a variable is missing depends only on available information. Here, we would have to be willing to assume that the probability of non-response to gdp_percap depends only on the other, fully recorded variables in the data. It is often reasonable to model this process as a logistic regression, where the outcome variable equals 1 for observed cases and 0 for missing. When an outcome variable is missing at random, it is acceptable to exclude the missing cases (that is, to treat them as NA), as long as the regression adjusts for all of the variables that affect the probability of missingness.

  2. Missingness that depends on unobserved predictors. Missingness is no longer “at random” if it depends on information that has not been recorded and this information also predicts the missing values. If a particular treatment causes discomfort, a patient is more likely to drop out of the study. This missingness is not at random (unless “discomfort” is measured and observed for all patients). If missingness is not at random, it must be explicitly modeled, or else you must accept some bias in your inferences.

  3. Missingness that depends on the missing value itself. Finally, a particularly difficult situation arises when the probability of missingness depends on the (potentially missing) variable itself. For example, suppose that people with higher earnings are less likely to reveal them.

Essentially, situations 3 and 4 are referred to collectively as non-random missingness, and cause more trouble for us than 1 and 2.

17.8 Dealing with Missing Data

There are several available methods for dealing with missing data that are MCAR or MAR, but they basically boil down to:

  • Complete Case (or Available Case) analyses
  • Single Imputation
  • Multiple Imputation

17.8.1 Complete Case (and Available Case) analyses

In Complete Case analyses, rows containing NA values are omitted from the data before analyses commence. This is the default approach for many statistical software packages, and may introduce unpredictable bias and fail to include some useful, often hard-won information.

  • A complete case analysis can be appropriate when the number of missing observations is not large, and the missing pattern is either MCAR (missing completely at random) or MAR (missing at random.)
  • Two problems arise with complete-case analysis:
    1. If the units with missing values differ systematically from the completely observed cases, this could bias the complete-case analysis.
    2. If many variables are included in a model, there may be very few complete cases, so that most of the data would be discarded for the sake of a straightforward analysis.
  • A related approach is available-case analysis where different aspects of a problem are studied with different subsets of the data, perhaps identified on the basis of what is missing in them.

17.8.2 Single Imputation

In single imputation analyses, NA values are estimated/replaced one time with one particular data value for the purpose of obtaining more complete samples, at the expense of creating some potential bias in the eventual conclusions or obtaining slightly less accurate estimates than would be available if there were no missing values in the data.

  • A single imputation can be just a replacement with the mean or median (for a quantity) or the mode (for a categorical variable.) However, such an approach, though easy to understand, underestimates variance and ignores the relationship of missing values to other variables.
  • Single imputation can also be done using a variety of models to try to capture information about the NA values that are available in other variables within the data set.
  • In this book, we will use the mice package to perform single imputations, as well as multiple imputation.

17.8.3 Multiple Imputation

Multiple imputation, where NA values are repeatedly estimated/replaced with multiple data values, for the purpose of obtaining mode complete samples and capturing details of the variation inherent in the fact that the data have missingness, so as to obtain more accurate estimates than are possible with single imputation.

How many imputations?

Quoting vanBuuren at https://stefvanbuuren.name/fimd/sec-howmany.html

They take a quote from Von Hippel (2009) as a rule of thumb: the number of imputations should be similar to the percentage of cases that are incomplete. This rule applies to fractions of missing information of up to 0.5.

White, Royston, and Wood (2011) suggest these criteria provide an adequate level of reproducibility in practice. The idea of reproducibility is sensible, the rule is simple to apply, so there is much to commend it. The rule has now become the de-facto standard, especially in medical applications.

It is convenient to set m = 5 during model building, and increase m only after being satisfied with the model for the “final” round of imputation. So if calculation is not prohibitive, we may set m to the average percentage of missing data. The substantive conclusions are unlikely to change as a result of raising m beyond m = 5.

17.9 Nations and Missing Data

In our study, if we want to adjust our estimates for the impact of per-capita GDP, we must deal with the fact that these data are missing for four of the nations in our data.

# A tibble: 7 × 3
  variable     n_miss pct_miss
  <chr>         <int>    <num>
1 gdp_percap        4     2.08
2 c_num             0     0   
3 country_name      0     0   
4 iso_alpha3        0     0   
5 uhc_index         0     0   
6 univ_care         0     0   
7 who_region        0     0   
nations |> filter(is.na(gdp_percap)) |>
  select(country_name, iso_alpha3, gdp_percap, uhc_index, univ_care) |>
  kable()
country_name iso_alpha3 gdp_percap uhc_index univ_care
Democratic People’s Republic of Korea PRK NA 68 yes
Eritrea ERI NA 45 no
South Sudan SSD NA 34 no
Venezuela (Bolivarian Republic of) VEN NA 75 no

17.9.1 Complete Case Analysis

We can drop all of the missing values from a data set with drop_na or with na.omit or by filtering for complete.cases. Any of these approaches produces the same result.

Note

If we don’t do anything about missing data, and simply feed the results to the lm() or stan_glm() functions, the machine will produce a fit on the complete cases.

fit5 <- lm(uhc_index ~ univ_care + gdp_percap, 
           data = nations)
summary(fit5)

Call:
lm(formula = uhc_index ~ univ_care + gdp_percap, data = nations)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.161  -9.625   1.336   9.558  19.917 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  57.02924    1.20223  47.436  < 2e-16 ***
univ_careyes 11.04387    1.98727   5.557 9.44e-08 ***
gdp_percap    0.27041    0.03414   7.920 2.14e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.5 on 185 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.4117,    Adjusted R-squared:  0.4053 
F-statistic: 64.73 on 2 and 185 DF,  p-value: < 2.2e-16
model_parameters(fit5, ci = 0.95)
Parameter       | Coefficient |   SE |         95% CI | t(185) |      p
-----------------------------------------------------------------------
(Intercept)     |       57.03 | 1.20 | [54.66, 59.40] |  47.44 | < .001
univ care [yes] |       11.04 | 1.99 | [ 7.12, 14.96] |   5.56 | < .001
gdp percap      |        0.27 | 0.03 | [ 0.20,  0.34] |   7.92 | < .001

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
model_performance(fit5)
# Indices of model performance

AIC      |     AICc |      BIC |    R2 | R2 (adj.) |   RMSE |  Sigma
--------------------------------------------------------------------
1488.247 | 1488.465 | 1501.192 | 0.412 |     0.405 | 12.402 | 12.503

Note the indication that this model fit has had four observations deleted due to missingness.

Finally, we check the model…

check_model(fit5)

17.9.2 Single Imputation

Here is how I produced a single imputation of the nations data in R using the mice package, and some default choices, so that I could perform a regression analysis incorporating all of the nations, including those with missing gdp_percap values.

Note

By default, mice uses predictive mean matching (or pmm), for numeric data. It uses logistic regression imputation for binary data, polytomous regression imputation for unordered categorical data with more than 2 levels, and a proportional odds regression model for ordered categorical data with more than 2 levels.

All of these methodologies will be explored in the 432 class, but a detailed explanation is beyond the scope of this book.

nations_simp <- 
  mice(nations, m = 1, seed = 431, print = FALSE) |>
  complete() |>
  tibble()
Warning: Number of logged events: 3

Here, m is the number of imputations I want to create (here just one), seed is the random seed I set so that I can replicate the results later, and print is set to FALSE to reduce the amount of unnecessary output this produces. I will ignore this warning about logged events in most cases, including here.

Tip

If you want to see the logged events, try:

ini <- mice(nations, m = 1, seed = 431, print = FALSE)
Warning: Number of logged events: 3
ini$loggedEvents
  it im dep     meth          out
1  0  0     constant        c_num
2  0  0     constant country_name
3  0  0     constant   iso_alpha3

Here, our nations data include three variables which are not included in the imputation process (and which we don’t want included) so all is well.

Here’s what I wind up with:

nations_simp
# A tibble: 192 × 7
   c_num country_name       iso_alpha3 uhc_index univ_care gdp_percap who_region
   <chr> <chr>              <chr>          <dbl> <fct>          <dbl> <fct>     
 1 1     Afghanistan        AFG               41 no             0.356 Eastern M…
 2 2     Albania            ALB               64 yes            6.81  European  
 3 3     Algeria            DZA               74 yes            4.34  African   
 4 4     Andorra            AND               79 no            42.0   European  
 5 5     Angola             AGO               37 no             3     African   
 6 6     Antigua and Barbu… ATG               76 no            19.9   Americas  
 7 7     Argentina          ARG               79 yes           13.7   Americas  
 8 8     Armenia            ARM               68 no             7.02  European  
 9 9     Australia          AUS               87 yes           65.1   Western P…
10 10    Austria            AUT               85 yes           52.1   European  
# ℹ 182 more rows
n_miss(nations_simp)
[1] 0

Now, I can do everything I did previously with this singly imputed data set.

fit6 <- lm(uhc_index ~ univ_care + gdp_percap, 
           data = nations_simp)
summary(fit6)

Call:
lm(formula = uhc_index ~ univ_care + gdp_percap, data = nations_simp)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.216  -9.940   1.337   9.692  20.062 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  56.86773    1.19475  47.598  < 2e-16 ***
univ_careyes 11.16974    1.98286   5.633 6.34e-08 ***
gdp_percap    0.27131    0.03427   7.916 2.02e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.59 on 189 degrees of freedom
Multiple R-squared:  0.4085,    Adjusted R-squared:  0.4023 
F-statistic: 65.27 on 2 and 189 DF,  p-value: < 2.2e-16
model_parameters(fit6, ci = 0.95)
Parameter       | Coefficient |   SE |         95% CI | t(189) |      p
-----------------------------------------------------------------------
(Intercept)     |       56.87 | 1.19 | [54.51, 59.22] |  47.60 | < .001
univ care [yes] |       11.17 | 1.98 | [ 7.26, 15.08] |   5.63 | < .001
gdp percap      |        0.27 | 0.03 | [ 0.20,  0.34] |   7.92 | < .001

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
model_performance(fit6)
# Indices of model performance

AIC      |     AICc |      BIC |    R2 | R2 (adj.) |   RMSE |  Sigma
--------------------------------------------------------------------
1522.564 | 1522.778 | 1535.594 | 0.409 |     0.402 | 12.494 | 12.593

And, again, we check the model…

check_model(fit6)

17.9.3 Multiple Imputation

The first thing we might consider is how many multiple imputations we want to do.

nations |> prop_miss_case()
[1] 0.02083333

2% of the cases in our nations tibble contain missing values, so in selecting the number of imputations, I’d go for the most commonly used option. Let’s try 5 imputations.

nations_imp5 <- mice(nations, m = 5, seed = 431, print = FALSE)
Warning: Number of logged events: 3

Here, m is the number of imputations I want to create (five, in our case), seed is the random seed I set so that I can replicate the results later, and print is set to FALSE to reduce the amount of unnecessary output this produces. I will ignore this warning about logged events in most cases, including here.

Here is how I would fit a set of five models, each using a different one of the five imputed data sets and obtain a summary of my results:

est7 <- nations |>
  mice(seed = 431, print = FALSE) |>
  with(lm(uhc_index ~ univ_care + gdp_percap)) |>
  pool()
Warning: Number of logged events: 3
model_parameters(est7, ci = 0.95)
Warning: Number of logged events: 3
Warning: Number of logged events: 3
Warning: Number of logged events: 3
Warning: Number of logged events: 3
# Fixed Effects

Parameter       | Coefficient |   SE |         95% CI |     t |     df |      p
-------------------------------------------------------------------------------
(Intercept)     |       56.81 | 1.19 | [54.46, 59.17] | 47.58 | 186.81 | < .001
univ care [yes] |       11.16 | 1.98 | [ 7.26, 15.06] |  5.64 | 187.01 | < .001
gdp percap      |        0.27 | 0.03 | [ 0.21,  0.34] |  7.98 | 186.80 | < .001

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
glance(est7)
  nimp nobs r.squared adj.r.squared
1    5  192 0.4111089     0.4048772

17.10 For More Information

See https://library.virginia.edu/data/articles/getting-started-with-multiple-imputation-in-r

  • Reference there to Rubin (1976)

Rubin (1976) classified types of missing data in three categories: MCAR, MAR, MNAR

MCAR: Missing Completely at Random - the reason for the missingness of data points are at random, meaning that the pattern of missing values is uncorrelated with the structure of the data. An example would be a random sample taken from the population: data on some people will be missing, but it will be at random since everyone had the same chance of being included in the sample.

MAR: Missing at Random - the missingness is not completely random, but the propensity of missingness depends on the observed data, not the missing data. An example would be a survey respondent choosing not to answer a question on income because they believe the privacy of personal information. As seen in this case, the missing value for income can be predicted by looking at the answers for the personal information question.

MNAR: Missing Not at Random - the missing is not random, it correlates with unobservable characteristics unknown to a researcher. An example would be social desirability bias in survey - where respondents with certain characteristics we can’t observe systematically shy away from answering questions on racial issues.

All multiple imputation techniques start with the MAR assumption. While MCAR is desirable, in general it is unrealistic for the data. Thus, researchers make the assumption that missing values can be replaced by predictions derived by the observable portion of the dataset. This is a fundamental assumption to make, otherwise we wouldn’t be able to predict plausible values of missing data points from the observed data.

17.11 For More Information

  • https://stefvanbuuren.name/fimd/

  1. This is a measure of coverage of essential health services as expressed as the average score of 14 tracer indicators of health service coverage, measured in 2021.↩︎

  2. Some of the GDP data describes 2021, some 2022 and some 2023, depending on what was most recent and available.↩︎

  3. The WHO regions are African, Americas, Eastern Mediterranean, European, Southeast Asian and Western Pacific.↩︎