17 Nations of the World
This is a sketchy draft. I’ll remove this notice when I post a version of this Chapter that is essentially finished.
17.1 R setup for this chapter
Appendix A lists all R packages used in this book, and also provides R session information. Appendix B describes the 431-Love.R script, and demonstrates its use.
17.2 Data on Nations of the World
Appendix C provides further guidance on pulling data from other systems into R, while Appendix D gives more information (including download links) for all data sets used in this book.
I collected these data from the World Health Organization and other sources, as of the end of May 2024. The primary resource was the Global Health Observatory from the WHO. Other resources included:
- The World Health Statistics Report
- Table of World Health Statistics 2024
- Gross Domestic Product Per Capita comes from the World Bank, accessed 2024-06-20.
nations <- read_csv("data/nations.csv", show_col_types = FALSE) |>
mutate(
UNIV_CARE = factor(UNIV_CARE),
WHO_REGION = factor(WHO_REGION),
C_NUM = as.character(C_NUM),
GDP_PERCAP = GDP_PERCAP/1000
) |>
janitor::clean_names()
glimpse(nations)
Rows: 192
Columns: 7
$ c_num <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", …
$ country_name <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angola",…
$ iso_alpha3 <chr> "AFG", "ALB", "DZA", "AND", "AGO", "ATG", "ARG", "ARM", "…
$ uhc_index <dbl> 41, 64, 74, 79, 37, 76, 79, 68, 87, 85, 66, 77, 76, 52, 7…
$ univ_care <fct> no, yes, yes, no, no, no, yes, no, yes, yes, no, yes, no,…
$ gdp_percap <dbl> 0.356, 6.810, 4.343, 41.993, 3.000, 19.920, 13.651, 7.018…
$ who_region <fct> Eastern Mediterranean, European, African, European, Afric…
The data include information on 7 variables (listed below), describing 192 separate countries of the world.
By Country, we mean a sovereign state that is a member of both the United Nations and the World Health Organization, in its own right. Note that Liechtenstein is not a member of the World Health Organization, but is in the UN. Liechtenstein is the only UN member nation not included in the nations
data.
Variable | Description |
---|---|
c_num |
Country ID (alphabetical by country_name) |
country_name |
Name of Country (according to WHO and UN) |
iso_alpha3 |
ISO Alphabetical 3-letter code gathered from WHO |
uhc_index |
Universal Health Coverage: Service coverage index1 - higher numbers indicate better coverage. |
univ_care |
Yes if country offers government-regulated and government-funded health care to more than 90% of its citizens as of 2023, else No |
gdp_percap |
GDP Per Capita per the World Bank in thousands of 2023 US dollars2 |
who_region |
Six groups3 according to regional distribution |
17.3 Exploratory Data Analysis
Our interest is in understanding the association of uhc_index
and univ_care
, perhaps after adjustment for the country’s gdp_percap
.
17.3.1 EDA for our outcome
bw = 4 # specify width of bins in histogram
p1 <- ggplot(nations, aes(uhc_index)) +
geom_histogram(binwidth = bw,
fill = "black", col = "yellow") +
stat_function(fun = function(x)
dnorm(x, mean = mean(nations$uhc_index,
na.rm = TRUE),
sd = sd(nations$uhc_index,
na.rm = TRUE)) *
length(nations$uhc_index) * bw,
geom = "area", alpha = 0.5,
fill = "lightblue", col = "blue"
) +
labs(
x = "UHC Service Coverage Index",
title = "Histogram & Normal Curve"
)
p2 <- ggplot(nations, aes(sample = uhc_index)) +
geom_qq() +
geom_qq_line(col = "red") +
labs(
y = "UHC Service Coverage Index",
x = "Standard Normal Distribution",
title = "Normal Q-Q plot"
)
p3 <- ggplot(nations, aes(x = uhc_index, y = "")) +
geom_violin(fill = "cornsilk") +
geom_boxplot(width = 0.2) +
stat_summary(
fun = mean, geom = "point",
shape = 16, col = "red"
) +
labs(
y = "", x = "UHC Service Coverage Index",
title = "Boxplot with Violin"
)
p1 + (p2 / p3 + plot_layout(heights = c(2, 1))) +
plot_annotation(
title = "Nations: UHC Service Coverage Index"
)
17.3.2 UHC Service Coverage by Universal Care Status
ggplot(nations, aes(y = univ_care, x = uhc_index,
fill = univ_care)) +
geom_violin() +
geom_boxplot(width = 0.3, fill = "cornsilk", notch = TRUE) +
stat_summary(fun = mean, geom = "point", fill = "red",
size = 3, shape = 21) +
scale_fill_brewer(palette = "Set2") +
guides(fill = "none")
17.4 Should we transform our outcome?
Eventually, we will consider including the gdp_percap
predictor in our model, so let’s include that here, as well as univ_care
when predicting uhc_index
.
fit1 <- lm(uhc_index ~ univ_care + gdp_percap,
data = nations)
boxCox(fit1, main = "Transforming UHC_Index")
summary(powerTransform(fit1))$result
Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
Y1 2.532079 2 1.908487 3.155671
17.4.1 Looking at Residual Plots
We might consider whether the residuals from our two models (with and without transforming the outcome) follow the assumption of a Normal distribution.
fit2 <- lm((uhc_index^2) ~ univ_care + gdp_percap,
data = nations)
resids_df <- tibble(raw = fit1$residuals,
sqr = fit2$residuals)
p1 <- ggplot(resids_df, aes(sample = raw)) +
geom_qq() + geom_qq_line(col = "red") +
labs(title = "uhc_index",
subtitle = "Normal Q-Q for fit1 residuals")
p2 <- ggplot(resids_df, aes(sample = sqr)) +
geom_qq() + geom_qq_line(col = "red") +
labs(title = "Square of uhc_index",
subtitle = "Normal Q-Q for fit2 residuals")
p1 + p2 +
plot_annotation("How well do we support a Normality assumption?")
In terms of providing a better match to the assumption of Normality in the residuals, the square transformation doesn’t appear to be very helpful. If anything, our problem with light tails (fewer values far from the mean than expected) may get a little worse in fit2. So we’ll skip the transformation and work with our original outcome in what follows.
17.5 Least Squares Linear Model
fit3 <- lm(uhc_index ~ univ_care, data = nations)
model_parameters(fit3, ci = 0.95)
Parameter | Coefficient | SE | 95% CI | t(190) | p
-----------------------------------------------------------------------
(Intercept) | 59.92 | 1.30 | [57.35, 62.49] | 46.04 | < .001
univ care [yes] | 15.65 | 2.19 | [11.34, 19.97] | 7.16 | < .001
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald t-distribution approximation.
estimate_contrasts(fit3, contrast = "univ_care", ci = 0.95)
Marginal Contrasts Analysis
Level1 | Level2 | Difference | 95% CI | SE | t(190) | p
------------------------------------------------------------------------
no | yes | -15.65 | [-19.97, -11.34] | 2.19 | -7.16 | < .001
Marginal contrasts estimated at univ_care
p-value adjustment method: Holm (1979)
17.5.1 Interpreting the Fit
When comparing two nations where one offers universal health care and the other does not, the country that offers universal care will, on average, have a UHC index score that is 15.7 points higher.
Under this fitted model, the average difference in UHC index score between a country that offers universal health care and one that does not is 15.7 points (95% uncertainty interval for the estimated difference is 11.34, 19.97 points.)
17.5.2 Performance of the Model
model_performance(fit3)
# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
--------------------------------------------------------------------
1575.544 | 1575.672 | 1585.317 | 0.212 | 0.208 | 14.417 | 14.493
17.5.3 Checking the Model
check_model(fit3)
17.6 Bayesian Linear Model
set.seed(431)
fit4 <- stan_glm(uhc_index ~ univ_care, data = nations, refresh = 0)
model_parameters(fit4, ci = 0.95)
Parameter | Median | 95% CI | pd | Rhat | ESS | Prior
-----------------------------------------------------------------------------------------
(Intercept) | 59.92 | [57.29, 62.50] | 100% | 1.000 | 4052.00 | Normal (65.46 +- 40.72)
univ_careyes | 15.69 | [11.28, 19.94] | 100% | 1.000 | 3679.00 | Normal (0.00 +- 84.92)
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a MCMC distribution approximation.
estimate_contrasts(fit4, contrast = "univ_care", ci = 0.95)
Marginal Contrasts Analysis
Level1 | Level2 | Difference | 95% CI | pd | % in ROPE
------------------------------------------------------------------
no | yes | -15.69 | [-19.94, -11.28] | 100% | 0%
Marginal contrasts estimated at univ_care
17.6.1 Interpreting the Fit
Under this fitted Bayesian model (fit4) which uses a weakly informative prior, the estimated median difference in UHC index score between a country that offers universal health care and one that does not is 15.7 points (95% credible interval for the median difference ranges from 11.28 to 19.94 points.)
The estimated probability that the median (universal care - no universal care) difference in UHC index scores is positive is 100%, according to the fit4 model. None of the simulated results comparing the two groups (universal care yes vs. no) fall in the region of practical equivalence.
17.6.2 Performance of the Model
model_performance(fit4)
# Indices of model performance
ELPD | ELPD_SE | LOOIC | LOOIC_SE | WAIC | R2 | R2 (adj.) | RMSE | Sigma
-----------------------------------------------------------------------------------------
-787.518 | 7.559 | 1575.035 | 15.119 | 1575.027 | 0.212 | 0.202 | 14.417 | 14.518
17.6.3 Checking the Model
check_model(fit4)
17.7 Missing Data Mechanisms
We now tackle the issue of missing data.
My source for the following description of mechanisms is Chapter 25 of Gelman and Hill (2007), and that chapter is available at this link.
-
MCAR = Missingness completely at random. A variable is missing completely at random if the probability of missingness is the same for all units, for example, if for each subject, we decide whether to collect
gdp_percap
by rolling a die and refusing to answer if a “6” shows up. If data are missing completely at random, then throwing out cases with missing data does not bias your inferences.
We have tacitly made the MCAR assumption in all of our prior dealings with missing data in this book.
Missingness that depends only on observed predictors. A more general assumption, called missing at random or MAR, is that the probability a variable is missing depends only on available information. Here, we would have to be willing to assume that the probability of non-response to
gdp_percap
depends only on the other, fully recorded variables in the data. It is often reasonable to model this process as a logistic regression, where the outcome variable equals 1 for observed cases and 0 for missing. When an outcome variable is missing at random, it is acceptable to exclude the missing cases (that is, to treat them as NA), as long as the regression adjusts for all of the variables that affect the probability of missingness.Missingness that depends on unobserved predictors. Missingness is no longer “at random” if it depends on information that has not been recorded and this information also predicts the missing values. If a particular treatment causes discomfort, a patient is more likely to drop out of the study. This missingness is not at random (unless “discomfort” is measured and observed for all patients). If missingness is not at random, it must be explicitly modeled, or else you must accept some bias in your inferences.
Missingness that depends on the missing value itself. Finally, a particularly difficult situation arises when the probability of missingness depends on the (potentially missing) variable itself. For example, suppose that people with higher earnings are less likely to reveal them.
Essentially, situations 3 and 4 are referred to collectively as non-random missingness, and cause more trouble for us than 1 and 2.
17.8 Dealing with Missing Data
There are several available methods for dealing with missing data that are MCAR or MAR, but they basically boil down to:
- Complete Case (or Available Case) analyses
- Single Imputation
- Multiple Imputation
17.8.1 Complete Case (and Available Case) analyses
In Complete Case analyses, rows containing NA values are omitted from the data before analyses commence. This is the default approach for many statistical software packages, and may introduce unpredictable bias and fail to include some useful, often hard-won information.
- A complete case analysis can be appropriate when the number of missing observations is not large, and the missing pattern is either MCAR (missing completely at random) or MAR (missing at random.)
- Two problems arise with complete-case analysis:
- If the units with missing values differ systematically from the completely observed cases, this could bias the complete-case analysis.
- If many variables are included in a model, there may be very few complete cases, so that most of the data would be discarded for the sake of a straightforward analysis.
- A related approach is available-case analysis where different aspects of a problem are studied with different subsets of the data, perhaps identified on the basis of what is missing in them.
17.8.2 Single Imputation
In single imputation analyses, NA values are estimated/replaced one time with one particular data value for the purpose of obtaining more complete samples, at the expense of creating some potential bias in the eventual conclusions or obtaining slightly less accurate estimates than would be available if there were no missing values in the data.
- A single imputation can be just a replacement with the mean or median (for a quantity) or the mode (for a categorical variable.) However, such an approach, though easy to understand, underestimates variance and ignores the relationship of missing values to other variables.
- Single imputation can also be done using a variety of models to try to capture information about the NA values that are available in other variables within the data set.
- In this book, we will use the
mice
package to perform single imputations, as well as multiple imputation.
17.8.3 Multiple Imputation
Multiple imputation, where NA values are repeatedly estimated/replaced with multiple data values, for the purpose of obtaining mode complete samples and capturing details of the variation inherent in the fact that the data have missingness, so as to obtain more accurate estimates than are possible with single imputation.
How many imputations?
Quoting vanBuuren at https://stefvanbuuren.name/fimd/sec-howmany.html
They take a quote from Von Hippel (2009) as a rule of thumb: the number of imputations should be similar to the percentage of cases that are incomplete. This rule applies to fractions of missing information of up to 0.5.
White, Royston, and Wood (2011) suggest these criteria provide an adequate level of reproducibility in practice. The idea of reproducibility is sensible, the rule is simple to apply, so there is much to commend it. The rule has now become the de-facto standard, especially in medical applications.
It is convenient to set m = 5 during model building, and increase m only after being satisfied with the model for the “final” round of imputation. So if calculation is not prohibitive, we may set m to the average percentage of missing data. The substantive conclusions are unlikely to change as a result of raising m beyond m = 5.
17.9 Nations and Missing Data
In our study, if we want to adjust our estimates for the impact of per-capita GDP, we must deal with the fact that these data are missing for four of the nations in our data.
miss_var_summary(nations)
# A tibble: 7 × 3
variable n_miss pct_miss
<chr> <int> <num>
1 gdp_percap 4 2.08
2 c_num 0 0
3 country_name 0 0
4 iso_alpha3 0 0
5 uhc_index 0 0
6 univ_care 0 0
7 who_region 0 0
nations |> filter(is.na(gdp_percap)) |>
select(country_name, iso_alpha3, gdp_percap, uhc_index, univ_care) |>
kable()
country_name | iso_alpha3 | gdp_percap | uhc_index | univ_care |
---|---|---|---|---|
Democratic People’s Republic of Korea | PRK | NA | 68 | yes |
Eritrea | ERI | NA | 45 | no |
South Sudan | SSD | NA | 34 | no |
Venezuela (Bolivarian Republic of) | VEN | NA | 75 | no |
17.9.1 Complete Case Analysis
We can drop all of the missing values from a data set with drop_na
or with na.omit
or by filtering for complete.cases
. Any of these approaches produces the same result.
If we don’t do anything about missing data, and simply feed the results to the lm()
or stan_glm()
functions, the machine will produce a fit on the complete cases.
fit5 <- lm(uhc_index ~ univ_care + gdp_percap,
data = nations)
summary(fit5)
Call:
lm(formula = uhc_index ~ univ_care + gdp_percap, data = nations)
Residuals:
Min 1Q Median 3Q Max
-36.161 -9.625 1.336 9.558 19.917
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57.02924 1.20223 47.436 < 2e-16 ***
univ_careyes 11.04387 1.98727 5.557 9.44e-08 ***
gdp_percap 0.27041 0.03414 7.920 2.14e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.5 on 185 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.4117, Adjusted R-squared: 0.4053
F-statistic: 64.73 on 2 and 185 DF, p-value: < 2.2e-16
model_parameters(fit5, ci = 0.95)
Parameter | Coefficient | SE | 95% CI | t(185) | p
-----------------------------------------------------------------------
(Intercept) | 57.03 | 1.20 | [54.66, 59.40] | 47.44 | < .001
univ care [yes] | 11.04 | 1.99 | [ 7.12, 14.96] | 5.56 | < .001
gdp percap | 0.27 | 0.03 | [ 0.20, 0.34] | 7.92 | < .001
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald t-distribution approximation.
model_performance(fit5)
# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
--------------------------------------------------------------------
1488.247 | 1488.465 | 1501.192 | 0.412 | 0.405 | 12.402 | 12.503
Note the indication that this model fit has had four observations deleted due to missingness.
Finally, we check the model…
check_model(fit5)
17.9.2 Single Imputation
Here is how I produced a single imputation of the nations data in R using the mice
package, and some default choices, so that I could perform a regression analysis incorporating all of the nations, including those with missing gdp_percap
values.
By default, mice uses predictive mean matching (or pmm), for numeric data. It uses logistic regression imputation for binary data, polytomous regression imputation for unordered categorical data with more than 2 levels, and a proportional odds regression model for ordered categorical data with more than 2 levels.
All of these methodologies will be explored in the 432 class, but a detailed explanation is beyond the scope of this book.
Warning: Number of logged events: 3
Here, m
is the number of imputations I want to create (here just one), seed
is the random seed I set so that I can replicate the results later, and print
is set to FALSE to reduce the amount of unnecessary output this produces. I will ignore this warning about logged events in most cases, including here.
If you want to see the logged events, try:
ini <- mice(nations, m = 1, seed = 431, print = FALSE)
Warning: Number of logged events: 3
ini$loggedEvents
it im dep meth out
1 0 0 constant c_num
2 0 0 constant country_name
3 0 0 constant iso_alpha3
Here, our nations
data include three variables which are not included in the imputation process (and which we don’t want included) so all is well.
Here’s what I wind up with:
nations_simp
# A tibble: 192 × 7
c_num country_name iso_alpha3 uhc_index univ_care gdp_percap who_region
<chr> <chr> <chr> <dbl> <fct> <dbl> <fct>
1 1 Afghanistan AFG 41 no 0.356 Eastern M…
2 2 Albania ALB 64 yes 6.81 European
3 3 Algeria DZA 74 yes 4.34 African
4 4 Andorra AND 79 no 42.0 European
5 5 Angola AGO 37 no 3 African
6 6 Antigua and Barbu… ATG 76 no 19.9 Americas
7 7 Argentina ARG 79 yes 13.7 Americas
8 8 Armenia ARM 68 no 7.02 European
9 9 Australia AUS 87 yes 65.1 Western P…
10 10 Austria AUT 85 yes 52.1 European
# ℹ 182 more rows
n_miss(nations_simp)
[1] 0
Now, I can do everything I did previously with this singly imputed data set.
fit6 <- lm(uhc_index ~ univ_care + gdp_percap,
data = nations_simp)
summary(fit6)
Call:
lm(formula = uhc_index ~ univ_care + gdp_percap, data = nations_simp)
Residuals:
Min 1Q Median 3Q Max
-36.216 -9.940 1.337 9.692 20.062
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.86773 1.19475 47.598 < 2e-16 ***
univ_careyes 11.16974 1.98286 5.633 6.34e-08 ***
gdp_percap 0.27131 0.03427 7.916 2.02e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.59 on 189 degrees of freedom
Multiple R-squared: 0.4085, Adjusted R-squared: 0.4023
F-statistic: 65.27 on 2 and 189 DF, p-value: < 2.2e-16
model_parameters(fit6, ci = 0.95)
Parameter | Coefficient | SE | 95% CI | t(189) | p
-----------------------------------------------------------------------
(Intercept) | 56.87 | 1.19 | [54.51, 59.22] | 47.60 | < .001
univ care [yes] | 11.17 | 1.98 | [ 7.26, 15.08] | 5.63 | < .001
gdp percap | 0.27 | 0.03 | [ 0.20, 0.34] | 7.92 | < .001
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald t-distribution approximation.
model_performance(fit6)
# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
--------------------------------------------------------------------
1522.564 | 1522.778 | 1535.594 | 0.409 | 0.402 | 12.494 | 12.593
And, again, we check the model…
check_model(fit6)
17.9.3 Multiple Imputation
The first thing we might consider is how many multiple imputations we want to do.
nations |> prop_miss_case()
[1] 0.02083333
2% of the cases in our nations
tibble contain missing values, so in selecting the number of imputations, I’d go for the most commonly used option. Let’s try 5 imputations.
nations_imp5 <- mice(nations, m = 5, seed = 431, print = FALSE)
Warning: Number of logged events: 3
Here, m
is the number of imputations I want to create (five, in our case), seed
is the random seed I set so that I can replicate the results later, and print
is set to FALSE to reduce the amount of unnecessary output this produces. I will ignore this warning about logged events in most cases, including here.
Here is how I would fit a set of five models, each using a different one of the five imputed data sets and obtain a summary of my results:
est7 <- nations |>
mice(seed = 431, print = FALSE) |>
with(lm(uhc_index ~ univ_care + gdp_percap)) |>
pool()
Warning: Number of logged events: 3
model_parameters(est7, ci = 0.95)
Warning: Number of logged events: 3
Warning: Number of logged events: 3
Warning: Number of logged events: 3
Warning: Number of logged events: 3
# Fixed Effects
Parameter | Coefficient | SE | 95% CI | t | df | p
-------------------------------------------------------------------------------
(Intercept) | 56.81 | 1.19 | [54.46, 59.17] | 47.58 | 186.81 | < .001
univ care [yes] | 11.16 | 1.98 | [ 7.26, 15.06] | 5.64 | 187.01 | < .001
gdp percap | 0.27 | 0.03 | [ 0.21, 0.34] | 7.98 | 186.80 | < .001
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald t-distribution approximation.
glance(est7)
nimp nobs r.squared adj.r.squared
1 5 192 0.4111089 0.4048772
17.10 For More Information
See https://library.virginia.edu/data/articles/getting-started-with-multiple-imputation-in-r
- Reference there to Rubin (1976)
Rubin (1976) classified types of missing data in three categories: MCAR, MAR, MNAR
MCAR: Missing Completely at Random - the reason for the missingness of data points are at random, meaning that the pattern of missing values is uncorrelated with the structure of the data. An example would be a random sample taken from the population: data on some people will be missing, but it will be at random since everyone had the same chance of being included in the sample.
MAR: Missing at Random - the missingness is not completely random, but the propensity of missingness depends on the observed data, not the missing data. An example would be a survey respondent choosing not to answer a question on income because they believe the privacy of personal information. As seen in this case, the missing value for income can be predicted by looking at the answers for the personal information question.
MNAR: Missing Not at Random - the missing is not random, it correlates with unobservable characteristics unknown to a researcher. An example would be social desirability bias in survey - where respondents with certain characteristics we can’t observe systematically shy away from answering questions on racial issues.
All multiple imputation techniques start with the MAR assumption. While MCAR is desirable, in general it is unrealistic for the data. Thus, researchers make the assumption that missing values can be replaced by predictions derived by the observable portion of the dataset. This is a fundamental assumption to make, otherwise we wouldn’t be able to predict plausible values of missing data points from the observed data.
17.11 For More Information
- https://stefvanbuuren.name/fimd/
This is a measure of coverage of essential health services as expressed as the average score of 14 tracer indicators of health service coverage, measured in 2021.↩︎
Some of the GDP data describes 2021, some 2022 and some 2023, depending on what was most recent and available.↩︎
The WHO regions are African, Americas, Eastern Mediterranean, European, Southeast Asian and Western Pacific.↩︎