Chapter 16 The Serum Zinc Study

16.1 Serum Zinc Levels in 462 Teenage Males (`serzinc`)

The serzinc data include serum zinc levels in micrograms per deciliter that have been gathered for a sample of 462 males aged 15-17, My source for these data is Appendix B1 of Pagano and Gauvreau (2000). Serum zinc deficiency has been associated with anemia, loss of strength and endurance, and it is thought that 25% of the world’s population is at risk of zinc deficiency. Such a deficiency can indicate poor nutrition, and can affect growth and vision, for instance. “Typical” values¹⁰ are said to be 0.66-1.10 mcg/ml, which is 66 - 110 micrograms per deciliter.

serzinc

# A tibble: 462 x 2
   ID     zinc
   <chr> <dbl>
 1 M-001   142
 2 M-002    88
 3 M-003    83
 4 M-004   100
 5 M-005   123
 6 M-006    63
 7 M-007   102
 8 M-008    80
 9 M-009   117
10 M-010    86
# ... with 452 more rows

16.2 Our Goal: A Confidence Interval for the Population Mean

After we assess the data a bit, and are satisfied that we understand it, our first inferential goal will be to produce a confidence interval for the true (population) mean of males age 15-17 based on this sample, assuming that these 462 males are a random sample from the population of interest, that each serum zinc level is drawn independently from an identical distribution describing that population.

To do this, we will have several different procedures available, including:

A confidence interval for the population mean based on a t distribution, when we assume that the data are drawn from an approximately Normal distribution, using the sample standard deviation. (Interval corresponding to a t test, and it will be a good choice when the data really are approximately Normally distributed.)
A resampling approach to generate a bootstrap confidence interval for the population mean, which does not require that we assume either that the population standard deviation is known, nor that the data are drawn from an approximately Normal distribution, but which has some other weaknesses.
A rank-based procedure called the Wilcoxon signed rank test can also be used to yield a confidence interval statement about the population pseudo-median, a measure of the population distribution’s center (but not the population’s mean).

16.3 Exploratory Data Analysis for Serum Zinc

16.3.1 Graphical Summaries

The code presented below builds:

a histogram (with Normal model superimposed),
a boxplot (with median notch) and
a Normal Q-Q plot (with guiding straight line through the quartiles)

for the zinc results from the serzinc tibble. It does this while making use of several functions contained in the script Love-boost.R.

These functions include:

fd_bins to estimate the Freedman-Diaconis bins setting for the histogram
qq_int and qq_slope to facilitate the drawing of a line on the Normal Q-Q plot

rezinc <- mosaic::favstats(~ zinc, data = serzinc)
bin_w1 <- 10 # specify binwidth

p1 <- ggplot(serzinc, aes(x = zinc)) +
    geom_histogram(binwidth = bin_w1, 
                   fill = "slateblue", 
                   col = "white") +
    theme_light() +
    stat_function(
        fun = function(x) dnorm(x, mean = rezinc$mean, 
                                sd = rezinc$sd) * 
            rezinc$n * bin_w1,
        col = "blue") +
labs(x = "Serum zinc level", y = "")

p2 <- ggplot(serzinc, aes(sample = zinc)) +
    geom_qq(col = "slateblue") + 
    geom_qq_line(col = "red") +
    theme_light() +
    labs(y = "Serum zinc level")

p3 <- ggplot(serzinc, aes(x = "", y = zinc)) +
    geom_violin() +
    geom_boxplot(width = 0.3, fill = "slateblue", 
                 outlier.color = "slateblue") +
    theme_light() +
    coord_flip() +
    labs(x = "", y = "Serum zinc level")

p1 + p2 - p3 + plot_layout(ncol = 1, height = c(3, 1)) +
    plot_annotation(title = "Serum Zinc (micrograms per deciliter) for 462 Teenage Males")

These results include some of the more useful plots and numerical summaries when assessing shape, center and spread. The zinc data in the serzinc data frame appear to be slightly right skewed, with five outlier values on the high end of the scale, in particular.

You could potentially add coord_flip() + to the histogram, and this would have the advantage of getting all three plots oriented in the same direction, but then we (or at least I) lose the ability to tell the direction of skew at a glance from the direction of the histogram.

16.3.2 Numerical Summaries

This section describes some numerical summaries of interest to augment the plots in summarizing the center, spread and shape of the distribution of serum zinc among these 462 teenage males.

The tables below are built using two functions from the Love-boost.R script.

skew1 provides the skew₁ value for the zinc data and
Emp_Rule provides the results of applying the 68-95-99.7 Empirical Rule to the zinc data.

mosaic::favstats(~ zinc, data = serzinc)

 min Q1 median Q3 max mean sd   n missing
  50 76     86 98 153 87.9 16 462       0

serzinc %$% skew1(zinc)

[1] 0.121

The skew₁ value backs up our graphical assessment, that the data are slightly right skewed.

We can also assess how well the 68-95-99.7 Empirical Rule for a Normal distribution holds up for these data. Not too badly, as it turns out.

serzinc %$% Emp_Rule(zinc)

                count proportion
Mean +/- 1 SD     323      0.699
Mean +/- 2 SD     447      0.968
Mean +/- 3 SD     458      0.991
Entire Data Set   462      1.000

serzinc %$% psych::describe(zinc)

   vars   n mean sd median trimmed  mad min max range skew kurtosis   se
X1    1 462 87.9 16     86    87.2 16.3  50 153   103 0.62     0.87 0.74

Rounded to two decimal places, the standard deviation of the serum zinc data turns out to be 16, and so the standard error of the mean, shown as se in the psych::describe output, is 16 divided by the square root of the sample size, n = 462. This standard error is about to become quite important to us in building statistical inferences about the mean of the entire population of teenage males based on this sample.

16.4 Inference about a Population Mean from this Sample

The notion of a confidence interval will be important to us, and is explained in further detail in Chapter 17. For now, we’ll focus on building a confidence interval for the mean of a population or process.

A confidence interval for a population or process mean uses data from a sample (and perhaps some additional information) to identify a range of potential values for the population mean, which, if certain assumptions hold, can be assumed to provide a reasonable estimate for the true population mean. A confidence interval consists of:

An interval estimate describing the population parameter of interest (here the population mean), and
A probability statement, expressed in terms of a confidence level.

Now, let’s identify a 90% confidence interval for the mean of the population zinc levels for teenagers like those sampled in this scenario. Under certain assumptions we’ll discuss later, a reasonable approach uses an intercept-only linear model to estimate the population mean zinc level, both in terms of a point estimate and a confidence interval for a given confidence level.

model_zinc <- lm(zinc ~ 1, data = serzinc)

coef(model_zinc) # for the point estimate of pop. mean

(Intercept) 
       87.9

confint(model_zinc, conf = 0.90) # establish 90% CI for pop. mean

            2.5 % 97.5 %
(Intercept)  86.5   89.4

Equivalently, we could use the tidy() from the broom package to obtain these results within a tibble for our intercept-only model.

# requires library(broom)
tidy(model_zinc, conf.int = TRUE, conf = 0.90)

# A tibble: 1 x 7
  term        estimate std.error statistic p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)     87.9     0.745      118.       0     86.7      89.2

This method is identical to estimating a confidence interval for the population mean using the t.test function in R. Here are those results.

tidy(t.test(serzinc$zinc, conf.level = 0.90))

# A tibble: 1 x 8
  estimate statistic p.value parameter conf.low conf.high method
     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl> <chr> 
1     87.9      118.       0       461     86.7      89.2 One S~
# ... with 1 more variable: alternative <chr>

In Chapter 17, we will discuss the development of this confidence interval, and compare this approach to several alternatives.

16.5 Comparison to “Normal” Zinc Levels

Recall that the “Normal” zinc level would be between 66 and 110. What percentage of the sampled 462 teenagers meet that standard?

There are at least two ways to get what we need.

serzinc %>%
    count(zinc > 65 & zinc < 111) %>%
    mutate(proportion = n / sum(n), percentage = 100 * n / sum(n))

# A tibble: 2 x 4
  `zinc > 65 & zinc < 111`     n proportion percentage
  <lgl>                    <int>      <dbl>      <dbl>
1 FALSE                       67      0.145       14.5
2 TRUE                       395      0.855       85.5

Or, we could combine the “explosion pipe” with tabyl to obtain:

serzinc %$% tabyl(zinc > 65 & zinc < 111) %>%
    adorn_pct_formatting()

 zinc > 65 & zinc < 111   n percent
                  FALSE  67   14.5%
                   TRUE 395   85.5%

16.6 Inference about a Population Proportion from this Sample

Previously, we estimated a confidence interval for the mean of the population zinc levels. Now, we want to estimate a confidence interval for the proportion of the population whose serum zinc levels are in the range of 66 to 110. We’ll discuss this further in Chapter 18, but for now, we’ll just provide a little insight.

As before, we want to build both a point estimate for the population proportion, and a confidence interval for the population proportion.

Now, let’s identify a 90% confidence interval for the proportion of the population whose zinc levels are within the “normal” range. We have seen that 395 / 462 subjects (or a proportion of 0.855) fall in the “normal range” in our sample. For now, that will also be our point estimate of the proportion in the “normal range” across the entire population of teenagers like those in our sample.

serzinc <- serzinc %>%
    mutate(in_range = ifelse(zinc > 65 & zinc < 111, 1, 0))

serzinc %>% tabyl(in_range)

 in_range   n percent
        0  67   0.145
        1 395   0.855

Once we’ve created this 0-1 variable, there are several available approaches for wrapping a confidence interval around this proportion.

16.6.1 Using an Intercept-Only Regression Again?

We might consider taking the same approach as we did with the population mean earlier:

model_zincprop <- lm(in_range ~ 1, data = serzinc)

tidy(model_zincprop, conf.int = TRUE, conf = 0.90) %>% 
    knitr::kable(digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.855	0.016	52.1	0	0.828	0.882

While there are more powerful approaches to estimate a confidence interval around this proportion, this simple approach turns out not to be too bad, so long as the sample proportion isn’t very close to either 0 or 1.

16.6.2 Better Approaches: `binom.test` and `prop.test`

Better approaches to estimating a confidence interval for the population proportion use specialized functions such as binom.test or prop.test in R. Here are those results. These approaches want us first to summarize our information into two counts:

x = number of observations where the event of interest occurs
n = number of observations, overall

Here, as we’ve seen, we have 395 “events” in 462 “trials”.

serzinc %>% tabyl(in_range) %>% adorn_totals()

 in_range   n percent
        0  67   0.145
        1 395   0.855
    Total 462   1.000

So, our data can be summarized as: x = 395, n = 462.

binom.test(x = 395, n = 462, conf.level = 0.90)


    Exact binomial test

data:  395 and 462
number of successes = 395, number of trials = 462, p-value <2e-16
alternative hypothesis: true probability of success is not equal to 0.5
90 percent confidence interval:
 0.825 0.881
sample estimates:
probability of success 
                 0.855

prop.test(x = 395, n = 462, conf.level = 0.90)


    1-sample proportions test with continuity correction

data:  395 out of 462, null probability 0.5
X-squared = 231, df = 1, p-value <2e-16
alternative hypothesis: true p is not equal to 0.5
90 percent confidence interval:
 0.825 0.881
sample estimates:
    p 
0.855

Or, if we want to see tidied output, we have…

binom.test(x = 395, n = 462, conf.level = 0.90) %>% tidy()

# A tibble: 1 x 8
  estimate statistic  p.value parameter conf.low conf.high method
     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl> <chr> 
1    0.855       395 1.23e-57       462    0.825     0.881 Exact~
# ... with 1 more variable: alternative <chr>

prop.test(x = 395, n = 462, conf.level = 0.90) %>% tidy()

# A tibble: 1 x 8
  estimate statistic  p.value parameter conf.low conf.high method
     <dbl>     <dbl>    <dbl>     <int>    <dbl>     <dbl> <chr> 
1    0.855      231. 2.88e-52         1    0.825     0.881 1-sam~
# ... with 1 more variable: alternative <chr>

In Chapter 18, we will discuss the development of confidence intervals for population proportions, and develop additional alternatives for building inference about proportions, percentages and rates.

References

Pagano, Marcello, and Kimberlee Gauvreau. 2000. Principles of Biostatistics. Second. Duxbury Press.

Reference values for those over the age of 10 years at http://www.mayomedicallaboratories.com/test-catalog/Clinical+and+Interpretive/8620 , visited 2019-09-17.↩