Predicting High-Density Lipoprotein Cholesterol Levels

432 Project A: A Partial Demonstration

Author

Thomas E. Love, Ph.D.

Published

2025-02-11

What is This?

This is a demonstration Project A for 432. It includes everything you must include in sections 1, 2, 4-9, and 11-13, although there are numerous places where a better project would say more about the results shown here than I have. I also have only performed single imputation, rather than multiple imputation, in this demonstration project, and I’ve only given some guidance regarding the discussion, leaving it up to you.

I decided to share this with students in the class in an effort to ensure that students could meet the minimum standards for the project (to obtain a solid B) more easily.

R Packages and Setup

Code

knitr::opts_chunk$set(comment = NA) 

library(janitor) 
library(naniar)

library(broom)
library(car)
library(caret)
library(GGally)
library(gt)
library(gtsummary)
library(knitr)
library(mice)
library(mosaic)
library(patchwork)
library(ROCR)
library(rsample)
library(rms)

library(easystats)
library(tidyverse) 

theme_set(theme_bw())

1 Data Source

These data come from the 2015-16 and 2017-18 administrations of the National Health and Nutrition Examination Survey from the Centers for Disease Control and Prevention.

The 2015-16 data are described in detail and accessed here
The 2017-18 data are described in detail and accessed here

In each year, we gathered data from the Demographics (DEMO) files (DEMO_I and DEMO_J, respectively), the Body Measures (BMX) files, the Cholesterol - High Density Lipoprotein (HDL) files, and the Current Health Status (HSQ) files.

2 The Subjects

The NHANES files contain complete data on our linear regression outcome (HDL cholesterol level) and our logistic regression outcome (sex-dependent cutoff for “at risk” HDL cholesterol: < 40 mg/dl for men and < 50 mg/dl for women) for a total of 6203 subjects between the ages of 40 and 79. Our tidied tibble (for analyses) consists of a random sample of 999 of those subjects.

3 Loading and Tidying the Data

Important

Here I provide only a summary of the steps I took. In your project, you’ll actually have to show your data management work.

In building this document, I used the nhanesA package for R, which is not currently available on CRAN. In addition, the repository of NHANES materials at the CDC website has changed in some ways that don’t allow me to use my original code. So, I have created a file called nh_demo.Rds which contains the tidied version of my data.

3.1 How the `nh_demo` file was built

I included the following items, from both 2017-18 and 2015-16 NHANES.

From the DEMO files: DEMO_J for 2017-18 and DEMO_I for 2015-16

SEQN = Subject identifying code
RIDSTATR = Interview/Examination Status (categorical)
- 1 = Interview only
- 2 = Interview and MEC examination (MEC = mobile examination center)
RIDAGEYR = Age in years (quantitative, topcoded at 80)
- All subjects ages 80 and over are coded as 80
RIAGENDR = Sex (categorical)
- 1 = Male, 2 = Female
RIDRETH3 = Race/Ethnicity (categorical)
- 1 = Mexican-American,
- 2 = Other Hispanic (1 and 2 are often combined)
- 3 = Non-Hispanic White
- 4 = Non-Hispanic Black
- 6 = Non-Hispanic Asian
- 7 = Other Race including Multi-Racial
- Note that categories 1 and 2 are often combined, and sometimes we leave out category 7, or combine it with 6.

From the Body Measures (BMX) files: BMX_J for 2017-18 and BMX_I for 2015-16 (BMX is part of the Examination data)

BMXWAIST = Waist Circumference in cm (quantitative)

From the Cholesterol - High-Density Lipoprotein (HDL) files: HDL_J for 2017-18 and HDL_I for 2015-16 (HDL is part of the Lab data)

LBDHDD = Direct HDL cholesterol in mg/dl (quantitative)

From the Current Health Status (HSQ) file HSQ_J for 2017-18 and HSQ_I for 2015-16 (HSQ is part of the Questionnaire data)

HSD010 = Self-reported general health (categorical)
- 1 = Excellent
- 2 = Very good
- 3 = Good
- 4 = Fair
- 5 = Poor
- 7 = Refused
- 9 = Don’t Know

3.2 Ingesting and Cleaning the NHANES data

I used functions from the nhanesA package to pull in the data from each necessary database using nhanes() and translated = FALSE, then used zap_label() to delete the labels, and converted to tibbles. Next, I used inner_join() to merge data within each NHANES cycle.

Next, I selected the 8 variables we need from the merged cycle-specific files, and converted each of the categorical variables to factors except the SEQN (subject code.) Then I added an indicator of the CYCLE (2015-16 or 2017-18) in which the data were drawn, and combined it all into one big tibble with full_join().

Then I decided to restrict our work here to adult subjects between the ages of 40 and 79, in part to avoid the fact that NHANES classifies everyone over age 80 as RIDAGEYR = 80. So that involved filtering to those with RIDAGEYR > 39 and RIDAGEYR < 80.

We have three quantitative variables: age (RIDAGEYR), waist circumference (BMXWAIST) and HDL Cholesterol (LBDHDD), which I renamed to AGE, WAIST and HDL, respectively. The ranges (minimum and maximum) for Age, Waist Circumference and HDL Cholesterol all seem fairly plausible to me. Perhaps you know better, and please do use that knowledge. We do have several hundred missing values in the waist circumference and HDL cholesterol values that we will need to deal with after we sample the data.

In terms of categorical variables, I started by checking to see that everyone has RIDSTATR status 2, meaning they completed both the NHANES questionnaire and the NHANES examination. Since they do, I then made sure our CYCLE variable works to indicate the NHANES reporting CYCLE for each subject by running tabyl(CYCLE, RIDSTATR).

Next, I changed RIAGENDR to SEX with values “M” and “F” as a factor. Another option would have been to use a 1/0 numeric variable to represent, for example FEMALE = 1 if the subject was reported as female and 0 if the subject was reported as male.

Next, I created RACE_ETH to replace RIDRETH3.

I recoded the levels of the RIDRETH3 variable to use short, understandable group names, using mutate() and fct_recode(), and
Collapsed the first two categories (Mexican-American and Other Hispanic) into a single category, using fct_recode(), and
Changed the variable name to RACE_ETH using mutate() and
Sorted the resulting factor in order of their counts using fct_infreq().

The most common RACE_ETH turns out to be Non-Hispanic White, followed by Hispanic, then Non-Hispanic Black, Non-Hispanic Asian and finally Other Race (including Multi-Racial.) We won’t collapse any further for now.

Next, I changed HSD010 to SROH, as follows.

Again, I recoded the levels of this categorical variable with fct_recode()
I also renamed the variable SROH (which is an abbreviation for self-reported overall health) with mutate(). Always explain your abbreviations.
I also converted the values 7 (Refused) and 9 (Don’t Know) to missing values with na_if(), then I used droplevels() to drop all of the now-unused levels in the factors I’m using.

3.3 Our outcomes

Our outcomes are each based on HDL cholesterol level. So I filtered the data for complete cases on that variable.

Next, I created a binary outcome for logistic regression, which is an indicator (taking the values 0 and 1) for an HDL cholesterol value that is “at risk” according to these standards published on the Mayo Clinic site, specifically, an adult male is considered to be at risk if their HDL < 40 mg/dl and an adult female is considered to be at risk if their HDL < 50 mg/dl.

Since we have complete data in nh_demo_full on HDL and SEX, I used that data to create the new variable HDL_RISK as shown below, then checked to see that this had worked properly:

Code

## this code chunk is not evaluated here - just showing you what I would do

nh_demo_full <- nh_demo_full |>
  mutate(HDL_RISK = case_when( 
    SEX == "M" & HDL < 40 ~ 1,
    SEX == "F" & HDL < 50 ~ 1,
    .default = 0))

## check to see if this worked properly
nh_demo_full |> group_by(HDL_RISK, SEX) |> 
  summarise(n = n(), min(HDL), max(HDL))

3.4 Arranging the Tibble and Sampling the Data

We don’t need RIDSTATR any more because we’ve checked and see that all its values are 2, as needed.
We’ve renamed some variables and replaced others with better versions.

Finally, I selected 999 observations from our working file, called the nh_demo_full tibble for use in our subsequent analyses, and this sample is called nh_demo. I have provided you with the nh_demo.Rds file, which I’m loading below.

Code

nh_demo <- read_rds("data/nh_demo.Rds")

4 The Tidy Tibble

4.1 Listing the Tibble

Note

If I ask you to print/list a tibble, part of the reason is to ensure that it actually is a tibble. A tibble will, as below, only print the first 10 rows.

Code

nh_demo

# A tibble: 999 × 9
   SEQN    HDL HDL_RISK   AGE RACE_ETH WAIST SROH      SEX   CYCLE  
   <chr> <dbl>    <dbl> <dbl> <fct>    <dbl> <fct>     <chr> <chr>  
 1 84842    48        0    50 Hispanic 101.  Good      M     2015-16
 2 92202   111        0    65 NH_Black  77   Fair      F     2015-16
 3 93010    46        1    63 NH_White  NA   Poor      F     2015-16
 4 93974    39        1    59 NH_White 111.  Good      M     2017-18
 5 96205    56        0    52 NH_Black  98   Good      M     2017-18
 6 95571    42        1    54 NH_White 104.  Good      F     2017-18
 7 99969    54        0    69 NH_Black 115.  Fair      M     2017-18
 8 89494    36        1    42 NH_Asian  98.1 <NA>      F     2015-16
 9 83887   101        0    51 NH_Black  74   Excellent F     2015-16
10 96908    48        1    73 Other    102   Good      F     2017-18
# ℹ 989 more rows

4.2 Size and Identifiers

Our analytic tibble, called nh_demo, has 999 rows (observations) on 9 columns (variables.) Our indicator variable is the SEQN, and we have a unique SEQN for each row in our data set, as the code below demonstrates.

Code

identical(nrow(nh_demo), n_distinct(nh_demo$SEQN))

[1] TRUE

4.3 Save The Tibble

I’ve already done this so I won’t do it again, but this is the code I would use to save the file.

Note

Note that I have set this code chunk to eval: false so that it will not actually do anything. I’m just showing you what I would do to save the Rds file to the data sub-directory of our R project. You should not use eval: false anywhere in your Project A.

Code

write_rds(nh_demo, "data/nh_demo.Rds")

5 The Code Book

I’ve chosen here to build a rather detailed codebook containing useful information describing the analytic sample, nh_demo. I’ve used numerous tools to help give a variety of summaries which may be helpful.

5.1 Five Key Statements

I suggest you begin your Project A codebook with five statements (like the ones shown below) to fix ideas about the sample size, the missingness, the outcomes and a statement about the roles for the other variables.

Note

If you look at the Quarto code that generated this document, you will see that I have used in-line coding to fill in the counts of subjects in statements 1 and 2. You should, too.

Sample Size The data in our complete nh_demo sample consist of 999 subjects from NHANES 2015-16 and NHANES 2017-18 between the ages of 40 and 79 in whom our outcome variables (HDL and HDL_RISK) were measured.
Missingness Of the 999 subjects, 912 have complete data on all variables listed below.
Our outcome variables are HDL, which is the subject’s serum HDL cholesterol measured in mg/dl for the linear regression, and HDL_RISK, which indicates whether or not the subject’s HDL cholesterol is at risk (too low) given their sex, for the logistic regression. There are no missing data in either of these outcomes.
Candidate predictors for my models include AGE, WAIST, RACE_ETH and SROH, (for both models) as well as SEX for the linear model.
The other variables contained in my tidy tibble are SEQN which is the subject identifying code, and CYCLE which identifies whether this subject was part of the 2015-16 or 2017-18 NHANES reporting cycle.

5.2 Describing the Variables (5 Tabs)

Variables included in the nh_demo data are summarized and described in five different ways in the panel below. Click on the tab that interests you to see results.

Tip

Please provide, at minimum, the variable descriptions list and the data_codebook() results here. You can include the others if you like.

In addition to the brief descriptions (including units of measurement, where appropriate) in the table below, descriptions of the original NHANES variables which led to these are also discussed in Section 3.1.

Variable	Description
`SEQN`	NHANES subject identifying code
`HDL`	HDL cholesterol in mg/dl (outcome for our linear model)
`HDL_RISK`	Does HDL put subject at risk? 1 (yes) if HDL < 40 for males or if HDL < 50 for females, otherwise 0 (outcome for logistic model)
`AGE`	Age in years, restricted to 40-79 here
`RACE_ETH`	Race-Ethnicity: five categories (NH_white, NH_Black, NH_Asian, Hispanic, Other)
`WAIST`	Waist circumference, in cm
`SROH`	Self-Reported Overall Health: five categories (Excellent, Very_Good, Good, Fair, Poor)
`SEX`	Biological Sex: either F or M
`CYCLE`	NHANES reporting cycle: either 2015-16 or 2017-18

Note

The data_codebook() function comes from the datawizard package, which is part of the easystats ecosystem.

Code

data_codebook(nh_demo)

nh_demo (999 rows and 9 variables, 9 shown)

ID | Name     | Type        |  Missings |        Values |           N
---+----------+-------------+-----------+---------------+------------
1  | SEQN     | character   |  0 (0.0%) |        100012 |   1 ( 0.1%)
   |          |             |           |        100067 |   1 ( 0.1%)
   |          |             |           |        100112 |   1 ( 0.1%)
   |          |             |           |        100142 |   1 ( 0.1%)
   |          |             |           |        100145 |   1 ( 0.1%)
   |          |             |           |        100150 |   1 ( 0.1%)
   |          |             |           |        100167 |   1 ( 0.1%)
   |          |             |           |        100178 |   1 ( 0.1%)
   |          |             |           |        100191 |   1 ( 0.1%)
   |          |             |           |        100210 |   1 ( 0.1%)
   |          |             |           |         (...) |            
---+----------+-------------+-----------+---------------+------------
2  | HDL      | numeric     |  0 (0.0%) |     [22, 149] |         999
---+----------+-------------+-----------+---------------+------------
3  | HDL_RISK | numeric     |  0 (0.0%) |             0 | 666 (66.7%)
   |          |             |           |             1 | 333 (33.3%)
---+----------+-------------+-----------+---------------+------------
4  | AGE      | numeric     |  0 (0.0%) |      [40, 79] |         999
---+----------+-------------+-----------+---------------+------------
5  | RACE_ETH | categorical |  0 (0.0%) |      NH_White | 333 (33.3%)
   |          |             |           |      Hispanic | 255 (25.5%)
   |          |             |           |      NH_Black | 229 (22.9%)
   |          |             |           |      NH_Asian | 145 (14.5%)
   |          |             |           |         Other |  37 ( 3.7%)
---+----------+-------------+-----------+---------------+------------
6  | WAIST    | numeric     | 44 (4.4%) | [66.5, 159.2] |         955
---+----------+-------------+-----------+---------------+------------
7  | SROH     | categorical | 56 (5.6%) |     Excellent |  84 ( 8.9%)
   |          |             |           |     Very_Good | 223 (23.6%)
   |          |             |           |          Good | 369 (39.1%)
   |          |             |           |          Fair | 231 (24.5%)
   |          |             |           |          Poor |  36 ( 3.8%)
---+----------+-------------+-----------+---------------+------------
8  | SEX      | character   |  0 (0.0%) |             F | 540 (54.1%)
   |          |             |           |             M | 459 (45.9%)
---+----------+-------------+-----------+---------------+------------
9  | CYCLE    | character   |  0 (0.0%) |       2015-16 | 475 (47.5%)
   |          |             |           |       2017-18 | 524 (52.5%)
---------------------------------------------------------------------

Note

The describe() function (and the html() function) used here come from the Hmisc package, developed by Frank Harrell and his colleagues.

Code

describe(nh_demo) |> html()

nh_demo Descriptives

nh_demo

9 Variables 999 Observations

SEQN

n	missing	distinct
999	0	999

lowest : 100012 100067 100112 100142 100145 , highest: 99937 99945 99966 99969 99997

HDL

n	missing	distinct	Info	Mean	pMedian	Gmd	.05	.10	.25	.50	.75	.90	.95
999	0	86	0.999	53.6	52.5	18.3	31	35	41	51	63	76	85

lowest : 22 23 24 25 26 , highest: 108 111 112 137 149

HDL_RISK

n	missing	distinct	Info	Sum	Mean
999	0	2	0.667	333	0.3333

AGE

n	missing	distinct	Info	Mean	pMedian	Gmd	.05	.10	.25	.50	.75	.90	.95
999	0	40	0.999	58.01	58	12.03	42	44	50	58	66	73	75

lowest : 40 41 42 43 44 , highest: 75 76 77 78 79

RACE_ETH

n	missing	distinct
999	0	5

 Value      NH_White Hispanic NH_Black NH_Asian    Other
 Frequency       333      255      229      145       37
 Proportion    0.333    0.255    0.229    0.145    0.037

WAIST

        n  missing distinct     Info     Mean  pMedian      Gmd      .05      .10 
      955       44      482        1    102.7    102.1    18.05    77.27    82.40 
      .25      .50      .75      .90      .95 
    92.05   101.20   112.60   123.50   131.65

lowest : 66.5 67.9 68.7 69 70.3 , highest: 150.5 151 151.3 153.5 159.2

SROH

n	missing	distinct
943	56	5

 Value      Excellent Very_Good      Good      Fair      Poor
 Frequency         84       223       369       231        36
 Proportion     0.089     0.236     0.391     0.245     0.038

SEX

n	missing	distinct
999	0	2

 Value          F     M
 Frequency    540   459
 Proportion 0.541 0.459

CYCLE

n	missing	distinct
999	0	2

 Value      2015-16 2017-18
 Frequency      475     524
 Proportion   0.475   0.525

Note

The tbl_summary() function used here comes from the gtsummary package. Note that missing values are listed as “Unknown” by default.

Code

tbl_summary(select(nh_demo, -SEQN),
        label = list(
            HDL = "HDL (HDL Cholesterol in mg/dl)",
            HDL_RISK = "HDL_RISK (HDL < 40 if Male, < 50 if Female)?",
            AGE = "AGE (in years)",
            WAIST = "WAIST (circumference in cm)",
            CYCLE = "CYCLE (NHANES reporting)",
            RACE_ETH = "RACE_ETH (Race/Ethnicity)",
            SROH = "SROH (Self-Reported Overall Health)"),
        stat = list( all_continuous() ~ 
                "{median} [{min} to {max}]" ))

Characteristic	N = 999¹
HDL (HDL Cholesterol in mg/dl)	51 [22 to 149]
HDL_RISK (HDL < 40 if Male, < 50 if Female)?	333 (33%)
AGE (in years)	58 [40 to 79]
RACE_ETH (Race/Ethnicity)
NH_White	333 (33%)
Hispanic	255 (26%)
NH_Black	229 (23%)
NH_Asian	145 (15%)
Other	37 (3.7%)
WAIST (circumference in cm)	101 [67 to 159]
Unknown	44
SROH (Self-Reported Overall Health)
Excellent	84 (8.9%)
Very_Good	223 (24%)
Good	369 (39%)
Fair	231 (24%)
Poor	36 (3.8%)
Unknown	56
SEX
F	540 (54%)
M	459 (46%)
CYCLE (NHANES reporting)
2015-16	475 (48%)
2017-18	524 (52%)
¹ Median [Min to Max]; n (%)

Note

These summaries come from the naniar package.

Code

gg_miss_var(nh_demo)

Code

miss_var_summary(nh_demo)

# A tibble: 9 × 3
  variable n_miss pct_miss
  <chr>     <int>    <num>
1 SROH         56     5.61
2 WAIST        44     4.40
3 SEQN          0     0   
4 HDL           0     0   
5 HDL_RISK      0     0   
6 AGE           0     0   
7 RACE_ETH      0     0   
8 SEX           0     0   
9 CYCLE         0     0

Code

miss_case_table(nh_demo)

# A tibble: 3 × 3
  n_miss_in_case n_cases pct_cases
           <int>   <int>     <dbl>
1              0     912     91.3 
2              1      74      7.41
3              2      13      1.30

6 Linear Regression Plans

6.1 My First Research Question

How effectively can we predict HDL cholesterol levels using age, sex, race/ethnicity, waist circumference and self-reported overall health, in a sample of 999 NHANES participants ages 40-79?

6.2 My Quantitative Outcome

My quantitative outcome is HDL, and I am interested in predicting this value using several more easily gathered characteristics of a subject.
I have complete HDL data for all 999 subjects in my nh_demo tibble.

Note

Note that if for some reason I did not have complete data in my main tibble on this quantitative outcome, I would filter my data here to include only those cases with complete data on this linear model outcome, before preceding with the rest of section 6.

My HDL data includes 86 different values, all measured in mg/dl.
The distribution of HDL across the 999 subjects in my nh_demo data (shown in the Figure below) is a bit right skewed, with a median of 51, and ranging from 22 mg/dl to 149 mg/dl.

Code

p1 <- ggplot(nh_demo, aes(sample = HDL)) +
  geom_qq(col = "navy") + geom_qq_line(col = "red") + 
  labs(title = "Normal Q-Q plot of HDL", x = "",
       y = "HDL Cholesterol Level (mg/dl)")

p2 <- ggplot(nh_demo, aes(x = HDL)) +
  geom_histogram(binwidth = 5, col = "white", fill = "navy") +
  labs(title = "Histogram of HDL", x = "HDL Cholesterol Level (mg/dl)")

p1 + p2

6.2.1 Numerical Summary of my linear outcome

A new request

This request is new. I want to see the following results for your outcome here, and if your main tibble includes some missing values of your linear outcome, again these should be filtered away in Section 6.

The Hmisc::describe() results provide the number of observations, missing values (which should be 0 here) and the number of distinct values.
The mosaic::favstats() results include the standard deviation, which is not in Hmisc::describe().
The tabyl() results show the five most common HDL values, so that you can check that none of them occur more than 10% of the time.
- Dr. Love will let you use an outcome where one value exceeds 10% of your data, but only if it doesn’t exceed 20% of your data, and the second most common value doesn’t exceed 10% of your data.

Here are brief numerical summaries of my outcome.

Code

Hmisc::describe(nh_demo$HDL)

nh_demo$HDL 
       n  missing distinct     Info     Mean  pMedian      Gmd      .05 
     999        0       86    0.999     53.6     52.5     18.3       31 
     .10      .25      .50      .75      .90      .95 
      35       41       51       63       76       85 

lowest :  22  23  24  25  26, highest: 108 111 112 137 149

Code

mosaic::favstats(nh_demo$HDL)

 min Q1 median Q3 max    mean       sd   n missing
  22 41     51 63 149 53.6046 16.69348 999       0

Code

nh_demo |> tabyl(HDL) |> adorn_pct_formatting() |> arrange(desc(n)) |> head(5)

 HDL  n percent
  41 48    4.8%
  48 37    3.7%
  43 30    3.0%
  45 30    3.0%
  49 29    2.9%

6.2.2 Summary Statements about my outcome, HDL.

My outcome is HDL.

I have 999 observations and no missing values on HDL.
I have 86 distinct HDL values.
The range is 22 to 149, with mean 53.6 and standard deviation 16.7 mg/dl.
The most common value for HDL is 41, which occurs in 4.8% of my subjects.

6.3 My Planned Predictors (Linear Model)

The predictors I intend to use in my linear model are AGE and WAIST, which are quantitative, SEX, which is a binary categorical variable, and RACE_ETH and SROH, each of which are 5-category variables treated as factors in my nh_demo tibble.

AGE has 40 distinct values, and is measured in years.
WAIST has 483 distinct values, and is measured in centimeters.
SEX has two levels, with nrow(nh_demo |> filter(SEX == "F")) female and nrow(nh_demo |> filter(SEX == "M")) male subjects.
RACE_ETH’s five categories each have at least 30 observations in each level (actually they each have 37 or more), and
SROH’s five categories also have at least 30 observations (actually 36 or more) in each level, as we can see in the table below. SROH also has 56 missing values.

Code

nh_demo |> tabyl(RACE_ETH, SROH) |> adorn_totals(where = c("row", "col"))

 RACE_ETH Excellent Very_Good Good Fair Poor NA_ Total
 NH_White        25       101  121   62   14  10   333
 Hispanic        19        38   87   90    8  13   255
 NH_Black        13        46   95   58    7  10   229
 NH_Asian        25        32   48   16    3  21   145
    Other         2         6   18    5    4   2    37
    Total        84       223  369  231   36  56   999

These five predictors are fewer than the allowed maximum of \(4 + (999 - 100)/100 = 12.999\), rounding to 12 predictors as prescribed by the Project A instructions, given our 999 observations in nh_demo.

6.3.1 Anticipated Direction of Effects

I expect Higher HDL to be associated with lower age, with lower waist circumference, with being female, with reporting better self-reported overall health, and with non-hispanic white ethnicity.

6.3.2 Missingness Summary

A new request

This request is also new. I want to see the following summary of missing values in your covariates here, after filtering to complete cases on your quantitative outcome variable.

I’m hoping that you’ll have complete data for all predictors on more than 60% of your observations, and that you won’t be missing more than 20% of any individual predictor.

Here is a report on missing data in my predictors for the linear model.

Code

linear_model_predictors <- nh_demo |> select(AGE, WAIST, SEX, RACE_ETH, SROH)

miss_var_summary(linear_model_predictors) |> filter(n_miss > 0)

# A tibble: 2 × 3
  variable n_miss pct_miss
  <chr>     <int>    <num>
1 SROH         56     5.61
2 WAIST        44     4.40

Code

miss_case_table(linear_model_predictors)

# A tibble: 3 × 3
  n_miss_in_case n_cases pct_cases
           <int>   <int>     <dbl>
1              0     912     91.3 
2              1      74      7.41
3              2      13      1.30

I have complete data for all linear model predictors in 912 (91.3%) of the 999 rows in my data.
I am missing 56 values (5.6%) for SROH, and 44 values (4.4%) for WAIST.

7 Logistic Regression Plans

7.1 My Second Research Question

How effectively can we predict whether or not a subject’s HDL cholesterol level is at risk (too low) using age, race/ethnicity, waist circumference and self-reported overall health, in a sample of 999 NHANES participants ages 40-79?

7.2 My Binary Outcome

My binary outcome is HDL_RISK, and I am interested in predicting this value using several more easily gathered characteristics of a subject.
I have complete HDL_RISK data for all 999 subjects in my nh_demo tibble.

Code

nh_demo |> tabyl(HDL_RISK) |> adorn_pct_formatting()

 HDL_RISK   n percent
        0 666   66.7%
        1 333   33.3%

Note

Note that if for some reason I did not have complete data in my main tibble on this binary outcome, I would filter my data here to include only those cases with complete data on this logistic model outcome, before preceding with the rest of section 7. Then I would display the tabyl() above for my binary outcome.

7.3 My Planned Predictors (Logistic Model)

I am again using four of the five predictors I used in my linear model, specifically age, waist circumference, self-reported overall health and race/ethnicity.

Note that I am not including SEX in this model (although I did in my linear model) because my binary outcome variable’s cutoff values are stratified by SEX.

I have 333 observations in my smaller outcome group (those at risk), and so I am permitted to have up to \(4 + (333-100)/100 = 6.33\), rounded to 6 predictors, so I’m within the acceptable limit prescribed by the Project A instructions.

7.3.1 Anticipated Direction of Effects

I expect higher rates of HDL risk to be associated with higher age, with higher waist circumference, with reporting worse self-reported overall health, and with Hispanic or non-Hispanic Black ethnicity.

7.3.2 Missingness Summary

A new request

This request is also new. I want to see the following summary of missing values in your covariates here, after filtering to complete cases on your binary outcome variable.

I’m hoping that you’ll have complete data for all predictors on more than 60% of your observations, and that you won’t be missing more than 20% of any individual predictor.

Here is a report on missing data in my predictors for the logistic model.

Code

logistic_model_predictors <- nh_demo |> select(AGE, WAIST, RACE_ETH, SROH)

miss_var_summary(logistic_model_predictors) |> filter(n_miss > 0)

# A tibble: 2 × 3
  variable n_miss pct_miss
  <chr>     <int>    <num>
1 SROH         56     5.61
2 WAIST        44     4.40

Code

miss_case_table(logistic_model_predictors)

# A tibble: 3 × 3
  n_miss_in_case n_cases pct_cases
           <int>   <int>     <dbl>
1              0     912     91.3 
2              1      74      7.41
3              2      13      1.30

I have complete data for all of the logistic model predictors in 912 (91.3%) of the 999 rows in my data.
I am missing 56 values (5.6%) for SROH, and 44 values (4.4%) for WAIST.

8 Linear Regression Analyses

8.1 Missingness

I will assume missing values are missing at random (MAR) in this work, and use single imputation for my analyses.

Here’s a table of missingness before imputation. I have missing data on two of my predictors, SROH and WAIST.

Code

miss_var_summary(nh_demo) |> filter(n_miss > 0)

# A tibble: 2 × 3
  variable n_miss pct_miss
  <chr>     <int>    <num>
1 SROH         56     5.61
2 WAIST        44     4.40

8.1.1 Single Imputation Approach

I’ll use the mice package to do my single imputation (m = 1), with a seed set to 432432.

Code

nh_demo_i <- 
  mice(nh_demo, m = 1, seed = 432432, print = FALSE) |>
  complete() |>
  tibble()

Warning: Number of logged events: 3

Code

n_miss(nh_demo_i)

[1] 0

We needn’t fear logged events in this context.

8.2 Outcome Transformation

Let’s look at the Box-Cox results from the car package.

Code

mod_temp <- lm(HDL ~ AGE + SEX + RACE_ETH + WAIST + SROH, data = nh_demo_i)

boxCox(mod_temp)

The Box-Cox plot suggests that we take the logarithm of HDL before fitting our model, and so we shall. Here’s a plot of the distribution of the natural logarithm of HDL.

Code

nh_demo_i <- nh_demo_i |>
  mutate(logHDL = log(HDL))

p1 <- ggplot(nh_demo_i, aes(sample = logHDL)) +
  geom_qq(col = "navy") + geom_qq_line(col = "red") + 
  labs(title = "Normal Q-Q plot of log(HDL)", x = "",
       y = "Log of HDL Cholesterol Level (mg/dl)")

p2 <- ggplot(nh_demo_i, aes(x = logHDL)) +
  geom_histogram(bins = 20, col = "white", fill = "navy") +
  labs(title = "Histogram of log(HDL)", x = "Log of HDL Cholesterol Level (mg/dl)")

p1 + p2

It does seem that this transformed outcome follows something closer to a Normal distribution.

8.3 Scatterplot Matrix and Collinearity

Here’s a scatterplot matrix of our outcome and predictors after transformation and imputation.

Code

ggpairs(nh_demo_i, columns = c("AGE", "WAIST", "SEX", 
                               "RACE_ETH", "SROH", "logHDL"))

To check collinearity in more detail, we can estimate variance inflation factors for our predictors.

Code

mod_A <- lm(logHDL ~ AGE + SEX + RACE_ETH + WAIST + SROH, data = nh_demo_i)

car::vif(mod_A)

             GVIF Df GVIF^(1/(2*Df))
AGE      1.024684  1        1.012267
SEX      1.025596  1        1.012717
RACE_ETH 1.248667  4        1.028148
WAIST    1.233811  1        1.110770
SROH     1.182393  4        1.021163

We don’t have any large generalized VIF results here, so we have no meaningful concerns regarding collinearity.

8.4 Model A

8.4.1 Fitting Model A

Here is the fit using the lm() function.

Code

mod_A <- lm(logHDL ~ AGE + SEX + RACE_ETH + WAIST + SROH, data = nh_demo_i)

We’ll also fit model A with the ols() function from the rms package, even though we won’t actually use this fit for a while.

Code

dd <- datadist(nh_demo_i)
options(datadist = "dd")

mod_A_ols <- ols(logHDL ~ AGE + SEX + RACE_ETH + WAIST + SROH,
                 data = nh_demo_i, x = TRUE, y = TRUE)

8.4.2 Coefficient Estimates

Note

I’ll show two different approaches (model_parameters() from the easystats ecosystem, and tidy() from the broom package) to summarize the coefficient estimates from mod_A. Either is sufficient for Project A, and there’s no need to show both approaches.

Model Parameters (Model A)
Tidied Coefficient Estimates (Model A)

Code

model_parameters(mod_A, ci = 0.90) |> print_md(digits = 3)

Parameter	Coefficient	SE	90% CI	t(987)	p
(Intercept)	4.616	0.078	(4.489, 4.744)	59.436	< .001
AGE	0.003	7.938e-04	(0.001, 0.004)	3.510	< .001
SEX (M)	-0.166	0.017	(-0.194, -0.139)	-10.010	< .001
RACE ETH (Hispanic)	-0.042	0.022	(-0.078, -0.005)	-1.873	0.061
RACE ETH (NH_Black)	0.090	0.022	(0.053, 0.127)	4.025	< .001
RACE ETH (NH_Asian)	-0.068	0.028	(-0.114, -0.023)	-2.479	0.013
RACE ETH (Other)	0.013	0.045	(-0.062, 0.088)	0.289	0.773
WAIST	-0.007	5.665e-04	(-0.008, -0.006)	-12.257	< .001
SROH (Very_Good)	-0.008	0.032	(-0.061, 0.045)	-0.243	0.808
SROH (Good)	-0.074	0.031	(-0.125, -0.023)	-2.392	0.017
SROH (Fair)	-0.088	0.033	(-0.143, -0.033)	-2.640	0.008
SROH (Poor)	-0.023	0.050	(-0.106, 0.059)	-0.466	0.642

Code

tidy(mod_A, conf.int = TRUE, conf.level = 0.90) |>
  select(term, estimate, se = std.error, low90 = conf.low, 
         high90 = conf.high, p = p.value) |>
  kable(digits = 3)

term	estimate	se	low90	high90	p
(Intercept)	4.616	0.078	4.489	4.744	0.000
AGE	0.003	0.001	0.001	0.004	0.000
SEXM	-0.166	0.017	-0.194	-0.139	0.000
RACE_ETHHispanic	-0.042	0.022	-0.078	-0.005	0.061
RACE_ETHNH_Black	0.090	0.022	0.053	0.127	0.000
RACE_ETHNH_Asian	-0.068	0.028	-0.114	-0.023	0.013
RACE_ETHOther	0.013	0.045	-0.062	0.088	0.773
WAIST	-0.007	0.001	-0.008	-0.006	0.000
SROHVery_Good	-0.008	0.032	-0.061	0.045	0.808
SROHGood	-0.074	0.031	-0.125	-0.023	0.017
SROHFair	-0.088	0.033	-0.143	-0.033	0.008
SROHPoor	-0.023	0.050	-0.106	0.059	0.642

8.4.3 Model A Effects

Since we’ve built the mod_A_ols model using the rms package, it can be helpful to look at the effects using its plot and associated table, as shown below.

Note

If you look at the Quarto code, you’ll see that I’ve used warning: false in this code chunk to suppress some un-interesting warnings here, and in several other code chunks in this document. I encourage you to do the same in your Project A, for the code chunks that I’ve used warning: false in this demo.

Don’t use warning: false otherwise in Project A, though.

Code

plot(summary(mod_A_ols, conf.int = 0.90))

Code

summary(mod_A_ols, conf.int = 0.90) |> kable(digits = 3)

	Low	High	Diff.	Effect	S.E.	Lower 0.9	Upper 0.9	Type
AGE	50	66.0	16.0	0.045	0.013	0.024	0.065	1
WAIST	92	112.6	20.6	-0.143	0.012	-0.162	-0.124	1
SEX - M:F	1	2.0	NA	-0.166	0.017	-0.194	-0.139	1
RACE_ETH - Hispanic:NH_White	1	2.0	NA	-0.042	0.022	-0.078	-0.005	1
RACE_ETH - NH_Black:NH_White	1	3.0	NA	0.090	0.022	0.053	0.127	1
RACE_ETH - NH_Asian:NH_White	1	4.0	NA	-0.068	0.028	-0.114	-0.023	1
RACE_ETH - Other:NH_White	1	5.0	NA	0.013	0.045	-0.062	0.088	1
SROH - Excellent:Good	3	1.0	NA	0.074	0.031	0.023	0.125	1
SROH - Very_Good:Good	3	2.0	NA	0.066	0.022	0.030	0.102	1
SROH - Fair:Good	3	4.0	NA	-0.014	0.022	-0.050	0.021	1
SROH - Poor:Good	3	5.0	NA	0.050	0.043	-0.021	0.122	1

8.4.4 Quality of Fit Summaries

Note

I’ll show two different approaches (model_performance() from the easystats ecosystem, and glance() from the broom package) to summarize the quality of fit measures describing mod_A. I actually like being able to see both of these approaches, since each provides some unique information, so I would encourage you, too, to include both in your Project A.

Model A Performance
Model A at a glance()

Code

model_performance(mod_A) |> print_md(digits = 3)

Indices of model performance
AIC	AICc	BIC	R2	R2 (adj.)	RMSE	Sigma
146.355	146.724	210.143	0.273	0.265	0.257	0.259

In our Model A, we use 11 degrees of freedom, and obtain an \(R^2\) value of 0.273.

Code

glance(mod_A) |>
  select(r2 = r.squared, adjr2 = adj.r.squared, sigma, 
         AIC, BIC, nobs, df, df.residual) |>
  kable(digits = c(3, 3, 2, 1, 1, 0, 0, 0))

r2	adjr2	sigma	AIC	BIC	nobs	df	df.residual
0.273	0.265	0.26	146.4	210.1	999	11	987

8.4.5 Regression Diagnostics for Model A

Note

Note in the Quarto code for the next chunk that I have set the fig-height to 9, so that the plots are easier to read. Please do that, too, whenever you use check_model() from the easystats ecosystem.

I also prefer that you run check_model() with detrend = FALSE for the Normal Q-Q plot of residuals, as done here.

Code

check_model(mod_A, detrend = FALSE)

For the most part, these plots look very reasonable. I see no clear problems with the assumptions of linearity, normality or constant variance evident in any of these results.

The main issue is the posterior predictive check, where our predictions are missing near the center of the distribution a bit, with more predicted values of log(HDL) in the 3.75 to 4.25 range than we see in the original data.

8.5 Non-Linearity and Spearman \(\rho^2\)

Here’s the relevant Spearman \(\rho^2\) plot, as a place to look for sensible places to consider a non-linear term or terms.

Code

plot(spearman2(logHDL ~ AGE + WAIST + SEX + RACE_ETH + SROH, 
               data = nh_demo_i))

Our Spearman \(\rho^2\) plot first suggests the use of a non-linear term in WAIST, so we’ll add a restricted cubic spline in WAIST using 5 knots, which should add 3 degrees of freedom to our initial model.

Next, the SEX variable also seems to be a good choice, so we’ll add an interaction between SEX and the main effect of WAIST, which will add one more degree of freedom to our model A.

8.6 Model B

Our Model B will add two non-linear terms, summing up to 4 additional degrees of freedom, to our Model A.

8.6.1 Fitting Model B

Code

mod_B <- lm(logHDL ~ AGE + rcs(WAIST, 5) + SEX + 
              WAIST %ia% SEX + RACE_ETH + SROH,
            data = nh_demo_i)

We’ll also fit model B with the ols() function from the rms package.

Code

dd <- datadist(nh_demo_i)
options(datadist = "dd")

mod_B_ols <- ols(logHDL ~ AGE + rcs(WAIST, 5) + SEX + 
              WAIST %ia% SEX + RACE_ETH + SROH,
            data = nh_demo_i, x = TRUE, y = TRUE)

8.6.2 Coefficient Estimates

Note

I’ll show two different approaches (model_parameters() from the easystats ecosystem, and tidy() from the broom package) to summarize the coefficient estimates from mod_B. Either is sufficient for Project A, and there’s no need to show both approaches. The tidy() approach handles the non-linear terms a bit better in my view, so I’d probably go with that.

Model Parameters (Model B)
Tidied Coefficient Estimates (Model B)

Code

model_parameters(mod_B, ci = 0.90) |> print_md(digits = 3)

Parameter	Coefficient	SE	90% CI	t(983)	p
(Intercept)	5.436	0.252	(5.021, 5.852)	21.549	< .001
AGE	0.003	7.920e-04	(0.002, 0.005)	4.096	< .001
rcs(WAIST ( degree)	-0.017	0.003	(-0.022, -0.012)	-5.617	< .001
rcs(WAIST ( degree)	0.039	0.018	(0.009, 0.068)	2.151	0.032
rcs(WAIST ( degree)	-0.158	0.095	(-0.314, -0.001)	-1.661	0.097
rcs(WAIST ( degree)	0.188	0.132	(-0.030, 0.406)	1.422	0.155
SEX (M)	-0.107	0.111	(-0.291, 0.076)	-0.961	0.337
WAIST %ia% SEX	-5.032e-04	0.001	(-0.002, 0.001)	-0.473	0.637
RACE ETH (Hispanic)	-0.034	0.022	(-0.070, 0.002)	-1.548	0.122
RACE ETH (NH_Black)	0.088	0.022	(0.051, 0.125)	3.911	< .001
RACE ETH (NH_Asian)	-0.075	0.028	(-0.121, -0.030)	-2.735	0.006
RACE ETH (Other)	0.010	0.045	(-0.064, 0.084)	0.223	0.823
SROH (Very_Good)	-0.001	0.032	(-0.055, 0.052)	-0.045	0.964
SROH (Good)	-0.063	0.031	(-0.114, -0.013)	-2.068	0.039
SROH (Fair)	-0.087	0.033	(-0.142, -0.033)	-2.638	0.008
SROH (Poor)	-0.031	0.050	(-0.113, 0.051)	-0.627	0.531

Code

tidy(mod_B, conf.int = TRUE, conf.level = 0.90) |>
  select(term, estimate, se = std.error, 
         low90 = conf.low, high90 = conf.high, 
         p = p.value) |>
  kable(digits = 3)

term	estimate	se	low90	high90	p
(Intercept)	5.436	0.252	5.021	5.852	0.000
AGE	0.003	0.001	0.002	0.005	0.000
rcs(WAIST, 5)WAIST	-0.017	0.003	-0.022	-0.012	0.000
rcs(WAIST, 5)WAIST’	0.039	0.018	0.009	0.068	0.032
rcs(WAIST, 5)WAIST’’	-0.158	0.095	-0.314	-0.001	0.097
rcs(WAIST, 5)WAIST’’’	0.188	0.132	-0.030	0.406	0.155
SEXM	-0.107	0.111	-0.291	0.076	0.337
WAIST %ia% SEX	-0.001	0.001	-0.002	0.001	0.637
RACE_ETHHispanic	-0.034	0.022	-0.070	0.002	0.122
RACE_ETHNH_Black	0.088	0.022	0.051	0.125	0.000
RACE_ETHNH_Asian	-0.075	0.028	-0.121	-0.030	0.006
RACE_ETHOther	0.010	0.045	-0.064	0.084	0.823
SROHVery_Good	-0.001	0.032	-0.055	0.052	0.964
SROHGood	-0.063	0.031	-0.114	-0.013	0.039
SROHFair	-0.087	0.033	-0.142	-0.033	0.008
SROHPoor	-0.031	0.050	-0.113	0.051	0.531

8.6.3 Model B Effects

Here, we use the mod_B_ols model to look at the effects using its plot and associated table, which may be especially helpful when we include non-linear terms.

Note

I’ve used warning: false in this code chunk to suppress some un-interesting warnings. You can (for this type of plot) too, if you need to, in your Project A.

Code

plot(summary(mod_B_ols, conf.int = 0.90))

Code

summary(mod_B_ols, conf.int = 0.90) |> kable(digits = 3)

	Low	High	Diff.	Effect	S.E.	Lower 0.9	Upper 0.9	Type
AGE	50	66.0	16.0	0.052	0.013	0.031	0.073	1
WAIST	92	112.6	20.6	-0.133	0.026	-0.175	-0.091	1
SEX - M:F	1	2.0	NA	-0.158	0.017	-0.186	-0.131	1
RACE_ETH - Hispanic:NH_White	1	2.0	NA	-0.034	0.022	-0.070	0.002	1
RACE_ETH - NH_Black:NH_White	1	3.0	NA	0.088	0.022	0.051	0.125	1
RACE_ETH - NH_Asian:NH_White	1	4.0	NA	-0.075	0.028	-0.121	-0.030	1
RACE_ETH - Other:NH_White	1	5.0	NA	0.010	0.045	-0.064	0.084	1
SROH - Excellent:Good	3	1.0	NA	0.063	0.031	0.013	0.114	1
SROH - Very_Good:Good	3	2.0	NA	0.062	0.022	0.027	0.097	1
SROH - Fair:Good	3	4.0	NA	-0.024	0.022	-0.059	0.012	1
SROH - Poor:Good	3	5.0	NA	0.032	0.043	-0.039	0.103	1

Code

model_performance(mod_B) |> print_md(digits = 3)

Indices of model performance
AIC	AICc	BIC	R2	R2 (adj.)	RMSE	Sigma
130.829	131.453	214.244	0.290	0.279	0.254	0.256

In our Model B, we use 15 degrees of freedom, and obtain an \(R^2\) value of 0.29.

Code

glance(mod_B) |>
  select(r2 = r.squared, adjr2 = adj.r.squared, sigma, 
         AIC, BIC, nobs, df, df.residual) |>
  kable(digits = c(3, 3, 2, 1, 1, 0, 0, 0))

r2	adjr2	sigma	AIC	BIC	nobs	df	df.residual
0.29	0.279	0.26	130.8	214.2	999	15	983

8.6.5 Regression Diagnostics for Model B

Code

check_model(mod_B, detrend = FALSE)

These residual plots also look pretty reasonable, too. Again, I see no clear problems with the assumptions of linearity, normality or constant variance evident in these results. The posterior predictive check is a little better than Model A, but not much. The collinearity we’ve introduced here is due to the interaction terms, so that’s not a concern for us.

8.7 Validating Models A and B

We will use the validate() function from the rms package to validate our ols fits.

Code

set.seed(4321); (valA <- validate(mod_A_ols))

          index.orig training   test optimism index.corrected  n
R-square      0.2729   0.2800 0.2630   0.0170          0.2558 40
MSE           0.0660   0.0656 0.0669  -0.0014          0.0674 40
g             0.1787   0.1811 0.1759   0.0052          0.1736 40
Intercept     0.0000   0.0000 0.1209  -0.1209          0.1209 40
Slope         1.0000   1.0000 0.9689   0.0311          0.9689 40

Code

set.seed(4322); (valB <- validate(mod_B_ols))

          index.orig training   test optimism index.corrected  n
R-square      0.2898   0.3056 0.2790   0.0267          0.2631 40
MSE           0.0645   0.0626 0.0655  -0.0029          0.0674 40
g             0.1827   0.1867 0.1796   0.0072          0.1755 40
Intercept     0.0000   0.0000 0.1501  -0.1501          0.1501 40
Slope         1.0000   1.0000 0.9620   0.0380          0.9620 40

8.7.1 Validated \(R^2\), and MSE as well as IC statistics

Note

If you inspect the Quarto code for the table below, you’ll see that I have used in-line coding to specify the values of each element. That’s definitely worth considering in building your work, although it takes some care.

Model	Validated \(R^2\)	Validated MSE	AIC	BIC	df
A	0.256	0.0674	146.4	210.1	11
B	0.263	0.0674	130.8	214.2	15

8.8 Final Linear Regression Model

We’ll choose Model B here.

We see a small improvement for Model B in terms of validated \(R^2\) and AIC. We also note that BIC is a little lower (thus better) for Model A, and there’s no material difference in validated mean squared error. Each of the models matches the assumptions of linear regression well, so there’s not much to choose from in that regard. Really, we could pick either model here, but I wanted to pick a model with non-linear terms so that I could display and discuss them in what follows.

8.8.1 Winning Model’s OLS summary

Code

mod_B_ols

Linear Regression Model

ols(formula = logHDL ~ AGE + rcs(WAIST, 5) + SEX + WAIST %ia% 
    SEX + RACE_ETH + SROH, data = nh_demo_i, x = TRUE, y = TRUE)

                Model Likelihood    Discrimination    
                      Ratio Test           Indexes    
Obs     999    LR chi2    341.85    R2       0.290    
sigma0.2560    d.f.           15    R2 adj   0.279    
d.f.    983    Pr(> chi2) 0.0000    g        0.183    

Residuals

      Min        1Q    Median        3Q       Max 
-0.763472 -0.183023 -0.002842  0.171374  0.916543 

                  Coef    S.E.   t     Pr(>|t|)
Intercept          5.4362 0.2523 21.55 <0.0001 
AGE                0.0032 0.0008  4.10 <0.0001 
WAIST             -0.0170 0.0030 -5.62 <0.0001 
WAIST'             0.0387 0.0180  2.15 0.0317  
WAIST''           -0.1579 0.0951 -1.66 0.0971  
WAIST'''           0.1881 0.1323  1.42 0.1552  
SEX=M             -0.1071 0.1115 -0.96 0.3370  
WAIST * SEX=M     -0.0005 0.0011 -0.47 0.6367  
RACE_ETH=Hispanic -0.0341 0.0220 -1.55 0.1219  
RACE_ETH=NH_Black  0.0877 0.0224  3.91 <0.0001 
RACE_ETH=NH_Asian -0.0754 0.0276 -2.73 0.0064  
RACE_ETH=Other     0.0100 0.0449  0.22 0.8234  
SROH=Very_Good    -0.0014 0.0323 -0.04 0.9645  
SROH=Good         -0.0635 0.0307 -2.07 0.0389  
SROH=Fair         -0.0873 0.0331 -2.64 0.0085  
SROH=Poor         -0.0312 0.0498 -0.63 0.5310

As a reminder, we displayed the validated \(R^2\) value ( = 0.263) for our Model B back in Section 8.7.1.

8.8.2 Effects Plot for Winning Model

Code

plot(summary(mod_B_ols, conf.int = 0.90))

8.8.3 Numerical Description of Effect Sizes

Code

summary(mod_B_ols, conf.int = 0.90) |> kable(digits = 3)

	Low	High	Diff.	Effect	S.E.	Lower 0.9	Upper 0.9	Type
AGE	50	66.0	16.0	0.052	0.013	0.031	0.073	1
WAIST	92	112.6	20.6	-0.133	0.026	-0.175	-0.091	1
SEX - M:F	1	2.0	NA	-0.158	0.017	-0.186	-0.131	1
RACE_ETH - Hispanic:NH_White	1	2.0	NA	-0.034	0.022	-0.070	0.002	1
RACE_ETH - NH_Black:NH_White	1	3.0	NA	0.088	0.022	0.051	0.125	1
RACE_ETH - NH_Asian:NH_White	1	4.0	NA	-0.075	0.028	-0.121	-0.030	1
RACE_ETH - Other:NH_White	1	5.0	NA	0.010	0.045	-0.064	0.084	1
SROH - Excellent:Good	3	1.0	NA	0.063	0.031	0.013	0.114	1
SROH - Very_Good:Good	3	2.0	NA	0.062	0.022	0.027	0.097	1
SROH - Fair:Good	3	4.0	NA	-0.024	0.022	-0.059	0.012	1
SROH - Poor:Good	3	5.0	NA	0.032	0.043	-0.039	0.103	1

8.8.4 Effect Size Description(s)

In your work, we would only need to see one of the following effect size descriptions.

WAIST description: If we have two female subjects of the same age, race/ethnicity and self-reported overall health, then if subject 1 has a waist circumference of 92 cm and subject 2 has a waist circumference of 112.6 cm, then our model estimates that subject 1 will have a log(HDL) that is 0.133 higher than subject 2. The 90% confidence interval around that estimated effect on log(HDL) ranges from (0.091, 0.175).

SEX description: If we have two subjects, each with the same age, race/ethnicity and self-reported overall health and with a waist circumference of 101.7 cm, then if subject 1 is male and subject 2 is female, our model estimates that the female subject will have a log(HDL) that is 0.158 higher (with 90% CI: (0.131, 0.186)) than the male subject.

AGE description: If we have two subjects of the same sex, waist circumference, race/ethnicity and self-reported overall health, then if subject 1 is age 50 and subject 2 is age 66, our model predicts that subject 1’s log(HDL) will be 0.052 larger than subject 2’s log(HDL), with 90% confidence interval (0.031, 0.073).

SROH description: If we have two subjects of the same age, sex, waist circumference and race/ethnicity. then if subject 1 reports Excellent overall health while subject 2 reports only Good overall health, our model predicts that subject 1’s log(HDL) will be 0.063 higher than subject 2’s log(HDL), with 90% CI (0.013, 0.114).

8.8.5 Prediction Plot for Winning Model

Here’s the set of prediction plots to describe the impact of the coefficients on log(HDL).

Code

ggplot(Predict(mod_B_ols))

8.8.6 Nomogram of Winning Model

Code

plot(nomogram(mod_B_ols, fun = exp, funlabel = "HDL"))

8.8.7 Prediction for a New Subject

I will create a predicted HDL for a new female Non-Hispanic White subject who is 60 years old, has an 85 cm waist circumference, and rates their overall health as Fair.

Note

I can do this using either the ols() or the lm() fit to our model. Either is sufficient for Project A, and there’s no need to show both approaches.

Using the lm() fit
Using the ols() fit

Here, I’ll actually run two predictions, one for a Male and one for a Female subject with the same values of AGE, WAIST, RACE_ETH and SROH.

Code

new_subjects <- 
  data.frame(AGE = c(60,60), WAIST = c(85, 85), SEX = c("M", "F"),
             RACE_ETH = c("NH_White", "NH_White"), SROH = c("Fair", "Fair"))

preds1 <- predict.lm(mod_B, newdata = new_subjects, 
                     interval = "prediction", level = 0.90)

exp(preds1)

       fit      lwr      upr
1 52.08776 34.05169 79.67694
2 60.50899 39.58373 92.49604

We then exponentiate these results to obtain the estimate and 90% prediction interval on the original scale of HDL cholesterol, as shown in the table below.

Predictor Values	Predicted HDL	90% Prediction Interval
AGE = 60, WAIST = 85, SEX = M, RACE_ETH = NH_White, SROH = FAIR	52.09 mg/dl	(34.05, 79.68) mg/dl
AGE = 60, WAIST = 85, SEX = F, RACE_ETH = NH_White, SROH = FAIR	60.51 mg/dl	(39.58, 92.5) mg/dl

Alternatively, I could have used our ols() fit. Here, I’ll just fit the results for the female subject.

Code

new_subject <- tibble(  AGE = 60, WAIST = 85, SEX = "F", 
                       RACE_ETH = "NH_White", SROH = "Fair" ) 

preds2 <- predict(mod_B_ols, newdata = new_subject, 
                 conf.int = 0.90, conf.type = "individual")

preds2

$linear.predictors
       1 
4.102792 

$lower
       1 
3.678418 

$upper
       1 
4.527166

We then exponentiate these results to obtain the estimate and 90% prediction interval on the original scale of HDL cholesterol, as shown in the table below.

Predictor Values	Predicted HDL	90% Prediction Interval
AGE = 60, WAIST = 85, SEX = F, RACE_ETH = NH_White, SROH = FAIR	60.51 mg/dl	(39.58, 92.5) mg/dl

Naturally, the two fitting approaches (lm() and ols()) produce the same model, so they produce the same predictions.

9 Logistic Regression Analyses

9.1 Missingness

Again, we’ll assume missing values are MAR, and use the single imputation approach developed previously in Section 8.1.1.

9.2 Model Y

We’ll predict Pr(HDL_RISK = 1), the probably of having a low enough HDL to put the subject at risk, as a function of AGE, WAIST, RACE_ETH and SROH.

9.2.1 Fitting Model Y

We’ll fit our logistic regression model using both glm() and lrm().

Code

mod_Y <- glm(HDL_RISK ~ AGE + WAIST + RACE_ETH + SROH,
            data = nh_demo_i, family = binomial())

ddd <- datadist(nh_demo_i)
options(datadist = "ddd")

mod_Y_lrm <- lrm(HDL_RISK ~ AGE + WAIST + RACE_ETH + SROH,
                data = nh_demo_i, x = TRUE, y = TRUE)

9.2.2 Coefficient Estimates

Note

I’ll show two different approaches (model_parameters() from the easystats ecosystem, and tidy() from the broom package) to summarize the coefficient estimates from mod_Y. Either is sufficient for Project A, and there’s no need to show both approaches.

Tidied Odds Ratio Estimates (Model Y)
Model Y Parameters

Code

tidy(mod_Y, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.90) |>
  select(term, estimate, se = std.error, 
         low90 = conf.low, high90 = conf.high, p = p.value) |>
  kable(digits = 3)

term	estimate	se	low90	high90	p
(Intercept)	0.011	0.722	0.003	0.036	0.000
AGE	0.986	0.007	0.975	0.998	0.048
WAIST	1.041	0.005	1.032	1.050	0.000
RACE_ETHHispanic	1.004	0.189	0.736	1.369	0.983
RACE_ETHNH_Black	0.531	0.203	0.379	0.740	0.002
RACE_ETHNH_Asian	1.607	0.239	1.083	2.380	0.047
RACE_ETHOther	0.536	0.420	0.261	1.049	0.138
SROHVery_Good	1.196	0.318	0.717	2.044	0.574
SROHGood	1.677	0.300	1.037	2.792	0.085
SROHFair	2.405	0.315	1.448	4.100	0.005
SROHPoor	1.993	0.450	0.949	4.193	0.126

Code

model_parameters(mod_Y, ci = 0.90, exponentiate = TRUE) |>
  print_md(digits = 3)

Parameter	Odds Ratio	SE	90% CI	z	p
(Intercept)	0.011	0.008	(0.003, 0.036)	-6.243	< .001
AGE	0.986	0.007	(0.975, 0.998)	-1.979	0.048
WAIST	1.041	0.005	(1.032, 1.050)	7.779	< .001
RACE ETH (Hispanic)	1.004	0.189	(0.736, 1.369)	0.022	0.983
RACE ETH (NH_Black)	0.531	0.108	(0.379, 0.740)	-3.114	0.002
RACE ETH (NH_Asian)	1.607	0.384	(1.083, 2.380)	1.985	0.047
RACE ETH (Other)	0.536	0.225	(0.261, 1.049)	-1.483	0.138
SROH (Very_Good)	1.196	0.380	(0.717, 2.044)	0.562	0.574
SROH (Good)	1.677	0.503	(1.037, 2.792)	1.722	0.085
SROH (Fair)	2.405	0.759	(1.448, 4.100)	2.782	0.005
SROH (Poor)	1.993	0.897	(0.949, 4.193)	1.531	0.126

9.2.3 Model Y Effects

Code

plot(summary(mod_Y_lrm, conf.int = 0.90))

Code

summary(mod_Y_lrm, conf.int = 0.90) |> kable(digits = 3)

	Low	High	Diff.	Effect	S.E.	Lower 0.9	Upper 0.9	Type
AGE	50	66.0	16.0	-0.221	0.112	-0.404	-0.037	1
Odds Ratio	50	66.0	16.0	0.802	NA	0.667	0.963	2
WAIST	92	112.6	20.6	0.831	0.107	0.655	1.006	1
Odds Ratio	92	112.6	20.6	2.295	NA	1.925	2.735	2
RACE_ETH - Hispanic:NH_White	1	2.0	NA	0.004	0.189	-0.306	0.314	1
Odds Ratio	1	2.0	NA	1.004	NA	0.736	1.369	2
RACE_ETH - NH_Black:NH_White	1	3.0	NA	-0.633	0.203	-0.967	-0.299	1
Odds Ratio	1	3.0	NA	0.531	NA	0.380	0.742	2
RACE_ETH - NH_Asian:NH_White	1	4.0	NA	0.474	0.239	0.081	0.867	1
Odds Ratio	1	4.0	NA	1.607	NA	1.085	2.381	2
RACE_ETH - Other:NH_White	1	5.0	NA	-0.624	0.420	-1.315	0.068	1
Odds Ratio	1	5.0	NA	0.536	NA	0.268	1.070	2
SROH - Excellent:Good	3	1.0	NA	-0.517	0.300	-1.010	-0.023	1
Odds Ratio	3	1.0	NA	0.596	NA	0.364	0.977	2
SROH - Very_Good:Good	3	2.0	NA	-0.338	0.193	-0.656	-0.020	1
Odds Ratio	3	2.0	NA	0.713	NA	0.519	0.980	2
SROH - Fair:Good	3	4.0	NA	0.361	0.180	0.065	0.656	1
Odds Ratio	3	4.0	NA	1.434	NA	1.067	1.928	2
SROH - Poor:Good	3	5.0	NA	0.173	0.368	-0.433	0.778	1
Odds Ratio	3	5.0	NA	1.188	NA	0.649	2.177	2

9.2.4 Quality of Fit Summaries

Note

Here, we have three available approaches to summarize the fit of Model Y, thanks to our glm() and lrm() fits of the same model. I’ll show all three, but in practice (and in your Project A) I wouldn’t include the model_performance() results.

Our Nagelkerke \(R^2\) estimate for Model Y is 0.152, and our C statistic is estimated to be 0.701.

Code

mod_Y_lrm

Logistic Regression Model

lrm(formula = HDL_RISK ~ AGE + WAIST + RACE_ETH + SROH, data = nh_demo_i, 
    x = TRUE, y = TRUE)

                       Model Likelihood     Discrimination    Rank Discrim.    
                             Ratio Test            Indexes          Indexes    
Obs           999    LR chi2     115.50     R2       0.152    C       0.701    
 0            666    d.f.            10    R2(10,999)0.100    Dxy     0.402    
 1            333    Pr(> chi2) <0.0001    R2(10,666)0.146    gamma   0.402    
max |deriv| 1e-06                           Brier    0.198    tau-a   0.179    

                  Coef    S.E.   Wald Z Pr(>|Z|)
Intercept         -4.5080 0.7221 -6.24  <0.0001 
AGE               -0.0138 0.0070 -1.98  0.0478  
WAIST              0.0403 0.0052  7.78  <0.0001 
RACE_ETH=Hispanic  0.0041 0.1886  0.02  0.9826  
RACE_ETH=NH_Black -0.6327 0.2032 -3.11  0.0018  
RACE_ETH=NH_Asian  0.4743 0.2390  1.98  0.0472  
RACE_ETH=Other    -0.6235 0.4205 -1.48  0.1381  
SROH=Very_Good     0.1786 0.3175  0.56  0.5739  
SROH=Good          0.5168 0.3001  1.72  0.0850  
SROH=Fair          0.8776 0.3154  2.78  0.0054  
SROH=Poor          0.6895 0.4503  1.53  0.1258

The glance() results below give us our AIC and BIC values, as well as the degrees of freedom used by Model Y.

Code

glance(mod_Y) |>
  mutate(df = nobs - df.residual - 1) |>
  select(AIC, BIC, df, df.residual, nobs) |>
  kable(digits = 1)

AIC	BIC	df	df.residual	nobs
1178.3	1232.2	10	988	999

Note

These summaries aren’t as appealing to me as those in the other two tabs here.

Code

model_performance(mod_Y) |> print_md(digits = 2)

Indices of model performance
AIC	AICc	BIC	Tjur’s R2	RMSE	Sigma	Log_loss	Score_log	Score_spherical	PCP
1178.26	1178.53	1232.23	0.11	0.44	1.00	0.58	-145.30	1.81e-03	0.60

9.2.5 Confusion Matrix (Model Y)

First, we augment our nh_demo_i data to include predicted probabilities of (HDL_RISK = 1) from Model Y.

Code

resY_aug <- augment(mod_Y, type.predict = "response")

My prediction rule for this confusion matrix is that the fitted value of Pr(HDL_RISK = 1) needs to be greater than or equal to 0.5 for me to predict HDL_RISK is 1, and otherwise I predict 0.

Code

cm_Y <- caret::confusionMatrix(
  data = factor(resY_aug$.fitted >= 0.5),
  reference = factor(resY_aug$HDL_RISK == 1),
  positive = "TRUE")

cm_Y

Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE   603  241
     TRUE     63   92
                                          
               Accuracy : 0.6957          
                 95% CI : (0.6661, 0.7241)
    No Information Rate : 0.6667          
    P-Value [Acc > NIR] : 0.02722         
                                          
                  Kappa : 0.2097          
                                          
 Mcnemar's Test P-Value : < 2e-16         
                                          
            Sensitivity : 0.27628         
            Specificity : 0.90541         
         Pos Pred Value : 0.59355         
         Neg Pred Value : 0.71445         
             Prevalence : 0.33333         
         Detection Rate : 0.09209         
   Detection Prevalence : 0.15516         
      Balanced Accuracy : 0.59084         
                                          
       'Positive' Class : TRUE

Here are our results, tabulated nicely.

Model	Classification Rule	Sensitivity	Specificity	Pos. Pred. Value
Y	Predicted Pr(HDL_RISK = 1) >= 0.5	0.276	0.905	0.594

9.3 Non-Linearity and Spearman \(\rho^2\) plot

Here’s the relevant Spearman \(\rho^2\) plot.

Code

plot(spearman2(HDL_RISK ~ AGE + WAIST + SEX + RACE_ETH + SROH, 
               data = nh_demo_i))

Our Spearman \(\rho^2\) plot first suggests the use of a non-linear term in WAIST, so we’ll add a restricted cubic spline in WAIST using 4 knots, which should add 2 degrees of freedom to our initial model.

Next, the SROH variable seems to be a good choice, so we’ll add an interaction between SROH and the main effect of WAIST, which will add four more degrees of freedom to our model Y.

9.4 Model Z

As mentioned, our model Z will add 6 degrees of freedom through two non-linear terms, to model Y.

9.4.1 Fitting Model Z

We’ll fit Model Z with both glm() and lrm().

Code

mod_Z <- glm(HDL_RISK ~ AGE + rcs(WAIST, 4) + RACE_ETH + 
               SROH + WAIST %ia% SROH,
            data = nh_demo_i, family = binomial())

ddd <- datadist(nh_demo_i)
options(datadist = "ddd")

mod_Z_lrm <- lrm(HDL_RISK ~ AGE + rcs(WAIST, 4) + RACE_ETH + 
                   SROH + WAIST %ia% SROH,
                 data = nh_demo_i, x = TRUE, y = TRUE)

9.4.2 Coefficient Estimates

Note

I’ll show two different approaches (model_parameters() from the easystats ecosystem, and tidy() from the broom package) to summarize the coefficient estimates from mod_Y. Either is sufficient for Project A, and there’s no need to show both approaches. The tidy() approach handles the non-linear terms a bit better in my view, so I’d probably go with that.

Tidied Odds Ratio Estimates (Model Z)
Model Z Parameters

Code

tidy(mod_Z, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.90) |>
  select(term, estimate, se = std.error, 
         low90 = conf.low, high90 = conf.high, p = p.value) |>
  kable(digits = 3)

term	estimate	se	low90	high90	p
(Intercept)	0.000	2.905	0.000	0.018	0.003
AGE	0.985	0.007	0.973	0.996	0.032
rcs(WAIST, 4)WAIST	1.093	0.032	1.039	1.154	0.005
rcs(WAIST, 4)WAIST’	0.893	0.074	0.787	1.006	0.128
rcs(WAIST, 4)WAIST’’	1.295	0.231	0.890	1.909	0.263
RACE_ETHHispanic	0.971	0.190	0.710	1.326	0.875
RACE_ETHNH_Black	0.525	0.203	0.375	0.732	0.002
RACE_ETHNH_Asian	1.739	0.248	1.157	2.615	0.025
RACE_ETHOther	0.541	0.419	0.264	1.056	0.142
SROHVery_Good	0.630	2.352	0.013	31.942	0.845
SROHGood	1.151	2.158	0.032	43.328	0.948
SROHFair	0.437	2.246	0.010	18.809	0.713
SROHPoor	1.828	2.837	0.015	187.479	0.832
WAIST %ia% SROHWAIST * SROH=Very_Good	1.006	0.024	0.968	1.047	0.791
WAIST %ia% SROHWAIST * SROH=Good	1.004	0.022	0.968	1.041	0.860
WAIST %ia% SROHWAIST * SROH=Fair	1.017	0.022	0.979	1.056	0.462
WAIST %ia% SROHWAIST * SROH=Poor	1.002	0.027	0.959	1.049	0.936

Code

model_parameters(mod_Z, ci = 0.90, exponentiate = TRUE) |>
  print_md(digits = 3)

Parameter	Odds Ratio	SE	90% CI	z	p
(Intercept)	1.765e-04	5.127e-04	(1.213e-06, 0.018)	-2.975	0.003
AGE	0.985	0.007	(0.973, 0.996)	-2.150	0.032
rcs(WAIST ( degree)	1.093	0.035	(1.039, 1.154)	2.791	0.005
rcs(WAIST ( degree)	0.893	0.066	(0.787, 1.006)	-1.523	0.128
rcs(WAIST ( degree)	1.295	0.300	(0.890, 1.909)	1.118	0.263
RACE ETH (Hispanic)	0.971	0.184	(0.710, 1.326)	-0.157	0.875
RACE ETH (NH_Black)	0.525	0.107	(0.375, 0.732)	-3.164	0.002
RACE ETH (NH_Asian)	1.739	0.431	(1.157, 2.615)	2.236	0.025
RACE ETH (Other)	0.541	0.227	(0.264, 1.056)	-1.467	0.142
SROH (Very_Good)	0.630	1.483	(0.013, 31.942)	-0.196	0.845
SROH (Good)	1.151	2.485	(0.032, 43.328)	0.065	0.948
SROH (Fair)	0.437	0.982	(0.010, 18.809)	-0.368	0.713
SROH (Poor)	1.828	5.185	(0.015, 187.479)	0.213	0.832
WAIST %ia% SROHWAIST * SROH=Very Good	1.006	0.024	(0.968, 1.047)	0.265	0.791
WAIST %ia% SROHWAIST * SROH=Good	1.004	0.022	(0.968, 1.041)	0.176	0.860
WAIST %ia% SROHWAIST * SROH=Fair	1.017	0.023	(0.979, 1.056)	0.735	0.462
WAIST %ia% SROHWAIST * SROH=Poor	1.002	0.027	(0.959, 1.049)	0.081	0.936

9.4.3 Model Z Effects

Code

plot(summary(mod_Z_lrm, conf.int = 0.90))

Code

summary(mod_Z_lrm, conf.int = 0.90) |> kable(digits = 3)

	Low	High	Diff.	Effect	S.E.	Lower 0.9	Upper 0.9	Type
AGE	50	66.0	16.0	-0.243	0.113	-0.429	-0.057	1
Odds Ratio	50	66.0	16.0	0.784	NA	0.651	0.944	2
WAIST	92	112.6	20.6	0.769	0.228	0.395	1.144	1
Odds Ratio	92	112.6	20.6	2.158	NA	1.484	3.140	2
RACE_ETH - Hispanic:NH_White	1	2.0	NA	-0.030	0.190	-0.342	0.283	1
Odds Ratio	1	2.0	NA	0.971	NA	0.710	1.327	2
RACE_ETH - NH_Black:NH_White	1	3.0	NA	-0.644	0.203	-0.979	-0.309	1
Odds Ratio	1	3.0	NA	0.525	NA	0.376	0.734	2
RACE_ETH - NH_Asian:NH_White	1	4.0	NA	0.553	0.248	0.146	0.961	1
Odds Ratio	1	4.0	NA	1.739	NA	1.157	2.613	2
RACE_ETH - Other:NH_White	1	5.0	NA	-0.615	0.419	-1.304	0.074	1
Odds Ratio	1	5.0	NA	0.541	NA	0.271	1.077	2
SROH - Excellent:Good	3	1.0	NA	-0.529	0.311	-1.041	-0.017	1
Odds Ratio	3	1.0	NA	0.589	NA	0.353	0.983	2
SROH - Very_Good:Good	3	2.0	NA	-0.355	0.197	-0.680	-0.031	1
Odds Ratio	3	2.0	NA	0.701	NA	0.507	0.969	2
SROH - Fair:Good	3	4.0	NA	0.306	0.198	-0.019	0.632	1
Odds Ratio	3	4.0	NA	1.359	NA	0.981	1.882	2
SROH - Poor:Good	3	5.0	NA	0.295	0.402	-0.366	0.956	1
Odds Ratio	3	5.0	NA	1.344	NA	0.694	2.602	2

9.4.4 Quality of Fit Summaries

Note

Again, we have three available approaches to summarize the fit of Model Z, thanks to our glm() and lrm() fits of the same model. I’ll show all three, but in practice (and in your Project A) I wouldn’t include the model_performance() results.

Our Nagelkerke \(R^2\) estimate for Model Z is 0.162, and our C statistic is estimated to be 0.702.

Code

mod_Z_lrm

Logistic Regression Model

lrm(formula = HDL_RISK ~ AGE + rcs(WAIST, 4) + RACE_ETH + SROH + 
    WAIST %ia% SROH, data = nh_demo_i, x = TRUE, y = TRUE)

                       Model Likelihood     Discrimination    Rank Discrim.    
                             Ratio Test            Indexes          Indexes    
Obs           999    LR chi2     123.92     R2       0.162    C       0.702    
 0            666    d.f.            16    R2(16,999)0.102    Dxy     0.404    
 1            333    Pr(> chi2) <0.0001    R2(16,666)0.150    gamma   0.404    
max |deriv| 3e-08                           Brier    0.197    tau-a   0.180    

                       Coef    S.E.   Wald Z Pr(>|Z|)
Intercept              -8.6421 2.9045 -2.98  0.0029  
AGE                    -0.0152 0.0071 -2.15  0.0315  
WAIST                   0.0887 0.0318  2.79  0.0053  
WAIST'                 -0.1134 0.0745 -1.52  0.1277  
WAIST''                 0.2588 0.2314  1.12  0.2634  
RACE_ETH=Hispanic      -0.0299 0.1900 -0.16  0.8751  
RACE_ETH=NH_Black      -0.6438 0.2035 -3.16  0.0016  
RACE_ETH=NH_Asian       0.5535 0.2476  2.24  0.0254  
RACE_ETH=Other         -0.6148 0.4189 -1.47  0.1422  
SROH=Very_Good         -0.4613 2.3523 -0.20  0.8445  
SROH=Good               0.1409 2.1582  0.07  0.9480  
SROH=Fair              -0.8270 2.2460 -0.37  0.7127  
SROH=Poor               0.6031 2.8370  0.21  0.8317  
WAIST * SROH=Very_Good  0.0063 0.0236  0.27  0.7908  
WAIST * SROH=Good       0.0038 0.0217  0.18  0.8602  
WAIST * SROH=Fair       0.0164 0.0223  0.73  0.4625  
WAIST * SROH=Poor       0.0022 0.0271  0.08  0.9356

The glance() results below give us our AIC and BIC values, as well as the degrees of freedom used by Model Z.

Code

glance(mod_Z) |>
  mutate(df = nobs - df.residual - 1) |>
  select(AIC, BIC, df, df.residual, nobs) |>
  kable(digits = 1)

AIC	BIC	df	df.residual	nobs
1181.8	1265.3	16	982	999

Note

As mentioned previously, these summaries aren’t as appealing to me as those in the other two tabs here.

Code

model_performance(mod_Z) |> print_md(digits = 2)

Indices of model performance
AIC	AICc	BIC	Tjur’s R2	RMSE	Sigma	Log_loss	Score_log	Score_spherical	PCP
1181.84	1182.46	1265.25	0.12	0.44	1.00	0.57	-144.81	1.60e-03	0.61

9.4.5 Confusion Matrix (Model Z)

As in Model Y, my prediction rule for my Model Z confusion matrix is that the fitted value of Pr(HDL_RISK = 1) needs to be greater than or equal to 0.5 for me to predict HDL_RISK is 1, and otherwise I predict 0.

Again, we augment our nh_demo_i data to include predicted probabilities of (HDL_RISK = 1) from Model Z.

Code

resZ_aug <- augment(mod_Z, type.predict = "response")

Applying my prediction rule, we obtain:

Code

cm_Z <- caret::confusionMatrix(
  data = factor(resZ_aug$.fitted >= 0.5),
  reference = factor(resZ_aug$HDL_RISK == 1),
  positive = "TRUE")

cm_Z

Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE   608  241
     TRUE     58   92
                                         
               Accuracy : 0.7007         
                 95% CI : (0.6712, 0.729)
    No Information Rate : 0.6667         
    P-Value [Acc > NIR] : 0.01177        
                                         
                  Kappa : 0.2193         
                                         
 Mcnemar's Test P-Value : < 2e-16        
                                         
            Sensitivity : 0.27628        
            Specificity : 0.91291        
         Pos Pred Value : 0.61333        
         Neg Pred Value : 0.71614        
             Prevalence : 0.33333        
         Detection Rate : 0.09209        
   Detection Prevalence : 0.15015        
      Balanced Accuracy : 0.59459        
                                         
       'Positive' Class : TRUE

Here are our results comparing classification performance by models Y and Z.

Model	Classification Rule	Sensitivity	Specificity	Pos. Pred. Value
Y	Predicted Pr(HDL_RISK = 1) >= 0.5	0.276	0.905	0.594
Z	Predicted Pr(HDL_RISK = 1) >= 0.5	0.276	0.913	0.613

9.5 Validating Models Y and Z

We will use the validate() function from the rms package to validate our lrm fits.

Code

set.seed(4323); (valY <- validate(mod_Y_lrm))

          index.orig training    test optimism index.corrected  n
Dxy           0.4017   0.4207  0.3843   0.0364          0.3652 40
R2            0.1516   0.1668  0.1406   0.0262          0.1254 40
Intercept     0.0000   0.0000 -0.0602   0.0602         -0.0602 40
Slope         1.0000   1.0000  0.9034   0.0966          0.9034 40
Emax          0.0000   0.0000  0.0329   0.0329          0.0329 40
D             0.1146   0.1272  0.1057   0.0215          0.0932 40
U            -0.0020  -0.0020  0.0010  -0.0030          0.0010 40
Q             0.1166   0.1292  0.1047   0.0245          0.0921 40
B             0.1979   0.1956  0.2003  -0.0047          0.2026 40
g             0.8757   0.9354  0.8396   0.0959          0.7798 40
gp            0.1767   0.1855  0.1701   0.0154          0.1612 40

Code

set.seed(4324); (valZ <- validate(mod_Z_lrm))

          index.orig training    test optimism index.corrected  n
Dxy           0.4036   0.4267  0.3867   0.0400          0.3636 40
R2            0.1620   0.1806  0.1461   0.0345          0.1275 40
Intercept     0.0000   0.0000 -0.0896   0.0896         -0.0896 40
Slope         1.0000   1.0000  0.8674   0.1326          0.8674 40
Emax          0.0000   0.0000  0.0474   0.0474          0.0474 40
D             0.1230   0.1386  0.1102   0.0284          0.0946 40
U            -0.0020  -0.0020  0.0021  -0.0041          0.0021 40
Q             0.1250   0.1406  0.1081   0.0325          0.0925 40
B             0.1968   0.1938  0.2000  -0.0062          0.2030 40
g             0.9605   1.0350  0.8957   0.1393          0.8212 40
gp            0.1825   0.1924  0.1724   0.0200          0.1626 40

9.5.1 Validated Nagelkerke \(R^2\), and C, as well as IC statistics

Model	Validated \(R^2\)	Validated C	AIC	BIC	df
Y	0.1254	0.6826	1178.3	1232.2	10
Z	0.1275	0.6818	1181.8	1265.3	16

9.6 Final Logistic Regression Model

I prefer Model Y, because of its slightly better AIC, BIC and validated C statistic, despite the fact that Model Z has a slightly higher validated \(R^2\) and that Model Z also has a slightly higher positive predictive value. It’s pretty close, though.

9.6.1 Winning Model’s Parameter Estimates

Code

mod_Y_lrm

Logistic Regression Model

lrm(formula = HDL_RISK ~ AGE + WAIST + RACE_ETH + SROH, data = nh_demo_i, 
    x = TRUE, y = TRUE)

                       Model Likelihood     Discrimination    Rank Discrim.    
                             Ratio Test            Indexes          Indexes    
Obs           999    LR chi2     115.50     R2       0.152    C       0.701    
 0            666    d.f.            10    R2(10,999)0.100    Dxy     0.402    
 1            333    Pr(> chi2) <0.0001    R2(10,666)0.146    gamma   0.402    
max |deriv| 1e-06                           Brier    0.198    tau-a   0.179    

                  Coef    S.E.   Wald Z Pr(>|Z|)
Intercept         -4.5080 0.7221 -6.24  <0.0001 
AGE               -0.0138 0.0070 -1.98  0.0478  
WAIST              0.0403 0.0052  7.78  <0.0001 
RACE_ETH=Hispanic  0.0041 0.1886  0.02  0.9826  
RACE_ETH=NH_Black -0.6327 0.2032 -3.11  0.0018  
RACE_ETH=NH_Asian  0.4743 0.2390  1.98  0.0472  
RACE_ETH=Other    -0.6235 0.4205 -1.48  0.1381  
SROH=Very_Good     0.1786 0.3175  0.56  0.5739  
SROH=Good          0.5168 0.3001  1.72  0.0850  
SROH=Fair          0.8776 0.3154  2.78  0.0054  
SROH=Poor          0.6895 0.4503  1.53  0.1258

9.6.2 Winning Model Effect Sizes

Code

plot(summary(mod_Y_lrm, conf.int = 0.90))

9.6.3 Numerical Description of Effect Sizes

Code

summary(mod_Y_lrm, conf.int = 0.90) |> kable(digits = 3)

	Low	High	Diff.	Effect	S.E.	Lower 0.9	Upper 0.9	Type
AGE	50	66.0	16.0	-0.221	0.112	-0.404	-0.037	1
Odds Ratio	50	66.0	16.0	0.802	NA	0.667	0.963	2
WAIST	92	112.6	20.6	0.831	0.107	0.655	1.006	1
Odds Ratio	92	112.6	20.6	2.295	NA	1.925	2.735	2
RACE_ETH - Hispanic:NH_White	1	2.0	NA	0.004	0.189	-0.306	0.314	1
Odds Ratio	1	2.0	NA	1.004	NA	0.736	1.369	2
RACE_ETH - NH_Black:NH_White	1	3.0	NA	-0.633	0.203	-0.967	-0.299	1
Odds Ratio	1	3.0	NA	0.531	NA	0.380	0.742	2
RACE_ETH - NH_Asian:NH_White	1	4.0	NA	0.474	0.239	0.081	0.867	1
Odds Ratio	1	4.0	NA	1.607	NA	1.085	2.381	2
RACE_ETH - Other:NH_White	1	5.0	NA	-0.624	0.420	-1.315	0.068	1
Odds Ratio	1	5.0	NA	0.536	NA	0.268	1.070	2
SROH - Excellent:Good	3	1.0	NA	-0.517	0.300	-1.010	-0.023	1
Odds Ratio	3	1.0	NA	0.596	NA	0.364	0.977	2
SROH - Very_Good:Good	3	2.0	NA	-0.338	0.193	-0.656	-0.020	1
Odds Ratio	3	2.0	NA	0.713	NA	0.519	0.980	2
SROH - Fair:Good	3	4.0	NA	0.361	0.180	0.065	0.656	1
Odds Ratio	3	4.0	NA	1.434	NA	1.067	1.928	2
SROH - Poor:Good	3	5.0	NA	0.173	0.368	-0.433	0.778	1
Odds Ratio	3	5.0	NA	1.188	NA	0.649	2.177	2

9.6.4 Effect Size Description(s)

In your work, we would only need to see one of the following effect size descriptions.

WAIST description: If we have two subjects of the same age, race/ethnicity and self-reported overall health, then if subject 1 has a waist circumference of 92 cm and subject 2 has a waist circumference of 112.6 cm, then our model estimates that subject 2 will have 2.295 times the odds (90% CI: 1.93, 2.74) that subject 1 has of having a HDL that is at risk.

AGE description: If we have two subjects of the same waist circumference, race/ethnicity and self-reported overall health, then if subject 1 is age 50 and subject 2 is age 66, our model predicts that subject 2 has 80.2% of the odds of an at-risk HDL that subject 1 has. The 90% CI for the odds ratio comparing subject 1 to subject 2 is (0.667, 0.963).

SROH description: If we have two subjects of the same age, waist circumference and race/ethnicity. then if subject 1 reports Excellent overall health while subject 2 reports only Good overall health, our model predicts that the odds ratio comparing subject 1’s odds of an at-risk HDL to those of subject 2 will be 0.596, with a 90% CI of (0.364, 0.977).

9.6.5 Validated \(R^2\) and \(C\) statistic for Winning Model

As we saw in Section 9.5.1,

the validated \(R^2\) statistic for Model Y is 0.1254, and
the validated \(C\) statistic for Model Y is 0.6826.

9.6.6 Nomogram of Winning Model

Code

plot(nomogram(mod_Y_lrm, fun = plogis, 
              funlabel = "Pr(HDL_RISK = 1)"))

9.6.7 Predictions for Two New Subjects

I will create a predicted Pr(HDL_RISK = 1) for two new Non-Hispanic White subjects who are 60 years old, and who rate their overall health as Fair. The first will have a waist circumference of 80 cm, and the second will have a waist circumference of 100 cm.

Note

I can do this using either the glm() or the lrm() fit to our model. Either is sufficient for Project A, and there’s no need to show both approaches.

Using the glm() fit
Using the lrm() fit

Code

new_subj <- 
  data.frame(AGE = c(60,60), WAIST = c(80, 100), 
             RACE_ETH = c("NH_White", "NH_White"), SROH = c("Fair", "Fair"))

preds3 <- predict(mod_Y, newdata = new_subj, type = "response")

preds3

        1         2 
0.2256724 0.3949595

Our prediction for subject 1’s Pr(HDL_RISK = 1) is 0.226.
Our prediction for subject 2’s Pr(HDL_RISK = 1) is 0.395.

Alternatively, I could have used our lrm() fit.

Code

new_subj <- 
  data.frame(AGE = c(60,60), WAIST = c(80, 100), 
             RACE_ETH = c("NH_White", "NH_White"), SROH = c("Fair", "Fair"))

preds4 <- predict(mod_Y_lrm, newdata = new_subj, type = "fitted")

preds4

        1         2 
0.2256724 0.3949595

As before, the two fitting approaches (glm() and lrm()) produce the same model, so they produce the same predictions.

10 Discussion

Note

I am leaving the discussion up to you, but I have provided an outline of what you need to address below.

10.1 Answering My Research Questions

10.1.1 Question 1 (with Answer)

I’ll leave the job of restating and drawing conclusions about the research question under study in our linear regression analyses for you to answer.

10.1.2 Question 2 (with Answer)

I’ll leave the job of restating and drawing conclusions about the research question under study in our logistic regression analyses for you to answer.

10.2 Thoughts on Project A

Note that we expect you to answer exactly two of the following four questions in your discussion.

10.2.1 Question 1

What was substantially harder or easier than you expected, and why?

If you chose to answer Question 1, your answer would go here.

10.2.2 Question 2

What do you wish you’d known at the start of this process that you know now, and why?

If you chose to answer Question 2, your answer would go here.

10.2.3 Question 3

What was the most confusing part of doing the project, and how did you get past it?

If you chose to answer Question 3, your answer would go here.

10.2.4 Question 4

What was the most useful thing you learned while doing the project, and why?

If you chose to answer Question 4, your answer would go here.

11 Affirmation

I am certain that it is completely appropriate for these NHANES data to be shared with anyone, without any conditions. There are no concerns about privacy or security.

12 References

NHANES 2015-16 data are found at https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2015.
NHANES 2017-18 data are found at https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2017.
Descriptions/definitions of “at risk” levels of HDL cholesterol are provided by the Mayo Clinic at https://www.mayoclinic.org/diseases-conditions/high-blood-cholesterol/in-depth/hdl-cholesterol/art-20046388.

13 Session Information

Code

xfun::session_info()

R version 4.4.3 (2025-02-28 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  abind_1.4-8          askpass_1.2.1        backports_1.5.0     
  base64enc_0.1-3      bayestestR_0.15.2    bigD_0.3.1          
  bit_4.6.0            bit64_4.6.0.1        bitops_1.0.9        
  blob_1.2.4           boot_1.3-31          broom_1.0.8         
  bslib_0.9.0          cachem_1.1.0         callr_3.7.6         
  car_3.1-3            carData_3.0-5        cards_0.6.0         
  caret_7.0-1          caTools_1.18.3       cellranger_1.1.0    
  checkmate_2.3.2      class_7.3-23         cli_3.6.4           
  clipr_0.8.0          clock_0.7.3          cluster_2.1.8       
  coda_0.19-4.1        codetools_0.2-20     colorspace_2.1-1    
  commonmark_1.9.5     compiler_4.4.3       conflicted_1.2.0    
  correlation_0.8.7    cowplot_1.1.3        cpp11_0.5.2         
  crayon_1.5.3         curl_6.2.2           data.table_1.17.0   
  datasets_4.4.3       datawizard_1.0.2     DBI_1.2.3           
  dbplyr_2.5.0         Deriv_4.1.6          diagram_1.6.5       
  digest_0.6.37        doBy_4.6.26          dplyr_1.1.4         
  dtplyr_1.3.1         e1071_1.7-16         easystats_0.7.4     
  effectsize_1.0.0     emmeans_1.11.0       estimability_1.5.1  
  evaluate_1.0.3       fansi_1.0.6          farver_2.1.2        
  fastmap_1.2.0        fontawesome_0.5.3    forcats_1.0.0       
  foreach_1.5.2        foreign_0.8-90       Formula_1.2-5       
  fs_1.6.6             furrr_0.3.1          future_1.40.0       
  future.apply_1.11.3  gargle_1.5.2         generics_0.1.3      
  GGally_2.2.1         ggformula_0.12.0     ggplot2_3.5.2       
  ggrepel_0.9.6        ggridges_0.5.6       ggstats_0.9.0       
  glmnet_4.1-8         globals_0.17.0       glue_1.8.0          
  googledrive_2.1.1    googlesheets4_1.1.1  gower_1.0.2         
  gplots_3.2.0         graphics_4.4.3       grDevices_4.4.3     
  grid_4.4.3           gridExtra_2.3        gt_1.0.0            
  gtable_0.3.6         gtools_3.9.5         gtsummary_2.2.0     
  hardhat_1.4.1        haven_2.5.4          highr_0.11          
  Hmisc_5.2-3          hms_1.1.3            htmlTable_2.4.3     
  htmltools_0.5.8.1    htmlwidgets_1.6.4    httr_1.4.7          
  ids_1.0.1            insight_1.1.0        ipred_0.9-15        
  isoband_0.2.7        iterators_1.0.14     janitor_2.2.1       
  jomo_2.7-6           jquerylib_0.1.4      jsonlite_2.0.0      
  juicyjuice_0.1.0     KernSmooth_2.23.26   knitr_1.50          
  labeling_0.4.3       labelled_2.14.0      lattice_0.22-7      
  lava_1.8.1           lifecycle_1.0.4      listenv_0.9.1       
  litedown_0.7         lme4_1.1-37          lubridate_1.9.4     
  magrittr_2.0.3       markdown_2.0         MASS_7.3-65         
  Matrix_1.7-2         MatrixModels_0.5-4   memoise_2.0.1       
  methods_4.4.3        mgcv_1.9-3           mice_3.17.0         
  microbenchmark_1.5.0 mime_0.13            minqa_1.2.8         
  mitml_0.4-5          modelbased_0.10.0    ModelMetrics_1.2.2.2
  modelr_0.1.11        mosaic_1.9.1         mosaicCore_0.9.4.0  
  mosaicData_0.20.4    multcomp_1.4-28      munsell_0.5.1       
  mvtnorm_1.3-3        naniar_1.1.0         nlme_3.1-167        
  nloptr_2.2.1         nnet_7.3-20          norm_1.0.11.1       
  numDeriv_2016.8.1.1  openssl_2.3.2        ordinal_2023.12.4.1 
  pan_1.9              parallel_4.4.3       parallelly_1.43.0   
  parameters_0.24.2    patchwork_1.3.0      pbkrtest_0.5.3      
  performance_0.13.0   pillar_1.10.2        pkgconfig_2.0.3     
  plyr_1.8.9           polspline_1.1.25     prettyunits_1.2.0   
  pROC_1.18.5          processx_3.8.6       prodlim_2024.06.25  
  progress_1.2.3       progressr_0.15.1     proxy_0.4-27        
  ps_1.9.1             purrr_1.0.4          quantreg_6.1        
  R6_2.6.1             ragg_1.4.0           rappdirs_0.3.3      
  rbibutils_2.3        RColorBrewer_1.1-3   Rcpp_1.0.14         
  RcppEigen_0.3.4.0.2  Rdpack_2.6.4         reactable_0.4.4     
  reactR_0.6.1         readr_2.1.5          readxl_1.4.5        
  recipes_1.2.1        reformulas_0.4.0     rematch_2.0.0       
  rematch2_2.1.2       report_0.6.1         reprex_2.1.1        
  reshape2_1.4.4       rlang_1.1.6          rmarkdown_2.29      
  rms_8.0-0            ROCR_1.0-11          rpart_4.1.24        
  rsample_1.3.0        rstudioapi_0.17.1    rvest_1.0.4         
  sandwich_3.1-1       sass_0.4.10          scales_1.3.0        
  see_0.11.0           selectr_0.4.2        shape_1.4.6.1       
  slider_0.3.2         snakecase_0.11.1     SparseM_1.84-2      
  sparsevctrs_0.3.3    splines_4.4.3        SQUAREM_2021.1      
  stats_4.4.3          stats4_4.4.3         stringi_1.8.7       
  stringr_1.5.1        survival_3.8-3       sys_3.4.3           
  systemfonts_1.2.2    textshaping_1.0.0    TH.data_1.1-3       
  tibble_3.2.1         tidyr_1.3.1          tidyselect_1.2.1    
  tidyverse_2.0.0      timechange_0.3.0     timeDate_4041.110   
  tinytex_0.57         tools_4.4.3          tzdb_0.5.0          
  ucminf_1.2.2         UpSetR_1.4.0         utf8_1.2.4          
  utils_4.4.3          uuid_1.2.1           V8_6.0.3            
  vctrs_0.6.5          viridis_0.6.5        viridisLite_0.4.2   
  visdat_0.6.0         vroom_1.6.5          warp_0.2.1          
  withr_3.0.2          xfun_0.52            xml2_1.3.8          
  xtable_1.8-4         yaml_2.3.10          zoo_1.8-14

What is This?

R Packages and Setup

1 Data Source

2 The Subjects

3 Loading and Tidying the Data

3.1 How the nh_demo file was built

3.2 Ingesting and Cleaning the NHANES data

3.3 Our outcomes

3.4 Arranging the Tibble and Sampling the Data

4 The Tidy Tibble

4.1 Listing the Tibble

4.2 Size and Identifiers

4.3 Save The Tibble

5 The Code Book

5.1 Five Key Statements

5.2 Describing the Variables (5 Tabs)

6 Linear Regression Plans

6.1 My First Research Question

6.2 My Quantitative Outcome

6.2.1 Numerical Summary of my linear outcome

6.2.2 Summary Statements about my outcome, HDL.

6.3 My Planned Predictors (Linear Model)

6.3.1 Anticipated Direction of Effects

6.3.2 Missingness Summary

7 Logistic Regression Plans

7.1 My Second Research Question

7.2 My Binary Outcome

7.3 My Planned Predictors (Logistic Model)

7.3.1 Anticipated Direction of Effects

7.3.2 Missingness Summary

8 Linear Regression Analyses

8.1 Missingness

8.1.1 Single Imputation Approach

8.2 Outcome Transformation

8.3 Scatterplot Matrix and Collinearity

8.4 Model A

8.4.1 Fitting Model A

8.4.2 Coefficient Estimates

8.4.3 Model A Effects

8.4.4 Quality of Fit Summaries

8.4.5 Regression Diagnostics for Model A

8.5 Non-Linearity and Spearman \(\rho^2\)

8.6 Model B

8.6.1 Fitting Model B

8.6.2 Coefficient Estimates

8.6.3 Model B Effects

8.6.4 Quality of Fit Summaries

8.6.5 Regression Diagnostics for Model B

8.7 Validating Models A and B

8.7.1 Validated \(R^2\), and MSE as well as IC statistics

8.8 Final Linear Regression Model

8.8.1 Winning Model’s OLS summary

8.8.2 Effects Plot for Winning Model

8.8.3 Numerical Description of Effect Sizes

8.8.4 Effect Size Description(s)

8.8.5 Prediction Plot for Winning Model

8.8.6 Nomogram of Winning Model

8.8.7 Prediction for a New Subject

9 Logistic Regression Analyses

9.1 Missingness

9.2 Model Y

9.2.1 Fitting Model Y

9.2.2 Coefficient Estimates

9.2.3 Model Y Effects

9.2.4 Quality of Fit Summaries

9.2.5 Confusion Matrix (Model Y)

9.3 Non-Linearity and Spearman \(\rho^2\) plot

9.4 Model Z

9.4.1 Fitting Model Z

9.4.2 Coefficient Estimates

9.4.3 Model Z Effects

9.4.4 Quality of Fit Summaries

9.4.5 Confusion Matrix (Model Z)

9.5 Validating Models Y and Z

9.5.1 Validated Nagelkerke \(R^2\), and C, as well as IC statistics

9.6 Final Logistic Regression Model

9.6.1 Winning Model’s Parameter Estimates

9.6.2 Winning Model Effect Sizes

9.6.3 Numerical Description of Effect Sizes

9.6.4 Effect Size Description(s)

3.1 How the `nh_demo` file was built