Predicting High-Density Lipoprotein Cholesterol Levels

432 Project A: A Partial Demonstration

Author

Thomas E. Love, Ph.D.

Published

2025-01-11

What is This?

This is a demonstration Project A for 432. It includes everything you must include in sections 1, 2, 4-9, and 11-13, although there are numerous places where a better project would say more about the results shown here than I have. I also have only performed single imputation, rather than multiple imputation, in this demonstration project, and I’ve only given some guidance regarding the discussion, leaving it up to you.

I decided to share this with students in the class in an effort to ensure that students could meet the minimum standards for the project (to obtain a solid B) more easily.

R Packages and Setup

Code
knitr::opts_chunk$set(comment = NA) 

library(janitor) 
library(naniar)

library(broom)
library(car)
library(caret)
library(GGally)
library(gt)
library(gtsummary)
library(knitr)
library(mice)
library(patchwork)
library(ROCR)
library(rsample)
library(rms)

library(easystats)
library(tidyverse) 

theme_set(theme_bw()) 

1 Data Source

These data come from the 2015-16 and 2017-18 administrations of the National Health and Nutrition Examination Survey from the Centers for Disease Control and Prevention.

  • The 2015-16 data are described in detail and accessed here
  • The 2017-18 data are described in detail and accessed here

In each year, we gathered data from the Demographics (DEMO) files (DEMO_I and DEMO_J, respectively), the Body Measures (BMX) files, the Cholesterol - High Density Lipoprotein (HDL) files, and the Current Health Status (HSQ) files.

2 The Subjects

The NHANES files contain complete data on our linear regression outcome (HDL cholesterol level) and our logistic regression outcome (sex-dependent cutoff for “at risk” HDL cholesterol: < 40 mg/dl for men and < 50 mg/dl for women) for a total of 6203 subjects between the ages of 40 and 79. Our tidied tibble (for analyses) consists of a random sample of 999 of those subjects.

3 Loading and Tidying the Data

Important

Here I provide only a summary of the steps I took. In your project, you’ll actually have to show your data management work.

In building this document, I used the nhanesA package for R, which is not currently available on CRAN. In addition, the repository of NHANES materials at the CDC website has changed in some ways that don’t allow me to use my original code. So, I have created a file called nh_demo.Rds which contains the tidied version of my data.

3.1 How the nh_demo file was built

I included the following items, from both 2017-18 and 2015-16 NHANES.

From the DEMO files: DEMO_J for 2017-18 and DEMO_I for 2015-16

  • SEQN = Subject identifying code
  • RIDSTATR = Interview/Examination Status (categorical)
    • 1 = Interview only
    • 2 = Interview and MEC examination (MEC = mobile examination center)
  • RIDAGEYR = Age in years (quantitative, topcoded at 80)
    • All subjects ages 80 and over are coded as 80
  • RIAGENDR = Sex (categorical)
    • 1 = Male, 2 = Female
  • RIDRETH3 = Race/Ethnicity (categorical)
    • 1 = Mexican-American,
    • 2 = Other Hispanic (1 and 2 are often combined)
    • 3 = Non-Hispanic White
    • 4 = Non-Hispanic Black
    • 6 = Non-Hispanic Asian
    • 7 = Other Race including Multi-Racial
    • Note that categories 1 and 2 are often combined, and sometimes we leave out category 7, or combine it with 6.

From the Body Measures (BMX) files: BMX_J for 2017-18 and BMX_I for 2015-16 (BMX is part of the Examination data)

  • BMXWAIST = Waist Circumference in cm (quantitative)

From the Cholesterol - High-Density Lipoprotein (HDL) files: HDL_J for 2017-18 and HDL_I for 2015-16 (HDL is part of the Lab data)

  • LBDHDD = Direct HDL cholesterol in mg/dl (quantitative)

From the Current Health Status (HSQ) file HSQ_J for 2017-18 and HSQ_I for 2015-16 (HSQ is part of the Questionnaire data)

  • HSD010 = Self-reported general health (categorical)
    • 1 = Excellent
    • 2 = Very good
    • 3 = Good
    • 4 = Fair
    • 5 = Poor
    • 7 = Refused
    • 9 = Don’t Know

3.2 Ingesting and Cleaning the NHANES data

I used functions from the nhanesA package to pull in the data from each necessary database using nhanes() and translated = FALSE, then used zap_label() to delete the labels, and converted to tibbles. Next, I used inner_join() to merge data within each NHANES cycle.

Next, I selected the 8 variables we need from the merged cycle-specific files, and converted each of the categorical variables to factors except the SEQN (subject code.) Then I added an indicator of the CYCLE (2015-16 or 2017-18) in which the data were drawn, and combined it all into one big tibble with full_join().

Then I decided to restrict our work here to adult subjects between the ages of 40 and 79, in part to avoid the fact that NHANES classifies everyone over age 80 as RIDAGEYR = 80. So that involved filtering to those with RIDAGEYR > 39 and RIDAGEYR < 80.

We have three quantitative variables: age (RIDAGEYR), waist circumference (BMXWAIST) and HDL Cholesterol (LBDHDD), which I renamed to AGE, WAIST and HDL, respectively. The ranges (minimum and maximum) for Age, Waist Circumference and HDL Cholesterol all seem fairly plausible to me. Perhaps you know better, and please do use that knowledge. We do have several hundred missing values in the waist circumference and HDL cholesterol values that we will need to deal with after we sample the data.

In terms of categorical variables, I started by checking to see that everyone has RIDSTATR status 2, meaning they completed both the NHANES questionnaire and the NHANES examination. Since they do, I then made sure our CYCLE variable works to indicate the NHANES reporting CYCLE for each subject by running tabyl(CYCLE, RIDSTATR).

Next, I changed RIAGENDR to SEX with values “M” and “F” as a factor. Another option would have been to use a 1/0 numeric variable to represent, for example FEMALE = 1 if the subject was reported as female and 0 if the subject was reported as male.

Next, I created RACE_ETH to replace RIDRETH3.

  • I recoded the levels of the RIDRETH3 variable to use short, understandable group names, using mutate() and fct_recode(), and
  • Collapsed the first two categories (Mexican-American and Other Hispanic) into a single category, using fct_recode(), and
  • Changed the variable name to RACE_ETH using mutate() and
  • Sorted the resulting factor in order of their counts using fct_infreq().

The most common RACE_ETH turns out to be Non-Hispanic White, followed by Hispanic, then Non-Hispanic Black, Non-Hispanic Asian and finally Other Race (including Multi-Racial.) We won’t collapse any further for now.

Next, I changed HSD010 to SROH, as follows.

  • Again, I recoded the levels of this categorical variable with fct_recode()
  • I also renamed the variable SROH (which is an abbreviation for self-reported overall health) with mutate(). Always explain your abbreviations.
  • I also converted the values 7 (Refused) and 9 (Don’t Know) to missing values with na_if(), then I used droplevels() to drop all of the now-unused levels in the factors I’m using.

3.3 Our outcomes

Our outcomes are each based on HDL cholesterol level. So I filtered the data for complete cases on that variable.

Next, I created a binary outcome for logistic regression, which is an indicator (taking the values 0 and 1) for an HDL cholesterol value that is “at risk” according to these standards published on the Mayo Clinic site, specifically, an adult male is considered to be at risk if their HDL < 40 mg/dl and an adult female is considered to be at risk if their HDL < 50 mg/dl.

Since we have complete data in nh_demo_full on HDL and SEX, I used that data to create the new variable HDL_RISK as shown below, then checked to see that this had worked properly:

Code
## this code chunk is not evaluated here - just showing you what I would do

nh_demo_full <- nh_demo_full |>
  mutate(HDL_RISK = case_when( 
    SEX == "M" & HDL < 40 ~ 1,
    SEX == "F" & HDL < 50 ~ 1,
    .default = 0))

## check to see if this worked properly
nh_demo_full |> group_by(HDL_RISK, SEX) |> 
  summarise(n = n(), min(HDL), max(HDL))

3.4 Arranging the Tibble and Sampling the Data

  • We don’t need RIDSTATR any more because we’ve checked and see that all its values are 2, as needed.
  • We’ve renamed some variables and replaced others with better versions.

Finally, I selected 999 observations from our working file, called the nh_demo_full tibble for use in our subsequent analyses, and this sample is called nh_demo. I have provided you with the nh_demo.Rds file, which I’m loading below.

Code
nh_demo <- read_rds("data/nh_demo.Rds")

4 The Tidy Tibble

4.1 Listing the Tibble

Note

If I ask you to print/list a tibble, part of the reason is to ensure that it actually is a tibble. A tibble will, as below, only print the first 10 rows.

Code
nh_demo
# A tibble: 999 × 9
   SEQN    HDL HDL_RISK   AGE RACE_ETH WAIST SROH      SEX   CYCLE  
   <chr> <dbl>    <dbl> <dbl> <fct>    <dbl> <fct>     <chr> <chr>  
 1 84842    48        0    50 Hispanic 101.  Good      M     2015-16
 2 92202   111        0    65 NH_Black  77   Fair      F     2015-16
 3 93010    46        1    63 NH_White  NA   Poor      F     2015-16
 4 93974    39        1    59 NH_White 111.  Good      M     2017-18
 5 96205    56        0    52 NH_Black  98   Good      M     2017-18
 6 95571    42        1    54 NH_White 104.  Good      F     2017-18
 7 99969    54        0    69 NH_Black 115.  Fair      M     2017-18
 8 89494    36        1    42 NH_Asian  98.1 <NA>      F     2015-16
 9 83887   101        0    51 NH_Black  74   Excellent F     2015-16
10 96908    48        1    73 Other    102   Good      F     2017-18
# ℹ 989 more rows

4.2 Size and Identifiers

Our analytic tibble, called nh_demo, has 999 rows (observations) on 9 columns (variables.) Our indicator variable is the SEQN, and we have a unique SEQN for each row in our data set, as the code below demonstrates.

Code
identical(nrow(nh_demo), n_distinct(nh_demo$SEQN))
[1] TRUE

4.3 Save The Tibble

I’ve already done this so I won’t do it again, but this is the code I would use to save the file.

Note

Note that I have set this code chunk to eval: false so that it will not actually do anything. I’m just showing you what I would do to save the Rds file to the data sub-directory of our R project. You should not use eval: false anywhere in your Project A.

Code
write_rds(nh_demo, "data/nh_demo.Rds")

5 The Code Book

I’ve chosen here to build a rather detailed codebook containing useful information describing the analytic sample, nh_demo. I’ve used numerous tools to help give a variety of summaries which may be helpful.

5.1 Five Key Statements

I suggest you begin your Project A codebook with five statements (like the ones shown below) to fix ideas about the sample size, the missingness, the outcomes and a statement about the roles for the other variables.

Note

If you look at the Quarto code that generated this document, you will see that I have used in-line coding to fill in the counts of subjects in statements 1 and 2. You should, too.

  1. Sample Size The data in our complete nh_demo sample consist of 999 subjects from NHANES 2015-16 and NHANES 2017-18 between the ages of 40 and 79 in whom our outcome variables (HDL and HDL_RISK) were measured.
  2. Missingness Of the 999 subjects, 912 have complete data on all variables listed below.
  3. Our outcome variables are HDL, which is the subject’s serum HDL cholesterol measured in mg/dl for the linear regression, and HDL_RISK, which indicates whether or not the subject’s HDL cholesterol is at risk (too low) given their sex, for the logistic regression. There are no missing data in either of these outcomes.
  4. Candidate predictors for my models include AGE, WAIST, RACE_ETH and SROH, (for both models) as well as SEX for the linear model.
  5. The other variables contained in my tidy tibble are SEQN which is the subject identifying code, and CYCLE which identifies whether this subject was part of the 2015-16 or 2017-18 NHANES reporting cycle.

5.2 Describing the Variables (5 Tabs)

Variables included in the nh_demo data are summarized and described in five different ways in the panel below. Click on the tab that interests you to see results.

In addition to the brief descriptions (including units of measurement, where appropriate) in the table below, descriptions of the original NHANES variables which led to these are also discussed in Section 3.1.

Variable Description
SEQN NHANES subject identifying code
HDL HDL cholesterol in mg/dl (outcome for our linear model)
HDL_RISK Does HDL put subject at risk? 1 (yes) if HDL < 40 for males or if HDL < 50 for females, otherwise 0 (outcome for logistic model)
AGE Age in years, restricted to 40-79 here
RACE_ETH Race-Ethnicity: five categories (NH_white, NH_Black, NH_Asian, Hispanic, Other)
WAIST Waist circumference, in cm
SROH Self-Reported Overall Health: five categories (Excellent, Very_Good, Good, Fair, Poor)
SEX Biological Sex: either F or M
CYCLE NHANES reporting cycle: either 2015-16 or 2017-18
Note

The data_codebook() function comes from the datawizard package, which is part of the easystats ecosystem.

Code
data_codebook(nh_demo)
nh_demo (999 rows and 9 variables, 9 shown)

ID | Name     | Type        |  Missings |        Values |           N
---+----------+-------------+-----------+---------------+------------
1  | SEQN     | character   |  0 (0.0%) |        100012 |   1 ( 0.1%)
   |          |             |           |        100067 |   1 ( 0.1%)
   |          |             |           |        100112 |   1 ( 0.1%)
   |          |             |           |        100142 |   1 ( 0.1%)
   |          |             |           |        100145 |   1 ( 0.1%)
   |          |             |           |        100150 |   1 ( 0.1%)
   |          |             |           |        100167 |   1 ( 0.1%)
   |          |             |           |        100178 |   1 ( 0.1%)
   |          |             |           |        100191 |   1 ( 0.1%)
   |          |             |           |        100210 |   1 ( 0.1%)
   |          |             |           |         (...) |            
---+----------+-------------+-----------+---------------+------------
2  | HDL      | numeric     |  0 (0.0%) |     [22, 149] |         999
---+----------+-------------+-----------+---------------+------------
3  | HDL_RISK | numeric     |  0 (0.0%) |             0 | 666 (66.7%)
   |          |             |           |             1 | 333 (33.3%)
---+----------+-------------+-----------+---------------+------------
4  | AGE      | numeric     |  0 (0.0%) |      [40, 79] |         999
---+----------+-------------+-----------+---------------+------------
5  | RACE_ETH | categorical |  0 (0.0%) |      NH_White | 333 (33.3%)
   |          |             |           |      Hispanic | 255 (25.5%)
   |          |             |           |      NH_Black | 229 (22.9%)
   |          |             |           |      NH_Asian | 145 (14.5%)
   |          |             |           |         Other |  37 ( 3.7%)
---+----------+-------------+-----------+---------------+------------
6  | WAIST    | numeric     | 44 (4.4%) | [66.5, 159.2] |         955
---+----------+-------------+-----------+---------------+------------
7  | SROH     | categorical | 56 (5.6%) |     Excellent |  84 ( 8.9%)
   |          |             |           |     Very_Good | 223 (23.6%)
   |          |             |           |          Good | 369 (39.1%)
   |          |             |           |          Fair | 231 (24.5%)
   |          |             |           |          Poor |  36 ( 3.8%)
---+----------+-------------+-----------+---------------+------------
8  | SEX      | character   |  0 (0.0%) |             F | 540 (54.1%)
   |          |             |           |             M | 459 (45.9%)
---+----------+-------------+-----------+---------------+------------
9  | CYCLE    | character   |  0 (0.0%) |       2015-16 | 475 (47.5%)
   |          |             |           |       2017-18 | 524 (52.5%)
---------------------------------------------------------------------
Note

The describe() function (and the html() function) used here come from the Hmisc package, developed by Frank Harrell and his colleagues.

Code
describe(nh_demo) |> html()
nh_demo Descriptives
nh_demo

9 Variables   999 Observations

SEQN
nmissingdistinct
9990999
lowest : 100012 100067 100112 100142 100145 , highest: 99937 99945 99966 99969 99997
HDL
image
nmissingdistinctInfoMeanpMedianGmd.05.10.25.50.75.90.95
9990860.99953.652.518.331354151637685
lowest : 22 23 24 25 26 , highest: 108 111 112 137 149
HDL_RISK
nmissingdistinctInfoSumMean
999020.6673330.3333

AGE
image
nmissingdistinctInfoMeanpMedianGmd.05.10.25.50.75.90.95
9990400.99958.015812.0342445058667375
lowest : 40 41 42 43 44 , highest: 75 76 77 78 79
RACE_ETH
image
nmissingdistinct
99905
 Value      NH_White Hispanic NH_Black NH_Asian    Other
 Frequency       333      255      229      145       37
 Proportion    0.333    0.255    0.229    0.145    0.037 

WAIST
image
        n  missing distinct     Info     Mean  pMedian      Gmd      .05      .10 
      955       44      482        1    102.7    102.1    18.05    77.27    82.40 
      .25      .50      .75      .90      .95 
    92.05   101.20   112.60   123.50   131.65  
lowest : 66.5 67.9 68.7 69 70.3 , highest: 150.5 151 151.3 153.5 159.2
SROH
image
nmissingdistinct
943565
 Value      Excellent Very_Good      Good      Fair      Poor
 Frequency         84       223       369       231        36
 Proportion     0.089     0.236     0.391     0.245     0.038 

SEX
nmissingdistinct
99902
 Value          F     M
 Frequency    540   459
 Proportion 0.541 0.459 

CYCLE
nmissingdistinct
99902
 Value      2015-16 2017-18
 Frequency      475     524
 Proportion   0.475   0.525 

Note

The tbl_summary() function used here comes from the gtsummary package. Note that missing values are listed as “Unknown” by default.

Code
tbl_summary(select(nh_demo, -SEQN),
        label = list(
            HDL = "HDL (HDL Cholesterol in mg/dl)",
            HDL_RISK = "HDL_RISK (HDL < 40 if Male, < 50 if Female)?",
            AGE = "AGE (in years)",
            WAIST = "WAIST (circumference in cm)",
            CYCLE = "CYCLE (NHANES reporting)",
            RACE_ETH = "RACE_ETH (Race/Ethnicity)",
            SROH = "SROH (Self-Reported Overall Health)"),
        stat = list( all_continuous() ~ 
                "{median} [{min} to {max}]" ))
Characteristic N = 9991
HDL (HDL Cholesterol in mg/dl) 51 [22 to 149]
HDL_RISK (HDL < 40 if Male, < 50 if Female)? 333 (33%)
AGE (in years) 58 [40 to 79]
RACE_ETH (Race/Ethnicity)
    NH_White 333 (33%)
    Hispanic 255 (26%)
    NH_Black 229 (23%)
    NH_Asian 145 (15%)
    Other 37 (3.7%)
WAIST (circumference in cm) 101 [67 to 159]
    Unknown 44
SROH (Self-Reported Overall Health)
    Excellent 84 (8.9%)
    Very_Good 223 (24%)
    Good 369 (39%)
    Fair 231 (24%)
    Poor 36 (3.8%)
    Unknown 56
SEX
    F 540 (54%)
    M 459 (46%)
CYCLE (NHANES reporting)
    2015-16 475 (48%)
    2017-18 524 (52%)
1 Median [Min to Max]; n (%)
Note

These summaries come from the naniar package.

Code
gg_miss_var(nh_demo)

Code
miss_var_summary(nh_demo)
# A tibble: 9 × 3
  variable n_miss pct_miss
  <chr>     <int>    <num>
1 SROH         56     5.61
2 WAIST        44     4.40
3 SEQN          0     0   
4 HDL           0     0   
5 HDL_RISK      0     0   
6 AGE           0     0   
7 RACE_ETH      0     0   
8 SEX           0     0   
9 CYCLE         0     0   
Code
miss_case_table(nh_demo)
# A tibble: 3 × 3
  n_miss_in_case n_cases pct_cases
           <int>   <int>     <dbl>
1              0     912     91.3 
2              1      74      7.41
3              2      13      1.30

6 Linear Regression Plans

6.1 My First Research Question

How effectively can we predict HDL cholesterol levels using age, sex, race/ethnicity, waist circumference and self-reported overall health, in a sample of 999 NHANES participants ages 40-79?

6.2 My Quantitative Outcome

  • My quantitative outcome is HDL, and I am interested in predicting this value using several more easily gathered characteristics of a subject.
  • I have complete HDL data for all 999 subjects in my nh_demo tibble.
  • My HDL data includes 86 different values, all measured in mg/dl.
  • The distribution of HDL across the 999 subjects in my nh_demo data (shown in the Figure below) is a bit right skewed, with a median of 51, and ranging from 22 mg/dl to 149 mg/dl.
Code
p1 <- ggplot(nh_demo, aes(sample = HDL)) +
  geom_qq(col = "navy") + geom_qq_line(col = "red") + 
  labs(title = "Normal Q-Q plot of HDL", x = "",
       y = "HDL Cholesterol Level (mg/dl)")

p2 <- ggplot(nh_demo, aes(x = HDL)) +
  geom_histogram(binwidth = 5, col = "white", fill = "navy") +
  labs(title = "Histogram of HDL", x = "HDL Cholesterol Level (mg/dl)")

p1 + p2

6.3 My Planned Predictors (Linear Model)

The predictors I intend to use in my linear model are AGE and WAIST, which are quantitative, SEX, which is a binary categorical variable, and RACE_ETH and SROH, each of which are 5-category variables treated as factors in my nh_demo tibble.

  • AGE has 40 distinct values, and is measured in years.
  • WAIST has 483 distinct values, and is measured in centimeters.
  • SEX has two levels, with nrow(nh_demo |> filter(SEX == "F")) female and nrow(nh_demo |> filter(SEX == "M")) male subjects.
  • RACE_ETH’s five categories each have at least 30 observations in each level (actually they each have 37 or more), and
  • SROH’s five categories also have at least 30 observations (actually 36 or more) in each level, as we can see in the table below. SROH also has 56 missing values.
Code
nh_demo |> tabyl(RACE_ETH, SROH) |> adorn_totals(where = c("row", "col"))
 RACE_ETH Excellent Very_Good Good Fair Poor NA_ Total
 NH_White        25       101  121   62   14  10   333
 Hispanic        19        38   87   90    8  13   255
 NH_Black        13        46   95   58    7  10   229
 NH_Asian        25        32   48   16    3  21   145
    Other         2         6   18    5    4   2    37
    Total        84       223  369  231   36  56   999

These five predictors are fewer than the allowed maximum of \(4 + (999 - 100)/100 = 12.999\), rounding to 12 predictors as prescribed by the Project A instructions, given our 999 observations in nh_demo.

6.3.1 Anticipated Direction of Effects

I expect Higher HDL to be associated with lower age, with lower waist circumference, with being female, with reporting better self-reported overall health, and with non-hispanic white ethnicity.

7 Logistic Regression Plans

7.1 My Second Research Question

How effectively can we predict whether or not a subject’s HDL cholesterol level is at risk (too low) using age, race/ethnicity, waist circumference and self-reported overall health, in a sample of 999 NHANES participants ages 40-79?

7.2 My Binary Outcome

  • My binary outcome is HDL_RISK, and I am interested in predicting this value using several more easily gathered characteristics of a subject.
  • I have complete HDL_RISK data for all 999 subjects in my nh_demo tibble.
Code
nh_demo |> tabyl(HDL_RISK)
 HDL_RISK   n   percent
        0 666 0.6666667
        1 333 0.3333333

7.3 My Planned Predictors (Logistic Model)

I am using four of the five predictors I used in my linear model, specifically age, waist circumference, self-reported overall health and race/ethnicity.

  • Note that I am not including SEX in this model (although I did in my linear model) because my binary outcome variable’s cutoff values are stratified by SEX.

I have 333 observations in my smaller outcome group (those at risk), and so I am permitted to have up to \(4 + (333-100)/100 = 6.33\), rounded to 6 predictors, so I’m within the acceptable limit prescribed by the Project A instructions.

7.3.1 Anticipated Direction of Effects

I expect higher rates of HDL risk to be associated with higher age, with higher waist circumference, with reporting worse self-reported overall health, and with Hispanic or non-Hispanic Black ethnicity.

8 Linear Regression Analyses

8.1 Missingness

I will assume missing values are missing at random (MAR) in this work, and use single imputation for my analyses.

Here’s a table of missingness before imputation. I have missing data on two of my predictors, SROH and WAIST.

Code
miss_var_summary(nh_demo) |> filter(n_miss > 0)
# A tibble: 2 × 3
  variable n_miss pct_miss
  <chr>     <int>    <num>
1 SROH         56     5.61
2 WAIST        44     4.40

8.1.1 Single Imputation Approach

I’ll use the mice package to do my single imputation (m = 1), with a seed set to 432432.

Code
nh_demo_i <- 
  mice(nh_demo, m = 1, seed = 432432, print = FALSE) |>
  complete() |>
  tibble()
Warning: Number of logged events: 3
Code
n_miss(nh_demo_i)
[1] 0

We needn’t fear logged events in this context.

8.2 Outcome Transformation

Let’s look at the Box-Cox results from the car package.

Code
mod_temp <- lm(HDL ~ AGE + SEX + RACE_ETH + WAIST + SROH, data = nh_demo_i)

boxCox(mod_temp)

The Box-Cox plot suggests that we take the logarithm of HDL before fitting our model, and so we shall. Here’s a plot of the distribution of the natural logarithm of HDL.

Code
nh_demo_i <- nh_demo_i |>
  mutate(logHDL = log(HDL))

p1 <- ggplot(nh_demo_i, aes(sample = logHDL)) +
  geom_qq(col = "navy") + geom_qq_line(col = "red") + 
  labs(title = "Normal Q-Q plot of log(HDL)", x = "",
       y = "Log of HDL Cholesterol Level (mg/dl)")

p2 <- ggplot(nh_demo_i, aes(x = logHDL)) +
  geom_histogram(bins = 20, col = "white", fill = "navy") +
  labs(title = "Histogram of log(HDL)", x = "Log of HDL Cholesterol Level (mg/dl)")

p1 + p2

It does seem that this transformed outcome follows something closer to a Normal distribution.

8.3 Scatterplot Matrix and Collinearity

Here’s a scatterplot matrix of our outcome and predictors after transformation and imputation.

Code
ggpairs(nh_demo_i, columns = c("AGE", "WAIST", "SEX", 
                               "RACE_ETH", "SROH", "logHDL"))

To check collinearity in more detail, we can estimate variance inflation factors for our predictors.

Code
mod_A <- lm(logHDL ~ AGE + SEX + RACE_ETH + WAIST + SROH, data = nh_demo_i)

car::vif(mod_A)
             GVIF Df GVIF^(1/(2*Df))
AGE      1.024684  1        1.012267
SEX      1.025596  1        1.012717
RACE_ETH 1.248667  4        1.028148
WAIST    1.233811  1        1.110770
SROH     1.182393  4        1.021163

We don’t have any large generalized VIF results here, so we have no meaningful concerns regarding collinearity.

8.4 Model A

8.4.1 Fitting Model A

Here is the fit using the lm() function.

Code
mod_A <- lm(logHDL ~ AGE + SEX + RACE_ETH + WAIST + SROH, data = nh_demo_i)

We’ll also fit model A with the ols() function from the rms package, even though we won’t actually use this fit for a while.

Code
dd <- datadist(nh_demo_i)
options(datadist = "dd")

mod_A_ols <- ols(logHDL ~ AGE + SEX + RACE_ETH + WAIST + SROH,
                 data = nh_demo_i, x = TRUE, y = TRUE)

8.4.2 Coefficient Estimates

Note

I’ll show two different approaches (model_parameters() from the easystats ecosystem, and tidy() from the broom package) to summarize the coefficient estimates from mod_A. Either is sufficient for Project A, and there’s no need to show both approaches.

Code
model_parameters(mod_A, ci = 0.90) |> print_md(digits = 3)
Parameter Coefficient SE 90% CI t(987) p
(Intercept) 4.616 0.078 (4.489, 4.744) 59.436 < .001
AGE 0.003 7.938e-04 (0.001, 0.004) 3.510 < .001
SEX (M) -0.166 0.017 (-0.194, -0.139) -10.010 < .001
RACE ETH (Hispanic) -0.042 0.022 (-0.078, -0.005) -1.873 0.061
RACE ETH (NH_Black) 0.090 0.022 (0.053, 0.127) 4.025 < .001
RACE ETH (NH_Asian) -0.068 0.028 (-0.114, -0.023) -2.479 0.013
RACE ETH (Other) 0.013 0.045 (-0.062, 0.088) 0.289 0.773
WAIST -0.007 5.665e-04 (-0.008, -0.006) -12.257 < .001
SROH (Very_Good) -0.008 0.032 (-0.061, 0.045) -0.243 0.808
SROH (Good) -0.074 0.031 (-0.125, -0.023) -2.392 0.017
SROH (Fair) -0.088 0.033 (-0.143, -0.033) -2.640 0.008
SROH (Poor) -0.023 0.050 (-0.106, 0.059) -0.466 0.642
Code
tidy(mod_A, conf.int = TRUE, conf.level = 0.90) |>
  select(term, estimate, se = std.error, low90 = conf.low, 
         high90 = conf.high, p = p.value) |>
  kable(digits = 3)
term estimate se low90 high90 p
(Intercept) 4.616 0.078 4.489 4.744 0.000
AGE 0.003 0.001 0.001 0.004 0.000
SEXM -0.166 0.017 -0.194 -0.139 0.000
RACE_ETHHispanic -0.042 0.022 -0.078 -0.005 0.061
RACE_ETHNH_Black 0.090 0.022 0.053 0.127 0.000
RACE_ETHNH_Asian -0.068 0.028 -0.114 -0.023 0.013
RACE_ETHOther 0.013 0.045 -0.062 0.088 0.773
WAIST -0.007 0.001 -0.008 -0.006 0.000
SROHVery_Good -0.008 0.032 -0.061 0.045 0.808
SROHGood -0.074 0.031 -0.125 -0.023 0.017
SROHFair -0.088 0.033 -0.143 -0.033 0.008
SROHPoor -0.023 0.050 -0.106 0.059 0.642

8.4.3 Model A Effects

Since we’ve built the mod_A_ols model using the rms package, it can be helpful to look at the effects using its plot and associated table, as shown below.

Note

If you look at the Quarto code, you’ll see that I’ve used warning: false in this code chunk to suppress some un-interesting warnings here, and in several other code chunks in this document. I encourage you to do the same in your Project A, for the code chunks that I’ve used warning: false in this demo.

  • Don’t use warning: false otherwise in Project A, though.
Code
plot(summary(mod_A_ols, conf.int = 0.90))

Code
summary(mod_A_ols, conf.int = 0.90) |> kable(digits = 3)
Low High Diff. Effect S.E. Lower 0.9 Upper 0.9 Type
AGE 50 66.0 16.0 0.045 0.013 0.024 0.065 1
WAIST 92 112.6 20.6 -0.143 0.012 -0.162 -0.124 1
SEX - M:F 1 2.0 NA -0.166 0.017 -0.194 -0.139 1
RACE_ETH - Hispanic:NH_White 1 2.0 NA -0.042 0.022 -0.078 -0.005 1
RACE_ETH - NH_Black:NH_White 1 3.0 NA 0.090 0.022 0.053 0.127 1
RACE_ETH - NH_Asian:NH_White 1 4.0 NA -0.068 0.028 -0.114 -0.023 1
RACE_ETH - Other:NH_White 1 5.0 NA 0.013 0.045 -0.062 0.088 1
SROH - Excellent:Good 3 1.0 NA 0.074 0.031 0.023 0.125 1
SROH - Very_Good:Good 3 2.0 NA 0.066 0.022 0.030 0.102 1
SROH - Fair:Good 3 4.0 NA -0.014 0.022 -0.050 0.021 1
SROH - Poor:Good 3 5.0 NA 0.050 0.043 -0.021 0.122 1

8.4.4 Quality of Fit Summaries

Note

I’ll show two different approaches (model_performance() from the easystats ecosystem, and glance() from the broom package) to summarize the quality of fit measures describing mod_A. I actually like being able to see both of these approaches, since each provides some unique information, so I would encourage you, too, to include both in your Project A.

Code
model_performance(mod_A) |> print_md(digits = 3)
Indices of model performance
AIC AICc BIC R2 R2 (adj.) RMSE Sigma
146.355 146.724 210.143 0.273 0.265 0.257 0.259

In our Model A, we use 11 degrees of freedom, and obtain an \(R^2\) value of 0.273.

Code
glance(mod_A) |>
  select(r2 = r.squared, adjr2 = adj.r.squared, sigma, 
         AIC, BIC, nobs, df, df.residual) |>
  kable(digits = c(3, 3, 2, 1, 1, 0, 0, 0))
r2 adjr2 sigma AIC BIC nobs df df.residual
0.273 0.265 0.26 146.4 210.1 999 11 987

8.4.5 Regression Diagnostics for Model A

Note

Note in the Quarto code for the next chunk that I have set the fig-height to 9, so that the plots are easier to read. Please do that, too, whenever you use check_model() from the easystats ecosystem.

  • I also prefer that you run check_model() with detrend = FALSE for the Normal Q-Q plot of residuals, as done here.
Code
check_model(mod_A, detrend = FALSE)

For the most part, these plots look very reasonable. I see no clear problems with the assumptions of linearity, normality or constant variance evident in any of these results.

The main issue is the posterior predictive check, where our predictions are missing near the center of the distribution a bit, with more predicted values of log(HDL) in the 3.75 to 4.25 range than we see in the original data.

8.5 Non-Linearity and Spearman \(\rho^2\)

Here’s the relevant Spearman \(\rho^2\) plot, as a place to look for sensible places to consider a non-linear term or terms.

Code
plot(spearman2(logHDL ~ AGE + WAIST + SEX + RACE_ETH + SROH, 
               data = nh_demo_i))

Our Spearman \(\rho^2\) plot first suggests the use of a non-linear term in WAIST, so we’ll add a restricted cubic spline in WAIST using 5 knots, which should add 3 degrees of freedom to our initial model.

Next, the SEX variable also seems to be a good choice, so we’ll add an interaction between SEX and the main effect of WAIST, which will add one more degree of freedom to our model A.

8.6 Model B

Our Model B will add two non-linear terms, summing up to 4 additional degrees of freedom, to our Model A.

8.6.1 Fitting Model B

Code
mod_B <- lm(logHDL ~ AGE + rcs(WAIST, 5) + SEX + 
              WAIST %ia% SEX + RACE_ETH + SROH,
            data = nh_demo_i)

We’ll also fit model B with the ols() function from the rms package.

Code
dd <- datadist(nh_demo_i)
options(datadist = "dd")

mod_B_ols <- ols(logHDL ~ AGE + rcs(WAIST, 5) + SEX + 
              WAIST %ia% SEX + RACE_ETH + SROH,
            data = nh_demo_i, x = TRUE, y = TRUE)

8.6.2 Coefficient Estimates

Note

I’ll show two different approaches (model_parameters() from the easystats ecosystem, and tidy() from the broom package) to summarize the coefficient estimates from mod_B. Either is sufficient for Project A, and there’s no need to show both approaches. The tidy() approach handles the non-linear terms a bit better in my view, so I’d probably go with that.

Code
model_parameters(mod_B, ci = 0.90) |> print_md(digits = 3)
Parameter Coefficient SE 90% CI t(983) p
(Intercept) 5.436 0.252 (5.021, 5.852) 21.549 < .001
AGE 0.003 7.920e-04 (0.002, 0.005) 4.096 < .001
rcs(WAIST ( degree) -0.017 0.003 (-0.022, -0.012) -5.617 < .001
rcs(WAIST ( degree) 0.039 0.018 (0.009, 0.068) 2.151 0.032
rcs(WAIST ( degree) -0.158 0.095 (-0.314, -0.001) -1.661 0.097
rcs(WAIST ( degree) 0.188 0.132 (-0.030, 0.406) 1.422 0.155
SEX (M) -0.107 0.111 (-0.291, 0.076) -0.961 0.337
WAIST %ia% SEX -5.032e-04 0.001 (-0.002, 0.001) -0.473 0.637
RACE ETH (Hispanic) -0.034 0.022 (-0.070, 0.002) -1.548 0.122
RACE ETH (NH_Black) 0.088 0.022 (0.051, 0.125) 3.911 < .001
RACE ETH (NH_Asian) -0.075 0.028 (-0.121, -0.030) -2.735 0.006
RACE ETH (Other) 0.010 0.045 (-0.064, 0.084) 0.223 0.823
SROH (Very_Good) -0.001 0.032 (-0.055, 0.052) -0.045 0.964
SROH (Good) -0.063 0.031 (-0.114, -0.013) -2.068 0.039
SROH (Fair) -0.087 0.033 (-0.142, -0.033) -2.638 0.008
SROH (Poor) -0.031 0.050 (-0.113, 0.051) -0.627 0.531
Code
tidy(mod_B, conf.int = TRUE, conf.level = 0.90) |>
  select(term, estimate, se = std.error, 
         low90 = conf.low, high90 = conf.high, 
         p = p.value) |>
  kable(digits = 3)
term estimate se low90 high90 p
(Intercept) 5.436 0.252 5.021 5.852 0.000
AGE 0.003 0.001 0.002 0.005 0.000
rcs(WAIST, 5)WAIST -0.017 0.003 -0.022 -0.012 0.000
rcs(WAIST, 5)WAIST’ 0.039 0.018 0.009 0.068 0.032
rcs(WAIST, 5)WAIST’’ -0.158 0.095 -0.314 -0.001 0.097
rcs(WAIST, 5)WAIST’’’ 0.188 0.132 -0.030 0.406 0.155
SEXM -0.107 0.111 -0.291 0.076 0.337
WAIST %ia% SEX -0.001 0.001 -0.002 0.001 0.637
RACE_ETHHispanic -0.034 0.022 -0.070 0.002 0.122
RACE_ETHNH_Black 0.088 0.022 0.051 0.125 0.000
RACE_ETHNH_Asian -0.075 0.028 -0.121 -0.030 0.006
RACE_ETHOther 0.010 0.045 -0.064 0.084 0.823
SROHVery_Good -0.001 0.032 -0.055 0.052 0.964
SROHGood -0.063 0.031 -0.114 -0.013 0.039
SROHFair -0.087 0.033 -0.142 -0.033 0.008
SROHPoor -0.031 0.050 -0.113 0.051 0.531

8.6.3 Model B Effects

Here, we use the mod_B_ols model to look at the effects using its plot and associated table, which may be especially helpful when we include non-linear terms.

Note

I’ve used warning: false in this code chunk to suppress some un-interesting warnings. You can (for this type of plot) too, if you need to, in your Project A.

Code
plot(summary(mod_B_ols, conf.int = 0.90))

Code
summary(mod_B_ols, conf.int = 0.90) |> kable(digits = 3)
Low High Diff. Effect S.E. Lower 0.9 Upper 0.9 Type
AGE 50 66.0 16.0 0.052 0.013 0.031 0.073 1
WAIST 92 112.6 20.6 -0.133 0.026 -0.175 -0.091 1
SEX - M:F 1 2.0 NA -0.158 0.017 -0.186 -0.131 1
RACE_ETH - Hispanic:NH_White 1 2.0 NA -0.034 0.022 -0.070 0.002 1
RACE_ETH - NH_Black:NH_White 1 3.0 NA 0.088 0.022 0.051 0.125 1
RACE_ETH - NH_Asian:NH_White 1 4.0 NA -0.075 0.028 -0.121 -0.030 1
RACE_ETH - Other:NH_White 1 5.0 NA 0.010 0.045 -0.064 0.084 1
SROH - Excellent:Good 3 1.0 NA 0.063 0.031 0.013 0.114 1
SROH - Very_Good:Good 3 2.0 NA 0.062 0.022 0.027 0.097 1
SROH - Fair:Good 3 4.0 NA -0.024 0.022 -0.059 0.012 1
SROH - Poor:Good 3 5.0 NA 0.032 0.043 -0.039 0.103 1

8.6.4 Quality of Fit Summaries

Code
model_performance(mod_B) |> print_md(digits = 3)
Indices of model performance
AIC AICc BIC R2 R2 (adj.) RMSE Sigma
130.829 131.453 214.244 0.290 0.279 0.254 0.256

In our Model B, we use 15 degrees of freedom, and obtain an \(R^2\) value of 0.29.

Code
glance(mod_B) |>
  select(r2 = r.squared, adjr2 = adj.r.squared, sigma, 
         AIC, BIC, nobs, df, df.residual) |>
  kable(digits = c(3, 3, 2, 1, 1, 0, 0, 0))
r2 adjr2 sigma AIC BIC nobs df df.residual
0.29 0.279 0.26 130.8 214.2 999 15 983

8.6.5 Regression Diagnostics for Model B

Code
check_model(mod_B, detrend = FALSE)

These residual plots also look pretty reasonable, too. Again, I see no clear problems with the assumptions of linearity, normality or constant variance evident in these results. The posterior predictive check is a little better than Model A, but not much. The collinearity we’ve introduced here is due to the interaction terms, so that’s not a concern for us.

8.7 Validating Models A and B

We will use the validate() function from the rms package to validate our ols fits.

Code
set.seed(4321); (valA <- validate(mod_A_ols))
          index.orig training   test optimism index.corrected  n
R-square      0.2729   0.2800 0.2630   0.0170          0.2558 40
MSE           0.0660   0.0656 0.0669  -0.0014          0.0674 40
g             0.1787   0.1811 0.1759   0.0052          0.1736 40
Intercept     0.0000   0.0000 0.1209  -0.1209          0.1209 40
Slope         1.0000   1.0000 0.9689   0.0311          0.9689 40
Code
set.seed(4322); (valB <- validate(mod_B_ols))
          index.orig training   test optimism index.corrected  n
R-square      0.2898   0.3056 0.2790   0.0267          0.2631 40
MSE           0.0645   0.0626 0.0655  -0.0029          0.0674 40
g             0.1827   0.1867 0.1796   0.0072          0.1755 40
Intercept     0.0000   0.0000 0.1501  -0.1501          0.1501 40
Slope         1.0000   1.0000 0.9620   0.0380          0.9620 40

8.7.1 Validated \(R^2\), and MSE as well as IC statistics

Note

If you inspect the Quarto code for the table below, you’ll see that I have used in-line coding to specify the values of each element. That’s definitely worth considering in building your work, although it takes some care.

Model Validated \(R^2\) Validated MSE AIC BIC df
A 0.256 0.0674 146.4 210.1 11
B 0.263 0.0674 130.8 214.2 15

8.8 Final Linear Regression Model

We’ll choose Model B here.

We see a small improvement for Model B in terms of validated \(R^2\) and AIC. We also note that BIC is a little lower (thus better) for Model A, and there’s no material difference in validated mean squared error. Each of the models matches the assumptions of linear regression well, so there’s not much to choose from in that regard. Really, we could pick either model here, but I wanted to pick a model with non-linear terms so that I could display and discuss them in what follows.

8.8.1 Winning Model’s OLS summary

Code
mod_B_ols
Linear Regression Model

ols(formula = logHDL ~ AGE + rcs(WAIST, 5) + SEX + WAIST %ia% 
    SEX + RACE_ETH + SROH, data = nh_demo_i, x = TRUE, y = TRUE)

                Model Likelihood    Discrimination    
                      Ratio Test           Indexes    
Obs     999    LR chi2    341.85    R2       0.290    
sigma0.2560    d.f.           15    R2 adj   0.279    
d.f.    983    Pr(> chi2) 0.0000    g        0.183    

Residuals

      Min        1Q    Median        3Q       Max 
-0.763472 -0.183023 -0.002842  0.171374  0.916543 

                  Coef    S.E.   t     Pr(>|t|)
Intercept          5.4362 0.2523 21.55 <0.0001 
AGE                0.0032 0.0008  4.10 <0.0001 
WAIST             -0.0170 0.0030 -5.62 <0.0001 
WAIST'             0.0387 0.0180  2.15 0.0317  
WAIST''           -0.1579 0.0951 -1.66 0.0971  
WAIST'''           0.1881 0.1323  1.42 0.1552  
SEX=M             -0.1071 0.1115 -0.96 0.3370  
WAIST * SEX=M     -0.0005 0.0011 -0.47 0.6367  
RACE_ETH=Hispanic -0.0341 0.0220 -1.55 0.1219  
RACE_ETH=NH_Black  0.0877 0.0224  3.91 <0.0001 
RACE_ETH=NH_Asian -0.0754 0.0276 -2.73 0.0064  
RACE_ETH=Other     0.0100 0.0449  0.22 0.8234  
SROH=Very_Good    -0.0014 0.0323 -0.04 0.9645  
SROH=Good         -0.0635 0.0307 -2.07 0.0389  
SROH=Fair         -0.0873 0.0331 -2.64 0.0085  
SROH=Poor         -0.0312 0.0498 -0.63 0.5310  

As a reminder, we displayed the validated \(R^2\) value ( = 0.263) for our Model B back in Section 8.7.1.

8.8.2 Effects Plot for Winning Model

Code
plot(summary(mod_B_ols, conf.int = 0.90))

8.8.3 Numerical Description of Effect Sizes

Code
summary(mod_B_ols, conf.int = 0.90) |> kable(digits = 3)
Low High Diff. Effect S.E. Lower 0.9 Upper 0.9 Type
AGE 50 66.0 16.0 0.052 0.013 0.031 0.073 1
WAIST 92 112.6 20.6 -0.133 0.026 -0.175 -0.091 1
SEX - M:F 1 2.0 NA -0.158 0.017 -0.186 -0.131 1
RACE_ETH - Hispanic:NH_White 1 2.0 NA -0.034 0.022 -0.070 0.002 1
RACE_ETH - NH_Black:NH_White 1 3.0 NA 0.088 0.022 0.051 0.125 1
RACE_ETH - NH_Asian:NH_White 1 4.0 NA -0.075 0.028 -0.121 -0.030 1
RACE_ETH - Other:NH_White 1 5.0 NA 0.010 0.045 -0.064 0.084 1
SROH - Excellent:Good 3 1.0 NA 0.063 0.031 0.013 0.114 1
SROH - Very_Good:Good 3 2.0 NA 0.062 0.022 0.027 0.097 1
SROH - Fair:Good 3 4.0 NA -0.024 0.022 -0.059 0.012 1
SROH - Poor:Good 3 5.0 NA 0.032 0.043 -0.039 0.103 1

8.8.4 Effect Size Description(s)

In your work, we would only need to see one of the following effect size descriptions.

WAIST description: If we have two female subjects of the same age, race/ethnicity and self-reported overall health, then if subject 1 has a waist circumference of 92 cm and subject 2 has a waist circumference of 112.6 cm, then our model estimates that subject 1 will have a log(HDL) that is 0.133 higher than subject 2. The 90% confidence interval around that estimated effect on log(HDL) ranges from (0.091, 0.175).

SEX description: If we have two subjects, each with the same age, race/ethnicity and self-reported overall health and with a waist circumference of 101.7 cm, then if subject 1 is male and subject 2 is female, our model estimates that the female subject will have a log(HDL) that is 0.158 higher (with 90% CI: (0.131, 0.186)) than the male subject.

AGE description: If we have two subjects of the same sex, waist circumference, race/ethnicity and self-reported overall health, then if subject 1 is age 50 and subject 2 is age 66, our model predicts that subject 1’s log(HDL) will be 0.052 larger than subject 2’s log(HDL), with 90% confidence interval (0.031, 0.073).

SROH description: If we have two subjects of the same age, sex, waist circumference and race/ethnicity. then if subject 1 reports Excellent overall health while subject 2 reports only Good overall health, our model predicts that subject 1’s log(HDL) will be 0.063 higher than subject 2’s log(HDL), with 90% CI (0.013, 0.114).

8.8.5 Prediction Plot for Winning Model

Here’s the set of prediction plots to describe the impact of the coefficients on log(HDL).

Code
ggplot(Predict(mod_B_ols))

8.8.6 Nomogram of Winning Model

Code
plot(nomogram(mod_B_ols, fun = exp, funlabel = "HDL"))

8.8.7 Prediction for a New Subject

I will create a predicted HDL for a new female Non-Hispanic White subject who is 60 years old, has an 85 cm waist circumference, and rates their overall health as Fair.

Note

I can do this using either the ols() or the lm() fit to our model. Either is sufficient for Project A, and there’s no need to show both approaches.

Here, I’ll actually run two predictions, one for a Male and one for a Female subject with the same values of AGE, WAIST, RACE_ETH and SROH.

Code
new_subjects <- 
  data.frame(AGE = c(60,60), WAIST = c(85, 85), SEX = c("M", "F"),
             RACE_ETH = c("NH_White", "NH_White"), SROH = c("Fair", "Fair"))

preds1 <- predict.lm(mod_B, newdata = new_subjects, 
                     interval = "prediction", level = 0.90)

exp(preds1)
       fit      lwr      upr
1 52.08776 34.05169 79.67694
2 60.50899 39.58373 92.49604

We then exponentiate these results to obtain the estimate and 90% prediction interval on the original scale of HDL cholesterol, as shown in the table below.

Predictor Values Predicted HDL 90% Prediction Interval
AGE = 60, WAIST = 85, SEX = M, RACE_ETH = NH_White, SROH = FAIR 52.09 mg/dl (34.05, 79.68) mg/dl
AGE = 60, WAIST = 85, SEX = F, RACE_ETH = NH_White, SROH = FAIR 60.51 mg/dl (39.58, 92.5) mg/dl

Alternatively, I could have used our ols() fit. Here, I’ll just fit the results for the female subject.

Code
new_subject <- tibble(  AGE = 60, WAIST = 85, SEX = "F", 
                       RACE_ETH = "NH_White", SROH = "Fair" ) 

preds2 <- predict(mod_B_ols, newdata = new_subject, 
                 conf.int = 0.90, conf.type = "individual")

preds2
$linear.predictors
       1 
4.102792 

$lower
       1 
3.678418 

$upper
       1 
4.527166 

We then exponentiate these results to obtain the estimate and 90% prediction interval on the original scale of HDL cholesterol, as shown in the table below.

Predictor Values Predicted HDL 90% Prediction Interval
AGE = 60, WAIST = 85, SEX = F, RACE_ETH = NH_White, SROH = FAIR 60.51 mg/dl (39.58, 92.5) mg/dl

Naturally, the two fitting approaches (lm() and ols()) produce the same model, so they produce the same predictions.

9 Logistic Regression Analyses

9.1 Missingness

Again, we’ll assume missing values are MAR, and use the single imputation approach developed previously in Section 8.1.1.

9.2 Model Y

We’ll predict Pr(HDL_RISK = 1), the probably of having a low enough HDL to put the subject at risk, as a function of AGE, WAIST, RACE_ETH and SROH.

9.2.1 Fitting Model Y

We’ll fit our logistic regression model using both glm() and lrm().

Code
mod_Y <- glm(HDL_RISK ~ AGE + WAIST + RACE_ETH + SROH,
            data = nh_demo_i, family = binomial())

ddd <- datadist(nh_demo_i)
options(datadist = "ddd")

mod_Y_lrm <- lrm(HDL_RISK ~ AGE + WAIST + RACE_ETH + SROH,
                data = nh_demo_i, x = TRUE, y = TRUE)

9.2.2 Coefficient Estimates

Note

I’ll show two different approaches (model_parameters() from the easystats ecosystem, and tidy() from the broom package) to summarize the coefficient estimates from mod_Y. Either is sufficient for Project A, and there’s no need to show both approaches.

Code
tidy(mod_Y, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.90) |>
  select(term, estimate, se = std.error, 
         low90 = conf.low, high90 = conf.high, p = p.value) |>
  kable(digits = 3)
term estimate se low90 high90 p
(Intercept) 0.011 0.722 0.003 0.036 0.000
AGE 0.986 0.007 0.975 0.998 0.048
WAIST 1.041 0.005 1.032 1.050 0.000
RACE_ETHHispanic 1.004 0.189 0.736 1.369 0.983
RACE_ETHNH_Black 0.531 0.203 0.379 0.740 0.002
RACE_ETHNH_Asian 1.607 0.239 1.083 2.380 0.047
RACE_ETHOther 0.536 0.420 0.261 1.049 0.138
SROHVery_Good 1.196 0.318 0.717 2.044 0.574
SROHGood 1.677 0.300 1.037 2.792 0.085
SROHFair 2.405 0.315 1.448 4.100 0.005
SROHPoor 1.993 0.450 0.949 4.193 0.126
Code
model_parameters(mod_Y, ci = 0.90, exponentiate = TRUE) |>
  print_md(digits = 3)
Parameter Odds Ratio SE 90% CI z p
(Intercept) 0.011 0.008 (0.003, 0.036) -6.243 < .001
AGE 0.986 0.007 (0.975, 0.998) -1.979 0.048
WAIST 1.041 0.005 (1.032, 1.050) 7.779 < .001
RACE ETH (Hispanic) 1.004 0.189 (0.736, 1.369) 0.022 0.983
RACE ETH (NH_Black) 0.531 0.108 (0.379, 0.740) -3.114 0.002
RACE ETH (NH_Asian) 1.607 0.384 (1.083, 2.380) 1.985 0.047
RACE ETH (Other) 0.536 0.225 (0.261, 1.049) -1.483 0.138
SROH (Very_Good) 1.196 0.380 (0.717, 2.044) 0.562 0.574
SROH (Good) 1.677 0.503 (1.037, 2.792) 1.722 0.085
SROH (Fair) 2.405 0.759 (1.448, 4.100) 2.782 0.005
SROH (Poor) 1.993 0.897 (0.949, 4.193) 1.531 0.126

9.2.3 Model Y Effects

Code
plot(summary(mod_Y_lrm, conf.int = 0.90))

Code
summary(mod_Y_lrm, conf.int = 0.90) |> kable(digits = 3)
Low High Diff. Effect S.E. Lower 0.9 Upper 0.9 Type
AGE 50 66.0 16.0 -0.221 0.112 -0.404 -0.037 1
Odds Ratio 50 66.0 16.0 0.802 NA 0.667 0.963 2
WAIST 92 112.6 20.6 0.831 0.107 0.655 1.006 1
Odds Ratio 92 112.6 20.6 2.295 NA 1.925 2.735 2
RACE_ETH - Hispanic:NH_White 1 2.0 NA 0.004 0.189 -0.306 0.314 1
Odds Ratio 1 2.0 NA 1.004 NA 0.736 1.369 2
RACE_ETH - NH_Black:NH_White 1 3.0 NA -0.633 0.203 -0.967 -0.299 1
Odds Ratio 1 3.0 NA 0.531 NA 0.380 0.742 2
RACE_ETH - NH_Asian:NH_White 1 4.0 NA 0.474 0.239 0.081 0.867 1
Odds Ratio 1 4.0 NA 1.607 NA 1.085 2.381 2
RACE_ETH - Other:NH_White 1 5.0 NA -0.624 0.420 -1.315 0.068 1
Odds Ratio 1 5.0 NA 0.536 NA 0.268 1.070 2
SROH - Excellent:Good 3 1.0 NA -0.517 0.300 -1.010 -0.023 1
Odds Ratio 3 1.0 NA 0.596 NA 0.364 0.977 2
SROH - Very_Good:Good 3 2.0 NA -0.338 0.193 -0.656 -0.020 1
Odds Ratio 3 2.0 NA 0.713 NA 0.519 0.980 2
SROH - Fair:Good 3 4.0 NA 0.361 0.180 0.065 0.656 1
Odds Ratio 3 4.0 NA 1.434 NA 1.067 1.928 2
SROH - Poor:Good 3 5.0 NA 0.173 0.368 -0.433 0.778 1
Odds Ratio 3 5.0 NA 1.188 NA 0.649 2.177 2

9.2.4 Quality of Fit Summaries

Note

Here, we have three available approaches to summarize the fit of Model Y, thanks to our glm() and lrm() fits of the same model. I’ll show all three, but in practice (and in your Project A) I wouldn’t include the model_performance() results.

Our Nagelkerke \(R^2\) estimate for Model Y is 0.152, and our C statistic is estimated to be 0.701.

Code
mod_Y_lrm
Logistic Regression Model

lrm(formula = HDL_RISK ~ AGE + WAIST + RACE_ETH + SROH, data = nh_demo_i, 
    x = TRUE, y = TRUE)

                       Model Likelihood     Discrimination    Rank Discrim.    
                             Ratio Test            Indexes          Indexes    
Obs           999    LR chi2     115.50     R2       0.152    C       0.701    
 0            666    d.f.            10    R2(10,999)0.100    Dxy     0.402    
 1            333    Pr(> chi2) <0.0001    R2(10,666)0.146    gamma   0.402    
max |deriv| 1e-06                           Brier    0.198    tau-a   0.179    

                  Coef    S.E.   Wald Z Pr(>|Z|)
Intercept         -4.5080 0.7221 -6.24  <0.0001 
AGE               -0.0138 0.0070 -1.98  0.0478  
WAIST              0.0403 0.0052  7.78  <0.0001 
RACE_ETH=Hispanic  0.0041 0.1886  0.02  0.9826  
RACE_ETH=NH_Black -0.6327 0.2032 -3.11  0.0018  
RACE_ETH=NH_Asian  0.4743 0.2390  1.98  0.0472  
RACE_ETH=Other    -0.6235 0.4205 -1.48  0.1381  
SROH=Very_Good     0.1786 0.3175  0.56  0.5739  
SROH=Good          0.5168 0.3001  1.72  0.0850  
SROH=Fair          0.8776 0.3154  2.78  0.0054  
SROH=Poor          0.6895 0.4503  1.53  0.1258  

The glance() results below give us our AIC and BIC values, as well as the degrees of freedom used by Model Y.

Code
glance(mod_Y) |>
  mutate(df = nobs - df.residual - 1) |>
  select(AIC, BIC, df, df.residual, nobs) |>
  kable(digits = 1)
AIC BIC df df.residual nobs
1178.3 1232.2 10 988 999
Note

These summaries aren’t as appealing to me as those in the other two tabs here.

Code
model_performance(mod_Y) |> print_md(digits = 2)
Indices of model performance
AIC AICc BIC Tjur’s R2 RMSE Sigma Log_loss Score_log Score_spherical PCP
1178.26 1178.53 1232.23 0.11 0.44 1.00 0.58 -145.30 1.81e-03 0.60

9.2.5 Confusion Matrix (Model Y)

First, we augment our nh_demo_i data to include predicted probabilities of (HDL_RISK = 1) from Model Y.

Code
resY_aug <- augment(mod_Y, type.predict = "response")

My prediction rule for this confusion matrix is that the fitted value of Pr(HDL_RISK = 1) needs to be greater than or equal to 0.5 for me to predict HDL_RISK is 1, and otherwise I predict 0.

Code
cm_Y <- caret::confusionMatrix(
  data = factor(resY_aug$.fitted >= 0.5),
  reference = factor(resY_aug$HDL_RISK == 1),
  positive = "TRUE")

cm_Y
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE   603  241
     TRUE     63   92
                                          
               Accuracy : 0.6957          
                 95% CI : (0.6661, 0.7241)
    No Information Rate : 0.6667          
    P-Value [Acc > NIR] : 0.02722         
                                          
                  Kappa : 0.2097          
                                          
 Mcnemar's Test P-Value : < 2e-16         
                                          
            Sensitivity : 0.27628         
            Specificity : 0.90541         
         Pos Pred Value : 0.59355         
         Neg Pred Value : 0.71445         
             Prevalence : 0.33333         
         Detection Rate : 0.09209         
   Detection Prevalence : 0.15516         
      Balanced Accuracy : 0.59084         
                                          
       'Positive' Class : TRUE            
                                          

Here are our results, tabulated nicely.

Model Classification Rule Sensitivity Specificity Pos. Pred. Value
Y Predicted Pr(HDL_RISK = 1) >= 0.5 0.276 0.905 0.594

9.3 Non-Linearity and Spearman \(\rho^2\) plot

Here’s the relevant Spearman \(\rho^2\) plot.

Code
plot(spearman2(HDL_RISK ~ AGE + WAIST + SEX + RACE_ETH + SROH, 
               data = nh_demo_i))

Our Spearman \(\rho^2\) plot first suggests the use of a non-linear term in WAIST, so we’ll add a restricted cubic spline in WAIST using 4 knots, which should add 2 degrees of freedom to our initial model.

Next, the SROH variable seems to be a good choice, so we’ll add an interaction between SROH and the main effect of WAIST, which will add four more degrees of freedom to our model Y.

9.4 Model Z

As mentioned, our model Z will add 6 degrees of freedom through two non-linear terms, to model Y.

9.4.1 Fitting Model Z

We’ll fit Model Z with both glm() and lrm().

Code
mod_Z <- glm(HDL_RISK ~ AGE + rcs(WAIST, 4) + RACE_ETH + 
               SROH + WAIST %ia% SROH,
            data = nh_demo_i, family = binomial())

ddd <- datadist(nh_demo_i)
options(datadist = "ddd")

mod_Z_lrm <- lrm(HDL_RISK ~ AGE + rcs(WAIST, 4) + RACE_ETH + 
                   SROH + WAIST %ia% SROH,
                 data = nh_demo_i, x = TRUE, y = TRUE)

9.4.2 Coefficient Estimates

Note

I’ll show two different approaches (model_parameters() from the easystats ecosystem, and tidy() from the broom package) to summarize the coefficient estimates from mod_Y. Either is sufficient for Project A, and there’s no need to show both approaches. The tidy() approach handles the non-linear terms a bit better in my view, so I’d probably go with that.

Code
tidy(mod_Z, exponentiate = TRUE, conf.int = TRUE, conf.level = 0.90) |>
  select(term, estimate, se = std.error, 
         low90 = conf.low, high90 = conf.high, p = p.value) |>
  kable(digits = 3)
term estimate se low90 high90 p
(Intercept) 0.000 2.905 0.000 0.018 0.003
AGE 0.985 0.007 0.973 0.996 0.032
rcs(WAIST, 4)WAIST 1.093 0.032 1.039 1.154 0.005
rcs(WAIST, 4)WAIST’ 0.893 0.074 0.787 1.006 0.128
rcs(WAIST, 4)WAIST’’ 1.295 0.231 0.890 1.909 0.263
RACE_ETHHispanic 0.971 0.190 0.710 1.326 0.875
RACE_ETHNH_Black 0.525 0.203 0.375 0.732 0.002
RACE_ETHNH_Asian 1.739 0.248 1.157 2.615 0.025
RACE_ETHOther 0.541 0.419 0.264 1.056 0.142
SROHVery_Good 0.630 2.352 0.013 31.942 0.845
SROHGood 1.151 2.158 0.032 43.328 0.948
SROHFair 0.437 2.246 0.010 18.809 0.713
SROHPoor 1.828 2.837 0.015 187.479 0.832
WAIST %ia% SROHWAIST * SROH=Very_Good 1.006 0.024 0.968 1.047 0.791
WAIST %ia% SROHWAIST * SROH=Good 1.004 0.022 0.968 1.041 0.860
WAIST %ia% SROHWAIST * SROH=Fair 1.017 0.022 0.979 1.056 0.462
WAIST %ia% SROHWAIST * SROH=Poor 1.002 0.027 0.959 1.049 0.936
Code
model_parameters(mod_Z, ci = 0.90, exponentiate = TRUE) |>
  print_md(digits = 3)
Parameter Odds Ratio SE 90% CI z p
(Intercept) 1.765e-04 5.127e-04 (1.213e-06, 0.018) -2.975 0.003
AGE 0.985 0.007 (0.973, 0.996) -2.150 0.032
rcs(WAIST ( degree) 1.093 0.035 (1.039, 1.154) 2.791 0.005
rcs(WAIST ( degree) 0.893 0.066 (0.787, 1.006) -1.523 0.128
rcs(WAIST ( degree) 1.295 0.300 (0.890, 1.909) 1.118 0.263
RACE ETH (Hispanic) 0.971 0.184 (0.710, 1.326) -0.157 0.875
RACE ETH (NH_Black) 0.525 0.107 (0.375, 0.732) -3.164 0.002
RACE ETH (NH_Asian) 1.739 0.431 (1.157, 2.615) 2.236 0.025
RACE ETH (Other) 0.541 0.227 (0.264, 1.056) -1.467 0.142
SROH (Very_Good) 0.630 1.483 (0.013, 31.942) -0.196 0.845
SROH (Good) 1.151 2.485 (0.032, 43.328) 0.065 0.948
SROH (Fair) 0.437 0.982 (0.010, 18.809) -0.368 0.713
SROH (Poor) 1.828 5.185 (0.015, 187.479) 0.213 0.832
WAIST %ia% SROHWAIST * SROH=Very Good 1.006 0.024 (0.968, 1.047) 0.265 0.791
WAIST %ia% SROHWAIST * SROH=Good 1.004 0.022 (0.968, 1.041) 0.176 0.860
WAIST %ia% SROHWAIST * SROH=Fair 1.017 0.023 (0.979, 1.056) 0.735 0.462
WAIST %ia% SROHWAIST * SROH=Poor 1.002 0.027 (0.959, 1.049) 0.081 0.936

9.4.3 Model Z Effects

Code
plot(summary(mod_Z_lrm, conf.int = 0.90))

Code
summary(mod_Z_lrm, conf.int = 0.90) |> kable(digits = 3)
Low High Diff. Effect S.E. Lower 0.9 Upper 0.9 Type
AGE 50 66.0 16.0 -0.243 0.113 -0.429 -0.057 1
Odds Ratio 50 66.0 16.0 0.784 NA 0.651 0.944 2
WAIST 92 112.6 20.6 0.769 0.228 0.395 1.144 1
Odds Ratio 92 112.6 20.6 2.158 NA 1.484 3.140 2
RACE_ETH - Hispanic:NH_White 1 2.0 NA -0.030 0.190 -0.342 0.283 1
Odds Ratio 1 2.0 NA 0.971 NA 0.710 1.327 2
RACE_ETH - NH_Black:NH_White 1 3.0 NA -0.644 0.203 -0.979 -0.309 1
Odds Ratio 1 3.0 NA 0.525 NA 0.376 0.734 2
RACE_ETH - NH_Asian:NH_White 1 4.0 NA 0.553 0.248 0.146 0.961 1
Odds Ratio 1 4.0 NA 1.739 NA 1.157 2.613 2
RACE_ETH - Other:NH_White 1 5.0 NA -0.615 0.419 -1.304 0.074 1
Odds Ratio 1 5.0 NA 0.541 NA 0.271 1.077 2
SROH - Excellent:Good 3 1.0 NA -0.529 0.311 -1.041 -0.017 1
Odds Ratio 3 1.0 NA 0.589 NA 0.353 0.983 2
SROH - Very_Good:Good 3 2.0 NA -0.355 0.197 -0.680 -0.031 1
Odds Ratio 3 2.0 NA 0.701 NA 0.507 0.969 2
SROH - Fair:Good 3 4.0 NA 0.306 0.198 -0.019 0.632 1
Odds Ratio 3 4.0 NA 1.359 NA 0.981 1.882 2
SROH - Poor:Good 3 5.0 NA 0.295 0.402 -0.366 0.956 1
Odds Ratio 3 5.0 NA 1.344 NA 0.694 2.602 2

9.4.4 Quality of Fit Summaries

Note

Again, we have three available approaches to summarize the fit of Model Z, thanks to our glm() and lrm() fits of the same model. I’ll show all three, but in practice (and in your Project A) I wouldn’t include the model_performance() results.

Our Nagelkerke \(R^2\) estimate for Model Z is 0.162, and our C statistic is estimated to be 0.702.

Code
mod_Z_lrm
Logistic Regression Model

lrm(formula = HDL_RISK ~ AGE + rcs(WAIST, 4) + RACE_ETH + SROH + 
    WAIST %ia% SROH, data = nh_demo_i, x = TRUE, y = TRUE)

                       Model Likelihood     Discrimination    Rank Discrim.    
                             Ratio Test            Indexes          Indexes    
Obs           999    LR chi2     123.92     R2       0.162    C       0.702    
 0            666    d.f.            16    R2(16,999)0.102    Dxy     0.404    
 1            333    Pr(> chi2) <0.0001    R2(16,666)0.150    gamma   0.404    
max |deriv| 3e-08                           Brier    0.197    tau-a   0.180    

                       Coef    S.E.   Wald Z Pr(>|Z|)
Intercept              -8.6421 2.9045 -2.98  0.0029  
AGE                    -0.0152 0.0071 -2.15  0.0315  
WAIST                   0.0887 0.0318  2.79  0.0053  
WAIST'                 -0.1134 0.0745 -1.52  0.1277  
WAIST''                 0.2588 0.2314  1.12  0.2634  
RACE_ETH=Hispanic      -0.0299 0.1900 -0.16  0.8751  
RACE_ETH=NH_Black      -0.6438 0.2035 -3.16  0.0016  
RACE_ETH=NH_Asian       0.5535 0.2476  2.24  0.0254  
RACE_ETH=Other         -0.6148 0.4189 -1.47  0.1422  
SROH=Very_Good         -0.4613 2.3523 -0.20  0.8445  
SROH=Good               0.1409 2.1582  0.07  0.9480  
SROH=Fair              -0.8270 2.2460 -0.37  0.7127  
SROH=Poor               0.6031 2.8370  0.21  0.8317  
WAIST * SROH=Very_Good  0.0063 0.0236  0.27  0.7908  
WAIST * SROH=Good       0.0038 0.0217  0.18  0.8602  
WAIST * SROH=Fair       0.0164 0.0223  0.73  0.4625  
WAIST * SROH=Poor       0.0022 0.0271  0.08  0.9356  

The glance() results below give us our AIC and BIC values, as well as the degrees of freedom used by Model Z.

Code
glance(mod_Z) |>
  mutate(df = nobs - df.residual - 1) |>
  select(AIC, BIC, df, df.residual, nobs) |>
  kable(digits = 1)
AIC BIC df df.residual nobs
1181.8 1265.3 16 982 999
Note

As mentioned previously, these summaries aren’t as appealing to me as those in the other two tabs here.

Code
model_performance(mod_Z) |> print_md(digits = 2)
Indices of model performance
AIC AICc BIC Tjur’s R2 RMSE Sigma Log_loss Score_log Score_spherical PCP
1181.84 1182.46 1265.25 0.12 0.44 1.00 0.57 -144.81 1.60e-03 0.61

9.4.5 Confusion Matrix (Model Z)

As in Model Y, my prediction rule for my Model Z confusion matrix is that the fitted value of Pr(HDL_RISK = 1) needs to be greater than or equal to 0.5 for me to predict HDL_RISK is 1, and otherwise I predict 0.

Again, we augment our nh_demo_i data to include predicted probabilities of (HDL_RISK = 1) from Model Z.

Code
resZ_aug <- augment(mod_Z, type.predict = "response")

Applying my prediction rule, we obtain:

Code
cm_Z <- caret::confusionMatrix(
  data = factor(resZ_aug$.fitted >= 0.5),
  reference = factor(resZ_aug$HDL_RISK == 1),
  positive = "TRUE")

cm_Z
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE   608  241
     TRUE     58   92
                                         
               Accuracy : 0.7007         
                 95% CI : (0.6712, 0.729)
    No Information Rate : 0.6667         
    P-Value [Acc > NIR] : 0.01177        
                                         
                  Kappa : 0.2193         
                                         
 Mcnemar's Test P-Value : < 2e-16        
                                         
            Sensitivity : 0.27628        
            Specificity : 0.91291        
         Pos Pred Value : 0.61333        
         Neg Pred Value : 0.71614        
             Prevalence : 0.33333        
         Detection Rate : 0.09209        
   Detection Prevalence : 0.15015        
      Balanced Accuracy : 0.59459        
                                         
       'Positive' Class : TRUE           
                                         

Here are our results comparing classification performance by models Y and Z.

Model Classification Rule Sensitivity Specificity Pos. Pred. Value
Y Predicted Pr(HDL_RISK = 1) >= 0.5 0.276 0.905 0.594
Z Predicted Pr(HDL_RISK = 1) >= 0.5 0.276 0.913 0.613

9.5 Validating Models Y and Z

We will use the validate() function from the rms package to validate our lrm fits.

Code
set.seed(4323); (valY <- validate(mod_Y_lrm))
          index.orig training    test optimism index.corrected  n
Dxy           0.4017   0.4198  0.3843   0.0355          0.3662 40
R2            0.1516   0.1668  0.1406   0.0262          0.1254 40
Intercept     0.0000   0.0000 -0.0602   0.0602         -0.0602 40
Slope         1.0000   1.0000  0.9034   0.0966          0.9034 40
Emax          0.0000   0.0000  0.0329   0.0329          0.0329 40
D             0.1146   0.1272  0.1057   0.0215          0.0932 40
U            -0.0020  -0.0020  0.0010  -0.0030          0.0010 40
Q             0.1166   0.1292  0.1047   0.0245          0.0921 40
B             0.1979   0.1956  0.2003  -0.0047          0.2026 40
g             0.8757   0.9354  0.8396   0.0959          0.7798 40
gp            0.1767   0.1855  0.1701   0.0154          0.1612 40
Code
set.seed(4324); (valZ <- validate(mod_Z_lrm))
          index.orig training    test optimism index.corrected  n
Dxy           0.4036   0.4258  0.3867   0.0391          0.3645 40
R2            0.1620   0.1806  0.1461   0.0345          0.1275 40
Intercept     0.0000   0.0000 -0.0896   0.0896         -0.0896 40
Slope         1.0000   1.0000  0.8674   0.1326          0.8674 40
Emax          0.0000   0.0000  0.0474   0.0474          0.0474 40
D             0.1230   0.1386  0.1102   0.0284          0.0946 40
U            -0.0020  -0.0020  0.0021  -0.0041          0.0021 40
Q             0.1250   0.1406  0.1081   0.0325          0.0925 40
B             0.1968   0.1938  0.2000  -0.0062          0.2030 40
g             0.9605   1.0350  0.8957   0.1393          0.8212 40
gp            0.1825   0.1924  0.1724   0.0200          0.1626 40

9.5.1 Validated Nagelkerke \(R^2\), and C, as well as IC statistics

Model Validated \(R^2\) Validated C AIC BIC df
Y 0.1254 0.6831 1178.3 1232.2 10
Z 0.1275 0.6823 1181.8 1265.3 16

9.6 Final Logistic Regression Model

I prefer Model Y, because of its slightly better AIC, BIC and validated C statistic, despite the fact that Model Z has a slightly higher validated \(R^2\) and that Model Z also has a slightly higher positive predictive value. It’s pretty close, though.

9.6.1 Winning Model’s Parameter Estimates

Code
mod_Y_lrm
Logistic Regression Model

lrm(formula = HDL_RISK ~ AGE + WAIST + RACE_ETH + SROH, data = nh_demo_i, 
    x = TRUE, y = TRUE)

                       Model Likelihood     Discrimination    Rank Discrim.    
                             Ratio Test            Indexes          Indexes    
Obs           999    LR chi2     115.50     R2       0.152    C       0.701    
 0            666    d.f.            10    R2(10,999)0.100    Dxy     0.402    
 1            333    Pr(> chi2) <0.0001    R2(10,666)0.146    gamma   0.402    
max |deriv| 1e-06                           Brier    0.198    tau-a   0.179    

                  Coef    S.E.   Wald Z Pr(>|Z|)
Intercept         -4.5080 0.7221 -6.24  <0.0001 
AGE               -0.0138 0.0070 -1.98  0.0478  
WAIST              0.0403 0.0052  7.78  <0.0001 
RACE_ETH=Hispanic  0.0041 0.1886  0.02  0.9826  
RACE_ETH=NH_Black -0.6327 0.2032 -3.11  0.0018  
RACE_ETH=NH_Asian  0.4743 0.2390  1.98  0.0472  
RACE_ETH=Other    -0.6235 0.4205 -1.48  0.1381  
SROH=Very_Good     0.1786 0.3175  0.56  0.5739  
SROH=Good          0.5168 0.3001  1.72  0.0850  
SROH=Fair          0.8776 0.3154  2.78  0.0054  
SROH=Poor          0.6895 0.4503  1.53  0.1258  

9.6.2 Winning Model Effect Sizes

Code
plot(summary(mod_Y_lrm, conf.int = 0.90))

9.6.3 Numerical Description of Effect Sizes

Code
summary(mod_Y_lrm, conf.int = 0.90) |> kable(digits = 3)
Low High Diff. Effect S.E. Lower 0.9 Upper 0.9 Type
AGE 50 66.0 16.0 -0.221 0.112 -0.404 -0.037 1
Odds Ratio 50 66.0 16.0 0.802 NA 0.667 0.963 2
WAIST 92 112.6 20.6 0.831 0.107 0.655 1.006 1
Odds Ratio 92 112.6 20.6 2.295 NA 1.925 2.735 2
RACE_ETH - Hispanic:NH_White 1 2.0 NA 0.004 0.189 -0.306 0.314 1
Odds Ratio 1 2.0 NA 1.004 NA 0.736 1.369 2
RACE_ETH - NH_Black:NH_White 1 3.0 NA -0.633 0.203 -0.967 -0.299 1
Odds Ratio 1 3.0 NA 0.531 NA 0.380 0.742 2
RACE_ETH - NH_Asian:NH_White 1 4.0 NA 0.474 0.239 0.081 0.867 1
Odds Ratio 1 4.0 NA 1.607 NA 1.085 2.381 2
RACE_ETH - Other:NH_White 1 5.0 NA -0.624 0.420 -1.315 0.068 1
Odds Ratio 1 5.0 NA 0.536 NA 0.268 1.070 2
SROH - Excellent:Good 3 1.0 NA -0.517 0.300 -1.010 -0.023 1
Odds Ratio 3 1.0 NA 0.596 NA 0.364 0.977 2
SROH - Very_Good:Good 3 2.0 NA -0.338 0.193 -0.656 -0.020 1
Odds Ratio 3 2.0 NA 0.713 NA 0.519 0.980 2
SROH - Fair:Good 3 4.0 NA 0.361 0.180 0.065 0.656 1
Odds Ratio 3 4.0 NA 1.434 NA 1.067 1.928 2
SROH - Poor:Good 3 5.0 NA 0.173 0.368 -0.433 0.778 1
Odds Ratio 3 5.0 NA 1.188 NA 0.649 2.177 2

9.6.4 Effect Size Description(s)

In your work, we would only need to see one of the following effect size descriptions.

WAIST description: If we have two subjects of the same age, race/ethnicity and self-reported overall health, then if subject 1 has a waist circumference of 92 cm and subject 2 has a waist circumference of 112.6 cm, then our model estimates that subject 2 will have 2.295 times the odds (90% CI: 1.93, 2.74) that subject 1 has of having a HDL that is at risk.

AGE description: If we have two subjects of the same waist circumference, race/ethnicity and self-reported overall health, then if subject 1 is age 50 and subject 2 is age 66, our model predicts that subject 2 has 80.2% of the odds of an at-risk HDL that subject 1 has. The 90% CI for the odds ratio comparing subject 1 to subject 2 is (0.667, 0.963).

SROH description: If we have two subjects of the same age, waist circumference and race/ethnicity. then if subject 1 reports Excellent overall health while subject 2 reports only Good overall health, our model predicts that the odds ratio comparing subject 1’s odds of an at-risk HDL to those of subject 2 will be 0.596, with a 90% CI of (0.364, 0.977).

9.6.5 Validated \(R^2\) and \(C\) statistic for Winning Model

As we saw in Section 9.5.1,

  • the validated \(R^2\) statistic for Model Y is 0.1254, and
  • the validated \(C\) statistic for Model Y is 0.6831.

9.6.6 Nomogram of Winning Model

Code
plot(nomogram(mod_Y_lrm, fun = plogis, 
              funlabel = "Pr(HDL_RISK = 1)"))

9.6.7 Predictions for Two New Subjects

I will create a predicted Pr(HDL_RISK = 1) for two new Non-Hispanic White subjects who are 60 years old, and who rate their overall health as Fair. The first will have a waist circumference of 80 cm, and the second will have a waist circumference of 100 cm.

Note

I can do this using either the glm() or the lrm() fit to our model. Either is sufficient for Project A, and there’s no need to show both approaches.

Code
new_subj <- 
  data.frame(AGE = c(60,60), WAIST = c(80, 100), 
             RACE_ETH = c("NH_White", "NH_White"), SROH = c("Fair", "Fair"))

preds3 <- predict(mod_Y, newdata = new_subj, type = "response")

preds3
        1         2 
0.2256724 0.3949595 
  • Our prediction for subject 1’s Pr(HDL_RISK = 1) is 0.226.
  • Our prediction for subject 2’s Pr(HDL_RISK = 1) is 0.395.

Alternatively, I could have used our lrm() fit.

Code
new_subj <- 
  data.frame(AGE = c(60,60), WAIST = c(80, 100), 
             RACE_ETH = c("NH_White", "NH_White"), SROH = c("Fair", "Fair"))

preds4 <- predict(mod_Y_lrm, newdata = new_subj, type = "fitted")

preds4
        1         2 
0.2256724 0.3949595 

As before, the two fitting approaches (glm() and lrm()) produce the same model, so they produce the same predictions.

10 Discussion

Note

I am leaving the discussion up to you, but I have provided an outline of what you need to address below.

10.1 Answering My Research Questions

10.1.1 Question 1 (with Answer)

I’ll leave the job of restating and drawing conclusions about the research question under study in our linear regression analyses for you to answer.

10.1.2 Question 2 (with Answer)

I’ll leave the job of restating and drawing conclusions about the research question under study in our logistic regression analyses for you to answer.

10.2 Thoughts on Project A

Note that we expect you to answer exactly two of the following four questions in your discussion.

10.2.1 Question 1

What was substantially harder or easier than you expected, and why?

If you chose to answer Question 1, your answer would go here.

10.2.2 Question 2

What do you wish you’d known at the start of this process that you know now, and why?

If you chose to answer Question 2, your answer would go here.

10.2.3 Question 3

What was the most confusing part of doing the project, and how did you get past it?

If you chose to answer Question 3, your answer would go here.

10.2.4 Question 4

What was the most useful thing you learned while doing the project, and why?

If you chose to answer Question 4, your answer would go here.

11 Affirmation

I am certain that it is completely appropriate for these NHANES data to be shared with anyone, without any conditions. There are no concerns about privacy or security.

12 References

  1. NHANES 2015-16 data are found at https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2015.
  2. NHANES 2017-18 data are found at https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2017.
  3. Descriptions/definitions of “at risk” levels of HDL cholesterol are provided by the Mayo Clinic at https://www.mayoclinic.org/diseases-conditions/high-blood-cholesterol/in-depth/hdl-cholesterol/art-20046388.

13 Session Information

Code
xfun::session_info()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  abind_1.4-8          askpass_1.2.1        backports_1.5.0     
  base64enc_0.1-3      bayestestR_0.15.0    bigD_0.3.0          
  bit_4.5.0.1          bit64_4.5.2          bitops_1.0.9        
  blob_1.2.4           boot_1.3-31          broom_1.0.7         
  bslib_0.8.0          cachem_1.1.0         callr_3.7.6         
  car_3.1-3            carData_3.0-5        cards_0.4.0         
  caret_7.0-1          caTools_1.18.3       cellranger_1.1.0    
  checkmate_2.3.2      class_7.3-22         cli_3.6.3           
  clipr_0.8.0          clock_0.7.1          cluster_2.1.6       
  coda_0.19-4.1        codetools_0.2-20     colorspace_2.1-1    
  commonmark_1.9.2     compiler_4.4.2       conflicted_1.2.0    
  correlation_0.8.6    cowplot_1.1.3        cpp11_0.5.1         
  crayon_1.5.3         curl_6.1.0           data.table_1.16.4   
  datasets_4.4.2       datawizard_0.13.0    DBI_1.2.3           
  dbplyr_2.5.0         Deriv_4.1.6          diagram_1.6.5       
  digest_0.6.37        doBy_4.6.24          dplyr_1.1.4         
  dtplyr_1.3.1         e1071_1.7-16         easystats_0.7.3     
  effectsize_1.0.0     emmeans_1.10.6       estimability_1.5.1  
  evaluate_1.0.3       fansi_1.0.6          farver_2.1.2        
  fastmap_1.2.0        fontawesome_0.5.3    forcats_1.0.0       
  foreach_1.5.2        foreign_0.8-87       Formula_1.2-5       
  fs_1.6.5             furrr_0.3.1          future_1.34.0       
  future.apply_1.11.3  gargle_1.5.2         generics_0.1.3      
  GGally_2.2.1         ggplot2_3.5.1        ggrepel_0.9.6       
  ggstats_0.8.0        glmnet_4.1-8         globals_0.16.3      
  glue_1.8.0           googledrive_2.1.1    googlesheets4_1.1.1 
  gower_1.0.2          gplots_3.2.0         graphics_4.4.2      
  grDevices_4.4.2      grid_4.4.2           gridExtra_2.3       
  gt_0.11.1            gtable_0.3.6         gtools_3.9.5        
  gtsummary_2.0.4      hardhat_1.4.0        haven_2.5.4         
  highr_0.11           Hmisc_5.2-1          hms_1.1.3           
  htmlTable_2.4.3      htmltools_0.5.8.1    htmlwidgets_1.6.4   
  httr_1.4.7           ids_1.0.1            insight_1.0.0       
  ipred_0.9-15         isoband_0.2.7        iterators_1.0.14    
  janitor_2.2.1        jomo_2.7-6           jquerylib_0.1.4     
  jsonlite_1.8.9       juicyjuice_0.1.0     KernSmooth_2.23.24  
  knitr_1.49           labeling_0.4.3       lattice_0.22-6      
  lava_1.8.0           lifecycle_1.0.4      listenv_0.9.1       
  lme4_1.1-35.5        lubridate_1.9.4      magrittr_2.0.3      
  markdown_1.13        MASS_7.3-64          Matrix_1.7-1        
  MatrixModels_0.5-3   memoise_2.0.1        methods_4.4.2       
  mgcv_1.9-1           mice_3.17.0          microbenchmark_1.5.0
  mime_0.12            minqa_1.2.8          mitml_0.4-5         
  modelbased_0.8.9     ModelMetrics_1.2.2.2 modelr_0.1.11       
  multcomp_1.4-26      munsell_0.5.1        mvtnorm_1.3-2       
  naniar_1.1.0         nlme_3.1-166         nloptr_2.1.1        
  nnet_7.3-20          norm_1.0.11.1        numDeriv_2016.8.1.1 
  openssl_2.3.1        ordinal_2023.12.4.1  pan_1.9             
  parallel_4.4.2       parallelly_1.41.0    parameters_0.24.0   
  patchwork_1.3.0      pbkrtest_0.5.3       performance_0.12.4  
  pillar_1.10.1        pkgconfig_2.0.3      plyr_1.8.9          
  polspline_1.1.25     prettyunits_1.2.0    pROC_1.18.5         
  processx_3.8.5       prodlim_2024.06.25   progress_1.2.3      
  progressr_0.15.1     proxy_0.4-27         ps_1.8.1            
  purrr_1.0.2          quantreg_5.99.1      R6_2.5.1            
  ragg_1.3.3           rappdirs_0.3.3       RColorBrewer_1.1-3  
  Rcpp_1.0.13-1        RcppEigen_0.3.4.0.2  reactable_0.4.4     
  reactR_0.6.1         readr_2.1.5          readxl_1.4.3        
  recipes_1.1.0        rematch_2.0.0        rematch2_2.1.2      
  report_0.5.9         reprex_2.1.1         reshape2_1.4.4      
  rlang_1.1.4          rmarkdown_2.29       rms_6.9-0           
  ROCR_1.0-11          rpart_4.1.24         rsample_1.2.1       
  rstudioapi_0.17.1    rvest_1.0.4          sandwich_3.1-1      
  sass_0.4.9           scales_1.3.0         see_0.9.0           
  selectr_0.4.2        shape_1.4.6.1        slider_0.3.2        
  snakecase_0.11.1     SparseM_1.84-2       splines_4.4.2       
  SQUAREM_2021.1       stats_4.4.2          stats4_4.4.2        
  stringi_1.8.4        stringr_1.5.1        survival_3.8-3      
  sys_3.4.3            systemfonts_1.1.0    textshaping_0.4.1   
  TH.data_1.1-2        tibble_3.2.1         tidyr_1.3.1         
  tidyselect_1.2.1     tidyverse_2.0.0      timechange_0.3.0    
  timeDate_4041.110    tinytex_0.54         tools_4.4.2         
  tzdb_0.4.0           ucminf_1.2.2         UpSetR_1.4.0        
  utf8_1.2.4           utils_4.4.2          uuid_1.2.1          
  V8_6.0.0             vctrs_0.6.5          viridis_0.6.5       
  viridisLite_0.4.2    visdat_0.6.0         vroom_1.6.5         
  warp_0.2.1           withr_3.0.2          xfun_0.50           
  xml2_1.3.6           xtable_1.8-4         yaml_2.3.10         
  zoo_1.8-12