16  Validating our Prostate Cancer Model

16.1 R Setup Used Here

knitr::opts_chunk$set(comment = NA)

library(broom)
library(rsample)
library(yardstick)
library(caret)
library(tidyverse) 

theme_set(theme_bw())

16.1.1 Data Load

prost <- read_csv("data/prost.csv", show_col_types = FALSE) 

We’ll repeat the data cleaning and model-fitting from our previous chapter.

16.2 Data Cleaning

prost <- prost |>
    mutate(svi_f = fct_recode(factor(svi), "No" = "0", "Yes" = "1"),
           gleason_f = fct_relevel(gleason, c("> 7", "7", "6")),
           bph_f = fct_relevel(bph, c("Low", "Medium", "High")),
           lcavol_c = lcavol - mean(lcavol),
           cavol = exp(lcavol),
           psa = exp(lpsa))

16.3 Fitting the prostA model

prost_A <- lm(lpsa ~ lcavol_c * svi, data = prost)

16.4 Split Validation of Model prost_A

Suppose we want to evaluate whether our model prost_A predicts effectively in new data.

We’ll first demonstrate a validation split approach (used, for instance, in 431) which splits our sample into a separate training (perhaps 70% of the data) and test (perhaps 30% of the data) samples, and then:

  • fit the model in the training sample,
  • use the resulting model to make predictions for lpsa in the test sample, and
  • evaluate the quality of those predictions, perhaps by comparing the results to what we’d get using a different model.

Our goal will be to cross-validate model prost_A, which, you’ll recall, uses lcavol_c, svi and their interaction, to predict lpsa in the prost data.

We’ll start by identifying a random sample of 70% of our prost data in a training sample (which we’ll call prost_train, and leave the rest as our test sample, called prost_test. To do this, we’ll use functions from the rsample package.

set.seed(432432)

prost_split <- initial_split(prost, prop = 0.7)

prost_train <- training(prost_split)
prost_test <- testing(prost_split)
  • Don’t forget to pre-specify the random seed, for replicability, as I’ve done here.

Let’s verify that we now have the samples we expect…

dim(prost_train)
[1] 67 16
dim(prost_test)
[1] 30 16

OK. Next, we’ll run the prost_A model in the training sample.

prost_A_train <- lm(lpsa ~ lcavol_c * svi, data = prost_train)

prost_A_train

Call:
lm(formula = lpsa ~ lcavol_c * svi, data = prost_train)

Coefficients:
 (Intercept)      lcavol_c           svi  lcavol_c:svi  
      2.2900        0.6922        1.1317       -0.4269  

Then we’ll use the coefficients from this model to obtain predicted lpsa values in the test sample.

prost_A_test_aug <- augment(prost_A, newdata = prost_test)

Now, we can use the functions from the yardstick package to obtain several key summaries of fit quality for our model. These summary statistics are:

  • the RMSE or root mean squared error, which measures the average difference (i.e. prediction error) between the observed known outcome values and the values predicted by the model by first squaring all of the errors, averaging them, and then taking the square root of the result. The lower the RMSE, the better the model.
  • the Rsquared or \(R^2\), which is just the square of the Pearson correlation coefficient relating the predicted and observed values, so we’d like this to be as large as possible, and
  • the MAE or mean absolute error, which is a bit less sensitive to outliers than the RMSE, because it measures the average prediction error by taking the absolute value of each error, and then grabbing the average of those values. The lower the MAE, the better the model.

These statistics are more helpful, generally, for comparing multiple models to each other, than for making final decisions on their own. The yardstick package provides individual functions to summarize performance, as follows.

rmse(data = prost_A_test_aug, truth = lpsa, estimate = .fitted)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.813
rsq(data = prost_A_test_aug, truth = lpsa, estimate = .fitted)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.515
mae(data = prost_A_test_aug, truth = lpsa, estimate = .fitted)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mae     standard       0.672

16.5 V-fold Cross-Validation Approach for model prostA

V-fold cross-validation (also known as k-fold cross-validation) randomly splits the data into V groups of roughly equal size (called “folds”). A resample of the analysis data consists of V-1 of the folds while the assessment set contains the final fold. In basic V-fold cross-validation (i.e. no repeats), the number of resamples is equal to V.

The idea of, for instance, 5-fold cross-validation in this case is to create five different subgroups (or folds) of the data, and then select 4 of the folds to be used as a model training sample, leaving the remaining fold as the model testing sample. We then repeat this over each of the five possible selections of testing sample, and summarize the results. This is very straightforward using the caret package, so we’ll demonstrate that approach here.

First, we use the trainControl() function from caret to set up five-fold cross-validation.

set.seed(432432)
ctrl <- trainControl(method = "cv", number = 5)

Next, we train our model on these five folds:

pros_model <- train(lpsa ~ lcavol_c * svi, data = prost, 
               method = "lm", trControl = ctrl)

Now, we can view a summary of the k-fold cross-validation

pros_model
Linear Regression 

97 samples
 2 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 79, 78, 77, 77, 77 
Resampling results:

  RMSE       Rsquared   MAE      
  0.7777655  0.5946201  0.6411997

Tuning parameter 'intercept' was held constant at a value of TRUE
  • No pre-processing means we didn’t scale the data before fitting models.
  • We used 5-fold cross-validation
  • The sample size for these training sets was between 77 and 79 for each pass.
  • The validated root mean squared error (averaged across the five resamplings) was 0.7778
  • The cross-validated R-squared is 0.595
  • The cross-validated mean absolute error is 0.641

To examine the final fitted model, we have:

pros_model$finalModel

Call:
lm(formula = .outcome ~ ., data = dat)

Coefficients:
   (Intercept)        lcavol_c             svi  `lcavol_c:svi`  
       2.33134         0.58640         0.60132         0.06479  

This model can be presented using all of our usual tools from the broom package.

tidy(pros_model$finalModel)
# A tibble: 4 × 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      2.33      0.0913    25.5   8.25e-44
2 lcavol_c         0.586     0.0821     7.15  1.98e-10
3 svi              0.601     0.358      1.68  9.67e- 2
4 `lcavol_c:svi`   0.0648    0.266      0.243 8.08e- 1
glance(pros_model$finalModel)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.581         0.567 0.759      42.9 1.68e-17     3  -109.  228.  241.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

We can also review the model predictions made within each fold:

pros_model$resample
       RMSE  Rsquared       MAE Resample
1 0.7447413 0.7030717 0.6124787    Fold1
2 0.8208461 0.3071658 0.6699472    Fold2
3 0.6510721 0.7944721 0.5589196    Fold3
4 0.8514246 0.4731454 0.6885641    Fold4
5 0.8207437 0.6952452 0.6760887    Fold5