::opts_chunk$set(comment = NA)
knitr
library(broom)
library(rsample)
library(yardstick)
library(caret)
library(tidyverse)
theme_set(theme_bw())
16 Validating our Prostate Cancer Model
16.1 R Setup Used Here
16.1.1 Data Load
<- read_csv("data/prost.csv", show_col_types = FALSE) prost
We’ll repeat the data cleaning and model-fitting from our previous chapter.
16.2 Data Cleaning
<- prost |>
prost mutate(svi_f = fct_recode(factor(svi), "No" = "0", "Yes" = "1"),
gleason_f = fct_relevel(gleason, c("> 7", "7", "6")),
bph_f = fct_relevel(bph, c("Low", "Medium", "High")),
lcavol_c = lcavol - mean(lcavol),
cavol = exp(lcavol),
psa = exp(lpsa))
16.3 Fitting the prostA
model
<- lm(lpsa ~ lcavol_c * svi, data = prost) prost_A
16.4 Split Validation of Model prost_A
Suppose we want to evaluate whether our model prost_A
predicts effectively in new data.
We’ll first demonstrate a validation split approach (used, for instance, in 431) which splits our sample into a separate training (perhaps 70% of the data) and test (perhaps 30% of the data) samples, and then:
- fit the model in the training sample,
- use the resulting model to make predictions for
lpsa
in the test sample, and - evaluate the quality of those predictions, perhaps by comparing the results to what we’d get using a different model.
Our goal will be to cross-validate model prost_A
, which, you’ll recall, uses lcavol_c
, svi
and their interaction, to predict lpsa
in the prost
data.
We’ll start by identifying a random sample of 70% of our prost
data in a training sample (which we’ll call prost_train
, and leave the rest as our test sample, called prost_test
. To do this, we’ll use functions from the rsample package.
set.seed(432432)
<- initial_split(prost, prop = 0.7)
prost_split
<- training(prost_split)
prost_train <- testing(prost_split) prost_test
- Don’t forget to pre-specify the random seed, for replicability, as I’ve done here.
Let’s verify that we now have the samples we expect…
dim(prost_train)
[1] 67 16
dim(prost_test)
[1] 30 16
OK. Next, we’ll run the prost_A
model in the training sample.
<- lm(lpsa ~ lcavol_c * svi, data = prost_train)
prost_A_train
prost_A_train
Call:
lm(formula = lpsa ~ lcavol_c * svi, data = prost_train)
Coefficients:
(Intercept) lcavol_c svi lcavol_c:svi
2.2900 0.6922 1.1317 -0.4269
Then we’ll use the coefficients from this model to obtain predicted lpsa
values in the test sample.
<- augment(prost_A, newdata = prost_test) prost_A_test_aug
Now, we can use the functions from the yardstick
package to obtain several key summaries of fit quality for our model. These summary statistics are:
- the RMSE or root mean squared error, which measures the average difference (i.e. prediction error) between the observed known outcome values and the values predicted by the model by first squaring all of the errors, averaging them, and then taking the square root of the result. The lower the RMSE, the better the model.
- the Rsquared or \(R^2\), which is just the square of the Pearson correlation coefficient relating the predicted and observed values, so we’d like this to be as large as possible, and
- the MAE or mean absolute error, which is a bit less sensitive to outliers than the RMSE, because it measures the average prediction error by taking the absolute value of each error, and then grabbing the average of those values. The lower the MAE, the better the model.
These statistics are more helpful, generally, for comparing multiple models to each other, than for making final decisions on their own. The yardstick package provides individual functions to summarize performance, as follows.
rmse(data = prost_A_test_aug, truth = lpsa, estimate = .fitted)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.813
rsq(data = prost_A_test_aug, truth = lpsa, estimate = .fitted)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rsq standard 0.515
mae(data = prost_A_test_aug, truth = lpsa, estimate = .fitted)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 0.672
16.5 V-fold Cross-Validation Approach for model prostA
V-fold cross-validation (also known as k-fold cross-validation) randomly splits the data into V groups of roughly equal size (called “folds”). A resample of the analysis data consists of V-1 of the folds while the assessment set contains the final fold. In basic V-fold cross-validation (i.e. no repeats), the number of resamples is equal to V.
The idea of, for instance, 5-fold cross-validation in this case is to create five different subgroups (or folds) of the data, and then select 4 of the folds to be used as a model training sample, leaving the remaining fold as the model testing sample. We then repeat this over each of the five possible selections of testing sample, and summarize the results. This is very straightforward using the caret package, so we’ll demonstrate that approach here.
First, we use the trainControl()
function from caret to set up five-fold cross-validation.
set.seed(432432)
<- trainControl(method = "cv", number = 5) ctrl
Next, we train our model on these five folds:
<- train(lpsa ~ lcavol_c * svi, data = prost,
pros_model method = "lm", trControl = ctrl)
Now, we can view a summary of the k-fold cross-validation
pros_model
Linear Regression
97 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 79, 78, 77, 77, 77
Resampling results:
RMSE Rsquared MAE
0.7777655 0.5946201 0.6411997
Tuning parameter 'intercept' was held constant at a value of TRUE
- No pre-processing means we didn’t scale the data before fitting models.
- We used 5-fold cross-validation
- The sample size for these training sets was between 77 and 79 for each pass.
- The validated root mean squared error (averaged across the five resamplings) was 0.7778
- The cross-validated R-squared is 0.595
- The cross-validated mean absolute error is 0.641
To examine the final fitted model, we have:
$finalModel pros_model
Call:
lm(formula = .outcome ~ ., data = dat)
Coefficients:
(Intercept) lcavol_c svi `lcavol_c:svi`
2.33134 0.58640 0.60132 0.06479
This model can be presented using all of our usual tools from the broom package.
tidy(pros_model$finalModel)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.33 0.0913 25.5 8.25e-44
2 lcavol_c 0.586 0.0821 7.15 1.98e-10
3 svi 0.601 0.358 1.68 9.67e- 2
4 `lcavol_c:svi` 0.0648 0.266 0.243 8.08e- 1
glance(pros_model$finalModel)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.581 0.567 0.759 42.9 1.68e-17 3 -109. 228. 241.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
We can also review the model predictions made within each fold:
$resample pros_model
RMSE Rsquared MAE Resample
1 0.7447413 0.7030717 0.6124787 Fold1
2 0.8208461 0.3071658 0.6699472 Fold2
3 0.6510721 0.7944721 0.5589196 Fold3
4 0.8514246 0.4731454 0.6885641 Fold4
5 0.8207437 0.6952452 0.6760887 Fold5