Cross-Validation

Author

Brenden Ackerson, Lindsey Cook, Dominic Gallelli

Published

December 4, 2022

Slides

Introduction

Cross-Validation is a key method used to assess model generalizability, which describes the extent that the statistical models created from one sample fit other samples from the same population (Song, Tang, and Wee 2021). Essentially the goal of cross-validation is to mimic the prediction of future individuals from the population, and it allows the accuracy of a predictive model’s performance to be estimated. This is a tool that can be used to help determine the “best” model for a data set, and tends to be often used in machine learning modeling and applications where it can be difficult to obtain new data to validate models, like in medical research applications. The early pioneering work of Stone and Geisser in the 1970’s, and the work by Burman in the 1980’s on leave-one-out cross-validation set the stage for the current techniques of cross-validation (Jung and Hu 2015). Today’s common techniques of cross-validation include the most common method of data splitting by the hold-out or validation technique, random subsampling by Monte Carlo, k-folds and repeated k-folds, and leave-one-out methods. (Ahmed and Nandi 2019).

The general, 5 step process to cross-validation is as follows (Song, Tang, and Wee 2021):

Split the data into a training set and test set
Fit a model to the training set and obtain the model parameters
Apply the fitted model to the test set and obtain prediction accuracy
Repeat steps one through three
Calculate the average cross-validation prediction accuracy across all the repetitions

This process differs in the various cross-validation techniques by varying how the data is split and how many repetitions are performed of the train and test cycles. Details for the various techniques are outlined in the methods section. The best cross-validation technique considers the model’s bias (difference between the population parameter and the cross-validation estimate), the variance (uncertainty in the cross-validation estimates), and the computation costs and time associated with each method.

One of the limitations of cross-validation methods is that they are not guaranteed to pick the true model of the data distribution, even as the number of samples approaches infinity. This is because when you train two different models on the same training set the one with more free parameters will often model the training data better. This can result in picking the wrong model due to over fitting. (Gronau and Wagenmakers 2019)

Methods

Cross-Validation Model Evaluation - Predictors

Typical technique comparisons for linear models use the mean squared error (MSE) or root mean squared error (RMSE), mean absolute error (MAE), and R-squared (\(R^2\)). RMSE calculates how far predicted values are from observed values in the data set and is calculated by:

\[RMSE= \sqrt{\frac{\sum(y_i –\hat{y})^2}{n}}\]

with \(\hat{y}\) being the predicted value from the model, \(y_i\) being the actual output value for the ith observation, and n being the sample size. MAE describes the typical magnitude of the residuals and is calculated by:

\[MAE=\frac{1}{n}\sum(|y_i-\hat{y}|)\]

\(R^2\) describes how well the model predictors explains the response variable variation, or the fraction of the variance that is explained in the model. This equation is as follows: \[R^2=\frac{n\sum(xy)-\sum(x)\sum(y)}{\sqrt{[n\sum(x^2)-(\sum(x)^2)][n\sum(y^2)-(\sum(y)^2)]}}\] This can also be written as:

\[R^2=1-\frac{\sum(y_i-\hat{y})^2}{\sum(y_i-\overline{y})^2}\]

where \(\overline{y}\) is the mean value of y. (Ott 2015)

Cross-Validation Techniques

The most common method of performing cross-validation is the Hold-Out or Validation Technique (Ahmed, 2019). This method tends to use defined splits for the training and test set, like a 90/10, 80/20, 70/30, 60/40, or 50/50 train/test data split. This method is considered an easy, straightforward approach, but with the model only being built on a portion of the data, this model may not predict well as it is sensitive to what data is (or isn’t) chosen in the training set. This can be especially problematic for small sample sizes. For the car data set modeling included in this paper, the method for hold-out validation is as follows:

Split the data into an 70/30 training/test data set
Fit the model to the training set
Apply the fitted model to the test set
Compare the model predictions - R squared (\(R^2\)), Mean Absolute Error (MAE), and root mean squared error (RMSE)

The k-folds Cross-Validation Technique is one potential answer to the limitations in the simple hold-out technique. The data is divided into k groups or splits, and then each k group becomes a test set while the other groups as a whole are the training set. The detailed process is as follows (Jung and Hu 2015):

Randomly and evenly split the data set into k-folds
Use k-1 folds of data as the training set to fit the model
With the fitted model, predict the value of the response variable in the hold out fold (kth fold)
From the response variable in the hold-out fold, calculate the prediction error
Repeat steps 2-4 for k times, using each k fold as a hold-out
Calculate the overall test predictors, taking the average of all k test predictors. The overall MSE calculation is shown below.

\[CV_(k)=\frac{1}{k}\sum_{i=1}^{k}{MSE_i}\]

The following image, Figure 1, visually shows how data is split in the k-folds technique.

Figure 1: Visual Depiction of K-folds .

   Figure 1: Visual Depiction of K-folds (Sourced from i2tutorials - Machine Learning-K-Fold)

(Source (https://www.i2tutorials.com/machine-learning-tutorial/machine-learning-k-fold-cross-validation/)](k-folds.png 2019))

Expanding on the k-folds technique is the repeated k-folds Cross-Validation Technique (Song, Tang, and Wee 2021) which extends the k-folds by conducting multiple repetitions (n), each using a different k fold split. If 5 repeats of a 10-fold cross validation were chosen, 50 (n*k) different models would be fit and evaluated. With each repetition (n) having a slightly different data subset, the model predictors should be even more unbiased than with k-folds, but with having to repeat the process numerous times, this can be a time and labor intensive process, leading to increased costs. The process is as follows:

Randomly and evenly split the data set into k-folds.
Use k-1 folds of data as the training set to fit the model
With the fitted model, predict the value of the response variable in the hold out fold (kth fold)
From the response variable in the hold-out fold, calculate the prediction error
Repeat steps 2-4 for k times, so each k fold is used as a hold-out
Repeat steps 1-5 n times, averaging the performance predictors for overall model predictions

The Leave-one-out Cross-Validation Technique is a special case of the k-folds technique that systematically excludes each point in the data set and fits the model with all other n-1 points. This technique splits the data into sets of n-1 and 1, n times. For each observation, the cross-validation residual is the difference between the observation and the model predicted value. This technique has the advantages of having a less biased MSE than a single test but can be a time-consuming technique when the data set is large or the model is complex, and can thus be an expensive method (Derryberry 2014). The leave-one-out method steps are highlighted below:

Split the data into a training set and testing set, using all but one observation as part of the training set.
Use the training set to build the model
Use the model to predict the response value of the one observation left out of the model
Repeat the process n times
Average the test predictions for overall model prediction values. \[CV_(n)=\frac{1}{n}\sum_{i=1}^{n}{MSE_i}\]

One big decision that has to be made when using cross-validation methods other than leave-one-out, is the optimal number of folds (also known as k) that the data should be divided into. In the article “Performance of Machine Learning Algorithms with Different K Values in K-fold Cross-Validation,” k-validation was used on 4 different machine learning algorithms with splits of 3, 5, 7, 10, 15 and 20. They found the optimal k folds changed depending on the model being tested (Nti, Nyarko-Boateng, and Aning 2021). Common practice uses 5 and 10 folds as these values have been shown to yield favorable test error rate estimates (James 2013). Our paper will explore using 3, 5 and 10 k-folds, and modeling repeated k-folds of 3 and 5 repetitions on 10 folds.

Another technique is the Monte Carlo Cross-Validation Technique, which is similar to a k-fold method, but a predefined portion of the data is randomly selected to form the test set in each repetition, and the remaining portion forms the training set. The train-test process is repeated a predetermined number of times (Wainer and Cawley 2021). In this paper we used a 70/30 data split, repeated 10 times.

Additionally, Multiple Predicting Cross-Validation is a technique that can be used for cross-validation but is not being modeled in this paper. This is a variation of k-folds but instead of each fold being the validation set, it is the training set. The trained model is then evaluated on the remaining data. The average of the k-folds is then used to measure the model performance. (Jung 2018)

Analysis and Results

Data Discussion

Cross-validation techniques of hold-out validation, k-folds, repeated k-folds, leave-one-out and Monte Carlo were performed on the Kaggle Car data.csv set in R. This data set allowed us to model the selling price of a used car in the United Kingdom given the variables of kilometers (kms) driven, the fuel type used, the year the car was manufactured, if the seller was an individual or dealership, and what type of transmission was in the car. This data set had 301 entries. The data is detailed in table 1 below.

Variable Name	Type	Characteristic
Selling_Price	Response	Numeric
Year	Predictor	Numeric
Kms_Driven	Predictor	Numeric
Fuel_Type	Predictor	Categorical, 3 levels
Seller_Type	Predictor	Categorical, 2 levels
Transmission	Predictor	Categorical, 2 levels

Table 1: Car Data Set for Cross Validation

Link to Documentation for CarData data

Figure 2 below shows a scatter plot of the year of the car verses its selling price. Unsurprisingly, newer cars seem to sell for higher prices than older car models.

Code

# Load Data
car_data <- read.csv("car data.csv")

# Selling Price ~ Year
plot(car_data$Year, car_data$Selling_Price,
     xlab = "Year Car Mfg",
     ylab = "Selling Price (thousands)",
     main = "Car Selling Price vs Year Manufactured")

Figure 2: Scatter Plot of Car Selling Price vs. Year Car Manufactured

As Figure 3 shows, the plot of the selling price verses the kilometers driven does not show any noticeable connection between the two.

Code

# Selling Price ~ Kms Driven
plot(car_data$Kms_Driven, car_data$Selling_Price,
     xlab = "Kms Driven",
     ylab = "Selling Price (thousands)",
     main = "Car Selling Price vs. kms Driven",
     xlim = c(0, 100000))

Figure 3: Scatter Plot of Car Selling Price vs. Kilometers Driven

In Figure 4 below, we can see the selling price is higher for diesel cars, automatic transmissions, and when a dealer sells the car instead of an individual.

Code

# Selling Price ~ Fuel_Type+Transmission+Seller_Type
car_data$Fuel_Type=as.factor(car_data$Fuel_Type)
car_data$Seller_Type=as.factor(car_data$Seller_Type)
car_data$Transmission=as.factor(car_data$Transmission)
par(mfrow=c(1,3))
plot(Selling_Price~Fuel_Type, data=car_data)
plot(Selling_Price~Transmission, data=car_data)
plot(Selling_Price~Seller_Type, data=car_data)

Figure 4: Boxplots of Selling Price vs Fuel, Transmission, and Seller Type

Statistical Modeling

First, the data was fit to the following linear model, \[Selling Price=Year+KmsDriven+FuelType+SellerType+Transmission\]

This linear model was fitted using the various cross-validation techniques. First the hold-out method was performed using an 70/30 data split. Next we explored k-folds, using 3, 5, and 10 k splits. The repeated k-folds method was then performed, using the 10 k split, repeated 3 and 5 times. Leave-one-out and Monte Carlo techniques were also ran on the model for comparisons.

Noticing the linear model’s poor performance, both with a low \(R^2\) (0.53), poor normality, and issues with the residuals and outliers, we also considered a log transformation of the the response variable, selling price. This transformation produced a better model (\(R^2\)=0.82, better residual plots) than the linear, and we wanted to see how the various cross validation techniques would perform and compare.

\[log(SellingPrice)=Year+KmsDriven+FuelType+SellerType+Transmission\]

R Coding - Linear Model of SellingPrice

Hold-Out

Code

library(caret)
library(ggplot2)
library(tidyverse)
library(ggfortify)
library(caTools)

set.seed(1)

car_data = read.csv("car data.csv")

# model normal linear model. 
model1 <- lm(Selling_Price ~  Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year, data = car_data)
#autoplot(model1)
#summary(model1)

# Hold Out
sample <- sample.split(car_data$Year, SplitRatio = 0.7)
train  <- subset(car_data, sample == TRUE)
test   <- subset(car_data, sample == FALSE)

model_train = lm(Selling_Price ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year, data = train)
test_pred = predict(model_train, test)

#data.frame(R_squared = R2(test_pred, test$Selling_Price),
#           RMSE = RMSE(test_pred, test$Selling_Price),
#           MAE = MAE(test_pred, test$Selling_Price))
#  R_squared     RMSE      MAE
#  0.675032 3.066001 2.059764

k-folds Cross-Validation 10

Code

#k-folds Cross-Validation 10
ctrl10 <- trainControl(method = "cv", number = 10)
cv_model10 <- train(Selling_Price ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
                   method = "lm",
                   data = car_data,
                   trControl = ctrl10)

# RMSE      Rsquared   MAE     
#v3.177328  0.6034333  2.084711

k-folds Cross-Validation 3

Code

#k-folds Cross-Validation 3
ctrl3 <- trainControl(method = "cv", number = 3)
cv_model3 <- train(Selling_Price ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
                   method = "lm",
                   data = car_data,
                   trControl = ctrl3)
#cv_model3$results
# RMSE  Rsquared      MAE
# 3.341571 0.5671751 2.080826

k-folds Cross-Validation 5

Code

#k-folds Cross-Validation 5
ctrl5 <- trainControl(method = "cv", number = 5)
cv_model5 <- train(Selling_Price ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
                   method = "lm",
                   data = car_data,
                   trControl = ctrl5)
# cv_model5$results
# RMSE  Rsquared      MAE
# 3.252752 0.5970045 2.088944

10-fold repeated 5

Code

#10-fold repeated 5
ctrl10_5 <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
cv_model10_5 <- train(Selling_Price ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
                   method = "lm",
                   data = car_data,
                   trControl = ctrl10_5)
#cv_model10_5$results

#RMSE      Rsquared   MAE    
#3.230983  0.6018475  2.09941

10-fold repeated 3

Code

#10-fold repeated 3
ctrl10_3 <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
cv_model10_3 <- train(Selling_Price ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
                   method = "lm",
                   data = car_data,
                   trControl = ctrl10_3 )
#cv_model10_3$results

# RMSE      Rsquared   MAE     
# 3.188268  0.6032351  2.093896

Leave-one-out Cross-Validation

Code

#Leave-one-out Cross-Validation
ctrlLOOCV <- trainControl(method = "LOOCV")
LOO_model <- train(Selling_Price ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
                  method = "lm",
                   data = car_data,
                   trControl = ctrlLOOCV )
#LOO_model$results

# RMSE      Rsquared   MAE     
# 3.380532  0.5563956  2.089128

Monte Carlo Cross-Validation

Code

#Monte Carlo Cross-Validation
ctrlMC <- trainControl(method = "LGOCV", number = 10)
MC_model <- train(Selling_Price ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
                    method = "lm",
                    data = car_data,
                    trControl = ctrlMC )

#MC_model$results

# RMSE      Rsquared   MAE     
# 3.375063  0.5821096  2.077517

R Coding - Log(SellingPrice)

Hold-Out

Code

#Log Model 

model2 <- lm(log(Selling_Price) ~  Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year, data = car_data)
#autoplot(model2)
#summary(model2)

# Hold Out
samplel <- sample.split(car_data$Year, SplitRatio = 0.7)
trainl  <- subset(car_data, sample == TRUE)
testl   <- subset(car_data, sample == FALSE)

model_trainl = lm(log(Selling_Price) ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year, data = train)
test_predl = predict(model_trainl, testl)

#data.frame(R_squared = R2(test_predl, log(testl$Selling_Price)),
 #          RMSE = RMSE(test_predl, log(testl$Selling_Price)),
  #         MAE = MAE(test_predl, log(testl$Selling_Price)))
#  R_squared     RMSE      MAE

k-folds Cross-Validation 10

Code

#k-folds Cross-Validation 10
ctrl10l <- trainControl(method = "cv", number = 10)
cv_model10l <- train(log(Selling_Price) ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
                   method = "lm",
                   data = car_data,
                   trControl = ctrl10l)
#cv_model10l$results


# RMSE      Rsquared   MAE

k-folds Cross-Validation 3

Code

#k-folds Cross-Validation 3
ctrl3l <- trainControl(method = "cv", number = 3)
cv_model3l <- train(log(Selling_Price) ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
                   method = "lm",
                   data = car_data,
                   trControl = ctrl3l)
#cv_model3l$results

k-folds Cross-Validation 5

Code

#k-folds Cross-Validation 5
ctrl5l <- trainControl(method = "cv", number = 5)
cv_model5l <- train(log(Selling_Price) ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
                   method = "lm",
                   data = car_data,
                   trControl = ctrl5l)
#cv_model5l$results

10-fold repeated 5

Code

#10-fold repeated 5
ctrl10_5l <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
cv_model10_5l <- train(log(Selling_Price) ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
                   method = "lm",
                   data = car_data,
                   trControl = ctrl10_5l)
#cv_model10_5l$results

#RMSE      Rsquared   MAE

10-fold repeated 3

Code

#10-fold repeated 3
ctrl10_3l <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
cv_model10_3l <- train(log(Selling_Price) ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
                   method = "lm",
                   data = car_data,
                   trControl = ctrl10_3l )
#cv_model10_3l$results

# RMSE      Rsquared   MAE

Leave-one-out Cross-Validation

Code

#Leave-one-out Cross-Validation
ctrlLOOCVl <- trainControl(method = "LOOCV")
LOO_modell <- train(log(Selling_Price) ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
                  method = "lm",
                   data = car_data,
                   trControl = ctrlLOOCVl )
#LOO_modell$results

# RMSE      Rsquared   MAE

Monte Carlo Cross-Validation

Code

#Monte Carlo Cross-Validation
ctrlMCl <- trainControl(method = "LGOCV", number = 10)
MC_modell<- train(log(Selling_Price) ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
                    method = "lm",
                    data = car_data,
                    trControl = ctrlMCl )

#MC_modell$results

The following table shows the various model accuracy prediction results from the general linear model cross-validation techniques.

	Monte Carlo	k-folds k=3	k-folds k=5	k-folds k=10
RMSE	3.375063	3.389747	3.35485	3.177328
Rsquared	0.5821096	0.5431807	0.5792003	0.6034333
MAE	2.077517	2.121226	2.078258	2.084711

	10-fold repeated 5 times	10-fold repeated 3 times	Leave-one-out	Hold-Out
RMSE	3.230983	3.188268	3.380532	3.066001
Rsquared	0.6018475	0.6032351	0.5563956	0.675032
MAE	2.09941	2.093896	2.089128	2.059764

Table 2: Cross-Validation on Cars Data - Linear Model

When looking at the results of the linear model, we can see that just based on high \(R^2\) values, and low RMSE and MAE, the hold-out method would look to be the superior technique for cross-validation with an \(R^2\) score of 0.675, RMSE of 3.066 and MAE of 2.060.

The hold-old model can be summarized with the following equation: \[\begin{align*}Selling\_Price = -601.9555 -0.000004Kms\_Driven +6.441331Fuel\_TypeDiesel \\ +1.825635Fuel\_TypePetrol-4.18545Seller\_TypeIndividual\\ -3.891009TransmissionManual+ 0.3023Year \end{align*}\]

However none of the models are great at predicting the selling price, as using a linear model just doesn’t fit this data set well.

When reviewing the log transformation cross-validation models, overall we observe better models. Surprisingly we still see the hold-out method performing slightly better than the others, although all models have similar RMSE, \(R^2\) and MAE values. The hold-out method has RMSE= 0.554, \(R^2\)= 0.828 and MAE = 0.401. We attribute the hold-out’s better performance to the luck of the data split; here (again) the random 70% training split was representative of the test sample.

	Monte Carlo	k-folds k=3	k-folds k=5	k-folds k=10
RMSE	0.6000005	0.5597472	0.5546057	0.5547619
Rsquared	0.7844457	0.8191809	0.81445	0.8097507
MAE	0.4230068	0.4125619	0.406683	0.4080991

	10-fold repeated 5 times	10-fold repeated 3 times	Leave-one-out	Hold-Out
RMSE	0.5429957	0.5415532	0.5564873	0.5544994
Rsquared	0.8180418	0.8220338	0.807924	0.8283292
MAE	0.4056521	0.4047157	0.4072408	0.400559

Table 3: Cross-Validation on Cars Data - Log Transformation Model

The transformed hold-old model can be summarized with the following equation: \[\begin{align*}log(Selling\_Price) = -187.9796 -0000011Kms\_Driven 0.9575Fuel\_TypeDiesel \\ + 0.4335012Fuel\_TypePetrol-2.078542Seller\_TypeIndividual\\ -0.3952547TransmissionManual+ 0.09409Year \end{align*}\]

Conclusion

This paper shows various cross-validation techniques that can be used for modeling validation, including hold-out, k-folds and repeated k-folds, leave-one-out, and Monte Carlo cross-validation. As cross-validation tends to be used in machine learning on large data, or in situations where it may be hard to gather new data to test a model, our example allowed us to review a full data set’s performance with linear regression and then compare the performance of the cross validation techniques. We compared both a linear model and a log transformed linear model and found that for the most part, many of these cross-validation techniques performed similarly and were able to fit accurate models. While k-folds, repeated k-folds and hold-out seemed to perform the best, most of these techniques seem sufficient to help validate a model based on this example. One should consider the time and resources needed to perform each test, as we noted even with our small data set, the repeated k-folds and leave-one-out techniques took considerably longer to run.

References

Ahmed, Hosameldin, and Asoke K Nandi. 2019. “Classification Algorithm Validation.”

Derryberry, DeWayne R. 2014. Basic Data Analysis for Time Series with r. John Wiley & Sons.

Gronau, Quentin F, and Eric-Jan Wagenmakers. 2019. “Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection.” Computational Brain & Behavior 2 (1): 1–11.

https://www.i2tutorials.com/machine-learning-tutorial/machine-learning-k-fold-cross-validation/)](k-folds.png. 2019. “Machine Learing k-Fold Cross-Validation.”

James, Witten, G. 2013. An Introduction to Statistical Learning: With Applications in r. Springer.

Jung, Yoonsuh. 2018. “Multiple Predicting k-Fold Cross-Validation for Model Selection.” Journal of Nonparametric Statistics 30 (1): 197–215.

Jung, Yoonsuh, and Jianhua Hu. 2015. “AK-Fold Averaging Cross-Validation Procedure.” Journal of Nonparametric Statistics 27 (2): 167–79.

Nti, Isaac Kofi, Owusu Nyarko-Boateng, and Justice Aning. 2021. “Performance of Machine Learning Algorithms with Different k Values in k-Fold Cross-Validation.” Inter. J. Info. Technol. Comp. Sci. 13: 61–71.

Ott, Longnecker, L.R. 2015. An Introduction to Statistical Methods and Data Analysis, 7th Ed. Cengage Learning.

Song, Q Chelsea, Chen Tang, and Serena Wee. 2021. “Making Sense of Model Generalizability: A Tutorial on Cross-Validation in r and Shiny.” Advances in Methods and Practices in Psychological Science 4 (1): 2515245920947067.

Wainer, Jacques, and Gavin Cawley. 2021. “Nested Cross-Validation When Selecting Classifiers Is Overzealous for Most Practical Applications.” Expert Systems with Applications 182: 115222.