Code
# Load Data
<- read.csv("car data.csv")
car_data
# Selling Price ~ Year
plot(car_data$Year, car_data$Selling_Price,
xlab = "Year Car Mfg",
ylab = "Selling Price (thousands)",
main = "Car Selling Price vs Year Manufactured")
Cross-Validation is a key method used to assess model generalizability, which describes the extent that the statistical models created from one sample fit other samples from the same population (Song, Tang, and Wee 2021). Essentially the goal of cross-validation is to mimic the prediction of future individuals from the population, and it allows the accuracy of a predictive model’s performance to be estimated. This is a tool that can be used to help determine the “best” model for a data set, and tends to be often used in machine learning modeling and applications where it can be difficult to obtain new data to validate models, like in medical research applications. The early pioneering work of Stone and Geisser in the 1970’s, and the work by Burman in the 1980’s on leave-one-out cross-validation set the stage for the current techniques of cross-validation (Jung and Hu 2015). Today’s common techniques of cross-validation include the most common method of data splitting by the hold-out or validation technique, random subsampling by Monte Carlo, k-folds and repeated k-folds, and leave-one-out methods. (Ahmed and Nandi 2019).
The general, 5 step process to cross-validation is as follows (Song, Tang, and Wee 2021):
This process differs in the various cross-validation techniques by varying how the data is split and how many repetitions are performed of the train and test cycles. Details for the various techniques are outlined in the methods section. The best cross-validation technique considers the model’s bias (difference between the population parameter and the cross-validation estimate), the variance (uncertainty in the cross-validation estimates), and the computation costs and time associated with each method.
One of the limitations of cross-validation methods is that they are not guaranteed to pick the true model of the data distribution, even as the number of samples approaches infinity. This is because when you train two different models on the same training set the one with more free parameters will often model the training data better. This can result in picking the wrong model due to over fitting. (Gronau and Wagenmakers 2019)
Typical technique comparisons for linear models use the mean squared error (MSE) or root mean squared error (RMSE), mean absolute error (MAE), and R-squared (\(R^2\)). RMSE calculates how far predicted values are from observed values in the data set and is calculated by:
\[RMSE= \sqrt{\frac{\sum(y_i –\hat{y})^2}{n}}\]
with \(\hat{y}\) being the predicted value from the model, \(y_i\) being the actual output value for the ith observation, and n being the sample size. MAE describes the typical magnitude of the residuals and is calculated by:
\[MAE=\frac{1}{n}\sum(|y_i-\hat{y}|)\]
\(R^2\) describes how well the model predictors explains the response variable variation, or the fraction of the variance that is explained in the model. This equation is as follows: \[R^2=\frac{n\sum(xy)-\sum(x)\sum(y)}{\sqrt{[n\sum(x^2)-(\sum(x)^2)][n\sum(y^2)-(\sum(y)^2)]}}\] This can also be written as:
\[R^2=1-\frac{\sum(y_i-\hat{y})^2}{\sum(y_i-\overline{y})^2}\]
where \(\overline{y}\) is the mean value of y. (Ott 2015)
The most common method of performing cross-validation is the Hold-Out or Validation Technique (Ahmed, 2019). This method tends to use defined splits for the training and test set, like a 90/10, 80/20, 70/30, 60/40, or 50/50 train/test data split. This method is considered an easy, straightforward approach, but with the model only being built on a portion of the data, this model may not predict well as it is sensitive to what data is (or isn’t) chosen in the training set. This can be especially problematic for small sample sizes. For the car data set modeling included in this paper, the method for hold-out validation is as follows:
The k-folds Cross-Validation Technique is one potential answer to the limitations in the simple hold-out technique. The data is divided into k groups or splits, and then each k group becomes a test set while the other groups as a whole are the training set. The detailed process is as follows (Jung and Hu 2015):
\[CV_(k)=\frac{1}{k}\sum_{i=1}^{k}{MSE_i}\]
The following image, Figure 1, visually shows how data is split in the k-folds technique.
.
Figure 1: Visual Depiction of K-folds (Sourced from i2tutorials - Machine Learning-K-Fold)
Expanding on the k-folds technique is the repeated k-folds Cross-Validation Technique (Song, Tang, and Wee 2021) which extends the k-folds by conducting multiple repetitions (n), each using a different k fold split. If 5 repeats of a 10-fold cross validation were chosen, 50 (n*k) different models would be fit and evaluated. With each repetition (n) having a slightly different data subset, the model predictors should be even more unbiased than with k-folds, but with having to repeat the process numerous times, this can be a time and labor intensive process, leading to increased costs. The process is as follows:
The Leave-one-out Cross-Validation Technique is a special case of the k-folds technique that systematically excludes each point in the data set and fits the model with all other n-1 points. This technique splits the data into sets of n-1 and 1, n times. For each observation, the cross-validation residual is the difference between the observation and the model predicted value. This technique has the advantages of having a less biased MSE than a single test but can be a time-consuming technique when the data set is large or the model is complex, and can thus be an expensive method (Derryberry 2014). The leave-one-out method steps are highlighted below:
One big decision that has to be made when using cross-validation methods other than leave-one-out, is the optimal number of folds (also known as k) that the data should be divided into. In the article “Performance of Machine Learning Algorithms with Different K Values in K-fold Cross-Validation,” k-validation was used on 4 different machine learning algorithms with splits of 3, 5, 7, 10, 15 and 20. They found the optimal k folds changed depending on the model being tested (Nti, Nyarko-Boateng, and Aning 2021). Common practice uses 5 and 10 folds as these values have been shown to yield favorable test error rate estimates (James 2013). Our paper will explore using 3, 5 and 10 k-folds, and modeling repeated k-folds of 3 and 5 repetitions on 10 folds.
Another technique is the Monte Carlo Cross-Validation Technique, which is similar to a k-fold method, but a predefined portion of the data is randomly selected to form the test set in each repetition, and the remaining portion forms the training set. The train-test process is repeated a predetermined number of times (Wainer and Cawley 2021). In this paper we used a 70/30 data split, repeated 10 times.
Additionally, Multiple Predicting Cross-Validation is a technique that can be used for cross-validation but is not being modeled in this paper. This is a variation of k-folds but instead of each fold being the validation set, it is the training set. The trained model is then evaluated on the remaining data. The average of the k-folds is then used to measure the model performance. (Jung 2018)
Cross-validation techniques of hold-out validation, k-folds, repeated k-folds, leave-one-out and Monte Carlo were performed on the Kaggle Car data.csv set in R. This data set allowed us to model the selling price of a used car in the United Kingdom given the variables of kilometers (kms) driven, the fuel type used, the year the car was manufactured, if the seller was an individual or dealership, and what type of transmission was in the car. This data set had 301 entries. The data is detailed in table 1 below.
Variable Name | Type | Characteristic |
---|---|---|
Selling_Price | Response | Numeric |
Year | Predictor | Numeric |
Kms_Driven | Predictor | Numeric |
Fuel_Type | Predictor | Categorical, 3 levels |
Seller_Type | Predictor | Categorical, 2 levels |
Transmission | Predictor | Categorical, 2 levels |
Table 1: Car Data Set for Cross Validation
Link to Documentation for CarData data
Figure 2 below shows a scatter plot of the year of the car verses its selling price. Unsurprisingly, newer cars seem to sell for higher prices than older car models.
# Load Data
<- read.csv("car data.csv")
car_data
# Selling Price ~ Year
plot(car_data$Year, car_data$Selling_Price,
xlab = "Year Car Mfg",
ylab = "Selling Price (thousands)",
main = "Car Selling Price vs Year Manufactured")
Figure 2: Scatter Plot of Car Selling Price vs. Year Car Manufactured
As Figure 3 shows, the plot of the selling price verses the kilometers driven does not show any noticeable connection between the two.
# Selling Price ~ Kms Driven
plot(car_data$Kms_Driven, car_data$Selling_Price,
xlab = "Kms Driven",
ylab = "Selling Price (thousands)",
main = "Car Selling Price vs. kms Driven",
xlim = c(0, 100000))
Figure 3: Scatter Plot of Car Selling Price vs. Kilometers Driven
In Figure 4 below, we can see the selling price is higher for diesel cars, automatic transmissions, and when a dealer sells the car instead of an individual.
# Selling Price ~ Fuel_Type+Transmission+Seller_Type
$Fuel_Type=as.factor(car_data$Fuel_Type)
car_data$Seller_Type=as.factor(car_data$Seller_Type)
car_data$Transmission=as.factor(car_data$Transmission)
car_datapar(mfrow=c(1,3))
plot(Selling_Price~Fuel_Type, data=car_data)
plot(Selling_Price~Transmission, data=car_data)
plot(Selling_Price~Seller_Type, data=car_data)
Figure 4: Boxplots of Selling Price vs Fuel, Transmission, and Seller Type
First, the data was fit to the following linear model, \[Selling Price=Year+KmsDriven+FuelType+SellerType+Transmission\]
This linear model was fitted using the various cross-validation techniques. First the hold-out method was performed using an 70/30 data split. Next we explored k-folds, using 3, 5, and 10 k splits. The repeated k-folds method was then performed, using the 10 k split, repeated 3 and 5 times. Leave-one-out and Monte Carlo techniques were also ran on the model for comparisons.
Noticing the linear model’s poor performance, both with a low \(R^2\) (0.53), poor normality, and issues with the residuals and outliers, we also considered a log transformation of the the response variable, selling price. This transformation produced a better model (\(R^2\)=0.82, better residual plots) than the linear, and we wanted to see how the various cross validation techniques would perform and compare.
\[log(SellingPrice)=Year+KmsDriven+FuelType+SellerType+Transmission\]
Hold-Out
library(caret)
library(ggplot2)
library(tidyverse)
library(ggfortify)
library(caTools)
set.seed(1)
= read.csv("car data.csv")
car_data
# model normal linear model.
<- lm(Selling_Price ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year, data = car_data)
model1 #autoplot(model1)
#summary(model1)
# Hold Out
<- sample.split(car_data$Year, SplitRatio = 0.7)
sample <- subset(car_data, sample == TRUE)
train <- subset(car_data, sample == FALSE)
test
= lm(Selling_Price ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year, data = train)
model_train = predict(model_train, test)
test_pred
#data.frame(R_squared = R2(test_pred, test$Selling_Price),
# RMSE = RMSE(test_pred, test$Selling_Price),
# MAE = MAE(test_pred, test$Selling_Price))
# R_squared RMSE MAE
# 0.675032 3.066001 2.059764
k-folds Cross-Validation 10
#k-folds Cross-Validation 10
<- trainControl(method = "cv", number = 10)
ctrl10 <- train(Selling_Price ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
cv_model10 method = "lm",
data = car_data,
trControl = ctrl10)
# RMSE Rsquared MAE
#v3.177328 0.6034333 2.084711
k-folds Cross-Validation 3
#k-folds Cross-Validation 3
<- trainControl(method = "cv", number = 3)
ctrl3 <- train(Selling_Price ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
cv_model3 method = "lm",
data = car_data,
trControl = ctrl3)
#cv_model3$results
# RMSE Rsquared MAE
# 3.341571 0.5671751 2.080826
k-folds Cross-Validation 5
#k-folds Cross-Validation 5
<- trainControl(method = "cv", number = 5)
ctrl5 <- train(Selling_Price ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
cv_model5 method = "lm",
data = car_data,
trControl = ctrl5)
# cv_model5$results
# RMSE Rsquared MAE
# 3.252752 0.5970045 2.088944
10-fold repeated 5
#10-fold repeated 5
<- trainControl(method = "repeatedcv", number = 10, repeats = 5)
ctrl10_5 <- train(Selling_Price ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
cv_model10_5 method = "lm",
data = car_data,
trControl = ctrl10_5)
#cv_model10_5$results
#RMSE Rsquared MAE
#3.230983 0.6018475 2.09941
10-fold repeated 3
#10-fold repeated 3
<- trainControl(method = "repeatedcv", number = 10, repeats = 3)
ctrl10_3 <- train(Selling_Price ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
cv_model10_3 method = "lm",
data = car_data,
trControl = ctrl10_3 )
#cv_model10_3$results
# RMSE Rsquared MAE
# 3.188268 0.6032351 2.093896
Leave-one-out Cross-Validation
#Leave-one-out Cross-Validation
<- trainControl(method = "LOOCV")
ctrlLOOCV <- train(Selling_Price ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
LOO_model method = "lm",
data = car_data,
trControl = ctrlLOOCV )
#LOO_model$results
# RMSE Rsquared MAE
# 3.380532 0.5563956 2.089128
Monte Carlo Cross-Validation
#Monte Carlo Cross-Validation
<- trainControl(method = "LGOCV", number = 10)
ctrlMC <- train(Selling_Price ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
MC_model method = "lm",
data = car_data,
trControl = ctrlMC )
#MC_model$results
# RMSE Rsquared MAE
# 3.375063 0.5821096 2.077517
Hold-Out
#Log Model
<- lm(log(Selling_Price) ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year, data = car_data)
model2 #autoplot(model2)
#summary(model2)
# Hold Out
<- sample.split(car_data$Year, SplitRatio = 0.7)
samplel <- subset(car_data, sample == TRUE)
trainl <- subset(car_data, sample == FALSE)
testl
= lm(log(Selling_Price) ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year, data = train)
model_trainl = predict(model_trainl, testl)
test_predl
#data.frame(R_squared = R2(test_predl, log(testl$Selling_Price)),
# RMSE = RMSE(test_predl, log(testl$Selling_Price)),
# MAE = MAE(test_predl, log(testl$Selling_Price)))
# R_squared RMSE MAE
k-folds Cross-Validation 10
#k-folds Cross-Validation 10
<- trainControl(method = "cv", number = 10)
ctrl10l <- train(log(Selling_Price) ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
cv_model10l method = "lm",
data = car_data,
trControl = ctrl10l)
#cv_model10l$results
# RMSE Rsquared MAE
k-folds Cross-Validation 3
#k-folds Cross-Validation 3
<- trainControl(method = "cv", number = 3)
ctrl3l <- train(log(Selling_Price) ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
cv_model3l method = "lm",
data = car_data,
trControl = ctrl3l)
#cv_model3l$results
k-folds Cross-Validation 5
#k-folds Cross-Validation 5
<- trainControl(method = "cv", number = 5)
ctrl5l <- train(log(Selling_Price) ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
cv_model5l method = "lm",
data = car_data,
trControl = ctrl5l)
#cv_model5l$results
10-fold repeated 5
#10-fold repeated 5
<- trainControl(method = "repeatedcv", number = 10, repeats = 5)
ctrl10_5l <- train(log(Selling_Price) ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
cv_model10_5l method = "lm",
data = car_data,
trControl = ctrl10_5l)
#cv_model10_5l$results
#RMSE Rsquared MAE
10-fold repeated 3
#10-fold repeated 3
<- trainControl(method = "repeatedcv", number = 10, repeats = 3)
ctrl10_3l <- train(log(Selling_Price) ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
cv_model10_3l method = "lm",
data = car_data,
trControl = ctrl10_3l )
#cv_model10_3l$results
# RMSE Rsquared MAE
Leave-one-out Cross-Validation
#Leave-one-out Cross-Validation
<- trainControl(method = "LOOCV")
ctrlLOOCVl <- train(log(Selling_Price) ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
LOO_modell method = "lm",
data = car_data,
trControl = ctrlLOOCVl )
#LOO_modell$results
# RMSE Rsquared MAE
Monte Carlo Cross-Validation
#Monte Carlo Cross-Validation
<- trainControl(method = "LGOCV", number = 10)
ctrlMCl <- train(log(Selling_Price) ~ Kms_Driven + Fuel_Type + Seller_Type + Transmission + Year,
MC_modellmethod = "lm",
data = car_data,
trControl = ctrlMCl )
#MC_modell$results
The following table shows the various model accuracy prediction results from the general linear model cross-validation techniques.
Monte Carlo | k-folds k=3 | k-folds k=5 | k-folds k=10 | |
---|---|---|---|---|
RMSE | 3.375063 | 3.389747 | 3.35485 | 3.177328 |
Rsquared | 0.5821096 | 0.5431807 | 0.5792003 | 0.6034333 |
MAE | 2.077517 | 2.121226 | 2.078258 | 2.084711 |
10-fold repeated 5 times | 10-fold repeated 3 times | Leave-one-out | Hold-Out | |
---|---|---|---|---|
RMSE | 3.230983 | 3.188268 | 3.380532 | 3.066001 |
Rsquared | 0.6018475 | 0.6032351 | 0.5563956 | 0.675032 |
MAE | 2.09941 | 2.093896 | 2.089128 | 2.059764 |
Table 2: Cross-Validation on Cars Data - Linear Model
When looking at the results of the linear model, we can see that just based on high \(R^2\) values, and low RMSE and MAE, the hold-out method would look to be the superior technique for cross-validation with an \(R^2\) score of 0.675, RMSE of 3.066 and MAE of 2.060.
The hold-old model can be summarized with the following equation: \[\begin{align*}Selling\_Price = -601.9555 -0.000004Kms\_Driven +6.441331Fuel\_TypeDiesel \\ +1.825635Fuel\_TypePetrol-4.18545Seller\_TypeIndividual\\ -3.891009TransmissionManual+ 0.3023Year \end{align*}\]
However none of the models are great at predicting the selling price, as using a linear model just doesn’t fit this data set well.
When reviewing the log transformation cross-validation models, overall we observe better models. Surprisingly we still see the hold-out method performing slightly better than the others, although all models have similar RMSE, \(R^2\) and MAE values. The hold-out method has RMSE= 0.554, \(R^2\)= 0.828 and MAE = 0.401. We attribute the hold-out’s better performance to the luck of the data split; here (again) the random 70% training split was representative of the test sample.
Monte Carlo | k-folds k=3 | k-folds k=5 | k-folds k=10 | |
---|---|---|---|---|
RMSE | 0.6000005 | 0.5597472 | 0.5546057 | 0.5547619 |
Rsquared | 0.7844457 | 0.8191809 | 0.81445 | 0.8097507 |
MAE | 0.4230068 | 0.4125619 | 0.406683 | 0.4080991 |
10-fold repeated 5 times | 10-fold repeated 3 times | Leave-one-out | Hold-Out | |
---|---|---|---|---|
RMSE | 0.5429957 | 0.5415532 | 0.5564873 | 0.5544994 |
Rsquared | 0.8180418 | 0.8220338 | 0.807924 | 0.8283292 |
MAE | 0.4056521 | 0.4047157 | 0.4072408 | 0.400559 |
Table 3: Cross-Validation on Cars Data - Log Transformation Model
The transformed hold-old model can be summarized with the following equation: \[\begin{align*}log(Selling\_Price) = -187.9796 -0000011Kms\_Driven 0.9575Fuel\_TypeDiesel \\ + 0.4335012Fuel\_TypePetrol-2.078542Seller\_TypeIndividual\\ -0.3952547TransmissionManual+ 0.09409Year \end{align*}\]
This paper shows various cross-validation techniques that can be used for modeling validation, including hold-out, k-folds and repeated k-folds, leave-one-out, and Monte Carlo cross-validation. As cross-validation tends to be used in machine learning on large data, or in situations where it may be hard to gather new data to test a model, our example allowed us to review a full data set’s performance with linear regression and then compare the performance of the cross validation techniques. We compared both a linear model and a log transformed linear model and found that for the most part, many of these cross-validation techniques performed similarly and were able to fit accurate models. While k-folds, repeated k-folds and hold-out seemed to perform the best, most of these techniques seem sufficient to help validate a model based on this example. One should consider the time and resources needed to perform each test, as we noted even with our small data set, the repeated k-folds and leave-one-out techniques took considerably longer to run.