Cross Validation is a method in which we evaluate how well a model can predict values accurately. This procedure can be used to determine the best model for a data set, which is primarily used for machine learning.
Steps
Split the data into a training set and test set
Fit a model to the training set and obtain the model parameters
Apply the fitted model to the test set and obtain prediction accuracy
Repeat steps one through three
Calculate the average cross-validation prediction accuracy across all the repetitions
There are a many ways we can assess how well we have fit our model but typically the root mean squared error (RMSE), mean absolute error (MAE), and R-squared (\(R^2\)) are used the most. Below are formulas used to calculate each measure. RMSE calculates how far predicted values are from observed values. MAE describes the typical magnitude of the residuals. \(R^2\) describes how well the predictors explain the response variable variation, or the fraction of the variance that is explained in the model.
One negative of cross-validation methods is that they are not guaranteed to pick the true model of the data distribution, even as the sample size approaches infinity. This is because when you train two different models on the same training set the one with more free parameters will model the training data better because it can over fit the data more easily. This can result in picking the wrong model due to over fitting. (Gronau and Wagenmakers 2019)
Hold Out or Validation Technique
This is the most common method of performing cross-validation (Ahmed, 2019). This method tends to use defined splits for the training and test set, like a 90/10, 80/20, etc. train/test data split. It is considered an easy, straightforward approach, but with the model only being built on a portion of the data, this model may not lead to accurate predictions because it is sensitive to what data is (or is not) chosen in the training set. This is especially problematic for small sample sizes.
k-folds Cross-Validation Technique
This method may be one potential answer to the limitations in the simple hold-out technique. The data is divided into k groups or splits, and then each k group becomes a test set while the other groups as a whole are the training set.
k-folds Cross-Validation Steps:
Randomly and evenly split the data set into k-folds.
Use k-1 folds of data as the training set to fit the model
With the fitted models, predict the value of the response variable in the hold out fold (kth fold)
From the response variable in the hold-out fold, calculate the prediction error
k-folds Cross-Validation Steps (contd):
Repeat steps 2-4 for k times, so each k fold is used as a hold-out
Compare the prediction performance measures to select the best model using the equation:
This methods expands on the k-folds technique (Song, Tang, and Wee 2021) by conducting multiple repetitions (n), each using a different k fold split. If 5 repeats of a 10-fold cross validation were chosen, 50 (n*k) different models would be fit and evaluated. With each repetition (n) having a slightly different data subset, the model predictors should be even more unbiased than with k-folds. One negative associated with this method is having to repeat the process numerous times, which makes this a time and labor intensive method.
Repeated k-folds Cross-Validation Steps
The process is as follows:
Randomly and evenly split the data set into k-folds.
Use k-1 folds of data as the training set to fit the model
With the fitted models, predict the value of the response variable in the hold out fold (kth fold)
From the response variable in the hold-out fold, calculate the prediction error
Repeat steps 2-4 for k times, so each k fold is used as a hold-out
Repeat steps 1-5 n times.
Leave-One-Out Cross-Validation
This method is a special case of the k-folds technique that systematically excludes each entry in the data set and fits the model with all other n-1 entries. This technique splits the data into sets of n-1 and 1, n times. For each observation, the cross-validation residual is the difference between the observation and the model predicted value. This technique has the advantages of having a less biased MSE than a single test but can be a time-consuming technique when the data set is large or the model is complex, and can thus be an expensive method. (Derryberry 2014)
Leave-One-Out Cross-Validation Steps
Split the data into a training set and testing set, using all but one observation as part of the training set.
Use the training set to build the model
Use the model to predict the response value of the one observation left out of the model
Repeat the process n times
Average the test predictions for overall model prediction values. \[CV_(n)=\frac{1}{n}\sum_{i=1}^{n}{MSE_i}\]
Optimal Number of Folds
Something to consider during the k-folds method is what is the optimal number of folds (also known as k) that the data should be divided into. In the article “Performance of Machine Learning Algorithms with Different K Values in K-fold Cross-Validation” k-validation was used on 4 different machine learning algorithms with splits of 3, 5, 7, 10, 15 and 20. They found the optimal k folds changed depending on the model being tested. (Nti, Nyarko-Boateng, and Aning 2021) Common practice uses 5 and 10 folds as these values have been shown to yield favorable test error rate estimates. (James 2013). Our paper will explore using 3, 5 and 10 k-folds, and modeling repeated k-folds of 3 and 5 repetitions on 10 folds.
Monte Carlo Cross-Validation Technique
This technique is similar to a k-fold method, but a predefined portion of the data is randomly selected to form the test set in each repetition, and the remaining portion forms the training set. The train-test process is repeated a predetermined number of times. (Wainer and Cawley 2021)
Multiple Predicting Cross-Validation
This is a variation of k-folds but instead of each fold being the validation set, it is the training set. The trained model is then evaluated on the remaining data. The average of the k-folds is then used to measure how good the model is. (Jung 2018)
Data Analysis
Hold-out validation, k-folds, repeated k-folds, leave-one-out and Monte Carlo cross-validation techniques were performed on a data set from Kaggle titled Car data.csv. This data set allowed us to model the selling price of a used car in the United Kingdom given the variables of kilometers driven, the fuel type used, the year the car was manufactured, the seller type, and transmission type. This data has 301 cars as entries. The data is detailed in the next slide.
This linear model was fitted using the various cross validation techniques. First the hold out method was performed using an 70/30 data split. The next technique of cross-validation was k-folds, using 3, 5, and 10 k splits. The repeated k-folds technique was then performed, using the 10 k split, repeated 3 and 5 times. Leave-one-out and Monte Carlo techniques were also ran on the model for comparisons.
As cross-validation tends to be used in machine learning on large data, or in situations where it may be hard to gather large samples, our example allowed us to review the full data set’s performance with linear regression and then compare the performance of the cross validation techniques. We compared both a linear model and a transformed linear model and found that for the most part many of these cross validation techniques performed about the same and were able to fit accurate models. While k-folds, repeated k-folds and hold-out seemed to perform the best, most of these techniques seem sufficient to help validate a model.
References
Derryberry, DeWayne R. 2014. Basic Data Analysis for Time Series with r. John Wiley & Sons.
Gronau, Quentin F, and Eric-Jan Wagenmakers. 2019. “Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection.”Computational Brain & Behavior 2 (1): 1–11.
James, Witten, G. 2013. An Introduction to Statistical Learning: With Applications in r. Springer.
Jung, Yoonsuh. 2018. “Multiple Predicting k-Fold Cross-Validation for Model Selection.”Journal of Nonparametric Statistics 30 (1): 197–215.
Nti, Isaac Kofi, Owusu Nyarko-Boateng, and Justice Aning. 2021. “Performance of Machine Learning Algorithms with Different k Values in k-Fold Cross-Validation.”Inter. J. Info. Technol. Comp. Sci. 13: 61–71.
Song, Q Chelsea, Chen Tang, and Serena Wee. 2021. “Making Sense of Model Generalizability: A Tutorial on Cross-Validation in r and Shiny.”Advances in Methods and Practices in Psychological Science 4 (1): 2515245920947067.
Wainer, Jacques, and Gavin Cawley. 2021. “Nested Cross-Validation When Selecting Classifiers Is Overzealous for Most Practical Applications.”Expert Systems with Applications 182: 115222.