With supervised learning algorithms, one of the biggest obstacles to model efficacy is finding the right fit between target function and training data. It’s imperative to avoid underfitting and overfitting data; as a consequence, panaceas don’t work well in machine learning. Finding a balanced fit often underlies any attempt to improve model performance.
Supervised Learning and Generalization
Supervised learning is, in essence, the approximation of a target function (f) that maps input variables (X) to an output variable (Y).
Y = f(X)
With supervised learning, model efficacy is a narrative revolving around how well your target function generalizes to new data.
An effective machine learning model generalizes well, extrapolating approximations on the mapping between X and Y, and makes accurate predictions on any data in the problem domain.
Fit matters. For many machine learning models, there are two core reasons for poor performance, particularly during the validation phase of creating a ML model: underfitting and overfitting.
Underfitting: Complexity is Sometimes Necessary
Understanding underfitting is simple. An underfitted model fails to approximate the underlying mapping function between input variables and output variables. As a corollary, the accuracy of model prediction is poor, failing to make accurate predictions on new data, and even on training datasets—the model is far too simple to be effective.
During training, if learning curves contain high error values—high bias—on both training and validation sets, this is indicative of an underfitted model. This can be the result of creating a linear model for non-linear data. To thwart underfitting, you can add more parameters, in an effort to increase model complexity, or relax regularization terms.
Overfitting: Mistaking Noise for Signal
Overfitting is the other extremity: when model complexity reaches an inordinate level, and as a result, affects model efficacy.
When training, individual data points are parlayed into intrinsic nuances of a model. As a result, your model may use both relevant parts of data (signal) and some irrelevant parts (noise).
Overfitting is what happens when noise is mistaken for signals. Overfitting occurs when a model encounters noise in training data and incorporates this information to rigidly shape the target function. When extraneous values and noise are picked up as concepts, this negatively impacts model performance, particularly when encountering novel data.
Reduce Overfitting: K-Fold Cross Validation
Two ways to counteract overfitting are:
- Holding back a validation dataset
- Using a resampling method
The first option, holding back data, is a luxury, and often, not possible.
Most domains have a finite number of features; there’s simply not enough data to afford the act of holding back data. In addition, model efficacy is based on a single act: training and testing merely once. A holdback estimate of model accuracy can be misleading.
Cross validation, particularly k-fold cross validation, is a more effective method that helps evaluate accuracy—a more effective tool to work with finite data points. Cross validation is a resampling statistical technique which involves splitting training data into subsets, training a model on a particular subset, and using remaining subsets to evaluate the model’s efficacy. A means to decrease variance, cross validation helps evaluate model accuracy and its effectiveness against independent, or unseen, data.
K-fold cross validation consists of the following steps:
- Randomize the dataset
- Split the dataset into k subsets
- For each subset k, perform the following steps:
- Hold back the subset or reserve it as validation data
- Use the remaining subsets as training data for model training
- Evaluate model on validation data (subset k)
- Evaluate your machine learning model by averaging the accuracies derived in all k cases of cross validation.
For a k-fold cross validation procedure in which k=3, the dataset is split into three folds, with each unique fold or subset being used as a validation subset at a point in the process.
Cross validation provides a lower-variance estimate of your model’s true out-of-sample accuracy compared with the use of a single train-test split. A key element in training an effective machine learning model is enabling the ability to generalize well. Finding the right fit is an important step in that process.