Overfitting vs Underfitting in Machine Learning.

Considering that we are designing a machine learning model. A model is said to be a good machine-learning model, if it generalizes any new input data from the problem domain in a proper way. This helps us to make predictions in the future data, that data model has never seen. Whenever working on a data set to predict or classify a problem, we tend to find accuracy by implementing a design model on first train set, then on test set. If the accuracy is satisfactory, we tend to increase accuracy of data-sets prediction either by increasing or decreasing data feature or features selection or applying feature engineering in our machine-learning model. But sometime our model maybe giving poor result. This can be explained by overfitting and underfitting, which are majorly responsible for the poor performances of the machine learning algorithms.

Generalization in Machine Learning

In machine learning we describe the learning of the target function from training data as inductive learning. Induction refers to learning general concepts from specific examples which is exactly the problem that supervised machine learning problems aim to solve. This is different from deduction that is the other way around and seeks to learn specific concepts from general rules.

Generalization refers to how well the concepts learned by a machine learning model apply to specific examples not seen by the model when it was learning. The goal of a good machine learning model is to generalize well from the training data to any data from the problem domain. This allows us to make predictions in the future on data the model has never seen.

Generalization is the model’s ability to give sensible outputs to sets of input that it has never seen before.

Overfitting in Machine Learning

When we run our training algorithm on the data set, we allow the overall cost (i.e. distance from each point to the line) to become smaller with more iterations. Leaving this training algorithm run for long leads to minimal overall cost. However, this means that the line will be fit into all the points (including noise), catching secondary patterns that may not be needed for the generalizability of the model.

The essence of an algorithm like Linear Regression is to capture the dominant trend and fit our line within that trend. In the figure above, the algorithm captured all trends - but not the dominant one. If the model does not capture the dominant trend that we can all see (positively increasing, in our case), it can’t predict a likely output for an input that it has never seen before - defying the purpose of Machine Learning to begin with.

Overfitting is the case where the overall cost is small, but the generalization of the model is unreliable.

This is due to the model learning “too much” from the training data set. Overfitting is more likely with nonparametric and nonlinear models that have more exibility when learning a target function. As such, many nonparametric machine learning algorithms also, include parameters or techniques to limit and constrain how much detail the model learns.For example, decision trees are a nonparametric machine-learning algorithm that is very flexible moreover, is subject to overfitting training data. This problem can be addressed by pruning a tree after it has learned in order to remove some of the detail, it has picked up. Overfitting (or high variance) leads to more bad than good.

Underfitting in Machine Learning

A statistical model or a machine-learning algorithm is said to have underfitting when it cannot capture the underlying trend of the data. We want the model to learn from the training data, but we don’t want it to learn too much (i.e. too many patterns). One solution could be to stop the training earlier. However, this could lead the model to not learn enough patterns from the training data, and possibly not even capture the dominant trend. In underfitting (i.e. high bias) is just as bad for generalization of the model as overfitting. In high bias, the model might not have enough flexibility in terms of line fitting, resulting in a simplistic line that does not generalize well.

How to differentiate between overfitting and underfitting?

Solving the issue of bias and variance is really about dealing with over-fitting and under-fitting. What is Bias & Variance lets first learn what is bias, variance and their importance in predicting model.

Bias: It gives us how closeness is our predictive models to training data after averaging predict value. Generally, algorithm has high bias, which help them to learn fast and easy to understand but are less flexible. That loses it ability to predict complex problem, so it fails to explain the algorithm bias. This results in underfitting of our model.

Variance: It define as deviation of predictions, in simple it is the amount which tell us when its point data value change or a different data is use how much the predicted value will be affected for same model or for different model respectively. Ideally, the predicted value which we predict from model should remain same even changing from one training data sets to another, but if the model has high variance then model predict value are affect by value of data-sets.

How to overcome it? What are the techniques?

Both overfitting and underfitting can lead to poor model performance. But by far the most common problem in applied machine learning is overfitting.

Overfitting is such a problem because the evaluation of machine learning algorithms on training data is different from the evaluation we actually care the most about, namely how well the algorithm performs on unseen data. There are two important techniques that you can use when evaluating machine learning algorithms to limit overfitting:
1. Use a resampling technique to estimate model accuracy.
2. Hold back a validation dataset.

The most popular resampling technique is k-fold cross validation. It allows you to train and test your model k-times on different subsets of training data and build up an estimate of the performance of a machine-learning model on unseen data.

A validation dataset is simply a subset of your training data that you hold back from your machine learning algorithms until the very end of your project. After you have selected and tuned your machine learning algorithms on your training dataset you can evaluate the learned models on the validation dataset to get a final objective idea of how the models might perform on unseen data.

In order to overcome underfitting we have to model the expected value of target variable as nth degree polynomial yielding the general Polynomial. The training error will tend to decrease as we increase the degree d of the polynomial.

What is a good Statistical Fit

In statistics, a fit refers to how well you approximate a target function. This is good terminology to use in machine learning, because supervised machine learning algorithms seek to approximate the unknown underlying mapping function for the output variables given the input variables.

Statistics often describe the goodness of fit, which refers to measures used to estimate how Well the approximation of the function matches the target function. Some of these methods are Useful in machine learning (e.g. calculating the residual errors), but some of these techniques assume we know the form of the target function we are approximating, which is not the case in machine learning. If we knew the form of the target function, we would use it directly to make Predictions, rather than trying to learn an approximation from samples of noisy training data.

Overfitting Vs Underfitting

Vishal Arora

December 14, 2019