Bias–variance trade off

Abstract

In statistics and machine learning, the bias–variance trade off is the property of a model that the variance of the parameter estimated across sample scan be reduced by increasing the bias in the estimated parameters. The bias–variance dilemma or bias–variance problem is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set:

The bias error is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
The variance is an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random noise in the training data (overfitting).

The bias–variance decomposition is a way of analyzing a learning algorithm’s expected generalization error with respect to a particular problem as a sum of three terms, the bias, variance, and a quantity called the irreducible error, resulting from noise in the problem itself. ## Introduction

Introduction

Whenever we discuss model prediction, it’s important to understand prediction errors (bias and variance). There is a tradeoff between a model’s ability to minimize bias and variance. Gaining a proper understanding of these errors would help us not only to build accurate models but also to avoid the mistake of overfitting and underfitting.

What is bias?

Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.

What is variance?

Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.

Mathematically

Let the variable we are trying to predict as Y and other covariates as X. We assume there is a relationship between the two such that

Y=f(X) + e

Where e is the error term and it’s normally distributed with a mean of 0.

We will make a model f^(X) of f(X) using linear regression or any other modeling technique.

So the expected squared error at a point x is

The Err(x) can be further decomposed as

Err(x) is the sum of Bias², variance and the irreducible error.

Irreducible error is the error that can’t be reduced by creating good models. It is a measure of the amount of noise in our data. Here it is important to understand that no matter how good we make our model, our data will have certain amount of noise or irreducible error that can not be removed.

Bias and variance using bulls-eye diagram

In the above diagram, center of the target is a model that perfectly predicts correct values. As we move away from the bulls-eye our predictions become get worse and worse. We can repeat our process of model building to get separate hits on the target.

In supervised learning, underfitting happens when a model unable to capture the underlying pattern of the data. These models usually have high bias and low variance. It happens when we have very less amount of data to build an accurate model or when we try to build a linear model with a nonlinear data. Also, these kind of models are very simple to capture the complex patterns in data like Linear and logistic regression.

In supervised learning, overfitting happens when our model captures the noise along with the underlying pattern in data. It happens when we train our model a lot over noisy dataset. These models have low bias and high variance. These models are very complex like Decision trees which are prone to overfitting.

Why is Bias Variance Tradeoff?

If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data.

This tradeoff in complexity is why there is a tradeoff between bias and variance. An algorithm can’t be more complex and less complex at the same time.

Total Error

To build a good model, we need to find a good balance between bias and variance such that it minimizes the total error.

An optimal balance of bias and variance would never overfit or underfit the model.

Therefore understanding bias and variance is critical for understanding the behavior of prediction models. ## Formulas

Let’s start by defining some key concepts. We assume we have independent variables x that affect the value of a dependent variable y via a deterministic or non-deterministic relationship. We say non-deterministic because y’s value can also be affected by noise that cannot be modeled explicitly. Let’s denote the dependence of y on x via function f, that essentially represents the true underlying relationship between x and y. In real situations, it is of course very hard — if not impossible — to know this relationship, but we will assume that f is fixed, even when it is unknown. In that case, y which is the result of x and random noise, is given by the formula:

Noise is modeled by random variable ϵ with zero mean and variance σϵ². The magnitude of variance represents the level of uncertainty about the underlying phenomenon. The larger our uncertainty is, the bigger the value of σϵ². Mathematically, ϵ has the following properties:

Now, when we try to model the underlying real-life problem, what this means actually is that we try to find a function f̂ such that it is as close to the true (yet unknown to us) function f. Function f̂ can take the form of coefficients in the regression case, support vectors and dual coefficients in the case of Support Vector Machines (SVMs) and it is learned from training data. The closer the underlying distribution generating training data is to the underlying distribution generating test (unseen) data, the better the model represented by function f̂ will generalize to unseen data. Function f̂ is learned by minimizing a loss function whose goal is to bring predictions of training data as close as possible to their observed values: y ≈ f̂(x).

Mean squared error (MSE, for abbreviation) is the average squared difference of a prediction f̂(x) from its true value y. It is defined as:

Bias is defined as the difference of the average value of prediction (over different realizations of training data) to the true underlying function f(x) for a given unseen (test) point x.

Let’s spend some time to explain what we mean by “different realizations of training data”. Assume we want to monitor the relationship between family income level and house sale prices in a certain neighborhood. If we had access to data from every household, we would be able to train a very accurate model. But because obtaining data can be costly, time-consuming or subject to privacy concerns, most of the times we do not have access to all data of the underlying population. A realization means that we have access to only a portion of the underlying data as our training data. This realization can be unrepresentative of the underlying population (for example, if we poll only houses where a household has certain educational level) or representative (if it is done without racial, educational, age or other types of biases). So, when we say that expectation 𝔼[f̂(x)]is over different realizations of training data, this can be thought of as if we had the opportunity to poll a sample out of the underlying population, train our model f̂ on this sample, compute f̂(x) and repeat this multiple times (with a different training sample each time). The average of predictions will represent 𝔼[f̂(x)]. Here, f̂(x) changes even though x is fixed, simply because f̂ depends on training data. So,f̂ will be different for different realizations of training data. In more mathematical terms, f̂ is a random variable affected by the randomness in which we obtain training data.

Variance is defined as the mean squared deviation of f̂(x) from its expected value 𝔼[f̂(x)]over different realizations of training data.

The formula that connects test MSE to bias, variance and irreducible error is:

The first expectation in term 𝔼[𝔼[(y −f̂(x))²]] is over the distribution of unseen (test) points x, while the second over the distribution of training data and random variable ϵ. Because f̂ depends on training data we can also say that the second expectation is over over f̂, ϵ.If we were to write the above formula more explicitly, it would be:

but we will skip the expectation identifiers for simplicity. All three terms on the right hand side are non-negative and irreducible error is not affected by model choice. This means that test MSE cannot go below σϵ². We will now derive the formula for a given test point x. Since it will hold for a given test point x, it will hold for any distribution of unseen test points.

Proof of bias-variance decomposition

As a reminder,we assume x is an unseen (test) point, f is the underlying true function (dictating the relationship between x and y), which is unknown but fixed and ϵ represents the inherent noise in the problem. Test MSE, 𝔼[(y −f̂(x))²]is over the different realizations of training data and random variable ϵ:

(1) is because y = f(x) + ϵ, (2) because of square expansion, linear property of expectation and independence of random variables ϵ and f̂. Remember that when two random variables are independent, the expectation of their product is equal to the product of their expectations. In Eq. (3), we see how test MSE decomposes to irreducible error σϵ² and 𝔼[(f(x) −f̂(x))²]. Let’s see now how this latter term can be further analyzed.

In Eq. (4) we subtract and add by 𝔼[f̂(x)]and in Eq. (5) we expand the terms inside the square. Bias 𝔼[f̂(x)] − f(x) is just a constant since we subtract f(x) (a constant) from 𝔼[f̂(x)] which is also a constant. Therefore, applying expectation to squared bias, (𝔼[f̂(x)] − f(x))² does not have any effect. In other words, 𝔼[(𝔼[f̂(x)] − f(x))²] = (𝔼[f̂(x)] − f(x))². In Eq. (6), we are able to pull f(x)−𝔼[f̂(x)] out of the expectation, because as we mentioned it is just a constant. Lastly, (7) holds because of linearity of expectation. Therefore, we see in (8) that 𝔼[(f(x) −f̂(x))²] is the sum of squared bias and variance. When we combine Eqs. (3) and (8), we end up with:

This is for a given test point x, but we usually have a set of test points and this converts to the formula we presented in the previous section.

(Expectation 𝔼 in the right hand side is over the distribution of test data.)

Showing bias-variance tradeoff in practice

After we derived the bias-variance decomposition formula, we will show what does it mean in practice. Assume, the underlying true function f that dictates the relationship between x and y is:

and the noise is modeled by a Gaussian with zero mean and standard deviation 1, ϵ ~𝒩(0, 1). As a reminder, y = f(x) + ϵ. If we randomly generate 1,000 points from this process, we get the following plot.

Blue dots represent (x, y) pairs and red line is the underlying true function f(x). Red dot is the unseen (test) point we want to predict. We see that ffollows a non-linear pattern due to the addition of square root and cosine in the function’s definition. For our purposes, these 1,000 points represent the whole underlying population.

Applications

In regression

The bias–variance decomposition forms the conceptual basis for regression regularization methods such as Lasso and ridge regression. Regularization methods introduce bias into the regression solution that can reduce variance considerably relative to the ordinary least squares (OLS) solution. Although the OLS solution provides non-biased regression estimates, the lower variance solutions produced by regularization techniques provide superior MSE performance.
In classification

The bias–variance decomposition was originally formulated for least-squares regression. For the case of classification under the 0-1 loss (misclassification rate), it is possible to find a similar decomposition. Alternatively, if the classification problem can be phrased as probabilistic classification, then the expected squared error of the predicted probabilities with respect to the true probabilities can be decomposed as before.

It has been argued that as training data increases, the variance of learned models will tend to decrease, and hence that as training data quantity increases, error is minimized by methods that learn models with lesser bias, and that conversely, for smaller training data quantities it is ever more important to minimize variance.
In reinforcement learning

Even though the bias–variance decomposition does not directly apply in reinforcement learning, a similar tradeoff can also characterize generalization. When an agent has limited information on its environment, the suboptimality of an RL algorithm can be decomposed into the sum of two terms: a term related to an asymptotic bias and a term due to overfitting. The asymptotic bias is directly related to the learning algorithm (independently of the quantity of data) while the overfitting term comes from the fact that the amount of data is limited.
In human learning

While widely discussed in the context of machine learning, the bias–variance dilemma has been examined in the context of human cognition, most notably by Gerd Gigerenzer and co-workers in the context of learned heuristics. They have argued (see references below) that the human brain resolves the dilemma in the case of the typically sparse, poorly-characterised training-sets provided by experience by adopting high-bias/low variance heuristics. This reflects the fact that a zero-bias approach has poor generalisability to new situations, and also unreasonably presumes precise knowledge of the true state of the world. The resulting heuristics are relatively simple, but produce better inferences in a wider variety of situations.

Geman et al. argue that the bias–variance dilemma implies that abilities such as generic object recognition cannot be learned from scratch, but require a certain degree of “hard wiring” that is later tuned by experience. This is because model-free approaches to inference require impractically large training sets if they are to avoid high variance.

Problems

We will model the problem with polynomial regressions of varying degrees of complexity. As a reminder, in polynomial regression we try to fit the following non-linear relationship between x and y.

In other words, we try to approximate y with f̂(x) as described in Eq. (9). We will not go into details how model parameters w₀, w₁, …, wd are learned as it is beyond the scope of this article, but let’s assume they are evaluated by minimizing a loss function that tries to bring f̂(x) as close to y as possible.

Now, let’s assume we could only use 20 points (out of the 1,000) to train our polynomial regression model and we consider four different regression models, one with degree d=1 (simple line), one with d=2, d=3 and d=5. If we randomly sample 20 points from the underlying population and we repeat this experiment 6 times, this is a possible outcome we get.

Blue dots represent the 20 training data points for a specific realization (experiment). The red line is the underlying (unknown to us) true function fand the other lines represent the fitting of the four different models to different realizations of training data. The green, purple, cyan, orange dots represent the prediction f̂(x) of test (unseen) point x under each model. As we see, there’s less variation in lines with smaller degrees of complexity. Take for instance d=1 (simple line). The slope of the line does not change all that much between different experiments. On other other hand, a more complex model (d=5) is much more sensitive to small fluctuations in the training data. See for example the difference in the orange line (d=5) between experiment 1 and 6 and how this affects prediction f̂(x). This is the variance problem we mentioned in previous sections. A simplistic model is very robust to changes in training data, but a more complex is not. On other hand, the deviation of f̂(x) from f(x) on average (the bias), is larger for more simplistic models, since our assumptions are not as representative of the underlying true relationship f. Here is the code for the above plots.

Conclusion

In this article, we presented the bias-variance problem. We proceeded with its mathematical derivation and showed with an example what the bias-variance really means in practice. We demonstrated that model choice has to battle two competing forces: bias and variance. A good model should strike a balance between the two but we can never achieve zero test error due to the presence of irreducible error. Our model should not be overly simplistic, but not too complex either so that it can generalize well to previously unseen data. ## Refernces Domingos, Pedro (2000). A unified bias-variance decomposition (PDF). ICML. Valentini, Giorgio; Dietterich, Thomas G. (2004). “Bias–variance analysis of support vector machines for the development of SVM-based ensemble methods” (PDF). Journal of Machine Learning Research. 5: 725–775. https://en.wikipedia.org/wiki/Bias–variance_tradeoff https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229 https://towardsdatascience.com/the-bias-variance-tradeoff-8818f41e39e9