Overfitting and How to Avoid It

N Nedd

May 4, 2017

Introduction

we are ultimately concerned with how well the model applies to data not seen when building the model. That is, we are concerned with how well our model can be generalised.

Overfitting and Underfitting occur when there are problems with the generalisation of a model to previously unseen data.

Model fitting:

In data science:

we start with a model (regression, support vector machine, neural network)
fit the model to the data by finding the best set of parameters

Eg. A regression Model:

y = b0 + b1x1 + b2x2

is adjusted to fit an equation:

y = 0.75 + 9.5x1 - 4.6x2

Overfitting

“Overfitting occurs when excellent performance is seen in training data, but poor performance is seen in test data”. (http://www.igi-global.com/dictionary/overfittingunderfitting/39804)

Overfitting problems

The models captures all the noise in the data
The model shows low bias (errors from assumptions)
The model shows high variance (susceptibility to changes in data)

Ideally, we would have: -Large set of data to train model -Large independent set of data to test model

How to avoid

Holdout methods
Cross-Validation

Holdout

Use sampling to:

Set aside some data to train Leave rest to test

Can use:

60/40, 70/30, 75/25

Holdout Example

require(caTools)

## Loading required package: caTools

## Warning: package 'caTools' was built under R version 3.3.2

data <- mtcars
set.seed(101) 
sample = sample.split(data, SplitRatio = .75)
train = subset(data, sample == TRUE)
test  = subset(data, sample == FALSE)

Hold out issues

Bias
Loss of Data
Stratification

Cross Validation

More sophisticated Hold Out
Perform multiple splits
Data is split into K sets (k-fold cross validation)
Each set is held out during training for testing _ Average estimates are averaged

Other Methods

Regularisation Mathematical Technique to improve generalisation of a model
Avoid Complexity

Another Problem

Underfitting occurs when the model is too simple in which poor performance is seen in both training and test data.

Conclusion

caret - useful package Balance between bias-variance