Overfitting and How to Avoid It

N Nedd

May 4, 2017

Introduction

we are ultimately concerned with how well the model applies to data not seen when building the model. That is, we are concerned with how well our model can be generalised.

Overfitting and Underfitting occur when there are problems with the generalisation of a model to previously unseen data.

Model fitting:

In data science:

Eg. A regression Model:

y = b0 + b1x1 + b2x2

is adjusted to fit an equation:

y = 0.75 + 9.5x1 - 4.6x2

Overfitting

“Overfitting occurs when excellent performance is seen in training data, but poor performance is seen in test data”. (http://www.igi-global.com/dictionary/overfittingunderfitting/39804)

Overfitting problems

Ideally, we would have: -Large set of data to train model -Large independent set of data to test model

How to avoid

Holdout

Use sampling to:

Set aside some data to train Leave rest to test

Can use:

60/40, 70/30, 75/25

Holdout Example

require(caTools)
## Loading required package: caTools
## Warning: package 'caTools' was built under R version 3.3.2
data <- mtcars
set.seed(101) 
sample = sample.split(data, SplitRatio = .75)
train = subset(data, sample == TRUE)
test  = subset(data, sample == FALSE)

Hold out issues

Cross Validation

Other Methods

Another Problem

Underfitting occurs when the model is too simple in which poor performance is seen in both training and test data.

Conclusion

caret - useful package Balance between bias-variance