N Nedd
May 4, 2017
we are ultimately concerned with how well the model applies to data not seen when building the model. That is, we are concerned with how well our model can be generalised.
Overfitting and Underfitting occur when there are problems with the generalisation of a model to previously unseen data.
In data science:
Eg. A regression Model:
y = b0 + b1x1 + b2x2
is adjusted to fit an equation:
y = 0.75 + 9.5x1 - 4.6x2
“Overfitting occurs when excellent performance is seen in training data, but poor performance is seen in test data”. (http://www.igi-global.com/dictionary/overfittingunderfitting/39804)
Ideally, we would have: -Large set of data to train model -Large independent set of data to test model
Use sampling to:
Set aside some data to train Leave rest to test
Can use:
60/40, 70/30, 75/25
require(caTools)
## Loading required package: caTools
## Warning: package 'caTools' was built under R version 3.3.2
data <- mtcars
set.seed(101)
sample = sample.split(data, SplitRatio = .75)
train = subset(data, sample == TRUE)
test = subset(data, sample == FALSE)
Regularisation Mathematical Technique to improve generalisation of a model
Avoid Complexity
Underfitting occurs when the model is too simple in which poor performance is seen in both training and test data.
caret - useful package Balance between bias-variance