The Bias Variance Tradeoff and Cross Valdation

Tony Fischetti
9/15/2015

Data Scientist at College Factual
@tonyfischetti
onthelambda.com

APO

classification

mtcars

library(magrittr)
mtcars %>% names

 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

mtcars %>% head

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Decision tree

alt text

why is this bad?

(conf.matrix2)

     [,1] [,2]
[1,]    7    4
[2,]    3   16

irrelavant features

(cr.leaders)

                   name die.age born
1            Rosa Parks      92 1913
2 Martin Luther King Jr      39 1929
3              Malcom X      39 1925
4        Harriet Tubman      91 1822
5         Marcus Garvey      52 1887
6      Shirley Chisholm      80 1924

irrelavant features

(decision tree)

irrelavant features

alt text

irrelavant features

(more.leaders)

               name die.age born
1 Fredrick Douglass      77 1818
2      Marvel Cooke      97 1903

linear regression

mtcars

model <- lm(mpg ~ ., data=mtcars)
summary(model)$r.squared

[1] 0.8690158

model2 <- lm(mpg ~ am + wt, data=mtcars)
summary(model2)$r.squared

[1] 0.7528348

bias-variance tradeoff

alt text

why is it a tradoff?

finding the optimal point

validation set approach

set.seed(1)
train.indices <- sample(1:nrow(mtcars), nrow(mtcars)/2)
training <- mtcars[train.indices,]
testing <- mtcars[-train.indices,]
model <- lm(mpg ~ ., data=training)
summary(model)$r.squared

[1] 0.9879687

validation set approach

set.seed(1)
train.indices <- sample(1:nrow(mtcars), nrow(mtcars)/2)
training <- mtcars[train.indices,]
testing <- mtcars[-train.indices,]
model <- lm(mpg ~ ., data=training)
summary(model)$r.squared

[1] 0.9879687

mean((predict(model) - training$mpg) ^ 2)

[1] 0.4408109

validation set approach

set.seed(1)
train.indices <- sample(1:nrow(mtcars), nrow(mtcars)/2)
training <- mtcars[train.indices,]
testing <- mtcars[-train.indices,]
model <- lm(mpg ~ ., data=training)
summary(model)$r.squared

[1] 0.9879687

mean((predict(model) - training$mpg) ^ 2)

[1] 0.4408109

mean((predict(model, newdata=testing) - testing$mpg) ^ 2)

[1] 337.9995

validation set approach

simpler.model <- lm(mpg ~ am + wt, data=training)
mean((predict(simpler.model) - training$mpg) ^ 2)

[1] 9.396091

mean((predict(simpler.model, newdata=testing) - testing$mpg) ^ 2)

[1] 12.70338

5-fold cross validation

library(boot)
bad.model <- glm(mpg ~ ., data=mtcars)
better.model <- glm(mpg ~ am + wt + qsec, data=mtcars)
bad.cv.err <- cv.glm(mtcars, bad.model, K=5)

bad.cv.err$delta[2]

[1] 9.351715

better.cv.err <- cv.glm(mtcars, better.model, K=5)
better.cv.err$delta[2]

[1] 6.746993

bias-variance tradeoff

alt text

final lessons

interpret your results

final lessons

interpret your results
rule of parsimony

final lessons

interpret your results
rule of parsimony
be mindful of overfitting

thanks

Tony Fischetti

College Factual

@tonyfischetti

onthelambda.com