In this lesson we will be using the Hitters dataset from the ISLR package. These data reflect the salaries of baseball platers and various player metrics for the 1886 and 1987 seasons.
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.6.2
data("Hitters")
names(Hitters)
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
## [7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
## [13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
## [19] "Salary" "NewLeague"
dim(Hitters)
## [1] 322 20
First we want to check for any missing data in your response, which is salary. Then we will remove the NAs. This process is called listwise deletion.
# How many data points are misisng for Salary?
sum(is.na(Hitters$Salary))
## [1] 59
# We are going to have to remove the NAs
Hitters<-na.omit(Hitters)
dim(Hitters)
## [1] 263 20
sum(is.na(Hitters$Salary))
## [1] 0
In order to demonstrate how the algorithms work, we will first only consider a small subset of predictors.
Candidate Predictors:
Let’s do this with our three variables.
STEP 1: Use simple linear regression to select one variable to enter the model.
# Forward Selection
mod1<-lm(Salary~CRBI, data=Hitters)
summary(mod1)
##
## Call:
## lm(formula = Salary ~ CRBI, data = Hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1099.27 -203.45 -97.43 146.37 1847.22
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 274.58039 32.85537 8.357 3.85e-15 ***
## CRBI 0.79095 0.07113 11.120 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 372.3 on 261 degrees of freedom
## Multiple R-squared: 0.3215, Adjusted R-squared: 0.3189
## F-statistic: 123.6 on 1 and 261 DF, p-value: < 2.2e-16
mod2<-lm(Salary~Hits, data=Hitters)
summary(mod2)
##
## Call:
## lm(formula = Salary ~ Hits, data = Hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -893.99 -245.63 -59.08 181.12 2059.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.0488 64.9822 0.970 0.333
## Hits 4.3854 0.5561 7.886 8.53e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 406.2 on 261 degrees of freedom
## Multiple R-squared: 0.1924, Adjusted R-squared: 0.1893
## F-statistic: 62.19 on 1 and 261 DF, p-value: 8.531e-14
mod3<-lm(Salary~Runs, data=Hitters)
summary(mod3)
##
## Call:
## lm(formula = Salary ~ Runs, data = Hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -714.8 -251.2 -55.9 193.6 1997.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 129.9292 59.9240 2.168 0.031 *
## Runs 7.4161 0.9923 7.474 1.18e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 410.2 on 261 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1731
## F-statistic: 55.86 on 1 and 261 DF, p-value: 1.18e-12
# First step: keep CRBI (smallest p-val)
STEP 2: Look for a second variable to add to the model.
mod12<-lm(Salary~CRBI+Hits, data=Hitters)
summary(mod12)
##
## Call:
## lm(formula = Salary ~ CRBI + Hits, data = Hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -942.47 -183.72 -37.85 96.55 2167.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -47.95590 55.98245 -0.857 0.392
## CRBI 0.68990 0.06723 10.262 < 2e-16 ***
## Hits 3.30084 0.48177 6.851 5.28e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 343.3 on 260 degrees of freedom
## Multiple R-squared: 0.4252, Adjusted R-squared: 0.4208
## F-statistic: 96.17 on 2 and 260 DF, p-value: < 2.2e-16
mod13<-lm(Salary~CRBI+Runs, data=Hitters)
summary(mod13)
##
## Call:
## lm(formula = Salary ~ CRBI + Runs, data = Hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1037.58 -194.51 -41.45 111.69 2125.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.40744 52.04504 -0.065 0.948
## CRBI 0.70114 0.06737 10.408 < 2e-16 ***
## Runs 5.61989 0.85295 6.589 2.45e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 345.3 on 260 degrees of freedom
## Multiple R-squared: 0.4185, Adjusted R-squared: 0.4141
## F-statistic: 93.57 on 2 and 260 DF, p-value: < 2.2e-16
# Second step: keep Hits
STEP 3: Look for a third variable, but ….OH NO! ITs not significant! So we will stop and only use the best model with two variables.
mod123<-lm(Salary~CRBI+Hits+Runs, data=Hitters)
summary(mod123)
##
## Call:
## lm(formula = Salary ~ CRBI + Hits + Runs, data = Hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -969.16 -186.72 -41.07 100.55 2166.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -46.24811 56.01270 -0.826 0.4098
## CRBI 0.68948 0.06724 10.255 <2e-16 ***
## Hits 2.28191 1.14187 1.998 0.0467 *
## Runs 1.97827 2.00996 0.984 0.3259
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 343.3 on 259 degrees of freedom
## Multiple R-squared: 0.4274, Adjusted R-squared: 0.4207
## F-statistic: 64.43 on 3 and 259 DF, p-value: < 2.2e-16
# STOP! Because adding Hits is not significant
STEP 1: Start with a full model using all the variables. Take out any variables that are not significant.
# Backward
# Start with all vars and take away
mod123<-lm(Salary~CRBI+Hits+Runs, data=Hitters)
summary(mod123)
##
## Call:
## lm(formula = Salary ~ CRBI + Hits + Runs, data = Hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -969.16 -186.72 -41.07 100.55 2166.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -46.24811 56.01270 -0.826 0.4098
## CRBI 0.68948 0.06724 10.255 <2e-16 ***
## Hits 2.28191 1.14187 1.998 0.0467 *
## Runs 1.97827 2.00996 0.984 0.3259
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 343.3 on 259 degrees of freedom
## Multiple R-squared: 0.4274, Adjusted R-squared: 0.4207
## F-statistic: 64.43 on 3 and 259 DF, p-value: < 2.2e-16
# First step: Remove Runs
STEP 2: Continue taking out variables until the only variables that remain are significant. We can stop with a two variable model.
mod12<-lm(Salary~CRBI+Hits, data=Hitters)
summary(mod12)
##
## Call:
## lm(formula = Salary ~ CRBI + Hits, data = Hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -942.47 -183.72 -37.85 96.55 2167.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -47.95590 55.98245 -0.857 0.392
## CRBI 0.68990 0.06723 10.262 < 2e-16 ***
## Hits 3.30084 0.48177 6.851 5.28e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 343.3 on 260 degrees of freedom
## Multiple R-squared: 0.4252, Adjusted R-squared: 0.4208
## F-statistic: 96.17 on 2 and 260 DF, p-value: < 2.2e-16
# STOP! Now everything is significant
The leaps package in R can used for variable selection. For some reason, there can be problems with installing this package, but if you specify the repository it will work! (See the example in my code)
#install.packages("leaps", repo = 'https://mac.R-project.org')
library(leaps)
This package will default to telling you the best models with one through eight variables included in the model. However, if you want to know all of the different possibilities up to the full model with all the predictors, you can do that too using the nvmax argument.
You must specify what type of selection you would like to do using the method argument.
# Forward
regfit.fwd<-regsubsets(Salary~., data=Hitters,
method="forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, method = "forward")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" "*"
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" " " " " " "
## 5 ( 1 ) " " " " "*" "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) "*" " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
# Backward
regfit.bwd<-regsubsets(Salary~., data=Hitters,
method="backward")
summary(regfit.bwd)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, method = "backward")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: backward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*" " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*" " "
## 4 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " "*" " "
## 5 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
## 7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" "*"
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " " " "*" " " " " " "
## 5 ( 1 ) " " " " " " "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) "*" " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
# Best Subset
regfit.full<-regsubsets(Salary~., data=Hitters)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " "
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" " " " " " "
## 5 ( 1 ) " " " " "*" "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
regfit.full<-regsubsets(Salary~., data=Hitters, nvmax=19)
reg.summary<-summary(regfit.full)
reg.summary
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " "
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" " " " " " "
## 5 ( 1 ) " " " " "*" "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
Now lets look at the variables that are contained in the regsubset object. We can learn a lot about different model comparison metrics.
Remember, the goal of these model comparison metrics is to find the “best” model that balances the model fit and simplicity/interpretability.
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
\[ R^2=\frac{SS(Reg)}{SS(Tot)}=1-\frac{SS(Res)}{SS(Tot)}\]
Although the R-squared metric is useful in simple linear regression, it has some drawbacks for multiple linear regression. R-squared monotonically increases as we add more and more variables into the model, which is not good because it does not balance for the increased complexity. We want to maximize this value!
# R-squared (monotonically increasing)
reg.summary$rsq
## [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
## [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
## [15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159
rsq_df<-data.frame(rsq=reg.summary$rsq,
size=1:19)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.2
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'readr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## Warning: package 'forcats' was built under R version 3.6.2
ggplot(rsq_df, aes(x=size, y=rsq))+
geom_point()+
geom_line()+
ggtitle("R-squared")
This is similar to R-squared but gives some penalty to the number of predictors, p. We want to maximize this value!
\[ R^2_{adj}=1-\frac{SS(Res)/(n-p-1)}{SS(Tot)/(n-1)}\]
adjrsq_df<-data.frame(rsq=c(reg.summary$rsq,reg.summary$adjr2),
type=c(rep("rsq",19),rep("rsq_adj",19)),
size=rep(1:19,2))
ggplot(adjrsq_df, aes(x=size, y=rsq, color=type))+
geom_point()+
geom_line()+
ggtitle("R-squared vs Adj R-squared")
Unlike the R-squared values, we want to minimize these metrics:
\[C_p=\frac{1}{n}(SS(Res)+2p\hat{\sigma}^2)\]
\[BIC=\frac{1}{n}(SS(Res)+log(n)p\hat{\sigma}^2)\] Note that BIC gives a harsher penalty and thereby picks smaller models.
# Lets look at the other metrics
metric_df<-data.frame(metric=c(reg.summary$rsq,
reg.summary$adjr2,
reg.summary$cp,
reg.summary$bic),
type=c(rep("rsq",19),
rep("rsq_adj",19),
rep("cp",19),
rep("bic",19)),
size=rep(1:19,4))
# Add vertical lines to show the best model
which.min(reg.summary$bic)
## [1] 6
which.min(reg.summary$cp)
## [1] 10
which.max(reg.summary$rsq)
## [1] 19
which.max(reg.summary$adjr2)
## [1] 11
vline.dat <- data.frame(type=unique(metric_df$type), vl=c(6, 10,19, 11 ))
ggplot(metric_df, aes(x=size, y=metric, color=type))+
geom_point()+
geom_line()+
ggtitle("Model Comparison Metrics")+
geom_vline(aes(xintercept=vl), data=vline.dat, lty=2) +
facet_grid(type~., scales = "free_y")