In this project we can analyze Baseball Data perform exploratory data analysis and and eventually predict statistics to influence playoff appearances. This project is especially important to me as I have played baseball my entire life, played in high school, played at UC Berkeley in college, and am a die-hard San Francisco Giants fan. So this data and project is close to my heart.
# load initial dependancies
setwd("~/Desktop/DataScientist/Projects")
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(gridExtra)
library(caTools)
library(caret)
## Loading required package: lattice
library(corrplot)
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
##
## The following object is masked from 'package:stats':
##
## lowess
library(memisc)
## Loading required package: MASS
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
##
##
## Attaching package: 'memisc'
##
## The following objects are masked from 'package:dplyr':
##
## collect, query, rename
##
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
##
## The following objects are masked from 'package:base':
##
## as.array, trimws
library(rpart)
library(rpart.plot)
library(MASS)
library(gbm)
## Loading required package: survival
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
The data comes from the Lahman’s Baseball Database. We can download the zipfile from the internet and unzip it and extract the files we desire.
# Baseball Zip File Url
url <- 'http://seanlahman.com/files/database/lahman-csv_2015-01-24.zip'
# create a temp file
temp <- tempfile()
# fetch the file into the temp. file and download it
download.file(url, temp)
# extract the target file from temp. file
teams <- read.csv(unz(temp, 'Teams.csv'))
salaries <- read.csv(unz(temp, 'Salaries.csv'))
Clean up the data, making previous team names and cities match the newer one
teams$teamID[teams$teamID == 'CAL'] <- 'LAA'
teams$teamID[teams$teamID == 'FLO'] <- 'MIA'
teams$teamID[teams$teamID == 'ML4'] <- 'MIL'
teams$teamID[teams$teamID == 'MON'] <- 'WAS'
teams$teamID[teams$teamID == 'ANA'] <- 'LAA'
Since we are going to predict whether or not a team will make the playoffs we need to add that column to the teams data frame
# new column to display if team made playoffs
# already column is team won division but need to add wild card teams
teams$Playoff <- teams$DivWin
# this only adds the the division winners so we need to also add the wild card winners
teams$Playoff[teams$WCWin == 'Y'] <- 'Y'
# convert Playoff to factor
teams$Playoff <- factor(teams$Playoff)
Subset data to only include wild card teams - The wilcard was introduced in 1995 but in year 1995 every team played less than 146 games so we decided to drop this year and use 1996 as the cut off. Now all teams have the same amount of seasons 19 seasons except the Arizona Diamondbacks and the Tampa Bay Rays since neither of these teams were introduced into the league until 1997
wildcard_era_teams <- teams[teams$yearID >= 1995, ]
table(wildcard_era_teams$G == 144, wildcard_era_teams$yearID)
##
## 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
## FALSE 6 28 28 30 30 30 30 30 30 30 30 30 30
## TRUE 22 0 0 0 0 0 0 0 0 0 0 0 0
##
## 2008 2009 2010 2011 2012 2013 2014
## FALSE 30 30 30 30 30 30 30
## TRUE 0 0 0 0 0 0 0
# use 1996 as cutoff
with(wildcard_era_teams, table(yearID == 1995, G))
## G
## 143 144 145 161 162 163
## FALSE 0 0 0 48 494 24
## TRUE 2 22 4 0 0 0
wildcard_era_teams <- teams[teams$yearID >= 1996, ]
Since we will eventually predict Playoff teams, first we want to look into what makes a playoff team. The factor the relates to Playoff appearances is wins, so we will look at the distribution of wins and discove on average how many wins a team need to make their way into October
ggplot(subset(wildcard_era_teams, Playoff == 'Y'), aes(W)) +
geom_histogram(fill = 'light blue', color = 'black', binwidth = 1) +
geom_vline(xintercept = mean(wildcard_era_teams$W[wildcard_era_teams$Playoff == 'Y']), color = 'red', linetype = 'longdash') +
geom_vline(xintercept = median(wildcard_era_teams$W[wildcard_era_teams$Playoff == 'Y']), color = 'green', linetype = 'longdash') +
xlab('Frequency') + ylab('Wins') +
ggtitle('Histogram of Win Counts\nWith lines at mean and median')
Since the data is normally distributed the mean and median will be about the same
We can now investigate this mean and median values and determine if it will be suitable to use as a cutoff for playoff teams
ggplot(wildcard_era_teams, aes(W, color = Playoff)) +
geom_histogram(position = 'dodge', binwidth = 2) +
coord_cartesian(xlim=c(50,120)) +
geom_vline(xintercept = 94, linetype = 2) +
xlab('Wins') + ylab('Count') +
ggtitle('Histogram of Wins Sorted By Playoff Appearance\nWith line at mean of 94')
As we can tell from the graph, we do happen to have one team that diddn’t make the playoffs, but we will ignore this team since our accuracy is still 99.8%. So we can use 94 wins as our cutoff to classify a team as a playoff team for that season
# only care about that lonely one team with the True Positive within the cunfusion matrix
table(wildcard_era_teams$W > 94, wildcard_era_teams$Playoff == 'N')
##
## FALSE TRUE
## FALSE 82 407
## TRUE 76 1
paste0("With an accuracy of ", round(407 /(407 + 1), 3))
## [1] "With an accuracy of 0.998"
# And looking at the data further
over94_no_playoff <- wildcard_era_teams[(wildcard_era_teams$W >= 94) &
(wildcard_era_teams$Playoff == 'N'), ]
paste0("Team with 96 wins and didn't make playoffs is The ", over94_no_playoff$name)
## [1] "Team with 96 wins and didn't make playoffs is The Cincinnati Reds"
We can see this with another plot of Wins vs Teams colored by Playoff appearances and this really highlights the ’99 Reds and real Playoff Teams
# plot to show team wins and playoff appearance
ggplot(wildcard_era_teams, aes(teamID, W, color = Playoff)) + geom_point() +
theme(axis.text.x = element_text(angle = 45)) +
xlab('Team') + ylab('Wins') +
geom_hline(yintercept = 94, linetype = 2) +
ggtitle('Wins By Team Colored By Playoffs Appearances')
We can make a line plot statified by League and Division to display the distribution of wins per year
# summarize new data extracing out desired columns
team_wins <- wildcard_era_teams %>%
group_by(teamID, yearID, lgID, divID, Playoff) %>%
summarise(W = W) %>%
ungroup() %>%
arrange(teamID)
nl <- ggplot(subset(team_wins, (lgID == 'NL')), aes(yearID, W)) +
geom_line(aes(color = teamID)) +
facet_wrap(~ divID, nrow = 3) +
xlab('Year') + ylab('Wins') +
ylim(50,100) +
ggtitle('Wins By Year - National League')
al <- ggplot(subset(team_wins, (lgID == 'AL')), aes(yearID, W)) +
geom_line(aes(color = teamID)) +
facet_wrap(~ divID, nrow = 3) +
ylim(50,100) +
xlab('Year') + ylab('') +
ggtitle('Wins By Year - American League')
grid.arrange(nl, al, ncol = 2, name = 'Wins By Year Per League and Divison' )
This plot better represents the data. In this boxplot you can see how rare it was for some teams to actually make the playoffs in some certain years and how common it was for some teams to make the playoffs. We can plot a black dashed line at the 94 wins where you were 99.8% likely to make the playoffs and we can also plot another red line at 88 wins where you had a 94% chance of making the playoffs
ggplot(wildcard_era_teams, aes(teamID, W)) +
geom_boxplot(fill = 'blue') +
geom_hline(yintercept = 94, linetype = 2, color = 'black') +
geom_hline(yintercept = 88, linetype = 2, color = 'red') +
theme(axis.text.x = element_text(angle = 90)) +
xlab('Team Name') + ylab('Wins') +
ggtitle('Boxplots of Teams vs Wins')
This is a very interesting graph as you can see some of the outlier seasons for teams where a team made the playoffs where they tipically do not, like PIT and CIN. In contrast, you can see teams that continually make the playoffs, like NY Yankees and ATL. And also the teams that really haven’t been close to being playoff bound throughout these years, like MIA and TOR.
Now we need to use EDA to see what varibles contribute to wins and hence making the playoffs. First we will add some other variables including Batting Average, On Base Percentage and Slugging Percentage to our wilcard era team data frame. We also add run differtial which we will explore later
# transform NA for sac flies and hit by pitches to O so can compute correlation matrix
wildcard_era_teams$SF[is.na(wildcard_era_teams$SF)] <- 0
wildcard_era_teams$HBP[is.na(wildcard_era_teams$HBP)] <- 0
# new column for team Batting Average
wildcard_era_teams$BA <- wildcard_era_teams$H / wildcard_era_teams$AB
# new column for team On-Base Percentage
# this will return NA for seasone beofre HBP was actually recorded but this will not affect the analysis
wildcard_era_teams <- transform(wildcard_era_teams, OBP = (H + BB + HBP) /
(AB + BB + HBP + SF))
# new column for team Slugging Percentage
wildcard_era_teams <- transform(wildcard_era_teams, SLG = (H + X2B + (2 * X3B) +
(3 * HR)) / AB)
wildcard_era_teams <- transform(wildcard_era_teams, OPS = OBP + SLG)
# new column for run differential
wildcard_era_teams$Diff <- wildcard_era_teams$R - wildcard_era_teams$RA
So alos we have a new different data frame with each players salary and the respective year. We can now subset the salary data and use merge to combine teams salaries to the wildcard era teams data.
#subset same years for salary
wildcard_era_salaries <- salaries[salaries$yearID >= 1996, ]
# pick out desired columns to perform merge
team_salaries <- wildcard_era_salaries %>%
group_by(teamID, yearID) %>%
summarise(Salary = sum(salary)) %>%
ungroup() %>%
arrange(teamID)
wildcard_era_teams <- merge(wildcard_era_teams, team_salaries, by = c('teamID', 'yearID'))
Split wilcard era teams before looking at predictions for playoff teams
set.seed(101)
split <- sample.split(wildcard_era_teams$Playoff, 0.8)
train <- subset(wildcard_era_teams, split == TRUE)
test <- subset(wildcard_era_teams, split == FALSE)
dim(train)
## [1] 424 55
There are obviously a lot of independant variables that are correlated with one another since baseball stats have alot to do with one another.
# subset data to only include numeric and integer variables
nums <- sapply(train, is.numeric) | sapply(train, is.integer)
# create data frame for correlation
train_nums <- train[,nums]
# exclude unwanted variables
train_nums <- train_nums[,-which(names(train_nums) %in% c('yearID', 'Rank', 'G', 'Ghome', 'IPouts', 'attendance', 'BPF', 'PPF'))]
#compute correlation matrix
corMatrix <- cor(train_nums)
# plot matrix
corrplot(corMatrix, method = "circle", type="lower", tl.cex=0.55, cl.cex = 0.5, add = FALSE, tl.pos="lower")
corrplot(corMatrix, method = "shade", type="upper", tl.cex=0.55, cl.cex = 0.5, add = TRUE, tl.pos="upper")
To make predictions on Playoffs, a team needs to have a lot of wins, and for a team to win, they need to have a good run score more than the give up, and to have a good run differential, they need to have good hitting, and need to have good pitching and defense. So really almost every variable can potentailly predict Playoff appearances(which is why I love baseball)
So to predict Playoff appearances, we have to predict wins. The most obvious and first variable to look at would be Runs and Runs against
# plot wins vs runs
p1 <- ggplot(train, aes(W, R)) + geom_point(aes(color = Playoff)) +
xlab('Wins') + ylab('Runs') +
ggtitle('Wins Per Runs\ncolored by Playoff Teams')
# plot wins vs runs given up
p2 <- ggplot(train, aes(W, RA)) + geom_point(aes(color = Playoff)) +
xlab('Wins') + ylab('Runs Against') +
ggtitle('Wins Per Runs Against\n colored by Playoff Teams')
grid.arrange(p1, p2, ncol = 2)
Now we can combine these to predictors. New created predictor where we took the difference between runs scored and runs allowed to better represent the data. This new column is called Diff caluculated previously
ggplot(train, aes(W, Diff)) + geom_point(aes(color = Playoff)) +
xlab('Wins') + ylab('Difference in Runs (R - RA)') +
ggtitle('Wins Per Runs Diffential\n colored by Playoff Teams')
Now since Wins and Runs are very obvious in predicting Playoffs we need to look at real baseball stats that predict Runs, so we can pick out some varibales to explore them further
p1 <- ggplot(train, aes(R, BA)) +
geom_point(aes(color = W)) +
scale_fill_discrete(name = 'Wins') +
xlab('Runs') + ylab('Batting Average') +
ggtitle(paste0('Runs by Batting Average\n Stratified by Wins\n Corelation: ',
round(cor(train_nums$R, train_nums$BA), 3)))
p2 <- ggplot(train, aes(R, OBP)) +
geom_point(aes(color = W)) +
scale_fill_discrete(name = 'Wins') +
xlab('Runs') + ylab('On Base Percentage') +
ggtitle(paste0('Runs by On Base Percentage\n Stratified by Wins\n Corelation: ',
round(cor(train_nums$R, train_nums$OBP), 3)))
p3 <- ggplot(train, aes(R, SLG)) +
geom_point(aes(color = W)) +
scale_fill_discrete(name = 'Wins') +
xlab('Runs') + ylab('Slugging Percentage') +
ggtitle(paste0('Runs by Slugging Percentage\n Stratified by Wins\n Corelation: ',
round(cor(train_nums$R, train_nums$SLG), 3)))
p4 <- ggplot(train, aes(R, HR)) +
geom_point(aes(color = W)) +
scale_fill_discrete(name = 'Wins') +
xlab('Runs') + ylab('Home Runs') +
ggtitle(paste0('Runs by Home Runs\n Stratified by Wins\n Corelation: ',
round(cor(train_nums$R, train_nums$HR), 3)))
grid.arrange(p1, p2, p3, p4, ncol = 2)
We can do one better with OBP and SLG and combine them both, since in the last few years in baseball they have started to calculate (On Base Percentage Plus Slugging), OPS.
ggplot(train, aes(R, OPS)) +
geom_point(aes(color = W)) +
scale_fill_discrete(name = 'Wins') +
xlab('Runs') + ylab('On Base Plus Slugging') +
ggtitle(paste0('Runs by On Base Plus Slugging\n Stratified by Wins\n Corelation: ',
round(cor(train_nums$R, train_nums$OPS), 3)))
So the best single variable to predict Runs in On-Base Plus Sluggling. Which is interesting and it has to make us wonder if doing statistical analysis like this and determining that OBP and SLG were two of the most important factors and it just made sense to combine the two together into one comprehensible predoctor. So did analysis add a new basbeall stat?
Now we can look at the other side of the spetrum. It is always said that pitching and defense wins Championships(definitely the fact for the SF Giants!), so we can look at that. We can look at Runs Against instead of Runs here.
p1 <- ggplot(train, aes(RA, ERA)) +
geom_point(aes(color = W)) +
scale_fill_discrete(name = 'Wins') +
xlab('Runs Against') + ylab('Earned Run Average') +
ggtitle(paste0('Runs Against by ERA\n Stratified by Wins\n Corelation: ',
round(cor(train_nums$RA, train_nums$ERA), 3)))
p2 <- ggplot(train, aes(RA, SO)) +
geom_point(aes(color = W)) +
scale_fill_discrete(name = 'Wins') +
xlab('Runs Against') + ylab('Strike Outs') +
ggtitle(paste0('Runs Against by Strike Outs\n Stratified by Wins\n Corelation: ',
round(cor(train_nums$RA, train_nums$SO), 3)))
p3 <- ggplot(train, aes(R, HA)) +
geom_point(aes(color = W)) +
scale_fill_discrete(name = 'Wins') +
xlab('Runs Against') + ylab('Hits Against') +
ggtitle(paste0('Runs Against by Hits Allowed\n Stratified by Wins\n Corelation: ',
round(cor(train_nums$RA, train_nums$HA), 3)))
p4 <- ggplot(train, aes(RA, E)) +
geom_point(aes(color = W)) +
scale_fill_discrete(name = 'Wins') +
xlab('Runs Against') + ylab('Errors') +
ggtitle(paste0('Runs Against by Errors\n Stratified by Wins\n Corelation: ',
round(cor(train_nums$RA, train_nums$E), 3)))
grid.arrange(p1, p2, p3, p4, ncol = 2)
Obviously ERA is the king of all pitching stats, so it will be the most important here and interesting enough Errors wasn’t as associated with Runs Against as was expected.
Maybe you can pay for wins and Playoff Apearances so we can also plot Wins By Salary per year
ggplot(train, aes(W, Salary)) + geom_point(aes(color = Playoff)) +
facet_wrap(~ yearID) +
xlab('Runs') + ylab('Team Salary') +
ggtitle('Total Wins By Salary')
Can also look at the correlation and it is not as high as we might think, so you really can’t pay for wins in MLB today, which was truely brought to life by Billy Bean with the A’s many years ago
cor(train$W, train$Salary)
## [1] 0.3026466
Its cheating to just use Wins or Runs or Run Differential to use in our model as it is obvious that these will predict Playoff appearances. So since we need to use on field stats to predict, we will not use these attributes and actually look at real baseball stats to predict whether or not a team will make the Playoffs. #### Logistic Regression We can use simple Logistic Regression from predictions on the training set
binary_train <- train
# change Playoff to 1 for Y and 0 for N to work with ROCR
binary_train$Playoff <- ifelse(binary_train$Playoff == 'Y', 1, 0)
# fit multiple
glmfit1 <- glm(Playoff ~ OPS, data = binary_train, family = binomial)
glmfit2 <- glm(Playoff ~ OPS + ERA, data = binary_train, family = binomial)
glmfit3 <- glm(Playoff ~ OPS + ERA + E, data = binary_train, family = binomial)
glmfit4 <- glm(Playoff ~ OPS + ERA + E + Salary, data = binary_train, family = binomial)
glmfit5 <- glm(Playoff ~ OPS + SF + SO + HA + RA + SV + BBA + DP + HRA + HR + AB + R + ER + X2B + H + SB + BB + HBP + SOA + E + SHO + OBP + Salary + SLG + X3B + CG + BA + CS + OPS + ERA, data = binary_train, family = binomial)
Bias - Varience trade-off. The last model we just threw all the variables into the model to just look at the result, however we would not really be interested in this model as it is not very interpretable model and need to only select certain subset of varibles
Therefore our baseline prediction would be No Playoffs with an accuracy of 0.7099057
with(train, table(Playoff))
## Playoff
## N Y
## 0 301 123
Functions to evaluate our Logistic Reegression models for convience and vector of accuracy and area under the curve for the models
EvalModelCF <- function(model) {
pred <- predict(model, type = 'response')
return(sum(diag(table(pred > 0.5, binary_train$Playoff))) / nrow(binary_train))
}
EvalModelAUC <- function(model) {
pred <- predict(model, type = 'response')
ROCRpred <- prediction(pred, binary_train$Playoff)
return(as.numeric(performance(ROCRpred, 'auc')@y.values))
}
# predict on training set CF
c(EvalModelCF(glmfit1), EvalModelCF(glmfit2), EvalModelCF(glmfit3), EvalModelCF(glmfit4), EvalModelCF(glmfit5))
## [1] 0.7216981 0.8702830 0.8726415 0.8726415 0.9174528
# predict on training set AUC
c(EvalModelAUC(glmfit1), EvalModelAUC(glmfit2), EvalModelAUC(glmfit3), EvalModelAUC(glmfit4), EvalModelAUC(glmfit5))
## [1] 0.7038598 0.9355266 0.9350944 0.9352565 0.9686411
Even when we put all of our variables into our model, we only get an slight increase in predictive performace, so simplier is better. Therefore, our best model for Logistic Regression would be our simple OPS + ERA
We can also try to use linear discrimatory analysis, and return a vector of the confusion matrix accuracy
ldafit1 <- lda(Playoff ~ OPS + ERA, data = binary_train)
ldafit2 <- lda(Playoff ~ OPS + ERA + BB + HA + BBA, data = binary_train)
Evallda <- function(model) {
pred <- predict(model)
lda.class <- pred$class
return(sum(diag(table(lda.class, binary_train$Playoff)))/nrow(binary_train))
}
c(Evallda(ldafit1), Evallda(ldafit2))
## [1] 0.8726415 0.8655660
We then can use a decision tree. Decison Tress typically aren’t highly predictive, but the are excellent for easy interpretabiltiy and even non-analysists can comprehend the plot. We first use cross validation to set the cp parameter to better create our tree
set.seed(321)
numFolds <- trainControl(method = 'cv', number = 10)
cpGrid <- expand.grid(.cp = seq(0.001, 0.05, 0.001))
best_cp <- train(Playoff ~ ., data = binary_train, method = 'rpart', trControl = numFolds, tuneGrid = cpGrid)
CARTfit <- rpart(Playoff ~ OPS + SF + SO + HA + RA + SV + BBA + DP + HRA + HR + AB + R + ER + X2B + H + SB + BB + HBP + SOA + E + SHO + OBP + Salary + SLG + X3B + CG + BA + CS + OPS + ERA, data = binary_train, method = 'class', cp = best_cp$bestTune)
# plot tree
prp(CARTfit)
As we can see from the tree, the most important and first split is era, and you can work your way down the branches of the tree to really determine what factors contribute to Playoff teams. Note - 1 = Playoffs, 2 = NO Playoffs
We can evaluate the prediction accuracy on the training set for our decision tree
# predict on training set
trainPredCART <- predict(CARTfit)
# prediction accuracy
table(trainPredCART[,2] > 0.5, binary_train$Playoff)
##
## 0 1
## FALSE 289 29
## TRUE 12 94
sum(diag(table(trainPredCART[,2] > 0.5, binary_train$Playoff))) / nrow(binary_train)
## [1] 0.9033019
# AUC
ROCRpredCART <- prediction(trainPredCART[,2], binary_train$Playoff)
as.numeric(performance(ROCRpredCART, 'auc')@y.values)
## [1] 0.9028982
We get a decent result, not as good as our Logistic Regression, but that wasn’t expected and we get a great output graph here.
We could again use a random forrest or gradient boosting like we do below to increase our prediction accuracy to actually obtain 100% accuracy on the training set. However, again we would rather have a more interpretable less accurate model rather than a complex higher one
set.seed(3)
boostfit <- gbm(Playoff ~ OPS + ERA + SF + SO + HA + BBA + DP + HRA + HR + AB + X2B + H + SB + BB + HBP + SOA + E + SHO + Salary + X3B + CG + BA + CS, data = binary_train, distribution="bernoulli", n.trees=5000, interaction.depth=4, shrinkage =0.2)
# training accuracy CF
boostpred <- predict(boostfit, n.trees = 5000)
table(boostpred > 0.5, binary_train$Playoff)
##
## 0 1
## FALSE 301 0
## TRUE 0 123
sum(diag(table(boostpred > 0.5, binary_train$Playoff)))/nrow(binary_train)
## [1] 1
# AUC
ROCRpredBoost <- prediction(boostpred, binary_train$Playoff)
as.numeric(performance(ROCRpredBoost, 'auc')@y.values)
## [1] 1
As you can see we get perfect predictions with our training set, when we pass in many variables
Interesting enough we can actually look at the varibale importantance summary from the boosting model and can see that the two most important variables are ERA and OPS, just as we had expected from our previous analysis
summary(boostfit)
## var rel.inf
## ERA ERA 21.6485360
## OPS OPS 15.5040136
## BB BB 9.7736368
## BBA BBA 6.6310762
## HR HR 6.0199394
## HA HA 4.7977298
## HRA HRA 4.3374891
## SHO SHO 3.8999004
## SB SB 3.6564620
## AB AB 3.1035279
## BA BA 2.8410996
## E E 2.5468286
## X2B X2B 2.0285220
## SF SF 1.8918423
## SOA SOA 1.6992293
## HBP HBP 1.6919997
## CS CS 1.6317152
## Salary Salary 1.5708404
## H H 1.3823560
## DP DP 1.0941966
## SO SO 1.0714925
## X3B X3B 0.6643984
## CG CG 0.5131681
After seeing this we can produce a new logistic regression and try to add the top 5 predictors form the Gradient Boosting Machine. The first number in the vector is the confusion matrix accuracy and the next is the the auc
glmfit6 <- glm(Playoff ~ OPS + ERA + BB + HA + BBA, data = binary_train, family = binomial)
c(EvalModelCF(glmfit6), EvalModelAUC(glmfit6))
## [1] 0.8702830 0.9412527
This is now our best interpretable model so we can now use it and can look at our coefficients for our model
glmfit6$coefficients
## (Intercept) OPS ERA BB HA
## -21.992433956 52.832405072 -5.528801215 0.006763696 0.003092544
## BBA
## -0.008091060
For convenience we create new varibales to predict on our testing set.
best_model_glm <- glmfit6
best_model_lda <- ldafit1
best_model_tree <- CARTfit
best_model_boost <- boostfit
First we need to create our binary test set, like we did for our training set.
binary_test <- test
# change Playoff to 1 for Y and 0 for N
binary_test$Playoff <- ifelse(binary_test$Playoff == 'Y', 1, 0)
Here we get a test accuracy of 0.8584906 and auc of 0.9212903, only a couple of percentage point lower than our training set predictions, meaning that our model did not overfit the data
predTestglm <- predict(best_model_glm, newdata = binary_test, type = 'response')
table(predTestglm > 0.5, binary_test$Playoff)
##
## 0 1
## FALSE 71 11
## TRUE 4 20
sum(diag(table(predTestglm > 0.5, binary_test$Playoff))) / nrow(binary_test)
## [1] 0.8584906
ROCRpredglm <- prediction(predTestglm, binary_test$Playoff)
as.numeric(performance(ROCRpredglm, 'auc')@y.values)
## [1] 0.9212903
Here we get accuracy of 0.8490566, a few points lower than logistic regression
predTestlda <- predict(best_model_lda, newdata = binary_test)
table(predTestlda$class, binary_test$Playoff)
##
## 0 1
## 0 68 9
## 1 7 22
sum(diag(table(predTestlda$class, binary_test$Playoff))) / nrow(binary_test)
## [1] 0.8490566
With our tree we get a lower accuracy of 0.8018868, and area under the curve of 0.9028982, which is not bad for a decision tree, but a very convenient graphic
predTestCART <- predict(best_model_tree, newdata = binary_test)
table(predTestCART[,2] > 0.5, binary_test$Playoff)
##
## 0 1
## FALSE 70 16
## TRUE 5 15
sum(diag(table(predTestCART[,2] > 0.5, binary_test$Playoff))) / nrow(binary_test)
## [1] 0.8018868
ROCRpredlCART <- prediction(predTestCART[,2], binary_test$Playoff)
as.numeric(performance(ROCRpredCART, 'auc')@y.values)
## [1] 0.9028982
Interesting enough, our worst test prediction accuracy was our boosting model, with a confusion matrix accuracy of 0.7641509, and auc of 0.867957. Meaning we really overfit our data in our training set
predTestboost <- predict(best_model_boost, newdata = binary_test, n.trees = 5000)
table(predTestboost > 0.5, binary_test$Playoff)
##
## 0 1
## FALSE 69 19
## TRUE 6 12
sum(diag(table(predTestboost > 0.5, binary_test$Playoff))) / nrow(binary_test)
## [1] 0.7641509
ROCRpredboost <- prediction(predTestboost, binary_test$Playoff)
as.numeric(performance(ROCRpredboost, 'auc')@y.values)
## [1] 0.867957
So our best model is our Logistic Regression with an 0.8584906 testing prediction accuracy. Interesting enough our gbm only had a 80% on the test set while it had 100% for our training set so a good deal of overfitting was occuring. If we would use our model with all of the variables in it we would definitly obtain a better prediction accuracy with higher variance, but the most important thing for this analysis and baseball teams would be a simple bias effective model. This model consisted of using ERA + OPS + BB + HA + BBA. These may not be the sexiest baseball statistics to strive for, but the numbers never lie. So for all you young athletes out there, remember Earned Run Average and On - Base Percentage are the two best attributes to strive for, and don’t forget walks are important as well.
I thought this was really cool how you can create a Spray Chart in R. This chart and data doesn’t have any correlation with the analysis from the previous sections, but it was interesting what you can actually do with effective data and R. In this plot I used Detroit’s Miguel Cabrera seaons from 2009 - 2012
load("balls_strikes_count.Rdata")
# create bases for graph
bases <- data.frame(x = c(0, 90/sqrt(2), 0, -90/sqrt(2), 0),
y = c(0, 90/sqrt(2), 2 * 90/sqrt(2), 90/sqrt(2), 0))
p1 <- ggplot(cabrera, aes(hitx, hity)) +
geom_point(aes(color = hit_outcome)) +
coord_equal() +
facet_wrap(~ season)
p2 <- p1 + geom_path(data = bases, aes(x = x, y = y)) +
geom_segment(x = 0, xend = 300, y = 0, yend = 300) +
geom_segment(x = 0, xend = -300, y = 0, yend = 300) +
xlab('') + ylab('') +
ggtitle('Miguel Cabrera Spray Chart')
p2
## Warning: Removed 1058 rows containing missing values (geom_point).
## Warning: Removed 964 rows containing missing values (geom_point).
## Warning: Removed 1030 rows containing missing values (geom_point).
## Warning: Removed 1232 rows containing missing values (geom_point).