Intro: Using the 2016 Lahman Baseball Database, I queried information on MLB pitching staffs from 2007-2016. With this data, I am going to figure out what teams spent for every strikeout their pitching staff created. I also want to run analysis on different statistic combinations to identify what to look for when assembling a pitching staff.
Objective: 1. To find the Price Per Strikeout of each pitching staff from 2007-2016. 2. To find any correlation between other factors (Age, Salary, Walks, etc.) that would lead to more efficient wins.
Libraries In order to start my analysis I loaded the libraries I will need for this.
library(RSQLite)
library(ggplot2)
library(dplyr)
library(caret)
Next step was the use SQLite to querey the data I will be using.
conn <- dbConnect(drv = SQLite(), dbname = "lahman2016.db")
q <- "select P.playerID as Player, (P.yearID - birthYear) as age, P.yearID as Year, P.teamID as Team,
P.W as Wins, P.L as Loss, P.G as Games, P.GS as Starts, P.SO as SO, P.HR as HR, P.BB as BB, salary, tm.W as TeamW, tm.L as TeamL, tm.WSWin as WS
from Pitching P
inner join (
select playerID, yearID, teamID, salary
from Salaries
where yearID >= 2007
) PS on PS.playerID = P.playerID and PS.yearID = P.yearID and PS.teamID = P.teamID
inner join (
select distinct teamID, franchID, W, L, yearID, WSWin
from Teams
where yearID >= 2007 ) tm
on tm.teamID = P.teamID and tm.yearID = p.YearID
left join (
select playerID, birthYear
from Master ) ma
on ma.playerID = P.playerID "
MLB <- dbGetQuery(conn = conn, statement = q)
dbDisconnect(conn)
This querey yeilded the following data frame:
head(MLB)
## Player age Year Team Wins Loss Games Starts SO HR BB salary TeamW
## 1 cruzju02 29 2007 ARI 6 1 53 0 87 7 32 1437500 90
## 2 davisdo02 32 2007 ARI 13 12 33 33 144 21 95 5500000 90
## 3 durbijd01 25 2007 ARI 0 0 1 0 1 0 1 380000 90
## 4 gonzaed01 24 2007 ARI 8 4 32 12 62 18 28 387500 90
## 5 hernali01 32 2007 ARI 11 11 33 33 90 34 79 7000000 90
## 6 johnsra05 44 2007 ARI 4 3 10 10 72 7 13 9100546 90
## TeamL WS
## 1 72 N
## 2 72 N
## 3 72 N
## 4 72 N
## 5 72 N
## 6 72 N
From this data I wanted to calculate the price every team paid for a strikeout.
PPS <- MLB %>%
group_by(Team, Year) %>%
mutate(Price_ps = sum(salary)/sum(SO),
PricePPW = sum(salary)/TeamW,
Avg_Age = mean(age))%>%
select(Team, Year, Price_ps, TeamW, PricePPW, Avg_Age, WS) %>%
ungroup()
This code grouped by team and year and then found the price per K. I also added in team wins and which teams won the World Series that season.
summary(PPS$Price_ps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9988 32010 43752 46702 57128 120897
ggplot(data = PPS) +
geom_histogram(mapping = aes(x = Price_ps),
fill = "steel blue") +
geom_vline(xintercept = mean(PPS$Price_ps),
color = '#dd2e31')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = PPS) +
geom_point(mapping = aes(x = TeamW, y = Price_ps, color = WS)) +
labs(x = "Total Wins", y = "Price Per K")
The charted price per strikeout two different ways. In the histogram the red line is the average price per stirkeout. And the green dots in the point graph represent teams that won the World Series. There isn’t really much correlation between price per strike out and wins (or even winning a World Series).
ggplot(data = PPS) +
geom_point(mapping = aes(x = Avg_Age, y = Price_ps, color = WS)) +
labs(x = "Age", y = 'Price per K')+
theme_dark()
Subbing out total wins with avergae age, you see that the younger a pitching staff the less their Price Per K is. Young arms don’t nearly command the money that established pitchers do. And I’ll show this in my next graph. And the ideal age for a pitching staff to win the World Series is between 30 and 32 years old. And the majority of those teams spent 50-75k per strikeout.
AgeSal <- MLB %>%
group_by(age) %>%
summarise(Avg_Sal = mean(salary))
ggplot(data = AgeSal) +
geom_point(mapping = aes(x = age, y = Avg_Sal))
I ran a querey to find the average salary for each age represented in our original data. the results show an interesting trajectory.
Few thoughts: -Age 36 stood out to me. Everything between age 21 and 40 has this nice flow to it, but age 36 is like, “Nah.” Just complete outlier.
-After age 40, it has a litte shape to it, but overall is messy. Hypothesis to this is that lower paid pitchers who are aging get cut for younger arms that cost the same amount of money. So the average salary gets inflated.
Next, I wanted to look at a few other items for fun.
Team_Year <- tbl_df(MLB) %>%
group_by(Team, Year) %>%
summarise(PriceSO = sum(salary)/sum(SO),
PriceBB = sum(salary)/sum(BB),
PriceHR = sum(salary)/sum(HR),
BBOut = sum(SO)/sum(BB),
Avg_Age = mean(age),
Avg_Sal = mean(salary),
WonWS = first(WS),
Wins = first(TeamW))%>%
ungroup()
So I added a few more items to the list of things I wanted to look at. Things such as the cost of a home run, the cost of a walk, and the cost of an out. I want to use these items to see if I can find a correlation with those and wins.
features_1 <- Team_Year %>%
select(PriceSO)
features_2 <- Team_Year %>%
select(PriceBB)
features_3 <- Team_Year %>%
select(PriceHR)
features_4 <- Team_Year %>%
select(Avg_Sal)
features_5 <- Team_Year %>%
select(BBOut)
features_6 <- Team_Year %>%
select(Avg_Age)
features_7 <- Team_Year %>%
select(Wins)
I set up these as features is be more efficient when running liner regresion models.
labels <- Team_Year %>%
.$PriceSO
model <- train(x = features_6, y = labels, method = "lm")
model$results
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 17220.46 0.2437634 13521.52 1283.047 0.0642422 862.0192
model$finalModel$coefficients
## (Intercept) Avg_Age
## -186838.46 7899.99
ggplot(data = Team_Year) +
geom_point(mapping = aes(x = Avg_Age , y = PriceSO),
alpha = .75,
color = '#2db9dd') +
geom_abline(intercept = model$finalModel$coefficients[[1]],
slope = model$finalModel$coefficients[[2]],
color = 'red', size = 2) +
theme_dark() +
labs(x = "Age", y = "Price Per K")
Running this model, it predicts that for every $7,900 a team spends on it’s pitching staff, that will return 1 additional strikeout.
labels <- Team_Year %>%
.$Avg_Sal
model <- train(x = features_5, y = labels, method = "lm")
model$results
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 1307787 0.1663019 1045336 79667.58 0.06605643 64906.02
model$finalModel$coefficients
## (Intercept) BBOut
## 122279.2 1254648.2
ggplot(data = Team_Year) +
geom_point(mapping = aes(x = BBOut , y = Avg_Sal),
alpha = .75,
color = '#fffa00') +
geom_abline(intercept = model$finalModel$coefficients[[1]],
slope = model$finalModel$coefficients[[2]],
color = '#0011ff', size = 2) +
theme_dark() +
labs(x = "K/BB Ratio", y = "Salary")
This model shows that in order to get improve a staffs strikeout to walk ratio by one tenth, a team will have to spend an additional $125,464.
Now comparing to a model of K/BB ratio to wins, a team can improve their K/BB ratio that should provide 9 extra wins.
labels <- Team_Year %>%
.$Wins
model <- train(x = features_5, y = labels, method = "lm")
model$results
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 10.14063 0.1409351 8.509873 0.3952705 0.03699643 0.34917
model$finalModel$coefficients
## (Intercept) BBOut
## 58.775678 9.052806
ggplot(data = Team_Year) +
geom_point(mapping = aes(x = BBOut, y = Wins),
alpha = .5,
color = "#ff8c00") +
labs(x = "K/BB Ratio", y = "Wins") +
geom_abline(intercept = model$finalModel$coefficients[[1]],
slope = model$finalModel$coefficients[[2]],
color = '#0011ff', size = 2)
A team can improve their record by 9 games for an additional $1,254,648 for pitchers with a solid K/BB ratio.
In conclusion, there are so many other factors that can lead to a teams success, and baseball has a lot of variance. I did find it intersting that teams who had an average staff age between 30-32 years old with a price per K between 50-75k won the most World Series titles and that the oldest pitchers had the highest average salary.