Title: The Cost of a K in the Moneyball Era

Author: Aaron Staggs

Intro: Using the 2016 Lahman Baseball Database, I queried information on MLB pitching staffs from 2007-2016. With this data, I am going to figure out what teams spent for every strikeout their pitching staff created. I also want to run analysis on different statistic combinations to identify what to look for when assembling a pitching staff.

Objective: 1. To find the Price Per Strikeout of each pitching staff from 2007-2016. 2. To find any correlation between other factors (Age, Salary, Walks, etc.) that would lead to more efficient wins.

Libraries In order to start my analysis I loaded the libraries I will need for this.

library(RSQLite)
library(ggplot2)
library(dplyr)
library(caret) 

Next step was the use SQLite to querey the data I will be using.

conn <- dbConnect(drv = SQLite(), dbname = "lahman2016.db")

q <- "select P.playerID as Player, (P.yearID - birthYear) as age, P.yearID as Year, P.teamID as Team, 
P.W as Wins, P.L as Loss, P.G as Games, P.GS as Starts, P.SO as SO,  P.HR as HR, P.BB as BB, salary, tm.W as TeamW, tm.L as TeamL, tm.WSWin as WS
from Pitching P
inner join (
select playerID, yearID, teamID, salary
from Salaries
where yearID >= 2007
) PS on PS.playerID = P.playerID and PS.yearID = P.yearID and PS.teamID = P.teamID
inner join (
select distinct teamID, franchID, W,  L, yearID, WSWin
from Teams
where yearID >= 2007 ) tm
on tm.teamID = P.teamID and tm.yearID = p.YearID
left join (
select playerID, birthYear
from Master ) ma
on ma.playerID = P.playerID "

MLB <- dbGetQuery(conn = conn, statement = q)
dbDisconnect(conn)

This querey yeilded the following data frame:

head(MLB)
##      Player age Year Team Wins Loss Games Starts  SO HR BB  salary TeamW
## 1  cruzju02  29 2007  ARI    6    1    53      0  87  7 32 1437500    90
## 2 davisdo02  32 2007  ARI   13   12    33     33 144 21 95 5500000    90
## 3 durbijd01  25 2007  ARI    0    0     1      0   1  0  1  380000    90
## 4 gonzaed01  24 2007  ARI    8    4    32     12  62 18 28  387500    90
## 5 hernali01  32 2007  ARI   11   11    33     33  90 34 79 7000000    90
## 6 johnsra05  44 2007  ARI    4    3    10     10  72  7 13 9100546    90
##   TeamL WS
## 1    72  N
## 2    72  N
## 3    72  N
## 4    72  N
## 5    72  N
## 6    72  N

From this data I wanted to calculate the price every team paid for a strikeout.

PPS <- MLB %>%
  group_by(Team, Year) %>%
  mutate(Price_ps = sum(salary)/sum(SO),
         PricePPW = sum(salary)/TeamW,
         Avg_Age = mean(age))%>%
  select(Team, Year, Price_ps, TeamW, PricePPW, Avg_Age, WS) %>%
  ungroup()

This code grouped by team and year and then found the price per K. I also added in team wins and which teams won the World Series that season.

summary(PPS$Price_ps)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9988   32010   43752   46702   57128  120897
ggplot(data = PPS) +
  geom_histogram(mapping = aes(x = Price_ps),
                 fill = "steel blue") +
  geom_vline(xintercept = mean(PPS$Price_ps),
             color = '#dd2e31')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = PPS) +
  geom_point(mapping = aes(x = TeamW, y = Price_ps, color = WS)) +
  labs(x = "Total Wins", y = "Price Per K")

The charted price per strikeout two different ways. In the histogram the red line is the average price per stirkeout. And the green dots in the point graph represent teams that won the World Series. There isn’t really much correlation between price per strike out and wins (or even winning a World Series).

ggplot(data = PPS) +  
  geom_point(mapping = aes(x = Avg_Age, y = Price_ps, color = WS)) +
  labs(x = "Age", y = 'Price per K')+
  theme_dark()

Subbing out total wins with avergae age, you see that the younger a pitching staff the less their Price Per K is. Young arms don’t nearly command the money that established pitchers do. And I’ll show this in my next graph. And the ideal age for a pitching staff to win the World Series is between 30 and 32 years old. And the majority of those teams spent 50-75k per strikeout.

AgeSal <- MLB %>%
  group_by(age) %>%
  summarise(Avg_Sal = mean(salary))

ggplot(data = AgeSal) +
  geom_point(mapping = aes(x = age, y = Avg_Sal))

I ran a querey to find the average salary for each age represented in our original data. the results show an interesting trajectory.

Few thoughts: -Age 36 stood out to me. Everything between age 21 and 40 has this nice flow to it, but age 36 is like, “Nah.” Just complete outlier.

-After age 40, it has a litte shape to it, but overall is messy. Hypothesis to this is that lower paid pitchers who are aging get cut for younger arms that cost the same amount of money. So the average salary gets inflated.

Next, I wanted to look at a few other items for fun.

Team_Year <- tbl_df(MLB) %>%
  group_by(Team, Year) %>%
  summarise(PriceSO = sum(salary)/sum(SO),
         PriceBB = sum(salary)/sum(BB),
         PriceHR = sum(salary)/sum(HR),
         BBOut = sum(SO)/sum(BB),
         Avg_Age = mean(age),
         Avg_Sal = mean(salary),
         WonWS = first(WS),
         Wins = first(TeamW))%>%
  ungroup()

So I added a few more items to the list of things I wanted to look at. Things such as the cost of a home run, the cost of a walk, and the cost of an out. I want to use these items to see if I can find a correlation with those and wins.

features_1 <- Team_Year %>%
  select(PriceSO)

features_2 <- Team_Year %>%
  select(PriceBB)

features_3 <- Team_Year %>%
  select(PriceHR)

features_4 <- Team_Year %>%
  select(Avg_Sal)

features_5 <- Team_Year %>%
  select(BBOut)

features_6 <- Team_Year %>%
  select(Avg_Age)

features_7 <- Team_Year %>%
  select(Wins)

I set up these as features is be more efficient when running liner regresion models.

labels <- Team_Year %>%
  .$PriceSO
  
model <- train(x = features_6, y = labels, method = "lm")
model$results
##   intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      TRUE 17220.46 0.2437634 13521.52 1283.047  0.0642422 862.0192
model$finalModel$coefficients
## (Intercept)     Avg_Age 
##  -186838.46     7899.99
ggplot(data = Team_Year) +
  geom_point(mapping = aes(x = Avg_Age , y = PriceSO),
             alpha = .75,
             color = '#2db9dd') +
  geom_abline(intercept = model$finalModel$coefficients[[1]],
              slope = model$finalModel$coefficients[[2]],
              color = 'red', size = 2) +
  theme_dark() +
  labs(x = "Age", y = "Price Per K")

Running this model, it predicts that for every $7,900 a team spends on it’s pitching staff, that will return 1 additional strikeout.

labels <- Team_Year %>%
  .$Avg_Sal
  
model <- train(x = features_5, y = labels, method = "lm")
model$results
##   intercept    RMSE  Rsquared     MAE   RMSESD RsquaredSD    MAESD
## 1      TRUE 1307787 0.1663019 1045336 79667.58 0.06605643 64906.02
model$finalModel$coefficients
## (Intercept)       BBOut 
##    122279.2   1254648.2
ggplot(data = Team_Year) +
  geom_point(mapping = aes(x = BBOut , y = Avg_Sal),
             alpha = .75,
             color = '#fffa00') +
  geom_abline(intercept = model$finalModel$coefficients[[1]],
              slope = model$finalModel$coefficients[[2]],
              color = '#0011ff', size = 2) +
  theme_dark() +
  labs(x = "K/BB Ratio", y = "Salary")

This model shows that in order to get improve a staffs strikeout to walk ratio by one tenth, a team will have to spend an additional $125,464.

Now comparing to a model of K/BB ratio to wins, a team can improve their K/BB ratio that should provide 9 extra wins.

labels <- Team_Year %>%
  .$Wins
  
model <- train(x = features_5, y = labels, method = "lm")
model$results
##   intercept     RMSE  Rsquared      MAE    RMSESD RsquaredSD   MAESD
## 1      TRUE 10.14063 0.1409351 8.509873 0.3952705 0.03699643 0.34917
model$finalModel$coefficients
## (Intercept)       BBOut 
##   58.775678    9.052806
ggplot(data = Team_Year) +
  geom_point(mapping = aes(x = BBOut, y = Wins),
             alpha = .5,
             color = "#ff8c00") +
  labs(x = "K/BB Ratio", y = "Wins") +
  geom_abline(intercept = model$finalModel$coefficients[[1]],
              slope = model$finalModel$coefficients[[2]],
              color = '#0011ff', size = 2)

A team can improve their record by 9 games for an additional $1,254,648 for pitchers with a solid K/BB ratio.

In conclusion, there are so many other factors that can lead to a teams success, and baseball has a lot of variance. I did find it intersting that teams who had an average staff age between 30-32 years old with a price per K between 50-75k won the most World Series titles and that the oldest pitchers had the highest average salary.