Cost of a Strike Out

Introduction

Lahman Data base was used to analyze team and pitcher data from 2007 to 2016. Analyzed both individual and a summarization of individuals that make up a team to find characteristics that tend to have a lower cost to strike out ratio. Only used pitchers that played in 10 or more games.

Load Libraries

library(RSQLite)
library(dplyr)
library(lubridate)
library(ggplot2)
library(caret)
library(kableExtra)

Data Brought in from SQLite

SQLight was used to select data and rename columns. From the Lahman data base tables Master, Pitching, Teams, and Salaries were used, and the data was loaded into R for analysis.

q_pitch <- "
SELECT playerID, yearID, teamID, G as pit_games,
BB as pit_walks, IBB as pit_intwalks,
SO as pit_so, ERA
FROM Pitching
WHERE yearID BETWEEN 2007 AND 2016;
"
q_master <- "
SELECT playerID, birthYear, birthMonth, 
nameLast, nameGiven, debut, finalGame
FROM Master
WHERE finalGame >= 2007
"
q_salaries <- "
SELECT yearID, teamID, playerID, salary as pit_salary
FROM Salaries
WHERE yearID BETWEEN 2007 AND 2016;
"

q_teams <- "
SELECT yearID, lgID, teamID, W as team_wins, L as team_losses, 
SOA as team_so, AB as team_atbat, BB as team_batters_walk
FROM Teams
WHERE yearID BETWEEN 2007 AND 2016;
"

conn <- dbConnect(drv = SQLite(), dbname = "lahman2016.db")
pitchers <- dbGetQuery(conn = conn, statement = q_pitch)
teams <- dbGetQuery(conn = conn, statement = q_teams)
master <- dbGetQuery(conn = conn, statement = q_master)
salaries <- dbGetQuery(conn = conn, statement = q_salaries)
dbDisconnect(conn)

Join Tables from SQL Data

All four tables were joined together to get individual statistics and then grouped to get team statics by team and year. Calculations performed to get information that was not provided in database such as cost per strike out, career length, total walks. Grouping performed to calculate team data such as average career length, cost per strike out, and average age.

baseball_full <-tbl_df(pitchers) %>%
  filter(pit_games >= 10) %>%
  inner_join(y= master, by = 'playerID') %>%
  inner_join(y = salaries, by = c('playerID', 'yearID', 'teamID')) %>%
  inner_join(y = teams, by = c('yearID', 'teamID')) %>%
  mutate(pit_age = ifelse(birthMonth < 4, yearID - birthYear, 
                          yearID - birthYear -1)) %>%
  mutate(pit_costso = pit_salary/pit_so) %>%
  mutate(career_length = yearID - year(debut)) %>%
  mutate(pit_tot_walks = pit_walks + pit_intwalks) %>%
  mutate(wl_ratio = team_wins/(team_wins + team_losses)) %>%
  mutate(team_atbat_walkpct = team_batters_walk / team_atbat) %>%
  group_by(teamID, yearID) %>%
  mutate(team_salary = sum(pit_salary),
         team_avg_age = mean(pit_age),
         team_avg_career_length = mean(career_length),
         team_costso = team_salary/team_so) %>%
  ungroup ()%>%
  select(playerID, nameGiven, nameLast, yearID, pit_games, pit_so,
         pit_tot_walks, ERA, pit_salary, pit_costso, pit_age, finalGame, career_length,
         team_atbat_walkpct, teamID, wl_ratio, team_wins, team_losses, team_so,
         team_costso,team_avg_career_length, team_avg_age, team_salary)
summary (baseball_full)
##    playerID          nameGiven           nameLast             yearID    
##  Length:3355        Length:3355        Length:3355        Min.   :2007  
##  Class :character   Class :character   Class :character   1st Qu.:2009  
##  Mode  :character   Mode  :character   Mode  :character   Median :2011  
##                                                           Mean   :2011  
##                                                           3rd Qu.:2014  
##                                                           Max.   :2016  
##    pit_games         pit_so       pit_tot_walks         ERA        
##  Min.   :10.00   Min.   :  1.00   Min.   :  0.00   Min.   : 0.000  
##  1st Qu.:26.00   1st Qu.: 38.00   1st Qu.: 18.00   1st Qu.: 3.150  
##  Median :33.00   Median : 62.00   Median : 29.00   Median : 3.920  
##  Mean   :39.93   Mean   : 76.99   Mean   : 33.95   Mean   : 4.127  
##  3rd Qu.:58.00   3rd Qu.:106.00   3rd Qu.: 46.00   3rd Qu.: 4.870  
##  Max.   :92.00   Max.   :301.00   Max.   :114.00   Max.   :13.500  
##    pit_salary         pit_costso        pit_age       finalGame        
##  Min.   :  380000   Min.   :  1528   Min.   :20.00   Length:3355       
##  1st Qu.:  500000   1st Qu.:  9202   1st Qu.:26.00   Class :character  
##  Median : 1250000   Median : 27667   Median :28.00   Mode  :character  
##  Mean   : 3302823   Mean   : 55849   Mean   :28.83                     
##  3rd Qu.: 4250000   3rd Qu.: 70455   3rd Qu.:31.00                     
##  Max.   :32571000   Max.   :923077   Max.   :49.00                     
##  career_length   team_atbat_walkpct    teamID             wl_ratio     
##  Min.   : 0.00   Min.   :0.06844    Length:3355        Min.   :0.3148  
##  1st Qu.: 2.00   1st Qu.:0.08314    Class :character   1st Qu.:0.4506  
##  Median : 5.00   Median :0.09152    Mode  :character   Median :0.5031  
##  Mean   : 5.58   Mean   :0.09183                       Mean   :0.5007  
##  3rd Qu.: 8.00   3rd Qu.:0.09925                       3rd Qu.:0.5556  
##  Max.   :26.00   Max.   :0.12355                       Max.   :0.6358  
##    team_wins      team_losses        team_so      team_costso   
##  Min.   : 51.0   Min.   : 59.00   Min.   : 911   Min.   : 7162  
##  1st Qu.: 73.0   1st Qu.: 72.00   1st Qu.:1096   1st Qu.:21172  
##  Median : 81.0   Median : 80.00   Median :1179   Median :30740  
##  Mean   : 81.1   Mean   : 80.87   Mean   :1177   Mean   :32729  
##  3rd Qu.: 90.0   3rd Qu.: 89.00   3rd Qu.:1249   3rd Qu.:40819  
##  Max.   :103.0   Max.   :111.00   Max.   :1450   Max.   :81455  
##  team_avg_career_length  team_avg_age    team_salary      
##  Min.   :2.727          Min.   :25.70   Min.   : 6982500  
##  1st Qu.:4.750          1st Qu.:27.93   1st Qu.:25058500  
##  Median :5.385          Median :28.82   Median :35479900  
##  Mean   :5.580          Mean   :28.83   Mean   :38822138  
##  3rd Qu.:6.357          3rd Qu.:29.67   3rd Qu.:48572271  
##  Max.   :9.769          Max.   :33.08   Max.   :98203500
dim(diamonds)
## [1] 53940    10

Build a pitching team with “Good deal players”

Looking to build a pitching team where one would get the best deal for cost of strike out.Looking at cost per strike out of indivual pitchers the majority fall below $120,000 with a few outliers.

ggplot(data = baseball_full) +
  geom_histogram(mapping =  aes( x = pit_costso), bins = 30, fill ="steel blue")+
  scale_x_continuous(labels = scales::comma) +
  geom_vline(xintercept = mean(baseball_full$pit_costso), color ="dark red") +
    labs(x = "Pitchers Cost per Strike Out",
       y = "Count" ,
       title = "Comparison of Cost per Strike Out Among Pitchers")

Correlations with Cost of Strike Out by Pitcher

Correlation table gives a quick view of how each attribute effects the other attributes. Pitcher salary, career and pitcher age have this highest corrolation values where the team win loss ratio has the lowest.

sel_data <- baseball_full %>%
  select(pit_games, pit_so, ERA, pit_age, pit_salary, pit_costso, pit_tot_walks,
         career_length, wl_ratio)

sel_data_cor <- round(cor(sel_data),4)

head(sel_data_cor)
##            pit_games  pit_so     ERA pit_age pit_salary pit_costso
## pit_games     1.0000 -0.0841 -0.4583  0.0802    -0.1592    -0.2098
## pit_so       -0.0841  1.0000 -0.2830 -0.1652     0.3676    -0.2256
## ERA          -0.4583 -0.2830  1.0000  0.0157    -0.0665     0.1981
## pit_age       0.0802 -0.1652  0.0157  1.0000     0.3556     0.4037
## pit_salary   -0.1592  0.3676 -0.0665  0.3556     1.0000     0.5094
## pit_costso   -0.2098 -0.2256  0.1981  0.4037     0.5094     1.0000
##            pit_tot_walks career_length wl_ratio
## pit_games        -0.1337        0.0069   0.0580
## pit_so            0.7809       -0.0208   0.1175
## ERA              -0.0603        0.0111  -0.1795
## pit_age          -0.1726        0.8506   0.0528
## pit_salary        0.2417        0.4832   0.0870
## pit_costso       -0.2245        0.4429  -0.0096

Looking at corolations graphed

Graph the high and low correlations to see how the compair to one another and how the correlation value is linked to the RSquared factor that is used in linear modeling.

features_pit_age <- baseball_full %>%
  select(pit_age)

labels_pit_costso <- baseball_full$pit_costso

model <- train(x = features_pit_age, y = labels_pit_costso, method = 'lm')
model$results
##   intercept     RMSE  Rsquared      MAE  RMSESD RsquaredSD    MAESD
## 1      TRUE 74880.03 0.1696633 41628.33 5546.62 0.01864453 1051.357
ggplot(data = baseball_full) +
  geom_point(mapping = aes(x = pit_age, y = pit_costso)) +
  labs(x = "Pitchers Age",
       y = "Cost per Strike Out" ,
       title = "Pitcher Age to Strike Out Value") +
  scale_y_continuous(labels = scales::comma) +
  theme_dark() +
  geom_abline(intercept = model$finalModel$coefficients[[1]],
              slope = model$finalModel$coefficients[[2]],
              color = 'dark red', size = 1)

As the graph shows there is a decent interrelationship between the age of the pitcher and the cost per strike out. As pitcher get older the value per strike out goes down.

features_career_length <- baseball_full %>%
  select(career_length)

labels_pit_costso <- baseball_full$pit_costso

model <- train(x = features_career_length, y = labels_pit_costso, method = 'lm')
model$results
##   intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      TRUE 73306.72 0.1995042 39463.56 5448.225 0.01692802 1327.326
ggplot(data = baseball_full) +
  geom_point(mapping = aes(x = career_length, y = pit_costso)) +
  labs(x = "Career Length",
       y = "Cost per Strike Out" ,
       title = "Career Length to Strike Out Value") +
  scale_y_continuous(labels = scales::comma) +
  theme_dark() +
  geom_abline(intercept = model$finalModel$coefficients[[1]],
              slope = model$finalModel$coefficients[[2]],
              color = 'dark red', size = 1)

As the graph shows there is a decent interrelationship between the career length of the pitcher and the cost per strike out. As pitcher’s career gets longer the value per strike out goes down.

features_pit_salary <- baseball_full %>%
  select(pit_salary)

labels_pit_costso <- baseball_full$pit_costso

model <- train(x = features_pit_salary, y = labels_pit_costso, method = 'lm')
model$results
##   intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      TRUE 72443.15 0.2563275 36930.24 6883.971 0.03165741 1593.329
ggplot(data = baseball_full) +
  geom_point(mapping = aes(x = pit_salary, y = pit_costso))+
  labs(x = "Pitcher Salary",
       y = "Cost per Strike Out" ,
       title = "Pitcher Salary to Cost per Strike Out") +
  scale_y_continuous(labels = scales::comma) +
  theme_dark() +
  geom_abline(intercept = model$finalModel$coefficients[[1]],
              slope = model$finalModel$coefficients[[2]],
              color = 'dark red', size = 1)

As the graph shows there is a decent interrelationship between the salary of the pitcher and the cost per strike out. As pitcher’s goes up the cost per strike up goes up as well. These two values should have a direct corelations because pitcher’s salary is used to calculate the cost per strike out. The variation the graph could be related to the value of the pitcher off the mound, health of the pitcher, or power of negotiating salary in baseball.

features_wl_ratio <- baseball_full %>%
  select(wl_ratio)

labels_pit_costso <- baseball_full$pit_costso

model <- train(x = features_wl_ratio, y = labels_pit_costso, method = 'lm')
model$results
##   intercept     RMSE     Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      TRUE 81830.83 0.0006658077 50432.45 4713.404 0.00104132 1159.298
ggplot(data = baseball_full) +
  geom_point(mapping = aes(x = wl_ratio, y = pit_costso)) +
  labs(x = "Percentage of Wins",
       y = "Cost per Strike Out" ,
       title = "Strike Out cost to Wins") +
  scale_y_continuous(labels = scales::comma) +
  theme_dark() +
  geom_abline(intercept = model$finalModel$coefficients[[1]],
              slope = model$finalModel$coefficients[[2]],
              color = 'dark red', size = 1)

There is very little correlation between the cost of a strike out to a team having a winning record. A manager or owner would have a hard time going out and buying a team of pitchers to get a winning record.

“Good Deal” Bullpen

Manager or owner was to get the biggest bang for their buck on a team of pitchers. Looking for pitchers that have above average stats with below average pay. Gathering a team of 13 pitchers. The first rank has more rigorous stats and might result in less than 13 pitchers for a given year. The second rank just has below average cost per strike out and above average amount of strike outs. Taken the first list out of the second list to ensure that there are no duplicates. Joining the groups back together ordered them by rank to ensure the starter group was included then the rest of the pitchers by cost per strike out. Took the top 13 to form a “Good Deal” bullpen.

 starters <- baseball_full %>%
  filter(pit_salary < mean(pit_salary),
         pit_so > mean(pit_so),
         pit_tot_walks < mean(pit_tot_walks),
         ERA < mean(ERA),
         wl_ratio > 0.50,
         year(finalGame) == 2016,
         yearID == 2016) %>%
   select(playerID, nameGiven, nameLast, pit_salary, pit_so, pit_costso,
          pit_tot_walks, ERA, career_length, pit_age) %>%
   mutate(rank = 1) %>%
   arrange(pit_costso)
 
 
 good_costso<-baseball_full %>%
   filter(pit_salary < mean(pit_salary),
          pit_so > mean(pit_so),
          year(finalGame) == 2016,
          yearID == 2016) %>%
   select(playerID, nameGiven, nameLast, pit_salary, pit_so, pit_costso,
          pit_tot_walks, ERA, career_length, pit_age) %>%
   arrange(pit_costso) 
 
 second_string <- anti_join(good_costso, starters, 
                            by = c('playerID', 'nameGiven', 'nameLast', 
                                  'pit_salary', 'pit_so', 
                                  'pit_costso', 'pit_tot_walks', 
                                  'ERA', 'career_length', 'pit_age')) %>%
   mutate(rank = 2)
 
 good_deal_team <- starters %>%
   full_join(y = second_string, 
             by = c('playerID', 'nameGiven', 'nameLast', 'pit_salary',
                    'pit_so', 'pit_costso', 'pit_tot_walks', 'ERA',
                    'career_length','pit_age','rank')) %>%
   select(nameGiven, nameLast, pit_salary, pit_so, pit_tot_walks, 
          ERA, pit_costso, rank) %>%
     rename(FirstMiddle = nameGiven,
            Last = nameLast,
            Salary = pit_salary,
            StrikeOut = pit_so,
            Walks = pit_tot_walks,
            SOCost = pit_costso) %>%
   arrange(rank, SOCost) %>%
   slice (1:13) 
   
 good_deal_team %>%  
  kable("html") %>%
  kable_styling()
FirstMiddle Last Salary StrikeOut Walks ERA SOCost rank
Kenneth Robert Giles 528200 102 26 4.11 5178.431 1
Roberto Osuna 516100 82 18 2.68 6293.902 1
Brad Brach 1250000 92 26 2.05 13586.957 1
Robert Glenn Ray 521000 218 75 4.90 2389.908 2
Jonathan Charles Gray 509500 185 61 4.61 2754.054 2
Collin Alexander McHugh 529000 177 55 4.34 2988.701 2
Kevin John Gausman 532000 174 48 3.61 3057.471 2
Jerad Joseph Eickhoff 516000 167 44 3.65 3089.820 2
Marcus Earl Stroman 515900 166 54 4.37 3107.831 2
Daniel Steven Straily 512100 162 77 3.76 3161.111 2
Aaron Jacob Sanchez 517800 161 63 3.00 3216.149 2
Daniel Dariel Salazar 536200 161 66 3.87 3330.435 2
Vincent John Velasquez 516000 152 46 4.12 3394.737 2

Looking at team data

Correlations with Cost of Strike Out by Teams

Correlation table gives a quick view of how each attribute effects the other attributes. The team cost per strike out has a high interrelationship with team average age, team avgerage career length and the total team salary. There is a much smaller correlation with the number of strike outs to cost per strike out. The ultimate goal in baseball is to win. When it comes to pitchers there is no winning stratagy. All of the team attributes have a low interrelationship to the win loss ratio.

sel_team_data <- baseball_full %>%
  select(team_so, team_atbat_walkpct,
         team_avg_age, team_avg_career_length, team_salary, team_costso, wl_ratio)

sel_team_cor <- round(cor(sel_team_data),4)

head(sel_team_cor)
##                        team_so team_atbat_walkpct team_avg_age
## team_so                 1.0000            -0.0545       0.0226
## team_atbat_walkpct     -0.0545             1.0000       0.1313
## team_avg_age            0.0226             0.1313       1.0000
## team_avg_career_length -0.0062             0.1297       0.8804
## team_salary             0.3594            -0.0665       0.4574
## team_costso             0.1848            -0.0618       0.4902
##                        team_avg_career_length team_salary team_costso
## team_so                               -0.0062      0.3594      0.1848
## team_atbat_walkpct                     0.1297     -0.0665     -0.0618
## team_avg_age                           0.8804      0.4574      0.4902
## team_avg_career_length                 1.0000      0.5166      0.5568
## team_salary                            0.5166      1.0000      0.9800
## team_costso                            0.5568      0.9800      1.0000
##                        wl_ratio
## team_so                  0.3549
## team_atbat_walkpct       0.3425
## team_avg_age             0.1742
## team_avg_career_length   0.2271
## team_salary              0.2765
## team_costso              0.2281

Looking at ten years of team data for the cost per strike out the team as a whole the cost is more copared to that of the indiduals because the pitching bullpen is lob sided to the number of pitchers playing in any game. The extreams are reduced but there is still some outliers.

data_baseball <- baseball_full %>%
  select(yearID, teamID, team_wins, team_losses, team_so, 
         team_atbat_walkpct,team_avg_age, team_avg_career_length, 
         team_salary, team_costso, wl_ratio) %>%
  group_by(teamID, yearID)

team_baseball<- distinct(data_baseball)

mean_team_cost <- mean(baseball_full$team_costso)
 
  ggplot(data = team_baseball) +
  geom_histogram(mapping =  aes( x = team_costso), bins = 25) +
    scale_x_continuous(labels = scales::comma) +
  geom_vline(xintercept=mean(baseball_full$team_costso), color="red")

ggplot(data = baseball_full) +
  geom_boxplot(mapping = aes(x = yearID, y = team_costso, 
                             group = yearID,
                             fill = yearID)) +
  labs(x = "Year",
       y = "Team Price for Strike Out" ,
       title = "Team Strike Out Cost by Year") +
  scale_y_continuous(labels = scales::comma) +
  theme_dark()

  features_team_avg_age <- baseball_full %>%
    select(team_avg_age)
  
  labels_team_costso <- baseball_full$team_costso
  
  model <- train(x = features_team_avg_age, 
                 y = labels_team_costso, 
                 method = 'lm')
  model$results
##   intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      TRUE 12983.11 0.2381298 10262.29 241.4323 0.01986402 161.9459
  ggplot(data = baseball_full) +
    geom_point(mapping = aes(x = team_avg_age,
                             y = team_costso)) +
    labs(x = "Average Team Age",
         y = "Team Cost per Strike Out" ,
         title = "Team Cost per Strike Out to age") +
    scale_y_continuous(labels = scales::comma) +
    theme_dark() +
    geom_abline(intercept = model$finalModel$coefficients[[1]],
                slope = model$finalModel$coefficients[[2]],
                color = 'dark red', size = 1)

As the graph shows there is a decent correlation between the average age of the pitching team and the cost per strike out. The same interrelationships is with individuals, as the pitching team get older the value per strike out goes down making the cost per strike out go up.

  features_team_avg_career_length <- baseball_full %>%
    select(team_avg_career_length)
  
  labels_team_costso <- baseball_full$team_costso
  
  model <- train(x = features_team_avg_career_length, 
                 y = labels_team_costso, 
                 method = 'lm')
  model$results
##   intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      TRUE 12310.64 0.3091983 9729.782 215.5717 0.02364329 159.7309
  ggplot(data = baseball_full) +
    geom_point(mapping = aes(x = team_avg_career_length,
                             y = team_costso)) +
    labs(x = "Team Career Length",
         y = "Team Cost per Strike Out" ,
         title = "Career Length to Strike Out Cost") +
    scale_y_continuous(labels = scales::comma) +
    theme_dark() +
    geom_abline(intercept = model$finalModel$coefficients[[1]],
                slope = model$finalModel$coefficients[[2]],
                color = 'dark red', size = 1)

As the graph shows there is a decent correlation between the average career of the pitching team and the cost per strike out. The same interrelationships is with individuals, as the pitching team’s career gets long the cost per strike out goes up and value per strike out goes down.

  features_wl_ratio <- baseball_full %>%
    select(wl_ratio)
  
  labels_team_avg_career_length <- baseball_full$team_avg_career_length
  
  model <- train(x = features_wl_ratio, 
                 y = labels_team_avg_career_length, 
                 method = 'lm')
  model$results
##   intercept     RMSE   Rsquared       MAE    RMSESD RsquaredSD      MAESD
## 1      TRUE 1.236875 0.05315146 0.9758566 0.0200377 0.01037708 0.01778283
  ggplot(data = baseball_full) +
    geom_point(mapping = aes(x = wl_ratio,
                             y = team_avg_career_length)) +
    labs(x = "Win Percentage",
         y = "Team Career Length" ,
         title = "Career Length to Wins") +
    theme_dark() +
    geom_abline(intercept = model$finalModel$coefficients[[1]],
                slope = model$finalModel$coefficients[[2]],
                color = 'dark red', size = 1)

There is a very low correlation between career length and the capabity to win games. Despite younger player and teams having a better cost per strike out it does not transpher to wins.

  features_wl_ratio <- baseball_full %>%
    select(wl_ratio)
  
  labels_team_salary <- baseball_full$team_salary
  
  model <- train(x = features_wl_ratio, 
                 y = labels_team_salary, 
                 method = 'lm')
  model$results
##   intercept     RMSE   Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      TRUE 18194131 0.07637461 14036615 314264.8 0.01170675 277928.9
  ggplot(data = baseball_full) +
    geom_point(mapping = aes(x = wl_ratio,
                             y = team_salary)) +
    labs(x = "Win Percentage",
         y = "Team Salary" ,
         title = "Team Salary to Wins") +
    scale_y_continuous(labels = scales::comma) +
    theme_dark() +
    geom_abline(intercept = model$finalModel$coefficients[[1]],
                slope = model$finalModel$coefficients[[2]],
                color = 'dark red', size = 1)

As salaries increase so does the win percentage.

  features_team_so <- baseball_full %>%
    select(team_so)
  
  labels_wl_ratio <- baseball_full$wl_ratio
  
  model <- train(x = features_team_so, 
                 y = labels_wl_ratio, 
                 method = 'lm')
  model$results
##   intercept       RMSE  Rsquared       MAE       RMSESD RsquaredSD
## 1      TRUE 0.06199979 0.1262335 0.0524347 0.0007881526 0.01276425
##          MAESD
## 1 0.0006361553
  ggplot(data = baseball_full) +
    geom_point(mapping = aes(x = team_so,
                             y = wl_ratio)) +
    labs(x = "Strike Outs",
         y = "Win Percentage" ,
         title = "Team Strike Out Cost to Win Ratio") +
    theme_dark() +
    geom_abline(intercept = model$finalModel$coefficients[[1]],
                slope = model$finalModel$coefficients[[2]],
                color = 'dark red', size = 1)

Also as salaries increase so do win percentages

  features_team_costso <- baseball_full %>%
    select(team_costso)
  
  labels_wl_ratio <- baseball_full$wl_ratio
  
  model <- train(x = features_team_costso, 
                 y = labels_wl_ratio, 
                 method = 'lm')
  model$results
##   intercept       RMSE   Rsquared        MAE       RMSESD  RsquaredSD
## 1      TRUE 0.06473426 0.05558987 0.05462158 0.0005627593 0.008806155
##          MAESD
## 1 0.0005630158
  ggplot(data = baseball_full) +
    geom_point(mapping = aes(x = team_costso,
                             y = wl_ratio)) +
    labs(x = "Cost per Strike Out",
         y = "Win Percentage" ,
         title = "Team Strike Out Cost to Win Ratio") +
    scale_x_continuous(labels = scales::comma) +
    theme_dark() +
    geom_abline(intercept = model$finalModel$coefficients[[1]],
                slope = model$finalModel$coefficients[[2]],
                color = 'dark red', size = 1)

There is only a low interrelationship bewtween win percentages and the cost per strike out.

team_baseball %>%
  select(yearID, teamID, team_costso, wl_ratio) %>%
  arrange(team_costso) %>%
  head(n=10)
## # A tibble: 10 x 4
## # Groups:   teamID, yearID [10]
##    yearID teamID team_costso wl_ratio
##     <int> <chr>        <dbl>    <dbl>
##  1   2009 OAK          7162.    0.463
##  2   2008 PIT          7251.    0.414
##  3   2007 TBA          7403.    0.407
##  4   2009 FLO          8004.    0.537
##  5   2013 HOU          8143.    0.315
##  6   2009 WAS          8144.    0.364
##  7   2008 WAS          8373.    0.366
##  8   2014 MIA          8671.    0.475
##  9   2008 FLO          8968.    0.522
## 10   2016 CIN         10213.    0.420

Out of the top ten team cost per strike out only two of the teams had a winning record.

Conclusion

To get the best price per strike out the best interrelationship factors are age and career length. Both of these atributes have the highest correlations on both the indivudal and team scales. There is no strong factor with pitching teams that would determine a strong winning record.