library(lubridate)
library(dotwhisker)
library(broom)
library(dplyr)
library(ggplot2)
library(cem)
library(dplyr)
library(car)

Introduction

I’ve been thinking a lot about handedness recently. I have two boys that I hope play a little baseball. Both are tracking to be tall-ish and may be athletic. I really want them to pitch. It’s become a pretty accepted point of view that left handed pitchers are at a premium in MLB. The one case the really make that clear was a pitcher named JA Happ. Happ, by all accounts, is a slightly below average left handed starting pitcher. This offseason, Happ signed a 3 year deal with the Blue Jays for 36 milliion dollars. When the news broke someone on twitter wrote, “My God. Parents, put a baseball in your child’s left hand and hope for the best.”

So, I want to test that assumption. The Lehmann database is great, but the information I need is in several different data files. I need the master file for the pitching hand, I need salaries (obviously), and pitching stats to compare apples to apples.

master <- read.csv("D:/Baseball/Master.csv", stringsAsFactors = FALSE)
salary <- read.csv("D:/Baseball/salaries.csv", stringsAsFactors = FALSE)
pitching <- read.csv("D:/Baseball/pitching.csv", stringsAsFactors = FALSE)

Data Cleaning

I don’t really need to look back at pitching salaries from 1910. I am going to pick an arbitrary cut point (2005). Salaries really started to explode after that.

master$finalGame <- as.Date(master$finalGame, "%m/%d/%Y")
master$year <- year(master$finalGame)
master <- subset(master, master$year >=2005)
pitching <- subset(pitching, pitching$year >=2005)
salary <- subset(salary, yearID >=2005)
salary$year <- salary$yearID
salary$player_id <- salary$playerID
salary$yearID <- NULL
salary$playerID <- NULL
df <- merge(salary, pitching, by=c("year", "player_id"))
master$player_id <- master$playerID
df <- merge(df, master, by=c("year", "player_id"))
df$throw <- df$throws.y
head(df)
##   year player_id teamID lgID  salary stint team_id league_id w l  g gs cg
## 1 2005 adamste01    PHI   NL  500000     1     PHI        NL 0 2 16  0  0
## 2 2005 almanca01    TEX   AL 1100000     1     TEX        AL 0 0  6  0  0
## 3 2005 alvarwi01    LAN   NL 2000000     1     LAN        NL 1 4 21  2  0
## 4 2005 anderbr02    KCA   AL 3250000     1     KCA        AL 1 2  6  6  0
## 5 2005 aybarma01    NYN   NL  425000     1     NYN        NL 0 0 22  0  0
## 6 2005 bartocl01    CHN   NL  318600     1     CHN        NL 0 2 19  0  0
##   sho sv ipouts  h er hr bb so baopp   era ibb wp hbp bk bfp gf  r sh sf
## 1   0  0     40 25 19  3 10  4 0.403 12.83   2  0   4  0  77  5 19  1  0
## 2   0  0     15 10  8  2  7  3 0.435 14.40   0  4   1  0  33  2  8  0  2
## 3   0  0     72 31 15  7  7 16 0.316  5.63   0  0   0  0 109  3 15  2  2
## 4   0  0     92 39 23  7  4 17 0.305  6.75   1  0   0  1 133  0 24  0  1
## 5   0  0     76 31 17  4  7 27 0.301  6.04   1  0   1  0 114  4 17  1  2
## 6   0  0     59 23 12  7 11 15 0.307  5.49   0  0   2  0  91  7 13  2  1
##   g_idp  playerID birthYear birthMonth birthDay birthCountry birthState
## 1    NA adamste01      1973          3        6          USA         AL
## 2    NA almanca01      1973         11        6         D.R.   Santiago
## 3    NA alvarwi01      1970          3       24    Venezuela      Zulia
## 4    NA anderbr02      1972          4       26          USA         VA
## 5    NA aybarma01      1972          5        4         D.R.    Peravia
## 6    NA bartocl01      1979          9        5          USA         TX
##    birthCity deathYear deathMonth deathDay deathCountry deathState
## 1     Mobile        NA         NA       NA                        
## 2   Santiago        NA         NA       NA                        
## 3  Maracaibo        NA         NA       NA                        
## 4 Portsmouth        NA         NA       NA                        
## 5       Bani        NA         NA       NA                        
## 6       West        NA         NA       NA                        
##   deathCity nameFirst nameLast      nameGiven weight height bats throws
## 1               Terry    Adams    Terry Wayne    180     75    R      R
## 2              Carlos Almanzar  Carlos Manuel    166     74    R      R
## 3              Wilson  Alvarez Wilson Eduardo    175     73    L      L
## 4               Brian Anderson    Brian James    190     73    L      L
## 5               Manny    Aybar Manuel Antonio    165     73    R      R
## 6               Cliff  Bartosh  Clifford Paul    175     74    L      L
##       debut  finalGame  retroID   bbrefID
## 1 8/10/1995 2005-05-23 adamt001 adamste01
## 2  9/4/1997 2005-04-30 almac001 almanca01
## 3 7/24/1989 2005-09-28 alvaw001 alvarwi01
## 4 9/10/1993 2005-05-08 andeb002 anderbr02
## 5  8/4/1997 2005-06-10 aybam001 aybarma01
## 6 5/15/2004 2005-06-18 bartc001 bartocl01

Visualization

Okay, I’ve got the data in a format that I can use. Let’s visualize. Let’s create a dataframe of just lefties and just righties.

righties <- subset(df, df$throws =="R")
lefties <- subset(df, df$throws =="L")
mean(lefties$salary)
## [1] 2674764
mean(righties$salary)
## [1] 2680025

So, there’s nothing there. Less than $10,000 difference in the two samples. Let’s press onward.

I’m not going to display a lot of what I did behind the scenes but it’s a lot of subsetting and creating color palettes. Let’s go right to visuals.

ggplot(histogram, aes(x=Group.1, y = x/1000)) + geom_bar(aes(fill=throw),stat="identity", position= "dodge") + xlab("Season") + ylab("Salary (in thousands)") + scale_fill_manual(values=handPalette)

This is also inconclusive. Just take 2011-2013. In 2011, lefties and righties made basically the same. In 2012 lefties made (on average) more a million dollars more than righties. However in 2013, righties made a couple hundred grand more than lefties.

ggplot(leftleague, aes(x=Group.1, y = x/1000)) + geom_bar(aes(fill=league),stat="identity", position= "dodge") + xlab("Season") + ylab("Salary (in thousands)") + ggtitle("Lefties Salaries") + scale_fill_manual(values=leaguePalette)

ggplot(rightleague, aes(x=Group.1, y = x/1000)) + geom_bar(aes(fill=league),stat="identity", position= "dodge") + xlab("Season") + ylab("Salary (in thousands)") + ggtitle("Righties Salaries") + scale_fill_manual(values=leaguePalette)

Looking at salaries in the AL vs the NL is interesting. Lefties in the National League made more money than righties for 2010-2013. The story is a little more mixed for righties.

Let’s take a look at a scatterplot for ERA and salary.

p <- ggplot(df, aes(salary/1000, era))
p + geom_point(aes(colour = df$throw)) + xlim(5000, 25000) + ylim(0, 10) + scale_color_manual(values = c("#daa520", "#228b22")) + xlab("Salary (in thousands)") + ylab("ERA") +  theme(legend.title=element_blank())

I truncated this data on both the x and the y axes. Any ERA over 10 is not going to keep you in the league for a long time so those were dropped. And any salary below 500k is going to be a player that has not reached arbitration and therefore is not really getting paid what the market will bear. So the picture is mixed so far. The next step would be a regression.

Regression and Matching

reg1 <- lm(salary ~ era + w + l + g + ipouts + throws + baopp + so + bb , data=df)
summary(reg1)
## 
## Call:
## lm(formula = salary ~ era + w + l + g + ipouts + throws + baopp + 
##     so + bb, data = df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -10381417  -1518634   -786951    793625  16901303 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2801563.4   583014.3   4.805 1.80e-06 ***
## era             1875.1    22269.7   0.084 0.932915    
## w             139291.6    74154.5   1.878 0.060641 .  
## l             371822.9    67913.1   5.475 5.64e-08 ***
## g             -32995.3     6125.3  -5.387 9.11e-08 ***
## ipouts           339.9     3342.7   0.102 0.919032    
## throwsR      -176206.0   237314.7  -0.742 0.457974    
## baopp       -2486503.4  1927267.8  -1.290 0.197314    
## so             26083.3     7856.4   3.320 0.000935 ***
## bb            -62288.0    13925.8  -4.473 8.68e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3310000 on 924 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.256,  Adjusted R-squared:  0.2488 
## F-statistic: 35.33 on 9 and 924 DF,  p-value: < 2.2e-16
dwplot(reg1)

dwplot(reg1) + geom_vline(xintercept = 0, colour = "grey60", linetype = 2)

I’ve also included a dotwhisker plot that helps to visualize a regression. If the vertical dashed line is not intersected by the dots or the horizontal line (the confidence intervals) then it’s statistically significant. Or you could read the regression table.

So salary is our dependent variable and I’m going to use a lot of the stats that should predict a better pitcher. ERA, wins, losses, etc. Unfortunately this data has a lot of noise in it. A good example? Losses actually predict a higher salary. That may be because losses denote starting pitchers and starting pitchers are much more likely to take a loss than a reliever. Games pitched predicts lower salary but that’s probably because relievers can show up in 80 games a year while starters average around 35 or so. Strike outs drive up salary and walks drive it down. Interestingly enough. Throwing right handed is not statistically significant.

The next thing I want to do is coarsened exact matching. Gary King and some others wrote the package. What it does is essentially this: it fights someone in the treatment case (in our example that’s left handed pitchers) and finds someone in the control case (righties) who is very close in terms of performance metrics. So this will compare apples to apples. It will help to correct the problems of pitchers have more games or less games. It will compare pitchers with lower ERAs to those with lower ERAs and so on. The one thing that needs to be done is variables need to be binned together. In order for the package to actually find a match it needs era to be broken up into several ranges (3.00-3.50, 3.51-4.00). I will do that below.

cem <- select(df, salary, w, l, g, gs, sv, ipouts, h, er, bb, so, baopp, era, throws, lgID)
cem <- data.frame(na.omit(cem))

cem$treated = recode(cem$throws, "'L'=1; 'R'=0;", as.factor.result=FALSE)

tr <- which(cem$treated==1)
ct <- which(cem$treated==0)

mean(cem$salary[tr]) - mean(cem$salary[ct])
## [1] -700.6434
cem$league = recode(cem$lgID, "'NL'=1; 'AL'=2;", as.factor.result=FALSE)
cem$lgID <- NULL

cem$ba <- recode(cem$baopp, ".000:.100= 1; .151:.200 =3; .201:.250=4; .251:.300 =5; .301:.350 =6; .351:.400 =6; .401:.500 =7; .501:.700 =8")
cem$baopp <- NULL

cem$ERA <- recode(cem$era, ".000:.1= 1; 1.01:2.00 =2; 2.01:3=3; 3.01:4 =4; 4.01:5 =5; 5.01:10 =6")
cem$era <- NULL
cem$games <- recode(cem$g, "1:10= 1; 11:20 =2; 21:30=3; 31:40 =4; 41:50 =6; 51:80 =6")
cem$g <- NULL
cem$loss <- recode(cem$l, "0:2= 1; 2:5 =2; 6:10=3; 10:18 =4")
cem$l <- NULL
cem$walks <- recode(cem$bb, "0:5= 1; 6:10 =2; 11:15=3; 16:20 =4; 25:30 =5; 31:88=6")
cem$bb <- NULL

mat <- cem(treatment = "treated", data = cem, drop = "salary", keep.all=TRUE)
est <- att(mat, salary ~ treated, data = cem)
summary(est)
## 
## Treatment effect estimation for data:
## 
##            G0  G1
## All       655 279
## Matched   171  96
## Unmatched 484 183
## 
## Linear regression model estimated on matched data only
## 
## Coefficients:
##             Estimate Std. Error t value p-value    
## (Intercept)  1383222     130156  10.627  <2e-16 ***
## treated       -39502     217062  -0.182  0.8557    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After all that, the answer is really not exciting at all. There is no statistical relationship between throwing hand and pitcher’s salary. Being left handed could mean anything from making 600k more or 300k less than a right hander. In other words? It means nothing.

Concluding Thoughts

So, if the perception is that left handers make more than right handers why doesn’t the data bear this out? I have a theory, at least. Maybe two.

  1. Baseball has a really weird salary structure. Not to go too far into it but for the first three years that a player is in the majors, he basically makes the league minimum (around 500k). After that he goes through three years of arbitration where his salary rises each of those three years. He is still not receiving his market value. Really, that doesn’t happen until free agency which doesn’t happen for most players until they are 28-30 years old. Many elite pitchers will then sign a huge deal for six or seven years. They really only get one bite at the apple.

  2. Relievers screw everything up. As another Kaggle user found, teams overpay closers. That also means that they underpay middle relievers. If I could break this down to just starting pitchers I might see something different but I didn’t do that is because lefties seem to be more important in the bullpen. Guys like Randy Choate was a LOOGY. He couldn’t really do much well except get out other left handers. And he pitched for a long time doing just that. A left handed starter cannot be a LOOGY.

  3. This data is just noisy. Inflated salaries have not existed long enough to really have a large enough dataset.