Introduction

Following the conclusion of the 2019 regular season, the baseball world was rocked (although not entirely surprised) when Ken Rosenthal reported a massive allegation against the Houston Astros. Rosenthal wrote that the Astros had constructed and implemented a sign-stealing procedure where a camera set up in center field relayed catcher’s signs to a laptop in the Astro’s tunnel. The Astros staff would watch signs, decode them and bang on a garbage can (very techy Houston, we see you) to communicate what pitches were coming. Investigations have confirmed the scandal with video evidence, and there are potentially more details that the public is not yet aware of. Therefore when I had to carry out a research project for my Data Analytics 210 class, my classmate Serena Laro and I were inclined to formally investigate.

At first glance, knowing what pitch is coming seems like it would offer extraordinary benefits. Hitting a baseball is commonly described as the most challenging task in all of sports. Surely, knowing which pitch is coming would make this task easier. However, quantifying the specific effect is easier said than done. Fangraphs’ Jake Mailhot, Baseball Prospectus’ Rob Arthur and Ben Lindbergh at The Ringer all attempted to answer this question to varying degrees. The problem is so difficult to answer for several reasons.

First, what type of effect could sign-stealing potentially have? Different hitters and pitchers have varying strengths and weaknesses, so the impact of stealing would vary at-bat to at-bat. Hitters that struggle with fastballs could see a huge benefit because of the frequency the pitch was thrown. In contrast, hitters who struggle more with curveballs will encounter the pitch less and therefore derive less benefit. Second, the Houston Astros were projected to be GREAT. The Astros exhibited one of the most significant drops in strikeout percentage from 23.4% in 2016 to 17.3% percent in 2017. The decrease seems like a great piece of evidence against the team, but this also coincided with significant roster changes, which prioritized on-base percentage and limited strikeouts. Furthermore, the projections for the club put them as being a high contact low strikeout before the season started. In 2017 the Astros also made substantial training changes, including the use of hitting technologies like blast motion, hittrax and rapsodo. The Astros also used pitching machines in an attempt to ease the transition from pre-game to the game. Finally, major league hitters have extremely well-developed skills of pitch recognition and contact. Merely knowing what pitch is coming can potentially disrupt the pattern that hitters have already established, possibly altering their approach. Hitters often avoid extra thought at the plate and trying to hear a banging on a trash can in a raucous Houston ballpark could be a distraction instead of a competitive advantage. It is also worth noting that with the benefits of advanced scouting and analytics, hitters likely often have an idea of what pitch is coming and can still swing and miss. Mariano Rivera was the first unanimous Hall of Famer, the greatest closers of all time, and everyone knew he was throwing them a cutter!

Data

To try and answer this question, I began by collecting PITCHf/x data from MLB.com, starting on May 19th (the alleged start date of the cheating). PITCHf/x data does not come in a format conducive to easy statistical analysis, so I had some data munging to do. I needed to use the game URL to determine the home and away teams, create the variable Whiff using pitch description and create data sets for my different subcategories of analysis. PitchesRAW was the data set from my scrape of the 2017 season.

#Extract the teams playing
Pitches17 <- PitchesRAW %>%
  mutate(Home = str_sub(url, start = 81, end = 83),
         Away = str_sub(url, start = 88, end = 90))

#Creates Whiffs Variable

Pitches17$Whiffs <- ifelse(Pitches17$des %in% c("Swinging Strike", "Swinging Strike (Blocked)"), 1, 0)
Pitches17$Whiffs <- as.factor(Pitches17$Whiffs)

#Create Fast variable 
Pitches17$fast <- ifelse(Pitches17$pitch_type %in% c("FA", "FC", "FF", "FT", "SI"), 1, 0)
Pitches17$fast <- as.factor(Pitches17$fast)

#Keep Only what is needed
Pitches17 <- Pitches17 %>%
  select(des, start_speed, end_speed, sz_top, sz_bot, pfx_x, pfx_z, px, pz, x0, y0, z0, vx0, vy0, vz0,
         ax, ay, az, break_y, break_angle, break_length, pitch_type, spin_dir, spin_rate, 
         inning_side, Home, Away, Whiffs, fast, count)

#Remove NAs
Pitches17 <- na.omit(Pitches17) 

#Astros Home Pitches
HomePitches <- Pitches17 %>%
  filter(str_detect(Home, "hou")) %>%
  filter(str_detect(inning_side, "bottom"))

#Astros Away Pitches
AwayPitches <- Pitches17 %>%
  filter(str_detect(Away, "hou")) %>%
  filter(str_detect(inning_side, "top"))

#Astros2017
AstrosPitches <- full_join(HomePitches, AwayPitches)

#Creates Home/Away Variable
AstrosPitches$HomeOrAway <- ifelse(AstrosPitches$Home %in% c("hou"), 1, 0)
AstrosPitches$HomeorAway <- as.numeric(AstrosPitches$HomeOrAway)

#Astros Fast Pitches
AstrosFastPitches <- AstrosPitches %>%
  filter(fast==1)

#Filtering Des
Whiffs17 <- Pitches17 %>%
  filter(str_detect(des, "Swinging Strike|Swinging Strike (Blocked)"))

#Houston Home Whiffs
HomeWhiffs <- Whiffs17 %>%
  filter(str_detect(Home, "hou")) %>%
  filter(str_detect(inning_side, "bottom"))

#Houston Away Whiffs
AwayWhiffs <- Whiffs17 %>%
  filter(str_detect(Away, "hou")) %>%
  filter(str_detect(inning_side, "top"))

Previous writers have looked at their run expectancies by pitches, isolated power and BB/K ratio, which had varying levels of evidence. For this piece, I decided to examine Whiffs. Whiffs, for any of you unaware, are swings and misses by the batter. Whiffs are statistically valuable for pitchers and defences and damaging to hitters and offences. I hypothesized that hitters with this extra information would swing and miss less. Early in the count, hitters are less likely to be fooled by pitches other than what they are actively hunting. At a minimum, hitters who know what pitch is coming would be expected to at least make some form of contact. Initially, I thought that pitch type may be a significant factor in the Astros Whiffs.

#Whiffs by Pitch Type
ggplot(Whiffs17, aes(pitch_type)) + geom_bar() + ggtitle("MLB Whiffs by Pitch Type")

#Astros Home Whiffs
ggplot(HomeWhiffs, aes(pitch_type)) + geom_bar() + ggtitle("Astros Home Whiffs by Pitch Type")

table(HomeWhiffs$pitch_type)
## 
##  CH  CU  FC  FF  FS  FT  KC  SI  SL 
## 125 116  35 222  13  90  73   1 404
#Away Whiffs
ggplot(AwayWhiffs, aes(pitch_type)) + geom_bar() + ggtitle("Astros Away Whiffs by Pitch Type")

table(AwayWhiffs$pitch_type)
## 
##  CH  CU  FC  FF  FS  FT  KC  SI  SL 
## 133 162  50 289  13 113  48   4 347

The bar charts represent the MLB total whiffs and the Astros home/away whiff splits. We can see that the Astros refuted the trend for swings and misses of pitch type FF (four-seam fastball) by a fairly significant margin. Furthermore, the Astros exhibited a reasonably substantial drop off for three types of fastballs (four seams, two seams and cutters) at home compared to away. I examined the Astros home/away whiffs by break length to see if that produced any exciting information.

#Astros Home Whiffs by Break Length
ggplot(HomeWhiffs, aes(break_length)) + geom_freqpoly() + ggtitle("Astros Home Whiffs by Break Length")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Astros Away Whiffs by Break Length
ggplot(AwayWhiffs, aes(break_length)) + geom_freqpoly() + ggtitle("Astros Away Whiffs by Break Length")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The relationship was relatively similar except at around the 5-inch break length. This is around the break we may expect sinkers, two seams and cutters to be at, furthering our previous query. Perhaps the counts in which they swing and miss would give us an indication of differences in the Astros approach.

#Home Whiffs by Count
ggplot(HomeWhiffs, aes(count)) + geom_bar() + ggtitle("Astros Home Whiffs by Count")

table(HomeWhiffs$count)
## 
## 0-0 0-1 0-2 1-0 1-1 1-2 2-0 2-1 2-2 3-1 3-2 
## 172 141  99 110 121 149  22  65 124  19  57
#Away Whiffs by Count
ggplot(AwayWhiffs, aes(count)) + geom_bar() + ggtitle("Astros Away Whiffs by Count")

table(AwayWhiffs$count)
## 
## 0-0 0-1 0-2 1-0 1-1 1-2 2-0 2-1 2-2 3-0 3-1 3-2 
## 194 164 101  92 142 155  30  78 129   4  15  55

Alas, whiffs by count, yielded no real significant results. So far, our most intriguing findings have been regarding whiffs on fastball pitch types. Moving forward, we restricted our data set to only fastballs, as this was the only real deviation we saw. To try and find a numerical effect, I constructed logistic regressions in an attempt to better answer the question. For those of you who hate statistics, logistic regressions are a method of statistical analysis that predicts if an event happens or not (in our case, whiff or not a whiff). The first model attempts to predict whiffs using if the Astros were Home or Away, the speed of the pitch and the spin rate. The second model uses home or away, horizontal and vertical movement, and the final model uses home/away and vertical/horizontal pitch location.

#********************Astros Fast Speed Model******************
speed.model <- glm(Whiffs ~ HomeOrAway + start_speed + spin_rate, data = AstrosFastPitches, family = "binomial")
speed.model
## 
## Call:  glm(formula = Whiffs ~ HomeOrAway + start_speed + spin_rate, 
##     family = "binomial", data = AstrosFastPitches)
## 
## Coefficients:
## (Intercept)   HomeOrAway  start_speed    spin_rate  
##  -6.0444562   -0.2157306    0.0443450   -0.0001877  
## 
## Degrees of Freedom: 9943 Total (i.e. Null);  9940 Residual
## Null Deviance:       5585 
## Residual Deviance: 5565  AIC: 5573
speed.model[1]
## $coefficients
##   (Intercept)    HomeOrAway   start_speed     spin_rate 
## -6.0444561704 -0.2157306153  0.0443449850 -0.0001877071
summary(speed.model)
## 
## Call:
## glm(formula = Whiffs ~ HomeOrAway + start_speed + spin_rate, 
##     family = "binomial", data = AstrosFastPitches)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5083  -0.4294  -0.4047  -0.3774   2.4964  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.044e+00  1.201e+00  -5.032 4.86e-07 ***
## HomeOrAway  -2.157e-01  7.460e-02  -2.892  0.00383 ** 
## start_speed  4.434e-02  1.376e-02   3.223  0.00127 ** 
## spin_rate   -1.877e-04  7.483e-05  -2.509  0.01212 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5585.5  on 9943  degrees of freedom
## Residual deviance: 5565.1  on 9940  degrees of freedom
## AIC: 5573.1
## 
## Number of Fisher Scoring iterations: 5
#Run the test data through the model
res <- predict(speed.model, AstrosPitches, type = "response")
res2 <- ifelse(res > 0.1, 1, 0)
#Validate the model - Confusion Matrix 
confmatrix <- table(res2, AstrosPitches$Whiffs)
confmatrix
##     
## res2     0     1
##    0 14999  2146
##    1   746    92
#Accuracy 
(confmatrix[[1,1]] + confmatrix[[2,2]]) / sum(confmatrix) #Predicts 83.9% of Outcomes
## [1] 0.8391814
#Calculating logits
1/exp(speed.model$coef[2]) 
## HomeOrAway 
##   1.240768
#Using speed and spin rate as factors with fastballs the Astros were 1.24x more likely to swing and miss on fastballs away from home

#********************Astros Fast Movement Prediction Model**********************
movement.model <- glm(Whiffs ~ HomeOrAway + pfx_x + pfx_z, data = AstrosFastPitches, family = "binomial")
movement.model
## 
## Call:  glm(formula = Whiffs ~ HomeOrAway + pfx_x + pfx_z, family = "binomial", 
##     data = AstrosFastPitches)
## 
## Coefficients:
## (Intercept)   HomeOrAway        pfx_x        pfx_z  
##   -2.485601    -0.253138     0.006651     0.026688  
## 
## Degrees of Freedom: 9943 Total (i.e. Null);  9940 Residual
## Null Deviance:       5585 
## Residual Deviance: 5570  AIC: 5578
movement.model[1]
## $coefficients
##  (Intercept)   HomeOrAway        pfx_x        pfx_z 
## -2.485600999 -0.253137998  0.006650885  0.026688161
summary(movement.model)
## 
## Call:
## glm(formula = Whiffs ~ HomeOrAway + pfx_x + pfx_z, family = "binomial", 
##     data = AstrosFastPitches)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4764  -0.4331  -0.4033  -0.3829   2.4279  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.485601   0.106543 -23.330  < 2e-16 ***
## HomeOrAway  -0.253138   0.075279  -3.363 0.000772 ***
## pfx_x        0.006651   0.007053   0.943 0.345663    
## pfx_z        0.026688   0.011926   2.238 0.025234 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5585.5  on 9943  degrees of freedom
## Residual deviance: 5570.3  on 9940  degrees of freedom
## AIC: 5578.3
## 
## Number of Fisher Scoring iterations: 5
#Run the test data through the model
res <- predict(movement.model, AstrosFastPitches, type = "response")
res2 <- ifelse(res > 0.1, 1, 0)
#Validate the model - Confusion Matrix 
confmatrix <- table(res2, AstrosFastPitches$Whiffs)
confmatrix
##     
## res2    0    1
##    0 9004  784
##    1  136   20
#Accuracy 
(confmatrix[[1,1]] + confmatrix[[2,2]]) / sum(confmatrix) #Predicts 90.7% of Outcomes
## [1] 0.9074819
#Calculating logits
1/exp(movement.model$coef[2])
## HomeOrAway 
##   1.288061
#Using horizontal and vertical movement as predicting factors the Astros were 1.29x more likely to swing and miss on fastballs away


#******************Astros Fast Location Prediction Model*******************
location.model <- glm(Whiffs ~ HomeOrAway + px + pz, data = AstrosFastPitches, family = "binomial")
location.model
## 
## Call:  glm(formula = Whiffs ~ HomeOrAway + px + pz, family = "binomial", 
##     data = AstrosFastPitches)
## 
## Coefficients:
## (Intercept)   HomeOrAway           px           pz  
##     -3.2005      -0.2659       0.0288       0.3566  
## 
## Degrees of Freedom: 9943 Total (i.e. Null);  9940 Residual
## Null Deviance:       5585 
## Residual Deviance: 5508  AIC: 5516
location.model[1]
## $coefficients
## (Intercept)  HomeOrAway          px          pz 
## -3.20052994 -0.26585685  0.02880469  0.35656676
summary(location.model)
## 
## Call:
## glm(formula = Whiffs ~ HomeOrAway + px + pz, family = "binomial", 
##     data = AstrosFastPitches)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8836  -0.4369  -0.3921  -0.3490   2.6289  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.20053    0.12046 -26.570  < 2e-16 ***
## HomeOrAway  -0.26586    0.07467  -3.560 0.000371 ***
## px           0.02880    0.04590   0.628 0.530268    
## pz           0.35657    0.04315   8.264  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5585.5  on 9943  degrees of freedom
## Residual deviance: 5508.1  on 9940  degrees of freedom
## AIC: 5516.1
## 
## Number of Fisher Scoring iterations: 5
#Run the test data through the model
res <- predict(location.model, AstrosFastPitches, type = "response")
res2 <- ifelse(res > 0.1, 1, 0)
#Validate the model - Confusion Matrix 
confmatrix <- table(res2, AstrosFastPitches$Whiffs)
confmatrix
##     
## res2    0    1
##    0 7494  571
##    1 1646  233
#Accuracy 
(confmatrix[[1,1]] + confmatrix[[2,2]]) / sum(confmatrix) #Predicts 77.7% of Outcomes
## [1] 0.7770515
#Calculating logits
1/exp(location.model$coef[2])
## HomeOrAway 
##   1.304548
#Using horizontal and vertical location as predicting factors the astros were 1.30x more likely to swing and miss on fastballs away from home

Results

Okay, let us break this down a bit further. The data shows us that the Astros had swing and miss tendencies similar to that of the Major league average for non-fastball pitch types. For four-seam/two-seam fastballs and cutters, there were distinct differences from the major league trend as well as in their home and away splits. This leads us to refine our query to fastballs. Our prediction models predicted between 78% and 91% of all outcomes (Whiff or not Whiff), a number I am especially happy with considering the noise/randomness associated with swings and misses. Next, we calculated the odds ratio using the coefficient variable of Home or Away. For the three separate models, we saw that the Astros were anywhere between 1.24x - 1.30x more likely to swing on fastballs away from home than at home. This is especially interesting. The high spin rate and velocity fastballs that have haunted hitters in the last few years are stupendously challenging to hit. Seeing that the Astros had noticeably fewer whiffs could potentially be taken as statistical evidence of the effect of knowing what pitch was coming. It could also be especially damaging to pitchers that rely on their fastball compared to other pitches like the slider.

Conclusion

This study was a rewarding and fun experience for me. When examining whiffs strictly, we saw that there was a noticeable difference between the Astros home/away splits. Seeing that whiffs on fastballs were 1.24x - 1.30x more likely away from home than at home could be interpreted as a direct effect of cheating. Before this study, I never had the opportunity to work with PITCHf/x data. I also had never constructed logistic regressions, confusion matrixes or odds ratios. It is, therefore, entirely possible that I misinterpreted or misspoke on behalf of some statistical effects. If anyone has any experience or curiosity on this subject, I implore you to reach out to me. I am very early in my data science career and know that mistakes happen. My goal is to continue to improve, not boost my own ego.