Introduction

Baseball is a highly analyzed sport and has a gold mine of information available to use. The MLB Pitch data set from Kaggle gives statistics on every pitch that was thrown in the 2015-2018 MLB regular seasons. The focus for this analysis was on the pitches.csv file, which contained information about the type of pitch thrown, the result of the pitch, and analytics on the pitch itself including velocity, spin rate, etc. Not every variable was an interest to the analysis, and the data set was extremely massive to begin with. It was necessary to focus on specific variables in the data set to speed up the analysis and achieve meaningful results.

Twenty six variables of interest were identified, and only the last pitch in the at bat was looked at. Only pitches with definite certainty were counted. The variables that were substantially focused on were spin rate, start speed, velocity, pitch type, and break length. The primary data set has 26 columns and 308,773 rows.

## 'data.frame':    308773 obs. of  23 variables:
##  $ px             : num  0.627 -0.286 0.008 0.2 -0.432 -0.389 -0.22 0.487 0.088 -0.382 ...
##  $ pz             : num  2.4 1.83 2.6 2.21 2.44 ...
##  $ start_speed    : num  92.9 92.6 87.5 87.5 86.8 87.5 87.2 89.2 91.1 88.8 ...
##  $ spin_rate      : num  2744 2475 1308 846 1081 ...
##  $ spin_dir       : num  148 137 167 144 173 ...
##  $ break_angle    : num  -45.7 -39 -8.1 -10.3 0 2.1 -13 5.3 -42.9 -6.2 ...
##  $ break_length   : num  3.7 4.8 5.4 6.8 5.7 5.8 5.3 4.5 4.5 5.7 ...
##  $ break_y        : num  23.7 23.7 23.8 23.8 23.9 23.8 23.8 23.8 23.7 23.8 ...
##  $ type_confidence: num  2 2 2 2 2 2 2 2 2 2 ...
##  $ pfx_x          : num  7.32 8.56 1.56 2.65 0.75 0.28 2.8 -1.55 8.59 2.36 ...
##  $ pfx_z          : num  11.72 9.19 6.73 3.6 5.64 ...
##  $ nasty          : num  42 48 18 32 38 46 34 38 39 40 ...
##  $ zone           : num  6 13 5 5 4 4 5 9 5 7 ...
##  $ code           : chr  "X" "E" "D" "X" ...
##  $ type           : chr  "X" "X" "X" "X" ...
##  $ pitch_type     : chr  "FF" "FF" "FC" "FC" ...
##  $ b_count        : num  2 2 1 2 0 0 2 0 0 2 ...
##  $ s_count        : num  2 0 0 1 2 0 0 1 0 2 ...
##  $ outs           : num  0 1 0 2 1 2 0 1 0 0 ...
##  $ pitch_num      : num  6 3 2 4 3 1 3 2 1 7 ...
##  $ on_1b          : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ on_2b          : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ on_3b          : num  0 0 0 1 0 1 0 0 0 0 ...

This data is regularly used to influence in-game decision-making, so it would be intriguing to examine pitching trends and attempt to predict outcomes. The main questions are: what is relationship between velocity and break length, how do different situations affect pitching speed, and can the pitch type that was thrown be predicted using only data?

Velocity and Break Length

The first step in modeling the data was to create two simple side-by-side boxplots, one of which showing how all different pitch types compare in terms of break length, and the other showing how pitch types compare in terms of start speed. While these plots were originally created simply in order to give basic information on the different pitch types that could possibly be used in modeling, a very clear trend became obvious: pitches that rank lower regarding break length generally rank higher for start speed, and pitches that rank lower for start speed generally rank higher regarding break length.

The best way to test this apparent relationship was by using a linear regression model. Subsequently, a simple scatterplot for break length as a function of start speed was made. This model had a strong negative linear correlation, meaning that as pitch velocity increases, pitch break length decreases. Additionally, it gave a fairly high adjusted r-squared value of 0.6998 and an extremely low p-value of less than 2.2e-16, meaning that a good amount of the variation in break length can be explained by pitch velocity.

Velocity in Situations

Start speed continued to be used as a variable in the modeling, but was now used as a response variable instead in order to see how it is affected by the situation and circumstances the pitcher is faced with, meaning either how many balls or strikes there are in the count, or how many runners are on-base, etc. This process began when a scatterplot of start speed as a function of spin rate was created, with each individual point colored based on the number of strikes in the count at the time the pitch is thrown. It showed a weak positive linear relationship, meaning that as spin rate increases so does pitch velocity. More importantly, it showed that the fastest pitches overall are pitches that have high spin rates, thrown with no strikes.

## 
## Call:
## lm(formula = start_speed ~ spin_rate * num_s, data = last_pitches)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.158  -2.406   0.851   3.249  15.962 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        8.075e+01  4.949e-02 1631.732  < 2e-16 ***
## spin_rate          4.704e-03  2.525e-05  186.302  < 2e-16 ***
## num_s1S           -4.118e-01  6.436e-02   -6.399 1.56e-10 ***
## num_s2S           -4.797e-01  6.127e-02   -7.829 4.92e-15 ***
## spin_rate:num_s1S -1.598e-05  3.348e-05   -0.477   0.6331    
## spin_rate:num_s2S  7.314e-05  3.218e-05    2.272   0.0231 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.777 on 308767 degrees of freedom
## Multiple R-squared:  0.3134, Adjusted R-squared:  0.3134 
## F-statistic: 2.819e+04 on 5 and 308767 DF,  p-value: < 2.2e-16

After considering this information, however, it can be hypothesized that these results were only true because pitches of all types were used for the model, and fastballs might be thrown more often with no strikes than in other counts. In other words, the fastest pitches aren’t necessarily thrown with no strikes, but instead, the fastest kinds of pitches (fastballs) are thrown more often with no strikes. In order to test this hypothesis, a new dataset that only contained fastballs was created, and a side-by-side boxplot showing how velocity changed depending on the count using this dataset was made. This process gave very clear results: as pitchers become more ahead in counts, meaning as the counts have more strikes than balls, they throw their pitches harder. With a strike-ball differential of -3, meaning in 3-0 counts, pitchers average a fastball start speed of 92.3 miles per hour. As the strike-ball differential increases up to 2, fastball velocity consistently does as well, peaking at 93.2 miles per hour in 0-2 counts.

Making Predictions

If one has a lot of experience watching baseball, then it is possible to identify a pitch just by looking at it. However, predicting pitch type using only a few explanatory variables is a much more daunting task. It was decided that only three pitches would be examined for this process: four seam fastball, change up, and curve ball. When initially looking for relationships between the primary variables, there was no clear correlation.

Once color was added based on pitch type, however, this changed dramatically.

Here it shows that each pitch type has its own clear relationship between the variables. After finding other relationships (similar to the one shown) for other variables in the data set, the data was tested and trained for a K Nearest Neighbors test. With four prediction variables and k set to three, a confusion matrix was created to display the accuracy of the test.

## [1] 0.9519778
##     
##         FF    CH    CU
##   FF 50816  1029    48
##   CH  1277 15788   728
##   CU    29   752  9975

It resulted in a correct prediction 95.3% of the time. Cross validation was used to ensure the best size for k was used in the K-Nearest Neighbors Test. One way to potentially enhance the accuracy would be to split up pitchers based on handedness, as the ball is released differently and subsequently moves different directions depending on which hand is used to throw it.

Conclusion

The modeling gave many interesting results. Specifically, pitch velocity has a very strong positive linear correlation with break length, meaning that generally, if a pitch has a higher start speed, it has lower break length. Additionally, the modeling pitchers tend to throw harder the further ahead in counts they are. This is likely because as pitchers become more ahead in counts, they become comfortable and worry about throwing balls less, thus focusing more on velocity instead of accuracy. Lastly, spin rate is the best metric in determining how much a given pitch will move. That being said, you must know what type of pitch is being thrown in order to use spin rate to calculate break length, since spin rate affects every different type of pitch differently. For example, higher spin rates make curve balls move more downwards, make fastballs drop less, and make sliders move sideways more. Further studies that could be done with this data include linking player IDs and names to the data to determine highest or lowest performing players in certain situations or metrics. One could also incorporate all pitches instead of only the last pitches of at-bats. Lastly, pitch location metrics could be used in order to determine what types of pitches in what area of the strike zone are more or least effective in getting batters out. As statistical analysis continues to become more and more popular in the baseball industry, new metrics and information will emerge that can be studied. This already enormous field provides unlimited opportunities for modeling, and in the grand scheme of things, this project merely cracked the surface of possible statistical analyses.

Appendix

library(dplyr)
library(tidyverse)
library(ggplot2)
library(class)

#pitches <- read_csv("~/Statistical Learning with R/Pitches R Proj./Pitch Data/pitches.csv") #L
pitches <- read_csv("Downloads/pitches.csv") #C
## Rows: 2867154 Columns: 40
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): code, type, pitch_type
## dbl (37): px, pz, start_speed, end_speed, spin_rate, spin_dir, break_angle, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pitch <- pitches

pitches <- data.frame(
  px=pitch$px, pz=pitch$pz,
  start_speed=pitch$start_speed,
  spin_rate=pitch$spin_rate,
  spin_dir=pitch$spin_dir,
  break_angle=pitch$break_angle,
  break_length=pitch$break_length,
  break_y=pitch$break_y,
  type_confidence=pitch$type_confidence,
  pfx_x=pitch$pfx_x,
  pfx_z=pitch$pfx_z,
  nasty=pitch$nasty,
  zone=pitch$zone,
  code=pitch$code,
  type=pitch$type,
  pitch_type=pitch$pitch_type,
  b_count=pitch$b_count,
  s_count=pitch$s_count,
  outs=pitch$outs,
  pitch_num=pitch$pitch_num,
  on_1b=pitch$on_1b,
  on_2b=pitch$on_2b,
  on_3b=pitch$on_3b
)

# Only Last Pitch in AB w/ Certainty in Call
last_pitches <- pitches %>%
  filter(code=="X"|code=="D"|code=="E"|code=="H",type_confidence==2) %>%
  na.omit(pitches)

#last_pitches$pitch_type<-factor(last_pitches$pitch_type, levels=orderPitch$pitch_type[order(orderPitch$avg)])

# Removing Misinput (impossible breaklength)
last_pitches <- last_pitches %>%
  filter(break_length<50000)

# Just fastballs
fastballs <- pitches %>%
  filter(pitch_type=="FF") %>%
  mutate(count=s_count-b_count)

#last_pitches$pitch_type<-factor(last_pitches$pitch_type, levels=orderPitch$pitch_type[order(orderPitch$avg)])


str(last_pitches)

# Pitches ordered by start speed
orderPitch<-last_pitches%>%
  group_by(pitch_type)%>%
  summarise(avg=mean(start_speed, na.rm=TRUE))

#orderPitch$pitch_type[order(orderPitch$avg)]

#re-level
last_pitches$pitch_type<-factor(last_pitches$pitch_type, levels=orderPitch$pitch_type[order(orderPitch$avg)])

ggplot(last_pitches, aes(x=pitch_type, y=start_speed, fill=pitch_type))+
  geom_boxplot()+
  ggtitle("Pitch Type vs Start Speed")+
  xlab("Pitch Type")+
  ylab("Start Speed")

orderPitch<-last_pitches%>%
  group_by(pitch_type)%>%
  summarise(avg=mean(break_length, na.rm=TRUE))

#orderPitch$pitch_type[order(orderPitch$avg)]

#re-level
last_pitches$pitch_type<-factor(last_pitches$pitch_type, levels=orderPitch$pitch_type[order(orderPitch$avg)])

ggplot(last_pitches, aes(x=pitch_type, y=break_length, fill=pitch_type))+
  geom_boxplot()+
  ggtitle("Pitch Type vs Break Length")+
  xlab("Pitch Type")+
  ylab("Break Length")


ggplot(last_pitches, aes(start_speed, break_length))+
  geom_point()+
  ggtitle("Start Speed vs Break Length")+
  xlab("Velocity")+
  ylab("Break Length")


last_pitches$num_s <- as.factor(ifelse(last_pitches$s_count == 0, '0S',
                                       ifelse(last_pitches$s_count == 1, '1S',
                                              ifelse(last_pitches$s_count == 2, '2S', NA))))

mult_mod2 <- lm(start_speed ~ spin_rate*num_s, data = last_pitches)
summary(mult_mod2)
s1 <- mult_mod2$coefficients[2]
s2 <- mult_mod2$coefficients[2]+mult_mod2$coefficients[5]
s3 <- mult_mod2$coefficients[2]+mult_mod2$coefficients[6]
i1 <- mult_mod2$coefficients[1]
i2 <- mult_mod2$coefficients[1]+mult_mod2$coefficients[3]
i3 <- mult_mod2$coefficients[1]+mult_mod2$coefficients[4]


ggplot(data=last_pitches, aes(x=spin_rate, y=start_speed, color=num_s))+
  geom_point()+
  ggtitle("Scatterplot of Spin Rate vs Velocity of Pitches")+
  theme_bw()+
  geom_abline(slope=s1, intercept = i1, col=2)+
  geom_abline(slope=s2, intercept = i2, col=4)+
  geom_abline(slope=s3, intercept = i3, col=6)


ggplot(fastballs, aes(group=count, x=count, y=start_speed, color=count))+
  geom_boxplot()+
  ggtitle("Pitch Speed per Count")+
  xlab("Count")+
  ylab("Velocity")

# taking main 3 pitch types (four seam fastball, curve ball, change-up)
spec_pitch <- last_pitches %>%
  filter(pitch_type %in% c("FF", "CH", "CU"))

# No color
ggplot(spec_pitch , aes(x=spin_rate, y=break_length))+
  geom_point()

# With color
ggplot(spec_pitch , aes(x=spin_rate, y=break_length, color=pitch_type, color=pitch_type))+
  geom_point(alpha=.4)
## Warning: Duplicated aesthetics after name standardisation: colour