Overview

We investigated an NCAA baseball player’s Trackman data to find a reliable pitch in different strike counts using the idea of variance and normal distribution. By reliable pitch, it means a pitch that has a higher chance of getting a strike.

Libraries

library(dplyr)
library(Boruta)
library(knitr)

We used the dplyr library for data manipulation, the Boruta library for the feature selection, and the knitr library to create clean tables.

Importing Data

ball_data = read.csv("project_dataset_trackman_pitcher.csv")

Fastball = ball_data %>% dplyr::filter(TaggedPitchType == "Fastball")
Curveball = ball_data %>% dplyr::filter(TaggedPitchType == "Curveball")
Slider = ball_data %>% dplyr::filter(TaggedPitchType == "Slider")
ChangeUp = ball_data %>% dplyr::filter(TaggedPitchType == "ChangeUp")

First, we imported pitcher performance data from a csv file from Trackman. After we imported data, we separated data by different pitch types. This particular pitcher threw four different types of pitches; fastball, curveball, slider and changeup.

Pitch Speed vs Inning

interaction.plot(as.factor(ball_data$Inning),as.factor(ball_data$TaggedPitchType),ball_data$RelSpeed,xlab = "Inning", ylab = "RelSpeed", trace.label = "Pitch Type")

Our primary interest in this project was pitch speed in different strike count and its reliability. First, we plotted pitch speed against inning per pitch type. As we suspected, pitch speed tended to decrease as the pitcher threw a longer inning for all four pitch types. However, we wanted to find other factors that affect the pitch speed beside the obvious one. For this reason, we used feature selection.

Boruta Feature Selection

speed_Boruta =  Boruta(RelSpeed~as.factor(Balls)+as.factor(Outs)+as.factor(BatterTeam)+as.factor(Strikes)+VertBreak+HorzBreak,data=ball_data)

plot(speed_Boruta ,las=2 ,xlab="")

Boruta feature selection is an algorithm that is built around random forest classification. The idea behind it is to find feature/variable importance from a random forest to determine which variables unnecessary in the model. This algorithm suggested that Batter-Team, strike count, vertical break, and horizontal break are essential features in this model. Using baseball knowledge, we concluded that vertical and horizontal breaks were selected due to their impact on different pitch types. Therefore, we chose to analyze different pitch types separately using batter-team and strike count as chosen features.

Fastball Speed Linear Model

lm.fast_vel = lm(RelSpeed ~ as.factor(Strikes) * as.factor(BatterTeam),data=Fastball)
anova(lm.fast_vel)
interaction.plot(as.factor(Fastball$Strikes),as.factor(Fastball$BatterTeam),(Fastball$RelSpeed),xlab = "Strike", ylab = "RelSpeed", trace.label = "Team")

We used the ANOVA table to verify the significance of strike count and batter-team on fastball speed. As the low p-value for each variable suggested, both of them were significant factors. Then we plotted fastball speed against strike count and showed that the speed increased as the strike count increased.

Slider Speed Linear Model

lm.slid_vel = lm(RelSpeed ~ as.factor(Strikes) + as.factor(BatterTeam),data=Slider)
anova(lm.slid_vel)
interaction.plot(as.factor(Slider$Strikes),as.factor(Slider$BatterTeam),(Slider$RelSpeed),xlab = "Strike", ylab = "RelSpeed", trace.label = "Team")

Using similar analysis as the fastball, we concluded that slider speed increased as the strike count increased.

Curveball Speed Linear Model

lm.curv_vel = lm(RelSpeed ~ as.factor(Strikes) + as.factor(BatterTeam),data=Curveball)
anova(lm.curv_vel)
interaction.plot(as.factor(Curveball$Strikes),as.factor(Curveball$BatterTeam),(Curveball$RelSpeed),xlab = "Strike", ylab = "RelSpeed", trace.label = "Team")

Using similar analysis as the fastball, we concluded that curveball speed did not increase as the strike count increased. The p-value for strike count was way above 0.05 that we failed to reject the hypothesis that average speed is the same for each strike count.

Strike Percentage per Velocity Range

s_data_table = ball_data %>%
  filter(TaggedPitchType != "ChangeUp") %>%
  group_by(TaggedPitchType, `Speed Range`=cut(RelSpeed, breaks= seq(55, 95, by = 2.5))) %>%
  summarise(`Total Pitch` = n(),
            `Called Strike` = sum(PitchCall == "StrikeCalled"),
            `Strike Swinging` = sum(PitchCall == "StrikeSwinging")) %>%
  mutate(`S %` = (`Strike Swinging` + `Called Strike`)/`Total Pitch`)

kable(s_data_table, align = "c")
TaggedPitchType Speed Range Total Pitch Called Strike Strike Swinging S %
Curveball (70,72.5] 1 0 0 0.0000000
Curveball (72.5,75] 5 1 0 0.2000000
Curveball (75,77.5] 31 9 5 0.4516129
Curveball (77.5,80] 18 7 2 0.5000000
Fastball (82.5,85] 1 0 0 0.0000000
Fastball (85,87.5] 80 17 8 0.3125000
Fastball (87.5,90] 169 14 18 0.1893491
Fastball (90,92.5] 29 4 4 0.2758621
Slider (77.5,80] 22 6 7 0.5909091
Slider (80,82.5] 71 24 8 0.4507042
Slider (82.5,85] 57 11 5 0.2807018
Slider (85,87.5] 7 2 1 0.4285714

Once we found out that fastball and slider speed increased as strike count went up while curveball speed stayed steady, we calculated each pitch type’s performance in different speed ranges. We defined better performance as a higher chance of getting a strike count by either called strike or strike swing. As a result, we found the following: curveball had a better performance when it was in (77.5,80] mph range. Once the speed went below it, performance declined. However, fastball and slider had the opposite result; they tended to have better performance when the speed was lower. Fastball had the best performance in (85,87.5] mph range and slider had the best performance in (77.5,80] mph range.

Average Speed and Standard Deviation per Pitch Type

norm_prob = function(val1, val2, mean, sd) {
  pnorm(val2,mean,sd) - pnorm(val1,mean,sd)
}

ball_data_table = ball_data %>%
  filter(TaggedPitchType != "ChangeUp") %>%
  group_by(TaggedPitchType,Strikes) %>%
  summarise(`Mean` = mean(RelSpeed),
            `Standard Deviation`= sd(RelSpeed)) %>%
  mutate(Reliability = case_when(TaggedPitchType == "Fastball" ~
                                   0.3125000*norm_prob(85,87.5,`Mean`,`Standard Deviation`)+
                                   0.1893491*norm_prob(87.5,90,`Mean`,`Standard Deviation`)+
                                   0.2758621*norm_prob(90,92.5,`Mean`,`Standard Deviation`),
                                 TaggedPitchType == "Curveball" ~
                                   0.2000000*norm_prob(72.5,75,`Mean`,`Standard Deviation`)+
                                   0.4516129*norm_prob(75,77.5,`Mean`,`Standard Deviation`)+
                                   0.5000000*norm_prob(77.5,80,`Mean`,`Standard Deviation`),
                                 TaggedPitchType == "Slider" ~
                                   0.4507042*norm_prob(80,82.5,`Mean`,`Standard Deviation`)+
                                   0.2807018*norm_prob(82.5,85,`Mean`,`Standard Deviation`)+
                                   0.4285714*norm_prob(85,87.5,`Mean`,`Standard Deviation`))
  )

kable(ball_data_table, align = "c")
TaggedPitchType Strikes Mean Standard Deviation Reliability
Curveball 0 76.86264 1.547311 0.4281754
Curveball 1 76.83849 1.400711 0.4370018
Curveball 2 76.59574 1.753424 0.4056354
Fastball 0 87.90559 1.441914 0.2365586
Fastball 1 88.40766 1.429681 0.2299278
Fastball 2 88.80115 1.239662 0.2211411
Slider 0 81.77771 1.589298 0.3392368
Slider 1 81.84915 1.855177 0.3232710
Slider 2 82.56458 1.596408 0.3475555

Now that we knew which speed range was ideal for each pitch type, the next step was to calculate the mean and standard deviation of pitch speed at each strike count to find the probability of each pitch falling in the ideal speed range in different strike count. Then, we combined strike percentage in different speed ranges and the probability of pitching in such speed ranges to calculate the reliability by multiplying them together.

Conclusion

In this project, we calculated pitch performance in different speed ranges and the probability of pitching in such a speed range in different strike counts. Using these two pieces of information, we can calculate each pitch’s reliability in different strike counts. Even though this is limited information and we need to consider many other factors to make a pitch decision, we could further develop the idea of reliability and make this part of a data-driven pitch decision.