We investigated an NCAA baseball player’s Trackman data to find a reliable pitch in different strike counts using the idea of variance and normal distribution. By reliable pitch, it means a pitch that has a higher chance of getting a strike.
library(dplyr)
library(Boruta)
library(knitr)
We used the dplyr library for data manipulation, the Boruta library for the feature selection, and the knitr library to create clean tables.
ball_data = read.csv("project_dataset_trackman_pitcher.csv")
Fastball = ball_data %>% dplyr::filter(TaggedPitchType == "Fastball")
Curveball = ball_data %>% dplyr::filter(TaggedPitchType == "Curveball")
Slider = ball_data %>% dplyr::filter(TaggedPitchType == "Slider")
ChangeUp = ball_data %>% dplyr::filter(TaggedPitchType == "ChangeUp")
First, we imported pitcher performance data from a csv file from Trackman. After we imported data, we separated data by different pitch types. This particular pitcher threw four different types of pitches; fastball, curveball, slider and changeup.
interaction.plot(as.factor(ball_data$Inning),as.factor(ball_data$TaggedPitchType),ball_data$RelSpeed,xlab = "Inning", ylab = "RelSpeed", trace.label = "Pitch Type")
Our primary interest in this project was pitch speed in different strike count and its reliability. First, we plotted pitch speed against inning per pitch type. As we suspected, pitch speed tended to decrease as the pitcher threw a longer inning for all four pitch types. However, we wanted to find other factors that affect the pitch speed beside the obvious one. For this reason, we used feature selection.
speed_Boruta = Boruta(RelSpeed~as.factor(Balls)+as.factor(Outs)+as.factor(BatterTeam)+as.factor(Strikes)+VertBreak+HorzBreak,data=ball_data)
plot(speed_Boruta ,las=2 ,xlab="")
Boruta feature selection is an algorithm that is built around random forest classification. The idea behind it is to find feature/variable importance from a random forest to determine which variables unnecessary in the model. This algorithm suggested that Batter-Team, strike count, vertical break, and horizontal break are essential features in this model. Using baseball knowledge, we concluded that vertical and horizontal breaks were selected due to their impact on different pitch types. Therefore, we chose to analyze different pitch types separately using batter-team and strike count as chosen features.
lm.fast_vel = lm(RelSpeed ~ as.factor(Strikes) * as.factor(BatterTeam),data=Fastball)
anova(lm.fast_vel)
interaction.plot(as.factor(Fastball$Strikes),as.factor(Fastball$BatterTeam),(Fastball$RelSpeed),xlab = "Strike", ylab = "RelSpeed", trace.label = "Team")
We used the ANOVA table to verify the significance of strike count and batter-team on fastball speed. As the low p-value for each variable suggested, both of them were significant factors. Then we plotted fastball speed against strike count and showed that the speed increased as the strike count increased.
lm.slid_vel = lm(RelSpeed ~ as.factor(Strikes) + as.factor(BatterTeam),data=Slider)
anova(lm.slid_vel)
interaction.plot(as.factor(Slider$Strikes),as.factor(Slider$BatterTeam),(Slider$RelSpeed),xlab = "Strike", ylab = "RelSpeed", trace.label = "Team")
Using similar analysis as the fastball, we concluded that slider speed increased as the strike count increased.
lm.curv_vel = lm(RelSpeed ~ as.factor(Strikes) + as.factor(BatterTeam),data=Curveball)
anova(lm.curv_vel)
interaction.plot(as.factor(Curveball$Strikes),as.factor(Curveball$BatterTeam),(Curveball$RelSpeed),xlab = "Strike", ylab = "RelSpeed", trace.label = "Team")
Using similar analysis as the fastball, we concluded that curveball speed did not increase as the strike count increased. The p-value for strike count was way above 0.05 that we failed to reject the hypothesis that average speed is the same for each strike count.
s_data_table = ball_data %>%
filter(TaggedPitchType != "ChangeUp") %>%
group_by(TaggedPitchType, `Speed Range`=cut(RelSpeed, breaks= seq(55, 95, by = 2.5))) %>%
summarise(`Total Pitch` = n(),
`Called Strike` = sum(PitchCall == "StrikeCalled"),
`Strike Swinging` = sum(PitchCall == "StrikeSwinging")) %>%
mutate(`S %` = (`Strike Swinging` + `Called Strike`)/`Total Pitch`)
kable(s_data_table, align = "c")
| TaggedPitchType | Speed Range | Total Pitch | Called Strike | Strike Swinging | S % |
|---|---|---|---|---|---|
| Curveball | (70,72.5] | 1 | 0 | 0 | 0.0000000 |
| Curveball | (72.5,75] | 5 | 1 | 0 | 0.2000000 |
| Curveball | (75,77.5] | 31 | 9 | 5 | 0.4516129 |
| Curveball | (77.5,80] | 18 | 7 | 2 | 0.5000000 |
| Fastball | (82.5,85] | 1 | 0 | 0 | 0.0000000 |
| Fastball | (85,87.5] | 80 | 17 | 8 | 0.3125000 |
| Fastball | (87.5,90] | 169 | 14 | 18 | 0.1893491 |
| Fastball | (90,92.5] | 29 | 4 | 4 | 0.2758621 |
| Slider | (77.5,80] | 22 | 6 | 7 | 0.5909091 |
| Slider | (80,82.5] | 71 | 24 | 8 | 0.4507042 |
| Slider | (82.5,85] | 57 | 11 | 5 | 0.2807018 |
| Slider | (85,87.5] | 7 | 2 | 1 | 0.4285714 |
Once we found out that fastball and slider speed increased as strike count went up while curveball speed stayed steady, we calculated each pitch type’s performance in different speed ranges. We defined better performance as a higher chance of getting a strike count by either called strike or strike swing. As a result, we found the following: curveball had a better performance when it was in (77.5,80] mph range. Once the speed went below it, performance declined. However, fastball and slider had the opposite result; they tended to have better performance when the speed was lower. Fastball had the best performance in (85,87.5] mph range and slider had the best performance in (77.5,80] mph range.
norm_prob = function(val1, val2, mean, sd) {
pnorm(val2,mean,sd) - pnorm(val1,mean,sd)
}
ball_data_table = ball_data %>%
filter(TaggedPitchType != "ChangeUp") %>%
group_by(TaggedPitchType,Strikes) %>%
summarise(`Mean` = mean(RelSpeed),
`Standard Deviation`= sd(RelSpeed)) %>%
mutate(Reliability = case_when(TaggedPitchType == "Fastball" ~
0.3125000*norm_prob(85,87.5,`Mean`,`Standard Deviation`)+
0.1893491*norm_prob(87.5,90,`Mean`,`Standard Deviation`)+
0.2758621*norm_prob(90,92.5,`Mean`,`Standard Deviation`),
TaggedPitchType == "Curveball" ~
0.2000000*norm_prob(72.5,75,`Mean`,`Standard Deviation`)+
0.4516129*norm_prob(75,77.5,`Mean`,`Standard Deviation`)+
0.5000000*norm_prob(77.5,80,`Mean`,`Standard Deviation`),
TaggedPitchType == "Slider" ~
0.4507042*norm_prob(80,82.5,`Mean`,`Standard Deviation`)+
0.2807018*norm_prob(82.5,85,`Mean`,`Standard Deviation`)+
0.4285714*norm_prob(85,87.5,`Mean`,`Standard Deviation`))
)
kable(ball_data_table, align = "c")
| TaggedPitchType | Strikes | Mean | Standard Deviation | Reliability |
|---|---|---|---|---|
| Curveball | 0 | 76.86264 | 1.547311 | 0.4281754 |
| Curveball | 1 | 76.83849 | 1.400711 | 0.4370018 |
| Curveball | 2 | 76.59574 | 1.753424 | 0.4056354 |
| Fastball | 0 | 87.90559 | 1.441914 | 0.2365586 |
| Fastball | 1 | 88.40766 | 1.429681 | 0.2299278 |
| Fastball | 2 | 88.80115 | 1.239662 | 0.2211411 |
| Slider | 0 | 81.77771 | 1.589298 | 0.3392368 |
| Slider | 1 | 81.84915 | 1.855177 | 0.3232710 |
| Slider | 2 | 82.56458 | 1.596408 | 0.3475555 |
Now that we knew which speed range was ideal for each pitch type, the next step was to calculate the mean and standard deviation of pitch speed at each strike count to find the probability of each pitch falling in the ideal speed range in different strike count. Then, we combined strike percentage in different speed ranges and the probability of pitching in such speed ranges to calculate the reliability by multiplying them together.
In this project, we calculated pitch performance in different speed ranges and the probability of pitching in such a speed range in different strike counts. Using these two pieces of information, we can calculate each pitch’s reliability in different strike counts. Even though this is limited information and we need to consider many other factors to make a pitch decision, we could further develop the idea of reliability and make this part of a data-driven pitch decision.