Overview

We designed a non-linear regression model and used the kernel smoothing method to construct a more comprehensive contact probability and contact productivity plots.

Libraries

library(dplyr)
library(ggplot2)
library(gridExtra)

Importing Data

baseball_data = read.csv("NCAA_TM_Final.csv")

play_baseball_data_2019 = baseball_data %>%
  filter(Season == 2019, 
         BatterTeam == "IOW_HAW", 
         PlayResult != "Error",
         PlateLocHeight <= 3.7,
         PlateLocHeight >= 1.3,         
         PlateLocSide <= 1,
         PlateLocSide >= -1,) %>%
  select(Batter,PitchCall,PlayResult, PlateLocHeight, PlateLocSide)

play_baseball_data_2019[12:14,-1]

We got the data from the NCAA_TM_Final.csv file, which our team uses to store all Trackman baseball data. Then, we filtered out batter names, pitch call, pitch location, and the results of the pitch. For the anonymity of players, we did not include their names in the same table above. Also, “PlayResult” reports “Undefined” when the pitch does not result in in-play.

Contact Probability

in_play_2019 = play_baseball_data_2019 %>%
  mutate(contact = case_when(
    PlayResult =="Out"| PlayResult =="Single"| PlayResult =="Double" |   
    PlayResult =="Triple" | PlayResult =="HomeRun" | PlayResult == "Sacrifice"~ 1,
    TRUE ~ 0))

Player_1[1:3,]

We first wanted to calculate the probability of batters making contact, resulting in “in-play” per strike zone location. We assigned 0 and 1 depending on the “PlayResult” to have a binary description. “Out,” “Single,” “Double,” “Triple,” “HomeRun,” and “Sacrifice” were assigned 1 meaning that contact resulted in “in-play,” and the rest of the data were given 0. The above table is a sample dataset for one of the batters.

Logistic Regression & Generalized Additive Model

First, we found the function which describes the relationship between pitch location and contact resulting in “in-play” using the Generalized Additive Model(GAM). One of the benefits of GAM is that it finds the relationship between the response variable and predictors using smooth functions. By applying smooth functions, we can detect non-linear/non-parametric relationships, which gives us a more comprehensive understanding of the relationship and better prediction. Then we made the “in-play” conversion prediction and converted the predicted value to probability using the equation for the likelihood of logistic regression.

Logistic Regression
\[p:\mathbb{R} \rightarrow (0,1)\]
\[p(t) = \frac{1}{1+e^{-t}}\]
Generalized Additive Model
\[g(E(y)) = c + s_{1}(x_{1}) + s_{2}(x_{2}) + ... + s_{n}(x_{n})\]
where
\(g(E(y))\) denotes the link function of the expected value of the response variable.
\(s_{1}(x_{1}) + s_{2}(x_{2}) + ... + s_{n}(x_{n})\) denotes the smooth functions of the predictor variables.

In our case
\[g(E(y)) = c + s_{1}(x_{1},x_{2})\]
\(g(E(y))\) denotes logit link function of y since y is binary.
\(s_{1}(x_{1},x_{2})\) denotes the smooth function of x and y coordinates, including the interaction term.

Contact Probability Plot

player_1_prob_contact = gam_lr_contact(Player_1)

ggplot(player_1_prob_contact, aes(PlateLocSide, PlateLocHeight)) +
  geom_tile(data=player_1_prob_contact, aes(x=PlateLocSide, y=PlateLocHeight, fill= contact_prob)) +
  scale_fill_gradientn("Probability",colors = c("royalblue3", "red2")) +
  theme_bw() +
  scale_x_continuous(limits = c(-2.5, 2.5), breaks = NULL) + 
  scale_y_continuous(limits = c(-0.5, 5.5),breaks = NULL) +
  geom_rect(xmin= -0.833, xmax=0.833, ymin=1.5, ymax=3.5,colour = "black", alpha = 0) +
  geom_segment(aes(x = -0.833, y = 1.5 + 2/3, xend = 0.833, yend = 1.5 + 2/3)) +
  geom_segment(aes(x = -0.833, y = 1.5 + 4/3, xend = 0.833, yend = 1.5 + 4/3)) +
  geom_segment(aes(x = -0.833 + 1.667/3, y = 1.5, xend = -0.833 + 1.667/3, yend = 3.5)) +
  geom_segment(aes(x = -0.833 + 2*1.667/3, y = 1.5, xend = -0.833 + 2*1.667/3, yend = 3.5)) +
  labs(title="Contact Probability", x =element_blank(), y = element_blank()) +
  theme(plot.title = element_text(hjust = 0.5))

After calculating the probability using the process in the previous section, we made a contact probability plot. This particular batter had a higher chance of making contact in the center/upper-right side; meanwhile, the “in-play” conversion probability was 0.2 or below in the lower part of the strike zone.

Contact Productivity using Extrapolated Runs

in_play_baseball_data_2019 = play_baseball_data_2019 %>%
  filter(PitchCall == "InPlay")

in_play_productivity_2019 = in_play_baseball_data_2019 %>%
  mutate(productivity = case_when(
    PlayResult =="Out" ~ -0.09,
    PlayResult =="Single" ~ 0.5,
    PlayResult =="Double" ~ 0.72,
    PlayResult =="Triple" ~ 1.04,
    PlayResult =="HomeRun" ~1.44,
    PlayResult == "Sacrifice" ~ 0.37))

Player_11[1:3,]

For the second part of the research, we wanted to calculate the contact productivity per location. First, we filtered pitch data, which resulted in “in-play.” Then, we converted in-play results into a score using the idea of Extrapolated Runs (XR). XR is a numerical way to estimate the number of runs a hitter contributes to his team. We believed that we could calculate the contact productivity per location by using XR, giving us a more in-depth understanding of hitters’ performance. We used XR value from https://en.wikipedia.org/wiki/Extrapolated_Runs.

Kernel Smoothing

ggplot(Player_11, aes(PlateLocSide,PlateLocHeight, z= productivity)) +
geom_point(aes(x=PlateLocSide, y=PlateLocHeight, color=productivity)) +
scale_colour_gradientn("XR",colors = c("royalblue3", "red2")) +
theme_bw() +
scale_x_continuous(limits = c(-2.5, 2.5), breaks = NULL) + 
scale_y_continuous(limits = c(-0.5, 5.5),breaks = NULL) +
geom_rect(xmin= -0.833, xmax=0.833, ymin=1.5, ymax=3.5,colour = "black", alpha = 0) +
geom_segment(aes(x = -0.833, y = 1.5 + 2/3, xend = 0.833, yend = 1.5 + 2/3)) +
geom_segment(aes(x = -0.833, y = 1.5 + 4/3, xend = 0.833, yend = 1.5 + 4/3)) +
geom_segment(aes(x = -0.833 + 1.667/3, y = 1.5, xend = -0.833 + 1.667/3, yend = 3.5)) +
geom_segment(aes(x = -0.833 + 2*1.667/3, y = 1.5, xend = -0.833 + 2*1.667/3, yend = 3.5)) +
labs(title="Contact Productivity", x =element_blank(), y = element_blank()) +
theme(plot.title = element_text(hjust = 0.5))

When we plotted contact productivity using a dot plot, it was hard to tell where the batter had higher productivity in the strike zone; we wanted a more comprehensive plot. For this reason, we applied the kernel smoothing method, which would give us the productivity score in the entire strike zone rather than the specific locations.

Contact Productivity Plot

player_11_contact_prod = k_smoothing(Player_11)

  ggplot(player_11_contact_prod, aes(x_coord, y_coord)) +
  geom_tile(data=player_11_contact_prod, aes(x=x_coord, y=y_coord, fill= value)) +
  scale_fill_gradientn("XR",colors = c("royalblue3", "red2")) +
  theme_bw() +
  scale_x_continuous(limits = c(-2.5, 2.5), breaks = NULL) + 
  scale_y_continuous(limits = c(-0.5, 5.5),breaks = NULL) +
  geom_rect(xmin= -0.833, xmax=0.833, ymin=1.5, ymax=3.5,colour = "black", alpha = 0) +
  geom_segment(aes(x = -0.833, y = 1.5 + 2/3, xend = 0.833, yend = 1.5 + 2/3)) +
  geom_segment(aes(x = -0.833, y = 1.5 + 4/3, xend = 0.833, yend = 1.5 + 4/3)) +
  geom_segment(aes(x = -0.833 + 1.667/3, y = 1.5, xend = -0.833 + 1.667/3, yend = 3.5)) +
  geom_segment(aes(x = -0.833 + 2*1.667/3, y = 1.5, xend = -0.833 + 2*1.667/3, yend = 3.5)) +
  labs(title="Contact Productivity", x =element_blank(), y = element_blank()) +
  theme(plot.title = element_text(hjust = 0.5))

From the contact productivity plot above, we could observe that this player had the highest productivity in the upper-middle area of the strike zone. Along this the red hot-spot, we could detect purple-ish spots in the middle and bottom-right area of the strike zone, which indicates average productivity of hitting single and some dark-blue locations around the edge of the strike zone, which showed negative productivity. One unique aspect of this batter is that he showed above-average productivity in the upper-left strike zone.

Conclusion

For the first part of the project, We used the Generalized Additive Model(GAM) and logistic regression to find the relationship between the strike zone location and contact probability. For the second part of the project, we used Extrapolated Runs (XR) and a kernel smoothing method to find contact productivity in different strike zone location.

Using those two plots, we can analyze the batters’ hitting mechanism more in-depth and make better swing decisions. From the pitchers’ point of view, we can use those plots to make more solid pitch strategies

For a future project, we will be working on reducing the bias created during the smoothing process and making plots for each pitcher hand.

Contact Productivity Plot Using Extrapolated Runs

Kitae Kim