Predicting True Speed Off Bat

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

The Gameplan

After examining the two measurement systems, I would like to take each system’s strengths and build a cumulative system that will offer a “true” average speed for each batter.

Differences in System A & System B

The first task that we want to take on is observing the differences between our two measurement systems. For this, we will separate the two systems into their own data frames before comparing. While doing this, we’re going to omit the unknown values and the two instances of hittype “U” as well.

#load data; drop NA values and "U" observations
TBR <- read.csv("battedBallData.csv") %>%
  filter(!is.na(speed_A), !is.na(speed_B), !is.na(vangle_A), !is.na(vangle_B), !is.na(hittype == "U")) 

#create dataframe only System A
System_A <- TBR %>%
  select(batter, pitcher, hittype, speed_A, vangle_A) 
summary(System_A)

##      batter         pitcher        hittype             speed_A      
##  Min.   :  1.0   Min.   :  1.0   Length:64943       Min.   : 26.46  
##  1st Qu.:170.0   1st Qu.:121.0   Class :character   1st Qu.: 80.47  
##  Median :341.0   Median :283.0   Mode  :character   Median : 90.70  
##  Mean   :364.8   Mean   :290.3                      Mean   : 88.51  
##  3rd Qu.:550.0   3rd Qu.:449.0                      3rd Qu.: 98.36  
##  Max.   :816.0   Max.   :645.0                      Max.   :121.85  
##     vangle_A      
##  Min.   :-87.594  
##  1st Qu.: -5.454  
##  Median : 10.994  
##  Mean   : 10.823  
##  3rd Qu.: 27.280  
##  Max.   : 78.461

#create dataframe only System B
System_B <- TBR %>%
  select(batter, pitcher, hittype, speed_B, vangle_B) 
summary(System_B)

##      batter         pitcher        hittype             speed_B       
##  Min.   :  1.0   Min.   :  1.0   Length:64943       Min.   :  6.967  
##  1st Qu.:170.0   1st Qu.:121.0   Class :character   1st Qu.: 69.092  
##  Median :341.0   Median :283.0   Mode  :character   Median : 83.709  
##  Mean   :364.8   Mean   :290.3                      Mean   : 79.860  
##  3rd Qu.:550.0   3rd Qu.:449.0                      3rd Qu.: 93.100  
##  Max.   :816.0   Max.   :645.0                      Max.   :114.403  
##     vangle_B      
##  Min.   :-66.469  
##  1st Qu.: -3.695  
##  Median : 10.476  
##  Mean   : 12.453  
##  3rd Qu.: 27.470  
##  Max.   : 83.871

The differences between the two systems are quite clear as System A measures speed much higher than its counterpart. The two systems both measure angle differently as well. So what do we do with this knowledge? If we know (or suspect) that System A is more accurate, we will want to rely on this system in the case that both measurements have been used. Eventually we will try to determine the factors that may have an impact on System A’s measurement of speed.

After the initial cleaning of data and exploratory analysis, I want to visualize the different categories of hits and double check the accuracy of both systems.

#compare the two systems for accuracy
ggplot(System_A, aes(speed_A, vangle_A, color = hittype)) +
  geom_point() +
  labs(x = "System A Off Bat Speed", 
       y = "System A Angle")

ggplot(System_B, aes(speed_B, vangle_B, color = hittype)) +
  geom_point() +
  labs(x = "System B Off Bat Speed", 
       y = "System B Angle")

We can, again, see some discrepancies in the two measurement systems. The first thing that I notice is the high amount of ground ball observations in System B with an angle of well over 0. This inaccuracy is immediately noticeable when compared to System A, which seems to have had a more accurate human observer when diagnosing ground balls.

Next, I want to separate and identify the mean speeds and angles for each pitcher and hitter, specifically for each of the different hit types. This will show us the mean speed and angle for each of our hitters and pitchers so that we know the expected values for each hit type within each measurement system.

#create expected values per pitcher using averages for speeds and angles
ex_pitcher_hit_types <- TBR %>%
  filter(!is.na(vangle_A), !is.na(vangle_B), !is.na(speed_A), !is.na(speed_B)) %>%
  group_by(pitcher, hittype) %>%
  summarize(ex_pitch_speed_A = mean(speed_A), 
            ex_pitch_speed_B = mean(speed_B), 
            ex_pitch_vangle_A = mean(vangle_A), 
            ex_pitch_vangle_B = mean(vangle_B))

## `summarise()` has grouped output by 'pitcher'. You can override using the
## `.groups` argument.

#create expected values per batter using averages for speeds and angles
ex_batter_hit_types <- TBR %>%
  filter(!is.na(vangle_A), !is.na(vangle_B), !is.na(speed_A), !is.na(speed_B)) %>%
  group_by(batter, hittype) %>%
  summarize(ex_bat_speed_A = mean(speed_A), 
            ex_bat_speed_B = mean(speed_B), 
            ex_bat_vangle_A = mean(vangle_A), 
            ex_bat_vangle_B = mean(vangle_B))

## `summarise()` has grouped output by 'batter'. You can override using the
## `.groups` argument.

#merge both expected batter/pitcher dataframes into main dataframe
TBR <- merge(TBR, ex_pitcher_hit_types)
TBR <- merge(TBR, ex_batter_hit_types)

str(TBR)

## 'data.frame':    64943 obs. of  15 variables:
##  $ hittype          : chr  "fly_ball" "fly_ball" "fly_ball" "fly_ball" ...
##  $ batter           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pitcher          : int  31 453 28 154 35 403 39 168 59 635 ...
##  $ speed_A          : num  93.5 76 80.9 85.1 83.1 ...
##  $ vangle_A         : num  22.8 29.2 30.5 29.3 57.6 ...
##  $ speed_B          : num  88.6 72.9 78.6 80.9 80.5 ...
##  $ vangle_B         : num  22.6 28.8 29.6 29.7 58.1 ...
##  $ ex_pitch_speed_A : num  92.3 90.6 89.8 87.7 89.2 ...
##  $ ex_pitch_speed_B : num  88 87.5 86.2 84 85.6 ...
##  $ ex_pitch_vangle_A: num  30.8 34.8 36.1 36.9 34.5 ...
##  $ ex_pitch_vangle_B: num  31 35 36.4 37.1 34.7 ...
##  $ ex_bat_speed_A   : num  86.6 86.6 86.6 86.6 86.6 ...
##  $ ex_bat_speed_B   : num  82.9 82.9 82.9 82.9 82.9 ...
##  $ ex_bat_vangle_A  : num  34.5 34.5 34.5 34.5 34.5 ...
##  $ ex_bat_vangle_B  : num  34.8 34.8 34.8 34.8 34.8 ...

A simple ggplot call will show us that there is a linear relationship between observed values and expected values.

We have combined the two measurement systems and taken their means. Now that we have all of our expected values and our observed values in one data frame (TBR), we can begin to build a predictive model. We have lots of options to look at!

Building a Model

First, we want to create a model estimating off bat speed from System A based on all other factors that we have created and explored so far. We will then sort out the highest contributing factors to off bat speed in our more trustworthy system (A).

#examining regression using all available variables (exploratory)
lm1 <- lm(speed_A ~ ., TBR)
summary(lm1)

## 
## Call:
## lm(formula = speed_A ~ ., data = TBR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -72.937  -2.866  -0.127   2.468  64.338 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2.494e+01  1.044e+00 -23.886  < 2e-16 ***
## hittypeground_ball  4.662e+00  5.762e-01   8.090 6.06e-16 ***
## hittypeline_drive   4.356e+00  3.086e-01  14.116  < 2e-16 ***
## hittypepopup       -2.267e+00  3.879e-01  -5.844 5.13e-09 ***
## hittypeU            2.850e+00  5.669e+00   0.503   0.6151    
## batter              4.380e-06  9.724e-05   0.045   0.9641    
## pitcher            -2.605e-05  1.195e-04  -0.218   0.8274    
## vangle_A           -3.667e-01  4.347e-03 -84.349  < 2e-16 ***
## speed_B             8.244e-01  1.886e-03 437.066  < 2e-16 ***
## vangle_B            4.410e-02  4.843e-03   9.106  < 2e-16 ***
## ex_pitch_speed_A    9.093e-01  1.910e-02  47.617  < 2e-16 ***
## ex_pitch_speed_B   -7.576e-01  1.872e-02 -40.480  < 2e-16 ***
## ex_pitch_vangle_A   3.473e-01  2.569e-02  13.514  < 2e-16 ***
## ex_pitch_vangle_B  -6.668e-02  2.599e-02  -2.566   0.0103 *  
## ex_bat_speed_A      9.570e-01  1.410e-02  67.853  < 2e-16 ***
## ex_bat_speed_B     -7.873e-01  1.440e-02 -54.684  < 2e-16 ***
## ex_bat_vangle_A     3.499e-01  2.417e-02  14.479  < 2e-16 ***
## ex_bat_vangle_B    -4.892e-02  2.482e-02  -1.971   0.0487 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.666 on 64925 degrees of freedom
## Multiple R-squared:  0.8139, Adjusted R-squared:  0.8139 
## F-statistic: 1.671e+04 on 17 and 64925 DF,  p-value: < 2.2e-16

We find that we have several variables that are not significant in the multiple regression model, so we can remove those now. Our new model looks like this:

#observed estimates and p-values to build
lm1 <- lm(speed_A ~ speed_B + vangle_A + ex_pitch_speed_A + ex_pitch_speed_B + 
            ex_bat_speed_A + ex_bat_speed_B, data = TBR)
summary(lm1)

## 
## Call:
## lm(formula = speed_A ~ speed_B + vangle_A + ex_pitch_speed_A + 
##     ex_pitch_speed_B + ex_bat_speed_A + ex_bat_speed_B, data = TBR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.327  -3.519   0.107   2.967  59.121 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      18.821251   0.610149  30.847  < 2e-16 ***
## speed_B           0.757504   0.001888 401.252  < 2e-16 ***
## vangle_A         -0.168945   0.002054 -82.258  < 2e-16 ***
## ex_pitch_speed_A -0.113113   0.013998  -8.081 6.54e-16 ***
## ex_pitch_speed_B  0.068990   0.011141   6.193 5.96e-10 ***
## ex_bat_speed_A    0.397321   0.011816  33.626  < 2e-16 ***
## ex_bat_speed_B   -0.245939   0.010355 -23.750  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.183 on 64936 degrees of freedom
## Multiple R-squared:  0.7783, Adjusted R-squared:  0.7783 
## F-statistic: 3.8e+04 on 6 and 64936 DF,  p-value: < 2.2e-16

#plot residuals to look for even distribution around mean
ggplot(lm1, aes(lm1$residuals)) +
  geom_histogram(binwidth = 1)

While our Multiple R-squared dropped from .81 to .78, I don’t believe this is a significant enough decrease in correlation coefficient to warrant resetting our model. I’ve chosen to use this model to avoid over fitting and to focus on the variables which provide the most predictive outcomes.

Note: Using the step function (both “forward” and “backward”) helped me find the best combination of regression variables.

Creating a Training Set

The first thing I want to do in this instance when separating our training set is to make sure that the at bats are randomized in order to ensure an even distribution without bias.

#randomize variables before subsetting into training and test
set.seed(2022)
x <- runif(nrow(TBR))
TBR_rand <- TBR[order(x), ]

Next, we’ll check the regression on our subset and separate the data into its training set data frame, roughly 2/3 of the original data. Once we have input the final 1/3 (test set) of the data into our model, we’ll examine our confidence intervals and bind our output with our original randomized data set.

#run model on training set and examine statistics
train_set <- lm(speed_A ~ speed_B + vangle_A + ex_pitch_speed_A + ex_pitch_speed_B + 
     ex_bat_speed_A + ex_bat_speed_B, data = TBR_rand, subset = 1:42862)
summary(train_set)

## 
## Call:
## lm(formula = speed_A ~ speed_B + vangle_A + ex_pitch_speed_A + 
##     ex_pitch_speed_B + ex_bat_speed_A + ex_bat_speed_B, data = TBR_rand, 
##     subset = 1:42862)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.176  -3.510   0.111   2.940  58.919 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      18.034281   0.751917  23.984  < 2e-16 ***
## speed_B           0.755388   0.002310 326.978  < 2e-16 ***
## vangle_A         -0.166640   0.002524 -66.035  < 2e-16 ***
## ex_pitch_speed_A -0.092572   0.017306  -5.349 8.88e-08 ***
## ex_pitch_speed_B  0.052843   0.013774   3.836 0.000125 ***
## ex_bat_speed_A    0.393175   0.014581  26.964  < 2e-16 ***
## ex_bat_speed_B   -0.236381   0.012792 -18.479  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.173 on 42855 degrees of freedom
## Multiple R-squared:  0.7795, Adjusted R-squared:  0.7795 
## F-statistic: 2.526e+04 on 6 and 42855 DF,  p-value: < 2.2e-16

#create test set and input into model; predicted values (conf) binded with test set as new variables
test_set <- data.frame(c(TBR_rand[42862:64943, ]))
conf <- predict(lm1, newdata = test_set, interval = "confidence")
true_speed <- cbind(test_set, conf)

Findings and Predictions of “True” Speed

Now, let’s look at what our model predicted as the “true” average speed off of the bat for our batters.

#batters and their average true speed
final_exp_speed <- true_speed %>%
  group_by(batter) %>%
  summarize(true_speed = mean(fit))

It looks like we have a well-distributed set of expected speed values produced by our model. Let’s visualize some of these and compare them with the original System A measurement of speed.

#true speed and system A speed measurements 
ggplot(true_speed, aes(speed_A, fit)) +
  geom_point(alpha = .3) +
  geom_smooth() +
  labs(x = "Off Bat Speed from Measurement A", 
       y = "Predicted 'True' Average Speed", 
       title = "Predicted Speed and Observed Speed") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

#true speed and system A angle measurements
ggplot(true_speed, aes(fit, ex_bat_vangle_A, color = hittype)) +
  geom_line(alpha = .4) +
  geom_smooth() +
  labs(x = "Predicted 'True' Off Bat Speed", 
       y = "Expected Angle using Measurement A", 
       title = "Predicted 'True' Speed") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

#final data frame of batters and predicted true speeds
final_exp_speed

## # A tibble: 734 × 2
##    batter true_speed
##     <int>      <dbl>
##  1      1       89.5
##  2      2       78.4
##  3      3       91.4
##  4      4       89.4
##  5      5       86.4
##  6      6       84.1
##  7      7       85.2
##  8      8       88.6
##  9     10       88.7
## 10     11       88.8
## # … with 724 more rows

Conclusion

After running our model we now have the predicted values for each hitter. When reflecting on the process and the time spent working with this data, I would have liked to use bootstrapping with replacement data to gather results and compare with this simple linear regression model. Another area that I would have liked to explore is the measurement system’s approaches to determining angles. I noticed some peculiar values of angle measurement within each bin of hit type and ultimately, knowing how the systems measure these angles may have given us more insight into the process of determining off bat speed.

write.csv(final_exp_speed, file = "Final_Exp_Speed.csv", row.names = FALSE)