Set Up

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.3

## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## Warning: package 'readr' was built under R version 4.2.3

## Warning: package 'purrr' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## Warning: package 'stringr' was built under R version 4.2.3

## Warning: package 'forcats' was built under R version 4.2.3

## Warning: package 'lubridate' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(dplyr)
library(ggplot2)
library(readxl)

## Warning: package 'readxl' was built under R version 4.2.3

savant_Hitter_data<-read_excel("C:/Users/grifw/Documents/School/CSC360/CSC360 Final Project/savant_Hitter data.xlsx")

mlbhit<-select(savant_Hitter_data, player_name, total_pitches, ba, babip,launch_speed,whiffs,swings)%>%
  filter(total_pitches>999)

mlbhit<-mutate(mlbhit,hitter_luck=ba-babip)
mlbhit<-mutate(mlbhit,whiff_rate=whiffs/swings)

colnames(mlbhit)[5]  <- "avg_exit_velo"




mlbhit

## # A tibble: 312 × 9
##    player_name total_pitches    ba babip avg_exit_velo whiffs swings hitter_luck
##    <chr>               <dbl> <dbl> <dbl>         <dbl>  <dbl>  <dbl>       <dbl>
##  1 Judge, Aar…          2906 0.311 0.34           95.9    370   1240    -0.0290 
##  2 Hoskins, R…          2897 0.246 0.292          90.1    312   1213    -0.046  
##  3 Olson, Matt          2893 0.24  0.274          92.8    414   1433    -0.0340 
##  4 Schwarber,…          2878 0.218 0.24           92.9    339   1168    -0.022  
##  5 Lindor, Fr…          2829 0.27  0.301          89.2    310   1352    -0.0310 
##  6 Profar, Ju…          2790 0.243 0.272          87      228   1217    -0.0290 
##  7 Nimmo, Bra…          2781 0.274 0.317          88.5    238   1217    -0.043  
##  8 Semien, Ma…          2776 0.248 0.263          87.3    270   1310    -0.0150 
##  9 Soto, Juan           2765 0.242 0.249          90.9    188    981    -0.00700
## 10 Freeman, F…          2755 0.325 0.359          91.3    264   1357    -0.0340 
## # ℹ 302 more rows
## # ℹ 1 more variable: whiff_rate <dbl>

I filtered the data to make the minimum number of pitches a batter saw to be 1000, to eliminate data points that are not representative of a players true performance. I based this number off of the rough average number of pitches per plate appearance in the MLB last season of 4, and having at least 250 plate appearances, which now includes most starting players on teams.

Hitter luck is a new column I made that shows the difference in batting average minus batting average on balls in play(BABIP). BABIP is how often a player gets a hit when they hit the ball. I have chosen this to be called hitter luck, as BABIP can be a lucky stat for hitters, and the difference between their normal batting average and BABIP can show this amount of luck they had over the given time period. The greater the number, the greater luck that player had.

Whiff rate is another column I added that is whiffs, or swings and misses, divided by total swings. This gives us and idea of how often a player swings and misses as a percentage of their total swings, and so how often they’re making contact and putting the ball in play. The lower the number, the lower the percentage of swing and misses, or whiffs the player had.

Structure

str(mlbhit)

## tibble [312 × 9] (S3: tbl_df/tbl/data.frame)
##  $ player_name  : chr [1:312] "Judge, Aaron" "Hoskins, Rhys" "Olson, Matt" "Schwarber, Kyle" ...
##  $ total_pitches: num [1:312] 2906 2897 2893 2878 2829 ...
##  $ ba           : num [1:312] 0.311 0.246 0.24 0.218 0.27 0.243 0.274 0.248 0.242 0.325 ...
##  $ babip        : num [1:312] 0.34 0.292 0.274 0.24 0.301 0.272 0.317 0.263 0.249 0.359 ...
##  $ avg_exit_velo: num [1:312] 95.9 90.1 92.8 92.9 89.2 87 88.5 87.3 90.9 91.3 ...
##  $ whiffs       : num [1:312] 370 312 414 339 310 228 238 270 188 264 ...
##  $ swings       : num [1:312] 1240 1213 1433 1168 1352 ...
##  $ hitter_luck  : num [1:312] -0.029 -0.046 -0.034 -0.022 -0.031 ...
##  $ whiff_rate   : num [1:312] 0.298 0.257 0.289 0.29 0.229 ...

summary(mlbhit)

##  player_name        total_pitches        ba             babip       
##  Length:312         Min.   :1002   Min.   :0.1600   Min.   :0.1850  
##  Class :character   1st Qu.:1396   1st Qu.:0.2240   1st Qu.:0.2700  
##  Mode  :character   Median :1817   Median :0.2450   Median :0.2920  
##                     Mean   :1823   Mean   :0.2463   Mean   :0.2928  
##                     3rd Qu.:2221   3rd Qu.:0.2680   3rd Qu.:0.3182  
##                     Max.   :2906   Max.   :0.3260   Max.   :0.3740  
##  avg_exit_velo       whiffs          swings        hitter_luck     
##  Min.   :79.40   Min.   : 53.0   Min.   : 417.0   Min.   :-0.1290  
##  1st Qu.:87.00   1st Qu.:159.8   1st Qu.: 678.2   1st Qu.:-0.0610  
##  Median :88.45   Median :209.5   Median : 865.5   Median :-0.0445  
##  Mean   :88.40   Mean   :217.2   Mean   : 871.6   Mean   :-0.0465  
##  3rd Qu.:90.00   3rd Qu.:262.5   3rd Qu.:1044.2   3rd Qu.:-0.0320  
##  Max.   :95.90   Max.   :455.0   Max.   :1433.0   Max.   : 0.0180  
##    whiff_rate    
##  Min.   :0.0706  
##  1st Qu.:0.2109  
##  Median :0.2495  
##  Mean   :0.2510  
##  3rd Qu.:0.2911  
##  Max.   :0.4059

Batting Average and BABIP

ggplot(mlbhit,aes(ba, babip))+
  geom_line()+
  geom_smooth(method=lm)+
  ggtitle("Batting Average vs. Batting Average on Balls in Play")+
  labs(x="Batting Average", y="Batting Average on Balls in Play")

## `geom_smooth()` using formula = 'y ~ x'

As Batting Average increases, so does BABIP. This shows that they are related, but how are they different?

Batting Average and Whiffs

ggplot(mlbhit,aes(whiff_rate,ba,color=avg_exit_velo))+
  geom_point()+
  geom_smooth(method=lm)+
  ggtitle("Whiff Rate vs. Batting Average")+
  labs(y="Batting Average", x="Whiff Rate")

## `geom_smooth()` using formula = 'y ~ x'

## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

We see here that as players whiff more, their batting average decreases. The average exit velocity for each player is also shown in the color, with more of the lighter blue higher on the chart showing a higher batting average.

Whiff Rate and Exit Velocity

ggplot(mlbhit,aes(whiff_rate,avg_exit_velo))+
  geom_point()+
  geom_smooth(method=lm)+
  ggtitle("Average Exit Velo vs. Whiff Rate")+
  labs(y="Average Exit Velo", x="Whiff Rate")

## `geom_smooth()` using formula = 'y ~ x'

There looks to be a slight connection between whiff rate and exit velocity. This would likely be explained through players swinging hard have less control of the bat, and so swing and miss more often.

Hitter Luck and Whiff Rate

ggplot(mlbhit,aes(whiff_rate,hitter_luck))+
  geom_point()+
  geom_smooth(method=lm)+
  ggtitle("Hitter Luck vs. Whiff Rate")+
  labs(y="Hitter Luck", x="Whiff Rate")

## `geom_smooth()` using formula = 'y ~ x'

Here we see a pretty clear connection of hitter luck and whiff rate.As players whiff more and make less contact, their level of luck decreases.

Exit Velocity and Hitter Luck

ggplot(mlbhit,aes(avg_exit_velo,hitter_luck))+
  geom_point()+
  geom_smooth(method=lm)+
  ggtitle("Average Exit Velo vs. Hitter Luck")+
  labs(x="Average Exit Velo", y="Hitter Luck")

## `geom_smooth()` using formula = 'y ~ x'

I expected a greater connection between average exit velocity and hitter luck, but It appears as if they are not effecting each other much

summary(lm(mlbhit$hitter_luck~mlbhit$avg_exit_velo))

## 
## Call:
## lm(formula = mlbhit$hitter_luck ~ mlbhit$avg_exit_velo)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.084053 -0.015201  0.002284  0.014724  0.063183 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)  
## (Intercept)          -0.0881421  0.0490678  -1.796   0.0734 .
## mlbhit$avg_exit_velo  0.0004710  0.0005548   0.849   0.3965  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02384 on 310 degrees of freedom
## Multiple R-squared:  0.00232,    Adjusted R-squared:  -0.0008986 
## F-statistic: 0.7208 on 1 and 310 DF,  p-value: 0.3965

With and r-squared of .002, there is no real connection between these variables.

Luckiest Hitters Table

luckiest_hitters<-mlbhit[order(-mlbhit$hitter_luck),]

luckiest_hitters

## # A tibble: 312 × 9
##    player_name total_pitches    ba babip avg_exit_velo whiffs swings hitter_luck
##    <chr>               <dbl> <dbl> <dbl>         <dbl>  <dbl>  <dbl>       <dbl>
##  1 Pujols, Al…          1385 0.27  0.252          91.2    142    668     0.0180 
##  2 Paredes, I…          1617 0.205 0.195          87.4    106    644     0.0100 
##  3 Rizzo, Ant…          2112 0.224 0.216          89.3    222    974     0.00800
##  4 Tellez, Ro…          2392 0.219 0.215          91.1    248   1040     0.00400
##  5 Arenado, N…          2330 0.293 0.29           88.7    215   1127     0.00300
##  6 Seager, Co…          2268 0.245 0.242          91.1    328   1236     0.00300
##  7 Ramírez, J…          2742 0.28  0.279          87.7    192   1269     0.00100
##  8 Bregman, A…          2521 0.259 0.26           88.9    151   1015    -0.00100
##  9 Betts, Moo…          2428 0.269 0.272          90.5    177   1072    -0.00300
## 10 Tucker, Ky…          2262 0.257 0.261          89.8    238   1193    -0.00400
## # ℹ 302 more rows
## # ℹ 1 more variable: whiff_rate <dbl>

This is a table of the hitters sorted in descending order by the hitter luck column. We see that Albert Pujols was the highest player in this variable.

Luckiest Hitters Sorted by Whiff Rate

regress<-luckiest_hitters[1:10, ]

regress[order(-regress$whiff_rate),]

## # A tibble: 10 × 9
##    player_name total_pitches    ba babip avg_exit_velo whiffs swings hitter_luck
##    <chr>               <dbl> <dbl> <dbl>         <dbl>  <dbl>  <dbl>       <dbl>
##  1 Seager, Co…          2268 0.245 0.242          91.1    328   1236     0.00300
##  2 Tellez, Ro…          2392 0.219 0.215          91.1    248   1040     0.00400
##  3 Rizzo, Ant…          2112 0.224 0.216          89.3    222    974     0.00800
##  4 Pujols, Al…          1385 0.27  0.252          91.2    142    668     0.0180 
##  5 Tucker, Ky…          2262 0.257 0.261          89.8    238   1193    -0.00400
##  6 Arenado, N…          2330 0.293 0.29           88.7    215   1127     0.00300
##  7 Betts, Moo…          2428 0.269 0.272          90.5    177   1072    -0.00300
##  8 Paredes, I…          1617 0.205 0.195          87.4    106    644     0.0100 
##  9 Ramírez, J…          2742 0.28  0.279          87.7    192   1269     0.00100
## 10 Bregman, A…          2521 0.259 0.26           88.9    151   1015    -0.00100
## # ℹ 1 more variable: whiff_rate <dbl>

I then sorted the top 10 players from the hitters luck sorted table, in descending order by their whiff rate. This shows us which of these players were lucky last year and whiffing at a higher rate compared to other players, and so are more likely to regress in their performance. Corey Seager is shown to be a top candidate for regression, along with rowdy tellez, and Anthony Rizzo. This may not hold true as MLB introduces some new rules next year, but based on the data I have and connections I have seen, this is the 10 players most likely to under perform in 2023.

Unluckiest Hitters Table

unluckiest_hitters<-mlbhit[order(mlbhit$hitter_luck),]

unluckiest_hitters

## # A tibble: 312 × 9
##    player_name total_pitches    ba babip avg_exit_velo whiffs swings hitter_luck
##    <chr>               <dbl> <dbl> <dbl>         <dbl>  <dbl>  <dbl>       <dbl>
##  1 Hiura, Kes…          1146 0.226 0.355          91.7    225    561      -0.129
##  2 Thompson, …          1138 0.256 0.374          92.2    189    500      -0.118
##  3 Alfaro, Jo…          1008 0.246 0.364          89.4    240    631      -0.118
##  4 Taylor, Ch…          1980 0.221 0.336          86.6    395    989      -0.115
##  5 Marsh, Bra…          1807 0.245 0.36           87.6    252    866      -0.115
##  6 Adell, Jo            1175 0.224 0.338          87.1    221    610      -0.114
##  7 Davis, J.D.          1486 0.248 0.36           92.4    285    762      -0.112
##  8 Bart, Joey           1161 0.215 0.326          86.6    208    548      -0.111
##  9 Rivas, Alf…          1208 0.235 0.344          85.8    133    507      -0.109
## 10 Reynolds, …          1080 0.246 0.348          90.2    131    448      -0.102
## # ℹ 302 more rows
## # ℹ 1 more variable: whiff_rate <dbl>

This now is a table of the hitters sorted in ascending order by the hitter luck column. We see that Keston Hiura was the lowest player in this variable.

Unluckiest Hitters Sorted by Whiff Rate

improve<-unluckiest_hitters[1:10, ]

improve[order(improve$whiff_rate),]

## # A tibble: 10 × 9
##    player_name total_pitches    ba babip avg_exit_velo whiffs swings hitter_luck
##    <chr>               <dbl> <dbl> <dbl>         <dbl>  <dbl>  <dbl>       <dbl>
##  1 Rivas, Alf…          1208 0.235 0.344          85.8    133    507      -0.109
##  2 Marsh, Bra…          1807 0.245 0.36           87.6    252    866      -0.115
##  3 Reynolds, …          1080 0.246 0.348          90.2    131    448      -0.102
##  4 Adell, Jo            1175 0.224 0.338          87.1    221    610      -0.114
##  5 Davis, J.D.          1486 0.248 0.36           92.4    285    762      -0.112
##  6 Thompson, …          1138 0.256 0.374          92.2    189    500      -0.118
##  7 Bart, Joey           1161 0.215 0.326          86.6    208    548      -0.111
##  8 Alfaro, Jo…          1008 0.246 0.364          89.4    240    631      -0.118
##  9 Taylor, Ch…          1980 0.221 0.336          86.6    395    989      -0.115
## 10 Hiura, Kes…          1146 0.226 0.355          91.7    225    561      -0.129
## # ℹ 1 more variable: whiff_rate <dbl>

I then sorted the top 10 players from the hitters luck sorted table, in ascending order by their whiff rate. This shows us which of these players were unlucky last year and whiffing at a lower rate compared to other players, and so are more likely to improve in their performance. Alfonso Rivas is the top candidate to have an improved season in 2023, with Brandon Marsh and Matt Reynolds also expected to improve their performance. With rule changes acknowledged, and early season small sample sizes also playing a factor, Brandon Marsh has been off to a hot start, as has J.D. Davis. They could be successful predictions in a few months.

MLB Hitters Project