Griffin Williams
2023-04-25
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
## Warning: package 'readxl' was built under R version 4.2.3
savant_Hitter_data<-read_excel("C:/Users/grifw/Documents/School/CSC360/CSC360 Final Project/savant_Hitter data.xlsx")mlbhit<-select(savant_Hitter_data, player_name, total_pitches, ba, babip,launch_speed,whiffs,swings)%>%
filter(total_pitches>999)
mlbhit<-mutate(mlbhit,hitter_luck=ba-babip)
mlbhit<-mutate(mlbhit,whiff_rate=whiffs/swings)
colnames(mlbhit)[5] <- "avg_exit_velo"
mlbhit## # A tibble: 312 × 9
## player_name total_pitches ba babip avg_exit_velo whiffs swings hitter_luck
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Judge, Aar… 2906 0.311 0.34 95.9 370 1240 -0.0290
## 2 Hoskins, R… 2897 0.246 0.292 90.1 312 1213 -0.046
## 3 Olson, Matt 2893 0.24 0.274 92.8 414 1433 -0.0340
## 4 Schwarber,… 2878 0.218 0.24 92.9 339 1168 -0.022
## 5 Lindor, Fr… 2829 0.27 0.301 89.2 310 1352 -0.0310
## 6 Profar, Ju… 2790 0.243 0.272 87 228 1217 -0.0290
## 7 Nimmo, Bra… 2781 0.274 0.317 88.5 238 1217 -0.043
## 8 Semien, Ma… 2776 0.248 0.263 87.3 270 1310 -0.0150
## 9 Soto, Juan 2765 0.242 0.249 90.9 188 981 -0.00700
## 10 Freeman, F… 2755 0.325 0.359 91.3 264 1357 -0.0340
## # ℹ 302 more rows
## # ℹ 1 more variable: whiff_rate <dbl>
I filtered the data to make the minimum number of pitches a batter saw to be 1000, to eliminate data points that are not representative of a players true performance. I based this number off of the rough average number of pitches per plate appearance in the MLB last season of 4, and having at least 250 plate appearances, which now includes most starting players on teams.
Hitter luck is a new column I made that shows the difference in batting average minus batting average on balls in play(BABIP). BABIP is how often a player gets a hit when they hit the ball. I have chosen this to be called hitter luck, as BABIP can be a lucky stat for hitters, and the difference between their normal batting average and BABIP can show this amount of luck they had over the given time period. The greater the number, the greater luck that player had.
Whiff rate is another column I added that is whiffs, or swings and misses, divided by total swings. This gives us and idea of how often a player swings and misses as a percentage of their total swings, and so how often they’re making contact and putting the ball in play. The lower the number, the lower the percentage of swing and misses, or whiffs the player had.
## tibble [312 × 9] (S3: tbl_df/tbl/data.frame)
## $ player_name : chr [1:312] "Judge, Aaron" "Hoskins, Rhys" "Olson, Matt" "Schwarber, Kyle" ...
## $ total_pitches: num [1:312] 2906 2897 2893 2878 2829 ...
## $ ba : num [1:312] 0.311 0.246 0.24 0.218 0.27 0.243 0.274 0.248 0.242 0.325 ...
## $ babip : num [1:312] 0.34 0.292 0.274 0.24 0.301 0.272 0.317 0.263 0.249 0.359 ...
## $ avg_exit_velo: num [1:312] 95.9 90.1 92.8 92.9 89.2 87 88.5 87.3 90.9 91.3 ...
## $ whiffs : num [1:312] 370 312 414 339 310 228 238 270 188 264 ...
## $ swings : num [1:312] 1240 1213 1433 1168 1352 ...
## $ hitter_luck : num [1:312] -0.029 -0.046 -0.034 -0.022 -0.031 ...
## $ whiff_rate : num [1:312] 0.298 0.257 0.289 0.29 0.229 ...
## player_name total_pitches ba babip
## Length:312 Min. :1002 Min. :0.1600 Min. :0.1850
## Class :character 1st Qu.:1396 1st Qu.:0.2240 1st Qu.:0.2700
## Mode :character Median :1817 Median :0.2450 Median :0.2920
## Mean :1823 Mean :0.2463 Mean :0.2928
## 3rd Qu.:2221 3rd Qu.:0.2680 3rd Qu.:0.3182
## Max. :2906 Max. :0.3260 Max. :0.3740
## avg_exit_velo whiffs swings hitter_luck
## Min. :79.40 Min. : 53.0 Min. : 417.0 Min. :-0.1290
## 1st Qu.:87.00 1st Qu.:159.8 1st Qu.: 678.2 1st Qu.:-0.0610
## Median :88.45 Median :209.5 Median : 865.5 Median :-0.0445
## Mean :88.40 Mean :217.2 Mean : 871.6 Mean :-0.0465
## 3rd Qu.:90.00 3rd Qu.:262.5 3rd Qu.:1044.2 3rd Qu.:-0.0320
## Max. :95.90 Max. :455.0 Max. :1433.0 Max. : 0.0180
## whiff_rate
## Min. :0.0706
## 1st Qu.:0.2109
## Median :0.2495
## Mean :0.2510
## 3rd Qu.:0.2911
## Max. :0.4059
ggplot(mlbhit,aes(ba, babip))+
geom_line()+
geom_smooth(method=lm)+
ggtitle("Batting Average vs. Batting Average on Balls in Play")+
labs(x="Batting Average", y="Batting Average on Balls in Play")## `geom_smooth()` using formula = 'y ~ x'
As Batting Average increases, so does BABIP. This shows that they are related, but how are they different?
ggplot(mlbhit,aes(whiff_rate,ba,color=avg_exit_velo))+
geom_point()+
geom_smooth(method=lm)+
ggtitle("Whiff Rate vs. Batting Average")+
labs(y="Batting Average", x="Whiff Rate")## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
We see here that as players whiff more, their batting average decreases. The average exit velocity for each player is also shown in the color, with more of the lighter blue higher on the chart showing a higher batting average.
ggplot(mlbhit,aes(whiff_rate,avg_exit_velo))+
geom_point()+
geom_smooth(method=lm)+
ggtitle("Average Exit Velo vs. Whiff Rate")+
labs(y="Average Exit Velo", x="Whiff Rate")## `geom_smooth()` using formula = 'y ~ x'
There looks to be a slight connection between whiff rate and exit velocity. This would likely be explained through players swinging hard have less control of the bat, and so swing and miss more often.
ggplot(mlbhit,aes(whiff_rate,hitter_luck))+
geom_point()+
geom_smooth(method=lm)+
ggtitle("Hitter Luck vs. Whiff Rate")+
labs(y="Hitter Luck", x="Whiff Rate")## `geom_smooth()` using formula = 'y ~ x'
Here we see a pretty clear connection of hitter luck and whiff rate.As players whiff more and make less contact, their level of luck decreases.
ggplot(mlbhit,aes(avg_exit_velo,hitter_luck))+
geom_point()+
geom_smooth(method=lm)+
ggtitle("Average Exit Velo vs. Hitter Luck")+
labs(x="Average Exit Velo", y="Hitter Luck")## `geom_smooth()` using formula = 'y ~ x'
I expected a greater connection between average exit velocity and hitter luck, but It appears as if they are not effecting each other much
##
## Call:
## lm(formula = mlbhit$hitter_luck ~ mlbhit$avg_exit_velo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.084053 -0.015201 0.002284 0.014724 0.063183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0881421 0.0490678 -1.796 0.0734 .
## mlbhit$avg_exit_velo 0.0004710 0.0005548 0.849 0.3965
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02384 on 310 degrees of freedom
## Multiple R-squared: 0.00232, Adjusted R-squared: -0.0008986
## F-statistic: 0.7208 on 1 and 310 DF, p-value: 0.3965
With and r-squared of .002, there is no real connection between these variables.
## # A tibble: 312 × 9
## player_name total_pitches ba babip avg_exit_velo whiffs swings hitter_luck
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Pujols, Al… 1385 0.27 0.252 91.2 142 668 0.0180
## 2 Paredes, I… 1617 0.205 0.195 87.4 106 644 0.0100
## 3 Rizzo, Ant… 2112 0.224 0.216 89.3 222 974 0.00800
## 4 Tellez, Ro… 2392 0.219 0.215 91.1 248 1040 0.00400
## 5 Arenado, N… 2330 0.293 0.29 88.7 215 1127 0.00300
## 6 Seager, Co… 2268 0.245 0.242 91.1 328 1236 0.00300
## 7 Ramírez, J… 2742 0.28 0.279 87.7 192 1269 0.00100
## 8 Bregman, A… 2521 0.259 0.26 88.9 151 1015 -0.00100
## 9 Betts, Moo… 2428 0.269 0.272 90.5 177 1072 -0.00300
## 10 Tucker, Ky… 2262 0.257 0.261 89.8 238 1193 -0.00400
## # ℹ 302 more rows
## # ℹ 1 more variable: whiff_rate <dbl>
This is a table of the hitters sorted in descending order by the hitter luck column. We see that Albert Pujols was the highest player in this variable.
## # A tibble: 10 × 9
## player_name total_pitches ba babip avg_exit_velo whiffs swings hitter_luck
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Seager, Co… 2268 0.245 0.242 91.1 328 1236 0.00300
## 2 Tellez, Ro… 2392 0.219 0.215 91.1 248 1040 0.00400
## 3 Rizzo, Ant… 2112 0.224 0.216 89.3 222 974 0.00800
## 4 Pujols, Al… 1385 0.27 0.252 91.2 142 668 0.0180
## 5 Tucker, Ky… 2262 0.257 0.261 89.8 238 1193 -0.00400
## 6 Arenado, N… 2330 0.293 0.29 88.7 215 1127 0.00300
## 7 Betts, Moo… 2428 0.269 0.272 90.5 177 1072 -0.00300
## 8 Paredes, I… 1617 0.205 0.195 87.4 106 644 0.0100
## 9 Ramírez, J… 2742 0.28 0.279 87.7 192 1269 0.00100
## 10 Bregman, A… 2521 0.259 0.26 88.9 151 1015 -0.00100
## # ℹ 1 more variable: whiff_rate <dbl>
I then sorted the top 10 players from the hitters luck sorted table, in descending order by their whiff rate. This shows us which of these players were lucky last year and whiffing at a higher rate compared to other players, and so are more likely to regress in their performance. Corey Seager is shown to be a top candidate for regression, along with rowdy tellez, and Anthony Rizzo. This may not hold true as MLB introduces some new rules next year, but based on the data I have and connections I have seen, this is the 10 players most likely to under perform in 2023.
## # A tibble: 312 × 9
## player_name total_pitches ba babip avg_exit_velo whiffs swings hitter_luck
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Hiura, Kes… 1146 0.226 0.355 91.7 225 561 -0.129
## 2 Thompson, … 1138 0.256 0.374 92.2 189 500 -0.118
## 3 Alfaro, Jo… 1008 0.246 0.364 89.4 240 631 -0.118
## 4 Taylor, Ch… 1980 0.221 0.336 86.6 395 989 -0.115
## 5 Marsh, Bra… 1807 0.245 0.36 87.6 252 866 -0.115
## 6 Adell, Jo 1175 0.224 0.338 87.1 221 610 -0.114
## 7 Davis, J.D. 1486 0.248 0.36 92.4 285 762 -0.112
## 8 Bart, Joey 1161 0.215 0.326 86.6 208 548 -0.111
## 9 Rivas, Alf… 1208 0.235 0.344 85.8 133 507 -0.109
## 10 Reynolds, … 1080 0.246 0.348 90.2 131 448 -0.102
## # ℹ 302 more rows
## # ℹ 1 more variable: whiff_rate <dbl>
This now is a table of the hitters sorted in ascending order by the hitter luck column. We see that Keston Hiura was the lowest player in this variable.
## # A tibble: 10 × 9
## player_name total_pitches ba babip avg_exit_velo whiffs swings hitter_luck
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Rivas, Alf… 1208 0.235 0.344 85.8 133 507 -0.109
## 2 Marsh, Bra… 1807 0.245 0.36 87.6 252 866 -0.115
## 3 Reynolds, … 1080 0.246 0.348 90.2 131 448 -0.102
## 4 Adell, Jo 1175 0.224 0.338 87.1 221 610 -0.114
## 5 Davis, J.D. 1486 0.248 0.36 92.4 285 762 -0.112
## 6 Thompson, … 1138 0.256 0.374 92.2 189 500 -0.118
## 7 Bart, Joey 1161 0.215 0.326 86.6 208 548 -0.111
## 8 Alfaro, Jo… 1008 0.246 0.364 89.4 240 631 -0.118
## 9 Taylor, Ch… 1980 0.221 0.336 86.6 395 989 -0.115
## 10 Hiura, Kes… 1146 0.226 0.355 91.7 225 561 -0.129
## # ℹ 1 more variable: whiff_rate <dbl>
I then sorted the top 10 players from the hitters luck sorted table, in ascending order by their whiff rate. This shows us which of these players were unlucky last year and whiffing at a lower rate compared to other players, and so are more likely to improve in their performance. Alfonso Rivas is the top candidate to have an improved season in 2023, with Brandon Marsh and Matt Reynolds also expected to improve their performance. With rule changes acknowledged, and early season small sample sizes also playing a factor, Brandon Marsh has been off to a hot start, as has J.D. Davis. They could be successful predictions in a few months.