Bat-Tracking

Author

Brady Baur

Let’s take a dive into the baseball swing. This summer, the website Baseball Savant released a plethora of new data. This included a group of data called bat-tracking. with this data we can gain a better understanding of how the swing functions and what ideal pieces are of an elite hitter’s swing.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(dslabs)
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.7     ✔ rsample      1.2.1
✔ dials        1.3.0     ✔ tune         1.2.1
✔ infer        1.0.7     ✔ workflows    1.1.4
✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
✔ parsnip      1.2.1     ✔ yardstick    1.3.1
✔ recipes      1.1.0     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/
library(openintro)
Loading required package: airports
Loading required package: cherryblossom
Loading required package: usdata

Attaching package: 'openintro'

The following object is masked from 'package:modeldata':

    ames

The following object is masked from 'package:dslabs':

    murders
stats <- read_csv("Data/stats.csv")
Rows: 207 Columns: 37
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): last_name, first_name
dbl (36): player_id, year, player_age, ab, pa, hit, single, double, triple, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
read_file("pictures/aaron_judge.jpg")
[1] "\xff\xd8\xff\xe0"
read_file("pictures/Shohei_Ohtani.jpg")
[1] "\xff\xd8\xff\xe0"

How do we Judge player performance?

One of the wildly agreed upon metrics to determine a hitter’s ability to produce runs is OPS. This stands for on base + slugging and takes in to account both the players ability to get on base and hit for power. While the argument could be made that swing speed helps give the hitter more time to decide whether or not to swing, we will exclude on base and focus on the batter’s ability to slug.

Slugging

Slugging average amount of bases a player has in any given at bat. a single is 1.000, a home run is 4.000, and striking out is 0. Let’s take a glance at the top 5 in slugging during the 2024 regular season.

top5slg <- stats |>
  select("last_name, first_name" ,slg_percent, ) |>
  arrange(desc(slg_percent)) |>
  filter(slg_percent > .563)
  colnames(top5slg) <- c("Player", "Slug Percentage")
top5slg
# A tibble: 5 × 2
  Player          `Slug Percentage`
  <chr>                       <dbl>
1 Judge, Aaron                0.701
2 Ohtani, Shohei              0.646
3 Witt Jr., Bobby             0.588
4 Soto, Juan                  0.569
5 Alvarez, Yordan             0.567

You may recognize some familiar faces. While we are going to dig deep into what may predict a players success it’s also helpful to see that the highest paid and most respected players are at the top our leader board.

Swing Speed

While slugging is a good metric, it only shows us the result of a players hitting. Baseball savants recently released metric “Swing Speed” tracks the speed of each players bat during any given swing. This helps us look at the players capacity for success.

ggplot(stats, mapping = aes(
  x = slg_percent,
  y = avg_swing_speed)) +
  geom_point() +
  labs(title = "Slugging vs. Swing Speed(mph)",
  x = "Avg Swing Speed(mph)",
  y = "Slugging %") +
  geom_smooth(method = "lm",
              color = "darkblue")
`geom_smooth()` using formula = 'y ~ x'

As you can see above as swing speed goes up so does slugging.

stats
# A tibble: 207 × 37
   `last_name, first_name` player_id  year player_age    ab    pa   hit single
   <chr>                       <dbl> <dbl>      <dbl> <dbl> <dbl> <dbl>  <dbl>
 1 Blackmon, Charlie          453568  2024         37   449   499   115     74
 2 McCutchen, Andrew          457705  2024         37   448   515   104     65
 3 Turner, Justin             457759  2024         39   460   539   119     84
 4 Santana, Carlos            467793  2024         38   521   594   124     75
 5 Pham, Tommy                502054  2024         36   440   478   109     77
 6 Martinez, J.D.             502110  2024         36   434   495   102     61
 7 Goldschmidt, Paul          502671  2024         36   599   654   147     91
 8 Altuve, Jose               514888  2024         34   628   682   185    134
 9 Freeman, Freddie           518692  2024         34   542   638   153     94
10 Stanton, Giancarlo         519317  2024         34   417   459    97     50
# ℹ 197 more rows
# ℹ 29 more variables: double <dbl>, triple <dbl>, home_run <dbl>,
#   strikeout <dbl>, walk <dbl>, k_percent <dbl>, bb_percent <dbl>,
#   batting_avg <dbl>, slg_percent <dbl>, on_base_percent <dbl>,
#   on_base_plus_slg <dbl>, xba <dbl>, xslg <dbl>, woba <dbl>, xwoba <dbl>,
#   xobp <dbl>, xiso <dbl>, avg_swing_speed <dbl>, fast_swing_rate <dbl>,
#   blasts_contact <dbl>, blasts_swing <dbl>, squared_up_contact <dbl>, …
slg.ss <- cor(stats$slg_percent, stats$avg_swing_speed)
slg.ss
[1] 0.544707

We can remove luck by using Baseball Savants xslug, which may take away from the players ability to hit balls in gaps but uses an algorithm to predict the amount of base’s the batter will reach based on the balls exit velocity, launch angle, and players sprint speed.

ggplot(stats, mapping = aes(
  x = xslg,
  y = avg_swing_speed)) +
  geom_point() +
  labs(title = "Expected Slugging vs. Swing Speed(mph)",
  x = "Avg Swing Speed(mph)",
  y = "Predicted Slugging") +
  geom_smooth(method = "lm",
              color = "darkblue")
`geom_smooth()` using formula = 'y ~ x'

xslg.ss <- cor(stats$xslg, stats$avg_swing_speed)
xslg.ss
[1] 0.6141739

As you can see we get a slightly higher correlation value. This is because metrics help mitigate the affect of luck a player may have. This being obsticales like facing above average fielders or player on a larger or smaller field causing homeruns to be fly outs and vice versa.

Swing speed’s relationship with other metrics

Let’s take a quick glance at other metrics and see if it matches what we have looked at so far.

exitvelo.ss <- cor(stats$exit_velocity_avg, stats$avg_swing_speed)
avglen.ss <- cor(stats$avg_swing_length, stats$avg_swing_speed)
whiff.ss <- cor(stats$whiff_percent, stats$avg_swing_speed)
k.ss <- cor(stats$k_percent, stats$avg_swing_speed)
k.whiff <- cor(stats$k_percent, stats$whiff_percent)
slen.whiff <- cor(stats$avg_swing_length, stats$whiff_percent)
tibble::tibble(exitvelo.ss, avglen.ss,whiff.ss, k.ss, k.whiff,slen.whiff)
# A tibble: 1 × 6
  exitvelo.ss avglen.ss whiff.ss  k.ss k.whiff slen.whiff
        <dbl>     <dbl>    <dbl> <dbl>   <dbl>      <dbl>
1       0.747     0.542    0.555 0.478   0.892      0.428
avglen.ss * avglen.ss
[1] 0.2942018

Now that we have collected a few more statistics we can better our approach at finding the “ideal” swing. While its not perfect we can see that exit velocity has a high correlation to swing speed which checks out.

We can also see a moderate connection between the length of the swing and its speed which also makes sense from a physics stand point. The faster the bat travels, the less time it takes to complete the task of swinging. However, this correlation is not 100% which means some players may sacrifice swing speed in order to load more and swing harder. For example, if you were to punch one of the arcade punching bags starting a few inches from the bag as fast as possible, not caring about the score, you would move directly forward and lose some speed and power. Now, try again starting a few inches from the bag but swing as hard as possible, it makes sense that you would retract the arm first to add some speed to your fist before hitting the bag. The same goes for a player deciding how hard to swing. This is supported by the data we have collected that shows that as swing length increases swing length goes up in fact, nearly 30% of swing speed is determined by swing length.

#is this too much writing and should it be explained in the conclusion?

#Final stuff to figure out: take the top 50% somehow of both swing speed and swing length and compare to groups with only one of the two characteristics

Maximizing Swing Length & Speed

tophalfss.sl <- stats |>
  filter(avg_swing_length < median(avg_swing_length), 
         avg_swing_speed > median(avg_swing_speed))
tophalfsl <- stats |>
  filter(avg_swing_length < median(avg_swing_length))
tophalfss <- stats |>
  filter(avg_swing_speed > median(avg_swing_speed))
tibble(
  Groups = c("Top 50% Swing Speed", "Top 50% Swing Length", "Top 50% in Both"),
  "Average xSlugging" = c(mean(tophalfss$xslg, na.rm = TRUE),
           mean(tophalfsl$xslg, na.rm = TRUE),
           mean(tophalfss.sl$xslg, na.rm = TRUE)))
# A tibble: 3 × 2
  Groups               `Average xSlugging`
  <chr>                              <dbl>
1 Top 50% Swing Speed                0.464
2 Top 50% Swing Length               0.408
3 Top 50% in Both                    0.457
linear_reg() |>
  set_engine("lm") |>
  fit(avg_swing_speed ~ xslg, data = stats) |>
  tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     61.3     0.951      64.5 4.45e-138
2 xslg            24.5     2.20       11.1 7.48e- 23

We can take a few things from the previous two tables. Firstly, we can see that being in the top 50% in swing length is much less advantageous than being top 50% in swing speed. Furthermore, we looked into deeper into swing speed and using a linear regression model we can predict that as swing speed goes up 1 mph xslg goes up 24.5 points. This can help us determine where a player is at in terms of the rest of the field. For example, lets take Jorge Polanco. His average xSlugging is .426. We can use the model to predict his expected swing speed.

61.33 + .426 * 24.53
[1] 71.77978

his actually average swing speed in 2024 was 69.7 which is well below the projected 71.7. This means he likely excels in other areas of hitting and may benefit from an off-season of rotational focus. Throwing medicine balls or under load swinging may help squeeze out some extra velocity in his swing aiding his already solid Slugging.

Conclusion

Its important to not that while we discovered a lot about the swing in this project, there are so many other aspects to a hitters swing. We can quantify almost all of them from plate discipline to contact quality. What we have determined is that physics works and that by swinging harder we have the potential to hit the ball harder and farther. We also discovered that their is a limit to that potential as a hitter can not “sell out” for swing speed by lengthening their swing.

References

“Statcast Custom Leaderboards.” Baseballsavant.Com, baseballsavant.mlb.com/leaderboard/custom?year=2024&type=batter&filter=&min=q&selections=player_age%2Cab%2Cpa%2Chit%2Csingle%2Cdouble%2Ctriple%2Chome_run%2Cstrikeout%2Cwalk%2Ck_percent%2Cbb_percent%2Cbatting_avg%2Cslg_percent%2Con_base_percent%2Con_base_plus_slg%2Cxba%2Cxslg%2Cwoba%2Cxwoba%2Cxobp%2Cxiso%2Cavg_swing_speed%2Cfast_swing_rate%2Cblasts_contact%2Cblasts_swing%2Csquared_up_contact%2Csquared_up_swing%2Cavg_swing_length%2Cswords%2Cexit_velocity_avg%2Claunch_angle_avg%2Csweet_spot_percent%2Cbarrel_batted_rate%2Chard_hit_percent%2Cavg_best_speed%2Cavg_hyper_speed%2Cwhiff_percent%2Cswing_percent&chart=false&x=player_age&y=player_age&r=no&chartType=beeswarm&sort=xwoba&sortDir=desc. Accessed 29 Nov. 2024.

News, RNZ. “Shohei Ohtani Makes Major League Baseball History.” RNZ, RNZ, 20 Sept. 2024, www.rnz.co.nz/news/sport/528539/shohei-ohtani-makes-major-league-baseball-history.

Witz, Billy. “How Aaron Judge Built Baseball’s Mightiest Swing.” The New York Times, The New York Times, 17 July 2017, www.nytimes.com/2017/07/17/sports/baseball/how-aaron-judge-built-baseballs-mightiest-swing.html.