I will be looking at the correlation between exit velocity, launch angle, barrel%, hard hit%, and hitting home runs. In today’s MLB, there is a much bigger focus on hitting home runs, and that is normally achieved by hitting the ball very hard at a specific launch angle. I want to take a deeper dive on whether hitting the ball hard consistently or having an average launch angle in the correct range has a noticeable relationship with hitting home runs.
Definitions
Before I get into the analysis and results for this question, we have to define what these metrics mean. These definitions come from the Baseball Savant website, which is where the data is being scraped from:
Exit Velocity: How fast, in miles per hour, a ball was hit by a batter
Launch Angle: How high/low, in degrees, a ball was hit by a batter
Barrels: A batted ball with the perfect combination of exit velocity and launch angle
Hard Hit: A batted ball with an exit velocity of 95 mph or higher.
My theory is that the players with the highest average exit velocity, highest barrel%, and highest hard hit% should have the most home runs. For launch angle, the majority of home runs are hit between 20-35 degrees, so players who have an average launch angle within that range should produce the most home runs.
Scatter Plot Analysis
The data was collected from Baseball Savant leaderboards, and it was scraped using a separate R script. The data was scraped using a function to get the five metrics that are being examined.
library(tidyverse)
Warning: package 'readr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
savant_df<-read_csv("baseballsavant.csv")
Rows: 670 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Player
dbl (6): Year, HR, Avg_Exit_Velo, Avg_Launch_Angle, Hard_Hit_Pct, Barrel_Pct
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
To start, I am going to make scatter plots comparing home runs to each individual variable. This can give a good inclination as to the correlation.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Scatter Plot Results
These scatter plots and their trendlines are giving an early indication that my hypothesis is correct. For the exit velocity, hard hit, and barrel graphs, the correlation is positive and the line goes up as the home runs go up. This means that the hitters who hit the ball the hardest the most often and square it up the most tend to hit the most home runs. The launch angle graph doesn’t have much of a correlation, although it is slightly positive. This also makes sense since most home runs are hit between 15 and 35 degrees, so an extremely high launch angle wouldn’t correlate well with home runs as opposed to the other metrics.
Boxplot Analysis
For the second part of the analysis, I am going to summarize and create an average for the home run variable. I will then use boxplots to compare the metrics for batters with an above average home run total and a below average home run total to see if there is a significant difference.
savant_df %>%ggplot(aes(x=as.factor(Above_or_below_average_hrs),y=Avg_Exit_Velo))+geom_boxplot()+labs(title ="Exit Velocity and Home Runs Correlation",x="Above or Below Average Home Runs",y="Exit Velocity")
savant_df %>%ggplot(aes(x=as.factor(Above_or_below_average_hrs),y=Avg_Launch_Angle))+geom_boxplot()+labs(title ="Launch Angle and Home Runs Correlation",x="Above or Below Average Home Runs",y="Launch Angle")
savant_df %>%ggplot(aes(x=as.factor(Above_or_below_average_hrs),y=Hard_Hit_Pct))+geom_boxplot()+labs(title ="Hard Hit % and Home Runs Correlation",x="Above or Below Average Home Runs",y="Hard Hit %")
savant_df %>%ggplot(aes(x=as.factor(Above_or_below_average_hrs),y=Barrel_Pct))+geom_boxplot()+labs(title ="Barrel % and Home Runs Correlation",x="Above or Below Average Home Runs",y="Barrel %")
Results
The boxplots are supporting the scatter plots in testing the correlation. The boxplots are much higher up for above average home runs for exit velocity, hard hit percentage, and barrel percentage. The launch angle for above average home runs is slightly higher than below average, but it isn’t super significant. My hypothesis was correct in that players who hit the ball extremely hard most often tend to hit more home runs. For the project, I will expand on this a bit more by looking at a couple more metrics and looking at specific players over their careers.