For this project, I want to examine a few different questions related to baseball and specifically to hitting home runs. Home runs are one of the most exciting things in sports and there is a lot more emphasis on baseball teams and players trying to hit home runs. Some of the players who are considered as some of the best hit a high amount of home runs, but I want to examine that correlation closer. I want to determine whether hitting a lot of home runs makes for a better hitter. For the second part of my analysis, I want to look at some important metrics and see their correlation with hitting home runs.
I am loading a data set from Baseball Savant, which is a website for baseball statistics.
library(tidyverse)
Warning: package 'readr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
hitting<-read_csv("~/Downloads/stats.csv")
Rows: 670 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): last_name, first_name
dbl (19): player_id, year, single, double, triple, home_run, k_percent, bb_p...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data Dictionary and Summary Statistics
There are 20 variables in this data set, but most of them are simple and easy to understand once they are explained. Many of them are percentages and could be calculated by anyone. Here are the definitions for some of the important terms:
Batting Average: The probability of a player getting a hit calculated as hits/at bats
On Base Percentage: Measures the frequency of a player reaching base
Slugging Percentage: Measures a batter’s power by calculating the amount of bases a batter records per at bat (1 for single, 2 for double, etc.)
OPS= On Base Percentage+ Slugging Percentage
Strikeout Percentage= How often a batter strikes out, walk percentage follows same logic
Exit Velocity: How fast, in miles per hour, a ball was hit by a batter
Launch Angle: How high/low, in degrees, a ball was hit by a batter
Barrels: A batted ball with the perfect combination of exit velocity and launch angle
Hard Hit: A batted ball with an exit velocity of 95 mph or higher.
Whiff Percentage: How often a batter swings and misses
Swing Percentage: How often a batter swings
Now that some of the important terms are defined, here are a few of the summary statistics for the data set.
summary(hitting)
last_name, first_name player_id year single
Length:670 Min. :408234 Min. :2021 Min. : 44.0
Class :character 1st Qu.:593428 1st Qu.:2022 1st Qu.: 76.0
Mode :character Median :650402 Median :2023 Median : 88.0
Mean :626494 Mean :2023 Mean : 88.4
3rd Qu.:668227 3rd Qu.:2024 3rd Qu.:100.0
Max. :808982 Max. :2025 Max. :161.0
double triple home_run k_percent
Min. : 9.00 Min. : 0.000 Min. : 0.00 Min. : 3.1
1st Qu.:23.00 1st Qu.: 1.000 1st Qu.:15.00 1st Qu.:16.7
Median :28.00 Median : 2.000 Median :21.00 Median :20.5
Mean :27.97 Mean : 2.279 Mean :21.74 Mean :20.6
3rd Qu.:32.00 3rd Qu.: 3.000 3rd Qu.:28.00 3rd Qu.:24.5
Max. :59.00 Max. :17.000 Max. :62.00 Max. :34.6
bb_percent batting_avg slg_percent on_base_percent
Min. : 2.500 Min. :0.1840 Min. :0.2730 Min. :0.2390
1st Qu.: 6.800 1st Qu.:0.2430 1st Qu.:0.3990 1st Qu.:0.3130
Median : 8.750 Median :0.2600 Median :0.4370 Median :0.3285
Mean : 8.899 Mean :0.2602 Mean :0.4411 Mean :0.3325
3rd Qu.:10.700 3rd Qu.:0.2750 3rd Qu.:0.4758 3rd Qu.:0.3510
Max. :22.200 Max. :0.3540 Max. :0.7010 Max. :0.4650
on_base_plus_slg b_rbi exit_velocity_avg launch_angle_avg
Min. :0.5610 Min. : 23.00 Min. :82.3 Min. :-4.40
1st Qu.:0.7183 1st Qu.: 61.00 1st Qu.:88.3 1st Qu.:10.30
Median :0.7680 Median : 73.00 Median :89.7 Median :13.40
Mean :0.7736 Mean : 74.76 Mean :89.7 Mean :13.28
3rd Qu.:0.8177 3rd Qu.: 88.00 3rd Qu.:91.1 3rd Qu.:16.20
Max. :1.1590 Max. :144.00 Max. :96.2 Max. :25.20
barrel_batted_rate hard_hit_percent whiff_percent swing_percent
Min. : 0.000 Min. :14.90 Min. : 5.30 Min. :35.00
1st Qu.: 6.500 1st Qu.:37.60 1st Qu.:20.20 1st Qu.:44.30
Median : 8.950 Median :42.40 Median :24.20 Median :47.40
Mean : 9.262 Mean :41.94 Mean :23.97 Mean :47.51
3rd Qu.:11.775 3rd Qu.:46.90 3rd Qu.:28.38 3rd Qu.:50.40
Max. :27.100 Max. :61.90 Max. :40.60 Max. :62.30
This summary is interesting to look at especially for the home runs and the rate and percentage stats since this will give us easy comparisons. It will be very clear if a player is above or below the average in particular stats.
Part 2
Visualization 1
My first visualization is going to be a line chart that shows the total home runs by season. This will provide a good base and show how prevalent home runs have become in the MLB today.
hr_per_season<- hitting %>%group_by(year) %>%summarise(totalHR=sum(home_run))hr_per_season %>%ggplot(aes(year, totalHR))+geom_line()+labs(title ="Home Runs per Season",x="Year",y="Number of Home Runs" )
This graph shows that the total number of home runs dipped for a few years from 2022-2024, but in 2025 the amount of home runs skyrocketed to well over 3000. Many hitters are trying to hit more home runs, but these next visualizations will help to examine whether that makes for a better hitter.
Visualization 2
My second visualization is going to be a simple scatter plot that shows the correlation between home runs and OPS. OPS is considered one of the best stats to look at the overall value of a player, so I want to see how it correlates with home runs.
hitting %>%ggplot(aes(home_run, on_base_plus_slg))+geom_point()+geom_smooth()+labs(title ="Home Runs and OPS Relationship",x="Home Runs",y="OPS" )
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
This scatter plot shows that more home runs almost always leads to a higher OPS, which is an early indication that hitting more home runs could make for a better hitter.
Visualization 3
My third visualization is also going to be a scatter plot, but comparing home runs to batting average, which is another commonly used statistic to evaluate hitters. The trendline should be particularly interesting for this graph because it could go either way. There are some hitters who seem to sacrifice their batting average to hit more home runs, but also some hitters who are well rounded. I don’t expect to see a huge correlation here.
hitting %>%ggplot(aes(home_run, batting_avg))+geom_point()+geom_smooth()+labs(title ="Home Runs and Batting Average Relationship",x="Home Runs",y="Batting Average" )
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
My prediction was correct that there isn’t a big positive correlation between these two statistics, although it seems to be slightly positive. This indicates that some hitters can hit a lot of home runs while maintaining a high batting average, but others are sacrificing batting average for more home runs.
Visualization 4
My fourth visualization is going to follow a similar logic to my third one. This will be another scatter plot comparing home runs to strikeout rate. Just like batting average, I want to see if there is a tradeoff to hitting a lot of home runs. Typically, hitters that hit a lot of home runs also tend to strike out more, so I want to see if the graph reflects that.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The correlation for this scatter plot is extremely high at first, but doesn’t have much of a correlation at all once home runs hits around 20. It is only slightly positive the rest of the way, which is a bit surprising. I thought this correlation was going to be much more positive.
Visualization 5
My last visualization is going to be a box plot where I will take the average in home runs for the whole data set and put each player into a category of either above or below the average. I am then going to compare the OPS for the players with above average home run totals and those with below average home run totals to see if there is a significant difference.
hitting %>%summarise(avg_hr=mean(home_run))
# A tibble: 1 × 1
avg_hr
<dbl>
1 21.7
hitting<-hitting %>%mutate(Above_or_below_average_hrs=ifelse(home_run>21.74, "Above Average", "Below Average"))hitting %>%ggplot(aes(x=as.factor(Above_or_below_average_hrs),y=on_base_plus_slg))+geom_boxplot()+labs(title ="Home Runs and OPS Boxplot",x="Above or Below Average Home Runs",y="OPS")
The box plot shows that having an above average amount of home runs tends to lead to a higher OPS, and that is especially true when considering the outliers. The highest OPS for below average home run totals is just below .90, while the above average home run totals see a few seasons with an OPS above 1.100. There are also outliers well below .600 for below average, but above average doesn’t see anything below .650.
Part 2 Results
In general, the visualizations show that hitting home runs makes for a better overall hitter. There is a high positive correlation between home runs and OPS, which is arguably the most important statistic to evaluate hitters, and there is also a small positive correlation with batting average. There are some downsides to hitting a lot of home runs as there is also a positive correlation with strikeout rate, meaning there is high risk with high reward. As the first graph shows, many players are willing to take on the added risk of strikeouts to hit more home runs. These results could be used for individual players, but also teams as they try to win games. Having a lot of home run hitters can mean games with many runs scored and lots of home runs, but it can also lead to games with very few runs and lots of strikeouts.
Part 3 Intro
For the next part, I want to transition into what makes for hitting a lot of home runs using some of the more advanced metrics from Baseball Savant. Specifically, I want to look at four metrics: exit velocity, launch angle, barrel %, and hard hit %. These were in the first data set, but I want to use the Baseball Savant website to look at them more closely to see their correlation with hitting home runs. Exit velocity, barrel % and hard hit % should have a high positive correlation. In general, the harder the ball is hit, the farther it can go. Launch angle is the interesting one because higher doesn’t necessarily mean better. Most home runs are hit at a launch angle between 15 and 35 degrees, meaning that an average launch angle in that range should lead to the most home runs.
Scraping Analysis and Hypothesis
This portion will also include Baseball Savant data, but it will be scraped in a separate R script and the resulting data set will be in this document. My theory is that the players with the highest average exit velocity, highest barrel%, and highest hard hit% should have the most home runs as well as an average launch angle of between 15 and 35 degrees.
savant_df<-read_csv("baseballsavant.csv")
Rows: 670 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Player
dbl (6): Year, HR, Avg_Exit_Velo, Avg_Launch_Angle, Hard_Hit_Pct, Barrel_Pct
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Boxplot Analysis
I am going to follow the same logic as I used for the fourth visualization in part 2 by summarizing and creating an average for the home run variable. I will then use boxplots to compare the metrics for batters with an above average home run total and a below average home run total to see if there is a significant difference.
savant_df %>%ggplot(aes(x=as.factor(Above_or_below_average_hrs),y=Avg_Exit_Velo))+geom_boxplot()+labs(title ="Exit Velocity and Home Runs Correlation",x="Above or Below Average Home Runs",y="Exit Velocity")
savant_df %>%ggplot(aes(x=as.factor(Above_or_below_average_hrs),y=Avg_Launch_Angle))+geom_boxplot()+labs(title ="Launch Angle and Home Runs Correlation",x="Above or Below Average Home Runs",y="Launch Angle")
savant_df %>%ggplot(aes(x=as.factor(Above_or_below_average_hrs),y=Hard_Hit_Pct))+geom_boxplot()+labs(title ="Hard Hit % and Home Runs Correlation",x="Above or Below Average Home Runs",y="Hard Hit %")
savant_df %>%ggplot(aes(x=as.factor(Above_or_below_average_hrs),y=Barrel_Pct))+geom_boxplot()+labs(title ="Barrel % and Home Runs Correlation",x="Above or Below Average Home Runs",y="Barrel %")
Part 3 Results
The boxplots are supporting my hypothesis because they are much higher up for above average home runs for exit velocity, hard hit percentage, and barrel percentage. The launch angle for above average home runs is slightly higher than below average, but it isn’t statistically significant. Based on these boxplots, it seems safe to conclude that having a higher exit velocity, hard hit percentage, and barrel percentage has a statistically positive correlation with hitting home runs.