BAIS 462 Final Project

Author

Nick Moscovic

Part 1

For this project, I want to examine a few different questions related to baseball and specifically to hitting home runs. Home runs are one of the most exciting things in sports and there is a lot more emphasis on baseball teams and players trying to hit home runs. Some of the players who are considered as some of the best hit a high amount of home runs, but I want to examine that correlation closer. I want to determine whether hitting a lot of home runs makes for a better hitter. For the second part of my analysis, I want to look at some important metrics and see their correlation with hitting home runs.

I am loading a data set from Baseball Savant, which is a website for baseball statistics.

library(tidyverse)
Warning: package 'readr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
hitting<- read_csv("~/Downloads/stats.csv")
Rows: 670 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): last_name, first_name
dbl (19): player_id, year, single, double, triple, home_run, k_percent, bb_p...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Dictionary and Summary Statistics

There are 20 variables in this data set, but most of them are simple and easy to understand once they are explained. Many of them are percentages and could be calculated by anyone. Here are the definitions for some of the important terms:

  • Batting Average: The probability of a player getting a hit calculated as hits/at bats

  • On Base Percentage: Measures the frequency of a player reaching base

  • Slugging Percentage: Measures a batter’s power by calculating the amount of bases a batter records per at bat (1 for single, 2 for double, etc.)

  • OPS= On Base Percentage+ Slugging Percentage

  • Strikeout Percentage= How often a batter strikes out, walk percentage follows same logic

  • Exit Velocity: How fast, in miles per hour, a ball was hit by a batter

  • Launch Angle: How high/low, in degrees, a ball was hit by a batter

  • Barrels: A batted ball with the perfect combination of exit velocity and launch angle

  • Hard Hit: A batted ball with an exit velocity of 95 mph or higher.

  • Whiff Percentage: How often a batter swings and misses

  • Swing Percentage: How often a batter swings

Now that some of the important terms are defined, here are a few of the summary statistics for the data set.

summary(hitting)
 last_name, first_name   player_id           year          single     
 Length:670            Min.   :408234   Min.   :2021   Min.   : 44.0  
 Class :character      1st Qu.:593428   1st Qu.:2022   1st Qu.: 76.0  
 Mode  :character      Median :650402   Median :2023   Median : 88.0  
                       Mean   :626494   Mean   :2023   Mean   : 88.4  
                       3rd Qu.:668227   3rd Qu.:2024   3rd Qu.:100.0  
                       Max.   :808982   Max.   :2025   Max.   :161.0  
     double          triple          home_run       k_percent   
 Min.   : 9.00   Min.   : 0.000   Min.   : 0.00   Min.   : 3.1  
 1st Qu.:23.00   1st Qu.: 1.000   1st Qu.:15.00   1st Qu.:16.7  
 Median :28.00   Median : 2.000   Median :21.00   Median :20.5  
 Mean   :27.97   Mean   : 2.279   Mean   :21.74   Mean   :20.6  
 3rd Qu.:32.00   3rd Qu.: 3.000   3rd Qu.:28.00   3rd Qu.:24.5  
 Max.   :59.00   Max.   :17.000   Max.   :62.00   Max.   :34.6  
   bb_percent      batting_avg      slg_percent     on_base_percent 
 Min.   : 2.500   Min.   :0.1840   Min.   :0.2730   Min.   :0.2390  
 1st Qu.: 6.800   1st Qu.:0.2430   1st Qu.:0.3990   1st Qu.:0.3130  
 Median : 8.750   Median :0.2600   Median :0.4370   Median :0.3285  
 Mean   : 8.899   Mean   :0.2602   Mean   :0.4411   Mean   :0.3325  
 3rd Qu.:10.700   3rd Qu.:0.2750   3rd Qu.:0.4758   3rd Qu.:0.3510  
 Max.   :22.200   Max.   :0.3540   Max.   :0.7010   Max.   :0.4650  
 on_base_plus_slg     b_rbi        exit_velocity_avg launch_angle_avg
 Min.   :0.5610   Min.   : 23.00   Min.   :82.3      Min.   :-4.40   
 1st Qu.:0.7183   1st Qu.: 61.00   1st Qu.:88.3      1st Qu.:10.30   
 Median :0.7680   Median : 73.00   Median :89.7      Median :13.40   
 Mean   :0.7736   Mean   : 74.76   Mean   :89.7      Mean   :13.28   
 3rd Qu.:0.8177   3rd Qu.: 88.00   3rd Qu.:91.1      3rd Qu.:16.20   
 Max.   :1.1590   Max.   :144.00   Max.   :96.2      Max.   :25.20   
 barrel_batted_rate hard_hit_percent whiff_percent   swing_percent  
 Min.   : 0.000     Min.   :14.90    Min.   : 5.30   Min.   :35.00  
 1st Qu.: 6.500     1st Qu.:37.60    1st Qu.:20.20   1st Qu.:44.30  
 Median : 8.950     Median :42.40    Median :24.20   Median :47.40  
 Mean   : 9.262     Mean   :41.94    Mean   :23.97   Mean   :47.51  
 3rd Qu.:11.775     3rd Qu.:46.90    3rd Qu.:28.38   3rd Qu.:50.40  
 Max.   :27.100     Max.   :61.90    Max.   :40.60   Max.   :62.30  

This summary is interesting to look at especially for the home runs and the rate and percentage stats since this will give us easy comparisons. It will be very clear if a player is above or below the average in particular stats.

Part 2

Visualization 1

My first visualization is going to be a line chart that shows the total home runs by season. This will provide a good base and show how prevalent home runs have become in the MLB today.

hr_per_season<- hitting %>% 
  group_by(year) %>% 
  summarise(totalHR= sum(home_run))

hr_per_season %>% 
  ggplot(aes(year, totalHR))+
  geom_line()+
  labs(
    title = "Home Runs per Season",
    x= "Year",
    y="Number of Home Runs"
  )

This graph shows that the total number of home runs dipped for a few years from 2022-2024, but in 2025 the amount of home runs skyrocketed to well over 3000. Many hitters are trying to hit more home runs, but these next visualizations will help to examine whether that makes for a better hitter.

Visualization 2

My second visualization is going to be a simple scatter plot that shows the correlation between home runs and OPS. OPS is considered one of the best stats to look at the overall value of a player, so I want to see how it correlates with home runs.

hitting %>% 
  ggplot(aes(home_run, on_base_plus_slg))+
  geom_point()+
  geom_smooth()+
  labs(
    title = "Home Runs and OPS Relationship",
    x= "Home Runs",
    y="OPS"
  )
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

This scatter plot shows that more home runs almost always leads to a higher OPS, which is an early indication that hitting more home runs could make for a better hitter.

Visualization 3

My third visualization is also going to be a scatter plot, but comparing home runs to batting average, which is another commonly used statistic to evaluate hitters. The trendline should be particularly interesting for this graph because it could go either way. There are some hitters who seem to sacrifice their batting average to hit more home runs, but also some hitters who are well rounded. I don’t expect to see a huge correlation here.

hitting %>% 
  ggplot(aes(home_run, batting_avg))+
  geom_point()+
  geom_smooth()+
  labs(
    title = "Home Runs and Batting Average Relationship",
    x= "Home Runs",
    y="Batting Average"
  )
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

My prediction was correct that there isn’t a big positive correlation between these two statistics, although it seems to be slightly positive. This indicates that some hitters can hit a lot of home runs while maintaining a high batting average, but others are sacrificing batting average for more home runs.

Visualization 4

My fourth visualization is going to follow a similar logic to my third one. This will be another scatter plot comparing home runs to strikeout rate. Just like batting average, I want to see if there is a tradeoff to hitting a lot of home runs. Typically, hitters that hit a lot of home runs also tend to strike out more, so I want to see if the graph reflects that.

hitting %>% 
  ggplot(aes(home_run, k_percent))+
  geom_point()+
  geom_smooth()+
  labs(
    title = "Home Runs and Strikeout Rate Relationship",
    x= "Home Runs",
    y="Strikeout Rate"
  )
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The correlation for this scatter plot is extremely high at first, but doesn’t have much of a correlation at all once home runs hits around 20. It is only slightly positive the rest of the way, which is a bit surprising. I thought this correlation was going to be much more positive.

Visualization 5

My last visualization is going to be a box plot where I will take the average in home runs for the whole data set and put each player into a category of either above or below the average. I am then going to compare the OPS for the players with above average home run totals and those with below average home run totals to see if there is a significant difference.

hitting %>% 
  summarise(avg_hr= mean(home_run))
# A tibble: 1 × 1
  avg_hr
   <dbl>
1   21.7
hitting<-hitting %>% 
  mutate(Above_or_below_average_hrs= 
           ifelse(home_run>21.74, "Above Average", "Below Average"))

hitting %>% 
  ggplot(aes(x=as.factor(Above_or_below_average_hrs),y=on_base_plus_slg))+
           geom_boxplot()+
           labs(title = "Home Runs and OPS Boxplot",
                x= "Above or Below Average Home Runs",
                y= "OPS")

The box plot shows that having an above average amount of home runs tends to lead to a higher OPS, and that is especially true when considering the outliers. The highest OPS for below average home run totals is just below .90, while the above average home run totals see a few seasons with an OPS above 1.100. There are also outliers well below .600 for below average, but above average doesn’t see anything below .650.

Part 2 Results

In general, the visualizations show that hitting home runs makes for a better overall hitter. There is a high positive correlation between home runs and OPS, which is arguably the most important statistic to evaluate hitters, and there is also a small positive correlation with batting average. There are some downsides to hitting a lot of home runs as there is also a positive correlation with strikeout rate, meaning there is high risk with high reward. As the first graph shows, many players are willing to take on the added risk of strikeouts to hit more home runs. These results could be used for individual players, but also teams as they try to win games. Having a lot of home run hitters can mean games with many runs scored and lots of home runs, but it can also lead to games with very few runs and lots of strikeouts.

Part 3 Intro

For the next part, I want to transition into what makes for hitting a lot of home runs using some of the more advanced metrics from Baseball Savant. Specifically, I want to look at four metrics: exit velocity, launch angle, barrel %, and hard hit %. These were in the first data set, but I want to use the Baseball Savant website to look at them more closely to see their correlation with hitting home runs. Exit velocity, barrel % and hard hit % should have a high positive correlation. In general, the harder the ball is hit, the farther it can go. Launch angle is the interesting one because higher doesn’t necessarily mean better. Most home runs are hit at a launch angle between 15 and 35 degrees, meaning that an average launch angle in that range should lead to the most home runs.

Scraping Analysis and Hypothesis

This portion will also include Baseball Savant data, but it will be scraped in a separate R script and the resulting data set will be in this document. My theory is that the players with the highest average exit velocity, highest barrel%, and highest hard hit% should have the most home runs as well as an average launch angle of between 15 and 35 degrees.

savant_df<- read_csv("baseballsavant.csv")
Rows: 670 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Player
dbl (6): Year, HR, Avg_Exit_Velo, Avg_Launch_Angle, Hard_Hit_Pct, Barrel_Pct

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Boxplot Analysis

I am going to follow the same logic as I used for the fourth visualization in part 2 by summarizing and creating an average for the home run variable. I will then use boxplots to compare the metrics for batters with an above average home run total and a below average home run total to see if there is a significant difference.

savant_df %>% 
  summarise(avg_hr= mean(HR))
# A tibble: 1 × 1
  avg_hr
   <dbl>
1   21.7
savant_df<-savant_df %>% 
  mutate(Above_or_below_average_hrs= 
           ifelse(HR>21.74, "Above Average", "Below Average")
  )
savant_df %>% 
  ggplot(aes(x=as.factor(Above_or_below_average_hrs),y=Avg_Exit_Velo))+
           geom_boxplot()+
           labs(title = "Exit Velocity and Home Runs Correlation",
                x= "Above or Below Average Home Runs",
                y= "Exit Velocity")

savant_df %>% 
  ggplot(aes(x=as.factor(Above_or_below_average_hrs),y=Avg_Launch_Angle))+
           geom_boxplot()+
           labs(title = "Launch Angle and Home Runs Correlation",
                x= "Above or Below Average Home Runs",
                y= "Launch Angle")

savant_df %>% 
  ggplot(aes(x=as.factor(Above_or_below_average_hrs),y=Hard_Hit_Pct))+
           geom_boxplot()+
           labs(title = "Hard Hit % and Home Runs Correlation",
                x= "Above or Below Average Home Runs",
                y= "Hard Hit %")

savant_df %>% 
  ggplot(aes(x=as.factor(Above_or_below_average_hrs),y=Barrel_Pct))+
           geom_boxplot()+
           labs(title = "Barrel % and Home Runs Correlation",
                x= "Above or Below Average Home Runs",
                y= "Barrel %")

Part 3 Results

The boxplots are supporting my hypothesis because they are much higher up for above average home runs for exit velocity, hard hit percentage, and barrel percentage. The launch angle for above average home runs is slightly higher than below average, but it isn’t statistically significant. Based on these boxplots, it seems safe to conclude that having a higher exit velocity, hard hit percentage, and barrel percentage has a statistically positive correlation with hitting home runs.