Today we will be delving into the Batting data set (data table listed below). In said data set, we will remove some aspects of it and add to it to answer the following question, who is the best home run hitter of all time?

But, before we get into how we’ll be answering that, we will be discussing what the Batting data set is and what variables it has. It has 110495 entries and has the following 22 variables. playerID, yearID, stint, teamID, lgID, G, AB, R, H, X2B, X3B, HR, RBI, SB, CS, BB, SO, IBB, HBP, SH, SF, and GIDP. playerID:The player ID code. yearID: The year. stint:The order of appearance(s) within a season. teamID: The team ID. lgID: The League ID G:The number of games a player played. AB: At bats. R: Runs. H:Hits. X2B:Double X3B:Triple. HR:Home run. RBI: Runs Batted in. SB:Stolen bases. CS:Caught stealing. BB:Base on balls. SO:Strikeout. IBB:Intentional walk. HBP:Hit by pitch. SH:Sacrifice hit. SF:Sacrifice fly. GIDP: Grounded into double play.

Source

Lahman, S. (2022) Lahman’s Baseball Database, 1871-2021, 2021 version, https://www.seanlahman.com/baseball-archive/statistics/

How Will We Answer Our Question?

Now you might be wondering, how do we get to that answer? Well, let’s go into that. We will first obtain two lists of baseball players who, 1: hit the most home runs in a single season, and 2: hit the most home runs in their whole career. With this list acquired, we will have our top contenders to answer this question. We will then merge this group of players into a single group. After we do this, we will then filter down the Batting data set to only contain this group of players. Then, we will count the total amount of home runs, at bats, and strike outs the player has hit in their career. Next, we’ll compute the ratio of total at bats divided by total home runs and as well as the ratio of total strike outs divided by total home runs. This will give us a ratio so we can get closer to our answer. Finally, we will create a function that will give us the highest consecutive number of home runs hit in a season for specific thresholds, those thresholds being 20, 30, 40, 50, 60. This will give us even more evidence to further allow us to answer our question.

Loading Libraries

Before we start with trying to answer our question, we have to load up our important packages (listed below). Those packages being, Lahman, tidyverse, and DT. The Lahman package contains the Batting data set, which mainly showcases the batting data of the Major Baseball League since 1871. The tidyverse package is what allows us to compute all these transformations to our Batting data set. And DT is our final package, this allows us to use the datatable function to showcase our incoming data sets.

library(tidyverse)
library(Lahman)
library(DT)

(The Batting data set would go here, but since there are too many entries, it crashes R. Here’s the coding I would use if I could. datatable(Batting))

Now, let’s move on to answering our question.

Who Are Our Top Contenders For Best Home Run Hitter?

We’ll start off with our code below.

`Top 20 Home Runs Hit in One Season` <- Batting %>%
  group_by(playerID, yearID) %>%
  summarize("HR's in One Season" = sum(HR))%>%
  arrange(desc(`HR's in One Season`))%>%
  head(30)

`Top 20 Home Runs Hit in One Season Part 2` <- `Top 20 Home Runs Hit in One Season`[-c(4 ,5 ,6 , 9, 14, 18, 24, 25, 26 ), ]%>%
  head(20)%>%
  select(playerID)

datatable(`Top 20 Home Runs Hit in One Season Part 2`)

Now we have a data set that gives us our top 20 contenders for most home runs hit in a single season. Let’s move onto our top 20 contenders for most home runs hit in their career.

Below is our code.

`Top 20 Home Run Hitters Over Whole Career` <- Batting %>%
  group_by(playerID) %>%
  summarize("Total HR's Hit in Career" = sum(HR))%>%
  arrange(desc(`Total HR's Hit in Career`))%>%
  head(20)%>%
  select(playerID)

datatable(`Top 20 Home Run Hitters Over Whole Career`)

Now we have our top 20 home run hitters in their overall career. Since we have these two data sets, we should now merge the two groups so we have our top contenders to answer our question.

`Top Contenders` <- full_join(`Top 20 Home Runs Hit in One Season Part 2`, `Top 20 Home Run Hitters Over Whole Career`, by = c("playerID"))

datatable(`Top Contenders`)

Here is our data set of our top 31 contenders for who is the best home run hitter of all time. Now, we will find the total home runs hit, at bats, and strike outs for our top contenders.

What Are The Stats of Our Top Contenders?

`Top Contenders Only` <- Batting%>%
  filter(playerID %in% `Top Contenders`$playerID)

`Top Contenders Stats` <-`Top Contenders Only`%>%
  group_by(playerID)%>%
  summarize("Career Total HR's" = sum(HR), 
            "Career Total AB's"= sum(AB), 
            "Career Total SO's" = sum(SO))

datatable(`Top Contenders Stats`)

Now we have our top contenders coinciding with their stats. But we’re not done yet, we still need more data to help us answer our question. We will find the ratio of at bats per home run, and the ratio of strikeouts per home run.

Below is our code and datatables.

`At Bats per Home Run` <- `Top Contenders Only`%>%
  group_by(playerID)%>%
  summarize("Total AB" = sum(AB), 
            "Total HR" = sum(HR), 
            "AB per HR" = `Total AB`/ `Total HR`)%>%
select(playerID, `AB per HR`)%>%
  arrange(`AB per HR`)

datatable(`At Bats per Home Run`)
`Strikeout per Home Run` <- `Top Contenders Only`%>%
  group_by(playerID)%>%
  summarize("Total SO" = sum(SO), 
            "Total HR" = sum(HR), 
            "SO per HR" = `Total SO`/ `Total HR`)%>%
  select(playerID, `SO per HR`)%>%
  arrange(`SO per HR`)

datatable(`Strikeout per Home Run`)

What Are The Longest Consecutive Run of Seasons With 20, 30, 40, 50, 60 Home Runs Hit?

Finally, we will make a function that we will use to find the consecutive run of home runs hit. For example: Joe Smith hits the following home runs in separate seasons, (12, 23, 32, 21, 10, 24, 21). If we set our function to only count for home runs greater than or equal to 20, the function will return the value of 3.

Let’s write out our function.

max_conse <- function(i, m){
#First we must give our function two variables. The first variable being our data set, and the second one being the one counting the amount of home runs needed to be hit to count for our streak. 
  highest_streak_so_far <- 0
  streak <- 0
#Before we start, we need to reset our streak, and highest streak so far, so it will not keep adding onto our data set.
  for (p in i){
#p is the entry in our data set, this will change every time it goes through the entire vector. i is our data set's total entries. 
  if(p >= m){
    streak <- streak + 1
#So if our entry (p) is greater than our number we set for m, then our streak will go up by 1. 
  }else{
    if (streak > highest_streak_so_far){
    highest_streak_so_far <- streak
#But if our entry (p) is less than our number we set for m, then we check if our current streak is greater than our highest streak so far, if it is, then our highest streak so far becomes our streak. After we do all this, we set our streak back to 0, you see that below. 
  }
    streak <- 0
  }
  }
  if(streak > highest_streak_so_far ){
    highest_streak_so_far <- streak
#Here we added this section to check our end points. If that end point had a bigger streak than our highest streak so far variable, then it wouldn't have changed our highest streak so far variable. We do a work around here to check for cases like that. 
  }else{streak <- 0}
  return(highest_streak_so_far)
#Finally, the function returns our number for the highest streak so far. 
}

Now we have our function. Let’s run it to calculate the longest consecutive run of seasons with 20 or more, 30 or more, 40 or more, 50 or more, 60 or more home runs hit in a single season. Below we do that.

Here is our code where we use our function from above.

`Consecutive Home Runs Hit` <- `Top Contenders Only`%>%
  group_by(playerID)%>%
  summarize("20 or more HR hit consecutively" = max_conse(HR, 20), 
            "30 or more HR hit consecutively" = max_conse(HR, 30), 
            "40 or more HR hit consecutively" = max_conse(HR, 40), 
            "50 or more HR hit consecutively" = max_conse(HR, 50), 
            "60 or more HR hit consecutively" = max_conse(HR, 60))

datatable(`Consecutive Home Runs Hit`)

Now, let’s put all these columns together into a single data set to make it easier for us to answer our question.

`All Data Set's Combined` <- Batting%>%
  filter(playerID %in% `Top Contenders`$playerID)%>%
  group_by(playerID)%>%
  summarize("Career Total HR's" = sum(HR), 
            "Career Total AB's"= sum(AB), 
            "Career Total SO's" = sum(SO),
            "AB per HR" = `Career Total AB's`/ `Career Total HR's`,
            "SO per HR" = `Career Total SO's`/ `Career Total HR's`, 
            "20 or more HR hit consecutively" = max_conse(HR, 20), 
            "30 or more HR hit consecutively" = max_conse(HR, 30), 
            "40 or more HR hit consecutively" = max_conse(HR, 40), 
            "50 or more HR hit consecutively" = max_conse(HR, 50), 
            "60 or more HR hit consecutively" = max_conse(HR, 60))

datatable(`All Data Set's Combined`)

Who is The Greatest Home Run Hitter of All Time?

Trying to answer this question is a hard one. You could ask five baseball enthusiasts the same question and you would get ten different answers. Do you put consistency on a higher pedestal than at bats per home run? Well, let me alter the full data set to what I consider most valuable to answer this question.

`Cleaning All Data Set's Combined`<- `All Data Set's Combined`%>%
  arrange(desc(`20 or more HR hit consecutively`), `AB per HR`)

datatable(`Cleaning All Data Set's Combined`)

The main thing I am looking for to help in answering this question is consistency and the amount of at bats needed to hit a home run. I was able to find two that fit my main idea of best home run hitter, ruthba01 (Babe Ruth) and bondsba01 (Barry Bonds). Let me make a data set that only contains these two so we can better compare them to each other.

`Top Two Contenders` <- `Cleaning All Data Set's Combined`%>%
  filter(`AB per HR` < 13, `20 or more HR hit consecutively` >14)

datatable(`Top Two Contenders`)

With these two left we can better pick the better home run hitter. As for their career totals, I think we can ignore these, since they are reasonably close to one another in value. We then move onto our AB per HR variable, and in this department Babe Ruth has a slight edge out from Barry Bonds. We now move onto the SO per HR variable, and again the values for both of them are too close to compare, so we’ll count them as equal. Next we have the consecutive variables. For the 20 variable they are very close to each other, so once again we cannot really call anyone a winner in this department. The 30 variable is where this gets interesting. There is a huge gap in between these two, a point goes towards Barry Bonds for this one. But, next we have something very odd. The 40 variable now goes towards Babe Ruth, a point goes to him. Now for the 50 variable, Babe Ruth, very slightly, beats Barry Bond in this aspect. And finally, for the 60 variable, we have a tie.

Now, what can we say from here on out? That really depends on who you ask. But, for me personally, I would have to pick Babe Ruth. His AB per HR is just slightly smaller than Barry Bond. And with the consecutive variables, there is a clear trend you can see with Babe Ruth’s while Barry Bond has an odd sort of trend.

Summary

We went over a lot in the Batting data set. We got to finally answer our question of, who is the best home run hitter of all time? We were able to answer this question by looking at the top home run hitters for seasons and for overall careers. We then merge this list together to see our top contenders. Then, we needed their total career at bats and strikeouts to compute the ratios of bat pats per home run and strike outs per home run. The lower these were the better it meant for our players. Then, the final computation we had to make was the consecutive season of home runs hit. Which was possible with our function we had made. With the given information laid out, others can come up with their own conclusions and what matters to them when answering our question. But, in my humble opinion, I would qualify Babe Ruth as the best home run hitter of all time.