Analyzing MLB Offensive Trends Using R

Author

Adam Lamping

Introduction

Baseball is a sport that can easily be broken down into numbers. As a result, it is the perfect opportunity to scrape, manipulate, and analyze data. This document uses data scraped from the web to identify offensive trends and themes, as well as which statistics most correlate to runs, since the year 2000 in the MLB.

The Data

The data I am using all comes from baseball-reference.com. Baseball Reference is a data base that is complete with many different stats, standing, and scores from throughout baseball history.

The specific data I am using comes from Baseball Reference’s yearly team stats pages. On these pages, baseball-reference compiled statistics for every team in Major League Baseball for that season. My data comes specifically from their offensive stats for each season since the year 2000. An example of one of these pages, as long as a definition of all variables included in my data, can be found at the following link: https://www.baseball-reference.com/leagues/majors/2024.shtml

The reason that this data is suitable for our question is because it allows for a comprehensive look at all of Major League Baseball. This data provides virtually every single offensive stat that we will need, and any others can be calculated from the information on baseball-reference. The reason I chose to specifically look at 2000 to present was to analyze more recent trends in baseball data, as the game has continued to evolve and develop.

To retrieve this data, I created a callable function in R to scrape the website for each season that I wanted data on. This function then combines everything into one data frame that is easy to use.

It should be noted that there were a few changes made to the data as it was scraped to make it easier for future use. The total rows and row labels were removed in the scraping process to avoid any errant calculations. Using HTML scraping techniques, a “year” variable was added to identify the year that team’s stats were from. Finally, the data types were corrected and changed to numeric for every variable except for team name.

Data Transformation

Now that we have our data, it is time to start analyzing it. There are a few transformations that must be made to help going forward. A new variable will be added to the data set called R_Quartile. This variable ranges from 1-4, and it identifies the quartile that the team is in based on runs scored. For example, a team in the top 25% of runs scored will have a 4 as the R_Quartile variable.

Additionally, baseball reference includes a league average every year in their data. To keep this from effecting our calculations, we will remove every league average from our data. We will also remove all data from the 2020 season. We removed the data from the 2020 season because it was shortened due to COVID-19. As a result, many statistics for the 2020 season are significantly different from other years and may skew the results.

There will be more data transformations as groupings are made as we perform our analysis. These will all be identified later in the document as they are used.

Correlations to Runs

One of the most impactful things we can learn from this data is which offensive statistics are most correlated to runs scored. This will allow us to know what areas of offense teams should focus on in order to score more runs throughout the season. Below is a list of every variables correlation to Runs scored.

       Variable Correlation
R             R  1.00000000
RBI         RBI  0.99653793
OPS         OPS  0.94593099
SLG         SLG  0.91533930
TB           TB  0.91511670
OBP         OBP  0.85323672
PA           PA  0.77130672
BA           BA  0.71366831
H             H  0.71238882
OPS.       OPS.  0.70923543
HR           HR  0.66368658
BB           BB  0.60384731
X2B         X2B  0.59427436
AB           AB  0.55878095
SF           SF  0.45278929
LOB         LOB  0.41968555
BatAge   BatAge  0.23968271
IBB         IBB  0.22660783
HBP         HBP  0.17749046
G             G  0.09879168
X3B         X3B  0.07931275
SB           SB  0.00468850
CS           CS -0.07390926
SH           SH -0.07835237
X.Bat     X.Bat -0.21228103
SO           SO -0.23629057

This shows us which variables are most correlated to Runs. The most correlated variable was RBI, or runs batted in. This makes sense because every RBI is a run. Since RBI’s are essentially measuring runs and are situational in nature, they are often not considered in most baseball analysis. As a result, we will highlight OPS as one of the best measures to predict runs, as it is the next highest correlation to runs.

OPS is on base percentage plus slugging percentage. On base percentage is the percent of time that a player gets on base, and slugging percentage is a batting average weighted by the total number of bases a player gets. This OPS statistic is often used in the baseball world, and in terms of this analysis, is most correlated to runs and will be used to identify other offensive trends.

OPS by Quartile

Next, we want to examine what makes the good teams so good. This is where we will use the quartile variable created earlier. First, we will group the data by quartile. From there we can see how teams in each quartile vary by certain statistics, in this case, OPS.

Above is a boxplot showcasing the distribution of OPS by teams in each run quartile. As we can see here, teams that are in the highest quartile and score the most runs tend to have the highest OPS. This makes sense considering our analysis on correlations from earlier. This tells us that all of the best teams tend to have a higher OPS. This once again tells us that OPS is a stat that should be the focus of Major League Teams

Team Success

Up to this point, many different trends and correlations have been examined. To get a better picture of offensive trends since 2000, one final trend we want to look at is which teams have been the best at scoring runs this decade. To do this, we must group by team, and from there we can examine which teams have been the most successful and least successful at scoring runs this decade.

Above is a column chart that shows us which teams were the best at scoring runs since 2000. Here we can see that teams like the Red Sox, Yankees, and Angels have been the best at scoring runs. Teams like the Marlins, Pirates, and Padres have been the worst. This graph helps us identify trends in offensive production by team, as we can see which teams have been producing the most in the past 20 years.

Conclusion

By using the data we were able to retrieve from baseball-reference.com, we were able to discover which offensive statistics correlated to runs and identify trends in the MLB over the past 24 years. We found that OPS is a very telling statistic in terms of offensive success. We saw that offensive production has seemingly gone down since 2000, but with a small bump in recent years. Finally, we saw which teams have been the most successful over the past 24 years. This data from baseball reference allowed us to learn a lot about modern day baseball.