Variable Correlation
R R 1.00000000
RBI RBI 0.99653793
OPS OPS 0.94593099
SLG SLG 0.91533930
TB TB 0.91511670
OBP OBP 0.85323672
PA PA 0.77130672
BA BA 0.71366831
H H 0.71238882
OPS. OPS. 0.70923543
HR HR 0.66368658
BB BB 0.60384731
X2B X2B 0.59427436
AB AB 0.55878095
SF SF 0.45278929
LOB LOB 0.41968555
BatAge BatAge 0.23968271
IBB IBB 0.22660783
HBP HBP 0.17749046
G G 0.09879168
X3B X3B 0.07931275
SB SB 0.00468850
CS CS -0.07390926
SH SH -0.07835237
X.Bat X.Bat -0.21228103
SO SO -0.23629057
Analyzing MLB Offensive Trends Using R
Introduction
Baseball is a sport that can easily be broken down into numbers. As a result, it is the perfect opportunity to scrape, manipulate, and analyze data. This document uses data scraped from the web to identify offensive trends and themes, as well as which statistics most correlate to runs, since the year 2000 in the MLB.
The Data
The data I am using all comes from baseball-reference.com. Baseball Reference is a data base that is complete with many different stats, standing, and scores from throughout baseball history.
The specific data I am using comes from Baseball Reference’s yearly team stats pages. On these pages, baseball-reference compiled statistics for every team in Major League Baseball for that season. My data comes specifically from their offensive stats for each season since the year 2000. An example of one of these pages, as long as a definition of all variables included in my data, can be found at the following link: https://www.baseball-reference.com/leagues/majors/2024.shtml
The reason that this data is suitable for our question is because it allows for a comprehensive look at all of Major League Baseball. This data provides virtually every single offensive stat that we will need, and any others can be calculated from the information on baseball-reference. The reason I chose to specifically look at 2000 to present was to analyze more recent trends in baseball data, as the game has continued to evolve and develop.
To retrieve this data, I created a callable function in R to scrape the website for each season that I wanted data on. This function then combines everything into one data frame that is easy to use.
It should be noted that there were a few changes made to the data as it was scraped to make it easier for future use. The total rows and row labels were removed in the scraping process to avoid any errant calculations. Using HTML scraping techniques, a “year” variable was added to identify the year that team’s stats were from. Finally, the data types were corrected and changed to numeric for every variable except for team name.
Data Transformation
Now that we have our data, it is time to start analyzing it. There are a few transformations that must be made to help going forward. A new variable will be added to the data set called R_Quartile. This variable ranges from 1-4, and it identifies the quartile that the team is in based on runs scored. For example, a team in the top 25% of runs scored will have a 4 as the R_Quartile variable.
Additionally, baseball reference includes a league average every year in their data. To keep this from effecting our calculations, we will remove every league average from our data. We will also remove all data from the 2020 season. We removed the data from the 2020 season because it was shortened due to COVID-19. As a result, many statistics for the 2020 season are significantly different from other years and may skew the results.
There will be more data transformations as groupings are made as we perform our analysis. These will all be identified later in the document as they are used.
Correlations to Runs
One of the most impactful things we can learn from this data is which offensive statistics are most correlated to runs scored. This will allow us to know what areas of offense teams should focus on in order to score more runs throughout the season. Below is a list of every variables correlation to Runs scored.
This shows us which variables are most correlated to Runs. The most correlated variable was RBI, or runs batted in. This makes sense because every RBI is a run. Since RBI’s are essentially measuring runs and are situational in nature, they are often not considered in most baseball analysis. As a result, we will highlight OPS as one of the best measures to predict runs, as it is the next highest correlation to runs.
OPS is on base percentage plus slugging percentage. On base percentage is the percent of time that a player gets on base, and slugging percentage is a batting average weighted by the total number of bases a player gets. This OPS statistic is often used in the baseball world, and in terms of this analysis, is most correlated to runs and will be used to identify other offensive trends.
Year by Year Trends in Significant Stats
The next part of my analysis will be on how the MLB has changed over the past 24 years. By looking at significant stats and how they have varied by year, we can easily identify trends in offensive production and style. To do this, the data will be grouped by year and the offensive categories will be totaled or averaged to examine how things change over time. Below are several scatterplots that show the how different offensive statistics have changed year to year.
From these graphs, we can say a lot about changes in the MLB over the past two decades. Primarily, you can say offense is down from where it used to be 20 years ago. As we can see from the runs graph, the total number of runs scored each year has continued to decrease. However, there was a jump around 2015, and we can see this trend happen in other categories as well. A new wave of rule changes over the past decade could possibly be attributed to this. As the MLB continues to try and make the game more offensive and exciting, it will be interesting to see if these trends continue.
For hits, it has been on a steady decline for a number of years. This also shows that there is a decrease in offensive productivity over the past 20 years. This could be a result of the focus on power hitting and less of a focus on contact hitting. Hitters are now more concerned with hitting for power, which means they make contact less, and this could account for this trend.
OPS is one of the most important stats in baseball, as we have already identified. However, it shows the same decline, but with a similar bump around 2015 as the runs had. This could also point to the decrease in offensive production.
OBP, or on base percentage, has suffered a similar fate as OPS. Since it is used to calculate OPS, this makes sense. However, with home run hitting being the focus in today’s game, and less about contact hitting, hitters have gotten on base less.
Finally, strike outs may be the most telling of offensive trends in todays game. People swing and miss more than they did 24 years ago. Again, this could be accounted for by a shift in strategies and attitudes about hitting. However, this does provide clear evidence that players are making contact less.
Overall, these graphs show us that offensive production is not what it used to be. Maybe pitchers are better, hitters are worse, strategies just changed? No matter the result, as the MLB continues to push offensive production, it will be interesting to see if these trends continue.
OPS by Quartile
Next, we want to examine what makes the good teams so good. This is where we will use the quartile variable created earlier. First, we will group the data by quartile. From there we can see how teams in each quartile vary by certain statistics, in this case, OPS.
Above is a boxplot showcasing the distribution of OPS by teams in each run quartile. As we can see here, teams that are in the highest quartile and score the most runs tend to have the highest OPS. This makes sense considering our analysis on correlations from earlier. This tells us that all of the best teams tend to have a higher OPS. This once again tells us that OPS is a stat that should be the focus of Major League Teams
Team Success
Up to this point, many different trends and correlations have been examined. To get a better picture of offensive trends since 2000, one final trend we want to look at is which teams have been the best at scoring runs this decade. To do this, we must group by team, and from there we can examine which teams have been the most successful and least successful at scoring runs this decade.
Above is a column chart that shows us which teams were the best at scoring runs since 2000. Here we can see that teams like the Red Sox, Yankees, and Angels have been the best at scoring runs. Teams like the Marlins, Pirates, and Padres have been the worst. This graph helps us identify trends in offensive production by team, as we can see which teams have been producing the most in the past 20 years.
Conclusion
By using the data we were able to retrieve from baseball-reference.com, we were able to discover which offensive statistics correlated to runs and identify trends in the MLB over the past 24 years. We found that OPS is a very telling statistic in terms of offensive success. We saw that offensive production has seemingly gone down since 2000, but with a small bump in recent years. Finally, we saw which teams have been the most successful over the past 24 years. This data from baseball reference allowed us to learn a lot about modern day baseball.