In 2002, the Oakland Athletics, under the leadership of general manager Billy Beane and assistant general manager Paul DePodesta, adopted an unorthodox method of building and managing a baseball team. “The Moneyball Theory” as it is and was most commonly referred to as, is the theory, devised by Bill James, in baseball that states that putting players on base at a higher rate leads to more runs, which therefore, translates to more wins. Beane, in 2003, put all his faith in the untested theory, after three of his best players signed with other ballclubs for more money than Beane offered them.
By 2002, the Oakland Athletics were the third-poorest ballclub in the MLB, with a salary payroll of 40 million dollars, whereas the New York Yankees in the same year had a salary payroll almost three times the amount, sitting at $112 million. In a sport that Beane believed that success is more closely correlated to winning bidding wars in the offseason and having a bigger wallet to pay the best players, Beane recognized the need to adapt practices that many considered rather outlandish to win games.
In this project, I want to test a couple of Beane’s theories and their relevance to today’s game. I used a dataset of MLB team stats from the years 2012-2018, so I could see if these theories are true in a more recent period of time.
The dataset I used for this code is a compilation of MLB team statistics from the years 2012 and 2018. The dataset contains team averages and totals for most of the major stat categories in baseball, including Batting Average, On Base Percentage, Slugging Percentage, Hits, Home Runs, Earned Run Average, and Stolen Bases. In addition, the dataset provides a salary total for each team, which is helpful in comparing payrolls and measuring the increase in salary payroll in a time period.
First, I want to test if there is a correlation between team’s salary payroll and wins. From there, based on the correlation between the two variables, I would conclude whether or not there exists an unfairness in the MLB, in which teams with more money would have an upper edge on their opponents.
The correlation between Wins and Salary was not as strong as I expected, as proven by the widespread values on the scatterplot; however, there appears to still be somewhat of a correlation present.
Along the x and y axes of the scatterplot lie histograms to demonstrate the spread of values in the salary and wins columns of the dataset. The salary histogram (horizontal histogram) is right skewed, meaning that a large percentage of the values have low salary values. High levels of skewness in a salary histogram that is composed of data from a number of years can communicate that there was either an inflation during the period or an overall increase in salary by year.
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings; warnings.filterwarnings(action = 'ignore')
import ipywidgets as widgets
from IPython.display import display
import matplotlib.patches as mpatches
import matplotlib.image as mpimg
import statistics
from matplotlib.ticker import FuncFormatter
path = "/Users/mike/Desktop/"
filename = path + "mlbdataset.csv"
df = pd.read_csv(filename, nrows = 6)
df = pd.read_csv(filename, usecols = ['salary', 'W', 'TeamName', 'HR', 'L', 'ERA', 'ER', 'WAR', 'OBP', 'RBI', 'DefEff', 'W-L%', 'SB', 'SO', 'SLG', 'BA', 'E'])
large = 22
med = 16
small = 12
params = {'axes.titlesize': large,
'legend.fontsize': med,
'figure.figsize': (16,10),
'axes.labelsize': med,
'axes.titlesize': med,
'xtick.labelsize': med,
'ytick.labelsize': med,
'figure.titlesize': large}
plt.rcParams.update(params)
plt.style.use('seaborn-whitegrid')
sns.set_style("white")
# VISUALIZATION 1: SCATTERPLOT AND BARPLOT
fig = plt.figure(figsize = (16,10), dpi = 80)
grid = plt.GridSpec(4,4, hspace = 0.5, wspace = 0.2)
ax_main = fig.add_subplot(grid[:-1, :-1])
ax_right = fig.add_subplot(grid[:-1, -1], xticklabels = [], yticklabels = [])
ax_bottom = fig.add_subplot(grid[-1, 0:-1], xticklabels = [], yticklabels = [])
# create scatterplot
ax_main.scatter(df.salary, df.W)
# create histogram for salary
ax_bottom.hist(df.salary, 40, histtype = 'stepfilled', orientation = 'vertical', color = 'deeppink')
ax_bottom.invert_yaxis()
# create histogram for wins
ax_right.hist(df.W, 40, histtype = 'stepfilled', orientation = 'horizontal', color = 'deeppink')
ax_main.set(title = 'Scatterplot of Wins by Salary', xlabel = "Team Salary Payroll (in Millions of Dollars)", ylabel = "Total Wins")
ax_main.title.set_fontsize(20)
xlabels = ax_main.get_xticks()
newxlabels = xlabels*1e-6
ax_main.set_xticklabels(newxlabels)
plt.show()
png
I measured the change of salary payrolls in the MLB as a whole between the years 2012 and 2018 by a stacked barplot. Recognizing the fact that there was a high level of skewness on the scatterplot, I saw it important to analyze the change of total salary in the MLB.
During the years 2012 and 2018, as demonstrated by the stacked barplot, the salary payroll increased rather drastically, as in 2012, it was around #3.25 Billion and increased to $4 billion in 2018. While the stacked bar plot is helpful in understanding the MLB as a group’s progress in a statistical category in a specific period of time, it does not properly compare the individual teams against each other. In order to test whether the change in overall payrolls during the time period had created a disparity or conjunction between team’s salary numbers, I coded a stacked line plot of all 30 teams. To access a line plot demonstrating the change of salary for each team in the MLB between 2012 and 2018, click the dropdown menu and select ‘Stacked Line Plot’.
In 2012 and 2015, there is a clear front runner in the salary category, as the New York Yankees and Kansas City Royals, respectively, hold a strong upper hand in the category, as there is a huge gap between them and the rest of the league. However, after 2015, the variance in salary values decreases and by 2018, the salaries of each team come closer together in value. While there is an advantage held when a team has a higher salary payroll than their competitor, the advantage, between 2012 and 2018 has gradually decreased over time.
Additionally, on the stacked line plot, the types of changes that team salaries experience is not consistent from team to team. Some teams remained constant in their salary payroll throughout the period of time, denoted by a horizontal line through the graph, others significantly increased their salary payrolls, like the Kansas City Royals who peaked at the top of the MLB in 2015 when they won the World Series. One thing the line plot demonstrates is that there is not always a consistent top dog in terms of salary payroll. Just because the New York Yankees are financially worth the most out of all the teams in the MLB does not mean that they always pay their players the most. It is important to note the changes that the rankings of team salary payroll experienced during this time period. While there is a correlation worth noting between salary and win percentage, as illustrated by the scatterplot above, the MLB remains to be a fair league because the leaders in salary output change year-by-year.
# split column 'TeamName' into separate columns to get Team and Name
df2 = df['TeamName'].str.split(" ", n = 1, expand = True)
df['Year'] = df2[0]
df['Team'] = df2[1]
df.drop(columns = 'TeamName', inplace = False)
DefEff | E | W | L | W-L% | ERA | ER | HR | SO | RBI | SB | BA | OBP | SLG | salary | WAR | Year | Team | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.698 | 75 | 82 | 80 | 0.506 | 3.72 | 605 | 174 | 1448 | 658 | 79 | 0.235 | 0.310 | 0.397 | 143324597 | 34.1 | 2018 | ARI |
1 | 0.709 | 80 | 90 | 72 | 0.556 | 3.75 | 607 | 153 | 1423 | 717 | 90 | 0.257 | 0.324 | 0.417 | 130649395 | 40.8 | 2018 | ATL |
2 | 0.674 | 104 | 47 | 115 | 0.290 | 5.18 | 824 | 234 | 1203 | 593 | 81 | 0.239 | 0.298 | 0.391 | 127633703 | 11.4 | 2018 | BAL |
3 | 0.693 | 77 | 108 | 54 | 0.667 | 3.75 | 608 | 176 | 1558 | 829 | 125 | 0.268 | 0.339 | 0.453 | 227398860 | 56.5 | 2018 | BOS |
4 | 0.700 | 104 | 95 | 68 | 0.583 | 3.65 | 598 | 157 | 1333 | 722 | 66 | 0.258 | 0.333 | 0.410 | 194259933 | 45.0 | 2018 | CHC |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
205 | 0.686 | 107 | 88 | 74 | 0.543 | 3.71 | 603 | 134 | 1218 | 732 | 91 | 0.271 | 0.338 | 0.421 | 120461369 | 40.8 | 2012 | STL |
206 | 0.704 | 114 | 90 | 72 | 0.556 | 3.19 | 518 | 139 | 1383 | 665 | 134 | 0.240 | 0.317 | 0.394 | 70242330 | 45.6 | 2012 | TBR |
207 | 0.694 | 85 | 93 | 69 | 0.574 | 3.99 | 639 | 175 | 1286 | 780 | 91 | 0.273 | 0.334 | 0.446 | 138226346 | 44.9 | 2012 | TEX |
208 | 0.694 | 101 | 73 | 89 | 0.451 | 4.64 | 745 | 204 | 1142 | 677 | 123 | 0.245 | 0.309 | 0.407 | 97293922 | 28.2 | 2012 | TOR |
209 | 0.702 | 94 | 98 | 64 | 0.605 | 3.33 | 543 | 129 | 1325 | 688 | 105 | 0.261 | 0.322 | 0.428 | 98256813 | 45.7 | 2012 | WSN |
210 rows × 18 columns
# VISUALIZATION 2 and 3: STACKED BARPLOT and STACKED LINE PLOT
# create dropdown menu
dropdown1 = widgets.Dropdown(
options = ['Stacked Bar Plot', 'Stacked Line Plot'],
# valuetype = "Stacked Line Plot",
description = 'View Type:'
)
def Plot_Type(valuetype):
salary_df = df.groupby(['Year', 'Team'])['salary'].sum().reset_index(name = 'Salary')
fig = plt.figure(figsize = (18, 18))
ax = fig.add_subplot(1,1,1)
# create colors - most using team colors (due to deficiency of selectivity and in order to maximize variety of colors on plot)
my_colors = {'ARI': 'darkred',
'ATL': 'turquoise',
'BAL': 'goldenrod',
'BOS': 'red',
'CHC': 'blue',
'CHW': 'black',
'CIN': 'firebrick',
'CLE': 'aquamarine',
'COL': 'purple',
'DET': 'midnightblue',
'HOU': 'darkblue',
'KCR': 'powderblue',
'LAA': 'crimson',
'LAD': 'dodgerblue',
'MIA': 'slategray',
'MIL': 'gold',
'MIN': 'thistle',
'NYM': 'orange',
'NYY': 'navy',
'OAK': 'green',
'PHI': 'darkorchid',
'PIT': 'greenyellow',
'SDP': 'brown',
'SEA': 'rebeccapurple',
'SFG': 'palevioletred',
'STL': 'plum',
'TBR': 'skyblue',
'TEX': 'darkseagreen',
'TOR': 'steelblue',
'WSN': 'pink',
}
# create stacked line plot
if valuetype == "Stacked Line Plot":
for key, grp in salary_df.groupby(['Team']):
grp.plot(ax = ax, kind = 'line', x = 'Year', y = 'Salary', color = my_colors[key], label = key, marker = '8')
plt.title('MLB Team Salary Payroll by Year\n', fontsize = 18)
ax.set_xlabel('\nYear', fontsize = 18)
ax.set_ylabel('Total Salary Payroll\n', fontsize = 18, labelpad = 20)
ax.tick_params(axis = 'x', labelsize = 14, rotation = 0)
ax.tick_params(axis = 'y', labelsize = 14, rotation = 0)
ax.set_xticks(np.arange(8))
handles, labels = ax.get_legend_handles_labels()
plt.legend(handles, labels, loc = 'best', fontsize = 14, ncol = 1)
ax.yaxis.set_major_formatter( FuncFormatter( lambda x, pos: ('$%1.1fM')%(x*1e-6)))
# create stacked bar plot
elif valuetype == "Stacked Bar Plot":
salary_df = df.groupby(['Year', 'Team'])['salary'].sum().reset_index(name = 'Salary')
salary_df = salary_df.pivot(index = 'Year', columns = 'Team', values = 'Salary')
salary_df.plot(kind = 'bar', stacked = True, color = my_colors, ax = ax)
plt.ylabel('\nTotal Salary Payroll\n', fontsize = 18, labelpad = 10)
plt.title('Total MLB Salary Payroll by Year\nStacked Bar Plot\n')
plt.xticks(rotation = 0, horizontalalignment = 'center', fontsize = 14)
plt.yticks(rotation = 0, fontsize = 14)
ax.set_xlabel('\nYear', fontsize = 18)
handles, labels = ax.get_legend_handles_labels()
plt.legend(handles, labels, loc = 'best', fontsize = 14, ncol = 4)
ax.yaxis.set_major_formatter(FuncFormatter( lambda x, pos: ('$%1.1fB')%(x*1e-9)))
# call function
widgets.interact(Plot_Type, valuetype = dropdown1)
interactive(children=(Dropdown(description='View Type:', options=('Stacked Bar Plot', 'Stacked Line Plot'), va…
<function __main__.Plot_Type(valuetype)>
In the following Correlogram, I sought to answer the following questions:
I want to test various statistics that Beane implemented into his team’s system. To measure the legitimacy of the Moneyball theorem, I did not simply measure the correlation between OBP and Wins, in addition to that, I compared the correlation of Wins and OBP to the correlation of Wins and different statistics that many teams put more weight on in the process of scouting and analyzing players, like BA (Batting Average), HR (Home Runs), and SLG (Slugging Percentage).
When Beane signed Scott Hatteberg, an unathletic catcher rattled with injuries that essentially rendered his defensive skills useless to play first basemen - a position he had never played before - many thought he was crazy. The thought behind Beane’s decision to place Hatteberg at first was that Hatteberg, like many of the other players Beane signed, had a high career On Base Percentage and he was overlooked, so as a result, the price to sign him was very low. In order to measure whether the decision to sign Hatteberg - which was rooted in his high OBP and ignored his poor defense - I compared the correlation coefficients between win percentage (W-L%) and two different variables: DefEff (Defensive Efficiency) and OBP (On Base Percentage).
Beane’s Moneyball practice put high emphasis on putting players on base, and because base stealing risked the opportunity of taking players off the bases and producing an out for your own team, Beane strongly advised his players against it. To test whether Beane’s advisory was clever, I utilized the Correlogram below to demonstrate the correlation between Stolen Bases and both Runs and Win Percentage. If the correlation of Stolen Bases and Runs and the correlation Stolen Bases and Wins are both significant, then it is conclusive that Beane’s advice was ill-advised, and teams would be better off by stealing more often.
# VISUALIZATION 5: CORRELOGRAM OF TEAM STATISTICS
# create new data frame with statistics you plan on using in correlogram
heatMapdf = pd.read_csv(filename, usecols = ['W-L%', 'WAR','BA', 'OBP', 'SLG', 'DefEff', 'SB', 'R', 'HR'])
plt.figure(figsize = (12, 10), dpi = 100)
sns.heatmap(heatMapdf.corr(), xticklabels = heatMapdf.corr().columns, yticklabels= heatMapdf.corr().columns, cmap = 'YlGnBu', center = 0, annot = True)
plt.title('Correlogram of MLB Team Statistics (2012-2018)\n')
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.show()
png
It is clear from the view of the Correlogram that out of the correlations between the four said statistics, OBP, BA, HR, and SLG that the statistic with the greatest correlation with win percentage (“W-L%”) is OBP (On Base Percentage). OBP has a .59 correlation coefficient which is significant, given that 0 represents no correlation and 1 represents a perfection correlation. However, on the other hand, BA, SLG, and HR have .46, .39, and -.38 correlation coefficients - all lower than that of OBP.
According to the Correlogram, the correlation between Defensive Efficiency and Win Percentage is 0.49, which represents a solid correlation. However, the correlation between OBP and Win Percentage is 0.59, so therefore, Billy Beane made the better decision by putting more emphasis on players with high OBP rather than players with high Defensive Efficiency. In Beane’s case, he had to sacrifice one for the other, and because there is a stronger correlation between high OBP’s and high win percentage, Beane made the smart decision.
Stolen Bases (SB) has a negative correlation (-0.036 and -0.019, respectively) with both Win Percentage and Runs, so therefore, Beane was correct in his analysis stating that the team would be better off by not stealing bases.
Because the relatively significant correlation coefficient between OBP and Win Percentage in the Correlogram above satisfied the legitimacy of the Moneyball theorem, I wanted to rank the MLB teams between the years 2012 and 2018 and then measure the wins of these teams, in order to see how the teams with the best OBP perform.
Of the top 50 teams measured by OBP, 41 of them are above the mean win percentage. In addition, adding a second component to the barchart, I added a colorscale to show how the teams measure in terms of salary. Of the top 50 teams in terms of OBP, the teams are relatively split between subgroups of “Above Average Payroll”, “Average Payroll”, and “Below Average Payroll”. This statistic illustrates the fact that MLB teams can elevate their chances at winning baseball games without overspending in the free agent market by putting more emphasis on getting on base than hitting for home runs.
However, this graph also demonstrates that although OBP and Win Percentage are correlated, even some of the teams with the highest OBP are unsuccessful in winning games. While the Moneyball theorem tells us that a high OBP can help teams win games, more is needed to win games than simply stacking a team with players that can get on base.
Applying this knowledge to today’s MLB, the New York Yankees have one of the highest salary payrolls in the MLB, but fall short every year in terms of wins to the Tampa Bay Rays. The Yankees, a team that constantly puts up big numbers in HR and RBI categories - the categories many people associate with talented players - have a top 5 strikeout percentage in the MLB, and as a result, a lower OBP than what they could achieve. If the Yankees minimized their strikeout percentage, they could greatly increase their OBP and elevate their status in the MLB.
# pick colors for each salary subgroup
def pick_colors_according_to_salary(df):
colors = []
avg = df.salary.mean()
for each in df.salary:
if each > avg*1.10:
colors.append('cyan')
elif each < avg*0.90:
colors.append('cadetblue')
else:
colors.append('purple')
return colors
# VISUALIZATION 6: SORTED BARPLOT OF WINS BY TOP 50 TEAMS
# sort rows by highest OBP
sorteddf = (df.sort_values('OBP', ascending = False))
sorteddf
# pick out the top 50 rows using indexes given
d1 = sorteddf.loc[153:156]
my_colors1 = pick_colors_according_to_salary(d1)
# create plot
plt.figure(figsize = (18,10), dpi = 100)
plt.bar(range(len(d1.TeamName)), d1.W, color = my_colors1)
plt.xticks(range(len(d1.TeamName)), d1.TeamName, rotation = 'vertical')
plt.ylabel("Total Season Wins", fontsize = 20)
plt.xlabel("\n Top 50 in terms of OBP\n(Sorted Highest OBP to Lowest OBP - Left to Right)")
plt.title("Wins of Top 50 MLB Teams Measured by OBP (2012-2018)\n", fontsize = 20)
plt.axhline(df.W.mean(), color = 'black', linestyle = 'dashed')
plt.text(52, 81, 'Mean: 81', fontsize=14, va='center', ha='center', backgroundcolor='w')
# customize legend
Above = mpatches.Patch(color = 'cyan', label = 'Above Average Salary')
At = mpatches.Patch(color = 'cadetblue', label = 'Within 10% of the Average Salary')
Below = mpatches.Patch(color = 'purple', label = 'Below Average Salary')
plt.legend(handles = [Above, At, Below], fontsize = 14)
plt.show()
png
In 2002, Billy Beane took a huge gamble by following the advice of former scout and newly hired assistant general manager, Paul DePodesta. Beane and DePodesta’s ideas arose from the idea that they needed to build their team in a different way than the richest teams in baseball because they held a significant advantage over them. From the viewpoint of the scatterplot of salary and wins of MLB teams between 2012 and 2018, there appears to be somewhat of a correlation; however, the correlation is not as great as I imagined, from the way Beane described the unfairness in the MLB.
However, in order to further ensure that this conclusion was correct, I made a stacked line plot of the player salaries for each MLB team every year between 2012 and 2018. As much as I edited the code for the line plot to expand the plot, it was hard to draw out which line was which because the plot was populated with multitudes of intersections between lines, dictating that the payrolls MLB teams invest in every year is unfixed. While there are a couple teams that sit around the top 10 consistently, like the Kansas City Royals and New York Yankees, there is no consistent dominator in the category of salary in the MLB, and the salary disparity, as shown by the spread of y values in the 2018 column of the line plot, has decreased, decreasing the upper edge that big market teams have on the poorer franchises.
Twenty years after Beane encountered and conquered the issue of salary disparity in the MLB, the issue is no longer as formidable. However, Beane and DePodesta’s wisdom is still relevant in today’s game. By use of a correlogram and ordered bar chart, I tested the actual Moneyball theory, that teams that succeed in getting players on base are more successful. The theory, statistically, is proven true, as it On Base Percentage has a greater correlation with Wins than Batting Average, Slugging Percentage, and Home Runs. Additionally, Beane added that keeping players on base is just as important as putting them on base, and stealing can hurt their chances of keeping players on base and winning games. Beane is also correct in this statement as Stolen Bases has a slight negative correlation with Win Percentage. The ordered bar chart tested the potential logical fallacy that because there is a correlation between OBP and Wins that the teams with the best OBP will win the most games. As learned from it, although having a higher OBP will greatly increase a team’s chances of winning games, it does not necessarily mean that the teams with the greatest OBP will perform the best.
Overall, by testing the Moneyball theorem and other theories developed and practiced by Beane and DePodesta, with a more modern set of data, we can recognize that the Moneyball theorem still holds true today. Beane’s work, by experimenting and testing theories that are labeled as radical and outlandish, even in the game of baseball, that the biggest upper hand is knowledge. Data mining provides for us an upper hand in finding knowledge that not even the richest teams in baseball could match up against, and that knowledge is still verifiable, even in today’s baseball.