For my project I will be analyzing Major League Baseball statistics to determine a statistical method for evaluating whether or not a player is over-valued. I will then take this system and apply it to the Philadelphia Phillies’ roster to recommend actions they may be able to take this season to unload players who are over-valued or unproductive.
Baseball analytics is now widely used throughout Major League Baseball for assessing player transactions as well as analyzing player performance. The Society for American Baseball Research (SABR) records data and also studies effective ways of quantifying a player’s impact on a team (“The SABR Story”). These analytical techniques have helped level the playing field for lower budget teams by allowing them to find good value in players based solely off of their statistics, without any bias from what many scouts refer to as the “eye-test,” which can skew the perception of how well a baseball player actually performs.
The difference in payroll between the top and bottom major league baseball teams is a staggering $173,160,088 (“MLB Team Payroll Tracker”), which makes the ability for lower budget teams to find good value talent extremely important. Without these analysis tools, the higher budget teams could monopolize the league talent and perennially win out their divisions. This lack of competition could likely decrease baseball viewership and game attendance for other teams, which could, in turn, hurt the league by an overall reduction in viewership, game attendance, and merchandise sales.
These concepts were popularized in the 2000s by Michael Lewis’ book Moneyball (2003) which was adapted as a movie in 2011. Lewis’ book draws frequently from ideas formulated by Bill James, who is credited with beginning the interest in baseball statistics and analysis with his annual publication started in 1977 (Lewis 2003).
I have been a Phillies fan for my entire life, however they are historically not a very good team. They have more losses than any franchise in major league baseball history and are one of only five teams to have reached the 10,000 loss benchmark. The others are the Atlanta Braves, Chicago Cubs, Pittsburgh Pirates, and Cincinnati Reds (“Top 10 MLB Teams with the Most Overall Losses”). Philadelphia seems to be in need of a rebuilding. The Phillies currently have the worst record in Major League Baseball and are (as of 7/21/2017) 24 1/2 games back in the NL East.
The data I have drawn from come mainly from two online sources. Spotrac.com provides salary information along with basic performance statistics for all MLB players. These data will help determine a collective value that can be placed on player productivity. I would also like to bring in an advanced metric of player performance called “Wins Above Replacement” or WAR for my final analysis. WAR data will be drawn from ESPN’s website.
Data Sources
To begin my analysis, the first variables I will consider are OPS (On-Base percentage plus slugging percentage) and salary information for all major league hitters. I will explore these data to see first, how strong is the correlation between OPS and salary for hitters and second, how to place a monetary value on increases in OPS. Looking at the 5 number summary of the hitter data can help with this process by breaking down the OPS data into four intervals. Either an average or median salary from each of these OPS quartiles can then be found. Looking at a boxplot overlayed with the OPS data points can show how spread out these quartiles are and if more intervals might be needed to get a better representation in the groupings. Analyzing a scatter plot of the Salary vs OPS data can show if there is a correlation as well as the nature and strength of that correlation and a trendline can be layered on top of the scatter plot to further highlight any pattern. If the spread of the quartiles is too wide, the data can be broken down into smaller percentile ranges as needed. (NOTE: For OPS, higher numbers are better for a hitter and for WHIP lower numbers are better for a pitcher.)
After analyzing the league as a whole, I will narrow my focus to just one team, the Philadelphia Phillies. Using the data from OPS and salary information, I will see if any Phillies players are being paid disproportionately more money than other players in their quartile (or percentile range). I can also visualize this information with a conditional scatter plot showing where any players are being paid above a certain threshold.
As a final measure, I will compute the league average salary and league average for W.A.R. Using the value ratio between salary and W.A.R. I will see if any Phillies players are being paid disproportionately high on this metric as well. By using two different, independent productivity measures, I should be able to explicitly isolate those target players (if there are any) that are shown to be overpaid on both of the two different scales.
As Henry Ford said, “Failure is just a resting place. It is an opportunity to begin again more intelligently.” The Phillies are definitely at a resting place now and needing to start again more intelligently. The team currently has the worst record in Major League Baseball and they need to find a way to start moving in the right direction. Catching the wave of “analytics in baseball”" a little late seems to have hurt the team, but they are now becoming increasingly interested in bringing in more analysis to their personnel decisions (George 2016). I believe my analysis will either confirm that they are making good value choices in their talent, which should pay off over the next few years, or it will point out opportunities to make changes and find better value talent.
George, John “Phillies Looking for More Math Geeks” bizjournals.com Philadelphia Business Journal, 21 OCT 2016. Web. 19 JUL 2017. https://www.bizjournals.com/philadelphia/news/2016/10/ 21/phillies-math-geeks-analytics-sabermetrics.html
Lewis, Michael Moneyball New York: Norton, 2003. Print
“MLB 2017 Payroll Tracker” Spotrac Web. 19 JUL 2017
“The SABR Story” SABR.org SABR, Web. 19 JUL 2017 www.sabr.org/about
“Top Ten MLB Teams with the Most Overall Losses” The Sportster 13 JAN 2015. Web. 19 JUL 2017 http://www.thesportster.com/baseball/top-10-mlb-teams- with-the-most-overall-losses/
“Wins Above Replacement 2017” ESPN.com ESPN. Web. 19 JUL 2017.http://www.espn.com/mlb/war/leaders/_/year /2017/type/seasonal/alltime/false/count/1
I used the XML package for its readHTMLTable function. I also use dplyr functions in the cleaning process and ggplot for visualizations, so both the XML package and tidyverse packages are loaded upfront here. Knitr was loaded to use the kable function for displaying tables. I also used the dataMeta package for creating a data dictionary for each data set, and the DT package for displaying interactive tables. I had to use the separate function from the reshape package to separate the player column on the original OPS data set into two columns with Player and Position.
library("tidyverse")
library("XML")
library("knitr")
library("dataMeta")
library("DT")
library("reshape")
Using the XML::readHTMLTable function I brought in salary and batting statistics from spotrac.com. I used the same function to import Wins Above Replacement Data from ESPN. The W.A.R. data on ESPN is separated on four different pages, so I repeated this same process four times. Complete W.A.R. and OPS data sets are shown below. The justification for using OPS as the measure of offensive productivity is that it is widely accepted by baseball statisticians as an accurate measure of a player’s contributions to a team’s offense (Lewis 2003, pg. 80).
After bringing in the data using readHTMLTable, the data is stored as a list. The standard as.data.frame function seemed to be the most forgiving option for transforming the list into a usable data frame. The most challenging issue with the resulting data frame was dealing with the salary data. Salary had to be imported as a character variable because the values contained dollar signs and commas. Using a mutate function, in combination with a gsub and as.numeric, gave me a new salary variable that was a numeric class. The original table had several extra offensive statistics that were not going to be considered in my analysis. Using the dplyr select and rename functions, the data was whittled down to include only Player, Team, OPS, and Salary.
The W.A.R. data was brought into RStudio using the same XML::readHTMLTable function that was used for the OPS data. The difference with the W.A.R. data was that the table was split between 4 separate pages. I imported each page separately, then combined them using a series of full_join functions. Similar to the OPS data, I had a variable (WAR) that was assigned as character, but needed to be numeric. I fixed this with a mutate function, adding a new variable that served as a numeric version of the WAR column. I then used the select function to choose only the variables for Player and W.A.R.
After the W.A.R. and OPS data were in R, I joined them, using inner_join, by player to get a set of data that included OPS, W.A.R., and Salary.
variable_name | variable_description | variable_options |
---|---|---|
Bin | OPS Percentile Bin ranging from 1 - 10 | 1 to 10 |
OPS | On-Base percentage plus slugging percentage | 0.477 to 0.918 |
Player | Player name | A.J to Zack Cozart |
Pos | Player Position | 1B to SS |
Salary | Salary in US Dollars | 535000 to 28166666 |
Team | Player’s Team | ARI to WSH |
variable_name | variable_description | variable_options |
---|---|---|
Player | Player name | Aaron Hicks to Zack Greinke |
W_A_R | Wins Above Replacement | 1.82 to 6.91 |
The first time I tried to visualize the data for Salary and OPS, I used a scatter plot with ggplot + geom point just to see if there was a strong or clear correlation. If there was a clear correlation, it would make creating a player valuation system pretty easy. However, this plot just showed a general cluster, even when using a log transformation on salary, because there is so much spread in the salary data, and clearly not a very strong correlation between salary and OPS. Even without this correlation, it is still possible to find a baseline for what a player should be paid based off of his production.
To get a better picture of comparison, I used OPS “bins” that split players into 10 percentile ranges. I did comparison boxplots so that the median as well as the general spread of the data in each level would be clear. In a very broad way, the median salary in each bin is typically going up as we move into higher production ranges. However, there is a huge amount of spread within each bin and fluctuations occur on a daily basis as players have good and bad games over the summer. The median salary in the 51 - 60 percentile range has been well above all the other bin medians, except for the top bin.
I also pulled the numbers out of these plots using the summarise and group_by functions. The table below shows the median, minimum, and maximum salary for each OPS bin. Looking at the table and the boxplots shows the interesting outlier again, with one of the highest paid groups being between the 51 and 60 percentile in OPS. In addition, there is at least one player in each bin being paid close to the league minimum ($535,000) and at least one player in each bin being paid close to or above $20 Million.
The lack of strong correlation is actually a positive sign for the purpose of this analysis. It means that there are both unproductive players being overpaid that can be dropped, and also that there are talented players who can be picked up at a reasonable salary for a lower-budget team.
Bin | Median | Max | Min |
---|---|---|---|
1 | 893750 | 26000000 | 535000 |
2 | 1750000 | 16000000 | 535000 |
3 | 700000 | 20428571 | 535000 |
4 | 5200000 | 28166666 | 535000 |
5 | 2250000 | 18000000 | 535000 |
6 | 4000000 | 28000000 | 535000 |
7 | 4200000 | 17200000 | 535000 |
8 | 2500000 | 22500000 | 535000 |
9 | 4000000 | 22125000 | 535000 |
10 | 5800000 | 23000000 | 535000 |
After setting this standard for appropriate pay based on performance “bins”, I filtered the data to include only players from the Philadelphia Phillies and used an ifesle function to make a recommendation based on whether or not the player was paid above the median salary for his particular OPS level. (NOTE: If a player was paid above the median, the recommendation is “TRADE”, if not the recommendation is “KEEP”) This valuation system made four recommendations for trades out of the 10 position players included in the data set, Freddy Galvis, Hyun-soo Kim, Andres Blanco, and Cesar Hernandez.
Player | Pos | Team | OPS | Salary | Bin | Recommendation |
---|---|---|---|---|---|---|
Freddy Galvis | SS | PHI | 0.606 | 4350000 | 3 | Trade |
Hyun-soo Kim | LF | PHI | 0.597 | 4200000 | 2 | Trade |
Andres Blanco | SS | PHI | 0.481 | 3000000 | 1 | Trade |
Cesar Hernandez | 2B | PHI | 0.711 | 2550000 | 8 | Trade |
Odubel Herrera | CF | PHI | 0.661 | 1600000 | 5 | Keep |
Daniel Nava | LF | PHI | 0.779 | 1350000 | 10 | Keep |
Cameron Rupp | C | PHI | 0.623 | 565000 | 3 | Keep |
Maikel Franco | 3B | PHI | 0.557 | 560000 | 1 | Keep |
Tommy Joseph | 1B | PHI | 0.583 | 543000 | 2 | Keep |
Aaron Altherr | RF | PHI | 0.714 | 538500 | 8 | Keep |
After creating this valuation and recommendation system for OPS, I wanted to use a similar system bringing in an advanced player metric called “Wins Above Replacement”, which measures how many more games a team has won with a particular player than having an “average” player in the same spot. Using this second measure is a good way to double-check if a player is over-valued. Wins above replacement data is not available for every player in MLB because not every player will necessarily produce an extra win for his team (some players are average or below average, commonly referred to as “scrubs”).
I started the W.A.R. evaluation by joining the salary data with the W.A.R. data for players who had W.A.R. information. I used this data to find a league average of “cost per additional win” that a player provides. The plot below shows the W.A.R. and Salary data, with the condition that players who are paid over the cost per win ratio are highlighted as “overpaid”. The plot looks fairly evenly split between players paid over that ratio and players paid at or below that ratio. Similar to OPS, there does not seem to be an explicit trend between W.A.R. and Salary, but the ratio that other teams are paying for wins can still help set a benchmark value.
Avg_salary | Avg_WAR | cost_per_win |
---|---|---|
6280570 | 3.276293 | 1916974 |
Using this same cost per W.A.R. ratio, I checked the Phillies’ lineup again to see if any players were being paid to much money for not enough additional production. Using another ifelse statement, the table below calls players being paid above the ratio “Overpaid” and calls players being paid at or below the ratio “OK”. The Phillies only two players in the W.A.R. data set are being paid appropriately for the number of additional wins they are producing (like the table says, they’re OK). This table also included Cesar Hernandez, who was overpaid based on OPS, so it might be valuable to keep him based on other factors that are included in W.A.R.
Player | Pos | Team | OPS | Salary | Bin | W_A_R | Cost_per_win | Rec |
---|---|---|---|---|---|---|---|---|
Cesar Hernandez | 2B | PHI | 0.711 | 2550000 | 8 | 2.39 | 1916974 | OK |
Odubel Herrera | CF | PHI | 0.661 | 1600000 | 5 | 2.69 | 1916974 | OK |
The purpose of this analysis was to create a statistical method for evaluating how a player should be paid based on his performance and then use that method to make personnel recommendations to help lower-budget teams become more competitive. The team I chose to use for this system was the Philadelphia Phillies, who currently have the worst record in Major League Baseball.
I focused this system on the offensive statistic OPS (On-Base Plus Slugging) because many baseball theorists consider this to be the most accurate measure of a player’s offensive impact (Lewis 2003, pg. 80). The data I used for salary and OPS come from the SPOTRAC website.
I took the original data and broke it down into 10 percentile “Bins” by OPS (where 1 is the lowest performing bin and 10 is the highest). I used the median salary for each bin as the benchmark for whether a player is overpaid based off of his performance, with the idea being that if a player is paid above the median in his bin, there are a good number of players available who are just as good, but cost the team less money. Finally, I used an ifelse function to compare Phillies players’ salaries to the median in their bins and make a recommendation to keep or trade each player.
To get a second opinion, I also incorporated a similar system using Wins Above Replacement, using data from ESPN. Unlike OPS, W.A.R. also takes defensive skills into account in its computation, so it could give additional insight for deciding whether a player is truly worth their salary. For the W.A.R. data, the benchmark I used for evaluation was the average league ratio of Salary per Win Above Replacement. Only two Phillies had W.A.R. data available, and both were paid appropriately for their extra value-added.
Although the data set continues to update and change on a daily basis, this analysis highlights four potential money-saving personnel moves the Phillies could make to stretch their budget further without sacrificing production. In addition, it confirms that most of their regular rotation of players are being paid at an appropriate level for their performance.
The next step that is needed to build on this analysis, is to create a method for identifying players that are within the Phillies’ budget that they could trade for. Setting up a system using dplyr functions could filter the data and find players in a certain performance range who meet certain salary criteria to replace overpaid players on the current roster.
# Load tidyverse, XML and other packages #
library("tidyverse")
library("XML")
library("knitr")
library("dataMeta")
library("DT")
library("reshape")
####### This section is for WRANGLING ops data #######
#Grab data table from website and save as a LIST #
Salary_Data_Raw <-
readHTMLTable("http://www.spotrac.com/mlb/statistics/player/",
colClasses = c("character", "character",
"numeric", "numeric",
"numeric", "numeric",
"numeric", "numeric",
"numeric", "numeric",
"numeric", "numeric",
"numeric", "numeric",
"numeric", "numeric",
"character"),
header = TRUE)
#Let's see how that turned out...#
#head(Salary_Data_Raw)#
#Format is key, list ---> data frame #
All_table <- as.data.frame(Salary_Data_Raw,
stringsAsFactors = FALSE)
# Let's look at Player, OPS, Team, and Salary #
clean_OPS_data <- select(All_table,
NULL.Player,
NULL.Team,
NULL.OPS,
NULL.Salary) %>%
#Try to add a new variable called Salary #
#Also convert team and player to character #
mutate(Salary = as.numeric(gsub("[$,]", "", NULL.Salary))) %>%
mutate(Team = as.character(NULL.Team)) %>%
#Now select only NULL.Player, NULL.Team, NULL.OPS, and Salary)#
select(NULL.Player, Team, NULL.OPS,Salary) %>%
#Rename the "NULLS" #
dplyr::rename(OPS = NULL.OPS) %>%
separate(col = NULL.Player, into = c("Player", "Pos"), sep = "[.]", extra = "merge")
# head(clean_OPS_data) #
OPS_bins <- mutate(clean_OPS_data,
Bin = ntile(OPS, 10))
#save(clean_OPS_data, OPS_bins, file = #
#"C:/Users/csj23/Documents/MSBA/Summer 2017/Intro# #to R/Final Project Data/OPS_data.RData") #
####Bring in the W.A.R data so that it can be joined#####
#Page 1#
WAR_Data_Raw1 <-
readHTMLTable("http://www.espn.com/mlb/war/leaders",
header = TRUE, skip.rows = 1)
All_table_WAR1 <- as.data.frame(WAR_Data_Raw1,
stringsAsFactors = FALSE)
EX_WAR1 <- select(All_table_WAR1,
"NULL.PLAYER",
"NULL.WAR") %>%
dplyr::rename(Player = NULL.PLAYER,
WAR = NULL.WAR) %>%
transform(WAR = as.factor(WAR)) %>%
mutate(W_A_R = as.numeric(as.character(WAR)))
# EX_WAR1 #
# Page 2 #
WAR_Data_Raw2 <-
readHTMLTable("http://www.espn.com/mlb/war/leaders/_/year/2017/type/seasonal/alltime/false/count/51",
header = TRUE, skip.rows = 1)
All_table_WAR2 <- as.data.frame(WAR_Data_Raw2,
stringsAsFactors = FALSE)
EX_WAR2 <- select(All_table_WAR2,
"NULL.PLAYER",
"NULL.WAR") %>%
dplyr::rename(Player = NULL.PLAYER,
WAR = NULL.WAR) %>%
transform(WAR = as.factor(WAR)) %>%
mutate(W_A_R = as.numeric(as.character(WAR)))
# EX_WAR2 #
#Page 3#
WAR_Data_Raw3 <-
readHTMLTable("http://www.espn.com/mlb/war/leaders/_/year/2017/type/seasonal/alltime/false/count/101",
header = TRUE, skip.rows = 1)
All_table_WAR3 <- as.data.frame(WAR_Data_Raw3,
stringsAsFactors = FALSE)
EX_WAR3 <- select(All_table_WAR3,
"NULL.PLAYER",
"NULL.WAR") %>%
dplyr::rename(Player = NULL.PLAYER,
WAR = NULL.WAR) %>%
transform(WAR = as.factor(WAR)) %>%
mutate(W_A_R = as.numeric(as.character(WAR)))
# EX_WAR3 #
#Page 4#
WAR_Data_Raw4 <-
readHTMLTable("http://www.espn.com/mlb/war/leaders/_/year/2017/type/seasonal/alltime/false/count/151",
header = TRUE, skip.rows = 1)
All_table_WAR4 <- as.data.frame(WAR_Data_Raw4,
stringsAsFactors = FALSE)
EX_WAR4 <- select(All_table_WAR4,
"NULL.PLAYER",
"NULL.WAR") %>%
dplyr::rename(Player = NULL.PLAYER,
WAR = NULL.WAR) %>%
transform(WAR = as.factor(WAR)) %>%
mutate(W_A_R = as.numeric(as.character(WAR)))
# EX_WAR4 #
# Get all 4 pages into one table #
WAR_Data <- full_join(EX_WAR1, EX_WAR2) %>%
full_join(EX_WAR3) %>%
full_join(EX_WAR4) %>%
select(Player, W_A_R)
# WAR_Data #
# WAR data is ready, inner join it with Clean OPS #
OPS_WAR_data <- inner_join(OPS_bins,
WAR_Data,
by = "Player")
#OPS_WAR_data#
#save(OPS_WAR_data, WAR_Data, #
#file = "C:/Users/csj23/Documents/MSBA/Summer# #2017/Intro to R/Final Project Data/WAR_data.RData")#
# Create a data set with only Phillies' players #
# Put Bin Median in as extra variable #
Phils <- OPS_bins %>%
filter(Team == "PHI") %>%
inner_join(Bin_Summary, by = "Bin") %>%
dplyr::rename("Bin Median Salary" = "median(Salary, na.rm = TRUE)")
#########Exploratory Analysis###########################
# Compare salary by OPS percentile bins #
ggplot(OPS_bins, aes(x = Bin, y = Salary, fill = as.character(Bin))) +
geom_boxplot() +
ggtitle("Salary by OPS Ranking") +
xlab("OPS Percentile Bin (1 - 10)") +
ylab("Salary") + guides(fill = guide_legend("OPS Bin"))
# Spread of salary data makes it hard to see pattern #
OPS_bins %>%
mutate(log_salary = log(Salary)) %>%
ggplot(aes(x = OPS, y = Salary)) +
geom_point() +
ggtitle("OPS vs Salary for MLB Players") +
xlab("OPS") +
ylab("Salary")
# Looking at median within each bin as a measuring #
# stick for whether or not a player is over-paid #
Bin_Summary <- as_tibble(group_by(OPS_bins, Bin) %>%
summarise(median(Salary, na.rm = TRUE), max(Salary), min(Salary)))
# Make a recommendation based off of bin median #
# Trade if over bin medain, keep if not #
Recs <- Phils %>%
mutate(Recommendation =
ifelse(Salary > `Bin Median Salary`, "Trade", "Keep")) %>%
select(Player, Pos, Team, OPS, Salary, Bin, Recommendation)
#Recs#
kable(Bin_Summary,
col.names = c("Bin",
"Median",
"Max",
"Min"),
caption = "Salary Data by OPS Bin")
Bin | Median | Max | Min |
---|---|---|---|
1 | 893750 | 26000000 | 535000 |
2 | 1750000 | 16000000 | 535000 |
3 | 700000 | 20428571 | 535000 |
4 | 5200000 | 28166666 | 535000 |
5 | 2250000 | 18000000 | 535000 |
6 | 4000000 | 28000000 | 535000 |
7 | 4200000 | 17200000 | 535000 |
8 | 2500000 | 22500000 | 535000 |
9 | 4000000 | 22125000 | 535000 |
10 | 5800000 | 23000000 | 535000 |
# WAR Section #
# Summary Stats Table for WAR #
# Add this onto WAR table #
Cost_per_WAR <- OPS_WAR_data %>%
summarise(Avg_salary = mean(Salary),
Avg_WAR = mean(W_A_R)) %>%
mutate(cost_per_win = Avg_salary/Avg_WAR)
# Scatter WAR and Salary #
# With and Without log transformation #
# Not a clear pattern #
OPS_WAR_data %>%
mutate(log_salary = log(Salary)) %>%
ggplot(aes(x = W_A_R, y = Salary, color = (Salary/W_A_R)> Cost_per_WAR$cost_per_win,
shape = (Salary/W_A_R)> Cost_per_WAR$cost_per_win)) +
geom_point() +
ggtitle("Scatter Plot of Players Paid Above Average") +
guides(color = guide_legend("Overpaid"),
shape = guide_legend("Overpaid")) +
xlab("Wins Above Replacement") +
ylab("Salary")
# Phillies Recommendations based on WAR #
WAR_Recs <- OPS_WAR_data %>%
mutate(Cost_per_win = Cost_per_WAR$cost_per_win) %>%
filter(Team == "PHI") %>%
mutate(Rec = ifelse((Salary/W_A_R)> Cost_per_win,
"Overpaid",
"OK"))
#WAR_Recs#
############ Data Dictionary ###########################
##For OPS data##
OPS_var_desc <- c("Player name",
"Player Position",
"Player's Team",
"On-Base percentage plus slugging percentage",
"Salary in US Dollars",
"OPS Percentile Bin ranging from 1 - 10")
OPS_var_type <- c(0, 0, 0, 0, 0, 0)
OPS_linker <- build_linker(OPS_bins,
variable_description = OPS_var_desc,
variable_type = OPS_var_type)
OPS_dict <- build_dict(my.data = OPS_bins,
linker = OPS_linker,
option_description = NULL,
prompt_varopts = FALSE)
##For WAR data##
WAR_var_desc <- c("Player name",
"Wins Above Replacement")
WAR_var_type <- c(0,0)
WAR_linker <- build_linker(WAR_Data,
variable_description = WAR_var_desc,
variable_type = WAR_var_type)
WAR_dict <- build_dict(my.data = WAR_Data,
linker = WAR_linker,
option_description = NULL,
prompt_varopts = FALSE)