Throughout the history of baseball, analysts have been trying to figure out the best ways to evaluate players. WAR (wins above replacement) is an attempt to provide a single number meant to encompass the contributions of a player in all phases of the game. Combining all elements to a single number provides ease in comparing two players with different styles of play.
However, the formula for WAR has been subject of debate from fans and media not directly involved with the industry. Despite not knowing the formula, fans and media still seem to trust the number at face value by citing WAR in television broadcasts and online articles. This is in large part due to ease and convenience, but in this project, I intend to uncover what statistics affect WAR the most, and based on these findings, what potential biases could WAR have.
In order to analyze WAR data, statistics from over 500 players over a ten-year period will be taken from Fangraphs.com.
The data table includes a large portion of useful data, such as batting average, on-base percentage, and stolen bases, but it does not include every variable I intend to analyze. For example, isolated plate discipline will need to be created by subtracting batting average from on-base percentage.
Two other limitations of this analysis are pitching and defense. This data set only analyzes position players and excludes pitchers. In order to examine pitchers, a completely separate study would need to be done due to their vastly different skill set and role on the team. Defensive contributions are included, but are also inexact. Quantifying the quality of a player’s defense has greatly improved over the years, but still is a highly contested topic. For the purposes of this study, I will use the cumulative defensive statistic as provided by Fangraphs while being cognizant of the limitations. The defense statistic needs to be merged from another data set. Ideally, UZR (ultimate zone rating) would be used, but data on UZR is only available for a small sample of players over a time span of this length.
Methods to compare variables will include scatter plots, line graphs, histograms, and others.
Knowing what influences WAR can help instill trust into the statistic as well as understanding potential biases whenever the statistic is brought up. Also possible is finding value in players that may have otherwise been overlooked. This is important for small market teams who cannot bid against larger markets.
Today’s landscape has player salaries tied to WAR. Some players who may not seem valuable based on conventional statistics are getting large contracts because of a high WAR. This study can help players decide which skills to focus on in order to improve their WAR, and thus possibly increase their earning potential.
In order to prepare the data set for analysis, two data sets (one for offense, one for defense), are merged. In the new data set, only 17 of the 44 variables are of importance in this study. Those 17 are selected, while the remainder are excluded. To perform these actions, the following code is input:
# PACKAGES REQUIRED
library(tidyverse) # to clean data and conduct graphical analysis
library(DT) # to provide a user-friendly data set
library(corrplot) # to create correlation matrices
Offense_Leaderboard <- read_csv("C:/Users/Akshay/Downloads/FanGraphs Leaderboard.csv")
Defense_Leaderboard <- read_csv("C:/Users/Akshay/Downloads/FanGraphs Leaderboard (1).csv")
Offense_Leaderboard %>%
left_join(Defense_Leaderboard)
Combined_Data <- Offense_Leaderboard %>% left_join(Defense_Leaderboard)
Final_Dataset <- select(Combined_Data, Name, G, PA, HR, R, RBI, SB, "BB%", "K%", ISO, AVG, OBP, SLG, BsR, Off, Def, WAR)
Using the following code gives a 5-number summary of all variables, which will be used throughout the analysis:
summary(Final_Dataset)
## Name G PA HR
## Length:590 Min. : 257.0 Min. :1000 Min. : 3.0
## Class :character 1st Qu.: 470.2 1st Qu.:1574 1st Qu.: 31.0
## Mode :character Median : 640.0 Median :2260 Median : 59.0
## Mean : 733.8 Mean :2770 Mean : 78.4
## 3rd Qu.: 939.8 3rd Qu.:3612 3rd Qu.:104.0
## Max. :1695.0 Max. :7258 Max. :358.0
## R RBI SB BB%
## Min. : 79.0 Min. : 62.0 Min. : 0.00 Length:590
## 1st Qu.: 174.2 1st Qu.: 157.2 1st Qu.: 9.00 Class :character
## Median : 265.5 Median : 246.5 Median : 22.00 Mode :character
## Mean : 331.8 Mean : 318.1 Mean : 44.78
## 3rd Qu.: 444.8 3rd Qu.: 415.2 3rd Qu.: 56.00
## Max. :1056.0 Max. :1201.0 Max. :387.00
## K% ISO AVG OBP
## Length:590 Min. :0.0500 Min. :0.1990 Min. :0.2570
## Class :character 1st Qu.:0.1212 1st Qu.:0.2460 1st Qu.:0.3090
## Mode :character Median :0.1540 Median :0.2590 Median :0.3250
## Mean :0.1525 Mean :0.2602 Mean :0.3264
## 3rd Qu.:0.1850 3rd Qu.:0.2747 3rd Qu.:0.3430
## Max. :0.2800 Max. :0.3200 Max. :0.4260
## SLG BsR Off Def
## Min. :0.2840 Min. :-85.20000 Min. :-141.00 Min. :-184.800
## 1st Qu.:0.3792 1st Qu.: -7.50000 1st Qu.: -30.32 1st Qu.: -30.350
## Median :0.4130 Median : -0.95000 Median : -4.10 Median : -1.200
## Mean :0.4128 Mean : 0.03068 Mean : 12.31 Mean : -3.992
## 3rd Qu.:0.4447 3rd Qu.: 7.27500 3rd Qu.: 33.92 3rd Qu.: 22.500
## Max. :0.5690 Max. : 66.90000 Max. : 400.90 Max. : 166.200
## WAR
## Min. :-4.20
## 1st Qu.: 2.60
## Median : 6.90
## Mean :10.14
## 3rd Qu.:13.97
## Max. :53.90
Over the ten-year period from 2007 to 2017, 590 Major League Baseball hitters are considered eligible to be analyzed. In order to qualify as “eligible,” they have to have averaged 3.1 plate appearances per team game played. Over 10 seasons, this works out to 5,022 total plate appearances.
For those 590 hitters, the WAR distribution is as follows:
Fangraphs breaks down player performance into offense-only and defense-only components, both of which have no units and will be elaborated on later. But just to provide a preliminary example of how they correlate to WAR, the graph below can illustrate. The red line is offense, while the blue line is defense.
Based on the graph, before even making any calculations, offense seems to have a strong positive correlation while defense is close to no correlation. Since the goal of this study is to investigate what goes into WAR, a preliminary assessment shows that defense seems to not have a strong influence on WAR.
As shown in the preliminary analysis, the distribution of WAR had a right-skew. WAR is a cumulative statistic, so players with more playing time over the 10-year period in question have more opportunity to accumulate. In order to adjust for the number of games played, WAR can be converted to a rate statistic by dividing by the number of games each player has appeared in. Doing so creates the following histogram using a new variable:
This histogram has a similar skew to the previous one, but now any advantage or disadvantage of playing longer has been controlled for.
What if experience matters, even with a rate statistic? So far, the graphs have shown the distributions for ALL players, but how do more experience players compare with less experienced players?
In order to find out, the same plot can be run for games played.
The shape of this graph is also right skewed, but the distribution also does not follow a smooth curve the way the WAR graphs do. In this case, the median is better than the mean to express the midpoint of the data. As seen in the ‘Data Preparation’ tab, the median for games played (G) is 640.
Using 640 as a cutoff, two new WAR frequency line graphs are created and superimposed on one another. The red line is for players above 640 games, and the blue line is for players with less than or equal to 640 games.
This graph shows higher WAR per game for players with more than 640 games at higher values, but favors less-experienced players at lower values.
One possible explanation of this conclusion is based correlation versus causation. Usually, good players have longer careers than worse players and thus accumulate a higher number of games. Another interesting note is that both lines look similar in shape despite their abnormalities.
In order to further investigate, three different correlation matrices are needed. The first one is for all players, while the second and third are with the applied filters based on games played. This information reveals why the above graph appears the way it does.
The first matrix below is for all players, henceforth referred to as Matrix 1.
The next matrix below is for players with 640 or fewer games, henceforth referred to as Matrix 2.
The third matrix is for players with more than 640 games, henceforth referred to as Matrix 3.
For all three matrices, variable PA (plate appearances) can be disregarded because of the multicolinearity with games played, which has already been accounted for.
Based on Matrix 1, the strongest correlations with WAR other than games and plate appearances are home runs, runs, RBI, and offensive rating (which is subjective and therefore cannot be analyzed the same way as the other variables). For example, a player can work on home run power and swing technique, but cannot work on his “offensive rating.” The same reason holds true for “defensive rating.” In the case of defense, the correlation is weak anyway and therefore not included.
Since home runs, runs, and RBI seem to have a strong correlation coefficient, a scatterplot for each with a smooth line superimposed is a practical way to visualize the data. The plot below is for all players and shows WAR per game on the horizontal axis and home runs on the vertical axis:
Plot 1 shows a positive correlation, but a high degree of variance. The next plot compares WAR per game with runs:
Plot 2 shows a more homscedastic fit than Plot 1.
Plot 3 is between Plots 1 and 2 in terms of homoscedasticity.
Based on Matrices 2 and 3, less experienced players do not focus as much as those with more experience. Since the correlation between home runs and WAR in general seems to be strong compared to other variables, players should be encouraged to work on power even if that means compromising in another area of their game.
Data was retrieved from Fangraphs.com, and then exported to a .csv file before being imported into R.
| Variable name | Variable meaning |
|---|---|
| G | games played |
| PA | plate appearances |
| HR | home runs |
| R | runs scored |
| RBI | runs batted in |
| SB | stolen bases |
| BB% | walk percentage |
| K% | strikeout percentage |
| ISO | isolated power (SLG - AVG) |
| AVG | batting average |
| OBP | on-base percentage |
| SLG | slugging percentage |
| BsR | base-running runs above average |
| Off | offensive rating |
| Def | defensive rating |
| WAR | wins above replacement |