This project looks at the top 132 Major League Baseball players from the 2020 to 2021 season, ranked by wins above replacement (WAR). Wins above replacement is a statistic that is used to determine how many games any particular player is above or below being replaced by the best minor league player who could potentially replace them.
Wins above replacement seems to be very accurate, as most of the time it is predictive of which players had the best seasons overall, however it is calculated differently by each team and by different websites. The data used in this project is from FanGraphs.com, so their wins above replacement is specifically called FWAR (Fangraphs wins above replacement). It is important to note that we are using FWAR because as mentioned wins above replacement could be very different depending on who is calculating it, and how much weight a team or website values certain statistics that are used in calculating wins above replacement.
This project is looking to see how FWAR is calculated, more or less which statistics are the most important for Fangraphs. This project is also looking to find interesting correlations that may or may not be important in calculating the statistic that is so highly regarded in baseball today.
In order to generate and view the code, the list of packages below must be installed.
library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)
library(rio)
library(janitor)
library(readxl)
CustomBaseballDataEC4990 <- read_excel("~/Documents/CustomBaseballDataEC4990.xlsx")
TotalRunsAdded<-mutate(CustomBaseballDataEC4990,TotalRuns=Runs+RunsBattedIn)
lm1 <- lm(WinsAboveReplacement~HomeRuns+TotalRuns+ISO+BABIP+Average+SluggingPercentage+
OnBasePercentage+Age+Hits+BaseOnBalls+Speed+Singles+Doubles+Triples+Offense+Defense, data=TotalRunsAdded)
summary(lm1)##
## Call:
## lm(formula = WinsAboveReplacement ~ HomeRuns + TotalRuns + ISO +
## BABIP + Average + SluggingPercentage + OnBasePercentage +
## Age + Hits + BaseOnBalls + Speed + Singles + Doubles + Triples +
## Offense + Defense, data = TotalRunsAdded)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.110245 -0.030194 -0.003835 0.028994 0.110154
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.0551439 0.1660990 12.373 < 2e-16 ***
## HomeRuns 0.0003480 0.0036605 0.095 0.92442
## TotalRuns 0.0010831 0.0003966 2.731 0.00731 **
## ISO -4.4425904 8.7639714 -0.507 0.61318
## BABIP 0.1842813 0.2158823 0.854 0.39508
## Average -9.4213765 8.7600784 -1.075 0.28439
## SluggingPercentage 1.9438807 8.7427068 0.222 0.82444
## OnBasePercentage 0.8247360 0.6992802 1.179 0.24065
## Age 0.0005453 0.0011859 0.460 0.64650
## Hits 0.0196892 0.0032062 6.141 1.18e-08 ***
## BaseOnBalls 0.0015449 0.0006968 2.217 0.02856 *
## Speed -0.0100571 0.0044976 -2.236 0.02726 *
## Singles -0.0091484 0.0035341 -2.589 0.01087 *
## Doubles -0.0058319 0.0032584 -1.790 0.07609 .
## Triples NA NA NA NA
## Offense 0.1025524 0.0010114 101.396 < 2e-16 ***
## Defense 0.1000217 0.0005756 173.781 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04368 on 116 degrees of freedom
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9993
## F-statistic: 1.33e+04 on 15 and 116 DF, p-value: < 2.2e-16
GPlot<- ggplot(TotalRunsAdded,aes(x=TotalRuns, y=WinsAboveReplacement))+
geom_point()+
geom_smooth(method = lm)+
theme_bw()+
xlab("Total Runs")+
ylab("Wins Above Replacement")
PlotlyScatHR <- TotalRunsAdded %>%
plot_ly(x= ~TotalRuns, y= ~WinsAboveReplacement,type = "scatter",
mode="markers",
hoverinfo="text",
text= paste("Name:", TotalRunsAdded$Name, "<br>",
"Age:", TotalRunsAdded$Age, "<br>",
"Home Runs:", TotalRunsAdded$HomeRuns) ,color = ~HomeRuns)
PlotlyScatOPS <- TotalRunsAdded %>%
plot_ly(x= ~TotalRuns, y= ~WinsAboveReplacement,type = "scatter",
mode="markers",
hoverinfo="text",
text= paste("Name:", TotalRunsAdded$Name, "<br>",
"Age:", TotalRunsAdded$Age, "<br>",
"OPS:", TotalRunsAdded$OnBasePlusSlugging) ,color = ~OnBasePlusSlugging)
PlotlyScatOpsSlg <- TotalRunsAdded %>%
plot_ly(x= ~SluggingPercentage, y= ~OnBasePercentage,type = "scatter",
mode="markers",
hoverinfo="text",
text= paste("Name:", TotalRunsAdded$Name, "<br>",
"Age:", TotalRunsAdded$Age, "<br>",
"OPS:", TotalRunsAdded$OnBasePlusSlugging) ,color = ~WinsAboveReplacement)
BattingPlotlyScatWAR <- TotalRunsAdded %>%
plot_ly(x= ~TotalRuns, y= ~Hits,type = "scatter",
mode="markers",
hoverinfo="text",
text= paste("Name:", TotalRunsAdded$Name, "<br>",
"Age:", TotalRunsAdded$Age, "<br>",
"Base On Balls:", TotalRunsAdded$BaseOnBalls) ,color = ~WinsAboveReplacement)
DefOffPlotlyScat <- TotalRunsAdded %>%
plot_ly(x= ~Offense, y= ~Defense,type = "scatter",
mode="markers",
hoverinfo="text",
text= paste("Name:", TotalRunsAdded$Name, "<br>",
"Age:", TotalRunsAdded$Age) ,color = ~WinsAboveReplacement)
GPlotDefOff<- ggplot(TotalRunsAdded,aes(x=Offense, y=Defense))+
geom_point()+
geom_smooth(method = lm)+
theme_bw()+
xlab("Offense")+
ylab("Defense")TotalRunsAdded[1:10,] ## # A tibble: 10 × 33
## Name Team Games PlateAppearances HomeRuns Runs RunsBattedIn StolenBases
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Bryce H… PHI 141 599 35 101 84 13
## 2 Juan So… WSN 151 654 29 111 95 9
## 3 Vladimi… TOR 161 698 48 123 111 4
## 4 Fernand… SDP 130 546 42 99 97 25
## 5 Shohei … LAA 158 639 46 103 100 26
## 6 Joey Vo… CIN 129 533 36 73 99 1
## 7 Nick Ca… CIN 138 585 34 95 100 3
## 8 Aaron J… NYY 148 633 39 89 98 6
## 9 Trea Tu… - - - 148 646 28 107 77 32
## 10 Bryan R… PIT 159 646 24 93 90 5
## # … with 25 more variables: BB% <dbl>, K% <dbl>, ISO <dbl>, BABIP <dbl>,
## # Average <dbl>, OnBasePercentage <dbl>, SluggingPercentage <dbl>,
## # BaseRunning <dbl>, Offense <dbl>, Defense <dbl>,
## # WinsAboveReplacement <dbl>, Age <dbl>, Hits <dbl>, Singles <dbl>,
## # Doubles <dbl>, Triples <dbl>, BaseOnBalls <dbl>, Strikouts <dbl>,
## # HitByPitch <dbl>, BB/K <dbl>, OnBasePlusSlugging <dbl>, Speed <dbl>,
## # Clutch <dbl>, HardHit% <dbl>, TotalRuns <dbl>
The data used comes from: https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2021&month=0&season1=2021&ind=0
This data source was easy to work with and the website allowed me customize the variables that I wanted. From there I cleaned the data by simply fixing the variable titles so they were friendlier to a person who is less invested in baseball statistics (sabermetrics). Some variables that I decided to include based solely on curiosity were “speed”, “On Base Plus Slugging”, “Singles”, “Doubles”, and “Triples”. The data is based on every single game during the regular season of 2020-2021, and includes only the top 132 players from the MLB. In R I mutated a new variable titled “Total Runs”, I will go further into why I did this in the results.
Below are my results:
For the first graph I mutated a new statistic titled “Total Runs”, this was made up of “Runs Batted In” plus “Runs”and then subtracting “Homeruns. This was important to do because when constructing the statistic for Wins Above Average, an important metric to notice is how many runs in total a player contributes to the team. Total runs are like points contributed, because the statistic”Runs” only tells us how many times each person scored for their team, while “Runs Batted In” tells us how many runs the player knocked in with a hit. I then subtracted “Homeruns” because a “Homerun” counts as a run batted in, but that is already counted in runs, so to avoid that being counted as double it was important to subtract “Homeruns”.
The graphs shows us something that would be pretty obvious, we have an upwards trend in our regression. Showing us that with more “Total Runs” a player will most likely have a higher “Wins Above Replacement” statistic.
PlotlyScatHRAlthough “Homeruns” was removed from total runs, it was still interesting to add “Homeruns” in a scatter plot to see how significant they were to the data, and we still see the same trend with “Total Runs” and “Wins Above Replacement”, however, now we can see with color if people who hit more “Homeruns” had an advantage, and although we do see that they tended to have more of both “Total Runs” and a higher “Wins Above Replacement”, it was not the most important metric.
PlotlyScatOPSAnother very important metric used in baseball in “On Base Plus Slugging” (OPS), this metric has been highly regarded for some time now, and was used before “Wins Above Replacement” was ever formulated. OPS comes from two statistics, one is the “On Base Percentage” (OBP), and the other is “Slugging Percentage” (SLG), if we add the two percentages we get our OPS, which is why it can be greater than 100%. It is interesting to note that two of the greatest players to ever play the sport, have the highest “Slugging Percentage”, Barry Bonds has the highest recorded SLG for a single season, and Babe Ruth the heighest during the span of a career. Babe Ruth also has the second highest OBP in history for the span of a professionals career, meanwhile Barry Bonds has the two greatest seasons recorded in history. This goes to show how important those two statistics are to the game of baseball, and it is also recognized in our graph above. We can see that players in the 2020-2021 season with a higher OPS percentage, are also at the upper right end of our scatter plot which has a positive trend line.
PlotlyScatOpsSlgIn this graph we separated “On Base Plus Slugging” into “On Base Percentage” on the Y-axis, and “Slugging Percentage” on the X-axis. This gave some interesting results, because although both require a player to contribute to the team, we can see something pretty interesting about the game. Players who had greater than 40% “On Base Percentage” all had a “Wins Above Replacement” statistic greater than 6. Meanwhile for “Slugging Percentage” it was only possible for players who were higher than 50%. Another interesting part of this data was that no player that was able to get on base more than 40% of the time had a “Wins Above Replacement” that was lower than 6. While players who were greater than 50% in terms of “Slugging Percentage” were not gauranteed a “Wins of Replacement” of 6 or more. This could signal to teams choosing players that in order to get more wins less power is necessary, and the ability for players to simple reach base would be more intuitive for a winning team. It does make sense too because, slugging percentage is factoring in how many bases a player gets per at bat, and while that is important, it is less important that reaching base safely. The ability to reach base safely increases the probability that they get out less often, which allows for more teammates to score that runner for more points.
BattingPlotlyScatWARIn the scatter plot above, we notice that “Hits” and “Total Runs” are highly correlated. This makes a lot of sense, because as a player gets more “Hits” they are able to get on base more, which leads to a high likely hood that they score more often. Hits also can lead to more “Runs Batted In”, which is a part of our “Total Runs” statistic. These highly relate to how high a players “Wins Above Replacement” can be. As we notice in the scatter plot players with lower “Wins Above Replacement” are darker blue or purple, which is reflected in the lower left end of the plot. While players who are lighter in color tend to be totaling more than 150 “Total Runs” and 150 “Hits”. Something else that I thought was necessary to add was “Base On Balls”, in baseball we call this a walk, if you scroll over the points you will see how often a player was walked. This is important to note because this is another way a player can get on base, and it is not reflected in “Hits”, but can lead to more “Total Runs”. For example, Barry Bonds was such a good hitter, as mentioned above his “Slugging Percentage” was through the roof, at his peak he hit a “Homerun” 1 in 7 at bats. This lead to many teams choosing to walk him to avoid that easy earned run for Bonds and his team. This was a good strategy, however, his team was aware of this and put their best hitters after him to ensure he would still score and increase his “Total Runs”. This was important to add to the graph for that purpose, because although a walk is not a gauranteed run, it can increase “Total Runs”.
DefOffPlotlyScatGPlotDefOff## `geom_smooth()` using formula 'y ~ x'
These final two graphs were somewhat different but interesting, because at first we notice that there is not much of a positive relationship between “Offense” and “Defense”, in fact if we look at the second graph the relationship is negative. We can also see that either higher “Offense” or “Defensive” skills we do get a higher “Wins Above Replacement”. However, it is interesting to note that it is somewhat of a trade off. Although some players are well rounded and have both offensive and defensive skill sets, it is still possible to achieve a high “Wins Above Replacement” statistic by being dominant at only one end of the spectrum. For example the player with the highest offenseive skill, Vladimir Guerrero, has poor defensive skills but still has a very high “Wins Above Replacement” at above 6.0. This graph is important because it shows that a player can be dominant at offense and still be better than a replacement level player. However, it is not the same case for defensive ability. We can see “Michael A. Taylor” has a very high defensive skill set, but not a high offensive skill set, and his “Wins Above Replacement” statistic is not very high. This graph is important because it tells us that players should focus more of their time practicing being better offensively rather than defensively. This could be something that teams should spend more time on when developing their team and especially with new players.
This report gave some insight to what may have been most important for the “Wins Above Replacement” statistic in baseball, however more importantly it showed what statistics carried the least amount of weight. Something that could be take from this report which was not intended is to see what statistics are more important for young athletes to focus their energy on. Although, this information is not for every player, because this data is lacking pitchers. Pitchers do not normally hit very well and their “Wins Above Replacement” is calculated very differently. However, for the other 8 people on the field for each team, it is important to see where their energy and time constraints for practice should be focused on most. For a very long time the best athlete on the field in terms of defense tended to be the shortstop, however, it seems to be obvious after doing this project that there is a reason why shortstops in todays game are now much better offensively. The reason being is that with so much of the game being focused on runs for wins, it is easy to sacrifice defensive ability for offensive ability as we noticed in our last graph.
Baseball is a game of numbers, this has been recognized most recently by the film “Moneyball”, however the game has always been analyzed by numbers. We can go back to the early 1900’s and see that we can still find game statistics for those games, but it wasn’t until more recently that “sabermetrics” came about with the inventor Bill James. SABR- (Society of American Baseball Research) Metrics (due to the measurements of the game), has become a very sought after skill for many baseball teams, and currently many economists are in the field. Part of the reason is to evaluate player values, but also because of the predictive knowledge they are able to add to the game using data. This data helps in a vast array of topics for the game of baseball, today they use data for anything from finding how to increase the velocity at which pitchers throw, to how to structure a lineup for increased wins, to even find the right combination of players to be able to increase a teams likelihood of more wins for their organization.
The objective of this project was to see which statistics were most important in calculating the statistic “Wins Above Replacement” but every team has a different evaluation of the measure. So although we learned more about how Fangraphs calculates their “Wins Above Replacement”, we also learned a variety of other valuable parts of the game with this data set. Baseball might have more data available than another sport, but new statistics keep being created, and it might be a long time before a perfect statistic comes to life to predict which players would be the best for their value. Until then it is fun to try to keep up with the experts who have succeed using statistics to change the game of baseball, from purely a sport on a field, to what it is today, where an office room of statisticians, and economists work together to better the sport on the field.