In this chapter the authors state that we are going to look at the relationship between Runs and Wins. According to the authors, this relationship is a prerequisite for determining how valuable a player is to his team. Wins and player performance are not easily linked. However, Runs and player performance is related. But before we can determine which players contribute the most to wins, we need to see how runs contribute to wins.
This is the load data section of the chapter. We took the Lahman’s team.csv file, loaded it, took out most of the columns, and subset from the year 2001 to the present. We then created two columns Run Differential( Runs - Runs Allowed), and Winning Percentage ( Wins / Wins + Losses).
This is what the top of the new data frame looks like:
#Import Data
Teams <- read.csv("~/Data Projects/bookBaseballDataR/chapter4/Teams.csv")
#Subset Rows and Columns
myteams <- subset(Teams, yearID > 2001)[,c("teamID", "yearID", "lgID", "G", "W", "L","R","RA")]
#Add 2 Varaibles - Run Diff and Wpct
myteams$RD <- with(myteams, R - RA)
myteams$Wpct <- with (myteams, W / (W + L))
head(myteams)
The first thing we do is run a linear regression to find the relationship between Run Differential and Winning Percentage. We do that with the “lm” function. Example below:
linfit <- lm(Wpct ~ RD, data = myteams)
linfit
##
## Call:
## lm(formula = Wpct ~ RD, data = myteams)
##
## Coefficients:
## (Intercept) RD
## 0.4999928 0.0006247
#quick plot
plot(myteams$RD, myteams$Wpct,
xlab = "Run Differential",
ylab = "Winning Percentage")
abline(a = coef(linfit)[1] , b = coef(linfit)[2] , lwd = 2)
We see that the linear regression shows a positive relationship between Run Differential and Win Percentage.
\[ WinPct = 0.4999928 + 0.0006247 * RD \]
Let me expound on this to make it more intuitive. If a team had a run differential of 0, they scored the same number of runs it allowed; then that team would have a win percent of 50%(.4999928 rounds to 50%), which is 81 games. If a team had a run differential of +10, they scored ten more runs than they allowed; We would plug 10 in for RD.
\[ 10 * 0.0006247 = 0.006247 \] \[ 0.006247 + 0.4999928 = 0.5062398 = WinPct\] \[ 0.5062398 * 162 = 82.01085 \]
So a run differential of +10 would lead to approximately a 82 win season. Since this is a linear regression, we can extrapolate that for every +10 in run differential we can expect a team to win 1 more game. A linear model isn’t a perfect model for the relationship as there are likely diminishing returns to run differential. Also, with a linear regression is would be theoretically possible to win more than 100% of your games, which isn’t actually possible. The good news is that 99% of all teams through major league history have had a run differential between -350 and +350, so it’s unlikely that we will be at the edges of the linear regression.
The book then makes you plot the residuals.
#prediction and Residuals
myteams$linWpct <- predict(linfit)
myteams$linResids <- residuals(linfit)
#Graph of Residuals
plot(myteams$RD, myteams$linResids,
xlab="Run Differential",
ylab="Residual")
abline(h=0, lty = 3)
It’s good to look at the residual plot to make sure there are no patterns or trends. The book than makes you manually calculate the mean and the standard deviation of the residuals. Interestingly it doesn’t tell you that’s what you’re doing. It demonstrates that ~68% of the data can be found one RMSE away from the mean. and 95% can be found two RMSE away. This is definition the standard deviation.
Bill James the father of Sabermetrics derived a non-linear formula to estimate winning percentage.
\[ Wpct = R^2 / (R^2 + RA^2) \]
We can compare this formula to the one derived from the linear regression model. We make a new column for the Pythagorean Formula. Then we calculate the residuals by manually subtracting the Pythagorean WinPct from the Actual WinPct.
##4.4
##
#Pythagorean Formula for Winning Percentage
myteams$pytWpct <- with(myteams, R^2 / (R^2 + RA^2))
#Residuals Computed
myteams$pytResids <- myteams$Wpct - myteams$pytWpct
mean(myteams$pytResids)
## [1] -0.0005466628
sqrt(mean(myteams$pytResids ^ 2))
## [1] 0.02577569
As you can see the mean of the residuals is near zero like with the linear regression, and the RMSE is almost identical. The Pythagorean Formula is very similar to our linear regression model. It also has the added benefit of reacting better at the extremes. I brought it up earlier. Since the linear regression is linear it goes on infinitely, the Pythagorean does not. But, if we are to do experimentation later on it will be helpful to use a model that behaves well at the extremes.
In the next section we will try to optimize the Pythagorean. Squaring the variables helps, but it’s likely there is a better exponent. The book goes through it briefly, I will be a bit more thorough. The math isn’t that hard, and I could use the practice.
The first thing we do is replace the 2’s with the variable k. Our Pythagorean now looks like this:
\[ Wpct = R^k/(R^k + RA^k) \]
We need Winpct to be fleshed out.
\[ W / (W + L) = R^k/(R^k + RA^k)\]
Multiply both sides by (W + L).
\[ W = R^k(W+L)/(R^K + RA^k) \]
Multiply both sides by (R^K + RA^k).
\[ WR^k + WRA^k = WR^k +LR^k \]
The WR^k cancel.
\[ WRA^k = LR^k \]
divide both sides by L.
\[ WRA^k/L = R^k \]
Divide both sides by RA^k.
\[ W/L = R^k/RA^k \]
Using this equation we can make it linear by taking the log of both sides.
\[log(W/L) = k * log(R/RA) \]
Using linear regression we can now find a value for k.
##4.5
##
#Exponent to the Pythag Formula
myteams$logWratio <- log(myteams$W / myteams$L)
myteams$logRratio <- log(myteams$R / myteams$RA)
pytFit <- lm(logWratio ~ 0 + logRratio, data = myteams)
pytFit
##
## Call:
## lm(formula = logWratio ~ 0 + logRratio, data = myteams)
##
## Coefficients:
## logRratio
## 1.858
1.858 is a more optimal value for k than 2. While it is significantly different from a statistical point of view, it also makes sense intuitively since it is close to 2. The book I am using is from 2012. Interestingly, the value for k they got was 1.903. So in the past 5 years the value of k has lowered slightly.
The relationship between wins and run differential isn’t perfect there are some circumstances that could some marginally different outcomes. Such as teams that when they win, win by a lot. This would cause you to have a high run differential for one game, but the win would only count as one win. If the team’s losses are closer in nature. this can have a compounding affect. And would lead to a team “beating” it’s Pythagorean winning percentage. A team that is over performing its Pythagorean could be considered lucky.
It is also possible for teams to under perform their Pythagorean. A team that wins many of its games by only one run would have a low run differential and a higher win percentage. Winning a large number of close games is often attributed to luck, but it could also be a factor of team strategy. A good pitching staff paired with a weaker offense could produce a result like this. We can test this theory by seeing if teams with a top tier closer are more likely to under perform their Pythagorean Win Percentage.
##4.6
##
#definition of top closer
Pitching <- read.csv("~/Data Projects/bookBaseballDataR/chapter4/Pitching.csv")
top_closer <- subset(Pitching, GF > 50 & ERA > 2.5)[,c("playerID", "yearID","teamID")]
#merge with myteams
teams_top_closer <- merge(myteams, top_closer)
summary(teams_top_closer$pytResids)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.053365 -0.015701 0.002275 0.003103 0.020592 0.069403
The mean is 0.0031 which doesn’t seem like much. But multiply it by 162 you get 0.5022. A good closer is worth 0.5 games above the predicted Pythagorean Win percentage.
We’re going to take the partial derivative of the Pythagorean Formula to figure out how Runs directly affect Wins. Math is fun! The good news is we’re going to make R do all the work.
but we start with a slightly modified but equivalent Pythagorean formula:
\[ Wpct = G * (R^2/(R^2 + RA^2)) \]
We take the partial derivative with respect to \(R\). This way we know the rate at which Wpct will increase for every one increase in R.
D(expression(G * R ^ 2 / (R^2 + RA ^ 2)), "R")
## G * (2 * R)/(R^2 + RA^2) - G * R^2 * (2 * R)/(R^2 + RA^2)^2
Take my word for it. It simplifies to this:
\[ IR/W = (R^2 + RA^2)^2 / (2*G*R*RA^2) \]
We can remove the G from the formula is R and RA are expressed in runs per game. IR stands for incremental runs. We can create a function for incremental runs.
IR <- function(RS=5, RA=5){
round((RS ^ 2 + RA ^2)^2 / (2 * RS * RA^2),1 )
}
IRtable <- expand.grid(RS = seq(3,6,0.5), RA = seq(3,6,0.5))
IRtable$IRW <- IR(IRtable$RS, IRtable$RA)
xtabs(IRW ~ RS + RA, data = IRtable)
## RA
## RS 3 3.5 4 4.5 5 5.5 6
## 3 6.0 6.1 6.5 7.0 7.7 8.5 9.4
## 3.5 7.2 7.0 7.1 7.5 7.9 8.5 9.2
## 4 8.7 8.1 8.0 8.1 8.4 8.8 9.4
## 4.5 10.6 9.6 9.1 9.0 9.1 9.4 9.8
## 5 12.8 11.3 10.5 10.1 10.0 10.1 10.3
## 5.5 15.6 13.4 12.2 11.4 11.1 11.0 11.1
## 6 18.8 15.8 14.1 13.0 12.4 12.1 12.0
We can then feed this formula a bunch of run scenarios and see how different environments affect the Pythagorean.The average number of runs scored in a game is around 4-5 runs. As you can see this works well with the rule of 10 discussed earlier. On average an increase of a 10 run differential leads to another win. However, in lower scoring environments, the top left corner, a lower number of runs is needed to gain a win.
Honestly, I don’t really get the importance of this matrix. The intuition makes sense. If there are a lot of runs being scored, then obviously you need a higher run differential to get more of a win. The partial derivative was cool, the application of it is a let down. And that’s how the chapter ends. There are exercises!
Using the linear regression from 4.3 do it for the different decades: Repeat it for the various decades
#Subset Rows and Columns
sixties <- subset(Teams, yearID > 1961 & yearID < 1970)[,c("teamID", "yearID", "lgID", "G", "W", "L","R","RA")]
#Add 2 Varaibles - Run Diff and Wpct
sixties$RD <- with(sixties, R - RA)
sixties$Wpct <- with (sixties, W / (W + L))
#Linear Regression
linsix <- lm(Wpct ~ RD, data = sixties)
linsix
##
## Call:
## lm(formula = Wpct ~ RD, data = sixties)
##
## Coefficients:
## (Intercept) RD
## 0.4999330 0.0006937
B. 10 Run Differential for each decade
Let’s take a look at the teams from the 19th century. The plot is linear, except that between -400 to 0. The Winning Percentage drops. It looks like there if a team started to have the run differential go against them, the wheel would just completely fall off.
#Subset Rows and Columns
dedball <- subset(Teams, yearID < 1900)[,c("teamID", "yearID", "lgID", "G", "W", "L","R","RA")]
#Add 2 Varaibles - Run Diff and Wpct
dedball$RD <- with(dedball, R - RA)
dedball$Wpct <- with (dedball, W / (W + L))
#quick plot
plot(dedball$RD, dedball$Wpct,
xlab = "Run Differential",
ylab = "Winning Percentage")
The Residual plot demonstrates this phenomena well as we see:
#Linear Regression
dedfit <- lm(Wpct ~ RD, data = dedball)
dedfit
##
## Call:
## lm(formula = Wpct ~ RD, data = dedball)
##
## Coefficients:
## (Intercept) RD
## 0.4873796 0.0007386
#Plot of Linear Regression
#abline(a = coef(dedfit)[1] , b = coef(dedfit)[2] , lwd = 2)
#prediction and Residuals
dedball$linWpct <- predict(dedfit)
dedball$linResids <- residuals(dedfit)
#Graph of Residuals
plot(dedball$RD, dedball$linResids,
xlab="Run Differential",
ylab="Residual")
abline(h=0, lty = 3)
#Pythagorean Formula for Winning Percentage
dedball$pytWpct <- with(dedball, R^2 / (R^2 + RA^2))
#Residuals Computed
dedball$pytResids <- dedball$Wpct - dedball$pytWpct
sqrt(mean(dedball$pytResids ^ 2))
## [1] 0.04033301
That’s all I feel like doing for this chapter. The other exercises don’t look as interesting and make you look through retrosheets, something I hate doing.