18.10 Exercises
Since the 1980s, sabermetricians have used a summary statistic different
from batting average to evaluate players. They realized walks were
important and that doubles, triples, and HRs, should be weighed more
than singles. As a result, they proposed the following metric: \(\frac{BB}{PA}+\frac{Singles+2Doubles+3Triples+4HR}{AB}\)
They called this on-base-percentage plus slugging percentage (OPS).
Although the sabermetricians probably did not use regression, here we
show how this metric is close to what one gets with regression. 1.
Compute the OPS for each team in the 2001 season. Then plot Runs per
game versus OPS.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.4 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(magrittr)
##
## Attaching package: 'magrittr'
##
## The following object is masked from 'package:purrr':
##
## set_names
##
## The following object is masked from 'package:tidyr':
##
## extract
library(Lahman)
## Warning: package 'Lahman' was built under R version 4.3.3
library(ggplot2)
j<-Teams %>% filter(yearID %in% 2001) %>% mutate(singles=(H-HR-X2B-X3B)/G, BB=BB/G, HR=HR/G, R_per_game=R/G, doubles=X2B/G, triples=X3B/G, PA=BB+AB)
OPStab<-j%>%mutate(OPS=(BB/PA+(singles+2*doubles+3*triples+4*HR)/AB))
ggplot(data=OPStab, mapping=aes(x=R_per_game, y=OPS))+geom_point()+
labs(title="2001 Baseball On-Base-Percentages Vs. Runs Per Game", x="On-Base-Percentages", y="Runs per Game")
2. For every year since 1962, compute the correlation between runs per game and OPS; then plot these correlations as a function of year.
corrOPS<-Teams %>% filter(yearID %in% 1962:2001) %>% mutate(singles=(H-HR-X2B-X3B)/G, BB=BB/G, HR=HR/G, R_per_game=R/G, doubles=X2B/G, triples=X3B/G, PA=BB+AB) %>%mutate(OPS=(BB/PA+(singles+2*doubles+3*triples+4*HR)/AB)) %>% group_by(yearID) %>% summarise(corrR=cor(R_per_game, OPS))
ggplot(data=corrOPS, mapping=aes(x=yearID, y=corrR))+geom_point()+
labs(title="1962-2001 Baseball Correlation Between On-Base-Percentages Vs. Runs Per Game By Year", x="Year", y="Correlation")
3. Note that we can rewrite OPS as a weighted average of BBs, singles, doubles, triples, and HRs. We know that the weights for doubles, triples, and HRs are 2, 3, and 4 times that of singles. But what about BB? What is the weight for BB relative to singles? Hint: the weight for BB relative to singles will be a function of AB and PA. \(\frac{AB}{PA}\) is the weight for BB to singles.
4. Note that the weight for \(BB\), , will change from team to team. To see how variable it is, compute and plot this quantity for each team for each year since 1962. Then plot it again, but instead of computing it for every team, compute and plot the ratio for the entire year. Then, once you are convinced that there is not much of a time or team trend, report the overall average.
#Plot the BB weight for each team during the time period from 1962 to 2001.
BBweight<-Teams%>%filter(yearID %in% 1962:2001)%>%mutate(singles=(H-HR-X2B-X3B)/G, BB=BB/G, HR=HR/G, R_per_game=R/G, doubles=X2B/G, triples=X3B/G, PA=BB+AB,BBw=AB/PA)
ggplot(data=BBweight, mapping=aes(x=yearID, y=BBw)) + geom_point()
#Plot the BB weight for a single year.
BBweightT<-Teams%>%filter(yearID %in% 1962:1969)%>%mutate(singles=(H-HR-X2B-X3B)/G, BB=BB/G, HR=HR/G, R_per_game=R/G, doubles=X2B/G, triples=X3B/G, PA=BB+AB,BBw=AB/PA)
ggplot(data=BBweightT, mapping=aes(x=yearID, y=BBw)) + geom_point()+labs(title="1962-1969 Baseball Weight of 'A Base on Balls' for all MLB Teams", x="Year", y="Weight of Base on Balls")
#Overall average based on team year.
overallavg<-mean(BBweight$BBw)
overallavg
## [1] 0.9993856
5. So now we know that the formula for OPS is proportional to \(0.91 ×BB+singles+2×doubles+3×triples+4×HR\). Let’s see how these coefficients compare to those obtained with regression. Fit a regression model to the data after 1962, as done earlier: using per game statistics for each year for each team. After fitting this model, report the coefficients as weights relative to the coefficient for singles.
library(purrr)
library(dplyr)
#Load your basеball data
jx<-Teams %>% filter(yearID %in% 1962:2001) %>% mutate(singles=(H-HR-X2B-X3B)/G, BB=BB/G, HR=HR/G, R_per_game=R/G, doubles=X2B/G, triples=X3B/G, PA=BB+AB)
OPStabx<-jx%>%mutate(OPS=(BB/PA+(singles+2*doubles+3*triples+4*HR)/AB))
regmod<-lm(OPS~BB+singles+doubles+triples+HR, data=OPStabx)
getco<-coef(regmod)
weight<-getco/getco["singles"]
print(weight)
## (Intercept) BB singles doubles triples HR
## 1.755233 1.707146 1.000000 3.673497 3.462409 4.029015
regmod
##
## Call:
## lm(formula = OPS ~ BB + singles + doubles + triples + HR, data = OPStabx)
##
## Coefficients:
## (Intercept) BB singles doubles triples HR
## 0.0002329 0.0002266 0.0001327 0.0004875 0.0004595 0.0005347
#Compute OPS.
corrOPS<-Teams %>% filter(yearID %in% 1962:2001) %>% mutate(singles=(H-HR-X2B-X3B)/G, BB=BB/G, HR=HR/G, R_per_game=R/G, doubles=X2B/G, triples=X3B/G, PA=BB+AB) %>%mutate(OPS=(BB/PA+(singles+2*doubles+3*triples+4*HR)/AB))
rmm<-regmod$coefficients
#Predicted runs from prev. regression model.
rmm<-as.data.frame(rmm)
corrOPS2<-corrOPS%>%mutate(predrun=rmm[1,1]+rmm[2,1]*singles+
rmm[3,1]*doubles+rmm[4,1]*triples+rmm[5,1]*HR)
#Correlation between OPS & predicted runs.
cOPspredrun<-corrOPS2 %>% group_by(yearID) %>% summarise(corrPR=cor(predrun, OPS))
#Correlation between 0PS & runs per game.
cOPsrun<-corrOPS %>% group_by(yearID) %>% summarise(corrR=cor(R_per_game, OPS))
#Calculate OPS + predicted runs by PlayerID.#
corrOPSX<-Batting %>% filter(yearID %in% 1962:2001) %>% mutate(singles=(H-HR-X2B-X3B)/G, BB=BB/G, HR=HR/G, R_per_game=R/G, doubles=X2B/G, triples=X3B/G, PA=BB+AB) %>%mutate(OPS=(BB/PA+(singles+2*doubles+3*triples+4*HR)/AB)) %>% mutate(predrun=rmm[1,1]+rmm[2,1]*singles+rmm[3,1]*doubles+rmm[4,1]*triples+rmm[5,1]*HR) %>% group_by(playerID, yearID)
#Plot OPS for Aaron Hanks from 1962 onwards.#
J<-c("aaronha01")
newdata<-subset(corrOPSX, playerID=="aaronha01")
ggplot(data=newdata, mapping=aes(x=yearID, y=OPS))+geom_point()+labs(title="Aaron Hanks OPS", x="Year", y="OPS")
#Plot Predicted Runs for Aaron Hanks from 1962 onwards.#
Jj<-c("aaronha01")
newdataj<-subset(corrOPSX, playerID=="aaronha01")
ggplot(data=newdata, mapping=aes(x=yearID, y=predrun))+geom_point()+labs(title="Aaron Hanks Predicted Runs", x="Year", y="Predicted Runs")
large<-corrOPSX%>%mutate(rankch=predrun-OPS)%>%select(playerID, yearID, rankch)
largex<-large%>%group_by(playerID)%>%summarize(meanrk=mean(rankch))
largex<-largex[order(largex$meanrk, decreasing=TRUE),]
#10 players with the largest difference between their predicted runs and OPS.#
head(largex, 10)
## # A tibble: 10 × 2
## playerID meanrk
## <chr> <dbl>
## 1 abadan01 0.000233
## 2 abregjo01 0.000233
## 3 aldrico01 0.000233
## 4 alstoga01 0.000233
## 5 andermi02 0.000233
## 6 andrecl01 0.000233
## 7 arlicdo01 0.000233
## 8 arroyru01 0.000233
## 9 atchlju01 0.000233
## 10 austde01 0.000233