18.10 Exercises
Since the 1980s, sabermetricians have used a summary statistic different from batting average to evaluate players. They realized walks were important and that doubles, triples, and HRs, should be weighed more than singles. As a result, they proposed the following metric: \(\frac{BB}{PA}+\frac{Singles+2Doubles+3Triples+4HR}{AB}\) They called this on-base-percentage plus slugging percentage (OPS). Although the sabermetricians probably did not use regression, here we show how this metric is close to what one gets with regression. 1. Compute the OPS for each team in the 2001 season. Then plot Runs per game versus OPS.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.4     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(magrittr)
## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
library(Lahman)
## Warning: package 'Lahman' was built under R version 4.3.3
library(ggplot2)
j<-Teams %>% filter(yearID %in% 2001) %>% mutate(singles=(H-HR-X2B-X3B)/G, BB=BB/G, HR=HR/G, R_per_game=R/G, doubles=X2B/G, triples=X3B/G, PA=BB+AB)

OPStab<-j%>%mutate(OPS=(BB/PA+(singles+2*doubles+3*triples+4*HR)/AB))
ggplot(data=OPStab, mapping=aes(x=R_per_game, y=OPS))+geom_point()+
labs(title="2001 Baseball On-Base-Percentages Vs. Runs Per Game", x="On-Base-Percentages", y="Runs per Game")

2. For every year since 1962, compute the correlation between runs per game and OPS; then plot these correlations as a function of year.

corrOPS<-Teams %>% filter(yearID %in% 1962:2001) %>% mutate(singles=(H-HR-X2B-X3B)/G, BB=BB/G, HR=HR/G, R_per_game=R/G, doubles=X2B/G, triples=X3B/G, PA=BB+AB) %>%mutate(OPS=(BB/PA+(singles+2*doubles+3*triples+4*HR)/AB)) %>% group_by(yearID) %>% summarise(corrR=cor(R_per_game, OPS)) 
ggplot(data=corrOPS, mapping=aes(x=yearID, y=corrR))+geom_point()+
labs(title="1962-2001 Baseball Correlation Between On-Base-Percentages Vs. Runs Per Game By Year", x="Year", y="Correlation")

3. Note that we can rewrite OPS as a weighted average of BBs, singles, doubles, triples, and HRs. We know that the weights for doubles, triples, and HRs are 2, 3, and 4 times that of singles. But what about BB? What is the weight for BB relative to singles? Hint: the weight for BB relative to singles will be a function of AB and PA. \(\frac{AB}{PA}\) is the weight for BB to singles.

4. Note that the weight for \(BB\), , will change from team to team. To see how variable it is, compute and plot this quantity for each team for each year since 1962. Then plot it again, but instead of computing it for every team, compute and plot the ratio for the entire year. Then, once you are convinced that there is not much of a time or team trend, report the overall average.

#Plot the BB weight for each team during the time period from 1962 to 2001.
BBweight<-Teams%>%filter(yearID %in% 1962:2001)%>%mutate(singles=(H-HR-X2B-X3B)/G, BB=BB/G, HR=HR/G, R_per_game=R/G, doubles=X2B/G, triples=X3B/G, PA=BB+AB,BBw=AB/PA)
ggplot(data=BBweight, mapping=aes(x=yearID, y=BBw)) + geom_point()

#Plot the BB weight for a single year. 
BBweightT<-Teams%>%filter(yearID %in% 1962:1969)%>%mutate(singles=(H-HR-X2B-X3B)/G, BB=BB/G, HR=HR/G, R_per_game=R/G, doubles=X2B/G, triples=X3B/G, PA=BB+AB,BBw=AB/PA)
ggplot(data=BBweightT, mapping=aes(x=yearID, y=BBw)) + geom_point()+labs(title="1962-1969 Baseball Weight of 'A Base on Balls' for all MLB Teams", x="Year", y="Weight of Base on Balls")

#Overall average based on team year.
overallavg<-mean(BBweight$BBw)
overallavg
## [1] 0.9993856

5. So now we know that the formula for OPS is proportional to \(0.91 ×BB+singles+2×doubles+3×triples+4×HR\). Let’s see how these coefficients compare to those obtained with regression. Fit a regression model to the data after 1962, as done earlier: using per game statistics for each year for each team. After fitting this model, report the coefficients as weights relative to the coefficient for singles.

library(purrr)
library(dplyr)

#Load your basеball data
jx<-Teams %>% filter(yearID %in% 1962:2001) %>% mutate(singles=(H-HR-X2B-X3B)/G, BB=BB/G, HR=HR/G, R_per_game=R/G, doubles=X2B/G, triples=X3B/G, PA=BB+AB)
OPStabx<-jx%>%mutate(OPS=(BB/PA+(singles+2*doubles+3*triples+4*HR)/AB))
regmod<-lm(OPS~BB+singles+doubles+triples+HR, data=OPStabx)
getco<-coef(regmod)
weight<-getco/getco["singles"]
print(weight)
## (Intercept)          BB     singles     doubles     triples          HR 
##    1.755233    1.707146    1.000000    3.673497    3.462409    4.029015
regmod
## 
## Call:
## lm(formula = OPS ~ BB + singles + doubles + triples + HR, data = OPStabx)
## 
## Coefficients:
## (Intercept)           BB      singles      doubles      triples           HR  
##   0.0002329    0.0002266    0.0001327    0.0004875    0.0004595    0.0005347
  1. We see that our linear regression model coefficients follow the same general trend as those used by OPS, but with slightly less weight for metrics other than singles. For each team in years after 1962, compute the OPS, the predicted runs with the regression model and compute the correlation between the two as well as the correlation with runs per game.
#Compute OPS.
corrOPS<-Teams %>% filter(yearID %in% 1962:2001) %>% mutate(singles=(H-HR-X2B-X3B)/G, BB=BB/G, HR=HR/G, R_per_game=R/G, doubles=X2B/G, triples=X3B/G, PA=BB+AB) %>%mutate(OPS=(BB/PA+(singles+2*doubles+3*triples+4*HR)/AB))

rmm<-regmod$coefficients

#Predicted runs from prev. regression model.
rmm<-as.data.frame(rmm)
corrOPS2<-corrOPS%>%mutate(predrun=rmm[1,1]+rmm[2,1]*singles+
rmm[3,1]*doubles+rmm[4,1]*triples+rmm[5,1]*HR)


#Correlation between OPS & predicted runs.
cOPspredrun<-corrOPS2 %>% group_by(yearID) %>% summarise(corrPR=cor(predrun, OPS))

#Correlation between 0PS & runs per game.
cOPsrun<-corrOPS %>% group_by(yearID) %>% summarise(corrR=cor(R_per_game, OPS))
  1. We see that using the regression approach predicts runs slightly better than OPS, but not that much. However, note that we have been computing OPS and predicting runs for teams when these measures are used to evaluate players. Let’s show that OPS is quite similar to what one obtains with regression at the player level. For the 1962 season and after, compute the OPS and the predicted runs from our model for each player and plot them. Use the PA per game correction we used in the previous chapter:
#Calculate OPS + predicted runs by PlayerID.#
corrOPSX<-Batting %>% filter(yearID %in% 1962:2001) %>% mutate(singles=(H-HR-X2B-X3B)/G, BB=BB/G, HR=HR/G, R_per_game=R/G, doubles=X2B/G, triples=X3B/G, PA=BB+AB) %>%mutate(OPS=(BB/PA+(singles+2*doubles+3*triples+4*HR)/AB)) %>% mutate(predrun=rmm[1,1]+rmm[2,1]*singles+rmm[3,1]*doubles+rmm[4,1]*triples+rmm[5,1]*HR) %>% group_by(playerID, yearID) 

#Plot OPS for Aaron Hanks from 1962 onwards.#
J<-c("aaronha01")
newdata<-subset(corrOPSX, playerID=="aaronha01")
ggplot(data=newdata, mapping=aes(x=yearID, y=OPS))+geom_point()+labs(title="Aaron Hanks OPS", x="Year", y="OPS")

#Plot Predicted Runs for Aaron Hanks from 1962 onwards.#
Jj<-c("aaronha01")
newdataj<-subset(corrOPSX, playerID=="aaronha01")
ggplot(data=newdata, mapping=aes(x=yearID, y=predrun))+geom_point()+labs(title="Aaron Hanks Predicted Runs", x="Year", y="Predicted Runs")

  1. What players have show the largest difference between their rank by predicted runs and OPS?
large<-corrOPSX%>%mutate(rankch=predrun-OPS)%>%select(playerID, yearID, rankch)
largex<-large%>%group_by(playerID)%>%summarize(meanrk=mean(rankch))
largex<-largex[order(largex$meanrk, decreasing=TRUE),]

#10 players with the largest difference between their predicted runs and OPS.#
head(largex, 10)
## # A tibble: 10 × 2
##    playerID    meanrk
##    <chr>        <dbl>
##  1 abadan01  0.000233
##  2 abregjo01 0.000233
##  3 aldrico01 0.000233
##  4 alstoga01 0.000233
##  5 andermi02 0.000233
##  6 andrecl01 0.000233
##  7 arlicdo01 0.000233
##  8 arroyru01 0.000233
##  9 atchlju01 0.000233
## 10 austde01  0.000233