The data used here come from the retrosplitsretrosheet github repository, provided by chadwickbureau. https://github.com/chadwickbureau/retrosplits https://github.com/chadwickbureau/retrosheet
I took the database of gamelogs and created a dataframe of starting pitcher, starting catcher, year, parkID, opponent offense, and runs given up (available as a CSV file here). The idea is to measure the impact of the starting catcher on runs given up, using a random effects model. Obviously using runs given up by the team, without controlling for the identity of any relievers used, among other factors, gives only a rough estimate of the impact of the catcher. I could make this more precise with the daybyday data from retrosplits, but without event level data, the data have some fundamental limitations.
First I read in the data
library(dplyr)
library(lme4)
library(sqldf)
master <- read.csv('master.csv')
master <- master %>% mutate(namefull = paste(namefirst, namelast, sep=" "))
df <- read.csv('old_timey_cm.csv')
df <- tbl_df(df)
I define some empty data frames to hold the results.
catcher.metric <- data.frame(yearID=integer(),
catID=character(),
val=double(),
n=integer()
)
park.factors <- data.frame(val=double(),
parkID=character(),
yearID=integer()
)
Now I loop over all available years, compute the random effects model, and use rbind with the empty data frames createwd above to store the results.
for (yr in 1911:2015) {
fdf <- df %>% filter(yearID>=yr, yearID<=yr)
fdf <- fdf %>%
mutate(pitKey=paste(pitID, yearID, sep="")) %>%
group_by(catID) %>% mutate(n=n()) %>% ungroup()
ans <- lmer(runs ~ (1|parkID) +
(1|pitKey) +
(1|catID) +
(1|opp),
data=fdf)
print(ans)
rr <- ranef(ans)
pitchers <- as.data.frame(rr$pitKey[,1])
pitchers$pitID <- row.names(rr$pitKey)
names(pitchers) <- c("val", "pitKey")
catchers <- as.data.frame(rr$catID[,1])
catchers$catID <- row.names(rr$catID)
names(catchers) <- c("val", "catID")
park.effects <- as.data.frame(rr$parkID[,1])
park.effects$parkID <- row.names(rr$parkID)
names(park.effects) <- c("val", "team")
park.effects$yearID <- rep(yr, nrow(park.effects))
park.factors <- rbind.data.frame(park.factors, park.effects)
ndf <- sqldf('select b.yearID, a.catID, a.val, b.n from catchers a inner join fdf b on a.catID=b.catID group by a.catID')
catcher.metric <- rbind.data.frame(catcher.metric, ndf)
}
Here I put a limit of at least 300 career starts and compute the weighted average of runs saved (runsv). Finally I join with the master table to output full name instead of retrosheet ID.
ss <- catcher.metric %>%
mutate(wval=val*n) %>%
group_by(catID) %>%
summarise(runsv=sum(wval)/sum(n), n=sum(n)) %>%
filter(n>=300) %>%
arrange(runsv)
uu <- sqldf('select a.*, m.namefull from ss a left join master m on a.catID=m.retroid')
The top 20, weighted average of catcher run-saving are,
print(uu[1:20,] %>% select(namefull, runsv, n))
## namefull runsv n
## 1 Jim Hegan -0.3193418 1429
## 2 Mickey Cochrane -0.2917492 1395
## 3 Bill Killefer -0.2773495 876
## 4 Yogi Berra -0.2711254 1641
## 5 George Gibson -0.2699274 486
## 6 Ken O'Dea -0.2668881 519
## 7 Chief Meyers -0.2653147 702
## 8 Red Wilson -0.2450274 507
## 9 Paul Lo Duca -0.2379321 887
## 10 Tom Padden -0.2343716 338
## 11 Gabby Hartnett -0.2298820 1714
## 12 Bubbles Hargrave -0.2283137 680
## 13 Yadier Molina -0.2197564 1380
## 14 Ray Katt -0.2170842 304
## 15 Del Crandall -0.2140449 1360
## 16 Roy Spencer -0.2109688 498
## 17 Russell Martin -0.2096291 1187
## 18 Jason Varitek -0.2060981 1372
## 19 Salvador Perez -0.1969813 519
## 20 Roy Campanella -0.1947015 1128