R Markdown

The data used here come from the retrosplits retrosheet github repository, provided by chadwickbureau. https://github.com/chadwickbureau/retrosplits https://github.com/chadwickbureau/retrosheet

I took the database of gamelogs and created a dataframe of starting pitcher, starting catcher, year, parkID, opponent offense, and runs given up (available as a CSV file here). The idea is to measure the impact of the starting catcher on runs given up, using a random effects model. Obviously using runs given up by the team, without controlling for the identity of any relievers used, among other factors, gives only a rough estimate of the impact of the catcher. I could make this more precise with the daybyday data from retrosplits, but without event level data, the data have some fundamental limitations.

First I read in the data

library(dplyr)
library(lme4)
library(sqldf)

master <- read.csv('master.csv')
master <- master %>% mutate(namefull = paste(namefirst, namelast, sep=" "))
df <- read.csv('old_timey_cm.csv')
df <- tbl_df(df)

I define some empty data frames to hold the results.

catcher.metric <- data.frame(yearID=integer(), 
                             catID=character(), 
                             val=double(),
                             n=integer()
)
park.factors <- data.frame(val=double(),
                           parkID=character(),
                           yearID=integer()
)

Now I loop over all available years, compute the random effects model, and use rbind with the empty data frames createwd above to store the results.

for (yr in 1911:2015) {
  fdf <- df %>% filter(yearID>=yr, yearID<=yr)
  fdf <- fdf %>% 
    mutate(pitKey=paste(pitID, yearID, sep="")) %>%
    group_by(catID) %>% mutate(n=n()) %>% ungroup()

  ans <- lmer(runs ~ (1|parkID) + 
                (1|pitKey) + 
                (1|catID) + 
                (1|opp), 
              data=fdf)

  print(ans)
  rr <- ranef(ans)

  pitchers <- as.data.frame(rr$pitKey[,1])
  pitchers$pitID <- row.names(rr$pitKey)
  names(pitchers) <- c("val", "pitKey")

  catchers <- as.data.frame(rr$catID[,1])
  catchers$catID <- row.names(rr$catID)
  names(catchers) <- c("val", "catID")

  park.effects <- as.data.frame(rr$parkID[,1])
  park.effects$parkID <- row.names(rr$parkID)
  names(park.effects) <- c("val", "team")
  park.effects$yearID <- rep(yr, nrow(park.effects))
  
  park.factors <- rbind.data.frame(park.factors, park.effects)
  ndf <- sqldf('select b.yearID, a.catID, a.val, b.n from catchers a inner join fdf b on a.catID=b.catID group by a.catID')
  catcher.metric <- rbind.data.frame(catcher.metric, ndf)
}

Here I put a limit of at least 300 career starts and compute the weighted average of runs saved (runsv). Finally I join with the master table to output full name instead of retrosheet ID.

ss <- catcher.metric %>% 
  mutate(wval=val*n) %>% 
  group_by(catID) %>% 
  summarise(runsv=sum(wval)/sum(n), n=sum(n)) %>% 
  filter(n>=300) %>% 
  arrange(runsv)

uu <- sqldf('select a.*, m.namefull from ss a left join master m on a.catID=m.retroid')

The top 20, weighted average of catcher run-saving are,

print(uu[1:20,] %>% select(namefull, runsv, n)) 
##            namefull      runsv    n
## 1         Jim Hegan -0.3193418 1429
## 2   Mickey Cochrane -0.2917492 1395
## 3     Bill Killefer -0.2773495  876
## 4        Yogi Berra -0.2711254 1641
## 5     George Gibson -0.2699274  486
## 6         Ken O'Dea -0.2668881  519
## 7      Chief Meyers -0.2653147  702
## 8        Red Wilson -0.2450274  507
## 9      Paul Lo Duca -0.2379321  887
## 10       Tom Padden -0.2343716  338
## 11   Gabby Hartnett -0.2298820 1714
## 12 Bubbles Hargrave -0.2283137  680
## 13    Yadier Molina -0.2197564 1380
## 14         Ray Katt -0.2170842  304
## 15     Del Crandall -0.2140449 1360
## 16      Roy Spencer -0.2109688  498
## 17   Russell Martin -0.2096291 1187
## 18    Jason Varitek -0.2060981 1372
## 19   Salvador Perez -0.1969813  519
## 20   Roy Campanella -0.1947015 1128