Sabermetrics Example (Key). Ursinus College

Moneyball (2011).

Introductory videos:
Trailer: https://www.youtube.com/watch?v=-4QPVo0UIzc
He Gets on Base Scene: https://www.youtube.com/watch?v=3MjxoaynCmk
Player Value Scene: https://www.youtube.com/watch?v=Tzin1DgexlE

Academic videos:
We will use and follow a set of videos by HarvardX (Machine Intelligence YouTube Channel).
Link here to first video. Replication in R follows.

Install and load key R packages

Install and load the the following packages: tidyverse and Lahman.

#install.packages("tidyverse")
#install.packages("Lahman")
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(Lahman)

Lecture 1.03 - Bases on Balls or Stolen Bases (Link to Video)

Create the following Scatterplots: 1. Home Runs per Game vs Runs per Game, 2. Stolen Bases per Game and Runs per Game. 3. Bases on Ball (Walks) per Game and Runs per Game. Use data from 1961 to 2001.

Key Code:

#Create scatterplots
library(ggplot2)
Teams %>%  filter(yearID %in% 1961:2001) %>%  
  mutate(HR_per_game = HR/G, R_per_game = R/G) %>%  
  ggplot(aes(HR_per_game, R_per_game))+
  geom_point(alpha = 0.5) + geom_smooth()+theme_minimal()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Teams %>%  filter(yearID %in% 1961:2001) %>%  
  mutate(SB_per_game = SB/G, R_per_game = R/G) %>%  
  ggplot(aes(SB_per_game, R_per_game))+
  geom_point(alpha = 0.5) +theme_minimal()

Teams %>%  filter(yearID %in% 1961:2001) %>%  
  mutate(BB_per_game = BB/G, R_per_game = R/G) %>%  
  ggplot(aes(BB_per_game, R_per_game))+
  geom_point(alpha = 0.5) + geom_smooth() + theme_minimal()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Lecture 1.04 - 1.10 Regression Basics

The rest of the videos in this chapter (chapter 1) show a quick review of simple linear regression using mostly Francis Galton’s original example (father’s height and son’s heights). To save time, we can skip to the sabermetrics practice, and the hands on practice picking the winning team.

Lecture 2.01 - Confounding Are BBs More Predictive?(Link to Video)

Create correlations between: 1. Home Runs per Game and Walks (Bases on Ball) per Game, 2. Home Runs per Game and Singles per Game, and 3. Walks per Game and Singles per Game. What do we see here?

Key Code:

#Calculate correlations
Teams %>%  filter(yearID %in% 1961:2001) %>%  
  mutate(Singles = (H-HR-X2B-X3B)/G, BB = BB/G, HR=HR/G) %>%  
  summarize(cor(BB,HR), cor(Singles, HR), cor(BB,Singles))

##   cor(BB, HR) cor(Singles, HR) cor(BB, Singles)
## 1   0.4039313       -0.1737435      -0.05603822

The key to this video is to show the confounding between HR and BB predicting runs. This motivates the need for multivariate regression analysis. The next videos build the foundations up from scratch, starting by conditioning or stratifying in each of the variables. Given our time constraints, we’ll skip those for now even though they are highly recommended for anyone (either first time seeing this material or as a review). I’ll just provide a quick summary here.

In lecture 2.02, multivariate regression is revisited using the sabermetrics example, and stratification is used to motivate the analysis. In lecture 2.03, theoretical justifications for the multivariate model are given, including the bivariate normality and the role of conditional expectations. In lecture 2.04, the ordinary least squared (OLS) method is discussed, including the Residual Sum of Squares (RSS) and the intuition behind the OLS optimal solution. Lecture 2.05 discusses the lm function (which stands for Linear Model) and lecture 2.06 shows that the coefficients are random variables using simulations (cool!). Lecture 2.07 comes back to Baseball, cleans the data using dplyr and introduces Tibbles. Lecture 2.08 explains the differences between tibbles and data.frames very nicely.

Lecture 2.09 describes the do() function and provides a useful example of how to define functions in R. Lecture 2.10 describes the very useful broom() package. Finally, lecture 2.11 (8 min long) builds the team, which is what we want. We will focus on doing that today.

Lecture 2.11 - Building a Better Offensive Metric for Baseball(Link to video:)

This particular video is relatively long, so we’ll break it down into two parts: Part 1 (until minute 3:30): Team regression model.
Part 2 (after minute 3:30): Player regression model, and picking the winning team

Part 1 Code:

Create multivariate and univariate regressions. Compare the coefficients. Multivariate 1: Runs per Game on Bases on Ball per Game and Home Runs per Game. Univariate regressions: 1) Runs per Game on Bases on Ball per Game. 2) Runs per Game on Home Runs per Game. Multivariate 2: Runs per Game on Bases on Ball per Game, Singles per Game, Doubles per Game, Triples per Game, and Home Runs per Game.

#Bivariate Regression
fit<-Teams %>%  filter(yearID %in% 1961:2001) %>%  
  mutate(BB = BB/G, HR=HR/G, R=R/G) %>%  
  lm(R~BB+HR, data = .)
library(broom)
tidy(fit, conf.int = TRUE)

## # A tibble: 3 × 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)    1.74     0.0824      21.2 7.62e- 83    1.58      1.91 
## 2 BB             0.387    0.0270      14.3 1.20e- 42    0.334     0.440
## 3 HR             1.56     0.0490      31.9 1.78e-155    1.47      1.66

#Univariate Regressions
fit1.1<-Teams %>%  filter(yearID %in% 1961:2001) %>%  
  mutate(BB = BB/G, HR=HR/G, R=R/G) %>%  
  lm(R~BB, data = .)
tidy(fit1.1)

## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    1.93     0.116       16.7 2.30e-55
## 2 BB             0.735    0.0349      21.1 3.00e-82

fit1.2<-Teams %>%  filter(yearID %in% 1961:2001) %>%  
  mutate(BB = BB/G, HR=HR/G, R=R/G) %>%  
  lm(R~HR, data = .)
tidy(fit1.2)

## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)     2.78    0.0436      63.7 0        
## 2 HR              1.84    0.0491      37.6 4.29e-195

fit2 <- Teams %>%  
  filter(yearID %in% 1961:2001) %>%  
  mutate(BB = BB/G,
         #SB = SB/G, 
         singles = (H-HR-X2B-X3B)/G,
         doubles = X2B/G,
         triples = X3B/G, 
         HR = HR/G, 
         R = R/G) %>%
  lm(R ~ BB + singles + doubles + triples + HR, data = .)
tidy(fit2)

## # A tibble: 6 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   -2.77     0.0862     -32.1 4.76e-157
## 2 BB             0.371    0.0117      31.6 1.87e-153
## 3 singles        0.519    0.0127      40.8 8.67e-217
## 4 doubles        0.771    0.0226      34.1 8.44e-171
## 5 triples        1.24     0.0768      16.1 2.12e- 52
## 6 HR             1.44     0.0243      59.3 0

Team Prediction: Use the estimated the model above (using 1961-2001 data) to predict runs for the year 2002. You should use the last multivariate model with 5 independent variables. Then compare the predictions of the model with the actual runs the teams made in 2002.

#Prediction: 
Teams %>%  
  filter(yearID %in% 2002) %>%  
  mutate(BB = BB/G,
         #SB = SB/G, 
         singles = (H-HR-X2B-X3B)/G,
         doubles = X2B/G,
         triples = X3B/G, 
         HR = HR/G, 
         R = R/G) %>%
  mutate(R_hat = predict(fit2, newdata = .)) %>%   
  ggplot(aes(x = R_hat, y = R, label = teamID)) + geom_point() +
  geom_text(hjust=-0.5, vjust=0) +
  geom_smooth(formula = y ~ x, method = 'loess') +
  geom_smooth(formula = y ~ x, method = 'lm', col='red')

Part 2 Code: Min 3:30+

(Player model)

Make the Per-Game Team Rate comparable with the Per-Plate Appearance Player Rate by computing the Team Plate Appearances per Game. For each team, you need to add the At bats and Bases on Ball of all players (that’s the total number of appearances) and divided by the total numbers of games the team played in the season. Use season 2002 for this calculation.

pa_per_game <- Batting %>%  filter(yearID == 2002) %>% 
  group_by(teamID) %>%  
  summarize(pa_per_game = sum(AB + BB)/max(G)) %>% 
  .$pa_per_game %>% 
  mean

Pick players in 2002 using data from 1999-2001. To do that, first, we need to filter our players and create our dataset. Filter out players with too few plate appearances (less than 300), calculate the plate appearances, Number of Games they played, Bases on Ball per Game, Singles per Game, Doubles per Game, Triples per Game, Home Runs per Game, their hitting AVG, and the predicted runs using the model above.

#Create the data frame from which to pick the players
players <- Batting %>% 
  filter(yearID %in% 1999:2001) %>% 
  group_by(playerID) %>% 
  mutate(PA = BB + AB) %>% 
  summarize(G = sum(PA)/pa_per_game,
            BB = sum(BB)/G,
            singles = sum(H-X2B-X3B-HR)/G,
            doubles = sum(X2B)/G,
            triples = sum(X3B)/G,
            HR = sum(HR)/G,
            AVG = sum(H)/sum(AB),
            PA = sum(PA)) %>% 
  filter(PA >= 300) %>% 
  select(-G) %>% 
  mutate(R_hat = predict(fit2, newdata = .)) 

#Distribution of R_hat. Note: Wide variability is a good sign.  
players %>%  ggplot(aes(R_hat)) +
  geom_histogram(binwidth =  0.5, color = "black")

Picking the Winning Team:

First, We need the salary of the players and their position.

Some Data Wrangling is necessary to combine different Tables.
First, we need to add (or merge) the 2002 salaries for each player, from the Salary Table. Write the code to do that here:

Key code: #### Salary:

#Merge in the player's salaries
players <- Salaries %>%  
  filter(yearID == 2002) %>% 
  select(playerID, salary) %>% 
  right_join(players, by="playerID")

Defensive Position. From Fielding’s Table

Then, we need to get the position most played (since some players play multiple positions). Also, remove the outfielders (OF = CF+LF+RF) and pitchers.

#Merge in the defensive position (we need to build an entire Team) 
#Remove outfielders and pitchers, and pick positions most played. 
players <- Fielding %>%  filter(yearID==2002) %>% 
  filter(!POS %in% c("OF","P")) %>% 
  group_by(playerID) %>%  
  top_n(1, G) %>% 
  filter(row_number(G) == 1 ) %>%  
  ungroup() %>%  
  select(playerID, POS) %>% 
  right_join(players, by="playerID") %>% 
  filter(!is.na(POS) & !is.na(salary))

Add Names (from People’s Table)

Get names of the players so we know who each player is.

Key code:

#Merge in the players' names. 
players <- People %>%  
  select(playerID, nameFirst, nameLast, debut) %>%  
  right_join(players, by="playerID")

#Show the top 10 most (potentially) productive players according to our model.      
players %>%  select(nameFirst, nameLast, POS, salary, R_hat) %>%  
  arrange(desc(R_hat)) %>% 
  top_n(10)

## Selecting by R_hat

##    nameFirst    nameLast POS   salary    R_hat
## 1       Todd      Helton  1B  5000000 8.228108
## 2      Jason      Giambi  1B 10428571 7.993795
## 3     Albert      Pujols  3B   600000 7.536447
## 4      Nomar Garciaparra  SS  9000000 7.505063
## 5       Jeff     Bagwell  1B 11000000 7.475066
## 6       Alex   Rodriguez  SS 22000000 7.439937
## 7     Carlos     Delgado  1B 19400000 7.368570
## 8     Rafael    Palmeiro  1B  8712986 7.257195
## 9       Mike      Piazza   C 10571429 7.160964
## 10       Jim       Thome  1B  8000000 7.156418

Visualizations

Create a scatterplot with salary in the horizontal axis and the predicted runs from the model in the vertical axis. Display each position in a different color.

#First Visual
players %>%  ggplot(aes(x = salary, y = R_hat, color = POS))+
  geom_point() +
  scale_x_log10()

We want only players whose debut was in 1997 or earlier (i.e., to remove all young players that haven’t been able to negotiate their contracts yet). Write a code that does that.

#Second (refined) visual, removing too young players (no contract yet)
players %>% filter(debut<1998) %>%  
  ggplot(aes(salary, R_hat, color= POS))+
  geom_point() + 
  scale_x_log10()

Finally, pick the Team

Make it interactive so that it’s easier to pick a team under the $40 MM Billy Beane had to work with.

#Make it interactive
#install.packages("plotly")
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

visual1 <- players %>% filter(debut<1998) %>%  
  ggplot( aes(salary, R_hat, color= POS, 
             text = paste(nameFirst, nameLast))) +
  geom_point() + 
  scale_x_log10()

visual2<- ggplotly(visual1, type = 'scatter') 
visual2