Introductory videos:
Trailer: https://www.youtube.com/watch?v=-4QPVo0UIzc
He Gets on Base Scene: https://www.youtube.com/watch?v=3MjxoaynCmk
Player Value Scene: https://www.youtube.com/watch?v=Tzin1DgexlE
Academic videos:
We will use and follow a set of videos by HarvardX
(Machine Intelligence YouTube Channel).
Link here
to first video. Replication in R follows.
Install and load the the following packages: tidyverse and Lahman.
#install.packages("tidyverse")
#install.packages("Lahman")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(Lahman)
Create the following Scatterplots: 1. Home Runs per Game vs Runs per Game, 2. Stolen Bases per Game and Runs per Game. 3. Bases on Ball (Walks) per Game and Runs per Game. Use data from 1961 to 2001.
Key Code:
#Create scatterplots
library(ggplot2)
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR_per_game = HR/G, R_per_game = R/G) %>%
ggplot(aes(HR_per_game, R_per_game))+
geom_point(alpha = 0.5) + geom_smooth()+theme_minimal()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(SB_per_game = SB/G, R_per_game = R/G) %>%
ggplot(aes(SB_per_game, R_per_game))+
geom_point(alpha = 0.5) +theme_minimal()
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(BB_per_game = BB/G, R_per_game = R/G) %>%
ggplot(aes(BB_per_game, R_per_game))+
geom_point(alpha = 0.5) + geom_smooth() + theme_minimal()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
The rest of the videos in this chapter (chapter 1) show a quick review of simple linear regression using mostly Francis Galton’s original example (father’s height and son’s heights). To save time, we can skip to the sabermetrics practice, and the hands on practice picking the winning team.
Create correlations between: 1. Home Runs per Game and Walks (Bases on Ball) per Game, 2. Home Runs per Game and Singles per Game, and 3. Walks per Game and Singles per Game. What do we see here?
Key Code:
#Calculate correlations
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(Singles = (H-HR-X2B-X3B)/G, BB = BB/G, HR=HR/G) %>%
summarize(cor(BB,HR), cor(Singles, HR), cor(BB,Singles))
## cor(BB, HR) cor(Singles, HR) cor(BB, Singles)
## 1 0.4039313 -0.1737435 -0.05603822
The key to this video is to show the confounding between HR and BB predicting runs. This motivates the need for multivariate regression analysis. The next videos build the foundations up from scratch, starting by conditioning or stratifying in each of the variables. Given our time constraints, we’ll skip those for now even though they are highly recommended for anyone (either first time seeing this material or as a review). I’ll just provide a quick summary here.
In lecture 2.02, multivariate regression is revisited using the sabermetrics example, and stratification is used to motivate the analysis. In lecture 2.03, theoretical justifications for the multivariate model are given, including the bivariate normality and the role of conditional expectations. In lecture 2.04, the ordinary least squared (OLS) method is discussed, including the Residual Sum of Squares (RSS) and the intuition behind the OLS optimal solution. Lecture 2.05 discusses the lm function (which stands for Linear Model) and lecture 2.06 shows that the coefficients are random variables using simulations (cool!). Lecture 2.07 comes back to Baseball, cleans the data using dplyr and introduces Tibbles. Lecture 2.08 explains the differences between tibbles and data.frames very nicely.
Lecture 2.09 describes the do() function and provides a useful example of how to define functions in R. Lecture 2.10 describes the very useful broom() package. Finally, lecture 2.11 (8 min long) builds the team, which is what we want. We will focus on doing that today.
This particular video is relatively long, so we’ll break it down into
two parts: Part 1 (until minute 3:30): Team regression model.
Part 2 (after minute 3:30): Player regression model, and picking the
winning team
Create multivariate and univariate regressions. Compare the coefficients. Multivariate 1: Runs per Game on Bases on Ball per Game and Home Runs per Game. Univariate regressions: 1) Runs per Game on Bases on Ball per Game. 2) Runs per Game on Home Runs per Game. Multivariate 2: Runs per Game on Bases on Ball per Game, Singles per Game, Doubles per Game, Triples per Game, and Home Runs per Game.
#Bivariate Regression
fit<-Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(BB = BB/G, HR=HR/G, R=R/G) %>%
lm(R~BB+HR, data = .)
library(broom)
tidy(fit, conf.int = TRUE)
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.74 0.0824 21.2 7.62e- 83 1.58 1.91
## 2 BB 0.387 0.0270 14.3 1.20e- 42 0.334 0.440
## 3 HR 1.56 0.0490 31.9 1.78e-155 1.47 1.66
#Univariate Regressions
fit1.1<-Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(BB = BB/G, HR=HR/G, R=R/G) %>%
lm(R~BB, data = .)
tidy(fit1.1)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.93 0.116 16.7 2.30e-55
## 2 BB 0.735 0.0349 21.1 3.00e-82
fit1.2<-Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(BB = BB/G, HR=HR/G, R=R/G) %>%
lm(R~HR, data = .)
tidy(fit1.2)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.78 0.0436 63.7 0
## 2 HR 1.84 0.0491 37.6 4.29e-195
fit2 <- Teams %>%
filter(yearID %in% 1961:2001) %>%
mutate(BB = BB/G,
#SB = SB/G,
singles = (H-HR-X2B-X3B)/G,
doubles = X2B/G,
triples = X3B/G,
HR = HR/G,
R = R/G) %>%
lm(R ~ BB + singles + doubles + triples + HR, data = .)
tidy(fit2)
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.77 0.0862 -32.1 4.76e-157
## 2 BB 0.371 0.0117 31.6 1.87e-153
## 3 singles 0.519 0.0127 40.8 8.67e-217
## 4 doubles 0.771 0.0226 34.1 8.44e-171
## 5 triples 1.24 0.0768 16.1 2.12e- 52
## 6 HR 1.44 0.0243 59.3 0
Team Prediction: Use the estimated the model above (using 1961-2001 data) to predict runs for the year 2002. You should use the last multivariate model with 5 independent variables. Then compare the predictions of the model with the actual runs the teams made in 2002.
#Prediction:
Teams %>%
filter(yearID %in% 2002) %>%
mutate(BB = BB/G,
#SB = SB/G,
singles = (H-HR-X2B-X3B)/G,
doubles = X2B/G,
triples = X3B/G,
HR = HR/G,
R = R/G) %>%
mutate(R_hat = predict(fit2, newdata = .)) %>%
ggplot(aes(x = R_hat, y = R, label = teamID)) + geom_point() +
geom_text(hjust=-0.5, vjust=0) +
geom_smooth(formula = y ~ x, method = 'loess') +
geom_smooth(formula = y ~ x, method = 'lm', col='red')
(Player model)
Make the Per-Game Team Rate comparable with the Per-Plate Appearance Player Rate by computing the Team Plate Appearances per Game. For each team, you need to add the At bats and Bases on Ball of all players (that’s the total number of appearances) and divided by the total numbers of games the team played in the season. Use season 2002 for this calculation.
pa_per_game <- Batting %>% filter(yearID == 2002) %>%
group_by(teamID) %>%
summarize(pa_per_game = sum(AB + BB)/max(G)) %>%
.$pa_per_game %>%
mean
Pick players in 2002 using data from 1999-2001. To do that, first, we need to filter our players and create our dataset. Filter out players with too few plate appearances (less than 300), calculate the plate appearances, Number of Games they played, Bases on Ball per Game, Singles per Game, Doubles per Game, Triples per Game, Home Runs per Game, their hitting AVG, and the predicted runs using the model above.
#Create the data frame from which to pick the players
players <- Batting %>%
filter(yearID %in% 1999:2001) %>%
group_by(playerID) %>%
mutate(PA = BB + AB) %>%
summarize(G = sum(PA)/pa_per_game,
BB = sum(BB)/G,
singles = sum(H-X2B-X3B-HR)/G,
doubles = sum(X2B)/G,
triples = sum(X3B)/G,
HR = sum(HR)/G,
AVG = sum(H)/sum(AB),
PA = sum(PA)) %>%
filter(PA >= 300) %>%
select(-G) %>%
mutate(R_hat = predict(fit2, newdata = .))
#Distribution of R_hat. Note: Wide variability is a good sign.
players %>% ggplot(aes(R_hat)) +
geom_histogram(binwidth = 0.5, color = "black")
Some Data Wrangling is necessary to combine different Tables.
First, we need to add (or merge) the 2002 salaries for each player, from
the Salary Table. Write the code to do that here:
Key code: #### Salary:
#Merge in the player's salaries
players <- Salaries %>%
filter(yearID == 2002) %>%
select(playerID, salary) %>%
right_join(players, by="playerID")
Then, we need to get the position most played (since some players
play multiple positions). Also, remove the outfielders (OF = CF+LF+RF)
and pitchers.
#Merge in the defensive position (we need to build an entire Team)
#Remove outfielders and pitchers, and pick positions most played.
players <- Fielding %>% filter(yearID==2002) %>%
filter(!POS %in% c("OF","P")) %>%
group_by(playerID) %>%
top_n(1, G) %>%
filter(row_number(G) == 1 ) %>%
ungroup() %>%
select(playerID, POS) %>%
right_join(players, by="playerID") %>%
filter(!is.na(POS) & !is.na(salary))
Get names of the players so we know who each player is.
Key code:
#Merge in the players' names.
players <- People %>%
select(playerID, nameFirst, nameLast, debut) %>%
right_join(players, by="playerID")
#Show the top 10 most (potentially) productive players according to our model.
players %>% select(nameFirst, nameLast, POS, salary, R_hat) %>%
arrange(desc(R_hat)) %>%
top_n(10)
## Selecting by R_hat
## nameFirst nameLast POS salary R_hat
## 1 Todd Helton 1B 5000000 8.228108
## 2 Jason Giambi 1B 10428571 7.993795
## 3 Albert Pujols 3B 600000 7.536447
## 4 Nomar Garciaparra SS 9000000 7.505063
## 5 Jeff Bagwell 1B 11000000 7.475066
## 6 Alex Rodriguez SS 22000000 7.439937
## 7 Carlos Delgado 1B 19400000 7.368570
## 8 Rafael Palmeiro 1B 8712986 7.257195
## 9 Mike Piazza C 10571429 7.160964
## 10 Jim Thome 1B 8000000 7.156418
Create a scatterplot with salary in the horizontal axis and the predicted runs from the model in the vertical axis. Display each position in a different color.
#First Visual
players %>% ggplot(aes(x = salary, y = R_hat, color = POS))+
geom_point() +
scale_x_log10()
We want only players whose debut was in 1997 or earlier (i.e., to remove all young players that haven’t been able to negotiate their contracts yet). Write a code that does that.
#Second (refined) visual, removing too young players (no contract yet)
players %>% filter(debut<1998) %>%
ggplot(aes(salary, R_hat, color= POS))+
geom_point() +
scale_x_log10()
Make it interactive so that it’s easier to pick a team under the $40 MM Billy Beane had to work with.
#Make it interactive
#install.packages("plotly")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
visual1 <- players %>% filter(debut<1998) %>%
ggplot( aes(salary, R_hat, color= POS,
text = paste(nameFirst, nameLast))) +
geom_point() +
scale_x_log10()
visual2<- ggplotly(visual1, type = 'scatter')
visual2