Moneyball (2011).

Introductory videos:
Trailer: https://www.youtube.com/watch?v=-4QPVo0UIzc
He Gets on Base Scene: https://www.youtube.com/watch?v=3MjxoaynCmk
Player Value Scene: https://www.youtube.com/watch?v=Tzin1DgexlE

Academic videos:
We will use and follow a set of videos by HarvardX (Machine Intelligence YouTube Channel).
Link here to first video. Replication in R follows.

Install and load key R packages

Install and load the the following packages: tidyverse and Lahman.

#install.packages("tidyverse")
#install.packages("Lahman")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(Lahman)

Lecture 1.04 - 1.10 Regression Basics

The rest of the videos in this chapter (chapter 1) show a quick review of simple linear regression using mostly Francis Galton’s original example (father’s height and son’s heights). To save time, we can skip to the sabermetrics practice, and the hands on practice picking the winning team.

Picking the Winning Team:

First, We need the salary of the players and their position.

Some Data Wrangling is necessary to combine different Tables.
First, we need to add (or merge) the 2002 salaries for each player, from the Salary Table. Write the code to do that here:

Key code: #### Salary:

#Merge in the player's salaries
players <- Salaries %>%  
  filter(yearID == 2002) %>% 
  select(playerID, salary) %>% 
  right_join(players, by="playerID")

Defensive Position. From Fielding’s Table

Then, we need to get the position most played (since some players play multiple positions). Also, remove the outfielders (OF = CF+LF+RF) and pitchers.

#Merge in the defensive position (we need to build an entire Team) 
#Remove outfielders and pitchers, and pick positions most played. 
players <- Fielding %>%  filter(yearID==2002) %>% 
  filter(!POS %in% c("OF","P")) %>% 
  group_by(playerID) %>%  
  top_n(1, G) %>% 
  filter(row_number(G) == 1 ) %>%  
  ungroup() %>%  
  select(playerID, POS) %>% 
  right_join(players, by="playerID") %>% 
  filter(!is.na(POS) & !is.na(salary))

Add Names (from People’s Table)

Get names of the players so we know who each player is.

Key code:

#Merge in the players' names. 
players <- People %>%  
  select(playerID, nameFirst, nameLast, debut) %>%  
  right_join(players, by="playerID")

#Show the top 10 most (potentially) productive players according to our model.      
players %>%  select(nameFirst, nameLast, POS, salary, R_hat) %>%  
  arrange(desc(R_hat)) %>% 
  top_n(10)
## Selecting by R_hat
##    nameFirst    nameLast POS   salary    R_hat
## 1       Todd      Helton  1B  5000000 8.228108
## 2      Jason      Giambi  1B 10428571 7.993795
## 3     Albert      Pujols  3B   600000 7.536447
## 4      Nomar Garciaparra  SS  9000000 7.505063
## 5       Jeff     Bagwell  1B 11000000 7.475066
## 6       Alex   Rodriguez  SS 22000000 7.439937
## 7     Carlos     Delgado  1B 19400000 7.368570
## 8     Rafael    Palmeiro  1B  8712986 7.257195
## 9       Mike      Piazza   C 10571429 7.160964
## 10       Jim       Thome  1B  8000000 7.156418

Visualizations

Create a scatterplot with salary in the horizontal axis and the predicted runs from the model in the vertical axis. Display each position in a different color.

#First Visual
players %>%  ggplot(aes(x = salary, y = R_hat, color = POS))+
  geom_point() +
  scale_x_log10()

We want only players whose debut was in 1997 or earlier (i.e., to remove all young players that haven’t been able to negotiate their contracts yet). Write a code that does that.

#Second (refined) visual, removing too young players (no contract yet)
players %>% filter(debut<1998) %>%  
  ggplot(aes(salary, R_hat, color= POS))+
  geom_point() + 
  scale_x_log10()

Finally, pick the Team

Make it interactive so that it’s easier to pick a team under the $40 MM Billy Beane had to work with.

#Make it interactive
#install.packages("plotly")
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
visual1 <- players %>% filter(debut<1998) %>%  
  ggplot( aes(salary, R_hat, color= POS, 
             text = paste(nameFirst, nameLast))) +
  geom_point() + 
  scale_x_log10()

visual2<- ggplotly(visual1, type = 'scatter') 
visual2