Moneyball

The Story

  • Moneyball tells the story of the Oakland A’s in 2002
    • One of the poorest teams in baseball
    • But they were improving every year

The Problem

  • Rich teams can afford the all-star players

  • How do the poor teams compete?

A Different Approach

  • The A’s started using a different method to select players

  • The traditional way was through scouting
    • Scouts would go watch high school and college players
    • Report back about their skills
    • A lot of talk about speed and athletic build
  • The A’s selected players based on their statistics, not on their looks

The Goal of a Basketball Team

Making it to the Playoffs

  • The A’s calculated that they needed to score 155 more runs than they allowed during the regular season to expect to win 95 games

  • Now, let’s verify this statement with Linear Regression

Scoring Runs

  • How does a team score more runs?

  • The A’s discovered that two baseball statistics were significantly more important than anything else

    • On-Base Percentage (OBP)
      • Percentage of time a player gets on base (including walks)
    • Slugging Percentage (SLG)
      • How far a player gets around bases on his turn (measures power)
  • Most teams focused on Batting Average (BA)

  • The A’s claimed that
    • OBP was the most important
    • SLG was important
    • BA was overvalued

Runs Allowed

  • Use pitching statistics to predict runs allowed
    • Opponents On-Base Percentage (OOBP)
    • Opponents Slugging Percentage (OSLG)
  • We get the following linear regression model

\[RunsAllowed = -837.38 + 2913.60(OOBP) + 1514.29(OSLG)\]

Predicting Runs and Wins

  • Can we predict how many games the 2002 Oakland A’s will win using our models

  • The models for runs use team statistics

  • We need to estimate the new team statistics using past player performances
    • Assumes past performances correlates with future performance
    • Assume few injuries
  • We can estimate the team statistics for 2002 by using the 2001 player statistics

Predicting Runs Scored

  • Using the 2001 regular season statistics for these players
    • Team OBP is 0.339
    • Team SLG is 0.430
  • Our regression equation was

\[RunsScored = -804.63 + 2737.77(OBP) + 1584.91(SLG)\]

  • Our 2002 prediction for the A’s is

\[RunsScored = -804.63 + 2737.77(0.339) + 1584.91(0.430) = 805\]

Predicting Runs Allowed

  • Using the 2001 regular season statistics for these players
    • Team OOBP is 0.307
    • Team OSLG is 0.373
  • Our regression equation was

\[RunsAllowed = -837.38 + 2913.60(OOBP) + 1514.29(OSLG)\]

  • Our 2002 prediction for the A’s is

\[RunsAllowed = -837.38 + 2913.60(0.307) + 1514.29 (0.373) = 622\]

Predicting Wins

  • Regression equation to predict wins

\[Wins = 80.8814 + 0.1058(RS - RA)\]

\[Wins = 80.8814 + 0.1058(805 - 622) = 100\]

Results

The Analytics Edge

  • Models allow managers to more accurately value players and minimize risk

  • Relatively simple models can be useful

Moneyball in R

Linear Regression

Read in data

# Load the dataset
baseball = read.csv("baseball.csv")
# Output the string
str(baseball)
## 'data.frame':    1232 obs. of  15 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 2 3 4 5 7 8 9 10 11 12 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
##  $ Year        : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ RS          : int  734 700 712 734 613 748 669 667 758 726 ...
##  $ RA          : int  688 600 705 806 759 676 588 845 890 670 ...
##  $ W           : int  81 94 93 69 61 85 97 68 64 88 ...
##  $ OBP         : num  0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
##  $ SLG         : num  0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
##  $ BA          : num  0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
##  $ Playoffs    : int  0 1 1 0 0 0 1 0 0 1 ...
##  $ RankSeason  : int  NA 4 5 NA NA NA 2 NA NA 6 ...
##  $ RankPlayoffs: int  NA 5 4 NA NA NA 4 NA NA 2 ...
##  $ G           : int  162 162 162 162 162 162 162 162 162 162 ...
##  $ OOBP        : num  0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
##  $ OSLG        : num  0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...

Subset to only include moneyball years

# Subset to the moneyball team
moneyball = subset(baseball, Year < 2002)
# Output the string
str(moneyball)
## 'data.frame':    902 obs. of  15 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
##  $ Year        : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ RS          : int  691 818 729 687 772 777 798 735 897 923 ...
##  $ RA          : int  730 677 643 829 745 701 795 850 821 906 ...
##  $ W           : int  75 92 88 63 82 88 83 66 91 73 ...
##  $ OBP         : num  0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
##  $ SLG         : num  0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
##  $ BA          : num  0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
##  $ Playoffs    : int  0 1 1 0 0 0 0 0 1 0 ...
##  $ RankSeason  : int  NA 5 7 NA NA NA NA NA 6 NA ...
##  $ RankPlayoffs: int  NA 1 3 NA NA NA NA NA 4 NA ...
##  $ G           : int  162 162 162 162 161 162 162 162 162 162 ...
##  $ OOBP        : num  0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
##  $ OSLG        : num  0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...

Compute Run Difference

# Compute Run Differences
moneyball$RD = moneyball$RS - moneyball$RA
# Output the string
str(moneyball)
## 'data.frame':    902 obs. of  16 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
##  $ Year        : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ RS          : int  691 818 729 687 772 777 798 735 897 923 ...
##  $ RA          : int  730 677 643 829 745 701 795 850 821 906 ...
##  $ W           : int  75 92 88 63 82 88 83 66 91 73 ...
##  $ OBP         : num  0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
##  $ SLG         : num  0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
##  $ BA          : num  0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
##  $ Playoffs    : int  0 1 1 0 0 0 0 0 1 0 ...
##  $ RankSeason  : int  NA 5 7 NA NA NA NA NA 6 NA ...
##  $ RankPlayoffs: int  NA 1 3 NA NA NA NA NA 4 NA ...
##  $ G           : int  162 162 162 162 161 162 162 162 162 162 ...
##  $ OOBP        : num  0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
##  $ OSLG        : num  0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
##  $ RD          : int  -39 141 86 -142 27 76 3 -115 76 17 ...

Scatterplot to check for linear relationship

# Scatterplot
plot(moneyball$RD, moneyball$W)

Regression model to predict wins

# Linear Regression model
WinsReg = lm(W ~ RD, data=moneyball)
# Output the summary
summary(WinsReg)
## 
## Call:
## lm(formula = W ~ RD, data = moneyball)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2662  -2.6509   0.1234   2.9364  11.6570 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 80.881375   0.131157  616.67   <2e-16 ***
## RD           0.105766   0.001297   81.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.939 on 900 degrees of freedom
## Multiple R-squared:  0.8808, Adjusted R-squared:  0.8807 
## F-statistic:  6651 on 1 and 900 DF,  p-value: < 2.2e-16

Regression model to predict runs scored

# Linear Regression model
RunsReg = lm(RS ~ OBP + SLG + BA, data=moneyball)
# Output the summary
summary(RunsReg)
## 
## Call:
## lm(formula = RS ~ OBP + SLG + BA, data = moneyball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.941 -17.247  -0.621  16.754  90.998 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -788.46      19.70 -40.029  < 2e-16 ***
## OBP          2917.42     110.47  26.410  < 2e-16 ***
## SLG          1637.93      45.99  35.612  < 2e-16 ***
## BA           -368.97     130.58  -2.826  0.00482 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.69 on 898 degrees of freedom
## Multiple R-squared:  0.9302, Adjusted R-squared:   0.93 
## F-statistic:  3989 on 3 and 898 DF,  p-value: < 2.2e-16