This is an example of some code I’ve been working on for modeling baseball hall-of-fame voting. The code is very much still a work in progress, but is available on my github. https://github.com/bdilday/BaseballHOFModels

Data preparation

The data used for modeling are a combination of basic stats from the Lahman package and WAR and WAA values from baseball reference. In a previous step I imported the baseball-reference daily WAR tables into an R data frame and stored the results in br_war.RData. I augmengt the basic batting stats from the Lahman::Batting table with a count of World Series wins, All Star selections and All Star starts, and MVP awards and votes.

source('./R/prepare_data.R')
load('./data/br_war.RData')

bb <- get_lahman_batting()
bb <- append_age(bb)
bb <- append_br_war(bb, war=br_war$BattingWAR)
bb <- append_pos(bb)
bb <- append_hof(bb)
bb <- append_ws_wins(bb)
bb <- append_mvps(bb)
bb <- append_all_star(bb)
bb %>% head(2) %>% print.data.frame()
##    playerID yearID   bbrefID AB  BA BABIP BASIC_WOBA BB CS  G GIDP H HBP
## 1 aardsda01   2004 aardsda01  0 NaN     0        NaN  0  0 11    0 0   0
## 2 aardsda01   2006 aardsda01  2   0     0          0  0  0 45    0 0   0
##   HR IBB OBP OPS PA R RBI SB SF SH SlugPct SO TB wwoba X1B X2B X3B
## 1  0   0 NaN NaN  0 0   0  0  0  0     NaN  0  0     0   0   0   0
## 2  0   0   0   0  3 0   0  0  0  1       0  0  0     0   0   0   0
##   birthYear birthMonth birthDay birthCountry birthState birthCity
## 1      1981         12       27          USA         CO    Denver
## 2      1981         12       27          USA         CO    Denver
##   deathYear deathMonth deathDay deathCountry deathState deathCity
## 1        NA         NA       NA         <NA>       <NA>      <NA>
## 2        NA         NA       NA         <NA>       <NA>      <NA>
##   nameFirst nameLast   nameGiven weight height bats throws      debut
## 1     David  Aardsma David Allan    215     75    R      R 2004-04-06
## 2     David  Aardsma David Allan    215     75    R      R 2004-04-06
##    finalGame  retroID deathDate  birthDate Age   WAA   WAR WAR_def WAR_off
## 1 2015-08-23 aardd001      <NA> 1981-12-27  23  0.01  0.01       0    0.01
## 2 2015-08-23 aardd001      <NA> 1981-12-27  25 -0.02 -0.02       0   -0.02
##   WAA_rank POS inducted votedBy WSWin MVPWin MVPShare MPVWin AllStar
## 1      442   P        N       N     0      0        0      0       0
## 2      728   P        N       N     0      0        0      0       0
##   AllStarStart
## 1            0
## 2            0

Feature generation

The get_fit_data function orders the data for each player from best to worst by one of the features, which by default is wins-above-average. The stats for the top N years (by default 20) are kept and tidyr::spread is used to cast them to a wide format. If a player played less than 20 years, the non-existant years are filled in with zeros. Before generating the data frame I exclude pitchers, and before fitting the model I limit the data to players who’s best best season had at least 30 PA.

tmp <- bb %>% filter(POS!='P')
fit_df_train <- get_fit_data(tmp, final_game_min='1901-01-01', final_game_max='2006-01-01') %>% 
  filter(PA_1>=30)
fit_df_test <- get_fit_data(tmp, final_game_min='2006-01-01', final_game_max='2016-01-01') %>% 
  filter(PA_1>=30)

Here’s an example of what the feature set looks like. The concept is to use overall value meaurements (WAR, WAA, WAR_off, WAR_def), subjective honors (ALL Star, MVP wins & shares) and FIP components (SO, BB, HR). This makes it straight forward to include pitchers with the same feature set, although I’m not doing that here.

head(fit_df_train, 1) %>% print.data.frame()
##    playerID POS inducted AllStar_1 AllStar_10 AllStar_11 AllStar_12
## 1 aaronha01  OF        Y         1          1          1          1
##   AllStar_13 AllStar_14 AllStar_15 AllStar_16 AllStar_17 AllStar_18
## 1          1          1          1          1          1          1
##   AllStar_19 AllStar_2 AllStar_20 AllStar_3 AllStar_4 AllStar_5 AllStar_6
## 1          1         1          1         1         1         1         1
##   AllStar_7 AllStar_8 AllStar_9 AllStarStart_1 AllStarStart_10
## 1         1         1         1              0               1
##   AllStarStart_11 AllStarStart_12 AllStarStart_13 AllStarStart_14
## 1               1               1               0               0
##   AllStarStart_15 AllStarStart_16 AllStarStart_17 AllStarStart_18
## 1               1               0               1               1
##   AllStarStart_19 AllStarStart_2 AllStarStart_20 AllStarStart_3
## 1               1              1               1              1
##   AllStarStart_4 AllStarStart_5 AllStarStart_6 AllStarStart_7
## 1              1              0              1              1
##   AllStarStart_8 AllStarStart_9 BB_1 BB_10 BB_11 BB_12 BB_13 BB_14 BB_15
## 1              1              1   56    76    71    59    37    62    64
##   BB_16 BB_17 BB_18 BB_19 BB_2 BB_20 BB_3 BB_4 BB_5 BB_6 BB_7 BB_8 BB_9
## 1    49    68    74    92   78    39   63   51   66   87   57   60   60
##   MVPShare_1 MVPShare_10 MVPShare_11 MVPShare_12 MVPShare_13 MVPShare_14
## 1  0.1741071   0.2035714   0.5357143   0.4940476   0.4345238  0.07857143
##   MVPShare_15 MVPShare_16 MVPShare_17 MVPShare_18 MVPShare_19 MVPShare_2
## 1  0.06785714   0.1071429   0.1041667  0.04761905  0.03571429  0.4821429
##   MVPShare_20 MVPShare_3 MVPShare_4 MVPShare_5 MVPShare_6 MVPShare_7
## 1           0  0.2821429  0.5178571  0.2571429  0.5595238  0.7113095
##   MVPShare_8 MVPShare_9 MVPWin_1 MVPWin_10 MVPWin_11 MVPWin_12 MVPWin_13
## 1  0.1458333  0.2071429        0         0         0         0         0
##   MVPWin_14 MVPWin_15 MVPWin_16 MVPWin_17 MVPWin_18 MVPWin_19 MVPWin_2
## 1         0         0         0         0         0         0        0
##   MVPWin_20 MVPWin_3 MVPWin_4 MVPWin_5 MVPWin_6 MVPWin_7 MVPWin_8 MVPWin_9
## 1         0        0        0        0        0        1        0        0
##   PA_1 PA_10 PA_11 PA_12 PA_13 PA_14 PA_15 PA_16 PA_17 PA_18 PA_19 PA_2
## 1  671   688   573   664   660   634   676   665   465   598   544  714
##   PA_20 PA_3 PA_4 PA_5 PA_6 PA_7 PA_8 PA_9 SO_1 SO_10 SO_11 SO_12 SO_13
## 1   382  669  693  667  639  675  664  639   64    96    58    49    54
##   SO_14 SO_15 SO_16 SO_17 SO_18 SO_19 SO_2 SO_20 SO_3 SO_4 SO_5 SO_6 SO_7
## 1    46    62    61    51    63    55   94    29   97   54   73   47   58
##   SO_8 SO_9 WAA_1 WAA_10 WAA_11 WAA_12 WAA_13 WAA_14 WAA_15 WAA_16 WAA_17
## 1   63   81  6.85   5.35   5.25   4.84   4.57   4.44   4.44   3.75   3.17
##   WAA_18 WAA_19 WAA_2 WAA_20 WAA_3 WAA_4 WAA_5 WAA_6 WAA_7 WAA_8 WAA_9
## 1   2.99   2.19  6.42   0.88  6.15  6.11  5.93  5.87  5.52  5.47   5.4
##   WAA_rank_1 WAA_rank_10 WAA_rank_11 WAA_rank_12 WAA_rank_13 WAA_rank_14
## 1          3           6           3           4           4          12
##   WAA_rank_15 WAA_rank_16 WAA_rank_17 WAA_rank_18 WAA_rank_19 WAA_rank_2
## 1           6           9          21          28          36          2
##   WAA_rank_20 WAA_rank_3 WAA_rank_4 WAA_rank_5 WAA_rank_6 WAA_rank_7
## 1          90          4          2          3          5          5
##   WAA_rank_8 WAA_rank_9 WAR_1 WAR_10 WAR_11 WAR_12 WAR_13 WAR_14 WAR_15
## 1          3          2  9.39   7.77   7.25   7.31   7.12    6.8   6.81
##   WAR_16 WAR_17 WAR_18 WAR_19 WAR_2 WAR_20 WAR_3 WAR_4 WAR_5 WAR_6 WAR_7
## 1   6.23   4.65   4.99   3.92   9.1   2.07  8.52  8.62  8.46  8.03  7.99
##   WAR_8 WAR_9 WAR_def_1 WAR_def_10 WAR_def_11 WAR_def_12 WAR_def_13
## 1  7.98  7.85         2       0.75      -1.34       0.21       0.63
##   WAR_def_14 WAR_def_15 WAR_def_16 WAR_def_17 WAR_def_18 WAR_def_19
## 1      -0.16       0.44      -0.01      -0.78       -0.6      -1.18
##   WAR_def_2 WAR_def_20 WAR_def_3 WAR_def_4 WAR_def_5 WAR_def_6 WAR_def_7
## 1     -1.29      -0.38     -0.47      -1.1       0.3      0.04     -0.13
##   WAR_def_8 WAR_def_9 WAR_off_1 WAR_off_10 WAR_off_11 WAR_off_12
## 1      0.77      0.32      7.03        6.2       7.74       6.59
##   WAR_off_13 WAR_off_14 WAR_off_15 WAR_off_16 WAR_off_17 WAR_off_18
## 1       5.93       6.25       5.31       5.81       4.85       4.91
##   WAR_off_19 WAR_off_2 WAR_off_20 WAR_off_3 WAR_off_4 WAR_off_5 WAR_off_6
## 1       4.24      9.51       2.03      8.18      9.09      7.75      7.19
##   WAR_off_7 WAR_off_8 WAR_off_9 WSWin_1 WSWin_10 WSWin_11 WSWin_12
## 1      7.78       6.5      6.81       0        0        0        0
##   WSWin_13 WSWin_14 WSWin_15 WSWin_16 WSWin_17 WSWin_18 WSWin_19 WSWin_2
## 1        0        0        0        0        0        0        0       0
##   WSWin_20 WSWin_3 WSWin_4 WSWin_5 WSWin_6 WSWin_7 WSWin_8 WSWin_9
## 1        0       0       0       0       0       1       0       0

Defining the model

Here I extract the data using model.matrix and setup a cluster to parallelize the cross-validations. The method I’m using in this instance is xgboost, with 10-fold cross validation via the caret package. The training data comprise players that retired prior to 2006-01-01 and the test data players that retired between 2006-01-01 and 2016-01-01.

frm <- as.formula(inducted ~ . - playerID)

xx = model.matrix(frm, data=fit_df_train)[,-1]
yy = fit_df_train$inducted

cl <- makeCluster(detectCores())
registerDoParallel(cl)

set.seed(825)
cv_train <- trainControl(method = "cv", number = 10)

xgbFit1 <- train(xx, yy, method = "xgbLinear", trControl = cv_train, verbose = FALSE)
stopCluster(cl)

Training results

The performance is perfect on the training data.

table(predict(xgbFit1), fit_df_train$inducted)
##    
##        N    Y
##   N 4227    0
##   Y    0  142

Test results

The model predicts 11 playesr from the test data set to be inducted.

xx_test <- model.matrix(frm,data=fit_df_test)[,-1]
yy_test <- fit_df_test$inducted

results_df_test <- fit_df_test
results_df_test$predict <- 'N'
cc = which(predict(xgbFit1, newdata = xx_test) == 'Y')
results_df_test[cc,]$predict <- 'Y'

results_df_test %>% 
  merge(Lahman::Master %>% select(playerID, nameLast, nameFirst), by="playerID") %>%
  arrange(desc(predict), -WAA_1) %>% 
  head(50) %>% 
  select(nameLast, nameFirst, predict, inducted, WAA_1) %>% 
  print.data.frame()
##       nameLast nameFirst predict inducted WAA_1
## 1        Bonds     Barry       Y        N  9.91
## 2      Griffey       Ken       Y        Y  7.60
## 3       Biggio     Craig       Y        Y  7.29
## 4       Piazza      Mike       Y        Y  6.76
## 5        Jones    Andruw       Y        N  6.04
## 6        Jones   Chipper       Y        N  5.86
## 7        Jeter     Derek       Y        N  5.57
## 8      Ramirez     Manny       Y        N  5.09
## 9       Thomas     Frank       Y        Y  5.04
## 10   Sheffield      Gary       Y        N  4.81
## 11   Rodriguez      Ivan       Y        Y  4.24
## 12        Sosa     Sammy       N        N  8.20
## 13       Rolen     Scott       N        N  7.34
## 14      Giambi     Jason       N        N  6.79
## 15      Helton      Todd       N        N  6.75
## 16    Hamilton      Josh       N        N  6.66
## 17        Drew     J. D.       N        N  6.37
## 18       Giles    Marcus       N        N  5.96
## 19      Erstad     Darin       N        N  5.95
## 20    Gonzalez      Luis       N        N  5.83
## 21         Lee    Derrek       N        N  5.66
## 22      Lofton     Kenny       N        N  5.56
## 23       Glaus      Troy       N        N  5.46
## 24       Thome       Jim       N        N  5.37
## 25     Edmonds       Jim       N        N  5.34
## 26    Guerrero  Vladimir       N        N  5.33
## 27 Garciaparra     Nomar       N        N  5.32
## 28     Figgins     Chone       N        N  5.28
## 29       Lopez      Javy       N        N  5.28
## 30      Jordan     Brian       N        N  5.12
## 31        Kent      Jeff       N        N  5.07
## 32     Roberts     Brian       N        N  5.07
## 33        Pena    Carlos       N        N  4.99
## 34     Ordonez   Magglio       N        N  4.89
## 35      Tejada    Miguel       N        N  4.89
## 36     Sanders    Reggie       N        N  4.88
## 37       Green     Shawn       N        N  4.87
## 38     Berkman     Lance       N        N  4.83
## 39     Delgado    Carlos       N        N  4.82
## 40       Giles     Brian       N        N  4.80
## 41       Floyd     Cliff       N        N  4.70
## 42     Aurilia      Rich       N        N  4.69
## 43       Abreu     Bobby       N        N  4.57
## 44    Youkilis     Kevin       N        N  4.49
## 45      Franco     Julio       N        N  4.48
## 46     Alfonzo   Edgardo       N        N  4.47
## 47      Furcal    Rafael       N        N  4.43
## 48      Salmon       Tim       N        N  4.43
## 49     Ensberg    Morgan       N        N  4.35
## 50   Victorino     Shane       N        N  4.31

Variable importance

The caret package provides a generic measure of variable importance which I can use to see which features are most and least important to the model.

Most important

varImp(xgbFit1)$importance  %>% head(20)
##                Overall
## WAA_8       100.000000
## WAA_5        76.256564
## WAR_6        69.148215
## WAR_9        36.334372
## WAR_5        34.409416
## WAA_rank_6   22.821211
## WAR_4        16.874295
## WAA_rank_11  13.074527
## SO_2         12.687577
## MVPShare_4   11.618700
## WAA_6        10.562202
## MVPShare_2   10.241774
## WAA_rank_3   10.133550
## WAA_rank_8    8.431491
## MVPShare_5    7.698129
## WAR_off_10    7.545897
## WAA_rank_13   6.272291
## WAR_off_1     5.993722
## BB_7          5.871416
## WAR_def_20    5.676881

Least important

varImp(xgbFit1)$importance  %>% tail(20)
##                 Overall
## AllStar_1   0.111321911
## WAR_def_11  0.088921086
## SO_10       0.081932590
## WAR_def_5   0.078701011
## SO_17       0.051725401
## PA_16       0.050446350
## PA_13       0.048166566
## WAR_def_6   0.046303194
## WAA_11      0.037712336
## PA_5        0.037445693
## PA_15       0.021192191
## WAA_rank_10 0.019294249
## BB_12       0.013170653
## BB_8        0.011597420
## MVPShare_17 0.007283144
## WAR_2       0.006749164
## SO_13       0.005737265
## WAA_7       0.003128745
## WAA_3       0.001529839
## WAA_13      0.000000000