This is an example of some code I’ve been working on for modeling baseball hall-of-fame voting. The code is very much still a work in progress, but is available on my github. https://github.com/bdilday/BaseballHOFModels
The data used for modeling are a combination of basic stats from the Lahman package and WAR and WAA values from baseball reference. In a previous step I imported the baseball-reference daily WAR tables into an R data frame and stored the results in br_war.RData. I augmengt the basic batting stats from the Lahman::Batting table with a count of World Series wins, All Star selections and All Star starts, and MVP awards and votes.
source('./R/prepare_data.R')
load('./data/br_war.RData')
bb <- get_lahman_batting()
bb <- append_age(bb)
bb <- append_br_war(bb, war=br_war$BattingWAR)
bb <- append_pos(bb)
bb <- append_hof(bb)
bb <- append_ws_wins(bb)
bb <- append_mvps(bb)
bb <- append_all_star(bb)
bb %>% head(2) %>% print.data.frame()
## playerID yearID bbrefID AB BA BABIP BASIC_WOBA BB CS G GIDP H HBP
## 1 aardsda01 2004 aardsda01 0 NaN 0 NaN 0 0 11 0 0 0
## 2 aardsda01 2006 aardsda01 2 0 0 0 0 0 45 0 0 0
## HR IBB OBP OPS PA R RBI SB SF SH SlugPct SO TB wwoba X1B X2B X3B
## 1 0 0 NaN NaN 0 0 0 0 0 0 NaN 0 0 0 0 0 0
## 2 0 0 0 0 3 0 0 0 0 1 0 0 0 0 0 0 0
## birthYear birthMonth birthDay birthCountry birthState birthCity
## 1 1981 12 27 USA CO Denver
## 2 1981 12 27 USA CO Denver
## deathYear deathMonth deathDay deathCountry deathState deathCity
## 1 NA NA NA <NA> <NA> <NA>
## 2 NA NA NA <NA> <NA> <NA>
## nameFirst nameLast nameGiven weight height bats throws debut
## 1 David Aardsma David Allan 215 75 R R 2004-04-06
## 2 David Aardsma David Allan 215 75 R R 2004-04-06
## finalGame retroID deathDate birthDate Age WAA WAR WAR_def WAR_off
## 1 2015-08-23 aardd001 <NA> 1981-12-27 23 0.01 0.01 0 0.01
## 2 2015-08-23 aardd001 <NA> 1981-12-27 25 -0.02 -0.02 0 -0.02
## WAA_rank POS inducted votedBy WSWin MVPWin MVPShare MPVWin AllStar
## 1 442 P N N 0 0 0 0 0
## 2 728 P N N 0 0 0 0 0
## AllStarStart
## 1 0
## 2 0
The get_fit_data function orders the data for each player from best to worst by one of the features, which by default is wins-above-average. The stats for the top N years (by default 20) are kept and tidyr::spread is used to cast them to a wide format. If a player played less than 20 years, the non-existant years are filled in with zeros. Before generating the data frame I exclude pitchers, and before fitting the model I limit the data to players who’s best best season had at least 30 PA.
tmp <- bb %>% filter(POS!='P')
fit_df_train <- get_fit_data(tmp, final_game_min='1901-01-01', final_game_max='2006-01-01') %>%
filter(PA_1>=30)
fit_df_test <- get_fit_data(tmp, final_game_min='2006-01-01', final_game_max='2016-01-01') %>%
filter(PA_1>=30)
Here’s an example of what the feature set looks like. The concept is to use overall value meaurements (WAR, WAA, WAR_off, WAR_def), subjective honors (ALL Star, MVP wins & shares) and FIP components (SO, BB, HR). This makes it straight forward to include pitchers with the same feature set, although I’m not doing that here.
head(fit_df_train, 1) %>% print.data.frame()
## playerID POS inducted AllStar_1 AllStar_10 AllStar_11 AllStar_12
## 1 aaronha01 OF Y 1 1 1 1
## AllStar_13 AllStar_14 AllStar_15 AllStar_16 AllStar_17 AllStar_18
## 1 1 1 1 1 1 1
## AllStar_19 AllStar_2 AllStar_20 AllStar_3 AllStar_4 AllStar_5 AllStar_6
## 1 1 1 1 1 1 1 1
## AllStar_7 AllStar_8 AllStar_9 AllStarStart_1 AllStarStart_10
## 1 1 1 1 0 1
## AllStarStart_11 AllStarStart_12 AllStarStart_13 AllStarStart_14
## 1 1 1 0 0
## AllStarStart_15 AllStarStart_16 AllStarStart_17 AllStarStart_18
## 1 1 0 1 1
## AllStarStart_19 AllStarStart_2 AllStarStart_20 AllStarStart_3
## 1 1 1 1 1
## AllStarStart_4 AllStarStart_5 AllStarStart_6 AllStarStart_7
## 1 1 0 1 1
## AllStarStart_8 AllStarStart_9 BB_1 BB_10 BB_11 BB_12 BB_13 BB_14 BB_15
## 1 1 1 56 76 71 59 37 62 64
## BB_16 BB_17 BB_18 BB_19 BB_2 BB_20 BB_3 BB_4 BB_5 BB_6 BB_7 BB_8 BB_9
## 1 49 68 74 92 78 39 63 51 66 87 57 60 60
## MVPShare_1 MVPShare_10 MVPShare_11 MVPShare_12 MVPShare_13 MVPShare_14
## 1 0.1741071 0.2035714 0.5357143 0.4940476 0.4345238 0.07857143
## MVPShare_15 MVPShare_16 MVPShare_17 MVPShare_18 MVPShare_19 MVPShare_2
## 1 0.06785714 0.1071429 0.1041667 0.04761905 0.03571429 0.4821429
## MVPShare_20 MVPShare_3 MVPShare_4 MVPShare_5 MVPShare_6 MVPShare_7
## 1 0 0.2821429 0.5178571 0.2571429 0.5595238 0.7113095
## MVPShare_8 MVPShare_9 MVPWin_1 MVPWin_10 MVPWin_11 MVPWin_12 MVPWin_13
## 1 0.1458333 0.2071429 0 0 0 0 0
## MVPWin_14 MVPWin_15 MVPWin_16 MVPWin_17 MVPWin_18 MVPWin_19 MVPWin_2
## 1 0 0 0 0 0 0 0
## MVPWin_20 MVPWin_3 MVPWin_4 MVPWin_5 MVPWin_6 MVPWin_7 MVPWin_8 MVPWin_9
## 1 0 0 0 0 0 1 0 0
## PA_1 PA_10 PA_11 PA_12 PA_13 PA_14 PA_15 PA_16 PA_17 PA_18 PA_19 PA_2
## 1 671 688 573 664 660 634 676 665 465 598 544 714
## PA_20 PA_3 PA_4 PA_5 PA_6 PA_7 PA_8 PA_9 SO_1 SO_10 SO_11 SO_12 SO_13
## 1 382 669 693 667 639 675 664 639 64 96 58 49 54
## SO_14 SO_15 SO_16 SO_17 SO_18 SO_19 SO_2 SO_20 SO_3 SO_4 SO_5 SO_6 SO_7
## 1 46 62 61 51 63 55 94 29 97 54 73 47 58
## SO_8 SO_9 WAA_1 WAA_10 WAA_11 WAA_12 WAA_13 WAA_14 WAA_15 WAA_16 WAA_17
## 1 63 81 6.85 5.35 5.25 4.84 4.57 4.44 4.44 3.75 3.17
## WAA_18 WAA_19 WAA_2 WAA_20 WAA_3 WAA_4 WAA_5 WAA_6 WAA_7 WAA_8 WAA_9
## 1 2.99 2.19 6.42 0.88 6.15 6.11 5.93 5.87 5.52 5.47 5.4
## WAA_rank_1 WAA_rank_10 WAA_rank_11 WAA_rank_12 WAA_rank_13 WAA_rank_14
## 1 3 6 3 4 4 12
## WAA_rank_15 WAA_rank_16 WAA_rank_17 WAA_rank_18 WAA_rank_19 WAA_rank_2
## 1 6 9 21 28 36 2
## WAA_rank_20 WAA_rank_3 WAA_rank_4 WAA_rank_5 WAA_rank_6 WAA_rank_7
## 1 90 4 2 3 5 5
## WAA_rank_8 WAA_rank_9 WAR_1 WAR_10 WAR_11 WAR_12 WAR_13 WAR_14 WAR_15
## 1 3 2 9.39 7.77 7.25 7.31 7.12 6.8 6.81
## WAR_16 WAR_17 WAR_18 WAR_19 WAR_2 WAR_20 WAR_3 WAR_4 WAR_5 WAR_6 WAR_7
## 1 6.23 4.65 4.99 3.92 9.1 2.07 8.52 8.62 8.46 8.03 7.99
## WAR_8 WAR_9 WAR_def_1 WAR_def_10 WAR_def_11 WAR_def_12 WAR_def_13
## 1 7.98 7.85 2 0.75 -1.34 0.21 0.63
## WAR_def_14 WAR_def_15 WAR_def_16 WAR_def_17 WAR_def_18 WAR_def_19
## 1 -0.16 0.44 -0.01 -0.78 -0.6 -1.18
## WAR_def_2 WAR_def_20 WAR_def_3 WAR_def_4 WAR_def_5 WAR_def_6 WAR_def_7
## 1 -1.29 -0.38 -0.47 -1.1 0.3 0.04 -0.13
## WAR_def_8 WAR_def_9 WAR_off_1 WAR_off_10 WAR_off_11 WAR_off_12
## 1 0.77 0.32 7.03 6.2 7.74 6.59
## WAR_off_13 WAR_off_14 WAR_off_15 WAR_off_16 WAR_off_17 WAR_off_18
## 1 5.93 6.25 5.31 5.81 4.85 4.91
## WAR_off_19 WAR_off_2 WAR_off_20 WAR_off_3 WAR_off_4 WAR_off_5 WAR_off_6
## 1 4.24 9.51 2.03 8.18 9.09 7.75 7.19
## WAR_off_7 WAR_off_8 WAR_off_9 WSWin_1 WSWin_10 WSWin_11 WSWin_12
## 1 7.78 6.5 6.81 0 0 0 0
## WSWin_13 WSWin_14 WSWin_15 WSWin_16 WSWin_17 WSWin_18 WSWin_19 WSWin_2
## 1 0 0 0 0 0 0 0 0
## WSWin_20 WSWin_3 WSWin_4 WSWin_5 WSWin_6 WSWin_7 WSWin_8 WSWin_9
## 1 0 0 0 0 0 1 0 0
Here I extract the data using model.matrix and setup a cluster to parallelize the cross-validations. The method I’m using in this instance is xgboost, with 10-fold cross validation via the caret package. The training data comprise players that retired prior to 2006-01-01 and the test data players that retired between 2006-01-01 and 2016-01-01.
frm <- as.formula(inducted ~ . - playerID)
xx = model.matrix(frm, data=fit_df_train)[,-1]
yy = fit_df_train$inducted
cl <- makeCluster(detectCores())
registerDoParallel(cl)
set.seed(825)
cv_train <- trainControl(method = "cv", number = 10)
xgbFit1 <- train(xx, yy, method = "xgbLinear", trControl = cv_train, verbose = FALSE)
stopCluster(cl)
The performance is perfect on the training data.
table(predict(xgbFit1), fit_df_train$inducted)
##
## N Y
## N 4227 0
## Y 0 142
The model predicts 11 playesr from the test data set to be inducted.
xx_test <- model.matrix(frm,data=fit_df_test)[,-1]
yy_test <- fit_df_test$inducted
results_df_test <- fit_df_test
results_df_test$predict <- 'N'
cc = which(predict(xgbFit1, newdata = xx_test) == 'Y')
results_df_test[cc,]$predict <- 'Y'
results_df_test %>%
merge(Lahman::Master %>% select(playerID, nameLast, nameFirst), by="playerID") %>%
arrange(desc(predict), -WAA_1) %>%
head(50) %>%
select(nameLast, nameFirst, predict, inducted, WAA_1) %>%
print.data.frame()
## nameLast nameFirst predict inducted WAA_1
## 1 Bonds Barry Y N 9.91
## 2 Griffey Ken Y Y 7.60
## 3 Biggio Craig Y Y 7.29
## 4 Piazza Mike Y Y 6.76
## 5 Jones Andruw Y N 6.04
## 6 Jones Chipper Y N 5.86
## 7 Jeter Derek Y N 5.57
## 8 Ramirez Manny Y N 5.09
## 9 Thomas Frank Y Y 5.04
## 10 Sheffield Gary Y N 4.81
## 11 Rodriguez Ivan Y Y 4.24
## 12 Sosa Sammy N N 8.20
## 13 Rolen Scott N N 7.34
## 14 Giambi Jason N N 6.79
## 15 Helton Todd N N 6.75
## 16 Hamilton Josh N N 6.66
## 17 Drew J. D. N N 6.37
## 18 Giles Marcus N N 5.96
## 19 Erstad Darin N N 5.95
## 20 Gonzalez Luis N N 5.83
## 21 Lee Derrek N N 5.66
## 22 Lofton Kenny N N 5.56
## 23 Glaus Troy N N 5.46
## 24 Thome Jim N N 5.37
## 25 Edmonds Jim N N 5.34
## 26 Guerrero Vladimir N N 5.33
## 27 Garciaparra Nomar N N 5.32
## 28 Figgins Chone N N 5.28
## 29 Lopez Javy N N 5.28
## 30 Jordan Brian N N 5.12
## 31 Kent Jeff N N 5.07
## 32 Roberts Brian N N 5.07
## 33 Pena Carlos N N 4.99
## 34 Ordonez Magglio N N 4.89
## 35 Tejada Miguel N N 4.89
## 36 Sanders Reggie N N 4.88
## 37 Green Shawn N N 4.87
## 38 Berkman Lance N N 4.83
## 39 Delgado Carlos N N 4.82
## 40 Giles Brian N N 4.80
## 41 Floyd Cliff N N 4.70
## 42 Aurilia Rich N N 4.69
## 43 Abreu Bobby N N 4.57
## 44 Youkilis Kevin N N 4.49
## 45 Franco Julio N N 4.48
## 46 Alfonzo Edgardo N N 4.47
## 47 Furcal Rafael N N 4.43
## 48 Salmon Tim N N 4.43
## 49 Ensberg Morgan N N 4.35
## 50 Victorino Shane N N 4.31
The caret package provides a generic measure of variable importance which I can use to see which features are most and least important to the model.
varImp(xgbFit1)$importance %>% head(20)
## Overall
## WAA_8 100.000000
## WAA_5 76.256564
## WAR_6 69.148215
## WAR_9 36.334372
## WAR_5 34.409416
## WAA_rank_6 22.821211
## WAR_4 16.874295
## WAA_rank_11 13.074527
## SO_2 12.687577
## MVPShare_4 11.618700
## WAA_6 10.562202
## MVPShare_2 10.241774
## WAA_rank_3 10.133550
## WAA_rank_8 8.431491
## MVPShare_5 7.698129
## WAR_off_10 7.545897
## WAA_rank_13 6.272291
## WAR_off_1 5.993722
## BB_7 5.871416
## WAR_def_20 5.676881
varImp(xgbFit1)$importance %>% tail(20)
## Overall
## AllStar_1 0.111321911
## WAR_def_11 0.088921086
## SO_10 0.081932590
## WAR_def_5 0.078701011
## SO_17 0.051725401
## PA_16 0.050446350
## PA_13 0.048166566
## WAR_def_6 0.046303194
## WAA_11 0.037712336
## PA_5 0.037445693
## PA_15 0.021192191
## WAA_rank_10 0.019294249
## BB_12 0.013170653
## BB_8 0.011597420
## MVPShare_17 0.007283144
## WAR_2 0.006749164
## SO_13 0.005737265
## WAA_7 0.003128745
## WAA_3 0.001529839
## WAA_13 0.000000000