This analysis intends to investigate if tree-based learning methods can successfully identify and predict whether a Major League Baseball (“MLB”) team will win its division. There are 30 teams in MLB split into six divisions of five teams each based on geography. The team with the best record within each division at the end of the season wins the division, and in turn earns an automatic berth into the playoffs. We intend to use a variety of techniques to fit a tree-based classification model to predict a team’s chances of winning its division using a variety of explanatory variables included in the data set. While our research goal is to build a model which can accurately predict whether or not a team will win its division, we also are hoping to use our models to gain an understanding of the most significant baseball metrics in predicting whether a team will win its division.
The data set used in this report can be found within Lahman’s Baseball Database (http://www.seanlahman.com/baseball-archive/statistics/), an open-source collection of baseball statistics managed by Sean Lahman, a journalist for the Rochester Democrat & Chronicle. The dataset contains data on MLB team performance across every season from 1871 to 2018. Each row in the data set represents one team season, while each column represents a different variable or statistic describing that team’s performance. The file contains 48 variables (columns), and 2,895 observations (rows).
While the large number of observations was a factor in selecting this particular data set, our question of interest, outlined above, forced us to subset the data to only include data from MLB seasons after 1977 (and going until 2018, the last year included in the data), excluding 1994, which we will discuss shortly. We felt it important to include as many observations as possible to train and test our models, and therefore selected 1977 because of its significance as an expansion year for MLB and its somewhat subjective status as one of the earlier years of what can be described as the modern era of baseball. Including years too long before 1977 not only runs the risk of having missing data as a result of certain statistics not being measured at that time, but also of including observations from an era of baseball which was fundamentally different from the way the game has been played for the past half-century or so, thus running the risk of introducing unexplained error into our models.
As mentioned above, we were forced to exclude the 1994 MLB season from our analysis because of its strange status in MLB history. On August 11, 1994, with about two months to play in the season, MLB players went on strike due to dissatisfaction with ongoing labor negotiations. About a month later, the Commissioner of Major League Baseball canceled the remainder of the season, meaning that not only did the teams play fewer games that season (meaning fewer runs, hits, etc.), but also that no division winners were officially named. Thus, we felt it necessary and appropriate to exclude the 1994 season from our analysis.
In addition to subsetting the data set, we removed 14 variables from the data set that were either unimportant to our question of interest (e.g. the name of each team, the team’s location, stadium name, etc.) or whose presence in the model interfered with our research intention of gathering information on which baseball metrics can be most accurately used to forecast whether a team will win its division. It was with this in mind that we removed Wins, Losses (a linear function of wins), WCWin, WSWin, and LgWin from consideration. WCWin, WSWin, and LgWin all represent binary variables describing a team’s playoff success, and as such are clearly highly correlated with DivWin. However, DivWin measures whether or not a team won its division in the regular season, and we felt that including postseason achievement metrics was not necessarily appropriate, especially since they tended to take the place in the decision trees of other variables which did measure a team’s regular season performance. Additionally, if using a decision tree to predict a future team’s chances of winning its division, we would not know the values of those variables since the team in question would not have competed in the playoffs yet, and thus including these would limit the models’ future forecasting ability. Finally, Wins and Losses were removed because they are very close proxies for DivWin (as the team with the most wins in a division is the division winner), and therefore should not be included in tree-based models intending to use a variety of baseball statistics to estimate a team’s chances of winning their division.
W: Total number of regular season wins achieved by the team in a season
SV: A save is awarded to the relief pitcher who finishes a game for the winning team. The total number of saves earned by a team’s pitchers during the season.
ERA: Earned run average represents the number of earned runs a pitcher allows per nine innings with earned runs being any runs that scored without the aid of an error or a passed ball. ERA is the most commonly accepted statistical tool for evaluating pitchers. The formula for finding ERA is: 9 x earned runs / innings pitched. If a pitcher exits a game with runners on base, any earned runs scored by those runners will count against him.
ER: Any run that scores against a pitcher without the benefit of an error or a passed ball.
R: A player is awarded a run if he crosses the plate to score his team a run. When tallying runs scored, the way in which a player reached base is not considered. R represents the total number of runs scored by a team over the course of the season.
SHO: A starting pitcher is credited with a shutout when he pitches the entire game for a team and does not allow the opposition to score. By definition, any pitcher who throws a shutout is also awarded a win. SHO represents the total number of shutouts thrown by a team over the course of the season.
Attendance: Number of tickets sold throughout the season.
BB: A walk (or base on balls) occurs when a pitcher throws four pitches out of the strike zone, none of which are swung at by the hitter. After refraining from swinging at four pitches out of the zone, the batter is awarded first base. The total number of walks by a team’s hitters during the season.
BBA: The total number of walks allowed by a team over the course of the season.
H: A hit occurs when a batter strikes the baseball into fair territory and reaches base without doing so via an error or a fielder’s choice. The total number of hits by a team during the season.
HR: A home run occurs when a batter hits a fair ball and scores on the play without being put out or without the benefit of an error. The total number of HR by a team during the season.
WCWin: A binary variable with a value of Y if the team won a Wild Card round playoff game, and N if not.
LgWin: A binary variable with a value of Y if the team won a League Championship Series playoff series, and N if not.
RA: The total number of runs allowed by a team in a season.
SB: A stolen base occurs when a baserunner takes another base without the ball being put in play. The total number of stolen bases a team’s baserunners earn over the course of the season.
WSWin: A binary variable with a value of Y if the team won the World Series, and N if not.
X1B: Singles, or hits on which the batter reaches first base safely without the contribution of a fielding error. The total number of singles a team’s batters hit during the season
CG: Complete game, the act of a pitcher pitching an entire game without the benefit of a relief pitcher. The total number of complete games thrown by a team’s pitchers during the season.
PPF: Pitching Park Factor - centered around 100, with numbers above 100 representing the percentage increase in run-scoring against pitchers in that park as compared to other parks, and numbers below 100 representing the percentage decrease in run-scoring against pitchers in that park. A metric for how hitter or pitcher-friendly a team’s park is.
E: An error is a mistake by a fielder which allows a baserunner to advance one or more bases that he would not have been able to take without the error. The total number of errors a team’s fielders make during the season.
Title: Predicting Win-Loss outcomes in MLB regular season games – A comparative study using data mining methods
This study looked at several different machine learning methods to determine how they compared when predicting regular season win/loss outcomes including looking at different decision tree models and K-NN models. Theu used more advanced statistics compared to our model such as batting average on balls put in play (BABIP). This research determined that a method known as Support Vector Machines produced the best results in terms of accuracy although most of the methods produced and accuracy are just under 60%.
Title: A Two-Stage Bayesian Model for Predicting Winners in Major League Baseball
This study used what is known as Bayesian Model to predict the win percentage of MLB teams. This model was effective at predicting the win percentage of teams and it also talks about how the model could theoretically be used in a betting strategy.
## [1] 0.8131313
As we can see, the base rate for our variable of interest in this analysis, DivWin, is 0.813. This tells us that 81.3%, or roughly 4/5 teams, do not win their respective divisions in a given year.
| DivWin | G | R | AB | H | X2B | X3B | HR | BB | SO | SB | CS | HBP | SF | RA | ER | ERA | CG | SHO | SV | IPouts | HA | HRA | BBA | SOA | E | DP | FP | attendance | BPF | PPF | X1B |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 160.4855 | 711.9710 | 5480.405 | 1420.824 | 268.0694 | 32.13561 | 151.9741 | 516.5062 | 1034.445 | 103.8934 | 47.18323 | 45.68634 | 44.67495 | 739.3023 | 672.8789 | 4.226553 | 12.50932 | 8.749482 | 38.08075 | 4298.624 | 1440.648 | 158.2733 | 532.6294 | 1017.92 | 114.3540 | 148.8913 | 0.9812174 | 2058022 | 100.1284 | 100.34369 | 968.6449 |
| 1 | 159.5270 | 776.8559 | 5462.477 | 1455.559 | 277.5225 | 31.92342 | 172.4775 | 555.5586 | 1014.144 | 109.1396 | 44.09009 | 48.95946 | 48.43694 | 657.9279 | 600.2703 | 3.769234 | 12.85586 | 11.563063 | 44.86036 | 4296.018 | 1369.297 | 145.0676 | 485.4009 | 1086.05 | 104.4369 | 142.6892 | 0.9826937 | 2738243 | 100.4144 | 99.51802 | 973.6351 |
The above variables are the statistics that were most highly correlated with a team winning its division. These 8 variables all have a correlation with an absolute value above 0.25 and they are the variables we expect to be the most important going forward.
## R ERA RA ER H HR HA
## 66.454291 50.680709 46.981253 42.375185 30.056818 28.596707 27.620885
## HRA AB X2B SHO SV attendance SOA
## 26.831440 19.422238 18.459735 14.139572 11.007874 8.089110 5.062500
## DP IPouts BB SO CG HBP SB
## 4.750000 4.266273 3.865556 3.499964 2.951660 2.908517 2.095238
## CS X1B SF BPF E
## 1.404336 1.375000 1.333333 1.225950 0.475830
Our most important variables were runs scored, ERA, runs allowed, and earned runs scored are the 4 variables that are the most important for predicting division wins. It is interesting that 3/4 are more defensive statistics so maybe better defense is more important than better offense. It was also interesting that saves wasn’t more important as one would think because the number of saves would be almost directly correlated with the number of wins and thus how likely a team is to win the division. Furthermore, it is also interesting how attendance is more important than half of our variables even the number of strikeouts that a team has.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 924 79
## 1 42 143
##
## Accuracy : 0.8981
## 95% CI : (0.8795, 0.9148)
## No Information Rate : 0.8131
## P-Value [Acc > NIR] : 4.839e-16
##
## Kappa : 0.6419
##
## Mcnemar's Test P-Value : 0.001065
##
## Sensitivity : 0.6441
## Specificity : 0.9565
## Pos Pred Value : 0.7730
## Neg Pred Value : 0.9212
## Prevalence : 0.1869
## Detection Rate : 0.1204
## Detection Prevalence : 0.1557
## Balanced Accuracy : 0.8003
##
## 'Positive' Class : 1
##
##
## Call:
## roc.default(response = baseball.data4$DivWin, predictor = as.numeric(baseball_data_fitted_model), plot = TRUE)
##
## Data: as.numeric(baseball_data_fitted_model) in 966 controls (baseball.data4$DivWin 0) < 222 cases (baseball.data4$DivWin 1).
## Area under the curve: 0.8003
We are pleased with our model’s overall accuracy of 0.903, which is significantly higher than the base rate of approximately 0.81. However, the sensitivity of our model is only about 0.6389, which is not totally surprising considered the unbalanced nature of the data, but still a metric we would like to see improve. However, the model’s good overall accuracy, in large part driven by an extremely high specificity, in combination with a good AUC of 0.8014, tells us that this decision tree has worked fairly well in predicting whether or not a team will win its division.
## CP nsplit rel error xerror xstd opt
## 1 0.10585586 0 1.0000000 1.0000000 0.06052069 1.0605207
## 2 0.04954955 2 0.7882883 0.9504505 0.05933724 0.8476255
## 3 0.04054054 3 0.7387387 0.8648649 0.05715038 0.7958891
## 4 0.03603604 4 0.6981982 0.8783784 0.05750833 0.7557065
## 5 0.01576577 6 0.6261261 0.7612613 0.05423370 0.6803598
## 6 0.01351351 10 0.5585586 0.7882883 0.05502532 0.6135839
## 7 0.01000000 11 0.5450450 0.8063063 0.05554064 0.6005857
## R ERA RA ER H HRA HA
## 62.754922 50.680709 46.981253 41.041852 28.101568 26.831440 26.183385
## HR X2B AB SHO SV attendance IPouts
## 22.157723 18.459735 15.012978 14.139572 11.007874 8.089110 4.266273
## BB HBP BPF CG E
## 2.451901 1.241851 1.225950 0.951660 0.475830
The most important variables in this model are still runs, ERA, runs allowed, and earned runs which is the same as before.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 903 76
## 1 63 146
##
## Accuracy : 0.883
## 95% CI : (0.8633, 0.9007)
## No Information Rate : 0.8131
## P-Value [Acc > NIR] : 4.307e-11
##
## Kappa : 0.6061
##
## Mcnemar's Test P-Value : 0.3088
##
## Sensitivity : 0.6577
## Specificity : 0.9348
## Pos Pred Value : 0.6986
## Neg Pred Value : 0.9224
## Prevalence : 0.1869
## Detection Rate : 0.1229
## Detection Prevalence : 0.1759
## Balanced Accuracy : 0.7962
##
## 'Positive' Class : 1
##
##
## Call:
## roc.default(response = baseball.data4$DivWin, predictor = as.numeric(new_baseball_data_fitted_model), plot = TRUE)
##
## Data: as.numeric(new_baseball_data_fitted_model) in 966 controls (baseball.data4$DivWin 0) < 222 cases (baseball.data4$DivWin 1).
## Area under the curve: 0.7962
As is visible above, a decision tree with 5 splits was determined to have the lowest relative error. Thus, for our second model we chose to fit a new decision tree with 5 splits, determined by our analysis to be the optimal number. Predictably, this tree was a lot simpler, making it much easier to interpret than the first model. In evaluating the model’s performance, we notice first that the sensitivity of this model did improve from the first model, rising to 0.67 from 0.63. However, this was accompanied by a decrease in specificity which lowered the overall model accuracy and caused a slight decrease in AUC. With that said, we feel that the increase in the model’s sensitivity is likely more significant than the decrease in specificity, overall accuracy and AUC in light of the fact that those metrics only suffered marginally and still remain fairly strong.
We built our first random forest classification model using the recommended number of predictors per tree of six, which we calculated by taking the square root of the total number of predictors used in the data (34).
The above graph depicts the error rates for each of the random forest trees produced in the procedure, with the x-axis representing the tree number (in chronological order), and the y-axis representing the error rate. The green and orange lines represent the class error in predicting the individual classes of the training data observations for each tree, with the green representing the tree’s false negative rate and the orange representing the tree’s false positive rate. The blue line displays the out of bag error rate for each tree. Our out of bounds error rate is the lowest at 109 trees and then evens out to around .16. Our win the division variables stabilizes around .85 which is better than our base rate of .81.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 765 137
## 1 12 37
##
## Accuracy : 0.8433
## 95% CI : (0.8186, 0.8659)
## No Information Rate : 0.817
## P-Value [Acc > NIR] : 0.0185
##
## Kappa : 0.2734
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.21264
## Specificity : 0.98456
## Pos Pred Value : 0.75510
## Neg Pred Value : 0.84812
## Prevalence : 0.18297
## Detection Rate : 0.03891
## Detection Prevalence : 0.05152
## Balanced Accuracy : 0.59860
##
## 'Positive' Class : 1
##
## [1] 0.888826
In observing the confusion matrix and displayed model performance figures, we notice that the overall model accuracy of this initial Random Forest model of 84.03% seems to be below that of our previous decision tree models. A further examination of the model’s performance reveals that the cause of this decrease in overall accuracy is an extremely poor model sensitivity of .1557. This tells us that this Random Forest model is only able to correctly classify 15.57% of division winning teams. Despite the acceptable model accuracy (even though it only represents a small improvement over the base rate) and strong AUC, we have to conclude that this model is unfit to achieve our goal of accurately predicting whether MLB teams will win their division on the basis of the model’s sensitivity. We will look for better sensitivities in the next models we fit as we tweak certain parameters in an attempt to improve upon this first model.
An examination of the variable importance plot reveals that the most important explanatory variables in predicting whether a team will win its division are R, attendance, SV, ERA, RA, and ER. None of these are surprising, as each of R, SV, ERA, RA, and ER are common baseball metrics often used to assess team performance, and it makes sense that attendance would be highly correlated with team success.
## Number of Trees Out of the Box Don't Win Division Win Division
## 80 80 0.1503680 0.01673102 0.7471264
## 81 81 0.1503680 0.01544402 0.7528736
## 84 84 0.1503680 0.01801802 0.7413793
## 67 67 0.1514196 0.01930502 0.7413793
## 74 74 0.1514196 0.01801802 0.7471264
After determining that according to our iniital Random Forest model, the ideal number of trees to be fit by the Random Forest procedure is 109, we decided to fit a second RFT model with 109 trees, as opposed to the previous model’s 500.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 766 138
## 1 11 36
##
## Accuracy : 0.8433
## 95% CI : (0.8186, 0.8659)
## No Information Rate : 0.817
## P-Value [Acc > NIR] : 0.0185
##
## Kappa : 0.2689
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.20690
## Specificity : 0.98584
## Pos Pred Value : 0.76596
## Neg Pred Value : 0.84735
## Prevalence : 0.18297
## Detection Rate : 0.03785
## Detection Prevalence : 0.04942
## Balanced Accuracy : 0.59637
##
## 'Positive' Class : 1
##
## [1] 0.8804235
As discussed above, upon examination of the OOB error rates, we determined the optimal number of trees to be fit within the Random Forest procedure to be 109. We then fit a second model with 109 trees in the hope that this model would have improved classification accuracy over our initial Random Forest model. In observing this model’s performance, we can see that the overall accuracy has improved slightly, as has the model’s sensitivity, while the AUC has marginally decreased. Unfortunately, while this model performed better than our initial model, we still do not feel comfortable with the classification power of a model with a sensitivity of 17.96%.
An examination of the variable importance plot reveals that this second Random Forest model found the exact same variables to be most important in predicting whether a team will win its division. There are a couple of slight differences between the plots, but these just represent slightly different variables being named as the 8th through 10th most important variables.
## mtry = 5 OOB error = 15.14%
## Searching left ...
## mtry = 3 OOB error = 14.51%
## 0.04166667 0.05
## Searching right ...
## mtry = 10 OOB error = 13.35%
## 0.1180556 0.05
## mtry = 20 OOB error = 13.88%
## -0.03937008 0.05
## mtry OOBError
## 3.OOB 3 0.1451104
## 5.OOB 5 0.1514196
## 10.OOB 10 0.1335436
## 20.OOB 20 0.1388013
As can be seen above, the m_try level where OOB error rate is the lowest occurs at 3. This means that the tuneRF function recommends setting the m_try level to 3 for our specific data. We will fit our final Random Forest model using the better performing ntree value of 109 from our second Random Forest model and this optimal m_try level, A m_try level of 3 means that for each of the decision trees the Random Forest Procedure fits, it will randomly select three explanatory variables from the predictor space to build each tree.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 767 128
## 1 10 46
##
## Accuracy : 0.8549
## 95% CI : (0.8309, 0.8767)
## No Information Rate : 0.817
## P-Value [Acc > NIR] : 0.001125
##
## Kappa : 0.3413
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.26437
## Specificity : 0.98713
## Pos Pred Value : 0.82143
## Neg Pred Value : 0.85698
## Prevalence : 0.18297
## Detection Rate : 0.04837
## Detection Prevalence : 0.05889
## Balanced Accuracy : 0.62575
##
## 'Positive' Class : 1
##
## [1] 0.8804235
This model performs very similarly to the first two Random Forest models, with an accuracy rate of 84.36% which marks a small improvement over the base rate but an extremely low sensitivity which lowers our confidence in the model’s predictive power. After having reviewed each of the three Random Forest models, we can say that the second model, with ntree=109 and mtry=6, performed the best, although the difference was marginal. An extremely low sensitivity plagued each of the models, and will limit their usefulness in predicting MLB teams’ success.
Interestingly, this final Random Forest model’s variable importance plot identified slightly different variables as being most important in predicting whether a team will win its division. Specifically, this model found SV, R, HR, BB, and attendance to be the most important variables as measured by the mean decrease in accuracy in the absence of those variables and R, SV, attendance, ERA, and HR to be the most important variables as measured by the mean decrease in Gini in the absence of those variables. Like with the variables found to be most significant by the previous two Random Forest models, we are unsurprised to find these variables significant, as each of them aside from attendance is a commonly referenced baseball statistic often used to measure team performance.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 673 49
## 1 104 125
##
## Accuracy : 0.8391
## 95% CI : (0.8142, 0.8619)
## No Information Rate : 0.817
## P-Value [Acc > NIR] : 0.04111
##
## Kappa : 0.5207
##
## Mcnemar's Test P-Value : 1.268e-05
##
## Sensitivity : 0.7184
## Specificity : 0.8662
## Pos Pred Value : 0.5459
## Neg Pred Value : 0.9321
## Prevalence : 0.1830
## Detection Rate : 0.1314
## Detection Prevalence : 0.2408
## Balanced Accuracy : 0.7923
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 667 55
## 1 110 119
##
## Accuracy : 0.8265
## 95% CI : (0.8009, 0.8501)
## No Information Rate : 0.817
## P-Value [Acc > NIR] : 0.2393
##
## Kappa : 0.4831
##
## Mcnemar's Test P-Value : 2.624e-05
##
## Sensitivity : 0.6839
## Specificity : 0.8584
## Pos Pred Value : 0.5197
## Neg Pred Value : 0.9238
## Prevalence : 0.1830
## Detection Rate : 0.1251
## Detection Prevalence : 0.2408
## Balanced Accuracy : 0.7712
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 707 71
## 1 70 103
##
## Accuracy : 0.8517
## 95% CI : (0.8275, 0.8737)
## No Information Rate : 0.817
## P-Value [Acc > NIR] : 0.002646
##
## Kappa : 0.503
##
## Mcnemar's Test P-Value : 1.000000
##
## Sensitivity : 0.5920
## Specificity : 0.9099
## Pos Pred Value : 0.5954
## Neg Pred Value : 0.9087
## Prevalence : 0.1830
## Detection Rate : 0.1083
## Detection Prevalence : 0.1819
## Balanced Accuracy : 0.7509
##
## 'Positive' Class : 1
##
In altering the decision thresholds, we examined a number of potential threshold values, but settled on 0.3, as it offered the most significant increase in sensitivity without compromising specificity and overall model accuracy to too high of a degree.
As we examine the new confusion matrices and model performance statistics, it is evident that lowering the decision threshold to 0.3 significantly improved the model’s sensitivity, which is exactly what we had hoped. Interestingly, in examining the model performance metrics given the new threshold, we notice that it is now the first Random Forest model which we feel performs the best in terms of classification accuracy. With an overall accuracy figure of 84.36%, which is right in line with the other two models, the first Random Forest model distinguishes itself in our mind with its improved sensitivity, which is in line with our decision tree models. This higher power in successfully predicting division winners increases our confidence in the first RF model’s ability to achieve our research objective relative to the second and and third RF models we fit.
When examining the data, our data set had an observed basee rate of .813, meaning that 81.3% of teams in our data did not win their division in that season. In an effort to create a model that could accurately predict whether or not a MLB team would win their division, we fit a series of decision tree and Random Forest models. Our analysis reveals that our decision tree models, fit using 10-fold cross validation, actually performed significantly better than the Random Forest models at classification. This is surprising, as the idea behind a Random Forest model is to lower the variance of its predictions by averaging prediction instances across many uncorrelated decision trees. However, we found that our Random Forest models suffered from extremely low sensitivity, even after altering the ntree and mtry parameters to attempt to improve the model. We feel that a primary cause of this issue is the razor thin difference between highly successful teams and middling teams. A handful of games can make all the difference in the final standings, and that small of a margin may not consistently show up in stats aggregated over the course of an entire season. This, combined with the unbalanced nature of our variable of interest, clearly hurt the ability of our Random Forest models to accurately identify potential division winners. In an attempt to improve the model’s power in successfully predicting division winners, we decided to lower the classification threshold from 0.5 to 0.3. This actually did make a significant difference in lowering the models’ FNR while maintaining fairly consistent overall model accuracy numbers. The first Random Forest model performed particularly well at the new decision threshold, producing a sensitivity rate of over 66% (as compared to its initial sensitivity of ~16%) while improving its overall model accuracy from 84.03% to 84.36%. However, despite this marked improvement we still feel as though our decision tree models outperformed our first Random Forest model, as both models boasted superior overall performance as measured by overall accuracy, Kappa, F1, and AUC compared to our best Random Forest model.
Between the two decision tree models, we would likely choose the second, pruned tree as our preferred classification model due to its simplicity and improved sensitivity as compared to the original tree fit using recursive binary splitting.
Both of the decision tree models we fit on the data showed us that runs scored, ERA, runs allowed, and earned runs scored were the 4 most significant variables that we tested. The runs scored variable is not a surprise, as the more runs that a team scores the more likely they are to outscore the other team and win the game. The other three variables are more interesting because they are defensive statistics. This could suggest that MLB teams might want to weigh a players defensive ability more heavily when deciding whether who to add to their team. Furthermore, MLB teams may want to consider players who can improve the team’s performance in these metrics for their team better, likely giving them a higher percentage chance of winning their division.
Our final selected model (the pruned decision tree model) model offers a significant improvement over random classification when one compares the data’s base rate of 0.81 to our final model’s accuracy of 0.874. This model performed the best of all the models fit in this analysis in terms of sensitivity, which we felt was one of the more important metrics to keep track of across models, with a sensitivity of 0.67. Overall, we are pleased with the performance of this model, and feel it successfully achieves our research goal of accurately predicting MLB division winners while also providing insight into what variables might be most significant in identifying such successful teams.
As machine learning models and data algorithms become more complex, so do the ethical concerns surrounding them become. Discrimination and bias are two main issues that have been ever growing in the data science community and our MLB data is no exception. The Weapons of Math Destruction book outlined how algorithms claim to accurately quantify traits that humans want to maximize or minimize. Additionally, the Equality of Opportunity in Supervised Learning paper from class introduced concerns about discrimination and bias entering modern algorithms and machine learning models. The author, O’Neil, states that metrics to quantify human traits include creditworthiness, an individual’s risk of recidivism, and value added to schools. These metrics are important to take into account when tackling the question of machine learning and human bias in the sports data world. In relation to our dataset, our classification trees took into account every baseball statistic variable to see which factor(s) would indicate an MLB team’s chance of winning their division (and thus making the playoffs). This means that on a surface level, our ML models did not use any pre-conceived notions nor use discriminatory practices against specific variables due to the fact we used all statistical categories making our algorithms as close to “unbiased” as we can. The areas of concern are more related to the data cleaning process, the dataset, and the source of the data rather than the ML algorithms themselves. In terms of the dataset, we pulled data from an accredited baseball statistician named Sean Lahman who records all team statistics in the MLB. One potential source of bias in our data cleaning process was our choice to only use data from 1977 to the present, given that our data set included every year of baseball statistics dating back to 1871. Our goal in doing so was to only use data from the so-called modern era of baseball, but excluding years in a somewhat arbitrary manner could potentially be seen as human interference.
As O’Neil points out in Weapons of Math Destruction, the widespread availability and transparency and continually updating nature of baseball statistics eliminates a lot of the risks associated with other types of data. Additionally, one interesting aspect of baseball data that O’Neil neglects to consider is that in baseball, as in most sports, there are a multitude of metrics which are extremely accurate measures of a team’s performance in every area of the game, including pitching, hitting, and fielding. Thus, when analyzing baseball data one does not have to rely on poor proxy measures to estimate variables of interest, thus exempting baseball “sabermetrics” from another one of the dangerous practices that O’Neil identifies as being a key cause of bias in statistical models.
Overall, we believe our algorithms to determine the MLB division winners based on statistical factors is fair and unbiased, and that our data cleaning process was justified to create consistency across the observations fed into our models.
The clearest immediate extension of our work in this paper would be to consider the effect of adjusting the decision thresholds on our decision tree models. Given the impact adjusting the thresholds had on the performance of our Random Forest models, we feel there is a possibility that tweaking the classification thresholds could potentially have a positive impact on the model’s performance.
It should be noted that we feel as though the impact of lowering the decision threshold on our decision tree models is less clear than with our three Random Forest models, as both tree models have a sensitivity high enough to where significantly lowering the threshold could lower specificity than it improves sensitivity, thus worsening the model’s overall performance. Nonetheless, investigating these effects and whether or not they have any impact on which model we feel best predicts whether a team will win its division would be a relevant and interesting extension of our work in this paper.
Another interesting extension of this project would be to fit the same model using other classification techniques in an attempt to continue attempting to develop a superior model. Possible techniques we could use include logistic regression, linear and quadratic discriminant analysis, and K-Nearest Neighbors. While we are pleased with the performance of our pruned decision tree model, it is always possible to fit a better model, and each of these have their own respective strengths that could yield a powerful model in terms of classification accuracy. Additionally, if we felt that multicollinearity presented issues in fitting these more sensitive models, we could attempt to use shrinkage methods such as ridge and lasso regression to fit a more accurate model.
Finally, performing unsupervised learning methods, such principal component analysis or clustering, on this data would be an interesting way to better understand the relationships between the variables in the predictor space as well as how the data are separated on the basis of different variables.