Nimon Dong
STAT 301-2 Data Science II Final ProjectI’ve always been a pretty avid professional/college basketball fan, while also being interested in the application of data analytics in the sports world. With the March Madness Tournament coming up this year, I wanted to apply the concepts we learned in STAT 301-2 Data Science II to try to predict the March Madness Tournament. Predicting the tournament comes down to predicting outcomes of college basketball games given historical performance statistics. Because the outcome of a basketball game is binary, i.e. win or lose, this is a classification problem. Overall, this final report includes a practical application and assessment of the model methodologies I learned in class. I have also include a brief Exploratory Data Analysis (EDA) in the Appendix.
For training and testing my regression models, I used datasets from a 2020 Google Cloud & NCAA ML Competition on Kaggle.com. These datasets include extensive historical data from D1 college basketball teams from 2003-2019 (366 teams and over 80,000 basketball games). Taken together, these datasets cover important player game performance metrics on each team, past tournaments, regular season results, etc. Given the nature of these datasets, there is no missingness. These performance metrics are as follows:
The formal citation is as follows:
Kaggle.com “Google Cloud & NCAA® ML Competition 2020-Men’s Dataset” Available online at: https://www.kaggle.com/c/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament/data
First, I started with the dataset containing detained regular season game results. Using this dataset, I calculated end of regular season average statistics for each team for each season. These statistics are all continuous variables. Below is a codebook defining and describing statistics I chose:
Then, I took the historical tournament results and, for each tournament matchup, appended the first team’s regular season statistics, their opponents (team 2) regular season statistics, and the difference between each statistic. In addition, I created the reverse matchup for each game, resulting in the data frame outlined below:
| Win | Team_1 | Team_2 | Team_1_Regular_Season_Stats | Team_2_Regular_Season_Stats | Stats_Differential |
|---|---|---|---|---|---|
| 1 | Team A | Team B | Team A regular season stats. . . | Team B regular season stats. . . | Team A - Team B stats. . . |
| 0 | Team B | Team A | Team B regular season stats. . . | Team A regular season stats. . . | Team B - Team A stats. . . |
Overall, my final cleaned dataset had 48 predictors with 2,230 observations. It is important to note that multicollinearity does exist where some of my predictor variables are correlated with each other. With this in mind, I took a deeper dive into resolving this issue in the Modeling Fitting section of this final report.
For my project, I first split my cleaned data into one set with regular season statistics and another set with only the statistic differentials. Given that the statistic differentials are derived from regular season statistics, this dataset separation partially resolved the multicollinearity issues between these two sets of predictors.
In order to split each dataset into training and test datasets, I used two different resampling methods depending on the model I ran. Yes, it is more optimal to train and test each model using the same sampling method; however, I wanted to apply a variety of different resampling methods we learned in the class. In the first resampling method, I used a validation set approach where I set aside the 2019 season as a test set and used the 2003 - 2018 season as the training set. The second method I used was a 10-fold cross-validation approach - this approach involves randomly dividing the set of observations into 10 groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the model is fit on the remaining 9 folds.
First, I used tidyverse techniques and methods to develop linear models and conduct feature selection. Feature selection is advantageous because it: 1) Keep it Simple Stupid (KISS), 2) corrects multicollinearity issues, and 3) reduces overfitting. Feature selection can be accomplished via a couple different methods, mainly: Best Subset Selection, Forward Stepwise Selection, and Backward Stepwise Selection.
Best subset selection is not optimal because there is large number of predictors in my dataset. As search space increases, the chance of finding models that fit the the training data well also increase. However, these models might not have any predictive power on future data! This method tends to lead to overfitting and high variance of the coefficient estimates. Instead, I used stepwise selection methods as they are much more restrictive and thus a more attractive alternative to best subset selection.
First, I ran a forward stepwise selection on all non-differential regular season statistics.
Looking at the test MSE, even though the forward selection method suggested a model with 23 predictors, I chose the model with 18 predictors because the difference in test MSE is marginal and I wanted to keep the model as simple as possible.
Next, I ran a backward stepwise selection on all non-differential regular season statistics.
Looking at the test MSE, even though the backward selection method suggested a model with 31 predictors, I chose the model with 16 predictors because, again, the difference in test MSE is marginal and I wanted to keep the model as simple as possible.
I then implemented each selection method on the entire training dataset, extracted the best model that uses the indicated number of predictors, and inspected the stepwise model coefficients.
| name | fwd | back |
|---|---|---|
| (Intercept) | 0.5165 | 0.5000 |
| team_1_seed | -0.0257 | -0.0267 |
| team_2_seed | 0.0258 | 0.0267 |
| team_1_win_pct | -0.2964 | NA |
| team_1_pace | -0.0036 | NA |
| team_1_ts_pct | -0.1479 | 2.1220 |
| team_1_or_pct | 0.3482 | 3.1996 |
| team_1_ast_pct | -0.5369 | -0.5742 |
| team_1_stl_pct | 1.2351 | NA |
| team_1_ftr | -0.4499 | -0.8900 |
| team_1_ORtg | 0.0129 | NA |
| team_1_DRtg | -0.0096 | -0.0100 |
| team_2_win_pct | 0.2881 | NA |
| team_2_ts_pct | 0.5948 | -2.1220 |
| team_2_ast_pct | 0.4689 | 0.5742 |
| team_2_stl_pct | -1.1469 | NA |
| team_2_ftr | 0.3147 | 0.8900 |
| team_2_ORtg | -0.0146 | NA |
| team_2_DRtg | 0.0107 | 0.0100 |
| team_1_dr_pct | NA | 2.2224 |
| team_1_trb_pct | NA | -5.0481 |
| team_2_or_pct | NA | -3.1996 |
| team_2_dr_pct | NA | -2.2224 |
| team_2_trb_pct | NA | 5.0481 |
Finally, I compared the test errors.
| method | test_mse |
|---|---|
| fwd_selection | 0.1698318 |
| back_selection | 0.1698318 |
Next, I ran a forward stepwise selection on all differential regular season statistics.
Looking at the test MSE, it seemed like the forward selection method suggested a model with 10 predictors.
I then ran a backward stepwise selection on all non-differential regular season statistics.
Looking at the test MSE, it seemed like the backward selection method is suggested a model with 13 predictors.
I implemented each selection method on the entire training dataset, extracted the best model that uses the indicated number of predictors, and inspected the stepwise model coefficients.
| name | fwd | back |
|---|---|---|
| (Intercept) | 0.5000 | 0.5000 |
| diff_seed | -0.0257 | -0.0257 |
| diff_win_pct | -0.2832 | -0.2814 |
| diff_pace | -0.0042 | -0.0045 |
| diff_ts_pct | -0.0180 | 2.5415 |
| diff_or_pct | 0.4312 | 2.9744 |
| diff_ast_pct | -0.5407 | -0.5369 |
| diff_stl_pct | 1.3219 | 0.8073 |
| diff_ftr | -0.4637 | -0.7390 |
| diff_ORtg | 0.0120 | NA |
| diff_DRtg | -0.0095 | -0.0104 |
| diff_dr_pct | NA | 1.9536 |
| diff_trb_pct | NA | -3.9396 |
| diff_to_pct | NA | -1.2475 |
| diff_blk_pct | NA | 0.6103 |
Finally, I compared the test errors.
| method | test_mse |
|---|---|
| fwd_selection_diff | 0.1698318 |
| back_selection_diff | 0.1698318 |
The test errors from both non-differential statistics and differential statistics are pretty much the same. This means that when picking the best model, I would select the model chosen by the forward selection method on the differential statistics dataset as this model would have least number of predictors involved. Because the differential statistics were derived from the non-differential statistics, it seemed like these differential statistics baked in all the necessary predicting information and reduced the number of overall features. In the following models, I used the differential statistics dataset exclusively.
Next, I fit a logistic model. Overall, I ran five different logistic regressions using a 10-k fold cross-validation method. I picked predictor combinations based off the features chosen by the forward stepwise selection on the differential statistics I ran above. The formulae are as follows:
The test errors from each model are shown below.
| method | test_mse |
|---|---|
| log_mod_3 | 0.2883408 |
| log_mod_5 | 0.2923767 |
| log_mod_1 | 0.3000000 |
| log_mod_2 | 0.3390135 |
| log_mod_4 | 0.3533632 |
Looking at the test errors, model 3 seemed to be the best logistic model. This is surprising, given that this model reduced the number of features to just three, in particular, Offensive Rating Differential, Defensive Rating Differential, and Seed Difference. The model also did not include all the predictors chosen by the forward stepwise selection method.
Here, I conducted a linear discriminant analysis. I ran the same five different regression predictor combinations from the logistic regression above, but used a validation set approach instead. The test errors from each model are shown below.
| method | test_mse |
|---|---|
| lda_mod_5 | 0.2537313 |
| lda_mod_1 | 0.3134328 |
| lda_mod_3 | 0.3283582 |
| lda_mod_2 | 0.3432836 |
| lda_mod_4 | 0.3582090 |
Looking at the test errors, model 5 seemed to be the best LDA model. Aside from model 5, the test errors for the LDA models seemed to a bit higher compared to the logistic models. This difference may have resulted from the fact that LDA is quite sensitive to outliers and/or some predictors are non-normal.
I then conducted a quadratic discriminant analysis. Again, I ran the same five regression predictor combinations from the logistic regression and used a validation set approach. The test errors from each model are shown below.
| method | test_mse |
|---|---|
| qda_mod_5 | 0.2537313 |
| qda_mod_3 | 0.3283582 |
| qda_mod_1 | 0.3358209 |
| qda_mod_2 | 0.3432836 |
| qda_mod_4 | 0.3582090 |
Looking at the test errors, model 5 seemed to be again the best model. Interestingly, the QDA models I ran had about the same level of test errors as the LDA models.
In addition, I conducted KNN. Again, I ran the same five regression predictor combinations from the logistic regression and used a validation set approach. The test errors from each model are shown below. Note, the model naming syntax is “model_pred_count_k_value”.
| method | test_mse |
|---|---|
| knn_mod_3_5 | 0.2537313 |
| knn_mod_3_10 | 0.2761194 |
| knn_mod_3_1 | 0.2835821 |
| knn_mod_10_5 | 0.2835821 |
| knn_mod_10_10 | 0.2835821 |
| knn_mod_3_15 | 0.2835821 |
| knn_mod_10_15 | 0.2985075 |
| knn_mod_10_1 | 0.3134328 |
| knn_mod_1_1 | 0.3283582 |
| knn_mod_2_1 | 0.3283582 |
| knn_mod_1_10 | 0.3283582 |
| knn_mod_1_15 | 0.3358209 |
| knn_mod_1_5 | 0.3432836 |
| knn_mod_2_5 | 0.3432836 |
| knn_mod_4_10 | 0.3507463 |
| knn_mod_2_15 | 0.3582090 |
| knn_mod_4_15 | 0.3582090 |
| knn_mod_2_10 | 0.3656716 |
| knn_mod_4_5 | 0.3731343 |
| knn_mod_4_1 | 0.4029851 |
Looking at the test errors, model_3_5 seemed to be the best KNN model. This model used three predictors and a k-value of 5.
I then ran a ridge regression and lasso models on the dataset. Below are plots displaying the 200 estimated test MSE values using 10 fold cross-validation for both ridge (left) and lasso models (right).
When looking for the optimal \(\lambda\), I looked for the \(\lambda\) that minimized test error. I also looked at a \(\lambda\) within one standard error from the minimized test error to safe guard against over-fitting issues. The coefficients for the candidate models produced by ridge regression and the lasso model are as follows:
| name | ridge_min | ridge_1se | lasso_min | lasso_1se |
|---|---|---|---|---|
| (Intercept) | 0.500 | 0.500 | 0.500 | 0.500 |
| diff_seed | -0.025 | -0.014 | -0.025 | -0.025 |
| diff_win_pct | -0.226 | 0.123 | -0.177 | NA |
| diff_pace | -0.004 | -0.002 | -0.004 | NA |
| diff_efg_pct | -0.230 | 0.295 | NA | NA |
| diff_ts_pct | 1.140 | 0.172 | NA | NA |
| diff_r3P | 0.074 | 0.022 | 0.022 | NA |
| diff_or_pct | 1.123 | 0.515 | 0.412 | 0.066 |
| diff_dr_pct | 0.469 | -0.071 | 0.043 | NA |
| diff_trb_pct | -0.700 | 0.501 | NA | NA |
| diff_ast_pct | -0.499 | -0.216 | -0.477 | NA |
| diff_to_pct | -0.786 | -1.026 | -0.110 | NA |
| diff_stl_pct | 1.251 | 0.895 | 1.222 | 0.406 |
| diff_blk_pct | 0.444 | 0.707 | 0.245 | NA |
| diff_ftr | -0.584 | -0.346 | -0.440 | -0.109 |
| diff_ORtg | 0.006 | 0.005 | 0.010 | 0.004 |
| diff_DRtg | -0.009 | -0.006 | -0.008 | -0.004 |
The test errors for the ridge regression and lasso are show below:
| method | test_mse |
|---|---|
| ridge_min | 0.1681242 |
| ridge_1se | 0.1748939 |
| lasso_min | 0.1681576 |
| lasso_1se | 0.1734699 |
Finally, I ran a PCR and PLS regression on our dataset. Below are plots of mean squared error by component for both PCR (left) and PLS regressions (right). I did this to identify the optimal number of principal components, shown by vertical line. The optimal number of principal components are 12 and 4 for the PCR and PLS regressions, respectively.
The test errors for the PCR and PLS regressions are show below:
| method | test_mse |
|---|---|
| pcr_12m | 0.1674023 |
| pls_4m | 0.1705316 |
After, running these models on the training and test dataset, I compared the test errors across all models. The comparison is shown in the table below.
Out of all the models I ran, the PCR model with 12 principal components was the best model with a test error of 0.1674. The ridge and lasso models came in at a close second and third, respectively. On the other end, the KNN models I ran, in general, performed the worst (i.e. had the highest test errors).
In order to increase overall model performance, I identified a couple areas of future improvement. Foremost, I would like to include more types of predictors to begin with. In this project, I only accounted for regular season game team statistics not from the tournament itself. This exclusion in itself may present a strong omitted variable bias because, compared to regular season games, tournament games are usually more intense and their intensity increases as the overall tournament progresses. This intensity may lead to an increase or decrease in player / team performance.
I would also like to transform each regular season statistics (square, square root, log, pairwise, etc.). This effectively quadruples the number of predictors I can assess. Some of these transformed variables could prove to be significant and improve model prediction accuracy. Moreover, I want to include non-basketball game statistic predictors such as the conference each team plays in, where the games are located in proximity to the schools playing, or other outside rankings from popular websites.
Moving foward, I would like to expand to more non-linear techniques. For example, I could use a multilayer perceptron neural net. The high nonlinearity of a neural net could predict some of the upsets that make March Madness famous.
Finally, I hope I can further expand my scope and apply my methodologies to other bracket style tournaments in other sports. This could entail predicting the outcomes of playoff games leading up to NBA Finals, the NFL Superbowl, or the MLB World Series.
First, I looked at a correlogram. The plot gives a sense of the severity of the multicollinearity issues between different predictors:
Looking at the plots above, I identified a couple highly correlated predictor pairings. For example, the plots indicated that a team’s seeding is directly correlated with their regular season win percentage. Similarly, a team’s Effective Field Goal Percentage is also directly correlated with its True Shooting Percentage. Finally, a team’s Offensive Rating is directly correlated with its Effective Field Goal Percentage and True Shooting Percentage. This correlation is intuative because these three statistics are based on one parameter, namely, making the basket.
Next, I looked at boxplots of different statistics between teams who won or lost games in the March Madness Tournament. This gave me an initial understanding of what statistics are important in predicting a team winning or losing games. My initial hypothesis was that there are 3 factors that best predict the outcome of a basketball game: 1) Shooting (Effective Field Goal Percentage), 2) Turnovers (Turnover Percentage), and 3) Rebounding (Total Rebounding Percentage). Intuitively, a team that wins a basketball game is probably better at shooting, turns the ball over less, and rebounds. I also analzyed whether or not these differences, if any exist, are statistically significant.
Looking at the plots above, winning teams have a higher Effective Field Goal Percentage, a lower Turnover Percentage, and a higher Total Rebound Percentage. Moreover, after conducting a t-test, I found that the difference between winning and losing teams is statistically significant at the 1% level among these three predictors. Overall, this confirms my initial hypothesis.
Aside from the three predictors I explored above, I also analyzed some other predictors, most notably Pace (average possessions per game), Seeding, Offensive Rating (the number of points a team scores per 100 possessions), and Defensive Rating (the number of points a team allows per 100 opposing team possessions). My initial hypothesis was that winning teams most likely have higher pace or higher average possessions per game during the regular season. This higher pace may be indicative of a more efficient offense and/or better defense. Regarding Seeding, I expected higher seeded teams to win more often in the March Madness Tournament. Finally, for Offensive and Defensive Rating, I expected March Madness teams that win to have higher Offensive Ratings and lower Defensive Ratings.
Looking at the pace boxplots, contrary to my initial hypothesis, there is not statistically significant difference of pace between winning and losing teams. Overall, what a team does with each possession is, by far, more important than the total number of times a team has the basketball. In other words, a team that wins games usually is able to score the ball on each possession, rather than turning the ball over or missing the shot.
Look at the seeding boxplots, I found a relatively intuitive result: higher seeded teams in the March Madness Tournament tend to win more often. The difference in game outcome by seeding is statistically significant at the 1% level. With this said, it is important to note that, as shown in the boxplots, even the highest seeded teams lose game and the lowest seeded teams can win games. This is the beauty and excitement of March Madness - the highest seeded team doesn’t always win and upsets happen! This confirms my initial hypothesis.
Looking at both the Offensive Rating and Defensive Rating boxplots, teams that win have higher Offensive Ratings and lower Defensive Ratings. These differences are statistically significant at the 1% level. In other words, compared to teams that lose, teams that win typically scoring more per 100 possessions and allow fewer opponent points per 100 opposing team possessions. Overall, on average, winning teams are better offensively and defensively. This confirms my initial hypotheses.