Introduction

I’ve always been a pretty avid professional/college basketball fan, while also being interested in the application of data analytics in the sports world. With the March Madness Tournament coming up this year, I wanted to apply the concepts we learned in STAT 301-2 Data Science II to try to predict the March Madness Tournament. Predicting the tournament comes down to predicting outcomes of college basketball games given historical performance statistics. Because the outcome of a basketball game is binary, i.e. win or lose, this is a classification problem. Overall, this final report includes a practical application and assessment of the model methodologies I learned in class. I have also include a brief Exploratory Data Analysis (EDA) in the Appendix.

Data

For training and testing my regression models, I used datasets from a 2020 Google Cloud & NCAA ML Competition on Kaggle.com. These datasets include extensive historical data from D1 college basketball teams from 2003-2019 (366 teams and over 80,000 basketball games). Taken together, these datasets cover important player game performance metrics on each team, past tournaments, regular season results, etc. Given the nature of these datasets, there is no missingness. These performance metrics are as follows:

  • FGM - field goals made
  • FGA - field goals attempted
  • FGM3 - three pointers made
  • FGA3 - three pointers attempted
  • FTM - free throws made
  • FTA - free throws attempted
  • OR - offensive rebounds
  • DR - defensive rebounds
  • Ast - assists
  • TO - turnovers committed
  • Stl - steals
  • Blk - blocks
  • PF - personal fouls committed

The formal citation is as follows:

Kaggle.com “Google Cloud & NCAA® ML Competition 2020-Men’s Dataset” Available online at: https://www.kaggle.com/c/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament/data

Data Cleaning

First, I started with the dataset containing detained regular season game results. Using this dataset, I calculated end of regular season average statistics for each team for each season. These statistics are all continuous variables. Below is a codebook defining and describing statistics I chose:

  • seed - Tournament Seed
  • win_pct - Win Percentage
  • pos - Possessions
  • opp_pos - Opponent Possessions
  • pace - Pace (Average Possessions Per Game)
  • efg_pct - Effective Field Goal Percentage
  • ts_percent - True Shooting Percent
  • r3P - 3-point Attempt Rate
  • or_pct - Offensive Rebounding Percentage
  • dr_pct - Defensive Rebounding Percentage
  • trb_pct - Total Rebound Percentage
  • ast_pct - Assist Percentage
  • stl_pct - Steal Percentage
  • to_pct - Turnover Percentage
  • blk_pct - Block Percentage
  • ftr - Free Throw Rate
  • ORtg - Offensive Rating
  • DRtg - Defensive Rating

Then, I took the historical tournament results and, for each tournament matchup, appended the first team’s regular season statistics, their opponents (team 2) regular season statistics, and the difference between each statistic. In addition, I created the reverse matchup for each game, resulting in the data frame outlined below:

Win Team_1 Team_2 Team_1_Regular_Season_Stats Team_2_Regular_Season_Stats Stats_Differential
1 Team A Team B Team A regular season stats. . . Team B regular season stats. . . Team A - Team B stats. . .
0 Team B Team A Team B regular season stats. . . Team A regular season stats. . . Team B - Team A stats. . .

Overall, my final cleaned dataset had 48 predictors with 2,230 observations. It is important to note that multicollinearity does exist where some of my predictor variables are correlated with each other. With this in mind, I took a deeper dive into resolving this issue in the Modeling Fitting section of this final report.

Data Splitting

For my project, I first split my cleaned data into one set with regular season statistics and another set with only the statistic differentials. Given that the statistic differentials are derived from regular season statistics, this dataset separation partially resolved the multicollinearity issues between these two sets of predictors.

In order to split each dataset into training and test datasets, I used two different resampling methods depending on the model I ran. Yes, it is more optimal to train and test each model using the same sampling method; however, I wanted to apply a variety of different resampling methods we learned in the class. In the first resampling method, I used a validation set approach where I set aside the 2019 season as a test set and used the 2003 - 2018 season as the training set. The second method I used was a 10-fold cross-validation approach - this approach involves randomly dividing the set of observations into 10 groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the model is fit on the remaining 9 folds.


Model Fitting

Subset Selection Methods

First, I used tidyverse techniques and methods to develop linear models and conduct feature selection. Feature selection is advantageous because it: 1) Keep it Simple Stupid (KISS), 2) corrects multicollinearity issues, and 3) reduces overfitting. Feature selection can be accomplished via a couple different methods, mainly: Best Subset Selection, Forward Stepwise Selection, and Backward Stepwise Selection.

Best Subset Selection

Best subset selection is not optimal because there is large number of predictors in my dataset. As search space increases, the chance of finding models that fit the the training data well also increase. However, these models might not have any predictive power on future data! This method tends to lead to overfitting and high variance of the coefficient estimates. Instead, I used stepwise selection methods as they are much more restrictive and thus a more attractive alternative to best subset selection.

Forward Stepwise Selection & Backward Stepwise Selection

Non-differential Regular Season Statistics

First, I ran a forward stepwise selection on all non-differential regular season statistics.

Looking at the test MSE, even though the forward selection method suggested a model with 23 predictors, I chose the model with 18 predictors because the difference in test MSE is marginal and I wanted to keep the model as simple as possible.

Next, I ran a backward stepwise selection on all non-differential regular season statistics.

Looking at the test MSE, even though the backward selection method suggested a model with 31 predictors, I chose the model with 16 predictors because, again, the difference in test MSE is marginal and I wanted to keep the model as simple as possible.

I then implemented each selection method on the entire training dataset, extracted the best model that uses the indicated number of predictors, and inspected the stepwise model coefficients.

name fwd back
(Intercept) 0.5165 0.5000
team_1_seed -0.0257 -0.0267
team_2_seed 0.0258 0.0267
team_1_win_pct -0.2964 NA
team_1_pace -0.0036 NA
team_1_ts_pct -0.1479 2.1220
team_1_or_pct 0.3482 3.1996
team_1_ast_pct -0.5369 -0.5742
team_1_stl_pct 1.2351 NA
team_1_ftr -0.4499 -0.8900
team_1_ORtg 0.0129 NA
team_1_DRtg -0.0096 -0.0100
team_2_win_pct 0.2881 NA
team_2_ts_pct 0.5948 -2.1220
team_2_ast_pct 0.4689 0.5742
team_2_stl_pct -1.1469 NA
team_2_ftr 0.3147 0.8900
team_2_ORtg -0.0146 NA
team_2_DRtg 0.0107 0.0100
team_1_dr_pct NA 2.2224
team_1_trb_pct NA -5.0481
team_2_or_pct NA -3.1996
team_2_dr_pct NA -2.2224
team_2_trb_pct NA 5.0481

Finally, I compared the test errors.

method test_mse
fwd_selection 0.1698318
back_selection 0.1698318

Differential Regular Season Statistics

Next, I ran a forward stepwise selection on all differential regular season statistics.

Looking at the test MSE, it seemed like the forward selection method suggested a model with 10 predictors.

I then ran a backward stepwise selection on all non-differential regular season statistics.

Looking at the test MSE, it seemed like the backward selection method is suggested a model with 13 predictors.

I implemented each selection method on the entire training dataset, extracted the best model that uses the indicated number of predictors, and inspected the stepwise model coefficients.

name fwd back
(Intercept) 0.5000 0.5000
diff_seed -0.0257 -0.0257
diff_win_pct -0.2832 -0.2814
diff_pace -0.0042 -0.0045
diff_ts_pct -0.0180 2.5415
diff_or_pct 0.4312 2.9744
diff_ast_pct -0.5407 -0.5369
diff_stl_pct 1.3219 0.8073
diff_ftr -0.4637 -0.7390
diff_ORtg 0.0120 NA
diff_DRtg -0.0095 -0.0104
diff_dr_pct NA 1.9536
diff_trb_pct NA -3.9396
diff_to_pct NA -1.2475
diff_blk_pct NA 0.6103

Finally, I compared the test errors.

method test_mse
fwd_selection_diff 0.1698318
back_selection_diff 0.1698318

The test errors from both non-differential statistics and differential statistics are pretty much the same. This means that when picking the best model, I would select the model chosen by the forward selection method on the differential statistics dataset as this model would have least number of predictors involved. Because the differential statistics were derived from the non-differential statistics, it seemed like these differential statistics baked in all the necessary predicting information and reduced the number of overall features. In the following models, I used the differential statistics dataset exclusively.


Logistic Regression

Next, I fit a logistic model. Overall, I ran five different logistic regressions using a 10-k fold cross-validation method. I picked predictor combinations based off the features chosen by the forward stepwise selection on the differential statistics I ran above. The formulae are as follows:

  1. win ~ diff_seed
  2. win ~ diff_ORtg + diff_DRtg
  3. win ~ diff_ORtg + diff_DRtg + diff_seed
  4. win ~ diff_efg_pct + diff_to_pct + diff_or_pct + diff_ftr
  5. win ~ diff_seed + diff_win_pct + diff_pace + diff_ts_pct + diff_or_pct + diff_ast_pct + diff_stl_pct + diff_ftr + diff_ORtg + diff_DRtg

The test errors from each model are shown below.

method test_mse
log_mod_3 0.2883408
log_mod_5 0.2923767
log_mod_1 0.3000000
log_mod_2 0.3390135
log_mod_4 0.3533632

Looking at the test errors, model 3 seemed to be the best logistic model. This is surprising, given that this model reduced the number of features to just three, in particular, Offensive Rating Differential, Defensive Rating Differential, and Seed Difference. The model also did not include all the predictors chosen by the forward stepwise selection method.


Linear Discriminant Analysis

Here, I conducted a linear discriminant analysis. I ran the same five different regression predictor combinations from the logistic regression above, but used a validation set approach instead. The test errors from each model are shown below.

method test_mse
lda_mod_5 0.2537313
lda_mod_1 0.3134328
lda_mod_3 0.3283582
lda_mod_2 0.3432836
lda_mod_4 0.3582090

Looking at the test errors, model 5 seemed to be the best LDA model. Aside from model 5, the test errors for the LDA models seemed to a bit higher compared to the logistic models. This difference may have resulted from the fact that LDA is quite sensitive to outliers and/or some predictors are non-normal.


Quadratic Discriminant Analysis

I then conducted a quadratic discriminant analysis. Again, I ran the same five regression predictor combinations from the logistic regression and used a validation set approach. The test errors from each model are shown below.

method test_mse
qda_mod_5 0.2537313
qda_mod_3 0.3283582
qda_mod_1 0.3358209
qda_mod_2 0.3432836
qda_mod_4 0.3582090

Looking at the test errors, model 5 seemed to be again the best model. Interestingly, the QDA models I ran had about the same level of test errors as the LDA models.


K-Nearest Neighbors

In addition, I conducted KNN. Again, I ran the same five regression predictor combinations from the logistic regression and used a validation set approach. The test errors from each model are shown below. Note, the model naming syntax is “model_pred_count_k_value”.

method test_mse
knn_mod_3_5 0.2537313
knn_mod_3_10 0.2761194
knn_mod_3_1 0.2835821
knn_mod_10_5 0.2835821
knn_mod_10_10 0.2835821
knn_mod_3_15 0.2835821
knn_mod_10_15 0.2985075
knn_mod_10_1 0.3134328
knn_mod_1_1 0.3283582
knn_mod_2_1 0.3283582
knn_mod_1_10 0.3283582
knn_mod_1_15 0.3358209
knn_mod_1_5 0.3432836
knn_mod_2_5 0.3432836
knn_mod_4_10 0.3507463
knn_mod_2_15 0.3582090
knn_mod_4_15 0.3582090
knn_mod_2_10 0.3656716
knn_mod_4_5 0.3731343
knn_mod_4_1 0.4029851

Looking at the test errors, model_3_5 seemed to be the best KNN model. This model used three predictors and a k-value of 5.


Ridge Regression & the Lasso

I then ran a ridge regression and lasso models on the dataset. Below are plots displaying the 200 estimated test MSE values using 10 fold cross-validation for both ridge (left) and lasso models (right).

When looking for the optimal \(\lambda\), I looked for the \(\lambda\) that minimized test error. I also looked at a \(\lambda\) within one standard error from the minimized test error to safe guard against over-fitting issues. The coefficients for the candidate models produced by ridge regression and the lasso model are as follows:

name ridge_min ridge_1se lasso_min lasso_1se
(Intercept) 0.500 0.500 0.500 0.500
diff_seed -0.025 -0.014 -0.025 -0.025
diff_win_pct -0.226 0.123 -0.177 NA
diff_pace -0.004 -0.002 -0.004 NA
diff_efg_pct -0.230 0.295 NA NA
diff_ts_pct 1.140 0.172 NA NA
diff_r3P 0.074 0.022 0.022 NA
diff_or_pct 1.123 0.515 0.412 0.066
diff_dr_pct 0.469 -0.071 0.043 NA
diff_trb_pct -0.700 0.501 NA NA
diff_ast_pct -0.499 -0.216 -0.477 NA
diff_to_pct -0.786 -1.026 -0.110 NA
diff_stl_pct 1.251 0.895 1.222 0.406
diff_blk_pct 0.444 0.707 0.245 NA
diff_ftr -0.584 -0.346 -0.440 -0.109
diff_ORtg 0.006 0.005 0.010 0.004
diff_DRtg -0.009 -0.006 -0.008 -0.004

The test errors for the ridge regression and lasso are show below:

method test_mse
ridge_min 0.1681242
ridge_1se 0.1748939
lasso_min 0.1681576
lasso_1se 0.1734699

PCR & PLS Regression

Finally, I ran a PCR and PLS regression on our dataset. Below are plots of mean squared error by component for both PCR (left) and PLS regressions (right). I did this to identify the optimal number of principal components, shown by vertical line. The optimal number of principal components are 12 and 4 for the PCR and PLS regressions, respectively.

The test errors for the PCR and PLS regressions are show below:

method test_mse
pcr_12m 0.1674023
pls_4m 0.1705316

Comparing All Model Test Errors

After, running these models on the training and test dataset, I compared the test errors across all models. The comparison is shown in the table below.

Out of all the models I ran, the PCR model with 12 principal components was the best model with a test error of 0.1674. The ridge and lasso models came in at a close second and third, respectively. On the other end, the KNN models I ran, in general, performed the worst (i.e. had the highest test errors).


Debrief and Next Steps

Areas of Improvement

In order to increase overall model performance, I identified a couple areas of future improvement. Foremost, I would like to include more types of predictors to begin with. In this project, I only accounted for regular season game team statistics not from the tournament itself. This exclusion in itself may present a strong omitted variable bias because, compared to regular season games, tournament games are usually more intense and their intensity increases as the overall tournament progresses. This intensity may lead to an increase or decrease in player / team performance.

I would also like to transform each regular season statistics (square, square root, log, pairwise, etc.). This effectively quadruples the number of predictors I can assess. Some of these transformed variables could prove to be significant and improve model prediction accuracy. Moreover, I want to include non-basketball game statistic predictors such as the conference each team plays in, where the games are located in proximity to the schools playing, or other outside rankings from popular websites.

Future Work

Moving foward, I would like to expand to more non-linear techniques. For example, I could use a multilayer perceptron neural net. The high nonlinearity of a neural net could predict some of the upsets that make March Madness famous.

Finally, I hope I can further expand my scope and apply my methodologies to other bracket style tournaments in other sports. This could entail predicting the outcomes of playoff games leading up to NBA Finals, the NFL Superbowl, or the MLB World Series.


Appendix

Exploratory Data Analysis

Correlogram Analysis

First, I looked at a correlogram. The plot gives a sense of the severity of the multicollinearity issues between different predictors:

Looking at the plots above, I identified a couple highly correlated predictor pairings. For example, the plots indicated that a team’s seeding is directly correlated with their regular season win percentage. Similarly, a team’s Effective Field Goal Percentage is also directly correlated with its True Shooting Percentage. Finally, a team’s Offensive Rating is directly correlated with its Effective Field Goal Percentage and True Shooting Percentage. This correlation is intuative because these three statistics are based on one parameter, namely, making the basket.

Boxplot Analysis

Effective Field Goal Percentage, Turnover Percentage, Total Rebound Percentage

Next, I looked at boxplots of different statistics between teams who won or lost games in the March Madness Tournament. This gave me an initial understanding of what statistics are important in predicting a team winning or losing games. My initial hypothesis was that there are 3 factors that best predict the outcome of a basketball game: 1) Shooting (Effective Field Goal Percentage), 2) Turnovers (Turnover Percentage), and 3) Rebounding (Total Rebounding Percentage). Intuitively, a team that wins a basketball game is probably better at shooting, turns the ball over less, and rebounds. I also analzyed whether or not these differences, if any exist, are statistically significant.

Looking at the plots above, winning teams have a higher Effective Field Goal Percentage, a lower Turnover Percentage, and a higher Total Rebound Percentage. Moreover, after conducting a t-test, I found that the difference between winning and losing teams is statistically significant at the 1% level among these three predictors. Overall, this confirms my initial hypothesis.

Pace, Seeding, Offensive Rating, Defensive Rating

Aside from the three predictors I explored above, I also analyzed some other predictors, most notably Pace (average possessions per game), Seeding, Offensive Rating (the number of points a team scores per 100 possessions), and Defensive Rating (the number of points a team allows per 100 opposing team possessions). My initial hypothesis was that winning teams most likely have higher pace or higher average possessions per game during the regular season. This higher pace may be indicative of a more efficient offense and/or better defense. Regarding Seeding, I expected higher seeded teams to win more often in the March Madness Tournament. Finally, for Offensive and Defensive Rating, I expected March Madness teams that win to have higher Offensive Ratings and lower Defensive Ratings.

Looking at the pace boxplots, contrary to my initial hypothesis, there is not statistically significant difference of pace between winning and losing teams. Overall, what a team does with each possession is, by far, more important than the total number of times a team has the basketball. In other words, a team that wins games usually is able to score the ball on each possession, rather than turning the ball over or missing the shot.

Look at the seeding boxplots, I found a relatively intuitive result: higher seeded teams in the March Madness Tournament tend to win more often. The difference in game outcome by seeding is statistically significant at the 1% level. With this said, it is important to note that, as shown in the boxplots, even the highest seeded teams lose game and the lowest seeded teams can win games. This is the beauty and excitement of March Madness - the highest seeded team doesn’t always win and upsets happen! This confirms my initial hypothesis.

Looking at both the Offensive Rating and Defensive Rating boxplots, teams that win have higher Offensive Ratings and lower Defensive Ratings. These differences are statistically significant at the 1% level. In other words, compared to teams that lose, teams that win typically scoring more per 100 possessions and allow fewer opponent points per 100 opposing team possessions. Overall, on average, winning teams are better offensively and defensively. This confirms my initial hypotheses.