Introduction

Before the computer was invented, statisticians relied on collecting data, computing regressions by hand, and performing inference using large books of distribution tables. Datasets were small and manageable, and assumptions were carefully checked to squeeze out the most information possible. Fast forward to today, and it is common to see datasets with millions of entries, anyone can perform a linear regression with the press of a button, and laptop CPUs perform 100-500 billion floating-point operations per second (FLOPS)\(^1\). This rise in computational power has made ensemble learning methods and neural networks possible and practical in real-world applications. In the context of data science, these methods are applied to gain insights for companies to aid in decision-making, helping to determine things like which product to sell, which stock to buy, or even what color to make the ‘BUY NOW’ button. An increasingly prevalent drawback is not using simple models before more complex ones; a sufficiently simple method may be overlooked in favor of a flashy modern alternative. The downside of this can be losing inference (from so-called “Black-Box” methods) and using unnecessary processing power. In this report, we compare traditional machine learning methods to more modern ones in the context of predicting NCAA men’s basketball win percentage.

When analyzing sports data, organizations have a common goal: predicting a team’s performance. The goal for any sports team is to win, so sports statisticians seek the predictors that best impact these odds. An implementation of this is to estimate the win percentage (the number of games won divided by the number of games played). We will compare the methods of multiple linear regression, logistic regression, gradient boosting (specifically XGBoost), and neural networks for this prediction. For consistency, we will use the same 10-fold cross-validation approach and analyze the root mean squared error loss (RMSE). The goal of this analysis is a broader coverage of the methods, comparing traditional statistical learning models with current computational ones.

About the Dataset

The dataset was obtained from Kaggle and consists of results and advanced metrics for men’s Division 1 college basketball seasons from the years 2013-2024. A thorough description of the variables and data can be found here. Examples of the dataset metrics include number of games played (G), number of games won (W), adjusted offensive and defensive efficiency (ADJOE,ADJDE), and effective field goal percentage shot and allowed (EFG_O,EFG_D).

Exploratory Data Analysis (EDA)

Before building any models, we will get an idea of the data we have. First, we will look at how the statistics are distributed. Next, we will graph their influence on the response (win percentage). Finally, we will check how they compare to one another (multicollinearity).

We begin with the histograms:

library(tidyverse)
data <- read.csv("archive/cbb.csv")
data <- data %>% dplyr::filter(G > W) # a few erroneous datapoints
data$win_percentage <- data$W / data$G
head(data)
##             TEAM CONF  G  W ADJOE ADJDE BARTHAG EFG_O EFG_D  TOR TORD  ORB  DRB
## 1 North Carolina  ACC 40 33 123.3  94.9  0.9531  52.6  48.1 15.4 18.2 40.7 30.0
## 2      Wisconsin  B10 40 36 129.1  93.6  0.9758  54.8  47.7 12.4 15.8 32.1 23.7
## 3       Michigan  B10 40 33 114.4  90.4  0.9375  53.9  47.7 14.0 19.5 25.5 24.9
## 4     Texas Tech  B12 38 31 115.2  85.2  0.9696  53.5  43.0 17.7 22.8 27.4 28.7
## 5        Gonzaga  WCC 39 37 117.8  86.3  0.9728  56.6  41.1 16.2 17.1 30.0 26.2
## 6       Kentucky  SEC 40 29 117.2  96.2  0.9062  49.9  46.0 18.1 16.1 42.0 29.7
##    FTR FTRD X2P_O X2P_D X3P_O X3P_D ADJ_T  WAB POSTSEASON SEED YEAR
## 1 32.3 30.4  53.9  44.6  32.7  36.2  71.7  8.6        2ND    1 2016
## 2 36.2 22.4  54.8  44.7  36.5  37.5  59.3 11.3        2ND    1 2015
## 3 30.7 30.0  54.7  46.8  35.2  33.2  65.9  6.9        2ND    3 2018
## 4 32.9 36.6  52.8  41.9  36.5  29.7  67.5  7.0        2ND    3 2019
## 5 39.0 26.9  56.3  40.0  38.2  29.0  71.5  7.7        2ND    1 2017
## 6 51.8 36.8  50.0  44.9  33.2  32.2  65.9  3.9        2ND    8 2014
##   win_percentage
## 1      0.8250000
## 2      0.9000000
## 3      0.8250000
## 4      0.8157895
## 5      0.9487179
## 6      0.7250000
numeric_data <- data %>% dplyr::select(-TEAM,-POSTSEASON,-SEED)
num_rows <- nrow(numeric_data)
num_cols <- ncol(numeric_data)
cat(paste("Number of rows: ",num_rows,"\n","Number of columns: ",num_cols))
## Number of rows:  3880 
##  Number of columns:  22
# Histogram of the data
data_long <- numeric_data %>%
  pivot_longer(cols = -c(CONF,YEAR), # every column becomes category under name variable
               names_to = "variable",
               values_to = "value")

ggplot(data_long, aes(x = value)) +
  geom_histogram(bins = 15, fill = "skyblue", color = "black") +
  facet_wrap(~ variable, ncol = 4, scales = "free_x") # make seperate plot for each category in variable and allow each plot their own x axis. 

Most variables are visually normally distributed with no big irregularities. We note that BARTHAG (Power Rating) appears uniformly distributed. In fact, this metric is a logistic transformation of ADJOE-ADJDE, two variables that are both normally distributed. Additionally, the variable of the games played does have a left skew. NCAA Division 1 basketball teams play 30 games in the regular season. Extra games are then played for the March Madness Tournament, a single-elimination tournament of 64 teams. The lower game values may be smaller divisions or outliers. In any case, more investigation could be done.

We continue with our analysis with scatterplots, plotting the response (win percentage) against the predictors:

data_scatter <- numeric_data %>%
  pivot_longer(
    cols = -c(W,win_percentage,WAB,CONF,YEAR),          # all columns except Wins
    names_to = "variable",
    values_to = "value"
  )

ggplot(data_scatter, aes(x = value, y = win_percentage)) +
  geom_point(alpha = 0.8,size=0.5,color="deepskyblue3") +
  facet_wrap(~ variable, ncol = 4, scales = "free_x") +
  theme_gray()

We notice a few things. First, ADJ_T (Adjusted Tempo) and FTR (Free Throw Rate) do not seem to contribute meaningfully to win percentage, which is surprising, as one might think that playing at different tempos and different FTR values might affect the outcome of a game. One possible hypothesis is that a team with a higher free-throw rate is driving more to the basket, prioritizing playing in the paint over perimeter offense. Basketball teams center their playstyle on the talent they possess (players), so coaches may prioritize different types of offense based on the players. We also notice many linear trends, although many of these metrics are heavily correlated, so causal inference is less concrete. Many of the results should be obvious when stated aloud: “A team with better offensive/defensive metrics wins more games,” or “A team with more turnovers wins fewer games.” One final thing to note is the interesting relationship between games played and win percentage. The plot is generally uniform, except for a constant increase past 30 games. This confirms our previous discussion about March Madness, as teams that play more games have made it further in the tournament by winning more games.

Let us plot some more scatterplots to display the relationships between predictors. There are \({17\choose 2} = 136\) such plots, so we will only consider a few:

library(patchwork)
s1 <- ggplot(data = numeric_data, aes(y = EFG_O, x = EFG_D)) +
  geom_point(colour="cadetblue4") + ylim(20,80)


s2 <- ggplot(data = numeric_data, aes(y = ORB, x = DRB)) +
 geom_point(colour="cadetblue4") + ylim(0,60)

s3 <- ggplot(data = numeric_data, aes(y = ADJOE, x = ADJDE)) +
  geom_point(colour="cadetblue4") + ylim(60,150)

s4 <- ggplot(data=numeric_data, aes(y= X3P_O,x= X2P_O))+
             geom_point(colour="cadetblue4")

s5 <- ggplot(data=numeric_data, aes(y= TORD,x= TOR))+
             geom_point(colour="cadetblue4")

s6 <- ggplot(data=numeric_data, aes(y= X3P_D,x= X2P_D))+
             geom_point(colour="cadetblue4")


s1+s2+s3+s4+s5+s6+plot_layout(ncol = 3)

We do not see any strong relationships, though there is one interesting plot, comparing ADJDE (Adjusted Defensive Efficiency) with ADJOE (Adjusted Offensive Efficiency). These metrics reflect how many points are allowed and scored, respectively (adjusted by team difficulty). Essentially, low ADJDE is good defense, and high ADJOE is good offense. Visibly, there is a weak trend that a better defense is correlated with a better offense. The direction of the effect is not clear (or even causal), but it could reflect the adage that good defense is the best offense.

Baseline Model: Mean Predictor

We begin with a simple baseline model to set a base average RMSE. We perform a linear regression with just an intercept term (we use none of the predictors).

set.seed(67)
model_data <- data %>% dplyr::select(-TEAM,-POSTSEASON,-SEED,-CONF,-WAB,-YEAR,-W,-G)
K <- 10
folds <- sample(rep(1:K, length.out = nrow(model_data))) # shuffle around the data
errors <- numeric(K)

for (k in 1:K) {
  train_data <- model_data[folds != k, ]
  test_data  <- model_data[folds == k, ]
  
  model <- lm(win_percentage ~ 1, data = train_data)
  
  preds <- predict(model, test_data)
  errors[k] <- mean((test_data$win_percentage - preds)^2)  # MSE
}

cat(sprintf("Average test MSE: %.4f\n",mean(errors)))
## Average test MSE: 0.0326
cat(sprintf("Average test RMSE: %.4f\n",mean(sqrt(errors))))
## Average test RMSE: 0.1805
cat(sprintf("Average test standard error: %.4f\n",sd(errors)))
## Average test standard error: 0.0019

The baseline model has poor predictive power, with win rate predictions differing from the true values by \(0.18\pm0.0019\) or 18 percentage points on average.

Model Building

We will build four models to predict the number of wins in a season. As the metrics reflect scoring, field goal percentages, and defense, we expect to see high accuracy in the predictions. We will build a linear regression model, a logistic regression model, an XGBoost model, and a neural network to observe prediction accuracy. We will use 10-fold cross validation for each model and give the average accuracy rates.

model_data <- data %>% dplyr::select(-TEAM,-POSTSEASON,-SEED,-CONF,-WAB,-YEAR,-W,-G)
head(model_data)
##   ADJOE ADJDE BARTHAG EFG_O EFG_D  TOR TORD  ORB  DRB  FTR FTRD X2P_O X2P_D
## 1 123.3  94.9  0.9531  52.6  48.1 15.4 18.2 40.7 30.0 32.3 30.4  53.9  44.6
## 2 129.1  93.6  0.9758  54.8  47.7 12.4 15.8 32.1 23.7 36.2 22.4  54.8  44.7
## 3 114.4  90.4  0.9375  53.9  47.7 14.0 19.5 25.5 24.9 30.7 30.0  54.7  46.8
## 4 115.2  85.2  0.9696  53.5  43.0 17.7 22.8 27.4 28.7 32.9 36.6  52.8  41.9
## 5 117.8  86.3  0.9728  56.6  41.1 16.2 17.1 30.0 26.2 39.0 26.9  56.3  40.0
## 6 117.2  96.2  0.9062  49.9  46.0 18.1 16.1 42.0 29.7 51.8 36.8  50.0  44.9
##   X3P_O X3P_D ADJ_T win_percentage
## 1  32.7  36.2  71.7      0.8250000
## 2  36.5  37.5  59.3      0.9000000
## 3  35.2  33.2  65.9      0.8250000
## 4  36.5  29.7  67.5      0.8157895
## 5  38.2  29.0  71.5      0.9487179
## 6  33.2  32.2  65.9      0.7250000

Linear Model Assumptions

We check the linearity of the model, the independence of the error terms, and the homoscedasticity (non-constant error term variance). We will not check normality, as the sample size is large enough such that the central limit theorem will guarantee normality of the beta parameters.

Autocorrelation

For multiple linear regression, we run some simple tests.

model <- lm(win_percentage ~ ., data = model_data)
library(lmtest)
bgtest(model, order = 1) # lag 1
## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  model
## LM test = 76.254, df = 1, p-value < 2.2e-16
bgtest(model, order = 5) # lag 5
## 
##  Breusch-Godfrey test for serial correlation of order up to 5
## 
## data:  model
## LM test = 193.99, df = 5, p-value < 2.2e-16

The result of significant autocorrelation is expected, as the data set includes many years of data, and with sports data, how a team performs in one year (and hence all the metrics) is usually related to the following year. In a thorough analysis, we would factor this in, but for our goal of prediction, autocorrelation will not significantly impact the predictive power.

Assumption of Linearity and Constant Variance

plot(model$fitted.values, model$residuals,
     xlab = "Fitted values",
     ylab = "Residuals",
     main = "Residuals vs Fitted")
abline(h = 0, col = "red", lty = 2)

res <- resid(model)

par(mfrow = c(3, 3))  # adjust grid as needed

for (x in names(coef(model))[-1]) {
  plot(model$model[[x]],
       res + coef(model)[x] * model$model[[x]],
       xlab = x,
       ylab = "Partial residual",
       main = paste("Partial residual vs", x))
  abline(lm(res + coef(model)[x] * model$model[[x]] ~ model$model[[x]]),
         col = "red")
}

All the residual plots appear linear with a near-constant band; thus, linear regression is appropriate. There are some potential outliers on the top left and bottom right of the Residuals vs. Fitted plot, although the dataset is sufficiently large (n=3885) for this to be a non-issue. Additionally, there are no problems with the homoscedasticity of the variance, so no advanced methods are necessary to control for this. One clear problem that arises is that there are some fitted values of win percentage less than zero or greater than one, which is not possible.

Multiple Linear Regression

For all of the predictive algorithms we will implement, we will exclude WAB (Wins above bubble), as it directly uses the win metrics.

We begin with multiple linear regression. The goal here is to fit the response variable win_percentage (\(y_i\)), with the various predictors to yield a relationship like: \(y_i = \beta_0+\beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip}\).

We will use best subsets regression to select a model.

library(leaps)

subset <- regsubsets(win_percentage ~ .,data = model_data,method = "exhaustive",nbest = 3,nvmax=16)
sum_sub <- summary(subset)
output <- cbind(
  sum_sub$outmat,
  R2 = round(sum_sub$rsq, 3),
  AdjR2 = round(sum_sub$adjr2, 3),
  Cp = round(sum_sub$cp, 1),
  BIC = round(sum_sub$bic, 1)
)

output
##           ADJOE ADJDE BARTHAG EFG_O EFG_D TOR TORD ORB DRB FTR FTRD X2P_O X2P_D
## 1  ( 1 )  " "   " "   "*"     " "   " "   " " " "  " " " " " " " "  " "   " "  
## 1  ( 2 )  "*"   " "   " "     " "   " "   " " " "  " " " " " " " "  " "   " "  
## 1  ( 3 )  " "   "*"   " "     " "   " "   " " " "  " " " " " " " "  " "   " "  
## 2  ( 1 )  " "   "*"   " "     "*"   " "   " " " "  " " " " " " " "  " "   " "  
## 2  ( 2 )  " "   " "   " "     "*"   "*"   " " " "  " " " " " " " "  " "   " "  
## 2  ( 3 )  " "   " "   "*"     "*"   " "   " " " "  " " " " " " " "  " "   " "  
## 3  ( 1 )  " "   " "   " "     "*"   "*"   "*" " "  " " " " " " " "  " "   " "  
## 3  ( 2 )  " "   " "   "*"     "*"   "*"   " " " "  " " " " " " " "  " "   " "  
## 3  ( 3 )  " "   " "   " "     "*"   "*"   " " "*"  " " " " " " " "  " "   " "  
## 4  ( 1 )  " "   " "   " "     "*"   "*"   "*" "*"  " " " " " " " "  " "   " "  
## 4  ( 2 )  " "   " "   " "     "*"   "*"   "*" " "  "*" " " " " " "  " "   " "  
## 4  ( 3 )  "*"   " "   " "     "*"   "*"   " " "*"  " " " " " " " "  " "   " "  
## 5  ( 1 )  " "   " "   " "     "*"   "*"   "*" "*"  "*" " " " " " "  " "   " "  
## 5  ( 2 )  " "   " "   " "     "*"   "*"   "*" "*"  " " "*" " " " "  " "   " "  
## 5  ( 3 )  "*"   " "   " "     "*"   "*"   " " "*"  " " "*" " " " "  " "   " "  
## 6  ( 1 )  " "   " "   " "     "*"   "*"   "*" "*"  "*" "*" " " " "  " "   " "  
## 6  ( 2 )  " "   " "   " "     "*"   "*"   "*" "*"  "*" " " " " "*"  " "   " "  
## 6  ( 3 )  "*"   "*"   " "     " "   "*"   " " "*"  " " "*" " " "*"  " "   " "  
## 7  ( 1 )  " "   " "   " "     "*"   "*"   "*" "*"  "*" "*" " " "*"  " "   " "  
## 7  ( 2 )  " "   " "   " "     "*"   "*"   "*" "*"  "*" "*" " " " "  " "   " "  
## 7  ( 3 )  " "   " "   " "     "*"   "*"   "*" "*"  "*" "*" " " " "  " "   "*"  
## 8  ( 1 )  " "   " "   " "     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   " "  
## 8  ( 2 )  " "   "*"   " "     "*"   "*"   "*" "*"  "*" "*" " " "*"  " "   " "  
## 8  ( 3 )  " "   " "   " "     "*"   "*"   "*" "*"  "*" "*" " " "*"  " "   " "  
## 9  ( 1 )  " "   "*"   " "     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   " "  
## 9  ( 2 )  " "   " "   " "     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   "*"  
## 9  ( 3 )  " "   " "   " "     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   " "  
## 10  ( 1 ) " "   "*"   " "     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   "*"  
## 10  ( 2 ) " "   "*"   " "     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   " "  
## 10  ( 3 ) " "   "*"   " "     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   " "  
## 11  ( 1 ) " "   "*"   " "     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   "*"  
## 11  ( 2 ) " "   "*"   " "     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   " "  
## 11  ( 3 ) " "   "*"   "*"     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   "*"  
## 12  ( 1 ) " "   "*"   "*"     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   "*"  
## 12  ( 2 ) " "   "*"   "*"     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   " "  
## 12  ( 3 ) "*"   "*"   " "     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   "*"  
## 13  ( 1 ) "*"   "*"   "*"     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   "*"  
## 13  ( 2 ) "*"   "*"   "*"     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   " "  
## 13  ( 3 ) " "   "*"   "*"     "*"   "*"   "*" "*"  "*" "*" "*" "*"  "*"   "*"  
## 14  ( 1 ) "*"   "*"   "*"     "*"   "*"   "*" "*"  "*" "*" "*" "*"  "*"   "*"  
## 14  ( 2 ) "*"   "*"   "*"     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   "*"  
## 14  ( 3 ) "*"   "*"   "*"     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   "*"  
## 15  ( 1 ) "*"   "*"   "*"     "*"   "*"   "*" "*"  "*" "*" "*" "*"  "*"   "*"  
## 15  ( 2 ) "*"   "*"   "*"     "*"   "*"   "*" "*"  "*" "*" "*" "*"  "*"   "*"  
## 15  ( 3 ) "*"   "*"   "*"     "*"   "*"   "*" "*"  "*" "*" "*" "*"  " "   "*"  
## 16  ( 1 ) "*"   "*"   "*"     "*"   "*"   "*" "*"  "*" "*" "*" "*"  "*"   "*"  
##           X3P_O X3P_D ADJ_T R2      AdjR2   Cp       BIC      
## 1  ( 1 )  " "   " "   " "   "0.563" "0.563" "6696.9" "-3195.8"
## 1  ( 2 )  " "   " "   " "   "0.473" "0.473" "8865.9" "-2471.8"
## 1  ( 3 )  " "   " "   " "   "0.39"  "0.39"  "10885"  "-1901.1"
## 2  ( 1 )  " "   " "   " "   "0.643" "0.643" "4752.7" "-3976.8"
## 2  ( 2 )  " "   " "   " "   "0.643" "0.643" "4759.9" "-3973.6"
## 2  ( 3 )  " "   " "   " "   "0.627" "0.627" "5151.6" "-3801.5"
## 3  ( 1 )  " "   " "   " "   "0.689" "0.689" "3654.8" "-4497.8"
## 3  ( 2 )  " "   " "   " "   "0.688" "0.688" "3677.9" "-4485.9"
## 3  ( 3 )  " "   " "   " "   "0.687" "0.687" "3703.5" "-4472.7"
## 4  ( 1 )  " "   " "   " "   "0.746" "0.746" "2270.8" "-5279.2"
## 4  ( 2 )  " "   " "   " "   "0.731" "0.73"  "2644.7" "-5049.8"
## 4  ( 3 )  " "   " "   " "   "0.727" "0.727" "2738.8" "-4994.2"
## 5  ( 1 )  " "   " "   " "   "0.779" "0.779" "1473.5" "-5811.9"
## 5  ( 2 )  " "   " "   " "   "0.777" "0.777" "1528.4" "-5772.3"
## 5  ( 3 )  " "   " "   " "   "0.768" "0.768" "1750.4" "-5615.8"
## 6  ( 1 )  " "   " "   " "   "0.813" "0.813" "650.3"  "-6454.8"
## 6  ( 2 )  " "   " "   " "   "0.791" "0.79"  "1197.2" "-6011.3"
## 6  ( 3 )  " "   " "   " "   "0.79"  "0.79"  "1211.8" "-6000.1"
## 7  ( 1 )  " "   " "   " "   "0.821" "0.82"  "473.6"  "-6603.2"
## 7  ( 2 )  " "   "*"   " "   "0.817" "0.817" "562.3"  "-6524.7"
## 7  ( 3 )  " "   " "   " "   "0.817" "0.817" "562.6"  "-6524.4"
## 8  ( 1 )  " "   " "   " "   "0.828" "0.828" "292.5"  "-6762.3"
## 8  ( 2 )  " "   " "   " "   "0.826" "0.826" "342.3"  "-6716"  
## 8  ( 3 )  " "   " "   "*"   "0.824" "0.823" "403.2"  "-6660.2"
## 9  ( 1 )  " "   " "   " "   "0.834" "0.834" "155.8"  "-6885.7"
## 9  ( 2 )  " "   " "   " "   "0.832" "0.831" "209"    "-6834.7"
## 9  ( 3 )  " "   "*"   " "   "0.832" "0.831" "210"    "-6833.7"
## 10  ( 1 ) " "   " "   " "   "0.837" "0.837" "86.9"   "-6946.6"
## 10  ( 2 ) " "   "*"   " "   "0.837" "0.836" "88.8"   "-6944.8"
## 10  ( 3 ) " "   " "   "*"   "0.836" "0.836" "110.4"  "-6923.5"
## 11  ( 1 ) " "   " "   "*"   "0.839" "0.838" "46.9"   "-6979.8"
## 11  ( 2 ) " "   "*"   "*"   "0.839" "0.838" "49.4"   "-6977.4"
## 11  ( 3 ) " "   " "   " "   "0.837" "0.837" "76.9"   "-6950.2"
## 12  ( 1 ) " "   " "   "*"   "0.839" "0.839" "38.5"   "-6982"  
## 12  ( 2 ) " "   "*"   "*"   "0.839" "0.839" "41.5"   "-6978.9"
## 12  ( 3 ) " "   " "   "*"   "0.839" "0.838" "47.4"   "-6973.1"
## 13  ( 1 ) " "   " "   "*"   "0.84"  "0.84"  "13.2"   "-7001"  
## 13  ( 2 ) " "   "*"   "*"   "0.84"  "0.84"  "15.9"   "-6998.3"
## 13  ( 3 ) " "   " "   "*"   "0.839" "0.839" "39.4"   "-6974.8"
## 14  ( 1 ) " "   " "   "*"   "0.84"  "0.84"  "13.1"   "-6994.9"
## 14  ( 2 ) "*"   " "   "*"   "0.84"  "0.84"  "13.4"   "-6994.5"
## 14  ( 3 ) " "   "*"   "*"   "0.84"  "0.84"  "15.2"   "-6992.7"
## 15  ( 1 ) "*"   " "   "*"   "0.84"  "0.84"  "15"     "-6986.7"
## 15  ( 2 ) " "   "*"   "*"   "0.84"  "0.84"  "15.1"   "-6986.6"
## 15  ( 3 ) "*"   "*"   "*"   "0.84"  "0.84"  "15.4"   "-6986.3"
## 16  ( 1 ) "*"   "*"   "*"   "0.84"  "0.84"  "17"     "-6978.4"

Using best subsets regression, the adjusted \(R^2\) value maxes at 0.84 around 13 predictors. Additionally, we wish to minimize the Bayesian Information Criterion (BIC), and aim for a Mallow’s \(C_p\) value near \(p+1\), where \(p\) is the number of predictors in the model. In terms of prediction, the models 13-16 and perhaps earlier will perform similarly, so we will consider the model with all predictors. An interesting thing to note is that the later models all seem to drop the variables for offensive and defensive 2 and 3 point percentages. This is likely because they are included in the EFG_O and EFG_D (Effective Field Goal Offensive/Defensive) statistics, which are, in turn, used in several other metrics. So the less informationally dense variables are less useful for the model.

We will use 10-fold cross-validation to evaluate the model’s accuracy, noting that for adequate comparison, we should use the same cross-validation train and test sets.

K <- 10
set.seed(67)
folds <- sample(rep(1:K, length.out = nrow(model_data))) 

errors <- numeric(K)

for (k in 1:K) {
  train_data <- model_data[folds != k, ]
  test_data  <- model_data[folds == k, ]
  
  model <- lm(win_percentage ~ ., data = train_data)
  
  preds <- predict(model, test_data)
  errors[k] <- mean((test_data$win_percentage - preds)^2)  # MSE
}
cat(sprintf("Average test MSE: %.4f\n",mean(errors)))
## Average test MSE: 0.0052
cat(sprintf("Average test RMSE: %.4f\n",mean(sqrt(errors))))
## Average test RMSE: 0.0724
cat(sprintf("Average test standard error: %.4f\n",sd(errors)))
## Average test standard error: 0.0002

The multiple linear regression fit performs very well compared to the baseline, with a test RMSE of \(0.072\pm0.0002\), so predictions are within 7.2 percentage points on average, compared to 18 percentage points from the baseline model. This difference indicates that the fit is significantly better than just predicting the average, hence the model is useful.

Now that we have an idea of the test accuracy, we build the final model using all of the training data:

model <- lm(win_percentage ~ ., data = model_data)
summary(model)
## 
## Call:
## lm(formula = win_percentage ~ ., data = model_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.272712 -0.048799 -0.000444  0.046801  0.311129 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1483581  0.0611744   2.425   0.0153 *  
## ADJOE       -0.0043792  0.0008222  -5.326 1.06e-07 ***
## ADJDE        0.0074888  0.0008416   8.898  < 2e-16 ***
## BARTHAG      0.1700799  0.0281529   6.041 1.67e-09 ***
## EFG_O        0.0296321  0.0046850   6.325 2.82e-10 ***
## EFG_D       -0.0360528  0.0062615  -5.758 9.18e-09 ***
## TOR         -0.0248283  0.0009495 -26.149  < 2e-16 ***
## TORD         0.0289746  0.0008669  33.424  < 2e-16 ***
## ORB          0.0099737  0.0004650  21.450  < 2e-16 ***
## DRB         -0.0142354  0.0005323 -26.743  < 2e-16 ***
## FTR          0.0033457  0.0002537  13.187  < 2e-16 ***
## FTRD        -0.0043555  0.0002403 -18.129  < 2e-16 ***
## X2P_O       -0.0019244  0.0029596  -0.650   0.5156    
## X2P_D        0.0065564  0.0040003   1.639   0.1013    
## X3P_O       -0.0006745  0.0025120  -0.268   0.7883    
## X3P_D       -0.0002127  0.0033598  -0.063   0.9495    
## ADJ_T        0.0026970  0.0004045   6.667 2.98e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07228 on 3863 degrees of freedom
## Multiple R-squared:  0.8403, Adjusted R-squared:  0.8397 
## F-statistic:  1271 on 16 and 3863 DF,  p-value: < 2.2e-16
anova(model)
## Analysis of Variance Table
## 
## Response: win_percentage
##             Df Sum Sq Mean Sq    F value    Pr(>F)    
## ADJOE        1 59.848  59.848 11454.6428 < 2.2e-16 ***
## ADJDE        1 13.650  13.650  2612.4944 < 2.2e-16 ***
## BARTHAG      1  0.002   0.002     0.3772    0.5391    
## EFG_O        1  9.055   9.055  1733.1714 < 2.2e-16 ***
## EFG_D        1  4.757   4.757   910.5520 < 2.2e-16 ***
## TOR          1  2.767   2.767   529.5709 < 2.2e-16 ***
## TORD         1  5.198   5.198   994.9652 < 2.2e-16 ***
## ORB          1  4.655   4.655   891.0149 < 2.2e-16 ***
## DRB          1  3.429   3.429   656.2183 < 2.2e-16 ***
## FTR          1  0.406   0.406    77.7888 < 2.2e-16 ***
## FTRD         1  1.874   1.874   358.7394 < 2.2e-16 ***
## X2P_O        1  0.000   0.000     0.0132    0.9084    
## X2P_D        1  0.363   0.363    69.4581 < 2.2e-16 ***
## X3P_O        1  0.000   0.000     0.0591    0.8080    
## X3P_D        1  0.000   0.000     0.0643    0.7998    
## ADJ_T        1  0.232   0.232    44.4512 2.977e-11 ***
## Residuals 3863 20.183   0.005                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As expected, the less informationally dense variables are not deemed significant after related variables have been added. For example, BARTHAG is directly computed using ADJOE and ADJDE. When these two predictors are added first, BARTHAG yields little additional predictive power. The adjusted \(R^2\) value is 0.8403, which indicates that the linear fit is excellent.

It is important to note that our goal is prediction, not inference. In this regard, we take little note of the large multicollinearity present:

library(car)
vif(model)
##      ADJOE      ADJDE    BARTHAG      EFG_O      EFG_D        TOR       TORD 
##  26.814180  22.137871  38.356053 154.053971 237.271047   3.077849   2.913227 
##        ORB        DRB        FTR       FTRD      X2P_O      X2P_D      X3P_O 
##   2.834941   2.162128   1.428778   1.720868  75.545428 128.747375  34.489687 
##      X3P_D      ADJ_T 
##  48.469203   1.119980

The lowest levels of multicollinearity are simple statistics that are not used in the more advanced ones, such as turnover rates (TOR,TORD), rebound rates (ORB, DRB), free throw rates (FTR,FTRD), and tempo (ADJ_T).

In summary, the multiple linear regression model performs significantly better than the baseline model; thus, it is useful for prediction. We now seek to improve the accuracy on the same train-test splits.

Logistic Regression

A potential drawback of multiple linear regression in our case is that the response variable is unbounded, that is, the win percentage of a team can be predicted to be less than zero or greater than one. This could be accounted for by using a logit transformation of the response, which is known as logistic regression (Binomial GLM with logit link function). Logistic regression is certainly an intuitive choice for modelling win percentage, as the response variable directly comes in a format that is a probability.

We fit the model:

data$L <- data$G-data$W # losses
model_data <- data %>% dplyr::select(-TEAM,-POSTSEASON,-SEED,-CONF,-WAB,-YEAR,-G,-win_percentage)

K <- 10
set.seed(67)
folds <- sample(rep(1:K, length.out = nrow(model_data)))

errors <- numeric(K)

for (k in 1:K) {
  
  train_data <- model_data[folds != k, ]
  test_data  <- model_data[folds == k, ]
  
  model <- glm(
    cbind(W, L) ~ .,
    family = binomial,
    data = train_data
  )
  
  preds <- predict(model, test_data, type = "response")
  
  true_probs <- test_data$W / (test_data$W + test_data$L)
  
  errors[k] <- mean((true_probs - preds)^2)
  
}

cat(sprintf("Average test MSE: %.4f\n",mean(errors)))
## Average test MSE: 0.0050
cat(sprintf("Average test RMSE: %.4f\n",mean(sqrt((errors)))))
## Average test RMSE: 0.0709
cat(sprintf("Average test standard error: %.4f\n",sd(errors)))
## Average test standard error: 0.0003

The logistic model shows a marginal improvement compared to the linear one, with an average RMSE of \(.070\pm 0.0003\), meaning predictions differ by about 7 percentage points from the truth on average. While there is no direct equivalent for an \(R^2\) value, we will check the value of McFadden’s pseudo \(R^2\), and note that a value in 0.2-0.4 is considered an excellent fit \(^2\).

library(performance)
data$L <- data$G-data$W
model_data <- data %>% dplyr::select(-TEAM,-POSTSEASON,-SEED,-CONF,-WAB,-YEAR,-G,-win_percentage)

model <- glm(
  cbind(W, L) ~ .,
  family = binomial,
  data = model_data,
)

null_model <- glm( # fit the null model
  cbind(W, L) ~ 1,
  family = binomial,
  data = model_data
)

# Use log likelihoods
ll_full <- as.numeric(logLik(model))
ll_null <- as.numeric(logLik(null_model))

# McFaddenks value
R2_McFadden <- 1 - ll_full / ll_null

cat(sprintf("McFadden's pseudo R²: %.4f\n", R2_McFadden))
## McFadden's pseudo R²: 0.4465

The pseudo \(R^2\) value is 0.4465, which indicates an excellent fit. We conclude that both linear and logistic regression models perform well in this example, where each method has advantages and disadvantages. The data appears quite linear, and the linear \(R^2\) value is very high. However, a logistic model ensures the response is between 0 and 1, whereas the linear fit has no such guarantee. In addition, the logistic model had a stronger performance when compared with linear regression (Average RMSE 0.70 vs 0.72). We now turn to the modern, more computational methods.

XGBoost Model

Ensemble learning methods are a family of machine learning methods that are frequently useful. Decision trees are a good choice for a base model, as they are easy to train and combine. As such, data scientists build many of these popular models, such as random forests, bagging (bootstrap aggregating), and gradient boosting. We have a large dataset that is less prone to overfitting and aligns more with the performance of a gradient boosting model. One of the most popular variations is XGBoost (eXtreme Gradient Boosting), which we will use for comparison. The model works by sequentially training weak learners to correct the errors of the previous models. We begin with a demonstration of a decision tree.

library(xgboost)
library(rpart)

model_data <- data %>% dplyr::select(-TEAM,-POSTSEASON,-SEED,-CONF,-WAB,-YEAR,-G,-W,-L)
# Build the regression tree
tree_model <- rpart(win_percentage ~ ., data = model_data, method = "anova",cp=0.01)

# Visualization
library(rpart.plot)
prp(tree_model)

Here we have a simple tree-based model that uses a few key metrics. Notably, we have BARTHAG (Basketball Adjusted Rating Through Games), which combines overall adjusted offensive and defensive statistics to determine a chance of winning. Then we have EFG (Effective Field Goal Percentage) for offensive and defensive metrics. We can build a more complex model by reducing the complexity parameter, although we will not explore that or cross-validation here.

We proceed with fitting a gradient boosted model using XGBoost and test its effectiveness. We print every 50 iterations to see the progress.

library(xgboost)

input_data <- data %>% dplyr::select(-TEAM,-POSTSEASON,-SEED,-CONF,-WAB,-YEAR,-G,-W,-L,-win_percentage)

data_matrix <- as.matrix(input_data)

dtrain <- xgb.DMatrix(data=data_matrix,label=data$win_percentage)

# parameters:

params <- list(
  objective = "reg:squarederror", # what we want to minimize
  eval_metric = "rmse",
  eta = 0.02, # learning rate
  max_depth = 4, # max depth of each decision tree
  subsample = 0.7, # fraction of training rows used per tree (controls overfitting)
  colsample_bytree = 0.7 #fraction of columns used per tree (controls overfitting)
)

# we need to compare the same runs, so the xgboost model must cross validate on the same folds:
set.seed(67)
folds <- sample(rep(1:K, length.out = nrow(model_data)))
folds_list <- lapply(1:K, function(k) which(folds == k))


cv <- xgb.cv(
  params = params,
  data = dtrain,
  nrounds = 1500, # 1000
  folds = folds_list,
  early_stopping_rounds = 20, # 20
  verbose = 1,
  print_every_n = 50
)
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 20 rounds.
## 
## [1]  train-rmse:0.178161±0.000583    test-rmse:0.178211±0.005180 
## [51] train-rmse:0.112825±0.000239    test-rmse:0.116135±0.002804 
## [101]    train-rmse:0.091161±0.000251    test-rmse:0.097393±0.002658 
## [151]    train-rmse:0.081297±0.000266    test-rmse:0.090040±0.002652 
## [201]    train-rmse:0.075777±0.000266    test-rmse:0.086399±0.002761 
## [251]    train-rmse:0.071775±0.000242    test-rmse:0.084037±0.002814 
## [301]    train-rmse:0.068486±0.000230    test-rmse:0.082156±0.002729 
## [351]    train-rmse:0.065693±0.000235    test-rmse:0.080662±0.002664 
## [401]    train-rmse:0.063355±0.000202    test-rmse:0.079477±0.002659 
## [451]    train-rmse:0.061341±0.000221    test-rmse:0.078567±0.002585 
## [501]    train-rmse:0.059558±0.000222    test-rmse:0.077790±0.002563 
## [551]    train-rmse:0.058008±0.000249    test-rmse:0.077147±0.002518 
## [601]    train-rmse:0.056624±0.000249    test-rmse:0.076630±0.002518 
## [651]    train-rmse:0.055398±0.000252    test-rmse:0.076221±0.002516 
## [701]    train-rmse:0.054279±0.000249    test-rmse:0.075904±0.002528 
## [751]    train-rmse:0.053262±0.000252    test-rmse:0.075573±0.002521 
## [801]    train-rmse:0.052318±0.000269    test-rmse:0.075388±0.002490 
## [851]    train-rmse:0.051446±0.000271    test-rmse:0.075193±0.002480 
## [901]    train-rmse:0.050607±0.000239    test-rmse:0.075035±0.002450 
## [951]    train-rmse:0.049819±0.000240    test-rmse:0.074897±0.002454 
## [1001]   train-rmse:0.049082±0.000237    test-rmse:0.074817±0.002477 
## [1051]   train-rmse:0.048359±0.000237    test-rmse:0.074762±0.002470 
## [1101]   train-rmse:0.047654±0.000234    test-rmse:0.074702±0.002483 
## [1151]   train-rmse:0.046974±0.000240    test-rmse:0.074648±0.002471 
## Stopping. Best iteration:
## [1185]   train-rmse:0.046513±0.000241    test-rmse:0.074625±0.002482
## 
## [1185]   train-rmse:0.046513±0.000241    test-rmse:0.074625±0.002482
log <- cv$evaluation_log
best_nrounds <- which.min(log$test_rmse_mean) # best number of trees

cv$evaluation_log[best_nrounds] # cross validated RMSE 
##     iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
##    <int>           <num>          <num>          <num>         <num>
## 1:  1165      0.04678253   0.0002426612     0.07462341   0.002478078

The XGBoost model performs slightly worse than both linear and logistic regression, with a test RMSE of \(0.0746 \pm 0.002\) or 7.46 percentage points. Note that different combinations of parameters were attempted (not shown here) and yielded worse results. This includes testing different learning rates, max depth, and overfitting parameters. XGBoost performs best with more predictors and nonlinear effects. In this case, there is a relatively small number of predictors, and a linear model fits very well, so it is clear that an XGBoost model is likely not appropriate here.

Neural Network

Finally, we turn to the star of the modern development in the field of artificial intelligence, the neural network. We use the Keras library for our model. For neural networks, we scale the data to ensure that all features contribute equally to the prediction. We use a ReLU activation function in a multilayer perceptron with one hidden layer. We use cross-validation to analyze the average test error, using a standard setup for a neural network. We use an L2 regularizer to prevent overfitting and adaptive moment estimation to adjust the learning rate during training.

library(keras3)
input_data <- data %>% dplyr::select(-TEAM,-POSTSEASON,-SEED,-CONF,-WAB,-YEAR,-G,-W,-L,-win_percentage)
X <- as.matrix(input_data)
Y <- model_data$win_percentage

X_scaled <- scale(X)

build_model <- function(input_dim) {
  model <- keras_model_sequential() %>%
    layer_dense(
      units = 16,
      activation = "relu",
      input_shape = input_dim,
      kernel_regularizer = regularizer_l2(0.001)
    ) %>%
    layer_dense(
      units = 8,
      activation = "relu",
      kernel_regularizer = regularizer_l2(0.001)
    ) %>%
    layer_dense(units = 1)   # linear output for regression
  
  model %>% compile(
    optimizer = optimizer_adam(learning_rate = 0.001),
    loss = "mse",
    metrics = list("mse")
  )
  
  model
}

K <- 10
n <- nrow(X_scaled)

set.seed(67)
folds <- sample(rep(1:K, length.out = n))

rmse_per_fold <- numeric(K)

for (k in 1:K) {
  
  cat("Fold", k, "\n")
  
  train_idx <- which(folds != k)
  val_idx   <- which(folds == k)
  
  X_train <- X_scaled[train_idx, ]
  Y_train <- Y[train_idx]
  
  X_val <- X_scaled[val_idx, ]
  Y_val <- Y[val_idx]
  
  model <- build_model(ncol(X_train))
  
  history <- model %>% fit(
    X_train, Y_train,
    epochs = 100,
    batch_size = 32,
    validation_data = list(X_val, Y_val),
    verbose = 0,
    callbacks = list(
      callback_early_stopping(
        monitor = "val_loss",
        patience = 10,
        restore_best_weights = TRUE
      )
    )
  )
  
  preds <- model %>% predict(X_val)
  
  rmse_per_fold[k] <- sqrt(mean((preds - Y_val)^2))
}
## Fold 1 
## 13/13 - 0s - 6ms/step
## Fold 2 
## 13/13 - 0s - 5ms/step
## Fold 3 
## 13/13 - 0s - 5ms/step
## Fold 4 
## 13/13 - 0s - 5ms/step
## Fold 5 
## 13/13 - 0s - 5ms/step
## Fold 6 
## 13/13 - 0s - 5ms/step
## Fold 7 
## 13/13 - 0s - 5ms/step
## Fold 8 
## 13/13 - 0s - 5ms/step
## Fold 9 
## 13/13 - 0s - 5ms/step
## Fold 10 
## 13/13 - 0s - 5ms/step
rmse_per_fold
##  [1] 0.06894099 0.06927144 0.06971607 0.06464198 0.07119575 0.07054971
##  [7] 0.07375508 0.07134856 0.07050159 0.07171752
cat(sprintf("Average test RMSE: %.4f\n",mean(rmse_per_fold)))
## Average test RMSE: 0.0702
cat(sprintf("Average test standard error: %.4f\n",sd(rmse_per_fold)))
## Average test standard error: 0.0024

The neural network has an average RMSE of \(0.070 \pm 0.002\), tied for the best result. We note that neural networks are computationally more involved than the classical methods. In this example, the difference was several seconds for simple methods compared to several minutes for the neural network. A model with more nodes, layers, and data uses more runtime. Moreover, here are some undisplayed error rates for different node counts. For layer nodes (x,y), the resulting average RMSEs were (2,1): 0.094, (4,2): 0.071, (8,4): 0.070, (16,8): 0.070, (32,16): 0.070.

Comparison of the Different Methods:

Viewing the four different models, we see that all perform quite similarly, with average RMSE around 0.07 or 7 percentage points. This likely indicates that even the linear and logistic regression models are similar to the true functional form of the data; thus, the more complex models, such as gradient boosting and neural networks, add little to no predictive power. An important note that was not explored in this analysis is how both linear and logistic regression can be used for inference, not just prediction, and can be used to determine which statistics are more or less important for winning. Ideally, one would compare the base statistics and not the combined ones, so in this specific analysis, inference is less useful. Nevertheless, this serves as an important reminder that fitting a good, simple model can be far more useful to organizations than the modern flashy counterparts.

Additional Approaches

Poisson Regression

Poisson regression assumes the functional form \(y_i = e^{\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+…+\beta_px_{ip}}\) and is best used for count data. This functional form ensures non-negativity and is due to the relationship between the Poisson distribution and the binomial distribution. As we are modelling win percentage, one could also model the number of wins as the dependent variable. One might then consider a model using Poisson regression. The reason we did not explore this approach is that different teams have different numbers of games in the regular season (some conferences play fewer games than others, and the tournament games are included). This difference in games would likely cause some error in this kind of model. Additionally (and more importantly), Poisson regression relies on the assumption that the observations are independent of each other, which is certainly not the case here, as wins in a basketball season influence the results of other games.\(^3\)

Bootstrap Aggregating (Bagging)

We used gradient boosting above, but we can also consider other tree-based methods, such as bootstrap aggregating or one of its implementations (such as random forest). Bootstrap aggregating with decision trees involves sampling with replacement from the training set to build many different decision trees, then averaging them together. This approach is used to reduce variance in a noisy dataset and prevent overfitting \(^4\). When deciding between the ensemble methods of bagging and boosting, we note that bagging aims to reduce variance, whereas boosting aims to reduce bias \(^5\). We have a large amount of data, so we are less concerned with overfitting, and as such, we opt for the boosted model for the analysis.

Regularization Methods

For the linear and logistic regression cases, we could have applied ridge (L2) or lasso (L1) regularization to avoid overfitting and improve the predictive performance. The methods have the effect of adding a constraint to minimize the loss function depending on certain weights. For example, ridge regression finds estimate beta vector \(\mathbb{\hat{\beta}}\) to minimize the penalized loss function \(\sum_{i=1}^{n} (y_i-\hat{y_i})^2 + \lambda \sum_{i=0}^p \beta_i^2\) for a pre-chosen constant \(\lambda\) \(^6\). In the case of the NCAA analysis, we did not implement these methods in the linear and logistic cases for simplicity. Additionally, as n is large and the response has a linear appearance, overfitting is unlikely. It is recommended that there exist 10-20 observations in the response per predictor variable, a criterion which is easily met with the 3885 observations and 16 predictors\(^7\). The lasso performs feature selection and shrinks less important coefficients to exactly zero, which would have likely occurred due to the multicollinearity. This would have yielded a simpler model, which would probably have a similar performance. Similarly, ridge regression shrinks some coefficients to prevent overfitting, but it does not perform feature selection. For inference and preventing overfitting, these regularization methods should certainly be considered when deciding on a model\(^8\).

Conclusion

We performed data analysis on NCAA men’s basketball season data from 2013 to 2024, and noted key observations. Next, we fit four different models and evaluated their performance against a baseline and each other using 10-fold cross-validation. Each of the four models performed similarly, with an average RMSE for the methods around 0.07, compared to the baseline mean model RMSE of 0.18. For runtime, both the linear and logistic regression models took a few seconds to run, the XGBoost model took about one minute, and the neural network took several minutes. The significant time difference implies that for far larger data sets with a good structure, traditional methods can match the performance of the modern computationally intensive methods in much shorter time, while also providing key inference that can be used in business applications. The key takeaway of this report is a real-world example of how keeping things simple in data science can yield the best results in the least time. While there are certainly many places for more complex models, it is never a bad idea to start with the simplest possible approach, then gradually increase the complexity.