The goal of this analysis is to develop and evaluate a vector autoregression model to forecast point differential and all associated predictors. The data to be modeled is college basketball score differentials based on data downloaded from sports-reference.com. Each college basketball team has a page containing a csv file of their game logs for the season. Game logs are downloaded for 351 teams between the 2012-2013 and 2016-2017 seasons. All regular season and post-season games are included. Fields such as points for and points again are condensed to differentials (for each variable we use n_for - n_against
). The data is then scaled and centered as it will be for the actual model building process.
Next we want to try to determine the variables that correlate well with one another (as well as with the target variable - point differential). Most variables, to varying degrees, correlate predictably with point differential. For example, teams that turn over the ball less frequently tend to have higher point differentials as do teams that score more three pointers than their opponents. Offensive rebounding and free throw shooting have the weakest correlations with point differential.
Basketball has some different properties than stock data (which we typically use for these projects) particularly in scheduling. The market has 252 trading days per year but the best two teams in the country play 40 games. Additionally, investors, for the most part, make projections based on market factors that transcend years while basketball teams gain and lose players and coaching talent annually. So instead of training a VAR model on the whole dataset, the models are trained separately on 10 team-seasons for the first 20 games per season. The teams are randomly selected, but include Virginia Tech because go Hokies!
library(vars)
library(forecast)
# Generate dataframes and ts objects for each team/season combo
teams <- c(sample(game_data$team, 9), 'virginia tech')
team_dfs <- list()
team_ts <- list()
for(i in seq(1,length(teams))){
team_dfs[[i]] <- game_data%>%
filter(team == teams[i] & g <= 20 & year == 2015)%>%
dplyr::select(-g, -team, -win, -year, -loc.away, -orb.d, -ft.d)
team_ts[[i]] <- ts(team_dfs[[i]])
}
autoplot_lists <- list()
for(i in 1:10){
autoplot_lists[[i]] <- autoplot(team_ts[[i]][,'point.d'])+
theme_yaz()+
labs(title = teams[i], y = 'Point Differential', x = 'Games Played')+
geom_hline(yintercept = 0, linetype = 'dashed')
}
library(gridExtra)
do.call("grid.arrange", c(autoplot_lists, ncol=5, top = 'Time Series of Point Differential By Game'))

All of the teams’ point differentials represent stationary white noise time series. That means no differencing is required!
acf_list <- list()
for(i in 1:10){
acf_list[[i]] <- ggAcf(team_ts[[i]][,'point.d'])+
theme_yaz()+
labs(title = teams[i], y = 'ACF', x = 'Lag')
}
do.call("grid.arrange", c(acf_list, ncol=5, top = 'AutoCorrelations by Team'))

The VARselect
function output calculates several information criteria to determine the optimal lag point for a given set of variables. In this case, we’ll use all variables in the above graphic except for free throw differential and offensive rebound differential. Number of lags (\(p\)) is selected using Final Prediction Error. For most teams, the preferred lag is \(p = 1\), but some use \(p = 2\), but for some reason, using \(p = 2\) resulted in null forecasts so \(p = 1\) was applited to all teams.
var.mods <- list()
for(i in 1:10){var.mods[[i]] <- VAR(team_ts[[i]], p = 1, type = 'const')}
The forecast plots are presented below looking five games into the future. The confidence intervals of the projections are narrow at first and grow as the forecast gets farther from the final game of the dataset, but that’s OK because a model can be retrained on a daily basis during the season. While this is a good start, the individual predictor variables are not all strong enough predictors of one another to feed consistent, trustworthy forecasts. My next step will be to try some other approaches like a naiive model or simple exponential smoothing that can rely on just the individual time series.
fcast.plots <- list()
for(i in 1:10){
fcast.plots[[i]] <- autoplot(forecast(var.mods[[i]], h = 5))+
theme_yaz()+
labs(x = 'Games Played', title = teams[i])
}
fcast.plots
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]










---
title: "R Notebook"
output: html_notebook
---

The goal of this analysis is to develop and evaluate a vector autoregression model to forecast point differential and all associated predictors. The data to be modeled is college basketball score differentials based on data downloaded from sports-reference.com. Each college basketball team has a page containing a csv file of their game logs for the season. Game logs are downloaded for 351 teams between the 2012-2013 and 2016-2017 seasons. All regular season and post-season games are included. Fields such as points for and points again are condensed to differentials (for each variable we use `n_for - n_against`). The data is then scaled and centered as it will be for the actual model building process.


```{r, echo = TRUE}
library(readr)
library(dplyr)
library(lubridate)
game_data.raw <- read_csv('C:/Users/joshy/Desktop/cbb-stats/cleaned_games.csv')%>%
  mutate(loc.away = ifelse(location == 'away', 1, 0),
         loc.home = ifelse(location == 'home', 1, 0),
         win = ifelse(outcome == 'W', 1, 0),
         year = year(Date))

game_data <- game_data.raw%>%
  mutate(point.d = points_for - points_against,
         fg.d = FGpct.for - FGpct.against,
         three.d = `3Ppct.for` - `3Ppct.against`,
         ft.d = FTpct.for - FTpct.against,
         orb.d = ORB.for - ORB.against,
         drb.d = (TRB.for - ORB.for) - (TRB.against - ORB.against),
         ast.d = AST.for - AST.against,
         stl.d = STL - STL.against,
         blk.d = BLK - BLK.against,
         tov.d = TOV - TOV.against,
         pf.d = PF - PF.against)%>%
  dplyr::select(point.d, fg.d, three.d, ft.d, 
         orb.d, drb.d, ast.d, stl.d, blk.d, tov.d, pf.d)%>%
  scale(scale = T, center = T)%>%
  data.frame()%>%
  bind_cols(game_data.raw%>%
              dplyr::select(g = G, team, loc.away, loc.home, win, year))

head(game_data, 100)
```

Next we want to try to determine the variables that correlate well with one another (as well as with the target variable - point differential). Most variables, to varying degrees, correlate predictably with point differential. For example, teams that turn over the ball less frequently tend to have higher point differentials as do teams that score more three pointers than their opponents. Offensive rebounding and free throw shooting have the weakest correlations with point differential. 

```{r, fig.height=10, fig.width=10, message=FALSE, warning = FALSE, results = 'hide', echo = TRUE}
library(ggplot2)
library(yaztheme)
pm <- GGally::ggpairs(game_data%>%dplyr::select(-g, -team, -win, -year, -loc.away),
                      title = 'Distribution and Relationships of Variables')+
  yaztheme::theme_yaz()
pm
```

Basketball has some different properties than stock data (which we typically use for these projects) particularly in scheduling. The market has 252 trading days per year but the best two teams in the country play 40 games. Additionally, investors, for the most part, make projections based on market factors that transcend years while basketball teams gain and lose players and coaching talent annually. So instead of training a VAR model on the whole dataset, the models are trained separately on 10 team-seasons for the first 20 games per season. The teams are randomly selected, but include Virginia Tech because go Hokies!

```{r, fig.width=11, fig.height=5, echo = TRUE}
library(vars)
library(forecast)
# Generate dataframes and ts objects for each team/season combo
teams <- c(sample(game_data$team, 9), 'virginia tech')

team_dfs <- list()
team_ts <- list()

for(i in seq(1,length(teams))){
  team_dfs[[i]] <- game_data%>%
    filter(team == teams[i] & g <= 20 & year == 2015)%>%
    dplyr::select(-g, -team, -win, -year, -loc.away, -orb.d, -ft.d)
  team_ts[[i]] <- ts(team_dfs[[i]])
}

autoplot_lists <- list()

for(i in 1:10){
  autoplot_lists[[i]] <- autoplot(team_ts[[i]][,'point.d'])+
    theme_yaz()+
    labs(title = teams[i], y = 'Point Differential', x = 'Games Played')+
    geom_hline(yintercept = 0, linetype = 'dashed')
}  

library(gridExtra)

do.call("grid.arrange", c(autoplot_lists, ncol=5, top = 'Time Series of Point Differential By Game'))
```

All of the teams' point differentials represent stationary white noise time series. That means no differencing is required!
```{r, fig.width=11, fig.height=5, echo = TRUE}
acf_list <- list()

for(i in 1:10){
  acf_list[[i]] <- ggAcf(team_ts[[i]][,'point.d'])+
    theme_yaz()+
    labs(title = teams[i], y = 'ACF', x = 'Lag')
}  

do.call("grid.arrange", c(acf_list, ncol=5, top = 'AutoCorrelations by Team'))
```

The `VARselect` function output calculates several information criteria to determine the optimal lag point for a given set of variables. In this case, we'll use all variables in the above graphic except for free throw differential and offensive rebound differential. Number of lags ($p$) is selected using Final Prediction Error. For most teams, the preferred lag is $p = 1$, but some use $p = 2$, but for some reason, using $p = 2$ resulted in null forecasts so $p = 1$ was applited to all teams.

```{r, echo = TRUE}
var.mods <- list()
for(i in 1:10){var.mods[[i]] <- VAR(team_ts[[i]], p = 1, type = 'const')}
```

The forecast plots are presented below looking five games into the future. The confidence intervals of the projections are narrow at first and grow as the forecast gets farther from the final game of the dataset, but that's OK because a model can be retrained on a daily basis during the season. While this is a good start, the individual predictor variables are not all strong enough predictors of one another to feed consistent, trustworthy forecasts. My next step will be to try some other approaches like a naiive model or simple exponential smoothing that can rely on just the individual time series. 
```{r, fig.height=12, fig.width = 6, echo = TRUE}
fcast.plots <- list()
for(i in 1:10){
  fcast.plots[[i]] <- autoplot(forecast(var.mods[[i]], h = 5))+
    theme_yaz()+
    labs(x = 'Games Played', title = teams[i])
}

fcast.plots
```

