We will be performing a time series analysis on NBA search interests. The data has been collected through Google trends and has been filtered to collect trends for the past 5 years to present day.
library(xgboost)
library(tidymodels)
library(modeltime)
library(tidyverse)
library(lubridate)
library(timetk)
nba <- read.csv("Data/NBA.csv") %>% as.tibble()
nba <- nba %>%
mutate(date = dmy(DMY)) %>%
select(-c(DMY)) %>%
select(date,NBA)
After cleaning and ensuring all data is in the correct format, we can now perform some data exploration.
nba %>%
plot_time_series(
.date_var = date,
.value = NBA,
.title = 'NBA Time Series Plot'
)
plot_time_series shows us the variance in trend
throughout the years that we have collected. as seen in the plot above.
there is a clear trend of seasonality with highs most likely
representing period during the finals series and low periods
representing off-season where no games are played.nba %>%
plot_seasonal_diagnostics(
.date_var = date,
.value = NBA,
.feature_set = c('month.lbl','year'),
.title = 'NBA Sesonal Diagnostics'
)
plot_seasonal_diagnostics gives us a more detailed
insight into seasonality trends. Here we have filtered to monthly and
yearly trends.
Looking at monthly diagnostics, we can see that the low months of July to September are the off-season and from October till April represents the regular season, from April to June is where the final series begins.
Looking at yearly diagnostics from 2020 to 2025, we can see that each year is quite consistent. 2020 and 2021 are the only two years that are abnormal, this is due to 2020 season being affected due to covid and all teams put into a bubble, and 2021 season where the amount of games were reduced affecting the season schedule.
nba %>%
plot_anomaly_diagnostics(
.date_var = date,
.value = NBA,
.title = 'NBA Anomalies'
)
plot_anomaly_diagnostics helps us see if any values
are abnormally high or low. this allows us to then research why this
anomaly has occurred and if it positively or negatively affected our
data.
As seen in the plot, when the .alpha is set to
default which is 0.05 there are no anomalies to report on.
nba %>%
plot_stl_diagnostics(
.date_var = date,
.value = NBA,
.facet_scales = 'free',
.feature_set = c('observed','trend')
)
plot_stl_diagnostics further confirms that the data set
relies heavily on seasonality throughout each year.The next step is to create train and test sets so we are able to train our forecasting models and allow see how our models perform on untrained data.
splits <- initial_time_split(nba)
train <- training(splits)
test <- testing(splits)
splits %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(
.date_var = date,
.title = 'NBA Data Training & Testing Split',
.value = NBA)
Now that we have created our train and test data, we can now create our forecasting models and visualise how accurate each model is on our test data.
arima_fit <- arima_reg() %>%
set_engine("auto_arima") %>%
fit(NBA ~ date, data = train)
arima_boost_fit <- arima_boost() %>%
set_engine("auto_arima_xgboost") %>%
fit(NBA ~ date, data = train)
ets_fit <- exp_smoothing() %>%
set_engine("ets") %>%
fit(NBA ~ date, data = train)
prophet_fit <- prophet_reg() %>%
set_engine("prophet") %>%
fit(NBA ~ date, data = train)
model_tbl <- modeltime_table(
arima_fit,
arima_boost_fit,
ets_fit,
prophet_fit)
calibrate_tbl <- model_tbl %>%
modeltime_calibrate(new_data = test)
calibrate_tbl %>%
modeltime_forecast(
actual_data = nba,
new_data = test
) %>%
plot_modeltime_forecast()
plotly creates interactive graphs,
we can see that only the Prophet model is able to accurately predict
seasonality well.calibrate_tbl %>%
modeltime_accuracy()
## # A tibble: 4 × 9
## .model_id .model_desc .type mae mape mase smape rmse rsq
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 ARIMA(2,0,3) WITH NON-Z… Test 14.2 117. 3.27 57.4 18.0 0.0821
## 2 2 ARIMA(2,0,0) WITH NON-Z… Test 12.9 96.8 2.95 54.3 16.5 0.176
## 3 3 ETS(M,A,N) Test 28.5 113. 6.54 166. 34.3 0.00825
## 4 4 PROPHET Test 4.38 21.1 1.01 22.2 5.68 0.900
modeltime_accuracy generates a table where we can see
how each models statistics (R2, RMSE). We can see that prophet performed
the best with a RMSE of 5.7 and R2 of 90. the table also tells us that,
ETS, Auto ARIMAs and Xgboosted ARIMA do not perform well in forecasting.
It is more likely that creating your own ARIMA models could potentially
create better results as you are able to manipulate multiple variables
such as moving averages and auto regressive orders.So far we have, analysed the NBA data set, created models on trained data, and analysed each models performed on test data.The next step is to use the models created to forecast into the future. Doing this allows us to have a general idea of how chosen data sets may perform in the future based on previous trends through the use of historical data.
refit_tbl <- calibrate_tbl %>%
modeltime_refit(data = nba)
refit_tbl %>%
modeltime_forecast(h = '5 years', actual_data = nba) %>%
plot_modeltime_forecast()