Data from: https://data.world/datatouille/stephen-curry-stats
Test set from : https://www.basketball-reference.com/players/c/curryst01/gamelog/2019
##Data set
sc <- read.csv("//Users//kevinclifford//Downloads//Stephen Curry Stats.csv", header=TRUE)
actual <- read.csv("//Users//kevinclifford//Downloads//sportsref_download-2.csv", header=TRUE)
ts <- ts(sc$PTS)
time <- ts %>% as_tsibble()
plot(ts)
autoplot(ACF(time))
## Response variable not specified, automatically selected `var = value`
autoplot(PACF(time))
## Response variable not specified, automatically selected `var = value`
test <- ts(actual$PTS)
The data set I am looking at looks at 878 games of Stephen Curry’s career (link to set above), from 2009-2018 and I cleaned it up to only include points scored in the games. Looking at the plot of the data there is no clear trend, as his scoring is pretty constant. Some outliers include non-scoring games, probably due to not playing. I decided to keep those in, mainly because throughout the course of a season injuries do occur and I was interested to see how the non-scoring games would affect the models. I will go further into detail at the end, but in the future if I was going to do an in-depth dive of this data I would remove those non-scoring games.
The residuals show that they are not white noise, indicating some differencing required especially in the ARIMA model.
I also took the stats of Curry’s scoring stats for the next 69 games played to use as a test set.
## ETS Models
fit1 <- ets(ts)
fit1
## ETS(A,N,N)
##
## Call:
## ets(y = ts)
##
## Smoothing parameters:
## alpha = 0.2078
##
## Initial states:
## l = 9.9527
##
## sigma: 10.2654
##
## AIC AICc BIC
## 10044.10 10044.13 10058.43
summary(fit1)
## ETS(A,N,N)
##
## Call:
## ets(y = ts)
##
## Smoothing parameters:
## alpha = 0.2078
##
## Initial states:
## l = 9.9527
##
## sigma: 10.2654
##
## AIC AICc BIC
## 10044.10 10044.13 10058.43
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.1093757 10.25367 7.919077 -Inf Inf 0.8394815 0.08381786
plot(fit1)
accuracy(fit1)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.1093757 10.25367 7.919077 -Inf Inf 0.8394815 0.08381786
fe <- forecast(fit1, 82)
plot(fe, main="ETS")
autoplot(fit1$residuals)
acc1 <- accuracy(fe$mean[1:69], test)
acc1
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -2.648204 10.01969 7.994726 -33.57537 46.94167 -0.1161203 0.5181533
arima1 <- auto.arima(ts)
arima1
## Series: ts
## ARIMA(1,1,2)
##
## Coefficients:
## ar1 ma1 ma2
## 0.8373 -1.5820 0.5893
## s.e. 0.0593 0.0799 0.0759
##
## sigma^2 = 102.4: log likelihood = -3273.55
## AIC=6555.11 AICc=6555.15 BIC=6574.21
summary(arima1)
## Series: ts
## ARIMA(1,1,2)
##
## Coefficients:
## ar1 ma1 ma2
## 0.8373 -1.5820 0.5893
## s.e. 0.0593 0.0799 0.0759
##
## sigma^2 = 102.4: log likelihood = -3273.55
## AIC=6555.11 AICc=6555.15 BIC=6574.21
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.1849934 10.09536 7.977679 -Inf Inf 0.8456938 0.01201767
plot(arima1)
accuracy(arima1)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.1849934 10.09536 7.977679 -Inf Inf 0.8456938 0.01201767
autoplot(arima1$residuals)
fe2 <- forecast(arima1, 82)
plot(fe2, main="Auto-ARIMA")
acc2 <- accuracy(fe2$mean[1:69], test)
acc2
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 6.714087 11.64913 9.506814 8.747294 38.48804 -0.1465925 0.709102
##Neural networks
nn1 <- time %>% model(NNETAR(value))
nn1
## # A mable: 1 x 1
## `NNETAR(value)`
## <model>
## 1 <NNAR(8,4)>
fe3 <- nn1 %>%
forecast(h = 82)
nn1 %>%
forecast(h = 82) %>%
autoplot(time) +
labs(x = "Games", y = "Points Scored", title = "Steph Curry Scoring Statistics")
accuracy(nn1)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NNETAR(value) Training 0.0109 9.00 6.93 NaN Inf 0.735 0.717 0.0116
gg_tsresiduals(nn1)
## Warning: Removed 8 row(s) containing missing values (geom_path).
## Warning: Removed 8 rows containing missing values (geom_point).
## Warning: Removed 8 rows containing non-finite values (stat_bin).
acc3 <- accuracy(fe3$.mean[1:69], test)
acc3
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 26.65401 31.02527 26.99635 107.438 110.8494 0.590296 1.971329
acc_nn <- accuracy(nn1)
acc_nn$.model <- NULL
#Accuracy
#Train
rbind(ETS = accuracy(fit1), ARIMA = accuracy(arima1))
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.1093757 10.25367 7.919077 -Inf Inf 0.8394815 0.08381786
## Training set 0.1849934 10.09536 7.977679 -Inf Inf 0.8456938 0.01201767
accuracy(nn1)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NNETAR(value) Training 0.0109 9.00 6.93 NaN Inf 0.735 0.717 0.0116
#Test
rbind(ETS = acc1, ARIMA = acc2, NNAR = acc3)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -2.648204 10.01969 7.994726 -33.575369 46.94167 -0.1161203 0.5181533
## Test set 6.714087 11.64913 9.506814 8.747294 38.48804 -0.1465925 0.7091020
## Test set 26.654013 31.02527 26.996345 107.437973 110.84939 0.5902960 1.9713286
Analysis
Models
The three models I chose to forecast the data are ETS, ARIMA, and neural network models. These are the three from the course that I have felt most comfortable working with. I briefly considered using a VAR model with the scoring or assist statistics of a teammate of Stephen Curry, but decided to go the route I was more comfortable with. The ETS() function in R produced an ETS(A,N,N) model, a simple exponential smoothing. This is intuitive because a basketball season would not have clear trends or seasonal pattern. There might be differing stretches of scoring for Curry, due to opponent, minutes played, injury, etc. However, the production will not change drastically game to game. If I included more seasons or more players than perhaps a trend would be clearer. The auto.Arima() function in R produced an ARIMA(1,1,2) model, meaning p and d equal to 1 and q equal to 2. I thought the differcing involved might be a bit more. I think if I were to deeper dive, I would experiment more with differencing of the data set. The NNETAR() function in R produced a NNAR(8,4) model. This indicates 8 lagged inputs and 4 nodes in the hidden layer. I think it is shows some variability that has to be dealt with, especially with the non-scoring games still included.
Fit statistics
Looking at the fit of the three models. The ARIMA model has a lower AIC, AICc, and BIC than the ETS model. I am still unsure how to examine the fit statistics using NNETAR() function. However, examining the “fit” of the plot I would say that the neural network model seems to be a better fit at the beginning of the forecast period but eventually it drops and becomes a much worse fit than the other two models. So, the best fitting model is the ARIMA model.
Accuracy Statistics
Training set
Examining the accuracy statistics, specifically the RMSE of all three models, the best performing model is the neural network model, followed closely by the ARIMA, and then the ETS model. All three apparently have very close performance on the training set.
Test set
When I introduce the test set is where it becomes increasingly apparent that the neural network model does not perform well forecasting. The forecasts decrease until it actually begins predicting negative points. Perhaps, this indicates how much of a mistake keeping the non-scoring games in the data set. Anyways, the neural network now becomes the worst performing forecasting model on the test set. Against the test set, the ETS model performs the best, followed closely by the ARIMA model.
Plots
As mentioned earlier, the plots go a long way in realizing that the initial RMSE performance of the neural network model is not indicative of the forecast's performance. The forecasts dip, whereas the ARIMA and ETS models are more constant. None of the three models show a great fit to the data but the ARIMA shows a slightly less constant forecast than the ETS. To be honest, based on all of this evaluation it is difficult to choose between the ARIMA and ETS. I think the ETS performing closely to the ARIMA against the training data and outperforming the ARIMA against the test set does make a compelling argument for the simple exponential smoothing model. The ARIMA's AIC does indicate a better fit. But, when examining this data, and the overall performance of all three models, I would not be totally satisfied with any and would look to explore further how I could improve these methods or move onto other methods.
Limitations/What I would do to build off these initial models
As I said before, there is a lot more I would like to do with this data set. Number one, I could have removed the non-scoring games. I guess the main reason I did not was because I wanted to see if injury played a part in creating some sort of trend in scoring. Meaning, the games after the non-scoring games were right after recovering from an injury, so would we see that the scoring goes from zero, to an initial slower period of lower scoring, then a return to the norm (or no return to the norm at all which would be even more interesting to me). I want to investigate more into the neural networking methods and specifically how to examine model fitting versus the other methods where I can write a code and see the AIC fit statistics. I also could have included several more players, seasons. I could have introduced different statistics. I am definitely interested in using VAR models with teammates, as well as opposing team defense. I wonder if you could produce good models that describe how one's passing ability helps scorers, and I am becoming increasingly interested in exploring several other avenues these type of forecasting models could go down.