We designed prediction models using R to predict expected points accumulated throughout the season for teams. We used the AIC and BIC variable selection criteria to select variables, constructed linear models using 2013-2014 to 2017-2018 season data from NHL, and tested them with 2018-2019 season data.
We obtained NHL data from http://www.nhl.com/stats/teams in forms of multiple Excel files and used R to clean them and create the ‘nhl_data.csv’ file that we used for this project. To start with as many variables as possible, we downloaded most of the available data, excluding most of the percentage data, and divided all non-percentage value by 82 to calculate the average-value per game to have normally distributed data. As a result, we were able to start this project with 59 variables.
library(dplyr)
library(MASS)
For this project, we used two libraries; dplyr and MASS. The MASS library has a boxcox function that we used to determine the necessity of linear model transformation, and the dplyr library has functions that made data manipulation much easier.
Total = read.csv("nhl_data.csv")
Total = Total %>%
dplyr::select(-c("X" ,"Team"))
Total = mutate_all(Total, function(x) as.numeric(as.character(x)))
Total = Total[complete.cases(Total), ]
First, we imported the csv file, which contains NHL data we used to make a prediction model. After we imported data, we dropped non-numerical columns and converted all data to a numeric type. In the end, we used complete.case() function to eliminate rows with missing values.
Model_data = Total[Total$Season != 20182019,] %>%
dplyr::select(-c("Season"))
Test_data = Total[Total$Season == 20182019,] %>%
dplyr::select(-c("Season"))
We decided to use 2013-2014 to 2017-2018 season data to build a model and to use the 2018-2019 season data to test the model.
lm.nhl = lm(P~.,data=Model_data)
alias(lm.nhl)
## Model :
## P ~ FOW. + GF + GA + GA.5v5 + GA.4v4 + GA.3v3 + GA.6v5 + GA.6v4 +
## GA.6v3 + GA.5v4 + GA.5v3 + GA.4v3 + GA.3v4 + GA.3v5 + GA.3v6 +
## GA.4v5 + GA.4v6 + GA.5v6 + PS.GA + GF.5v5 + GF.4v4 + GF.3v3 +
## GF.6v5 + GF.6v4 + GF.6v3 + GF.5v4 + GF.5v3 + GF.4v3 + GF.3v4 +
## GF.3v5 + GF.3v6 + GF.4v5 + GF.4v6 + GF.5v6 + PS.GF + Shots.GP +
## SA.GP + Net.Pen.60 + Pen.Taken.60 + TOI_4v5 + TOI_3v5 + TOI_5v4 +
## TOI_5v3 + SAT + SAT.Tied + SAT.Ahead + SAT.Behind + SAT.Close +
## USAT + USAT.Tied + USAT.Ahead + USAT.Behind + USAT.Close +
## Shots + SAT_For + SAT_Agst + USAT_For + USAT_Agst
##
## Complete :
## (Intercept) FOW. GF GA GA.5v5 GA.4v4 GA.3v3 GA.6v5 GA.6v4 GA.5v4
## GA.6v3 0 0 0 0 0 0 0 0 0 0
## GF.3v6 0 0 0 0 0 0 0 0 0 0
## SAT.Behind 0 0 0 0 0 0 0 0 0 0
## USAT.Behind 0 0 0 0 0 0 0 0 0 0
## SAT_Agst 0 0 0 0 0 0 0 0 0 0
## USAT_Agst 0 0 0 0 0 0 0 0 0 0
## GA.5v3 GA.4v3 GA.3v4 GA.3v5 GA.3v6 GA.4v5 GA.4v6 GA.5v6 PS.GA
## GA.6v3 0 0 0 0 0 0 0 0 0
## GF.3v6 0 0 0 0 0 0 0 0 0
## SAT.Behind 0 0 0 0 0 0 0 0 0
## USAT.Behind 0 0 0 0 0 0 0 0 0
## SAT_Agst 0 0 0 0 0 0 0 0 0
## USAT_Agst 0 0 0 0 0 0 0 0 0
## GF.5v5 GF.4v4 GF.3v3 GF.6v5 GF.6v4 GF.6v3 GF.5v4 GF.5v3 GF.4v3
## GA.6v3 0 0 0 0 0 0 0 0 0
## GF.3v6 0 0 0 0 0 0 0 0 0
## SAT.Behind 0 0 0 0 0 0 0 0 0
## USAT.Behind 0 0 0 0 0 0 0 0 0
## SAT_Agst 0 0 0 0 0 0 0 0 0
## USAT_Agst 0 0 0 0 0 0 0 0 0
## GF.3v4 GF.3v5 GF.4v5 GF.4v6 GF.5v6 PS.GF Shots.GP SA.GP Net.Pen.60
## GA.6v3 0 0 0 0 0 0 0 0 0
## GF.3v6 0 0 0 0 0 0 0 0 0
## SAT.Behind 0 0 0 0 0 0 0 0 0
## USAT.Behind 0 0 0 0 0 0 0 0 0
## SAT_Agst 0 0 0 0 0 0 0 0 0
## USAT_Agst 0 0 0 0 0 0 0 0 0
## Pen.Taken.60 TOI_4v5 TOI_3v5 TOI_5v4 TOI_5v3 SAT SAT.Tied SAT.Ahead
## GA.6v3 0 0 0 0 0 0 0 0
## GF.3v6 0 0 0 0 0 0 0 0
## SAT.Behind 0 0 0 0 0 1 -1 -1
## USAT.Behind 0 0 0 0 0 0 0 0
## SAT_Agst 0 0 0 0 0 -1 0 0
## USAT_Agst 0 0 0 0 0 0 0 0
## SAT.Close USAT USAT.Tied USAT.Ahead USAT.Close Shots SAT_For
## GA.6v3 0 0 0 0 0 0 0
## GF.3v6 0 0 0 0 0 0 0
## SAT.Behind 0 0 0 0 0 0 0
## USAT.Behind 0 1 -1 -1 0 0 0
## SAT_Agst 0 0 0 0 0 0 1
## USAT_Agst 0 -1 0 0 0 0 0
## USAT_For
## GA.6v3 0
## GF.3v6 0
## SAT.Behind 0
## USAT.Behind 0
## SAT_Agst 0
## USAT_Agst 1
First, we created a linear model using all variables available. Then, we used the alias function to find linear dependency between variables. This step is vital since it is impossible to evaluate the linear model when variables have a dependency. Alias function suggested that “GA.6v3”, “GF.3v6”, “SAT.Behind”, “USAT.Behind”, “SAT_Agst”, and “USAT_Agst” are depended on other variables.
New_model_data = Model_data %>%
dplyr::select(-c("GA.6v3","GF.3v6","SAT.Behind","USAT.Behind","SAT_Agst","USAT_Agst"))
lm.nhl_new = lm(P~.,data=New_model_data)
Now that we found variables with dependency, we removed them and created a new linear model.
Usually, the next step would be checking the variance inflation factor(VIF) of each variable and eliminating the ones with VIF higher than 5. However, we decided to skip this step and let the AIC and BIC method exclude variables with a highly inflated coefficient.
Side Note: A principle behind the forward AIC or forward BIC method is estimating/comparing the model’s quality relative to other models with one extra variable and selecting the best performing model at each step. Naturally, this process will remove variables with high VIF since such a model with variables with high VIF will show worse performance.
par(mfrow=c(2,2))
plot(lm.nhl_new)
Before analyzing the model further, we checked two assumptions; normality assumption and constant variance assumption. The approximately straight line on the Normal Q-Q plot indicated that the normality assumption is not violated. The roughly horizontal line on the Residual vs. Fitted plot pointed out that the constant variance assumption is plausible.
Side Note: The Normal Q-Q plot shows the distribution of the data against the normal distribution, and the Residual vs. Fitted plot shows residuals against fitted values. These two plots are widly used to check normality and constant variance assumptions, respectively.
boxcox(lm.nhl_new)
Even though the normality assumption was not violated, we checked whether a transformation is needed using the boxcox function. 95% C.I. of lamda was about [0.9, 1.6], and since the interval contained 1, we decided that further transformation is not needed
forward.aic = step(lm(P~1,data=Model_data),direction="forward",
trace=1,scope=list(upper = lm.nhl_new),k=2)
AIC variable selection criteria method suggested that the following model showed the best performance: ‘P ~ GF + GA + GA.3v5 + GF.4v5 + GA.6v5 + GF.4v6 + GF.5v4 + Pen.Taken.60 + SAT.Ahead + USAT_For + SAT_For + GF.5v5 + GF.3v4 + FOW.’.
lm.nhl_selected = lm(P ~ GF + GA + GA.3v5 + GF.4v5 + GA.6v5 + GF.4v6 + GF.5v4 + Pen.Taken.60 + SAT.Ahead + USAT_For + SAT_For + GF.5v5 + GF.3v4 + FOW. ,data=Test_data)
plot(x = Test_data$P,
y = lm.nhl_selected$fitted.values,
xlab = "True Values",
ylab = "Model Fitted Values")
abline(b = 1, a = 0)
mse = sum((Test_data$P-lm.nhl_selected$fitted.values)^2)/length(Test_data$P)
mse
## [1] 0.0006655701
summary(lm.nhl_selected)$adj.r.squared
## [1] 0.9534819
To evaluate the performance of this model, we plotted model fitted values against actual values. From the plot, we noticed that the model showed excellent accuracy. Then, we calculated mean squared error (MSE) to find numerical evidence to support our model accuracy, and MSE turned out to be about 0.0006656 Also, the R^2 value of 0.953 indicated that the model’s inputs explained about 96.2% of the observed variation. Overall, our AIC model was accurate and explained almost all of the variation.
forward.bic <- step(lm(P~1,data=Model_data),direction="forward",
trace=1,scope=list(upper = lm.nhl_new),k=log(nrow(Model_data)))
BIC variable selection criteria method suggested that the following model showed the best performance: ‘P ~ GF + GA + GA.3v5 + GF.4v5 + GA.6v5 + GF.4v6’. Compared to the constructed model using the AIC method, the BIC model had fewer variables, making it easier to interpret.
lm.nhl_selected = lm(P ~ GF + GA + GA.3v5 + GF.4v5 + GA.6v5 + GF.4v6 ,data=Test_data)
plot(x = Test_data$P,
y = lm.nhl_selected$fitted.values,
xlab = "True Values",
ylab = "Model Fitted Values")
abline(b = 1, a = 0)
mse = sum((Test_data$P-lm.nhl_selected$fitted.values)^2)/length(Test_data$P)
mse
## [1] 0.001021989
summary(lm.nhl_selected)$adj.r.squared
## [1] 0.9523806
We applied the same method to evaluate the BIC model, and we got an MSE of 0.001021989 and an R^2 value of 0.95238. Compared to the AIC model, it was less accurate; however, it still was a great model to predict the expected points. Considering that the BIC model has fewer variables and easier to interpret, we could use these two models for different situations.
In this project, we built two different models to predict NHL teams’ season performance and those two models had a clear advantage and disadvantage. The AIC model had higher accuracy and R^2 value; however, it contained variables that were hard to understand. Whereas the BIC model was slightly less accurate than the AIC model, it still was precise and the selected variables were easier to comprehend.
This difference is due to the different approaches of those two variable selection criteria. The AIC approach aims to find the most accurate model, even if it risks overfitting. Meanwhile, the BIC approach is a bit more conservative and only includes essential variables.
Knowing the difference between those two methods, we can use them for a different purpose. If our goal is to find the most numerically accurate model, AIC is the way to go. If our goal is to build a model that will help us understand data and significant variables, BIC is your go-to.