Overview

Previously, we designed prediction models using R to predict expected points accumulated throughout the season for teams with AIC/BIC variable selection criteria. This time, we used the Boruta feature selection to select variables, constructed linear models using 2013-2014 to 2017-2018 season data from NHL, and tested them with 2018-2019 season data.

Libraries

library(dplyr)
library(Boruta)

For this project, we used two libraries; dplyr and Boruta. The Boruta library has a Boruta function for feature selection and the dplyr library has functions for data manipulation.

Importing Data

Total = read.csv("nhl_data.csv")

Total = Total %>%
  dplyr::select(-c("X" ,"Team"))

Total = mutate_all(Total, function(x) as.numeric(as.character(x)))
Total = Total[complete.cases(Total), ]

First, we imported the csv file, which contains NHL data we used to make a prediction model. After we imported data, we dropped non-numerical columns and converted all data to a numeric type. In the end, we used complete.case() function to eliminate rows with missing values.

Model Data & Test Data

Model_data = Total[Total$Season != 20182019,] %>%
    dplyr::select(-c("Season"))

Test_data = Total[Total$Season == 20182019,]  %>%
    dplyr::select(-c("Season"))

We decided to use 2013-2014 to 2017-2018 season data to build a model and to use the 2018-2019 season data to test the model.

Boruta Feature Selection

point_Boruta =  Boruta(P~.,data=Model_data)

plot(point_Boruta ,las=2 ,xlab="")

getSelectedAttributes(point_Boruta, withTentative = F)

##  [1] "GF"          "GA"          "GA.5v5"      "GA.6v5"      "GA.4v5"     
##  [6] "GF.5v5"      "GF.4v4"      "GF.5v4"      "GF.5v6"      "SA.GP"      
## [11] "SAT"         "SAT.Behind"  "SAT.Close"   "USAT"        "USAT.Tied"  
## [16] "USAT.Behind" "USAT.Close"  "USAT_Agst"

Unlike the AIC/BIC method, we did not have to construct a linear model, check multicollinearity and other assumptions to proceed. Using the Boruta method provided us a much more straightforward process of the variable selection process. The main reasoning behind skipping those steps is that the Boruta method aims to find all unique/significant predictors rather than optima set of predictors; multicollinearity isn’t the main concern when building a model with Boruta.

Boruta feature selection method selected the following variables: “GF”, “GA”, “GA.5v5”, “GA.6v5”, “GA.4v5”, “GF.5v5”, “GF.4v4”, “GF.5v4”, “GF.5v6”, “SA.GP”,“SAT”, “SAT.Behind”, “SAT.Close”, “USAT”, “USAT.Tied”, “USAT.Behind”, “USAT.Close”, and “USAT_Agst”

Test Boruta Feature Selection

lm.nhl_selected = lm(P ~ GF+GA+GA.5v5+GA.6v5+GA.4v5+GF.5v5+GF.4v4+GF.5v4+GF.5v6+SA.GP+SAT+SAT.Behind+SAT.Close+USAT+USAT.Tied+USAT.Behind+USAT.Close+USAT_Agst ,data=Test_data)

plot(x = Test_data$P,             
     y = lm.nhl_selected$fitted.values,
     xlab = "True Values",
     ylab = "Model Fitted Values")
abline(b = 1, a = 0)

mse = sum((Test_data$P-lm.nhl_selected$fitted.values)^2)/length(Test_data$P)
mse

## [1] 0.0005133821

summary(lm.nhl_selected)$adj.r.squared

## [1] 0.9521581

To evaluate the performance of this model, we plotted model fitted values against actual values. We noticed that the model showed excellent accuracy from the plot, even compared to the previously used AIC method. The mean squared error (MSE) was about 0.0005133821 which is better compare to the AIC method’s MSE 0.0006656, while the R^2 value was 0.9522 which is not much lower compared to the AIC method’s R^2 value 0.953.

Conclusion

In this project, we built a model to predict NHL teams’ season performance using Boruta feature selection. What was surprising about the result is that the Boruta model had even better MSE than the model built with the AIC method. Even though it is hard to say that the Boruta method is better than the AIC method since it has a lower R^2 value, it was interesting to use another tool to create a prediction model.

If we were to point out a problem, it would be that the Boruta model’s interpretation can be challenging, just like the AIC model, since many variables in the model do not seem important from a hockey point of view. Hence this method is more suitable for finding a general relationship between predictors and the outcome rather than building an efficient model. The next step for this project would be using a much bigger dataset to test whether one method performs better than other methods.

NHL Expected Points Predictive Modeling 2.0