Hockey has been a cornerstone of culture at R.I.T since the team founding in 1957. The purpose of this project is to predict the outcome of a RIT Men`s hockey match based on key indicators in the 1st and 2nd periods. These indicators include point differences, time in power play differences and whether the game is played at home or away. The full project can be accessed on github here.

require(ggplot2)
require(dplyr)
require(corrplot)
require(MASS)
require(class)

1-Data Preperation

1.1 Load Data

The RIT men’s hockey team website displays data of all past games in an unstructured format across multiple pages. The data shows the game location, the date the game was played and various in-game statistics. The data was collected into an Excel CSV file. Our training data includes all matches from the 2016 - 2017 up to the 2018 - 2019 hockey seasons. The test data is from the 2019 - 2020 hockey season. The full data dictionary can be found here.. A glimpse of the raw training data is found below.

dirdata <- "/Users/endererimhan/Documents/Capstone/"
hockey_raw <- read.table(paste0(dirdata,"HData2.CSV"), header = TRUE, sep = ",")
hockey_raw_test <- read.table(paste0(dirdata,"TestHData2.CSV"), header = TRUE, sep = ",")
glimpse(hockey_raw)
Rows: 114
Columns: 14
$ Date     <chr> "10/1/16", "10/7/16", "10/8/16", "10/15/16", "10/21/16", "10/22/16", "10/28/16", "10/29/16", "…
$ Location <chr> "H", "H", "H", "H", "A", "A", "A", "A", "A", "H", "A", "A", "H", "H", "A", "A", "A", "H", "H",…
$ p1T      <int> 1, 1, 3, 0, 0, 1, 2, 3, 0, 0, 0, 1, 0, 1, 1, 1, 2, 0, 0, 0, 1, 1, 2, 1, 1, 0, 0, 1, 1, 0, 0, 1…
$ p2T      <int> 1, 3, 2, 0, 1, 2, 0, 2, 0, 1, 1, 2, 1, 0, 1, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0, 0…
$ p3T      <int> 1, 2, 0, 1, 1, 0, 2, 1, 0, 0, 2, 0, 5, 0, 2, 2, 1, 1, 1, 0, 2, 0, 3, 2, 0, 0, 0, 0, 2, 1, 0, 1…
$ p1G      <int> 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 2, 3, 1, 0, 1, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1…
$ p2G      <int> 0, 2, 4, 0, 2, 3, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 3, 1, 2, 1, 0, 0, 0…
$ p3G      <int> 0, 1, 1, 1, 3, 2, 0, 2, 0, 1, 1, 0, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 0, 3, 1, 2, 1, 3, 0, 3, 3, 0…
$ POW1T    <int> 259, 118, 200, 146, 246, 32, 31, 240, 72, 0, 120, 312, 120, 120, 197, 178, 162, 0, 0, 0, 95, 1…
$ POW2T    <int> 187, 120, 318, 346, 226, 0, 0, 240, 301, 120, 120, 105, 120, 0, 180, 305, 374, 120, 120, 209, …
$ POW3T    <int> 136, 434, 195, 0, 115, 120, 0, 60, 59, 240, 80, 239, 84, 0, 100, 249, 0, 0, 240, 31, 469, 240,…
$ POW1G    <int> 120, 242, 148, 266, 240, 128, 60, 288, 192, 47, 240, 258, 331, 211, 240, 240, 291, 162, 73, 11…
$ POW2G    <int> 236, 242, 342, 0, 240, 151, 236, 25, 0, 313, 240, 120, 329, 29, 221, 195, 0, 0, 0, 120, 77, 12…
$ POW3G    <int> 212, 120, 0, 120, 162, 61, 0, 0, 120, 120, 299, 120, 146, 0, 220, 173, 240, 120, 120, 420, 109…

1.2 Preprocess Data

Before any analysis is done, we need to define some new variables. The pre-processing is done for both the training and test data. A glimpse of the transformed training data is shown below.

hockey_train <- hockey_raw 

hockey_train <- hockey_train %>% mutate(Location = as.character(Location)) 

#Cleaning up training data
hockey_train <- hockey_train %>% 
  mutate(scorediffp1 = p1T - p1G, 
         scorediffp2 = scorediffp1 + p2T  - p2G, 
         scoreddiffFin = scorediffp2 + p3T - p3G,
         ritp1pwr = POW1T - POW1G, 
         ritp2pwr = ritp1pwr + POW2T - POW2G,
         rit3pwr = ritp2pwr + POW3T - POW3G,
         rit_final = p1T + p2T + p3T,
         op_final = p1G + p2G + p3G,
         Location = ifelse(Location == "H", "Home", "Away"),
         outcome = ifelse(scoreddiffFin > 0, "Win", "Loss")
         ) 

hockey_train_full <- hockey_train 

hockey_train <- hockey_train[,-c(1,3:14)]

#Cleaning up the test data
hockey_test <- hockey_raw_test 

hockey_test <- hockey_test %>% 
  mutate(Location = as.character(Location),
         scorediffp1 = p1T - p1G, 
         scorediffp2 = p1T + p2T - p1G - p2G, 
         scoreddiffFin = p1T + p2T + p3T - p1G - p2G -p3G,
         ritp1pwr = POW1T - POW1G, 
         ritp2pwr = ritp1pwr + POW2T - POW2G,
         rit_final = p1T + p2T + p3T,
         op_final = p1G + p2G + p3G,
         outcome = ifelse(scoreddiffFin > 0, "Win", "Loss"),
         Location = ifelse(Location == "H", "Home", "Away"))

hockey_test_full <- hockey_test 
hockey_test <- hockey_test[,-c(1,3:12)]

#For future correlation analyis and model training
hockey_final <- hockey_train %>% 
  mutate(Location = ifelse(Location == "Home", 1, 0) , 
                        outcome = ifelse(outcome == "Win", 1 ,0))

glimpse(hockey_train)
Rows: 114
Columns: 10
$ Location      <chr> "Home", "Home", "Home", "Home", "Away", "Away", "Away", "Away", "Away", "Home", "Away", "…
$ scorediffp1   <int> 1, 1, 2, 0, 0, 0, 2, 3, -1, -1, -2, -2, -1, 1, 0, 1, 1, -2, -2, -2, 1, 1, 2, 0, 1, -1, -1…
$ scorediffp2   <int> 2, 2, 0, 0, -1, -1, 1, 4, -1, 0, -2, -1, -1, 0, 0, 5, 4, -3, -2, -2, 0, -1, 2, -1, -1, -3…
$ scoreddiffFin <int> 3, 3, -1, 0, -3, -3, 3, 3, -1, -1, -1, -1, 4, 0, 1, 6, 4, -4, -2, -3, 2, -4, 5, -2, -2, -…
$ ritp1pwr      <int> 139, -124, 52, -120, 6, -96, -29, -48, -120, -47, -120, 54, -211, -91, -43, -62, -129, -1…
$ ritp2pwr      <int> 90, -246, 28, 226, -8, -247, -265, 167, 181, -240, -240, 39, -420, -120, -84, 48, 245, -4…
$ rit3pwr       <int> 14, 68, 223, 106, -55, -188, -265, 227, 120, -120, -459, 158, -482, -120, -204, 124, 5, -…
$ rit_final     <int> 3, 6, 5, 1, 2, 3, 4, 6, 0, 1, 3, 3, 6, 1, 4, 7, 6, 1, 1, 0, 3, 1, 5, 3, 1, 1, 2, 3, 4, 1,…
$ op_final      <int> 0, 3, 6, 1, 5, 6, 1, 3, 1, 2, 4, 4, 2, 1, 3, 1, 2, 5, 3, 3, 1, 5, 0, 5, 3, 6, 3, 6, 1, 3,…
$ outcome       <chr> "Win", "Win", "Loss", "Loss", "Loss", "Loss", "Win", "Win", "Loss", "Loss", "Loss", "Loss…

2 - Data Visualization

2.1 How good is the RIT Mens Hockey Team?

Point differences between RIT and the rival team were calculated for each period. The final score difference for each match is shown visually in the bar chart below. A positive score difference indicates an RIT win.

scoreddiffFin.counts <- as.data.frame(table(hockey_train$scoreddiffFin))
ggplot(data = scoreddiffFin.counts, aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity", fill = "steelblue")+
  geom_text(aes(label = Freq), vjust=1.6, color = "white", size = 3.5)+
  theme_minimal() +xlab("Difference") + ylab("Frequency") +
  ggtitle("Final Score Difference")

RIT won 41% of games and tied 7% of them. When RIT loses, 44% of the time its only by a 1 point difference. RIT wins by an average of 2.5 points and loses by an average of 2.2 points.

2.2 Can score difference predict game outcome?

The point differences for period 1 and period 2 are displayed in the scatter diagram below.

ggplot(hockey_train, aes(x = scorediffp1, y = scorediffp2, color = outcome)) + 
    geom_jitter(size = 3, width = 0.1 , height = 0.2) + 
    xlab("Score Difference at Period 1") + 
    ylab("Score Difference at Period 2") + 
    ggtitle("Game Outcome")

There seems to be a positive relationship between period 1 and period 2 point differences. The lower left section of the plot has a unanimous outcome of an RIT loss while the upper right section has the opposite effect. The middle has a mix of both wins and losses. There does seem to be some degree of separation between wins and loss which is good for predicting game outcome.

2.3 Can game location predict game outcome?

A double bar chart of game outcome based on location is shown below.

win_location_counts <- hockey_train %>% 
  group_by(Location, outcome) %>% 
  summarise(count = n())

ggplot(win_location_counts, aes(Location, count, fill = outcome)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Outcome of Game by Location")

The games location does not seem to effect the outcome of a match.

2.4 Can difference of time in powerplay predict game outcome?

Time in power play difference between RIT and the rival team for each period was calculated.For each period, two histograms of these differences were generated and overlapped. One indicating an RIT win and the other a loss.

ggplot(data=hockey_train, aes(x = ritp1pwr, group = outcome, fill = outcome)) +
  geom_density(adjust = 1.3, alpha = 0.4) +
  labs(title = "Histogram of powerplay difference, period one")


ggplot(data=hockey_train, aes(x=ritp2pwr, group=outcome, fill = outcome)) +
  geom_density(adjust = 1.3, alpha = 0.4) +
  labs(title = "Histogram of powerplay difference, period two")

For each period, there seems to be a lot of overlap in the distribution of power play difference between a lost and won game. This seems to suggest that difference in power play between the teams has no effect on the result of the game.

2.5 What variables are correlated?

A correlation plot between all the variables in the training data is shown below.

corrplot(cor(hockey_final[ ,c(1:3,5:6,10)]), method = "number" , type = "lower")

We seem to have moderate correlations with game outcome and point differences in period 1 and 2. It is interesting to note that the point-difference in period 2 is more correlated with outcome than in period 1. This result is to be expected. It is surprising to note that home does not seem to correlate with game outcome. This will surprise any hockey enthusiast who will gladly tell you that the home team has a higher chance of winning in their own turf.

Now lets do some machine learning!

3 - Training the Classification Machines

These next two sections are for a technical audience. A knowledge of statistical machine learning is assumed. Please skip to the conclusion for a non-technical solution.

We will now train 4 classic classification algorithms. We optimize each algorithm using cross-validation. In Chapter 4, we will compare each method using ROC Analysis and choose the best one.

3.1 Logistic Regression

We begin with logistic regression. Using Stepwise Logistic Regression, we find the combination of predictors that output the best model.

glm.fit.all <- glm(outcome ~ scorediffp1 + scorediffp2 + Location, data = hockey_final, family = binomial)
step.model <- glm.fit.all %>% stepAIC(trace = FALSE)
step.model

Call:  glm(formula = outcome ~ scorediffp2, family = binomial, data = hockey_final)

Coefficients:
(Intercept)  scorediffp2  
     -0.958        1.689  

Degrees of Freedom: 113 Total (i.e. Null);  112 Residual
Null Deviance:      154.5 
Residual Deviance: 77.37    AIC: 81.37

A Logistic Regression model with the 2nd period score difference as a predictor is the best model to use for optimal prediction. Lets look at the diagnostics.

summary(glm(outcome ~ scorediffp2, data = hockey_final, family = binomial))

Call:
glm(formula = outcome ~ scorediffp2, family = binomial, data = hockey_final)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.23850  -0.37000  -0.06947   0.41271   2.33052  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.9580     0.3117  -3.074  0.00211 ** 
scorediffp2   1.6892     0.3221   5.244 1.57e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 154.511  on 113  degrees of freedom
Residual deviance:  77.369  on 112  degrees of freedom
AIC: 81.369

Number of Fisher Scoring iterations: 6

The intercept and predictor are both statistically significant. Overall the model seems to do very well. The graph of the logistic curve is shown below.

ggplot(hockey_final, aes(x = scorediffp2, y = outcome)) + geom_point() + 
  stat_smooth(method = "glm", method.args = list(family="binomial"), se = FALSE) +
  ylab("Probability of Win") + xlab("Period 2 difference") +
  labs(title = "Logistic Regression Model")

When tied after the 2nd period, RIT has about a 25% chance of winning. Note that a tie is indicated as a loss for both teams in this model. This probability jumps to 62.5% when RIT is up by one and around 90% when up by two. We will now use 6-fold cross-validation to find the optimal probability cut-off value that switches a prediction to a win. We use 1 - accuracy or error as a performance indicator for our model.

set.seed(123)
fold <- sample(rep(1:6, length = 114)) 
cutoff <- seq(0, 1, length.out = 51) 
glm.accur <- matrix(0, 6, 51)

#6-fold cross validation for probability cutoff
for (i in 1:6) {
  glm <- glm(outcome ~ scorediffp2, data = hockey_final[fold!=i,], family = binomial)
  test <- hockey_final[fold == i, 3]
  test <- as.data.frame(test)
  colnames(test) <- c("scorediffp2")
  glm.probs <- predict(glm ,test , type = 'response' ) 
  for (j in 1:51) {
    glm.pred <- ifelse(glm.probs > cutoff[j] ,1 ,0)
    glm.accur[i,j] <- 1 - mean(glm.pred == hockey_final[fold == i,10]) 
    }
}

error.log <- rep(0, 51) 

for (k in 1:51) {
  error.log[k] <- mean(glm.accur[ , k]) 
  }

min.log <- round(min(error.log), 3)
title.log <- paste0("Minimum error: " , min.log)
df.log <- data.frame(cutoff, error.log) 

ggplot(data = df.log, aes(x = cutoff, y = error.log, group = 1)) +
  geom_line(color = "red") + xlab("Cut-off Probabilty") + ylab("Mean Error rate") +
  ggtitle("6-fold cross validation: Logistic regression" , title.log) #Choose 0.5 as optimal probability cutoff

The error is minimized when the cut-off probability is at 0.5. This means that if RIT has a 50% or more chance of winning, the prediction would be a win.

We conclude that the best logistic model is the one that uses the 2nd period point difference as a predictor with a cut-off probability of 0.5.

3.2 K-Nearest Neighbors

We now train a K-Nearest Neighbors model with location and point differences in both periods as predictors. Since all three are integers with about the same mean and variance, normalizing the data is not needed. 6-fold cross-validation is used to find the optimal K.

set.seed(105)
fold <- sample(rep(1:6, length = 114)) 
K_choice <- seq(1, 95, length.out = 95) 
knn_accur <- matrix(0, 6, 95)

for (i in 1:6) {
  train2 <- data.frame(hockey_final[fold != i , c(1,2,3)]) 
  test2 <- data.frame(hockey_final[fold == i , c(1,2,3)]) 
  test2real <- data.frame(hockey_final[fold == i, 10]) 
  train2real <- as.vector(hockey_final[fold != i, 10])
  for (j in 1:95) {
    knn_model <- knn(train = train2, test = test2, cl = train2real, k = K_choice[j])
    knn_accur[i,j] <- 1 - mean(as.numeric(levels(knn_model)[as.integer(knn_model)]) == test2real)
  }
}

error_knn = rep(0, 95) 
for (i in 1:95) {
  error_knn[i] = mean(knn_accur[, i]) 
  }

min_knn <- round(min(error_knn), 3)
title_knn <- paste0("Minimum error: ", min_knn)
df_knn <- data.frame(K_choice, error_knn) 

ggplot(data=df_knn, aes(x = K_choice, y = error_knn, group = 1)) +
  geom_line(color = "red") + 
  xlab("Choice of K") + ylab("Mean Error rate") + 
  ggtitle("6-fold cross validation: KNN", title_knn)

Lets zoom in between k=1 and k=40. We will also output the variance of the error rates to get a closer look at how KNN is behaving.

set.seed(105)
fold <- sample(rep(1:6,length = 114)) 
K_choice <- seq(1 , 40 , length.out = 40) 
knn_accur <- matrix(0 , 6 , 40)

for (i in 1:6) {
  train2 <- data.frame(hockey_final[fold != i, c(1,2,3)]) 
  test2 <- data.frame(hockey_final[fold == i, c(1,2,3)]) 
  test2real <- data.frame(hockey_final[fold == i, 10]) 
  train2real <- as.vector(hockey_final[fold != i, 10])
  for (j in 1:40) {
    knn_model <- knn(train = train2, test = test2, cl = train2real, k = K_choice[j])
    knn_accur[i,j] <- 1 - mean(as.numeric(levels(knn_model)[as.integer(knn_model)]) == test2real)
  }
}

error_knn <- rep(0, 40) 
for (i in 1:40) {
  error_knn[i] = mean(knn_accur[, i]) 
}

sd_knn <- rep(0,40) 
for (i in 1:40) {
  sd_knn[i] = sd(knn_accur[, i]) 
  }

min_knn <- round(min(error_knn), 3)
title_knn <- paste0("Minimum error: ", min_knn)
df_knn <- data.frame(K_choice, error_knn) 

df_knn2 <- data.frame(K_choice, sd_knn)

ggplot(data = df_knn, aes(x = K_choice, y = error_knn, group = 1)) +
  geom_line(color = "red") + 
  xlab("Choice of K") + ylab("Mean Error rate") + 
  ggtitle("6-fold cross validation: KNN", title_knn)


ggplot(data = df_knn2, aes(x = K_choice, y = sd_knn, group = 1)) + 
  geom_line(color = "red") +
  xlab("Choice of K") + ylab("Standard deviation of Error rate") +
  ggtitle("6-fold cross validation: KNN", title_knn) #Choose K=10

10 is the best number of neighbors to consider since it achieves a low mean error rate with the lowest variance. We conclude that the best nearest neighbor model is the one that considers 10 neighbors.

3.3 Discriminant Analysis

Finally, we consider both Linear & Quadratic Discriminant Analysis. We use 6-fold cross validation to find the optimal prior probability that maximizes accuracy and minimizes error.

set.seed(121)
fold <- sample(rep(1:6, length = 114)) 
Priors <- seq(0.01, 0.99, length.out = 99) 
lda_accur <- matrix(0, 6, 99)

for (i in 1:6) {
  test <- data.frame(hockey_final[fold == i, ]) 
  train <- data.frame(hockey_final[fold != i, ]) 
  for (j in 1:99) {
    prior1 <- Priors[j]
    lda_model <- lda(outcome ~ scorediffp1 + scorediffp2, data = test, prior = c(prior1, 1 - prior1))
    predictions <- lda_model %>% predict(test)
    lda_accur[i,j] <- 1 - mean(predictions$class == test$outcome) 
    }
}

error_lda <- rep(0, 99) 
for (i in 1:99) {
  error_lda[i] = mean(lda_accur[, i])
}

min_lda <- round(min(error_lda), 3)
title_lda <- paste0("Minimum error: ", min_lda)

df_lda <- data.frame(Priors , error_lda) 

ggplot(data = df_lda, aes(x = Priors, y = error_lda, group = 1)) +
  geom_line(color = "red") + 
  xlab("Choice of Prior") + ylab("Mean Error rate") +
  ggtitle("6-fold cross validation: LDA", title_lda) #Choose 0.57 as optimal probability cut-off

We conclude that the best linear discriminant model is the one that uses 1st and 2nd period point difference as predictors with a cut-off prior probability of 0.57

set.seed(121)
fold <- sample(rep(1:6,length = 114)) 
Priors <- seq(0.01, 0.99, length.out = 99) 
qda_accur <- matrix(0, 6, 99)

for (i in 1:6) {
  test <- data.frame(hockey_final[fold == i, ]) 
  train <- data.frame(hockey_final[fold != i, ])
  for (j in 1:99) {
    prior1 <- Priors[j]
    qda_model <- qda(outcome ~ scorediffp1 + scorediffp2, data = test, prior = c(prior1 , 1 - prior1))
    predictions <- qda_model %>% predict(test)
    qda_accur[i,j] <- 1 - mean(predictions$class == test$outcome)
    } 
  }


error_qda <- rep( 0 , 99) 
for (i in 1:99) {
  error_qda[i] = mean(qda_accur[, i]) }

min_qda <- round(min(error_qda),3)
title_qda <- paste0("Minimum error: ", min_qda)
df_qda <- data.frame(Priors, error_qda) 

ggplot(data = df_qda, aes(x = Priors, y = error_qda, group = 1)) +
  geom_line(color = "red") + 
  xlab("Choice of Prior") + ylab("Mean Error rate") +
  ggtitle("6-fold cross validation: QDA" , title_qda) #Choose 0.67 as optimal probabilty cut-off

We conclude that the best quadratic discriminant model is the one that uses 1st and 2nd period point difference as predictors with a cut-off prior probability of 0.67

4 - ROC Analysis

Now that we have four classification algorithms that are the best in their respected classes, lets compare and contrast them.


hockey_test2 <- hockey_test %>% 
  mutate(Location = ifelse(Location == "Home" , 1, 0) , 
         outcome = ifelse(outcome == "Win" , 1 ,0))

glm_final <- glm(outcome ~ scorediffp2, data = hockey_final, family = binomial) 
test <- data.frame(hockey_test2[ ,3])
test <- as.data.frame(test)
colnames(test) <- c("scorediffp2")
glm_final_probs <- predict(glm_final, test, type = 'response') 
glm_final_pred <- ifelse(glm_final_probs > 0.5, 1, 0)
glm_table <- table(predicted = as.factor(glm_final_pred), actual = as.factor(hockey_test2$outcome))


set.seed(100)
train_knn <- hockey_final[c(1, 2, 3)]
test_knn <- hockey_test2[c(1, 2, 3)]
cl_knn <- as.factor(hockey_final$outcome)
knn_final <- knn(train = train_knn, test = test_knn, cl = cl_knn, k = 10)
knn_table <- table(predicted = as.factor(knn_final), actual = as.factor(hockey_test2$outcome) )


lda_final <- lda(outcome ~ scorediffp1 + scorediffp2, data = hockey_final, prior = c(0.57, 0.43))
predictions_lda <- lda_final %>% predict(hockey_test2)
lda_table <- table(predicted = predictions_lda$class, actual = hockey_test2$outcome )
#error_lda = 1 - mean(predictions.lda$class == testdata$outcome) 


qda_final <- qda(outcome ~ scorediffp1 + scorediffp2, data = hockey_final , prior = c(0.67 ,0.33))
predictions_qda <- qda_final %>% predict(hockey_test2)
qda_table <- table(predicted = predictions_qda$class, actual = hockey_test2$outcome)

log_sen <- glm_table[2,2] / (glm_table[2,2] + glm_table[1,2])
log_fp <- glm_table[2,1] / (glm_table[2,1] + glm_table[1,1]) 
log_act <- (glm_table[1,1] + glm_table[2,2]) / sum(glm_table)

knn_sen <- knn_table[2,2] / (knn_table[2,2] + knn_table[1,2]) 
knn_fp <- knn_table[2,1] / (knn_table[2,1] + knn_table[1,1]) 
knn_act <- (knn_table[1,1] + knn_table[2,2]) / sum(knn_table) 

lda_sen <- lda_table[2,2] / (lda_table[2,2] + lda_table[1,2]) 
lda_fp <- lda_table[2,1] / (lda_table[2,1] + lda_table[1,1]) 
lda_act <- (lda_table[1,1] + lda_table[2,2]) / sum(lda_table) 

qda_sen <- qda_table[2,2] / (qda_table[2,2] + qda_table[1,2]) 
qda_fp <- qda_table[2,1] / (qda_table[2,1] + qda_table[1,1]) 
qda_act <- (qda_table[1,1] + qda_table[2,2]) / sum(qda_table) 

Methods <- c("Logistic", "KNN" , "LDA" , "QDA")
Methods_roc <- c("Logistic/LDA" , "KNN" , "QDA")
Accuracy <- c(log_act, knn_act, lda_act, qda_act) 
Sensitivity <- c(log_sen , knn_sen , lda_sen , qda_sen) 
Specificity = c(1-log_fp , 1-knn_fp , 1-lda_fp , 1-qda_fp)
compare_roc <- data.frame(Methods_roc, Sensitivity[c(1,2,4)], 1-Specificity[c(1,2,4)])
colnames(compare_roc) <- c("Methods" , "tp_rate" , "fp_rate")

comparisions <- data.frame(Methods, Accuracy, Sensitivity, Specificity) 
comparisions

It seems that QDA is the most accurate algorithm while KNN performs the worst.

ggplot(compare_roc, aes(x = fp_rate, y = tp_rate , col = Methods)) + 
geom_point() + xlim(0,1) + ylim(0,1) + geom_abline(intercept = 0, slope = 1) + 
ggtitle("ROC graph") +xlab("1 - Specificity") + ylab("Sensitivity")

All 4 classifiers do much better than “random guessing” which is indicated by the black line. The closer towards the upper left corner, the better the classifier is. KNN is the closest towards the black line. This means that KNN under performs compared to the other classifiers. QDA seems to perform the best. QDA is the classifier to go in terms of predictability however if one wants a simple interpretable model with some loss of predictability, Logistic Regression is king.

5 - Conclusion

We conclude that the best predictors that determine the outcome of a game is the point-difference at period 1 and period 2. Time in Power play differences and location does not seem to predict very well. A Logistic Regression Model is the most interpretable but if one wants predictability, Quadratic Discriminant Analysis is the way to go.

---
title: "Hockey in-game prediction"
output: html_notebook
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Hockey has been a cornerstone of culture at R.I.T since the team founding in 1957. The purpose of this project is to predict the outcome of a RIT Men`s hockey match based on key indicators in the 1st and 2nd periods. These indicators include point differences, time in power play differences and whether the game is played at home or away. The full project can be accessed on github [here](https://github.com/EnderErimhan/Hockey).

```{r}
require(ggplot2)
require(dplyr)
require(corrplot)
require(MASS)
require(class)

```

## 1-Data Preperation

### 1.1 Load Data

The RIT men’s hockey team website displays data of all past games in an unstructured format across multiple pages. The data shows the game location, the date the game was played and various in-game statistics. The data was collected into an Excel CSV file. Our training data includes all matches from the 2016 - 2017 up to the 2018 - 2019 hockey seasons. The test data is from the 2019 - 2020 hockey season. The full data dictionary can be found [here.](https://github.com/EnderErimhan/Hockey#data-dictionary). A glimpse of the raw training data is found below.

```{r, echo = T}
dirdata <- "/Users/endererimhan/Documents/Capstone/"
hockey_raw <- read.table(paste0(dirdata,"HData2.CSV"), header = TRUE, sep = ",")
hockey_raw_test <- read.table(paste0(dirdata,"TestHData2.CSV"), header = TRUE, sep = ",")
glimpse(hockey_raw)
```
### 1.2 Preprocess Data

Before any analysis is done, we need to define some new variables. The pre-processing is done for both the training and test data. A glimpse of the transformed training data is shown below.  

```{r, echo = T}
hockey_train <- hockey_raw 

hockey_train <- hockey_train %>% mutate(Location = as.character(Location)) 

#Cleaning up training data
hockey_train <- hockey_train %>% 
  mutate(scorediffp1 = p1T - p1G, 
         scorediffp2 = scorediffp1 + p2T  - p2G, 
         scoreddiffFin = scorediffp2 + p3T - p3G,
         ritp1pwr = POW1T - POW1G, 
         ritp2pwr = ritp1pwr + POW2T - POW2G,
         rit3pwr = ritp2pwr + POW3T - POW3G,
         rit_final = p1T + p2T + p3T,
         op_final = p1G + p2G + p3G,
         Location = ifelse(Location == "H", "Home", "Away"),
         outcome = ifelse(scoreddiffFin > 0, "Win", "Loss")
         ) 

hockey_train_full <- hockey_train 

hockey_train <- hockey_train[,-c(1,3:14)]

#Cleaning up the test data
hockey_test <- hockey_raw_test 

hockey_test <- hockey_test %>% 
  mutate(Location = as.character(Location),
         scorediffp1 = p1T - p1G, 
         scorediffp2 = p1T + p2T - p1G - p2G, 
         scoreddiffFin = p1T + p2T + p3T - p1G - p2G -p3G,
         ritp1pwr = POW1T - POW1G, 
         ritp2pwr = ritp1pwr + POW2T - POW2G,
         rit_final = p1T + p2T + p3T,
         op_final = p1G + p2G + p3G,
         outcome = ifelse(scoreddiffFin > 0, "Win", "Loss"),
         Location = ifelse(Location == "H", "Home", "Away"))

hockey_test_full <- hockey_test 
hockey_test <- hockey_test[,-c(1,3:12)]

#For future correlation analysis and model training
hockey_final <- hockey_train %>% 
  mutate(Location = ifelse(Location == "Home", 1, 0) , 
                        outcome = ifelse(outcome == "Win", 1 ,0))

glimpse(hockey_train)

```

## 2 - Data Visualization

### 2.1 How good is the RIT Mens Hockey Team?

Point differences between RIT and the rival team were calculated for each period. The final score difference for each match is shown visually in the bar chart below. A positive score difference indicates an RIT win. 

```{r, echo = T }
scoreddiffFin.counts <- as.data.frame(table(hockey_train$scoreddiffFin))
ggplot(data = scoreddiffFin.counts, aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity", fill = "steelblue")+
  geom_text(aes(label = Freq), vjust=1.6, color = "white", size = 3.5)+
  theme_minimal() +xlab("Difference") + ylab("Frequency") +
  ggtitle("Final Score Difference")
```

RIT won 41% of games and tied 7% of them. When RIT loses, 44% of the time its only by a 1 point difference. RIT wins by an average of 2.5 points and loses by an average of 2.2 points. 

### 2.2 Can score difference predict game outcome?

The point differences for period 1 and period 2 are displayed in the scatter diagram below.

```{r, echo = T}
ggplot(hockey_train, aes(x = scorediffp1, y = scorediffp2, color = outcome)) + 
    geom_jitter(size = 3, width = 0.1 , height = 0.2) + 
    xlab("Score Difference at Period 1") + 
    ylab("Score Difference at Period 2") + 
    ggtitle("Game Outcome")
```

There seems to be a positive relationship between period 1 and period 2 point differences. The lower left section of the plot has a unanimous outcome of an RIT loss while the upper right section has the opposite effect. The middle has a mix of both wins and losses. There does seem to be some degree of separation between wins and loss which is good for predicting game outcome. 

### 2.3 Can game location predict game outcome?

A double bar chart of game outcome based on location is shown below.

```{r, message = FALSE, echo = T}
win_location_counts <- hockey_train %>% 
  group_by(Location, outcome) %>% 
  summarise(count = n())

ggplot(win_location_counts, aes(Location, count, fill = outcome)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Outcome of Game by Location")

```

The games location does not seem to effect the outcome of a match.

### 2.4 Can difference of time in powerplay predict game outcome?

Time in power play difference between RIT and the rival team for each period was calculated.For each period, two histograms of these differences were generated and overlapped. One indicating an RIT win and the other a loss.    

```{r, fig.show='hold', out.width = '25%', fig.align='center', echo = T }
ggplot(data=hockey_train, aes(x = ritp1pwr, group = outcome, fill = outcome)) +
  geom_density(adjust = 1.3, alpha = 0.4) +
  labs(title = "Histogram of power play difference, period one")

ggplot(data=hockey_train, aes(x=ritp2pwr, group=outcome, fill = outcome)) +
  geom_density(adjust = 1.3, alpha = 0.4) +
  labs(title = "Histogram of power play difference, period two")
```
For each period, there seems to be a lot of overlap in the distribution of power play difference between a lost and won game. This seems to suggest that difference in power play between the teams has no effect on the result of the game.

### 2.5 What variables are correlated?

A correlation plot between all the variables in the training data is shown below. 

```{r, echo = T }
corrplot(cor(hockey_final[ ,c(1:3,5:6,10)]), method = "number" , type = "lower")
```

We seem to have moderate correlations with game outcome and point differences in period 1 and 2. It is interesting to note that the point-difference in period 2 is more correlated with outcome than in period 1. This result is to be expected. It is surprising to note that home does not seem to correlate with game outcome. This will surprise any hockey enthusiast who will gladly tell you that the home team has a higher chance of winning in their own turf.

Now lets do some machine learning!

## 3 - Training the Classification Machines

These next two sections are for a technical audience. A knowledge of statistical machine learning is assumed. Please skip to the conclusion for a non-technical solution.

We will now train 4 classic classification algorithms. We optimize each algorithm using cross-validation. In Chapter 4, we will compare each method using ROC Analysis and choose the best one.

### 3.1 Logistic Regression 

We begin with logistic regression. Using Stepwise Logistic Regression, we find the combination of predictors that output the best model. 

```{r, echo = TRUE }
glm.fit.all <- glm(outcome ~ scorediffp1 + scorediffp2 + Location, data = hockey_final, family = binomial)
step.model <- glm.fit.all %>% stepAIC(trace = FALSE)
step.model

```

A Logistic Regression model with the 2nd period score difference as a predictor is the best model to use for optimal prediction. Lets look at the diagnostics. 

```{r, message=FALSE, echo = TRUE }
summary(glm(outcome ~ scorediffp2, data = hockey_final, family = binomial))
```

The intercept and predictor are both statistically significant. Overall the model seems to do very well. The graph of the logistic curve is shown below.

```{r, echo = T, message=F }
ggplot(hockey_final, aes(x = scorediffp2, y = outcome)) + geom_point() + 
  stat_smooth(method = "glm", method.args = list(family="binomial"), se = FALSE) +
  ylab("Probability of Win") + xlab("Period 2 difference") +
  labs(title = "Logistic Regression Model")

```

When tied after the 2nd period, RIT has about a 25% chance of winning. Note that a tie is indicated as a loss for both teams in this model. This probability jumps to 62.5% when RIT is up by one and around 90% when up by two. We will now use 6-fold cross-validation to find the optimal probability cut-off value that switches a prediction to a win. We use 1 - accuracy or error as a performance indicator for our model.

```{r, echo = TRUE} 
set.seed(123)
fold <- sample(rep(1:6, length = 114)) 
cutoff <- seq(0, 1, length.out = 51) 
glm.accur <- matrix(0, 6, 51)

#6-fold cross validation for probability cutoff
for (i in 1:6) {
  glm <- glm(outcome ~ scorediffp2, data = hockey_final[fold!=i,], family = binomial)
  test <- hockey_final[fold == i, 3]
  test <- as.data.frame(test)
  colnames(test) <- c("scorediffp2")
  glm.probs <- predict(glm ,test , type = 'response' ) 
  for (j in 1:51) {
    glm.pred <- ifelse(glm.probs > cutoff[j] ,1 ,0)
    glm.accur[i,j] <- 1 - mean(glm.pred == hockey_final[fold == i,10]) 
    }
}

error.log <- rep(0, 51) 

for (k in 1:51) {
  error.log[k] <- mean(glm.accur[ , k]) 
  }

min.log <- round(min(error.log), 3)
title.log <- paste0("Minimum error: " , min.log)
df.log <- data.frame(cutoff, error.log) 

ggplot(data = df.log, aes(x = cutoff, y = error.log, group = 1)) +
  geom_line(color = "red") + xlab("Cut-off Probabilty") + ylab("Mean Error rate") +
  ggtitle("6-fold cross validation: Logistic regression" , title.log) #Choose 0.5 as optimal probability cutoff
```

The error is minimized when the cut-off probability is at 0.5. This means that if RIT has a 50% or more chance of winning, the prediction would be a win. 

We conclude that the best logistic model is the one that uses the 2nd period point difference as a predictor with a cut-off probability of 0.5.

### 3.2 K-Nearest Neighbors 

We now train a K-Nearest Neighbors model with location and point differences in both periods as predictors. Since all three are integers with about the same mean and variance, normalizing the data is not needed. 6-fold cross-validation is used to find the optimal K.

```{r, echo = TRUE }
set.seed(105)
fold <- sample(rep(1:6, length = 114)) 
K_choice <- seq(1, 95, length.out = 95) 
knn_accur <- matrix(0, 6, 95)

for (i in 1:6) {
  train2 <- data.frame(hockey_final[fold != i , c(1,2,3)]) 
  test2 <- data.frame(hockey_final[fold == i , c(1,2,3)]) 
  test2real <- data.frame(hockey_final[fold == i, 10]) 
  train2real <- as.vector(hockey_final[fold != i, 10])
  for (j in 1:95) {
    knn_model <- knn(train = train2, test = test2, cl = train2real, k = K_choice[j])
    knn_accur[i,j] <- 1 - mean(as.numeric(levels(knn_model)[as.integer(knn_model)]) == test2real)
  }
}

error_knn = rep(0, 95) 
for (i in 1:95) {
  error_knn[i] = mean(knn_accur[, i]) 
  }

min_knn <- round(min(error_knn), 3)
title_knn <- paste0("Minimum error: ", min_knn)
df_knn <- data.frame(K_choice, error_knn) 

ggplot(data=df_knn, aes(x = K_choice, y = error_knn, group = 1)) +
  geom_line(color = "red") + 
  xlab("Choice of K") + ylab("Mean Error rate") + 
  ggtitle("6-fold cross validation: KNN", title_knn)

```

Lets zoom in between k=1 and k=40. We will also output the variance of the error rates to get a closer look at how KNN is behaving. 

```{r, echo = TRUE, fig.show='hold', out.width = '25%', fig.align='center' }
set.seed(105)
fold <- sample(rep(1:6,length = 114)) 
K_choice <- seq(1 , 40 , length.out = 40) 
knn_accur <- matrix(0 , 6 , 40)

for (i in 1:6) {
  train2 <- data.frame(hockey_final[fold != i, c(1,2,3)]) 
  test2 <- data.frame(hockey_final[fold == i, c(1,2,3)]) 
  test2real <- data.frame(hockey_final[fold == i, 10]) 
  train2real <- as.vector(hockey_final[fold != i, 10])
  for (j in 1:40) {
    knn_model <- knn(train = train2, test = test2, cl = train2real, k = K_choice[j])
    knn_accur[i,j] <- 1 - mean(as.numeric(levels(knn_model)[as.integer(knn_model)]) == test2real)
  }
}

error_knn <- rep(0, 40) 
for (i in 1:40) {
  error_knn[i] = mean(knn_accur[, i]) 
}

sd_knn <- rep(0,40) 
for (i in 1:40) {
  sd_knn[i] = sd(knn_accur[, i]) 
  }

min_knn <- round(min(error_knn), 3)
title_knn <- paste0("Minimum error: ", min_knn)
df_knn <- data.frame(K_choice, error_knn) 

df_knn2 <- data.frame(K_choice, sd_knn)

ggplot(data = df_knn, aes(x = K_choice, y = error_knn, group = 1)) +
  geom_line(color = "red") + 
  xlab("Choice of K") + ylab("Mean Error rate") + 
  ggtitle("6-fold cross validation: KNN", title_knn)

ggplot(data = df_knn2, aes(x = K_choice, y = sd_knn, group = 1)) + 
  geom_line(color = "red") +
  xlab("Choice of K") + ylab("Standard deviation of Error rate") +
  ggtitle("6-fold cross validation: KNN", title_knn) #Choose K=10

```

10 is the best number of neighbors to consider since it achieves a low mean error rate with the lowest variance. We conclude that the best nearest neighbor model is the one that considers 10 neighbors. 

### 3.3 Discriminant Analysis 

Finally, we consider both Linear & Quadratic Discriminant Analysis. We use 6-fold cross validation to find the optimal prior probability that maximizes accuracy and minimizes error. 


```{r, echo = TRUE }
set.seed(121)
fold <- sample(rep(1:6, length = 114)) 
Priors <- seq(0.01, 0.99, length.out = 99) 
lda_accur <- matrix(0, 6, 99)

for (i in 1:6) {
  test <- data.frame(hockey_final[fold == i, ]) 
  train <- data.frame(hockey_final[fold != i, ]) 
  for (j in 1:99) {
    prior1 <- Priors[j]
    lda_model <- lda(outcome ~ scorediffp1 + scorediffp2, data = test, prior = c(prior1, 1 - prior1))
    predictions <- lda_model %>% predict(test)
    lda_accur[i,j] <- 1 - mean(predictions$class == test$outcome) 
    }
}

error_lda <- rep(0, 99) 
for (i in 1:99) {
  error_lda[i] = mean(lda_accur[, i])
}

min_lda <- round(min(error_lda), 3)
title_lda <- paste0("Minimum error: ", min_lda)

df_lda <- data.frame(Priors , error_lda) 

ggplot(data = df_lda, aes(x = Priors, y = error_lda, group = 1)) +
  geom_line(color = "red") + 
  xlab("Choice of Prior") + ylab("Mean Error rate") +
  ggtitle("6-fold cross validation: LDA", title_lda) #Choose 0.57 as optimal probability cut-off

```

We conclude that the best linear discriminant model is the one that uses 1st and 2nd period point difference as predictors with a cut-off prior probability of 0.57

```{r, echo = TRUE}
set.seed(121)
fold <- sample(rep(1:6,length = 114)) 
Priors <- seq(0.01, 0.99, length.out = 99) 
qda_accur <- matrix(0, 6, 99)

for (i in 1:6) {
  test <- data.frame(hockey_final[fold == i, ]) 
  train <- data.frame(hockey_final[fold != i, ])
  for (j in 1:99) {
    prior1 <- Priors[j]
    qda_model <- qda(outcome ~ scorediffp1 + scorediffp2, data = test, prior = c(prior1 , 1 - prior1))
    predictions <- qda_model %>% predict(test)
    qda_accur[i,j] <- 1 - mean(predictions$class == test$outcome)
    } 
  }


error_qda <- rep( 0 , 99) 
for (i in 1:99) {
  error_qda[i] = mean(qda_accur[, i]) }

min_qda <- round(min(error_qda),3)
title_qda <- paste0("Minimum error: ", min_qda)
df_qda <- data.frame(Priors, error_qda) 

ggplot(data = df_qda, aes(x = Priors, y = error_qda, group = 1)) +
  geom_line(color = "red") + 
  xlab("Choice of Prior") + ylab("Mean Error rate") +
  ggtitle("6-fold cross validation: QDA" , title_qda) #Choose 0.67 as optimal probabilty cut-off

```

We conclude that the best quadratic discriminant model is the one that uses 1st and 2nd period point difference as predictors with a cut-off prior probability of 0.67

## 4 - ROC Analysis 

Now that we have four classification algorithms that are the best in their respected classes, lets compare and contrast them.

```{r, echo = TRUE} 

hockey_test2 <- hockey_test %>% 
  mutate(Location = ifelse(Location == "Home" , 1, 0) , 
         outcome = ifelse(outcome == "Win" , 1 ,0))

glm_final <- glm(outcome ~ scorediffp2, data = hockey_final, family = binomial) 
test <- data.frame(hockey_test2[ ,3])
test <- as.data.frame(test)
colnames(test) <- c("scorediffp2")
glm_final_probs <- predict(glm_final, test, type = 'response') 
glm_final_pred <- ifelse(glm_final_probs > 0.5, 1, 0)
glm_table <- table(predicted = as.factor(glm_final_pred), actual = as.factor(hockey_test2$outcome))


set.seed(100)
train_knn <- hockey_final[c(1, 2, 3)]
test_knn <- hockey_test2[c(1, 2, 3)]
cl_knn <- as.factor(hockey_final$outcome)
knn_final <- knn(train = train_knn, test = test_knn, cl = cl_knn, k = 10)
knn_table <- table(predicted = as.factor(knn_final), actual = as.factor(hockey_test2$outcome) )


lda_final <- lda(outcome ~ scorediffp1 + scorediffp2, data = hockey_final, prior = c(0.57, 0.43))
predictions_lda <- lda_final %>% predict(hockey_test2)
lda_table <- table(predicted = predictions_lda$class, actual = hockey_test2$outcome )
#error_lda = 1 - mean(predictions.lda$class == testdata$outcome) 


qda_final <- qda(outcome ~ scorediffp1 + scorediffp2, data = hockey_final , prior = c(0.67 ,0.33))
predictions_qda <- qda_final %>% predict(hockey_test2)
qda_table <- table(predicted = predictions_qda$class, actual = hockey_test2$outcome)

log_sen <- glm_table[2,2] / (glm_table[2,2] + glm_table[1,2])
log_fp <- glm_table[2,1] / (glm_table[2,1] + glm_table[1,1]) 
log_act <- (glm_table[1,1] + glm_table[2,2]) / sum(glm_table)

knn_sen <- knn_table[2,2] / (knn_table[2,2] + knn_table[1,2]) 
knn_fp <- knn_table[2,1] / (knn_table[2,1] + knn_table[1,1]) 
knn_act <- (knn_table[1,1] + knn_table[2,2]) / sum(knn_table) 

lda_sen <- lda_table[2,2] / (lda_table[2,2] + lda_table[1,2]) 
lda_fp <- lda_table[2,1] / (lda_table[2,1] + lda_table[1,1]) 
lda_act <- (lda_table[1,1] + lda_table[2,2]) / sum(lda_table) 

qda_sen <- qda_table[2,2] / (qda_table[2,2] + qda_table[1,2]) 
qda_fp <- qda_table[2,1] / (qda_table[2,1] + qda_table[1,1]) 
qda_act <- (qda_table[1,1] + qda_table[2,2]) / sum(qda_table) 

Methods <- c("Logistic", "KNN" , "LDA" , "QDA")
Methods_roc <- c("Logistic/LDA" , "KNN" , "QDA")
Accuracy <- c(log_act, knn_act, lda_act, qda_act) 
Sensitivity <- c(log_sen , knn_sen , lda_sen , qda_sen) 
Specificity = c(1-log_fp , 1-knn_fp , 1-lda_fp , 1-qda_fp)
compare_roc <- data.frame(Methods_roc, Sensitivity[c(1,2,4)], 1-Specificity[c(1,2,4)])
colnames(compare_roc) <- c("Methods" , "tp_rate" , "fp_rate")

comparisions <- data.frame(Methods, Accuracy, Sensitivity, Specificity) 
comparisions
```

It seems that QDA is the most accurate algorithm while KNN performs the worst.

```{r, echo = TRUE}
ggplot(compare_roc, aes(x = fp_rate, y = tp_rate , col = Methods)) + 
geom_point() + xlim(0,1) + ylim(0,1) + geom_abline(intercept = 0, slope = 1) + 
ggtitle("ROC graph") +xlab("1 - Specificity") + ylab("Sensitivity")
```

All 4 classifiers do much better than "random guessing" which is indicated by the black line. The closer towards the upper left corner, the better the classifier is. KNN is the closest towards the black line. This means that KNN under performs compared to the other classifiers. QDA seems to perform the best. QDA is the classifier to go in terms of predictability however if one wants a simple interpretable model with some loss of predictability, Logistic Regression is king.  

## 5 - Conclusion 

We conclude that the best predictors that determine the outcome of a game is the point-difference at period 1 and period 2.  Time in Power play differences and location does not seem to predict very well. A Logistic Regression Model is the most interpretable but if one wants predictability, Quadratic Discriminant Analysis is the way to go.

