BANL 6900 Business Analytics Capstone Course

Final Project Report

Information on the Dataset used in this Project

The dataset I will be using for this project comprises of data from the English Premier League(UK) season 2021-22. The data is extracted from FBRef.com, which is a website that collects soccer data for different leagues across the globe. The final dataset comes from different links within the website, collected, cleaned, and merged in an Excel file. The dataset comprises of numerous stats that are collected during a game. For the project, I chose the stats that I felt made the most difference in the outcome(end result) of a game. The final dataset can be found in the GitHub repository: aliRep1.

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following object is masked from 'package:ggpubr':
## 
##     mutate

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## 
## Attaching package: 'forecast'

## The following object is masked from 'package:ggpubr':
## 
##     gghistogram

## Loading required package: lattice

In this dataset, There are 760 rows and 32 columns. The names and types of the columns and first 6 and last 6 rows of the dataset are listed below.

## 'data.frame':    760 obs. of  32 variables:
##  $ Date        : POSIXct, format: "2021-08-13" "2021-08-22" ...
##  $ Time        : chr  "20:00 (15:00)" "16:30 (11:30)" "12:30 (07:30)" "15:00 (10:00)" ...
##  $ Round       : chr  "Matchweek 1" "Matchweek 2" "Matchweek 3" "Matchweek 4" ...
##  $ Day         : chr  "Fri" "Sun" "Sat" "Sat" ...
##  $ Venue       : num  0 1 0 1 0 1 0 1 1 0 ...
##  $ Result      : chr  "L" "L" "L" "W" ...
##  $ GF          : num  0 0 0 1 1 3 0 2 3 2 ...
##  $ GA          : num  2 2 5 0 0 1 0 2 1 0 ...
##  $ Team        : chr  "Arsenal" "Arsenal" "Arsenal" "Arsenal" ...
##  $ Opponent    : chr  "Brentford" "Chelsea" "Manchester City" "Norwich City" ...
##  $ xG          : num  1.4 0.3 0.1 2.8 1.2 0.8 0.5 1.4 2.7 0.9 ...
##  $ xGA         : num  1.3 2.9 3.8 0.6 1.3 1 1.4 0.9 1.4 1.2 ...
##  $ Poss        : num  66 35 20 52 54 46 41 55 54 36 ...
##  $ Attendance  : num  16479 58729 52276 58000 20000 ...
##  $ Captain     : chr  "Granit Xhaka" "Granit Xhaka" "Pierre-Emerick Aubameyang" "Pierre-Emerick Aubameyang" ...
##  $ Referee     : chr  "Michael Oliver" "Paul Tierney" "Martin Atkinson" "Michael Oliver" ...
##  $ SoTP        : num  18.2 50 0 20 23.1 58.3 25 35.3 38.1 55.6 ...
##  $ ShotsFK     : num  1 0 0 1 1 0 1 1 1 0 ...
##  $ ShotsPK     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ GoalieSaveP : num  33.3 60 50 100 100 75 100 66.7 75 100 ...
##  $ PassCmpP    : num  87.2 78.7 65.4 82 80.9 84 76.3 86.4 80.5 75.6 ...
##  $ KP          : num  20 5 1 23 9 11 5 10 15 7 ...
##  $ Pass1/3     : num  37 20 9 30 30 30 14 48 29 12 ...
##  $ Pass18yard  : num  19 8 2 16 5 4 5 15 8 4 ...
##  $ Cross18yard : num  6 0 1 5 1 0 0 4 1 1 ...
##  $ ProgPass    : num  51 22 11 35 31 23 19 39 21 12 ...
##  $ ShotCA      : num  42 9 2 48 21 23 13 26 32 13 ...
##  $ GoalCA      : num  0 0 0 1 2 6 0 2 5 1 ...
##  $ DribbleTklP : num  36.4 46.7 42.9 46.2 40 53.3 33.3 63.6 44.4 43.8 ...
##  $ Blocks      : num  14 24 23 22 24 12 19 11 14 19 ...
##  $ Intercepts  : num  19 16 16 6 20 14 18 11 10 13 ...
##  $ DribbleSuccP: num  50 60 30 52.2 42.9 45.5 58.8 58.8 41.7 58.8 ...

##         Date          Time       Round Day Venue Result GF GA    Team
## 1 2021-08-13 20:00 (15:00) Matchweek 1 Fri     0      L  0  2 Arsenal
## 2 2021-08-22 16:30 (11:30) Matchweek 2 Sun     1      L  0  2 Arsenal
## 3 2021-08-28 12:30 (07:30) Matchweek 3 Sat     0      L  0  5 Arsenal
## 4 2021-09-11 15:00 (10:00) Matchweek 4 Sat     1      W  1  0 Arsenal
## 5 2021-09-18 15:00 (10:00) Matchweek 5 Sat     0      W  1  0 Arsenal
## 6 2021-09-26 16:30 (11:30) Matchweek 6 Sun     1      W  3  1 Arsenal
##          Opponent  xG xGA Poss Attendance                   Captain
## 1       Brentford 1.4 1.3   66      16479              Granit Xhaka
## 2         Chelsea 0.3 2.9   35      58729              Granit Xhaka
## 3 Manchester City 0.1 3.8   20      52276 Pierre-Emerick Aubameyang
## 4    Norwich City 2.8 0.6   52      58000 Pierre-Emerick Aubameyang
## 5         Burnley 1.2 1.3   54      20000 Pierre-Emerick Aubameyang
## 6       Tottenham 0.8 1.0   46      59919 Pierre-Emerick Aubameyang
##           Referee SoTP ShotsFK ShotsPK GoalieSaveP PassCmpP KP Pass1/3
## 1  Michael Oliver 18.2       1       0        33.3     87.2 20      37
## 2    Paul Tierney 50.0       0       0        60.0     78.7  5      20
## 3 Martin Atkinson  0.0       0       0        50.0     65.4  1       9
## 4  Michael Oliver 20.0       1       0       100.0     82.0 23      30
## 5  Anthony Taylor 23.1       1       0       100.0     80.9  9      30
## 6    Craig Pawson 58.3       0       0        75.0     84.0 11      30
##   Pass18yard Cross18yard ProgPass ShotCA GoalCA DribbleTklP Blocks Intercepts
## 1         19           6       51     42      0        36.4     14         19
## 2          8           0       22      9      0        46.7     24         16
## 3          2           1       11      2      0        42.9     23         16
## 4         16           5       35     48      1        46.2     22          6
## 5          5           1       31     21      2        40.0     24         20
## 6          4           0       23     23      6        53.3     12         14
##   DribbleSuccP
## 1         50.0
## 2         60.0
## 3         30.0
## 4         52.2
## 5         42.9
## 6         45.5

##           Date          Time        Round Day Venue Result GF GA         Team
## 755 2022-04-23 15:00 (10:00) Matchweek 34 Sat     1      L  0  3 Norwich City
## 756 2022-04-30 15:00 (10:00) Matchweek 35 Sat     0      L  0  2 Norwich City
## 757 2022-05-08 14:00 (09:00) Matchweek 36 Sun     1      L  0  4 Norwich City
## 758 2022-05-11 19:45 (14:45) Matchweek 21 Wed     0      L  0  3 Norwich City
## 759 2022-05-15 14:00 (09:00) Matchweek 37 Sun     0      D  1  1 Norwich City
## 760 2022-05-22 16:00 (11:00) Matchweek 38 Sun     1      L  0  5 Norwich City
##           Opponent  xG xGA Poss Attendance      Captain         Referee SoTP
## 755  Newcastle Utd 0.7 2.4   46      26910 Grant Hanley  Chris Kavanagh 40.0
## 756    Aston Villa 0.6 2.2   55      40290 Grant Hanley     John Brooks 30.0
## 757       West Ham 0.7 3.1   37      26428 Grant Hanley    Robert Jones 25.0
## 758 Leicester City 1.2 2.5   36      38092 Grant Hanley    Simon Hooper 55.6
## 759         Wolves 1.3 0.9   37      31219 Grant Hanley Tony Harrington 18.2
## 760      Tottenham 0.3 3.7   40      27022 Grant Hanley  Chris Kavanagh  0.0
##     ShotsFK ShotsPK GoalieSaveP PassCmpP KP Pass1/3 Pass18yard Cross18yard
## 755       1       0        57.1     75.0  4      20          9           0
## 756       0       0        66.7     82.5  6      13          4           0
## 757       1       0        25.0     81.3  5      19          6           1
## 758       0       0        62.5     80.4  5      16          7           1
## 759       0       0        75.0     75.8  9      12          2           0
## 760       0       0        58.3     79.4  5      18          4           1
##     ProgPass ShotCA GoalCA DribbleTklP Blocks Intercepts DribbleSuccP
## 755       31     10      0        38.9     14         17         41.2
## 756       23     12      0         0.0     18         20         35.7
## 757       25     12      0        21.4     16         12         87.5
## 758       22     12      0        43.8     19         16         40.0
## 759       14     19      2        36.4     10         22         75.0
## 760       21     12      0        18.8      8         16         46.2

The Business Question Investigated using the Dataset.

The question investigated in this project is:
“Was there a home-ground advantage in the English Premier League season 2021-22?” It is often believed in sports that a team has an advantage while playing at their home ground/court. To answer this question, we need to know the determinants (i.e. predictors) that can influence a match result. The dataset provides specifics (i.e. predictors) about a match result. The project aims to make a prediction about a possible win or loss result based on these contributing factors.

Therefore, my target variable in this project is the “Result” column.

Initial Prediction about Answering the Business Question

The whole dataset contains matches from all the teams and that repeats twice as every team plays against each other twice in one season, once home and once away. For this reason, the dataset was split according to teams. As the dataset is from last season(2021-22), I sort the data using dplyr showing top and bottom 5 teams based on the results, as shown below:

## 
##           Arsenal       Aston Villa         Brentford          Brighton 
##                38                38                38                38 
##           Burnley           Chelsea    Crystal Palace           Everton 
##                38                38                38                38 
##      Leeds United    Leicester City         Liverpool   Manchester City 
##                38                38                38                38 
## Manchester United         Newcastle      Norwich City       Southampton 
##                38                38                38                38 
##         Tottenham           Watford          West Ham            Wolves 
##                38                38                38                38

##    
##      0  1
##   D 17 16
##   L 21 14
##   W 57 65

##    
##      0  1
##   D 24 19
##   L 56 53
##   W 15 23

From the outputs above, it looks like the top 5 teams win more matches at home compared to away. There is also a big difference between the number of wins at home and away as they have won more matches to be in the top 5. When we look at the bottom 5 teams, they have also won more matches at home compared to away, but the difference in the two is not as big as compared to the top 5.

However, the above output does answer my initial question that there is a home ground advantage to the home team. If that is true, it will be interesting to know what factors contribute the highest for it.

The method used in this project

Looking at my dataset, my target variable is categorical. I plan to use logistic regression in this project to predict the factors responsible for a home ground advantage. As logistic regression is a well established prediction technique for categorical target variables, I will try this first.

Summary statistics of 10 most important numerical columns

For this project, in a majority, numerical columns will be used throughout. There are 23 numerical columns in this dataset. The 10 most important numerical columns are:
Venue, Poss, SoTP, GoalieSaveP, PassCmpP, KP, Pass1/3, ProgPass, ShotCA, GoalCA

The output below shows the results of this summary.

##      Venue          Poss         SoTP       GoalieSaveP         PassCmpP    
##  Min.   :0.0   Min.   :19   Min.   : 0.0   Min.   :-100.00   Min.   :52.00  
##  1st Qu.:0.0   1st Qu.:40   1st Qu.:22.2   1st Qu.:  50.00   1st Qu.:73.20  
##  Median :0.5   Median :50   Median :33.3   Median :  66.70   Median :79.10  
##  Mean   :0.5   Mean   :50   Mean   :33.0   Mean   :  67.21   Mean   :77.92  
##  3rd Qu.:1.0   3rd Qu.:60   3rd Qu.:42.9   3rd Qu.: 100.00   3rd Qu.:83.30  
##  Max.   :1.0   Max.   :81   Max.   :83.3   Max.   : 100.00   Max.   :92.70  
##        KP            Pass1/3         ProgPass         ShotCA    
##  Min.   : 0.000   Min.   : 5.00   Min.   : 5.00   Min.   : 0.0  
##  1st Qu.: 6.000   1st Qu.:18.00   1st Qu.:22.00   1st Qu.:13.0  
##  Median : 9.000   Median :26.00   Median :29.00   Median :19.0  
##  Mean   : 9.157   Mean   :28.43   Mean   :31.03   Mean   :19.7  
##  3rd Qu.:12.000   3rd Qu.:36.00   3rd Qu.:39.00   3rd Qu.:25.0  
##  Max.   :25.000   Max.   :95.00   Max.   :79.00   Max.   :55.0  
##      GoalCA      
##  Min.   : 0.000  
##  1st Qu.: 0.000  
##  Median : 2.000  
##  Mean   : 2.189  
##  3rd Qu.: 3.000  
##  Max.   :11.000

Based on the summary values, there are no missing values (NAs) and nothing completely unusual or irregular that may indicate a typo.

Frequency tables of 5 most important categorical columns

The dataset I am working with, the target variable is categorical and the only other categorical column in the dataset that has an influence on the target variable is “Captain” column. I picked up the 2 categorical columns in the dataset.They are:
Result, Captain

The output below shows the results of this summary.

## 
##   D   L   W 
## 176 292 292

## 
##           Aaron Cresswell              Adam Lallana              Adam Webster 
##                         1                         5                         2 
##       Alexandre Lacazette             Asmir Begović                Ben Foster 
##                        16                         1                         1 
##                Ben Gibson                   Ben Mee           Bruno Fernandes 
##                         5                        21                         9 
##             Callum Wilson         César Azpilicueta        Christian Nørgaard 
##                         2                        24                         1 
##               Conor Coady            Craig Cathcart         Cristiano Ronaldo 
##                        38                         1                         1 
##                  Dan Burn               Declan Rice                Ezri Konsa 
##                         1                        34                         1 
##              Fabian Schär        Federico Fernández               Fernandinho 
##                         1                         3                        10 
##              Granit Xhaka              Grant Hanley             Harry Maguire 
##                         2                        33                        28 
##               Hugo Lloris            İlkay Gündoğan                 Jack Cork 
##                        38                        15                         1 
##          Jamaal Lascelles            James McArthur              James Milner 
##                        22                        10                         3 
##           James Tarkowski         James Ward-Prowse                 Joel Ward 
##                        16                        36                        11 
##               John McGinn             Jonjo Shelvey               Jonny Evans 
##                         2                         9                         1 
##          Jordan Henderson           Jordan Pickford                  Jorginho 
##                        29                         1                        12 
##         Kasper Schmeichel           Kevin De Bruyne           Kieran Trippier 
##                        37                         5                         1 
##                Lewis Dunk               Liam Cooper               Lucas Digne 
##                        29                        21                         3 
##          Luka Milivojević               Luke Ayling                Marc Guéhi 
##                         9                        17                         7 
##             Marcos Alonso                Mark Noble           Martin Ødegaard 
##                         1                         3                         8 
##             Michael Keane            Moussa Sissoko              N'Golo Kanté 
##                         2                        31                         1 
##               Oriol Romeu Pierre-Emerick Aubameyang            Pontus Jansson 
##                         2                        12                        37 
##                Rúben Dias            Séamus Coleman               Shane Duffy 
##                         8                        30                         1 
##             Tom Cleverley              Tyrone Mings           Virgil van Dijk 
##                         4                        35                         6 
##             Wilfried Zaha      William Troost-Ekong                Yerry Mina 
##                         1                         1                         1

Frequency tables are generated as shown above. Please note that the result table counts every match twice as every team plays against one other twice, so instead of 760 matches (176+292+292 =760) it will be 380 matches (total divided by 2).

Histogram of 5 most important numerical variables

I picked up 5 numerical columns in the dataset. They are:
Venue, GoalCA, ShotCA, KP, GoalieSaveP

The charts below show the histograms.

The first four histograms (Shot on Target Percentage, Goal Creating Actions, Shot Creating Actions, and Key Passes) do not have a known distribution and all are skewed the right. The last histogram (Goalie Save Percentage) is skewed to the left.

Boxplot of 5 most important numerical columns

The charts below show the boxplots.

The first boxplot (Shot on Target Percentage) has about 4 outliers, otherwise an equally distributed distribution. The second boxplot (Goal Creating Actions) also has 4 outliers, but most of the values are in the upper quartile. The third boxplot (Shot Creating Actions) has a lot of outliers, but all other values are equally distributed. The fourth boxplot (Key Passes) has 2 outliers and all the other values are equally distributed. The last boxlot (Goalie Save Percentage) has 2 negative outliers and all of its other values are heavily distributed in the lower quartile.

Scatter plot of 6 most important numerical columns

My target variable is Result which is the result of all the matches played in the English Premier League in season 2021-22. To answer this question and learn more about the dataset. The target variable column(Result) which is a categorical variable is mutated into a new column, but the values are converted to numbers to convert it into a numeric column. If the column had a win it was changed to 3, a draw was changed to 1, and a loss was changed to 0. I look at the scatter plot matrix from the perspective of Result being in the y-axis. For this step, six numerical columns are chosen as this dataset heavily contains numerical columns more than categorical. It looks like the Venue column is strongly correlated, all other columns show some correlation as well.

Suggesting some numerical predictors for the target variable

I propose Venue, GoalCA, ShotCA, KP, and GoalieSaveP as predictors for my target variable Result. I also provide below correlations matrix between these variables:

##              Resultnum      Venue      GoalCA       ShotCA          KP
## Resultnum   1.00000000 0.09197389  0.61206217  0.338428818  0.31072999
## Venue       0.09197389 1.00000000  0.06138718  0.181544928  0.17709102
## GoalCA      0.61206217 0.06138718  1.00000000  0.407233778  0.39108041
## ShotCA      0.33842882 0.18154493  0.40723378  1.000000000  0.96113325
## KP          0.31072999 0.17709102  0.39108041  0.961133249  1.00000000
## GoalieSaveP 0.27256265 0.01133267 -0.01560726 -0.003226178 -0.01276763
##              GoalieSaveP
## Resultnum    0.272562650
## Venue        0.011332673
## GoalCA      -0.015607259
## ShotCA      -0.003226178
## KP          -0.012767633
## GoalieSaveP  1.000000000

Boxplot of 4 numerical variables versus the target variable

The categorical target variable column was chosen along with the top 4 predictor variables that were shown in the last question. The numerical columns I selected are Resultnum, GoalCA, ShotCA, KP, GoalieSaveP
The boxplots are below:

After looking at the output from above, it can be assumed that Result and Goal Creating Actions and Result and Goalie Save Percentage are the pairs that are highly coorelated.

Creating the prediction model

Based on the analysis explained above, I decided to use the following variables as predictors for the “Resultnum” variable. They are: Numerical predictors: Venue, GoalCA, ShotCA, KP, GoalieSaveP, and SoTP

my linear regression model is: Resultnum ~ Venue + GoalCA + ShotCA + KP + GoalieSaveP + SoTP

The following shows how I create this model using glm():

Creating the training and validation partitions

I use 70% for training and 30% for validation partitions.

Develop the model using the training partition

I now develop my regression model on the training dataset. I will later use the validation partition to see how good my model is.

## 
## Call:
## glm(formula = Resultnum ~ Venue + GoalCA + ShotCA + KP + GoalieSaveP + 
##     SoTP, family = "binomial", data = train.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6267  -0.4592  -0.1353   0.3185   3.5097  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.389003   0.983356  -9.548  < 2e-16 ***
## Venue        0.214892   0.285415   0.753   0.4515    
## GoalCA       1.231663   0.126174   9.762  < 2e-16 ***
## ShotCA       0.088516   0.055237   1.602   0.1090    
## KP          -0.064559   0.111917  -0.577   0.5640    
## GoalieSaveP  0.057164   0.007099   8.052 8.13e-16 ***
## SoTP         0.023691   0.010838   2.186   0.0288 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 707.38  on 531  degrees of freedom
## Residual deviance: 334.77  on 525  degrees of freedom
## AIC: 348.77
## 
## Number of Fisher Scoring iterations: 6

Some comments on the regression report printed above

The following variables are found to be significant: GoalCA, GoalieSaveP and SoTP

Venue, Key Passes(KP) and Shot Creating Actions(ShotCA) are not significant. These variables can be dropped from the prediction model.

The model as a whole is significant at 2e-16.

Finally, I check for the normality of the residuals assumption of the regression theory.

The chart above looks Normally distributed with three outliers.

Applying the model on the validation and training sets and calculating performance measures

The model developed needs to be assessed for prediction power. I applied the model on the training and validation set separately and calculated model’s performance measures on them.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 304  37
##          1  25 166
##                                           
##                Accuracy : 0.8835          
##                  95% CI : (0.8531, 0.9095)
##     No Information Rate : 0.6184          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7502          
##                                           
##  Mcnemar's Test P-Value : 0.1624          
##                                           
##             Sensitivity : 0.8177          
##             Specificity : 0.9240          
##          Pos Pred Value : 0.8691          
##          Neg Pred Value : 0.8915          
##              Prevalence : 0.3816          
##          Detection Rate : 0.3120          
##    Detection Prevalence : 0.3590          
##       Balanced Accuracy : 0.8709          
##                                           
##        'Positive' Class : 1               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 125  21
##          1  14  68
##                                          
##                Accuracy : 0.8465         
##                  95% CI : (0.793, 0.8907)
##     No Information Rate : 0.6096         
##     P-Value [Acc > NIR] : 4.799e-15      
##                                          
##                   Kappa : 0.6728         
##                                          
##  Mcnemar's Test P-Value : 0.3105         
##                                          
##             Sensitivity : 0.7640         
##             Specificity : 0.8993         
##          Pos Pred Value : 0.8293         
##          Neg Pred Value : 0.8562         
##              Prevalence : 0.3904         
##          Detection Rate : 0.2982         
##    Detection Prevalence : 0.3596         
##       Balanced Accuracy : 0.8317         
##                                          
##        'Positive' Class : 1              
##

Some comments on the performance measures

The performance measures show the prediction performance first in the training, and then in the validation. The most important measure here is the Accuracy which indicates the accuracy of our prediction model. The Accuracy for the training dataset is 89%, whereas, the Accuracy for the validation dataset is 85%. Both the Accuracy are good and acceptable for the dataset.

There is nothing alarming. Therefore my model is good to use for actual predictions.

A prediction demonstration using test data

Suppose the following new match data is given to us. Our job is to predict its result without the match being played.

GoalCA=2, ShotCA=20, KP=9, SoTP=33, GoalieSaveP=67

##           1 
## "Loss/Draw"

It seems the Result for the match will result in a Loss or Draw.

Conclusion and Final Remarks

In this project, a dataset is selected that showed specific and in-depth information about a Premier League season. The match result is investigated using the data. Using exploratory analysis and visualizations, we identified most suitable predictors for the match result. The model developed using the training partition, and tested using the validation partition. We have seen that there is not much performance difference between these two sets. Next, the model is used for prediction to demonstrate that it works.

The datasets used in this project and the models used conclude that there was no home ground advantage in the Premier League season 2021-22. Thus, our business question is answered as a no. The initial analysis show that the team playing on their home ground usually wins the game, but the model used in this project suggests otherwise. My model says that if a team has a high number of Goal Creating Actions(GCA), high number of Goalie Save Percentage(GoalieSaveP) and a high number of Shots on Target Percentage(SoTP) during the game, they will win the game no matter its venue(home or away).

Future studies may use a larger dataset, different methods to select the predictors, and use of alternative prediction techniques such as decision trees.

I did not encounter any major difficulty in this project because the data was cleaned and ready to use.