The dataset I will be using for this project comprises of data from the English Premier League(UK) season 2021-22. The data is extracted from FBRef.com, which is a website that collects soccer data for different leagues across the globe. The final dataset comes from different links within the website, collected, cleaned, and merged in an Excel file. The dataset comprises of numerous stats that are collected during a game. For the project, I chose the stats that I felt made the most difference in the outcome(end result) of a game. The final dataset can be found in the GitHub repository: aliRep1.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following object is masked from 'package:ggpubr':
##
## mutate
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'forecast'
## The following object is masked from 'package:ggpubr':
##
## gghistogram
## Loading required package: lattice
In this dataset, There are 760 rows and 32 columns. The names and types of the columns and first 6 and last 6 rows of the dataset are listed below.
## 'data.frame': 760 obs. of 32 variables:
## $ Date : POSIXct, format: "2021-08-13" "2021-08-22" ...
## $ Time : chr "20:00 (15:00)" "16:30 (11:30)" "12:30 (07:30)" "15:00 (10:00)" ...
## $ Round : chr "Matchweek 1" "Matchweek 2" "Matchweek 3" "Matchweek 4" ...
## $ Day : chr "Fri" "Sun" "Sat" "Sat" ...
## $ Venue : num 0 1 0 1 0 1 0 1 1 0 ...
## $ Result : chr "L" "L" "L" "W" ...
## $ GF : num 0 0 0 1 1 3 0 2 3 2 ...
## $ GA : num 2 2 5 0 0 1 0 2 1 0 ...
## $ Team : chr "Arsenal" "Arsenal" "Arsenal" "Arsenal" ...
## $ Opponent : chr "Brentford" "Chelsea" "Manchester City" "Norwich City" ...
## $ xG : num 1.4 0.3 0.1 2.8 1.2 0.8 0.5 1.4 2.7 0.9 ...
## $ xGA : num 1.3 2.9 3.8 0.6 1.3 1 1.4 0.9 1.4 1.2 ...
## $ Poss : num 66 35 20 52 54 46 41 55 54 36 ...
## $ Attendance : num 16479 58729 52276 58000 20000 ...
## $ Captain : chr "Granit Xhaka" "Granit Xhaka" "Pierre-Emerick Aubameyang" "Pierre-Emerick Aubameyang" ...
## $ Referee : chr "Michael Oliver" "Paul Tierney" "Martin Atkinson" "Michael Oliver" ...
## $ SoTP : num 18.2 50 0 20 23.1 58.3 25 35.3 38.1 55.6 ...
## $ ShotsFK : num 1 0 0 1 1 0 1 1 1 0 ...
## $ ShotsPK : num 0 0 0 0 0 0 0 0 0 0 ...
## $ GoalieSaveP : num 33.3 60 50 100 100 75 100 66.7 75 100 ...
## $ PassCmpP : num 87.2 78.7 65.4 82 80.9 84 76.3 86.4 80.5 75.6 ...
## $ KP : num 20 5 1 23 9 11 5 10 15 7 ...
## $ Pass1/3 : num 37 20 9 30 30 30 14 48 29 12 ...
## $ Pass18yard : num 19 8 2 16 5 4 5 15 8 4 ...
## $ Cross18yard : num 6 0 1 5 1 0 0 4 1 1 ...
## $ ProgPass : num 51 22 11 35 31 23 19 39 21 12 ...
## $ ShotCA : num 42 9 2 48 21 23 13 26 32 13 ...
## $ GoalCA : num 0 0 0 1 2 6 0 2 5 1 ...
## $ DribbleTklP : num 36.4 46.7 42.9 46.2 40 53.3 33.3 63.6 44.4 43.8 ...
## $ Blocks : num 14 24 23 22 24 12 19 11 14 19 ...
## $ Intercepts : num 19 16 16 6 20 14 18 11 10 13 ...
## $ DribbleSuccP: num 50 60 30 52.2 42.9 45.5 58.8 58.8 41.7 58.8 ...
## Date Time Round Day Venue Result GF GA Team
## 1 2021-08-13 20:00 (15:00) Matchweek 1 Fri 0 L 0 2 Arsenal
## 2 2021-08-22 16:30 (11:30) Matchweek 2 Sun 1 L 0 2 Arsenal
## 3 2021-08-28 12:30 (07:30) Matchweek 3 Sat 0 L 0 5 Arsenal
## 4 2021-09-11 15:00 (10:00) Matchweek 4 Sat 1 W 1 0 Arsenal
## 5 2021-09-18 15:00 (10:00) Matchweek 5 Sat 0 W 1 0 Arsenal
## 6 2021-09-26 16:30 (11:30) Matchweek 6 Sun 1 W 3 1 Arsenal
## Opponent xG xGA Poss Attendance Captain
## 1 Brentford 1.4 1.3 66 16479 Granit Xhaka
## 2 Chelsea 0.3 2.9 35 58729 Granit Xhaka
## 3 Manchester City 0.1 3.8 20 52276 Pierre-Emerick Aubameyang
## 4 Norwich City 2.8 0.6 52 58000 Pierre-Emerick Aubameyang
## 5 Burnley 1.2 1.3 54 20000 Pierre-Emerick Aubameyang
## 6 Tottenham 0.8 1.0 46 59919 Pierre-Emerick Aubameyang
## Referee SoTP ShotsFK ShotsPK GoalieSaveP PassCmpP KP Pass1/3
## 1 Michael Oliver 18.2 1 0 33.3 87.2 20 37
## 2 Paul Tierney 50.0 0 0 60.0 78.7 5 20
## 3 Martin Atkinson 0.0 0 0 50.0 65.4 1 9
## 4 Michael Oliver 20.0 1 0 100.0 82.0 23 30
## 5 Anthony Taylor 23.1 1 0 100.0 80.9 9 30
## 6 Craig Pawson 58.3 0 0 75.0 84.0 11 30
## Pass18yard Cross18yard ProgPass ShotCA GoalCA DribbleTklP Blocks Intercepts
## 1 19 6 51 42 0 36.4 14 19
## 2 8 0 22 9 0 46.7 24 16
## 3 2 1 11 2 0 42.9 23 16
## 4 16 5 35 48 1 46.2 22 6
## 5 5 1 31 21 2 40.0 24 20
## 6 4 0 23 23 6 53.3 12 14
## DribbleSuccP
## 1 50.0
## 2 60.0
## 3 30.0
## 4 52.2
## 5 42.9
## 6 45.5
## Date Time Round Day Venue Result GF GA Team
## 755 2022-04-23 15:00 (10:00) Matchweek 34 Sat 1 L 0 3 Norwich City
## 756 2022-04-30 15:00 (10:00) Matchweek 35 Sat 0 L 0 2 Norwich City
## 757 2022-05-08 14:00 (09:00) Matchweek 36 Sun 1 L 0 4 Norwich City
## 758 2022-05-11 19:45 (14:45) Matchweek 21 Wed 0 L 0 3 Norwich City
## 759 2022-05-15 14:00 (09:00) Matchweek 37 Sun 0 D 1 1 Norwich City
## 760 2022-05-22 16:00 (11:00) Matchweek 38 Sun 1 L 0 5 Norwich City
## Opponent xG xGA Poss Attendance Captain Referee SoTP
## 755 Newcastle Utd 0.7 2.4 46 26910 Grant Hanley Chris Kavanagh 40.0
## 756 Aston Villa 0.6 2.2 55 40290 Grant Hanley John Brooks 30.0
## 757 West Ham 0.7 3.1 37 26428 Grant Hanley Robert Jones 25.0
## 758 Leicester City 1.2 2.5 36 38092 Grant Hanley Simon Hooper 55.6
## 759 Wolves 1.3 0.9 37 31219 Grant Hanley Tony Harrington 18.2
## 760 Tottenham 0.3 3.7 40 27022 Grant Hanley Chris Kavanagh 0.0
## ShotsFK ShotsPK GoalieSaveP PassCmpP KP Pass1/3 Pass18yard Cross18yard
## 755 1 0 57.1 75.0 4 20 9 0
## 756 0 0 66.7 82.5 6 13 4 0
## 757 1 0 25.0 81.3 5 19 6 1
## 758 0 0 62.5 80.4 5 16 7 1
## 759 0 0 75.0 75.8 9 12 2 0
## 760 0 0 58.3 79.4 5 18 4 1
## ProgPass ShotCA GoalCA DribbleTklP Blocks Intercepts DribbleSuccP
## 755 31 10 0 38.9 14 17 41.2
## 756 23 12 0 0.0 18 20 35.7
## 757 25 12 0 21.4 16 12 87.5
## 758 22 12 0 43.8 19 16 40.0
## 759 14 19 2 36.4 10 22 75.0
## 760 21 12 0 18.8 8 16 46.2
The question investigated in this project is:
“Was there a home-ground advantage in the English Premier League season
2021-22?” It is often believed in sports that a team has an advantage
while playing at their home ground/court. To answer this question, we
need to know the determinants (i.e. predictors) that can influence a
match result. The dataset provides specifics (i.e. predictors) about a
match result. The project aims to make a prediction about a possible win
or loss result based on these contributing factors.
Therefore, my target variable in this project is the “Result” column.
The whole dataset contains matches from all the teams and that repeats twice as every team plays against each other twice in one season, once home and once away. For this reason, the dataset was split according to teams. As the dataset is from last season(2021-22), I sort the data using dplyr showing top and bottom 5 teams based on the results, as shown below:
##
## Arsenal Aston Villa Brentford Brighton
## 38 38 38 38
## Burnley Chelsea Crystal Palace Everton
## 38 38 38 38
## Leeds United Leicester City Liverpool Manchester City
## 38 38 38 38
## Manchester United Newcastle Norwich City Southampton
## 38 38 38 38
## Tottenham Watford West Ham Wolves
## 38 38 38 38
##
## 0 1
## D 17 16
## L 21 14
## W 57 65
##
## 0 1
## D 24 19
## L 56 53
## W 15 23
From the outputs above, it looks like the top 5 teams win more matches at home compared to away. There is also a big difference between the number of wins at home and away as they have won more matches to be in the top 5. When we look at the bottom 5 teams, they have also won more matches at home compared to away, but the difference in the two is not as big as compared to the top 5.
However, the above output does answer my initial question that there is a home ground advantage to the home team. If that is true, it will be interesting to know what factors contribute the highest for it.
Looking at my dataset, my target variable is categorical. I plan to use logistic regression in this project to predict the factors responsible for a home ground advantage. As logistic regression is a well established prediction technique for categorical target variables, I will try this first.
For this project, in a majority, numerical columns will be used
throughout. There are 23 numerical columns in this dataset. The 10 most
important numerical columns are:
Venue, Poss, SoTP, GoalieSaveP, PassCmpP, KP, Pass1/3, ProgPass, ShotCA,
GoalCA
The output below shows the results of this summary.
## Venue Poss SoTP GoalieSaveP PassCmpP
## Min. :0.0 Min. :19 Min. : 0.0 Min. :-100.00 Min. :52.00
## 1st Qu.:0.0 1st Qu.:40 1st Qu.:22.2 1st Qu.: 50.00 1st Qu.:73.20
## Median :0.5 Median :50 Median :33.3 Median : 66.70 Median :79.10
## Mean :0.5 Mean :50 Mean :33.0 Mean : 67.21 Mean :77.92
## 3rd Qu.:1.0 3rd Qu.:60 3rd Qu.:42.9 3rd Qu.: 100.00 3rd Qu.:83.30
## Max. :1.0 Max. :81 Max. :83.3 Max. : 100.00 Max. :92.70
## KP Pass1/3 ProgPass ShotCA
## Min. : 0.000 Min. : 5.00 Min. : 5.00 Min. : 0.0
## 1st Qu.: 6.000 1st Qu.:18.00 1st Qu.:22.00 1st Qu.:13.0
## Median : 9.000 Median :26.00 Median :29.00 Median :19.0
## Mean : 9.157 Mean :28.43 Mean :31.03 Mean :19.7
## 3rd Qu.:12.000 3rd Qu.:36.00 3rd Qu.:39.00 3rd Qu.:25.0
## Max. :25.000 Max. :95.00 Max. :79.00 Max. :55.0
## GoalCA
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 2.000
## Mean : 2.189
## 3rd Qu.: 3.000
## Max. :11.000
Based on the summary values, there are no missing values (NAs) and nothing completely unusual or irregular that may indicate a typo.
The dataset I am working with, the target variable is categorical and
the only other categorical column in the dataset that has an influence
on the target variable is “Captain” column. I picked up the 2
categorical columns in the dataset.They are:
Result, Captain
The output below shows the results of this summary.
##
## D L W
## 176 292 292
##
## Aaron Cresswell Adam Lallana Adam Webster
## 1 5 2
## Alexandre Lacazette Asmir Begović Ben Foster
## 16 1 1
## Ben Gibson Ben Mee Bruno Fernandes
## 5 21 9
## Callum Wilson César Azpilicueta Christian Nørgaard
## 2 24 1
## Conor Coady Craig Cathcart Cristiano Ronaldo
## 38 1 1
## Dan Burn Declan Rice Ezri Konsa
## 1 34 1
## Fabian Schär Federico Fernández Fernandinho
## 1 3 10
## Granit Xhaka Grant Hanley Harry Maguire
## 2 33 28
## Hugo Lloris İlkay Gündoğan Jack Cork
## 38 15 1
## Jamaal Lascelles James McArthur James Milner
## 22 10 3
## James Tarkowski James Ward-Prowse Joel Ward
## 16 36 11
## John McGinn Jonjo Shelvey Jonny Evans
## 2 9 1
## Jordan Henderson Jordan Pickford Jorginho
## 29 1 12
## Kasper Schmeichel Kevin De Bruyne Kieran Trippier
## 37 5 1
## Lewis Dunk Liam Cooper Lucas Digne
## 29 21 3
## Luka Milivojević Luke Ayling Marc Guéhi
## 9 17 7
## Marcos Alonso Mark Noble Martin Ødegaard
## 1 3 8
## Michael Keane Moussa Sissoko N'Golo Kanté
## 2 31 1
## Oriol Romeu Pierre-Emerick Aubameyang Pontus Jansson
## 2 12 37
## Rúben Dias Séamus Coleman Shane Duffy
## 8 30 1
## Tom Cleverley Tyrone Mings Virgil van Dijk
## 4 35 6
## Wilfried Zaha William Troost-Ekong Yerry Mina
## 1 1 1
Frequency tables are generated as shown above. Please note that the result table counts every match twice as every team plays against one other twice, so instead of 760 matches (176+292+292 =760) it will be 380 matches (total divided by 2).
I picked up 5 numerical columns in the dataset. They are:
Venue, GoalCA, ShotCA, KP, GoalieSaveP
The charts below show the histograms.
The first four histograms (Shot on Target Percentage, Goal Creating Actions, Shot Creating Actions, and Key Passes) do not have a known distribution and all are skewed the right. The last histogram (Goalie Save Percentage) is skewed to the left.
The charts below show the boxplots.
The first boxplot (Shot on Target Percentage) has about 4 outliers, otherwise an equally distributed distribution. The second boxplot (Goal Creating Actions) also has 4 outliers, but most of the values are in the upper quartile. The third boxplot (Shot Creating Actions) has a lot of outliers, but all other values are equally distributed. The fourth boxplot (Key Passes) has 2 outliers and all the other values are equally distributed. The last boxlot (Goalie Save Percentage) has 2 negative outliers and all of its other values are heavily distributed in the lower quartile.
My target variable is Result which is the result of all the matches played in the English Premier League in season 2021-22. To answer this question and learn more about the dataset. The target variable column(Result) which is a categorical variable is mutated into a new column, but the values are converted to numbers to convert it into a numeric column. If the column had a win it was changed to 3, a draw was changed to 1, and a loss was changed to 0. I look at the scatter plot matrix from the perspective of Result being in the y-axis. For this step, six numerical columns are chosen as this dataset heavily contains numerical columns more than categorical. It looks like the Venue column is strongly correlated, all other columns show some correlation as well.
I propose Venue, GoalCA, ShotCA, KP, and GoalieSaveP as predictors for my target variable Result. I also provide below correlations matrix between these variables:
## Resultnum Venue GoalCA ShotCA KP
## Resultnum 1.00000000 0.09197389 0.61206217 0.338428818 0.31072999
## Venue 0.09197389 1.00000000 0.06138718 0.181544928 0.17709102
## GoalCA 0.61206217 0.06138718 1.00000000 0.407233778 0.39108041
## ShotCA 0.33842882 0.18154493 0.40723378 1.000000000 0.96113325
## KP 0.31072999 0.17709102 0.39108041 0.961133249 1.00000000
## GoalieSaveP 0.27256265 0.01133267 -0.01560726 -0.003226178 -0.01276763
## GoalieSaveP
## Resultnum 0.272562650
## Venue 0.011332673
## GoalCA -0.015607259
## ShotCA -0.003226178
## KP -0.012767633
## GoalieSaveP 1.000000000
The categorical target variable column was chosen along with the top
4 predictor variables that were shown in the last question. The
numerical columns I selected are Resultnum, GoalCA, ShotCA, KP,
GoalieSaveP
The boxplots are below:
After looking at the output from above, it can be assumed that Result and Goal Creating Actions and Result and Goalie Save Percentage are the pairs that are highly coorelated.
Based on the analysis explained above, I decided to use the following variables as predictors for the “Resultnum” variable. They are: Numerical predictors: Venue, GoalCA, ShotCA, KP, GoalieSaveP, and SoTP
my linear regression model is: Resultnum ~ Venue + GoalCA + ShotCA + KP + GoalieSaveP + SoTP
The following shows how I create this model using glm():
I use 70% for training and 30% for validation partitions.
I now develop my regression model on the training dataset. I will later use the validation partition to see how good my model is.
##
## Call:
## glm(formula = Resultnum ~ Venue + GoalCA + ShotCA + KP + GoalieSaveP +
## SoTP, family = "binomial", data = train.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6267 -0.4592 -0.1353 0.3185 3.5097
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.389003 0.983356 -9.548 < 2e-16 ***
## Venue 0.214892 0.285415 0.753 0.4515
## GoalCA 1.231663 0.126174 9.762 < 2e-16 ***
## ShotCA 0.088516 0.055237 1.602 0.1090
## KP -0.064559 0.111917 -0.577 0.5640
## GoalieSaveP 0.057164 0.007099 8.052 8.13e-16 ***
## SoTP 0.023691 0.010838 2.186 0.0288 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 707.38 on 531 degrees of freedom
## Residual deviance: 334.77 on 525 degrees of freedom
## AIC: 348.77
##
## Number of Fisher Scoring iterations: 6
The following variables are found to be significant: GoalCA, GoalieSaveP and SoTP
Venue, Key Passes(KP) and Shot Creating Actions(ShotCA) are not significant. These variables can be dropped from the prediction model.
The model as a whole is significant at 2e-16.
Finally, I check for the normality of the residuals assumption of the regression theory.
The chart above looks Normally distributed with three outliers.
The model developed needs to be assessed for prediction power. I applied the model on the training and validation set separately and calculated model’s performance measures on them.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 304 37
## 1 25 166
##
## Accuracy : 0.8835
## 95% CI : (0.8531, 0.9095)
## No Information Rate : 0.6184
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7502
##
## Mcnemar's Test P-Value : 0.1624
##
## Sensitivity : 0.8177
## Specificity : 0.9240
## Pos Pred Value : 0.8691
## Neg Pred Value : 0.8915
## Prevalence : 0.3816
## Detection Rate : 0.3120
## Detection Prevalence : 0.3590
## Balanced Accuracy : 0.8709
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 125 21
## 1 14 68
##
## Accuracy : 0.8465
## 95% CI : (0.793, 0.8907)
## No Information Rate : 0.6096
## P-Value [Acc > NIR] : 4.799e-15
##
## Kappa : 0.6728
##
## Mcnemar's Test P-Value : 0.3105
##
## Sensitivity : 0.7640
## Specificity : 0.8993
## Pos Pred Value : 0.8293
## Neg Pred Value : 0.8562
## Prevalence : 0.3904
## Detection Rate : 0.2982
## Detection Prevalence : 0.3596
## Balanced Accuracy : 0.8317
##
## 'Positive' Class : 1
##
The performance measures show the prediction performance first in the training, and then in the validation. The most important measure here is the Accuracy which indicates the accuracy of our prediction model. The Accuracy for the training dataset is 89%, whereas, the Accuracy for the validation dataset is 85%. Both the Accuracy are good and acceptable for the dataset.
There is nothing alarming. Therefore my model is good to use for actual predictions.
Suppose the following new match data is given to us. Our job is to predict its result without the match being played.
GoalCA=2, ShotCA=20, KP=9, SoTP=33, GoalieSaveP=67
## 1
## "Loss/Draw"
It seems the Result for the match will result in a Loss or Draw.
In this project, a dataset is selected that showed specific and in-depth information about a Premier League season. The match result is investigated using the data. Using exploratory analysis and visualizations, we identified most suitable predictors for the match result. The model developed using the training partition, and tested using the validation partition. We have seen that there is not much performance difference between these two sets. Next, the model is used for prediction to demonstrate that it works.
The datasets used in this project and the models used conclude that there was no home ground advantage in the Premier League season 2021-22. Thus, our business question is answered as a no. The initial analysis show that the team playing on their home ground usually wins the game, but the model used in this project suggests otherwise. My model says that if a team has a high number of Goal Creating Actions(GCA), high number of Goalie Save Percentage(GoalieSaveP) and a high number of Shots on Target Percentage(SoTP) during the game, they will win the game no matter its venue(home or away).
Future studies may use a larger dataset, different methods to select the predictors, and use of alternative prediction techniques such as decision trees.
I did not encounter any major difficulty in this project because the data was cleaned and ready to use.