Manchester United Football Club is one of the most famous and successful football clubs in the world. Based in Manchester, England, the club was founded in 1878 as Newton Heath LYR Football Club before changing its name to Manchester United in 1902. The team plays its home matches at Old Trafford, a stadium with a seating capacity of over 74,000.
The club has won numerous domestic and international trophies, including 20 English league titles, 12 FA Cups, and 3 UEFA Champions League titles. Manchester United has been known for its attacking style of play and its legendary players, such as Cristiano Ronaldo, Bobby Charlton, George Best, Eric Cantona, and Ryan Giggs. The club has a huge global fanbase and is often considered one of the wealthiest and most commercially successful sports teams in the world.
Historically, the club enjoyed great success under manager Sir Alex Ferguson, who led Manchester United to numerous titles during his 26-year tenure before retiring in 2013. The club’s fortunes have varied since then, with several managerial changes and attempts to return to its former glory.
This dataset includes the statistics of all Manchester United players in season 2022-2023. It includes statistics of all competitions and only players who have played. For more or detailed statistics, please refer to https://fbref.com/en/squads/19538871/2022-2023/all_comps/Manchester-United-Stats-All-Competitions.
This dataset also include match statistics of Manchester United in season 2022-2023. It’s for highlighting which are the most important traits that contributes to win or lose matches.
Over the last 10 years, Manchester United has not been in contention for the Premier League. After £1 billion spent on players’ transfer fees, some questions need to be asked on how the team plays. Personally, I believe the team has not been consistently playing well despite the amount of money injected into the squad. I aim to predict what leads Man Utd to win football matches based on match statistics, while providing some insights on the key attributes during football matches.
#Importing required libraries
library(ggplot2)
library(tidyverse)
library(glmnet)
library(car)
library(detectseparation)
library(ResourceSelection)
library(pROC)
library(Metrics)
library(caret)
Reading in Man Utd player dataset
manutd_players = read.csv("ManUtdplayer.csv", header=TRUE)
#Plotting Age vs Matches Played
average.age = mean(manutd_players$Age) #Average Age of Man Utd's squad
print(average.age) #25.8
## [1] 25.8
ggplot(data = manutd_players, mapping=aes(x=Age, y=Match.Played))+
geom_col(fill = "orange")+
geom_vline(xintercept = 25.8, # Vertical line at x = 25
color = "blue", # Line color
linetype = "dashed", # Line type (optional)
size = 0.7)+
geom_text(aes(x = 25.9, y = max(Match.Played) + 40, label = "Average age"),
color = "blue", # Text color
angle = 0, # Angle of text
hjust = 0, # Horizontal alignment
vjust = 0) + # Vertical alignment
labs(x = "Age", y = "Matches Played", title = "Age vs Matches Played")+
theme_linedraw()+
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(size = 7) # Adjust text size
)+ylim(0,100)
As per the plot above, Man Utd has a mixture of experience and young talent with two noticeable peaks at around ages 23 and 30. The average age of a player in the Man Utd squad is 25.8.
#Who has been scoring the goals?
# Filter the data and exclude players with less than 6 goals
manutd_scored = manutd_players %>%
filter(Matches == Matches , Goals > 5)
ggplot(data=manutd_scored, mapping = aes(x=Player, y=Goals))+
geom_col(fill = "orange")+
labs(x = "Player", y = "Goals Scored", title = "Players vs Goals Scored")+
theme_linedraw()+
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(size = 7) # Adjust text size
)+ylim(0,30)
During the 2022/23 campaign, Marcus Rashford (Forward) was the most prolific goalscorer across all competitions for Manchester United with 30 goals. Bruno Fernandes is second in that category, but is far behind with 14 goals; just below half of Rashford’s goal tally.
#Who has been creating goals?
# Filter the data and exclude players with less than 4 Assists
manutd_assist = manutd_players %>%
filter(Matches == Matches , Assist >= 4)
ggplot(data=manutd_assist, mapping = aes(x=Player, y=Assist))+
geom_col(fill = "orange")+
labs(x = "Player", y = "Assists", title = "Players vs Assists")+
theme_linedraw()+
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(size = 7) # Adjust text size
)+
scale_y_continuous(breaks = seq(0, ceiling(max(manutd_assist$Assist)), by=1))
Here, Bruno Fernandes (Midfielder) shines as he tops the assist charts for Manchester United which is a testament to his goal creation. Christian Eriksen (Midfielder) comes close in behind with 10 assists while Rashford comes 3rd best with 9.
#Has each player been clinical in front of goal?
ggplot(data=manutd_players, mapping = aes(x=Expected.Goals, y=Goals))+
geom_point()+
geom_abline(slope = 1, intercept = 0, # y = x line
color = "red", linetype = "dashed")+
labs(x = "Expected Goals", y = "Goals", title = "Expected Goals vs Goals")+
theme_linedraw()+
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(size = 7)# Adjust text size
)
Individually, Man Utd players are moderate in expected goals vs goals. In other words, there is not a consistent pattern indicating that players are scoring according to expected goals because around half of the goals are below the goals to expected goals ratio.
This plot identifies that Man Utd struggled to capitalise on their goal-scoring opportunities in the 22/23 season.
This dataset contains variables that are out of scope and domain knowledge of pattern of play. I have delete null irrelevant columns in Excel and renamed ‘Man Utd Statistics’ into ‘Man Utd Tidy’.
#Reading in the datasets
manutd.data = read.csv("ManUtdTidy.csv", header=TRUE)
manutd.data = manutd.data %>%
select(-X, -X.1,-X.2,-X.3,-X.4,-X.5,-X.6,-X.7,-X.8,-X.9,-X.10,-X.11,-X.12,-X.13,-X.14,-X.15,-X.16,-X.17,-X.18,-X.19,-X.20,-X.21,-X.22,-X.23,-X.24, -Points, -Post.shot.Expected.Goals, -Crosses.Faced.By.Goalkeeper,-Crosses.Into.Penalty.Area, -Short.Passes.Completion.Percentage, -Medium.Passes.Completion.Percentage, -Long.Passes.Completion.Percentage, -Errors)
manutd.data = manutd.data[-62,]
#Result is binary so let a Win = 1, and not a win (D or L) = 0
manutd.data$Result[manutd.data$Result == "W"] = 1
manutd.data$Result[manutd.data$Result == "D"] = 0
manutd.data$Result[manutd.data$Result == "L"] = 0
#Possession inputs are in % terms, so let's convert to decimal form
manutd.data$Possession....=manutd.data$Possession..../100
#Same for venue: Let home = 1, and not home = 0
manutd.data$Venue[manutd.data$Venue == "Home"] = 1
manutd.data$Venue[manutd.data$Venue == "Away"] = 0
manutd.data$Venue[manutd.data$Venue == "Neutral"] = 0
manutd.data = as_tibble(manutd.data)
manutd.data$Result = as.numeric(manutd.data$Result)
manutd.data = data.frame(lapply(manutd.data, as.numeric))
manutd.data$Venue = as.factor(manutd.data$Venue)
manutd.data$Clean.Sheets = as.factor(manutd.data$Clean.Sheets)
#Replacing NA values with the mean of each respective column
manutd.data[] = lapply(manutd.data, function(x) {
x[is.na(x)] = mean(x, na.rm = TRUE)
return(x)
})
Checking the dimensions of the dataset
print(dim(manutd.data))
## [1] 61 69
Initial look at the dataset
#Initial look at the dataset
summary(manutd.data)
## Venue Result Goals Goals.Against Shots
## 0:27 Min. :0.0000 Min. :0.000 Min. :0.000 Min. : 4.00
## 1:34 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:12.00
## Median :1.0000 Median :2.000 Median :1.000 Median :16.00
## Mean :0.6721 Mean :1.783 Mean :1.017 Mean :16.21
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:19.00
## Max. :1.0000 Max. :4.000 Max. :7.000 Max. :34.00
## Shots.on.Target Average.Shot.Distance Shots.on.Target.Against
## Min. : 1.00 Min. : 1.00 Min. :0.000
## 1st Qu.: 4.00 1st Qu.:14.80 1st Qu.:2.000
## Median : 6.00 Median :16.50 Median :3.000
## Mean : 5.82 Mean :14.97 Mean :3.475
## 3rd Qu.: 8.00 3rd Qu.:18.80 3rd Qu.:5.000
## Max. :13.00 Max. :22.20 Max. :9.000
## Saves.By.Goalkeeper Clean.Sheets
## Min. :0.00 0:28
## 1st Qu.:1.00 1:33
## Median :2.00
## Mean :2.18
## 3rd Qu.:3.00
## Max. :6.00
## Passes.Completed..Longer.than.40.yards..By.Goalkeeper
## Min. : 0.000
## 1st Qu.: 2.000
## Median : 4.000
## Mean : 4.246
## 3rd Qu.: 6.000
## Max. :15.000
## Passes.Attempted..Longer.than.40.yards..By.Goalkeeper
## Min. : 0.00
## 1st Qu.: 8.00
## Median :11.00
## Mean :19.62
## 3rd Qu.:22.00
## Max. :78.00
## Passes.Attempted.By.Goalkeeper Average.Passes.Length.By.Goalkeeper
## Min. : 1.00 Min. : 0.00
## 1st Qu.:16.00 1st Qu.:23.60
## Median :25.00 Median :30.20
## Mean :22.39 Mean :26.03
## 3rd Qu.:32.00 3rd Qu.:34.70
## Max. :39.00 Max. :47.90
## Goals.Kicks.Attempted.By.Goalkeeper
## Min. : 0.000
## 1st Qu.: 2.000
## Median : 5.000
## Mean : 5.738
## 3rd Qu.: 9.000
## Max. :16.000
## Goals.Kicks.Attempted..Longer.than.40.yards.
## Min. : 0.00
## 1st Qu.: 8.00
## Median : 40.00
## Mean : 46.73
## 3rd Qu.: 87.50
## Max. :100.00
## Average.Goal.Kicks.Length.By.Goalkeeper
## Min. : 6.00
## 1st Qu.:19.00
## Median :36.30
## Mean :38.56
## 3rd Qu.:58.40
## Max. :72.00
## Average.Distance.of.Defensive.Action.By.Goalkeeper Passes.Completed
## Min. : 3.00 Min. : 78.4
## 1st Qu.: 10.00 1st Qu.:389.0
## Median : 15.50 Median :440.1
## Mean : 56.99 Mean :440.1
## 3rd Qu.: 19.00 3rd Qu.:512.0
## Max. :535.00 Max. :698.0
## Passes.Attemped Passes.Completion.Percentage Total.Passing.Distance
## Min. : 77.2 Min. : 64.3 Min. : 189
## 1st Qu.: 464.0 1st Qu.: 80.0 1st Qu.: 5949
## Median : 596.0 Median : 83.6 Median : 6820
## Mean : 750.6 Mean : 719.1 Mean : 6820
## 3rd Qu.: 750.6 3rd Qu.: 719.1 3rd Qu.: 8645
## Max. :6837.0 Max. :8197.0 Max. :11752
## Progressive.Passing.Distance Short.Passes.Completed Short.Passes.Attemped
## Min. : 189 Min. : 85.5 Min. : 82.7
## 1st Qu.:2254 1st Qu.:189.0 1st Qu.:189.0
## Median :2385 Median :218.5 Median :229.8
## Mean :2344 Mean :218.5 Mean :229.8
## 3rd Qu.:2842 3rd Qu.:244.0 3rd Qu.:274.0
## Max. :3595 Max. :367.0 Max. :384.0
## Medium.Passes.Completed Medium.Passes.Attemped Long.Passes.Completed
## Min. : 54.0 Min. : 35.0 Min. :22.00
## 1st Qu.:148.0 1st Qu.:159.0 1st Qu.:36.00
## Median :169.2 Median :182.1 Median :43.11
## Mean :169.2 Mean :182.1 Mean :43.11
## 3rd Qu.:195.0 3rd Qu.:220.0 3rd Qu.:51.00
## Max. :272.0 Max. :304.0 Max. :68.00
## Long.Passes.Attemped Passes.Into.Final.Third Passes.Into.Penalty.Area
## Min. : 20.00 Min. : 1.0 Min. : 0.00
## 1st Qu.: 60.00 1st Qu.:25.0 1st Qu.: 8.00
## Median : 67.97 Median :33.4 Median :11.00
## Mean : 67.97 Mean :33.4 Mean :11.54
## 3rd Qu.: 77.00 3rd Qu.:40.0 3rd Qu.:13.00
## Max. :103.00 Max. :91.0 Max. :52.00
## Progressive.Passes Tackles Tackles.Won Tackles.in.Defensive.Third
## Min. : 5.00 Min. : 5.0 Min. : 3.0 Min. : 0.00
## 1st Qu.:33.00 1st Qu.:15.0 1st Qu.: 8.0 1st Qu.: 7.00
## Median :41.04 Median :16.7 Median :10.3 Median : 8.78
## Mean :41.04 Mean :16.7 Mean :10.3 Mean : 8.78
## 3rd Qu.:50.00 3rd Qu.:19.0 3rd Qu.:12.0 3rd Qu.:11.00
## Max. :99.00 Max. :30.0 Max. :21.0 Max. :15.00
## Tackles.in.Middle.Third Tackles.in.Attacking.Third Blocks
## Min. : 0.00 Min. : 0.0 Min. : 1.00
## 1st Qu.: 4.00 1st Qu.: 1.0 1st Qu.: 9.00
## Median : 6.08 Median : 3.0 Median :12.38
## Mean : 6.08 Mean : 3.1 Mean :12.38
## 3rd Qu.: 8.00 3rd Qu.: 3.1 3rd Qu.:15.00
## Max. :14.00 Max. :15.0 Max. :23.00
## Shots.Blocked Shot.Against Interception Clearance
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 2.00 1st Qu.: 8.00 1st Qu.: 7.00 1st Qu.:13.00
## Median : 4.64 Median :12.42 Median : 9.28 Median :18.42
## Mean : 4.64 Mean :12.42 Mean : 9.28 Mean :18.42
## 3rd Qu.: 6.00 3rd Qu.:15.00 3rd Qu.:10.00 3rd Qu.:24.00
## Max. :22.00 Max. :38.00 Max. :29.00 Max. :54.00
## Possession.... Touches Touches.in.Defensive.Penalty.Area
## Min. :0.300 Min. : 63.0 Min. : 19.00
## 1st Qu.:0.520 1st Qu.:569.0 1st Qu.: 59.00
## Median :0.610 Median :619.0 Median : 75.00
## Mean :1.036 Mean :598.9 Mean : 86.44
## 3rd Qu.:1.036 3rd Qu.:708.0 3rd Qu.: 86.44
## Max. :8.020 Max. :915.0 Max. :269.00
## Touches.in.Defensive.Third Touches.in.Middle.Third Touches.in.Attacking.Third
## Min. :106.0 Min. : 25.0 Min. : 12.0
## 1st Qu.:189.0 1st Qu.:221.0 1st Qu.:125.0
## Median :209.9 Median :256.7 Median :165.7
## Mean :209.9 Mean :256.7 Mean :165.7
## 3rd Qu.:229.0 3rd Qu.:302.0 3rd Qu.:205.0
## Max. :419.0 Max. :440.0 Max. :404.0
## Touches.in.Attacking.Penalty.Area Take.on..Dribble..Attempted
## Min. : 6.30 Min. : 11.00
## 1st Qu.:21.00 1st Qu.: 16.00
## Median :27.35 Median : 21.00
## Mean :27.35 Mean : 33.97
## 3rd Qu.:33.00 3rd Qu.: 33.97
## Max. :53.00 Max. :383.00
## Take.on..Dribble..Success.Percentage Carries Total.Carries.Distance
## Min. : 14.3 Min. : 178.0 Min. : 15
## 1st Qu.: 40.6 1st Qu.: 327.0 1st Qu.:1411
## Median : 52.4 Median : 441.0 Median :1751
## Mean : 137.5 Mean : 545.4 Mean :1751
## 3rd Qu.: 137.5 3rd Qu.: 545.4 3rd Qu.:2089
## Max. :1799.0 Max. :2971.0 Max. :3042
## Progressive.Carries.Distance Progressive.Carries Carries.into.Final.Third
## Min. : 11.0 Min. : 7.00 Min. : 2.00
## 1st Qu.: 728.0 1st Qu.:15.00 1st Qu.:10.00
## Median : 898.4 Median :19.04 Median :13.14
## Mean : 898.4 Mean :19.04 Mean :13.14
## 3rd Qu.:1103.0 3rd Qu.:23.00 3rd Qu.:16.00
## Max. :1957.0 Max. :44.00 Max. :31.00
## Carries.Into.Penalty.Area Miscontrols Dispossesed Yellow.Card
## Min. : 2.00 Min. : 1.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 5.00 1st Qu.:11.00 1st Qu.: 6.0 1st Qu.: 1.00
## Median : 7.00 Median :12.58 Median : 7.0 Median : 2.14
## Mean : 7.18 Mean :12.58 Mean : 7.1 Mean : 2.14
## 3rd Qu.: 9.00 3rd Qu.:15.00 3rd Qu.: 8.0 3rd Qu.: 3.00
## Max. :15.00 Max. :22.00 Max. :16.0 Max. :11.00
## Red.Cards Fouls.Commited Fouls.Drawn Crosses Interception.1
## Min. : 0.00 Min. : 3.00 Min. : 3.0 Min. : 4.00 Min. : 1.0
## 1st Qu.: 0.00 1st Qu.: 9.00 1st Qu.: 7.0 1st Qu.: 9.00 1st Qu.: 7.0
## Median : 0.00 Median :10.94 Median : 8.7 Median :13.88 Median :10.0
## Mean : 1.26 Mean :10.94 Mean : 8.7 Mean :13.88 Mean :11.1
## 3rd Qu.: 1.26 3rd Qu.:13.00 3rd Qu.:10.0 3rd Qu.:18.00 3rd Qu.:11.1
## Max. :13.00 Max. :16.00 Max. :16.0 Max. :30.00 Max. :63.0
## Tackles.Won.1 Ball.Recovery Aerial.Duel.Won Aerial.Duel.Lost
## Min. : 4.00 Min. : 6.0 Min. : 4.00 Min. : 3.00
## 1st Qu.: 9.00 1st Qu.:51.0 1st Qu.:11.00 1st Qu.: 9.00
## Median :12.00 Median :52.0 Median :12.71 Median :11.32
## Mean :14.72 Mean :51.2 Mean :12.71 Mean :11.32
## 3rd Qu.:14.72 3rd Qu.:59.0 3rd Qu.:14.00 3rd Qu.:13.00
## Max. :91.00 Max. :74.0 Max. :26.00 Max. :23.00
Now, is the dataset balanced or not?
ggplot(data = manutd.data, mapping = aes(y=Result))+
geom_bar(fill="orange")+
labs(x="Count", y="Result Type", title="Number of Wins vs Losses in the dataset")+
scale_y_continuous(breaks = c(0, 1))+
theme(plot.title = element_text(hjust = 0.5))
As per the plot above, there appears to be more observations for a Man Utd win than a Man Utd loss.
Since the desired outcome is to predict a win or loss and the fact that Result is binary, a logistic model seems an appropiate fit to the data.
Fitting the Logistic Regression model
#Fitting the model with all possible variables.
#Included weights to the data as there is an imbalance in the dataset
model.fit = glm(Result~.,data=manutd.data, family = "binomial"(link = logit),weights = ifelse(manutd.data$Result == 0, 20, 1))
summary(model.fit) #A bit dreadful at the moment
##
## Call:
## glm(formula = Result ~ ., family = binomial(link = logit), data = manutd.data,
## weights = ifelse(manutd.data$Result == 0, 20, 1))
##
## Coefficients: (8 not defined because of singularities)
## Estimate Std. Error
## (Intercept) -1.177e+02 5.139e+06
## Venue1 2.611e+01 6.903e+05
## Goals 7.801e+01 7.521e+05
## Goals.Against -5.059e+01 6.082e+05
## Shots -4.046e+00 5.516e+04
## Shots.on.Target -8.135e-01 2.011e+05
## Average.Shot.Distance -5.583e+01 6.991e+05
## Shots.on.Target.Against 5.414e+01 6.655e+05
## Saves.By.Goalkeeper -6.624e+01 9.936e+05
## Clean.Sheets1 -4.269e+02 6.130e+06
## Passes.Completed..Longer.than.40.yards..By.Goalkeeper -1.484e+00 7.431e+04
## Passes.Attempted..Longer.than.40.yards..By.Goalkeeper -1.892e+00 3.503e+04
## Passes.Attempted.By.Goalkeeper 1.193e+01 1.805e+05
## Average.Passes.Length.By.Goalkeeper 2.652e+00 1.069e+05
## Goals.Kicks.Attempted.By.Goalkeeper -1.833e+00 1.329e+05
## Goals.Kicks.Attempted..Longer.than.40.yards. 1.865e+00 8.249e+04
## Average.Goal.Kicks.Length.By.Goalkeeper 7.184e+00 7.840e+04
## Average.Distance.of.Defensive.Action.By.Goalkeeper -9.194e-01 6.066e+04
## Passes.Completed -2.641e+01 5.806e+05
## Passes.Attemped 1.673e-01 6.538e+03
## Passes.Completion.Percentage -1.126e+00 1.507e+04
## Total.Passing.Distance 1.195e+00 1.590e+04
## Progressive.Passing.Distance -1.650e-02 2.280e+03
## Short.Passes.Completed 4.986e+01 9.501e+05
## Short.Passes.Attemped -1.642e+01 2.832e+05
## Medium.Passes.Completed -3.684e+00 2.930e+05
## Medium.Passes.Attemped 9.836e+00 1.084e+05
## Long.Passes.Completed 2.538e+01 6.056e+05
## Long.Passes.Attemped -3.704e+00 9.614e+04
## Passes.Into.Final.Third -6.965e+01 1.006e+06
## Passes.Into.Penalty.Area -6.781e+01 1.094e+06
## Progressive.Passes 3.341e+01 4.977e+05
## Tackles -1.476e+02 1.864e+06
## Tackles.Won -1.442e+01 3.793e+05
## Tackles.in.Defensive.Third 1.567e+02 2.122e+06
## Tackles.in.Middle.Third 1.781e+02 2.205e+06
## Tackles.in.Attacking.Third 1.412e+02 1.899e+06
## Blocks 2.512e+01 4.347e+05
## Shots.Blocked -1.963e+01 3.501e+05
## Shot.Against 9.658e+00 8.393e+04
## Interception 8.632e+00 1.834e+05
## Clearance 2.952e+01 3.732e+05
## Possession.... 3.189e+03 4.543e+07
## Touches -1.459e+01 2.127e+05
## Touches.in.Defensive.Penalty.Area -8.682e+00 1.415e+05
## Touches.in.Defensive.Third 3.720e+00 1.015e+05
## Touches.in.Middle.Third 4.141e+00 9.321e+04
## Touches.in.Attacking.Third 1.481e+01 2.353e+05
## Touches.in.Attacking.Penalty.Area 1.848e+01 3.341e+05
## Take.on..Dribble..Attempted 3.599e+01 5.423e+05
## Take.on..Dribble..Success.Percentage -1.542e+00 3.200e+04
## Carries -5.488e+00 8.983e+04
## Total.Carries.Distance -1.214e-01 5.293e+03
## Progressive.Carries.Distance 9.256e-01 2.085e+04
## Progressive.Carries -5.270e-01 1.248e+05
## Carries.into.Final.Third -4.511e+01 6.941e+05
## Carries.Into.Penalty.Area -5.571e+01 9.023e+05
## Miscontrols -6.245e+00 2.401e+05
## Dispossesed 9.251e+00 2.012e+05
## Yellow.Card 8.056e+01 9.534e+05
## Red.Cards -2.594e+02 3.740e+06
## Fouls.Commited NA NA
## Fouls.Drawn NA NA
## Crosses NA NA
## Interception.1 NA NA
## Tackles.Won.1 NA NA
## Ball.Recovery NA NA
## Aerial.Duel.Won NA NA
## Aerial.Duel.Lost NA NA
## z value Pr(>|z|)
## (Intercept) 0 1
## Venue1 0 1
## Goals 0 1
## Goals.Against 0 1
## Shots 0 1
## Shots.on.Target 0 1
## Average.Shot.Distance 0 1
## Shots.on.Target.Against 0 1
## Saves.By.Goalkeeper 0 1
## Clean.Sheets1 0 1
## Passes.Completed..Longer.than.40.yards..By.Goalkeeper 0 1
## Passes.Attempted..Longer.than.40.yards..By.Goalkeeper 0 1
## Passes.Attempted.By.Goalkeeper 0 1
## Average.Passes.Length.By.Goalkeeper 0 1
## Goals.Kicks.Attempted.By.Goalkeeper 0 1
## Goals.Kicks.Attempted..Longer.than.40.yards. 0 1
## Average.Goal.Kicks.Length.By.Goalkeeper 0 1
## Average.Distance.of.Defensive.Action.By.Goalkeeper 0 1
## Passes.Completed 0 1
## Passes.Attemped 0 1
## Passes.Completion.Percentage 0 1
## Total.Passing.Distance 0 1
## Progressive.Passing.Distance 0 1
## Short.Passes.Completed 0 1
## Short.Passes.Attemped 0 1
## Medium.Passes.Completed 0 1
## Medium.Passes.Attemped 0 1
## Long.Passes.Completed 0 1
## Long.Passes.Attemped 0 1
## Passes.Into.Final.Third 0 1
## Passes.Into.Penalty.Area 0 1
## Progressive.Passes 0 1
## Tackles 0 1
## Tackles.Won 0 1
## Tackles.in.Defensive.Third 0 1
## Tackles.in.Middle.Third 0 1
## Tackles.in.Attacking.Third 0 1
## Blocks 0 1
## Shots.Blocked 0 1
## Shot.Against 0 1
## Interception 0 1
## Clearance 0 1
## Possession.... 0 1
## Touches 0 1
## Touches.in.Defensive.Penalty.Area 0 1
## Touches.in.Defensive.Third 0 1
## Touches.in.Middle.Third 0 1
## Touches.in.Attacking.Third 0 1
## Touches.in.Attacking.Penalty.Area 0 1
## Take.on..Dribble..Attempted 0 1
## Take.on..Dribble..Success.Percentage 0 1
## Carries 0 1
## Total.Carries.Distance 0 1
## Progressive.Carries.Distance 0 1
## Progressive.Carries 0 1
## Carries.into.Final.Third 0 1
## Carries.Into.Penalty.Area 0 1
## Miscontrols 0 1
## Dispossesed 0 1
## Yellow.Card 0 1
## Red.Cards 0 1
## Fouls.Commited NA NA
## Fouls.Drawn NA NA
## Crosses NA NA
## Interception.1 NA NA
## Tackles.Won.1 NA NA
## Ball.Recovery NA NA
## Aerial.Duel.Won NA NA
## Aerial.Duel.Lost NA NA
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.7285e+02 on 60 degrees of freedom
## Residual deviance: 4.9881e-10 on 0 degrees of freedom
## AIC: 122
##
## Number of Fisher Scoring iterations: 25
Trying Backwards Stepwise Regression
backward_model = step(model.fit, direction = "backward", trace = 1)
Inspecting the Stepwise model plus an interaction between Shot.Against and Blocks (Based on AIC)
model = glm(Result ~ Venue + Clean.Sheets+Goals + Passes.Attempted.By.Goalkeeper +
Medium.Passes.Attemped + Tackles + Tackles.in.Defensive.Third +
Tackles.in.Middle.Third + Blocks + Shot.Against + Touches +Shot.Against*Blocks, family = "binomial"(link = logit),
data = manutd.data)
vif(model, type="predictor")
## Venue Clean.Sheets
## 9.149248 22.735538
## Goals Passes.Attempted.By.Goalkeeper
## 15.590206 17.181876
## Medium.Passes.Attemped Tackles
## 62.704391 86.798798
## Tackles.in.Defensive.Third Tackles.in.Middle.Third
## 66.070992 82.694777
## Blocks Shot.Against
## 52.664070 43.000093
## Touches Blocks:Shot.Against
## 47.392682 83.462955
High Multicollinearity in the model.
Checking for perfect seperation
regressors = manutd.data %>%
select(Goals , Passes.Attempted.By.Goalkeeper ,
Medium.Passes.Attemped ,Tackles , Tackles.in.Defensive.Third ,
Tackles.in.Middle.Third , Blocks , Shot.Against ,Touches ,Shot.Against:Blocks)
seperation.result = detect_separation(y = manutd.data$Result, x =regressors , family = binomial())
print(seperation.result)
## Implementation: ROI | Solver: lpsolve
## Separation: FALSE
## Existence of maximum likelihood estimates
## Goals Passes.Attempted.By.Goalkeeper
## 0 0
## Medium.Passes.Attemped Tackles
## 0 0
## Tackles.in.Defensive.Third Tackles.in.Middle.Third
## 0 0
## Blocks Shot.Against
## 0 0
## Touches Shots.Blocked
## 0 0
## 0: finite value, Inf: infinity, -Inf: -infinity
table(manutd.data$Venue, manutd.data$Result) #No perfect seperation between Venue and Result
##
## 0 1
## 0 14 13
## 1 6 28
table(manutd.data$Clean.Sheets, manutd.data$Result) #No perfect seperation between Clean sheet and Result
##
## 0 1
## 0 17 11
## 1 3 30
There is no perfect seperation within the model.
Performing a Ridge logistic regression to reduce the effect of multicollinearity
#Creating new Dataset for Ridge Regression, only including variables from backwards stepwise regresison
ridge_dataset = manutd.data %>%
select(Venue, Clean.Sheets,Goals, Passes.Attempted.By.Goalkeeper ,
Medium.Passes.Attemped , Tackles , Tackles.in.Defensive.Third ,
Tackles.in.Middle.Third , Blocks , Shot.Against , Touches, Result, Shot.Against:Blocks)
ridge_dataset = ridge_dataset %>%
mutate(Shot.Against.Blocks = ridge_dataset$Shot.Against*ridge_dataset$Blocks)
# Dataset with only predictors
ridge_regressors = ridge_dataset %>%
select(-Result)
#Creating a matrix for regressors
x = as.matrix(ridge_regressors) # Predictors (all columns except the response)
y = ridge_dataset$Result
#Defining weights for imbalanced data
weights = ifelse(y == 0, length(y) / sum(y == 0), 1) # Inverse class frequencies
#Fitting Ridge model
ridge_model = glmnet(x, y, family="binomial",alpha = 0,weights=weights)
plot(ridge_model) # Draw plot of coefficients
#performing LOOCV for the Ridge model as dataset is relatively small
cv_loocv = cv.glmnet(x, y, alpha = 0, family = "binomial", nfolds = 5, weight= weights)
best_lambda = cv_loocv$lambda.min
ridge_final = glmnet(x, y, family = "binomial", alpha = 0, weight=weights,lambda = best_lambda)
#Plot of the lambdas
plot(cv_loocv)
#Coefficients of Ridge logistic model
coef(ridge_final, s = "lambda.min")
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -3.322798456
## Venue 0.586798893
## Clean.Sheets 2.143316912
## Goals 1.356862438
## Passes.Attempted.By.Goalkeeper 0.018899575
## Medium.Passes.Attemped -0.008709140
## Tackles -0.034372231
## Tackles.in.Defensive.Third 0.030683168
## Tackles.in.Middle.Third 0.214350368
## Blocks -0.035104513
## Shot.Against 0.009404932
## Touches 0.001929199
## Shots.Blocked -0.125238305
## Shot.Against.Blocks -0.002111422
Null Model (Baseline Model) for Comparision
# Fit null logistic model
null_model = glm(manutd.data$Result ~ 1, family = "binomial", data=manutd.data)
summary(null_model)
##
## Call:
## glm(formula = manutd.data$Result ~ 1, family = "binomial", data = manutd.data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.7178 0.2727 2.632 0.00849 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 77.184 on 60 degrees of freedom
## Residual deviance: 77.184 on 60 degrees of freedom
## AIC: 79.184
##
## Number of Fisher Scoring iterations: 4
The baseline model predicts a constant probability for all observations. Note, that the p-value for \(\beta_0\) < 0.05 which rejects the null hypothesis that the intercept equals 0.
Extracting the intercept value
# Extract the intercept
beta_0 = coef(null_model)[1]
# Compute constant predicted probability
p_hat = 1 / (1 + exp(-beta_0)) #Inverse of logit
# Predicted classes based on threshold
predicted_class = ifelse(p_hat >= 0.5, 1, 0)
#Calculating predicted values of the logistic model with ridge penalisation
predictions = predict(ridge_final,newx = x, s=best_lambda, type="response")
#Transforming predictions into binary outcomes
predicted_classes = ifelse(predictions >= 0.5, 1,0)
Confusion Matrix
# Confusion matrix
Confusion.matrix = table(Predicted = predicted_classes, Actual = y)
print(Confusion.matrix)
## Actual
## Predicted 0 1
## 0 19 4
## 1 1 37
#Estimated specificity, sensiticity, and prediction error
specificity = 19/(19+4); sensitivity = 37/(37+1)
print(specificity); print(sensitivity) #Specificity and sensitivity respectively
## [1] 0.826087
## [1] 0.9736842
prediction.error = (1+4)/(19+4+1+37)
print(prediction.error)
## [1] 0.08196721
From the Confusion matrix, the probability that the model predicts a win, given the Result is a win is 0.97. Moreover, the probability that the model predicts a loss, given the Result is a loss is 0.83.
Calculating Accuracy, Precision, Recall, and F1-Score of the model from the Confusion Matrix
accuracy = (19+37)/(19+4+37+1) #proportion of correct predictions (both true positives and true negatives) out of all predictions made
precision = 19/(19+1) #accuracy of the positive predictions made by the model
recall = 37/(37+4) #how well the model identifies positive instances
#Computing the F1-score since dataset is imbalanced
f1.score = 2*(precision*recall)/(precision+recall) #the harmonic mean of precision and recall
print(precision)
## [1] 0.95
A precision of 95% indicates that the model’s predictions of a win matched the actual values 95% of the time
print(recall)
## [1] 0.902439
A recall value of 0.90 indicates the model is strong in identifying whenever Man Utd would win
print(f1.score)
## [1] 0.925609
F1-score indicates the model achieves a good balance between identifying most wins (recall) and not over-predicting (precision) Results.
# Estimate of AUC
roc_curve = roc(y, predictions) # ROC curve
plot(roc_curve, grid=TRUE, col="orange",print.thres = "best") #Plot ROC curve alongside the point that maximises both sensitivity and specificity
Area under ROC curve
auc(y, predictions)
## [1] 0.9804878
Predictors are excellent in the model in predicting Man Utd’s Result for a imbalanced and moderate sized dataset
Hosmer and Lemeshow GOF test
hoslem.test(y, predict(ridge_final, newx = x, type = "response"))
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: y, predict(ridge_final, newx = x, type = "response")
## X-squared = 10.537, df = 8, p-value = 0.2293
P-value > 0.05 which indicates a failure to reject the null hypothesis that there is a lack of fit of the model to the data.
Residual vs Fitted values plot of Logistic model without Ridge penalisation
residuals.deviance = residuals(model, type = "deviance")
print(mean(residuals.deviance)) #Mean of deviance residuals close to 0
## [1] 1.102836e-06
residuals.pearson = residuals(model, type="pearson")
print(mean(residuals.pearson)) #Mean of pearson residuals close to 0
## [1] 7.798223e-07
Means of both Deviance and Pearson residuals are close to 0, indicating a good fit.
Residual vs Fitted values plot of Logistic model without Ridge penalisation
plot(fitted(model), residuals(model, type="response"), main = "Residuals vs Fitted")+
abline(h = 0, col = "red")
## integer(0)
Bands of residuals are forming above and below 0 at the tails of residuals = 0. Some residual observations hover around 0, which could indicate a possible overfit.
Deviance of the baseline model
deviance(model)
## [1] 8.256683e-09
1 - pchisq(deviance(model), 48)
## [1] 1
Pretty small deviance which indicate the number of predictors close to saturation. Large p-value under chi-square distribution indicates a failure to reject the null hypothesis that the model is correct.
Checking the Deviance and Pearson residuals of the final model (Ridge)
# Deviance residuals
deviance_residuals = sign(y - predictions) *
sqrt(-2 * (y * log(predictions) + (1 - y) * log(1 - predictions)))
summary(deviance_residuals) #Residuals are between -2 and 2
## s1
## Min. :-1.4590
## 1st Qu.:-0.2693
## Median : 0.3368
## Mean : 0.2865
## 3rd Qu.: 0.8179
## Max. : 1.5401
# Pearson residuals
pearson_residuals = (y - predictions) /
sqrt(predictions * (1 - predictions))
# Plot residuals of ridge penalised model
par(mfrow = c(1, 2))
plot(deviance_residuals, main = "Deviance Residuals", ylab = "Residuals", xlab = "Index")+
plot(pearson_residuals, main = "Pearson Residuals", ylab = "Residuals", xlab = "Index")
## integer(0)
Residuals for the case of both deviance and Pearson scatter around zero, with no noticeable trend or pattern. Moreover, residuals are between -2 and 2, indicating no outliers. As the respective index increases, variability in the residuals does not appear to increase.
Log-loss of the baseline model vs the fitted Ridge logistic model
# Calculate log-loss of fitted model
log_loss.model = logLoss(actual = y,predicted = predictions)
# Print log-loss
print(log_loss.model)
## [1] 0.2606354
# Calculate log-loss of logistic null model (baseline model)
log_loss_baseline = logLoss(manutd.data$Result, predicted = p_hat)
print(log_loss_baseline)
## [1] 0.6326591
The log-Loss of the fitted model is less than the log-Loss of the baseline logistic model, indicating that the ridge model is a better predictor than the baseline model and regularisation via Ridge improved the model’s performance.
Histogram of residuals
hist(pearson_residuals, main = "Histogram of Pearson Residuals", xlab = "Residuals")
qqnorm(pearson_residuals); qqline(pearson_residuals)
Residuals are centered around 0.0 - 0.5 with an approximate normal distribution. Q-Q plot is almost on the 45 degree line understandibly as the dataset is quite small
Coefficients of the model
#Coefficients of the model
coefs = coef(ridge_final)
print(coefs)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -3.322798456
## Venue 0.586798893
## Clean.Sheets 2.143316912
## Goals 1.356862438
## Passes.Attempted.By.Goalkeeper 0.018899575
## Medium.Passes.Attemped -0.008709140
## Tackles -0.034372231
## Tackles.in.Defensive.Third 0.030683168
## Tackles.in.Middle.Third 0.214350368
## Blocks -0.035104513
## Shot.Against 0.009404932
## Touches 0.001929199
## Shots.Blocked -0.125238305
## Shot.Against.Blocks -0.002111422
The equation for the logistic regression model is: \[\text{logit}(p_i)=-3.32+0.59\text{Venue}+2.14\text{CleanSheet}+1.36\text{Goals}+0.02\text{GKpasses}-0.03\text{Tackles} +0.03\text{TacklesDefensiveThird}+0.21\text{TacklesMiddleThird}-0.04\text{Blocks}-0.13\text{ShotsBlocked}\]
The coefficients indicate the impact of each predictor on the log-odds of the a Man Utd win.
The loss function for logistic regression with ridge penalisation, where \(\lambda = 0.03\), is: \[ \mathcal{L}(\beta) = - \frac{1}{62} \sum_{i=1}^{62} \left[ \text{Result}_i \log \hat{p}_i + (1 - \text{Result}_i) \log (1 - \hat{p}_i) \right] + \frac{0.03}{2} \sum_{j=1}^{101} \beta_j^2 \]
The intercept value indicates that, without any influence from the predictors, the odds of the outcome (likely a win, goal, or other event) are less than 1. negative. Moreover, whenever Man Utd keeps a clean sheet in a match, the log-odds of them Winning increases. Similarly, the more goals Man Utd scores in a match, the greater the log-odds of them Winning. Moreover, if Man Utd is at home, then the log-odds of them Winning increases.
Tackles in Defensive Third and Tackles in Middle Third are positively associated with the log-odds of a Win. This suggests that successful tackles, especially in the middle third and defensive zones, are beneficial. The fact that tackles in the middle third have a relatively larger coefficient indicates that they may be more impactful than tackles in other areas, potentially because they break up opposition attacks before they reach the defensive zone.
Blocks and Shots Blocked have negative coefficients, implying that blocking shots and making defensive interventions might not always be favorable for the outcome. It could be due to the nature of these defensive actions (e.g., blocking shots might indicate a defensive team under pressure, or blocked shots could lead to dangerous rebounds).
Tackling in the defensive third is slightly positive, indicating that strong defense in the team’s own half is beneficial. Tackling in the middle third is more strongly positive, suggesting that controlling the midfield and breaking up opposition play in the center of the field is particularly important for the outcome.
Shots Blocked and the interaction between Shot Against and Blocks are negatively correlated with the outcome, suggesting that increased defensive actions like blocking shots are associated with a lower likelihood of a positive outcome. This could be due to increased pressure on defense, possibly indicating that the team is underperforming in a more vulnerable position.
A team that maintains clean sheets, scores more goals, and makes tackles in the middle third is likely to be more successful, according to the model. This suggests a playing style focused on defensive stability (clean sheets, tackles) and attacking efficiency (goals, passes).
Improvement through offense: The model suggests focusing on improving goal-scoring ability and maintaining clean sheets as these variables have large positive coefficients. Potentially, improving current attacking positions’ goal output or buying a goalscorer (Striker or Winger or Attacking Midfielder) with consistent goal output would improve the log-odds of Man Utd winning matches.
Improvement through defense: Tackling in the middle third has a relatively strong positive effect, so improving midfield defense could also be beneficial. Conversely, reducing shots blocked might also be something to focus on, as these are negatively correlated with the outcome. Recommended that Man Utd improve defensively through a shot-stopping-focused goalkeeper (Keeping De Gea) and/or buying ball-winning midfielders as tackling in the middle third is strongly positive and could be interacting with the negatively correlated ‘Shots-blocked’ and keeping a clean-sheet.
I am proud of being able to apply statistical knowledge to Manchester United, and provide some insight on how they could improve based on their 22/23 statistics. Specifically, I am proud of being able to handle multicollinearity through adding a Ridge penalisation; a concept that I have never known previously. Moreover, I am proud of utilising Leave One Out Cross-Validation when assessing my model as I had also never known this method of cross-validation previously. I had only known cross validation with an 80/20 split of my data into a training and test set. Other methodologies I learned through this project include: adding weights to my glm() due to the imbalanced dataset, checking for perfect seperation, and model metrics such as precision, recall, and F-score.
One thing I want to improve on is bootstrapping. Since the dataset is relatively small and my football domain knowledge that Man Utd performed consistently in 22/23, then I improve on non-parametric bootstrapping in order to simulate more observations. At the moment, I was content with keeping real results from the season.