Premier-League-Analysis-and-insights

The analysis of the Premier League from 1992 to 2022. This analysis will answer questions like.

1- Does the home stadium ground give any advantage? And if the answer is yes, what’s the quantity for this advantage?

2-what is the best way to collect points, defensive or attacking play?

Introduction

The packages for are analysis are tidyvese and gdata

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'tibble' was built under R version 4.2.2
## Warning: package 'tidyr' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.2
## Warning: package 'purrr' was built under R version 4.2.2
## Warning: package 'dplyr' was built under R version 4.2.2
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(gdata)
## gdata: Unable to locate valid perl interpreter
## gdata: 
## gdata: read.xls() will be unable to read Excel XLS and XLSX files
## gdata: unless the 'perl=' argument is used to specify the location of a
## gdata: valid perl intrpreter.
## gdata: 
## gdata: (To avoid display of this message in the future, please ensure
## gdata: perl is installed and available on the executable search path.)
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLX' (Excel 97-2004) files.
## 
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLSX' (Excel 2007+) files.
## 
## gdata: Run the function 'installXLSXsupport()'
## gdata: to automatically download and install the perl
## gdata: libaries needed to support Excel XLS and XLSX formats.
## 
## Attaching package: 'gdata'
## 
## The following objects are masked from 'package:dplyr':
## 
##     combine, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     keep
## 
## The following object is masked from 'package:stats':
## 
##     nobs
## 
## The following object is masked from 'package:utils':
## 
##     object.size
## 
## The following object is masked from 'package:base':
## 
##     startsWith

Loading epl data

library(worldfootballR)
## Warning: package 'worldfootballR' was built under R version 4.2.2
PL<-fb_match_results(country = "ENG", gender = "M", season_end_year = c(1993:2022), tier = "1st")
names(PL)
##  [1] "Competition_Name" "Gender"           "Country"          "Season_End_Year" 
##  [5] "Round"            "Wk"               "Day"              "Date"            
##  [9] "Time"             "Home"             "HomeGoals"        "Away"            
## [13] "AwayGoals"        "Attendance"       "Venue"            "Referee"         
## [17] "Notes"            "MatchURL"         "Home_xG"          "Away_xG"
PL_r<-PL %>%
  arrange(Season_End_Year,Wk)%>%
  select(Season_End_Year,Wk,Date,Home,HomeGoals,AwayGoals,Away)
PL_r$Date<-as.Date(as.character(PL_r$Date))
PL_r$FTR<- case_when(
                  PL_r$HomeGoals>PL_r$AwayGoals~"H",
                  PL_r$HomeGoals<PL_r$AwayGoals~"A",
                  PL_r$HomeGoals==PL_r$AwayGoals~"D"
                     )

Data processing

In this stage we are going to do some descriptive statistics so as to find the summary of the data and then clean the data.

summary(PL_r)
##  Season_End_Year      Wk                 Date                Home          
##  Min.   :1993    Length:11646       Min.   :1992-08-15   Length:11646      
##  1st Qu.:2000    Class :character   1st Qu.:1999-08-07   Class :character  
##  Median :2007    Mode  :character   Median :2007-02-07   Mode  :character  
##  Mean   :2007                       Mean   :2007-03-21                     
##  3rd Qu.:2015                       3rd Qu.:2014-11-29                     
##  Max.   :2022                       Max.   :2022-05-22                     
##    HomeGoals       AwayGoals        Away               FTR           
##  Min.   :0.000   Min.   :0.00   Length:11646       Length:11646      
##  1st Qu.:1.000   1st Qu.:0.00   Class :character   Class :character  
##  Median :1.000   Median :1.00   Mode  :character   Mode  :character  
##  Mean   :1.521   Mean   :1.14                                        
##  3rd Qu.:2.000   3rd Qu.:2.00                                        
##  Max.   :9.000   Max.   :9.00
table(PL_r$Season_End_Year)
## 
## 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 
##  462  462  462  380  380  380  380  380  380  380  380  380  380  380  380  380 
## 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 
##  380  380  380  380  380  380  380  380  380  380  380  380  380  380

As we can see in 1993,1994 and 1995 the number of games that where played are 462, that’s because by then EPL had 22 teams, 20 teams were introduced in 1996.

Analysis Phase

Will now answer the first question.That will try to find out if the home ground favors the home team, and if yes to what extent?

## we will create a new dataframe with percentage of the game result
## (H: for home winning , A: for away team winning, and D for draw)

home_vs_away<-count(PL_r,FTR)%>%
  arrange(desc(n))

home_vs_away$percentage<-(home_vs_away$n/sum((home_vs_away$n)))*100
table(home_vs_away)
## , , percentage = 25.845783960158
## 
##    n
## FTR 3010 3301 5335
##   A    0    0    0
##   D    1    0    0
##   H    0    0    0
## 
## , , percentage = 28.3444959642796
## 
##    n
## FTR 3010 3301 5335
##   A    0    1    0
##   D    0    0    0
##   H    0    0    0
## 
## , , percentage = 45.8097200755624
## 
##    n
## FTR 3010 3301 5335
##   A    0    0    0
##   D    0    0    0
##   H    0    0    1

Plotting the results

ggplot(data = home_vs_away,aes(x=reorder(FTR,-percentage),y=percentage))+
    geom_col()+
    ggtitle("The percentage of results at Home ground")+
    xlab("who wins the Home games")

We can see that the chances of winning a home game are great than that of an away game .Now lets find out the extent at which the home ground favors the home team.

## creating a new data frame to calculate each average points Home and Away matches
home_vs_away_points<-matrix(nrow = 2,ncol = 2)
home_vs_away_points<-as.data.frame(home_vs_away_points)
names(home_vs_away_points)=c("where_to_play","Average_points")
home_vs_away_points$`where_to_play`<-c("Home","Away")

##average points at home matches
home_vs_away_points[1,2]<-((home_vs_away[1,2])*3+(home_vs_away[3,2])*1)/sum(home_vs_away$n)

##average points at away matches
home_vs_away_points[2,2]<-((home_vs_away[2,2])*3+(home_vs_away[3,2])*1)/sum(home_vs_away$n)

Plotting the results

ggplot(data = home_vs_away_points,aes(x=where_to_play,y=Average_points))+
  geom_col()+
  ggtitle("Home Vs Away, average points")+
  xlab("where to play")+ylab("Average points")

From the data above we can see that on average collecting points on home is greater than away.The advantage of getting points on a home ground is about 0.53.

Let’s track this advantage through the years.

First, we will go to see how many points the teams have collected on their home ground over the years.

## new dataframe counting final results by every year
point_year<-PL_r%>%
  group_by(Season_End_Year)%>%
  count(FTR)

## calculating the points that collected in a home and away ground
point_year$points<-case_when(point_year$FTR=="H"~point_year$n*3,
                             point_year$FTR=="A"~point_year$n*3,
                             point_year$FTR=="D"~point_year$n*1
                             )

## creating a new column"h_points" that summation points that collected in home ground either "h_points" to away points
point_year2<-point_year%>%
  group_by(Season_End_Year)%>%
  summarize(h_points=points[FTR=="H"]+points[FTR=="D"],a_points=(points[FTR=="A"])+points[FTR=="D"])

## tidying the data
point_year3<-point_year2%>%
  pivot_longer(c(`h_points`, `a_points`), names_to = "Home_vs_Away", values_to = "Points")
ggplot(data = point_year3,aes(x=Season_End_Year,y=Points,col=Home_vs_Away))+
  geom_line()+
  theme(legend.title=element_blank())+
  theme(legend.position=c(0.9,0.9))+
  scale_color_manual(labels = c("Away", "Home"),
                     values = c( "red", "blue"))+
  ggtitle("Total points collected by the teams Home vs Away")+
  xlab("Season end year")+
  ylab("Total points")

As we have seen the home ground advantages the home teams.But when we check 2021 things changed that’s because crowds where not allowed in stadium because of COVID-19.

Now lets see how not attending the stadium changed the trajectory of football.

## new dataframe with only results of 2020-2021 season
season_2021<-subset(PL_r,Season_End_Year=="2021")

Home vs Away winning in 2021 season

point_2021<-season_2021%>%
  group_by(Season_End_Year)%>%
  count(FTR)

plotting the result

ggplot(data =point_2021,aes(FTR,n))+
  geom_col()+
  ggtitle("Home vs Away winning in 2021 season")+
  ylab("Numbers of matches")+
  xlab("The Results")

It’s clear that home advantage was gone without the crowd’s attendance.

Now we will make some team analysis, starting by looking for the team that won the most at its home stadium.

## new dataframe to calculate the home results for each PL team
home_point<-PL_r %>%
  group_by(Home)%>%
  count(FTR)

## replacing "H" by "W" for winning, "A" by "L" for losing and "D" still "D" for draw
home_point$FTR[home_point$FTR=="H"]<-"W"
home_point$FTR[home_point$FTR=="A"]<-"L"

## calculating the points collected at home ground by each team
## as we know the winning team get three points, the losing team take nothing, and 1 point for each team during draw
home_point$points<- case_when(home_point$FTR=="L"~home_point$n*0,
                               home_point$FTR=="W"~home_point$n*3,
                               home_point$FTR=="D"~home_point$n*1
                               )
                               
home_point2<-home_point%>%
  group_by(Home)%>%
  summarize(T_point=sum(points))

plotting the result

ggplot(data = home_point2,aes(x=T_point,y=reorder(Home,T_point),fill=Home))+
  geom_col()+
  ggtitle("Points by team in Home ground")+
  ylab("Team")+
  xlab("Points")+
  theme(legend.position="none")

As we see, Manchester United is the team with the most points collected in their home stadium, followed by Arsenal, Liverpool, and Chelsea.

Let’s see the points collected on the home ground per match for every team.