Introduction

This analysis is conducted on a dataset containing match statistics for English Premier League for 18 seasons - for the years 2000-2018.

Data set has been imported from the following link: https://www.kaggle.com/datasets/saife245/english-premier-league?resource=download

My objective is to analyze this data to extract interesting milestones related to teams, seasons, and final match results. Also to discover any underlying relationship between the final match results and other variables.

Data Structure Analysis of Dataset

This dataset contains 6840 rows (observations) and 41 columns (variables). Although our observations indicate that we have a good amount of observations to conduct a meaningful analysis, our set of columns are too large and should be assessed it normalization is required.

Following are the column names defined in our dataset.

#Extract column names from our dataset
dimnames(eplMatches)[[2]]
##  [1] "MatchID"       "Date"          "HomeTeam"      "AwayTeam"     
##  [5] "FTHG"          "FTAG"          "FTR"           "HTGS"         
##  [9] "ATGS"          "HTGC"          "ATGC"          "HTP"          
## [13] "ATP"           "HM1"           "HM2"           "HM3"          
## [17] "HM4"           "HM5"           "AM1"           "AM2"          
## [21] "AM3"           "AM4"           "AM5"           "MW"           
## [25] "HTFormPtsStr"  "ATFormPtsStr"  "HTFormPts"     "ATFormPts"    
## [29] "HTWinStreak3"  "HTWinStreak5"  "HTLossStreak3" "HTLossStreak5"
## [33] "ATWinStreak3"  "ATWinStreak5"  "ATLossStreak3" "ATLossStreak5"
## [37] "HTGD"          "ATGD"          "DiffPts"       "DiffFormPts"  
## [41] "Season"
Column Name Description Problems/Recommended Action
MatchID Uniquely identifies a match - Primary Key None
Date Date when the match took place Some dates were identified to be outside of the season calendar - issue of bad entry.
HomeTeam Name of team playing in own stadium Proposed for Normalization - Team table. This value should just contain TeamID.
AwayTeam Name of team playing in away stadium Proposed for Normalization - Team table. This value should just contain TeamID.
FTHG/FTAG Number of goals scored at the end of the match by home/away team None
FTR Containing H if home team won, A if away team won, and D if its a draw (goals scored is equal) Data contained incorrect values - only H (if home team won) and NH (if home team did not win). It was corrected using the update query in SQL before further analysis.
HTGS/ATGS Goals scored by home/away team (commutative for current season) None
HTGC/ATGC Goals conceded by home/away team (commutative for current season) None
HTP/ATP Points won from this match by home/away team None
HM1 till HM5 Match results for the previous five games (indicated by 1-5) for the home team Proposed for Normalization - TeamForm table. There should just be a TeamFormID in this Match table.
AM1 till AM5 Match results for the previous five games (indicated by 1-5) for the away team Proposed for Normalization - TeamForm table. There should just be a TeamFormID in this Match table.
MW Indicates what matchweek for current season. Usually 38 MWs are there in a season None

HTFormPtsStr/

ATFormPtsStr

A 5-character string indicating home/away team’s form Proposed for Normalization - TeamForm table. There should just be a TeamFormID in this Match table.

HTFormPts/

ATFormPts

Current season’s tally of points for home/away team before this match Proposed for Normalization - TeamForm table. There should just be a TeamFormID in this Match table.

HTWinStreak3/

HTWinStreak5

Boolean value - 1 if home team won last 3 games / 5 games Proposed for Normalization - TeamForm table. There should just be a TeamFormID in this Match table.

HTLossStreak3/

HTLossStreak5

Boolean value - 1 if home team lost last 3 games / 5 games Proposed for Normalization - TeamForm table. There should just be a TeamFormID in this Match table.

ATWinStreak3/

ATWinStreak5

Boolean value - 1 if away team won last 3 games / 5 games Proposed for Normalization - TeamForm table. There should just be a TeamFormID in this Match table.

ATLossStreak3/

ATLossStreak5

Boolean value - 1 if away team lost last 3 games / 5 games Proposed for Normalization - TeamForm table. There should just be a TeamFormID in this Match table.
HTGD/ATGD Goal Difference i.e Goals scored - Goals conceded; for the home/away team None
DiffPts Difference in Points None
DiffFormPts Difference in Form Points None

Changes Recommended

It is recommended to introduce following two new tables (entities) in the current data structure:

  1. Team - Containing TeamID (PK) and TeamName. PK of this table will be kept as FK in Match table in relevant columns such as Home Team or Away Team.
  2. TeamForm - Containing TeamFormID(PK) with above highlighted columns. PK of this table will be kept as FK in Match table resolving the many-to-many relationship between Team and Form.

Changes Made

Before proceeding following changes were made in the dataset:

General Statistics

In a typical season there are 380 matches - 20 teams play each other twice (home and away). The team which accumulates the highest points wins the league for that particular season.

To check the quality of the data, lets find out the number of matches played in each season.

Above table shows that we successfully altered database table introducing the Season variable. The number of matches in our database corroborates with our understanding as well.

Following interesting statistics can be extracted using SQL queries via R:

1. Match Result with the most occurences:

Most occurrences found for Home wins over 18 seasons - indicating the significance of home advantage in a match.

2. Which teams were most successful at Home and Away?

The team with most Home wins is **
** with matches won. The team with most Away wins is **

** with matches won.

3. Highest margin of victory over 18 seasons?

Tottenham won by the biggest margin against Wigan on 2009-11-22 by scoring 9 goals.

4. Team with the 5-match winning streak with most occurrences?

##      HomeTeam HForm   AwayTeam AForm
## 1  Man United    31 Man United    32
## 2     Arsenal    18    Arsenal    15
## 3     Chelsea    18    Chelsea    22
## 4   Liverpool    17  Liverpool    12
## 5    Man City    16   Man City    14
## 6   Tottenham     8  Tottenham     9
## 7     Everton     3    Everton     3
## 8   Newcastle     2  Newcastle     1
## 9       Leeds     1      Leeds     2
## 10 Birmingham     1 Birmingham     1
## 11      Wigan     1      Wigan     1

Data Visualizations

1. Breakup of Full-Time Results in each Season

Looking at the above chart, it becomes apparent that wins for the home team is a consistent feature over the seasons.

Highest Away wins were recorded in Season 2013-14.

Most Draws were recorded in Season2010-11.

Highest Home wins were recorded in Season 2005-06.

2. Goals Conceded in Each Season

Throughout our analysis we observed that Home wins have been higher than Away wins or draws. In tandem with this trend, we would expect home team to score more goals and concede less on most occasions.

Contrary to our expectations, it was interesting to discover that in 7 out of 18 seasons teams conceded more goals at home than away.

3. Goals Scored By Teams

Above Size map shows the number of goals scored by teams in home and away settings. There are eight teams which hold the honor of scoring a large proportion of goals compared to others.

In a Home setting, top 8 teams scored 48% of the goals - almost half of all goals were scored by this group.

In an Away setting, the same 8 teams scored 44% of the goals.

Conclusion

In this dataset, it was observed that home advantage is a real advantage - Almost half of all match results observed over 18 seasons ended in favor of the home team. In another graph it was discovered that home teams conceded more goals than away teams. This was contrary to our expectations since the more goals are conceded by home teams their probability to win goes down. Something which may not have been part of this data would have to used to explain such a phenomenon.

Using queries and visualizations, information about matches and individual teams can be discovered using the data. For example, the best team for home and away was highlighted and the match with highest goal margin was also discovered. Top eight teams, either home or away, account for more than half of all goals scored in the league.

Some improvements can be made in the data structure to obtain more detailed analysis. For example, if Team and TeamForm were declared as separate entities we could conduct analysis on individual teams without the bias of Home or Away. Additional columns had to be introduced such as Season to group the analysis at that level. Moreover, FTR (Full-time result) variable was corrected in the dataset before conducting analysis. To further improve the analysis, more statistics like shots on target, fouls, crowd attendance %, player values/ratings, etc. could have been included.