This analysis is conducted on a dataset containing match statistics for English Premier League for 18 seasons - for the years 2000-2018.
Data set has been imported from the following link: https://www.kaggle.com/datasets/saife245/english-premier-league?resource=download
My objective is to analyze this data to extract interesting milestones related to teams, seasons, and final match results. Also to discover any underlying relationship between the final match results and other variables.
This dataset contains 6840 rows (observations) and 41 columns (variables). Although our observations indicate that we have a good amount of observations to conduct a meaningful analysis, our set of columns are too large and should be assessed it normalization is required.
Following are the column names defined in our dataset.
#Extract column names from our dataset
dimnames(eplMatches)[[2]]
## [1] "MatchID" "Date" "HomeTeam" "AwayTeam"
## [5] "FTHG" "FTAG" "FTR" "HTGS"
## [9] "ATGS" "HTGC" "ATGC" "HTP"
## [13] "ATP" "HM1" "HM2" "HM3"
## [17] "HM4" "HM5" "AM1" "AM2"
## [21] "AM3" "AM4" "AM5" "MW"
## [25] "HTFormPtsStr" "ATFormPtsStr" "HTFormPts" "ATFormPts"
## [29] "HTWinStreak3" "HTWinStreak5" "HTLossStreak3" "HTLossStreak5"
## [33] "ATWinStreak3" "ATWinStreak5" "ATLossStreak3" "ATLossStreak5"
## [37] "HTGD" "ATGD" "DiffPts" "DiffFormPts"
## [41] "Season"
| Column Name | Description | Problems/Recommended Action |
|---|---|---|
| MatchID | Uniquely identifies a match - Primary Key | None |
| Date | Date when the match took place | Some dates were identified to be outside of the season calendar - issue of bad entry. |
| HomeTeam | Name of team playing in own stadium | Proposed for Normalization - Team table. This value should just contain TeamID. |
| AwayTeam | Name of team playing in away stadium | Proposed for Normalization - Team table. This value should just contain TeamID. |
| FTHG/FTAG | Number of goals scored at the end of the match by home/away team | None |
| FTR | Containing H if home team won, A if away team won, and D if its a draw (goals scored is equal) | Data contained incorrect values - only H (if home team won) and NH (if home team did not win). It was corrected using the update query in SQL before further analysis. |
| HTGS/ATGS | Goals scored by home/away team (commutative for current season) | None |
| HTGC/ATGC | Goals conceded by home/away team (commutative for current season) | None |
| HTP/ATP | Points won from this match by home/away team | None |
| HM1 till HM5 | Match results for the previous five games (indicated by 1-5) for the home team | Proposed for Normalization - TeamForm table. There should just be a TeamFormID in this Match table. |
| AM1 till AM5 | Match results for the previous five games (indicated by 1-5) for the away team | Proposed for Normalization - TeamForm table. There should just be a TeamFormID in this Match table. |
| MW | Indicates what matchweek for current season. Usually 38 MWs are there in a season | None |
HTFormPtsStr/ ATFormPtsStr |
A 5-character string indicating home/away team’s form | Proposed for Normalization - TeamForm table. There should just be a TeamFormID in this Match table. |
HTFormPts/ ATFormPts |
Current season’s tally of points for home/away team before this match | Proposed for Normalization - TeamForm table. There should just be a TeamFormID in this Match table. |
HTWinStreak3/ HTWinStreak5 |
Boolean value - 1 if home team won last 3 games / 5 games | Proposed for Normalization - TeamForm table. There should just be a TeamFormID in this Match table. |
HTLossStreak3/ HTLossStreak5 |
Boolean value - 1 if home team lost last 3 games / 5 games | Proposed for Normalization - TeamForm table. There should just be a TeamFormID in this Match table. |
ATWinStreak3/ ATWinStreak5 |
Boolean value - 1 if away team won last 3 games / 5 games | Proposed for Normalization - TeamForm table. There should just be a TeamFormID in this Match table. |
ATLossStreak3/ ATLossStreak5 |
Boolean value - 1 if away team lost last 3 games / 5 games | Proposed for Normalization - TeamForm table. There should just be a TeamFormID in this Match table. |
| HTGD/ATGD | Goal Difference i.e Goals scored - Goals conceded; for the home/away team | None |
| DiffPts | Difference in Points | None |
| DiffFormPts | Difference in Form Points | None |
Changes Recommended
It is recommended to introduce following two new tables (entities) in the current data structure:
Changes Made
Before proceeding following changes were made in the dataset:
Added a column for Season (a unique string) so that analysis can be aggregated at that level
Updated DB values for FTR(Full-Time Result); H for Home Win, A for Away win, and D for Draw
In a typical season there are 380 matches - 20 teams play each other twice (home and away). The team which accumulates the highest points wins the league for that particular season.
To check the quality of the data, lets find out the number of matches played in each season.
Above table shows that we successfully altered database table introducing the Season variable. The number of matches in our database corroborates with our understanding as well.
Following interesting statistics can be extracted using SQL queries via R:
Most occurrences found for Home wins over 18 seasons - indicating the significance of home advantage in a match.
** with matches won.
Tottenham won by the biggest margin against Wigan on 2009-11-22 by scoring 9 goals.
## HomeTeam HForm AwayTeam AForm
## 1 Man United 31 Man United 32
## 2 Arsenal 18 Arsenal 15
## 3 Chelsea 18 Chelsea 22
## 4 Liverpool 17 Liverpool 12
## 5 Man City 16 Man City 14
## 6 Tottenham 8 Tottenham 9
## 7 Everton 3 Everton 3
## 8 Newcastle 2 Newcastle 1
## 9 Leeds 1 Leeds 2
## 10 Birmingham 1 Birmingham 1
## 11 Wigan 1 Wigan 1
Looking at the above chart, it becomes apparent that wins for the home team is a consistent feature over the seasons.
Highest Away wins were recorded in Season 2013-14.
Most Draws were recorded in Season2010-11.
Highest Home wins were recorded in Season 2005-06.
Throughout our analysis we observed that Home wins have been higher than Away wins or draws. In tandem with this trend, we would expect home team to score more goals and concede less on most occasions.
Contrary to our expectations, it was interesting to discover that in 7 out of 18 seasons teams conceded more goals at home than away.
Above Size map shows the number of goals scored by teams in home and away settings. There are eight teams which hold the honor of scoring a large proportion of goals compared to others.
In a Home setting, top 8 teams scored 48% of the goals - almost half of all goals were scored by this group.
In an Away setting, the same 8 teams scored 44% of the goals.
In this dataset, it was observed that home advantage is a real advantage - Almost half of all match results observed over 18 seasons ended in favor of the home team. In another graph it was discovered that home teams conceded more goals than away teams. This was contrary to our expectations since the more goals are conceded by home teams their probability to win goes down. Something which may not have been part of this data would have to used to explain such a phenomenon.
Using queries and visualizations, information about matches and individual teams can be discovered using the data. For example, the best team for home and away was highlighted and the match with highest goal margin was also discovered. Top eight teams, either home or away, account for more than half of all goals scored in the league.
Some improvements can be made in the data structure to obtain more detailed analysis. For example, if Team and TeamForm were declared as separate entities we could conduct analysis on individual teams without the bias of Home or Away. Additional columns had to be introduced such as Season to group the analysis at that level. Moreover, FTR (Full-time result) variable was corrected in the dataset before conducting analysis. To further improve the analysis, more statistics like shots on target, fouls, crowd attendance %, player values/ratings, etc. could have been included.