Introduction

English Premier League: The League of Uncertainties

The Premier League is an English professional league for men’s association football clubs. At the top of the English football league system, it is the country’s primary football competition. Contested by 20 clubs, it operates on a system of promotion and relegation with the English Football League (EFL; known as “The Football League” before 2016-17). Welsh clubs that compete in the English football league system can also qualify.

The Premier League is the most-watched sports league in the world, broadcast in 212 territories to 643 million homes and a potential TV audience of 4.7 billion people. In the 2014-15 season, the average Premier League match attendance exceeded 36,000, second highest of any professional football league behind the Bundesliga’s 43,500. Most stadium occupancies are near capacity. The Premier League ranks third in the UEFA coefficients of leagues based on performances in European competitions over the past five seasons.

The Premier League is considered to be the toughest league to predict, as the gap in quality between the top teams and the bottom table teams is not so wide as other European leagues. Seeing the number one team beaten by the last ranked team is not such a big surprise here. In this project, I will try to try to make sense out the results for the last 8 years of action based on the team attributes for that season taken from EA sports:FIFA.

Packages Required

The following packages are required for the project: library(tidyverse) #For data cleaning library(dplyr) #For Data transformation library(ggplot2) #For plotting graphs library(RSQLite) #For importing SQLite data library(DT) #For DataFrame library(knitr) #For Dataset summary library(ggExtra) #For ggmarginal library(plotly) #for boxplot library(wordcloud) #for wordcloud

library(tidyverse)  #For data cleaning
library(dplyr)      #For Data transformation
library(ggplot2)    #For plotting graphs
library(RSQLite)    #For importing SQLite data
library(DT)         #For DataFrame
library(knitr)      #For Dataset summary
library(ggExtra)    #For ggmarginal
library(plotly)     #for boxplot
library(wordcloud)  #for wordcloud

Data Preperation

Datasets Used

I have used the following Kaggle dataset for this project:

European Soccer Database

The dataset contains details of:

  • +25,000 matches
  • +10,000 players
  • 11 European Countries with their lead championship
  • Seasons 2008 to 2016
  • Players and Teams’ attributes* sourced from EA Sports’ FIFA video game series, including the weekly updates
  • Team line up with squad formation (X, Y coordinates)
  • Betting odds from up to 10 providers
  • Detailed match events (goal types, possession, corner, cross, fouls, cards etc…) for +10,000 matches

Our Dataset consists of 9 Tables in total with the following dimensions:

  • Country #11*2 Table containing details of the Country
  • League #11*3 Table containing details of the League
  • Match #25979*115 Table containing details and events of every match
  • Player #11060*7 Table containing Player details
  • Player_Attributes #183978*42 Table containing Player attributes from FIFA
  • Team #299*5 Table containg basic Team details
  • Team_Attributes #1458*25 Table containing Team attributes taken from FIFA
  • sqlite_sequence #7*2

Data Importing

We have donwloaded the Dataset from Kaggle onto our Local drive as of now. Each of the tables in the SQLite file need to be read into separately. Importing SQLite dataset in R:

con <- dbConnect(RSQLite::SQLite(), dbname = "database.sqlite")
dbListTables(con)
## [1] "Country"           "League"            "Match"            
## [4] "Player"            "Player_Attributes" "Team"             
## [7] "Team_Attributes"   "sqlite_sequence"
Country = dbGetQuery( con,'select * from Country' )
League = dbGetQuery( con,'select * from League' )
Match = dbGetQuery( con,'select * from Match' )
Player = dbGetQuery( con,'select * from Player' )
Player_Attributes = dbGetQuery( con,'select * from Player_Attributes' )
Team = dbGetQuery( con,'select * from Team' )
Team_Attributes = dbGetQuery( con,'select * from Team_Attributes' )
sqlite_sequence = dbGetQuery( con,'select * from sqlite_sequence' )

Merging datasets

We are required to merge the filtered data from the given datasets to create the final dataset to be used for our analysis:

Following steps were taken care of:

  • Getting League Details for EPL from Country and League table.
  • Getting the mathc event details from Match table
  • Getting Team details for EPL teams only from Team table and joining with Team attribute table to get the FIFA attributes.
  • Joining the match data and the team details data for both the home team and away team
#Getting League details for England
Country_League <- Country %>% 
  inner_join(League, by = "id") %>%
  rename(countryName = name.x, leagueName = name.y) %>%
  select(-country_id) %>% 
  filter(countryName == "England")

#Getting match details for EPL
League_Matches <- Country_League %>% 
  inner_join(Match, by = c("id" = "country_id")) %>%
  select(-(home_player_X1:BSA)) %>%
  mutate(home_team_win = ifelse(home_team_goal > away_team_goal,1,0)) %>%
  mutate(year_match = substring(season,1,4))

#getting team ids from match data
home_teams <- League_Matches %>% 
  distinct(home_team_api_id) %>% 
  rename(team_id = home_team_api_id)

away_teams <- League_Matches %>% 
  distinct(away_team_api_id) %>% 
  rename(team_id = away_team_api_id)

epl_teams <- union(home_teams,away_teams)

#Getting details of teams
team_data <- epl_teams %>% 
  inner_join(Team, by = c("team_id" = "team_api_id")) %>%
  select(team_id,team_long_name,team_short_name) %>%
  inner_join(Team_Attributes, by = c("team_id" = "team_api_id")) %>%
  select(-ends_with("Class"), -id, -team_fifa_api_id) %>%
  mutate(year = substring(date,1,4))

#Getting details of home team
home_team_data <- League_Matches %>% 
  inner_join(team_data, by = (c("year_match" = "year", "home_team_api_id" = "team_id"))) %>%
  rename(home_team_long_name = team_long_name, home_team_short_name = team_short_name, 
         home_buildUpPlaySpeed = buildUpPlaySpeed, home_buildUpPlayDribbling = buildUpPlayDribbling,
         home_buildUpPlayPassing = buildUpPlayPassing, home_chanceCreationPassing = chanceCreationPassing, 
         home_chanceCreationCrossing = chanceCreationCrossing, home_chanceCreationShooting = chanceCreationShooting,
         home_defencePressure = defencePressure, home_defenceAggression = defenceAggression,
         home_defenceTeamWidth = defenceTeamWidth)

#Getting details of away team
total_team_data <- home_team_data %>% 
  inner_join(team_data, by = (c("year_match" = "year", "away_team_api_id" = "team_id"))) %>%
  rename(away_team_long_name = team_long_name, away_team_short_name = team_short_name, 
         away_buildUpPlaySpeed = buildUpPlaySpeed, away_buildUpPlayDribbling = buildUpPlayDribbling,
         away_buildUpPlayPassing = buildUpPlayPassing, away_chanceCreationPassing = chanceCreationPassing, 
         away_chanceCreationCrossing = chanceCreationCrossing, away_chanceCreationShooting = chanceCreationShooting, 
         away_defencePressure = defencePressure, away_defenceAggression = defenceAggression,
         away_defenceTeamWidth = defenceTeamWidth)

Data Cleaning

  • Removing Null Values: I looked at the basic structure of all the datasets and noticed that there are some missing values in the buildUpPlayDribbling variable for both the home team as well as the away team. The missing values were not so prominent hence I decided to replace the missing values with the mean of the remaining values for that variable. We have also created a new set of variables to get the difference of attribute between the home team and the away team.
#Replacing null values in home_buildUpPlayDribbling
total_team_data$home_buildUpPlayDribbling <- coalesce(total_team_data$home_buildUpPlayDribbling, as.integer(summary(total_team_data$home_buildUpPlayDribbling)[4]))

#Replacing null values in away_buildUpPlayDribbling
total_team_data$away_buildUpPlayDribbling <- coalesce(total_team_data$away_buildUpPlayDribbling, as.integer(summary(total_team_data$away_buildUpPlayDribbling)[4]))

#Creating Columns for attribute difference
final_match_data <- total_team_data %>% 
  mutate(diff_buildUpPlaySpeed = home_buildUpPlaySpeed - away_buildUpPlaySpeed,
  diff_buildUpPlayDribbling = home_buildUpPlayDribbling - away_buildUpPlayDribbling,
  diff_buildUpPlayPassing = home_buildUpPlayPassing - away_buildUpPlayPassing, 
  diff_chanceCreationPassing = home_chanceCreationPassing - away_chanceCreationPassing,
  diff_chanceCreationCrossing = home_chanceCreationCrossing - away_chanceCreationCrossing, 
  diff_chanceCreationShooting = home_chanceCreationShooting - away_chanceCreationShooting, 
  diff_defencePressure = home_defencePressure - away_defencePressure, 
  diff_defenceAggression = home_defenceAggression - away_defenceAggression,
  diff_defenceTeamWidth = home_defenceTeamWidth - away_defenceTeamWidth) %>%
  select(-c(id,id.y,league_id,year_match,date.y,date)) %>% 
  rename(match_date = date.x) %>%
  arrange(stage, season)
  • Removing Outiers: We can’t see any outliers in any of the observations, hence we don’t require this step. All the ratings are properly capped from 0-100.

Data after cleaning

Our Final Dataset consists of 2280 observations and 42 variables.

#Dataset preview
head(final_match_data,200) %>%
  datatable(caption = "Match Data")

Data Summary

final_match_data

The final dataset: final_match_data has the following characteristics:

  • Dimensions:
## [1] 2280   42
  • Column Names:
##  [1] "countryName"                 "leagueName"                 
##  [3] "season"                      "stage"                      
##  [5] "match_date"                  "match_api_id"               
##  [7] "home_team_api_id"            "away_team_api_id"           
##  [9] "home_team_goal"              "away_team_goal"             
## [11] "home_team_win"               "home_team_long_name"        
## [13] "home_team_short_name"        "home_buildUpPlaySpeed"      
## [15] "home_buildUpPlayDribbling"   "home_buildUpPlayPassing"    
## [17] "home_chanceCreationPassing"  "home_chanceCreationCrossing"
## [19] "home_chanceCreationShooting" "home_defencePressure"       
## [21] "home_defenceAggression"      "home_defenceTeamWidth"      
## [23] "away_team_long_name"         "away_team_short_name"       
## [25] "away_buildUpPlaySpeed"       "away_buildUpPlayDribbling"  
## [27] "away_buildUpPlayPassing"     "away_chanceCreationPassing" 
## [29] "away_chanceCreationCrossing" "away_chanceCreationShooting"
## [31] "away_defencePressure"        "away_defenceAggression"     
## [33] "away_defenceTeamWidth"       "diff_buildUpPlaySpeed"      
## [35] "diff_buildUpPlayDribbling"   "diff_buildUpPlayPassing"    
## [37] "diff_chanceCreationPassing"  "diff_chanceCreationCrossing"
## [39] "diff_chanceCreationShooting" "diff_defencePressure"       
## [41] "diff_defenceAggression"      "diff_defenceTeamWidth"
  • Column Summary:
kable(summary(final_match_data))
countryName leagueName season stage match_date match_api_id home_team_api_id away_team_api_id home_team_goal away_team_goal home_team_win home_team_long_name home_team_short_name home_buildUpPlaySpeed home_buildUpPlayDribbling home_buildUpPlayPassing home_chanceCreationPassing home_chanceCreationCrossing home_chanceCreationShooting home_defencePressure home_defenceAggression home_defenceTeamWidth away_team_long_name away_team_short_name away_buildUpPlaySpeed away_buildUpPlayDribbling away_buildUpPlayPassing away_chanceCreationPassing away_chanceCreationCrossing away_chanceCreationShooting away_defencePressure away_defenceAggression away_defenceTeamWidth diff_buildUpPlaySpeed diff_buildUpPlayDribbling diff_buildUpPlayPassing diff_chanceCreationPassing diff_chanceCreationCrossing diff_chanceCreationShooting diff_defencePressure diff_defenceAggression diff_defenceTeamWidth
Length:2280 Length:2280 Length:2280 Min. : 1.0 Length:2280 Min. : 839796 Min. : 8191 Min. : 8191 Min. :0.000 Min. :0.000 Min. :0.0000 Length:2280 Length:2280 Min. :25.00 Min. :24.0 Min. :24.00 Min. :28.00 Min. :31.00 Min. :24.00 Min. :25.00 Min. :31.00 Min. :30.00 Length:2280 Length:2280 Min. :25.00 Min. :24.0 Min. :24.00 Min. :28.00 Min. :31.00 Min. :24.00 Min. :25.00 Min. :31.00 Min. :30.00 Min. :-50 Min. :-29 Min. :-51 Min. :-44 Min. :-43 Min. :-45 Min. :-40 Min. :-35 Min. :-40
Class :character Class :character Class :character 1st Qu.:10.0 Class :character 1st Qu.:1025182 1st Qu.: 8551 1st Qu.: 8551 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:0.0000 Class :character Class :character 1st Qu.:48.00 1st Qu.:38.0 1st Qu.:43.50 1st Qu.:42.00 1st Qu.:50.00 1st Qu.:45.75 1st Qu.:38.00 1st Qu.:41.00 1st Qu.:45.00 Class :character Class :character 1st Qu.:48.00 1st Qu.:38.0 1st Qu.:43.50 1st Qu.:42.00 1st Qu.:50.00 1st Qu.:45.75 1st Qu.:38.00 1st Qu.:41.00 1st Qu.:45.00 1st Qu.:-10 1st Qu.: 0 1st Qu.:-11 1st Qu.:-10 1st Qu.: -9 1st Qu.:-10 1st Qu.:-10 1st Qu.:-10 1st Qu.: -7
Mode :character Mode :character Mode :character Median :19.5 Mode :character Median :1351780 Median : 8663 Median : 8663 Median :1.000 Median :1.000 Median :0.0000 Mode :character Mode :character Median :58.50 Median :38.0 Median :51.00 Median :49.50 Median :59.00 Median :54.00 Median :44.00 Median :50.00 Median :51.00 Mode :character Mode :character Median :58.50 Median :38.0 Median :51.00 Median :49.50 Median :59.00 Median :54.00 Median :44.00 Median :50.00 Median :51.00 Median : 0 Median : 0 Median : 0 Median : 0 Median : 0 Median : 0 Median : 0 Median : 0 Median : 0
NA NA NA Mean :19.5 NA Mean :1380333 Mean : 9195 Mean : 9195 Mean :1.552 Mean :1.187 Mean :0.4491 NA NA Mean :56.57 Mean :38.3 Mean :52.14 Mean :50.73 Mean :57.28 Mean :52.37 Mean :44.77 Mean :50.39 Mean :50.83 NA NA Mean :56.57 Mean :38.3 Mean :52.14 Mean :50.73 Mean :57.28 Mean :52.37 Mean :44.77 Mean :50.39 Mean :50.83 Mean : 0 Mean : 0 Mean : 0 Mean : 0 Mean : 0 Mean : 0 Mean : 0 Mean : 0 Mean : 0
NA NA NA 3rd Qu.:29.0 NA 3rd Qu.:1724171 3rd Qu.:10003 3rd Qu.:10003 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.0000 NA NA 3rd Qu.:65.00 3rd Qu.:38.0 3rd Qu.:61.00 3rd Qu.:59.25 3rd Qu.:68.25 3rd Qu.:60.00 3rd Qu.:50.25 3rd Qu.:58.00 3rd Qu.:56.00 NA NA 3rd Qu.:65.00 3rd Qu.:38.0 3rd Qu.:61.00 3rd Qu.:59.25 3rd Qu.:68.25 3rd Qu.:60.00 3rd Qu.:50.25 3rd Qu.:58.00 3rd Qu.:56.00 3rd Qu.: 10 3rd Qu.: 0 3rd Qu.: 11 3rd Qu.: 10 3rd Qu.: 9 3rd Qu.: 10 3rd Qu.: 10 3rd Qu.: 10 3rd Qu.: 7
NA NA NA Max. :38.0 NA Max. :1989079 Max. :10261 Max. :10261 Max. :8.000 Max. :6.000 Max. :1.0000 NA NA Max. :77.00 Max. :60.0 Max. :80.00 Max. :72.00 Max. :76.00 Max. :80.00 Max. :70.00 Max. :70.00 Max. :70.00 NA NA Max. :77.00 Max. :60.0 Max. :80.00 Max. :72.00 Max. :76.00 Max. :80.00 Max. :70.00 Max. :70.00 Max. :70.00 Max. : 50 Max. : 29 Max. : 51 Max. : 44 Max. : 43 Max. : 45 Max. : 40 Max. : 35 Max. : 40
  • Column Details: The Dataset consists of 2280 observations and 42 variables. The variables can be divided into 4 categories:
  1. Match Details metadata: These set of variables contain details about the match, match date, season details, round number, goals scored, etc.
  2. Home Team attributes: These set of variables contain Team attributes as sourced from FIFA for the particular year. The team attributes basically give a rough quantitative estimate of the characteristics of the team, like defennsive capabilities, attacking strength, passing quality, etc.
  3. Away Team attributes: These set contain the same variables as Home team attributes, but for the away team.
  4. Attribute Difference: These imputed set of variables are used to find the difference between the home team attributes and the away team attributes, and will be used primarily for trying to make sense of the results.

Data Exploration

plot 1

WORDCLOUD to show the most succesful teams

The Word cloud plot below depicts the teamwise rankings by the total points, taking all the seasons into consideration. We observe thaat Manchester city and Manchester United top the leaderboard followed by Chelsea and Arsenal.

#team_wise_points:

points_wordcloud_home <- final_match_data_points %>%
  select(home_team_long_name, home_team_points) %>% rename(team = home_team_long_name, points = home_team_points)
points_wordcloud_away <- final_match_data_points %>%
  select(away_team_long_name, away_team_points) %>% rename(team = away_team_long_name, points = away_team_points)
points_wordcloud <- union_all(points_wordcloud_home,points_wordcloud_away) %>%
  group_by(team) %>% summarize(total_points = sum(points)) %>% arrange(desc(total_points))

points_wordcloud[1,1] <- "Man City"
points_wordcloud[2,1] <- "Man Utd"
points_wordcloud[5,1] <- "Tottenham"

set.seed(1234)
wordcloud(words = points_wordcloud$team, freq = (points_wordcloud$total_points)^2, 
          min.freq = 100, random.order = FALSE, rot.per = 0.4,
          colors = brewer.pal(8, "Dark2"))

plot 2

Playing styles across seasons

The figure below is an interactive grouped boxplot of the combined ratings of all teams across six seasons. The ratings are further divided into attack points, defense points and midfield points. We observe that after the first two seasons, the attack points for the remaining seasons has remained fairly constant. Also, the range of midfield points seem to have a decreasing trend as the minimum rating of the teams has improved over the seasons. On the contrary, the range of defense points has increased over the seasons.

These attributes of values for attack, midfield and defense can help to tell about the playing style of the teams.

final_match_data_ratings <- final_match_data %>% 
  mutate(home_defense_points = home_defencePressure + home_defenceAggression + home_defenceTeamWidth,
         away_defense_points = away_defencePressure + away_defenceAggression + away_defenceTeamWidth,
         home_midfield_points = home_buildUpPlaySpeed + home_buildUpPlayDribbling + home_buildUpPlayPassing,
         away_midfield_points = away_buildUpPlaySpeed + away_buildUpPlayDribbling + away_buildUpPlayPassing,
         home_attack_points = home_chanceCreationPassing + home_chanceCreationCrossing + home_chanceCreationShooting,
         away_attack_points = away_chanceCreationPassing + away_chanceCreationCrossing + away_chanceCreationShooting)

ratings_season_home <- final_match_data_ratings %>%
  select(season, home_team_long_name, home_defense_points, home_midfield_points, home_attack_points) %>% 
  rename(team = home_team_long_name, defense_points = home_defense_points, 
         midfield_points = home_midfield_points, attack_points = home_attack_points)

ratings_season_away <- final_match_data_ratings %>%
  select(season, away_team_long_name, away_defense_points, away_midfield_points, away_attack_points) %>% 
  rename(team = away_team_long_name, defense_points = away_defense_points, 
         midfield_points = away_midfield_points, attack_points = away_attack_points)

rating_season <- union_all(ratings_season_home,ratings_season_away) %>%
  group_by(season,team,defense_points,midfield_points,attack_points) %>% 
  filter(row_number() == 1) %>% gather(category, points, 3:5)

attach(rating_season)

plot_ly(ggplot2::diamonds, x = ~season, y = ~points, color = ~category, type = "box") %>%
  layout(title = "Playing styles across seasons", boxmode = "group")

plot 3

Goal Difference

The figure below is a lollipop plot of the difference between the goals scored and goals conceded by each team in all the seasons. A positive goal difference suggests that the team has more number of goals scored than the number of goals conceded while the ones with a negative difference suggest otherwise. We observe that Manchester City and Manchester United have the maximum positive goal difference and hence are at the top of the leaderboard as seen in plot 1.

home_team_goals <- final_match_data %>% select(home_team_long_name, home_team_goal, away_team_goal) %>%
  rename(team = home_team_long_name, goals_for = home_team_goal, goals_against = away_team_goal)
away_team_goals <- final_match_data %>% select(away_team_long_name, away_team_goal, home_team_goal) %>%
  rename(team = away_team_long_name, goals_for = away_team_goal, goals_against = home_team_goal)

team_goals <- union_all(home_team_goals,away_team_goals) %>%
  mutate(goal_difference = goals_for - goals_against) %>% select(team, goal_difference) %>%
  group_by(team) %>% summarise(total_goal_difference = sum(goal_difference)) %>% arrange(total_goal_difference) %>%
  mutate(Avg = mean(total_goal_difference, na.rm = TRUE),
         Above = ifelse(total_goal_difference - Avg > 0, TRUE, FALSE),
         team_name = factor(team, levels = .$team))

ggplot(team_goals, aes(total_goal_difference, team_name, color = Above)) +
  geom_segment(aes(x = Avg, y = team_name, xend = total_goal_difference, yend = team_name), color = "grey50") +
  geom_point()

plot 4

Goals scored/conceded vs team attributes

The plots below are scatterplots of goals scored/goals conceded vs attack ratings/midfield ratings/defense ratings across different seasons. The plots for all seasons have been clubbed together using the facet wrap functionality of ggplot.

final_match_data_ratings <- final_match_data %>% 
  mutate(home_defense_points = home_defencePressure + home_defenceAggression + home_defenceTeamWidth,
         away_defense_points = away_defencePressure + away_defenceAggression + away_defenceTeamWidth,
         home_midfield_points = home_buildUpPlaySpeed + home_buildUpPlayDribbling + home_buildUpPlayPassing,
         away_midfield_points = away_buildUpPlaySpeed + away_buildUpPlayDribbling + away_buildUpPlayPassing,
         home_attack_points = home_chanceCreationPassing + home_chanceCreationCrossing + home_chanceCreationShooting,
         away_attack_points = away_chanceCreationPassing + away_chanceCreationCrossing + away_chanceCreationShooting)

ratings_goals_season_home <- final_match_data_ratings %>%
  select(season, home_team_long_name, home_defense_points, home_midfield_points, home_attack_points, home_team_goal, away_team_goal) %>% 
  rename(team = home_team_long_name, defense_points = home_defense_points, 
         midfield_points = home_midfield_points, attack_points = home_attack_points,
         goals_for = home_team_goal, goals_against = away_team_goal)

ratings_goals_season_away <- final_match_data_ratings %>%
  select(season, away_team_long_name, away_defense_points, away_midfield_points, away_attack_points,  away_team_goal, home_team_goal) %>% 
  rename(team = away_team_long_name, defense_points = away_defense_points, 
         midfield_points = away_midfield_points, attack_points = away_attack_points,
         goals_for = away_team_goal, goals_against = home_team_goal)

rating_goals_season <- union_all(ratings_goals_season_home,ratings_goals_season_away) %>%
  group_by(season,team,defense_points,midfield_points,attack_points) %>% 
  summarise(total_goals_for = sum(goals_for), total_goals_against = sum(goals_against)) %>%
  arrange(total_goals_for)
  1. The first set of plots show the correlation between attack points and total_goals_for across the seasons. An interesting observation is that apart from season 2011, there is a negtive correlation between the two variables. So goals scored go on decreasing with an increase in attack ratings.
ggplot(data = rating_goals_season, aes(x = attack_points, y = total_goals_for)) +
  geom_point() +
  facet_wrap(~ season, nrow = 2) + geom_smooth(method = "lm")

  1. We see a similar observation in the correlation between midfield_points and total goals for.
ggplot(data = rating_goals_season, aes(x = midfield_points, y = total_goals_for)) +
  geom_point() +
  facet_wrap(~ season, nrow = 2) + geom_smooth(method = "lm")

  1. The plots below show a correlation between the total goals for and defense points.Teams with higher defense attributes end up scoring more number of goals.
ggplot(data = rating_goals_season, aes(x = defense_points, y = total_goals_for)) +
  geom_point() +
  facet_wrap(~ season, nrow = 2) + geom_smooth(method = "lm")

The next set of plots show a correlation between total goals against and attack points.

  1. The plots below show a correlation between the total goals conceded and attack points.It is a positive trend between the two variables.
ggplot(data = rating_goals_season, aes(x = attack_points, y = total_goals_against)) +
  geom_point() +
  facet_wrap(~ season, nrow = 2) + geom_smooth(method = "lm")

  1. The plots below show a correlation between the total goals conceded and midfield points. The plots also follow the same trend as the previous set of plots.
ggplot(data = rating_goals_season, aes(x = midfield_points, y = total_goals_against)) +
  geom_point() +
  facet_wrap(~ season, nrow = 2) + geom_smooth(method = "lm")

  1. The plots below show a correlation between the total goals conceded and defense points. As the defense ratings go on increasing, goals conceded decreases.
ggplot(data = rating_goals_season, aes(x = defense_points, y = total_goals_against)) +
  geom_point() +
  facet_wrap(~ season, nrow = 2) + geom_smooth(method = "lm")

Hence we can see from the plots that teams with higher defense ratings and lower attack ratings tend to perform better

plot 5

Goals scored/ conceded vs overall ratings

Following plots are used to give the overall trend of goals scored and goals against vs the overall rating(attack+midfield+defense ratings). The plot shows data for all years combined in a single graph.

ratings_combined <- rating_goals_season %>% 
  mutate(total_rating = attack_points + attack_points + defense_points)

g1 <- ggplot(data = ratings_combined, aes(x = total_rating, y = total_goals_for, color = season)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)


g2 <- ggplot(data = ratings_combined, aes(x = total_rating, y = total_goals_against, color = season)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

This graph shows the number of goals scored vs the overall ratings. The general trend shows that as the overall rating goes on increasing, number of goals scored by the team goes on decreasing.

ggMarginal(g1, type = "histogram", fill = "transparent")

This graph shows the number of goals conceded vs the overall ratings. The general trend shows that as the overall rating goes on increasing, number of goals conceded by the team also goes on increasing.

ggMarginal(g2, type = "histogram", fill = "transparent")

The plots suggest that teams with lower overall attributes tend to perform better.

plot 6

Team rankings and league points through the seasons

This is a general scatter plot showing the number of points per season of a team and how those points go on varying between years. This plot also helps us by showing the change in league position of the team season by season and gives some good insights about the jump in rankings or fall in rankings for the team. It shows how unpredictable the leagus is as teams can jump multiple places or fall multiple places in consecutive years. A low ranked team in a season can actually end up winning the league next season.

points_home <- final_match_data_points %>%
  select(season,home_team_long_name, home_team_points) %>% rename(team = home_team_long_name, points = home_team_points)
points_away <- final_match_data_points %>%
  select(season,away_team_long_name, away_team_points) %>% rename(team = away_team_long_name, points = away_team_points)

points_team <- union_all(points_home,points_away) %>%
  group_by(season,team) %>% summarize(season_total_points = sum(points)) %>% arrange(desc(season_total_points))

ggplot(data = points_team, aes(x = season, y = season_total_points, group = team, color = team)) +
  geom_point() + geom_line() +
  labs(x = "season", y = "Sason points", title = "Season points data")

Summary

Exploratory Analysis:

The Premier League is considered to be the toughest league to predict, as the gap in quality between the top teams and the bottom table teams is not so wide as other European leagues. Based on the stats for team ratings, this project tried to find some relation between team’s performances and also gave details about the best teams in England during this period.

With the use of the extensive graphical capabilities available in R, we provided some key insights about the data:

  • Most succesful team in England from 2010-2016.
  • The variation of attack, midfield, and defense points of teams through the years describing how the teams have tried to change their playing styles across the years.
  • The teams with the best and worst goal difference in the league through the years, and predicatbly the team with the highest positive goal difference also turned out to be the most succesful team in this period.
  • Dependancy of goals scored and goals conceded on the 3 categories of points: attack points, midfield points, defense points.In general, teams with lower attack attribute points tend to perform better; Teams with higher defense attribute points also perform better.
  • League points of teams through the seasons, showing the improvment or decline of scores in consecutve seasons. This also showed how a team could just jump from sixth to first position in just a space of one year, or how a team could end up losing grace next season by falling down the table

The trends in the data have been unearthed through visualization only, but we could use some data mining techniques like clustering algorithms to find out more insights from the data.