The Premier League is the organising body of the Premier League with responsibility for the competition, its Rule Book and the centralised broadcast and other commercial rights.
Each individual club is independent, working within the rules of football, as defined by the Premier League, The FA, UEFA and FIFA, as well as being subject to English and European law.
Each of the 20 clubs are a Shareholder in the Premier League. Consultation is at the heart of the Premier League and Shareholder meetings are the ultimate decision-making forum for Premier League policy and are held at regular intervals during the course of the season.
The Premier League AGM takes place at the close of each season, at which time the relegated clubs transfer their shares to the clubs promoted into the Premier League from the Football League Championship.
Note : Premier League is the Main League/First Division of England Football League
On the previous post, we have analyzed the data about premier league
stats. There are 2 dataset from the source, stats.csv and
results.csv. What we have here is results.csv,
and we will continue our analysis on the previous post, but we will also
add a visualization, to make it more easy to interpretation.
The main goal we have here is, to know :
This is the step we prepare the data before analysis.
# Prepare the library to use analysis below
library(scales)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tidyr)
library(glue)The first thing we should do is import the data to our notebook.
results <- read.csv("results.csv")
resultsAs we can see, there are 6 columns, with 4560 rows of data we have. This data will we use to analyze and explore the data.
The data we have, there are :
🟥IMPORTANT NOTE :🟥
HOW TO READ RESULT COLUMNS :
str(results)## 'data.frame': 4560 obs. of 6 variables:
## $ home_team : chr "Sheffield United" "Arsenal" "Everton" "Newcastle United" ...
## $ away_team : chr "Liverpool" "Aston Villa" "Watford" "Wigan Athletic" ...
## $ home_goals: num 1 1 2 2 3 3 3 2 5 3 ...
## $ away_goals: num 1 1 1 1 0 2 1 0 1 0 ...
## $ result : chr "D" "D" "H" "H" ...
## $ season : chr "2006-2007" "2006-2007" "2006-2007" "2006-2007" ...
Let’s check missing value to make sure.
colSums(is.na(results))## home_team away_team home_goals away_goals result season
## 0 0 0 0 0 0
There’s no missing value, that’s good. We can continue now.
The data contains 2 type of data, num and
chr. For the analysis, we can change the data type for
“home_team”, “away_team”, “result”, and “season” columns. Why? because
there is data repeating. So we have to change the data type to
factor(category) for ease data analysis.
results$home_team <- as.factor(results$home_team)
results$away_team <- as.factor(results$away_team)
results$result <- as.factor(results$result)
results$season <- as.factor(results$season)
str(results)## 'data.frame': 4560 obs. of 6 variables:
## $ home_team : Factor w/ 39 levels "AFC Bournemouth",..: 29 2 15 24 26 28 37 7 22 12 ...
## $ away_team : Factor w/ 39 levels "AFC Bournemouth",..: 20 3 35 38 5 23 11 34 16 21 ...
## $ home_goals: num 1 1 2 2 3 3 3 2 5 3 ...
## $ away_goals: num 1 1 1 1 0 2 1 0 1 0 ...
## $ result : Factor w/ 3 levels "A","D","H": 2 2 3 3 3 3 3 3 3 3 ...
## $ season : Factor w/ 12 levels "2006-2007","2007-2008",..: 1 1 1 1 1 1 1 1 1 1 ...
Now all the data types is correct for each columns.
EDA stands for (Exploratory Data Analysis), it means we will explore the data to find something interesting.
summary(results)## home_team away_team home_goals
## Arsenal : 228 Arsenal : 228 Min. :0.000
## Chelsea : 228 Chelsea : 228 1st Qu.:1.000
## Everton : 228 Everton : 228 Median :1.000
## Liverpool : 228 Liverpool : 228 Mean :1.543
## Manchester City : 228 Manchester City : 228 3rd Qu.:2.000
## Manchester United: 228 Manchester United: 228 Max. :9.000
## (Other) :3192 (Other) :3192
## away_goals result season
## Min. :0.000 A:1288 2006-2007: 380
## 1st Qu.:0.000 D:1164 2007-2008: 380
## Median :1.000 H:2108 2008-2009: 380
## Mean :1.144 2009-2010: 380
## 3rd Qu.:2.000 2010-2011: 380
## Max. :7.000 2011-2012: 380
## (Other) :2280
📌 Short Summary :
Now let’s explore the data more, and ask some question or we can find the detail about the summary above!
top10_home <- aggregate(data =results, x = home_goals ~ home_team, FUN = sum )
top10_home_order <- top10_home[order(top10_home$home_goals, decreasing = T),][1:10,]
ggplot(data = top10_home_order, mapping = aes(x = home_goals , y = reorder(home_team, home_goals))) +
geom_col(aes(fill=home_goals), show.legend = F) +
geom_col(data = top10_home_order[2,],fill="#faeb1e") +
scale_fill_continuous(low = "#037ffc", high = "#183654") + #The color we use HEX Code of color
geom_label(aes(label=home_goals)) +
labs(
title = "Top 10 Home Team with Most Home Goal",
subtitle = "From 12 season (2006-2007 to 2017-2018)",
y = "Team",
x = "Goal"
) +
theme_minimal()📌 Insight :
top10_away <- aggregate(data =results, x = away_goals ~ away_team, FUN = sum )
top10_away_order <- top10_away[order(top10_away$away_goals, decreasing = T),][1:10,]
ggplot(data = top10_away_order, mapping = aes(x = away_goals , y = reorder(away_team, away_goals))) +
geom_col(aes(fill=away_goals), show.legend = F) +
geom_col(data = top10_away_order[2,],fill="#faeb1e") +
scale_fill_continuous(low = "#ff0000", high = "#592222") + #The color we use HEX Code of color
geom_label(aes(label=away_goals)) +
labs(
title = "Top 10 Away Team with Most Away Goal",
subtitle = "From 12 season (2006-2007 to 2017-2018)",
y = "Team",
x = "Goal"
) +
theme_minimal()📌 Insight :
Let’s explore more about Manchester United. We see the match result of Manchester United, the most win team over 12 season!
mu_h <- results[results$home_team == "Manchester United",]
mu_h$home_team <- droplevels(mu_h$home_team)
muh_result <- as.data.frame(table(mu_h$result))
mu_a <- results[results$away_team == "Manchester United",]
mu_a$away_team <- droplevels(mu_a$away_team)
mua_result <- as.data.frame(table(mu_a$result))
colnames(muh_result)[colnames(muh_result)=="Freq"] = "As Home Team"
colnames(mua_result)[colnames(mua_result)=="Freq"] = "As Away Team"
muha_join <- left_join(muh_result, mua_result)## Joining, by = "Var1"
muha_join <- pivot_longer(data = muha_join,
cols = c("As Home Team", "As Away Team"))
colnames(muha_join)[colnames(muha_join)=="Var1"] = "match_result"
muha_joinNow we visualize it!
ggplot(data = muha_join, mapping = aes(x = match_result, y = value, fill = match_result)) +
geom_col() +
geom_label(aes(label=value), fill = "white") +
facet_wrap(~ name) +
labs(
title='Manchester United Match Results',
subtitle='From 12 season (2006-2007 to 2017-2018)',
x = 'Result',
y = 'Result Count'
) +
theme_minimal()🟥IMPORTANT NOTE :🟥
📌 Insight :
What about their goal per season? Let’s see!
mu_home <- results[results$home_team == "Manchester United" & results$result== "H",]
mu_home <- aggregate(data = mu_home, x = home_goals ~ season, FUN = sum)
mu_away <- results[results$away_team == "Manchester United" & results$result== "A",]
mu_away <- aggregate(data = mu_away, x = away_goals ~ season , FUN = sum)
# There's 2 dataframes, MU for home and MU for away team. For ease to head to head each other, we try to join them using left join
join_mu <- left_join(mu_home, mu_away)## Joining, by = "season"
join_mu <- pivot_longer(data = join_mu,
cols = c("home_goals", "away_goals"))
join_muNow, we visualize it!
ggplot(data = join_mu, mapping = aes(x = season, y = value, group = name, col = name)) +
geom_line() +
geom_point() +
labs(
title='Manchester United Home and Away Goal Performance',
subtitle='From 12 season (2006-2007 to 2017-2018)',
x = 'Season',
y = 'Goal',
col = 'Result'
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 35))
📌 Insight :
As the previous post, it’s interesting that both Chelsea and Manchester United has a very slight average difference goal, only 0.58, but Chelsea has 20 more win than Manchester City. What happen? Let’s explore it!
We try to find their Goal total.
# FIND THEIR GOAL TOTAL
# CHELSEA AS HOME AND AWAY TEAM SUBSETTING
c_home <- results[results$home_team == "Chelsea" & results$result == "H",]
c_home <- aggregate(data = c_home, x = home_goals ~ season + home_team, FUN = sum)
c_away <- results[results$away_team == "Chelsea" & results$result == "A",]
c_away <- aggregate(data = c_away, x = away_goals ~ season +away_team , FUN = sum)
c_result <- left_join(c_home, c_away)## Joining, by = "season"
c_result <- pivot_longer(data = c_result,
cols = c("home_goals", "away_goals"))
# MANCHESTER CITY AS HOME AND AWAY TEAM SUBSETTING
mc_home <- results[results$home_team == "Manchester City" & results$result == "H",]
mc_home <- aggregate(data = mc_home, x = home_goals ~ season+ home_team , FUN = sum)
mc_away <- results[results$away_team == "Manchester City" & results$result == "A",]
mc_away <- aggregate(data = mc_away, x = away_goals ~ season+ away_team , FUN = sum)
mc_result <- left_join(mc_home, mc_away)## Joining, by = "season"
mc_result <- pivot_longer(data = mc_result,
cols = c("home_goals", "away_goals"))
cmc_result <- full_join(c_result,mc_result)## Joining, by = c("season", "home_team", "away_team", "name", "value")
cmc_result <- cmc_result[,-2]
colnames(cmc_result)[colnames(cmc_result)=="away_team"] = "team"
cmc_result <- spread(data = cmc_result, name, value)
cmc_result <- pivot_longer(data = cmc_result,
cols = c("away_goals", "home_goals"))
colnames(cmc_result)[colnames(cmc_result)=="name"] = "goal"
head(cmc_result)Now for visualize!
cmc_agg <- aggregate( value ~ goal + team , data = cmc_result, FUN = sum)
ggplot(data = cmc_agg, aes(x = team, y = value)) +
geom_col(aes(fill = team), position = "dodge") +
facet_wrap(~ goal) +
geom_label(aes(label=value))
📌 Insight :
chelsea <- 297 + 411
mc <- 266 + 447
chelsea## [1] 708
mc## [1] 713
As we can see that Manchester City still has more goals than Chelsea. To make sure why Chelsea has more win, let’s look for their match result.
# FIND THEIR MATCH RESULT
# CHELSEA AND MANCHESTER CITY AS HOME TEAM SUBSETTING
c_mc <- results[results$home_team == "Chelsea" | results$home_team == "Manchester City",]
c_mc$home_team <- droplevels(c_mc$home_team)
c_mc_result <- as.data.frame(table(c_mc$home_team, c_mc$result))
colnames(c_mc_result)[colnames(c_mc_result)=="Var2"] = "Result"
colnames(c_mc_result)[colnames(c_mc_result)=="Freq"] = "Home Team"
# CHELSEA AND MANCHESTER CITY AS AWAY TEAM SUBSETTING
c_mc_a <- results[results$away_team == "Chelsea" | results$away_team == "Manchester City",]
c_mc_a$away_team <- droplevels(c_mc_a$away_team)
c_mca_result <- as.data.frame(table(c_mc_a$away_team, c_mc_a$result))
colnames(c_mca_result)[colnames(c_mca_result)=="Var2"] = "Result"
colnames(c_mca_result)[colnames(c_mca_result)=="Freq"] = "Away Team"
# CHELSEA AND MANCHESTER CITY JOINING HOME AND AWAY TEAM SUBSETTING
cmc_join <- left_join(c_mc_result, c_mca_result)## Joining, by = c("Var1", "Result")
cmc_join <- pivot_longer(data = cmc_join,
cols = c("Home Team", "Away Team"))
cmc_agg <- aggregate(cmc_join, value ~ Var1 + Result + name , FUN = sum)
cmc_aggcmc_agg <- cmc_agg %>%
mutate(
label = glue(
"Result: {Result}
Match Result Count: {value}"
)
)
cmc_aggWe have result both of these two team. Now we visualize it.
cmc_viz <- ggplot(data = cmc_agg, mapping = aes(x = Var1, y = value, fill = Result, text = label))+
geom_col(aes(fill = Result), position = "dodge", ) +
facet_wrap(~name) +
labs(
title='Chelsea vs Manchester City Match Results ',
subtitle='From 12 season (2006-2007 to 2017-2018)',
x = 'Team',
y = 'Total'
) +
theme_minimal()
ggplotly(cmc_viz, tooltip = "text")🟥IMPORTANT NOTE :🟥
📌 Insight :
For another exploration, as the previous post, we say that each team have to be the top 4 to qualified to Champions League. Let’s see who will be get in season 2017-2018!
Before that, we have to know Premier League Standings point rules
:
- Three points are awarded for a win
- One point for a draw and
- None for a defeat
- The team with the most points at the end of the season winning the
Premier League title
So, we will make a simple standings based on our data, our data contains result match over 12 season. Let’s just take 2017-2018 for our analysis.
# HOME
clas_h <- results[results$season =="2017-2018",]
clas_h <- clas_h[,c("home_team","home_goals","result","season")]
clas_h <- as.data.frame.matrix(table(clas_h$home_team,clas_h$result))
clas_h$point <- (clas_h$D * 1) + (clas_h$H * 3)
clas_h <- clas_h[order(clas_h$point, decreasing = TRUE),]
clas_h <- cbind(team = rownames(clas_h), clas_h)
rownames(clas_h) <- NULL
# AWAY
clas_a <- results[results$season =="2017-2018",]
clas_a <- clas_a[,c("away_team","away_goals","result","season")]
clas_a <- as.data.frame.matrix(table(clas_a$away_team,clas_a$result))
clas_a$point <- (clas_a$D * 1) + (clas_a$A * 3)
clas_a <- clas_a[order(clas_a$point, decreasing = TRUE),]
clas_a <- cbind(team = rownames(clas_a), clas_a)
rownames(clas_a) <- NULL
clas_hclas_aWe have standings, but it’s not the final. It’s just standings for Home and Away team. We have 2 dataframes now, then we have to join them!
stands <- full_join(clas_h,clas_a, "team")
stands$total_points <- stands$point.x + stands$point.y
stands <- stands[order(stands$total_points, decreasing = T),]
row.names(stands) <- NULL
stands <- stands[,-c(2:9)]
stands <- stands[apply(stands!=0, 1, all),]
standsThis is the final Standings now. Let’s Visualize it.
ggplot(data = stands, aes(x = total_points, y = reorder(team , total_points))) +
geom_col( position = "dodge") +
geom_col(aes(fill = total_points), show.legend = FALSE) +
geom_col(data = stands[1:4,], fill = "#f18c8e") +
scale_fill_continuous(low = "#568ea6", high = "#f0b7a4") +
geom_label(aes(label=total_points)) +
labs(title = "Premier League Standings",
subtitle = "Season 2017-2018",
caption = "Top 4 team eligible for Champions League",
x = "Total Points Standings",
y = NULL) +
theme(legend.position = "top", # untuk mengubah posisi legend
plot.title.position = "plot") + # untuk mengubah posisi judul plot
theme_minimal()
📌 Insight :
Congrats!
Based on analysis above, we find that :