The Premier League is the organising body of the Premier League with responsibility for the competition, its Rule Book and the centralised broadcast and other commercial rights.
Each individual club is independent, working within the rules of football, as defined by the Premier League, The FA, UEFA and FIFA, as well as being subject to English and European law.
Each of the 20 clubs are a Shareholder in the Premier League. Consultation is at the heart of the Premier League and Shareholder meetings are the ultimate decision-making forum for Premier League policy and are held at regular intervals during the course of the season.
The Premier League AGM takes place at the close of each season, at which time the relegated clubs transfer their shares to the clubs promoted into the Premier League from the Football League Championship.
Note : Premier League is the Main League/First Division of England Football League
There are 2 dataset from the source, stats.csv and results.csv. What
we have here is stats.csv
, and we will try to do analysis
with this data.
The main goal we have here is, how to get the club to have a lot of chances to win and keep playing in the Premier League.
This is the step we prepare the data before analysis.
The first thing we should do is import the data to our notebook.
<- read.csv("stats.csv")
df df
As we can see, there are 42 columns, with 240 rows of data we have. This data will we use to analyze and explore the data.
To get our goal, we don’t have to use all the columns. We just use the columns we need. Let’s take some of them :
<- df[,c("team","wins","losses","goals","total_yel_card","total_red_card","clean_sheet","total_pass","interception","touches","penalty_save","season")]
stats stats
str(stats)
## 'data.frame': 240 obs. of 12 variables:
## $ team : chr "Manchester United" "Chelsea" "Liverpool" "Arsenal" ...
## $ wins : num 28 24 20 19 17 16 16 15 15 14 ...
## $ losses : num 5 3 10 8 12 14 15 16 10 12 ...
## $ goals : num 83 64 57 63 57 47 52 52 52 45 ...
## $ total_yel_card: num 60 62 44 59 48 84 38 77 65 48 ...
## $ total_red_card: num 1 4 0 3 3 4 3 6 2 1 ...
## $ clean_sheet : num 16 22 20 12 6 12 13 8 14 12 ...
## $ total_pass : num 18723 16759 17154 18458 14914 ...
## $ interception : num 254 292 246 214 276 235 277 282 303 220 ...
## $ touches : num 25686 24010 24150 25592 22200 ...
## $ penalty_save : num 2 1 0 0 0 2 0 5 1 2 ...
## $ season : chr "2006-2007" "2006-2007" "2006-2007" "2006-2007" ...
Let’s check missing value to make sure.
colSums(is.na(stats))
## team wins losses goals total_yel_card
## 0 0 0 0 0
## total_red_card clean_sheet total_pass interception touches
## 0 0 0 0 0
## penalty_save season
## 0 0
There’s no missing value, that’s good. We can continue now.
str(stats)
## 'data.frame': 240 obs. of 12 variables:
## $ team : chr "Manchester United" "Chelsea" "Liverpool" "Arsenal" ...
## $ wins : num 28 24 20 19 17 16 16 15 15 14 ...
## $ losses : num 5 3 10 8 12 14 15 16 10 12 ...
## $ goals : num 83 64 57 63 57 47 52 52 52 45 ...
## $ total_yel_card: num 60 62 44 59 48 84 38 77 65 48 ...
## $ total_red_card: num 1 4 0 3 3 4 3 6 2 1 ...
## $ clean_sheet : num 16 22 20 12 6 12 13 8 14 12 ...
## $ total_pass : num 18723 16759 17154 18458 14914 ...
## $ interception : num 254 292 246 214 276 235 277 282 303 220 ...
## $ touches : num 25686 24010 24150 25592 22200 ...
## $ penalty_save : num 2 1 0 0 0 2 0 5 1 2 ...
## $ season : chr "2006-2007" "2006-2007" "2006-2007" "2006-2007" ...
The data contains 2 type of data, num
and
chr
. For the analysis, we can change the data type for
“team” and “season” columns. Why? because there is data
repeating. So we have to change the data type to
factor(category) for ease data analysis.
$team <- as.factor(stats$team)
stats$season <- as.factor(stats$season)
statsstr(stats)
## 'data.frame': 240 obs. of 12 variables:
## $ team : Factor w/ 39 levels "AFC Bournemouth",..: 22 12 20 2 34 7 28 5 15 26 ...
## $ wins : num 28 24 20 19 17 16 16 15 15 14 ...
## $ losses : num 5 3 10 8 12 14 15 16 10 12 ...
## $ goals : num 83 64 57 63 57 47 52 52 52 45 ...
## $ total_yel_card: num 60 62 44 59 48 84 38 77 65 48 ...
## $ total_red_card: num 1 4 0 3 3 4 3 6 2 1 ...
## $ clean_sheet : num 16 22 20 12 6 12 13 8 14 12 ...
## $ total_pass : num 18723 16759 17154 18458 14914 ...
## $ interception : num 254 292 246 214 276 235 277 282 303 220 ...
## $ touches : num 25686 24010 24150 25592 22200 ...
## $ penalty_save : num 2 1 0 0 0 2 0 5 1 2 ...
## $ season : Factor w/ 12 levels "2006-2007","2007-2008",..: 1 1 1 1 1 1 1 1 1 1 ...
Now all the data types is correct for each columns.
It’s time to explore the data!
summary(stats)
## team wins losses goals
## Arsenal : 12 Min. : 1.00 Min. : 2.00 Min. : 20.00
## Chelsea : 12 1st Qu.:10.00 1st Qu.:10.00 1st Qu.: 40.00
## Everton : 12 Median :12.00 Median :15.00 Median : 47.00
## Liverpool : 12 Mean :14.15 Mean :14.15 Mean : 51.06
## Manchester City : 12 3rd Qu.:18.00 3rd Qu.:19.00 3rd Qu.: 61.00
## Manchester United: 12 Max. :32.00 Max. :29.00 Max. :106.00
## (Other) :168
## total_yel_card total_red_card clean_sheet total_pass
## Min. :38.00 Min. :0.000 Min. : 2.00 Min. : 9478
## 1st Qu.:54.00 1st Qu.:1.000 1st Qu.: 8.00 1st Qu.:13380
## Median :60.50 Median :3.000 Median :10.00 Median :14937
## Mean :61.08 Mean :2.862 Mean :10.95 Mean :15692
## 3rd Qu.:67.00 3rd Qu.:4.000 3rd Qu.:14.00 3rd Qu.:18250
## Max. :94.00 Max. :9.000 Max. :24.00 Max. :28241
##
## interception touches penalty_save season
## Min. :198.0 Min. :16772 Min. :0.0000 2006-2007: 20
## 1st Qu.:472.2 1st Qu.:21577 1st Qu.:0.0000 2007-2008: 20
## Median :558.5 Median :23169 Median :1.0000 2008-2009: 20
## Mean :555.2 Mean :23909 Mean :0.8375 2009-2010: 20
## 3rd Qu.:654.0 3rd Qu.:26294 3rd Qu.:1.0000 2010-2011: 20
## Max. :872.0 Max. :35130 Max. :5.0000 2011-2012: 20
## (Other) :120
length(levels(stats$season)) #Find unique value for season columns
## [1] 12
length(levels(stats$team)) #Find unique value for team columns
## [1] 39
📌 Short Summary :
Now let’s explore the data more, and ask some question or we can find the detail about the summary above!
<- as.data.frame(table(stats$team))
team_table $Freq== 12,] team_table[team_table
$Freq< 12,] team_table[team_table
📌 Insight :
<- aggregate(data = stats, x = wins ~ team , FUN = sum)
win_agg order(win_agg$wins, decreasing =T),] win_agg[
📌 Insight :
What makes Manchester United got so many wins? Let’s see its average goals!
<- aggregate(data = stats, x = goals ~ team , FUN = mean)
mean_goal order(mean_goal$goals, decreasing =T),] mean_goal[
📌 Insight :
<- aggregate(data = stats, x = losses ~ team , FUN = sum)
loss_agg order(loss_agg$losses, decreasing =T),] loss_agg[
📌 Insight :
Let’s see what team with the most foul, we can see it from
total_yel_card
and total_red_card
.
<- aggregate(data = stats, x = total_yel_card ~ team + total_red_card, FUN = sum)
foul_agg order(foul_agg$total_yel_card, decreasing =T),] foul_agg[
📌 Insight :
order(foul_agg$total_red_card, decreasing =T),] foul_agg[
📌 Insight :
Note :
In Correlation, there are several measurements of the strength of the relationship between the data (variable), which is -1 (Negative Correlation), 0 (No Correlation), and 1 (Positive Correlation).
As we can see that Manchester United has most win and goal average. Does that mean that if we do goal more, we got more wins? Let’s see, we can call it Correlation.
cor(stats$wins, stats$goals)
## [1] 0.8992363
plot(stats$goals,stats$wins)
abline(lm(stats$wins ~ stats$goals), col="red")
We can see that the correlation of goals and wins is 0.8992363, which is close to 1. That is, goals and wins have a Positive Correlation. So if we score more goals, it can result in more wins!
If we want to win more, is it just by scoring goals? It’s part of the front line. We can also see from the side of the midfield and the back, which could also possibly be the reason for the victory. Let’s examine another column of data.
Columns that related to the midfield and the back are,
total_pass
,clean_sheet
,interception
,touches
,
and penalty_save
.
cor(stats$wins, stats$total_pass)
## [1] 0.7017069
cor(stats$wins, stats$clean_sheet)
## [1] 0.7680525
cor(stats$wins, stats$interception)
## [1] -0.04731535
cor(stats$wins, stats$touches)
## [1] 0.7046452
cor(stats$wins, stats$penalty_save)
## [1] -0.03236519
📌 Insight :
total_pass
,clean_sheet
, and
touches
has postive correlation, very
close to 1. We can say that these three has contribution to win
the matches, because it’s has correlation.interception
and penalty_save
has
negative correlation, far from -1, but close to 0. We
can say that these two has no contribution to win the
matches, because it has no correlation.But how about the foul with losses? How the correlation?
cor(stats$wins, stats$total_yel_card)
## [1] -0.2112763
As in our main goal is how to get the club to have a lot of chances to win and keep playing in the Premier League. There are a lot of teams that maintain their position to Premier League. They compete each other to stay in the Premier League, and the bottom standings will be degraded to 2nd England Football League. They have to stay at least top 5 of Premier League to be promoted to the most prestigious event of the Europe Football, it is The Champions League. So to get our goal, based on analysis above, we can conclude that :
If the team wants to make more wins, they have to :