Introduction

The Premier League is the organising body of the Premier League with responsibility for the competition, its Rule Book and the centralised broadcast and other commercial rights.

Each individual club is independent, working within the rules of football, as defined by the Premier League, The FA, UEFA and FIFA, as well as being subject to English and European law.

Each of the 20 clubs are a Shareholder in the Premier League. Consultation is at the heart of the Premier League and Shareholder meetings are the ultimate decision-making forum for Premier League policy and are held at regular intervals during the course of the season.

The Premier League AGM takes place at the close of each season, at which time the relegated clubs transfer their shares to the clubs promoted into the Premier League from the Football League Championship.

Note : Premier League is the Main League/First Division of England Football League

There are 2 dataset from the source, stats.csv and results.csv. What we have here is stats.csv, and we will try to do analysis with this data.

The main goal we have here is, how to get the club to have a lot of chances to win and keep playing in the Premier League.

Data Preprocessing

This is the step we prepare the data before analysis.

Import Data

The first thing we should do is import the data to our notebook.

df <- read.csv("stats.csv")
df

As we can see, there are 42 columns, with 240 rows of data we have. This data will we use to analyze and explore the data.

To get our goal, we don’t have to use all the columns. We just use the columns we need. Let’s take some of them :

  1. team = name of team
  2. wins = team win
  3. losses = team loss
  4. goals = team goals
  5. total_yel_card = total number of yellow cards given to the team
  6. total_red_card = total number of red cards given to the team
  7. clean_sheet = goalkeeper prevent their opponents scoring any goals during an entire match
  8. total_pass = total pass made by players
  9. interception = stealing the ball from opposition
  10. touches = a specific style of play where the players move the ball around quickly with a simple tap or hit of the ball
  11. penalty_save = goalkeeper saves penalty
  12. season = part of the year during which football matches are held

Get the columns we need

stats <- df[,c("team","wins","losses","goals","total_yel_card","total_red_card","clean_sheet","total_pass","interception","touches","penalty_save","season")]
stats

Data Cleansing

str(stats)
## 'data.frame':    240 obs. of  12 variables:
##  $ team          : chr  "Manchester United" "Chelsea" "Liverpool" "Arsenal" ...
##  $ wins          : num  28 24 20 19 17 16 16 15 15 14 ...
##  $ losses        : num  5 3 10 8 12 14 15 16 10 12 ...
##  $ goals         : num  83 64 57 63 57 47 52 52 52 45 ...
##  $ total_yel_card: num  60 62 44 59 48 84 38 77 65 48 ...
##  $ total_red_card: num  1 4 0 3 3 4 3 6 2 1 ...
##  $ clean_sheet   : num  16 22 20 12 6 12 13 8 14 12 ...
##  $ total_pass    : num  18723 16759 17154 18458 14914 ...
##  $ interception  : num  254 292 246 214 276 235 277 282 303 220 ...
##  $ touches       : num  25686 24010 24150 25592 22200 ...
##  $ penalty_save  : num  2 1 0 0 0 2 0 5 1 2 ...
##  $ season        : chr  "2006-2007" "2006-2007" "2006-2007" "2006-2007" ...

Let’s check missing value to make sure.

Check Missing Value

colSums(is.na(stats))
##           team           wins         losses          goals total_yel_card 
##              0              0              0              0              0 
## total_red_card    clean_sheet     total_pass   interception        touches 
##              0              0              0              0              0 
##   penalty_save         season 
##              0              0

There’s no missing value, that’s good. We can continue now.

Check Data Type

str(stats)
## 'data.frame':    240 obs. of  12 variables:
##  $ team          : chr  "Manchester United" "Chelsea" "Liverpool" "Arsenal" ...
##  $ wins          : num  28 24 20 19 17 16 16 15 15 14 ...
##  $ losses        : num  5 3 10 8 12 14 15 16 10 12 ...
##  $ goals         : num  83 64 57 63 57 47 52 52 52 45 ...
##  $ total_yel_card: num  60 62 44 59 48 84 38 77 65 48 ...
##  $ total_red_card: num  1 4 0 3 3 4 3 6 2 1 ...
##  $ clean_sheet   : num  16 22 20 12 6 12 13 8 14 12 ...
##  $ total_pass    : num  18723 16759 17154 18458 14914 ...
##  $ interception  : num  254 292 246 214 276 235 277 282 303 220 ...
##  $ touches       : num  25686 24010 24150 25592 22200 ...
##  $ penalty_save  : num  2 1 0 0 0 2 0 5 1 2 ...
##  $ season        : chr  "2006-2007" "2006-2007" "2006-2007" "2006-2007" ...

The data contains 2 type of data, num and chr. For the analysis, we can change the data type for “team” and “season” columns. Why? because there is data repeating. So we have to change the data type to factor(category) for ease data analysis.

stats$team <- as.factor(stats$team)
stats$season <- as.factor(stats$season)
str(stats)
## 'data.frame':    240 obs. of  12 variables:
##  $ team          : Factor w/ 39 levels "AFC Bournemouth",..: 22 12 20 2 34 7 28 5 15 26 ...
##  $ wins          : num  28 24 20 19 17 16 16 15 15 14 ...
##  $ losses        : num  5 3 10 8 12 14 15 16 10 12 ...
##  $ goals         : num  83 64 57 63 57 47 52 52 52 45 ...
##  $ total_yel_card: num  60 62 44 59 48 84 38 77 65 48 ...
##  $ total_red_card: num  1 4 0 3 3 4 3 6 2 1 ...
##  $ clean_sheet   : num  16 22 20 12 6 12 13 8 14 12 ...
##  $ total_pass    : num  18723 16759 17154 18458 14914 ...
##  $ interception  : num  254 292 246 214 276 235 277 282 303 220 ...
##  $ touches       : num  25686 24010 24150 25592 22200 ...
##  $ penalty_save  : num  2 1 0 0 0 2 0 5 1 2 ...
##  $ season        : Factor w/ 12 levels "2006-2007","2007-2008",..: 1 1 1 1 1 1 1 1 1 1 ...

Now all the data types is correct for each columns.

EDA (Exploratory Data Analysis)

It’s time to explore the data!

summary(stats)
##                 team          wins           losses          goals       
##  Arsenal          : 12   Min.   : 1.00   Min.   : 2.00   Min.   : 20.00  
##  Chelsea          : 12   1st Qu.:10.00   1st Qu.:10.00   1st Qu.: 40.00  
##  Everton          : 12   Median :12.00   Median :15.00   Median : 47.00  
##  Liverpool        : 12   Mean   :14.15   Mean   :14.15   Mean   : 51.06  
##  Manchester City  : 12   3rd Qu.:18.00   3rd Qu.:19.00   3rd Qu.: 61.00  
##  Manchester United: 12   Max.   :32.00   Max.   :29.00   Max.   :106.00  
##  (Other)          :168                                                   
##  total_yel_card  total_red_card   clean_sheet      total_pass   
##  Min.   :38.00   Min.   :0.000   Min.   : 2.00   Min.   : 9478  
##  1st Qu.:54.00   1st Qu.:1.000   1st Qu.: 8.00   1st Qu.:13380  
##  Median :60.50   Median :3.000   Median :10.00   Median :14937  
##  Mean   :61.08   Mean   :2.862   Mean   :10.95   Mean   :15692  
##  3rd Qu.:67.00   3rd Qu.:4.000   3rd Qu.:14.00   3rd Qu.:18250  
##  Max.   :94.00   Max.   :9.000   Max.   :24.00   Max.   :28241  
##                                                                 
##   interception      touches       penalty_save          season   
##  Min.   :198.0   Min.   :16772   Min.   :0.0000   2006-2007: 20  
##  1st Qu.:472.2   1st Qu.:21577   1st Qu.:0.0000   2007-2008: 20  
##  Median :558.5   Median :23169   Median :1.0000   2008-2009: 20  
##  Mean   :555.2   Mean   :23909   Mean   :0.8375   2009-2010: 20  
##  3rd Qu.:654.0   3rd Qu.:26294   3rd Qu.:1.0000   2010-2011: 20  
##  Max.   :872.0   Max.   :35130   Max.   :5.0000   2011-2012: 20  
##                                                   (Other)  :120
length(levels(stats$season)) #Find unique value for season columns 
## [1] 12
length(levels(stats$team)) #Find unique value for team columns
## [1] 39

📌 Short Summary :

  • 2006-2007 to 2017-2018 is 12 season
  • There are 39 teams playing for 12 season!
  • There is a team that got 32 wins, which is the highest for 12 seasons!
  • There is a team that got 29 losses, which is the highest for 12 seasons!
  • There is a team that got 94 yellow cards, which is the highest for 12 seasons!

Case Questions

Now let’s explore the data more, and ask some question or we can find the detail about the summary above!

1. Of the existing teams, how many times have they played in the Premier League over 12 seasons?

team_table <- as.data.frame(table(stats$team))
team_table[team_table$Freq== 12,]
team_table[team_table$Freq< 12,]

📌 Insight :

  • Team with Freq = 12 means that they always plays in Premier League, there are 7 teams.
  • Teams with Freq < 12 mean that they ever get degradation (drop to 2nd Legue) or Promoted to Premier League, there are 32 teams.
    • For Example (Promoted) : Blackpool team have ever play in Premier League once (1 times), it means that Blackpool ever get Promoted from 2nd England Football League to Premier League.
    • For Example (Degradation) : West Ham United team just play 11 times in Premier League, it means that West Ham United ever get degradation from Premier League to 2nd England Football League.
  • Total there are 39 teams ever played in Premier League over 12 season.

2. What team has the most wins over 12 seasons?

win_agg <- aggregate(data = stats, x = wins ~ team , FUN = sum)
win_agg[order(win_agg$wins, decreasing =T),]

📌 Insight :

  • Looks like Manchester United is the strong Club maybe? They have scored 290 goals over 12 season!
  • Derby County is the team with the fewest goals

3. What is the average goal for Manchester United over 12 season?

What makes Manchester United got so many wins? Let’s see its average goals!

mean_goal <- aggregate(data = stats, x = goals ~  team , FUN = mean)
mean_goal[order(mean_goal$goals, decreasing =T),]

📌 Insight :

  • Manchester United has the most average goals per season, it is 72.25000.
  • Something interest here :
    • Manchester City has more average goals than Chelsea, only 0.58334 differ.
    • But Chelsea has 20 more win than Manchester City.
  • Derby County has the fewest average goal per season, it is 20.00000. Which is 52.25 differ from Manchester United.

4. What team has the most losses over 12 seasons?

loss_agg <- aggregate(data = stats, x = losses ~ team , FUN = sum)
loss_agg[order(loss_agg$losses, decreasing =T),]

📌 Insight :

  • Sunderland and West Ham United are the teams that has lost the most..

5. What team has the most foul over 12 seasons?

Let’s see what team with the most foul, we can see it from total_yel_card and total_red_card.

foul_agg <- aggregate(data = stats, x = total_yel_card ~ team + total_red_card, FUN = sum)
foul_agg[order(foul_agg$total_yel_card, decreasing =T),]

📌 Insight :

  • Everton is the team with the most foul over 12 season
  • We can see interesting thing there! Manchester United, which is the team with the most goals, but also be the second team that the most foul over 12 season!
foul_agg[order(foul_agg$total_red_card, decreasing =T),]

📌 Insight :

  • Queens Park Rangers and Sunderland are the team has the most red card.

Descriptive Analysis (Correlation)

Note :

In Correlation, there are several measurements of the strength of the relationship between the data (variable), which is -1 (Negative Correlation), 0 (No Correlation), and 1 (Positive Correlation).

As we can see that Manchester United has most win and goal average. Does that mean that if we do goal more, we got more wins? Let’s see, we can call it Correlation.

cor(stats$wins, stats$goals)
## [1] 0.8992363
plot(stats$goals,stats$wins)
abline(lm(stats$wins ~ stats$goals), col="red")

We can see that the correlation of goals and wins is 0.8992363, which is close to 1. That is, goals and wins have a Positive Correlation. So if we score more goals, it can result in more wins!

If we want to win more, is it just by scoring goals? It’s part of the front line. We can also see from the side of the midfield and the back, which could also possibly be the reason for the victory. Let’s examine another column of data.

Columns that related to the midfield and the back are, total_pass,clean_sheet,interception,touches, and penalty_save.

cor(stats$wins, stats$total_pass)
## [1] 0.7017069
cor(stats$wins, stats$clean_sheet)
## [1] 0.7680525
cor(stats$wins, stats$interception)
## [1] -0.04731535
cor(stats$wins, stats$touches)
## [1] 0.7046452
cor(stats$wins, stats$penalty_save)
## [1] -0.03236519

📌 Insight :

  • total_pass,clean_sheet, and touches has postive correlation, very close to 1. We can say that these three has contribution to win the matches, because it’s has correlation.
  • interception and penalty_save has negative correlation, far from -1, but close to 0. We can say that these two has no contribution to win the matches, because it has no correlation.

But how about the foul with losses? How the correlation?

cor(stats$wins, stats$total_yel_card)
## [1] -0.2112763
  • It’s have negative correlation.

Conclusion

As in our main goal is how to get the club to have a lot of chances to win and keep playing in the Premier League. There are a lot of teams that maintain their position to Premier League. They compete each other to stay in the Premier League, and the bottom standings will be degraded to 2nd England Football League. They have to stay at least top 5 of Premier League to be promoted to the most prestigious event of the Europe Football, it is The Champions League. So to get our goal, based on analysis above, we can conclude that :

If the team wants to make more wins, they have to :

  • Front Line
    • Score a goal, this is the most contribution to get a team win the match.
  • Midfield and Back Line
    • Do more ball passes.
    • Make sure the goalkeeper blocks all the balls that are heading towards the goal (gawang).
    • Make sure all the players always toches the ball and make a pass to other players.
  • Penalty saves and Interceptions don’t guarantee a win.
  • The number of Foul does not make the team lose.