1 INTERNATIONAL FOOTBALL RESULTS FROM 1872 TO 2019

Hello, back again with me Imad. Now, I want to talk about result of football matches for national teams around the world (Disclaimer : this is official mathces supervised by FIFA and this is only male football national team).

In this project i want to know who is the best national footbal team based on their result from 1872 to 2019, so i will do data wrangling and analyzing wining ratio, losing ratio, home wining and lose, and away wining and lose, etc.

Enjoy your coffee…

First of all we look at our working directory :

getwd()
## [1] "D:/IMAD/Data Science/01 Algoritma/Day 9/LBB by Imaduddin 2/lbb2_by_imaduddin/lbb2_by_imaduddin"

from the above we see that …/lbb2_by_imaduddin and it is same with

D:\IMAD\Data Science\01 Algoritma\Day 9\LBB by Imaduddin 2\lbb2_by_imaduddin\lbb2_by_imaduddin

so we can read the data and take a look

1.1 A glimpse of data

results <- read.csv("results.csv")

str(results)
## 'data.frame':    40773 obs. of  9 variables:
##  $ date      : Factor w/ 15106 levels "1872-11-30","1873-03-08",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ home_team : Factor w/ 309 levels "Ã…land Islands",..: 242 87 242 87 242 242 87 297 242 242 ...
##  $ away_team : Factor w/ 306 levels "Ã…land Islands",..: 87 237 87 237 87 293 237 237 87 293 ...
##  $ home_score: int  0 4 2 2 3 4 1 0 7 9 ...
##  $ away_score: int  0 2 1 2 0 0 3 2 2 0 ...
##  $ tournament: Factor w/ 107 levels "ABCS Tournament",..: 55 55 55 55 55 55 55 55 55 55 ...
##  $ city      : Factor w/ 1959 levels "6th of October City",..: 663 1011 663 1011 663 663 1011 1914 663 663 ...
##  $ country   : Factor w/ 265 levels "Éire","Afghanistan",..: 206 71 206 71 206 206 71 256 206 206 ...
##  $ neutral   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

The data contains 40773 observation and 9 column, which is : 1. date = Date the match took place on 2. home_team = Name of the home team. 3. away_team = Name of the away team. 4. home_score = Home team full time score. 5. away_score = Away team full time score. 6. tournament = Name of the tournament. 7. city = City where the match took place. 8. country = Country where the match took place. 9. neutral = TRUE if the match took place outside the home team country, FALSE otherwise.

1.2 Number of Football Matches

1.2.1 Total Football Matches

Firstly, we want to know how many football matches every year (1872 - 2018), so we need to change the column date into date class

check the date

head(results$date)
## [1] 1872-11-30 1873-03-08 1874-03-07 1875-03-06 1876-03-04 1876-03-25
## 15106 Levels: 1872-11-30 1873-03-08 1874-03-07 1875-03-06 ... 2019-07-19

The structure date is %Y-%m-%d, so we use lubridate to change the date class

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
results$date <- ymd(results$date)

# check the date structure and class

head(results$date)
## [1] "1872-11-30" "1873-03-08" "1874-03-07" "1875-03-06" "1876-03-04"
## [6] "1876-03-25"
class(results$date)
## [1] "Date"

after that we add one column for year that match played -> results$year

results$year <- year(results$date)

we will create new data frame to look matches year by year -> yearmatches

yearmatches <- data.frame(table("year" = results$year))

colnames(yearmatches) <- c("year","match")

we just need to see total match until 2018, so we delete total match in 2019

yearmatches <- yearmatches[-c(148),]

tail(yearmatches)
##     year match
## 142 2013   966
## 143 2014   859
## 144 2015  1003
## 145 2016   926
## 146 2017   958
## 147 2018   854
yearmatches[147,2]/yearmatches[127,2]
## [1] 1.223496
yearmatches$year <- as.numeric(as.character(yearmatches$year))
plot(yearmatches[yearmatches$year >= 1918 & yearmatches$year <= 2018, ] , yearmatches$match, main = "Total Footbal Matches in The Past Century", ylab = "number of match", type = "h")
abline(lm(yearmatches$match~yearmatches$year), lty = 5, lwd = 3, col = "red")

result : if we see in the past 20 years the number of football matchs increase to 122% and also if we shows the graphic number of footbal matches in the past century, it shows positive trend. The result is number of football match around the world will increase steadily.

1.2.2 Number of matches each team

Knowing total number of football matches isn’t enough, because if we want to know the top performer team and the weakest one we need to know the number of football match each national team, so i will write some code to know that

results[,c("home_team","away_team")] <- lapply(results[,c("home_team","away_team")], as.character)

change class of home_team and away_team into character using lappy() then create new data frame -> nation.match

nation.match <- cbind.data.frame("team" = c(results$home_team, results$away_team))

nation.match <- data.frame(table("team" = nation.match$team))

nation.match <- nation.match[order(nation.match$Freq, decreasing = T), ]

nation.match$team <- as.character(nation.match$team)

colnames(nation.match) <- c("team","number.match")

str(nation.match)
## 'data.frame':    313 obs. of  2 variables:
##  $ team        : chr  "Sweden" "England" "Brazil" "Argentina" ...
##  $ number.match: int  1010 994 977 975 942 922 878 862 853 834 ...
theme_algoritma <- readRDS("theme_algoritma.rds")
library(ggplot2)
ggplot(nation.match[nation.match$number.match > 810, ], aes(reorder(team, number.match), number.match))+
  geom_col(color = "blue")+
  labs(y = "Number of match", x = "National Team", title = "Top 10 of National Football Team Matches from 1872 to 2019", caption = "Source : International Football Results From 1872 To 2019" )+
  coord_flip()+
  theme_algoritma

Take a look at chart above, Sweden Team has the most number of football matches in the world. So Sweden Team is the strongest team in the world? we don’t know the answer yet, until we find which team has the most winning all this time

1.3 The Winer, The loser, and even

results is also giving us information all match scores, so we need to know which team that win or lose.

we will add 2 new columns in data frame results -> winer and loser.

To know which team that win or lose, we have to giving terms, which is :

for column winer

  1. home_score - away_score > 0 the winner is home_team
  2. home_score - away_score < 0 the winner is away_team
  3. home_score - away_score = 0 its draw

and

for column loser

  1. home_score - away_score > 0 the loser is away_team
  2. home_score - away_score < 0 the loser is home_team
  3. home_score - away_score = 0 its draw

so take a look at this, then

results$winer <- ifelse(results$home_score - results$away_score > 0, results$home_team,ifelse(results$home_score - results$away_score < 0, results$away_team, "draw"))

results$loser <- ifelse(results$home_score - results$away_score < 0, results$home_team,ifelse(results$home_score - results$away_score > 0, results$away_team, "draw"))

head(results,3)
##         date home_team away_team home_score away_score tournament    city
## 1 1872-11-30  Scotland   England          0          0   Friendly Glasgow
## 2 1873-03-08   England  Scotland          4          2   Friendly  London
## 3 1874-03-07  Scotland   England          2          1   Friendly Glasgow
##    country neutral year    winer    loser
## 1 Scotland   FALSE 1872     draw     draw
## 2  England   FALSE 1873  England Scotland
## 3 Scotland   FALSE 1874 Scotland  England

1.3.1 Winer Team

Now we add new data frame for wining team -> winer.team

And order data frame from the highest to the lowest

winer.team <- data.frame(table("team" = results$winer))

winer.team$team <- as.character(winer.team$team)

winer.team <-winer.team[winer.team$team != "draw",]

winer.team <- winer.team[order(winer.team$Freq, decreasing = T),]

colnames(winer.team) <- c("team","match.win")

head(winer.team)
##            team match.win
## 38       Brazil       625
## 88      England       563
## 107     Germany       551
## 14    Argentina       524
## 258      Sweden       494
## 250 South Korea       449

1.3.2 Loser Team

Also add new data frame for loser team -> loser.team

loser.team <- data.frame(table("team" = results$loser))

loser.team$team <- as.character(loser.team$team)

loser.team <-loser.team[loser.team$team != "draw",]

loser.team <- loser.team[order(loser.team$Freq, decreasing = T),]

colnames(loser.team) <- c("team","match.lose")

head(loser.team)
##                 team match.lose
## 98           Finland        397
## 267      Switzerland        343
## 201           Norway        328
## 199 Northern Ireland        327
## 160       Luxembourg        311
## 58             Chile        309

1.3.3 Draw

Not only the winer and the loser but also to give us all information from the data, we will find the number of draw match of each team, we create new data frame -> draw.results

draw.results <- results[results$winer == "draw",]

str(draw.results)
## 'data.frame':    9423 obs. of  12 variables:
##  $ date      : Date, format: "1872-11-30" "1875-03-06" ...
##  $ home_team : chr  "Scotland" "England" "Northern Ireland" "England" ...
##  $ away_team : chr  "England" "Scotland" "Wales" "Wales" ...
##  $ home_score: int  0 2 1 1 1 1 0 1 1 2 ...
##  $ away_score: int  0 2 1 1 1 1 0 1 1 2 ...
##  $ tournament: Factor w/ 107 levels "ABCS Tournament",..: 55 55 55 17 17 17 17 17 17 17 ...
##  $ city      : Factor w/ 1959 levels "6th of October City",..: 663 1011 228 263 1011 663 1914 663 189 228 ...
##  $ country   : Factor w/ 265 levels "Éire","Afghanistan",..: 206 71 192 71 71 206 256 206 256 192 ...
##  $ neutral   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ year      : num  1872 1875 1883 1885 1885 ...
##  $ winer     : chr  "draw" "draw" "draw" "draw" ...
##  $ loser     : chr  "draw" "draw" "draw" "draw" ...
draw.team <- cbind.data.frame("team" = c(draw.results$home_team, draw.results$away_team))

draw.team <- data.frame(table("team" = draw.team$team))

draw.team <- draw.team[order(draw.team$Freq, decreasing = T), ]

draw.team$team <- as.character(draw.team$team)

colnames(draw.team) <- c("team","match.draw")

head(draw.team)
##            team match.draw
## 81      England        241
## 13    Argentina        240
## 128       Italy        224
## 245      Sweden        223
## 238 South Korea        219
## 269     Uruguay        213

1.3.4 Data Merging

Now we have 4 data :

  1. winer.team
  2. loser.team
  3. draw.team
  4. nation.match

we will merge all the data

combine.data <- merge(loser.team[, c("team", "match.lose")], 
                  winer.team[, c("team", "match.win")])
combine.data <- merge(combine.data[,c("team","match.lose","match.win")],
                  draw.team[,c("team","match.draw")])
combine.data <- merge(combine.data[,c("team","match.lose","match.win", "match.draw")],
                      nation.match[, c("team","number.match")])

tail(combine.data)
##           team match.lose match.win match.draw number.match
## 278  Ynys Môn         18        32         11           61
## 279  Yorkshire          1         5          1            7
## 280 Yugoslavia        159       223         98          480
## 281     Zambia        197       324        186          707
## 282   Zanzibar        123        41         41          205
## 283   Zimbabwe        141       184        122          447

1.4 Analyzing Data 1

To analyze data we put some data like winning ratio, losing ratio, and draw ratio. From those data, we will know who is the best National Footbal Team and who is the worst.

combine.data$wining.ratio <- combine.data$match.win/combine.data$number.match
combine.data$losing.ratio <- combine.data$match.lose/combine.data$number.match
combine.data$drawing.ratio <- combine.data$match.draw/combine.data$number.match

1.4.1 The Best and The Worst team

We got combine.data that have many columns that will give us some information. I will subset the data to know which team has many number of wining and number of losing. The rule is : 1. > 700 number of match 2. > 400 wining match for the most wining team 3. > 270 losing match for the most losing team

combine.data.wining <- combine.data[combine.data$match.win > 400 & combine.data$number.match >= 700,]
combine.data.losing <- combine.data[combine.data$match.lose > 270 & combine.data$number.match >= 700, ]
ggplot(combine.data.wining, aes(reorder(team, wining.ratio/losing.ratio),wining.ratio/losing.ratio))+
  geom_boxplot()+
  geom_point(aes (size = number.match), col ="blue4") +
  labs(title = "Top 10 Best Performance National Football Team", caption = "Source : Internationa Football Results From 1872 - 2019", x ="Team", y = "Wining Ratio") +
  theme(legend.position = "bottom", 
        axis.text.y = element_text(size = 8, face = "bold", colour = "white"))+
  coord_flip()+
  theme_algoritma

ggplot(combine.data.losing, aes(reorder(team, losing.ratio/wining.ratio),losing.ratio/wining.ratio))+
  geom_boxplot()+
  geom_point(aes (size = number.match), col ="red4") +
  labs(title = "Top 10 Worst Performance National Football Team", caption = "Source : Internationa Football Results From 1872 - 2019", x ="Team", y = "Losing Ratio") +
  theme(legend.position = "bottom", 
        axis.text.y = element_text(size = 8, face = "bold", colour = "white"))+
  coord_flip()+
  theme_algoritma

results : from the graphic above we see that the top performer team is Brazil, and England and Germany is in 2nd and 3rd place respectively. On the other hand Finland is the worst performer, followed by Switzerland and Norway team.

1.5 Analyzing Data 2

In this section, we will analyze the efect of home or away match, so we will omit the result draw and take out neutral game. create new data frame results2

results2 <- results[results$winer != "draw" & results$neutral == F,]

1.5.1 Home and Away Analyzing

we add some columns to results2.The rule is :

  1. home.wining column is TRUE if home_team is the same as winer, otherwise is FALSE
  2. home.lose column is TRUE if home_team is not the same as winer, otherwise is FALSE
  3. away.wining column is TRUE if away_team is the same as winer, otherwise is FALSE
  4. away.lose column is TRUE if away_team is not the same as winer, otherwise is FALSE

the code is :

results2$home.wining <- ifelse(results2$home_team == results2$winer,T,F)
results2$home.lose <- ifelse(results2$home_team != results2$winer,T,F)
results2$away.wining <- ifelse(results2$away_team == results2$winer,T,F)
results2$away.lose <- ifelse(results2$away_team != results2$winer,T,F)
home.or.away <- data.frame(prop.table(table(results2$home.wining)))

colnames(home.or.away) <- c("home_or_away", "Percentage")

home.or.away$home_or_away <- as.character(home.or.away$home_or_away)

results : form the information above, we see that home matches are very influential on winning results.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v tibble  2.1.3     v purrr   0.3.2
## v tidyr   0.8.3     v dplyr   0.8.3
## v readr   1.3.1     v stringr 1.4.0
## v tibble  2.1.3     v forcats 0.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()

to analyze more deeply, we need to know result home and away wining in each team, so we write this code :

home.wining.team <- results2 %>%
  filter(home.wining == T) %>%
  group_by(home_team) %>%
  summarise(home.wining = n()) %>%
  ungroup() %>% 
  arrange(desc(home.wining))

colnames(home.wining.team) <- c("team","home.wining")
home.lose.team <- results2 %>%
  filter(home.lose == T) %>%
  group_by(home_team) %>%
  summarise(home.lose = n()) %>%
  ungroup() %>% 
  arrange(desc(home.lose))

colnames(home.lose.team) <- c("team","home.lose")
away.wining.team <- results2 %>%
  filter(away.wining == T) %>%
  group_by(away_team) %>%
  summarise(away.wining = n()) %>%
  ungroup() %>% 
  arrange(desc(away.wining))

colnames(away.wining.team) <- c("team","away.wining")
away.lose.team <- results2 %>%
  filter(away.lose == T) %>%
  group_by(away_team) %>%
  summarise(away.lose = n()) %>%
  ungroup() %>% 
  arrange(desc(away.lose))

colnames(away.lose.team) <- c("team","away.lose")
combine.data2 <- merge(home.wining.team[, c("team", "home.wining")], 
                  home.lose.team[, c("team", "home.lose")])
combine.data2 <- merge(combine.data2[,c("team","home.wining","home.lose")],
                  away.wining.team[,c("team","away.wining")])
combine.data2 <- merge(combine.data2[,c("team","home.wining","home.lose", "away.wining")],
                      away.lose.team[, c("team","away.lose")])
combine.data2 <- merge(combine.data2[,c("team","home.wining","home.lose", "away.wining","away.lose")],
                      nation.match[, c("team","number.match")])
head(combine.data2)
##                  team home.wining home.lose away.wining away.lose
## 1      Ã…land Islands           5         3           2         2
## 2         Afghanistan           2         3           4        29
## 3             Albania          54        61          20       105
## 4             Algeria         115        26          39        79
## 5              Angola          68        16          31        62
## 6 Antigua and Barbuda          32        28          10        37
##   number.match
## 1           46
## 2          106
## 3          326
## 4          497
## 5          346
## 6          184

we add home and away match ratio

combine.data2$away.wining.ratio <- combine.data2$away.wining/combine.data2$away.lose
combine.data2$home.lose.ratio <- combine.data2$home.lose/combine.data2$home.wining
combine.data2.wining <- combine.data2[combine.data2$away.wining.ratio > 1.3 & combine.data2$number.match >= 700,]

combine.data2.losing <- combine.data2[combine.data2$home.lose.ratio > 0.5 & combine.data2$number.match >= 700,]
ggplot(combine.data2.losing,aes(reorder(team, home.lose.ratio),home.lose.ratio))+
  geom_col(aes(fill = number.match))+
  labs(title = "Top 5 National Football Team Home Loser", caption = "Source : Internationa Football Results From 1872 - 2019", x ="Team", y = "Home Lose Ratio") +
  theme(legend.position = "bottom", 
        axis.text.y = element_text(size = 8, face = "bold", colour = "white"))+
  coord_flip()+
  theme_algoritma

ggplot(combine.data2.wining,aes(reorder(team, away.wining.ratio),away.wining.ratio))+
  geom_col(aes(fill = number.match))+
  labs(title = "Top 5 National Football Team Away Winer", caption = "Source : Internationa Football Results From 1872 - 2019", x ="Team", y = "Away Wining Ratio") +
  theme(legend.position = "bottom", 
        axis.text.y = element_text(size = 8, face = "bold", colour = "white"))+
  coord_flip()+
  theme_algoritma

The Final results : from the first graphic, we see that finland is the most-home-loser-team, therefore, based on previous graphic finland is the worst national footbal team ever in history. The second graphic shows England is the the most-away-winer-team, eventhough Brazil which is the top performer team place on 3rd place