Hello, back again with me Imad. Now, I want to talk about result of football matches for national teams around the world (Disclaimer : this is official mathces supervised by FIFA and this is only male football national team).
In this project i want to know who is the best national footbal team based on their result from 1872 to 2019, so i will do data wrangling and analyzing wining ratio, losing ratio, home wining and lose, and away wining and lose, etc.
Enjoy your coffee…
First of all we look at our working directory :
getwd()
## [1] "D:/IMAD/Data Science/01 Algoritma/Day 9/LBB by Imaduddin 2/lbb2_by_imaduddin/lbb2_by_imaduddin"
from the above we see that …/lbb2_by_imaduddin and it is same with
D:\IMAD\Data Science\01 Algoritma\Day 9\LBB by Imaduddin 2\lbb2_by_imaduddin\lbb2_by_imaduddin
so we can read the data and take a look
results <- read.csv("results.csv")
str(results)
## 'data.frame': 40773 obs. of 9 variables:
## $ date : Factor w/ 15106 levels "1872-11-30","1873-03-08",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ home_team : Factor w/ 309 levels "Ã…land Islands",..: 242 87 242 87 242 242 87 297 242 242 ...
## $ away_team : Factor w/ 306 levels "Ã…land Islands",..: 87 237 87 237 87 293 237 237 87 293 ...
## $ home_score: int 0 4 2 2 3 4 1 0 7 9 ...
## $ away_score: int 0 2 1 2 0 0 3 2 2 0 ...
## $ tournament: Factor w/ 107 levels "ABCS Tournament",..: 55 55 55 55 55 55 55 55 55 55 ...
## $ city : Factor w/ 1959 levels "6th of October City",..: 663 1011 663 1011 663 663 1011 1914 663 663 ...
## $ country : Factor w/ 265 levels "Éire","Afghanistan",..: 206 71 206 71 206 206 71 256 206 206 ...
## $ neutral : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
The data contains 40773 observation and 9 column, which is : 1. date = Date the match took place on 2. home_team = Name of the home team. 3. away_team = Name of the away team. 4. home_score = Home team full time score. 5. away_score = Away team full time score. 6. tournament = Name of the tournament. 7. city = City where the match took place. 8. country = Country where the match took place. 9. neutral = TRUE if the match took place outside the home team country, FALSE otherwise.
Firstly, we want to know how many football matches every year (1872 - 2018), so we need to change the column date into date class
check the date
head(results$date)
## [1] 1872-11-30 1873-03-08 1874-03-07 1875-03-06 1876-03-04 1876-03-25
## 15106 Levels: 1872-11-30 1873-03-08 1874-03-07 1875-03-06 ... 2019-07-19
The structure date is %Y-%m-%d, so we use lubridate to change the date class
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
results$date <- ymd(results$date)
# check the date structure and class
head(results$date)
## [1] "1872-11-30" "1873-03-08" "1874-03-07" "1875-03-06" "1876-03-04"
## [6] "1876-03-25"
class(results$date)
## [1] "Date"
after that we add one column for year that match played -> results$year
results$year <- year(results$date)
we will create new data frame to look matches year by year -> yearmatches
yearmatches <- data.frame(table("year" = results$year))
colnames(yearmatches) <- c("year","match")
we just need to see total match until 2018, so we delete total match in 2019
yearmatches <- yearmatches[-c(148),]
tail(yearmatches)
## year match
## 142 2013 966
## 143 2014 859
## 144 2015 1003
## 145 2016 926
## 146 2017 958
## 147 2018 854
yearmatches[147,2]/yearmatches[127,2]
## [1] 1.223496
yearmatches$year <- as.numeric(as.character(yearmatches$year))
plot(yearmatches[yearmatches$year >= 1918 & yearmatches$year <= 2018, ] , yearmatches$match, main = "Total Footbal Matches in The Past Century", ylab = "number of match", type = "h")
abline(lm(yearmatches$match~yearmatches$year), lty = 5, lwd = 3, col = "red")
result : if we see in the past 20 years the number of football matchs increase to 122% and also if we shows the graphic number of footbal matches in the past century, it shows positive trend. The result is number of football match around the world will increase steadily.
Knowing total number of football matches isn’t enough, because if we want to know the top performer team and the weakest one we need to know the number of football match each national team, so i will write some code to know that
results[,c("home_team","away_team")] <- lapply(results[,c("home_team","away_team")], as.character)
change class of home_team and away_team into character using lappy() then create new data frame -> nation.match
nation.match <- cbind.data.frame("team" = c(results$home_team, results$away_team))
nation.match <- data.frame(table("team" = nation.match$team))
nation.match <- nation.match[order(nation.match$Freq, decreasing = T), ]
nation.match$team <- as.character(nation.match$team)
colnames(nation.match) <- c("team","number.match")
str(nation.match)
## 'data.frame': 313 obs. of 2 variables:
## $ team : chr "Sweden" "England" "Brazil" "Argentina" ...
## $ number.match: int 1010 994 977 975 942 922 878 862 853 834 ...
theme_algoritma <- readRDS("theme_algoritma.rds")
library(ggplot2)
ggplot(nation.match[nation.match$number.match > 810, ], aes(reorder(team, number.match), number.match))+
geom_col(color = "blue")+
labs(y = "Number of match", x = "National Team", title = "Top 10 of National Football Team Matches from 1872 to 2019", caption = "Source : International Football Results From 1872 To 2019" )+
coord_flip()+
theme_algoritma
Take a look at chart above, Sweden Team has the most number of football matches in the world. So Sweden Team is the strongest team in the world? we don’t know the answer yet, until we find which team has the most winning all this time
results is also giving us information all match scores, so we need to know which team that win or lose.
we will add 2 new columns in data frame results -> winer and loser.
To know which team that win or lose, we have to giving terms, which is :
for column winer
drawand
for column loser
drawso take a look at this, then
results$winer <- ifelse(results$home_score - results$away_score > 0, results$home_team,ifelse(results$home_score - results$away_score < 0, results$away_team, "draw"))
results$loser <- ifelse(results$home_score - results$away_score < 0, results$home_team,ifelse(results$home_score - results$away_score > 0, results$away_team, "draw"))
head(results,3)
## date home_team away_team home_score away_score tournament city
## 1 1872-11-30 Scotland England 0 0 Friendly Glasgow
## 2 1873-03-08 England Scotland 4 2 Friendly London
## 3 1874-03-07 Scotland England 2 1 Friendly Glasgow
## country neutral year winer loser
## 1 Scotland FALSE 1872 draw draw
## 2 England FALSE 1873 England Scotland
## 3 Scotland FALSE 1874 Scotland England
Now we add new data frame for wining team -> winer.team
And order data frame from the highest to the lowest
winer.team <- data.frame(table("team" = results$winer))
winer.team$team <- as.character(winer.team$team)
winer.team <-winer.team[winer.team$team != "draw",]
winer.team <- winer.team[order(winer.team$Freq, decreasing = T),]
colnames(winer.team) <- c("team","match.win")
head(winer.team)
## team match.win
## 38 Brazil 625
## 88 England 563
## 107 Germany 551
## 14 Argentina 524
## 258 Sweden 494
## 250 South Korea 449
Also add new data frame for loser team -> loser.team
loser.team <- data.frame(table("team" = results$loser))
loser.team$team <- as.character(loser.team$team)
loser.team <-loser.team[loser.team$team != "draw",]
loser.team <- loser.team[order(loser.team$Freq, decreasing = T),]
colnames(loser.team) <- c("team","match.lose")
head(loser.team)
## team match.lose
## 98 Finland 397
## 267 Switzerland 343
## 201 Norway 328
## 199 Northern Ireland 327
## 160 Luxembourg 311
## 58 Chile 309
Not only the winer and the loser but also to give us all information from the data, we will find the number of draw match of each team, we create new data frame -> draw.results
draw.results <- results[results$winer == "draw",]
str(draw.results)
## 'data.frame': 9423 obs. of 12 variables:
## $ date : Date, format: "1872-11-30" "1875-03-06" ...
## $ home_team : chr "Scotland" "England" "Northern Ireland" "England" ...
## $ away_team : chr "England" "Scotland" "Wales" "Wales" ...
## $ home_score: int 0 2 1 1 1 1 0 1 1 2 ...
## $ away_score: int 0 2 1 1 1 1 0 1 1 2 ...
## $ tournament: Factor w/ 107 levels "ABCS Tournament",..: 55 55 55 17 17 17 17 17 17 17 ...
## $ city : Factor w/ 1959 levels "6th of October City",..: 663 1011 228 263 1011 663 1914 663 189 228 ...
## $ country : Factor w/ 265 levels "Éire","Afghanistan",..: 206 71 192 71 71 206 256 206 256 192 ...
## $ neutral : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ year : num 1872 1875 1883 1885 1885 ...
## $ winer : chr "draw" "draw" "draw" "draw" ...
## $ loser : chr "draw" "draw" "draw" "draw" ...
draw.team <- cbind.data.frame("team" = c(draw.results$home_team, draw.results$away_team))
draw.team <- data.frame(table("team" = draw.team$team))
draw.team <- draw.team[order(draw.team$Freq, decreasing = T), ]
draw.team$team <- as.character(draw.team$team)
colnames(draw.team) <- c("team","match.draw")
head(draw.team)
## team match.draw
## 81 England 241
## 13 Argentina 240
## 128 Italy 224
## 245 Sweden 223
## 238 South Korea 219
## 269 Uruguay 213
Now we have 4 data :
we will merge all the data
combine.data <- merge(loser.team[, c("team", "match.lose")],
winer.team[, c("team", "match.win")])
combine.data <- merge(combine.data[,c("team","match.lose","match.win")],
draw.team[,c("team","match.draw")])
combine.data <- merge(combine.data[,c("team","match.lose","match.win", "match.draw")],
nation.match[, c("team","number.match")])
tail(combine.data)
## team match.lose match.win match.draw number.match
## 278 Ynys Môn 18 32 11 61
## 279 Yorkshire 1 5 1 7
## 280 Yugoslavia 159 223 98 480
## 281 Zambia 197 324 186 707
## 282 Zanzibar 123 41 41 205
## 283 Zimbabwe 141 184 122 447
To analyze data we put some data like winning ratio, losing ratio, and draw ratio. From those data, we will know who is the best National Footbal Team and who is the worst.
combine.data$wining.ratio <- combine.data$match.win/combine.data$number.match
combine.data$losing.ratio <- combine.data$match.lose/combine.data$number.match
combine.data$drawing.ratio <- combine.data$match.draw/combine.data$number.match
We got combine.data that have many columns that will give us some information. I will subset the data to know which team has many number of wining and number of losing. The rule is : 1. > 700 number of match 2. > 400 wining match for the most wining team 3. > 270 losing match for the most losing team
combine.data.wining <- combine.data[combine.data$match.win > 400 & combine.data$number.match >= 700,]
combine.data.losing <- combine.data[combine.data$match.lose > 270 & combine.data$number.match >= 700, ]
ggplot(combine.data.wining, aes(reorder(team, wining.ratio/losing.ratio),wining.ratio/losing.ratio))+
geom_boxplot()+
geom_point(aes (size = number.match), col ="blue4") +
labs(title = "Top 10 Best Performance National Football Team", caption = "Source : Internationa Football Results From 1872 - 2019", x ="Team", y = "Wining Ratio") +
theme(legend.position = "bottom",
axis.text.y = element_text(size = 8, face = "bold", colour = "white"))+
coord_flip()+
theme_algoritma
ggplot(combine.data.losing, aes(reorder(team, losing.ratio/wining.ratio),losing.ratio/wining.ratio))+
geom_boxplot()+
geom_point(aes (size = number.match), col ="red4") +
labs(title = "Top 10 Worst Performance National Football Team", caption = "Source : Internationa Football Results From 1872 - 2019", x ="Team", y = "Losing Ratio") +
theme(legend.position = "bottom",
axis.text.y = element_text(size = 8, face = "bold", colour = "white"))+
coord_flip()+
theme_algoritma
results : from the graphic above we see that the top performer team is Brazil, and England and Germany is in 2nd and 3rd place respectively. On the other hand Finland is the worst performer, followed by Switzerland and Norway team.
In this section, we will analyze the efect of home or away match, so we will omit the result draw and take out neutral game. create new data frame results2
results2 <- results[results$winer != "draw" & results$neutral == F,]
we add some columns to results2.The rule is :
home.wining column is TRUE if home_team is the same as winer, otherwise is FALSEhome.lose column is TRUE if home_team is not the same as winer, otherwise is FALSEaway.wining column is TRUE if away_team is the same as winer, otherwise is FALSEaway.lose column is TRUE if away_team is not the same as winer, otherwise is FALSEthe code is :
results2$home.wining <- ifelse(results2$home_team == results2$winer,T,F)
results2$home.lose <- ifelse(results2$home_team != results2$winer,T,F)
results2$away.wining <- ifelse(results2$away_team == results2$winer,T,F)
results2$away.lose <- ifelse(results2$away_team != results2$winer,T,F)
home.or.away <- data.frame(prop.table(table(results2$home.wining)))
colnames(home.or.away) <- c("home_or_away", "Percentage")
home.or.away$home_or_away <- as.character(home.or.away$home_or_away)
results : form the information above, we see that home matches are very influential on winning results.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v tibble 2.1.3 v purrr 0.3.2
## v tidyr 0.8.3 v dplyr 0.8.3
## v readr 1.3.1 v stringr 1.4.0
## v tibble 2.1.3 v forcats 0.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x lubridate::intersect() masks base::intersect()
## x dplyr::lag() masks stats::lag()
## x lubridate::setdiff() masks base::setdiff()
## x lubridate::union() masks base::union()
to analyze more deeply, we need to know result home and away wining in each team, so we write this code :
home.wining.team <- results2 %>%
filter(home.wining == T) %>%
group_by(home_team) %>%
summarise(home.wining = n()) %>%
ungroup() %>%
arrange(desc(home.wining))
colnames(home.wining.team) <- c("team","home.wining")
home.lose.team <- results2 %>%
filter(home.lose == T) %>%
group_by(home_team) %>%
summarise(home.lose = n()) %>%
ungroup() %>%
arrange(desc(home.lose))
colnames(home.lose.team) <- c("team","home.lose")
away.wining.team <- results2 %>%
filter(away.wining == T) %>%
group_by(away_team) %>%
summarise(away.wining = n()) %>%
ungroup() %>%
arrange(desc(away.wining))
colnames(away.wining.team) <- c("team","away.wining")
away.lose.team <- results2 %>%
filter(away.lose == T) %>%
group_by(away_team) %>%
summarise(away.lose = n()) %>%
ungroup() %>%
arrange(desc(away.lose))
colnames(away.lose.team) <- c("team","away.lose")
combine.data2 <- merge(home.wining.team[, c("team", "home.wining")],
home.lose.team[, c("team", "home.lose")])
combine.data2 <- merge(combine.data2[,c("team","home.wining","home.lose")],
away.wining.team[,c("team","away.wining")])
combine.data2 <- merge(combine.data2[,c("team","home.wining","home.lose", "away.wining")],
away.lose.team[, c("team","away.lose")])
combine.data2 <- merge(combine.data2[,c("team","home.wining","home.lose", "away.wining","away.lose")],
nation.match[, c("team","number.match")])
head(combine.data2)
## team home.wining home.lose away.wining away.lose
## 1 Ã…land Islands 5 3 2 2
## 2 Afghanistan 2 3 4 29
## 3 Albania 54 61 20 105
## 4 Algeria 115 26 39 79
## 5 Angola 68 16 31 62
## 6 Antigua and Barbuda 32 28 10 37
## number.match
## 1 46
## 2 106
## 3 326
## 4 497
## 5 346
## 6 184
we add home and away match ratio
combine.data2$away.wining.ratio <- combine.data2$away.wining/combine.data2$away.lose
combine.data2$home.lose.ratio <- combine.data2$home.lose/combine.data2$home.wining
combine.data2.wining <- combine.data2[combine.data2$away.wining.ratio > 1.3 & combine.data2$number.match >= 700,]
combine.data2.losing <- combine.data2[combine.data2$home.lose.ratio > 0.5 & combine.data2$number.match >= 700,]
ggplot(combine.data2.losing,aes(reorder(team, home.lose.ratio),home.lose.ratio))+
geom_col(aes(fill = number.match))+
labs(title = "Top 5 National Football Team Home Loser", caption = "Source : Internationa Football Results From 1872 - 2019", x ="Team", y = "Home Lose Ratio") +
theme(legend.position = "bottom",
axis.text.y = element_text(size = 8, face = "bold", colour = "white"))+
coord_flip()+
theme_algoritma
ggplot(combine.data2.wining,aes(reorder(team, away.wining.ratio),away.wining.ratio))+
geom_col(aes(fill = number.match))+
labs(title = "Top 5 National Football Team Away Winer", caption = "Source : Internationa Football Results From 1872 - 2019", x ="Team", y = "Away Wining Ratio") +
theme(legend.position = "bottom",
axis.text.y = element_text(size = 8, face = "bold", colour = "white"))+
coord_flip()+
theme_algoritma
The Final results : from the first graphic, we see that finland is the most-home-loser-team, therefore, based on previous graphic finland is the worst national footbal team ever in history. The second graphic shows England is the the most-away-winer-team, eventhough Brazil which is the top performer team place on 3rd place