library(rvest)
library(readr)
library(magrittr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(deductive)
library(validate)
library(outliers)
library(stringr)
FIFA ranking prior to the World Cup Russia 2018 has been published in the Organization website. Combined with countries population, the aim of the data preprocessing is to analyse in depth the efficiency of countries and regions over their assigned ranking. Considering that the most influential countries in football are from Europe and South America, the data to analyse has been filtered to UEFA and CONMEBOL organizations. Missing values check and proper transformations have been performed in order to achieve a complete and clean dataset.
The ‘fifa’ dataset has been obtained from the FIFA website and we only keep the following variables: “Rank”, “Country”, “CountryCode”, “Points”, “PrevPoints”, “Organization”.
The ‘pop’ dataset has been obtained from Wikipedia. It is basically the popupation for all the countries in the world. For this case, we keep only four variables: “Country”,“Region”,“Subregion”,“Population”.
popdata <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)")
length(html_nodes(popdata, "table"))
## [1] 4
pop <- html_table(html_nodes(popdata,"table")[[3]])
fifadata <- read_html("http://www.fifa.com/fifa-world-ranking/ranking-table/men/index.html")
length(html_nodes(fifadata, "table"))
## [1] 1
fifa <- html_table(html_nodes(fifadata,"table")[[1]])
pop <- pop[ -c(1,5, 7) ]
pop <- pop[-1,] #deletes first useless row.
colnames(pop)[c(1:4)] <- c("Country","Region","Subregion","Population")
head(pop)
fifa <- fifa[ c(2,3,5:7,20) ]
colnames(fifa)[c(1:6)] <- c("Rank","Country","CountryCode","Points","PrevPoints","Organization")
head(fifa)
We define for this section the desired data type for each variable. In the case of “Points” variable, the number between parenthesis makes it impossible to transform the field to int, so we get rid of that additional number. A similar issue occurs with “Population”, which contains commas for thousand and millions; we solve it by getting rid of those commas. Some records from “Country” variable contain a reference number between brakects. We also remove those in order to avoid future problems while merging datasets.
pop$Country <- gsub("\\[.+?]","",pop$Country) #removes data within brackets from the field
pop$Country <- as.factor(pop$Country)
pop$Region <- as.factor(pop$Region)
pop$Subregion <- as.factor(pop$Subregion)
pop$Population <- as.numeric(gsub(",", "", pop$Population)) #this field contains commas for thousands and millions
pop$Population <- as.integer(pop$Population)
str(pop)
## 'data.frame': 233 obs. of 4 variables:
## $ Country : Factor w/ 233 levels "Afghanistan",..: 43 95 221 96 28 156 150 17 171 133 ...
## $ Region : Factor w/ 5 levels "Africa","Americas",..: 3 3 2 3 2 3 1 3 4 2 ...
## $ Subregion : Factor w/ 22 levels "Australia and New Zealand",..: 6 18 12 15 16 18 20 18 7 3 ...
## $ Population: int 1409517397 1339180127 324459463 263991379 209288278 197015955 190886311 164669751 143989754 129163276 ...
fifa$Country <- as.factor(fifa$Country)
fifa$Organization <- as.factor(fifa$Organization)
fifa$Points <- gsub("\\s*\\([^\\)]+\\)","",as.character(fifa$Points)) #removes data within paretheses from the field
fifa$Points <- as.integer(fifa$Points)
str(fifa)
## 'data.frame': 211 obs. of 6 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Country : Factor w/ 211 levels "Afghanistan",..: 76 28 20 152 9 184 71 151 41 175 ...
## $ CountryCode : chr "GER" "BRA" "BEL" "POR" ...
## $ Points : int 1558 1431 1298 1274 1241 1199 1198 1183 1135 1126 ...
## $ PrevPoints : int 1544 1384 1346 1306 1254 1179 1166 1128 1146 1162 ...
## $ Organization: Factor w/ 6 levels "AFC","CAF","CONCACAF",..: 6 4 6 6 4 6 6 6 4 6 ...
The dataset has been also filtered by the most representative “Organizations”: UEFA and CONMEBOL. In addition, we join ‘fifa’ and ‘pop’ datasets by country, creating ‘ranking’ dataset.
fifa <- fifa %>% filter(Organization=='CONMEBOL'|Organization=='UEFA')
## Warning: package 'bindrcpp' was built under R version 3.4.4
ranking <- fifa %>% left_join(pop,by='Country')
## Warning: Column `Country` joining factors with different levels, coercing
## to character vector
head(ranking)
Data is already tidy.
After merging our datasets, we check missing values by column. This provides an idea of how many records have missmatched. From finding that there are 7 NA cases from our dataset merging, we detect the location and the country name. This particular case is not significant, hence we fix it manually.
colSums(is.na(ranking)) #detecs whether there are missmatches or not.
## Rank Country CountryCode Points PrevPoints
## 0 0 0 0 0
## Organization Region Subregion Population
## 0 7 7 7
w <- which(is.na(ranking$Population))
ranking$Country[w] #detect which countries to correct
## [1] "England" "Wales" "Northern Ireland"
## [4] "Republic of Ireland" "Scotland" "FYR Macedonia"
## [7] "Kosovo"
# England, Wales, Norhtern Ireland and Scotland are grouped under United Kingdom.
# Kosovo doesn't appear in 'pop' dataset.
# Republic of Ireland and FYR Macedonia are spelled as Ireland and Republic of Macedonia in 'pop' dataset.
# As we already detected the location of missing values, we will manually add the missing information to 'ranking' dataset.
w
## [1] 13 17 24 26 33 46 60
ranking[13,7:9] <- c('Europe','Northern Europe',55268100) #England
ranking[17,7:9] <- c('Europe','Northern Europe',3113200) #Wales
ranking[24,7:9] <- c('Europe','Northern Europe',1862100) #Northern Ireland
ranking[26,7:9] <- c('Europe','Northern Europe',4761657) #Republic of Ireland
ranking[33,7:9] <- c('Europe','Northern Europe',5404700) #Scotland
ranking[46,7:9] <- c('Europe','Southern Europe',2083160) #Macedonia
ranking[60,7:9] <- c('Europe','Southern Europe',1739825) #Kosovo
ranking$Population <- as.integer(ranking$Population)
colSums(is.na(ranking)) #now there are no missing values
## Rank Country CountryCode Points PrevPoints
## 0 0 0 0 0
## Organization Region Subregion Population
## 0 0 0 0
Two calculated fields are created. “PointsDiff” is the difference between “Points” and “PrevPoints” (previous points). “EfficiencyRate” is an efficiency index created with ranking “Points” by “Population” and multiplied by one million in order to achieve a clear insight easier to read.
ranking <- mutate(ranking,
PointsDiff=Points-PrevPoints,
EfficiencyRate=(Points/Population)*1000000)
head(ranking)
There are no special values for numeric variables in the dataset.
sum(is.infinite(ranking$Points))
## [1] 0
sum(is.nan(ranking$Points))
## [1] 0
sum(is.infinite(ranking$PrevPoints))
## [1] 0
sum(is.nan(ranking$PrevPoints))
## [1] 0
sum(is.infinite(ranking$Population))
## [1] 0
sum(is.nan(ranking$Population))
## [1] 0
sum(is.infinite(ranking$EfficiencyRate))
## [1] 0
sum(is.nan(ranking$EfficiencyRate))
## [1] 0
#there are no special values
When we check the “EfficiencyRate” histogram, it is easy to find a right skewed distribution. After applying log transformation, we achieve a simetric normal distribution.
hist(ranking$EfficiencyRate) #this histogram is highly right skewed, so applying log transformation would be accurate.
logefficiency <- log(ranking$EfficiencyRate)
hist(logefficiency) #now the efficiency rate follows a clear simetric normal distribution