Required packages

library(dplyr)
library(lubridate)
library(tidyr)
library(Hmisc)
library(infotheo)

Executive Summary

Data Preprocessing is a process and the collection of operations needed to prepare all forms of untidy data for statistical analysis. The 5 major tasks for data include Get, Understand, Tidy & Manipulate, Scan and Transform. To achieve the first task, we located the right data sets and imported them using the Base R functions. Then, we merged the data sets to create a single data set for further preprocessing.

Next, we inspected the data set to understand the various aspects of it. We checked the attributes (metadata) as well as the structure of the data set to find out details such as types of variables, data structures, dimensions, etc. We made some appropriate changes in our data set such as data type conversions and labelling factor variables to make more sense of the data.

We then checked whether the data followed the tidy data principles. As each variable in our data did not have its own column, it was not in a tidy format. Hence, we reshaped it into a tidy format using appropriate method. Then, we created two new variables from the existing variables using the mutate function.

Afterwards, we scanned the data for missing values and inconsistencies. Since one of the variables in our data set had missing values, we applied basic imputation technique to replace them. Next, we scanned all numeric variables for outliers using box plot. We used multiple techniques to deal with the outliers including imputation, excluding outliers and capping. Finally, we applied transformation to one of the variables to change the scale for better understanding of the variable. We applied both equal width (distance) binning and equal depth (frequency) binning methods to discretize/transform the numerical variable into categorical counterpart.

Data

We are going to use three data sets: World Cups, World Cup Players, World Cup Matches. The World Cups data set contains information about all the World Cups held to date. It consists of the following variables:

The World Cup Matches data set contains information regarding all of the matches played in the World Cups held to date. It consists of the following variables:

The World Cup Players data set contains information regarding all of the players who played in the World Cup matches held to date. It consists of the following variables:

Data Source: https://www.kaggle.com/abecklas/fifa-world-cup

WorldCup_data <- read.csv("WorldCups.csv")
head(WorldCup_data)
WorldCupMatches_data <- read.csv("WorldCupMatches.csv")
head(WorldCupMatches_data)
WorldCupPlayers_data <- read.csv("WorldCupPlayers.csv")
head(WorldCupPlayers_data)
merged_data <- inner_join(WorldCupPlayers_data, WorldCupMatches_data, by = "MatchID")
merged_data <- inner_join(merged_data, WorldCup_data, by = "Year")
head(merged_data)

Explanation:

Our first step was to import the data sets using read.csv (Base R function) since the data was stored in csv files. Then, we used inner_join to merge World Cup Players data set with World Cup Matches data set by the key variable: MatchID, which is common in both data sets. Inner join retains the rows with matching key variable values in both data sets. Finally, using inner_join once again, we merged the resulting data set (World Cup Players + World Cup Matches) with World Cups data set by the common variable: Year. As a result, a single data set was created for our report.

Understand

attributes(merged_data)$class
[1] "data.frame"
attributes(merged_data)$names
 [1] "RoundID.x"            "MatchID"              "Team.Initials"        "Coach.Name"          
 [5] "Line.up"              "Shirt.Number"         "Player.Name"          "Position"            
 [9] "Event"                "Year"                 "Datetime"             "Stage"               
[13] "Stadium"              "City"                 "Home.Team.Name"       "Home.Team.Goals"     
[17] "Away.Team.Goals"      "Away.Team.Name"       "Win.conditions"       "Attendance.x"        
[21] "Half.time.Home.Goals" "Half.time.Away.Goals" "Referee"              "Assistant.1"         
[25] "Assistant.2"          "RoundID.y"            "Home.Team.Initials"   "Away.Team.Initials"  
[29] "Country"              "Winner"               "Runners.Up"           "Third"               
[33] "Fourth"               "GoalsScored"          "QualifiedTeams"       "MatchesPlayed"       
[37] "Attendance.y"        
str(merged_data)
'data.frame':   39256 obs. of  37 variables:
 $ RoundID.x           : int  201 201 201 201 201 201 201 201 201 201 ...
 $ MatchID             : int  1096 1096 1096 1096 1096 1096 1096 1096 1096 1096 ...
 $ Team.Initials       : Factor w/ 82 levels "ALG","ANG","ARG",..: 26 47 26 47 26 47 26 47 26 47 ...
 $ Coach.Name          : Factor w/ 335 levels "ACOSTA Nelson (URU)",..: 50 169 50 169 50 169 50 169 50 169 ...
 $ Line.up             : Factor w/ 2 levels "N","S": 2 2 2 2 2 2 2 2 2 2 ...
 $ Shirt.Number        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Player.Name         : Factor w/ 7663 levels "?URI?I?","A BAUTISTA",..: 317 5519 4713 3865 1988 5988 482 2990 2035 1668 ...
 $ Position            : Factor w/ 4 levels "","C","GK","GKC": 3 3 1 1 1 2 1 1 1 1 ...
 $ Event               : Factor w/ 1894 levels "","G1'","G1' G42'",..: 1 1 308 527 1 1 325 1 1 1 ...
 $ Year                : int  1930 1930 1930 1930 1930 1930 1930 1930 1930 1930 ...
 $ Datetime            : Factor w/ 602 levels "01 Jul 1950 - 15:00 ",..: 205 205 205 205 205 205 205 205 205 205 ...
 $ Stage               : Factor w/ 23 levels "Final","First round",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Stadium             : Factor w/ 181 levels "Arena Amazonia",..: 125 125 125 125 125 125 125 125 125 125 ...
 $ City                : Factor w/ 151 levels "Alicante ","Antibes ",..: 84 84 84 84 84 84 84 84 84 84 ...
 $ Home.Team.Name      : Factor w/ 78 levels "Algeria","Angola",..: 24 24 24 24 24 24 24 24 24 24 ...
 $ Home.Team.Goals     : int  4 4 4 4 4 4 4 4 4 4 ...
 $ Away.Team.Goals     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Away.Team.Name      : Factor w/ 83 levels "Algeria","Angola",..: 46 46 46 46 46 46 46 46 46 46 ...
 $ Win.conditions      : Factor w/ 43 levels " "," win on penalties (2 - 3) ",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Attendance.x        : int  4444 4444 4444 4444 4444 4444 4444 4444 4444 4444 ...
 $ Half.time.Home.Goals: int  3 3 3 3 3 3 3 3 3 3 ...
 $ Half.time.Away.Goals: int  0 0 0 0 0 0 0 0 0 0 ...
 $ Referee             : Factor w/ 366 levels "ABD EL FATAH Essam (EGY)",..: 192 192 192 192 192 192 192 192 192 192 ...
 $ Assistant.1         : Factor w/ 387 levels "ABDUL HAMID Halim (MAS)",..: 87 87 87 87 87 87 87 87 87 87 ...
 $ Assistant.2         : Factor w/ 408 levels "ABDUL HAMID Halim (MAS)",..: 292 292 292 292 292 292 292 292 292 292 ...
 $ RoundID.y           : int  201 201 201 201 201 201 201 201 201 201 ...
 $ Home.Team.Initials  : Factor w/ 77 levels "ALG","ANG","ARG",..: 25 25 25 25 25 25 25 25 25 25 ...
 $ Away.Team.Initials  : Factor w/ 82 levels "ALG","ANG","ARG",..: 47 47 47 47 47 47 47 47 47 47 ...
 $ Country             : Factor w/ 15 levels "Argentina","Brazil",..: 14 14 14 14 14 14 14 14 14 14 ...
 $ Winner              : Factor w/ 9 levels "Argentina","Brazil",..: 9 9 9 9 9 9 9 9 9 9 ...
 $ Runners.Up          : Factor w/ 10 levels "Argentina","Brazil",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Third               : Factor w/ 14 levels "Austria","Brazil",..: 14 14 14 14 14 14 14 14 14 14 ...
 $ Fourth              : Factor w/ 16 levels "Austria","Belgium",..: 16 16 16 16 16 16 16 16 16 16 ...
 $ GoalsScored         : int  70 70 70 70 70 70 70 70 70 70 ...
 $ QualifiedTeams      : int  13 13 13 13 13 13 13 13 13 13 ...
 $ MatchesPlayed       : int  18 18 18 18 18 18 18 18 18 18 ...
 $ Attendance.y        : Factor w/ 20 levels "1.045.246","1.545.791",..: 17 17 17 17 17 17 17 17 17 17 ...
merged_data$Shirt.Number <- as.factor(merged_data$Shirt.Number)
class(merged_data$Shirt.Number)
[1] "factor"
merged_data$Event <- as.character(merged_data$Event)
class(merged_data$Event)
[1] "character"
merged_data$Year <- as.factor(merged_data$Year)
class(merged_data$Year)
[1] "factor"
merged_data$Datetime <- dmy_hm(merged_data$Datetime)
class(merged_data$Datetime)
[1] "POSIXct" "POSIXt" 
merged_data$Win.conditions <- as.character(merged_data$Win.conditions)
class(merged_data$Win.conditions)
[1] "character"
merged_data$Attendance.y <- as.character(merged_data$Attendance.y)
merged_data$Attendance.y <- gsub("\\.", "", merged_data$Attendance.y)
merged_data$Attendance.y <- as.integer(merged_data$Attendance.y)
class(merged_data$Attendance.y)
[1] "integer"
merged_data$Line.up <- factor(merged_data$Line.up, levels = c("N", "S"), labels = c("No", "Yes"))
class(merged_data$Line.up)
[1] "factor"
levels(merged_data$Line.up)
[1] "No"  "Yes"
merged_data <- merged_data[, -26]
colnames(merged_data) <- c("RoundID", "MatchID", "Player_Team_Initials", "Player_Team_Coach", "Starting_Line-up", "Player_Shirt_Number", "Player_Name", "Player_Position", "Player_Event", "World_Cup_Year", "Match_Datetime", "Match_Stage", "Match_Stadium", "Match_City", "Home_Team_Name", "Home_Team_Goals", "Away_Team_Goals", "Away_Team_Name", "Match_Detail", "Match_Attendance", "Half-time_Home_Goals", "Half-time_Away_Goals", "Match_Referee", "Match_Assistant_Referee_1", "Match_Assistant_Referee_2", "Home_Team_Initials", "Away_Team_Initials", "World_Cup_Host_Country", "Winner", "Runners-up", "Third_Position", "Fourth_Position", "World_Cup_Total_Goals", "Number_of_Qualified_Teams", "World_Cup_Total_Matches", "World_Cup_Total_Attendance")
colnames(merged_data)
 [1] "RoundID"                    "MatchID"                    "Player_Team_Initials"      
 [4] "Player_Team_Coach"          "Starting_Line-up"           "Player_Shirt_Number"       
 [7] "Player_Name"                "Player_Position"            "Player_Event"              
[10] "World_Cup_Year"             "Match_Datetime"             "Match_Stage"               
[13] "Match_Stadium"              "Match_City"                 "Home_Team_Name"            
[16] "Home_Team_Goals"            "Away_Team_Goals"            "Away_Team_Name"            
[19] "Match_Detail"               "Match_Attendance"           "Half-time_Home_Goals"      
[22] "Half-time_Away_Goals"       "Match_Referee"              "Match_Assistant_Referee_1" 
[25] "Match_Assistant_Referee_2"  "Home_Team_Initials"         "Away_Team_Initials"        
[28] "World_Cup_Host_Country"     "Winner"                     "Runners-up"                
[31] "Third_Position"             "Fourth_Position"            "World_Cup_Total_Goals"     
[34] "Number_of_Qualified_Teams"  "World_Cup_Total_Matches"    "World_Cup_Total_Attendance"

Explanation:

First, we checked the attributes (metadata) of the merged data set which returned variable names, object class and row names. We skipped printing the row names as they were meaningless and took a lot of space. Next, we inspected the structure of the data frame to find out the details of each variable such as data type, number of levels (for factor variables), etc. The str function also returned the total number of observations and variables in the data set.

Inspection of data set showed that the data consisted of multiple data types including numerics, characters, factors, date. However, some variables were not imported as correct data type. Hence, we made the following data type conversions to make more sense of the data:

  • Shirt.Number variable from integer to factor using as.factor() function.
  • Events variable from factor to character using as.character() function.
  • Year variable from integer to factor using as.factor() function.
  • Datetime variable from factor to POSIXct (date-time). As the order of elements in date was day, month, year, hour, minute we used the dmy_hm() function of lubridate package.
  • Win.conditions variable from factor to character using as.character() function.
  • Attendance.y variable was first changed from factor to character and then all of the periods were removed. Then, we converted the variable to integer data type.

Next, we labelled the levels of Line.up factor variable from “N” and “S” to “No” and “Yes” respectively. We then removed the redundant RoundID.y column (same as RoundID.x) by simply filtering the data frame. Finally, column names of the data frame were renamed to more suitable and meaningful names using colnames() function.

Tidy & Manipulate Data I

merged_data <- gather(merged_data, World_Cup_Position, Position_Holder, 29:32)
merged_data$World_Cup_Position <- as.factor(merged_data$World_Cup_Position)
merged_data$Position_Holder <- as.factor(merged_data$Position_Holder)
head(merged_data[, c("World_Cup_Position", "Position_Holder")])

Explanation:

The data did not conform to the tidy data principles as each variable did not have its own column: each World_Cup_Position variable value (i.e. winner, runners up, etc.) had a separate column. Hence, we reshaped the data using gather function (tidyr package) to create a new column for World_Cup_Position values.

The new columns created had character data type. We converted them to factor data type which was more appropriate.

Tidy & Manipulate Data II

merged_data <- mutate(merged_data, Match_Goal_Difference = Home_Team_Goals - Away_Team_Goals)
merged_data <- mutate(merged_data, World_Cup_Average_Goals = World_Cup_Total_Goals / World_Cup_Total_Matches)
head(merged_data[, c("Match_Goal_Difference", "World_Cup_Average_Goals")])

Explanation:

We created two new variables: Match_Goal_Difference and World_Cup_Average_Goals. For Match_Goal_Difference variable, we calculated the difference between Home_Team_Goals variable and Away_Team-Goals variable. For World_Cup_Average_Goals variable, we divided World_Cup_Total_Goals variable by World_Cup_Total_Matches variable. To complete this task, we used the mutate function (dplyr package).

Scan I

sum(is.na(merged_data))
[1] 736
colSums(is.na(merged_data))
                   RoundID                    MatchID       Player_Team_Initials 
                         0                          0                          0 
         Player_Team_Coach           Starting_Line-up        Player_Shirt_Number 
                         0                          0                          0 
               Player_Name            Player_Position               Player_Event 
                         0                          0                          0 
            World_Cup_Year             Match_Datetime                Match_Stage 
                         0                          0                          0 
             Match_Stadium                 Match_City             Home_Team_Name 
                         0                          0                          0 
           Home_Team_Goals            Away_Team_Goals             Away_Team_Name 
                         0                          0                          0 
              Match_Detail           Match_Attendance       Half-time_Home_Goals 
                         0                        736                          0 
      Half-time_Away_Goals              Match_Referee  Match_Assistant_Referee_1 
                         0                          0                          0 
 Match_Assistant_Referee_2         Home_Team_Initials         Away_Team_Initials 
                         0                          0                          0 
    World_Cup_Host_Country      World_Cup_Total_Goals  Number_of_Qualified_Teams 
                         0                          0                          0 
   World_Cup_Total_Matches World_Cup_Total_Attendance         World_Cup_Position 
                         0                          0                          0 
           Position_Holder      Match_Goal_Difference    World_Cup_Average_Goals 
                         0                          0                          0 
merged_data$Match_Attendance <- as.integer(impute(merged_data$Match_Attendance, fun = mean))
sum(is.na(merged_data))
[1] 0
is.specialInf <- function(x){
    sum(is.infinite(x))
}
sapply(merged_data, is.specialInf)
                   RoundID                    MatchID       Player_Team_Initials 
                         0                          0                          0 
         Player_Team_Coach           Starting_Line-up        Player_Shirt_Number 
                         0                          0                          0 
               Player_Name            Player_Position               Player_Event 
                         0                          0                          0 
            World_Cup_Year             Match_Datetime                Match_Stage 
                         0                          0                          0 
             Match_Stadium                 Match_City             Home_Team_Name 
                         0                          0                          0 
           Home_Team_Goals            Away_Team_Goals             Away_Team_Name 
                         0                          0                          0 
              Match_Detail           Match_Attendance       Half-time_Home_Goals 
                         0                          0                          0 
      Half-time_Away_Goals              Match_Referee  Match_Assistant_Referee_1 
                         0                          0                          0 
 Match_Assistant_Referee_2         Home_Team_Initials         Away_Team_Initials 
                         0                          0                          0 
    World_Cup_Host_Country      World_Cup_Total_Goals  Number_of_Qualified_Teams 
                         0                          0                          0 
   World_Cup_Total_Matches World_Cup_Total_Attendance         World_Cup_Position 
                         0                          0                          0 
           Position_Holder      Match_Goal_Difference    World_Cup_Average_Goals 
                         0                          0                          0 
is.specialNaN <- function(x){
    sum(is.nan(x))
}
sapply(merged_data, is.specialNaN)
                   RoundID                    MatchID       Player_Team_Initials 
                         0                          0                          0 
         Player_Team_Coach           Starting_Line-up        Player_Shirt_Number 
                         0                          0                          0 
               Player_Name            Player_Position               Player_Event 
                         0                          0                          0 
            World_Cup_Year             Match_Datetime                Match_Stage 
                         0                          0                          0 
             Match_Stadium                 Match_City             Home_Team_Name 
                         0                          0                          0 
           Home_Team_Goals            Away_Team_Goals             Away_Team_Name 
                         0                          0                          0 
              Match_Detail           Match_Attendance       Half-time_Home_Goals 
                         0                          0                          0 
      Half-time_Away_Goals              Match_Referee  Match_Assistant_Referee_1 
                         0                          0                          0 
 Match_Assistant_Referee_2         Home_Team_Initials         Away_Team_Initials 
                         0                          0                          0 
    World_Cup_Host_Country      World_Cup_Total_Goals  Number_of_Qualified_Teams 
                         0                          0                          0 
   World_Cup_Total_Matches World_Cup_Total_Attendance         World_Cup_Position 
                         0                          0                          0 
           Position_Holder      Match_Goal_Difference    World_Cup_Average_Goals 
                         0                          0                          0 

Explanation:

We scanned all variables for missing values using the is.na function. After identifying the variable which had missing values (i.e. Match_Attendance), we decided to apply basic imputation technique to replace the missing data. Since it was a numerical variable and a very small fraction of the data set was missing, we replaced the missing values with the mean value of that variable. We used impute function (Hmisc package) for this purpose.

Next, we checked each column of the data frame for special values (i.e. -Inf, Inf and NaN) using is.infinite and is.nan functions. We did not find any special values in our data set.

Scan II

length(boxplot(merged_data$Home_Team_Goals, plot=FALSE)$out)
[1] 18760
merged_data$Home_Team_Goals[merged_data$Home_Team_Goals > 3] <- median(merged_data$Home_Team_Goals)
length(boxplot(merged_data$Home_Team_Goals, plot=FALSE)$out)
[1] 0
length(boxplot(merged_data$Away_Team_Goals, plot=FALSE)$out)
[1] 14328
merged_data$Away_Team_Goals[merged_data$Away_Team_Goals > 2] <- median(merged_data$Away_Team_Goals)
length(boxplot(merged_data$Away_Team_Goals, plot=FALSE)$out)
[1] 0
length(boxplot(merged_data$Match_Attendance, plot=FALSE)$out)
[1] 1744
merged_data <- merged_data[merged_data$Match_Attendance < 100000, ]
length(boxplot(merged_data$Match_Attendance, plot=FALSE)$out)
[1] 0
length(boxplot(merged_data$`Half-time_Home_Goals`, plot=FALSE)$out)
[1] 7920
merged_data$`Half-time_Home_Goals`[merged_data$`Half-time_Home_Goals` > 2] <- median(merged_data$`Half-time_Home_Goals`)
length(boxplot(merged_data$`Half-time_Home_Goals`, plot=FALSE)$out)
[1] 0
length(boxplot(merged_data$`Half-time_Away_Goals`, plot=FALSE)$out)
[1] 2184
merged_data$`Half-time_Away_Goals`[merged_data$`Half-time_Away_Goals` > 2] <- median(merged_data$`Half-time_Away_Goals`)
length(boxplot(merged_data$`Half-time_Away_Goals`, plot=FALSE)$out)
[1] 0
length(boxplot(merged_data$Match_Goal_Difference, plot=FALSE)$out)
[1] 4748
cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}
merged_data$Match_Goal_Difference <- cap(merged_data$Match_Goal_Difference)
length(boxplot(merged_data$Match_Goal_Difference, plot=FALSE)$out)
[1] 0
length(boxplot(merged_data$World_Cup_Average_Goals, plot=FALSE)$out)
[1] 22308
merged_data$World_Cup_Average_Goals[merged_data$World_Cup_Average_Goals > 3] <- mean(merged_data$World_Cup_Average_Goals)
length(boxplot(merged_data$World_Cup_Average_Goals, plot=FALSE)$out)
[1] 0
length(boxplot(merged_data$World_Cup_Total_Goals, plot=FALSE)$out)
[1] 0
length(boxplot(merged_data$Number_of_Qualified_Teams, plot=FALSE)$out)
[1] 0
length(boxplot(merged_data$World_Cup_Total_Matches, plot=FALSE)$out)
[1] 0
length(boxplot(merged_data$World_Cup_Total_Attendance, plot=FALSE)$out)
[1] 0

Explanation:

In this section, we scanned all of the numeric variables for outliers using box plots. Since there were a lot of numeric variables and the space was limited, we did not plot the box plots. We used multiple techniques to deal with the outliers: 1. For variables that contained significant number of outliers, we used mean or median imputation. 2. For variables that had small number of outliers, we excluded/deleted the outliers by simply filtering the data set. 3. For Match_Goal_Difference variable, we used the Capping technique which involves replacing the outliers with the nearest neighbours that are not outliers.

The remaining numerical variables did not have any outliers (as shown in the above code output).

Transform

average_goals <- select(merged_data, World_Cup_Average_Goals)
average_goals_binned_width <- discretize(merged_data$World_Cup_Average_Goals, disc = "equalwidth")
combined1 <- bind_cols(average_goals, average_goals_binned_width)
colnames(combined1) <- c("World_Cup_Average_Goals", "Equal Width Bin")
head(unique(combined1))
average_goals_binned_freq <- discretize(merged_data$World_Cup_Average_Goals, disc = "equalfreq")
combined2 <- bind_cols(average_goals, average_goals_binned_freq)
colnames(combined2) <- c("World_Cup_Average_Goals", "Equal Frequency Bin")
head(unique(combined2))
hist(merged_data$World_Cup_Average_Goals, main = "Histogram of World Cup Average Goals", xlab = "World Cup Average Goals")

hist(combined1$`Equal Width Bin`, main = "Histogram of Equal Width Binning", xlab = "Equal Width Bins")

hist(combined2$`Equal Frequency Bin`, main = "Histogram of Equal Frequency Binning", xlab = "Equal Frequency Bins")


Explanation:

In this section of the report, we applied transformation to World_Cup_Average_Goals variable. The purpose of the transformation was to change the scale for better understanding of the variable. We applied both equal width (distance) binning and equal depth (frequency) binning methods to discretise/transform the numerical variable into categorical counterpart.

In equal-width binning, the variable is divided into n intervals of equal size whereas in equal-depth binning method, the variable is divided into n intervals, each containing approximately the same number of observations. We applied these methods using the discretize function under the infotheo package. By binning the data, the scale of the continuous data was converted into discrete categories/bins. As a result, the distributional properties of the variable also changed which is visible in the above histograms.


---
title: "MATH2349 Semester 1, 2018"
author: Abdullah Abid (s3697995), Muhammad Umer Mazhar (s3677768), Waleed Shafqat
  (s3669576)
subtitle: Assignment 3
output:
  html_notebook: default
  html_document:
    df_print: paged
---


## Required packages 

```{r message=FALSE, warning=FALSE}
library(dplyr)
library(lubridate)
library(tidyr)
library(Hmisc)
library(infotheo)
```


## Executive Summary 

Data Preprocessing is a process and the collection of operations needed to prepare all forms of untidy data for statistical analysis. The 5 major tasks for data include Get, Understand, Tidy & Manipulate, Scan and Transform. To achieve the first task, we located the right data sets and imported them using the Base R functions. Then, we merged the data sets to create a single data set for further preprocessing.

Next, we inspected the data set to understand the various aspects of it. We checked the attributes (metadata) as well as the structure of the data set to find out details such as types of variables, data structures, dimensions, etc. We made some appropriate changes in our data set such as data type conversions and labelling factor variables to make more sense of the data.

We then checked whether the data followed the tidy data principles. As each variable in our data did not have its own column, it was not in a tidy format. Hence, we reshaped it into a tidy format using appropriate method. Then, we created two new variables from the existing variables using the mutate function.

Afterwards, we scanned the data for missing values and inconsistencies. Since one of the variables in our data set had missing values, we applied basic imputation technique to replace them. Next, we scanned all numeric variables for outliers using box plot. We used multiple techniques to deal with the outliers including imputation, excluding outliers and capping.
Finally, we applied transformation to one of the variables to change the scale for better understanding of the variable. We applied both equal width (distance) binning and equal depth (frequency) binning methods to discretize/transform the numerical variable into categorical counterpart.


## Data 

We are going to use three data sets: World Cups, World Cup Players, World Cup Matches. The World Cups data set contains information about all the World Cups held to date. It consists of the following variables:

* Year: year in which the world cup was held 
* Country: country where the world cup was held
* Winner: team that won the world cup 
* Runners-Up: team that came second
* Third: team that came third
* Fourth: team that came fourth
* GoalsScored: total goals scored in the world cup
* QualifiedTeams: number of teams that qualified for the world cup
* MatchesPlayed: total matches played in the world cup
* Attendance: total attendance in the world cup

The World Cup Matches data set contains information regarding all of the matches played in the World Cups held to date. It consists of the following variables:

* Year: year in which the world cup was held 
* Datetime: date and time of the match
* Stage: stage of the world cup when the match was played (e.g. quarter-final, semi-final)
* stadium: stadium name where the match was played
* City: city name where the match was played
* Home Team Name: name of the home team
* Home Team Goals: number of goals scored by the home team
* Away Team Goals: number of goals scored by the away team
* Away Team Name: name of the away team
* Win Conditions: special detail regarding the win (if any)
* Attendance: match attendance 
* Half-time Home Goals: number of goals scored by the home team at half-time
* Half-time Away Goals: number of goals scored by the away team at half-time
* Referee: name of match referee
* Assistant 1: name of assistant referee 1
* Assistant 2: name of assistant referee 2
* RoundID: ID of the round
* MatchID: unique ID of the match
* Home Team Initials: home team's three letter initials
* Away Team Initials: away team's three letter initials

The World Cup Players data set contains information regarding all of the players who played in the World Cup matches held to date. It consists of the following variables:

* RoundID: ID of the round 
* MatchID: unique ID of the match
* Team Initials: three letter initials of the player's team
* Coach Name: name of player's coach
* Line-up: indicates whether the player was in the starting line-up or not
* Shirt Number: shirt number of the player
* Player Name: name of the player
* Position: indicator for goalkeepers and captains
* Event: important event involving the player (if any)

**Data Source**: https://www.kaggle.com/abecklas/fifa-world-cup

```{r}
WorldCup_data <- read.csv("WorldCups.csv")
head(WorldCup_data)

WorldCupMatches_data <- read.csv("WorldCupMatches.csv")
head(WorldCupMatches_data)

WorldCupPlayers_data <- read.csv("WorldCupPlayers.csv")
head(WorldCupPlayers_data)

merged_data <- inner_join(WorldCupPlayers_data, WorldCupMatches_data, by = "MatchID")
merged_data <- inner_join(merged_data, WorldCup_data, by = "Year")
head(merged_data)
```

#### Explanation:

Our first step was to import the data sets using read.csv (Base R function) since the data was stored in csv files. Then, we used inner_join to merge World Cup Players data set with World Cup Matches data set by the key variable: MatchID, which is common in both data sets. Inner join retains the rows with matching key variable values in both data sets. Finally, using inner_join once again, we merged the resulting data set (World Cup Players + World Cup Matches) with World Cups data set by the common variable: Year. As a result, a single data set was created for our report.


## Understand 

```{r}
attributes(merged_data)$class
attributes(merged_data)$names

str(merged_data)

merged_data$Shirt.Number <- as.factor(merged_data$Shirt.Number)
class(merged_data$Shirt.Number)

merged_data$Event <- as.character(merged_data$Event)
class(merged_data$Event)

merged_data$Year <- as.factor(merged_data$Year)
class(merged_data$Year)

merged_data$Datetime <- dmy_hm(merged_data$Datetime)
class(merged_data$Datetime)

merged_data$Win.conditions <- as.character(merged_data$Win.conditions)
class(merged_data$Win.conditions)

merged_data$Attendance.y <- as.character(merged_data$Attendance.y)
merged_data$Attendance.y <- gsub("\\.", "", merged_data$Attendance.y)
merged_data$Attendance.y <- as.integer(merged_data$Attendance.y)
class(merged_data$Attendance.y)

merged_data$Line.up <- factor(merged_data$Line.up, levels = c("N", "S"), labels = c("No", "Yes"))
class(merged_data$Line.up)
levels(merged_data$Line.up)

merged_data <- merged_data[, -26]

colnames(merged_data) <- c("RoundID", "MatchID", "Player_Team_Initials", "Player_Team_Coach", "Starting_Line-up", "Player_Shirt_Number", "Player_Name", "Player_Position", "Player_Event", "World_Cup_Year", "Match_Datetime", "Match_Stage", "Match_Stadium", "Match_City", "Home_Team_Name", "Home_Team_Goals", "Away_Team_Goals", "Away_Team_Name", "Match_Detail", "Match_Attendance", "Half-time_Home_Goals", "Half-time_Away_Goals", "Match_Referee", "Match_Assistant_Referee_1", "Match_Assistant_Referee_2", "Home_Team_Initials", "Away_Team_Initials", "World_Cup_Host_Country", "Winner", "Runners-up", "Third_Position", "Fourth_Position", "World_Cup_Total_Goals", "Number_of_Qualified_Teams", "World_Cup_Total_Matches", "World_Cup_Total_Attendance")
colnames(merged_data)
```

#### Explanation:

First, we checked the attributes (metadata) of the merged data set which returned variable names, object class and row names. We skipped printing the row names as they were meaningless and took a lot of space. Next, we inspected the structure of the data frame to find out the details of each variable such as data type, number of levels (for factor variables), etc. The str function also returned the total number of observations and variables in the data set.

Inspection of data set showed that the data consisted of multiple data types including numerics, characters, factors, date. However, some variables were not imported as correct data type. Hence, we made the following data type conversions to make more sense of the data:

* Shirt.Number variable from integer to factor using as.factor() function.
* Events variable from factor to character using as.character() function.
* Year variable from integer to factor using as.factor() function.
* Datetime variable from factor to POSIXct (date-time). As the order of elements in date was day, month, year, hour, minute we used the dmy_hm() function of lubridate package.
* Win.conditions variable from factor to character using as.character() function.
* Attendance.y variable was first changed from factor to character and then all of the periods were removed. Then, we converted the variable to integer data type.

Next, we labelled the levels of Line.up factor variable from "N" and "S" to "No" and "Yes" respectively. We then removed the redundant RoundID.y column (same as RoundID.x) by simply filtering the data frame. Finally, column names of the data frame were renamed to more suitable and meaningful names using colnames() function.


##	Tidy & Manipulate Data I 

```{r warning=FALSE}
merged_data <- gather(merged_data, World_Cup_Position, Position_Holder, 29:32)

merged_data$World_Cup_Position <- as.factor(merged_data$World_Cup_Position)
merged_data$Position_Holder <- as.factor(merged_data$Position_Holder)

head(merged_data[, c("World_Cup_Position", "Position_Holder")])
```

#### Explanation:

The data did not conform to the tidy data principles as each variable did not have its own column: each World_Cup_Position variable value (i.e. winner, runners up, etc.) had a separate column. Hence, we reshaped the data using gather function (tidyr package) to create a new column for World_Cup_Position values.

The new columns created had character data type. We converted them to factor data type which was more appropriate.


##	Tidy & Manipulate Data II 

```{r warning=FALSE}
merged_data <- mutate(merged_data, Match_Goal_Difference = Home_Team_Goals - Away_Team_Goals)
merged_data <- mutate(merged_data, World_Cup_Average_Goals = World_Cup_Total_Goals / World_Cup_Total_Matches)
head(merged_data[, c("Match_Goal_Difference", "World_Cup_Average_Goals")])
```

#### Explanation:

We created two new variables: Match_Goal_Difference and World_Cup_Average_Goals. For Match_Goal_Difference variable, we calculated the difference between Home_Team_Goals variable and Away_Team-Goals variable. For World_Cup_Average_Goals variable, we divided World_Cup_Total_Goals variable by World_Cup_Total_Matches variable. To complete this task, we used the mutate function (dplyr package).


##	Scan I 

```{r}
sum(is.na(merged_data))
colSums(is.na(merged_data))

merged_data$Match_Attendance <- as.integer(impute(merged_data$Match_Attendance, fun = mean))
sum(is.na(merged_data))

is.specialInf <- function(x){
    sum(is.infinite(x))
}
sapply(merged_data, is.specialInf)

is.specialNaN <- function(x){
    sum(is.nan(x))
}
sapply(merged_data, is.specialNaN)
```

#### Explanation:

We scanned all variables for missing values using the is.na function. After identifying the variable which had missing values (i.e. Match_Attendance), we decided to apply basic imputation technique to replace the missing data. Since it was a numerical variable and a very small fraction of the data set was missing, we replaced the missing values with the mean value of that variable. We used impute function (Hmisc package) for this purpose.

Next, we checked each column of the data frame for special values (i.e. -Inf, Inf and NaN) using is.infinite and is.nan functions. We did not find any special values in our data set.


##	Scan II

```{r}
length(boxplot(merged_data$Home_Team_Goals, plot=FALSE)$out)
merged_data$Home_Team_Goals[merged_data$Home_Team_Goals > 3] <- median(merged_data$Home_Team_Goals)
length(boxplot(merged_data$Home_Team_Goals, plot=FALSE)$out)

length(boxplot(merged_data$Away_Team_Goals, plot=FALSE)$out)
merged_data$Away_Team_Goals[merged_data$Away_Team_Goals > 2] <- median(merged_data$Away_Team_Goals)
length(boxplot(merged_data$Away_Team_Goals, plot=FALSE)$out)

length(boxplot(merged_data$Match_Attendance, plot=FALSE)$out)
merged_data <- merged_data[merged_data$Match_Attendance < 100000, ]
length(boxplot(merged_data$Match_Attendance, plot=FALSE)$out)

length(boxplot(merged_data$`Half-time_Home_Goals`, plot=FALSE)$out)
merged_data$`Half-time_Home_Goals`[merged_data$`Half-time_Home_Goals` > 2] <- median(merged_data$`Half-time_Home_Goals`)
length(boxplot(merged_data$`Half-time_Home_Goals`, plot=FALSE)$out)

length(boxplot(merged_data$`Half-time_Away_Goals`, plot=FALSE)$out)
merged_data$`Half-time_Away_Goals`[merged_data$`Half-time_Away_Goals` > 2] <- median(merged_data$`Half-time_Away_Goals`)
length(boxplot(merged_data$`Half-time_Away_Goals`, plot=FALSE)$out)

length(boxplot(merged_data$Match_Goal_Difference, plot=FALSE)$out)
cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}
merged_data$Match_Goal_Difference <- cap(merged_data$Match_Goal_Difference)
length(boxplot(merged_data$Match_Goal_Difference, plot=FALSE)$out)

length(boxplot(merged_data$World_Cup_Average_Goals, plot=FALSE)$out)
merged_data$World_Cup_Average_Goals[merged_data$World_Cup_Average_Goals > 3] <- mean(merged_data$World_Cup_Average_Goals)
length(boxplot(merged_data$World_Cup_Average_Goals, plot=FALSE)$out)

length(boxplot(merged_data$World_Cup_Total_Goals, plot=FALSE)$out)

length(boxplot(merged_data$Number_of_Qualified_Teams, plot=FALSE)$out)

length(boxplot(merged_data$World_Cup_Total_Matches, plot=FALSE)$out)

length(boxplot(merged_data$World_Cup_Total_Attendance, plot=FALSE)$out)
```

#### Explanation:

In this section, we scanned all of the numeric variables for outliers using box plots. Since there were a lot of numeric variables and the space was limited, we did not plot the box plots. We used multiple techniques to deal with the outliers: 1. For variables that contained significant number of outliers, we used mean or median imputation. 2. For variables that had small number of outliers, we excluded/deleted the outliers by simply filtering the data set. 3. For Match_Goal_Difference variable, we used the Capping technique which involves replacing the outliers with the nearest neighbours that are not outliers.

The remaining numerical variables did not have any outliers (as shown in the above code output).


##	Transform 

```{r}
average_goals <- select(merged_data, World_Cup_Average_Goals)

average_goals_binned_width <- discretize(merged_data$World_Cup_Average_Goals, disc = "equalwidth")
combined1 <- bind_cols(average_goals, average_goals_binned_width)
colnames(combined1) <- c("World_Cup_Average_Goals", "Equal Width Bin")
head(unique(combined1))

average_goals_binned_freq <- discretize(merged_data$World_Cup_Average_Goals, disc = "equalfreq")
combined2 <- bind_cols(average_goals, average_goals_binned_freq)
colnames(combined2) <- c("World_Cup_Average_Goals", "Equal Frequency Bin")
head(unique(combined2))

hist(merged_data$World_Cup_Average_Goals, main = "Histogram of World Cup Average Goals", xlab = "World Cup Average Goals")
hist(combined1$`Equal Width Bin`, main = "Histogram of Equal Width Binning", xlab = "Equal Width Bins")
hist(combined2$`Equal Frequency Bin`, main = "Histogram of Equal Frequency Binning", xlab = "Equal Frequency Bins")
```

<br>

#### Explanation:

In this section of the report, we applied transformation to World_Cup_Average_Goals variable. The purpose of the transformation was to change the scale for better understanding of the variable. We applied both equal width (distance) binning and equal depth (frequency) binning methods to discretise/transform the numerical variable into categorical counterpart.

In equal-width binning, the variable is divided into n intervals of equal size whereas in equal-depth binning method, the variable is divided into n intervals, each containing approximately the same number of observations. We applied these methods using the discretize function under the infotheo package. By binning the data, the scale of the continuous data was converted into discrete categories/bins. As a result, the distributional properties of the variable also changed which is visible in the above histograms.

<br>
