R provides extensive range of functions which is used to perform various preprocessing tasks.Below are the packages of functions that has been installed in order to carry out the given tasks: 1.readr package is used to import files in r using read_csv() function.
2.dplyr package is used for data manipulations tasks such as mutating,joining etc.
3.tidyr package is used to tidy the datasets which includes functions like separate(),unite().
4.hmisc package allows us to compute imputed values from the non-missing values.
library(readr)
library(dplyr)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
Attaching package: 㤼㸱dplyr㤼㸲
The following objects are masked from 㤼㸱package:stats㤼㸲:
filter, lag
The following objects are masked from 㤼㸱package:base㤼㸲:
intersect, setdiff, setequal, union
library(tidyr)
library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: 㤼㸱Hmisc㤼㸲
The following objects are masked from 㤼㸱package:dplyr㤼㸲:
src, summarize
The following objects are masked from 㤼㸱package:base㤼㸲:
format.pval, units
library(lubridate)
Attaching package: 㤼㸱lubridate㤼㸲
The following object is masked from 㤼㸱package:base㤼㸲:
date
library(outliers)
Data preprocessing is done to get the consistent and understandable data so that statistical analysis can be performed further on it .For this purpose,we have to do five major tasks which is as follows: 1.getting data -> importing from the available source 2.understanding the structure of data 3.tidying and manipulating the data 4.scanning the data -> scanning for missing values and outliers 5.transforming the data -> scaling or standardising the data
In this assignment,the unpreprocessed data sets are taken and above steps are followed to give a preprocessed data.
The World Cup Matches dataset shows all the results from the matches contested as part of the cups and World Cup Players dataset shows all the result of players participated in the worldcup.
Two data sets namely worldcupmatches and worldcupplayers are taken from kaggle source. https://www.kaggle.com/abecklas/fifa-world-cup. These datasets are imported into R using read_csv function after which subsetting is done by taking main variables from each data sets.’WorldcupMatches’ is subsetted into matches1 with 11 variables and 853 observations , MatchID - unique ID of the match Round ID -unique Id of the match Datetime - The Date on which the match was played along with a 24 hour format time Stage - The stage at which the match was played Stadium - Stadium name where the match was held City - The city name, where the match was played Home Team Name - Home team country name Away Team Name - Away team country name Home Team Goals - Total goals scored by the home team by the end of the match Away Team Goals - Total goals scored by the away team by the end of the match Attendance - Total crowd present at the satdium
‘WorlcupPlayers’ is subsetted into players1 with 6 variables and 37785 observations, MatchID - unique ID of the match Round ID -unique Id of the match Line-up - eleven players who takes first play of a game Shirt Number player name position - position of player in the team
These two subsetted dataset is merged into ‘world_cup_matches’ using inner join function which will return rows in both sets.After combining ,the dataset will have 39256 observations and 17 variables.
matches <- read_csv("WorldCupMatches.CSV")
Parsed with column specification:
cols(
.default = col_character(),
Year = [32mcol_double()[39m,
`Home Team Goals` = [32mcol_double()[39m,
`Away Team Goals` = [32mcol_double()[39m,
Attendance = [32mcol_double()[39m,
`Half-time Home Goals` = [32mcol_double()[39m,
`Half-time Away Goals` = [32mcol_double()[39m,
RoundID = [32mcol_double()[39m,
MatchID = [32mcol_double()[39m
)
See spec(...) for full column specifications.
players <- read_csv("WorldCupPlayers.CSV")
Parsed with column specification:
cols(
RoundID = [32mcol_double()[39m,
MatchID = [32mcol_double()[39m,
`Team Initials` = [31mcol_character()[39m,
`Coach Name` = [31mcol_character()[39m,
`Line-up` = [31mcol_character()[39m,
`Shirt Number` = [32mcol_double()[39m,
`Player Name` = [31mcol_character()[39m,
Position = [31mcol_character()[39m,
Event = [31mcol_character()[39m
)
matches1 <- matches[,c("MatchID","RoundID","Datetime","Stage","Stadium","City","Home Team Name","Away Team Name","Home Team Goals","Away Team Goals","Attendance")]
players1 <- players[,c("MatchID","RoundID","Line-up","Shirt Number","Player Name","Position")]
world_cup_matches <- inner_join(matches1,players1,by = c("MatchID","RoundID"))
Data type is determined using sapply function and structure of data type is found by str() function.Similarly,class and dimensions of dataset is found using class & dim function respectively.
The Word_cup_matches datasets contains char,numeric data types.Appropriate data type conversions is done for three variables Line-up,position and datetime which meets (2-4) tasks requirements.
First, Line-up variable is renamed into starting_11 to avoid confusion of same column and value name after factorizing.
Then starting_11 is converted from character to factor and labelled as N->line-up & S-> Substitute.
Likewise, position variable is converted into factor and labelled as Gk -> goalkeeper,C -> captain, GKC -> Goalkeeper&captain.
For converting the datatype of ‘datetime’ as date,few tidying steps are handled . Firstly ,the ‘datetime’ is separated into ‘date’ and ‘time’ variable using separate function followed by which ‘date’ variable is again separated into Date,Month and year in order to label month as 1,2,3 etc through which we can achieve format as (ie.01/07/1930).Then all these three variables are united into single variable ‘date’.
Before labelling and uniting , month variable which has two types of format June and Jun is limited to 3 char(ie Jun) with the help of for function .
This ‘for’ function will read all the months row wise which has 4 char and substr function will take first three position of month.
Finally ,the obtained ‘date’ variable is converted into date data type.
sapply(world_cup_matches,typeof)
MatchID RoundID Datetime Stage Stadium
"double" "double" "character" "character" "character"
City Home Team Name Away Team Name Home Team Goals Away Team Goals
"character" "character" "character" "double" "double"
Attendance Line-up Shirt Number Player Name Position
"double" "character" "double" "character" "character"
str(world_cup_matches)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 39256 obs. of 15 variables:
$ MatchID : num 1096 1096 1096 1096 1096 ...
$ RoundID : num 201 201 201 201 201 201 201 201 201 201 ...
$ Datetime : chr "13 Jul 1930 - 15:00" "13 Jul 1930 - 15:00" "13 Jul 1930 - 15:00" "13 Jul 1930 - 15:00" ...
$ Stage : chr "Group 1" "Group 1" "Group 1" "Group 1" ...
$ Stadium : chr "Pocitos" "Pocitos" "Pocitos" "Pocitos" ...
$ City : chr "Montevideo" "Montevideo" "Montevideo" "Montevideo" ...
$ Home Team Name : chr "France" "France" "France" "France" ...
$ Away Team Name : chr "Mexico" "Mexico" "Mexico" "Mexico" ...
$ Home Team Goals: num 4 4 4 4 4 4 4 4 4 4 ...
$ Away Team Goals: num 1 1 1 1 1 1 1 1 1 1 ...
$ Attendance : num 4444 4444 4444 4444 4444 ...
$ Line-up : chr "S" "S" "S" "S" ...
$ Shirt Number : num 0 0 0 0 0 0 0 0 0 0 ...
$ Player Name : chr "Alex THEPOT" "Oscar BONFIGLIO" "Marcel LANGILLER" "Juan CARRENO" ...
$ Position : chr "GK" "GK" NA NA ...
class(world_cup_matches)
[1] "tbl_df" "tbl" "data.frame"
dim(world_cup_matches)
[1] 39256 15
colnames(world_cup_matches)[colnames(world_cup_matches) == "Line-up"] <- "starting_11"
world_cup_matches$starting_11 <- world_cup_matches$starting_11 %>% factor(levels = c("N","S"),labels = c("Line-up","Substitute"),ordered = TRUE)
world_cup_matches$Position <- world_cup_matches$Position %>% factor(levels = c("GK","C","GKC"),labels = c("Goalkeeper","Captain","GoalKeeper&captain"),ordered = TRUE)
world_cup_matches <- world_cup_matches %>% separate('Datetime',into = c("date","time"),sep = "-")
world_cup_matches <- world_cup_matches %>% separate('date',into = c("Date","Month","Year"),sep = " ")
Expected 3 pieces. Additional pieces discarded in 39256 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
for (i in c(1:nrow(world_cup_matches)))
{
if(nchar(world_cup_matches$Month[i]) == 4)
world_cup_matches$Month[i] <- substr(world_cup_matches$Month[i],1,3)
}
world_cup_matches$Month <- world_cup_matches$Month %>% factor(levels = c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"),labels = c(1:12),ordered =TRUE)
world_cup_matches <- world_cup_matches %>% unite('date',Date,Month,Year,sep = "/")
world_cup_matches$date <- as.Date(world_cup_matches$date)
str(world_cup_matches)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 39256 obs. of 16 variables:
$ MatchID : num 1096 1096 1096 1096 1096 ...
$ RoundID : num 201 201 201 201 201 201 201 201 201 201 ...
$ date : Date, format: "0013-07-19" "0013-07-19" ...
$ time : chr " 15:00" " 15:00" " 15:00" " 15:00" ...
$ Stage : chr "Group 1" "Group 1" "Group 1" "Group 1" ...
$ Stadium : chr "Pocitos" "Pocitos" "Pocitos" "Pocitos" ...
$ City : chr "Montevideo" "Montevideo" "Montevideo" "Montevideo" ...
$ Home Team Name : chr "France" "France" "France" "France" ...
$ Away Team Name : chr "Mexico" "Mexico" "Mexico" "Mexico" ...
$ Home Team Goals: num 4 4 4 4 4 4 4 4 4 4 ...
$ Away Team Goals: num 1 1 1 1 1 1 1 1 1 1 ...
$ Attendance : num 4444 4444 4444 4444 4444 ...
$ starting_11 : Ord.factor w/ 2 levels "Line-up"<"Substitute": 2 2 2 2 2 2 2 2 2 2 ...
$ Shirt Number : num 0 0 0 0 0 0 0 0 0 0 ...
$ Player Name : chr "Alex THEPOT" "Oscar BONFIGLIO" "Marcel LANGILLER" "Juan CARRENO" ...
$ Position : Ord.factor w/ 3 levels "Goalkeeper"<"Captain"<..: 1 1 NA NA NA 2 NA NA NA NA ...
head(world_cup_matches)
NA
NA
NA
NA
The tidy dataset should satisfy three rules which are : 1.Each variable must have its own column. 2.Each observation must have its own row. 3.Each value have its own cell.
(The world_cup_matches dataset which has ‘datetime’ variable in untidy format is tidied in the above task.)
Therefore,the dataset is in tidy format as it satisfies the above mentioned rules.
‘winner’ variable which is created with the help of existing variables Home Team Goals,Away Team Goals, Home Team Name and Away Team Name will display the team name of who has got highest goal.
The below code will compare the goals taken by the Home team & Away team and will print the name of the team who has got the highest score in the ‘winner’ variable.
This logic is done using for and if else function.
Then ‘winner’ Column variable is added to the dataset by Cbind function.
Winner <- vector(mode = "character",length = nrow(world_cup_matches))
for (row in c(1:nrow(world_cup_matches))) {
if(world_cup_matches$`Home Team Goals`[row] > world_cup_matches$`Away Team Goals`[row])
Winner = world_cup_matches$`Home Team Name`
else
Winner = world_cup_matches$`Away Team Name`
}
world_cup_matches <- cbind(world_cup_matches,Winner)
World_cup_mathes is scanned for missing values ,inconsistencies and obvious errors.
Colsums function is used to find missing values in all variables .
sum (is.nan()) is used to find nan values and sum(is.infinite()) function is used to find infinite values in all the variables.
Position and Attendance variables has missing values in the datasets ,Since attendance variable is numeric we have imputed it with mean .Likewise mode imputation is done for position variable as it is char.
colSums(is.na(world_cup_matches))
MatchID RoundID date time Stage
0 0 0 0 0
Stadium City Home Team Name Away Team Name Home Team Goals
0 0 0 0 0
Away Team Goals Attendance starting_11 Shirt Number Player Name
0 0 0 0 0
Position Winner
0 0
for (col in ncol(world_cup_matches)) {
print(sum(is.nan(world_cup_matches[,col])))
print(sum(is.infinite(world_cup_matches[,col])))
}
[1] 0
[1] 0
world_cup_matches$Position <- impute(world_cup_matches$Position,fun = mode)
world_cup_matches$Attendance <- impute(world_cup_matches$Attendance,fun = mean)
The dataset is being scanned for outliers . using boxplot function ,the outliers for the numeric variables Home Team Goals and Away Team Goalsare shown.
There are very few outliers in the both Home Team Goals and Away Team Goals.To handle this,capping method is used . In capping,for outliers that lie outside the outlier fences on a box-plot hence, observations outside the lower limit are replaced with the 5th percentile and those observations that lie above the upper limit are replaced with 95th percentile.
boxplot(world_cup_matches$`Home Team Goals`,world_cup_matches$`Away Team Goals`)
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
world_cup_matches$`Home Team Goals` <- world_cup_matches$`Home Team Goals` %>% cap()
world_cup_matches$`Away Team Goals` <- world_cup_matches$`Away Team Goals` %>% cap()
boxplot(world_cup_matches$`Home Team Goals`,world_cup_matches$`Away Team Goals`)
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
Transformation is done for variable ‘Attendance’ in the dataset.
Histogram of that variable is found using hist() function,through which we can see that it is right skewed.
To transform the right skewed distribution to normal distribution,sqrt tranformation and log tranformation is applied.
By comparing the both histogram of transformed variable,its seen that sqrt transformation has normalized the data properly.
(i.e) sqrt function has reduced the right skewness.
Summary staistics is also performed for this variable.
hist(world_cup_matches$Attendance)
sqrt_Attendance <- sqrt(world_cup_matches$Attendance)
log_Attendance <- log(world_cup_matches$Attendance)
hist(sqrt_Attendance)
hist(log_Attendance)
summary(sqrt_Attendance)
184 values imputed to 214.4594
Min. 1st Qu. Median Mean 3rd Qu. Max.
44.72 174.93 207.36 207.08 249.20 416.95