library(readr)
library(readxl)
library(magrittr)
library(lubridate)
library(tidyr)
library(dplyr)
library(forecast)
The aim of this assignment is to apply the principles of data pre- processing, data integrity, ,data manipulation techniques, use techniques to identify and handle missing values and outliers. In short, we will be applying the data wrangling and manipulation techniques to a real-world data problem and demonstrate our skills in data pre-processing.
In this investigation we have taken an open dataset from Kaggle. The dataset is about IPL(Indian Premier League, a professional cricket league) from Seasons 1 to 9.
Dataset source(https://www.kaggle.com/harsha547/indian-premier-league-csv-dataset)
The dataset source consists of 4 files, which gives us an insight about the performance of teams as well as Individual players for IPL. The dataset Match.csv file has been considered as the main file for The objective is to make the data more meaningful for any further analysis. In order to make Match.csv as a single point of consideration, Player, Season, Team and ball have been merged to the Match dataset with their corresponding variables. This will result in one data set that can be used for further analysis. However, the data is untidy, so pre processing steps will be applied.
Preprocessing involved applying tidy data principles to the merged dataset Match, removing unwanted and repeating variables.
This was followed by converting variables into suitable data types and labelling factors, renaming variables.
The dataset contains 9 character variables, 1 date variable, 4 numeric variables and 4 factor variables. The dataset was scanned for missing values and no missing values were found. The dataset was also inspected for outliers and suitable outliers were capped using the capping function. Finally, a new variable (run_rate: type numeric) was created. Visualisation of run_rate using a histogram shows that the data is skewed to the left. BoxCox transformation has been applied to make the data normal.
Source of the Dataset : https://www.kaggle.com/harsha547/indian-premier-league-csv-dataset
The dataset consist of details about Indian Premier League from Seasons 1 to 9.
Overall, the dataset consist of 4 files, which gives us an insight about the performance of Teams as well as Individual players.
Details of the dataset are as follows: 1. Parent source: www.kaggle.com
Dataset Name : Indian Premier League CSV Dataset
Following files are considered for the purpose of the assignment: (Minimum Req - 2 & 3)
a.Match : Match by match data for all seasons including details such as performing teams, toss, runs, winner etc. Each Match is uniquely identified by Match_ID.
b.Player : Details of players such as Name, Date of Birth, Role etc. Each player is uniquely identified by Player_ID.
c.Season : Details of total number of seasons played. Each Season is uniquely identified by Season_ID.
d.Team : The performing teams are listed and each team is uniquely identified by Team_ID.
The dataset Match.csv file has been considered as the main file for analysis and preprocessing the datasets. In order to make Match.csv as our main file, all other datasets i.e Player, Season, and Team have been merged to the Match dataset with their corresponding variables.
For example, in order to get description of the Team name and opponent team name, we have merged the Team.csv to Match.csv using the common and unique variable Team_ID.
Similar procedure was carried out to get details of Toss winning team and man of the match. (Minimum Requirement-1)
Match <- read_csv("Match.csv")
Parsed with column specification:
cols(
Match_Id = col_integer(),
Match_Date = col_character(),
Team_Name_Id = col_integer(),
Opponent_Team_Id = col_integer(),
Season_Id = col_integer(),
Venue_Name = col_character(),
Toss_Winner_Id = col_integer(),
Toss_Decision = col_character(),
IS_Superover = col_integer(),
IS_Result = col_integer(),
Is_DuckWorthLewis = col_integer(),
Win_Type = col_character(),
Won_By = col_character(),
Match_Winner_Id = col_integer(),
Man_Of_The_Match_Id = col_integer(),
First_Umpire_Id = col_integer(),
Second_Umpire_Id = col_integer(),
City_Name = col_character(),
Host_Country = col_character()
)
Player <- read_csv("Player.csv")
Missing column names filled in: 'X8' [8]Parsed with column specification:
cols(
Player_Id = col_integer(),
Player_Name = col_character(),
DOB = col_character(),
Batting_Hand = col_character(),
Bowling_Skill = col_character(),
Country = col_character(),
Is_Umpire = col_integer(),
X8 = col_integer()
)
Team <- read_csv("Team.csv")
Parsed with column specification:
cols(
Team_Id = col_integer(),
Team_Name = col_character(),
Team_Short_Code = col_character()
)
Ball <- read_excel("Ball_by_Ball.csv.xlsx")
head(Match)
head(Player)
head(Ball)
head(Team)
#Min Req 1- Merging datasets#
mergedData1 <- merge(Match, Team, by.x = c("Team_Name_Id"), by.y = c("Team_Id"))
mergedData2 <- merge(mergedData1, Team, by.x = c("Opponent_Team_Id"), by.y = c("Team_Id"))
mergedData3 <- merge(mergedData2, Team, by.x = c("Toss_Winner_Id"), by.y = c("Team_Id"))
mergedData4 <- merge(mergedData3, Team, by.x = c("Match_Winner_Id"), by.y = c("Team_Id"))
column names <U+393C><U+3E31>Team_Name.x<U+393C><U+3E32>, <U+393C><U+3E31>Team_Short_Code.x<U+393C><U+3E32>, <U+393C><U+3E31>Team_Name.y<U+393C><U+3E32>, <U+393C><U+3E31>Team_Short_Code.y<U+393C><U+3E32> are duplicated in the result
mergedData5 <- merge(mergedData4, Player, by.x = c("Man_Of_The_Match_Id"), by.y = c("Player_Id"))
column names <U+393C><U+3E31>Team_Name.x<U+393C><U+3E32>, <U+393C><U+3E31>Team_Short_Code.x<U+393C><U+3E32>, <U+393C><U+3E31>Team_Name.y<U+393C><U+3E32>, <U+393C><U+3E31>Team_Short_Code.y<U+393C><U+3E32> are duplicated in the result
The merged output “mergerdData5”,dataset is untidy and difficult to understand. There are some repeating variables, some variables contain binary information, unstructured date variables which are making the data untidy. Tidying it up is necessary for clear visualisation and understanding. (Minimum Req-5)
Following are the issues with the merged data:
The variable names are repeating and do not appear meaningful. Renaming the variables will make the variable names more appropriate.
There are redundant variables in the data, for example Team_ID and Host_Team have the same information. This is causing the data to spread out unnecessarily and it is increasing the number of variables.
Data variables need to be rearranged to make the data unambiguous.
The date needs to be converted from Character to Date format. (Minimum Req - 3)
The dataset needs to be factored and labelled to have a better understanding and make it more usable for any further anaylsis.(Minimum Req - 4)
The data needs to be ordered to have a better understanding. (Minimum Req - 4)
Match_ID uniquely identifies a match. The variable is in numeric format and is not required for any further calculations. Since it is an identifier only it can be converted to a charecter format.(Min Req - 3)
The won_by variable needs to be tidied up and eventually converte to a numeric variable. (Min Req - 3 & 5 & 8)
Gather and tidy up the Venue, City and Host Country. (Min Req - 5)
Following are the R-Codes used to tidy up the data as mentioned in the minimum requirements.
#1.The Column names are repeating and not clear, so renaming the variables#
colnames(mergedData5) <- c("Man_Of_The_Match_Id", "Match_Winner_Id", "Toss_Winnder_Id", "Opponent_Team_Id", "Team_Name_Id", "Match_Id", "Match_Date", "Season_Id", "Venue_Name", "Toss_Decision", "SuperOver","Result", "Duckworth_Lewis", "Win_Type", "Won_by", "First_Umpire", "Second_Umpire", "City_Name","Host_Country", "Host_Team", "Host_Code", "Opponent_Team", "Opponent_Code", "Toss_won_by", "Toss_Code", "Match_won_by", "winner_code", "Man_of_the_Match", "DOB", "B_Hand", "Bo_Hand", "Origin_Country", "Is_Umpire", "x8")
#2.cleaning up the data by removing unwanted variables#
finaldata <- mergedData5 %>% select(-1,-2,-3,-4,-5,-16,-17,-21,-23,-25,-27,-29,-30,-31,-32,-33,-34)
head(finaldata)
#3.Data is still difficult to understand, hence rearranging the coulmns from the data set to have better understanding of the data#
IPL_Data <- finaldata[,c(3,1,2,4,11,12,13,14,15,5,6,7,8,16,9,10,17)]
#Final Dataset is IPL_Data to pre-process- min req -2 &3#
IPL_Data$Match_Date = dmy(IPL_Data$Match_Date)
str(IPL_Data$Match_Date)
Date[1:574], format: "2008-05-11" "2008-05-08" "2010-04-07" "2010-04-01" "2012-04-21" "2015-04-25" "2011-04-15" "2015-04-11" "2011-04-18" "2008-04-18" "2008-04-20" ...
#6.factoring - min req 3 & 4 #
IPL_Data$Result <- factor(IPL_Data$Result,levels = c("1","0"),labels = c("yes","no"))
IPL_Data$SuperOver<- factor(IPL_Data$SuperOver,levels = c("1","0"),labels = c("yes","no"))
IPL_Data$Duckworth_Lewis<- factor(IPL_Data$Duckworth_Lewis,levels = c("1","0"),labels = c("D/L","No"))
#7 Converting won_by to numeric, converting character null into character 0 and then converting variable to numeric#
IPL_Data$Won_by[IPL_Data$Won_by == "NULL"] <- "0"
IPL_Data$Won_by <- as.numeric(IPL_Data$Won_by)
str(IPL_Data$Won_by)
num [1:574] 23 5 14 24 20 97 8 45 7 140 ...
#9.ordering by season id
IPL_Data$Season_Id <- factor(IPL_Data$Season_Id,levels = c("1","2","3","4","5","6","7","8","9","10"),ordered = TRUE)
IPL_Data$Season_Id <- sort(IPL_Data$Season_Id, decreasing = FALSE)
str(IPL_Data$Season_Id)
Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 1 1 1 1 1 1 1 1 1 ...
#10.Unite the columns venue, city and country- min req5#
IPL_Data <- IPL_Data%>%unite(venue_name, Venue_Name,City_Name,Host_Country,sep = ",")
head(IPL_Data)
For minimmum req 6, a new variable needs to be created (mutated) in the data. The new variable would would contain total runs scored in a particular match.
Therefore, a dataframe has been created summarising the total runs scored grouped by match and adding it as ‘Total_Runs_in_match’ to the IPL_data. (Min Req - 1 & 7). The R-Codes are as follows.
#Creating new variable from existing one- min req 6#
#for that, creating a new data frame and joining to the data set#
Ball$Batsman_Scored<- as.numeric(Ball$Batsman_Scored)
df_run2<- Ball%>% select(Match_Id,Batsman_Scored)
Comp_Run<- tbl_df(df_run2)
Comp_Run<- Comp_Run %>% group_by(Match_Id) %>% summarise(Total_Runs_in_match = sum(Batsman_Scored,na.rm = TRUE))
new_Ipl<-left_join(IPL_Data,Comp_Run,by = "Match_Id")
#Changing the Match_ID as character, as it has no numeric significance.
new_Ipl$Match_Id <- new_Ipl$Match_Id %>% as.character()
class(new_Ipl$Match_Id)
[1] "character"
head(new_Ipl)
The new dataframe new_IPL created in the above step has 574 rows and 18 variables. By analysing the structure, we deduce that this dataset has 9 character variables, 1 date variable, 4 numeric variables and 4 factor variables which meets the minimum requirement 2,3 and 4.
#Data Inspection#
dim(new_Ipl)
[1] 574 16
#Checking Structure of the data, The data has satisfied minimun req 2 to 4, Ordered, Numeric, Character, Labelled Factors#
str(new_Ipl)
'data.frame': 574 obs. of 16 variables:
$ Season_Id : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ Match_Id : chr "336019" "336015" "419149" "419140" ...
$ Match_Date : Date, format: "2008-05-11" "2008-05-08" "2010-04-07" "2010-04-01" ...
$ venue_name : chr "Rajiv Gandhi International Stadium, Uppal,Hyderabad,India" "Eden Gardens,Kolkata,India" "Eden Gardens,Kolkata,India" "Eden Gardens,Kolkata,India" ...
$ Host_Team : chr "Deccan Chargers" "Kolkata Knight Riders" "Kolkata Knight Riders" "Kolkata Knight Riders" ...
$ Opponent_Team : chr "Kolkata Knight Riders" "Royal Challengers Bangalore" "Delhi Daredevils" "Deccan Chargers" ...
$ Toss_won_by : chr "Kolkata Knight Riders" "Kolkata Knight Riders" "Kolkata Knight Riders" "Kolkata Knight Riders" ...
$ Toss_Decision : chr "bat" "bat" "bat" "bat" ...
$ SuperOver : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
$ Result : Factor w/ 2 levels "yes","no": 1 1 1 1 1 1 1 1 1 1 ...
$ Duckworth_Lewis : Factor w/ 2 levels "D/L","No": 2 2 2 2 2 2 2 2 1 2 ...
$ Match_won_by : chr "Kolkata Knight Riders" "Kolkata Knight Riders" "Kolkata Knight Riders" "Kolkata Knight Riders" ...
$ Win_Type : chr "by runs" "by runs" "by runs" "by runs" ...
$ Won_by : num 23 5 14 24 20 97 8 45 7 140 ...
$ Man_of_the_Match : chr "SC Ganguly" "SC Ganguly" "SC Ganguly" "SC Ganguly" ...
$ Total_Runs_in_match: num 358 242 332 317 344 267 338 347 260 268 ...
#summary#
summary(new_Ipl)
Season_Id Match_Id Match_Date venue_name Host_Team Opponent_Team Toss_won_by Toss_Decision
6 : 76 Length:574 Min. :2008-04-18 Length:574 Length:574 Length:574 Length:574 Length:574
5 : 74 Class :character 1st Qu.:2010-03-31 Class :character Class :character Class :character Class :character Class :character
4 : 72 Mode :character Median :2012-05-01 Mode :character Mode :character Mode :character Mode :character Mode :character
3 : 60 Mean :2012-05-06
7 : 60 3rd Qu.:2014-05-10
9 : 60 Max. :2016-05-29
(Other):172
SuperOver Result Duckworth_Lewis Match_won_by Win_Type Won_by Man_of_the_Match Total_Runs_in_match
yes: 6 yes:574 D/L: 15 Length:574 Length:574 Min. : 0.00 Length:574 Min. :117.0
no :568 no : 0 No :559 Class :character Class :character 1st Qu.: 6.00 Class :character 1st Qu.:260.0
Mode :character Mode :character Median : 8.00 Mode :character Median :291.0
Mean : 17.17 Mean :288.4
3rd Qu.: 20.75 3rd Qu.:321.0
Max. :144.00 Max. :448.0
is.na() is used to find out the missing values in the data. On scanning using is.na no missing values have been found in the tidy data. (Min Req -7)
The null values were already treated in Tidy and Manipulate I.
#There are no missing values in the tidy data- min req7#
colSums(is.na(new_Ipl))
Season_Id Match_Id Match_Date venue_name Host_Team Opponent_Team Toss_won_by
0 0 0 0 0 0 0
Toss_Decision SuperOver Result Duckworth_Lewis Match_won_by Win_Type Won_by
0 0 0 0 0 0 0
Man_of_the_Match Total_Runs_in_match
0 0
The numeric variable ‘Total runs in a match’ has been scanned for outliers. In this case, the numeric outliers are a representation of matches with unusual statistics, which can be visualised by the boxplot.
To satisfy Min Req-8, we have used capping function to treat the outliers. After capping the values, plotting a BoxPlot confirms there there are no outliers in the capped data.
Following variables have not been considered while identifying outliers:
1.Match_ID : Match_ID has no numeric significance and hence, converted into character.
y <- new_Ipl$Total_Runs_in_match %>% boxplot(main="Total runs in a match", ylab="Runs", col = "grey")
y$out
[1] 430 153 418 117 420 161 165 150 149 448 165 152 134
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
new_Ipl["cap_run"]<- new_Ipl$Total_Runs_in_match %>% cap()
new_Ipl$cap_run %>% boxplot(main="cap run", ylab="Runs", col = "grey")
A new variable Run_rate has been created to contain total runs divided by total number of overs in a match(40). (Min Req - 6).
From the histogram of run_rate we can see that the data is skewed to the left - not normal. But the margin is not so significant that any mathematical transformation would give us a normal data. Applying mathematical transformation is leading to more skewed data. Hence, BoxCox transformation has been applied for the variable run_rate.
After BoxCox transformation, on plotting the histogram we can see that the skewness has decreased and the distribution is fitting normal distribution. (Min Req - 9)
#Transform#
#Adding new variable run rate as total runs divided by total overs 40#
new_Ipl["Run_Rate"] <- (new_Ipl$Total_Runs_in_match/40)
new_Ipl
hist(new_Ipl$Run_Rate,main = "Run Rate" , col = c("lightblue"))
#not significant but the histogram is skewed to the left#
#applying box cox transformation because the mathematical operations did not produce significant normal results#
#Purpose of the transformation is to decrease the skewness and covert the distribution into a normal distribution#
boxcox_rr <- BoxCox(new_Ipl$Run_Rate, lambda = "auto")
hist(boxcox_rr, main = "Transformed run rate" , col = c("grey"))