# This is the R chunk for the required packages
library(readr)
library(dplyr)
library(tidyr)
library(lubridate)
library(outliers)
library(forecast)
library(readxl)
The aim of this assignment is to preprocess two dataset of Indian Premier Legue (IPL), which is a proffessional cricket legue which occurs in India every year. The two dataset I am using for this assignment are Players dataset which consist of the information about players and most_runs_average_strikerate dataset which consist of data related to the batting score who played under IPL from 2008-2019.
The two dataset which consist of players and most_runs_average_strikerate from Indian Premier league during the period 2009-2019 was extracted from kaggle : https://www.kaggle.com/ramjidoolla/ipl-data-set.
Player dataset
Player_Name : Name of the Players
DOB : Date of Birth Batting_Hand: Batting style Bowling_Skill : Bowling skill *Country: Country they represent in the International format
most_runs_average_strikerate dataset
Additional 2 variables were created using mutate function
Coloumn name batman from most_runs_average_strikerate dataset was changed to “Player_Name” for merging purpose
# This is the R chunk for the Data Section
Player <- read_excel("C:/Users/ADMIN/Downloads/archive/Players.xlsx")
runs <- read.csv("C:/Users/ADMIN/Downloads/archive/most_runs_average_strikerate.csv")
colnames(runs)[colnames(runs) == "batsman"] <- "Player_Name"
New_Data <- merge(runs, Player, by= "Player_Name")
head(New_Data)
NA
NA
The New dataset New_Data has been created using merge() function. The new dataset has 516 observations and 10 variables.
Below we checked the head of the data using head() function and structure of the data using str() function.
# This is the R chunk for the Understand Section
head(New_Data)
str(New_Data)
'data.frame': 516 obs. of 10 variables:
$ Player_Name : chr "A Ashish Reddy" "A Chandila" "A Chopra" "A Choudhary" ...
$ total_runs : int 280 4 53 25 4 62 150 15 35 366 ...
$ out : int 15 1 5 2 0 2 6 0 2 29 ...
$ numberofballs: int 191 7 71 20 5 53 118 13 46 385 ...
$ average : num 18.7 4 10.6 12.5 NA ...
$ strikerate : num 146.6 57.1 74.6 125 80 ...
$ DOB : POSIXct, format: "1991-02-24" "1983-12-05" ...
$ Batting_Hand : chr "Right_Hand" "Right_Hand" "Right_Hand" "Right_hand" ...
$ Bowling_Skill: chr "Right-arm medium" "Right-arm offbreak" "Right-arm offbreak" "Left-arm fast-medium" ...
$ Country : chr "India" "India" "India" NA ...
Frrom above we can see that the New_Data dataset consist of multiple data types such numerics, characters, factors (not formatted appropriately).
To consider a dataset to be tidy the dataset needs to satisfy the below crieteria:- Each variable have its own column Each observation have its own row *Each value have its own cell.
As we can observe in the New_data dataset, Some cells of the variable country and DOB does not have values in it. str() suggest that the variables like DOB, Country, Batting_Hand, Bowling_skill etc are not formatted correctly.
str(New_Data)
'data.frame': 516 obs. of 10 variables:
$ Player_Name : chr "A Ashish Reddy" "A Chandila" "A Chopra" "A Choudhary" ...
$ total_runs : int 280 4 53 25 4 62 150 15 35 366 ...
$ out : int 15 1 5 2 0 2 6 0 2 29 ...
$ numberofballs: int 191 7 71 20 5 53 118 13 46 385 ...
$ average : num 18.7 4 10.6 12.5 NA ...
$ strikerate : num 146.6 57.1 74.6 125 80 ...
$ DOB : Date, format: "1991-02-24" "1983-12-05" ...
$ Batting_Hand : Factor w/ 3 levels "Left_Hand","Right_hand",..: 3 3 3 2 3 3 2 3 3 3 ...
$ Bowling_Skill: Factor w/ 16 levels "Left-arm fast",..: 12 14 14 2 14 11 NA 9 6 5 ...
$ Country : Factor w/ 11 levels "Australia","Bangladesh",..: 4 4 4 NA NA 3 NA NA 4 4 ...
Here we will create 2 new variable with mutate() function
# This is the R chunk for the Tidy & Manipulate Data II
New_Data<- New_Data%>%mutate(Overs_Played =(numberofballs/6))
New_Data$Overs_Played <- round(New_Data$Overs_Played, digits = 2)
New_Data<- New_Data%>%mutate(Avg_runs_per_Over =(total_runs/Overs_Played))
head(New_Data)
NA
Now, the New_Data dataset consist of 516 observation and 12 variables
We will now scan all the variable for missing values using special values and obvious errors using is.na() , is.infinte() and is.nan() fuction.
# This is the R chunk for the Scan I
colSums(sapply(New_Data,is.infinite))
Player_Name total_runs out numberofballs
0 0 0 0
average strikerate DOB Batting_Hand
0 0 0 0
Bowling_Skill Country Overs_Played Avg_runs_per_Over
0 0 0 0
colSums(sapply(New_Data,is.nan))
Player_Name total_runs out numberofballs
0 0 0 0
average strikerate DOB Batting_Hand
0 0 0 0
Bowling_Skill Country Overs_Played Avg_runs_per_Over
0 0 0 0
colSums(is.na(New_Data))
Player_Name total_runs out numberofballs
0 0 0 0
average strikerate DOB Batting_Hand
34 0 76 0
Bowling_Skill Country Overs_Played Avg_runs_per_Over
24 76 0 0
As we can se in the above output, there are no special values and obvious errors in the dataset, but the variables - average, DOB, Bowling_skill, Country consist of missing values.
We can replace the missing values of average with mean of the rest of the data. remaining variable with missing data can be omitted using na.omit() function
New_Data$average[is.na(New_Data$average)] <-mean(New_Data$average, na.rm = TRUE)
New_Data<- na.omit(New_Data)
colSums(is.na(New_Data))
Player_Name total_runs out numberofballs
0 0 0 0
average strikerate DOB Batting_Hand
0 0 0 0
Bowling_Skill Country Overs_Played Avg_runs_per_Over
0 0 0 0
Here we will check the outliers if 4 variables from the dataset New_Data
average Overs_played Avg_runs_per_over Overs_Played
We will be using the box plot to identify the outliers. Separate boxplot will be created to identify the outliers. and z.score to the the outlier values.
# This is the R chunk for the Scan II
boxplot(New_Data$average,
outcol = "red", col = "green", main="Average Batting Score", ylab= "Average")
z.score<- New_Data$average %>% scores(type = "z")
which (abs(z.score)>3)
[1] 150 246 281
boxplot(New_Data$strikerate,
outcol = "red", col = "green", main="Batting Strikerate", ylab= "Strikerate")
z.score<- New_Data$strikerate %>% scores(type = "z")
which (abs(z.score)>3)
[1] 194
boxplot(New_Data$Avg_runs_per_Over,
outcol = "red", col = "green", main="Average Runs Per Over", ylab= "Strikerate")
z.score<- New_Data$Avg_runs_per_Over %>% scores(type = "z")
which (abs(z.score)>3)
[1] 194
boxplot(New_Data$Overs_Played,
outcol = "red", col = "green", main="Overs Played", ylab= "Strikerate")
z.score<- New_Data$Overs_Played %>% scores(type = "z")
which (abs(z.score)>3)
[1] 20 38 50 78 95 128 187 254 308 325 331 367 383 413
There are many ways to treat the outliers like imputing with mean, median or mode, Ignoring or excluding outliers etc can be used. To prevent biasness of the data while performing the data analysis imputing with mean is not suitable here.
To treat the outliers here Capping funtion will be used.
cap<- function(x){
quantiles <- quantile(x, c(.05, 0.25, 0.75, 0.95))
x[ x< quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
New_Data$average<- New_Data$average %>% cap()
New_Data$strikerate<-New_Data$strikerate %>% cap()
New_Data$Avg_runs_per_Over <- New_Data$Avg_runs_per_Over%>% cap()
New_Data$Overs_Played <- New_Data$Overs_Played %>% cap()
boxplot(New_Data$average,
outcol = "red", col = "green", main="Average Batting Score", ylab= "Average")
boxplot(New_Data$strikerate,
outcol = "red", col = "green", main="Batting Strikerate", ylab= "Strikerate")
boxplot(New_Data$Avg_runs_per_Over,
outcol = "red", col = "green", main="Average Runs Per Over", ylab= "Average")
boxplot(New_Data$Overs_Played,
outcol = "red", col = "green", main="Overs Played", ylab= "Overs")
NA
NA
Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.
Here, we will check the normality of the variable Overs_Played using histogram. We will use the function hist() to plot the histogram
# This is the R chunk for the Transform Section
hist(New_Data$Overs_Played)
As we can observe the histogram is Right skewed. To reduce the skewness, and to tranform to a normal distribution we will use log(10) transformation below
hist(logdata)