MATH2349 Data Wrangling

Required packages

# This is the R chunk for the required packages

library(readr)
library(dplyr)
library(tidyr)
library(lubridate)
library(outliers)
library(forecast)
library(readxl)

Executive Summary

The aim of this assignment is to preprocess two dataset of Indian Premier Legue (IPL), which is a proffessional cricket legue which occurs in India every year. The two dataset I am using for this assignment are Players dataset which consist of the information about players and most_runs_average_strikerate dataset which consist of data related to the batting score who played under IPL from 2008-2019.

First, the two dataset were imported and merged together using merge() function to create a new dataset.
Then, proceeded to inspect the variable to check whether the data is tidy or not.
Then some of variable were converted to factor make the analysis easier
Next, two new variable were created using mutate(). An over in a cricket consist of 6 balls. new variable Overs_played was created by dividing the numberofballs value by 6 and another variable Avg_runs_per_over was created by dividing total_runs/Overs_Played
Next, We scanned for missing values, special values and obvious errors using is.na(), is.infinite() and is.nan() function respectively. missing values were omited or replaced by mean value accordingly.
In the next step, we identified the outliers using boxplot. The capping method is used to deal with the outliers.
Final step was to transform the Overs_played data. After inspection we found out that the data was right skewed. Transformation was made to make the date symmetrical

Data

The two dataset which consist of players and most_runs_average_strikerate from Indian Premier league during the period 2009-2019 was extracted from kaggle : https://www.kaggle.com/ramjidoolla/ipl-data-set.

Player dataset

Player_Name : Name of the Players
DOB : Date of Birth Batting_Hand: Batting style Bowling_Skill : Bowling skill *Country: Country they represent in the International format

most_runs_average_strikerate dataset

batsman : Name of Batsman
total_runs: Total runs scored
out: Total number of time he got out
numberofballs: Number of balls played
average : Average runs
strikerate: Strike rate- runs per 100 ball

Additional 2 variables were created using mutate function

Overs_played: Total number of overs played
Avg_runs_per_over: Average number of runs per over

Coloumn name batman from most_runs_average_strikerate dataset was changed to “Player_Name” for merging purpose

# This is the R chunk for the Data Section

Player <- read_excel("C:/Users/ADMIN/Downloads/archive/Players.xlsx")

runs <- read.csv("C:/Users/ADMIN/Downloads/archive/most_runs_average_strikerate.csv")

colnames(runs)[colnames(runs) == "batsman"] <- "Player_Name"

New_Data <- merge(runs, Player, by= "Player_Name")

head(New_Data)

NA
NA

Understand

The New dataset New_Data has been created using merge() function. The new dataset has 516 observations and 10 variables.

Below we checked the head of the data using head() function and structure of the data using str() function.

# This is the R chunk for the Understand Section

head(New_Data)


str(New_Data)

'data.frame':   516 obs. of  10 variables:
 $ Player_Name  : chr  "A Ashish Reddy" "A Chandila" "A Chopra" "A Choudhary" ...
 $ total_runs   : int  280 4 53 25 4 62 150 15 35 366 ...
 $ out          : int  15 1 5 2 0 2 6 0 2 29 ...
 $ numberofballs: int  191 7 71 20 5 53 118 13 46 385 ...
 $ average      : num  18.7 4 10.6 12.5 NA ...
 $ strikerate   : num  146.6 57.1 74.6 125 80 ...
 $ DOB          : POSIXct, format: "1991-02-24" "1983-12-05" ...
 $ Batting_Hand : chr  "Right_Hand" "Right_Hand" "Right_Hand" "Right_hand" ...
 $ Bowling_Skill: chr  "Right-arm medium" "Right-arm offbreak" "Right-arm offbreak" "Left-arm fast-medium" ...
 $ Country      : chr  "India" "India" "India" NA ...

Frrom above we can see that the New_Data dataset consist of multiple data types such numerics, characters, factors (not formatted appropriately).

Tidy & Manipulate Data I

To consider a dataset to be tidy the dataset needs to satisfy the below crieteria:- Each variable have its own column Each observation have its own row *Each value have its own cell.

As we can observe in the New_data dataset, Some cells of the variable country and DOB does not have values in it. str() suggest that the variables like DOB, Country, Batting_Hand, Bowling_skill etc are not formatted correctly.

str(New_Data)

'data.frame':   516 obs. of  10 variables:
 $ Player_Name  : chr  "A Ashish Reddy" "A Chandila" "A Chopra" "A Choudhary" ...
 $ total_runs   : int  280 4 53 25 4 62 150 15 35 366 ...
 $ out          : int  15 1 5 2 0 2 6 0 2 29 ...
 $ numberofballs: int  191 7 71 20 5 53 118 13 46 385 ...
 $ average      : num  18.7 4 10.6 12.5 NA ...
 $ strikerate   : num  146.6 57.1 74.6 125 80 ...
 $ DOB          : Date, format: "1991-02-24" "1983-12-05" ...
 $ Batting_Hand : Factor w/ 3 levels "Left_Hand","Right_hand",..: 3 3 3 2 3 3 2 3 3 3 ...
 $ Bowling_Skill: Factor w/ 16 levels "Left-arm fast",..: 12 14 14 2 14 11 NA 9 6 5 ...
 $ Country      : Factor w/ 11 levels "Australia","Bangladesh",..: 4 4 4 NA NA 3 NA NA 4 4 ...

Tidy & Manipulate Data II

Here we will create 2 new variable with mutate() function

Overs_played: Total number of overs played by a player
Avg_runs_per_over: Average number of runs per over by a player

# This is the R chunk for the Tidy & Manipulate Data II 
New_Data<- New_Data%>%mutate(Overs_Played =(numberofballs/6))

New_Data$Overs_Played  <- round(New_Data$Overs_Played, digits = 2)

New_Data<- New_Data%>%mutate(Avg_runs_per_Over =(total_runs/Overs_Played))

head(New_Data)

NA

Now, the New_Data dataset consist of 516 observation and 12 variables

Scan I

We will now scan all the variable for missing values using special values and obvious errors using is.na() , is.infinte() and is.nan() fuction.

# This is the R chunk for the Scan I
colSums(sapply(New_Data,is.infinite))

      Player_Name        total_runs               out     numberofballs 
                0                 0                 0                 0 
          average        strikerate               DOB      Batting_Hand 
                0                 0                 0                 0 
    Bowling_Skill           Country      Overs_Played Avg_runs_per_Over 
                0                 0                 0                 0

colSums(sapply(New_Data,is.nan))

      Player_Name        total_runs               out     numberofballs 
                0                 0                 0                 0 
          average        strikerate               DOB      Batting_Hand 
                0                 0                 0                 0 
    Bowling_Skill           Country      Overs_Played Avg_runs_per_Over 
                0                 0                 0                 0

colSums(is.na(New_Data))

      Player_Name        total_runs               out     numberofballs 
                0                 0                 0                 0 
          average        strikerate               DOB      Batting_Hand 
               34                 0                76                 0 
    Bowling_Skill           Country      Overs_Played Avg_runs_per_Over 
               24                76                 0                 0

As we can se in the above output, there are no special values and obvious errors in the dataset, but the variables - average, DOB, Bowling_skill, Country consist of missing values.

We can replace the missing values of average with mean of the rest of the data. remaining variable with missing data can be omitted using na.omit() function

New_Data$average[is.na(New_Data$average)] <-mean(New_Data$average, na.rm = TRUE)

New_Data<- na.omit(New_Data)

colSums(is.na(New_Data))

      Player_Name        total_runs               out     numberofballs 
                0                 0                 0                 0 
          average        strikerate               DOB      Batting_Hand 
                0                 0                 0                 0 
    Bowling_Skill           Country      Overs_Played Avg_runs_per_Over 
                0                 0                 0                 0

Scan II

Here we will check the outliers if 4 variables from the dataset New_Data

average Overs_played Avg_runs_per_over Overs_Played

We will be using the box plot to identify the outliers. Separate boxplot will be created to identify the outliers. and z.score to the the outlier values.

# This is the R chunk for the Scan II
boxplot(New_Data$average, 
        outcol = "red", col = "green", main="Average Batting Score", ylab= "Average")


z.score<- New_Data$average %>% scores(type = "z")
which (abs(z.score)>3)

[1] 150 246 281

boxplot(New_Data$strikerate, 
        outcol = "red", col = "green", main="Batting Strikerate", ylab= "Strikerate")


z.score<- New_Data$strikerate %>% scores(type = "z")
which (abs(z.score)>3)

[1] 194

boxplot(New_Data$Avg_runs_per_Over, 
        outcol = "red", col = "green", main="Average Runs Per Over", ylab= "Strikerate")


z.score<- New_Data$Avg_runs_per_Over %>% scores(type = "z")
which (abs(z.score)>3)

[1] 194

boxplot(New_Data$Overs_Played, 
        outcol = "red", col = "green", main="Overs Played", ylab= "Strikerate")


z.score<- New_Data$Overs_Played %>% scores(type = "z")
which (abs(z.score)>3)

 [1]  20  38  50  78  95 128 187 254 308 325 331 367 383 413

There are many ways to treat the outliers like imputing with mean, median or mode, Ignoring or excluding outliers etc can be used. To prevent biasness of the data while performing the data analysis imputing with mean is not suitable here.

To treat the outliers here Capping funtion will be used.

cap<- function(x){
quantiles <- quantile(x, c(.05, 0.25, 0.75, 0.95))
    x[ x< quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

New_Data$average<- New_Data$average %>% cap()
New_Data$strikerate<-New_Data$strikerate %>% cap()
New_Data$Avg_runs_per_Over <- New_Data$Avg_runs_per_Over%>% cap()
New_Data$Overs_Played <- New_Data$Overs_Played %>% cap()

boxplot(New_Data$average, 
        outcol = "red", col = "green", main="Average Batting Score", ylab= "Average")


boxplot(New_Data$strikerate, 
        outcol = "red", col = "green", main="Batting Strikerate", ylab= "Strikerate")



boxplot(New_Data$Avg_runs_per_Over, 
        outcol = "red", col = "green", main="Average Runs Per Over", ylab= "Average")



boxplot(New_Data$Overs_Played, 
        outcol = "red", col = "green", main="Overs Played", ylab= "Overs")

NA
NA

Transform

Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.

Here, we will check the normality of the variable Overs_Played using histogram. We will use the function hist() to plot the histogram

# This is the R chunk for the Transform Section
hist(New_Data$Overs_Played)

As we can observe the histogram is Right skewed. To reduce the skewness, and to tranform to a normal distribution we will use log(10) transformation below


hist(logdata)