Required packages

# This is the R chunk for the required packages

library(readr)
library(dplyr)
library(tidyr)
library(lubridate)
library(outliers)
library(forecast)
library(readxl)

Executive Summary

The aim of this assignment is to preprocess two dataset of Indian Premier Legue (IPL), which is a proffessional cricket legue which occurs in India every year. The two dataset I am using for this assignment are Players dataset which consist of the information about players and most_runs_average_strikerate dataset which consist of data related to the batting score who played under IPL from 2008-2019.

Data

The two dataset which consist of players and most_runs_average_strikerate from Indian Premier league during the period 2009-2019 was extracted from kaggle : https://www.kaggle.com/ramjidoolla/ipl-data-set.

Player dataset

Player_Name : Name of the Players
DOB : Date of Birth Batting_Hand: Batting style Bowling_Skill : Bowling skill *Country: Country they represent in the International format

most_runs_average_strikerate dataset

Additional 2 variables were created using mutate function

Coloumn name batman from most_runs_average_strikerate dataset was changed to “Player_Name” for merging purpose

# This is the R chunk for the Data Section

Player <- read_excel("C:/Users/ADMIN/Downloads/archive/Players.xlsx")

runs <- read.csv("C:/Users/ADMIN/Downloads/archive/most_runs_average_strikerate.csv")

colnames(runs)[colnames(runs) == "batsman"] <- "Player_Name"

New_Data <- merge(runs, Player, by= "Player_Name")

head(New_Data)
NA
NA

Understand

The New dataset New_Data has been created using merge() function. The new dataset has 516 observations and 10 variables.

Below we checked the head of the data using head() function and structure of the data using str() function.

# This is the R chunk for the Understand Section

head(New_Data)

str(New_Data)
'data.frame':   516 obs. of  10 variables:
 $ Player_Name  : chr  "A Ashish Reddy" "A Chandila" "A Chopra" "A Choudhary" ...
 $ total_runs   : int  280 4 53 25 4 62 150 15 35 366 ...
 $ out          : int  15 1 5 2 0 2 6 0 2 29 ...
 $ numberofballs: int  191 7 71 20 5 53 118 13 46 385 ...
 $ average      : num  18.7 4 10.6 12.5 NA ...
 $ strikerate   : num  146.6 57.1 74.6 125 80 ...
 $ DOB          : POSIXct, format: "1991-02-24" "1983-12-05" ...
 $ Batting_Hand : chr  "Right_Hand" "Right_Hand" "Right_Hand" "Right_hand" ...
 $ Bowling_Skill: chr  "Right-arm medium" "Right-arm offbreak" "Right-arm offbreak" "Left-arm fast-medium" ...
 $ Country      : chr  "India" "India" "India" NA ...

Frrom above we can see that the New_Data dataset consist of multiple data types such numerics, characters, factors (not formatted appropriately).

Tidy & Manipulate Data I

To consider a dataset to be tidy the dataset needs to satisfy the below crieteria:- Each variable have its own column Each observation have its own row *Each value have its own cell.

As we can observe in the New_data dataset, Some cells of the variable country and DOB does not have values in it. str() suggest that the variables like DOB, Country, Batting_Hand, Bowling_skill etc are not formatted correctly.

str(New_Data)
'data.frame':   516 obs. of  10 variables:
 $ Player_Name  : chr  "A Ashish Reddy" "A Chandila" "A Chopra" "A Choudhary" ...
 $ total_runs   : int  280 4 53 25 4 62 150 15 35 366 ...
 $ out          : int  15 1 5 2 0 2 6 0 2 29 ...
 $ numberofballs: int  191 7 71 20 5 53 118 13 46 385 ...
 $ average      : num  18.7 4 10.6 12.5 NA ...
 $ strikerate   : num  146.6 57.1 74.6 125 80 ...
 $ DOB          : Date, format: "1991-02-24" "1983-12-05" ...
 $ Batting_Hand : Factor w/ 3 levels "Left_Hand","Right_hand",..: 3 3 3 2 3 3 2 3 3 3 ...
 $ Bowling_Skill: Factor w/ 16 levels "Left-arm fast",..: 12 14 14 2 14 11 NA 9 6 5 ...
 $ Country      : Factor w/ 11 levels "Australia","Bangladesh",..: 4 4 4 NA NA 3 NA NA 4 4 ...

Tidy & Manipulate Data II

Here we will create 2 new variable with mutate() function

# This is the R chunk for the Tidy & Manipulate Data II 
New_Data<- New_Data%>%mutate(Overs_Played =(numberofballs/6))

New_Data$Overs_Played  <- round(New_Data$Overs_Played, digits = 2)

New_Data<- New_Data%>%mutate(Avg_runs_per_Over =(total_runs/Overs_Played))

head(New_Data)
NA

Now, the New_Data dataset consist of 516 observation and 12 variables

Scan I

We will now scan all the variable for missing values using special values and obvious errors using is.na() , is.infinte() and is.nan() fuction.

# This is the R chunk for the Scan I
colSums(sapply(New_Data,is.infinite))
      Player_Name        total_runs               out     numberofballs 
                0                 0                 0                 0 
          average        strikerate               DOB      Batting_Hand 
                0                 0                 0                 0 
    Bowling_Skill           Country      Overs_Played Avg_runs_per_Over 
                0                 0                 0                 0 
colSums(sapply(New_Data,is.nan))
      Player_Name        total_runs               out     numberofballs 
                0                 0                 0                 0 
          average        strikerate               DOB      Batting_Hand 
                0                 0                 0                 0 
    Bowling_Skill           Country      Overs_Played Avg_runs_per_Over 
                0                 0                 0                 0 
colSums(is.na(New_Data))
      Player_Name        total_runs               out     numberofballs 
                0                 0                 0                 0 
          average        strikerate               DOB      Batting_Hand 
               34                 0                76                 0 
    Bowling_Skill           Country      Overs_Played Avg_runs_per_Over 
               24                76                 0                 0 

As we can se in the above output, there are no special values and obvious errors in the dataset, but the variables - average, DOB, Bowling_skill, Country consist of missing values.

We can replace the missing values of average with mean of the rest of the data. remaining variable with missing data can be omitted using na.omit() function

New_Data$average[is.na(New_Data$average)] <-mean(New_Data$average, na.rm = TRUE)

New_Data<- na.omit(New_Data)

colSums(is.na(New_Data))
      Player_Name        total_runs               out     numberofballs 
                0                 0                 0                 0 
          average        strikerate               DOB      Batting_Hand 
                0                 0                 0                 0 
    Bowling_Skill           Country      Overs_Played Avg_runs_per_Over 
                0                 0                 0                 0 

Scan II

Here we will check the outliers if 4 variables from the dataset New_Data

average Overs_played Avg_runs_per_over Overs_Played

We will be using the box plot to identify the outliers. Separate boxplot will be created to identify the outliers. and z.score to the the outlier values.

# This is the R chunk for the Scan II
boxplot(New_Data$average, 
        outcol = "red", col = "green", main="Average Batting Score", ylab= "Average")


z.score<- New_Data$average %>% scores(type = "z")
which (abs(z.score)>3)
[1] 150 246 281
boxplot(New_Data$strikerate, 
        outcol = "red", col = "green", main="Batting Strikerate", ylab= "Strikerate")


z.score<- New_Data$strikerate %>% scores(type = "z")
which (abs(z.score)>3)
[1] 194
boxplot(New_Data$Avg_runs_per_Over, 
        outcol = "red", col = "green", main="Average Runs Per Over", ylab= "Strikerate")


z.score<- New_Data$Avg_runs_per_Over %>% scores(type = "z")
which (abs(z.score)>3)
[1] 194
boxplot(New_Data$Overs_Played, 
        outcol = "red", col = "green", main="Overs Played", ylab= "Strikerate")


z.score<- New_Data$Overs_Played %>% scores(type = "z")
which (abs(z.score)>3)
 [1]  20  38  50  78  95 128 187 254 308 325 331 367 383 413

There are many ways to treat the outliers like imputing with mean, median or mode, Ignoring or excluding outliers etc can be used. To prevent biasness of the data while performing the data analysis imputing with mean is not suitable here.

To treat the outliers here Capping funtion will be used.

cap<- function(x){
quantiles <- quantile(x, c(.05, 0.25, 0.75, 0.95))
    x[ x< quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

New_Data$average<- New_Data$average %>% cap()
New_Data$strikerate<-New_Data$strikerate %>% cap()
New_Data$Avg_runs_per_Over <- New_Data$Avg_runs_per_Over%>% cap()
New_Data$Overs_Played <- New_Data$Overs_Played %>% cap()
boxplot(New_Data$average, 
        outcol = "red", col = "green", main="Average Batting Score", ylab= "Average")


boxplot(New_Data$strikerate, 
        outcol = "red", col = "green", main="Batting Strikerate", ylab= "Strikerate")



boxplot(New_Data$Avg_runs_per_Over, 
        outcol = "red", col = "green", main="Average Runs Per Over", ylab= "Average")



boxplot(New_Data$Overs_Played, 
        outcol = "red", col = "green", main="Overs Played", ylab= "Overs")

NA
NA

Transform

Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.

Here, we will check the normality of the variable Overs_Played using histogram. We will use the function hist() to plot the histogram

# This is the R chunk for the Transform Section
hist(New_Data$Overs_Played)

As we can observe the histogram is Right skewed. To reduce the skewness, and to tranform to a normal distribution we will use log(10) transformation below


hist(logdata)