Required packages .

Loading the required packages needed for the whole data pre-processing:

library(readr)
library(magrittr)
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(MVN)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## sROC 0.1-2 loaded
library(infotheo)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Executive Summary

The two datasets ODI data.csv and test.csv (containing players’ batting statistics in ODI and test matches) were loaded into R studio. The unwanted columns were removed from the datasets. For a proper understanding of the variables, each of the variables was renamed using colnames(). The datasets contained a common column Player Name which has another column Team embedded in the bracket values of each Player Name.

The two variables were separated using separate(). The datasets were then merged using inner join on the basis of players who were common in both the datasets (i.e. who played both ODI and tests) and stored in variable cricket. After that structure and class of each variable were analyzed and appropriate type conversions were performed to convert Team variables (ODI_Team and Test_Team) to factors. The levels in each variable were also labeled to display Team values (ODI_Team and Test_Team) properly for better understanding.

Three new columns (Total_Hundreds, Total_Fifties, and Total_Runs were mutated in the cricket dataset. The dataset was then scanned for missing/infinite and special values by a user-defined function is.specialorNA.The special values ‘-’ were replaced by NA and all missing values were omitted out due to their very low percentage in the dataset (only 4%).

The cricket dataset was subsetted to contain only numerical variables and stored in variable cricket_outlier. The outliers were detected using the Mahanabolis distance multivariate outlier detection method. Outliers were found to be in large numbers (151). To mitigate their effect, equal-width binning (Discretisation) was performed on cricket_outlier and stored in cricket_outlier1. Finally, the distribution of Total_Runs variable was checked by plotting a histogram which was found to be right-skewed. Thus, to normalize the distribution of Total_Runs, reciprocal, square-root, Box-cox and log10 transformations were applied. The normalized distribution was achieved by log10 transformation.

Data

The datasets used for the assignment has been taken from Kaggle Website provided by the user Mahendran Narayan. The original data present on the website consist of 3 folders- Batting, Bowling and Fielding. These folders consists of batting,bowling and fielding statistics of various cricketers recorded from the year 1987. These folders contain various datasets.

The two datasets used for the assignment are ODI data.csv and test.csv that has been taken from Batting folder. These two datasets consist of player names along with their batting statistics in ODI and Test matches respectively.

The datasets can be found on https://www.kaggle.com/mahendran1/icc-cricket

ODI data.csv:

The ODI data dataset consist of 2500 records of players with 13 variables. The variables are:

1. Player(character): Name of the player/batsman/cricketer. This column contains another column as embedded values in brackets of the team played for.The bracket contains the team which is a factor variable.

2. Span(character): The time period/span played for.

3. Mat(numeric): Total ODI matches played.

4. Innings(numeric): Number of innings played/ total number of times the player came out to bat.

5. NO(numeric): Number of times remained not out in a match.

6. Runs(numeric): Total number of runs scored.

7. HS(numeric): Highest score in an ODI match. The astrick(*) values indicates the player remained not out.

8. Ave(numeric): The batting average. It is Total Runs Scored divided by Number of times they have been out.

9. BF(numeric): The total number of balls faced in all the innings.

10. SR(numeric): The batsman or player’s strike rate is total runs scored divided by number of balls faced multiplied by 100.

11. 100(numeric): Total number of hundreds/centuries scored by the player in ODI matches played.

12. 50(numeric): Total number of fifties/ half centuries scored by the player in ODI matches played.

13. 0(numeric): Total number of times player scored a duck(0). It is the innings in which a player scored 0 runs.

test.csv:

The test data dataset consist of 3001 records of players with 11 variables (2 character and 9 numeric). The variables are:

1. Player(character): Name of the player/batsman/cricketer. This column contains another column as embedded values in brackets of the team played for. The bracket contains the team which is a factor variable.

2. Span(character): The time period/span played for.

3. Mat(numeric): Total ODI matches played.

4. Innings(numeric): Number of innings played/ total number of times the player came out to bat.

5. NO(numeric): Number of times remained not out in a match. 6. Runs(numeric): Total number of runs scored.

7. HS(numeric): Highest score in an ODI match. The astrick(*) values indicates the player remained not out.

8. Av(numeric): The batting average. It is Total Runs Scored divided by Number of times they have been out.

11. 100(numeric): Total number of hundreds/centuries scored by the player in ODI matches played.

12. 50(numeric): Total number of fifties/ half centuries scored by the player in ODI matches played.

13. 0(numeric): Total number of times player scored a duck(0). It is the innings in which a player batted and scored 0 runs.

The data-preprocessing is explained in below steps:

Step 1. Importing dataset in Rstudio:The datasets odi data.csv and test.csv are loaded in R studio and stored in variables odi and test respectively using read_csv(). After loading the datasets, it is verified that the data sets has been loaded correctly using head().

# Loading the datasets ODI and Tests in R studio using read_csv()

odi <- read_csv("ODI data.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   Player = col_character(),
##   Span = col_character(),
##   Mat = col_double(),
##   Inns = col_double(),
##   NO = col_double(),
##   Runs = col_double(),
##   HS = col_character(),
##   Ave = col_double(),
##   BF = col_double(),
##   SR = col_double(),
##   `100` = col_double(),
##   `50` = col_double(),
##   `0` = col_double(),
##   `Unnamed: 13` = col_logical()
## )
## Warning: 219 parsing failures.
##  row col expected actual           file
## 1467 Ave a double      - 'ODI data.csv'
## 1494 Ave a double      - 'ODI data.csv'
## 1509 Ave a double      - 'ODI data.csv'
## 1571 Ave a double      - 'ODI data.csv'
## 1683 Ave a double      - 'ODI data.csv'
## .... ... ........ ...... ..............
## See problems(...) for more details.
test <- read_csv("test.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   Player = col_character(),
##   Span = col_character(),
##   Mat = col_double(),
##   Inns = col_double(),
##   NO = col_double(),
##   Runs = col_double(),
##   HS = col_character(),
##   Ave = col_double(),
##   `100` = col_double(),
##   `50` = col_double(),
##   `0` = col_double(),
##   `Unnamed: 11` = col_logical()
## )
## Warning: 208 parsing failures.
##  row col expected actual       file
## 1959 Ave a double      - 'test.csv'
## 2068 Ave a double      - 'test.csv'
## 2309 Ave a double      - 'test.csv'
## 2336 Ave a double      - 'test.csv'
## 2364 Ave a double      - 'test.csv'
## .... ... ........ ...... ..........
## See problems(...) for more details.
#Checking if dataset has been loaded correctly or not using head()

head(odi)
head(test)

Step 2. Removing unnecessary columns from the datasets: After importing the datasets, some unwanted columns are generated in both the datasets which are removed by subsetting them out. After removing,colnames() function is called to verify the removal has been done.

#Deleting the unnecessary columns in ODI and Test datasets by subsetting them out

odi <- odi[,c(-1,-15)]
test <- test[,c(-1,-13)]

#Verifying that the unnecessary columns have been removed successfully
colnames(odi)
##  [1] "Player" "Span"   "Mat"    "Inns"   "NO"     "Runs"   "HS"     "Ave"   
##  [9] "BF"     "SR"     "100"    "50"     "0"
colnames(test)
##  [1] "Player" "Span"   "Mat"    "Inns"   "NO"     "Runs"   "HS"     "Ave"   
##  [9] "100"    "50"     "0"

Step 3. Renaming the columns for proper understanding: The datasets are to be merged to display only statistics of those players who have played both tests and ODI matches. Therefore, the column names in both the datasets were renamed using colnames to differentiate between ODI statistics and Test statistics of the players.

#Renaming the columns using colnames()

colnames(odi) <- c("Player_Name","ODI_Span","ODI_Matches","ODI_Innings","Not_Outs(ODI)","ODI_Runs","ODI_HS", "ODI_Average","Balls_Faced(ODIs)","ODI_SR","ODI_Hundreds","ODI_Fifties","ODI_Ducks") 

colnames(test) <- c("Player_Name","Test_Span","Test_Matches","Test_Innings","Not_Outs(Test)","Test_Runs","Test_HS", "Test_Average","Test_Hundreds","Test_Fifties","Test_Ducks")

#Checking column names have been updated successfully 
colnames(odi)
##  [1] "Player_Name"       "ODI_Span"          "ODI_Matches"      
##  [4] "ODI_Innings"       "Not_Outs(ODI)"     "ODI_Runs"         
##  [7] "ODI_HS"            "ODI_Average"       "Balls_Faced(ODIs)"
## [10] "ODI_SR"            "ODI_Hundreds"      "ODI_Fifties"      
## [13] "ODI_Ducks"
colnames(test)
##  [1] "Player_Name"    "Test_Span"      "Test_Matches"   "Test_Innings"  
##  [5] "Not_Outs(Test)" "Test_Runs"      "Test_HS"        "Test_Average"  
##  [9] "Test_Hundreds"  "Test_Fifties"   "Test_Ducks"

Tidy & Manipulate Data I

Step 4. Separating Team variable from Player Name in both odi and test datasets: The Player name variable contains another variable Team which is embedded in bracket values of the Player Name in each record in both the datasets, Since two variables are stored in one column, we need to separate them first before we merge the two datasets odi and test. This is done using separate() function of tidyr package and the Player Name column is separated in two columns- Player Name and ODI/test team played for in both the datasets. The sep parameter defines the value from where the separation is to be done. The extra parameter merge all the extra values after bracket (. The resulting datasets are stored in odi_new and test_new variables respectively.

#Separating odi dataset using separate() and storing the new dataset with additional column ODI_Team in odi_new

odi_new <- odi %>% separate(Player_Name, into= c("Player_Name", "ODI_Team"), sep="[(]" ,extra="merge")

#Verifying that a new column has been successfully created in odi_new using head()
head(odi_new)
#Separating test dataset using separate() and storing the new dataset with additional column Test_Team in test_new

test_new <- test %>% separate(Player_Name, into= c("Player_Name", "Test_Team"), sep="[(]", extra="merge")


#Verifying that a new column has been successfully created in test_new using head()

head(test_new)

Step 5. Merging the datasets odi_new and test_new: The two datasets are merged to contain final records of only those players that have played both ODI and test matches in their career. The inner_join() function is used as it joins common observations in two tables. Thus, the players common in both the datasets along with their ODI and Test statistics are stored in a new variable cricket.

#Merging the odi_new and test_new dataset using inner join to display data of only those players who played both ODI and Test matches and storing the merged data in new variable Cricket

cricket <- inner_join(odi_new,test_new, by="Player_Name")

#Verifying that the datasets have been successfully merged by checking the header of the merged dataset cricket

head(cricket)

Understand

Step 6. Check structure and class of dataset cricket: Using str() and class(), the structure of dataset cricket and the class of its variables are inspected. A new function, check.class() has been created to check class of each variable in a dataframe/dataset. Here, class of each variable of cricket dataset is checked using sapply() and check.class().

#Checking structure of cricket dataset

str(cricket)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1462 obs. of  25 variables:
##  $ Player_Name      : chr  "SR Tendulkar " "KC Sangakkara " "RT Ponting " "ST Jayasuriya " ...
##  $ ODI_Team         : chr  "INDIA)" "Asia/ICC/SL)" "AUS/ICC)" "Asia/SL)" ...
##  $ ODI_Span         : chr  "1989-2012" "2000-2015" "1995-2012" "1989-2011" ...
##  $ ODI_Matches      : num  463 404 375 445 448 378 242 328 311 344 ...
##  $ ODI_Innings      : num  452 380 365 433 418 350 233 314 300 318 ...
##  $ Not_Outs(ODI)    : num  41 41 39 18 39 53 39 53 23 40 ...
##  $ ODI_Runs         : num  18426 14234 13704 13430 12650 ...
##  $ ODI_HS           : chr  "200*" "169" "164" "189" ...
##  $ ODI_Average      : num  44.8 42 42 32.4 33.4 ...
##  $ Balls_Faced(ODIs): num  21367 18048 17046 14725 16020 ...
##  $ ODI_SR           : num  86.2 78.9 80.4 91.2 79 ...
##  $ ODI_Hundreds     : num  49 25 30 28 19 10 43 17 22 12 ...
##  $ ODI_Fifties      : num  96 93 82 68 77 83 55 86 72 83 ...
##  $ ODI_Ducks        : num  20 15 20 34 28 20 13 17 16 13 ...
##  $ Test_Team        : chr  "INDIA)" "SL)" "AUS)" "SL)" ...
##  $ Test_Span        : chr  "1989-2013" "2000-2015" "1995-2012" "1991-2007" ...
##  $ Test_Matches     : num  200 134 168 110 149 120 84 166 113 164 ...
##  $ Test_Innings     : num  329 233 287 188 252 200 141 280 188 286 ...
##  $ Not_Outs(Test)   : num  33 17 29 14 15 22 10 40 17 32 ...
##  $ Test_Runs        : num  15921 12400 13378 6973 11814 ...
##  $ Test_HS          : chr  "248*" "319" "257" "340" ...
##  $ Test_Average     : num  53.8 57.4 51.9 40.1 49.8 ...
##  $ Test_Hundreds    : num  51 38 41 14 34 25 27 45 16 36 ...
##  $ Test_Fifties     : num  68 52 62 31 50 46 22 58 35 63 ...
##  $ Test_Ducks       : num  14 11 17 15 15 15 10 16 13 8 ...
#Checking datatype of each variable in the two datasets by using a user-defined check.class() function

check.class <- function(x){
  class(x)
}

sapply(cricket,check.class)
##       Player_Name          ODI_Team          ODI_Span       ODI_Matches 
##       "character"       "character"       "character"         "numeric" 
##       ODI_Innings     Not_Outs(ODI)          ODI_Runs            ODI_HS 
##         "numeric"         "numeric"         "numeric"       "character" 
##       ODI_Average Balls_Faced(ODIs)            ODI_SR      ODI_Hundreds 
##         "numeric"         "numeric"         "numeric"         "numeric" 
##       ODI_Fifties         ODI_Ducks         Test_Team         Test_Span 
##         "numeric"         "numeric"       "character"       "character" 
##      Test_Matches      Test_Innings    Not_Outs(Test)         Test_Runs 
##         "numeric"         "numeric"         "numeric"         "numeric" 
##           Test_HS      Test_Average     Test_Hundreds      Test_Fifties 
##       "character"         "numeric"         "numeric"         "numeric" 
##        Test_Ducks 
##         "numeric"

Step 7. Converting the newly created ODI_Team and Test_Team variables to factor: The variables ODI_Team and Test_Team represent the teams that the player has played for. Some players may have played test matches for one team and ODI matches for a different team. Hence, these variables can have different team names for the same player record.

These variables are converted to factor using as.factor(). The levels() function checks the levels in both ODI_Team and Test_Team. The levels() function has also been used to re-define or relabel the values in proper order. The ordering of the factor variables have been avoided as it was creating null values in the dataset. The check.class() function is called to verify successful type conversion has occurred in the dataset cricket.

#Converting the ODI_Team  variable to factor using as.factor()

cricket$ODI_Team <- as.factor(cricket$ODI_Team)

#Checking the current levels in the variable ODI_Team using levels()

levels(cricket$ODI_Team)
##  [1] "1) (UAE)"        "3) (PAK)"        "AFG)"            "Afr/ICC/SA)"    
##  [5] "Afr/SA)"         "Afr/ZIM)"        "Asia/BDESH)"     "Asia/ICC/INDIA)"
##  [9] "Asia/ICC/PAK)"   "Asia/ICC/SL)"    "Asia/INDIA)"     "Asia/PAK)"      
## [13] "Asia/SL)"        "AUS)"            "AUS/ICC)"        "AUS/NZ)"        
## [17] "AUS/SA)"         "BDESH)"          "CAN)"            "CAN/WI)"        
## [21] "ENG)"            "ENG/ICC)"        "ENG/IRE)"        "ENG/PNG)"       
## [25] "ICC/NZ)"         "ICC/SA)"         "ICC/WI)"         "INDIA)"         
## [29] "IRE)"            "NZ)"             "PAK)"            "SA)"            
## [33] "SCOT)"           "SL)"             "UAE)"            "USA/WI)"        
## [37] "WI)"             "ZIM)"
#Labelling and defining the levels in the variable ODI_Team using levels()

levels(cricket$ODI_Team) = c("UAE ","Pakistan ","Afghanistan","Africa/ICC/South Africa","Africa/South Africa","Africa/Zimbabwe","Asia/Bangladesh","Asia/ICC/India","Asia/ICC/Pakistan","Asia/ICC/Sri Lanka","Asia/India","Asia/Pakistan","Asia/Sri Lanka","Australia","Australia/ICC","Australia/New Zealand","Australia/South Africa","Bangladesh","Canada","Canada/West Indies","England","England/ICC","England/Ireland","England/Papua New Guinea","ICC/New Zealand","ICC/South Africa","ICC/West Indies","India","Ireland","New Zealand","Pakistan","South Africa","Scotland","Sri Lanka","UAE","USA/West Indies","West Indies","Zimbabwe")


#Verifying that ordering of factors has been applied

levels(cricket$ODI_Team)
##  [1] "UAE "                     "Pakistan "               
##  [3] "Afghanistan"              "Africa/ICC/South Africa" 
##  [5] "Africa/South Africa"      "Africa/Zimbabwe"         
##  [7] "Asia/Bangladesh"          "Asia/ICC/India"          
##  [9] "Asia/ICC/Pakistan"        "Asia/ICC/Sri Lanka"      
## [11] "Asia/India"               "Asia/Pakistan"           
## [13] "Asia/Sri Lanka"           "Australia"               
## [15] "Australia/ICC"            "Australia/New Zealand"   
## [17] "Australia/South Africa"   "Bangladesh"              
## [19] "Canada"                   "Canada/West Indies"      
## [21] "England"                  "England/ICC"             
## [23] "England/Ireland"          "England/Papua New Guinea"
## [25] "ICC/New Zealand"          "ICC/South Africa"        
## [27] "ICC/West Indies"          "India"                   
## [29] "Ireland"                  "New Zealand"             
## [31] "Pakistan"                 "South Africa"            
## [33] "Scotland"                 "Sri Lanka"               
## [35] "UAE"                      "USA/West Indies"         
## [37] "West Indies"              "Zimbabwe"
#Converting the Test_Team variable to factor using as.factor

cricket$Test_Team <- as.factor(cricket$Test_Team)

#Checking the current levels in the variable Test_Team using levels()

levels(cricket$Test_Team)
##  [1] "3) (PAK)"   "AFG)"       "AUS)"       "AUS/SA)"    "BDESH)"    
##  [6] "ENG)"       "ENG/ICC)"   "ENG/IRE)"   "ICC/INDIA)" "ICC/NZ)"   
## [11] "ICC/PAK)"   "ICC/SA)"    "ICC/SL)"    "ICC/WI)"    "INDIA)"    
## [16] "IRE)"       "NZ)"        "PAK)"       "SA)"        "SA/ZIM)"   
## [21] "SL)"        "WI)"        "ZIM)"
#Defining/labelling the levels in the variable Test_Team using levels()

levels(cricket$Test_Team) =c("Pakistan ","Afghanistan","Australia","Australia/South Africa","Bangladesh","England","England/ICC","England/Ireland","ICC/India","ICC/New Zealand","ICC/Pakistan","ICC/South Africa","ICC/Sri Lanka", "ICC/West Indies","India","Ireland","New Zealand","Pakistan","South Africa","South Africa/Zimbabwe","Sri Lanka","West Indies","Zimbabwe")


#Checking the levels again using levels()

levels(cricket$Test_Team)
##  [1] "Pakistan "              "Afghanistan"            "Australia"             
##  [4] "Australia/South Africa" "Bangladesh"             "England"               
##  [7] "England/ICC"            "England/Ireland"        "ICC/India"             
## [10] "ICC/New Zealand"        "ICC/Pakistan"           "ICC/South Africa"      
## [13] "ICC/Sri Lanka"          "ICC/West Indies"        "India"                 
## [16] "Ireland"                "New Zealand"            "Pakistan"              
## [19] "South Africa"           "South Africa/Zimbabwe"  "Sri Lanka"             
## [22] "West Indies"            "Zimbabwe"
#Verifying the datatype of each variable after successful type conversions using the defined functioncheck.class()

sapply(cricket,check.class)
##       Player_Name          ODI_Team          ODI_Span       ODI_Matches 
##       "character"          "factor"       "character"         "numeric" 
##       ODI_Innings     Not_Outs(ODI)          ODI_Runs            ODI_HS 
##         "numeric"         "numeric"         "numeric"       "character" 
##       ODI_Average Balls_Faced(ODIs)            ODI_SR      ODI_Hundreds 
##         "numeric"         "numeric"         "numeric"         "numeric" 
##       ODI_Fifties         ODI_Ducks         Test_Team         Test_Span 
##         "numeric"         "numeric"          "factor"       "character" 
##      Test_Matches      Test_Innings    Not_Outs(Test)         Test_Runs 
##         "numeric"         "numeric"         "numeric"         "numeric" 
##           Test_HS      Test_Average     Test_Hundreds      Test_Fifties 
##       "character"         "numeric"         "numeric"         "numeric" 
##        Test_Ducks 
##         "numeric"

Tidy & Manipulate Data II

Step 8. Mutating new columns Total_Hundreds, Total_Fifties and Total_Runs: The mutate() function is used to mutate new columns Total_Hundreds, Total_Fifties and Total_Runs in the existing dataset cricket.The Total hundreds is calculated by adding number of hundreds scored in both ODI and Test matches. The Total_Fifties is calculated by adding number of fifties scored in both ODI and test matches. In a similar way, Total_Runs scored by the player is computed by adding runs scored in ODI and test matches.

After mutating, the successful mutation is verified by checking the header (head) of the dataset and also the column names using colnames().

# Mutating three columns Total_Hundreds,Total_Fifties and Total_Runs in the dataset cricket using mutate() 

cricket <- cricket %>% mutate(Total_Hundreds = ODI_Hundreds + Test_Hundreds,
       Total_Fifties= ODI_Fifties + Test_Fifties, Total_Runs = ODI_Runs + Test_Runs)

#Displaying the updated dataset cricket's column names  with three new columns Total_Hundreds and Total_Fifties and Total_Runs

head(cricket)
colnames(cricket)
##  [1] "Player_Name"       "ODI_Team"          "ODI_Span"         
##  [4] "ODI_Matches"       "ODI_Innings"       "Not_Outs(ODI)"    
##  [7] "ODI_Runs"          "ODI_HS"            "ODI_Average"      
## [10] "Balls_Faced(ODIs)" "ODI_SR"            "ODI_Hundreds"     
## [13] "ODI_Fifties"       "ODI_Ducks"         "Test_Team"        
## [16] "Test_Span"         "Test_Matches"      "Test_Innings"     
## [19] "Not_Outs(Test)"    "Test_Runs"         "Test_HS"          
## [22] "Test_Average"      "Test_Hundreds"     "Test_Fifties"     
## [25] "Test_Ducks"        "Total_Hundreds"    "Total_Fifties"    
## [28] "Total_Runs"

Scan I

Step 9. Defining is.specialorNA function to detect missing, infinite and special values: A function is.specialorNA is created to detect missing, infinite and NAN values (if present), in the dataset cricket. Using sapply(), the function is applied to cricket dataset. The output shows presence of NA/NAN/infinite values in the cricket dataset for each variable.

#Defining a user-defined function is.specialorNA to detect NA, NAN and infinite values in cricket dataset

is.specialorNA <- function(x){
  if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))
}

#Checking for infinite, NAN and NA values in cricket dataset

sapply(cricket, function(x) sum( is.specialorNA(x) ))
##       Player_Name          ODI_Team          ODI_Span       ODI_Matches 
##                 0                 0                 0                 0 
##       ODI_Innings     Not_Outs(ODI)          ODI_Runs            ODI_HS 
##                 7                 7                 7                 0 
##       ODI_Average Balls_Faced(ODIs)            ODI_SR      ODI_Hundreds 
##                47                 7                11                 7 
##       ODI_Fifties         ODI_Ducks         Test_Team         Test_Span 
##                 7                 7                 0                 0 
##      Test_Matches      Test_Innings    Not_Outs(Test)         Test_Runs 
##                 0                 5                 5                 5 
##           Test_HS      Test_Average     Test_Hundreds      Test_Fifties 
##                 0                21                 5                 5 
##        Test_Ducks    Total_Hundreds     Total_Fifties        Total_Runs 
##                 5                12                12                12

Step 10. Replacing special value ‘-’ with NA and omitting the records with NA values: After manual scanning of the dataset,it is found to consist some special values ‘-’ in many records which only indicate missing data. Therefore, these values are replaced by NA.

The complete records are subsetted using complete() to find the total records that are complete which are stored in cricket_complete variable. The total records that contained NA values are stored in cricket_incomplete.

#Replacing the special value '-' with NA in dataset cricket

cricket[cricket == "-" ] <-  NA

#Subsetting the data to display only complete observations in cricket dataset using complete[]

cricket_complete <- cricket[complete.cases(cricket), ]
cricket_incomplete <- cricket[!complete.cases(cricket), ]

Step 11. Counting the records and omitting the records with NA values: The count() function is used to compute the total records that were complete and incomplete in dataset cricket. It is found that 62 observations contained NA values while 1400 records were complete.

As the total number of records with missing values was just 62 (which is only 4%),it was decided to omit them out of the dataset. This is done because the dataset we have

consist of factual records that should not be manipulated by recoding/imputing new values in missing records of players.

Secondly, some authors recommend that if the amount of missing data is very small relatively to the size of the data set (up to 5%), then leaving out the few values with missing features would be the best strategy in order not to bias the analysis. Hence, the missing records are omitted out using na.omit(). The is.specialorNA function is called again to verify that all the missing, special and infinite values have been successfully omitted from the dataset cricket.

#Computing total complete observations using count()
count(cricket_complete)
count(cricket_incomplete)
#Omitting the observations with NA values in cricket dataset using na.omit()

cricket <- na.omit(cricket)

#Verifying that special, missing and infinite values have successfully been removed by calling is.specialorNA

sapply(cricket, function(x) sum( is.specialorNA(x) ))
##       Player_Name          ODI_Team          ODI_Span       ODI_Matches 
##                 0                 0                 0                 0 
##       ODI_Innings     Not_Outs(ODI)          ODI_Runs            ODI_HS 
##                 0                 0                 0                 0 
##       ODI_Average Balls_Faced(ODIs)            ODI_SR      ODI_Hundreds 
##                 0                 0                 0                 0 
##       ODI_Fifties         ODI_Ducks         Test_Team         Test_Span 
##                 0                 0                 0                 0 
##      Test_Matches      Test_Innings    Not_Outs(Test)         Test_Runs 
##                 0                 0                 0                 0 
##           Test_HS      Test_Average     Test_Hundreds      Test_Fifties 
##                 0                 0                 0                 0 
##        Test_Ducks    Total_Hundreds     Total_Fifties        Total_Runs 
##                 0                 0                 0                 0

Scan II

Step 12. Subsetting the dataset to contain only numerical variables: The dataset has been subsetted to contain only numerical variables for the purpose of detecting any outliers in them. This is done using select() function of dplyr package. The resulting dataset is stored in cricket_outlier variable.

The class of each variable is validated after subsetting them to verify that the new dataset contains only numerical variables. This is done using check.class function in sapply().

#Subsetting dataset cricket to display only numerical variables using select() and storing the resulting dataser in cricket_outlier

cricket_outlier <- cricket %>% dplyr::select(c("ODI_Matches","ODI_Innings","Not_Outs(ODI)","ODI_Runs",
                                                       "ODI_Average","Balls_Faced(ODIs)","ODI_SR","ODI_Hundreds","ODI_Fifties",
                                                    "ODI_Ducks","Test_Matches","Test_Innings","Not_Outs(Test)","Test_Runs",
                                                       "Test_Average","Test_Hundreds","Test_Fifties","Test_Ducks","Total_Hundreds","Total_Fifties","Total_Runs"))

#Validating the datatype of each variable in the new dataset using check.class()

sapply(cricket_outlier,check.class)
##       ODI_Matches       ODI_Innings     Not_Outs(ODI)          ODI_Runs 
##         "numeric"         "numeric"         "numeric"         "numeric" 
##       ODI_Average Balls_Faced(ODIs)            ODI_SR      ODI_Hundreds 
##         "numeric"         "numeric"         "numeric"         "numeric" 
##       ODI_Fifties         ODI_Ducks      Test_Matches      Test_Innings 
##         "numeric"         "numeric"         "numeric"         "numeric" 
##    Not_Outs(Test)         Test_Runs      Test_Average     Test_Hundreds 
##         "numeric"         "numeric"         "numeric"         "numeric" 
##      Test_Fifties        Test_Ducks    Total_Hundreds     Total_Fifties 
##         "numeric"         "numeric"         "numeric"         "numeric" 
##        Total_Runs 
##         "numeric"

Step 13. Detecting outliers using The Mahalanobis distance multivariate outlier detection method: As the number of numerical variables in the dataset are more than two, the Mahalanobis distance multivariate outlier detection method was used to detect outliers. Mahalanobis distance follows a Chi-square distribution with n (number of variables) degrees of freedom, therefore any Mahalanobis distance greater than the critical chi-square value is treated as outliers.

The mvn() function define the multivariate outlier detection method using the multivariateOutlierMethod argument. When we use multivariateOutlierMethod = “quan” argument, it detects the multivariate outliers using the chi-square distribution critical value. The showOutliers = TRUE argument will depict any multivariate outliers and show them on the QQ plot.

The number of outliers is shown to be 151. It is more than 5% so we will use appropriate technique to handle outliers in cricket_outlier dataset.

# Multivariate outlier detection using Mahalanobis distance with QQ plots in the dataset cricket_outlier

results <- mvn(data = cricket_outlier, multivariateOutlierMethod = "quan", showOutliers = TRUE)
## Warning in covMcd(data, alpha = alpha): The covariance matrix of the data is singular.
## There are 1400 observations (in the entire dataset of 1400 obs.) lying
## on the hyperplane with equation a_1*(x_i1 - m_1) + ... + a_p*(x_ip -
## m_p) = 0 with (m_1, ..., m_p) the mean of these observations and
## coefficients a_i from the vector a <- c(0, 0, 0, -0.0573701, 0, 0, 0,
## -0.0661818, 0.570668, 0, 0, 0, 0, -0.0573701, 0, -0.0661818, 0.570668,
## 0, 0.0661818, -0.570668, 0.0573701)

Step 14. Applying binning technique to mitigate the effect of outliers: Binning (Discretisation) is applied to mitigate the effects of outliers. Binning or discretisation methods transform numerical variables into categorical counterparts. It controls or mitigates the impact of outliers over the model by placing them to the first or last category.

The equal-width binning strategy is applied where the variables are divided into n intervals of equal size. Using discretize() of infotheo package, we perform the binning to mitigate the effect of outliers in cricket_outlier dataset. After binning, the resulting dataset is stored in cricket_outlier1.

#Applying equal-width binning using discretize()

ew_binned<-discretize(cricket_outlier, disc = "equalwidth")
cricket_outlier1 <- cricket_outlier %>% bind_cols(ew_binned)
cricket_outlier1 %>% head(5)

Transform

Step 15. Checking distribution of variable Total_Runs from cricket_outlier1 dataset: The distribution of values in the variable Total_Runs is checked to see if its symmetric/normal.

#Checking the distribution of values in the variable Total_Runs

hist(cricket_outlier1$Total_Runs)

Step 16. Applying appropriate transformation to normalise the distribution of values in Total_Runs: As the distribution of values is right-skewed for Total_Runs variable, it is assumed that taking roots or logarithms or reciprocals work will work well in transforming the distribution to normal. The boxcox transformation is also applied as its considered a powerful transformation to normalize distribution.

After applying square-root, reciprocal,Box-Cox and log10 transformation, it is found that log10 transformation works best in normalizing the distribution of values in variable Total_Runs.

#Applying square root transformation to variable Total_Runs 

Total_Runs1 <- sqrt(cricket_outlier1$Total_Runs)
hist(Total_Runs1)

#Applying reciprocal tranformation to variable Total_Runs

Total_Runs2 <- 1/cricket_outlier1$Total_Runs

hist(Total_Runs2)

#Applying Box-Cox tranformation to variable Total_Runs

Total_Runs3<- BoxCox(cricket_outlier1$Total_Runs,lambda = "auto")

hist(Total_Runs3)

#Applying log10 transformation to variable Total_Runs 

Total_Runs <- log10(cricket_outlier1$Total_Runs)
hist(Total_Runs)