Problem Statement:

To apply various data preprocessing techniques to verify if the given dataset is in a tidy format for carrying out various analytical and statiscal studies. The assignment aims to inculcate a sense of optimal preprocessing of data to study its different aspects in terms of its characteristics, missing values, outliers, normalization and visual approaches.

Required packages

The following packages are going to help us in importing data, making the data tidy and handling the variables efficiently for reading and understanding the data for statistical studies. They are also going to help us in handling the unusual special values and outliers in the variable and provide techniques to exclude these values. We are going to use the library() function to call these packages.

library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(outliers)
library(lubridate)

Executive Summary

We are going to import dataset of cricket which depicts international cricket matches played across the world. We are required to combine two datasets and apply various preprocessing techniques in order to make the data tidy for manipulation. We are going to consider various numerical values such as scores and run rates to detect the missing values and outliers in the dataset. We are also required to tranform the variables to inorder to fit a normalized dataset.

Data

The two datsets Match Results and Match Totals are going to used for preprocessing:

Match Totals Dataset :

Following is the url for fetching the dataset:

https://www.kaggle.com/saivamshi/cricket-world-cup-2019-players-data#ODI_Match_Totals.csv

Following are the variables in the dataset:

Score: gives the scores the team have scored

Overs: number of overs played the teams

Target: the target score opposition team needs to score

Inns: the innings number played the team

Result: the result of the match

Opposition: the opposition team

Ground: the venue of the match

Match_ID: The identity number of the match

Country: the country of the team

Country_ID: the identity number of the country

Match Results Dataset :

Following is the url for fetching the dataset:

https://www.kaggle.com/saivamshi/cricket-world-cup-2019-players-data#ODI_Match_Results.csv

Result: the result of the match

Margin: the difference of scores after final results

Toss: the result of toss

Bat: the innings the team batted, whether first or second

Opposition: the opposition team

Ground: the venue of the match

Start.Date: the date of the match

Match_ID: the identity number of the match

Country: the country of the team

Country_ID: the identity number of the country

We are going to use Base R functions to import data into the R Studio. As the downloaded file is saved in csv format in the path : C:/Users/Anup/Desktop/Assignment 3, we will use the read.csv() function to import the dataset. The argument stringAsFactors = FALSE will remove the unnecessary factor created during data importation.

The head() function will give us quick insight of the data.

We have created a unique column like a composite key in databases using two columns which will help us in joining two datasets.

Combining two datasets:

We are going to combine the datasets using a left join. We have considered ODI_Totals_rev1 to be the left dataset as it has most the relevant data and also ODI_Results_rev1 has data of abandoned and no result(n/r) matches, which is not giving us any valuable output. Hence a left_join will give us the desired dataset removing the data of abadoned or n/r matches.

#Importing ODI_Match_Totals:
ODI_Totals  <- read.csv("C:/Users/Anup/Desktop/Assignment 3/ODI_Match_Totals.csv",stringsAsFactors = FALSE)

head(ODI_Totals)
#Created new column for unique identification:
ODI_Totals$MatchTeam_ID <- paste(ODI_Totals$Match_ID,'-',ODI_Totals$Country_ID)


#Remove irrelevant Column:
ODI_Totals_rev1 <- select(ODI_Totals, -X2)

#Importing ODI_Match_Result:
ODI_Results <- read.csv("C:/Users/Anup/Desktop/Assignment 3/ODI_Match_Results.csv",   stringsAsFactors = FALSE)

head(ODI_Results)
#Created new column for unique identification:
ODI_Results$MatchTeam_ID <- paste(ODI_Results$Match_ID,'-',ODI_Results$Country_ID)



#Remove Repeated columns:
ODI_Results_rev1 <- select(ODI_Results, -Opposition, -Ground, -Start.Date, -Country, -Match_ID,-Country_ID, -X1, -Result, -BR )


head(ODI_Totals_rev1)
head(ODI_Results_rev1)
#COMBING DATASET USING LEFT JOIN AND REMOVE UNWANTED COLUMN:

Cricket_Dataset_rev0<- left_join(ODI_Totals_rev1,ODI_Results_rev1 , by = c("MatchTeam_ID"="MatchTeam_ID"))

#Dataset before datatype conversions:
Cricket_Dataset_rev1 <- select(Cricket_Dataset_rev0, -Match_ID)


#Dataset used for datatype conversions:
Cricket_Dataset_rev2 <- select(Cricket_Dataset_rev0, -Match_ID)

Understand

Following are the summaries of the dataset before and after the data type conversions. Most the variables like RPO, Result, Start.Date and Toss were imported as character. We have provided suitable conversions in the Cricket_Dataset_rev2.

Following are the data type conversions required in the the dataset:

RPO : converted to double.

Result : converted as factor.

Start.Date : converted to date.

Toss : coverted to factor.

Inns: converted to factor.

#Before datatype conversions
summary(Cricket_Dataset_rev1)
##     Score               Overs            RPO             Target     
##  Length:1296        Min.   : 0.00   Min.   : 0.000   Min.   : 68.0  
##  Class :character   1st Qu.:40.20   1st Qu.: 4.700   1st Qu.:208.8  
##  Mode  :character   Median :48.20   Median : 5.400   Median :258.0  
##                     Mean   :43.09   Mean   : 5.409   Mean   :253.4  
##                     3rd Qu.:50.00   3rd Qu.: 6.112   3rd Qu.:301.0  
##                     Max.   :50.00   Max.   :14.160   Max.   :482.0  
##                                                      NA's   :676    
##       Inns          Result           Opposition           Ground         
##  Min.   :0.000   Length:1296        Length:1296        Length:1296       
##  1st Qu.:1.000   Class :character   Class :character   Class :character  
##  Median :1.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1.463                                                           
##  3rd Qu.:2.000                                                           
##  Max.   :2.000                                                           
##                                                                          
##   Start.Date          Country            Country_ID     MatchTeam_ID      
##  Length:1296        Length:1296        Min.   : 1.000   Length:1296       
##  Class :character   Class :character   1st Qu.: 3.000   Class :character  
##  Mode  :character   Mode  :character   Median : 5.000   Mode  :character  
##                                        Mean   : 8.553                     
##                                        3rd Qu.: 8.000                     
##                                        Max.   :40.000                     
##                                                                           
##     Margin              Toss               Bat           
##  Length:1296        Length:1296        Length:1296       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
## 
#Before datatype conversions
head(Cricket_Dataset_rev1)
#RPO to double
Cricket_Dataset_rev2$RPO <- as.double(Cricket_Dataset_rev2$RPO)

#Result to factor
Cricket_Dataset_rev2$Result <- as.factor(Cricket_Dataset_rev2$Result)

#StringDate to Date
Cricket_Dataset_rev2$Start.Date <- dmy(Cricket_Dataset_rev2$Start.Date)

#Toss to factor
Cricket_Dataset_rev2$Toss <-  as.factor(Cricket_Dataset_rev2$Toss)

#Inns to factor
Cricket_Dataset_rev2$Inns <-  as.factor(Cricket_Dataset_rev2$Inns)

#Bat to Factor
Cricket_Dataset_rev2$Bat <-  as.factor(Cricket_Dataset_rev2$Bat)

#CountryID to Factor
Cricket_Dataset_rev2$Country_ID <-  as.factor(Cricket_Dataset_rev2$Country_ID)


#Post datatype conversions
summary(Cricket_Dataset_rev2)
##     Score               Overs            RPO             Target     
##  Length:1296        Min.   : 0.00   Min.   : 0.000   Min.   : 68.0  
##  Class :character   1st Qu.:40.20   1st Qu.: 4.700   1st Qu.:208.8  
##  Mode  :character   Median :48.20   Median : 5.400   Median :258.0  
##                     Mean   :43.09   Mean   : 5.409   Mean   :253.4  
##                     3rd Qu.:50.00   3rd Qu.: 6.112   3rd Qu.:301.0  
##                     Max.   :50.00   Max.   :14.160   Max.   :482.0  
##                                                      NA's   :676    
##  Inns     Result     Opposition           Ground         
##  0: 20   -   :  3   Length:1296        Length:1296       
##  1:656   lost:562   Class :character   Class :character  
##  2:620   n/r : 54   Mode  :character   Mode  :character  
##          tied: 14                                        
##          won :663                                        
##                                                          
##                                                          
##    Start.Date           Country            Country_ID  MatchTeam_ID      
##  Min.   :2013-01-03   Length:1296        8      :160   Length:1296       
##  1st Qu.:2014-08-21   Class :character   6      :156   Class :character  
##  Median :2015-12-31   Mode  :character   1      :145   Mode  :character  
##  Mean   :2016-02-15                      7      :139                     
##  3rd Qu.:2017-09-29                      3      :135                     
##  Max.   :2019-05-19                      2      :131                     
##                                          (Other):430                     
##     Margin            Toss      Bat     
##  Length:1296        lost:657   -  :  4  
##  Class :character   won :639   1st:656  
##  Mode  :character              2nd:636  
##                                         
##                                         
##                                         
## 
head(Cricket_Dataset_rev2)

Tidy & Manipulate Data I

We have already made the dataset tidy by making columns unique, now we are going to arrange the dataset to make the data more readable. Each column and row is unique which ensures that the dataset is tidy and readable

The MatchTeam_ID can be seperated into two variables, that is, MatchFormat and MatchTeam_ID

Also we have renamed the column Start.Date to Match_Date.

#Arrange Columns

Cricket_Dataset_rev3 <- select(Cricket_Dataset_rev2, MatchTeam_ID, Start.Date, Country_ID,Country, Opposition, Toss,Bat, Inns, Score, Target, Result,Overs, RPO, Margin, Ground)

#Rename Column:
colnames(Cricket_Dataset_rev3)[colnames(Cricket_Dataset_rev3)=="Start.Date"] <- "Match_Date"

head(Cricket_Dataset_rev3)
#Making the data tidy:

Cricket_Dataset_rev4<- separate(Cricket_Dataset_rev3, col = MatchTeam_ID, into= c("Match_Format","MatchTeam_Id"), sep = '#')

head(Cricket_Dataset_rev4)

Tidy & Manipulate Data II

Let us introduce new variable for prediction of score for 50 overs for every match. As we know, runrate is double variable and target cannot be double, we have converted the Predicted_Score as integer.

Let us remove the irrevelant columns, as we are now only focussed on required statiscal data.

#Creating new variable of Predicted score for 50 overs:
Cricket_Dataset_rev4$Predicted_Score <- as.integer(50*Cricket_Dataset_rev4$RPO)

head(Cricket_Dataset_rev4)
#Removing unneccesary columns:
Cricket_Dataset_rev5 <- select(Cricket_Dataset_rev4, -Toss, -Inns )

head(Cricket_Dataset_rev5)

Scan I

We are going to create functions to detect NA, NAN and Inf values. We have created following functions to detect these values according to the datatype:

is.specialorNA_chr : to scan values in character variable

is.specialorNA_nr : to scan values in numeric variable

is.specialorNA_fc : to scan values in factor variable

is.specialorNA_dt : to scan values in date variable

We found NA values in the Target. The Target variable is bound to have NA values, as only the team which will bat second will have a target to chase. Hence we have replace the NA values in Target with "Not Applicable."

#Functions to detect special values

#for character variable
is.specialorNA_chr <- function(x){
if(is.character(x))(is.infinite(x) | is.nan(x) | is.na(x))}

#for numeric variable
is.specialorNA_nr <- function(x){
if(is.numeric(x))(is.infinite(x) | is.nan(x) | is.na(x))}

#for factor variable
is.specialorNA_fc <- function(x){
if(is.factor(x))(is.infinite(x) | is.nan(x) | is.na(x))}

#for date variable
is.specialorNA_dt <- function(x){
if(is.Date(x))(is.infinite(x) | is.nan(x) | is.na(x))}

which(is.na(Cricket_Dataset_rev5$RPO))
## integer(0)
#Application of functions:

#Character variables
which(sapply(Cricket_Dataset_rev5$Match_Format , is.specialorNA_chr))
## named integer(0)
which(sapply(Cricket_Dataset_rev5$MatchTeam_Id, is.specialorNA_chr))
## named integer(0)
which(sapply(Cricket_Dataset_rev5$Country, is.specialorNA_chr))
## named integer(0)
which(sapply(Cricket_Dataset_rev5$Opposition, is.specialorNA_chr))
## named integer(0)
#Numeric variables
which(sapply(Cricket_Dataset_rev5$RPO, is.specialorNA_nr))
## integer(0)
which(sapply(Cricket_Dataset_rev5$Overs, is.specialorNA_nr))
## integer(0)
which(sapply(Cricket_Dataset_rev5$Predicted_Score , is.specialorNA_nr))
## integer(0)
which(sapply(Cricket_Dataset_rev5$Target , is.specialorNA_nr))
##   [1]    1    4    6    8   10   11   14   16   17   20   22   24   26   28
##  [15]   29   31   34   36   37   40   42   43   45   46   52   54   56   58
##  [29]   61   63   64   65   66   68   69   70   71   72   74   76   77   80
##  [43]   82   84   85   87   89   91   94   96   98   99  102  103  105  107
##  [57]  109  112  114  115  118  120  122  123  126  127  130  131  133  135
##  [71]  138  140  142  145  146  150  151  152  154  156  157  158  160  162
##  [85]  166  168  169  172  174  176  177  180  182  183  185  187  189  191
##  [99]  193  195  197  199  202  204  205  208  210  211  213  215  216  218
## [113]  220  222  223  226  228  230  232  233  235  237  239  241  243  245
## [127]  247  250  252  253  255  257  260  262  265  266  269  270  273  275
## [141]  276  278  281  283  285  287  289  291  292  293  295  296  298  300
## [155]  303  305  306  308  309  311  312  315  316  317  319  320  321  323
## [169]  325  327  329  330  332  335  336  338  340  342  343  345  347  348
## [183]  350  352  354  355  357  360  362  364  366  367  369  370  372  373
## [197]  375  378  380  382  383  386  387  389  390  392  393  394  396  397
## [211]  399  400  403  407  409  410  412  414  416  418  421  423  425  426
## [225]  430  431  434  435  436  437  438  441  442  444  446  449  450  453
## [239]  455  456  457  459  460  462  464  467  469  471  473  474  476  477
## [253]  479  482  483  486  488  489  490  492  495  497  500  501  502  503
## [267]  504  507  510  512  514  516  518  519  521  523  528  530  532  535
## [281]  536  539  540  543  544  546  548  549  551  553  554  557  558  561
## [295]  562  564  566  568  570  571  573  574  576  577  580  582  583  585
## [309]  587  589  590  592  594  596  598  600  601  604  605  607  610  613
## [323]  614  615  616  618  621  622  623  625  627  629  630  632  633  635
## [337]  637  638  641  642  643  645  649  650  651  655  657  659  661  664
## [351]  666  668  669  671  674  675  677  679  682  684  686  688  689  691
## [365]  695  698  701  702  703  704  705  706  708  710  712  714  717  718
## [379]  721  722  723  724  725  726  728  729  731  733  735  737  740  741
## [393]  743  745  748  749  750  753  756  758  759  761  764  766  768  769
## [407]  772  774  776  777  779  781  783  785  788  790  793  795  799  801
## [421]  803  804  806  809  811  813  815  816  818  821  823  825  826  828
## [435]  831  832  834  837  839  840  842  843  846  847  848  850  851  854
## [449]  856  857  860  861  862  863  864  867  868  869  870  873  875  876
## [463]  879  880  881  883  885  886  889  891  893  895  896  899  901  902
## [477]  905  906  909  910  913  915  916  919  920  923  924  925  927  928
## [491]  930  931  933  934  936  938  941  943  944  945  947  949  952  953
## [505]  955  957  959  961  962  964  966  967  970  971  974  976  978  980
## [519]  981  983  985  988  989  991  993  996  997 1000 1001 1003 1004 1006
## [533] 1009 1011 1013 1014 1017 1019 1021 1025 1027 1029 1030 1033 1035 1036
## [547] 1038 1040 1042 1044 1047 1048 1050 1054 1056 1058 1060 1061 1064 1065
## [561] 1066 1069 1072 1074 1075 1079 1081 1085 1086 1088 1091 1093 1095 1096
## [575] 1098 1100 1103 1104 1106 1107 1110 1111 1113 1116 1117 1119 1121 1122
## [589] 1125 1128 1129 1130 1133 1135 1137 1138 1140 1143 1145 1147 1149 1151
## [603] 1152 1154 1155 1157 1159 1161 1162 1165 1168 1171 1172 1175 1177 1179
## [617] 1181 1182 1184 1186 1189 1190 1193 1195 1197 1199 1201 1203 1204 1207
## [631] 1208 1211 1213 1214 1216 1219 1221 1223 1225 1227 1228 1230 1232 1234
## [645] 1235 1237 1239 1241 1243 1244 1246 1247 1250 1251 1253 1254 1255 1258
## [659] 1260 1261 1263 1265 1267 1270 1272 1274 1276 1277 1279 1280 1284 1285
## [673] 1287 1290 1292 1296
#Factor Variables
which(sapply(Cricket_Dataset_rev5$Country_ID , is.specialorNA_fc))
## integer(0)
which(sapply(Cricket_Dataset_rev5$Result , is.specialorNA_fc))
## integer(0)
#Imputed NA with "Not Applicable"
Cricket_Dataset_rev6 <-  
Cricket_Dataset_rev5 %>% group_by(Country_ID)  %>%
mutate(Target = ifelse(is.na(Target),"Not Applicable", (Target) ))


head(Cricket_Dataset_rev6)

Scan II

We are going to use the Z-score approach to detect the outliers.

The z-scores in this study are not the unwanted values. Infact they help us in understanding which teams performed exceptionally well or poor in the match.

The RPO z-score will tell us which batted really well if z-score is greater than 3 or extremely poor if the z-score is less than -3.

For example the which((z.scores_RPO) >3 ) shows observation no. 230 RPO as 13.47, which is exceptional well maintained by New Zealand against West Indies which resulted in their victory.

On the other hand, which((z.scores_RPO)< (-3)) will show countries who RPO are poor in their performance. But this dataset there are no such values.

Similar results can be deduced using the Overs variable, showing the performance of the teams. Hence we are not going to eliminate these values.

#Z-scores in RPO:
z.scores_RPO <- Cricket_Dataset_rev6$RPO %>% scores(type = "z")

which((z.scores_RPO) >3 )
## [1]  230  481  646 1088 1242
which((z.scores_RPO)< (-3))
##  [1]   65  158  216  309  370  435  457  548  650  705  793  869  924  925
## [15]  930  961 1151 1234 1235 1280
#Z-scores in Overs:
z.scores_Overs <- Cricket_Dataset_rev6$Overs %>% scores(type = "z")

which((z.scores_Overs) >3 )
## integer(0)
which((z.scores_Overs)< (-3))
##  [1]   19   32   64   65  158  171  192  216  309  370  435  457  548  646
## [15]  650  651  704  705  715  793  869  897  924  925  930  961  962 1102
## [29] 1151 1185 1234 1235 1280

Transform

The dataset is our study has definite range of numerical data and does not vary much. Hence the data is normalized. However we can going to us loge and log10 transformation to check the variations in transformation of variables in dataset.

In our case, applying the loge and log10 transformation did not make any difference to the RPO variable. We have also applied log10 transformation to Predicted_Score variable.

The log transformation compresses high values and spreads low values by expressing the values as orders of magnitude. Hence the curve plotted on the histogram is more smooth in logarithmic transformation.

#Transformation of Run Rate per Over
hist(Cricket_Dataset_rev6$RPO)

log_e_RPO <- log(Cricket_Dataset_rev6$RPO)

log_10_RPO <- log10(Cricket_Dataset_rev6$RPO)

hist(log_e_RPO)

hist(log_10_RPO)

#Transformation of Predicted_Score
hist(Cricket_Dataset_rev6$Predicted_Score )

log_Predicted_Score <- log10(Cricket_Dataset_rev6$Predicted_Score)

hist(log_Predicted_Score)

Conclusion

We have successfully used different data preprocessing techniques to tidy data, find outliers, deal with special values, transform and normalise data according to our needs and objectives. Following are tasks carried using the Cricket dataset:

  1. Imported and joined two dataset using left join.

  2. Created necessary variables and made the data tidy.

  3. Used filtering and selecting techniques to handle dataset.

  4. Creating functions to detect NA and Special values and imputing relevant values.

  5. Using z-score technique in analysing the performance of the team.

  6. Applying transformation to make the data more normalized on few select variables of interest.

References

Cricket Dataset :

https://www.kaggle.com/saivamshi/cricket-world-cup-2019-players-data#ODI_Match_Totals.csv

For methods and implementation:

Course website modules of MATH2349 Data Preprocessing, by Dr. Anil Dolgun, RMIT University