To apply various data preprocessing techniques to verify if the given dataset is in a tidy format for carrying out various analytical and statiscal studies. The assignment aims to inculcate a sense of optimal preprocessing of data to study its different aspects in terms of its characteristics, missing values, outliers, normalization and visual approaches.
The following packages are going to help us in importing data, making the data tidy and handling the variables efficiently for reading and understanding the data for statistical studies. They are also going to help us in handling the unusual special values and outliers in the variable and provide techniques to exclude these values. We are going to use the library() function to call these packages.
library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(outliers)
library(lubridate)
We are going to import dataset of cricket which depicts international cricket matches played across the world. We are required to combine two datasets and apply various preprocessing techniques in order to make the data tidy for manipulation. We are going to consider various numerical values such as scores and run rates to detect the missing values and outliers in the dataset. We are also required to tranform the variables to inorder to fit a normalized dataset.
The two datsets Match Results and Match Totals are going to used for preprocessing:
Match Totals Dataset :
Following is the url for fetching the dataset:
https://www.kaggle.com/saivamshi/cricket-world-cup-2019-players-data#ODI_Match_Totals.csv
Following are the variables in the dataset:
Score: gives the scores the team have scored
Overs: number of overs played the teams
Target: the target score opposition team needs to score
Inns: the innings number played the team
Result: the result of the match
Opposition: the opposition team
Ground: the venue of the match
Match_ID: The identity number of the match
Country: the country of the team
Country_ID: the identity number of the country
Match Results Dataset :
Following is the url for fetching the dataset:
https://www.kaggle.com/saivamshi/cricket-world-cup-2019-players-data#ODI_Match_Results.csv
Result: the result of the match
Margin: the difference of scores after final results
Toss: the result of toss
Bat: the innings the team batted, whether first or second
Opposition: the opposition team
Ground: the venue of the match
Start.Date: the date of the match
Match_ID: the identity number of the match
Country: the country of the team
Country_ID: the identity number of the country
We are going to use Base R functions to import data into the R Studio. As the downloaded file is saved in csv format in the path : C:/Users/Anup/Desktop/Assignment 3, we will use the read.csv() function to import the dataset. The argument stringAsFactors = FALSE will remove the unnecessary factor created during data importation.
The head() function will give us quick insight of the data.
We have created a unique column like a composite key in databases using two columns which will help us in joining two datasets.
Combining two datasets:
We are going to combine the datasets using a left join. We have considered ODI_Totals_rev1 to be the left dataset as it has most the relevant data and also ODI_Results_rev1 has data of abandoned and no result(n/r) matches, which is not giving us any valuable output. Hence a left_join will give us the desired dataset removing the data of abadoned or n/r matches.
#Importing ODI_Match_Totals:
ODI_Totals <- read.csv("C:/Users/Anup/Desktop/Assignment 3/ODI_Match_Totals.csv",stringsAsFactors = FALSE)
head(ODI_Totals)
#Created new column for unique identification:
ODI_Totals$MatchTeam_ID <- paste(ODI_Totals$Match_ID,'-',ODI_Totals$Country_ID)
#Remove irrelevant Column:
ODI_Totals_rev1 <- select(ODI_Totals, -X2)
#Importing ODI_Match_Result:
ODI_Results <- read.csv("C:/Users/Anup/Desktop/Assignment 3/ODI_Match_Results.csv", stringsAsFactors = FALSE)
head(ODI_Results)
#Created new column for unique identification:
ODI_Results$MatchTeam_ID <- paste(ODI_Results$Match_ID,'-',ODI_Results$Country_ID)
#Remove Repeated columns:
ODI_Results_rev1 <- select(ODI_Results, -Opposition, -Ground, -Start.Date, -Country, -Match_ID,-Country_ID, -X1, -Result, -BR )
head(ODI_Totals_rev1)
head(ODI_Results_rev1)
#COMBING DATASET USING LEFT JOIN AND REMOVE UNWANTED COLUMN:
Cricket_Dataset_rev0<- left_join(ODI_Totals_rev1,ODI_Results_rev1 , by = c("MatchTeam_ID"="MatchTeam_ID"))
#Dataset before datatype conversions:
Cricket_Dataset_rev1 <- select(Cricket_Dataset_rev0, -Match_ID)
#Dataset used for datatype conversions:
Cricket_Dataset_rev2 <- select(Cricket_Dataset_rev0, -Match_ID)
Following are the summaries of the dataset before and after the data type conversions. Most the variables like RPO, Result, Start.Date and Toss were imported as character. We have provided suitable conversions in the Cricket_Dataset_rev2.
Following are the data type conversions required in the the dataset:
RPO : converted to double.
Result : converted as factor.
Start.Date : converted to date.
Toss : coverted to factor.
Inns: converted to factor.
#Before datatype conversions
summary(Cricket_Dataset_rev1)
## Score Overs RPO Target
## Length:1296 Min. : 0.00 Min. : 0.000 Min. : 68.0
## Class :character 1st Qu.:40.20 1st Qu.: 4.700 1st Qu.:208.8
## Mode :character Median :48.20 Median : 5.400 Median :258.0
## Mean :43.09 Mean : 5.409 Mean :253.4
## 3rd Qu.:50.00 3rd Qu.: 6.112 3rd Qu.:301.0
## Max. :50.00 Max. :14.160 Max. :482.0
## NA's :676
## Inns Result Opposition Ground
## Min. :0.000 Length:1296 Length:1296 Length:1296
## 1st Qu.:1.000 Class :character Class :character Class :character
## Median :1.000 Mode :character Mode :character Mode :character
## Mean :1.463
## 3rd Qu.:2.000
## Max. :2.000
##
## Start.Date Country Country_ID MatchTeam_ID
## Length:1296 Length:1296 Min. : 1.000 Length:1296
## Class :character Class :character 1st Qu.: 3.000 Class :character
## Mode :character Mode :character Median : 5.000 Mode :character
## Mean : 8.553
## 3rd Qu.: 8.000
## Max. :40.000
##
## Margin Toss Bat
## Length:1296 Length:1296 Length:1296
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
#Before datatype conversions
head(Cricket_Dataset_rev1)
#RPO to double
Cricket_Dataset_rev2$RPO <- as.double(Cricket_Dataset_rev2$RPO)
#Result to factor
Cricket_Dataset_rev2$Result <- as.factor(Cricket_Dataset_rev2$Result)
#StringDate to Date
Cricket_Dataset_rev2$Start.Date <- dmy(Cricket_Dataset_rev2$Start.Date)
#Toss to factor
Cricket_Dataset_rev2$Toss <- as.factor(Cricket_Dataset_rev2$Toss)
#Inns to factor
Cricket_Dataset_rev2$Inns <- as.factor(Cricket_Dataset_rev2$Inns)
#Bat to Factor
Cricket_Dataset_rev2$Bat <- as.factor(Cricket_Dataset_rev2$Bat)
#CountryID to Factor
Cricket_Dataset_rev2$Country_ID <- as.factor(Cricket_Dataset_rev2$Country_ID)
#Post datatype conversions
summary(Cricket_Dataset_rev2)
## Score Overs RPO Target
## Length:1296 Min. : 0.00 Min. : 0.000 Min. : 68.0
## Class :character 1st Qu.:40.20 1st Qu.: 4.700 1st Qu.:208.8
## Mode :character Median :48.20 Median : 5.400 Median :258.0
## Mean :43.09 Mean : 5.409 Mean :253.4
## 3rd Qu.:50.00 3rd Qu.: 6.112 3rd Qu.:301.0
## Max. :50.00 Max. :14.160 Max. :482.0
## NA's :676
## Inns Result Opposition Ground
## 0: 20 - : 3 Length:1296 Length:1296
## 1:656 lost:562 Class :character Class :character
## 2:620 n/r : 54 Mode :character Mode :character
## tied: 14
## won :663
##
##
## Start.Date Country Country_ID MatchTeam_ID
## Min. :2013-01-03 Length:1296 8 :160 Length:1296
## 1st Qu.:2014-08-21 Class :character 6 :156 Class :character
## Median :2015-12-31 Mode :character 1 :145 Mode :character
## Mean :2016-02-15 7 :139
## 3rd Qu.:2017-09-29 3 :135
## Max. :2019-05-19 2 :131
## (Other):430
## Margin Toss Bat
## Length:1296 lost:657 - : 4
## Class :character won :639 1st:656
## Mode :character 2nd:636
##
##
##
##
head(Cricket_Dataset_rev2)
We have already made the dataset tidy by making columns unique, now we are going to arrange the dataset to make the data more readable. Each column and row is unique which ensures that the dataset is tidy and readable
The MatchTeam_ID can be seperated into two variables, that is, MatchFormat and MatchTeam_ID
Also we have renamed the column Start.Date to Match_Date.
#Arrange Columns
Cricket_Dataset_rev3 <- select(Cricket_Dataset_rev2, MatchTeam_ID, Start.Date, Country_ID,Country, Opposition, Toss,Bat, Inns, Score, Target, Result,Overs, RPO, Margin, Ground)
#Rename Column:
colnames(Cricket_Dataset_rev3)[colnames(Cricket_Dataset_rev3)=="Start.Date"] <- "Match_Date"
head(Cricket_Dataset_rev3)
#Making the data tidy:
Cricket_Dataset_rev4<- separate(Cricket_Dataset_rev3, col = MatchTeam_ID, into= c("Match_Format","MatchTeam_Id"), sep = '#')
head(Cricket_Dataset_rev4)
Let us introduce new variable for prediction of score for 50 overs for every match. As we know, runrate is double variable and target cannot be double, we have converted the Predicted_Score as integer.
Let us remove the irrevelant columns, as we are now only focussed on required statiscal data.
#Creating new variable of Predicted score for 50 overs:
Cricket_Dataset_rev4$Predicted_Score <- as.integer(50*Cricket_Dataset_rev4$RPO)
head(Cricket_Dataset_rev4)
#Removing unneccesary columns:
Cricket_Dataset_rev5 <- select(Cricket_Dataset_rev4, -Toss, -Inns )
head(Cricket_Dataset_rev5)
We are going to create functions to detect NA, NAN and Inf values. We have created following functions to detect these values according to the datatype:
is.specialorNA_chr : to scan values in character variable
is.specialorNA_nr : to scan values in numeric variable
is.specialorNA_fc : to scan values in factor variable
is.specialorNA_dt : to scan values in date variable
We found NA values in the Target. The Target variable is bound to have NA values, as only the team which will bat second will have a target to chase. Hence we have replace the NA values in Target with "Not Applicable."
#Functions to detect special values
#for character variable
is.specialorNA_chr <- function(x){
if(is.character(x))(is.infinite(x) | is.nan(x) | is.na(x))}
#for numeric variable
is.specialorNA_nr <- function(x){
if(is.numeric(x))(is.infinite(x) | is.nan(x) | is.na(x))}
#for factor variable
is.specialorNA_fc <- function(x){
if(is.factor(x))(is.infinite(x) | is.nan(x) | is.na(x))}
#for date variable
is.specialorNA_dt <- function(x){
if(is.Date(x))(is.infinite(x) | is.nan(x) | is.na(x))}
which(is.na(Cricket_Dataset_rev5$RPO))
## integer(0)
#Application of functions:
#Character variables
which(sapply(Cricket_Dataset_rev5$Match_Format , is.specialorNA_chr))
## named integer(0)
which(sapply(Cricket_Dataset_rev5$MatchTeam_Id, is.specialorNA_chr))
## named integer(0)
which(sapply(Cricket_Dataset_rev5$Country, is.specialorNA_chr))
## named integer(0)
which(sapply(Cricket_Dataset_rev5$Opposition, is.specialorNA_chr))
## named integer(0)
#Numeric variables
which(sapply(Cricket_Dataset_rev5$RPO, is.specialorNA_nr))
## integer(0)
which(sapply(Cricket_Dataset_rev5$Overs, is.specialorNA_nr))
## integer(0)
which(sapply(Cricket_Dataset_rev5$Predicted_Score , is.specialorNA_nr))
## integer(0)
which(sapply(Cricket_Dataset_rev5$Target , is.specialorNA_nr))
## [1] 1 4 6 8 10 11 14 16 17 20 22 24 26 28
## [15] 29 31 34 36 37 40 42 43 45 46 52 54 56 58
## [29] 61 63 64 65 66 68 69 70 71 72 74 76 77 80
## [43] 82 84 85 87 89 91 94 96 98 99 102 103 105 107
## [57] 109 112 114 115 118 120 122 123 126 127 130 131 133 135
## [71] 138 140 142 145 146 150 151 152 154 156 157 158 160 162
## [85] 166 168 169 172 174 176 177 180 182 183 185 187 189 191
## [99] 193 195 197 199 202 204 205 208 210 211 213 215 216 218
## [113] 220 222 223 226 228 230 232 233 235 237 239 241 243 245
## [127] 247 250 252 253 255 257 260 262 265 266 269 270 273 275
## [141] 276 278 281 283 285 287 289 291 292 293 295 296 298 300
## [155] 303 305 306 308 309 311 312 315 316 317 319 320 321 323
## [169] 325 327 329 330 332 335 336 338 340 342 343 345 347 348
## [183] 350 352 354 355 357 360 362 364 366 367 369 370 372 373
## [197] 375 378 380 382 383 386 387 389 390 392 393 394 396 397
## [211] 399 400 403 407 409 410 412 414 416 418 421 423 425 426
## [225] 430 431 434 435 436 437 438 441 442 444 446 449 450 453
## [239] 455 456 457 459 460 462 464 467 469 471 473 474 476 477
## [253] 479 482 483 486 488 489 490 492 495 497 500 501 502 503
## [267] 504 507 510 512 514 516 518 519 521 523 528 530 532 535
## [281] 536 539 540 543 544 546 548 549 551 553 554 557 558 561
## [295] 562 564 566 568 570 571 573 574 576 577 580 582 583 585
## [309] 587 589 590 592 594 596 598 600 601 604 605 607 610 613
## [323] 614 615 616 618 621 622 623 625 627 629 630 632 633 635
## [337] 637 638 641 642 643 645 649 650 651 655 657 659 661 664
## [351] 666 668 669 671 674 675 677 679 682 684 686 688 689 691
## [365] 695 698 701 702 703 704 705 706 708 710 712 714 717 718
## [379] 721 722 723 724 725 726 728 729 731 733 735 737 740 741
## [393] 743 745 748 749 750 753 756 758 759 761 764 766 768 769
## [407] 772 774 776 777 779 781 783 785 788 790 793 795 799 801
## [421] 803 804 806 809 811 813 815 816 818 821 823 825 826 828
## [435] 831 832 834 837 839 840 842 843 846 847 848 850 851 854
## [449] 856 857 860 861 862 863 864 867 868 869 870 873 875 876
## [463] 879 880 881 883 885 886 889 891 893 895 896 899 901 902
## [477] 905 906 909 910 913 915 916 919 920 923 924 925 927 928
## [491] 930 931 933 934 936 938 941 943 944 945 947 949 952 953
## [505] 955 957 959 961 962 964 966 967 970 971 974 976 978 980
## [519] 981 983 985 988 989 991 993 996 997 1000 1001 1003 1004 1006
## [533] 1009 1011 1013 1014 1017 1019 1021 1025 1027 1029 1030 1033 1035 1036
## [547] 1038 1040 1042 1044 1047 1048 1050 1054 1056 1058 1060 1061 1064 1065
## [561] 1066 1069 1072 1074 1075 1079 1081 1085 1086 1088 1091 1093 1095 1096
## [575] 1098 1100 1103 1104 1106 1107 1110 1111 1113 1116 1117 1119 1121 1122
## [589] 1125 1128 1129 1130 1133 1135 1137 1138 1140 1143 1145 1147 1149 1151
## [603] 1152 1154 1155 1157 1159 1161 1162 1165 1168 1171 1172 1175 1177 1179
## [617] 1181 1182 1184 1186 1189 1190 1193 1195 1197 1199 1201 1203 1204 1207
## [631] 1208 1211 1213 1214 1216 1219 1221 1223 1225 1227 1228 1230 1232 1234
## [645] 1235 1237 1239 1241 1243 1244 1246 1247 1250 1251 1253 1254 1255 1258
## [659] 1260 1261 1263 1265 1267 1270 1272 1274 1276 1277 1279 1280 1284 1285
## [673] 1287 1290 1292 1296
#Factor Variables
which(sapply(Cricket_Dataset_rev5$Country_ID , is.specialorNA_fc))
## integer(0)
which(sapply(Cricket_Dataset_rev5$Result , is.specialorNA_fc))
## integer(0)
#Imputed NA with "Not Applicable"
Cricket_Dataset_rev6 <-
Cricket_Dataset_rev5 %>% group_by(Country_ID) %>%
mutate(Target = ifelse(is.na(Target),"Not Applicable", (Target) ))
head(Cricket_Dataset_rev6)
We are going to use the Z-score approach to detect the outliers.
The z-scores in this study are not the unwanted values. Infact they help us in understanding which teams performed exceptionally well or poor in the match.
The RPO z-score will tell us which batted really well if z-score is greater than 3 or extremely poor if the z-score is less than -3.
For example the which((z.scores_RPO) >3 ) shows observation no. 230 RPO as 13.47, which is exceptional well maintained by New Zealand against West Indies which resulted in their victory.
On the other hand, which((z.scores_RPO)< (-3)) will show countries who RPO are poor in their performance. But this dataset there are no such values.
Similar results can be deduced using the Overs variable, showing the performance of the teams. Hence we are not going to eliminate these values.
#Z-scores in RPO:
z.scores_RPO <- Cricket_Dataset_rev6$RPO %>% scores(type = "z")
which((z.scores_RPO) >3 )
## [1] 230 481 646 1088 1242
which((z.scores_RPO)< (-3))
## [1] 65 158 216 309 370 435 457 548 650 705 793 869 924 925
## [15] 930 961 1151 1234 1235 1280
#Z-scores in Overs:
z.scores_Overs <- Cricket_Dataset_rev6$Overs %>% scores(type = "z")
which((z.scores_Overs) >3 )
## integer(0)
which((z.scores_Overs)< (-3))
## [1] 19 32 64 65 158 171 192 216 309 370 435 457 548 646
## [15] 650 651 704 705 715 793 869 897 924 925 930 961 962 1102
## [29] 1151 1185 1234 1235 1280
The dataset is our study has definite range of numerical data and does not vary much. Hence the data is normalized. However we can going to us loge and log10 transformation to check the variations in transformation of variables in dataset.
In our case, applying the loge and log10 transformation did not make any difference to the RPO variable. We have also applied log10 transformation to Predicted_Score variable.
The log transformation compresses high values and spreads low values by expressing the values as orders of magnitude. Hence the curve plotted on the histogram is more smooth in logarithmic transformation.
#Transformation of Run Rate per Over
hist(Cricket_Dataset_rev6$RPO)
log_e_RPO <- log(Cricket_Dataset_rev6$RPO)
log_10_RPO <- log10(Cricket_Dataset_rev6$RPO)
hist(log_e_RPO)
hist(log_10_RPO)
#Transformation of Predicted_Score
hist(Cricket_Dataset_rev6$Predicted_Score )
log_Predicted_Score <- log10(Cricket_Dataset_rev6$Predicted_Score)
hist(log_Predicted_Score)
We have successfully used different data preprocessing techniques to tidy data, find outliers, deal with special values, transform and normalise data according to our needs and objectives. Following are tasks carried using the Cricket dataset:
Imported and joined two dataset using left join.
Created necessary variables and made the data tidy.
Used filtering and selecting techniques to handle dataset.
Creating functions to detect NA and Special values and imputing relevant values.
Using z-score technique in analysing the performance of the team.
Applying transformation to make the data more normalized on few select variables of interest.
Cricket Dataset :
https://www.kaggle.com/saivamshi/cricket-world-cup-2019-players-data#ODI_Match_Totals.csv
For methods and implementation:
Course website modules of MATH2349 Data Preprocessing, by Dr. Anil Dolgun, RMIT University