The Following packages were required to preprocess the dataset.
library(readr)
library(dplyr)
library(Hmisc)
library(lubridate)
library(outliers)
library(forecast)
library(knitr)
In this Assignment we are going to preprocess the data of every single IPL cricket match that is held between 2008-2019. This dataset contains the statistics of every single ball that is bowled in IPL.
To preprocess, two datasets i.e.: IPL matches and Ball summary were merged with a unique identifier “ID”. After that a deep inspection of the variables was performed to get the knowledge of every variables specification and data. In the next step suitable data type conversions were applied.
As this dataset fulfills all the tidy principles of Hadley Wickham’s, we need not to apply any tidy function. A new Variable Run_rate was created using mutate Function.
While scanning the data we found some missing values present in the factor variables that were imputed using the mode function and missing values of the categorical variables was treated suitably. Mean and Median imputation method can have used to treat the missing values that were present in the numeric data type. As per this dataset is concern, we don’t found any numeric variables with missing values. There were no special values present in the dataset as well. While plotting the Boxplot, we observed some outliers in Victory_by_wickets, Total_runs, Victory_by_runs by using Z-score. Since the outliers present in the dataset were lower than 5% of the total dataset, it was assumed that the they were not due to the data entry errors. Therefore, the scanned value were left untreated as it can be a significant part of our dataset.
Numerical variable Victory_by_runs was transformed to decrease the skewness present in the graph and convert the distribution into a more normal one.
It was checked if the datasets were impored as data frame or not using class() function. Full join is used to merge both the datasets by the key variable id.
The following dataset was collected from the from an open Source(https://www.kaggle.com/nowe9/ipldata) . IPL is T20 cricket format league conducted in India every year begins in 2008. The data provides the information of 11 seasons of IPL(2008-2019).
Ball summary is a ball by ball dataset which contains 179078 observations and 20 Variables. IPL Dataset is a match by match dataset which contains 756 observations and 18 variables.
In this step we perform various steps such as:
Importing both the dataset and name it as Ball_summary and IPL_dataset
Checking the class of both the datasets.
Perform the full join function by using common variable (“id”=”match_id”) and name it as IPL_summary.
#import the dataset
Ball_summary <- read_csv("Ball_Summary.csv")
Parsed with column specification:
cols(
.default = col_double(),
team_batting = [31mcol_character()[39m,
team_bowling = [31mcol_character()[39m,
batsman = [31mcol_character()[39m,
non_striker = [31mcol_character()[39m,
bowler = [31mcol_character()[39m,
Batsman_dismissed = [31mcol_character()[39m,
way_of_dismissal = [31mcol_character()[39m,
fielder = [31mcol_character()[39m
)
See spec(...) for full column specifications.
View(Ball_summary)
#checking the class of Ball
class(Ball_summary)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
#import the other dataset
Ipl_dataset <- read_csv("Ipl dataset.csv")
Parsed with column specification:
cols(
id = [32mcol_double()[39m,
`Year(season)` = [32mcol_double()[39m,
city = [31mcol_character()[39m,
Match_date = [31mcol_character()[39m,
teamA = [31mcol_character()[39m,
teamB = [31mcol_character()[39m,
Winner_toss = [31mcol_character()[39m,
toss_decision = [31mcol_character()[39m,
result = [31mcol_character()[39m,
`D/L_applied` = [32mcol_double()[39m,
Match_winner = [31mcol_character()[39m,
victory_by_runs = [32mcol_double()[39m,
victory_by_wickets = [32mcol_double()[39m,
man_of_match = [31mcol_character()[39m,
venue = [31mcol_character()[39m,
`1st_umpire` = [31mcol_character()[39m,
`2nd_umpire` = [31mcol_character()[39m,
`3rd_umpire` = [31mcol_character()[39m
)
View(Ipl_dataset)
#checking the class of Ipl_dataset
class(Ipl_dataset)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
#full join the dataset
Ipl_summary <- full_join(Ipl_dataset,Ball_summary,c("id"="match_id"))
View(Ipl_summary)
head(Ipl_summary)
NA
In this step we will use the Dim() function to check the dimensions of the dataset.
While going through the dataset we found that some of the variables were imported as incorrect data types. So, they were converted using as.factor() function. Two variables year(season) and Super_over was lablled and ordered correctly using factor Function.
#understand
dim(Ipl_summary)
[1] 179078 37
Ipl_summary$`Year(season)`<- factor(Ipl_summary$`Year(season)`, levels = c("2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019"), labels = c("2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019"), ordered = TRUE)
Ipl_summary$toss_decision<- as.factor(Ipl_summary$toss_decision)
Ipl_summary$result <- as.factor(Ipl_summary$result)
Ipl_summary$`D/L_applied` <- as.factor(Ipl_summary$`D/L_applied`)
Ipl_summary$over <- as.factor(Ipl_summary$over)
Ipl_summary$super_over <- factor(Ipl_summary$super_over,levels = c(0,1), labels = c("No","Yes"))
Ipl_summary$way_of_dismissal <- as.factor(Ipl_summary$way_of_dismissal)
str(Ipl_summary)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 179078 obs. of 37 variables:
$ id : num 1 1 1 1 1 1 1 1 1 1 ...
$ Year(season) : Ord.factor w/ 12 levels "2008"<"2009"<..: 10 10 10 10 10 10 10 10 10 10 ...
$ city : chr "Hyderabad" "Hyderabad" "Hyderabad" "Hyderabad" ...
$ Match_date : chr "05-04-17" "05-04-17" "05-04-17" "05-04-17" ...
$ teamA : chr "SRH" "SRH" "SRH" "SRH" ...
$ teamB : chr "RCB" "RCB" "RCB" "RCB" ...
$ Winner_toss : chr "RCB" "RCB" "RCB" "RCB" ...
$ toss_decision : Factor w/ 2 levels "bat","field": 2 2 2 2 2 2 2 2 2 2 ...
$ result : Factor w/ 3 levels "no result","normal",..: 2 2 2 2 2 2 2 2 2 2 ...
$ D/L_applied : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Match_winner : chr "SRH" "SRH" "SRH" "SRH" ...
$ victory_by_runs : num 35 35 35 35 35 35 35 35 35 35 ...
$ victory_by_wickets: num 0 0 0 0 0 0 0 0 0 0 ...
$ man_of_match : chr "Yuvraj Singh" "Yuvraj Singh" "Yuvraj Singh" "Yuvraj Singh" ...
$ venue : chr "Rajiv Gandhi International Stadium, Uppal" "Rajiv Gandhi International Stadium, Uppal" "Rajiv Gandhi International Stadium, Uppal" "Rajiv Gandhi International Stadium, Uppal" ...
$ 1st_umpire : chr "AY Dandekar" "AY Dandekar" "AY Dandekar" "AY Dandekar" ...
$ 2nd_umpire : chr "NJ Llong" "NJ Llong" "NJ Llong" "NJ Llong" ...
$ 3rd_umpire : chr NA NA NA NA ...
$ innings : num 1 1 1 1 1 1 1 1 1 1 ...
$ team_batting : chr "SRH" "SRH" "SRH" "SRH" ...
$ team_bowling : chr "RCB" "RCB" "RCB" "RCB" ...
$ over : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 2 2 2 ...
$ ball : num 1 2 3 4 5 6 7 1 2 3 ...
$ batsman : chr "DA Warner" "DA Warner" "DA Warner" "DA Warner" ...
$ non_striker : chr "S Dhawan" "S Dhawan" "S Dhawan" "S Dhawan" ...
$ bowler : chr "TS Mills" "TS Mills" "TS Mills" "TS Mills" ...
$ super_over : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ wide : num 0 0 0 0 2 0 0 0 0 0 ...
$ bye : num 0 0 0 0 0 0 0 0 0 0 ...
$ legbye : num 0 0 0 0 0 0 1 0 0 0 ...
$ noball : num 0 0 0 0 0 0 0 0 0 1 ...
$ batsman_runs : num 0 0 4 0 0 0 0 1 4 0 ...
$ extra : num 0 0 0 0 2 0 1 0 0 1 ...
$ total_runs : num 0 0 4 0 2 0 1 1 4 1 ...
$ Batsman_dismissed : chr NA NA NA NA ...
$ way_of_dismissal : Factor w/ 9 levels "bowled","caught",..: NA NA NA NA NA NA NA NA NA NA ...
$ fielder : chr NA NA NA NA ...
To check whether the data follows all the principles of Hadley Wickham’s tidy data principles we have to inspect the data using head(),tail(), and glimpse() function.
After the inspection we found out that the the data is tidy as it follows all the principles i.e:
1.Each variable have its own column.
2.Each observation have its own row.
3.Each value have its own cell.
head(Ipl_summary)
tail(Ipl_summary)
glimpse(Ipl_summary)
Observations: 179,078
Variables: 37
$ id [3m[38;5;246m<dbl>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ `Year(season)` [3m[38;5;246m<ord>[39m[23m 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 20...
$ city [3m[38;5;246m<chr>[39m[23m "Hyderabad", "Hyderabad", "Hyderabad", "Hyderabad", "Hyderabad", "Hyderabad", "H...
$ Match_date [3m[38;5;246m<chr>[39m[23m "05-04-17", "05-04-17", "05-04-17", "05-04-17", "05-04-17", "05-04-17", "05-04-1...
$ teamA [3m[38;5;246m<chr>[39m[23m "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SR...
$ teamB [3m[38;5;246m<chr>[39m[23m "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RC...
$ Winner_toss [3m[38;5;246m<chr>[39m[23m "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RC...
$ toss_decision [3m[38;5;246m<fct>[39m[23m field, field, field, field, field, field, field, field, field, field, field, fie...
$ result [3m[38;5;246m<fct>[39m[23m normal, normal, normal, normal, normal, normal, normal, normal, normal, normal, ...
$ `D/L_applied` [3m[38;5;246m<fct>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ Match_winner [3m[38;5;246m<chr>[39m[23m "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SR...
$ victory_by_runs [3m[38;5;246m<dbl>[39m[23m 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, ...
$ victory_by_wickets [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ man_of_match [3m[38;5;246m<chr>[39m[23m "Yuvraj Singh", "Yuvraj Singh", "Yuvraj Singh", "Yuvraj Singh", "Yuvraj Singh", ...
$ venue [3m[38;5;246m<chr>[39m[23m "Rajiv Gandhi International Stadium, Uppal", "Rajiv Gandhi International Stadium...
$ `1st_umpire` [3m[38;5;246m<chr>[39m[23m "AY Dandekar", "AY Dandekar", "AY Dandekar", "AY Dandekar", "AY Dandekar", "AY D...
$ `2nd_umpire` [3m[38;5;246m<chr>[39m[23m "NJ Llong", "NJ Llong", "NJ Llong", "NJ Llong", "NJ Llong", "NJ Llong", "NJ Llon...
$ `3rd_umpire` [3m[38;5;246m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ innings [3m[38;5;246m<dbl>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ team_batting [3m[38;5;246m<chr>[39m[23m "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SR...
$ team_bowling [3m[38;5;246m<chr>[39m[23m "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RC...
$ over [3m[38;5;246m<fct>[39m[23m 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5,...
$ ball [3m[38;5;246m<dbl>[39m[23m 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1,...
$ batsman [3m[38;5;246m<chr>[39m[23m "DA Warner", "DA Warner", "DA Warner", "DA Warner", "DA Warner", "S Dhawan", "S ...
$ non_striker [3m[38;5;246m<chr>[39m[23m "S Dhawan", "S Dhawan", "S Dhawan", "S Dhawan", "S Dhawan", "DA Warner", "DA War...
$ bowler [3m[38;5;246m<chr>[39m[23m "TS Mills", "TS Mills", "TS Mills", "TS Mills", "TS Mills", "TS Mills", "TS Mill...
$ super_over [3m[38;5;246m<fct>[39m[23m No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, ...
$ wide [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ bye [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ legbye [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ noball [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ batsman_runs [3m[38;5;246m<dbl>[39m[23m 0, 0, 4, 0, 0, 0, 0, 1, 4, 0, 6, 0, 0, 4, 1, 0, 0, 3, 1, 1, 0, 1, 0, 1, 1, 1, 1,...
$ extra [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 2, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ total_runs [3m[38;5;246m<dbl>[39m[23m 0, 0, 4, 0, 2, 0, 1, 1, 4, 1, 6, 0, 0, 4, 1, 0, 0, 3, 1, 1, 0, 1, 0, 1, 1, 1, 1,...
$ Batsman_dismissed [3m[38;5;246m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "DA Warner", NA, NA, NA, NA, NA, NA,...
$ way_of_dismissal [3m[38;5;246m<fct>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, caught, NA, NA, NA, NA, NA, NA, NA, ...
$ fielder [3m[38;5;246m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Mandeep Singh", NA, NA, NA, NA, NA,...
In this step we calculate the Run rate scored by each team in each match. To calculate the total runs scored by a team in every match, I group the dataset by following variables team_batting,Year(season),Match_date. Then to calculate the Run rate, divide the total runs by total overs(20). To perform the grouping properly, I had filtered the dataset where there is no missing values in the Match_date variable. At the last to create the new variable Run_rate mutate() Function is used.
Ipl_summary <- Ipl_summary %>% group_by(team_batting,`Year(season)`,Match_date) %>% mutate(total_runs = sum(total_runs), Run_Rate = total_runs/20)%>% filter(!is.na(Match_date) )
head(Ipl_summary)
NA
In this step we used colsums() function to obtain the total number of missing values that are present in particular variables.
After using the colsums() function we found out all the missing values in every variable. Mode() was used to replace all the missing values in Factor variable as in case of character variable all the missing values were recoded as “NO RECORDED” by making an assumption that there must be some kind of data entry error in the dataset.
In 3rd_umpire variable the missing values were recoded as “NOT ASKED FOR 3rd Umpire” as the decision taken on that corresponding deliveries were taken by the 1st or 2nd umpires only.
In Batsman_dismissed and Fielder variable the missing values were recoded as “Still Playing” as the batsman is still batting and is not out on that corresponding delivery.
Colsums() function is used again after all the imputations and recoding and as per the expectations, there were no missing values present in the dataset.
Special values were also inspected but no special type of values was present in the dataset.
colSums(is.na(Ipl_summary))
id Year(season) city Match_date teamA
0 0 1700 0 0
teamB Winner_toss toss_decision result D/L_applied
0 0 0 0 0
Match_winner victory_by_runs victory_by_wickets man_of_match venue
372 0 0 372 0
1st_umpire 2nd_umpire 3rd_umpire innings team_batting
500 500 150712 0 0
team_bowling over ball batsman non_striker
0 0 0 0 0
bowler super_over wide bye legbye
0 0 0 0 0
noball batsman_runs extra total_runs Batsman_dismissed
0 0 0 0 170244
way_of_dismissal fielder Run_Rate
170244 172630 0
Ipl_summary$city <-impute(Ipl_summary$city, fun = mode)
Ipl_summary$toss_decision <-impute(Ipl_summary$toss_decision, fun = mode)
Ipl_summary$way_of_dismissal <-impute(Ipl_summary$way_of_dismissal , fun = mode)
Ipl_summary$man_of_match <- impute(Ipl_summary$man_of_match, fun = mode)
Ipl_summary$Match_winner [is.na(Ipl_summary$Match_winner)] <- "MATCH_TIED"
Ipl_summary$`1st_umpire` [is.na(Ipl_summary$`1st_umpire`)] <- "NO RECORDED"
Ipl_summary$`2nd_umpire` [is.na(Ipl_summary$`2nd_umpire`)] <- "NO RECORDED"
Ipl_summary$`3rd_umpire` [is.na(Ipl_summary$`3rd_umpire`)] <- "NOT ASKED FOR 3rd Umpire"
Ipl_summary$Batsman_dismissed [is.na(Ipl_summary$Batsman_dismissed)] <- "Still playing"
Ipl_summary$fielder [is.na(Ipl_summary$fielder)] <- "Still Playing"
colSums(is.na(Ipl_summary))
id Year(season) city Match_date teamA
0 0 0 0 0
teamB Winner_toss toss_decision result D/L_applied
0 0 0 0 0
Match_winner victory_by_runs victory_by_wickets man_of_match venue
0 0 0 0 0
1st_umpire 2nd_umpire 3rd_umpire innings team_batting
0 0 0 0 0
team_bowling over ball batsman non_striker
0 0 0 0 0
bowler super_over wide bye legbye
0 0 0 0 0
noball batsman_runs extra total_runs Batsman_dismissed
0 0 0 0 0
way_of_dismissal fielder Run_Rate
0 0 0
#Inspection for special values in data frame
is.special <- function(x){ if (is.numeric(x)) (is.infinite(x) | is.nan(x))}
# apply this function to the data frame.
sapply(Ipl_summary, function(x) sum( is.special(x)))
id Year(season) city Match_date teamA
0 0 0 0 0
teamB Winner_toss toss_decision result D/L_applied
0 0 0 0 0
Match_winner victory_by_runs victory_by_wickets man_of_match venue
0 0 0 0 0
1st_umpire 2nd_umpire 3rd_umpire innings team_batting
0 0 0 0 0
team_bowling over ball batsman non_striker
0 0 0 0 0
bowler super_over wide bye legbye
0 0 0 0 0
noball batsman_runs extra total_runs Batsman_dismissed
0 0 0 0 0
way_of_dismissal fielder Run_Rate
0 0 0
In this step, I used the main numeric variables present in our dataset which are the deciding factors in a Match to plot the boxplot. We used boxplot because it is one of the most suitable way of detecting all univariate outliers present in the dataset. While inspecting the variable Victory_by_wickets, no outliers were observed. After that I plot the boxplot of Victory_by_Total_runs and total_runs and outliers were observed on these variables which we will treat with z-scores.
I use which() function to find the locations or outliers present in the dataset whose absolute value is greater than 3.
After inspecting the dataset, the percentage of outliers present in the both thevariables is less than 5%. Hence, it can be determined that the outliers present in the boxplor is due to the extraordinary performance of the winning team not because of some type of data entry error.
Therefore, we will ignore the outliers and values will remain same in the dataset.
boxplot(Ipl_summary$victory_by_wickets, main= "Box plot showing Victory by wickets", ylab= "Number of Wickets", col = "sky blue")
boxplot(Ipl_summary$total_runs, main= "Box Plot showing victory by total runs", ylab="Total runs", col = "Sky blue")
boxplot(Ipl_summary$victory_by_runs, main="Boxplt showing runs by which teams won in IPL", ylab= "Runs", col = "sky blue")
#Z scores by victory runs
zscores_victory_by_runs <- Ipl_summary$victory_by_runs %>% scores(type ="z")
zscores_victory_by_runs %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.5762 -0.5762 -0.5762 0.0000 0.2406 5.7004
#where are outliers
which( abs(zscores_victory_by_runs) >3 )
[1] 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907
[18] 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924
[35] 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941
[52] 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958
[69] 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975
[86] 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992
[103] 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
[120] 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026
[137] 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043
[154] 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060
[171] 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077
[188] 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094
[205] 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111
[222] 2112 2113 2114 2115 2116 10197 10198 10199 10200 10201 10202 10203 10204 10205 10206 10207 10208
[239] 10209 10210 10211 10212 10213 10214 10215 10216 10217 10218 10219 10220 10221 10222 10223 10224 10225
[256] 10226 10227 10228 10229 10230 10231 10232 10233 10234 10235 10236 10237 10238 10239 10240 10241 10242
[273] 10243 10244 10245 10246 10247 10248 10249 10250 10251 10252 10253 10254 10255 10256 10257 10258 10259
[290] 10260 10261 10262 10263 10264 10265 10266 10267 10268 10269 10270 10271 10272 10273 10274 10275 10276
[307] 10277 10278 10279 10280 10281 10282 10283 10284 10285 10286 10287 10288 10289 10290 10291 10292 10293
[324] 10294 10295 10296 10297 10298 10299 10300 10301 10302 10303 10304 10305 10306 10307 10308 10309 10310
[341] 10311 10312 10313 10314 10315 10316 10317 10318 10319 10320 10321 10322 10323 10324 10325 10326 10327
[358] 10328 10329 10330 10331 10332 10333 10334 10335 10336 10337 10338 10339 10340 10341 10342 10343 10344
[375] 10345 10346 10347 10348 10349 10350 10351 10352 10353 10354 10355 10356 10357 10358 10359 10360 10361
[392] 10362 10363 10364 10365 10366 10367 10368 10369 10370 10371 10372 10373 10374 10375 10376 10377 10378
[409] 10379 10380 10381 10382 10383 10384 10385 10386 10387 10388 10389 10390 10391 10392 10393 10394 10395
[426] 10396 10397 10398 10399 10400 10401 10402 10403 10404 10405 10406 10407 13863 13864 13865 13866 13867
[443] 13868 13869 13870 13871 13872 13873 13874 13875 13876 13877 13878 13879 13880 13881 13882 13883 13884
[460] 13885 13886 13887 13888 13889 13890 13891 13892 13893 13894 13895 13896 13897 13898 13899 13900 13901
[477] 13902 13903 13904 13905 13906 13907 13908 13909 13910 13911 13912 13913 13914 13915 13916 13917 13918
[494] 13919 13920 13921 13922 13923 13924 13925 13926 13927 13928 13929 13930 13931 13932 13933 13934 13935
[511] 13936 13937 13938 13939 13940 13941 13942 13943 13944 13945 13946 13947 13948 13949 13950 13951 13952
[528] 13953 13954 13955 13956 13957 13958 13959 13960 13961 13962 13963 13964 13965 13966 13967 13968 13969
[545] 13970 13971 13972 13973 13974 13975 13976 13977 13978 13979 13980 13981 13982 13983 13984 13985 13986
[562] 13987 13988 13989 13990 13991 13992 13993 13994 13995 13996 13997 13998 13999 14000 14001 14002 14003
[579] 14004 14005 14006 14007 14008 14009 14010 14011 14012 14013 14014 14015 14016 14017 14018 14019 14020
[596] 14021 14022 14023 14024 14025 14026 14027 14028 14029 14030 14031 14032 14033 14034 14035 14036 14037
[613] 14038 14039 14040 14041 14042 14043 14044 14045 14046 14047 14048 14049 14050 14051 14052 14053 14054
[630] 14055 14056 14057 14058 14059 14060 14061 14062 14063 14064 14065 14066 14067 14068 14069 14070 14071
[647] 14072 14073 14074 14075 14076 14077 14078 14079 14080 14081 14082 14083 14084 14085 14086 14087 26664
[664] 26665 26666 26667 26668 26669 26670 26671 26672 26673 26674 26675 26676 26677 26678 26679 26680 26681
[681] 26682 26683 26684 26685 26686 26687 26688 26689 26690 26691 26692 26693 26694 26695 26696 26697 26698
[698] 26699 26700 26701 26702 26703 26704 26705 26706 26707 26708 26709 26710 26711 26712 26713 26714 26715
[715] 26716 26717 26718 26719 26720 26721 26722 26723 26724 26725 26726 26727 26728 26729 26730 26731 26732
[732] 26733 26734 26735 26736 26737 26738 26739 26740 26741 26742 26743 26744 26745 26746 26747 26748 26749
[749] 26750 26751 26752 26753 26754 26755 26756 26757 26758 26759 26760 26761 26762 26763 26764 26765 26766
[766] 26767 26768 26769 26770 26771 26772 26773 26774 26775 26776 26777 26778 26779 26780 26781 26782 26783
[783] 26784 26785 26786 26787 26788 26789 26790 26791 26792 26793 26794 26795 26796 26797 26798 26799 26800
[800] 26801 26802 26803 26804 26805 26806 26807 26808 26809 26810 26811 26812 26813 26814 26815 26816 26817
[817] 26818 26819 26820 26821 26822 26823 26824 26825 26826 26827 26828 26829 26830 26831 26832 26833 26834
[834] 26835 26836 26837 26838 26839 26840 26841 26842 26843 26844 26845 26846 26847 26848 26849 26850 26851
[851] 26852 26853 26854 26855 26856 26857 26858 26859 26860 26861 26862 26863 26864 26865 26866 26867 26868
[868] 26869 26870 26871 26872 26873 26874 26875 26876 26877 26878 26879 26880 26881 26882 26883 26884 26885
[885] 26886 28124 28125 28126 28127 28128 28129 28130 28131 28132 28133 28134 28135 28136 28137 28138 28139
[902] 28140 28141 28142 28143 28144 28145 28146 28147 28148 28149 28150 28151 28152 28153 28154 28155 28156
[919] 28157 28158 28159 28160 28161 28162 28163 28164 28165 28166 28167 28168 28169 28170 28171 28172 28173
[936] 28174 28175 28176 28177 28178 28179 28180 28181 28182 28183 28184 28185 28186 28187 28188 28189 28190
[953] 28191 28192 28193 28194 28195 28196 28197 28198 28199 28200 28201 28202 28203 28204 28205 28206 28207
[970] 28208 28209 28210 28211 28212 28213 28214 28215 28216 28217 28218 28219 28220 28221 28222 28223 28224
[987] 28225 28226 28227 28228 28229 28230 28231 28232 28233 28234 28235 28236 28237 28238
[ reached getOption("max.print") -- omitted 3612 entries ]
#length of outliers
length(which(abs(zscores_victory_by_runs)>3))
[1] 4612
#total runs
zscores_total_runs <- Ipl_summary$total_runs %>% scores(type ="z")
zscores_total_runs %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-5.14494 -0.62091 0.03953 0.00000 0.69997 3.47383
#where
which(abs(zscores_total_runs)>3)
[1] 6376 6377 6378 6379 6380 6381 6382 6383 6384 6385 6386 6387 6388 6389 6390
[16] 6391 6392 6393 6394 6395 6396 6397 6398 6399 6400 6401 6402 6403 6404 6405
[31] 6406 6407 6408 6409 6410 6411 6412 6413 6414 6415 6416 6417 6418 6419 6420
[46] 6421 6422 6423 6424 6425 6426 6427 6428 6429 6430 6431 6432 6433 6434 6435
[61] 6436 6437 10325 10326 10327 10328 10329 10330 10331 10332 10333 10334 10335 10336 10337
[76] 10338 10339 10340 10341 10342 10343 10344 10345 10346 10347 10348 10349 10350 10351 10352
[91] 10353 10354 10355 10356 10357 10358 10359 10360 10361 10362 10363 10364 10365 10366 10367
[106] 10368 10369 10370 10371 10372 10373 10374 10375 10376 10377 10378 10379 10380 10381 10382
[121] 10383 10384 10385 10386 10387 10388 10389 10390 10391 10392 10393 10394 10395 10396 10397
[136] 10398 10399 10400 10401 10402 10403 10404 10405 10406 10407 13373 13374 13375 13376 13377
[151] 13378 13379 13380 13381 13382 13383 13384 13385 13386 13387 13388 13389 13390 13391 13392
[166] 13393 13394 13395 13396 13397 13398 13399 13400 13401 13402 13403 13404 13405 13406 13407
[181] 23924 23925 23926 23927 23928 23929 23930 23931 23932 23933 23934 23935 23936 23937 23938
[196] 23939 23940 23941 23942 23943 23944 23945 23946 23947 23948 23949 23950 23951 23952 23953
[211] 23954 23955 23956 23957 23958 23959 23960 23961 23962 23963 23964 23965 23966 23967 23968
[226] 23969 23970 23971 23972 23973 23974 27718 27719 27720 27721 27722 27723 27724 27725 27726
[241] 27727 27728 27729 27730 27731 27732 27733 27734 27735 27736 27737 27738 27739 27740 27741
[256] 27742 27743 27744 27745 27746 27747 27748 27749 27750 27751 27752 27753 27754 27755 27756
[271] 27757 27758 27759 27760 27761 27762 27763 27764 27765 27766 27767 27768 27769 27770 27771
[286] 27772 27773 27774 27775 27776 27777 27778 27779 27780 27781 27782 27783 27784 27785 27786
[301] 27787 27788 27789 27790 27791 27792 27793 27794 27795 27796 27797 27798 27799 27800 27801
[316] 27802 27803 27804 27805 27806 27807 27808 27809 27810 27811 27812 27813 27814 27815 27816
[331] 27893 27894 27895 27896 27897 27898 27899 27900 27901 27902 27903 27904 27905 27906 27907
[346] 27908 27909 27910 27911 27912 27913 27914 27915 27916 27917 27918 27919 27920 27921 27922
[361] 27923 27924 66559 66560 66561 66562 66563 66564 66565 66566 66567 66568 66569 66570 66571
[376] 66572 66573 66574 66575 66576 66577 66578 66579 66580 66581 66582 66583 66584 66585 66586
[391] 66587 66588 66589 66590 66591 66592 66593 66594 66595 66596 66597 66598 66599 66600 66601
[406] 66602 66603 66604 66605 66606 66607 66608 66609 66610 66611 66612 66613 66614 66615 66616
[421] 66617 66618 66619 66620 66621 70931 70932 70933 70934 70935 70936 70937 70938 70939 70940
[436] 70941 70942 70943 70944 70945 70946 70947 70948 70949 70950 70951 70952 70953 70954 70955
[451] 70956 70957 70958 70959 70960 70961 70962 70963 70964 70965 70966 70967 70968 70969 70970
[466] 70971 70972 70973 70974 70975 70976 70977 70978 70979 70980 70981 70982 70983 70984 70985
[481] 70986 70987 70988 70989 70990 70991 70992 70993 97225 97226 97227 97228 97229 97230 97231
[496] 97232 97233 97234 97235 97236 97237 97238 97239 97240 97241 97242 97243 97244 97245 97246
[511] 97247 97248 97249 97250 97251 97252 97253 97254 97255 97256 97257 97258 97259 97260 97261
[526] 97262 97263 97264 97265 97266 97267 97268 97269 97270 97271 97272 97273 97274 97275 97276
[541] 97277 97278 97279 97280 97281 97282 97283 97284 97285 97286 97287 97288 97289 97290 97291
[556] 97292 97293 97294 97295 97296 97297 97298 97299 97300 97301 97302 97303 97304 97305 97306
[571] 97307 97308 97309 97310 97311 97312 97313 97314 97315 97316 97317 97318 97319 97320 97321
[586] 97322 97323 97324 97325 97326 97327 97328 97329 97330 97331 97332 97333 97334 97335 97336
[601] 97337 97338 97339 97340 97341 97342 97343 97344 97345 97346 97347 97348 97349 97350 97351
[616] 97352 115958 115959 115960 115961 115962 115963 115964 115965 115966 115967 115968 115969 115970 115971
[631] 115972 115973 115974 115975 115976 115977 115978 115979 115980 115981 115982 115983 115984 115985 115986
[646] 115987 135163 135164 135165 135166 135167 135168 135169 146923 146924 146925 146926 146927 146928 146929
[661] 146930 146931 146932 146933 146934 146935 146936 146937 146938 146939 146940 146941 146942 146943 146944
[676] 146945 146946 146947 146948 146949 146950 146951 146952 146953 146954 146955 146956 151769 151770 151771
[691] 151772 151773 151774 151775 151776 151777 151778 151779 151780 151781 151782 151783 151784 151785 151786
[706] 151787 151788 151789 151790 151791 151792 151793 151794 151795 151796 151797 151798 151799 151800 151801
[721] 151802 151803 151804 151805 160692 160693 160694 160695 160696 160697 160698 160699 160700 160701 160702
[736] 160703 160704 160705 160706 160707 160708 160709 160710 160711 160712 160713 160714 160715 160716 160717
[751] 160718 160719 160720 160721 160722 160723 160724 160725 160726 160727 160728 160729 160730 160731 160732
[766] 160733 160734 160735 160736 160737 160738 160739 160740 160741 160742 160743 160744 160745 160746 160747
[781] 160748 160749 160750 160751 160752 160753 160754 160755 160756 160757 160758 160759 160760 160761 160762
[796] 160763 160764 160765 160766 160767 160768 160769 160770 160771 160772 160773 160774 160775 160776 160777
[811] 160778 160779 160780 160781 160782 160783 160784 160785 160786 160787 160788 160789 160790 160791 160792
[826] 160793 160794 160795 160796 160797 160798 160799 160800 160801 160802 160803 160804 160805 160806 160807
[841] 160808 160809 160810 160811 160812 160813 160814 160815 160816 176413 176414 176415 176416 176417 176418
[856] 176419 176420 176421 176422 176423 176424 176425 176426 176427 176428 176429 176430 176431 176432 176433
[871] 176434 176435 176436 176437 176438 176439 176440 176441 176442 176443 176444 176445 176446 176447 176448
[886] 176449 176450 176451 176452 176453 176454 176455 176456 176457 176458 176459 176460 176461 176462 176463
#number
length(which(abs(zscores_total_runs) >3))
[1] 900
In this step, I have to apply different types of transformation and Box Cox and try to reduce the right skewness and also try to the transform the distribution to normal distribution.
Following Transformation were applied:
Log Transformation.
Square root transformation.
Square Transformation.
Reciprocal Transformation.
Box Cox.
After applying the transformation, it is clearly evident that log() function is the best to produce a normal curve among all the above transformation.
#Transform
hist(Ipl_summary$victory_by_runs, main="Histogram of Victory runs")
log_vic_runs <- log10(Ipl_summary$victory_by_runs)
hist(log_vic_runs)
ln_vic_runs <- log(Ipl_summary$victory_by_runs)
hist(ln_vic_runs)
BoxCox_vic_runs <- BoxCox(Ipl_summary$victory_by_runs, lambda = "auto")
hist(BoxCox_vic_runs)
sqrt_vic_runs <- sqrt(Ipl_summary$victory_by_runs)
hist(sqrt_vic_runs)
square_vic_runs <- Ipl_summary$victory_by_runs^2
hist(square_vic_runs)
reciprocal_vic_runs <- Ipl_summary$victory_by_runs^(-1)
hist(reciprocal_vic_runs)
NA
NA