Required packages

The Following packages were required to preprocess the dataset.

library(readr)
library(dplyr)
library(Hmisc)
library(lubridate)
library(outliers)
library(forecast)
library(knitr)

Executive Summary

In this Assignment we are going to preprocess the data of every single IPL cricket match that is held between 2008-2019. This dataset contains the statistics of every single ball that is bowled in IPL.

To preprocess, two datasets i.e.: IPL matches and Ball summary were merged with a unique identifier “ID”. After that a deep inspection of the variables was performed to get the knowledge of every variables specification and data. In the next step suitable data type conversions were applied.

As this dataset fulfills all the tidy principles of Hadley Wickham’s, we need not to apply any tidy function. A new Variable Run_rate was created using mutate Function.

While scanning the data we found some missing values present in the factor variables that were imputed using the mode function and missing values of the categorical variables was treated suitably. Mean and Median imputation method can have used to treat the missing values that were present in the numeric data type. As per this dataset is concern, we don’t found any numeric variables with missing values. There were no special values present in the dataset as well. While plotting the Boxplot, we observed some outliers in Victory_by_wickets, Total_runs, Victory_by_runs by using Z-score. Since the outliers present in the dataset were lower than 5% of the total dataset, it was assumed that the they were not due to the data entry errors. Therefore, the scanned value were left untreated as it can be a significant part of our dataset.

Numerical variable Victory_by_runs was transformed to decrease the skewness present in the graph and convert the distribution into a more normal one.

Data

It was checked if the datasets were impored as data frame or not using class() function. Full join is used to merge both the datasets by the key variable id.

The following dataset was collected from the from an open Source(https://www.kaggle.com/nowe9/ipldata) . IPL is T20 cricket format league conducted in India every year begins in 2008. The data provides the information of 11 seasons of IPL(2008-2019).

Ball summary is a ball by ball dataset which contains 179078 observations and 20 Variables. IPL Dataset is a match by match dataset which contains 756 observations and 18 variables.

In this step we perform various steps such as:

  1. Importing both the dataset and name it as Ball_summary and IPL_dataset

  2. Checking the class of both the datasets.

  3. Perform the full join function by using common variable (“id”=”match_id”) and name it as IPL_summary.

#import the dataset
Ball_summary <- read_csv("Ball_Summary.csv")
Parsed with column specification:
cols(
  .default = col_double(),
  team_batting = col_character(),
  team_bowling = col_character(),
  batsman = col_character(),
  non_striker = col_character(),
  bowler = col_character(),
  Batsman_dismissed = col_character(),
  way_of_dismissal = col_character(),
  fielder = col_character()
)
See spec(...) for full column specifications.
View(Ball_summary)


#checking the class of Ball
class(Ball_summary)
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
#import the other dataset
Ipl_dataset <- read_csv("Ipl dataset.csv")
Parsed with column specification:
cols(
  id = col_double(),
  `Year(season)` = col_double(),
  city = col_character(),
  Match_date = col_character(),
  teamA = col_character(),
  teamB = col_character(),
  Winner_toss = col_character(),
  toss_decision = col_character(),
  result = col_character(),
  `D/L_applied` = col_double(),
  Match_winner = col_character(),
  victory_by_runs = col_double(),
  victory_by_wickets = col_double(),
  man_of_match = col_character(),
  venue = col_character(),
  `1st_umpire` = col_character(),
  `2nd_umpire` = col_character(),
  `3rd_umpire` = col_character()
)
View(Ipl_dataset)

#checking the class of Ipl_dataset
class(Ipl_dataset)
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
#full join the dataset
Ipl_summary <- full_join(Ipl_dataset,Ball_summary,c("id"="match_id"))
View(Ipl_summary)
head(Ipl_summary)
NA

Understand

In this step we will use the Dim() function to check the dimensions of the dataset.

While going through the dataset we found that some of the variables were imported as incorrect data types. So, they were converted using as.factor() function. Two variables year(season) and Super_over was lablled and ordered correctly using factor Function.

#understand
dim(Ipl_summary)
[1] 179078     37
Ipl_summary$`Year(season)`<- factor(Ipl_summary$`Year(season)`, levels = c("2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019"), labels = c("2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019"), ordered = TRUE)
Ipl_summary$toss_decision<- as.factor(Ipl_summary$toss_decision)
Ipl_summary$result <- as.factor(Ipl_summary$result)
Ipl_summary$`D/L_applied` <- as.factor(Ipl_summary$`D/L_applied`)
Ipl_summary$over <- as.factor(Ipl_summary$over)
Ipl_summary$super_over <- factor(Ipl_summary$super_over,levels = c(0,1), labels = c("No","Yes"))
Ipl_summary$way_of_dismissal <- as.factor(Ipl_summary$way_of_dismissal)
str(Ipl_summary)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    179078 obs. of  37 variables:
 $ id                : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Year(season)      : Ord.factor w/ 12 levels "2008"<"2009"<..: 10 10 10 10 10 10 10 10 10 10 ...
 $ city              : chr  "Hyderabad" "Hyderabad" "Hyderabad" "Hyderabad" ...
 $ Match_date        : chr  "05-04-17" "05-04-17" "05-04-17" "05-04-17" ...
 $ teamA             : chr  "SRH" "SRH" "SRH" "SRH" ...
 $ teamB             : chr  "RCB" "RCB" "RCB" "RCB" ...
 $ Winner_toss       : chr  "RCB" "RCB" "RCB" "RCB" ...
 $ toss_decision     : Factor w/ 2 levels "bat","field": 2 2 2 2 2 2 2 2 2 2 ...
 $ result            : Factor w/ 3 levels "no result","normal",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ D/L_applied       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ Match_winner      : chr  "SRH" "SRH" "SRH" "SRH" ...
 $ victory_by_runs   : num  35 35 35 35 35 35 35 35 35 35 ...
 $ victory_by_wickets: num  0 0 0 0 0 0 0 0 0 0 ...
 $ man_of_match      : chr  "Yuvraj Singh" "Yuvraj Singh" "Yuvraj Singh" "Yuvraj Singh" ...
 $ venue             : chr  "Rajiv Gandhi International Stadium, Uppal" "Rajiv Gandhi International Stadium, Uppal" "Rajiv Gandhi International Stadium, Uppal" "Rajiv Gandhi International Stadium, Uppal" ...
 $ 1st_umpire        : chr  "AY Dandekar" "AY Dandekar" "AY Dandekar" "AY Dandekar" ...
 $ 2nd_umpire        : chr  "NJ Llong" "NJ Llong" "NJ Llong" "NJ Llong" ...
 $ 3rd_umpire        : chr  NA NA NA NA ...
 $ innings           : num  1 1 1 1 1 1 1 1 1 1 ...
 $ team_batting      : chr  "SRH" "SRH" "SRH" "SRH" ...
 $ team_bowling      : chr  "RCB" "RCB" "RCB" "RCB" ...
 $ over              : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 2 2 2 ...
 $ ball              : num  1 2 3 4 5 6 7 1 2 3 ...
 $ batsman           : chr  "DA Warner" "DA Warner" "DA Warner" "DA Warner" ...
 $ non_striker       : chr  "S Dhawan" "S Dhawan" "S Dhawan" "S Dhawan" ...
 $ bowler            : chr  "TS Mills" "TS Mills" "TS Mills" "TS Mills" ...
 $ super_over        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ wide              : num  0 0 0 0 2 0 0 0 0 0 ...
 $ bye               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ legbye            : num  0 0 0 0 0 0 1 0 0 0 ...
 $ noball            : num  0 0 0 0 0 0 0 0 0 1 ...
 $ batsman_runs      : num  0 0 4 0 0 0 0 1 4 0 ...
 $ extra             : num  0 0 0 0 2 0 1 0 0 1 ...
 $ total_runs        : num  0 0 4 0 2 0 1 1 4 1 ...
 $ Batsman_dismissed : chr  NA NA NA NA ...
 $ way_of_dismissal  : Factor w/ 9 levels "bowled","caught",..: NA NA NA NA NA NA NA NA NA NA ...
 $ fielder           : chr  NA NA NA NA ...

Tidy & Manipulate Data I

To check whether the data follows all the principles of Hadley Wickham’s tidy data principles we have to inspect the data using head(),tail(), and glimpse() function.

After the inspection we found out that the the data is tidy as it follows all the principles i.e:

1.Each variable have its own column.

2.Each observation have its own row.

3.Each value have its own cell.

head(Ipl_summary)
tail(Ipl_summary)
glimpse(Ipl_summary) 
Observations: 179,078
Variables: 37
$ id                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ `Year(season)`     <ord> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 20...
$ city               <chr> "Hyderabad", "Hyderabad", "Hyderabad", "Hyderabad", "Hyderabad", "Hyderabad", "H...
$ Match_date         <chr> "05-04-17", "05-04-17", "05-04-17", "05-04-17", "05-04-17", "05-04-17", "05-04-1...
$ teamA              <chr> "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SR...
$ teamB              <chr> "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RC...
$ Winner_toss        <chr> "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RC...
$ toss_decision      <fct> field, field, field, field, field, field, field, field, field, field, field, fie...
$ result             <fct> normal, normal, normal, normal, normal, normal, normal, normal, normal, normal, ...
$ `D/L_applied`      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ Match_winner       <chr> "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SR...
$ victory_by_runs    <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, ...
$ victory_by_wickets <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ man_of_match       <chr> "Yuvraj Singh", "Yuvraj Singh", "Yuvraj Singh", "Yuvraj Singh", "Yuvraj Singh", ...
$ venue              <chr> "Rajiv Gandhi International Stadium, Uppal", "Rajiv Gandhi International Stadium...
$ `1st_umpire`       <chr> "AY Dandekar", "AY Dandekar", "AY Dandekar", "AY Dandekar", "AY Dandekar", "AY D...
$ `2nd_umpire`       <chr> "NJ Llong", "NJ Llong", "NJ Llong", "NJ Llong", "NJ Llong", "NJ Llong", "NJ Llon...
$ `3rd_umpire`       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ innings            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ team_batting       <chr> "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SRH", "SR...
$ team_bowling       <chr> "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RCB", "RC...
$ over               <fct> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5,...
$ ball               <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1,...
$ batsman            <chr> "DA Warner", "DA Warner", "DA Warner", "DA Warner", "DA Warner", "S Dhawan", "S ...
$ non_striker        <chr> "S Dhawan", "S Dhawan", "S Dhawan", "S Dhawan", "S Dhawan", "DA Warner", "DA War...
$ bowler             <chr> "TS Mills", "TS Mills", "TS Mills", "TS Mills", "TS Mills", "TS Mills", "TS Mill...
$ super_over         <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, ...
$ wide               <dbl> 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ bye                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ legbye             <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ noball             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ batsman_runs       <dbl> 0, 0, 4, 0, 0, 0, 0, 1, 4, 0, 6, 0, 0, 4, 1, 0, 0, 3, 1, 1, 0, 1, 0, 1, 1, 1, 1,...
$ extra              <dbl> 0, 0, 0, 0, 2, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ total_runs         <dbl> 0, 0, 4, 0, 2, 0, 1, 1, 4, 1, 6, 0, 0, 4, 1, 0, 0, 3, 1, 1, 0, 1, 0, 1, 1, 1, 1,...
$ Batsman_dismissed  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "DA Warner", NA, NA, NA, NA, NA, NA,...
$ way_of_dismissal   <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, caught, NA, NA, NA, NA, NA, NA, NA, ...
$ fielder            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Mandeep Singh", NA, NA, NA, NA, NA,...

Tidy & Manipulate Data II

In this step we calculate the Run rate scored by each team in each match. To calculate the total runs scored by a team in every match, I group the dataset by following variables team_batting,Year(season),Match_date. Then to calculate the Run rate, divide the total runs by total overs(20). To perform the grouping properly, I had filtered the dataset where there is no missing values in the Match_date variable. At the last to create the new variable Run_rate mutate() Function is used.

Ipl_summary <- Ipl_summary %>% group_by(team_batting,`Year(season)`,Match_date) %>% mutate(total_runs = sum(total_runs), Run_Rate = total_runs/20)%>% filter(!is.na(Match_date) )
head(Ipl_summary)
NA

Scan I

In this step we used colsums() function to obtain the total number of missing values that are present in particular variables.

After using the colsums() function we found out all the missing values in every variable. Mode() was used to replace all the missing values in Factor variable as in case of character variable all the missing values were recoded as “NO RECORDED” by making an assumption that there must be some kind of data entry error in the dataset.

In 3rd_umpire variable the missing values were recoded as “NOT ASKED FOR 3rd Umpire” as the decision taken on that corresponding deliveries were taken by the 1st or 2nd umpires only.

In Batsman_dismissed and Fielder variable the missing values were recoded as “Still Playing” as the batsman is still batting and is not out on that corresponding delivery.

Colsums() function is used again after all the imputations and recoding and as per the expectations, there were no missing values present in the dataset.

Special values were also inspected but no special type of values was present in the dataset.

colSums(is.na(Ipl_summary))
                id       Year(season)               city         Match_date              teamA 
                 0                  0               1700                  0                  0 
             teamB        Winner_toss      toss_decision             result        D/L_applied 
                 0                  0                  0                  0                  0 
      Match_winner    victory_by_runs victory_by_wickets       man_of_match              venue 
               372                  0                  0                372                  0 
        1st_umpire         2nd_umpire         3rd_umpire            innings       team_batting 
               500                500             150712                  0                  0 
      team_bowling               over               ball            batsman        non_striker 
                 0                  0                  0                  0                  0 
            bowler         super_over               wide                bye             legbye 
                 0                  0                  0                  0                  0 
            noball       batsman_runs              extra         total_runs  Batsman_dismissed 
                 0                  0                  0                  0             170244 
  way_of_dismissal            fielder           Run_Rate 
            170244             172630                  0 
Ipl_summary$city <-impute(Ipl_summary$city, fun = mode)
Ipl_summary$toss_decision  <-impute(Ipl_summary$toss_decision, fun = mode)
Ipl_summary$way_of_dismissal <-impute(Ipl_summary$way_of_dismissal , fun = mode)
Ipl_summary$man_of_match <- impute(Ipl_summary$man_of_match, fun = mode)
Ipl_summary$Match_winner [is.na(Ipl_summary$Match_winner)] <- "MATCH_TIED"
Ipl_summary$`1st_umpire` [is.na(Ipl_summary$`1st_umpire`)] <- "NO RECORDED"
Ipl_summary$`2nd_umpire` [is.na(Ipl_summary$`2nd_umpire`)] <- "NO RECORDED"
Ipl_summary$`3rd_umpire` [is.na(Ipl_summary$`3rd_umpire`)] <- "NOT ASKED FOR 3rd Umpire"
Ipl_summary$Batsman_dismissed [is.na(Ipl_summary$Batsman_dismissed)] <- "Still playing"
Ipl_summary$fielder [is.na(Ipl_summary$fielder)] <- "Still Playing"
colSums(is.na(Ipl_summary))
                id       Year(season)               city         Match_date              teamA 
                 0                  0                  0                  0                  0 
             teamB        Winner_toss      toss_decision             result        D/L_applied 
                 0                  0                  0                  0                  0 
      Match_winner    victory_by_runs victory_by_wickets       man_of_match              venue 
                 0                  0                  0                  0                  0 
        1st_umpire         2nd_umpire         3rd_umpire            innings       team_batting 
                 0                  0                  0                  0                  0 
      team_bowling               over               ball            batsman        non_striker 
                 0                  0                  0                  0                  0 
            bowler         super_over               wide                bye             legbye 
                 0                  0                  0                  0                  0 
            noball       batsman_runs              extra         total_runs  Batsman_dismissed 
                 0                  0                  0                  0                  0 
  way_of_dismissal            fielder           Run_Rate 
                 0                  0                  0 
#Inspection for special values in data frame
is.special <- function(x){ if (is.numeric(x)) (is.infinite(x) | is.nan(x))}

# apply this function to the data frame.
sapply(Ipl_summary, function(x) sum( is.special(x)))
                id       Year(season)               city         Match_date              teamA 
                 0                  0                  0                  0                  0 
             teamB        Winner_toss      toss_decision             result        D/L_applied 
                 0                  0                  0                  0                  0 
      Match_winner    victory_by_runs victory_by_wickets       man_of_match              venue 
                 0                  0                  0                  0                  0 
        1st_umpire         2nd_umpire         3rd_umpire            innings       team_batting 
                 0                  0                  0                  0                  0 
      team_bowling               over               ball            batsman        non_striker 
                 0                  0                  0                  0                  0 
            bowler         super_over               wide                bye             legbye 
                 0                  0                  0                  0                  0 
            noball       batsman_runs              extra         total_runs  Batsman_dismissed 
                 0                  0                  0                  0                  0 
  way_of_dismissal            fielder           Run_Rate 
                 0                  0                  0 

Scan II

In this step, I used the main numeric variables present in our dataset which are the deciding factors in a Match to plot the boxplot. We used boxplot because it is one of the most suitable way of detecting all univariate outliers present in the dataset. While inspecting the variable Victory_by_wickets, no outliers were observed. After that I plot the boxplot of Victory_by_Total_runs and total_runs and outliers were observed on these variables which we will treat with z-scores.

I use which() function to find the locations or outliers present in the dataset whose absolute value is greater than 3.

After inspecting the dataset, the percentage of outliers present in the both thevariables is less than 5%. Hence, it can be determined that the outliers present in the boxplor is due to the extraordinary performance of the winning team not because of some type of data entry error.

Therefore, we will ignore the outliers and values will remain same in the dataset.

boxplot(Ipl_summary$victory_by_wickets, main= "Box plot showing Victory by wickets", ylab= "Number of Wickets", col = "sky blue")


boxplot(Ipl_summary$total_runs, main= "Box Plot showing victory by total runs", ylab="Total runs", col = "Sky blue")


boxplot(Ipl_summary$victory_by_runs, main="Boxplt showing runs by which teams won in IPL", ylab= "Runs", col = "sky blue")



#Z scores by victory runs
zscores_victory_by_runs <- Ipl_summary$victory_by_runs %>% scores(type ="z")
zscores_victory_by_runs %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.5762 -0.5762 -0.5762  0.0000  0.2406  5.7004 
#where are outliers
which( abs(zscores_victory_by_runs) >3 )
   [1]  1891  1892  1893  1894  1895  1896  1897  1898  1899  1900  1901  1902  1903  1904  1905  1906  1907
  [18]  1908  1909  1910  1911  1912  1913  1914  1915  1916  1917  1918  1919  1920  1921  1922  1923  1924
  [35]  1925  1926  1927  1928  1929  1930  1931  1932  1933  1934  1935  1936  1937  1938  1939  1940  1941
  [52]  1942  1943  1944  1945  1946  1947  1948  1949  1950  1951  1952  1953  1954  1955  1956  1957  1958
  [69]  1959  1960  1961  1962  1963  1964  1965  1966  1967  1968  1969  1970  1971  1972  1973  1974  1975
  [86]  1976  1977  1978  1979  1980  1981  1982  1983  1984  1985  1986  1987  1988  1989  1990  1991  1992
 [103]  1993  1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009
 [120]  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  2025  2026
 [137]  2027  2028  2029  2030  2031  2032  2033  2034  2035  2036  2037  2038  2039  2040  2041  2042  2043
 [154]  2044  2045  2046  2047  2048  2049  2050  2051  2052  2053  2054  2055  2056  2057  2058  2059  2060
 [171]  2061  2062  2063  2064  2065  2066  2067  2068  2069  2070  2071  2072  2073  2074  2075  2076  2077
 [188]  2078  2079  2080  2081  2082  2083  2084  2085  2086  2087  2088  2089  2090  2091  2092  2093  2094
 [205]  2095  2096  2097  2098  2099  2100  2101  2102  2103  2104  2105  2106  2107  2108  2109  2110  2111
 [222]  2112  2113  2114  2115  2116 10197 10198 10199 10200 10201 10202 10203 10204 10205 10206 10207 10208
 [239] 10209 10210 10211 10212 10213 10214 10215 10216 10217 10218 10219 10220 10221 10222 10223 10224 10225
 [256] 10226 10227 10228 10229 10230 10231 10232 10233 10234 10235 10236 10237 10238 10239 10240 10241 10242
 [273] 10243 10244 10245 10246 10247 10248 10249 10250 10251 10252 10253 10254 10255 10256 10257 10258 10259
 [290] 10260 10261 10262 10263 10264 10265 10266 10267 10268 10269 10270 10271 10272 10273 10274 10275 10276
 [307] 10277 10278 10279 10280 10281 10282 10283 10284 10285 10286 10287 10288 10289 10290 10291 10292 10293
 [324] 10294 10295 10296 10297 10298 10299 10300 10301 10302 10303 10304 10305 10306 10307 10308 10309 10310
 [341] 10311 10312 10313 10314 10315 10316 10317 10318 10319 10320 10321 10322 10323 10324 10325 10326 10327
 [358] 10328 10329 10330 10331 10332 10333 10334 10335 10336 10337 10338 10339 10340 10341 10342 10343 10344
 [375] 10345 10346 10347 10348 10349 10350 10351 10352 10353 10354 10355 10356 10357 10358 10359 10360 10361
 [392] 10362 10363 10364 10365 10366 10367 10368 10369 10370 10371 10372 10373 10374 10375 10376 10377 10378
 [409] 10379 10380 10381 10382 10383 10384 10385 10386 10387 10388 10389 10390 10391 10392 10393 10394 10395
 [426] 10396 10397 10398 10399 10400 10401 10402 10403 10404 10405 10406 10407 13863 13864 13865 13866 13867
 [443] 13868 13869 13870 13871 13872 13873 13874 13875 13876 13877 13878 13879 13880 13881 13882 13883 13884
 [460] 13885 13886 13887 13888 13889 13890 13891 13892 13893 13894 13895 13896 13897 13898 13899 13900 13901
 [477] 13902 13903 13904 13905 13906 13907 13908 13909 13910 13911 13912 13913 13914 13915 13916 13917 13918
 [494] 13919 13920 13921 13922 13923 13924 13925 13926 13927 13928 13929 13930 13931 13932 13933 13934 13935
 [511] 13936 13937 13938 13939 13940 13941 13942 13943 13944 13945 13946 13947 13948 13949 13950 13951 13952
 [528] 13953 13954 13955 13956 13957 13958 13959 13960 13961 13962 13963 13964 13965 13966 13967 13968 13969
 [545] 13970 13971 13972 13973 13974 13975 13976 13977 13978 13979 13980 13981 13982 13983 13984 13985 13986
 [562] 13987 13988 13989 13990 13991 13992 13993 13994 13995 13996 13997 13998 13999 14000 14001 14002 14003
 [579] 14004 14005 14006 14007 14008 14009 14010 14011 14012 14013 14014 14015 14016 14017 14018 14019 14020
 [596] 14021 14022 14023 14024 14025 14026 14027 14028 14029 14030 14031 14032 14033 14034 14035 14036 14037
 [613] 14038 14039 14040 14041 14042 14043 14044 14045 14046 14047 14048 14049 14050 14051 14052 14053 14054
 [630] 14055 14056 14057 14058 14059 14060 14061 14062 14063 14064 14065 14066 14067 14068 14069 14070 14071
 [647] 14072 14073 14074 14075 14076 14077 14078 14079 14080 14081 14082 14083 14084 14085 14086 14087 26664
 [664] 26665 26666 26667 26668 26669 26670 26671 26672 26673 26674 26675 26676 26677 26678 26679 26680 26681
 [681] 26682 26683 26684 26685 26686 26687 26688 26689 26690 26691 26692 26693 26694 26695 26696 26697 26698
 [698] 26699 26700 26701 26702 26703 26704 26705 26706 26707 26708 26709 26710 26711 26712 26713 26714 26715
 [715] 26716 26717 26718 26719 26720 26721 26722 26723 26724 26725 26726 26727 26728 26729 26730 26731 26732
 [732] 26733 26734 26735 26736 26737 26738 26739 26740 26741 26742 26743 26744 26745 26746 26747 26748 26749
 [749] 26750 26751 26752 26753 26754 26755 26756 26757 26758 26759 26760 26761 26762 26763 26764 26765 26766
 [766] 26767 26768 26769 26770 26771 26772 26773 26774 26775 26776 26777 26778 26779 26780 26781 26782 26783
 [783] 26784 26785 26786 26787 26788 26789 26790 26791 26792 26793 26794 26795 26796 26797 26798 26799 26800
 [800] 26801 26802 26803 26804 26805 26806 26807 26808 26809 26810 26811 26812 26813 26814 26815 26816 26817
 [817] 26818 26819 26820 26821 26822 26823 26824 26825 26826 26827 26828 26829 26830 26831 26832 26833 26834
 [834] 26835 26836 26837 26838 26839 26840 26841 26842 26843 26844 26845 26846 26847 26848 26849 26850 26851
 [851] 26852 26853 26854 26855 26856 26857 26858 26859 26860 26861 26862 26863 26864 26865 26866 26867 26868
 [868] 26869 26870 26871 26872 26873 26874 26875 26876 26877 26878 26879 26880 26881 26882 26883 26884 26885
 [885] 26886 28124 28125 28126 28127 28128 28129 28130 28131 28132 28133 28134 28135 28136 28137 28138 28139
 [902] 28140 28141 28142 28143 28144 28145 28146 28147 28148 28149 28150 28151 28152 28153 28154 28155 28156
 [919] 28157 28158 28159 28160 28161 28162 28163 28164 28165 28166 28167 28168 28169 28170 28171 28172 28173
 [936] 28174 28175 28176 28177 28178 28179 28180 28181 28182 28183 28184 28185 28186 28187 28188 28189 28190
 [953] 28191 28192 28193 28194 28195 28196 28197 28198 28199 28200 28201 28202 28203 28204 28205 28206 28207
 [970] 28208 28209 28210 28211 28212 28213 28214 28215 28216 28217 28218 28219 28220 28221 28222 28223 28224
 [987] 28225 28226 28227 28228 28229 28230 28231 28232 28233 28234 28235 28236 28237 28238
 [ reached getOption("max.print") -- omitted 3612 entries ]
#length of outliers
length(which(abs(zscores_victory_by_runs)>3))
[1] 4612
#total runs
zscores_total_runs <- Ipl_summary$total_runs %>% scores(type ="z")
zscores_total_runs %>% summary()
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-5.14494 -0.62091  0.03953  0.00000  0.69997  3.47383 
#where
which(abs(zscores_total_runs)>3)
  [1]   6376   6377   6378   6379   6380   6381   6382   6383   6384   6385   6386   6387   6388   6389   6390
 [16]   6391   6392   6393   6394   6395   6396   6397   6398   6399   6400   6401   6402   6403   6404   6405
 [31]   6406   6407   6408   6409   6410   6411   6412   6413   6414   6415   6416   6417   6418   6419   6420
 [46]   6421   6422   6423   6424   6425   6426   6427   6428   6429   6430   6431   6432   6433   6434   6435
 [61]   6436   6437  10325  10326  10327  10328  10329  10330  10331  10332  10333  10334  10335  10336  10337
 [76]  10338  10339  10340  10341  10342  10343  10344  10345  10346  10347  10348  10349  10350  10351  10352
 [91]  10353  10354  10355  10356  10357  10358  10359  10360  10361  10362  10363  10364  10365  10366  10367
[106]  10368  10369  10370  10371  10372  10373  10374  10375  10376  10377  10378  10379  10380  10381  10382
[121]  10383  10384  10385  10386  10387  10388  10389  10390  10391  10392  10393  10394  10395  10396  10397
[136]  10398  10399  10400  10401  10402  10403  10404  10405  10406  10407  13373  13374  13375  13376  13377
[151]  13378  13379  13380  13381  13382  13383  13384  13385  13386  13387  13388  13389  13390  13391  13392
[166]  13393  13394  13395  13396  13397  13398  13399  13400  13401  13402  13403  13404  13405  13406  13407
[181]  23924  23925  23926  23927  23928  23929  23930  23931  23932  23933  23934  23935  23936  23937  23938
[196]  23939  23940  23941  23942  23943  23944  23945  23946  23947  23948  23949  23950  23951  23952  23953
[211]  23954  23955  23956  23957  23958  23959  23960  23961  23962  23963  23964  23965  23966  23967  23968
[226]  23969  23970  23971  23972  23973  23974  27718  27719  27720  27721  27722  27723  27724  27725  27726
[241]  27727  27728  27729  27730  27731  27732  27733  27734  27735  27736  27737  27738  27739  27740  27741
[256]  27742  27743  27744  27745  27746  27747  27748  27749  27750  27751  27752  27753  27754  27755  27756
[271]  27757  27758  27759  27760  27761  27762  27763  27764  27765  27766  27767  27768  27769  27770  27771
[286]  27772  27773  27774  27775  27776  27777  27778  27779  27780  27781  27782  27783  27784  27785  27786
[301]  27787  27788  27789  27790  27791  27792  27793  27794  27795  27796  27797  27798  27799  27800  27801
[316]  27802  27803  27804  27805  27806  27807  27808  27809  27810  27811  27812  27813  27814  27815  27816
[331]  27893  27894  27895  27896  27897  27898  27899  27900  27901  27902  27903  27904  27905  27906  27907
[346]  27908  27909  27910  27911  27912  27913  27914  27915  27916  27917  27918  27919  27920  27921  27922
[361]  27923  27924  66559  66560  66561  66562  66563  66564  66565  66566  66567  66568  66569  66570  66571
[376]  66572  66573  66574  66575  66576  66577  66578  66579  66580  66581  66582  66583  66584  66585  66586
[391]  66587  66588  66589  66590  66591  66592  66593  66594  66595  66596  66597  66598  66599  66600  66601
[406]  66602  66603  66604  66605  66606  66607  66608  66609  66610  66611  66612  66613  66614  66615  66616
[421]  66617  66618  66619  66620  66621  70931  70932  70933  70934  70935  70936  70937  70938  70939  70940
[436]  70941  70942  70943  70944  70945  70946  70947  70948  70949  70950  70951  70952  70953  70954  70955
[451]  70956  70957  70958  70959  70960  70961  70962  70963  70964  70965  70966  70967  70968  70969  70970
[466]  70971  70972  70973  70974  70975  70976  70977  70978  70979  70980  70981  70982  70983  70984  70985
[481]  70986  70987  70988  70989  70990  70991  70992  70993  97225  97226  97227  97228  97229  97230  97231
[496]  97232  97233  97234  97235  97236  97237  97238  97239  97240  97241  97242  97243  97244  97245  97246
[511]  97247  97248  97249  97250  97251  97252  97253  97254  97255  97256  97257  97258  97259  97260  97261
[526]  97262  97263  97264  97265  97266  97267  97268  97269  97270  97271  97272  97273  97274  97275  97276
[541]  97277  97278  97279  97280  97281  97282  97283  97284  97285  97286  97287  97288  97289  97290  97291
[556]  97292  97293  97294  97295  97296  97297  97298  97299  97300  97301  97302  97303  97304  97305  97306
[571]  97307  97308  97309  97310  97311  97312  97313  97314  97315  97316  97317  97318  97319  97320  97321
[586]  97322  97323  97324  97325  97326  97327  97328  97329  97330  97331  97332  97333  97334  97335  97336
[601]  97337  97338  97339  97340  97341  97342  97343  97344  97345  97346  97347  97348  97349  97350  97351
[616]  97352 115958 115959 115960 115961 115962 115963 115964 115965 115966 115967 115968 115969 115970 115971
[631] 115972 115973 115974 115975 115976 115977 115978 115979 115980 115981 115982 115983 115984 115985 115986
[646] 115987 135163 135164 135165 135166 135167 135168 135169 146923 146924 146925 146926 146927 146928 146929
[661] 146930 146931 146932 146933 146934 146935 146936 146937 146938 146939 146940 146941 146942 146943 146944
[676] 146945 146946 146947 146948 146949 146950 146951 146952 146953 146954 146955 146956 151769 151770 151771
[691] 151772 151773 151774 151775 151776 151777 151778 151779 151780 151781 151782 151783 151784 151785 151786
[706] 151787 151788 151789 151790 151791 151792 151793 151794 151795 151796 151797 151798 151799 151800 151801
[721] 151802 151803 151804 151805 160692 160693 160694 160695 160696 160697 160698 160699 160700 160701 160702
[736] 160703 160704 160705 160706 160707 160708 160709 160710 160711 160712 160713 160714 160715 160716 160717
[751] 160718 160719 160720 160721 160722 160723 160724 160725 160726 160727 160728 160729 160730 160731 160732
[766] 160733 160734 160735 160736 160737 160738 160739 160740 160741 160742 160743 160744 160745 160746 160747
[781] 160748 160749 160750 160751 160752 160753 160754 160755 160756 160757 160758 160759 160760 160761 160762
[796] 160763 160764 160765 160766 160767 160768 160769 160770 160771 160772 160773 160774 160775 160776 160777
[811] 160778 160779 160780 160781 160782 160783 160784 160785 160786 160787 160788 160789 160790 160791 160792
[826] 160793 160794 160795 160796 160797 160798 160799 160800 160801 160802 160803 160804 160805 160806 160807
[841] 160808 160809 160810 160811 160812 160813 160814 160815 160816 176413 176414 176415 176416 176417 176418
[856] 176419 176420 176421 176422 176423 176424 176425 176426 176427 176428 176429 176430 176431 176432 176433
[871] 176434 176435 176436 176437 176438 176439 176440 176441 176442 176443 176444 176445 176446 176447 176448
[886] 176449 176450 176451 176452 176453 176454 176455 176456 176457 176458 176459 176460 176461 176462 176463
#number
length(which(abs(zscores_total_runs) >3))
[1] 900

Transform

In this step, I have to apply different types of transformation and Box Cox and try to reduce the right skewness and also try to the transform the distribution to normal distribution.

Following Transformation were applied:

  1. Log Transformation.

  2. Square root transformation.

  3. Square Transformation.

  4. Reciprocal Transformation.

  5. Box Cox.

After applying the transformation, it is clearly evident that log() function is the best to produce a normal curve among all the above transformation.


#Transform
hist(Ipl_summary$victory_by_runs, main="Histogram of Victory runs")


log_vic_runs <- log10(Ipl_summary$victory_by_runs)
hist(log_vic_runs)


ln_vic_runs <- log(Ipl_summary$victory_by_runs)
hist(ln_vic_runs)


BoxCox_vic_runs <- BoxCox(Ipl_summary$victory_by_runs, lambda = "auto")
hist(BoxCox_vic_runs)


sqrt_vic_runs <- sqrt(Ipl_summary$victory_by_runs)
hist(sqrt_vic_runs)


square_vic_runs <- Ipl_summary$victory_by_runs^2
hist(square_vic_runs)


reciprocal_vic_runs <- Ipl_summary$victory_by_runs^(-1)
hist(reciprocal_vic_runs)

NA
NA
