library(readr)
library(dplyr)
library(Hmisc)
library(lubridate)
library(outliers)
library(forecast)
library(knitr)
For assignment 3, we are using the data of All Indian Premier League Cricket matches held from 2008 and 2016.This is a ball by ball data of all the IPL cricket matches till season 9.
The two data sets were merged to join the ipl matches and with the corresponding deliveries with a unique identifier “ID” (Identified as match_id in matches.csv). Then a complete inspection of variables was done to understand each variables specification and related data. After this, appropriate data type conversions were applied.The ‘Date’ variable while importing was of character data type.It was converted to Date by as.Date().
As the dataset followed Hadley Wikham’s tidy principles, so no tidy function on the dataset was performed.A new variable ‘run_rate’ was calculated using mutate function.
Missing values in factor variables were imputed with mode and missing values in categoical variables were recoded appropriately.The missing values in numeric data type can be imputed with mean or median.For this dataset there were no missing values found for numeric variables.Dataset was checked for variables having any special values.
Outliers of total_runs, win_by_runs, win_by_wickets was observed by plotting individual box plot and then outliers were scanned using z-scores.Since the outliers were less than 5% in the dataset, it was concluded that the outliers are not due to data entry errors.So,the scanned value were left as it is in dataset as it can prove to be of significant in our dataset.
Numerical variable ‘win_by_runs’ was transformed to decrease the skewness and convert the distribution to a more normal distribution.
Two data sets were obtained from the following link (https://www.kaggle.com/nowke9/ipldata/download).Indian Premier League (IPL) is a Twenty20 cricket format league in India.The Data gives information till Season 11 (2008 - 2019).
matches.csv -It is Match by match data.Matches has 18 variables.Below is the specification of ever variables: -id: Unique Id for a match -season: Year of the match -city: City where the match took place -date: Date of match (DD/MM/YY) -team1: Team batting first -team2: Team batting second -toss_winner: Toss winner -toss_decision: Toss decision - bat/field -result: Match result -dl_applied: Is Duckworth Lewis (DL) rule applied -winner: Winner of the match -win_by_runs: Win by runs -win_by_wickets: Win by wickets -player_of_match: Player of the match award (Man of the match) -venue:Match venue -umpire1:Name of Umpire 1 -umpire2:Name of Umpire 2 -umpire3:Name ofUmpire 3
deliveries.csv - Ball by ball data.Deliveries has total 21 variablesBelow is the specification of variables: -match_id :Unique Identifier for a match -innings - current inings of the match -batting_team :Name of the batting team -bowling_team: Name of the bowling team -over:Current over -ball:Current ball of the over -batsman: Name of the batsman on strike -non_striker: Name of the batsman on non-striker’s end -bowler: Name of the bowler -is_super_over: Is this a super-over (0 or 1) -wide_runs :Runs given as wide -bye_runs: Runs given as bye -legbye_runs: Runs given as leg-bye -noball_runs: Runs given as no-ball -penalty_runs:Runs given as penalty -batsman_runs:Runs scored by the batsman -extra_runs:Total extra runs (Wide, Bye, No-ball, Penalty) -total_runs:Total runs from the ball (extra_runs, batsman_runs) -player_dismissed:Name of the player dismissed (If out) -dismissal_kind: How the player was dismissed (If out) -fielder: Fielder involved in the dismissal (If any)
It was checked if the datasets were impored as data frame or not using class() function. Full join is used to merge both the datasets by the key variable id.
deliveries <- read_csv("ipldata/deliveries.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## batting_team = col_character(),
## bowling_team = col_character(),
## batsman = col_character(),
## non_striker = col_character(),
## bowler = col_character(),
## player_dismissed = col_character(),
## dismissal_kind = col_character(),
## fielder = col_character()
## )
## See spec(...) for full column specifications.
matches <- read_csv("ipldata/matches.csv")
## Parsed with column specification:
## cols(
## id = col_double(),
## season = col_double(),
## city = col_character(),
## date = col_character(),
## team1 = col_character(),
## team2 = col_character(),
## toss_winner = col_character(),
## toss_decision = col_character(),
## result = col_character(),
## dl_applied = col_double(),
## winner = col_character(),
## win_by_runs = col_double(),
## win_by_wickets = col_double(),
## player_of_match = col_character(),
## venue = col_character(),
## umpire1 = col_character(),
## umpire2 = col_character(),
## umpire3 = col_character()
## )
class(deliveries)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
class(matches)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
ipl <- full_join(matches,deliveries, c("id" = "match_id"))
head(ipl)
dim() is used to see the dimensions of the dataset.
Some of the variables were imported as incorrect data types. Therefore, they were converted using: as.factor(), for example,toss_decision ,result ,dismissal_kind factor() for ordered factor variables, for example season. as.date for date variables , for example date.
str() function is used to check the converted data type of the variables.
dim(ipl)
## [1] 179078 38
ipl$season<- factor(ipl$season, levels = c("2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019"), labels = c("2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019"), ordered = TRUE)
ipl$date <- as.Date(ipl$date)
ipl$toss_decision<- as.factor(ipl$toss_decision)
ipl$result <- as.factor(ipl$result)
ipl$dl_applied <- as.factor(ipl$dl_applied)
ipl$over <- as.factor(ipl$over)
ipl$is_super_over <- factor(ipl$is_super_over,levels = c(0,1), labels = c("No","Yes"))
ipl$dismissal_kind <- as.factor(ipl$dismissal_kind)
str(ipl)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 179078 obs. of 38 variables:
## $ id : num 1 1 1 1 1 1 1 1 1 1 ...
## $ season : Ord.factor w/ 12 levels "2008"<"2009"<..: 10 10 10 10 10 10 10 10 10 10 ...
## $ city : chr "Hyderabad" "Hyderabad" "Hyderabad" "Hyderabad" ...
## $ date : Date, format: "2017-04-05" "2017-04-05" ...
## $ team1 : chr "Sunrisers Hyderabad" "Sunrisers Hyderabad" "Sunrisers Hyderabad" "Sunrisers Hyderabad" ...
## $ team2 : chr "Royal Challengers Bangalore" "Royal Challengers Bangalore" "Royal Challengers Bangalore" "Royal Challengers Bangalore" ...
## $ toss_winner : chr "Royal Challengers Bangalore" "Royal Challengers Bangalore" "Royal Challengers Bangalore" "Royal Challengers Bangalore" ...
## $ toss_decision : Factor w/ 2 levels "bat","field": 2 2 2 2 2 2 2 2 2 2 ...
## $ result : Factor w/ 3 levels "no result","normal",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ dl_applied : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ winner : chr "Sunrisers Hyderabad" "Sunrisers Hyderabad" "Sunrisers Hyderabad" "Sunrisers Hyderabad" ...
## $ win_by_runs : num 35 35 35 35 35 35 35 35 35 35 ...
## $ win_by_wickets : num 0 0 0 0 0 0 0 0 0 0 ...
## $ player_of_match : chr "Yuvraj Singh" "Yuvraj Singh" "Yuvraj Singh" "Yuvraj Singh" ...
## $ venue : chr "Rajiv Gandhi International Stadium, Uppal" "Rajiv Gandhi International Stadium, Uppal" "Rajiv Gandhi International Stadium, Uppal" "Rajiv Gandhi International Stadium, Uppal" ...
## $ umpire1 : chr "AY Dandekar" "AY Dandekar" "AY Dandekar" "AY Dandekar" ...
## $ umpire2 : chr "NJ Llong" "NJ Llong" "NJ Llong" "NJ Llong" ...
## $ umpire3 : chr NA NA NA NA ...
## $ inning : num 1 1 1 1 1 1 1 1 1 1 ...
## $ batting_team : chr "Sunrisers Hyderabad" "Sunrisers Hyderabad" "Sunrisers Hyderabad" "Sunrisers Hyderabad" ...
## $ bowling_team : chr "Royal Challengers Bangalore" "Royal Challengers Bangalore" "Royal Challengers Bangalore" "Royal Challengers Bangalore" ...
## $ over : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ ball : num 1 2 3 4 5 6 7 1 2 3 ...
## $ batsman : chr "DA Warner" "DA Warner" "DA Warner" "DA Warner" ...
## $ non_striker : chr "S Dhawan" "S Dhawan" "S Dhawan" "S Dhawan" ...
## $ bowler : chr "TS Mills" "TS Mills" "TS Mills" "TS Mills" ...
## $ is_super_over : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ wide_runs : num 0 0 0 0 2 0 0 0 0 0 ...
## $ bye_runs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ legbye_runs : num 0 0 0 0 0 0 1 0 0 0 ...
## $ noball_runs : num 0 0 0 0 0 0 0 0 0 1 ...
## $ penalty_runs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ batsman_runs : num 0 0 4 0 0 0 0 1 4 0 ...
## $ extra_runs : num 0 0 0 0 2 0 1 0 0 1 ...
## $ total_runs : num 0 0 4 0 2 0 1 1 4 1 ...
## $ player_dismissed: chr NA NA NA NA ...
## $ dismissal_kind : Factor w/ 9 levels "bowled","caught",..: NA NA NA NA NA NA NA NA NA NA ...
## $ fielder : chr NA NA NA NA ...
Upon inspection using glimpse(), head() and tail(). This data is tidy since it follows Hadley Wickham’s tidy data principles where:
Each variable have its own column. Each observation have its own row. Each value have its own cell.
glimpse(ipl)
## Observations: 179,078
## Variables: 38
## $ id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ season <ord> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017,…
## $ city <chr> "Hyderabad", "Hyderabad", "Hyderabad", "Hyderab…
## $ date <date> 2017-04-05, 2017-04-05, 2017-04-05, 2017-04-05…
## $ team1 <chr> "Sunrisers Hyderabad", "Sunrisers Hyderabad", "…
## $ team2 <chr> "Royal Challengers Bangalore", "Royal Challenge…
## $ toss_winner <chr> "Royal Challengers Bangalore", "Royal Challenge…
## $ toss_decision <fct> field, field, field, field, field, field, field…
## $ result <fct> normal, normal, normal, normal, normal, normal,…
## $ dl_applied <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ winner <chr> "Sunrisers Hyderabad", "Sunrisers Hyderabad", "…
## $ win_by_runs <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35,…
## $ win_by_wickets <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ player_of_match <chr> "Yuvraj Singh", "Yuvraj Singh", "Yuvraj Singh",…
## $ venue <chr> "Rajiv Gandhi International Stadium, Uppal", "R…
## $ umpire1 <chr> "AY Dandekar", "AY Dandekar", "AY Dandekar", "A…
## $ umpire2 <chr> "NJ Llong", "NJ Llong", "NJ Llong", "NJ Llong",…
## $ umpire3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ inning <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ batting_team <chr> "Sunrisers Hyderabad", "Sunrisers Hyderabad", "…
## $ bowling_team <chr> "Royal Challengers Bangalore", "Royal Challenge…
## $ over <fct> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3,…
## $ ball <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2,…
## $ batsman <chr> "DA Warner", "DA Warner", "DA Warner", "DA Warn…
## $ non_striker <chr> "S Dhawan", "S Dhawan", "S Dhawan", "S Dhawan",…
## $ bowler <chr> "TS Mills", "TS Mills", "TS Mills", "TS Mills",…
## $ is_super_over <fct> No, No, No, No, No, No, No, No, No, No, No, No,…
## $ wide_runs <dbl> 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bye_runs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ legbye_runs <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ noball_runs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ penalty_runs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ batsman_runs <dbl> 0, 0, 4, 0, 0, 0, 0, 1, 4, 0, 6, 0, 0, 4, 1, 0,…
## $ extra_runs <dbl> 0, 0, 0, 0, 2, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ total_runs <dbl> 0, 0, 4, 0, 2, 0, 1, 1, 4, 1, 6, 0, 0, 4, 1, 0,…
## $ player_dismissed <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "DA…
## $ dismissal_kind <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, cau…
## $ fielder <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Ma…
head(ipl)
tail(ipl)
Run rate of each match of every team is calculated.Here,I have grouped the dataset by team , year and date to calculate the total runs made by a team in every match.Then, divided the runs/total over (total over =20) to calculate the run rate of every match.We have filtered the data where the is no missing values in date variable, so the grouping is performed appropriately. Mutate() function is used to create a new variable ‘Run Rate’.
ipl <- ipl %>% group_by(batting_team,season,date) %>% mutate(sum_run = sum(total_runs), run_rate = sum_run/20) %>% filter(!is.na(date))
head(ipl)
colSums() was used to get total number of missing values in each variables. For variable types that are factor, mode was used to replace the missing values.The charater type variables were recoded as “Not recorded” assuming that there would be some data entry error. The missing value in Umpire3 variable is recoded as “NO CALL FOR UMPIRE 3” as the deciscion for the respective balls were made by umpire1 & umpire2 and Umpire 3 wasn’t required The missing value of variables player_dismissed and fielder is recoded as “NOT OUT” as the player is still batting and is not out yet on respective balls.
After all the imputations and recoding for missing values, colSums(is.na(ipl)) was checked again and as expected, now there are no missing values.
Inspection for special values was done but No special type (Inf, -Inf, Nan) values was found in the dataset.
colSums(is.na(ipl))
## id season city date
## 0 0 1700 0
## team1 team2 toss_winner toss_decision
## 0 0 0 0
## result dl_applied winner win_by_runs
## 0 0 321 0
## win_by_wickets player_of_match venue umpire1
## 0 321 0 248
## umpire2 umpire3 inning batting_team
## 248 150460 0 0
## bowling_team over ball batsman
## 0 0 0 0
## non_striker bowler is_super_over wide_runs
## 0 0 0 0
## bye_runs legbye_runs noball_runs penalty_runs
## 0 0 0 0
## batsman_runs extra_runs total_runs player_dismissed
## 0 0 0 143022
## dismissal_kind fielder sum_run run_rate
## 143022 145091 0 0
ipl$city <-impute(ipl$city, fun = mode)
ipl$toss_decision <-impute(ipl$toss_decision, fun = mode)
ipl$dismissal_kind <-impute(ipl$dismissal_kind, fun = mode)
ipl$dismissal_kind <- impute(ipl$dismissal_kind , fun = mode )
ipl$winner [is.na(ipl$winner)] <- "TIE"
ipl$umpire1 [is.na(ipl$umpire1)] <- "NOT RECORDED"
ipl$umpire2 [is.na(ipl$umpire2)] <- "NOT RECORDED"
ipl$umpire3 [is.na(ipl$umpire3)] <- "NO CALLL FOR UMPIRE 3"
ipl$player_dismissed [is.na(ipl$player_dismissed)] <- "NOT OUT"
ipl$fielder [is.na(ipl$fielder)] <- "NOT OUT"
colSums(is.na(ipl))
## id season city date
## 0 0 0 0
## team1 team2 toss_winner toss_decision
## 0 0 0 0
## result dl_applied winner win_by_runs
## 0 0 0 0
## win_by_wickets player_of_match venue umpire1
## 0 321 0 0
## umpire2 umpire3 inning batting_team
## 0 0 0 0
## bowling_team over ball batsman
## 0 0 0 0
## non_striker bowler is_super_over wide_runs
## 0 0 0 0
## bye_runs legbye_runs noball_runs penalty_runs
## 0 0 0 0
## batsman_runs extra_runs total_runs player_dismissed
## 0 0 0 0
## dismissal_kind fielder sum_run run_rate
## 0 0 0 0
#Inspection for special values in data frame
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
# apply this function to the data frame.
sapply(ipl, function(x) sum( is.special(x)))
## id season city date
## 0 0 0 0
## team1 team2 toss_winner toss_decision
## 0 0 0 0
## result dl_applied winner win_by_runs
## 0 0 0 0
## win_by_wickets player_of_match venue umpire1
## 0 0 0 0
## umpire2 umpire3 inning batting_team
## 0 0 0 0
## bowling_team over ball batsman
## 0 0 0 0
## non_striker bowler is_super_over wide_runs
## 0 0 0 0
## bye_runs legbye_runs noball_runs penalty_runs
## 0 0 0 0
## batsman_runs extra_runs total_runs player_dismissed
## 0 0 0 0
## dismissal_kind fielder sum_run run_rate
## 0 0 0 0
We have plotted box plot for the main numeric variables of our dataset which are the deciding factors of cricket match.Box plot is one of the best methods to detect univariate outliers.If outliers are present in these variable ,the statistical study and results will largerly deviate from the expected conclusions. When we inspect win_by_wickets, no outliers are observed. When the box plot of win_by_runs and total_runs is plotted outliers are observed and further we detect outliers with the z-score.
Using which() we find the locations of z-score whose absolute value is greater than 3.
As the percentage of the outliers in both the variables are less than 5% (win_by_runs= 2.575% and total_runs=0.06%) , It can be concluded that the outliers are not observed in the box plot is not to due to data entry errors.It might be case of exceptional performance by the winning team.So, No handling of outliers will be performed and the values will remain as it is in the data frame.
ipl$win_by_wickets %>% boxplot(main ="Box plot of Winning Wickets in match", ylab= "No. of wickets" ,col = "grey")
ipl$total_runs %>% boxplot(main ="Box plot of total runs", ylab= "Total runs" ,col = "grey")
ipl$win_by_runs %>% boxplot(main ="Box plot of Run by which the Team won", ylab= "RUNS" ,col = "grey")
z.scores <- ipl$win_by_runs %>% scores(type ="z")
z.scores %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.5833 -0.5833 -0.5833 0.0000 0.2615 5.5839
length(which(abs(z.scores)>3))
## [1] 4134
z.scores_total_runs <- ipl$total_runs %>% scores(type ="z")
z.scores_total_runs %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.8157 -0.8157 -0.1841 0.0000 -0.1841 3.6056
length(which(abs(z.scores_total_runs) >3))
## [1] 38
ForTransformation, several types of transformation and Box Cox was to applied to reduce right skewness and transform the distribution to normal distribution. log() proved to produce the best normal curve among all other transformations.
hist(ipl$win_by_runs, main="Histogram of Winning runs")
log_win_by_runs <- log10(ipl$win_by_runs)
hist(log_win_by_runs)
ln_win_by_runs <- log(ipl$win_by_runs)
hist(ln_win_by_runs)
BoxCox_win_by_runs <- BoxCox(ipl$win_by_runs, lambda = "auto")
hist(BoxCox_win_by_runs)
sqrt_win_by_runs <- sqrt(ipl$win_by_runs)
hist(sqrt_win_by_runs)
square_win_by_runs <- ipl$win_by_runs^2
hist(square_win_by_runs)
reciprocal_win_by_runs <- ipl$win_by_runs^(-1)
hist(reciprocal_win_by_runs)