The Indian Premier League (IPL) is a men’s T20 franchise cricket league of India. It is annually contested by ten teams based out of seven Indian cities and three Indian states. The league was founded by the Board of Control for Cricket in India (BCCI) in 2007. Fantasy cricket is gaining popularity day by day. Here, a user can choose real life players based on their various performance.
To create the ideal Super-11 team by analysing the IPL’s ball-by-ball and match-by-match datasets. The team should have strong openers, middle-order hitters, all-rounders, a reliable wicketkeeper, and a collection of quick and spin bowlers. Players who have played or bowled 300 balls should be considered.
Given Dataset: IPL Complete Dataset (2008-2020).
This dataset consists of two separate CSV files.
Information of each match summary
No of features- 17 such as inning, over, ball,batsman, etc
Data points- 816
Ball by ball details and deliveries
No of features- 18 such as inning, over, ball,batsman, etc
Data points- 193468
library(tidyverse)
library(forcats)
library(tidyr)
library(lubridate)
library(ggplot2)
library(dplyr)
mm <- read.csv("IPL Matches 2008-2020.csv")
bb <- read.csv("IPL Ball-by-Ball 2008-2020.csv")
str(mm)
## 'data.frame': 816 obs. of 17 variables:
## $ id : int 335982 335983 335984 335985 335986 335987 335988 335989 335990 335991 ...
## $ city : chr "Bangalore" "Chandigarh" "Delhi" "Mumbai" ...
## $ date : chr "2008-04-18" "2008-04-19" "2008-04-19" "2008-04-20" ...
## $ player_of_match: chr "BB McCullum" "MEK Hussey" "MF Maharoof" "MV Boucher" ...
## $ venue : chr "M Chinnaswamy Stadium" "Punjab Cricket Association Stadium, Mohali" "Feroz Shah Kotla" "Wankhede Stadium" ...
## $ neutral_venue : int 0 0 0 0 0 0 0 0 0 0 ...
## $ team1 : chr "Royal Challengers Bangalore" "Kings XI Punjab" "Delhi Daredevils" "Mumbai Indians" ...
## $ team2 : chr "Kolkata Knight Riders" "Chennai Super Kings" "Rajasthan Royals" "Royal Challengers Bangalore" ...
## $ toss_winner : chr "Royal Challengers Bangalore" "Chennai Super Kings" "Rajasthan Royals" "Mumbai Indians" ...
## $ toss_decision : chr "field" "bat" "bat" "bat" ...
## $ winner : chr "Kolkata Knight Riders" "Chennai Super Kings" "Delhi Daredevils" "Royal Challengers Bangalore" ...
## $ result : chr "runs" "runs" "wickets" "wickets" ...
## $ result_margin : int 140 33 9 5 5 6 9 6 3 66 ...
## $ eliminator : chr "N" "N" "N" "N" ...
## $ method : chr NA NA NA NA ...
## $ umpire1 : chr "Asad Rauf" "MR Benson" "Aleem Dar" "SJ Davis" ...
## $ umpire2 : chr "RE Koertzen" "SL Shastri" "GA Pratapkumar" "DJ Harper" ...
colnames(mm)
## [1] "id" "city" "date" "player_of_match"
## [5] "venue" "neutral_venue" "team1" "team2"
## [9] "toss_winner" "toss_decision" "winner" "result"
## [13] "result_margin" "eliminator" "method" "umpire1"
## [17] "umpire2"
library(tidyverse)
library(forcats)
library(tidyr)
library(lubridate)
library(ggplot2)
library(dplyr)
plt <- ggplot(data = mm) +
geom_bar(mapping = aes(y = fct_infreq(winner), fill = winner))
plot_data <- layer_data(plt)
plt + annotate(geom = "text", label = plot_data$count, x = plot_data$x +5, y = plot_data$y ) +
scale_fill_manual(values = c("yellow","lightblue","#00008B","darkred","orange","red","orange","purple","blue","cyan","#D11D9B","magenta","#D1AB3E","#FF822A","grey")) +
labs(title = "Number of matches won by each team from 2008-2020", x = "Number of wins", y = "Team") +
guides(fill=guide_legend(title="Team"))
Here, we can observed that Mumbai Indians win most number of matches and Rising Pune superstars won least number of matches
mm %>%
count(player_of_match,sort = TRUE,name = "Number of MOTMs") %>%
top_n(15)
AS de villers won the man of the match tittle the most number of times fallowed by CH Gayle and RG Sharma
runscorers <- bb %>%
group_by(batsman) %>%
summarise(runs = sum(batsman_runs), balls = n(), innings = sum(is_wicket == 1),
fours = sum(batsman_runs == 4), sixes = sum(batsman_runs == 6))%>%
mutate(boundary_pct = round(((4*fours + 6*sixes)/runs)*100,2),
strike_rate = round((runs/balls)*100,2), avg = round(runs/innings, 2)) %>%
subset(select = -c(balls, fours, sixes, innings)) %>%
arrange(desc(runs)) %>%
filter(runs > 1000)
summary(runscorers)
top_10_runscorers <- runscorers%>%
top_n(10, runs) %>%
subset(select = -c(runs)) %>%
gather(key = "stats", value = "val", 2:4) %>%
ggplot(aes(x = batsman, y = val, fill = stats)) +
geom_bar(stat = "identity") +
labs(title = "Top 10 Run Scorers", x = "Name", y = "Value", fill = "Stats") +
theme(plot.title = element_text(hjust = 0.5, size = 20), panel.background = element_rect(fill = "white"),
panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
axis.title = element_text(size = 20), axis.text = element_text(size = 15),
legend.title = element_text(size = 15), legend.text = element_text(size = 15),
strip.text = element_text(size = 20)) +
facet_wrap(~stats, ncol = 1,
labeller = as_labeller(c("avg" = "Average", "boundary_pct" = "Boundary Percent", "strike_rate" = "Strike Rate"))) +
geom_text(aes(label = val), vjust = 2, size = 5) +
scale_fill_hue(h = c(180, 300), labels = c("Average", "Boundary Percent", "Strike Rate"))
top_10_runscorers
This graph gives the complete idea about the batman’s performance. Ab de villers has highest strike rate and highest average run scored whereas CH gaye hits most number of boundaries.
top_10_wickettakers <- bb %>%
group_by(bowler) %>%
summarise(wickets = sum(is_wicket), runs = sum(batsman_runs),
boundaries = sum(batsman_runs == 4|batsman_runs == 6), balls = n()) %>%
mutate(economy_rate = round(runs/(balls/6),2), boundary_pct = round((boundaries/balls)*100,2)) %>%
subset(select = -c(runs, balls, boundaries)) %>%
filter(wickets > 50) %>%
arrange(desc(wickets)) %>%
top_n(10, wickets)
summary(top_10_wickettakers)
g2 <- gather(top_10_wickettakers, key = "features", value = "val", 2:4) %>%
ggplot(aes(x = bowler, y = val, fill = features)) +
geom_bar(stat = "identity") +
labs(title = "Top 10 Wicket Takers", x = "Bowler", y = "Values", fill = "Stats") +
theme(plot.title = element_text(hjust = 0.5, size = 20), panel.grid = element_blank(),
panel.background = element_rect(fill = "white"), axis.title = element_text(size = 15),
axis.text = element_text(size = 13), legend.title = element_text(size = 15),
legend.text = element_text(size = 15), strip.text = element_text(size = 20)) +
scale_fill_manual(values = c("skyblue", "Green", "violet"), labels = c("Boundary Percent", "Economy", "Wickets")) +
geom_text(aes(label = val), vjust = -0.15, size = 5) +
facet_wrap(~features, ncol = 1, labeller = as_labeller(c("boundary_pct" = "Boundary Percent","economy_rate" = "Economy", "wickets" = "Wickets")))
g2
This graph gives the complete idea about the bowler’s performance. UT Yadav has highest Economy rate and highest wickets is taken by SL Malinga and SP Narine gives lowest boundaries.
HIS <- bb %>%
select(id , batsman,batsman_runs )%>%
group_by(id ,batsman) %>%
summarise(iHS = sum(batsman_runs)) %>%
arrange(-iHS)%>% head(10)
ggplot(data = HIS , aes(reorder(id ,-iHS), y = iHS , fill = batsman)) +
geom_bar(stat = "identity") +
geom_text(aes(label = iHS),color="black",fontface = "bold",size =2.5,vjust = 1.8)+
theme(axis.text.x = element_text(angle = 40, hjust = 1))+
labs(x = "Match_ID" , y = "Scores",
title = "Highest indivisual Scores by Batsmen in IPL History")
Highest number of runs in single inning is scored by AB De Villers fallowed by BB McCullum and CH Gayle
library(readxl)
library(tidyverse)
library(ggplot2)
library(lubridate)
library(dplyr)
bb <- read.csv("IPL Ball-by-Ball 2008-2020.csv")
mm<-read.csv("IPL Matches 2008-2020.csv")
bb %>%
left_join(mm[c("id","date")]) -> bb_mm
str(bb_mm)
t<-dmy(bb_mm$date)
y<-year(t)
bb_mm$year=y
bb_mm <- subset(bb_mm, select = -c(date))
bb_mm_2020 <- subset(bb_mm, year == 2020)
bb_copy=bb
bb_mm_copy=bb_mm
# Summarizing data for batsmen
# Batting Points
# Run -> +1
# Boundary_Bonus -> +2
# Six_Bonus -> +3
# thirty_Run_Bonus -> +5
# Half-century_Bonus -> +9
# Century_Bonus -> +18
batsman_score <- function(x) {
x$bat_score <- 1*x$runs + 2*x$fours + 3*x$sixes + 5*x$thirty + 9*x$fifty + 18*x$century
return(x)
}
Here,we define function ‘get_bat_stats’ and taking the no of times a player has scored centuries, fifties, thirties, total sixes and fours and total runs as inputs and attaching some weights to each variables.
We are calculating the batscore using wights as follows:
runs<-1
fours <-2
sixes<- 3
thirty<- 5
fifty<-9
century<-18
Then we are printing the batscore in decreasing order.
## Batsman Analysis
get_bat_stats <- function(df) {
# Initialize batting dataframe
batting_df <- data.frame(player = character(0), runs = integer(0), fours = integer(0),
sixes = integer(0), thirty = integer(0), fifty = integer(0), century = integer(0))
# Group data by batsman and compute sums
batting_df <- df %>%
group_by(batsman) %>%
summarize(runs = sum(batsman_runs),
fours = sum(batsman_runs == 4),
sixes = sum(batsman_runs == 6))
# Group data by match id and compute thirties, fifties, and centuries
df %>%
group_by(batsman) %>%
summarize(score = sum(batsman_runs)) %>%
group_by(batsman) %>%
mutate(thirty = sum(score >= 30),
fifty = sum(score >= 50),
century = sum(score >= 100)) %>%
select(-score) %>%
right_join(batting_df, by = "batsman") -> batting_df
# Compute bat_score using the batsman_score function
batting_df$bat_score <- 1 * batting_df$runs + 2 * batting_df$fours + 3 * batting_df$sixes +
5 * batting_df$thirty + 9 * batting_df$fifty + 18 * batting_df$century
return(batting_df)
}
batsmen_results=get_bat_stats(bb)
batsmen_results$player=batsmen_results$batsman
## Finding the top batsmen based on their batting score.
batsmen_results[order(batsmen_results$bat_score, decreasing = TRUE),]
## # A tibble: 537 x 9
## # Groups: batsman [537]
## batsman thirty fifty century runs fours sixes bat_score player
## <chr> <int> <int> <int> <int> <int> <int> <dbl> <chr>
## 1 V Kohli 1 1 1 5878 504 202 7524 V Kohli
## 2 SK Raina 1 1 1 5368 493 194 6968 SK Raina
## 3 DA Warner 1 1 1 5254 510 195 6891 DA Warner
## 4 RG Sharma 1 1 1 5230 458 214 6820 RG Sharma
## 5 S Dhawan 1 1 1 5197 591 109 6738 S Dhawan
## 6 CH Gayle 1 1 1 4772 384 349 6619 CH Gayle
## 7 AB de Villiers 1 1 1 4849 390 235 6366 AB de Villie~
## 8 RV Uthappa 1 1 1 4607 454 163 6036 RV Uthappa
## 9 MS Dhoni 1 1 1 4632 313 216 5938 MS Dhoni
## 10 G Gambhir 1 1 1 4217 492 59 5410 G Gambhir
## # ... with 527 more rows
Here,we define function ‘bowler_score’ and taking the no of times a player took wickets, lbw_or_bowled, three wickets, four wickets and five wickets and total runs as inputs and attaching some weights to each variables.
We are calculating the bowler score using wights as follows:
wickets<-25
lbw_or_bowled <-8
three wickets<- 4
four_wickets<- 8
five_wickets<-16
Then we are printing the bowlerscore in decreasing order.
## Bowler Analysis
bowler_score <- function(x) {
x[["bowl_score"]] <- 25 * as.numeric(x[["wickets"]]) + 8 * as.numeric(x[["lbw_or_bowled"]]) + 4 * as.numeric(x[["three_wickets"]]) + 8 * as.numeric(x[["four_wickets"]]) + 16 * as.numeric(x[["five_wickets"]])
return(x)
}
get_bowl_stats <- function(df) {
# Initialize bowling data frame.
bowling_df <- data.frame(player = character(0), wickets = numeric(0), lbw_or_bowled = numeric(0), three_wickets = numeric(0), four_wickets = numeric(0), five_wickets = numeric(0))
# Let's use group_by function from dplyr package. Our aim is to find bowling score so we'll group our data by bowler.
for (player in unique(df$bowler)) {
details <- df[df$bowler == player, ]
# It finds the length of data frame where dismissal kind is either lbw or bowled
lbw_or_bowled <- sum(details[["dismissal_kind"]] == "lbw" | details[["dismissal_kind"]] == "bowled")
total_wickets <- 0
three_wickets <- 0
four_wickets <- 0
fifer <- 0
# To find three wicket haul, four wicket haul and five wicket haul we should group data frame (details) again by match id.
for (match in unique(details[["id"]])) {
match_details <- details[details[["id"]] == match, ]
wickets <- sum(match_details[["dismissal_kind"]] %in% c("caught", "bowled", "lbw", "stumped", "caught and bowled"))
if (wickets >= 3) {
three_wickets <- three_wickets + 1
} else if (wickets >= 4) {
four_wickets <- four_wickets + 1
} else if (wickets >= 5) {
fifer <- fifer + 1
}
# Need to add every match wickets into total_wickets to find total wickets of bowler.
total_wickets <- total_wickets + wickets
}
# Store all the data into bowling data frame.
bowling_df <- rbind(bowling_df, data.frame(player = player, wickets = total_wickets, three_wickets = three_wickets, four_wickets = four_wickets, five_wickets = fifer))
}
# To find bowl_score we apply bowler_score function.
#bowling_df <- apply(bowling_df, 1, bowler_score)
bowling_df$bowl_score= 30*bowling_df$wickets+5*bowling_df$three_wickets+9*bowling_df$four_wickets+18*bowling_df$five_wickets
return(bowling_df)
}
bb_2=bb
library(tidyr)
bb_2 %>% drop_na()
bowler_results=(get_bowl_stats(bb_2))
## Finding the top bowlers based on their batting score.
br_2<-bowler_results[order(bowler_results$bowl_score, decreasing = TRUE),][]
head(select(br_2, c('player','bowl_score')))
## player bowl_score
## 101 SL Malinga 5195
## 88 A Mishra 4880
## 16 PP Chawla 4740
## 57 DJ Bravo 4650
## 41 Harbhajan Singh 4525
## 146 R Ashwin 4175
We have not define any function for wicket keeper category. We sorted the players according the number of stumps.
## Wicket Keeper Analysis
get_wicketkeepers <- function(df) {
# Let's find out wicket keepers from this dataset. To achieve this we'll check dismissal kind, if it is stumpout then store such fielders name into list
wicketkeepers <- df[df$dismissal_kind == "stumped", "fielder"]
return(table(wicketkeepers))
}
keeper<-get_wicketkeepers(bb_2)
max(keeper)
keeper_2=data.frame(keeper)
str(keeper_2)
keeper_2$player=keeper_2$wicketkeepers
head(keeper_2)
## wicketkeepers Freq player
## 1 AB de Villiers 8 AB de Villiers
## 2 AC Gilchrist 16 AC Gilchrist
## 3 AP Tare 5 AP Tare
## 4 AP Tare (sub) 1 AP Tare (sub)
## 5 AT Rayudu 2 AT Rayudu
## 6 BB McCullum 6 BB McCullum
We calculated the all-rounder score using the bat score and ball score. The formula is given bellow.
## Allrounder_Score= (bat_score + bowl_score) / 1000
# Merge the batsmen_tournament and bowlers_tournament data frames, using the "player" column as the merge key
batsmen_results %>%
full_join(bowler_results) -> merge_data_2
merge_data_2 %>%
full_join(keeper_2) -> merge_data_3
merge_data_3 <- merge_data_3 %>%
mutate(allrounder_score = (bat_score * bowl_score) / 1000)
# View the first few rows of the merged data frame
head(merge_data_3)
## # A tibble: 6 x 17
## # Groups: batsman [6]
## batsman thirty fifty century runs fours sixes bat_s~1 player wickets three~2
## <chr> <int> <int> <int> <int> <int> <int> <dbl> <chr> <dbl> <dbl>
## 1 A Ashis~ 1 1 1 280 16 15 389 A Ash~ 18 1
## 2 A Chand~ 0 0 0 4 0 0 4 A Cha~ 11 1
## 3 A Chopra 1 1 0 53 7 0 81 A Cho~ NA NA
## 4 A Choud~ 0 0 0 25 1 1 30 A Cho~ 5 0
## 5 A Danan~ 0 0 0 4 0 0 4 A Dan~ 0 0
## 6 A Flint~ 1 1 0 62 5 2 92 A Fli~ 2 0
## # ... with 6 more variables: four_wickets <dbl>, five_wickets <dbl>,
## # bowl_score <dbl>, wicketkeepers <fct>, Freq <int>, allrounder_score <dbl>,
## # and abbreviated variable names 1: bat_score, 2: three_wickets
names(merge_data_3)
## [1] "batsman" "thirty" "fifty" "century"
## [5] "runs" "fours" "sixes" "bat_score"
## [9] "player" "wickets" "three_wickets" "four_wickets"
## [13] "five_wickets" "bowl_score" "wicketkeepers" "Freq"
## [17] "allrounder_score"
merged_score_4<- merge_data_3[order(merge_data_3$allrounder_score, decreasing = TRUE),]
select(merged_score_4, c('player','allrounder_score'))
## Adding missing grouping variables: `batsman`
## # A tibble: 582 x 3
## # Groups: batsman [538]
## batsman player allrounder_score
## <chr> <chr> <dbl>
## 1 SR Watson SR Watson 14638.
## 2 RA Jadeja RA Jadeja 9497.
## 3 DJ Bravo DJ Bravo 9007.
## 4 KA Pollard KA Pollard 7334.
## 5 JH Kallis JH Kallis 6078.
## 6 YK Pathan YK Pathan 5402.
## 7 SK Raina SK Raina 5226
## 8 Harbhajan Singh Harbhajan Singh 5181.
## 9 SP Narine SP Narine 4977.
## 10 Yuvraj Singh Yuvraj Singh 4029.
## # ... with 572 more rows
# Set the number of players for each role
total_batsmen <- 4
total_bowlers <- 3
total_allrounders <- 3
total_wicketkeepers <- 1
# Initialize the playing_XI_tournament list
playing_XI_tournament <- list()
# Sort the data by bat_score and select the top total_batsmen rows
playing_XI_tournament[['BAT']] <- merge_data_3[order(merge_data_3$bat_score, decreasing = TRUE), ][1:total_batsmen, ]$player
# Select the rows where player is in wicketkeeper_tournament, sort by bat_score, and select the top total_wicketkeepers rows
playing_XI_tournament[['WK']] <- merge_data_3[order(merge_data_3$Freq, decreasing = TRUE), ][1:total_wicketkeepers, ]$player
# Sort the data by allrounder_score and select the top total_allrounders rows
playing_XI_tournament[['AR']] <- merge_data_3[order(merge_data_3$allrounder_score, decreasing = TRUE), ][1:total_allrounders, ]$player
# Sort the data by bowl_score and select the top total_bowlers rows
playing_XI_tournament[['BWL']] <- merge_data_3[order(merge_data_3$bowl_score, decreasing = TRUE), ][1:total_bowlers, ]$player
# Print the header message
cat("Fantasy dream team for the whole tournament\n")
## Fantasy dream team for the whole tournament
# Iterate over the elements in the playing_XI_tournament list
for (player_role in names(playing_XI_tournament)) {
# Extract the players vector for the current role
players <- playing_XI_tournament[[player_role]]
# Iterate over the players in the current role
for (player in players) {
# Print the player name and role
cat(sprintf("%-15s %s\n", player, player_role))
}
}
## V Kohli BAT
## SK Raina BAT
## DA Warner BAT
## RG Sharma BAT
## MS Dhoni WK
## SR Watson AR
## RA Jadeja AR
## DJ Bravo AR
## SL Malinga BWL
## A Mishra BWL
## PP Chawla BWL
This is our final team. We choose M.S.Dhnoi as the team captian. The team consists of 4 batsmen, 1 wicket keeper, 3 all-rounders and 3 bowlers.