Introduction

The Indian Premier League (IPL) is a men’s T20 franchise cricket league of India. It is annually contested by ten teams based out of seven Indian cities and three Indian states. The league was founded by the Board of Control for Cricket in India (BCCI) in 2007. Fantasy cricket is gaining popularity day by day. Here, a user can choose real life players based on their various performance.

Business Objective​

To create the ideal Super-11 team by analysing the IPL’s ball-by-ball and match-by-match datasets. The team should have strong openers, middle-order hitters, all-rounders, a reliable wicketkeeper, and a collection of quick and spin bowlers. Players who have played or bowled 300 balls should be considered. ​

Squad Name – Ball Burners​

Squad’s Captain- MS Dhoni

Data Understanding​

Given Dataset: IPL Complete Dataset (2008-2020).

This dataset consists of two separate CSV files.

IPL Matches 2008-2020:​

Information of each match summary

No of features- 17 such as inning, over, ball,batsman, etc

Data points- 816

IPL Ball-by-Ball 2008-2020:

Ball by ball details and deliveries​

No of features- 18​ such as inning, over, ball,batsman, etc

Data points- 193468

library(tidyverse)
library(forcats)
library(tidyr)
library(lubridate) 
library(ggplot2) 
library(dplyr)
mm <- read.csv("IPL Matches 2008-2020.csv")
bb <- read.csv("IPL Ball-by-Ball 2008-2020.csv")
str(mm)
## 'data.frame':    816 obs. of  17 variables:
##  $ id             : int  335982 335983 335984 335985 335986 335987 335988 335989 335990 335991 ...
##  $ city           : chr  "Bangalore" "Chandigarh" "Delhi" "Mumbai" ...
##  $ date           : chr  "2008-04-18" "2008-04-19" "2008-04-19" "2008-04-20" ...
##  $ player_of_match: chr  "BB McCullum" "MEK Hussey" "MF Maharoof" "MV Boucher" ...
##  $ venue          : chr  "M Chinnaswamy Stadium" "Punjab Cricket Association Stadium, Mohali" "Feroz Shah Kotla" "Wankhede Stadium" ...
##  $ neutral_venue  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ team1          : chr  "Royal Challengers Bangalore" "Kings XI Punjab" "Delhi Daredevils" "Mumbai Indians" ...
##  $ team2          : chr  "Kolkata Knight Riders" "Chennai Super Kings" "Rajasthan Royals" "Royal Challengers Bangalore" ...
##  $ toss_winner    : chr  "Royal Challengers Bangalore" "Chennai Super Kings" "Rajasthan Royals" "Mumbai Indians" ...
##  $ toss_decision  : chr  "field" "bat" "bat" "bat" ...
##  $ winner         : chr  "Kolkata Knight Riders" "Chennai Super Kings" "Delhi Daredevils" "Royal Challengers Bangalore" ...
##  $ result         : chr  "runs" "runs" "wickets" "wickets" ...
##  $ result_margin  : int  140 33 9 5 5 6 9 6 3 66 ...
##  $ eliminator     : chr  "N" "N" "N" "N" ...
##  $ method         : chr  NA NA NA NA ...
##  $ umpire1        : chr  "Asad Rauf" "MR Benson" "Aleem Dar" "SJ Davis" ...
##  $ umpire2        : chr  "RE Koertzen" "SL Shastri" "GA Pratapkumar" "DJ Harper" ...
colnames(mm)
##  [1] "id"              "city"            "date"            "player_of_match"
##  [5] "venue"           "neutral_venue"   "team1"           "team2"          
##  [9] "toss_winner"     "toss_decision"   "winner"          "result"         
## [13] "result_margin"   "eliminator"      "method"          "umpire1"        
## [17] "umpire2"

Exploratory Data Analysis

library(tidyverse)
library(forcats)
library(tidyr)
library(lubridate) 
library(ggplot2) 
library(dplyr)


plt <- ggplot(data = mm) +
  geom_bar(mapping = aes(y = fct_infreq(winner), fill = winner))

plot_data <- layer_data(plt)
plt + annotate(geom = "text", label = plot_data$count, x = plot_data$x +5, y = plot_data$y ) +
  scale_fill_manual(values = c("yellow","lightblue","#00008B","darkred","orange","red","orange","purple","blue","cyan","#D11D9B","magenta","#D1AB3E","#FF822A","grey")) +
  labs(title = "Number of matches won by each team from 2008-2020", x = "Number of wins", y = "Team") +
  guides(fill=guide_legend(title="Team"))

Here, we can observed that Mumbai Indians win most number of matches and Rising Pune superstars won least number of matches

Top 10 Batsman by Run Scored

mm %>% 
  count(player_of_match,sort = TRUE,name = "Number of MOTMs") %>% 
  top_n(15)

AS de villers won the man of the match tittle the most number of times fallowed by CH Gayle and RG Sharma

Top 10 Batsman by Run Scored

runscorers <- bb %>%
  group_by(batsman) %>%
  summarise(runs = sum(batsman_runs), balls = n(), innings = sum(is_wicket == 1),
            fours = sum(batsman_runs == 4), sixes = sum(batsman_runs == 6))%>%
  mutate(boundary_pct = round(((4*fours + 6*sixes)/runs)*100,2),
         strike_rate = round((runs/balls)*100,2), avg = round(runs/innings, 2)) %>%
  subset(select = -c(balls, fours, sixes, innings)) %>%
  arrange(desc(runs)) %>%
  filter(runs > 1000)

summary(runscorers)

top_10_runscorers <- runscorers%>%
  top_n(10, runs) %>%
  subset(select = -c(runs)) %>%
  gather(key = "stats", value = "val", 2:4) %>%
  ggplot(aes(x = batsman, y = val, fill = stats)) +
  geom_bar(stat = "identity") +
  labs(title = "Top 10 Run Scorers", x = "Name", y = "Value", fill = "Stats") +
  theme(plot.title = element_text(hjust = 0.5, size = 20), panel.background = element_rect(fill = "white"),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.title = element_text(size = 20), axis.text = element_text(size = 15),
        legend.title = element_text(size = 15), legend.text = element_text(size = 15),
        strip.text = element_text(size = 20)) +
  facet_wrap(~stats, ncol = 1, 
             labeller = as_labeller(c("avg" = "Average", "boundary_pct" = "Boundary Percent", "strike_rate" = "Strike Rate"))) +
  geom_text(aes(label = val), vjust = 2, size = 5) +
  scale_fill_hue(h = c(180, 300), labels = c("Average", "Boundary Percent", "Strike Rate"))

top_10_runscorers

This graph gives the complete idea about the batman’s performance. Ab de villers has highest strike rate and highest average run scored whereas CH gaye hits most number of boundaries.

Top 10 Blowers by number of wickets taken

top_10_wickettakers <- bb %>%
  group_by(bowler) %>%
  summarise(wickets = sum(is_wicket), runs = sum(batsman_runs), 
            boundaries = sum(batsman_runs == 4|batsman_runs == 6), balls = n()) %>%
  mutate(economy_rate = round(runs/(balls/6),2), boundary_pct = round((boundaries/balls)*100,2)) %>%
  subset(select = -c(runs, balls, boundaries)) %>%
  filter(wickets > 50) %>%
  arrange(desc(wickets)) %>%
  top_n(10, wickets) 

summary(top_10_wickettakers)

g2 <- gather(top_10_wickettakers, key = "features", value = "val", 2:4) %>%
  ggplot(aes(x = bowler, y = val, fill = features)) +
  geom_bar(stat = "identity") +
  labs(title = "Top 10 Wicket Takers", x = "Bowler", y = "Values", fill = "Stats") +
  theme(plot.title = element_text(hjust = 0.5, size = 20), panel.grid = element_blank(),
        panel.background = element_rect(fill = "white"), axis.title = element_text(size = 15),
        axis.text = element_text(size = 13), legend.title = element_text(size = 15),
        legend.text = element_text(size = 15), strip.text = element_text(size = 20)) +
  scale_fill_manual(values = c("skyblue", "Green", "violet"), labels = c("Boundary Percent", "Economy", "Wickets")) +
  geom_text(aes(label = val), vjust = -0.15, size = 5) +
  facet_wrap(~features, ncol = 1, labeller = as_labeller(c("boundary_pct" = "Boundary Percent","economy_rate" = "Economy", "wickets" = "Wickets")))

g2

This graph gives the complete idea about the bowler’s performance. UT Yadav has highest Economy rate and highest wickets is taken by SL Malinga and SP Narine gives lowest boundaries.

Highest Individual scorers

HIS <- bb %>% 
  select(id , batsman,batsman_runs )%>%
  group_by(id ,batsman) %>% 
  summarise(iHS = sum(batsman_runs)) %>% 
  arrange(-iHS)%>% head(10)


ggplot(data = HIS , aes(reorder(id ,-iHS), y = iHS , fill = batsman)) + 
  geom_bar(stat = "identity") + 
  geom_text(aes(label = iHS),color="black",fontface = "bold",size =2.5,vjust = 1.8)+
  theme(axis.text.x = element_text(angle = 40, hjust = 1))+
  labs(x = "Match_ID" , y = "Scores", 
       title = "Highest indivisual Scores by Batsmen in IPL History")

Highest number of runs in single inning is scored by AB De Villers fallowed by BB McCullum and CH Gayle

Super-11 Team Selections

Methology

  1. Team should consists of mix of all rounders, batsman, bowlers and wicket keepers so, we have taken subjective judgement based on secondary research about the IPL teams wining history and came to the conclusion that team should have 4 batsman, 1 wicket keeper, 2 All rounders and 3 Bowlers.
  2. Top Four batsman are selected by applying weightage to different attributes such as Runs, Boundary Bonus, Six Bonus, etc.
  3. Top Three Blowers are selected by applying weightage to different attributes such as wickets and wickets Bonuses.
  4. Top two all rounder are selected by observing resulted score form batting and bowling performance.
  5. Wicket Keeper is selected on the base of his performance history.
  6. Team Captain is also choosen based on the performance history and subjective judgement.

Loading and merging two datasets:

library(readxl)
library(tidyverse)
library(ggplot2)
library(lubridate)
library(dplyr)

bb <- read.csv("IPL Ball-by-Ball 2008-2020.csv")
mm<-read.csv("IPL Matches 2008-2020.csv")

bb %>% 
  left_join(mm[c("id","date")]) -> bb_mm

We are joining by “id” column which is present in both the dataset.

str(bb_mm)

t<-dmy(bb_mm$date)
y<-year(t)

bb_mm$year=y

unique(bb_mm$year)

bb_mm <- subset(bb_mm, select = -c(date))

bb_mm_2020 <- subset(bb_mm, year == 2020)

bb_copy=bb

bb_mm_copy=bb_mm



# Summarizing data for batsmen
# Batting Points
# Run -> +1
# Boundary_Bonus -> +2
# Six_Bonus -> +3
# thirty_Run_Bonus -> +5
# Half-century_Bonus -> +9
# Century_Bonus -> +18

batsman_score <- function(x) {
  x$bat_score <- 1*x$runs + 2*x$fours + 3*x$sixes + 5*x$thirty + 9*x$fifty + 18*x$century
  return(x)
}

Batsman Analysis

Here,we define function ‘get_bat_stats’ and taking the no of times a player has scored centuries, fifties, thirties, total sixes and fours and total runs as inputs and attaching some weights to each variables.

We are calculating the batscore using wights as follows:
runs<-1
fours <-2
sixes<- 3
thirty<- 5
fifty<-9
century<-18

Then we are printing the batscore in decreasing order.

## Batsman Analysis
 get_bat_stats <- function(df) {
   # Initialize batting dataframe
   batting_df <- data.frame(player = character(0), runs = integer(0), fours = integer(0),
                            sixes = integer(0), thirty = integer(0), fifty = integer(0), century = integer(0))
   
   # Group data by batsman and compute sums
   batting_df <- df %>%
     group_by(batsman) %>%
     summarize(runs = sum(batsman_runs),
               fours = sum(batsman_runs == 4),
               sixes = sum(batsman_runs == 6))
   
   # Group data by match id and compute thirties, fifties, and centuries
   df %>%
     group_by(batsman) %>%
     summarize(score = sum(batsman_runs)) %>%
     group_by(batsman) %>%
     mutate(thirty = sum(score >= 30),
            fifty = sum(score >= 50),
            century = sum(score >= 100)) %>%
     select(-score) %>%
     right_join(batting_df, by = "batsman") -> batting_df
   
   # Compute bat_score using the batsman_score function
   batting_df$bat_score <- 1 * batting_df$runs + 2 * batting_df$fours + 3 * batting_df$sixes +
     5 * batting_df$thirty + 9 * batting_df$fifty + 18 * batting_df$century
   
   return(batting_df)
 }
 
 batsmen_results=get_bat_stats(bb)
 batsmen_results$player=batsmen_results$batsman
 
 ## Finding the top batsmen based on their batting score.
 batsmen_results[order(batsmen_results$bat_score, decreasing = TRUE),] 
## # A tibble: 537 x 9
## # Groups:   batsman [537]
##    batsman        thirty fifty century  runs fours sixes bat_score player       
##    <chr>           <int> <int>   <int> <int> <int> <int>     <dbl> <chr>        
##  1 V Kohli             1     1       1  5878   504   202      7524 V Kohli      
##  2 SK Raina            1     1       1  5368   493   194      6968 SK Raina     
##  3 DA Warner           1     1       1  5254   510   195      6891 DA Warner    
##  4 RG Sharma           1     1       1  5230   458   214      6820 RG Sharma    
##  5 S Dhawan            1     1       1  5197   591   109      6738 S Dhawan     
##  6 CH Gayle            1     1       1  4772   384   349      6619 CH Gayle     
##  7 AB de Villiers      1     1       1  4849   390   235      6366 AB de Villie~
##  8 RV Uthappa          1     1       1  4607   454   163      6036 RV Uthappa   
##  9 MS Dhoni            1     1       1  4632   313   216      5938 MS Dhoni     
## 10 G Gambhir           1     1       1  4217   492    59      5410 G Gambhir    
## # ... with 527 more rows

Bowler Analysis

Here,we define function ‘bowler_score’ and taking the no of times a player took wickets, lbw_or_bowled, three wickets, four wickets and five wickets and total runs as inputs and attaching some weights to each variables.

We are calculating the bowler score using wights as follows:
wickets<-25
lbw_or_bowled <-8
three wickets<- 4
four_wickets<- 8
five_wickets<-16

Then we are printing the bowlerscore in decreasing order.

## Bowler Analysis
 bowler_score <- function(x) {
   x[["bowl_score"]] <- 25 * as.numeric(x[["wickets"]]) + 8 * as.numeric(x[["lbw_or_bowled"]]) + 4 * as.numeric(x[["three_wickets"]]) + 8 * as.numeric(x[["four_wickets"]]) + 16 * as.numeric(x[["five_wickets"]])
   return(x)
 }
 
 get_bowl_stats <- function(df) {
   # Initialize bowling data frame.
   bowling_df <- data.frame(player = character(0), wickets = numeric(0), lbw_or_bowled = numeric(0), three_wickets = numeric(0), four_wickets = numeric(0), five_wickets = numeric(0))
   
   # Let's use group_by function from dplyr package. Our aim is to find bowling score so we'll group our data by bowler.
   for (player in unique(df$bowler)) {
     details <- df[df$bowler == player, ]
     
     # It finds the length of data frame where dismissal kind is either lbw or bowled
     lbw_or_bowled <- sum(details[["dismissal_kind"]] == "lbw" | details[["dismissal_kind"]] == "bowled")
     
     total_wickets <- 0
     three_wickets <- 0
     four_wickets <- 0
     fifer <- 0
     
     # To find three wicket haul, four wicket haul and five wicket haul we should group data frame (details) again by match id.
     for (match in unique(details[["id"]])) {
       match_details <- details[details[["id"]] == match, ]
       wickets <- sum(match_details[["dismissal_kind"]] %in% c("caught", "bowled", "lbw", "stumped", "caught and bowled"))
       if (wickets >= 3) {
         three_wickets <- three_wickets + 1
       } else if (wickets >= 4) {
         four_wickets <- four_wickets + 1
       } else if (wickets >= 5) {
         fifer <- fifer + 1
       }
       # Need to add every match wickets into total_wickets to find total wickets of bowler.
       total_wickets <- total_wickets + wickets
     }
     
     # Store all the data into bowling data frame.
     bowling_df <- rbind(bowling_df, data.frame(player = player, wickets = total_wickets,  three_wickets = three_wickets, four_wickets = four_wickets, five_wickets = fifer))
   }
   
   # To find bowl_score we apply bowler_score function.
   #bowling_df <- apply(bowling_df, 1, bowler_score)
   
   bowling_df$bowl_score= 30*bowling_df$wickets+5*bowling_df$three_wickets+9*bowling_df$four_wickets+18*bowling_df$five_wickets
     

   
   return(bowling_df)
 }
 
 bb_2=bb
 
 library(tidyr)
 bb_2 %>% drop_na()
 bowler_results=(get_bowl_stats(bb_2))
 ## Finding the top bowlers based on their batting score.
 br_2<-bowler_results[order(bowler_results$bowl_score, decreasing = TRUE),][]
 
 head(select(br_2, c('player','bowl_score')))
##              player bowl_score
## 101      SL Malinga       5195
## 88         A Mishra       4880
## 16        PP Chawla       4740
## 57         DJ Bravo       4650
## 41  Harbhajan Singh       4525
## 146        R Ashwin       4175

Wicket Keeper Analysis

We have not define any function for wicket keeper category. We sorted the players according the number of stumps.

## Wicket Keeper Analysis
 
 get_wicketkeepers <- function(df) {
   # Let's find out wicket keepers from this dataset. To achieve this we'll check dismissal kind, if it is stumpout then store such fielders name into list
   wicketkeepers <- df[df$dismissal_kind == "stumped", "fielder"]
   return(table(wicketkeepers))
 }
 
 keeper<-get_wicketkeepers(bb_2)
 max(keeper)
 keeper_2=data.frame(keeper)
 
str(keeper_2)
keeper_2$player=keeper_2$wicketkeepers

head(keeper_2)
##    wicketkeepers Freq         player
## 1 AB de Villiers    8 AB de Villiers
## 2   AC Gilchrist   16   AC Gilchrist
## 3        AP Tare    5        AP Tare
## 4  AP Tare (sub)    1  AP Tare (sub)
## 5      AT Rayudu    2      AT Rayudu
## 6    BB McCullum    6    BB McCullum

Allrounder_Score_Calculation

We calculated the all-rounder score using the bat score and ball score. The formula is given bellow.

## Allrounder_Score= (bat_score + bowl_score) / 1000

 # Merge the batsmen_tournament and bowlers_tournament data frames, using the "player" column as the merge key
 
 batsmen_results %>% 
   full_join(bowler_results) -> merge_data_2
 merge_data_2 %>% 
   full_join(keeper_2) -> merge_data_3
 merge_data_3 <- merge_data_3 %>%
   mutate(allrounder_score = (bat_score * bowl_score) / 1000)
 
 # View the first few rows of the merged data frame
 head(merge_data_3)
## # A tibble: 6 x 17
## # Groups:   batsman [6]
##   batsman  thirty fifty century  runs fours sixes bat_s~1 player wickets three~2
##   <chr>     <int> <int>   <int> <int> <int> <int>   <dbl> <chr>    <dbl>   <dbl>
## 1 A Ashis~      1     1       1   280    16    15     389 A Ash~      18       1
## 2 A Chand~      0     0       0     4     0     0       4 A Cha~      11       1
## 3 A Chopra      1     1       0    53     7     0      81 A Cho~      NA      NA
## 4 A Choud~      0     0       0    25     1     1      30 A Cho~       5       0
## 5 A Danan~      0     0       0     4     0     0       4 A Dan~       0       0
## 6 A Flint~      1     1       0    62     5     2      92 A Fli~       2       0
## # ... with 6 more variables: four_wickets <dbl>, five_wickets <dbl>,
## #   bowl_score <dbl>, wicketkeepers <fct>, Freq <int>, allrounder_score <dbl>,
## #   and abbreviated variable names 1: bat_score, 2: three_wickets
 names(merge_data_3)
##  [1] "batsman"          "thirty"           "fifty"            "century"         
##  [5] "runs"             "fours"            "sixes"            "bat_score"       
##  [9] "player"           "wickets"          "three_wickets"    "four_wickets"    
## [13] "five_wickets"     "bowl_score"       "wicketkeepers"    "Freq"            
## [17] "allrounder_score"
 merged_score_4<- merge_data_3[order(merge_data_3$allrounder_score, decreasing = TRUE),] 
 select(merged_score_4, c('player','allrounder_score'))
## Adding missing grouping variables: `batsman`
## # A tibble: 582 x 3
## # Groups:   batsman [538]
##    batsman         player          allrounder_score
##    <chr>           <chr>                      <dbl>
##  1 SR Watson       SR Watson                 14638.
##  2 RA Jadeja       RA Jadeja                  9497.
##  3 DJ Bravo        DJ Bravo                   9007.
##  4 KA Pollard      KA Pollard                 7334.
##  5 JH Kallis       JH Kallis                  6078.
##  6 YK Pathan       YK Pathan                  5402.
##  7 SK Raina        SK Raina                   5226 
##  8 Harbhajan Singh Harbhajan Singh            5181.
##  9 SP Narine       SP Narine                  4977.
## 10 Yuvraj Singh    Yuvraj Singh               4029.
## # ... with 572 more rows

Setting the number of players for each role

 # Set the number of players for each role
 total_batsmen <- 4
 total_bowlers <- 3
 total_allrounders <- 3
 total_wicketkeepers <- 1

Final Team

# Initialize the playing_XI_tournament list
 playing_XI_tournament <- list()
 
 # Sort the data by bat_score and select the top total_batsmen rows
 playing_XI_tournament[['BAT']] <- merge_data_3[order(merge_data_3$bat_score, decreasing = TRUE), ][1:total_batsmen, ]$player
 
 # Select the rows where player is in wicketkeeper_tournament, sort by bat_score, and select the top total_wicketkeepers rows
 playing_XI_tournament[['WK']] <- merge_data_3[order(merge_data_3$Freq, decreasing = TRUE), ][1:total_wicketkeepers, ]$player
 
 # Sort the data by allrounder_score and select the top total_allrounders rows
 playing_XI_tournament[['AR']] <- merge_data_3[order(merge_data_3$allrounder_score, decreasing = TRUE), ][1:total_allrounders, ]$player
 
 # Sort the data by bowl_score and select the top total_bowlers rows
 playing_XI_tournament[['BWL']] <- merge_data_3[order(merge_data_3$bowl_score, decreasing = TRUE), ][1:total_bowlers, ]$player
 # Print the header message
 cat("Fantasy dream team for the whole tournament\n")
## Fantasy dream team for the whole tournament
 # Iterate over the elements in the playing_XI_tournament list
 for (player_role in names(playing_XI_tournament)) {
   # Extract the players vector for the current role
   players <- playing_XI_tournament[[player_role]]
   
   # Iterate over the players in the current role
   for (player in players) {
     # Print the player name and role
     cat(sprintf("%-15s %s\n", player, player_role))
   }
 }
## V Kohli         BAT
## SK Raina        BAT
## DA Warner       BAT
## RG Sharma       BAT
## MS Dhoni        WK
## SR Watson       AR
## RA Jadeja       AR
## DJ Bravo        AR
## SL Malinga      BWL
## A Mishra        BWL
## PP Chawla       BWL

Conclusion

This is our final team. We choose M.S.Dhnoi as the team captian. The team consists of 4 batsmen, 1 wicket keeper, 3 all-rounders and 3 bowlers.

  1. V Kohli (Batsman)
  2. SK Raina (Batsman)
  3. DA Warner (Batsman)
  4. RG Sharma (Batsman)
  5. MS Dhoni (Wicket Keeper)
  6. SR Watson (All Rounder)
  7. RA Jadeja (All Rounder)
  8. DJ Bravo (All Rounder)
  9. SL Malinga (Bowler)
  10. A Mishra (Bowler)
  11. PP Chawla (Bowler)