Cricket Data Analysis

If you have ever wondered about cricket, watch this introductory video that nicely sums up the game of cricket. Cricket has mainly three versions, test match (\(5\) day long; played in sessions througout the day), \(50\) over (\(6\) balls \(=1\) over, i.e. the bowler bowls 6 times to complete an over) match (typically takes \(7\) hours to finish), and the shortest and most popular nowadays is the \(20\) over match (commonly known as T20/\(20-20\) cricket; takes about \(3\) hours to complete). In this project, I am gonna explore a T20 cricket dataset played internationally over the years since it’s inception in 2004. I’ve attached another short video explaning T20 cricket. This dataset is not quite up to date.

Load packages

library(tidyverse)

## ── Attaching packages ─────────────────────────────────── tidyverse 1.3.0 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Load data

dat <- read.csv("T20_Cricket.csv")
dim(dat)

## [1] 33378    28

The dataset contains \(33378\) rows and \(28\) columns.

head(dat)

##   Innings.Player Innings.Runs.Scored Innings.Runs.Scored.Num
## 1       AD Hales                116*                     116
## 2      LJ Wright                 99*                      99
## 3       AD Hales                  99                      99
## 4       AD Hales                  94                      94
## 5        JE Root                 90*                      90
## 6    SW Billings                  87                      87
##   Innings.Minutes.Batted Innings.Batted.Flag Innings.Not.Out.Flag
## 1                     97                   1                    1
## 2                     83                   1                    1
## 3                     84                   1                    0
## 4                     80                   1                    0
## 5                     77                   1                    1
## 6                      -                   1                    0
##   Innings.Balls.Faced Innings.Boundary.Fours Innings.Boundary.Sixes
## 1                  64                     11                      6
## 2                  55                      8                      6
## 3                  68                      6                      4
## 4                  61                     11                      2
## 5                  49                     13                      1
## 6                  47                     10                      3
##   Innings.Batting.Strike.Rate Innings.Number    Opposition            Ground
## 1                      181.25              2   v Sri Lanka        Chattogram
## 2                         180              1 v Afghanistan     Colombo (RPS)
## 3                      145.58              2 v West Indies        Nottingham
## 4                      154.09              1   v Australia Chester-le-Street
## 5                      183.67              2   v Australia       Southampton
## 6                       185.1              1 v West Indies        Basseterre
##   Innings.Date Country X50.s X100.s Innings.Runs.Scored.Buckets
## 1   2014-03-27 England     0      1                     100-149
## 2   2012-09-21 England     1      0                       50-99
## 3   2012-06-24 England     1      0                       50-99
## 4   2013-08-31 England     1      0                       50-99
## 5   2013-08-29 England     1      0                       50-99
## 6   2019-03-08 England     1      0                       50-99
##   Innings.Overs.Bowled Innings.Bowled.Flag Innings.Maidens.Bowled
## 1                                       NA                       
## 2                                       NA                       
## 3                                       NA                       
## 4                                       NA                       
## 5                                       NA                       
## 6                                       NA                       
##   Innings.Runs.Conceded Innings.Wickets.Taken X4.Wickets X5.Wickets X10.Wickets
## 1                                                     NA         NA          NA
## 2                                                     NA         NA          NA
## 3                                                     NA         NA          NA
## 4                                                     NA         NA          NA
## 5                                                     NA         NA          NA
## 6                                                     NA         NA          NA
##   Innings.Wickets.Taken.Buckets Innings.Economy.Rate
## 1                                                   
## 2                                                   
## 3                                                   
## 4                                                   
## 5                                                   
## 6

Let’s look at the column headings. I will explain what each column means when I creat necessary subsets of this dataset.

colnames(dat)

##  [1] "Innings.Player"                "Innings.Runs.Scored"          
##  [3] "Innings.Runs.Scored.Num"       "Innings.Minutes.Batted"       
##  [5] "Innings.Batted.Flag"           "Innings.Not.Out.Flag"         
##  [7] "Innings.Balls.Faced"           "Innings.Boundary.Fours"       
##  [9] "Innings.Boundary.Sixes"        "Innings.Batting.Strike.Rate"  
## [11] "Innings.Number"                "Opposition"                   
## [13] "Ground"                        "Innings.Date"                 
## [15] "Country"                       "X50.s"                        
## [17] "X100.s"                        "Innings.Runs.Scored.Buckets"  
## [19] "Innings.Overs.Bowled"          "Innings.Bowled.Flag"          
## [21] "Innings.Maidens.Bowled"        "Innings.Runs.Conceded"        
## [23] "Innings.Wickets.Taken"         "X4.Wickets"                   
## [25] "X5.Wickets"                    "X10.Wickets"                  
## [27] "Innings.Wickets.Taken.Buckets" "Innings.Economy.Rate"

str(dat)

## 'data.frame':    33378 obs. of  28 variables:
##  $ Innings.Player               : Factor w/ 1064 levels "A Balbirnie",..: 37 532 37 37 420 957 275 420 37 275 ...
##  $ Innings.Runs.Scored          : Factor w/ 231 levels "","0","0*","1",..: 24 227 226 218 212 203 200 195 190 190 ...
##  $ Innings.Runs.Scored.Num      : Factor w/ 124 levels "","-","0","1",..: 18 124 124 120 116 112 110 108 105 105 ...
##  $ Innings.Minutes.Batted       : Factor w/ 103 levels "","-","0","1",..: 101 86 87 83 79 2 54 72 50 77 ...
##  $ Innings.Batted.Flag          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Innings.Not.Out.Flag         : int  1 1 0 0 1 0 1 0 1 1 ...
##  $ Innings.Balls.Faced          : Factor w/ 75 levels "","-","0","1",..: 64 54 68 61 47 45 43 42 40 44 ...
##  $ Innings.Boundary.Fours       : Factor w/ 18 levels "","-","0","1",..: 6 17 15 6 8 5 16 15 18 13 ...
##  $ Innings.Boundary.Sixes       : Factor w/ 18 levels "","-","0","1",..: 15 15 13 11 4 12 14 13 13 15 ...
##  $ Innings.Batting.Strike.Rate  : Factor w/ 1104 levels "","-","0","10",..: 577 572 345 409 591 599 618 617 625 538 ...
##  $ Innings.Number               : Factor w/ 3 levels "-","1","2": 3 2 3 2 3 2 2 3 3 2 ...
##  $ Opposition                   : Factor w/ 33 levels "v Afghanistan",..: 27 1 31 2 2 31 26 26 19 19 ...
##  $ Ground                       : Factor w/ 107 levels "Aberdeen","Abu Dhabi",..: 23 28 81 25 96 8 54 74 106 48 ...
##  $ Innings.Date                 : Factor w/ 635 levels "2005-02-17","2005-06-13",..: 293 198 179 249 248 571 85 406 226 499 ...
##  $ Country                      : Factor w/ 17 levels "Afghanistan",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ X50.s                        : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ X100.s                       : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Innings.Runs.Scored.Buckets  : Factor w/ 6 levels "","-","0-49",..: 4 6 6 6 6 6 6 6 6 6 ...
##  $ Innings.Overs.Bowled         : Factor w/ 28 levels "","0.1","0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Innings.Bowled.Flag          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Innings.Maidens.Bowled       : Factor w/ 5 levels "","-","0","1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Innings.Runs.Conceded        : Factor w/ 69 levels "","-","0","1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Innings.Wickets.Taken        : Factor w/ 9 levels "","-","0","1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ X4.Wickets                   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ X5.Wickets                   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ X10.Wickets                  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Innings.Wickets.Taken.Buckets: Factor w/ 4 levels "","-","0-4","5+": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Innings.Economy.Rate         : Factor w/ 318 levels "","-","0","0.85",..: 1 1 1 1 1 1 1 1 1 1 ...

Some variables need to be converted into numeric.

dat$Innings.Runs.Scored.Num <- as.numeric(as.character(dat$Innings.Runs.Scored.Num))

## Warning: NAs introduced by coercion

dat$Innings.Minutes.Batted <- as.numeric(as.character(dat$Innings.Minutes.Batted))

## Warning: NAs introduced by coercion

dat$Innings.Balls.Faced <- as.numeric(as.character(dat$Innings.Balls.Faced))

## Warning: NAs introduced by coercion

dat$Innings.Boundary.Fours <- as.numeric(as.character(dat$Innings.Boundary.Fours))

## Warning: NAs introduced by coercion

dat$Innings.Boundary.Sixes <- as.numeric(as.character(dat$Innings.Boundary.Sixes))

## Warning: NAs introduced by coercion

dat$Innings.Batting.Strike.Rate <- as.numeric(as.character(dat$Innings.Batting.Strike.Rate))

## Warning: NAs introduced by coercion

dat$Innings.Overs.Bowled <- as.numeric(as.character(dat$Innings.Overs.Bowled))

## Warning: NAs introduced by coercion

dat$Innings.Maidens.Bowled <-as.numeric(as.character(dat$Innings.Maidens.Bowled))

## Warning: NAs introduced by coercion

dat$Innings.Runs.Conceded <- as.numeric(as.character(dat$Innings.Runs.Conceded))

## Warning: NAs introduced by coercion

dat$Innings.Wickets.Taken <- as.numeric(as.character(dat$Innings.Wickets.Taken))

## Warning: NAs introduced by coercion

dat$Innings.Economy.Rate <- as.numeric(as.character(dat$Innings.Economy.Rate))

## Warning: NAs introduced by coercion

dat$Innings.Date <- as.Date(as.character(dat$Innings.Date))

str(dat)

## 'data.frame':    33378 obs. of  28 variables:
##  $ Innings.Player               : Factor w/ 1064 levels "A Balbirnie",..: 37 532 37 37 420 957 275 420 37 275 ...
##  $ Innings.Runs.Scored          : Factor w/ 231 levels "","0","0*","1",..: 24 227 226 218 212 203 200 195 190 190 ...
##  $ Innings.Runs.Scored.Num      : num  116 99 99 94 90 87 85 83 80 80 ...
##  $ Innings.Minutes.Batted       : num  97 83 84 80 77 NA 54 70 50 75 ...
##  $ Innings.Batted.Flag          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Innings.Not.Out.Flag         : int  1 1 0 0 1 0 1 0 1 1 ...
##  $ Innings.Balls.Faced          : num  64 55 68 61 49 47 45 44 42 46 ...
##  $ Innings.Boundary.Fours       : num  11 8 6 11 13 10 7 6 9 4 ...
##  $ Innings.Boundary.Sixes       : num  6 6 4 2 1 3 5 4 4 6 ...
##  $ Innings.Batting.Strike.Rate  : num  181 180 146 154 184 ...
##  $ Innings.Number               : Factor w/ 3 levels "-","1","2": 3 2 3 2 3 2 2 3 3 2 ...
##  $ Opposition                   : Factor w/ 33 levels "v Afghanistan",..: 27 1 31 2 2 31 26 26 19 19 ...
##  $ Ground                       : Factor w/ 107 levels "Aberdeen","Abu Dhabi",..: 23 28 81 25 96 8 54 74 106 48 ...
##  $ Innings.Date                 : Date, format: "2014-03-27" "2012-09-21" ...
##  $ Country                      : Factor w/ 17 levels "Afghanistan",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ X50.s                        : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ X100.s                       : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Innings.Runs.Scored.Buckets  : Factor w/ 6 levels "","-","0-49",..: 4 6 6 6 6 6 6 6 6 6 ...
##  $ Innings.Overs.Bowled         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Innings.Bowled.Flag          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Innings.Maidens.Bowled       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Innings.Runs.Conceded        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Innings.Wickets.Taken        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X4.Wickets                   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ X5.Wickets                   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ X10.Wickets                  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Innings.Wickets.Taken.Buckets: Factor w/ 4 levels "","-","0-4","5+": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Innings.Economy.Rate         : num  NA NA NA NA NA NA NA NA NA NA ...

summary(dat)

##        Innings.Player  Innings.Runs.Scored Innings.Runs.Scored.Num
##  Shoaib Malik :  220          :16689       Min.   :  0.00         
##  RG Sharma    :  199   DNB    : 4451       1st Qu.:  3.00         
##  Shahid Afridi:  196   0      : 1038       Median : 11.00         
##  MS Dhoni     :  194   1      :  624       Mean   : 17.14         
##  LRPL Taylor  :  187   2      :  456       3rd Qu.: 25.00         
##  KJ O'Brien   :  181   4      :  418       Max.   :172.00         
##  (Other)      :32201   (Other): 9702       NA's   :21316          
##  Innings.Minutes.Batted Innings.Batted.Flag Innings.Not.Out.Flag
##  Min.   :  0.0          Min.   :0.000       Min.   :0.000       
##  1st Qu.:  6.0          1st Qu.:0.000       1st Qu.:0.000       
##  Median : 14.0          Median :1.000       Median :0.000       
##  Mean   : 20.1          Mean   :0.723       Mean   :0.157       
##  3rd Qu.: 29.0          3rd Qu.:1.000       3rd Qu.:0.000       
##  Max.   :101.0          Max.   :1.000       Max.   :1.000       
##  NA's   :24821          NA's   :16689       NA's   :16689       
##  Innings.Balls.Faced Innings.Boundary.Fours Innings.Boundary.Sixes
##  Min.   : 0.00       Min.   : 0.000         Min.   : 0.000        
##  1st Qu.: 4.00       1st Qu.: 0.000         1st Qu.: 0.000        
##  Median :10.00       Median : 1.000         Median : 0.000        
##  Mean   :13.97       Mean   : 1.491         Mean   : 0.598        
##  3rd Qu.:20.00       3rd Qu.: 2.000         3rd Qu.: 1.000        
##  Max.   :76.00       Max.   :16.000         Max.   :16.000        
##  NA's   :21316       NA's   :21316          NA's   :21316         
##  Innings.Batting.Strike.Rate Innings.Number          Opposition   
##  Min.   :  0.0               -:  341        v Pakistan    : 3146  
##  1st Qu.: 62.5               1:16590        v New Zealand : 2730  
##  Median :100.0               2:16447        v Sri Lanka   : 2662  
##  Mean   :106.1                              v Australia   : 2640  
##  3rd Qu.:143.8                              v India       : 2640  
##  Max.   :600.0                              v South Africa: 2532  
##  NA's   :21564                              (Other)       :17028  
##            Ground       Innings.Date                Country     
##  Dubai (DSC)  : 2266   Min.   :2005-02-17   Pakistan    : 3256  
##  Dhaka        : 1694   1st Qu.:2011-01-09   New Zealand : 2730  
##  Colombo (RPS): 1584   Median :2014-03-31   Sri Lanka   : 2706  
##  Johannesburg : 1236   Mean   :2014-04-27   Australia   : 2662  
##  Harare       : 1012   3rd Qu.:2017-10-10   India       : 2662  
##  Abu Dhabi    :  946   Max.   :2019-11-05   South Africa: 2532  
##  (Other)      :24640                        (Other)     :16830  
##      X50.s           X100.s      Innings.Runs.Scored.Buckets
##  Min.   :0.000   Min.   :0.000          :16689              
##  1st Qu.:0.000   1st Qu.:0.000   -      : 4627              
##  Median :0.000   Median :0.000   0-49   :11146              
##  Mean   :0.052   Mean   :0.002   100-149:   37              
##  3rd Qu.:0.000   3rd Qu.:0.000   150-199:    3              
##  Max.   :1.000   Max.   :1.000   50-99  :  876              
##  NA's   :16689   NA's   :16689                              
##  Innings.Overs.Bowled Innings.Bowled.Flag Innings.Maidens.Bowled
##  Min.   :0.100        Min.   :0.000       Min.   :0.000         
##  1st Qu.:2.000        1st Qu.:0.000       1st Qu.:0.000         
##  Median :4.000        Median :1.000       Median :0.000         
##  Mean   :3.146        Mean   :0.532       Mean   :0.036         
##  3rd Qu.:4.000        3rd Qu.:1.000       3rd Qu.:0.000         
##  Max.   :4.000        Max.   :1.000       Max.   :2.000         
##  NA's   :24492        NA's   :16689       NA's   :24492         
##  Innings.Runs.Conceded Innings.Wickets.Taken   X4.Wickets      X5.Wickets   
##  Min.   : 0.00         Min.   :0.000         Min.   :0.000   Min.   :0.000  
##  1st Qu.:16.00         1st Qu.:0.000         1st Qu.:1.000   1st Qu.:0.000  
##  Median :23.00         Median :1.000         Median :1.000   Median :0.000  
##  Mean   :23.87         Mean   :0.991         Mean   :0.998   Mean   :0.002  
##  3rd Qu.:31.00         3rd Qu.:2.000         3rd Qu.:1.000   3rd Qu.:0.000  
##  Max.   :75.00         Max.   :6.000         Max.   :1.000   Max.   :1.000  
##  NA's   :24492         NA's   :24492         NA's   :16689   NA's   :16689  
##   X10.Wickets    Innings.Wickets.Taken.Buckets Innings.Economy.Rate
##  Min.   :0          :16689                     Min.   : 0.000      
##  1st Qu.:0       -  : 7803                     1st Qu.: 5.750      
##  Median :0       0-4: 8855                     Median : 7.500      
##  Mean   :0       5+ :   31                     Mean   : 7.888      
##  3rd Qu.:0                                     3rd Qu.: 9.500      
##  Max.   :0                                     Max.   :36.000      
##  NA's   :16689                                 NA's   :24492

We do not need to worry about the missing values, as it is a common aspect of a cricket dataset. The missing values comes from the fact that, in a cricket match,

every player may not get a chance to bat, as it depends on how many players get out in an innings.
not every player is a bowler. Players good at bowling are asked to bowl.

print(paste('Highest runs scored by an individual player:', max(dat$Innings.Runs.Scored.Num, na.rm = TRUE)))

## [1] "Highest runs scored by an individual player: 172"

print(paste('Max number of balls faced by an individual player:', max(dat$Innings.Balls.Faced, na.rm = TRUE)))

## [1] "Max number of balls faced by an individual player: 76"

print(paste('Most wickets taken by a bowler in an innings:', max(dat$Innings.Wickets.Taken, na.rm = TRUE)))

## [1] "Most wickets taken by a bowler in an innings: 6"

Batting Stats

Here, I will create a subset of the given dataset reflecting all the necessary batting statistics. I will also rename the columns to make it easily readable and understanble.

subset1 <- c( "Innings.Player",
  "Country","Innings.Batted.Flag","Innings.Runs.Scored.Num","Innings.Balls.Faced", "Innings.Batting.Strike.Rate","Innings.Boundary.Fours", "Innings.Boundary.Sixes","X50.s", "X100.s", "Opposition", "Ground", "Innings.Date")
batdata <- select(dat, subset1)
batdata <- batdata %>%
  rename(Player = Innings.Player, Matches=Innings.Batted.Flag, Runs=Innings.Runs.Scored.Num, Balls=Innings.Balls.Faced, StrikeRate=Innings.Batting.Strike.Rate, Fours=Innings.Boundary.Fours, Sixes=Innings.Boundary.Sixes, Fifties= X50.s, Hundreds=X100.s, MatchDay= Innings.Date)
head(batdata)

##        Player Country Matches Runs Balls StrikeRate Fours Sixes Fifties
## 1    AD Hales England       1  116    64     181.25    11     6       0
## 2   LJ Wright England       1   99    55     180.00     8     6       1
## 3    AD Hales England       1   99    68     145.58     6     4       1
## 4    AD Hales England       1   94    61     154.09    11     2       1
## 5     JE Root England       1   90    49     183.67    13     1       1
## 6 SW Billings England       1   87    47     185.10    10     3       1
##   Hundreds    Opposition            Ground   MatchDay
## 1        1   v Sri Lanka        Chattogram 2014-03-27
## 2        0 v Afghanistan     Colombo (RPS) 2012-09-21
## 3        0 v West Indies        Nottingham 2012-06-24
## 4        0   v Australia Chester-le-Street 2013-08-31
## 5        0   v Australia       Southampton 2013-08-29
## 6        0 v West Indies        Basseterre 2019-03-08

Now that we have created the desired batting dataset, let’s sort it in descending order in terms of runs scored by individual players.

batdata1 <- batdata %>%
  arrange(desc(Runs))
head(batdata1,5)

##              Player     Country Matches Runs Balls StrikeRate Fours Sixes
## 1          AJ Finch   Australia       1  172    76     226.31    16    10
## 2 Hazratullah Zazai Afghanistan       1  162    62     261.29    11    16
## 3          AJ Finch   Australia       1  156    63     247.61    11    14
## 4        GJ Maxwell   Australia       1  145    65     223.07    14     9
## 5         HG Munsey    Scotland       1  127    56     226.78     5    14
##   Fifties Hundreds    Opposition            Ground   MatchDay
## 1       0        1    v Zimbabwe            Harare 2018-07-03
## 2       0        1     v Ireland          Dehradun 2019-02-23
## 3       0        1     v England       Southampton 2013-08-29
## 4       0        1   v Sri Lanka         Pallekele 2016-09-06
## 5       0        1 v Netherlands Dublin (Malahide) 2019-09-16

We clearly see that, Aaron Finch from Australia scored the highest ever individual runs 172 from 76 balls against Zimbabwe. He hitted \(16\) fours, and \(10\) sixes in his innings. The match was held in Harare, Zimbabwe in July 2018. Similar explanation applies to others as well.

As you can imagine, batters are expected to score more runs as they face more and more balls. Let’s see if our data reflects it.

a <- ggplot(data=batdata1, aes(x=Balls, y=Runs))
a+geom_point(color="Blue")+geom_smooth(se=TRUE, color="Red")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 21316 rows containing non-finite values (stat_smooth).

## Warning: Removed 21316 rows containing missing values (geom_point).

Yes, it does reflect what you thought. See the upper rightmost dot, that’s the highest individual score came with highest number of balls faced. That said, there are many exmples as well where the batters underperformed (scored less runs than balls faced).

It’s obvious that anyone would like to see key performance indicators (such as, total runs, strike rate, average runs etc.) by individual players. For example, strike rate indicates a batter’s hard hitting ability. If a player plays a substantial number of matches with a good strike rate (typically over \(100\)) along with a good average (typically above \(20\)), is a hot cake in T20 tournaments around the world.

coun <- batdata1[1:2]
country <- coun[!duplicated(coun$Player),]
bag1 <- aggregate(cbind(Matches, Runs, Balls) ~ Player, data=batdata1, sum)
bag2 <- aggregate(StrikeRate ~ Player, data=batdata1, mean)
Average <- aggregate(Runs ~ Player, data = batdata1, mean)
bag3 <- aggregate(cbind(Fours, Sixes) ~ Player, data=batdata1, sum)
bag4 <- aggregate(cbind(Fifties, Hundreds) ~ Player, data=batdata1, sum)
Highest <-  aggregate(Runs ~ Player, data = batdata1, max)


mvr1 <- merge(bag1, bag2, by="Player")
mvr2 <- merge(mvr1, Average, by="Player")
mvr3 <- merge(mvr2, bag3, by ="Player")
mvr4 <- merge(mvr3, bag4, by ="Player")
mvr5 <- merge(mvr4, Highest, by ="Player" )
mvr6 <- merge(country, mvr5, by ="Player")


mvr6 <- mvr6 %>%
  rename(Runs=Runs.x, Average =Runs.y, HighestScore= Runs)
mvr6 <- mvr6 %>%
  arrange(desc(Runs))
mvr7 <- mvr6 %>% mutate_if(is.numeric, ~round(., 1))
head(mvr7,10)

##              Player      Country Matches Runs Balls StrikeRate Average Fours
## 1         RG Sharma        India      91 2452  1794      116.4    26.9   219
## 2           V Kohli        India      67 2450  1811      123.8    36.6   235
## 3        MJ Guptill  New Zealand      78 2359  1776      116.7    30.2   211
## 4      Shoaib Malik     Pakistan     103 2251  1814      121.7    21.9   185
## 5       BB McCullum  New Zealand      70 2140  1571      123.8    30.6   199
## 6         DA Warner    Australia      75 2031  1441      122.7    27.1   199
## 7  Mohammad Shahzad  Afghanistan      65 1936  1436      120.3    29.8   218
## 8         JP Duminy South Africa      75 1934  1532      115.8    25.8   138
## 9       PR Stirling      Ireland      71 1929  1401      111.4    27.2   233
## 10  Mohammad Hafeez     Pakistan      86 1908  1643      103.1    22.2   196
##    Sixes Fifties Hundreds HighestScore
## 1    109      17        4          118
## 2     58      22        0           90
## 3    105      14        2          105
## 4     61       7        0           75
## 5     91      13        2          123
## 6     84      15        1          100
## 7     72      12        1          118
## 8     71      11        0           96
## 9     59      16        0           91
## 10    51      10        0           86

Let’s visualize the to ten players in terms of total runs, and average runs.

top10 <- mvr7 %>%
        slice(1:10)
viz1 <- ggplot(data=top10, aes(x=reorder(Player,-Runs),
                              y=Runs,
                       fill=Player)) + geom_col(show.legend = FALSE)
viz1 + ggtitle("Top 10 Scorers")+
  geom_text(aes(label = Runs), vjust = -0.1)+
  theme(axis.text.x = element_text(angle = 60, hjust = 1),
        plot.title = element_text(hjust=0.5, colour="Black",
                                  size=20))

top10a <- mvr7 %>%
        slice(1:10)
viz2 <- ggplot(data=top10a, aes(x=reorder(Player,-Average),
                              y=Average,
                       fill=Player)) + geom_col(show.legend = FALSE)
viz2 + ggtitle("Top 10 Averages")+
  geom_text(aes(label = Average), vjust = -0.1)+
  theme(axis.text.x = element_text(angle = 60, hjust = 1),
        plot.title = element_text(hjust=0.5, colour="Black",
                                  size=20))

Comparing the above two charts, we see that even though Rohit Sharma is the highest scorer, he is in the \(7\)th position in terms of average runs. On the other hand, Virat Kohli tops the list in terms of average runs and only \(2\) runs short from being the highest run getter. Similar comparison applies to others as well.

At this point, I am interested in exploring how the runs/match are spread out for top ten scorers.

filt1 <- (batdata1$Player == "RG Sharma") | (batdata1$Player == "V Kohli") | (batdata1$Player == "MJ Guptill") | (batdata1$Player == "Shoaib Malik") | (batdata1$Player == "BB McCullum") | (batdata1$Player == "DA Warner") | (batdata1$Player == "Mohammad Shahzad") | (batdata1$Player == "JP Duminy") | (batdata1$Player == "PR Stirling") | (batdata1$Player == "Mohammad Hafeez")

box <- batdata1[filt1,]
head(box,10)

##              Player     Country Matches Runs Balls StrikeRate Fours Sixes
## 9       BB McCullum New Zealand       1  123    58     212.06    11     7
## 11        RG Sharma       India       1  118    43     274.41    12    10
## 12 Mohammad Shahzad Afghanistan       1  118    67     176.11    10     8
## 16      BB McCullum New Zealand       1  116    56     207.14    12     8
## 19        RG Sharma       India       1  111    61     181.96     8     7
## 23        RG Sharma       India       1  106    66     160.60    12     5
## 24       MJ Guptill New Zealand       1  105    54     194.44     6     9
## 30       MJ Guptill New Zealand       1  101    69     146.37     9     6
## 35        DA Warner   Australia       1  100    56     178.57    10     4
## 38        RG Sharma       India       1  100    56     178.57    11     5
##    Fifties Hundreds     Opposition       Ground   MatchDay
## 9        0        1   v Bangladesh    Pallekele 2012-09-21
## 11       0        1    v Sri Lanka       Indore 2017-12-22
## 12       0        1     v Zimbabwe      Sharjah 2016-01-10
## 16       0        1    v Australia Christchurch 2010-02-28
## 19       0        1  v West Indies      Lucknow 2018-11-06
## 23       0        1 v South Africa   Dharamsala 2015-10-02
## 24       0        1    v Australia     Auckland 2018-02-16
## 30       0        1 v South Africa  East London 2012-12-23
## 35       0        1    v Sri Lanka     Adelaide 2019-10-27
## 38       0        1      v England      Bristol 2018-07-08

boxp <- ggplot(data=box, aes(x=Player, y=Runs))
boxp1 <- boxp + 
  geom_jitter(aes( colour=Player), show.legend = FALSE) + 
  geom_boxplot(alpha = 0.7, outlier.colour = NA) +
  xlab("") + 
  ylab("Runs/Match") + 
  ggtitle("Runs/match by Individual Players") + 
  
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1),
    axis.text.y = element_text(),  
    plot.title = element_text(hjust=0.5, colour="Black",
                                  size=20)
    )
boxp1

## Warning: Removed 851 rows containing non-finite values (stat_boxplot).

## Warning: Removed 851 rows containing missing values (geom_point).

Notice that,

Rohit Sharma has the lowest median score of about \(11\) followed by Mohammad Hafeez, and Shoaib Malik. However, \(25\)% of the time Sharma scored between \(40\) and \(90\) runs which compensated for his average score. He also played some \(90\) plus innings indicated by the outliers here.
On the other hand, Virat Kohli has the highest median score followed by Martin Guptil, and Mohammad Shahzad. McCullum, Warner, Duminy, and Stirling have decent medians.
Hafeez, and Malik are amongst the most predictable batsmen. Part of the reason is, both of them are all-rounders (a player who specializes in both batting and bowling, and a great asset to the team).

Bowling Stats

Similar to the batting stats, let’s create a dataset containing only bowling statistics.

subset2 <- c("Innings.Player", "Country", "Innings.Bowled.Flag", "Innings.Overs.Bowled", "Innings.Maidens.Bowled", "Innings.Runs.Conceded", "Innings.Wickets.Taken", "X4.Wickets", "X5.Wickets", "Innings.Economy.Rate")

bowldata <- select(dat, subset2)
bowldata <- bowldata %>%
  rename(Player=Innings.Player, Matches = Innings.Bowled.Flag, Overs=Innings.Overs.Bowled, Maidens =Innings.Maidens.Bowled, RunsConceded=Innings.Runs.Conceded, Wickets=Innings.Wickets.Taken, FourWickets= X4.Wickets, FiveWickets = X5.Wickets, Economy=Innings.Economy.Rate)

bowldata1 <- bowldata %>%
  arrange(desc(Wickets))
head(bowldata1,10)

##         Player      Country Matches Overs Maidens RunsConceded Wickets
## 1    YS Chahal        India       1   4.0       0           25       6
## 2   BAW Mendis    Sri Lanka       1   4.0       2            8       6
## 3   BAW Mendis    Sri Lanka       1   4.0       1           16       6
## 4  JP Faulkner    Australia       1   4.0       0           27       5
## 5    R McLaren South Africa       1   3.5       0           19       5
## 6      D Wiese South Africa       1   4.0       0           23       5
## 7  Imran Tahir South Africa       1   4.0       0           23       5
## 8  Imran Tahir South Africa       1   3.5       0           24       5
## 9     KMA Paul  West Indies       1   4.0       0           15       5
## 10   DJG Sammy  West Indies       1   3.5       0           26       5
##    FourWickets FiveWickets Economy
## 1            0           1    6.25
## 2            0           1    2.00
## 3            0           1    4.00
## 4            0           1    6.75
## 5            0           1    4.95
## 6            0           1    5.75
## 7            0           1    5.75
## 8            0           1    6.26
## 9            0           1    3.75
## 10           0           1    6.78

Aggregating the statistics by individual player.

c <- bowldata1[1:2]
c1 <- c[!duplicated(c$Player),]
bal1 <- aggregate(cbind(Matches, Overs, Maidens, RunsConceded, Wickets, FiveWickets) ~ Player, data=bowldata1, sum)
bal2 <- aggregate(Economy ~ Player, data=bowldata1, mean)
MW <-  aggregate(Wickets ~ Player, data = bowldata1, max)

m1 <- merge(c1, bal1, by="Player")
m2 <- merge(m1, bal2, by="Player")
m3 <- merge(m2, MW, by="Player")

m3 <- m3 %>%
  rename(Wickets =Wickets.x, Highest = Wickets.y)
m3 <- m3 %>%
  arrange(desc(Wickets))
m4 <- m3 %>% mutate_if(is.numeric, ~round(., 2))
head(m4,10)

##             Player     Country Matches Overs Maidens RunsConceded Wickets
## 1       SL Malinga   Sri Lanka      79 282.9       1         2061     106
## 2    Shahid Afridi    Pakistan      96 356.4       4         2362      97
## 3  Shakib Al Hasan  Bangladesh      75 277.1       2         1894      92
## 4      Saeed Ajmal    Pakistan      63 237.0       2         1516      85
## 5         Umar Gul    Pakistan      60 198.3       2         1443      85
## 6      Rashid Khan Afghanistan      41 155.0       1          927      79
## 7      GH Dockrell     Ireland      69 220.4       1         1539      75
## 8       TG Southee New Zealand      62 226.9       2         1885      73
## 9    Mohammad Nabi Afghanistan      72 249.4       5         1789      69
## 10      BAW Mendis   Sri Lanka      39 147.3       5          952      66
##    FiveWickets Economy Highest
## 1            2    7.23       5
## 2            0    6.77       4
## 3            1    6.89       5
## 4            0    6.38       4
## 5            2    7.29       5
## 6            2    5.88       5
## 7            0    7.36       4
## 8            1    8.47       5
## 9            0    7.20       4
## 10           2    6.42       6

We see that,

Sri Lankan legend Lasith Malinga tops the list with \(106\) wickets with a economy rate of \(7.23\). Economy rate \(7.23\) (i.e., he conceded \(7.23\) runs per \(6\) balls) is in fact a pretty decent economy rate in the context of T20 Cricket.
Rashid Khan (\(6\)th in this list) seems to be the most impressive bowler with \(79\) wickets in just \(41\) matches. Rashid is also the most economical (economy \(5.88\)) out of these \(10\) bowlers.
On the other hand, Tim Southee is the most expensive bowler with economy \(8.47\).
An interesting thing to notice is that, only \(3\) out of these \(10\) bowlers are pacers (A pacer is a bowler who typically bowls at a speed over \(130\)km/hr) namely, Lasith Malinga, Umar Gul, and Tim Southee.
Rest of the bowlers are spinners (A spinner is a bowler who typically bowls at speed less than \(100\)km/hr and can spin the ball in any direction, if not both).
It turns out that the spinners are more effective bowlers in T20 Cricket. But, not all spinners are not destined to be effective as it also depends on their skill level. Also, this does not establish spinners are superior to the pacers. In fact, pacers often (if not always) outnumber spinners in a team.

top10b <- m4 %>%
        slice(1:10)
viz3 <- ggplot(data=top10b, aes(x=reorder(Player,Economy),
                              y=Economy,
                       fill=Player)) + geom_col(show.legend = FALSE)
viz3 + ggtitle("Economy rate by Individual Bowlers")+
  geom_text(aes(label = Economy), vjust = -0.1)+
  theme(axis.text.x = element_text(angle = 60, hjust = 1),
        plot.title = element_text(hjust=0.5, colour="Black",
                                  size=20))

filt2 <- (bowldata1$Player == "SL Malinga") | (bowldata1$Player == "Shahid Afridi") | (bowldata1$Player == "Shakib Al Hasan") | (bowldata1$Player == "Saeed Ajmal") | (bowldata1$Player == "Umar Gul") | (bowldata1$Player == "Rashid Khan") | (bowldata1$Player == "GH Dockrell") | (bowldata1$Player == "TG Southee") | (bowldata1$Player == "Mohammad Nabi") | (bowldata1$Player == "BAW Mendis") 

boxbowl <- bowldata1[filt2,]
head(boxbowl,5)

##        Player     Country Matches Overs Maidens RunsConceded Wickets
## 2  BAW Mendis   Sri Lanka       1   4.0       2            8       6
## 3  BAW Mendis   Sri Lanka       1   4.0       1           16       6
## 11 TG Southee New Zealand       1   4.0       1           18       5
## 14   Umar Gul    Pakistan       1   3.0       0            6       5
## 15   Umar Gul    Pakistan       1   2.2       0            6       5
##    FourWickets FiveWickets Economy
## 2            0           1    2.00
## 3            0           1    4.00
## 11           0           1    4.50
## 14           0           1    2.00
## 15           0           1    2.57

bbox <- ggplot(data=boxbowl, aes(x=Player, y=RunsConceded))
boxp2 <- bbox + 
  geom_jitter(aes( colour=Player), show.legend = FALSE) + 
  geom_boxplot(alpha = 0.7, outlier.colour = NA) +
  xlab("") + 
  ylab("RunsConceded/Match") + 
  ggtitle("Runs Conceded/match by Individual Bowlers") + 
  
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1),
    axis.text.y = element_text(),  
    plot.title = element_text(hjust=0.5, colour="Black",
                                  size=20)
    )
boxp2

## Warning: Removed 675 rows containing non-finite values (stat_boxplot).

## Warning: Removed 675 rows containing missing values (geom_point).

Notice that,

Saeed Ajmal, Shahid Afridi, and Sakib Al Hasan are amongst the most consistent bowlers.
Malinga is more consistent than the other two pacers.
\(50\)% of the time, Dockrell concedes less than \(20\) runs per match.

Based on the above analysis and my cricketing knowledge, I am going to pick my best \(11\) for a hypothetical T20 cricket match. I going to choose \(4\) specialist bowlers, a allrounder, and \(6\) solid batsmen.

My team list below follows a batting order,

BB McCullum (Captain + Wicket-kepper)
DA Warner
MJ Guptill
V Kohli
JP Duminy
PR Stirling
Shahid Afridi
Rashid Khan
Saeed Ajmal
Umar Gul
SL Malinga

Cricket Data Analysis

Ismail Firoz

23/04/2020

Load packages

Load data

Batting Stats

Bowling Stats