If you have ever wondered about cricket, watch this introductory video that nicely sums up the game of cricket. Cricket has mainly three versions, test match (\(5\) day long; played in sessions througout the day), \(50\) over (\(6\) balls \(=1\) over, i.e. the bowler bowls 6 times to complete an over) match (typically takes \(7\) hours to finish), and the shortest and most popular nowadays is the \(20\) over match (commonly known as T20/\(20-20\) cricket; takes about \(3\) hours to complete). In this project, I am gonna explore a T20 cricket dataset played internationally over the years since it’s inception in 2004. I’ve attached another short video explaning T20 cricket. This dataset is not quite up to date.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
dat <- read.csv("T20_Cricket.csv")
dim(dat)
## [1] 33378 28
The dataset contains \(33378\) rows and \(28\) columns.
head(dat)
## Innings.Player Innings.Runs.Scored Innings.Runs.Scored.Num
## 1 AD Hales 116* 116
## 2 LJ Wright 99* 99
## 3 AD Hales 99 99
## 4 AD Hales 94 94
## 5 JE Root 90* 90
## 6 SW Billings 87 87
## Innings.Minutes.Batted Innings.Batted.Flag Innings.Not.Out.Flag
## 1 97 1 1
## 2 83 1 1
## 3 84 1 0
## 4 80 1 0
## 5 77 1 1
## 6 - 1 0
## Innings.Balls.Faced Innings.Boundary.Fours Innings.Boundary.Sixes
## 1 64 11 6
## 2 55 8 6
## 3 68 6 4
## 4 61 11 2
## 5 49 13 1
## 6 47 10 3
## Innings.Batting.Strike.Rate Innings.Number Opposition Ground
## 1 181.25 2 v Sri Lanka Chattogram
## 2 180 1 v Afghanistan Colombo (RPS)
## 3 145.58 2 v West Indies Nottingham
## 4 154.09 1 v Australia Chester-le-Street
## 5 183.67 2 v Australia Southampton
## 6 185.1 1 v West Indies Basseterre
## Innings.Date Country X50.s X100.s Innings.Runs.Scored.Buckets
## 1 2014-03-27 England 0 1 100-149
## 2 2012-09-21 England 1 0 50-99
## 3 2012-06-24 England 1 0 50-99
## 4 2013-08-31 England 1 0 50-99
## 5 2013-08-29 England 1 0 50-99
## 6 2019-03-08 England 1 0 50-99
## Innings.Overs.Bowled Innings.Bowled.Flag Innings.Maidens.Bowled
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## Innings.Runs.Conceded Innings.Wickets.Taken X4.Wickets X5.Wickets X10.Wickets
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## Innings.Wickets.Taken.Buckets Innings.Economy.Rate
## 1
## 2
## 3
## 4
## 5
## 6
Let’s look at the column headings. I will explain what each column means when I creat necessary subsets of this dataset.
colnames(dat)
## [1] "Innings.Player" "Innings.Runs.Scored"
## [3] "Innings.Runs.Scored.Num" "Innings.Minutes.Batted"
## [5] "Innings.Batted.Flag" "Innings.Not.Out.Flag"
## [7] "Innings.Balls.Faced" "Innings.Boundary.Fours"
## [9] "Innings.Boundary.Sixes" "Innings.Batting.Strike.Rate"
## [11] "Innings.Number" "Opposition"
## [13] "Ground" "Innings.Date"
## [15] "Country" "X50.s"
## [17] "X100.s" "Innings.Runs.Scored.Buckets"
## [19] "Innings.Overs.Bowled" "Innings.Bowled.Flag"
## [21] "Innings.Maidens.Bowled" "Innings.Runs.Conceded"
## [23] "Innings.Wickets.Taken" "X4.Wickets"
## [25] "X5.Wickets" "X10.Wickets"
## [27] "Innings.Wickets.Taken.Buckets" "Innings.Economy.Rate"
str(dat)
## 'data.frame': 33378 obs. of 28 variables:
## $ Innings.Player : Factor w/ 1064 levels "A Balbirnie",..: 37 532 37 37 420 957 275 420 37 275 ...
## $ Innings.Runs.Scored : Factor w/ 231 levels "","0","0*","1",..: 24 227 226 218 212 203 200 195 190 190 ...
## $ Innings.Runs.Scored.Num : Factor w/ 124 levels "","-","0","1",..: 18 124 124 120 116 112 110 108 105 105 ...
## $ Innings.Minutes.Batted : Factor w/ 103 levels "","-","0","1",..: 101 86 87 83 79 2 54 72 50 77 ...
## $ Innings.Batted.Flag : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Innings.Not.Out.Flag : int 1 1 0 0 1 0 1 0 1 1 ...
## $ Innings.Balls.Faced : Factor w/ 75 levels "","-","0","1",..: 64 54 68 61 47 45 43 42 40 44 ...
## $ Innings.Boundary.Fours : Factor w/ 18 levels "","-","0","1",..: 6 17 15 6 8 5 16 15 18 13 ...
## $ Innings.Boundary.Sixes : Factor w/ 18 levels "","-","0","1",..: 15 15 13 11 4 12 14 13 13 15 ...
## $ Innings.Batting.Strike.Rate : Factor w/ 1104 levels "","-","0","10",..: 577 572 345 409 591 599 618 617 625 538 ...
## $ Innings.Number : Factor w/ 3 levels "-","1","2": 3 2 3 2 3 2 2 3 3 2 ...
## $ Opposition : Factor w/ 33 levels "v Afghanistan",..: 27 1 31 2 2 31 26 26 19 19 ...
## $ Ground : Factor w/ 107 levels "Aberdeen","Abu Dhabi",..: 23 28 81 25 96 8 54 74 106 48 ...
## $ Innings.Date : Factor w/ 635 levels "2005-02-17","2005-06-13",..: 293 198 179 249 248 571 85 406 226 499 ...
## $ Country : Factor w/ 17 levels "Afghanistan",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ X50.s : int 0 1 1 1 1 1 1 1 1 1 ...
## $ X100.s : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Innings.Runs.Scored.Buckets : Factor w/ 6 levels "","-","0-49",..: 4 6 6 6 6 6 6 6 6 6 ...
## $ Innings.Overs.Bowled : Factor w/ 28 levels "","0.1","0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Innings.Bowled.Flag : int NA NA NA NA NA NA NA NA NA NA ...
## $ Innings.Maidens.Bowled : Factor w/ 5 levels "","-","0","1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Innings.Runs.Conceded : Factor w/ 69 levels "","-","0","1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Innings.Wickets.Taken : Factor w/ 9 levels "","-","0","1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ X4.Wickets : int NA NA NA NA NA NA NA NA NA NA ...
## $ X5.Wickets : int NA NA NA NA NA NA NA NA NA NA ...
## $ X10.Wickets : int NA NA NA NA NA NA NA NA NA NA ...
## $ Innings.Wickets.Taken.Buckets: Factor w/ 4 levels "","-","0-4","5+": 1 1 1 1 1 1 1 1 1 1 ...
## $ Innings.Economy.Rate : Factor w/ 318 levels "","-","0","0.85",..: 1 1 1 1 1 1 1 1 1 1 ...
Some variables need to be converted into numeric.
dat$Innings.Runs.Scored.Num <- as.numeric(as.character(dat$Innings.Runs.Scored.Num))
## Warning: NAs introduced by coercion
dat$Innings.Minutes.Batted <- as.numeric(as.character(dat$Innings.Minutes.Batted))
## Warning: NAs introduced by coercion
dat$Innings.Balls.Faced <- as.numeric(as.character(dat$Innings.Balls.Faced))
## Warning: NAs introduced by coercion
dat$Innings.Boundary.Fours <- as.numeric(as.character(dat$Innings.Boundary.Fours))
## Warning: NAs introduced by coercion
dat$Innings.Boundary.Sixes <- as.numeric(as.character(dat$Innings.Boundary.Sixes))
## Warning: NAs introduced by coercion
dat$Innings.Batting.Strike.Rate <- as.numeric(as.character(dat$Innings.Batting.Strike.Rate))
## Warning: NAs introduced by coercion
dat$Innings.Overs.Bowled <- as.numeric(as.character(dat$Innings.Overs.Bowled))
## Warning: NAs introduced by coercion
dat$Innings.Maidens.Bowled <-as.numeric(as.character(dat$Innings.Maidens.Bowled))
## Warning: NAs introduced by coercion
dat$Innings.Runs.Conceded <- as.numeric(as.character(dat$Innings.Runs.Conceded))
## Warning: NAs introduced by coercion
dat$Innings.Wickets.Taken <- as.numeric(as.character(dat$Innings.Wickets.Taken))
## Warning: NAs introduced by coercion
dat$Innings.Economy.Rate <- as.numeric(as.character(dat$Innings.Economy.Rate))
## Warning: NAs introduced by coercion
dat$Innings.Date <- as.Date(as.character(dat$Innings.Date))
str(dat)
## 'data.frame': 33378 obs. of 28 variables:
## $ Innings.Player : Factor w/ 1064 levels "A Balbirnie",..: 37 532 37 37 420 957 275 420 37 275 ...
## $ Innings.Runs.Scored : Factor w/ 231 levels "","0","0*","1",..: 24 227 226 218 212 203 200 195 190 190 ...
## $ Innings.Runs.Scored.Num : num 116 99 99 94 90 87 85 83 80 80 ...
## $ Innings.Minutes.Batted : num 97 83 84 80 77 NA 54 70 50 75 ...
## $ Innings.Batted.Flag : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Innings.Not.Out.Flag : int 1 1 0 0 1 0 1 0 1 1 ...
## $ Innings.Balls.Faced : num 64 55 68 61 49 47 45 44 42 46 ...
## $ Innings.Boundary.Fours : num 11 8 6 11 13 10 7 6 9 4 ...
## $ Innings.Boundary.Sixes : num 6 6 4 2 1 3 5 4 4 6 ...
## $ Innings.Batting.Strike.Rate : num 181 180 146 154 184 ...
## $ Innings.Number : Factor w/ 3 levels "-","1","2": 3 2 3 2 3 2 2 3 3 2 ...
## $ Opposition : Factor w/ 33 levels "v Afghanistan",..: 27 1 31 2 2 31 26 26 19 19 ...
## $ Ground : Factor w/ 107 levels "Aberdeen","Abu Dhabi",..: 23 28 81 25 96 8 54 74 106 48 ...
## $ Innings.Date : Date, format: "2014-03-27" "2012-09-21" ...
## $ Country : Factor w/ 17 levels "Afghanistan",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ X50.s : int 0 1 1 1 1 1 1 1 1 1 ...
## $ X100.s : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Innings.Runs.Scored.Buckets : Factor w/ 6 levels "","-","0-49",..: 4 6 6 6 6 6 6 6 6 6 ...
## $ Innings.Overs.Bowled : num NA NA NA NA NA NA NA NA NA NA ...
## $ Innings.Bowled.Flag : int NA NA NA NA NA NA NA NA NA NA ...
## $ Innings.Maidens.Bowled : num NA NA NA NA NA NA NA NA NA NA ...
## $ Innings.Runs.Conceded : num NA NA NA NA NA NA NA NA NA NA ...
## $ Innings.Wickets.Taken : num NA NA NA NA NA NA NA NA NA NA ...
## $ X4.Wickets : int NA NA NA NA NA NA NA NA NA NA ...
## $ X5.Wickets : int NA NA NA NA NA NA NA NA NA NA ...
## $ X10.Wickets : int NA NA NA NA NA NA NA NA NA NA ...
## $ Innings.Wickets.Taken.Buckets: Factor w/ 4 levels "","-","0-4","5+": 1 1 1 1 1 1 1 1 1 1 ...
## $ Innings.Economy.Rate : num NA NA NA NA NA NA NA NA NA NA ...
summary(dat)
## Innings.Player Innings.Runs.Scored Innings.Runs.Scored.Num
## Shoaib Malik : 220 :16689 Min. : 0.00
## RG Sharma : 199 DNB : 4451 1st Qu.: 3.00
## Shahid Afridi: 196 0 : 1038 Median : 11.00
## MS Dhoni : 194 1 : 624 Mean : 17.14
## LRPL Taylor : 187 2 : 456 3rd Qu.: 25.00
## KJ O'Brien : 181 4 : 418 Max. :172.00
## (Other) :32201 (Other): 9702 NA's :21316
## Innings.Minutes.Batted Innings.Batted.Flag Innings.Not.Out.Flag
## Min. : 0.0 Min. :0.000 Min. :0.000
## 1st Qu.: 6.0 1st Qu.:0.000 1st Qu.:0.000
## Median : 14.0 Median :1.000 Median :0.000
## Mean : 20.1 Mean :0.723 Mean :0.157
## 3rd Qu.: 29.0 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :101.0 Max. :1.000 Max. :1.000
## NA's :24821 NA's :16689 NA's :16689
## Innings.Balls.Faced Innings.Boundary.Fours Innings.Boundary.Sixes
## Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median :10.00 Median : 1.000 Median : 0.000
## Mean :13.97 Mean : 1.491 Mean : 0.598
## 3rd Qu.:20.00 3rd Qu.: 2.000 3rd Qu.: 1.000
## Max. :76.00 Max. :16.000 Max. :16.000
## NA's :21316 NA's :21316 NA's :21316
## Innings.Batting.Strike.Rate Innings.Number Opposition
## Min. : 0.0 -: 341 v Pakistan : 3146
## 1st Qu.: 62.5 1:16590 v New Zealand : 2730
## Median :100.0 2:16447 v Sri Lanka : 2662
## Mean :106.1 v Australia : 2640
## 3rd Qu.:143.8 v India : 2640
## Max. :600.0 v South Africa: 2532
## NA's :21564 (Other) :17028
## Ground Innings.Date Country
## Dubai (DSC) : 2266 Min. :2005-02-17 Pakistan : 3256
## Dhaka : 1694 1st Qu.:2011-01-09 New Zealand : 2730
## Colombo (RPS): 1584 Median :2014-03-31 Sri Lanka : 2706
## Johannesburg : 1236 Mean :2014-04-27 Australia : 2662
## Harare : 1012 3rd Qu.:2017-10-10 India : 2662
## Abu Dhabi : 946 Max. :2019-11-05 South Africa: 2532
## (Other) :24640 (Other) :16830
## X50.s X100.s Innings.Runs.Scored.Buckets
## Min. :0.000 Min. :0.000 :16689
## 1st Qu.:0.000 1st Qu.:0.000 - : 4627
## Median :0.000 Median :0.000 0-49 :11146
## Mean :0.052 Mean :0.002 100-149: 37
## 3rd Qu.:0.000 3rd Qu.:0.000 150-199: 3
## Max. :1.000 Max. :1.000 50-99 : 876
## NA's :16689 NA's :16689
## Innings.Overs.Bowled Innings.Bowled.Flag Innings.Maidens.Bowled
## Min. :0.100 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:0.000 1st Qu.:0.000
## Median :4.000 Median :1.000 Median :0.000
## Mean :3.146 Mean :0.532 Mean :0.036
## 3rd Qu.:4.000 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :4.000 Max. :1.000 Max. :2.000
## NA's :24492 NA's :16689 NA's :24492
## Innings.Runs.Conceded Innings.Wickets.Taken X4.Wickets X5.Wickets
## Min. : 0.00 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:16.00 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.000
## Median :23.00 Median :1.000 Median :1.000 Median :0.000
## Mean :23.87 Mean :0.991 Mean :0.998 Mean :0.002
## 3rd Qu.:31.00 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :75.00 Max. :6.000 Max. :1.000 Max. :1.000
## NA's :24492 NA's :24492 NA's :16689 NA's :16689
## X10.Wickets Innings.Wickets.Taken.Buckets Innings.Economy.Rate
## Min. :0 :16689 Min. : 0.000
## 1st Qu.:0 - : 7803 1st Qu.: 5.750
## Median :0 0-4: 8855 Median : 7.500
## Mean :0 5+ : 31 Mean : 7.888
## 3rd Qu.:0 3rd Qu.: 9.500
## Max. :0 Max. :36.000
## NA's :16689 NA's :24492
We do not need to worry about the missing values, as it is a common aspect of a cricket dataset. The missing values comes from the fact that, in a cricket match,
print(paste('Highest runs scored by an individual player:', max(dat$Innings.Runs.Scored.Num, na.rm = TRUE)))
## [1] "Highest runs scored by an individual player: 172"
print(paste('Max number of balls faced by an individual player:', max(dat$Innings.Balls.Faced, na.rm = TRUE)))
## [1] "Max number of balls faced by an individual player: 76"
print(paste('Most wickets taken by a bowler in an innings:', max(dat$Innings.Wickets.Taken, na.rm = TRUE)))
## [1] "Most wickets taken by a bowler in an innings: 6"
Here, I will create a subset of the given dataset reflecting all the necessary batting statistics. I will also rename the columns to make it easily readable and understanble.
subset1 <- c( "Innings.Player",
"Country","Innings.Batted.Flag","Innings.Runs.Scored.Num","Innings.Balls.Faced", "Innings.Batting.Strike.Rate","Innings.Boundary.Fours", "Innings.Boundary.Sixes","X50.s", "X100.s", "Opposition", "Ground", "Innings.Date")
batdata <- select(dat, subset1)
batdata <- batdata %>%
rename(Player = Innings.Player, Matches=Innings.Batted.Flag, Runs=Innings.Runs.Scored.Num, Balls=Innings.Balls.Faced, StrikeRate=Innings.Batting.Strike.Rate, Fours=Innings.Boundary.Fours, Sixes=Innings.Boundary.Sixes, Fifties= X50.s, Hundreds=X100.s, MatchDay= Innings.Date)
head(batdata)
## Player Country Matches Runs Balls StrikeRate Fours Sixes Fifties
## 1 AD Hales England 1 116 64 181.25 11 6 0
## 2 LJ Wright England 1 99 55 180.00 8 6 1
## 3 AD Hales England 1 99 68 145.58 6 4 1
## 4 AD Hales England 1 94 61 154.09 11 2 1
## 5 JE Root England 1 90 49 183.67 13 1 1
## 6 SW Billings England 1 87 47 185.10 10 3 1
## Hundreds Opposition Ground MatchDay
## 1 1 v Sri Lanka Chattogram 2014-03-27
## 2 0 v Afghanistan Colombo (RPS) 2012-09-21
## 3 0 v West Indies Nottingham 2012-06-24
## 4 0 v Australia Chester-le-Street 2013-08-31
## 5 0 v Australia Southampton 2013-08-29
## 6 0 v West Indies Basseterre 2019-03-08
Now that we have created the desired batting dataset, let’s sort it in descending order in terms of runs scored by individual players.
batdata1 <- batdata %>%
arrange(desc(Runs))
head(batdata1,5)
## Player Country Matches Runs Balls StrikeRate Fours Sixes
## 1 AJ Finch Australia 1 172 76 226.31 16 10
## 2 Hazratullah Zazai Afghanistan 1 162 62 261.29 11 16
## 3 AJ Finch Australia 1 156 63 247.61 11 14
## 4 GJ Maxwell Australia 1 145 65 223.07 14 9
## 5 HG Munsey Scotland 1 127 56 226.78 5 14
## Fifties Hundreds Opposition Ground MatchDay
## 1 0 1 v Zimbabwe Harare 2018-07-03
## 2 0 1 v Ireland Dehradun 2019-02-23
## 3 0 1 v England Southampton 2013-08-29
## 4 0 1 v Sri Lanka Pallekele 2016-09-06
## 5 0 1 v Netherlands Dublin (Malahide) 2019-09-16
We clearly see that, Aaron Finch from Australia scored the highest ever individual runs 172 from 76 balls against Zimbabwe. He hitted \(16\) fours, and \(10\) sixes in his innings. The match was held in Harare, Zimbabwe in July 2018. Similar explanation applies to others as well.
As you can imagine, batters are expected to score more runs as they face more and more balls. Let’s see if our data reflects it.
a <- ggplot(data=batdata1, aes(x=Balls, y=Runs))
a+geom_point(color="Blue")+geom_smooth(se=TRUE, color="Red")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 21316 rows containing non-finite values (stat_smooth).
## Warning: Removed 21316 rows containing missing values (geom_point).
Yes, it does reflect what you thought. See the upper rightmost dot, that’s the highest individual score came with highest number of balls faced. That said, there are many exmples as well where the batters underperformed (scored less runs than balls faced).
It’s obvious that anyone would like to see key performance indicators (such as, total runs, strike rate, average runs etc.) by individual players. For example, strike rate indicates a batter’s hard hitting ability. If a player plays a substantial number of matches with a good strike rate (typically over \(100\)) along with a good average (typically above \(20\)), is a hot cake in T20 tournaments around the world.
coun <- batdata1[1:2]
country <- coun[!duplicated(coun$Player),]
bag1 <- aggregate(cbind(Matches, Runs, Balls) ~ Player, data=batdata1, sum)
bag2 <- aggregate(StrikeRate ~ Player, data=batdata1, mean)
Average <- aggregate(Runs ~ Player, data = batdata1, mean)
bag3 <- aggregate(cbind(Fours, Sixes) ~ Player, data=batdata1, sum)
bag4 <- aggregate(cbind(Fifties, Hundreds) ~ Player, data=batdata1, sum)
Highest <- aggregate(Runs ~ Player, data = batdata1, max)
mvr1 <- merge(bag1, bag2, by="Player")
mvr2 <- merge(mvr1, Average, by="Player")
mvr3 <- merge(mvr2, bag3, by ="Player")
mvr4 <- merge(mvr3, bag4, by ="Player")
mvr5 <- merge(mvr4, Highest, by ="Player" )
mvr6 <- merge(country, mvr5, by ="Player")
mvr6 <- mvr6 %>%
rename(Runs=Runs.x, Average =Runs.y, HighestScore= Runs)
mvr6 <- mvr6 %>%
arrange(desc(Runs))
mvr7 <- mvr6 %>% mutate_if(is.numeric, ~round(., 1))
head(mvr7,10)
## Player Country Matches Runs Balls StrikeRate Average Fours
## 1 RG Sharma India 91 2452 1794 116.4 26.9 219
## 2 V Kohli India 67 2450 1811 123.8 36.6 235
## 3 MJ Guptill New Zealand 78 2359 1776 116.7 30.2 211
## 4 Shoaib Malik Pakistan 103 2251 1814 121.7 21.9 185
## 5 BB McCullum New Zealand 70 2140 1571 123.8 30.6 199
## 6 DA Warner Australia 75 2031 1441 122.7 27.1 199
## 7 Mohammad Shahzad Afghanistan 65 1936 1436 120.3 29.8 218
## 8 JP Duminy South Africa 75 1934 1532 115.8 25.8 138
## 9 PR Stirling Ireland 71 1929 1401 111.4 27.2 233
## 10 Mohammad Hafeez Pakistan 86 1908 1643 103.1 22.2 196
## Sixes Fifties Hundreds HighestScore
## 1 109 17 4 118
## 2 58 22 0 90
## 3 105 14 2 105
## 4 61 7 0 75
## 5 91 13 2 123
## 6 84 15 1 100
## 7 72 12 1 118
## 8 71 11 0 96
## 9 59 16 0 91
## 10 51 10 0 86
Let’s visualize the to ten players in terms of total runs, and average runs.
top10 <- mvr7 %>%
slice(1:10)
viz1 <- ggplot(data=top10, aes(x=reorder(Player,-Runs),
y=Runs,
fill=Player)) + geom_col(show.legend = FALSE)
viz1 + ggtitle("Top 10 Scorers")+
geom_text(aes(label = Runs), vjust = -0.1)+
theme(axis.text.x = element_text(angle = 60, hjust = 1),
plot.title = element_text(hjust=0.5, colour="Black",
size=20))
top10a <- mvr7 %>%
slice(1:10)
viz2 <- ggplot(data=top10a, aes(x=reorder(Player,-Average),
y=Average,
fill=Player)) + geom_col(show.legend = FALSE)
viz2 + ggtitle("Top 10 Averages")+
geom_text(aes(label = Average), vjust = -0.1)+
theme(axis.text.x = element_text(angle = 60, hjust = 1),
plot.title = element_text(hjust=0.5, colour="Black",
size=20))
Comparing the above two charts, we see that even though Rohit Sharma is the highest scorer, he is in the \(7\)th position in terms of average runs. On the other hand, Virat Kohli tops the list in terms of average runs and only \(2\) runs short from being the highest run getter. Similar comparison applies to others as well.
At this point, I am interested in exploring how the runs/match are spread out for top ten scorers.
filt1 <- (batdata1$Player == "RG Sharma") | (batdata1$Player == "V Kohli") | (batdata1$Player == "MJ Guptill") | (batdata1$Player == "Shoaib Malik") | (batdata1$Player == "BB McCullum") | (batdata1$Player == "DA Warner") | (batdata1$Player == "Mohammad Shahzad") | (batdata1$Player == "JP Duminy") | (batdata1$Player == "PR Stirling") | (batdata1$Player == "Mohammad Hafeez")
box <- batdata1[filt1,]
head(box,10)
## Player Country Matches Runs Balls StrikeRate Fours Sixes
## 9 BB McCullum New Zealand 1 123 58 212.06 11 7
## 11 RG Sharma India 1 118 43 274.41 12 10
## 12 Mohammad Shahzad Afghanistan 1 118 67 176.11 10 8
## 16 BB McCullum New Zealand 1 116 56 207.14 12 8
## 19 RG Sharma India 1 111 61 181.96 8 7
## 23 RG Sharma India 1 106 66 160.60 12 5
## 24 MJ Guptill New Zealand 1 105 54 194.44 6 9
## 30 MJ Guptill New Zealand 1 101 69 146.37 9 6
## 35 DA Warner Australia 1 100 56 178.57 10 4
## 38 RG Sharma India 1 100 56 178.57 11 5
## Fifties Hundreds Opposition Ground MatchDay
## 9 0 1 v Bangladesh Pallekele 2012-09-21
## 11 0 1 v Sri Lanka Indore 2017-12-22
## 12 0 1 v Zimbabwe Sharjah 2016-01-10
## 16 0 1 v Australia Christchurch 2010-02-28
## 19 0 1 v West Indies Lucknow 2018-11-06
## 23 0 1 v South Africa Dharamsala 2015-10-02
## 24 0 1 v Australia Auckland 2018-02-16
## 30 0 1 v South Africa East London 2012-12-23
## 35 0 1 v Sri Lanka Adelaide 2019-10-27
## 38 0 1 v England Bristol 2018-07-08
boxp <- ggplot(data=box, aes(x=Player, y=Runs))
boxp1 <- boxp +
geom_jitter(aes( colour=Player), show.legend = FALSE) +
geom_boxplot(alpha = 0.7, outlier.colour = NA) +
xlab("") +
ylab("Runs/Match") +
ggtitle("Runs/match by Individual Players") +
theme(
axis.text.x = element_text(angle = 90, hjust = 1),
axis.text.y = element_text(),
plot.title = element_text(hjust=0.5, colour="Black",
size=20)
)
boxp1
## Warning: Removed 851 rows containing non-finite values (stat_boxplot).
## Warning: Removed 851 rows containing missing values (geom_point).
Notice that,
Similar to the batting stats, let’s create a dataset containing only bowling statistics.
subset2 <- c("Innings.Player", "Country", "Innings.Bowled.Flag", "Innings.Overs.Bowled", "Innings.Maidens.Bowled", "Innings.Runs.Conceded", "Innings.Wickets.Taken", "X4.Wickets", "X5.Wickets", "Innings.Economy.Rate")
bowldata <- select(dat, subset2)
bowldata <- bowldata %>%
rename(Player=Innings.Player, Matches = Innings.Bowled.Flag, Overs=Innings.Overs.Bowled, Maidens =Innings.Maidens.Bowled, RunsConceded=Innings.Runs.Conceded, Wickets=Innings.Wickets.Taken, FourWickets= X4.Wickets, FiveWickets = X5.Wickets, Economy=Innings.Economy.Rate)
bowldata1 <- bowldata %>%
arrange(desc(Wickets))
head(bowldata1,10)
## Player Country Matches Overs Maidens RunsConceded Wickets
## 1 YS Chahal India 1 4.0 0 25 6
## 2 BAW Mendis Sri Lanka 1 4.0 2 8 6
## 3 BAW Mendis Sri Lanka 1 4.0 1 16 6
## 4 JP Faulkner Australia 1 4.0 0 27 5
## 5 R McLaren South Africa 1 3.5 0 19 5
## 6 D Wiese South Africa 1 4.0 0 23 5
## 7 Imran Tahir South Africa 1 4.0 0 23 5
## 8 Imran Tahir South Africa 1 3.5 0 24 5
## 9 KMA Paul West Indies 1 4.0 0 15 5
## 10 DJG Sammy West Indies 1 3.5 0 26 5
## FourWickets FiveWickets Economy
## 1 0 1 6.25
## 2 0 1 2.00
## 3 0 1 4.00
## 4 0 1 6.75
## 5 0 1 4.95
## 6 0 1 5.75
## 7 0 1 5.75
## 8 0 1 6.26
## 9 0 1 3.75
## 10 0 1 6.78
Aggregating the statistics by individual player.
c <- bowldata1[1:2]
c1 <- c[!duplicated(c$Player),]
bal1 <- aggregate(cbind(Matches, Overs, Maidens, RunsConceded, Wickets, FiveWickets) ~ Player, data=bowldata1, sum)
bal2 <- aggregate(Economy ~ Player, data=bowldata1, mean)
MW <- aggregate(Wickets ~ Player, data = bowldata1, max)
m1 <- merge(c1, bal1, by="Player")
m2 <- merge(m1, bal2, by="Player")
m3 <- merge(m2, MW, by="Player")
m3 <- m3 %>%
rename(Wickets =Wickets.x, Highest = Wickets.y)
m3 <- m3 %>%
arrange(desc(Wickets))
m4 <- m3 %>% mutate_if(is.numeric, ~round(., 2))
head(m4,10)
## Player Country Matches Overs Maidens RunsConceded Wickets
## 1 SL Malinga Sri Lanka 79 282.9 1 2061 106
## 2 Shahid Afridi Pakistan 96 356.4 4 2362 97
## 3 Shakib Al Hasan Bangladesh 75 277.1 2 1894 92
## 4 Saeed Ajmal Pakistan 63 237.0 2 1516 85
## 5 Umar Gul Pakistan 60 198.3 2 1443 85
## 6 Rashid Khan Afghanistan 41 155.0 1 927 79
## 7 GH Dockrell Ireland 69 220.4 1 1539 75
## 8 TG Southee New Zealand 62 226.9 2 1885 73
## 9 Mohammad Nabi Afghanistan 72 249.4 5 1789 69
## 10 BAW Mendis Sri Lanka 39 147.3 5 952 66
## FiveWickets Economy Highest
## 1 2 7.23 5
## 2 0 6.77 4
## 3 1 6.89 5
## 4 0 6.38 4
## 5 2 7.29 5
## 6 2 5.88 5
## 7 0 7.36 4
## 8 1 8.47 5
## 9 0 7.20 4
## 10 2 6.42 6
We see that,
top10b <- m4 %>%
slice(1:10)
viz3 <- ggplot(data=top10b, aes(x=reorder(Player,Economy),
y=Economy,
fill=Player)) + geom_col(show.legend = FALSE)
viz3 + ggtitle("Economy rate by Individual Bowlers")+
geom_text(aes(label = Economy), vjust = -0.1)+
theme(axis.text.x = element_text(angle = 60, hjust = 1),
plot.title = element_text(hjust=0.5, colour="Black",
size=20))
filt2 <- (bowldata1$Player == "SL Malinga") | (bowldata1$Player == "Shahid Afridi") | (bowldata1$Player == "Shakib Al Hasan") | (bowldata1$Player == "Saeed Ajmal") | (bowldata1$Player == "Umar Gul") | (bowldata1$Player == "Rashid Khan") | (bowldata1$Player == "GH Dockrell") | (bowldata1$Player == "TG Southee") | (bowldata1$Player == "Mohammad Nabi") | (bowldata1$Player == "BAW Mendis")
boxbowl <- bowldata1[filt2,]
head(boxbowl,5)
## Player Country Matches Overs Maidens RunsConceded Wickets
## 2 BAW Mendis Sri Lanka 1 4.0 2 8 6
## 3 BAW Mendis Sri Lanka 1 4.0 1 16 6
## 11 TG Southee New Zealand 1 4.0 1 18 5
## 14 Umar Gul Pakistan 1 3.0 0 6 5
## 15 Umar Gul Pakistan 1 2.2 0 6 5
## FourWickets FiveWickets Economy
## 2 0 1 2.00
## 3 0 1 4.00
## 11 0 1 4.50
## 14 0 1 2.00
## 15 0 1 2.57
bbox <- ggplot(data=boxbowl, aes(x=Player, y=RunsConceded))
boxp2 <- bbox +
geom_jitter(aes( colour=Player), show.legend = FALSE) +
geom_boxplot(alpha = 0.7, outlier.colour = NA) +
xlab("") +
ylab("RunsConceded/Match") +
ggtitle("Runs Conceded/match by Individual Bowlers") +
theme(
axis.text.x = element_text(angle = 90, hjust = 1),
axis.text.y = element_text(),
plot.title = element_text(hjust=0.5, colour="Black",
size=20)
)
boxp2
## Warning: Removed 675 rows containing non-finite values (stat_boxplot).
## Warning: Removed 675 rows containing missing values (geom_point).
Notice that,
Based on the above analysis and my cricketing knowledge, I am going to pick my best \(11\) for a hypothetical T20 cricket match. I going to choose \(4\) specialist bowlers, a allrounder, and \(6\) solid batsmen.
My team list below follows a batting order,