## Load data
deliveries <- read_csv("C:/Users/tomjr/Downloads/deliveries.csv (2).zip")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## batting_team = col_character(),
## bowling_team = col_character(),
## batsman = col_character(),
## non_striker = col_character(),
## bowler = col_character(),
## player_dismissed = col_character(),
## dismissal_kind = col_character(),
## fielder = col_character()
## )
## i Use `spec()` for the full column specifications.
matches <- read_csv("C:/Users/tomjr/Downloads/matches (1).csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## id = col_double(),
## season = col_double(),
## city = col_character(),
## date = col_date(format = ""),
## team1 = col_character(),
## team2 = col_character(),
## toss_winner = col_character(),
## toss_decision = col_character(),
## result = col_character(),
## dl_applied = col_double(),
## winner = col_character(),
## win_by_runs = col_double(),
## win_by_wickets = col_double(),
## player_of_match = col_character(),
## venue = col_character(),
## umpire1 = col_character(),
## umpire2 = col_character(),
## umpire3 = col_logical()
## )
## create a third data set which tells us who actually won the tournament
# takes thirty seconds to find on wikipedia
champions <- tibble(season=c(2008,2009,2010,2011,2012,2013,2014,2015,2016),
champions = c("Rajasthan Royals","Deccan Chargers",
"Chennai Super Kings","Chennai Super Kings",
"Kolkata Knight Riders","Mumbai Indians","Kolkata Knight Riders",
"Mumbai Indians","Sunrisers Hyderabad"))
# merge data sets
ip <- merge(champions, matches, by.x = "season", by.y = "season")
#create ipl
ipl <- merge(ip, deliveries, by.x = "id", by.y = "match_id")
rm(ip)
rm(champions)
The Indian Premier League (IPL) is the biggest club cricket tournament in the world. It is the cricketing equivalent of the UEFA Champions League, NFL and Wimbledon. Every year there is a draft and the franchises choose their players, pay them insane wages, for under two months work. This analysis will hopefully explain both the sport and show why it is so popular.
So, here is the quickest and briefest summary of the tournament.
## `summarise()` regrouping output by 'winner' (override with `.groups` argument)
The Mumbai Indians have won the most amount of matches in its history. This is useful as the people of Mumbai’s happiness relies heavily on how successful their cricket team is. Although there is a massive amount of data to explore here, the best way to explain cricket is to dive right in.
It’s simple - score more runs than the opposition. As Mumbai Indians won the most matches I will explore their batting performance. To win a cricket match, score more runs than the opposition. The team that is batting’s job is to score runs, and the more runs the better.
## `summarise()` ungrouping output (override with `.groups` argument)
From this we should expect them to have done very well in the 2010, 2013, and 2015 seasons.
Next, each team is made up of 11 players. Some players job is to primarily bowl, other players job is primarily to bat, hence the term ‘batsman’. Every player can bowl and bat in game, and teams can use all 11 players when they bat. WHen the bowling side get a player out, another player replaces him until either the batting team runs out of players, or the bowling team have bowled 20 overs.
ipl %>%
filter(batting_team == "Mumbai Indians") %>%
group_by(batsman) %>%
summarise(runs = sum(batsman_runs))%>%
arrange(desc(runs)) %>%
head(10) %>%
ggplot +
geom_bar(stat = 'identity', show.legend = FALSE, mapping = aes(x = reorder(batsman, - runs),
y = runs, fill = batsman)) +
labs(x = "Batter", y = "Total Runs scored", title = "Top Ten Batsmen for Mumbai Indians") + #Batsmen score runs
theme_classic()+
theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
scale_y_continuous(expand = c(0,0))
## `summarise()` ungrouping output (override with `.groups` argument)
For cricket fans, the name that jumps out is SR Tendulkar. He is now retired but is a living legend in both world and Indian cricket. RG Sharma is however their top runs scorer.
Tactically, you usually put your best batsman in first so they face the most balls and therefore can score lots of runs. So the information is easier to digest, I have chosen 4 batsman from the original chart and added SL Malinga who is a bowler for contrast.
## `summarise()` regrouping output by 'season', 'batsman' (override with `.groups` argument)
And that seems like a good idea. Indeed the batsman who have scored the most runs have indeed faced a lot of balls.
Cricket has three main formats, test cricket which is played over 5 days and therefore not time constrained, one day cricket which takes a day, and Twenty20, the newest and arguably most exciting format, which takes about three hours. This time constrained format means Batsman have been forced to change their style. They no longer have five days to make runs, but mere hours, so a key metric here will be strike rate which measures how quickly a batsman is able to make runs. The greater the batsman’s strike rate the better.
## `summarise()` ungrouping output (override with `.groups` argument)
The loophole. Face one ball, smash it for a lot of runs, go off, and you have yourself an amazing strike rate. So this chart can be extremely misleading.
batters.s %>%
filter(batsman %in% c("RG Sharma", "KA Pollard", "SR Tendulkar", "HH Pandya", "SL Malinga")) %>%
ggplot +
geom_bar(stat = 'identity', position = 'dodge', aes(x = season, y = strikerate, fill = batsman)) +
labs( x = "Season", y = "Strike Rate", title = "Strike Rate Analysis per Season") +
theme_classic() +
scale_y_continuous(expand = c(0,0))+
scale_x_continuous(breaks = seq(2008,2016,1))
## effect of hitting boundaries on strike rate
HH Pandya and SL Malinga don’t appear in Mumbai Indians Top Ten run scorers, but there strike rates at time are similar and even superior.
So how does a batsman create or bat to create a high strike rate?
Score a lot of runs off as few balls as possible is the simple answer. In more detail more understanding of cricket is needed. WHen the batsman hits the ball he can run to the other end to get 1 run. If he can run back again before the ball is fielded, he gets two. Usually, the max number of runs gained by running back and forth is 2, occasionally 3. However around the edge of the pitch is a rope, called the boundary, is a batsman hits the ball all the way to the boundary he gets 4, if he hits it over the boundary without it bouncing he gets 6. Cricket has creatively named this way of scoring ‘hitting a boundary’!
A simple theory would be that if a batsman hits a lot of boundaries he would have a high strike rate. From now I will drop SL Malinga from the analysis because he is one of the worlds greatest bowlers and it seems unfair to ridicule his batting ability, something no team at any level would pay for.
In 2015 Mumbai Indians were hitting a huge percentage of balls to the boundary. Though it must be said that whatever HH Pandya did between the 2015 and 2016 season needs to be called into question. Going from hitting one in four balls to the boundary to one in twenty is a hefty drop.
While strike rate is important, as previously mentioned, it can be manipulated and paint a false image of a batsman’s talents. There is another way to see how effective a batsman is. That is the batsman’s average. The greater the average, the better the batsman. For simplicity I will just use KA Pollard and RG Sharma from now on.
## `summarise()` regrouping output by 'season', 'batsman' (override with `.groups` argument)
So we see that RG Sharma’s average is more consistent compared to how volatile KA Pollards is. However Mumbai Indians won in both 2013 and 2015, where KA Pollard’s average peaked (twice!).
’Batting for his average"
Finally on the batting analysis i will explain the “Batting for his Average” phenomena. This is a batsman being accused of deliberately playing low risk, low scoring, defensive shots to preserve his average. Accumulate runs slowly with minimal risk. Often used by frustrated fans who want to see big boundaries. This style of play is not suited to twenty20 cricket. For a format of cricket which is high risk, high strike rates and based upon scoring big boundaries. RG Sharma’s average is very high and consistent - it is not at all surprising that he is the highest run scorer for Mumbai Indians. However, can a disgruntled fan accuse him of batting for his average?
No. A strike rate of over 100 means that a batsman is scoring over a run per ball which in cricketing terms is not defensive!
It is however KA Pollard who has the most astonishing results and it must be explained why because it is hard to imagine KA Pollard is just one player. Firstly, in 2010 and 2012 he has roughly the same average but his strike rate in 2010 is almost 2 runs per ball, compared to under 1.4 runs ball and 1.4 is high. It’s almost as if the 2010 version of Pollard was the 2012 version turbo charged. Secondly, after 2010 his average drops to around five which is so low, it’s almost pointless him batting and while his strike rate is low, he definitely cannot be accused of “batting for his average”. Then in 2014 he bats more like RG Sharma than RG Sharma does! Thirdly, the final batting cricketing colloquialism explained in this paper will be “Gun Batsman”. This is the term used for batsmen who hits a hell of a lot of runs fast. There is no other way but to describe KA Pollard in 2015 than as an absolute Gun of a Batsman. Let’s see which bowlers he hit his runs against.
## `summarise()` regrouping output by 'bowler' (override with `.groups` argument)
## Warning in if (class(try(col2rgb(palette), silent = TRUE)) == "try-error")
## stop("color palette is not correct"): the condition has length > 1 and only the
## first element will be used
## Warning in if (class(try(col2rgb(bg.labels), silent = TRUE)) == "try-error")
## stop("Invalid bg.labels"): the condition has length > 1 and only the first
## element will be used
The quick explanation of this treemap is that DJ Bravo won’t want to play KA Pollard anytime soon. So, a concisen summary of how to win a cricket match via batting. Have a 2015 version of KA Pollard, the 2013 version was very good as well, and it helps to have a couple more high scoring batsman in HH Pandya and RG Sharma in there as well.
When one team bats, the other team bowls. The bowling team selects a bowler and bowls(throws the ball) at the batsman. Behind the batsman are three sticks, called the wickets. If the bowler hits the wickets with the ball, then the batsman is “out”. His wicket has been ‘taken’. He must leave the pitch and is replaced by a new batsman until there are no members of the other team left to bat. There are 10 ways the bowling team can take a batsman’s wicket.
Caught is the most common way of taking a wicket. This is not surprising as batsman in Twenty20 cricket hit the ball in the air a lot to try and hit 6’s. As hitting a 6 is a very difficult skill, there is a lot of opportunities to get the batsman out. ‘Get the batsman out’ is another way of saying ‘taking the batsman wicket’.
The bowler is officially credited with certain types of wickets, although they will selfishly claim responsibility for all types. Bowled: The bowler bowls the ball and hits the wickets. Caught: Batsman hits the ball and it is caught. LBW: The bowler bowls, the batsman used his leg to stop the ball hitting the wickets instead of his bat.
Although “caught and bowled” is specified here, it is not any different from caught. The difference is that caught means another fielder caught the ball, caught and bowled is when the bowler who bowled the ball also made the catch, it is close to the holy grail for bowlers.
However, as any bowler will proclaim, bowling is an art form and wicket taking isn’t the only important skill, especially in twenty20 when the probability of getting ten wickets is slim. It’s economy. The Bowler’s economy may confuse an economist because a bowler wants their economy to be as low as possible.
## `summarise()` ungrouping output (override with `.groups` argument)
Mumbai have the fourth best economy in the histroy of the IPL.
A bowling attack is a group of four or five players who take responsibility for doing the majority of the batting. they are usually pretty poor batsman so this is their contribution to the team. Time to discover Mumbai Indians Bowling attack in 2015, KA Pollards Gun season.
## `summarise()` ungrouping output (override with `.groups` argument)
From this chart it looks pretty simple. The bowling attack should be, based on conceding as few runs as possible, de Lange, Malinga, Singh, McClenaghan, Anderson. However, wickets are important. Getting the opposition all out (taking every wicket of the opposition) may not be crucial in twenty20, but getting their best batsman out can be.
## `summarise()` ungrouping output (override with `.groups` argument)
Suddenly picking a bowling attack isn’t so simple, de Lange hasn’t taken a single wicket. Although not picking HH Pandya would be a start. Both his economy and strike rate are high. Bowlers want low statistics.
However there is still hope. Like batsman, bowlers stats can be easily manipulated. Bowl one great over, fake a shoulder injury, and keep your great statistics. Luckily cricket loves metrics especially for bowlers so we fully explore through the data how the bowler where used and who the captain chose to ‘throw the ball to’ (bowl).
To begin with we will look at a bowlers strike rate. This tells us how often a bowler takes wickets, and is vital to know because sometimes a bowling team need a wicket fast. Other times it is enough to concede as few runs as possible and frustrate the batsman.
## create a 'fulldataset without merging loads.
## so this data set will give me the metrics for a bowler, for everyone who bwoled a ball for Mumbai Indians in 2015.
# explain metrics.
# balls = the number of balls a bowler bowled in the tournament - higher number the better
#runs_conceded = the number of runs the bowler conceded - lower number the better
# economy = amount of runs conceded on avergae per over - lower number the better
# strike = how many balls a bowler must bowl to get a wicket - lower number the better
# bowl_avg = how many runs conceded for each wicket taken. - lower number the better
mummy <- mumbai_bowl %>%
select(bowler, dismissal_kind, player_dismissed, over, total_runs)%>%
group_by(bowler)%>%
mutate(balls = n())%>%
mutate(runs_conceded = sum(total_runs))%>%
mutate(economy = runs_conceded/(balls/6))%>%
filter(dismissal_kind == "caught" | dismissal_kind == "bowled")%>%
group_by(bowler)%>%
mutate(wickets = n())%>%
mutate(strike = balls/wickets)%>%
mutate(bowl_avg = runs_conceded/wickets)%>%
summarise(bowler, bowl_avg, wickets, runs_conceded, strike, economy, balls)%>%
distinct()
## `summarise()` regrouping output by 'bowler' (override with `.groups` argument)
## Plot strike rate against economy
ggplot(mummy, aes(x = economy, y = strike, color = bowler)) +
geom_point(size = 3) +
geom_text_repel(aes(label = bowler)) +
labs(x = "Economy", y = "Strike Rate", title = "Wicket Taker or Batsman Frustrater?")+
theme_classic()
HH Pandya is neither. In fact this plot tells the captain that the best option is to throw the ball to SL Malinga no matter the circumstances, and failing that, McClenaghan or Singh. If desperate and a wicket is a must, S Gopal would be a defensible decision. He is a wicket taker.
In other circumstances the captain needs to make a more careful decision, he needs a wicket but cannot go all out for one. He needs then to take into account how many runs he must give up to get a wicket. Luckily there is a metric for that called the Bowlers average. For clarity, the strike rate is the average amount of balls bowled to get a wicket, whilst the bowlers average is the amount of runs conceded to get a wicket. Is there however much difference between the two?
A bowlers average and strike rate seem correlated so it seems not. As before, the lower the better for both strike rate and average. So far the data is showing us that S Gopal could be a very useful bowler and the captain of the Mumbai Indians should use him. Did he?
He bowled the least. Unspurpsingly SL Malinga, a bowler for all circumstances, was used a lot.
In cricket one team bats and sets a score. Then the other teams bats and tries to chase it down. This called a run chase. Bowlers really earn their money and value when they are facing a run chase. Their team has already batted and set a score, and now they are bowling to protect their score by either getting their opponents all out, or keeping them to a score which is less then their own team’s score.
In numbers. Mumbai Indians have batted and scored 150. Now they must make sure their opponents don’t get 150 runs. The last overs of the match, usually the last five, are called the death overs, and they are the most pressurized. The batsmen know exactly how many balls they have to make the needed runs. For example, they might need 12(2 overs) runs off ten balls to win. These are the most nailbiting, tense and exciting overs in cricket. Compulsive viewing.
This section will be about bowling in these overs where the whole of Mumbai sits nervously, praying for a Mumbai win. SO who did the captain in 2015 chose to bowl in these overs?
## `summarise()` regrouping output by 'bowler' (override with `.groups` argument)
It is no surprise he chose Malinga. Excitingly KA Pollard pops up again. Few cricketers bat and bowl and even fewer are both Gun batsman and chosen to bowl at the death. It’s beginning to feel like KA Pollard was the ultimate cricketer in 2015.
So, imagine the death overs are about to start and the author can show the captain of Mumbai Indians one chart to help him make a decision. He wants to know both how economical they are and their ability to take a wicket.
It has to be MJ McClenaghan surely. Well… it could be a bit more complicated than just choosing the best bowler because bowlers are limited to 4 overs. This therefore cause the captain a few headaches across the course of a match because Teams usually send out their best batsman first so to prevent them scoring heavily, and to create pressure on the batting team, captains also want to bowl good bowlers at them too. So it’s possible McClenaghan bowled a lot of his overs at the start and in effect used them up.
The graph belwo shows the bowlers performance in terms of economy across the course of a match in the 2015 season.
## `summarise()` regrouping output by 'bowler', 'over' (override with `.groups` argument)
This is so interesting(at least for a cricket fan!). First amazingly, H Singh, who bowled the second most for Mumbai Indians, never bowled after over 14. In fact, he was predominantly used in the middle overs and his economy there is remarkably consistent. Secondly, R Vinay Kumar’s economy when he bowls early is much better than at the death. KA pollard bowled rarely, and his outlier in over 17 suggests the captain made a mistake. However, his best over for bowling economywise is the last and most pressured over. Incredible. McClenaghan and SL Malinga are used in the first and last overs, and while SL Malinga’s economy seems slightly better overall in the first overs, there no drastic change. Simply put, this graph is a representation of Mumbai Indians bowling performance in 2015.
This will be brief. When watching the cricket on TV, the metrics such as bowlers economy are constantly presented. The goal of this predictive exercise is to work out, primitively, which metric introduced in this paper is the most important in the winning of a cricket match. The metrics tested also include whether a team won the coin toss or not, as commentators enjoy talking about who won the toss.
The dataframe metrics_final contained vectors of each teams bowling economy, bowling strike rate, batting average, batting strike rate, win toss, win.
Both the lm model and the random forest model predicted that the economy metric was the most important in terms of winning a match. This makes the Over by Over Bowling Economy is more interesting in my opinion. As according to this brief modelling, the decisions a captain makes here in keeping the bowlers economy as low as possible, have the greatest bearing of all the metrics in the outcome of the match.