“There are two ways to write error-free programs; only the third one works.”" Alan J. Perlis
“Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to produce bigger and better idiots. So far, the universe is winning. ” Rick Cook
“My software never has bugs. It just develops random features.” Anon
“If you make an ass out of yourself, there will always be someone to ride you.” Bruce Lee
This is the 3rd and final post on cricpy, and is a continuation to my 2 earlier posts
Cricpy, is the python avatar of my R package ‘cricketr’. To know more about my R package cricketr see Re-introducing cricketr! : An R package to analyze performances of cricketers
With this post cricpy avatar, now becomes omnipotent, and is now capable of handling Test, ODI and T20 matches.
Cricpy uses the statistics info available in ESPN Cricinfo Statsguru.
You should be able to install the package using pip install cricpy and use the many functions available in the package. Please mindful of the ESPN Cricinfo Terms of Use
This post is also hosted on Rpubs at Cricpy takes guard for the Twenty 20s. You can also download the pdf version of this post at cricpy-TT.pdf
You can fork/clone the package at Github cricpy
The data for a particular player in Twenty20s can be obtained with the getPlayerDataTT() function. To do this you will need to go to [T20 Batting(http://stats.espncricinfo.com/wi/content/records/283194.html) T20 Bowling and click the player you are interested in This will bring up a page which have the profile number for the player e.g. for Virat Kohli this would be http://www.espncricinfo.com/india/content/player/253802.html. Hence,this can be used to get the data for Virat Kohlias shown below
The cricpy package is a clone of my R package cricketr. The signature of all the python functions are identical with that of its clone ‘cricketr’, with only the necessary variations between Python and R. It may be useful to look at my post R vs Python: Different similarities and similar differences. In fact if you are familar with one of the languages you can look up the package in the other and you will notice the parallel constructs.
You can fork/clone the package at Github cricpy
Note: The charts are self-explanatory and I have not added much of my owy interpretation to it. Do look at the plots closely and check out the performances for yourself.
# Install the package
# Do a pip install cricpy
# Import cricpy
import cricpy.analytics as ca
## C:\Users\Ganesh\ANACON~1\lib\site-packages\statsmodels\compat\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
## from pandas.core import datetools
import cricpy.analytics as ca
ca.batsman4s("./kohli.csv","Virat Kohli")
## C:\Users\Ganesh\ANACON~1\lib\site-packages\cricpy\analytics.py:83: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
## runsPoly = poly.fit_transform(runs.reshape(-1,1))
import cricpy.analytics as ca
help(ca.getPlayerDataTT)
## Help on function getPlayerDataTT in module cricpy.analytics:
##
## getPlayerDataTT(profile, opposition='', host='', dir='./data', file='player001.csv', type='batting', homeOrAway=[1, 2, 3], result=[1, 2, 3, 5], create=True)
## Get the Twenty20 International player data from ESPN Cricinfo based on specific inputs and store in a file in a given directory~
##
## Description
##
## Get the Twenty20 player data given the profile of the batsman/bowler. The allowed inputs are home,away, neutralboth and won,lost,tied or no result of matches. The data is stored in a <player>.csv file in a directory specified. This function also returns a data frame of the player
##
## Usage
##
## getPlayerDataTT(profile, opposition="",host="",dir = "./data", file = "player001.csv",
## type = "batting", homeOrAway = c(1, 2, 3), result = c(1, 2, 3,5))
## Arguments
##
## profile
## This is the profile number of the player to get data. This can be obtained from http://www.espncricinfo.com/ci/content/player/index.html. Type the name of the player and click search. This will display the details of the player. Make a note of the profile ID. For e.g For Virat Kohli this turns out to be 253802 http://www.espncricinfo.com/india/content/player/35263.html. Hence the profile for Sehwag is 35263
## opposition
## The numerical value of the opposition country e.g.Australia,India, England etc. The values are Afghanistan:40,Australia:2,Bangladesh:25,England:1,Hong Kong:19,India:6,Ireland:29, New Zealand:5,Pakistan:7,Scotland:30,South Africa:3,Sri Lanka:8,United Arab Emirates:27, West Indies:4, Zimbabwe:9; Note: If no value is entered for opposition then all teams are considered
## host
## The numerical value of the host country e.g.Australia,India, England etc. The values are Australia:2,Bangladesh:25,England:1,India:6,New Zealand:5, South Africa:3,Sri Lanka:8,United States of America:11,West Indies:4, Zimbabwe:9 Note: If no value is entered for host then all host countries are considered
## dir
## Name of the directory to store the player data into. If not specified the data is stored in a default directory "./data". Default="./data"
## file
## Name of the file to store the data into for e.g. kohli.csv. This can be used for subsequent functions. Default="player001.csv"
## type
## type of data required. This can be "batting" or "bowling"
## homeOrAway
## This is vector with either or all 1,2, 3. 1 is for home 2 is for away, 3 is for neutral venue
## result
## This is a vector that can take values 1,2,3,5. 1 - won match 2- lost match 3-tied 5- no result
## Details
##
## More details can be found in my short video tutorial in Youtube https://www.youtube.com/watch?v=q9uMPFVsXsI
##
## Value
##
## Returns the player's dataframe
##
## Note
##
## Maintainer: Tinniam V Ganesh <tvganesh.85@gmail.com>
##
## Author(s)
##
## Tinniam V Ganesh
##
## References
##
## http://www.espncricinfo.com/ci/content/stats/index.html
## https://gigadom.wordpress.com/
##
## See Also
##
## bowlerWktRateTT getPlayerData
##
## Examples
##
## ## Not run:
## # Only away. Get data only for won and lost innings
## kohli =getPlayerDataTT(253802,dir="../cricketr/data", file="kohli1.csv",
## type="batting")
##
## # Get bowling data and store in file for future
## ashwin = getPlayerDataTT(26421,dir="../cricketr/data",file="ashwin1.csv",
## type="bowling")
##
## kohli =getPlayerDataTT(253802,opposition = 2,host=2,dir="../cricketr/data",
## file="kohli1.csv",type="batting")
The details below will introduce the different functions that are available in cricpy.
Important Note This needs to be done only once for a player. This function stores the player’s data in the specified CSV file (for e.g. kohli.csv as above) which can then be reused for all other functions). Once we have the data for the players many analyses can be done. This post will use the stored CSV file obtained with a prior getPlayerDataTT for all subsequent analyses
import cricpy.analytics as ca
#kohli=ca.getPlayerDataTT(253802,dir=".",file="kohli.csv",type="batting")
#guptill=ca.getPlayerDataTT(226492,dir=".",file="guptill.csv",type="batting")
#shahzad=ca.getPlayerDataTT(419873,dir=".",file="shahzad.csv",type="batting")
#mccullum=ca.getPlayerDataTT(37737,dir=".",file="mccullum.csv",type="batting")
Included below are some of the functions that can be used for ODI batsmen and bowlers. For this I have chosen, Virat Kohli, ‘the run machine’ who is on-track for breaking many of the Test, ODI and Twenty20 records
The 3 plots below provide the following for Virat Kohli in T20s
import cricpy.analytics as ca
import matplotlib.pyplot as plt
ca.batsmanRunsFreqPerf("./kohli.csv","Virat Kohli")
ca.batsmanMeanStrikeRate("./kohli.csv","Virat Kohli")
ca.batsmanRunsRanges("./kohli.csv","Virat Kohli")
import cricpy.analytics as ca
ca.batsman4s("./kohli.csv","Virat Kohli")
ca.batsman6s("./kohli.csv","Virat Kohli")
ca.batsmanDismissals("./kohli.csv","Virat Kohli")
ca.batsmanScoringRateODTT("./kohli.csv","Virat Kohli")
## C:\Users\Ganesh\ANACON~1\lib\site-packages\cricpy\analytics.py:3620: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
## bfPoly = poly.fit_transform(bf.reshape(-1,1))
The plots below show the 3D scatter plot of Kohli’s Runs versus Balls Faced and Minutes at crease. A linear regression plane is then fitted between Runs and Balls Faced + Minutes at crease
import cricpy.analytics as ca
ca.battingPerf3d("./kohli.csv","Virat Kohli")
## C:\Users\Ganesh\ANACON~1\lib\site-packages\cricpy\analytics.py:1569: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
## df2['BF']=pd.to_numeric(df2['BF'])
## C:\Users\Ganesh\ANACON~1\lib\site-packages\cricpy\analytics.py:1570: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
## df2['Mins']=pd.to_numeric(df2['Mins'])
## C:\Users\Ganesh\ANACON~1\lib\site-packages\cricpy\analytics.py:1571: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
## df2['Runs']=pd.to_numeric(df2['Runs'])
The plot below gives the average runs scored by Kohli at different grounds. The plot also the number of innings at each ground as a label at x-axis.
import cricpy.analytics as ca
ca.batsmanAvgRunsGround("./kohli.csv","Virat Kohli")
This plot computes the average runs scored by Kohli against different countries.
import cricpy.analytics as ca
ca.batsmanAvgRunsOpposition("./kohli.csv","Virat Kohli")
The plot below shows the Runs Likelihood for a batsman. For this the performance of Kohli is plotted as a 3D scatter plot with Runs versus Balls Faced + Minutes at crease. K-Means. The centroids of 3 clusters are computed and plotted. In this plot Kohli’s highest tendencies are computed and plotted using K-Means
import cricpy.analytics as ca
ca.batsmanRunsLikelihood("./kohli.csv","Virat Kohli")
The following batsmen have been very prolific in Twenty20 cricket and will be used for the analyses
The following plots take a closer at their performances. The box plots show the median the 1st and 3rd quartile of the runs
This plot shows a combined boxplot of the Runs ranges and a histogram of the Runs Frequency
import cricpy.analytics as ca
ca.batsmanPerfBoxHist("./kohli.csv","Virat Kohli")
ca.batsmanPerfBoxHist("./guptill.csv","M J Guptill")
ca.batsmanPerfBoxHist("./shahzad.csv","M Shahzad")
ca.batsmanPerfBoxHist("./mccullum.csv","BB McCullum")
Take a look at the Moving Average across the career of the Top 4 Twenty20 batsmen.
import cricpy.analytics as ca
ca.batsmanMovingAverage("./kohli.csv","Virat Kohli")
ca.batsmanMovingAverage("./guptill.csv","M J Guptill")
#ca.batsmanMovingAverage("./shahzad.csv","M Shahzad") # Gives error. Check!
ca.batsmanMovingAverage("./mccullum.csv","BB McCullum")
This function provides the cumulative average runs of the batsman over the career.Kohli’s average tops around 45 runs around 43 innings, though there is a dip downwards
import cricpy.analytics as ca
ca.batsmanCumulativeAverageRuns("./kohli.csv","Virat Kohli")
ca.batsmanCumulativeAverageRuns("./guptill.csv","M J Guptill")
ca.batsmanCumulativeAverageRuns("./shahzad.csv","M Shahzad")
ca.batsmanCumulativeAverageRuns("./mccullum.csv","BB McCullum")
Kohli, Guptill and McCullum average a strike rate of 125+
import cricpy.analytics as ca
ca.batsmanCumulativeStrikeRate("./kohli.csv","Virat Kohli")
ca.batsmanCumulativeStrikeRate("./guptill.csv","M J Guptill")
ca.batsmanCumulativeStrikeRate("./shahzad.csv","M Shahzad")
ca.batsmanCumulativeStrikeRate("./mccullum.csv","BB McCullum")
The plot below compares the Relative cumulative average runs of the batsman. Kohli is way above all the other 3 batsmen. Behind Kohli is McCullum and then Guptill
import cricpy.analytics as ca
frames = ["./kohli.csv","./guptill.csv","./shahzad.csv","./mccullum.csv"]
names = ["Kohli","Guptill","Shahzad","McCullumn"]
ca.relativeBatsmanCumulativeAvgRuns(frames,names)
The plot below gives the relative Runs Frequency Percetages for each 10 run bucket. The plot below show that Kohli tops the overall strike rate followed by McCullum and then Guptill
import cricpy.analytics as ca
frames = ["./kohli.csv","./guptill.csv","./shahzad.csv","./mccullum.csv"]
names = ["Kohli","Guptill","Shahzad","McCullum"]
ca.relativeBatsmanCumulativeStrikeRate(frames,names)
The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A 3D prediction plane is fitted
import cricpy.analytics as ca
ca.battingPerf3d("./kohli.csv","Virat Kohli")
ca.battingPerf3d("./guptill.csv","M J Guptill")
ca.battingPerf3d("./shahzad.csv","M Shahzad")
ca.battingPerf3d("./mccullum.csv","BB McCullum")
Guptill and McCullum have a large percentage of sixes in comparison to the 4s. Kohli has a relative lower number of 6s
import cricpy.analytics as ca
frames = ["./kohli.csv","./guptill.csv","./shahzad.csv","./mccullum.csv"]
names = ["Kohli","Guptill","Shahzad","McCullum"]
ca.batsman4s6s(frames,names)
A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease.
import cricpy.analytics as ca
import numpy as np
import pandas as pd
BF = np.linspace( 10, 400,15)
Mins = np.linspace( 30,600,15)
newDF= pd.DataFrame({'BF':BF,'Mins':Mins})
kohli= ca.batsmanRunsPredict("./kohli.csv",newDF,"Kohli")
## C:\Users\Ganesh\ANACON~1\lib\site-packages\cricpy\analytics.py:1398: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
## df['BF']=pd.to_numeric(df['BF'])
## C:\Users\Ganesh\ANACON~1\lib\site-packages\cricpy\analytics.py:1399: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
## df['Runs']=pd.to_numeric(df['Runs'])
print(kohli)
## BF Mins Runs
## 0 10.000000 30.000000 14.753153
## 1 37.857143 70.714286 55.963333
## 2 65.714286 111.428571 97.173513
## 3 93.571429 152.142857 138.383693
## 4 121.428571 192.857143 179.593873
## 5 149.285714 233.571429 220.804053
## 6 177.142857 274.285714 262.014233
## 7 205.000000 315.000000 303.224414
## 8 232.857143 355.714286 344.434594
## 9 260.714286 396.428571 385.644774
## 10 288.571429 437.142857 426.854954
## 11 316.428571 477.857143 468.065134
## 12 344.285714 518.571429 509.275314
## 13 372.142857 559.285714 550.485494
## 14 400.000000 600.000000 591.695674
The fitted model is then used to predict the runs that the batsmen will score for a given Balls faced and Minutes at crease.
The following 4 bowlers have had an excellent career and will be used for the analysis
This plot below computes the percentage frequency of number of wickets taken for e.g 1 wicket x%, 2 wickets y% etc and plots them as a continuous line
import cricpy.analytics as ca
#shakib=ca.getPlayerDataTT(56143,dir=".",file="shakib.csv",type="bowling")
#nabi=ca.getPlayerDataOD(25913,dir=".",file="nabi.csv",type="bowling")
#rashid=ca.getPlayerDataOD(793463,dir=".",file="rashid.csv",type="bowling")
#tahir=ca.getPlayerDataOD(40618,dir=".",file="tahir.csv",type="bowling")
This plot below plots the frequency of wickets taken for each of the bowlers
import cricpy.analytics as ca
ca.bowlerWktsFreqPercent("./shakib.csv","Shakib Al Hasan")
ca.bowlerWktsFreqPercent("./nabi.csv","Mohammad Nabi")
ca.bowlerWktsFreqPercent("./rashid.csv","Rashid Khan")
ca.bowlerWktsFreqPercent("./tahir.csv","Imran Tahir")
The plot below create a box plot showing the 1st and 3rd quartile of runs conceded versus the number of wickets taken.
import cricpy.analytics as ca
ca.bowlerWktsRunsPlot("./shakib.csv","Shakib Al Hasan")
ca.bowlerWktsRunsPlot("./nabi.csv","Mohammad Nabi")
ca.bowlerWktsRunsPlot("./rashid.csv","Rashid Khan")
ca.bowlerWktsRunsPlot("./tahir.csv","Imran Tahir")
The plot gives the average wickets taken by Muralitharan at different venues.
import cricpy.analytics as ca
ca.bowlerAvgWktsGround("./shakib.csv","Shakib Al Hasan")
ca.bowlerAvgWktsGround("./nabi.csv","Mohammad Nabi")
ca.bowlerAvgWktsGround("./rashid.csv","Rashid Khan")
ca.bowlerAvgWktsGround("./tahir.csv","Imran Tahir")
The plot gives the average wickets taken by Muralitharan against different countries. The x-axis also includes the number of innings against each team
import cricpy.analytics as ca
ca.bowlerAvgWktsOpposition("./shakib.csv","Shakib Al Hasan")
ca.bowlerAvgWktsOpposition("./nabi.csv","Mohammad Nabi")
ca.bowlerAvgWktsOpposition("./rashid.csv","Rashid Khan")
ca.bowlerAvgWktsOpposition("./tahir.csv","Imran Tahir")
From th eplot below it can be see
import cricpy.analytics as ca
ca.bowlerMovingAverage("./shakib.csv","Shakib Al Hasan")
ca.bowlerMovingAverage("./nabi.csv","Mohammad Nabi")
ca.bowlerMovingAverage("./rashid.csv","Rashid Khan")
ca.bowlerMovingAverage("./tahir.csv","Imran Tahir")
The plots below give the cumulative average wickets taken by the bowlers. Rashid Khan has been the most effective with almost 2.28 wickets per match
import cricpy.analytics as ca
ca.bowlerCumulativeAvgWickets("./shakib.csv","Shakib Al Hasan")
ca.bowlerCumulativeAvgWickets("./nabi.csv","Mohammad Nabi")
ca.bowlerCumulativeAvgWickets("./rashid.csv","Rashid Khan")
ca.bowlerCumulativeAvgWickets("./tahir.csv","Imran Tahir")
The plots below give the cumulative average economy rate of the bowlers. Rashid Khan has the nest economy rate followed by Mohammed Nabi
import cricpy.analytics as ca
ca.bowlerCumulativeAvgEconRate("./shakib.csv","Shakib Al Hasan")
ca.bowlerCumulativeAvgEconRate("./nabi.csv","Mohammad Nabi")
ca.bowlerCumulativeAvgEconRate("./rashid.csv","Rashid Khan")
ca.bowlerCumulativeAvgEconRate("./tahir.csv","Imran Tahir")
The Relative cumulative economy rate is given below. It can be seen that Rashid Khan has the best economy rate followed by Mohammed Nabi and then Imran Tahir
import cricpy.analytics as ca
frames = ["./shakib.csv","./nabi.csv","./rashid.csv","tahir.csv"]
names = ["Shakib Al Hasan","Mohammad Nabi","Rashid Khan", "Imran Tahir"]
ca.relativeBowlerCumulativeAvgEconRate(frames,names)
Rashid Khan has the best figures for wickets between 2-3.5 wickets. Mohammed Nabi pips Rashid Khan when takes a haul of 4 wickets.
import cricpy.analytics as ca
frames = ["./shakib.csv","./nabi.csv","./rashid.csv","tahir.csv"]
names = ["Shakib Al Hasan","Mohammad Nabi","Rashid Khan", "Imran Tahir"]
ca.relativeBowlingER(frames,names)
Rashid has the best performance with cumulative average wickets. He is followed by Imran Tahir in the wicket haul, followed by Shakib Al Hasan
import cricpy.analytics as ca
frames = ["./shakib.csv","./nabi.csv","./rashid.csv","tahir.csv"]
names = ["Shakib Al Hasan","Mohammad Nabi","Rashid Khan", "Imran Tahir"]
ca.relativeBowlerCumulativeAvgWickets(frames,names)
The plots above capture some of the capabilities and features of my cricpy package. Feel free to install the package and try it out. Please do keep in mind ESPN Cricinfo’s Terms of Use.
Here are the main findings from the analysis above
The analysis of the Top 4 test batsman Kohli, Guptill, Shahzad and McCullum