The way that data is shared online is a constant source of unnecessary frustrations. Last weekend I was curious about the characteristics of batmen’s innings while watching the test match between England and South Africa. See my posts on RPubs. I decided to resolve my curiosity by getting the original data. Now, anyone working with such data must keep it neatly and tidily in a relational data base, with a table for the individual players, a table for batting, bowling, fielding and match statistics as a minimum. Querying the data base directly would require some knowledge of the relational structure, so it makes sense to help users by providing a web interface to run the queries. ESPN’s cric info does just that. But why does it have to result in tables that have to be downloaded page by page? And why does the formatted data mix numbers with text? Instead of total runs scored and a column containing the number of wickets if an innings was declared or abandoned cricinfo “helpfully” gives the score in the conventional format such as 625/7 d for 625 for seven declared. The result was that it took me the best part of last sunday afternoon to get some very routine data tables into shape using regexprs, greps and gsubs.

As I now have this data to hand, this sunday I took another hour out to build a small R package to share the data easily. It has three data object. Batting, bowling and innings. All are up to date as of 5 August 2017 and contain all the relevant data from ESPN cric info in a fairly self explanatory format (if you refer to the original data tables).

Use devtools to install the package, with its very rudimentary documentation.

library(devtools)
install_github("dgolicher/crickdata")

Data tables

library(crickdata)
data(bowling)
head(bowling)
##                Player Overs BPO Mdns Runs Wkts Econ Inns     Opposition
## 1      JC Laker (ENG)  51.2   6   23   53   10 1.03    3    v Australia
## 2    A Kumble (INDIA)  26.3   6    9   74   10 2.79    4     v Pakistan
## 3    GA Lohmann (ENG)  14.2   5    6   28    9 2.33    2 v South Africa
## 4      JC Laker (ENG)  16.4   6    4   37    9 2.22    2    v Australia
## 5 M Muralitharan (SL)  40.0   6   19   51    9 1.27    1     v Zimbabwe
## 6  Sir RJ Hadlee (NZ)  23.4   6    4   52    9 2.19    1    v Australia
##         Ground  Start.Date Country       Date Year Day Month Yday type
## 1   Manchester 26 Jul 1956     ENG 1956-07-26 1956  26     7  208 Test
## 2        Delhi  4 Feb 1999   INDIA 1999-02-04 1999   4     2   35 Test
## 3 Johannesburg  2 Mar 1896     ENG 1896-03-02 1896   2     3   62 Test
## 4   Manchester 26 Jul 1956     ENG 1956-07-26 1956  26     7  208 Test
## 5        Kandy  4 Jan 2002      SL 2002-01-04 2002   4     1    4 Test
## 6     Brisbane  8 Nov 1985      NZ 1985-11-08 1985   8    11  312 Test
data(batting)
head(batting)
##                  Player Runs Mins  BF    SR Inns     Opposition
## 1          BC Lara (WI)  400  778 582 68.72    1      v England
## 2       ML Hayden (AUS)  380  622 437 86.95    1     v Zimbabwe
## 3          BC Lara (WI)  375  766 538 69.70    1      v England
## 4 DPMD Jayawardene (SL)  374  752 572 65.38    2 v South Africa
## 5        GS Sobers (WI)  365  614  NA    NA    2     v Pakistan
## 6        L Hutton (ENG)  364  797 847 42.97    1    v Australia
##          Ground  Start.Date Country Notout       Date Year Day Month Yday
## 1     St John's 10 Apr 2004      WI   TRUE 2004-04-10 2004  10     4  101
## 2         Perth  9 Oct 2003     AUS        2003-10-09 2003   9    10  282
## 3     St John's 16 Apr 1994      WI        1994-04-16 1994  16     4  106
## 4 Colombo (SSC) 27 Jul 2006      SL        2006-07-27 2006  27     7  208
## 5      Kingston 26 Feb 1958      WI   TRUE 1958-02-26 1958  26     2   57
## 6      The Oval 20 Aug 1938     ENG        1938-08-20 1938  20     8  232
##   Fours Sixs type
## 1    43    4 Test
## 2    38   11 Test
## 3    45    0 Test
## 4    43    1 Test
## 5    38    0 Test
## 6    35    0 Test
data(innings)
head(innings)
##          Team  Score Overs  RPO Lead Inns Result    Opposition
## 1   Sri Lanka 952/6d 271.0 3.51  415    2   draw       v India
## 2     England 903/7d 335.2 2.69  903    1    won   v Australia
## 3     England    849 258.2 3.28  849    1   draw v West Indies
## 4 West Indies 790/3d 208.1 3.79  462    2    won    v Pakistan
## 5    Pakistan 765/6d 248.5 3.07  121    2   draw   v Sri Lanka
## 6   Sri Lanka 760/7d 202.4 3.75  334    2   draw       v India
##          Ground  Start.Date Total       Date Year Day Month Yday type
## 1 Colombo (RPS)  2 Aug 1997   952 1997-08-02 1997   2     8  214 Test
## 2      The Oval 20 Aug 1938   903 1938-08-20 1938  20     8  232 Test
## 3      Kingston  3 Apr 1930   849 1930-04-03 1930   3     4   93 Test
## 4      Kingston 26 Feb 1958   790 1958-02-26 1958  26     2   57 Test
## 5       Karachi 21 Feb 2009   765 2009-02-21 2009  21     2   52 Test
## 6     Ahmedabad 16 Nov 2009   760 2009-11-16 2009  16    11  320 Test
##   declared
## 1     TRUE
## 2     TRUE
## 3    FALSE
## 4     TRUE
## 5     TRUE
## 6     TRUE

Updating

I also added three functions for obtaining the latest data. They all take two arguments, the year and the number of additional pages (each page is fifty records) that will be needed to obtain all the data since 5 August 2017.

d<-latest_batting(2017,npages=1)
head(d)
##              Player Runs Mins  BF 4s 6s     SR Inns   Opposition    Ground
## 1  S Dhawan (INDIA)  119    - 123 17  0  96.74    1  v Sri Lanka Pallekele
## 2 HH Pandya (INDIA)  108    -  96  8  7 112.50    1  v Sri Lanka Pallekele
## 3  KL Rahul (INDIA)   85    - 135  8  0  62.96    1  v Sri Lanka Pallekele
## 4 LD Chandimal (SL)   48    -  87  6  0  55.17    2      v India Pallekele
## 5   V Kohli (INDIA)   42    -  84  3  0  50.00    1  v Sri Lanka Pallekele
## 6  R Ashwin (INDIA)   31    -  75  1  0  41.33    1  v Sri Lanka Pallekele
##    Start Date  Country Notout       Date Year Day Month Yday Fours Sixs
## 1 12 Aug 2017    INDIA        2017-08-12 2017  12     8  224    17    0
## 2 12 Aug 2017    INDIA        2017-08-12 2017  12     8  224     8    7
## 3 12 Aug 2017    INDIA        2017-08-12 2017  12     8  224     8    0
## 4 12 Aug 2017       SL        2017-08-12 2017  12     8  224     6    0
## 5 12 Aug 2017    INDIA        2017-08-12 2017  12     8  224     3    0
## 6 12 Aug 2017    INDIA        2017-08-12 2017  12     8  224     1    0
##   type
## 1 Test
## 2 Test
## 3 Test
## 4 Test
## 5 Test
## 6 Test

Same goes for latest_bowling() and latest_innings(). These are very messy functions as can be seen from sourcing them. but its not worth investing any more time tidying them up as they work at the moment and the site itself is likely to change it’s interface.

The data can be merged with the main data set using dplyr.

library(dplyr)
data(batting)
d$Mins<-0 ## Note that there was no time on the latest data which causes a problem again!
batting<-bind_rows(d,batting)

So I hope that might save someone an afternoon’s work if they are looking for a quick way to pull down all the available cricket data.