Introduction, Background


Cricket is a game having massive following in the British Commonwealth countries, and in some other countries as well. There are several versions, one day international (ODI) is one of them. ODIs are mainly played in bilateral series and other tournaments. In the game, one team first “bat” and tally scores while the other team bowl and field. Then the teams switch roles. Whoever tally higher, is the winner.

There are certain players, who are specialized for “batting” job to make scores. These players are traditionally known as “batsman”. Batting records of batsmen can be a gold mine for a statistician/data scientist to conduct research with endless research questions. Some examples:

The above list is not a comprehensive one. Anybody can come up with his own research question and can apply certain statical method for analysis. But regardless of the research topic and analysis method, certainly one would need data.

ESPNcricinfo1 (also known as Cricinfo) is a website dedicated for the game of cricket. It contains news, columns, blogs, live scores, live commentary– almost everything related to the game. It hosts numerous data, player profiles, interactive insights and so on. Additionally, they have a searchable cricket statistics database (called Statsguru2) covering Tests, ODIs, Twenty20 Internationals and women’s cricket. One can submit query to pull the ODI batting data for a certain player. Different options are offered if one wants to apply any specific filter, or if one wants the data in specific format.

Here in this note, I will share how I managed to pull up-to-date records for about 100 batsmen from Statsguru. This is part of a larger project, where I am applying survival analysis to predict career milestones of certain batsmen. I wrote R codes that links up with the Statsguru query page and downloads data. I made efforts to manipulate the data so that the final product (a clean csv file) is shaped as I want.

Acknowledgments: My research supervisor Dr. Brandenburger3 has been providing constant guidance and support throughout the process.


Program Architecture


A manual observation revealed that Cricinfo set a unique ID for each player. This program revolves around this ID. Information about a single player can be found in different pages in Cricinfo. But the urls of those pages have a generic structure, usually the ID being the only difference for different players. For example, 303669 is reserved for Joe Root (an English international cricketer), 7133 is reserved for Ricky Ponting (former Australian international cricketer).

For each player, Cricinfo has a main page. This page contains general information about the player and career summary. Detailed records are hosted in statguru database. This is an interactive page and the user may submit query using different filters.

The program presented here takes the player ID (set by Cricinfo) as input. Then it pulls data from two different sources.

When submitting a query from the statguru database, one can impose filter on the following criteria:

For the current work, we are interested about the ODI batting records of the players. The table returned by the query does not have a variable associated with venue type or match result. For the time being, we did not use the information of match result. But we were interested in the venue type. So, the program was written in such a way that retains that information.


R Code


Different functions were written for specific jobs. Most of the functions take the user defined ID (player ID set by Cricinfo) as argument. In the main function, these functions are called. Some data manipulation tasks were also performed in the main call, so that we have a nice clean csv file, the way we want it.


Including the venue variable


As stated above, the table returned by the submitted query does not give the information whether an innnings was played in home, away or neutral ground of the player of interest. But if the desired venue type is checked, it returns the filtered info. Fortunately the urls have a generic format and only a number changes for home, away or neutral venue. So, the program was so written that it constructs the url for a specific venue type. The pulled data was tagged with that venue type. Then the procedure was repeated for away and neutral venue type.

Let us start the code:

library(rvest)
library(stringr)
library(RCurl)
library(XML)
## function written for venue type
## the argument can be just 1, 2 or 3
#########################################################################
venueType=function(x){
  if(is.element(x, 1:3)==T){
    return(ifelse(x==1, "Home", ifelse(x==2, "Away", "Neutral")))
  }else{
    message("warning: venue type not defined for ",x)
    return(NA)
  }
}

First we loaded required packages. venueType function takes 1, 2, or 3 as argument and returns Home, Away or Neutral respectively. Later we will see the use of a loop for 3 types of venue. For the variable we want to create (the venue type), we will use this function.


Construct url of the main page associated with the player of interest


A function was made to construct the main page url of the player of interest. It takes the ID (player ID set by Cricinfo) as input.

consURLmainpage=function(ID){
  url=str_c("http://www.espncricinfo.com/australia/content/player/",ID,".html")
  return(url)
}

Here is one thing to note. If we directly go to an Indian player’s page, we would see that india will be followed by http://www.espncricinfo.com/ in the url. But a careful investigation revealed that the url is not sensitive to this part. For example, if we go to Ricky Ponting’s main page, we would see the url is http://www.espncricinfo.com/australia/content/player/7133.html. But if we change the url to http://www.espncricinfo.com/bangladesh/content/player/7133.html or even http://www.espncricinfo.com/a/content/player/7133.html, it will direct us to the same page (Ricky Ponting’s main page).


Information extracted from the main page


Four information about a player (full name, national side he plays for, batting style-right handed or left handed and date of birth) are extracted from the main page. Two functions do the job. Following chunk shows the first one.

getConNameStyle=function(batterID){
  html=getURL(consURLmainpage(batterID))
  doc = htmlParse(html, asText=TRUE)
  plainText=xpathSApply(doc, "//p", xmlValue)
  ConIndex=which(str_detect(plainText, "Major teams"))
  big_name=str_split(plainText[ConIndex],",")[[1]][1]
  Ret1=abbreviate(str_sub(big_name,13,-1),3) ## Country
  NameIndex=which(str_detect(plainText, "Full name"))
  Ret2=str_sub(plainText[NameIndex],12) ## Name
  StyleIndex=which(str_detect(plainText, "Batting style"))
  Ret3=str_sub(plainText[StyleIndex],15) ## Batting Style
  return(as.list(c(Con=Ret1, Name=Ret2, Style=Ret3)))
}

The getConNameStyle function above takes the player ID as input and returns the country, full name and batting style as a list output. This function calls the consURLmainpage function within itself. So, the previous function code must be run first.

The second function extracts the date of birth of the player from the main page. This part could be added in the previous part. But we wanted the date of birth as a date format. When we were returning the four info as list, we were not getting the date of birth as desired format. So, we left this part as a separate portion. This certainly puts a question mark on the code efficiency, but for this job, this makes a very little difference.

swith_Fun= function(x){
  switch(x,Jan=1, Feb=2, Mar=3, Apr=4, May=5, Jun=6, Jul=7,
         Aug=8,Sep=9, Oct=10, Nov=11, Dec=12)
}
getDOBfromID=function(batterID){
  html=getURL(consURLmainpage(batterID))
  doc = htmlParse(html, asText=TRUE)
  plainText=xpathSApply(doc, "//p", xmlValue)
  BornIndex=which(str_detect(plainText,"Born\n \n"))
  complexDate=str_sub(plainText[BornIndex],8)
  DOB_Year=as.numeric(str_split(complexDate,",")[[1]][2])
  MonDay=str_split(complexDate,",")[[1]][1]
  DOB_Mon=swith_Fun(str_sub(str_split(MonDay," ")[[1]][1],1,3))
  DOB_Day=as.numeric(str_sub(MonDay,-2,-1))
  return(as.Date(str_c(DOB_Year,"-",DOB_Mon,"-",DOB_Day)))
}

Construct url for statguru database


As being said, we wanted to include the venue variable. The way we handled this, is that we separately downloaded the data for each type of venue (and fill in the field accordingly). Now for different venue type, we observed that the url changes. We wrote a function that constructs the url separately, taking the venue type and player ID as input.

## make a function for URL
## takes three arguments
## Plater ID, venue type ans result type
## returns the appropriate URL
## this url is for the data table
MakeURL=function(playerID, venuetype,resultType){
  A="http://stats.espncricinfo.com/ci/engine/player/"
  B=playerID
  C=".html?class=2;home_or_away="
  D=venuetype
  E=";template=results;type=batting;view=innings"
  return(str_c(A,B,C,D,E))
}

Main program


Now we are ready to write the main program to do our job. The whole program is encapsulated as a function, taking only the player ID as input. There are comments with the program, pretty much self explanatory. Some code lines are commented out which were used internally during the time of constructing.

makeCSVfromID=function(batterID){
  ## this is main table
  ## all table will be row binded to it
  append_Table=NA
  possibleVenues=c(1:3) ## possible venue in numerics
  
  ## scraping the data frm cricinfo
  ############################################
  for(ven in possibleVenues){
    url=MakeURL(batterID, ven)
    webpage=read_html(url)
    sb_table=html_nodes(webpage, 'table')
    sb=html_table(sb_table,fill = T)[[4]]
    ## sb is my table
    sb$playerID=batterID
    sb$Venue=venueType(ven)
    append_Table=rbind(append_Table,sb)
  }
  trim_Table=append_Table[-1,] ## removing the NA's added first
  rm(append_Table)
  
  
  ## clearing workspace
  ## this only keeps the trim_table data
  
  # names(trim_Table)
  # dim(trim_Table)
  # head(trim_Table)
  ## 10th column is useless
  ## Min is useless column 2
  ## 3rd last column name should be ODI_ID
  ## let us do this two first
  round2_table=trim_Table[,-c(2,10)]
  #rm(trim_Table)
  #head(round2_table)
  
  ## fix the opposition and ODi_ID column
  names(round2_table)[12]="ODI_ID"
  # head(round2_table)
  ## take care of the ODI_ID
  round2_table$ODI_ID=as.numeric(str_sub(round2_table$ODI_ID,6))
  #head(round2_table)
  round2_table$Opposition=str_sub(round2_table$Opposition, 2)
  #head(round2_table)
  ## Fix the runs
  #View(round2_table)
  ## Create a new varaible for not out
  round2_table$NotOut=rep(NA, nrow(round2_table))
  for(i in 1:nrow(round2_table)){
    if(str_sub(round2_table$Runs[i], -1)=="*"){
      round2_table$NotOut[i]="Yes"
      round2_table$Runs[i]=as.numeric(str_sub(round2_table$Runs[i],0,-2))
    }
  }
  # head(round2_table)
  
  ## take care of the out ones
  indexes=which(is.na(round2_table$NotOut))
  for(i in indexes){
    if(round2_table$Runs[i]=="B"){next}else{
      round2_table$Runs[i]=as.numeric(round2_table$Runs[i])
      round2_table$NotOut[i]="No"
    }
  }
  
  index2=which(is.na(round2_table$Runs))
  round2_table$NotOut[index2]=NA
  
  round2_table$Runs=as.numeric(round2_table$Runs)
  
  
  # View(round2_table)
  ## Now to fix the Date
  ## after fixing the date, the data should be sorted in chronological order
  
  ## first renames the col name
  names(round2_table)[11]="Date"
  ## numcode of the months
  theYears=str_sub(round2_table$Date,-4,-1)
  theMonths=rep(NA, nrow(round2_table))
  for(i in 1:nrow(round2_table)){
    theMonths[i]=swith_Fun(str_sub(round2_table$Date[i],-8,-6))
  }
  theDays=str_sub(round2_table$Date,1,-10)
  theDates=as.Date(str_c(theYears,"-",theMonths,"-",theDays))
  
  round2_table$Date=theDates
  ## Renames some variable name
  names(round2_table)[3]="Fours"
  names(round2_table)[4]="Sixes"
  # lapply(round2_table,class)
  
  round2_table$BF=as.numeric(round2_table$BF)
  round2_table$Fours=as.numeric(round2_table$Fours)
  round2_table$Sixes=as.numeric(round2_table$Sixes)
  round2_table$SR=as.numeric(round2_table$SR)
  round2_table$Pos=as.numeric(round2_table$Pos)
  round2_table$Inns=as.numeric(round2_table$Inns)
  # lapply(round2_table,class)
  #View(round2_table)
  ## now sort by date
  round3_table=round2_table[order(round2_table$Date),]
  # head(round3_table)
  ## Add Name, Debut and Born column
  ConNameStyle=getConNameStyle(batterID)
  Fname=ConNameStyle$Name
  Cont=ConNameStyle$Con
  Style=ConNameStyle$Style
  deb=round3_table$Date[1]
  round3_table$FullName=Fname
  round3_table$Country=Cont
  round3_table$Style=Style
  round3_table$Debut=deb
  round3_table$DOB=getDOBfromID(batterID)
  round3_table$CID=batterID
  ## write the csv
  namePart=str_replace_all(Fname," ", "")
  write.csv(round3_table,file=str_c("player",namePart,".csv"))
}

## Test case
## 49209 is the ID of Sanath Jayasuriya 
makeCSVfromID(49209)

The makeCSVfromID function above takes the player ID as input and downloads the data from Cricinfo. A test case was run. We downloaded the ODI batting records of former Sri Lankan international Sanath Jayasuriya. The file name is playerSanathTeranJayasuriya.csv.

The data is downloaded in csv format. Each row corresponds to an innings. If the player did not bat at that particular innings, the Runs (and related) column(s) are just NA.

For example, we see that Sanath Jayasuriya did not bat in his fifth ODI of his career. There could be number of reasons like the match was washed away by rain, or he did not need to bat as earlier batsmen played all 50 overs.

mydata=read.csv("playerSanathTeranJayasuriya.csv")[,-1]
dim(mydata)
## [1] 445  21
names(mydata)
##  [1] "Runs"       "BF"         "Fours"      "Sixes"      "SR"        
##  [6] "Pos"        "Dismissal"  "Inns"       "Opposition" "Ground"    
## [11] "Date"       "ODI_ID"     "playerID"   "Venue"      "NotOut"    
## [16] "FullName"   "Country"    "Style"      "Debut"      "DOB"       
## [21] "CID"
head(mydata)
##   Runs BF Fours Sixes    SR Pos Dismissal Inns Opposition    Ground
## 1    3  5     0     0 60.00   5    caught    2  Australia Melbourne
## 2   13 16     0     0 81.25   7    caught    1  Australia     Perth
## 3   24 40     0     0 60.00   7    caught    2   Pakistan     Perth
## 4    0  3     0     0  0.00   8    bowled    2  Australia Melbourne
## 5   NA NA    NA    NA    NA  NA         -    1   Pakistan  Brisbane
## 6   31 66     1     0 46.96   7    caught    1  Australia  Adelaide
##         Date ODI_ID playerID   Venue NotOut                FullName
## 1 1989-12-26    596    49209    Away     No Sanath Teran Jayasuriya
## 2 1989-12-30    597    49209    Away     No Sanath Teran Jayasuriya
## 3 1989-12-31    598    49209 Neutral     No Sanath Teran Jayasuriya
## 4 1990-01-04    600    49209    Away     No Sanath Teran Jayasuriya
## 5 1990-02-10    601    49209 Neutral   <NA> Sanath Teran Jayasuriya
## 6 1990-02-18    608    49209    Away     No Sanath Teran Jayasuriya
##   Country         Style      Debut        DOB   CID
## 1     SrL Left-hand bat 1989-12-26 1969-06-30 49209
## 2     SrL Left-hand bat 1989-12-26 1969-06-30 49209
## 3     SrL Left-hand bat 1989-12-26 1969-06-30 49209
## 4     SrL Left-hand bat 1989-12-26 1969-06-30 49209
## 5     SrL Left-hand bat 1989-12-26 1969-06-30 49209
## 6     SrL Left-hand bat 1989-12-26 1969-06-30 49209

  1. http://www.espncricinfo.com/

  2. http://stats.espncricinfo.com/ci/engine/stats/index.html

  3. Thomas Brandenburger , PhD, Associate Professor, Math and Stat, South Dakota State University