Introduction, Background
Cricket is a game having massive following in the British Commonwealth countries, and in some other countries as well. There are several versions, one day international (ODI) is one of them. ODIs are mainly played in bilateral series and other tournaments. In the game, one team first “bat” and tally scores while the other team bowl and field. Then the teams switch roles. Whoever tally higher, is the winner.
There are certain players, who are specialized for “batting” job to make scores. These players are traditionally known as “batsman”. Batting records of batsmen can be a gold mine for a statistician/data scientist to conduct research with endless research questions. Some examples:
What is the likelihood that a certain batsman will make a century (100+ score. The score is also known as run) in the next game?
What is the likelihood that a certain batsman will play at least 300 international matches in the end of the career?
How many runs will a certain batsman will tally in the end of his career?
How many centuries will a certain batsman will tally in the end of his career?
What is the likelihood that some certain batsman will break a certain batting record?
The above list is not a comprehensive one. Anybody can come up with his own research question and can apply certain statical method for analysis. But regardless of the research topic and analysis method, certainly one would need data.
ESPNcricinfo (also known as Cricinfo) is a website dedicated for the game of cricket. It contains news, columns, blogs, live scores, live commentary– almost everything related to the game. It hosts numerous data, player profiles, interactive insights and so on. Additionally, they have a searchable cricket statistics database (called Statsguru) covering Tests, ODIs, Twenty20 Internationals and women’s cricket. One can submit query to pull the ODI batting data for a certain player. Different options are offered if one wants to apply any specific filter, or if one wants the data in specific format.
Here in this note, I will share how I managed to pull up-to-date records for about 100 batsmen from Statsguru. This is part of a larger project, where I am applying survival analysis to predict career milestones of certain batsmen. I wrote R codes that links up with the Statsguru query page and downloads data. I made efforts to manipulate the data so that the final product (a clean csv file) is shaped as I want.
Acknowledgments: My research supervisor Dr. Brandenburger has been providing constant guidance and support throughout the process.
Program Architecture
A manual observation revealed that Cricinfo set a unique ID for each player. This program revolves around this ID. Information about a single player can be found in different pages in Cricinfo. But the urls of those pages have a generic structure, usually the ID being the only difference for different players. For example, 303669 is reserved for Joe Root (an English international cricketer), 7133 is reserved for Ricky Ponting (former Australian international cricketer).
For each player, Cricinfo has a main page. This page contains general information about the player and career summary. Detailed records are hosted in statguru database. This is an interactive page and the user may submit query using different filters.
The program presented here takes the player ID (set by Cricinfo) as input. Then it pulls data from two different sources.
When submitting a query from the statguru database, one can impose filter on the following criteria:
- Version of game (Test, ODI, T20, List A etc)
- Venue Type
- Host Country
- Ground
- Date
- Season
- Match Result
- View format
For the current work, we are interested about the ODI batting records of the players. The table returned by the query does not have a variable associated with venue type or match result. For the time being, we did not use the information of match result. But we were interested in the venue type. So, the program was written in such a way that retains that information.
R Code
Different functions were written for specific jobs. Most of the functions take the user defined ID (player ID set by Cricinfo) as argument. In the main function, these functions are called. Some data manipulation tasks were also performed in the main call, so that we have a nice clean csv file, the way we want it.
Including the venue variable
As stated above, the table returned by the submitted query does not give the information whether an innnings was played in home, away or neutral ground of the player of interest. But if the desired venue type is checked, it returns the filtered info. Fortunately the urls have a generic format and only a number changes for home, away or neutral venue. So, the program was so written that it constructs the url for a specific venue type. The pulled data was tagged with that venue type. Then the procedure was repeated for away and neutral venue type.
Let us start the code:
library(rvest)
library(stringr)
library(RCurl)
library(XML)
## function written for venue type
## the argument can be just 1, 2 or 3
#########################################################################
venueType=function(x){
if(is.element(x, 1:3)==T){
return(ifelse(x==1, "Home", ifelse(x==2, "Away", "Neutral")))
}else{
message("warning: venue type not defined for ",x)
return(NA)
}
}
First we loaded required packages. venueType
function takes 1, 2, or 3 as argument and returns Home, Away or Neutral respectively. Later we will see the use of a loop for 3 types of venue. For the variable we want to create (the venue type), we will use this function.
Construct url of the main page associated with the player of interest
A function was made to construct the main page url of the player of interest. It takes the ID (player ID set by Cricinfo) as input.
consURLmainpage=function(ID){
url=str_c("http://www.espncricinfo.com/australia/content/player/",ID,".html")
return(url)
}
Here is one thing to note. If we directly go to an Indian player’s page, we would see that india
will be followed by http://www.espncricinfo.com/
in the url. But a careful investigation revealed that the url is not sensitive to this part. For example, if we go to Ricky Ponting’s main page, we would see the url is http://www.espncricinfo.com/australia/content/player/7133.html
. But if we change the url to http://www.espncricinfo.com/bangladesh/content/player/7133.html
or even http://www.espncricinfo.com/a/content/player/7133.html
, it will direct us to the same page (Ricky Ponting’s main page).
Information extracted from the main page
Four information about a player (full name, national side he plays for, batting style-right handed or left handed and date of birth) are extracted from the main page. Two functions do the job. Following chunk shows the first one.
getConNameStyle=function(batterID){
html=getURL(consURLmainpage(batterID))
doc = htmlParse(html, asText=TRUE)
plainText=xpathSApply(doc, "//p", xmlValue)
ConIndex=which(str_detect(plainText, "Major teams"))
big_name=str_split(plainText[ConIndex],",")[[1]][1]
Ret1=abbreviate(str_sub(big_name,13,-1),3) ## Country
NameIndex=which(str_detect(plainText, "Full name"))
Ret2=str_sub(plainText[NameIndex],12) ## Name
StyleIndex=which(str_detect(plainText, "Batting style"))
Ret3=str_sub(plainText[StyleIndex],15) ## Batting Style
return(as.list(c(Con=Ret1, Name=Ret2, Style=Ret3)))
}
The getConNameStyle
function above takes the player ID as input and returns the country, full name and batting style as a list output. This function calls the consURLmainpage
function within itself. So, the previous function code must be run first.
The second function extracts the date of birth of the player from the main page. This part could be added in the previous part. But we wanted the date of birth as a date format. When we were returning the four info as list, we were not getting the date of birth as desired format. So, we left this part as a separate portion. This certainly puts a question mark on the code efficiency, but for this job, this makes a very little difference.
swith_Fun= function(x){
switch(x,Jan=1, Feb=2, Mar=3, Apr=4, May=5, Jun=6, Jul=7,
Aug=8,Sep=9, Oct=10, Nov=11, Dec=12)
}
getDOBfromID=function(batterID){
html=getURL(consURLmainpage(batterID))
doc = htmlParse(html, asText=TRUE)
plainText=xpathSApply(doc, "//p", xmlValue)
BornIndex=which(str_detect(plainText,"Born\n \n"))
complexDate=str_sub(plainText[BornIndex],8)
DOB_Year=as.numeric(str_split(complexDate,",")[[1]][2])
MonDay=str_split(complexDate,",")[[1]][1]
DOB_Mon=swith_Fun(str_sub(str_split(MonDay," ")[[1]][1],1,3))
DOB_Day=as.numeric(str_sub(MonDay,-2,-1))
return(as.Date(str_c(DOB_Year,"-",DOB_Mon,"-",DOB_Day)))
}
Construct url for statguru database
As being said, we wanted to include the venue
variable. The way we handled this, is that we separately downloaded the data for each type of venue (and fill in the field accordingly). Now for different venue type, we observed that the url changes. We wrote a function that constructs the url separately, taking the venue type and player ID as input.
## make a function for URL
## takes three arguments
## Plater ID, venue type ans result type
## returns the appropriate URL
## this url is for the data table
MakeURL=function(playerID, venuetype,resultType){
A="http://stats.espncricinfo.com/ci/engine/player/"
B=playerID
C=".html?class=2;home_or_away="
D=venuetype
E=";template=results;type=batting;view=innings"
return(str_c(A,B,C,D,E))
}
Main program
Now we are ready to write the main program to do our job. The whole program is encapsulated as a function, taking only the player ID as input. There are comments with the program, pretty much self explanatory. Some code lines are commented out which were used internally during the time of constructing.
makeCSVfromID=function(batterID){
## this is main table
## all table will be row binded to it
append_Table=NA
possibleVenues=c(1:3) ## possible venue in numerics
## scraping the data frm cricinfo
############################################
for(ven in possibleVenues){
url=MakeURL(batterID, ven)
webpage=read_html(url)
sb_table=html_nodes(webpage, 'table')
sb=html_table(sb_table,fill = T)[[4]]
## sb is my table
sb$playerID=batterID
sb$Venue=venueType(ven)
append_Table=rbind(append_Table,sb)
}
trim_Table=append_Table[-1,] ## removing the NA's added first
rm(append_Table)
## clearing workspace
## this only keeps the trim_table data
# names(trim_Table)
# dim(trim_Table)
# head(trim_Table)
## 10th column is useless
## Min is useless column 2
## 3rd last column name should be ODI_ID
## let us do this two first
round2_table=trim_Table[,-c(2,10)]
#rm(trim_Table)
#head(round2_table)
## fix the opposition and ODi_ID column
names(round2_table)[12]="ODI_ID"
# head(round2_table)
## take care of the ODI_ID
round2_table$ODI_ID=as.numeric(str_sub(round2_table$ODI_ID,6))
#head(round2_table)
round2_table$Opposition=str_sub(round2_table$Opposition, 2)
#head(round2_table)
## Fix the runs
#View(round2_table)
## Create a new varaible for not out
round2_table$NotOut=rep(NA, nrow(round2_table))
for(i in 1:nrow(round2_table)){
if(str_sub(round2_table$Runs[i], -1)=="*"){
round2_table$NotOut[i]="Yes"
round2_table$Runs[i]=as.numeric(str_sub(round2_table$Runs[i],0,-2))
}
}
# head(round2_table)
## take care of the out ones
indexes=which(is.na(round2_table$NotOut))
for(i in indexes){
if(round2_table$Runs[i]=="B"){next}else{
round2_table$Runs[i]=as.numeric(round2_table$Runs[i])
round2_table$NotOut[i]="No"
}
}
index2=which(is.na(round2_table$Runs))
round2_table$NotOut[index2]=NA
round2_table$Runs=as.numeric(round2_table$Runs)
# View(round2_table)
## Now to fix the Date
## after fixing the date, the data should be sorted in chronological order
## first renames the col name
names(round2_table)[11]="Date"
## numcode of the months
theYears=str_sub(round2_table$Date,-4,-1)
theMonths=rep(NA, nrow(round2_table))
for(i in 1:nrow(round2_table)){
theMonths[i]=swith_Fun(str_sub(round2_table$Date[i],-8,-6))
}
theDays=str_sub(round2_table$Date,1,-10)
theDates=as.Date(str_c(theYears,"-",theMonths,"-",theDays))
round2_table$Date=theDates
## Renames some variable name
names(round2_table)[3]="Fours"
names(round2_table)[4]="Sixes"
# lapply(round2_table,class)
round2_table$BF=as.numeric(round2_table$BF)
round2_table$Fours=as.numeric(round2_table$Fours)
round2_table$Sixes=as.numeric(round2_table$Sixes)
round2_table$SR=as.numeric(round2_table$SR)
round2_table$Pos=as.numeric(round2_table$Pos)
round2_table$Inns=as.numeric(round2_table$Inns)
# lapply(round2_table,class)
#View(round2_table)
## now sort by date
round3_table=round2_table[order(round2_table$Date),]
# head(round3_table)
## Add Name, Debut and Born column
ConNameStyle=getConNameStyle(batterID)
Fname=ConNameStyle$Name
Cont=ConNameStyle$Con
Style=ConNameStyle$Style
deb=round3_table$Date[1]
round3_table$FullName=Fname
round3_table$Country=Cont
round3_table$Style=Style
round3_table$Debut=deb
round3_table$DOB=getDOBfromID(batterID)
round3_table$CID=batterID
## write the csv
namePart=str_replace_all(Fname," ", "")
write.csv(round3_table,file=str_c("player",namePart,".csv"))
}
## Test case
## 49209 is the ID of Sanath Jayasuriya
makeCSVfromID(49209)
The makeCSVfromID
function above takes the player ID as input and downloads the data from Cricinfo. A test case was run. We downloaded the ODI batting records of former Sri Lankan international Sanath Jayasuriya. The file name is playerSanathTeranJayasuriya.csv
.
The data is downloaded in csv format. Each row corresponds to an innings. If the player did not bat at that particular innings, the Runs
(and related) column(s) are just NA.
For example, we see that Sanath Jayasuriya did not bat in his fifth ODI of his career. There could be number of reasons like the match was washed away by rain, or he did not need to bat as earlier batsmen played all 50 overs.
mydata=read.csv("playerSanathTeranJayasuriya.csv")[,-1]
dim(mydata)
## [1] 445 21
names(mydata)
## [1] "Runs" "BF" "Fours" "Sixes" "SR"
## [6] "Pos" "Dismissal" "Inns" "Opposition" "Ground"
## [11] "Date" "ODI_ID" "playerID" "Venue" "NotOut"
## [16] "FullName" "Country" "Style" "Debut" "DOB"
## [21] "CID"
head(mydata)
## Runs BF Fours Sixes SR Pos Dismissal Inns Opposition Ground
## 1 3 5 0 0 60.00 5 caught 2 Australia Melbourne
## 2 13 16 0 0 81.25 7 caught 1 Australia Perth
## 3 24 40 0 0 60.00 7 caught 2 Pakistan Perth
## 4 0 3 0 0 0.00 8 bowled 2 Australia Melbourne
## 5 NA NA NA NA NA NA - 1 Pakistan Brisbane
## 6 31 66 1 0 46.96 7 caught 1 Australia Adelaide
## Date ODI_ID playerID Venue NotOut FullName
## 1 1989-12-26 596 49209 Away No Sanath Teran Jayasuriya
## 2 1989-12-30 597 49209 Away No Sanath Teran Jayasuriya
## 3 1989-12-31 598 49209 Neutral No Sanath Teran Jayasuriya
## 4 1990-01-04 600 49209 Away No Sanath Teran Jayasuriya
## 5 1990-02-10 601 49209 Neutral <NA> Sanath Teran Jayasuriya
## 6 1990-02-18 608 49209 Away No Sanath Teran Jayasuriya
## Country Style Debut DOB CID
## 1 SrL Left-hand bat 1989-12-26 1969-06-30 49209
## 2 SrL Left-hand bat 1989-12-26 1969-06-30 49209
## 3 SrL Left-hand bat 1989-12-26 1969-06-30 49209
## 4 SrL Left-hand bat 1989-12-26 1969-06-30 49209
## 5 SrL Left-hand bat 1989-12-26 1969-06-30 49209
## 6 SrL Left-hand bat 1989-12-26 1969-06-30 49209