Recently, I wrote a web scraper that takes tables off of basketball reference and places them in spreadsheets. Each tab of the spreadsheet documents one league metric for one year. Each year has its own workbook. After writing the scraper function, I use a loop to demonstrate how to use this to scrape multiple seasons worth of data.
The function can be found below. Sometimes you have to play with the encoding so that the team populates correctly in the third column. The “â???¢” doesn’t always work if you don’t have the encoding set up correctly.
require(XML)
## Loading required package: XML
require(xlsx)
## Loading required package: xlsx
## Loading required package: rJava
## Loading required package: xlsxjars
require(httr)
## Loading required package: httr
setwd("C://Users//Luke//Documents//R")
NBAscrape<-function(year){
#Set URL based on the year you want to scrape
URL<-paste0("http://www.basketball-reference.com/leagues/NBA_",year,"_leaders.html")
tabs<-GET(URL)
#Scrape the URL
scrape<-readHTMLTable(rawToChar(tabs$content), header=F, stringsAsFactors=F)
#Clean the data by getting rid of random strings pulled in
clean<-lapply(scrape, function(x){
x[,2]<-gsub("â???¢","",x[,2])
return(x)})
#Clean by separating player name and team into separate columns
clean.final<-lapply(clean, function(x){
x[,4]<-sapply(strsplit(x[,2]," "), function(y){y[1]})
x[,5]<-sapply(strsplit(x[,2]," "), function(y){y[2]})
x[,6]<-as.numeric(as.character(x[,3]))
return(x[,4:6])
})
#clean up some of the list names as the xlsx package doesn't like "/"
names(clean.final)<-gsub("/", " ", names(clean))
for(i in 1:length(clean.final)){
write.xlsx(clean.final[[i]], file=paste0("output3//NBA_Leaders",year,".xlsx"), sheetName=names(clean.final)[i], append=TRUE)
}
}
You can then loop through the data. First you create a vector with the years you want, then you run a simple loop. The loop will take some time to run. It took ~20 min to scrape 30 some years.
#scrape every year this millenium
season_list<-as.character(2000:2017)
for(i in season_list){
NBAscrape(i)
}