Basketball Reference Web Scraper

Recently, I wrote a web scraper that takes tables off of basketball reference and places them in spreadsheets. Each tab of the spreadsheet documents one league metric for one year. Each year has its own workbook. After writing the scraper function, I use a loop to demonstrate how to use this to scrape multiple seasons worth of data.

The function can be found below. Sometimes you have to play with the encoding so that the team populates correctly in the third column. The “â???¢” doesn’t always work if you don’t have the encoding set up correctly.

Function

require(XML)
## Loading required package: XML
require(xlsx)
## Loading required package: xlsx
## Loading required package: rJava
## Loading required package: xlsxjars
require(httr)
## Loading required package: httr
setwd("C://Users//Luke//Documents//R")

NBAscrape<-function(year){
    #Set URL based on the year you want to scrape
    URL<-paste0("http://www.basketball-reference.com/leagues/NBA_",year,"_leaders.html")
    
    tabs<-GET(URL)
    
    #Scrape the URL
    scrape<-readHTMLTable(rawToChar(tabs$content), header=F, stringsAsFactors=F)
    
    #Clean the data by getting rid of random strings pulled in
    clean<-lapply(scrape, function(x){
        x[,2]<-gsub("â???¢","",x[,2])
        return(x)})
    
    #Clean by separating player name and team into separate columns
    clean.final<-lapply(clean, function(x){
        x[,4]<-sapply(strsplit(x[,2],"  "), function(y){y[1]})
        x[,5]<-sapply(strsplit(x[,2],"  "), function(y){y[2]})
        x[,6]<-as.numeric(as.character(x[,3]))
        return(x[,4:6])
    })
    
    #clean up some of the list names as the xlsx package doesn't like "/"
    names(clean.final)<-gsub("/", " ", names(clean))
    
    for(i in 1:length(clean.final)){
        write.xlsx(clean.final[[i]], file=paste0("output3//NBA_Leaders",year,".xlsx"), sheetName=names(clean.final)[i], append=TRUE)
    }
}

Loop

You can then loop through the data. First you create a vector with the years you want, then you run a simple loop. The loop will take some time to run. It took ~20 min to scrape 30 some years.

#scrape every year this millenium
season_list<-as.character(2000:2017)

for(i in season_list){
    NBAscrape(i)
}