I spent a little time this weekend finding ways to programatically access the website basketball-reference.com.
What follows is a work in progress. The website has a simple enough structure but there are exceptions (“per 48 minutes” tables being passed as “per game” tables, for instance).

You will need to load a few libraries:

library(XML)
library(plyr)
library(ggplot2)
library(reshape2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

I wrote a basic function for pulling a player’s career “per game” statistics into a data frame. You can specify as many players as you want, and it will load all their stats into the data frame. There are no arguments to the function; when called, it prompts the user for number of queries and for player codes. The codes are in this format: “lllllff##”. So, Michael Jordan would be “jordami01”. You don’t need to use the quotes. Also: it’s possible that “01” will not be correct, for instance in the case of Ray Allen (“allenra02”) or the Blazers’ Cliff Robinson (“robincl02”). I’m working on a follow-up function that will generate the full directory of players with the proper code in the data frame.

First I define some columns.

pHeader=c("seas", "age", "team", "leag", "pos", "games", "strt",
          "min", "fg", "fga", "fgp", "tp", "tpa", "tpp", "duo", 
          "duoa", "duop", "ft", "fta", "ftp", "orb", "drb", "trb", 
          "ast", "stl", "blk", "tov", "pf", "pts")

pClasses=c("character", "numeric", rep("character", 3), 
           rep("numeric", 24))

pmain <- data.frame(namecol=character(), seas=character(), age=numeric(), 
                    team=character(), leag=character(), pos=character(), 
                    games=numeric(), strt=numeric(),min=numeric(), fg=numeric(), 
                    fga=numeric(), fgp=numeric(), tp=numeric(), tpa=numeric(), 
                    tpp=numeric(), duo=numeric(), duoa=numeric(), duop=numeric(), 
                    ft=numeric(), fta=numeric(), ftp=numeric(), orb=numeric(), 
                    drb=numeric(), trb=numeric(), ast=numeric(), stl=numeric(), 
                    blk=numeric(), tov=numeric(), pf=numeric(), pts=numeric())

Then, I declare some helper functions. The first one just gets the desired number of players.

inputQuantity <- function(){
        # ask for number of players to analyze
        pquant <- readline(prompt="Enter number of players to compare: ")
        return(pquant)
}

The next one prompts for the player code(s).

inputCode <- function() {
        # ask for name of player
        NAME <- readline(prompt="Enter player code: ")
        return(NAME)
}

This one assembles the URL for the player’s page on basketball-reference.com.

makeURL <- function(NAME){
        purl <- paste("http://www.basketball-reference.com/players/", 
                      substr(NAME, 1, 1), "/", NAME, ".html", sep="")
        return (purl)
}

This chunk does the html parsing with commands from the XML package.

importData <- function(purl){
        pdoc <- htmlParse(purl) # parse document
        pnode <- getNodeSet (pdoc, "//table")
        pdata <- readHTMLTable(pnode[[3]], header=pHeader, colClasses=pClasses) 
        return(pdata)
}

This function creates a name ID column for the player in question.

makeTidyDF <- function(data, NAME){
        num <- nrow(data)
        namecol <- rep(NAME, num)
        tidydf <- cbind(namecol, data)
        return(tidydf)
}

This all gets chained together to form the main function.

#****************************************
# Main Function
#****************************************
nbadata <- function(){
        pquant <- inputQuantity()
        for (i in 1:pquant){
                playerID <- inputCode()
                playerURL <- makeURL(playerID)
                playerTable <- importData(playerURL)
                playerTidyFrame <- makeTidyDF(playerTable, playerID)
                pmain <- rbind(pmain, playerTidyFrame)
        }
        pmain[,2] <- as.numeric(substr(pmain[,2], 1, 4)) + 1
        return(pmain)
}

Sourcing the code into R gives you the nbadata() function, which prompts you for input and then returns a data frame.

# guards <- nbadata()
# then I would choose 3
# then, say, jordami01, drexlcl01, thomais01.

I’ll update these as I continue to explore the XML package.