I spent a little time this weekend finding ways to programatically access the website basketball-reference.com.
What follows is a work in progress. The website has a simple enough structure but there are exceptions (“per 48 minutes” tables being passed as “per game” tables, for instance).
You will need to load a few libraries:
library(XML)
library(plyr)
library(ggplot2)
library(reshape2)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
I wrote a basic function for pulling a player’s career “per game” statistics into a data frame. You can specify as many players as you want, and it will load all their stats into the data frame. There are no arguments to the function; when called, it prompts the user for number of queries and for player codes. The codes are in this format: “lllllff##”. So, Michael Jordan would be “jordami01”. You don’t need to use the quotes. Also: it’s possible that “01” will not be correct, for instance in the case of Ray Allen (“allenra02”) or the Blazers’ Cliff Robinson (“robincl02”). I’m working on a follow-up function that will generate the full directory of players with the proper code in the data frame.
First I define some columns.
pHeader=c("seas", "age", "team", "leag", "pos", "games", "strt",
"min", "fg", "fga", "fgp", "tp", "tpa", "tpp", "duo",
"duoa", "duop", "ft", "fta", "ftp", "orb", "drb", "trb",
"ast", "stl", "blk", "tov", "pf", "pts")
pClasses=c("character", "numeric", rep("character", 3),
rep("numeric", 24))
pmain <- data.frame(namecol=character(), seas=character(), age=numeric(),
team=character(), leag=character(), pos=character(),
games=numeric(), strt=numeric(),min=numeric(), fg=numeric(),
fga=numeric(), fgp=numeric(), tp=numeric(), tpa=numeric(),
tpp=numeric(), duo=numeric(), duoa=numeric(), duop=numeric(),
ft=numeric(), fta=numeric(), ftp=numeric(), orb=numeric(),
drb=numeric(), trb=numeric(), ast=numeric(), stl=numeric(),
blk=numeric(), tov=numeric(), pf=numeric(), pts=numeric())
Then, I declare some helper functions. The first one just gets the desired number of players.
inputQuantity <- function(){
# ask for number of players to analyze
pquant <- readline(prompt="Enter number of players to compare: ")
return(pquant)
}
The next one prompts for the player code(s).
inputCode <- function() {
# ask for name of player
NAME <- readline(prompt="Enter player code: ")
return(NAME)
}
This one assembles the URL for the player’s page on basketball-reference.com.
makeURL <- function(NAME){
purl <- paste("http://www.basketball-reference.com/players/",
substr(NAME, 1, 1), "/", NAME, ".html", sep="")
return (purl)
}
This chunk does the html parsing with commands from the XML package.
importData <- function(purl){
pdoc <- htmlParse(purl) # parse document
pnode <- getNodeSet (pdoc, "//table")
pdata <- readHTMLTable(pnode[[3]], header=pHeader, colClasses=pClasses)
return(pdata)
}
This function creates a name ID column for the player in question.
makeTidyDF <- function(data, NAME){
num <- nrow(data)
namecol <- rep(NAME, num)
tidydf <- cbind(namecol, data)
return(tidydf)
}
This all gets chained together to form the main function.
#****************************************
# Main Function
#****************************************
nbadata <- function(){
pquant <- inputQuantity()
for (i in 1:pquant){
playerID <- inputCode()
playerURL <- makeURL(playerID)
playerTable <- importData(playerURL)
playerTidyFrame <- makeTidyDF(playerTable, playerID)
pmain <- rbind(pmain, playerTidyFrame)
}
pmain[,2] <- as.numeric(substr(pmain[,2], 1, 4)) + 1
return(pmain)
}
Sourcing the code into R gives you the nbadata() function, which prompts you for input and then returns a data frame.
# guards <- nbadata()
# then I would choose 3
# then, say, jordami01, drexlcl01, thomais01.
I’ll update these as I continue to explore the XML package.