Web scraping in R: A tutorial using Super Bowl Data

Hadley Wickham created an R package called rvest that makes it easy to scrape static web pages. In this note, I will show how to use rvest to extract a table with Super Bowl data from a web page.

We will use the optional R packages rvest, stringr, and tidyr. These packages must be installed before running this tutorial.

install.packages('rvest')
install.packages('stringr')
install.packages('tidyr')

Once the packages have been installed, they can be loaded using the library command.

library(rvest)
library(stringr)
library(tidyr)

We use the read_html function to read a web page. This function is provided by the xml2 package, which was loaded automatically when we loaded rvest.

url <- 'http://espn.go.com/nfl/superbowl/history/winners'
webpage <- read_html(url)

Next, we use the functions html_nodes and html_table (from rvest) to extract the HTML table element and convert it to a data frame.

sb_table <- html_nodes(webpage, 'table')
sb <- html_table(sb_table)[[1]]
head(sb)

##                               X1            X2
## 1 Super Bowl Winners and Results          <NA>
## 2                            NO.          DATE
## 3                              I Jan. 15, 1967
## 4                             II Jan. 14, 1968
## 5                            III Jan. 12, 1969
## 6                             IV Jan. 11, 1970
##                              X3                            X4
## 1                          <NA>                          <NA>
## 2                          SITE                        RESULT
## 3 Los Angeles Memorial Coliseum  Green Bay 35, Kansas City 10
## 4           Orange Bowl (Miami)      Green Bay 33, Oakland 14
## 5           Orange Bowl (Miami) New York Jets 16, Baltimore 7
## 6  Tulane Stadium (New Orleans)   Kansas City 23, Minnesota 7

We remove the first two rows, and set the column names.

sb <- sb[-(1:2), ]
names(sb) <- c("number", "date", "site", "result")
head(sb)

##   number          date                          site
## 3      I Jan. 15, 1967 Los Angeles Memorial Coliseum
## 4     II Jan. 14, 1968           Orange Bowl (Miami)
## 5    III Jan. 12, 1969           Orange Bowl (Miami)
## 6     IV Jan. 11, 1970  Tulane Stadium (New Orleans)
## 7      V Jan. 17, 1971           Orange Bowl (Miami)
## 8     VI Jan. 16, 1972  Tulane Stadium (New Orleans)
##                          result
## 3  Green Bay 35, Kansas City 10
## 4      Green Bay 33, Oakland 14
## 5 New York Jets 16, Baltimore 7
## 6   Kansas City 23, Minnesota 7
## 7       Baltimore 16, Dallas 13
## 8            Dallas 24, Miami 3

It is traditional to use Roman numerals to refer to Super Bowls, but Arabic numerals are more convenient to work with. We will also convert the date to a standard format.

sb$number <- 1:49
sb$date <- as.Date(sb$date, "%B. %d, %Y")
head(sb)

##   number       date                          site
## 3      1 1967-01-15 Los Angeles Memorial Coliseum
## 4      2 1968-01-14           Orange Bowl (Miami)
## 5      3 1969-01-12           Orange Bowl (Miami)
## 6      4 1970-01-11  Tulane Stadium (New Orleans)
## 7      5 1971-01-17           Orange Bowl (Miami)
## 8      6 1972-01-16  Tulane Stadium (New Orleans)
##                          result
## 3  Green Bay 35, Kansas City 10
## 4      Green Bay 33, Oakland 14
## 5 New York Jets 16, Baltimore 7
## 6   Kansas City 23, Minnesota 7
## 7       Baltimore 16, Dallas 13
## 8            Dallas 24, Miami 3

The result column should be split into four columns – the winning team’s name, the winner’s score, the losing team’s name, and the loser’s score. We start by splitting the results column into two columns at the comma. This operation uses the separate function from the tidyr package.

sb <- separate(sb, result, c('winner', 'loser'), sep=', ', remove=TRUE)
head(sb)

##   number       date                          site           winner
## 3      1 1967-01-15 Los Angeles Memorial Coliseum     Green Bay 35
## 4      2 1968-01-14           Orange Bowl (Miami)     Green Bay 33
## 5      3 1969-01-12           Orange Bowl (Miami) New York Jets 16
## 6      4 1970-01-11  Tulane Stadium (New Orleans)   Kansas City 23
## 7      5 1971-01-17           Orange Bowl (Miami)     Baltimore 16
## 8      6 1972-01-16  Tulane Stadium (New Orleans)        Dallas 24
##            loser
## 3 Kansas City 10
## 4     Oakland 14
## 5    Baltimore 7
## 6    Minnesota 7
## 7      Dallas 13
## 8        Miami 3

Finally, we split off the scores from the winner and loser columns. The function str_extract from the stringr package finds a substring matching a pattern. In this case, the pattern is a sequence of 1 or more digits at the end of a line.

pattern <- " \\d+$"
sb$winnerScore <- as.numeric(str_extract(sb$winner, pattern))
sb$loserScore <- as.numeric(str_extract(sb$loser, pattern))
sb$winner <- gsub(pattern, "", sb$winner)
sb$loser <- gsub(pattern, "", sb$loser)
head(sb)

##   number       date                          site        winner
## 3      1 1967-01-15 Los Angeles Memorial Coliseum     Green Bay
## 4      2 1968-01-14           Orange Bowl (Miami)     Green Bay
## 5      3 1969-01-12           Orange Bowl (Miami) New York Jets
## 6      4 1970-01-11  Tulane Stadium (New Orleans)   Kansas City
## 7      5 1971-01-17           Orange Bowl (Miami)     Baltimore
## 8      6 1972-01-16  Tulane Stadium (New Orleans)        Dallas
##         loser winnerScore loserScore
## 3 Kansas City          35         10
## 4     Oakland          33         14
## 5   Baltimore          16          7
## 6   Minnesota          23          7
## 7      Dallas          16         13
## 8       Miami          24          3

Our data frame is looking pretty good, so we write it to a CSV (comma-separated value) file.

write.csv(sb, 'superbowl.csv', row.names=F)

Web scraping in R: A tutorial using Super Bowl Data

David Radcliffe

January 18, 2016