Hadley Wickham created an R package called rvest that makes it easy to scrape static web pages. In this note, I will show how to use rvest to extract a table with Super Bowl data from a web page.
We will use the optional R packages rvest, stringr, and tidyr. These packages must be installed before running this tutorial.
install.packages('rvest')
install.packages('stringr')
install.packages('tidyr')
Once the packages have been installed, they can be loaded using the library command.
library(rvest)
library(stringr)
library(tidyr)
We use the read_html function to read a web page. This function is provided by the xml2 package, which was loaded automatically when we loaded rvest.
url <- 'http://espn.go.com/nfl/superbowl/history/winners'
webpage <- read_html(url)
Next, we use the functions html_nodes and html_table (from rvest) to extract the HTML table element and convert it to a data frame.
sb_table <- html_nodes(webpage, 'table')
sb <- html_table(sb_table)[[1]]
head(sb)
## X1 X2
## 1 Super Bowl Winners and Results <NA>
## 2 NO. DATE
## 3 I Jan. 15, 1967
## 4 II Jan. 14, 1968
## 5 III Jan. 12, 1969
## 6 IV Jan. 11, 1970
## X3 X4
## 1 <NA> <NA>
## 2 SITE RESULT
## 3 Los Angeles Memorial Coliseum Green Bay 35, Kansas City 10
## 4 Orange Bowl (Miami) Green Bay 33, Oakland 14
## 5 Orange Bowl (Miami) New York Jets 16, Baltimore 7
## 6 Tulane Stadium (New Orleans) Kansas City 23, Minnesota 7
We remove the first two rows, and set the column names.
sb <- sb[-(1:2), ]
names(sb) <- c("number", "date", "site", "result")
head(sb)
## number date site
## 3 I Jan. 15, 1967 Los Angeles Memorial Coliseum
## 4 II Jan. 14, 1968 Orange Bowl (Miami)
## 5 III Jan. 12, 1969 Orange Bowl (Miami)
## 6 IV Jan. 11, 1970 Tulane Stadium (New Orleans)
## 7 V Jan. 17, 1971 Orange Bowl (Miami)
## 8 VI Jan. 16, 1972 Tulane Stadium (New Orleans)
## result
## 3 Green Bay 35, Kansas City 10
## 4 Green Bay 33, Oakland 14
## 5 New York Jets 16, Baltimore 7
## 6 Kansas City 23, Minnesota 7
## 7 Baltimore 16, Dallas 13
## 8 Dallas 24, Miami 3
It is traditional to use Roman numerals to refer to Super Bowls, but Arabic numerals are more convenient to work with. We will also convert the date to a standard format.
sb$number <- 1:49
sb$date <- as.Date(sb$date, "%B. %d, %Y")
head(sb)
## number date site
## 3 1 1967-01-15 Los Angeles Memorial Coliseum
## 4 2 1968-01-14 Orange Bowl (Miami)
## 5 3 1969-01-12 Orange Bowl (Miami)
## 6 4 1970-01-11 Tulane Stadium (New Orleans)
## 7 5 1971-01-17 Orange Bowl (Miami)
## 8 6 1972-01-16 Tulane Stadium (New Orleans)
## result
## 3 Green Bay 35, Kansas City 10
## 4 Green Bay 33, Oakland 14
## 5 New York Jets 16, Baltimore 7
## 6 Kansas City 23, Minnesota 7
## 7 Baltimore 16, Dallas 13
## 8 Dallas 24, Miami 3
The result column should be split into four columns – the winning team’s name, the winner’s score, the losing team’s name, and the loser’s score. We start by splitting the results column into two columns at the comma. This operation uses the separate function from the tidyr package.
sb <- separate(sb, result, c('winner', 'loser'), sep=', ', remove=TRUE)
head(sb)
## number date site winner
## 3 1 1967-01-15 Los Angeles Memorial Coliseum Green Bay 35
## 4 2 1968-01-14 Orange Bowl (Miami) Green Bay 33
## 5 3 1969-01-12 Orange Bowl (Miami) New York Jets 16
## 6 4 1970-01-11 Tulane Stadium (New Orleans) Kansas City 23
## 7 5 1971-01-17 Orange Bowl (Miami) Baltimore 16
## 8 6 1972-01-16 Tulane Stadium (New Orleans) Dallas 24
## loser
## 3 Kansas City 10
## 4 Oakland 14
## 5 Baltimore 7
## 6 Minnesota 7
## 7 Dallas 13
## 8 Miami 3
Finally, we split off the scores from the winner and loser columns. The function str_extract from the stringr package finds a substring matching a pattern. In this case, the pattern is a sequence of 1 or more digits at the end of a line.
pattern <- " \\d+$"
sb$winnerScore <- as.numeric(str_extract(sb$winner, pattern))
sb$loserScore <- as.numeric(str_extract(sb$loser, pattern))
sb$winner <- gsub(pattern, "", sb$winner)
sb$loser <- gsub(pattern, "", sb$loser)
head(sb)
## number date site winner
## 3 1 1967-01-15 Los Angeles Memorial Coliseum Green Bay
## 4 2 1968-01-14 Orange Bowl (Miami) Green Bay
## 5 3 1969-01-12 Orange Bowl (Miami) New York Jets
## 6 4 1970-01-11 Tulane Stadium (New Orleans) Kansas City
## 7 5 1971-01-17 Orange Bowl (Miami) Baltimore
## 8 6 1972-01-16 Tulane Stadium (New Orleans) Dallas
## loser winnerScore loserScore
## 3 Kansas City 35 10
## 4 Oakland 33 14
## 5 Baltimore 16 7
## 6 Minnesota 23 7
## 7 Dallas 16 13
## 8 Miami 24 3
Our data frame is looking pretty good, so we write it to a CSV (comma-separated value) file.
write.csv(sb, 'superbowl.csv', row.names=F)