rvest is another excellent package by Hadley for web scraping. I had been looking for something to scrape tabular data off Wikipedia and this package seemed like a good place to start. I will try out a few examples in the package documentation and one I found online for extracting tables from Wikipedia, and conclude with my use case - to extract the Olympics medal tally and rank by country from the Rio 2016 wikipedia page.
A good place to start is the vignette that explains how to use the selectorgadget tool that is useful in identifying the css selector needed to extract the desired components from a page
library(rvest)
html <- read_html("http://www.imdb.com/title/tt1490017/")
cast <- html_nodes(html, ".itemprop .itemprop")
length(cast)
[1] 15
html_text(cast)
[1] "Will Arnett" "Elizabeth Banks" "Craig Berry" "Alison Brie" "David Burrows" "Anthony Daniels"
[7] "Charlie Day" "Amanda Farinos" "Keith Ferguson" "Will Ferrell" "Will Forte" "Dave Franco"
[13] "Morgan Freeman" "Todd Hansen" "Jonah Hill"
We can now try the example provided in the readme file for the package on github
library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
rating <- lego_movie %>%
html_nodes("strong span") %>%
html_text() %>%
as.numeric()
rating
[1] 7.8
poster <- lego_movie %>%
html_nodes(".poster img") %>%
html_attr("src")
poster
[1] "http://ia.media-imdb.com/images/M/MV5BMTg4MDk1ODExN15BMl5BanBnXkFtZTgwNzIyNjg3MDE@._V1_UX182_CR0,0,182,268_AL_.jpg"
Another related example from the Rstudio blog to get the titles and authors from recent message board postings
lego_movie %>%
html_nodes("table") %>%
.[[2]] %>%
html_table()
For extracting tables from Wikipedia, I did not have much success with the selectorgadget approach above. I will instead work with the example from Cory Nissen’s blog where he describes how to extract the xpath for the table you are interested in using the inspect element in the Chrome browser. He uses the population table in the following wiki page as an example
library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/table[2]') %>%
html_table()
population <- population[[1]]
head(population)
Rank in the fifty states, 2015 Rank in all states & territories, 2010 State or territory
1 7000100000000000000♠1 7000100000000000000♠1 California
2 7000200000000000000♠2 7000200000000000000♠2 Texas
3 7000300000000000000♠3 7000400000000000000♠4 Florida
4 7000400000000000000♠4 7000300000000000000♠3 New York
5 7000500000000000000♠5 7000500000000000000♠5 Illinois
6 7000600000000000000♠6 7000600000000000000♠6 Pennsylvania
Population estimate, July 1, 2015 Census population, April 1, 2010
1 39,144,818 37,254,503
2 27,469,114 25,146,105
3 20,271,272 18,804,623
4 19,795,791 19,378,087
5 12,859,995 12,831,549
6 12,802,503 12,702,887
Total seats in House of Representatives, 2013–2023 Estimated pop. per House seat, 2015
1 7001530000000000000♠53 738,581
2 7001360000000000000♠36 763,031
3 7001270000000000000♠27 750,788
4 7001270000000000000♠27 733,177
5 7001180000000000000♠18 714,444
6 7001180000000000000♠18 711,250
Census pop. per House seat, 2010 Percent of total U.S. pop., 2015[note 1]
1 702,905 12.18%
2 698,487 8.55%
3 696,345 6.31%
4 717,707 6.16%
5 712,813 4.00%
6 705,688 3.98%
Some clean-up of the text may be required
I will try the above for the case I am interested in - getting the medal tally by country for the recently concluded 2016 Rio Olympics. The cool part about ongoing events is that the data is constantly updated on Wikipedia. So you can benefit from pulling the latest information while an event is active.
url <- "https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table"
medal_tally <- url %>% read_html() %>% html_nodes(xpath='//*[@id="mw-content-text"]/table[2]') %>% html_table(fill=TRUE)
medal_tally <- medal_tally[[1]]
head(medal_tally)
Rank NOC Gold Silver Bronze Total
1 1 United States (USA) 46 37 38 121
2 2 Great Britain (GBR) 27 23 17 67
3 3 China (CHN) 26 18 26 70
4 4 Russia (RUS) 19 18 19 56
5 5 Germany (GER) 17 10 15 42
6 6 Japan (JPN) 12 8 21 41
What I had hoped for!