rvest is another excellent package by Hadley for web scraping. I had been looking for something to scrape tabular data off Wikipedia and this package seemed like a good place to start. I will try out a few examples in the package documentation and one I found online for extracting tables from Wikipedia, and conclude with my use case - to extract the Olympics medal tally and rank by country from the Rio 2016 wikipedia page.

Vignette example

A good place to start is the vignette that explains how to use the selectorgadget tool that is useful in identifying the css selector needed to extract the desired components from a page

library(rvest)
html <- read_html("http://www.imdb.com/title/tt1490017/")
cast <- html_nodes(html, ".itemprop .itemprop")
length(cast)
[1] 15
html_text(cast)
 [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"     "Alison Brie"     "David Burrows"   "Anthony Daniels"
 [7] "Charlie Day"     "Amanda Farinos"  "Keith Ferguson"  "Will Ferrell"    "Will Forte"      "Dave Franco"    
[13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"     

We can now try the example provided in the readme file for the package on github

library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
rating <- lego_movie %>% 
  html_nodes("strong span") %>%
  html_text() %>%
  as.numeric()
rating
[1] 7.8
poster <- lego_movie %>%
  html_nodes(".poster img") %>%
  html_attr("src")
poster
[1] "http://ia.media-imdb.com/images/M/MV5BMTg4MDk1ODExN15BMl5BanBnXkFtZTgwNzIyNjg3MDE@._V1_UX182_CR0,0,182,268_AL_.jpg"

Another related example from the Rstudio blog to get the titles and authors from recent message board postings

lego_movie %>%
  html_nodes("table") %>%
  .[[2]] %>%
  html_table()

Example of extracting a table from Wikipedia

For extracting tables from Wikipedia, I did not have much success with the selectorgadget approach above. I will instead work with the example from Cory Nissen’s blog where he describes how to extract the xpath for the table you are interested in using the inspect element in the Chrome browser. He uses the population table in the following wiki page as an example

library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/table[2]') %>%
  html_table()
population <- population[[1]]
head(population)
  Rank in the fifty states, 2015 Rank in all states & territories, 2010 State or territory
1          7000100000000000000♠1                  7000100000000000000♠1         California
2          7000200000000000000♠2                  7000200000000000000♠2              Texas
3          7000300000000000000♠3                  7000400000000000000♠4            Florida
4          7000400000000000000♠4                  7000300000000000000♠3           New York
5          7000500000000000000♠5                  7000500000000000000♠5           Illinois
6          7000600000000000000♠6                  7000600000000000000♠6       Pennsylvania
  Population estimate, July 1, 2015 Census population, April 1, 2010
1                        39,144,818                       37,254,503
2                        27,469,114                       25,146,105
3                        20,271,272                       18,804,623
4                        19,795,791                       19,378,087
5                        12,859,995                       12,831,549
6                        12,802,503                       12,702,887
  Total seats in House of Representatives, 2013–2023 Estimated pop. per House seat, 2015
1                             7001530000000000000♠53                             738,581
2                             7001360000000000000♠36                             763,031
3                             7001270000000000000♠27                             750,788
4                             7001270000000000000♠27                             733,177
5                             7001180000000000000♠18                             714,444
6                             7001180000000000000♠18                             711,250
  Census pop. per House seat, 2010 Percent of total U.S. pop., 2015[note 1]
1                          702,905                                   12.18%
2                          698,487                                    8.55%
3                          696,345                                    6.31%
4                          717,707                                    6.16%
5                          712,813                                    4.00%
6                          705,688                                    3.98%

Some clean-up of the text may be required

Extracting the Rio 2016 Olympics rank and medal tally by country

I will try the above for the case I am interested in - getting the medal tally by country for the recently concluded 2016 Rio Olympics. The cool part about ongoing events is that the data is constantly updated on Wikipedia. So you can benefit from pulling the latest information while an event is active.

url <- "https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table"
medal_tally <- url %>% read_html() %>% html_nodes(xpath='//*[@id="mw-content-text"]/table[2]') %>% html_table(fill=TRUE)
medal_tally <- medal_tally[[1]]
head(medal_tally)
   Rank                   NOC Gold Silver Bronze Total
1      1  United States (USA)   46     37     38   121
2      2  Great Britain (GBR)   27     23     17    67
3      3          China (CHN)   26     18     26    70
4      4         Russia (RUS)   19     18     19    56
5      5        Germany (GER)   17     10     15    42
6      6          Japan (JPN)   12      8     21    41

What I had hoped for!

LS0tCnRpdGxlOiAiMTogKnJ2ZXN0KiBmb3Igd2ViIHNjcmFwaW5nIgphdXRob3I6ICJbTmlyYW5qYW4gU2hldHR5XShodHRwOi8vbmlyYW5qYW4uY28pIgpkYXRlOiAiQXVndXN0IDI1LCAyMDE2IgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKCi0tLQoKKnJ2ZXN0KiBpcyBhbm90aGVyIGV4Y2VsbGVudCBwYWNrYWdlIGJ5IFtIYWRsZXldKGh0dHA6Ly9oYWRsZXkubnovKSBmb3Igd2ViIHNjcmFwaW5nLiBJIGhhZCBiZWVuIGxvb2tpbmcgZm9yIHNvbWV0aGluZyB0byBzY3JhcGUgdGFidWxhciBkYXRhIG9mZiBXaWtpcGVkaWEgYW5kIHRoaXMgcGFja2FnZSBzZWVtZWQgbGlrZSBhIGdvb2QgcGxhY2UgdG8gc3RhcnQuIEkgd2lsbCB0cnkgb3V0IGEgZmV3IGV4YW1wbGVzIGluIHRoZSBwYWNrYWdlIGRvY3VtZW50YXRpb24gYW5kIG9uZSBJIGZvdW5kIG9ubGluZSBmb3IgZXh0cmFjdGluZyB0YWJsZXMgZnJvbSBXaWtpcGVkaWEsIGFuZCBjb25jbHVkZSB3aXRoIG15IHVzZSBjYXNlIC0gdG8gZXh0cmFjdCB0aGUgT2x5bXBpY3MgbWVkYWwgdGFsbHkgYW5kIHJhbmsgYnkgY291bnRyeSBmcm9tIHRoZSBSaW8gMjAxNiB3aWtpcGVkaWEgcGFnZS4gCgojIyBWaWduZXR0ZSBleGFtcGxlCgpBIGdvb2QgcGxhY2UgdG8gc3RhcnQgaXMgdGhlIFt2aWduZXR0ZV0oaHR0cHM6Ly9jcmFuLnItcHJvamVjdC5vcmcvd2ViL3BhY2thZ2VzL3J2ZXN0L3ZpZ25ldHRlcy9zZWxlY3RvcmdhZGdldC5odG1sKSB0aGF0IGV4cGxhaW5zIGhvdyB0byB1c2UgdGhlIGBzZWxlY3RvcmdhZGdldGAgdG9vbCB0aGF0IGlzIHVzZWZ1bCBpbiBpZGVudGlmeWluZyB0aGUgY3NzIHNlbGVjdG9yIG5lZWRlZCB0byBleHRyYWN0IHRoZSBkZXNpcmVkIGNvbXBvbmVudHMgZnJvbSBhIHBhZ2UKCmBgYHtyfQpsaWJyYXJ5KHJ2ZXN0KQpodG1sIDwtIHJlYWRfaHRtbCgiaHR0cDovL3d3dy5pbWRiLmNvbS90aXRsZS90dDE0OTAwMTcvIikKY2FzdCA8LSBodG1sX25vZGVzKGh0bWwsICIuaXRlbXByb3AgLml0ZW1wcm9wIikKbGVuZ3RoKGNhc3QpCmh0bWxfdGV4dChjYXN0KQpgYGAKCgpXZSBjYW4gbm93IHRyeSB0aGUgZXhhbXBsZSBwcm92aWRlZCBpbiB0aGUgW3JlYWRtZV0oaHR0cHM6Ly9naXRodWIuY29tL2hhZGxleS9ydmVzdCkgZmlsZSBmb3IgdGhlIHBhY2thZ2Ugb24gZ2l0aHViCgpgYGB7cn0KcmF0aW5nIDwtIGxlZ29fbW92aWUgJT4lIAogIGh0bWxfbm9kZXMoInN0cm9uZyBzcGFuIikgJT4lCiAgaHRtbF90ZXh0KCkgJT4lCiAgYXMubnVtZXJpYygpCnJhdGluZwpgYGAKCmBgYHtyfQpwb3N0ZXIgPC0gbGVnb19tb3ZpZSAlPiUKICBodG1sX25vZGVzKCIucG9zdGVyIGltZyIpICU+JQogIGh0bWxfYXR0cigic3JjIikKcG9zdGVyCmBgYAoKQW5vdGhlciByZWxhdGVkIGV4YW1wbGUgZnJvbSB0aGUgW1JzdHVkaW8gYmxvZ10oaHR0cHM6Ly9ibG9nLnJzdHVkaW8ub3JnLzIwMTQvMTEvMjQvcnZlc3QtZWFzeS13ZWItc2NyYXBpbmctd2l0aC1yLykgdG8gZ2V0IHRoZSB0aXRsZXMgYW5kIGF1dGhvcnMgZnJvbSByZWNlbnQgbWVzc2FnZSBib2FyZCBwb3N0aW5ncwoKYGBge3J9CmxlZ29fbW92aWUgJT4lCiAgaHRtbF9ub2RlcygidGFibGUiKSAlPiUKICAuW1syXV0gJT4lCiAgaHRtbF90YWJsZSgpCmBgYAoKIyMgRXhhbXBsZSBvZiBleHRyYWN0aW5nIGEgdGFibGUgZnJvbSBXaWtpcGVkaWEKCkZvciBleHRyYWN0aW5nIHRhYmxlcyBmcm9tIFdpa2lwZWRpYSwgSSBkaWQgbm90IGhhdmUgbXVjaCBzdWNjZXNzIHdpdGggdGhlIGBzZWxlY3RvcmdhZGdldGAgYXBwcm9hY2ggYWJvdmUuIEkgd2lsbCBpbnN0ZWFkIHdvcmsgd2l0aCB0aGUgZXhhbXBsZSBmcm9tIFtDb3J5IE5pc3NlbidzIGJsb2ddKGh0dHA6Ly9ibG9nLmNvcnluaXNzZW4uY29tLzIwMTUvMDEvdXNpbmctcnZlc3QtdG8tc2NyYXBlLWh0bWwtdGFibGUuaHRtbCkgd2hlcmUgaGUgZGVzY3JpYmVzIGhvdyB0byBleHRyYWN0IHRoZSBgeHBhdGhgIGZvciB0aGUgdGFibGUgeW91IGFyZSBpbnRlcmVzdGVkIGluIHVzaW5nIHRoZSBgaW5zcGVjdCBlbGVtZW50YCBpbiB0aGUgQ2hyb21lIGJyb3dzZXIuIEhlIHVzZXMgdGhlIHBvcHVsYXRpb24gdGFibGUgaW4gdGhlIGZvbGxvd2luZyBbd2lraV0oaHR0cDovL2VuLndpa2lwZWRpYS5vcmcvd2lraS9MaXN0X29mX1UuUy5fc3RhdGVzX2FuZF90ZXJyaXRvcmllc19ieV9wb3B1bGF0aW9uKSBwYWdlIGFzIGFuIGV4YW1wbGUKCmBgYHtyfQpsaWJyYXJ5KCJydmVzdCIpCnVybCA8LSAiaHR0cDovL2VuLndpa2lwZWRpYS5vcmcvd2lraS9MaXN0X29mX1UuUy5fc3RhdGVzX2FuZF90ZXJyaXRvcmllc19ieV9wb3B1bGF0aW9uIgpwb3B1bGF0aW9uIDwtIHVybCAlPiUKICByZWFkX2h0bWwoKSAlPiUKICBodG1sX25vZGVzKHhwYXRoPScvLypbQGlkPSJtdy1jb250ZW50LXRleHQiXS90YWJsZVsyXScpICU+JQogIGh0bWxfdGFibGUoKQpwb3B1bGF0aW9uIDwtIHBvcHVsYXRpb25bWzFdXQpoZWFkKHBvcHVsYXRpb24pCmBgYAoKU29tZSBjbGVhbi11cCBvZiB0aGUgdGV4dCBtYXkgYmUgcmVxdWlyZWQKCiMjIEV4dHJhY3RpbmcgdGhlIFJpbyAyMDE2IE9seW1waWNzIHJhbmsgYW5kIG1lZGFsIHRhbGx5IGJ5IGNvdW50cnkKCkkgd2lsbCB0cnkgdGhlIGFib3ZlIGZvciB0aGUgY2FzZSBJIGFtIGludGVyZXN0ZWQgaW4gLSBnZXR0aW5nIHRoZSBtZWRhbCB0YWxseSBieSBjb3VudHJ5IGZvciB0aGUgcmVjZW50bHkgY29uY2x1ZGVkIFsyMDE2IFJpbyBPbHltcGljc10oaHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvMjAxNl9TdW1tZXJfT2x5bXBpY3NfbWVkYWxfdGFibGUpLiBUaGUgY29vbCBwYXJ0IGFib3V0IG9uZ29pbmcgZXZlbnRzIGlzIHRoYXQgdGhlIGRhdGEgaXMgY29uc3RhbnRseSB1cGRhdGVkIG9uIFdpa2lwZWRpYS4gU28geW91IGNhbiBiZW5lZml0IGZyb20gcHVsbGluZyB0aGUgbGF0ZXN0IGluZm9ybWF0aW9uIHdoaWxlIGFuIGV2ZW50IGlzIGFjdGl2ZS4gCgpgYGB7cn0KdXJsIDwtICJodHRwczovL2VuLndpa2lwZWRpYS5vcmcvd2lraS8yMDE2X1N1bW1lcl9PbHltcGljc19tZWRhbF90YWJsZSIKCm1lZGFsX3RhbGx5IDwtIHVybCAlPiUgcmVhZF9odG1sKCkgJT4lIGh0bWxfbm9kZXMoeHBhdGg9Jy8vKltAaWQ9Im13LWNvbnRlbnQtdGV4dCJdL3RhYmxlWzJdJykgJT4lIGh0bWxfdGFibGUoZmlsbD1UUlVFKQptZWRhbF90YWxseSA8LSBtZWRhbF90YWxseVtbMV1dCmhlYWQobWVkYWxfdGFsbHkpCmBgYAoKV2hhdCBJIGhhZCBob3BlZCBmb3IhIAo=