Illustration of the rvest package

Some data on the internet is already in a .csv format so we can just point R to it. Like this:

dow <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=^DJI")
head(dow)

##         Date     Open     High      Low    Close    Volume Adj.Close
## 1 2016-12-14 19876.13 19966.43 19748.67 19792.53 408430000  19792.53
## 2 2016-12-13 19852.21 19953.75 19846.45 19911.21 388420000  19911.21
## 3 2016-12-12 19770.20 19824.59 19747.74 19796.43 333660000  19796.43
## 4 2016-12-09 19631.35 19757.74 19623.19 19756.85 334470000  19756.85
## 5 2016-12-08 19559.94 19664.97 19527.83 19614.81 324570000  19614.81
## 6 2016-12-07 19241.99 19558.42 19229.83 19549.62 385200000  19549.62

Most data however is in some html table. I have used XML package successufly to download pages that are mostly html table.

library(XML)
salaries <- readHTMLTable("http://www.usatoday.com/sports/mlb/salaries/")[[1]]
head(salaries[,1:7])

##   rank             Name Team POS       Salary        Years   Total Value
## 1   --  Clayton Kershaw  LAD  SP $ 33,000,000  7 (2014-20) $ 215,000,000
## 2   --     Zack Greinke  ARI  SP $ 31,876,966  6 (2016-21) $ 206,500,000
## 3   --      David Price  BOS  SP $ 30,000,000  7 (2016-22) $ 217,000,000
## 4   --   Miguel Cabrera  DET  1B $ 28,000,000 10 (2014-23) $ 292,000,000
## 5   -- Justin Verlander  DET  SP $ 28,000,000  7 (2013-19) $ 180,000,000
## 6   --   Hanley Ramirez  BOS  1B $ 27,500,000  4 (2015-18)  $ 88,000,000

#or

teamwins <- readHTMLTable("http://www.baseball-reference.com/leagues/MLB/#teams_team_wins3000::none", stringsAsFactors = FALSE)[[1]]
head(teamwins[,1:18])

##   Year   G ARI ATL BLA BAL BOS CHC CHW CIN CLE COL DET HOU KCR ANA LAD FLA
## 1 2016 162  69  68      89  93 103  78  68  94  75  86  84  81  74  91  79
## 2 2015 162  79  67      81  78  97  76  64  81  68  74  86  95  85  92  71
## 3 2014 162  64  79      96  71  73  73  76  85  66  90  70  89  98  94  77
## 4 2013 163  81  96      85  97  66  63  90  92  74  93  51  86  78  92  62
## 5 2012 162  81  94      93  69  61  85  97  68  64  88  55  72  89  86  69
## 6 2011 162  94  89      69  90  71  79  79  80  73  95  56  71  86  82  72

With other pages I couldn’t get XML to work so I used rvest.

library(rvest)
page <- read_html("http://www.nytimes.com/elections/results/michigan")
node <- html_nodes(page, xpath = '//*[@id="mi-0-2016-11-08"]/div/div[2]/div[3]/div/table')
rMI <- html_table(node[[1]], fill=TRUE)
head(rMI)

##   Vote by county   Trump Clinton
## 1          Wayne 228,908 517,842
## 2        Oakland 289,127 342,976
## 3         Macomb 224,589 176,238
## 4           Kent 147,959 138,567
## 5        Genesee  84,174 102,744
## 6      Washtenaw  50,335 128,025

The key is to find the correct ‘node’ on the page. I use Chrom inspector (right click in Chrome and choose ‘inspect’). Then browse through the html in the inspector and observe which parts of the webpage are highlighted. You may need to click the little black triangles to unwrap all of the code. Once the table that you are interested in is highlighted, right click and select ‘copy X-Path’ as illustrated in the image below.

The xpath is then passed to the html_nodes() function. Finally, we read the html table

page <- read_html("http://www.nytimes.com/elections/results/new-york")
node <- html_nodes(page, xpath = '//*[@id="ny-0-2016-11-08"]/div/div[2]/div[4]/div/table')
rNY <- html_table(node[[1]], fill=TRUE)
head(rNY)

##   Vote by county Clinton   Trump
## 1       Brooklyn 595,086 133,653
## 2         Queens 473,389 138,550
## 3        Suffolk 276,953 328,403
## 4      Manhattan 515,481  58,935
## 5         Nassau 307,326 275,479
## 6           Erie 192,065 173,817

Illustration of the `rvest` package

Tomas Dvorak

December 15, 2016