rvest packageSome data on the internet is already in a .csv format so we can just point R to it. Like this:
dow <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=^DJI")
head(dow)
## Date Open High Low Close Volume Adj.Close
## 1 2016-12-14 19876.13 19966.43 19748.67 19792.53 408430000 19792.53
## 2 2016-12-13 19852.21 19953.75 19846.45 19911.21 388420000 19911.21
## 3 2016-12-12 19770.20 19824.59 19747.74 19796.43 333660000 19796.43
## 4 2016-12-09 19631.35 19757.74 19623.19 19756.85 334470000 19756.85
## 5 2016-12-08 19559.94 19664.97 19527.83 19614.81 324570000 19614.81
## 6 2016-12-07 19241.99 19558.42 19229.83 19549.62 385200000 19549.62
Most data however is in some html table. I have used XML package successufly to download pages that are mostly html table.
library(XML)
salaries <- readHTMLTable("http://www.usatoday.com/sports/mlb/salaries/")[[1]]
head(salaries[,1:7])
## rank Name Team POS Salary Years Total Value
## 1 -- Clayton Kershaw LAD SP $ 33,000,000 7 (2014-20) $ 215,000,000
## 2 -- Zack Greinke ARI SP $ 31,876,966 6 (2016-21) $ 206,500,000
## 3 -- David Price BOS SP $ 30,000,000 7 (2016-22) $ 217,000,000
## 4 -- Miguel Cabrera DET 1B $ 28,000,000 10 (2014-23) $ 292,000,000
## 5 -- Justin Verlander DET SP $ 28,000,000 7 (2013-19) $ 180,000,000
## 6 -- Hanley Ramirez BOS 1B $ 27,500,000 4 (2015-18) $ 88,000,000
#or
teamwins <- readHTMLTable("http://www.baseball-reference.com/leagues/MLB/#teams_team_wins3000::none", stringsAsFactors = FALSE)[[1]]
head(teamwins[,1:18])
## Year G ARI ATL BLA BAL BOS CHC CHW CIN CLE COL DET HOU KCR ANA LAD FLA
## 1 2016 162 69 68 89 93 103 78 68 94 75 86 84 81 74 91 79
## 2 2015 162 79 67 81 78 97 76 64 81 68 74 86 95 85 92 71
## 3 2014 162 64 79 96 71 73 73 76 85 66 90 70 89 98 94 77
## 4 2013 163 81 96 85 97 66 63 90 92 74 93 51 86 78 92 62
## 5 2012 162 81 94 93 69 61 85 97 68 64 88 55 72 89 86 69
## 6 2011 162 94 89 69 90 71 79 79 80 73 95 56 71 86 82 72
With other pages I couldn’t get XML to work so I used rvest.
library(rvest)
page <- read_html("http://www.nytimes.com/elections/results/michigan")
node <- html_nodes(page, xpath = '//*[@id="mi-0-2016-11-08"]/div/div[2]/div[3]/div/table')
rMI <- html_table(node[[1]], fill=TRUE)
head(rMI)
## Vote by county Trump Clinton
## 1 Wayne 228,908 517,842
## 2 Oakland 289,127 342,976
## 3 Macomb 224,589 176,238
## 4 Kent 147,959 138,567
## 5 Genesee 84,174 102,744
## 6 Washtenaw 50,335 128,025
The key is to find the correct ‘node’ on the page. I use Chrom inspector (right click in Chrome and choose ‘inspect’). Then browse through the html in the inspector and observe which parts of the webpage are highlighted. You may need to click the little black triangles to unwrap all of the code. Once the table that you are interested in is highlighted, right click and select ‘copy X-Path’ as illustrated in the image below.
The xpath is then passed to the html_nodes() function. Finally, we read the html table
page <- read_html("http://www.nytimes.com/elections/results/new-york")
node <- html_nodes(page, xpath = '//*[@id="ny-0-2016-11-08"]/div/div[2]/div[4]/div/table')
rNY <- html_table(node[[1]], fill=TRUE)
head(rNY)
## Vote by county Clinton Trump
## 1 Brooklyn 595,086 133,653
## 2 Queens 473,389 138,550
## 3 Suffolk 276,953 328,403
## 4 Manhattan 515,481 58,935
## 5 Nassau 307,326 275,479
## 6 Erie 192,065 173,817