Here I will demostrate a simple web scraping example using a table of exchange rates from Peru’s tax agency SUNAT. Today that page looked like this:
Sunat currency exchange for Oct 2017
rvest library and the webpage we want to scraper using its url. library(rvest)
## Loading required package: xml2
url <- 'http://www.sunat.gob.pe/cl-at-ittipcam/tcS01Alias'
webpage <- read_html(url)
html_nodes tbls <- html_nodes(webpage, "table")
length(tbls)
## [1] 6
the are 6 tables. But through trial and error I find that my table of interest is table 2.
tbl2<-html_table(tbls[[2]])
print(tbl2)
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
## 1 Día Compra Venta Día Compra Venta Día Compra Venta Día Compra Venta
## 2 3 3.267 3.271 4 3.266 3.268 5 3.258 3.260 6 3.254 3.256
## 3 7 3.266 3.268 10 3.270 3.273 11 3.265 3.267 12 3.260 3.262
## 4 13 3.254 3.256 14 3.248 3.251 17 3.244 3.247 18 3.244 3.246
## 5 19 3.242 3.244 20 3.235 3.237 21 3.237 3.240 24 3.238 3.241
## 6 25 3.238 3.242 26 3.233 3.235 27 3.236 3.239 28 3.244 3.248
dim(tbl2)
## [1] 6 12
num.cols<-dim(tbl2)[2]
num.rows<-dim(tbl2)[1]
num.cols
## [1] 12
num.rows
## [1] 6
We already have the number of rows and columns and we used them to create vectors that we then integrate into a data.frame
dia<-c()
compra<-c()
venta<-c()
num.cols
## [1] 12
num.rows
## [1] 6
for(i in 2:num.rows){
for(j in 1:(num.cols/3)){
dia<-c(dia,as.numeric(tbl2[i,(j-1)*3+1]))
compra<-c(compra,as.numeric(tbl2[i,(j-1)*3+2]))
venta<-c(venta,as.numeric(tbl2[i,(j-1)*3+3]))
}
}
output<-data.frame(dia,compra,venta)
print(output)
## dia compra venta
## 1 3 3.267 3.271
## 2 4 3.266 3.268
## 3 5 3.258 3.260
## 4 6 3.254 3.256
## 5 7 3.266 3.268
## 6 10 3.270 3.273
## 7 11 3.265 3.267
## 8 12 3.260 3.262
## 9 13 3.254 3.256
## 10 14 3.248 3.251
## 11 17 3.244 3.247
## 12 18 3.244 3.246
## 13 19 3.242 3.244
## 14 20 3.235 3.237
## 15 21 3.237 3.240
## 16 24 3.238 3.241
## 17 25 3.238 3.242
## 18 26 3.233 3.235
## 19 27 3.236 3.239
## 20 28 3.244 3.248