Olgun AYDIN - olgunaydinn@gmail.com
2016/10/13 - Erum2016/Poland
Official statistics centre or central banks of countries are calculating inflation rate in monthly period. Inflation is measured by the consumer price index inclueds the annual percentage change in the cost to the average consumer of acquiring a basket of goods and services for specific intervals, such as yearly. The central banks and the statistics centres serve these information publicly. Most of the institutes serve the information monthly. Its hard to track Inflation rates for “Fast-Moving Consumer Goods(FMCG)” daily.
To monitor this, I tried to create SparkR-shiny App. The application is scraping top FMCG related web sites in daily based and stored the data on Spark stand alone cluster on Amazon Web Services (AWS) Elastic Compute Cloud (EC2). With this application people could filter consumer price index for any time interval, any category also people could compare price changes of goods in different categories.
rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces.
load("inflation.RData")
library(rvest)
#read htmls
htmls<-read_html(links[i])
##getting price##
p<-html_nodes(htmls, xpath = '//div[@class="span25"]//span[@class="price"]
/text()[1]')
Results from getting data I wanna scrap
head(p)
{xml_nodeset (6)}
[1] \n\t\t\t\t\t\t\t\t\t\t0,35
[2] \n\t\t\t\t\t\t\t\t\t\t0,65
[3] \n\t\t\t\t\t\t\t\t\t\t3,20
[4] \n\t\t\t\t\t\t\t\t\t\t5,95
[5] \n\t\t\t\t\t\t\t\t\t\t1,75
[6] \n\t\t\t\t\t\t\t\t\t\t3,10
Data analysis using R is limited by the amount of memory available on a single machine and further as R is single threaded it is often impractical to use R on large datasets.
u R programs can be scaled while making it easy to use and deploy across a number of workloads.
SparkR: an R frontend for Apache Spark, a widely deployed cluster computing engine. There are a number of benefits to designing an R frontend that is tightly integrated with Spark.
To improve performance over large datasets, SparkR performs lazy evaluation on data frame operations and uses Spark’s relational query optimizer to optimize execution.
SparkR was initially developed at the AMPLab, UC Berkeley and has been a part of the Apache Spark project for the past eight months.
SparkR’s architecture consists of two main components an R to JVM binding on the driver that allows R programs to submit jobs to a Spark cluster and support for running R on the Spark executors.
You could directly download Spark from official site
http://spark.apache.org/
You should get an AWS account, getting secret access key and downloaded your private key pair as .pem file.
With “spark-ec2 script” you could directly setup Spark, HDFS, Tachyon, RStudio on your cluster.
For detailed information kindly read article in following link.
https://www.r-bloggers.com/launch-apache-spark-on-aws-ec2-and-initialize-sparkr-using-rstudio-2/
Makes it incredibly easy to build interactive web applications with R. Automatic “reactive” binding between inputs and outputs and extensive pre-built widgets make it possible to build beautiful, responsive, and powerful applications with minimal effort.
https://cran.r-project.org/web/packages/shiny/index.html
Makes it incredibly easy to build interactive web applications with R. Automatic “reactive” binding between inputs and outputs and extensive pre-built widgets make it possible to build beautiful, responsive, and powerful applications with minimal effort.
https://cran.r-project.org/web/packages/shiny/index.html
To get started building the application, create a new empty directory wherever you’d like, then create empty ui.R and server.R files within in.
http://shiny.rstudio.com/articles/build.html
First of all, i need to select effective web sites for my study. I collected data manually and then store it
links<-rbind("https://www.carrefoursa.com/r/meyve-sebze/meyve?count=120&page=1",
"https://www.carrefoursa.com/r/meyve-sebze/sebze?count=120&page=1",
"https://www.carrefoursa.com/r/meyve-sebze/salata?count=120&page=1",
"https://www.carrefoursa.com/r/gida-yemek-malzemeleri/makarna-pirinc-ve-bakliyat/pirinc?count=120&page=1",
"https://www.carrefoursa.com/r/gida-yemek-malzemeleri/makarna-pirinc-ve-bakliyat/bulgur?count=120&page=1",
"https://www.carrefoursa.com/r/gida-yemek-malzemeleri/makarna-pirinc-ve-bakliyat/mercimek?count=120&page=1",
"https://www.carrefoursa.com/r/gida-yemek-malzemeleri/makarna-pirinc-ve-bakliyat/nohut?count=120&page=1",
"https://www.carrefoursa.com/r/gida-yemek-malzemeleri/makarna-pirinc-ve-bakliyat/fasulye?count=120&page=1",
"https://www.carrefoursa.com/r/gida-yemek-malzemeleri/makarna-pirinc-ve-bakliyat/bugday?count=120&page=1",
"https://www.carrefoursa.com/r/sut-kahvaltilik/sutler/tam-yagli-uzun-omurlu-sutler?count=120&page=1",
"https://www.carrefoursa.com/r/sut-kahvaltilik/peynirler/beyaz-peynir?count=120&page=1",
"https://www.carrefoursa.com/r/gida-yemek-malzemeleri/seker-tuz-ve-baharat/mutfak-sekerleri?count=120&page=1",
"https://www.carrefoursa.com/r/gida-yemek-malzemeleri/seker-tuz-ve-baharat/tuz?count=120&page=1",
"https://www.carrefoursa.com/r/icecek/sular?count=120&page=1")
Then, I collected html tags from all the links i stored.
load("inflation.RData")
library(rvest)
#read htmls
htmls<-read_html(links[i])
After that I need to decide scraping data which is important for me. To do that I need to write some XPaths and need to verify the XPaths. For this reason I used Google Chrome console.
With html_nodes function I have parsed the data which i need.
library(stringr)
##getting price##
p<-html_nodes(htmls, xpath = '//div[@class="span25"]//span[@class="price"]
/text()[1]')
pp<-as.character(p)
pp<-str_replace_all(p,"[\n\t\t\t\t\t\t\t\t\t\t\t]","")
price<-rbind(price,as.data.frame(pp))
With html_nodes function I have parsed the data which i need.
library(stringr)
##getting price##
p<-html_nodes(htmls, xpath = '//div[@class="span25"]//span[@class="price"]/text()[1]')
pp<-as.character(p)
pp<-str_replace_all(p,"[\n\t\t\t\t\t\t\t\t\t\t\t]","")
price<-rbind(price,as.data.frame(pp))
##getting oldprice
op<-html_nodes(htmls, xpath='//div[@class="span25"]//span[@class="oldprice"] /text()[1]')
opp<-as.character(op)
opp<-str_replace_all(opp,"[\n\t\t\t\t\t\t\t\t\t\t\t]","")
old_price<-rbind(old_price,as.data.frame(opp))
With html_nodes function I have parsed the data which i need.
##getting product names##
n<-html_nodes(htmls, xpath = '//div[@class="span25"]//a[@class="title"]/text()')
nn<-as.character(n)
nn<-str_replace_all(nn, "[\n\t\t\t\t\t\t\t\t\t\t\t]" , "")
name<-rbind(name,as.data.frame(nn))
##getting categories##
cats<-html_nodes(htmls,xpath = '//h5[@id="breadcrumb"]//span[@itemprop="title"]/text()')
cc<-as.character(cats)
cc<-str_replace_all(cc, " &" , "")
mc<-cc[1]
main_cat<-as.data.frame(mc)
Then i have created final dataframe.
##getting system time when data was collected##
date<-strsplit(str_replace_all(Sys.time(),"[ ]",","),",")[[1]][1]
time<-strsplit(str_replace_all(Sys.time(),"[ ]",","),",")[[1]][2]
##reversing mcc and scc
revmcc <- mcc[rev(rownames(mcc)),]
revmccdf<-as.data.frame(revmcc)
revscc <- scc[rev(rownames(scc)),]
revsccdf<-as.data.frame(revscc)
price<-
##creating final dataset by using all variables##
d_set<-data.frame(revmccdf,revsccdf,name,price,old_price,date,time)
##merging datasets day by day##
ddset<-rbind(ddset,d_set)
ddset$pp<-str_replace_all(ddset$pp,',','.')
ddset$opp<-str_replace_all(ddset$opp,',','.')
Time to create my shiny app.
ui <- dashboardPage(
skin='green',
dashboardHeader(title = "Inflation Checker"),
dashboardSidebar(sidebarMenu(
menuItem("Dashboard", tabName = "dashboard", icon = icon("dashboard")),
menuItem("Data", tabName = "widgets", icon = icon("th"))
)),
dashboardBody(
# Boxes need to be put in a row (or column)
fluidRow(
box(plotOutput("plot1", height = 500)),
Time to create my shiny app.
server <- function(input, output) {
output$plot1 <- renderPlot({
boxplot(as.numeric(ddset$pp[ddset$revscc==
input$category]),main="Box Plot of Prices",xlab=input$category,ylab="TL")
})
output$summary <- renderTable({
test <- t(as.matrix(summary(as.numeric(ddset$
pp[ddset$revscc==input$category]))))
})
}
shinyApp(ui, server)
You could visit my application by following link
https://olgunaydin.shinyapps.io/Erum/