Dynamic Inflation Rate Calculation of Fast-Moving Consumer Goods: Shiny-SparkR App

Olgun AYDIN - olgunaydinn@gmail.com

2016/10/13 - Erum2016/Poland

Outline

     

  • Introduction
  • Scraping Data With R - rvest
  • SparkR
  • SparkR on EC2
  • Shiny
  • Application

Introduction

     

Official statistics centre or central banks of countries are calculating inflation rate in monthly period. Inflation is measured by the consumer price index inclueds the annual percentage change in the cost to the average consumer of acquiring a basket of goods and services for specific intervals, such as yearly. The central banks and the statistics centres serve these information publicly. Most of the institutes serve the information monthly. Its hard to track Inflation rates for “Fast-Moving Consumer Goods(FMCG)” daily.

Introduction

     

To monitor this, I tried to create SparkR-shiny App. The application is scraping top FMCG related web sites in daily based and stored the data on Spark stand alone cluster on Amazon Web Services (AWS) Elastic Compute Cloud (EC2). With this application people could filter consumer price index for any time interval, any category also people could compare price changes of goods in different categories.

Scraping Data With R - rvest

rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces.

load("inflation.RData")
library(rvest)
#read htmls
htmls<-read_html(links[i])
##getting price##
p<-html_nodes(htmls, xpath = '//div[@class="span25"]//span[@class="price"]
                                  /text()[1]')

Scraping Data With R - rvest

   

Results from getting data I wanna scrap

head(p)
{xml_nodeset (6)}
[1] \n\t\t\t\t\t\t\t\t\t\t0,35
[2] \n\t\t\t\t\t\t\t\t\t\t0,65
[3] \n\t\t\t\t\t\t\t\t\t\t3,20
[4] \n\t\t\t\t\t\t\t\t\t\t5,95
[5] \n\t\t\t\t\t\t\t\t\t\t1,75
[6] \n\t\t\t\t\t\t\t\t\t\t3,10

SparkR

     

Data analysis using R is limited by the amount of memory available on a single machine and further as R is single threaded it is often impractical to use R on large datasets. u R programs can be scaled while making it easy to use and deploy across a number of workloads.

SparkR: an R frontend for Apache Spark, a widely deployed cluster computing engine. There are a number of benefits to designing an R frontend that is tightly integrated with Spark.

SparkR


To improve performance over large datasets, SparkR performs lazy evaluation on data frame operations and uses Spark’s relational query optimizer to optimize execution.

SparkR was initially developed at the AMPLab, UC Berkeley and has been a part of the Apache Spark project for the past eight months.

SparkR’s architecture consists of two main components an R to JVM binding on the driver that allows R programs to submit jobs to a Spark cluster and support for running R on the Spark executors.

SparkR: https://spark.apache.org/docs/latest/sparkr.html

SparkR

     

SparkR on EC2


You could directly download Spark from official site http://spark.apache.org/

You should get an AWS account, getting secret access key and downloaded your private key pair as .pem file.

With “spark-ec2 script” you could directly setup Spark, HDFS, Tachyon, RStudio on your cluster.

For detailed information kindly read article in following link. https://www.r-bloggers.com/launch-apache-spark-on-aws-ec2-and-initialize-sparkr-using-rstudio-2/

Shiny



Makes it incredibly easy to build interactive web applications with R. Automatic “reactive” binding between inputs and outputs and extensive pre-built widgets make it possible to build beautiful, responsive, and powerful applications with minimal effort.

https://cran.r-project.org/web/packages/shiny/index.html

Shiny



Makes it incredibly easy to build interactive web applications with R. Automatic “reactive” binding between inputs and outputs and extensive pre-built widgets make it possible to build beautiful, responsive, and powerful applications with minimal effort.

https://cran.r-project.org/web/packages/shiny/index.html

Shiny



To get started building the application, create a new empty directory wherever you’d like, then create empty ui.R and server.R files within in.

http://shiny.rstudio.com/articles/build.html

Shiny


Shiny


Application

First of all, i need to select effective web sites for my study. I collected data manually and then store it

links<-rbind("https://www.carrefoursa.com/r/meyve-sebze/meyve?count=120&page=1",
        "https://www.carrefoursa.com/r/meyve-sebze/sebze?count=120&page=1",
        "https://www.carrefoursa.com/r/meyve-sebze/salata?count=120&page=1",
        "https://www.carrefoursa.com/r/gida-yemek-malzemeleri/makarna-pirinc-ve-bakliyat/pirinc?count=120&page=1",
        "https://www.carrefoursa.com/r/gida-yemek-malzemeleri/makarna-pirinc-ve-bakliyat/bulgur?count=120&page=1",
        "https://www.carrefoursa.com/r/gida-yemek-malzemeleri/makarna-pirinc-ve-bakliyat/mercimek?count=120&page=1",
        "https://www.carrefoursa.com/r/gida-yemek-malzemeleri/makarna-pirinc-ve-bakliyat/nohut?count=120&page=1",
        "https://www.carrefoursa.com/r/gida-yemek-malzemeleri/makarna-pirinc-ve-bakliyat/fasulye?count=120&page=1",
        "https://www.carrefoursa.com/r/gida-yemek-malzemeleri/makarna-pirinc-ve-bakliyat/bugday?count=120&page=1",
        "https://www.carrefoursa.com/r/sut-kahvaltilik/sutler/tam-yagli-uzun-omurlu-sutler?count=120&page=1",
        "https://www.carrefoursa.com/r/sut-kahvaltilik/peynirler/beyaz-peynir?count=120&page=1",
        "https://www.carrefoursa.com/r/gida-yemek-malzemeleri/seker-tuz-ve-baharat/mutfak-sekerleri?count=120&page=1",
        "https://www.carrefoursa.com/r/gida-yemek-malzemeleri/seker-tuz-ve-baharat/tuz?count=120&page=1",
        "https://www.carrefoursa.com/r/icecek/sular?count=120&page=1")

Application



Then, I collected html tags from all the links i stored.

load("inflation.RData")
library(rvest)
#read htmls
htmls<-read_html(links[i])

Application


After that I need to decide scraping data which is important for me. To do that I need to write some XPaths and need to verify the XPaths. For this reason I used Google Chrome console.

Application


With html_nodes function I have parsed the data which i need.

library(stringr)
##getting price##
p<-html_nodes(htmls, xpath = '//div[@class="span25"]//span[@class="price"]
                                /text()[1]')
pp<-as.character(p)
pp<-str_replace_all(p,"[\n\t\t\t\t\t\t\t\t\t\t\t]","")
price<-rbind(price,as.data.frame(pp))

Application

With html_nodes function I have parsed the data which i need.

library(stringr)
##getting price##
p<-html_nodes(htmls, xpath = '//div[@class="span25"]//span[@class="price"]/text()[1]')
pp<-as.character(p)
pp<-str_replace_all(p,"[\n\t\t\t\t\t\t\t\t\t\t\t]","")
price<-rbind(price,as.data.frame(pp))
##getting oldprice
op<-html_nodes(htmls, xpath='//div[@class="span25"]//span[@class="oldprice"] /text()[1]')
opp<-as.character(op)
opp<-str_replace_all(opp,"[\n\t\t\t\t\t\t\t\t\t\t\t]","")
old_price<-rbind(old_price,as.data.frame(opp))

Application

With html_nodes function I have parsed the data which i need.

##getting product names##
  n<-html_nodes(htmls, xpath = '//div[@class="span25"]//a[@class="title"]/text()')
  nn<-as.character(n)
  nn<-str_replace_all(nn, "[\n\t\t\t\t\t\t\t\t\t\t\t]" , "")
  name<-rbind(name,as.data.frame(nn))
  ##getting categories##
  cats<-html_nodes(htmls,xpath = '//h5[@id="breadcrumb"]//span[@itemprop="title"]/text()')
  cc<-as.character(cats)
  cc<-str_replace_all(cc, " &amp;" , "")
  mc<-cc[1]
  main_cat<-as.data.frame(mc)

Application

Then i have created final dataframe.

##getting system time when data was collected##
date<-strsplit(str_replace_all(Sys.time(),"[ ]",","),",")[[1]][1]
time<-strsplit(str_replace_all(Sys.time(),"[ ]",","),",")[[1]][2]

##reversing mcc and scc
revmcc <- mcc[rev(rownames(mcc)),]
revmccdf<-as.data.frame(revmcc)

revscc <- scc[rev(rownames(scc)),]
revsccdf<-as.data.frame(revscc)

price<-
##creating final dataset by using all variables##
d_set<-data.frame(revmccdf,revsccdf,name,price,old_price,date,time)

##merging datasets day by day##
ddset<-rbind(ddset,d_set)
ddset$pp<-str_replace_all(ddset$pp,',','.')
ddset$opp<-str_replace_all(ddset$opp,',','.')

Application

Time to create my shiny app.

ui <- dashboardPage(

  skin='green',  
  dashboardHeader(title = "Inflation Checker"),
  dashboardSidebar(sidebarMenu(
    menuItem("Dashboard", tabName = "dashboard", icon = icon("dashboard")),
    menuItem("Data", tabName = "widgets", icon = icon("th"))
  )),
  dashboardBody(
    # Boxes need to be put in a row (or column)
    fluidRow(
      box(plotOutput("plot1", height = 500)),

Application

Time to create my shiny app.

server <- function(input, output) {

output$plot1 <- renderPlot({

boxplot(as.numeric(ddset$pp[ddset$revscc==
input$category]),main="Box Plot of Prices",xlab=input$category,ylab="TL")
  })
output$summary <- renderTable({
test <- t(as.matrix(summary(as.numeric(ddset$
pp[ddset$revscc==input$category]))))
  })
}

shinyApp(ui, server)

Application



You could visit my application by following link


https://olgunaydin.shinyapps.io/Erum/

Thank you!