Data Science Capstone ======================================================== author: Eric Dolan date: 19 November 2015 STL analysis of smoking business ======================================================== This is a study of popularly on smoking business by looking into the number of reviews(talking points) on yelp data set challenge and understand the economic impact and very real economic trade-offs we should consider when voting on a smoking ban. The yelp data can be find on [yelp_dataset_challenge_academic_dataset.zip](https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/yelp_dataset_challenge_academic_dataset.zip) More on Yelp Dataset Challenge and the details of the data structure can be found on http://www.yelp.com/dataset_challenge Background ======================================================== From the data set provide by Yelp, there are: - a total of 61164 businesses - only 2754 of businesses that are with the attribute smoking(both allowing indoor and outdoor) - a total of 1569264 reviews and from the period of Oct 2004 to Jan 2015 - only 144080 reviews for the 2754 smoking businesses. It is surprisingly high for only 2754 businesses. Exploratory analysis ======================================================== After understanding the data, I converted the data into a Time Series Object - only 2006 to 2014 data are use because they consist of full year data. The final transformed data will look like in fig.1 ```{r, echo=FALSE} load("tsdata.RData") str(tsdata) ``` *fig.1 - Final transformed data* Next, we move on to build a Seasonal Decomposition of Time Series by Loess model Modelling ======================================================== After a few round of tuning and experiment with different settings. We are able to produce a final model for forcasting. ```{r echo=FALSE, fig.width=12, fig.height=6} plot(tsdata, ylab = "Total Review", col="grey") fit1 <- stl(tsdata, t.window=15, s.window="periodic", robust = TRUE) lines(fit1$time.series[,2],col="red") ``` *fig.2 - Final model plot* Forecast ======================================================== The 2 years forecast result in fig.3, show that it remain in same zone as of today. This somehow mean there is an economic impact. ```{r, echo=FALSE, fig.width=12, fig.height=6} library(forecast) fcast <- forecast(fit1, method="naive") plot(fcast, main = "") ``` *fig.3 - Forecast result*