NY City Restaurant Data Analysis

rest

Author: Sekhar Mekala

Date: 05/26/2015

Business requirements

Analyze NY City's restaurants data, and find the following:

  • Which factor(s) affect the number of days between restaurant's inspection?
  • Which factor(s) affect the closure of a restaurant, in the current inspection?
  • Which factor(s) affect the closure of a restaurant, in the next inspection?
  • Can we predict the above three with some degree of certainty?

Data Source

Data Munging

  • Data is loaded into normalized RDBMS tables. RDBMS tables help us to manage the data efficiently, and these can be updated daily with new data
  • The data needed for analysis is obtained in the form of a CSV file(s) from RDBMS Tables
  • The CSV files obtained from RDBMS are loaded at www.github.com. These files are accessed by R programs for data analysis
  • Data is transformed, and two types of data sets are produced for training and model testing (…Cont)

Data Munging (cont...)

  • R language is used to transform the data to the required format (data frames)
  • The model training data is taken from all the years data except the data from 2015 year
  • The model testing data is taken from the 2015 year's data

RDBMS Data Model

db

Model development and evaluation - I(A)

fwd_bkwd

Model development and evaluation - I(B)

  • Found that the “score” variable alone shows strong relationship with the days between the successive inspections Score

\[ days=12.19-((2.859)(score^4)/(10^6)) \]

Model development and evaluation - II

  • Predicting if a restaurant will be closed, based on the current citations, score and other variables LDA

LDA has better performance

Model development and evaluation - III

  • Predicting if a restaurant will be closed in the next visit, based on the current citations, score and other variables QDA

QDA has better performance

Roadblocks

  • With more than 96 variables, and around 150 MB of data, the algorithms (developed in R) have ran while
  • Creating the logic to get the days difference between consecutive inspections (using vectorized operations only)
  • Deciding on which functional forms to use (parametric approach)
  • kable function was not working properly at some places, and it took a while for me to figure out the problem
  • Publishing the document in RPUBS also ran for a while, due to the volume of data we have

R Code

RPUBS at http://www.rpubs.com/msekhar12/MSDA_607_Final_Project

The source code and data files are present at the following location: https://github.com/msekhar12/MSDA_FINAL_PROJECT/tree/master

The first page's image (restaurant's ratings image) is found at google images