Zomato: Know your restaurant better!!

Introduction

“I really get fascinated by good quality food being served in the restaurants, do you?”

“Who would not want to tease their palette by tasting the best cuisines across a total of 15 countries?”

Zomato API Analysis is one of the most useful analysis for foodies.

The basic idea of analyzing the Zomato dataset is to get a fair idea about the factors affecting the aggregate rating of each restaurant such as

Best cuisines of every part of the world which lies in their budget
Find the value for money restaurants in various parts of the country for the cuisines
Which locality of that country serves that cuisines with maximum number of restaurants
The needs of people who are striving to get the best cuisine of the country

“Just so that you have a good meal the next time you step out”

Business Problem

Factors that help you dine

The bigger picture is to analyze the various factors aiding the customer to decide the appropriate restaurant like

Cuisine

Location

Pocket money etc.

Some of the issues that we had to address through this analysis are:

Standardizing Currency

Converting the various local currencies to US Dollars and thereby determine a standardized average cost for two so that you know how much you need to shell out no matter where you are
Analyze and understand the relationship between the Price Range and the Rating to determine the value for money the restaurant provides

Cuisine

Jack of all but Master of One.
Multiple cuisines are served by each restaurant and are recorded as comma separated values under single column.
Does it really matter what you serve and where? Answer this question by analyzing the interdependence of Cuisine it serves, location of the restaurant and its final rating

Votes

Are the number of votes synonymous with the restaurant being considered a poor or an excellent choice?
For e.g. A restaurant with only 3 votes is categorized as Poor choice.
The question we need to answer is that can we rate a restaurant solely based on the number of votes it has garnered?

Requirements

Following are the packages that we have used for the analysis of our Zomato data set. The following table specifies the function for each library and why we used them

# to preload necessary packages 
packages <- c("tidyverse", "readr", "maps", "DT", "knitr", "rmarkdown", "jsonlite", "httr", "mapproj","lubridate","lucr","kableExtra","ggplot2","gridExtra","RCurl")
requiredpackages <- packages[!(packages %in% installed.packages()[,"Package"])]
if (length(requiredpackages)) install.packages(requiredpackages, repos = "http://cran.us.r-project.org" )

library(knitr)
library(kableExtra)
library(lucr) #loading lucr library
library(DT)
library(jsonlite)
library(httr)
library(lubridate)
library(DT)
library(RCurl)
library(ggplot2)
library(gridExtra)
text_tbl <- data.frame(
  Libraries = c("tidyverse", "readr", "maps","mapproj", 
                "DT", "knitr", "rmarkdown", "jsonlite", 
                "httr","lucr","kableExtra","ggplot2","gridExtra","RCurl"),
  Functions = c("Easy installation of packages",
                "To easily import delimited data",
                "For geographical data",
                "To convert latitude/longitude into projected coordinates",
                "To create functional tables in HTML",
                "For dynamic report generation",
                "To convert R Markdown documents into a variety of formats",
                "A Robust, High Performance JSON Parser and Generator for R",
                "Tools for Working with URLs and HTTP organised by HTTP verbs",
                "Reformat currency-based data as numeric and convert between currencies",
                "Build complex HTML or LaTeX tables using kable()",
                "To create beautiful visualizations",
                "To arrange the plots generated grid-wise",
                "To read data from web url"
                
  )
)

kable(text_tbl, "html") %>%
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T, border_right = T) %>%
  column_spec(2, width = "30em")

Libraries	Functions
tidyverse	Easy installation of packages
readr	To easily import delimited data
maps	For geographical data
mapproj	To convert latitude/longitude into projected coordinates
DT	To create functional tables in HTML
knitr	For dynamic report generation
rmarkdown	To convert R Markdown documents into a variety of formats
jsonlite	A Robust, High Performance JSON Parser and Generator for R
httr	Tools for Working with URLs and HTTP organised by HTTP verbs
lucr	Reformat currency-based data as numeric and convert between currencies
kableExtra	Build complex HTML or LaTeX tables using kable()
ggplot2	To create beautiful visualizations
gridExtra	To arrange the plots generated grid-wise
RCurl	To read data from web url

Proposed Solution

Some of the approaches to address the above issues are

We have converted the currency values to standardized abbreviations such as INR for Indian Rupees and used an in-built package to convert the given currency to USD
Based on the converted currency value we also re-evaluated the pre-assigned price range to the Zomato defined USD scale as shown below:

text_table1 <- data.frame(AveragePrice = c("0 < Price <= 10","10 < Price <= 25","25 < Price <= 50","50 < Price"),
                          PriceRange = c("1","2","3","4"))
kable(text_table1, "html") %>%
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T, border_right = T) %>%
  column_spec(2, width = "15em")

AveragePrice	PriceRange
0 < Price <= 10	1
10 < Price <= 25	2
25 < Price <= 50	3
50 < Price	4

Proposed solution to address the cuisine problem is to separate the comma separated values and spread them into individual columns and count their frequency and determine the favorite cuisine based on the location.
Speaking of location since our data set consists of Latitude as well as Longitude we will attempt to create a map and highlighting with the help of color coding to determine the cities having maximum number of highly rated restaurants
Predict a linear model to uncover the fact whether votes really decide the rating by comparing a full and a reduced model.

Data CodeBook

Documentation Guide

A glance at the following documentation source will give us a fair understanding about how Zomato has gone about preparing the data and help us analyze it better.

https://developers.zomato.com/documentation
Fetching the data

Data has been collected from Kaggle in the form of .csv file
Data Source

https://www.kaggle.com/shrutimehta/zomato-restaurants-data
MetaData

Each restaurant in the dataset is uniquely identified by its Restaurant Id. Every Restaurant is analyzed based on the following variables:

text_tbl2 <- data.frame(
  Attributes = c("Restaurant Id", "Restaurant Name", "Country Code","City", 
                "Address","Locality", "Locality Verbose", "Longitude", "Latitude", 
                "Cuisines","Average Cost for two","Currency","Has Table booking","Has Online delivery","Price range",
                "Aggregate Rating","Rating color","Rating text","Votes"),
  Description = c("Unique id of every restaurant",
                  "Name of the restaurant",
                  "Country in which restaurant is located",
                  "City in which restaurant is located",
                  "Address of the restaurant","Location in the city",
                  "Detailed description of the locality",
                  "Longitude coordinate",
                  "Latitude coordinate",
                  "Cuisines offered by the restaurant",
                  "Cost for two people in different currencies",
                  "Currency of the country", 
                  "Can you reserve a table?",  
                  "Do they deliver at your doorstep?",
                  "Range of price of food",
                  "Average rating out of 5",
                  "Depending upon the average rating color", 
                  "Text on the basis of rating",    
                  "Number of ratings casted by people"
  )
)

kable(text_tbl2, "html") %>%
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T, border_right = T) %>%
  column_spec(2, width = "30em")

Attributes	Description
Restaurant Id	Unique id of every restaurant
Restaurant Name	Name of the restaurant
Country Code	Country in which restaurant is located
City	City in which restaurant is located
Address	Address of the restaurant
Locality	Location in the city
Locality Verbose	Detailed description of the locality
Longitude	Longitude coordinate
Latitude	Latitude coordinate
Cuisines	Cuisines offered by the restaurant
Average Cost for two	Cost for two people in different currencies
Currency	Currency of the country
Has Table booking	Can you reserve a table?
Has Online delivery	Do they deliver at your doorstep?
Price range	Range of price of food
Aggregate Rating	Average rating out of 5
Rating color	Depending upon the average rating color
Rating text	Text on the basis of rating
Votes	Number of ratings casted by people

Data Cleaning

For the purpose of crisp analysis we have not considered two columns from the original dataset since our analysis showed those columns did not contribute significantly. They are “Is delivering now?” & “Switch to order menu”

Going ahead and keeping the business requirements in mind ‘Data Processing’ has been done on the following categories.

Currency

We have converted the local currencies to USD with the help of lucr package.
Based on the converted rates we have obtained we have defined a new standardized price range.
Both the above values have been integrated into our existing dataset.

library(lucr)
library(DT)
rates <- conversion_rates(currency = "USD", key = "91df1e9c6b6d487180fbe7bf19c7cc0d") #extrating conversion rates based on USD
#list_currencies(key = "91df1e9c6b6d487180fbe7bf19c7cc0d")
df <- read.csv("zomato.csv",stringsAsFactors = FALSE)
df["converted_rates"] <- NA #created new column for converted price and initialized with "NA"
df["new_price_range"] <- NA #created new column for standardized price range and initialized with "NA"
arr <- array()
#df$Currency
for (i in 1:length(df$Currency)) { #looping through all rows
 #print(i)
  val <- df[i, "Currency"] #finding currency values
  #print(val)
  df[["converted_rates"]][i] <- round((df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]),2) #find converted currency rate 
  
  if (!is.na((df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]))) { #finding standardized price range
    if ((df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]) > 0 && (df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]) <= 10) 
    {
     
      arr[i] <- 1
      
    } else if ((df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]) > 10 && (df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]) <= 25) {
      arr[i] <- 2
      
    }else if ((df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]) > 25 && (df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]) <= 50) {
      arr[i] <- 3
      
    } else if ((df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]) > 50) {
      arr[i] <- 4
     
    } else{
      arr[i] <- 0
     
    }
  } else{
    arr[i] <- 0
  
  }
 
}

vec <- as.vector(arr)
df[["new_price_range"]] <- vec #updating values in new column created
write.csv(df, "updatedZomato.csv")

datatable(df[,-c(3:19)], caption = "Updated Currency Value", colnames=c("Restaurant Id","Restaurant Name","Converted Rates in USD","New Price Range"))

Country

As country based analysis is one of the major factors and Zomato data contained country ID (based on Zomato standards), we used the Zomato REST APIs to collect the country information corresponding to each city.

REST API consumed: https://developers.zomato.com/api/v2.1/locations

By making a GET call to above endpoint we received information about each city including the country name in JSON format

library(DT)
library(jsonlite)
library(httr)
#reading the zomato  csv file and storing it in dataframe df

df <- read.csv("zomato.csv")
#finding the list of unique cities
list_of_cities <- as.list(unique(df$City))
iter <- length(list_of_cities)
out <- matrix(NA, nrow = iter, ncol = 2)

for (i in 1:length(list_of_cities)) {

  url <- 'https://developers.zomato.com/api/v2.1/locations?query=city_name=' #request url to call the REST API
  
  place <- list_of_cities[[i]] #finding city name to pass in Get request(mandatory parameter)

  newURL <- paste(url, place) #creating new URL

  req <- GET(newURL,add_headers(User_key = "74af46c228e902e021c0d31516308ecb"),query = list(entity_type = "city")) #added headers and made a REST Call
  
  response <- content(req) #storing the content received
 
  if (length(response$location_suggestions) != 0) { # checking if the response contains country information
    
  out[i,] <- c(response$location_suggestions[[1]]$country_id,response$location_suggestions[[1]]$country_name) 
  # saving the country ID and name in a matrix
  }
 
 
}
datatable(unique(out,incomparables = FALSE, MARGIN = 1, fromLast = FALSE, signif = Inf), caption = 'Distinct Country ID and Country Name',colnames = c("Country Code","Country Name"))

#printing the final matrix containing unique records for counrty IDs

EDA and Visualizations

The following plots and graphs help us in unraveling the unsaid information that our data set holds.

We have tried visualizing the some of the important attributes for restaurants in India compared to those world wide.

Restaurants and their Birth Place

With the help of this visualization we have a fair idea of which country has maximum number of restaurants and then also for the various cities in India.

We can conclude that voracious eaters hail from India and USA as per Zomato

From India, being true to the size of their paunch and heavy built structure, restaurants are a major hit in the Northern cities like New Delhi and Noida

library(ggplot2)
library(gridExtra)
df <- read.csv(text=getURL("https://raw.githubusercontent.com/Akanksha061990/DataWrangling/master/FinalData.csv"), header=T)
# Extracting only Indian restaurants
df_india = subset(df, df$Country.Name == "India")
df_other = subset(df, df$Country.Name!="India")

# Plotting distribution of registered restaurants by country
p1 = ggplot(df) + geom_bar(aes(df$Country.Name), fill = "tomato3") + coord_flip() + labs(title="Country wise distribution", x="Number of outlets", y = "Country") + theme(plot.title = element_text(size = 10, face = "bold"))
p2 = ggplot(df_india) + geom_bar(aes(City), fill = "red") + coord_flip() + labs(title="City-wise distribution in India", x="Number of outlets", y = "City") + theme(plot.title = element_text(size = 10, face = "bold"))
grid.arrange(p1, p2, ncol=2)

India versus Other Countries

With the help of this visualization we have a can understand how the restaurants from India and other countries fares on the basis of the aggregate ratings it has garnered

In all we can see that greater than 2000 restaurants from India alone have not been given a rating. May be this is a disparity we can look into while presenting our final analysis and address this issue Diners outside India do take providing ratings a bit seriously and the no rating count is far less.

#Plot to have a look at aggregrate ratings of restaurants in India as well as worldwide

p3 = ggplot(df_india) + geom_bar(aes(Aggregate.rating, fill=Rating.text))  + labs(title="Distribution of Aggregate Rating in India", x = "Aggregate Rating", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"))

p4 = ggplot(df_other) + geom_bar(aes(Aggregate.rating, fill=Rating.text))  + labs(title="Distribution of Aggregate Rating in other countries", x = "Aggregate rating", y = "Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"))

grid.arrange(p3, p4, nrow=2)

Metro Cities

The next time you plan a visit to India you must know which cities to visit to have a good meal and also a better understanding of how much would you shell out in US Dollars

In New Delhi in the range of 0-75$ followed by in Mumbai, Chennai and Kolkata where the average spend can be estimated at 15-20$ However maximum number of restaurants are from New Delhi , hence it provides a better clarity as compared to other cities

#plot to compare average cost for two in the major Metros of India

df_newdelhi = subset(df_india, City=="New Delhi")
df_chennai = subset(df_india, City=="Chennai")
df_mumbai = subset(df_india, City=="Mumbai")
df_kolkata = subset(df_india, City=="Kolkata")

p5 = ggplot(df_newdelhi) + geom_bar(aes(df_newdelhi$Converted_Rates), color="green4", fill="green4") + labs(title="Avg.Money Spent - New Delhi", x="Avg. Cost", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"))

p6 = ggplot(df_mumbai) + geom_bar(aes(df_mumbai$Converted_Rates), color="green4", fill="green4") + labs(title="Avg.Money Spent - Mumbai", x="Avg. Cost", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"))

p7 = ggplot(df_chennai) + geom_bar(aes(df_chennai$Converted_Rates), color="green4", fill="green4") + labs(title="Avg.Money Spent - Chennai", x="Avg. Cost", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"))

p8 = ggplot(df_kolkata) + geom_bar(aes(df_kolkata$Converted_Rates), color="green4", fill="green4") + labs(title="Avg.Money Spent - Kolkata", x="Avg. Cost", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"))

grid.arrange(p5, p6, p7, p8, nrow=4)

Online Delivery or Reservation

A visualization of how many restaurants provide a home delivery and reservation options to its diners.

Which one is preferred more in India as well as outside India. Comparatively Indians like the food delivered at the door step where as diners from the rest of the world like a table booked to avoid no last minute hassles

# Plots to compare count of restaurants providing online delivery in India versus restaurants worldwide

p9 = ggplot(df_india) + geom_bar(aes(df_india$Has.Online.delivery, fill=df_india$Has.Online.delivery)) + labs(title="Online Delivery Option - India", x="Has Online Delivery?", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"), legend.position = "none")

p10 = ggplot(df_other) + geom_bar(aes(df_other$Has.Online.delivery, fill=df_other$Has.Online.delivery)) + labs(title="Online Delivery Option - Other Countries", x="Has Online Delivery?", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"), legend.position = "none")

grid.arrange(p9, p10, nrow=2, ncol=2)

# Plots to compare Table reservation counts in restaurants in India versus restaurants worldwide
p45 = ggplot(df_india) + geom_bar(aes(df_india$Has.Table.booking, fill=df_india$Has.Table.booking)) + labs(title="Table Reservation - India", x="Has Table Booking?", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"), legend.position = "none")

p46 = ggplot(df_other) + geom_bar(aes(df_other$Has.Table.booking, fill=df_other$Has.Table.booking)) + labs(title="Table Reservation - Other Countries", x="Has Table Booking?", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"), legend.position = "none")

grid.arrange(p45, p46, nrow=2, ncol=2)

Future Enhancements and Disclaimer

The REST API provided by Zomato is GET call which takes single city and return corresponding results. As it is running for each city and we have 141 unique cities, it takes some time to get country names correspding to country code in the first load. But in the subsequent updates since it will only run for the newly added restaurants/cities. Howvever, we are trying to improve the performance of this REST call.
We will be adding Geomaps based on latitude and longitude information provided for each restaurant
We will separate the comma separated values in cuisine column and spread them into individual columns to count their frequency and determine the favorite cuisine based on the location.