“I really get fascinated by good quality food being served in the restaurants, do you?”
“Who would not want to tease their palette by tasting the best cuisines across a total of 15 countries?”
Zomato API Analysis is one of the most useful analysis for foodies.
The basic idea of analyzing the Zomato dataset is to get a fair idea about the factors affecting the aggregate rating of each restaurant such as
Best cuisines of every part of the world which lies in their budget
Find the value for money restaurants in various parts of the country for the cuisines
Which locality of that country serves that cuisines with maximum number of restaurants
The needs of people who are striving to get the best cuisine of the country
“Just so that you have a good meal the next time you step out”
Factors that help you dine
The bigger picture is to analyze the various factors aiding the customer to decide the appropriate restaurant like
Cuisine
Location
Pocket money etc.
Some of the issues that we had to address through this analysis are:
Jack of all but Master of One.
Multiple cuisines are served by each restaurant and are recorded as comma separated values under single column.
Does it really matter what you serve and where? Answer this question by analyzing the interdependence of Cuisine it serves, location of the restaurant and its final rating
Are the number of votes synonymous with the restaurant being considered a poor or an excellent choice?
For e.g. A restaurant with only 3 votes is categorized as Poor choice.
The question we need to answer is that can we rate a restaurant solely based on the number of votes it has garnered?
# to preload necessary packages
packages <- c("tidyverse", "readr", "maps", "DT", "knitr", "rmarkdown", "jsonlite", "httr", "mapproj","lubridate","lucr","kableExtra","ggplot2","gridExtra","RCurl","ggmap","tm","wordcloud")
requiredpackages <- packages[!(packages %in% installed.packages()[,"Package"])]
if (length(requiredpackages)) install.packages(requiredpackages, repos = "http://cran.us.r-project.org" )
library(knitr)
library(kableExtra)
library(lucr) #loading lucr library
library(DT)
library(jsonlite)
library(httr)
library(lubridate)
library(DT)
library(RCurl)
library(ggplot2)
library(gridExtra)
library(ggmap)
#library(tm)
library(wordcloud)
text_tbl <- data.frame(
Libraries = c("tidyverse", "readr", "maps","mapproj",
"DT", "knitr", "rmarkdown", "jsonlite",
"httr","lucr","kableExtra","ggplot2","gridExtra","RCurl","ggmap","tm","wordcloud","data.table"),
Functions = c("Easy installation of packages",
"To easily import delimited data",
"For geographical data",
"To convert latitude/longitude into projected coordinates",
"To create functional tables in HTML",
"For dynamic report generation",
"To convert R Markdown documents into a variety of formats",
"A Robust, High Performance JSON Parser and Generator for R",
"Tools for Working with URLs and HTTP organised by HTTP verbs",
"Reformat currency-based data as numeric and convert between currencies",
"Build complex HTML or LaTeX tables using kable()",
"To create beautiful visualizations",
"To arrange the plots generated grid-wise",
"To read data from web url",
"To plot maps",
"To help us create a matrix",
"To help us plot the word cloud",
"Fast file reader and parallel file writer. Enables Fast aggregation and column modification"
)
)
kable(text_tbl, "html") %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T, border_right = T) %>%
column_spec(2, width = "30em")
Libraries | Functions |
---|---|
tidyverse | Easy installation of packages |
readr | To easily import delimited data |
maps | For geographical data |
mapproj | To convert latitude/longitude into projected coordinates |
DT | To create functional tables in HTML |
knitr | For dynamic report generation |
rmarkdown | To convert R Markdown documents into a variety of formats |
jsonlite | A Robust, High Performance JSON Parser and Generator for R |
httr | Tools for Working with URLs and HTTP organised by HTTP verbs |
lucr | Reformat currency-based data as numeric and convert between currencies |
kableExtra | Build complex HTML or LaTeX tables using kable() |
ggplot2 | To create beautiful visualizations |
gridExtra | To arrange the plots generated grid-wise |
RCurl | To read data from web url |
ggmap | To plot maps |
tm | To help us create a matrix |
wordcloud | To help us plot the word cloud |
data.table | Fast file reader and parallel file writer. Enables Fast aggregation and column modification |
Some of the approaches to address the above issues are
We have converted the currency values to standardized abbreviations such as INR for Indian Rupees and used an in-built package to convert the given currency to USD
Based on the converted currency value we also re-evaluated the pre-assigned price range to the Zomato defined USD scale as shown below:
text_table1 <- data.frame(AveragePrice = c("0 < Price <= 10","10 < Price <= 25","25 < Price <= 50","50 < Price"),
PriceRange = c("1","2","3","4"))
kable(text_table1, "html") %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T, border_right = T) %>%
column_spec(2, width = "15em")
AveragePrice | PriceRange |
---|---|
0 < Price <= 10 | 1 |
10 < Price <= 25 | 2 |
25 < Price <= 50 | 3 |
50 < Price | 4 |
To address the cuisine problem we have separated the comma separated values and spread them into individual columns and aggregated their frequency and have determined the most favored cuisine based on its location.
Speaking of location since our data set consists of Latitude as well as Longitude we have generated a map for the three continents Asia, Europe and North America and plotted the locations of the restaurants and have color coded them according to the Rating color and the legends help you in understanding the rating text for each restaurant
Documentation Guide
A glance at the following documentation source will give us a fair understanding about how Zomato has gone about preparing the data and help us analyze it better.
Fetching the data
Data has been collected from Kaggle in the form of .csv file
Data Source
https://www.kaggle.com/shrutimehta/zomato-restaurants-data Local File - zomato.csv local file is the unprepared file and is used for the zomato API call to fetch country names based on country codes Github File - This is the final data file, which is updated periodically and is used for further analysis and EDA
MetaData
Each restaurant in the dataset is uniquely identified by its Restaurant Id. Every Restaurant is analyzed based on the following variables:
text_tbl2 <- data.frame(
Attributes = c("Restaurant Id", "Restaurant Name", "Country Code","City",
"Address","Locality", "Locality Verbose", "Longitude", "Latitude",
"Cuisines","Average Cost for two","Currency","Has Table booking","Has Online delivery","Price range",
"Aggregate Rating","Rating color","Rating text","Votes"),
Description = c("Unique id of every restaurant",
"Name of the restaurant",
"Country in which restaurant is located",
"City in which restaurant is located",
"Address of the restaurant","Location in the city",
"Detailed description of the locality",
"Longitude coordinate",
"Latitude coordinate",
"Cuisines offered by the restaurant",
"Cost for two people in different currencies",
"Currency of the country",
"Can you reserve a table?",
"Do they deliver at your doorstep?",
"Range of price of food",
"Average rating out of 5",
"Depending upon the average rating color",
"Text on the basis of rating",
"Number of ratings casted by people"
)
)
kable(text_tbl2, "html") %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T, border_right = T) %>%
column_spec(2, width = "30em")
Attributes | Description |
---|---|
Restaurant Id | Unique id of every restaurant |
Restaurant Name | Name of the restaurant |
Country Code | Country in which restaurant is located |
City | City in which restaurant is located |
Address | Address of the restaurant |
Locality | Location in the city |
Locality Verbose | Detailed description of the locality |
Longitude | Longitude coordinate |
Latitude | Latitude coordinate |
Cuisines | Cuisines offered by the restaurant |
Average Cost for two | Cost for two people in different currencies |
Currency | Currency of the country |
Has Table booking | Can you reserve a table? |
Has Online delivery | Do they deliver at your doorstep? |
Price range | Range of price of food |
Aggregate Rating | Average rating out of 5 |
Rating color | Depending upon the average rating color |
Rating text | Text on the basis of rating |
Votes | Number of ratings casted by people |
For the purpose of crisp analysis we have not considered two columns from the original dataset since our analysis showed those columns did not contribute significantly. They are “Is delivering now?” & “Switch to order menu”
Going ahead and keeping the business requirements in mind ‘Data Processing’ has been done on the following categories.
Currency
We have converted the local currencies to USD with the help of lucr package.
Based on the converted rates we have obtained we have defined a new standardized price range.
Both the above values have been integrated into our existing dataset.
library(lucr)
library(DT)
library(data.table)
rates <- conversion_rates(currency = "USD", key = "91df1e9c6b6d487180fbe7bf19c7cc0d") #extrating conversion rates based on USD
#list_currencies(key = "91df1e9c6b6d487180fbe7bf19c7cc0d")
df <- fread("https://raw.githubusercontent.com/Akanksha061990/DataWrangling/master/zomato.csv", data.table = FALSE)
df["converted_rates"] <- NA #created new column for converted price and initialized with "NA"
df["new_price_range"] <- NA #created new column for standardized price range and initialized with "NA"
arr <- array()
#df$Currency
for (i in 1:length(df$Currency)) { #looping through all rows
#print(i)
val <- df[i, "Currency"] #finding currency values
#print(val)
df[["converted_rates"]][i] <- round((df[["Average Cost for two"]][i]) / (rates$rates[[val]]),2) #find converted currency rate
if (!is.na((df[["Average Cost for two"]][i]) / (rates$rates[[val]]))) { #finding standardized price range
if ((df[["Average Cost for two"]][i]) / (rates$rates[[val]]) > 0 && (df[["Average Cost for two"]][i]) / (rates$rates[[val]]) <= 10)
{
arr[i] <- 1
} else if ((df[["Average Cost for two"]][i]) / (rates$rates[[val]]) > 10 && (df[["Average Cost for two"]][i]) / (rates$rates[[val]]) <= 25) {
arr[i] <- 2
}else if ((df[["Average Cost for two"]][i]) / (rates$rates[[val]]) > 25 && (df[["Average Cost for two"]][i]) / (rates$rates[[val]]) <= 50) {
arr[i] <- 3
} else if ((df[["Average Cost for two"]][i]) / (rates$rates[[val]]) > 50) {
arr[i] <- 4
} else{
arr[i] <- 0
}
} else{
arr[i] <- 0
}
}
vec <- as.vector(arr)
df[["new_price_range"]] <- vec #updating values in new column created
#write.csv(df, "updatedZomato.csv")
datatable(df[,-c(3:19)], caption = "Updated Currency Value", colnames=c("Restaurant Id","Restaurant Name","Converted Rates in USD","New Price Range"))
Country
As country based analysis is one of the major factors and Zomato data contained country ID (based on Zomato standards), we used the Zomato REST APIs to collect the country information corresponding to each city.
REST API consumed: https://developers.zomato.com/api/v2.1/locations
By making a GET call to above endpoint we received information about each city including the country name in JSON format
library(DT)
library(jsonlite)
library(httr)
#reading the zomato csv file and storing it in dataframe df
df <- read.csv("zomato.csv",stringsAsFactors = FALSE)
#finding the list of unique cities
list_of_cities <- as.list(unique(df$City))
iter <- length(list_of_cities)
out <- matrix(NA, nrow = iter, ncol = 2)
for (i in 1:length(list_of_cities)) {
url <- 'https://developers.zomato.com/api/v2.1/locations?query=city_name=' #request url to call the REST API
place <- list_of_cities[[i]] #finding city name to pass in Get request(mandatory parameter)
newURL <- paste(url, place) #creating new URL
req <- GET(newURL,add_headers(User_key = "74af46c228e902e021c0d31516308ecb"),query = list(entity_type = "city")) #added headers and made a REST Call
response <- content(req) #storing the content received
if (length(response$location_suggestions) != 0) { # checking if the response contains country information
out[i,] <- c(response$location_suggestions[[1]]$country_id,response$location_suggestions[[1]]$country_name)
# saving the country ID and name in a matrix
}
}
datatable(unique(out,incomparables = FALSE, MARGIN = 1, fromLast = FALSE, signif = Inf), caption = 'Distinct Country ID and Country Name',colnames = c("Country Code","Country Name"))
#printing the final matrix containing unique records for counrty IDs
The following plots and graphs help us in unraveling the unsaid information that our data set holds.
We have tried visualizing the some of the important attributes for restaurants in India compared to those world wide.
We have created a WordCloud to decipher the best selling cuisine and have also plotted the location of the restaurants alongwith their rating from Asia, Europe and North America
With the help of this visualization we have a fair idea of which country has maximum number of restaurants and then also for the various cities in India.
We can conclude that voracious eaters hail from India and USA as per Zomato
From India, being true to the size of their paunch and heavy built structure, restaurants are a major hit in the Northern cities like New Delhi and Noida
library(ggplot2)
library(gridExtra)
library(RCurl)
df <- fread("https://raw.githubusercontent.com/Akanksha061990/DataWrangling/master/FinalData.csv", data.table = FALSE)
# Extracting only Indian restaurants
df_india = subset(df, `Country Name` == "India")
df_other = subset(df, `Country Name`="India")
# Plotting distribution of registered restaurants by country
p1 = ggplot(df) + geom_bar(aes(`Country Name`), fill = "tomato3") +
coord_flip() +
labs(title="Country wise distribution", x="Number of outlets", y = "Country") +
theme(plot.title = element_text(size = 10, face = "bold"))
p2 = ggplot(df_india) + geom_bar(aes(City), fill = "red") +
coord_flip() +
labs(title="City-wise distribution in India", x="Number of outlets", y = "City") +
theme(plot.title = element_text(size = 10, face = "bold"))
grid.arrange(p1, p2, ncol=2)
With the help of this visualization we have a can understand how the restaurants from India and other countries fares on the basis of the aggregate ratings it has garnered
In all we can see that greater than 2000 restaurants from India alone have not been given a rating. May be this is a disparity we can look into while presenting our final analysis and address this issue Diners outside India do take providing ratings a bit seriously and the no rating count is far less.
#Plot to have a look at aggregrate ratings of restaurants in India as well as worldwide
p3 = ggplot(df_india) + geom_bar(aes(Aggregate.rating, fill=Rating.text)) +
labs(title="Distribution of Aggregate Rating in India", x = "Aggregate Rating", y="Number of outlets") +
theme(plot.title = element_text(size=10, face = "bold"))
p4 = ggplot(df_other) + geom_bar(aes(Aggregate.rating, fill=Rating.text)) +
labs(title="Distribution of Aggregate Rating in other countries", x = "Aggregate rating", y = "Number of outlets") +
theme(plot.title = element_text(size=10, face = "bold"))
grid.arrange(p3, p4, nrow=2)
The next time you plan a visit to India you must know which cities to visit to have a good meal and also a better understanding of how much would you shell out in US Dollars
In New Delhi in the range of 0-75$ followed by in Mumbai, Chennai and Kolkata where the average spend can be estimated at 15-20$ However maximum number of restaurants are from New Delhi , hence it provides a better clarity as compared to other cities
#plot to compare average cost for two in the major Metros of India
df_newdelhi = subset(df_india, City=="New Delhi")
df_chennai = subset(df_india, City=="Chennai")
df_mumbai = subset(df_india, City=="Mumbai")
df_kolkata = subset(df_india, City=="Kolkata")
p5 = ggplot(df_newdelhi) + geom_bar(aes(df_newdelhi$Converted_Rates), color="green4", fill="green4") +
labs(title="Avg.Money Spent - New Delhi", x="Avg. Cost", y="Number of outlets") +
theme(plot.title = element_text(size=10, face = "bold"))
p6 = ggplot(df_mumbai) + geom_bar(aes(df_mumbai$Converted_Rates), color="green4", fill="green4") +
labs(title="Avg.Money Spent - Mumbai", x="Avg. Cost", y="Number of outlets") +
theme(plot.title = element_text(size=10, face = "bold"))
p7 = ggplot(df_chennai) + geom_bar(aes(df_chennai$Converted_Rates), color="green4", fill="green4") +
labs(title="Avg.Money Spent - Chennai", x="Avg. Cost", y="Number of outlets") +
theme(plot.title = element_text(size=10, face = "bold"))
p8 = ggplot(df_kolkata) + geom_bar(aes(df_kolkata$Converted_Rates), color="green4", fill="green4") +
labs(title="Avg.Money Spent - Kolkata", x="Avg. Cost", y="Number of outlets") +
theme(plot.title = element_text(size=10, face = "bold"))
grid.arrange(p5, p6, p7, p8, nrow=4)
A visualization of how many restaurants provide a home delivery and reservation options to its diners.
Which one is preferred more in India as well as outside India. Comparatively Indians like the food delivered at the door step where as diners from the rest of the world like a table booked to avoid no last minute hassles
# Plots to compare count of restaurants providing online delivery in India versus restaurants worldwide
p9 = ggplot(df_india) + geom_bar(aes(df_india$Has.Online.delivery, fill=df_india$Has.Online.delivery)) +
labs(title="Online Delivery Option - India", x="Has Online Delivery?", y="Number of outlets") +
theme(plot.title = element_text(size=10, face = "bold"), legend.position = "none")
p10 = ggplot(df_other) + geom_bar(aes(df_other$Has.Online.delivery, fill=df_other$Has.Online.delivery)) +
labs(title="Online Delivery Option - Other Countries", x="Has Online Delivery?", y="Number of outlets") +
theme(plot.title = element_text(size=10, face = "bold"), legend.position = "none")
grid.arrange(p9, p10, nrow=2, ncol=2)
# Plots to compare Table reservation counts in restaurants in India versus restaurants worldwide
p45 = ggplot(df_india) + geom_bar(aes(df_india$Has.Table.booking, fill=df_india$Has.Table.booking)) +
labs(title="Table Reservation - India", x="Has Table Booking?", y="Number of outlets") +
theme(plot.title = element_text(size=10, face = "bold"), legend.position = "none")
p46 = ggplot(df_other) + geom_bar(aes(df_other$Has.Table.booking, fill=df_other$Has.Table.booking)) +
labs(title="Table Reservation - Other Countries", x="Has Table Booking?", y="Number of outlets") +
theme(plot.title = element_text(size=10, face = "bold"), legend.position = "none")
grid.arrange(p45, p46, nrow=2, ncol=2)
These set of visualizations helps us in determining how are the restaurants clustered across the three different continents i.e. Asia, Europe and North America.
These restaurants are highlighted according to their Rating color and are co-ordinated with the Rating Text.
library(ggmap)
library(plyr)
zomato <- fread("https://raw.githubusercontent.com/Akanksha061990/DataWrangling/master/FinalData.csv", data.table = FALSE)
zomato$Rating.color <- revalue(zomato$Rating.color, c("Dark Green" = "darkgreen","Green" = "green","Yellow" = "yellow",
"White" = "white","Red" = "red","Orange" = "orange"))
col = as.character(zomato$Rating.color)
names(col) <- as.character(zomato$Rating.text)
map_Asia <- get_map(location = 'Asia',zoom = 3)
mapPoints <- ggmap(map_Asia) +
geom_point(aes(x = zomato$Longitude, y = zomato$Latitude,colour = Rating.text), data = zomato, alpha = .6) +
scale_color_manual(values = col)
mapPoints
library(ggmap)
library(plyr)
zomato <- fread("https://raw.githubusercontent.com/Akanksha061990/DataWrangling/master/FinalData.csv", data.table = FALSE)
zomato$Rating.color <- revalue(zomato$Rating.color, c("Dark Green" = "darkgreen","Green" = "green","Yellow" = "yellow",
"White" = "white","Red" = "red","Orange" = "orange"))
col = as.character(zomato$Rating.color)
names(col) <- as.character(zomato$Rating.text)
map_Europe <- get_map(location = 'Europe',zoom = 3)
mapPoints1 <- ggmap(map_Europe) +
geom_point(aes(x = zomato$Longitude, y = zomato$Latitude,colour = Rating.text), data = zomato, alpha = .6) +
scale_color_manual(values = col)
mapPoints1
library(ggmap)
library(plyr)
zomato <- fread("https://raw.githubusercontent.com/Akanksha061990/DataWrangling/master/FinalData.csv", data.table = FALSE)
zomato$Rating.color <- revalue(zomato$Rating.color, c("Dark Green" = "darkgreen","Green" = "green","Yellow" = "yellow",
"White" = "white","Red" = "red","Orange" = "orange"))
col = as.character(zomato$Rating.color)
names(col) <- as.character(zomato$Rating.text)
map_NA <- get_map(location = 'North America',zoom = 3)
mapPoints2 <- ggmap(map_NA) +
geom_point(aes(x = zomato$Longitude, y = zomato$Latitude,colour = Rating.text), data = zomato, alpha = .6) +
scale_color_manual(values = col)
mapPoints2
Through this visualization we have predicted the most visually appealing cuisine that is being prepared across all highly rated restaurants and since the maximum are from India, the popular cuisine turned out to be
library(tm)
library(wordcloud)
zomato <- fread("https://raw.githubusercontent.com/Akanksha061990/DataWrangling/master/FinalData.csv", data.table = FALSE)
zomato$Cuisines2 <- (sapply(zomato$Cuisines,gsub,pattern = "\\,",replacement = " "))
cuisine <- Corpus(VectorSource(zomato$Cuisines2))
cuisine_dtm <- DocumentTermMatrix(cuisine)
cuisine_freq <- colSums(as.matrix(cuisine_dtm))
freq <- sort(colSums(as.matrix(cuisine_dtm)), decreasing=TRUE)
cuisine_wf <- data.frame(word = names(cuisine_freq), freq=cuisine_freq)
pal2 <- brewer.pal(8,"Dark2")
wordcloud(cuisine_wf$word,cuisine_wf$freq,random.order=FALSE,
rot.per =.15, colors=pal2,scale=c(4,.9))
The country codes unique to zomato have been re-valued to the names of the country they stand for by using the Zomato API call which helped us in analyzing the data set at a country level
The standardization of currency by looking up the currency code and converting all local currencies to US Dollars helped us re-define the price range and thus established a relationship between the final rating and the price range
The comma separated cuisines have been spread into individual columns and have been aggregated for restaurants with rating greater than 4 and have been filtered according to their country. This helped us understand what you should serve in order to be popular in a particular place
No matter if you plan your international vacation to Europe or Asia, you will never go hungry if you take a good luck at our maps to find some excellent restaurants
FoodCloud as we like to call it helps us understand which is the best selling cuisine.
We have established the correlation between Aggregate rating and the new price range as well as number of votes and found it to be around 39% and 31% which means there is a positive realtion but not strong enough that you should judge a restaurant based on the these two factors alone
“LIVE TO EAT & NOT EAT TO LIVE”