“I really get fascinated by good quality food being served in the restaurants, do you?”
“Who would not want to tease their palette by tasting the best cuisines across a total of 15 countries?”
Zomato API Analysis is one of the most useful analysis for foodies.
The basic idea of analyzing the Zomato dataset is to get a fair idea about the factors affecting the aggregate rating of each restaurant such as
Best cuisines of every part of the world which lies in their budget
Find the value for money restaurants in various parts of the country for the cuisines
Which locality of that country serves that cuisines with maximum number of restaurants
The needs of people who are striving to get the best cuisine of the country
“Just so that you have a good meal the next time you step out”
Factors that help you dine
The bigger picture is to analyze the various factors aiding the customer to decide the appropriate restaurant like
Cuisine
Location
Pocket money etc.
Some of the issues that we had to address through this analysis are:
Jack of all but Master of One.
Multiple cuisines are served by each restaurant and are recorded as comma separated values under single column.
Does it really matter what you serve and where? Answer this question by analyzing the interdependence of Cuisine it serves, location of the restaurant and its final rating
Are the number of votes synonymous with the restaurant being considered a poor or an excellent choice?
For e.g. A restaurant with only 3 votes is categorized as Poor choice.
The question we need to answer is that can we rate a restaurant solely based on the number of votes it has garnered?
# to preload necessary packages
packages <- c("tidyverse", "readr", "maps", "DT", "knitr", "rmarkdown", "jsonlite", "httr", "mapproj","lubridate","lucr","kableExtra","ggplot2","gridExtra","RCurl")
requiredpackages <- packages[!(packages %in% installed.packages()[,"Package"])]
if (length(requiredpackages)) install.packages(requiredpackages, repos = "http://cran.us.r-project.org" )
library(knitr)
library(kableExtra)
library(lucr) #loading lucr library
library(DT)
library(jsonlite)
library(httr)
library(lubridate)
library(DT)
library(RCurl)
library(ggplot2)
library(gridExtra)
text_tbl <- data.frame(
Libraries = c("tidyverse", "readr", "maps","mapproj",
"DT", "knitr", "rmarkdown", "jsonlite",
"httr","lucr","kableExtra","ggplot2","gridExtra","RCurl"),
Functions = c("Easy installation of packages",
"To easily import delimited data",
"For geographical data",
"To convert latitude/longitude into projected coordinates",
"To create functional tables in HTML",
"For dynamic report generation",
"To convert R Markdown documents into a variety of formats",
"A Robust, High Performance JSON Parser and Generator for R",
"Tools for Working with URLs and HTTP organised by HTTP verbs",
"Reformat currency-based data as numeric and convert between currencies",
"Build complex HTML or LaTeX tables using kable()",
"To create beautiful visualizations",
"To arrange the plots generated grid-wise",
"To read data from web url"
)
)
kable(text_tbl, "html") %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T, border_right = T) %>%
column_spec(2, width = "30em")
| Libraries | Functions |
|---|---|
| tidyverse | Easy installation of packages |
| readr | To easily import delimited data |
| maps | For geographical data |
| mapproj | To convert latitude/longitude into projected coordinates |
| DT | To create functional tables in HTML |
| knitr | For dynamic report generation |
| rmarkdown | To convert R Markdown documents into a variety of formats |
| jsonlite | A Robust, High Performance JSON Parser and Generator for R |
| httr | Tools for Working with URLs and HTTP organised by HTTP verbs |
| lucr | Reformat currency-based data as numeric and convert between currencies |
| kableExtra | Build complex HTML or LaTeX tables using kable() |
| ggplot2 | To create beautiful visualizations |
| gridExtra | To arrange the plots generated grid-wise |
| RCurl | To read data from web url |
Some of the approaches to address the above issues are
We have converted the currency values to standardized abbreviations such as INR for Indian Rupees and used an in-built package to convert the given currency to USD
Based on the converted currency value we also re-evaluated the pre-assigned price range to the Zomato defined USD scale as shown below:
text_table1 <- data.frame(AveragePrice = c("0 < Price <= 10","10 < Price <= 25","25 < Price <= 50","50 < Price"),
PriceRange = c("1","2","3","4"))
kable(text_table1, "html") %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T, border_right = T) %>%
column_spec(2, width = "15em")
| AveragePrice | PriceRange |
|---|---|
| 0 < Price <= 10 | 1 |
| 10 < Price <= 25 | 2 |
| 25 < Price <= 50 | 3 |
| 50 < Price | 4 |
Proposed solution to address the cuisine problem is to separate the comma separated values and spread them into individual columns and count their frequency and determine the favorite cuisine based on the location.
Speaking of location since our data set consists of Latitude as well as Longitude we will attempt to create a map and highlighting with the help of color coding to determine the cities having maximum number of highly rated restaurants
Predict a linear model to uncover the fact whether votes really decide the rating by comparing a full and a reduced model.
Documentation Guide
A glance at the following documentation source will give us a fair understanding about how Zomato has gone about preparing the data and help us analyze it better.
Fetching the data
Data has been collected from Kaggle in the form of .csv file
Data Source
MetaData
Each restaurant in the dataset is uniquely identified by its Restaurant Id. Every Restaurant is analyzed based on the following variables:
text_tbl2 <- data.frame(
Attributes = c("Restaurant Id", "Restaurant Name", "Country Code","City",
"Address","Locality", "Locality Verbose", "Longitude", "Latitude",
"Cuisines","Average Cost for two","Currency","Has Table booking","Has Online delivery","Price range",
"Aggregate Rating","Rating color","Rating text","Votes"),
Description = c("Unique id of every restaurant",
"Name of the restaurant",
"Country in which restaurant is located",
"City in which restaurant is located",
"Address of the restaurant","Location in the city",
"Detailed description of the locality",
"Longitude coordinate",
"Latitude coordinate",
"Cuisines offered by the restaurant",
"Cost for two people in different currencies",
"Currency of the country",
"Can you reserve a table?",
"Do they deliver at your doorstep?",
"Range of price of food",
"Average rating out of 5",
"Depending upon the average rating color",
"Text on the basis of rating",
"Number of ratings casted by people"
)
)
kable(text_tbl2, "html") %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T, border_right = T) %>%
column_spec(2, width = "30em")
| Attributes | Description |
|---|---|
| Restaurant Id | Unique id of every restaurant |
| Restaurant Name | Name of the restaurant |
| Country Code | Country in which restaurant is located |
| City | City in which restaurant is located |
| Address | Address of the restaurant |
| Locality | Location in the city |
| Locality Verbose | Detailed description of the locality |
| Longitude | Longitude coordinate |
| Latitude | Latitude coordinate |
| Cuisines | Cuisines offered by the restaurant |
| Average Cost for two | Cost for two people in different currencies |
| Currency | Currency of the country |
| Has Table booking | Can you reserve a table? |
| Has Online delivery | Do they deliver at your doorstep? |
| Price range | Range of price of food |
| Aggregate Rating | Average rating out of 5 |
| Rating color | Depending upon the average rating color |
| Rating text | Text on the basis of rating |
| Votes | Number of ratings casted by people |
For the purpose of crisp analysis we have not considered two columns from the original dataset since our analysis showed those columns did not contribute significantly. They are “Is delivering now?” & “Switch to order menu”
Going ahead and keeping the business requirements in mind ‘Data Processing’ has been done on the following categories.
Currency
We have converted the local currencies to USD with the help of lucr package.
Based on the converted rates we have obtained we have defined a new standardized price range.
Both the above values have been integrated into our existing dataset.
library(lucr)
library(DT)
rates <- conversion_rates(currency = "USD", key = "91df1e9c6b6d487180fbe7bf19c7cc0d") #extrating conversion rates based on USD
#list_currencies(key = "91df1e9c6b6d487180fbe7bf19c7cc0d")
df <- read.csv("zomato.csv",stringsAsFactors = FALSE)
df["converted_rates"] <- NA #created new column for converted price and initialized with "NA"
df["new_price_range"] <- NA #created new column for standardized price range and initialized with "NA"
arr <- array()
#df$Currency
for (i in 1:length(df$Currency)) { #looping through all rows
#print(i)
val <- df[i, "Currency"] #finding currency values
#print(val)
df[["converted_rates"]][i] <- round((df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]),2) #find converted currency rate
if (!is.na((df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]))) { #finding standardized price range
if ((df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]) > 0 && (df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]) <= 10)
{
arr[i] <- 1
} else if ((df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]) > 10 && (df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]) <= 25) {
arr[i] <- 2
}else if ((df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]) > 25 && (df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]) <= 50) {
arr[i] <- 3
} else if ((df[["Average.Cost.for.two"]][i]) / (rates$rates[[val]]) > 50) {
arr[i] <- 4
} else{
arr[i] <- 0
}
} else{
arr[i] <- 0
}
}
vec <- as.vector(arr)
df[["new_price_range"]] <- vec #updating values in new column created
write.csv(df, "updatedZomato.csv")
datatable(df[,-c(3:19)], caption = "Updated Currency Value", colnames=c("Restaurant Id","Restaurant Name","Converted Rates in USD","New Price Range"))
Country
As country based analysis is one of the major factors and Zomato data contained country ID (based on Zomato standards), we used the Zomato REST APIs to collect the country information corresponding to each city.
REST API consumed: https://developers.zomato.com/api/v2.1/locations
By making a GET call to above endpoint we received information about each city including the country name in JSON format
library(DT)
library(jsonlite)
library(httr)
#reading the zomato csv file and storing it in dataframe df
df <- read.csv("zomato.csv")
#finding the list of unique cities
list_of_cities <- as.list(unique(df$City))
iter <- length(list_of_cities)
out <- matrix(NA, nrow = iter, ncol = 2)
for (i in 1:length(list_of_cities)) {
url <- 'https://developers.zomato.com/api/v2.1/locations?query=city_name=' #request url to call the REST API
place <- list_of_cities[[i]] #finding city name to pass in Get request(mandatory parameter)
newURL <- paste(url, place) #creating new URL
req <- GET(newURL,add_headers(User_key = "74af46c228e902e021c0d31516308ecb"),query = list(entity_type = "city")) #added headers and made a REST Call
response <- content(req) #storing the content received
if (length(response$location_suggestions) != 0) { # checking if the response contains country information
out[i,] <- c(response$location_suggestions[[1]]$country_id,response$location_suggestions[[1]]$country_name)
# saving the country ID and name in a matrix
}
}
datatable(unique(out,incomparables = FALSE, MARGIN = 1, fromLast = FALSE, signif = Inf), caption = 'Distinct Country ID and Country Name',colnames = c("Country Code","Country Name"))
#printing the final matrix containing unique records for counrty IDs
The following plots and graphs help us in unraveling the unsaid information that our data set holds.
We have tried visualizing the some of the important attributes for restaurants in India compared to those world wide.
With the help of this visualization we have a fair idea of which country has maximum number of restaurants and then also for the various cities in India.
We can conclude that voracious eaters hail from India and USA as per Zomato
From India, being true to the size of their paunch and heavy built structure, restaurants are a major hit in the Northern cities like New Delhi and Noida
library(ggplot2)
library(gridExtra)
df <- read.csv(text=getURL("https://raw.githubusercontent.com/Akanksha061990/DataWrangling/master/FinalData.csv"), header=T)
# Extracting only Indian restaurants
df_india = subset(df, df$Country.Name == "India")
df_other = subset(df, df$Country.Name!="India")
# Plotting distribution of registered restaurants by country
p1 = ggplot(df) + geom_bar(aes(df$Country.Name), fill = "tomato3") + coord_flip() + labs(title="Country wise distribution", x="Number of outlets", y = "Country") + theme(plot.title = element_text(size = 10, face = "bold"))
p2 = ggplot(df_india) + geom_bar(aes(City), fill = "red") + coord_flip() + labs(title="City-wise distribution in India", x="Number of outlets", y = "City") + theme(plot.title = element_text(size = 10, face = "bold"))
grid.arrange(p1, p2, ncol=2)
With the help of this visualization we have a can understand how the restaurants from India and other countries fares on the basis of the aggregate ratings it has garnered
In all we can see that greater than 2000 restaurants from India alone have not been given a rating. May be this is a disparity we can look into while presenting our final analysis and address this issue Diners outside India do take providing ratings a bit seriously and the no rating count is far less.
#Plot to have a look at aggregrate ratings of restaurants in India as well as worldwide
p3 = ggplot(df_india) + geom_bar(aes(Aggregate.rating, fill=Rating.text)) + labs(title="Distribution of Aggregate Rating in India", x = "Aggregate Rating", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"))
p4 = ggplot(df_other) + geom_bar(aes(Aggregate.rating, fill=Rating.text)) + labs(title="Distribution of Aggregate Rating in other countries", x = "Aggregate rating", y = "Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"))
grid.arrange(p3, p4, nrow=2)
The next time you plan a visit to India you must know which cities to visit to have a good meal and also a better understanding of how much would you shell out in US Dollars
In New Delhi in the range of 0-75$ followed by in Mumbai, Chennai and Kolkata where the average spend can be estimated at 15-20$ However maximum number of restaurants are from New Delhi , hence it provides a better clarity as compared to other cities
#plot to compare average cost for two in the major Metros of India
df_newdelhi = subset(df_india, City=="New Delhi")
df_chennai = subset(df_india, City=="Chennai")
df_mumbai = subset(df_india, City=="Mumbai")
df_kolkata = subset(df_india, City=="Kolkata")
p5 = ggplot(df_newdelhi) + geom_bar(aes(df_newdelhi$Converted_Rates), color="green4", fill="green4") + labs(title="Avg.Money Spent - New Delhi", x="Avg. Cost", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"))
p6 = ggplot(df_mumbai) + geom_bar(aes(df_mumbai$Converted_Rates), color="green4", fill="green4") + labs(title="Avg.Money Spent - Mumbai", x="Avg. Cost", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"))
p7 = ggplot(df_chennai) + geom_bar(aes(df_chennai$Converted_Rates), color="green4", fill="green4") + labs(title="Avg.Money Spent - Chennai", x="Avg. Cost", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"))
p8 = ggplot(df_kolkata) + geom_bar(aes(df_kolkata$Converted_Rates), color="green4", fill="green4") + labs(title="Avg.Money Spent - Kolkata", x="Avg. Cost", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"))
grid.arrange(p5, p6, p7, p8, nrow=4)
A visualization of how many restaurants provide a home delivery and reservation options to its diners.
Which one is preferred more in India as well as outside India. Comparatively Indians like the food delivered at the door step where as diners from the rest of the world like a table booked to avoid no last minute hassles
# Plots to compare count of restaurants providing online delivery in India versus restaurants worldwide
p9 = ggplot(df_india) + geom_bar(aes(df_india$Has.Online.delivery, fill=df_india$Has.Online.delivery)) + labs(title="Online Delivery Option - India", x="Has Online Delivery?", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"), legend.position = "none")
p10 = ggplot(df_other) + geom_bar(aes(df_other$Has.Online.delivery, fill=df_other$Has.Online.delivery)) + labs(title="Online Delivery Option - Other Countries", x="Has Online Delivery?", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"), legend.position = "none")
grid.arrange(p9, p10, nrow=2, ncol=2)
# Plots to compare Table reservation counts in restaurants in India versus restaurants worldwide
p45 = ggplot(df_india) + geom_bar(aes(df_india$Has.Table.booking, fill=df_india$Has.Table.booking)) + labs(title="Table Reservation - India", x="Has Table Booking?", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"), legend.position = "none")
p46 = ggplot(df_other) + geom_bar(aes(df_other$Has.Table.booking, fill=df_other$Has.Table.booking)) + labs(title="Table Reservation - Other Countries", x="Has Table Booking?", y="Number of outlets") + theme(plot.title = element_text(size=10, face = "bold"), legend.position = "none")
grid.arrange(p45, p46, nrow=2, ncol=2)
The REST API provided by Zomato is GET call which takes single city and return corresponding results. As it is running for each city and we have 141 unique cities, it takes some time to get country names correspding to country code in the first load. But in the subsequent updates since it will only run for the newly added restaurants/cities. Howvever, we are trying to improve the performance of this REST call.
We will be adding Geomaps based on latitude and longitude information provided for each restaurant
We will separate the comma separated values in cuisine column and spread them into individual columns to count their frequency and determine the favorite cuisine based on the location.