HU Minor: ML Predictive Analysis for Zomato Restaurants
Phase 1: Challenge Proposal
Domain Understanding
Introduction
I have always been intrigued by Bengaluru’s culinary scene. Bangalore is home to restaurants from all across the world. You may discover all various types of cuisines and delicacies here. You want it, Bengaluru has it: delivery, dine-in, taverns, cafés, cocktails, buffets, and sweets. For gastronomy lovers and travellers Bengaluru is the best city to visit for.
Why must I do the challenge?
Bengaluru is home to an expanding number of restaurants. At the moment, there are about 12,000 establishments. With so many eateries, the sector has not yet reached saturation, and new food vendors are sprouting day after day. However, new establishments now find it challenging to compete with establishments that have already achieved success. Being the heart of India’s information technology industry, Bengaluru’s residents rely primarily on casual dining cuisine because they lack the time to prepare meals at home. Therefore it is crucial for me to understand the local demography because there is such a huge demand for eateries.
What is it exactly that I am going to do?
In an attempt to address the aforementioned issues, I intend to provide valuable insights in my prediction analysis about the variables that influence a new establishment’s success and predict ratings for new start-up businesses in Bangalore.
That insight will be helpful for potential investors and restaurant (owners) to find out whether a start-up business would be successful and sustainable in their market. (catering industry).
Data Sourcing
One of the most popular applications in Bengaluru for ordering meals
finding eateries, and reading/writing review is
Zomato. The one and only source from which
I can obtain further information about the overall rating of
each restaurant in Bengaluru.
Enriching Data
With the aid of restaurant demographics and information on
independent variables associated with the restaurant ambience, food and
service quality offered in Zomato, it is
feasible to predict a rating for a new start-up
business. To accomplish that, a conceptual
framework will be constructed, and it will assist me
define my dependent variable for this predictive
analysis.
The Zomato Dataset does not
needs to be integrated or further enriched. For the reasons
that the dataset is not only quite large and it has
enough data in total to find potential correlations and
patterns that machine learning can learn about during
modelling. But also because the dataset has sufficient variables
to support my conceptual framework as displayed above.
Additional information and a description of the data will be included in the Data Understanding Section.
Interview: Guest Behaviour & Experience
The conceptual framework’s moderator variable illustrates how changes in guest behavior and experience affect the relationship between an independent variable and a dependent variable. For instance, a person can be already frustrated when they visit a restaurant or they might have received the best service but yet provide a poor rating.
It can be challenging to find reliable information
or even conduct this type of research online
(Guest Behaviour & Experience). Therefore an
interview with someone who has dined at a
Zomato restaurant would be an
ideal candidate since they can provide
considerably more in-depth and reliable
information.
Interview Insights:
Particularly in Bengaluru where there is fierce competition in the food market, Zomato serves as a platform for all of these businesses enabling them to draw in more customers and thereby increase their revenue. As a result, these restaurants would undoubtedly prefer to collaborate with Zomato since it will bring their restaurant to the highlights of the city, boost their profitability and overall rating.
People of Bengaluru when they open their Zomato application on their phone will probably notice all the restaurants that are now providing special offers on sale and huge discounts on top of their application dashboards. Thus, guests will visit those restaurants the most since they have more sales offers that meet their needs, whether they relate to price or customer satisfaction levels. Those restaurants stand out from the competition.
The majority of their customers are drawn to and kept loyal to them by the sales and offers that are available on the application.
Exploratory Research: Zomato Business Model
In addition, I have performed exploratory research
on Zomato to learn more about how
the firm operates and how it maintains a strategic
advantage.
There are three primary stakeholders in the Zomato Company:
- 1. Restaurant
- 2. Customer
- 3. Delivery Partner
Zomato has over 35 million
active users, 15 million establishments that
are registered, and more than 165,000 delivery
partners.
Zomato earns over 72% of its
revenue from advertising and delivery commissions. They
collect fees from a particular restaurant to come on
with more offers for them. Especially, when the
restaurant pays advertising fees to the company in this case to
Zomato, they then run offers and
prioritize those with a high fee on top of their offer
list.
1. Restaurant: Restaurants that have
registered with Zomato are those who
wish to have their meals or dishes delivered to the customers
they desire. The restaurant benefits more when
an order is placed since it prepares, packs, and delivers the food as
opposed to eating customers who must occupy a certain table, waiters
must attend to them, cutlery must be used and cleaned, etc.
Combining all the expenses and labor, having customers order
online is more advantageous for the restaurant.
2. Customer: The goal is to get their food
delivered to their house. Zomato attracts
most of their customer through its application, whereas when a
customer buys that food or dish directly from the restaurant or
by anywhere else online, they do not get the price that
Zomato offers them.
3. Delivery Partners: In contrast,
Zomato gives its delivery partners a higher
payment. Considering that their primary responsibility
is delivering meals to customers’ homes.
Zomato strives to entice additional delivery
partners by offering better compensation and stable
employment.
Analytic Approach
How am I going to do the challenge?
Now that I have acknowledged what is the right data to employ, I must follow a structured data technique, which entails the following actions, in order to be able to foresee situations with greater precision:
- Step 1: Collecting the relevant data to my target of predictive analysis.
- Step 2: Organizing the relevant data into a single dataset.
- Step 3: Cleaning the relevant data to prevent an inaccurate model and misleading prediction.
- Step 4: Utilizing exploratory data analysis to gain knowledge about the data.
- Step 5: Establishing useful variables to comprehend the records.
- Step 6: Choosing an appropriate machine learning algorithm.
- Step 7: Constructing and employing a successful model.
When am I going to do the challenge?
I plan to accomplish the challenge within the three blocks of my Minor Program in Big Data & Design. Although this challenge hasn’t been officially assigned to me, I intend to pursue it at my own pace. By the end of this semester, I aim to showcase the prediction analysis I will have conducted. To successfully complete the challenge, I will regularly submit my documentation to my Major studies at Hogeschool Fontys, ensuring I constantly receive feedback and make appropriate improvements.
Who is responsible for doing the Challenge?
In my predictive analysis report, I am the only one
accountable for the challenge of developing a
successful machine learning model, which will
predict the success of a new establishment by making use of the
restaurant demographics and information on independent variables
associated with the restaurant ambience, food and service quality
provided from the Zomato Dataset
Should I do the challenge or not?
I am confident that I am capable of completing the challenge
by making use of the various resources available from both my Minor
Program
(Data Science & AI Library from Canvas)
and my Major
(AI Project Methodology from Canvas).
These resources, along with the guidelines I have previously laid out in
this challenge proposal, will support my efforts to achieve success.
Summary
1. This predictive analysis intends to offer significant insights into the variables that influence the success of a new establishment and predict the rating for new start-up businesses.
2. Main Research Question: Predict the success rate of a
new restaurant based on different features like
(online_order, book_table, votes,
location, rest_type, cuisines,
two_people_cost and
meal_type).
3. Target Variable: The feature rate is my
target variable that contains the overall star rating out of
five.
4. Machine Learning Model: The nature of the target
variable rate is suitable for Regression
Analysis.
Phase 2: Data Provisioning
Data Requirements
Most of the data requirements have already been satisfied for a predictive model to make an accurate prediction.
- I have already confirmed that I am completely aware of the domains for which the data will be retrieved.
I have obtained the Zomato Dataset from
the Kaggle Online Machine Learning Repository
Platform where I downloaded the data folder including the
Zomato.data dataset and its
dictionary_data description for more
explanation of the dataset.
- I have already compiled and analyzed a list of the
significant stakeholders whose establishments will benefit from
this projection based on the available information on
Zomato dataset.
This projection will be helpful for potential investors and restaurant (owners) to find out whether a start-up business would be successful and sustainable in their target market.
I have already gathered all the potential attributes (for relevant tables) but I will determine the data types they might possess in the upcoming sections.
I have already collected all the potential attributes (for relevant tables) but I will establish relationships between the attributes of a dataframe in the upcoming sections.
Data Collection
Load Libraries & Packages
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import warnings
# Pip: Package Installer for Python (PIP) is a program
# that helped me to install and manage python software packages.
#!pip install missingno
import missingno as msno
# I will use display.max_columns to adjust the maximum number of displayed columns.
pd.set_option('display.max_columns', 20)
# It ignores most of the machine learning warnings produced.
warnings.filterwarnings("ignore")Data Understanding
Zomato is a multinational corporation of Indian restaurant food
delivery services that was founded in 2008 by Deepinder
Goyal and Pankaj Chaddah. Zomato gives
information regarding restaurants menus and customer
ratings. In Bangalore, it also provides several options for
food delivery from associated restaurants. As of 2019,
the service is accessible in more than 10,000
cities across 24 different countries.
Loading Data
I have the Zomato.data dataset saved in
the same directory as my (Python) notebook. It might
therefore be easier for me to specify the file path, as
you can see from the code itself. I didn’t mention the real location of
the file; I only mentioned its name.
# Encoding helps to simply convert the data's characters into binary code.
# If it's omitted, an error will be raised.
df = pd.read_csv("Zomato.data.csv", encoding = "latin-1")# Obtain a glimpse of the data by printing information regarding the DataFrame.
df_sample = df.loc[24830:24832]| url | address | name | online_order | book_table | rate | votes | phone | location | rest_type | dish_liked | cuisines | approx_cost(for two people) | reviews_list | menu_item | listed_in(type) | listed_in(city) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24830 | https://www.zomato.com/bangalore/nine2twelve-kalyan-nagar-bangalore?context=eyJzZSI6eyJlIjpbIjU0NzM5IiwiNTYxOTQiLCI1NzEwMSIsIjE4Nzg2NTcyIiwxODk2ODA5OCwiMTg5NjczNzEiLCIxODQyNjM1NiIsIjUwNzEzIiwiMTg5Mjk0NDQiLCIxODM2NjY1NCIsIjU4ODQ3IiwiNTY1NTUiLCIxODU5MzM3NyIsIjE4NTM3ODE2IiwiNTc2NDMiLCIxODQ3NjgyMSJdLCJ0IjoiRGluZS1PdXQgUmVzdGF1cmFudHMgaW4gS2FseWFuIE5hZ2FyIn19 | Flat 302, 403, 2nd Main, Kasturinagar, East Of NGEF Layout Kalyan Nagar, Bangalore | Nine2twelve | No | No | NEW | 0 | +91 9606166379 | Kalyan Nagar | Quick Bites | nan | South Indian | 300.000000 | [] | [] | Dine-out | Kalyan Nagar |
| 24831 | https://www.zomato.com/bangalore/new-taj-biryani-centre-kalyan-nagar-bangalore?context=eyJzZSI6eyJlIjpbIjU2MTk0IiwiNTcxMDEiLCIxODc4NjU3MiIsIjE4OTY4MDk4IiwxODk2NzM3MSwiMTg0MjYzNTYiLCI1MDcxMyIsIjE4OTI5NDQ0IiwiMTgzNjY2NTQiLCI1ODg0NyIsIjU2NTU1IiwiMTg1OTMzNzciLCIxODUzNzgxNiIsIjU3NjQzIiwiMTg0NzY4MjEiXSwidCI6IkRpbmUtT3V0IFJlc3RhdXJhbnRzIGluIEthbHlhbiBOYWdhciJ9fQ== | IB Road, Lorry Stand, Kushaal Nagar, Ward 10, Kalyan Nagar, Bangalore | New Taj Biryani Centre | No | No | NEW | 0 | +91 8979052325 | Kalyan Nagar | Quick Bites | nan | Biryani | 300.000000 | [] | [] | Dine-out | Kalyan Nagar |
| 24832 | https://www.zomato.com/bangalore/ss-bucket-biryani-kammanahalli?context=eyJzZSI6eyJlIjpbIjU3MTAxIiwiMTg3ODY1NzIiLCIxODk2ODA5OCIsIjE4OTY3MzcxIiwxODQyNjM1NiwiNTA3MTMiLCIxODkyOTQ0NCIsIjE4MzY2NjU0IiwiNTg4NDciLCI1NjU1NSIsIjE4NTkzMzc3IiwiMTg1Mzc4MTYiLCI1NzY0MyIsIjE4NDc2ODIxIiwiMTg2MTYwMDMiXSwidCI6IkRpbmUtT3V0IFJlc3RhdXJhbnRzIGluIEthbHlhbiBOYWdhciJ9fQ== | 15, 5th Main Road, KEB Road, Near Kullappa Circle, HRBR Layout, Kammanahalli, Bangalore | SS Bucket Biryani | No | No | 4.0/5 | 161 | +91 9886974444 | Kammanahalli | Casual Dining | Brinjal Curry, Basmati Rice, Mutton Biryani | Biryani, North Indian, Chinese | 600.000000 | [('Rated 3.0', 'RATED\n Visited this place today in the afternoon around 2 PM. We wanted chicken biriyani and it was not available. The only option was mutton biriyani. So we ordered that. Even mutton biriyani was not available here and they have sent their delivery boys to get it from OMBR layout branch, so we had to wait until he got it.\n\nComing to the biryani it was good. There is nothing extra ordinary about the taste but it was quite OK.\nWe ordered double pack and it was more than enough for 4 people. The raita given was tasty but the quantity was very less.\n\nOverall this is a budget friendly place and if you are a biriyani lover do try this out once.\n\nFood - 3.5\nAmbiance - 3\nValue for money - 4'), ('Rated 3.0', 'RATED\n Really surprised to find biryani in a bucket, but it was really great. They give a large quantity of biryani and kebabs for a very reasonable price.')] | [] | Dine-out | Kalyan Nagar |
The main objective of analyzing the
Zomato Dataset is to gain a clear
understanding of the variables influencing the establishment of various
types of dining establishments in numerous places throughout
Bengaluru, as well as the overall rating of each
restaurant. Bengaluru is one such metropolitan
area, with more than 12,000 eateries serving
cuisine from across the world.
## url object
## address object
## name object
## online_order object
## book_table object
## rate object
## votes int64
## phone object
## location object
## rest_type object
## dish_liked object
## cuisines object
## approx_cost(for two people) float64
## reviews_list object
## menu_item object
## listed_in(type) object
## listed_in(city) object
## dtype: object
In the Zomato Dataset recognizing and
comprehending the significance of each data type is
essential. Depending on the data types, a particular
analysis will be necessary.
These data types will similarly guarantee that data is gathered in the preferred format and that the values of each feature are as anticipated.
Understanding the different data types can also assist me in deciding how they might be combined before feeding them into a machine-learning model.
Data Characteristic & Description
# Quickly checks for the names of the columns, the data types that it contains and
# Whether they have any odd or striking missing data.
df.info()## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 51717 entries, 0 to 51716
## Data columns (total 17 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 url 51717 non-null object
## 1 address 51717 non-null object
## 2 name 51717 non-null object
## 3 online_order 51717 non-null object
## 4 book_table 51717 non-null object
## 5 rate 43942 non-null object
## 6 votes 51717 non-null int64
## 7 phone 50509 non-null object
## 8 location 51696 non-null object
## 9 rest_type 51490 non-null object
## 10 dish_liked 23639 non-null object
## 11 cuisines 51672 non-null object
## 12 approx_cost(for two people) 51371 non-null float64
## 13 reviews_list 51717 non-null object
## 14 menu_item 51717 non-null object
## 15 listed_in(type) 51717 non-null object
## 16 listed_in(city) 51717 non-null object
## dtypes: float64(1), int64(1), object(15)
## memory usage: 6.7+ MB
url: includes a link to the restaurant’s page on the Zomato website.address: contains the restaurant’s Bengaluru address.name: contains the restaurant’s name.online_order: whether the restaurant offers online ordering.book_table: whether there is a table book option.rate: contains a five-star rating for the restaurant overall.votes: lists all of the restaurant’s ratings as of the specified date.phone: includes the restaurant’s phone number.location: contains information on the area where the restaurant is situated.rest_type: type of restaurant.dish_liked: meals that guests liked in the establishment.cuisines: comma-separated lists of food genres.approx_cost(for two people): includes an estimate of the cost of a dinner for two people.reviews_list: a collection of tuples with restaurant reviews; each tuple has two values: the customer’s rating and the review.menu_item: contains a list of the menu options for the restaurant.listed_in(type): type of meal.listed_in(city): incorporates the city-neighborhood in which the restaurant is situated.
## (51717, 17)
The Zomato Dataset has 51717
observations, with each row including details about a
specific restaurant in Bangalore and 17 columns. There is one
target variable rate and 17
features (restaurant attributes).
- Both
isna()andisnull()functions are used to find the missing values in the pandas dataframe. (Aruchamy, 2022)
## url 0
## address 0
## name 0
## online_order 0
## book_table 0
## rate 7775
## votes 0
## phone 1208
## location 21
## rest_type 227
## dish_liked 28078
## cuisines 45
## approx_cost(for two people) 346
## reviews_list 0
## menu_item 0
## listed_in(type) 0
## listed_in(city) 0
## dtype: int64
- Although the two coding lines vary in functions and forms, but both of them provide the same results. I will list both of them here for academic purposes.
## url 0
## address 0
## name 0
## online_order 0
## book_table 0
## rate 7775
## votes 0
## phone 1208
## location 21
## rest_type 227
## dish_liked 28078
## cuisines 45
## approx_cost(for two people) 346
## reviews_list 0
## menu_item 0
## listed_in(type) 0
## listed_in(city) 0
## dtype: int64
In the Zomato Dataset there are
missing values in the columns:
rate: This column has 7775 missing values.phone: This column has 1208 missing values.location: This column has 21 missing values.rest_type: This column has 227 missing values.dish_liked: This column has 28078 missing values.cuisines: This column has 45 missing values.approx_cost(for two people): This column has 346 missing values.
Data Preparation
Since the Zomato Dataset has been
successfully imported and its essential
information was clearly comprehended. It is
indeed time to begin the Data Preparation Process,
which entails eliminating meaningless columns, assigning columns
- meaningful names, inspecting and eliminating duplicate & missing
values, and ultimately performing a small exploratory data
analysis to make certain that there are no evident
discrepancies, identify patterns and
discover relationships between the
features.
These transformations and preparations are required to ensure that
the Zomato Dataset is properly prepared
for the forthcoming Phase 3 Prediction.
Removal of Specific Columns
# Drop any specified column within the dataset.
df.drop(["address", "dish_liked", "menu_item", "phone", "reviews_list", "url"], axis = 1, inplace = True)The Zomato Dataset contained
several features like address, phone
and url that were irrelevant to my
predictive analysis. This indicates that they have
no relation whatsoever on my target variable
rate and the challenge that my modelling
approach is intended to tackle. By dropping these
irrelevant columns, I can prevent the algorithm from looking
for any misleading associations and avoid
overfitting.
Moreover, the Zomato Dataset contained
several other features like menu_item and
reviews_list that were redundant to my
predictive analysis. This indicates that all
these features share the same information with other features
like cuisines and rate, and one can
be safely dropped without compromising and losing
information.
Finally, the Zomato Dataset contained
one important feature called dish_liked that is
also redundant. Due to the numerous issues
this feature has caused during training and in
real-life settings when retraining is expected, I might have to
take a decision and choose some of the finest
features and drop the others.
Meaningful Column Names
# Rename any specified column within the dataset.
df.rename(columns={"approx_cost(for two people)": "two_people_cost",
"listed_in(type)": "meal_type",
"listed_in(city)": "city_neighborhood"}, inplace = True)
df.info()## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 51717 entries, 0 to 51716
## Data columns (total 11 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 name 51717 non-null object
## 1 online_order 51717 non-null object
## 2 book_table 51717 non-null object
## 3 rate 43942 non-null object
## 4 votes 51717 non-null int64
## 5 location 51696 non-null object
## 6 rest_type 51490 non-null object
## 7 cuisines 51672 non-null object
## 8 two_people_cost 51371 non-null float64
## 9 meal_type 51717 non-null object
## 10 city_neighborhood 51717 non-null object
## dtypes: float64(1), int64(1), object(9)
## memory usage: 4.3+ MB
The Zomato Dataset must have
distinct, accessible, and meaningful names in order for
me as a user to recognize and distinguish the data
columns on a DataFrame.
Inspection & Removal of Duplicate Values
# I have printed the dataset with duplicates by selecting every row that is identical across all columns
# And then return a series of counts of False and True based on whether a row is a duplicate.
print(df.duplicated(keep=False).value_counts(ascending = True))## True 215
## False 51502
## Name: count, dtype: int64
There are 215 duplicate rows and 51502 unique rows in the dataset.
# I have printed the dataset with duplicates
# By selecting every row that is identical across all columns.
print(df[df.duplicated(keep = False)])## name online_order book_table rate votes \
## 553 My Tea House Yes Yes NEW 0
## 596 My Tea House Yes Yes NEW 0
## 2195 Shiv Sagar No No 3.6/5 10
## 2235 Shiv Sagar No No 3.6/5 10
## 3747 The Fisherman's Wharf Yes Yes 4.4/5 4099
## ... ... ... ... ... ...
## 50366 House Of Candy Yes No NaN 0
## 50379 House Of Candy Yes No NaN 0
## 50405 House Of Candy Yes No NaN 0
## 50900 Nawab Di Biryani Yes No NEW 0
## 50904 Nawab Di Biryani Yes No NEW 0
##
## location rest_type \
## 553 Banashankari Casual Dining
## 596 Banashankari Casual Dining
## 2195 Bannerghatta Road Food Court
## 2235 Bannerghatta Road Food Court
## 3747 Sarjapur Road Casual Dining, Bar
## ... ... ...
## 50366 Whitefield Confectionery
## 50379 Whitefield Confectionery
## 50405 Whitefield Confectionery
## 50900 Whitefield Takeaway, Delivery
## 50904 Whitefield Takeaway, Delivery
##
## cuisines two_people_cost \
## 553 Continental, Asian, North Indian, Tea 500.0
## 596 Continental, Asian, North Indian, Tea 500.0
## 2195 South Indian, Beverages 400.0
## 2235 South Indian, Beverages 400.0
## 3747 Seafood, Goan, North Indian, Continental, Asian 1400.0
## ... ... ...
## 50366 Desserts 200.0
## 50379 Desserts 200.0
## 50405 Desserts 200.0
## 50900 Biryani, Mughlai 400.0
## 50904 Biryani, Mughlai 400.0
##
## meal_type city_neighborhood
## 553 Dine-out Banashankari
## 596 Dine-out Banashankari
## 2195 Dine-out Bannerghatta Road
## 2235 Dine-out Bannerghatta Road
## 3747 Buffet Bellandur
## ... ... ...
## 50366 Delivery Whitefield
## 50379 Delivery Whitefield
## 50405 Delivery Whitefield
## 50900 Delivery Whitefield
## 50904 Delivery Whitefield
##
## [215 rows x 11 columns]
The Zomato Dataset contains 215
duplicate rows that have the potential to corrupt the
training or test sets of data. The training procedure will be
impacted by the outliers, which will lead my
model to understand trends that in reality do not occur, and
inputs with insufficient and incomplete data will cause
my model to interpret features inaccurately.
I will therefore drop all the duplicate rows.
- 1. To get rid of all the duplicate rows, I can either
use the
duplicated()function in combination with the~negation operator.
- 2. Or I can use the
drop_duplicates()that also drops all the duplicate rows within the dataset.
Although the two coding lines vary in functions and
forms, both of them provide the same results. I will list
both of them here for academic purposes, but I will
only utilize the second drop duplicates()
function.
Every duplicate row has already been eliminated, so I must explicitly verify using Python code that they are indeed deleted from the new dataset that I have allocated to.
# I can use the duplicated() function in combination with
# The sum() returns the whole set of duplicate values in the dataset.
df.duplicated().sum()## 0
The Zomato Dataset no longer contains
any duplicate records.
Graphical Inspection of Missing Values
# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))
# The msno.matrix() function provide visual summary of the completeness
# Or absence of my Zomato Dataset.
msno.matrix(df, fontsize=15, color=(0.99, 0.76, 0.8))
plt.show()From the aforementioned matrix, I can get more information about the data that was missing from each column, as well as the overall number of rows and columns.
# The basic aesthetic of the plots will then be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))
# The msno.heatmap() function provides a visual summary of how the existence
# Or lack of one feature on my Zomato Dataset impacts the existence
# Or lack of another feature (Null Correlation).
msno.heatmap(df, fontsize=15)
plt.show()From the aforementioned heatmap visual, I can get more information about the missing columns’ correlation with one another.
Additionally, I can see that some variables and missing data
are related. For instance, there is a 70%
positive null correlation between the features
cuisines and location.
Elimination of NA Values
## name 0
## online_order 0
## book_table 0
## rate 7755
## votes 0
## location 21
## rest_type 227
## cuisines 45
## two_people_cost 344
## meal_type 0
## city_neighborhood 0
## dtype: int64
In the Zomato Dataset there are now
missing values in the columns:
rate: This column has 7755 missing values.location: This column has 21 missing values.rest_type: This column has 227 missing values.cuisines: This column has 45 missing values.approx_cost(for two people): This column has 344 missing values.
# The two coding lines below provide the same results.
df.dropna(inplace = True)
# df.dropna(how = "any", inplace = True)
df.info()## <class 'pandas.core.frame.DataFrame'>
## Index: 43447 entries, 0 to 51716
## Data columns (total 11 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 name 43447 non-null object
## 1 online_order 43447 non-null object
## 2 book_table 43447 non-null object
## 3 rate 43447 non-null object
## 4 votes 43447 non-null int64
## 5 location 43447 non-null object
## 6 rest_type 43447 non-null object
## 7 cuisines 43447 non-null object
## 8 two_people_cost 43447 non-null float64
## 9 meal_type 43447 non-null object
## 10 city_neighborhood 43447 non-null object
## dtypes: float64(1), int64(1), object(9)
## memory usage: 4.0+ MB
Because the Machine Learning Model that I want to use provided me with an error if I pass Null Values into it. The best way to deal with those values is to either impute the data with mean, median, mode or in the most challenging situation employs advanced imputation techniques, such as MICE in order to fix the missingness of the data based on the different types of missing data (MCAR/MAR/MNAR). However, if I want to do that then the precision and dependability of my model would be considerably reduced, which will lead the model to learn on insignificant features.
I therefore made a conscious decision to omit and drop all the rows with Null Values so that the dataset:
- Have less misguided information, which
improves the precision of my machine learning
model.
- Have less data redundancy, which reduces the likelihood that judgments will be relied on noise or corrupted data.
- Have less feature - variable, which minimize the intricacy of the my machine learning algorithm and speeds up model training.
Graphical Verification of Missing Values
# The basic aesthetic of the plots will then be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))
# The msno.matrix() function provide a visual summary of the completeness
# Or absence of my Zomato Dataset.
msno.matrix(df, fontsize=15, color=(0.57, 0.36, 0.51))
plt.show()I deployed the msno.matrix() for
visualization to verify if there were any further missing
values that needed to be filled in. There is no white
space whatsoever in each of the bars. This indicates that
all of the missing data have now been successfully
retrieved and deleted.
The Zomato Dataset has 43447
observations with each row including details
about a specific restaurant in Bangalore. There is
one target variable rate
and 10 features with restaurant attributes.
Data Transformations For Columns
# The unique() function looks for distinct values in an array
# And returns the distinct values ordered by distinctiveness.
pd.unique(df[["rate"]].values.ravel("K"))## array(['4.1/5', '3.8/5', '3.7/5', '3.6/5', '4.6/5', '4.0/5', '4.2/5',
## '3.9/5', '3.1/5', '3.0/5', '3.2/5', '3.3/5', '2.8/5', '4.4/5',
## '4.3/5', 'NEW', '2.9/5', '3.5/5', '2.6/5', '3.8 /5', '3.4/5',
## '4.5/5', '2.5/5', '2.7/5', '4.7/5', '2.4/5', '2.2/5', '2.3/5',
## '3.4 /5', '-', '3.6 /5', '4.8/5', '3.9 /5', '4.2 /5', '4.0 /5',
## '4.1 /5', '3.7 /5', '3.1 /5', '2.9 /5', '3.3 /5', '2.8 /5',
## '3.5 /5', '2.7 /5', '2.5 /5', '3.2 /5', '2.6 /5', '4.5 /5',
## '4.3 /5', '4.4 /5', '4.9/5', '2.1/5', '2.0/5', '1.8/5', '4.6 /5',
## '4.9 /5', '3.0 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5',
## '2.1 /5', '2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)
# The values.ravel() function enables me to retrieve input values
# that have been compressed as an array with the same type and structure.
# The letter "K" denotes viewing each element in the memory sequence.The rate column in the
Zomato Dataset indicates several
restaurants as - because
they do not have any ratings yet.
There are also newly opened establishments for which
no guests have yet made a visit and
posted a rating. These establishment
ratings are referred to by the term
NEW.
Due to these unexpected ratings presented in the array, some adjustment is required in order to generate a reliable dataset at the end, which can then be fed into a Machine Learning Model.
# The unique() function in combination with
# The values.ravel("K") returns distinct values in an array.
pd.unique(df[["rate"]].values.ravel("K"))## array(['4.1/5', '3.8/5', '3.7/5', '3.6/5', '4.6/5', '4.0/5', '4.2/5',
## '3.9/5', '3.1/5', '3.0/5', '3.2/5', '3.3/5', '2.8/5', '4.4/5',
## '4.3/5', '2.9/5', '3.5/5', '2.6/5', '3.8 /5', '3.4/5', '4.5/5',
## '2.5/5', '2.7/5', '4.7/5', '2.4/5', '2.2/5', '2.3/5', '3.4 /5',
## '3.6 /5', '4.8/5', '3.9 /5', '4.2 /5', '4.0 /5', '4.1 /5',
## '3.7 /5', '3.1 /5', '2.9 /5', '3.3 /5', '2.8 /5', '3.5 /5',
## '2.7 /5', '2.5 /5', '3.2 /5', '2.6 /5', '4.5 /5', '4.3 /5',
## '4.4 /5', '4.9/5', '2.1/5', '2.0/5', '1.8/5', '4.6 /5', '4.9 /5',
## '3.0 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5', '2.1 /5',
## '2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)
The sort of way restaurant ratings is kept on the
Zomato Dataset is unsuitable and
inappropriate for regression models. Because of this, I must
eliminate all objects related to strings or
characters from the field and only keep the
rate column values.
# The str.replace() function displays numerical values
# And removes all character-string-related objects from the column.
df["rate"] = df["rate"].str.replace('/5','')
# The str.strip() function removes all white spaces
# And other invisible characters.
df["rate"] = df["rate"].str.strip().astype("float")
# The astype() function converts the entire pandas
# object - column to the same data type indicated.- I have to verify now that I got rid of all the leading
space and character-string related objects from the
ratecolumn.
## array([4.1, 3.8, 3.7, 3.6, 4.6, 4. , 4.2, 3.9, 3.1, 3. , 3.2, 3.3, 2.8,
## 4.4, 4.3, 2.9, 3.5, 2.6, 3.4, 4.5, 2.5, 2.7, 4.7, 2.4, 2.2, 2.3,
## 4.8, 4.9, 2.1, 2. , 1.8])
The Zomato Dataset's column
rate no longer contains any
misleading or inappropriate field.
Addiction of Row Number
# The insert() function enables the addition of a new element to a DataFrame at a specific index position.
# The np.arange() function obtain values that are equally spaced within a specified range.
# The len() function provides the object's length.
df.insert(loc=0, column="row_num", value = np.arange(len(df)))
# The above code will retrieve the Data Frame's row number and place it as the first column.
df_tail = df.tail(5)| row_num | name | online_order | book_table | rate | votes | location | rest_type | cuisines | two_people_cost | meal_type | city_neighborhood | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 51709 | 41185 | The Farm House Bar n Grill | No | No | 3.7 | 34 | Whitefield | Casual Dining, Bar | North Indian, Continental | 800.0 | Pubs and bars | Whitefield |
| 51711 | 41186 | Bhagini | No | No | 2.5 | 81 | Whitefield | Casual Dining, Bar | Andhra, South Indian, Chinese, North Indian | 800.0 | Pubs and bars | Whitefield |
| 51712 | 41187 | Best Brews - Four Points by Sheraton Bengaluru... | No | No | 3.6 | 27 | Whitefield | Bar | Continental | 1500.0 | Pubs and bars | Whitefield |
| 51715 | 41188 | Chime - Sheraton Grand Bengaluru Whitefield Hotel &... | No | Yes | 4.3 | 236 | ITPL Main Road, Whitefield | Bar | Finger Food | 2500.0 | Pubs and bars | Whitefield |
| 51716 | 41189 | The Nest - The Den Bengaluru | No | No | 3.4 | 13 | ITPL Main Road, Whitefield | Bar, Casual Dining | Finger Food, North Indian, Continental | 1500.0 | Pubs and bars | Whitefield |
I chose to utilize the distinct row number because it allowed me to refer to the indices without getting lost or accidentally replacing data from other rows. I have incorporated it for my convenience and understanding purposes.
Exploratory Data Analysis
According to my Conceptual Framework, there are many
features and factors that can influence what constitutes a good
eatery. I will highlight for the objectives of my predictive
analysis that the restaurants with high rate
tend to be the good ones (Assumption). I will perform
initial investigations on Zomato Dataset
so as to discover patterns and to check
assumptions with the help of tabular or graphical
analytics.
# Provide some summary statistics for numerical columns as indicated below.
df_describe = df.describe()| row_num | rate | votes | two_people_cost | |
|---|---|---|---|---|
| count | 41190.00 | 41190.00 | 41190.00 | 41190.00 |
| mean | 20594.50 | 3.70 | 352.07 | 603.55 |
| std | 11890.67 | 0.44 | 883.46 | 464.65 |
| min | 0.00 | 1.80 | 0.00 | 40.00 |
| 25% | 10297.25 | 3.40 | 21.00 | 300.00 |
| 50% | 20594.50 | 3.70 | 73.00 | 500.00 |
| 75% | 30891.75 | 4.00 | 277.00 | 750.00 |
| max | 41189.00 | 4.90 | 16832.00 | 6000.00 |
An average rating of 3.7 out of 5 is quite respectable, especially considering Bangalore can only have one or two five-star rating eateries.
The fact that 75% of Bengalureans spend no more than 750₹ rupees for two people cost (4,35€ per person) is indeed quite interesting to look into this extensively.
I am going to further examine the relation between the features and discover how it correlates to the success of a start-up business. In addition, I am going to dive into some of the features to determine what narrative these might have for me.
1. Bangalore’s Franchise Restaurants
- I will group by the names of the restaurants and retrieve data statistics like the mean rating, the total number of votes, the average cost for two guests etc for the most popular restaurants.
# In the groupby() function's as index = False parameter
# indicates that I do not intend to use the column Identifier as the index.
Franchise_Restaurant = df.groupby(by='name', as_index=False).agg({'rate': 'mean',
'votes': 'sum',
'two_people_cost': 'mean',
'row_num': 'count'})
# I assigned the aggregated columns - names to make them easier to understand.
Franchise_Restaurant.columns = ['Restaurant_Name', 'Average_Rating', 'Total_Votes', 'Average_Two_People_Cost', 'Total_Restaurants']
# I sorted the values in increasing order to identify Bengalore's most popular places.
Franchise_Restaurant = Franchise_Restaurant.sort_values(by='Total_Restaurants', ascending=False)[:10]
# The columns have been reorganized depending on my perspective.
Franchise_Restaurant = Franchise_Restaurant.loc[:, ['Restaurant_Name', 'Total_Restaurants', 'Total_Votes', 'Average_Two_People_Cost', 'Average_Rating']]
# I round the number up to a specific number of decimal digits for clarification purposes.
Franchise_Restaurant = Franchise_Restaurant.round(decimals = 2)| Restaurant_Name | Total_Restaurants | Total_Votes | Average_Two_People_Cost | Average_Rating | |
|---|---|---|---|---|---|
| 987 | Cafe Coffee Day | 86 | 3089 | 838.37 | 3.26 |
| 4191 | Onesta | 85 | 347520 | 600.00 | 4.41 |
| 1869 | Empire Restaurant | 69 | 229808 | 693.48 | 4.03 |
| 2978 | Kanti Sweets | 68 | 7336 | 400.00 | 3.90 |
| 1975 | Five Star Chicken | 68 | 3134 | 259.56 | 3.42 |
| 2848 | Just Bake | 67 | 2898 | 400.00 | 3.41 |
| 596 | Baskin Robbins | 62 | 2487 | 250.81 | 3.57 |
| 4393 | Pizza Hut | 60 | 20161 | 747.50 | 3.38 |
| 4371 | Petoo | 60 | 4242 | 675.83 | 3.83 |
| 2892 | KFC | 60 | 23495 | 422.50 | 3.65 |
2. Bangalore’s Independent Restaurants
- I will group by the names of the restaurants and retrieve data statistics like the mean rating, the total number of votes, the average cost for two guests etc for the least popular restaurants.
# In the groupby() function's as index = False parameter
# indicates that I do not intend to use the column Identifier as the index.
Independent_Restaurant = df.groupby(by='name', as_index=False).agg({'rate': 'mean',
'votes': 'sum',
'two_people_cost': 'mean',
'row_num': 'count'})
# I assigned the aggregated columns - names to make them easier to understand.
Independent_Restaurant.columns = ['Restaurant_Name', 'Average_Rating', 'Total_Votes', 'Average_Two_People_Cost', 'Total_Restaurants']
# I sorted the values in increasing order to identify Bengalore's least popular places.
Independent_Restaurant = Independent_Restaurant.sort_values(by='Total_Restaurants', ascending=True)[:10]
# The columns have been reorganized depending on my perspective.
Independent_Restaurant = Independent_Restaurant.loc[:, ['Restaurant_Name', 'Total_Restaurants', 'Total_Votes',
'Average_Two_People_Cost', 'Average_Rating']]| Restaurant_Name | Total_Restaurants | Total_Votes | Average_Two_People_Cost | Average_Rating | |
|---|---|---|---|---|---|
| 3520 | Mangalore Kitchen | 1 | 42 | 500.00 | 3.80 |
| 4067 | NightOwl | 1 | 31 | 400.00 | 2.70 |
| 5236 | South Grand | 1 | 102 | 400.00 | 3.40 |
| 5233 | Soup N Grill | 1 | 14 | 600.00 | 3.50 |
| 2556 | Hotel Shri Raghavendra | 1 | 6 | 150.00 | 3.00 |
| 5223 | SoopeRolls | 1 | 11 | 250.00 | 3.50 |
| 5222 | Soo Ra Sang | 1 | 290 | 1500.00 | 4.10 |
| 5217 | Soham Bombay Masti Magic | 1 | 16 | 150.00 | 3.60 |
| 2562 | Hotel Thalassery | 1 | 8 | 200.00 | 3.40 |
| 4062 | Night Fox | 1 | 22 | 500.00 | 3.20 |
I discovered from these two tables that a restaurant can either be an independent or a franchise.
# I will first output a set of counts of unique restaurant values.
Franchise_Restaurant = df['name'].value_counts(sort=True, dropna=False, ascending=False)[:10]
# The basic aesthetic of the plots will then be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))
# Finally, I will make use of a barplot to demonstrate how a numerical and a category variable interact.
sns.barplot(x = Franchise_Restaurant, y = Franchise_Restaurant.index, palette= "gnuplot")
plt.title("Top Ten Restaurant Franchise's in Bangalore.", fontsize=20)
plt.xlabel("Count of Restaurant Name", fontsize=15)
plt.ylabel("Restaurant Name", fontsize=15)
plt.show()For instance, Cafe Coffee Day may represent a franchise where the establishment owner provides permission to operators to use the business’s name and model in exchange for royalties and support, whereas the Mangalore Kitchen appears to be an Independent with only one establishment by that name.
I am not certain which business would be beneficial for a restaurant owner or potential investor based on the bar chart. However, I can attest that Bangalore has both franchised and independently owned restaurants, which is rather intriguing.
3. Bangalore Restaurant Cuisines
# I will first output a set of counts of unique restaurant cuisines values.
Restaurant_Cuisine_Type = df['cuisines'].value_counts()[:10]
# The basic aesthetic of the plots will then be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))
# Finally, I will make use of a barplot to demonstrate how a numerical and a category variable interact.
sns.barplot(x = Restaurant_Cuisine_Type.index, y = Restaurant_Cuisine_Type, palette= "inferno")
plt.title("Top Ten Bangalore Restaurant Cuisines", fontsize=20)
plt.xlabel("Restaurant Cuisine", fontsize=15)
plt.ylabel("Count of Cuisines", fontsize=15)
plt.xticks(rotation=90)## ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [Text(0, 0, 'North Indian'), Text(1, 0, 'North Indian, Chinese'), Text(2, 0, 'South Indian'), Text(3, 0, 'Cafe'), Text(4, 0, 'Bakery, Desserts'), Text(5, 0, 'Biryani'), Text(6, 0, 'South Indian, North Indian, Chinese'), Text(7, 0, 'Desserts'), Text(8, 0, 'Fast Food'), Text(9, 0, 'Chinese')])
The most preferred cuisines in Bengaluru are North Indian, Chinese and South Indian. Although I am surprised that inhabitants there aren’t promoting their local cuisine (South Indian)?!
After conducting small research, I discovered that even though South Indian food is healthier than North Indian food, the inherent flavors of the vegetables are preserved, and there is comparably less usage of fats in South Indian recipes. However, North Indian cuisine is undoubtedly simpler and quicker to cook in restaurants, and is more familiar to foreign visitors and Indians at large. As a result, guests tend to dine in or order online from restaurants that provide these cuisines. (Campbell, 2022)
Nevertheless, operating a South Indian food restaurant can be profitable for restaurant owners or potential investors who wish to establish their own business with minimum investment.
4. Bangalore’s Restaurant Online Ordering Service
# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))
# The data labels output a set of Boolean (Yes/No) counts
data_labels = df.online_order.value_counts().index
data_colors = ['lightskyblue','lightcoral']
# The pie chart's pieces are separated by the explode.
data_explode = (0, 0.1)
# The pie is sized based on the values in the relevant column.
sizes = df.online_order.value_counts().values
# To show the percentage value thanks to autopct.
# The pie slices are made to face right when the startangle is 90 degrees.
plt.pie(sizes, explode=data_explode, labels=data_labels, colors= data_colors, autopct='%1.1f%%', startangle=90, textprops={'fontsize': 15})## ([<matplotlib.patches.Wedge object at 0x000002279FEAEF90>, <matplotlib.patches.Wedge object at 0x000002279FE7F250>], [Text(-0.9695170559757531, -0.5196505346596967, 'Yes'), Text(1.057654970155367, 0.5668914923560326, 'No')], [Text(-0.5288274850776834, -0.2834457461780164, '65.7%'), Text(0.6169653992572974, 0.33068670387435234, '34.3%')])
# A solid circle is created at its central position.
donut_graph = plt.Circle( (0,0), 0.4, color='white')
# A new donut chart is created by matching the provided parameters of the previous selected pie chart.
current_graph = plt.gcf()
current_graph.gca().add_artist(donut_graph)
plt.title("Bangalore's Restaurant Online Ordering Service", fontsize=20)
plt.xlabel("Online Orders", fontsize=15)
plt.show()Only over 35% of the eateries in Bangalore do not accept online orders, compared to around 65% of them. Let’s dive more into that…
# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))
# Query enables for accessing and returning DataFrame with a True/False result.
sns.kdeplot(df.query('online_order == "Yes" & rate > 0')['rate'],
color='orange', label='Online Ordering Service Provided', shade=True)
sns.kdeplot(df.query('online_order == "No" & rate > 0')['rate'],
color='blueviolet', label='Online Ordering Service NOT Provided', shade=True)
plt.title("Bangalore's Restaurants Rate Distribution by Online Ordering Service", fontsize=20)
plt.xlabel("Rate", fontsize=15)
plt.ylabel("Online Order", fontsize=15)
plt.legend(loc="upper right")
plt.show()The rate probability distribution reveals that even when there isn’t much of a difference, guests tend to provide establishments that accept an online orders the best ratings. Potential investors and business owners may find this graph helpful when considering whether to add online orders system to their business or not. It appears that the majority of restaurants offer that service, and based on my research and interviews, I also have the impression that Bangaloreans enjoy placing orders online on a regular basis.
5. Bangalore’s Restaurant Table Booking Service
# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))
# The data labels output a set of Boolean (Yes/No) counts
labels = df.book_table.value_counts().index
colors = ['aquamarine','lightseagreen']
# The pie chart's pieces are separated by the explode.
explode = (0, 0.1)
# The pie is sized based on the values in the relevant column.
sizes = df.book_table.value_counts().values
# To show the percentage value thanks to autopct.
# The pie slices are made to face right when the startangle is 90 degrees.
plt.pie(sizes, explode=explode, labels=labels, colors= colors, autopct='%1.1f%%', startangle=90, textprops={'fontsize': 15})## ([<matplotlib.patches.Wedge object at 0x000002279FF96F90>, <matplotlib.patches.Wedge object at 0x000002279FC95850>], [Text(-0.5065896566917578, -0.9764051002186168, 'No'), Text(0.5526432618455541, 1.0651692002384912, 'Yes')], [Text(-0.27632163092277695, -0.5325846001192455, '84.8%'), Text(0.32237523607657315, 0.6213487001391197, '15.2%')])
plt.title("Bangalore's Restaurant Table Booking Service", fontsize=20)
plt.xlabel("Table Booking", fontsize=15)
plt.show()In Bangalore, around 85% of eateries don’t offer a way to reserve a table. It could be fascinating to examine this further…
# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))
# Query enables for accessing and returning DataFrame with a True/False result.
sns.kdeplot(df.query('book_table == "Yes" & two_people_cost > 0')['two_people_cost'],
color='hotpink', label='Table Booking Service Provided', shade=True)
sns.kdeplot(df.query('book_table == "No" & two_people_cost > 0')['two_people_cost'],
color='lightseagreen', label='Table Booking Service NOT Provided', shade=True)
plt.title("Bangalore's Restaurants Cost Distribution by Table Booking Service", fontsize=20)
plt.xlabel("Two People Cost", fontsize=15)
plt.ylabel("Table Booking", fontsize=15)
plt.legend(loc="upper right")
plt.show()Only fancy restaurants with enough floor area in India typically allow guests to reserve a table, whereas smaller eateries simply do not. Consequently, based on the second pink line in the graph, it can be verified that such restaurants are significantly more expensive.
# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))
# Query enables for accessing and returning DataFrame with a True/False result.
sns.kdeplot(df.query('book_table == "Yes" & rate > 0')['rate'],
color='royalblue', label='Table Booking Service', shade=True)
sns.kdeplot(df.query('book_table == "No" & rate > 0')['rate'],
color='gold', label='Table Booking Service NOT Provided', shade=True)
plt.title("Bangalore's Restaurants Rate Distribution by Table Booking Service", fontsize=20)
plt.xlabel("Rate", fontsize=15)
plt.ylabel("Table Booking", fontsize=15)
plt.legend(loc="upper right")
plt.show()The above figure demonstrates how important it is for restaurant guests to reserve a table. I have observed that establishments with reserved table service typically get the highest ratings. Potential investors and business owners may find this graph valuable when deciding where to establish the eatery. Since the location and overall floor area will determine whether or not the restaurant can employ a guest table booking service.
I honestly believe they should consider locations with more floor space since it not only helps the customers by ensuring they will have a seat on the occasion they have planned, but it also assists restaurant managers to schedule sufficient personnel for preparation and outstanding service, which result in higher daily revenues and ultimately higher rating.
6. Prominent Restaurant Types in Bangalore
# I will first output a set of counts of unique restaurant types values.
Restaurant_Type = df['rest_type'].value_counts()[:10]
# The basic aesthetic of the plots will then be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))
# Finally, I will make use of a barplot to demonstrate how a numerical and a category variable interact.
sns.barplot(x = Restaurant_Type, y = Restaurant_Type.index, palette= "hsv")
plt.title("Prominent Restaurant Types in Bangalore", fontsize=20)
plt.xlabel("Count of Restaurant Type", fontsize=15)
plt.ylabel("Restaurant Type", fontsize=15)
# The values on the y-axis are reversed.
plt.gca().invert_yaxis()
plt.show()Bangalore is considered the technological capital of India, residents there favor Quick Bites because of their busy schedules. The reason that this type of restaurant leads the market is not just that people can afford it on a regular basis, but also that they lack the time to prepare their own lunch and bring it to the office.
7. Prominent Restaurant Meal Types in Bangalore
# I will first output a set of counts of unique restaurant mean type values.
meal_types = list(df['meal_type'].value_counts().index)
# The basic aesthetic of the plots will then be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))
# I will go through each of the enumerable string meal types.
for restaurant_type in meal_types:
restaurant_mean_index = meal_types.index(restaurant_type)
restaurant_meal_types = df[ (df['meal_type'] == restaurant_type) & (df['rate'] > 0)]
colors_code = ['brown', 'mediumorchid', 'mediumblue', 'darkcyan', 'yellow', 'orange', 'red']
sns.kdeplot(restaurant_meal_types['rate'], label = restaurant_type, color=colors_code[restaurant_mean_index], shade=True)
plt.title("Bangalore's Restaurants Rate Distribution by Meal Types", fontsize=20)
plt.xlabel("Rate", fontsize=15)
plt.ylabel("Meal Type", fontsize=15)
plt.legend(loc="upper right")
plt.show()The average rate probability distribution for Pubs and Bars, Drinks & nightlife, Cafés, and Buffet is frequently around 4, which is incredibly valuable for potential investors and company owners to take into consideration when deciding what kind of eatery to establish. This is fairly insightful since the beverage-related establishments have the best ratings, but let’s have a sneak peek at if such establishments also have the lowest average cost for two people.
# I will first output a set of counts of unique restaurant mean type values.
meal_types = list(df['meal_type'].value_counts().index)
# The basic aesthetic of the plots will then be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))
# I will go through each of the enumerable string meal types.
for restaurant_type in meal_types:
restaurant_mean_index = meal_types.index(restaurant_type)
restaurant_meal_types = df[ (df['meal_type'] == restaurant_type) & (df['two_people_cost'] > 0)]
colors_code = ['brown', 'mediumorchid', 'mediumblue', 'darkcyan', 'yellow', 'orange', 'red']
sns.kdeplot(restaurant_meal_types['two_people_cost'], label = restaurant_type, color=colors_code[restaurant_mean_index], shade=True)
plt.title("Bangalore's Restaurants Cost Distribution by Meal Types", fontsize=20)
plt.xlabel("Two People Cost", fontsize=15)
plt.ylabel("Meal Type", fontsize=15)
plt.legend(loc="upper right")
plt.show()The average two people cost probability distribution is notably left-skewed, especially for the food-related industry like Delivery, Dine-out, and Desserts. This graph demonstrates that 90% of restaurants offer food at under ₹1000 rupees for two people (6€ per person), while this is slightly higher for beverages and drinks-related industry. An average night out for two people will cost ₹1500 (9€ per person). Potential investors and business owners can use this information to determine whether they are selecting a beverage-related or food-related establishment and then formulate their market and financial planning accordingly.
8. Bangalore’s Restaurant Rate Distribution
# The basic aesthetic of the plots will then be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
fig, ax = plt.subplots(figsize=(20, 10))
sns.histplot(df, x = 'rate', kde=True, bins=20, color='navy', ax = ax, fill=True)
# I have drawn a red line through the bars to indicate the mean.
plt.axvline(df.rate.mean(), color='firebrick', linestyle='dashed', linewidth=1.5)
plt.title("Bangalore's Restaurants Cost Distribution", fontsize=20)
plt.xlabel("Rate", fontsize=15)
plt.ylabel("Count of Restaurant Rate", fontsize=15)
plt.show()Restaurants with ratings below 2.5 and higher than 4.5 are extremely rare and the majority of eateries have ratings between 3 and 4.
Exploratory Data Analysis Conclusion
I have concluded that business owners and potential investors can utilize a set of features (cuisines, restaurant type, online order, table services, two people cost etc) based on their own market segment and business model, to predict the rating that a newly proposed restaurant would have attained. This insight will be valuable since it will help them to determine the rating of their potential new restaurant before it is even built.
Phase 3: Model Prediction
In this section, I am going to find the most efficient model that can predict ratings for new start-up businesses as well as provide insightful data about the variables that influence a new establishment’s success to help business owners and potential investors to make wise decisions.
Preprocessing
Before I start training the algorithm and creating a
model that can predict my target variable
rate, there is a preprocessing
step to be taken into consideration.
## row_num int32
## name object
## online_order object
## book_table object
## rate float64
## votes int64
## location object
## rest_type object
## cuisines object
## two_people_cost float64
## meal_type object
## city_neighborhood object
## dtype: object
As a first step, I have to encode all object columns into a factorized-categorical variable.
There are numerous ways to encode categorical variables. For instance:
- Transforming the string labels into a numeric form (label encoding).
- Transforming the categorical data into numeric with the
help of indicator or dummy variables.
(
OneHotEncoderorget_dummies).
However, I have an easy and straightforward approach by using the
function def. This will produce a
dataset with only numerical variables.
# I first made a selection of all the columns and
# Then used the FactorizedColumns method to encode them.
def FactorizedColumns(df):
for i in df.columns[df.columns.isin(["name", "online_order", "book_table", "location", "rest_type", "cuisines", "meal_type", "city_neighborhood"])]:
df[i] = df[i].factorize()[0]
return df
df_numeric = FactorizedColumns(df.loc[:, df.columns != "row_num"])
df_numeric.info()## <class 'pandas.core.frame.DataFrame'>
## Index: 41190 entries, 0 to 51716
## Data columns (total 11 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 name 41190 non-null int64
## 1 online_order 41190 non-null int64
## 2 book_table 41190 non-null int64
## 3 rate 41190 non-null float64
## 4 votes 41190 non-null int64
## 5 location 41190 non-null int64
## 6 rest_type 41190 non-null int64
## 7 cuisines 41190 non-null int64
## 8 two_people_cost 41190 non-null float64
## 9 meal_type 41190 non-null int64
## 10 city_neighborhood 41190 non-null int64
## dtypes: float64(2), int64(9)
## memory usage: 3.8 MB
It is visible that all features now have a numerical form.
Feature Selection
Now that the variables have undergone substantial preprocessing. Let’s start by selecting which features to include in my machine-learning model.
Correlation Coefficients Matrix
As a first step is to use a Correlation Coefficients Matrix with those preprocessed features to display how numerical and categorical variables are related.
# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))
# Use the a Pearson method to determine the coefficient of correlation.
correlation_plot = df_numeric.corr(method = "pearson")
ax = plt.figure(figsize=(10,6))
# Display a heatmap of the correlation coefficient.
heatmap = sns.heatmap(correlation_plot, annot = True, cbar = True, fmt=".2f", cmap=sns.color_palette("rocket", as_cmap=True))
# Rotation of the x-axis labels by 20 degrees.
heatmap.set_xticklabels(heatmap.get_xticklabels(), rotation = 20, horizontalalignment = "right")
plt.show()This heatmap display both the strengths of
the relationship as well as the direction
whether it is a positive or negative. The
fourth row depicts the correlation coefficients between
all feature values \(X\) and the target
variable rate \(y\).
The features that highly correlates with
rate are:
votes: This feature has a quite positive correlation of (0.43) withrate.book_table: This feature has a quite negative correlation of (-0.43) withrate.two_people_cost: This feature has a quite positive correlation of (0.38) withrate.
The features that does not correlates with
rate are:
city_neighborhood: This feature has an almost no correlation (0.02) withrate.
Based on my exploratory data analysis, I believe most of the features are quite interesting and important for further investigation with the exception of:
name: There is very little correlation between therateand anameonly determines whether a restaurant is an independent or franchised;city_neighborhood: Since there is almost no correlation with therate.
Pairplot Correlation Visualization
In light of the quantity of features in the
Zomato Dataset, it could be
challenging to draw conclusions from the
Correlation Coefficients Matrix only. As a result, I
will be using feature values against the target
variable rate.
# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))
# Show a scatterplot matrix of the rate VS each significant feature.
sns.pairplot(df_numeric, hue="rate")