HU Minor: ML Predictive Analysis for Zomato Restaurants

Phase 1: Challenge Proposal

Domain Understanding

Introduction

I have always been intrigued by Bengaluru’s culinary scene. Bangalore is home to restaurants from all across the world. You may discover all various types of cuisines and delicacies here. You want it, Bengaluru has it: delivery, dine-in, taverns, cafés, cocktails, buffets, and sweets. For gastronomy lovers and travellers Bengaluru is the best city to visit for.

Why must I do the challenge?

Bengaluru is home to an expanding number of restaurants. At the moment, there are about 12,000 establishments. With so many eateries, the sector has not yet reached saturation, and new food vendors are sprouting day after day. However, new establishments now find it challenging to compete with establishments that have already achieved success. Being the heart of India’s information technology industry, Bengaluru’s residents rely primarily on casual dining cuisine because they lack the time to prepare meals at home. Therefore it is crucial for me to understand the local demography because there is such a huge demand for eateries.

What is it exactly that I am going to do?

In an attempt to address the aforementioned issues, I intend to provide valuable insights in my prediction analysis about the variables that influence a new establishment’s success and predict ratings for new start-up businesses in Bangalore.

That insight will be helpful for potential investors and restaurant (owners) to find out whether a start-up business would be successful and sustainable in their market. (catering industry).

Data Sourcing

One of the most popular applications in Bengaluru for ordering meals finding eateries, and reading/writing review is Zomato. The one and only source from which I can obtain further information about the overall rating of each restaurant in Bengaluru.

Enriching Data

With the aid of restaurant demographics and information on independent variables associated with the restaurant ambience, food and service quality offered in Zomato, it is feasible to predict a rating for a new start-up business. To accomplish that, a conceptual framework will be constructed, and it will assist me define my dependent variable for this predictive analysis.

Conceptual Framework

The Zomato Dataset does not needs to be integrated or further enriched. For the reasons that the dataset is not only quite large and it has enough data in total to find potential correlations and patterns that machine learning can learn about during modelling. But also because the dataset has sufficient variables to support my conceptual framework as displayed above.

Additional information and a description of the data will be included in the Data Understanding Section.

Interview: Guest Behaviour & Experience

The conceptual framework’s moderator variable illustrates how changes in guest behavior and experience affect the relationship between an independent variable and a dependent variable. For instance, a person can be already frustrated when they visit a restaurant or they might have received the best service but yet provide a poor rating.

It can be challenging to find reliable information or even conduct this type of research online (Guest Behaviour & Experience). Therefore an interview with someone who has dined at a Zomato restaurant would be an ideal candidate since they can provide considerably more in-depth and reliable information.

Interview Insights:

Particularly in Bengaluru where there is fierce competition in the food market, Zomato serves as a platform for all of these businesses enabling them to draw in more customers and thereby increase their revenue. As a result, these restaurants would undoubtedly prefer to collaborate with Zomato since it will bring their restaurant to the highlights of the city, boost their profitability and overall rating.

People of Bengaluru when they open their Zomato application on their phone will probably notice all the restaurants that are now providing special offers on sale and huge discounts on top of their application dashboards. Thus, guests will visit those restaurants the most since they have more sales offers that meet their needs, whether they relate to price or customer satisfaction levels. Those restaurants stand out from the competition.

The majority of their customers are drawn to and kept loyal to them by the sales and offers that are available on the application.

Exploratory Research: Zomato Business Model

In addition, I have performed exploratory research on Zomato to learn more about how the firm operates and how it maintains a strategic advantage.

There are three primary stakeholders in the Zomato Company:

1. Restaurant
2. Customer
3. Delivery Partner

Zomato has over 35 million active users, 15 million establishments that are registered, and more than 165,000 delivery partners.

Zomato earns over 72% of its revenue from advertising and delivery commissions. They collect fees from a particular restaurant to come on with more offers for them. Especially, when the restaurant pays advertising fees to the company in this case to Zomato, they then run offers and prioritize those with a high fee on top of their offer list.

1. Restaurant: Restaurants that have registered with Zomato are those who wish to have their meals or dishes delivered to the customers they desire. The restaurant benefits more when an order is placed since it prepares, packs, and delivers the food as opposed to eating customers who must occupy a certain table, waiters must attend to them, cutlery must be used and cleaned, etc. Combining all the expenses and labor, having customers order online is more advantageous for the restaurant.

2. Customer: The goal is to get their food delivered to their house. Zomato attracts most of their customer through its application, whereas when a customer buys that food or dish directly from the restaurant or by anywhere else online, they do not get the price that Zomato offers them.

3. Delivery Partners: In contrast, Zomato gives its delivery partners a higher payment. Considering that their primary responsibility is delivering meals to customers’ homes. Zomato strives to entice additional delivery partners by offering better compensation and stable employment.

Analytic Approach

How am I going to do the challenge?

Now that I have acknowledged what is the right data to employ, I must follow a structured data technique, which entails the following actions, in order to be able to foresee situations with greater precision:

Step 1: Collecting the relevant data to my target of predictive analysis.
Step 2: Organizing the relevant data into a single dataset.
Step 3: Cleaning the relevant data to prevent an inaccurate model and misleading prediction.
Step 4: Utilizing exploratory data analysis to gain knowledge about the data.
Step 5: Establishing useful variables to comprehend the records.
Step 6: Choosing an appropriate machine learning algorithm.
Step 7: Constructing and employing a successful model.

When am I going to do the challenge?

I plan to accomplish the challenge within the three blocks of my Minor Program in Big Data & Design. Although this challenge hasn’t been officially assigned to me, I intend to pursue it at my own pace. By the end of this semester, I aim to showcase the prediction analysis I will have conducted. To successfully complete the challenge, I will regularly submit my documentation to my Major studies at Hogeschool Fontys, ensuring I constantly receive feedback and make appropriate improvements.

Who is responsible for doing the Challenge?

In my predictive analysis report, I am the only one accountable for the challenge of developing a successful machine learning model, which will predict the success of a new establishment by making use of the restaurant demographics and information on independent variables associated with the restaurant ambience, food and service quality provided from the Zomato Dataset

Should I do the challenge or not?

I am confident that I am capable of completing the challenge by making use of the various resources available from both my Minor Program (Data Science & AI Library from Canvas) and my Major (AI Project Methodology from Canvas). These resources, along with the guidelines I have previously laid out in this challenge proposal, will support my efforts to achieve success.

Summary

1. This predictive analysis intends to offer significant insights into the variables that influence the success of a new establishment and predict the rating for new start-up businesses.

2. Main Research Question: Predict the success rate of a new restaurant based on different features like (online_order, book_table, votes, location, rest_type, cuisines, two_people_cost and meal_type).

3. Target Variable: The feature rate is my target variable that contains the overall star rating out of five.

4. Machine Learning Model: The nature of the target variable rate is suitable for Regression Analysis.

Phase 2: Data Provisioning

Data Requirements

Most of the data requirements have already been satisfied for a predictive model to make an accurate prediction.

I have already confirmed that I am completely aware of the domains for which the data will be retrieved.

I have obtained the Zomato Dataset from the Kaggle Online Machine Learning Repository Platform where I downloaded the data folder including the Zomato.data dataset and its dictionary_data description for more explanation of the dataset.

I have already compiled and analyzed a list of the significant stakeholders whose establishments will benefit from this projection based on the available information on Zomato dataset.

This projection will be helpful for potential investors and restaurant (owners) to find out whether a start-up business would be successful and sustainable in their target market.

I have already gathered all the potential attributes (for relevant tables) but I will determine the data types they might possess in the upcoming sections.
I have already collected all the potential attributes (for relevant tables) but I will establish relationships between the attributes of a dataframe in the upcoming sections.

Data Collection

Load Libraries & Packages

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import warnings

# Pip: Package Installer for Python (PIP) is a program
# that helped me to install and manage python software packages.

#!pip install missingno
import missingno as msno

# I will use display.max_columns to adjust the maximum number of displayed columns.
pd.set_option('display.max_columns', 20)

# It ignores most of the machine learning warnings produced. 
warnings.filterwarnings("ignore")

Data Understanding

Zomato is a multinational corporation of Indian restaurant food delivery services that was founded in 2008 by Deepinder Goyal and Pankaj Chaddah. Zomato gives information regarding restaurants menus and customer ratings. In Bangalore, it also provides several options for food delivery from associated restaurants. As of 2019, the service is accessible in more than 10,000 cities across 24 different countries.

Loading Data

I have the Zomato.data dataset saved in the same directory as my (Python) notebook. It might therefore be easier for me to specify the file path, as you can see from the code itself. I didn’t mention the real location of the file; I only mentioned its name.

# Encoding helps to simply convert the data's characters into binary code.
# If it's omitted, an error will be raised.
df = pd.read_csv("Zomato.data.csv", encoding = "latin-1")

# Obtain a glimpse of the data by printing information regarding the DataFrame.
df_sample = df.loc[24830:24832]

	url	address	name	online_order	book_table	rate	votes	phone	location	rest_type	dish_liked	cuisines	approx_cost(for two people)	reviews_list	menu_item	listed_in(type)	listed_in(city)
24830	https://www.zomato.com/bangalore/nine2twelve-kalyan-nagar-bangalore?context=eyJzZSI6eyJlIjpbIjU0NzM5IiwiNTYxOTQiLCI1NzEwMSIsIjE4Nzg2NTcyIiwxODk2ODA5OCwiMTg5NjczNzEiLCIxODQyNjM1NiIsIjUwNzEzIiwiMTg5Mjk0NDQiLCIxODM2NjY1NCIsIjU4ODQ3IiwiNTY1NTUiLCIxODU5MzM3NyIsIjE4NTM3ODE2IiwiNTc2NDMiLCIxODQ3NjgyMSJdLCJ0IjoiRGluZS1PdXQgUmVzdGF1cmFudHMgaW4gS2FseWFuIE5hZ2FyIn19	Flat 302, 403, 2nd Main, Kasturinagar, East Of NGEF Layout Kalyan Nagar, Bangalore	Nine2twelve	No	No	NEW	0	+91 9606166379	Kalyan Nagar	Quick Bites	nan	South Indian	300.000000	[]	[]	Dine-out	Kalyan Nagar
24831	https://www.zomato.com/bangalore/new-taj-biryani-centre-kalyan-nagar-bangalore?context=eyJzZSI6eyJlIjpbIjU2MTk0IiwiNTcxMDEiLCIxODc4NjU3MiIsIjE4OTY4MDk4IiwxODk2NzM3MSwiMTg0MjYzNTYiLCI1MDcxMyIsIjE4OTI5NDQ0IiwiMTgzNjY2NTQiLCI1ODg0NyIsIjU2NTU1IiwiMTg1OTMzNzciLCIxODUzNzgxNiIsIjU3NjQzIiwiMTg0NzY4MjEiXSwidCI6IkRpbmUtT3V0IFJlc3RhdXJhbnRzIGluIEthbHlhbiBOYWdhciJ9fQ==	IB Road, Lorry Stand, Kushaal Nagar, Ward 10, Kalyan Nagar, Bangalore	New Taj Biryani Centre	No	No	NEW	0	+91 8979052325	Kalyan Nagar	Quick Bites	nan	Biryani	300.000000	[]	[]	Dine-out	Kalyan Nagar
24832	https://www.zomato.com/bangalore/ss-bucket-biryani-kammanahalli?context=eyJzZSI6eyJlIjpbIjU3MTAxIiwiMTg3ODY1NzIiLCIxODk2ODA5OCIsIjE4OTY3MzcxIiwxODQyNjM1NiwiNTA3MTMiLCIxODkyOTQ0NCIsIjE4MzY2NjU0IiwiNTg4NDciLCI1NjU1NSIsIjE4NTkzMzc3IiwiMTg1Mzc4MTYiLCI1NzY0MyIsIjE4NDc2ODIxIiwiMTg2MTYwMDMiXSwidCI6IkRpbmUtT3V0IFJlc3RhdXJhbnRzIGluIEthbHlhbiBOYWdhciJ9fQ==	15, 5th Main Road, KEB Road, Near Kullappa Circle, HRBR Layout, Kammanahalli, Bangalore	SS Bucket Biryani	No	No	4.0/5	161	+91 9886974444	Kammanahalli	Casual Dining	Brinjal Curry, Basmati Rice, Mutton Biryani	Biryani, North Indian, Chinese	600.000000	[('Rated 3.0', 'RATED\n Visited this place today in the afternoon around 2 PM. We wanted chicken biriyani and it was not available. The only option was mutton biriyani. So we ordered that. Even mutton biriyani was not available here and they have sent their delivery boys to get it from OMBR layout branch, so we had to wait until he got it.\n\nComing to the biryani it was good. There is nothing extra ordinary about the taste but it was quite OK.\nWe ordered double pack and it was more than enough for 4 people. The raita given was tasty but the quantity was very less.\n\nOverall this is a budget friendly place and if you are a biriyani lover do try this out once.\n\nFood - 3.5\nAmbiance - 3\nValue for money - 4'), ('Rated 3.0', 'RATED\n Really surprised to find biryani in a bucket, but it was really great. They give a large quantity of biryani and kebabs for a very reasonable price.')]	[]	Dine-out	Kalyan Nagar

The main objective of analyzing the Zomato Dataset is to gain a clear understanding of the variables influencing the establishment of various types of dining establishments in numerous places throughout Bengaluru, as well as the overall rating of each restaurant. Bengaluru is one such metropolitan area, with more than 12,000 eateries serving cuisine from across the world.

# Quickly inspect the number of columns with their type of objects.
df.dtypes

## url                             object
## address                         object
## name                            object
## online_order                    object
## book_table                      object
## rate                            object
## votes                            int64
## phone                           object
## location                        object
## rest_type                       object
## dish_liked                      object
## cuisines                        object
## approx_cost(for two people)    float64
## reviews_list                    object
## menu_item                       object
## listed_in(type)                 object
## listed_in(city)                 object
## dtype: object

In the Zomato Dataset recognizing and comprehending the significance of each data type is essential. Depending on the data types, a particular analysis will be necessary.

These data types will similarly guarantee that data is gathered in the preferred format and that the values of each feature are as anticipated.

Understanding the different data types can also assist me in deciding how they might be combined before feeding them into a machine-learning model.

Data Characteristic & Description

# Quickly checks for the names of the columns, the data types that it contains and 
# Whether they have any odd or striking missing data.
df.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 51717 entries, 0 to 51716
## Data columns (total 17 columns):
##  #   Column                       Non-Null Count  Dtype  
## ---  ------                       --------------  -----  
##  0   url                          51717 non-null  object 
##  1   address                      51717 non-null  object 
##  2   name                         51717 non-null  object 
##  3   online_order                 51717 non-null  object 
##  4   book_table                   51717 non-null  object 
##  5   rate                         43942 non-null  object 
##  6   votes                        51717 non-null  int64  
##  7   phone                        50509 non-null  object 
##  8   location                     51696 non-null  object 
##  9   rest_type                    51490 non-null  object 
##  10  dish_liked                   23639 non-null  object 
##  11  cuisines                     51672 non-null  object 
##  12  approx_cost(for two people)  51371 non-null  float64
##  13  reviews_list                 51717 non-null  object 
##  14  menu_item                    51717 non-null  object 
##  15  listed_in(type)              51717 non-null  object 
##  16  listed_in(city)              51717 non-null  object 
## dtypes: float64(1), int64(1), object(15)
## memory usage: 6.7+ MB

url: includes a link to the restaurant’s page on the Zomato website.
address: contains the restaurant’s Bengaluru address.
name: contains the restaurant’s name.
online_order: whether the restaurant offers online ordering.
book_table: whether there is a table book option.
rate: contains a five-star rating for the restaurant overall.
votes: lists all of the restaurant’s ratings as of the specified date.
phone: includes the restaurant’s phone number.
location: contains information on the area where the restaurant is situated.
rest_type: type of restaurant.
dish_liked: meals that guests liked in the establishment.
cuisines: comma-separated lists of food genres.
approx_cost (for two people): includes an estimate of the cost of a dinner for two people.
reviews_list: a collection of tuples with restaurant reviews; each tuple has two values: the customer’s rating and the review.
menu_item: contains a list of the menu options for the restaurant.
listed_in (type): type of meal.
listed_in (city): incorporates the city-neighborhood in which the restaurant is situated.

# Quickly analyses the columns and observations in the dataset.
df.shape

## (51717, 17)

The Zomato Dataset has 51717 observations, with each row including details about a specific restaurant in Bangalore and 17 columns. There is one target variable rate and 17 features (restaurant attributes).

Both isna() and isnull() functions are used to find the missing values in the pandas dataframe. (Aruchamy, 2022)

print(df.isna().sum())

## url                                0
## address                            0
## name                               0
## online_order                       0
## book_table                         0
## rate                            7775
## votes                              0
## phone                           1208
## location                          21
## rest_type                        227
## dish_liked                     28078
## cuisines                          45
## approx_cost(for two people)      346
## reviews_list                       0
## menu_item                          0
## listed_in(type)                    0
## listed_in(city)                    0
## dtype: int64

Although the two coding lines vary in functions and forms, but both of them provide the same results. I will list both of them here for academic purposes.

print(df.isnull().sum())

## url                                0
## address                            0
## name                               0
## online_order                       0
## book_table                         0
## rate                            7775
## votes                              0
## phone                           1208
## location                          21
## rest_type                        227
## dish_liked                     28078
## cuisines                          45
## approx_cost(for two people)      346
## reviews_list                       0
## menu_item                          0
## listed_in(type)                    0
## listed_in(city)                    0
## dtype: int64

In the Zomato Dataset there are missing values in the columns:

rate: This column has 7775 missing values.
phone: This column has 1208 missing values.
location: This column has 21 missing values.
rest_type: This column has 227 missing values.
dish_liked: This column has 28078 missing values.
cuisines: This column has 45 missing values.
approx_cost(for two people): This column has 346 missing values.

Data Preparation

Since the Zomato Dataset has been successfully imported and its essential information was clearly comprehended. It is indeed time to begin the Data Preparation Process, which entails eliminating meaningless columns, assigning columns - meaningful names, inspecting and eliminating duplicate & missing values, and ultimately performing a small exploratory data analysis to make certain that there are no evident discrepancies, identify patterns and discover relationships between the features.

These transformations and preparations are required to ensure that the Zomato Dataset is properly prepared for the forthcoming Phase 3 Prediction.

Removal of Specific Columns

# Drop any specified column within the dataset.
df.drop(["address", "dish_liked", "menu_item", "phone", "reviews_list", "url"], axis = 1, inplace = True)

The Zomato Dataset contained several features like address, phone and url that were irrelevant to my predictive analysis. This indicates that they have no relation whatsoever on my target variable rate and the challenge that my modelling approach is intended to tackle. By dropping these irrelevant columns, I can prevent the algorithm from looking for any misleading associations and avoid overfitting.

Moreover, the Zomato Dataset contained several other features like menu_item and reviews_list that were redundant to my predictive analysis. This indicates that all these features share the same information with other features like cuisines and rate, and one can be safely dropped without compromising and losing information.

Finally, the Zomato Dataset contained one important feature called dish_liked that is also redundant. Due to the numerous issues this feature has caused during training and in real-life settings when retraining is expected, I might have to take a decision and choose some of the finest features and drop the others.

Meaningful Column Names

# Rename any specified column within the dataset.
df.rename(columns={"approx_cost(for two people)": "two_people_cost", 
                   "listed_in(type)": "meal_type",
                   "listed_in(city)": "city_neighborhood"}, inplace = True)
df.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 51717 entries, 0 to 51716
## Data columns (total 11 columns):
##  #   Column             Non-Null Count  Dtype  
## ---  ------             --------------  -----  
##  0   name               51717 non-null  object 
##  1   online_order       51717 non-null  object 
##  2   book_table         51717 non-null  object 
##  3   rate               43942 non-null  object 
##  4   votes              51717 non-null  int64  
##  5   location           51696 non-null  object 
##  6   rest_type          51490 non-null  object 
##  7   cuisines           51672 non-null  object 
##  8   two_people_cost    51371 non-null  float64
##  9   meal_type          51717 non-null  object 
##  10  city_neighborhood  51717 non-null  object 
## dtypes: float64(1), int64(1), object(9)
## memory usage: 4.3+ MB

The Zomato Dataset must have distinct, accessible, and meaningful names in order for me as a user to recognize and distinguish the data columns on a DataFrame.

Inspection & Removal of Duplicate Values

# I have printed the dataset with duplicates by selecting every row that is identical across all columns 
# And then return a series of counts of False and True based on whether a row is a duplicate.

print(df.duplicated(keep=False).value_counts(ascending = True))

## True       215
## False    51502
## Name: count, dtype: int64

There are 215 duplicate rows and 51502 unique rows in the dataset.

# I have printed the dataset with duplicates 
# By selecting every row that is identical across all columns.

print(df[df.duplicated(keep = False)])

##                         name online_order book_table   rate  votes  \
## 553             My Tea House          Yes        Yes    NEW      0   
## 596             My Tea House          Yes        Yes    NEW      0   
## 2195              Shiv Sagar           No         No  3.6/5     10   
## 2235              Shiv Sagar           No         No  3.6/5     10   
## 3747   The Fisherman's Wharf          Yes        Yes  4.4/5   4099   
## ...                      ...          ...        ...    ...    ...   
## 50366         House Of Candy          Yes         No    NaN      0   
## 50379         House Of Candy          Yes         No    NaN      0   
## 50405         House Of Candy          Yes         No    NaN      0   
## 50900       Nawab Di Biryani          Yes         No    NEW      0   
## 50904       Nawab Di Biryani          Yes         No    NEW      0   
## 
##                 location           rest_type  \
## 553         Banashankari       Casual Dining   
## 596         Banashankari       Casual Dining   
## 2195   Bannerghatta Road          Food Court   
## 2235   Bannerghatta Road          Food Court   
## 3747       Sarjapur Road  Casual Dining, Bar   
## ...                  ...                 ...   
## 50366         Whitefield       Confectionery   
## 50379         Whitefield       Confectionery   
## 50405         Whitefield       Confectionery   
## 50900         Whitefield  Takeaway, Delivery   
## 50904         Whitefield  Takeaway, Delivery   
## 
##                                               cuisines  two_people_cost  \
## 553              Continental, Asian, North Indian, Tea            500.0   
## 596              Continental, Asian, North Indian, Tea            500.0   
## 2195                           South Indian, Beverages            400.0   
## 2235                           South Indian, Beverages            400.0   
## 3747   Seafood, Goan, North Indian, Continental, Asian           1400.0   
## ...                                                ...              ...   
## 50366                                         Desserts            200.0   
## 50379                                         Desserts            200.0   
## 50405                                         Desserts            200.0   
## 50900                                 Biryani, Mughlai            400.0   
## 50904                                 Biryani, Mughlai            400.0   
## 
##       meal_type  city_neighborhood  
## 553    Dine-out       Banashankari  
## 596    Dine-out       Banashankari  
## 2195   Dine-out  Bannerghatta Road  
## 2235   Dine-out  Bannerghatta Road  
## 3747     Buffet          Bellandur  
## ...         ...                ...  
## 50366  Delivery         Whitefield  
## 50379  Delivery         Whitefield  
## 50405  Delivery         Whitefield  
## 50900  Delivery         Whitefield  
## 50904  Delivery         Whitefield  
## 
## [215 rows x 11 columns]

The Zomato Dataset contains 215 duplicate rows that have the potential to corrupt the training or test sets of data. The training procedure will be impacted by the outliers, which will lead my model to understand trends that in reality do not occur, and inputs with insufficient and incomplete data will cause my model to interpret features inaccurately.

I will therefore drop all the duplicate rows.

1. To get rid of all the duplicate rows, I can either use the duplicated() function in combination with the ~ negation operator.

#df = df[~df.duplicated()]

2. Or I can use the drop_duplicates() that also drops all the duplicate rows within the dataset.

df.drop_duplicates(inplace=True)

Although the two coding lines vary in functions and forms, both of them provide the same results. I will list both of them here for academic purposes, but I will only utilize the second drop duplicates() function.

Every duplicate row has already been eliminated, so I must explicitly verify using Python code that they are indeed deleted from the new dataset that I have allocated to.

# I can use the duplicated() function in combination with 
# The sum() returns the whole set of duplicate values in the dataset.

df.duplicated().sum()

## 0

The Zomato Dataset no longer contains any duplicate records.

Graphical Inspection of Missing Values

# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))

# The msno.matrix() function provide visual summary of the completeness 
# Or absence of my Zomato Dataset.
msno.matrix(df, fontsize=15, color=(0.99, 0.76, 0.8))
plt.show()

From the aforementioned matrix, I can get more information about the data that was missing from each column, as well as the overall number of rows and columns.

# The basic aesthetic of the plots will then be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))

# The msno.heatmap() function provides a visual summary of how the existence 
# Or lack of one feature on my Zomato Dataset impacts the existence 
# Or lack of another feature (Null Correlation).
msno.heatmap(df, fontsize=15)
plt.show()

From the aforementioned heatmap visual, I can get more information about the missing columns’ correlation with one another.

Additionally, I can see that some variables and missing data are related. For instance, there is a 70% positive null correlation between the features cuisines and location.

Elimination of NA Values

# Both coding lines provide the same results.
print(df.isnull().sum(axis = 0))

## name                    0
## online_order            0
## book_table              0
## rate                 7755
## votes                   0
## location               21
## rest_type             227
## cuisines               45
## two_people_cost       344
## meal_type               0
## city_neighborhood       0
## dtype: int64

# print(df.isna().sum())

In the Zomato Dataset there are now missing values in the columns:

rate: This column has 7755 missing values.
location: This column has 21 missing values.
rest_type: This column has 227 missing values.
cuisines: This column has 45 missing values.
approx_cost(for two people): This column has 344 missing values.

# The two coding lines below provide the same results.
df.dropna(inplace = True)
# df.dropna(how = "any", inplace = True)
df.info()

## <class 'pandas.core.frame.DataFrame'>
## Index: 43447 entries, 0 to 51716
## Data columns (total 11 columns):
##  #   Column             Non-Null Count  Dtype  
## ---  ------             --------------  -----  
##  0   name               43447 non-null  object 
##  1   online_order       43447 non-null  object 
##  2   book_table         43447 non-null  object 
##  3   rate               43447 non-null  object 
##  4   votes              43447 non-null  int64  
##  5   location           43447 non-null  object 
##  6   rest_type          43447 non-null  object 
##  7   cuisines           43447 non-null  object 
##  8   two_people_cost    43447 non-null  float64
##  9   meal_type          43447 non-null  object 
##  10  city_neighborhood  43447 non-null  object 
## dtypes: float64(1), int64(1), object(9)
## memory usage: 4.0+ MB

Because the Machine Learning Model that I want to use provided me with an error if I pass Null Values into it. The best way to deal with those values is to either impute the data with mean, median, mode or in the most challenging situation employs advanced imputation techniques, such as MICE in order to fix the missingness of the data based on the different types of missing data (MCAR/MAR/MNAR). However, if I want to do that then the precision and dependability of my model would be considerably reduced, which will lead the model to learn on insignificant features.

I therefore made a conscious decision to omit and drop all the rows with Null Values so that the dataset:

Have less misguided information, which improves the precision of my machine learning model.
Have less data redundancy, which reduces the likelihood that judgments will be relied on noise or corrupted data.
Have less feature - variable, which minimize the intricacy of the my machine learning algorithm and speeds up model training.

Graphical Verification of Missing Values

# The basic aesthetic of the plots will then be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))

# The msno.matrix() function provide a visual summary of the completeness 
# Or absence of my Zomato Dataset.
msno.matrix(df, fontsize=15, color=(0.57, 0.36, 0.51))
plt.show()

I deployed the msno.matrix() for visualization to verify if there were any further missing values that needed to be filled in. There is no white space whatsoever in each of the bars. This indicates that all of the missing data have now been successfully retrieved and deleted.

The Zomato Dataset has 43447 observations with each row including details about a specific restaurant in Bangalore. There is one target variable rate and 10 features with restaurant attributes.

Data Transformations For Columns

# The unique() function looks for distinct values in an array 
# And returns the distinct values ordered by distinctiveness.

pd.unique(df[["rate"]].values.ravel("K"))

## array(['4.1/5', '3.8/5', '3.7/5', '3.6/5', '4.6/5', '4.0/5', '4.2/5',
##        '3.9/5', '3.1/5', '3.0/5', '3.2/5', '3.3/5', '2.8/5', '4.4/5',
##        '4.3/5', 'NEW', '2.9/5', '3.5/5', '2.6/5', '3.8 /5', '3.4/5',
##        '4.5/5', '2.5/5', '2.7/5', '4.7/5', '2.4/5', '2.2/5', '2.3/5',
##        '3.4 /5', '-', '3.6 /5', '4.8/5', '3.9 /5', '4.2 /5', '4.0 /5',
##        '4.1 /5', '3.7 /5', '3.1 /5', '2.9 /5', '3.3 /5', '2.8 /5',
##        '3.5 /5', '2.7 /5', '2.5 /5', '3.2 /5', '2.6 /5', '4.5 /5',
##        '4.3 /5', '4.4 /5', '4.9/5', '2.1/5', '2.0/5', '1.8/5', '4.6 /5',
##        '4.9 /5', '3.0 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5',
##        '2.1 /5', '2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)


# The values.ravel() function enables me to retrieve input values 
# that have been compressed as an array with the same type and structure.
# The letter "K" denotes viewing each element in the memory sequence.

The rate column in the Zomato Dataset indicates several restaurants as - because they do not have any ratings yet.

There are also newly opened establishments for which no guests have yet made a visit and posted a rating. These establishment ratings are referred to by the term NEW.

df = df[df['rate'] != 'NEW']
df = df[df['rate'] != '-']

Due to these unexpected ratings presented in the array, some adjustment is required in order to generate a reliable dataset at the end, which can then be fed into a Machine Learning Model.

# The unique() function in combination with 
# The values.ravel("K") returns distinct values in an array.
pd.unique(df[["rate"]].values.ravel("K"))

## array(['4.1/5', '3.8/5', '3.7/5', '3.6/5', '4.6/5', '4.0/5', '4.2/5',
##        '3.9/5', '3.1/5', '3.0/5', '3.2/5', '3.3/5', '2.8/5', '4.4/5',
##        '4.3/5', '2.9/5', '3.5/5', '2.6/5', '3.8 /5', '3.4/5', '4.5/5',
##        '2.5/5', '2.7/5', '4.7/5', '2.4/5', '2.2/5', '2.3/5', '3.4 /5',
##        '3.6 /5', '4.8/5', '3.9 /5', '4.2 /5', '4.0 /5', '4.1 /5',
##        '3.7 /5', '3.1 /5', '2.9 /5', '3.3 /5', '2.8 /5', '3.5 /5',
##        '2.7 /5', '2.5 /5', '3.2 /5', '2.6 /5', '4.5 /5', '4.3 /5',
##        '4.4 /5', '4.9/5', '2.1/5', '2.0/5', '1.8/5', '4.6 /5', '4.9 /5',
##        '3.0 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5', '2.1 /5',
##        '2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)

The sort of way restaurant ratings is kept on the Zomato Dataset is unsuitable and inappropriate for regression models. Because of this, I must eliminate all objects related to strings or characters from the field and only keep the rate column values.

# The str.replace() function displays numerical values 
# And removes all character-string-related objects from the column.

df["rate"] = df["rate"].str.replace('/5','')

# The str.strip() function removes all white spaces 
# And other invisible characters.

df["rate"] = df["rate"].str.strip().astype("float")

# The astype() function converts the entire pandas 
# object - column to the same data type indicated.

I have to verify now that I got rid of all the leading space and character-string related objects from the rate column.

pd.unique(df[["rate"]].values.ravel("K"))

## array([4.1, 3.8, 3.7, 3.6, 4.6, 4. , 4.2, 3.9, 3.1, 3. , 3.2, 3.3, 2.8,
##        4.4, 4.3, 2.9, 3.5, 2.6, 3.4, 4.5, 2.5, 2.7, 4.7, 2.4, 2.2, 2.3,
##        4.8, 4.9, 2.1, 2. , 1.8])

The Zomato Dataset's column rate no longer contains any misleading or inappropriate field.

Addiction of Row Number

# The insert() function enables the addition of a new element to a DataFrame at a specific index position.
# The np.arange() function obtain values that are equally spaced within a specified range.
# The len() function provides the object's length.

df.insert(loc=0, column="row_num", value = np.arange(len(df)))

# The above code will retrieve the Data Frame's row number and place it as the first column.
df_tail = df.tail(5)

	row_num	name	online_order	book_table	rate	votes	location	rest_type	cuisines	two_people_cost	meal_type	city_neighborhood
51709	41185	The Farm House Bar n Grill	No	No	3.7	34	Whitefield	Casual Dining, Bar	North Indian, Continental	800.0	Pubs and bars	Whitefield
51711	41186	Bhagini	No	No	2.5	81	Whitefield	Casual Dining, Bar	Andhra, South Indian, Chinese, North Indian	800.0	Pubs and bars	Whitefield
51712	41187	Best Brews - Four Points by Sheraton Bengaluru...	No	No	3.6	27	Whitefield	Bar	Continental	1500.0	Pubs and bars	Whitefield
51715	41188	Chime - Sheraton Grand Bengaluru Whitefield Hotel &...	No	Yes	4.3	236	ITPL Main Road, Whitefield	Bar	Finger Food	2500.0	Pubs and bars	Whitefield
51716	41189	The Nest - The Den Bengaluru	No	No	3.4	13	ITPL Main Road, Whitefield	Bar, Casual Dining	Finger Food, North Indian, Continental	1500.0	Pubs and bars	Whitefield

I chose to utilize the distinct row number because it allowed me to refer to the indices without getting lost or accidentally replacing data from other rows. I have incorporated it for my convenience and understanding purposes.

Exploratory Data Analysis

According to my Conceptual Framework, there are many features and factors that can influence what constitutes a good eatery. I will highlight for the objectives of my predictive analysis that the restaurants with high rate tend to be the good ones (Assumption). I will perform initial investigations on Zomato Dataset so as to discover patterns and to check assumptions with the help of tabular or graphical analytics.

# Provide some summary statistics for numerical columns as indicated below.
df_describe = df.describe()

	row_num	rate	votes	two_people_cost
count	41190.00	41190.00	41190.00	41190.00
mean	20594.50	3.70	352.07	603.55
std	11890.67	0.44	883.46	464.65
min	0.00	1.80	0.00	40.00
25%	10297.25	3.40	21.00	300.00
50%	20594.50	3.70	73.00	500.00
75%	30891.75	4.00	277.00	750.00
max	41189.00	4.90	16832.00	6000.00

An average rating of 3.7 out of 5 is quite respectable, especially considering Bangalore can only have one or two five-star rating eateries.

The fact that 75% of Bengalureans spend no more than 750₹ rupees for two people cost (4,35€ per person) is indeed quite interesting to look into this extensively.

I am going to further examine the relation between the features and discover how it correlates to the success of a start-up business. In addition, I am going to dive into some of the features to determine what narrative these might have for me.

1. Bangalore’s Franchise Restaurants

I will group by the names of the restaurants and retrieve data statistics like the mean rating, the total number of votes, the average cost for two guests etc for the most popular restaurants.

# In the groupby() function's as index = False parameter 
# indicates that I do not intend to use the column Identifier as the index.

Franchise_Restaurant = df.groupby(by='name', as_index=False).agg({'rate': 'mean',
                                                                 'votes': 'sum',
                                                                 'two_people_cost': 'mean',
                                                                 'row_num': 'count'})
                                                                 
# I assigned the aggregated columns - names to make them easier to understand.
Franchise_Restaurant.columns = ['Restaurant_Name', 'Average_Rating', 'Total_Votes', 'Average_Two_People_Cost', 'Total_Restaurants']

# I sorted the values in increasing order to identify Bengalore's most popular places.
Franchise_Restaurant = Franchise_Restaurant.sort_values(by='Total_Restaurants', ascending=False)[:10]

# The columns have been reorganized depending on my perspective.
Franchise_Restaurant = Franchise_Restaurant.loc[:, ['Restaurant_Name', 'Total_Restaurants', 'Total_Votes', 'Average_Two_People_Cost', 'Average_Rating']]

# I round the number up to a specific number of decimal digits for clarification purposes. 
Franchise_Restaurant = Franchise_Restaurant.round(decimals = 2)

	Restaurant_Name	Total_Restaurants	Total_Votes	Average_Two_People_Cost	Average_Rating
987	Cafe Coffee Day	86	3089	838.37	3.26
4191	Onesta	85	347520	600.00	4.41
1869	Empire Restaurant	69	229808	693.48	4.03
2978	Kanti Sweets	68	7336	400.00	3.90
1975	Five Star Chicken	68	3134	259.56	3.42
2848	Just Bake	67	2898	400.00	3.41
596	Baskin Robbins	62	2487	250.81	3.57
4393	Pizza Hut	60	20161	747.50	3.38
4371	Petoo	60	4242	675.83	3.83
2892	KFC	60	23495	422.50	3.65

2. Bangalore’s Independent Restaurants

I will group by the names of the restaurants and retrieve data statistics like the mean rating, the total number of votes, the average cost for two guests etc for the least popular restaurants.

# In the groupby() function's as index = False parameter 
# indicates that I do not intend to use the column Identifier as the index.

Independent_Restaurant = df.groupby(by='name', as_index=False).agg({'rate': 'mean',
                                                 'votes': 'sum',
                                                 'two_people_cost': 'mean',
                                                 'row_num': 'count'})
                                              
# I assigned the aggregated columns - names to make them easier to understand.
Independent_Restaurant.columns = ['Restaurant_Name', 'Average_Rating', 'Total_Votes', 'Average_Two_People_Cost', 'Total_Restaurants']

# I sorted the values in increasing order to identify Bengalore's least popular places.
Independent_Restaurant = Independent_Restaurant.sort_values(by='Total_Restaurants', ascending=True)[:10]

# The columns have been reorganized depending on my perspective.
Independent_Restaurant = Independent_Restaurant.loc[:, ['Restaurant_Name', 'Total_Restaurants', 'Total_Votes', 
'Average_Two_People_Cost', 'Average_Rating']]

	Restaurant_Name	Total_Restaurants	Total_Votes	Average_Two_People_Cost	Average_Rating
3520	Mangalore Kitchen	1	42	500.00	3.80
4067	NightOwl	1	31	400.00	2.70
5236	South Grand	1	102	400.00	3.40
5233	Soup N Grill	1	14	600.00	3.50
2556	Hotel Shri Raghavendra	1	6	150.00	3.00
5223	SoopeRolls	1	11	250.00	3.50
5222	Soo Ra Sang	1	290	1500.00	4.10
5217	Soham Bombay Masti Magic	1	16	150.00	3.60
2562	Hotel Thalassery	1	8	200.00	3.40
4062	Night Fox	1	22	500.00	3.20

I discovered from these two tables that a restaurant can either be an independent or a franchise.

# I will first output a set of counts of unique restaurant values.
Franchise_Restaurant = df['name'].value_counts(sort=True, dropna=False, ascending=False)[:10]

# The basic aesthetic of the plots will then be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))

# Finally, I will make use of a barplot to demonstrate how a numerical and a category variable interact. 
sns.barplot(x = Franchise_Restaurant, y = Franchise_Restaurant.index, palette= "gnuplot")
plt.title("Top Ten Restaurant Franchise's in Bangalore.", fontsize=20)
plt.xlabel("Count of Restaurant Name", fontsize=15)
plt.ylabel("Restaurant Name", fontsize=15)
plt.show()

For instance, Cafe Coffee Day may represent a franchise where the establishment owner provides permission to operators to use the business’s name and model in exchange for royalties and support, whereas the Mangalore Kitchen appears to be an Independent with only one establishment by that name.

I am not certain which business would be beneficial for a restaurant owner or potential investor based on the bar chart. However, I can attest that Bangalore has both franchised and independently owned restaurants, which is rather intriguing.

3. Bangalore Restaurant Cuisines

# I will first output a set of counts of unique restaurant cuisines values.
Restaurant_Cuisine_Type = df['cuisines'].value_counts()[:10]

# The basic aesthetic of the plots will then be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))

# Finally, I will make use of a barplot to demonstrate how a numerical and a category variable interact. 
sns.barplot(x = Restaurant_Cuisine_Type.index, y = Restaurant_Cuisine_Type, palette= "inferno")
plt.title("Top Ten Bangalore Restaurant Cuisines", fontsize=20)
plt.xlabel("Restaurant Cuisine", fontsize=15)
plt.ylabel("Count of Cuisines", fontsize=15)
plt.xticks(rotation=90)

## ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [Text(0, 0, 'North Indian'), Text(1, 0, 'North Indian, Chinese'), Text(2, 0, 'South Indian'), Text(3, 0, 'Cafe'), Text(4, 0, 'Bakery, Desserts'), Text(5, 0, 'Biryani'), Text(6, 0, 'South Indian, North Indian, Chinese'), Text(7, 0, 'Desserts'), Text(8, 0, 'Fast Food'), Text(9, 0, 'Chinese')])

plt.show()

The most preferred cuisines in Bengaluru are North Indian, Chinese and South Indian. Although I am surprised that inhabitants there aren’t promoting their local cuisine (South Indian)?!

After conducting small research, I discovered that even though South Indian food is healthier than North Indian food, the inherent flavors of the vegetables are preserved, and there is comparably less usage of fats in South Indian recipes. However, North Indian cuisine is undoubtedly simpler and quicker to cook in restaurants, and is more familiar to foreign visitors and Indians at large. As a result, guests tend to dine in or order online from restaurants that provide these cuisines. (Campbell, 2022)

Nevertheless, operating a South Indian food restaurant can be profitable for restaurant owners or potential investors who wish to establish their own business with minimum investment.

4. Bangalore’s Restaurant Online Ordering Service

# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))

# The data labels output a set of Boolean (Yes/No) counts
data_labels = df.online_order.value_counts().index
data_colors = ['lightskyblue','lightcoral']

# The pie chart's pieces are separated by the explode. 
data_explode = (0, 0.1)

# The pie is sized based on the values in the relevant column.
sizes = df.online_order.value_counts().values

# To show the percentage value thanks to autopct. 

# The pie slices are made to face right when the startangle is 90 degrees.
plt.pie(sizes, explode=data_explode, labels=data_labels, colors= data_colors, autopct='%1.1f%%', startangle=90, textprops={'fontsize': 15})

## ([<matplotlib.patches.Wedge object at 0x000002279FEAEF90>, <matplotlib.patches.Wedge object at 0x000002279FE7F250>], [Text(-0.9695170559757531, -0.5196505346596967, 'Yes'), Text(1.057654970155367, 0.5668914923560326, 'No')], [Text(-0.5288274850776834, -0.2834457461780164, '65.7%'), Text(0.6169653992572974, 0.33068670387435234, '34.3%')])

# A solid circle is created at its central position.
donut_graph = plt.Circle( (0,0), 0.4, color='white')

# A new donut chart is created by matching the provided parameters of the previous selected pie chart.
current_graph = plt.gcf()
current_graph.gca().add_artist(donut_graph)
plt.title("Bangalore's Restaurant Online Ordering Service", fontsize=20)
plt.xlabel("Online Orders", fontsize=15)
plt.show()

Only over 35% of the eateries in Bangalore do not accept online orders, compared to around 65% of them. Let’s dive more into that…

# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))

# Query enables for accessing and returning DataFrame with a True/False result.
sns.kdeplot(df.query('online_order == "Yes" & rate > 0')['rate'], 
             color='orange', label='Online Ordering Service Provided', shade=True)
             
sns.kdeplot(df.query('online_order == "No" & rate > 0')['rate'], 
             color='blueviolet', label='Online Ordering Service NOT Provided', shade=True)
plt.title("Bangalore's Restaurants Rate Distribution by Online Ordering Service", fontsize=20)
plt.xlabel("Rate", fontsize=15)
plt.ylabel("Online Order", fontsize=15)
plt.legend(loc="upper right")
plt.show()

The rate probability distribution reveals that even when there isn’t much of a difference, guests tend to provide establishments that accept an online orders the best ratings. Potential investors and business owners may find this graph helpful when considering whether to add online orders system to their business or not. It appears that the majority of restaurants offer that service, and based on my research and interviews, I also have the impression that Bangaloreans enjoy placing orders online on a regular basis.

5. Bangalore’s Restaurant Table Booking Service

# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))

# The data labels output a set of Boolean (Yes/No) counts
labels = df.book_table.value_counts().index
colors = ['aquamarine','lightseagreen']

# The pie chart's pieces are separated by the explode. 
explode = (0, 0.1)

# The pie is sized based on the values in the relevant column.
sizes = df.book_table.value_counts().values

# To show the percentage value thanks to autopct. 

# The pie slices are made to face right when the startangle is 90 degrees.
plt.pie(sizes, explode=explode, labels=labels, colors= colors, autopct='%1.1f%%', startangle=90, textprops={'fontsize': 15})

## ([<matplotlib.patches.Wedge object at 0x000002279FF96F90>, <matplotlib.patches.Wedge object at 0x000002279FC95850>], [Text(-0.5065896566917578, -0.9764051002186168, 'No'), Text(0.5526432618455541, 1.0651692002384912, 'Yes')], [Text(-0.27632163092277695, -0.5325846001192455, '84.8%'), Text(0.32237523607657315, 0.6213487001391197, '15.2%')])

plt.title("Bangalore's Restaurant Table Booking Service", fontsize=20)
plt.xlabel("Table Booking", fontsize=15)
plt.show()

In Bangalore, around 85% of eateries don’t offer a way to reserve a table. It could be fascinating to examine this further…

# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))

# Query enables for accessing and returning DataFrame with a True/False result.
sns.kdeplot(df.query('book_table == "Yes" & two_people_cost > 0')['two_people_cost'], 
             color='hotpink', label='Table Booking Service Provided', shade=True)
sns.kdeplot(df.query('book_table == "No" & two_people_cost > 0')['two_people_cost'], 
             color='lightseagreen', label='Table Booking Service NOT Provided', shade=True)
plt.title("Bangalore's Restaurants Cost Distribution by Table Booking Service", fontsize=20)
plt.xlabel("Two People Cost", fontsize=15)
plt.ylabel("Table Booking", fontsize=15)
plt.legend(loc="upper right")
plt.show()

Only fancy restaurants with enough floor area in India typically allow guests to reserve a table, whereas smaller eateries simply do not. Consequently, based on the second pink line in the graph, it can be verified that such restaurants are significantly more expensive.

# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))

# Query enables for accessing and returning DataFrame with a True/False result.
sns.kdeplot(df.query('book_table == "Yes" & rate > 0')['rate'], 
             color='royalblue', label='Table Booking Service', shade=True)
sns.kdeplot(df.query('book_table == "No" & rate > 0')['rate'], 
             color='gold', label='Table Booking Service NOT Provided', shade=True)
plt.title("Bangalore's Restaurants Rate Distribution by Table Booking Service", fontsize=20)
plt.xlabel("Rate", fontsize=15)
plt.ylabel("Table Booking", fontsize=15)
plt.legend(loc="upper right")
plt.show()

The above figure demonstrates how important it is for restaurant guests to reserve a table. I have observed that establishments with reserved table service typically get the highest ratings. Potential investors and business owners may find this graph valuable when deciding where to establish the eatery. Since the location and overall floor area will determine whether or not the restaurant can employ a guest table booking service.

I honestly believe they should consider locations with more floor space since it not only helps the customers by ensuring they will have a seat on the occasion they have planned, but it also assists restaurant managers to schedule sufficient personnel for preparation and outstanding service, which result in higher daily revenues and ultimately higher rating.

6. Prominent Restaurant Types in Bangalore

# I will first output a set of counts of unique restaurant types values.
Restaurant_Type = df['rest_type'].value_counts()[:10]

# The basic aesthetic of the plots will then be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))

# Finally, I will make use of a barplot to demonstrate how a numerical and a category variable interact. 
sns.barplot(x = Restaurant_Type, y = Restaurant_Type.index, palette= "hsv")
plt.title("Prominent Restaurant Types in Bangalore", fontsize=20)
plt.xlabel("Count of Restaurant Type", fontsize=15)
plt.ylabel("Restaurant Type", fontsize=15)

# The values on the y-axis are reversed.
plt.gca().invert_yaxis()
plt.show()

Bangalore is considered the technological capital of India, residents there favor Quick Bites because of their busy schedules. The reason that this type of restaurant leads the market is not just that people can afford it on a regular basis, but also that they lack the time to prepare their own lunch and bring it to the office.

7. Prominent Restaurant Meal Types in Bangalore

# I will first output a set of counts of unique restaurant mean type values.
meal_types = list(df['meal_type'].value_counts().index)

# The basic aesthetic of the plots will then be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))

# I will go through each of the enumerable string meal types.
for restaurant_type in meal_types:
    restaurant_mean_index = meal_types.index(restaurant_type)
    restaurant_meal_types = df[ (df['meal_type'] == restaurant_type) & (df['rate'] > 0)]
    colors_code = ['brown', 'mediumorchid', 'mediumblue', 'darkcyan', 'yellow', 'orange', 'red']
    sns.kdeplot(restaurant_meal_types['rate'], label = restaurant_type, color=colors_code[restaurant_mean_index], shade=True)
plt.title("Bangalore's Restaurants Rate Distribution by Meal Types", fontsize=20)
plt.xlabel("Rate", fontsize=15)
plt.ylabel("Meal Type", fontsize=15)
plt.legend(loc="upper right")
plt.show()

The average rate probability distribution for Pubs and Bars, Drinks & nightlife, Cafés, and Buffet is frequently around 4, which is incredibly valuable for potential investors and company owners to take into consideration when deciding what kind of eatery to establish. This is fairly insightful since the beverage-related establishments have the best ratings, but let’s have a sneak peek at if such establishments also have the lowest average cost for two people.

# I will first output a set of counts of unique restaurant mean type values.
meal_types = list(df['meal_type'].value_counts().index)

# The basic aesthetic of the plots will then be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))

# I will go through each of the enumerable string meal types.
for restaurant_type in meal_types:
    restaurant_mean_index = meal_types.index(restaurant_type)
    restaurant_meal_types = df[ (df['meal_type'] == restaurant_type) & (df['two_people_cost'] > 0)]
    colors_code = ['brown', 'mediumorchid', 'mediumblue', 'darkcyan', 'yellow', 'orange', 'red']
    
    sns.kdeplot(restaurant_meal_types['two_people_cost'], label = restaurant_type, color=colors_code[restaurant_mean_index], shade=True)
plt.title("Bangalore's Restaurants Cost Distribution by Meal Types", fontsize=20)
plt.xlabel("Two People Cost", fontsize=15)
plt.ylabel("Meal Type", fontsize=15)
plt.legend(loc="upper right")
plt.show()

The average two people cost probability distribution is notably left-skewed, especially for the food-related industry like Delivery, Dine-out, and Desserts. This graph demonstrates that 90% of restaurants offer food at under ₹1000 rupees for two people (6€ per person), while this is slightly higher for beverages and drinks-related industry. An average night out for two people will cost ₹1500 (9€ per person). Potential investors and business owners can use this information to determine whether they are selecting a beverage-related or food-related establishment and then formulate their market and financial planning accordingly.

8. Bangalore’s Restaurant Rate Distribution

# The basic aesthetic of the plots will then be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
fig, ax = plt.subplots(figsize=(20, 10))
sns.histplot(df, x = 'rate', kde=True, bins=20, color='navy', ax = ax, fill=True)

# I have drawn a red line through the bars to indicate the mean.
plt.axvline(df.rate.mean(), color='firebrick', linestyle='dashed', linewidth=1.5)
plt.title("Bangalore's Restaurants Cost Distribution", fontsize=20)
plt.xlabel("Rate", fontsize=15)
plt.ylabel("Count of Restaurant Rate", fontsize=15)
plt.show()

Restaurants with ratings below 2.5 and higher than 4.5 are extremely rare and the majority of eateries have ratings between 3 and 4.

Exploratory Data Analysis Conclusion

I have concluded that business owners and potential investors can utilize a set of features (cuisines, restaurant type, online order, table services, two people cost etc) based on their own market segment and business model, to predict the rating that a newly proposed restaurant would have attained. This insight will be valuable since it will help them to determine the rating of their potential new restaurant before it is even built.

Phase 3: Model Prediction

In this section, I am going to find the most efficient model that can predict ratings for new start-up businesses as well as provide insightful data about the variables that influence a new establishment’s success to help business owners and potential investors to make wise decisions.

Preprocessing

Before I start training the algorithm and creating a model that can predict my target variable rate, there is a preprocessing step to be taken into consideration.

# Quickly inspects the number of columns with their type of objects.
df.dtypes

## row_num                int32
## name                  object
## online_order          object
## book_table            object
## rate                 float64
## votes                  int64
## location              object
## rest_type             object
## cuisines              object
## two_people_cost      float64
## meal_type             object
## city_neighborhood     object
## dtype: object

As a first step, I have to encode all object columns into a factorized-categorical variable.

There are numerous ways to encode categorical variables. For instance:

Transforming the string labels into a numeric form (label encoding).
Transforming the categorical data into numeric with the help of indicator or dummy variables. (OneHotEncoder or get_dummies).

However, I have an easy and straightforward approach by using the function def. This will produce a dataset with only numerical variables.

# I first made a selection of all the columns and 
# Then used the FactorizedColumns method to encode them.

def FactorizedColumns(df):
    for i in df.columns[df.columns.isin(["name", "online_order", "book_table", "location", "rest_type", "cuisines", "meal_type", "city_neighborhood"])]:
        df[i] = df[i].factorize()[0]
    return df

df_numeric = FactorizedColumns(df.loc[:, df.columns != "row_num"])
df_numeric.info()

## <class 'pandas.core.frame.DataFrame'>
## Index: 41190 entries, 0 to 51716
## Data columns (total 11 columns):
##  #   Column             Non-Null Count  Dtype  
## ---  ------             --------------  -----  
##  0   name               41190 non-null  int64  
##  1   online_order       41190 non-null  int64  
##  2   book_table         41190 non-null  int64  
##  3   rate               41190 non-null  float64
##  4   votes              41190 non-null  int64  
##  5   location           41190 non-null  int64  
##  6   rest_type          41190 non-null  int64  
##  7   cuisines           41190 non-null  int64  
##  8   two_people_cost    41190 non-null  float64
##  9   meal_type          41190 non-null  int64  
##  10  city_neighborhood  41190 non-null  int64  
## dtypes: float64(2), int64(9)
## memory usage: 3.8 MB

It is visible that all features now have a numerical form.

Feature Selection

Now that the variables have undergone substantial preprocessing. Let’s start by selecting which features to include in my machine-learning model.

Correlation Coefficients Matrix

As a first step is to use a Correlation Coefficients Matrix with those preprocessed features to display how numerical and categorical variables are related.

# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))

# Use the a Pearson method to determine the coefficient of correlation.
correlation_plot = df_numeric.corr(method = "pearson")
ax = plt.figure(figsize=(10,6))

# Display a heatmap of the correlation coefficient.
heatmap = sns.heatmap(correlation_plot, annot = True, cbar = True, fmt=".2f", cmap=sns.color_palette("rocket", as_cmap=True))

# Rotation of the x-axis labels by 20 degrees.
heatmap.set_xticklabels(heatmap.get_xticklabels(), rotation = 20, horizontalalignment = "right")

plt.show()

This heatmap display both the strengths of the relationship as well as the direction whether it is a positive or negative. The fourth row depicts the correlation coefficients between all feature values \(X\) and the target variable rate \(y\).

The features that highly correlates with rate are:

votes: This feature has a quite positive correlation of (0.43) with rate.
book_table: This feature has a quite negative correlation of (-0.43) with rate.
two_people_cost: This feature has a quite positive correlation of (0.38) with rate.

The features that does not correlates with rate are:

city_neighborhood: This feature has an almost no correlation (0.02) with rate.

Based on my exploratory data analysis, I believe most of the features are quite interesting and important for further investigation with the exception of:

name: There is very little correlation between the rate and a name only determines whether a restaurant is an independent or franchised;
city_neighborhood: Since there is almost no correlation with the rate.

Pairplot Correlation Visualization

In light of the quantity of features in the Zomato Dataset, it could be challenging to draw conclusions from the Correlation Coefficients Matrix only. As a result, I will be using feature values against the target variable rate.

# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style='whitegrid', context='notebook')
plt.figure(figsize=(20,10))

# Show a scatterplot matrix of the rate VS each significant feature.
sns.pairplot(df_numeric, hue="rate")

This Pairplots is employed to determine the most distinguishable groupings or the optimum combination of features to describe a relationship between two dependent variable, by creating some simple linear separations or basic lines in my Zomato Dataset.

Division of the Dataset

The data will now be divided into a training and testing dataset. By modifying test_size to be 0.2. The training set will receive randomly 80% of the data rows, while the testing set will receive randomly 20% of the data rows. It is assured that the training and testing set include identical rows each time the function train_test_split() is invoked by setting the random_state parameter to a fixed value 42.

# The X is my selection of features.
# The best model accuracy score is achieved by combining all of these features!
X = df_numeric[["online_order", "book_table", "votes", "location", "rest_type", "cuisines", "two_people_cost", "meal_type"]]

# y is the target variable.
y = df_numeric["rate"].values

# Splitting the training and testing dataset.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, y_train.shape

## ((32952, 8), (32952,))

Modeling

The nature of the target variable rate is suitable for regression analysis as validated by my Domain Understanding.

Considering that I have a continuous target variable rate, the linear regression model is the first model I choose to employ because of its widespread use, simplicity, and effectiveness for regression analysis.

Multiple Linear Regression

I will construct a multiple linear regression model because I have several features \(X\) such as online_order, book_table, votes, location, rest_type, cuisines, two_people_cost, meal_type to my target variable \(y\) rates.

from sklearn.linear_model import LinearRegression

# Determine the Machine Learning Model 
LinRegModel = LinearRegression(fit_intercept= True)

# Training the Linear Regression Algorithm on the Training Set
LinRegModel.fit(X_train, y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

print('The Multiple Linear Regression Intercept is: %.2f' % LinRegModel.intercept_)

## The Multiple Linear Regression Intercept is: 3.80

When fitting the multiple regression model to the training set, I have obtained eight parameters \(b_0\), \(b_1\), \(b_2\), \(b_3\), \(b_4\), \(b_5\), \(b_6\), \(b_7\) and \(b_8\) as follows:

$y = b_0 + b_1 x_1 + b_2 x_2$ + b3_x3$ + b4_x4$ + b5_x5$ + b6_x6$ + b7_x7$ + b8_x8$.

\(y\) is the predicted value for rate.
\(x_1\) represents the first feature value online_order.
\(x_2\) represents the second feature value book_table.
\(x_3\) represents the third feature value votes.
\(x_4\) represents the fourth feature value location.
\(x_5\) represents the fifth feature value rest_type.
\(x_6\) represents the sixth feature value cuisines.
\(x_7\) represents the seventh feature value two_people_cost.
\(x_8\) represents the eighth feature value meal_type.

After the multiple regression model is fitted, I have received feature values as follows:

# Choose Columns from an Index Range that corresponds to each slope.
col = df_numeric.iloc[:, [1,2,4,5,6,7,8,9,10]]
col = col.columns

# Print each feature's slope with it's columns name accordingly.
for i in range(LinRegModel.n_features_in_):
  each_slope = print("The Slope For {n}: is {s}".format(n=col[i] , s=round(LinRegModel.coef_[i], 3)))

## The Slope For online_order: is -0.106
## The Slope For book_table: is -0.255
## The Slope For votes: is 0.0
## The Slope For location: is 0.0
## The Slope For rest_type: is 0.002
## The Slope For cuisines: is 0.0
## The Slope For two_people_cost: is 0.0
## The Slope For meal_type: is -0.008

The first two slopes show that when online ordering or booking for tables are not accessible, the restaurant experiences an increasing rate. This is a very intriguing fact, as my exploratory data analysis revealed exactly the opposite. Bengalorian restaurants typically receive high ratings, especially when they offer a table booking or online ordering service to their customers. Let’s evaluate the model’s reliability and determine whether it is solid or not.

Multiple Linear Regression Model Evaluation Metrics

For a more comprehensive analysis of how well my multiple regression model fits the Zomato Dataset, I will employ the Root Mean Square Error (RMSE) and R2 Score measurements.

Let’s first look at the Root Mean Square Error (RMSE) from sklearn metrics.

# To obtain the RMSE, I will compute the Square Root of the Mean Squared Error 
# Using the function np.sqrt from numpy.

from sklearn.metrics import mean_squared_error

# Linear Regression Model Evaluation For Training Set
Y_Training_Prediction = LinRegModel.predict(X_train)
RMSE_Training = np.sqrt(mean_squared_error(y_train, Y_Training_Prediction))

# Linear Regression Model Performance For Training Set
print("{} {:.2f}".format("The Root Mean Square Error For The Training Set is:", RMSE_Training))

## The Root Mean Square Error For The Training Set is: 0.37

# Linear Regression Model Evaluation For Testing Set
Y_Test_Prediction = LinRegModel.predict(X_test)
RMSE_Test = np.sqrt(mean_squared_error(y_test, Y_Test_Prediction))

# Linear Regression Model Performance For Testing Set
print("{} {:.2f}".format("The Root Mean Square Error For The Testing Set is:", RMSE_Test))

## The Root Mean Square Error For The Testing Set is: 0.36

This metric reveals the average distance between the projected and observed values in the Zomato Dataset. It is considered that model with the best fit will be the one with the lowest Root Mean Square Error (RMSE) values between 0.1 and 0.4, which is true in this instance.

Let’s now look at the R2 Score from sklearn metrics.

from sklearn.metrics import r2_score

# Linear Regression Model Evaluation For Training Set
R2_Training = r2_score(y_train, Y_Training_Prediction)

# Model Performance For Training Set
print("{} {:.2f}".format("The R2 Score For The Training Set:", R2_Training))

## The R2 Score For The Training Set: 0.29

# Linear Regression Model Evaluation For Testing Set
R2_Test = r2_score(y_test, Y_Test_Prediction)

# Model Performance For Testing Set"
print("{} {:.2f}".format("The R2 Score For The Testing Set:", R2_Test))

## The R2 Score For The Testing Set: 0.30

This metric reveals the variance of the target variable rate (\(y\)) that is accounted for by the multiple regression model’s feature variables (\(X\)). In other words, it evaluates how strongly the feature variables (\(y\)) and the multiple regression model are correlated.

It is also considered that model with the best fit will be the one with the highest R2 Score potentially above 0.7, which is Not true in this instance.

Multiple Linear Regression Residual Analysis

Let’s further look at Residual Analysis in more detail from Statistics How To.

The Residual is the unbiased deviation or error that was previously indicated in the multiple linear regression prediction.

Residual = Observed Value – Predicted Value.

After evaluating my model using the evaluation metrics like Root Mean Square Error (RMSE) or R2 Score. I will employ the Residual Analysis to determine whether multiple linear regression is biased or not.

# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style = "whitegrid", context = "notebook")
plt.figure(figsize = (20,10))

# The differences between the y_train and Y_Training_Prediction are called Residuals.
sns.scatterplot(x = Y_Training_Prediction, y = y_train - Y_Training_Prediction,
    c='deepskyblue', marker='x', label='Training Data', alpha = 0.8)
sns.scatterplot(x = Y_Test_Prediction, y = y_test - Y_Test_Prediction,
    c='lightcoral', marker='<', label='Testing Data', alpha = 0.8)
plt.title("Residual Analysis", fontsize=20)    
plt.xlabel('Predicted Values', fontsize = 15)
plt.ylabel('Residuals', fontsize=15)
plt.legend(loc='upper right')
plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='red', linestyle='dashed', linewidth=1.5)
plt.xlim([3.3, 6.6])

## (3.3, 6.6)

plt.show()

The Residual plot appears NOT to be normal because it displays the unbiased deviation or error associated with the projected values that are scattered irregularly over the entire plot. Evidence of non-linearity is the opposite U-shaped pattern on the left side of the illustration.

The fact that the Residuals are not equitably spread around 0 over the whole range of projected values is another evidence of non-linearity.

Multiple Linear Regression Prediction

Now let’s strive to provide some clarity on the data by predicting the rate for the testing set and displaying the variance between the projected and actual values.

# Predict the feature of the testing data.
y_test_pred = LinRegModel.predict(X_test)

# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style = "whitegrid", context = "notebook")
plt.figure(figsize = (20,10))

# Fitting of a linear regression line in the prediction.
sns.regplot(x = y_test, y = y_test_pred, scatter_kws={"color": "limegreen", 'alpha':0.8}, line_kws={"color": "tomato"})
plt.title("Actual VS Projected Restaurant Rate", fontsize=20)
plt.xlabel("Actual Restaurant Rate", fontsize=15)
plt.ylabel("Projected Restaurant Rate", fontsize=15)
plt.show()

The purpose of linear regression is to find the most suitable line with the lowest unbiased deviation between the projected and actual values. The linear regression line in the graph indicates that the projected values are rather low for a higher actual restaurant rating.

Multiple Linear Regression Conclusion

1. Root Mean Square Error (RMSE): The RMSE informs me that the median deviation in restaurant rate values between the model’s projected and actual values is 0.37. This metric showed that my multiple linear regression model did perform well since the RMSE is quite low. However, the R2 Score is preferred over Root Mean Square Error since RMSE is an exact measurement of fit that is highly reliant on the features \(X\) and is also not a standardized metric. (QUITE YES)
2. R2 Score: The R2 Score informs me that selected features \(X\) in the multiple linear regression can account for 30% of the variation in the restaurant rate, which is regarded as weak correlation. (NO)
3. Residual Analysis: The Residual Analysis showed that my multiple linear regression model did Not perform well since the residuals were Not distributed evenly across the entire plot. (NO)

Thus, according to the aforementioned analysis and evaluation, it can be concluded that the multiple linear regression model which includes online_order, book_table, votes, location, rest_type, cuisines, two_people_cost and meal_type as features \(X\) on my target variable rate fails to accurately and successfully fit the dataset.

Model Selection

As a result of the multiple regression model’s poor accuracy performance and, more importantly, its failure to accurately fit the Zomato Dataset. I will reevaluate the research problem and identify some additional modelling approaches.

I already acknowledge the fact that the nature of the target variable rate is suitable for Regression Analysis.
Since I want to predict the success rate of a new restaurant based on the different features available, Supervised Learning will be ideal.

By taking these considerations into account, I can choose from a limited selection of supervised learning models that evaluate data to resolve a regression analysis problem. These are approximately all of the models that are available for regression analysis:

BaggingRegressor
Decision Tree Regression
ExtraTreesRegressor
GradientBoostingRegressor
LGBM
Linear SVR
MLPRegressor
Random Forest
RidgeCV
Stochastic Gradient Descent
Support Vector Machines
VotingRegressor

After running most of these models, I have identified the two models that fit this regression problem the best.

Decision Tree Regressor:
Extra Trees Regressor:

Decision Tree Regressor Model

I will construct a decision tree regressor model with the selected features \(X\) such as online_order, book_table, votes, location, rest_type, cuisines, two_people_cost, meal_type to my target variable \(y\) rates.

from sklearn.tree import DecisionTreeRegressor

# Determine the Machine Learning Model 
DecTreeRegModel = DecisionTreeRegressor(random_state=42, splitter = "best", max_depth=50, min_samples_split=3)

# Training the Decision Tree Regressor Algorithm on the Training Set
DecTreeRegModel.fit(X_train, y_train)

DecisionTreeRegressor(max_depth=50, min_samples_split=3, random_state=42)

Decision Tree Regressor Evaluation Metrics

For a more comprehensive analysis of how well my Decision Tree Regressor fits the Zomato Dataset, I will employ the Root Mean Square Error (RMSE), R2 Score and K-Fold Cross Validation measurements.

Let’s first look at the Root Mean Square Error (RMSE) from sklearn metrics.

# Decision Tree Regressor Model Evaluation For Testing Set
Y_Test_Prediction = DecTreeRegModel.predict(X_test)
RMSE_Test = np.sqrt(mean_squared_error(y_test, Y_Test_Prediction))

# Decision Tree Regressor Model Performance For Testing Set
print("{} {:.2f}".format("The Root Mean Square Error For The Testing Set is:", RMSE_Test))

## The Root Mean Square Error For The Testing Set is: 0.15

It is considered that model with the best fit will be the one with the lowest Root Mean Square Error (RMSE) values between 0.1 and 0.4, which is Absolutely True in this instance.

Let’s now look at the R2 Score from sklearn metrics.

# Decision Tree Regressor Model Evaluation For Testing Set
Y_Test_Prediction = DecTreeRegModel.predict(X_test)
R2_Test  = r2_score(y_test, Y_Test_Prediction)

# Model Performance For Testing Set
print("{} {:.2f}".format("The Decision Tree Regressor Model Accuracy Score is:", R2_Test))

## The Decision Tree Regressor Model Accuracy Score is: 0.88

It is considered that model with the best fit will be the one with the highest R2 Score potentially above 0.7, which is Absolutely True in this instance.

Let’s now look at the K-Fold Cross-Validation from sklearn metrics.

I have trained 20 different models using the K-Fold Cross-Validation technique; 19 of them will be used for the training dataset and 1 for my testing dataset. I can use this technique to assess the effectiveness of my decision tree regressor model.

from sklearn.model_selection import cross_val_score, KFold

# Divide The Dataset Into 20 Subsets of Data.
Kfold = KFold(n_splits = 17, random_state = 42, shuffle = True)

# Verify The Model's Generalization Across The Entire Dataset.
Cross_Val_Score = cross_val_score(DecTreeRegModel, X_train, y_train, cv=Kfold)

# Model Performance For Testing Set
print("{} {:.2f}".format("The Mean Cross Validation Score is:", Cross_Val_Score.mean()))

## The Mean Cross Validation Score is: 0.88

It can be observed that the mean cross validation score improves as the K-Fold is getting bigger, resulting in a smaller size difference between both the training and testing dataset.

This indicates that the model’s biases are decreasing and that it is no longer possible to overfit or underfit the training set of data. As a result, it can be stated that the model is good.

Decision Tree Regressor Residual Analysis

Let’s further look at Residual Analysis in more detail from Statistics How To.

Residual = Observed Value – Predicted Value.

After evaluating my model using the evaluation metrics like Root Mean Square Error (RMSE), R2 Score and K-Fold Cross Validation. I will employ the Residual Analysis to verify whether Decision Tree Regressor is not biased as has already been confirmed.

# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style = "whitegrid", context = "notebook")
plt.figure(figsize = (20,10))

# Decision Tree Regressor Model Evaluation For Testing & Training Set
Y_Test_Prediction = DecTreeRegModel.predict(X_test)
Y_Training_Prediction = DecTreeRegModel.predict(X_train)

# The differences between the y_train and Y_Training_Prediction are called Residuals.
sns.scatterplot(x = Y_Training_Prediction, y = y_train - Y_Training_Prediction,
    c='deepskyblue', marker='o', label='Training Data', alpha = 0.8)
sns.scatterplot(x = Y_Test_Prediction, y = y_test - Y_Test_Prediction,
    c='lightcoral', marker='s', label='Testing Data', alpha = 0.8)
plt.title("Residual Analysis", fontsize=20)    
plt.xlabel('Predicted Values', fontsize = 15)
plt.ylabel('Residuals', fontsize=15)
plt.legend(loc='upper right')
plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='red', linestyle='dashed', linewidth=1.5)
plt.xlim([2, 5])

## (2.0, 5.0)

plt.show()

The distribution of data points for both positive and negative residuals appears to be equal on the residual plot. The fact that the Residuals are quite equitably spread around 0 over the whole range of projected values, is another evidence of a better-fitting model and less bias.

Decision Tree Regressor Prediction

Now let’s strive to provide some clarity on the data by visualizing the difference between Actual Values and Predicted Value in a graph.

# Predict the feature of the testing data.
y_test_pred = DecTreeRegModel.predict(X_test)

# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style = "whitegrid", context = "notebook")
plt.figure(figsize = (20,10))

# Fitting of a linear regression line in the prediction.
sns.regplot(x = y_test, y = y_test_pred, scatter_kws={"color": "limegreen", 'alpha':0.8}, line_kws={"color": "tomato"})
plt.title("Actual VS Projected Restaurant Rate", fontsize=20)
plt.xlabel("Actual Restaurant Rate", fontsize=15)
plt.ylabel("Projected Restaurant Rate", fontsize=15)
plt.show()

The linear regression line in the graph indicates that the model correctly predicts the Zomato Dataset given the minimal and unbiased difference between the actual values and the model’s projected values.

Decision Tree Regressor Conclusion

1. Root Mean Square Error (RMSE): The RMSE informs me that the median deviation in restaurant rate values between the model’s projected and actual values is 0.15. This metric showed that my decision tree regressor did perform well since the RMSE is quite low. (DEFINITELY YES)
2. R2 Score: The R2 Score informs me that selected features \(X\) in the decision tree regressor can account for 88% of the variation in the restaurant rate, which is regarded as strong correlation. (YES)
3. Residual Analysis: The Residual Analysis showed that my decision tree regressor model did perform well since the residuals were fairly distributed around 0 across the entire plot. (YES)

Thus, according to the aforementioned analysis and evaluation, it can be concluded that the decision tree regressor model succeeds in accurately fitting the dataset.

Extra Trees Regressor Model

Due to the similarities between this model and the prior model, the model evaluation will be quite simple and straightforward.

from sklearn.ensemble import ExtraTreesRegressor

# Determine the Machine Learning Model 
ExtTreeRegModel = ExtraTreesRegressor(n_estimators=200,  random_state=42)

# Training the Extra Trees Regressor Algorithm on the Training Set
ExtTreeRegModel.fit(X_train, y_train)

ExtraTreesRegressor(n_estimators=200, random_state=42)

Extra Trees Regressor Evaluation Metrics

Root Mean Square Error (RMSE)

# Extra Trees Regressor Model Evaluation For Testing Set
Y_Test_Prediction = ExtTreeRegModel.predict(X_test)
RMSE_Test = np.sqrt(mean_squared_error(y_test, Y_Test_Prediction))

# Extra Trees Regressor Model Performance For Testing Set
print("{} {:.2f}".format("The Root Mean Square Error For The Testing Set is:", RMSE_Test))

## The Root Mean Square Error For The Testing Set is: 0.11

The model is considered quite good so far since it has a lower Root Mean Square Error (RMSE)

R2 Score

# Extra Trees Regressor Model Evaluation For Testing Set
Y_Test_Prediction = ExtTreeRegModel.predict(X_test)
R2_Test  = r2_score(y_test, Y_Test_Prediction)

# Model Performance For Testing Set
print("{} {:.2f}".format("The Extra Trees Regressor Model Accuracy Score is:", R2_Test))

## The Extra Trees Regressor Model Accuracy Score is: 0.93

The model is considered quite a good fit since it has a high R2 Score

K-Fold Cross Validation

I have trained 10 different models using the K-Fold Cross Validation technique.

# Divide The Dataset Into 10 Subsets of Data.
Kfold = KFold(n_splits = 20, random_state = 42, shuffle = True)

# Verify The Model's Generalization Across The Entire Dataset.
Cross_Val_Score = cross_val_score(ExtTreeRegModel, X_train, y_train, cv=Kfold)

# Model Performance For Testing Set
print("{} {:.2f}".format("The Mean Cross Validation Score is:", Cross_Val_Score.mean()))

## The Mean Cross Validation Score is: 0.93

The biases are diminishing and therefore there is less chance of over-fitting or under-fitting, thus the model seems to be performing reasonably well so far.

Decision Tree Regressor Residual Analysis

Residual = Observed Value – Predicted Value.

# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style = "whitegrid", context = "notebook")
plt.figure(figsize = (20,10))

# Decision Tree Regressor Model Evaluation For Testing & Training Set
Y_Test_Prediction = ExtTreeRegModel.predict(X_test)
Y_Training_Prediction = ExtTreeRegModel.predict(X_train)

# The differences between the y_train and Y_Training_Prediction are called Residuals.
sns.scatterplot(x = Y_Training_Prediction, y = y_train - Y_Training_Prediction,
    c='deepskyblue', marker='o', label='Training Data', alpha = 0.8)
sns.scatterplot(x = Y_Test_Prediction, y = y_test - Y_Test_Prediction,
    c='lightcoral', marker='s', label='Testing Data', alpha = 0.8)
plt.title("Residual Analysis", fontsize=20)    
plt.xlabel('Predicted Values', fontsize = 15)
plt.ylabel('Residuals', fontsize=15)
plt.legend(loc='upper right')
plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='red', linestyle='dashed', linewidth=1.5)
plt.xlim([2, 5])

## (2.0, 5.0)

plt.show()

The fact that the Residuals are quite equitably spread around 0 over the whole range of projected values is solid evidence of a better fitting and less biased model.

Decision Tree Regressor Prediction

Visualizing the difference between Actual Values and Predicted Value in a graph.

# Predict the feature of the testing data.
y_test_pred = ExtTreeRegModel.predict(X_test)

# The basic aesthetic of the plots will be defined by a number of settings that I will set.
sns.set(style = "whitegrid", context = "notebook")
plt.figure(figsize = (20,10))

# Fitting of a linear regression line in the prediction.
sns.regplot(x = y_test, y = y_test_pred, scatter_kws={"color": "limegreen", 'alpha':0.8}, line_kws={"color": "tomato"})
plt.title("Actual VS Projected Restaurant Rate", fontsize=20)
plt.xlabel("Actual Restaurant Rate", fontsize=15)
plt.ylabel("Projected Restaurant Rate", fontsize=15)
plt.show()

The model correctly predicts the Zomato Dataset given the minimal and unbiased difference between the actual values and the model’s projected values.

Model Comparison

I have built and evaluated 3 machine learning models. As far as a regression algorithm is concerned, a good fit is when:

Root Mean Squared Error (RMSE) values are minimally adjusted, preferably between 0.1 and 0.4.
R2 Score is significantly adjusted, preferably more than 0.7.
Residuals are quite equitably spread around 0 over the whole range of projected values.

If these conditions were met with (YES), then I performed K-Fold Cross Validation and Model Prediction to look for Over- or Under-Fitting and Bias respectively.

K-Fold Cross Validation: K-Fold is getting bigger with a smaller size difference between both the training and testing dataset? It indicates No Over- or Under-Fitting.
Model Prediction: Linear Regression with the lowest deviation between the projected and actual values? It indicates Low Bias.

Linear Regression

RMSE: (QUITE YES)
R2 Score: (NO)
Residual: (NO)

Decision Tree Regressor

RMSE: (DEFINITELY YES)
R2 Score: (YES)
Residual : (YES)
K-Fold Cross Validation: (YES)
Model Prediction: (YES)

Extra Trees Regressor:

RMSE: (DEFINITELY YES)
R2 Score: (DEFINITELY YES)
Residual : (DEFINITELY YES)
K-Fold Cross Validation: (YES)
Model Prediction: (DEFINITELY YES)

Prediction Conclusion

After comparing the two models, it has been shown that the Extra Trees Regressor and Decision Tree Regressor both perform remarkably well on the Zomato Dataset. However, the Extra Trees Regressor has a slight advantage since it has a Better Model Prediction and Accuracy Score.

The Extra Trees Regressor is considered good, meaning that it can solve the main underlying problem and concern that business owners have about whether their potential new eatery in Bangalore would be successful in the market.

The analysis is greatly insightful for Business Owners and Potential Investors as it can actually determine quite accurately whether their new business will have the rating they have determined for their business.

Phase 4: Challenge Delivery

Since a reliable model has been developed, I scheduled a meeting at Vers010 with the owner in attendance. The meeting took place on 10/06/2024 at Prinsendam 180; 3072 MA Rotterdam. During this meeting, I demonstrated a comprehensive report on the model’s transparency and impact.

Stakeholder Interview

In summary, this interview was conducted to provide a narrative of the AI Methodology Cycle, from the formulation of the research question through the deployment of the model.

Transparency

I began by presenting a short and straightforward overview of Bengaluru’s culinary scene to ensure that my stakeholders comprehended the topic.

Stakeholder Feedback: Your analysis of the Bengaluru food service industry is fairly extensive, and it is clear that you are attempting to help owners decide critically about the sorts of eateries that would receive high ratings based on the features you have highlighted.

After that, I clarified the objective of this prediction analysis which is to predict the success rate of a new restaurant and offer valuable insights that will help potential investors and restaurant (owners) in determining whether a new start-up business will be successful and sustainable in their Bengalorian market so that they can create their business model strategy accordingly.

Stakeholder Feedback: I appreciate that you highlighted guest behavior and experience as factors that could influence a restaurant’s rating. However, based on our expertise, we would like to emphasize how essential it is to effectively resolve guest complaints and convert them into good customer experiences.

Then I went into further detail about Zomato, the one and only service that had helped me in identifying the appropriate dataset and achieving my objective. The stakeholders, however, were unaware of Zomato’s operations and how it maintains a competitive edge in Bangalore, so I had to give them a simple explanation of Zomato’s Business Model.

Stakeholder Feedback: The research summary was indeed simple for us to understand and acknowledge.

Afterwards, I had to explain and familiarize the stakeholder with the Zomato Dataset by outlining the variables impacting the establishment and detailing the various steps I followed to guarantee that Zomato Dataset was ready (cleaned) to be utilized in the model. The most effective way to verify that was to demonstrate my Exploratory Data Analysis.

Stakeholder 1 Feedback: Thanks to the documentation you are employing, we can easily follow your thoughts and the actions you undertook to clean up your data. Your comments within the code chunk make it simple and easy for us to realize what is going on even though we are not programmers. It is obvious from your comprehensive description that you knew exactly what you were doing at every step.

Stakeholder 2 Feedback: Your exploratory data analysis has revealed to be the most effective so far since it helps us to condense large amounts of data into easily comprehensible patterns.

Model Demonstration

Lastly, I have made an interactive demonstration with the owners of Vers010 on how accurately my model predicts the rating for new restaurants in Bangalore with the features (restaurant attributes) that are available to me from the Zomato Dataset.

The demonstration has been arranged in a systematic and well-organized way, with a plan and a demonstrable technique of monitoring the stakeholders and receiving their input.

Step 1: I have demonstrated and explained my Zomato Dataset and regression model to the Vers010 owners.

Stakeholder_Demo = df.head(5)

	row_num	name	online_order	book_table	rate	votes	location	rest_type	cuisines	two_people_cost	meal_type	city_neighborhood
0	0	Jalsa	Yes	Yes	4.1	775	Banashankari	Casual Dining	North Indian, Mughlai, Chinese	800.0	Buffet	Banashankari
1	1	Spice Elephant	Yes	No	4.1	787	Banashankari	Casual Dining	Chinese, North Indian, Thai	800.0	Buffet	Banashankari
2	2	San Churro Cafe	Yes	No	3.8	918	Banashankari	Cafe, Casual Dining	Cafe, Mexican, Italian	800.0	Buffet	Banashankari
3	3	Addhuri Udupi Bhojana	No	No	3.7	88	Banashankari	Quick Bites	South Indian, North Indian	300.0	Buffet	Banashankari
4	4	Grand Village	No	No	3.8	166	Basavanagudi	Casual Dining	North Indian, Rajasthani	600.0	Buffet	Banashankari

Step 2: I let Vers010 owners to select any 10 eateries from the available Zomato Dataset. The owners have decided manually which 10 eateries will be the target of their prediction.

# I collected their input in a separate spreadsheet and imported it here for the prediction.

Stakeholder_Selection = pd.read_csv("Stakeholder_Selection.csv", encoding = "latin-1")

	name	online_order	book_table	rate	votes	location	rest_type	cuisines	two_people_cost	meal_type	city_neighborhood
0	Chutney Chang	Yes	Yes	4.1	2365	Jayanagar	Casual Dining	North Indian, Chinese, BBQ	1500	Dine-out	Basavanagudi
1	Once Upon a Rooftop	No	Yes	4.3	1278	Jayanagar	Casual Dining, Bar	Pizza, Italian, Chinese, Thai	1000	Dine-out	Basavanagudi
2	Thyme & Whisk	Yes	No	4.2	109	Jayanagar	Casual Dining	Asian, Chinese, Continental, Italian	800	Dine-out	Basavanagudi
3	Fusion Theory	Yes	Yes	4.1	336	Jayanagar	Casual Dining	Continental, Biryani, Desserts, Italian, North Indian, Chinese, Modern Indian, Asian	700	Dine-out	Basavanagudi
4	Toscano	Yes	Yes	4.3	1161	Jayanagar	Casual Dining	Italian, Salad	1300	Dine-out	Basavanagudi
5	Ranganna Military Hotel	No	No	4.2	760	Jayanagar	Quick Bites	South Indian, Biryani	350	Dine-out	Basavanagudi
6	Kapoor's Cafe	Yes	No	4.0	201	Jayanagar	Casual Dining	North Indian	800	Dine-out	Basavanagudi
7	Central Jail Restaurant	Yes	No	3.7	429	Jayanagar	Casual Dining	Andhra, Seafood, North Indian, Chinese	700	Dine-out	Basavanagudi
8	Alchemy Coffee Roasters	No	No	4.2	737	Jayanagar	Cafe	Cafe	500	Dine-out	Basavanagudi
9	Andhra Ruchulu	Yes	Yes	4.1	682	Jayanagar	Casual Dining, Bar	Andhra, Biryani, North Indian, Chinese	1000	Dine-out	Basavanagudi

Step 3: I have removed the rate column within the Zomato Dataset so that the restaurant would appear to be brand-new and without a rating.

Stakeholder_Selection_Pred = Stakeholder_Selection.drop(["rate"], axis = 1)

	name	online_order	book_table	votes	location	rest_type	cuisines	two_people_cost	meal_type	city_neighborhood
0	Chutney Chang	Yes	Yes	2365	Jayanagar	Casual Dining	North Indian, Chinese, BBQ	1500	Dine-out	Basavanagudi
1	Once Upon a Rooftop	No	Yes	1278	Jayanagar	Casual Dining, Bar	Pizza, Italian, Chinese, Thai	1000	Dine-out	Basavanagudi
2	Thyme & Whisk	Yes	No	109	Jayanagar	Casual Dining	Asian, Chinese, Continental, Italian	800	Dine-out	Basavanagudi
3	Fusion Theory	Yes	Yes	336	Jayanagar	Casual Dining	Continental, Biryani, Desserts, Italian, North Indian, Chinese, Modern Indian, Asian	700	Dine-out	Basavanagudi
4	Toscano	Yes	Yes	1161	Jayanagar	Casual Dining	Italian, Salad	1300	Dine-out	Basavanagudi
5	Ranganna Military Hotel	No	No	760	Jayanagar	Quick Bites	South Indian, Biryani	350	Dine-out	Basavanagudi
6	Kapoor's Cafe	Yes	No	201	Jayanagar	Casual Dining	North Indian	800	Dine-out	Basavanagudi
7	Central Jail Restaurant	Yes	No	429	Jayanagar	Casual Dining	Andhra, Seafood, North Indian, Chinese	700	Dine-out	Basavanagudi
8	Alchemy Coffee Roasters	No	No	737	Jayanagar	Cafe	Cafe	500	Dine-out	Basavanagudi
9	Andhra Ruchulu	Yes	Yes	682	Jayanagar	Casual Dining, Bar	Andhra, Biryani, North Indian, Chinese	1000	Dine-out	Basavanagudi

This step 3 is not required in the real-world settings, because new eateries never receive ratings before they are even opened. I have however constructed it in this way so that Vers010 owners may independently compare the results of the projected and actual rating.

Step 4: I have encoded all the columns for the future selection in order to employ the model.

Stakeholder_Selection_Pred = FactorizedColumns(Stakeholder_Selection_Pred)

	name	online_order	book_table	votes	rest_type	cuisines	two_people_cost
0	0	0	0	2365	0	0	1500
1	1	1	0	1278	1	1	1000
2	2	0	1	109	0	2	800
3	3	0	0	336	0	3	700
4	4	0	0	1161	0	4	1300
5	5	1	1	760	2	5	350
6	6	0	1	201	0	6	800
7	7	0	1	429	0	7	700
8	8	1	1	737	3	8	500
9	9	0	0	682	1	9	1000

Step 5: I have provided the brand-new eateries listed above with a prediction rating based on the results of my model.

# The X are my selection of features same as before.
Stakeholder_Selection_X_Pred = Stakeholder_Selection_Pred[["online_order", "book_table", "votes", "location", "rest_type", "cuisines", "two_people_cost", "meal_type"]]

# Predict new restaurant rate with the feature of the testing data.
Stakeholder_Selection_y_Pred = ExtTreeRegModel.predict(Stakeholder_Selection_X_Pred)
Stakeholder_Selection_y_Pred_Rate = [round(prediction_rate, 1) for prediction_rate in Stakeholder_Selection_y_Pred]
Stakeholder_Selection_y_Pred_Rate

## [4.1, 4.2, 3.7, 4.1, 4.2, 4.1, 3.8, 3.8, 4.1, 4.0]

Step 6: The predictions from my model have been combined with the spreadsheet that the Vers010 owners previously developed.

# Put the list in a column of the Vers010 owners spreadsheet.
Stakeholder_Selection["prediction_rate"] = Stakeholder_Selection_y_Pred_Rate
Stakeholder_Selection = Stakeholder_Selection[["name", "rate", "prediction_rate"]].reset_index(drop = True)

	name	rate	prediction_rate
0	Chutney Chang	4.1	4.1
1	Once Upon a Rooftop	4.3	4.2
2	Thyme & Whisk	4.2	3.7
3	Fusion Theory	4.1	4.1
4	Toscano	4.3	4.2
5	Ranganna Military Hotel	4.2	4.1
6	Kapoor's Cafe	4.0	3.8
7	Central Jail Restaurant	3.7	3.8
8	Alchemy Coffee Roasters	4.2	4.1
9	Andhra Ruchulu	4.1	4.0

By evaluating the actual and predicted ratings, it is noticeable that the prediction of the restaurant rate is reasonably accurate. The variance is only half a point (0.5), which is not too bad for a eatery.

Additional Step 7: If Vers010 owners want to export the spreadsheet containing these predictions, they can un-comment the code by removing the # from the second line and running the whole notebook!

# Converts DataFrame into CSV Data

# Stakeholder_Selection.to_csv('Stakeholder_Prediction.csv', header=True, index=False, encoding = "latin-1")

Stakeholder Conclusion Feedback: In fact, your exploratory data research was somewhat valuable to us in identifying some data patterns and determining how we could move forward with our Vers010 strategy in an equitable way. The model, which is relatively simple, predicts the rating of a brand-new restaurant in Bangalore.

Stakeholder Summary Feedback: Guest decisions about which restaurants to eat at or utilize for online ordering can be informed by ratings and reviews. It can be difficult to choose the greatest restaurant without looking at the overall rating. But a great restaurant is one where customers always get the same food quality/taste and level of service. This builds a solid brand and encourages the visitor to pay the restaurant another visit.

Impact

I have already explained how my model operates and why the outcomes are the way they are, and I concluded my interview by discussing the expected or probable technological impacts the model will have on people’s daily lives.

1. Stakeholders

Business Owner: The ability to accurately predict the rate of a restaurant will help them to make educated decisions and develop data-informed strategies.
Potential Investor: The ability to accurately predict the rate of a restaurant will help them to manage their investments sensibly and securely.
Guests: The ability to accurately predict the rate of a restaurant will help them to choose mindfully where to eat/order because there are so many selections in Bangalore. A positive rate will inform the guests that the meal will be tasty, carefully cooked, and worth a visit in comparison to other eateries.
Employee: The ability to accurately predict the rate of a restaurant will help businesses generate an estimated rise in income and employment, which reduces social inequities and keeps long-term unemployed people out of severe poverty.

2. Impact on Society

Thanks to this machine learning model, potential business owners won’t have to spend their valuable time or resources creating a hastily thought-out business plan. This model will assist them in making informed choices regarding which kind of market and financial analyses to consider in an effective way.

3. Privacy

It is possible that a number of Zomatorestaurants copied reviews from other websites and put them on their own websites. Although, these are apparently genuine reviews from actual guests, as a practice, these reviews rightfully belong to them. It is unlawful and infringes a copyright violation to reuse or republish them.

4. Data

Publishing a fraudulently favorable online review of a reputable eatery is a violation of legislation on a federal and state level. There might have been a lot of inaccurate information if I take that as a feature.
Data theft without consent is unethical and immoral whenever it relates to information that does not belong to Zomato.
This machine learning model is resilient enough to provide reasonably accurate predictions of restaurant ratings. Despite the fact that subjective factors (psychological, intellectual or emotional) have an impact on how guests rate a restaurant.

5. Inclusivity

This machine learning algorithms could be biased at any point in their development. Since algorithms are created and designed by humans, they cannot be completely bias-free.

Further Improvement

Stakeholder Improvement: You should have considered features such as authentic reviews, dish price, menu variety, taste score or even hygiene score because these variables affect the restaurant rating.

Personal Improvement: In order to increase the model’s predictability, I should have done additional research on both subjective factors (psychological, intellectual or emotional) and matters like how complaints are handled, how to identify changing client preferences and trends.

Conclusion

Since the beginning of my minor, I have been driven by a strong interest in predictive analysis, leading me to undertake this challenge. My goal was to apply the AI methodology taught in my major program at Fontys to a real-world scenario, specifically focusing on Zomato restaurant data in Bengaluru. Over the past six months, I have worked on this project iteratively, continuously refining my approach based on feedback from teachers and peers at Fontys.

Throughout this process, I followed a structured framework, ensuring that each iteration brought me closer to a well-constructed predictive analysis. The iterative nature of my work allowed for continuous improvement and incorporation of insights from various stakeholders. This approach not only honed my technical skills but also enhanced my understanding of the predictive analysis framework.

The project has been submitted to both my minor and major studies, serving as a testament to my ability to perform predictive analysis using Python and to follow a specific analytical framework. The dual submission underscores the interdisciplinary relevance of my work and its alignment with the learning outcomes of both programs.

Embarking on this project at the start of my minor, I managed my own pace and tackled various iterations over six months. Each iteration involved refining models, improving data preprocessing techniques, and validating results to ensure robustness and accuracy. This final iteration represents the culmination of all these efforts.

Additionally, this project contributes significantly to my learning outcomes for the minor, especially within the technical portfolio. It demonstrates my capability to apply theoretical knowledge in a practical setting, addressing complex problems through data-driven solutions. The skills and insights gained from this project will undoubtedly be invaluable as I continue to explore and innovate within the field of predictive analysis.

In conclusion, this project not only showcases my proficiency in predictive analysis and Python but also highlights my commitment to continuous learning and improvement. The structured, iterative approach I adopted, combined with the invaluable feedback from Fontys, has culminated in a robust and insightful predictive analysis. This work stands as a significant milestone in my academic journey, reflecting both my technical capabilities and my dedication to achieving excellence in my studies.

HU Minor: ML Predictive Analysis for Zomato Restaurants