Load Required Libraries

library(dplyr)
library(tidyr)
library(knitr)
library(ggplot2)
library(moments)
library(caret)

Load Dataset

data <- read.csv("data.csv", stringsAsFactors = FALSE)
data <- janitor::clean_names(data)
head(data)

##   restaurant_name dining_rating delivery_rating dining_votes delivery_votes
## 1      Doner King           3.9             4.2           39              0
## 2      Doner King           3.9             4.2           39              0
## 3      Doner King           3.9             4.2           39              0
## 4      Doner King           3.9             4.2           39              0
## 5      Doner King           3.9             4.2           39              0
## 6      Doner King           3.9             4.2           39              0
##     cuisine place_name       city                         item_name best_seller
## 1 Fast Food   Malakpet  Hyderabad               Platter Kebab Combo  BESTSELLER
## 2 Fast Food   Malakpet  Hyderabad           Chicken Rumali Shawarma  BESTSELLER
## 3 Fast Food   Malakpet  Hyderabad            Chicken Tandoori Salad            
## 4 Fast Food   Malakpet  Hyderabad                 Chicken BBQ Salad  BESTSELLER
## 5 Fast Food   Malakpet  Hyderabad          Special Doner Wrap Combo    MUST TRY
## 6 Fast Food   Malakpet  Hyderabad Chicken Tandoori Pizza [8 inches]  BESTSELLER
##   votes prices
## 1    84    249
## 2    45    129
## 3    39    189
## 4    43    189
## 5    31    205
## 6    48    199

Level 1: Understanding the Data (Basic Exploration)

Question 1.1: What is the structure of the dataset (number of rows, columns, and data types)?

str(data)

## 'data.frame':    123657 obs. of  12 variables:
##  $ restaurant_name: chr  "Doner King" "Doner King" "Doner King" "Doner King" ...
##  $ dining_rating  : num  3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 ...
##  $ delivery_rating: num  4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 ...
##  $ dining_votes   : int  39 39 39 39 39 39 39 39 39 39 ...
##  $ delivery_votes : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cuisine        : chr  "Fast Food" "Fast Food" "Fast Food" "Fast Food" ...
##  $ place_name     : chr  "Malakpet" "Malakpet" "Malakpet" "Malakpet" ...
##  $ city           : chr  " Hyderabad" " Hyderabad" " Hyderabad" " Hyderabad" ...
##  $ item_name      : chr  "Platter Kebab Combo" "Chicken Rumali Shawarma" "Chicken Tandoori Salad" "Chicken BBQ Salad" ...
##  $ best_seller    : chr  "BESTSELLER" "BESTSELLER" "" "BESTSELLER" ...
##  $ votes          : int  84 45 39 43 31 48 27 59 29 31 ...
##  $ prices         : num  249 129 189 189 205 199 165 165 115 129 ...

#Interpretation: This dataset contains information about restaurants and their menu items across multiple cities. It includes details such as restaurant names, dining and delivery ratings, votes, cuisines, item names, best seller indicators, and prices. Each row represents a specific item offered by a restaurant, allowing analysis at both restaurant and item levels.

Question 1.2: Are there any missing value in the Dataset?

data <- data %>%
  filter(!is.na(dining_rating) & !is.na(delivery_rating))
colSums(is.na(data))

## restaurant_name   dining_rating delivery_rating    dining_votes  delivery_votes 
##               0               0               0               0               0 
##         cuisine      place_name            city       item_name     best_seller 
##               0               0               0               0               0 
##           votes          prices 
##               0               0

#Interpretation: The dataset contains missing values in the dining_rating and delivery_rating columns. This indicates that rating information is not available for some entries, which may affect analysis related to customer satisfaction and should be considered during further data processing.

Question 1.3: What is the total number of unique restaurants, cities, and cuisines in the dataset?

length(unique(data$restaurant_name))

## [1] 557

length(unique(data$city))

## [1] 16

length(unique(data$cuisine))

## [1] 40

#Interpretation: The dataset contains multiple unique restaurants, cities, and cuisines, indicating a diverse range of food options and locations. This diversity allows for comprehensive analysis of restaurant performance and customer preferences across different regions and culinary categories.

Question 1.4: Which cuisines have the highest average dining ratings?

data %>% 
  group_by(cuisine) %>% 
  summarise(avg_rating = mean(dining_rating,na.rm = TRUE)) %>% 
  arrange(desc(avg_rating))

## # A tibble: 40 × 2
##    cuisine      avg_rating
##    <chr>             <dbl>
##  1 Awadhi             4.5 
##  2 Mexican            4.4 
##  3 Wraps              4.3 
##  4 Andhra             4.2 
##  5 Turkish            4.1 
##  6 Bakery             4.09
##  7 Seafood            4.05
##  8 Shake              3.98
##  9 Pasta              3.93
## 10 North Indian       3.90
## # ℹ 30 more rows

#Interpretation: Awadhi, Mexican, and Wraps cuisines have the highest average dining ratings, indicating strong customer satisfaction. Most other cuisines have slightly lower but still good ratings.

Question 1.5: What is the average dining rating and delivery rating across all restaurants?

colMeans(data[, c("dining_rating", "delivery_rating")], na.rm = TRUE)

##   dining_rating delivery_rating 
##        3.822619        3.970807

#Interpretation: The average dining rating is around 3.82, while the average delivery rating is slightly higher at 3.97. This indicates that delivery services are generally rated better than dine-in experiences.

Level 2: Data Extraction & Filtering

Question 2.1: Which restauranlts have high dining ratings but low customer engagement?

low_engagement <- data %>% 
  filter(dining_rating >= 4.5 & dining_votes < 50) %>% 
  group_by(restaurant_name) %>% 
  slice(1)
head(low_engagement)

## # A tibble: 6 × 12
## # Groups:   restaurant_name [6]
##   restaurant_name      dining_rating delivery_rating dining_votes delivery_votes
##   <chr>                        <dbl>           <dbl>        <int>          <int>
## 1 AB's - Absolute Bar…           4.7             3.7            0              0
## 2 Brik Oven                      4.6             3.9            0              0
## 3 Chaitanya                      4.5             4.4            0              0
## 4 Chili's Grill & Bar            4.5             4.1            0              0
## 5 Dastarkhwan                    4.5             4              0              0
## 6 Exotica                        4.6             4.3            0              0
## # ℹ 7 more variables: cuisine <chr>, place_name <chr>, city <chr>,
## #   item_name <chr>, best_seller <chr>, votes <int>, prices <dbl>

#Interpretation: The analysis shows that restaurants such as AB's - Absolute Barbecues, Brik Oven, Chaitanya, Chili's Grill & Bar, Dastarkhwan, and Exotica have high dining ratings (4.5 and above) but very low or zero dining votes. This indicates that although these restaurants are rated highly, they have limited customer engagement or feedback.

Questions 2.2: Identify restaurants where delivery ratings are higher than dining ratings.

delivery <- data %>% 
  filter(delivery_rating > dining_rating) %>% 
  distinct(restaurant_name, .keep_all = TRUE )

head(delivery)

##          restaurant_name dining_rating delivery_rating dining_votes
## 1             Doner King           3.9             4.2           39
## 2              BrownBear           3.6             4.0          239
## 3 The Thickshake Factory           3.4             3.8           38
## 4             McDonald's           3.2             3.9          137
## 5      Mughal Restaurant           3.8             4.1          258
## 6     Tipsy Topsy Bakery           3.8             3.9          225
##   delivery_votes   cuisine     place_name       city
## 1              0 Fast Food       Malakpet  Hyderabad
## 2              0 Fast Food Himayath Nagar  Hyderabad
## 3              0 Beverages Himayath Nagar  Hyderabad
## 4              0 Fast Food       MPM Mall  Hyderabad
## 5              0  Desserts     Lakdikapul  Hyderabad
## 6              0   Chinese   Saroor Nagar  Hyderabad
##                                                     item_name best_seller votes
## 1                                         Platter Kebab Combo  BESTSELLER    84
## 2                                              Pineapple Cake                 0
## 3                               Belgian Chocolate Thick Shake                22
## 4 Big Spicy Paneer Wrap + Coke + Fries (M) + Veg Pizza McPuff                 0
## 5                                             Chicken Biryani  BESTSELLER   245
## 6                                        Chicken Soft Noodles  BESTSELLER   401
##   prices
## 1 249.00
## 2 500.00
## 3 241.00
## 4 367.86
## 5 229.00
## 6 165.00

#Interpretation: The results show that some restaurants, such as Doner King, BrownBear, and McDonald's, have higher delivery ratings compared to dining ratings. This indicates that these restaurants perform better in delivery services than in dine-in experiences.

Questions 2.3: Extract items that are highly popular based on votes but are not marked as best sellers.

items <- data %>%
  filter(votes > 100 & delivery_rating > 4.0 & dining_rating > 4.0 & best_seller != "BESTSELLER" & best_seller != "MUST TRY")

head(items)

##                restaurant_name dining_rating delivery_rating dining_votes
## 1        Siddique Kabab Centre           4.4             4.1          732
## 2        Siddique Kabab Centre           4.4             4.1          732
## 3        Siddique Kabab Centre           4.4             4.1          732
## 4 Shah Ghouse Special Shawarma           4.3             4.1          156
## 5 Shah Ghouse Special Shawarma           4.3             4.1          156
## 6 Shah Ghouse Special Shawarma           4.3             4.1          156
##   delivery_votes   cuisine place_name       city
## 1              0   Chinese Tolichowki  Hyderabad
## 2              0   Chinese Tolichowki  Hyderabad
## 3              0   Chinese Tolichowki  Hyderabad
## 4              0 Beverages  Charminar  Hyderabad
## 5              0 Beverages  Charminar  Hyderabad
## 6              0 Beverages  Charminar  Hyderabad
##                                     item_name    best_seller votes prices
## 1                     Boneless Butter Chicken                  114    168
## 2                     Boneless Butter Chicken                  114    168
## 3                                 Rumali Roti                  848     10
## 4                 SGC Chicken Samoli Shawarma CHEF'S SPECIAL   305    135
## 5                 SGC Chicken Kuboos Shawarma CHEF'S SPECIAL   206    135
## 6 shah ghouse Special Chicken Somali Shawarma CHEF'S SPECIAL   102    155

#Interpretation: The results show that some items, such as Boneless Butter Chicken, Rumali Roti, and Shawarma items, have high votes despite not being marked as best sellers. This indicates the presence of popular items that are not officially highlighted, representing potential hidden opportunities for restaurants.

Questions 2.4: Identify expensive items (top 10% by price) across all restaurants.

threshold <- quantile(data$prices, 0.90, na.rm = TRUE)

expensive_items <- data %>%
  filter(prices > threshold)

head(expensive_items)

##   restaurant_name dining_rating delivery_rating dining_votes delivery_votes
## 1       Taco Bell           4.3             3.7          117              0
## 2       Taco Bell           4.3             3.7          117              0
## 3       BrownBear           3.6             4.0          239              0
## 4       BrownBear           3.6             4.0          239              0
## 5       BrownBear           3.6             4.0          239              0
## 6       BrownBear           3.6             4.0          239              0
##     cuisine             place_name       city                      item_name
## 1     Wraps The Next Galleria Mall  Hyderabad         Cheese Max Box Non-Veg
## 2     Wraps The Next Galleria Mall  Hyderabad 7 Layer Burrito Meal - Non Veg
## 3 Fast Food         Himayath Nagar  Hyderabad                 Pineapple Cake
## 4 Fast Food         Himayath Nagar  Hyderabad              Black Forest Cake
## 5 Fast Food         Himayath Nagar  Hyderabad                Red Velvet Cake
## 6 Fast Food         Himayath Nagar  Hyderabad              Butterscotch Cake
##   best_seller votes prices
## 1                 0    449
## 2                 0    408
## 3                 0    500
## 4                 0    550
## 5                 0    700
## 6                 5    550

#Interpretation: The results show that items such as Cheese Max Box, Burrito Meals, and various cakes have high prices, placing them in the top 10% price category. These items represent premium offerings across restaurants, indicating higher pricing strategies for certain menu items.

Questions 2.5: Which restaurants have the highest number of popular items (based on item votes)?

popular_items <- data %>%
  filter(votes > 100) %>%   
  group_by(restaurant_name) %>%
  summarise(popular_item_count = n()) %>%
  arrange(desc(popular_item_count))

head(popular_items)

## # A tibble: 6 × 2
##   restaurant_name  popular_item_count
##   <chr>                         <int>
## 1 Agarwal Caterers                216
## 2 McDonald's                      158
## 3 Burger King                     134
## 4 Bawarchi                         89
## 5 Grand Hotel                      78
## 6 Mehfil                           74

#Interpretation: The results show that Agarwal Caterers has the highest number of popular items (216), followed by McDonald's and Burger King. This indicates that these restaurants offer a larger number of highly popular dishes, reflecting strong customer preference across their menu.

Level 3: Grouping And Summarization

Question 3.1: What is the average dining and delivery rating for each city?

city_ratings <- data %>%
  group_by(city) %>%
  summarise(
    avg_dining = mean(dining_rating, na.rm = TRUE),
    avg_delivery = mean(delivery_rating, na.rm = TRUE)
  ) %>%
  arrange(desc(avg_dining))


head(city_ratings)

## # A tibble: 6 × 3
##   city            avg_dining avg_delivery
##   <chr>                <dbl>        <dbl>
## 1 " Goa"                4.13         3.91
## 2 " Malleshwaram"       4            4   
## 3 " New Delhi"          3.97         3.93
## 4 " Hyderabad"          3.89         3.99
## 5 " Lucknow"            3.88         3.98
## 6 " Raipur"             3.86         3.91

#Interpretation: The analysis shows that Goa has the highest average dining rating (4.13), indicating strong customer satisfaction for dine-in experiences. Malleshwaram also performs well with balanced dining and delivery ratings (4.0 each).

Question 3.2: Which cuisines have the highest average dining rating?

cuisine_ratings <- data %>%
  group_by(cuisine) %>%
  summarise(avg_rating = mean(dining_rating, na.rm = TRUE)) %>%
  arrange(desc(avg_rating))

ggplot(cuisine_ratings, aes(x = reorder(cuisine, avg_rating), y = avg_rating)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Average Dining Rating by Cuisine", x = "Cuisine", y = "Rating")

head(cuisine_ratings)

## # A tibble: 6 × 2
##   cuisine avg_rating
##   <chr>        <dbl>
## 1 Awadhi        4.5 
## 2 Mexican       4.4 
## 3 Wraps         4.3 
## 4 Andhra        4.2 
## 5 Turkish       4.1 
## 6 Bakery        4.09

#Interpretation: The analysis shows that Awadhi cuisine has the highest average dining rating (4.5), indicating strong customer preference. It is followed by Mexican (4.4) and Wraps (4.3), which also receive high ratings.

Question 3.3: What is the average price of items for each restaurant?

restaurant_price <- data %>%
  group_by(restaurant_name) %>%
  summarise(avg_price = mean(prices, na.rm = TRUE)) %>%
  arrange(desc(avg_price))

head(restaurant_price)

## # A tibble: 6 × 2
##   restaurant_name        avg_price
##   <chr>                      <dbl>
## 1 Zaffran Mataam Alarabi      806.
## 2 Barbeque Nation             657.
## 3 Khalids Biriyani            604.
## 4 Exotica                     562.
## 5 The Fatty Bao               556.
## 6 Mandi Town                  555.

#Interpretation: The analysis shows that Zaffran Mataam Alarabi has the highest average item price (806), followed by Barbeque Nation (657) and Khalids Biriyani (604). These restaurants appear to follow a premium pricing strategy compared to others.

Question 3.4: Which restaurants receive the highest total customer engagement (votes)?

restaurant_votes <- data %>%
  group_by(restaurant_name) %>%
  summarise(total_votes = sum(votes, na.rm = TRUE)) %>%
  arrange(desc(total_votes)) %>% 
  head(10)

ggplot(restaurant_votes, aes(x = reorder(restaurant_name, total_votes), y = total_votes)) +
  geom_bar(stat = "identity") +
  coord_flip()

head(restaurant_votes)

## # A tibble: 6 × 2
##   restaurant_name  total_votes
##   <chr>                  <int>
## 1 Agarwal Caterers      116249
## 2 Bawarchi               90838
## 3 Mehfil                 73798
## 4 McDonald's             69878
## 5 Lucky Restaurant       68021
## 6 Burger King            52193

#Interpretation: The analysis shows that Agarwal Caterers has the highest total votes (116,249), indicating the highest customer engagement. It is followed by Bawarchi (90,838) and Mehfil (73,798), which also receive strong customer interaction.

Question 3.5: What is the average price and average votes for each cuisine?

cuisine_analysis <- data %>%
  group_by(cuisine) %>%
  summarise(
    avg_price = mean(prices, na.rm = TRUE),
    avg_votes = mean(votes, na.rm = TRUE)
  ) %>%
  arrange(desc(avg_votes))

head(cuisine_analysis)

## # A tibble: 6 × 3
##   cuisine      avg_price avg_votes
##   <chr>            <dbl>     <dbl>
## 1 Awadhi            200.     225. 
## 2 North Indian      205.      72.0
## 3 Mandi             205.      69.7
## 4 Seafood           303.      56.9
## 5 Turkish           244.      49.7
## 6 South Indian      183.      49.3

ggplot(cuisine_analysis, aes(x = avg_price, y = avg_votes)) +
  geom_point() +
  labs(title = "Price vs Popularity by Cuisine", x = "Average Price", y = "Average Votes")

#Interpretation: The analysis shows that Awadhi cuisine has the highest average votes (225) with a moderate average price (200), indicating strong customer popularity. North Indian and Mandi cuisines also have relatively high average votes, suggesting consistent customer interest.

Level 4: Visualization

Question 4.1: Which cities have the highest average dining ratings?(BarChart)

city_ratings <- data %>%
  group_by(city) %>%
  summarise(avg_dining = mean(dining_rating, na.rm = TRUE)) %>%
  arrange(desc(avg_dining))

ggplot(city_ratings, aes(x = reorder(city, avg_dining), y = avg_dining, fill = city)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(
    title = "Average Dining Rating by City",
    x = "City",
    y = "Average Dining Rating"
  ) +
  theme(legend.position = "none")

#Interpretation: The bar chart shows that Goa has the highest average dining rating, followed by Malleshwaram and New Delhi. Overall, ratings across cities are quite similar with only slight variations.

Question 4.2: What is the distribution of dining ratings?(Histogram)

ggplot(data, aes(x = dining_rating)) +
  geom_histogram(binwidth = 0.2, fill = "skyblue", color = "black") +
  labs(
    title = "Distribution of Dining Ratings",
    x = "Dining Rating",
    y = "Frequency"
  )

#Interpretation: The histogram shows that most dining ratings are concentrated between 3.5 and 4.2, indicating generally good ratings. Very few restaurants have extremely low or very high ratings.

Question 4.3: What is the relationship between price and votes?(Scatter Plot)

filtered_data <- data %>%
  filter(votes < 3000)

ggplot(filtered_data, aes(x = prices, y = votes)) +
  geom_point(color = "darkblue", alpha = 0.5) +
  labs(
    title = "Price vs Votes (Filtered)",
    x = "Price",
    y = "Votes"
  )

#Interpretation: The scatter plot shows no clear relationship between price and votes, with most high votes occurring at lower price ranges. This indicates that cheaper items tend to receive more engagement, but the overall relationship is weak.

Question 4.4: How does average price vary across top cuisines?(LineChart)

top_cuisines <- data %>%
  group_by(cuisine) %>%
  summarise(avg_price = mean(prices, na.rm = TRUE)) %>%
  arrange(desc(avg_price)) %>%
  head(10)

ggplot(top_cuisines, aes(x = reorder(cuisine, avg_price), y = avg_price, group = 1)) +
  geom_line(color = "blue") +
  geom_point(color = "red", size = 2) +
  labs(
    title = "Average Price Across Top Cuisines",
    x = "Cuisine",
    y = "Average Price"
  ) +
  theme_minimal()

#Interpretation: The line chart shows that Andhra cuisine has the highest average price, followed by Continental and Seafood. Most other cuisines have relatively similar and lower price ranges.

Question 4.5: How does price distribution vary across different cuisines?(BoxPlot)

top_cuisines <- data %>%
  group_by(cuisine) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  head(8)

filtered_data <- data %>%
  filter(cuisine %in% top_cuisines$cuisine)

ggplot(filtered_data, aes(x = cuisine, y = prices, fill = cuisine)) +
  geom_boxplot() +
  labs(
    title = "Price Distribution Across Cuisines",
    x = "Cuisine",
    y = "Price"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  )

#Interpretation: The box plot shows that most cuisines have similar median prices, but there is high variability with many outliers, especially in Chinese and Desserts. This indicates that while typical prices are consistent, some items are significantly more expensive.

Question 4.6: What is the proportion of best seller vs non-best seller items?(PieChart)

best_seller_count <- data %>%
  group_by(best_seller) %>%
  summarise(count = n())

ggplot(best_seller_count, aes(x = "", y = count, fill = best_seller)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  labs(
    title = "Proportion of Best Seller vs Non-Best Seller Items",
    fill = "Category"
  ) +
  theme_void()

#Interpretation: The pie chart shows that the majority of items are non-best sellers, while only a small proportion are marked as “BESTSELLER.” This indicates that only a few items contribute significantly to popularity.

Level 5: Correlation Analysis

Question 5.1: Is there a relationship between price and votes?

cor(data$prices, data$votes)

## [1] -0.05805689

#Interpretation: There is a very weak negative relationship between price and votes. This indicates that higher prices do not significantly reduce or increase popularity.

Question 5.2: Is there a relationship between dining rating and votes?

cor(data$dining_rating, data$votes)

## [1] 0.04058689

#Interpretation: There is a very weak positive relationship between dining rating and votes. This suggests that higher ratings do not strongly influence customer engagement.

Question 5.3: Is there a relationship between dining rating and delivery rating?

cor(data$dining_rating, data$delivery_rating)

## [1] 0.311651

#Interpretation: There is a moderate positive relationship between dining and delivery ratings. This indicates that restaurants performing well in dining tend to also perform well in delivery.

Question 5.4: What is the overall correlation between numerical variables in the dataset?

cor(data[, c("prices", "votes", "dining_rating", "delivery_rating")])

##                      prices       votes dining_rating delivery_rating
## prices           1.00000000 -0.05805689    0.07347007      0.04144929
## votes           -0.05805689  1.00000000    0.04058689      0.04679406
## dining_rating    0.07347007  0.04058689    1.00000000      0.31165095
## delivery_rating  0.04144929  0.04679406    0.31165095      1.00000000

#Interpretation: The correlation matrix shows that most variables have very weak relationships with each other, especially price and votes. The strongest relationship is a moderate positive correlation between dining and delivery ratings, indicating consistency in restaurant performance.

Level 6: Regression Analysis

Question 6.1: Can we predict votes based on price using a regression model?

set.seed(123)

train_index <- sample(1:nrow(data), 0.8 * nrow(data))

train_data <- data[train_index, ]
test_data <- data[-train_index, ]

model1 <- lm(votes ~ prices, data = train_data)
summary(model1)

## 
## Call:
## lm(formula = votes ~ prices, data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -38.2  -29.1  -23.1   -9.1 9717.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.224325   0.850591   44.94   <2e-16 ***
## prices      -0.043212   0.002797  -15.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141.2 on 73086 degrees of freedom
## Multiple R-squared:  0.003255,   Adjusted R-squared:  0.003242 
## F-statistic: 238.7 on 1 and 73086 DF,  p-value: < 2.2e-16

predictions <- predict(model1, newdata = test_data)

ggplot(test_data, aes(x = prices, y = votes)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_line(aes(y = predictions), color = "red") +
  labs(
    title = "Actual vs Predicted Votes",
    x = "Price",
    y = "Votes"
  ) +
  theme_minimal()

#Interpretation: The regression model is statistically significant as the F-statistic exceeds the critical value. However, the R-squared value is extremely low, indicating that price explains very little variation in votes. The negative coefficient suggests a slight inverse relationship, but the scatter plot shows no clear pattern. Overall, price is not a strong predictor of item popularity.

Question 6.2: Can we predict votes based on dining rating using a regression model?

model2 <- lm(votes ~ dining_rating, data = train_data)
summary(model2)

## 
## Call:
## lm(formula = votes ~ dining_rating, data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -41.5  -28.9  -22.0  -10.1 9718.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -25.565      4.915  -5.202 1.98e-07 ***
## dining_rating   13.977      1.279  10.931  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141.3 on 73086 degrees of freedom
## Multiple R-squared:  0.001632,   Adjusted R-squared:  0.001618 
## F-statistic: 119.5 on 1 and 73086 DF,  p-value: < 2.2e-16

predictions2 <- predict(model2, newdata = test_data)

ggplot(test_data, aes(x = votes, y = predictions2)) +
  geom_point(color = "darkgreen", alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  labs(
    title = "Actual vs Predicted Votes (Using Rating)",
    x = "Actual Votes",
    y = "Predicted Votes"
  ) +
  theme_minimal()

#Interpretation: The regression model is statistically significant as the F-statistic exceeds the critical value. The positive coefficient indicates that higher dining ratings are associated with higher votes. However, the R-squared value is extremely low, showing that dining rating explains very little variation in votes. The scatter plot further confirms poor prediction accuracy, indicating that rating alone is not a strong predictor of customer engagement.

Question 6.3: Can we predict votes using both price and dining rating?

model3 <- lm(votes ~ prices + dining_rating, data = train_data)
summary(model3)

## 
## Call:
## lm(formula = votes ~ prices + dining_rating, data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -50.8  -29.6  -21.8   -7.9 9712.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -20.367504   4.916184  -4.143 3.43e-05 ***
## prices         -0.045664   0.002802 -16.299  < 2e-16 ***
## dining_rating  15.485289   1.279747  12.100  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141 on 73085 degrees of freedom
## Multiple R-squared:  0.005248,   Adjusted R-squared:  0.005221 
## F-statistic: 192.8 on 2 and 73085 DF,  p-value: < 2.2e-16

predictions3 <- predict(model3, newdata = test_data)

ggplot(test_data, aes(x = votes, y = predictions3)) +
  geom_point(color = "purple", alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  labs(
    title = "Actual vs Predicted Votes (Combined Model)",
    x = "Actual Votes",
    y = "Predicted Votes"
  ) +
  theme_minimal()

#Interpretation: The combined regression model is statistically significant, as the F-statistic exceeds the critical value. The coefficients indicate that price has a negative effect while dining rating has a positive effect on votes. However, the R-squared value remains very low, showing that the model explains only a small portion of the variation in votes. The scatter plot also indicates poor prediction accuracy, suggesting that other factors beyond price and rating influence customer engagement.

Restaurant Performance and Menu-Level Dynamics

Dhruv Mittal

02-05-2026

Load Required Libraries

Load Dataset

Level 1: Understanding the Data (Basic Exploration)

Question 1.1: What is the structure of the dataset (number of rows, columns, and data types)?

Question 1.2: Are there any missing value in the Dataset?

Question 1.3: What is the total number of unique restaurants, cities, and cuisines in the dataset?

Question 1.4: Which cuisines have the highest average dining ratings?

Question 1.5: What is the average dining rating and delivery rating across all restaurants?

Level 2: Data Extraction & Filtering

Question 2.1: Which restauranlts have high dining ratings but low customer engagement?

Questions 2.2: Identify restaurants where delivery ratings are higher than dining ratings.

Questions 2.3: Extract items that are highly popular based on votes but are not marked as best sellers.

Questions 2.4: Identify expensive items (top 10% by price) across all restaurants.

Questions 2.5: Which restaurants have the highest number of popular items (based on item votes)?

Level 3: Grouping And Summarization

Question 3.1: What is the average dining and delivery rating for each city?

Question 3.2: Which cuisines have the highest average dining rating?

Question 3.3: What is the average price of items for each restaurant?

Question 3.4: Which restaurants receive the highest total customer engagement (votes)?

Question 3.5: What is the average price and average votes for each cuisine?

Level 4: Visualization

Question 4.1: Which cities have the highest average dining ratings?(BarChart)

Question 4.2: What is the distribution of dining ratings?(Histogram)

Question 4.3: What is the relationship between price and votes?(Scatter Plot)

Question 4.4: How does average price vary across top cuisines?(LineChart)

Question 4.5: How does price distribution vary across different cuisines?(BoxPlot)

Question 4.6: What is the proportion of best seller vs non-best seller items?(PieChart)

Level 5: Correlation Analysis

Question 5.1: Is there a relationship between price and votes?

Question 5.2: Is there a relationship between dining rating and votes?

Question 5.3: Is there a relationship between dining rating and delivery rating?

Question 5.4: What is the overall correlation between numerical variables in the dataset?

Level 6: Regression Analysis

Question 6.1: Can we predict votes based on price using a regression model?

Question 6.2: Can we predict votes based on dining rating using a regression model?

Question 6.3: Can we predict votes using both price and dining rating?