Load Dataset
data <- read.csv("data.csv", stringsAsFactors = FALSE)
data <- janitor::clean_names(data)
head(data)
## restaurant_name dining_rating delivery_rating dining_votes delivery_votes
## 1 Doner King 3.9 4.2 39 0
## 2 Doner King 3.9 4.2 39 0
## 3 Doner King 3.9 4.2 39 0
## 4 Doner King 3.9 4.2 39 0
## 5 Doner King 3.9 4.2 39 0
## 6 Doner King 3.9 4.2 39 0
## cuisine place_name city item_name best_seller
## 1 Fast Food Malakpet Hyderabad Platter Kebab Combo BESTSELLER
## 2 Fast Food Malakpet Hyderabad Chicken Rumali Shawarma BESTSELLER
## 3 Fast Food Malakpet Hyderabad Chicken Tandoori Salad
## 4 Fast Food Malakpet Hyderabad Chicken BBQ Salad BESTSELLER
## 5 Fast Food Malakpet Hyderabad Special Doner Wrap Combo MUST TRY
## 6 Fast Food Malakpet Hyderabad Chicken Tandoori Pizza [8 inches] BESTSELLER
## votes prices
## 1 84 249
## 2 45 129
## 3 39 189
## 4 43 189
## 5 31 205
## 6 48 199
Level 1: Understanding the Data (Basic Exploration)
Question 1.1: What is the structure of the dataset (number of rows,
columns, and data types)?
str(data)
## 'data.frame': 123657 obs. of 12 variables:
## $ restaurant_name: chr "Doner King" "Doner King" "Doner King" "Doner King" ...
## $ dining_rating : num 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 ...
## $ delivery_rating: num 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 4.2 ...
## $ dining_votes : int 39 39 39 39 39 39 39 39 39 39 ...
## $ delivery_votes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cuisine : chr "Fast Food" "Fast Food" "Fast Food" "Fast Food" ...
## $ place_name : chr "Malakpet" "Malakpet" "Malakpet" "Malakpet" ...
## $ city : chr " Hyderabad" " Hyderabad" " Hyderabad" " Hyderabad" ...
## $ item_name : chr "Platter Kebab Combo" "Chicken Rumali Shawarma" "Chicken Tandoori Salad" "Chicken BBQ Salad" ...
## $ best_seller : chr "BESTSELLER" "BESTSELLER" "" "BESTSELLER" ...
## $ votes : int 84 45 39 43 31 48 27 59 29 31 ...
## $ prices : num 249 129 189 189 205 199 165 165 115 129 ...
#Interpretation: This dataset contains information about restaurants and their menu items across multiple cities. It includes details such as restaurant names, dining and delivery ratings, votes, cuisines, item names, best seller indicators, and prices. Each row represents a specific item offered by a restaurant, allowing analysis at both restaurant and item levels.
Question 1.2: Are there any missing value in the Dataset?
data <- data %>%
filter(!is.na(dining_rating) & !is.na(delivery_rating))
colSums(is.na(data))
## restaurant_name dining_rating delivery_rating dining_votes delivery_votes
## 0 0 0 0 0
## cuisine place_name city item_name best_seller
## 0 0 0 0 0
## votes prices
## 0 0
#Interpretation: The dataset contains missing values in the dining_rating and delivery_rating columns. This indicates that rating information is not available for some entries, which may affect analysis related to customer satisfaction and should be considered during further data processing.
Question 1.3: What is the total number of unique restaurants,
cities, and cuisines in the dataset?
length(unique(data$restaurant_name))
## [1] 557
length(unique(data$city))
## [1] 16
length(unique(data$cuisine))
## [1] 40
#Interpretation: The dataset contains multiple unique restaurants, cities, and cuisines, indicating a diverse range of food options and locations. This diversity allows for comprehensive analysis of restaurant performance and customer preferences across different regions and culinary categories.
Question 1.4: Which cuisines have the highest average dining
ratings?
data %>%
group_by(cuisine) %>%
summarise(avg_rating = mean(dining_rating,na.rm = TRUE)) %>%
arrange(desc(avg_rating))
## # A tibble: 40 × 2
## cuisine avg_rating
## <chr> <dbl>
## 1 Awadhi 4.5
## 2 Mexican 4.4
## 3 Wraps 4.3
## 4 Andhra 4.2
## 5 Turkish 4.1
## 6 Bakery 4.09
## 7 Seafood 4.05
## 8 Shake 3.98
## 9 Pasta 3.93
## 10 North Indian 3.90
## # ℹ 30 more rows
Question 1.5: What is the average dining rating and delivery rating
across all restaurants?
colMeans(data[, c("dining_rating", "delivery_rating")], na.rm = TRUE)
## dining_rating delivery_rating
## 3.822619 3.970807
Level 3: Grouping And Summarization
Question 3.1: What is the average dining and delivery rating for
each city?
city_ratings <- data %>%
group_by(city) %>%
summarise(
avg_dining = mean(dining_rating, na.rm = TRUE),
avg_delivery = mean(delivery_rating, na.rm = TRUE)
) %>%
arrange(desc(avg_dining))
head(city_ratings)
## # A tibble: 6 × 3
## city avg_dining avg_delivery
## <chr> <dbl> <dbl>
## 1 " Goa" 4.13 3.91
## 2 " Malleshwaram" 4 4
## 3 " New Delhi" 3.97 3.93
## 4 " Hyderabad" 3.89 3.99
## 5 " Lucknow" 3.88 3.98
## 6 " Raipur" 3.86 3.91
#Interpretation: The analysis shows that Goa has the highest average dining rating (4.13), indicating strong customer satisfaction for dine-in experiences. Malleshwaram also performs well with balanced dining and delivery ratings (4.0 each).
Question 3.2: Which cuisines have the highest average dining
rating?
cuisine_ratings <- data %>%
group_by(cuisine) %>%
summarise(avg_rating = mean(dining_rating, na.rm = TRUE)) %>%
arrange(desc(avg_rating))
ggplot(cuisine_ratings, aes(x = reorder(cuisine, avg_rating), y = avg_rating)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Average Dining Rating by Cuisine", x = "Cuisine", y = "Rating")

head(cuisine_ratings)
## # A tibble: 6 × 2
## cuisine avg_rating
## <chr> <dbl>
## 1 Awadhi 4.5
## 2 Mexican 4.4
## 3 Wraps 4.3
## 4 Andhra 4.2
## 5 Turkish 4.1
## 6 Bakery 4.09
#Interpretation: The analysis shows that Awadhi cuisine has the highest average dining rating (4.5), indicating strong customer preference. It is followed by Mexican (4.4) and Wraps (4.3), which also receive high ratings.
Question 3.3: What is the average price of items for each
restaurant?
restaurant_price <- data %>%
group_by(restaurant_name) %>%
summarise(avg_price = mean(prices, na.rm = TRUE)) %>%
arrange(desc(avg_price))
head(restaurant_price)
## # A tibble: 6 × 2
## restaurant_name avg_price
## <chr> <dbl>
## 1 Zaffran Mataam Alarabi 806.
## 2 Barbeque Nation 657.
## 3 Khalids Biriyani 604.
## 4 Exotica 562.
## 5 The Fatty Bao 556.
## 6 Mandi Town 555.
#Interpretation: The analysis shows that Zaffran Mataam Alarabi has the highest average item price (806), followed by Barbeque Nation (657) and Khalids Biriyani (604). These restaurants appear to follow a premium pricing strategy compared to others.
Question 3.4: Which restaurants receive the highest total customer
engagement (votes)?
restaurant_votes <- data %>%
group_by(restaurant_name) %>%
summarise(total_votes = sum(votes, na.rm = TRUE)) %>%
arrange(desc(total_votes)) %>%
head(10)
ggplot(restaurant_votes, aes(x = reorder(restaurant_name, total_votes), y = total_votes)) +
geom_bar(stat = "identity") +
coord_flip()

head(restaurant_votes)
## # A tibble: 6 × 2
## restaurant_name total_votes
## <chr> <int>
## 1 Agarwal Caterers 116249
## 2 Bawarchi 90838
## 3 Mehfil 73798
## 4 McDonald's 69878
## 5 Lucky Restaurant 68021
## 6 Burger King 52193
#Interpretation: The analysis shows that Agarwal Caterers has the highest total votes (116,249), indicating the highest customer engagement. It is followed by Bawarchi (90,838) and Mehfil (73,798), which also receive strong customer interaction.
Question 3.5: What is the average price and average votes for each
cuisine?
cuisine_analysis <- data %>%
group_by(cuisine) %>%
summarise(
avg_price = mean(prices, na.rm = TRUE),
avg_votes = mean(votes, na.rm = TRUE)
) %>%
arrange(desc(avg_votes))
head(cuisine_analysis)
## # A tibble: 6 × 3
## cuisine avg_price avg_votes
## <chr> <dbl> <dbl>
## 1 Awadhi 200. 225.
## 2 North Indian 205. 72.0
## 3 Mandi 205. 69.7
## 4 Seafood 303. 56.9
## 5 Turkish 244. 49.7
## 6 South Indian 183. 49.3
ggplot(cuisine_analysis, aes(x = avg_price, y = avg_votes)) +
geom_point() +
labs(title = "Price vs Popularity by Cuisine", x = "Average Price", y = "Average Votes")

#Interpretation: The analysis shows that Awadhi cuisine has the highest average votes (225) with a moderate average price (200), indicating strong customer popularity. North Indian and Mandi cuisines also have relatively high average votes, suggesting consistent customer interest.
Level 4: Visualization
Question 4.1: Which cities have the highest average dining
ratings?(BarChart)
city_ratings <- data %>%
group_by(city) %>%
summarise(avg_dining = mean(dining_rating, na.rm = TRUE)) %>%
arrange(desc(avg_dining))
ggplot(city_ratings, aes(x = reorder(city, avg_dining), y = avg_dining, fill = city)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(
title = "Average Dining Rating by City",
x = "City",
y = "Average Dining Rating"
) +
theme(legend.position = "none")

#Interpretation: The bar chart shows that Goa has the highest average dining rating, followed by Malleshwaram and New Delhi. Overall, ratings across cities are quite similar with only slight variations.
Question 4.2: What is the distribution of dining
ratings?(Histogram)
ggplot(data, aes(x = dining_rating)) +
geom_histogram(binwidth = 0.2, fill = "skyblue", color = "black") +
labs(
title = "Distribution of Dining Ratings",
x = "Dining Rating",
y = "Frequency"
)

#Interpretation: The histogram shows that most dining ratings are concentrated between 3.5 and 4.2, indicating generally good ratings. Very few restaurants have extremely low or very high ratings.
Question 4.3: What is the relationship between price and
votes?(Scatter Plot)
filtered_data <- data %>%
filter(votes < 3000)
ggplot(filtered_data, aes(x = prices, y = votes)) +
geom_point(color = "darkblue", alpha = 0.5) +
labs(
title = "Price vs Votes (Filtered)",
x = "Price",
y = "Votes"
)

#Interpretation: The scatter plot shows no clear relationship between price and votes, with most high votes occurring at lower price ranges. This indicates that cheaper items tend to receive more engagement, but the overall relationship is weak.
Question 4.4: How does average price vary across top
cuisines?(LineChart)
top_cuisines <- data %>%
group_by(cuisine) %>%
summarise(avg_price = mean(prices, na.rm = TRUE)) %>%
arrange(desc(avg_price)) %>%
head(10)
ggplot(top_cuisines, aes(x = reorder(cuisine, avg_price), y = avg_price, group = 1)) +
geom_line(color = "blue") +
geom_point(color = "red", size = 2) +
labs(
title = "Average Price Across Top Cuisines",
x = "Cuisine",
y = "Average Price"
) +
theme_minimal()

#Interpretation: The line chart shows that Andhra cuisine has the highest average price, followed by Continental and Seafood. Most other cuisines have relatively similar and lower price ranges.
Question 4.5: How does price distribution vary across different
cuisines?(BoxPlot)
top_cuisines <- data %>%
group_by(cuisine) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
head(8)
filtered_data <- data %>%
filter(cuisine %in% top_cuisines$cuisine)
ggplot(filtered_data, aes(x = cuisine, y = prices, fill = cuisine)) +
geom_boxplot() +
labs(
title = "Price Distribution Across Cuisines",
x = "Cuisine",
y = "Price"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none"
)

#Interpretation: The box plot shows that most cuisines have similar median prices, but there is high variability with many outliers, especially in Chinese and Desserts. This indicates that while typical prices are consistent, some items are significantly more expensive.
Question 4.6: What is the proportion of best seller vs non-best
seller items?(PieChart)
best_seller_count <- data %>%
group_by(best_seller) %>%
summarise(count = n())
ggplot(best_seller_count, aes(x = "", y = count, fill = best_seller)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
labs(
title = "Proportion of Best Seller vs Non-Best Seller Items",
fill = "Category"
) +
theme_void()

#Interpretation: The pie chart shows that the majority of items are non-best sellers, while only a small proportion are marked as “BESTSELLER.” This indicates that only a few items contribute significantly to popularity.
Level 5: Correlation Analysis
Question 5.1: Is there a relationship between price and votes?
cor(data$prices, data$votes)
## [1] -0.05805689
#Interpretation: There is a very weak negative relationship between price and votes. This indicates that higher prices do not significantly reduce or increase popularity.
Question 5.2: Is there a relationship between dining rating and
votes?
cor(data$dining_rating, data$votes)
## [1] 0.04058689
#Interpretation: There is a very weak positive relationship between dining rating and votes. This suggests that higher ratings do not strongly influence customer engagement.
Question 5.3: Is there a relationship between dining rating and
delivery rating?
cor(data$dining_rating, data$delivery_rating)
## [1] 0.311651
#Interpretation: There is a moderate positive relationship between dining and delivery ratings. This indicates that restaurants performing well in dining tend to also perform well in delivery.
Question 5.4: What is the overall correlation between numerical
variables in the dataset?
cor(data[, c("prices", "votes", "dining_rating", "delivery_rating")])
## prices votes dining_rating delivery_rating
## prices 1.00000000 -0.05805689 0.07347007 0.04144929
## votes -0.05805689 1.00000000 0.04058689 0.04679406
## dining_rating 0.07347007 0.04058689 1.00000000 0.31165095
## delivery_rating 0.04144929 0.04679406 0.31165095 1.00000000
#Interpretation: The correlation matrix shows that most variables have very weak relationships with each other, especially price and votes. The strongest relationship is a moderate positive correlation between dining and delivery ratings, indicating consistency in restaurant performance.
Level 6: Regression Analysis
Question 6.1: Can we predict votes based on price using a regression
model?
set.seed(123)
train_index <- sample(1:nrow(data), 0.8 * nrow(data))
train_data <- data[train_index, ]
test_data <- data[-train_index, ]
model1 <- lm(votes ~ prices, data = train_data)
summary(model1)
##
## Call:
## lm(formula = votes ~ prices, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.2 -29.1 -23.1 -9.1 9717.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.224325 0.850591 44.94 <2e-16 ***
## prices -0.043212 0.002797 -15.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 141.2 on 73086 degrees of freedom
## Multiple R-squared: 0.003255, Adjusted R-squared: 0.003242
## F-statistic: 238.7 on 1 and 73086 DF, p-value: < 2.2e-16
predictions <- predict(model1, newdata = test_data)
ggplot(test_data, aes(x = prices, y = votes)) +
geom_point(color = "blue", alpha = 0.5) +
geom_line(aes(y = predictions), color = "red") +
labs(
title = "Actual vs Predicted Votes",
x = "Price",
y = "Votes"
) +
theme_minimal()

#Interpretation: The regression model is statistically significant as the F-statistic exceeds the critical value. However, the R-squared value is extremely low, indicating that price explains very little variation in votes. The negative coefficient suggests a slight inverse relationship, but the scatter plot shows no clear pattern. Overall, price is not a strong predictor of item popularity.
Question 6.2: Can we predict votes based on dining rating using a
regression model?
model2 <- lm(votes ~ dining_rating, data = train_data)
summary(model2)
##
## Call:
## lm(formula = votes ~ dining_rating, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.5 -28.9 -22.0 -10.1 9718.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.565 4.915 -5.202 1.98e-07 ***
## dining_rating 13.977 1.279 10.931 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 141.3 on 73086 degrees of freedom
## Multiple R-squared: 0.001632, Adjusted R-squared: 0.001618
## F-statistic: 119.5 on 1 and 73086 DF, p-value: < 2.2e-16
predictions2 <- predict(model2, newdata = test_data)
ggplot(test_data, aes(x = votes, y = predictions2)) +
geom_point(color = "darkgreen", alpha = 0.6) +
geom_abline(slope = 1, intercept = 0, color = "red") +
labs(
title = "Actual vs Predicted Votes (Using Rating)",
x = "Actual Votes",
y = "Predicted Votes"
) +
theme_minimal()

#Interpretation: The regression model is statistically significant as the F-statistic exceeds the critical value. The positive coefficient indicates that higher dining ratings are associated with higher votes. However, the R-squared value is extremely low, showing that dining rating explains very little variation in votes. The scatter plot further confirms poor prediction accuracy, indicating that rating alone is not a strong predictor of customer engagement.
Question 6.3: Can we predict votes using both price and dining
rating?
model3 <- lm(votes ~ prices + dining_rating, data = train_data)
summary(model3)
##
## Call:
## lm(formula = votes ~ prices + dining_rating, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.8 -29.6 -21.8 -7.9 9712.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.367504 4.916184 -4.143 3.43e-05 ***
## prices -0.045664 0.002802 -16.299 < 2e-16 ***
## dining_rating 15.485289 1.279747 12.100 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 141 on 73085 degrees of freedom
## Multiple R-squared: 0.005248, Adjusted R-squared: 0.005221
## F-statistic: 192.8 on 2 and 73085 DF, p-value: < 2.2e-16
predictions3 <- predict(model3, newdata = test_data)
ggplot(test_data, aes(x = votes, y = predictions3)) +
geom_point(color = "purple", alpha = 0.6) +
geom_abline(slope = 1, intercept = 0, color = "red") +
labs(
title = "Actual vs Predicted Votes (Combined Model)",
x = "Actual Votes",
y = "Predicted Votes"
) +
theme_minimal()

#Interpretation: The combined regression model is statistically significant, as the F-statistic exceeds the critical value. The coefficients indicate that price has a negative effect while dining rating has a positive effect on votes. However, the R-squared value remains very low, showing that the model explains only a small portion of the variation in votes. The scatter plot also indicates poor prediction accuracy, suggesting that other factors beyond price and rating influence customer engagement.