Overview

In this notebook, explonatory data analysis is performed to get a fair idea about factors affecting the establishment of different types of restaurant at different places in Metro Manila, aggregate rating of each restaurant and many more. Three models are also compared to have an idea of the best predicting model. We compare the test MSE of the three models.

Libraries

library(tidyverse)
library(readr)
library(skimr)
library(reshape2)
library(ggcorrplot)
library(ISLR)
library(tree)
library(randomForest)
library(kableExtra)

Data Cleaning

There’s nothing so much to clean in this dataset but I will check if there are missing values and redundant observations by exploring the data and then working on it.

Reading the Dataset

data = read_csv("data.csv")
data = data %>% select(-c("X1" ))
head(data) %>% 
        rmarkdown::paged_table()

Data Exploration

dim(data)
## [1] 6830   23
# Variable types
class_type = as.matrix(lapply(data, function(x) class(x)))
colnames(class_type) = c("class_type")
class_type
##                                class_type 
## id                             "numeric"  
## address                        "character"
## city                           "character"
## city_id                        "numeric"  
## country_id                     "numeric"  
## locality                       "character"
## locality_verbose               "character"
## aggregate_rating               "numeric"  
## rating_color                   "character"
## rating_text                    "character"
## votes                          "numeric"  
## average_cost_for_two           "numeric"  
## cuisines                       "character"
## currency                       "character"
## has_online_delivery            "numeric"  
## has_table_booking              "numeric"  
## include_bogo_offers            "logical"  
## is_book_form_web_view          "numeric"  
## is_delivering_now              "numeric"  
## is_table_reservation_supported "numeric"  
## is_zomato_book_res             "numeric"  
## name                           "character"
## price_range                    "numeric"
# Missing Values
missing_val = as.matrix(lapply(data, function(x) sum(is.na(x))))
colnames(missing_val) = c("missing_values")
missing_val
##                                missing_values
## id                             0             
## address                        0             
## city                           0             
## city_id                        0             
## country_id                     0             
## locality                       0             
## locality_verbose               0             
## aggregate_rating               0             
## rating_color                   0             
## rating_text                    0             
## votes                          0             
## average_cost_for_two           0             
## cuisines                       1             
## currency                       0             
## has_online_delivery            0             
## has_table_booking              0             
## include_bogo_offers            0             
## is_book_form_web_view          2             
## is_delivering_now              0             
## is_table_reservation_supported 0             
## is_zomato_book_res             2             
## name                           0             
## price_range                    0
filter(data, is.na(cuisines) | is.na(is_book_form_web_view) |
        is.na(is_zomato_book_res)) %>% 
        rmarkdown::paged_table()
# Detect Duplicate Observations
obs_dup = duplicated(data)
subset(data, obs_dup)
## # A tibble: 0 x 23
## # … with 23 variables: id <dbl>, address <chr>, city <chr>, city_id <dbl>,
## #   country_id <dbl>, locality <chr>, locality_verbose <chr>,
## #   aggregate_rating <dbl>, rating_color <chr>, rating_text <chr>, votes <dbl>,
## #   average_cost_for_two <dbl>, cuisines <chr>, currency <chr>,
## #   has_online_delivery <dbl>, has_table_booking <dbl>,
## #   include_bogo_offers <lgl>, is_book_form_web_view <dbl>,
## #   is_delivering_now <dbl>, is_table_reservation_supported <dbl>,
## #   is_zomato_book_res <dbl>, name <chr>, price_range <dbl>
# Deleting Unnnecessary Columns for Analysis
data = data %>% select(-c('id', 'city_id', 'country_id', 'locality',
                          'locality_verbose', 'rating_color',
                          'rating_text', 'currency'))

# Reading Column Names
names(data)
##  [1] "address"                        "city"                          
##  [3] "aggregate_rating"               "votes"                         
##  [5] "average_cost_for_two"           "cuisines"                      
##  [7] "has_online_delivery"            "has_table_booking"             
##  [9] "include_bogo_offers"            "is_book_form_web_view"         
## [11] "is_delivering_now"              "is_table_reservation_supported"
## [13] "is_zomato_book_res"             "name"                          
## [15] "price_range"
# Reading uninque values on selected columns
u1 = unique(data$aggregate_rating)
u2 = unique(data$price_range)
u3 = unique(data$include_bogo_offers)
u4 = unique(data$has_online_delivery)
u5 = unique(data$is_book_form_web_view)
u6 = unique(data$is_delivering_now)
u7 = unique(data$is_zomato_book_res)

# Data with these variables
data_ret = data

# Remove variables with 0 variance
data = data %>% select(-c('include_bogo_offers', 'has_online_delivery',
                          'is_book_form_web_view', 'is_delivering_now'))

Variables is_book_form_web_view, include_bogo_offers, has_online_delivery, is_delivering_now have elements which are all equal. This would result a 0 variance and would produce NA values in my correlation matrix. So I removed these variables. This leaves me missing values for cuisines and is_zomato_book_res variables on restaurant with ID 18683703, 18376892, and 18451637. I’ll impute the missing value with the mean of these variable right after encoding. Also, I have no duplicates in data.

Encoding Categorical Variables

# Encoding
data$address = as.numeric(as.factor(data$address))
data$city = as.numeric(as.factor(data$city))
data$cuisines = as.numeric(as.factor(data$cuisines))
data$name = as.numeric(as.factor(data$name))
str(data)
## tibble [6,830 × 11] (S3: tbl_df/tbl/data.frame)
##  $ address                       : num [1:6830] 1187 1187 2352 2859 2115 ...
##  $ city                          : num [1:6830] 5 5 10 6 10 10 10 10 10 6 ...
##  $ aggregate_rating              : num [1:6830] 4.3 3.4 4.9 4.7 3.6 3.7 3.1 3.5 2.5 3.8 ...
##  $ votes                         : num [1:6830] 402 38 1014 300 51 ...
##  $ average_cost_for_two          : num [1:6830] 4000 4000 6000 2600 2500 3500 3500 3000 2700 3500 ...
##  $ cuisines                      : num [1:6830] 608 458 266 450 391 311 311 104 115 202 ...
##  $ has_table_booking             : num [1:6830] 0 0 0 0 0 0 0 0 0 1 ...
##  $ is_table_reservation_supported: num [1:6830] 0 0 1 0 1 1 0 0 0 1 ...
##  $ is_zomato_book_res            : num [1:6830] 0 0 0 0 0 0 0 0 0 0 ...
##  $ name                          : num [1:6830] 1194 2102 2619 452 1583 ...
##  $ price_range                   : num [1:6830] 4 4 4 4 4 4 4 4 4 4 ...
# Taking care of missing values
data$cuisines = ifelse(is.na(data$cuisines), ave(data$cuisines,FUN = function(x) mean(x, na.rm = TRUE)), data$cuisines)
data$is_zomato_book_res = ifelse(is.na(data$is_zomato_book_res), ave(data$is_zomato_book_res,FUN = function(x) mean(x, na.rm = TRUE)), data$is_zomato_book_res)

Summary Statistics

skim(data)
Data summary
Name data
Number of rows 6830
Number of columns 11
_______________________
Column type frequency:
numeric 11
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
address 0 1 1818.31 820.00 1 1268.25 1834.0 2551.75 3194.0 ▃▅▇▆▇
city 0 1 9.46 4.57 1 5.00 10.0 13.00 17.0 ▅▅▃▇▃
aggregate_rating 0 1 2.97 1.22 0 2.90 3.3 3.70 4.9 ▂▁▂▇▂
votes 0 1 56.28 91.72 0 7.00 26.0 67.00 1404.0 ▇▁▁▁▁
average_cost_for_two 0 1 743.72 543.76 0 400.00 600.0 1000.00 10000.0 ▇▁▁▁▁
cuisines 0 1 334.34 190.40 1 188.00 310.0 490.00 698.0 ▅▇▇▆▅
has_table_booking 0 1 0.08 0.26 0 0.00 0.0 0.00 1.0 ▇▁▁▁▁
is_table_reservation_supported 0 1 0.11 0.32 0 0.00 0.0 0.00 1.0 ▇▁▁▁▁
is_zomato_book_res 0 1 0.00 0.01 0 0.00 0.0 0.00 1.0 ▇▁▁▁▁
name 0 1 1644.43 926.25 1 833.00 1662.5 2438.00 3278.0 ▇▇▇▇▇
price_range 0 1 2.38 0.90 1 2.00 2.0 3.00 4.0 ▅▇▁▇▂

Data Visualization

Correlations

# Compute the correlation matrix
pearson_cormat = cor(data, method = "pearson")
ggcorrplot(pearson_cormat, title = "Pearson Correlation")

# kendall_cormat = cor(data, method = c("kendall"))
# ggcorrplot(kendall_cormat, title = "Kendall Correlation")

spearman_cormat = cor(data, method = "spearman")
ggcorrplot(spearman_cormat, title = "Spearman Correlation")

These are the variables with noticeable correlation:

  1. average_cost_for_two and price_range
  • I guess these two variables are redundant since they both relate to price.
  1. is_table_reservation_supported and average_cost_for_two
  • Restaurants that allow table reservation might be expensive.
  1. aggregate_rating_rating and votes
  • More votes imply high ratings, conversely high ratings imply more votes.

Restaurants Delivering Online or Not

u4 = unique(data_ret$has_online_delivery)
ggplot(data = data_ret) +
        geom_bar(mapping = aes(x = as.character(has_online_delivery), 
                               fill = as.character(has_online_delivery))) +
        labs(x = "Has online delivery or not?", y = "Count",
             title = "Restaurants Delivering Online or Not",
             fill = "has_online_delivery")

data_ret %>%
count(has_online_delivery)
## # A tibble: 1 x 2
##   has_online_delivery     n
##                 <dbl> <int>
## 1                   0  6830

All restaurants doesn’t deliver online. Why is this?

Restaurants Allowing Table Booking or Not

ggplot(data = data) +
        geom_bar(mapping = aes(x = as.character(has_table_booking), 
                               fill = as.character(has_table_booking))) +
        labs(x = "Allow table booking or not?", y = "Count",
             title = "Restaurants Allowing Table Booking or Not",
             fill = "has_table_booking")

data %>%
count(has_table_booking) %>% 
        rmarkdown::paged_table()
# Restaurants with table booking
data_ret %>% filter(has_table_booking == 1) %>% 
        select(name, address) %>% 
        rmarkdown::paged_table()

Table Booking vs Aggregate Rating

ggplot(data = data_ret) +
geom_bar(mapping = aes(x = as.character(aggregate_rating),
                       fill = as.character(has_table_booking)),
         position = "fill") + 
        theme(axis.text.x=element_text(angle=50, size=6, vjust=0.5)) +
        labs(x = "Aggregate Rating", y = "Count",
             title = "Table Booking Rate vs Aggregate Rating",
             fill = "has_table_booking")

Most of the restaurants with high rating have table bookings but quite a few of them as represented by the blue bars.

Location

ggplot(data = data_ret) +
        geom_bar(mapping = aes(x = city, 
                               fill = city)) +
        labs(x = "Location", y = "Count",
             title = "Locationwise Counts for Restaurants",
             fill = "City") + 
        theme(axis.text.x=element_text(angle=50, size=6, vjust=0.5))

Location and Rating

ggplot(data = data_ret, mapping = aes(x = city, y = aggregate_rating,
                                      fill = city)) +
        geom_boxplot() + theme(text = element_text(size = 20),
                               axis.text.x = element_text(angle = 90,
                                                         hjust = 1)) +
        labs(x = "City", y = "Aggregaate Rating",
             title = "Locationwise Rating",
             fill = "City")

Restaurant Type

There are too many restaurant cuisines (699 obs) and plotting them would be inconvenient, so what I will do is to get the top 30 cuisines only.

top_30_cuisines = data_ret %>% group_by(cuisines) %>% count() %>%
        arrange(desc(n))

top_30_cuisines = top_30_cuisines[1:30, ]
top_30_cuisines
## # A tibble: 30 x 2
## # Groups:   cuisines [30]
##    cuisines      n
##    <chr>     <int>
##  1 Filipino    852
##  2 Coffee      347
##  3 American    320
##  4 Japanese    317
##  5 Chinese     245
##  6 Fast Food   220
##  7 Bakery      181
##  8 Korean      161
##  9 Pizza       158
## 10 Italian     119
## # … with 20 more rows
ggplot(data = top_30_cuisines) +
geom_bar(mapping = aes(x = cuisines, y = n, fill = cuisines),
         stat = "identity") + 
        theme(text = element_text(size = 30),
              axis.text.x = element_text(angle = 90, hjust = 1)) +
        labs(x = "Cuisines", y = "Count",
             title = "Top 30 Restaurant Type",
             fill = "Cuisines")

Cuisines and Rating

data_for_this_plot = 
        subset(data_ret, cuisines %in% top_30_cuisines$cuisines)

ggplot(data = data_for_this_plot, mapping = aes(x = cuisines,
                                             y = aggregate_rating, 
                                             fill = cuisines)) +
        geom_boxplot() + theme(text = element_text(size = 30),
                               axis.text.x = element_text(angle = 90,
                                                         hjust = 1)) +
        labs(x = "City", y = "Aggregaate Rating",
             title = "Top 30 Cuisine Rating",
             fill = "City")

Cost of Restaurant

ggplot(data = data_ret) +
geom_bar(mapping = aes(x = as.factor(average_cost_for_two),
                       fill = as.factor(average_cost_for_two)),
         show.legend = FALSE) + labs(x = "Cost", y = "Count",
                                     title = "Cost of Restaurants") +
        theme(text = element_text(size = 40), 
              axis.text.x = element_text(angle = 90, hjust = 1))

The cost is in terms of Philippine peso. As we can see, 500php is the most frequent rate followed by 400php, and 300php.

Number of Restaurants in a Location (address)

data_for_this_plot = data_ret %>% group_by(address) %>% count() %>% 
        arrange(desc(n))

data_for_this_plot %>% rmarkdown::paged_table()
# Top 50 location having more restaurants
top_50 = data_for_this_plot[1:50, ]
ggplot(data = top_50) +
        geom_bar(aes(x = address, y = n, fill = address),
                 stat = "identity", show.legend = FALSE) + 
        theme(text = element_text(size = 35),
              axis.text.x = element_text(angle = 90, hjust = 1))

Number of Restaurant Type per Location

data_for_this_plot = data_ret %>% group_by(address, cuisines) %>%
        count() %>% arrange(desc(n))
data_for_this_plot %>% rmarkdown::paged_table()

Most famous Restaurant in Metro Manila

data_for_this_plot = data_ret %>% group_by(name) %>% count() %>%
        arrange(desc(n))
top_50 = data_for_this_plot[1:50, ]
top_50 %>% ggplot() + 
        geom_bar(aes(x = name, y = n, fill = name), stat = "identity", 
                 show.legend = FALSE) + labs(x = "Restaurant",
                 y = "No. of Oulets", title = "Top 50 Famous Restaurants") +
                 theme(text = element_text(size = 30),
                 axis.text.x = element_text(angle = 90, hjust = 1,
                                            vjust = 0.9))

Regression Analysis

In this section I’ll perform three regression models to generate predictions. I’ll compare the test MSE of these three models to arrive at a single model that has the best accuracy.

Setting the Data

Here, I will use the validation set approach to split the data into 2. One for training set and the other half is for test set.

# Retaining predictors and response variable
data_for_reg = data_ret %>% select(aggregate_rating, has_online_delivery,
                                   has_table_booking, votes, address,
                                   cuisines, average_cost_for_two)

# Converting qualitative variables to factors
data_for_reg$has_online_delivery =
        as.factor(data_for_reg$has_online_delivery)

data_for_reg$has_table_booking =
        as.factor(data_for_reg$has_table_booking)

data_for_reg$address =
        as.factor(data_for_reg$address)

data_for_reg$cuisines =
        as.factor(data_for_reg$cuisines)

class_type = as.matrix(lapply(data_for_reg, function(x) class(x)))
colnames(class_type) = c("class_type")
class_type
##                      class_type
## aggregate_rating     "numeric" 
## has_online_delivery  "factor"  
## has_table_booking    "factor"  
## votes                "numeric" 
## address              "factor"  
## cuisines             "factor"  
## average_cost_for_two "numeric"
# Removing vars with one value
u1 = unique(data_for_reg$has_online_delivery)
u2 = unique(data_for_reg$has_table_booking)
u3 = unique(data_for_reg$votes)
u4 = unique(data_for_reg$address)
u5 = unique(data_for_reg$cuisines)
u6 = unique(data_for_reg$average_cost_for_two)
data_for_reg = data_for_reg %>% select(-has_online_delivery)

# Removing 1 missing value
data_for_reg = data_for_reg[complete.cases(data_for_reg), ]
data_for_reg = data_for_reg %>% select(-c(address, cuisines))
# Data types
nrow(data_for_reg)
## [1] 6829

Splitting the Data

set.seed(1)
train = sample(6829, 3415)

Multiple Linear Regression

Given qualitative variables such as has_online_delivery, has_table_booking, address, and cuisines, R generates dummy variables automatically. Below we fit a multiple regression model.

lm.fit = lm(average_cost_for_two ~ has_table_booking + votes +
                    aggregate_rating,
            data = data_for_reg, subset = train)
summary(lm.fit)
## 
## Call:
## lm(formula = average_cost_for_two ~ has_table_booking + votes + 
##     aggregate_rating, data = data_for_reg, subset = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1009.8  -286.2   -88.8   172.9  9202.0 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        316.1991    22.0742  14.324  < 2e-16 ***
## has_table_booking1 493.9149    32.1947  15.341  < 2e-16 ***
## votes                0.6625     0.1018   6.511 8.54e-11 ***
## aggregate_rating   118.6082     7.5555  15.698  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 478.3 on 3411 degrees of freedom
## Multiple R-squared:  0.2027, Adjusted R-squared:  0.202 
## F-statistic: 289.1 on 3 and 3411 DF,  p-value: < 2.2e-16
attach(data_for_reg)
mean((aggregate_rating - predict(lm.fit))[-train]^2)
## [1] 602243

Therefore, the estimated test MSE for multiple linear regression fit is 602243. The square root of the MSE is therefore around 776, indicating that this model leads to test predictions that are within around 776php of the true median cost value.

Regression Tree

reg_tree_data = data_for_reg
reg_tree = tree(average_cost_for_two ~ ., reg_tree_data, subset = train) 
summary(reg_tree)
## 
## Regression tree:
## tree(formula = average_cost_for_two ~ ., data = reg_tree_data, 
##     subset = train)
## Number of terminal nodes:  5 
## Residual mean deviance:  218000 = 743400000 / 3410 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -952.90 -258.60  -58.62    0.00  147.10 8965.00
plot(reg_tree)
text(reg_tree, pretty = 0)

The tree indicates that restaurants with low ratings and low votes have low rates (\(aggregate\_rating<3.35\), \(votes<7.5\)). On the other hand, restaurants with high ratings and have table bookings are expensive (\(aggregate\_rating \geq 3.35\), \(has\_table\_booking:1\)). This is then followed by restaurants with no table bookings but with high rating (\(has\_table\_booking:0\), \(aggregate\_rating \geq 3.75\)).

# Tree Pruning ------------------------------------------------------------
cv.reg_tree = cv.tree(reg_tree)
names(cv.reg_tree)
## [1] "size"   "dev"    "k"      "method"
cv.reg_tree
## $size
## [1] 5 4 3 2 1
## 
## $dev
## [1] 765174974 766802495 790327462 828693288 979308052
## 
## $k
## [1]      -Inf  18034734  24228694  41937378 150893247
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
plot(cv.reg_tree$size, cv.reg_tree$dev, type = 'b')

prune.reg_tree = prune.tree(reg_tree, best = 5)
plot(prune.reg_tree)
text(prune.reg_tree, pretty = 0)

# Prediction
y_hat = predict(prune.reg_tree, newdata = reg_tree_data[-train,])
reg_tree_data.test = reg_tree_data[-train, 'average_cost_for_two']

plot(y_hat, reg_tree_data.test$average_cost_for_two)
abline(0, 1)

mean((y_hat - reg_tree_data.test$average_cost_for_two)^2)
## [1] 239351.7

In this case, the tree with 5 terminal nodes is selected by cross-validation. The test set MSE associated with the regression tree is 239351.7 The square root of the MSE is around 489, indicating that this model leads to test predictions that are within around Php489 of the true median cost value.

Random Forest

Here we use \(m = \sqrt{p}= 2\) for the subset of predictors in every split.

rndm_frst_data = data_for_reg
rndm_frst = randomForest(average_cost_for_two ~ .,
                         data = rndm_frst_data, subset = train,
                         mtry = 2, importance = TRUE)

# Prediction
yhat.rndm_frst = predict(rndm_frst,
                         newdata = rndm_frst_data[-train, ])
rndm_frst_data.test = rndm_frst_data[-train,
                                     'average_cost_for_two']
mean((yhat.rndm_frst - rndm_frst_data.test$average_cost_for_two)^2)
## [1] 254707.5

The test set MSE is 254707.5; this indicates that random forests did not yield an improvement over regression tree.

Using the importance() function, we can view the importance of each variable.

importance(rndm_frst)
##                    %IncMSE IncNodePurity
## aggregate_rating  19.65345     203131921
## has_table_booking 16.01677      43884678
## votes             23.71299     317525237
varImpPlot (rndm_frst)

The results indicate that across all of the trees considered in the random forest, votes (vote) and ratings (aggregate_rating) are by far the two most important variables.

Summary

Multiple_Linear_Regression Regression_Tree Random_Forest
602243 239351.7 254707.5

Source Code: https://github.com/EarlMacalam/Data-Analysis-with-R/blob/master/Restaurants-Price-Prediction.Rmd