In this notebook, explonatory data analysis is performed to get a fair idea about factors affecting the establishment of different types of restaurant at different places in Metro Manila, aggregate rating of each restaurant and many more. Three models are also compared to have an idea of the best predicting model. We compare the test MSE of the three models.
There’s nothing so much to clean in this dataset but I will check if there are missing values and redundant observations by exploring the data and then working on it.
data = read_csv("data.csv")
data = data %>% select(-c("X1" ))
head(data) %>%
rmarkdown::paged_table()## [1] 6830 23
# Variable types
class_type = as.matrix(lapply(data, function(x) class(x)))
colnames(class_type) = c("class_type")
class_type## class_type
## id "numeric"
## address "character"
## city "character"
## city_id "numeric"
## country_id "numeric"
## locality "character"
## locality_verbose "character"
## aggregate_rating "numeric"
## rating_color "character"
## rating_text "character"
## votes "numeric"
## average_cost_for_two "numeric"
## cuisines "character"
## currency "character"
## has_online_delivery "numeric"
## has_table_booking "numeric"
## include_bogo_offers "logical"
## is_book_form_web_view "numeric"
## is_delivering_now "numeric"
## is_table_reservation_supported "numeric"
## is_zomato_book_res "numeric"
## name "character"
## price_range "numeric"
# Missing Values
missing_val = as.matrix(lapply(data, function(x) sum(is.na(x))))
colnames(missing_val) = c("missing_values")
missing_val## missing_values
## id 0
## address 0
## city 0
## city_id 0
## country_id 0
## locality 0
## locality_verbose 0
## aggregate_rating 0
## rating_color 0
## rating_text 0
## votes 0
## average_cost_for_two 0
## cuisines 1
## currency 0
## has_online_delivery 0
## has_table_booking 0
## include_bogo_offers 0
## is_book_form_web_view 2
## is_delivering_now 0
## is_table_reservation_supported 0
## is_zomato_book_res 2
## name 0
## price_range 0
filter(data, is.na(cuisines) | is.na(is_book_form_web_view) |
is.na(is_zomato_book_res)) %>%
rmarkdown::paged_table()## # A tibble: 0 x 23
## # … with 23 variables: id <dbl>, address <chr>, city <chr>, city_id <dbl>,
## # country_id <dbl>, locality <chr>, locality_verbose <chr>,
## # aggregate_rating <dbl>, rating_color <chr>, rating_text <chr>, votes <dbl>,
## # average_cost_for_two <dbl>, cuisines <chr>, currency <chr>,
## # has_online_delivery <dbl>, has_table_booking <dbl>,
## # include_bogo_offers <lgl>, is_book_form_web_view <dbl>,
## # is_delivering_now <dbl>, is_table_reservation_supported <dbl>,
## # is_zomato_book_res <dbl>, name <chr>, price_range <dbl>
# Deleting Unnnecessary Columns for Analysis
data = data %>% select(-c('id', 'city_id', 'country_id', 'locality',
'locality_verbose', 'rating_color',
'rating_text', 'currency'))
# Reading Column Names
names(data)## [1] "address" "city"
## [3] "aggregate_rating" "votes"
## [5] "average_cost_for_two" "cuisines"
## [7] "has_online_delivery" "has_table_booking"
## [9] "include_bogo_offers" "is_book_form_web_view"
## [11] "is_delivering_now" "is_table_reservation_supported"
## [13] "is_zomato_book_res" "name"
## [15] "price_range"
# Reading uninque values on selected columns
u1 = unique(data$aggregate_rating)
u2 = unique(data$price_range)
u3 = unique(data$include_bogo_offers)
u4 = unique(data$has_online_delivery)
u5 = unique(data$is_book_form_web_view)
u6 = unique(data$is_delivering_now)
u7 = unique(data$is_zomato_book_res)
# Data with these variables
data_ret = data
# Remove variables with 0 variance
data = data %>% select(-c('include_bogo_offers', 'has_online_delivery',
'is_book_form_web_view', 'is_delivering_now'))Variables is_book_form_web_view, include_bogo_offers, has_online_delivery, is_delivering_now have elements which are all equal. This would result a 0 variance and would produce NA values in my correlation matrix. So I removed these variables. This leaves me missing values for cuisines and is_zomato_book_res variables on restaurant with ID 18683703, 18376892, and 18451637. I’ll impute the missing value with the mean of these variable right after encoding. Also, I have no duplicates in data.
# Encoding
data$address = as.numeric(as.factor(data$address))
data$city = as.numeric(as.factor(data$city))
data$cuisines = as.numeric(as.factor(data$cuisines))
data$name = as.numeric(as.factor(data$name))
str(data)## tibble [6,830 × 11] (S3: tbl_df/tbl/data.frame)
## $ address : num [1:6830] 1187 1187 2352 2859 2115 ...
## $ city : num [1:6830] 5 5 10 6 10 10 10 10 10 6 ...
## $ aggregate_rating : num [1:6830] 4.3 3.4 4.9 4.7 3.6 3.7 3.1 3.5 2.5 3.8 ...
## $ votes : num [1:6830] 402 38 1014 300 51 ...
## $ average_cost_for_two : num [1:6830] 4000 4000 6000 2600 2500 3500 3500 3000 2700 3500 ...
## $ cuisines : num [1:6830] 608 458 266 450 391 311 311 104 115 202 ...
## $ has_table_booking : num [1:6830] 0 0 0 0 0 0 0 0 0 1 ...
## $ is_table_reservation_supported: num [1:6830] 0 0 1 0 1 1 0 0 0 1 ...
## $ is_zomato_book_res : num [1:6830] 0 0 0 0 0 0 0 0 0 0 ...
## $ name : num [1:6830] 1194 2102 2619 452 1583 ...
## $ price_range : num [1:6830] 4 4 4 4 4 4 4 4 4 4 ...
# Taking care of missing values
data$cuisines = ifelse(is.na(data$cuisines), ave(data$cuisines,FUN = function(x) mean(x, na.rm = TRUE)), data$cuisines)
data$is_zomato_book_res = ifelse(is.na(data$is_zomato_book_res), ave(data$is_zomato_book_res,FUN = function(x) mean(x, na.rm = TRUE)), data$is_zomato_book_res)| Name | data |
| Number of rows | 6830 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| address | 0 | 1 | 1818.31 | 820.00 | 1 | 1268.25 | 1834.0 | 2551.75 | 3194.0 | ▃▅▇▆▇ |
| city | 0 | 1 | 9.46 | 4.57 | 1 | 5.00 | 10.0 | 13.00 | 17.0 | ▅▅▃▇▃ |
| aggregate_rating | 0 | 1 | 2.97 | 1.22 | 0 | 2.90 | 3.3 | 3.70 | 4.9 | ▂▁▂▇▂ |
| votes | 0 | 1 | 56.28 | 91.72 | 0 | 7.00 | 26.0 | 67.00 | 1404.0 | ▇▁▁▁▁ |
| average_cost_for_two | 0 | 1 | 743.72 | 543.76 | 0 | 400.00 | 600.0 | 1000.00 | 10000.0 | ▇▁▁▁▁ |
| cuisines | 0 | 1 | 334.34 | 190.40 | 1 | 188.00 | 310.0 | 490.00 | 698.0 | ▅▇▇▆▅ |
| has_table_booking | 0 | 1 | 0.08 | 0.26 | 0 | 0.00 | 0.0 | 0.00 | 1.0 | ▇▁▁▁▁ |
| is_table_reservation_supported | 0 | 1 | 0.11 | 0.32 | 0 | 0.00 | 0.0 | 0.00 | 1.0 | ▇▁▁▁▁ |
| is_zomato_book_res | 0 | 1 | 0.00 | 0.01 | 0 | 0.00 | 0.0 | 0.00 | 1.0 | ▇▁▁▁▁ |
| name | 0 | 1 | 1644.43 | 926.25 | 1 | 833.00 | 1662.5 | 2438.00 | 3278.0 | ▇▇▇▇▇ |
| price_range | 0 | 1 | 2.38 | 0.90 | 1 | 2.00 | 2.0 | 3.00 | 4.0 | ▅▇▁▇▂ |
# Compute the correlation matrix
pearson_cormat = cor(data, method = "pearson")
ggcorrplot(pearson_cormat, title = "Pearson Correlation")# kendall_cormat = cor(data, method = c("kendall"))
# ggcorrplot(kendall_cormat, title = "Kendall Correlation")
spearman_cormat = cor(data, method = "spearman")
ggcorrplot(spearman_cormat, title = "Spearman Correlation")These are the variables with noticeable correlation:
average_cost_for_two and price_rangeis_table_reservation_supported and average_cost_for_twoaggregate_rating_rating and votesu4 = unique(data_ret$has_online_delivery)
ggplot(data = data_ret) +
geom_bar(mapping = aes(x = as.character(has_online_delivery),
fill = as.character(has_online_delivery))) +
labs(x = "Has online delivery or not?", y = "Count",
title = "Restaurants Delivering Online or Not",
fill = "has_online_delivery")## # A tibble: 1 x 2
## has_online_delivery n
## <dbl> <int>
## 1 0 6830
All restaurants doesn’t deliver online. Why is this?
ggplot(data = data) +
geom_bar(mapping = aes(x = as.character(has_table_booking),
fill = as.character(has_table_booking))) +
labs(x = "Allow table booking or not?", y = "Count",
title = "Restaurants Allowing Table Booking or Not",
fill = "has_table_booking")# Restaurants with table booking
data_ret %>% filter(has_table_booking == 1) %>%
select(name, address) %>%
rmarkdown::paged_table()ggplot(data = data_ret) +
geom_bar(mapping = aes(x = as.character(aggregate_rating),
fill = as.character(has_table_booking)),
position = "fill") +
theme(axis.text.x=element_text(angle=50, size=6, vjust=0.5)) +
labs(x = "Aggregate Rating", y = "Count",
title = "Table Booking Rate vs Aggregate Rating",
fill = "has_table_booking")Most of the restaurants with high rating have table bookings but quite a few of them as represented by the blue bars.
ggplot(data = data_ret) +
geom_bar(mapping = aes(x = city,
fill = city)) +
labs(x = "Location", y = "Count",
title = "Locationwise Counts for Restaurants",
fill = "City") +
theme(axis.text.x=element_text(angle=50, size=6, vjust=0.5))ggplot(data = data_ret, mapping = aes(x = city, y = aggregate_rating,
fill = city)) +
geom_boxplot() + theme(text = element_text(size = 20),
axis.text.x = element_text(angle = 90,
hjust = 1)) +
labs(x = "City", y = "Aggregaate Rating",
title = "Locationwise Rating",
fill = "City")There are too many restaurant cuisines (699 obs) and plotting them would be inconvenient, so what I will do is to get the top 30 cuisines only.
top_30_cuisines = data_ret %>% group_by(cuisines) %>% count() %>%
arrange(desc(n))
top_30_cuisines = top_30_cuisines[1:30, ]
top_30_cuisines## # A tibble: 30 x 2
## # Groups: cuisines [30]
## cuisines n
## <chr> <int>
## 1 Filipino 852
## 2 Coffee 347
## 3 American 320
## 4 Japanese 317
## 5 Chinese 245
## 6 Fast Food 220
## 7 Bakery 181
## 8 Korean 161
## 9 Pizza 158
## 10 Italian 119
## # … with 20 more rows
ggplot(data = top_30_cuisines) +
geom_bar(mapping = aes(x = cuisines, y = n, fill = cuisines),
stat = "identity") +
theme(text = element_text(size = 30),
axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = "Cuisines", y = "Count",
title = "Top 30 Restaurant Type",
fill = "Cuisines")data_for_this_plot =
subset(data_ret, cuisines %in% top_30_cuisines$cuisines)
ggplot(data = data_for_this_plot, mapping = aes(x = cuisines,
y = aggregate_rating,
fill = cuisines)) +
geom_boxplot() + theme(text = element_text(size = 30),
axis.text.x = element_text(angle = 90,
hjust = 1)) +
labs(x = "City", y = "Aggregaate Rating",
title = "Top 30 Cuisine Rating",
fill = "City")ggplot(data = data_ret) +
geom_bar(mapping = aes(x = as.factor(average_cost_for_two),
fill = as.factor(average_cost_for_two)),
show.legend = FALSE) + labs(x = "Cost", y = "Count",
title = "Cost of Restaurants") +
theme(text = element_text(size = 40),
axis.text.x = element_text(angle = 90, hjust = 1))The cost is in terms of Philippine peso. As we can see, 500php is the most frequent rate followed by 400php, and 300php.
data_for_this_plot = data_ret %>% group_by(address) %>% count() %>%
arrange(desc(n))
data_for_this_plot %>% rmarkdown::paged_table()# Top 50 location having more restaurants
top_50 = data_for_this_plot[1:50, ]
ggplot(data = top_50) +
geom_bar(aes(x = address, y = n, fill = address),
stat = "identity", show.legend = FALSE) +
theme(text = element_text(size = 35),
axis.text.x = element_text(angle = 90, hjust = 1))data_for_this_plot = data_ret %>% group_by(address, cuisines) %>%
count() %>% arrange(desc(n))
data_for_this_plot %>% rmarkdown::paged_table()data_for_this_plot = data_ret %>% group_by(name) %>% count() %>%
arrange(desc(n))
top_50 = data_for_this_plot[1:50, ]
top_50 %>% ggplot() +
geom_bar(aes(x = name, y = n, fill = name), stat = "identity",
show.legend = FALSE) + labs(x = "Restaurant",
y = "No. of Oulets", title = "Top 50 Famous Restaurants") +
theme(text = element_text(size = 30),
axis.text.x = element_text(angle = 90, hjust = 1,
vjust = 0.9))In this section I’ll perform three regression models to generate predictions. I’ll compare the test MSE of these three models to arrive at a single model that has the best accuracy.
Here, I will use the validation set approach to split the data into 2. One for training set and the other half is for test set.
# Retaining predictors and response variable
data_for_reg = data_ret %>% select(aggregate_rating, has_online_delivery,
has_table_booking, votes, address,
cuisines, average_cost_for_two)
# Converting qualitative variables to factors
data_for_reg$has_online_delivery =
as.factor(data_for_reg$has_online_delivery)
data_for_reg$has_table_booking =
as.factor(data_for_reg$has_table_booking)
data_for_reg$address =
as.factor(data_for_reg$address)
data_for_reg$cuisines =
as.factor(data_for_reg$cuisines)
class_type = as.matrix(lapply(data_for_reg, function(x) class(x)))
colnames(class_type) = c("class_type")
class_type## class_type
## aggregate_rating "numeric"
## has_online_delivery "factor"
## has_table_booking "factor"
## votes "numeric"
## address "factor"
## cuisines "factor"
## average_cost_for_two "numeric"
# Removing vars with one value
u1 = unique(data_for_reg$has_online_delivery)
u2 = unique(data_for_reg$has_table_booking)
u3 = unique(data_for_reg$votes)
u4 = unique(data_for_reg$address)
u5 = unique(data_for_reg$cuisines)
u6 = unique(data_for_reg$average_cost_for_two)
data_for_reg = data_for_reg %>% select(-has_online_delivery)
# Removing 1 missing value
data_for_reg = data_for_reg[complete.cases(data_for_reg), ]
data_for_reg = data_for_reg %>% select(-c(address, cuisines))
# Data types
nrow(data_for_reg)## [1] 6829
Given qualitative variables such as has_online_delivery, has_table_booking, address, and cuisines, R generates dummy variables automatically. Below we fit a multiple regression model.
lm.fit = lm(average_cost_for_two ~ has_table_booking + votes +
aggregate_rating,
data = data_for_reg, subset = train)
summary(lm.fit)##
## Call:
## lm(formula = average_cost_for_two ~ has_table_booking + votes +
## aggregate_rating, data = data_for_reg, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1009.8 -286.2 -88.8 172.9 9202.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 316.1991 22.0742 14.324 < 2e-16 ***
## has_table_booking1 493.9149 32.1947 15.341 < 2e-16 ***
## votes 0.6625 0.1018 6.511 8.54e-11 ***
## aggregate_rating 118.6082 7.5555 15.698 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 478.3 on 3411 degrees of freedom
## Multiple R-squared: 0.2027, Adjusted R-squared: 0.202
## F-statistic: 289.1 on 3 and 3411 DF, p-value: < 2.2e-16
## [1] 602243
Therefore, the estimated test MSE for multiple linear regression fit is 602243. The square root of the MSE is therefore around 776, indicating that this model leads to test predictions that are within around 776php of the true median cost value.
reg_tree_data = data_for_reg
reg_tree = tree(average_cost_for_two ~ ., reg_tree_data, subset = train)
summary(reg_tree)##
## Regression tree:
## tree(formula = average_cost_for_two ~ ., data = reg_tree_data,
## subset = train)
## Number of terminal nodes: 5
## Residual mean deviance: 218000 = 743400000 / 3410
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -952.90 -258.60 -58.62 0.00 147.10 8965.00
The tree indicates that restaurants with low ratings and low votes have low rates (\(aggregate\_rating<3.35\), \(votes<7.5\)). On the other hand, restaurants with high ratings and have table bookings are expensive (\(aggregate\_rating \geq 3.35\), \(has\_table\_booking:1\)). This is then followed by restaurants with no table bookings but with high rating (\(has\_table\_booking:0\), \(aggregate\_rating \geq 3.75\)).
# Tree Pruning ------------------------------------------------------------
cv.reg_tree = cv.tree(reg_tree)
names(cv.reg_tree)## [1] "size" "dev" "k" "method"
## $size
## [1] 5 4 3 2 1
##
## $dev
## [1] 765174974 766802495 790327462 828693288 979308052
##
## $k
## [1] -Inf 18034734 24228694 41937378 150893247
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
prune.reg_tree = prune.tree(reg_tree, best = 5)
plot(prune.reg_tree)
text(prune.reg_tree, pretty = 0)# Prediction
y_hat = predict(prune.reg_tree, newdata = reg_tree_data[-train,])
reg_tree_data.test = reg_tree_data[-train, 'average_cost_for_two']
plot(y_hat, reg_tree_data.test$average_cost_for_two)
abline(0, 1)## [1] 239351.7
In this case, the tree with 5 terminal nodes is selected by cross-validation. The test set MSE associated with the regression tree is 239351.7 The square root of the MSE is around 489, indicating that this model leads to test predictions that are within around Php489 of the true median cost value.
Here we use \(m = \sqrt{p}= 2\) for the subset of predictors in every split.
rndm_frst_data = data_for_reg
rndm_frst = randomForest(average_cost_for_two ~ .,
data = rndm_frst_data, subset = train,
mtry = 2, importance = TRUE)
# Prediction
yhat.rndm_frst = predict(rndm_frst,
newdata = rndm_frst_data[-train, ])
rndm_frst_data.test = rndm_frst_data[-train,
'average_cost_for_two']
mean((yhat.rndm_frst - rndm_frst_data.test$average_cost_for_two)^2)## [1] 254707.5
The test set MSE is 254707.5; this indicates that random forests did not yield an improvement over regression tree.
Using the importance() function, we can view the importance of each variable.
## %IncMSE IncNodePurity
## aggregate_rating 19.65345 203131921
## has_table_booking 16.01677 43884678
## votes 23.71299 317525237
The results indicate that across all of the trees considered in the random forest, votes (vote) and ratings (aggregate_rating) are by far the two most important variables.
| Multiple_Linear_Regression | Regression_Tree | Random_Forest |
|---|---|---|
| 602243 | 239351.7 | 254707.5 |
Among the three models, regression tree performed the best. This is then followed by random forest and multiple linear regression.
Decision trees outperform multiple linear regression which implies that there is a highly non-linear and complex relationship between the features and the response.
I just used validation approach in splitting the data. This can be improve by using cross-validation or any other resampling procedures.
Most of the restaurants in Metro Manila have no table bookings.
High rating restaurants have table bookings.
Quezon city has the most number of restaurants.
Most of the high rating restaurants are in Makati , Taguig, San Juan, and Pasay City.
Filipino Coffee, Filipino Asian, Bakery Coffee, and Japanese Ramen are the cuisines with the most high ratings.
500php is the most frequent rate followed by 400php, and 300php.
The most famous restaurant in Metro Manila is Starbucks followed by Jollibee and McDonalds.
Source Code: https://github.com/EarlMacalam/Data-Analysis-with-R/blob/master/Restaurants-Price-Prediction.Rmd