For this assignment, I will be studying the relationship between price range and text ratings of the Best restaurants around the world according to Zomato and their aggregate ratings. Zomato allows public to access the Zomato API. Zomato APIs give us the most thorough, comprehensive and freshest information of over 2 million restauratns across 10,000 cities in the world.
There are many usesful analysis for foodies as well as analysists to taste and interpret the best restauratants in the world. This assignment will find the value of money (using variable Price range) and text ratings relationship with its aggregate ratings.Additionally, this dataset allows us to see the best cuisine of the country and which locality of that country serves that cuisines with maximum number of restaurants. I will be exploring multi-level structure of the relationship between text rating, price range and aggreate rating by locality.
The data set is called “Zomato Restaurants Data” and is taken from Kaggle.com.
I start with by cleaning and renaming some variables for the analysis. The first investigation in this assignment is a ecological analysis, which will summarizes the mean aggreate ratings and price range per locality of restaurants. The dependent variables is aggregate ratings and the independent variable is rating text and price range. Aggregate rating is indicated between 0 and 5, 5 being the highest. Price range is indicated for the price of two in 1,2,3,4 indicating in countries currency. For example, a restaurant in India would with price range 3 would mean Rs. 3,000. The text rating variable is indicated by very poor, poor, good, very good, excellent.
Let’s see the results!
library(nlme)
library(dplyr)
library(magrittr)
library(tidyr)
library(readr)
library(haven)
library(merTools)
library(lmerTest)
library(ggplot2)
library(texreg)
zomato <- read_csv("/Users/Deepakie/Documents/Queens College/SOC712/Data/zomato.csv")
Parsed with column specification:
cols(
.default = col_character(),
`Restaurant ID` = col_integer(),
`Country Code` = col_integer(),
Longitude = col_double(),
Latitude = col_double(),
`Average Cost for two` = col_integer(),
`Price range` = col_integer(),
`Aggregate rating` = col_double(),
Votes = col_integer()
)
See spec(...) for full column specifications.
head(zomato)
As we see the dataset allows us to see Restaurant name, country code, city, aggreate ratings, etc. There has been many Zomato API analysis done. If you are interested, you must visit https://www.kaggle.com/shrutimehta/zomato-restaurants-data/kernels.
zomato_data<-zomato%>%
rename(Restaurant_name = `Restaurant Name` ,
Aggregate_rating = `Aggregate rating`,
Rating_text = `Rating text`,
Price_range = `Price range`)%>%
filter(!is.na("Aggregate_rating"), !is.na("Rating_text"), !is.na("Price_range"), !is.na("Restaurant_Name"))
length(unique(zomato$Locality))
[1] 1208
The unique function tells us the quantity there are in the variable for the ecological factor in our dataset. The data consists of 1208 localities across the 10,000 cities in the world with the best restaurants.
To better access an ecological regression, I see how many restaurants there are per locality.
zomato_data %>%
group_by(Locality) %>%
summarise(`Restaurant_name` = n())
The ecological regression shown below illustrates the mean of all aggregate rating and the mean of price range in each locality.
ratings <- zomato_data %>%
group_by(Locality) %>%
summarise(mean_r = mean(`Aggregate_rating`, na.rm = TRUE), mean_p = mean(`Price_range`, na.rm = TRUE))
head(ratings)
ecoreg <- lm(mean_r ~ mean_p, data = ratings)
summary(ecoreg)
Call:
lm(formula = mean_r ~ mean_p, data = ratings)
Residuals:
Min 1Q Median 3Q Max
-4.3058 -0.4382 0.1618 0.5986 2.2970
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.03540 0.07793 26.12 <2e-16 ***
mean_p 0.56760 0.02912 19.49 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.951 on 1206 degrees of freedom
Multiple R-squared: 0.2395, Adjusted R-squared: 0.2389
F-statistic: 379.9 on 1 and 1206 DF, p-value: < 2.2e-16
The ecological regression demonstrates the mean of all aggregate rating and the mean of price range in each locality. The regression evaluates the sifnigicant and can be assumed that the mean price range is increased by .567 for every percent increase in average aggregate ratings in each locality. In this simply linear model with one independent and dependent variable we see that higher price range is associated with higher aggregrate ratings per restaurant. TH eproblem with evaluating on a exological level is that we can’t assume all restaurants have this association and can not evaluate each one on an individual level. This gives us ecological fallacy.
cpool <- lm(Aggregate_rating ~ Price_range, data = zomato_data)
summary(cpool)
Call:
lm(formula = Aggregate_rating ~ Price_range, data = zomato_data)
Residuals:
Min 1Q Median 3Q Max
-4.2761 -0.7261 0.4572 0.9905 2.8238
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.34287 0.03111 43.17 <2e-16 ***
Price_range 0.73331 0.01540 47.60 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.363 on 9549 degrees of freedom
Multiple R-squared: 0.1918, Adjusted R-squared: 0.1917
F-statistic: 2266 on 1 and 9549 DF, p-value: < 2.2e-16
After evaluating the ecological regression, now I examine the association between aggregrate ratings and price range without evaluating the ecological factors. We see there is a statistically significant. This model states price range is increased by .73 higher for every 1 percent increase in aggregrate rating in each restaurants. The problem here is we are only looking at what was given to us in the data as a whole. Now we conduct another regression to do 1 model per school using dplyr package.
dcoef <- zomato_data %>%
group_by(Locality) %>%
do(mod = lm(Aggregate_rating ~ Price_range, data = .))
|=============================================================== | 62% ~1 s remaining
|================================================================ | 63% ~1 s remaining
|================================================================= | 65% ~1 s remaining
|=================================================================== | 66% ~1 s remaining
|===================================================================== | 68% ~1 s remaining
|======================================================================= | 70% ~1 s remaining
|========================================================================= | 72% ~1 s remaining
|========================================================================== | 73% ~1 s remaining
|============================================================================ | 75% ~1 s remaining
|============================================================================= | 76% ~1 s remaining
|=============================================================================== | 78% ~1 s remaining
|================================================================================ | 79% ~1 s remaining
|================================================================================= | 80% ~1 s remaining
|=================================================================================== | 82% ~1 s remaining
|==================================================================================== | 83% ~1 s remaining
|===================================================================================== | 84% ~1 s remaining
|======================================================================================= | 86% ~0 s remaining
|========================================================================================= | 88% ~0 s remaining
|=========================================================================================== | 90% ~0 s remaining
|============================================================================================ | 91% ~0 s remaining
|============================================================================================== | 93% ~0 s remaining
|================================================================================================ | 94% ~0 s remaining
|================================================================================================= | 96% ~0 s remaining
|=================================================================================================== | 98% ~0 s remaining
|==================================================================================================== | 98% ~0 s remaining
|===================================================================================================== | 99% ~0 s remaining
coef <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coef, aes(x = intc)) + geom_histogram()
Here we conduct a regression model grouping restaurants in each locality. The first step is to look at the intercept. The histogram above examines the average percentage of aggregrate ratings per locality.We can see from the histogram, that a majority of the percent of aggregrate ratings falls between 0 and 5. Highest being between 3-4.9. Now we look at the slope.
dcoef <- zomato_data %>%
group_by(Locality) %>%
do(mod = lm(Aggregate_rating ~ Price_range, data = .))
|================================================================================ | 79% ~1 s remaining
|================================================================================= | 80% ~1 s remaining
|=================================================================================== | 82% ~0 s remaining
|==================================================================================== | 83% ~0 s remaining
|====================================================================================== | 85% ~0 s remaining
|======================================================================================== | 86% ~0 s remaining
|========================================================================================== | 88% ~0 s remaining
|=========================================================================================== | 90% ~0 s remaining
|============================================================================================= | 92% ~0 s remaining
|=============================================================================================== | 94% ~0 s remaining
|================================================================================================= | 95% ~0 s remaining
|=================================================================================================== | 97% ~0 s remaining
|==================================================================================================== | 99% ~0 s remaining
|===================================================================================================== | 99% ~0 s remaining
coef <- dcoef %>% do(data.frame(Price_range = coef(.$mod)[2]))
ggplot(coef, aes(x = Price_range)) + geom_histogram()
Here is the estimated price range in all localities. There is a little difference between price ranges in context to restaurants at each localities. For each price range there are different number of restuarants. In other words, there is a variation in both the intercept and the slope paramenters. To balance between no-pooling and complete pooling models, we can use partial pooling which is also know as multilevel modeling.
m1_lme <- lme(Aggregate_rating ~ Price_range, data =zomato_data, random = ~1|Locality, method = "ML")
summary(m1_lme)
Linear mixed-effects model fit by maximum likelihood
Data: zomato_data
AIC BIC logLik
30393.73 30422.38 -15192.86
Random effects:
Formula: ~1 | Locality
(Intercept) Residual
StdDev: 0.7776847 1.11356
Fixed effects: Aggregate_rating ~ Price_range
Value Std.Error DF t-value p-value
(Intercept) 2.2325228 0.04887464 8342 45.67856 0
Price_range 0.4349093 0.01639787 8342 26.52230 0
Correlation:
(Intr)
Price_range -0.779
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-3.3658488 -0.4813580 0.1527703 0.6098936 2.5549160
Number of Observations: 9551
Number of Groups: 1208
Compared to complete-pooling model, partial-pooling model allows between-locality variations.Compared to no-pooling model, partial-pooling model imposes a structure (i.e., distribution) on the between-locality variations. The multilevel model allows for an analysis of the influence aggregrate ratings and price range. We assume the intercept follows a normal distribution. A random intercept model allows intercepts to vary. We can predict the intercept across localities for each observation. The model suggest that the fixed effect amongst the 1208 localities, the average percent of aggregrate rating begins at 2.23. For every 1 unit increase in price range, the percentage of aggregrate rating is increased by .43.
m2_lme <- lme(Aggregate_rating ~ Price_range, data =zomato_data, random = ~Price_range|Locality, method = "ML")
summary(m2_lme)
Linear mixed-effects model fit by maximum likelihood
Data: zomato_data
AIC BIC logLik
30076.69 30119.68 -15032.35
Random effects:
Formula: ~Price_range | Locality
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 1.4432889 (Intr)
Price_range 0.3322745 -0.997
Residual 1.1000871
Fixed effects: Aggregate_rating ~ Price_range
Value Std.Error DF t-value p-value
(Intercept) 2.4333728 0.07117285 8342 34.18962 0
Price_range 0.3693655 0.02156101 8342 17.13118 0
Correlation:
(Intr)
Price_range -0.936
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-3.5040036 -0.4280240 0.1234795 0.5877478 2.6907813
Number of Observations: 9551
Number of Groups: 1208
The model above is a regression analyzes of the influence of price range on aggregrate ratings. The random effect used is ~Price_range|Locality. We see the random effect of intercept on random effect of slope. The model states the fixed effect amonst the 1208 localities, the average of aggregrate rating begins at 2.43.Whereas for the random effect, the price range is on average .33.
AIC(cpool, m1_lme, m2_lme)
m3_lme <- lme(Aggregate_rating ~ Price_range + Rating_text, data =zomato_data, random = ~ Price_range|Locality, method = "ML")
summary(m3_lme)
Linear mixed-effects model fit by maximum likelihood
Data: zomato_data
AIC BIC logLik
-5854.608 -5775.799 2938.304
Random effects:
Formula: ~Price_range | Locality
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 2.870236e-02 (Intr)
Price_range 7.781014e-05 -0.002
Residual 1.761786e-01
Fixed effects: Aggregate_rating ~ Price_range + Rating_text
Value Std.Error DF t-value p-value
(Intercept) 3.0308639 0.005433916 8337 557.7678 0
Price_range 0.0181632 0.002409166 8337 7.5392 0
Rating_textExcellent 1.5742489 0.011157156 8337 141.0977 0
Rating_textGood 0.6113196 0.005135141 8337 119.0463 0
Rating_textNot rated -3.0411268 0.005162638 8337 -589.0645 0
Rating_textPoor -0.7544128 0.013385470 8337 -56.3606 0
Rating_textVery Good 1.0847958 0.006762267 8337 160.4190 0
Correlation:
(Intr) Prc_rn Rtng_E Rtng_G Rtn_Nr Rtng_P
Price_range -0.771
Rating_textExcellent -0.013 -0.230
Rating_textGood -0.187 -0.204 0.222
Rating_textNot rated -0.455 0.181 0.109 0.265
Rating_textPoor -0.091 -0.034 0.064 0.131 0.110
Rating_textVery Good -0.054 -0.314 0.231 0.363 0.185 0.104
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-3.5104232 -0.4857277 0.0126301 0.6268978 2.4599705
Number of Observations: 9551
Number of Groups: 1208
In this model, we can see that the average aggregrate rating is 3.03. For every increase in price range , aggregrate decreases by .77. For every text rating, the aggregrate rating decreases. For example, for a text rating of poor, the aggregrate rating decreases by .091. Likewise, for ever very good text rating the aggregrate rating decreases by .054. In conclusion, the model suggest that price range and text ratings decrease the aggregrate ratings overall. There is no statistical significance in this model.
m4_lme <- lme(Aggregate_rating ~ Price_range*Rating_text, data =zomato_data, random = ~ Price_range|Locality, method = "ML")
summary(m4_lme)
Linear mixed-effects model fit by maximum likelihood
Data: zomato_data
AIC BIC logLik
-5855.951 -5741.32 2943.975
Random effects:
Formula: ~Price_range | Locality
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 2.809636e-02 (Intr)
Price_range 3.044358e-05 -0.001
Residual 1.761296e-01
Fixed effects: Aggregate_rating ~ Price_range * Rating_text
Value Std.Error DF t-value p-value
(Intercept) 3.0210225 0.00744686 8332 405.6775 0.0000
Price_range 0.0239608 0.00393880 8332 6.0833 0.0000
Rating_textExcellent 1.5745677 0.03358597 8332 46.8817 0.0000
Rating_textGood 0.6146194 0.01202324 8332 51.1193 0.0000
Rating_textNot rated -3.0073451 0.01264869 8332 -237.7595 0.0000
Rating_textPoor -0.7160665 0.03562979 8332 -20.0974 0.0000
Rating_textVery Good 1.1127355 0.01786587 8332 62.2828 0.0000
Price_range:Rating_textExcellent -0.0023444 0.01176357 8332 -0.1993 0.8420
Price_range:Rating_textGood -0.0027146 0.00569397 8332 -0.4767 0.6336
Price_range:Rating_textNot rated -0.0255218 0.00874215 8332 -2.9194 0.0035
Price_range:Rating_textPoor -0.0215607 0.01807739 8332 -1.1927 0.2330
Price_range:Rating_textVery Good -0.0126523 0.00706073 8332 -1.7919 0.0732
Correlation:
(Intr) Prc_rn Rtng_E Rtng_G Rtn_Nr Rtng_P Rtn_VG P_:R_E Pr_:R_G P_:R_r P_:R_P
Price_range -0.886
Rating_textExcellent -0.218 0.196
Rating_textGood -0.562 0.533 0.127
Rating_textNot rated -0.526 0.494 0.116 0.312
Rating_textPoor -0.185 0.177 0.041 0.115 0.106
Rating_textVery Good -0.398 0.366 0.098 0.243 0.215 0.077
Price_range:Rating_textExcellent 0.294 -0.334 -0.938 -0.180 -0.165 -0.059 -0.130
Price_range:Rating_textGood 0.584 -0.677 -0.130 -0.902 -0.334 -0.122 -0.248 0.227
Price_range:Rating_textNot rated 0.383 -0.440 -0.085 -0.233 -0.910 -0.079 -0.159 0.147 0.300
Price_range:Rating_textPoor 0.184 -0.213 -0.041 -0.114 -0.106 -0.927 -0.077 0.071 0.147 0.096
Price_range:Rating_textVery Good 0.484 -0.554 -0.113 -0.302 -0.273 -0.099 -0.918 0.190 0.383 0.245 0.119
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-3.549287625 -0.474761666 0.003430372 0.625377439 2.472914780
Number of Observations: 9551
Number of Groups: 1208
This model adds an interaction. The interaction demostrates teh correlation of the interaction between price range and text rating to aggregrate ratings in the localities of the restaurants. The average percentage of aggregrate ratings starts at 3.02. For every increase in price range, the percentage of aggregrate rating increases by .023. For every unit increase rate text of excellent, the aggregrate rating goes up by 1.57. This shwos a important association between aggregrate ratings and text rating. Interestingly, every increase of interaction between price range and rate text of excellent, aggregrate rating is decreaseed by -0.002.
AIC(cpool, m1_lme,m2_lme,m3_lme, m4_lme)
htmlreg(list(m1_lme,m2_lme,m3_lme, m4_lme))
| Model 1 | Model 2 | Model 3 | Model 4 | ||
|---|---|---|---|---|---|
| (Intercept) | 2.23*** | 2.43*** | 3.03*** | 3.02*** | |
| (0.05) | (0.07) | (0.01) | (0.01) | ||
| Price_range | 0.43*** | 0.37*** | 0.02*** | 0.02*** | |
| (0.02) | (0.02) | (0.00) | (0.00) | ||
| Rating_textExcellent | 1.57*** | 1.57*** | |||
| (0.01) | (0.03) | ||||
| Rating_textGood | 0.61*** | 0.61*** | |||
| (0.01) | (0.01) | ||||
| Rating_textNot rated | -3.04*** | -3.01*** | |||
| (0.01) | (0.01) | ||||
| Rating_textPoor | -0.75*** | -0.72*** | |||
| (0.01) | (0.04) | ||||
| Rating_textVery Good | 1.08*** | 1.11*** | |||
| (0.01) | (0.02) | ||||
| Price_range:Rating_textExcellent | -0.00 | ||||
| (0.01) | |||||
| Price_range:Rating_textGood | -0.00 | ||||
| (0.01) | |||||
| Price_range:Rating_textNot rated | -0.03** | ||||
| (0.01) | |||||
| Price_range:Rating_textPoor | -0.02 | ||||
| (0.02) | |||||
| Price_range:Rating_textVery Good | -0.01 | ||||
| (0.01) | |||||
| AIC | 30393.73 | 30076.69 | -5854.61 | -5855.95 | |
| BIC | 30422.38 | 30119.68 | -5775.80 | -5741.32 | |
| Log Likelihood | -15192.86 | -15032.35 | 2938.30 | 2943.98 | |
| Num. obs. | 9551 | 9551 | 9551 | 9551 | |
| Num. groups | 1208 | 1208 | 1208 | 1208 | |
| p < 0.001, p < 0.01, p < 0.05 | |||||
From the table above, observing the AIC and BIC tests we can state that the first model which evaluated the influence of price range on Aggregrate ratings between localities best fits our analysis. In other words, the best model to use for this particular analysis would be the first model (Model 1).
zomato_data %<>% mutate(crange = Price_range - mean(Price_range))
m1 <- lme4::lmer(Aggregate_rating ~ Price_range + Rating_text + (1|Locality), data=zomato_data)
fastdisp(m1)
lme4::lmer(formula = Aggregate_rating ~ Price_range + Rating_text +
(1 | Locality), data = zomato_data)
coef.est coef.se
(Intercept) 3.03 0.01
Price_range 0.02 0.00
Rating_textExcellent 1.57 0.01
Rating_textGood 0.61 0.01
Rating_textNot rated -3.04 0.01
Rating_textPoor -0.75 0.01
Rating_textVery Good 1.08 0.01
Error terms:
Groups Name Std.Dev.
Locality (Intercept) 0.03
Residual 0.18
---
number of obs: 9551, groups: Locality, 1208
AIC = -5798.1
By using MerTools we can see the significant differences in price range and Text rating has on aggregrate ratings.The aggregrate rating difference in each locality is shown to be .03.
feEx <- FEsim(m1, 1000)
cbind(feEx[,1] , round(feEx[, 2:4], 3))
Now we are going to plot to visually see the differences betwen in price range and Text rating has on aggregrate ratings.
plotFEsim(feEx) +
theme_bw() + labs(title = "Coefficient Plot",
x = "Median Effect Estimate", y = "Aggregate_rating")
feEx <- REsim(m1)
head(feEx)
Now we will observe the random effect which takes in consideration the differences between localities. The table above shows the mean and median for each state.
table(feEx$term)
(Intercept)
1208
table(feEx$groupFctr)
Locality
1208
p <- plotREsim(feEx)
p
Here we plotted to see the equal localities are above and below the 0 mark. There are some outliers with low and high which must be accounted for. But for most localities, the effects are evenly distributed indicating that locality doesnt have an effect on the aggregrate ratings.