RRD <- read.csv("/Users/hannahpeterson/Documents/R stuff/RestaurantRatersComplete.csv")
RRDE <- read.csv("/Users/hannahpeterson/Documents/R stuff/RRD Edited2.csv")
RRDE2 <- read.csv("/Users/hannahpeterson/Documents/R stuff/RRD Edited2.csv")
RRDE3 <- read.csv("/Users/hannahpeterson/Documents/R stuff/RRD Edited2.csv")
head(RRDE)
## smoker drink_level dress_preference ambience transport marital_status
## 1 TRUE social drinker informal family public single
## 2 TRUE social drinker informal family public single
## 3 TRUE social drinker informal family public single
## 4 TRUE social drinker informal family public single
## 5 FALSE abstemious no preference family car owner single
## 6 FALSE abstemious no preference family car owner single
## hijos birth_year religion activity budget rating food_rating
## 1 independent 1989 Catholic student high 0 0
## 2 independent 1989 Catholic student high 0 0
## 3 independent 1989 Catholic student high 0 1
## 4 independent 1989 Catholic student high 0 1
## 5 independent 1943 Christian student high 1 2
## 6 independent 1943 Christian student high 1 2
## service_rating
## 1 0
## 2 0
## 3 1
## 4 1
## 5 1
## 6 0
##Linear Regression
m1=lm(rating~.,data=RRDE)
summary(m1)
##
## Call:
## lm(formula = rating ~ ., data = RRDE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91982 -0.07269 -0.01253 0.06254 1.90696
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8934985 1.6133807 3.653 0.000263 ***
## smokerTRUE -0.0343015 0.0259226 -1.323 0.185837
## drink_levelcasual drinker -0.0680991 0.0254250 -2.678 0.007428 **
## drink_levelsocial drinker -0.1078001 0.0253675 -4.250 2.19e-05 ***
## dress_preferenceformal -0.0806879 0.0672568 -1.200 0.230329
## dress_preferenceinformal -0.2004827 0.0695671 -2.882 0.003975 **
## dress_preferenceno preference -0.2006370 0.0712205 -2.817 0.004870 **
## ambiencefriends 0.0512738 0.0222279 2.307 0.021122 *
## ambiencesolitary -0.0339017 0.0267831 -1.266 0.205665
## transporton foot -0.0810585 0.0323891 -2.503 0.012367 *
## transportpublic -0.1638244 0.0256980 -6.375 2.04e-10 ***
## marital_statussingle 0.1276733 0.0509882 2.504 0.012321 *
## marital_statuswidow 0.5316636 0.1003315 5.299 1.23e-07 ***
## hijosindependent -0.0918057 0.0833669 -1.101 0.270866
## hijoskids -0.3145672 0.0840026 -3.745 0.000183 ***
## birth_year -0.0027530 0.0008356 -3.295 0.000994 ***
## religionChristian -0.1012856 0.0531072 -1.907 0.056569 .
## religionJewish -0.4772020 0.0978924 -4.875 1.13e-06 ***
## religionMormon -0.3960469 0.1122700 -3.528 0.000424 ***
## religionnone -0.1064383 0.0285816 -3.724 0.000199 ***
## activitystudent 0.0260464 0.0314700 0.828 0.407915
## activityunemployed -0.4237667 0.1426878 -2.970 0.002997 **
## activityworking-class 0.2436041 0.1380733 1.764 0.077758 .
## budgetlow 0.1024578 0.0643328 1.593 0.111326
## budgetmedium 0.1296879 0.0622091 2.085 0.037160 *
## food_rating 0.4019010 0.0128045 31.387 < 2e-16 ***
## service_rating 0.4430751 0.0120789 36.682 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3553 on 3916 degrees of freedom
## Multiple R-squared: 0.8291, Adjusted R-squared: 0.828
## F-statistic: 730.7 on 26 and 3916 DF, p-value: < 2.2e-16
First we ran a simple linear regression using rating as our dependent variable. We decided to use overall rating as our dependent variable because we wanted to find the biggest predictors of a restauratns rating. The original dataset had 25 variables in it but we narrowed it down to 14 variables that we thought to be the best predictors and most correlated with rating. With an adjusted R squared of .828, we felt very confident using the variables that we had.
table(RRDE$rating)
##
## 0 1 2
## 1773 968 1202
par(mfrow=c(3,3), mai=c(.3,.6,.1,.1)) ##mfrow 3 columns
plot(smoker ~ rating, data=RRDE, col=c(grey(.2),2:6))
plot(drink_level ~ rating, data=RRDE, col=c(grey(.2),2:6))
plot(dress_preference ~ rating, data=RRDE, col=c(grey(.2),2:6))
plot(ambience ~ rating, data=RRDE, col=c(grey(.2),2:6))
plot(transport ~ rating, data=RRDE, col=c(grey(.2),2:6))
plot(marital_status ~ rating, data=RRDE, col=c(grey(.2),2:6))
plot(hijos ~ rating, data=RRDE, col=c(grey(.2),2:6))
plot(activity ~ rating, data=RRDE, col=c(grey(.2),2:6))
plot(budget ~ rating, data=RRDE, col=c(grey(.2),2:6))
These are some plots to show a representation of each variable in relation to the dependent variable rating.
##Recoding
library(car)
RRDE3$smoker=recode(RRDE3$smoker,"'FALSE'=0; 'TRUE'=1")
RRDE3$drink_level=recode(RRDE3$drink_level,"'abstemious'=0; 'casual drinker'=1; 'social drinker'=2")
RRDE3$dress_preference=recode(RRDE3$dress_preference,"'no preference'=0; 'informal'=1; 'formal'=2; 'elegant'=3")
RRDE3$ambience=recode(RRDE3$ambience,"'solitary'=0; 'friends'=1; 'family'=2")
RRDE3$transport=recode(RRDE3$transport,"'on foot'=0; 'public'=1; 'car owner'=2")
RRDE3$marital_status=recode(RRDE3$marital_status,"'single'=0; 'widowed'=1; 'married'=2")
RRDE3$hijos=recode(RRDE3$hijos,"'dependent'=0; 'independent'=1; 'kids'=2")
RRDE3$activity=recode(RRDE3$activity,"'unemployed'=0; 'student'=1; 'working-class'=2; 'professional'=3")
RRDE3$budget=recode(RRDE3$budget,"'low'=0; 'medium'=1; 'high'=2")
RRDE3$religion=recode(RRDE3$religion,"'none'=0; 'Catholic'=1; 'Christian'=1; 'Jewish'=1; 'Mormon'=1")
RRDE3$rating=recode(RRDE3$rating,"'0'=0; '1'=0; '2'=1")
##Logistic Regression
m2=glm(rating~., family=binomial,data=RRDE3)
m2
##
## Call: glm(formula = rating ~ ., family = binomial, data = RRDE3)
##
## Coefficients:
## (Intercept) smoker drink_level1
## 69.87837 -0.58323 0.22203
## drink_level2 dress_preference1 dress_preference2
## 0.01060 0.17308 0.25062
## dress_preference3 ambience1 ambience2
## 0.51823 0.91948 0.83211
## transport1 transport2 marital_status2
## -1.25954 -0.28712 0.04129
## marital_statuswidow hijos1 hijos2
## 2.75872 -0.95435 -2.33720
## birth_year religion1 activity1
## -0.04271 0.87638 10.14821
## activity2 activity3 budget1
## 12.13832 9.37795 0.19138
## budget2 food_rating service_rating
## -2.09213 1.51763 2.14975
##
## Degrees of Freedom: 3942 Total (i.e. Null); 3919 Residual
## Null Deviance: 4849
## Residual Deviance: 2025 AIC: 2073
summary(m2)
##
## Call:
## glm(formula = rating ~ ., family = binomial, data = RRDE3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1271 -0.2062 -0.1098 0.3141 3.4999
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 69.87837 266.57180 0.262 0.793216
## smoker -0.58323 0.21708 -2.687 0.007216 **
## drink_level1 0.22203 0.21634 1.026 0.304752
## drink_level2 0.01060 0.22241 0.048 0.962004
## dress_preference1 0.17308 0.23504 0.736 0.461516
## dress_preference2 0.25062 0.21875 1.146 0.251937
## dress_preference3 0.51823 0.45785 1.132 0.257691
## ambience1 0.91948 0.27926 3.293 0.000993 ***
## ambience2 0.83211 0.21899 3.800 0.000145 ***
## transport1 -1.25954 0.28488 -4.421 9.81e-06 ***
## transport2 -0.28712 0.30154 -0.952 0.341002
## marital_status2 0.04129 0.45160 0.091 0.927156
## marital_statuswidow 2.75872 0.68822 4.008 6.11e-05 ***
## hijos1 -0.95435 1.07941 -0.884 0.376620
## hijos2 -2.33719 1.11829 -2.090 0.036620 *
## birth_year -0.04271 0.00749 -5.702 1.18e-08 ***
## religion1 0.87638 0.24572 3.567 0.000362 ***
## activity1 10.14821 266.15825 0.038 0.969585
## activity2 12.13832 266.17031 0.046 0.963626
## activity3 9.37795 266.15823 0.035 0.971893
## budget1 0.19138 0.17623 1.086 0.277480
## budget2 -2.09213 0.48374 -4.325 1.53e-05 ***
## food_rating 1.51763 0.10348 14.666 < 2e-16 ***
## service_rating 2.14975 0.10568 20.342 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4849.2 on 3942 degrees of freedom
## Residual deviance: 2024.7 on 3919 degrees of freedom
## AIC: 2072.7
##
## Number of Fisher Scoring iterations: 13
Here, we wanted to recode all the variables so they were numeric. Since we wanted to run a logistic regression, our dependent variable needs to be binary. With our dependent, we wanted to just look at either a good rating or a bad rating. We recoded 0 (bad) and 1(average) as a 0 to represent the “bad” ratings and then 2 as a 1 to show only good ratings.
RRDE2$smoker=recode(RRDE2$smoker,"'FALSE'=0; 'TRUE'=1")
RRDE2$drink_level=recode(RRDE2$drink_level,"'abstemious'=0; 'casual drinker'=1; 'social drinker'=2")
RRDE2$dress_preference=recode(RRDE2$dress_preference,"'no preference'=0; 'informal'=1; 'formal'=2; 'elegant'=3")
RRDE2$ambience=recode(RRDE2$ambience,"'solitary'=0; 'friends'=1; 'family'=2")
RRDE2$transport=recode(RRDE2$transport,"'on foot'=0; 'public'=1; 'car owner'=2")
RRDE2$marital_status=recode(RRDE2$marital_status,"'single'=0; 'widowed'=1; 'married'=2")
RRDE2$hijos=recode(RRDE2$hijos,"'dependent'=0; 'independent'=1; 'kids'=2")
RRDE2$activity=recode(RRDE2$activity,"'unemployed'=0; 'student'=1; 'working-class'=2; 'professional'=3")
RRDE2$budget=recode(RRDE2$budget,"'low'=0; 'medium'=1; 'high'=2")
RRDE2$religion=recode(RRDE2$religion,"'none'=0; 'Catholic'=1; 'Christian'=1; 'Jewish'=1; 'Mormon'=1")
Next, we recoded all the variables again but just labeled it as a different dataset to use for decision trees. This time, we didn’t recode rating because we wanted to keep it as the original.
library(tree)
length(RRDE2$rating)
## [1] 3943
fullrtree <- tree(rating ~., data=RRDE2, mindev=0.1, mincut=1)
fullrtree <- tree(rating ~., data=RRDE2, mincut=1)
fullrtree
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 3943 2892.00 0.85520
## 2) food_rating < 0.5 1797 155.00 0.06233
## 4) service_rating < 0.5 1713 27.72 0.01284 *
## 5) service_rating > 0.5 84 37.57 1.07100 *
## 3) food_rating > 0.5 2146 661.70 1.51900
## 6) service_rating < 1.5 1196 329.10 1.23800
## 12) food_rating < 1.5 653 137.30 1.09300 *
## 13) food_rating > 1.5 543 161.60 1.41300 *
## 7) service_rating > 1.5 950 119.60 1.87300 *
plot(fullrtree, col=8)
text(fullrtree, pretty=1)
We decided to run a CART decision tree with all the variables, again using rating as our dependent variable. In this output, it showed that food rating and service rating were the 2 best predictors of overall rating. This output is not surprising at all but it doesn’t tell us enough information to predict what affects rating.
rtree <- tree(rating ~.-food_rating-service_rating, data=RRDE2, mindev=0.1, mincut=1)
rtree <- tree(rating ~.-food_rating-service_rating, data=RRDE2, mincut=1)
rtree
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 3943 2892.00 0.8552
## 2) hijos: 2 1633 400.90 0.1506
## 4) budget: 0 1442 0.00 0.0000 *
## 5) budget: 1 191 121.20 1.2880 *
## 3) hijos: 0,1 2310 1108.00 1.3530
## 6) transport: 1,2 2037 971.30 1.3040
## 12) birth_year < 1965.5 222 71.32 1.7340 *
## 13) birth_year > 1965.5 1815 853.90 1.2520 *
## 7) transport: 0 273 95.28 1.7180 *
plot(rtree, col=8)
text(rtree, pretty=1)
We ran another tree without food rating or service rating effecting the predictions. This outcome showed us that kids, birth year, mode of transportation and budget were the biggest predictors for overall rating.
frtree <- tree(food_rating ~.-rating-service_rating, data=RRDE2, mindev=0.1, mincut=1)
frtree <- tree(food_rating ~.-rating-service_rating, data=RRDE2, mincut=1)
frtree
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 3943 3085.0 0.8844
## 2) budget: 0 1980 708.3 0.2768
## 4) hijos: 1 538 301.8 1.0190
## 8) birth_year < 1988.5 226 129.4 0.6770 *
## 9) birth_year > 1988.5 312 126.9 1.2660 *
## 5) hijos: 2 1442 0.0 0.0000 *
## 3) budget: 1,2 1963 908.7 1.4970 *
plot(frtree, col=8)
text(frtree, pretty=1)
Next, we ran a decision tree looking specifically at food rating while taking out overall and service rating. In the first tree ran, it showed that food rating and service rating were the biggest predcitors. With that, we wanted to see what specific variables affected food rating because that could give even more explanation to overall rating.
srtree <- tree(service_rating ~.-rating-food_rating, data=RRDE2, mindev=0.1, mincut=1)
srtree <- tree(service_rating ~.-rating-food_rating, data=RRDE2, mincut=1)
srtree
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 3943 2678.00 0.7616
## 2) hijos: 2 1633 398.80 0.1488
## 4) budget: 0 1442 0.00 0.0000 *
## 5) budget: 1 191 125.80 1.2720 *
## 3) hijos: 0,1 2310 1232.00 1.1950
## 6) birth_year < 1965.5 232 91.52 1.6030 *
## 7) birth_year > 1965.5 2078 1098.00 1.1490
## 14) transport: 1,2 1815 896.90 1.1000 *
## 15) transport: 0 263 165.70 1.4900 *
plot(srtree, col=8)
text(srtree, pretty=1)
We wanted to do the same thing for service rating as we did for food rating. Since those were the 2 biggest variables for rating, we felt we needed to run both to fully understand.
We found out that, overall, food rating is the one predictor that will have the most affect on rating. While digging deeper into the variables that most affect food rating, we realized that the most impactful variables were birth year, mode of transportation, how many kids someone had, and their budget. This data and decision trees can be used in the future to help restaurants better understand the things that they need to focus on more in their restaurants and what their consumers are caring about/looking for.