Master Analysis 2

This dataset was obtained from recommender system prototype that creators have put together to create a top-n list of restaurants based on customers’ preferences. This data has 138 users filing reviews for restaurants for a total of 4,090 cases. These users were asked for personal information such as drink level, birth year, weight, and religion to have a better understanding of their restaurant preference. Using these preferences and ratings of restaurants will help determine top-n list of restaurants.

We focused on the overall rating of restaurants and how factors may affect these ratings. Ratings are ranged from 0-2 with 0 being poor and 2 being excellent. We assume this because based on yelp and google reviews, the higher the number, the better the rating is. Our goal for this dataset is to see potential reasons customers choose to rate the places the way they did and using factors to determine if that may be the reason why. We narrowed down the variables we thought would have some affect on user ratings. For example, we chose the variable “Ambience” to see if this effected there mindset while they were at the restaurant and thus their rating. From early regression models we found the Food and Service rating were highly significant and had a high R squared.

rate<-read.csv("c:/users/abbey/Desktop/Data Mining/Restaurant.csv")
head(rate)

##   userID smoker drink_level ambience marital_status birth_year religion
## 1  U1040  FALSE  abstemious  friends         single       1994 Catholic
## 2  U1040  FALSE  abstemious  friends         single       1994 Catholic
## 3  U1040  FALSE  abstemious  friends         single       1994 Catholic
## 4  U1040  FALSE  abstemious  friends         single       1994 Catholic
## 5  U1110  FALSE  abstemious   family         single       1993 Catholic
## 6  U1110  FALSE  abstemious   family         single       1993 Catholic
##   activity Weight..Klio. budget rating food_rating service_rating Rcuisine
## 1  student            73 medium      1           2              1 Pizzeria
## 2  student            73 medium      2           2              2 Pizzeria
## 3  student            73 medium      1           1              1 Pizzeria
## 4  student            73 medium      1           1              1 Pizzeria
## 5  student            85 medium      1           2              1  Mexican
## 6  student            85 medium      2           2              2  Mexican
##   Upayment
## 1     cash
## 2     cash
## 3     cash
## 4     cash
## 5     cash
## 6     cash

library(textir)

## Loading required package: distrom

## Loading required package: Matrix

## Loading required package: gamlr

## Loading required package: parallel

library(MASS)
dim(rate)

## [1] 4090   15

table(rate$rating)

## 
##    0    1    2 
## 1808 1024 1258

This data set contained 4,090 different cases and 15 different variables. Some variables include user ID, smoker, drink level, ambience, marital status, birth year, etc. For overall rating 1,808 cases rated restaurants poor compared to 1,258 cases gave an overall rating of excellent based on our rating assumption.

x<-scale(rate[,c(2,6,9,12,13)])
x[1:10,]

##           smoker birth_year Weight..Klio. food_rating service_rating
##  [1,] -0.2828268  0.7617017     0.5918023   1.2556357      0.2797459
##  [2,] -0.2828268  0.7617017     0.5918023   1.2556357      1.4930657
##  [3,] -0.2828268  0.7617017     0.5918023   0.1232101      0.2797459
##  [4,] -0.2828268  0.7617017     0.5918023   0.1232101      0.2797459
##  [5,] -0.2828268  0.6787912     1.5864143   1.2556357      0.2797459
##  [6,] -0.2828268  0.6787912     1.5864143   1.2556357      1.4930657
##  [7,] -0.2828268  0.6787912     1.5864143   1.2556357      1.4930657
##  [8,] -0.2828268  0.6787912     1.5864143   1.2556357      1.4930657
##  [9,] -0.2828268  0.5958808    -2.1433808  -1.0092155     -0.9335740
## [10,] -0.2828268  0.5958808    -2.1433808  -1.0092155     -0.9335740

for (j in 1:5) {
  x[,j]=(x[,j]-mean(x[,j]))/sd(x[,j])
}

##mean and standard deviation
mean(x)

## [1] 9.404608e-18

sd(x)

## [1] 0.9999022

set.seed(1)
grprate <- kmeans(rate[,c(-1,-2,-3,-4,-5,-7,-8,-10,-14,-15)], centers=5, nstart=10)
clusrate<-cbind(rate, cluster = grprate$cluster) ##asigns each user ID a cluster
head(clusrate)

##   userID smoker drink_level ambience marital_status birth_year religion
## 1  U1040  FALSE  abstemious  friends         single       1994 Catholic
## 2  U1040  FALSE  abstemious  friends         single       1994 Catholic
## 3  U1040  FALSE  abstemious  friends         single       1994 Catholic
## 4  U1040  FALSE  abstemious  friends         single       1994 Catholic
## 5  U1110  FALSE  abstemious   family         single       1993 Catholic
## 6  U1110  FALSE  abstemious   family         single       1993 Catholic
##   activity Weight..Klio. budget rating food_rating service_rating Rcuisine
## 1  student            73 medium      1           2              1 Pizzeria
## 2  student            73 medium      2           2              2 Pizzeria
## 3  student            73 medium      1           1              1 Pizzeria
## 4  student            73 medium      1           1              1 Pizzeria
## 5  student            85 medium      1           2              1  Mexican
## 6  student            85 medium      2           2              2  Mexican
##   Upayment cluster
## 1     cash       4
## 2     cash       4
## 3     cash       4
## 4     cash       4
## 5     cash       1
## 6     cash       1

To take a closer look at characteristics we ran a cluster analysis.

There is a distinct clustering of Groups. Cluster analysis is the task of grouping a set of objects in such a way that objects in the same cluster are more similar (in some sense or another) to each other than to those in other groups (clusters). These five distinct groups are similar in characteristics within the group when rating.

o=order(grprate$cluster)
head(data.frame(rate$userID[o],grprate$cluster[o])) ##shows the first 6 cases

##   rate.userID.o. grprate.cluster.o.
## 1          U1110                  1
## 2          U1110                  1
## 3          U1110                  1
## 4          U1110                  1
## 5          U1058                  1
## 6          U1058                  1

plot(rate$rating, rate$Weight..Klio.,type="n", xlim=c(0,3), xlab="Rating", ylab="Weight")
text(x=rate$rating, y=rate$Weight..Klio., labels=rate$userID, col=rainbow(6)[grprate$cluster])

grprate$centers

##   birth_year Weight..Klio.    rating food_rating service_rating
## 1   1985.075      77.52559 1.3763326   1.5095949      1.2057569
## 2   1937.435      50.06034 1.7370690   1.7241379      1.6034483
## 3   1987.788     109.93939 1.1515152   1.2878788      1.1363636
## 4   1988.453      66.05252 0.4573304   0.4691466      0.4065646
## 5   1988.736      47.18453 1.2741652   1.1810193      1.1247803

The graph is ploted by User ID when graphing the weight against the rating you can see the different clusters created in this analysis. There are 5 clusers with an overlaping by one cluster. Based upon measureable characteristics of each cluster we can see that the raters in cluster 4 give the lowest ratings. Cluster 4 has an average weight of 55 Kilograms, and an average age of 30 years old. Compared to cluster 2 who gives higher ratings and weighs on average 50 kilograms and average age is 80 years old.

table(grprate$cluster)

## 
##    1    2    3    4    5 
##  938  232   66 2285  569

According to this cluster analysis. There are only 2,285 cases that have similar characteristics that are grouped together that give lower reviews. Compared to 232 cases that are group together that have similar charactersitics that give higher reviews.

m3 <- aov(Weight..Klio. ~ as.factor(cluster), data=clusrate) 
summary(m3)

##                      Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(cluster)    4 512337  128084    6313 <2e-16 ***
## Residuals          4085  82876      20                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this cluster analysis of 5 cluster there is a high F value. The p-value is statistically significant. Additional clustering could be made to try and fix the overlapping of the different clusters. Also, another characteristic besides Weight might better differentiate the clusters. So this could be taken into account for a further analysis.

Next, we decided to create histograms to look at 5 different characteristics of the raters and what those characteristics could tell us about the ratings given.

library(lattice)
library(nutshell)

## Loading required package: nutshell.bbdb

## Loading required package: nutshell.audioscrobbler

rate$rating = factor(rate$rating, levels=c("0","1","2"))
levels(rate$rating) = c("Poor","Average","Great")
rate$rating <- factor(rate$rating)

par(mfrow=c(2,4), mai=c(.3,.6,.1,.1))
histogram(~rating|drink_level, data=rate,layout=c(1,3),col=c(grey(.2),2:6)) ##drink level vs rating

histogram(~rating|ambience, data=rate,layout=c(1,4),col=c(grey(.2),2:6)) ##ambience vs rating

histogram(~rating|religion, data=rate,layout=c(1,5),col=c(grey(.2),2:6)) ##religion vs rating

histogram(~rating|smoker, data=rate,layout=c(1,2),col=c(grey(.2),2:6)) ##smoker vs rating

histogram(~rating|activity, data=rate,layout=c(1,5),col=c(grey(.2),2:6))  ##activity vs rating

histogram(~rating|budget, data=rate,layout=c(1,4),col=c(grey(.2),2:6)) ##budget vs rating

histogram(~rating|marital_status, data=rate,layout=c(1,4),col=c(grey(.2),2:6))  ##marital status vs rating

These five variables all have meaningful information for advertisers. Three of the most surprising discoveries we found started with how the Ambience really affected how people rated the restaurant. If they were with there family they were more likely to give it a poor grade, but if you were out with your friends better ratings were given. Another thing that we found shocking was the large percentage of poor ratings that people with low budgets gave out. We assumed high budgets would be more critical in there analysis but they happened to give higher scores. Finally, casual drinkers, or people that drink alcohol with every meal gave much lower ratings than people that described themselves as social drinkers or didn’t drink with their meal. We assumed alcohol would lead to a more enjoyable experience for most, but it turned out to be the opposite.

In conclusion: The data set that we were given had customer ratings for restaurants and also what characteristics they were defined by. We took a closer look at these characteristics and how they affected the rating. The variables we primarily focused on were drinking level, ambience, and religion. We used cluster analysis to group our data and to easily allow us to recognize similarities among the raters. We also used graphs to further our analysis and compare the subcategories of each characteristic. The comparison showed us what subclasses gave the best reviews in each characteristic. Through these analysis techniques we have found that in order to maximize advertising and success in a restaurant you need to appeal to the desired drinking level, ambience, and religion of your niche of customers.

Master Analysis 2

Ober

December 12, 2017