Clustering provides us with similarities in the data but it doesn’t allow for the understanding of causation. To look for relationships between a reviewer’s personal attributes and their ratings market basket analysis is used.

restaurant<- read.csv("C:/DataMining/Data/RestaurantRatersComplete.csv")
resttest <- read.csv("C:/DataMining/Data/RestaurantRatersTest.csv")
rest<-restaurant[,c(-1,-2)]
rest <- rest[,c(-11,-13,-14,-16:-19)] 
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:arules':
## 
##     recode
rest[["birth_year"]] <- ordered(cut(rest[["birth_year"]], c(1930,1986,1989,1994)), 
                                labels = c("old", "middle","young"))
rest$rating <- recode(rest$rating,"'0' = 'poor';'1'='okay';'2'='good'")
rest$rating = as.factor(rest$rating)
rest$food_rating = as.factor(rest$food_rating)
rest$service_rating = as.factor(rest$service_rating)
set.seed(1)
rest1 <- as(rest, "transactions")
summary(rest1)
## transactions as itemMatrix in sparse format with
##  4090 rows (elements/itemsets/transactions) and
##  163 columns (items) and a density of 0.09850451 
## 
## most frequent items:
##     marital_status=single          activity=student 
##                      3919                      3655 
##             Upayment=cash dress_preference=informal 
##                      3352                      2651 
##           ambience=family                   (Other) 
##                      2427                     49666 
## 
## element (itemset/transaction) length distribution:
## sizes
##   15   16   17 
##   68 3724  298 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   16.00   16.00   16.06   16.00   17.00 
## 
## includes extended item information - examples:
##                       labels   variables         levels
## 1                     smoker      smoker           TRUE
## 2     drink_level=abstemious drink_level     abstemious
## 3 drink_level=casual drinker drink_level casual drinker
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

Overall association rules are mined from the data and a summary is run to provide basic descriptive statistics.

aa=as(rest1,"matrix") # transforms transaction matrix into incidence matrix
aa[1]   # print the first row of the incidence matrix
## [1] FALSE
rules <- apriori(rest1, parameter = list(maxlen=20 ,support = 0.01, confidence = 0.6))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target   ext
##      20  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 40 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[163 item(s), 4090 transaction(s)] done [0.00s].
## sorting and recoding items ... [71 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 done [0.56s].
## writing ... [2949837 rule(s)] done [0.66s].
## creating S4 object  ... done [1.70s].
rules
## set of 2949837 rules
summary(rules)
## set of 2949837 rules
## 
## rule length distribution (lhs + rhs):sizes
##      1      2      3      4      5      6      7      8      9     10 
##      4    557   8055  53676 190038 402239 577412 620995 521183 337817 
##     11     12     13     14     15 
## 164269  57705  13770   1987    130 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   7.000   8.000   7.929   9.000  15.000 
## 
## summary of quality measures:
##     support          confidence          lift             count       
##  Min.   :0.01002   Min.   :0.6000   Min.   : 0.6262   Min.   :  41.0  
##  1st Qu.:0.01320   1st Qu.:0.9091   1st Qu.: 1.5428   1st Qu.:  54.0  
##  Median :0.01760   Median :1.0000   Median : 2.0931   Median :  72.0  
##  Mean   :0.05023   Mean   :0.9370   Mean   : 2.6003   Mean   : 205.4  
##  3rd Qu.:0.03423   3rd Qu.:1.0000   3rd Qu.: 3.2537   3rd Qu.: 140.0  
##  Max.   :0.95819   Max.   :1.0000   Max.   :41.9487   Max.   :3919.0  
## 
## mining info:
##   data ntransactions support confidence
##  rest1          4090    0.01        0.6

The top five rules that predict that a reviewer’s ratings are listed by confidence.

rulesPoorRatings <- subset(rules, subset = rhs %in% "rating=poor" & lift > 1.2)
inspect(sort(rulesPoorRatings, by = "confidence", decreasing = TRUE)[1:5])
##     lhs                                  rhs              support confidence     lift count
## [1] {hijos=kids,                                                                           
##      personality=hunter-ostentatious} => {rating=poor} 0.35256724          1 2.262168  1442
## [2] {hijos=kids,                                                                           
##      budget=low}                      => {rating=poor} 0.35256724          1 2.262168  1442
## [3] {dress_preference=no preference,                                                       
##      birth_year=old,                                                                       
##      budget=low}                      => {rating=poor} 0.01075795          1 2.262168    44
## [4] {ambience=friends,                                                                     
##      birth_year=old,                                                                       
##      budget=low}                      => {rating=poor} 0.01075795          1 2.262168    44
## [5] {birth_year=old,                                                                       
##      budget=low,                                                                           
##      food_rating=0}                   => {rating=poor} 0.01075795          1 2.262168    44
rulesOkayRatings <- subset(rules, subset = rhs %in% "rating=okay" & lift > 1.2)
inspect(sort(rulesOkayRatings, by = "confidence", decreasing = TRUE)[1:5])
##     lhs                             rhs              support confidence     lift count
## [1] {ambience=solitary,                                                               
##      birth_year=middle,                                                               
##      food_rating=1}              => {rating=okay} 0.02029340          1 3.994141    83
## [2] {birth_year=old,                                                                  
##      food_rating=1,                                                                   
##      service_rating=1,                                                                
##      Upayment=VISA}              => {rating=okay} 0.01002445          1 3.994141    41
## [3] {drink_level=social drinker,                                                      
##      dress_preference=formal,                                                         
##      ambience=solitary,                                                               
##      food_rating=1}              => {rating=okay} 0.01711491          1 3.994141    70
## [4] {drink_level=social drinker,                                                      
##      ambience=solitary,                                                               
##      food_rating=1,                                                                   
##      service_rating=1}           => {rating=okay} 0.01613692          1 3.994141    66
## [5] {drink_level=social drinker,                                                      
##      ambience=solitary,                                                               
##      interest=technology,                                                             
##      food_rating=1}              => {rating=okay} 0.01711491          1 3.994141    70
rulesGoodRatings <- subset(rules, subset = rhs %in% "rating=good" & lift > 1.2)
inspect(sort(rulesGoodRatings, by = "confidence", decreasing = TRUE)[1:5])
##     lhs                               rhs              support confidence     lift count
## [1] {service_rating=2,                                                                  
##      Upayment=MasterCard-Eurocard} => {rating=good} 0.01418093          1 3.251192    58
## [2] {ambience=solitary,                                                                 
##      service_rating=2,                                                                  
##      Upayment=MasterCard-Eurocard} => {rating=good} 0.01320293          1 3.251192    54
## [3] {birth_year=old,                                                                    
##      service_rating=2,                                                                  
##      Upayment=MasterCard-Eurocard} => {rating=good} 0.01418093          1 3.251192    58
## [4] {interest=technology,                                                               
##      service_rating=2,                                                                  
##      Upayment=MasterCard-Eurocard} => {rating=good} 0.01320293          1 3.251192    54
## [5] {drink_level=abstemious,                                                            
##      service_rating=2,                                                                  
##      Upayment=MasterCard-Eurocard} => {rating=good} 0.01344743          1 3.251192    55

All of the rules have a confidence of 1 which means that if all of the conditions on the left-hand side have a 100% probability of giving the rating that is listed on the right-hand side. All of the rules have a lift greater than 1 meaning that the occurrence of the conditions on the left increase the likelihood of the rating on the right.

Three of the rules for a poor rating (rating of 0) contain the condition of the user being older. This is the same conclusion found with the cluster analysis above. The 3 sets of rules also show that a user will probably give the rating they gave to the food to the overall rating. The one strange connection that the rules bring up is the form of payment. It wouldn’t seem like the form of payment would have any effect on the rating even if joined with other conditions.

Conclusion

This data is from an internet review site and therefore attracts users with certain characteristics. The reviewers tend to be younger individuals (born after 1985), students, with low/medium budgets and they tend to describe themselves as hunters-ostentatious or thrifty-protector.

library(lattice)
barchart(rest$personality,ylab="Personality",col="black")

table(rest$activity)
## 
##             ?  professional       student    unemployed working-class 
##            25           385          3655            17             8
table(resttest$birth_year)
## 
## 1930 1940 1943 1952 1967 1969 1979 1981 1982 1983 1984 1985 1986 1987 1988 
##   73  144    5   10    6   20    8    9   40  579   20   54   64  116 1646 
## 1989 1990 1991 1992 1993 1994 
##  365  262  619   42    4    4
rate.budget.tbl=table(His=rest$budget,Mr=rest$rating)
rate.budget.tbl
##         Mr
## His      good okay poor
##   ?        18    7   13
##   high     42   32   12
##   low     194  259 1559
##   medium 1004  726  224
barchart(rate.budget.tbl,horizontal=FALSE,groups=FALSE,xlab="Budget",col="black")

plot(rating~personality,data=rest)#Reads across the bottom as conformist, hard-worker, hunter-ostentatious, thrifty protector

If we assume that this was an accurate sample of the population of diners at the 130 restaurants then it would be reasonable to make some conclusions on how to receive higher ratings. Restaurants that put a lot of effort into their food, attract older customers, dissuade those with ostentatious personalities, and attract customers with medium to high budgets.