Apply Association rules on Hostel recommendation

Introduction

This is a paper on how to apply association rules in order to make Hostel recommendation based on scoring in 7 categories: atmosphere, cleaniness, facilities, location, security, staff and value for money.

The main goal is to find any association between these charateristic, and if we can find any for specific criterion using arules package. From the results we will build good recommendations for choosing hostel not with one charateristic alone, but with some combinations.

The Dataset was taken from Kaggle website. In this dataset there are 342 hostels of five big cities: “Fukuoka-City”, “Hiroshima”,“Kyoto”,“Osaka” and “Tokyo” in Japan with 16 variables.

Prepare the data

In order to use association rule, nominal value should be converted as categorial factor value, as well as remove unneccessary data.

df <- data_raw %>%
        select(c(8:14)) %>%
        drop_na(.) 

lbl_clean <-c('very dirty','dirty','unclean','clean','very clean','pureness')
lbl_atm <- c('horrible','very uncomfortable','uncomfortable','comfortable','very comfortable','cozy')
lbl_ser<-c('bad','normal','good','fabulous','superb','excellent')
lbl_fac <-c('very poor','poor','mediocre','useful','great','amazing')
lbl_loc <- c('terrible','reachable','great','wonderful', 'exceptional','perfect')
lbl_sec<- c('dangerous','unsafe','safe','guarded','protected', 'secure')
lbl_val<-c('overpriced','expensive','affordable','inexpensive','economical','cheap')

df$atmosphere <- cut(df$atmosphere,breaks=c(0,5,6,7,8,9,10), labels = lbl_atm)
df$cleanliness<- cut(df$cleanliness,breaks=c(0,5,6,7,8,9,10), labels = lbl_clean)
df$facilities<- cut(df$facilities,breaks=c(0,5,6,7,8,9,10), labels = lbl_fac)
df$location.y<- cut(df$location.y,breaks=c(0,5,6,7,8,9,10), labels = lbl_loc)
df$security<- cut(df$security,breaks=c(0,5,6,7,8,9,10), labels = lbl_sec)
df$staff<- cut(df$staff,breaks=c(0,5,6,7,8,9,10), labels = lbl_ser)
df$valueformoney<- cut(df$valueformoney,breaks=c(0,5,6,7,8,9,10), labels = lbl_val)
head(df)

##           atmosphere cleanliness facilities  location.y  security
## 1   very comfortable    pureness    amazing exceptional protected
## 2               cozy    pureness    amazing     perfect    secure
## 3        comfortable     unclean      great   wonderful    secure
## 4        comfortable       clean     useful   wonderful      safe
## 5               cozy    pureness      great exceptional    secure
## 6 very uncomfortable       clean       poor   reachable protected
##       staff valueformoney
## 1 excellent         cheap
## 2 excellent         cheap
## 3 excellent    economical
## 4  fabulous    affordable
## 5 excellent         cheap
## 6    superb    affordable

Bar plot

library(ggplot2)
for (i in names(df)){
    p<-ggplot(df, aes_string(x=i))+
      geom_bar(fill = "cornflowerblue", color = "black")+
      theme_classic()+
      theme(axis.text.x = element_text(angle = 60, hjust = 1))
    print(p)
}

Convert to transaction

df_trans <- as(df, "transactions",strict = F)
df_trans

## transactions in sparse format with
##  327 transactions (rows) and
##  42 items (columns)

summary(df_trans)

## transactions as itemMatrix in sparse format with
##  327 rows (elements/itemsets/transactions) and
##  42 columns (items) and a density of 0.1666667 
## 
## most frequent items:
##      staff=excellent cleanliness=pureness      security=secure 
##                  224                  215                  185 
##  valueformoney=cheap   location.y=perfect              (Other) 
##                  163                  144                 1358 
## 
## element (itemset/transaction) length distribution:
## sizes
##   7 
## 327 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       7       7       7       7       7       7 
## 
## includes extended item information - examples:
##                          labels  variables             levels
## 1           atmosphere=horrible atmosphere           horrible
## 2 atmosphere=very uncomfortable atmosphere very uncomfortable
## 3      atmosphere=uncomfortable atmosphere      uncomfortable
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

itemFrequencyPlot(df_trans, topN=50,  cex.names=.5)

From the graph we can see that 3 top charateristics of majority hostel are excellent staffs, highest cleanliness quality and extremely secure.

Association rules

association_rules <- apriori(df_trans, parameter = list(supp=0.001, conf=0.8,maxlen=10))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 0 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[42 item(s), 327 transaction(s)] done [0.00s].
## sorting and recoding items ... [42 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 done [0.00s].
## writing ... [18897 rule(s)] done [0.01s].
## creating S4 object  ... done [0.01s].

summary(association_rules)

## set of 18897 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5    6    7 
##   23  985 5092 7471 4379  947 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   4.000   5.000   4.955   6.000   7.000 
## 
## summary of quality measures:
##     support           confidence          lift             count        
##  Min.   :0.003058   Min.   :0.8000   Min.   :  1.168   Min.   :  1.000  
##  1st Qu.:0.003058   1st Qu.:1.0000   1st Qu.:  2.477   1st Qu.:  1.000  
##  Median :0.003058   Median :1.0000   Median :  5.450   Median :  1.000  
##  Mean   :0.008247   Mean   :0.9944   Mean   : 14.352   Mean   :  2.697  
##  3rd Qu.:0.003058   3rd Qu.:1.0000   3rd Qu.: 20.438   3rd Qu.:  1.000  
##  Max.   :0.550459   Max.   :1.0000   Max.   :109.000   Max.   :180.000  
## 
## mining info:
##      data ntransactions support confidence
##  df_trans           327   0.001        0.8

There is 18897 rules therefore we inspect the first 10 rules

inspect(association_rules[1:10])

##      lhs                         rhs                      support    
## [1]  {staff=normal}           => {cleanliness=very dirty} 0.009174312
## [2]  {staff=normal}           => {atmosphere=horrible}    0.009174312
## [3]  {cleanliness=very dirty} => {facilities=very poor}   0.015290520
## [4]  {cleanliness=very dirty} => {atmosphere=horrible}    0.018348624
## [5]  {facilities=mediocre}    => {cleanliness=clean}      0.024464832
## [6]  {atmosphere=cozy}        => {valueformoney=cheap}    0.214067278
## [7]  {atmosphere=cozy}        => {security=secure}        0.201834862
## [8]  {atmosphere=cozy}        => {cleanliness=pureness}   0.226299694
## [9]  {atmosphere=cozy}        => {staff=excellent}        0.232415902
## [10] {facilities=amazing}     => {valueformoney=cheap}    0.342507645
##      confidence lift      count
## [1]  1.0000000  54.500000   3  
## [2]  1.0000000  29.727273   3  
## [3]  0.8333333  38.928571   5  
## [4]  1.0000000  29.727273   6  
## [5]  0.8000000   9.020690   8  
## [6]  0.8860759   1.777588  70  
## [7]  0.8354430   1.476702  66  
## [8]  0.9367089   1.424669  74  
## [9]  0.9620253   1.404385  76  
## [10] 0.8549618   1.715169 112

Since the first to the fifth, count number quite low, we foucs on other and immediately see some associations here, if the atmostphere are cozy then a significant number of hostel will have either cheap price, best security, cleanlest or excellent staff. In the tenth, if facilities are amazing then it also have high chance to be cheap.

Visualization with top 20 rules andconfidence greater than 0.4

subRules<-association_rules[quality(association_rules)$confidence>0.4]
subRules2<-head(subRules, n=20, by="lift")
plot(subRules2, method="paracoord")

In this graph, the foremost striking feature is that the top 20 end up with normal rating staffs. The majority of them place in wonderful location but is rather very dirty and have average staffs. In general, those rules have wide range of negative-positive ratings.

Rules for specific criterion( some examples )

Best value for money

Suppose a customer is frugal, he/she will be looking for those have the best value for money, but what can we recommend some hostels which is good in other categories?

val4mon<- apriori(df_trans, parameter = list(supp=0.001, conf=0.8), 
                  appearance = list(default="rhs",lhs="valueformoney=cheap"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 0 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[42 item(s), 327 transaction(s)] done [0.00s].
## sorting and recoding items ... [42 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [3 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(val4mon)

##     lhs                      rhs                    support   confidence
## [1] {valueformoney=cheap} => {security=secure}      0.4097859 0.8220859 
## [2] {valueformoney=cheap} => {cleanliness=pureness} 0.4587156 0.9202454 
## [3] {valueformoney=cheap} => {staff=excellent}      0.4525994 0.9079755 
##     lift     count
## [1] 1.453092 134  
## [2] 1.399629 150  
## [3] 1.325482 148

Voila! Not only he/she can find some hostels with best value for money, he/she also able to choose those go with best security or the cleanest or excellent staffs as extra criteria.

Best location

For those who care about location, can we provide good hostels for suggestion?

bestloc <-apriori(df_trans, parameter = list(supp=0.001, conf=0.8),
                  appearance = list(default="rhs",lhs="location.y=perfect"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 0 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[42 item(s), 327 transaction(s)] done [0.00s].
## sorting and recoding items ... [42 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [2 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(bestloc)

##     lhs                     rhs                    support   confidence
## [1] {location.y=perfect} => {cleanliness=pureness} 0.3547401 0.8055556 
## [2] {location.y=perfect} => {staff=excellent}      0.3639144 0.8263889 
##     lift     count
## [1] 1.225194 116  
## [2] 1.206380 119

Beside having perfect location and adding hostel with best cleaniness and best services will improve the quality of the trip.

Conclusion

By building association rules around this dataset, we have discover some connection between the 7 charateristics of hostel, esspecially that positive qualities often go hand on hand with each other at high confidence rate. As a result, building a system to recommend hostel, even with one main critetion is feasible