This is a paper on how to apply association rules in order to make Hostel recommendation based on scoring in 7 categories: atmosphere, cleaniness, facilities, location, security, staff and value for money.
The main goal is to find any association between these charateristic, and if we can find any for specific criterion using arules package. From the results we will build good recommendations for choosing hostel not with one charateristic alone, but with some combinations.
The Dataset was taken from Kaggle website. In this dataset there are 342 hostels of five big cities: “Fukuoka-City”, “Hiroshima”,“Kyoto”,“Osaka” and “Tokyo” in Japan with 16 variables.
In order to use association rule, nominal value should be converted as categorial factor value, as well as remove unneccessary data.
df <- data_raw %>%
select(c(8:14)) %>%
drop_na(.)
lbl_clean <-c('very dirty','dirty','unclean','clean','very clean','pureness')
lbl_atm <- c('horrible','very uncomfortable','uncomfortable','comfortable','very comfortable','cozy')
lbl_ser<-c('bad','normal','good','fabulous','superb','excellent')
lbl_fac <-c('very poor','poor','mediocre','useful','great','amazing')
lbl_loc <- c('terrible','reachable','great','wonderful', 'exceptional','perfect')
lbl_sec<- c('dangerous','unsafe','safe','guarded','protected', 'secure')
lbl_val<-c('overpriced','expensive','affordable','inexpensive','economical','cheap')
df$atmosphere <- cut(df$atmosphere,breaks=c(0,5,6,7,8,9,10), labels = lbl_atm)
df$cleanliness<- cut(df$cleanliness,breaks=c(0,5,6,7,8,9,10), labels = lbl_clean)
df$facilities<- cut(df$facilities,breaks=c(0,5,6,7,8,9,10), labels = lbl_fac)
df$location.y<- cut(df$location.y,breaks=c(0,5,6,7,8,9,10), labels = lbl_loc)
df$security<- cut(df$security,breaks=c(0,5,6,7,8,9,10), labels = lbl_sec)
df$staff<- cut(df$staff,breaks=c(0,5,6,7,8,9,10), labels = lbl_ser)
df$valueformoney<- cut(df$valueformoney,breaks=c(0,5,6,7,8,9,10), labels = lbl_val)
head(df)
## atmosphere cleanliness facilities location.y security
## 1 very comfortable pureness amazing exceptional protected
## 2 cozy pureness amazing perfect secure
## 3 comfortable unclean great wonderful secure
## 4 comfortable clean useful wonderful safe
## 5 cozy pureness great exceptional secure
## 6 very uncomfortable clean poor reachable protected
## staff valueformoney
## 1 excellent cheap
## 2 excellent cheap
## 3 excellent economical
## 4 fabulous affordable
## 5 excellent cheap
## 6 superb affordable
library(ggplot2)
for (i in names(df)){
p<-ggplot(df, aes_string(x=i))+
geom_bar(fill = "cornflowerblue", color = "black")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
print(p)
}
df_trans <- as(df, "transactions",strict = F)
df_trans
## transactions in sparse format with
## 327 transactions (rows) and
## 42 items (columns)
summary(df_trans)
## transactions as itemMatrix in sparse format with
## 327 rows (elements/itemsets/transactions) and
## 42 columns (items) and a density of 0.1666667
##
## most frequent items:
## staff=excellent cleanliness=pureness security=secure
## 224 215 185
## valueformoney=cheap location.y=perfect (Other)
## 163 144 1358
##
## element (itemset/transaction) length distribution:
## sizes
## 7
## 327
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7 7 7 7 7 7
##
## includes extended item information - examples:
## labels variables levels
## 1 atmosphere=horrible atmosphere horrible
## 2 atmosphere=very uncomfortable atmosphere very uncomfortable
## 3 atmosphere=uncomfortable atmosphere uncomfortable
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
itemFrequencyPlot(df_trans, topN=50, cex.names=.5)
From the graph we can see that 3 top charateristics of majority hostel are excellent staffs, highest cleanliness quality and extremely secure.
association_rules <- apriori(df_trans, parameter = list(supp=0.001, conf=0.8,maxlen=10))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 0
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[42 item(s), 327 transaction(s)] done [0.00s].
## sorting and recoding items ... [42 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 done [0.00s].
## writing ... [18897 rule(s)] done [0.01s].
## creating S4 object ... done [0.01s].
summary(association_rules)
## set of 18897 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7
## 23 985 5092 7471 4379 947
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 4.000 5.000 4.955 6.000 7.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.003058 Min. :0.8000 Min. : 1.168 Min. : 1.000
## 1st Qu.:0.003058 1st Qu.:1.0000 1st Qu.: 2.477 1st Qu.: 1.000
## Median :0.003058 Median :1.0000 Median : 5.450 Median : 1.000
## Mean :0.008247 Mean :0.9944 Mean : 14.352 Mean : 2.697
## 3rd Qu.:0.003058 3rd Qu.:1.0000 3rd Qu.: 20.438 3rd Qu.: 1.000
## Max. :0.550459 Max. :1.0000 Max. :109.000 Max. :180.000
##
## mining info:
## data ntransactions support confidence
## df_trans 327 0.001 0.8
There is 18897 rules therefore we inspect the first 10 rules
inspect(association_rules[1:10])
## lhs rhs support
## [1] {staff=normal} => {cleanliness=very dirty} 0.009174312
## [2] {staff=normal} => {atmosphere=horrible} 0.009174312
## [3] {cleanliness=very dirty} => {facilities=very poor} 0.015290520
## [4] {cleanliness=very dirty} => {atmosphere=horrible} 0.018348624
## [5] {facilities=mediocre} => {cleanliness=clean} 0.024464832
## [6] {atmosphere=cozy} => {valueformoney=cheap} 0.214067278
## [7] {atmosphere=cozy} => {security=secure} 0.201834862
## [8] {atmosphere=cozy} => {cleanliness=pureness} 0.226299694
## [9] {atmosphere=cozy} => {staff=excellent} 0.232415902
## [10] {facilities=amazing} => {valueformoney=cheap} 0.342507645
## confidence lift count
## [1] 1.0000000 54.500000 3
## [2] 1.0000000 29.727273 3
## [3] 0.8333333 38.928571 5
## [4] 1.0000000 29.727273 6
## [5] 0.8000000 9.020690 8
## [6] 0.8860759 1.777588 70
## [7] 0.8354430 1.476702 66
## [8] 0.9367089 1.424669 74
## [9] 0.9620253 1.404385 76
## [10] 0.8549618 1.715169 112
Since the first to the fifth, count number quite low, we foucs on other and immediately see some associations here, if the atmostphere are cozy then a significant number of hostel will have either cheap price, best security, cleanlest or excellent staff. In the tenth, if facilities are amazing then it also have high chance to be cheap.
subRules<-association_rules[quality(association_rules)$confidence>0.4]
subRules2<-head(subRules, n=20, by="lift")
plot(subRules2, method="paracoord")
In this graph, the foremost striking feature is that the top 20 end up with normal rating staffs. The majority of them place in wonderful location but is rather very dirty and have average staffs. In general, those rules have wide range of negative-positive ratings.
Suppose a customer is frugal, he/she will be looking for those have the best value for money, but what can we recommend some hostels which is good in other categories?
val4mon<- apriori(df_trans, parameter = list(supp=0.001, conf=0.8),
appearance = list(default="rhs",lhs="valueformoney=cheap"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 0
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[42 item(s), 327 transaction(s)] done [0.00s].
## sorting and recoding items ... [42 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [3 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(val4mon)
## lhs rhs support confidence
## [1] {valueformoney=cheap} => {security=secure} 0.4097859 0.8220859
## [2] {valueformoney=cheap} => {cleanliness=pureness} 0.4587156 0.9202454
## [3] {valueformoney=cheap} => {staff=excellent} 0.4525994 0.9079755
## lift count
## [1] 1.453092 134
## [2] 1.399629 150
## [3] 1.325482 148
Voila! Not only he/she can find some hostels with best value for money, he/she also able to choose those go with best security or the cleanest or excellent staffs as extra criteria.
For those who care about location, can we provide good hostels for suggestion?
bestloc <-apriori(df_trans, parameter = list(supp=0.001, conf=0.8),
appearance = list(default="rhs",lhs="location.y=perfect"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 0
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[42 item(s), 327 transaction(s)] done [0.00s].
## sorting and recoding items ... [42 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [2 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(bestloc)
## lhs rhs support confidence
## [1] {location.y=perfect} => {cleanliness=pureness} 0.3547401 0.8055556
## [2] {location.y=perfect} => {staff=excellent} 0.3639144 0.8263889
## lift count
## [1] 1.225194 116
## [2] 1.206380 119
Beside having perfect location and adding hostel with best cleaniness and best services will improve the quality of the trip.
By building association rules around this dataset, we have discover some connection between the 7 charateristics of hostel, esspecially that positive qualities often go hand on hand with each other at high confidence rate. As a result, building a system to recommend hostel, even with one main critetion is feasible