Association rules are “if-then” statements showing relationships between two data items. Usually they are used for market basket analysis - for example: if customer buys butter, then he is likely to buy bread. That’s typical purpose of an Association rule. I’d like to use them in more unusual way - to find indicators of AirBnB Superhost listings in Madrid, inspired by a Kaggle notebook1 inspecting similar relations in Amsterdam.
Whole dataset2 consists of many .csv files, but for this analysis only listing attributes are important.
listings_detailed <- read.csv("Rules mining/listings_detailed.csv")
sub_transactions <- listings_detailed[, c("id", "amenities")]
sub_transactions$amenities <- gsub('\\["', "", sub_transactions$amenities)
sub_transactions$amenities <- gsub('\\"', "", sub_transactions$amenities)
sub_transactions$amenities <- gsub(']', "", sub_transactions$amenities)
sub_transactions$amenities <- gsub('[', "", sub_transactions$amenities, fixed = T)
newdata <- data.table()
for (i in 1:nrow(sub_transactions)) {
id <- sub_transactions$id[i]
str <- sub_transactions$amenities[i]
str <- unlist(strsplit(str, ", "))
if (length(str) != 0) {
tmp <- data.frame(Transaction = id, Item = str)
} else {
tmp <- data.frame(Transaction = id, Item = "Nothing")
}
newdata <- rbind(newdata, tmp, fill=TRUE)
}
items_df <- as.data.frame(table(newdata$Item))
most_frequent_items <- items_df[items_df$Freq > 10, 'Var1']
sub_transactions <- newdata[newdata$Item %in% most_frequent_items]
superhost <- listings_detailed[listings_detailed$id %in% unique(sub_transactions$Transaction), c('id', 'host_is_superhost')]
superhost$host_is_superhost[is.na(superhost$host_is_superhost)] <- FALSE
superhost <-
superhost %>%
mutate(Superhost = paste('Superhost =', as.character(host_is_superhost))) %>%
`colnames<-`(c('Transaction', 'host_is_superhost', 'Item'))
sub_transactions <- rbind(sub_transactions, superhost[, c('Transaction', 'Item')])
write.csv(sub_transactions, file = 'amenities.csv', row.names = F)
trans <- read.transactions('amenities.csv', format = 'single', cols = 1:2, sep = ',', header = T)
Input Dataframe consists of 74 columns, but the main scope of this project will require only three of them - Id, Amenities and Superhost. Amenities are stored in column in the form of a list, so some initial cleaning is required to get rid of all excessive characters and split those amenities on separator ,. Next step is to reshape the Dataframe, so every row contains only Listing Id and present Amenity. Following that, low-frequency amenities are removed from the Dataframe to limit number of Items and possible rules to generate. Finally, a Superhost variable is added to the Dataframe (all NA values are treated as FALSE boolean values), saved to csv file and read as a transaction object.
itemFrequencyPlot(trans, topN=15, type = 'relative')
Actually, nothing really innovative - most of listings have Wifi, Essentials or Kitchen, but what’s really interesting - a pretty frequent variable across all listings is Superhost = FALSE, meaning that there will not be many listings on which algorithm could generate rules. Further extension of the analysis could be beneficial. Now not only rules with Superhost = FALSE, but also Superhost = TRUE will be inspected, so that more dependencies can be discovered.
rules <- apriori(trans, parameter = list(support = 0.75,
conf = 0.8,
minlen = 3))
superhost_t <- apriori(trans, parameter = list(support = 0.017,
conf = 0.5,
maxlen = 5),
appearance = list(rhs = 'Superhost = t'))
superhost_f <- apriori(trans, parameter = list(support = 0.09,
conf = 0.8,
maxlen = 5),
appearance = list(rhs = 'Superhost = f'))
inspect(sort(rules, by = 'confidence', decreasing = T))
## lhs rhs support confidence coverage
## [1] {Essentials, Heating} => {Wifi} 0.7716893 0.9585286 0.8050770
## [2] {Essentials, Kitchen} => {Wifi} 0.7804567 0.9514666 0.8202671
## [3] {Heating, Wifi} => {Essentials} 0.7716893 0.9330663 0.8270466
## [4] {Kitchen, Wifi} => {Essentials} 0.7804567 0.9318929 0.8374962
## [5] {Essentials, Wifi} => {Kitchen} 0.7804567 0.9020266 0.8652258
## [6] {Essentials, Wifi} => {Heating} 0.7716893 0.8918935 0.8652258
## lift count
## [1] 1.018326 15139
## [2] 1.010824 15311
## [3] 1.023706 15139
## [4] 1.022419 15311
## [5] 1.021353 15311
## [6] 1.030518 15139
plot(rules, method="paracoord", control=list(reorder=TRUE))
plot(superhost_t, method="graph")
To limit the number of rules, values of support and confidence were set really high, 0.75 and 0.8 respectively. Quick recap: 1. Confidence - whenever the LHS item was present, the RHS item was also present in the given % of observations. 2. Support - LHS item + RHS item appear together in a given % of observations. The best rule seems to be that whenever there are Heating and Essential, there also be a Wifi in 95.9% of cases. Those two items are present at the same time in 77.2% of observations.
inspect(sort(superhost_t, by = 'confidence', decreasing = T))
## lhs rhs support confidence coverage lift count
## [1] {Air conditioning,
## Carbon monoxide alarm,
## Hair dryer,
## Room-darkening shades} => {Superhost = t} 0.01707615 0.5106707 0.03343868 2.791401 335
## [2] {Air conditioning,
## Carbon monoxide alarm,
## Room-darkening shades} => {Superhost = t} 0.01707615 0.5083460 0.03359160 2.778694 335
## [3] {Carbon monoxide alarm,
## Coffee maker,
## Iron,
## Room-darkening shades} => {Superhost = t} 0.01763686 0.5036390 0.03501886 2.752965 346
## [4] {Carbon monoxide alarm,
## Iron,
## Room-darkening shades,
## Smoke alarm} => {Superhost = t} 0.01789173 0.5035868 0.03552860 2.752679 351
## [5] {Carbon monoxide alarm,
## Coffee maker,
## Room-darkening shades,
## Smoke alarm} => {Superhost = t} 0.01768784 0.5028986 0.03517178 2.748917 347
## [6] {Carbon monoxide alarm,
## Coffee maker,
## Hair dryer,
## Room-darkening shades} => {Superhost = t} 0.01789173 0.5021459 0.03563054 2.744803 351
## [7] {Carbon monoxide alarm,
## Hair dryer,
## Iron,
## Room-darkening shades} => {Superhost = t} 0.01809563 0.5007052 0.03614028 2.736928 355
## [8] {Carbon monoxide alarm,
## Hair dryer,
## Room-darkening shades,
## Smoke alarm} => {Superhost = t} 0.01814660 0.5000000 0.03629320 2.733073 356
plot(superhost_t, method="paracoord", control=list(reorder=TRUE))
plot(superhost_t, method="graph")
Here confidence = 0.5 and support = 0.018. Rules are more complex, all of 8 generated rules have length = 5. Two best rules (judging by confidence) implicate, that the presence of Air conditioning, Carbon monocide alarm, Hair dryer and Room-darkening shades implicate, that in 51% of cases, a host of that flat is a Superhost. If we’re to exclude the Hair dryer from this set, those chances decrease to 50.8%. Is there a possible explaination? Well, given that the host provides AC, it is reasonable to expect a Hair dryer and lack of it reduces (by a slight margin) chances of becoming a Superhost. On the other hand, that combination is present in only 1.7% of all listings, meaning 335 cases. What’s worth mentioning - in all of 8 rules, there are Room-darkening shades. It is quite surprising, that every rule consists of Room-darkening shades, but not AC, which is more a determinant of high-class flat than those shades.
inspect(sort(superhost_f, by = 'confidence', decreasing = T))
## lhs rhs support confidence coverage lift count
## [1] {Lock on bedroom door} => {Superhost = f} 0.10133551 0.8297162 0.1221327 1.050018 1988
## [2] {Lock on bedroom door,
## Wifi} => {Superhost = f} 0.09802222 0.8267412 0.1185646 1.046253 1923
## [3] {Essentials,
## Lock on bedroom door} => {Superhost = f} 0.09501478 0.8233216 0.1154042 1.041925 1864
## [4] {Essentials,
## Lock on bedroom door,
## Wifi} => {Superhost = f} 0.09226221 0.8204896 0.1124478 1.038341 1810
plot(superhost_f, method="paracoord", control=list(reorder=TRUE))
plot(superhost_f, method="graph")
Confidence = 0.8 and support = 0.09. There are only 4 rules generated and the best one (by both confidence and support) is a rule indicating, that whether there is a Lock on bedroom door, the host in 82.9% cases will not be a Superhost. Both LHS (Lock) and RHS (not a Superhost) are present in 10% of all cases, meaning approx. 1900 observations. Possibly a Lock on bedroom door could indicate that there is only a single room to rent, not the whole place, so chances of being a Superhost are smaller - Superhosts are usually renting the whole apartment. At the same time, Superhost = f does not mean anything negative - it does not indicate that host is worse than the average (but in contrary to Superhost = t which means that host is better than the average).
The goal of this analysis was to determine indicators of Superhost listings in Madrid. The real problem was to wrangle data in a way allowing for a Association rules mining. Rules were investigated in two cases: when the host is Superhost and when is not. Also a glance at the more general Association rules has been made with no gamechanger being discovered. Judging by the confidence, a given set of 5 amenities (including AC) indicates 51% chances of host being a Superhost. On the other hand, an existence of a lock on the bedroom door provides 82.9% chance of host not being a Superhost, which is probably due to the fact, that those listings are highly likely offering only a room, not the whole apartment to rent.