Introduction

Association rules are “if-then” statements showing relationships between two data items. Usually they are used for market basket analysis - for example: if customer buys butter, then he is likely to buy bread. That’s typical purpose of an Association rule. I’d like to use them in more unusual way - to find indicators of AirBnB Superhost listings in Madrid, inspired by a Kaggle notebook¹ inspecting similar relations in Amsterdam.

Data preprocessing

Whole dataset² consists of many .csv files, but for this analysis only listing attributes are important.

Airbnb listings

listings_detailed <- read.csv("Rules mining/listings_detailed.csv")

sub_transactions <- listings_detailed[, c("id", "amenities")]

sub_transactions$amenities <- gsub('\\["', "", sub_transactions$amenities)
sub_transactions$amenities <- gsub('\\"', "", sub_transactions$amenities)
sub_transactions$amenities <- gsub(']', "", sub_transactions$amenities)
sub_transactions$amenities <- gsub('[', "", sub_transactions$amenities, fixed = T)

Dataframe building

newdata <- data.table()
for (i in 1:nrow(sub_transactions)) {
  id <- sub_transactions$id[i]
  str <- sub_transactions$amenities[i]
  str <- unlist(strsplit(str, ", "))
  if (length(str) != 0) {
    tmp <- data.frame(Transaction = id, Item = str)
  } else {
    tmp <- data.frame(Transaction = id, Item = "Nothing")
  }
  newdata <- rbind(newdata, tmp, fill=TRUE)
}

Removing low-frequency amenities

items_df <- as.data.frame(table(newdata$Item))
most_frequent_items <- items_df[items_df$Freq > 10, 'Var1']
sub_transactions <- newdata[newdata$Item %in% most_frequent_items]

Adding Superhost variable

superhost <- listings_detailed[listings_detailed$id %in% unique(sub_transactions$Transaction), c('id', 'host_is_superhost')]

superhost$host_is_superhost[is.na(superhost$host_is_superhost)] <- FALSE

superhost <- 
  superhost %>%
  mutate(Superhost = paste('Superhost =', as.character(host_is_superhost))) %>%
  `colnames<-`(c('Transaction', 'host_is_superhost', 'Item'))

sub_transactions <- rbind(sub_transactions, superhost[, c('Transaction', 'Item')])

Building transaction object

write.csv(sub_transactions, file = 'amenities.csv', row.names = F)

trans <- read.transactions('amenities.csv', format = 'single', cols = 1:2, sep = ',', header = T)

Input Dataframe consists of 74 columns, but the main scope of this project will require only three of them - Id, Amenities and Superhost. Amenities are stored in column in the form of a list, so some initial cleaning is required to get rid of all excessive characters and split those amenities on separator ,. Next step is to reshape the Dataframe, so every row contains only Listing Id and present Amenity. Following that, low-frequency amenities are removed from the Dataframe to limit number of Items and possible rules to generate. Finally, a Superhost variable is added to the Dataframe (all NA values are treated as FALSE boolean values), saved to csv file and read as a transaction object.

Association Rules

Item Frequency plot

itemFrequencyPlot(trans, topN=15, type = 'relative')

Actually, nothing really innovative - most of listings have Wifi, Essentials or Kitchen, but what’s really interesting - a pretty frequent variable across all listings is Superhost = FALSE, meaning that there will not be many listings on which algorithm could generate rules. Further extension of the analysis could be beneficial. Now not only rules with Superhost = FALSE, but also Superhost = TRUE will be inspected, so that more dependencies can be discovered.

Rules sets

rules <- apriori(trans, parameter = list(support = 0.75,
                                         conf = 0.8,
                                         minlen = 3))

superhost_t <- apriori(trans, parameter = list(support = 0.017,
                                         conf = 0.5,
                                         maxlen = 5),
                 appearance = list(rhs = 'Superhost = t'))

superhost_f <- apriori(trans, parameter = list(support = 0.09,
                                         conf = 0.8,
                                         maxlen = 5),
                 appearance = list(rhs = 'Superhost = f'))

General rules

Overview

inspect(sort(rules, by = 'confidence', decreasing = T))

##     lhs                      rhs          support   confidence coverage 
## [1] {Essentials, Heating} => {Wifi}       0.7716893 0.9585286  0.8050770
## [2] {Essentials, Kitchen} => {Wifi}       0.7804567 0.9514666  0.8202671
## [3] {Heating, Wifi}       => {Essentials} 0.7716893 0.9330663  0.8270466
## [4] {Kitchen, Wifi}       => {Essentials} 0.7804567 0.9318929  0.8374962
## [5] {Essentials, Wifi}    => {Kitchen}    0.7804567 0.9020266  0.8652258
## [6] {Essentials, Wifi}    => {Heating}    0.7716893 0.8918935  0.8652258
##     lift     count
## [1] 1.018326 15139
## [2] 1.010824 15311
## [3] 1.023706 15139
## [4] 1.022419 15311
## [5] 1.021353 15311
## [6] 1.030518 15139

Parallel coordinates plot

plot(rules, method="paracoord", control=list(reorder=TRUE))

Graph

plot(superhost_t, method="graph")

To limit the number of rules, values of support and confidence were set really high, 0.75 and 0.8 respectively. Quick recap: 1. Confidence - whenever the LHS item was present, the RHS item was also present in the given % of observations. 2. Support - LHS item + RHS item appear together in a given % of observations. The best rule seems to be that whenever there are Heating and Essential, there also be a Wifi in 95.9% of cases. Those two items are present at the same time in 77.2% of observations.

Is Superhost

Overview

inspect(sort(superhost_t, by = 'confidence', decreasing = T))

##     lhs                         rhs                support confidence   coverage     lift count
## [1] {Air conditioning,                                                                         
##      Carbon monoxide alarm,                                                                    
##      Hair dryer,                                                                               
##      Room-darkening shades}  => {Superhost = t} 0.01707615  0.5106707 0.03343868 2.791401   335
## [2] {Air conditioning,                                                                         
##      Carbon monoxide alarm,                                                                    
##      Room-darkening shades}  => {Superhost = t} 0.01707615  0.5083460 0.03359160 2.778694   335
## [3] {Carbon monoxide alarm,                                                                    
##      Coffee maker,                                                                             
##      Iron,                                                                                     
##      Room-darkening shades}  => {Superhost = t} 0.01763686  0.5036390 0.03501886 2.752965   346
## [4] {Carbon monoxide alarm,                                                                    
##      Iron,                                                                                     
##      Room-darkening shades,                                                                    
##      Smoke alarm}            => {Superhost = t} 0.01789173  0.5035868 0.03552860 2.752679   351
## [5] {Carbon monoxide alarm,                                                                    
##      Coffee maker,                                                                             
##      Room-darkening shades,                                                                    
##      Smoke alarm}            => {Superhost = t} 0.01768784  0.5028986 0.03517178 2.748917   347
## [6] {Carbon monoxide alarm,                                                                    
##      Coffee maker,                                                                             
##      Hair dryer,                                                                               
##      Room-darkening shades}  => {Superhost = t} 0.01789173  0.5021459 0.03563054 2.744803   351
## [7] {Carbon monoxide alarm,                                                                    
##      Hair dryer,                                                                               
##      Iron,                                                                                     
##      Room-darkening shades}  => {Superhost = t} 0.01809563  0.5007052 0.03614028 2.736928   355
## [8] {Carbon monoxide alarm,                                                                    
##      Hair dryer,                                                                               
##      Room-darkening shades,                                                                    
##      Smoke alarm}            => {Superhost = t} 0.01814660  0.5000000 0.03629320 2.733073   356

Parallel coordinates plot

plot(superhost_t, method="paracoord", control=list(reorder=TRUE))

Graph

plot(superhost_t, method="graph")

Here confidence = 0.5 and support = 0.018. Rules are more complex, all of 8 generated rules have length = 5. Two best rules (judging by confidence) implicate, that the presence of Air conditioning, Carbon monocide alarm, Hair dryer and Room-darkening shades implicate, that in 51% of cases, a host of that flat is a Superhost. If we’re to exclude the Hair dryer from this set, those chances decrease to 50.8%. Is there a possible explaination? Well, given that the host provides AC, it is reasonable to expect a Hair dryer and lack of it reduces (by a slight margin) chances of becoming a Superhost. On the other hand, that combination is present in only 1.7% of all listings, meaning 335 cases. What’s worth mentioning - in all of 8 rules, there are Room-darkening shades. It is quite surprising, that every rule consists of Room-darkening shades, but not AC, which is more a determinant of high-class flat than those shades.

Is not Superhost

Overview

inspect(sort(superhost_f, by = 'confidence', decreasing = T))

##     lhs                        rhs                support confidence  coverage     lift count
## [1] {Lock on bedroom door}  => {Superhost = f} 0.10133551  0.8297162 0.1221327 1.050018  1988
## [2] {Lock on bedroom door,                                                                   
##      Wifi}                  => {Superhost = f} 0.09802222  0.8267412 0.1185646 1.046253  1923
## [3] {Essentials,                                                                             
##      Lock on bedroom door}  => {Superhost = f} 0.09501478  0.8233216 0.1154042 1.041925  1864
## [4] {Essentials,                                                                             
##      Lock on bedroom door,                                                                   
##      Wifi}                  => {Superhost = f} 0.09226221  0.8204896 0.1124478 1.038341  1810

Parallel coordinates plot

plot(superhost_f, method="paracoord", control=list(reorder=TRUE))

Graph

plot(superhost_f, method="graph")

Confidence = 0.8 and support = 0.09. There are only 4 rules generated and the best one (by both confidence and support) is a rule indicating, that whether there is a Lock on bedroom door, the host in 82.9% cases will not be a Superhost. Both LHS (Lock) and RHS (not a Superhost) are present in 10% of all cases, meaning approx. 1900 observations. Possibly a Lock on bedroom door could indicate that there is only a single room to rent, not the whole place, so chances of being a Superhost are smaller - Superhosts are usually renting the whole apartment. At the same time, Superhost = f does not mean anything negative - it does not indicate that host is worse than the average (but in contrary to Superhost = t which means that host is better than the average).

Summary

The goal of this analysis was to determine indicators of Superhost listings in Madrid. The real problem was to wrangle data in a way allowing for a Association rules mining. Rules were investigated in two cases: when the host is Superhost and when is not. Also a glance at the more general Association rules has been made with no gamechanger being discovered. Judging by the confidence, a given set of 5 amenities (including AC) indicates 51% chances of host being a Superhost. On the other hand, an existence of a lock on the bedroom door provides 82.9% chance of host not being a Superhost, which is probably due to the fact, that those listings are highly likely offering only a room, not the whole apartment to rent.

Superhost determinants - based on AirBnB listings in Madrid

Artur Nowak

26/02/2022

Introduction

Data preprocessing

Airbnb listings

Dataframe building

Removing low-frequency amenities

Adding Superhost variable

Building transaction object

Association Rules

Item Frequency plot

Rules sets

General rules

Overview

Parallel coordinates plot

Graph

Is Superhost

Overview

Parallel coordinates plot

Graph

Is not Superhost

Overview

Parallel coordinates plot

Graph

Summary