Introduction

Airbnb, a San Francisco based company, is one of the most popular online market place for accomodation. By connecting local hosts with travelers, the company’s innovative platform created an entirely new supply of real estate rental, and anyone can now easily access the accommodation market. Since its onset in 2008, Airbnb has expanded to more than 34,000 countries in 191 countries, serving more than 60 million users.

A few months ago, Airbnb hosted a machine learning challenge to develop a model to identify the patterns among their customers. Specifically, given the data of 210,000 users’ behaviors, participants were asked to predict in which country a new user will make his or her first booking. If we can build a successful model to predict booking behaviors of first-time users, Airbnb can meaningful uitlize it to better personalize their maketing tactics and increase its booking rates.


Data Explanation

Here is the link to the original dataset: https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/data

I downloaded 4 files. train_users_2.csv file contrains 213,451 observations that contain a variety of information such as users’ age, gender, language, date of first booking, signup methods, and device types. The second dataset, countries.csv, is a table of characteristics of each country destination, the outcome variable that I am trying to predict. The third dataset, sessions.csv, is a detailed record of each user’s online activities on the Airbnb website. Finally, age_gender_bkst.csv show demographics of users and destination of each age-gender bucket. *** ## Data Preparation First, I turned off scientific notation in R to be able to read in dates and IDs.

options(scipen = 999)
countries <- read.csv("countries.csv")
demographics <- read.csv("age_gender_bkts.csv")
sessions <- read_csv("sessions.csv")
airbnb <- read_csv("train_users_2.csv")
## [1] 213451     16

As you can see, the dataset airbnb alone is quite huge. Here’s a list of what the variables actually mean.

Before building the actual model, it was necessary to clean the data to make it more coherent and remove erroneous input. Running summary function on ‘airbnb’ dataset showed that most of the factor variables were categorized as string value. Therefore, I went through all 3 datasets to convert each variable to right form of value.

##                      id    date_account_created  timestamp_first_active 
##             "character"                  "Date"               "numeric" 
##      date_first_booking                  gender                     age 
##                  "Date"             "character"               "numeric" 
##           signup_method             signup_flow                language 
##             "character"               "integer"             "character" 
##       affiliate_channel      affiliate_provider first_affiliate_tracked 
##             "character"             "character"             "character" 
##              signup_app       first_device_type           first_browser 
##             "character"             "character"             "character" 
##     country_destination 
##             "character"

Also, I noticed was the age values of users. Running ‘summary’ on airbnb$age indicated that the maximum age was 2014, which seemed to be a year value. To clean the data, I had to convert theses numbers into either real age or NA. First, assuming that the youngest user is 12, we can conjecture that value between 1906 and 1994 is year. Values such as 2014 are most likely to be faulty input. Thus, I converted values above 1994 to NA. Currently, the oldest person in the world is 116, so I made an assumption that values between 110 and 1906 are realistically also faulty input. With the rest of the year values, we can reasonably guess that they are DOB years, so I simply subtracted them from 2016 to derive users’ real age.

airbnb$age[airbnb$age > 1994] <- NA
airbnb$age[airbnb$age < 1906 & airbnb$age > 110] <- NA

#convert year values to age values
convertyear <- function(age){
  if (is.na(age)) {
    return (age)
  }
  else if(age > 1906 & age < 1994) {
    trueage <- 2016 - age
    return (trueage)
  }
  else {
    return (age)
  }
}

trueagetemp <- sapply(airbnb$age, convertyear)
airbnb$age <- trueagetemp

#Let's assume that age values less than 12 are faulty input 
airbnb$age[airbnb$age < 12] <- NA

#gender: convert unknown to NA so that R can automatically process it. 
airbnb$gender[airbnb$gender == "-unknown-"] <- NA

#change timestamp to date, time
airbnb$timestamp_first_active <- ymd_hms(airbnb$timestamp_first_active)

Another cleaning that I did on the training dataset was converting ‘unknown’ values of the gender column to ‘NA’, so that R would not treat it in a special way. Finally, I changed timestamp_first_active to correct date format.


Feature Engineering

Another messy dataset that I had was ‘sessions’ data. First, the dataset had both “-unknown-“ and “NA” so I converted NA to –unknown- in this case, because it makes sense that users sometime use device and do activities that Web cannot recognize. Second, this dataset simply contained the activity flow of each user, and each user_id was taking up multiple rows. Since knowing every sequence and detail of web activities is not particularly insightful, I rearranged the dataset to show each user’s total number of unique actions, total number of unique actions types, total number of unique action details, the most frequent action detail, most frequently used device, and total time spent on web. This reorganized dataset is called ‘session_visitors.’

#replace NA in secs_elapsed with average time
sessions$secs_elapsed[is.na(sessions$secs_elapsed)] <- mean(na.omit(sessions$secs_elapsed))
#average time for each action_type
#(actiontime <- sessions %>% group_by(action_detail) %>% summarize(mean = mean(na.omit(secs_elapsed))))

session_visitors <- sessions %>% group_by(user_id) %>% summarize(totalActions = length(action), 
                                                             uniqueActions = length(unique(action)),
                                                             uniqueAction_type = length(unique(action_type)),
                                                             uniqueAction_detail = length(unique(action_detail)),
                                                             freqAction_detail = names(which.max(table(action_detail))),
                                                             device = names(which.max(table(device_type))),
                                                             time_sec = sum(secs_elapsed))

Exploratory Analysis

I visualized the distribution of some variables to examine if there is any apparent correlation between country_destination (outcome variables) and independent variables.

First, visualizing the age distribution of users revealed a very interesting fact of the data. The distribution of language showed that the Airbnb users predominantly browse and speak English, so we can conjecture that language is not likely to be a deciding factor in booking a new destination.



The distribution of country_destination shows that the most common outcome is NDF, which means that the user ended up not booking any accommodations. In fact, it seems like almost half of the outcomes are NDF.



Plotting the relationship between singup_app and country_destination shows that users on mobile devices are much more likely to fall into NDF outcome than web users.



Decision Trees

To start off, I built a rpart decision tree based on the original dataset without any feature engineering to set up a benchmark value. Here, id and date_first_booking are huge burden to R and intereferes with the train function in the caret package. Therefore, I created a new dataset called airbnb.new, which contains the exactly same information as airbnb (after cleaning) minus id and date_first_booking. I used cross validation to obtain the estimate of accuracy of my model.

airbnb.new <- airbnb
airbnb.new$id <- NULL
airbnb.new$date_first_booking <- NULL
TRAINCONTROL = trainControl(method = "cv", number = 10, verboseIter = TRUE)
airbnb.rpart.train <- train(country_destination ~ date_account_created + gender + age + signup_method + 
                        language + affiliate_channel + first_affiliate_tracked + signup_app +
                        first_device_type, data = airbnb, method = "rpart", 
                      trControl = TRAINCONTROL)

The accuracy is only around 53%, and what is more troubling is the fact that the model only predicts NDF and US as outcomes.

##    AU    CA    DE    ES    FR    GB    IT   NDF    NL    PT    US other 
##     0     0     0     0     0     0     0 63444     0     0 42642     0

To increase the accuracy, I added some features to the original dataset. Not only did I add the columns from the ‘session_visitor’ dataset, but I also acombined the ‘countries’ dataset. Each row of this new dataset, ‘airbnbcombined’, has session information of corresponding user and country information of corresponding country_destination.

Then, I built a decision tree again, and used bootstrap to calculate the mean and standard devisation of accuracy. *train function in the caret package did not run when I added features, even after storing the data into a different dataset. Hence, I implemented boostrap manually.

n <- nrow(airbnbcombined)
airbnbcombined$id <- NULL
airbnbcombined$date_first_booking <- NULL
airbnbcombined$date_account_created <- NULL
accuracy.list = c()
for (i in 1: 25) {
  index = sample(1:n, replace = TRUE)
  boot <- airbnbcombined[index, ]
  test <- airbnbcombined[setdiff(1:n, index), ]

  bootrpart <- rpart(country_destination ~  gender + age + signup_method + affiliate_channel + 
                       signup_app + first_device_type + destination_language + distance_km + 
                       uniqueAction_type + uniqueAction_detail + time_sec + device + booking_time,
                     data = boot)
  prediction <- predict(bootrpart, newdata = test, type = "class")
  accuracy = sum(test$country_destination == prediction) / length(prediction)
  accuracy.list <- c(accuracy.list, accuracy)
}

The model yielded a result of 67% accuracy and 0.11% standard deviation.


Conclusion

It is quite unfortunate that the dataset that Airbnb side has provided has so many flaws that are difficult to work with. Because so much of accomodation is either NDF or US. I had trouble prediciting the locations of non-US countries, although the model with feature engineering seems to do a lot better job with it.

In order to better predict the outcome, we need more data on web activities of individual users. The current ‘sessions’ dataset does not contain information of all users, but even with so many NA values, it seems to dramatically increase the accuracy of prediciton.