The Airbnb dataset includes 2 major datasets on new users’ insights and app session whose details are as follows:

  1. users_train.csv
  1. sessions.csv - web sessions log for users

Load the dataset:

library(dplyr)
library(tidyr)
library(ggplot2)
library(ggcorrplot)
library(plyr)
library(psych)
library(caret)
library(tidyverse)
str(full)
## 'data.frame':    213451 obs. of  16 variables:
##  $ id                     : Factor w/ 213451 levels "00023iyk9l","0005ytdols",..: 100523 48039 26485 68504 48956 147281 129610 2144 59779 40826 ...
##  $ date_account_created   : Factor w/ 1634 levels "01/01/10","01/01/11",..: 884 715 1255 1528 1199 1 6 11 16 16 ...
##  $ timestamp_first_active : num  2.01e+13 2.01e+13 2.01e+13 2.01e+13 2.01e+13 ...
##  $ date_first_booking     : Factor w/ 1977 levels "","01/01/11",..: 1 1 1219 1406 274 7 24 69 1199 18 ...
##  $ gender                 : Factor w/ 3 levels "FEMALE","MALE",..: NA 2 1 1 NA NA 1 1 1 NA ...
##  $ age                    : int  NA 38 56 42 41 NA 46 47 50 46 ...
##  $ signup_method          : Factor w/ 3 levels "basic","facebook",..: 2 2 1 2 1 1 1 1 1 1 ...
##  $ signup_flow            : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ language               : Factor w/ 25 levels "ca","cs","da",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ affiliate_channel      : Factor w/ 8 levels "api","content",..: 3 8 3 3 3 4 4 3 4 4 ...
##  $ affiliate_provider     : Factor w/ 18 levels "baidu","bing",..: 5 9 5 5 5 13 3 5 3 3 ...
##  $ first_affiliate_tracked: Factor w/ 8 levels "","linked","local ops",..: 8 8 8 8 8 5 8 5 8 5 ...
##  $ signup_app             : Factor w/ 4 levels "Android","iOS",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ first_device_type      : Factor w/ 9 levels "Android Phone",..: 6 6 9 6 6 6 6 6 6 6 ...
##  $ first_browser          : Factor w/ 51 levels "Android Browser",..: 8 8 23 17 8 8 41 41 41 17 ...
##  $ country_destination    : Factor w/ 12 levels "AU","CA","DE",..: 8 8 12 10 12 12 12 12 12 12 ...
str(sessions)
## 'data.frame':    1048575 obs. of  6 variables:
##  $ id           : Factor w/ 15501 levels "","0035hobuyj",..: 5631 5631 5631 5631 5631 5631 5631 5631 5631 5631 ...
##  $ action       : Factor w/ 244 levels "","10","11","12",..: 123 198 123 198 123 198 123 155 104 123 ...
##  $ action_type  : Factor w/ 10 levels "","-unknown-",..: 1 5 1 5 1 5 1 6 10 1 ...
##  $ action_detail: Factor w/ 105 levels "","-unknown-",..: 1 99 1 99 1 99 1 101 99 1 ...
##  $ device_type  : Factor w/ 14 levels "-unknown-","Android App Unknown Phone/Tablet",..: 13 13 13 13 13 13 13 13 13 13 ...
##  $ secs_elapsed : int  319 67753 301 22141 435 7703 115 831 20842 683 ...

Create new binary variable “book” to see whether or not a user proceed to booking, based on country_destination (in which NDF means no booking):

full = mutate(full, book = ifelse(country_destination=="NDF", 0, 1))
head(full$book)
## [1] 0 0 1 1 1 1

Data preprocessing

Check the pattern of missing data:

missing_values = full %>% 
  summarize_all(funs(sum(is.na(.))/n()))

missing_values = gather(missing_values, key="feature", value="missing_pct")

missing_values %>% 
  ggplot(aes(x=reorder(feature,-missing_pct),y=missing_pct))+
  geom_bar(stat="identity",fill="tomato3")+
  coord_flip()+theme_bw()

The result shows that in the training set only 60% of the observations are complete with 45% missing in gender, 42% missing in age. Since the purpose of the analysis is to identify differences in user behavior among different age and gender groups, we can’t afford to remove age and gender or impute a categorical variable like gender. If a better dataset is not available, removing incomplete observations will work better.

newdata=na.omit(full) # no dummification
newdata=newdata %>%
  filter(age < 100 & age>18) %>%
  droplevels() # remove people over 100 years old and under 18 (Airbnb terms require users to be at leats 18 years of age)

Create a new variable “agegroup”, selecting age brackets based on a cluster analysis of 5 major groups of users.

newdata$agegroup[newdata$age<26]="Below 26"
newdata$agegroup[newdata$age<36 & newdata$age>25]="26 to 35"
newdata$agegroup[newdata$age<46 & newdata$age>35]="36 to 45"
newdata$agegroup[newdata$age<56 & newdata$age>45]="46 to 55"
newdata$agegroup[newdata$age>55]="Over 56"
newdata$agegroup=factor(newdata$agegroup, levels = c("Below 26", "26 to 35", "36 to 45", "46 to 55", "Over 56"))

Combine tables to join the sessions dataset, removing observations that are not included in the full dataset.

library(plyr)
join=join(newdata, sessions, by="id", type= "inner", match="all")
join=na.omit(join)
newsessions=select(join, 1:6)

Drop insignificant levels ofvariable action (there are over 200 levels)

join = join %>% group_by(join$action) %>%
  filter(n() >= 100) %>%
  droplevels()

Exploratory Data Analysis

It should be noted that the dataset is made up mostly of categorical variables, which makes it hard to create correlation plots, box plots and scatter plots. Some notable trends are shown as follows:

Demographics of users who proceeded to booking?

m = subset(newdata$gender, newdata$book==1)
barplot(prop.table(table(newdata$gender)), main = "Distribution of users who booked by gender", col="orange2")

Destination popularity

ggplot(newdata, aes(x=country_destination, fill=gender)) +
  geom_bar(position="dodge")+
  labs(title="Country destination broken down by gender")

  • Note: NDF is the people who don’t book homestays at all. Because the dataset is collected on US users and most users are female, it’s expected that America is the most popular. Other than that, France and Spain are the most preferred destinations, and females dominate males in almost all country classes.

Is there any difference in the age distribution of users who proceeded to booking and those who did not?

m = subset(newdata$age, newdata$book==1)
n = subset(newdata$age, newdata$book==0)
plot(density(m), col="red", main="Age distribution", xlab="Age")
lines(density(n))

Note: Users who were converted are younger than the other group. The golden age bracket is 25 to 35.

What are the most common affiliate channels and providers?

ggplot(newdata,
       aes(x=reorder(affiliate_channel, affiliate_channel,
                     function(x)-length(x)))) +
  geom_bar(fill = "orange2")+
  labs(title = "Affiliate channel") +
  xlab("Channel")

ggplot(newdata,
       aes(x=reorder(affiliate_provider, affiliate_provider,
                     function(x)-length(x)))) +
  geom_bar(fill="orange2")+
  labs(title = "Affiliate providers") +
  xlab("Provider")+
  theme(axis.text.x = element_text(angle = 35, hjust = 1))

What affiliate channels and providers have the highest conversion rate?

newdata$book=as.factor(newdata$book)
ggplot(newdata, aes(x=affiliate_channel, fill=book))+
  geom_bar(position="stack")+
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Content seems to be the worst channel considering conversion rate. Other channels are almost similar in conversion power. A closer look with a proportion table:

prop.table(table(newdata$book, newdata$affiliate_channel), margin = 2)
##    
##           api   content    direct     other remarketing sem-brand
##   0 0.4976626 0.8202186 0.4433692 0.5254645   0.5478927 0.4691020
##   1 0.5023374 0.1797814 0.5566308 0.4745355   0.4521073 0.5308980
##    
##     sem-non-brand       seo
##   0     0.5019333 0.4484754
##   1     0.4980667 0.5515246

Note: Direct and SEO have the highest conversion rate, followed by SEM brand and API.

newdata$book=as.factor(newdata$book)
ggplot(newdata, aes(x=affiliate_provider, fill=book))+
  geom_bar(position="stack")+
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Facebook doesn’t perform as well as Direct and Google as an affiliate provider. A closer look:

prop.table(table(newdata$book, newdata$affiliate_provider), margin = 2)
##    
##         baidu      bing craigslist    direct email-marketing  facebook
##   0 0.4615385 0.5231855  0.4769417 0.4435496       0.6206897 0.6068676
##   1 0.5384615 0.4768145  0.5230583 0.5564504       0.3793103 0.3931324
##    
##     facebook-open-graph    google       gsp    meetup     naver     other
##   0           0.7057292 0.4897601 0.8333333 0.7338710 0.6363636 0.4964447
##   1           0.2942708 0.5102399 0.1666667 0.2661290 0.3636364 0.5035553
##    
##     padmapper      vast      wayn     yahoo    yandex
##   0 0.5635910 0.5117845 1.0000000 0.5844749 0.7692308
##   1 0.4364090 0.4882155 0.0000000 0.4155251 0.2307692

Notable trends:

  • Direct sources bring in the highest conversion rate, which is expected because people accessing Airbnb directly often have stronger intention to book homestays than those visiting upon seeing ads or affiliate links.
  • Comparing search engines, Baidu is the best performer (54%), followed by Google and Bing. Yahoo was left behind with a conversion rate of 41%. Naver (Korean equivalent of Yahoo) and Yandex are the worst two.
  • Comparing listing sites, Craigslist is better than Padmapper.
  • Social networking sites don’t do very well in terms of conversion. Facebook, the best one, offers a rate of only 39%, while Meetup does 27% and Wayn 0%.

Do the most common affiliate channels differ across age groups?

ggplot(newdata, aes(x=agegroup, fill=affiliate_channel))+
  geom_bar(position="fill")+
  ylab("proportion")

Direct is the most popular channel for the two most active age groups, which is probably because these users already know about Airbnb and go straight to the site to browse and book homestays. Meanwhile, younger and older users have a higher chance of discovering Airbnb from content (i.e. reading travel articles/promotion emails) and SEM non-brand (i.e. typing travel related keywords on search engines).

What are the most common ‘first marketing’ with which users interact before signup?

ggplot(newdata,
       aes(x=reorder(first_affiliate_tracked, first_affiliate_tracked,
                     function(x)-length(x)))) +
  geom_bar(fill="orange2")+
  labs(title = "First affiliate tracked") +
  xlab("Device")

Airbnb doesn’t provide detailed information on each type of marketing so it’s hard to interpret this clearly, but we can see untracked dominates other types of marketing.

What are the most common signup method?

ggplot(newdata,
       aes(x=reorder(signup_method, signup_method,
                     function(x)-length(x)))) +
  geom_bar(fill="orange2")+
  labs(title = "Signup method") +
  xlab("Device")

Facebook is still the most popular with its quick and convenient access.

What are the most common browsers?

ggplot(newdata,
       aes(x=reorder(first_browser, first_browser,
                     function(x)-length(x)))) +
  geom_bar(fill="orange2")+
  theme(axis.text.x = element_text(angle = 55, hjust = 1))+
  labs(title="First browser", x = "Browser")

Given that Mac Desktop is the most prevalent device, at first glance the plot could seem weird that Chrome is the most popular browser. However, if we take both Safari and Mobile Safari into consideration, the total amount is almost equal to that of Chrome. Another point to consider is that many Mac users choose Chrome over Safari for multiple reasons.

A closer look

Users’ age distribution broken down by affiliate channels

ggplot(newdata, aes(x=affiliate_channel, y=age))+
  geom_boxplot()+
  coord_flip()

  labs(title="Affiliate channel by age")
## $title
## [1] "Affiliate channel by age"
## 
## attr(,"class")
## [1] "labels"

The plot above provides a better look on which channels are best for different age groups. Channels with a “taller” boxplot tend to be more effective with older users. For example, content, remarketing and SEM non-brand are well suited for middle-aged people.

Users’ age distribution broken down by affiliate providers

ggplot(newdata, aes(x=affiliate_provider, y=age))+
  geom_boxplot()+
  labs(title="Affiliate provider by age")+
  coord_flip()

Similarly, we could say that Bing, GSP, Vast and Yahoo are more effective with older users, while the likes of Baidu, Meetup and Yandex are more popular among young users.

Is there any clear difference in signup method among different age groups?

ggplot(newdata, aes(x=agegroup, fill=signup_method))+
  geom_bar(position="dodge")

The answer: Yes! Most young people sign up using Facebook, while older folks often go with traditional signup This also mean that it’s easier to collect data on young users (i.e. from their Facebook page) and promote new campaigns to this group of users on Facebook.

What are the most common affiliate channels for each gender?

ggplot(newdata, aes(x=gender, fill=affiliate_provider))+
  geom_bar(position="fill")+
  ylab("proportion")

What are the most common first marketing for each age group?

ggplot(newdata, aes(x=agegroup, fill=first_affiliate_tracked))+
  geom_bar(position="stack")

summary(join$secs_elapsed)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     453    1410   22938   10294 1799035

There are so many outliers that it’s hard to plot a good boxplot. Let’s transform the second data into minutes to reduce the sparseness of the data.

Time to the first booking by age

ggplot(join, aes(x=age, y=secs_elapsed))+
  geom_point()+
  labs(title = "Seconds elapsed by age")

It’s interesting that the most active user groups spend a longer time to proceed to their first booking, while most older users decide very quickly on their trip. One possible reason for this is that young users tend to check and compare the various options they have (i.e. Booking.com, Agoda, Traveloka, etc.) or wait until finding a good deal, while old users, having access to less options, make up their mind quicker.

What are the most common actions on app?

ggplot(join, aes(x=reorder(action, action,
                     function(x)-length(x)))) + 
  geom_bar(fill = "orange2")+
  labs(title = "Action on app") +
  xlab("Action")+
  coord_flip()+
  theme(text = element_text(size=13))