The Airbnb dataset includes 2 major datasets on new users’ insights and app session whose details are as follows:
Load the dataset:
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggcorrplot)
library(plyr)
library(psych)
library(caret)
library(tidyverse)
str(full)
## 'data.frame': 213451 obs. of 16 variables:
## $ id : Factor w/ 213451 levels "00023iyk9l","0005ytdols",..: 100523 48039 26485 68504 48956 147281 129610 2144 59779 40826 ...
## $ date_account_created : Factor w/ 1634 levels "01/01/10","01/01/11",..: 884 715 1255 1528 1199 1 6 11 16 16 ...
## $ timestamp_first_active : num 2.01e+13 2.01e+13 2.01e+13 2.01e+13 2.01e+13 ...
## $ date_first_booking : Factor w/ 1977 levels "","01/01/11",..: 1 1 1219 1406 274 7 24 69 1199 18 ...
## $ gender : Factor w/ 3 levels "FEMALE","MALE",..: NA 2 1 1 NA NA 1 1 1 NA ...
## $ age : int NA 38 56 42 41 NA 46 47 50 46 ...
## $ signup_method : Factor w/ 3 levels "basic","facebook",..: 2 2 1 2 1 1 1 1 1 1 ...
## $ signup_flow : int 0 0 3 0 0 0 0 0 0 0 ...
## $ language : Factor w/ 25 levels "ca","cs","da",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ affiliate_channel : Factor w/ 8 levels "api","content",..: 3 8 3 3 3 4 4 3 4 4 ...
## $ affiliate_provider : Factor w/ 18 levels "baidu","bing",..: 5 9 5 5 5 13 3 5 3 3 ...
## $ first_affiliate_tracked: Factor w/ 8 levels "","linked","local ops",..: 8 8 8 8 8 5 8 5 8 5 ...
## $ signup_app : Factor w/ 4 levels "Android","iOS",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ first_device_type : Factor w/ 9 levels "Android Phone",..: 6 6 9 6 6 6 6 6 6 6 ...
## $ first_browser : Factor w/ 51 levels "Android Browser",..: 8 8 23 17 8 8 41 41 41 17 ...
## $ country_destination : Factor w/ 12 levels "AU","CA","DE",..: 8 8 12 10 12 12 12 12 12 12 ...
str(sessions)
## 'data.frame': 1048575 obs. of 6 variables:
## $ id : Factor w/ 15501 levels "","0035hobuyj",..: 5631 5631 5631 5631 5631 5631 5631 5631 5631 5631 ...
## $ action : Factor w/ 244 levels "","10","11","12",..: 123 198 123 198 123 198 123 155 104 123 ...
## $ action_type : Factor w/ 10 levels "","-unknown-",..: 1 5 1 5 1 5 1 6 10 1 ...
## $ action_detail: Factor w/ 105 levels "","-unknown-",..: 1 99 1 99 1 99 1 101 99 1 ...
## $ device_type : Factor w/ 14 levels "-unknown-","Android App Unknown Phone/Tablet",..: 13 13 13 13 13 13 13 13 13 13 ...
## $ secs_elapsed : int 319 67753 301 22141 435 7703 115 831 20842 683 ...
Create new binary variable “book” to see whether or not a user proceed to booking, based on country_destination (in which NDF means no booking):
full = mutate(full, book = ifelse(country_destination=="NDF", 0, 1))
head(full$book)
## [1] 0 0 1 1 1 1
Check the pattern of missing data:
missing_values = full %>%
summarize_all(funs(sum(is.na(.))/n()))
missing_values = gather(missing_values, key="feature", value="missing_pct")
missing_values %>%
ggplot(aes(x=reorder(feature,-missing_pct),y=missing_pct))+
geom_bar(stat="identity",fill="tomato3")+
coord_flip()+theme_bw()
The result shows that in the training set only 60% of the observations are complete with 45% missing in gender, 42% missing in age. Since the purpose of the analysis is to identify differences in user behavior among different age and gender groups, we can’t afford to remove age and gender or impute a categorical variable like gender. If a better dataset is not available, removing incomplete observations will work better.
newdata=na.omit(full) # no dummification
newdata=newdata %>%
filter(age < 100 & age>18) %>%
droplevels() # remove people over 100 years old and under 18 (Airbnb terms require users to be at leats 18 years of age)
Create a new variable “agegroup”, selecting age brackets based on a cluster analysis of 5 major groups of users.
newdata$agegroup[newdata$age<26]="Below 26"
newdata$agegroup[newdata$age<36 & newdata$age>25]="26 to 35"
newdata$agegroup[newdata$age<46 & newdata$age>35]="36 to 45"
newdata$agegroup[newdata$age<56 & newdata$age>45]="46 to 55"
newdata$agegroup[newdata$age>55]="Over 56"
newdata$agegroup=factor(newdata$agegroup, levels = c("Below 26", "26 to 35", "36 to 45", "46 to 55", "Over 56"))
Combine tables to join the sessions dataset, removing observations that are not included in the full dataset.
library(plyr)
join=join(newdata, sessions, by="id", type= "inner", match="all")
join=na.omit(join)
newsessions=select(join, 1:6)
Drop insignificant levels ofvariable action (there are over 200 levels)
join = join %>% group_by(join$action) %>%
filter(n() >= 100) %>%
droplevels()
It should be noted that the dataset is made up mostly of categorical variables, which makes it hard to create correlation plots, box plots and scatter plots. Some notable trends are shown as follows:
m = subset(newdata$gender, newdata$book==1)
barplot(prop.table(table(newdata$gender)), main = "Distribution of users who booked by gender", col="orange2")
ggplot(newdata, aes(x=country_destination, fill=gender)) +
geom_bar(position="dodge")+
labs(title="Country destination broken down by gender")
m = subset(newdata$age, newdata$book==1)
n = subset(newdata$age, newdata$book==0)
plot(density(m), col="red", main="Age distribution", xlab="Age")
lines(density(n))
Note: Users who were converted are younger than the other group. The golden age bracket is 25 to 35.
ggplot(newdata,
aes(x=reorder(first_device_type, first_device_type,
function(x)-length(x)))) +
geom_bar(fill = "orange2")+
labs(title = "First device type frequency") +
xlab("Device")+
theme(axis.text.x = element_text(angle = 35, hjust = 1))
Desktop is still the most common device used to check Airbnb, followed by iPad and iPhone. This means that Airbnb should prioritize these platforms when it comes to A/B testing, feature launches and app optimization.
ggplot(newdata,
aes(x=reorder(affiliate_channel, affiliate_channel,
function(x)-length(x)))) +
geom_bar(fill = "orange2")+
labs(title = "Affiliate channel") +
xlab("Channel")
ggplot(newdata,
aes(x=reorder(affiliate_provider, affiliate_provider,
function(x)-length(x)))) +
geom_bar(fill="orange2")+
labs(title = "Affiliate providers") +
xlab("Provider")+
theme(axis.text.x = element_text(angle = 35, hjust = 1))
newdata$book=as.factor(newdata$book)
ggplot(newdata, aes(x=affiliate_channel, fill=book))+
geom_bar(position="stack")+
theme(axis.text.x = element_text(angle = 30, hjust = 1))
Content seems to be the worst channel considering conversion rate. Other channels are almost similar in conversion power. A closer look with a proportion table:
prop.table(table(newdata$book, newdata$affiliate_channel), margin = 2)
##
## api content direct other remarketing sem-brand
## 0 0.4976626 0.8202186 0.4433692 0.5254645 0.5478927 0.4691020
## 1 0.5023374 0.1797814 0.5566308 0.4745355 0.4521073 0.5308980
##
## sem-non-brand seo
## 0 0.5019333 0.4484754
## 1 0.4980667 0.5515246
Note: Direct and SEO have the highest conversion rate, followed by SEM brand and API.
newdata$book=as.factor(newdata$book)
ggplot(newdata, aes(x=affiliate_provider, fill=book))+
geom_bar(position="stack")+
theme(axis.text.x = element_text(angle = 30, hjust = 1))
Facebook doesn’t perform as well as Direct and Google as an affiliate provider. A closer look:
prop.table(table(newdata$book, newdata$affiliate_provider), margin = 2)
##
## baidu bing craigslist direct email-marketing facebook
## 0 0.4615385 0.5231855 0.4769417 0.4435496 0.6206897 0.6068676
## 1 0.5384615 0.4768145 0.5230583 0.5564504 0.3793103 0.3931324
##
## facebook-open-graph google gsp meetup naver other
## 0 0.7057292 0.4897601 0.8333333 0.7338710 0.6363636 0.4964447
## 1 0.2942708 0.5102399 0.1666667 0.2661290 0.3636364 0.5035553
##
## padmapper vast wayn yahoo yandex
## 0 0.5635910 0.5117845 1.0000000 0.5844749 0.7692308
## 1 0.4364090 0.4882155 0.0000000 0.4155251 0.2307692
Notable trends:
ggplot(newdata, aes(x=agegroup, fill=affiliate_channel))+
geom_bar(position="fill")+
ylab("proportion")
Direct is the most popular channel for the two most active age groups, which is probably because these users already know about Airbnb and go straight to the site to browse and book homestays. Meanwhile, younger and older users have a higher chance of discovering Airbnb from content (i.e. reading travel articles/promotion emails) and SEM non-brand (i.e. typing travel related keywords on search engines).
ggplot(newdata,
aes(x=reorder(first_affiliate_tracked, first_affiliate_tracked,
function(x)-length(x)))) +
geom_bar(fill="orange2")+
labs(title = "First affiliate tracked") +
xlab("Device")
Airbnb doesn’t provide detailed information on each type of marketing so it’s hard to interpret this clearly, but we can see untracked dominates other types of marketing.
ggplot(newdata,
aes(x=reorder(signup_method, signup_method,
function(x)-length(x)))) +
geom_bar(fill="orange2")+
labs(title = "Signup method") +
xlab("Device")
Facebook is still the most popular with its quick and convenient access.
ggplot(newdata,
aes(x=reorder(first_browser, first_browser,
function(x)-length(x)))) +
geom_bar(fill="orange2")+
theme(axis.text.x = element_text(angle = 55, hjust = 1))+
labs(title="First browser", x = "Browser")
Given that Mac Desktop is the most prevalent device, at first glance the plot could seem weird that Chrome is the most popular browser. However, if we take both Safari and Mobile Safari into consideration, the total amount is almost equal to that of Chrome. Another point to consider is that many Mac users choose Chrome over Safari for multiple reasons.
ggplot(newdata, aes(x=affiliate_channel, y=age))+
geom_boxplot()+
coord_flip()
labs(title="Affiliate channel by age")
## $title
## [1] "Affiliate channel by age"
##
## attr(,"class")
## [1] "labels"
The plot above provides a better look on which channels are best for different age groups. Channels with a “taller” boxplot tend to be more effective with older users. For example, content, remarketing and SEM non-brand are well suited for middle-aged people.
ggplot(newdata, aes(x=affiliate_provider, y=age))+
geom_boxplot()+
labs(title="Affiliate provider by age")+
coord_flip()
Similarly, we could say that Bing, GSP, Vast and Yahoo are more effective with older users, while the likes of Baidu, Meetup and Yandex are more popular among young users.
ggplot(newdata, aes(x=agegroup, fill=signup_method))+
geom_bar(position="dodge")
The answer: Yes! Most young people sign up using Facebook, while older folks often go with traditional signup This also mean that it’s easier to collect data on young users (i.e. from their Facebook page) and promote new campaigns to this group of users on Facebook.
ggplot(newdata, aes(x=gender, fill=affiliate_provider))+
geom_bar(position="fill")+
ylab("proportion")
ggplot(newdata, aes(x=agegroup, fill=first_affiliate_tracked))+
geom_bar(position="stack")
summary(join$secs_elapsed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 453 1410 22938 10294 1799035
There are so many outliers that it’s hard to plot a good boxplot. Let’s transform the second data into minutes to reduce the sparseness of the data.
ggplot(join, aes(x=age, y=secs_elapsed))+
geom_point()+
labs(title = "Seconds elapsed by age")
It’s interesting that the most active user groups spend a longer time to proceed to their first booking, while most older users decide very quickly on their trip. One possible reason for this is that young users tend to check and compare the various options they have (i.e. Booking.com, Agoda, Traveloka, etc.) or wait until finding a good deal, while old users, having access to less options, make up their mind quicker.
ggplot(join, aes(x=reorder(action, action,
function(x)-length(x)))) +
geom_bar(fill = "orange2")+
labs(title = "Action on app") +
xlab("Action")+
coord_flip()+
theme(text = element_text(size=13))