Users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. The goal of this notebook is to analyse the data, identify problems and opportunities, and come up with insights to increase the likelihood of bookings, decrease average time for booking and hence lead to more revenue. This data was taken from Kaggle.

If you’re interested in similar projects, have look at my other blogposts here.

Import the libraries.

library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(gridExtra)
library(anytime)
library(zoo)

Import the dataset.

train_users_2 <- read.csv(file="train_users_2.csv", header = TRUE, sep=",")
train_users_2 <- tbl_df(train_users_2)
head(train_users_2)

Pre-Processing

Convert to date-time format using lubridate.

train_users_2$date_account_created <- ymd(train_users_2$date_account_created)
train_users_2$date_first_booking <- ymd(train_users_2$date_first_booking)
train_users_2$timestamp_first_active  <- ymd_hms(train_users_2$timestamp_first_active)

Replace -unknown- gender by NA.

train_users_2$gender[as.character(train_users_2$gender) == "-unknown-"] <- NA
# sum(is.na(train_users_2$gender))

Summary of the data frame.

summary(train_users_2)
##           id         date_account_created timestamp_first_active       
##  00023iyk9l:     1   Min.   :2010-01-01   Min.   :2009-03-19 04:32:55  
##  0005ytdols:     1   1st Qu.:2012-12-26   1st Qu.:2012-12-25 07:33:27  
##  000guo2307:     1   Median :2013-09-11   Median :2013-09-11 06:13:08  
##  000wc9mlv3:     1   Mean   :2013-06-25   Mean   :2013-06-25 16:15:47  
##  0012yo8hu2:     1   3rd Qu.:2014-03-06   3rd Qu.:2014-03-06 08:25:14  
##  001357912w:     1   Max.   :2014-06-30   Max.   :2014-06-30 23:58:24  
##  (Other)   :213445                                                     
##  date_first_booking         gender           age         
##  Min.   :2010-01-02   FEMALE   :63041   Min.   :   1.00  
##  1st Qu.:2012-12-02   MALE     :54440   1st Qu.:  28.00  
##  Median :2013-09-11   OTHER    :  282   Median :  34.00  
##  Mean   :2013-07-04   -unknown-:    0   Mean   :  49.67  
##  3rd Qu.:2014-04-04   NA's     :95688   3rd Qu.:  43.00  
##  Max.   :2015-06-29                     Max.   :2014.00  
##  NA's   :124543                         NA's   :87990    
##   signup_method     signup_flow        language     
##  basic   :152897   Min.   : 0.000   en     :206314  
##  facebook: 60008   1st Qu.: 0.000   zh     :  1632  
##  google  :   546   Median : 0.000   fr     :  1172  
##                    Mean   : 3.267   es     :   915  
##                    3rd Qu.: 0.000   ko     :   747  
##                    Max.   :25.000   de     :   732  
##                                     (Other):  1939  
##      affiliate_channel   affiliate_provider  first_affiliate_tracked
##  direct       :137727   direct    :137426   untracked    :109232    
##  sem-brand    : 26045   google    : 51693   linked       : 46287    
##  sem-non-brand: 18844   other     : 12549   omg          : 43982    
##  other        :  8961   craigslist:  3471   tracked-other:  6156    
##  seo          :  8663   bing      :  2328                :  6065    
##  api          :  8167   facebook  :  2273   product      :  1556    
##  (Other)      :  5044   (Other)   :  3711   (Other)      :   173    
##    signup_app           first_device_type       first_browser  
##  Android:  5454   Mac Desktop    :89600   Chrome       :63845  
##  iOS    : 19019   Windows Desktop:72716   Safari       :45169  
##  Moweb  :  6261   iPhone         :20759   Firefox      :33655  
##  Web    :182717   iPad           :14339   -unknown-    :27266  
##                   Other/Unknown  :10667   IE           :21068  
##                   Android Phone  : 2803   Mobile Safari:19274  
##                   (Other)        : 2567   (Other)      : 3174  
##  country_destination
##  NDF    :124543     
##  US     : 62376     
##  other  : 10094     
##  FR     :  5023     
##  IT     :  2835     
##  GB     :  2324     
##  (Other):  6256

Note that there are 95,688 NA values in the gender column and 117,763 filled values. So, our analysis based on the gender demographic might not be entirely correct in the real world.

Number of NA values in each column of the data frame.

colSums(is.na(train_users_2))
##                      id    date_account_created  timestamp_first_active 
##                       0                       0                       0 
##      date_first_booking                  gender                     age 
##                  124543                   95688                   87990 
##           signup_method             signup_flow                language 
##                       0                       0                       0 
##       affiliate_channel      affiliate_provider first_affiliate_tracked 
##                       0                       0                       0 
##              signup_app       first_device_type           first_browser 
##                       0                       0                       0 
##     country_destination 
##                       0

The age column contains value less than 18 and more than 80. Infact, age contains values as large as 104 and 2014. We will assign NA values to them.

filter(train_users_2, age > 100 | age <18)

Put NA wherever the age is either below 18 or greater than 80.

train_users_2 <- mutate(train_users_2, age = ifelse(age < 18, NA, age), age = ifelse(age > 80, NA, age))

To fill in the NA values in the age column, we’ll calculate the mean and standard deviation of the age column. Then we’ll generate n numbers as random integers between the mean and the standard deviation to fill in the NA values. n is the number of NA values in the age column.

mean_age <- as.integer(mean(train_users_2$age, na.rm=TRUE))
sd_age <- as.integer(sd(train_users_2$age, na.rm=TRUE))
count_nan_age <- sum(is.na(train_users_2$age))
random_age <- sample((mean_age - sd_age):(mean_age + sd_age), count_nan_age, replace = TRUE)
# floor(runif(count_nan_age, min=mean_age-sd_age, max = mean_age+sd_age))    # Random number generator

Replace the NA values in the age column with the random integers we generated above.

# replace_na(train_users_2$age, random_age)
train_users_2 <- mutate(train_users_2, age=ifelse(is.na(age), random_age, age))

Create a new column called age_brackets and add it to the data frame.

age_breaks <- c(18, 30, 40, 50, 60, 70, 80)

age_labels <- c("20s", "30s", "40s", "50s", "60s", "70s")

age_brackets <- cut(x=train_users_2$age, breaks=age_breaks,
labels=age_labels, include.lowest = TRUE)

train_users_2 <- mutate(train_users_2, age_brackets)

Add a new column time_first_active_to_booking which is equal to number of days between date_first_booking and timestamp_first_active.

train_users_2 <- mutate(train_users_2, time_first_active_to_booking = as.integer(difftime(date_first_booking, timestamp_first_active, units = "days")))

Add a new column time_signup_to_booking which is equal to number of days between date_first_booking and date_account_created.

train_users_2 <- mutate(train_users_2, time_signup_to_booking = as.integer(difftime(date_first_booking, date_account_created, units = "days")))
train_users_2

View the number of NAs per column.

colSums(is.na(train_users_2))
##                           id         date_account_created 
##                            0                            0 
##       timestamp_first_active           date_first_booking 
##                            0                       124543 
##                       gender                          age 
##                        95688                            0 
##                signup_method                  signup_flow 
##                            0                            0 
##                     language            affiliate_channel 
##                            0                            0 
##           affiliate_provider      first_affiliate_tracked 
##                            0                            0 
##                   signup_app            first_device_type 
##                            0                            0 
##                first_browser          country_destination 
##                            0                            0 
##                 age_brackets time_first_active_to_booking 
##                            0                       124543 
##       time_signup_to_booking 
##                       124543

Reset the gender levels. If you don’t do this, the -unknown- level will still show up in the levels(train_users_2$gender). We don’t want that as we’ve already set all the -unknown- gender values to NA.

train_users_2$gender <- factor(train_users_2$gender)

We are done with the pre-processing. Whew. :P

EXPLORATORY DATA ANALYSIS

train_users_2

Age, Gender, and Language

Create a data frame without NA values in the gender column in the df.

gender_male_female_others <- filter(train_users_2, is.na(gender)==FALSE)
language_no_en = filter(train_users_2, language != "en")
plot_gender <- ggplot(train_users_2, aes(gender)) + geom_bar(aes(fill=..count..))
plot_age <- ggplot(train_users_2, aes(age)) + geom_bar(aes(fill = ..count..))
plot_language <- ggplot(train_users_2, aes(language)) + geom_bar(aes(fill = ..count..)) + theme(axis.text.x = element_text(angle=90, hjust=0))
plot_language_no_en <- ggplot(language_no_en, aes(language)) + geom_bar(aes(fill=..count..))


grid.arrange(plot_gender, plot_age, plot_language, plot_language_no_en, ncol=2)

  1. We can see that there are a lot of missing values for gender. Majority of the users did not fill in their gender information on the platform.
  2. In the second plot, we observe that the age group of majority of the users lies between 25 and 47 with maximum users around the age of 30. This tells us that young and middle aged users are dominant.
  3. For a company based in the US, it isn’t surprising that the most used language on their portal/app is English.
  4. If we remove the English language from the plot, Chinese(zh) is the next popular language on AirBnB, followed by French and Spanish. This suggests that AirBnB, after the US, is really popular in French and Spanish speaking countries/ communities. French is predominantly spoken in France, so we know this app is popular in France. But we can’t say that for Spanish because Spanish is spoken in a lot of countries including Spain, Columbia, and the US.

This data along with the user’s location can be used to identify which regions(inside countries) use what language and maybe, show targeted ads to those communities?

Age vs Gender

Create a new data frame with only the male and female values after removing the NA values from the gender column.

gender_male_female <- filter(train_users_2, is.na(gender)==FALSE , gender != "OTHER")
ggplot(gender_male_female, aes(age)) + geom_density(aes(fill = gender), alpha=0.7) + ggtitle("Density plot Age and Gender")

Based on age, there is almost no difference between number of men and women who use AirBnB. Men and women in their 30s are the most prominent users of AirBnB.

Affiliate Marketing a.k.a Advertisments

Before we begin to analyze the graphs, let us understand what is affiliate marketing is.

Affiliate marketing is a type of performance-based marketing in which a business rewards one or more affiliates for each visitor or customer brought by the affiliate’s own marketing efforts. Affiliate marketing is quickly becoming a powerful way to increase sales.

Create a data frame without the Direct affiliate channel and affiliate provider. We do this to clearly view the other affiliate channels and affiliate providers who have a smaller percentage(contribution).

aff_ch_prov_no_direct <- filter(train_users_2, affiliate_channel != "direct", affiliate_provider != "direct", is.na(gender)==FALSE)
plot_affiliate_provider_channel <- ggplot(train_users_2, aes(affiliate_provider)) + geom_bar(aes(fill=affiliate_channel)) + theme(axis.text.x = element_text(angle=90, hjust=0))

plot_affiliate_provider_channel_no_direct <- ggplot(aff_ch_prov_no_direct, aes(affiliate_provider)) + geom_bar(aes(fill=affiliate_channel)) + theme(axis.text.x = element_text(angle=90, hjust=0))

grid.arrange(plot_affiliate_provider_channel, plot_affiliate_provider_channel_no_direct, ncol=2)

The 2 plots show the distribution of the affiliate channels used by different affiliate providers.

  1. Direct marketing performed by AirBnB itself has had the most outreach in terms of marketing. Direct marketing is a form of advertising where organizations communicate directly to customers through a variety of media including text messages, email, websites, online adverts, promotional letters, and targeted television.
  2. Google is a close second affiliate provider with semi-branding being its most popular affiliate channel. Bing, Facebook, and Craigslist are the other “major” contributors.

Targeted Marketing based on Age

These graphs show the comparision of the usage of the AirBnB platoform based on the age demographic. The user might or might not book a hotel, but he/she is scrolling around on the AirBnB platform(mobile app/ web app).

age_affiliate_channel <- ggplot(train_users_2, aes(affiliate_channel)) + geom_bar(aes(fill=age_brackets)) + theme(axis.text.x = element_text(angle=90, hjust=0))

age_affiliate_provider <- ggplot(train_users_2, aes(affiliate_provider)) + geom_bar(aes(fill=age_brackets)) + theme(axis.text.x = element_text(angle=90, hjust=0))

age_affiliate_channel_no_direct <- ggplot(aff_ch_prov_no_direct, aes(affiliate_channel)) + geom_bar(aes(fill=age_brackets), position = position_dodge()) + theme(axis.text.x = element_text(angle=90, hjust=0))

age_affiliate_provider_no_direct <- ggplot(aff_ch_prov_no_direct, aes(affiliate_provider)) + geom_bar(aes(fill=age_brackets), position = position_dodge()) + theme(axis.text.x = element_text(angle=90, hjust=0))


# gender_affiliate_channel <- ggplot(gender_male_female_others, aes(affiliate_channel)) + geom_bar(aes(fill=gender)) + theme(axis.text.x = element_text(angle=90, hjust=0))
# gender_affiliate_provider <- ggplot(gender_male_female_others, aes(affiliate_provider)) + geom_bar(aes(fill=gender)) + theme(axis.text.x = element_text(angle=90, hjust=0))


grid.arrange(age_affiliate_channel, age_affiliate_provider, age_affiliate_channel_no_direct, age_affiliate_provider_no_direct, ncol=2)

  1. Direct marketing targets majority of the people in their 20s, 30s, and 40s. This could be the reason why these age-groups are more dominant on AirBnB.
  2. Google being the second most popular affiliate_provider after AirBnB itself, caters to slightly more to people in their 30s as compared to the other age groups.
  3. If we remove the direct affiliate channel from the plot, we observe that majority of people in their 50s and 60s are targted by semi branded and semi non-branded channels.
  4. If we remove the direct affiliate provider, we observe that Google caters more to people in their 30s.

Targeted Marketing based on Gender

These graphs show the comparision of the usage of the AirBnB platoform based on the gender demographic. The user might or might not book a hotel, but he is scrolling around on the AirBnB platform(mobile app/ web app).

gender_affiliate_channel <- ggplot(gender_male_female_others, aes(affiliate_channel)) + geom_bar(aes(fill=gender), position = position_dodge()) + theme(axis.text.x = element_text(angle=90, hjust=0))

gender_affiliate_provider <- ggplot(gender_male_female_others, aes(affiliate_provider)) + geom_bar(aes(fill=gender), position = position_dodge()) + theme(axis.text.x = element_text(angle=90, hjust=0))

gender_affiliate_channel_no_direct <- ggplot(aff_ch_prov_no_direct, aes(affiliate_channel)) + geom_bar(aes(fill=gender), position = position_dodge()) + theme(axis.text.x = element_text(angle=90, hjust=0))

gender_affiliate_provider_no_direct <- ggplot(aff_ch_prov_no_direct, aes(affiliate_provider)) + geom_bar(aes(fill=gender), position = position_dodge()) + theme(axis.text.x = element_text(angle=90, hjust=0))


grid.arrange(gender_affiliate_channel, gender_affiliate_provider, gender_affiliate_channel_no_direct, gender_affiliate_provider_no_direct, ncol=2)

  1. More women than men are targeted by the direct affiliate channel.
  2. Same goes for the direct affiliate provider.
  3. If we remove the direct affiliate channel, we observe the semi branded and semi non-branded are the two most popular channels followed by API and SEO(Search Engine Optimisation). With the exception of API channel, all other channels cater to more women than men.
  4. Google as an affiliate provider is more common amongst women than men.

Signup App and Signup Method

plot_signup_method_app <- ggplot(train_users_2, aes(signup_app)) + geom_bar(aes(fill=signup_method), position = position_dodge()) + ggtitle("Signup App and Signup Method")

plot_signup_method_app

  1. Signing up using email is the most popular option followed by signup using facebook. No body likes to link their google account to their airbnb account.
  2. An overwhelming majority of people access the AirBnB platform using browsers on their computers followed by their iOS app.
    The fact that Android users are less than iOS users might seem odd, but remember that, AirBnB is an American company with it’s largest user base being Americans. iOS is more popular in the US than Android.
  3. People probably don’t use the app as much. This could be beacuse they don’t like the Android/iOS app’s UI or functionality. Maybe, the web version offers more functionality and is easier to use. Or people aren’t aware of the AirBnB app.

Signup App and Signup Method based on Age and Gender

signup_no_na_gender <- filter(train_users_2, is.na(gender)==FALSE)
# plot_signup_app_age <- ggplot(train_users_2, aes(signup_app)) + geom_bar(aes(fill = age_brackets))

plot_signup_app_age <- ggplot(train_users_2, aes(signup_app)) + geom_bar(aes(fill = age_brackets), position = position_dodge())
plot_signup_app_gender <- ggplot(signup_no_na_gender, aes(signup_app)) + geom_bar(aes(fill = gender), position = position_dodge())

plot_signup_method_age <- ggplot(train_users_2, aes(signup_method)) + geom_bar(aes(fill=age_brackets), position = position_dodge())
plot_signup_method_gender <- ggplot(signup_no_na_gender, aes(signup_method)) + geom_bar(aes(fill=gender), position = position_dodge())

grid.arrange(plot_signup_app_age, plot_signup_app_gender, plot_signup_method_age, plot_signup_method_gender, ncol=2)

  1. As expected, elderly people do not use smartphones at all to use AirBnB. A large number of people in their 20s, 30s, and 40s use their computers to access the AirBnB platform. One would expect the “tech savvy” teens in their 20s to use smartphone more, but that isn’t the case(Note that there a lot more people in the 30s age bracket. This assumption might be wrong).

  2. More women prefer signing up using their computers while more men prefer iOS/Android apps.

  3. A lot more people in their 30s prefer signing up using email as compared to facebook. An almost weirdly equal number of people in their 20s and 30s prefer to signup using facebook.

  4. More women than men prefer to use facebook and email signup methods. Compared to the other 2, Google sign up method is like a 404 error, Does Not Exist.

First Device type vs Age and Gender

Create a table without NDF in the country_destination column and NA values in the gender column.
NDF == No Destination Found. Meaning the user hasn’t booked any hotel.

user_no_NDF_no_gender <- filter(train_users_2, country_destination != "NDF", !is.na(gender))
ggplot(user_no_NDF_no_gender, aes(first_device_type)) + geom_bar(aes(fill=..count..))  + theme(axis.text.x = element_text(angle=90, hjust=0)) + ggtitle("First device type vs Gender") + facet_grid(.~gender)

  1. Macs are the most preferred laptops to access the AirBnB platform followed by Windows Desktop. Again, Apple is supremely popular in the US.
  2. iPhones and iPads the second most widely used devices to access the AirBnB platform.
ggplot(user_no_NDF_no_gender, aes(first_device_type)) + geom_bar(aes(fill=..count..))  + theme(axis.text.x = element_text(angle=90, hjust=0)) + ggtitle("First device type vs Age brackets") +facet_grid(.~age_brackets)

  1. Mac Desktops are highly popular amongst people in their 20s and 30s to access the AirBnB platform followed by the Windows Desktop.
  2. We see a decreasing trend in the usage of Macs as the age increases. There’s no disparity between Mac Desktops and Windows Desktops for people in their 60s.
  3. Smartphones however become unpopular as the Age increases.

Bookings and Accounts over the years

num_booking <- train_users_2 %>%
                filter(!is.na(date_first_booking)) %>%
                group_by(date_first_booking) %>%
                summarise(num = n()) %>%
                ggplot(aes(date_first_booking, num)) + geom_line(aes(color=num))  + scale_colour_gradient(low='blue', high='red') + ggtitle("Number of bookings over the years (2010-2015)")


num_accounts <- train_users_2 %>%
                filter(!is.na(date_account_created)) %>%
                group_by(date_account_created) %>%
                summarise(num = n()) %>%
                ggplot(aes(date_account_created, num)) + geom_line(aes(color=num))  + scale_colour_gradient(low='blue', high='red') + ggtitle("Number of accounts created over the years (2010-2014)")

grid.arrange(num_booking, num_accounts, ncol=2)

  1. Number of bookings rapidly increase every year.
  2. The sharp drop in bookings for the year 2015 is because we only have data till 29-06-2015. filter(train_users_2, date_first_booking >= "2015-06-29")
  3. For the number of accounts created, we only have data till 2014-06-30. filter(train_users_2, date_account_created >= "2014-06-30")

Number of First-bookings per year

num_book <- function(yr) {

plot_title <- paste("Number of bookings in", yr)
num_bookings_x <- train_users_2 %>%
                filter(!is.na(date_first_booking), year(date_first_booking) == yr) %>%
                group_by(date_first_booking) %>%
                summarise(num = n()) %>%
                ggplot(aes(date_first_booking, num)) + geom_line(aes(color=num))  + scale_colour_gradient(low='blue', high='red') + ggtitle(plot_title)

num_bookings_x
}


grid.arrange(num_book(2010), num_book(2011), num_book(2012), num_book(2013), num_book(2014),num_book(2015), ncol=2)

  1. Number of first-bookings is at its least around January. This could possibly because the new-year has just ended, so people don’t travel anywhere so soon. Also, cold outside much?
  2. Number of first-bookings always spikes up between the months of July and October. This could be in anticipation of festivals like Thanksgiving and Oktoberfest.
  3. However, we see a sharp drop in the number of bookings starting from July, 2014 until July, 2015.

“In July 2014, Airbnb revealed design revisions to the site and mobile app and introduced a new logo. Some considered the new logo to be visually similar to genitalia, but a consumer survey by Survata showed only a minority of respondents thought this was the case.”

Google “AirBnB 2014” to find the reason for the sudden drop in the number of bookings in 2014.

Number of Accounts created per year

  1. This plot follows a similar trend as above.
  2. Number of new (first) accounts created is less around January and spikes around September and October.
  3. People probably create new accounts to book as well as compare prices amongst other services.
num_acc <- function(yr) {

plot_title <- paste("Number of accounts created in", yr)
num_accounts_x <- train_users_2 %>%
                filter(!is.na(date_account_created), year(date_account_created) == yr) %>%
                group_by(date_account_created) %>%
                summarise(num = n()) %>%
                ggplot(aes(date_account_created, num)) + geom_line(aes(color=num))  + scale_colour_gradient(low='blue', high='red') + ggtitle(plot_title)

num_accounts_x
}


grid.arrange(num_acc(2010), num_acc(2011), num_acc(2012), num_acc(2013), num_acc(2014), ncol=2)

AirBnB could probably reduce prices or give more discounts and offers during the months of August, September, and October so that more people book hotels.

Time between Sign up and First booking based on Age and Gender

train_users_2 %>%
    filter(!is.na(time_signup_to_booking)) %>%
    group_by(age_brackets) %>%
    ggplot(aes(x = age_brackets, y = time_signup_to_booking)) + geom_boxplot(aes(fill=gender)) + facet_grid(.~gender) + ggtitle("Time between Signup and Booking based on Gender and Age")

The colored boxes indicate the interquartile range which represents the middle 50% of the data. The whiskers extend from either side of the box. The whiskers represent the ranges for the bottom 25% and the top 25% of the data values, excluding outliers.

  1. A heavy majority of people irrespective of age and gender book hotels on the day they signup. The median value is 0.
  2. You can see the “outliers” book hotels more 1000 days after signing up on the platform.
  3. The “wait-time” for the middle 50% users from each age bracket generally tends to decrease with age.

Time between the First booking and First activity on the AirBnB platform

train_users_2 %>%
    filter(!is.na(time_first_active_to_booking), time_first_active_to_booking !=0 ) %>%
    group_by(time_first_active_to_booking) %>%
    summarise(count = n()) %>%
    ggplot(aes(time_first_active_to_booking, count)) + geom_point(aes(color=count))  + scale_colour_gradient(low='blue', high='red') + ggtitle("Time (in days) between first booking and first activity")

  1. Time between the first booking and first activity of the users 0 or close to 0 for a lot of people.
  2. There are a people who’ve booked their first hotel more than 100 days after their first activity on the AirBnB platform. That’s more than 3 years. Damn.

Time between First booking and Signup on the AirBnB platform

train_users_2 %>%
    filter(!is.na(time_signup_to_booking)) %>%
    group_by(time_signup_to_booking) %>%
    summarise(count = n()) %>%
    ggplot(aes(time_signup_to_booking, count)) + geom_point(aes(color=count))  + scale_colour_gradient(low='blue', high='red') + ggtitle("Time (in days) between first booking and signup")

  1. Here, we see that the number of days is negative for quite a number of people. People have booked hotels as long as a year before creating an account.
    Other than that, the data seems similar to the above plot.
  2. A huge number of people book the hotels on the same day they signup on the AirBnB platform.

Let’s analyse the negative values. How many negative values are there?

train_users_2 %>%
    filter(time_signup_to_booking < 0)

We see that there are only 29 negative values.
This means that there were 29 users who were able to book their rooms without creating an account!

Let’s see in which years did this happen.
After filtering in only the negative values of time_signup_to_bookin, that is poeple who booked hotels before signing up, we plot the following graph.
This graph tells us that users could sign up before booking on the AirBnB platform from 2010 to 2013.

train_users_2 %>%
    filter(time_signup_to_booking < 0) %>%
    mutate(year_first_booking = year(date_first_booking)) %>%
    group_by(year_first_booking) %>%
    ggplot(aes(year_first_booking)) + geom_bar(aes(fill = ..count..))

The following comment was released by AirBnB.

“Up until early 2013 there was a handful of flows where a user was able to book before fully creating an account (by the definition of account creation we use today). After early 2013 this is no longer possible.”

# train_users_2 %>%
#     filter(!is.na(date_first_booking), year(date_first_booking) == 2012, age_brackets=="20s") %>%
#     group_by(date_first_booking) %>%
#     summarise(num = n()) %>%
#     ggplot(aes(date_first_booking, num)) + geom_point(aes(color=num))  + scale_colour_gradient(low='blue', high='green')



# 
# train_users_2 %>%
#     filter(!is.na(date_first_booking), year(date_first_booking) == 2012) %>%
#     mutate(genders = gender)
#     group_by(date_first_booking) %>%
#     summarise(num = n()) %>%
#     ggplot(aes(date_first_booking, num)) + geom_point(aes(fill=gender))

Booked vs Not Booked

book_not_book <- train_users_2 %>%
                    mutate(Booked = ifelse(country_destination == "NDF", "Not Booked", "Booked")) %>%
                    ggplot(aes(Booked)) + geom_bar(aes(fill=..count..))

book_gender <- train_users_2 %>%
                    mutate(Booked = ifelse(country_destination == "NDF", "Not Booked", "Booked")) %>%
                    ggplot(aes(Booked)) + geom_bar(aes(fill=gender), position = position_dodge())

book_age <- train_users_2 %>%
                mutate(Booked = ifelse(country_destination == "NDF", "Not Booked", "Booked")) %>%
                ggplot(aes(Booked)) + geom_bar(aes(fill=age_brackets), position = position_dodge())

book_country <- train_users_2 %>%
                # mutate(Booked = ifelse(country_destination == "NDF", "Not Booked", "Booked")) %>%
                ggplot(aes(country_destination)) + geom_bar(aes(fill=..count..))

grid.arrange(book_not_book, book_gender, book_age, book_country, ncol=2)

  1. There are obviously more inert users than active users.
  2. Male vs Female ratio is pretty much the same for users who book vs those who don’t. NA values however differ. There are lot of users who don’t provide their gender and don’t book any hotels.
  3. People in their 30s are the highest in the lot for both booking and not booking hotels. The ratio of Booked:NotBooked is less than 1 for people in their 20s, 30, and 40s. While the same ratio is kind of constant for people in their 50s, 60s, and 70s.

The graph shows that people in their 20s, 30s, and 40s are AirBnB’s base customers.

Frequency of the Destination Country

NDF means no booking was made.
Other means a booking was made to a country other than the 12 provided.

# NDF means "No Destination Found". Which means there wasn't a booking.
country_dest_no_NDF_no_gender <- filter(train_users_2, country_destination != "NDF", is.na(gender)==FALSE)
country_dest_NDF_no_gender <- filter(train_users_2, is.na(gender)==FALSE)

# plot_dest_country_with_NDF <- ggplot(train_users_2, aes(country_destination)) + geom_bar(aes(fill=..count..))
# plot_dest_country_without_NDF <- ggplot(country_dest_without_NDF, aes(country_destination)) + geom_bar(aes(fill=..count..))

plot_dest_country_without_NDF_age <- ggplot(country_dest_no_NDF_no_gender, aes(country_destination)) + geom_bar(aes(fill=age_brackets), position = position_dodge())
plot_dest_country_without_NDF_gender <- ggplot(country_dest_no_NDF_no_gender, aes(country_destination)) + geom_bar(aes(fill=gender), position = position_dodge())

plot_dest_country_with_NDF_age <- ggplot(country_dest_NDF_no_gender, aes(country_destination)) + geom_bar(aes(fill=age_brackets), position = position_dodge())
plot_dest_country_with_NDF_gender <- ggplot(country_dest_NDF_no_gender, aes(country_destination)) + geom_bar(aes(fill=gender), position = position_dodge())

grid.arrange(plot_dest_country_with_NDF_age, plot_dest_country_with_NDF_gender, plot_dest_country_without_NDF_age, plot_dest_country_without_NDF_gender, ncol=2)

  1. Highets number of inactive users(people who haven’t booked a room) are in their 30s.
  2. Women travel around slightly more than men using AirBnB.
  3. After the US and “other” countries, France is the next most popular destination.
  4. More women travel to france as comapred to men while more men visit Canada as compared to women.

Note that the travel can be both national and international as the country of origin of users is not provided in the dataset.

Affect of Affiliate Channel on the Destination Country

country_affiliate_provider_no_usa <- filter(country_dest_no_NDF_no_gender, country_destination != "US")
plot_country_affiliate_channel <- ggplot(country_dest_no_NDF_no_gender, aes(country_destination)) + geom_bar(aes(fill=affiliate_channel) ,position = position_dodge()) + ggtitle("With US")
plot_country_affiliate_provider_no_usa <- ggplot(country_affiliate_provider_no_usa, aes(country_destination)) + geom_bar(aes(fill=affiliate_channel) ,position = position_dodge()) + ggtitle("Without US")


grid.arrange(plot_country_affiliate_channel, plot_country_affiliate_provider_no_usa, ncol=1)

  1. Direct affiliate channels have played a major role to confirm bookings.
  2. Semi branding affiliate channel plays an important role to confirm bookings, especially in the US.

Affect of Affiliate Provider on the Destination Country

country_affiliate_provider_no_usa <- filter(country_dest_no_NDF_no_gender, country_destination != "US")
plot_country_affiliate_provider <- ggplot(country_dest_no_NDF_no_gender, aes(country_destination)) + geom_bar(aes(fill=affiliate_provider), position = position_dodge()) + ggtitle("With US")
plot_country_affiliate_provider_no_usa <- ggplot(country_affiliate_provider_no_usa, aes(country_destination)) + geom_bar(aes(fill=affiliate_provider), position = position_dodge()) + ggtitle("Without US")


grid.arrange(plot_country_affiliate_provider, plot_country_affiliate_provider_no_usa, nrow=2 )

After the direct affiliate provider, Google plays an important role in confirming the bookings.

Monthly Booking Statistics based on Gender

num_book_monthly <- function(yr) {

plot_title <- paste("Aggregate Monthly Bookings |  Males vs Females", yr)
num_book_mf <- gender_male_female %>%
                   filter(year(date_first_booking) == yr) %>%
                   group_by(month(date_first_booking), gender) %>%
                   ggplot(aes(month(date_first_booking, label = T))) + geom_bar(aes(fill=gender), position = position_dodge()) + ggtitle(plot_title)
num_book_mf
}

grid.arrange(num_book_monthly(2010), num_book_monthly(2011), num_book_monthly(2012), num_book_monthly(2013), num_book_monthly(2014), num_book_monthly(2015), ncol=2)

  1. Women have always booked more hotels on AirBnB than men except for one month - December, 2013.
  2. We can see the total number of booking peaking around July, except in 2014 and 2015.

Weekly Booking Statistics based on Gender

num_book_weekly <- function(yr) {

plot_title <- paste("Aggregate Weekly Bookings |  Males vs Females", yr)
num_book_wk <- gender_male_female %>%
                filter(year(date_first_booking) == 2010) %>%
                group_by(weekdays(date_first_booking), gender) %>%
                ggplot(aes(weekdays(date_first_booking))) + geom_bar(aes(fill=gender), position = position_dodge()) + ggtitle(plot_title)




num_book_wk
}

grid.arrange(num_book_weekly(2010), num_book_weekly(2011), num_book_weekly(2012), num_book_weekly(2013), num_book_weekly(2014), num_book_weekly(2015), ncol=2)

  1. Number of booking is always the least during the weekend, ie. Saturday and Sunday.
  2. The number of bookings always peaks during Thursday and Friday. People probably book hotels for the weekend on Thursdays and Fridays.
  3. Hotel bookings by men drop a lot on Sundays as compared to Saturdays, but is the opposite for women.
  4. The number of bookings keep increasing from Monday to Friday only to fall on the weekend.

Daily Variation in the Number of Bookings

calh <- train_users_2 %>%
    select(date_first_booking) %>%
    filter(year(date_first_booking) >= 2010, !is.na(date_first_booking)) %>%
    group_by(date_first_booking) %>%
    mutate (bookings = n(),
            year = year(date_first_booking),
            yearmonthf = factor(as.yearmon(date_first_booking)),
            monthf = as.factor(month(date_first_booking, label=TRUE, abbr = TRUE)),
            week = week(date_first_booking),
            monthweek = ceiling(day(date_first_booking)/7),
            weekdayf = wday(date_first_booking, label = TRUE, abbr = TRUE))



ggplot(calh, aes(monthweek, weekdayf, fill=bookings)) + geom_tile(colour = "white") + facet_grid(year~monthf) + 
  scale_fill_gradient(low="blue", high="red") +
  labs(x="Week of Month",
       y="",
       title = "Daily Variation in number of bookings", 
       fill="Close")

We observe the AirBnB went from from 0-50 bookings per day in 2010 to almost 200 bookings per day in the early months of 2014.
The number dropped down to around 100 bookings per day thereafter.

Thank you for reading.
Suggestions and constructive criticism are welcome. :)
You can find me on LinkedIn. Find my other blogposts here.