Users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. The goal of this notebook is to analyse the data, identify problems and opportunities, and come up with insights to increase the likelihood of bookings, decrease average time for booking and hence lead to more revenue. This data was taken from Kaggle.
If you’re interested in similar projects, have look at my other blogposts here.
Import the libraries.
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(gridExtra)
library(anytime)
library(zoo)
Import the dataset.
train_users_2 <- read.csv(file="train_users_2.csv", header = TRUE, sep=",")
train_users_2 <- tbl_df(train_users_2)
head(train_users_2)
Convert to date-time format using lubridate.
train_users_2$date_account_created <- ymd(train_users_2$date_account_created)
train_users_2$date_first_booking <- ymd(train_users_2$date_first_booking)
train_users_2$timestamp_first_active <- ymd_hms(train_users_2$timestamp_first_active)
Replace -unknown- gender by NA.
train_users_2$gender[as.character(train_users_2$gender) == "-unknown-"] <- NA
# sum(is.na(train_users_2$gender))
Summary of the data frame.
summary(train_users_2)
## id date_account_created timestamp_first_active
## 00023iyk9l: 1 Min. :2010-01-01 Min. :2009-03-19 04:32:55
## 0005ytdols: 1 1st Qu.:2012-12-26 1st Qu.:2012-12-25 07:33:27
## 000guo2307: 1 Median :2013-09-11 Median :2013-09-11 06:13:08
## 000wc9mlv3: 1 Mean :2013-06-25 Mean :2013-06-25 16:15:47
## 0012yo8hu2: 1 3rd Qu.:2014-03-06 3rd Qu.:2014-03-06 08:25:14
## 001357912w: 1 Max. :2014-06-30 Max. :2014-06-30 23:58:24
## (Other) :213445
## date_first_booking gender age
## Min. :2010-01-02 FEMALE :63041 Min. : 1.00
## 1st Qu.:2012-12-02 MALE :54440 1st Qu.: 28.00
## Median :2013-09-11 OTHER : 282 Median : 34.00
## Mean :2013-07-04 -unknown-: 0 Mean : 49.67
## 3rd Qu.:2014-04-04 NA's :95688 3rd Qu.: 43.00
## Max. :2015-06-29 Max. :2014.00
## NA's :124543 NA's :87990
## signup_method signup_flow language
## basic :152897 Min. : 0.000 en :206314
## facebook: 60008 1st Qu.: 0.000 zh : 1632
## google : 546 Median : 0.000 fr : 1172
## Mean : 3.267 es : 915
## 3rd Qu.: 0.000 ko : 747
## Max. :25.000 de : 732
## (Other): 1939
## affiliate_channel affiliate_provider first_affiliate_tracked
## direct :137727 direct :137426 untracked :109232
## sem-brand : 26045 google : 51693 linked : 46287
## sem-non-brand: 18844 other : 12549 omg : 43982
## other : 8961 craigslist: 3471 tracked-other: 6156
## seo : 8663 bing : 2328 : 6065
## api : 8167 facebook : 2273 product : 1556
## (Other) : 5044 (Other) : 3711 (Other) : 173
## signup_app first_device_type first_browser
## Android: 5454 Mac Desktop :89600 Chrome :63845
## iOS : 19019 Windows Desktop:72716 Safari :45169
## Moweb : 6261 iPhone :20759 Firefox :33655
## Web :182717 iPad :14339 -unknown- :27266
## Other/Unknown :10667 IE :21068
## Android Phone : 2803 Mobile Safari:19274
## (Other) : 2567 (Other) : 3174
## country_destination
## NDF :124543
## US : 62376
## other : 10094
## FR : 5023
## IT : 2835
## GB : 2324
## (Other): 6256
date_first_booking column say that the user hasn’t booked any room.age column means that the user hasn’t specified his/her age. We can fill dummy values in the age column.gender column means that the user hasn’t specified his gender.Note that there are 95,688 NA values in the gender column and 117,763 filled values. So, our analysis based on the gender demographic might not be entirely correct in the real world.
Number of NA values in each column of the data frame.
colSums(is.na(train_users_2))
## id date_account_created timestamp_first_active
## 0 0 0
## date_first_booking gender age
## 124543 95688 87990
## signup_method signup_flow language
## 0 0 0
## affiliate_channel affiliate_provider first_affiliate_tracked
## 0 0 0
## signup_app first_device_type first_browser
## 0 0 0
## country_destination
## 0
The age column contains value less than 18 and more than 80. Infact, age contains values as large as 104 and 2014. We will assign NA values to them.
filter(train_users_2, age > 100 | age <18)
Put NA wherever the age is either below 18 or greater than 80.
train_users_2 <- mutate(train_users_2, age = ifelse(age < 18, NA, age), age = ifelse(age > 80, NA, age))
To fill in the NA values in the age column, we’ll calculate the mean and standard deviation of the age column. Then we’ll generate n numbers as random integers between the mean and the standard deviation to fill in the NA values. n is the number of NA values in the age column.
mean_age <- as.integer(mean(train_users_2$age, na.rm=TRUE))
sd_age <- as.integer(sd(train_users_2$age, na.rm=TRUE))
count_nan_age <- sum(is.na(train_users_2$age))
random_age <- sample((mean_age - sd_age):(mean_age + sd_age), count_nan_age, replace = TRUE)
# floor(runif(count_nan_age, min=mean_age-sd_age, max = mean_age+sd_age)) # Random number generator
Replace the NA values in the age column with the random integers we generated above.
# replace_na(train_users_2$age, random_age)
train_users_2 <- mutate(train_users_2, age=ifelse(is.na(age), random_age, age))
Create a new column called age_brackets and add it to the data frame.
age_breaks <- c(18, 30, 40, 50, 60, 70, 80)
age_labels <- c("20s", "30s", "40s", "50s", "60s", "70s")
age_brackets <- cut(x=train_users_2$age, breaks=age_breaks,
labels=age_labels, include.lowest = TRUE)
train_users_2 <- mutate(train_users_2, age_brackets)
Add a new column time_first_active_to_booking which is equal to number of days between date_first_booking and timestamp_first_active.
train_users_2 <- mutate(train_users_2, time_first_active_to_booking = as.integer(difftime(date_first_booking, timestamp_first_active, units = "days")))
Add a new column time_signup_to_booking which is equal to number of days between date_first_booking and date_account_created.
train_users_2 <- mutate(train_users_2, time_signup_to_booking = as.integer(difftime(date_first_booking, date_account_created, units = "days")))
train_users_2
View the number of NAs per column.
colSums(is.na(train_users_2))
## id date_account_created
## 0 0
## timestamp_first_active date_first_booking
## 0 124543
## gender age
## 95688 0
## signup_method signup_flow
## 0 0
## language affiliate_channel
## 0 0
## affiliate_provider first_affiliate_tracked
## 0 0
## signup_app first_device_type
## 0 0
## first_browser country_destination
## 0 0
## age_brackets time_first_active_to_booking
## 0 124543
## time_signup_to_booking
## 124543
date_first_booking, time_first_active_to_booking, and gender columns have NA values. This is totally fine.date_first_booking column means that the user hasn’t booked any hotels yet.gender column means that the user hasn’t specified the gender.time_first_active_to_booking is derived from date_first_booking, so it will have NA values.Reset the gender levels. If you don’t do this, the -unknown- level will still show up in the levels(train_users_2$gender). We don’t want that as we’ve already set all the -unknown- gender values to NA.
train_users_2$gender <- factor(train_users_2$gender)
We are done with the pre-processing. Whew. :P
train_users_2
Create a data frame without NA values in the gender column in the df.
gender_male_female_others <- filter(train_users_2, is.na(gender)==FALSE)
language_no_en = filter(train_users_2, language != "en")
plot_gender <- ggplot(train_users_2, aes(gender)) + geom_bar(aes(fill=..count..))
plot_age <- ggplot(train_users_2, aes(age)) + geom_bar(aes(fill = ..count..))
plot_language <- ggplot(train_users_2, aes(language)) + geom_bar(aes(fill = ..count..)) + theme(axis.text.x = element_text(angle=90, hjust=0))
plot_language_no_en <- ggplot(language_no_en, aes(language)) + geom_bar(aes(fill=..count..))
grid.arrange(plot_gender, plot_age, plot_language, plot_language_no_en, ncol=2)
This data along with the user’s location can be used to identify which regions(inside countries) use what language and maybe, show targeted ads to those communities?
Create a new data frame with only the male and female values after removing the NA values from the gender column.
gender_male_female <- filter(train_users_2, is.na(gender)==FALSE , gender != "OTHER")
ggplot(gender_male_female, aes(age)) + geom_density(aes(fill = gender), alpha=0.7) + ggtitle("Density plot Age and Gender")
Based on age, there is almost no difference between number of men and women who use AirBnB. Men and women in their 30s are the most prominent users of AirBnB.
Before we begin to analyze the graphs, let us understand what is affiliate marketing is.
Affiliate marketing is a type of performance-based marketing in which a business rewards one or more affiliates for each visitor or customer brought by the affiliate’s own marketing efforts. Affiliate marketing is quickly becoming a powerful way to increase sales.
Create a data frame without the Direct affiliate channel and affiliate provider. We do this to clearly view the other affiliate channels and affiliate providers who have a smaller percentage(contribution).
aff_ch_prov_no_direct <- filter(train_users_2, affiliate_channel != "direct", affiliate_provider != "direct", is.na(gender)==FALSE)
plot_affiliate_provider_channel <- ggplot(train_users_2, aes(affiliate_provider)) + geom_bar(aes(fill=affiliate_channel)) + theme(axis.text.x = element_text(angle=90, hjust=0))
plot_affiliate_provider_channel_no_direct <- ggplot(aff_ch_prov_no_direct, aes(affiliate_provider)) + geom_bar(aes(fill=affiliate_channel)) + theme(axis.text.x = element_text(angle=90, hjust=0))
grid.arrange(plot_affiliate_provider_channel, plot_affiliate_provider_channel_no_direct, ncol=2)
The 2 plots show the distribution of the affiliate channels used by different affiliate providers.
These graphs show the comparision of the usage of the AirBnB platoform based on the age demographic. The user might or might not book a hotel, but he/she is scrolling around on the AirBnB platform(mobile app/ web app).
age_affiliate_channel <- ggplot(train_users_2, aes(affiliate_channel)) + geom_bar(aes(fill=age_brackets)) + theme(axis.text.x = element_text(angle=90, hjust=0))
age_affiliate_provider <- ggplot(train_users_2, aes(affiliate_provider)) + geom_bar(aes(fill=age_brackets)) + theme(axis.text.x = element_text(angle=90, hjust=0))
age_affiliate_channel_no_direct <- ggplot(aff_ch_prov_no_direct, aes(affiliate_channel)) + geom_bar(aes(fill=age_brackets), position = position_dodge()) + theme(axis.text.x = element_text(angle=90, hjust=0))
age_affiliate_provider_no_direct <- ggplot(aff_ch_prov_no_direct, aes(affiliate_provider)) + geom_bar(aes(fill=age_brackets), position = position_dodge()) + theme(axis.text.x = element_text(angle=90, hjust=0))
# gender_affiliate_channel <- ggplot(gender_male_female_others, aes(affiliate_channel)) + geom_bar(aes(fill=gender)) + theme(axis.text.x = element_text(angle=90, hjust=0))
# gender_affiliate_provider <- ggplot(gender_male_female_others, aes(affiliate_provider)) + geom_bar(aes(fill=gender)) + theme(axis.text.x = element_text(angle=90, hjust=0))
grid.arrange(age_affiliate_channel, age_affiliate_provider, age_affiliate_channel_no_direct, age_affiliate_provider_no_direct, ncol=2)
direct affiliate channel from the plot, we observe that majority of people in their 50s and 60s are targted by semi branded and semi non-branded channels.direct affiliate provider, we observe that Google caters more to people in their 30s.These graphs show the comparision of the usage of the AirBnB platoform based on the gender demographic. The user might or might not book a hotel, but he is scrolling around on the AirBnB platform(mobile app/ web app).
gender_affiliate_channel <- ggplot(gender_male_female_others, aes(affiliate_channel)) + geom_bar(aes(fill=gender), position = position_dodge()) + theme(axis.text.x = element_text(angle=90, hjust=0))
gender_affiliate_provider <- ggplot(gender_male_female_others, aes(affiliate_provider)) + geom_bar(aes(fill=gender), position = position_dodge()) + theme(axis.text.x = element_text(angle=90, hjust=0))
gender_affiliate_channel_no_direct <- ggplot(aff_ch_prov_no_direct, aes(affiliate_channel)) + geom_bar(aes(fill=gender), position = position_dodge()) + theme(axis.text.x = element_text(angle=90, hjust=0))
gender_affiliate_provider_no_direct <- ggplot(aff_ch_prov_no_direct, aes(affiliate_provider)) + geom_bar(aes(fill=gender), position = position_dodge()) + theme(axis.text.x = element_text(angle=90, hjust=0))
grid.arrange(gender_affiliate_channel, gender_affiliate_provider, gender_affiliate_channel_no_direct, gender_affiliate_provider_no_direct, ncol=2)
direct affiliate channel, we observe the semi branded and semi non-branded are the two most popular channels followed by API and SEO(Search Engine Optimisation). With the exception of API channel, all other channels cater to more women than men.plot_signup_method_app <- ggplot(train_users_2, aes(signup_app)) + geom_bar(aes(fill=signup_method), position = position_dodge()) + ggtitle("Signup App and Signup Method")
plot_signup_method_app
signup_no_na_gender <- filter(train_users_2, is.na(gender)==FALSE)
# plot_signup_app_age <- ggplot(train_users_2, aes(signup_app)) + geom_bar(aes(fill = age_brackets))
plot_signup_app_age <- ggplot(train_users_2, aes(signup_app)) + geom_bar(aes(fill = age_brackets), position = position_dodge())
plot_signup_app_gender <- ggplot(signup_no_na_gender, aes(signup_app)) + geom_bar(aes(fill = gender), position = position_dodge())
plot_signup_method_age <- ggplot(train_users_2, aes(signup_method)) + geom_bar(aes(fill=age_brackets), position = position_dodge())
plot_signup_method_gender <- ggplot(signup_no_na_gender, aes(signup_method)) + geom_bar(aes(fill=gender), position = position_dodge())
grid.arrange(plot_signup_app_age, plot_signup_app_gender, plot_signup_method_age, plot_signup_method_gender, ncol=2)
As expected, elderly people do not use smartphones at all to use AirBnB. A large number of people in their 20s, 30s, and 40s use their computers to access the AirBnB platform. One would expect the “tech savvy” teens in their 20s to use smartphone more, but that isn’t the case(Note that there a lot more people in the 30s age bracket. This assumption might be wrong).
More women prefer signing up using their computers while more men prefer iOS/Android apps.
A lot more people in their 30s prefer signing up using email as compared to facebook. An almost weirdly equal number of people in their 20s and 30s prefer to signup using facebook.
More women than men prefer to use facebook and email signup methods. Compared to the other 2, Google sign up method is like a 404 error, Does Not Exist.
Create a table without NDF in the country_destination column and NA values in the gender column.
NDF == No Destination Found. Meaning the user hasn’t booked any hotel.
user_no_NDF_no_gender <- filter(train_users_2, country_destination != "NDF", !is.na(gender))
ggplot(user_no_NDF_no_gender, aes(first_device_type)) + geom_bar(aes(fill=..count..)) + theme(axis.text.x = element_text(angle=90, hjust=0)) + ggtitle("First device type vs Gender") + facet_grid(.~gender)
ggplot(user_no_NDF_no_gender, aes(first_device_type)) + geom_bar(aes(fill=..count..)) + theme(axis.text.x = element_text(angle=90, hjust=0)) + ggtitle("First device type vs Age brackets") +facet_grid(.~age_brackets)
num_booking <- train_users_2 %>%
filter(!is.na(date_first_booking)) %>%
group_by(date_first_booking) %>%
summarise(num = n()) %>%
ggplot(aes(date_first_booking, num)) + geom_line(aes(color=num)) + scale_colour_gradient(low='blue', high='red') + ggtitle("Number of bookings over the years (2010-2015)")
num_accounts <- train_users_2 %>%
filter(!is.na(date_account_created)) %>%
group_by(date_account_created) %>%
summarise(num = n()) %>%
ggplot(aes(date_account_created, num)) + geom_line(aes(color=num)) + scale_colour_gradient(low='blue', high='red') + ggtitle("Number of accounts created over the years (2010-2014)")
grid.arrange(num_booking, num_accounts, ncol=2)
filter(train_users_2, date_first_booking >= "2015-06-29")filter(train_users_2, date_account_created >= "2014-06-30")num_book <- function(yr) {
plot_title <- paste("Number of bookings in", yr)
num_bookings_x <- train_users_2 %>%
filter(!is.na(date_first_booking), year(date_first_booking) == yr) %>%
group_by(date_first_booking) %>%
summarise(num = n()) %>%
ggplot(aes(date_first_booking, num)) + geom_line(aes(color=num)) + scale_colour_gradient(low='blue', high='red') + ggtitle(plot_title)
num_bookings_x
}
grid.arrange(num_book(2010), num_book(2011), num_book(2012), num_book(2013), num_book(2014),num_book(2015), ncol=2)
“In July 2014, Airbnb revealed design revisions to the site and mobile app and introduced a new logo. Some considered the new logo to be visually similar to genitalia, but a consumer survey by Survata showed only a minority of respondents thought this was the case.”
Google “AirBnB 2014” to find the reason for the sudden drop in the number of bookings in 2014.
num_acc <- function(yr) {
plot_title <- paste("Number of accounts created in", yr)
num_accounts_x <- train_users_2 %>%
filter(!is.na(date_account_created), year(date_account_created) == yr) %>%
group_by(date_account_created) %>%
summarise(num = n()) %>%
ggplot(aes(date_account_created, num)) + geom_line(aes(color=num)) + scale_colour_gradient(low='blue', high='red') + ggtitle(plot_title)
num_accounts_x
}
grid.arrange(num_acc(2010), num_acc(2011), num_acc(2012), num_acc(2013), num_acc(2014), ncol=2)
AirBnB could probably reduce prices or give more discounts and offers during the months of August, September, and October so that more people book hotels.
train_users_2 %>%
filter(!is.na(time_signup_to_booking)) %>%
group_by(age_brackets) %>%
ggplot(aes(x = age_brackets, y = time_signup_to_booking)) + geom_boxplot(aes(fill=gender)) + facet_grid(.~gender) + ggtitle("Time between Signup and Booking based on Gender and Age")
The colored boxes indicate the interquartile range which represents the middle 50% of the data. The whiskers extend from either side of the box. The whiskers represent the ranges for the bottom 25% and the top 25% of the data values, excluding outliers.
train_users_2 %>%
filter(!is.na(time_first_active_to_booking), time_first_active_to_booking !=0 ) %>%
group_by(time_first_active_to_booking) %>%
summarise(count = n()) %>%
ggplot(aes(time_first_active_to_booking, count)) + geom_point(aes(color=count)) + scale_colour_gradient(low='blue', high='red') + ggtitle("Time (in days) between first booking and first activity")
train_users_2 %>%
filter(!is.na(time_signup_to_booking)) %>%
group_by(time_signup_to_booking) %>%
summarise(count = n()) %>%
ggplot(aes(time_signup_to_booking, count)) + geom_point(aes(color=count)) + scale_colour_gradient(low='blue', high='red') + ggtitle("Time (in days) between first booking and signup")
Let’s analyse the negative values. How many negative values are there?
train_users_2 %>%
filter(time_signup_to_booking < 0)
We see that there are only 29 negative values.
This means that there were 29 users who were able to book their rooms without creating an account!
Let’s see in which years did this happen.
After filtering in only the negative values of time_signup_to_bookin, that is poeple who booked hotels before signing up, we plot the following graph.
This graph tells us that users could sign up before booking on the AirBnB platform from 2010 to 2013.
train_users_2 %>%
filter(time_signup_to_booking < 0) %>%
mutate(year_first_booking = year(date_first_booking)) %>%
group_by(year_first_booking) %>%
ggplot(aes(year_first_booking)) + geom_bar(aes(fill = ..count..))
The following comment was released by AirBnB.
“Up until early 2013 there was a handful of flows where a user was able to book before fully creating an account (by the definition of account creation we use today). After early 2013 this is no longer possible.”
# train_users_2 %>%
# filter(!is.na(date_first_booking), year(date_first_booking) == 2012, age_brackets=="20s") %>%
# group_by(date_first_booking) %>%
# summarise(num = n()) %>%
# ggplot(aes(date_first_booking, num)) + geom_point(aes(color=num)) + scale_colour_gradient(low='blue', high='green')
#
# train_users_2 %>%
# filter(!is.na(date_first_booking), year(date_first_booking) == 2012) %>%
# mutate(genders = gender)
# group_by(date_first_booking) %>%
# summarise(num = n()) %>%
# ggplot(aes(date_first_booking, num)) + geom_point(aes(fill=gender))
book_not_book <- train_users_2 %>%
mutate(Booked = ifelse(country_destination == "NDF", "Not Booked", "Booked")) %>%
ggplot(aes(Booked)) + geom_bar(aes(fill=..count..))
book_gender <- train_users_2 %>%
mutate(Booked = ifelse(country_destination == "NDF", "Not Booked", "Booked")) %>%
ggplot(aes(Booked)) + geom_bar(aes(fill=gender), position = position_dodge())
book_age <- train_users_2 %>%
mutate(Booked = ifelse(country_destination == "NDF", "Not Booked", "Booked")) %>%
ggplot(aes(Booked)) + geom_bar(aes(fill=age_brackets), position = position_dodge())
book_country <- train_users_2 %>%
# mutate(Booked = ifelse(country_destination == "NDF", "Not Booked", "Booked")) %>%
ggplot(aes(country_destination)) + geom_bar(aes(fill=..count..))
grid.arrange(book_not_book, book_gender, book_age, book_country, ncol=2)
The graph shows that people in their 20s, 30s, and 40s are AirBnB’s base customers.
NDF means no booking was made.
Other means a booking was made to a country other than the 12 provided.
# NDF means "No Destination Found". Which means there wasn't a booking.
country_dest_no_NDF_no_gender <- filter(train_users_2, country_destination != "NDF", is.na(gender)==FALSE)
country_dest_NDF_no_gender <- filter(train_users_2, is.na(gender)==FALSE)
# plot_dest_country_with_NDF <- ggplot(train_users_2, aes(country_destination)) + geom_bar(aes(fill=..count..))
# plot_dest_country_without_NDF <- ggplot(country_dest_without_NDF, aes(country_destination)) + geom_bar(aes(fill=..count..))
plot_dest_country_without_NDF_age <- ggplot(country_dest_no_NDF_no_gender, aes(country_destination)) + geom_bar(aes(fill=age_brackets), position = position_dodge())
plot_dest_country_without_NDF_gender <- ggplot(country_dest_no_NDF_no_gender, aes(country_destination)) + geom_bar(aes(fill=gender), position = position_dodge())
plot_dest_country_with_NDF_age <- ggplot(country_dest_NDF_no_gender, aes(country_destination)) + geom_bar(aes(fill=age_brackets), position = position_dodge())
plot_dest_country_with_NDF_gender <- ggplot(country_dest_NDF_no_gender, aes(country_destination)) + geom_bar(aes(fill=gender), position = position_dodge())
grid.arrange(plot_dest_country_with_NDF_age, plot_dest_country_with_NDF_gender, plot_dest_country_without_NDF_age, plot_dest_country_without_NDF_gender, ncol=2)
Note that the travel can be both national and international as the country of origin of users is not provided in the dataset.
country_affiliate_provider_no_usa <- filter(country_dest_no_NDF_no_gender, country_destination != "US")
plot_country_affiliate_channel <- ggplot(country_dest_no_NDF_no_gender, aes(country_destination)) + geom_bar(aes(fill=affiliate_channel) ,position = position_dodge()) + ggtitle("With US")
plot_country_affiliate_provider_no_usa <- ggplot(country_affiliate_provider_no_usa, aes(country_destination)) + geom_bar(aes(fill=affiliate_channel) ,position = position_dodge()) + ggtitle("Without US")
grid.arrange(plot_country_affiliate_channel, plot_country_affiliate_provider_no_usa, ncol=1)
country_affiliate_provider_no_usa <- filter(country_dest_no_NDF_no_gender, country_destination != "US")
plot_country_affiliate_provider <- ggplot(country_dest_no_NDF_no_gender, aes(country_destination)) + geom_bar(aes(fill=affiliate_provider), position = position_dodge()) + ggtitle("With US")
plot_country_affiliate_provider_no_usa <- ggplot(country_affiliate_provider_no_usa, aes(country_destination)) + geom_bar(aes(fill=affiliate_provider), position = position_dodge()) + ggtitle("Without US")
grid.arrange(plot_country_affiliate_provider, plot_country_affiliate_provider_no_usa, nrow=2 )
After the direct affiliate provider, Google plays an important role in confirming the bookings.
num_book_monthly <- function(yr) {
plot_title <- paste("Aggregate Monthly Bookings | Males vs Females", yr)
num_book_mf <- gender_male_female %>%
filter(year(date_first_booking) == yr) %>%
group_by(month(date_first_booking), gender) %>%
ggplot(aes(month(date_first_booking, label = T))) + geom_bar(aes(fill=gender), position = position_dodge()) + ggtitle(plot_title)
num_book_mf
}
grid.arrange(num_book_monthly(2010), num_book_monthly(2011), num_book_monthly(2012), num_book_monthly(2013), num_book_monthly(2014), num_book_monthly(2015), ncol=2)
num_book_weekly <- function(yr) {
plot_title <- paste("Aggregate Weekly Bookings | Males vs Females", yr)
num_book_wk <- gender_male_female %>%
filter(year(date_first_booking) == 2010) %>%
group_by(weekdays(date_first_booking), gender) %>%
ggplot(aes(weekdays(date_first_booking))) + geom_bar(aes(fill=gender), position = position_dodge()) + ggtitle(plot_title)
num_book_wk
}
grid.arrange(num_book_weekly(2010), num_book_weekly(2011), num_book_weekly(2012), num_book_weekly(2013), num_book_weekly(2014), num_book_weekly(2015), ncol=2)
calh <- train_users_2 %>%
select(date_first_booking) %>%
filter(year(date_first_booking) >= 2010, !is.na(date_first_booking)) %>%
group_by(date_first_booking) %>%
mutate (bookings = n(),
year = year(date_first_booking),
yearmonthf = factor(as.yearmon(date_first_booking)),
monthf = as.factor(month(date_first_booking, label=TRUE, abbr = TRUE)),
week = week(date_first_booking),
monthweek = ceiling(day(date_first_booking)/7),
weekdayf = wday(date_first_booking, label = TRUE, abbr = TRUE))
ggplot(calh, aes(monthweek, weekdayf, fill=bookings)) + geom_tile(colour = "white") + facet_grid(year~monthf) +
scale_fill_gradient(low="blue", high="red") +
labs(x="Week of Month",
y="",
title = "Daily Variation in number of bookings",
fill="Close")
We observe the AirBnB went from from 0-50 bookings per day in 2010 to almost 200 bookings per day in the early months of 2014.
The number dropped down to around 100 bookings per day thereafter.
Thank you for reading.
Suggestions and constructive criticism are welcome. :)
You can find me on LinkedIn. Find my other blogposts here.