Tucker (2016) found that target advertising is an emerging field in digital marketing where companies invest in understanding the user behavior on the internet. Through segmenting the users who visit the company website, actions by the visitors, engagement with the users, we can send specific marketing material as per their business requirements (Svetlik, 2017). In this exploratory research study, we analyzed the user pathway channels for AirBNB peer-to-peer sharing platform. Users visit the portal through different devices, various applications, perform numerous searches, send many messages, and other activities. So, in this study we analyze this data to identify key trends and user behaviour patterns accordingly.
Some of the questions to explore in this research are:
For this research project, I have selected to do data visualization on the open source dataset available at databits.io. I have been a regular customer of AirBNB for last 3 years and wanted to explore this use case to better understand the user behavior on the portal. The dataset is in .txt format and available for download from the databits website.
Source: Databits (2018)
The dataframe has user pathways data during the period 05/05/2014 to 04/23/2015.
A total of 7756 rows and 21 variables are collected for each user who access this portal. Some of the columns names and description is provided for reference.
id_visitor id of the visitor id_session id of the session dim_session_number the number of session on a given day for a visitor dim_user_agent user agent of the session dim_device_app_combo parsed out device/app combo from user agent ds date stamp of session ts_min time of session start ts_max time of session end did_search binary flag indicating if the visitor performed a search during the session sent_message binary flag indicating if the visitor sent a message during the session sent_booking_request binary flag indicating if the visitor sent a booking request during the session next_id_session Next Session ID next_dim_session_number Next number of sessions for visitor next_dim_device_app_combo Next parsed out device/application combination next_ds Next session date stamp next_ts_min Next session start time next_ts_max Next session end time next_did_search Next session - did search? (0 or 1) next_sent_message Next session - sent message? (0 or 1) next_sent_booking_request Next session - booking request? (0 or 1)
To solve the exploratory questions identified, we will use various graphical methods such as bar charts, mapping, scatter plots, line plots and other appropriate tools. There is also a need to ensure the variables classes are well defined and segregate data points. In the name of the areas, we need to seperate the key variable by creating a new column for further analysis.
library(tibble) # used to create tibbles
library(tidyr) # used to tidy up data
## Warning: package 'tidyr' was built under R version 3.4.4
library(lubridate) # used for date/time functions
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(magrittr) # used for piping
## Warning: package 'magrittr' was built under R version 3.4.4
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
##
## extract
library(ggplot2) # used for data visualization
## Warning: package 'ggplot2' was built under R version 3.4.4
library(dplyr) # used for data manipulation
## Warning: package 'dplyr' was built under R version 3.4.4
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Internetdata <- read.delim(url ("http://databits.io/static_content/challenges/airbnb-user-pathways-challenge/airbnb_session_data.txt"), sep = "|", na.strings = 'NULL')
Internetdata_tib<-as_tibble(Internetdata)
#gives dimension of the dataframe
dim(Internetdata_tib)
## [1] 7756 21
#Clean the dataset and use seperate to find device and application details.
#extract application field from the unified column
Internet_Clean <- Internetdata_tib %>%
separate(dim_device_app_combo, into = c("Device", "Application"), sep = " - ") %>%
separate(next_dim_device_app_combo,into = c("Next_Device", "Next_Application"), sep = " - ")
# Convert the start and end times from string to date/time format
Internet_Clean$Start_Time <- ymd_hms(Internet_Clean$ts_min)
Internet_Clean$End_Time <- ymd_hms(Internet_Clean$ts_max)
Internet_Clean$Next_Start_Time <- ymd_hms(Internet_Clean$next_ts_min)
Internet_Clean$Next_End_Time <- ymd_hms(Internet_Clean$next_ts_max)
#create a new variable duration (in minutes) to measure the user activity
Internet_Clean$Duration <-
(Internet_Clean$End_Time - Internet_Clean$Start_Time) / 60
Internet_Clean$Next_Duration <-
(Internet_Clean$Next_End_Time - Internet_Clean$Next_Start_Time) / 60
#Summmary of the cleaned dataset
summary(Internet_Clean)
## id_visitor
## f70f0c27-6af3-4fb5-8bd7-f73240465fd5: 702
## 98c352ee-12e0-4d43-99a3-97edd8dd4bb1: 431
## 46cf3a9c-43d5-471c-ada4-4227f5c27384: 412
## b03087b5-6b04-4e91-be49-5b254dd0e839: 340
## fd61c634-6876-4c76-81b3-6da2f86a92b8: 276
## 39b7ba8d-8549-429d-b0cc-65e7f55b2142: 228
## (Other) :5367
## id_session dim_session_number
## 0001d1236f9ef2b7d05ab8a1d48b94cf: 1 Min. : 1.00
## 0004d0fa8cabd64ab5f09b0684f16f3c: 1 1st Qu.: 11.00
## 000c762eb1b8a5a1b1d4860caaae0efe: 1 Median : 46.00
## 001345e6c8a1363661079c01c8dcf6b8: 1 Mean : 98.09
## 00210d71bcbceeb0bf81a8a8d7fb8d82: 1 3rd Qu.:128.00
## 0023c24ee6d106da74f4bf4131eadb24: 1 Max. :702.00
## (Other) :7750
## dim_user_agent
## Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_1 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Mobile/11D201: 292
## Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko : 241
## Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko : 225
## : 219
## Airbnb/4.2.0 iPhone/7.1.1 : 210
## Airbnb/4.7.0 iPhone/8.1.2 : 197
## (Other) :6372
## Device Application ds
## Length:7756 Length:7756 2014-12-15: 61
## Class :character Class :character 2014-09-02: 57
## Mode :character Mode :character 2014-09-05: 56
## 2014-09-06: 54
## 2014-12-23: 54
## 2014-12-01: 53
## (Other) :7421
## ts_min ts_max did_search
## 2014-08-14 11:31:12: 4 2014-09-26 21:59:56: 4 Min. :0.0000
## 2014-08-25 21:14:07: 4 2014-07-09 12:01:38: 3 1st Qu.:0.0000
## 2014-08-31 02:22:33: 4 2014-08-01 08:14:23: 3 Median :0.0000
## 2014-09-06 20:31:31: 4 2014-08-14 11:31:12: 3 Mean :0.1594
## 2014-06-03 09:00:00: 3 2014-08-18 21:32:29: 3 3rd Qu.:0.0000
## 2014-06-17 23:58:52: 3 2014-08-25 21:14:07: 3 Max. :1.0000
## (Other) :7734 (Other) :7737
## sent_message sent_booking_request
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.1649 Mean :0.0187
## 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000
##
## next_id_session next_dim_session_number
## e42bbc9b3f21b8e4415205ce85c8d809: 6 Min. : 2.0
## f0431d5843af09bb3587adc4da5563c8: 6 1st Qu.: 17.0
## ff9b22941e3e7f05a467454da5ebaca8: 6 Median : 56.0
## 012843f8953e428b9236a4a9a3393b18: 5 Mean :106.7
## 01696a8017226b7603d063cc7e6a56b5: 5 3rd Qu.:140.8
## (Other) :7098 Max. :702.0
## NA's : 630 NA's :630
## next_dim_user_agent
## Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_1 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Mobile/11D201: 292
## Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko : 238
## Airbnb/4.2.0 iPhone/7.1.1 : 210
## Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko : 207
## : 199
## (Other) :5980
## NA's : 630
## Next_Device Next_Application next_ds
## Length:7756 Length:7756 2014-12-15: 59
## Class :character Class :character 2014-09-02: 55
## Mode :character Mode :character 2014-09-05: 54
## 2014-12-23: 54
## 2014-09-06: 52
## (Other) :6852
## NA's : 630
## next_ts_min next_ts_max next_did_search
## 2014-08-14 11:31:12: 4 2014-09-26 21:59:56: 4 Min. :0.0000
## 2014-08-25 21:14:07: 4 2014-07-09 12:01:38: 3 1st Qu.:0.0000
## 2014-08-31 02:22:33: 4 2014-08-01 08:14:23: 3 Median :0.0000
## 2014-09-06 20:31:31: 4 2014-08-14 11:31:12: 3 Mean :0.1458
## 2014-06-03 09:00:00: 3 2014-08-18 21:32:29: 3 3rd Qu.:0.0000
## (Other) :7107 (Other) :7110 Max. :1.0000
## NA's : 630 NA's : 630 NA's :630
## next_sent_message next_sent_booking_request Start_Time
## Min. :0.0000 Min. :0.0000 Min. :2014-05-05 08:57:33
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2014-09-04 01:13:34
## Median :0.0000 Median :0.0000 Median :2014-11-06 00:12:10
## Mean :0.1756 Mean :0.0194 Mean :2014-11-03 11:24:56
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:2015-01-07 04:14:09
## Max. :1.0000 Max. :1.0000 Max. :2015-04-23 06:02:24
## NA's :630 NA's :630
## End_Time Next_Start_Time
## Min. :2014-05-05 08:58:06 Min. :2014-05-06 10:58:16
## 1st Qu.:2014-09-04 01:25:12 1st Qu.:2014-09-05 19:45:30
## Median :2014-11-06 00:19:57 Median :2014-11-07 22:49:08
## Mean :2014-11-03 11:35:41 Mean :2014-11-04 22:25:09
## 3rd Qu.:2015-01-07 04:14:26 3rd Qu.:2015-01-07 18:13:59
## Max. :2015-04-23 06:02:24 Max. :2015-04-22 09:04:56
## NA's :630
## Next_End_Time Duration Next_Duration
## Min. :2014-05-06 10:58:33 Length:7756 Length:7756
## 1st Qu.:2014-09-05 19:50:34 Class :difftime Class :difftime
## Median :2014-11-07 22:58:47 Mode :numeric Mode :numeric
## Mean :2014-11-04 22:36:08
## 3rd Qu.:2015-01-07 18:33:26
## Max. :2015-04-22 09:30:29
## NA's :630
mean(Internet_Clean$Duration)
## Time difference of 10.74447 secs
From this, we can say that there are total of 5373 visits to the website and total of 7756 sessions were recorded with 630 unique users. Each visitor roughly spent an average 10.7 seconds on the portal. The maximum time spent on the portal is 641.2 seconds ~10.7 minutes.
Let us find out which devices are being used to access this AirBNB website.
#devices used to access the website
Webdevice <- Internet_Clean %>% group_by(Device) %>% count(Device, sort = TRUE)
## Warning: package 'bindrcpp' was built under R version 3.4.4
g1<- ggplot(Webdevice, aes(Webdevice$Device, fill= Webdevice$n))
g1+geom_bar(stat="count") + ggtitle("Stacked bar plot showing the devices")
#applicate used to access the website
Webapp <- Internet_Clean %>% group_by(Application) %>% count(Application, sort = TRUE)
Webapp
## # A tibble: 9 x 2
## # Groups: Application [9]
## Application n
## <chr> <int>
## 1 iOS 2251
## 2 Web 1768
## 3 Chrome 1181
## 4 Moweb 625
## 5 Android 465
## 6 Safari 443
## 7 IE 429
## 8 Firefox 327
## 9 Other 267
From the above graph, we can conclude that Iphone, Desktop, Android Phone and Android Tablet were the top 4 widely used devices to access the portal. Similarly, we also have iOS, web, Chrome and Moweb are top 4 applications from which users logined into the portal for any activity. Firefox is the application used by the user population.
Now let us find correlation with these results to find out the number of sessions from each application on iPhone.
#for iPhone devices
DiPhone <- Internet_Clean %>% filter(Device == "iPhone") %>%
group_by(Application) %>% count(Application, sort = TRUE)
ggplot(DiPhone,
aes(x = Application, y = n, fill = Application)) +
geom_bar(stat = "identity") +
labs(title = "Total Sessions by Application",
x = "Application", y = "Number of Sessions")
#for android devices
Dandroid <- Internet_Clean %>% filter(Device == "Android Phone") %>%
group_by(Application) %>% count(Application, sort = TRUE)
ggplot(Dandroid,
aes(x = Application, y = n, fill = Application)) +
geom_bar(stat = "identity") +
labs(title = "Total Sessions by Application",
x = "Application", y = "Number of Sessions")
ggplot(data = Internet_Clean, aes(Device))+geom_bar()+facet_grid(Application~.) + ggtitle("Facet grid plot showing application and device spread")
From the graphs, the results showed that there is strong correlation between device and application used. The iPhone users use iOS app to access the portal, while the android users use the andriod app and the desktop users prefer chrome to connect with the portal accordingly.
Next, let us know the average time spent in a visit by each user.
Visits <- Internet_Clean %>% group_by(id_visitor) %>% count(id_visitor, sort = TRUE)
ggplot(Visits, aes(n)) + geom_density(kernel = "gaussian", color = "blue") + labs(title = "Distribution of visits per user", x = "Number of Visits", y = "Frequency")
plot(Internet_Clean$dim_session_number, Internet_Clean$dim_user_agent)
We can interpret that majority of the visitors did visit only once or twice. The outlier is one visitor who visited 702 times to the portal with mean of 98 times. On the other hand, we can extract the data of the users are frequently visiting the portal and analyze their activity to see what are they looking for and booking conversion rate. This might be of great value to the digital marketing to send promotions relevant to these frequently visiting customers to cross sell opportunities.
Next, we look at the actions performed by these visitors on the portal to understand their expectations and build blocks accordingly.
Actbyvisitor <- Internet_Clean %>% group_by(id_visitor) %>% filter(n() >= 20) %>% summarize(Search = sum(did_search), Message = sum(sent_message), Booking = sum(sent_booking_request))
#check for any patterns of frequent visitors with searches and messages sent on the portal
ggplot(Actbyvisitor, aes(Search, Message)) +
geom_jitter(color = 'blue') +
labs(title = "List of actions done by frequent visitors (Search & Messages)")
Only those users who visited the portal more than 20 are categorized as frequent vistors. There is weak correlation beween messages sent and searches on the platform.
ggplot(Actbyvisitor, aes(Search, Booking)) +
geom_jitter(color = 'green') +
labs(title = "List of actions done by frequent visitors (Search & Booking)")
There is moderate correlation beween bookings made and searches on the platform.This insight can be passed on to the marketing team so they can know the key words or phrases being searched on the platfrom and improve the efforts in the platform for quick booking conversion accordingly.
ggplot(Actbyvisitor, aes(Booking, Message)) +
geom_jitter(color = 'brown') +
labs(title = "List of actions done by frequent visitors (Booking & Messaging)")
There is weak correlation beween messages sent and bookings made on the platform.
To conclude, the research study will analyzed various user behaviours from the pathway channels used to access the AirBnB portal. There are opportunities for the digital managers to improve their markeitng through target advertising of users who are frequently visiting to search as there are high chances they will make a booking. The average duration of the each visit is 10.7 minutes and this can be improved with advertising strategy on the shortlisted applications for greater conversion. On the other hand, there is strong correlation between the device make and the application used to access the portal. There are many other variables that are considered in this project such as the location of the users as this can be useful to do geo-tagging advertising for frequent visitors to the portal.
Databits. (2018). Airbnb User Pathways Challenge. [Online]. Available at http://databits.io/challenges/airbnb-user-pathways-challenge
Svetlík, J. (2017). Integrating online advertising into integrated marketing communications. Marketing Identity, 5(1/1), 206-215.
Tucker, C. E. (2016). Social advertising: How advertising that explicitly promotes social influence can backfire. Available at SSRN 1975897.
Wilson, T., Yun, C. T., Chuan, S. B., Hong, T. T., & Bing, M. T. H. (2016). 19 A Critique of the Advertising Consumer as “Target”. Explorations in Critical Studies of Advertising, 261.