AirBnB User Pathways - Data Description
User pathways are the routes by which people navigate a website. The AirBnB data set contains data on user pathways for user sessions in the past year in a US city. We will be using R to perform data analysis and visualization to explore and identify trends in user pathways, and uncover insights to understand how people are using the AirBnB site.
Source Data and Code Book
The source data can be found here: http://databits.io/static_content/challenges/airbnb-user-pathways-challenge/airbnb_session_data.txt
The code book can be found here: http://databits.io/static_content/challenges/airbnb-user-pathways-challenge/data_dictionary.rtf
Variables Measured
The following is the code used to evaluate the variables in the data set. We noted that there is a total of 7,756 observations in the data set, and 21 variables, which are listed below.
# Find total number of observations
nrow(AirB_tib)
# Get variable names
names(AirB_tib)
# Get summary information on the AirBnB data set variables
summary(AirB_tib)The data set contains the following variables:
| Variable | Description |
|---|---|
| id_visitor | Visitor ID |
| id_session | Session ID |
| dim_session_number | Number of sessions on a given day for a visitor |
| dim_user_agent | User agent of the session - gives website information on the device and operating system |
| dim_device_app_combo | Parsed out device/application combination from the user agent of the session |
| ds | Session date stamp |
| ts_min | Session start time |
| ts_max | Session end time |
| did_search | Indicates whether the user performed a search during the session (0 = no search performed, 1 = search performed) |
| sent_message | Indicates whether the user sent a message during the session (0 = not sent, 1 = sent) |
| sent_booking_request | Indicates whether the user sent a booking request during the session (0 = not booked, 1 = booked) |
| next_id_session | Next Session ID |
| next_dim_session_number | Next number of sessions for visitor |
| next_dim_device_app_combo | Next parsed out device/application combination |
| next_ds | Next session date stamp |
| next_ts_min | Next session start time |
| next_ts_max | Next session end time |
| next_did_search | Next session - did search? (0 or 1) |
| next_sent_message | Next session - sent message? (0 or 1) |
| next_sent_booking_request | Next session - booking request? (0 or 1) |
Summary of Variable Details
| Variable | Data Type | Missing Values | Value/Range |
|---|---|---|---|
| id_visitor | Varchar | None | Unique ID String |
| id_session | Varchar | None | Unique ID String |
| dim_session_number | Bigint | None | Integer; 1 to 702 |
| dim_user_agent | Varchar | None | Device/O.S. String |
| dim_device_app_combo | Varchar | None | Device/App Detail String |
| ds | Varchar | None | Date; 5/5/2014 - 4/23/2015 |
| ts_min | Varchar | None | Date Time Stamp |
| ts_max | Varchar | None | Date Time Stamp |
| did_search | Bigint | None | 0 or 1 |
| sent_message | Bigint | None | 0 or 1 |
| sent_booking_request | Bigint | None | 0 or 1 |
| next_id_session | Varchar | 630 | Unique ID String |
| next_dim_session_number | Bigint | 630 | Integer; 2 to 702 |
| next_dim_user_agent | Varchar | 630 | Device/O.S. String |
| next_dim_device_app_combo | Varchar | 630 | Device/App Detail String |
| next_ds | Varchar | 630 | Date; 5/6/2014 - 4/22/2015 |
| next_ts_min | Varchar | 630 | Date Time Stamp |
| next_ts_max | Varchar | 630 | Date Time Stamp |
| next_did_search | Bigint | 630 | 0 or 1 |
| next_sent_message | Bigint | 630 | 0 or 1 |
| next_sent_booking_request | Bigint | 630 | 0 or 1 |
Packages Used in this Proposal
library(tibble) # used to create tibbles
library(tidyr) # used to tidy up data
library(prettydoc) # document themes for R Markdown
library(DT) # used for displaying R data objects (matrices or data frames) as tables on HTML pages
library(lubridate) # date/time functionsImport Data
# Read data via the source data url provided above
AirBnB <- read.delim(url
("http://databits.io/static_content/challenges/airbnb-user-pathways-challenge/airbnb_session_data.txt"),
sep = "|", na.strings = 'NULL')
# Store data as tibble
AirB_tib <- as_tibble(AirBnB)
# Preview data - show first 8 rows
datatable(head(AirB_tib, 8),options = list(scrollX=TRUE, pageLength=4))Data Cleaning
First, we would like to separate the column “dim_device_app_combo” into two columns - one for the device name and one for the application name. I want to do the same to the column “next_dim_device_app_combo” as well.
# Separate Device and Application columns
AirB_tib2 <- AirB_tib %>%
separate(dim_device_app_combo,
into = c("Device", "Application"), sep = " - ") %>%
separate(next_dim_device_app_combo,
into = c("Next_Device", "Next_Application"),
sep = " - ")
datatable(head(AirB_tib2, 8),options = list(scrollX=TRUE, pageLength=4))Next we want to calculate the duration of each session, by creating a new calculated field, which is the difference between the session start time and the session end time. In order to do so, we would first convert the start time and end time strings into date-time (hour/minute/second) format, using the hms() function from the lubridate package.
Then, we would calculate the difference of the start and end times to calculate the duration of the session in a calculated field called “Duration”. We would also calculate the duration of the next session in a calculated field called “Next_Duration”.
# Convert the start and end times from string to date/time format
AirB_tib2$Start_Time <- ymd_hms(AirB_tib2$ts_min)
AirB_tib2$End_Time <- ymd_hms(AirB_tib2$ts_max)
AirB_tib2$Next_Start_Time <- ymd_hms(AirB_tib2$next_ts_min)
AirB_tib2$Next_End_Time <- ymd_hms(AirB_tib2$next_ts_max)
# When we substract the difference of the times, the result is in seconds,
# so we divide by 60 to get the duration in minutes
AirB_tib2$Duration <-
(AirB_tib2$End_Time - AirB_tib2$Start_Time) / 60
AirB_tib2$Next_Duration <-
(AirB_tib2$Next_End_Time - AirB_tib2$Next_Start_Time) / 60
datatable(head(AirB_tib2, 8),options = list(scrollX=TRUE, pageLength=4))Initial Comments
We will continue to tidy the data as needed as we go through the analysis. During the initial evaluation, we noted while importing the data that there are null values for some observations for the “next session” fields (i.e. next session id, next session date), which indicates that the unique visitor did not open another session after his or her first session. However, these null values are not just missing data, but actually tells us something - that the user did not visit the site again. Therefore, we will choose to keep the null values in our data set.
In addition, we noted per inspection of the calculated durations field that there were times were the user only visited the site very briefly (i.e. 2 seconds). We may consider removing such session from our analysis, as such a short session may suggest that the user accidentally visited the site, and does not provide much meaningful information.
Planned Analysis
We will explore various ways of analyzing the AirBnB data set - potential ideas are listed below:
Investigate whether there is a correlation between the amount of time spent on the site, and whether the user takes action (as in, performs a search, sends a message, or sends a booking request).
Explore whether users are more likely to place a booking request based on the device they are using. Are people more likely to request booking when they are browsing on computers vs. when they are browsing on mobile devices?
Explore whether users are more likely to place a booking request based on the application/browser they are using. For instance, are people more likely to place a booking on Chrome vs. on Firefox? This could potentially suggest that the site is easier to navigate on one application vs. another?
Examine the actions of the users in each session, and determine whether users tend to visit the site multiple times before requesting a booking. If so, determine approximately how many times users typically visit the site before requesting a booking.
Examine the session information by unique visitors and note the behavior of frequent visitors (say 10 sessions or more). How often do frequent visitors spend on the site? Is there a short duration in the number of days between sessions right before placing a booking request?