AirBnB User Pathways - Data Description

User pathways are the routes by which people navigate a website. The AirBnB data set contains data on user pathways for user sessions in the past year in a US city. We will be using R to perform data analysis and visualization to explore and identify trends in user pathways, and uncover insights to understand how people are using the AirBnB site.

Variables Measured

The following is the code used to evaluate the variables in the data set. We noted that there is a total of 7,756 observations in the data set, and 21 variables, which are listed below.

# Find total number of observations  
nrow(AirB_tib)

# Get variable names
names(AirB_tib)

# Get summary information on the AirBnB data set variables
summary(AirB_tib)

The data set contains the following variables:

Variable Description
id_visitor Visitor ID
id_session Session ID
dim_session_number Number of sessions on a given day for a visitor
dim_user_agent User agent of the session - gives website information on the device and operating system
dim_device_app_combo Parsed out device/application combination from the user agent of the session
ds Session date stamp
ts_min Session start time
ts_max Session end time
did_search Indicates whether the user performed a search during the session (0 = no search performed, 1 = search performed)
sent_message Indicates whether the user sent a message during the session (0 = not sent, 1 = sent)
sent_booking_request Indicates whether the user sent a booking request during the session (0 = not booked, 1 = booked)
next_id_session Next Session ID
next_dim_session_number Next number of sessions for visitor
next_dim_device_app_combo Next parsed out device/application combination
next_ds Next session date stamp
next_ts_min Next session start time
next_ts_max Next session end time
next_did_search Next session - did search? (0 or 1)
next_sent_message Next session - sent message? (0 or 1)
next_sent_booking_request Next session - booking request? (0 or 1)

Summary of Variable Details

Variable Data Type Missing Values Value/Range
id_visitor Varchar None Unique ID String
id_session Varchar None Unique ID String
dim_session_number Bigint None Integer; 1 to 702
dim_user_agent Varchar None Device/O.S. String
dim_device_app_combo Varchar None Device/App Detail String
ds Varchar None Date; 5/5/2014 - 4/23/2015
ts_min Varchar None Date Time Stamp
ts_max Varchar None Date Time Stamp
did_search Bigint None 0 or 1
sent_message Bigint None 0 or 1
sent_booking_request Bigint None 0 or 1
next_id_session Varchar 630 Unique ID String
next_dim_session_number Bigint 630 Integer; 2 to 702
next_dim_user_agent Varchar 630 Device/O.S. String
next_dim_device_app_combo Varchar 630 Device/App Detail String
next_ds Varchar 630 Date; 5/6/2014 - 4/22/2015
next_ts_min Varchar 630 Date Time Stamp
next_ts_max Varchar 630 Date Time Stamp
next_did_search Bigint 630 0 or 1
next_sent_message Bigint 630 0 or 1
next_sent_booking_request Bigint 630 0 or 1

Packages Used in this Proposal

library(tibble) # used to create tibbles
library(tidyr) # used to tidy up data
library(prettydoc) # document themes for R Markdown
library(DT) # used for displaying R data objects (matrices or data frames) as tables on HTML pages
library(lubridate) # date/time functions

Import Data

# Read data via the source data url provided above
AirBnB <- read.delim(url
  ("http://databits.io/static_content/challenges/airbnb-user-pathways-challenge/airbnb_session_data.txt"), 
  sep = "|", na.strings = 'NULL')

# Store data as tibble
AirB_tib <- as_tibble(AirBnB)

# Preview data - show first 8 rows
datatable(head(AirB_tib, 8),options = list(scrollX=TRUE, pageLength=4))

Data Cleaning

First, we would like to separate the column “dim_device_app_combo” into two columns - one for the device name and one for the application name. I want to do the same to the column “next_dim_device_app_combo” as well.

# Separate Device and Application columns

AirB_tib2 <- AirB_tib %>% 
  separate(dim_device_app_combo, 
           into = c("Device", "Application"), sep = " - ") %>%
  separate(next_dim_device_app_combo,
           into = c("Next_Device", "Next_Application"), 
           sep = " - ") 

datatable(head(AirB_tib2, 8),options = list(scrollX=TRUE, pageLength=4))

Next we want to calculate the duration of each session, by creating a new calculated field, which is the difference between the session start time and the session end time. In order to do so, we would first convert the start time and end time strings into date-time (hour/minute/second) format, using the hms() function from the lubridate package.

Then, we would calculate the difference of the start and end times to calculate the duration of the session in a calculated field called “Duration”. We would also calculate the duration of the next session in a calculated field called “Next_Duration”.

# Convert the start and end times from string to date/time format

AirB_tib2$Start_Time <- ymd_hms(AirB_tib2$ts_min)
AirB_tib2$End_Time <- ymd_hms(AirB_tib2$ts_max)

AirB_tib2$Next_Start_Time <- ymd_hms(AirB_tib2$next_ts_min)
AirB_tib2$Next_End_Time <- ymd_hms(AirB_tib2$next_ts_max)

# When we substract the difference of the times, the result is in seconds,
# so we divide by 60 to get the duration in minutes

AirB_tib2$Duration <- 
  (AirB_tib2$End_Time - AirB_tib2$Start_Time) / 60

AirB_tib2$Next_Duration <- 
  (AirB_tib2$Next_End_Time - AirB_tib2$Next_Start_Time) / 60

datatable(head(AirB_tib2, 8),options = list(scrollX=TRUE, pageLength=4))

Initial Comments

We will continue to tidy the data as needed as we go through the analysis. During the initial evaluation, we noted while importing the data that there are null values for some observations for the “next session” fields (i.e. next session id, next session date), which indicates that the unique visitor did not open another session after his or her first session. However, these null values are not just missing data, but actually tells us something - that the user did not visit the site again. Therefore, we will choose to keep the null values in our data set.

In addition, we noted per inspection of the calculated durations field that there were times were the user only visited the site very briefly (i.e. 2 seconds). We may consider removing such session from our analysis, as such a short session may suggest that the user accidentally visited the site, and does not provide much meaningful information.

Planned Analysis

We will explore various ways of analyzing the AirBnB data set - potential ideas are listed below:

  1. Investigate whether there is a correlation between the amount of time spent on the site, and whether the user takes action (as in, performs a search, sends a message, or sends a booking request).

  2. Explore whether users are more likely to place a booking request based on the device they are using. Are people more likely to request booking when they are browsing on computers vs. when they are browsing on mobile devices?

  3. Explore whether users are more likely to place a booking request based on the application/browser they are using. For instance, are people more likely to place a booking on Chrome vs. on Firefox? This could potentially suggest that the site is easier to navigate on one application vs. another?

  4. Examine the actions of the users in each session, and determine whether users tend to visit the site multiple times before requesting a booking. If so, determine approximately how many times users typically visit the site before requesting a booking.

  5. Examine the session information by unique visitors and note the behavior of frequent visitors (say 10 sessions or more). How often do frequent visitors spend on the site? Is there a short duration in the number of days between sessions right before placing a booking request?