by Jekaterina Novikova

Used Libraries

All the packages used in the analysis are listed below:

library(ggplot2)
library(GGally) # for ggpairs
library(memisc)
library(gridExtra)
library(maps)
library(mapdata)
library(dplyr)
library(tidyr)
library(rworldmap)
library(RColorBrewer)
library(caret)
library(corrplot)
library(Rtsne)

Dataset Analysis

Loading Data

The dataset provides data about airbnb (www.airbnb.com) users. The question to answer: which country a new user’s first booking destination will be?

All the users in this dataset are from the USA.

There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,‘DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’. Please note that ‘NDF’ is different from ‘other’ because ‘other’ means there was a booking, but is to a country not included in the list, while ‘NDF’ means there wasn’t a booking.

# Load the Data
df <- read.csv("train_users_2.csv")

Data Summary

dim(df)

## [1] 213451     16

names(df)

##  [1] "id"                      "date_account_created"   
##  [3] "timestamp_first_active"  "date_first_booking"     
##  [5] "gender"                  "age"                    
##  [7] "signup_method"           "signup_flow"            
##  [9] "language"                "affiliate_channel"      
## [11] "affiliate_provider"      "first_affiliate_tracked"
## [13] "signup_app"              "first_device_type"      
## [15] "first_browser"           "country_destination"

str(df)

## 'data.frame':    213451 obs. of  16 variables:
##  $ id                     : Factor w/ 213451 levels "00023iyk9l","0005ytdols",..: 100523 48039 26485 68504 48956 147281 129610 2144 59779 40826 ...
##  $ date_account_created   : Factor w/ 1634 levels "2010-01-01","2010-01-02",..: 171 502 263 696 249 1 2 3 4 4 ...
##  $ timestamp_first_active : num  2.01e+13 2.01e+13 2.01e+13 2.01e+13 2.01e+13 ...
##  $ date_first_booking     : Factor w/ 1977 levels "","2010-01-02",..: 1 1 194 960 35 2 4 10 190 3 ...
##  $ gender                 : Factor w/ 4 levels "-unknown-","FEMALE",..: 1 3 2 2 1 1 2 2 2 1 ...
##  $ age                    : num  NA 38 56 42 41 NA 46 47 50 46 ...
##  $ signup_method          : Factor w/ 3 levels "basic","facebook",..: 2 2 1 2 1 1 1 1 1 1 ...
##  $ signup_flow            : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ language               : Factor w/ 25 levels "ca","cs","da",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ affiliate_channel      : Factor w/ 8 levels "api","content",..: 3 8 3 3 3 4 4 3 4 4 ...
##  $ affiliate_provider     : Factor w/ 18 levels "baidu","bing",..: 5 9 5 5 5 13 3 5 3 3 ...
##  $ first_affiliate_tracked: Factor w/ 8 levels "","linked","local ops",..: 8 8 8 8 8 5 8 5 8 5 ...
##  $ signup_app             : Factor w/ 4 levels "Android","iOS",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ first_device_type      : Factor w/ 9 levels "Android Phone",..: 6 6 9 6 6 6 6 6 6 6 ...
##  $ first_browser          : Factor w/ 52 levels "-unknown-","Android Browser",..: 9 9 24 18 9 9 42 42 42 18 ...
##  $ country_destination    : Factor w/ 12 levels "AU","CA","DE",..: 8 8 12 10 12 12 12 12 12 12 ...

Here:

id: user id
date_account_created: the date of account creation
timestamp_first_active: timestamp of the first activity, note that it can be earlier than
date_account_created or date_first_booking because a user can search before signing up
date_first_booking: date of first booking
gender, age, signup_method
signup_flow: the page a user came to signup up from
language: international language preference
affiliate_channel: what kind of paid marketing
affiliate_provider: where the marketing is e.g. google, craigslist, other
first_affiliate_tracked: whats the first marketing the user interacted with before the signing up
signup_app, first_device_type, first_browser
country_destination: this is the target variable which needs to be predicted

summary(df)

##           id         date_account_created timestamp_first_active
##  00023iyk9l:     1   2014-05-13:   674    Min.   :2.009e+13     
##  0005ytdols:     1   2014-06-24:   670    1st Qu.:2.012e+13     
##  000guo2307:     1   2014-06-25:   636    Median :2.013e+13     
##  000wc9mlv3:     1   2014-05-20:   632    Mean   :2.013e+13     
##  0012yo8hu2:     1   2014-05-14:   622    3rd Qu.:2.014e+13     
##  001357912w:     1   2014-05-21:   602    Max.   :2.014e+13     
##  (Other)   :213445   (Other)   :209615                          
##   date_first_booking       gender           age           signup_method   
##            :124543   -unknown-:95688   Min.   :   1.00   basic   :152897  
##  2014-05-22:   248   FEMALE   :63041   1st Qu.:  28.00   facebook: 60008  
##  2014-06-11:   231   MALE     :54440   Median :  34.00   google  :   546  
##  2014-06-24:   226   OTHER    :  282   Mean   :  49.67                    
##  2014-05-21:   225                     3rd Qu.:  43.00                    
##  2014-06-10:   223                     Max.   :2014.00                    
##  (Other)   : 87755                     NA's   :87990                      
##   signup_flow        language          affiliate_channel 
##  Min.   : 0.000   en     :206314   direct       :137727  
##  1st Qu.: 0.000   zh     :  1632   sem-brand    : 26045  
##  Median : 0.000   fr     :  1172   sem-non-brand: 18844  
##  Mean   : 3.267   es     :   915   other        :  8961  
##  3rd Qu.: 0.000   ko     :   747   seo          :  8663  
##  Max.   :25.000   de     :   732   api          :  8167  
##                   (Other):  1939   (Other)      :  5044  
##   affiliate_provider  first_affiliate_tracked   signup_app    
##  direct    :137426   untracked    :109232     Android:  5454  
##  google    : 51693   linked       : 46287     iOS    : 19019  
##  other     : 12549   omg          : 43982     Moweb  :  6261  
##  craigslist:  3471   tracked-other:  6156     Web    :182717  
##  bing      :  2328                :  6065                     
##  facebook  :  2273   product      :  1556                     
##  (Other)   :  3711   (Other)      :   173                     
##        first_device_type       first_browser   country_destination
##  Mac Desktop    :89600   Chrome       :63845   NDF    :124543     
##  Windows Desktop:72716   Safari       :45169   US     : 62376     
##  iPhone         :20759   Firefox      :33655   other  : 10094     
##  iPad           :14339   -unknown-    :27266   FR     :  5023     
##  Other/Unknown  :10667   IE           :21068   IT     :  2835     
##  Android Phone  : 2803   Mobile Safari:19274   GB     :  2324     
##  (Other)        : 2567   (Other)      : 3174   (Other):  6256

General Feature Analysis

Check for features’s variance

Based on the principal component analysis PCA, it is important that features have maximum variance for maximum uniqueness, so that each feature is as distant as possible (as orthogonal as possible) from the other features.

zero.var = nearZeroVar(df, saveMetrics=TRUE)
zero.var

##                          freqRatio percentUnique zeroVar   nzv
## id                        1.000000  1.000000e+02   FALSE FALSE
## date_account_created      1.005970  7.655153e-01   FALSE FALSE
## timestamp_first_active    1.051801  1.035366e-01   FALSE FALSE
## date_first_booking      502.189516  9.262079e-01   FALSE  TRUE
## gender                    1.517869  1.873966e-03   FALSE FALSE
## age                       1.017952  5.949843e-02   FALSE FALSE
## signup_method             2.547944  1.405475e-03   FALSE FALSE
## signup_flow              11.238079  7.964357e-03   FALSE FALSE
## language                126.417892  1.171229e-02   FALSE  TRUE
## affiliate_channel         5.288040  3.747933e-03   FALSE FALSE
## affiliate_provider        2.658503  8.432849e-03   FALSE FALSE
## first_affiliate_tracked   2.359885  3.747933e-03   FALSE FALSE
## signup_app                9.607077  1.873966e-03   FALSE FALSE
## first_device_type         1.232191  4.216424e-03   FALSE FALSE
## first_browser             1.413469  2.436156e-02   FALSE FALSE
## country_destination       1.996649  5.621899e-03   FALSE FALSE

There is no features without variability (all has enough variance). So there is no feature to be removed further.

Analysing Users Age

Here is the summary of age variable’s values in the dataset.

summary(df$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   28.00   34.00   49.67   43.00 2014.00   87990

We can see that both min and max values of age are fake.

Distribution of Age

qplot(x = age, data = df, binwidth = 1)

qplot(x = age, data = df, binwidth = 1) +
  scale_x_continuous(limits = c(18, 75), breaks = seq(18, 75, 2))

Outliers

There are outliers with the obviously false age of around 2000 years. This is most probably a mistake made by users during registration.

The following boxplot shows that these outliers exist in all the gender cases. It is possible, some users entered the current year when registered in airbnb instead of their year of birth.

a1 <- qplot(x = gender, y = age, data = df, 
      geom = "boxplot") 

a2 <- qplot(x = gender, y = age, data = df, 
      geom = "boxplot") +
  coord_cartesian(ylim = c(18,75))

a3 <- qplot(x = gender, y = age, data = df, 
      geom = "boxplot") +
  coord_cartesian(ylim = c(0,18))

grid.arrange(a1, a2, a3, ncol = 2)

by(df$age, df$gender, summary)

## df$gender: -unknown-
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   29.00   35.00   46.78   45.00 2014.00   78845 
## -------------------------------------------------------- 
## df$gender: FEMALE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   28.00   33.00   52.73   42.00 2014.00    5328 
## -------------------------------------------------------- 
## df$gender: MALE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   29.00   34.00   47.08   43.00 2014.00    3763 
## -------------------------------------------------------- 
## df$gender: OTHER
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   17.00   29.75   35.00   62.90   44.00 2014.00      54

Female users have a slightly lower median age than males and slightly higher mean age, while the 1st and 3rd quartiles are very similar for both genders.

The minimum age for males, females and others starts from 1-2 years old, which is obviously a false number. There are no outliers for the gender “OTHER” of the age less than 17 years, which implies that users selecting this gender during the registration less often lie about their age. There are more younger age outliers among males, compared to females or users of unknown gender.

Age by a Destination Country

by(df$age, df$country_destination, summary)

## df$country_destination: AU
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.00   29.00   35.00   42.81   44.00 2014.00     103 
## -------------------------------------------------------- 
## df$country_destination: CA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.00   29.00   34.00   48.99   43.00 2014.00     351 
## -------------------------------------------------------- 
## df$country_destination: DE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    5.00   28.00   34.00   53.55   43.00 2014.00     210 
## -------------------------------------------------------- 
## df$country_destination: ES
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   27.00   32.00   46.24   40.00 2014.00     543 
## -------------------------------------------------------- 
## df$country_destination: FR
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    5.00   29.00   34.00   48.11   44.00 2014.00    1310 
## -------------------------------------------------------- 
## df$country_destination: GB
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    5.00   29.00   35.00   50.16   46.00 2014.00     550 
## -------------------------------------------------------- 
## df$country_destination: IT
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    4.00   28.00   34.00   50.65   44.00 2014.00     799 
## -------------------------------------------------------- 
## df$country_destination: NDF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   28.00   34.00   51.83   44.00 2014.00   67614 
## -------------------------------------------------------- 
## df$country_destination: NL
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.00   28.00   33.00   53.15   41.00 2014.00     160 
## -------------------------------------------------------- 
## df$country_destination: other
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   29.00   34.00   48.84   42.00 2014.00    2518 
## -------------------------------------------------------- 
## df$country_destination: PT
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   19.00   28.00   32.00   48.75   41.75 2014.00      59 
## -------------------------------------------------------- 
## df$country_destination: US
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   28.00   33.00   47.41   41.00 2014.00   13773

ggplot(df, aes(country_destination, age)) +
  geom_boxplot(aes()) + 
  ylim(25,50)

ggplot(df, aes(age)) +
  geom_histogram(binwidth = 1, color = 'black', fill = '#099DD9') +
  xlim(18,75) +
  facet_wrap(~country_destination, ncol = 3, scales = "free")

ggplot(df, aes(age)) +
  geom_histogram(binwidth = 1, color = 'black', fill = '#099DD9') +
  geom_histogram(data=subset(df,age==20), color = "black", fill="red", binwidth = 1) +
  #xlim(15, 25) +
  scale_x_continuous(limits = c(15, 25), breaks = seq(15, 25, 1)) +
  facet_wrap(~country_destination, ncol = 3, scales = "free")

It seems there are no big difference between the overall distribution of users ages while selecting different countries of destination. However, there is a clear decrease of users of age 20, both reserving accommodation (for any country) and not reserving at all. Users of age 20 clearly are the minority of airbnb.

Age on a Map of Destination Countries

#----- summary by age -----#
df.age.summ <- summarise(group_by(df, country_destination), 
                         medianAge = median(age, na.rm = TRUE),
                         meanAge = mean(age, na.rm = TRUE),
                         sdAge = sd(age, na.rm = TRUE))

spdf <- joinCountryData2Map(subset(df.age.summ, country_destination != "other" & country_destination != "NDF"),
                            joinCode="ISO_A2", 
                            nameJoinColumn="country_destination")

## 10 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 233 codes from the map weren't represented in your data

df_map <- spdf@data

mapCountryData(spdf, 
               nameColumnToPlot="medianAge", 
               catMethod="fixedWidth", 
               numCats=3,
               mapTitle = "Median Age of Airbnb Users"
               )

mapCountryData(spdf, 
               nameColumnToPlot="meanAge", 
               catMethod="fixedWidth", 
               numCats=10,
               mapTitle = "Mean Age of Airbnb Users"
               )

df.age.summ2 <- summarise(group_by(subset(df, age<76 & age>17), country_destination),
                         meanAge = mean(age, na.rm = TRUE),
                         sdAge = sd(age, na.rm = TRUE))

spdf2 <- joinCountryData2Map(subset(df.age.summ2, 
                                   country_destination != "other" & country_destination != "NDF"),
                            joinCode="ISO_A2", 
                            nameJoinColumn="country_destination")

## 10 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 233 codes from the map weren't represented in your data

df_map <- spdf2@data

mapCountryData(spdf2, 
               nameColumnToPlot="meanAge", 
               catMethod="fixedWidth", 
               numCats=10,
               mapTitle = "Mean Age of Airbnb Users in the Age-Group 18-75"
               )

Analysing Users Gender

Distribution of Gender

ggplot(df, aes(x = gender)) + 
  geom_bar()

Gender by a Destination Country

df.gender <- spread(df, gender, age)
colnames(df.gender)[15] <- "unknown"
df.gender.summ <- summarise(group_by(df.gender, country_destination), 
                            countF = sum(!is.na(FEMALE)),
                            countM = sum(!is.na(MALE)),
                            countO = sum(!is.na(OTHER)),
                            countUn = sum(!is.na(unknown)),
                            total_with_NA = n())
spdf <- joinCountryData2Map(subset(df.gender.summ, 
                                   country_destination != "other" &
                                     country_destination != "NDF"), 
                            joinCode="ISO_A2", 
                            nameJoinColumn="country_destination")

## 10 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 233 codes from the map weren't represented in your data

df_map <- subset(spdf@data, !is.na(country_destination))

par(mai= c(0,0,0.9,0),
    xaxs = "i",
    yaxs = "i")

mapPies(dF =df_map,
        nameX="LON",
        nameY="LAT",
        nameZs =c("countF",
                  "countM",
                  "countO",
                  "countUn") ,
        zColours=c("red",
                   "green",
                   "yellow",
                   "blue"),
        oceanCol = "lightblue",
        landCol = "wheat",
        addSizeLegend=T,
        addCatLegend=F,
        mapRegion="world",
        xlim=c(-181,181),
        ylim=c(-81,80),
        symbolSize = 2)

## symbolMaxSize= 7.24  maxSumValues= 48603  symbolScale= 0.03284029 
## List of 2
##  $ x: num [1:100] -138 -138 -138 -138 -138 ...
##  $ y: num [1:100] 76.6 77.5 78.4 79.3 80.2 ...

title(main=paste("Number of Airbnb Users by Gender"),
      cex=3)

legend(-180.1516,90,
       legend=c("Female",
                  "Male",
                  "Other",
                  "Unknown"),
       col=c("red",
                   "green",
                   "yellow",
                   "blue"),
       pch=16,
       cex=0.8,
       pt.cex=1.5,
       bty="o",
       box.lty=0,
       horiz = F,
       bg="#FFFFFF70")

Analysing Dates and Times

Converting and Adding New Variables

I want to convert both date_account_created and date_first_booking to a date type, so that later I could calculate a period between these two dates.

Empty values cannot be converted to date, that’s why I copy the values of date_account_created to the column with date_first_booking, so that the calculated period between these two dates would be equal to zero.

df$date_account_created <- as.character(df$date_account_created)
df$date_first_booking <- as.character(df$date_first_booking)
df$date_first_booking <- ifelse(df$date_first_booking == "", df$date_account_created, df$date_first_booking)

df$date_account_created <- as.Date(df$date_account_created)
df$date_first_booking <- as.Date(df$date_first_booking)

summary(df$date_account_created)

##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2010-01-01" "2012-12-26" "2013-09-11" "2013-06-25" "2014-03-06" 
##         Max. 
## "2014-06-30"

summary(df$date_first_booking)

##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2010-01-02" "2013-01-14" "2013-09-24" "2013-07-13" "2014-03-22" 
##         Max. 
## "2015-06-29"

I may be useful to see if user create certain accounts or make certain bookings in some specific months, days or weekdays. For this, I will need new variables: month_acc_created, day_acc_created, weekday_acc_created and *_month_first_book, day_first_book, weekday_first_book*.

#use %b for Jan, %m for 01, %B for January, %a for Mon, %A for Monday

df$month_acc_created <- factor(format(df$date_account_created, format = "%B"), 
                               levels = c("January", "February", "March", "April", 
                                          "May", "June", "July", "August", "September",
                                          "October", "November", "December")) 
df$day_acc_created <- factor(format(df$date_account_created, format = "%d")) 
df$dac_weekday <- factor(format(df$date_account_created, format = "%A"), 
                                 levels = c("Monday", "Tuesday", "Wednesday", 
                                            "Thursday", "Friday", "Saturday", 
                                            "Sunday"))


df$month_first_book <- factor(format(df$date_first_booking, format = "%B"), 
                              levels = c("January", "February", "March", "April", 
                                         "May", "June", "July", "August", "September",
                                         "October", "November", "December")) 
df$day_first_book <- factor(format(df$date_first_booking, format = "%d")) 
df$weekday_first_book <- factor(format(df$date_first_booking, format = "%A"), 
                                levels = c("Monday", "Tuesday", "Wednesday", 
                                           "Thursday", "Friday", "Saturday", 
                                           "Sunday"))

Dates of First Bookings by Country

summary(df$month_first_book)

##   January  February     March     April       May      June      July 
##     16967     16348     20046     21414     25435     26886     15051 
##    August September   October  November  December 
##     15284     15741     13935     13105     13239

qplot(x = month_first_book, data = df)

ggplot(subset(df, country_destination != "NDF"), aes(x = month_first_book)) +
  #geom_bar(las=2)+
  geom_bar(data=subset(df,country_destination != "NDF" & country_destination != "AU" &
                         month_first_book %in% c("June", "May")),
           color = "black", fill="green") +
  geom_bar(data=subset(df,country_destination != "NDF" & country_destination != "AU" &
                         month_first_book %in% c("November", "December")),
           color = "black", fill="red") +
  geom_bar(data=subset(df,country_destination == "AU" & 
                         month_first_book == "November"),
           color = "black", fill="green") +
    geom_bar(data=subset(df,country_destination == "AU" & 
                         month_first_book %in% c("June", "July","August")),
           color = "black", fill="red") +
  geom_bar(data=subset(df,country_destination == "PT" & 
                         month_first_book %in% c("July", "August", "September",
                                          "October","November","December","January")),
           color = "black", fill="red") +
  geom_bar(data=subset(df,country_destination == "PT" & 
                         month_first_book %in% c("July", "August", "September",
                                          "October", "January")),
           color = "black", fill="red") +
  facet_wrap(~country_destination, scales = "free")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The majority of bookings are made during the months of March to June. November and December are the lowest months in terms of a number of reservations.

Among the countries, Australia and Portugal looks different from others. For Portugal, reservations are very low for the whole period from June to the end of the year (also including January). In Australia, June, July and August are all low-season months, and a higher number of reservations is made in November.

In all the other countries, May and June are the highest seasons for reservations.

Now, let’s see how the reservations are distributed by the day of booking.

qplot(x = day_first_book, data = df)

ggplot(subset(df, country_destination != "NDF"), aes(x=day_first_book)) +
  geom_bar()+
  facet_wrap(~country_destination, scales = "free")

I do not see anything specific here, the number of bookings made in each day of a month looks more or less the same. The 31st day obviously has less bookings than other days, just because not all the months have this day.

Let’s have a look at the reservations’ distribution over the weekdays.

qplot(x = weekday_first_book, data = df)

ggplot(subset(df, country_destination != "NDF"), aes(x=weekday_first_book)) +
  geom_bar()+
  facet_wrap(~country_destination, scales = "free")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

In general, it seems that less reservations are made during the week-end. That’s probably because people are travelling over the week-ends, not planning travel.

Days Between Signup and First Booking

df$days_to_book <- as.numeric(df$date_first_booking - df$date_account_created)

summary(df$days_to_book)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -349.00    0.00    0.00   18.48    1.00  365.00

by(df$days_to_book, df$country_destination, summary)

## df$country_destination: AU
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -77.00    1.00    5.00   46.04   41.00  364.00 
## -------------------------------------------------------- 
## df$country_destination: CA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -4.00    1.00    3.00   45.73   26.25  364.00 
## -------------------------------------------------------- 
## df$country_destination: DE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -269.00    1.00    4.00   39.53   30.00  364.00 
## -------------------------------------------------------- 
## df$country_destination: ES
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    4.00   37.35   25.00  365.00 
## -------------------------------------------------------- 
## df$country_destination: FR
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    4.00   37.59   25.00  364.00 
## -------------------------------------------------------- 
## df$country_destination: GB
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -109.00    1.00    4.00   39.44   28.00  365.00 
## -------------------------------------------------------- 
## df$country_destination: IT
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -298.00    1.00    3.00   36.68   28.00  364.00 
## -------------------------------------------------------- 
## df$country_destination: NDF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0 
## -------------------------------------------------------- 
## df$country_destination: NL
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     1.0     3.0    42.1    26.0   365.0 
## -------------------------------------------------------- 
## df$country_destination: other
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -85.00    1.00    4.00   47.88   41.00  365.00 
## -------------------------------------------------------- 
## df$country_destination: PT
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    4.00   44.49   29.00  364.00 
## -------------------------------------------------------- 
## df$country_destination: US
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -349.00    0.00    3.00   45.19   28.00  365.00

ggplot(df, aes(country_destination, days_to_book)) +
  geom_boxplot()

ggplot(subset(df, country_destination != "NDF"), aes(gender, days_to_book)) +
  geom_boxplot()

ggplot(subset(df, country_destination != "NDF"), aes(language, days_to_book)) +
  geom_boxplot()

Analysing SignUp method

summary(df$signup_method)

##    basic facebook   google 
##   152897    60008      546

qplot(x = signup_method, data = df)

ggplot(subset(df, country_destination != "NDF" & country_destination != "US"  & signup_method == "facebook"), aes(x = signup_method)) +
  geom_bar()+
  facet_wrap(~country_destination)+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

It seems that signup_method may be a useful feature. Users have selected some countries, such as e.g. France , Italy or Spain, more often than others when signing up using Facebook.

Analysing Languge

Distribution of Language

First, let’s look how language are distributed across the users that did not ever make a reservation.

df <- within(df, language <- factor(language, 
                                        levels=names(sort(table(language),
                                                          decreasing=TRUE))))

qAll <- ggplot(subset(df, country_destination == "NDF"), 
               aes(language)) +
  geom_bar(aes(fill = gender)) +
  ggtitle("Frequency distribution - NDF, All Languages")
qWoEng <- ggplot(subset(df, country_destination == "NDF" & language != "en"), 
               aes(language)) +
  geom_bar(aes(fill = gender)) +
  ggtitle("Frequency Distribution - NDF, W/o English")

grid.arrange(qAll, qWoEng, ncol = 1)

Now, compare it to the distribution between all users who made a reservation through Airbnb.

qAll <- ggplot(subset(df, country_destination != "NDF"), 
               aes(language)) +
  geom_bar(aes(fill = gender)) +
  ggtitle("Frequency distribution - Reservations, All Languages")
qWoEng <- ggplot(subset(df, country_destination != "NDF" & language != "en"), 
               aes(language)) +
  geom_bar(aes(fill = gender)) +
  ggtitle("Frequency Distribution - Reservations, W/o English")

grid.arrange(qAll, qWoEng, ncol = 1)

Graphs look quite similar, although there are slight differences in ordering. E.g. those using German language are reserving almost on the same level as Spanish-speaking users. However, among non-reserving users German-speaking ones appear less frequently than Spanish-speaking users.

Language by a Destination Country

English is the main language for the majority of the users. However, there may be more insights whith other (non-English) languages. I will look in details what is the mostly used user’s language (except english) for each European country of destination.

ggplot(subset(df, country_destination == "DE" & language != "en"), 
               aes(language)) +
  geom_bar(aes(fill = gender)) +
  ggtitle("Frequency Distribution for Germany (w/o English)") +
  geom_bar(data=subset(df,country_destination == "DE" & language == "de"), color = "black", size = 1, aes(fill = gender))

ggplot(subset(df, country_destination == "FR" & language != "en"), 
               aes(language)) +
  geom_bar(aes(fill = gender)) +
  ggtitle("Frequency Distribution for France (w/o English)")+
  geom_bar(data=subset(df,country_destination == "FR" & language == "fr"), color = "black", size = 1, aes(fill = gender))

ggplot(subset(df, country_destination == "ES" & language != "en"), 
               aes(language)) +
  geom_bar(aes(fill = gender)) +
  ggtitle("Frequency Distribution for Spain (w/o English)")+
  geom_bar(data=subset(df,country_destination == "ES" & language == "es"), color = "black", size = 1, aes(fill = gender))

ggplot(subset(df, country_destination == "NL" & language != "en"), 
               aes(language)) +
  geom_bar(aes(fill = gender)) +
  ggtitle("Frequency Distribution for Netherlands (w/o English)")+
  geom_bar(data=subset(df,country_destination == "NL" & language == "nl"), color = "black", size = 1, aes(fill = gender))

ggplot(subset(df, country_destination == "PT" & language != "en"), 
               aes(language)) +
  geom_bar(aes(fill = gender)) +
  ggtitle("Frequency Distribution for Portugal (w/o English)")+
  geom_bar(data=subset(df,country_destination == "PT" & language == "pt"), color = "black", size = 1, aes(fill = gender))

It seems that there is a correlation between the users using an european language as a preference and a destination country they book.

Predictive Modelling

Libraries

library(xgboost)
library(readr)
library(stringr)
library(caret)
library(car)
library(Ckmeans.1d.dp) # plot features improtance

Where will a new guest book their first travel experience?

New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. Now, I will try to predict the five most probable countries where new users book a place to stay.

First, I load the test data and make some manipulations with current data, e.g. replace all missing values with -1.

rm(list = ls())

# load data
df_train = read_csv("train_users_2.csv")
df_test = read_csv("test_users.csv")
labels = df_train['country_destination']
df_train = df_train[-grep('country_destination', colnames(df_train))]

# combine train and test data
df_all = rbind(df_train,df_test)
# remove date_first_booking
df_all = df_all[-c(which(colnames(df_all) %in% c('date_first_booking')))]
# replace missing values
df_all[is.na(df_all)] <- -1

Data Cleaning

I clean the data by 1) splitting date_account_created into three values of year, month and day, 2) doing the same with timestamp_first_active, 3) removing unrealistic age values, i.e. those greater than 100 and smaller than 14.

# split date_account_created in year, month and day
dac = as.data.frame(str_split_fixed(df_all$date_account_created, '-', 3))
df_all['dac_year'] = dac[,1]
df_all['dac_month'] = dac[,2]
df_all['dac_day'] = dac[,3]
df_all$dac_weekday <- factor(format(df_all$date_account_created, format = "%A"), 
                         levels = c("Monday", "Tuesday", "Wednesday", 
                                    "Thursday", "Friday", "Saturday", 
                                    "Sunday"))
df_all = df_all[,-c(which(colnames(df_all) %in% c('date_account_created')))]

# split timestamp_first_active in year, month and day
df_all[,'tfa_year'] = substring(as.character(df_all[,'timestamp_first_active']), 1, 4)
df_all['tfa_month'] = substring(as.character(df_all['timestamp_first_active']), 5, 6)
df_all['tfa_day'] = substring(as.character(df_all['timestamp_first_active']), 7, 8)
df_all = df_all[,-c(which(colnames(df_all) %in% c('timestamp_first_active')))]

# clean Age by removing values
df_all[df_all$age < 14 | df_all$age > 100,'age'] <- -1

# one-hot-encoding features
ohe_feats = c('gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser')
dummies <- dummyVars(~ gender + signup_method + signup_flow + language + affiliate_channel + affiliate_provider + first_affiliate_tracked + signup_app + first_device_type + first_browser, data = df_all)
df_all_ohe <- as.data.frame(predict(dummies, newdata = df_all))
df_all_combined <- cbind(df_all[,-c(which(colnames(df_all) %in% ohe_feats))],df_all_ohe)

# split train and test
X = df_all_combined[df_all_combined$id %in% df_train$id,]
y <- recode(labels$country_destination,"'NDF'=0; 'US'=1; 'other'=2; 'FR'=3; 'CA'=4; 'GB'=5; 'ES'=6; 'IT'=7; 'PT'=8; 'NL'=9; 'DE'=10; 'AU'=11")
X_test = df_all_combined[df_all_combined$id %in% df_test$id,]

Creating a Model

Now I build a machine learning model to predict the destination country from the the features by using XGBoost extreme gradient boosting algorithm.

# train xgboost
xgb <- xgboost(data = data.matrix(X), 
               label = y, 
               eta = 0.1,
               max_depth = 9, 
               nround=25, 
               subsample = 0.5,
               colsample_bytree = 0.5,
               seed = 1,
               eval_metric = "merror",
               objective = "multi:softprob",
               num_class = 12,
               nthread = 3,
               missing=NA
)

## Warning in data.matrix(X): NAs introduced by coercion

## Warning in data.matrix(X): NAs introduced by coercion

## Warning: parameter seed is ignored, please set random seed using set.seed

## [0]  train-merror:0.386515
## [1]  train-merror:0.399591
## [2]  train-merror:0.390975
## [3]  train-merror:0.373477
## [4]  train-merror:0.375407
## [5]  train-merror:0.371912
## [6]  train-merror:0.370389
## [7]  train-merror:0.366168
## [8]  train-merror:0.364946
## [9]  train-merror:0.364843
## [10] train-merror:0.364416
## [11] train-merror:0.364472
## [12] train-merror:0.364660
## [13] train-merror:0.364688
## [14] train-merror:0.364205
## [15] train-merror:0.363395
## [16] train-merror:0.363535
## [17] train-merror:0.363512
## [18] train-merror:0.362041
## [19] train-merror:0.361319
## [20] train-merror:0.361193
## [21] train-merror:0.360842
## [22] train-merror:0.360664
## [23] train-merror:0.360270
## [24] train-merror:0.359553

Now, I analyse the features’ improtance of the trained model.

## Anayse and visualize xgboost
model <- xgb.dump(xgb, with.stats = T)
model[1:10]

##  [1] "booster[0]"                                                         
##  [2] "0:[f10<0.5] yes=1,no=2,missing=1,gain=2473.81,cover=16263.2"        
##  [3] "1:[f17<9] yes=3,no=4,missing=3,gain=1685.54,cover=11447.8"          
##  [4] "3:[f17<1.5] yes=7,no=8,missing=7,gain=1134.43,cover=9605.45"        
##  [5] "7:[f46<0.5] yes=15,no=16,missing=15,gain=473.749,cover=8766.24"     
##  [6] "15:[f77<0.5] yes=29,no=30,missing=29,gain=124.152,cover=3166.78"    
##  [7] "29:[f49<0.5] yes=53,no=54,missing=53,gain=134.544,cover=2719.29"    
##  [8] "53:[f51<0.5] yes=97,no=98,missing=97,gain=126.618,cover=1554.97"    
##  [9] "97:[f2<3.5] yes=155,no=156,missing=155,gain=39.8956,cover=1235.06"  
## [10] "155:[f17<0.5] yes=243,no=244,missing=243,gain=12.8258,cover=290.278"

#names <- dimnames(data.matrix(X[,-1]))[[2]] # delete -1 ?
names <- dimnames(data.matrix(X))[[2]]

## Warning in data.matrix(X): NAs introduced by coercion

## Warning in data.matrix(X): NAs introduced by coercion

importance_matrix <- xgb.importance(names, model = xgb)

xgb.plot.importance(importance_matrix[1:20,])

#library(DiagrammeR)
#xgb.plot.tree(feature_names = names, model = xgb, n_first_tree = 2)

Feature importance plot is useful to select only best features with highest correlation to the outcome. In this case, age is of the highest improtance.

To improve model fitting performance (time or overfitting) and simplify the model, less important features can be removed.

Prediction

Finally, I use the above model to predict values in testing set and to extract 5 destination countries with the highest probabilities.

# predict values in test set
y_pred <- predict(xgb, data.matrix(X_test),missing=NA)

## Warning in data.matrix(X_test): NAs introduced by coercion

## Warning in data.matrix(X_test): NAs introduced by coercion

# extract the 5 classes with highest probabilities
predictions <- as.data.frame(matrix(y_pred, nrow=12))
rownames(predictions) <- c('NDF','US','other','FR','CA','GB','ES','IT','PT','NL','DE','AU')
predictions_top5 <- as.vector(apply(predictions, 2, function(x) names(sort(x)[12:8])))

Finally, I create the results dataframe.

# create submission 
ids <- NULL
for (i in 1:NROW(X_test)) {
  idx <- X_test$id[i]
  ids <- append(ids, rep(idx,5))
}
results <- NULL
results$id <- ids
results$country <- predictions_top5

# generate submission file
results <- as.data.frame(results)

These results correspond to a 0.86956 accuracy when tested in Airbnb New User Bookings Kaggle competition.

Airbnb Users: Exploratory Data Analysis and Predictive Modelling

Jekaterina Novikova

April 2, 2016