All the packages used in the analysis are listed below:
library(ggplot2)
library(GGally) # for ggpairs
library(memisc)
library(gridExtra)
library(maps)
library(mapdata)
library(dplyr)
library(tidyr)
library(rworldmap)
library(RColorBrewer)
library(caret)
library(corrplot)
library(Rtsne)
The dataset provides data about airbnb (www.airbnb.com) users. The question to answer: which country a new user’s first booking destination will be?
All the users in this dataset are from the USA.
There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,‘DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’. Please note that ‘NDF’ is different from ‘other’ because ‘other’ means there was a booking, but is to a country not included in the list, while ‘NDF’ means there wasn’t a booking.
# Load the Data
df <- read.csv("train_users_2.csv")
dim(df)
## [1] 213451 16
names(df)
## [1] "id" "date_account_created"
## [3] "timestamp_first_active" "date_first_booking"
## [5] "gender" "age"
## [7] "signup_method" "signup_flow"
## [9] "language" "affiliate_channel"
## [11] "affiliate_provider" "first_affiliate_tracked"
## [13] "signup_app" "first_device_type"
## [15] "first_browser" "country_destination"
str(df)
## 'data.frame': 213451 obs. of 16 variables:
## $ id : Factor w/ 213451 levels "00023iyk9l","0005ytdols",..: 100523 48039 26485 68504 48956 147281 129610 2144 59779 40826 ...
## $ date_account_created : Factor w/ 1634 levels "2010-01-01","2010-01-02",..: 171 502 263 696 249 1 2 3 4 4 ...
## $ timestamp_first_active : num 2.01e+13 2.01e+13 2.01e+13 2.01e+13 2.01e+13 ...
## $ date_first_booking : Factor w/ 1977 levels "","2010-01-02",..: 1 1 194 960 35 2 4 10 190 3 ...
## $ gender : Factor w/ 4 levels "-unknown-","FEMALE",..: 1 3 2 2 1 1 2 2 2 1 ...
## $ age : num NA 38 56 42 41 NA 46 47 50 46 ...
## $ signup_method : Factor w/ 3 levels "basic","facebook",..: 2 2 1 2 1 1 1 1 1 1 ...
## $ signup_flow : int 0 0 3 0 0 0 0 0 0 0 ...
## $ language : Factor w/ 25 levels "ca","cs","da",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ affiliate_channel : Factor w/ 8 levels "api","content",..: 3 8 3 3 3 4 4 3 4 4 ...
## $ affiliate_provider : Factor w/ 18 levels "baidu","bing",..: 5 9 5 5 5 13 3 5 3 3 ...
## $ first_affiliate_tracked: Factor w/ 8 levels "","linked","local ops",..: 8 8 8 8 8 5 8 5 8 5 ...
## $ signup_app : Factor w/ 4 levels "Android","iOS",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ first_device_type : Factor w/ 9 levels "Android Phone",..: 6 6 9 6 6 6 6 6 6 6 ...
## $ first_browser : Factor w/ 52 levels "-unknown-","Android Browser",..: 9 9 24 18 9 9 42 42 42 18 ...
## $ country_destination : Factor w/ 12 levels "AU","CA","DE",..: 8 8 12 10 12 12 12 12 12 12 ...
Here:
summary(df)
## id date_account_created timestamp_first_active
## 00023iyk9l: 1 2014-05-13: 674 Min. :2.009e+13
## 0005ytdols: 1 2014-06-24: 670 1st Qu.:2.012e+13
## 000guo2307: 1 2014-06-25: 636 Median :2.013e+13
## 000wc9mlv3: 1 2014-05-20: 632 Mean :2.013e+13
## 0012yo8hu2: 1 2014-05-14: 622 3rd Qu.:2.014e+13
## 001357912w: 1 2014-05-21: 602 Max. :2.014e+13
## (Other) :213445 (Other) :209615
## date_first_booking gender age signup_method
## :124543 -unknown-:95688 Min. : 1.00 basic :152897
## 2014-05-22: 248 FEMALE :63041 1st Qu.: 28.00 facebook: 60008
## 2014-06-11: 231 MALE :54440 Median : 34.00 google : 546
## 2014-06-24: 226 OTHER : 282 Mean : 49.67
## 2014-05-21: 225 3rd Qu.: 43.00
## 2014-06-10: 223 Max. :2014.00
## (Other) : 87755 NA's :87990
## signup_flow language affiliate_channel
## Min. : 0.000 en :206314 direct :137727
## 1st Qu.: 0.000 zh : 1632 sem-brand : 26045
## Median : 0.000 fr : 1172 sem-non-brand: 18844
## Mean : 3.267 es : 915 other : 8961
## 3rd Qu.: 0.000 ko : 747 seo : 8663
## Max. :25.000 de : 732 api : 8167
## (Other): 1939 (Other) : 5044
## affiliate_provider first_affiliate_tracked signup_app
## direct :137426 untracked :109232 Android: 5454
## google : 51693 linked : 46287 iOS : 19019
## other : 12549 omg : 43982 Moweb : 6261
## craigslist: 3471 tracked-other: 6156 Web :182717
## bing : 2328 : 6065
## facebook : 2273 product : 1556
## (Other) : 3711 (Other) : 173
## first_device_type first_browser country_destination
## Mac Desktop :89600 Chrome :63845 NDF :124543
## Windows Desktop:72716 Safari :45169 US : 62376
## iPhone :20759 Firefox :33655 other : 10094
## iPad :14339 -unknown- :27266 FR : 5023
## Other/Unknown :10667 IE :21068 IT : 2835
## Android Phone : 2803 Mobile Safari:19274 GB : 2324
## (Other) : 2567 (Other) : 3174 (Other): 6256
Based on the principal component analysis PCA, it is important that features have maximum variance for maximum uniqueness, so that each feature is as distant as possible (as orthogonal as possible) from the other features.
zero.var = nearZeroVar(df, saveMetrics=TRUE)
zero.var
## freqRatio percentUnique zeroVar nzv
## id 1.000000 1.000000e+02 FALSE FALSE
## date_account_created 1.005970 7.655153e-01 FALSE FALSE
## timestamp_first_active 1.051801 1.035366e-01 FALSE FALSE
## date_first_booking 502.189516 9.262079e-01 FALSE TRUE
## gender 1.517869 1.873966e-03 FALSE FALSE
## age 1.017952 5.949843e-02 FALSE FALSE
## signup_method 2.547944 1.405475e-03 FALSE FALSE
## signup_flow 11.238079 7.964357e-03 FALSE FALSE
## language 126.417892 1.171229e-02 FALSE TRUE
## affiliate_channel 5.288040 3.747933e-03 FALSE FALSE
## affiliate_provider 2.658503 8.432849e-03 FALSE FALSE
## first_affiliate_tracked 2.359885 3.747933e-03 FALSE FALSE
## signup_app 9.607077 1.873966e-03 FALSE FALSE
## first_device_type 1.232191 4.216424e-03 FALSE FALSE
## first_browser 1.413469 2.436156e-02 FALSE FALSE
## country_destination 1.996649 5.621899e-03 FALSE FALSE
There is no features without variability (all has enough variance). So there is no feature to be removed further.
Here is the summary of age variable’s values in the dataset.
summary(df$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 28.00 34.00 49.67 43.00 2014.00 87990
We can see that both min and max values of age are fake.
qplot(x = age, data = df, binwidth = 1)
qplot(x = age, data = df, binwidth = 1) +
scale_x_continuous(limits = c(18, 75), breaks = seq(18, 75, 2))
There are outliers with the obviously false age of around 2000 years. This is most probably a mistake made by users during registration.
The following boxplot shows that these outliers exist in all the gender cases. It is possible, some users entered the current year when registered in airbnb instead of their year of birth.
a1 <- qplot(x = gender, y = age, data = df,
geom = "boxplot")
a2 <- qplot(x = gender, y = age, data = df,
geom = "boxplot") +
coord_cartesian(ylim = c(18,75))
a3 <- qplot(x = gender, y = age, data = df,
geom = "boxplot") +
coord_cartesian(ylim = c(0,18))
grid.arrange(a1, a2, a3, ncol = 2)
by(df$age, df$gender, summary)
## df$gender: -unknown-
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 29.00 35.00 46.78 45.00 2014.00 78845
## --------------------------------------------------------
## df$gender: FEMALE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 28.00 33.00 52.73 42.00 2014.00 5328
## --------------------------------------------------------
## df$gender: MALE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 29.00 34.00 47.08 43.00 2014.00 3763
## --------------------------------------------------------
## df$gender: OTHER
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 17.00 29.75 35.00 62.90 44.00 2014.00 54
Female users have a slightly lower median age than males and slightly higher mean age, while the 1st and 3rd quartiles are very similar for both genders.
The minimum age for males, females and others starts from 1-2 years old, which is obviously a false number. There are no outliers for the gender “OTHER” of the age less than 17 years, which implies that users selecting this gender during the registration less often lie about their age. There are more younger age outliers among males, compared to females or users of unknown gender.
by(df$age, df$country_destination, summary)
## df$country_destination: AU
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.00 29.00 35.00 42.81 44.00 2014.00 103
## --------------------------------------------------------
## df$country_destination: CA
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.00 29.00 34.00 48.99 43.00 2014.00 351
## --------------------------------------------------------
## df$country_destination: DE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5.00 28.00 34.00 53.55 43.00 2014.00 210
## --------------------------------------------------------
## df$country_destination: ES
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 27.00 32.00 46.24 40.00 2014.00 543
## --------------------------------------------------------
## df$country_destination: FR
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5.00 29.00 34.00 48.11 44.00 2014.00 1310
## --------------------------------------------------------
## df$country_destination: GB
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5.00 29.00 35.00 50.16 46.00 2014.00 550
## --------------------------------------------------------
## df$country_destination: IT
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4.00 28.00 34.00 50.65 44.00 2014.00 799
## --------------------------------------------------------
## df$country_destination: NDF
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 28.00 34.00 51.83 44.00 2014.00 67614
## --------------------------------------------------------
## df$country_destination: NL
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.00 28.00 33.00 53.15 41.00 2014.00 160
## --------------------------------------------------------
## df$country_destination: other
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 29.00 34.00 48.84 42.00 2014.00 2518
## --------------------------------------------------------
## df$country_destination: PT
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 19.00 28.00 32.00 48.75 41.75 2014.00 59
## --------------------------------------------------------
## df$country_destination: US
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 28.00 33.00 47.41 41.00 2014.00 13773
ggplot(df, aes(country_destination, age)) +
geom_boxplot(aes()) +
ylim(25,50)
ggplot(df, aes(age)) +
geom_histogram(binwidth = 1, color = 'black', fill = '#099DD9') +
xlim(18,75) +
facet_wrap(~country_destination, ncol = 3, scales = "free")
ggplot(df, aes(age)) +
geom_histogram(binwidth = 1, color = 'black', fill = '#099DD9') +
geom_histogram(data=subset(df,age==20), color = "black", fill="red", binwidth = 1) +
#xlim(15, 25) +
scale_x_continuous(limits = c(15, 25), breaks = seq(15, 25, 1)) +
facet_wrap(~country_destination, ncol = 3, scales = "free")
It seems there are no big difference between the overall distribution of users ages while selecting different countries of destination. However, there is a clear decrease of users of age 20, both reserving accommodation (for any country) and not reserving at all. Users of age 20 clearly are the minority of airbnb.
#----- summary by age -----#
df.age.summ <- summarise(group_by(df, country_destination),
medianAge = median(age, na.rm = TRUE),
meanAge = mean(age, na.rm = TRUE),
sdAge = sd(age, na.rm = TRUE))
spdf <- joinCountryData2Map(subset(df.age.summ, country_destination != "other" & country_destination != "NDF"),
joinCode="ISO_A2",
nameJoinColumn="country_destination")
## 10 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 233 codes from the map weren't represented in your data
df_map <- spdf@data
mapCountryData(spdf,
nameColumnToPlot="medianAge",
catMethod="fixedWidth",
numCats=3,
mapTitle = "Median Age of Airbnb Users"
)
mapCountryData(spdf,
nameColumnToPlot="meanAge",
catMethod="fixedWidth",
numCats=10,
mapTitle = "Mean Age of Airbnb Users"
)
df.age.summ2 <- summarise(group_by(subset(df, age<76 & age>17), country_destination),
meanAge = mean(age, na.rm = TRUE),
sdAge = sd(age, na.rm = TRUE))
spdf2 <- joinCountryData2Map(subset(df.age.summ2,
country_destination != "other" & country_destination != "NDF"),
joinCode="ISO_A2",
nameJoinColumn="country_destination")
## 10 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 233 codes from the map weren't represented in your data
df_map <- spdf2@data
mapCountryData(spdf2,
nameColumnToPlot="meanAge",
catMethod="fixedWidth",
numCats=10,
mapTitle = "Mean Age of Airbnb Users in the Age-Group 18-75"
)
ggplot(df, aes(x = gender)) +
geom_bar()
df.gender <- spread(df, gender, age)
colnames(df.gender)[15] <- "unknown"
df.gender.summ <- summarise(group_by(df.gender, country_destination),
countF = sum(!is.na(FEMALE)),
countM = sum(!is.na(MALE)),
countO = sum(!is.na(OTHER)),
countUn = sum(!is.na(unknown)),
total_with_NA = n())
spdf <- joinCountryData2Map(subset(df.gender.summ,
country_destination != "other" &
country_destination != "NDF"),
joinCode="ISO_A2",
nameJoinColumn="country_destination")
## 10 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 233 codes from the map weren't represented in your data
df_map <- subset(spdf@data, !is.na(country_destination))
par(mai= c(0,0,0.9,0),
xaxs = "i",
yaxs = "i")
mapPies(dF =df_map,
nameX="LON",
nameY="LAT",
nameZs =c("countF",
"countM",
"countO",
"countUn") ,
zColours=c("red",
"green",
"yellow",
"blue"),
oceanCol = "lightblue",
landCol = "wheat",
addSizeLegend=T,
addCatLegend=F,
mapRegion="world",
xlim=c(-181,181),
ylim=c(-81,80),
symbolSize = 2)
## symbolMaxSize= 7.24 maxSumValues= 48603 symbolScale= 0.03284029
## List of 2
## $ x: num [1:100] -138 -138 -138 -138 -138 ...
## $ y: num [1:100] 76.6 77.5 78.4 79.3 80.2 ...
title(main=paste("Number of Airbnb Users by Gender"),
cex=3)
legend(-180.1516,90,
legend=c("Female",
"Male",
"Other",
"Unknown"),
col=c("red",
"green",
"yellow",
"blue"),
pch=16,
cex=0.8,
pt.cex=1.5,
bty="o",
box.lty=0,
horiz = F,
bg="#FFFFFF70")
I want to convert both date_account_created and date_first_booking to a date type, so that later I could calculate a period between these two dates.
Empty values cannot be converted to date, that’s why I copy the values of date_account_created to the column with date_first_booking, so that the calculated period between these two dates would be equal to zero.
df$date_account_created <- as.character(df$date_account_created)
df$date_first_booking <- as.character(df$date_first_booking)
df$date_first_booking <- ifelse(df$date_first_booking == "", df$date_account_created, df$date_first_booking)
df$date_account_created <- as.Date(df$date_account_created)
df$date_first_booking <- as.Date(df$date_first_booking)
summary(df$date_account_created)
## Min. 1st Qu. Median Mean 3rd Qu.
## "2010-01-01" "2012-12-26" "2013-09-11" "2013-06-25" "2014-03-06"
## Max.
## "2014-06-30"
summary(df$date_first_booking)
## Min. 1st Qu. Median Mean 3rd Qu.
## "2010-01-02" "2013-01-14" "2013-09-24" "2013-07-13" "2014-03-22"
## Max.
## "2015-06-29"
I may be useful to see if user create certain accounts or make certain bookings in some specific months, days or weekdays. For this, I will need new variables: month_acc_created, day_acc_created, weekday_acc_created and *_month_first_book, day_first_book, weekday_first_book*.
#use %b for Jan, %m for 01, %B for January, %a for Mon, %A for Monday
df$month_acc_created <- factor(format(df$date_account_created, format = "%B"),
levels = c("January", "February", "March", "April",
"May", "June", "July", "August", "September",
"October", "November", "December"))
df$day_acc_created <- factor(format(df$date_account_created, format = "%d"))
df$dac_weekday <- factor(format(df$date_account_created, format = "%A"),
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday",
"Sunday"))
df$month_first_book <- factor(format(df$date_first_booking, format = "%B"),
levels = c("January", "February", "March", "April",
"May", "June", "July", "August", "September",
"October", "November", "December"))
df$day_first_book <- factor(format(df$date_first_booking, format = "%d"))
df$weekday_first_book <- factor(format(df$date_first_booking, format = "%A"),
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday",
"Sunday"))
summary(df$month_first_book)
## January February March April May June July
## 16967 16348 20046 21414 25435 26886 15051
## August September October November December
## 15284 15741 13935 13105 13239
qplot(x = month_first_book, data = df)
ggplot(subset(df, country_destination != "NDF"), aes(x = month_first_book)) +
#geom_bar(las=2)+
geom_bar(data=subset(df,country_destination != "NDF" & country_destination != "AU" &
month_first_book %in% c("June", "May")),
color = "black", fill="green") +
geom_bar(data=subset(df,country_destination != "NDF" & country_destination != "AU" &
month_first_book %in% c("November", "December")),
color = "black", fill="red") +
geom_bar(data=subset(df,country_destination == "AU" &
month_first_book == "November"),
color = "black", fill="green") +
geom_bar(data=subset(df,country_destination == "AU" &
month_first_book %in% c("June", "July","August")),
color = "black", fill="red") +
geom_bar(data=subset(df,country_destination == "PT" &
month_first_book %in% c("July", "August", "September",
"October","November","December","January")),
color = "black", fill="red") +
geom_bar(data=subset(df,country_destination == "PT" &
month_first_book %in% c("July", "August", "September",
"October", "January")),
color = "black", fill="red") +
facet_wrap(~country_destination, scales = "free")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
The majority of bookings are made during the months of March to June. November and December are the lowest months in terms of a number of reservations.
Among the countries, Australia and Portugal looks different from others. For Portugal, reservations are very low for the whole period from June to the end of the year (also including January). In Australia, June, July and August are all low-season months, and a higher number of reservations is made in November.
In all the other countries, May and June are the highest seasons for reservations.
Now, let’s see how the reservations are distributed by the day of booking.
qplot(x = day_first_book, data = df)
ggplot(subset(df, country_destination != "NDF"), aes(x=day_first_book)) +
geom_bar()+
facet_wrap(~country_destination, scales = "free")
I do not see anything specific here, the number of bookings made in each day of a month looks more or less the same. The 31st day obviously has less bookings than other days, just because not all the months have this day.
Let’s have a look at the reservations’ distribution over the weekdays.
qplot(x = weekday_first_book, data = df)
ggplot(subset(df, country_destination != "NDF"), aes(x=weekday_first_book)) +
geom_bar()+
facet_wrap(~country_destination, scales = "free")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
In general, it seems that less reservations are made during the week-end. That’s probably because people are travelling over the week-ends, not planning travel.
df$days_to_book <- as.numeric(df$date_first_booking - df$date_account_created)
summary(df$days_to_book)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -349.00 0.00 0.00 18.48 1.00 365.00
by(df$days_to_book, df$country_destination, summary)
## df$country_destination: AU
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -77.00 1.00 5.00 46.04 41.00 364.00
## --------------------------------------------------------
## df$country_destination: CA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.00 1.00 3.00 45.73 26.25 364.00
## --------------------------------------------------------
## df$country_destination: DE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -269.00 1.00 4.00 39.53 30.00 364.00
## --------------------------------------------------------
## df$country_destination: ES
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 4.00 37.35 25.00 365.00
## --------------------------------------------------------
## df$country_destination: FR
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 4.00 37.59 25.00 364.00
## --------------------------------------------------------
## df$country_destination: GB
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -109.00 1.00 4.00 39.44 28.00 365.00
## --------------------------------------------------------
## df$country_destination: IT
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -298.00 1.00 3.00 36.68 28.00 364.00
## --------------------------------------------------------
## df$country_destination: NDF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
## --------------------------------------------------------
## df$country_destination: NL
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 1.0 3.0 42.1 26.0 365.0
## --------------------------------------------------------
## df$country_destination: other
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -85.00 1.00 4.00 47.88 41.00 365.00
## --------------------------------------------------------
## df$country_destination: PT
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 4.00 44.49 29.00 364.00
## --------------------------------------------------------
## df$country_destination: US
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -349.00 0.00 3.00 45.19 28.00 365.00
ggplot(df, aes(country_destination, days_to_book)) +
geom_boxplot()
ggplot(subset(df, country_destination != "NDF"), aes(gender, days_to_book)) +
geom_boxplot()
ggplot(subset(df, country_destination != "NDF"), aes(language, days_to_book)) +
geom_boxplot()
summary(df$signup_method)
## basic facebook google
## 152897 60008 546
qplot(x = signup_method, data = df)
ggplot(subset(df, country_destination != "NDF" & country_destination != "US" & signup_method == "facebook"), aes(x = signup_method)) +
geom_bar()+
facet_wrap(~country_destination)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
It seems that signup_method may be a useful feature. Users have selected some countries, such as e.g. France , Italy or Spain, more often than others when signing up using Facebook.
First, let’s look how language are distributed across the users that did not ever make a reservation.
df <- within(df, language <- factor(language,
levels=names(sort(table(language),
decreasing=TRUE))))
qAll <- ggplot(subset(df, country_destination == "NDF"),
aes(language)) +
geom_bar(aes(fill = gender)) +
ggtitle("Frequency distribution - NDF, All Languages")
qWoEng <- ggplot(subset(df, country_destination == "NDF" & language != "en"),
aes(language)) +
geom_bar(aes(fill = gender)) +
ggtitle("Frequency Distribution - NDF, W/o English")
grid.arrange(qAll, qWoEng, ncol = 1)
Now, compare it to the distribution between all users who made a reservation through Airbnb.
qAll <- ggplot(subset(df, country_destination != "NDF"),
aes(language)) +
geom_bar(aes(fill = gender)) +
ggtitle("Frequency distribution - Reservations, All Languages")
qWoEng <- ggplot(subset(df, country_destination != "NDF" & language != "en"),
aes(language)) +
geom_bar(aes(fill = gender)) +
ggtitle("Frequency Distribution - Reservations, W/o English")
grid.arrange(qAll, qWoEng, ncol = 1)
Graphs look quite similar, although there are slight differences in ordering. E.g. those using German language are reserving almost on the same level as Spanish-speaking users. However, among non-reserving users German-speaking ones appear less frequently than Spanish-speaking users.
English is the main language for the majority of the users. However, there may be more insights whith other (non-English) languages. I will look in details what is the mostly used user’s language (except english) for each European country of destination.
ggplot(subset(df, country_destination == "DE" & language != "en"),
aes(language)) +
geom_bar(aes(fill = gender)) +
ggtitle("Frequency Distribution for Germany (w/o English)") +
geom_bar(data=subset(df,country_destination == "DE" & language == "de"), color = "black", size = 1, aes(fill = gender))
ggplot(subset(df, country_destination == "FR" & language != "en"),
aes(language)) +
geom_bar(aes(fill = gender)) +
ggtitle("Frequency Distribution for France (w/o English)")+
geom_bar(data=subset(df,country_destination == "FR" & language == "fr"), color = "black", size = 1, aes(fill = gender))
ggplot(subset(df, country_destination == "ES" & language != "en"),
aes(language)) +
geom_bar(aes(fill = gender)) +
ggtitle("Frequency Distribution for Spain (w/o English)")+
geom_bar(data=subset(df,country_destination == "ES" & language == "es"), color = "black", size = 1, aes(fill = gender))
ggplot(subset(df, country_destination == "NL" & language != "en"),
aes(language)) +
geom_bar(aes(fill = gender)) +
ggtitle("Frequency Distribution for Netherlands (w/o English)")+
geom_bar(data=subset(df,country_destination == "NL" & language == "nl"), color = "black", size = 1, aes(fill = gender))
ggplot(subset(df, country_destination == "PT" & language != "en"),
aes(language)) +
geom_bar(aes(fill = gender)) +
ggtitle("Frequency Distribution for Portugal (w/o English)")+
geom_bar(data=subset(df,country_destination == "PT" & language == "pt"), color = "black", size = 1, aes(fill = gender))
It seems that there is a correlation between the users using an european language as a preference and a destination country they book.
library(xgboost)
library(readr)
library(stringr)
library(caret)
library(car)
library(Ckmeans.1d.dp) # plot features improtance
New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. Now, I will try to predict the five most probable countries where new users book a place to stay.
First, I load the test data and make some manipulations with current data, e.g. replace all missing values with -1.
rm(list = ls())
# load data
df_train = read_csv("train_users_2.csv")
df_test = read_csv("test_users.csv")
labels = df_train['country_destination']
df_train = df_train[-grep('country_destination', colnames(df_train))]
# combine train and test data
df_all = rbind(df_train,df_test)
# remove date_first_booking
df_all = df_all[-c(which(colnames(df_all) %in% c('date_first_booking')))]
# replace missing values
df_all[is.na(df_all)] <- -1
I clean the data by 1) splitting date_account_created into three values of year, month and day, 2) doing the same with timestamp_first_active, 3) removing unrealistic age values, i.e. those greater than 100 and smaller than 14.
# split date_account_created in year, month and day
dac = as.data.frame(str_split_fixed(df_all$date_account_created, '-', 3))
df_all['dac_year'] = dac[,1]
df_all['dac_month'] = dac[,2]
df_all['dac_day'] = dac[,3]
df_all$dac_weekday <- factor(format(df_all$date_account_created, format = "%A"),
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday",
"Sunday"))
df_all = df_all[,-c(which(colnames(df_all) %in% c('date_account_created')))]
# split timestamp_first_active in year, month and day
df_all[,'tfa_year'] = substring(as.character(df_all[,'timestamp_first_active']), 1, 4)
df_all['tfa_month'] = substring(as.character(df_all['timestamp_first_active']), 5, 6)
df_all['tfa_day'] = substring(as.character(df_all['timestamp_first_active']), 7, 8)
df_all = df_all[,-c(which(colnames(df_all) %in% c('timestamp_first_active')))]
# clean Age by removing values
df_all[df_all$age < 14 | df_all$age > 100,'age'] <- -1
# one-hot-encoding features
ohe_feats = c('gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser')
dummies <- dummyVars(~ gender + signup_method + signup_flow + language + affiliate_channel + affiliate_provider + first_affiliate_tracked + signup_app + first_device_type + first_browser, data = df_all)
df_all_ohe <- as.data.frame(predict(dummies, newdata = df_all))
df_all_combined <- cbind(df_all[,-c(which(colnames(df_all) %in% ohe_feats))],df_all_ohe)
# split train and test
X = df_all_combined[df_all_combined$id %in% df_train$id,]
y <- recode(labels$country_destination,"'NDF'=0; 'US'=1; 'other'=2; 'FR'=3; 'CA'=4; 'GB'=5; 'ES'=6; 'IT'=7; 'PT'=8; 'NL'=9; 'DE'=10; 'AU'=11")
X_test = df_all_combined[df_all_combined$id %in% df_test$id,]
Now I build a machine learning model to predict the destination country from the the features by using XGBoost extreme gradient boosting algorithm.
# train xgboost
xgb <- xgboost(data = data.matrix(X),
label = y,
eta = 0.1,
max_depth = 9,
nround=25,
subsample = 0.5,
colsample_bytree = 0.5,
seed = 1,
eval_metric = "merror",
objective = "multi:softprob",
num_class = 12,
nthread = 3,
missing=NA
)
## Warning in data.matrix(X): NAs introduced by coercion
## Warning in data.matrix(X): NAs introduced by coercion
## Warning: parameter seed is ignored, please set random seed using set.seed
## [0] train-merror:0.386515
## [1] train-merror:0.399591
## [2] train-merror:0.390975
## [3] train-merror:0.373477
## [4] train-merror:0.375407
## [5] train-merror:0.371912
## [6] train-merror:0.370389
## [7] train-merror:0.366168
## [8] train-merror:0.364946
## [9] train-merror:0.364843
## [10] train-merror:0.364416
## [11] train-merror:0.364472
## [12] train-merror:0.364660
## [13] train-merror:0.364688
## [14] train-merror:0.364205
## [15] train-merror:0.363395
## [16] train-merror:0.363535
## [17] train-merror:0.363512
## [18] train-merror:0.362041
## [19] train-merror:0.361319
## [20] train-merror:0.361193
## [21] train-merror:0.360842
## [22] train-merror:0.360664
## [23] train-merror:0.360270
## [24] train-merror:0.359553
Now, I analyse the features’ improtance of the trained model.
## Anayse and visualize xgboost
model <- xgb.dump(xgb, with.stats = T)
model[1:10]
## [1] "booster[0]"
## [2] "0:[f10<0.5] yes=1,no=2,missing=1,gain=2473.81,cover=16263.2"
## [3] "1:[f17<9] yes=3,no=4,missing=3,gain=1685.54,cover=11447.8"
## [4] "3:[f17<1.5] yes=7,no=8,missing=7,gain=1134.43,cover=9605.45"
## [5] "7:[f46<0.5] yes=15,no=16,missing=15,gain=473.749,cover=8766.24"
## [6] "15:[f77<0.5] yes=29,no=30,missing=29,gain=124.152,cover=3166.78"
## [7] "29:[f49<0.5] yes=53,no=54,missing=53,gain=134.544,cover=2719.29"
## [8] "53:[f51<0.5] yes=97,no=98,missing=97,gain=126.618,cover=1554.97"
## [9] "97:[f2<3.5] yes=155,no=156,missing=155,gain=39.8956,cover=1235.06"
## [10] "155:[f17<0.5] yes=243,no=244,missing=243,gain=12.8258,cover=290.278"
#names <- dimnames(data.matrix(X[,-1]))[[2]] # delete -1 ?
names <- dimnames(data.matrix(X))[[2]]
## Warning in data.matrix(X): NAs introduced by coercion
## Warning in data.matrix(X): NAs introduced by coercion
importance_matrix <- xgb.importance(names, model = xgb)
xgb.plot.importance(importance_matrix[1:20,])
#library(DiagrammeR)
#xgb.plot.tree(feature_names = names, model = xgb, n_first_tree = 2)
Feature importance plot is useful to select only best features with highest correlation to the outcome. In this case, age is of the highest improtance.
To improve model fitting performance (time or overfitting) and simplify the model, less important features can be removed.
Finally, I use the above model to predict values in testing set and to extract 5 destination countries with the highest probabilities.
# predict values in test set
y_pred <- predict(xgb, data.matrix(X_test),missing=NA)
## Warning in data.matrix(X_test): NAs introduced by coercion
## Warning in data.matrix(X_test): NAs introduced by coercion
# extract the 5 classes with highest probabilities
predictions <- as.data.frame(matrix(y_pred, nrow=12))
rownames(predictions) <- c('NDF','US','other','FR','CA','GB','ES','IT','PT','NL','DE','AU')
predictions_top5 <- as.vector(apply(predictions, 2, function(x) names(sort(x)[12:8])))
Finally, I create the results dataframe.
# create submission
ids <- NULL
for (i in 1:NROW(X_test)) {
idx <- X_test$id[i]
ids <- append(ids, rep(idx,5))
}
results <- NULL
results$id <- ids
results$country <- predictions_top5
# generate submission file
results <- as.data.frame(results)
These results correspond to a 0.86956 accuracy when tested in Airbnb New User Bookings Kaggle competition.