1. Introduction

  • 이제 막 Airbnb에 가입한 New 사용자는 190개 이상의 국가 중 34,000개 이상의 도시의 숙소를 Airbnb를 통해 예약할 수 있다.
    이 사용자가 어느 도시 및 국가를 처음으로 방문할 것인지 예측할 수 있다면, Airbnb는 맞춤 컨텐츠를 제공해 줄 수 있으며 첫 예약하기까지의 사용자의 시간을 줄여줄 수 있을 것이다.

1.1. Aim

Analysis - Airbnb의 데이터를 바탕으로 전환율과 유입 경로, 예약하는 요일 등을 분석

Analytics - Airbnb 데이터를 바탕으로 머신러닝 모델링을 통해 어느 나라의 유저가 처음 어느곳을 예약할 지 예측 - 새로운 가입자가 최초로 예약할 장소를 예측하는 것

1.2. Data Description

  • 사용할 데이터는 인구통계학 수치(demographics)와 web session records 그리고 몇몇 요약된 통계자료로 구성되어 있음.

[Files & Data fields]

- train.users.csv / test.users.csv

  • id : 유저 id
  • date_account_created : 계정 생성 날짜
  • timestamp_first_active : 처음 활동한 날짜 (이 feature는 계정 생성 or 첫 예약 날짜보다 이전일 것입니다)
  • date_first_booking : 첫 예약한 날짜
  • gender : 성별
  • age : 나이
  • signup_method : 회원가입 방법
  • signup_flow : 회원가입할 때 사용자가 보고있는 page 번호
  • language : 선호 언어, 사용 언어(모국어)
  • affiliate_channel : Airbnb 유입경로
  • affiliate_provider : 유입 장소 (google, craigslist, other)
  • first_affiliate_tracked : 회원가입 전 user가 반응한 첫번째 marketing
  • signup_app : Web , iOS, Android etc
  • first_device_type : Window/Mac desktop , iPhone etc
  • first_browser : Chrome, Safari, Firefox etc
  • country_destination : 첫 예약 국가 / 우리가 예측해야하는 target 변수

- sessions.csv : 웹 사용자 log

  • user_id : 유저 id
  • action : 활동 ( search_result, lookup, index etc)
  • action_type : Click, View etc
  • device_type : Window/Mac desktop etc
  • secs_elapsed : session 사용시간

- countries.csv : 첫 예약 국가 요약 통계자료

- age_gender_bkts.csv : 나이와 성별에 따라 destination 요약 통계자료

2. Preprocessing

2.1. Import library & Load data

suppressPackageStartupMessages({
library(data.table)
library(forecast)
library(lubridate)
library(zoo)
library(ggplot2)
library(DescTools)
library(sqldf)
library(dplyr)
library(knitr)
library(VIM)
library(mice)
library(rpart)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(ggthemes)
library(DT)
library(corrplot)
library(psych)
library(caret)
library(gmodels)
library(C50)
library(xgboost)
library(readr)
library(stringr)
library(caret)
library(car)
})
train <- read.csv('./input/train_users_2.csv', stringsAsFactors = FALSE)
test <- read.csv('./input/test_users.csv', stringsAsFactors = FALSE)
train_unlabel <- train[-16] # EDA를 위해 label 제거(test data에는 label이 없기 때문에)

2.2. Merge data

  • EDA의 정확도를 높이기 위해 train set과 test set를 합쳐서 표본을 늘림.
  • merge할 때 두 데이터의 정합성을 위해 train의 label 변수를 제거.

Method 1 : R (rbind)

users_r <- rbind(train_unlabel, test)

Method2 : SQL (union all)

users_sql <- sqldf("
select *
from train_unlabel a

union all

select *
from test b
;")

2.3. Peek into the Data

datatable(head(train), style="bootstrap", class="table-condensed", options = list(dom = 'tp',scrollX = TRUE))

2.4. Reshape Date Type

  • date 관련된 변수들의 데이터 타입이 모두 문자열(character) 혹은 실수형(numeric)으로 인식되어 있음.
    • 정리 되지 않은 데이터는 가공이 필요함
      • 형식을 통일하고 빈 값이 있다면 분석 중에 error가 발생하지 않도록 적절한 키워드를 넣어줘야 함.
    • 이를 위해 아래와 같이 형식을 통일
      • date_account_created
        • ex) 2010-06-28 / character -> Date
      • timestamp_first_active
        • ex) 20090319043255 / numeric -> Date
      • date_first_booking
        • ex) 2010-08-02 / character -> Date

Reshape data : label table

train$date_first_booking <- as.Date(train$date_first_booking, "%Y-%m-%d")
train$date_account_created <- as.Date(train$date_account_created, "%Y-%m-%d")
train$timestamp_first_active <- as.character(train$timestamp_first_active)
train$timestamp_first_active <- as.Date(train$timestamp_first_active, "%Y%m%d")

train$date_first_booking_year <- year(train$date_first_booking)
train$date_first_booking_month <- month(train$date_first_booking)
train$date_first_booking_quarter <- quarter(train$date_first_booking)
train$date_first_booking_weekdays <- weekdays(train$date_first_booking)

train$date_account_created_year <- year(train$date_account_created)
train$date_account_created_month <- month(train$date_account_created)
train$date_account_created_quarter <- quarter(train$date_account_created)
train$date_account_created_weekdays <- weekdays(train$date_account_created)

train$timestamp_first_active_year <- year(train$timestamp_first_active)
train$timestamp_first_active_month <- month(train$timestamp_first_active)
train$timestamp_first_active_quarter <- quarter(train$timestamp_first_active)
train$timestamp_first_active_weekdays <- weekdays(train$timestamp_first_active)

Reshape data : merge table

users_r$date_first_booking <- as.Date(users_r$date_first_booking, "%Y-%m-%d")
users_r$date_account_created <- as.Date(users_r$date_account_created, "%Y-%m-%d")

users_r$timestamp_first_active <- as.character(users_r$timestamp_first_active)
users_r$timestamp_first_active <- as.Date(users_r$timestamp_first_active, "%Y%m%d")

users_r$date_first_booking_year <- year(users_r$date_first_booking)
users_r$date_first_booking_month <- month(users_r$date_first_booking)
users_r$date_first_booking_quarter <- quarter(users_r$date_first_booking)
users_r$date_first_booking_weekdays <- weekdays(users_r$date_first_booking)

users_r$date_account_created_year <- year(users_r$date_account_created)
users_r$date_account_created_month <- month(users_r$date_account_created)
users_r$date_account_created_quarter <- quarter(users_r$date_account_created)
users_r$date_account_created_weekdays <- weekdays(users_r$date_account_created)

users_r$timestamp_first_active_year <- year(users_r$timestamp_first_active)
users_r$timestamp_first_active_month <- month(users_r$timestamp_first_active)
users_r$timestamp_first_active_quarter <- quarter(users_r$timestamp_first_active)
users_r$timestamp_first_active_weekdays <- weekdays(users_r$timestamp_first_active)

2.4.3. Summary : Reshape date type

  • label을 제거한 train과 test 데이터를 결합한 users_r 테이블과 label이 있는 train 테이블에 날짜형 데이터를 사용할 수 있게끔 변환 시킴.

2.5. Missing values treatment

  • Missing values : date_first_booking(186,639 rows), age(116,866 rows)
  • 누락된 값, 비어있는 값
  • 함수 적용 불가, 분석 결과를 왜곡하므로 적당한 방법으로 처리
  • Visualization
    • colSums(is.na(data))
    • VIM
  • Solution
    • Deletion
      • row 제거
      • column 제거
    • Imputation
      • knnout
      • 대표값
        • 최빈값
        • 평균값

2.5.1. Find missing values

sort(colSums(is.na(users_r)), decreasing = TRUE)
##              date_first_booking         date_first_booking_year 
##                          186639                          186639 
##        date_first_booking_month      date_first_booking_quarter 
##                          186639                          186639 
##     date_first_booking_weekdays                             age 
##                          186639                          116866 
##                              id            date_account_created 
##                               0                               0 
##          timestamp_first_active                          gender 
##                               0                               0 
##                   signup_method                     signup_flow 
##                               0                               0 
##                        language               affiliate_channel 
##                               0                               0 
##              affiliate_provider         first_affiliate_tracked 
##                               0                               0 
##                      signup_app               first_device_type 
##                               0                               0 
##                   first_browser       date_account_created_year 
##                               0                               0 
##      date_account_created_month    date_account_created_quarter 
##                               0                               0 
##   date_account_created_weekdays     timestamp_first_active_year 
##                               0                               0 
##    timestamp_first_active_month  timestamp_first_active_quarter 
##                               0                               0 
## timestamp_first_active_weekdays 
##                               0
aggr(users_r, col = c('gray', 'red'), sortVars = T)

## 
##  Variables sorted by number of missings: 
##                         Variable     Count
##               date_first_booking 0.6773400
##          date_first_booking_year 0.6773400
##         date_first_booking_month 0.6773400
##       date_first_booking_quarter 0.6773400
##      date_first_booking_weekdays 0.6773400
##                              age 0.4241237
##                               id 0.0000000
##             date_account_created 0.0000000
##           timestamp_first_active 0.0000000
##                           gender 0.0000000
##                    signup_method 0.0000000
##                      signup_flow 0.0000000
##                         language 0.0000000
##                affiliate_channel 0.0000000
##               affiliate_provider 0.0000000
##          first_affiliate_tracked 0.0000000
##                       signup_app 0.0000000
##                first_device_type 0.0000000
##                    first_browser 0.0000000
##        date_account_created_year 0.0000000
##       date_account_created_month 0.0000000
##     date_account_created_quarter 0.0000000
##    date_account_created_weekdays 0.0000000
##      timestamp_first_active_year 0.0000000
##     timestamp_first_active_month 0.0000000
##   timestamp_first_active_quarter 0.0000000
##  timestamp_first_active_weekdays 0.0000000

2.5.2. Summary : Missing values

- 결측치는 아래와 같음
    - date_first_booking : 186,639 rows
    - date_first_booking_year : 186,639 rows
    - date_first_booking_month : 186,639 rows
    - date_first_booking_quarter : 186,639 rows
    - date_first_booking_weekdays : 186,639 rows
    - age : 116,866 rows
    
  
- 해결 방안
    - 방법 1 : 해당 row 제거
    - 방법 2 : 해당 variable 제거
    - 방법 3 : 결측치 대체 (평균, 중앙값, knn)
    
- 고민
    - 전체(275,547) - 결측치(186639) = 88,908 건
    - 상당한 양의 (결측)데이터를 날려버리면 표본이 줄어들기 때문에 대체하는 방향으로 결정

2.6. Find numeric & character & date type variables

# number of numeric & factor & date variables
numeric_var <- sapply(users_r, is.numeric)
char_var <- sapply(users_r, is.character)
date_var <- sapply(users_r, is.Date)

users_r_num <- users_r[, numeric_var]
users_r_char <- users_r[, char_var]
users_r_date <- users_r[, date_var]

#str(users_r_num)
#str(users_r_char)
#str(users_r_date)

# 11 numeric variables
# 13 character variables
# 3 date variables

2.7. Outliers Treatment

  • age 변수를 살펴보니 2000이 넘어가는 이상치를 발견, 존재할 수 없는 (가입시 실수로 기입한 걸로 예상되는)값이므로 제거하기로 결정

  • Visualization
    • boxplot
  • Preprocessing
    • 이상치 포함시 분석 결과 왜곡
    • 결측 처리 후 제외하고 분석
  • 방법
    • A. 정상 범위를 넘어가는 이상치를 결측 처리
    • B. 정상 범위를 넘어가는 이상치를 최소/최대값으로 변환
  • 정상범위 기준
    • Range 1
      • 1Q - (1.5 * IQR)
    • Ragne 2
      • 3Q + (1.5 * IQR)
  • 고민
    • A. 이상치 제외한 데이터 분석
      • 통계분석을 하지 않는 이상 굳이 할 필요가 있을까?
    • B. 이상치 처리후 데이터 분석
      • 현실은 이상치에 주목하는데 그걸 날려버릴 필요가 있을까?
      • Logistic 분석을 시행할 때는 필요
  • 처리 기준

    종류 예시 해결 방법
    존재할 수 없는 값 성별 변수에 3 결측처리
    극단적인 값 몸무게 변수에 200 정상 범위 기준 정해서 결측 처리

2.7.1. Boxplot

# merge numeric type data (unlable & test)
b_total <- boxplot(users_r_num)

# original 'age' data (train)
b_age_origin <- boxplot(train$age)

# merge 'age' data (unlabel & test)
b_age_modi <- boxplot(users_r$age)

# original data (train)
summary(train$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   28.00   34.00   49.67   43.00 2014.00   87990
hist(train$age, 5)

# merge data (unlabel & test)
#summary(users_r$age)
#hist(users_r$age, 5)

2.7.2. Preprocessing - Age Outliers

# original data (train)
#str(which(train$age < b_age_origin$stats[1])) # 1분위수 - 1.5IQR : 57개
#str(which(train$age > b_age_origin$stats[5])) # 3분위수 + 1.5IQR : 5,337개

# merge data (unlabel & test)
#str(which(users_r$age < b_age_modi$stats[1])) # 1분위수 - 1.5IQR : 59개
#str(which(users_r$age > b_age_modi$stats[5])) # 3분위수 + 1.5IQR : 7,772개

# original data (train)
out1 <- which(train$age < b_age_origin$stats[1])
out2 <- which(train$age > b_age_origin$stats[5])

# merge data (unlabel & test)
out3 <- which(users_r$age < b_age_modi$stats[1])
out4 <- which(users_r$age > b_age_modi$stats[5])

train <- train[-c(out1, out2), ]
users_r <- users_r[-c(out3, out4), ]

# original data (train)
summary(train$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    15.0    28.0    33.0    35.6    41.0    65.0   87990
hist(train$age)

boxplot(train$age)

# merge data (unlabel & test)
#summary(users_r$age)
#hist(users_r$age)
#boxplot(users_r$age)

2.7.3. Preprocessing - Ages (연령대)

  • 다양하게 분포된 연령을 대표값으로 군집화해서 분석
  • 10~19 : 10대 / 20~29 : 20대 / 30~39 : 30대 / 40~49 : 40대 / 50~59 : 50대 / 60~69 : 60대 / 70 ~ 79 : 70대
users_r$ages <- trunc(users_r$age / 10) * 10
train$ages <- trunc(train$age / 10) * 10

# table(users_r$ages)
Desc(train$ages)
## ------------------------------------------------------------------------- 
## train$ages (numeric)
## 
##    length        n     NAs  unique     0s   mean  meanCI
##   207'857  119'867  87'990       6      0  31.03   30.97
##              57.7%   42.3%           0.0%          31.09
##                                                         
##       .05      .10     .25  median    .75    .90     .95
##     20.00    20.00   20.00   30.00  40.00  50.00   50.00
##                                                         
##     range       sd   vcoef     mad    IQR   skew    kurt
##     50.00    10.70    0.34   14.83  20.00   0.74    0.15
##                                                         
## 
##    level    freq   perc  cumfreq  cumperc
## 1     10   1'872   1.6%    1'872     1.6%
## 2     20  36'049  30.1%   37'921    31.6%
## 3     30  47'570  39.7%   85'491    71.3%
## 4     40  20'210  16.9%  105'701    88.2%
## 5     50  10'521   8.8%  116'222    97.0%
## 6     60   3'645   3.0%  119'867   100.0%

#Desc(users_r$ages)


task0 <- sqldf("
with base as(

select
    a.*
from (
    select
        ages
        , sum(case when country_destination = 'AU' then 1 else 0 end) as AU
        , sum(case when country_destination = 'CA' then 1 else 0 end) as CA
        , sum(case when country_destination = 'DE' then 1 else 0 end) as DE
        , sum(case when country_destination = 'ES' then 1 else 0 end) as ES
        , sum(case when country_destination = 'FR' then 1 else 0 end) as FR
        , sum(case when country_destination = 'GB' then 1 else 0 end) as GB
        , sum(case when country_destination = 'IT' then 1 else 0 end) as IT
        , sum(case when country_destination = 'NDF' then 1 else 0 end) as NDF
        , sum(case when country_destination = 'NL' then 1 else 0 end) as NL
        , sum(case when country_destination = 'PT' then 1 else 0 end) as PT
        , sum(case when country_destination = 'US' then 1 else 0 end) as US
        , sum(case when country_destination = 'other' then 1 else 0 end) as other
        , count(*) as total
    from
        train
    group by 1
    ) as a
)
select
    *
from
    base;
")
ages by country_destination
ages AU CA DE ES FR GB IT NDF NL PT US other total
NA 103 351 210 543 1310 550 799 67614 160 59 13773 2518 87990
10 7 32 18 30 74 31 41 924 11 4 584 116 1872
20 106 245 243 569 940 460 566 16347 191 45 14262 2075 36049
30 181 440 310 669 1425 643 763 20059 225 65 19669 3121 47570
40 70 182 138 209 606 274 291 9894 83 18 7313 1132 20210
50 44 101 70 114 373 189 190 5253 39 16 3561 571 10521
60 10 32 32 45 117 76 65 1756 20 4 1286 202 3645

2.7.4. Summary : Outliers

- 총 7,831(59 + 7,772)개의 이상치 제거 완료
- 267,716건 잔존

3. EDA - Part 1

  • (새로운 사용자 분석을 위한) 기존의 사용자 분석
    • 기존의 사용자가 어떠한 상태에서 예약을 했는지, 예약을 했다면 어디를 맨 처음 예약했는지 확인하기 위한 분석
  • Part 1은 변수 하나로 보는 EDA
  • Part 2는 여러 개의 변수를 결합해서 SQL과 R로 EDA

3.1. EDA 1 : Age

  • 연령별 분포와 전체 연령을 각각의 대표값(10~70대)로 만들어서 비율 확인하기
    • outlier 전처리 과정에서 이미 변수 생성

Visualization

# View age chart
ggplot(users_r, aes(x=age)) +
  geom_histogram(aes(y=..density..),      
# Histogram with density instead of count on y-axis
  binwidth=.5,
  colour="black", fill="white") +
  geom_density(alpha=.2, fill="pink") + # Overlay with transparent density plot
  ggtitle("Users age distribution")

# View ages chart
ggplot(data = train, aes(x = ages)) +
  xlim(10, 100) +
  geom_bar(fill = "#FF6666") +
  ggtitle("Users age distribution")

# View age by date_account_created
users_r %>% 
#   sample_n(2000) %>% 
#   filter(age<150) %>%
  ggplot(aes(age, date_first_booking)) +
  geom_point(alpha=.5) +
  geom_smooth() +
  ggtitle("Age by date_account_created")

Summary

- 20대(32.3%) ~ 30대(38.8%)의 고객(71.2%)이 주로 Airbnb 서비스를 사용하고 있는 것을 알 수 있음.
- (이상치 처리하기 전에도) 70세 이상의 고객이 거의 없음을 알 수 있어서 이를 반영하였음.

“주 사용고객은 전체 고객의 71.2%를 차지하는 2030이다.”

3.2. EDA 2 : Gender

Report

task_gender_ratio <- sqldf("
select
    gender
    , round(100. * g_tot/tot, 2) as prob
from (
    select
        gender
        , count(*) as g_tot
    from users_r
    group by 1
    ) as a, (
    select
        count(*) as tot
    from users_r
    ) as b
")
signup_app by country_destination
gender prob
-unknown- 47.75
FEMALE 27.69
MALE 24.43
OTHER 0.12

Visualization

Desc(train$gender)
## ------------------------------------------------------------------------- 
## train$gender (character)
## 
##    length       n     NAs  unique  levels   dupes
##   207'857 207'857       0       4       4       y
##            100.0%    0.0%                        
## 
##        level    freq   perc  cumfreq  cumperc
## 1  -unknown-  94'507  45.5%   94'507    45.5%
## 2     FEMALE  60'630  29.2%  155'137    74.6%
## 3       MALE  52'444  25.2%  207'581    99.9%
## 4      OTHER     276   0.1%  207'857   100.0%

ggplot(data = task_gender_ratio, aes(x = reorder(prob, gender), y = prob, fill = gender)) +
  geom_bar(stat = "identity") +
  coord_flip()

Summary

- 여성 이용자(27.69%) 남성 이용자(24.43%) 보다 조금 더 많은데 실제 매출에 기여하는 성별은 누구인지는 파악되지 않음
  - 추후에 성별에 따른 예약 전환율을 분석할 예정

3.3. EDA 3 : Signup_app

  • 가입한 앱의 분포와 비율 확인

Visualization

Desc(users_r$signup_app)
## ------------------------------------------------------------------------- 
## users_r$signup_app (character)
## 
##    length       n     NAs  unique  levels   dupes
##   267'716 267'716       0       4       4       y
##            100.0%    0.0%                        
## 
##      level     freq   perc  cumfreq  cumperc
## 1      Web  212'705  79.5%  212'705    79.5%
## 2      iOS   34'258  12.8%  246'963    92.2%
## 3  Android   10'384   3.9%  257'347    96.1%
## 4    Moweb   10'369   3.9%  267'716   100.0%

users_r %>%
  ggplot(aes(x = signup_app)) +
  geom_bar() +
  ggtitle("signup_app")

Summary

- Web의 비율이 79.5%로 압도적으로 많고 그 다음은 iOS 사용자(12.8%)가 많다.

“Web과 iOS 사용자의 구매 전환율이 높으므로 추후 마케팅할 때 그들에게 집중하는 것이 좋다.”

3.4. EDA 4 : Country destination

  • (기존) 사용자들이 맨 처음 예약한 장소가 어디인지 파악하는 분석
  • 기존에 label이 있는 데이터에서 추출해야함.

Visualization

Desc(train$country_destination)
## ------------------------------------------------------------------------- 
## train$country_destination (character)
## 
##    length       n     NAs  unique  levels   dupes
##   207'857 207'857       0      12      12       y
##            100.0%    0.0%                        
## 
##     level     freq   perc  cumfreq  cumperc
## 1     NDF  121'847  58.6%  121'847    58.6%
## 2      US   60'448  29.1%  182'295    87.7%
## 3   other    9'735   4.7%  192'030    92.4%
## 4      FR    4'845   2.3%  196'875    94.7%
## 5      IT    2'715   1.3%  199'590    96.0%
## 6      GB    2'223   1.1%  201'813    97.1%
## 7      ES    2'179   1.0%  203'992    98.1%
## 8      CA    1'383   0.7%  205'375    98.8%
## 9      DE    1'021   0.5%  206'396    99.3%
## 10     NL      729   0.4%  207'125    99.6%
## 11     AU      521   0.3%  207'646    99.9%
## 12     PT      211   0.1%  207'857   100.0%

Summary

- NDF가 가장 많은 비율(58.3%)을 차지하고 있다. No Destination Found의 약자로 예약을 하지 않은 사용자를 뜻한다.
- 많은(전체 사용자의 2/3 이상) 사용자가 예약하지 않았고, 예약한 사용자 중에서는 미국 29.2%로 가장 많은 비율을 차지했음을 알 수 있다.
- 조금 더 구체적인 분석을 실시할 예정

3.5. EDA 5 : First device type

Visualization

Desc(users_r$first_device_type)
## ------------------------------------------------------------------------- 
## users_r$first_device_type (character)
## 
##    length       n     NAs  unique  levels   dupes
##   267'716 267'716       0       9       9       y
##            100.0%    0.0%                        
## 
##                 level     freq   perc  cumfreq  cumperc
## 1         Mac Desktop  103'224  38.6%  103'224    38.6%
## 2     Windows Desktop   83'782  31.3%  187'006    69.9%
## 3              iPhone   39'432  14.7%  226'438    84.6%
## 4                iPad   17'354   6.5%  243'792    91.1%
## 5       Other/Unknown   10'933   4.1%  254'725    95.1%
## 6       Android Phone    9'313   3.5%  264'038    98.6%
## 7      Android Tablet    2'015   0.8%  266'053    99.4%
## 8     Desktop (Other)    1'475   0.6%  267'528    99.9%
## 9  SmartPhone (Other)      188   0.1%  267'716   100.0%

Summary

- Web의 비율이 높지만, Apple 제품을 쓰는 사용자의 비율이 59.8%로 압도적으로 높다.

4. EDA - Part 2

4.1. EDA 1 : 가입일과 예약일

  • 가입 날짜와 첫 예약 날짜와의 상관관계가 있는지 확인하기
  • 샘플링해서 분석을 시행(무작위 추출 2,000개), 모집단을 예측하는데 문제가 없을 것으로 예상
  • 사용자의 나이를 나타내는 변수의 이상치는 이미 제거했음

Visualization

users_r %>% 
  sample_n(2000) %>% 
#   filter(age < 150) %>% # 제거하지 않았을 경우 주석 해제하고 진행
  ggplot(aes(date_account_created, date_first_booking)) + 
  geom_point(alpha = 0.5) + geom_smooth() +
  ggtitle("date_first_booking by date_account_created")

Summary

- 두 변수(date_account_created, date_first_booking)는 선형 관계에 있음
즉, 가입한지 오래된 사용자 보다 가입한 지 얼마 안 된 사용자를 대상으로 마케팅을 진행해야함.

4.2. EDA 2 : 다변량 분석

  • 가입 일자, 첫 예약 일자, 가입한 앱, 나이

Visualization

users_r %>% 
  sample_n(2000) %>% 
#   filter(age < 150) %>%
  ggplot(aes(date_account_created, date_first_booking)) + 
  geom_point(aes(size=age, col=signup_app),alpha = 0.5) + geom_smooth() +
  ggtitle("user analysis")

Summary

예약을 완료한 대부분의 고객들은 Web을 통해 가입한 고객임을 알 수 있음
숙소를 예약하는 것이기 때문에 (화면이 작은) 스마트폰 보다는, 상대적으로 큰 모니터를 통해 천천히 살펴보면서 예약하는 것이라는 가설을 세울 수 있음
점의 크기로 age를 나타냈지만 너무 많아서 의미를 찾기 어려움, 추가적인 분석이 필요함(분석할 예정).

“최근에 회원을 가입한 사용자 중에 Web을 통해 가입한 고객을 대상으로 마케팅을 해야 한다.”

  • 위 분석을 통해 signup_app이 중요 변수임을 알았음.
  • signup_app 추가 분석 진행 필요

4.3. EDA 3 : 가입한 앱과 예약 국가

Report 1 - 가입기기별, 예약 국가별 예약자 수

table(train$signup_app)
## 
## Android     iOS   Moweb     Web 
##    5406   18832    6167  177452
# conversion_rate_by_country_signup_app
task1 <- sqldf("
with base as(
select
    a.*
from (
    select 
        signup_app
        , sum(case when country_destination = 'AU' then 1 else 0 end) as AU
        , sum(case when country_destination = 'CA' then 1 else 0 end) as CA
        , sum(case when country_destination = 'DE' then 1 else 0 end) as DE
        , sum(case when country_destination = 'ES' then 1 else 0 end) as ES
        , sum(case when country_destination = 'FR' then 1 else 0 end) as FR
        , sum(case when country_destination = 'GB' then 1 else 0 end) as GB
        , sum(case when country_destination = 'IT' then 1 else 0 end) as IT
        , sum(case when country_destination = 'NDF' then 1 else 0 end) as NDF
        , sum(case when country_destination = 'NL' then 1 else 0 end) as NL
        , sum(case when country_destination = 'PT' then 1 else 0 end) as PT
        , sum(case when country_destination = 'US' then 1 else 0 end) as US
        , sum(case when country_destination = 'other' then 1 else 0 end) as other
        , count(*) as total
    from train
    group by 1
    ) as a
)
select *
from base

union all

select
    'Total'
    , sum(AU)
    , sum(CA)
    , sum(DE)
    , sum(ES)
    , sum(FR)
    , sum(GB)
    , sum(IT)
    , sum(NDF)
    , sum(NL)
    , sum(PT)
    , sum(US)
    , sum(other)
    , sum(total)
from base;
")
# conversion_rate_by_country_signup_app
signup_app by country_destination
signup_app AU CA DE ES FR GB IT NDF NL PT US other total
Android 3 15 9 25 24 16 21 4197 7 2 921 166 5406
Moweb 14 28 15 34 55 33 30 4078 14 3 1673 190 6167
Web 473 1282 957 2016 4548 2059 2524 100172 670 194 53831 8726 177452
iOS 31 58 40 104 218 115 140 13400 38 12 4023 653 18832
Total 521 1383 1021 2179 4845 2223 2715 121847 729 211 60448 9735 207857

Report 2 - 가입기기별, 예약 국가별 예약 비율

task2 <- sqldf("
with base as(
select
    b.*
    , AU + CA + DE + ES + FR + GB + IT + NDF + NL + PT + US + other as total
from (

    select
        signup_app
        , round( 100. * AU / total, 2) as AU
        , round( 100. * CA / total, 2) as CA
        , round( 100. * DE / total, 2) as DE
        , round( 100. * ES / total, 2) as ES
        , round( 100. * FR / total, 2) as FR
        , round( 100. * GB / total, 2) as GB
        , round( 100. * IT / total, 2) as IT
        , round( 100. * NDF / total, 2) as NDF
        , round( 100. * NL / total, 2) as NL
        , round( 100. * PT / total, 2) as PT
        , round( 100. * US / total, 2) as US
        , round( 100. * other / total, 2) as other
        
    from (
        select 
            signup_app
            , sum(case when country_destination = 'AU' then 1 else 0 end) as AU
            , sum(case when country_destination = 'CA' then 1 else 0 end) as CA
            , sum(case when country_destination = 'DE' then 1 else 0 end) as DE
            , sum(case when country_destination = 'ES' then 1 else 0 end) as ES
            , sum(case when country_destination = 'FR' then 1 else 0 end) as FR
            , sum(case when country_destination = 'GB' then 1 else 0 end) as GB
            , sum(case when country_destination = 'IT' then 1 else 0 end) as IT
            , sum(case when country_destination = 'NDF' then 1 else 0 end) as NDF
            , sum(case when country_destination = 'NL' then 1 else 0 end) as NL
            , sum(case when country_destination = 'PT' then 1 else 0 end) as PT
            , sum(case when country_destination = 'US' then 1 else 0 end) as US
            , sum(case when country_destination = 'other' then 1 else 0 end) as other
            , count(*) as total
        from train
        group by 1
        )
    ) as b
)

select
    *
from
    base

union all

select
    'Total'
    , round(avg(AU), 2)
    , round(avg(CA), 2)
    , round(avg(DE), 2)
    , round(avg(ES), 2)
    , round(avg(FR), 2)
    , round(avg(GB), 2)
    , round(avg(IT), 2)
    , round(avg(NDF), 2)
    , round(avg(NL), 2)
    , round(avg(PT), 2)
    , round(avg(US), 2)
    , round(avg(other), 2)
    , null
from
    base
;")
signup_app by country_destination
signup_app AU CA DE ES FR GB IT NDF NL PT US other total
Android 0.06 0.28 0.17 0.46 0.44 0.30 0.39 77.64 0.13 0.04 17.04 3.07 100.02
Moweb 0.23 0.45 0.24 0.55 0.89 0.54 0.49 66.13 0.23 0.05 27.13 3.08 100.01
Web 0.27 0.72 0.54 1.14 2.56 1.16 1.42 56.45 0.38 0.11 30.34 4.92 100.01
iOS 0.16 0.31 0.21 0.55 1.16 0.61 0.74 71.16 0.20 0.06 21.36 3.47 99.99
Total 0.18 0.44 0.29 0.68 1.26 0.65 0.76 67.84 0.23 0.07 23.97 3.64 NA

Report 3 for Visualization
- ggplot 시각화를 위한 dataframe 생성

task3 <- sqldf("
select
    a.signup_app
    , country_destination
    , round(100. * total / sa_total, 2) as prob

from (
    select
        signup_app
        , country_destination
        , count(*) as total
    from
        train
    group by 1, 2

    ) as a, (

    select
        signup_app
        , count(*) as sa_total
        
    from
        train
    group by 1
    ) b

where a.signup_app = b.signup_app;
")

Visualization

ggplot(data = task3, aes(x = signup_app, y = prob, fill = country_destination)) +
  geom_bar(stat = "identity")

Report 4
- 국가별, 기기별 예약자 수

table(train$country_destination)
## 
##     AU     CA     DE     ES     FR     GB     IT    NDF     NL  other 
##    521   1383   1021   2179   4845   2223   2715 121847    729   9735 
##     PT     US 
##    211  60448
task4 <- sqldf("
with base as(
select
    a.*
from (
    select 
        country_destination
        , sum(case when signup_app = 'Android' then 1 else 0 end) as Android
        , sum(case when signup_app = 'Moweb' then 1 else 0 end) as Moweb
        , sum(case when signup_app = 'Web' then 1 else 0 end) as Web
        , sum(case when signup_app = 'iOS' then 1 else 0 end) as iOS
        , count(*) as total
    from train
    group by 1
    ) as a
)
select *
from base

union all

select
    'Total'
    , sum(Android)
    , sum(Moweb)
    , sum(Web)
    , sum(iOS)
    , sum(total)
from base;
")
signup_app by country_destination
country_destination Android Moweb Web iOS total
AU 3 14 473 31 521
CA 15 28 1282 58 1383
DE 9 15 957 40 1021
ES 25 34 2016 104 2179
FR 24 55 4548 218 4845
GB 16 33 2059 115 2223
IT 21 30 2524 140 2715
NDF 4197 4078 100172 13400 121847
NL 7 14 670 38 729
PT 2 3 194 12 211
US 921 1673 53831 4023 60448
other 166 190 8726 653 9735
Total 5406 6167 177452 18832 207857

Report 5
- 국가별, 기기별 예약 비율

task5 <- sqldf("
with base as(

select
    b.*
    , Android + Moweb + Web + iOS as total
from (
    select
        country_destination
        , round( 100. * Android / total, 2) as Android
        , round( 100. * Moweb / total, 2) as Moweb
        , round( 100. * Web / total, 2) as Web
        , round( 100. * iOS / total, 2) as iOS
    from (
        select 
            country_destination
            , sum(case when signup_app = 'Android' then 1 else 0 end) as Android
            , sum(case when signup_app = 'Moweb' then 1 else 0 end) as Moweb
            , sum(case when signup_app = 'Web' then 1 else 0 end) as Web
            , sum(case when signup_app = 'iOS' then 1 else 0 end) as iOS
            , count(*) as total
        from train
        group by 1
        )
    ) as b
)

select
    *
from 
    base

union all

select
    'Total'
    , sum(Android)
    , sum(Moweb)
    , sum(Web)
    , sum(iOS)
    , null
from base;
")
signup_app by country_destination
country_destination Android Moweb Web iOS total
AU 0.58 2.69 90.79 5.95 100.01
CA 1.08 2.02 92.70 4.19 99.99
DE 0.88 1.47 93.73 3.92 100.00
ES 1.15 1.56 92.52 4.77 100.00
FR 0.50 1.14 93.87 4.50 100.01
GB 0.72 1.48 92.62 5.17 99.99
IT 0.77 1.10 92.97 5.16 100.00
NDF 3.44 3.35 82.21 11.00 100.00
NL 0.96 1.92 91.91 5.21 100.00
PT 0.95 1.42 91.94 5.69 100.00
US 1.52 2.77 89.05 6.66 100.00
other 1.71 1.95 89.64 6.71 100.01
Total 14.26 22.87 1093.95 68.93 NA

Report 6 for visualization
- ggplot 시각화를 위한 국가별, 기기별 예약 비율

task6 <- sqldf("
select
    a.country_destination
    , a.signup_app
    , round(100. * total / sa_total, 2) as prob

from (
    select
        country_destination
        , signup_app 
        , count(*) as total
    from
        train
    group by 1, 2

    ) as a, (

    select
       country_destination
        , count(*) as sa_total
        
    from
        train
    group by 1
    ) b

where a.country_destination = b.country_destination;
")

Visualization

ggplot(data = task6, aes(x = country_destination, y = prob, fill = signup_app)) +
  geom_bar(stat = "identity")

Summary

Web 전환률은 평균 80~90%로 App을 모두 합친 것보다 훨씬 높다.
(미국의 경우 Wep이 89.2%이고, App이 10.7%인데, 다른 나라도 상황이 비슷하다)

Web 가입자의 전환률이 그들을 대상으로 대상으로 더 높으므로 마케팅을 진행하는 게 좋다.

4.4. EDA 4 : 시즌별 분석

예약일(분기, 요일)과 destination의 상관관계 분석 (월별, 나라별 방문자 수 지표)

  • 시즌별로 어느 나라에 사람들이 많이 방문했는지 확인
    • 특정 시즌에 인기 있는 도시와 나라 분석이 가능해짐
  • 첫 예약 날짜를 기준으로 새로운 변수(년, 월, 분기)를 생성했음

Report

date_first_booking_quarter <- sqldf("
select
    date_first_booking_quarter 
    , count(*) as freq
from
    train
where
    date_first_booking_quarter not in ('NA')
group by 1
order by 2 desc
")

date_first_booking_month <- sqldf("
select
    date_first_booking_month 
    , count(*) as freq
from
    train
where
    date_first_booking_month not in ('NA')
group by 1
order by 2 desc
")

date_first_booking_weekdays <- sqldf("
select
    date_first_booking_weekdays 
    , count(*) as freq
from
    train
where
    date_first_booking_weekdays not in ('NA')
group by 1
order by 2 desc
")


task7 <- sqldf("
with base as(
select
    a.*
from (
    select 
        date_first_booking_month
        , sum(case when country_destination = 'AU' then 1 else 0 end) as AU
        , sum(case when country_destination = 'CA' then 1 else 0 end) as CA
        , sum(case when country_destination = 'DE' then 1 else 0 end) as DE
        , sum(case when country_destination = 'ES' then 1 else 0 end) as ES
        , sum(case when country_destination = 'FR' then 1 else 0 end) as FR
        , sum(case when country_destination = 'GB' then 1 else 0 end) as GB
        , sum(case when country_destination = 'IT' then 1 else 0 end) as IT
        , sum(case when country_destination = 'NDF' then 1 else 0 end) as NDF
        , sum(case when country_destination = 'NL' then 1 else 0 end) as NL
        , sum(case when country_destination = 'PT' then 1 else 0 end) as PT
        , sum(case when country_destination = 'US' then 1 else 0 end) as US
        , sum(case when country_destination = 'other' then 1 else 0 end) as other
        , count(*) as total
    from
        train
    group by 1
    ) as a
)

select
    *
from
    base

union all

select
    'Total'
    , round((sum(AU) / 12), 2)
    , round((sum(CA) / 12), 2)
    , round((sum(DE) / 12), 2)
    , round((sum(ES) / 12), 2)
    , round((sum(FR) / 12), 2)
    , round((sum(GB) / 12), 2)
    , round((sum(IT) / 12), 2)
    , null
    , round((sum(NL) / 12), 2)
    , round((sum(PT) / 12), 2)
    , round((sum(US) / 12), 2)
    , round((sum(other) / 12), 2)
    , null
from base;

")

task8 <- sqldf("
with base as(
select
    b.*
    , AU + CA + DE + ES + FR + GB + IT + NDF + NL + PT + US + other as total
from (

    select
        date_first_booking_month
        , round( 100. * AU / total, 2) as AU
        , round( 100. * CA / total, 2) as CA
        , round( 100. * DE / total, 2) as DE
        , round( 100. * ES / total, 2) as ES
        , round( 100. * FR / total, 2) as FR
        , round( 100. * GB / total, 2) as GB
        , round( 100. * IT / total, 2) as IT
        , round( 100. * NDF / total, 2) as NDF
        , round( 100. * NL / total, 2) as NL
        , round( 100. * PT / total, 2) as PT
        , round( 100. * US / total, 2) as US
        , round( 100. * other / total, 2) as other
        
    from (
        select 
            date_first_booking_month
            , sum(case when country_destination = 'AU' then 1 else 0 end) as AU
            , sum(case when country_destination = 'CA' then 1 else 0 end) as CA
            , sum(case when country_destination = 'DE' then 1 else 0 end) as DE
            , sum(case when country_destination = 'ES' then 1 else 0 end) as ES
            , sum(case when country_destination = 'FR' then 1 else 0 end) as FR
            , sum(case when country_destination = 'GB' then 1 else 0 end) as GB
            , sum(case when country_destination = 'IT' then 1 else 0 end) as IT
            , sum(case when country_destination = 'NDF' then 1 else 0 end) as NDF
            , sum(case when country_destination = 'NL' then 1 else 0 end) as NL
            , sum(case when country_destination = 'PT' then 1 else 0 end) as PT
            , sum(case when country_destination = 'US' then 1 else 0 end) as US
            , sum(case when country_destination = 'other' then 1 else 0 end) as other
            , count(*) as total
        from train
        group by 1
        )
    ) as b
)

select
    *
from
    base

union all

select
    'Avg_booking'
    , round((sum(AU) / 12), 2)
    , round((sum(CA) / 12), 2)
    , round((sum(DE) / 12), 2)
    , round((sum(ES) / 12), 2)
    , round((sum(FR) / 12), 2)
    , round((sum(GB) / 12), 2)
    , round((sum(IT) / 12), 2)
    , null
    , round((sum(NL) / 12), 2)
    , round((sum(PT) / 12), 2)
    , round((sum(US) / 12), 2)
    , round((sum(other) / 12), 2)
    , null
from
    base
;")

task9 <- sqldf("
select
    a.date_first_booking_month
    , a.country_destination
    , round(100. * total / sa_total, 2) as prob

from (
    select
        date_first_booking_month
        , country_destination
        , count(*) as total
    from
        train
    group by 1, 2
    ) as a, (

    select
        date_first_booking_month
        , count(*) as sa_total
    from 
        train
    group by 1
    ) b

where a.date_first_booking_month = b.date_first_booking_month
;")
date_first_booking_quarter
date_first_booking_quarter freq
2 28885
1 20875
3 20209
4 16041
date_first_booking_month
date_first_booking_month freq
6 10217
5 10153
4 8515
3 8087
7 7027
8 6816
2 6539
9 6366
1 6249
10 5987
11 5107
12 4947
date_first_booking_weekdays
date_first_booking_weekdays freq
Wednesday 13951
Tuesday 13832
Thursday 13491
Friday 12840
Monday 12341
Saturday 10043
Sunday 9512
date_first_booking_month
date_first_booking_month AU CA DE ES FR GB IT NDF NL PT US other total
NA 0 0 0 0 0 0 0 121847 0 0 0 0 121847
1 57 53 57 112 320 140 157 0 29 9 4645 670 6249
2 46 72 66 163 376 168 194 0 49 18 4604 783 6539
3 47 102 79 206 488 217 280 0 64 22 5622 960 8087
4 27 156 99 256 544 229 341 0 67 23 5822 951 8515
5 53 208 144 326 705 312 418 0 99 27 6715 1146 10153
6 36 214 130 332 595 295 351 0 116 40 6859 1249 10217
7 31 136 94 229 348 182 247 0 68 14 4913 765 7027
8 30 139 102 181 399 151 210 0 81 15 4806 702 6816
9 41 101 110 144 332 169 221 0 49 12 4545 642 6366
10 45 67 54 114 291 141 126 0 50 13 4423 663 5987
11 60 73 45 60 238 128 103 0 29 9 3754 608 5107
12 48 62 41 56 209 91 67 0 28 9 3740 596 4947
0 43 115 85 181 403 185 226 NA 60 17 5037 811 NA
date_first_booking_month_ratio
date_first_booking_month AU CA DE ES FR GB IT NDF NL PT US other total
NA 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100 0.00 0.00 0.00 0.00 100.00
1 0.91 0.85 0.91 1.79 5.12 2.24 2.51 0 0.46 0.14 74.33 10.72 99.98
2 0.70 1.10 1.01 2.49 5.75 2.57 2.97 0 0.75 0.28 70.41 11.97 100.00
3 0.58 1.26 0.98 2.55 6.03 2.68 3.46 0 0.79 0.27 69.52 11.87 99.99
4 0.32 1.83 1.16 3.01 6.39 2.69 4.00 0 0.79 0.27 68.37 11.17 100.00
5 0.52 2.05 1.42 3.21 6.94 3.07 4.12 0 0.98 0.27 66.14 11.29 100.01
6 0.35 2.09 1.27 3.25 5.82 2.89 3.44 0 1.14 0.39 67.13 12.22 99.99
7 0.44 1.94 1.34 3.26 4.95 2.59 3.52 0 0.97 0.20 69.92 10.89 100.02
8 0.44 2.04 1.50 2.66 5.85 2.22 3.08 0 1.19 0.22 70.51 10.30 100.01
9 0.64 1.59 1.73 2.26 5.22 2.65 3.47 0 0.77 0.19 71.39 10.08 99.99
10 0.75 1.12 0.90 1.90 4.86 2.36 2.10 0 0.84 0.22 73.88 11.07 100.00
11 1.17 1.43 0.88 1.17 4.66 2.51 2.02 0 0.57 0.18 73.51 11.91 100.01
12 0.97 1.25 0.83 1.13 4.22 1.84 1.35 0 0.57 0.18 75.60 12.05 99.99
0 0.65 1.55 1.16 2.39 5.48 2.53 3.00 NA 0.82 0.23 70.89 11.30 NA

**Visualization

#View count of each country_destination by data_first_booking_month

fig.dac_mon <- ggplot(data = train, aes(x = country_destination, fill=country_destination))
fig.dac_mon <- fig.dac_mon + geom_bar(position="identity", alpha=1.0)
fig.dac_mon <- fig.dac_mon + scale_y_log10()
# fig.dac_mon + facet_wrap(~date_first_booking_month)
fig.dac_mon

#View count of each country_destination by data_first_booking_month
fig.dac_mon <- ggplot(train, aes(date_first_booking_month, fill=country_destination))
fig.dac_mon <- fig.dac_mon + geom_bar(position="identity", alpha=1.0)
fig.dac_mon <- fig.dac_mon + scale_y_log10()
fig.dac_mon + facet_wrap(~country_destination)

ggplot(data = task9, aes(x = date_first_booking_month, y = prob, fill = country_destination)) +
  geom_bar(stat = "identity")

Summary
- alpah를 1.0으로 설정하여 색을 진하게, log를 사용하여 한 눈에 보기 쉽게 scale 수정 - 나라마다 조금의 차이가 있지만 대체적으로 5~6월에 예약을 많이 한다. 그 이후로는 예약 수가 줄어든다. - 다른 나라들은 거의 유사하지만 호주(AU)의 경우 조금 다른 양상을 보인다. - 추측 : 다른 나라들과 계절이 반대이기 때문에 반대의 양상이 보이지 않을까 추측해볼 수 있음.

4.5. EDA 5 : 연령대별 분석

Report 1
- 연령대에 따른 예약자 수

task10 <- sqldf("
with base as(

select
    a.*
from (
    select
        ages
        , sum(case when country_destination = 'AU' then 1 else 0 end) as AU
        , sum(case when country_destination = 'CA' then 1 else 0 end) as CA
        , sum(case when country_destination = 'DE' then 1 else 0 end) as DE
        , sum(case when country_destination = 'ES' then 1 else 0 end) as ES
        , sum(case when country_destination = 'FR' then 1 else 0 end) as FR
        , sum(case when country_destination = 'GB' then 1 else 0 end) as GB
        , sum(case when country_destination = 'IT' then 1 else 0 end) as IT
        , sum(case when country_destination = 'NDF' then 1 else 0 end) as NDF
        , sum(case when country_destination = 'NL' then 1 else 0 end) as NL
        , sum(case when country_destination = 'PT' then 1 else 0 end) as PT
        , sum(case when country_destination = 'US' then 1 else 0 end) as US
        , sum(case when country_destination = 'other' then 1 else 0 end) as other
        , count(*) as total
    from
        train
    group by 1
    ) as a
)
select
    *
from
    base
where ages > 0 and ages < 70

union all

select 
    'Avg'
    , round((sum(AU) / 12), 2)
    , round((sum(CA) / 12), 2)
    , round((sum(DE) / 12), 2)
    , round((sum(ES) / 12), 2)
    , round((sum(FR) / 12), 2)
    , round((sum(GB) / 12), 2)
    , round((sum(IT) / 12), 2)
    , null
    , round((sum(NL) / 12), 2)
    , round((sum(PT) / 12), 2)
    , round((sum(US) / 12), 2)
    , round((sum(other) / 12), 2)
    , null
from base

union all

select
    'Total'
    , sum(AU)
    , sum(CA)
    , sum(DE)
    , sum(ES)
    , sum(FR)
    , sum(GB)
    , sum(IT)
    , null
    , sum(NL)
    , sum(PT)
    , sum(US)
    , sum(other)
    , null
from base;
")
Age vs. Country destination
ages AU CA DE ES FR GB IT NDF NL PT US other total
10 7 32 18 30 74 31 41 924 11 4 584 116 1872
20 106 245 243 569 940 460 566 16347 191 45 14262 2075 36049
30 181 440 310 669 1425 643 763 20059 225 65 19669 3121 47570
40 70 182 138 209 606 274 291 9894 83 18 7313 1132 20210
50 44 101 70 114 373 189 190 5253 39 16 3561 571 10521
60 10 32 32 45 117 76 65 1756 20 4 1286 202 3645
0 43 115 85 181 403 185 226 NA 60 17 5037 811 NA
0 521 1383 1021 2179 4845 2223 2715 NA 729 211 60448 9735 NA

Report 2
- 연령대에 따른 예약 비율

task11 <- sqldf("
with base as (
select
    b.*
    , AU + CA + DE + ES + FR + GB + IT + NDF + NL + PT + US + other as total

from (
    
    select
        ages
        , round( 100. * AU / total, 2) as AU
        , round( 100. * CA / total, 2) as CA
        , round( 100. * DE / total, 2) as DE
        , round( 100. * ES / total, 2) as ES
        , round( 100. * FR / total, 2) as FR
        , round( 100. * GB / total, 2) as GB
        , round( 100. * IT / total, 2) as IT
        , round( 100. * NDF / total, 2) as NDF
        , round( 100. * NL / total, 2) as NL
        , round( 100. * PT / total, 2) as PT
        , round( 100. * US / total, 2) as US
        , round( 100. * other / total, 2) as other

    from (

        select
            ages
            , sum(case when country_destination = 'AU' then 1 else 0 end) as AU
            , sum(case when country_destination = 'CA' then 1 else 0 end) as CA
            , sum(case when country_destination = 'DE' then 1 else 0 end) as DE
            , sum(case when country_destination = 'ES' then 1 else 0 end) as ES
            , sum(case when country_destination = 'FR' then 1 else 0 end) as FR
            , sum(case when country_destination = 'GB' then 1 else 0 end) as GB
            , sum(case when country_destination = 'IT' then 1 else 0 end) as IT
            , sum(case when country_destination = 'NDF' then 1 else 0 end) as NDF
            , sum(case when country_destination = 'NL' then 1 else 0 end) as NL
            , sum(case when country_destination = 'PT' then 1 else 0 end) as PT
            , sum(case when country_destination = 'US' then 1 else 0 end) as US
            , sum(case when country_destination = 'other' then 1 else 0 end) as other
            , count(*) as total
        from (
                select *
                from train
                where ages > 0 and ages < 70
            )
        group by 1
        ) 
    ) as b
)
select
    *
from
    base

union all

select
    'Avg_ages'
    , round(avg(AU), 2)
    , round(avg(CA), 2)
    , round(avg(DE), 2)
    , round(avg(ES), 2)
    , round(avg(FR), 2)
    , round(avg(GB), 2)
    , round(avg(IT), 2)
    , null
    , round(avg(NL), 2)
    , round(avg(PT), 2)
    , round(avg(US), 2)
    , round(avg(other), 2)
    , null
from
    base
;")
Age vs. Country destination
ages AU CA DE ES FR GB IT NDF NL PT US other total
10 0.37 1.71 0.96 1.60 3.95 1.66 2.19 49.36 0.59 0.21 31.20 6.20 100.00
20 0.29 0.68 0.67 1.58 2.61 1.28 1.57 45.35 0.53 0.12 39.56 5.76 100.00
30 0.38 0.92 0.65 1.41 3.00 1.35 1.60 42.17 0.47 0.14 41.35 6.56 100.00
40 0.35 0.90 0.68 1.03 3.00 1.36 1.44 48.96 0.41 0.09 36.19 5.60 100.01
50 0.42 0.96 0.67 1.08 3.55 1.80 1.81 49.93 0.37 0.15 33.85 5.43 100.02
60 0.27 0.88 0.88 1.23 3.21 2.09 1.78 48.18 0.55 0.11 35.28 5.54 100.00
0 0.35 1.01 0.75 1.32 3.22 1.59 1.73 NA 0.49 0.14 36.24 5.85 NA

Report 3 for visualization
- ggplot 시각화를 위한 dataframe

task12 <- sqldf("

select
    a.ages
    , a.country_destination
    , round(100. * ac_total / ages_total, 2) as prob

from (
    select
        ages
        , country_destination
        , count(*) as ac_total
    from
        train
    group by 1, 2
    
    ) as a, (

    select
        ages
        , count(*) as ages_total
    from
        train
    group by 1
    ) as b
where a.ages = b.ages
")

Visualization 1

not.na <- !is.na(train$ages)
ggplot(data = task12[not.na, ], aes(x = ages, y = prob, fill =  country_destination)) +
  geom_bar(stat = "identity", alpha = 1) + 
  xlim(14, 70)

# View count of age
not.na <- !is.na(train$age)
fig.age <- ggplot(train[not.na,], aes(age, fill=country_destination))
fig.age <- fig.age + geom_bar(position="identity", alpha=0.5)
fig.age <- fig.age + xlim(14, 70)
fig.age <- fig.age + scale_y_log10()
fig.age + facet_wrap(~country_destination)

# View count of ages
not.na <- !is.na(train$ages)
fig.ages <- ggplot(train[not.na,], aes(ages, fill=country_destination))
fig.ages <- fig.ages + geom_bar(position="identity", alpha=0.5)
fig.ages <- fig.ages + xlim(14, 70)
fig.ages <- fig.ages + scale_y_log10()
fig.ages + facet_wrap(~country_destination)

Summary 1
- 대부분의 나라가 전체 인구 분포와 유사함 - Age와 Destination의 관계를 Boxplot을 통해 표현 - Boxplot을 이용하면 전체 데이터 분포를 더 잘 파악할 수 있고, 어디에 데이터가 분포되어 있는지 Outlier의 존재 유무 등을 파악할 수 있음

Visualization 2

# Boxplot for age (between 10 and 70)

fig.age.bx <- ggplot(train, aes(x=country_destination, y=age))
fig.age.bx <- fig.age.bx + geom_boxplot(aes(colour=country_destination))
fig.age.bx <- fig.age.bx + ylim(10, 70)

fig.age.bx

# boxplot for ages

fig.ages.bx <- ggplot(train, aes(x=country_destination, y=ages))
fig.ages.bx <- fig.ages.bx + geom_boxplot(aes(colour=country_destination))
fig.ages.bx <- fig.ages.bx + ylim(10, 70)

fig.ages.bx

Summary 2
- 모든 나라의 20~30대(중반)에 많이 집중되어 있음을 알 수 있음. - 영국(GB)가 연령대가 조금 높은 편이고, 스페인(ES)가 조금 낮은 편. - 중위값은 모든 나라에서 박스의 아래쪽에 있는 것으로 보아 젊은 연령대가 Airbnb를 더 많이 사용하고 있음을 알 수 있음.

4.6. EDA 6 : 전환율(Conversion rate)

  • Airbnb 이용자의 도착 국가별, 성별의 구매전환율*
  • (* 첫구매 전환율 : 첫 구매자수 / 가입자수 * 100(%))

Report - 국가별, 성별의 구매 전환 수 (Conversion rate by country destination & gender)

task_a <- sqldf("
with base as(
select a.*
from (
    select
        country_destination
        , sum(case when gender = '-unknown-' then 1 
                   when gender = 'OTHER' then 1 else 0 end) as unknown
        , sum(case when gender = 'FEMALE' then 1 else 0 end) as female
        , sum(case when gender = 'MALE' then 1 else 0 end) as male
        , count(*) as total
    from
        train
    group by 1) as a
)

select *
from base

union all

select
    'Total'
    , sum(unknown)
    , sum(female)
    , sum(male)
    , sum(total)
from base

union all

select
    'Avg'
    , avg(unknown)
    , avg(female)
    , avg(male)
    , avg(total)
from base
")

task_b <- sqldf("
with base as(
    select
        a.*
        , male_ratio + female_ratio + unknown_ratio as tot_ratio
    from(
        select
            country_destination
            , round(100. * male / total, 2) as male_ratio
            , round(100. * female / total, 2) as female_ratio
            , round(100. * unknown / total, 2) as unknown_ratio
        from (
            select
                country_destination
                , sum(case when gender = '-unknown-' then 1 
                   when gender = 'OTHER' then 1 else 0 end) as unknown
                , sum(case when gender = 'FEMALE' then 1 else 0 end) as female
                , sum(case when gender = 'MALE' then 1 else 0 end) as male
                , count(*) as total
            from
                train
            group by 1
            )
        ) as a
)
select *
from base

union all

select
    'Total_avg'
    , round(avg(male_ratio), 2)
    , round(avg(female_ratio), 2)
    , round(avg(unknown_ratio), 2)
    , round(avg(tot_ratio), 2)
from base
")

task_c <- sqldf("
select
    a.gender
    , a.country_destination
    , round(100. * freq / freq_total, 2) as prob
from (
    select
        gender
        , country_destination
        , count(*) as freq
    from train
    group by 1, 2

    ) as a, (
    select
        gender
        , count(*) freq_total
    from train
    ) as b

")
ages by country_destination
country_destination unknown female male total
AU 140.000 200.0 181.000 521.00
CA 478.000 442.0 463.000 1383.00
DE 281.000 344.0 396.000 1021.00
ES 701.000 830.0 648.000 2179.00
FR 1678.000 1893.0 1274.000 4845.00
GB 739.000 838.0 646.000 2223.00
IT 1016.000 1042.0 657.000 2715.00
NDF 66394.000 29769.0 25684.000 121847.00
NL 225.000 241.0 263.000 729.00
PT 69.000 75.0 67.000 211.00
US 19668.000 21915.0 18865.000 60448.00
other 3394.000 3041.0 3300.000 9735.00
Total 94783.000 60630.0 52444.000 207857.00
Avg 7898.583 5052.5 4370.333 17321.42
ages by country_destination
country_destination male_ratio female_ratio unknown_ratio tot_ratio
AU 34.74 38.39 26.87 100
CA 33.48 31.96 34.56 100
DE 38.79 33.69 27.52 100
ES 29.74 38.09 32.17 100
FR 26.30 39.07 34.63 100
GB 29.06 37.70 33.24 100
IT 24.20 38.38 37.42 100
NDF 21.08 24.43 54.49 100
NL 36.08 33.06 30.86 100
PT 31.75 35.55 32.70 100
US 31.21 36.25 32.54 100
other 33.90 31.24 34.86 100
Total_avg 30.86 34.82 34.32 100

Visualization
- 국가별, 성별 비중 시각화

ggplot(data = task_b, aes(x = country_destination, y = task_b$male_ratio, fill = country_destination)) +
  geom_bar(stat = 'identity') +
  ggtitle("male ratio") +
  coord_flip()

ggplot(data = task_b, aes(x = country_destination, y = task_b$female_ratio, fill = country_destination)) +
  geom_bar(stat = 'identity') +
  ggtitle("female ratio") +
  coord_flip()

ggplot(data = task_b, aes(x = country_destination, y = task_b$unknown_ratio, fill = country_destination)) +
  geom_bar(stat = 'identity') +
  ggtitle("unknown ratio") +
  coord_flip()

ggplot(data = task_c, aes(x = gender, y = prob, fill = country_destination)) +
  geom_bar(stat = "identity") +
  ggtitle("country destination ratio by gender") +
  coord_flip()

Report
- 국가별, 성별 전환율 차이 비교 리포트 1) 국가별로 어떠한 성별이 얼만큼의 차이로 전환율이 우세한지 확인하는 지표 만들기 2) 성별로 우세한 국가 뽑기

task_d <- sqldf("
select
    country_destination
    , male_ratio
    , female_ratio
    , unknown_ratio
    , tot_ratio
    , case when male_ratio > female_ratio then 'M'
           when female_ratio > male_ratio then 'F' end as male_vs_female
    , abs(male_ratio - female_ratio) as male_female_diff
from 
    task_b
")
ages by country_destination
country_destination male_ratio female_ratio unknown_ratio tot_ratio male_vs_female male_female_diff
AU 34.74 38.39 26.87 100 F 3.65
CA 33.48 31.96 34.56 100 M 1.52
DE 38.79 33.69 27.52 100 M 5.10
ES 29.74 38.09 32.17 100 F 8.35
FR 26.30 39.07 34.63 100 F 12.77
GB 29.06 37.70 33.24 100 F 8.64
IT 24.20 38.38 37.42 100 F 14.18
NDF 21.08 24.43 54.49 100 F 3.35
NL 36.08 33.06 30.86 100 M 3.02
PT 31.75 35.55 32.70 100 F 3.80
US 31.21 36.25 32.54 100 F 5.04
other 33.90 31.24 34.86 100 M 2.66
Total_avg 30.86 34.82 34.32 100 F 3.96
summary(task_d)
##  country_destination   male_ratio     female_ratio   unknown_ratio  
##  Length:13           Min.   :21.08   Min.   :24.43   Min.   :26.87  
##  Class :character    1st Qu.:29.06   1st Qu.:33.06   1st Qu.:32.17  
##  Mode  :character    Median :31.21   Median :35.55   Median :33.24  
##                      Mean   :30.86   Mean   :34.82   Mean   :34.32  
##                      3rd Qu.:33.90   3rd Qu.:38.09   3rd Qu.:34.63  
##                      Max.   :38.79   Max.   :39.07   Max.   :54.49  
##    tot_ratio   male_vs_female     male_female_diff
##  Min.   :100   Length:13          Min.   : 1.520  
##  1st Qu.:100   Class :character   1st Qu.: 3.350  
##  Median :100   Mode  :character   Median : 3.960  
##  Mean   :100                      Mean   : 5.849  
##  3rd Qu.:100                      3rd Qu.: 8.350  
##  Max.   :100                      Max.   :14.180
table(task_d$male_vs_female)
## 
## F M 
## 9 4

Summary
- 평균적으로 여성의 전환율이 남성보다 3.87 %의 차이로 우세하다. - 남성의 전환율이 우세한 국가는 4개국으로 [‘CA’ ‘DE’ ‘NL’ ‘other’] 이 있다. - 여성의 전환율이 우세한 국가는 8개국으로 [‘AU’ ‘ES’ ‘FR’ ‘GB’ ‘IT’ ‘PT’ ‘US’ ‘total_avg_gender’] 이 있다.

4.7. EDA 7 : 가입 후 예약까지 걸리는 시간

  • 첫 예약이 발생하기까지 걸리는 시간
  • 회원 가입 시점을 기준으로 첫 예약이 이루어지기까지의 시간을 다음과 같은 구간으로 나누어서 유저 수 구하기(1일 미만 / 1~3일 / 3~5일 / 5~7일 / 7일 이상)

Outliers treatment

train$date_diff <- train$date_first_booking - train$timestamp_first_active
train$date_diff <- as.numeric(train$date_diff)

train %>%
  select(timestamp_first_active, date_first_booking, date_diff) %>%
  head()
##   timestamp_first_active date_first_booking date_diff
## 1             2009-03-19               <NA>        NA
## 2             2009-05-23               <NA>        NA
## 3             2009-06-09         2010-08-02       419
## 4             2009-10-31         2012-09-08      1043
## 5             2009-12-08         2010-02-18        72
## 6             2010-01-01         2010-01-02         1
summary(train$date_diff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.00    3.00   44.89   30.00 1369.00  121847
hist(train$date_diff)

b_diff <- boxplot(train$date_diff)

# 이상치로 분류된 값을 보면 365일 이상의 관측치의 수는 평균적으로 1~2개 이므로 그 이상은 제거(365일의 관측치는 50개)
#a <- train[train$date_diff > b_diff$stats[5],]
#str(a)
#table(a$date_diff)

under_diff_365 <- train %>%
  filter(date_diff < 365)

hist(under_diff_365$date_diff)

task_e <- sqldf("
select
    a.dd as periods
    , count(*) freq
from (
    select
        date_diff
        , case when date_diff < 1 then 0
               when date_diff >= 1 and date_diff < 2 then '1 days'
               when date_diff >= 2 and date_diff < 3 then '2 days'
               when date_diff >= 3 and date_diff < 4 then '3 days'
               when date_diff >= 4 and date_diff < 5 then '4 days'
               when date_diff >= 5 and date_diff < 6 then '5 days'
               when date_diff >= 6 and date_diff < 7 then '6 days'
               when date_diff >= 7 and date_diff < 8 then '7 days'
               when date_diff >= 8 and date_diff <= 14  then '2 weeks'
               when date_diff >= 15 and date_diff <= 21 then '3 weeks'
               when date_diff >= 22 and date_diff <= 28 then '4 weeks'
               when date_diff >= 31 and date_diff <= 60 then '2 months'
               when date_diff >= 61 and date_diff <= 90 then '3 months'
               when date_diff >= 91 and date_diff <= 120 then '4 months'
               when date_diff >= 121 and date_diff <= 150 then '5 months'
               when date_diff >= 151 and date_diff <= 180 then '6 months'
               when date_diff >= 181 and date_diff <= 210 then '7 months'
               when date_diff >= 211 and date_diff <= 240 then '8 months'
               when date_diff >= 241 and date_diff <= 270 then '9 months'
               when date_diff >= 271 and date_diff <= 300 then '10 months'
               when date_diff >= 301 and date_diff <= 330 then '11 months'
               when date_diff >= 331 and date_diff <= 365 then '12 months'
               when date_diff is NULL then 'NA'
               else 'others' end as dd
    from
        train

    ) as a
group by 1
order by freq desc

")

Report

task_ee <- sqldf("select *
from task_e
where periods not in ('NA')")


task_eee <- sqldf("
select
    a.periods
    , round(100. * p_tot / b.tot, 2) as prob
from(
    select
        periods
        , freq as p_tot
    from task_ee
    
    ) as a, (
    
    select
        sum(freq) as tot
    from task_ee
    ) as b
")
How long it will take for the first booking to occur
periods freq
0 20710
1 days 14248
2 days 6271
2 weeks 5823
2 months 4248
3 days 3863
3 weeks 2974
4 days 2836
3 months 2615
5 days 2174
4 months 2002
4 weeks 1917
12 months 1794
5 months 1769
6 days 1726
7 days 1616
6 months 1576
7 months 1514
11 months 1504
8 months 1480
9 months 1418
10 months 1389
others 543
How long it will take for the first booking to occur
periods prob
0 24.08
1 days 16.57
2 days 7.29
2 weeks 6.77
2 months 4.94
3 days 4.49
3 weeks 3.46
4 days 3.30
3 months 3.04
5 days 2.53
4 months 2.33
4 weeks 2.23
12 months 2.09
5 months 2.06
6 days 2.01
7 days 1.88
6 months 1.83
7 months 1.76
11 months 1.75
8 months 1.72
9 months 1.65
10 months 1.61
others 0.63

Visualization

ggplot(data = task_ee, aes(x = reorder(periods, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "#FF6666") +
  ggtitle("") +
  coord_flip()

Summary
- 23개의 계급값으로 구분(당일~others) - 총 예약 건수는 88,908건 - 대부분의 예약은 2주 내에 이루어졌음 (54.81%)

5. Time Series Analysis

5.1. Visuallization 1

5.2. Visualization 2

weekly_created <- group_by(training, date = week_created) %>%
    summarise(created_count = n())

weekly_stats <- merge(weekly_booked, weekly_created, by="date")

weekly_stats <- melt(weekly_stats, id.vars = "date")

weekly_plot <- ggplot(weekly_stats, aes(date, value, col = variable)) +
    geom_line()

weekly_plot

5.3. Summary

사람들은 크리스마스(연휴)가 있는 주에 크리스마스 휴가를 예약하지 않는다. 보통 사람들은 몇 달 혹은 몇 주전에 예약을 할 것이다.
그리고 사람들은 주말보다 평일(수요일)에 가장 이용을 많이 한다. 아마 사무실에서 업무에 치이면서 무료하고 반복된 패턴을 벗어나고 싶은 욕구가 클 때 여행을 이용하는 것 같다. 상대적으로 여유로운 주말이나 연휴에는 이미 휴식을 취하거나 여가를 보내고 있기 때문에 이용을 덜 하는 것 같다.    

6. Summary

6.1. Age

Summary
1) 서비스의 주요 고객의 연령대는 전체 고객의 71.2%를 차지하는 20(32.3%)~30(38.8%)이다. 2) 대부분의 나라가 전체 인구 분포와 유사하다. - 중위 값은 모든 나라에서 박스의 아래쪽에 있는 것으로 보아 젊은 연령대가 airbnb를 더 많이 이용하고 있음을 알 수 있음 - 영국(GB)의 경우 연령대가 조금 높은 편이고, 스페인(ES)가 조금 낮은 편

Report

Age vs. Country destination
ages AU CA DE ES FR GB IT NDF NL PT US other total
10 7 32 18 30 74 31 41 924 11 4 584 116 1872
20 106 245 243 569 940 460 566 16347 191 45 14262 2075 36049
30 181 440 310 669 1425 643 763 20059 225 65 19669 3121 47570
40 70 182 138 209 606 274 291 9894 83 18 7313 1132 20210
50 44 101 70 114 373 189 190 5253 39 16 3561 571 10521
60 10 32 32 45 117 76 65 1756 20 4 1286 202 3645
0 43 115 85 181 403 185 226 NA 60 17 5037 811 NA
0 521 1383 1021 2179 4845 2223 2715 NA 729 211 60448 9735 NA
Age vs. Country destination
ages AU CA DE ES FR GB IT NDF NL PT US other total
10 0.37 1.71 0.96 1.60 3.95 1.66 2.19 49.36 0.59 0.21 31.20 6.20 100.00
20 0.29 0.68 0.67 1.58 2.61 1.28 1.57 45.35 0.53 0.12 39.56 5.76 100.00
30 0.38 0.92 0.65 1.41 3.00 1.35 1.60 42.17 0.47 0.14 41.35 6.56 100.00
40 0.35 0.90 0.68 1.03 3.00 1.36 1.44 48.96 0.41 0.09 36.19 5.60 100.01
50 0.42 0.96 0.67 1.08 3.55 1.80 1.81 49.93 0.37 0.15 33.85 5.43 100.02
60 0.27 0.88 0.88 1.23 3.21 2.09 1.78 48.18 0.55 0.11 35.28 5.54 100.00
0 0.35 1.01 0.75 1.32 3.22 1.59 1.73 NA 0.49 0.14 36.24 5.85 NA

Visualization

# View age chart
ggplot(users_r, aes(x=age)) +
  geom_histogram(aes(y=..density..),      
  binwidth=.5,
  colour="black", fill="white") +
  geom_density(alpha=.2, fill="pink") + # Overlay with transparent density plot
  ggtitle("Users age distribution")

# View ages chart
ggplot(data = train, aes(x = ages)) +
  xlim(10, 100) +
  geom_bar(fill = "#FF6666") +
  ggtitle("Users age distribution")

# View count of age
not.na <- !is.na(train$age)
fig.age <- ggplot(train[not.na,], aes(age, fill=country_destination))
fig.age <- fig.age + geom_bar(position="identity", alpha=0.5)
fig.age <- fig.age + xlim(14, 70)
fig.age <- fig.age + scale_y_log10()
fig.age + facet_wrap(~country_destination)

# View count of ages
not.na <- !is.na(train$ages)
fig.ages <- ggplot(train[not.na,], aes(ages, fill=country_destination))
fig.ages <- fig.ages + geom_bar(position="identity", alpha=0.5)
fig.ages <- fig.ages + xlim(14, 70)
fig.ages <- fig.ages + scale_y_log10()
fig.ages + facet_wrap(~country_destination)

# Boxplot for age (between 10 and 70)
fig.age.bx <- ggplot(train, aes(x=country_destination, y=age))
fig.age.bx <- fig.age.bx + geom_boxplot(aes(colour=country_destination))
fig.age.bx <- fig.age.bx + ylim(10, 70)

fig.age.bx

6.2. Gender

Summary
1) 여성 이용자(27.69%) 남성 이용자(24.43%) 보다 조금 더 많은데 실제 매출에 기여하는 성별은 누구인지는 파악되지 않으므로 전환율(Conversion Rate) 분석을 진행하였음 2) 평균적으로 여성의 전환율이 남성보다 3.87 %의 차이로 우세함을 알 수 있었다. 3) 남성의 전환율이 우세한 국가는 4개국으로 [‘CA’ ‘DE’ ‘NL’ ‘other’]가 있다. 4) 여성의 전환율이 우세한 국가는 8개국으로 [‘AU’ ‘ES’ ‘FR’ ‘GB’ ‘IT’ ‘PT’ ‘US’]가 있다.

Report

Gender ratio by country_destination
gender prob
-unknown- 47.75
FEMALE 27.69
MALE 24.43
OTHER 0.12
ages by country_destination
country_destination unknown female male total
AU 140.000 200.0 181.000 521.00
CA 478.000 442.0 463.000 1383.00
DE 281.000 344.0 396.000 1021.00
ES 701.000 830.0 648.000 2179.00
FR 1678.000 1893.0 1274.000 4845.00
GB 739.000 838.0 646.000 2223.00
IT 1016.000 1042.0 657.000 2715.00
NDF 66394.000 29769.0 25684.000 121847.00
NL 225.000 241.0 263.000 729.00
PT 69.000 75.0 67.000 211.00
US 19668.000 21915.0 18865.000 60448.00
other 3394.000 3041.0 3300.000 9735.00
Total 94783.000 60630.0 52444.000 207857.00
Avg 7898.583 5052.5 4370.333 17321.42
ages by country_destination
country_destination male_ratio female_ratio unknown_ratio tot_ratio
AU 34.74 38.39 26.87 100
CA 33.48 31.96 34.56 100
DE 38.79 33.69 27.52 100
ES 29.74 38.09 32.17 100
FR 26.30 39.07 34.63 100
GB 29.06 37.70 33.24 100
IT 24.20 38.38 37.42 100
NDF 21.08 24.43 54.49 100
NL 36.08 33.06 30.86 100
PT 31.75 35.55 32.70 100
US 31.21 36.25 32.54 100
other 33.90 31.24 34.86 100
Total_avg 30.86 34.82 34.32 100

Visualization

Desc(train$gender)
## ------------------------------------------------------------------------- 
## train$gender (character)
## 
##    length       n     NAs  unique  levels   dupes
##   207'857 207'857       0       4       4       y
##            100.0%    0.0%                        
## 
##        level    freq   perc  cumfreq  cumperc
## 1  -unknown-  94'507  45.5%   94'507    45.5%
## 2     FEMALE  60'630  29.2%  155'137    74.6%
## 3       MALE  52'444  25.2%  207'581    99.9%
## 4      OTHER     276   0.1%  207'857   100.0%

ggplot(data = task_gender_ratio, aes(x = reorder(prob, gender), y = prob, fill = gender)) +
  geom_bar(stat = "identity") +
  coord_flip()

ggplot(data = task_b, aes(x = reorder(country_destination, male_ratio), y = task_b$male_ratio, fill = country_destination)) +
  geom_bar(stat = 'identity') +
  ggtitle("male ratio") +
  coord_flip()

ggplot(data = task_b, aes(x = reorder(country_destination, female_ratio), y = task_b$female_ratio, fill = country_destination)) +
  geom_bar(stat = 'identity') +
  ggtitle("female ratio") +
  coord_flip()

ggplot(data = task_b, aes(x = reorder(country_destination, unknown_ratio), y = task_b$unknown_ratio, fill = country_destination)) +
  geom_bar(stat = 'identity') +
  ggtitle("unknown ratio") +
  coord_flip()

ggplot(data = task_c, aes(x = reorder(gender, prob), y = prob, fill = country_destination)) +
  geom_bar(stat = "identity") +
  ggtitle("country destination ratio by gender") +
  coord_flip()

ages by country_destination
country_destination male_ratio female_ratio unknown_ratio tot_ratio male_vs_female male_female_diff
AU 34.74 38.39 26.87 100 F 3.65
CA 33.48 31.96 34.56 100 M 1.52
DE 38.79 33.69 27.52 100 M 5.10
ES 29.74 38.09 32.17 100 F 8.35
FR 26.30 39.07 34.63 100 F 12.77
GB 29.06 37.70 33.24 100 F 8.64
IT 24.20 38.38 37.42 100 F 14.18
NDF 21.08 24.43 54.49 100 F 3.35
NL 36.08 33.06 30.86 100 M 3.02
PT 31.75 35.55 32.70 100 F 3.80
US 31.21 36.25 32.54 100 F 5.04
other 33.90 31.24 34.86 100 M 2.66
Total_avg 30.86 34.82 34.32 100 F 3.96

6.3. Signup_app & First device type

Summary
1) Web의 비율이 79.5%로 압도적으로 많고 그 다음은 iOS 사용자(12.8%)가 많다. “Web과 iOS 사용자의 구매 전환율이 높으므로 추후 마케팅할 때 그들에게 집중하는 것이 좋다.”

2) Web의 비율이 높지만, Apple 제품을 쓰는 사용자의 비율이 59.8%로 압도적으로 높다.

3) Web 전환률은 평균 80~90%로 App을 모두 합친 것보다 훨씬 높다.
  (미국의 경우 Wep이 89.2%이고, App이 10.7%인데, 다른 나라도 상황이 비슷하다)

4) Web 가입자의 전환률이 그들을 대상으로 대상으로 더 높으므로 마케팅을 진행하는 게 좋다.

5) Device type 분석을 진행한 결과, Apple 유저의 경우 61.61 %의 전환율(Mac desktop 48.50 % + iPad 6.81 % + iPhone 6.30 %)을 보여주고 있다. 이는 Web으로 들어갈 수 있는 모든 Desktop의 34.74% 전환율(Windows desktop 33.56 % + Desktop other 0.62 % + Android tablet 0.56 %) 보다 압도적으로 높다.

Report

z <- sqldf("
with base as(
select
    b.*
    , AU + CA + DE + ES + FR + GB + IT + NDF + NL + PT + US + other as total
from (

    select
        first_device_type
        , round( 100. * AU / total, 2) as AU
        , round( 100. * CA / total, 2) as CA
        , round( 100. * DE / total, 2) as DE
        , round( 100. * ES / total, 2) as ES
        , round( 100. * FR / total, 2) as FR
        , round( 100. * GB / total, 2) as GB
        , round( 100. * IT / total, 2) as IT
        , round( 100. * NDF / total, 2) as NDF
        , round( 100. * NL / total, 2) as NL
        , round( 100. * PT / total, 2) as PT
        , round( 100. * US / total, 2) as US
        , round( 100. * other / total, 2) as other
        
    from (
        select 
            first_device_type
            , sum(case when country_destination = 'AU' then 1 else 0 end) as AU
            , sum(case when country_destination = 'CA' then 1 else 0 end) as CA
            , sum(case when country_destination = 'DE' then 1 else 0 end) as DE
            , sum(case when country_destination = 'ES' then 1 else 0 end) as ES
            , sum(case when country_destination = 'FR' then 1 else 0 end) as FR
            , sum(case when country_destination = 'GB' then 1 else 0 end) as GB
            , sum(case when country_destination = 'IT' then 1 else 0 end) as IT
            , sum(case when country_destination = 'NDF' then 1 else 0 end) as NDF
            , sum(case when country_destination = 'NL' then 1 else 0 end) as NL
            , sum(case when country_destination = 'PT' then 1 else 0 end) as PT
            , sum(case when country_destination = 'US' then 1 else 0 end) as US
            , sum(case when country_destination = 'other' then 1 else 0 end) as other
            , count(*) as total
        from train
        group by 1
        )
    ) as b
)

select
    *
from
    base

union all

select
    'Total'
    , round(avg(AU), 2)
    , round(avg(CA), 2)
    , round(avg(DE), 2)
    , round(avg(ES), 2)
    , round(avg(FR), 2)
    , round(avg(GB), 2)
    , round(avg(IT), 2)
    , round(avg(NDF), 2)
    , round(avg(NL), 2)
    , round(avg(PT), 2)
    , round(avg(US), 2)
    , round(avg(other), 2)
    , null
from
    base
;")

zz <- sqldf("
select
    a.first_device_type
    , country_destination
    , round(100. * total / sa_total, 2) as prob

from (
    select
        first_device_type
        , country_destination
        , count(*) as total
    from
        train
    group by 1, 2

    ) as a, (

    select
        first_device_type
        , count(*) as sa_total
        
    from
        train
    group by 1
    ) b

where a.first_device_type = b.first_device_type;
")

zzz <- sqldf("
with base as(

select
    b.*
    , Android_Phone + Android_Tablet + Desktop_Other + Mac_Desktop + Other_Unknown + SmartPhone_Other + Windows_Desktop + iPad + iPhone as total
from (
    select
            country_destination
            , round( 100. * ap / total, 2) as 'Android_Phone'
            , round( 100. * at / total, 2) as 'Android_Tablet'
            , round( 100. * do / total, 2) as 'Desktop_Other'
            , round( 100. * md / total, 2) as 'Mac_Desktop'
            , round( 100. * ou / total, 2) as 'Other_Unknown'
            , round( 100. * so / total, 2) as 'SmartPhone_Other'
            , round( 100. * wd / total, 2) as 'Windows_Desktop'
            , round( 100. * ipa / total, 2) as 'iPad'
            , round( 100. * iph / total, 2) as 'iPhone'
        from (
            select 
                country_destination
                , sum(case when first_device_type = 'Android Phone' then 1 else 0 end) as ap
                , sum(case when first_device_type = 'Android Tablet' then 1 else 0 end) as at
                , sum(case when first_device_type = 'Desktop (Other)' then 1 else 0 end) as do
                , sum(case when first_device_type = 'Mac Desktop' then 1 else 0 end) as md
                , sum(case when first_device_type = 'Other/Unknown' then 1 else 0 end) as ou
                , sum(case when first_device_type = 'SmartPhone (Other)' then 1 else 0 end) as so
                , sum(case when first_device_type = 'Windows Desktop' then 1 else 0 end) as wd
                , sum(case when first_device_type = 'iPad' then 1 else 0 end) as ipa
                , sum(case when first_device_type = 'iPhone' then 1 else 0 end) as iph
                , count(*) as total
            from train
            group by 1
            )
        ) as b
)
select
    *
from 
    base

union all

select
    'AVG'
    , round(avg(Android_Phone), 2)
    , round(avg(Android_Tablet), 2)
    , round(avg(Desktop_Other), 2)
    , round(avg(Mac_Desktop), 2)
    , round(avg(Other_Unknown), 2)
    , round(avg(SmartPhone_Other), 2)
    , round(avg(Windows_Desktop), 2)
    , round(avg(iPad), 2)
    , round(avg(iPhone), 2)
    , null
from base
;
")
Desc(users_r$signup_app)
## ------------------------------------------------------------------------- 
## users_r$signup_app (character)
## 
##    length       n     NAs  unique  levels   dupes
##   267'716 267'716       0       4       4       y
##            100.0%    0.0%                        
## 
##      level     freq   perc  cumfreq  cumperc
## 1      Web  212'705  79.5%  212'705    79.5%
## 2      iOS   34'258  12.8%  246'963    92.2%
## 3  Android   10'384   3.9%  257'347    96.1%
## 4    Moweb   10'369   3.9%  267'716   100.0%

Desc(users_r$first_device_type)
## ------------------------------------------------------------------------- 
## users_r$first_device_type (character)
## 
##    length       n     NAs  unique  levels   dupes
##   267'716 267'716       0       9       9       y
##            100.0%    0.0%                        
## 
##                 level     freq   perc  cumfreq  cumperc
## 1         Mac Desktop  103'224  38.6%  103'224    38.6%
## 2     Windows Desktop   83'782  31.3%  187'006    69.9%
## 3              iPhone   39'432  14.7%  226'438    84.6%
## 4                iPad   17'354   6.5%  243'792    91.1%
## 5       Other/Unknown   10'933   4.1%  254'725    95.1%
## 6       Android Phone    9'313   3.5%  264'038    98.6%
## 7      Android Tablet    2'015   0.8%  266'053    99.4%
## 8     Desktop (Other)    1'475   0.6%  267'528    99.9%
## 9  SmartPhone (Other)      188   0.1%  267'716   100.0%

signup_app by country_destination
signup_app AU CA DE ES FR GB IT NDF NL PT US other total
Android 3 15 9 25 24 16 21 4197 7 2 921 166 5406
Moweb 14 28 15 34 55 33 30 4078 14 3 1673 190 6167
Web 473 1282 957 2016 4548 2059 2524 100172 670 194 53831 8726 177452
iOS 31 58 40 104 218 115 140 13400 38 12 4023 653 18832
Total 521 1383 1021 2179 4845 2223 2715 121847 729 211 60448 9735 207857
signup_app by country_destination
signup_app AU CA DE ES FR GB IT NDF NL PT US other total
Android 0.06 0.28 0.17 0.46 0.44 0.30 0.39 77.64 0.13 0.04 17.04 3.07 100.02
Moweb 0.23 0.45 0.24 0.55 0.89 0.54 0.49 66.13 0.23 0.05 27.13 3.08 100.01
Web 0.27 0.72 0.54 1.14 2.56 1.16 1.42 56.45 0.38 0.11 30.34 4.92 100.01
iOS 0.16 0.31 0.21 0.55 1.16 0.61 0.74 71.16 0.20 0.06 21.36 3.47 99.99
Total 0.18 0.44 0.29 0.68 1.26 0.65 0.76 67.84 0.23 0.07 23.97 3.64 NA
signup_app by country_destination
country_destination Android Moweb Web iOS total
AU 0.58 2.69 90.79 5.95 100.01
CA 1.08 2.02 92.70 4.19 99.99
DE 0.88 1.47 93.73 3.92 100.00
ES 1.15 1.56 92.52 4.77 100.00
FR 0.50 1.14 93.87 4.50 100.01
GB 0.72 1.48 92.62 5.17 99.99
IT 0.77 1.10 92.97 5.16 100.00
NDF 3.44 3.35 82.21 11.00 100.00
NL 0.96 1.92 91.91 5.21 100.00
PT 0.95 1.42 91.94 5.69 100.00
US 1.52 2.77 89.05 6.66 100.00
other 1.71 1.95 89.64 6.71 100.01
Total 14.26 22.87 1093.95 68.93 NA
signup_app by country_destination
first_device_type AU CA DE ES FR GB IT NDF NL PT US other total
Android Phone 0.00 0.47 0.14 0.47 0.62 0.11 0.43 74.59 0.14 0.00 20.09 2.93 99.99
Android Tablet 0.16 0.88 0.40 0.88 1.76 0.40 1.36 64.82 0.40 0.16 23.80 4.97 99.99
Desktop (Other) 0.26 1.54 1.11 0.94 1.89 0.69 1.11 54.50 0.43 0.00 32.90 4.63 100.00
Mac Desktop 0.29 0.75 0.61 1.22 2.88 1.32 1.53 53.07 0.44 0.12 32.89 4.87 99.99
Other/Unknown 0.17 0.45 0.24 0.58 1.53 0.48 0.54 73.02 0.17 0.05 20.11 2.65 99.99
SmartPhone (Other) 0.00 0.00 0.00 0.00 2.70 0.00 1.35 66.22 0.00 0.00 27.03 2.70 100.00
Windows Desktop 0.23 0.71 0.46 1.06 2.14 1.03 1.30 58.88 0.30 0.11 28.57 5.22 100.01
iPad 0.30 0.55 0.42 1.05 2.43 1.09 1.53 62.21 0.47 0.09 25.26 4.59 99.99
iPhone 0.17 0.29 0.29 0.61 1.28 0.66 0.76 69.22 0.19 0.05 23.10 3.37 99.99
Total 0.18 0.63 0.41 0.76 1.91 0.64 1.10 64.06 0.28 0.06 25.97 3.99 NA
signup_app by country_destination
country_destination Android_Phone Android_Tablet Desktop_Other Mac_Desktop Other_Unknown SmartPhone_Other Windows_Desktop iPad iPhone total
AU 0.00 0.38 0.58 49.33 3.45 0.00 31.48 8.06 6.72 100.00
CA 0.94 0.80 1.30 47.36 3.40 0.00 36.37 5.50 4.34 100.01
DE 0.39 0.49 1.27 52.20 2.45 0.00 31.73 5.68 5.78 99.99
ES 0.60 0.50 0.50 48.97 2.80 0.00 34.14 6.70 5.78 99.99
FR 0.35 0.45 0.45 51.95 3.30 0.04 31.04 6.98 5.43 99.99
GB 0.13 0.22 0.36 51.69 2.25 0.00 32.48 6.79 6.07 99.99
IT 0.44 0.63 0.48 49.06 2.10 0.04 33.63 7.85 5.78 100.01
NDF 1.69 0.66 0.52 38.01 6.28 0.04 34.03 7.10 11.67 100.00
NL 0.55 0.69 0.69 52.40 2.47 0.00 28.81 9.05 5.35 100.01
PT 0.00 0.95 0.00 50.71 2.37 0.00 35.07 5.69 5.21 100.00
US 0.92 0.49 0.64 47.49 3.49 0.03 33.29 5.81 7.85 100.01
other 0.83 0.64 0.55 43.68 2.86 0.02 37.76 6.55 7.11 100.00
AVG 0.57 0.58 0.61 48.57 3.10 0.01 33.32 6.81 6.42 NA

Visualization

ggplot(data = task3, aes(x = signup_app, y = prob, fill = country_destination)) +
  geom_bar(stat = "identity")

ggplot(data = task6, aes(x = country_destination, y = prob, fill = signup_app)) +
  geom_bar(stat = "identity")

ggplot(data = zz, aes(x = first_device_type, y = prob, fill = country_destination)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

6.4. Season

Summary
나라마다 조금의 차이가 있지만 대체적으로 5~6월에 예약을 많이 한다. 그 이후로는 예약 수가 줄어드는 경향을 보인다. 다른 나라들은 거의 유사하지만 호주(AU)의 경우 조금 다른 양상을 보인다. 추측 : 다른 나라들과 계절이 반대이기 때문에 반대의 양상이 보이지 않을까 추측해볼 수 있음.

Report

date_first_booking_quarter
date_first_booking_quarter freq
2 28885
1 20875
3 20209
4 16041
date_first_booking_month
date_first_booking_month freq
6 10217
5 10153
4 8515
3 8087
7 7027
8 6816
2 6539
9 6366
1 6249
10 5987
11 5107
12 4947
date_first_booking_weekdays
date_first_booking_weekdays freq
Wednesday 13951
Tuesday 13832
Thursday 13491
Friday 12840
Monday 12341
Saturday 10043
Sunday 9512
date_first_booking_month
date_first_booking_month AU CA DE ES FR GB IT NDF NL PT US other total
NA 0 0 0 0 0 0 0 121847 0 0 0 0 121847
1 57 53 57 112 320 140 157 0 29 9 4645 670 6249
2 46 72 66 163 376 168 194 0 49 18 4604 783 6539
3 47 102 79 206 488 217 280 0 64 22 5622 960 8087
4 27 156 99 256 544 229 341 0 67 23 5822 951 8515
5 53 208 144 326 705 312 418 0 99 27 6715 1146 10153
6 36 214 130 332 595 295 351 0 116 40 6859 1249 10217
7 31 136 94 229 348 182 247 0 68 14 4913 765 7027
8 30 139 102 181 399 151 210 0 81 15 4806 702 6816
9 41 101 110 144 332 169 221 0 49 12 4545 642 6366
10 45 67 54 114 291 141 126 0 50 13 4423 663 5987
11 60 73 45 60 238 128 103 0 29 9 3754 608 5107
12 48 62 41 56 209 91 67 0 28 9 3740 596 4947
0 43 115 85 181 403 185 226 NA 60 17 5037 811 NA
date_first_booking_month_ratio
date_first_booking_month AU CA DE ES FR GB IT NDF NL PT US other total
NA 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100 0.00 0.00 0.00 0.00 100.00
1 0.91 0.85 0.91 1.79 5.12 2.24 2.51 0 0.46 0.14 74.33 10.72 99.98
2 0.70 1.10 1.01 2.49 5.75 2.57 2.97 0 0.75 0.28 70.41 11.97 100.00
3 0.58 1.26 0.98 2.55 6.03 2.68 3.46 0 0.79 0.27 69.52 11.87 99.99
4 0.32 1.83 1.16 3.01 6.39 2.69 4.00 0 0.79 0.27 68.37 11.17 100.00
5 0.52 2.05 1.42 3.21 6.94 3.07 4.12 0 0.98 0.27 66.14 11.29 100.01
6 0.35 2.09 1.27 3.25 5.82 2.89 3.44 0 1.14 0.39 67.13 12.22 99.99
7 0.44 1.94 1.34 3.26 4.95 2.59 3.52 0 0.97 0.20 69.92 10.89 100.02
8 0.44 2.04 1.50 2.66 5.85 2.22 3.08 0 1.19 0.22 70.51 10.30 100.01
9 0.64 1.59 1.73 2.26 5.22 2.65 3.47 0 0.77 0.19 71.39 10.08 99.99
10 0.75 1.12 0.90 1.90 4.86 2.36 2.10 0 0.84 0.22 73.88 11.07 100.00
11 1.17 1.43 0.88 1.17 4.66 2.51 2.02 0 0.57 0.18 73.51 11.91 100.01
12 0.97 1.25 0.83 1.13 4.22 1.84 1.35 0 0.57 0.18 75.60 12.05 99.99
0 0.65 1.55 1.16 2.39 5.48 2.53 3.00 NA 0.82 0.23 70.89 11.30 NA

Visualization

#View count of each country_destination by data_first_booking_month
fig.dac_mon <- ggplot(train, aes(date_first_booking_month, fill=country_destination))
fig.dac_mon <- fig.dac_mon + geom_bar(position="identity", alpha=1.0)
fig.dac_mon <- fig.dac_mon + scale_y_log10()
fig.dac_mon + facet_wrap(~country_destination)

ggplot(data = task9, aes(x = date_first_booking_month, y = prob, fill = country_destination)) +
  geom_bar(stat = "identity")

6.5. Time cost

Summary
23개의 계급값으로 구분(당일~others) 총 예약 건수는 88,908건 예약의 상당수는 2주 내에 이루어졌다는 것을 알 수 있다.(54.81%)

Report

How long it will take for the first booking to occur
periods freq
0 20710
1 days 14248
2 days 6271
2 weeks 5823
2 months 4248
3 days 3863
3 weeks 2974
4 days 2836
3 months 2615
5 days 2174
4 months 2002
4 weeks 1917
12 months 1794
5 months 1769
6 days 1726
7 days 1616
6 months 1576
7 months 1514
11 months 1504
8 months 1480
9 months 1418
10 months 1389
others 543
How long it will take for the first booking to occur
periods prob
0 24.08
1 days 16.57
2 days 7.29
2 weeks 6.77
2 months 4.94
3 days 4.49
3 weeks 3.46
4 days 3.30
3 months 3.04
5 days 2.53
4 months 2.33
4 weeks 2.23
12 months 2.09
5 months 2.06
6 days 2.01
7 days 1.88
6 months 1.83
7 months 1.76
11 months 1.75
8 months 1.72
9 months 1.65
10 months 1.61
others 0.63

Visualization

6.6. Conclusion

1) Apple을 사용하는 2030의 젊은 층을 타겟으로 마케팅을 진행
2) 나라에 따라 다르지만 5~6월 성수기 외에 프로모션을 진행해서 전환율은 높이는 방법도 고려해볼 수 있음
3) 가입한 지 2주 내에 첫 예약이 이뤄지는 경우가 많기 때문에 처음 가입하고 난 다음의 UX를 잘 설계함으로써 지속적으로 이용할 수 있도록 관계를 맺는 것이 중요함
4) 사람들은 크리스마스(연휴)가 있는 주에 크리스마스 휴가를 예약하지 않는다. 보통 사람들은 몇 달 혹은 몇 주전에 예약을 할 것이다.
5) 주말보다 평일(수요일)에 가장 이용을 많이 한다. 아마 사무실에서 업무에 치이면서 무료하고 반복된 패턴을 벗어나고 싶은 욕구가 클 때 여행을 이용하는 것 같다. 상대적으로 여유로운 주말이나 연휴에는 이미 휴식을 취하거나 여가를 보내고 있기 때문에 이용을 덜 하는 것 같다.

- 이와 같이 여러 변수들간의 관계를 시각화 하면서 어떤 관계가 있는지 유추해 볼 수 있다.
- 이것으로 EDA를 마치고 다음 장부터는 Machine Learning Algorithm을 통해 위 데이터를 학습하고 예측하는 과정을 진행할 예정이다.

7. Modeling

7.1. Setting

- Split the data

# This R script is based on Sandro's python script, which produces a LB score of 0.8655
# This script should produce a LB score of 0.86547

set.seed(1)
# load data
df_train = read_csv("./input/train_users_2.csv")
df_test = read_csv("./input/test_users.csv")
labels = df_train['country_destination']
df_train = df_train[-grep('country_destination', colnames(df_train))]

7.2. Feature Engineering

# combine train and test data
df_all = rbind(df_train,df_test)
# remove date_first_booking
df_all = df_all[-c(which(colnames(df_all) %in% c('date_first_booking')))]
# replace missing values
df_all[is.na(df_all)] <- -1

# split date_account_created in year, month and day
dac = as.data.frame(str_split_fixed(df_all$date_account_created, '-', 3))
df_all['dac_year'] = dac[,1]
df_all['dac_month'] = dac[,2]
df_all['dac_day'] = dac[,3]
df_all = df_all[,-c(which(colnames(df_all) %in% c('date_account_created')))]

# split timestamp_first_active in year, month and day
df_all[,'tfa_year'] = substring(as.character(df_all[,'timestamp_first_active']), 1, 4)
df_all['tfa_month'] = substring(as.character(df_all['timestamp_first_active']), 5, 6)
df_all['tfa_day'] = substring(as.character(df_all['timestamp_first_active']), 7, 8)
df_all = df_all[,-c(which(colnames(df_all) %in% c('timestamp_first_active')))]

# clean Age by removing values
df_all[df_all$age < 14 | df_all$age > 100,'age'] <- -1

# one-hot-encoding features
ohe_feats = c('gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser')
dummies <- dummyVars(~ gender + signup_method + signup_flow + language + affiliate_channel + affiliate_provider + first_affiliate_tracked + signup_app + first_device_type + first_browser, data = df_all)
df_all_ohe <- as.data.frame(predict(dummies, newdata = df_all))
df_all_combined <- cbind(df_all[,-c(which(colnames(df_all) %in% ohe_feats))],df_all_ohe)

# split train and test
X = df_all_combined[df_all_combined$id %in% df_train$id,]
y <- recode(labels$country_destination,"'NDF'=0; 'US'=1; 'other'=2; 'FR'=3; 'CA'=4; 'GB'=5; 'ES'=6; 'IT'=7; 'PT'=8; 'NL'=9; 'DE'=10; 'AU'=11")
X_test = df_all_combined[df_all_combined$id %in% df_test$id,]

7.3. Modeling : XGBoost

# train xgboost
xgb <- xgboost(data = data.matrix(X[,-1]), 
               label = y, 
               eta = 0.1,
               max_depth = 9, 
               nround=25, 
               subsample = 0.5,
               colsample_bytree = 0.5,
               seed = 1,
               eval_metric = "merror",
               objective = "multi:softprob",
               num_class = 12,
               nthread = 3
)
## [1]  train-merror:0.364903 
## [2]  train-merror:0.364337 
## [3]  train-merror:0.363540 
## [4]  train-merror:0.363189 
## [5]  train-merror:0.362781 
## [6]  train-merror:0.362449 
## [7]  train-merror:0.362341 
## [8]  train-merror:0.362247 
## [9]  train-merror:0.362027 
## [10] train-merror:0.361741 
## [11] train-merror:0.361488 
## [12] train-merror:0.361273 
## [13] train-merror:0.361198 
## [14] train-merror:0.361020 
## [15] train-merror:0.360818 
## [16] train-merror:0.360757 
## [17] train-merror:0.360687 
## [18] train-merror:0.360486 
## [19] train-merror:0.360350 
## [20] train-merror:0.360050 
## [21] train-merror:0.359628 
## [22] train-merror:0.359474 
## [23] train-merror:0.359221 
## [24] train-merror:0.358963 
## [25] train-merror:0.358902

7.4. Predict

# predict values in test set
y_pred <- predict(xgb, data.matrix(X_test[,-1]))

# extract the 5 classes with highest probabilities
predictions <- as.data.frame(matrix(y_pred, nrow=12))
rownames(predictions) <- c('NDF','US','other','FR','CA','GB','ES','IT','PT','NL','DE','AU')
predictions_top5 <- as.vector(apply(predictions, 2, function(x) names(sort(x)[12:8])))

# create submission 
ids <- NULL
for (i in 1:NROW(X_test)) {
  idx <- X_test$id[i]
  ids <- append(ids, rep(idx,5))
}
submission <- NULL
submission$id <- ids
submission$country <- predictions_top5

# generate submission file
submission <- as.data.frame(submission)
write.csv(submission, "submission.csv", quote=FALSE, row.names = FALSE)