EDA

Problem statement

Data from hotel bookings will be analyzed for this research. This data set includes reservation details for a resort hotel and a city hotel, as well as details like the country of origin, the hotel category, whether the reservation was canceled, the number of adults, kids, and/or babies, the arrival date, and whether the booking was made by a returning customer or not. Here are some of the most crucial factors and their explanations.

Table 1. The most crucial factors and their explanations
Variables	Explanation
hotel	The category of hotel, city hotel or resort hotel
is_canceled	Value indicating if the booking was canceled (1) or not (0)
is_repeated_guest	Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations	Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled	Number of previous bookings not cancelled by the customer prior to the current booking
adults	Number of adults
adr	Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
lead_time	Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
booking changes	Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation

The hotel industry is very unstable, and reservations depend on a variety of factors. This project is driven by two primary goals. One is to investigate and examine data to find significant elements associated to order cancellation. Another method is to categorize customers and identify devoted followers by investigating a closely connected variable, which is repeated guests. Both of them can carry out numerous initiatives to increase business and performance and can offer insights to hotel management. EDA will be divided into following 5 analysis:

Figure 1. Work flow of EDA (1) Data summary: Using summary() and str() functions to explore the overall situation of variables.

(2) Check data for data cleaning: Check missing values, duplicate values and outliers.

(3)Data visualization: Using plot(), hist(), vioplot() and scatterplot() functions to visualize data.

(4) Statistical testing: Using Pearson Correlation Analysis find correlation between different variables. Such as variables that strongly correlated with cancellation and variables that strongly correlated with is_repeated_guest.

(5) conclusion of EDA: This section is a summary of the EDA.

Data summary

Import the dataset name as “df” and analyze the data.

df[df == "NULL"] <- NA
head(df)

##          hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel           0       342              2015               July
## 2 Resort Hotel           0       737              2015               July
## 3 Resort Hotel           0         7              2015               July
## 4 Resort Hotel           0        13              2015               July
## 5 Resort Hotel           0        14              2015               July
## 6 Resort Hotel           0        14              2015               July
##   arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1                       27                         1                       0
## 2                       27                         1                       0
## 3                       27                         1                       0
## 4                       27                         1                       0
## 5                       27                         1                       0
## 6                       27                         1                       0
##   stays_in_week_nights adults children babies meal country market_segment
## 1                    0      2        0      0   BB     PRT         Direct
## 2                    0      2        0      0   BB     PRT         Direct
## 3                    1      1        0      0   BB     GBR         Direct
## 4                    1      1        0      0   BB     GBR      Corporate
## 5                    2      2        0      0   BB     GBR      Online TA
## 6                    2      2        0      0   BB     GBR      Online TA
##   distribution_channel is_repeated_guest previous_cancellations
## 1               Direct                 0                      0
## 2               Direct                 0                      0
## 3               Direct                 0                      0
## 4            Corporate                 0                      0
## 5                TA/TO                 0                      0
## 6                TA/TO                 0                      0
##   previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1                              0                  C                  C
## 2                              0                  C                  C
## 3                              0                  A                  C
## 4                              0                  A                  A
## 5                              0                  A                  A
## 6                              0                  A                  A
##   booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1               3   No Deposit  <NA>    <NA>                    0     Transient
## 2               4   No Deposit  <NA>    <NA>                    0     Transient
## 3               0   No Deposit  <NA>    <NA>                    0     Transient
## 4               0   No Deposit   304    <NA>                    0     Transient
## 5               0   No Deposit   240    <NA>                    0     Transient
## 6               0   No Deposit   240    <NA>                    0     Transient
##   adr required_car_parking_spaces total_of_special_requests reservation_status
## 1   0                           0                         0          Check-Out
## 2   0                           0                         0          Check-Out
## 3  75                           0                         0          Check-Out
## 4  75                           0                         0          Check-Out
## 5  98                           0                         1          Check-Out
## 6  98                           0                         1          Check-Out
##   reservation_status_date
## 1              2015-07-01
## 2              2015-07-01
## 3              2015-07-02
## 4              2015-07-02
## 5              2015-07-03
## 6              2015-07-03

Check the variable type and values

str(df)

## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  NA NA NA "304" ...
##  $ company                       : chr  NA NA NA NA ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...

There are 119390 rows and 32 columns in df.

The minimum value, length, first quantile, median, third quantile, mode, class and maximum value of each attribute are displayed below.

summary(df)

##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Length:119390          
##  Class :character       
##  Mode  :character       
##                         
##                         
##                         
##

Check the count of canceled order of the whole datafram.

table(df$is_canceled)

## 
##     0     1 
## 75166 44224

There are 44224 rows of order which is canceled while 75166 rows are not canceled.The cancellation rate of the original data set is 37.04%. We will caculate the cancellation rate again after data cleaning. This step just to get a rough idea of data.

44224/119390

## [1] 0.3704163

table(df$is_repeated_guest)

## 
##      0      1 
## 115580   3810

There are just 3810 rows from repeated guest. Large proportion of new users whose rate is 96.8%.

115580/119390

## [1] 0.9680878

Check data for data cleaning

Check for missing values

Using sum() function check count of missing values.

sum(is.na(df))

## [1] 129425

There are 129425 missing values.

Install and library tidyverse and naniar packages.

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.7      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.0 
## ✔ readr   2.1.2      ✔ forcats 0.5.1 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## 載入需要的套件：sm
## 
## Package 'sm', version 2.2-5.7: type help(sm) for summary information
## 
## 載入需要的套件：zoo
## 
## 
## 載入套件：'zoo'
## 
## 
## 下列物件被遮斷自 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## corrplot 0.92 loaded

Show the missing values and its percentage.

df %>% miss_var_summary()

## # A tibble: 32 × 3
##    variable                 n_miss pct_miss
##    <chr>                     <int>    <dbl>
##  1 company                  112593 94.3    
##  2 agent                     16340 13.7    
##  3 country                     488  0.409  
##  4 children                      4  0.00335
##  5 hotel                         0  0      
##  6 is_canceled                   0  0      
##  7 lead_time                     0  0      
##  8 arrival_date_year             0  0      
##  9 arrival_date_month            0  0      
## 10 arrival_date_week_number      0  0      
## # … with 22 more rows

From the results of the code above, “company” has 112593 missing vaules which accounts for 94% of the data in this column, “agent” has 16340 missing values which which accounts for 13% of the data in this column. “country” has 488 missing vaules and “children” has 4 missing values.

Check for outliers

Choose some columns of numeric type to generate a new dataframe df1 to explore outliers, remove columns with only two values, for example, “is_canceled” just has two values that is 0 and 1, remove it. The remaining columns are lead_time, arrival_date_year, arrival_date_week_number, arrival_date_day_of_month, stays_in_weekend_nights, stays_in_week_nights, adults, children ,babies, previous_cancellations, previous_bookings_not_canceled, booking_changes, days_in_waiting_list, adr, required_car_parking_spaces,and total_of_special_requests. Since there are many remaining columns, the violin plot is divided into three data frames for the convenience of displaying.

df1 <- df[,c('lead_time','arrival_date_year','arrival_date_week_number','arrival_date_day_of_month')]
head(df1)

##   lead_time arrival_date_year arrival_date_week_number
## 1       342              2015                       27
## 2       737              2015                       27
## 3         7              2015                       27
## 4        13              2015                       27
## 5        14              2015                       27
## 6        14              2015                       27
##   arrival_date_day_of_month
## 1                         1
## 2                         1
## 3                         1
## 4                         1
## 5                         1
## 6                         1

df2 <- df[,c('stays_in_weekend_nights','stays_in_week_nights', 'adults', 'children','babies', 'previous_cancellations')]
head(df2)

##   stays_in_weekend_nights stays_in_week_nights adults children babies
## 1                       0                    0      2        0      0
## 2                       0                    0      2        0      0
## 3                       0                    1      1        0      0
## 4                       0                    1      1        0      0
## 5                       0                    2      2        0      0
## 6                       0                    2      2        0      0
##   previous_cancellations
## 1                      0
## 2                      0
## 3                      0
## 4                      0
## 5                      0
## 6                      0

df3 <- df[,c('previous_bookings_not_canceled', 'booking_changes','days_in_waiting_list', 'adr')]
head(df3)

##   previous_bookings_not_canceled booking_changes days_in_waiting_list adr
## 1                              0               3                    0   0
## 2                              0               4                    0   0
## 3                              0               0                    0  75
## 4                              0               0                    0  75
## 5                              0               0                    0  98
## 6                              0               0                    0  98

if(!require(vioplot)) install.packages("vioplot",repos = "http://cran.us.r-project.org")
#install.packages('vioplot')
#library(vioplot)
vioplot(df1,border='steelblue',pchMed=16)

vioplot(df2,border='steelblue',pchMed=16)

vioplot(df3,border='steelblue',pchMed=16)

From graphs above, “lead_time”, “stays_in_weekend_nights”,“stays_in_week_nights”, “adults”, “previous_cancellations” and “adr” have outliers. ## Data visualization ### Cancellation rate

 if(!require(ggplot2)) install.packages("ggplot2",repos = "http://cran.us.r-project.org")
# install.packages('ggplot2')
# library(ggplot2)
dfi<-df
dfi[,'is_canceled']<-factor(df[,'is_canceled'])
ggplot(dfi, aes(x = is_canceled,fill=is_canceled)) +
  #geom_bar() +
  geom_bar(stat="count")+theme_minimal()+
  labs(title = "Cancellation Rate")

We can infer from the plot that the majority of reservations were kept.

Bar chart of hotel types

 if(!require(dplyr)) install.packages("dplyr",repos = "http://cran.us.r-project.org")
# install.packages('dplyr')
# library(dplyr)
 if(!require(ggplot2)) install.packages("ggplot2",repos = "http://cran.us.r-project.org")
# install.packages('ggplot2')
# library(ggplot2)
df <- df %>%
  mutate_at(.vars = vars(is_canceled), .fun = as.character)
ggplot(df, aes(x = hotel,fill=is_canceled)) +
  geom_bar() +
  labs(title = "Bar chart of hotel types")

City hotel has more people canceled order than resort hotel.

Repeated guest rate

   if(!require(ggplot2)) install.packages("ggplot2",repos = "http://cran.us.r-project.org")
# install.packages('ggplot2')
# library(ggplot2)
dfr <-df
dfr[,'is_repeated_guest']<-factor(df[,'is_repeated_guest'])
ggplot(dfr,aes(x=is_repeated_guest,fill=is_repeated_guest))+
  geom_bar(stat="count")+theme_minimal()+
  labs(title = "Repeated guest rate")

From the plot, we can see that the majority of bookings were not repeated guests.

Pie chart of market segment

#market segment group by cancellation
 if(!require(ggrepel)) install.packages("ggrepel",repos = "http://cran.us.r-project.org")

## 載入需要的套件：ggrepel

# install.packages('ggrepel')
# library(ggrepel)

bookings_bysegment_plot<- df %>%
  group_by(is_canceled, market_segment) %>%
  summarise(cuenta = n())%>%
  transmute(market_segment, pct_segment = round(cuenta/sum(cuenta)*100, 2)) %>%
  ggplot(aes(x=" ", y=pct_segment, group=market_segment, 
             colour=market_segment, fill=market_segment)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y") + 
  facet_grid(.~ is_canceled) +
  theme_void() +
  geom_text_repel(aes(x = 1.7, label = paste(pct_segment, "%", sep = " ")),
                  colour = 'black', position = position_stack(vjust = 0.5)) + 
  labs(x = "cancellation", title = "market segment group by cancellation")

## `summarise()` has grouped output by 'is_canceled'. You can override using the
## `.groups` argument.

print(bookings_bysegment_plot)

From the picture above, Online TA is the most popular market segment both in cancellation and reservation.

Statistical testing

Using Pearson Correlation Analysis to check the correlationship with is_repeated_guest and is_canceled

Convert adr from num to int

df4 <-df[,c('is_repeated_guest','is_canceled','lead_time','arrival_date_year','arrival_date_week_number','arrival_date_day_of_month','stays_in_weekend_nights','stays_in_week_nights', 'adults', 'children','babies', 'previous_cancellations','previous_bookings_not_canceled','booking_changes','days_in_waiting_list', 'adr')]

Check the correlated coefficient

df4$is_canceled <- as.integer(df4$is_canceled)
df4$adr <- as.integer(df4$adr)
df4 <-na.omit(df4)
df4 %>% miss_var_summary()

## # A tibble: 16 × 3
##    variable                       n_miss pct_miss
##    <chr>                           <int>    <dbl>
##  1 is_repeated_guest                   0        0
##  2 is_canceled                         0        0
##  3 lead_time                           0        0
##  4 arrival_date_year                   0        0
##  5 arrival_date_week_number            0        0
##  6 arrival_date_day_of_month           0        0
##  7 stays_in_weekend_nights             0        0
##  8 stays_in_week_nights                0        0
##  9 adults                              0        0
## 10 children                            0        0
## 11 babies                              0        0
## 12 previous_cancellations              0        0
## 13 previous_bookings_not_canceled      0        0
## 14 booking_changes                     0        0
## 15 days_in_waiting_list                0        0
## 16 adr                                 0        0

Check for variables that are correlated with is_repeated_guest and is_canceled. Check the correlated coefficient of is_repeated_guest and is_canceled.

There are no variables that strongly correlated with is_repeated_guest and is_canceled. The top 3 variables that are correlated with is_repeated_guest are previous_bookings_not_canceled, adults, and adr. The absolute values of the correlation coefficients are 0.41805573, 0.14642457, and 0.13432846. The top 3 variables related to is_canceled are lead_time, booking_changes, and previous_cancellations. The absolute values of the correlation coefficients are 0.2931774, 0.14437057, and 0.11014047.

Visulaze the correlation

Since column names of df4 is very long, rename column names for a better visualization.

Old name	New name
is_repeated_guest	rep
is_canceled	cc
lead_time	lt
arrival_date_year	year
arrival_date_week_number	week
arrival_date_day_of_month	day
stays_in_weekend_nights	wkn
stays_in_week_nights	wn
adults	adu
children	chi
babies	abb
previous_cancellations	pc
previous_bookings_not_canceled	pnc
booking_changes	bc
days_in_waiting_list	dw
adr	adr

# change column names of all the columns in the df4
colnames(df4) <- c("rep", "cc","lt","year","week","day","wkn","wn","adu","chi","bab",
                   "pc","pnc","bc","dw","adr")

# visualizations
M <-cor(df4,method="pearson")
corrplot.mixed(M, order = 'AOE',number.cex = 0.5,tl.col = 'black')

chisq.test(dfc\(is_canceled,dfc\)distribution_channel) chisq.test(dfc\(is_canceled,dfc\)customer_type) chisq.test(dfc\(is_canceled,dfc\)meal) chisq.test(dfc\(is_canceled,dfc\)reserved_room_type) chisq.test(dfc\(is_canceled,dfc\)deposit_type)

Using chi square and bar chart to check chr variables relationship with is_repeated_guest and is_canceled

Pearson Correlation Analysis analyse the correlation of numeric variables. Lets use chi square and bar chart to check chr variables relationship with is_repeated_guest and is_canceled. Chr variables are hotel, arrival_date_month, meal, country, distribution_channel, reserved_room_type , assigned_room_type, deposit_type, agent, company, customer_type, reservation_status, and reservation_status_date. Market_segment and hotel are analysed before reservation_status_date has the same mean with is_canceled in this article, agent is ID of the travel agency that made the booking, and company has 94% Null. Therefore, this part don’t analyzed again those ralated variable(including reservation_status_date).
Among those variables, arrival_date_month can be anlalyzed by line chart.
While distribution_channel, meal, reserved_room_type, deposit_type, customer_type can be analyzed by bar chart.

Using chi square to check whether is_canceled and other 5 variables is corralated or not. H0: is_canceled have no significant correlation with distribution_channel, customer_type, meal, reserved_room_type, and deposit_type.

dfc <- df[,c( "is_canceled","is_repeated_guest","distribution_channel","customer_type", "meal", "reserved_room_type",
          "deposit_type")]
chisq.test(dfc$is_canceled,dfc$distribution_channel,correct = TRUE)

## Warning in chisq.test(dfc$is_canceled, dfc$distribution_channel, correct =
## TRUE): Chi-squared 近似演算法有可能不準

## 
##  Pearson's Chi-squared test
## 
## data:  dfc$is_canceled and dfc$distribution_channel
## X-squared = 3745.8, df = 4, p-value < 2.2e-16

chisq.test(dfc$is_canceled,dfc$customer_type)

## 
##  Pearson's Chi-squared test
## 
## data:  dfc$is_canceled and dfc$customer_type
## X-squared = 2222.5, df = 3, p-value < 2.2e-16

chisq.test(dfc$is_canceled,dfc$meal)

## 
##  Pearson's Chi-squared test
## 
## data:  dfc$is_canceled and dfc$meal
## X-squared = 304.24, df = 4, p-value < 2.2e-16

chisq.test(dfc$is_canceled,dfc$reserved_room_type, correct = TRUE)

## Warning in chisq.test(dfc$is_canceled, dfc$reserved_room_type, correct = TRUE):
## Chi-squared 近似演算法有可能不準

## 
##  Pearson's Chi-squared test
## 
## data:  dfc$is_canceled and dfc$reserved_room_type
## X-squared = 647.84, df = 9, p-value < 2.2e-16

chisq.test(dfc$is_canceled,dfc$deposit_type)

## 
##  Pearson's Chi-squared test
## 
## data:  dfc$is_canceled and dfc$deposit_type
## X-squared = 27677, df = 2, p-value < 2.2e-16

From the results above, every p-value < 2.2e-16. Therefore, is_canceled has correlation with distribution_channel, customer_type, meal, reserved_room_type, and deposit_type.

Using bar chart to check chr variables relationship with is_repeated_guest and is_canceled

# details in customer_type
df %>% 
  group_by(customer_type) %>% 
  summarize(mean_adr = mean(adr, na.rm = TRUE), 
            n = n(),
            median_adr = median(adr, na.rm = TRUE)) %>% 
  arrange(desc(mean_adr))

## # A tibble: 4 × 4
##   customer_type   mean_adr     n median_adr
##   <chr>              <dbl> <int>      <dbl>
## 1 Transient          107.  89613       99  
## 2 Contract            87.5  4076       82.9
## 3 Transient-Party     86.1 25124       81  
## 4 Group               83.5   577       75

# customer_type
  ggplot(data=df,aes(x=customer_type,fill=is_canceled))+
  geom_bar()

From the image above, when the customer type is Transient, the count of cancellation and not cancellation is most.

ggplot(df, aes(x=customer_type, y=adr, colour=customer_type)) + 
  geom_boxplot() + 
  scale_color_brewer(palette = "BrBG") + 
  theme_light() + 
  scale_y_continuous(limits = c(0, 600)) +
  ggtitle("ADR vs. Customer Type") + 
  xlab("Customer Type") + 
  ylab("adr")

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

From the image above, most of the customer type is Transient and the average adr is also the highest.

# details in deposit_type
df %>% 
  group_by(deposit_type) %>% 
  summarize(mean_adr = mean(adr, na.rm = TRUE), 
            n = n(),
            median_adr = median(adr, na.rm = TRUE)) %>% 
  arrange(desc(mean_adr))

## # A tibble: 3 × 4
##   deposit_type mean_adr      n median_adr
##   <chr>           <dbl>  <int>      <dbl>
## 1 No Deposit      104.  104641       95.5
## 2 Non Refund       90.0  14587       86  
## 3 Refundable       76.2    162       66

# deposit_type
  ggplot(data=df,aes(x=deposit_type,fill=is_canceled))+
  geom_bar()

From the graph above, when the deposit type is No deposit, more people canceled the order.

ggplot(df, aes(x=deposit_type, y=adr, colour=deposit_type)) + 
  geom_boxplot() + 
  scale_color_brewer(palette = "Dark2") + 
  theme_light() + 
  scale_y_continuous(limits = c(0, 600)) +
  ggtitle("ADR vs. Deposit Type") + 
  xlab("Deposit Type") + 
  ylab("adr")

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

From the figure above, most of the reservation come from “No Deposit”, and also the mean adr is the highest among others.

# reserved_room_type
  ggplot(data=df,aes(x=reserved_room_type,fill=is_canceled))+
  geom_bar()

From the photo above, most people book room type A.

ggplot(df, aes(x=reserved_room_type, y=adr, colour=reserved_room_type)) + 
  geom_boxplot() + 
  scale_color_brewer(palette = "BrBG") + 
  theme_light() + 
  scale_y_continuous(limits = c(0, 750)) +
  ggtitle("Average daily rate vs. Room Type") + 
  xlab("reserved_room_type") + 
  ylab("adr")

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

From the picture above, room type A, B and D contain many outliers. Both room type A and B have the lowest adr.

# details in meal_type
df %>% 
  group_by(meal) %>% 
  summarize(mean_adr = mean(adr, na.rm = TRUE), 
            n = n(),
            median_adr = median(adr, na.rm = TRUE)) %>% 
  arrange(desc(mean_adr))

## # A tibble: 5 × 4
##   meal      mean_adr     n median_adr
##   <chr>        <dbl> <int>      <dbl>
## 1 HB           120.  14463      107  
## 2 FB           109.    798       84  
## 3 BB            99.4 92310       91.7
## 4 SC            98.3 10650       96  
## 5 Undefined     91.9  1169       79

# meal
  ggplot(df,aes(x=meal,fill=is_canceled))+
  geom_bar()

From the picture above, meal type BB is chosen by most people. BB – Bed & Breakfast.

ggplot(df, aes(x=meal, y=adr, colour=meal)) + 
  geom_boxplot() + 
  scale_color_brewer(palette = "BrBG") + 
  theme_light() + 
  scale_y_continuous(limits = c(0, 600)) +
  ggtitle("Average daily rate vs. Meal Type") + 
  xlab("Meal") + 
  ylab("adr")

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

From the picture above, meal type BB is chosen by most people.In contrast, FB has the lowest demand. For adr, HB (Half board (breakfast and dinner)) has the highest ADR among others.

#Distribution channel
df%>%
  ggplot(aes(x=distribution_channel,fill=is_canceled))+
  geom_bar()

Booking distribution channel, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”. TA/TO was choosen by most people.

# detail in market segment
df %>% 
  group_by(market_segment) %>% 
  summarize(mean_adr = mean(adr, na.rm = TRUE), 
            n = n(),
            median_adr = median(adr, na.rm = TRUE)) %>% 
  arrange(desc(mean_adr))

## # A tibble: 8 × 4
##   market_segment mean_adr     n median_adr
##   <chr>             <dbl> <int>      <dbl>
## 1 Online TA        117.   56477      110. 
## 2 Direct           115.   12606      105  
## 3 Aviation         100.     237       95  
## 4 Offline TA/TO     87.4  24219       85.5
## 5 Groups            79.5  19811       70  
## 6 Corporate         69.4   5295       65  
## 7 Undefined         15        2       15  
## 8 Complementary      2.89   743        0

ggplot(df, aes(x=market_segment, y=adr, colour=market_segment)) + 
  geom_boxplot() + 
  scale_color_brewer(palette = "BrBG") + 
  theme_light() + 
  scale_y_continuous(limits = c(0, 600)) +
  ggtitle("Average daily rate vs. Market segment") + 
  xlab("Market segment") + 
  ylab("adr") +
  theme(legend.position = "none")

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

From the picture above, Online TA has the highest adr and amount, followed by direct channel.

conclusion of EDA

The majority of reservations were kept. The cancellation rate of the original data set is 37.04%. City hotel has more people canceled order than resort hotel.
The majority of bookings were not repeated guests. The new user rate is 96.8%.
Previous bookings not cancelled, adults, and adr are the top 3 factors that linked with is repeated guest. The correlation coefficients’ absolute values are 0.41805573, 0.14642457, and 0.13432846. Lead time, booking changes, and previous cancellations are the top three variables that are connected to being canceled. The correlation coefficients’ absolute values are 0.2931174, 0.14437057, and 0.11014047.
In character variables, people preger TA/TO in distribution channel, BB in meal, transient in customer type, and room type A are.
For ADR, most of the reservation come from no deposit mode, but the average daily rate is also higher than others mode.
In market segment part, people tend to use online TA and Direct channel to book hotel, and the higher demand with the higher price (ADR).
It is noticeable that in the meal type average adr of HB (2 meals) is the highest one than FB (Full Board 3 meals included).
Overall, as the correlation coefficient map shown, the adr does not have a strong relation with hotel cancellation, which is 0.05, regarded as weak correlation between each other.

WQD7004-EDA-Rmarkdown

Danny

2023-01-15

EDA