Data from hotel bookings will be analyzed for this research. This data set includes reservation details for a resort hotel and a city hotel, as well as details like the country of origin, the hotel category, whether the reservation was canceled, the number of adults, kids, and/or babies, the arrival date, and whether the booking was made by a returning customer or not. Here are some of the most crucial factors and their explanations.
| Variables | Explanation |
|---|---|
| hotel | The category of hotel, city hotel or resort hotel |
| is_canceled | Value indicating if the booking was canceled (1) or not (0) |
| is_repeated_guest | Value indicating if the booking name was from a repeated guest (1) or not (0) |
| previous_cancellations | Number of previous bookings that were cancelled by the customer prior to the current booking |
| previous_bookings_not_canceled | Number of previous bookings not cancelled by the customer prior to the current booking |
| adults | Number of adults |
| adr | Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights |
| lead_time | Number of days that elapsed between the entering date of the booking into the PMS and the arrival date |
| booking changes | Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation |
The hotel industry is very unstable, and reservations depend on a variety of factors. This project is driven by two primary goals. One is to investigate and examine data to find significant elements associated to order cancellation. Another method is to categorize customers and identify devoted followers by investigating a closely connected variable, which is repeated guests. Both of them can carry out numerous initiatives to increase business and performance and can offer insights to hotel management. EDA will be divided into following 5 analysis:
(1) Data summary: Using summary() and str() functions
to explore the overall situation of variables.
(2) Check data for data cleaning: Check missing values, duplicate values and outliers.
(3)Data visualization: Using plot(), hist(), vioplot() and scatterplot() functions to visualize data.
(4) Statistical testing: Using Pearson Correlation Analysis find correlation between different variables. Such as variables that strongly correlated with cancellation and variables that strongly correlated with is_repeated_guest.
(5) conclusion of EDA: This section is a summary of the EDA.
Import the dataset name as “df” and analyze the data.
df[df == "NULL"] <- NA
head(df)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit <NA> <NA> 0 Transient
## 2 4 No Deposit <NA> <NA> 0 Transient
## 3 0 No Deposit <NA> <NA> 0 Transient
## 4 0 No Deposit 304 <NA> 0 Transient
## 5 0 No Deposit 240 <NA> 0 Transient
## 6 0 No Deposit 240 <NA> 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## 6 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
## 6 2015-07-03
Check the variable type and values
str(df)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr NA NA NA "304" ...
## $ company : chr NA NA NA NA ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
There are 119390 rows and 32 columns in df.
The minimum value, length, first quantile, median, third quantile, mode, class and maximum value of each attribute are displayed below.
summary(df)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Length:119390
## Class :character
## Mode :character
##
##
##
##
Check the count of canceled order of the whole datafram.
table(df$is_canceled)
##
## 0 1
## 75166 44224
There are 44224 rows of order which is canceled while 75166 rows are not canceled.The cancellation rate of the original data set is 37.04%. We will caculate the cancellation rate again after data cleaning. This step just to get a rough idea of data.
44224/119390
## [1] 0.3704163
table(df$is_repeated_guest)
##
## 0 1
## 115580 3810
There are just 3810 rows from repeated guest. Large proportion of new users whose rate is 96.8%.
115580/119390
## [1] 0.9680878
Using sum() function check count of missing values.
sum(is.na(df))
## [1] 129425
There are 129425 missing values.
Install and library tidyverse and naniar packages.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## 載入需要的套件:sm
##
## Package 'sm', version 2.2-5.7: type help(sm) for summary information
##
## 載入需要的套件:zoo
##
##
## 載入套件:'zoo'
##
##
## 下列物件被遮斷自 'package:base':
##
## as.Date, as.Date.numeric
##
##
## corrplot 0.92 loaded
Show the missing values and its percentage.
df %>% miss_var_summary()
## # A tibble: 32 × 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 company 112593 94.3
## 2 agent 16340 13.7
## 3 country 488 0.409
## 4 children 4 0.00335
## 5 hotel 0 0
## 6 is_canceled 0 0
## 7 lead_time 0 0
## 8 arrival_date_year 0 0
## 9 arrival_date_month 0 0
## 10 arrival_date_week_number 0 0
## # … with 22 more rows
From the results of the code above, “company” has 112593 missing vaules which accounts for 94% of the data in this column, “agent” has 16340 missing values which which accounts for 13% of the data in this column. “country” has 488 missing vaules and “children” has 4 missing values.
Choose some columns of numeric type to generate a new dataframe df1 to explore outliers, remove columns with only two values, for example, “is_canceled” just has two values that is 0 and 1, remove it. The remaining columns are lead_time, arrival_date_year, arrival_date_week_number, arrival_date_day_of_month, stays_in_weekend_nights, stays_in_week_nights, adults, children ,babies, previous_cancellations, previous_bookings_not_canceled, booking_changes, days_in_waiting_list, adr, required_car_parking_spaces,and total_of_special_requests. Since there are many remaining columns, the violin plot is divided into three data frames for the convenience of displaying.
df1 <- df[,c('lead_time','arrival_date_year','arrival_date_week_number','arrival_date_day_of_month')]
head(df1)
## lead_time arrival_date_year arrival_date_week_number
## 1 342 2015 27
## 2 737 2015 27
## 3 7 2015 27
## 4 13 2015 27
## 5 14 2015 27
## 6 14 2015 27
## arrival_date_day_of_month
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
df2 <- df[,c('stays_in_weekend_nights','stays_in_week_nights', 'adults', 'children','babies', 'previous_cancellations')]
head(df2)
## stays_in_weekend_nights stays_in_week_nights adults children babies
## 1 0 0 2 0 0
## 2 0 0 2 0 0
## 3 0 1 1 0 0
## 4 0 1 1 0 0
## 5 0 2 2 0 0
## 6 0 2 2 0 0
## previous_cancellations
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
df3 <- df[,c('previous_bookings_not_canceled', 'booking_changes','days_in_waiting_list', 'adr')]
head(df3)
## previous_bookings_not_canceled booking_changes days_in_waiting_list adr
## 1 0 3 0 0
## 2 0 4 0 0
## 3 0 0 0 75
## 4 0 0 0 75
## 5 0 0 0 98
## 6 0 0 0 98
if(!require(vioplot)) install.packages("vioplot",repos = "http://cran.us.r-project.org")
#install.packages('vioplot')
#library(vioplot)
vioplot(df1,border='steelblue',pchMed=16)
vioplot(df2,border='steelblue',pchMed=16)
vioplot(df3,border='steelblue',pchMed=16)
From graphs above, “lead_time”, “stays_in_weekend_nights”,“stays_in_week_nights”, “adults”, “previous_cancellations” and “adr” have outliers. ## Data visualization ### Cancellation rate
if(!require(ggplot2)) install.packages("ggplot2",repos = "http://cran.us.r-project.org")
# install.packages('ggplot2')
# library(ggplot2)
dfi<-df
dfi[,'is_canceled']<-factor(df[,'is_canceled'])
ggplot(dfi, aes(x = is_canceled,fill=is_canceled)) +
#geom_bar() +
geom_bar(stat="count")+theme_minimal()+
labs(title = "Cancellation Rate")
We can infer from the plot that the majority of reservations were kept.
if(!require(dplyr)) install.packages("dplyr",repos = "http://cran.us.r-project.org")
# install.packages('dplyr')
# library(dplyr)
if(!require(ggplot2)) install.packages("ggplot2",repos = "http://cran.us.r-project.org")
# install.packages('ggplot2')
# library(ggplot2)
df <- df %>%
mutate_at(.vars = vars(is_canceled), .fun = as.character)
ggplot(df, aes(x = hotel,fill=is_canceled)) +
geom_bar() +
labs(title = "Bar chart of hotel types")
City hotel has more people canceled order than resort hotel.
if(!require(ggplot2)) install.packages("ggplot2",repos = "http://cran.us.r-project.org")
# install.packages('ggplot2')
# library(ggplot2)
dfr <-df
dfr[,'is_repeated_guest']<-factor(df[,'is_repeated_guest'])
ggplot(dfr,aes(x=is_repeated_guest,fill=is_repeated_guest))+
geom_bar(stat="count")+theme_minimal()+
labs(title = "Repeated guest rate")
From the plot, we can see that the majority of bookings were not repeated guests.
#market segment group by cancellation
if(!require(ggrepel)) install.packages("ggrepel",repos = "http://cran.us.r-project.org")
## 載入需要的套件:ggrepel
# install.packages('ggrepel')
# library(ggrepel)
bookings_bysegment_plot<- df %>%
group_by(is_canceled, market_segment) %>%
summarise(cuenta = n())%>%
transmute(market_segment, pct_segment = round(cuenta/sum(cuenta)*100, 2)) %>%
ggplot(aes(x=" ", y=pct_segment, group=market_segment,
colour=market_segment, fill=market_segment)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y") +
facet_grid(.~ is_canceled) +
theme_void() +
geom_text_repel(aes(x = 1.7, label = paste(pct_segment, "%", sep = " ")),
colour = 'black', position = position_stack(vjust = 0.5)) +
labs(x = "cancellation", title = "market segment group by cancellation")
## `summarise()` has grouped output by 'is_canceled'. You can override using the
## `.groups` argument.
print(bookings_bysegment_plot)
From the picture above, Online TA is the most popular market segment both in cancellation and reservation.
Convert adr from num to int
df4 <-df[,c('is_repeated_guest','is_canceled','lead_time','arrival_date_year','arrival_date_week_number','arrival_date_day_of_month','stays_in_weekend_nights','stays_in_week_nights', 'adults', 'children','babies', 'previous_cancellations','previous_bookings_not_canceled','booking_changes','days_in_waiting_list', 'adr')]
Check the correlated coefficient
df4$is_canceled <- as.integer(df4$is_canceled)
df4$adr <- as.integer(df4$adr)
df4 <-na.omit(df4)
df4 %>% miss_var_summary()
## # A tibble: 16 × 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 is_repeated_guest 0 0
## 2 is_canceled 0 0
## 3 lead_time 0 0
## 4 arrival_date_year 0 0
## 5 arrival_date_week_number 0 0
## 6 arrival_date_day_of_month 0 0
## 7 stays_in_weekend_nights 0 0
## 8 stays_in_week_nights 0 0
## 9 adults 0 0
## 10 children 0 0
## 11 babies 0 0
## 12 previous_cancellations 0 0
## 13 previous_bookings_not_canceled 0 0
## 14 booking_changes 0 0
## 15 days_in_waiting_list 0 0
## 16 adr 0 0
Check for variables that are correlated with is_repeated_guest and is_canceled. Check the correlated coefficient of is_repeated_guest and is_canceled.
There are no variables that strongly correlated with is_repeated_guest and is_canceled. The top 3 variables that are correlated with is_repeated_guest are previous_bookings_not_canceled, adults, and adr. The absolute values of the correlation coefficients are 0.41805573, 0.14642457, and 0.13432846. The top 3 variables related to is_canceled are lead_time, booking_changes, and previous_cancellations. The absolute values of the correlation coefficients are 0.2931774, 0.14437057, and 0.11014047.
Visulaze the correlation
Since column names of df4 is very long, rename column names for a better visualization.
| Old name | New name |
|---|---|
| is_repeated_guest | rep |
| is_canceled | cc |
| lead_time | lt |
| arrival_date_year | year |
| arrival_date_week_number | week |
| arrival_date_day_of_month | day |
| stays_in_weekend_nights | wkn |
| stays_in_week_nights | wn |
| adults | adu |
| children | chi |
| babies | abb |
| previous_cancellations | pc |
| previous_bookings_not_canceled | pnc |
| booking_changes | bc |
| days_in_waiting_list | dw |
| adr | adr |
# change column names of all the columns in the df4
colnames(df4) <- c("rep", "cc","lt","year","week","day","wkn","wn","adu","chi","bab",
"pc","pnc","bc","dw","adr")
# visualizations
M <-cor(df4,method="pearson")
corrplot.mixed(M, order = 'AOE',number.cex = 0.5,tl.col = 'black')
chisq.test(dfc\(is_canceled,dfc\)distribution_channel) chisq.test(dfc\(is_canceled,dfc\)customer_type) chisq.test(dfc\(is_canceled,dfc\)meal) chisq.test(dfc\(is_canceled,dfc\)reserved_room_type) chisq.test(dfc\(is_canceled,dfc\)deposit_type)
Pearson Correlation Analysis analyse the correlation of numeric
variables. Lets use chi square and bar chart to check chr variables
relationship with is_repeated_guest and is_canceled. Chr variables are
hotel, arrival_date_month, meal, country, distribution_channel,
reserved_room_type , assigned_room_type, deposit_type, agent, company,
customer_type, reservation_status, and reservation_status_date.
Market_segment and hotel are analysed before reservation_status_date has
the same mean with is_canceled in this article, agent is ID of the
travel agency that made the booking, and company has 94% Null.
Therefore, this part don’t analyzed again those ralated
variable(including reservation_status_date).
Among those variables, arrival_date_month can be anlalyzed by line
chart.
While distribution_channel, meal, reserved_room_type, deposit_type,
customer_type can be analyzed by bar chart.
dfc <- df[,c( "is_canceled","is_repeated_guest","distribution_channel","customer_type", "meal", "reserved_room_type",
"deposit_type")]
chisq.test(dfc$is_canceled,dfc$distribution_channel,correct = TRUE)
## Warning in chisq.test(dfc$is_canceled, dfc$distribution_channel, correct =
## TRUE): Chi-squared 近似演算法有可能不準
##
## Pearson's Chi-squared test
##
## data: dfc$is_canceled and dfc$distribution_channel
## X-squared = 3745.8, df = 4, p-value < 2.2e-16
chisq.test(dfc$is_canceled,dfc$customer_type)
##
## Pearson's Chi-squared test
##
## data: dfc$is_canceled and dfc$customer_type
## X-squared = 2222.5, df = 3, p-value < 2.2e-16
chisq.test(dfc$is_canceled,dfc$meal)
##
## Pearson's Chi-squared test
##
## data: dfc$is_canceled and dfc$meal
## X-squared = 304.24, df = 4, p-value < 2.2e-16
chisq.test(dfc$is_canceled,dfc$reserved_room_type, correct = TRUE)
## Warning in chisq.test(dfc$is_canceled, dfc$reserved_room_type, correct = TRUE):
## Chi-squared 近似演算法有可能不準
##
## Pearson's Chi-squared test
##
## data: dfc$is_canceled and dfc$reserved_room_type
## X-squared = 647.84, df = 9, p-value < 2.2e-16
chisq.test(dfc$is_canceled,dfc$deposit_type)
##
## Pearson's Chi-squared test
##
## data: dfc$is_canceled and dfc$deposit_type
## X-squared = 27677, df = 2, p-value < 2.2e-16
From the results above, every p-value < 2.2e-16. Therefore, is_canceled has correlation with distribution_channel, customer_type, meal, reserved_room_type, and deposit_type.
# details in customer_type
df %>%
group_by(customer_type) %>%
summarize(mean_adr = mean(adr, na.rm = TRUE),
n = n(),
median_adr = median(adr, na.rm = TRUE)) %>%
arrange(desc(mean_adr))
## # A tibble: 4 × 4
## customer_type mean_adr n median_adr
## <chr> <dbl> <int> <dbl>
## 1 Transient 107. 89613 99
## 2 Contract 87.5 4076 82.9
## 3 Transient-Party 86.1 25124 81
## 4 Group 83.5 577 75
# customer_type
ggplot(data=df,aes(x=customer_type,fill=is_canceled))+
geom_bar()
From the image above, when the customer type is Transient, the count of cancellation and not cancellation is most.
ggplot(df, aes(x=customer_type, y=adr, colour=customer_type)) +
geom_boxplot() +
scale_color_brewer(palette = "BrBG") +
theme_light() +
scale_y_continuous(limits = c(0, 600)) +
ggtitle("ADR vs. Customer Type") +
xlab("Customer Type") +
ylab("adr")
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
From the image above, most of the customer type is Transient and the average adr is also the highest.
# details in deposit_type
df %>%
group_by(deposit_type) %>%
summarize(mean_adr = mean(adr, na.rm = TRUE),
n = n(),
median_adr = median(adr, na.rm = TRUE)) %>%
arrange(desc(mean_adr))
## # A tibble: 3 × 4
## deposit_type mean_adr n median_adr
## <chr> <dbl> <int> <dbl>
## 1 No Deposit 104. 104641 95.5
## 2 Non Refund 90.0 14587 86
## 3 Refundable 76.2 162 66
# deposit_type
ggplot(data=df,aes(x=deposit_type,fill=is_canceled))+
geom_bar()
From the graph above, when the deposit type is No deposit, more people canceled the order.
ggplot(df, aes(x=deposit_type, y=adr, colour=deposit_type)) +
geom_boxplot() +
scale_color_brewer(palette = "Dark2") +
theme_light() +
scale_y_continuous(limits = c(0, 600)) +
ggtitle("ADR vs. Deposit Type") +
xlab("Deposit Type") +
ylab("adr")
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
From the figure above, most of the reservation come from “No Deposit”, and also the mean adr is the highest among others.
# reserved_room_type
ggplot(data=df,aes(x=reserved_room_type,fill=is_canceled))+
geom_bar()
From the photo above, most people book room type A.
ggplot(df, aes(x=reserved_room_type, y=adr, colour=reserved_room_type)) +
geom_boxplot() +
scale_color_brewer(palette = "BrBG") +
theme_light() +
scale_y_continuous(limits = c(0, 750)) +
ggtitle("Average daily rate vs. Room Type") +
xlab("reserved_room_type") +
ylab("adr")
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
From the picture above, room type A, B and D contain many outliers. Both room type A and B have the lowest adr.
# details in meal_type
df %>%
group_by(meal) %>%
summarize(mean_adr = mean(adr, na.rm = TRUE),
n = n(),
median_adr = median(adr, na.rm = TRUE)) %>%
arrange(desc(mean_adr))
## # A tibble: 5 × 4
## meal mean_adr n median_adr
## <chr> <dbl> <int> <dbl>
## 1 HB 120. 14463 107
## 2 FB 109. 798 84
## 3 BB 99.4 92310 91.7
## 4 SC 98.3 10650 96
## 5 Undefined 91.9 1169 79
# meal
ggplot(df,aes(x=meal,fill=is_canceled))+
geom_bar()
From the picture above, meal type BB is chosen by most people. BB – Bed & Breakfast.
ggplot(df, aes(x=meal, y=adr, colour=meal)) +
geom_boxplot() +
scale_color_brewer(palette = "BrBG") +
theme_light() +
scale_y_continuous(limits = c(0, 600)) +
ggtitle("Average daily rate vs. Meal Type") +
xlab("Meal") +
ylab("adr")
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
From the picture above, meal type BB is chosen by most people.In contrast, FB has the lowest demand. For adr, HB (Half board (breakfast and dinner)) has the highest ADR among others.
#Distribution channel
df%>%
ggplot(aes(x=distribution_channel,fill=is_canceled))+
geom_bar()
Booking distribution channel, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”. TA/TO was choosen by most people.
# detail in market segment
df %>%
group_by(market_segment) %>%
summarize(mean_adr = mean(adr, na.rm = TRUE),
n = n(),
median_adr = median(adr, na.rm = TRUE)) %>%
arrange(desc(mean_adr))
## # A tibble: 8 × 4
## market_segment mean_adr n median_adr
## <chr> <dbl> <int> <dbl>
## 1 Online TA 117. 56477 110.
## 2 Direct 115. 12606 105
## 3 Aviation 100. 237 95
## 4 Offline TA/TO 87.4 24219 85.5
## 5 Groups 79.5 19811 70
## 6 Corporate 69.4 5295 65
## 7 Undefined 15 2 15
## 8 Complementary 2.89 743 0
ggplot(df, aes(x=market_segment, y=adr, colour=market_segment)) +
geom_boxplot() +
scale_color_brewer(palette = "BrBG") +
theme_light() +
scale_y_continuous(limits = c(0, 600)) +
ggtitle("Average daily rate vs. Market segment") +
xlab("Market segment") +
ylab("adr") +
theme(legend.position = "none")
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
From the picture above, Online TA has the highest adr and amount, followed by direct channel.
The majority of reservations were kept. The cancellation rate of the original data set is 37.04%. City hotel has more people canceled order than resort hotel.
The majority of bookings were not repeated guests. The new user rate is 96.8%.
Previous bookings not cancelled, adults, and adr are the top 3 factors that linked with is repeated guest. The correlation coefficients’ absolute values are 0.41805573, 0.14642457, and 0.13432846. Lead time, booking changes, and previous cancellations are the top three variables that are connected to being canceled. The correlation coefficients’ absolute values are 0.2931174, 0.14437057, and 0.11014047.
In character variables, people preger TA/TO in distribution channel, BB in meal, transient in customer type, and room type A are.
For ADR, most of the reservation come from no deposit mode, but the average daily rate is also higher than others mode.
In market segment part, people tend to use online TA and Direct channel to book hotel, and the higher demand with the higher price (ADR).
It is noticeable that in the meal type average adr of HB (2 meals) is the highest one than FB (Full Board 3 meals included).
Overall, as the correlation coefficient map shown, the adr does not have a strong relation with hotel cancellation, which is 0.05, regarded as weak correlation between each other.