Booking cancellation is a major challenge for hotel managers. We already know by now that, Covid-19 pandemic has resulted in disruptions to travel plans. Hotel industry has been overwhelmed by large number of cancellations. Due to the growing popularity of online travel agents such as Booking.com, Expedia and Co. It is becoming more and more common to do free cancellations up to 24 hours before arrival. Although, this brings a new dimension of flexibility to hotel guests, but, it has also increased financial risk for hotels and difficulties to plan their occupancy rate. Thus, realizing the importance to analyse cancellation behavior of hotel guests we are working on hotels data to predict risk of cancellation on future bookings.
We will be using classification to predict the risk of cancellation into high and low risk. There are several methods of classifications which we can use to create a prediction model based on the different attributes that influence the cancellation. There are several classification models which we are going to test such as ‘Decision Tree’, ‘Naive Bayes’, ‘Support Vector Machine’ etc. We will select the final model based on the accuracy of the model. We are using the hotel data for building the prediction model to deep dive into booking cancellations. For fitting the data into the model and to get the desired accuracy, first we need to sort and clean the data. Then we will transform the data in a way which will increase the prediction score.
This prediction of risk will help the hotels in managing the demand for their hotel by predicting the overbooking percentage for their hotel in peak demand seasons. This will in turn help them minimizing their revenue loss from cancellations.
library(dplyr)
library(tidyverse)
library(pastecs)
library(flexdashboard)The packages used till now in this projects are tidyverse and dplyr. dplyr is part of “tidyverse” package. tidyverse is an collection of R packages designed for data science. All packages share an underlying design philosophy, grammar and data structures. The core packages are ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats, which provide functionality to model, transform, and visualize data. dplyr package functions are used to solve the vast majority of data manipulation challenges. The tidyr package functions used to solve the vast majority of data tidying challenges. In this project we have used functions such as “mutate” which creates new variables with functions of existing variables.
We have also used package pestecs to turn summary statistics into table format. Among many user-written packages, package pastecs has an easy to use function called stat.desc to display a table of descriptive statistics for a list of variables
The data is obtained from github website where many data sets are available publicly. Following is the link for it. Data source
Purpose of data is perform analytics operations on the data to understand the causes of booking cancellation, its prediction, booking trends etc. The data consist the information about the hotel bookings. We have total 1,19,390 records for the hotel bookings which are collected between the arrival date of 1st July 2015 and 31st August 2017. It consist of 32 variables like Hotel type, date and time of booking, repeated customer or not, country, number of people etc.
$hotel
[1] 0
$is_canceled
[1] 0
$lead_time
[1] 0
$arrival_date_year
[1] 0
$arrival_date_month
[1] 0
$arrival_date_week_number
[1] 0
$arrival_date_day_of_month
[1] 0
$stays_in_weekend_nights
[1] 0
$stays_in_week_nights
[1] 0
$adults
[1] 0
$children
[1] 4
$babies
[1] 0
$meal
[1] 0
$country
[1] 0
$market_segment
[1] 0
$distribution_channel
[1] 0
$is_repeated_guest
[1] 0
$previous_cancellations
[1] 0
$previous_bookings_not_canceled
[1] 0
$reserved_room_type
[1] 0
$assigned_room_type
[1] 0
$booking_changes
[1] 0
$deposit_type
[1] 0
$agent
[1] 0
$company
[1] 0
$days_in_waiting_list
[1] 0
$customer_type
[1] 0
$adr
[1] 0
$required_car_parking_spaces
[1] 0
$total_of_special_requests
[1] 0
$reservation_status
[1] 0
$reservation_status_date
[1] 0
[1] "NULL" "304" "240" "303" "15" "241" "8" "250" "115" "5"
[11] "175" "134" "156" "243" "242" "3" "105" "40" "147" "306"
[21] "184" "96" "2" "127" "95" "146" "9" "177" "6" "143"
[31] "244" "149" "167" "300" "171" "305" "67" "196" "152" "142"
[41] "261" "104" "36" "26" "29" "258" "110" "71" "181" "88"
[51] "251" "275" "69" "248" "208" "256" "314" "126" "281" "273"
[61] "253" "185" "330" "334" "328" "326" "321" "324" "313" "38"
[71] "155" "68" "335" "308" "332" "94" "348" "310" "339" "375"
[81] "66" "327" "387" "298" "91" "245" "385" "257" "393" "168"
[91] "405" "249" "315" "75" "128" "307" "11" "436" "1" "201"
[101] "183" "223" "368" "336" "291" "464" "411" "481" "10" "154"
[111] "468" "410" "390" "440" "495" "492" "493" "434" "57" "531"
[121] "420" "483" "526" "472" "429" "16" "446" "34" "78" "139"
[131] "252" "270" "47" "114" "301" "193" "182" "135" "350" "195"
[141] "352" "355" "159" "363" "384" "360" "331" "367" "64" "406"
[151] "163" "414" "333" "427" "431" "430" "426" "438" "433" "418"
[161] "441" "282" "432" "72" "450" "180" "454" "455" "59" "451"
[171] "254" "358" "469" "165" "467" "510" "337" "476" "502" "527"
[181] "479" "508" "535" "302" "497" "187" "13" "7" "27" "14"
[191] "22" "17" "28" "42" "20" "19" "45" "37" "61" "39"
[201] "21" "24" "41" "50" "30" "54" "52" "12" "44" "31"
[211] "83" "32" "63" "60" "55" "56" "89" "87" "118" "86"
[221] "85" "210" "214" "129" "179" "138" "174" "170" "153" "93"
[231] "151" "119" "35" "173" "58" "53" "133" "79" "235" "192"
[241] "191" "236" "162" "215" "157" "287" "132" "234" "98" "77"
[251] "103" "107" "262" "220" "121" "205" "378" "23" "296" "290"
[261] "229" "33" "286" "276" "425" "484" "323" "403" "219" "394"
[271] "509" "111" "423" "4" "70" "82" "81" "74" "92" "99"
[281] "90" "112" "117" "106" "148" "158" "144" "211" "213" "216"
[291] "232" "150" "267" "227" "247" "278" "280" "285" "289" "269"
[301] "295" "265" "288" "122" "294" "325" "341" "344" "346" "359"
[311] "283" "364" "370" "371" "25" "141" "391" "397" "416" "404"
[321] "299" "197" "73" "354" "444" "408" "461" "388" "453" "459"
[331] "474" "475" "480" "449"
[1] "NULL" "110" "113" "270" "178" "240" "154" "144" "307" "268"
[11] "59" "204" "312" "318" "94" "174" "274" "195" "223" "317"
[21] "281" "118" "53" "286" "12" "47" "324" "342" "373" "371"
[31] "383" "86" "82" "218" "88" "31" "397" "392" "405" "331"
[41] "367" "20" "83" "416" "51" "395" "102" "34" "84" "360"
[51] "394" "457" "382" "461" "478" "386" "112" "486" "421" "9"
[61] "308" "135" "224" "504" "269" "356" "498" "390" "513" "203"
[71] "263" "477" "521" "169" "515" "445" "337" "251" "428" "292"
[81] "388" "130" "250" "355" "254" "543" "531" "528" "62" "120"
[91] "42" "81" "116" "530" "103" "39" "16" "92" "61" "501"
[101] "165" "291" "290" "43" "325" "192" "108" "200" "465" "287"
[111] "297" "490" "482" "207" "282" "437" "225" "329" "272" "28"
[121] "77" "338" "72" "246" "319" "146" "159" "380" "323" "511"
[131] "407" "278" "80" "403" "399" "14" "137" "343" "346" "347"
[141] "349" "289" "351" "353" "54" "99" "358" "361" "362" "366"
[151] "372" "365" "277" "109" "377" "379" "22" "378" "330" "364"
[161] "401" "232" "255" "384" "167" "212" "514" "391" "400" "376"
[171] "402" "396" "302" "398" "6" "370" "369" "409" "168" "104"
[181] "408" "413" "148" "10" "333" "419" "415" "424" "425" "423"
[191] "422" "435" "439" "442" "448" "443" "454" "444" "52" "459"
[201] "458" "456" "460" "447" "470" "466" "484" "184" "485" "32"
[211] "487" "491" "494" "193" "516" "496" "499" "29" "78" "520"
[221] "507" "506" "512" "126" "64" "242" "518" "523" "539" "534"
[231] "436" "525" "541" "40" "455" "410" "45" "38" "49" "48"
[241] "67" "68" "65" "91" "37" "8" "179" "209" "219" "221"
[251] "227" "153" "186" "253" "202" "216" "275" "233" "280" "309"
[261] "321" "93" "316" "85" "107" "350" "279" "334" "348" "150"
[271] "73" "385" "418" "197" "450" "452" "115" "46" "76" "96"
[281] "100" "105" "101" "122" "11" "139" "142" "127" "143" "140"
[291] "149" "163" "160" "180" "238" "183" "222" "185" "217" "215"
[301] "213" "237" "230" "234" "35" "245" "158" "258" "259" "260"
[311] "411" "257" "271" "18" "106" "210" "273" "71" "284" "301"
[321] "305" "293" "264" "311" "304" "313" "288" "320" "314" "332"
[331] "341" "352" "243" "368" "393" "132" "220" "412" "420" "426"
[341] "417" "429" "433" "446" "357" "479" "483" "489" "229" "481"
[351] "497" "451" "492"
[1] "PRT" "GBR" "USA" "ESP" "IRL" "FRA" "NULL" "ROU" "NOR" "OMN"
[11] "ARG" "POL" "DEU" "BEL" "CHE" "CN" "GRC" "ITA" "NLD" "DNK"
[21] "RUS" "SWE" "AUS" "EST" "CZE" "BRA" "FIN" "MOZ" "BWA" "LUX"
[31] "SVN" "ALB" "IND" "CHN" "MEX" "MAR" "UKR" "SMR" "LVA" "PRI"
[41] "SRB" "CHL" "AUT" "BLR" "LTU" "TUR" "ZAF" "AGO" "ISR" "CYM"
[51] "ZMB" "CPV" "ZWE" "DZA" "KOR" "CRI" "HUN" "ARE" "TUN" "JAM"
[61] "HRV" "HKG" "IRN" "GEO" "AND" "GIB" "URY" "JEY" "CAF" "CYP"
[71] "COL" "GGY" "KWT" "NGA" "MDV" "VEN" "SVK" "FJI" "KAZ" "PAK"
[81] "IDN" "LBN" "PHL" "SEN" "SYC" "AZE" "BHR" "NZL" "THA" "DOM"
[91] "MKD" "MYS" "ARM" "JPN" "LKA" "CUB" "CMR" "BIH" "MUS" "COM"
[101] "SUR" "UGA" "BGR" "CIV" "JOR" "SYR" "SGP" "BDI" "SAU" "VNM"
[111] "PLW" "QAT" "EGY" "PER" "MLT" "MWI" "ECU" "MDG" "ISL" "UZB"
[121] "NPL" "BHS" "MAC" "TGO" "TWN" "DJI" "STP" "KNA" "ETH" "IRQ"
[131] "HND" "RWA" "KHM" "MCO" "BGD" "IMN" "TJK" "NIC" "BEN" "VGB"
[141] "TZA" "GAB" "GHA" "TMP" "GLP" "KEN" "LIE" "GNB" "MNE" "UMI"
[151] "MYT" "FRO" "MMR" "PAN" "BFA" "LBY" "MLI" "NAM" "BOL" "PRY"
[161] "BRB" "ABW" "AIA" "SLV" "DMA" "PYF" "GUY" "LCA" "ATA" "GTM"
[171] "ASM" "MRT" "NCL" "KIR" "SDN" "ATF" "SLE" "LAO"
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 0.0000 0.0000 0.1039 0.0000 10.0000 4
# A tibble: 6 x 30
hotel is_canceled lead_time arrival_date_we~ stays_in_weeken~ stays_in_week_n~
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 342 27 0 0
2 0 0 737 27 0 0
3 0 0 7 27 0 1
4 0 0 13 27 0 1
5 0 0 14 27 0 2
6 0 0 14 27 0 2
# ... with 24 more variables: adults <dbl>, children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, ...
hotel is_canceled lead_time arrival_date_week_number
nbr.val NA 1.193900e+05 1.193900e+05 1.193900e+05
nbr.null NA 7.516600e+04 6.345000e+03 0.000000e+00
nbr.na NA 0.000000e+00 0.000000e+00 0.000000e+00
min NA 0.000000e+00 0.000000e+00 1.000000e+00
max NA 1.000000e+00 7.370000e+02 5.300000e+01
range NA 1.000000e+00 7.370000e+02 5.200000e+01
sum NA 4.422400e+04 1.241792e+07 3.243250e+06
median NA 0.000000e+00 6.900000e+01 2.800000e+01
mean NA 3.704163e-01 1.040114e+02 2.716517e+01
SE.mean NA 1.397622e-03 3.092743e-01 3.937486e-02
CI.mean NA 2.739316e-03 6.061726e-01 7.717408e-02
var NA 2.332100e-01 1.141972e+04 1.850998e+02
std.dev NA 4.829182e-01 1.068631e+02 1.360514e+01
coef.var NA 1.303718e+00 1.027417e+00 5.008302e-01
stays_in_weekend_nights stays_in_week_nights adults children
nbr.val 1.193900e+05 1.193900e+05 1.193900e+05 1.193860e+05
nbr.null 5.199800e+04 7.645000e+03 4.030000e+02 1.107960e+05
nbr.na 0.000000e+00 0.000000e+00 0.000000e+00 4.000000e+00
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
max 1.900000e+01 5.000000e+01 5.500000e+01 1.000000e+01
range 1.900000e+01 5.000000e+01 5.500000e+01 1.000000e+01
sum 1.107460e+05 2.985110e+05 2.216360e+05 1.240300e+04
median 1.000000e+00 2.000000e+00 2.000000e+00 0.000000e+00
mean 9.275986e-01 2.500302e+00 1.856403e+00 1.038899e-01
SE.mean 2.890104e-03 5.522801e-03 1.676449e-03 1.153503e-03
CI.mean 5.664557e-03 1.082460e-02 3.285813e-03 2.260847e-03
var 9.972289e-01 3.641554e+00 3.355433e-01 1.588512e-01
std.dev 9.986135e-01 1.908286e+00 5.792610e-01 3.985614e-01
coef.var 1.076558e+00 7.632222e-01 3.120340e-01 3.836383e+00
babies meal country market_segment distribution_channel
nbr.val 1.193900e+05 NA NA NA NA
nbr.null 1.184730e+05 NA NA NA NA
nbr.na 0.000000e+00 NA NA NA NA
min 0.000000e+00 NA NA NA NA
max 1.000000e+01 NA NA NA NA
range 1.000000e+01 NA NA NA NA
sum 9.490000e+02 NA NA NA NA
median 0.000000e+00 NA NA NA NA
mean 7.948739e-03 NA NA NA NA
SE.mean 2.819917e-04 NA NA NA NA
CI.mean 5.526992e-04 NA NA NA NA
var 9.493811e-03 NA NA NA NA
std.dev 9.743619e-02 NA NA NA NA
coef.var 1.225807e+01 NA NA NA NA
is_repeated_guest previous_cancellations
nbr.val 1.193900e+05 1.193900e+05
nbr.null 1.155800e+05 1.129060e+05
nbr.na 0.000000e+00 0.000000e+00
min 0.000000e+00 0.000000e+00
max 1.000000e+00 2.600000e+01
range 1.000000e+00 2.600000e+01
sum 3.810000e+03 1.040100e+04
median 0.000000e+00 0.000000e+00
mean 3.191222e-02 8.711785e-02
SE.mean 5.086906e-04 2.443608e-03
CI.mean 9.970254e-04 4.789432e-03
var 3.089409e-02 7.129039e-01
std.dev 1.757671e-01 8.443364e-01
coef.var 5.507832e+00 9.691887e+00
previous_bookings_not_canceled reserved_room_type assigned_room_type
nbr.val 1.193900e+05 NA NA
nbr.null 1.157700e+05 NA NA
nbr.na 0.000000e+00 NA NA
min 0.000000e+00 NA NA
max 7.200000e+01 NA NA
range 7.200000e+01 NA NA
sum 1.636800e+04 NA NA
median 0.000000e+00 NA NA
mean 1.370969e-01 NA NA
SE.mean 4.333757e-03 NA NA
CI.mean 8.494093e-03 NA NA
var 2.242317e+00 NA NA
std.dev 1.497437e+00 NA NA
coef.var 1.092247e+01 NA NA
booking_changes deposit_type agent company days_in_waiting_list
nbr.val 1.193900e+05 NA NA NA 1.193900e+05
nbr.null 1.013140e+05 NA NA NA 1.156920e+05
nbr.na 0.000000e+00 NA NA NA 0.000000e+00
min 0.000000e+00 NA NA NA 0.000000e+00
max 2.100000e+01 NA NA NA 3.910000e+02
range 2.100000e+01 NA NA NA 3.910000e+02
sum 2.640000e+04 NA NA NA 2.771220e+05
median 0.000000e+00 NA NA NA 0.000000e+00
mean 2.211240e-01 NA NA NA 2.321149e+00
SE.mean 1.887848e-03 NA NA NA 5.092117e-02
CI.mean 3.700152e-03 NA NA NA 9.980468e-02
var 4.255026e-01 NA NA NA 3.095742e+02
std.dev 6.523056e-01 NA NA NA 1.759472e+01
coef.var 2.949953e+00 NA NA NA 7.580177e+00
customer_type adr required_car_parking_spaces
nbr.val NA 1.193900e+05 1.193900e+05
nbr.null NA 1.959000e+03 1.119740e+05
nbr.na NA 0.000000e+00 0.000000e+00
min NA -6.380000e+00 0.000000e+00
max NA 5.400000e+03 8.000000e+00
range NA 5.406380e+03 8.000000e+00
sum NA 1.215762e+07 7.464000e+03
median NA 9.457500e+01 0.000000e+00
mean NA 1.018311e+02 6.251780e-02
SE.mean NA 1.462565e-01 7.099012e-04
CI.mean NA 2.866603e-01 1.391395e-03
var NA 2.553866e+03 6.016775e-02
std.dev NA 5.053579e+01 2.452911e-01
coef.var NA 4.962706e-01 3.923541e+00
total_of_special_requests reservation_status reservation_status_date
nbr.val 1.193900e+05 NA 1.193900e+05
nbr.null 7.031800e+04 NA 0.000000e+00
nbr.na 0.000000e+00 NA 0.000000e+00
min 0.000000e+00 NA 1.636000e+04
max 5.000000e+00 NA 1.742300e+04
range 5.000000e+00 NA 1.063000e+03
sum 6.821500e+04 NA 2.031065e+09
median 0.000000e+00 NA 1.702000e+04
mean 5.713628e-01 NA 1.701202e+04
SE.mean 2.294451e-03 NA 6.634426e-01
CI.mean 4.497087e-03 NA 1.300337e+00
var 6.285293e-01 NA 5.255024e+04
std.dev 7.927984e-01 NA 2.292384e+02
coef.var 1.387557e+00 NA 1.347509e-02
date
nbr.val 1.193900e+05
nbr.null 0.000000e+00
nbr.na 0.000000e+00
min 1.661700e+04
max 1.740900e+04
range 7.920000e+02
sum 2.034608e+09
median 1.705000e+04
mean 1.704169e+04
SE.mean 6.472825e-01
CI.mean 1.268663e+00
var 5.002138e+04
std.dev 2.236546e+02
coef.var 1.312397e-02
*For data, we first check how many null values are present in each column. This will give us idea if we have to work on any column.
*After checking,we find that there is no such discrepancy seen in the data, the only things observed were, there are 4 Null values in children columns, which we can take care of later.
*Now we will check all the unique values of each variable, which will give us lot of information.
*Things we noticed that there is “Null” entry as a character in country, agent and company. In most cases of hotels we are unaware about these things. While booking. People try to give as minimum info as possible because they feel it as unnecessary info just for booking purpose. Thus, we will keep them as it is and consider it as a value.
*We also observe that there are three separate column with date which are months,year and date. We will combine that into one and delete the rest.
*We can see from unique values that there are only two types of hotel. This hotel type can be useful while modeling the data. So we will replace that types into numerical categories which will helps us going ahead for modeling.
*Now we will check what can we do with that null values. We go though that column and observe that mean and median are similar. Thus we will replace that column with median which is 0 here.
The data is obtained from github website where many data sets are available publicly. Following is the link for it. Data source
Purpose of data is perform analytics operations on the data to understand the causes of booking cancellation, its prediction, booking trends etc. The data consist the information about the hotel bookings. We have total 1,19,390 records for the hotel bookings which are collected between the arrival date of 1st July 2015 and 31st August 2017. It consist of 32 variables like Hotel type, date and time of booking, repeated customer or not, country, number of people etc.
# A tibble: 2 x 3
lead_time_updated num_bookings perc_bookings
<chr> <int> <dbl>
1 <10 22852 0.191
2 >10 96538 0.809
As we can see from the chart and the table, 19% of the bookings were made less than 10 days before the arrival day. Hence, any cancellation during these 10 days is a loss for the hotel as they wont be able to get replacement bookings for these cancellations. Below we will look into number of cancellations by # of days before arrival date.
# A tibble: 2 x 3
D_cancellation num_cancel perc_cancel
<chr> <int> <dbl>
1 <10 6601 0.153
2 >10 36416 0.847
Out of all bookings, 15% bookings were cancelled in 10 days leading up-to the arrival day of the guest. A breakup of number of bookings is given below
# A tibble: 391 x 2
D_cancellation num_cancel
<drtn> <int>
1 0 days 881
2 1 days 630
3 2 days 487
4 3 days 718
5 4 days 581
6 5 days 603
7 6 days 636
8 7 days 599
9 8 days 496
10 9 days 462
# ... with 381 more rows
Total amount lost by hotels because of these bookings
# A tibble: 2 x 2
D_day amount_lost
<chr> <dbl>
1 <=10 days 2065942.
2 >10 days 14211814.
# A tibble: 2 x 4
hotel num_booking num_cancel perc_can
<chr> <int> <dbl> <dbl>
1 City Hotel 79330 33102 41.7
2 Resort Hotel 40060 11122 27.8
We have lost ~ 2 Million USD from the bookings that are getting cancelled less than 10 days before the arrival day.
# A tibble: 2 x 2
hotel num_can
<chr> <int>
1 City Hotel 33102
2 Resort Hotel 11122
we can clearly see that there are 3 times more cancellation in city hotel as compared to resort hotels. This can be because of locations of some of the city hotels as traffic plays important role for business class customers. Moreover, resort hotels have bookings of customers who are there for a dedicated vacation.
# A tibble: 3 x 2
arrival_date_year cancel_in_year
<dbl> <int>
1 2015 8142
2 2016 20337
3 2017 15745
There’s a skyrocket increase of ~50% cancellations from year 2015 to 2016 with slight decrease in 2017.
It is observed that bookings of 2 and 3 days of room nights has maximum cancellations.
# A tibble: 8 x 2
market_segment ms
<chr> <int>
1 Aviation 41
2 Complementary 85
3 Corporate 916
4 Direct 1722
5 Groups 12023
6 Offline TA/TO 8080
7 Online TA 20148
8 Undefined 2
# A tibble: 5 x 2
distribution_channel DC
<chr> <int>
1 Corporate 1381
2 Direct 2330
3 GDS 34
4 TA/TO 39268
5 Undefined 4
Market segment and distribution channel referrers to the booking done by travel agents or website involvement in there booking. As observed in distribution channel graph , more than 2000 cancellations are from the bookings completed through online travel agents i.e. websites like trivago, airbnb, booking.com etc. As these sites provide ease in refunds and smooth cancellations.
# A tibble: 1 x 1
num_cancel
<int>
1 14917
# A tibble: 1 x 1
num_cancel
<int>
1 594
# A tibble: 1 x 1
same_room_cancel
<int>
1 104473
# A tibble: 1 x 1
same_room_cancel
<int>
1 42423
Looking at the results we found that there is no impact of assigning same room as reserved room on cancellations as only 594 bookings were cancelled out of 14917 i.e. is just 3.9%. On the other hand, 40.6% of the bookings are cancelled even when desired rooms are assigned to customers.
1.Summarizing the problem statement: Analyzed cancellation behavior of hotel guests to predict risk of cancellation on future bookings. This analysis includes factors such as cancellation trend according to hotel type, trend by year, based on number of room nights booked and cancellation trend according to different channels of bookings.
2.Summarizing the implementation: The data was scrapped and manipulated accordingly for the analysis. The data was then reviewed cancelation percentage within 10 days leading up to arrival day and the total monetary impact of it. Then data was reviewed for finding potential factors that can be considered for the cancellation. This was done using summary statistics,area charts and bar charts.
3.Summary/Insights: Various results and analysis showed that there is a severe impact on hotels because of cancellation of bookings specially when the booking is cancelled within 10 days leading up to arrival day. This cancellation trend is based on number of factors. The insights that we got from the analysis: * There are 3 times more cancellation in city hotel as compared to resort hotels. * Bookings of short span such as bookings of 2 and 3 days of room nights has maximum cancellations.
* In market segments and distribution channels, there is higher percentage of cancellation of bookings that are completed through online travel agents.
* There is significant number of bookings that are canceled even when desired rooms were assigned to customers which shows this variable does not play any role in prediction.
* On average, hotel lost nearly 2 million USD from bookings that are getting cancelled less than 10 days before the arrival day.
---
title: "CANCELLATION Trends in Hotel Bookings"
output:
flexdashboard::flex_dashboard:
social: menu
source: embed
vertical_layout: fill
orientation: columns
theme: cerulean
---
```{r setup, include=FALSE,echo=FALSE}
library(flexdashboard)
```
# INTRODUCTION
### FEW THINGS TO READ{.tabset}
Booking cancellation is a major challenge for hotel managers. We already know by now that, Covid-19 pandemic has resulted in disruptions to travel plans. Hotel industry has been overwhelmed by large number of cancellations. Due to the growing popularity of online travel agents such as Booking.com, Expedia and Co. It is becoming more and more common to do free cancellations up to 24 hours before arrival. Although, this brings a new dimension of flexibility to hotel guests, but, it has also increased financial risk for hotels and difficulties to plan their occupancy rate. **Thus, realizing the importance to analyse cancellation behavior of hotel guests we are working on hotels data to predict risk of cancellation on future bookings.**
We will be using classification to predict the risk of cancellation into high and low risk. There are several methods of classifications which we can use to create a prediction model based on the different attributes that influence the cancellation. There are several classification models which we are going to test such as 'Decision Tree', 'Naive Bayes', 'Support Vector Machine' etc. We will select the final model based on the accuracy of the model. We are using the hotel data for building the prediction model to deep dive into booking cancellations. For fitting the data into the model and to get the desired accuracy, first we need to sort and clean the data. Then we will transform the data in a way which will increase the prediction score.
This prediction of risk will help the hotels in managing the demand for their hotel by predicting the overbooking percentage for their hotel in peak demand seasons. This will in turn help them minimizing their revenue loss from cancellations.
#### .
```{r out.width='65%',out.height='20%',fig.align='center'}
knitr::include_graphics("C:/Storage/Study/Masters/Data Wrangling/MicrosoftTeams-image.png")
```
LIBRARIES USED
=======================================================================
```{r message=FALSE}
library(dplyr)
library(tidyverse)
library(pastecs)
library(flexdashboard)
url <- 'https://github.com/mpst989/Datasets/raw/main/hotels.csv'
hotel <- readr::read_csv(url)
```
```{r include=FALSE}
knitr::opts_chunk$set(echo = TRUE,warning = FALSE, message = FALSE)
```
```{r message=FALSE}
library(dplyr)
library(tidyverse)
library(pastecs)
library(flexdashboard)
```
dplyr
flexdashboard
pastecs
tidyverse
The packages used till now in this projects are tidyverse and dplyr. dplyr is part of “tidyverse” package. tidyverse is an collection of R packages designed for data science. All packages share an underlying design philosophy, grammar and data structures. The core packages are ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats, which provide functionality to model, transform, and visualize data. dplyr package functions are used to solve the vast majority of data manipulation challenges. The tidyr package functions used to solve the vast majority of data tidying challenges. In this project we have used functions such as “mutate” which creates new variables with functions of existing variables.
We have also used package pestecs to turn summary statistics into table format. Among many user-written packages, package pastecs has an easy to use function called stat.desc to display a table of descriptive statistics for a list of variables
DATA PREPARARION
=======================================================================
## About Data{.tabset}
The data is obtained from github website where many data sets are available publicly. Following is the link for it. [Data source](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)
Purpose of data is perform analytics operations on the data to understand the causes of booking cancellation, its prediction, booking trends etc. The data consist the information about the hotel bookings. We have total 1,19,390 records for the hotel bookings which are collected between the arrival date of 1st July 2015 and 31st August 2017. It consist of 32 variables like Hotel type, date and time of booking, repeated customer or not, country, number of people etc.
### NA and Unique Values
```{r, echo=FALSE}
lapply(hotel,function(x) { length(which(is.na(x)))}) # NA values in all columns
#Null values display
unique(hotel$agent)
unique(hotel$company)
unique(hotel$country)
```
### Few Null Values Replaced
```{r, echo=FALSE}
#replace null by mean/median
summary(hotel$children)
hotel_1 <- hotel %>%
mutate(date = glue::glue("{arrival_date_year}-{arrival_date_month}-{arrival_date_day_of_month}"),
date = parse_date(date, format = "%Y-%B-%d"))
hotel_2 = subset(hotel_1, select = -c(arrival_date_year,arrival_date_month,arrival_date_day_of_month) )
#converting two resort types into 0 and 1.
hotel_2$hotel[hotel_2$hotel %in% c('Resort Hotel')] <- 0
hotel_2$hotel[(hotel_2$hotel == 'City Hotel')] <- 1
#replacing the null value by 0
hotel_2$children[(hotel_2$children == 'NA')] <- 0
head(hotel_2)
```
### Summary Statistics
```{r, echo=FALSE}
# Data statistics in tabular format
stat.desc(hotel_2)
```
### Thought Process
*For data, we first check how many null values are present in each column. This will give us idea if we have to work on any column.
*After checking,we find that there is no such discrepancy seen in the data, the only things observed were, there are 4 Null values in children columns, which we can take care of later.
*Now we will check all the unique values of each variable, which will give us lot of information.
*Things we noticed that there is "Null" entry as a character in country, agent and company. In most cases of hotels we are unaware about these things. While booking. People try to give as minimum info as possible because they feel it as unnecessary info just for booking purpose. Thus, we will keep them as it is and consider it as a value.
*We also observe that there are three separate column with date which are months,year and date. We will combine that into one and delete the rest.
*We can see from unique values that there are only two types of hotel. This hotel type can be useful while modeling the data. So we will replace that types into numerical categories which will helps us going ahead for modeling.
*Now we will check what can we do with that null values. We go though that column and observe that mean and median are similar. Thus we will replace that column with median which is 0 here.
EXPLORATORY DATA ANALYSIS
=======================================================================
## WRITE POINTS OF EDA AND E XPLAIN{.tabset}
The data is obtained from github website where many data sets are available publicly. Following is the link for it. [Data source](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)
Purpose of data is perform analytics operations on the data to understand the causes of booking cancellation, its prediction, booking trends etc. The data consist the information about the hotel bookings. We have total 1,19,390 records for the hotel bookings which are collected between the arrival date of 1st July 2015 and 31st August 2017. It consist of 32 variables like Hotel type, date and time of booking, repeated customer or not, country, number of people etc.
### Lead Time
```{r, echo=FALSE}
hotel_updated <- hotel %>%
mutate(arrival_date = glue::glue("{arrival_date_year}-{arrival_date_month}-{arrival_date_day_of_month}"),arrival_date = parse_date(arrival_date, format = "%Y-%B-%d"))
hotel1 <- hotel_updated %>%
subset(select = -c(arrival_date_year,arrival_date_month,arrival_date_day_of_month) ) %>%
filter(reservation_status=="Canceled") %>%
mutate(D_cancellation= arrival_date- reservation_status_date) %>%
mutate(D_day = ifelse(D_cancellation<=10,"<=10 days",">10 days"))
```
```{r, echo=FALSE}
library(ggplot2)
p <- ggplot(hotel_updated, aes(x=hotel_updated$lead_time)) + geom_area(stat="bin",color="cornflowerblue",fill="cornflowerblue")
p + labs(x="Number of Days", y="Count", title="Hotel Lead Time") + theme_bw()
```
```{r, echo=FALSE}
hotel_updated %>% mutate(lead_time_updated = ifelse(lead_time<=10,"<10",">10")) %>%
group_by(lead_time_updated) %>%
summarise(num_bookings= n()) %>% mutate(perc_bookings=num_bookings/sum(num_bookings))
```
As we can see from the chart and the table, **19%** of the bookings were made less than 10 days before the arrival day. Hence, any cancellation during these **10 days** is a loss for the hotel as they wont be able to get replacement bookings for these cancellations. Below we will look into number of cancellations by # of days before arrival date.
### Cancellations Before D Days
```{r, echo=FALSE}
p <- ggplot(hotel1, aes(x=hotel1$D_cancellation)) + geom_area(stat="bin",color="cornflowerblue",fill="cornflowerblue")
p + labs(x="Number of Days", y="Count", title="Cancellation Before Arrival Day") + theme_bw()
```
```{r, echo=FALSE}
hotel1 %>% mutate(D_cancellation = ifelse(D_cancellation<=10,"<10",">10")) %>%
group_by(D_cancellation) %>%
summarise(num_cancel= n()) %>% mutate(perc_cancel=num_cancel/sum(num_cancel))
```
Out of all bookings, **15%** bookings were cancelled in 10 days leading up-to the arrival day of the guest.
A breakup of number of bookings is given below
```{r, echo=FALSE}
hotel2 <- hotel_updated %>%
subset(select = -c(arrival_date_year,arrival_date_month,arrival_date_day_of_month) ) %>%
filter(reservation_status=="Canceled") %>%
mutate(D_cancellation= arrival_date- reservation_status_date) %>%
group_by(D_cancellation) %>%
summarise(num_cancel= n())
hotel2
```
**Total amount lost by hotels because of these bookings**
```{r, echo=FALSE}
hotel1 %>%
mutate(total_amount = (stays_in_weekend_nights+stays_in_week_nights)* adr) %>%
group_by(D_day) %>%
summarise(amount_lost=sum(total_amount))
hotel_updated %>% group_by(hotel) %>% summarise(num_booking=n(), num_cancel=sum(is_canceled)) %>% mutate(perc_can= num_cancel*100/num_booking)
```
We have lost **~ 2 Million** USD from the bookings that are getting cancelled less than **10** days before the arrival day.
### Cancellation By Hotel Type
#### .
```{r, echo=FALSE}
p <- ggplot(hotel_updated, aes(x= hotel, fill=hotel)) + geom_bar()
p + labs(x="Hotel Type", y="Count", title="Cancellation Trend According To Hotel Type") + theme_bw()
```
#### .
```{r, echo=FALSE}
hotel_updated %>%
filter(is_canceled==1) %>%
group_by(hotel) %>%
summarise(num_can = n())
```
we can clearly see that there are **3 times** more cancellation in city hotel as compared to resort hotels. This can be because of locations of some of the city hotels as traffic plays important role for business class customers. Moreover, resort hotels have bookings of customers who are there for a dedicated vacation.
### Cancellation Trends By Year 2015-2017
#### .
```{r, echo=FALSE}
# trends by year
hotel3 <- hotel_updated %>%
filter(reservation_status=="Canceled")
p <-ggplot(hotel3, aes(x = arrival_date_year)) + geom_bar(fill= "cornflowerblue")
p + labs(x="Year", y="Cancellations", title="Yearwise Cancellation Trend") + theme_bw()
```
#### .
```{r, echo=FALSE}
# trends by year
hotel_updated %>%
filter(is_canceled==1) %>%
group_by(arrival_date_year) %>%
summarise(cancel_in_year = n())
```
There's a skyrocket increase of **~50%** cancellations from year 2015 to 2016 with slight decrease in 2017.
### Cancellation On Basis of no. of Room Nights Booked
#### .
```{r, echo=FALSE}
hotel1 %>%
mutate(roomnights= stays_in_week_nights+ stays_in_weekend_nights)%>%
ggplot(aes(x= roomnights)) + geom_bar(color="black", fill="cornflowerblue") + theme_bw() + labs(x="Number of Room Nights", y="Count", title="Cancellation Trend According To Number of Room Nights Booked")
```
It is observed that bookings of **2 and 3 days** of room nights has maximum cancellations.
### Cancalletion Trend Due to Different Channels of Booking
#### .
```{r, echo=FALSE}
hotel1 %>%
group_by(market_segment) %>%
summarise(ms = n())
ggplot(hotel1, aes(x= market_segment))+geom_bar(color="black", fill="cornflowerblue") + theme_bw() + labs(x="Channels of Bookings", y="Count", title="Cancellation Trend According To Channels of Bookings")
```
```{r, echo=FALSE}
hotel1 %>%
group_by(distribution_channel) %>%
summarise(DC = n())
ggplot(hotel1, aes(x= distribution_channel))+geom_bar(color="black", fill="cornflowerblue") + theme_bw() + labs(x="Distribution Channels", y="Count", title="Cancellation Trend According To Distribution Channels")
```
Market segment and distribution channel referrers to the booking done by travel agents or website involvement in there booking. As observed in distribution channel graph , more than 2000 cancellations are from the bookings completed through online travel agents i.e. websites like trivago, airbnb, booking.com etc. As these sites provide ease in refunds and smooth cancellations.
### Cancellation Due to Not Assigning Reserved Room Type
```{r, echo=FALSE}
hotel_updated %>%
filter(reserved_room_type != assigned_room_type) %>%
summarise(num_cancel =n())
hotel1 %>%
filter(reserved_room_type != assigned_room_type) %>%
summarise(num_cancel =n())
hotel_updated%>%
filter(reserved_room_type== assigned_room_type) %>%
summarise(same_room_cancel =n())
hotel1 %>%
filter(reserved_room_type== assigned_room_type) %>%
summarise(same_room_cancel =n())
```
Looking at the results we found that there is no impact of assigning same room as reserved room on cancellations as only **594** bookings were cancelled out of **14917** i.e. is just **3.9%**.
On the other hand, **40.6%** of the bookings are cancelled even when desired rooms are assigned to customers.
SUMMARY
================================================================
## Summary
1.Summarizing the problem statement: Analyzed cancellation behavior of hotel guests to predict risk of cancellation on future bookings. This analysis includes factors such as cancellation trend according to hotel type, trend by year, based on number of room nights booked and cancellation trend according to different channels of bookings.
2.Summarizing the implementation: The data was scrapped and manipulated accordingly for the analysis. The data was then reviewed cancelation percentage within 10 days leading up to arrival day and the total monetary impact of it. Then data was reviewed for finding potential factors that can be considered for the cancellation. This was done using summary statistics,area charts and bar charts.
3.Summary/Insights: Various results and analysis showed that there is a severe impact on hotels because of cancellation of bookings specially when the booking is cancelled within 10 days leading up to arrival day. This cancellation trend is based on number of factors. The insights that we got from the analysis:
* There are 3 times more cancellation in city hotel as compared to resort hotels.
* Bookings of short span such as bookings of 2 and 3 days of room nights has maximum cancellations.
* In market segments and distribution channels, there is higher percentage of cancellation of bookings that are completed through online travel agents.
* There is significant number of bookings that are canceled even when desired rooms were assigned to customers which shows this variable does not play any role in prediction.
* On average, hotel lost nearly 2 million USD from bookings that are getting cancelled less than 10 days before the arrival day.