INTRODUCTION

FEW THINGS TO READ



Booking cancellation is a major challenge for hotel managers. We already know by now that, Covid-19 pandemic has resulted in disruptions to travel plans. Hotel industry has been overwhelmed by large number of cancellations. Due to the growing popularity of online travel agents such as Booking.com, Expedia and Co. It is becoming more and more common to do free cancellations up to 24 hours before arrival. Although, this brings a new dimension of flexibility to hotel guests, but, it has also increased financial risk for hotels and difficulties to plan their occupancy rate. Thus, realizing the importance to analyse cancellation behavior of hotel guests we are working on hotels data to predict risk of cancellation on future bookings.

We will be using classification to predict the risk of cancellation into high and low risk. There are several methods of classifications which we can use to create a prediction model based on the different attributes that influence the cancellation. There are several classification models which we are going to test such as ‘Decision Tree’, ‘Naive Bayes’, ‘Support Vector Machine’ etc. We will select the final model based on the accuracy of the model. We are using the hotel data for building the prediction model to deep dive into booking cancellations. For fitting the data into the model and to get the desired accuracy, first we need to sort and clean the data. Then we will transform the data in a way which will increase the prediction score.

This prediction of risk will help the hotels in managing the demand for their hotel by predicting the overbooking percentage for their hotel in peak demand seasons. This will in turn help them minimizing their revenue loss from cancellations.

.

LIBRARIES USED



library(dplyr)
library(tidyverse)
library(pastecs)
library(flexdashboard)

dplyr

flexdashboard

pastecs

tidyverse

The packages used till now in this projects are tidyverse and dplyr. dplyr is part of “tidyverse” package. tidyverse is an collection of R packages designed for data science. All packages share an underlying design philosophy, grammar and data structures. The core packages are ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats, which provide functionality to model, transform, and visualize data. dplyr package functions are used to solve the vast majority of data manipulation challenges. The tidyr package functions used to solve the vast majority of data tidying challenges. In this project we have used functions such as “mutate” which creates new variables with functions of existing variables.

We have also used package pestecs to turn summary statistics into table format. Among many user-written packages, package pastecs has an easy to use function called stat.desc to display a table of descriptive statistics for a list of variables

DATA PREPARARION

About Data




The data is obtained from github website where many data sets are available publicly. Following is the link for it. Data source

Purpose of data is perform analytics operations on the data to understand the causes of booking cancellation, its prediction, booking trends etc. The data consist the information about the hotel bookings. We have total 1,19,390 records for the hotel bookings which are collected between the arrival date of 1st July 2015 and 31st August 2017. It consist of 32 variables like Hotel type, date and time of booking, repeated customer or not, country, number of people etc.


NA and Unique Values

$hotel
[1] 0

$is_canceled
[1] 0

$lead_time
[1] 0

$arrival_date_year
[1] 0

$arrival_date_month
[1] 0

$arrival_date_week_number
[1] 0

$arrival_date_day_of_month
[1] 0

$stays_in_weekend_nights
[1] 0

$stays_in_week_nights
[1] 0

$adults
[1] 0

$children
[1] 4

$babies
[1] 0

$meal
[1] 0

$country
[1] 0

$market_segment
[1] 0

$distribution_channel
[1] 0

$is_repeated_guest
[1] 0

$previous_cancellations
[1] 0

$previous_bookings_not_canceled
[1] 0

$reserved_room_type
[1] 0

$assigned_room_type
[1] 0

$booking_changes
[1] 0

$deposit_type
[1] 0

$agent
[1] 0

$company
[1] 0

$days_in_waiting_list
[1] 0

$customer_type
[1] 0

$adr
[1] 0

$required_car_parking_spaces
[1] 0

$total_of_special_requests
[1] 0

$reservation_status
[1] 0

$reservation_status_date
[1] 0
  [1] "NULL" "304"  "240"  "303"  "15"   "241"  "8"    "250"  "115"  "5"   
 [11] "175"  "134"  "156"  "243"  "242"  "3"    "105"  "40"   "147"  "306" 
 [21] "184"  "96"   "2"    "127"  "95"   "146"  "9"    "177"  "6"    "143" 
 [31] "244"  "149"  "167"  "300"  "171"  "305"  "67"   "196"  "152"  "142" 
 [41] "261"  "104"  "36"   "26"   "29"   "258"  "110"  "71"   "181"  "88"  
 [51] "251"  "275"  "69"   "248"  "208"  "256"  "314"  "126"  "281"  "273" 
 [61] "253"  "185"  "330"  "334"  "328"  "326"  "321"  "324"  "313"  "38"  
 [71] "155"  "68"   "335"  "308"  "332"  "94"   "348"  "310"  "339"  "375" 
 [81] "66"   "327"  "387"  "298"  "91"   "245"  "385"  "257"  "393"  "168" 
 [91] "405"  "249"  "315"  "75"   "128"  "307"  "11"   "436"  "1"    "201" 
[101] "183"  "223"  "368"  "336"  "291"  "464"  "411"  "481"  "10"   "154" 
[111] "468"  "410"  "390"  "440"  "495"  "492"  "493"  "434"  "57"   "531" 
[121] "420"  "483"  "526"  "472"  "429"  "16"   "446"  "34"   "78"   "139" 
[131] "252"  "270"  "47"   "114"  "301"  "193"  "182"  "135"  "350"  "195" 
[141] "352"  "355"  "159"  "363"  "384"  "360"  "331"  "367"  "64"   "406" 
[151] "163"  "414"  "333"  "427"  "431"  "430"  "426"  "438"  "433"  "418" 
[161] "441"  "282"  "432"  "72"   "450"  "180"  "454"  "455"  "59"   "451" 
[171] "254"  "358"  "469"  "165"  "467"  "510"  "337"  "476"  "502"  "527" 
[181] "479"  "508"  "535"  "302"  "497"  "187"  "13"   "7"    "27"   "14"  
[191] "22"   "17"   "28"   "42"   "20"   "19"   "45"   "37"   "61"   "39"  
[201] "21"   "24"   "41"   "50"   "30"   "54"   "52"   "12"   "44"   "31"  
[211] "83"   "32"   "63"   "60"   "55"   "56"   "89"   "87"   "118"  "86"  
[221] "85"   "210"  "214"  "129"  "179"  "138"  "174"  "170"  "153"  "93"  
[231] "151"  "119"  "35"   "173"  "58"   "53"   "133"  "79"   "235"  "192" 
[241] "191"  "236"  "162"  "215"  "157"  "287"  "132"  "234"  "98"   "77"  
[251] "103"  "107"  "262"  "220"  "121"  "205"  "378"  "23"   "296"  "290" 
[261] "229"  "33"   "286"  "276"  "425"  "484"  "323"  "403"  "219"  "394" 
[271] "509"  "111"  "423"  "4"    "70"   "82"   "81"   "74"   "92"   "99"  
[281] "90"   "112"  "117"  "106"  "148"  "158"  "144"  "211"  "213"  "216" 
[291] "232"  "150"  "267"  "227"  "247"  "278"  "280"  "285"  "289"  "269" 
[301] "295"  "265"  "288"  "122"  "294"  "325"  "341"  "344"  "346"  "359" 
[311] "283"  "364"  "370"  "371"  "25"   "141"  "391"  "397"  "416"  "404" 
[321] "299"  "197"  "73"   "354"  "444"  "408"  "461"  "388"  "453"  "459" 
[331] "474"  "475"  "480"  "449" 
  [1] "NULL" "110"  "113"  "270"  "178"  "240"  "154"  "144"  "307"  "268" 
 [11] "59"   "204"  "312"  "318"  "94"   "174"  "274"  "195"  "223"  "317" 
 [21] "281"  "118"  "53"   "286"  "12"   "47"   "324"  "342"  "373"  "371" 
 [31] "383"  "86"   "82"   "218"  "88"   "31"   "397"  "392"  "405"  "331" 
 [41] "367"  "20"   "83"   "416"  "51"   "395"  "102"  "34"   "84"   "360" 
 [51] "394"  "457"  "382"  "461"  "478"  "386"  "112"  "486"  "421"  "9"   
 [61] "308"  "135"  "224"  "504"  "269"  "356"  "498"  "390"  "513"  "203" 
 [71] "263"  "477"  "521"  "169"  "515"  "445"  "337"  "251"  "428"  "292" 
 [81] "388"  "130"  "250"  "355"  "254"  "543"  "531"  "528"  "62"   "120" 
 [91] "42"   "81"   "116"  "530"  "103"  "39"   "16"   "92"   "61"   "501" 
[101] "165"  "291"  "290"  "43"   "325"  "192"  "108"  "200"  "465"  "287" 
[111] "297"  "490"  "482"  "207"  "282"  "437"  "225"  "329"  "272"  "28"  
[121] "77"   "338"  "72"   "246"  "319"  "146"  "159"  "380"  "323"  "511" 
[131] "407"  "278"  "80"   "403"  "399"  "14"   "137"  "343"  "346"  "347" 
[141] "349"  "289"  "351"  "353"  "54"   "99"   "358"  "361"  "362"  "366" 
[151] "372"  "365"  "277"  "109"  "377"  "379"  "22"   "378"  "330"  "364" 
[161] "401"  "232"  "255"  "384"  "167"  "212"  "514"  "391"  "400"  "376" 
[171] "402"  "396"  "302"  "398"  "6"    "370"  "369"  "409"  "168"  "104" 
[181] "408"  "413"  "148"  "10"   "333"  "419"  "415"  "424"  "425"  "423" 
[191] "422"  "435"  "439"  "442"  "448"  "443"  "454"  "444"  "52"   "459" 
[201] "458"  "456"  "460"  "447"  "470"  "466"  "484"  "184"  "485"  "32"  
[211] "487"  "491"  "494"  "193"  "516"  "496"  "499"  "29"   "78"   "520" 
[221] "507"  "506"  "512"  "126"  "64"   "242"  "518"  "523"  "539"  "534" 
[231] "436"  "525"  "541"  "40"   "455"  "410"  "45"   "38"   "49"   "48"  
[241] "67"   "68"   "65"   "91"   "37"   "8"    "179"  "209"  "219"  "221" 
[251] "227"  "153"  "186"  "253"  "202"  "216"  "275"  "233"  "280"  "309" 
[261] "321"  "93"   "316"  "85"   "107"  "350"  "279"  "334"  "348"  "150" 
[271] "73"   "385"  "418"  "197"  "450"  "452"  "115"  "46"   "76"   "96"  
[281] "100"  "105"  "101"  "122"  "11"   "139"  "142"  "127"  "143"  "140" 
[291] "149"  "163"  "160"  "180"  "238"  "183"  "222"  "185"  "217"  "215" 
[301] "213"  "237"  "230"  "234"  "35"   "245"  "158"  "258"  "259"  "260" 
[311] "411"  "257"  "271"  "18"   "106"  "210"  "273"  "71"   "284"  "301" 
[321] "305"  "293"  "264"  "311"  "304"  "313"  "288"  "320"  "314"  "332" 
[331] "341"  "352"  "243"  "368"  "393"  "132"  "220"  "412"  "420"  "426" 
[341] "417"  "429"  "433"  "446"  "357"  "479"  "483"  "489"  "229"  "481" 
[351] "497"  "451"  "492" 
  [1] "PRT"  "GBR"  "USA"  "ESP"  "IRL"  "FRA"  "NULL" "ROU"  "NOR"  "OMN" 
 [11] "ARG"  "POL"  "DEU"  "BEL"  "CHE"  "CN"   "GRC"  "ITA"  "NLD"  "DNK" 
 [21] "RUS"  "SWE"  "AUS"  "EST"  "CZE"  "BRA"  "FIN"  "MOZ"  "BWA"  "LUX" 
 [31] "SVN"  "ALB"  "IND"  "CHN"  "MEX"  "MAR"  "UKR"  "SMR"  "LVA"  "PRI" 
 [41] "SRB"  "CHL"  "AUT"  "BLR"  "LTU"  "TUR"  "ZAF"  "AGO"  "ISR"  "CYM" 
 [51] "ZMB"  "CPV"  "ZWE"  "DZA"  "KOR"  "CRI"  "HUN"  "ARE"  "TUN"  "JAM" 
 [61] "HRV"  "HKG"  "IRN"  "GEO"  "AND"  "GIB"  "URY"  "JEY"  "CAF"  "CYP" 
 [71] "COL"  "GGY"  "KWT"  "NGA"  "MDV"  "VEN"  "SVK"  "FJI"  "KAZ"  "PAK" 
 [81] "IDN"  "LBN"  "PHL"  "SEN"  "SYC"  "AZE"  "BHR"  "NZL"  "THA"  "DOM" 
 [91] "MKD"  "MYS"  "ARM"  "JPN"  "LKA"  "CUB"  "CMR"  "BIH"  "MUS"  "COM" 
[101] "SUR"  "UGA"  "BGR"  "CIV"  "JOR"  "SYR"  "SGP"  "BDI"  "SAU"  "VNM" 
[111] "PLW"  "QAT"  "EGY"  "PER"  "MLT"  "MWI"  "ECU"  "MDG"  "ISL"  "UZB" 
[121] "NPL"  "BHS"  "MAC"  "TGO"  "TWN"  "DJI"  "STP"  "KNA"  "ETH"  "IRQ" 
[131] "HND"  "RWA"  "KHM"  "MCO"  "BGD"  "IMN"  "TJK"  "NIC"  "BEN"  "VGB" 
[141] "TZA"  "GAB"  "GHA"  "TMP"  "GLP"  "KEN"  "LIE"  "GNB"  "MNE"  "UMI" 
[151] "MYT"  "FRO"  "MMR"  "PAN"  "BFA"  "LBY"  "MLI"  "NAM"  "BOL"  "PRY" 
[161] "BRB"  "ABW"  "AIA"  "SLV"  "DMA"  "PYF"  "GUY"  "LCA"  "ATA"  "GTM" 
[171] "ASM"  "MRT"  "NCL"  "KIR"  "SDN"  "ATF"  "SLE"  "LAO" 

Few Null Values Replaced

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.1039  0.0000 10.0000       4 
# A tibble: 6 x 30
  hotel is_canceled lead_time arrival_date_we~ stays_in_weeken~ stays_in_week_n~
  <chr>       <dbl>     <dbl>            <dbl>            <dbl>            <dbl>
1 0               0       342               27                0                0
2 0               0       737               27                0                0
3 0               0         7               27                0                1
4 0               0        13               27                0                1
5 0               0        14               27                0                2
6 0               0        14               27                0                2
# ... with 24 more variables: adults <dbl>, children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, ...



Summary Statistics

         hotel  is_canceled    lead_time arrival_date_week_number
nbr.val     NA 1.193900e+05 1.193900e+05             1.193900e+05
nbr.null    NA 7.516600e+04 6.345000e+03             0.000000e+00
nbr.na      NA 0.000000e+00 0.000000e+00             0.000000e+00
min         NA 0.000000e+00 0.000000e+00             1.000000e+00
max         NA 1.000000e+00 7.370000e+02             5.300000e+01
range       NA 1.000000e+00 7.370000e+02             5.200000e+01
sum         NA 4.422400e+04 1.241792e+07             3.243250e+06
median      NA 0.000000e+00 6.900000e+01             2.800000e+01
mean        NA 3.704163e-01 1.040114e+02             2.716517e+01
SE.mean     NA 1.397622e-03 3.092743e-01             3.937486e-02
CI.mean     NA 2.739316e-03 6.061726e-01             7.717408e-02
var         NA 2.332100e-01 1.141972e+04             1.850998e+02
std.dev     NA 4.829182e-01 1.068631e+02             1.360514e+01
coef.var    NA 1.303718e+00 1.027417e+00             5.008302e-01
         stays_in_weekend_nights stays_in_week_nights       adults     children
nbr.val             1.193900e+05         1.193900e+05 1.193900e+05 1.193860e+05
nbr.null            5.199800e+04         7.645000e+03 4.030000e+02 1.107960e+05
nbr.na              0.000000e+00         0.000000e+00 0.000000e+00 4.000000e+00
min                 0.000000e+00         0.000000e+00 0.000000e+00 0.000000e+00
max                 1.900000e+01         5.000000e+01 5.500000e+01 1.000000e+01
range               1.900000e+01         5.000000e+01 5.500000e+01 1.000000e+01
sum                 1.107460e+05         2.985110e+05 2.216360e+05 1.240300e+04
median              1.000000e+00         2.000000e+00 2.000000e+00 0.000000e+00
mean                9.275986e-01         2.500302e+00 1.856403e+00 1.038899e-01
SE.mean             2.890104e-03         5.522801e-03 1.676449e-03 1.153503e-03
CI.mean             5.664557e-03         1.082460e-02 3.285813e-03 2.260847e-03
var                 9.972289e-01         3.641554e+00 3.355433e-01 1.588512e-01
std.dev             9.986135e-01         1.908286e+00 5.792610e-01 3.985614e-01
coef.var            1.076558e+00         7.632222e-01 3.120340e-01 3.836383e+00
               babies meal country market_segment distribution_channel
nbr.val  1.193900e+05   NA      NA             NA                   NA
nbr.null 1.184730e+05   NA      NA             NA                   NA
nbr.na   0.000000e+00   NA      NA             NA                   NA
min      0.000000e+00   NA      NA             NA                   NA
max      1.000000e+01   NA      NA             NA                   NA
range    1.000000e+01   NA      NA             NA                   NA
sum      9.490000e+02   NA      NA             NA                   NA
median   0.000000e+00   NA      NA             NA                   NA
mean     7.948739e-03   NA      NA             NA                   NA
SE.mean  2.819917e-04   NA      NA             NA                   NA
CI.mean  5.526992e-04   NA      NA             NA                   NA
var      9.493811e-03   NA      NA             NA                   NA
std.dev  9.743619e-02   NA      NA             NA                   NA
coef.var 1.225807e+01   NA      NA             NA                   NA
         is_repeated_guest previous_cancellations
nbr.val       1.193900e+05           1.193900e+05
nbr.null      1.155800e+05           1.129060e+05
nbr.na        0.000000e+00           0.000000e+00
min           0.000000e+00           0.000000e+00
max           1.000000e+00           2.600000e+01
range         1.000000e+00           2.600000e+01
sum           3.810000e+03           1.040100e+04
median        0.000000e+00           0.000000e+00
mean          3.191222e-02           8.711785e-02
SE.mean       5.086906e-04           2.443608e-03
CI.mean       9.970254e-04           4.789432e-03
var           3.089409e-02           7.129039e-01
std.dev       1.757671e-01           8.443364e-01
coef.var      5.507832e+00           9.691887e+00
         previous_bookings_not_canceled reserved_room_type assigned_room_type
nbr.val                    1.193900e+05                 NA                 NA
nbr.null                   1.157700e+05                 NA                 NA
nbr.na                     0.000000e+00                 NA                 NA
min                        0.000000e+00                 NA                 NA
max                        7.200000e+01                 NA                 NA
range                      7.200000e+01                 NA                 NA
sum                        1.636800e+04                 NA                 NA
median                     0.000000e+00                 NA                 NA
mean                       1.370969e-01                 NA                 NA
SE.mean                    4.333757e-03                 NA                 NA
CI.mean                    8.494093e-03                 NA                 NA
var                        2.242317e+00                 NA                 NA
std.dev                    1.497437e+00                 NA                 NA
coef.var                   1.092247e+01                 NA                 NA
         booking_changes deposit_type agent company days_in_waiting_list
nbr.val     1.193900e+05           NA    NA      NA         1.193900e+05
nbr.null    1.013140e+05           NA    NA      NA         1.156920e+05
nbr.na      0.000000e+00           NA    NA      NA         0.000000e+00
min         0.000000e+00           NA    NA      NA         0.000000e+00
max         2.100000e+01           NA    NA      NA         3.910000e+02
range       2.100000e+01           NA    NA      NA         3.910000e+02
sum         2.640000e+04           NA    NA      NA         2.771220e+05
median      0.000000e+00           NA    NA      NA         0.000000e+00
mean        2.211240e-01           NA    NA      NA         2.321149e+00
SE.mean     1.887848e-03           NA    NA      NA         5.092117e-02
CI.mean     3.700152e-03           NA    NA      NA         9.980468e-02
var         4.255026e-01           NA    NA      NA         3.095742e+02
std.dev     6.523056e-01           NA    NA      NA         1.759472e+01
coef.var    2.949953e+00           NA    NA      NA         7.580177e+00
         customer_type           adr required_car_parking_spaces
nbr.val             NA  1.193900e+05                1.193900e+05
nbr.null            NA  1.959000e+03                1.119740e+05
nbr.na              NA  0.000000e+00                0.000000e+00
min                 NA -6.380000e+00                0.000000e+00
max                 NA  5.400000e+03                8.000000e+00
range               NA  5.406380e+03                8.000000e+00
sum                 NA  1.215762e+07                7.464000e+03
median              NA  9.457500e+01                0.000000e+00
mean                NA  1.018311e+02                6.251780e-02
SE.mean             NA  1.462565e-01                7.099012e-04
CI.mean             NA  2.866603e-01                1.391395e-03
var                 NA  2.553866e+03                6.016775e-02
std.dev             NA  5.053579e+01                2.452911e-01
coef.var            NA  4.962706e-01                3.923541e+00
         total_of_special_requests reservation_status reservation_status_date
nbr.val               1.193900e+05                 NA            1.193900e+05
nbr.null              7.031800e+04                 NA            0.000000e+00
nbr.na                0.000000e+00                 NA            0.000000e+00
min                   0.000000e+00                 NA            1.636000e+04
max                   5.000000e+00                 NA            1.742300e+04
range                 5.000000e+00                 NA            1.063000e+03
sum                   6.821500e+04                 NA            2.031065e+09
median                0.000000e+00                 NA            1.702000e+04
mean                  5.713628e-01                 NA            1.701202e+04
SE.mean               2.294451e-03                 NA            6.634426e-01
CI.mean               4.497087e-03                 NA            1.300337e+00
var                   6.285293e-01                 NA            5.255024e+04
std.dev               7.927984e-01                 NA            2.292384e+02
coef.var              1.387557e+00                 NA            1.347509e-02
                 date
nbr.val  1.193900e+05
nbr.null 0.000000e+00
nbr.na   0.000000e+00
min      1.661700e+04
max      1.740900e+04
range    7.920000e+02
sum      2.034608e+09
median   1.705000e+04
mean     1.704169e+04
SE.mean  6.472825e-01
CI.mean  1.268663e+00
var      5.002138e+04
std.dev  2.236546e+02
coef.var 1.312397e-02

Thought Process

*For data, we first check how many null values are present in each column. This will give us idea if we have to work on any column.

*After checking,we find that there is no such discrepancy seen in the data, the only things observed were, there are 4 Null values in children columns, which we can take care of later.

*Now we will check all the unique values of each variable, which will give us lot of information.

*Things we noticed that there is “Null” entry as a character in country, agent and company. In most cases of hotels we are unaware about these things. While booking. People try to give as minimum info as possible because they feel it as unnecessary info just for booking purpose. Thus, we will keep them as it is and consider it as a value.

*We also observe that there are three separate column with date which are months,year and date. We will combine that into one and delete the rest.

*We can see from unique values that there are only two types of hotel. This hotel type can be useful while modeling the data. So we will replace that types into numerical categories which will helps us going ahead for modeling.

*Now we will check what can we do with that null values. We go though that column and observe that mean and median are similar. Thus we will replace that column with median which is 0 here.

EXPLORATORY DATA ANALYSIS

WRITE POINTS OF EDA AND E XPLAIN




The data is obtained from github website where many data sets are available publicly. Following is the link for it. Data source

Purpose of data is perform analytics operations on the data to understand the causes of booking cancellation, its prediction, booking trends etc. The data consist the information about the hotel bookings. We have total 1,19,390 records for the hotel bookings which are collected between the arrival date of 1st July 2015 and 31st August 2017. It consist of 32 variables like Hotel type, date and time of booking, repeated customer or not, country, number of people etc.


Lead Time

# A tibble: 2 x 3
  lead_time_updated num_bookings perc_bookings
  <chr>                    <int>         <dbl>
1 <10                      22852         0.191
2 >10                      96538         0.809

As we can see from the chart and the table, 19% of the bookings were made less than 10 days before the arrival day. Hence, any cancellation during these 10 days is a loss for the hotel as they wont be able to get replacement bookings for these cancellations. Below we will look into number of cancellations by # of days before arrival date.

Cancellations Before D Days

# A tibble: 2 x 3
  D_cancellation num_cancel perc_cancel
  <chr>               <int>       <dbl>
1 <10                  6601       0.153
2 >10                 36416       0.847

Out of all bookings, 15% bookings were cancelled in 10 days leading up-to the arrival day of the guest. A breakup of number of bookings is given below

# A tibble: 391 x 2
   D_cancellation num_cancel
   <drtn>              <int>
 1 0 days                881
 2 1 days                630
 3 2 days                487
 4 3 days                718
 5 4 days                581
 6 5 days                603
 7 6 days                636
 8 7 days                599
 9 8 days                496
10 9 days                462
# ... with 381 more rows

Total amount lost by hotels because of these bookings

# A tibble: 2 x 2
  D_day     amount_lost
  <chr>           <dbl>
1 <=10 days    2065942.
2 >10 days    14211814.
# A tibble: 2 x 4
  hotel        num_booking num_cancel perc_can
  <chr>              <int>      <dbl>    <dbl>
1 City Hotel         79330      33102     41.7
2 Resort Hotel       40060      11122     27.8

We have lost ~ 2 Million USD from the bookings that are getting cancelled less than 10 days before the arrival day.

Cancellation By Hotel Type

.

.

# A tibble: 2 x 2
  hotel        num_can
  <chr>          <int>
1 City Hotel     33102
2 Resort Hotel   11122

we can clearly see that there are 3 times more cancellation in city hotel as compared to resort hotels. This can be because of locations of some of the city hotels as traffic plays important role for business class customers. Moreover, resort hotels have bookings of customers who are there for a dedicated vacation.

Cancellation On Basis of no. of Room Nights Booked

.


It is observed that bookings of 2 and 3 days of room nights has maximum cancellations.

Cancalletion Trend Due to Different Channels of Booking

.

# A tibble: 8 x 2
  market_segment    ms
  <chr>          <int>
1 Aviation          41
2 Complementary     85
3 Corporate        916
4 Direct          1722
5 Groups         12023
6 Offline TA/TO   8080
7 Online TA      20148
8 Undefined          2



# A tibble: 5 x 2
  distribution_channel    DC
  <chr>                <int>
1 Corporate             1381
2 Direct                2330
3 GDS                     34
4 TA/TO                39268
5 Undefined                4


Market segment and distribution channel referrers to the booking done by travel agents or website involvement in there booking. As observed in distribution channel graph , more than 2000 cancellations are from the bookings completed through online travel agents i.e. websites like trivago, airbnb, booking.com etc. As these sites provide ease in refunds and smooth cancellations.

Cancellation Due to Not Assigning Reserved Room Type

# A tibble: 1 x 1
  num_cancel
       <int>
1      14917
# A tibble: 1 x 1
  num_cancel
       <int>
1        594
# A tibble: 1 x 1
  same_room_cancel
             <int>
1           104473
# A tibble: 1 x 1
  same_room_cancel
             <int>
1            42423

Looking at the results we found that there is no impact of assigning same room as reserved room on cancellations as only 594 bookings were cancelled out of 14917 i.e. is just 3.9%. On the other hand, 40.6% of the bookings are cancelled even when desired rooms are assigned to customers.

SUMMARY

Summary




1.Summarizing the problem statement: Analyzed cancellation behavior of hotel guests to predict risk of cancellation on future bookings. This analysis includes factors such as cancellation trend according to hotel type, trend by year, based on number of room nights booked and cancellation trend according to different channels of bookings.


2.Summarizing the implementation: The data was scrapped and manipulated accordingly for the analysis. The data was then reviewed cancelation percentage within 10 days leading up to arrival day and the total monetary impact of it. Then data was reviewed for finding potential factors that can be considered for the cancellation. This was done using summary statistics,area charts and bar charts.


3.Summary/Insights: Various results and analysis showed that there is a severe impact on hotels because of cancellation of bookings specially when the booking is cancelled within 10 days leading up to arrival day. This cancellation trend is based on number of factors. The insights that we got from the analysis:

* There are 3 times more cancellation in city hotel as compared to resort hotels.
* Bookings of short span such as bookings of 2 and 3 days of room nights has maximum cancellations.
* In market segments and distribution channels, there is higher percentage of cancellation of bookings that are completed through online travel agents.
* There is significant number of bookings that are canceled even when desired rooms were assigned to customers which shows this variable does not play any role in prediction.
* On average, hotel lost nearly 2 million USD from bookings that are getting cancelled less than 10 days before the arrival day.

---
title: "CANCELLATION Trends in Hotel Bookings"


output: 
  flexdashboard::flex_dashboard:
             social: menu
             source: embed
             vertical_layout: fill
             orientation: columns
             theme: cerulean
             
---

```{r setup, include=FALSE,echo=FALSE}
library(flexdashboard)   

```


# INTRODUCTION

### FEW THINGS TO READ{.tabset}



Booking cancellation is a major challenge for hotel managers. We already know by now that, Covid-19 pandemic has resulted in disruptions to travel plans. Hotel industry has been overwhelmed by large number of cancellations. Due to the growing popularity of online travel agents such as Booking.com, Expedia and Co. It is becoming more and more common to do free cancellations up to 24 hours before arrival. Although, this brings a new dimension of flexibility to hotel guests, but, it has also increased financial risk for hotels and difficulties to plan their occupancy rate. **Thus, realizing the importance to analyse cancellation behavior of hotel guests we are working on hotels data to predict risk of cancellation on future bookings.**

We will be using classification to predict the risk of cancellation into high and low risk. There are several methods of classifications which we can use to create a prediction model based on the different attributes that influence the cancellation. There are several classification models which we are going to test such as 'Decision Tree', 'Naive Bayes', 'Support Vector Machine' etc. We will select the final model based on the accuracy of the model. We are using the hotel data for building the prediction model to deep dive into booking cancellations. For fitting the data into the model and to get the desired accuracy, first we need to sort and clean the data. Then we will transform the data in a way which will increase the prediction score.

This prediction of risk will help the hotels in managing the demand for their hotel by predicting the overbooking percentage for their hotel in peak demand seasons. This will in turn help them minimizing their revenue loss from cancellations.

#### . ```{r out.width='65%',out.height='20%',fig.align='center'} knitr::include_graphics("C:/Storage/Study/Masters/Data Wrangling/MicrosoftTeams-image.png") ``` LIBRARIES USED =======================================================================

```{r message=FALSE} library(dplyr) library(tidyverse) library(pastecs) library(flexdashboard) url <- 'https://github.com/mpst989/Datasets/raw/main/hotels.csv' hotel <- readr::read_csv(url) ``` ```{r include=FALSE} knitr::opts_chunk$set(echo = TRUE,warning = FALSE, message = FALSE) ``` ```{r message=FALSE} library(dplyr) library(tidyverse) library(pastecs) library(flexdashboard) ```

dplyr

flexdashboard

pastecs

tidyverse

The packages used till now in this projects are tidyverse and dplyr. dplyr is part of “tidyverse” package. tidyverse is an collection of R packages designed for data science. All packages share an underlying design philosophy, grammar and data structures. The core packages are ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats, which provide functionality to model, transform, and visualize data. dplyr package functions are used to solve the vast majority of data manipulation challenges. The tidyr package functions used to solve the vast majority of data tidying challenges. In this project we have used functions such as “mutate” which creates new variables with functions of existing variables.

We have also used package pestecs to turn summary statistics into table format. Among many user-written packages, package pastecs has an easy to use function called stat.desc to display a table of descriptive statistics for a list of variables

DATA PREPARARION ======================================================================= ## About Data{.tabset}


The data is obtained from github website where many data sets are available publicly. Following is the link for it. [Data source](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)

Purpose of data is perform analytics operations on the data to understand the causes of booking cancellation, its prediction, booking trends etc. The data consist the information about the hotel bookings. We have total 1,19,390 records for the hotel bookings which are collected between the arrival date of 1st July 2015 and 31st August 2017. It consist of 32 variables like Hotel type, date and time of booking, repeated customer or not, country, number of people etc.


### NA and Unique Values ```{r, echo=FALSE} lapply(hotel,function(x) { length(which(is.na(x)))}) # NA values in all columns #Null values display unique(hotel$agent) unique(hotel$company) unique(hotel$country) ``` ### Few Null Values Replaced ```{r, echo=FALSE} #replace null by mean/median summary(hotel$children) hotel_1 <- hotel %>% mutate(date = glue::glue("{arrival_date_year}-{arrival_date_month}-{arrival_date_day_of_month}"), date = parse_date(date, format = "%Y-%B-%d")) hotel_2 = subset(hotel_1, select = -c(arrival_date_year,arrival_date_month,arrival_date_day_of_month) ) #converting two resort types into 0 and 1. hotel_2$hotel[hotel_2$hotel %in% c('Resort Hotel')] <- 0 hotel_2$hotel[(hotel_2$hotel == 'City Hotel')] <- 1 #replacing the null value by 0 hotel_2$children[(hotel_2$children == 'NA')] <- 0 head(hotel_2) ```

### Summary Statistics ```{r, echo=FALSE} # Data statistics in tabular format stat.desc(hotel_2) ``` ### Thought Process

*For data, we first check how many null values are present in each column. This will give us idea if we have to work on any column.

*After checking,we find that there is no such discrepancy seen in the data, the only things observed were, there are 4 Null values in children columns, which we can take care of later.

*Now we will check all the unique values of each variable, which will give us lot of information.

*Things we noticed that there is "Null" entry as a character in country, agent and company. In most cases of hotels we are unaware about these things. While booking. People try to give as minimum info as possible because they feel it as unnecessary info just for booking purpose. Thus, we will keep them as it is and consider it as a value.

*We also observe that there are three separate column with date which are months,year and date. We will combine that into one and delete the rest.

*We can see from unique values that there are only two types of hotel. This hotel type can be useful while modeling the data. So we will replace that types into numerical categories which will helps us going ahead for modeling.

*Now we will check what can we do with that null values. We go though that column and observe that mean and median are similar. Thus we will replace that column with median which is 0 here.

EXPLORATORY DATA ANALYSIS ======================================================================= ## WRITE POINTS OF EDA AND E XPLAIN{.tabset}


The data is obtained from github website where many data sets are available publicly. Following is the link for it. [Data source](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)

Purpose of data is perform analytics operations on the data to understand the causes of booking cancellation, its prediction, booking trends etc. The data consist the information about the hotel bookings. We have total 1,19,390 records for the hotel bookings which are collected between the arrival date of 1st July 2015 and 31st August 2017. It consist of 32 variables like Hotel type, date and time of booking, repeated customer or not, country, number of people etc.


### Lead Time ```{r, echo=FALSE} hotel_updated <- hotel %>% mutate(arrival_date = glue::glue("{arrival_date_year}-{arrival_date_month}-{arrival_date_day_of_month}"),arrival_date = parse_date(arrival_date, format = "%Y-%B-%d")) hotel1 <- hotel_updated %>% subset(select = -c(arrival_date_year,arrival_date_month,arrival_date_day_of_month) ) %>% filter(reservation_status=="Canceled") %>% mutate(D_cancellation= arrival_date- reservation_status_date) %>% mutate(D_day = ifelse(D_cancellation<=10,"<=10 days",">10 days")) ``` ```{r, echo=FALSE} library(ggplot2) p <- ggplot(hotel_updated, aes(x=hotel_updated$lead_time)) + geom_area(stat="bin",color="cornflowerblue",fill="cornflowerblue") p + labs(x="Number of Days", y="Count", title="Hotel Lead Time") + theme_bw() ``` ```{r, echo=FALSE} hotel_updated %>% mutate(lead_time_updated = ifelse(lead_time<=10,"<10",">10")) %>% group_by(lead_time_updated) %>% summarise(num_bookings= n()) %>% mutate(perc_bookings=num_bookings/sum(num_bookings)) ```

As we can see from the chart and the table, **19%** of the bookings were made less than 10 days before the arrival day. Hence, any cancellation during these **10 days** is a loss for the hotel as they wont be able to get replacement bookings for these cancellations. Below we will look into number of cancellations by # of days before arrival date.

### Cancellations Before D Days ```{r, echo=FALSE} p <- ggplot(hotel1, aes(x=hotel1$D_cancellation)) + geom_area(stat="bin",color="cornflowerblue",fill="cornflowerblue") p + labs(x="Number of Days", y="Count", title="Cancellation Before Arrival Day") + theme_bw() ``` ```{r, echo=FALSE} hotel1 %>% mutate(D_cancellation = ifelse(D_cancellation<=10,"<10",">10")) %>% group_by(D_cancellation) %>% summarise(num_cancel= n()) %>% mutate(perc_cancel=num_cancel/sum(num_cancel)) ```

Out of all bookings, **15%** bookings were cancelled in 10 days leading up-to the arrival day of the guest. A breakup of number of bookings is given below

```{r, echo=FALSE} hotel2 <- hotel_updated %>% subset(select = -c(arrival_date_year,arrival_date_month,arrival_date_day_of_month) ) %>% filter(reservation_status=="Canceled") %>% mutate(D_cancellation= arrival_date- reservation_status_date) %>% group_by(D_cancellation) %>% summarise(num_cancel= n()) hotel2 ``` **Total amount lost by hotels because of these bookings** ```{r, echo=FALSE} hotel1 %>% mutate(total_amount = (stays_in_weekend_nights+stays_in_week_nights)* adr) %>% group_by(D_day) %>% summarise(amount_lost=sum(total_amount)) hotel_updated %>% group_by(hotel) %>% summarise(num_booking=n(), num_cancel=sum(is_canceled)) %>% mutate(perc_can= num_cancel*100/num_booking) ```

We have lost **~ 2 Million** USD from the bookings that are getting cancelled less than **10** days before the arrival day.

### Cancellation By Hotel Type #### . ```{r, echo=FALSE} p <- ggplot(hotel_updated, aes(x= hotel, fill=hotel)) + geom_bar() p + labs(x="Hotel Type", y="Count", title="Cancellation Trend According To Hotel Type") + theme_bw() ``` #### . ```{r, echo=FALSE} hotel_updated %>% filter(is_canceled==1) %>% group_by(hotel) %>% summarise(num_can = n()) ```

we can clearly see that there are **3 times** more cancellation in city hotel as compared to resort hotels. This can be because of locations of some of the city hotels as traffic plays important role for business class customers. Moreover, resort hotels have bookings of customers who are there for a dedicated vacation.

### Cancellation Trends By Year 2015-2017 #### . ```{r, echo=FALSE} # trends by year hotel3 <- hotel_updated %>% filter(reservation_status=="Canceled") p <-ggplot(hotel3, aes(x = arrival_date_year)) + geom_bar(fill= "cornflowerblue") p + labs(x="Year", y="Cancellations", title="Yearwise Cancellation Trend") + theme_bw() ``` #### . ```{r, echo=FALSE} # trends by year hotel_updated %>% filter(is_canceled==1) %>% group_by(arrival_date_year) %>% summarise(cancel_in_year = n()) ```

There's a skyrocket increase of **~50%** cancellations from year 2015 to 2016 with slight decrease in 2017.

### Cancellation On Basis of no. of Room Nights Booked #### . ```{r, echo=FALSE} hotel1 %>% mutate(roomnights= stays_in_week_nights+ stays_in_weekend_nights)%>% ggplot(aes(x= roomnights)) + geom_bar(color="black", fill="cornflowerblue") + theme_bw() + labs(x="Number of Room Nights", y="Count", title="Cancellation Trend According To Number of Room Nights Booked") ```

It is observed that bookings of **2 and 3 days** of room nights has maximum cancellations.

### Cancalletion Trend Due to Different Channels of Booking #### . ```{r, echo=FALSE} hotel1 %>% group_by(market_segment) %>% summarise(ms = n()) ggplot(hotel1, aes(x= market_segment))+geom_bar(color="black", fill="cornflowerblue") + theme_bw() + labs(x="Channels of Bookings", y="Count", title="Cancellation Trend According To Channels of Bookings") ```

```{r, echo=FALSE} hotel1 %>% group_by(distribution_channel) %>% summarise(DC = n()) ggplot(hotel1, aes(x= distribution_channel))+geom_bar(color="black", fill="cornflowerblue") + theme_bw() + labs(x="Distribution Channels", y="Count", title="Cancellation Trend According To Distribution Channels") ```

Market segment and distribution channel referrers to the booking done by travel agents or website involvement in there booking. As observed in distribution channel graph , more than 2000 cancellations are from the bookings completed through online travel agents i.e. websites like trivago, airbnb, booking.com etc. As these sites provide ease in refunds and smooth cancellations.

### Cancellation Due to Not Assigning Reserved Room Type ```{r, echo=FALSE} hotel_updated %>% filter(reserved_room_type != assigned_room_type) %>% summarise(num_cancel =n()) hotel1 %>% filter(reserved_room_type != assigned_room_type) %>% summarise(num_cancel =n()) hotel_updated%>% filter(reserved_room_type== assigned_room_type) %>% summarise(same_room_cancel =n()) hotel1 %>% filter(reserved_room_type== assigned_room_type) %>% summarise(same_room_cancel =n()) ```

Looking at the results we found that there is no impact of assigning same room as reserved room on cancellations as only **594** bookings were cancelled out of **14917** i.e. is just **3.9%**. On the other hand, **40.6%** of the bookings are cancelled even when desired rooms are assigned to customers.

SUMMARY ================================================================ ## Summary


1.Summarizing the problem statement: Analyzed cancellation behavior of hotel guests to predict risk of cancellation on future bookings. This analysis includes factors such as cancellation trend according to hotel type, trend by year, based on number of room nights booked and cancellation trend according to different channels of bookings.


2.Summarizing the implementation: The data was scrapped and manipulated accordingly for the analysis. The data was then reviewed cancelation percentage within 10 days leading up to arrival day and the total monetary impact of it. Then data was reviewed for finding potential factors that can be considered for the cancellation. This was done using summary statistics,area charts and bar charts.


3.Summary/Insights: Various results and analysis showed that there is a severe impact on hotels because of cancellation of bookings specially when the booking is cancelled within 10 days leading up to arrival day. This cancellation trend is based on number of factors. The insights that we got from the analysis:

* There are 3 times more cancellation in city hotel as compared to resort hotels.
* Bookings of short span such as bookings of 2 and 3 days of room nights has maximum cancellations.
* In market segments and distribution channels, there is higher percentage of cancellation of bookings that are completed through online travel agents.
* There is significant number of bookings that are canceled even when desired rooms were assigned to customers which shows this variable does not play any role in prediction.
* On average, hotel lost nearly 2 million USD from bookings that are getting cancelled less than 10 days before the arrival day.