Introduction


Problem Statement

  • Booking cancellation is a major challenge for hotel managers. We already know by now that, Covid-19 pandemic has resulted in disruptions to travel plans. Hotel industry has been overwhelmed by large number of cancellations. Due to the growing popularity of online travel agents such as Booking.com, Expedia and Co. it is becoming more and more common to do free cancellations up to 24 hours before arrival. Although, this brings a new dimension of flexibility to hotel guests, but, it has also increased financial risk for hotels and difficulties to plan their occupancy rate. Thus, realizing the importance to analyse cancellation behavior of hotel guests we are working on hotels data to predict risk of cancellation on future bookings.

  • We will be using classification to predict the risk of cancellation into high and low risk. There are several methods of classifications which we can use to create a prediction model based on the different attributes that influence the cancellation. There are several classification models which we are going to test such as ‘Decision Tree’, ‘Naive Bayes’, ‘Support Vector Machine’ etc. We will select the final model based on the accuracy of the model. We are using the hotel data for building the prediction model to deep dive into booking cancellations. For fitting the data into the model and to get the desired accuracy, first we need to sort and clean the data. Then we will transform the data in a way which will increase the prediction score.

  • This prediction of risk will help the hotels in managing the demand for their hotel by predicting the overbooking percentage for their hotel in peak demand seasons. This will in turn help them minimizing their revenue loss from cancellations.

Packages Required

The packages used till now in this projects are tidyverse and dplyr. dplyr is part of “tidyverse” package. tidyverse is an collection of R packages designed for data science. All packages share an underlying design philosophy, grammar and data structures. The core packages are ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats, which provide functionality to model, transform, and visualize data. dplyr package functions are used to solve the vast majority of data manipulation challenges. The tidyr package functions used to solve the vast majority of data tidying challenges. In this project we have used functions such as “mutate” which creates new variables with functions of existing variables.

We have also used package pestecs to turn summary statistics into table format. Among many user-written packages, package pastecs has an easy to use function called stat.desc to display a table of descriptive statistics for a list of variables.

options(repos = structure(c(CRAN="http://cran.r-project.org")))
options(repos = "https://cran.rstudio.com")

install.packages("pastecs")
## package 'pastecs' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\vibze\AppData\Local\Temp\RtmpqOE2fr\downloaded_packages
install.packages("tidyverse")
## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\vibze\AppData\Local\Temp\RtmpqOE2fr\downloaded_packages
install.packages("dplyr")
## package 'dplyr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\vibze\AppData\Local\Temp\RtmpqOE2fr\downloaded_packages
library(dplyr)
library(tidyverse)
library(pastecs)

Data Preparation

url <- 'https://github.com/mpst989/Datasets/raw/main/hotels.csv'
hotel <- readr::read_csv(url)
## Rows: 119390 Columns: 32
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date  (1): reservation_status_date
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(hotel)
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Min.   :2014-10-17     
##  1st Qu.:2016-02-01     
##  Median :2016-08-07     
##  Mean   :2016-07-30     
##  3rd Qu.:2017-02-08     
##  Max.   :2017-09-14     
## 
  • The data is obtained from github website where many data sets are available publicly. Following is the link for it.Data source

  • Purpose of data is perform analytics operations on the data to understand the causes of booking cancellation, its prediction, booking trends etc. The data consist the information about the hotel bookings. We have total 1,19,390 records for the hotel bookings which are collected between the arrival date of 1st July 2015 and 31st August 2017. It consist of 32 variables like Hotel type, date and time of booking, repeated customer or not, country, number of people etc.

lapply(hotel,function(x) { length(which(is.na(x)))}) # NA values in all columns
## $hotel
## [1] 0
## 
## $is_canceled
## [1] 0
## 
## $lead_time
## [1] 0
## 
## $arrival_date_year
## [1] 0
## 
## $arrival_date_month
## [1] 0
## 
## $arrival_date_week_number
## [1] 0
## 
## $arrival_date_day_of_month
## [1] 0
## 
## $stays_in_weekend_nights
## [1] 0
## 
## $stays_in_week_nights
## [1] 0
## 
## $adults
## [1] 0
## 
## $children
## [1] 4
## 
## $babies
## [1] 0
## 
## $meal
## [1] 0
## 
## $country
## [1] 0
## 
## $market_segment
## [1] 0
## 
## $distribution_channel
## [1] 0
## 
## $is_repeated_guest
## [1] 0
## 
## $previous_cancellations
## [1] 0
## 
## $previous_bookings_not_canceled
## [1] 0
## 
## $reserved_room_type
## [1] 0
## 
## $assigned_room_type
## [1] 0
## 
## $booking_changes
## [1] 0
## 
## $deposit_type
## [1] 0
## 
## $agent
## [1] 0
## 
## $company
## [1] 0
## 
## $days_in_waiting_list
## [1] 0
## 
## $customer_type
## [1] 0
## 
## $adr
## [1] 0
## 
## $required_car_parking_spaces
## [1] 0
## 
## $total_of_special_requests
## [1] 0
## 
## $reservation_status
## [1] 0
## 
## $reservation_status_date
## [1] 0
#Null values display

unique(hotel$agent)
##   [1] "NULL" "304"  "240"  "303"  "15"   "241"  "8"    "250"  "115"  "5"   
##  [11] "175"  "134"  "156"  "243"  "242"  "3"    "105"  "40"   "147"  "306" 
##  [21] "184"  "96"   "2"    "127"  "95"   "146"  "9"    "177"  "6"    "143" 
##  [31] "244"  "149"  "167"  "300"  "171"  "305"  "67"   "196"  "152"  "142" 
##  [41] "261"  "104"  "36"   "26"   "29"   "258"  "110"  "71"   "181"  "88"  
##  [51] "251"  "275"  "69"   "248"  "208"  "256"  "314"  "126"  "281"  "273" 
##  [61] "253"  "185"  "330"  "334"  "328"  "326"  "321"  "324"  "313"  "38"  
##  [71] "155"  "68"   "335"  "308"  "332"  "94"   "348"  "310"  "339"  "375" 
##  [81] "66"   "327"  "387"  "298"  "91"   "245"  "385"  "257"  "393"  "168" 
##  [91] "405"  "249"  "315"  "75"   "128"  "307"  "11"   "436"  "1"    "201" 
## [101] "183"  "223"  "368"  "336"  "291"  "464"  "411"  "481"  "10"   "154" 
## [111] "468"  "410"  "390"  "440"  "495"  "492"  "493"  "434"  "57"   "531" 
## [121] "420"  "483"  "526"  "472"  "429"  "16"   "446"  "34"   "78"   "139" 
## [131] "252"  "270"  "47"   "114"  "301"  "193"  "182"  "135"  "350"  "195" 
## [141] "352"  "355"  "159"  "363"  "384"  "360"  "331"  "367"  "64"   "406" 
## [151] "163"  "414"  "333"  "427"  "431"  "430"  "426"  "438"  "433"  "418" 
## [161] "441"  "282"  "432"  "72"   "450"  "180"  "454"  "455"  "59"   "451" 
## [171] "254"  "358"  "469"  "165"  "467"  "510"  "337"  "476"  "502"  "527" 
## [181] "479"  "508"  "535"  "302"  "497"  "187"  "13"   "7"    "27"   "14"  
## [191] "22"   "17"   "28"   "42"   "20"   "19"   "45"   "37"   "61"   "39"  
## [201] "21"   "24"   "41"   "50"   "30"   "54"   "52"   "12"   "44"   "31"  
## [211] "83"   "32"   "63"   "60"   "55"   "56"   "89"   "87"   "118"  "86"  
## [221] "85"   "210"  "214"  "129"  "179"  "138"  "174"  "170"  "153"  "93"  
## [231] "151"  "119"  "35"   "173"  "58"   "53"   "133"  "79"   "235"  "192" 
## [241] "191"  "236"  "162"  "215"  "157"  "287"  "132"  "234"  "98"   "77"  
## [251] "103"  "107"  "262"  "220"  "121"  "205"  "378"  "23"   "296"  "290" 
## [261] "229"  "33"   "286"  "276"  "425"  "484"  "323"  "403"  "219"  "394" 
## [271] "509"  "111"  "423"  "4"    "70"   "82"   "81"   "74"   "92"   "99"  
## [281] "90"   "112"  "117"  "106"  "148"  "158"  "144"  "211"  "213"  "216" 
## [291] "232"  "150"  "267"  "227"  "247"  "278"  "280"  "285"  "289"  "269" 
## [301] "295"  "265"  "288"  "122"  "294"  "325"  "341"  "344"  "346"  "359" 
## [311] "283"  "364"  "370"  "371"  "25"   "141"  "391"  "397"  "416"  "404" 
## [321] "299"  "197"  "73"   "354"  "444"  "408"  "461"  "388"  "453"  "459" 
## [331] "474"  "475"  "480"  "449"
unique(hotel$company)
##   [1] "NULL" "110"  "113"  "270"  "178"  "240"  "154"  "144"  "307"  "268" 
##  [11] "59"   "204"  "312"  "318"  "94"   "174"  "274"  "195"  "223"  "317" 
##  [21] "281"  "118"  "53"   "286"  "12"   "47"   "324"  "342"  "373"  "371" 
##  [31] "383"  "86"   "82"   "218"  "88"   "31"   "397"  "392"  "405"  "331" 
##  [41] "367"  "20"   "83"   "416"  "51"   "395"  "102"  "34"   "84"   "360" 
##  [51] "394"  "457"  "382"  "461"  "478"  "386"  "112"  "486"  "421"  "9"   
##  [61] "308"  "135"  "224"  "504"  "269"  "356"  "498"  "390"  "513"  "203" 
##  [71] "263"  "477"  "521"  "169"  "515"  "445"  "337"  "251"  "428"  "292" 
##  [81] "388"  "130"  "250"  "355"  "254"  "543"  "531"  "528"  "62"   "120" 
##  [91] "42"   "81"   "116"  "530"  "103"  "39"   "16"   "92"   "61"   "501" 
## [101] "165"  "291"  "290"  "43"   "325"  "192"  "108"  "200"  "465"  "287" 
## [111] "297"  "490"  "482"  "207"  "282"  "437"  "225"  "329"  "272"  "28"  
## [121] "77"   "338"  "72"   "246"  "319"  "146"  "159"  "380"  "323"  "511" 
## [131] "407"  "278"  "80"   "403"  "399"  "14"   "137"  "343"  "346"  "347" 
## [141] "349"  "289"  "351"  "353"  "54"   "99"   "358"  "361"  "362"  "366" 
## [151] "372"  "365"  "277"  "109"  "377"  "379"  "22"   "378"  "330"  "364" 
## [161] "401"  "232"  "255"  "384"  "167"  "212"  "514"  "391"  "400"  "376" 
## [171] "402"  "396"  "302"  "398"  "6"    "370"  "369"  "409"  "168"  "104" 
## [181] "408"  "413"  "148"  "10"   "333"  "419"  "415"  "424"  "425"  "423" 
## [191] "422"  "435"  "439"  "442"  "448"  "443"  "454"  "444"  "52"   "459" 
## [201] "458"  "456"  "460"  "447"  "470"  "466"  "484"  "184"  "485"  "32"  
## [211] "487"  "491"  "494"  "193"  "516"  "496"  "499"  "29"   "78"   "520" 
## [221] "507"  "506"  "512"  "126"  "64"   "242"  "518"  "523"  "539"  "534" 
## [231] "436"  "525"  "541"  "40"   "455"  "410"  "45"   "38"   "49"   "48"  
## [241] "67"   "68"   "65"   "91"   "37"   "8"    "179"  "209"  "219"  "221" 
## [251] "227"  "153"  "186"  "253"  "202"  "216"  "275"  "233"  "280"  "309" 
## [261] "321"  "93"   "316"  "85"   "107"  "350"  "279"  "334"  "348"  "150" 
## [271] "73"   "385"  "418"  "197"  "450"  "452"  "115"  "46"   "76"   "96"  
## [281] "100"  "105"  "101"  "122"  "11"   "139"  "142"  "127"  "143"  "140" 
## [291] "149"  "163"  "160"  "180"  "238"  "183"  "222"  "185"  "217"  "215" 
## [301] "213"  "237"  "230"  "234"  "35"   "245"  "158"  "258"  "259"  "260" 
## [311] "411"  "257"  "271"  "18"   "106"  "210"  "273"  "71"   "284"  "301" 
## [321] "305"  "293"  "264"  "311"  "304"  "313"  "288"  "320"  "314"  "332" 
## [331] "341"  "352"  "243"  "368"  "393"  "132"  "220"  "412"  "420"  "426" 
## [341] "417"  "429"  "433"  "446"  "357"  "479"  "483"  "489"  "229"  "481" 
## [351] "497"  "451"  "492"
unique(hotel$country)
##   [1] "PRT"  "GBR"  "USA"  "ESP"  "IRL"  "FRA"  "NULL" "ROU"  "NOR"  "OMN" 
##  [11] "ARG"  "POL"  "DEU"  "BEL"  "CHE"  "CN"   "GRC"  "ITA"  "NLD"  "DNK" 
##  [21] "RUS"  "SWE"  "AUS"  "EST"  "CZE"  "BRA"  "FIN"  "MOZ"  "BWA"  "LUX" 
##  [31] "SVN"  "ALB"  "IND"  "CHN"  "MEX"  "MAR"  "UKR"  "SMR"  "LVA"  "PRI" 
##  [41] "SRB"  "CHL"  "AUT"  "BLR"  "LTU"  "TUR"  "ZAF"  "AGO"  "ISR"  "CYM" 
##  [51] "ZMB"  "CPV"  "ZWE"  "DZA"  "KOR"  "CRI"  "HUN"  "ARE"  "TUN"  "JAM" 
##  [61] "HRV"  "HKG"  "IRN"  "GEO"  "AND"  "GIB"  "URY"  "JEY"  "CAF"  "CYP" 
##  [71] "COL"  "GGY"  "KWT"  "NGA"  "MDV"  "VEN"  "SVK"  "FJI"  "KAZ"  "PAK" 
##  [81] "IDN"  "LBN"  "PHL"  "SEN"  "SYC"  "AZE"  "BHR"  "NZL"  "THA"  "DOM" 
##  [91] "MKD"  "MYS"  "ARM"  "JPN"  "LKA"  "CUB"  "CMR"  "BIH"  "MUS"  "COM" 
## [101] "SUR"  "UGA"  "BGR"  "CIV"  "JOR"  "SYR"  "SGP"  "BDI"  "SAU"  "VNM" 
## [111] "PLW"  "QAT"  "EGY"  "PER"  "MLT"  "MWI"  "ECU"  "MDG"  "ISL"  "UZB" 
## [121] "NPL"  "BHS"  "MAC"  "TGO"  "TWN"  "DJI"  "STP"  "KNA"  "ETH"  "IRQ" 
## [131] "HND"  "RWA"  "KHM"  "MCO"  "BGD"  "IMN"  "TJK"  "NIC"  "BEN"  "VGB" 
## [141] "TZA"  "GAB"  "GHA"  "TMP"  "GLP"  "KEN"  "LIE"  "GNB"  "MNE"  "UMI" 
## [151] "MYT"  "FRO"  "MMR"  "PAN"  "BFA"  "LBY"  "MLI"  "NAM"  "BOL"  "PRY" 
## [161] "BRB"  "ABW"  "AIA"  "SLV"  "DMA"  "PYF"  "GUY"  "LCA"  "ATA"  "GTM" 
## [171] "ASM"  "MRT"  "NCL"  "KIR"  "SDN"  "ATF"  "SLE"  "LAO"
  • For data, we first check how many null values are present in each column. This will give us idea if we have to work on any column.

  • After checking,we find that there is no such discrepancy seen in the data, the only things observed were, there are 4 Null values in children columns, which we can take care of later.

  • Now we will check all the unique values of each variable, which will give us lot of information.

  • Things we noticed that there is “Null” entry as a character in country, agent and company. In most cases of hotels we are unaware about these things. While booking. People try to give as minimum info as possible because they feel it as unnecessary info just for booking purpose. Thus, we will keep them as it is and consider it as a value

  • We also observe that there are three seperate column with date which are months,year and date. We will combine that into one and delete the rest.

  • We can see from unique values that there are only two types of hotel. This hotel type can be useful while modeling the data. So we will replace that types into numerical categories which will helps us going ahead for modeling.

  • Now we will check what can we do with that null values. We go though that column and observe that mean and median are similar. Thus we will replace that column with median which is 0 here.

#replace null by mean/median

summary(hotel$children)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.1039  0.0000 10.0000       4
hotel_1 <- hotel %>% 
    mutate(date = glue::glue("{arrival_date_year}-{arrival_date_month}-{arrival_date_day_of_month}"),
         date = parse_date(date, format = "%Y-%B-%d")) 



hotel_2 = subset(hotel_1, select = -c(arrival_date_year,arrival_date_month,arrival_date_day_of_month) )



#converting two resort types into 0 and 1.

hotel_2$hotel[hotel_2$hotel %in% c('Resort Hotel')] <- 0
hotel_2$hotel[(hotel_2$hotel == 'City Hotel')] <- 1

#replacing the null value by 0

hotel_2$children[(hotel_2$children == 'NA')] <- 0



head(hotel_2)
## # A tibble: 6 x 30
##   hotel is_canceled lead_time arrival_date_we~ stays_in_weeken~ stays_in_week_n~
##   <chr>       <dbl>     <dbl>            <dbl>            <dbl>            <dbl>
## 1 0               0       342               27                0                0
## 2 0               0       737               27                0                0
## 3 0               0         7               27                0                1
## 4 0               0        13               27                0                1
## 5 0               0        14               27                0                2
## 6 0               0        14               27                0                2
## # ... with 24 more variables: adults <dbl>, children <dbl>, babies <dbl>,
## #   meal <chr>, country <chr>, market_segment <chr>,
## #   distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## #   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
## #   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, ...



Summary statistics

# Data statistics in tabular format

stat.desc(hotel_2)
##          hotel  is_canceled    lead_time arrival_date_week_number
## nbr.val     NA 1.193900e+05 1.193900e+05             1.193900e+05
## nbr.null    NA 7.516600e+04 6.345000e+03             0.000000e+00
## nbr.na      NA 0.000000e+00 0.000000e+00             0.000000e+00
## min         NA 0.000000e+00 0.000000e+00             1.000000e+00
## max         NA 1.000000e+00 7.370000e+02             5.300000e+01
## range       NA 1.000000e+00 7.370000e+02             5.200000e+01
## sum         NA 4.422400e+04 1.241792e+07             3.243250e+06
## median      NA 0.000000e+00 6.900000e+01             2.800000e+01
## mean        NA 3.704163e-01 1.040114e+02             2.716517e+01
## SE.mean     NA 1.397622e-03 3.092743e-01             3.937486e-02
## CI.mean     NA 2.739316e-03 6.061726e-01             7.717408e-02
## var         NA 2.332100e-01 1.141972e+04             1.850998e+02
## std.dev     NA 4.829182e-01 1.068631e+02             1.360514e+01
## coef.var    NA 1.303718e+00 1.027417e+00             5.008302e-01
##          stays_in_weekend_nights stays_in_week_nights       adults     children
## nbr.val             1.193900e+05         1.193900e+05 1.193900e+05 1.193860e+05
## nbr.null            5.199800e+04         7.645000e+03 4.030000e+02 1.107960e+05
## nbr.na              0.000000e+00         0.000000e+00 0.000000e+00 4.000000e+00
## min                 0.000000e+00         0.000000e+00 0.000000e+00 0.000000e+00
## max                 1.900000e+01         5.000000e+01 5.500000e+01 1.000000e+01
## range               1.900000e+01         5.000000e+01 5.500000e+01 1.000000e+01
## sum                 1.107460e+05         2.985110e+05 2.216360e+05 1.240300e+04
## median              1.000000e+00         2.000000e+00 2.000000e+00 0.000000e+00
## mean                9.275986e-01         2.500302e+00 1.856403e+00 1.038899e-01
## SE.mean             2.890104e-03         5.522801e-03 1.676449e-03 1.153503e-03
## CI.mean             5.664557e-03         1.082460e-02 3.285813e-03 2.260847e-03
## var                 9.972289e-01         3.641554e+00 3.355433e-01 1.588512e-01
## std.dev             9.986135e-01         1.908286e+00 5.792610e-01 3.985614e-01
## coef.var            1.076558e+00         7.632222e-01 3.120340e-01 3.836383e+00
##                babies meal country market_segment distribution_channel
## nbr.val  1.193900e+05   NA      NA             NA                   NA
## nbr.null 1.184730e+05   NA      NA             NA                   NA
## nbr.na   0.000000e+00   NA      NA             NA                   NA
## min      0.000000e+00   NA      NA             NA                   NA
## max      1.000000e+01   NA      NA             NA                   NA
## range    1.000000e+01   NA      NA             NA                   NA
## sum      9.490000e+02   NA      NA             NA                   NA
## median   0.000000e+00   NA      NA             NA                   NA
## mean     7.948739e-03   NA      NA             NA                   NA
## SE.mean  2.819917e-04   NA      NA             NA                   NA
## CI.mean  5.526992e-04   NA      NA             NA                   NA
## var      9.493811e-03   NA      NA             NA                   NA
## std.dev  9.743619e-02   NA      NA             NA                   NA
## coef.var 1.225807e+01   NA      NA             NA                   NA
##          is_repeated_guest previous_cancellations
## nbr.val       1.193900e+05           1.193900e+05
## nbr.null      1.155800e+05           1.129060e+05
## nbr.na        0.000000e+00           0.000000e+00
## min           0.000000e+00           0.000000e+00
## max           1.000000e+00           2.600000e+01
## range         1.000000e+00           2.600000e+01
## sum           3.810000e+03           1.040100e+04
## median        0.000000e+00           0.000000e+00
## mean          3.191222e-02           8.711785e-02
## SE.mean       5.086906e-04           2.443608e-03
## CI.mean       9.970254e-04           4.789432e-03
## var           3.089409e-02           7.129039e-01
## std.dev       1.757671e-01           8.443364e-01
## coef.var      5.507832e+00           9.691887e+00
##          previous_bookings_not_canceled reserved_room_type assigned_room_type
## nbr.val                    1.193900e+05                 NA                 NA
## nbr.null                   1.157700e+05                 NA                 NA
## nbr.na                     0.000000e+00                 NA                 NA
## min                        0.000000e+00                 NA                 NA
## max                        7.200000e+01                 NA                 NA
## range                      7.200000e+01                 NA                 NA
## sum                        1.636800e+04                 NA                 NA
## median                     0.000000e+00                 NA                 NA
## mean                       1.370969e-01                 NA                 NA
## SE.mean                    4.333757e-03                 NA                 NA
## CI.mean                    8.494093e-03                 NA                 NA
## var                        2.242317e+00                 NA                 NA
## std.dev                    1.497437e+00                 NA                 NA
## coef.var                   1.092247e+01                 NA                 NA
##          booking_changes deposit_type agent company days_in_waiting_list
## nbr.val     1.193900e+05           NA    NA      NA         1.193900e+05
## nbr.null    1.013140e+05           NA    NA      NA         1.156920e+05
## nbr.na      0.000000e+00           NA    NA      NA         0.000000e+00
## min         0.000000e+00           NA    NA      NA         0.000000e+00
## max         2.100000e+01           NA    NA      NA         3.910000e+02
## range       2.100000e+01           NA    NA      NA         3.910000e+02
## sum         2.640000e+04           NA    NA      NA         2.771220e+05
## median      0.000000e+00           NA    NA      NA         0.000000e+00
## mean        2.211240e-01           NA    NA      NA         2.321149e+00
## SE.mean     1.887848e-03           NA    NA      NA         5.092117e-02
## CI.mean     3.700152e-03           NA    NA      NA         9.980468e-02
## var         4.255026e-01           NA    NA      NA         3.095742e+02
## std.dev     6.523056e-01           NA    NA      NA         1.759472e+01
## coef.var    2.949953e+00           NA    NA      NA         7.580177e+00
##          customer_type           adr required_car_parking_spaces
## nbr.val             NA  1.193900e+05                1.193900e+05
## nbr.null            NA  1.959000e+03                1.119740e+05
## nbr.na              NA  0.000000e+00                0.000000e+00
## min                 NA -6.380000e+00                0.000000e+00
## max                 NA  5.400000e+03                8.000000e+00
## range               NA  5.406380e+03                8.000000e+00
## sum                 NA  1.215762e+07                7.464000e+03
## median              NA  9.457500e+01                0.000000e+00
## mean                NA  1.018311e+02                6.251780e-02
## SE.mean             NA  1.462565e-01                7.099012e-04
## CI.mean             NA  2.866603e-01                1.391395e-03
## var                 NA  2.553866e+03                6.016775e-02
## std.dev             NA  5.053579e+01                2.452911e-01
## coef.var            NA  4.962706e-01                3.923541e+00
##          total_of_special_requests reservation_status reservation_status_date
## nbr.val               1.193900e+05                 NA            1.193900e+05
## nbr.null              7.031800e+04                 NA            0.000000e+00
## nbr.na                0.000000e+00                 NA            0.000000e+00
## min                   0.000000e+00                 NA            1.636000e+04
## max                   5.000000e+00                 NA            1.742300e+04
## range                 5.000000e+00                 NA            1.063000e+03
## sum                   6.821500e+04                 NA            2.031065e+09
## median                0.000000e+00                 NA            1.702000e+04
## mean                  5.713628e-01                 NA            1.701202e+04
## SE.mean               2.294451e-03                 NA            6.634426e-01
## CI.mean               4.497087e-03                 NA            1.300337e+00
## var                   6.285293e-01                 NA            5.255024e+04
## std.dev               7.927984e-01                 NA            2.292384e+02
## coef.var              1.387557e+00                 NA            1.347509e-02
##                  date
## nbr.val  1.193900e+05
## nbr.null 0.000000e+00
## nbr.na   0.000000e+00
## min      1.661700e+04
## max      1.740900e+04
## range    7.920000e+02
## sum      2.034608e+09
## median   1.705000e+04
## mean     1.704169e+04
## SE.mean  6.472825e-01
## CI.mean  1.268663e+00
## var      5.002138e+04
## std.dev  2.236546e+02
## coef.var 1.312397e-02

Proposed Exploratory Data Analysis


  • We are tying to know what are the various factors that are impacting cancellation of bookings and what are the trends of cancellations given below:

    1. Monthly number of cancellation

    2. Cancellation rate based on ADR

    3. Cancellation rate based on distribution channel

    4. Cancellation rate based on customer profile

    5. Cancellation rate based on deposit

    6. Cancellation trend based on of days before arrival

    7. Cancellation rate for the customer who has had previous cancellations

    As we want to see cancellation trends based on number of days before the arrival, that information is not clearly available in the data. To get that information, We have to create new information using 3 already existing variables. We will use arrival date , status update date and cancellation status to determine, how many days before the checking date a booking got cancelled. Using this new variable we can see how many days before the arrival day generally the booking is cancelled. This can help the hotel to check if too many bookings are getting cancelled too close to the arrival time or not. There could be two reasons for that, either fraud bookings are made or customers are getting better prices somewhere else. Other objective of the analysis is to use the past bookings data and predict the risk of cancellation of the upcoming bookings and to provide insights on cancellations.

  • We will use histogram plots and scatter plots for cancellation status with arrival time(probably month wise and for festive season), of booked nights, ADR, market segments, distribution channel, repeat guest, number of booking changes, customer type etc. We will also check if there is any covariance between these different variables.

  • We don’t know much about modelling in R right now and will learn the same to complete these analysis.

  • We are planning to use classification modelling techniques to classify our upcoming bookings as high low and medium risk of cancellation.