Booking cancellation is a major challenge for hotel managers. We already know by now that, Covid-19 pandemic has resulted in disruptions to travel plans. Hotel industry has been overwhelmed by large number of cancellations. Due to the growing popularity of online travel agents such as Booking.com, Expedia and Co. it is becoming more and more common to do free cancellations up to 24 hours before arrival. Although, this brings a new dimension of flexibility to hotel guests, but, it has also increased financial risk for hotels and difficulties to plan their occupancy rate. Thus, realizing the importance to analyse cancellation behavior of hotel guests we are working on hotels data to predict risk of cancellation on future bookings.
We will be using classification to predict the risk of cancellation into high and low risk. There are several methods of classifications which we can use to create a prediction model based on the different attributes that influence the cancellation. There are several classification models which we are going to test such as ‘Decision Tree’, ‘Naive Bayes’, ‘Support Vector Machine’ etc. We will select the final model based on the accuracy of the model. We are using the hotel data for building the prediction model to deep dive into booking cancellations. For fitting the data into the model and to get the desired accuracy, first we need to sort and clean the data. Then we will transform the data in a way which will increase the prediction score.
This prediction of risk will help the hotels in managing the demand for their hotel by predicting the overbooking percentage for their hotel in peak demand seasons. This will in turn help them minimizing their revenue loss from cancellations.
The packages used till now in this projects are tidyverse and dplyr. dplyr is part of “tidyverse” package. tidyverse is an collection of R packages designed for data science. All packages share an underlying design philosophy, grammar and data structures. The core packages are ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats, which provide functionality to model, transform, and visualize data. dplyr package functions are used to solve the vast majority of data manipulation challenges. The tidyr package functions used to solve the vast majority of data tidying challenges. In this project we have used functions such as “mutate” which creates new variables with functions of existing variables.
We have also used package pestecs to turn summary statistics into table format. Among many user-written packages, package pastecs has an easy to use function called stat.desc to display a table of descriptive statistics for a list of variables.
options(repos = structure(c(CRAN="http://cran.r-project.org")))
options(repos = "https://cran.rstudio.com")
install.packages("pastecs")
## package 'pastecs' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\vibze\AppData\Local\Temp\RtmpqOE2fr\downloaded_packages
install.packages("tidyverse")
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\vibze\AppData\Local\Temp\RtmpqOE2fr\downloaded_packages
install.packages("dplyr")
## package 'dplyr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\vibze\AppData\Local\Temp\RtmpqOE2fr\downloaded_packages
library(dplyr)
library(tidyverse)
library(pastecs)
url <- 'https://github.com/mpst989/Datasets/raw/main/hotels.csv'
hotel <- readr::read_csv(url)
## Rows: 119390 Columns: 32
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date (1): reservation_status_date
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(hotel)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Min. :2014-10-17
## 1st Qu.:2016-02-01
## Median :2016-08-07
## Mean :2016-07-30
## 3rd Qu.:2017-02-08
## Max. :2017-09-14
##
The data is obtained from github website where many data sets are available publicly. Following is the link for it.Data source
Purpose of data is perform analytics operations on the data to understand the causes of booking cancellation, its prediction, booking trends etc. The data consist the information about the hotel bookings. We have total 1,19,390 records for the hotel bookings which are collected between the arrival date of 1st July 2015 and 31st August 2017. It consist of 32 variables like Hotel type, date and time of booking, repeated customer or not, country, number of people etc.
lapply(hotel,function(x) { length(which(is.na(x)))}) # NA values in all columns
## $hotel
## [1] 0
##
## $is_canceled
## [1] 0
##
## $lead_time
## [1] 0
##
## $arrival_date_year
## [1] 0
##
## $arrival_date_month
## [1] 0
##
## $arrival_date_week_number
## [1] 0
##
## $arrival_date_day_of_month
## [1] 0
##
## $stays_in_weekend_nights
## [1] 0
##
## $stays_in_week_nights
## [1] 0
##
## $adults
## [1] 0
##
## $children
## [1] 4
##
## $babies
## [1] 0
##
## $meal
## [1] 0
##
## $country
## [1] 0
##
## $market_segment
## [1] 0
##
## $distribution_channel
## [1] 0
##
## $is_repeated_guest
## [1] 0
##
## $previous_cancellations
## [1] 0
##
## $previous_bookings_not_canceled
## [1] 0
##
## $reserved_room_type
## [1] 0
##
## $assigned_room_type
## [1] 0
##
## $booking_changes
## [1] 0
##
## $deposit_type
## [1] 0
##
## $agent
## [1] 0
##
## $company
## [1] 0
##
## $days_in_waiting_list
## [1] 0
##
## $customer_type
## [1] 0
##
## $adr
## [1] 0
##
## $required_car_parking_spaces
## [1] 0
##
## $total_of_special_requests
## [1] 0
##
## $reservation_status
## [1] 0
##
## $reservation_status_date
## [1] 0
#Null values display
unique(hotel$agent)
## [1] "NULL" "304" "240" "303" "15" "241" "8" "250" "115" "5"
## [11] "175" "134" "156" "243" "242" "3" "105" "40" "147" "306"
## [21] "184" "96" "2" "127" "95" "146" "9" "177" "6" "143"
## [31] "244" "149" "167" "300" "171" "305" "67" "196" "152" "142"
## [41] "261" "104" "36" "26" "29" "258" "110" "71" "181" "88"
## [51] "251" "275" "69" "248" "208" "256" "314" "126" "281" "273"
## [61] "253" "185" "330" "334" "328" "326" "321" "324" "313" "38"
## [71] "155" "68" "335" "308" "332" "94" "348" "310" "339" "375"
## [81] "66" "327" "387" "298" "91" "245" "385" "257" "393" "168"
## [91] "405" "249" "315" "75" "128" "307" "11" "436" "1" "201"
## [101] "183" "223" "368" "336" "291" "464" "411" "481" "10" "154"
## [111] "468" "410" "390" "440" "495" "492" "493" "434" "57" "531"
## [121] "420" "483" "526" "472" "429" "16" "446" "34" "78" "139"
## [131] "252" "270" "47" "114" "301" "193" "182" "135" "350" "195"
## [141] "352" "355" "159" "363" "384" "360" "331" "367" "64" "406"
## [151] "163" "414" "333" "427" "431" "430" "426" "438" "433" "418"
## [161] "441" "282" "432" "72" "450" "180" "454" "455" "59" "451"
## [171] "254" "358" "469" "165" "467" "510" "337" "476" "502" "527"
## [181] "479" "508" "535" "302" "497" "187" "13" "7" "27" "14"
## [191] "22" "17" "28" "42" "20" "19" "45" "37" "61" "39"
## [201] "21" "24" "41" "50" "30" "54" "52" "12" "44" "31"
## [211] "83" "32" "63" "60" "55" "56" "89" "87" "118" "86"
## [221] "85" "210" "214" "129" "179" "138" "174" "170" "153" "93"
## [231] "151" "119" "35" "173" "58" "53" "133" "79" "235" "192"
## [241] "191" "236" "162" "215" "157" "287" "132" "234" "98" "77"
## [251] "103" "107" "262" "220" "121" "205" "378" "23" "296" "290"
## [261] "229" "33" "286" "276" "425" "484" "323" "403" "219" "394"
## [271] "509" "111" "423" "4" "70" "82" "81" "74" "92" "99"
## [281] "90" "112" "117" "106" "148" "158" "144" "211" "213" "216"
## [291] "232" "150" "267" "227" "247" "278" "280" "285" "289" "269"
## [301] "295" "265" "288" "122" "294" "325" "341" "344" "346" "359"
## [311] "283" "364" "370" "371" "25" "141" "391" "397" "416" "404"
## [321] "299" "197" "73" "354" "444" "408" "461" "388" "453" "459"
## [331] "474" "475" "480" "449"
unique(hotel$company)
## [1] "NULL" "110" "113" "270" "178" "240" "154" "144" "307" "268"
## [11] "59" "204" "312" "318" "94" "174" "274" "195" "223" "317"
## [21] "281" "118" "53" "286" "12" "47" "324" "342" "373" "371"
## [31] "383" "86" "82" "218" "88" "31" "397" "392" "405" "331"
## [41] "367" "20" "83" "416" "51" "395" "102" "34" "84" "360"
## [51] "394" "457" "382" "461" "478" "386" "112" "486" "421" "9"
## [61] "308" "135" "224" "504" "269" "356" "498" "390" "513" "203"
## [71] "263" "477" "521" "169" "515" "445" "337" "251" "428" "292"
## [81] "388" "130" "250" "355" "254" "543" "531" "528" "62" "120"
## [91] "42" "81" "116" "530" "103" "39" "16" "92" "61" "501"
## [101] "165" "291" "290" "43" "325" "192" "108" "200" "465" "287"
## [111] "297" "490" "482" "207" "282" "437" "225" "329" "272" "28"
## [121] "77" "338" "72" "246" "319" "146" "159" "380" "323" "511"
## [131] "407" "278" "80" "403" "399" "14" "137" "343" "346" "347"
## [141] "349" "289" "351" "353" "54" "99" "358" "361" "362" "366"
## [151] "372" "365" "277" "109" "377" "379" "22" "378" "330" "364"
## [161] "401" "232" "255" "384" "167" "212" "514" "391" "400" "376"
## [171] "402" "396" "302" "398" "6" "370" "369" "409" "168" "104"
## [181] "408" "413" "148" "10" "333" "419" "415" "424" "425" "423"
## [191] "422" "435" "439" "442" "448" "443" "454" "444" "52" "459"
## [201] "458" "456" "460" "447" "470" "466" "484" "184" "485" "32"
## [211] "487" "491" "494" "193" "516" "496" "499" "29" "78" "520"
## [221] "507" "506" "512" "126" "64" "242" "518" "523" "539" "534"
## [231] "436" "525" "541" "40" "455" "410" "45" "38" "49" "48"
## [241] "67" "68" "65" "91" "37" "8" "179" "209" "219" "221"
## [251] "227" "153" "186" "253" "202" "216" "275" "233" "280" "309"
## [261] "321" "93" "316" "85" "107" "350" "279" "334" "348" "150"
## [271] "73" "385" "418" "197" "450" "452" "115" "46" "76" "96"
## [281] "100" "105" "101" "122" "11" "139" "142" "127" "143" "140"
## [291] "149" "163" "160" "180" "238" "183" "222" "185" "217" "215"
## [301] "213" "237" "230" "234" "35" "245" "158" "258" "259" "260"
## [311] "411" "257" "271" "18" "106" "210" "273" "71" "284" "301"
## [321] "305" "293" "264" "311" "304" "313" "288" "320" "314" "332"
## [331] "341" "352" "243" "368" "393" "132" "220" "412" "420" "426"
## [341] "417" "429" "433" "446" "357" "479" "483" "489" "229" "481"
## [351] "497" "451" "492"
unique(hotel$country)
## [1] "PRT" "GBR" "USA" "ESP" "IRL" "FRA" "NULL" "ROU" "NOR" "OMN"
## [11] "ARG" "POL" "DEU" "BEL" "CHE" "CN" "GRC" "ITA" "NLD" "DNK"
## [21] "RUS" "SWE" "AUS" "EST" "CZE" "BRA" "FIN" "MOZ" "BWA" "LUX"
## [31] "SVN" "ALB" "IND" "CHN" "MEX" "MAR" "UKR" "SMR" "LVA" "PRI"
## [41] "SRB" "CHL" "AUT" "BLR" "LTU" "TUR" "ZAF" "AGO" "ISR" "CYM"
## [51] "ZMB" "CPV" "ZWE" "DZA" "KOR" "CRI" "HUN" "ARE" "TUN" "JAM"
## [61] "HRV" "HKG" "IRN" "GEO" "AND" "GIB" "URY" "JEY" "CAF" "CYP"
## [71] "COL" "GGY" "KWT" "NGA" "MDV" "VEN" "SVK" "FJI" "KAZ" "PAK"
## [81] "IDN" "LBN" "PHL" "SEN" "SYC" "AZE" "BHR" "NZL" "THA" "DOM"
## [91] "MKD" "MYS" "ARM" "JPN" "LKA" "CUB" "CMR" "BIH" "MUS" "COM"
## [101] "SUR" "UGA" "BGR" "CIV" "JOR" "SYR" "SGP" "BDI" "SAU" "VNM"
## [111] "PLW" "QAT" "EGY" "PER" "MLT" "MWI" "ECU" "MDG" "ISL" "UZB"
## [121] "NPL" "BHS" "MAC" "TGO" "TWN" "DJI" "STP" "KNA" "ETH" "IRQ"
## [131] "HND" "RWA" "KHM" "MCO" "BGD" "IMN" "TJK" "NIC" "BEN" "VGB"
## [141] "TZA" "GAB" "GHA" "TMP" "GLP" "KEN" "LIE" "GNB" "MNE" "UMI"
## [151] "MYT" "FRO" "MMR" "PAN" "BFA" "LBY" "MLI" "NAM" "BOL" "PRY"
## [161] "BRB" "ABW" "AIA" "SLV" "DMA" "PYF" "GUY" "LCA" "ATA" "GTM"
## [171] "ASM" "MRT" "NCL" "KIR" "SDN" "ATF" "SLE" "LAO"
For data, we first check how many null values are present in each column. This will give us idea if we have to work on any column.
After checking,we find that there is no such discrepancy seen in the data, the only things observed were, there are 4 Null values in children columns, which we can take care of later.
Now we will check all the unique values of each variable, which will give us lot of information.
Things we noticed that there is “Null” entry as a character in country, agent and company. In most cases of hotels we are unaware about these things. While booking. People try to give as minimum info as possible because they feel it as unnecessary info just for booking purpose. Thus, we will keep them as it is and consider it as a value
We also observe that there are three seperate column with date which are months,year and date. We will combine that into one and delete the rest.
We can see from unique values that there are only two types of hotel. This hotel type can be useful while modeling the data. So we will replace that types into numerical categories which will helps us going ahead for modeling.
Now we will check what can we do with that null values. We go though that column and observe that mean and median are similar. Thus we will replace that column with median which is 0 here.
#replace null by mean/median
summary(hotel$children)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.1039 0.0000 10.0000 4
hotel_1 <- hotel %>%
mutate(date = glue::glue("{arrival_date_year}-{arrival_date_month}-{arrival_date_day_of_month}"),
date = parse_date(date, format = "%Y-%B-%d"))
hotel_2 = subset(hotel_1, select = -c(arrival_date_year,arrival_date_month,arrival_date_day_of_month) )
#converting two resort types into 0 and 1.
hotel_2$hotel[hotel_2$hotel %in% c('Resort Hotel')] <- 0
hotel_2$hotel[(hotel_2$hotel == 'City Hotel')] <- 1
#replacing the null value by 0
hotel_2$children[(hotel_2$children == 'NA')] <- 0
head(hotel_2)
## # A tibble: 6 x 30
## hotel is_canceled lead_time arrival_date_we~ stays_in_weeken~ stays_in_week_n~
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 342 27 0 0
## 2 0 0 737 27 0 0
## 3 0 0 7 27 0 1
## 4 0 0 13 27 0 1
## 5 0 0 14 27 0 2
## 6 0 0 14 27 0 2
## # ... with 24 more variables: adults <dbl>, children <dbl>, babies <dbl>,
## # meal <chr>, country <chr>, market_segment <chr>,
## # distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## # deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
## # customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, ...
# Data statistics in tabular format
stat.desc(hotel_2)
## hotel is_canceled lead_time arrival_date_week_number
## nbr.val NA 1.193900e+05 1.193900e+05 1.193900e+05
## nbr.null NA 7.516600e+04 6.345000e+03 0.000000e+00
## nbr.na NA 0.000000e+00 0.000000e+00 0.000000e+00
## min NA 0.000000e+00 0.000000e+00 1.000000e+00
## max NA 1.000000e+00 7.370000e+02 5.300000e+01
## range NA 1.000000e+00 7.370000e+02 5.200000e+01
## sum NA 4.422400e+04 1.241792e+07 3.243250e+06
## median NA 0.000000e+00 6.900000e+01 2.800000e+01
## mean NA 3.704163e-01 1.040114e+02 2.716517e+01
## SE.mean NA 1.397622e-03 3.092743e-01 3.937486e-02
## CI.mean NA 2.739316e-03 6.061726e-01 7.717408e-02
## var NA 2.332100e-01 1.141972e+04 1.850998e+02
## std.dev NA 4.829182e-01 1.068631e+02 1.360514e+01
## coef.var NA 1.303718e+00 1.027417e+00 5.008302e-01
## stays_in_weekend_nights stays_in_week_nights adults children
## nbr.val 1.193900e+05 1.193900e+05 1.193900e+05 1.193860e+05
## nbr.null 5.199800e+04 7.645000e+03 4.030000e+02 1.107960e+05
## nbr.na 0.000000e+00 0.000000e+00 0.000000e+00 4.000000e+00
## min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## max 1.900000e+01 5.000000e+01 5.500000e+01 1.000000e+01
## range 1.900000e+01 5.000000e+01 5.500000e+01 1.000000e+01
## sum 1.107460e+05 2.985110e+05 2.216360e+05 1.240300e+04
## median 1.000000e+00 2.000000e+00 2.000000e+00 0.000000e+00
## mean 9.275986e-01 2.500302e+00 1.856403e+00 1.038899e-01
## SE.mean 2.890104e-03 5.522801e-03 1.676449e-03 1.153503e-03
## CI.mean 5.664557e-03 1.082460e-02 3.285813e-03 2.260847e-03
## var 9.972289e-01 3.641554e+00 3.355433e-01 1.588512e-01
## std.dev 9.986135e-01 1.908286e+00 5.792610e-01 3.985614e-01
## coef.var 1.076558e+00 7.632222e-01 3.120340e-01 3.836383e+00
## babies meal country market_segment distribution_channel
## nbr.val 1.193900e+05 NA NA NA NA
## nbr.null 1.184730e+05 NA NA NA NA
## nbr.na 0.000000e+00 NA NA NA NA
## min 0.000000e+00 NA NA NA NA
## max 1.000000e+01 NA NA NA NA
## range 1.000000e+01 NA NA NA NA
## sum 9.490000e+02 NA NA NA NA
## median 0.000000e+00 NA NA NA NA
## mean 7.948739e-03 NA NA NA NA
## SE.mean 2.819917e-04 NA NA NA NA
## CI.mean 5.526992e-04 NA NA NA NA
## var 9.493811e-03 NA NA NA NA
## std.dev 9.743619e-02 NA NA NA NA
## coef.var 1.225807e+01 NA NA NA NA
## is_repeated_guest previous_cancellations
## nbr.val 1.193900e+05 1.193900e+05
## nbr.null 1.155800e+05 1.129060e+05
## nbr.na 0.000000e+00 0.000000e+00
## min 0.000000e+00 0.000000e+00
## max 1.000000e+00 2.600000e+01
## range 1.000000e+00 2.600000e+01
## sum 3.810000e+03 1.040100e+04
## median 0.000000e+00 0.000000e+00
## mean 3.191222e-02 8.711785e-02
## SE.mean 5.086906e-04 2.443608e-03
## CI.mean 9.970254e-04 4.789432e-03
## var 3.089409e-02 7.129039e-01
## std.dev 1.757671e-01 8.443364e-01
## coef.var 5.507832e+00 9.691887e+00
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## nbr.val 1.193900e+05 NA NA
## nbr.null 1.157700e+05 NA NA
## nbr.na 0.000000e+00 NA NA
## min 0.000000e+00 NA NA
## max 7.200000e+01 NA NA
## range 7.200000e+01 NA NA
## sum 1.636800e+04 NA NA
## median 0.000000e+00 NA NA
## mean 1.370969e-01 NA NA
## SE.mean 4.333757e-03 NA NA
## CI.mean 8.494093e-03 NA NA
## var 2.242317e+00 NA NA
## std.dev 1.497437e+00 NA NA
## coef.var 1.092247e+01 NA NA
## booking_changes deposit_type agent company days_in_waiting_list
## nbr.val 1.193900e+05 NA NA NA 1.193900e+05
## nbr.null 1.013140e+05 NA NA NA 1.156920e+05
## nbr.na 0.000000e+00 NA NA NA 0.000000e+00
## min 0.000000e+00 NA NA NA 0.000000e+00
## max 2.100000e+01 NA NA NA 3.910000e+02
## range 2.100000e+01 NA NA NA 3.910000e+02
## sum 2.640000e+04 NA NA NA 2.771220e+05
## median 0.000000e+00 NA NA NA 0.000000e+00
## mean 2.211240e-01 NA NA NA 2.321149e+00
## SE.mean 1.887848e-03 NA NA NA 5.092117e-02
## CI.mean 3.700152e-03 NA NA NA 9.980468e-02
## var 4.255026e-01 NA NA NA 3.095742e+02
## std.dev 6.523056e-01 NA NA NA 1.759472e+01
## coef.var 2.949953e+00 NA NA NA 7.580177e+00
## customer_type adr required_car_parking_spaces
## nbr.val NA 1.193900e+05 1.193900e+05
## nbr.null NA 1.959000e+03 1.119740e+05
## nbr.na NA 0.000000e+00 0.000000e+00
## min NA -6.380000e+00 0.000000e+00
## max NA 5.400000e+03 8.000000e+00
## range NA 5.406380e+03 8.000000e+00
## sum NA 1.215762e+07 7.464000e+03
## median NA 9.457500e+01 0.000000e+00
## mean NA 1.018311e+02 6.251780e-02
## SE.mean NA 1.462565e-01 7.099012e-04
## CI.mean NA 2.866603e-01 1.391395e-03
## var NA 2.553866e+03 6.016775e-02
## std.dev NA 5.053579e+01 2.452911e-01
## coef.var NA 4.962706e-01 3.923541e+00
## total_of_special_requests reservation_status reservation_status_date
## nbr.val 1.193900e+05 NA 1.193900e+05
## nbr.null 7.031800e+04 NA 0.000000e+00
## nbr.na 0.000000e+00 NA 0.000000e+00
## min 0.000000e+00 NA 1.636000e+04
## max 5.000000e+00 NA 1.742300e+04
## range 5.000000e+00 NA 1.063000e+03
## sum 6.821500e+04 NA 2.031065e+09
## median 0.000000e+00 NA 1.702000e+04
## mean 5.713628e-01 NA 1.701202e+04
## SE.mean 2.294451e-03 NA 6.634426e-01
## CI.mean 4.497087e-03 NA 1.300337e+00
## var 6.285293e-01 NA 5.255024e+04
## std.dev 7.927984e-01 NA 2.292384e+02
## coef.var 1.387557e+00 NA 1.347509e-02
## date
## nbr.val 1.193900e+05
## nbr.null 0.000000e+00
## nbr.na 0.000000e+00
## min 1.661700e+04
## max 1.740900e+04
## range 7.920000e+02
## sum 2.034608e+09
## median 1.705000e+04
## mean 1.704169e+04
## SE.mean 6.472825e-01
## CI.mean 1.268663e+00
## var 5.002138e+04
## std.dev 2.236546e+02
## coef.var 1.312397e-02
We are tying to know what are the various factors that are impacting cancellation of bookings and what are the trends of cancellations given below:
Monthly number of cancellation
Cancellation rate based on ADR
Cancellation rate based on distribution channel
Cancellation rate based on customer profile
Cancellation rate based on deposit
Cancellation trend based on of days before arrival
Cancellation rate for the customer who has had previous cancellations
As we want to see cancellation trends based on number of days before the arrival, that information is not clearly available in the data. To get that information, We have to create new information using 3 already existing variables. We will use arrival date , status update date and cancellation status to determine, how many days before the checking date a booking got cancelled. Using this new variable we can see how many days before the arrival day generally the booking is cancelled. This can help the hotel to check if too many bookings are getting cancelled too close to the arrival time or not. There could be two reasons for that, either fraud bookings are made or customers are getting better prices somewhere else. Other objective of the analysis is to use the past bookings data and predict the risk of cancellation of the upcoming bookings and to provide insights on cancellations.
We will use histogram plots and scatter plots for cancellation status with arrival time(probably month wise and for festive season), of booked nights, ADR, market segments, distribution channel, repeat guest, number of booking changes, customer type etc. We will also check if there is any covariance between these different variables.
We don’t know much about modelling in R right now and will learn the same to complete these analysis.
We are planning to use classification modelling techniques to classify our upcoming bookings as high low and medium risk of cancellation.