Authors: Sreeparna Chatterjee | Abhishek Yadav | Deepak Narang
Introduction
The Hotel industry is one of its kind, replete with immense competition and unpredictability. The value of real-time clean and authentic data is crucial to this industry. Predicting customer behavior is imperative to maintaining a steady flow of revenue and predicting the investments and efforts required in marketing and advertising. The hospitality industry in the US is valued at close to $200 billion, with the numbers taking a slight dip during the pandemic, but are expected to get back to normal as the global pandemic fades down. With immense competition in the online bidding market and several booking agencies targeting the same customers, reliable and clean data assisting the models can create a distinction in the success of a Hotel. For this project, we have used the dataset from Antonio, Almeida, and Nunes, 2019.
One of the unavoidable but major challenge for Hotel Industry is the booking cancellation. It directly impacts the hotel’s revenue. If a model which predicts the cancellation is available for management then it allows them to come up with better plans to tackle the cancellation and predict the demand which can result in significant increment in the revenue.
Our objectives are:
We would begin with importing the data, cleaning it by getting rid of null or missing values, checking for completeness and consistency, and finding the relationships between various data points, and factors affecting the same. To get a better understanding of the data we will dive into basic EDA by creating several visualizations like Bar charts, Box plots, and graphs, to comprehend correlations between various factors.
The R packages used for data analysis in this project are:
• data.table - To import data • tidyverse - Fortidy data, visualisation, transformation • tibble - To create tibbles • plotrix - For 3D Exploded Pie Chart • tidyr - For tidy data • dplyr - For data analysis • DT - To display data set • ggplot2 - For data visualization • magrittr - For pipe oprator • knitr - To display tables • moments - For moments • ROCR - For creating ROC curve • Scales -To demonstrate ggplot2 style scales for specific types of data • Pander -To produce simple tables from summary() output • Plotly -For creating interactive charts
options(warn=-1)
library(data.table) #import data
library(tidyverse) #tidy data, visualisation, transformation
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::between() masks data.table::between()
## x dplyr::filter() masks stats::filter()
## x dplyr::first() masks data.table::first()
## x dplyr::lag() masks stats::lag()
## x dplyr::last() masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
library(tibble) #create tibbles
library(plotrix) #3D Exploded Pie Chart
library(tidyr) #tidy data
library(dplyr) #data analysis
library(DT) #display data set
library(ggplot2) #data visualization
library(magrittr) #pipe oprator
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(knitr) #display tables
library(moments) #moments
library(caret) #partitioning the data
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ROCR)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:plotrix':
##
## rescale
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(pander)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
The hotel data for analysis has been sourced from this Github repository. This data contains all the booking information that will help us to find answers for analysis aim questions.
The Hotel Data imported contains 119390 rows and 32 columns, having all the information of a booking.
Here we have read the data from csv file and check the number of rows and columns in the hotel data.
hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
## Rows: 119390 Columns: 32
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date (1): reservation_status_date
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
data<-hotels
dim(data)
## [1] 119390 32
The description for Hotel Data variables is given below.
data.type <- lapply(data, class)
Description <- c("Hotel (H1 = Resort Hotel or H2 = City Hotel)",
"Value indicating if the booking was canceled (1) or not (0)",
"Number of days that elapsed between the entering date of the booking into the PMS and the arrival date",
"Year of arrival date",
"Month of arrival date",
"Week number of year for arrival date",
"Day of arrival date",
"Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel",
"Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel",
"Number of adults",
"Number of children",
"Number of babies",
"Type of meal booked. SC for no meal package, BB for Bed & Breakfast, HB for Half board (breakfast and one other meal – usually dinner), FB for Full board (breakfast, lunch and dinner)",
"Country of origin. Categories are represented in the ISO 3155–3:2013 format",
"Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”",
"Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”",
"Value indicating if the booking name was from a repeated guest (1) or not (0)",
"Number of previous bookings that were cancelled by the customer prior to the current booking",
"Number of previous bookings not cancelled by the customer prior to the current booking",
"Code of room type reserved. Code is presented instead of designation for anonymity reasons",
"Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons",
"Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation",
"Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.",
"ID of the travel agency that made the booking",
"ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons",
"Number of days the booking was in the waiting list before it was confirmed to the customer",
"Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking",
"Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights",
"Number of car parking spaces required by the customer",
"Number of special requests made by the customer (e.g. twin bed or high floor)",
"Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why",
"Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel"
)
Variable <- colnames(data)
data.description <- as.data.frame(cbind(Variable, Description))
options(warn=-1)
data.description
## Variable
## 1 hotel
## 2 is_canceled
## 3 lead_time
## 4 arrival_date_year
## 5 arrival_date_month
## 6 arrival_date_week_number
## 7 arrival_date_day_of_month
## 8 stays_in_weekend_nights
## 9 stays_in_week_nights
## 10 adults
## 11 children
## 12 babies
## 13 meal
## 14 country
## 15 market_segment
## 16 distribution_channel
## 17 is_repeated_guest
## 18 previous_cancellations
## 19 previous_bookings_not_canceled
## 20 reserved_room_type
## 21 assigned_room_type
## 22 booking_changes
## 23 deposit_type
## 24 agent
## 25 company
## 26 days_in_waiting_list
## 27 customer_type
## 28 adr
## 29 required_car_parking_spaces
## 30 total_of_special_requests
## 31 reservation_status
## 32 reservation_status_date
## Description
## 1 Hotel (H1 = Resort Hotel or H2 = City Hotel)
## 2 Value indicating if the booking was canceled (1) or not (0)
## 3 Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
## 4 Year of arrival date
## 5 Month of arrival date
## 6 Week number of year for arrival date
## 7 Day of arrival date
## 8 Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
## 9 Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
## 10 Number of adults
## 11 Number of children
## 12 Number of babies
## 13 Type of meal booked. SC for no meal package, BB for Bed & Breakfast, HB for Half board (breakfast and one other meal – usually dinner), FB for Full board (breakfast, lunch and dinner)
## 14 Country of origin. Categories are represented in the ISO 3155–3:2013 format
## 15 Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
## 16 Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
## 17 Value indicating if the booking name was from a repeated guest (1) or not (0)
## 18 Number of previous bookings that were cancelled by the customer prior to the current booking
## 19 Number of previous bookings not cancelled by the customer prior to the current booking
## 20 Code of room type reserved. Code is presented instead of designation for anonymity reasons
## 21 Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons
## 22 Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
## 23 Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.
## 24 ID of the travel agency that made the booking
## 25 ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
## 26 Number of days the booking was in the waiting list before it was confirmed to the customer
## 27 Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking
## 28 Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
## 29 Number of car parking spaces required by the customer
## 30 Number of special requests made by the customer (e.g. twin bed or high floor)
## 31 Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why
## 32 Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel
Converting variables to factors
data<-data%>%
mutate(
hotel=as.factor(hotel),
is_canceled=as.factor(is_canceled),
meal=as.factor(meal),
country=as.factor(country),
market_segment=as.factor(market_segment),
distribution_channel=as.factor(distribution_channel),
is_repeated_guest=as.factor(is_repeated_guest),
reserved_room_type=as.factor(reserved_room_type),
assigned_room_type=as.factor(assigned_room_type),
deposit_type=as.factor(deposit_type),
customer_type=as.factor(customer_type),
reservation_status=as.factor(reservation_status),
agent=as.factor(agent),
company=as.factor(company),
arrival_date_day_of_month=as.factor(arrival_date_day_of_month),
arrival_date_month=as.factor(arrival_date_month),
arrival_date_year=as.factor(arrival_date_year))
New variables “arrival_date”, “nights_stay” and “total_guests” were created using existing variables.
#creating new variable: arrival_date
data$arrival_date <- paste(data$arrival_date_month,
data$arrival_date_day_of_month,
data$arrival_date_year,sep="-")
data$arrival_date <-as.Date(data$arrival_date, format="%B-%d-%Y")
#creating new variable: nights_stay and total_guests
data <- data %>%
dplyr::mutate(nights_stay = stays_in_weekend_nights + stays_in_week_nights,
total_guests = adults + children + babies)%>%
dplyr::select(-c(stays_in_weekend_nights, stays_in_week_nights, adults, children, babies,
arrival_date_week_number ))
The summary below shows
summary(data)
## hotel is_canceled lead_time arrival_date_year
## City Hotel :79330 0:75166 Min. : 0 2015:21996
## Resort Hotel:40060 1:44224 1st Qu.: 18 2016:56707
## Median : 69 2017:40687
## Mean :104
## 3rd Qu.:160
## Max. :737
##
## arrival_date_month arrival_date_day_of_month meal country
## August :13877 17 : 4406 BB :92310 PRT :48590
## July :12661 5 : 4317 FB : 798 GBR :12129
## May :11791 15 : 4196 HB :14463 FRA :10415
## October:11160 25 : 4160 SC :10650 ESP : 8568
## April :11089 26 : 4147 Undefined: 1169 DEU : 7287
## June :10939 9 : 4096 ITA : 3766
## (Other):47873 (Other):94068 (Other):28635
## market_segment distribution_channel is_repeated_guest
## Online TA :56477 Corporate: 6677 0:115580
## Offline TA/TO:24219 Direct :14645 1: 3810
## Groups :19811 GDS : 193
## Direct :12606 TA/TO :97870
## Corporate : 5295 Undefined: 5
## Complementary: 743
## (Other) : 239
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 A :85994
## 1st Qu.: 0.00000 1st Qu.: 0.0000 D :19201
## Median : 0.00000 Median : 0.0000 E : 6535
## Mean : 0.08712 Mean : 0.1371 F : 2897
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000 G : 2094
## Max. :26.00000 Max. :72.0000 B : 1118
## (Other): 1551
## assigned_room_type booking_changes deposit_type agent
## A :74053 Min. : 0.0000 No Deposit:104641 9 :31961
## D :25322 1st Qu.: 0.0000 Non Refund: 14587 NULL :16340
## E : 7806 Median : 0.0000 Refundable: 162 240 :13922
## F : 3751 Mean : 0.2211 1 : 7191
## G : 2553 3rd Qu.: 0.0000 14 : 3640
## C : 2375 Max. :21.0000 7 : 3539
## (Other): 3530 (Other):42797
## company days_in_waiting_list customer_type
## NULL :112593 Min. : 0.000 Contract : 4076
## 40 : 927 1st Qu.: 0.000 Group : 577
## 223 : 784 Median : 0.000 Transient :89613
## 67 : 267 Mean : 2.321 Transient-Party:25124
## 45 : 250 3rd Qu.: 0.000
## 153 : 215 Max. :391.000
## (Other): 4354
## adr required_car_parking_spaces total_of_special_requests
## Min. : -6.38 Min. :0.00000 Min. :0.0000
## 1st Qu.: 69.29 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 94.58 Median :0.00000 Median :0.0000
## Mean : 101.83 Mean :0.06252 Mean :0.5714
## 3rd Qu.: 126.00 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :5400.00 Max. :8.00000 Max. :5.0000
##
## reservation_status reservation_status_date arrival_date
## Canceled :43017 Min. :2014-10-17 Min. :2015-07-01
## Check-Out:75166 1st Qu.:2016-02-01 1st Qu.:2016-03-13
## No-Show : 1207 Median :2016-08-07 Median :2016-09-06
## Mean :2016-07-30 Mean :2016-08-28
## 3rd Qu.:2017-02-08 3rd Qu.:2017-03-18
## Max. :2017-09-14 Max. :2017-08-31
##
## nights_stay total_guests
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 2.000
## Median : 3.000 Median : 2.000
## Mean : 3.428 Mean : 1.968
## 3rd Qu.: 4.000 3rd Qu.: 2.000
## Max. :69.000 Max. :55.000
## NA's :4
The “Undefined” meal are imputed by “SC” because both “Undefined” and “SC” mean the customer choose not to eat at hotel and new variable is create as “new_meal” and existing variable “meal” is dropped.
data<- data %>%
dplyr::mutate(new_meal = fct_collapse(meal, SC = c("Undefined" , "SC"),
BB = "BB",
FB = "FB",
HB = "HB"),
new_meal = fct_relevel(new_meal, "FB", "HB", "BB", "SC"),
new_meal = fct_explicit_na(new_meal)) %>%
dplyr::select(-meal)
From summary we can see that there no missing values except “total_guests”. The 4 missing values are replaced by 0.
data$total_guests[is.na(data$total_guests)] <- 0
colSums(is.na(data))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_day_of_month
## 0 0
## country market_segment
## 0 0
## distribution_channel is_repeated_guest
## 0 0
## previous_cancellations previous_bookings_not_canceled
## 0 0
## reserved_room_type assigned_room_type
## 0 0
## booking_changes deposit_type
## 0 0
## agent company
## 0 0
## days_in_waiting_list customer_type
## 0 0
## adr required_car_parking_spaces
## 0 0
## total_of_special_requests reservation_status
## 0 0
## reservation_status_date arrival_date
## 0 0
## nights_stay total_guests
## 0 0
## new_meal
## 0
The hotel data is from July 2015 to August 2017, so for most of the months data is only for 2 years. We can see seasonality affecting the number of bookings, as the number of bookings drop around the months of March to May and more bookings were made in November, December, January and Februray.
library(ggplot2)
data %>% ggplot(aes(x=arrival_date_month, fill=hotel)) +
geom_bar(position="dodge") +
scale_fill_manual(values=c("skyblue4", "plum4"),labels=c("City Hotel", "Resort Hotel")) +
scale_y_continuous(name = "Bookings",labels = scales::comma) +
guides(fill=guide_legend(title=NULL)) +
facet_grid(arrival_date_year ~ .) +
theme(legend.position="bottom", axis.text.x=element_text(angle=0, hjust=1, vjust=0.5)) +
scale_x_discrete(labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"))+
labs(x="", y="Bookings") +
ggtitle("Bookings Data", subtitle="7/1/2015 - 8/31/2017")
Below chart shows the cancellation percentage for City Hoetl and Resort Hotels. Here we can see that City Hotels had more cancellations as compared to Resort Hotels.
data %>% ggplot( aes(x=hotel,
fill=is_canceled)) +
geom_bar(position="dodge") +
geom_text(stat = "Count", aes(label=scales::percent(..count../sum(..count..))),position=position_dodge(0.9), vjust=1.5) +
scale_fill_manual(values=c("skyblue4", "plum4"),labels=c("Not Canceled", "Canceled")) +
guides(fill=guide_legend(title=NULL)) +
ggtitle("City and Resort Hotels Booking with/without Cancellation")
Lead Time can be interpreted as the difference between the date on which booking was done and date for which booking was done.
We want to see the effect of Lead Time on Cancellations.
# Calculating average lead time for every month
avg_monthly_lead <- hotels %>% group_by(arrival_date_month) %>%
summarise(avg_lead_time=mean(lead_time))
data %>% group_by(arrival_date_month, is_canceled) %>%
summarize(avg_lead_time=mean(lead_time)) %>%
ggplot(aes(x=arrival_date_month, y=avg_lead_time, fill=is_canceled)) +
geom_col(position="dodge") +
geom_text(aes(label=round(avg_lead_time)), position=position_dodge(0.9), vjust=1.5, size=3) +
scale_fill_manual(values=c("skyblue4", "plum4"),labels=c("Not Canceled", "Canceled")) +
guides(fill=guide_legend(title=NULL)) +
theme(legend.position="bottom", axis.text.x=element_text(angle=45, hjust=1, vjust=0.5)) +
scale_x_discrete(labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"))+
labs(x="", y="Lead Time in Days", title = "Average Lead Time per Month")
## `summarise()` has grouped output by 'arrival_date_month'. You can override using the `.groups` argument.
From the above chart we can see that the Lead Time is more for Cancelled bookings as compared to Not Cancelled ones.
We wanted to identify a more robust way of identifying patterns or drivers that is causing booking cancellations.
We decide to go ahead with a logistic regression model with our response variable as: whether a booking will be cancelled or not (1/0)
Splitting the dataset into training and testing data set:
dt = sort(sample(nrow(data), nrow(data)*.7))
train<-data[dt,]
test<-data[-dt,]
Fitting the logistic regression model:
options(warning=-1)
model <- glm(is_canceled ~ hotel + lead_time + adr + total_guests +
total_of_special_requests + distribution_channel +
is_repeated_guest + previous_cancellations + booking_changes +
deposit_type + days_in_waiting_list + required_car_parking_spaces +
customer_type, family = "binomial" , data=train)
Summary of the logistic regression model
summary(model)
##
## Call:
## glm(formula = is_canceled ~ hotel + lead_time + adr + total_guests +
## total_of_special_requests + distribution_channel + is_repeated_guest +
## previous_cancellations + booking_changes + deposit_type +
## days_in_waiting_list + required_car_parking_spaces + customer_type,
## family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.8636 -0.7910 -0.4512 0.2153 3.6586
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.014e+00 8.011e-02 -37.617 < 2e-16 ***
## hotelResort Hotel 1.277e-02 1.969e-02 0.649 0.516548
## lead_time 3.845e-03 1.011e-04 38.017 < 2e-16 ***
## adr 5.598e-03 2.117e-04 26.444 < 2e-16 ***
## total_guests 1.273e-01 1.477e-02 8.616 < 2e-16 ***
## total_of_special_requests -5.898e-01 1.279e-02 -46.108 < 2e-16 ***
## distribution_channelDirect -3.362e-01 5.776e-02 -5.821 5.85e-09 ***
## distribution_channelGDS -3.581e-01 2.223e-01 -1.610 0.107290
## distribution_channelTA/TO 5.536e-01 5.077e-02 10.903 < 2e-16 ***
## distribution_channelUndefined 1.596e+01 1.605e+02 0.099 0.920758
## is_repeated_guest1 -1.443e+00 8.813e-02 -16.375 < 2e-16 ***
## previous_cancellations 2.047e+00 5.849e-02 34.996 < 2e-16 ***
## booking_changes -3.835e-01 1.829e-02 -20.965 < 2e-16 ***
## deposit_typeNon Refund 4.962e+00 1.300e-01 38.164 < 2e-16 ***
## deposit_typeRefundable 5.657e-01 2.455e-01 2.304 0.021205 *
## days_in_waiting_list -2.154e-03 5.735e-04 -3.756 0.000173 ***
## required_car_parking_spaces -3.863e+01 4.244e+03 -0.009 0.992738
## customer_typeGroup -2.380e-01 1.992e-01 -1.195 0.232126
## customer_typeTransient 1.173e+00 5.876e-02 19.955 < 2e-16 ***
## customer_typeTransient-Party 2.116e-01 6.119e-02 3.458 0.000544 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 110227 on 83572 degrees of freedom
## Residual deviance: 74825 on 83553 degrees of freedom
## AIC: 74865
##
## Number of Fisher Scoring iterations: 11
model$deviance
## [1] 74824.63
AIC(model)
## [1] 74864.63
BIC(model)
## [1] 75051.3
#Setting cut-off probability=0.5
table(predict(model,type="response") > 0.5)
##
## FALSE TRUE
## 64302 19271
#confusion matrix
pred_prob=predict(model, data=train, type="response")
pred_value=1*(pred_prob>0.5)
actual_value <-train$is_canceled
confusion_matrix <- table(actual_value, pred_value)
confusion_matrix
## pred_value
## actual_value 0 1
## 0 49728 2843
## 1 14574 16428
#misclasscification or error rate
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate
## [1] 0.2084046
#In-sample prediction
pred.glm0.train<- predict(model, type="response")
##ROC Curve
pred <- prediction(pred.glm0.train, as.numeric(train$is_canceled))
perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, colorize=TRUE)
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8355711
#out-of-sample prediction
pred.glm0.test<- predict(model, newdata = test, type="response")
##ROC Curve
pred <- prediction(pred.glm0.test, as.numeric(test$is_canceled))
perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, colorize=TRUE)
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8294258