opi— title: “Midterm” author: “Group 2” date: “2023-10-08” output: html_document —

Introduction

Problem Statement: The goal is to understand the factors that influence the average daily rate (adr) of hotel bookings the most.

Approach: The analysis will use the hotels.csv dataset. The methodology will involve data cleaning, exploratory data analysis (EDA), and model fitting including linear regression, KNN, alogistic Regression, and SVM.

Analytic Technique: Linear regression and KNN will be employed to understand and predict the adr. Data preprocessing techniques like winsorization will be applied to refine the model further.

Logistic regression will utilized to understand is_cancelled and how to reduce the odds of cancellation.

Consumer Benefit: This analysis aims to provide insights on the most profitable customers, how to increase their bookings, and decrease cancellations.

Packages Required

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
library(ggplot2)
library(kknn)
library(dplyr)
library(tidyr)
library(stringr)
library(tidyverse)
library(corrplot)
library(DescTools)
library(ROCR)

Data Preparation

More information on the original data can be found here: hotels.csv

## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Length:119390          
##  Class :character       
##  Mode  :character       
##                         
##                         
##                         
## 

Clean adr

Setting the outlier 5400 to the mean of adr

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6.38   69.29   94.58  101.83  126.00 5400.00
##  [1] hotel                          is_canceled                   
##  [3] lead_time                      arrival_date_year             
##  [5] arrival_date_month             arrival_date_week_number      
##  [7] arrival_date_day_of_month      stays_in_weekend_nights       
##  [9] stays_in_week_nights           adults                        
## [11] children                       babies                        
## [13] meal                           country                       
## [15] market_segment                 distribution_channel          
## [17] is_repeated_guest              previous_cancellations        
## [19] previous_bookings_not_canceled reserved_room_type            
## [21] assigned_room_type             booking_changes               
## [23] deposit_type                   agent                         
## [25] company                        days_in_waiting_list          
## [27] customer_type                  adr                           
## [29] required_car_parking_spaces    total_of_special_requests     
## [31] reservation_status             reservation_status_date       
## <0 rows> (or 0-length row.names)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6.38   69.29   94.58  101.78  126.00  451.50

Clean arrival_date_year

Found arrival year date min = 0 leading to all var in these obs as 0. These should be removed

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   101.8  2016.0  2016.0  2016.1  2017.0  2017.0
## [1] 2015.0000 2016.0000 2017.0000  101.8311
##  [1] hotel                          is_canceled                   
##  [3] lead_time                      arrival_date_year             
##  [5] arrival_date_month             arrival_date_week_number      
##  [7] arrival_date_day_of_month      stays_in_weekend_nights       
##  [9] stays_in_week_nights           adults                        
## [11] children                       babies                        
## [13] meal                           country                       
## [15] market_segment                 distribution_channel          
## [17] is_repeated_guest              previous_cancellations        
## [19] previous_bookings_not_canceled reserved_room_type            
## [21] assigned_room_type             booking_changes               
## [23] deposit_type                   agent                         
## [25] company                        days_in_waiting_list          
## [27] customer_type                  adr                           
## [29] required_car_parking_spaces    total_of_special_requests     
## [31] reservation_status             reservation_status_date       
## <0 rows> (or 0-length row.names)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   101.8  2016.0  2016.0  2016.1  2017.0  2017.0
## [1] 2015.0000 2016.0000 2017.0000  101.8311

Clean adults

All of the obs with adults > 4 is cancelled. These obs will be removed assuming it was a typo error then cancellation to correct

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   2.000   1.859   2.000 101.831
##  [1]   2.0000   1.0000   3.0000   4.0000  40.0000  26.0000  50.0000  27.0000
##  [9]  55.0000   0.0000  20.0000   6.0000   5.0000  10.0000 101.8311
##  [1]   2.0000   1.0000   3.0000   4.0000  40.0000  26.0000  50.0000  27.0000
##  [9]  55.0000   0.0000  20.0000   6.0000   5.0000  10.0000 101.8311
## [1] "Canceled" "Canceled" "Canceled" "Canceled" "Canceled" "Canceled"
##             hotel is_canceled lead_time arrival_date_year arrival_date_month
## 2225 Resort Hotel           0         1              2015            October
##      arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 2225                       41                         6                       0
##      stays_in_week_nights adults children babies meal country market_segment
## 2225                    3      0        0      0   SC     PRT      Corporate
##      distribution_channel is_repeated_guest previous_cancellations
## 2225            Corporate                 0                      0
##      previous_bookings_not_canceled reserved_room_type assigned_room_type
## 2225                              0                  A                  I
##      booking_changes deposit_type agent company days_in_waiting_list
## 2225               1   No Deposit  NULL     174                    0
##        customer_type adr required_car_parking_spaces total_of_special_requests
## 2225 Transient-Party   0                           0                         0
##      reservation_status reservation_status_date
## 2225          Check-Out              2015-10-06
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   2.000   1.853   2.000   4.000
## [1] 2 1 3 4 0

Clean babies

Babies var seems off. There are 9 and 10 counts. This appears to be a one-off error. changing value to 0

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.00795  0.00000 10.00000
## [1]  0  1  2 10  9
## [1]  0  1  2 10  9
## [1] 0 1 2
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.007791 0.000000 2.000000
## [1] 0 1 2

Reviewing NA values

Found 4 NA in children and changed them to 0.

##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              0                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

Exploratory Data Analysis

Boxplots of all numeric variables

Correlation Plot

Upon review of the correlation plot we can see that:

Strongest Positive Correlations - Visual review:

  • children ~ adr

  • adults ~ adr This indicates that as the number of children or adults in the booking increases, the Average Daily Rate increases. Likely because larger rooms are more expensive.

  • stays_in_weekend_nights ~ stays_in_weeks_nights Bookings that have weekend stays also tend to have longer weekday stays.

  • is_canceled ~ lead_time Bookings made well in advance are more likely to be canceled.

Strongest Negative Correlations - Visual review:

  • required_car_parking_spaces ~ is_canceled

  • booking_changes ~ is_canceled

  • total_of_special_requests ~ is_canceled Bookings with these attributes are less likely to be canceled.

  • required_car_parking_spaces ~ lead_time

  • adr ~ is_repeated_guest

  • adults ~ is_repeated_guest Repeated guests tend to book less expensive rooms or may benefit from loyalty discounts.

  • is_repeated_guest ~ lead_time Repeated guests might book closer to their stay date compared to first-time guests.

Target Variable adr Correlations * children ~ adr * adults ~ adr * adr ~ is_repeated_guest

Modelling

Create training and testing sets

# Set the seed
set.seed(21)

# Randomly sample row indices for the training set
train_indices <- sample(1:NROW(hotels_data),NROW(hotels_data)*0.75)
 
# Create the training set
train_data <- hotels_data[train_indices, ]
numeric_train_data <- train_data[sapply(train_data, is.numeric)]

# Create the testing set
test_data <- hotels_data[-train_indices, ]
numeric_test_data <- test_data[sapply(test_data, is.numeric)]

print(paste("We have split the data using random sampling into a training set of 75%:", count(train_data), "observations and a testing set of 25%:", count(test_data), "observations"))
## [1] "We have split the data using random sampling into a training set of 75%: 89528 observations and a testing set of 25%: 29843 observations"

Linear Regression

Winsorization function

# Winsorize all numeric variables in the dataframe
winsorize_all <- function(data, lower = 0.05, upper = 0.95) {
  data <- lapply(data, function(x) Winsorize(x, probs = c(lower, upper)))
  return(data)
}

Target variable adr - Average Daily Rate

In sample linear model/winsorization comparison

##        Actual Predicted
## 16562    81.0  96.45532
## 12105    80.1 103.70450
## 76839    89.0 104.74265
## 55685   115.0  96.35001
## 106096  105.0  90.61195
## 68024   130.0  97.38073
##   Actual Predicted_Winsorized
## 1   81.0             78.39981
## 2   80.1            114.89974
## 3   89.0             87.32924
## 4  115.0             93.52211
## 5  105.0             96.42189
## 6  130.0            107.44216

In sample MSE comparison to winsorization model

The linear model MSE is improved after winsorizing the training set. This suggests that winsorizing the data will improve the models fit to the data.

## [1] "Training MSE for Linear Model: 1739.62"
## [1] "Training MSE for Winsorized Linear Model: 1671.39"

Out of sample linear model/winsorization comparison

hotels_test_data_winsorized <- winsorize_all(numeric_test_data)

lm_model_test <- lm(adr ~ ., data = numeric_test_data)

lm_model_test_win <- lm(adr ~ ., data = hotels_test_data_winsorized)

comparison_df <- data.frame(Actual = numeric_test_data$adr, lm_predicted = lm_model_test$fitted.values)
head(comparison_df)
##    Actual lm_predicted
## 8  103.00    105.31938
## 10 105.50    105.41917
## 23  84.67    103.24816
## 33 108.30    122.30285
## 37  98.00     97.38399
## 39 108.80    132.21362
comparison_df_win <- data.frame(Actual = hotels_test_data_winsorized$adr, Predicted_winsorized = lm_model_test_win$fitted.values)
head(comparison_df)
##    Actual lm_predicted
## 8  103.00    105.31938
## 10 105.50    105.41917
## 23  84.67    103.24816
## 33 108.30    122.30285
## 37  98.00     97.38399
## 39 108.80    132.21362

MSE comparison of In sample and out of sample testing and training models

The winsorized model performs better than the standard model when comparing the MSEs below in both training and testing datasets which is making the model less sensitive to outliers.

lm_mse_train <- mean((lm_model_train$fitted.values - numeric_train_data$adr)^2)
print(paste("Training MSE for Linear Model:", round(lm_mse_train, 2)))
## [1] "Training MSE for Linear Model: 1739.62"
lm_mse_train_win <- mean((lm_model_train_win$fitted.values - numeric_train_data$adr)^2)
print(paste("Training MSE for Winsorized Linear Model:", round(lm_mse_train_win, 2)))
## [1] "Training MSE for Winsorized Linear Model: 1671.39"
lm_mse_test <- mean((lm_model_test$fitted.values - numeric_test_data$adr)^2)
print(paste("Test MSE for Linear Model:", round(lm_mse_test, 2)))
## [1] "Test MSE for Linear Model: 1763.21"
lm_mse_test_win <- mean((lm_model_test_win$fitted.values - numeric_test_data$adr)^2)
print(paste("Test MSE for Winsorized Linear Model:", round(lm_mse_test_win, 2)))
## [1] "Test MSE for Winsorized Linear Model: 1679.17"

This scatter plot indicates a wide variation in adr across all levels of adults. The slop is positive showing a positive relationship with Average Daily Rate and the number of Adults in the booking. This is more than likely due to the need of additional beds and/or space.

This scatterplot shows a positive relationship between adr and children. This is more than likely due to the need of additional space, beds, or amenities that children prefer.

This plot indicates a negative relationship between adr and is_repeated_guest. This may be due to loyalty programs or the guests being ‘in the know’ of deals.

## 
## Call:
## lm(formula = adr ~ adults, data = numeric_test_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -136.19  -31.10   -6.19   23.00  330.90 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.9101     1.0436   43.99   <2e-16 ***
## adults       30.0942     0.5442   55.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.94 on 29841 degrees of freedom
## Multiple R-squared:  0.09295,    Adjusted R-squared:  0.09292 
## F-statistic:  3058 on 1 and 29841 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = adr ~ adults + adults * is_repeated_guest, data = numeric_test_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -135.35  -31.71   -6.71   22.74  330.29 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               49.4140     1.0857  45.514  < 2e-16 ***
## adults                    28.6465     0.5624  50.934  < 2e-16 ***
## is_repeated_guest        -19.4695     4.1174  -4.729 2.27e-06 ***
## adults:is_repeated_guest  -4.2869     2.7291  -1.571    0.116    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.72 on 29839 degrees of freedom
## Multiple R-squared:  0.1015, Adjusted R-squared:  0.1014 
## F-statistic:  1123 on 3 and 29839 DF,  p-value: < 2.2e-16

The R^2 slightly improved from 9.3% to 10.15% with the interaction term of adults and is_repeated_guest. While the number of adults and repeated guest status individually have a significant impact on adr, the interaction between them isn’t statistically significant at the 0.05 level. This could indicate a reason to invest in marketing toward adult groups to increase the average daily rate of bookings.

Target variable adr

Logistic Regression is_canceled

##  [1] "hotel"                          "is_canceled"                   
##  [3] "lead_time"                      "arrival_date_year"             
##  [5] "arrival_date_month"             "arrival_date_week_number"      
##  [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
##  [9] "stays_in_week_nights"           "adults"                        
## [11] "children"                       "babies"                        
## [13] "meal"                           "country"                       
## [15] "market_segment"                 "distribution_channel"          
## [17] "is_repeated_guest"              "previous_cancellations"        
## [19] "previous_bookings_not_canceled" "reserved_room_type"            
## [21] "assigned_room_type"             "booking_changes"               
## [23] "deposit_type"                   "agent"                         
## [25] "company"                        "days_in_waiting_list"          
## [27] "customer_type"                  "adr"                           
## [29] "required_car_parking_spaces"    "total_of_special_requests"     
## [31] "reservation_status"             "reservation_status_date"

is_canceled ~ lead_time + adr + adults + children + arrival_date_year + market_segment + total_of_special_requests + booking_change

## 
## Call:
## glm(formula = is_canceled ~ lead_time + adr + adults + children + 
##     arrival_date_year + market_segment + total_of_special_requests + 
##     booking_changes, family = binomial, data = hotels.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3453  -0.8595  -0.5441   0.9853   4.9367  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -1.057e+01  4.877e+01  -0.217    0.828    
## lead_time                    5.548e-03  8.748e-05  63.422   <2e-16 ***
## adr                          5.464e-03  2.044e-04  26.728   <2e-16 ***
## adults                       3.002e-02  1.834e-02   1.637    0.102    
## children                    -2.813e-02  2.172e-02  -1.295    0.195    
## arrival_date_year            1.058e-02  1.192e-02   0.888    0.375    
## market_segmentAviation      -1.267e+01  5.437e+01  -0.233    0.816    
## market_segmentComplementary -1.208e+01  5.437e+01  -0.222    0.824    
## market_segmentCorporate     -1.251e+01  5.437e+01  -0.230    0.818    
## market_segmentDirect        -1.294e+01  5.437e+01  -0.238    0.812    
## market_segmentGroups        -1.153e+01  5.437e+01  -0.212    0.832    
## market_segmentOffline TA/TO -1.247e+01  5.437e+01  -0.229    0.819    
## market_segmentOnline TA     -1.167e+01  5.437e+01  -0.215    0.830    
## total_of_special_requests   -8.656e-01  1.346e-02 -64.307   <2e-16 ***
## booking_changes             -7.421e-01  1.894e-02 -39.174   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 109950  on 83558  degrees of freedom
## Residual deviance:  91917  on 83544  degrees of freedom
## AIC: 91947
## 
## Number of Fisher Scoring iterations: 9

Start with in-sample unweighted

## [1] 0.7681851

out-of-sample unweighted

## [1] 0.7712624

## [1] 0.2

Optimal cut off prob for in-sample

##     Predicted
## True     0     1
##    0 18226 34578
##    1  2957 27798
## [1] "MR:0.449203556768272"
## [1] "FPR:0.654836754791304"
## [1] "FNR:0.0961469679726874"
## [1] "cost:0.59075623212341"

out-of-sample

##     Predicted
## True     0     1
##    0 19323  3041
##    1  6426  7022
## [1] "MR:0.264352730928181"
## [1] "FPR:0.135977463781077"
## [1] "FNR:0.477840571088638"
## [1] "cost:0.935219425794947"