opi— title: “Midterm” author: “Group 2” date: “2023-10-08” output: html_document —

Introduction

Problem Statement: The goal is to understand the factors that influence the average daily rate (adr) of hotel bookings the most.

Approach: The analysis will use the hotels.csv dataset. The methodology will involve data cleaning, exploratory data analysis (EDA), and model fitting including linear regression, KNN, a logistic Regression, and SVM.

Analytic Technique: Linear regression and KNN will be employed to understand and predict the adr. Data preprocessing techniques like winsorization will be applied to refine the model further.

Logistic regression will utilized to understand is_cancelled and how to reduce the odds of cancellation.

Consumer Benefit: This analysis aims to provide insights on the most profitable customers, how to increase their bookings, and decrease cancellations.

Packages Required

ggplot2: Used for visualizations
kknn: Used for the KNN models
dplyr: Used for data manipulation
tidyr: Used for data cleaning and modifying
stringr: Used for character variable manipulation
tidyverse: Used for data manipulation
corrplot: Correlation matrix visualization
DescTools: Used for winsorization
rpart: Builds the tree model
rpart.plot: Creates an advanced plot of the tree

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
library(ggplot2)
library(kknn)
library(dplyr)
library(tidyr)
library(stringr)
library(tidyverse)
library(corrplot)
library(DescTools)
library(ROCR)
library(readr)
library(reshape2)
library(caret)
library(e1071)

Data Preparation

More information on the original data can be found here: hotels.csv

## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...

##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Length:119390          
##  Class :character       
##  Mode  :character       
##                         
##                         
##                         
##

Clean `adr`

Setting the outlier 5400 to the mean of adr

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6.38   69.29   94.58  101.83  126.00 5400.00

##  [1] hotel                          is_canceled                   
##  [3] lead_time                      arrival_date_year             
##  [5] arrival_date_month             arrival_date_week_number      
##  [7] arrival_date_day_of_month      stays_in_weekend_nights       
##  [9] stays_in_week_nights           adults                        
## [11] children                       babies                        
## [13] meal                           country                       
## [15] market_segment                 distribution_channel          
## [17] is_repeated_guest              previous_cancellations        
## [19] previous_bookings_not_canceled reserved_room_type            
## [21] assigned_room_type             booking_changes               
## [23] deposit_type                   agent                         
## [25] company                        days_in_waiting_list          
## [27] customer_type                  adr                           
## [29] required_car_parking_spaces    total_of_special_requests     
## [31] reservation_status             reservation_status_date       
## <0 rows> (or 0-length row.names)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6.38   69.29   94.58  101.78  126.00  451.50

Clean `arrival_date_year`

Found arrival year date min = 0 leading to all var in these obs as 0. These should be removed

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   101.8  2016.0  2016.0  2016.1  2017.0  2017.0

## [1] 2015.0000 2016.0000 2017.0000  101.8311

##  [1] hotel                          is_canceled                   
##  [3] lead_time                      arrival_date_year             
##  [5] arrival_date_month             arrival_date_week_number      
##  [7] arrival_date_day_of_month      stays_in_weekend_nights       
##  [9] stays_in_week_nights           adults                        
## [11] children                       babies                        
## [13] meal                           country                       
## [15] market_segment                 distribution_channel          
## [17] is_repeated_guest              previous_cancellations        
## [19] previous_bookings_not_canceled reserved_room_type            
## [21] assigned_room_type             booking_changes               
## [23] deposit_type                   agent                         
## [25] company                        days_in_waiting_list          
## [27] customer_type                  adr                           
## [29] required_car_parking_spaces    total_of_special_requests     
## [31] reservation_status             reservation_status_date       
## <0 rows> (or 0-length row.names)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   101.8  2016.0  2016.0  2016.1  2017.0  2017.0

## [1] 2015.0000 2016.0000 2017.0000  101.8311

Clean `adults`

All of the obs with adults > 4 is cancelled. These obs will be removed assuming it was a typo error then cancellation to correct

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   2.000   1.859   2.000 101.831

##  [1]   2.0000   1.0000   3.0000   4.0000  40.0000  26.0000  50.0000  27.0000
##  [9]  55.0000   0.0000  20.0000   6.0000   5.0000  10.0000 101.8311

##  [1]   2.0000   1.0000   3.0000   4.0000  40.0000  26.0000  50.0000  27.0000
##  [9]  55.0000   0.0000  20.0000   6.0000   5.0000  10.0000 101.8311

## [1] "Canceled" "Canceled" "Canceled" "Canceled" "Canceled" "Canceled"

##             hotel is_canceled lead_time arrival_date_year arrival_date_month
## 2225 Resort Hotel           0         1              2015            October
##      arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 2225                       41                         6                       0
##      stays_in_week_nights adults children babies meal country market_segment
## 2225                    3      0        0      0   SC     PRT      Corporate
##      distribution_channel is_repeated_guest previous_cancellations
## 2225            Corporate                 0                      0
##      previous_bookings_not_canceled reserved_room_type assigned_room_type
## 2225                              0                  A                  I
##      booking_changes deposit_type agent company days_in_waiting_list
## 2225               1   No Deposit  NULL     174                    0
##        customer_type adr required_car_parking_spaces total_of_special_requests
## 2225 Transient-Party   0                           0                         0
##      reservation_status reservation_status_date
## 2225          Check-Out              2015-10-06

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   2.000   1.853   2.000   4.000

## [1] 2 1 3 4 0

Clean `babies`

Babies var seems off. There are 9 and 10 counts. This appears to be a one-off error. changing value to 0

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.00795  0.00000 10.00000

## [1]  0  1  2 10  9

## [1]  0  1  2 10  9

## [1] 0 1 2

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.007791 0.000000 2.000000

## [1] 0 1 2

Reviewing NA values

Found 4 NA in children and changed them to 0.

##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              0                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

Exploratory Data Analysis

Boxplots of all numeric variables

Correlation Plot

Upon review of the correlation plot we can see that:

Strongest Positive Correlations - Visual review:

children ~ adr
adults ~ adr This indicates that as the number of children or adults in the booking increases, the Average Daily Rate increases. Likely because larger rooms are more expensive.
stays_in_weekend_nights ~ stays_in_weeks_nights Bookings that have weekend stays also tend to have longer weekday stays.
is_canceled ~ lead_time Bookings made well in advance are more likely to be canceled.

Strongest Negative Correlations - Visual review:

required_car_parking_spaces ~ is_canceled
booking_changes ~ is_canceled
total_of_special_requests ~ is_canceled Bookings with these attributes are less likely to be canceled.
required_car_parking_spaces ~ lead_time
adr ~ is_repeated_guest
adults ~ is_repeated_guest Repeated guests tend to book less expensive rooms or may benefit from loyalty discounts.
is_repeated_guest ~ lead_time Repeated guests might book closer to their stay date compared to first-time guests.

Target Variable adr Correlations * children ~ adr * adults ~ adr * adr ~ is_repeated_guest

Modelling

Create training and testing sets

# Set the seed
set.seed(21)

# Randomly sample row indices for the training set
train_indices <- sample(1:NROW(hotels_data),NROW(hotels_data)*0.75)
 
# Create the training set
train_data <- hotels_data[train_indices, ]
numeric_train_data <- train_data[sapply(train_data, is.numeric)]

# Create the testing set
test_data <- hotels_data[-train_indices, ]
numeric_test_data <- test_data[sapply(test_data, is.numeric)]

print(paste("We have split the data using random sampling into a training set of 75%:", count(train_data), "observations and a testing set of 25%:", count(test_data), "observations"))

## [1] "We have split the data using random sampling into a training set of 75%: 89528 observations and a testing set of 25%: 29843 observations"

Linear Regression

Winsorization function

# Winsorize all numeric variables in the dataframe
winsorize_all <- function(data, lower = 0.05, upper = 0.95) {
  data <- lapply(data, function(x) Winsorize(x, probs = c(lower, upper)))
  return(data)
}

Target variable `adr` - Average Daily Rate

In sample linear model/winsorization comparison

##        Actual Predicted
## 16562    81.0  96.45532
## 12105    80.1 103.70450
## 76839    89.0 104.74265
## 55685   115.0  96.35001
## 106096  105.0  90.61195
## 68024   130.0  97.38073

##   Actual Predicted_Winsorized
## 1   81.0             78.39981
## 2   80.1            114.89974
## 3   89.0             87.32924
## 4  115.0             93.52211
## 5  105.0             96.42189
## 6  130.0            107.44216

In sample MSE comparison to winsorization model

The linear model MSE is improved after winsorizing the training set. This suggests that winsorizing the data will improve the models fit to the data.

## [1] "Training MSE for Linear Model: 1739.62"

## [1] "Training MSE for Winsorized Linear Model: 1671.39"

Out of sample linear model/winsorization comparison

hotels_test_data_winsorized <- winsorize_all(numeric_test_data)

lm_model_test <- lm(adr ~ ., data = numeric_test_data)

lm_model_test_win <- lm(adr ~ ., data = hotels_test_data_winsorized)

comparison_df <- data.frame(Actual = numeric_test_data$adr, lm_predicted = lm_model_test$fitted.values)
head(comparison_df)

##    Actual lm_predicted
## 8  103.00    105.31938
## 10 105.50    105.41917
## 23  84.67    103.24816
## 33 108.30    122.30285
## 37  98.00     97.38399
## 39 108.80    132.21362

comparison_df_win <- data.frame(Actual = hotels_test_data_winsorized$adr, Predicted_winsorized = lm_model_test_win$fitted.values)
head(comparison_df)

##    Actual lm_predicted
## 8  103.00    105.31938
## 10 105.50    105.41917
## 23  84.67    103.24816
## 33 108.30    122.30285
## 37  98.00     97.38399
## 39 108.80    132.21362

MSE comparison of In sample and out of sample testing and training models

The winsorized model performs better than the standard model when comparing the MSEs below in both training and testing datasets which is making the model less sensitive to outliers.

lm_mse_train <- mean((lm_model_train$fitted.values - numeric_train_data$adr)^2)
print(paste("Training MSE for Linear Model:", round(lm_mse_train, 2)))

## [1] "Training MSE for Linear Model: 1739.62"

lm_mse_train_win <- mean((lm_model_train_win$fitted.values - numeric_train_data$adr)^2)
print(paste("Training MSE for Winsorized Linear Model:", round(lm_mse_train_win, 2)))

## [1] "Training MSE for Winsorized Linear Model: 1671.39"

lm_mse_test <- mean((lm_model_test$fitted.values - numeric_test_data$adr)^2)
print(paste("Test MSE for Linear Model:", round(lm_mse_test, 2)))

## [1] "Test MSE for Linear Model: 1763.21"

lm_mse_test_win <- mean((lm_model_test_win$fitted.values - numeric_test_data$adr)^2)
print(paste("Test MSE for Winsorized Linear Model:", round(lm_mse_test_win, 2)))

## [1] "Test MSE for Winsorized Linear Model: 1679.17"

This scatter plot indicates a wide variation in adr across all levels of adults. The slop is positive showing a positive relationship with Average Daily Rate and the number of Adults in the booking. This is more than likely due to the need of additional beds and/or space.

This scatterplot shows a positive relationship between adr and children. This is more than likely due to the need of additional space, beds, or amenities that children prefer.

This plot indicates a negative relationship between adr and is_repeated_guest. This may be due to loyalty programs or the guests being ‘in the know’ of deals.

## 
## Call:
## lm(formula = adr ~ adults, data = numeric_test_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -136.19  -31.10   -6.19   23.00  330.90 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.9101     1.0436   43.99   <2e-16 ***
## adults       30.0942     0.5442   55.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.94 on 29841 degrees of freedom
## Multiple R-squared:  0.09295,    Adjusted R-squared:  0.09292 
## F-statistic:  3058 on 1 and 29841 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = adr ~ adults + adults * is_repeated_guest, data = numeric_test_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -135.35  -31.71   -6.71   22.74  330.29 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               49.4140     1.0857  45.514  < 2e-16 ***
## adults                    28.6465     0.5624  50.934  < 2e-16 ***
## is_repeated_guest        -19.4695     4.1174  -4.729 2.27e-06 ***
## adults:is_repeated_guest  -4.2869     2.7291  -1.571    0.116    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.72 on 29839 degrees of freedom
## Multiple R-squared:  0.1015, Adjusted R-squared:  0.1014 
## F-statistic:  1123 on 3 and 29839 DF,  p-value: < 2.2e-16

The R^2 slightly improved from 9.3% to 10.15% with the interaction term of adults and is_repeated_guest. While the number of adults and repeated guest status individually have a significant impact on adr, the interaction between them isn’t statistically significant at the 0.05 level. This could indicate a reason to invest in marketing toward adult groups to increase the average daily rate of bookings.

Logistic Regression `is_canceled`

is_canceled was chosen as the target variable because it can cause significant implactions in revenue.

## 
## Call:
## glm(formula = is_canceled ~ lead_time + adr + adults + children + 
##     arrival_date_year + market_segment + total_of_special_requests + 
##     booking_changes, family = binomial, data = hotels.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3453  -0.8595  -0.5441   0.9853   4.9367  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -1.057e+01  4.877e+01  -0.217    0.828    
## lead_time                    5.548e-03  8.748e-05  63.422   <2e-16 ***
## adr                          5.464e-03  2.044e-04  26.728   <2e-16 ***
## adults                       3.002e-02  1.834e-02   1.637    0.102    
## children                    -2.813e-02  2.172e-02  -1.295    0.195    
## arrival_date_year            1.058e-02  1.192e-02   0.888    0.375    
## market_segmentAviation      -1.267e+01  5.437e+01  -0.233    0.816    
## market_segmentComplementary -1.208e+01  5.437e+01  -0.222    0.824    
## market_segmentCorporate     -1.251e+01  5.437e+01  -0.230    0.818    
## market_segmentDirect        -1.294e+01  5.437e+01  -0.238    0.812    
## market_segmentGroups        -1.153e+01  5.437e+01  -0.212    0.832    
## market_segmentOffline TA/TO -1.247e+01  5.437e+01  -0.229    0.819    
## market_segmentOnline TA     -1.167e+01  5.437e+01  -0.215    0.830    
## total_of_special_requests   -8.656e-01  1.346e-02 -64.307   <2e-16 ***
## booking_changes             -7.421e-01  1.894e-02 -39.174   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 109950  on 83558  degrees of freedom
## Residual deviance:  91917  on 83544  degrees of freedom
## AIC: 91947
## 
## Number of Fisher Scoring iterations: 9

In-sample Unweighted

##     Predicted
## True     0     1
##    0 36687 16117
##    1  9220 21535

## [1] "MR:0.303222872461375"

## [1] 0.7681851

Out-of-sample Unweighted

##     Predicted
## True     0     1
##    0 19323  3041
##    1  6426  7022

## [1] "MR:0.264352730928181"

## [1] 0.7712624

The unweigheted testing AUC had marginally better performance compared to the unweighted training AUC and a significantly lower MR meaning the testing set performed better.

## Optimal pcut

## [1] 0.2

In-sample Weighted

##     Predicted
## True     0     1
##    0 18226 34578
##    1  2957 27798

## [1] "MR:0.449203556768272"

## [1] "FPR:0.654836754791304"

## [1] "FNR:0.0961469679726874"

## [1] "cost:0.59075623212341"

Out-of-sample Weighted

##     Predicted
## True     0     1
##    0 19323  3041
##    1  6426  7022

## [1] "MR:0.264352730928181"

## [1] "FPR:0.135977463781077"

## [1] "FNR:0.477840571088638"

## [1] "cost:0.935219425794947"

##               Model        MR        FNR       FPR      Cost
## 1 Weighted Training 0.4492036 0.09614697 0.6548368 0.5907562
## 2  Weighted Testing 0.2643527 0.47784057 0.1359775 0.9352194

###Tree Classification Model

## 'data.frame':    119371 obs. of  18 variables:
##  $ is_canceled                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ lead_time                     : num  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : num  2015 2015 2015 2015 2015 ...
##  $ arrival_date_week_number      : num  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : num  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_repeated_guest             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ booking_changes               : num  3 4 0 0 0 0 0 0 0 0 ...
##  $ days_in_waiting_list          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num  0 0 0 0 1 1 0 1 1 0 ...

##     0     1 
## 59717 29811

##     pred
## true     0     1
##    0 47002  9240
##    1 12715 20571

## [1] "Accuracy on Training Data: 0.245230542400143"

##     pred
## true     0     1
##    0 15879  3047
##    1  4148  6769

## [1] 0.2410951

## [1] "Accuracy on Testing Data: 0.241095064169152"

##     pred
## true     0     1
##    0 47002  9240
##    1 12715 20571

##     pred
## true     0     1
##    0 14623  4303
##    1  3657  7260

## [1] 0.2452305

Obtain ROC and AUC on training set (use predicted probabilities).

## [1] 0.7696875

## [1] 0.7713274

Regression Tree

Target variable `adr`

## Call:
## rpart(formula = adr ~ ., data = numeric_data)
##   n= 119371 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.10611870      0 1.0000000 1.0000105 0.005831705
## 2 0.06906433      1 0.8938813 0.8939234 0.004966323
## 3 0.05123835      2 0.8248170 0.8248877 0.004714844
## 4 0.05075714      3 0.7735786 0.7538636 0.004473344
## 5 0.02338833      4 0.7228215 0.7228448 0.004381954
## 6 0.01671854      5 0.6994332 0.6993238 0.004301035
## 7 0.01339108      6 0.6827146 0.6839629 0.004224456
## 8 0.01112683      7 0.6693235 0.6705173 0.004168324
## 9 0.01000000      8 0.6581967 0.6582171 0.004143095
## 
## Variable importance
## arrival_date_week_number                 children                   adults 
##                       34                       34                       20 
##                lead_time        arrival_date_year   previous_cancellations 
##                        7                        4                        1 
## 
## Node number 1: 119371 observations,    complexity param=0.1061187
##   mean=101.7911, MSE=2315.247 
##   left son=2 (110781 obs) right son=3 (8590 obs)
##   Primary splits:
##       children                  < 0.5    to the left,  improve=0.10611870, (0 missing)
##       adults                    < 2.5    to the left,  improve=0.07454214, (0 missing)
##       arrival_date_week_number  < 13.5   to the left,  improve=0.07166795, (0 missing)
##       arrival_date_year         < 2016.5 to the left,  improve=0.03680391, (0 missing)
##       total_of_special_requests < 0.5    to the left,  improve=0.02949121, (0 missing)
##   Surrogate splits:
##       adults < 0.5    to the right, agree=0.928, adj=0.004, (0 split)
## 
## Node number 2: 110781 observations,    complexity param=0.06906433
##   mean=97.42633, MSE=1914.985 
##   left son=4 (105046 obs) right son=5 (5735 obs)
##   Primary splits:
##       adults                    < 2.5    to the left,  improve=0.08997446, (0 missing)
##       arrival_date_week_number  < 13.5   to the left,  improve=0.07515351, (0 missing)
##       arrival_date_year         < 2016.5 to the left,  improve=0.03555443, (0 missing)
##       total_of_special_requests < 0.5    to the left,  improve=0.03191124, (0 missing)
##       is_repeated_guest         < 0.5    to the right, improve=0.02100150, (0 missing)
## 
## Node number 3: 8590 observations,    complexity param=0.01112683
##   mean=158.081, MSE=4062.978 
##   left son=6 (4861 obs) right son=7 (3729 obs)
##   Primary splits:
##       children                 < 1.5    to the left,  improve=0.088110920, (0 missing)
##       arrival_date_week_number < 13.5   to the left,  improve=0.086564980, (0 missing)
##       adults                   < 1.5    to the left,  improve=0.041024670, (0 missing)
##       arrival_date_year        < 2016.5 to the left,  improve=0.038691800, (0 missing)
##       lead_time                < 7.5    to the left,  improve=0.008326848, (0 missing)
##   Surrogate splits:
##       adults                    < 0.5    to the right, agree=0.591, adj=0.058, (0 split)
##       total_of_special_requests < 0.5    to the right, agree=0.582, adj=0.036, (0 split)
##       stays_in_week_nights      < 8.5    to the left,  agree=0.569, adj=0.006, (0 split)
##       stays_in_weekend_nights   < 4.5    to the left,  agree=0.567, adj=0.002, (0 split)
##       lead_time                 < 334.5  to the left,  agree=0.566, adj=0.001, (0 split)
## 
## Node number 4: 105046 observations,    complexity param=0.05123835
##   mean=94.35929, MSE=1723.278 
##   left son=8 (20937 obs) right son=9 (84109 obs)
##   Primary splits:
##       arrival_date_week_number  < 13.5   to the left,  improve=0.07822694, (0 missing)
##       arrival_date_year         < 2016.5 to the left,  improve=0.03106447, (0 missing)
##       adults                    < 1.5    to the left,  improve=0.02886486, (0 missing)
##       total_of_special_requests < 0.5    to the left,  improve=0.02853794, (0 missing)
##       is_repeated_guest         < 0.5    to the right, improve=0.02130618, (0 missing)
##   Surrogate splits:
##       lead_time                 < 543.5  to the right, agree=0.802, adj=0.008, (0 split)
##       stays_in_week_nights      < 13.5   to the right, agree=0.802, adj=0.004, (0 split)
##       stays_in_weekend_nights   < 5.5    to the right, agree=0.801, adj=0.003, (0 split)
##       arrival_date_year         < 1007.5 to the left,  agree=0.801, adj=0.000, (0 split)
##       arrival_date_day_of_month < 0.5    to the left,  agree=0.801, adj=0.000, (0 split)
## 
## Node number 5: 5735 observations
##   mean=153.6042, MSE=2098.161 
## 
## Node number 6: 4861 observations
##   mean=141.5092, MSE=3221.267 
## 
## Node number 7: 3729 observations
##   mean=179.6835, MSE=4335.545 
## 
## Node number 8: 20937 observations
##   mean=71.08803, MSE=689.0327 
## 
## Node number 9: 84109 observations,    complexity param=0.05075714
##   mean=100.1521, MSE=1812.366 
##   left son=18 (22080 obs) right son=19 (62029 obs)
##   Primary splits:
##       arrival_date_week_number  < 40.5   to the right, improve=0.09202481, (0 missing)
##       arrival_date_year         < 2016.5 to the left,  improve=0.08699909, (0 missing)
##       total_of_special_requests < 0.5    to the left,  improve=0.03557023, (0 missing)
##       lead_time                 < 220.5  to the right, improve=0.02998976, (0 missing)
##       previous_cancellations    < 0.5    to the right, improve=0.02180009, (0 missing)
##   Surrogate splits:
##       lead_time               < 480.5  to the right, agree=0.739, adj=0.004, (0 split)
##       days_in_waiting_list    < 385    to the right, agree=0.738, adj=0.002, (0 split)
##       previous_cancellations  < 25.5   to the right, agree=0.738, adj=0.001, (0 split)
##       stays_in_weekend_nights < 6.5    to the right, agree=0.738, adj=0.000, (0 split)
##       stays_in_week_nights    < 15.5   to the right, agree=0.738, adj=0.000, (0 split)
## 
## Node number 18: 22080 observations
##   mean=78.50635, MSE=1202.244 
## 
## Node number 19: 62029 observations,    complexity param=0.02338833
##   mean=107.8572, MSE=1803.395 
##   left son=38 (15907 obs) right son=39 (46122 obs)
##   Primary splits:
##       lead_time                 < 188.5  to the right, improve=0.05778426, (0 missing)
##       arrival_date_year         < 2016.5 to the left,  improve=0.05313649, (0 missing)
##       total_of_special_requests < 0.5    to the left,  improve=0.04456134, (0 missing)
##       previous_cancellations    < 0.5    to the right, improve=0.03764751, (0 missing)
##       arrival_date_week_number  < 21.5   to the left,  improve=0.02683526, (0 missing)
##   Surrogate splits:
##       previous_cancellations < 0.5    to the right, agree=0.775, adj=0.124, (0 split)
##       days_in_waiting_list   < 59.5   to the right, agree=0.755, adj=0.045, (0 split)
## 
## Node number 38: 15907 observations
##   mean=90.4748, MSE=1002.185 
## 
## Node number 39: 46122 observations,    complexity param=0.01671854
##   mean=113.8522, MSE=1939.576 
##   left son=78 (11842 obs) right son=79 (34280 obs)
##   Primary splits:
##       arrival_date_week_number  < 19.5   to the left,  improve=0.05165108, (0 missing)
##       arrival_date_year         < 2016.5 to the left,  improve=0.04279114, (0 missing)
##       total_of_special_requests < 0.5    to the left,  improve=0.02750329, (0 missing)
##       adults                    < 1.5    to the left,  improve=0.02706598, (0 missing)
##       is_repeated_guest         < 0.5    to the right, improve=0.02437876, (0 missing)
##   Surrogate splits:
##       days_in_waiting_list < 40.5   to the right, agree=0.748, adj=0.017, (0 split)
## 
## Node number 78: 11842 observations
##   mean=96.82277, MSE=1126.113 
## 
## Node number 79: 34280 observations,    complexity param=0.01339108
##   mean=119.7351, MSE=2085.798 
##   left son=158 (22830 obs) right son=159 (11450 obs)
##   Primary splits:
##       arrival_date_year         < 2016.5 to the left,  improve=0.05176054, (0 missing)
##       total_of_special_requests < 0.5    to the left,  improve=0.02735772, (0 missing)
##       adults                    < 1.5    to the left,  improve=0.02654263, (0 missing)
##       is_repeated_guest         < 0.5    to the right, improve=0.02199786, (0 missing)
##       stays_in_week_nights      < 2.5    to the left,  improve=0.01555139, (0 missing)
##   Surrogate splits:
##       arrival_date_week_number       < 22.5   to the right, agree=0.684, adj=0.054, (0 split)
##       lead_time                      < 168.5  to the left,  agree=0.667, adj=0.004, (0 split)
##       previous_bookings_not_canceled < 11.5   to the left,  agree=0.667, adj=0.002, (0 split)
##       total_of_special_requests      < 3.5    to the left,  agree=0.667, adj=0.002, (0 split)
##       required_car_parking_spaces    < 1.5    to the left,  agree=0.666, adj=0.000, (0 split)
## 
## Node number 158: 22830 observations
##   mean=112.3766, MSE=1906.504 
## 
## Node number 159: 11450 observations
##   mean=134.4069, MSE=2120.063

## [1] 1519.47

## [1] 1537.142

R square and Adjusted R square for training data

## [1] 0.3426609

## 
## Regression tree:
## rpart(formula = adr ~ ., data = numeric_data)
## 
## Variables actually used in tree construction:
## [1] adults                   arrival_date_week_number arrival_date_year       
## [4] children                 lead_time               
## 
## Root node error: 276373307/119371 = 2315.2
## 
## n= 119371 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.106119      0   1.00000 1.00001 0.0058317
## 2 0.069064      1   0.89388 0.89392 0.0049663
## 3 0.051238      2   0.82482 0.82489 0.0047148
## 4 0.050757      3   0.77358 0.75386 0.0044733
## 5 0.023388      4   0.72282 0.72284 0.0043820
## 6 0.016719      5   0.69943 0.69932 0.0043010
## 7 0.013391      6   0.68271 0.68396 0.0042245
## 8 0.011127      7   0.66932 0.67052 0.0041683
## 9 0.010000      8   0.65820 0.65822 0.0041431

### Out-of-sample R^2?

## Out-of-Sample R-squared: 0.3392447

AUC for Training Data: 0.5399 AUC for Testing Data: 0.5467

AUC for Training Data: 0.7682 AUC for Testing Data: 0.7713

Conclusion

Based on the provided results, it appears that you have evaluated the performance of a model using different metrics, including confusion matrices and AUC values, both in-sample (training) and out-of-sample (testing). Here’s a summary of the key findings:

Linear Model:

In-Sample AUC: 0.7682

Out-of-Sample AUC: 0.7712

Regression Tree Model:

In-Sample AUC: 0.5399

Out-of-Sample AUC: 0.5467

Confusion Matrix Metrics:

Linear Model (In-Sample):

Misclassification Rate (MR): 0.2644

False Positive Rate (FPR): 0.1360

False Negative Rate (FNR): 0.4778

Cost: 0.9352

Linear Model (Out-of-Sample):

MR: 0.2644

FPR: 0.1360

FNR: 0.4778

Cost: 0.9352

Regression Tree Model (In-Sample):

MR: 0.4492

FPR: 0.6548

FNR: 0.0961

Cost: 0.5908

Regression Tree Model (Out-of-Sample):

MR: 0.2644

FPR: 0.1360

FNR: 0.4778

Cost: 0.9352

The linear model outperforms the regression tree model in terms of AUC for both in-sample and out-of-sample data.

The confusion matrix metrics provide additional insights into the model’s performance, including misclassification rates, false positive rates, false negative rates, and associated costs.

Consideration should be given to the specific goals and requirements of the modeling task when interpreting these results. The cost associated with misclassification can be crucial in decision-making.

Overall, the linear model appears to be a better-performing model based on the provided evaluation metrics.

Introduction

Packages Required

Data Preparation

Clean adr

Clean arrival_date_year

Clean adults

Clean babies

Reviewing NA values

Exploratory Data Analysis

Boxplots of all numeric variables

Correlation Plot

Modelling

Create training and testing sets

Linear Regression

Winsorization function

Target variable adr - Average Daily Rate

In sample linear model/winsorization comparison

In sample MSE comparison to winsorization model

Out of sample linear model/winsorization comparison

MSE comparison of In sample and out of sample testing and training models

Logistic Regression is_canceled

In-sample Unweighted

Out-of-sample Unweighted

In-sample Weighted

Out-of-sample Weighted

Regression Tree

Target variable adr

R square and Adjusted R square for training data

Conclusion

Clean `adr`

Clean `arrival_date_year`

Clean `adults`

Clean `babies`

Target variable `adr` - Average Daily Rate

Logistic Regression `is_canceled`

Target variable `adr`