opi— title: “Midterm” author: “Group 2” date: “2023-10-08” output: html_document —
Problem Statement: The goal is to understand the factors that influence the average daily rate (adr) of hotel bookings the most.
Approach: The analysis will use the hotels.csv dataset. The methodology will involve data cleaning, exploratory data analysis (EDA), and model fitting including linear regression, KNN, a logistic Regression, and SVM.
Analytic Technique: Linear regression and KNN will be employed to understand and predict the adr. Data preprocessing techniques like winsorization will be applied to refine the model further.
Logistic regression will utilized to understand is_cancelled and how to reduce the odds of cancellation.
Consumer Benefit: This analysis aims to provide insights on the most profitable customers, how to increase their bookings, and decrease cancellations.
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
library(ggplot2)
library(kknn)
library(dplyr)
library(tidyr)
library(stringr)
library(tidyverse)
library(corrplot)
library(DescTools)
library(ROCR)
library(readr)
library(reshape2)
library(caret)
library(e1071)
More information on the original data can be found here: hotels.csv
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Length:119390
## Class :character
## Mode :character
##
##
##
##
adrSetting the outlier 5400 to the mean of adr
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.38 69.29 94.58 101.83 126.00 5400.00
## [1] hotel is_canceled
## [3] lead_time arrival_date_year
## [5] arrival_date_month arrival_date_week_number
## [7] arrival_date_day_of_month stays_in_weekend_nights
## [9] stays_in_week_nights adults
## [11] children babies
## [13] meal country
## [15] market_segment distribution_channel
## [17] is_repeated_guest previous_cancellations
## [19] previous_bookings_not_canceled reserved_room_type
## [21] assigned_room_type booking_changes
## [23] deposit_type agent
## [25] company days_in_waiting_list
## [27] customer_type adr
## [29] required_car_parking_spaces total_of_special_requests
## [31] reservation_status reservation_status_date
## <0 rows> (or 0-length row.names)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.38 69.29 94.58 101.78 126.00 451.50
arrival_date_yearFound arrival year date min = 0 leading to all var in these obs as 0. These should be removed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 101.8 2016.0 2016.0 2016.1 2017.0 2017.0
## [1] 2015.0000 2016.0000 2017.0000 101.8311
## [1] hotel is_canceled
## [3] lead_time arrival_date_year
## [5] arrival_date_month arrival_date_week_number
## [7] arrival_date_day_of_month stays_in_weekend_nights
## [9] stays_in_week_nights adults
## [11] children babies
## [13] meal country
## [15] market_segment distribution_channel
## [17] is_repeated_guest previous_cancellations
## [19] previous_bookings_not_canceled reserved_room_type
## [21] assigned_room_type booking_changes
## [23] deposit_type agent
## [25] company days_in_waiting_list
## [27] customer_type adr
## [29] required_car_parking_spaces total_of_special_requests
## [31] reservation_status reservation_status_date
## <0 rows> (or 0-length row.names)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 101.8 2016.0 2016.0 2016.1 2017.0 2017.0
## [1] 2015.0000 2016.0000 2017.0000 101.8311
adultsAll of the obs with adults > 4 is cancelled. These obs will be removed assuming it was a typo error then cancellation to correct
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 2.000 1.859 2.000 101.831
## [1] 2.0000 1.0000 3.0000 4.0000 40.0000 26.0000 50.0000 27.0000
## [9] 55.0000 0.0000 20.0000 6.0000 5.0000 10.0000 101.8311
## [1] 2.0000 1.0000 3.0000 4.0000 40.0000 26.0000 50.0000 27.0000
## [9] 55.0000 0.0000 20.0000 6.0000 5.0000 10.0000 101.8311
## [1] "Canceled" "Canceled" "Canceled" "Canceled" "Canceled" "Canceled"
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 2225 Resort Hotel 0 1 2015 October
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 2225 41 6 0
## stays_in_week_nights adults children babies meal country market_segment
## 2225 3 0 0 0 SC PRT Corporate
## distribution_channel is_repeated_guest previous_cancellations
## 2225 Corporate 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 2225 0 A I
## booking_changes deposit_type agent company days_in_waiting_list
## 2225 1 No Deposit NULL 174 0
## customer_type adr required_car_parking_spaces total_of_special_requests
## 2225 Transient-Party 0 0 0
## reservation_status reservation_status_date
## 2225 Check-Out 2015-10-06
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 2.000 1.853 2.000 4.000
## [1] 2 1 3 4 0
babiesBabies var seems off. There are 9 and 10 counts. This appears to be a one-off error. changing value to 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.00795 0.00000 10.00000
## [1] 0 1 2 10 9
## [1] 0 1 2 10 9
## [1] 0 1 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.007791 0.000000 2.000000
## [1] 0 1 2
Found 4 NA in children and changed them to 0.
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 0 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
Upon review of the correlation plot we can see that:
Strongest Positive Correlations - Visual review:
children ~ adr
adults ~ adr This indicates that as the
number of children or adults in the booking increases, the Average Daily
Rate increases. Likely because larger rooms are more expensive.
stays_in_weekend_nights ~
stays_in_weeks_nights Bookings that have weekend stays also
tend to have longer weekday stays.
is_canceled ~ lead_time Bookings made
well in advance are more likely to be canceled.
Strongest Negative Correlations - Visual review:
required_car_parking_spaces ~
is_canceled
booking_changes ~ is_canceled
total_of_special_requests ~ is_canceled
Bookings with these attributes are less likely to be canceled.
required_car_parking_spaces ~
lead_time
adr ~ is_repeated_guest
adults ~ is_repeated_guest Repeated
guests tend to book less expensive rooms or may benefit from loyalty
discounts.
is_repeated_guest ~ lead_time Repeated
guests might book closer to their stay date compared to first-time
guests.
Target Variable adr Correlations *
children ~ adr * adults ~
adr * adr ~ is_repeated_guest
# Set the seed
set.seed(21)
# Randomly sample row indices for the training set
train_indices <- sample(1:NROW(hotels_data),NROW(hotels_data)*0.75)
# Create the training set
train_data <- hotels_data[train_indices, ]
numeric_train_data <- train_data[sapply(train_data, is.numeric)]
# Create the testing set
test_data <- hotels_data[-train_indices, ]
numeric_test_data <- test_data[sapply(test_data, is.numeric)]
print(paste("We have split the data using random sampling into a training set of 75%:", count(train_data), "observations and a testing set of 25%:", count(test_data), "observations"))
## [1] "We have split the data using random sampling into a training set of 75%: 89528 observations and a testing set of 25%: 29843 observations"
# Winsorize all numeric variables in the dataframe
winsorize_all <- function(data, lower = 0.05, upper = 0.95) {
data <- lapply(data, function(x) Winsorize(x, probs = c(lower, upper)))
return(data)
}
adr - Average Daily Rate## Actual Predicted
## 16562 81.0 96.45532
## 12105 80.1 103.70450
## 76839 89.0 104.74265
## 55685 115.0 96.35001
## 106096 105.0 90.61195
## 68024 130.0 97.38073
## Actual Predicted_Winsorized
## 1 81.0 78.39981
## 2 80.1 114.89974
## 3 89.0 87.32924
## 4 115.0 93.52211
## 5 105.0 96.42189
## 6 130.0 107.44216
The linear model MSE is improved after winsorizing the training set. This suggests that winsorizing the data will improve the models fit to the data.
## [1] "Training MSE for Linear Model: 1739.62"
## [1] "Training MSE for Winsorized Linear Model: 1671.39"
hotels_test_data_winsorized <- winsorize_all(numeric_test_data)
lm_model_test <- lm(adr ~ ., data = numeric_test_data)
lm_model_test_win <- lm(adr ~ ., data = hotels_test_data_winsorized)
comparison_df <- data.frame(Actual = numeric_test_data$adr, lm_predicted = lm_model_test$fitted.values)
head(comparison_df)
## Actual lm_predicted
## 8 103.00 105.31938
## 10 105.50 105.41917
## 23 84.67 103.24816
## 33 108.30 122.30285
## 37 98.00 97.38399
## 39 108.80 132.21362
comparison_df_win <- data.frame(Actual = hotels_test_data_winsorized$adr, Predicted_winsorized = lm_model_test_win$fitted.values)
head(comparison_df)
## Actual lm_predicted
## 8 103.00 105.31938
## 10 105.50 105.41917
## 23 84.67 103.24816
## 33 108.30 122.30285
## 37 98.00 97.38399
## 39 108.80 132.21362
The winsorized model performs better than the standard model when comparing the MSEs below in both training and testing datasets which is making the model less sensitive to outliers.
lm_mse_train <- mean((lm_model_train$fitted.values - numeric_train_data$adr)^2)
print(paste("Training MSE for Linear Model:", round(lm_mse_train, 2)))
## [1] "Training MSE for Linear Model: 1739.62"
lm_mse_train_win <- mean((lm_model_train_win$fitted.values - numeric_train_data$adr)^2)
print(paste("Training MSE for Winsorized Linear Model:", round(lm_mse_train_win, 2)))
## [1] "Training MSE for Winsorized Linear Model: 1671.39"
lm_mse_test <- mean((lm_model_test$fitted.values - numeric_test_data$adr)^2)
print(paste("Test MSE for Linear Model:", round(lm_mse_test, 2)))
## [1] "Test MSE for Linear Model: 1763.21"
lm_mse_test_win <- mean((lm_model_test_win$fitted.values - numeric_test_data$adr)^2)
print(paste("Test MSE for Winsorized Linear Model:", round(lm_mse_test_win, 2)))
## [1] "Test MSE for Winsorized Linear Model: 1679.17"
This scatter plot indicates a wide variation in adr
across all levels of adults. The slop is positive showing a
positive relationship with Average Daily Rate and the number of Adults
in the booking. This is more than likely due to the need of additional
beds and/or space.
This scatterplot shows a positive relationship between
adr and children. This is more than likely due
to the need of additional space, beds, or amenities that children
prefer.
This plot indicates a negative relationship between adr
and is_repeated_guest. This may be due to loyalty programs
or the guests being ‘in the know’ of deals.
##
## Call:
## lm(formula = adr ~ adults, data = numeric_test_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -136.19 -31.10 -6.19 23.00 330.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.9101 1.0436 43.99 <2e-16 ***
## adults 30.0942 0.5442 55.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.94 on 29841 degrees of freedom
## Multiple R-squared: 0.09295, Adjusted R-squared: 0.09292
## F-statistic: 3058 on 1 and 29841 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = adr ~ adults + adults * is_repeated_guest, data = numeric_test_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -135.35 -31.71 -6.71 22.74 330.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.4140 1.0857 45.514 < 2e-16 ***
## adults 28.6465 0.5624 50.934 < 2e-16 ***
## is_repeated_guest -19.4695 4.1174 -4.729 2.27e-06 ***
## adults:is_repeated_guest -4.2869 2.7291 -1.571 0.116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.72 on 29839 degrees of freedom
## Multiple R-squared: 0.1015, Adjusted R-squared: 0.1014
## F-statistic: 1123 on 3 and 29839 DF, p-value: < 2.2e-16
The R^2 slightly improved from 9.3% to 10.15% with the interaction
term of adults and is_repeated_guest. While
the number of adults and repeated guest status individually have a
significant impact on adr, the interaction between them isn’t
statistically significant at the 0.05 level. This could indicate a
reason to invest in marketing toward adult groups to increase the
average daily rate of bookings.
is_canceledis_canceled was chosen as the target variable because it
can cause significant implactions in revenue.
##
## Call:
## glm(formula = is_canceled ~ lead_time + adr + adults + children +
## arrival_date_year + market_segment + total_of_special_requests +
## booking_changes, family = binomial, data = hotels.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3453 -0.8595 -0.5441 0.9853 4.9367
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.057e+01 4.877e+01 -0.217 0.828
## lead_time 5.548e-03 8.748e-05 63.422 <2e-16 ***
## adr 5.464e-03 2.044e-04 26.728 <2e-16 ***
## adults 3.002e-02 1.834e-02 1.637 0.102
## children -2.813e-02 2.172e-02 -1.295 0.195
## arrival_date_year 1.058e-02 1.192e-02 0.888 0.375
## market_segmentAviation -1.267e+01 5.437e+01 -0.233 0.816
## market_segmentComplementary -1.208e+01 5.437e+01 -0.222 0.824
## market_segmentCorporate -1.251e+01 5.437e+01 -0.230 0.818
## market_segmentDirect -1.294e+01 5.437e+01 -0.238 0.812
## market_segmentGroups -1.153e+01 5.437e+01 -0.212 0.832
## market_segmentOffline TA/TO -1.247e+01 5.437e+01 -0.229 0.819
## market_segmentOnline TA -1.167e+01 5.437e+01 -0.215 0.830
## total_of_special_requests -8.656e-01 1.346e-02 -64.307 <2e-16 ***
## booking_changes -7.421e-01 1.894e-02 -39.174 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 109950 on 83558 degrees of freedom
## Residual deviance: 91917 on 83544 degrees of freedom
## AIC: 91947
##
## Number of Fisher Scoring iterations: 9
## Predicted
## True 0 1
## 0 36687 16117
## 1 9220 21535
## [1] "MR:0.303222872461375"
## [1] 0.7681851
## Predicted
## True 0 1
## 0 19323 3041
## 1 6426 7022
## [1] "MR:0.264352730928181"
## [1] 0.7712624
The unweigheted testing AUC had marginally better performance compared to the unweighted training AUC and a significantly lower MR meaning the testing set performed better.
## Optimal pcut
## [1] 0.2
## Predicted
## True 0 1
## 0 18226 34578
## 1 2957 27798
## [1] "MR:0.449203556768272"
## [1] "FPR:0.654836754791304"
## [1] "FNR:0.0961469679726874"
## [1] "cost:0.59075623212341"
## Predicted
## True 0 1
## 0 19323 3041
## 1 6426 7022
## [1] "MR:0.264352730928181"
## [1] "FPR:0.135977463781077"
## [1] "FNR:0.477840571088638"
## [1] "cost:0.935219425794947"
## Model MR FNR FPR Cost
## 1 Weighted Training 0.4492036 0.09614697 0.6548368 0.5907562
## 2 Weighted Testing 0.2643527 0.47784057 0.1359775 0.9352194
###Tree Classification Model
## 'data.frame': 119371 obs. of 18 variables:
## $ is_canceled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ lead_time : num 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : num 2015 2015 2015 2015 2015 ...
## $ arrival_date_week_number : num 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : num 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : num 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : num 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : num 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num 0 0 0 0 0 0 0 0 0 0 ...
## $ is_repeated_guest : num 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : num 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: num 0 0 0 0 0 0 0 0 0 0 ...
## $ booking_changes : num 3 4 0 0 0 0 0 0 0 0 ...
## $ days_in_waiting_list : num 0 0 0 0 0 0 0 0 0 0 ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : num 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : num 0 0 0 0 1 1 0 1 1 0 ...
## 0 1
## 59717 29811
## pred
## true 0 1
## 0 47002 9240
## 1 12715 20571
## [1] "Accuracy on Training Data: 0.245230542400143"
## pred
## true 0 1
## 0 15879 3047
## 1 4148 6769
## [1] 0.2410951
## [1] "Accuracy on Testing Data: 0.241095064169152"
## pred
## true 0 1
## 0 47002 9240
## 1 12715 20571
## pred
## true 0 1
## 0 14623 4303
## 1 3657 7260
## [1] 0.2452305
Obtain ROC and AUC on training set (use predicted probabilities).
## [1] 0.7696875
## [1] 0.7713274
adr## Call:
## rpart(formula = adr ~ ., data = numeric_data)
## n= 119371
##
## CP nsplit rel error xerror xstd
## 1 0.10611870 0 1.0000000 1.0000105 0.005831705
## 2 0.06906433 1 0.8938813 0.8939234 0.004966323
## 3 0.05123835 2 0.8248170 0.8248877 0.004714844
## 4 0.05075714 3 0.7735786 0.7538636 0.004473344
## 5 0.02338833 4 0.7228215 0.7228448 0.004381954
## 6 0.01671854 5 0.6994332 0.6993238 0.004301035
## 7 0.01339108 6 0.6827146 0.6839629 0.004224456
## 8 0.01112683 7 0.6693235 0.6705173 0.004168324
## 9 0.01000000 8 0.6581967 0.6582171 0.004143095
##
## Variable importance
## arrival_date_week_number children adults
## 34 34 20
## lead_time arrival_date_year previous_cancellations
## 7 4 1
##
## Node number 1: 119371 observations, complexity param=0.1061187
## mean=101.7911, MSE=2315.247
## left son=2 (110781 obs) right son=3 (8590 obs)
## Primary splits:
## children < 0.5 to the left, improve=0.10611870, (0 missing)
## adults < 2.5 to the left, improve=0.07454214, (0 missing)
## arrival_date_week_number < 13.5 to the left, improve=0.07166795, (0 missing)
## arrival_date_year < 2016.5 to the left, improve=0.03680391, (0 missing)
## total_of_special_requests < 0.5 to the left, improve=0.02949121, (0 missing)
## Surrogate splits:
## adults < 0.5 to the right, agree=0.928, adj=0.004, (0 split)
##
## Node number 2: 110781 observations, complexity param=0.06906433
## mean=97.42633, MSE=1914.985
## left son=4 (105046 obs) right son=5 (5735 obs)
## Primary splits:
## adults < 2.5 to the left, improve=0.08997446, (0 missing)
## arrival_date_week_number < 13.5 to the left, improve=0.07515351, (0 missing)
## arrival_date_year < 2016.5 to the left, improve=0.03555443, (0 missing)
## total_of_special_requests < 0.5 to the left, improve=0.03191124, (0 missing)
## is_repeated_guest < 0.5 to the right, improve=0.02100150, (0 missing)
##
## Node number 3: 8590 observations, complexity param=0.01112683
## mean=158.081, MSE=4062.978
## left son=6 (4861 obs) right son=7 (3729 obs)
## Primary splits:
## children < 1.5 to the left, improve=0.088110920, (0 missing)
## arrival_date_week_number < 13.5 to the left, improve=0.086564980, (0 missing)
## adults < 1.5 to the left, improve=0.041024670, (0 missing)
## arrival_date_year < 2016.5 to the left, improve=0.038691800, (0 missing)
## lead_time < 7.5 to the left, improve=0.008326848, (0 missing)
## Surrogate splits:
## adults < 0.5 to the right, agree=0.591, adj=0.058, (0 split)
## total_of_special_requests < 0.5 to the right, agree=0.582, adj=0.036, (0 split)
## stays_in_week_nights < 8.5 to the left, agree=0.569, adj=0.006, (0 split)
## stays_in_weekend_nights < 4.5 to the left, agree=0.567, adj=0.002, (0 split)
## lead_time < 334.5 to the left, agree=0.566, adj=0.001, (0 split)
##
## Node number 4: 105046 observations, complexity param=0.05123835
## mean=94.35929, MSE=1723.278
## left son=8 (20937 obs) right son=9 (84109 obs)
## Primary splits:
## arrival_date_week_number < 13.5 to the left, improve=0.07822694, (0 missing)
## arrival_date_year < 2016.5 to the left, improve=0.03106447, (0 missing)
## adults < 1.5 to the left, improve=0.02886486, (0 missing)
## total_of_special_requests < 0.5 to the left, improve=0.02853794, (0 missing)
## is_repeated_guest < 0.5 to the right, improve=0.02130618, (0 missing)
## Surrogate splits:
## lead_time < 543.5 to the right, agree=0.802, adj=0.008, (0 split)
## stays_in_week_nights < 13.5 to the right, agree=0.802, adj=0.004, (0 split)
## stays_in_weekend_nights < 5.5 to the right, agree=0.801, adj=0.003, (0 split)
## arrival_date_year < 1007.5 to the left, agree=0.801, adj=0.000, (0 split)
## arrival_date_day_of_month < 0.5 to the left, agree=0.801, adj=0.000, (0 split)
##
## Node number 5: 5735 observations
## mean=153.6042, MSE=2098.161
##
## Node number 6: 4861 observations
## mean=141.5092, MSE=3221.267
##
## Node number 7: 3729 observations
## mean=179.6835, MSE=4335.545
##
## Node number 8: 20937 observations
## mean=71.08803, MSE=689.0327
##
## Node number 9: 84109 observations, complexity param=0.05075714
## mean=100.1521, MSE=1812.366
## left son=18 (22080 obs) right son=19 (62029 obs)
## Primary splits:
## arrival_date_week_number < 40.5 to the right, improve=0.09202481, (0 missing)
## arrival_date_year < 2016.5 to the left, improve=0.08699909, (0 missing)
## total_of_special_requests < 0.5 to the left, improve=0.03557023, (0 missing)
## lead_time < 220.5 to the right, improve=0.02998976, (0 missing)
## previous_cancellations < 0.5 to the right, improve=0.02180009, (0 missing)
## Surrogate splits:
## lead_time < 480.5 to the right, agree=0.739, adj=0.004, (0 split)
## days_in_waiting_list < 385 to the right, agree=0.738, adj=0.002, (0 split)
## previous_cancellations < 25.5 to the right, agree=0.738, adj=0.001, (0 split)
## stays_in_weekend_nights < 6.5 to the right, agree=0.738, adj=0.000, (0 split)
## stays_in_week_nights < 15.5 to the right, agree=0.738, adj=0.000, (0 split)
##
## Node number 18: 22080 observations
## mean=78.50635, MSE=1202.244
##
## Node number 19: 62029 observations, complexity param=0.02338833
## mean=107.8572, MSE=1803.395
## left son=38 (15907 obs) right son=39 (46122 obs)
## Primary splits:
## lead_time < 188.5 to the right, improve=0.05778426, (0 missing)
## arrival_date_year < 2016.5 to the left, improve=0.05313649, (0 missing)
## total_of_special_requests < 0.5 to the left, improve=0.04456134, (0 missing)
## previous_cancellations < 0.5 to the right, improve=0.03764751, (0 missing)
## arrival_date_week_number < 21.5 to the left, improve=0.02683526, (0 missing)
## Surrogate splits:
## previous_cancellations < 0.5 to the right, agree=0.775, adj=0.124, (0 split)
## days_in_waiting_list < 59.5 to the right, agree=0.755, adj=0.045, (0 split)
##
## Node number 38: 15907 observations
## mean=90.4748, MSE=1002.185
##
## Node number 39: 46122 observations, complexity param=0.01671854
## mean=113.8522, MSE=1939.576
## left son=78 (11842 obs) right son=79 (34280 obs)
## Primary splits:
## arrival_date_week_number < 19.5 to the left, improve=0.05165108, (0 missing)
## arrival_date_year < 2016.5 to the left, improve=0.04279114, (0 missing)
## total_of_special_requests < 0.5 to the left, improve=0.02750329, (0 missing)
## adults < 1.5 to the left, improve=0.02706598, (0 missing)
## is_repeated_guest < 0.5 to the right, improve=0.02437876, (0 missing)
## Surrogate splits:
## days_in_waiting_list < 40.5 to the right, agree=0.748, adj=0.017, (0 split)
##
## Node number 78: 11842 observations
## mean=96.82277, MSE=1126.113
##
## Node number 79: 34280 observations, complexity param=0.01339108
## mean=119.7351, MSE=2085.798
## left son=158 (22830 obs) right son=159 (11450 obs)
## Primary splits:
## arrival_date_year < 2016.5 to the left, improve=0.05176054, (0 missing)
## total_of_special_requests < 0.5 to the left, improve=0.02735772, (0 missing)
## adults < 1.5 to the left, improve=0.02654263, (0 missing)
## is_repeated_guest < 0.5 to the right, improve=0.02199786, (0 missing)
## stays_in_week_nights < 2.5 to the left, improve=0.01555139, (0 missing)
## Surrogate splits:
## arrival_date_week_number < 22.5 to the right, agree=0.684, adj=0.054, (0 split)
## lead_time < 168.5 to the left, agree=0.667, adj=0.004, (0 split)
## previous_bookings_not_canceled < 11.5 to the left, agree=0.667, adj=0.002, (0 split)
## total_of_special_requests < 3.5 to the left, agree=0.667, adj=0.002, (0 split)
## required_car_parking_spaces < 1.5 to the left, agree=0.666, adj=0.000, (0 split)
##
## Node number 158: 22830 observations
## mean=112.3766, MSE=1906.504
##
## Node number 159: 11450 observations
## mean=134.4069, MSE=2120.063
## [1] 1519.47
## [1] 1537.142
## [1] 0.3426609
##
## Regression tree:
## rpart(formula = adr ~ ., data = numeric_data)
##
## Variables actually used in tree construction:
## [1] adults arrival_date_week_number arrival_date_year
## [4] children lead_time
##
## Root node error: 276373307/119371 = 2315.2
##
## n= 119371
##
## CP nsplit rel error xerror xstd
## 1 0.106119 0 1.00000 1.00001 0.0058317
## 2 0.069064 1 0.89388 0.89392 0.0049663
## 3 0.051238 2 0.82482 0.82489 0.0047148
## 4 0.050757 3 0.77358 0.75386 0.0044733
## 5 0.023388 4 0.72282 0.72284 0.0043820
## 6 0.016719 5 0.69943 0.69932 0.0043010
## 7 0.013391 6 0.68271 0.68396 0.0042245
## 8 0.011127 7 0.66932 0.67052 0.0041683
## 9 0.010000 8 0.65820 0.65822 0.0041431
### Out-of-sample R^2?
## Out-of-Sample R-squared: 0.3392447
AUC for Training Data: 0.5399 AUC for Testing Data: 0.5467
AUC for Training Data: 0.7682 AUC for Testing Data: 0.7713
Based on the provided results, it appears that you have evaluated the performance of a model using different metrics, including confusion matrices and AUC values, both in-sample (training) and out-of-sample (testing). Here’s a summary of the key findings:
Linear Model:
In-Sample AUC: 0.7682
Out-of-Sample AUC: 0.7712
Regression Tree Model:
In-Sample AUC: 0.5399
Out-of-Sample AUC: 0.5467
Confusion Matrix Metrics:
Linear Model (In-Sample):
Misclassification Rate (MR): 0.2644
False Positive Rate (FPR): 0.1360
False Negative Rate (FNR): 0.4778
Cost: 0.9352
Linear Model (Out-of-Sample):
MR: 0.2644
FPR: 0.1360
FNR: 0.4778
Cost: 0.9352
Regression Tree Model (In-Sample):
MR: 0.4492
FPR: 0.6548
FNR: 0.0961
Cost: 0.5908
Regression Tree Model (Out-of-Sample):
MR: 0.2644
FPR: 0.1360
FNR: 0.4778
Cost: 0.9352
The linear model outperforms the regression tree model in terms of AUC for both in-sample and out-of-sample data.
The confusion matrix metrics provide additional insights into the model’s performance, including misclassification rates, false positive rates, false negative rates, and associated costs.
Consideration should be given to the specific goals and requirements of the modeling task when interpreting these results. The cost associated with misclassification can be crucial in decision-making.
Overall, the linear model appears to be a better-performing model based on the provided evaluation metrics.