opi— title: “Midterm” author: “Group 2” date: “2023-10-08” output: html_document —
Problem Statement: The goal is to understand the factors that influence the average daily rate (adr) of hotel bookings the most.
Approach: The analysis will use the hotels.csv dataset. The methodology will involve data cleaning, exploratory data analysis (EDA), and model fitting including linear regression, KNN, alogistic Regression, and SVM.
Analytic Technique: Linear regression and KNN will be employed to understand and predict the adr. Data preprocessing techniques like winsorization will be applied to refine the model further.
Logistic regression will utilized to understand is_cancelled and how to reduce the odds of cancellation.
Consumer Benefit: This analysis aims to provide insights on the most profitable customers, how to increase their bookings, and decrease cancellations.
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
library(ggplot2)
library(kknn)
library(dplyr)
library(tidyr)
library(stringr)
library(tidyverse)
library(corrplot)
library(DescTools)
library(ROCR)
More information on the original data can be found here: hotels.csv
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Length:119390
## Class :character
## Mode :character
##
##
##
##
adrSetting the outlier 5400 to the mean of adr
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.38 69.29 94.58 101.83 126.00 5400.00
## [1] hotel is_canceled
## [3] lead_time arrival_date_year
## [5] arrival_date_month arrival_date_week_number
## [7] arrival_date_day_of_month stays_in_weekend_nights
## [9] stays_in_week_nights adults
## [11] children babies
## [13] meal country
## [15] market_segment distribution_channel
## [17] is_repeated_guest previous_cancellations
## [19] previous_bookings_not_canceled reserved_room_type
## [21] assigned_room_type booking_changes
## [23] deposit_type agent
## [25] company days_in_waiting_list
## [27] customer_type adr
## [29] required_car_parking_spaces total_of_special_requests
## [31] reservation_status reservation_status_date
## <0 rows> (or 0-length row.names)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.38 69.29 94.58 101.78 126.00 451.50
arrival_date_yearFound arrival year date min = 0 leading to all var in these obs as 0. These should be removed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 101.8 2016.0 2016.0 2016.1 2017.0 2017.0
## [1] 2015.0000 2016.0000 2017.0000 101.8311
## [1] hotel is_canceled
## [3] lead_time arrival_date_year
## [5] arrival_date_month arrival_date_week_number
## [7] arrival_date_day_of_month stays_in_weekend_nights
## [9] stays_in_week_nights adults
## [11] children babies
## [13] meal country
## [15] market_segment distribution_channel
## [17] is_repeated_guest previous_cancellations
## [19] previous_bookings_not_canceled reserved_room_type
## [21] assigned_room_type booking_changes
## [23] deposit_type agent
## [25] company days_in_waiting_list
## [27] customer_type adr
## [29] required_car_parking_spaces total_of_special_requests
## [31] reservation_status reservation_status_date
## <0 rows> (or 0-length row.names)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 101.8 2016.0 2016.0 2016.1 2017.0 2017.0
## [1] 2015.0000 2016.0000 2017.0000 101.8311
adultsAll of the obs with adults > 4 is cancelled. These obs will be removed assuming it was a typo error then cancellation to correct
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 2.000 1.859 2.000 101.831
## [1] 2.0000 1.0000 3.0000 4.0000 40.0000 26.0000 50.0000 27.0000
## [9] 55.0000 0.0000 20.0000 6.0000 5.0000 10.0000 101.8311
## [1] 2.0000 1.0000 3.0000 4.0000 40.0000 26.0000 50.0000 27.0000
## [9] 55.0000 0.0000 20.0000 6.0000 5.0000 10.0000 101.8311
## [1] "Canceled" "Canceled" "Canceled" "Canceled" "Canceled" "Canceled"
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 2225 Resort Hotel 0 1 2015 October
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 2225 41 6 0
## stays_in_week_nights adults children babies meal country market_segment
## 2225 3 0 0 0 SC PRT Corporate
## distribution_channel is_repeated_guest previous_cancellations
## 2225 Corporate 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 2225 0 A I
## booking_changes deposit_type agent company days_in_waiting_list
## 2225 1 No Deposit NULL 174 0
## customer_type adr required_car_parking_spaces total_of_special_requests
## 2225 Transient-Party 0 0 0
## reservation_status reservation_status_date
## 2225 Check-Out 2015-10-06
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 2.000 1.853 2.000 4.000
## [1] 2 1 3 4 0
babiesBabies var seems off. There are 9 and 10 counts. This appears to be a one-off error. changing value to 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.00795 0.00000 10.00000
## [1] 0 1 2 10 9
## [1] 0 1 2 10 9
## [1] 0 1 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.007791 0.000000 2.000000
## [1] 0 1 2
Found 4 NA in children and changed them to 0.
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 0 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
Upon review of the correlation plot we can see that:
Strongest Positive Correlations - Visual review:
children ~ adr
adults ~ adr This indicates that as the
number of children or adults in the booking increases, the Average Daily
Rate increases. Likely because larger rooms are more expensive.
stays_in_weekend_nights ~
stays_in_weeks_nights Bookings that have weekend stays also
tend to have longer weekday stays.
is_canceled ~ lead_time Bookings made
well in advance are more likely to be canceled.
Strongest Negative Correlations - Visual review:
required_car_parking_spaces ~
is_canceled
booking_changes ~ is_canceled
total_of_special_requests ~ is_canceled
Bookings with these attributes are less likely to be canceled.
required_car_parking_spaces ~
lead_time
adr ~ is_repeated_guest
adults ~ is_repeated_guest Repeated
guests tend to book less expensive rooms or may benefit from loyalty
discounts.
is_repeated_guest ~ lead_time Repeated
guests might book closer to their stay date compared to first-time
guests.
Target Variable adr Correlations *
children ~ adr * adults ~
adr * adr ~ is_repeated_guest
# Set the seed
set.seed(21)
# Randomly sample row indices for the training set
train_indices <- sample(1:NROW(hotels_data),NROW(hotels_data)*0.75)
# Create the training set
train_data <- hotels_data[train_indices, ]
numeric_train_data <- train_data[sapply(train_data, is.numeric)]
# Create the testing set
test_data <- hotels_data[-train_indices, ]
numeric_test_data <- test_data[sapply(test_data, is.numeric)]
print(paste("We have split the data using random sampling into a training set of 75%:", count(train_data), "observations and a testing set of 25%:", count(test_data), "observations"))
## [1] "We have split the data using random sampling into a training set of 75%: 89528 observations and a testing set of 25%: 29843 observations"
# Winsorize all numeric variables in the dataframe
winsorize_all <- function(data, lower = 0.05, upper = 0.95) {
data <- lapply(data, function(x) Winsorize(x, probs = c(lower, upper)))
return(data)
}
adr - Average Daily Rate## Actual Predicted
## 16562 81.0 96.45532
## 12105 80.1 103.70450
## 76839 89.0 104.74265
## 55685 115.0 96.35001
## 106096 105.0 90.61195
## 68024 130.0 97.38073
## Actual Predicted_Winsorized
## 1 81.0 78.39981
## 2 80.1 114.89974
## 3 89.0 87.32924
## 4 115.0 93.52211
## 5 105.0 96.42189
## 6 130.0 107.44216
The linear model MSE is improved after winsorizing the training set. This suggests that winsorizing the data will improve the models fit to the data.
## [1] "Training MSE for Linear Model: 1739.62"
## [1] "Training MSE for Winsorized Linear Model: 1671.39"
hotels_test_data_winsorized <- winsorize_all(numeric_test_data)
lm_model_test <- lm(adr ~ ., data = numeric_test_data)
lm_model_test_win <- lm(adr ~ ., data = hotels_test_data_winsorized)
comparison_df <- data.frame(Actual = numeric_test_data$adr, lm_predicted = lm_model_test$fitted.values)
head(comparison_df)
## Actual lm_predicted
## 8 103.00 105.31938
## 10 105.50 105.41917
## 23 84.67 103.24816
## 33 108.30 122.30285
## 37 98.00 97.38399
## 39 108.80 132.21362
comparison_df_win <- data.frame(Actual = hotels_test_data_winsorized$adr, Predicted_winsorized = lm_model_test_win$fitted.values)
head(comparison_df)
## Actual lm_predicted
## 8 103.00 105.31938
## 10 105.50 105.41917
## 23 84.67 103.24816
## 33 108.30 122.30285
## 37 98.00 97.38399
## 39 108.80 132.21362
The winsorized model performs better than the standard model when comparing the MSEs below in both training and testing datasets which is making the model less sensitive to outliers.
lm_mse_train <- mean((lm_model_train$fitted.values - numeric_train_data$adr)^2)
print(paste("Training MSE for Linear Model:", round(lm_mse_train, 2)))
## [1] "Training MSE for Linear Model: 1739.62"
lm_mse_train_win <- mean((lm_model_train_win$fitted.values - numeric_train_data$adr)^2)
print(paste("Training MSE for Winsorized Linear Model:", round(lm_mse_train_win, 2)))
## [1] "Training MSE for Winsorized Linear Model: 1671.39"
lm_mse_test <- mean((lm_model_test$fitted.values - numeric_test_data$adr)^2)
print(paste("Test MSE for Linear Model:", round(lm_mse_test, 2)))
## [1] "Test MSE for Linear Model: 1763.21"
lm_mse_test_win <- mean((lm_model_test_win$fitted.values - numeric_test_data$adr)^2)
print(paste("Test MSE for Winsorized Linear Model:", round(lm_mse_test_win, 2)))
## [1] "Test MSE for Winsorized Linear Model: 1679.17"
This scatter plot indicates a wide variation in adr
across all levels of adults. The slop is positive showing a
positive relationship with Average Daily Rate and the number of Adults
in the booking. This is more than likely due to the need of additional
beds and/or space.
This scatterplot shows a positive relationship between
adr and children. This is more than likely due
to the need of additional space, beds, or amenities that children
prefer.
This plot indicates a negative relationship between adr
and is_repeated_guest. This may be due to loyalty programs
or the guests being ‘in the know’ of deals.
##
## Call:
## lm(formula = adr ~ adults, data = numeric_test_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -136.19 -31.10 -6.19 23.00 330.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.9101 1.0436 43.99 <2e-16 ***
## adults 30.0942 0.5442 55.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.94 on 29841 degrees of freedom
## Multiple R-squared: 0.09295, Adjusted R-squared: 0.09292
## F-statistic: 3058 on 1 and 29841 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = adr ~ adults + adults * is_repeated_guest, data = numeric_test_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -135.35 -31.71 -6.71 22.74 330.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.4140 1.0857 45.514 < 2e-16 ***
## adults 28.6465 0.5624 50.934 < 2e-16 ***
## is_repeated_guest -19.4695 4.1174 -4.729 2.27e-06 ***
## adults:is_repeated_guest -4.2869 2.7291 -1.571 0.116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.72 on 29839 degrees of freedom
## Multiple R-squared: 0.1015, Adjusted R-squared: 0.1014
## F-statistic: 1123 on 3 and 29839 DF, p-value: < 2.2e-16
The R^2 slightly improved from 9.3% to 10.15% with the interaction
term of adults and is_repeated_guest. While
the number of adults and repeated guest status individually have a
significant impact on adr, the interaction between them isn’t
statistically significant at the 0.05 level. This could indicate a
reason to invest in marketing toward adult groups to increase the
average daily rate of bookings.
adris_canceled## [1] "hotel" "is_canceled"
## [3] "lead_time" "arrival_date_year"
## [5] "arrival_date_month" "arrival_date_week_number"
## [7] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [9] "stays_in_week_nights" "adults"
## [11] "children" "babies"
## [13] "meal" "country"
## [15] "market_segment" "distribution_channel"
## [17] "is_repeated_guest" "previous_cancellations"
## [19] "previous_bookings_not_canceled" "reserved_room_type"
## [21] "assigned_room_type" "booking_changes"
## [23] "deposit_type" "agent"
## [25] "company" "days_in_waiting_list"
## [27] "customer_type" "adr"
## [29] "required_car_parking_spaces" "total_of_special_requests"
## [31] "reservation_status" "reservation_status_date"
is_canceled ~ lead_time + adr + adults + children + arrival_date_year + market_segment + total_of_special_requests + booking_change
##
## Call:
## glm(formula = is_canceled ~ lead_time + adr + adults + children +
## arrival_date_year + market_segment + total_of_special_requests +
## booking_changes, family = binomial, data = hotels.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3453 -0.8595 -0.5441 0.9853 4.9367
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.057e+01 4.877e+01 -0.217 0.828
## lead_time 5.548e-03 8.748e-05 63.422 <2e-16 ***
## adr 5.464e-03 2.044e-04 26.728 <2e-16 ***
## adults 3.002e-02 1.834e-02 1.637 0.102
## children -2.813e-02 2.172e-02 -1.295 0.195
## arrival_date_year 1.058e-02 1.192e-02 0.888 0.375
## market_segmentAviation -1.267e+01 5.437e+01 -0.233 0.816
## market_segmentComplementary -1.208e+01 5.437e+01 -0.222 0.824
## market_segmentCorporate -1.251e+01 5.437e+01 -0.230 0.818
## market_segmentDirect -1.294e+01 5.437e+01 -0.238 0.812
## market_segmentGroups -1.153e+01 5.437e+01 -0.212 0.832
## market_segmentOffline TA/TO -1.247e+01 5.437e+01 -0.229 0.819
## market_segmentOnline TA -1.167e+01 5.437e+01 -0.215 0.830
## total_of_special_requests -8.656e-01 1.346e-02 -64.307 <2e-16 ***
## booking_changes -7.421e-01 1.894e-02 -39.174 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 109950 on 83558 degrees of freedom
## Residual deviance: 91917 on 83544 degrees of freedom
## AIC: 91947
##
## Number of Fisher Scoring iterations: 9
Start with in-sample unweighted
## [1] 0.7681851
out-of-sample unweighted
## [1] 0.7712624
## [1] 0.2
Optimal cut off prob for in-sample
## Predicted
## True 0 1
## 0 18226 34578
## 1 2957 27798
## [1] "MR:0.449203556768272"
## [1] "FPR:0.654836754791304"
## [1] "FNR:0.0961469679726874"
## [1] "cost:0.59075623212341"
out-of-sample
## Predicted
## True 0 1
## 0 19323 3041
## 1 6426 7022
## [1] "MR:0.264352730928181"
## [1] "FPR:0.135977463781077"
## [1] "FNR:0.477840571088638"
## [1] "cost:0.935219425794947"