Hello all! It’s Gissella Nadya again and today we are going to explore on yet another Classification Models, those are Naive Bayes, Decision Tree and Random Forest. The dataset can be accessed on Kaggle and it was originally obtained from an article on ScienceDirect. It was then made avaiable here on February 11th 2020. Let’s get started!
library(tidyverse)
library(ggplot2)
library(ggmosaic)
library(plotly)
library(e1071) # naiveBayes()
library(caret) # cm
library(partykit) # ctree()
library(tictoc)
library(rpart)
library(randomForest)
library(ROCR)First of all, let’s get to know the data first.
bookings <- read_csv("hotel_bookings.csv") %>% janitor::clean_names()
head(bookings)glimpse(bookings)#> Rows: 119,390
#> Columns: 32
#> $ hotel <chr> "Resort Hotel", "Resort Hotel", "Resor…
#> $ is_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
#> $ lead_time <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75,…
#> $ arrival_date_year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 20…
#> $ arrival_date_month <chr> "July", "July", "July", "July", "July"…
#> $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27…
#> $ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4,…
#> $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
#> $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "B…
#> $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GB…
#> $ market_segment <chr> "Direct", "Direct", "Direct", "Corpora…
#> $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corpora…
#> $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C"…
#> $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C"…
#> $ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ deposit_type <chr> "No Deposit", "No Deposit", "No Deposi…
#> $ agent <chr> "NULL", "NULL", "NULL", "304", "240", …
#> $ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL"…
#> $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ customer_type <chr> "Transient", "Transient", "Transient",…
#> $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00…
#> $ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3,…
#> $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out",…
#> $ reservation_status_date <date> 2015-07-01, 2015-07-01, 2015-07-02, 2…
To understand more about our dataset, here are the explanations for each columns:
|
target = is_cancelled
unique(sort(bookings$hotel))#> [1] "City Hotel" "Resort Hotel"
Here we can see that the dataset consists of data from a city hotel and resort hotel. Today we are going to focus only to our Resort Hotel dataset.
resort <- bookings %>%
filter(hotel == "Resort Hotel")
resort <- resort %>%
select(-hotel) %>%
mutate_if(~is.character(.), ~as.factor(.)) %>%
mutate(is_canceled = as.factor(is_canceled),
is_repeated_guest = as.factor(is_repeated_guest),
total_of_special_requests = as.factor(total_of_special_requests))resort$arrival_date_month <- factor(resort$arrival_date_month, levels = month.name)colSums(is.na(resort))#> is_canceled lead_time
#> 0 0
#> arrival_date_year arrival_date_month
#> 0 0
#> arrival_date_week_number arrival_date_day_of_month
#> 0 0
#> stays_in_weekend_nights stays_in_week_nights
#> 0 0
#> adults children
#> 0 0
#> babies meal
#> 0 0
#> country market_segment
#> 0 0
#> distribution_channel is_repeated_guest
#> 0 0
#> previous_cancellations previous_bookings_not_canceled
#> 0 0
#> reserved_room_type assigned_room_type
#> 0 0
#> booking_changes deposit_type
#> 0 0
#> agent company
#> 0 0
#> days_in_waiting_list customer_type
#> 0 0
#> adr required_car_parking_spaces
#> 0 0
#> total_of_special_requests reservation_status
#> 0 0
#> reservation_status_date
#> 0
No missing values. Great!
We noticed that there are variables for stays_in_weekend_nights and stays_in_week_nights, in this feature engineering, we are going to make a new column for the total of this two. Next, we are also going to count the total of amount that the guest have to pay and named it total_stay_cost which would be counted by the Average Daily Rate * Total Number of Night Stays.
resort <- resort %>%
mutate(stays_in_total_nights = stays_in_weekend_nights + stays_in_week_nights,
total_stay_cost = adr * stays_in_total_nights)resort <- resort %>%
mutate(is_canceled = case_when(is_canceled == 0 ~ "No",
is_canceled == 1 ~ "Yes"),
is_canceled = as.factor(is_canceled)) %>%
rename(canceled = is_canceled)
resort$canceled <- factor(resort$canceled, levels = c("No", "Yes"))summary(resort)#> canceled lead_time arrival_date_year arrival_date_month
#> No :28938 Min. : 0.00 Min. :2015 August : 4894
#> Yes:11122 1st Qu.: 10.00 1st Qu.:2016 July : 4573
#> Median : 57.00 Median :2016 April : 3609
#> Mean : 92.68 Mean :2016 May : 3559
#> 3rd Qu.:155.00 3rd Qu.:2017 October: 3555
#> Max. :737.00 Max. :2017 March : 3336
#> (Other):16534
#> arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
#> Min. : 1.00 Min. : 1.00 Min. : 0.00
#> 1st Qu.:16.00 1st Qu.: 8.00 1st Qu.: 0.00
#> Median :28.00 Median :16.00 Median : 1.00
#> Mean :27.14 Mean :15.82 Mean : 1.19
#> 3rd Qu.:38.00 3rd Qu.:24.00 3rd Qu.: 2.00
#> Max. :53.00 Max. :31.00 Max. :19.00
#>
#> stays_in_week_nights adults children babies
#> Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. :0.0000
#> 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 0.0000 1st Qu.:0.0000
#> Median : 3.000 Median : 2.000 Median : 0.0000 Median :0.0000
#> Mean : 3.129 Mean : 1.867 Mean : 0.1287 Mean :0.0139
#> 3rd Qu.: 5.000 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.:0.0000
#> Max. :50.000 Max. :55.000 Max. :10.0000 Max. :2.0000
#>
#> meal country market_segment distribution_channel
#> BB :30005 PRT :17630 Complementary: 201 Corporate: 3269
#> FB : 754 GBR : 6814 Corporate : 2309 Direct : 7865
#> HB : 8046 ESP : 3957 Direct : 6513 TA/TO :28925
#> SC : 86 IRL : 2166 Groups : 5836 Undefined: 1
#> Undefined: 1169 FRA : 1611 Offline TA/TO: 7472
#> DEU : 1203 Online TA :17729
#> (Other): 6679
#> is_repeated_guest previous_cancellations previous_bookings_not_canceled
#> 0:38282 Min. : 0.0000 Min. : 0.0000
#> 1: 1778 1st Qu.: 0.0000 1st Qu.: 0.0000
#> Median : 0.0000 Median : 0.0000
#> Mean : 0.1017 Mean : 0.1465
#> 3rd Qu.: 0.0000 3rd Qu.: 0.0000
#> Max. :26.0000 Max. :30.0000
#>
#> reserved_room_type assigned_room_type booking_changes deposit_type
#> A :23399 A :17046 Min. : 0.000 No Deposit:38199
#> D : 7433 D :10339 1st Qu.: 0.000 Non Refund: 1719
#> E : 4982 E : 5638 Median : 0.000 Refundable: 142
#> G : 1610 C : 2214 Mean : 0.288
#> F : 1106 G : 1853 3rd Qu.: 0.000
#> C : 918 F : 1733 Max. :17.000
#> (Other): 612 (Other): 1237
#> agent company days_in_waiting_list customer_type
#> 240 :13905 NULL :36952 Min. : 0.0000 Contract : 1776
#> NULL : 8209 223 : 784 1st Qu.: 0.0000 Group : 284
#> 250 : 2869 281 : 138 Median : 0.0000 Transient :30209
#> 241 : 1721 154 : 133 Mean : 0.5278 Transient-Party: 7791
#> 40 : 1002 405 : 100 3rd Qu.: 0.0000
#> 314 : 927 94 : 87 Max. :185.0000
#> (Other):11427 (Other): 1866
#> adr required_car_parking_spaces total_of_special_requests
#> Min. : -6.38 Min. :0.0000 0:22361
#> 1st Qu.: 50.00 1st Qu.:0.0000 1:11806
#> Median : 75.00 Median :0.0000 2: 4827
#> Mean : 94.95 Mean :0.1381 3: 910
#> 3rd Qu.:125.00 3rd Qu.:0.0000 4: 142
#> Max. :508.00 Max. :8.0000 5: 14
#>
#> reservation_status reservation_status_date stays_in_total_nights
#> Canceled :10831 Min. :2014-11-18 Min. : 0.000
#> Check-Out:28938 1st Qu.:2016-01-26 1st Qu.: 2.000
#> No-Show : 291 Median :2016-07-31 Median : 3.000
#> Mean :2016-07-28 Mean : 4.319
#> 3rd Qu.:2017-02-11 3rd Qu.: 7.000
#> Max. :2017-09-14 Max. :69.000
#>
#> total_stay_cost
#> Min. : -63.8
#> 1st Qu.: 117.0
#> Median : 273.0
#> Mean : 435.4
#> 3rd Qu.: 593.0
#> Max. :7590.0
#>
If we take a look nicely to the summary, we can see that the variable company, and agent has many NULL in it. Therefore, we are going to ignore these columns.
cleaned_hotel <- resort %>%
select(-c(company, agent))resort %>%
ggplot() +
geom_mosaic(aes(x = product(canceled),
fill = canceled,
conds = product(is_repeated_guest))) +
labs(title = "Does our Repeated Guests Make Many Canceled Bookings?") +
theme(legend.position = "none",
title = element_text(face = "bold"),
plot.title = element_text(hjust = 0.5, size = 14))Lead times is the gap between the arrival date and the booking date. Through the boxplot, we are able to tell that the bookings that people who books later tends to lead to a canceled bookings, while those who books very early, tend to be a successful booking and leads to arrival to the hotel.
According to our dataset, the most booking happened was in the year 2016.
RNGkind(sample.kind = "Rounding")
set.seed(598)
index <- sample(nrow(cleaned_hotel), nrow(cleaned_hotel)*0.8)
hotel_train <- cleaned_hotel[index, ]
hotel_test <- cleaned_hotel[-index, ]prop.table(table(hotel_train$canceled))#>
#> No Yes
#> 0.7234461 0.2765539
72:27. The data seems to be imbalance, therefore we are going to perform a downsampling method.
hotel_train <- downSample(x = hotel_train %>% select(-canceled),
y = hotel_train$canceled,
yname = "canceled")prop.table(table(hotel_train$canceled))#>
#> No Yes
#> 0.5 0.5
tic()
model_naive <- naiveBayes(x = hotel_train %>% select(-c(canceled, reservation_status)),
y = hotel_train$canceled,
laplace = 1) # angka untuk laplace smoothing
toc()#> 0.13 sec elapsed
Here are interpretation samples.
model_naive$tables$is_repeated_guest#> is_repeated_guest
#> hotel_train$canceled 0 1
#> No 0.94280880 0.05719120
#> Yes 0.98996052 0.01003948
model_naive$tables$market_segment#> market_segment
#> hotel_train$canceled Complementary Corporate Direct Groups
#> No 0.005637614 0.066411095 0.196076221 0.118389897
#> Yes 0.003382568 0.033036419 0.079264855 0.223700530
#> market_segment
#> hotel_train$canceled Offline TA/TO Online TA
#> No 0.208478972 0.405006201
#> Yes 0.103281091 0.557334536
tic()
naive_class <- predict(model_naive, hotel_test %>% select(-reservation_status), type = "class")
toc()#> 8.458 sec elapsed
table(naive_class)#> naive_class
#> No Yes
#> 1974 6038
head(naive_class)#> [1] No Yes Yes Yes Yes Yes
#> Levels: No Yes
before we model, we need to make sure that there is no date class in our data frame
dt_hotel_train <- hotel_train %>% select(-c(reservation_status_date, reservation_status))
dt_hotel_test <- hotel_test %>% select(-c(reservation_status_date, reservation_status))model_dtree <- ctree(formula = canceled ~ .,
data = dt_hotel_train)Surprisingly, by using the partykit package and ctree() function for the decision tree, are not able to distinguish anything. Let’s try to use another Decission Tree Method using the rpart package.
tic()
set.seed(598)
model_dt_rpart <- rpart(canceled ~ ., data = dt_hotel_train, method = "class")
toc()#> 1.663 sec elapsed
rattle::fancyRpartPlot(model_dt_rpart, sub = NULL, palettes = "YlOrBr")It looks like the plot is self explanatory. It is interesting that the model divided it using the country as the first decision.
dtree_rpart_pred <- select(dt_hotel_test, canceled) %>%
bind_cols(pred_class = predict(model_dt_rpart,
newdata = dt_hotel_test, type = "class"))
head(dtree_rpart_pred)Yikes, it appears we have some false predictions here, but to make sure we are going to see later in our model evaluation.
Random Forest model are known with their heavy and long computational load. To overcome this characteristic, we are able to use nearZeroVar() function from the package caret to eliminate the columns with variance close to 0 (uninformative columns).
dim(cleaned_hotel)#> [1] 40060 31
nzvar <- nearZeroVar(cleaned_hotel)
rf_hotel <- cleaned_hotel[, -nzvar]
dim(rf_hotel)#> [1] 40060 24
Using the nearZeroVar() function, we are able to eliminate from 31 columns to 25. Now we can proceed with another cross validation step.
RNGkind(sample.kind = "Rounding")
set.seed(598)
index <- sample(nrow(rf_hotel), nrow(rf_hotel) * 0.8)
rf_hotel_train <- rf_hotel[index, ]
rf_hotel_test <- rf_hotel[-index, ]prop.table(table(rf_hotel_train$canceled))#>
#> No Yes
#> 0.7234461 0.2765539
As before, we are going to do downsampling.
rf_hotel_train <- downSample(x = rf_hotel_train %>% select(-canceled),
y = rf_hotel_train$canceled,
yname = "canceled")prop.table(table(rf_hotel_train$canceled))#>
#> No Yes
#> 0.5 0.5
tic()
set.seed(598)
ctrl <- trainControl(method = "repeatedcv", #k-fold
number = 5, # k-fold cross validation (k=5)
repeats = 3) # 3x repeat
model_forest <- train(canceled ~ .,
data = rf_hotel_train,
method = "rf",
trControl = ctrl)
# saveRDS(model_forest, "model_forest_resort_1.RDS") # save the model
toc()#> [1] "1.14 hours elapsed to create the model"
hotel_forest_rds <- readRDS("model_forest_resort_1.RDS")
hotel_forest_rds#> Random Forest
#>
#> 17726 samples
#> 23 predictor
#> 2 classes: 'No', 'Yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times)
#> Summary of sample sizes: 14181, 14180, 14181, 14180, 14182, 14181, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.7811311 0.5622615
#> 96 1.0000000 1.0000000
#> 190 1.0000000 1.0000000
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 96.
mtry is the number of columns that was made as the predictors. For hotel_forest_rds model, there are 96 splitting nodes.
hotel_forest_rds$finalModel#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 96
#>
#> OOB estimate of error rate: 0%
#> Confusion matrix:
#> No Yes class.error
#> No 8863 0 0
#> Yes 0 8863 0
For the hotel_forest_rds model, the Out of Bag Error is 0%. It means that the accuracy fo the model is 100%! Which is a bit suspicious.
\[Accuracy = 100 - 0 = 100 \%\]
varImp(hotel_forest_rds)#> rf variable importance
#>
#> only 20 most important variables shown (out of 190)
#>
#> Overall
#> reservation_statusCheck-Out 100.00000
#> countryPRT 3.39248
#> lead_time 3.13235
#> required_car_parking_spaces 2.41052
#> market_segmentOnline TA 0.62521
#> reservation_status_date 0.41835
#> customer_typeTransient 0.28214
#> market_segmentOffline TA/TO 0.14293
#> countryGBR 0.13807
#> stays_in_total_nights 0.13744
#> reservation_statusNo-Show 0.12462
#> total_stay_cost 0.06876
#> market_segmentDirect 0.05254
#> market_segmentGroups 0.04598
#> distribution_channelTA/TO 0.03999
#> booking_changes 0.03975
#> arrival_date_year 0.02933
#> adr 0.01956
#> distribution_channelDirect 0.01687
#> arrival_date_week_number 0.01027
Here we see that the reservation_statusCheck-Out is the most important variable for seeing if a booking is cancelled or not. Surprisingly that is common logic. Therefore we are going to try a hypothesis in which we remove this variable.
forest_class <- predict(hotel_forest_rds, rf_hotel_test, type = "raw")
head(forest_class)#> [1] No No Yes No No No
#> Levels: No Yes
cleaned_hotel2 <- cleaned_hotel %>% select(-reservation_status)
dim(cleaned_hotel2)#> [1] 40060 30
nzvar2 <- nearZeroVar(cleaned_hotel2)
rf_hotel2 <- cleaned_hotel2[, -nzvar2]
dim(rf_hotel2)#> [1] 40060 23
Using the nearZeroVar() function, we are able to eliminate from 30 columns to 23. Now we can proceed with another cross validation step.
RNGkind(sample.kind = "Rounding")
set.seed(598)
index <- sample(nrow(rf_hotel2), nrow(rf_hotel2) * 0.8)
rf_hotel_train2 <- rf_hotel2[index, ]
rf_hotel_test2 <- rf_hotel2[-index, ]prop.table(table(rf_hotel_train2$canceled))#>
#> No Yes
#> 0.7234461 0.2765539
rf_hotel_train2 <- downSample(x = rf_hotel_train2 %>% select(-canceled),
y = rf_hotel_train2$canceled,
yname = "canceled")prop.table(table(rf_hotel_train2$canceled))#>
#> No Yes
#> 0.5 0.5
tic()
set.seed(598)
ctrl2 <- trainControl(method = "repeatedcv", #k-fold
number = 5, # k-fold cross validation (k=5)
repeats = 3) # 3x repeat
model_forest2 <- train(canceled ~ .,
data = rf_hotel_train2,
method = "rf",
trControl = ctrl2)
# saveRDS(model_forest2, "model_forest_resort_2.RDS") # save the model
toc()#> [1] "4.41 hours elapsed to create the model"
hotel_forest_rds2 <- readRDS("model_forest_resort_2.RDS")
hotel_forest_rds2#> Random Forest
#>
#> 17726 samples
#> 22 predictor
#> 2 classes: 'No', 'Yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times)
#> Summary of sample sizes: 14181, 14180, 14181, 14180, 14182, 14181, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.6905469 0.3810921
#> 95 0.9175416 0.8350833
#> 188 0.9197041 0.8394082
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 188.
mtry is the number of columns that was made as the predictors. For hotel_forest_rds model, there are 188 splitting nodes.
hotel_forest_rds2$finalModel#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 188
#>
#> OOB estimate of error rate: 7.62%
#> Confusion matrix:
#> No Yes class.error
#> No 8205 658 0.07424123
#> Yes 693 8170 0.07819023
For the hotel_forest_rds2 model, the Out of Bag Error is 7.62%. It means that the accuracy fo the model is 92.38%! Which means very good!
\[Accuracy = 100 - 7.62 = 92.38 \%\]
varImp(hotel_forest_rds2)#> rf variable importance
#>
#> only 20 most important variables shown (out of 188)
#>
#> Overall
#> reservation_status_date 100.000
#> lead_time 87.854
#> countryPRT 60.552
#> required_car_parking_spaces 49.957
#> arrival_date_week_number 40.886
#> market_segmentOnline TA 37.148
#> adr 22.383
#> total_stay_cost 16.312
#> arrival_date_day_of_month 14.061
#> arrival_date_year 8.325
#> stays_in_total_nights 7.102
#> booking_changes 6.684
#> total_of_special_requests1 5.509
#> stays_in_week_nights 4.981
#> customer_typeTransient 4.583
#> distribution_channelDirect 4.299
#> total_of_special_requests2 3.729
#> mealUndefined 3.552
#> stays_in_weekend_nights 3.198
#> adults 2.362
For hotel_forest_rds2 model, the most important variable is reservation_status_date
forest_class_2 <- predict(hotel_forest_rds2, rf_hotel_test2, type = "raw")
head(forest_class_2)#> [1] Yes Yes Yes No Yes No
#> Levels: No Yes
cm_nb <- confusionMatrix(naive_class,
hotel_test$canceled,
positive = "Yes")
cm_nb#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 1901 73
#> Yes 3852 2186
#>
#> Accuracy : 0.5101
#> 95% CI : (0.4991, 0.5211)
#> No Information Rate : 0.718
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.1977
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.9677
#> Specificity : 0.3304
#> Pos Pred Value : 0.3620
#> Neg Pred Value : 0.9630
#> Prevalence : 0.2820
#> Detection Rate : 0.2728
#> Detection Prevalence : 0.7536
#> Balanced Accuracy : 0.6491
#>
#> 'Positive' Class : Yes
#>
cm_dt_rpart <- confusionMatrix(dtree_rpart_pred$pred_class,
dt_hotel_test$canceled, positive = "Yes")
cm_dt_rpart#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 4685 307
#> Yes 1068 1952
#>
#> Accuracy : 0.8284
#> 95% CI : (0.8199, 0.8366)
#> No Information Rate : 0.718
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.6155
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.8641
#> Specificity : 0.8144
#> Pos Pred Value : 0.6464
#> Neg Pred Value : 0.9385
#> Prevalence : 0.2820
#> Detection Rate : 0.2436
#> Detection Prevalence : 0.3769
#> Balanced Accuracy : 0.8392
#>
#> 'Positive' Class : Yes
#>
cm_rf <- confusionMatrix(data = forest_class,
reference = rf_hotel_test$canceled,
positive = "Yes")
cm_rf#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 5753 0
#> Yes 0 2259
#>
#> Accuracy : 1
#> 95% CI : (0.9995, 1)
#> No Information Rate : 0.718
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 1
#>
#> Mcnemar's Test P-Value : NA
#>
#> Sensitivity : 1.000
#> Specificity : 1.000
#> Pos Pred Value : 1.000
#> Neg Pred Value : 1.000
#> Prevalence : 0.282
#> Detection Rate : 0.282
#> Detection Prevalence : 0.282
#> Balanced Accuracy : 1.000
#>
#> 'Positive' Class : Yes
#>
cm_rf_2 <- confusionMatrix(data = forest_class_2,
reference = rf_hotel_test2$canceled,
positive = "Yes")
cm_rf_2#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 5305 143
#> Yes 448 2116
#>
#> Accuracy : 0.9262
#> 95% CI : (0.9203, 0.9319)
#> No Information Rate : 0.718
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.825
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.9367
#> Specificity : 0.9221
#> Pos Pred Value : 0.8253
#> Neg Pred Value : 0.9738
#> Prevalence : 0.2820
#> Detection Rate : 0.2641
#> Detection Prevalence : 0.3200
#> Balanced Accuracy : 0.9294
#>
#> 'Positive' Class : Yes
#>
As we can see, the overall models that we created have its ups and downs. The Naive Bayes model really high on Recall / Sensitivity, but worst in all other metrics. The Decision Tree have overall above 0.80 values except the precision. As what we were expected, the First Random Forest model have 1 value in everything, which is very suspicious. And finally the second Random Forest have almost 0.90 in all the metrics, except precision. Because we would like to see the accuracy of the model, therefore we are going to use the Random Forest 2 model.
We are using our best model with the highest accuracy: rf_hotel_test2.
# probability results
rf_prob <- predict(hotel_forest_rds2, # best model
rf_hotel_test2, # data test
type = "prob")
head(rf_prob)# object predictions
forest_auc <- prediction(predictions = rf_prob[, 2], # prob kelas positif
labels = as.numeric(rf_hotel_test$canceled == "Yes")) # label kelas positif
# buat performance dari objek prediction
perf <- performance(prediction.obj = forest_auc,
measure = "tpr", # tpr = true positive rate
x.measure = "fpr") #fpr = false positive rate
# buat plot
plot(perf)
abline(0,1, lty = 2, col = "red")auc <- performance(prediction.obj = forest_auc,
measure = "auc")
auc@y.values#> [[1]]
#> [1] 0.9827171
The AUC value of rf_hotel_test2 model is almost 1, this means that our model could classify both the positive class and negative class very well.
Thank you very much for taking the time to read my report project. Hope you have a great day!