Introduction

The hotel industry thrives on understanding customer behavior, optimizing reservations, and forecasting demand. In this portfolio, I present a comprehensive analysis of hotel reservation data using R programming to derive valuable insights and build predictive models.

Objectives: - Analyze key metrics like booking trends, cancellation rates, and customer preferences. - Visualize reservation data to identify patterns across seasons, room types, and customer demographics. - Build predictive models to forecast future bookings and cancellations. - Optimize operational strategies for enhancing customer satisfaction and revenue generation.

Tools and Techniques: This project leverages powerful R libraries such as ggplot2 for data visualization, dplyr for data manipulation, and caret or xgboost for machine learning models. Advanced statistical techniques and machine learning algorithms, such as decision trees and logistic regression, are applied to ensure accurate predictions and actionable insights.

Through this portfolio, I demonstrate my expertise in data science and R programming, showcasing how data-driven decisions can improve operational efficiency and customer experiences in the hotel industry.

Library

library(dplyr)
library(caret)
library(e1071)
library(ROCR)
library(partykit)
library(rsample)
library(xgboost)
library(lubridate)
library(randomForest)

Read Data & Understanding

Import Data

hotel <- read.csv("HotelReservations.csv")

Data Inspection

Let’s take a quick look at the data content with the Head() command

head(hotel)

We check the data type with the glimpse() command.

glimpse(hotel)
#> Rows: 36,275
#> Columns: 19
#> $ Booking_ID                           <chr> "INN00001", "INN00002", "INN00003…
#> $ no_of_adults                         <int> 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 1, …
#> $ no_of_children                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_weekend_nights                 <int> 1, 2, 2, 0, 1, 0, 1, 1, 0, 0, 1, …
#> $ no_of_week_nights                    <int> 2, 3, 1, 2, 1, 2, 3, 3, 4, 5, 0, …
#> $ type_of_meal_plan                    <chr> "Meal Plan 1", "Not Selected", "M…
#> $ required_car_parking_space           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ room_type_reserved                   <chr> "Room_Type 1", "Room_Type 1", "Ro…
#> $ lead_time                            <int> 224, 5, 1, 211, 48, 346, 34, 83, …
#> $ arrival_year                         <int> 2017, 2018, 2018, 2018, 2018, 201…
#> $ arrival_month                        <int> 10, 11, 2, 5, 4, 9, 10, 12, 7, 10…
#> $ arrival_date                         <int> 2, 6, 28, 20, 11, 13, 15, 26, 6, …
#> $ market_segment_type                  <chr> "Offline", "Online", "Online", "O…
#> $ repeated_guest                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_previous_cancellations         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ avg_price_per_room                   <dbl> 65.00, 106.68, 60.00, 100.00, 94.…
#> $ no_of_special_requests               <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 3, 0, …
#> $ booking_status                       <chr> "Not_Canceled", "Not_Canceled", "…

From the glimps function above, we can see that the data has 36.275 rows and 19 columns. Here is the explanation of the variables:

Data Dictionary

  • Booking_ID : unique identifier of each booking
  • no_of_adults : Number of adults
  • no_of_children : Number of Children
  • no_of_weekend_nights : Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
  • no_of_week_nights : Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
  • type_of_meal_plan : Type of meal plan booked by the customer:
  • required_car_parking_space : Does the customer require a car parking space? (0 - No, 1- Yes)
  • room_type_reserved : Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
  • lead_time : Number of days between the date of booking and the arrival date
  • arrival_year : Year of arrival date
  • arrival_month : Month of arrival date
  • arrival_date : Date of the month
  • market_segment_type : Market segment designation.
  • repeated_guest : Is the customer a repeated guest? (0 - No, 1- Yes)
  • no_of_previous_cancellations : Number of previous bookings that were canceled by the customer prior to the current booking
  • no_of_previous_bookings_not_canceled : Number of previous bookings not canceled by the customer prior to the current booking
  • avg_price_per_room : Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
  • no_of_special_requests : Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
  • booking_status : Flag indicating if the booking was canceled or not.

Data Manipulation

hotel_clean <- hotel %>% 
  mutate_at(vars(type_of_meal_plan, room_type_reserved, market_segment_type, booking_status), as.factor)
glimpse(hotel_clean)
#> Rows: 36,275
#> Columns: 19
#> $ Booking_ID                           <chr> "INN00001", "INN00002", "INN00003…
#> $ no_of_adults                         <int> 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 1, …
#> $ no_of_children                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_weekend_nights                 <int> 1, 2, 2, 0, 1, 0, 1, 1, 0, 0, 1, …
#> $ no_of_week_nights                    <int> 2, 3, 1, 2, 1, 2, 3, 3, 4, 5, 0, …
#> $ type_of_meal_plan                    <fct> Meal Plan 1, Not Selected, Meal P…
#> $ required_car_parking_space           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ room_type_reserved                   <fct> Room_Type 1, Room_Type 1, Room_Ty…
#> $ lead_time                            <int> 224, 5, 1, 211, 48, 346, 34, 83, …
#> $ arrival_year                         <int> 2017, 2018, 2018, 2018, 2018, 201…
#> $ arrival_month                        <int> 10, 11, 2, 5, 4, 9, 10, 12, 7, 10…
#> $ arrival_date                         <int> 2, 6, 28, 20, 11, 13, 15, 26, 6, …
#> $ market_segment_type                  <fct> Offline, Online, Online, Online, …
#> $ repeated_guest                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_previous_cancellations         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ avg_price_per_room                   <dbl> 65.00, 106.68, 60.00, 100.00, 94.…
#> $ no_of_special_requests               <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 3, 0, …
#> $ booking_status                       <fct> Not_Canceled, Not_Canceled, Cance…

Check Missing Values

colSums(is.na(hotel_clean))
#>                           Booking_ID                         no_of_adults 
#>                                    0                                    0 
#>                       no_of_children                 no_of_weekend_nights 
#>                                    0                                    0 
#>                    no_of_week_nights                    type_of_meal_plan 
#>                                    0                                    0 
#>           required_car_parking_space                   room_type_reserved 
#>                                    0                                    0 
#>                            lead_time                         arrival_year 
#>                                    0                                    0 
#>                        arrival_month                         arrival_date 
#>                                    0                                    0 
#>                  market_segment_type                       repeated_guest 
#>                                    0                                    0 
#>         no_of_previous_cancellations no_of_previous_bookings_not_canceled 
#>                                    0                                    0 
#>                   avg_price_per_room               no_of_special_requests 
#>                                    0                                    0 
#>                       booking_status 
#>                                    0

No missing values

Exploratory Data Analyst

Check the distribution/pattern of the data

summary(hotel_clean)
#>   Booking_ID         no_of_adults   no_of_children    no_of_weekend_nights
#>  Length:36275       Min.   :0.000   Min.   : 0.0000   Min.   :0.0000      
#>  Class :character   1st Qu.:2.000   1st Qu.: 0.0000   1st Qu.:0.0000      
#>  Mode  :character   Median :2.000   Median : 0.0000   Median :1.0000      
#>                     Mean   :1.845   Mean   : 0.1053   Mean   :0.8107      
#>                     3rd Qu.:2.000   3rd Qu.: 0.0000   3rd Qu.:2.0000      
#>                     Max.   :4.000   Max.   :10.0000   Max.   :7.0000      
#>                                                                           
#>  no_of_week_nights    type_of_meal_plan required_car_parking_space
#>  Min.   : 0.000    Meal Plan 1 :27835   Min.   :0.00000           
#>  1st Qu.: 1.000    Meal Plan 2 : 3305   1st Qu.:0.00000           
#>  Median : 2.000    Meal Plan 3 :    5   Median :0.00000           
#>  Mean   : 2.204    Not Selected: 5130   Mean   :0.03099           
#>  3rd Qu.: 3.000                         3rd Qu.:0.00000           
#>  Max.   :17.000                         Max.   :1.00000           
#>                                                                   
#>    room_type_reserved   lead_time       arrival_year  arrival_month   
#>  Room_Type 1:28130    Min.   :  0.00   Min.   :2017   Min.   : 1.000  
#>  Room_Type 2:  692    1st Qu.: 17.00   1st Qu.:2018   1st Qu.: 5.000  
#>  Room_Type 3:    7    Median : 57.00   Median :2018   Median : 8.000  
#>  Room_Type 4: 6057    Mean   : 85.23   Mean   :2018   Mean   : 7.424  
#>  Room_Type 5:  265    3rd Qu.:126.00   3rd Qu.:2018   3rd Qu.:10.000  
#>  Room_Type 6:  966    Max.   :443.00   Max.   :2018   Max.   :12.000  
#>  Room_Type 7:  158                                                    
#>   arrival_date     market_segment_type repeated_guest   
#>  Min.   : 1.0   Aviation     :  125    Min.   :0.00000  
#>  1st Qu.: 8.0   Complementary:  391    1st Qu.:0.00000  
#>  Median :16.0   Corporate    : 2017    Median :0.00000  
#>  Mean   :15.6   Offline      :10528    Mean   :0.02564  
#>  3rd Qu.:23.0   Online       :23214    3rd Qu.:0.00000  
#>  Max.   :31.0                          Max.   :1.00000  
#>                                                         
#>  no_of_previous_cancellations no_of_previous_bookings_not_canceled
#>  Min.   : 0.00000             Min.   : 0.0000                     
#>  1st Qu.: 0.00000             1st Qu.: 0.0000                     
#>  Median : 0.00000             Median : 0.0000                     
#>  Mean   : 0.02335             Mean   : 0.1534                     
#>  3rd Qu.: 0.00000             3rd Qu.: 0.0000                     
#>  Max.   :13.00000             Max.   :58.0000                     
#>                                                                   
#>  avg_price_per_room no_of_special_requests      booking_status 
#>  Min.   :  0.00     Min.   :0.0000         Canceled    :11885  
#>  1st Qu.: 80.30     1st Qu.:0.0000         Not_Canceled:24390  
#>  Median : 99.45     Median :0.0000                             
#>  Mean   :103.42     Mean   :0.6197                             
#>  3rd Qu.:120.00     3rd Qu.:1.0000                             
#>  Max.   :540.00     Max.   :5.0000                             
#> 
ggplot(hotel_clean, aes(x = no_of_adults, fill = booking_status)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Number of Adults", 
    y = "Frequency", 
    title = "Bar Plot of Number of Adults by Booking Status"
  ) +
  theme_minimal()+
  scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))

Insight: no_of_adults with 2 have booking_status Canceled highest

ggplot(hotel_clean, aes(x = no_of_children, fill = booking_status)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Number of Children", 
    y = "Frequency", 
    title = "Bar Plot of Number of Children by Booking Status"
  ) +
  theme_minimal()+
  scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))

Insight: no_of_children with 0 have booking_status Canceled highest

ggplot(hotel_clean, aes(x = no_of_weekend_nights, fill = booking_status)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Number of Weekend Night", 
    y = "Frequency", 
    title = "Bar Plot of Number of Weekend Night by Booking Status"
  ) +
  theme_minimal()+
  scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))

Insight: no_of_weekend_nights with 0 have booking_status Canceled highest

ggplot(hotel_clean, aes(x = no_of_week_nights, fill = booking_status)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Number of Week Night", 
    y = "Frequency", 
    title = "Bar Plot of Number of Week Night by Booking Status"
  ) +
  theme_minimal()+
  scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))

Insight: no_of_week_nights with 3 have booking_status Canceled highest

ggplot(hotel_clean, aes(x = type_of_meal_plan, fill = booking_status)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Type of Meal Plan", 
    y = "Count", 
    title = "Bar Plot of Booking Status by Type of Meal Plan"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))

Insight: type_of_meal_plan with Meal Plan 1 have booking_status Canceled highest

ggplot(hotel_clean, aes(x = required_car_parking_space, fill = booking_status)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Required Car Parking Space", 
    y = "Frequency", 
    title = " Bar Plot Required Car Parking Space by Booking Status"
  ) +
  theme_minimal()+
  scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))

Insight: required_car_parking_space with 0 have booking_status Canceled highest

ggplot(hotel_clean, aes(x = room_type_reserved, fill = booking_status)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Room Type Reserved", 
    y = "Frequency", 
    title = " Bar Plot Room Type Reserved by Booking Status"
  ) +
  theme_minimal()+
  scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))

Insight: room_type_reserved with Room_type 1 have booking_status Canceled highest

ggplot(hotel_clean, aes(x = as.factor(booking_status), y = lead_time)) +
  geom_boxplot(fill = "#B3D9FF", color = "black") +
  labs(
    x = "Booking Status", 
    y = "Lead Time", 
    title = "Box Plot of Lead Time by Booking Status"
  ) +
  theme_minimal()

Insight: booking_status Canceled have mean lead_time higher than booking_status Not_Canceled, both booking_status have outliers

ggplot(hotel_clean, aes(x = arrival_year, fill = booking_status)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Arrival Year", 
    y = "Frequency", 
    title = " Bar Plot Arrival Year by Booking Status"
  ) +
  theme_minimal()+
  scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))

Insight: arrival_year 2018 have booking_status Canceled higher than other

ggplot(hotel_clean, aes(x = arrival_month, fill = booking_status)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Arrival Month", 
    y = "Frequency", 
    title = " Bar Plot Arrival Month by Booking Status"
  ) +
  theme_minimal()+
  scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))

Insight: arrival_month 10 have booking_status Canceled higher than other

ggplot(hotel_clean, aes(x = arrival_date, fill = booking_status)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Arrival Date", 
    y = "Frequency", 
    title = " Bar Plot Arrival Date by Booking Status"
  ) +
  theme_minimal()+
  scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))

Insight: arrival_date 15 have booking_status Canceled higher than other

ggplot(hotel_clean, aes(x = market_segment_type, fill = booking_status)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Market Segment Type", 
    y = "Frequency", 
    title = " Bar Plot Market Segment Type by Booking Status"
  ) +
  theme_minimal()+
  scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))

Insight: market_segment_type online have booking_status Canceled higher than other

ggplot(hotel_clean, aes(x = repeated_guest, fill = booking_status)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Repeated Guest", 
    y = "Frequency", 
    title = " Bar Plot Repeated Guest by Booking Status"
  ) +
  theme_minimal()+
  scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))

Insight: repeated_guest 0 have booking_status Canceled higher than other

ggplot(hotel_clean, aes(x = no_of_previous_cancellations, fill = booking_status)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Number Of Previous Cancellations", 
    y = "Frequency", 
    title = " Bar Plot Number Of Previous Cancellations by Booking Status"
  ) +
  theme_minimal()+
  scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))

Insight: no_of_previous_cancellations 0 have booking_status Canceled higher than other

ggplot(hotel_clean, aes(x = no_of_previous_bookings_not_canceled, fill = booking_status)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Number Of Previous Booking Not Cancellations", 
    y = "Frequency", 
    title = " Bar Plot Number Of Previous Booking Not Cancellations by Booking Status"
  ) +
  theme_minimal()+
  scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))

Insight: no_of_previous_bookings_not_canceled 0 have booking_status Canceled higher than other

ggplot(hotel_clean, aes(x = as.factor(booking_status), y = avg_price_per_room)) +
  geom_boxplot(fill = "#B3D9FF", color = "black") +
  labs(
    x = "Booking Status", 
    y = "Average Price Per Room", 
    title = "Box Plot of Average Price Per Room by Booking Status"
  ) +
  theme_minimal()

Insight: booking_status Not_Canceled have mean avg_price_per_room higher, booking_status Canceled have median higher, both booking_status have outliers

ggplot(hotel_clean, aes(x = no_of_special_requests, fill = booking_status)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Number Of Special Requests", 
    y = "Frequency", 
    title = " Bar Plot Number Of Special Requests by Booking Status"
  ) +
  theme_minimal()+
  scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))

Insight; no_of_special_requests 0 have booking_status Canceled Higher than others

Feature Engineering

Feature engineering in R refers to the process of transforming raw data into meaningful features (variables) that can be used to improve the performance of machine learning models. It involves creating, modifying, or selecting variables that better represent the underlying patterns in the data and make it easier for models to learn. Combining arrival_year, arrival_month, and arrival_date into a single arrival_date_full. make Day from extract arrival_date_full.

hotel_clean$arrival_date_full <- as.Date(
  paste(hotel_clean$arrival_year, hotel_clean$arrival_month, hotel_clean$arrival_date, sep = "-"))

hotel_clean <- hotel_clean %>% mutate(Day = weekdays(arrival_date_full))
hotel_clean <- na.omit(hotel_clean)

Combining no_of_adults and no_of_children into single total_people, combining no_of_weekend_nights and no_of_week_nights into single total_nights

hotel_clean <- hotel_clean %>%
  mutate(total_people = no_of_adults + no_of_children,
         total_nights = no_of_weekend_nights + no_of_week_nights)
prop.table(table(hotel_clean$Day))
#> 
#>    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
#> 0.1308295 0.1484629 0.1495392 0.1692974 0.1253932 0.1351896 0.1412882

change data type

hotel_clean$room_type_reserved <- as.numeric(hotel_clean$room_type_reserved)
hotel_clean$type_of_meal_plan <- as.numeric(hotel_clean$type_of_meal_plan)
hotel_clean$market_segment_type <- as.numeric(hotel_clean$market_segment_type)
hotel_clean$Day <- as.numeric(factor(hotel_clean$Day))
prop.table(table(hotel_clean$Day))
#> 
#>         1         2         3         4         5         6         7 
#> 0.1308295 0.1484629 0.1495392 0.1692974 0.1253932 0.1351896 0.1412882
colSums(is.na(hotel_clean))
#>                           Booking_ID                         no_of_adults 
#>                                    0                                    0 
#>                       no_of_children                 no_of_weekend_nights 
#>                                    0                                    0 
#>                    no_of_week_nights                    type_of_meal_plan 
#>                                    0                                    0 
#>           required_car_parking_space                   room_type_reserved 
#>                                    0                                    0 
#>                            lead_time                         arrival_year 
#>                                    0                                    0 
#>                        arrival_month                         arrival_date 
#>                                    0                                    0 
#>                  market_segment_type                       repeated_guest 
#>                                    0                                    0 
#>         no_of_previous_cancellations no_of_previous_bookings_not_canceled 
#>                                    0                                    0 
#>                   avg_price_per_room               no_of_special_requests 
#>                                    0                                    0 
#>                       booking_status                    arrival_date_full 
#>                                    0                                    0 
#>                                  Day                         total_people 
#>                                    0                                    0 
#>                         total_nights 
#>                                    0

Check all dataset

glimpse(hotel_clean)
#> Rows: 36,238
#> Columns: 23
#> $ Booking_ID                           <chr> "INN00001", "INN00002", "INN00003…
#> $ no_of_adults                         <int> 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 1, …
#> $ no_of_children                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_weekend_nights                 <int> 1, 2, 2, 0, 1, 0, 1, 1, 0, 0, 1, …
#> $ no_of_week_nights                    <int> 2, 3, 1, 2, 1, 2, 3, 3, 4, 5, 0, …
#> $ type_of_meal_plan                    <dbl> 1, 4, 1, 1, 4, 2, 1, 1, 1, 1, 4, …
#> $ required_car_parking_space           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ room_type_reserved                   <dbl> 1, 1, 1, 1, 1, 1, 1, 4, 1, 4, 1, …
#> $ lead_time                            <int> 224, 5, 1, 211, 48, 346, 34, 83, …
#> $ arrival_year                         <int> 2017, 2018, 2018, 2018, 2018, 201…
#> $ arrival_month                        <int> 10, 11, 2, 5, 4, 9, 10, 12, 7, 10…
#> $ arrival_date                         <int> 2, 6, 28, 20, 11, 13, 15, 26, 6, …
#> $ market_segment_type                  <dbl> 4, 5, 5, 5, 5, 5, 5, 5, 4, 5, 5, …
#> $ repeated_guest                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_previous_cancellations         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ avg_price_per_room                   <dbl> 65.00, 106.68, 60.00, 100.00, 94.…
#> $ no_of_special_requests               <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 3, 0, …
#> $ booking_status                       <fct> Not_Canceled, Not_Canceled, Cance…
#> $ arrival_date_full                    <date> 2017-10-02, 2018-11-06, 2018-02-…
#> $ Day                                  <dbl> 2, 6, 7, 4, 7, 5, 4, 7, 1, 5, 6, …
#> $ total_people                         <int> 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 1, …
#> $ total_nights                         <int> 3, 5, 3, 2, 2, 2, 4, 4, 4, 5, 1, …
summary(hotel_clean)
#>   Booking_ID         no_of_adults   no_of_children    no_of_weekend_nights
#>  Length:36238       Min.   :0.000   Min.   : 0.0000   Min.   :0.0000      
#>  Class :character   1st Qu.:2.000   1st Qu.: 0.0000   1st Qu.:0.0000      
#>  Mode  :character   Median :2.000   Median : 0.0000   Median :1.0000      
#>                     Mean   :1.845   Mean   : 0.1052   Mean   :0.8105      
#>                     3rd Qu.:2.000   3rd Qu.: 0.0000   3rd Qu.:2.0000      
#>                     Max.   :4.000   Max.   :10.0000   Max.   :7.0000      
#>  no_of_week_nights type_of_meal_plan required_car_parking_space
#>  Min.   : 0.000    Min.   :1.000     Min.   :0.00000           
#>  1st Qu.: 1.000    1st Qu.:1.000     1st Qu.:0.00000           
#>  Median : 2.000    Median :1.000     Median :0.00000           
#>  Mean   : 2.204    Mean   :1.516     Mean   :0.03093           
#>  3rd Qu.: 3.000    3rd Qu.:1.000     3rd Qu.:0.00000           
#>  Max.   :17.000    Max.   :4.000     Max.   :1.00000           
#>  room_type_reserved   lead_time       arrival_year  arrival_month   
#>  Min.   :1.000      Min.   :  0.00   Min.   :2017   Min.   : 1.000  
#>  1st Qu.:1.000      1st Qu.: 17.00   1st Qu.:2018   1st Qu.: 5.000  
#>  Median :1.000      Median : 57.00   Median :2018   Median : 8.000  
#>  Mean   :1.708      Mean   : 85.28   Mean   :2018   Mean   : 7.429  
#>  3rd Qu.:1.000      3rd Qu.:126.00   3rd Qu.:2018   3rd Qu.:10.000  
#>  Max.   :7.000      Max.   :443.00   Max.   :2018   Max.   :12.000  
#>   arrival_date   market_segment_type repeated_guest   
#>  Min.   : 1.00   Min.   :1.000       Min.   :0.00000  
#>  1st Qu.: 8.00   1st Qu.:4.000       1st Qu.:0.00000  
#>  Median :16.00   Median :5.000       Median :0.00000  
#>  Mean   :15.58   Mean   :4.553       Mean   :0.02555  
#>  3rd Qu.:23.00   3rd Qu.:5.000       3rd Qu.:0.00000  
#>  Max.   :31.00   Max.   :5.000       Max.   :1.00000  
#>  no_of_previous_cancellations no_of_previous_bookings_not_canceled
#>  Min.   : 0.00000             Min.   : 0.000                      
#>  1st Qu.: 0.00000             1st Qu.: 0.000                      
#>  Median : 0.00000             Median : 0.000                      
#>  Mean   : 0.02335             Mean   : 0.153                      
#>  3rd Qu.: 0.00000             3rd Qu.: 0.000                      
#>  Max.   :13.00000             Max.   :58.000                      
#>  avg_price_per_room no_of_special_requests      booking_status 
#>  Min.   :  0.00     Min.   :0.00           Canceled    :11878  
#>  1st Qu.: 80.30     1st Qu.:0.00           Not_Canceled:24360  
#>  Median : 99.45     Median :0.00                               
#>  Mean   :103.44     Mean   :0.62                               
#>  3rd Qu.:120.00     3rd Qu.:1.00                               
#>  Max.   :540.00     Max.   :5.00                               
#>  arrival_date_full         Day         total_people     total_nights   
#>  Min.   :2017-07-01   Min.   :1.000   Min.   : 1.000   Min.   : 0.000  
#>  1st Qu.:2018-02-28   1st Qu.:2.000   1st Qu.: 2.000   1st Qu.: 2.000  
#>  Median :2018-06-12   Median :4.000   Median : 2.000   Median : 3.000  
#>  Mean   :2018-05-24   Mean   :3.981   Mean   : 1.951   Mean   : 3.015  
#>  3rd Qu.:2018-09-19   3rd Qu.:6.000   3rd Qu.: 2.000   3rd Qu.: 4.000  
#>  Max.   :2018-12-31   Max.   :7.000   Max.   :12.000   Max.   :24.000

Modeling

Before make modeling, we must chose feature column, and we dont use column Booking_ID and arrival_date_full

hotel_finish <- hotel_clean %>% select(-Booking_ID, -arrival_date_full) %>% mutate(booking_status = as.factor(booking_status))

split data to train and test

RNGkind(sample.kind = "Rounding")
set.seed(100)

# your code here
index_hotel <- sample(nrow(hotel_finish), nrow(hotel_finish)*0.75)

hotel_train <- hotel_finish[index_hotel,] 
hotel_test <- hotel_finish[-index_hotel,]

Proportional target not balance, but we must try to use modeling

prop.table(table(hotel_train$booking_status))
#> 
#>     Canceled Not_Canceled 
#>    0.3269188    0.6730812

From above prompt, we can see data not balance, we use up sample to balance target

hotel_train_up <- upSample(
  x = hotel_train %>% select(-booking_status),
  y = hotel_train$booking_status,
  yname = "booking_status"
)

check again proprtion of target

prop.table(table(hotel_train_up$booking_status))
#> 
#>     Canceled Not_Canceled 
#>          0.5          0.5

Target has balance, ready for training model

Random Forest is one of the most popular and powerful machine learning algorithms. It falls under the category of ensemble learning, which means it combines several simpler models (in this case, decision trees) to produce a more accurate and stable model. Create a Random Forest model using hotel_train_up with 5-fold cross validation, then the process is repeated 3 times

set.seed(42)

ctrl <- trainControl(method = "repeatedcv",
                     number = 5,     
                     repeats = 3)  

hotel_forest <- train(booking_status ~ .,
                            data = hotel_train_up,
                            method = "rf",       
                            trControl = ctrl)  

In the Bootstrap sampling stage, there is data that is not used in modeling, this is referred to as Out-of-Bag (OOB) data. The Random Forest model will use OOB data as data to evaluate by calculating the error (similar to test data). This error is called OOB error. In the case of classification, OOB error is the percentage of OOB data that is misclassified.

hotel_forest$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 11
#> 
#>         OOB estimate of  error rate: 5.33%
#> Confusion matrix:
#>              Canceled Not_Canceled class.error
#> Canceled        17667          626  0.03422074
#> Not_Canceled     1324        16969  0.07237741

The OOB Error value for the hotel_forest model is 5.33%. In other words, the model accuracy on OOB data is 94.67%.

Although random forest is labeled as an uninterpretable model, at least we can see what predictors are most used (important) in making random forest:

varImp(hotel_forest) %>% plot()

Feature importance highest is lead_time and less then is no_of_previous_cancellations

Evaluation

predictions_rf <- predict(hotel_forest, newdata = hotel_test)
conf_matrix_rf <- confusionMatrix(predictions_rf, hotel_test$booking_status, positive = "Canceled")
conf_matrix_rf
#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     Canceled Not_Canceled
#>   Canceled         2518          420
#>   Not_Canceled      475         5647
#>                                               
#>                Accuracy : 0.9012              
#>                  95% CI : (0.8949, 0.9073)    
#>     No Information Rate : 0.6696              
#>     P-Value [Acc > NIR] : < 0.0000000000000002
#>                                               
#>                   Kappa : 0.7757              
#>                                               
#>  Mcnemar's Test P-Value : 0.07107             
#>                                               
#>             Sensitivity : 0.8413              
#>             Specificity : 0.9308              
#>          Pos Pred Value : 0.8570              
#>          Neg Pred Value : 0.9224              
#>              Prevalence : 0.3304              
#>          Detection Rate : 0.2779              
#>    Detection Prevalence : 0.3243              
#>       Balanced Accuracy : 0.8860              
#>                                               
#>        'Positive' Class : Canceled            
#> 

Accuracy Random Forest model is 90.12 %

Conclusion

Our positive class is Canceled, which means the customer has cancel booking status hotel, while the negative class is Not_Canceled, which means the customer has not cancel booking status. FP: predicting a customer Canceled, while the customer Cancel booking hotel, the hotel risk is that the hotel incurs a loss. FN: predicting the customer Not_Canceled to Booking Status, even though the customer Not Canceled, the hotel risks losing profits. From the hotel side, the concerning risk is FN so the matrix we use is Recall. From the Random Forest methods above, if we are concerned according to the matrix, we will use Recall of 84.13%