The hotel industry thrives on understanding customer behavior, optimizing reservations, and forecasting demand. In this portfolio, I present a comprehensive analysis of hotel reservation data using R programming to derive valuable insights and build predictive models.
Objectives: - Analyze key metrics like booking trends, cancellation rates, and customer preferences. - Visualize reservation data to identify patterns across seasons, room types, and customer demographics. - Build predictive models to forecast future bookings and cancellations. - Optimize operational strategies for enhancing customer satisfaction and revenue generation.
Tools and Techniques: This project leverages powerful R libraries such as ggplot2 for data visualization, dplyr for data manipulation, and caret or xgboost for machine learning models. Advanced statistical techniques and machine learning algorithms, such as decision trees and logistic regression, are applied to ensure accurate predictions and actionable insights.
Through this portfolio, I demonstrate my expertise in data science and R programming, showcasing how data-driven decisions can improve operational efficiency and customer experiences in the hotel industry.
Let’s take a quick look at the data content with the Head() command
We check the data type with the glimpse() command.
#> Rows: 36,275
#> Columns: 19
#> $ Booking_ID <chr> "INN00001", "INN00002", "INN00003…
#> $ no_of_adults <int> 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 1, …
#> $ no_of_children <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_weekend_nights <int> 1, 2, 2, 0, 1, 0, 1, 1, 0, 0, 1, …
#> $ no_of_week_nights <int> 2, 3, 1, 2, 1, 2, 3, 3, 4, 5, 0, …
#> $ type_of_meal_plan <chr> "Meal Plan 1", "Not Selected", "M…
#> $ required_car_parking_space <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ room_type_reserved <chr> "Room_Type 1", "Room_Type 1", "Ro…
#> $ lead_time <int> 224, 5, 1, 211, 48, 346, 34, 83, …
#> $ arrival_year <int> 2017, 2018, 2018, 2018, 2018, 201…
#> $ arrival_month <int> 10, 11, 2, 5, 4, 9, 10, 12, 7, 10…
#> $ arrival_date <int> 2, 6, 28, 20, 11, 13, 15, 26, 6, …
#> $ market_segment_type <chr> "Offline", "Online", "Online", "O…
#> $ repeated_guest <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_previous_cancellations <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ avg_price_per_room <dbl> 65.00, 106.68, 60.00, 100.00, 94.…
#> $ no_of_special_requests <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 3, 0, …
#> $ booking_status <chr> "Not_Canceled", "Not_Canceled", "…
From the glimps function above, we can see that the data has 36.275 rows and 19 columns. Here is the explanation of the variables:
Data Dictionary
Booking_ID
: unique identifier of each bookingno_of_adults
: Number of adultsno_of_children
: Number of Childrenno_of_weekend_nights
: Number of weekend nights
(Saturday or Sunday) the guest stayed or booked to stay at the
hotelno_of_week_nights
: Number of week nights (Monday to
Friday) the guest stayed or booked to stay at the hoteltype_of_meal_plan
: Type of meal plan booked by the
customer:required_car_parking_space
: Does the customer require
a car parking space? (0 - No, 1- Yes)room_type_reserved
: Type of room reserved by the
customer. The values are ciphered (encoded) by INN Hotels.lead_time
: Number of days between the date of booking
and the arrival datearrival_year
: Year of arrival datearrival_month
: Month of arrival datearrival_date
: Date of the monthmarket_segment_type
: Market segment designation.repeated_guest
: Is the customer a repeated guest? (0 -
No, 1- Yes)no_of_previous_cancellations
: Number of previous
bookings that were canceled by the customer prior to the current
bookingno_of_previous_bookings_not_canceled
: Number of
previous bookings not canceled by the customer prior to the current
bookingavg_price_per_room
: Average price per day of the
reservation; prices of the rooms are dynamic. (in euros)no_of_special_requests
: Total number of special
requests made by the customer (e.g. high floor, view from the room,
etc)booking_status
: Flag indicating if the booking was
canceled or not.hotel_clean <- hotel %>%
mutate_at(vars(type_of_meal_plan, room_type_reserved, market_segment_type, booking_status), as.factor)
glimpse(hotel_clean)
#> Rows: 36,275
#> Columns: 19
#> $ Booking_ID <chr> "INN00001", "INN00002", "INN00003…
#> $ no_of_adults <int> 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 1, …
#> $ no_of_children <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_weekend_nights <int> 1, 2, 2, 0, 1, 0, 1, 1, 0, 0, 1, …
#> $ no_of_week_nights <int> 2, 3, 1, 2, 1, 2, 3, 3, 4, 5, 0, …
#> $ type_of_meal_plan <fct> Meal Plan 1, Not Selected, Meal P…
#> $ required_car_parking_space <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ room_type_reserved <fct> Room_Type 1, Room_Type 1, Room_Ty…
#> $ lead_time <int> 224, 5, 1, 211, 48, 346, 34, 83, …
#> $ arrival_year <int> 2017, 2018, 2018, 2018, 2018, 201…
#> $ arrival_month <int> 10, 11, 2, 5, 4, 9, 10, 12, 7, 10…
#> $ arrival_date <int> 2, 6, 28, 20, 11, 13, 15, 26, 6, …
#> $ market_segment_type <fct> Offline, Online, Online, Online, …
#> $ repeated_guest <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_previous_cancellations <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ avg_price_per_room <dbl> 65.00, 106.68, 60.00, 100.00, 94.…
#> $ no_of_special_requests <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 3, 0, …
#> $ booking_status <fct> Not_Canceled, Not_Canceled, Cance…
Check Missing Values
#> Booking_ID no_of_adults
#> 0 0
#> no_of_children no_of_weekend_nights
#> 0 0
#> no_of_week_nights type_of_meal_plan
#> 0 0
#> required_car_parking_space room_type_reserved
#> 0 0
#> lead_time arrival_year
#> 0 0
#> arrival_month arrival_date
#> 0 0
#> market_segment_type repeated_guest
#> 0 0
#> no_of_previous_cancellations no_of_previous_bookings_not_canceled
#> 0 0
#> avg_price_per_room no_of_special_requests
#> 0 0
#> booking_status
#> 0
No missing values
Check the distribution/pattern of the data
#> Booking_ID no_of_adults no_of_children no_of_weekend_nights
#> Length:36275 Min. :0.000 Min. : 0.0000 Min. :0.0000
#> Class :character 1st Qu.:2.000 1st Qu.: 0.0000 1st Qu.:0.0000
#> Mode :character Median :2.000 Median : 0.0000 Median :1.0000
#> Mean :1.845 Mean : 0.1053 Mean :0.8107
#> 3rd Qu.:2.000 3rd Qu.: 0.0000 3rd Qu.:2.0000
#> Max. :4.000 Max. :10.0000 Max. :7.0000
#>
#> no_of_week_nights type_of_meal_plan required_car_parking_space
#> Min. : 0.000 Meal Plan 1 :27835 Min. :0.00000
#> 1st Qu.: 1.000 Meal Plan 2 : 3305 1st Qu.:0.00000
#> Median : 2.000 Meal Plan 3 : 5 Median :0.00000
#> Mean : 2.204 Not Selected: 5130 Mean :0.03099
#> 3rd Qu.: 3.000 3rd Qu.:0.00000
#> Max. :17.000 Max. :1.00000
#>
#> room_type_reserved lead_time arrival_year arrival_month
#> Room_Type 1:28130 Min. : 0.00 Min. :2017 Min. : 1.000
#> Room_Type 2: 692 1st Qu.: 17.00 1st Qu.:2018 1st Qu.: 5.000
#> Room_Type 3: 7 Median : 57.00 Median :2018 Median : 8.000
#> Room_Type 4: 6057 Mean : 85.23 Mean :2018 Mean : 7.424
#> Room_Type 5: 265 3rd Qu.:126.00 3rd Qu.:2018 3rd Qu.:10.000
#> Room_Type 6: 966 Max. :443.00 Max. :2018 Max. :12.000
#> Room_Type 7: 158
#> arrival_date market_segment_type repeated_guest
#> Min. : 1.0 Aviation : 125 Min. :0.00000
#> 1st Qu.: 8.0 Complementary: 391 1st Qu.:0.00000
#> Median :16.0 Corporate : 2017 Median :0.00000
#> Mean :15.6 Offline :10528 Mean :0.02564
#> 3rd Qu.:23.0 Online :23214 3rd Qu.:0.00000
#> Max. :31.0 Max. :1.00000
#>
#> no_of_previous_cancellations no_of_previous_bookings_not_canceled
#> Min. : 0.00000 Min. : 0.0000
#> 1st Qu.: 0.00000 1st Qu.: 0.0000
#> Median : 0.00000 Median : 0.0000
#> Mean : 0.02335 Mean : 0.1534
#> 3rd Qu.: 0.00000 3rd Qu.: 0.0000
#> Max. :13.00000 Max. :58.0000
#>
#> avg_price_per_room no_of_special_requests booking_status
#> Min. : 0.00 Min. :0.0000 Canceled :11885
#> 1st Qu.: 80.30 1st Qu.:0.0000 Not_Canceled:24390
#> Median : 99.45 Median :0.0000
#> Mean :103.42 Mean :0.6197
#> 3rd Qu.:120.00 3rd Qu.:1.0000
#> Max. :540.00 Max. :5.0000
#>
ggplot(hotel_clean, aes(x = no_of_adults, fill = booking_status)) +
geom_bar(position = "dodge") +
labs(
x = "Number of Adults",
y = "Frequency",
title = "Bar Plot of Number of Adults by Booking Status"
) +
theme_minimal()+
scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))
Insight: no_of_adults
with 2 have
booking_status
Canceled highest
ggplot(hotel_clean, aes(x = no_of_children, fill = booking_status)) +
geom_bar(position = "dodge") +
labs(
x = "Number of Children",
y = "Frequency",
title = "Bar Plot of Number of Children by Booking Status"
) +
theme_minimal()+
scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))
Insight: no_of_children
with 0 have
booking_status
Canceled highest
ggplot(hotel_clean, aes(x = no_of_weekend_nights, fill = booking_status)) +
geom_bar(position = "dodge") +
labs(
x = "Number of Weekend Night",
y = "Frequency",
title = "Bar Plot of Number of Weekend Night by Booking Status"
) +
theme_minimal()+
scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))
Insight: no_of_weekend_nights
with 0 have
booking_status
Canceled highest
ggplot(hotel_clean, aes(x = no_of_week_nights, fill = booking_status)) +
geom_bar(position = "dodge") +
labs(
x = "Number of Week Night",
y = "Frequency",
title = "Bar Plot of Number of Week Night by Booking Status"
) +
theme_minimal()+
scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))
Insight: no_of_week_nights
with 3 have
booking_status
Canceled highest
ggplot(hotel_clean, aes(x = type_of_meal_plan, fill = booking_status)) +
geom_bar(position = "dodge") +
labs(
x = "Type of Meal Plan",
y = "Count",
title = "Bar Plot of Booking Status by Type of Meal Plan"
) +
theme_minimal() +
scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))
Insight: type_of_meal_plan
with Meal Plan 1 have
booking_status
Canceled highest
ggplot(hotel_clean, aes(x = required_car_parking_space, fill = booking_status)) +
geom_bar(position = "dodge") +
labs(
x = "Required Car Parking Space",
y = "Frequency",
title = " Bar Plot Required Car Parking Space by Booking Status"
) +
theme_minimal()+
scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))
Insight: required_car_parking_space
with 0 have
booking_status
Canceled highest
ggplot(hotel_clean, aes(x = room_type_reserved, fill = booking_status)) +
geom_bar(position = "dodge") +
labs(
x = "Room Type Reserved",
y = "Frequency",
title = " Bar Plot Room Type Reserved by Booking Status"
) +
theme_minimal()+
scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))
Insight: room_type_reserved
with Room_type 1 have
booking_status
Canceled highest
ggplot(hotel_clean, aes(x = as.factor(booking_status), y = lead_time)) +
geom_boxplot(fill = "#B3D9FF", color = "black") +
labs(
x = "Booking Status",
y = "Lead Time",
title = "Box Plot of Lead Time by Booking Status"
) +
theme_minimal()
Insight: booking_status
Canceled have mean
lead_time
higher than booking_status
Not_Canceled, both booking_status
have outliers
ggplot(hotel_clean, aes(x = arrival_year, fill = booking_status)) +
geom_bar(position = "dodge") +
labs(
x = "Arrival Year",
y = "Frequency",
title = " Bar Plot Arrival Year by Booking Status"
) +
theme_minimal()+
scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))
Insight: arrival_year
2018 have
booking_status
Canceled higher than other
ggplot(hotel_clean, aes(x = arrival_month, fill = booking_status)) +
geom_bar(position = "dodge") +
labs(
x = "Arrival Month",
y = "Frequency",
title = " Bar Plot Arrival Month by Booking Status"
) +
theme_minimal()+
scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))
Insight: arrival_month
10 have
booking_status
Canceled higher than other
ggplot(hotel_clean, aes(x = arrival_date, fill = booking_status)) +
geom_bar(position = "dodge") +
labs(
x = "Arrival Date",
y = "Frequency",
title = " Bar Plot Arrival Date by Booking Status"
) +
theme_minimal()+
scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))
Insight: arrival_date
15 have
booking_status
Canceled higher than other
ggplot(hotel_clean, aes(x = market_segment_type, fill = booking_status)) +
geom_bar(position = "dodge") +
labs(
x = "Market Segment Type",
y = "Frequency",
title = " Bar Plot Market Segment Type by Booking Status"
) +
theme_minimal()+
scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))
Insight: market_segment_type
online have
booking_status
Canceled higher than other
ggplot(hotel_clean, aes(x = repeated_guest, fill = booking_status)) +
geom_bar(position = "dodge") +
labs(
x = "Repeated Guest",
y = "Frequency",
title = " Bar Plot Repeated Guest by Booking Status"
) +
theme_minimal()+
scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))
Insight: repeated_guest
0 have
booking_status
Canceled higher than other
ggplot(hotel_clean, aes(x = no_of_previous_cancellations, fill = booking_status)) +
geom_bar(position = "dodge") +
labs(
x = "Number Of Previous Cancellations",
y = "Frequency",
title = " Bar Plot Number Of Previous Cancellations by Booking Status"
) +
theme_minimal()+
scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))
Insight: no_of_previous_cancellations
0 have
booking_status
Canceled higher than other
ggplot(hotel_clean, aes(x = no_of_previous_bookings_not_canceled, fill = booking_status)) +
geom_bar(position = "dodge") +
labs(
x = "Number Of Previous Booking Not Cancellations",
y = "Frequency",
title = " Bar Plot Number Of Previous Booking Not Cancellations by Booking Status"
) +
theme_minimal()+
scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))
Insight: no_of_previous_bookings_not_canceled
0 have
booking_status
Canceled higher than other
ggplot(hotel_clean, aes(x = as.factor(booking_status), y = avg_price_per_room)) +
geom_boxplot(fill = "#B3D9FF", color = "black") +
labs(
x = "Booking Status",
y = "Average Price Per Room",
title = "Box Plot of Average Price Per Room by Booking Status"
) +
theme_minimal()
Insight: booking_status
Not_Canceled have mean
avg_price_per_room
higher, booking_status
Canceled have median higher, both booking_status
have
outliers
ggplot(hotel_clean, aes(x = no_of_special_requests, fill = booking_status)) +
geom_bar(position = "dodge") +
labs(
x = "Number Of Special Requests",
y = "Frequency",
title = " Bar Plot Number Of Special Requests by Booking Status"
) +
theme_minimal()+
scale_fill_manual(values = c("#B3D9FF", "#FFB3B3"))
Insight; no_of_special_requests
0 have
booking_status
Canceled Higher than others
Feature engineering in R refers to the process of transforming raw
data into meaningful features (variables) that can be used to improve
the performance of machine learning models. It involves creating,
modifying, or selecting variables that better represent the underlying
patterns in the data and make it easier for models to learn. Combining
arrival_year
, arrival_month
, and
arrival_date
into a single arrival_date_full
.
make Day from extract arrival_date_full
.
hotel_clean$arrival_date_full <- as.Date(
paste(hotel_clean$arrival_year, hotel_clean$arrival_month, hotel_clean$arrival_date, sep = "-"))
hotel_clean <- hotel_clean %>% mutate(Day = weekdays(arrival_date_full))
hotel_clean <- na.omit(hotel_clean)
Combining no_of_adults
and no_of_children
into single total_people
, combining
no_of_weekend_nights
and no_of_week_nights
into single total_nights
hotel_clean <- hotel_clean %>%
mutate(total_people = no_of_adults + no_of_children,
total_nights = no_of_weekend_nights + no_of_week_nights)
#>
#> Friday Monday Saturday Sunday Thursday Tuesday Wednesday
#> 0.1308295 0.1484629 0.1495392 0.1692974 0.1253932 0.1351896 0.1412882
change data type
hotel_clean$room_type_reserved <- as.numeric(hotel_clean$room_type_reserved)
hotel_clean$type_of_meal_plan <- as.numeric(hotel_clean$type_of_meal_plan)
hotel_clean$market_segment_type <- as.numeric(hotel_clean$market_segment_type)
hotel_clean$Day <- as.numeric(factor(hotel_clean$Day))
#>
#> 1 2 3 4 5 6 7
#> 0.1308295 0.1484629 0.1495392 0.1692974 0.1253932 0.1351896 0.1412882
#> Booking_ID no_of_adults
#> 0 0
#> no_of_children no_of_weekend_nights
#> 0 0
#> no_of_week_nights type_of_meal_plan
#> 0 0
#> required_car_parking_space room_type_reserved
#> 0 0
#> lead_time arrival_year
#> 0 0
#> arrival_month arrival_date
#> 0 0
#> market_segment_type repeated_guest
#> 0 0
#> no_of_previous_cancellations no_of_previous_bookings_not_canceled
#> 0 0
#> avg_price_per_room no_of_special_requests
#> 0 0
#> booking_status arrival_date_full
#> 0 0
#> Day total_people
#> 0 0
#> total_nights
#> 0
Check all dataset
#> Rows: 36,238
#> Columns: 23
#> $ Booking_ID <chr> "INN00001", "INN00002", "INN00003…
#> $ no_of_adults <int> 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 1, …
#> $ no_of_children <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_weekend_nights <int> 1, 2, 2, 0, 1, 0, 1, 1, 0, 0, 1, …
#> $ no_of_week_nights <int> 2, 3, 1, 2, 1, 2, 3, 3, 4, 5, 0, …
#> $ type_of_meal_plan <dbl> 1, 4, 1, 1, 4, 2, 1, 1, 1, 1, 4, …
#> $ required_car_parking_space <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ room_type_reserved <dbl> 1, 1, 1, 1, 1, 1, 1, 4, 1, 4, 1, …
#> $ lead_time <int> 224, 5, 1, 211, 48, 346, 34, 83, …
#> $ arrival_year <int> 2017, 2018, 2018, 2018, 2018, 201…
#> $ arrival_month <int> 10, 11, 2, 5, 4, 9, 10, 12, 7, 10…
#> $ arrival_date <int> 2, 6, 28, 20, 11, 13, 15, 26, 6, …
#> $ market_segment_type <dbl> 4, 5, 5, 5, 5, 5, 5, 5, 4, 5, 5, …
#> $ repeated_guest <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_previous_cancellations <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ no_of_previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ avg_price_per_room <dbl> 65.00, 106.68, 60.00, 100.00, 94.…
#> $ no_of_special_requests <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 3, 0, …
#> $ booking_status <fct> Not_Canceled, Not_Canceled, Cance…
#> $ arrival_date_full <date> 2017-10-02, 2018-11-06, 2018-02-…
#> $ Day <dbl> 2, 6, 7, 4, 7, 5, 4, 7, 1, 5, 6, …
#> $ total_people <int> 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 1, …
#> $ total_nights <int> 3, 5, 3, 2, 2, 2, 4, 4, 4, 5, 1, …
#> Booking_ID no_of_adults no_of_children no_of_weekend_nights
#> Length:36238 Min. :0.000 Min. : 0.0000 Min. :0.0000
#> Class :character 1st Qu.:2.000 1st Qu.: 0.0000 1st Qu.:0.0000
#> Mode :character Median :2.000 Median : 0.0000 Median :1.0000
#> Mean :1.845 Mean : 0.1052 Mean :0.8105
#> 3rd Qu.:2.000 3rd Qu.: 0.0000 3rd Qu.:2.0000
#> Max. :4.000 Max. :10.0000 Max. :7.0000
#> no_of_week_nights type_of_meal_plan required_car_parking_space
#> Min. : 0.000 Min. :1.000 Min. :0.00000
#> 1st Qu.: 1.000 1st Qu.:1.000 1st Qu.:0.00000
#> Median : 2.000 Median :1.000 Median :0.00000
#> Mean : 2.204 Mean :1.516 Mean :0.03093
#> 3rd Qu.: 3.000 3rd Qu.:1.000 3rd Qu.:0.00000
#> Max. :17.000 Max. :4.000 Max. :1.00000
#> room_type_reserved lead_time arrival_year arrival_month
#> Min. :1.000 Min. : 0.00 Min. :2017 Min. : 1.000
#> 1st Qu.:1.000 1st Qu.: 17.00 1st Qu.:2018 1st Qu.: 5.000
#> Median :1.000 Median : 57.00 Median :2018 Median : 8.000
#> Mean :1.708 Mean : 85.28 Mean :2018 Mean : 7.429
#> 3rd Qu.:1.000 3rd Qu.:126.00 3rd Qu.:2018 3rd Qu.:10.000
#> Max. :7.000 Max. :443.00 Max. :2018 Max. :12.000
#> arrival_date market_segment_type repeated_guest
#> Min. : 1.00 Min. :1.000 Min. :0.00000
#> 1st Qu.: 8.00 1st Qu.:4.000 1st Qu.:0.00000
#> Median :16.00 Median :5.000 Median :0.00000
#> Mean :15.58 Mean :4.553 Mean :0.02555
#> 3rd Qu.:23.00 3rd Qu.:5.000 3rd Qu.:0.00000
#> Max. :31.00 Max. :5.000 Max. :1.00000
#> no_of_previous_cancellations no_of_previous_bookings_not_canceled
#> Min. : 0.00000 Min. : 0.000
#> 1st Qu.: 0.00000 1st Qu.: 0.000
#> Median : 0.00000 Median : 0.000
#> Mean : 0.02335 Mean : 0.153
#> 3rd Qu.: 0.00000 3rd Qu.: 0.000
#> Max. :13.00000 Max. :58.000
#> avg_price_per_room no_of_special_requests booking_status
#> Min. : 0.00 Min. :0.00 Canceled :11878
#> 1st Qu.: 80.30 1st Qu.:0.00 Not_Canceled:24360
#> Median : 99.45 Median :0.00
#> Mean :103.44 Mean :0.62
#> 3rd Qu.:120.00 3rd Qu.:1.00
#> Max. :540.00 Max. :5.00
#> arrival_date_full Day total_people total_nights
#> Min. :2017-07-01 Min. :1.000 Min. : 1.000 Min. : 0.000
#> 1st Qu.:2018-02-28 1st Qu.:2.000 1st Qu.: 2.000 1st Qu.: 2.000
#> Median :2018-06-12 Median :4.000 Median : 2.000 Median : 3.000
#> Mean :2018-05-24 Mean :3.981 Mean : 1.951 Mean : 3.015
#> 3rd Qu.:2018-09-19 3rd Qu.:6.000 3rd Qu.: 2.000 3rd Qu.: 4.000
#> Max. :2018-12-31 Max. :7.000 Max. :12.000 Max. :24.000
Before make modeling, we must chose feature column, and we dont use
column Booking_ID
and arrival_date_full
hotel_finish <- hotel_clean %>% select(-Booking_ID, -arrival_date_full) %>% mutate(booking_status = as.factor(booking_status))
split data to train and test
RNGkind(sample.kind = "Rounding")
set.seed(100)
# your code here
index_hotel <- sample(nrow(hotel_finish), nrow(hotel_finish)*0.75)
hotel_train <- hotel_finish[index_hotel,]
hotel_test <- hotel_finish[-index_hotel,]
Proportional target not balance, but we must try to use modeling
#>
#> Canceled Not_Canceled
#> 0.3269188 0.6730812
From above prompt, we can see data not balance, we use up sample to balance target
hotel_train_up <- upSample(
x = hotel_train %>% select(-booking_status),
y = hotel_train$booking_status,
yname = "booking_status"
)
check again proprtion of target
#>
#> Canceled Not_Canceled
#> 0.5 0.5
Target has balance, ready for training model
Random Forest is one of the most popular and powerful machine learning algorithms. It falls under the category of ensemble learning, which means it combines several simpler models (in this case, decision trees) to produce a more accurate and stable model. Create a Random Forest model using hotel_train_up with 5-fold cross validation, then the process is repeated 3 times
set.seed(42)
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 3)
hotel_forest <- train(booking_status ~ .,
data = hotel_train_up,
method = "rf",
trControl = ctrl)
In the Bootstrap sampling stage, there is data that is not used in modeling, this is referred to as Out-of-Bag (OOB) data. The Random Forest model will use OOB data as data to evaluate by calculating the error (similar to test data). This error is called OOB error. In the case of classification, OOB error is the percentage of OOB data that is misclassified.
#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 11
#>
#> OOB estimate of error rate: 5.33%
#> Confusion matrix:
#> Canceled Not_Canceled class.error
#> Canceled 17667 626 0.03422074
#> Not_Canceled 1324 16969 0.07237741
The OOB Error value for the hotel_forest model is 5.33%. In other words, the model accuracy on OOB data is 94.67%.
Although random forest is labeled as an uninterpretable model, at least we can see what predictors are most used (important) in making random forest:
Feature importance highest is lead_time
and less then is
no_of_previous_cancellations
conf_matrix_rf <- confusionMatrix(predictions_rf, hotel_test$booking_status, positive = "Canceled")
conf_matrix_rf
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Canceled Not_Canceled
#> Canceled 2518 420
#> Not_Canceled 475 5647
#>
#> Accuracy : 0.9012
#> 95% CI : (0.8949, 0.9073)
#> No Information Rate : 0.6696
#> P-Value [Acc > NIR] : < 0.0000000000000002
#>
#> Kappa : 0.7757
#>
#> Mcnemar's Test P-Value : 0.07107
#>
#> Sensitivity : 0.8413
#> Specificity : 0.9308
#> Pos Pred Value : 0.8570
#> Neg Pred Value : 0.9224
#> Prevalence : 0.3304
#> Detection Rate : 0.2779
#> Detection Prevalence : 0.3243
#> Balanced Accuracy : 0.8860
#>
#> 'Positive' Class : Canceled
#>
Accuracy Random Forest model is 90.12 %
Our positive class is Canceled, which means the customer has cancel booking status hotel, while the negative class is Not_Canceled, which means the customer has not cancel booking status. FP: predicting a customer Canceled, while the customer Cancel booking hotel, the hotel risk is that the hotel incurs a loss. FN: predicting the customer Not_Canceled to Booking Status, even though the customer Not Canceled, the hotel risks losing profits. From the hotel side, the concerning risk is FN so the matrix we use is Recall. From the Random Forest methods above, if we are concerned according to the matrix, we will use Recall of 84.13%