Processing math: 100%
  • Background
  • Library and Setup
  • Hotel Demand Forecasting
    • Import Data
    • Data Preprocessing
    • Exploratory Data Analysis
      • Market Segments
      • Time Series Analysis
      • Seasonality Analysis
    • Cross-Validation
    • Forecasting Methods
      • Preprocessing Specification
      • Seasonality Specification
      • Impute Outlier
      • Model Specification
      • Model Fitting
      • Forecasting Result
    • Model Assumption Checking
      • Autocorrelation
      • Normality
  • Conclusion

Background

Hospitality industry is growing, with more and more people spending their money for vacation and leisure activities. People may only lodge into a hotel when it’s a holiday season or a special event, thus the demand for staying room is not equally distributed accross the year. To maximize the revenue gained by the hotel, the management often employed a pricing strategy, one of them being raising the room rate when the demand is high and making a promo when the demand is low. Thus, the ability to accurately forecast the future demand is very important and became a vital part on the pricing scheme. The demand for different segment of customer may differ and forecasting become harder as it may requires different model for different segment.

This post will focus on fitting and tuning different forecasting models using purrr package on a real dataset .

Library and Setup

Below is the list of required packages if you wish to reproduce the results. The full source code for this post is available at my github repository .

Hotel Demand Forecasting

Import Data

Let’s import the dataset. The data is acquired from Nuno et al. (2019) . The data consists of around 119,390 booking transactions from 2 hotel: an anonymous city hotel from Lisbon and a resort hotel from Algarve. The dataset comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. There is so much to explore from this data, but we will only focus on demand forecasting.

ABCDEFGHIJ0123456789
 
 
hotel
<fctr>
is_canceled
<int>
lead_time
<int>
arrival_date_year
<int>
arrival_date_month
<fctr>
1Resort Hotel03422015July
2Resort Hotel07372015July
3Resort Hotel072015July
4Resort Hotel0132015July
5Resort Hotel0142015July
6Resort Hotel0142015July
7Resort Hotel002015July
8Resort Hotel092015July
9Resort Hotel1852015July
10Resort Hotel1752015July

Data Description:

  • hotel : Hotel (Resort Hotel or City Hotel)

  • is_canceled : Value indicating if the booking was canceled (1) or not (0)

  • lead_time : Number of days that elapsed between the entering date of the booking into the PMS and the arrival date

  • arrival_date_year : Year of arrival date

  • arrival_date_month : Month of arrival date

  • arrival_date_week_number : Week number of year for arrival date

  • arrival_date_day_of_month : Day of arrival date

  • stays_in_weekend_nights : Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel

  • stays_in_week_nights : Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel

  • adults : Number of adults

  • children : Number of children

  • babies : Number of babies

  • meal : Type of meal booked. Categories are presented in standard hospitality meal packages:

    • Undefined/SC – no meal package
    • BB – Bed & Breakfast
    • HB – Half board (breakfast and one other meal – usually dinner)
    • FB – Full board (breakfast, lunch and dinner)
  • country : Country of origin. Categories are represented in the ISO 3155–3:2013 format

  • market_segment : Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”

  • distribution_channel : Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”

  • is_repeated_guest : Value indicating if the booking name was from a repeated guest (1) or not (0)

  • previous_cancellations : Number of previous bookings that were cancelled by the customer prior to the current booking

  • previous_bookings_not_canceled : Number of previous bookings not cancelled by the customer prior to the current booking

  • reserved_room_type : Code of room type reserved. Code is presented instead of designation for anonymity reasons.

  • assigned_room_type : Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons.

  • booking_changes : Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation

  • deposit_type : Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:

    • No Deposit – no deposit was made
    • Non Refund * a deposit was made in the value of the total stay cost
    • Refundable – a deposit was made with a value under the total cost of stay.
  • agent : ID of the travel agency that made the booking

  • company : ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons

  • days_in_waiting_list : Number of days the booking was in the waiting list before it was confirmed to the customer

  • customer_type : Type of booking, assuming one of four categories:

    • Contract - when the booking has an allotment or other type of contract associated to it
    • Group – when the booking is associated to a group
    • Transient – when the booking is not part of a group or contract, and is not associated to other transient booking
    • Transient-party – when the booking is transient, but is associated to at least other transient booking
  • adr : Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights

  • required_car_parking_spaces : Number of car parking spaces required by the customer

  • total_of_special_requests : Number of special requests made by the customer (e.g. twin bed or high floor)

  • reservation_status : Reservation last status, assuming one of three categories:

    • Canceled – booking was canceled by the customer
    • Check-Out – customer has checked in but already departed
    • No-Show – customer did not check-in and did inform the hotel of the reason why
  • reservation_status_date : Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

Data Preprocessing

Before we analyze the data, you may notice that the date is scattered in separate columns. We will unite them together to get a proper arrival date column.

ABCDEFGHIJ0123456789
 
 
hotel
<fctr>
is_canceled
<int>
lead_time
<int>
arrival_date
<date>
arrival_date_week_number
<int>
1Resort Hotel03422015-07-0127
2Resort Hotel07372015-07-0127
3Resort Hotel072015-07-0127
4Resort Hotel0132015-07-0127
5Resort Hotel0142015-07-0127
6Resort Hotel0142015-07-0127

Exploratory Data Analysis

Market Segments

For each hotel, we have several market segments as mentioned earlier in the data description. In order to maximize our revenue, we will forecast the most profitable market segment. First, we will look at the number of transactions for each market segment on each hotel.

Based on the result, most of the booking done via travel agent, either online or offline, which combined together contributes more than 40% of the non-canceled total transactions. The other segment don’t have much transactions, but perhaps we would like to see the revenue generated by each market segment by looking at the Average Daily Rate (ADR). Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights. Therefore, the higher ADR means more revenue generated for each staying night.

Total ADR generated from online travel agent is the highest both in the city hotel and the resort hotel. The segment of offline travel agent and direct has a little margin with each other, with direct segment has the higher contribution in resort hotel even though it has lower number of transactions. This give us the top 3 market segments, both in term of quantity and profitability. We will focus on this segments for the rest of the analysis.

Time Series Analysis

We want to forecast the demand of lodging for both city hotel and resort hotel. Based on the previous exploration, we will use only the data from segment online travel agent, offline travel agent and direct. We will consider both canceled and non-canceled transactions to reflect the demand.

ABCDEFGHIJ0123456789
arrival_date
<date>
demand
<int>
2015-07-0179
2015-07-028
2015-07-033
2015-07-046
2015-07-058
2015-07-0610
2015-07-0718
2015-07-088
2015-07-096
2015-07-107

Let’s look at the series of demand for each hotel.

The demand for city hotel has a higher fluctuation compared to the resort hotel. This may be caused by several factors, including the room capacity, since we don’t know the room capacity for each hotel. There are also some spikes in demand for city hotel during the late 2015. We will inspect further by dividing the series by market segment. We will also need to make the series have constant interval of time, which is a daily interval in our case.

The behaviour for each segment is quite different, so we will forecast them separately. At this point, we will have 6 different time series to be forecasted for 3 different segments on each hotel.

Seasonality Analysis

Before we proceed to create a forecasting model. We will try to gain more insight regarding the customer behaviour by looking at the seasonality of the demand. Since we have too many series, we will only explore the most lucrative market segment, that is the segment of online TA. We will look at the behaviour for both the city hotel and the resort hotel.

Weekly Seasonality

First, we will look at the weekly seasonality. The weekly seasonality will help us to understand when does people more frequently do check in? Our common sense will tell us that perhaps weekend should be the one where people start to check in. However, weekly seasonality may have a weak strength since people are not regularly go to vacation or rent a hotel.

The following figures is the weekly seasonality of the online TA segment for the city hotel. It has a negative seasonality on the weekend (Saturday and Sunday) and has a high positive seasonality in Tuesday. Thus, the hotel is more likely to have less visitor on the weekend, perhaps because the hotel is not designed to be a vacation hotel and more of a business or transit hotel.

Next, we will look at the weekly seasonality of the Online TA segment of the Resort Hotel as well.

The customer behaviour is quite the same, with strong positive seasonality happened during Wednesday and negative seasonality during Sunday. It is natural that the number of arrival is dropping during Sunday, since people will go back to their and home do the ordinary activities on Monday. Another reason is perhaps because Monday is the day that most of the monuments/museusms etc are closed in Portugal so if people checked in on Sunday evening the don’t have much to visit during the next day. The peak seasonality in Wednesday may signal that people want to spent more time for the upcoming weekend or people just want to avoid the crowd during weekend.

Monthly Seasonality

We will also check the monthly seasonality and see at what month does it reach its highest and lowest point. The first one is the city hotel.

The next one is the online TA segment in Resort Hotel. The seasonality reach its highest point during October and same with the city hotel, it reach the lowest point on March.

Based on both graphs, the high and positive seasonality happens around May-June and September-October. The highest negative seasonality happens in March. Both Lisbon and Algarve are located in Portugal. According to Audley Travels , the best time to visit Portugal is in spring (March-May), when the country is in bloom and waking after the winter. You could also go in fall (between September and October) when the sun is still shining, the weather is warm, and many of the crowds have dispersed. However, the negative seasonality in March and April perhaps tell us that the weather is still too cold to travel around and people love to spend more time to go for vacation during September and October. The summer holiday of school, which is span from late June to early September, have a good influence toward the city hotel seasonality.

September and October are two of the best months to visit Portugal. The weather is still warm and pleasant, and the temperatures are much more manageable for sightseeing or hiking. It’s also a wonderful time to visit many of Portugal’s wineries with the grape harvest in full swing. The beaches are also much quieter.
Audley Travels

For the next sections, we will focus on the forecasting of the demand using various machine learning methods.

Cross-Validation

We will split the data into training dataset and testing dataset, with testing dataset consists of the last 30 days from the full dataset.

ABCDEFGHIJ0123456789
arrival_date
<date>
market_segment
<fctr>
demand
<dbl>
hotel
<chr>
2017-07-23Online TA26Resort Hotel
2017-07-24Online TA46Resort Hotel
2017-07-25Online TA25Resort Hotel
2017-07-26Online TA32Resort Hotel
2017-07-27Online TA33Resort Hotel
2017-07-28Online TA36Resort Hotel
2017-07-29Online TA51Resort Hotel
2017-07-30Online TA23Resort Hotel
2017-07-31Online TA44Resort Hotel
2017-08-01Online TA49Resort Hotel

Forecasting Methods

We will do forecasting for each segment of each hotel. This is done to capture the pattern of each series since they have different characteristics and doing an aggregated forecast may result in higher error. Thus, we will have 6 different series, 3 for each hotel. Since we have 6 series to forecast, manually fitting and tuning the model will be tedious and take a long time. We will use purrr to efficiently fitting and evaluating the model in order to get the best model for each series based on the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) values. MAE is chosen due to it’s interpretability while RMSE is chosen because it is sensitive to large errors. We don’t use Mean Absolute Percentage Error because it did not perform really well when the actual data has or close to zero value, despite being easier to interpret.

MAE=Σni=1(yi^yi)n

RMSE=Σni=1(yi^yi)2n

Before we proceed, we will do forecasting method for the aggregated time series by combining all demand into a single series. This will function as a benchmark for the subsequent forecasting.

ABCDEFGHIJ0123456789
arrival_date
<date>
demand
<dbl>
2015-07-01118
2015-07-0252
2015-07-0343
2015-07-0455
2015-07-0553
2015-07-0655
2015-07-0753
2015-07-0834
2015-07-0938
2015-07-1049

With a weekly seasonality and using ARIMA method, here is the result of the forecast on the next 30 days. The forecast give us MAE of 21, which mean that the model will have different around 21 demands compared to the actual testing dataset in average and RMSE around 26.

ABCDEFGHIJ0123456789
dataset
<chr>
ME
<dbl>
RMSE
<dbl>
MAE
<dbl>
MPE
<dbl>
MAPE
<dbl>
MASE
<dbl>
Training set0.458012937.7393428.13519-10.1605227.865600.7628458
Test set-8.015046826.3180121.25028-8.6837615.882660.5761712

We will see if by separating the series will give us better forecast performance.

First, we will nest the dataset, making our data into a list of 6 separate series.

ABCDEFGHIJ0123456789
series
<chr>
data
<list>
City Hotel_Direct<tibble>
City Hotel_Offline TA/TO<tibble>
City Hotel_Online TA<tibble>
Resort Hotel_Direct<tibble>
Resort Hotel_Offline TA/TO<tibble>
Resort Hotel_Online TA<tibble>

The column data consists of a list of the demand for each series.

Seasonality Specification

The next step is to specify the seasonal period for the series. We will try several seasonal period, including:

  • weekly seasonality
  • monthly seasonality
  • annual seasonality
  • weekly and monthly seasonality (multi-seasonal)
  • weekly and annual seasonality (multi-seasonal)

Impute Outlier

We will also try to preprocess the data by whether an outlier should be replaced or not. If the outlier is replaced, we will identify the outlier and estimate the replacement using the tsoutliers function. Residuals are identified by fitting a loess curve for non-seasonal data and via a periodic STL decomposition for seasonal data.

Model Fitting

Below is the combination for each specification on each series. For 6 different series, we will have 5 different seasonality specification, 3 different models and other specifications. Therefore, we will 540 different combinations. We will choose the best model based on the RMSE and MAE value on the testing dataset.

ABCDEFGHIJ0123456789
series
<chr>
data
<list>
preprocess_name
<chr>
preprocess_spec
<list>
reverse_name
<chr>
City Hotel_Direct<tibble>log_spec<fun>log_spec
City Hotel_Direct<tibble>log_spec<fun>log_spec
City Hotel_Direct<tibble>log_spec<fun>log_spec
City Hotel_Direct<tibble>log_spec<fun>log_spec
City Hotel_Direct<tibble>log_spec<fun>log_spec
City Hotel_Direct<tibble>log_spec<fun>log_spec
City Hotel_Direct<tibble>log_spec<fun>log_spec
City Hotel_Direct<tibble>log_spec<fun>log_spec
City Hotel_Direct<tibble>log_spec<fun>log_spec
City Hotel_Direct<tibble>log_spec<fun>log_spec

The following code produce the transformation process for the data before fitted into model.

The following code do all process from transforming data into time series, fitting them into the model and forecast demands for the next 30 days.

Forecasting Result

Below is the result for our modeling process. We use MAE and RMSE to measures and compares the performance of each model.

ABCDEFGHIJ0123456789
hotel
<chr>
market_segment
<chr>
mae
<dbl>
rmse
<dbl>
preprocess_name
<chr>
reverse_name
<chr>
City HotelDirect3.7895194.634582log_speclog_spec
City HotelDirect11.69005713.925391log_speclog_spec
City HotelDirect10.58341212.780857log_speclog_spec
City HotelDirect3.7895194.634582log_speclog_spec
City HotelDirect11.69005713.925391log_speclog_spec
City HotelDirect10.58341212.780857log_speclog_spec
City HotelDirect3.7895194.634582log_speclog_spec
City HotelDirect3.8527304.710368log_speclog_spec
City HotelDirect3.7912134.643537log_speclog_spec
City HotelDirect3.7895194.634582log_speclog_spec

Below is the best configuration for each series based on the lowest RMSE, since RMSE give more penality towards large error. To interpret the MAE values, we need to consider the range of the data, shown as the standar deviation of the data.

ABCDEFGHIJ0123456789
hotel
<chr>
market_segment
<chr>
mae
<dbl>
rmse
<dbl>
mean
<dbl>
std_dev
<dbl>
Resort HotelDirect2.2061832.6465628.1428574.854501
City HotelDirect3.4283784.1668607.5321104.833170
Resort HotelOffline TA/TO4.2600056.0942219.3905647.382110
Resort HotelOnline TA8.70836411.03199721.78637011.695512
City HotelOffline TA/TO8.79784413.71886121.37745731.955130
City HotelOnline TA13.16303716.19000148.12844028.530810

Since we don’t have much data, only two years of transactions, the model performance may not perform so well. However, judging from the MAE value, the performance is quite acceptable, with most of the error values are less than the value of one standard deviation. Compared to the aggregated data on the first forecast which have MAE of 21, we have lower MAE value for each series, which give us an evidence that by making a separate forecasting models for each market segment will make the model more accurate.

Below is the forecasting result for each series. The red line indicate the actual demand value while the blue line indicate the forecast value. The blue area represent area with 80% prediction interval while the light blue are represent the 95% prediction interval. Most of the actual demand is still inside the forecasting intervals.

Model Assumption Checking

Autocorrelation

The autocorrelation can be checked using the Ljung-Box test. If there are correlations between residuals, then there is information left in the residuals which should be used in computing forecasts.

ABCDEFGHIJ0123456789
hotel
<chr>
market_segment
<chr>
test
<fctr>
p_value
<dbl>
City HotelDirectBox-Ljung test0.9764
City HotelOffline TA/TOBox-Ljung test0.9712
City HotelOnline TABox-Ljung test0.7693
Resort HotelDirectBox-Ljung test0.8539
Resort HotelOffline TA/TOBox-Ljung test0.9684
Resort HotelOnline TABox-Ljung test0.9236

The results suggests that all of our models don’t have any autocorrelation based on the non-significant p-value.

Normality

We will also check if the residuals for each model is normally distributed using Shapiro-Wilk Test. If the residuals are not normally distributed, it will lead to a biased parameter and less optimal forecast. This is also indicate that we can still tweak our model to get a better performance.

ABCDEFGHIJ0123456789
hotel
<chr>
segment
<chr>
mean_error
<dbl>
median_error
<dbl>
City HotelDirect0.17975-0.38742
City HotelOffline TA/TO0.10356-6.74179
City HotelOnline TA0.01455-0.89750
Resort HotelDirect0.04419-0.58587
Resort HotelOffline TA/TO-0.08350-0.10431
Resort HotelOnline TA0.21409-0.38216

Based on the result, all of our model didn’t fulfill the normality assumption for the residuals. The positive mean of error signify that the model is underestimate the forecast while negative mean error means the model is overestimate. If we look at the median of error, all of our models are underestimate on the trianing set. They might be influenced by the presence of an outlier value such as a really high demand especially on the early part of the series. This suggest that we can improve the model further in order to get better performance.

Conclusion

This article has illustrated how R and functional programming of purrr can help us to do flexible forecasting for multiple time series models. We have tried to do hotel demand forecasting using a real-world datasets with two years worth of data. We also have tried to analyze the series pattern for each hotel and segment and fit the best model for each one of them with a satisfying results. The next step perhaps is to enhance the model further either by using another time series model, incorporate predictor by the unused variables from the original dataset or transforming the data.