data_621_blog

Time Series Analysis

This is a time series analysis tutorial on building an ARIMA model. We will be using a simple dataset from hotel revenue industry. The original sample data has only four columns, i.e. date, room_sold, adr, and revenue. The “date” is referred to the historical record of check-in date of a hotel in NYC, whereas the “room_sold” is like the name suggested, i.e. number of room sold out in that corresponding date. Most importantly, the “adr” is referred to the average daily rate, which is the average pricing of a room in that date (note that it is an average because a hotel offers multiple room type options thus we care only about the average across all room types). Finally, “revenue” is simply the product of “room_sold” x “adr”. In this tutorial, we want to perform time series analysis to predict “revenue”. In reality, the technique can be used to predict any time series data, e.g. room_sold, adr, etc.

The date range in this data set covers from January 1, 2016 to December 31, 2019. We will use the data from January 1, 2016 till August 31, 2019 as our train set to predict the revenue for the rest of 2019. And then we can compare the difference between actual and forecasted results.

# load packages
pacman::p_load(fUnitRoots, lmtest, forecast, FitAR, tidyverse, broom)

# read data
df <- read.csv("sample_hotel_data.csv", header = TRUE) %>% 
        dplyr::mutate(date = as.Date(date, "%m/%d/%Y"))

# glance the data
broom::glance(df)
## Warning: 'glance.data.frame' is deprecated.
## See help("Deprecated")
## # A tibble: 1 x 4
##    nrow  ncol complete.obs na.fraction
##   <int> <int>        <int>       <dbl>
## 1  1461     4         1461           0

# head
head(df)
##         date room_sold    adr revenue
## 1 2016-01-01       124 147.20   18252
## 2 2016-01-02       118 142.60   16827
## 3 2016-01-03       128 151.62   19408
## 4 2016-01-04       151 145.34   21947
## 5 2016-01-05       141 119.21   16809
## 6 2016-01-06       135 110.84   14964

# set variable
var <- 'revenue'

# filter, extract revenue data between January 1, 2016 to August 31, 2019
tsData <- df %>%
        dplyr::filter(date <= as.Date("08/31/2019", "%m/%d/%Y")) %>%
        dplyr::select(bquote(.(var))) %>%
        unlist %>%
        as.vector %>%
        ts(., start = c(2016, 1), frequency = 365)

# forecast data used for comparison
dfForecast <- df %>%
        dplyr::filter(date > as.Date("08/31/2019", "%m/%d/%Y"))

# head
head(tsData)
## Time Series:
## Start = c(2016, 1) 
## End = c(2016, 6) 
## Frequency = 365 
## [1] 18252 16827 19408 21947 16809 14964

# class
print(paste0("tsData is a ", class(tsData), " object."))
## [1] "tsData is a ts object."

Clearly, we see a seasonal pattern in our data. The next step we need to take is to decompose our data into different time series components.

plot(tsData, main = "Hotel Revenue Y201601 to Y201908")

After decomposition, we can see a clear up trend, seasonality, and random noises in our data.

timeseriescomponents <- decompose(tsData)
plot(timeseriescomponents)

Let’s stationarize our time series data so that we can make prediction out of it. Let’s examine the acf (autocorrelation function) and pacf (partial autocorrelation function) plots for optimal AR (auto regressive) and MA (moving average) orders.

# detemine stationarity of data
urkpssTest(tsData, type = c("tau"), lags = c("short"), use.lag = NULL, doplot = TRUE)

## 
## Title:
##  KPSS Unit Root Test
## 
## Test Results:
##   NA
## 
## Description:
##  Sat May 02 22:01:15 2020 by user: myvio
tsstationary <- diff(tsData, differences = 1)
plot(tsstationary)

acf(tsData, lag.max = length(tsData))

Let’s remove seasonality from our data, and plot again after differencing.

# remove seasonality
timeseriesseasonallyadjusted <- tsData - timeseriescomponents$seasonal
plot(timeseriesseasonallyadjusted)


# plot again - differencing based on after removing seasonality
tsstationary <- diff(timeseriesseasonallyadjusted, differences = 1)
plot(tsstationary)

Now, let’s fit our model using auto.arima to find the best ARIMA model and its associated parameters.

# auto.arima
auto.arima(tsData, trace = TRUE)
## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,1,2)(1,1,1)[365]                    : Inf
##  ARIMA(0,1,0)(0,1,0)[365]                    : 13880.77
##  ARIMA(1,1,0)(1,1,0)[365]                    : Inf
##  ARIMA(0,1,1)(0,1,1)[365]                    : Inf
##  ARIMA(0,1,0)(1,1,0)[365]                    : Inf
##  ARIMA(0,1,0)(0,1,1)[365]                    : Inf
##  ARIMA(0,1,0)(1,1,1)[365]                    : Inf
##  ARIMA(1,1,0)(0,1,0)[365]                    : 13883.35
##  ARIMA(0,1,1)(0,1,0)[365]                    : 13881.35
##  ARIMA(1,1,1)(0,1,0)[365]                    : 13572.13
##  ARIMA(1,1,1)(1,1,0)[365]                    : Inf
##  ARIMA(1,1,1)(0,1,1)[365]                    : Inf
##  ARIMA(1,1,1)(1,1,1)[365]                    : Inf
##  ARIMA(2,1,1)(0,1,0)[365]                    : 13421.3
##  ARIMA(2,1,1)(1,1,0)[365]                    : Inf
##  ARIMA(2,1,1)(0,1,1)[365]                    : Inf
##  ARIMA(2,1,1)(1,1,1)[365]                    : Inf
##  ARIMA(2,1,0)(0,1,0)[365]                    : 13759.98
##  ARIMA(3,1,1)(0,1,0)[365]                    : 13407.53
##  ARIMA(3,1,1)(1,1,0)[365]                    : Inf
##  ARIMA(3,1,1)(0,1,1)[365]                    : Inf
##  ARIMA(3,1,1)(1,1,1)[365]                    : Inf
##  ARIMA(3,1,0)(0,1,0)[365]                    : 13730.59
##  ARIMA(4,1,1)(0,1,0)[365]                    : 13374.2
##  ARIMA(4,1,1)(1,1,0)[365]                    : Inf
##  ARIMA(4,1,1)(0,1,1)[365]                    : Inf
##  ARIMA(4,1,1)(1,1,1)[365]                    : Inf
##  ARIMA(4,1,0)(0,1,0)[365]                    : 13659.38
##  ARIMA(5,1,1)(0,1,0)[365]                    : 13347.96
##  ARIMA(5,1,1)(1,1,0)[365]                    : Inf
##  ARIMA(5,1,1)(0,1,1)[365]                    : Inf
##  ARIMA(5,1,1)(1,1,1)[365]                    : Inf
##  ARIMA(5,1,0)(0,1,0)[365]                    : 13447.38
##  ARIMA(5,1,2)(0,1,0)[365]                    : 13333.24
##  ARIMA(5,1,2)(1,1,0)[365]                    : Inf
##  ARIMA(5,1,2)(0,1,1)[365]                    : Inf
##  ARIMA(5,1,2)(1,1,1)[365]                    : Inf
##  ARIMA(4,1,2)(0,1,0)[365]                    : 13393.23
##  ARIMA(5,1,3)(0,1,0)[365]                    : Inf
##  ARIMA(4,1,3)(0,1,0)[365]                    : 13390.17
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(5,1,2)(0,1,0)[365]                    : 20030.98
## 
##  Best model: ARIMA(5,1,2)(0,1,0)[365]
## Series: tsData 
## ARIMA(5,1,2)(0,1,0)[365] 
## 
## Coefficients:
##          ar1      ar2      ar3      ar4      ar5      ma1     ma2
##       0.2281  -0.6530  -0.2181  -0.2895  -0.3917  -0.6288  0.2679
## s.e.  0.0625   0.0438   0.0437   0.0295   0.0375   0.0639  0.0589
## 
## sigma^2 estimated as 50122376:  log likelihood=-10007.42
## AIC=20030.83   AICc=20030.98   BIC=20069.87

Let’s use the following orders to fit our model, i.e. AR is equal to 5, differencing equal to 1 and MA order equal to 2.

# fit the model
fitARIMA <- arima(tsData, order = c(5, 1, 2), 
                  seasonal = list(order = c(0, 1, 0), period = 365), 
                  method = "ML")

# summary
summary(fitARIMA)
## 
## Call:
## arima(x = tsData, order = c(5, 1, 2), seasonal = list(order = c(0, 1, 0), period = 365), 
##     method = "ML")
## 
## Coefficients:
##          ar1      ar2      ar3      ar4      ar5      ma1     ma2
##       0.2271  -0.6524  -0.2186  -0.2893  -0.3920  -0.6277  0.2670
## s.e.  0.0624   0.0438   0.0437   0.0295   0.0375   0.0638  0.0589
## 
## sigma^2 estimated as 49761784:  log likelihood = -10007.41,  aic = 20030.83
## 
## Training set error measures:
##                     ME    RMSE      MAE       MPE     MAPE      MASE
## Training set -7.152579 6013.32 3713.296 -1.436386 11.71198 0.5519136
##                     ACF1
## Training set -0.02126539

# significance of coefficients
coeftest(fitARIMA)
## 
## z test of coefficients:
## 
##      Estimate Std. Error  z value  Pr(>|z|)    
## ar1  0.227094   0.062409   3.6388 0.0002739 ***
## ar2 -0.652434   0.043767 -14.9070 < 2.2e-16 ***
## ar3 -0.218592   0.043654  -5.0074 5.518e-07 ***
## ar4 -0.289270   0.029465  -9.8176 < 2.2e-16 ***
## ar5 -0.392038   0.037461 -10.4653 < 2.2e-16 ***
## ma1 -0.627704   0.063813  -9.8367 < 2.2e-16 ***
## ma2  0.266962   0.058883   4.5338 5.794e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(1, 1))
acf(fitARIMA$residuals)


# residual diagnostics
boxresult <- LjungBoxTest(fitARIMA$residuals, k = 2, StartLag = 1)  # residual?? or the original series?
par(mfrow = c(2, 1))
plot(boxresult[, 3], main = "Ljung-Box Q Test", ylab = "P-values", xlab = "Lag")
qqnorm(fitARIMA$residuals)
qqline(fitARIMA$residuals)


# forcast future values, forecast between September 1, 2019 to December 31, 2019
par(mfrow = c(1, 1))
r <- predict(fitARIMA, n.ahead = 122)
futureVal <- forecast(fitARIMA, h = 122, level = c(95))
plot(futureVal)

Here’s the moment of truth, let’s combine our forecast with actual data and see whether our forecast is succesful.

# forecast - r is our forecasted result
r
## $pred
## Time Series:
## Start = c(2019, 245) 
## End = c(2020, 1) 
## Frequency = 365 
##   [1] 26848.03 17743.17 35634.94 49838.13 44155.55 37584.08 38423.21 33246.32
##   [9] 46197.29 53294.33 62112.15 50615.76 37571.58 37003.60 30992.86 45337.57
##  [17] 44428.01 48862.69 46354.56 37381.23 39600.36 35208.38 59916.03 70615.63
##  [25] 69273.66 48358.60 36199.47 38506.97 31010.34 48886.27 56758.23 57822.43
##  [33] 48491.68 46520.70 51986.09 38745.38 45548.68 55366.02 57573.47 48807.08
##  [41] 39651.72 41304.99 30401.43 49989.23 58914.27 60792.58 52405.96 36235.17
##  [49] 41553.66 31262.71 50954.45 59225.24 61996.50 48935.48 40086.83 43681.31
##  [57] 30244.23 36151.86 41783.65 41696.60 41393.87 41449.59 45039.75 28469.63
##  [65] 44308.43 51269.65 54682.50 48666.82 37377.47 44588.90 27741.40 48286.91
##  [73] 55860.39 55319.10 46599.69 29585.83 31584.12 19996.90 19777.53 17180.61
##  [81] 24644.64 26777.64 27622.57 26626.32 18362.46 31869.82 44690.48 47534.55
##  [89] 41377.75 36333.28 41010.15 28424.31 51343.49 79534.44 78127.41 64713.03
##  [97] 53943.33 50479.05 28016.64 46015.38 50933.11 53726.93 49818.45 36710.96
## [105] 36208.32 20827.58 26299.37 27866.26 25718.92 18697.01 21771.34 26174.16
## [113] 18456.25 14003.41 13899.71 25823.23 36596.68 43700.56 40003.72 34116.72
## [121] 39077.47 19066.36
## 
## $se
## Time Series:
## Start = c(2019, 245) 
## End = c(2020, 1) 
## Frequency = 365 
##   [1]  7054.203  8224.333  8269.933  8279.882  8348.856  8354.599  8564.363
##   [8]  9471.006 10154.347 10309.453 10355.406 10391.024 10442.425 10635.926
##  [15] 11085.424 11495.901 11683.339 11758.068 11808.387 11883.266 12052.997
##  [22] 12343.060 12629.276 12807.339 12902.214 12970.738 13058.501 13207.584
##  [29] 13423.076 13642.073 13803.921 13909.566 13992.590 14087.746 14221.996
##  [36] 14396.038 14574.838 14721.741 14831.745 14924.134 15023.247 15147.022
##  [43] 15295.628 15448.595 15583.155 15693.683 15791.129 15891.845 16007.964
##  [50] 16139.869 16275.314 16399.901 16508.838 16608.292 16709.006 16819.232
##  [57] 16939.561 17062.582 17179.090 17285.370 17384.855 17484.517 17589.966
##  [64] 17701.874 17815.736 17925.626 18028.790 18127.107 18225.064 18326.470
##  [71] 18431.973 18538.837 18643.228 18743.157 18839.626 18935.507 19033.377
##  [78] 19133.808 19235.142 19334.895 19431.652 19525.927 19619.547 19714.254
##  [85] 19810.522 19907.351 20003.133 20096.870 20188.809 20280.109 20371.943
##  [92] 20464.688 20557.747 20650.078 20740.985 20830.570 20919.568 21008.766
##  [99] 21098.455 21188.283 21277.575 21365.849 21453.135 21539.896 21626.659
## [106] 21713.643 21800.645 21887.229 21973.063 22058.142 22142.756 22227.259
## [113] 22311.813 22396.303 22480.448 22564.021 22647.002 22729.574 22811.972
## [120] 22894.316 22976.542 23058.470

r_hat <- r$pred %>% as.numeric()

# combine together
dfForecast <- dfForecast %>%
        dplyr::mutate(forecast = r_hat) %>%
        dplyr::select(date, actual = revenue, forecast)
        
head(dfForecast); tail(dfForecast)
##         date actual forecast
## 1 2019-09-01  28856 26848.03
## 2 2019-09-02  20001 17743.17
## 3 2019-09-03  33240 35634.94
## 4 2019-09-04  47821 49838.13
## 5 2019-09-05  43832 44155.55
## 6 2019-09-06  41337 37584.08
##           date actual forecast
## 117 2019-12-26  35876 36596.68
## 118 2019-12-27  44421 43700.56
## 119 2019-12-28  42427 40003.72
## 120 2019-12-29  35404 34116.72
## 121 2019-12-30  38725 39077.47
## 122 2019-12-31  39893 19066.36

# ggplot
dfForecast %>%
        tidyr::gather(., key = "actual vs. forecast", value = "revenue", -date) %>%
        ggplot(., aes(date, revenue)) +
        geom_line(aes(col = `actual vs. forecast`)) + 
        scale_y_continuous(labels = scales::dollar) +
        labs(title = "Actual vs. Forecasted Revenue, September to December 2019")

Wow, surprisingly, our forecast is very close to the actual, with the exceptions of how we predict the Thanksgiving week (where we have much lower predicted revenue versus the actual) and the week after. It’s worth to spend more time looking into the issue and figure out why that’s the case. In addition, we should consider overfitting, especially when continue relying on this model to make prediction for Q1 in 2020 will make completely no sense during the pandemic crisis. Unfortunately, this is the end of this tutorial on the basic of forecasting. Perhaps we shall revisit this data and the topic later.

data_621_blog_4

Jimmy Ng

5/2/2020

Time Series Analysis