Forecast daily bike rental demand using time series models

Author

Satoshi Matsumoto

About Data Analysis Report

This report is for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Mon Jan 8 14:51:19 2024.

Data Description:

This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.

Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

Relevant Paper:

Fanaee-T, Hadi, and Gama, Joao, ‘Event labeling combining ensemble detectors and background knowledge’, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg

Task One: Load and explore the data

Load data and install packages

  • Tidyverse for the data manipulation

  • Plotly for the visualization

  • forcast for forecasting

  • TTR for smoothing time series data.

  • TSA for Winter-Holts method.

  • fUnitRoot for making time-series data as stationary.

Describe and explore the data

According to the data exploration, there is information regarding time, temperature, humidity, and the number of bike users. In this scinario, we apply the “day.csv” for the time series analysis.

Transform machine and human readable data

Some colums are difficult to understand.

new_colnames <- c(date = "dteday", ride = "cnt")
df1 <-df1 %>%
  rename(all_of(new_colnames))

df1 %>%
  select(date,ride)%>%
  head()
# A tibble: 6 × 2
  date        ride
  <date>     <dbl>
1 2011-01-01   985
2 2011-01-02   801
3 2011-01-03  1349
4 2011-01-04  1562
5 2011-01-05  1600
6 2011-01-06  1606

Task Two: Create interactive time series plots

To visualize the time series, we can use the plot_time_series function from the timetk library. Here is the command look like,

## Read about the timetk package
df1%>%
  plot_time_series(.date_var = date,.value = ride,.interactive = TRUE, .plotly_slider = TRUE)

Visualizing seasonality

Using plot_seasonal_diagnostics function, find out the seasonal trends of rides.

df1 %>%
  group_by(year(date))%>%
  plot_seasonal_diagnostics(.date_var = date, .value = ride,.x_lab = "Date", .y_lab = "Number of users")
df1 %>%
  group_by(year(date))%>%
  plot_anomaly_diagnostics(.date_var = date, .value = ride,.x_lab = "Date", .y_lab = "Number of users")
frequency = 7 observations per 1 week
trend = 92 observations per 3 months
frequency = 7 observations per 1 week
trend = 92 observations per 3 months

Task Three: Smooth time series data

Make short-term forecasts using the time series data by smoothing the data. Before we do anything regarding forecasting, we need to tell R that this data is a time series. To do this, we make a time series object. There are two critical inputs we must give the function — frequency and start. we’ll need to manually tell it the frequency of our readings — that is, how many rows of data you have in a year. In this case, we have one data point per month, so frequency=12.

For start, if we’re not interested in attaching specific dates with our results, then we can just enter “1”, but since we usually do we’ll indicate the year and month of the first data point as start=c(2011,1).

Make a time series object by putting the date and count of total bike variables

# Use a variable of "ride" from Jan 2011 to Dec 2012 as a time series object as "dfts".

dfts <- ts(df1$ride, start = c(2011,1), end = c(2012,12), frequency = 12, names = ride)
plot(dfts)

To identify and replace outliers and missing values in a time series.

tsclean(dfts, replace.missing = TRUE, iterate = 2, lambda = NULL)
      Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
2011  985  801 1349 1562 1600 1606 1510  959  822 1321 1263 1162
2012 1406 1421 1248 1204 1000  683 1650 1927 1543  981  986 1416

Simple Exponential Smoothing

#To make forecasts using simple exponential smoothing in R, we can fit a simple exponential smoothing predictive model using the “HoltWinters()” function in R. 
dfts_forecasts <- HoltWinters(dfts, beta=FALSE, gamma=FALSE)
dfts_forecasts
Holt-Winters exponential smoothing without trend and without seasonal component.

Call:
HoltWinters(x = dfts, beta = FALSE, gamma = FALSE)

Smoothing parameters:
 alpha: 0.1621681
 beta : FALSE
 gamma: FALSE

Coefficients:
      [,1]
a 1276.253

The estimated value of the alpha parameter is about 0.162.This is very close to zero, telling us that the forecasts are based on seasonality in general.

The forecasts made by HoltWinters() are stored in a named element of this list variable called “fitted”, so we can get their values by typing:

dfts_forecasts$fitted
              xhat     level
Feb 2011  985.0000  985.0000
Mar 2011  955.1611  955.1611
Apr 2011 1019.0292 1019.0292
May 2011 1107.0817 1107.0817
Jun 2011 1187.0173 1187.0173
Jul 2011 1254.9629 1254.9629
Aug 2011 1296.3218 1296.3218
Sep 2011 1241.6190 1241.6190
Oct 2011 1173.5702 1173.5702
Nov 2011 1197.4786 1197.4786
Dec 2011 1208.1041 1208.1041
Jan 2012 1200.6275 1200.6275
Feb 2012 1233.9323 1233.9323
Mar 2012 1264.2687 1264.2687
Apr 2012 1261.6305 1261.6305
May 2012 1252.2846 1252.2846
Jun 2012 1211.3721 1211.3721
Jul 2012 1125.6870 1125.6870
Aug 2012 1210.7139 1210.7139
Sep 2012 1326.8726 1326.8726
Oct 2012 1361.9216 1361.9216
Nov 2012 1300.1482 1300.1482
Dec 2012 1249.2034 1249.2034

We can plot the original time series in black against the forecasts by typing in red:

plot(dfts_forecasts)

The plot shows the original time series in black, and the forecasts as a red line. The time series of forecasts is much smoother than the time series of the original data here.

Accuracy of the forecasts

As a measure of the accuracy of the forecasts, we can calculate the sum of squared errors for the in-sample forecast errors, that is, the forecast errors for the time period covered by our original time series.

dfts_forecasts$SSE
[1] 2815237
#here the sum-of-squared-errors is 2815237

It is common in simple exponential smoothing to use the first value in the time series as the initial value for the level. The initial values are specified for the level in the HoltWinters() function by using the “l.start” parameter.

HoltWinters(dfts, beta=FALSE, gamma=FALSE, l.start=985)
Holt-Winters exponential smoothing without trend and without seasonal component.

Call:
HoltWinters(x = dfts, beta = FALSE, gamma = FALSE, l.start = 985)

Smoothing parameters:
 alpha: 0.1621681
 beta : FALSE
 gamma: FALSE

Coefficients:
      [,1]
a 1276.253

by default HoltWinters() just makes forecasts for the time period covered by the original data, in this case from 01.2011 to 12.2012.

Simple Moving Average (SMA)

The folowing commands produce the SMA. The value of n can be changed to the desired value. eg. n = 20 will produce 20 days simple moving average (SMA)

# Calculate 10 day moving average
df1$MA10 <- TTR::SMA( df1$ride, n = 10)

# Calculate 60 days moving average
df1$MA60 <- TTR::SMA( df1$ride, n = 60)

Plotting Simple Moving Average

# Now we plot the values in ggplot
pl <- ggplot(df1 , aes(x = date))
pl <- pl + geom_line(aes(y = ride, color = "Ride"), group = 1)
pl <- pl + geom_line(aes(y = MA10, color = "MA10"),group = 1)
pl <- pl + geom_line(aes(y = MA60, color = "MA60"), group = 1)
pl <- pl +  theme_minimal()
#pl <- pl + theme(legend.title = "Moving Ave." )
pl <- pl + theme(legend.position = "top")
pl <- pl + labs(title ="Moving averages")
pl <- pl + labs(color="Frequency")
pl
Warning: Removed 9 rows containing missing values (`geom_line()`).
Warning: Removed 59 rows containing missing values (`geom_line()`).

Task Four: Decompose and access the stationary of time series data

Ultimately, we usually want to fit more advanced ARIMA models to use for forecasting instead of short-term forecasts. Therefore, it is important to decompose the data and assess the stationary of the data because if the time series data is not stationary, we cannot fit more advanced time series models.

  1. Observed: the actual data plot

  2. Trend: the long-term trends in the data

  3. Seasonal: the repeated seasonal signal adder/ any monthly/yearly pattern of the data points.

  4. Random: the “left-over” components that aren’t expected from the seasonality or trend components.

components_dfts <- decompose(dfts)
plot(components_dfts)

Check if time series data is stationary

We need to remove non-stationary part for ARIMA.

library(tseries)
adf.test(dfts)

    Augmented Dickey-Fuller Test

data:  dfts
Dickey-Fuller = -3.2447, Lag order = 2, p-value = 0.09935
alternative hypothesis: stationary

Since the p-value is not less than .05, we fail to reject the null hypothesis. This means the time series is non-stationary. In other words, it has some time-dependent structure and does not have constant variance over time. For the ARIMA model analysis, we need to make stationary data.

Methods to achieve stationary

  1. difference the data: compute the differences between consecutive observations log or square root the series data to stabilize non-constant variance if the data contains a trend, fit some type of curve to the data and then model the residuals from that fit.
  2. Unit root test: This test is used to find out that first difference or regression which should be used on the trending data to make it stationary. In Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test, small p-values suggest differencing is required.

Unit root test

urkpssTest(dfts, type = c("tau"), lags = c("short"),use.lag = NULL, doplot = TRUE)


Title:
 KPSS Unit Root Test

Test Results:
  NA

Description:
 Mon Jan  8 14:51:24 2024 by user: satos
tsstationary <- diff(dfts, differences=1) #Remove non-stationary
#Plot after removing non-stationary
autoplot(tsstationary)

Detect the seasonality

The following plots and functions help detecting the seasonality - A seasonal subseries plot - Multiple box plot - Auto correlation plot - ndiffs() is used to determine the number of first differences required to make the time series non-seasonal

We calculate auto correlation here.

acf(dfts,lag.max=34) 

As we can infer from the graph above, the autocorrelation repeats to increase and decrease as the lag increases, confirming that there is no linear association between observations separated by larger lags.

###Remove seasonality To remove seasonality from the data, we subtract the seasonal component from the original series and then difference it to make it stationary.

ts_seasonally_adj <- dfts- components_dfts$seasonal
ts_stationary2 <- diff(ts_seasonally_adj, differences=1)
autoplot(ts_stationary2)

Task Five: Fit and forecast time series data using ARIMA models

After decomposing and making sure that the time series is stationary, Arima model can be fitted and forecast. ARIMA stands for AutoRegressive Integrated Moving Average, and it is a statistical model used to analyze and forecast time series data. It is a versatile model that can be used to handle a variety of time series patterns, including trend, seasonality, and autocorrelation.

Examine p and q values

To determine the order of the model to be fitted to the data, we need three variables: p, d, and q which are non-negative integers that refer to the order of the autoregressive, integrated, and moving average parts of the model respectively.

To examine which p and q values will be appropriate we need to run acf() and pacf() function.

#ACF
acf(ts_stationary2, lag.max=34)

pacf(ts_stationary2, lag.max=34)

Manual ARIMA model

Use Arima() function to fit ARIMA model to univariate time series.

fitARIMA <- Arima(dfts, order=c(1,1,1),seasonal = list(order = c(1,0,0), period = 12),method="ML")
fitARIMA
Series: dfts 
ARIMA(1,1,1)(1,0,0)[12] 

Coefficients:
          ar1     ma1     sar1
      -0.5510  0.7693  -0.3612
s.e.   0.4032  0.3146   0.2729

sigma^2 = 112553:  log likelihood = -165.72
AIC=339.44   AICc=341.66   BIC=343.98
library(lmtest)
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
coeftest(fitARIMA)

z test of coefficients:

     Estimate Std. Error z value Pr(>|z|)  
ar1  -0.55098    0.40325 -1.3664  0.17182  
ma1   0.76932    0.31462  2.4453  0.01447 *
sar1 -0.36124    0.27291 -1.3237  0.18561  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observing the coefficients we can exclude the insignificant ones. We can use a function confint() for this purpose.

confint(fitARIMA)
          2.5 %    97.5 %
ar1  -1.3413336 0.2393643
ma1   0.1526882 1.3859576
sar1 -0.8961214 0.1736466

Box-Ljung test

To confirm no pattern in the residuals, we do a test of independence at all lags up to the one specified. To obtain the box test results:

acf(fitARIMA$residuals)

boxresult<- Box.test(fitARIMA$residuals,lag = 2,type="Ljung-Box")
boxresult

    Box-Ljung test

data:  fitARIMA$residuals
X-squared = 3.1199, df = 2, p-value = 0.2101
qqnorm(fitARIMA$residuals)

The p-values for the Ljung-Box Q test all are well above 0.05, indicating “non-significance.”

Auto ARIMA model

The auto.arima() function in R uses a combination of unit root tests, minimization of the AIC and MLE to obtain an ARIMA model.

# Fit an ARIMA model
arima_model <- auto.arima(dfts,trace = TRUE)

 ARIMA(2,0,2)            with non-zero mean : Inf
 ARIMA(0,0,0)            with non-zero mean : 347.794
 ARIMA(1,0,0)            with non-zero mean : 347.2763
 ARIMA(0,0,1)            with non-zero mean : 341.2001
 ARIMA(0,0,0)            with zero mean     : 414.5997
 ARIMA(1,0,1)            with non-zero mean : 344.0656
 ARIMA(0,0,2)            with non-zero mean : Inf
 ARIMA(1,0,2)            with non-zero mean : Inf
 ARIMA(0,0,1)            with zero mean     : Inf

 Best model: ARIMA(0,0,1)            with non-zero mean 
arima_model
Series: dfts 
ARIMA(0,0,1) with non-zero mean 

Coefficients:
         ma1       mean
      0.7237  1272.4343
s.e.  0.1445    86.6815

sigma^2 = 68521:  log likelihood = -167
AIC=340   AICc=341.2   BIC=343.53

Forecasting using an ARIMA model

The parameters of that ARIMA model can be used as a predictive model for making forecasts for future values of the time series once the best-suited model is selected for time series data.

The d-value effects the prediction intervals —the prediction intervals increases in size with higher values of ‘d’. The prediction intervals will all be essentially the same when d=0 because the long-term forecast standard deviation will go to the standard deviation of the historical data.

There is a function called predict() which is used for predictions from the results of various model fitting functions. It takes an argument n.ahead() specifying how many time steps ahead to predict.

predict(fitARIMA,n.ahead = 25)
$pred
          Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
2013 1352.434 1333.474 1403.429 1415.213 1491.170 1604.435 1255.806 1155.364
2014 1363.060 1369.899 1344.634 1340.374 1312.937 1272.021 1397.959 1434.242
2015 1359.215                                                               
          Sep      Oct      Nov      Dec
2013 1294.288 1497.188 1495.446 1340.079
2014 1384.058 1310.763 1311.392 1367.517
2015                                    

$se
           Jan       Feb       Mar       Apr       May       Jun       Jul
2013  335.4935  528.7916  644.4571  753.5950  843.2468  927.0267 1002.4176
2014 1343.8583 1364.2656 1386.9164 1407.8055 1429.1428 1449.7553 1470.3016
2015 1594.8694                                                            
           Aug       Sep       Oct       Nov       Dec
2013 1073.2525 1139.3139 1201.9474 1261.3708 1318.1729
2014 1490.4434 1510.3826 1530.0261 1549.4401 1568.6032
2015                                                  

This is the Manual ARIMA model

Predict the value of Auto ARMA model

predict(arima_model, n.ahead = 25)
$pred
          Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
2013 1375.011 1272.434 1272.434 1272.434 1272.434 1272.434 1272.434 1272.434
2014 1272.434 1272.434 1272.434 1272.434 1272.434 1272.434 1272.434 1272.434
2015 1272.434                                                               
          Sep      Oct      Nov      Dec
2013 1272.434 1272.434 1272.434 1272.434
2014 1272.434 1272.434 1272.434 1272.434
2015                                    

$se
          Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
2013 261.7650 323.1268 323.1268 323.1268 323.1268 323.1268 323.1268 323.1268
2014 323.1268 323.1268 323.1268 323.1268 323.1268 323.1268 323.1268 323.1268
2015 323.1268                                                               
          Sep      Oct      Nov      Dec
2013 323.1268 323.1268 323.1268 323.1268
2014 323.1268 323.1268 323.1268 323.1268
2015                                    

forecast() function

futur_val <- forecast(fitARIMA,h=25, level=c(99.5))
autoplot(futur_val)

  • Manual ARIMA model analysis shows the changes for the next 25 months in line plot.

Auto ARIMA model

autoplot(forecast(arima_model, h=25))

  • Auto ARIMA model analysis indicates the horizontal stable forecasting.

###Check residual Residual is the difference between the forecast data and the actual data. A good model must have a randomly distributed residual without an obvious pattern. Here are some residual assumptions that have to be met:

####Manual ARIMA model

checkresiduals(object = fitARIMA)


    Ljung-Box test

data:  Residuals from ARIMA(1,1,1)(1,0,0)[12]
Q* = 4.328, df = 3, p-value = 0.2282

Model df: 3.   Total lags used: 6

####Auto ARIMA Model

checkresiduals(object = arima_model)


    Ljung-Box test

data:  Residuals from ARIMA(0,0,1) with non-zero mean
Q* = 4.5212, df = 4, p-value = 0.34

Model df: 1.   Total lags used: 5

According to the residual check, Manual ARIMA model analysis seems to be more accurate than auto ARIMA model.

Accuracy Metrics

There are a lot of accuracy metrics in forecasting. In this discussion, we will compare Root Mean Squared Error (RMSE) of both models.

Root Mean Squared Error (RMSE) : the standard deviation of the residual that measures how spread out our residuals are. A smaller value of RMSE is a sign of a better result.

####Manual

#Manual
accuracy(fitARIMA)
                   ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
Training set 16.80081 306.2578 208.8546 -1.584652 17.22951 0.4379268 -0.1393143
#306.2578

Auto

accuracy(arima_model)
                      ME     RMSE      MAE       MPE     MAPE      MASE
Training set -0.09656058 250.6209 199.1791 -4.672498 17.61748 0.4176391
                    ACF1
Training set -0.01205578
#250.6209

Based on RMSE, we can see that Auto ARIMA model has a better result than Manual ARIMA model method.

Task Six: Findings and Conclusions

What did you learn throughout the process?

I have learned the following things.

  • Time-series analysis

  • ARIMA model analysis

  • Forecasting calculation

Are the results what you expected?

  • Yes, I could get the Technics to find out stationary and seasonal data or not.

What are the key findings and takeaways?

The key findings are as follows:

  • The user trends are seasonal, more usage in the summer period and less usage in the winter period.