Forecast daily bike rental demand using time series models
Author
Satoshi Matsumoto
About Data Analysis Report
This report is for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Mon Jan 8 14:51:19 2024.
Data Description:
This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.
Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Relevant Paper:
Fanaee-T, Hadi, and Gama, Joao, ‘Event labeling combining ensemble detectors and background knowledge’, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg
Task One: Load and explore the data
Load data and install packages
Tidyverse for the data manipulation
Plotly for the visualization
forcast for forecasting
TTR for smoothing time series data.
TSA for Winter-Holts method.
fUnitRoot for making time-series data as stationary.
Describe and explore the data
According to the data exploration, there is information regarding time, temperature, humidity, and the number of bike users. In this scinario, we apply the “day.csv” for the time series analysis.
Make short-term forecasts using the time series data by smoothing the data. Before we do anything regarding forecasting, we need to tell R that this data is a time series. To do this, we make a time series object. There are two critical inputs we must give the function — frequency and start. we’ll need to manually tell it the frequency of our readings — that is, how many rows of data you have in a year. In this case, we have one data point per month, so frequency=12.
For start, if we’re not interested in attaching specific dates with our results, then we can just enter “1”, but since we usually do we’ll indicate the year and month of the first data point as start=c(2011,1).
Make a time series object by putting the date and count of total bike variables
# Use a variable of "ride" from Jan 2011 to Dec 2012 as a time series object as "dfts".dfts <-ts(df1$ride, start =c(2011,1), end =c(2012,12), frequency =12, names = ride)plot(dfts)
To identify and replace outliers and missing values in a time series.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2011 985 801 1349 1562 1600 1606 1510 959 822 1321 1263 1162
2012 1406 1421 1248 1204 1000 683 1650 1927 1543 981 986 1416
Simple Exponential Smoothing
#To make forecasts using simple exponential smoothing in R, we can fit a simple exponential smoothing predictive model using the “HoltWinters()” function in R. dfts_forecasts <-HoltWinters(dfts, beta=FALSE, gamma=FALSE)dfts_forecasts
Holt-Winters exponential smoothing without trend and without seasonal component.
Call:
HoltWinters(x = dfts, beta = FALSE, gamma = FALSE)
Smoothing parameters:
alpha: 0.1621681
beta : FALSE
gamma: FALSE
Coefficients:
[,1]
a 1276.253
The estimated value of the alpha parameter is about 0.162.This is very close to zero, telling us that the forecasts are based on seasonality in general.
The forecasts made by HoltWinters() are stored in a named element of this list variable called “fitted”, so we can get their values by typing:
dfts_forecasts$fitted
xhat level
Feb 2011 985.0000 985.0000
Mar 2011 955.1611 955.1611
Apr 2011 1019.0292 1019.0292
May 2011 1107.0817 1107.0817
Jun 2011 1187.0173 1187.0173
Jul 2011 1254.9629 1254.9629
Aug 2011 1296.3218 1296.3218
Sep 2011 1241.6190 1241.6190
Oct 2011 1173.5702 1173.5702
Nov 2011 1197.4786 1197.4786
Dec 2011 1208.1041 1208.1041
Jan 2012 1200.6275 1200.6275
Feb 2012 1233.9323 1233.9323
Mar 2012 1264.2687 1264.2687
Apr 2012 1261.6305 1261.6305
May 2012 1252.2846 1252.2846
Jun 2012 1211.3721 1211.3721
Jul 2012 1125.6870 1125.6870
Aug 2012 1210.7139 1210.7139
Sep 2012 1326.8726 1326.8726
Oct 2012 1361.9216 1361.9216
Nov 2012 1300.1482 1300.1482
Dec 2012 1249.2034 1249.2034
We can plot the original time series in black against the forecasts by typing in red:
plot(dfts_forecasts)
The plot shows the original time series in black, and the forecasts as a red line. The time series of forecasts is much smoother than the time series of the original data here.
Accuracy of the forecasts
As a measure of the accuracy of the forecasts, we can calculate the sum of squared errors for the in-sample forecast errors, that is, the forecast errors for the time period covered by our original time series.
dfts_forecasts$SSE
[1] 2815237
#here the sum-of-squared-errors is 2815237
It is common in simple exponential smoothing to use the first value in the time series as the initial value for the level. The initial values are specified for the level in the HoltWinters() function by using the “l.start” parameter.
Task Four: Decompose and access the stationary of time series data
Ultimately, we usually want to fit more advanced ARIMA models to use for forecasting instead of short-term forecasts. Therefore, it is important to decompose the data and assess the stationary of the data because if the time series data is not stationary, we cannot fit more advanced time series models.
Observed: the actual data plot
Trend: the long-term trends in the data
Seasonal: the repeated seasonal signal adder/ any monthly/yearly pattern of the data points.
Random: the “left-over” components that aren’t expected from the seasonality or trend components.
Augmented Dickey-Fuller Test
data: dfts
Dickey-Fuller = -3.2447, Lag order = 2, p-value = 0.09935
alternative hypothesis: stationary
Since the p-value is not less than .05, we fail to reject the null hypothesis. This means the time series is non-stationary. In other words, it has some time-dependent structure and does not have constant variance over time. For the ARIMA model analysis, we need to make stationary data.
Methods to achieve stationary
difference the data: compute the differences between consecutive observations log or square root the series data to stabilize non-constant variance if the data contains a trend, fit some type of curve to the data and then model the residuals from that fit.
Unit root test: This test is used to find out that first difference or regression which should be used on the trending data to make it stationary. In Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test, small p-values suggest differencing is required.
Unit root test
urkpssTest(dfts, type =c("tau"), lags =c("short"),use.lag =NULL, doplot =TRUE)
Title:
KPSS Unit Root Test
Test Results:
NA
Description:
Mon Jan 8 14:51:24 2024 by user: satos
#Plot after removing non-stationaryautoplot(tsstationary)
Detect the seasonality
The following plots and functions help detecting the seasonality - A seasonal subseries plot - Multiple box plot - Auto correlation plot - ndiffs() is used to determine the number of first differences required to make the time series non-seasonal
We calculate auto correlation here.
acf(dfts,lag.max=34)
As we can infer from the graph above, the autocorrelation repeats to increase and decrease as the lag increases, confirming that there is no linear association between observations separated by larger lags.
###Remove seasonality To remove seasonality from the data, we subtract the seasonal component from the original series and then difference it to make it stationary.
Task Five: Fit and forecast time series data using ARIMA models
After decomposing and making sure that the time series is stationary, Arima model can be fitted and forecast. ARIMA stands for AutoRegressive Integrated Moving Average, and it is a statistical model used to analyze and forecast time series data. It is a versatile model that can be used to handle a variety of time series patterns, including trend, seasonality, and autocorrelation.
Examine p and q values
To determine the order of the model to be fitted to the data, we need three variables: p, d, and q which are non-negative integers that refer to the order of the autoregressive, integrated, and moving average parts of the model respectively.
To examine which p and q values will be appropriate we need to run acf() and pacf() function.
#ACFacf(ts_stationary2, lag.max=34)
pacf(ts_stationary2, lag.max=34)
Manual ARIMA model
Use Arima() function to fit ARIMA model to univariate time series.
fitARIMA <-Arima(dfts, order=c(1,1,1),seasonal =list(order =c(1,0,0), period =12),method="ML")fitARIMA
The p-values for the Ljung-Box Q test all are well above 0.05, indicating “non-significance.”
Auto ARIMA model
The auto.arima() function in R uses a combination of unit root tests, minimization of the AIC and MLE to obtain an ARIMA model.
# Fit an ARIMA modelarima_model <-auto.arima(dfts,trace =TRUE)
ARIMA(2,0,2) with non-zero mean : Inf
ARIMA(0,0,0) with non-zero mean : 347.794
ARIMA(1,0,0) with non-zero mean : 347.2763
ARIMA(0,0,1) with non-zero mean : 341.2001
ARIMA(0,0,0) with zero mean : 414.5997
ARIMA(1,0,1) with non-zero mean : 344.0656
ARIMA(0,0,2) with non-zero mean : Inf
ARIMA(1,0,2) with non-zero mean : Inf
ARIMA(0,0,1) with zero mean : Inf
Best model: ARIMA(0,0,1) with non-zero mean
arima_model
Series: dfts
ARIMA(0,0,1) with non-zero mean
Coefficients:
ma1 mean
0.7237 1272.4343
s.e. 0.1445 86.6815
sigma^2 = 68521: log likelihood = -167
AIC=340 AICc=341.2 BIC=343.53
Forecasting using an ARIMA model
The parameters of that ARIMA model can be used as a predictive model for making forecasts for future values of the time series once the best-suited model is selected for time series data.
The d-value effects the prediction intervals —the prediction intervals increases in size with higher values of ‘d’. The prediction intervals will all be essentially the same when d=0 because the long-term forecast standard deviation will go to the standard deviation of the historical data.
There is a function called predict() which is used for predictions from the results of various model fitting functions. It takes an argument n.ahead() specifying how many time steps ahead to predict.
predict(fitARIMA,n.ahead =25)
$pred
Jan Feb Mar Apr May Jun Jul Aug
2013 1352.434 1333.474 1403.429 1415.213 1491.170 1604.435 1255.806 1155.364
2014 1363.060 1369.899 1344.634 1340.374 1312.937 1272.021 1397.959 1434.242
2015 1359.215
Sep Oct Nov Dec
2013 1294.288 1497.188 1495.446 1340.079
2014 1384.058 1310.763 1311.392 1367.517
2015
$se
Jan Feb Mar Apr May Jun Jul
2013 335.4935 528.7916 644.4571 753.5950 843.2468 927.0267 1002.4176
2014 1343.8583 1364.2656 1386.9164 1407.8055 1429.1428 1449.7553 1470.3016
2015 1594.8694
Aug Sep Oct Nov Dec
2013 1073.2525 1139.3139 1201.9474 1261.3708 1318.1729
2014 1490.4434 1510.3826 1530.0261 1549.4401 1568.6032
2015
This is the Manual ARIMA model
Predict the value of Auto ARMA model
predict(arima_model, n.ahead =25)
$pred
Jan Feb Mar Apr May Jun Jul Aug
2013 1375.011 1272.434 1272.434 1272.434 1272.434 1272.434 1272.434 1272.434
2014 1272.434 1272.434 1272.434 1272.434 1272.434 1272.434 1272.434 1272.434
2015 1272.434
Sep Oct Nov Dec
2013 1272.434 1272.434 1272.434 1272.434
2014 1272.434 1272.434 1272.434 1272.434
2015
$se
Jan Feb Mar Apr May Jun Jul Aug
2013 261.7650 323.1268 323.1268 323.1268 323.1268 323.1268 323.1268 323.1268
2014 323.1268 323.1268 323.1268 323.1268 323.1268 323.1268 323.1268 323.1268
2015 323.1268
Sep Oct Nov Dec
2013 323.1268 323.1268 323.1268 323.1268
2014 323.1268 323.1268 323.1268 323.1268
2015
Manual ARIMA model analysis shows the changes for the next 25 months in line plot.
Auto ARIMA model
autoplot(forecast(arima_model, h=25))
Auto ARIMA model analysis indicates the horizontal stable forecasting.
###Check residual Residual is the difference between the forecast data and the actual data. A good model must have a randomly distributed residual without an obvious pattern. Here are some residual assumptions that have to be met:
####Manual ARIMA model
checkresiduals(object = fitARIMA)
Ljung-Box test
data: Residuals from ARIMA(1,1,1)(1,0,0)[12]
Q* = 4.328, df = 3, p-value = 0.2282
Model df: 3. Total lags used: 6
####Auto ARIMA Model
checkresiduals(object = arima_model)
Ljung-Box test
data: Residuals from ARIMA(0,0,1) with non-zero mean
Q* = 4.5212, df = 4, p-value = 0.34
Model df: 1. Total lags used: 5
According to the residual check, Manual ARIMA model analysis seems to be more accurate than auto ARIMA model.
Accuracy Metrics
There are a lot of accuracy metrics in forecasting. In this discussion, we will compare Root Mean Squared Error (RMSE) of both models.
Root Mean Squared Error (RMSE) : the standard deviation of the residual that measures how spread out our residuals are. A smaller value of RMSE is a sign of a better result.
####Manual
#Manualaccuracy(fitARIMA)
ME RMSE MAE MPE MAPE MASE ACF1
Training set 16.80081 306.2578 208.8546 -1.584652 17.22951 0.4379268 -0.1393143
#306.2578
Auto
accuracy(arima_model)
ME RMSE MAE MPE MAPE MASE
Training set -0.09656058 250.6209 199.1791 -4.672498 17.61748 0.4176391
ACF1
Training set -0.01205578
#250.6209
Based on RMSE, we can see that Auto ARIMA model has a better result than Manual ARIMA model method.
Task Six: Findings and Conclusions
What did you learn throughout the process?
I have learned the following things.
Time-series analysis
ARIMA model analysis
Forecasting calculation
Are the results what you expected?
Yes, I could get the Technics to find out stationary and seasonal data or not.
What are the key findings and takeaways?
The key findings are as follows:
The user trends are seasonal, more usage in the summer period and less usage in the winter period.