This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Tue Oct 31 09:00:36 2023.
Data Description:
This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.
Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Relevant Paper:
Fanaee-T, Hadi, and Gama, Joao. Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg
install.packages("timetk")
install.packages("lubridate")
install.packages("dplyr")
install.packages("ggpot2")
install.packages("tidyverse")
install.packages("TTR")
install.packages("tseries")
install.packages("corrplot")
install.packages("astsa")
Load the packages and data
library(tseries)
library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(timetk)
library(TTR)
library(corrplot)
library(astsa)
data(bike_sharing_daily)
Describe and explore the data
summary(bike_sharing_daily)
## instant dteday season yr
## Min. : 1.0 Min. :2011-01-01 Min. :1.000 Min. :0.0000
## 1st Qu.:183.5 1st Qu.:2011-07-02 1st Qu.:2.000 1st Qu.:0.0000
## Median :366.0 Median :2012-01-01 Median :3.000 Median :1.0000
## Mean :366.0 Mean :2012-01-01 Mean :2.497 Mean :0.5007
## 3rd Qu.:548.5 3rd Qu.:2012-07-01 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :731.0 Max. :2012-12-31 Max. :4.000 Max. :1.0000
## mnth holiday weekday workingday
## Min. : 1.00 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
## Median : 7.00 Median :0.00000 Median :3.000 Median :1.000
## Mean : 6.52 Mean :0.02873 Mean :2.997 Mean :0.684
## 3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.:1.000
## Max. :12.00 Max. :1.00000 Max. :6.000 Max. :1.000
## weathersit temp atemp hum
## Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200
## Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267
## Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279
## 3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302
## Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725
## windspeed casual registered cnt
## Min. :0.02239 Min. : 2.0 Min. : 20 Min. : 22
## 1st Qu.:0.13495 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152
## Median :0.18097 Median : 713.0 Median :3662 Median :4548
## Mean :0.19049 Mean : 848.2 Mean :3656 Mean :4504
## 3rd Qu.:0.23321 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956
## Max. :0.50746 Max. :3410.0 Max. :6946 Max. :8714
ts_data<-bike_sharing_daily
mean_temp<-numeric(length=4) #Average temperature in 4 different seasons
median_temp<-numeric(length=4)
for(i in 1:4){
mean_temp[i]<-mean(ts_data$temp[ts_data$season == i])
median_temp[i]<-median(ts_data$temp[ts_data$season == i])}
print(mean_temp)
## [1] 0.2977475 0.5444052 0.7063093 0.4229060
print(median_temp)
## [1] 0.2858330 0.5620835 0.7145830 0.4091665
temp_rent<-data.frame(ts_data$temp, ts_data$atemp,ts_data$cnt) #Exploring correlations
cor(temp_rent)
## ts_data.temp ts_data.atemp ts_data.cnt
## ts_data.temp 1.0000000 0.9917016 0.6274940
## ts_data.atemp 0.9917016 1.0000000 0.6310657
## ts_data.cnt 0.6274940 0.6310657 1.0000000
corrplot(cor(temp_rent))
temp_reg_casual<-data.frame(ts_data$temp,ts_data$registered,ts_data$casual)
cor(temp_reg_casual)
## ts_data.temp ts_data.registered ts_data.casual
## ts_data.temp 1.0000000 0.5400120 0.5432847
## ts_data.registered 0.5400120 1.0000000 0.3952825
## ts_data.casual 0.5432847 0.3952825 1.0000000
corrplot(cor(temp_reg_casual))
mean_cnt<-numeric(length=12) #Mean of bikes rented across months
for(i in 1:12){
mean_cnt[i]<-mean(ts_data$cnt[ts_data$mnth == i])
}
print(mean_cnt)
## [1] 2176.339 2655.298 3692.258 4484.900 5349.774 5772.367 5563.677 5664.419
## [9] 5766.517 5199.226 4247.183 3403.806
ggplot(ts_data, aes(x = factor(season), y = temp, fill = factor(season))) +
geom_boxplot() +
scale_fill_manual(values = c("#E41A1C", "#377EB8", "#4DAF4A", "#FF7F00")) +
labs(x = "Season", y = "Temperature") +
theme_minimal() #Visuals to see the change in temperature with respect to change in season
ggplot(ts_data, aes(x = factor(season), y = temp, fill = factor(season))) +
stat_summary(fun.y = "mean", geom = "bar", position = "dodge") +
scale_fill_manual(values = c("#E41A1C", "#377EB8", "#4DAF4A", "#FF7F00")) +
labs(x = "Season", y = "Mean Temperature") +
theme_minimal()
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ts_data %>%
plot_time_series(dteday, cnt,
.interactive = FALSE,
.plotly_slider = TRUE) ## Using timetk to explore time series visually
ts_data%>%
group_by(yr) %>%
plot_time_series(dteday,cnt,
.color_var=season, .interactive=FALSE)
ts_data%>%
group_by(yr) %>%
plot_time_series(dteday,hum,
.color_var=season, .interactive=FALSE)
ts_data%>%
group_by(yr) %>%
plot_time_series(dteday,atemp,
.color_var=season,.interactive=FALSE)
ts_data%>%
group_by(yr) %>%
plot_time_series(dteday,windspeed,
.color_var=season,.interactive=FALSE)
ts_data%>%
group_by(yr) %>%
plot_time_series(dteday,casual,
.color_var=season,.interactive=FALSE)
ts_data%>%
group_by(yr) %>%
plot_time_series(dteday,registered,
.color_var=season,.interactive=FALSE) ### As the objective of this project is to forecast the number of bikes that will be rented in the future. Our focus will be on "cnt" as it contains both casual and registered clients.
#To check for the seasonality we used seasonal_diagnostics
ts_data %>%
plot_seasonal_diagnostics(dteday, cnt, .interactive = FALSE) # seasonality is seen repeating year by year
ts_data %>%
plot_anomaly_diagnostics(dteday,(cnt),
.message = FALSE,
.facet_ncol = 3,
.ribbon_alpha = 0.25,
.interactive = FALSE) #Anomaly detection
ts_data %>%
mutate(year = year(dteday)) %>%
plot_time_series(dteday, (cnt),
.facet_vars = c(year), # add groups/facets
.color_var = year, # color by year
.facet_ncol = 2,
.facet_scales = "free",
.facet_collapse = TRUE, # combine group strip text into 1 line
.interactive = FALSE)
ts_data %>%
plot_time_series(dteday,(cnt),
.color_var = year(dteday),# Returns static ggplot
.interactive = FALSE, # Customization
.title = "bike rental data",
.x_lab = "Date",
.y_lab = "count of bikes rented",
.color_lab = "year") +
scale_y_continuous(labels = scales::comma_format())
library(tseries) #creating a univariate time series
library(forecast)
##
## Attaching package: 'forecast'
## The following object is masked from 'package:astsa':
##
## gas
count1 <- ts(ts_data$cnt, start = c(2011, 1,1), frequency = 365)
length(count1)
## [1] 731
hist(count1,freq=FALSE)
lines(density(count1), lwd = 2, col = 'red')
g <- seq(min(count1), max(count1), length = 731)
h <- dnorm(g, mean = mean(count1), sd = sd(count1))
lines(g, h, col = "blue", lwd = 2) #cleaning outliers
library(tseries)
library(forecast)
count<-tsclean(count1)
hist(count, freq=FALSE)
lines(density(count), lwd = 2, col = 'red')
k <- seq(min(count), max(count), length = 731)
l <- dnorm(k, mean = mean(count), sd = sd(count))
lines(k, l, col = "blue", lwd = 2) #simple exponential smoothing
HW1<- HoltWinters((count), beta=FALSE, gamma=FALSE)
library(forecast)
forecast_s.exp<-predict(HW1,n=24)
forecast_s.exp
## Time Series:
## Start = c(2013, 2)
## End = c(2013, 25)
## Frequency = 365
## fit
## [1,] 2300.898
## [2,] 2300.898
## [3,] 2300.898
## [4,] 2300.898
## [5,] 2300.898
## [6,] 2300.898
## [7,] 2300.898
## [8,] 2300.898
## [9,] 2300.898
## [10,] 2300.898
## [11,] 2300.898
## [12,] 2300.898
## [13,] 2300.898
## [14,] 2300.898
## [15,] 2300.898
## [16,] 2300.898
## [17,] 2300.898
## [18,] 2300.898
## [19,] 2300.898
## [20,] 2300.898
## [21,] 2300.898
## [22,] 2300.898
## [23,] 2300.898
## [24,] 2300.898
plot(HW1) #Visually evaluate the fit
plot(count, ylab="rented bikes", xlim=c(2011,2013))
lines(HW1$fitted[,1], lty=2, col="blue")
HW1.pred <- predict(HW1, 24, prediction.interval = TRUE, level=0.95)
#Visually evaluate the prediction
plot(count, ylab="rented bikes", xlim=c(2011,2013))
lines(HW1$fitted[,1], lty=2, col="blue")
lines(HW1.pred[,1], col="red")
lines(HW1.pred[,2], lty=2, col="orange")
lines(HW1.pred[,3], lty=2, col="orange")
library(forecast)
HW1_for <- forecast(HW1, h=24, level=c(80,95))
HW1_for$residuals<-na.omit(HW1_for$residuals)
plot(HW1_for, xlim=c(2011, 2013)) #visualize forecast and checking stationarity, autocorrelation and normality of forecasted errors
lines(HW1_for$fitted, lty=2, col="purple")
acf(HW1_for$residuals, na.action=na.pass)
Box.test(HW1_for$residuals, type="Ljung-Box")
##
## Box-Ljung test
##
## data: HW1_for$residuals
## X-squared = 24.108, df = 1, p-value = 9.109e-07
hist(HW1_for$residuals,freq=FALSE)
lines(density(HW1_for$residuals), lwd = 2, col = 'red')
y <- seq(min(HW1_for$residual), max(HW1_for$residual), length = 722)
z <- dnorm(y, mean = mean(HW1_for$residuals), sd = sd(HW1_for$residuals))
lines(y, z, col = "blue", lwd = 2)
shapiro.test(HW1_for$residuals)
##
## Shapiro-Wilk normality test
##
## data: HW1_for$residuals
## W = 0.94856, p-value = 2.98e-15
adf.test(count) #Checking the stationarity of original time series
##
## Augmented Dickey-Fuller Test
##
## data: count
## Dickey-Fuller = -1.4435, Lag order = 9, p-value = 0.8138
## alternative hypothesis: stationary
decomp<-decompose((count)) #trend is seen
plot(decomp)
adf.test(diff(count)) #checking the stationarity of differenced time series
## Warning in adf.test(diff(count)): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff(count)
## Dickey-Fuller = -13.298, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
#Seasonal decomposition
fit <- stl((count), s.window="periodic")
plot(fit)
plot(fit$time.series[,1], ylab = "Stationary of the Rental", # both trend and seasonality is observed
xlab = "Day of the Year")
checkresiduals(fit$time.series[, 3])
##
## Ljung-Box test
##
## data: Residuals
## Q* = 1841, df = 146, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 146
shapiro.test(fit$time.series[, 3]) #Calculating Simple moving average
##
## Shapiro-Wilk normality test
##
## data: fit$time.series[, 3]
## W = 0.9989, p-value = 0.9439
trend <- SMA(count, n = 10) #deseasonalizing the data
deseasonalized_detrended_data <- count - trend
deseasonalized_detrended_data<-na.omit(deseasonalized_detrended_data)
plot(deseasonalized_detrended_data)
adf.test(deseasonalized_detrended_data) #checking stationarity
## Warning in adf.test(deseasonalized_detrended_data): p-value smaller than
## printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: deseasonalized_detrended_data
## Dickey-Fuller = -11.154, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
library(forecast) #Fitting SARIMA as ARIMA with deseasonalized data was giving large AIC
fit_arima<-auto.arima((count),trace=TRUE)
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,0,2)(0,1,0)[365] with drift : 381.8691
## ARIMA(0,0,0)(0,1,0)[365] with drift : 494.9632
## ARIMA(1,0,0)(0,1,0)[365] with drift : 397.904
## ARIMA(0,0,1)(0,1,0)[365] with drift : 427.4676
## ARIMA(0,0,0)(0,1,0)[365] : 1056.341
## ARIMA(1,0,2)(0,1,0)[365] with drift : 379.6846
## ARIMA(0,0,2)(0,1,0)[365] with drift : 404.7834
## ARIMA(1,0,1)(0,1,0)[365] with drift : 389.6037
## ARIMA(1,0,3)(0,1,0)[365] with drift : 379.227
## ARIMA(0,0,3)(0,1,0)[365] with drift : 401.2876
## ARIMA(2,0,3)(0,1,0)[365] with drift : Inf
## ARIMA(1,0,4)(0,1,0)[365] with drift : 380.5488
## ARIMA(0,0,4)(0,1,0)[365] with drift : 402.943
## ARIMA(2,0,4)(0,1,0)[365] with drift : 383.8164
## ARIMA(1,0,3)(0,1,0)[365] : Inf
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(1,0,3)(0,1,0)[365] with drift : 6097.864
##
## Best model: ARIMA(1,0,3)(0,1,0)[365] with drift
fit <- arima(count, order=c(1, 0, 2), seasonal=list(order=c(0, 1, 0),period=365))
predict(fit, n.ahead=24) #forecasting for next 2 months
## $pred
## Time Series:
## Start = c(2013, 2)
## End = c(2013, 25)
## Frequency = 365
## [1] 2311.064 2523.005 2653.928 3556.856 4381.788 4803.724 3706.664 2656.607
## [9] 3877.555 2455.507 4374.462 3490.422 2768.385 2585.353 2571.324 3207.299
## [17] 3647.278 3562.261 3432.247 1569.238 2244.232 2698.230 4604.231 4534.237
##
## $se
## Time Series:
## Start = c(2013, 2)
## End = c(2013, 25)
## Frequency = 365
## [1] 994.0721 1056.4824 1068.9631 1081.2079 1093.2255 1105.0239 1116.6107
## [8] 1127.9931 1139.1779 1150.1713 1160.9796 1171.6083 1182.0630 1192.3488
## [15] 1202.4706 1212.4331 1222.2407 1231.8977 1241.4081 1250.7759 1260.0046
## [22] 1269.0979 1278.0591 1286.8916
forecast_arima <- predict(fit, 24)
plot(forecast(fit, 24))
hist(fit$residuals,freq=FALSE, xlab = "Residual", ylab = "Distribution", main = "Histogram of Model Errors - Bike Count") #Residual analysis
lines(density(fit$residuals), lwd = 2, col = 'red')
x <- seq(min(fit$residual), max(fit$residual), length = 722)
f <- dnorm(x, mean = mean(fit$residuals), sd = sd(fit$residuals))
lines(x, f, col = "blue", lwd = 2)
adf.test(fit$residuals)
## Warning in adf.test(fit$residuals): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: fit$residuals
## Dickey-Fuller = -8.0939, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary
shapiro.test(fit$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit$residuals
## W = 0.83833, p-value < 2.2e-16
acf(fit$residuals)
print(cbind(forecast_arima$pred,HW1.pred[,1])) #Comparison
## Time Series:
## Start = c(2013, 2)
## End = c(2013, 25)
## Frequency = 365
## forecast_arima$pred HW1.pred[, 1]
## 2013.003 2311.064 2300.898
## 2013.005 2523.005 2300.898
## 2013.008 2653.928 2300.898
## 2013.011 3556.856 2300.898
## 2013.014 4381.788 2300.898
## 2013.016 4803.724 2300.898
## 2013.019 3706.664 2300.898
## 2013.022 2656.607 2300.898
## 2013.025 3877.555 2300.898
## 2013.027 2455.507 2300.898
## 2013.030 4374.462 2300.898
## 2013.033 3490.422 2300.898
## 2013.036 2768.385 2300.898
## 2013.038 2585.353 2300.898
## 2013.041 2571.324 2300.898
## 2013.044 3207.299 2300.898
## 2013.047 3647.278 2300.898
## 2013.049 3562.261 2300.898
## 2013.052 3432.247 2300.898
## 2013.055 1569.238 2300.898
## 2013.058 2244.232 2300.898
## 2013.060 2698.230 2300.898
## 2013.063 4604.231 2300.898
## 2013.066 4534.237 2300.898
#FINDINGS In summary, SARIMA (1,0,2)(0,1,0)[365] fitted the data better as compared to simple exponential smoothing. We choose SARIMA (1,0,2)(0,1,0)[365] to fit for two reasons. 1) This model gave the least AICc (379) in comparison with all the models constructed with or without deseasonalized data. 2) Seasonality in this data is justified. So, instead of removing it we had to take some additional parameters (369). The proposed model showed that in the next 24 days there will be a need of approximately maximum 4700 and minimum 2300 bicycles. #Through graphical analysis we found; 1)Most people prefer to rent bicycles the most in summers (June-Oct) and the least in winters (Nov-Jan). As there exists 63% correlation between temperature and the rented bicycles. 2)The registered clients increases as the temperature increases by 54%. approximately 40% of the casual riders get registered and vice versa. 2)Seasonality in average temperature and count of rented bicycles (parabolic curve) is observed to be repeating periodically after one year. 3)Severe fluctuation in Casual riders is seen with no trend. 4)Windspeed in january is the highest. 5)Registered riders gradually increases from June onwards.
#CONCLUSION 1.In future, advance machine learning algorithms should be combined with ARIMA model to have an optimized hybrid model. ARIMA itself is insufficient to grab the trends exhibited by bike_rental_daily data. 2. A way to deal with leap year (2012) should be used. As, already existing techniques are redundant.
#NOTE 1. A trail has been made to meet all the requirement as directed by the guide. But the platform used for coding is unable to create interactive plot. To create them just specify “interactive=TRUE” in place of “interactive=FALSE”.