About Data Analysis Report

This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Tue Oct 31 09:00:36 2023.

Data Description:

This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.

Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

Relevant Paper:

Fanaee-T, Hadi, and Gama, Joao. Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg

Task One: Load and explore the data

Load data and install packages

Import required packages

install.packages("timetk")
install.packages("lubridate")
install.packages("dplyr")
install.packages("ggpot2")
install.packages("tidyverse")
install.packages("TTR")
install.packages("tseries")
install.packages("corrplot")
install.packages("astsa")

Load the packages and data

library(tseries)
library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(timetk)
library(TTR)
library(corrplot)
library(astsa)
data(bike_sharing_daily)

Describe and explore the data

summary(bike_sharing_daily)

##     instant          dteday               season            yr        
##  Min.   :  1.0   Min.   :2011-01-01   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:183.5   1st Qu.:2011-07-02   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :366.0   Median :2012-01-01   Median :3.000   Median :1.0000  
##  Mean   :366.0   Mean   :2012-01-01   Mean   :2.497   Mean   :0.5007  
##  3rd Qu.:548.5   3rd Qu.:2012-07-01   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :731.0   Max.   :2012-12-31   Max.   :4.000   Max.   :1.0000  
##       mnth          holiday           weekday        workingday   
##  Min.   : 1.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 7.00   Median :0.00000   Median :3.000   Median :1.000  
##  Mean   : 6.52   Mean   :0.02873   Mean   :2.997   Mean   :0.684  
##  3rd Qu.:10.00   3rd Qu.:0.00000   3rd Qu.:5.000   3rd Qu.:1.000  
##  Max.   :12.00   Max.   :1.00000   Max.   :6.000   Max.   :1.000  
##    weathersit         temp             atemp              hum        
##  Min.   :1.000   Min.   :0.05913   Min.   :0.07907   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200  
##  Median :1.000   Median :0.49833   Median :0.48673   Median :0.6267  
##  Mean   :1.395   Mean   :0.49538   Mean   :0.47435   Mean   :0.6279  
##  3rd Qu.:2.000   3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302  
##  Max.   :3.000   Max.   :0.86167   Max.   :0.84090   Max.   :0.9725  
##    windspeed           casual         registered        cnt      
##  Min.   :0.02239   Min.   :   2.0   Min.   :  20   Min.   :  22  
##  1st Qu.:0.13495   1st Qu.: 315.5   1st Qu.:2497   1st Qu.:3152  
##  Median :0.18097   Median : 713.0   Median :3662   Median :4548  
##  Mean   :0.19049   Mean   : 848.2   Mean   :3656   Mean   :4504  
##  3rd Qu.:0.23321   3rd Qu.:1096.0   3rd Qu.:4776   3rd Qu.:5956  
##  Max.   :0.50746   Max.   :3410.0   Max.   :6946   Max.   :8714

ts_data<-bike_sharing_daily
 mean_temp<-numeric(length=4) #Average temperature in 4 different seasons
median_temp<-numeric(length=4)
for(i in 1:4){
mean_temp[i]<-mean(ts_data$temp[ts_data$season == i])
  median_temp[i]<-median(ts_data$temp[ts_data$season == i])}
print(mean_temp)

## [1] 0.2977475 0.5444052 0.7063093 0.4229060

print(median_temp)

## [1] 0.2858330 0.5620835 0.7145830 0.4091665

temp_rent<-data.frame(ts_data$temp, ts_data$atemp,ts_data$cnt) #Exploring correlations
cor(temp_rent)

##               ts_data.temp ts_data.atemp ts_data.cnt
## ts_data.temp     1.0000000     0.9917016   0.6274940
## ts_data.atemp    0.9917016     1.0000000   0.6310657
## ts_data.cnt      0.6274940     0.6310657   1.0000000

corrplot(cor(temp_rent))

temp_reg_casual<-data.frame(ts_data$temp,ts_data$registered,ts_data$casual)
cor(temp_reg_casual)

##                    ts_data.temp ts_data.registered ts_data.casual
## ts_data.temp          1.0000000          0.5400120      0.5432847
## ts_data.registered    0.5400120          1.0000000      0.3952825
## ts_data.casual        0.5432847          0.3952825      1.0000000

corrplot(cor(temp_reg_casual))

mean_cnt<-numeric(length=12) #Mean of bikes rented across months
for(i in 1:12){
 mean_cnt[i]<-mean(ts_data$cnt[ts_data$mnth == i])
}
print(mean_cnt)

##  [1] 2176.339 2655.298 3692.258 4484.900 5349.774 5772.367 5563.677 5664.419
##  [9] 5766.517 5199.226 4247.183 3403.806

ggplot(ts_data, aes(x = factor(season), y = temp, fill = factor(season))) +  
  geom_boxplot() +
  scale_fill_manual(values = c("#E41A1C", "#377EB8", "#4DAF4A", "#FF7F00")) +
  labs(x = "Season", y = "Temperature") +
  theme_minimal()                 #Visuals to see the change in temperature with respect to change in season

ggplot(ts_data, aes(x = factor(season), y = temp, fill = factor(season))) +
  stat_summary(fun.y = "mean", geom = "bar", position = "dodge") +
  scale_fill_manual(values = c("#E41A1C", "#377EB8", "#4DAF4A", "#FF7F00")) +
  labs(x = "Season", y = "Mean Temperature") +
  theme_minimal()

## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Task Two: Create interactive time series plots

ts_data %>% 
  plot_time_series(dteday, cnt, 
                   .interactive = FALSE,            
                   .plotly_slider = TRUE)  ## Using timetk to explore time series visually

ts_data%>%
  group_by(yr) %>%
  plot_time_series(dteday,cnt, 
                   .color_var=season, .interactive=FALSE)

ts_data%>%
  group_by(yr) %>%
  plot_time_series(dteday,hum, 
                   .color_var=season, .interactive=FALSE)

ts_data%>%
  group_by(yr) %>%
  plot_time_series(dteday,atemp, 
                   .color_var=season,.interactive=FALSE)

ts_data%>%
  group_by(yr) %>%
  plot_time_series(dteday,windspeed, 
                   .color_var=season,.interactive=FALSE)

ts_data%>%
  group_by(yr) %>%
  plot_time_series(dteday,casual, 
                   .color_var=season,.interactive=FALSE)

ts_data%>%
  group_by(yr) %>%
  plot_time_series(dteday,registered, 
                   .color_var=season,.interactive=FALSE)                  ### As the objective of this project is to forecast the number of bikes that will be rented in the future. Our focus will be on "cnt" as it contains both casual and registered clients.

#To check for the seasonality we used seasonal_diagnostics
ts_data %>%
  plot_seasonal_diagnostics(dteday, cnt, .interactive = FALSE)   # seasonality is seen repeating year by year

ts_data %>%
 plot_anomaly_diagnostics(dteday,(cnt),
                           .message = FALSE,
                           .facet_ncol = 3,
                           .ribbon_alpha = 0.25,
                           .interactive = FALSE) #Anomaly detection

ts_data %>%
  mutate(year = year(dteday)) %>%
  plot_time_series(dteday, (cnt),
                   .facet_vars     = c(year), # add groups/facets
                   .color_var      = year,            # color by year
                   .facet_ncol     = 2,
                   .facet_scales   = "free",
                   .facet_collapse = TRUE,  # combine group strip text into 1 line
                   .interactive    = FALSE)

ts_data %>%
  plot_time_series(dteday,(cnt), 
                   .color_var = year(dteday),# Returns static ggplot
                   .interactive = FALSE,  # Customization
                   .title = "bike rental data",
                   .x_lab = "Date",
                   .y_lab = "count of bikes rented",
                   .color_lab = "year") +
  scale_y_continuous(labels = scales::comma_format())

Task Three: Smooth time series data

library(tseries) #creating a univariate time series
library(forecast)

## 
## Attaching package: 'forecast'

## The following object is masked from 'package:astsa':
## 
##     gas

count1 <- ts(ts_data$cnt, start = c(2011, 1,1), frequency = 365)
length(count1)

## [1] 731

hist(count1,freq=FALSE)
lines(density(count1), lwd = 2, col = 'red')
g <- seq(min(count1), max(count1), length = 731)
h <- dnorm(g, mean = mean(count1), sd = sd(count1))
lines(g, h, col = "blue", lwd = 2) #cleaning outliers

library(tseries)
library(forecast)
count<-tsclean(count1)
hist(count, freq=FALSE)
lines(density(count), lwd = 2, col = 'red')
k <- seq(min(count), max(count), length = 731)
l <- dnorm(k, mean = mean(count), sd = sd(count))
lines(k, l, col = "blue", lwd = 2) #simple exponential smoothing

HW1<- HoltWinters((count), beta=FALSE, gamma=FALSE)
library(forecast)
forecast_s.exp<-predict(HW1,n=24)
forecast_s.exp

## Time Series:
## Start = c(2013, 2) 
## End = c(2013, 25) 
## Frequency = 365 
##            fit
##  [1,] 2300.898
##  [2,] 2300.898
##  [3,] 2300.898
##  [4,] 2300.898
##  [5,] 2300.898
##  [6,] 2300.898
##  [7,] 2300.898
##  [8,] 2300.898
##  [9,] 2300.898
## [10,] 2300.898
## [11,] 2300.898
## [12,] 2300.898
## [13,] 2300.898
## [14,] 2300.898
## [15,] 2300.898
## [16,] 2300.898
## [17,] 2300.898
## [18,] 2300.898
## [19,] 2300.898
## [20,] 2300.898
## [21,] 2300.898
## [22,] 2300.898
## [23,] 2300.898
## [24,] 2300.898

plot(HW1) #Visually evaluate the fit

plot(count, ylab="rented bikes", xlim=c(2011,2013))
lines(HW1$fitted[,1], lty=2, col="blue")

HW1.pred <- predict(HW1, 24, prediction.interval = TRUE, level=0.95)
#Visually evaluate the prediction
plot(count, ylab="rented bikes", xlim=c(2011,2013))
lines(HW1$fitted[,1], lty=2, col="blue")
lines(HW1.pred[,1], col="red")
lines(HW1.pred[,2], lty=2, col="orange")
lines(HW1.pred[,3], lty=2, col="orange")

library(forecast)
HW1_for <- forecast(HW1, h=24, level=c(80,95))
HW1_for$residuals<-na.omit(HW1_for$residuals)
plot(HW1_for, xlim=c(2011, 2013)) #visualize forecast and checking stationarity, autocorrelation and normality of forecasted errors 
lines(HW1_for$fitted, lty=2, col="purple")

acf(HW1_for$residuals, na.action=na.pass)

Box.test(HW1_for$residuals, type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  HW1_for$residuals
## X-squared = 24.108, df = 1, p-value = 9.109e-07

hist(HW1_for$residuals,freq=FALSE)
lines(density(HW1_for$residuals), lwd = 2, col = 'red')
y <- seq(min(HW1_for$residual), max(HW1_for$residual), length = 722)
z <- dnorm(y, mean = mean(HW1_for$residuals), sd = sd(HW1_for$residuals))
lines(y, z, col = "blue", lwd = 2)

shapiro.test(HW1_for$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  HW1_for$residuals
## W = 0.94856, p-value = 2.98e-15

Task Four: Decompose and assess the stationarity of time series data

adf.test(count) #Checking the stationarity of original time series

## 
##  Augmented Dickey-Fuller Test
## 
## data:  count
## Dickey-Fuller = -1.4435, Lag order = 9, p-value = 0.8138
## alternative hypothesis: stationary

decomp<-decompose((count))               #trend is seen
plot(decomp)

adf.test(diff(count))  #checking the stationarity of differenced time series

## Warning in adf.test(diff(count)): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(count)
## Dickey-Fuller = -13.298, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

#Seasonal decomposition
fit <- stl((count), s.window="periodic")
plot(fit)

plot(fit$time.series[,1], ylab = "Stationary of the Rental",  # both trend and seasonality is observed
     xlab = "Day of the Year")

checkresiduals(fit$time.series[, 3])

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 1841, df = 146, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 146

shapiro.test(fit$time.series[, 3]) #Calculating Simple moving average

## 
##  Shapiro-Wilk normality test
## 
## data:  fit$time.series[, 3]
## W = 0.9989, p-value = 0.9439

trend <- SMA(count, n = 10)  #deseasonalizing the data
deseasonalized_detrended_data <- count - trend
deseasonalized_detrended_data<-na.omit(deseasonalized_detrended_data)
plot(deseasonalized_detrended_data)

adf.test(deseasonalized_detrended_data)   #checking stationarity

## Warning in adf.test(deseasonalized_detrended_data): p-value smaller than
## printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  deseasonalized_detrended_data
## Dickey-Fuller = -11.154, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

Task Five: Fit and forecast time series data using ARIMA models

library(forecast) #Fitting SARIMA as ARIMA with deseasonalized data was giving large AIC
fit_arima<-auto.arima((count),trace=TRUE)

## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,0,2)(0,1,0)[365] with drift         : 381.8691
##  ARIMA(0,0,0)(0,1,0)[365] with drift         : 494.9632
##  ARIMA(1,0,0)(0,1,0)[365] with drift         : 397.904
##  ARIMA(0,0,1)(0,1,0)[365] with drift         : 427.4676
##  ARIMA(0,0,0)(0,1,0)[365]                    : 1056.341
##  ARIMA(1,0,2)(0,1,0)[365] with drift         : 379.6846
##  ARIMA(0,0,2)(0,1,0)[365] with drift         : 404.7834
##  ARIMA(1,0,1)(0,1,0)[365] with drift         : 389.6037
##  ARIMA(1,0,3)(0,1,0)[365] with drift         : 379.227
##  ARIMA(0,0,3)(0,1,0)[365] with drift         : 401.2876
##  ARIMA(2,0,3)(0,1,0)[365] with drift         : Inf
##  ARIMA(1,0,4)(0,1,0)[365] with drift         : 380.5488
##  ARIMA(0,0,4)(0,1,0)[365] with drift         : 402.943
##  ARIMA(2,0,4)(0,1,0)[365] with drift         : 383.8164
##  ARIMA(1,0,3)(0,1,0)[365]                    : Inf
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(1,0,3)(0,1,0)[365] with drift         : 6097.864
## 
##  Best model: ARIMA(1,0,3)(0,1,0)[365] with drift

fit <- arima(count, order=c(1, 0, 2), seasonal=list(order=c(0, 1, 0),period=365))
predict(fit, n.ahead=24) #forecasting for next 2 months

## $pred
## Time Series:
## Start = c(2013, 2) 
## End = c(2013, 25) 
## Frequency = 365 
##  [1] 2311.064 2523.005 2653.928 3556.856 4381.788 4803.724 3706.664 2656.607
##  [9] 3877.555 2455.507 4374.462 3490.422 2768.385 2585.353 2571.324 3207.299
## [17] 3647.278 3562.261 3432.247 1569.238 2244.232 2698.230 4604.231 4534.237
## 
## $se
## Time Series:
## Start = c(2013, 2) 
## End = c(2013, 25) 
## Frequency = 365 
##  [1]  994.0721 1056.4824 1068.9631 1081.2079 1093.2255 1105.0239 1116.6107
##  [8] 1127.9931 1139.1779 1150.1713 1160.9796 1171.6083 1182.0630 1192.3488
## [15] 1202.4706 1212.4331 1222.2407 1231.8977 1241.4081 1250.7759 1260.0046
## [22] 1269.0979 1278.0591 1286.8916

forecast_arima <- predict(fit, 24)
plot(forecast(fit, 24))

hist(fit$residuals,freq=FALSE, xlab = "Residual", ylab = "Distribution", main = "Histogram of Model Errors - Bike Count") #Residual analysis
lines(density(fit$residuals), lwd = 2, col = 'red')
x <- seq(min(fit$residual), max(fit$residual), length = 722)
f <- dnorm(x, mean = mean(fit$residuals), sd = sd(fit$residuals))
lines(x, f, col = "blue", lwd = 2)

adf.test(fit$residuals)

## Warning in adf.test(fit$residuals): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  fit$residuals
## Dickey-Fuller = -8.0939, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary

shapiro.test(fit$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit$residuals
## W = 0.83833, p-value < 2.2e-16

acf(fit$residuals)

print(cbind(forecast_arima$pred,HW1.pred[,1])) #Comparison

## Time Series:
## Start = c(2013, 2) 
## End = c(2013, 25) 
## Frequency = 365 
##          forecast_arima$pred HW1.pred[, 1]
## 2013.003            2311.064      2300.898
## 2013.005            2523.005      2300.898
## 2013.008            2653.928      2300.898
## 2013.011            3556.856      2300.898
## 2013.014            4381.788      2300.898
## 2013.016            4803.724      2300.898
## 2013.019            3706.664      2300.898
## 2013.022            2656.607      2300.898
## 2013.025            3877.555      2300.898
## 2013.027            2455.507      2300.898
## 2013.030            4374.462      2300.898
## 2013.033            3490.422      2300.898
## 2013.036            2768.385      2300.898
## 2013.038            2585.353      2300.898
## 2013.041            2571.324      2300.898
## 2013.044            3207.299      2300.898
## 2013.047            3647.278      2300.898
## 2013.049            3562.261      2300.898
## 2013.052            3432.247      2300.898
## 2013.055            1569.238      2300.898
## 2013.058            2244.232      2300.898
## 2013.060            2698.230      2300.898
## 2013.063            4604.231      2300.898
## 2013.066            4534.237      2300.898

Task Six: Findings and Conclusions

#FINDINGS In summary, SARIMA (1,0,2)(0,1,0)[365] fitted the data better as compared to simple exponential smoothing. We choose SARIMA (1,0,2)(0,1,0)[365] to fit for two reasons. 1) This model gave the least AICc (379) in comparison with all the models constructed with or without deseasonalized data. 2) Seasonality in this data is justified. So, instead of removing it we had to take some additional parameters (369). The proposed model showed that in the next 24 days there will be a need of approximately maximum 4700 and minimum 2300 bicycles. #Through graphical analysis we found; 1)Most people prefer to rent bicycles the most in summers (June-Oct) and the least in winters (Nov-Jan). As there exists 63% correlation between temperature and the rented bicycles. 2)The registered clients increases as the temperature increases by 54%. approximately 40% of the casual riders get registered and vice versa. 2)Seasonality in average temperature and count of rented bicycles (parabolic curve) is observed to be repeating periodically after one year. 3)Severe fluctuation in Casual riders is seen with no trend. 4)Windspeed in january is the highest. 5)Registered riders gradually increases from June onwards.

#CONCLUSION 1.In future, advance machine learning algorithms should be combined with ARIMA model to have an optimized hybrid model. ARIMA itself is insufficient to grab the trends exhibited by bike_rental_daily data. 2. A way to deal with leap year (2012) should be used. As, already existing techniques are redundant.

#NOTE 1. A trail has been made to meet all the requirement as directed by the guide. But the platform used for coding is unable to create interactive plot. To create them just specify “interactive=TRUE” in place of “interactive=FALSE”.

Forecast daily bike rental demand in Washington DC using time series models

HAFSA!

2023-10-31