This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Sat Nov 30 16:20:25 2024.
Data Description:
This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.
Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Relevant Paper:
Fanaee-T, Hadi, and Gama, Joao. Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg
# Summary of the dataset
summary(data)
## instant dteday season yr
## Min. : 1.0 Length:731 Min. :1.000 Min. :0.0000
## 1st Qu.:183.5 Class :character 1st Qu.:2.000 1st Qu.:0.0000
## Median :366.0 Mode :character Median :3.000 Median :1.0000
## Mean :366.0 Mean :2.497 Mean :0.5007
## 3rd Qu.:548.5 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :731.0 Max. :4.000 Max. :1.0000
## mnth holiday weekday workingday
## Min. : 1.00 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
## Median : 7.00 Median :0.00000 Median :3.000 Median :1.000
## Mean : 6.52 Mean :0.02873 Mean :2.997 Mean :0.684
## 3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.:1.000
## Max. :12.00 Max. :1.00000 Max. :6.000 Max. :1.000
## weathersit temp atemp hum
## Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200
## Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267
## Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279
## 3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302
## Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725
## windspeed casual registered cnt
## Min. :0.02239 Min. : 2.0 Min. : 20 Min. : 22
## 1st Qu.:0.13495 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152
## Median :0.18097 Median : 713.0 Median :3662 Median :4548
## Mean :0.19049 Mean : 848.2 Mean :3656 Mean :4504
## 3rd Qu.:0.23321 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956
## Max. :0.50746 Max. :3410.0 Max. :6946 Max. :8714
# Select relevant columns
bike_data <- data %>%
select(dteday, cnt, temp, hum, windspeed) %>%
mutate(dteday = as.Date(dteday))
# Check data structure
str(bike_data)
## 'data.frame': 731 obs. of 5 variables:
## $ dteday : Date, format: "2011-01-01" "2011-01-02" ...
## $ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ...
## $ temp : num 0.344 0.363 0.196 0.2 0.227 ...
## $ hum : num 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed: num 0.16 0.249 0.248 0.16 0.187 ...
# Plot daily bike rentals
ggplot(bike_data, aes(x = dteday, y = cnt)) +
geom_line(color = "blue") +
labs(title = "Daily Bike Rentals", x = "Date", y = "Count")
## Read about the timetk package
# ?timetk
# Convert to time series object
bike_ts <- bike_data %>%
select(dteday, cnt) %>%
tk_ts(start = c(2011, 1), freq = 365)
## Warning: Non-numeric columns being dropped: dteday
# Interactive plot
bike_data %>%
plot_time_series(dteday, cnt, .interactive = FALSE)
# Apply moving average for smoothing
bike_data <- bike_data %>%
mutate(cnt_smoothed = zoo::rollmean(cnt, k = 7, fill = NA))
# Plot smoothed data
ggplot(bike_data, aes(x = dteday)) +
geom_line(aes(y = cnt), color = "blue", alpha = 0.5) +
geom_line(aes(y = cnt_smoothed), color = "red") +
labs(title = "Smoothed Daily Bike Rentals", x = "Date", y = "Count")
## Warning: Removed 6 rows containing missing values (`geom_line()`).
# Decompose time series
decomp <- decompose(bike_ts, type = "multiplicative")
plot(decomp)
# Check stationarity using Augmented Dickey-Fuller Test
adf.test <- tseries::adf.test(bike_ts)
adf.test
##
## Augmented Dickey-Fuller Test
##
## data: bike_ts
## Dickey-Fuller = -1.6351, Lag order = 9, p-value = 0.7327
## alternative hypothesis: stationary
# Fit ARIMA model
arima_model <- auto.arima(bike_ts)
summary(arima_model)
## Series: bike_ts
## ARIMA(1,0,2)(0,1,0)[365] with drift
##
## Coefficients:
## ar1 ma1 ma2 drift
## 0.9586 -0.6363 -0.1892 5.7093
## s.e. 0.0283 0.0583 0.0506 0.7566
##
## sigma^2 = 1599566: log likelihood = -3131.76
## AIC=6273.52 AICc=6273.68 BIC=6293.03
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 5.357072 890.0137 457.0405 -44.28372 51.73145 0.1967752 0.01047273
# Forecast
forecasted <- forecast(arima_model, h = 30)
autoplot(forecasted) +
labs(title = "ARIMA Model Forecast", x = "Date", y = "Count")
rmse <- sqrt(mean((bike_ts - fitted(arima_model))^2, na.rm = TRUE))
rmse
## [1] 890.0137
The Root Mean Square Error (RMSE) for the ARIMA model was calculated as follows:
# Calculate RMSE for ARIMA model
rmse <- sqrt(mean((bike_ts - fitted(arima_model))^2, na.rm = TRUE))
rmse
## [1] 890.0137