About Data Analysis Report

This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Fri May 23 00:53:41 2025.

Data Description:

This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.

Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

Relevant Paper:

Fanaee-T, Hadi, and Gama, Joao. Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg

Load data and install packages

## Import required packages
# Load required packages
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(timetk)
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(TSstudio)

# Load the bike sharing dataset
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip", "bikes.zip")
unzip("bikes.zip")
bikes <- read.csv("day.csv", stringsAsFactors = FALSE)

# Inspect the data
glimpse(bikes)

## Rows: 731
## Columns: 16
## $ instant    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ dteday     <chr> "2011-01-01", "2011-01-02", "2011-01-03", "2011-01-04", "20…
## $ season     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ yr         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mnth       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ holiday    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ weekday    <int> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,…
## $ workingday <int> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,…
## $ weathersit <int> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2,…
## $ temp       <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.2269570, 0.20…
## $ atemp      <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.2292700, 0.23…
## $ hum        <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261,…
## $ windspeed  <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.1869000, 0.08…
## $ casual     <int> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25, 38, 54…
## $ registered <int> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 1280, 122…
## $ cnt        <int> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1321, 126…

summary(bikes)

##     instant         dteday              season            yr        
##  Min.   :  1.0   Length:731         Min.   :1.000   Min.   :0.0000  
##  1st Qu.:183.5   Class :character   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :366.0   Mode  :character   Median :3.000   Median :1.0000  
##  Mean   :366.0                      Mean   :2.497   Mean   :0.5007  
##  3rd Qu.:548.5                      3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :731.0                      Max.   :4.000   Max.   :1.0000  
##       mnth          holiday           weekday        workingday   
##  Min.   : 1.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 7.00   Median :0.00000   Median :3.000   Median :1.000  
##  Mean   : 6.52   Mean   :0.02873   Mean   :2.997   Mean   :0.684  
##  3rd Qu.:10.00   3rd Qu.:0.00000   3rd Qu.:5.000   3rd Qu.:1.000  
##  Max.   :12.00   Max.   :1.00000   Max.   :6.000   Max.   :1.000  
##    weathersit         temp             atemp              hum        
##  Min.   :1.000   Min.   :0.05913   Min.   :0.07907   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200  
##  Median :1.000   Median :0.49833   Median :0.48673   Median :0.6267  
##  Mean   :1.395   Mean   :0.49538   Mean   :0.47435   Mean   :0.6279  
##  3rd Qu.:2.000   3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302  
##  Max.   :3.000   Max.   :0.86167   Max.   :0.84090   Max.   :0.9725  
##    windspeed           casual         registered        cnt      
##  Min.   :0.02239   Min.   :   2.0   Min.   :  20   Min.   :  22  
##  1st Qu.:0.13495   1st Qu.: 315.5   1st Qu.:2497   1st Qu.:3152  
##  Median :0.18097   Median : 713.0   Median :3662   Median :4548  
##  Mean   :0.19049   Mean   : 848.2   Mean   :3656   Mean   :4504  
##  3rd Qu.:0.23321   3rd Qu.:1096.0   3rd Qu.:4776   3rd Qu.:5956  
##  Max.   :0.50746   Max.   :3410.0   Max.   :6946   Max.   :8714

Describe and explore the data

head(bikes)

##   instant     dteday season yr mnth holiday weekday workingday weathersit
## 1       1 2011-01-01      1  0    1       0       6          0          2
## 2       2 2011-01-02      1  0    1       0       0          0          2
## 3       3 2011-01-03      1  0    1       0       1          1          1
## 4       4 2011-01-04      1  0    1       0       2          1          1
## 5       5 2011-01-05      1  0    1       0       3          1          1
## 6       6 2011-01-06      1  0    1       0       4          1          1
##       temp    atemp      hum windspeed casual registered  cnt
## 1 0.344167 0.363625 0.805833 0.1604460    331        654  985
## 2 0.363478 0.353739 0.696087 0.2485390    131        670  801
## 3 0.196364 0.189405 0.437273 0.2483090    120       1229 1349
## 4 0.200000 0.212122 0.590435 0.1602960    108       1454 1562
## 5 0.226957 0.229270 0.436957 0.1869000     82       1518 1600
## 6 0.204348 0.233209 0.518261 0.0895652     88       1518 1606

colnames(bikes)

##  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
##  [6] "holiday"    "weekday"    "workingday" "weathersit" "temp"      
## [11] "atemp"      "hum"        "windspeed"  "casual"     "registered"
## [16] "cnt"

str(bikes)

## 'data.frame':    731 obs. of  16 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "2011-01-01" "2011-01-02" "2011-01-03" "2011-01-04" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...

Create Interactive Time Series Plot

bikes$dteday <- as.Date(bikes$dteday)

# Create interactive time series plot
dev.new()
bikes %>%
  plot_time_series(
    .date_var = dteday,
    .value = cnt,
    .interactive = TRUE,
    .title = "Bike Rentals Over Time"
  )

Smooth time series data

bikes %>%
  plot_time_series(
    .date_var = dteday,
    .value = cnt,
    .smooth = TRUE,          # Adds a moving average line
    .smooth_period = 7,      # 7-day moving average
    .interactive = TRUE,
    .title = "Bike Rentals with 7-Day Moving Average"
  )

## Warning: There was 1 warning in `dplyr::mutate()`.
## ℹ In argument: `.value_smooth = auto_smooth(...)`.
## Caused by warning in `simpleLoess()`:
## ! k-d tree limited by memory. ncmax= 731

Decompose and assess the stationarity of time series data

# Decompose and assess the stationarity of time series data

# Load required package
library(tseries)

# Convert to time series with yearly frequency (365 days)
bikes_ts <- ts(bikes$cnt, frequency = 365)

# Decompose the time series to see trend, seasonality, and random parts
decomp <- decompose(bikes_ts)
plot(decomp)

# This plot shows:
# - Observed: the original data
# - Trend: the overall direction (going up or down)
# - Seasonal: the repeating pattern each year
# - Random: the leftover part after removing trend and seasonality

# Stationarity check - Augmented Dickey-Fuller test on original data
adf_result <- adf.test(bikes_ts)
print(adf_result)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  bikes_ts
## Dickey-Fuller = -1.6351, Lag order = 9, p-value = 0.7327
## alternative hypothesis: stationary

# If p-value > 0.05, the data is not stationary (has trends or patterns)

# Differencing the time series to remove trend
diff_bikes <- diff(bikes$cnt)
plot(diff_bikes, main = "Differenced Bike Rentals", ylab = "Differenced Count", xlab = "Time")

# Stationarity check on differenced data to confirm it’s now stationary
adf_diff_result <- adf.test(diff_bikes)

## Warning in adf.test(diff_bikes): p-value smaller than printed p-value

print(adf_diff_result)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff_bikes
## Dickey-Fuller = -13.798, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

# If p-value < 0.05, the differenced data is stationary (good for ARIMA)

# ACF and PACF plots to understand patterns in the differenced data
acf(diff_bikes, main = "ACF of Differenced Bike Rentals")

# ACF shows how past values affect future ones
pacf(diff_bikes, main = "PACF of Differenced Bike Rentals")

# PACF helps decide the AR part of ARIMA

Fit and forecast time series data using ARIMA models

# Fit and forecast time series data using ARIMA models

#Fit the ARIMA model with chosen parameters
fit_arima <- arima(bikes_ts, order = c(2, 1, 1))  # Adjust p and q based on your ACF/PACF
summary(fit_arima)

## 
## Call:
## arima(x = bikes_ts, order = c(2, 1, 1))
## 
## Coefficients:
##          ar1      ar2      ma1
##       0.3617  -0.0404  -0.8809
## s.e.  0.0430   0.0405   0.0226
## 
## sigma^2 estimated as 854241:  log likelihood = -6021.46,  aic = 12050.92
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 12.04143 923.6189 645.4637 -44.17775 58.16843 0.8843462
##                      ACF1
## Training set -0.003108696

# Plot residuals
residuals_arima <- residuals(fit_arima)
plot(residuals_arima, main = "Residuals of ARIMA Model", ylab = "Residuals", xlab = "Time")

# Check if residuals are white noise (no patterns) using the Ljung-Box test
Box.test(residuals_arima, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals_arima
## X-squared = 0.0070934, df = 1, p-value = 0.9329

# If p-value > 0.05, residuals are random (good model)
# Use first 700 days for training, last 31 days for testing
train <- bikes_ts[1:700]
test <- bikes_ts[701:731]

# Fit the ARIMA model on training data
fit_train <- arima(train, order = c(2, 1, 1))  # Use same p, d, q as above

# Forecast for the test period (31 days)
forecast_test <- forecast(fit_train, h = 31)

# Plot the forecast against actual test data
plot(forecast_test, main = "ARIMA Forecast vs Actual (Test Data)")
lines(test, col = "red")
legend("topleft", legend = c("Forecast", "Actual"), col = c("blue", "red"), lty = 1)

# forecast accuracy on test data
accuracy(forecast_test, test)

##                      ME      RMSE       MAE       MPE      MAPE      MASE
## Training set   34.30738  912.0027  638.8407 -44.63792  58.96331 0.8793965
## Test set     -810.22249 1945.1434 1494.0067 -88.70813 100.64437 2.0565758
##                     ACF1
## Training set -0.00645021
## Test set              NA

# Forecast 30 days into the future using the full dataset
forecast_future <- forecast(fit_arima, h = 30)
plot(forecast_future, main = "30-Day Forecast of Bike Rentals")

Findings and Conclusions

The bike rental data shows a clear upward trend and yearly seasonality (from the decomposition plot). I used an ARIMA(2,1,1) model after differencing the data to make it stationary (ADF test p-value < 0.05 after differencing). The model’s residuals are random (Box-Ljung test p-value = 0.9329), which means it captures the main patterns in the data. However, the forecast accuracy is poor: - Training MAPE is 58.96%, which is too high (should be < 20%). - Test MAPE is 100.64%, meaning the model doesn’t work well on new data. The 30-day forecast follows the trend but isn’t reliable for planning because of the high errors. Limitations: - The model doesn’t include factors like weather or holidays, which affect bike rentals. - It only captures yearly patterns, not weekly ones (e.g., more rentals on weekends). Future improvements: - Add weather and holiday data to the model. - Try a simpler model like ARIMA(1,1,1) or use weekly patterns (frequency = 7).

Forecast daily bike rental demand using time series models

Bedangshu Majumder

23/05/2025