Bike Daily Rental Demand Forecasting

Author

Pranish Shinde

Libraries Setup

library(tidyverse) # collection of data manipulation and visualization packages
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr) # data manipulation verbs (filter, mutate, group_by, etc.)
library(tseries) # time series analysis tools including ADF stationarity test
Warning: package 'tseries' was built under R version 4.5.2
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(lubridate) # simplifies working with date/time objects
library(timetk) # time series visualization and feature engineering
Warning: package 'timetk' was built under R version 4.5.2
library(ggplot2) # data visualization (also loaded via tidyverse, explicit for clarity)
library(forecast) # provides ARIMA modeling and forecasting tools
Warning: package 'forecast' was built under R version 4.5.2

Load Datasets

# Read in the daily and hourly bike-sharing CSVs from local disk.
daily_data  <- read.csv("C:/Users/Administrator/Downloads/bike data daily.csv")
hourly_data <- read.csv("C:/Users/Administrator/Downloads/bike data hour.csv")

# Quick structural overview: column names, data types, and sample values
str(daily_data)
'data.frame':   731 obs. of  16 variables:
 $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ dteday    : chr  "2011-01-01" "2011-01-02" "2011-01-03" "2011-01-04" ...
 $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
 $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
 $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
 $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
 $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
 $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
 $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
 $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
 $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
 $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...
# Preview the first 6 rows of daily data
head(daily_data)
  instant     dteday season yr mnth holiday weekday workingday weathersit
1       1 2011-01-01      1  0    1       0       6          0          2
2       2 2011-01-02      1  0    1       0       0          0          2
3       3 2011-01-03      1  0    1       0       1          1          1
4       4 2011-01-04      1  0    1       0       2          1          1
5       5 2011-01-05      1  0    1       0       3          1          1
6       6 2011-01-06      1  0    1       0       4          1          1
      temp    atemp      hum windspeed casual registered  cnt
1 0.344167 0.363625 0.805833 0.1604460    331        654  985
2 0.363478 0.353739 0.696087 0.2485390    131        670  801
3 0.196364 0.189405 0.437273 0.2483090    120       1229 1349
4 0.200000 0.212122 0.590435 0.1602960    108       1454 1562
5 0.226957 0.229270 0.436957 0.1869000     82       1518 1600
6 0.204348 0.233209 0.518261 0.0895652     88       1518 1606
# Open both datasets for manual inspection
view(daily_data)
view(hourly_data)

Advanced Data Wrangling

# Convert the date column from character/factor to a proper Date object
# so it can be used correctly in time series plots and models
daily_data[,"dteday"] <- as.Date(daily_data[,"dteday"])

# Normalize total bike rental count to [0, 1] range by dividing by its maximum
# This makes comparisons across variables on different scales easier
daily_data[,"ncnt"] <- daily_data[,"cnt"] / max(daily_data[,"cnt"])

# Normalize the number of registered (non-casual) users the same way
daily_data[,"nr"] <- daily_data[,"registered"] / max(daily_data[,"registered"])

# Compute the rental-to-registration ratio:
# How many rentals occurred relative to the maximum registered user count
# Values > 1 would mean more rentals than registered users (possible with casual riders)
daily_data[,"rr"] <- daily_data[,"cnt"] / max(daily_data[,"registered"])

# Summary statistics for all columns to check ranges and spot any anomalies
summary(daily_data)
    instant          dteday               season            yr        
 Min.   :  1.0   Min.   :2011-01-01   Min.   :1.000   Min.   :0.0000  
 1st Qu.:183.5   1st Qu.:2011-07-02   1st Qu.:2.000   1st Qu.:0.0000  
 Median :366.0   Median :2012-01-01   Median :3.000   Median :1.0000  
 Mean   :366.0   Mean   :2012-01-01   Mean   :2.497   Mean   :0.5007  
 3rd Qu.:548.5   3rd Qu.:2012-07-01   3rd Qu.:3.000   3rd Qu.:1.0000  
 Max.   :731.0   Max.   :2012-12-31   Max.   :4.000   Max.   :1.0000  
      mnth          holiday           weekday        workingday   
 Min.   : 1.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
 1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
 Median : 7.00   Median :0.00000   Median :3.000   Median :1.000  
 Mean   : 6.52   Mean   :0.02873   Mean   :2.997   Mean   :0.684  
 3rd Qu.:10.00   3rd Qu.:0.00000   3rd Qu.:5.000   3rd Qu.:1.000  
 Max.   :12.00   Max.   :1.00000   Max.   :6.000   Max.   :1.000  
   weathersit         temp             atemp              hum        
 Min.   :1.000   Min.   :0.05913   Min.   :0.07907   Min.   :0.0000  
 1st Qu.:1.000   1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200  
 Median :1.000   Median :0.49833   Median :0.48673   Median :0.6267  
 Mean   :1.395   Mean   :0.49538   Mean   :0.47435   Mean   :0.6279  
 3rd Qu.:2.000   3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302  
 Max.   :3.000   Max.   :0.86167   Max.   :0.84090   Max.   :0.9725  
   windspeed           casual         registered        cnt      
 Min.   :0.02239   Min.   :   2.0   Min.   :  20   Min.   :  22  
 1st Qu.:0.13495   1st Qu.: 315.5   1st Qu.:2497   1st Qu.:3152  
 Median :0.18097   Median : 713.0   Median :3662   Median :4548  
 Mean   :0.19049   Mean   : 848.2   Mean   :3656   Mean   :4504  
 3rd Qu.:0.23321   3rd Qu.:1096.0   3rd Qu.:4776   3rd Qu.:5956  
 Max.   :0.50746   Max.   :3410.0   Max.   :6946   Max.   :8714  
      ncnt                nr                 rr          
 Min.   :0.002525   Min.   :0.002879   Min.   :0.003167  
 1st Qu.:0.361717   1st Qu.:0.359487   1st Qu.:0.453786  
 Median :0.521919   Median :0.527210   Median :0.654765  
 Mean   :0.516909   Mean   :0.526371   Mean   :0.648481  
 3rd Qu.:0.683498   3rd Qu.:0.687662   3rd Qu.:0.857472  
 Max.   :1.000000   Max.   :1.000000   Max.   :1.254535  

Exploratory Visualization: Temperature

# Plot normalized temperature over time, grouped by year (yr) and colored by season.
# Grouping by year creates two faceted panels for easier year-over-year comparison.

daily_data %>%
  group_by(yr) %>%
  plot_time_series(
    dteday, temp,
    .color_var  = season,
    .x_lab      = "Date",
    .y_lab      = "Normalized Temperature",
    .title      = "Normalized Temperature vs Date for Daily Data",
    .interactive = TRUE   # renders as an interactive plotly chart
  )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the timetk package.
  Please report the issue at
  <https://github.com/business-science/timetk/issues>.

Exploratory Visualization: Humidity

# Same faceted time series plot for normalized humidity.
# Humidity shows more noise/variability than temperature.

daily_data %>%
  group_by(yr) %>%
  plot_time_series(
    dteday, hum,
    .color_var  = season,
    .x_lab      = "Date",
    .y_lab      = "Normalized Humidity",
    .title      = "Normalized Humidity vs Date for Daily Data",
    .interactive = TRUE
  )

Exploratory Visualization: Wind Speed

# Wind speed is also quite noisy; plotted here for completeness before deciding whether to include it in the model.

daily_data %>%
  group_by(yr) %>%
  plot_time_series(
    dteday, windspeed,
    .color_var  = season,
    .x_lab      = "Date",
    .y_lab      = "Normalized Windspeed",
    .title      = "Windspeed vs Date for Daily Data",
    .interactive = TRUE
  )

Exploratory Visualization: Normalized Bike Rentals (Total Counts)

# Plotting ncnt (normalized total rentals) reveals the seasonal rental pattern and overall growth trend across the two years.

daily_data %>%
  group_by(yr) %>%
  plot_time_series(
    dteday, ncnt,
    .color_var  = season,
    .x_lab      = "Date",
    .y_lab      = "Normalized Bike Rentals",
    .title      = "Normalized Bike Rentals vs Date for Daily Data",
    .interactive = TRUE
  )

Exploratory Visualization: Normalized Registered Rentals

# nr = normalized registered-user rentals; shows how the registered user base contributes to the overall rental trend.

daily_data %>%
  group_by(yr) %>%
  plot_time_series(
    dteday, nr,
    .color_var  = season,
    .x_lab      = "Date",
    .y_lab      = "Normalized Registered Rentals",
    .title      = "Normalized Registered Rentals vs Date for Daily Data",
    .interactive = TRUE
  )

Exploratory Visualization: Rental-to-Registration Ratio

# rr = total rentals / max registered users.
# Tracks how heavily the system is used relative to its registered user ceiling.

daily_data %>%
  group_by(yr) %>%
  plot_time_series(
    dteday, rr,
    .color_var  = season,
    .x_lab      = "Date",
    .y_lab      = "Ratio",
    .title      = "Ratio of Rentals to Registration vs Date for Daily Data",
    .interactive = TRUE
  )

Data Cleaning: Outlier Removal via tsclean()

# Humidity and wind speed exhibited excessive noise, so they are dropped from further analysis.
# The four variables below are cleaned using tsclean(), which identifies and replaces outliers and missing values with linearly interpolated values, making the series more suitable for ARIMA modeling.

daily_data[,"temp"]  <- tsclean(daily_data[,"temp"])   # normalized temperature
daily_data[,"ncnt"]  <- tsclean(daily_data[,"ncnt"])   # normalized total rentals
daily_data[,"nr"]    <- tsclean(daily_data[,"nr"])     # normalized registered rentals
daily_data[,"rr"]    <- tsclean(daily_data[,"rr"])     # rental-to-registration ratio

# Confirm the cleaned values look sensible
head(daily_data)
  instant     dteday season yr mnth holiday weekday workingday weathersit
1       1 2011-01-01      1  0    1       0       6          0          2
2       2 2011-01-02      1  0    1       0       0          0          2
3       3 2011-01-03      1  0    1       0       1          1          1
4       4 2011-01-04      1  0    1       0       2          1          1
5       5 2011-01-05      1  0    1       0       3          1          1
6       6 2011-01-06      1  0    1       0       4          1          1
      temp    atemp      hum windspeed casual registered  cnt       ncnt
1 0.344167 0.363625 0.805833 0.1604460    331        654  985 0.11303649
2 0.363478 0.353739 0.696087 0.2485390    131        670  801 0.09192105
3 0.196364 0.189405 0.437273 0.2483090    120       1229 1349 0.15480835
4 0.200000 0.212122 0.590435 0.1602960    108       1454 1562 0.17925178
5 0.226957 0.229270 0.436957 0.1869000     82       1518 1600 0.18361258
6 0.204348 0.233209 0.518261 0.0895652     88       1518 1606 0.18430112
          nr        rr
1 0.09415491 0.1418082
2 0.09645839 0.1153182
3 0.17693637 0.1942125
4 0.20932911 0.2248776
5 0.21854305 0.2303484
6 0.21854305 0.2312122

Post-Cleaning Visualization: Temperature

# Re-plot temperature after tsclean() to confirm that outliers have been removed and the series is smoother.
daily_data %>%
  group_by(yr) %>%
  plot_time_series(
    dteday, temp,
    .color_var  = season,
    .x_lab      = "Date",
    .y_lab      = "Normalized Temperature",
    .title      = "Normalized Temperature vs Date for Daily Data",
    .interactive = TRUE
  )

Post-Cleaning Visualization: Normalized Bike Rentals

daily_data %>%
  group_by(yr) %>%
  plot_time_series(
    dteday, ncnt,
    .color_var  = season,
    .x_lab      = "Date",
    .y_lab      = "Normalized Bike Rentals",
    .title      = "Normalized Bike Rentals vs Date for Daily Data",
    .interactive = TRUE
  )

Post-Cleaning Visualization: Normalized Registered Rentals

daily_data %>%
  group_by(yr) %>%
  plot_time_series(
    dteday, nr,
    .color_var  = season,
    .x_lab      = "Date",
    .y_lab      = "Normalized Registered Rentals",
    .title      = "Normalized Registered Rentals vs Date for Daily Data",
    .interactive = TRUE
  )

Post-Cleaning Visualization: Rental-to-Registration Ratio

daily_data %>%
  group_by(yr) %>%
  plot_time_series(
    dteday, rr,
    .color_var  = season,
    .x_lab      = "Date",
    .y_lab      = "Normalized Registered Rentals",
    .title      = "Ratio of Rentals to Registration vs Date for Daily Data",
    .interactive = TRUE
  )

Stationarity Testing: Augmented Dickey-Fuller (ADF) Test

# The ADF test checks whether a time series is stationary (i.e., has no unit root).

# H0: the series has a unit root (non-stationary)
# H1: the series is stationary

# A p-value < 0.05 allows us to reject H0 and conclude the series is stationary,
# which is a prerequisite for ARIMA modeling.

# Test stationarity of the cleaned temperature series
daily_data[,"temp"] %>% adf.test()

    Augmented Dickey-Fuller Test

data:  .
Dickey-Fuller = -1.6785, Lag order = 9, p-value = 0.7144
alternative hypothesis: stationary

ADF test for normalized total bike rental count

daily_data[,"ncnt"] %>% adf.test()

    Augmented Dickey-Fuller Test

data:  .
Dickey-Fuller = -1.3084, Lag order = 9, p-value = 0.871
alternative hypothesis: stationary

ADF test for normalized registered rentals

daily_data[,"nr"] %>% adf.test()

    Augmented Dickey-Fuller Test

data:  .
Dickey-Fuller = -2.418, Lag order = 9, p-value = 0.4014
alternative hypothesis: stationary

ADF test for rental-to-registration ratio

daily_data[,"rr"] %>% adf.test()

    Augmented Dickey-Fuller Test

data:  .
Dickey-Fuller = -1.3084, Lag order = 9, p-value = 0.871
alternative hypothesis: stationary

Seasonal and Trend decomposition using Loess

# STL (Seasonal and Trend decomposition using Loess) splits a time series into:
#   [,1] seasonal  – repeating within-year pattern
#   [,2] trend     – long-run direction of the series
#   [,3] remainder – residuals (noise after removing seasonal + trend)
# freq = 365 sets annual seasonality for daily data.
freq <- 365

# --- Decompose: Normalized Registered Rentals (nr) ---------------------------
norm_rentals <- ts(daily_data[, "nr"], frequency = freq)  # convert to ts object
decomped1    <- stl(norm_rentals, "periodic")             # STL decomposition

# Plot the trend component to visualize the long-run rental growth
plot(decomped1$time.series[, 2],
     ylab = "Stationary of the Normalized Rental Reservations",
     xlab = "Daily of the Year")

Residual diagnostics for the nr decomposition:

# check residuals() plots the remainder, its ACF, and a Ljung-Box test.
# Ideally residuals should look like white noise (no autocorrelation).
checkresiduals(decomped1$time.series[, 3])


    Ljung-Box test

data:  Residuals
Q* = 1997.3, df = 146, p-value < 2.2e-16

Model df: 0.   Total lags used: 146

Decompose: Normalized Total Rental Count (ncnt)

norm_cnt  <- ts(daily_data[, "ncnt"], frequency = freq)
decomped2 <- stl(norm_cnt, "periodic")

# Plot the trend component for total rental counts
plot(decomped2$time.series[, 2],
     ylab = "Stationary of the Normalized Rental Counts",
     xlab = "Daily of the Year")

Residual diagnostics for the ncnt decomposition

checkresiduals(decomped2$time.series[, 3])


    Ljung-Box test

data:  Residuals
Q* = 1437.8, df = 146, p-value < 2.2e-16

Model df: 0.   Total lags used: 146

Decompose: Rental-to-Registration Ratio (rr)

norm_rr   <- ts(daily_data[, "rr"], frequency = freq)
decomped3 <- stl(norm_rr, "periodic")

# Plot the trend component for the rental ratio
plot(decomped3$time.series[, 2],
     ylab = "Stationary of the Normalized Rental Counts to Reservations",
     xlab = "Daily of the Year")

Residual diagnostics for the rr decomposition

checkresiduals(decomped3$time.series[, 3])


    Ljung-Box test

data:  Residuals
Q* = 1437.8, df = 146, p-value < 2.2e-16

Model df: 0.   Total lags used: 146

Normality Testing on STL Residuals (Shapiro-Wilk)

# The Shapiro-Wilk test checks whether residuals follow a normal distribution.
# H0: the residuals are normally distributed
# p > 0.05 → fail to reject H0 (residuals appear normal, which is desirable)
# This is important for validating ARIMA model assumptions.

print("-------Noramlized Rental Reservations")
[1] "-------Noramlized Rental Reservations"
# Shapiro-Wilk test on STL residuals for normalized registered rentals
shapiro.test(decomped1$time.series[, 3])

    Shapiro-Wilk normality test

data:  decomped1$time.series[, 3]
W = 0.99554, p-value = 0.03334
print("-------Normalized Count of Rentals")
[1] "-------Normalized Count of Rentals"
# Note: this re-uses decomped1 residuals; consider using decomped2 for ncnt
shapiro.test(decomped1$time.series[, 3])

    Shapiro-Wilk normality test

data:  decomped1$time.series[, 3]
W = 0.99554, p-value = 0.03334
print("-------Normalized Ratio of Rentals to Reservations")
[1] "-------Normalized Ratio of Rentals to Reservations"
# Shapiro-Wilk test on STL residuals for the rental-to-registration ratio
shapiro.test(decomped3$time.series[, 3])

    Shapiro-Wilk normality test

data:  decomped3$time.series[, 3]
W = 0.99702, p-value = 0.1993

ARIMA Modeling & Forecasting: Total Bike Count (ncnt)

# auto.arima() automatically selects the best-fitting ARIMA(p,d,q)(P,D,Q)[m]
# model by minimizing AICc. seasonal = TRUE allows it to fit a seasonal component.
fit1 <- auto.arima(norm_cnt, seasonal = TRUE)

# Histogram of model residuals: should be roughly bell-shaped and centered at 0
# if the model has adequately captured the signal in the data.
hist(fit1$residuals,
     xlab = "Residual",
     ylab = "Distribution",
     main = "Histogram of Model Errors - Bike Count")

normality test: ARIMA Residuals

# Formal normality test on ARIMA residuals for the bike count model
shapiro.test(fit1$residuals)

    Shapiro-Wilk normality test

data:  fit1$residuals
W = 0.83122, p-value < 2.2e-16

Forecast using fit1

# Forecast the next 25 days of normalized bike rental counts using fit1.
# The shaded bands represent 80% and 95% prediction intervals.
prediction1 <- forecast(fit1, 25)
plot(prediction1,
     xlab = "Date",
     ylab = "Normalized Count of Rentals",
     main = "Prediction of Bike Rental Counts")

ARIMA Modeling & Forecasting: Registered Rentals (nr)

# Fit a seasonal ARIMA model to the normalized registered-user rental series.
fit2 <- auto.arima(norm_rentals, seasonal = TRUE)

# Inspect residuals histogram to check for model fit quality
hist(fit2$residuals,
     xlab = "Residual",
     ylab = "Distribution",
     main = "Histogram of Model Errors - Rental Count")

Normality test: reregistered rentals

# Normality test on residuals for the registered-rentals ARIMA model
shapiro.test(fit2$residuals)

    Shapiro-Wilk normality test

data:  fit2$residuals
W = 0.85305, p-value < 2.2e-16

forecast using fit2

# Forecast the next 25 days of normalized registered rentals using fit2
prediction2 <- forecast(fit2, 25)
plot(prediction2,
     xlab = "Date",
     ylab = "Normalized Rentals",
     main = "Prediction of Bike Rentals")

ARIMA Modeling & Forecasting: Rental-to-Registration Ratio

# Note: fit2 is re-assigned here to a model fitted on norm_cnt (same as fit1).
# Consider renaming to fit3 and fitting on norm_rr to model the ratio specifically.
fit2 <- auto.arima(norm_cnt, seasonal = TRUE)

# Residuals histogram for this third model
hist(fit2$residuals,
     xlab = "Residual",
     ylab = "Distribution",
     main = "Histogram of Model Errors - Count to Rental Ratio")

Normality test on residuals for this model

shapiro.test(fit2$residuals)

    Shapiro-Wilk normality test

data:  fit2$residuals
W = 0.83122, p-value < 2.2e-16

Final forecast

# Forecast the next 25 days and plot with prediction intervals
prediction3 <- forecast(fit2, 25)
plot(prediction3,
     xlab = "Date",
     ylab = "Normalized Rental Ratio",
     main = "Prediction of Bike Rentals to Reservations")

Comment

Findings and Conclusions After processing the raw data and using the ARIMA package to model ride share data, I was able to make predictions for the 25 days beyond the current data set. Qualitatively the data shows that was the weather gets warmer the number of bike rentals increase, and over the course of two years the number of rentals increases over the number of rentals from the previous year. As the data terminates at the end of one cycle, I expect the number of rentals to increase to a level higher than it was a year before, which is what the models are predicting. Therefore the results were what I expected the data appears to oscillate up and down over a 1 year period with the overall data moving towards higher rental numbers.