About Data Analysis Report

This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Sat Nov 30 16:20:25 2024.

Data Description:

This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.

Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

Relevant Paper:

Fanaee-T, Hadi, and Gama, Joao. Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg

Task One: Load and explore the data

Load data and install packages

Describe and explore the data

# Summary of the dataset
summary(data)
##     instant         dteday              season            yr        
##  Min.   :  1.0   Length:731         Min.   :1.000   Min.   :0.0000  
##  1st Qu.:183.5   Class :character   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :366.0   Mode  :character   Median :3.000   Median :1.0000  
##  Mean   :366.0                      Mean   :2.497   Mean   :0.5007  
##  3rd Qu.:548.5                      3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :731.0                      Max.   :4.000   Max.   :1.0000  
##       mnth          holiday           weekday        workingday   
##  Min.   : 1.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 7.00   Median :0.00000   Median :3.000   Median :1.000  
##  Mean   : 6.52   Mean   :0.02873   Mean   :2.997   Mean   :0.684  
##  3rd Qu.:10.00   3rd Qu.:0.00000   3rd Qu.:5.000   3rd Qu.:1.000  
##  Max.   :12.00   Max.   :1.00000   Max.   :6.000   Max.   :1.000  
##    weathersit         temp             atemp              hum        
##  Min.   :1.000   Min.   :0.05913   Min.   :0.07907   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200  
##  Median :1.000   Median :0.49833   Median :0.48673   Median :0.6267  
##  Mean   :1.395   Mean   :0.49538   Mean   :0.47435   Mean   :0.6279  
##  3rd Qu.:2.000   3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302  
##  Max.   :3.000   Max.   :0.86167   Max.   :0.84090   Max.   :0.9725  
##    windspeed           casual         registered        cnt      
##  Min.   :0.02239   Min.   :   2.0   Min.   :  20   Min.   :  22  
##  1st Qu.:0.13495   1st Qu.: 315.5   1st Qu.:2497   1st Qu.:3152  
##  Median :0.18097   Median : 713.0   Median :3662   Median :4548  
##  Mean   :0.19049   Mean   : 848.2   Mean   :3656   Mean   :4504  
##  3rd Qu.:0.23321   3rd Qu.:1096.0   3rd Qu.:4776   3rd Qu.:5956  
##  Max.   :0.50746   Max.   :3410.0   Max.   :6946   Max.   :8714
# Select relevant columns
bike_data <- data %>%
  select(dteday, cnt, temp, hum, windspeed) %>%
  mutate(dteday = as.Date(dteday))

# Check data structure
str(bike_data)
## 'data.frame':    731 obs. of  5 variables:
##  $ dteday   : Date, format: "2011-01-01" "2011-01-02" ...
##  $ cnt      : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...
##  $ temp     : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ hum      : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed: num  0.16 0.249 0.248 0.16 0.187 ...
# Plot daily bike rentals
ggplot(bike_data, aes(x = dteday, y = cnt)) +
  geom_line(color = "blue") +
  labs(title = "Daily Bike Rentals", x = "Date", y = "Count")

Task Two: Create interactive time series plots

## Read about the timetk package
# ?timetk
# Convert to time series object
bike_ts <- bike_data %>%
  select(dteday, cnt) %>%
  tk_ts(start = c(2011, 1), freq = 365)
## Warning: Non-numeric columns being dropped: dteday
# Interactive plot
bike_data %>%
  plot_time_series(dteday, cnt, .interactive = FALSE)

Task Three: Smooth time series data

# Apply moving average for smoothing
bike_data <- bike_data %>%
  mutate(cnt_smoothed = zoo::rollmean(cnt, k = 7, fill = NA))

# Plot smoothed data
ggplot(bike_data, aes(x = dteday)) +
  geom_line(aes(y = cnt), color = "blue", alpha = 0.5) +
  geom_line(aes(y = cnt_smoothed), color = "red") +
  labs(title = "Smoothed Daily Bike Rentals", x = "Date", y = "Count")
## Warning: Removed 6 rows containing missing values (`geom_line()`).

Task Four: Decompose and assess the stationarity of time series data

# Decompose time series
decomp <- decompose(bike_ts, type = "multiplicative")
plot(decomp)

# Check stationarity using Augmented Dickey-Fuller Test
adf.test <- tseries::adf.test(bike_ts)
adf.test
## 
##  Augmented Dickey-Fuller Test
## 
## data:  bike_ts
## Dickey-Fuller = -1.6351, Lag order = 9, p-value = 0.7327
## alternative hypothesis: stationary

Task Five: Fit and forecast time series data using ARIMA models

# Fit ARIMA model
arima_model <- auto.arima(bike_ts)
summary(arima_model)
## Series: bike_ts 
## ARIMA(1,0,2)(0,1,0)[365] with drift 
## 
## Coefficients:
##          ar1      ma1      ma2   drift
##       0.9586  -0.6363  -0.1892  5.7093
## s.e.  0.0283   0.0583   0.0506  0.7566
## 
## sigma^2 = 1599566:  log likelihood = -3131.76
## AIC=6273.52   AICc=6273.68   BIC=6293.03
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 5.357072 890.0137 457.0405 -44.28372 51.73145 0.1967752 0.01047273
# Forecast
forecasted <- forecast(arima_model, h = 30)
autoplot(forecasted) +
  labs(title = "ARIMA Model Forecast", x = "Date", y = "Count")

Calculate RMSE for ARIMA model

rmse <- sqrt(mean((bike_ts - fitted(arima_model))^2, na.rm = TRUE))
rmse
## [1] 890.0137

Task Six: Findings and Conclusions

Findings

  • The dataset revealed a clear seasonal trend in bike rental demand, with higher rentals observed in the summer and lower rentals during winter. This seasonal variation aligns with expected user behavior based on weather conditions.
  • The time series decomposition highlighted trend, seasonality, and residuals:
    • The trend component showed a steady increase in demand over time.
    • The seasonal component displayed consistent yearly patterns, with peaks around warmer months and troughs during colder months.
    • The residual component indicated random noise without significant irregularities.

Model Performance

The Root Mean Square Error (RMSE) for the ARIMA model was calculated as follows:

# Calculate RMSE for ARIMA model
rmse <- sqrt(mean((bike_ts - fitted(arima_model))^2, na.rm = TRUE))
rmse
## [1] 890.0137

Implications

  • Resource Planning: Allocate more bikes and maintenance resources during the summer months to meet increased demand.
  • Marketing Strategies: Design promotional campaigns to encourage usage during colder months and weekdays.
  • Operational Efficiency: Use forecasts to optimize bike redistribution and staffing levels based on predicted demand patterns.

Limitations

  • The model does not account for external factors such as holidays, special events, or sudden weather changes, which could significantly influence bike rental demand.
  • The analysis assumes that historical patterns will continue in the future, which may not hold under changing circumstances (e.g., new user behavior or operational policies).

Future Work

  • Incorporate external variables like weather forecasts, holidays, or events into the model to improve forecast accuracy.
  • Experiment with more advanced models such as Prophet or machine learning approaches like LSTMs to capture complex patterns.
  • Segment users into groups (e.g., casual riders vs. subscribers) to provide more granular and customized forecasts.