This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Mon Aug 4 12:36:25 2025.
Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
## Import required packages
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.1.0 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyquant)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## ── Attaching core tidyquant packages ─────────────────────── tidyquant 1.0.11 ──
## ✔ PerformanceAnalytics 2.0.8 ✔ TTR 0.24.4
## ✔ quantmod 0.4.28 ✔ xts 0.14.1── Conflicts ────────────────────────────────────────── tidyquant_conflicts() ──
## ✖ zoo::as.Date() masks base::as.Date()
## ✖ zoo::as.Date.numeric() masks base::as.Date.numeric()
## ✖ dplyr::filter() masks stats::filter()
## ✖ xts::first() masks dplyr::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ xts::last() masks dplyr::last()
## ✖ PerformanceAnalytics::legend() masks graphics::legend()
## ✖ quantmod::summary() masks base::summary()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(timetk)
##
## Attaching package: 'timetk'
##
## The following object is masked from 'package:tidyquant':
##
## FANG
library(TTR)
library(urca)
library(lubridate)
library(tseries)
library(forecast)
library(lmtest)
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:zoo':
##
## index
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(zoo)
# Loading the data
bike_df<- read.csv('bikes_rent.csv')
This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.
# Checking for missing values for each column
colSums(is.na(bike_df))
## season yr mnth holiday weekday
## 0 0 0 0 0
## workingday weathersit temp atemp hum
## 0 0 0 0 0
## windspeed.mph. windspeed.ms. cnt
## 0 0 0
#Checking for duplicates
sum(duplicated(bike_df))
## [1] 0
# Rename columns
bike_df<- bike_df %>%
rename(month='mnth', year='yr', temperature='temp', count='cnt',apparent_temp='atemp', humidity='hum')
# Converting numeric feature to their meaniful factor names
bike_df$season <- factor(bike_df$season,
levels = c(1, 2, 3, 4),
labels = c("Winter", "Spring", "Summer", "Fall"))
bike_df$year <- factor(bike_df$year,
levels = c(0, 1),
labels = c("2011", "2012"))
bike_df$month <- factor(bike_df$month,
levels = 1:12,
labels = month.name) # Built-in month labels
bike_df$holiday <- factor(bike_df$holiday,
levels = c(0, 1),
labels = c("No", "Yes"))
bike_df$weekday <- factor(bike_df$weekday,
levels = 0:6,
labels = c("Sunday", "Monday", "Tuesday",
"Wednesday", "Thursday", "Friday", "Saturday"))
bike_df$workingday <- factor(bike_df$workingday,
levels = c(0, 1),
labels = c("No", "Yes"))
bike_df$weathersit <- factor(bike_df$weathersit,
levels = c(1, 2, 3, 4),
labels = c("Clear", "Cloudy",
"Light Rain/Snow", "Heavy Rain/Snow"))
bike_df %>%
select(where(is.numeric)) %>%
summary()
## temperature apparent_temp humidity windspeed.mph.
## Min. : 2.424 Min. : 3.953 Min. : 0.00 Min. : 1.500
## 1st Qu.:13.820 1st Qu.:16.892 1st Qu.:52.00 1st Qu.: 9.042
## Median :20.432 Median :24.337 Median :62.67 Median :12.125
## Mean :20.311 Mean :23.718 Mean :62.79 Mean :12.763
## 3rd Qu.:26.872 3rd Qu.:30.430 3rd Qu.:73.02 3rd Qu.:15.625
## Max. :35.328 Max. :42.045 Max. :97.25 Max. :34.000
## windspeed.ms. count
## Min. : 0.6706 Min. : 22
## 1st Qu.: 4.0419 1st Qu.:3152
## Median : 5.4204 Median :4548
## Mean : 5.7052 Mean :4504
## 3rd Qu.: 6.9850 3rd Qu.:5956
## Max. :15.1989 Max. :8714
Temperatures ranged from 2.4°C to 35.3°C, but felt warmer, as apparent temperatures hit up to 42°C, likely due to humidity or sun exposure. Humidity varied widely (0% to 97.25%), with most days remaining around 63%. Winds were generally breezy, averaging 12 mph, though maximum windspeed reached 34 mph, which could affect rider comfort. Bike rentals ranged from 22 to 8,714 per day, with a median of 4,548, and is indicative of consistent bike rentals demand.
# Bike Rentals: Holiday vs Non- Holidays Distribution
ggplot(bike_df, aes(x = holiday, fill=holiday)) +
geom_bar(color="black") +
scale_fill_manual(values = c("No" ="orange", "Yes"="skyblue"))+
labs(title = "Bike Rentals: Holidays vs Non-Holidays", x = "Holiday", y = "Frequency") +
theme_minimal()+
theme(plot.title= element_text(hjust = 0.5, face="bold"))
Let’s tackle the answer with basic logic first. We all know there are far fewer holidays than regular working days (non-holidays). It makes sense if the availability of bikes decreases significantly on holidays. This could also be explained by other factors. Holidays could be equivalent to people leaving town, spending time indoors, or using other modes of transport. Let’s not forget that some rental systems have reduced hours or staffing on holidays, thus fewer bikes available for users. Perhaps, on holidays, most roads are busy with parades, marathons events, thereby disrupting riding patterns, and, of course, decreasing even more bike rentals on holidays. Additionally, some holidays might fall in colder months, a season where biking is not popular at all. As a result, more No’s could mean that most data entries occur on regular days.
# Bike Rentals: Workingdays vs Non- Workingdays Distribution
ggplot(bike_df, aes(x = workingday, fill=workingday)) +
geom_bar(color="black") +
scale_fill_manual(values = c("No" ="tomato", "Yes"="skyblue"))+
labs(title = "Bike Rentals: Workingdays vs Non- Workingdays Distribution", x = "Holiday", y = "Frequency") +
theme_minimal()+
theme(plot.title= element_text(hjust = 0.5, face="bold"))
Although it is tempting to assume that leisure bike rides would see a surge on holidays, bike rentals actually peak during working days. This reflects a growing trend of people using bikes as a practical mode of transportation for commuting to work or school. Factors like bad weather, travel, and service closures tend to suppress activity on non-working days. Overall, the data suggests that bike rentals are driven more by necessity than leisure.
Now, let’s see how the weather conditions affect bike rentals.
# Bike Rentals: Weathersit Distribution
ggplot(bike_df, aes(x=weathersit, fill = weathersit)) +
geom_bar(color="black") +
scale_fill_manual(values = c("Clear" = "skyblue", "Cloudy" = "lightgray", "Light Rain/Snow" = "gray30")) +
labs(title = "Bike Rentals by Weather", x="Weathersit", y="Frequency")+
theme_minimal()+
theme(plot.title = element_text(hjust=0.5, face="bold"))
Bike rentals are highest during clear weather, drop noticeably on cloudy days, and are lowest when it rains or snows. This pattern indicates how weather conditions directly influences bike rentals volume.
# Creating another column to distinguish weekdays from weekends
bike_df$day_type <- ifelse(bike_df$weekday %in% c("Saturday", "Sunday"), "Weekend", "Weekday")
ggplot(bike_df, aes(x = day_type, fill=day_type)) +
geom_bar( color = "black") +
scale_fill_manual(values = c("Weekend"="lightblue", "Weekday"="green"))+
labs(title = "Weekday vs Weekend Comparison",x = "Day Type", y = "Frequency") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
weather_daytype_df <- bike_df %>%
group_by(day_type, weathersit) %>%
summarise(avg_count = mean(count))
## `summarise()` has grouped output by 'day_type'. You can override using the
## `.groups` argument.
ggplot(weather_daytype_df, aes(x = day_type, y = avg_count, fill = weathersit)) +
geom_bar(stat = "identity", position = "dodge", color = "black") +
labs(title = "Day Type and Weather",x = "Day Type",y = "Average Rentals",fill = "Weather") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
# Create seasonal data by grouping by season
season_df <- bike_df %>%
group_by(season) %>%
summarise(avg_count = mean(count))
# Create summer_value object to annotate the spike
summer_value<- season_df %>%
filter(season=="Summer") %>%
pull(avg_count)
# Create winter_value to annotate the lowest bike rental
winter_value<- season_df %>%
filter(season=="Winter") %>%
pull(avg_count)
# Plot bike rentals trend by season
ggplot(season_df, aes(x= season, y= avg_count, group=1)) +
geom_line(color="blue", size=1.5) +
annotate("point", x="Summer", y= summer_value, color="green", size=3) +
annotate("text", x = "Summer", y= summer_value + 250, label="Summer Spike!", color="black", fontface="bold") +
annotate("point", x="Winter", y= winter_value, color="red", size=3) +
annotate("text", x = "Winter", y= winter_value +400, label="Low", color="tomato", fontface="bold") +
labs(title = "Bike Rentals Trend by Season", x="Seasons", y="Average Rentals") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
Bike rentals show strong seasonal variation, peaking in summer and dipping sharply in winter. The “Summer Spike!” annotation highlights peak demand, while winter sees the lowest activity in bike stations. This pattern suggests that seasonality is a key factor in bike rental behavior
# Convert year and month to numeric for efficient visualization
bike_df$year_num <- as.numeric(as.character(bike_df$year))
bike_df$month_num <- as.numeric(bike_df$month)
# Create a date column using the first day of each month
bike_df$date <- as.Date(paste(bike_df$year_num, bike_df$month_num, "01", sep = "-"), format = "%Y-%m-%d")
# grouping data by month
monthly_df<- bike_df %>%
mutate(month_year = floor_date(date,"month")) %>%
group_by(month_year) %>%
summarise(tot_count = sum(count))
# Interactive plot
monthly_df %>%
plot_time_series(.date_var = month_year,.value = tot_count,.interactive = TRUE,.plotly_slider = TRUE, .title="Monthly Bike Rentals", .x_lab="Month", .y_lab="Total Rentals",.line_color = "black", .smooth=TRUE, .smooth_color = "tomato",.smooth_span = 0.4)
#Time series object
bike_ts_monthly <- ts(monthly_df$tot_count, start = c(2011,1),frequency = 12)
# Convert time series to data frame
tibble_columns <- tibble(
month_year = as.Date(time(bike_ts_monthly), origin = "1970-01-01"),
tot_count = as.numeric(bike_ts_monthly)
)
# With simple MA (with tidyquant package)
ggplot(tibble_columns, aes(x = month_year, y = tot_count)) +
geom_line() +
geom_ma(ma_fun = SMA, n = 5, color="blue") +
theme_minimal() +
labs(title = "5-Months Moving Average", x = "Time", y = "Bike Rentals")+
theme(plot.title = element_text(hjust=0.5, face = "bold"))
# Decomposing time series
decomp <- decompose(bike_ts_monthly)
plot(decomp, col= c("magenta"),lwd=1.5)
grid()
# ARIMA model
bike_ts_arima<- auto.arima(bike_ts_monthly)
bike_ts_arima
## Series: bike_ts_monthly
## ARIMA(1,1,0)
##
## Coefficients:
## ar1
## 0.5383
## s.e. 0.1755
##
## sigma^2 = 382385344: log likelihood = -259.56
## AIC=523.12 AICc=523.72 BIC=525.39
# Forecasting bike rentals for 10 month
forecast_vals <- forecast(bike_ts_arima, h = 10)
# Plot forecast
autoplot(forecast_vals) +
scale_x_continuous(breaks = seq(2011, 2025, by = 1)) +
scale_y_continuous(labels = scales::comma, limits = c(0, NA)) +
theme_minimal() +
labs(title = "Forecasted Bike Rentals", x = "Year", y = "Rentals") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
# Checking for autocorrelation
checkresiduals(bike_ts_arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,0)
## Q* = 2.9228, df = 4, p-value = 0.5708
##
## Model df: 1. Total lags used: 5
# Check if residuals are normally distributed (Shapiro test)
shapiro.test(residuals(bike_ts_arima))
##
## Shapiro-Wilk normality test
##
## data: residuals(bike_ts_arima)
## W = 0.93563, p-value = 0.1303
# ACF/PACF of residuals
acf(residuals(bike_ts_arima))
pacf(residuals(bike_ts_arima))
# Chech for stationarity of residuals (important for ARIMA validity)
kpss.test(residuals(bike_ts_arima))
##
## KPSS Test for Level Stationarity
##
## data: residuals(bike_ts_arima)
## KPSS Level = 0.19327, Truncation lag parameter = 2, p-value = 0.1
# Evaluate model performance
accuracy(bike_ts_arima)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 931.7514 18722.18 13300.08 1.804659 10.21507 0.1979 -0.02802576
# Generate predictions
predicted <- fitted(bike_ts_arima)
# Plot actual vs predicted
#zoo package interpret x axis as date format, instead of numeric
bike_zoo <- zoo(monthly_df$tot_count, order.by = monthly_df$month_year)
plot(bike_zoo, type = "l", col = "black", lwd = 2,
main = "Actual vs Predicted Values",
xlab = "Time", ylab = "Bike Rentals")
lines(zoo(predicted, order.by = monthly_df$month_year), col = "blue", lwd = 2)
legend("topleft", legend = c("Actual", "Predicted"),
col = c("black", "blue"), lwd = 2)
Based on the ARIMA forecasts, the company can make strategic, data-driven decisions to refine inventory and pricing across bike stations. The forecast shows bike rentals stabilizing around 75,000–80,000 annual rentals, following a sharp drop from a 2012 peak of 218,573 monthly rentals, on average. This long-term equilibrium suggests a predictable and consistent level of demand.
With demand flattening over the next decade, stakeholders can avoid over-provisioning during off-peak seasons and redirecting surplus bikes to high-demand areas. The forecasted drop, followed by stability, presents clear windows for promotional pricing. Periods where demand is projected to decline may benefit from discount strategies to encourage usage.
The integration of predictive analysis into the station-level decisions can design seasonal redistribution schedules, optimize dynamic pricing algorithms, and finally, plan an expansion or contraction based on sustained trends.
The residuals pass the normality, stationary, and autocorrelation test. This suggests that the ARIMA(1,1,0) is statistically sound and well-specified. Also, the forecast errors are low (ACF1 close to zero), indicating no significant autocorrelation in forecast errors. The MASE < 1, showing that the model performs better than a naïve baseline, and the Mean Absolute Percentage Error (MAPE) is less than 15%, a key indicator that it is acceptable to good in real-world forecasting. Therefore, the model produces reliable and reasonably accurate predictions for monthly bike rentals.
First, the ARIMA model has relatively high RMSE and MAE, which could signal larger errors during peaks. Then, there is a lack of other relevant regressors such as local events, pricing promotions, and even some economic indicators (unemploymnent and inflation) that could improve predictive power. Finally, the dataset has a short time span (only 24 months), which prevented the ability to generalize patterns, and capture long-term seasonality.