This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Mon Jul 15 11:12:04 2024.
Data Description:
This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.
Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Relevant Paper:
Fanaee-T, Hadi, and Gama, Joao. Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg
#Install, Load and explore the data
## Install required libraries
install.packages(c("tidyverse", "plotly","tseries","forecast"))
## Warning in install.packages(c("tidyverse", "plotly", "tseries", "forecast")):
## installation of package 'forecast' had non-zero exit status
## Import required packages
library(tidyverse)
library(plotly)
library(tseries)
library(forecast)
##Data cleaning
# reading csv file
bike_day_data = read_csv("day.csv")
# Checking for missing values in the data frames
# used cat() to concatenate string and anyNA(data) checks if there are any missing values
cat("Are there any missing data in the dataframe?: ", anyNA(bike_day_data), "\n")
## Are there any missing data in the dataframe?: FALSE
## checking summary of data
#summary(bike_day_data)
# Convert dteday to Date type
bike_day_data$dteday= as.Date(bike_day_data$dteday)
# Aggregate data to count records by season
season_counts = bike_day_data %>%
group_by(season) %>%
summarise(count = n())
# Convert season codes to season names for better visualization (optional)
bike_day_data$season = factor(bike_day_data$season,
levels = c(1, 2, 3, 4),
labels = c("Spring", "Summer", "Fall", "Winter"))
bike_day_data$weathersit = factor(bike_day_data$weathersit,
levels = c(1, 2, 3),
labels = c("Clear", "Cloudy", "Light_Rain"))
plot_ly(bike_day_data, x = ~dteday, y = ~cnt, color = ~factor(season, labels = c("Spring", "Summer", "Fall", "Winter")), type = 'scatter', mode = 'lines') %>%
layout(
title = "Daily Bike Rental Counts",
xaxis = list(title = "Date"),
yaxis = list(title = "Bike Rented"),
colorway = c("blue", "orange", "green", "red") # Customize colors if needed
)
The chart highlights the wide range of bike rental demand recorded in the dataset, suggesting potential influences such as specific events, weather conditions, or other external factors that contribute to significant spikes in rental activity, as indicated by the maximum value of 8714.
ggplot(bike_day_data, aes(x = dteday, y = cnt, colour = season)) +
geom_line() +
labs(title = "Daily Bike Rental Counts",
x = "Date",
y = "Bike Rented") +
facet_wrap(~ season) +
theme_minimal()
By analyzing above chart, It appears that there might not be a significant impact of season on bike rentals at first glance. The counts are relatively balanced across all seasons, suggesting that bike rental demand does not vary drastically depending on the time of year represented by these seasons.
plot_ly(bike_day_data, x = ~dteday, y = ~cnt, color = ~weathersit, type = 'scatter', mode = 'lines') %>%
layout(
title = "Daily Bike Rental Counts by Weather Situation",
xaxis = list(title = "Date"),
yaxis = list(title = "Bike Rented"),
colorway = c("green", "purple", "orange") # Customize colors if needed
)
This chart highlights the varying behaviors of bike renters in response to different weather conditions. Clear weather, being the most frequent, likely encourages higher rental rates, while Cloudy and Light Rain conditions may correlate with fewer rentals.
# Extract the start date and calculate the frequency
start_date = as.numeric(format(min(bike_day_data$dteday), "%Y"))
# Convert to time series object
bike_ts = ts(bike_day_data$cnt, start = c(start_date, 1), frequency =365)
# Plot the time series data
p = autoplot(bike_ts) +
labs(title = "Daily Bike Rental Counts",
x = "Date",
y = "Bike Rented") +
theme_minimal()
# Decompose the time series
decomposed = decompose(bike_ts)
decom<-autoplot(decomposed) +
theme_minimal()
# Check stationarity
adf_test = adf.test(bike_ts)
#print(adf_test)
# Differencing the time series to achieve stationarity becuase previous test told us data is not stationary
diff_bike_ts = diff(bike_ts)
# Check stationarity of the differenced series
adf_test_diff = adf.test(diff_bike_ts)
## Warning in adf.test(diff_bike_ts): p-value smaller than printed p-value
#print(adf_test_diff)
# Fit ARIMA model
fit = auto.arima(bike_ts)
#summary(fit)
# Forecast future values
forecasts = forecast(fit, h = 30) # Forecast the next 30 days
autoplot(forecasts) +
labs(title = "Bike Rental Forecast For Next 30 Days",
x = "Date",
y = "Bike Rented") +
theme_minimal()
# Plot the forecast with the original time series data
forcastsWithorignal = autoplot(bike_ts) +
autolayer(forecasts, series = "Forecast") +
labs(title = "Bike Rental Forecast For Next 30 Days",
x = "Date",
y = "Bike Rented") +
theme_minimal()
#display plot
forcastsWithorignal
The analysis highlights the significant influence of weather conditions on bike rental behavior. Clear weather consistently drives higher rental rates, indicating a preference for favorable outdoor conditions.
Looking ahead, the forecasted values suggest that bike rental demand will follow seasonal patterns, with potential spikes during periods of clear weather