library(tidyverse) # collection of data manipulation and visualization packages
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr) # data manipulation verbs (filter, mutate, group_by, etc.)library(tseries) # time series analysis tools including ADF stationarity test
Warning: package 'tseries' was built under R version 4.5.2
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library(lubridate) # simplifies working with date/time objectslibrary(timetk) # time series visualization and feature engineering
Warning: package 'timetk' was built under R version 4.5.2
library(ggplot2) # data visualization (also loaded via tidyverse, explicit for clarity)library(forecast) # provides ARIMA modeling and forecasting tools
Warning: package 'forecast' was built under R version 4.5.2
Load Datasets
# Read in the daily and hourly bike-sharing CSVs from local disk.daily_data <-read.csv("C:/Users/Administrator/Downloads/bike data daily.csv")hourly_data <-read.csv("C:/Users/Administrator/Downloads/bike data hour.csv")# Quick structural overview: column names, data types, and sample valuesstr(daily_data)
# Open both datasets for manual inspectionview(daily_data)view(hourly_data)
Advanced Data Wrangling
# Convert the date column from character/factor to a proper Date object# so it can be used correctly in time series plots and modelsdaily_data[,"dteday"] <-as.Date(daily_data[,"dteday"])# Normalize total bike rental count to [0, 1] range by dividing by its maximum# This makes comparisons across variables on different scales easierdaily_data[,"ncnt"] <- daily_data[,"cnt"] /max(daily_data[,"cnt"])# Normalize the number of registered (non-casual) users the same waydaily_data[,"nr"] <- daily_data[,"registered"] /max(daily_data[,"registered"])# Compute the rental-to-registration ratio:# How many rentals occurred relative to the maximum registered user count# Values > 1 would mean more rentals than registered users (possible with casual riders)daily_data[,"rr"] <- daily_data[,"cnt"] /max(daily_data[,"registered"])# Summary statistics for all columns to check ranges and spot any anomaliessummary(daily_data)
instant dteday season yr
Min. : 1.0 Min. :2011-01-01 Min. :1.000 Min. :0.0000
1st Qu.:183.5 1st Qu.:2011-07-02 1st Qu.:2.000 1st Qu.:0.0000
Median :366.0 Median :2012-01-01 Median :3.000 Median :1.0000
Mean :366.0 Mean :2012-01-01 Mean :2.497 Mean :0.5007
3rd Qu.:548.5 3rd Qu.:2012-07-01 3rd Qu.:3.000 3rd Qu.:1.0000
Max. :731.0 Max. :2012-12-31 Max. :4.000 Max. :1.0000
mnth holiday weekday workingday
Min. : 1.00 Min. :0.00000 Min. :0.000 Min. :0.000
1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
Median : 7.00 Median :0.00000 Median :3.000 Median :1.000
Mean : 6.52 Mean :0.02873 Mean :2.997 Mean :0.684
3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.:1.000
Max. :12.00 Max. :1.00000 Max. :6.000 Max. :1.000
weathersit temp atemp hum
Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000
1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200
Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267
Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279
3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302
Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725
windspeed casual registered cnt
Min. :0.02239 Min. : 2.0 Min. : 20 Min. : 22
1st Qu.:0.13495 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152
Median :0.18097 Median : 713.0 Median :3662 Median :4548
Mean :0.19049 Mean : 848.2 Mean :3656 Mean :4504
3rd Qu.:0.23321 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956
Max. :0.50746 Max. :3410.0 Max. :6946 Max. :8714
ncnt nr rr
Min. :0.002525 Min. :0.002879 Min. :0.003167
1st Qu.:0.361717 1st Qu.:0.359487 1st Qu.:0.453786
Median :0.521919 Median :0.527210 Median :0.654765
Mean :0.516909 Mean :0.526371 Mean :0.648481
3rd Qu.:0.683498 3rd Qu.:0.687662 3rd Qu.:0.857472
Max. :1.000000 Max. :1.000000 Max. :1.254535
Exploratory Visualization: Temperature
# Plot normalized temperature over time, grouped by year (yr) and colored by season.# Grouping by year creates two faceted panels for easier year-over-year comparison.daily_data %>%group_by(yr) %>%plot_time_series( dteday, temp,.color_var = season,.x_lab ="Date",.y_lab ="Normalized Temperature",.title ="Normalized Temperature vs Date for Daily Data",.interactive =TRUE# renders as an interactive plotly chart )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the timetk package.
Please report the issue at
<https://github.com/business-science/timetk/issues>.
Exploratory Visualization: Humidity
# Same faceted time series plot for normalized humidity.# Humidity shows more noise/variability than temperature.daily_data %>%group_by(yr) %>%plot_time_series( dteday, hum,.color_var = season,.x_lab ="Date",.y_lab ="Normalized Humidity",.title ="Normalized Humidity vs Date for Daily Data",.interactive =TRUE )
Exploratory Visualization: Wind Speed
# Wind speed is also quite noisy; plotted here for completeness before deciding whether to include it in the model.daily_data %>%group_by(yr) %>%plot_time_series( dteday, windspeed,.color_var = season,.x_lab ="Date",.y_lab ="Normalized Windspeed",.title ="Windspeed vs Date for Daily Data",.interactive =TRUE )
Exploratory Visualization: Normalized Bike Rentals (Total Counts)
# Plotting ncnt (normalized total rentals) reveals the seasonal rental pattern and overall growth trend across the two years.daily_data %>%group_by(yr) %>%plot_time_series( dteday, ncnt,.color_var = season,.x_lab ="Date",.y_lab ="Normalized Bike Rentals",.title ="Normalized Bike Rentals vs Date for Daily Data",.interactive =TRUE )
# nr = normalized registered-user rentals; shows how the registered user base contributes to the overall rental trend.daily_data %>%group_by(yr) %>%plot_time_series( dteday, nr,.color_var = season,.x_lab ="Date",.y_lab ="Normalized Registered Rentals",.title ="Normalized Registered Rentals vs Date for Daily Data",.interactive =TRUE )
Exploratory Visualization: Rental-to-Registration Ratio
# rr = total rentals / max registered users.# Tracks how heavily the system is used relative to its registered user ceiling.daily_data %>%group_by(yr) %>%plot_time_series( dteday, rr,.color_var = season,.x_lab ="Date",.y_lab ="Ratio",.title ="Ratio of Rentals to Registration vs Date for Daily Data",.interactive =TRUE )
Data Cleaning: Outlier Removal via tsclean()
# Humidity and wind speed exhibited excessive noise, so they are dropped from further analysis.# The four variables below are cleaned using tsclean(), which identifies and replaces outliers and missing values with linearly interpolated values, making the series more suitable for ARIMA modeling.daily_data[,"temp"] <-tsclean(daily_data[,"temp"]) # normalized temperaturedaily_data[,"ncnt"] <-tsclean(daily_data[,"ncnt"]) # normalized total rentalsdaily_data[,"nr"] <-tsclean(daily_data[,"nr"]) # normalized registered rentalsdaily_data[,"rr"] <-tsclean(daily_data[,"rr"]) # rental-to-registration ratio# Confirm the cleaned values look sensiblehead(daily_data)
# Re-plot temperature after tsclean() to confirm that outliers have been removed and the series is smoother.daily_data %>%group_by(yr) %>%plot_time_series( dteday, temp,.color_var = season,.x_lab ="Date",.y_lab ="Normalized Temperature",.title ="Normalized Temperature vs Date for Daily Data",.interactive =TRUE )
Post-Cleaning Visualization: Normalized Bike Rentals
daily_data %>%group_by(yr) %>%plot_time_series( dteday, ncnt,.color_var = season,.x_lab ="Date",.y_lab ="Normalized Bike Rentals",.title ="Normalized Bike Rentals vs Date for Daily Data",.interactive =TRUE )
daily_data %>%group_by(yr) %>%plot_time_series( dteday, nr,.color_var = season,.x_lab ="Date",.y_lab ="Normalized Registered Rentals",.title ="Normalized Registered Rentals vs Date for Daily Data",.interactive =TRUE )
Post-Cleaning Visualization: Rental-to-Registration Ratio
daily_data %>%group_by(yr) %>%plot_time_series( dteday, rr,.color_var = season,.x_lab ="Date",.y_lab ="Normalized Registered Rentals",.title ="Ratio of Rentals to Registration vs Date for Daily Data",.interactive =TRUE )
Stationarity Testing: Augmented Dickey-Fuller (ADF) Test
# The ADF test checks whether a time series is stationary (i.e., has no unit root).# H0: the series has a unit root (non-stationary)# H1: the series is stationary# A p-value < 0.05 allows us to reject H0 and conclude the series is stationary,# which is a prerequisite for ARIMA modeling.# Test stationarity of the cleaned temperature seriesdaily_data[,"temp"] %>%adf.test()
Augmented Dickey-Fuller Test
data: .
Dickey-Fuller = -1.6785, Lag order = 9, p-value = 0.7144
alternative hypothesis: stationary
ADF test for normalized total bike rental count
daily_data[,"ncnt"] %>%adf.test()
Augmented Dickey-Fuller Test
data: .
Dickey-Fuller = -1.3084, Lag order = 9, p-value = 0.871
alternative hypothesis: stationary
ADF test for normalized registered rentals
daily_data[,"nr"] %>%adf.test()
Augmented Dickey-Fuller Test
data: .
Dickey-Fuller = -2.418, Lag order = 9, p-value = 0.4014
alternative hypothesis: stationary
ADF test for rental-to-registration ratio
daily_data[,"rr"] %>%adf.test()
Augmented Dickey-Fuller Test
data: .
Dickey-Fuller = -1.3084, Lag order = 9, p-value = 0.871
alternative hypothesis: stationary
Seasonal and Trend decomposition using Loess
# STL (Seasonal and Trend decomposition using Loess) splits a time series into:# [,1] seasonal – repeating within-year pattern# [,2] trend – long-run direction of the series# [,3] remainder – residuals (noise after removing seasonal + trend)# freq = 365 sets annual seasonality for daily data.freq <-365# --- Decompose: Normalized Registered Rentals (nr) ---------------------------norm_rentals <-ts(daily_data[, "nr"], frequency = freq) # convert to ts objectdecomped1 <-stl(norm_rentals, "periodic") # STL decomposition# Plot the trend component to visualize the long-run rental growthplot(decomped1$time.series[, 2],ylab ="Stationary of the Normalized Rental Reservations",xlab ="Daily of the Year")
Residual diagnostics for the nr decomposition:
# check residuals() plots the remainder, its ACF, and a Ljung-Box test.# Ideally residuals should look like white noise (no autocorrelation).checkresiduals(decomped1$time.series[, 3])
Ljung-Box test
data: Residuals
Q* = 1997.3, df = 146, p-value < 2.2e-16
Model df: 0. Total lags used: 146
Decompose: Normalized Total Rental Count (ncnt)
norm_cnt <-ts(daily_data[, "ncnt"], frequency = freq)decomped2 <-stl(norm_cnt, "periodic")# Plot the trend component for total rental countsplot(decomped2$time.series[, 2],ylab ="Stationary of the Normalized Rental Counts",xlab ="Daily of the Year")
Residual diagnostics for the ncnt decomposition
checkresiduals(decomped2$time.series[, 3])
Ljung-Box test
data: Residuals
Q* = 1437.8, df = 146, p-value < 2.2e-16
Model df: 0. Total lags used: 146
Decompose: Rental-to-Registration Ratio (rr)
norm_rr <-ts(daily_data[, "rr"], frequency = freq)decomped3 <-stl(norm_rr, "periodic")# Plot the trend component for the rental ratioplot(decomped3$time.series[, 2],ylab ="Stationary of the Normalized Rental Counts to Reservations",xlab ="Daily of the Year")
Residual diagnostics for the rr decomposition
checkresiduals(decomped3$time.series[, 3])
Ljung-Box test
data: Residuals
Q* = 1437.8, df = 146, p-value < 2.2e-16
Model df: 0. Total lags used: 146
Normality Testing on STL Residuals (Shapiro-Wilk)
# The Shapiro-Wilk test checks whether residuals follow a normal distribution.# H0: the residuals are normally distributed# p > 0.05 → fail to reject H0 (residuals appear normal, which is desirable)# This is important for validating ARIMA model assumptions.print("-------Noramlized Rental Reservations")
[1] "-------Noramlized Rental Reservations"
# Shapiro-Wilk test on STL residuals for normalized registered rentalsshapiro.test(decomped1$time.series[, 3])
Shapiro-Wilk normality test
data: decomped1$time.series[, 3]
W = 0.99554, p-value = 0.03334
print("-------Normalized Count of Rentals")
[1] "-------Normalized Count of Rentals"
# Note: this re-uses decomped1 residuals; consider using decomped2 for ncntshapiro.test(decomped1$time.series[, 3])
Shapiro-Wilk normality test
data: decomped1$time.series[, 3]
W = 0.99554, p-value = 0.03334
print("-------Normalized Ratio of Rentals to Reservations")
[1] "-------Normalized Ratio of Rentals to Reservations"
# Shapiro-Wilk test on STL residuals for the rental-to-registration ratioshapiro.test(decomped3$time.series[, 3])
Shapiro-Wilk normality test
data: decomped3$time.series[, 3]
W = 0.99702, p-value = 0.1993
ARIMA Modeling & Forecasting: Total Bike Count (ncnt)
# auto.arima() automatically selects the best-fitting ARIMA(p,d,q)(P,D,Q)[m]# model by minimizing AICc. seasonal = TRUE allows it to fit a seasonal component.fit1 <-auto.arima(norm_cnt, seasonal =TRUE)# Histogram of model residuals: should be roughly bell-shaped and centered at 0# if the model has adequately captured the signal in the data.hist(fit1$residuals,xlab ="Residual",ylab ="Distribution",main ="Histogram of Model Errors - Bike Count")
normality test: ARIMA Residuals
# Formal normality test on ARIMA residuals for the bike count modelshapiro.test(fit1$residuals)
Shapiro-Wilk normality test
data: fit1$residuals
W = 0.83122, p-value < 2.2e-16
Forecast using fit1
# Forecast the next 25 days of normalized bike rental counts using fit1.# The shaded bands represent 80% and 95% prediction intervals.prediction1 <-forecast(fit1, 25)plot(prediction1,xlab ="Date",ylab ="Normalized Count of Rentals",main ="Prediction of Bike Rental Counts")
# Fit a seasonal ARIMA model to the normalized registered-user rental series.fit2 <-auto.arima(norm_rentals, seasonal =TRUE)# Inspect residuals histogram to check for model fit qualityhist(fit2$residuals,xlab ="Residual",ylab ="Distribution",main ="Histogram of Model Errors - Rental Count")
Normality test: reregistered rentals
# Normality test on residuals for the registered-rentals ARIMA modelshapiro.test(fit2$residuals)
Shapiro-Wilk normality test
data: fit2$residuals
W = 0.85305, p-value < 2.2e-16
forecast using fit2
# Forecast the next 25 days of normalized registered rentals using fit2prediction2 <-forecast(fit2, 25)plot(prediction2,xlab ="Date",ylab ="Normalized Rentals",main ="Prediction of Bike Rentals")
ARIMA Modeling & Forecasting: Rental-to-Registration Ratio
# Note: fit2 is re-assigned here to a model fitted on norm_cnt (same as fit1).# Consider renaming to fit3 and fitting on norm_rr to model the ratio specifically.fit2 <-auto.arima(norm_cnt, seasonal =TRUE)# Residuals histogram for this third modelhist(fit2$residuals,xlab ="Residual",ylab ="Distribution",main ="Histogram of Model Errors - Count to Rental Ratio")
Normality test on residuals for this model
shapiro.test(fit2$residuals)
Shapiro-Wilk normality test
data: fit2$residuals
W = 0.83122, p-value < 2.2e-16
Final forecast
# Forecast the next 25 days and plot with prediction intervalsprediction3 <-forecast(fit2, 25)plot(prediction3,xlab ="Date",ylab ="Normalized Rental Ratio",main ="Prediction of Bike Rentals to Reservations")
Comment
Findings and Conclusions After processing the raw data and using the ARIMA package to model ride share data, I was able to make predictions for the 25 days beyond the current data set. Qualitatively the data shows that was the weather gets warmer the number of bike rentals increase, and over the course of two years the number of rentals increases over the number of rentals from the previous year. As the data terminates at the end of one cycle, I expect the number of rentals to increase to a level higher than it was a year before, which is what the models are predicting. Therefore the results were what I expected the data appears to oscillate up and down over a 1 year period with the overall data moving towards higher rental numbers.