This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Sun Oct 27 07:14:25 2024.
Data Description:
This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.
Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Relevant Paper:
Fanaee-T, Hadi, and Gama, Joao. Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg
## Import required packages
packages <- c("tidyverse", "ggplot2", "dplyr", "dbplyr", "timetk", "tseries", "forecast")
for (pkg in packages) {
if (!require(pkg, character.only = TRUE)) {
install.packages(pkg)
library(pkg, character.only = TRUE)
}
}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: dbplyr
##
##
## Attaching package: 'dbplyr'
##
##
## The following objects are masked from 'package:dplyr':
##
## ident, sql
##
##
## Loading required package: timetk
##
## Loading required package: tseries
##
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Loading required package: forecast
daydata<-read.csv("day.csv")
daydata[,"dteday"] <- as.Date(daydata[,"dteday"])
daydata[,"ncnt"] <- daydata[,"cnt"] / max(daydata[,"cnt"])
daydata[,"nr"] <- daydata[,"registered"] / max(daydata[,"registered"])
daydata[,"rr"] <- daydata[,"cnt"] / max(daydata[,"registered"])
summary(daydata)
## instant dteday season yr
## Min. : 1.0 Min. :2011-01-01 Min. :1.000 Min. :0.0000
## 1st Qu.:183.5 1st Qu.:2011-07-02 1st Qu.:2.000 1st Qu.:0.0000
## Median :366.0 Median :2012-01-01 Median :3.000 Median :1.0000
## Mean :366.0 Mean :2012-01-01 Mean :2.497 Mean :0.5007
## 3rd Qu.:548.5 3rd Qu.:2012-07-01 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :731.0 Max. :2012-12-31 Max. :4.000 Max. :1.0000
## mnth holiday weekday workingday
## Min. : 1.00 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
## Median : 7.00 Median :0.00000 Median :3.000 Median :1.000
## Mean : 6.52 Mean :0.02873 Mean :2.997 Mean :0.684
## 3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.:1.000
## Max. :12.00 Max. :1.00000 Max. :6.000 Max. :1.000
## weathersit temp atemp hum
## Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200
## Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267
## Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279
## 3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302
## Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725
## windspeed casual registered cnt
## Min. :0.02239 Min. : 2.0 Min. : 20 Min. : 22
## 1st Qu.:0.13495 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152
## Median :0.18097 Median : 713.0 Median :3662 Median :4548
## Mean :0.19049 Mean : 848.2 Mean :3656 Mean :4504
## 3rd Qu.:0.23321 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956
## Max. :0.50746 Max. :3410.0 Max. :6946 Max. :8714
## ncnt nr rr
## Min. :0.002525 Min. :0.002879 Min. :0.003167
## 1st Qu.:0.361717 1st Qu.:0.359488 1st Qu.:0.453786
## Median :0.521919 Median :0.527210 Median :0.654765
## Mean :0.516909 Mean :0.526371 Mean :0.648481
## 3rd Qu.:0.683498 3rd Qu.:0.687662 3rd Qu.:0.857472
## Max. :1.000000 Max. :1.000000 Max. :1.254535
## Read about the timetk package
# ?timetk
daydata %>% group_by(yr) %>% plot_time_series(dteday, temp, .color_var = season, .x_lab = "Date", .y_lab = "Normalized Temperature",
.title = "Normalized Temperature vs Date for Day Data", .interactive = TRUE)
daydata %>% group_by(yr) %>% plot_time_series(dteday, hum, .color_var = season, .x_lab = "Date", .y_lab = "Normalized Humidity",
.title = "Normalized Humidity vs Date for Day Data", .interactive = TRUE)
daydata %>% group_by(yr) %>% plot_time_series(dteday, ncnt, .color_var = season, .x_lab = "Date", .y_lab = "Normalized Bike Rentals",
.title = "Normalized Bike Rentals vs Date for Day Data", .interactive = TRUE)
daydata %>% group_by(yr) %>% plot_time_series(dteday, ncnt, .color_var = season, .x_lab = "Date", .y_lab = "Normalized Registered Rentals",
.title = "Normalized Registered Rentals vs Date for Day Data", .interactive = TRUE)
daydata[,"temp"] <- tsclean(daydata[,"temp"])
daydata[,"ncnt"] <- tsclean(daydata[,"ncnt"])
daydata[,"nr"] <- tsclean(daydata[,"nr"])
daydata[,"rr"] <- tsclean(daydata[,"rr"])
summary(daydata)
## instant dteday season yr
## Min. : 1.0 Min. :2011-01-01 Min. :1.000 Min. :0.0000
## 1st Qu.:183.5 1st Qu.:2011-07-02 1st Qu.:2.000 1st Qu.:0.0000
## Median :366.0 Median :2012-01-01 Median :3.000 Median :1.0000
## Mean :366.0 Mean :2012-01-01 Mean :2.497 Mean :0.5007
## 3rd Qu.:548.5 3rd Qu.:2012-07-01 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :731.0 Max. :2012-12-31 Max. :4.000 Max. :1.0000
## mnth holiday weekday workingday
## Min. : 1.00 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
## Median : 7.00 Median :0.00000 Median :3.000 Median :1.000
## Mean : 6.52 Mean :0.02873 Mean :2.997 Mean :0.684
## 3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.:1.000
## Max. :12.00 Max. :1.00000 Max. :6.000 Max. :1.000
## weathersit temp atemp hum
## Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200
## Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267
## Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279
## 3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302
## Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725
## windspeed casual registered cnt
## Min. :0.02239 Min. : 2.0 Min. : 20 Min. : 22
## 1st Qu.:0.13495 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152
## Median :0.18097 Median : 713.0 Median :3662 Median :4548
## Mean :0.19049 Mean : 848.2 Mean :3656 Mean :4504
## 3rd Qu.:0.23321 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956
## Max. :0.50746 Max. :3410.0 Max. :6946 Max. :8714
## ncnt nr rr
## Min. :0.04946 Min. :0.05989 Min. :0.06205
## 1st Qu.:0.36826 1st Qu.:0.36100 1st Qu.:0.46199
## Median :0.52444 Median :0.52865 Median :0.65793
## Mean :0.51986 Mean :0.52838 Mean :0.65218
## 3rd Qu.:0.68637 3rd Qu.:0.68766 3rd Qu.:0.86107
## Max. :1.00000 Max. :1.00000 Max. :1.25453
daydata %>% group_by(yr) %>% plot_time_series(dteday, temp, .color_var = season, .x_lab = "Date", .y_lab = "Normalized Temperature",
.title = "Normalized Temperature vs Date for Day Data", .interactive = TRUE)
daydata %>% group_by(yr) %>% plot_time_series(dteday, hum, .color_var = season, .x_lab = "Date", .y_lab = "Normalized Humidity",
.title = "Normalized Humidity vs Date for Day Data", .interactive = TRUE)
daydata %>% group_by(yr) %>% plot_time_series(dteday, ncnt, .color_var = season, .x_lab = "Date", .y_lab = "Normalized Bike Rentals",
.title = "Normalized Bike Rentals vs Date for Day Data", .interactive = TRUE)
daydata %>% group_by(yr) %>% plot_time_series(dteday, ncnt, .color_var = season, .x_lab = "Date", .y_lab = "Normalized Registered Rentals",
.title = "Normalized Registered Rentals vs Date for Day Data", .interactive = TRUE)
daydata[,"temp"] %>% adf.test()
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -1.6785, Lag order = 9, p-value = 0.7144
## alternative hypothesis: stationary
daydata[,"ncnt"] %>% adf.test()
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -1.3084, Lag order = 9, p-value = 0.871
## alternative hypothesis: stationary
daydata[,"nr"] %>% adf.test()
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -2.418, Lag order = 9, p-value = 0.4014
## alternative hypothesis: stationary
daydata[,"rr"] %>% adf.test()
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -1.3084, Lag order = 9, p-value = 0.871
## alternative hypothesis: stationary
#decomposes the data
freq <- 365
#normalized day rentals
norm_rentals <- ts(daydata[, "nr"], frequency = freq)
decomped1 <- stl(norm_rentals, "periodic")
plot(decomped1$time.series[,2], ylab = "Stationary of the Normalized Rental Reservations",
xlab = "Day of the Year")
checkresiduals(decomped1$time.series[, 3])
##
## Ljung-Box test
##
## data: Residuals
## Q* = 1997.3, df = 146, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 146
shapiro.test(decomped1$time.series[, 3])
##
## Shapiro-Wilk normality test
##
## data: decomped1$time.series[, 3]
## W = 0.99554, p-value = 0.03334
#normalized day counts
norm_cnt <- ts(daydata[, "ncnt"], frequency = freq)
decomped2 <- stl(norm_cnt, "periodic")
plot(decomped2$time.series[,2], ylab = "Stationary of the Normalized Rental Counts",
xlab = "Day of the Year")
checkresiduals(decomped2$time.series[, 3])
##
## Ljung-Box test
##
## data: Residuals
## Q* = 1437.8, df = 146, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 146
shapiro.test(decomped2$time.series[, 3])
##
## Shapiro-Wilk normality test
##
## data: decomped2$time.series[, 3]
## W = 0.99702, p-value = 0.1993
#normalized day rental rates
norm_rr <- ts(daydata[, "rr"], frequency = freq)
decomped3 <- stl(norm_rr, "periodic")
plot(decomped3$time.series[,2], ylab = "Stationary of the Normalized Rental Counts to Reservations",
xlab = "Day of the Year")
checkresiduals(decomped3$time.series[, 3])
##
## Ljung-Box test
##
## data: Residuals
## Q* = 1437.8, df = 146, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 146
shapiro.test(decomped3$time.series[, 3])
##
## Shapiro-Wilk normality test
##
## data: decomped3$time.series[, 3]
## W = 0.99702, p-value = 0.1993
#bike count predictions
fit1 <- auto.arima(norm_cnt, seasonal = TRUE, )
hist(fit1$residuals, xlab = "Residual", ylab = "Distribution", main = "Histogram of Model Errors - Bike Count")
shapiro.test(fit1$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit1$residuals
## W = 0.83122, p-value < 2.2e-16
prediction1 <- forecast(fit1, 25)
plot(prediction1, xlab = "Date", ylab = "Normalized Count of Rentals", main = "Prediction of Bike Rental Counts")
#predictions of number of reservations
fit2 <- auto.arima(norm_rentals, seasonal = TRUE, )
hist(fit2$residuals, xlab = "Residual", ylab = "Distribution", main = "Histogram of Model Errors - Rental Count")
shapiro.test(fit2$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit2$residuals
## W = 0.85305, p-value < 2.2e-16
prediction2 <- forecast(fit2, 25)
plot(prediction2, xlab = "Date", ylab = "Normalized Rentals", main = "Prediction of Bike Rentals")
#bike count predictions
fit3 <- auto.arima(norm_cnt, seasonal = TRUE, )
hist(fit2$residuals, xlab = "Residual", ylab = "Distribution", main = "Histogram of Model Errors - Count to Rental Ratio")
shapiro.test(fit3$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit3$residuals
## W = 0.83122, p-value < 2.2e-16
prediction3 <- forecast(fit3, 25)
plot(prediction3, xlab = "Date", ylab = "Normalized Rental Ratio", main = "Prediction of Bike Rentals to Reservations")
"By used the ARIMA package to model ride share data, I was able to make predictions for the 25 days beyond the current data set. Qualitatively the data shows that was the weather gets warmer the number of bike rentals increase, and over the course of two years the number of rentals increases over the number of rentals from the previous year. As the data terminates at the end of one cycle, I expect the number of rentals to increase to a level higher than it was a year before, which is what the models are predicting. Therefore the results were what I expected - the data appears to oscillate up and down over a 1 year period with the overall data moving towards higher rental numbers."
## [1] "By used the ARIMA package to model ride share data, I was able to make predictions for the 25 days beyond the current data set. Qualitatively the data shows that was the weather gets warmer the number of bike rentals increase, and over the course of two years the number of rentals increases over the number of rentals from the previous year. As the data terminates at the end of one cycle, I expect the number of rentals to increase to a level higher than it was a year before, which is what the models are predicting. Therefore the results were what I expected - the data appears to oscillate up and down over a 1 year period with the overall data moving towards higher rental numbers."