About Data Analysis Report

This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Sat Sep 16 10:57:04 2023.

Data Description:

This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.

Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

Relevant Paper:

Fanaee-T, Hadi, and Gama, Joao. Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg

Task One: Load and explore the data

Load data and install packages

if(!require('tidyverse')) {
  install.packages('tidyverse')
  library('tidyverse')
}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
if(!require('lubridate')) {
  install.packages('lubridate')
  library('lubridate')
}
if(!require('ggplot2')) {
  install.packages('ggplot2')
  library('ggplot2')
}
if(!require('timetk')) {
  install.packages('timetk')
  library('timetk')
}
## Loading required package: timetk
if(!require('dbplyr')) {
  install.packages('dbplyr')
  library('dbplyr')
}
## Loading required package: dbplyr
## 
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
## 
##     ident, sql
if(!require('tseries')) {
  install.packages('tseries')
  library('tseries')
}
## Loading required package: tseries
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
if(!require('forecast')) {
  install.packages('forecast')
  library('forecast')
}
## Loading required package: forecast
setwd("C:/Users/kenne/Desktop/WashingtonDCBikeShare/WashingtonDCBikeShare")
Data_day <- read.csv("day.csv")
Data_hour <- read.csv("hour.csv")

Describe and explore the data

Data_day <- mutate(Data_day,dteday = as.Date(dteday))
Data_day$ncnt <- Data_day$cnt / max(Data_day$cnt)
Data_day$nr <- Data_day$registered / max(Data_day$registered)
Data_day$rr <- Data_day$cnt / max(Data_day$registered)
summary(Data_day)
##     instant          dteday               season            yr        
##  Min.   :  1.0   Min.   :2011-01-01   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:183.5   1st Qu.:2011-07-02   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :366.0   Median :2012-01-01   Median :3.000   Median :1.0000  
##  Mean   :366.0   Mean   :2012-01-01   Mean   :2.497   Mean   :0.5007  
##  3rd Qu.:548.5   3rd Qu.:2012-07-01   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :731.0   Max.   :2012-12-31   Max.   :4.000   Max.   :1.0000  
##       mnth          holiday           weekday        workingday   
##  Min.   : 1.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 7.00   Median :0.00000   Median :3.000   Median :1.000  
##  Mean   : 6.52   Mean   :0.02873   Mean   :2.997   Mean   :0.684  
##  3rd Qu.:10.00   3rd Qu.:0.00000   3rd Qu.:5.000   3rd Qu.:1.000  
##  Max.   :12.00   Max.   :1.00000   Max.   :6.000   Max.   :1.000  
##    weathersit         temp             atemp              hum        
##  Min.   :1.000   Min.   :0.05913   Min.   :0.07907   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200  
##  Median :1.000   Median :0.49833   Median :0.48673   Median :0.6267  
##  Mean   :1.395   Mean   :0.49538   Mean   :0.47435   Mean   :0.6279  
##  3rd Qu.:2.000   3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302  
##  Max.   :3.000   Max.   :0.86167   Max.   :0.84090   Max.   :0.9725  
##    windspeed           casual         registered        cnt      
##  Min.   :0.02239   Min.   :   2.0   Min.   :  20   Min.   :  22  
##  1st Qu.:0.13495   1st Qu.: 315.5   1st Qu.:2497   1st Qu.:3152  
##  Median :0.18097   Median : 713.0   Median :3662   Median :4548  
##  Mean   :0.19049   Mean   : 848.2   Mean   :3656   Mean   :4504  
##  3rd Qu.:0.23321   3rd Qu.:1096.0   3rd Qu.:4776   3rd Qu.:5956  
##  Max.   :0.50746   Max.   :3410.0   Max.   :6946   Max.   :8714  
##       ncnt                nr                 rr          
##  Min.   :0.002525   Min.   :0.002879   Min.   :0.003167  
##  1st Qu.:0.361717   1st Qu.:0.359488   1st Qu.:0.453786  
##  Median :0.521919   Median :0.527210   Median :0.654765  
##  Mean   :0.516909   Mean   :0.526371   Mean   :0.648481  
##  3rd Qu.:0.683498   3rd Qu.:0.687662   3rd Qu.:0.857472  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.254535

Task Two: Create interactive time series plots

Data_day %>% 
  group_by(yr) %>% 
  plot_time_series(dteday
                   , temp
                   , .color_var = season
                   , .x_lab = "Date"
                   , .y_lab = "Temperature"
                   , .title = "Normalized Temperature vs Date"
                   , .interactive = TRUE)
Data_day %>%
  group_by(yr) %>%
  plot_time_series(dteday
                   , hum
                   , .color_var = season
                   , .x_lab = "Date"
                   , .y_lab = "Humidity"
                   , .title = "Normalized Humidity vs. Date"
                   , .interactive = TRUE)
Data_day %>%
  group_by(yr) %>%
  plot_time_series(dteday
                   , windspeed
                   , .color_var = season
                   , .x_lab = "Date" 
                   , .y_lab = "Windspeed"
                   , .title = "Normalized Windspeed vs Date" 
                   , .interactive = TRUE)
Data_day %>%
  group_by(yr) %>%
  plot_time_series(dteday
                   , ncnt
                   , .color_var = season
                   , .x_lab = "Date"
                   , .y_lab = "No. of Rental bikes including Casual and Registered"
                   , .title = "No. of Rental bikes including Casual and Registered vs. Date"
                   , .interactive = TRUE)
Data_day %>%
  group_by(yr) %>%
  plot_time_series(dteday
                   , nr
                   , .color_var = season
                   , .x_lab = "Date"
                   , .y_lab = "No. of Registered users"
                   , .title = "No. of Registered users vs. Date"
                   , .interactive = TRUE)
Data_day %>%
  group_by(yr) %>%
  plot_time_series(dteday
                   , rr
                   , .color_var = season
                   , .x_lab = "Date"
                   , .y_lab = "Ratio of Registered users"
                   , .title = "Ratio of Registered users vs. Date"
                   , .interactive = TRUE) 

Task Three: Smooth time series data

Data_day$temp <- tsclean(Data_day$temp)
Data_day$ncnt <- tsclean(Data_day$ncnt)
Data_day$nr <- tsclean(Data_day$nr)
Data_day$rr <- tsclean(Data_day$rr)
head(Data_day)
##   instant     dteday season yr mnth holiday weekday workingday weathersit
## 1       1 2011-01-01      1  0    1       0       6          0          2
## 2       2 2011-01-02      1  0    1       0       0          0          2
## 3       3 2011-01-03      1  0    1       0       1          1          1
## 4       4 2011-01-04      1  0    1       0       2          1          1
## 5       5 2011-01-05      1  0    1       0       3          1          1
## 6       6 2011-01-06      1  0    1       0       4          1          1
##       temp    atemp      hum windspeed casual registered  cnt       ncnt
## 1 0.344167 0.363625 0.805833 0.1604460    331        654  985 0.11303649
## 2 0.363478 0.353739 0.696087 0.2485390    131        670  801 0.09192105
## 3 0.196364 0.189405 0.437273 0.2483090    120       1229 1349 0.15480835
## 4 0.200000 0.212122 0.590435 0.1602960    108       1454 1562 0.17925178
## 5 0.226957 0.229270 0.436957 0.1869000     82       1518 1600 0.18361258
## 6 0.204348 0.233209 0.518261 0.0895652     88       1518 1606 0.18430112
##           nr        rr
## 1 0.09415491 0.1418082
## 2 0.09645839 0.1153182
## 3 0.17693637 0.1942125
## 4 0.20932911 0.2248776
## 5 0.21854305 0.2303484
## 6 0.21854305 0.2312122
Data_day %>% 
  group_by(yr) %>% 
  plot_time_series(dteday
                   , temp
                   , .color_var = season
                   , .x_lab = "Date"
                   , .y_lab = "Temperature"
                   , .title = "Normalized Temperature vs Date"
                   , .interactive = TRUE)
Data_day %>%
  group_by(yr) %>%
  plot_time_series(dteday
                   , hum
                   , .color_var = season
                   , .x_lab = "Date"
                   , .y_lab = "Humidity"
                   , .title = "Normalized Humidity vs. Date"
                   , .interactive = TRUE)
Data_day %>%
  group_by(yr) %>%
  plot_time_series(dteday
                   , windspeed
                   , .color_var = season
                   , .x_lab = "Date" 
                   , .y_lab = "Windspeed"
                   , .title = "Normalized Windspeed vs Date" 
                   , .interactive = TRUE)
Data_day %>%
  group_by(yr) %>%
  plot_time_series(dteday
                   , ncnt
                   , .color_var = season
                   , .x_lab = "Date"
                   , .y_lab = "No. of Rental bikes including Casual and Registered"
                   , .title = "No. of Rental bikes including Casual and Registered vs. Date"
                   , .interactive = TRUE)
Data_day %>%
  group_by(yr) %>%
  plot_time_series(dteday
                   , nr
                   , .color_var = season
                   , .x_lab = "Date"
                   , .y_lab = "No. of Registered users"
                   , .title = "No. of Registered users vs. Date"
                   , .interactive = TRUE)
Data_day %>%
  group_by(yr) %>%
  plot_time_series(dteday
                   , rr
                   , .color_var = season
                   , .x_lab = "Date"
                   , .y_lab = "Ratio of Registered users"
                   , .title = "Ratio of Registered users vs. Date"
                   , .interactive = TRUE) 

Task Four: Decompose and assess the stationarity of time series data

Data_day$temp %>% adf.test()
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -1.6785, Lag order = 9, p-value = 0.7144
## alternative hypothesis: stationary
Data_day$hum %>% adf.test()
## Warning in adf.test(.): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -6.3675, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary
Data_day$windspeed %>% adf.test()
## Warning in adf.test(.): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -7.1391, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary
Data_day$ncnt %>% adf.test()
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -1.3084, Lag order = 9, p-value = 0.871
## alternative hypothesis: stationary
Data_day$nr %>% adf.test()
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -2.418, Lag order = 9, p-value = 0.4014
## alternative hypothesis: stationary
Data_day$rr %>% adf.test()
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -1.3084, Lag order = 9, p-value = 0.871
## alternative hypothesis: stationary
freq <- 365
norm_rentals <- ts(Data_day$nr, frequency = freq)
decompd <- stl(norm_rentals, "periodic")
plot(decompd$time.series[,2], ylab = "Stationary of the Normalized Rental Reservations", 
     xlab = "Day of the Year")

checkresiduals(decompd$time.series[, 3])

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 1997.3, df = 146, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 146
norm_cnt <- ts(Data_day$ncnt, frequency = freq)
decompd2 <- stl(norm_cnt, "periodic")
plot(decompd2$time.series[,2], ylab = "Stationary of the Normalized Rental Counts", 
     xlab = "Day of the Year")

checkresiduals(decompd2$time.series[, 3])

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 1437.8, df = 146, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 146
norm_rr <- ts(Data_day$rr, frequency = freq)
decompd3 <- stl(norm_rr, "periodic")
plot(decompd3$time.series[,2], ylab = "Stationary of the Normalized Rental Counts to Reservations", 
     xlab = "Day of the Year")

checkresiduals(decompd3$time.series[, 3])

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 1437.8, df = 146, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 146
shapiro.test(decompd$time.series[, 3])
## 
##  Shapiro-Wilk normality test
## 
## data:  decompd$time.series[, 3]
## W = 0.99554, p-value = 0.03334
shapiro.test(decompd2$time.series[, 3])
## 
##  Shapiro-Wilk normality test
## 
## data:  decompd2$time.series[, 3]
## W = 0.99702, p-value = 0.1993
shapiro.test(decompd3$time.series[, 3])
## 
##  Shapiro-Wilk normality test
## 
## data:  decompd3$time.series[, 3]
## W = 0.99702, p-value = 0.1993

Task Five: Fit and forecast time series data using ARIMA models

fit1 <- auto.arima(norm_cnt, seasonal = TRUE, )
hist(fit1$residuals, xlab = "Residual", ylab = "Distribution", main = "Histogram of Model Errors (Bike Count)")

shapiro.test(fit1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit1$residuals
## W = 0.83122, p-value < 2.2e-16
prediction1 <- forecast(fit1, 30)
plot(prediction1, xlab = "Date", ylab = "Normalized Count of Rentals", main = "Prediction of Bike Rental Counts")

fit2 <- auto.arima(norm_rentals, seasonal = TRUE, )
hist(fit2$residuals, xlab = "Residual", ylab = "Distribution", main = "Histogram of Model Errors (Rental Count)")

shapiro.test(fit2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit2$residuals
## W = 0.85305, p-value < 2.2e-16
prediction2 <- forecast(fit2, 30)
plot(prediction2, xlab = "Date", ylab = "Normalized Rentals", main = "Prediction of Bike Rentals")

fit3 <- auto.arima(norm_cnt, seasonal = TRUE, )
hist(fit3$residuals, xlab = "Residual", ylab = "Distribution", main = "Histogram of Model Errors (Count to Rental Ratio)")

shapiro.test(fit3$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit3$residuals
## W = 0.83122, p-value < 2.2e-16
prediction3 <- forecast(fit3, 30)
plot(prediction3, xlab = "Date", ylab = "Normalized Rental Ratio", main = "Prediction of Bike Rentals to Reservations")

Task Six: Findings and Conclusions

After processing the raw data and employing the ARIMA package to model ride share data, I successfully generated predictions for the upcoming 30 days beyond the current dataset. In a qualitative assessment of the data, it becomes evident that as the weather becomes warmer, there is a corresponding increase in the number of bike rentals. Moreover, when considering a two-year timeframe, it is apparent that the number of rentals steadily rises compared to the previous year.

Given that the data cycle concludes at the end of a year, it is reasonable to anticipate that the rental numbers will surge to a level surpassing the figures from the preceding year. This aligns with the projections made by the models. Consequently, the outcomes align with my expectations: the data exhibits a recurring annual pattern of fluctuations with an overall trend toward higher rental numbers.