In this data dive, I will examine the time series component of the Seoul Bikeshare data set. The date information came in d/m/Y format, so this had to be corrected to convert it to a date variable. Each row represents one hour labeled in the hour column.
library(readr)
library(tidyr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(broom)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
## The following object is masked from 'package:lubridate':
##
## interval
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(xts)
## Warning: package 'xts' was built under R version 4.4.3
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following object is masked from 'package:tsibble':
##
## index
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
library(ggthemes)
#Import Data set
SeoulBikeData <- read.csv("C:\\Users\\jakem\\Documents\\1. Classes\\INFO H510\\SeoulBikeData.csv")
#Clean Dates
SeoulBikeData <- SeoulBikeData |>
separate(date, into = c("day", "month", "year"), sep = "/") |>
mutate(
day = as.numeric(day),
month = as.numeric(month),
year = as.numeric(year)
)
df <- SeoulBikeData |>
mutate(
date = as.Date(paste(year,month,day, sep = "/", collapse = NULL))
)
# Create datetime column
df$datetime <- ymd_h(paste(df$date, df$hour))
rentals <- df |>
select(datetime, rented_bikes)
# Create tsibble with datetime
bike_data <- as_tsibble(
rentals, index=datetime)
bike_xts <- xts(x = bike_data$rented_bikes,
order.by = bike_data$datetime)
bike_xts <- setNames(bike_xts, "bikes")
print(bike_xts)
## Warning: object timezone ('UTC') is different from system timezone ('')
## NOTE: set 'options(xts_check_TZ = FALSE)' to disable this warning
## This note is displayed once per session
## bikes
## 2017-12-01 00:00:00 254
## 2017-12-01 01:00:00 204
## 2017-12-01 02:00:00 173
## 2017-12-01 03:00:00 107
## 2017-12-01 04:00:00 78
## 2017-12-01 05:00:00 100
## 2017-12-01 06:00:00 181
## 2017-12-01 07:00:00 460
## 2017-12-01 08:00:00 930
## 2017-12-01 09:00:00 490
## ...
## 2018-11-30 14:00:00 761
## 2018-11-30 15:00:00 768
## 2018-11-30 16:00:00 837
## 2018-11-30 17:00:00 1047
## 2018-11-30 18:00:00 1384
## 2018-11-30 19:00:00 1003
## 2018-11-30 20:00:00 764
## 2018-11-30 21:00:00 694
## 2018-11-30 22:00:00 712
## 2018-11-30 23:00:00 584
The datetime was plotted using a weekly moving average. The depth spikes towards grow deeper as the number of days in the moving average decreases. The weekly moving average smoothed the data appropriately to identify trends. A LOESS smoothing line was used to examine seasons.
bike_xts_smoothed <- bike_xts |>
rollapply(width = 24*7, \(x) mean(x, na.rm = TRUE), fill = FALSE)
ggplot(data=bike_xts_smoothed, mapping = aes(x = Index, y = bikes)) +
geom_line() +
geom_smooth(method = "loess", se = FALSE, color = "blue") +
labs(title = "Rented Bikes in 2015",
subtitle = "Weekly Moving Average") +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
The line plot shows clear trends through the year. Bikes rented is strongly correlated with temperature, so the cold months (November - March) show the lowest ridership and rises until Summer. Interestingly, ridership dips around August, likely due to the heat. The plot is bi-modal, with June and September having the highest ridership. The trends line up with the season.
The smoothed line shows a yearly sinusoidal pattern. This is intuitive due to tempe
The data was split on July 1st to fit two separate linear regression models.
bike_xts1 <- bike_xts["2018-01/2018-06"]
subset_2018 <- bike_xts["2018-07/2018-11"]
subset_2017 <- bike_xts["2017-12"]
# Combine both subsets
bike_xts2 <- bike_xts["2018-07/2018-11"]
dates <- index(bike_xts2)
# Modify the index to change values that are in December 2017 to 2018 to model trend
new_index <- if_else(
year(dates) == 2017 & month(dates) == 12,
dates %m+% years(1),
dates
)
# Apply new index
index(bike_xts2) <- new_index
The subsets of the yearly data were modeled.
bike_xts1 |>
ggplot(mapping = aes(x = Index, y = bikes)) +
geom_line() +
geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
labs(title = "Bikes Rented from January to July 1st") +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
The first half of the year shows a moderately strong, negative trend.
The trend line illustrates the underlying pattern through the noise.
bike_xts2 |>
ggplot(mapping = aes(x = Index, y = bikes)) +
geom_line() +
geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
labs(title = "Bikes Rented from July 1st to December 31st") +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
The second half of the year shows a strong, negative trend.
bike_xts1_smoothed <- rollmean(bike_xts1, k = 24*7, fill = NA)
bike_xts2_smoothed <- rollmean(bike_xts2, k = 24*7, fill = NA)
# Convert to data frames and add period labels
df1 <- data.frame(Index = index(bike_xts1_smoothed),
bikes = coredata(bike_xts1_smoothed),
Period = "First Half")
df2 <- data.frame(Index = index(bike_xts2_smoothed),
bikes = coredata(bike_xts2_smoothed),
Period = "Second Half")
# Combine datasets
combined_df <- bind_rows(df1, df2)
# Plot
ggplot(combined_df, aes(x = Index, y = bikes, color = Period)) +
geom_line() +
geom_smooth(method = 'lm', se = FALSE) + # separate trend lines by color/period
labs(title = "Bikes Rented in 2023", x = "Date", y = "Number of Bikes") +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 334 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 334 rows containing missing values or values outside the scale range
## (`geom_line()`).
The smoothed plots show the trends much more clearly, especially when placed on the same graph. The season could be better visualized if more than one year of data was available. The season is intuitive due to the co-linearity of temperature with bikes rented and month.
# ACF plot
acf(coredata(bike_xts), main = "ACF of Bike Rentals")
# PACF plot
pacf(coredata(bike_xts), main = "PACF of Bike Rentals")
The ACF plot shows the highest spike at lag 24, which represents the
daily season.
The PACF shows a similar spike at lag 24, which means that each value is related to the value exactly one day prior.
# ACF plot
acf(coredata(bike_xts),lag.max = 24*7+3, main = "ACF of Bike Rentals")
# PACF plot
pacf(coredata(bike_xts), lag.max = 24*7+3, main = "PACF of Bike Rentals")
A weekly season is apparent as well in the PCAF at lag 168.
# ACF plot
acf(coredata(bike_xts),lag.max = 24*31+3, main = "ACF of Bike Rentals")
# PACF plot
pacf(coredata(bike_xts), lag.max = 24*31+3, main = "PACF of Bike Rentals")
There is no apparent monthly season based on the PCAF.
In conclusion, bike rentals have a positive trend during the first half of the year and a negative trend in the second half. The bike share data contains daily, weekly, and yearly seasonality, but not monthly.