Forecasting - Assignment 2

Author

Pragya Mishra

Section 1: Data Description

This data set assigned to me is Streetcar ridership data - it contains the daily count of riders for streetcars throughout the City of Cincinnati.

Data Span: 1st January 2023 - 31st Dec 2023

Refresh Frequency: Daily

Data Generating process: Streetcar Ridership data is created via an Automated Passenger Counter (APC) system installed in the doorways of each streetcar vehicle generated by Southwest Ohio Regional Transit Authority (SORTA).

The fluctuations in ridership can be attributed to a multitude of events throughout the year, ranging from football and basketball matches to city concerts. Daily ridership is further influenced by holidays, seasonal changes, and various factors such as the city’s ambitious plans to expand its streetcar network, the year-over-year increase in ridership, a modest annual rise in the city’s population, and a general decline in population across the US Midwest.

The availability of only one year of data for the forecasting exercise poses some challenges. With a limited historical dataset, establishing robust patterns and accurately predicting future trends becomes more difficult. The forecasting models may face challenges in capturing the full range of variability and complexities inherent in the ridership data.

Section 2: Data Visualization

Reading the data and required libraries

Code

knitr::opts_chunk$set(warning = FALSE, message = FALSE) 
#| echo: false
setwd("C:/Users/mishr/OneDrive/Documents/Semester 2 Coursework/Forecasting/Data")
ridership_data <- read.csv("streetcar_ridership.csv", header = TRUE, stringsAsFactors = TRUE)  # read in data
library(ggplot2)
library(dplyr)
library(tsibble)
library(plotly)
library(dplyr)
library(zoo)
library(tidyr)
library(slider)
library(stats)
library(feasts)
library(ggpubr)
library(cowplot)
library(ggplot2)
library(tsibble)
library(fable)
library(lubridate)

Upon examining various visualizations to understand ridership over time, it is evident that there is no clear upward or downward trend in the data. However, a pronounced seasonal pattern emerges, revealing fluctuations in ridership throughout the year. Notably, ridership hovers around 2500 individuals on most days, and the distribution of data exhibits a right-skewed pattern.

This suggests that while there may not be a consistent trend over the entire period, there is a discernible regularity in the ridership patterns, with certain periods experiencing higher or lower levels of activity.

Code

ggplot(ridership_data, aes(x = ridership)) +
  geom_density(fill = "skyblue", color = "black") +
  labs(title = "Density Plot",
       x = "Ridership",
       y = "Density")

Code

ggplot(ridership_data, aes(x = ridership)) +
  geom_histogram(binwidth = 80, fill = "lightgreen", color = "black") +
  labs(title = "Histogram",
       x = "Ridership",
       y = "Frequency")

Code

ggplot(ridership_data, aes(x = (date), y = ridership)) +
  geom_boxplot(fill = "orange", color = "black") +
  labs(title = "Boxplot",
       x = "Category",
       y = "Ridership")

Data Summary
Statistic	Value
Count (N)	365
Mean	2943.9
Median	2610
Standard Deviation	1240.7
Range	10084

Furthermore, specific months, such as June, July, and September, stand out with significant spikes in ridership. These observations suggest a strong correlation between ridership patterns and both the day of the week and certain months. The visual analysis highlights the importance of considering these temporal factors when interpreting ridership trends.

Section 3: Moving Average and Time Series Decomposition

In the process of calculating moving averages, we are generating plot for four different scenarios using varying window sizes (3, 5, 7, and 13 days). The goal is to assess which moving average best aligns with the data, effectively capturing underlying patterns while avoiding overfitting or excessive sensitivity to noise. This comparative analysis aids in determining the most suitable moving average that strikes a balance between accurately representing the data trends and maintaining resilience against noise interference.

Code

ridership_data_ma <- ridership_data %>%
  arrange(date) %>%
  mutate(
    ma_3 = slider::slide_dbl(ridership, mean,
                             .before = 1, .after = 1, .complete = TRUE),
    ma_5 = slider::slide_dbl(ridership, mean,
                             .before = 2, .after = 2, .complete = TRUE),
    ma_7 = slider::slide_dbl(ridership, mean,
                             .before = 3, .after = 3, .complete = TRUE),
    ma_13 = slider::slide_dbl(ridership, mean,
                              .before = 6, .after = 6, .complete = TRUE)
  )

ridership_ma_pivot <- ridership_data_ma %>%
  pivot_longer(
    cols = ma_3:ma_13,
    values_to = "value_ma",
    names_to = "ma_order"
  ) %>%
  mutate(ma_order = factor(
    ma_order,
    levels = c(
      "ma_3",
      "ma_5",
      "ma_7",
      "ma_13"
    ),
    labels = c(
      "ma_3",
      "ma_5",
      "ma_7",
      "ma_13"
    )
  ))

ridership_ma_pivot %>%
  ggplot() +
  geom_line(aes(date, ridership, group = 1), size = 1) +
  geom_line(aes(date, value_ma, group = 1, color = ma_order), size = 1) +
  scale_color_discrete(name = 'MA Order')+
  theme_bw() +
  xlab("Date") +
  ylab("Ridership")

Due to the daily granularity of the dataset, a 7-day period moving average was selected for the analysis. The trends observed indicate a gradual rise in ridership during the initial months of 2023. However, a notable dip in June-July is apparent, contributing to a visible decline even in the smoothed moving average. Despite this temporary setback, the latter part of the year follows the earlier upward trend, and overall, the ridership tends to stabilize.

Code

ridership_data_ma %>%
  ggplot() +
  geom_line(aes(date, ridership, group = 1), size = 1) +
  geom_line(aes(date, ma_7,group = 1), size = 1, color = "skyblue") +
  theme_bw()+
  ylab('Ridership')

Rolling Standard Deviation of Ridership : here variance is constant and not increasing or decreasing so we don’t need any transformation.

Code

ridership_data %>%
  mutate(
    sd = slider::slide_dbl(ridership, sd, .before = 12, .after = 0, .complete = TRUE)
  ) %>%
  ggplot()+
  geom_line(aes(date,sd,group = 1),size=1)+
  ggtitle('Rolling Standard Deviation of Ridership',
          subtitle = 'Upward Trend Indicates Non-Constant Variance')+
  ylab('12 Month Rolling Standard Deviation')

Remainder Plot -

Code

ridership_data %>%
  mutate(
    ma_7 = slider::slide_dbl(ridership, mean,
                                  .before = 3, .after = 3, .complete = TRUE),
  ) %>%
  mutate(resid = ridership - ma_7) %>%
  select(date, ridership, ma_7, resid) %>%
  pivot_longer(
    ridership:resid,
    names_to = "decomposition",
    values_to = "ridership"
  ) %>%
  mutate(
    decomposition = case_when(
      decomposition == "value" ~ "ridership",
      decomposition == "ma_7" ~ "Trend",
      decomposition == "resid" ~ "Remainder"
    )
  ) %>%
  mutate(
    decomposition = factor(
      decomposition,
      labels = c(
        "Ridership",
        "Trend",
        "Remainder"
      ),
      levels = c(
        "Ridership",
        "Trend",
        "Remainder"
      )
    )
  ) %>%
  ggplot() +
  geom_line(aes(date, ridership,group = 1), size = 1) +
  facet_wrap(
    ~decomposition,
    nrow = 3,
    scales = "free"
  ) +
  theme_bw() +
  ylab("Ridership") +
  xlab("Date") +
  ggtitle(
    "Ridership = Trend + Remainder"
  )

Upon detrending the data, it becomes evident that the moving average effectively captures the annual trend in 2023, depicting a consistent increase in ridership with a noticeable decline in June-July. Examining the remainder, it is apparent that a distinct seasonal pattern emerges, likely attributed to an increase in ridership on weekdays. This observation highlights the influence of a weekly cycle on the overall ridership patterns.

STL Decomposition

Code

tsibble_data <- as_tsibble(ridership_data, index = date)

tsibble_data %>%
  model(
    stl = STL(ridership)
  ) %>%
  components(stl) %>%
  autoplot()

Upon inspecting the seasonality in the remainder plot, it appears that there is no discernible pattern across the months. Instead, the plot shows a relatively consistent straight line, indicating a lack of clear seasonal variation in the remainder series. This observation suggests that the seasonality in the original time series has been effectively captured and accounted for by the chosen moving average. The absence of distinct patterns in the remainder plot implies that the chosen moving average has successfully smoothed out the seasonal fluctuations, allowing for a clearer understanding of the underlying trends in the data.

Code

#seasonality

boxplots <- ridership_data %>%
  mutate(ma = slider::slide_dbl(ridership, mean,
                                .before = 3, .after = 3, .complete = TRUE)) %>%
  mutate(detrend = ridership - ma) %>%
  mutate(month = lubridate::month(date,label=T,abbr=T)) %>%
  ggplot()+
  geom_boxplot(aes(month,detrend))+
  geom_hline(yintercept=0,color='red') +
  xlab("Month") +
  ylab("Remainder")

bar_plot <- ridership_data %>%
  mutate(ma = slider::slide_dbl(ridership, mean,
                                .before = 3, .after = 3, .complete = TRUE)) %>%
  mutate(detrend = ridership - ma) %>%
  mutate(month = lubridate::month(date,label=T,abbr=T)) %>%
  ggplot() +
  geom_col(aes(month, ridership, fill = month)) +
  theme_bw() +
  xlab("Month") +
  ylab("Seasonality Effect on Ridership")

#lag_plot for checking seasonality in remainder

combined_plot <- ggarrange(boxplots, bar_plot, ncol = 2)

final_plot <- combined_plot +
  ggtitle("Seasonality") +
  theme(plot.title = element_text(hjust = 0.5))

print(final_plot)

Section 4: Forecasting

Code

r_data_mod <- lm(ridership ~ date, data = tsibble_data)

newdata <- data.frame(date = ymd("2023-12-01")+ months(1:3))
newdata <- as_tsibble(newdata)


newdata$pred <- predict(r_data_mod, newdata = newdata)
# all combined
linear_forecast = newdata %>%
  mutate(.model = "Regression") %>%
  rename(.mean = pred) %>%
  mutate(date = yearmonth(date)) %>%
  as_tsibble()

###########
# Load required libraries
library(tsibble)
library(fable)
library(lubridate)

# Modeling and Forecasting
tsibble_data %>%
  model(
    Mean = MEAN(ridership),
    `Naïve` = fable::NAIVE(ridership),
    Drift = fable::NAIVE(ridership ~ drift())
  ) %>%
  forecast(h = 60) %>%
  bind_rows(linear_forecast) %>%
  autoplot(tsibble_data, level = NULL, linewidth = 1) +
  geom_vline(aes(xintercept = ymd("2023-12-31")), color = "red", linetype = "dashed") +
  geom_line(data = newdata, aes(date, pred), color = "#E76BF3", size = 1) +
  theme_bw() +
  ggtitle("Forecast for Next 6 Months")

In our plot depicting the four forecasting methods for the next six months, the naive forecast suggests an estimation of approximately 2500 riders per day for the upcoming two months. While this forecast is generally plausible, it overlooks the gradually increasing trend observed in the early months of the year, as evidenced by the original 2023 data.

On the other hand, the naive forecast with drift takes into account the rising trend witnessed in the first two months of 2024. It starts from around 2500 riders per day and projects an upward trajectory, offering a more probable forecast considering the ascending daily ridership magnitude observed in the initial months of 2023. This approach aligns better with the observed trend, providing a more nuanced and realistic projection for the upcoming period.

END.