CUNY Data Science 698 - Capstone Research Project

Abstract

This study analyzes daily ridership patterns across major transit services operated by the Metropolitan Transportation Authority (MTA) in New York City, with a focus on the impact of the COVID-19 pandemic and post-pandemic recovery. Using publicly available MTA ridership data from 2020 to 2025, the study examines how ridership declined during the pandemic and how recovery patterns differ across transit modes. The analysis examines how ridership declined during the pandemic and how recovery patterns differ across transit modes including subways, buses, commuter rail and bridges and tunnels.

To better explain fluctuations in transit usage, the study incorporates COVID-19 indicators such as daily case counts and hospitalizations. Statistical modeling and time-series techniques, including multiple linear regression, ARIMA, ETS and Prophet models, are used to quantify relationships and generate forecasts.

The results show that COVID-19 case levels have a significant negative effect on ridership, while recovery trends vary across transit modes. Forecasting models consistently indicate that ridership is recovering gradually but is likely to stabilize below pre-pandemic levels. These findings suggest a structural shift in urban mobility patterns and provide insights for long-term transit planning.

Introduction

Public transportation plays a critical role in the economy and daily life of New York City. The Metropolitan Transportation Authority operates one of the largest transit systems in the United States, serving millions of riders through subways, buses, commuter railroads, para transit services and bridges and tunnels. Understanding ridership trends is essential for effective transit planning, resource allocation and policy decision-making.

According to data from the Metropolitan Transportation Authority Open Data Portal, subway ridership fell by over 90% at the peak of the pandemic in April 2020 compared to pre-pandemic levels.

The COVID-19 pandemic caused an unprecedented disruption to public transit systems beginning in early 2020. Government restrictions, health concerns and the rapid shift to remote work led to a sharp decline in ridership across all modes of transportation. Although ridership has gradually recovered over time, the pace and pattern of recovery have varied significantly across different transit services. However, the COVID-19 pandemic differs in both scale and duration due to its long-term impact on commuting behavior. Recent labor market studies indicate that remote and hybrid work arrangements remain significantly higher than pre-pandemic levels. Data from WFH Research shows that work-from-home behavior continues to reduce commuting frequency in major metropolitan areas. This shift raises important questions about the long-term sustainability of transit demand.

This study analyzes daily MTA ridership data from 2020 to 2025 to examine how transit usage changed during and after the pandemic. By incorporating COVID-19 indicators such as case counts and hospitalizations, the analysis aims to better understand how public health conditions influenced travel behavior.

The primary objective of this study is to evaluate how ridership evolved over time and to compare recovery patterns across transit modes. In addition, the study explores weekday and weekend differences, seasonal trends and the relationship between ridership and pandemic dynamics. The findings provide insights into long-term changes in urban mobility and support discussions on future transit planning.

The central research question guiding this study is:

How has daily MTA ridership evolved since the COVID-19 pandemic, and how do recovery patterns differ across transit modes when considering COVID-19 dynamics?

Secondary questions include:

• Are there consistent weekday and weekend ridership patterns across modes, and do they shift during pandemic waves?

• How do seasonal trends vary by transit service, and how are they affected by COVID-19 milestones?

• Which services have recovered more quickly, and which continue to lag behind pre-pandemic baselines, considering both case counts and vaccination rates? Data

Data

This project analyzes MTA ridership trends from 2020–2025 and examines how COVID-19 impacted recovery across transit modes.

MTA Ridership Data

Using the MTA dataset, I focused on the the number of estimates commuters based on the day of the week.

# Load MTA data
mta_url <- "https://data.ny.gov/resource/vxuj-8kew.csv"
mta <- read_csv(mta_url)

MTA Data Cleaning

# Convert date columns
mta <- mta %>%
  mutate(date = as.Date(date))

DT::datatable(mta)

glimpse(mta)

Rows: 1,000
Columns: 15
$ date                                                 <date> 2020-03-01, 2020…
$ subways_total_estimated_ridership                    <dbl> 2212965, 5329915,…
$ subways_of_comparable_pre_pandemic_day               <dbl> 0.97, 0.96, 0.98,…
$ buses_total_estimated_ridersip                       <dbl> 984908, 2209066, …
$ buses_of_comparable_pre_pandemic_day                 <dbl> 0.99, 0.99, 0.99,…
$ lirr_total_estimated_ridership                       <dbl> 86790, 321569, 31…
$ lirr_of_comparable_pre_pandemic_day                  <dbl> 1.00, 1.03, 1.02,…
$ metro_north_total_estimated_ridership                <dbl> 55825, 180701, 19…
$ metro_north_of_comparable_pre_pandemic_day           <dbl> 0.59, 0.66, 0.69,…
$ access_a_ride_total_scheduled_trips                  <dbl> 19922, 30338, 327…
$ access_a_ride_of_comparable_pre_pandemic_day         <dbl> 1.13, 1.02, 1.10,…
$ bridges_and_tunnels_total_traffic                    <dbl> 786960, 874619, 8…
$ bridges_and_tunnels_of_comparable_pre_pandemic_day   <dbl> 0.98, 0.95, 0.96,…
$ staten_island_railway_total_estimated_ridership      <dbl> 1636, 17140, 1745…
$ staten_island_railway_of_comparable_pre_pandemic_day <dbl> 0.52, 1.07, 1.09,…

Covid-19 Data

covid_url <- "https://data.cityofnewyork.us/resource/rc75-m7u3.csv"
covid <- read_csv(covid_url)

# Convert date columns
covid <- covid %>%
  mutate(date = as.Date(date_of_interest))

DT::datatable(covid)

glimpse(covid)

Rows: 1,000
Columns: 56
$ date_of_interest                <dttm> 2020-02-29, 2020-03-01, 2020-03-02, 2…
$ case_count                      <dbl> 1, 0, 0, 1, 5, 3, 8, 7, 21, 57, 69, 15…
$ probable_case_count             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ hospitalized_count              <dbl> 1, 1, 2, 7, 2, 14, 8, 8, 18, 37, 60, 7…
$ death_count                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ case_count_7day_avg             <dbl> 0, 0, 0, 0, 0, 0, 3, 3, 6, 15, 24, 46,…
$ all_case_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 3, 3, 6, 15, 24, 46,…
$ hosp_count_7day_avg             <dbl> 0, 0, 0, 0, 0, 0, 5, 6, 8, 13, 21, 32,…
$ death_count_7day_avg            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bx_case_count                   <dbl> 0, 0, 0, 0, 0, 0, 2, 0, 3, 4, 8, 19, 2…
$ bx_probable_case_count          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bx_hospitalized_count           <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 5, 7, 7, 23, 1…
$ bx_death_count                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bx_case_count_7day_avg          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 5, 9,…
$ bx_probable_case_count_7day_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bx_all_case_count_7day_avg      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 5, 9,…
$ bx_hospitalized_count_7day_avg  <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 3, 6, 9,…
$ bx_death_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bk_case_count                   <dbl> 0, 0, 0, 0, 1, 3, 1, 2, 5, 16, 11, 31,…
$ bk_probable_case_count          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bk_hospitalized_count           <dbl> 1, 0, 2, 3, 1, 3, 1, 3, 8, 11, 13, 11,…
$ bk_death_count                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bk_case_count_7day_avg          <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 4, 6, 10, 2…
$ bk_probable_case_count_7day_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bk_all_case_count_7day_avg      <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 4, 6, 10, 2…
$ bk_hospitalized_count_7day_avg  <dbl> 0, 0, 0, 0, 0, 0, 2, 2, 3, 4, 6, 7, 11…
$ bk_death_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mn_case_count                   <dbl> 1, 0, 0, 0, 2, 0, 3, 1, 6, 24, 24, 62,…
$ mn_probable_case_count          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mn_hospitalized_count           <dbl> 0, 0, 0, 1, 1, 5, 3, 0, 1, 9, 12, 19, …
$ mn_death_count                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mn_case_count_7day_avg          <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 5, 9, 17, 3…
$ mn_probable_case_count_7day_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mn_all_case_count_7day_avg      <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 5, 9, 17, 3…
$ mn_hospitalized_count_7day_avg  <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 3, 4, 7, 9,…
$ mn_death_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ qn_case_count                   <dbl> 0, 0, 0, 1, 2, 0, 1, 3, 6, 10, 24, 40,…
$ qn_probable_case_count          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ qn_hospitalized_count           <dbl> 0, 0, 0, 2, 0, 4, 2, 4, 4, 8, 23, 23, …
$ qn_death_count                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ qn_case_count_7day_avg          <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 3, 7, 12, 2…
$ qn_probable_case_count_7day_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ qn_all_case_count_7day_avg      <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 3, 7, 12, 2…
$ qn_hospitalized_count_7day_avg  <dbl> 0, 0, 0, 0, 0, 0, 1, 2, 2, 3, 6, 10, 1…
$ qn_death_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ si_case_count                   <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 3, 2, 3, 13…
$ si_probable_case_count          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ si_hospitalized_count           <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 2, 5, 2, 3,…
$ si_death_count                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ si_probable_case_count_7day_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ si_case_count_7day_avg          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3,…
$ si_all_case_count_7day_avg      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3,…
$ si_hospitalized_count_7day_avg  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2,…
$ si_death_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ incomplete                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ date                            <date> 2020-02-29, 2020-03-01, 2020-03-02, 2…

Covid-19 Data Cleaning

# Select relevant columns
covid <- covid %>%
  select(date, cases = case_count, hospitalizations = hospitalized_count)

# Merge datasets
full_data <- left_join(mta, covid, by = "date")

# Create rolling averages
#full_data <- full_data %>% arrange(date) %>%  mutate(cases_7day = rollmean(cases, 7, fill = NA, align = "right"))

full_data <- full_data %>%
  arrange(date) %>%
  mutate(
    cases_7day = zoo::rollmean(cases, 7, fill = NA, align = "right"),
    ridership_7day = zoo::rollmean(subways_total_estimated_ridership, 7, fill = NA, align = "right")
  )

DT::datatable(full_data)

Converting Dates to Day of the Week

In order to identify the day of the week corresponding to the date, I converted the Date column into month/date/year (mdy) column

new_full_data <- full_data %>%
  mutate(
    Day_of_Week = wday(date, label = TRUE),
    Weekend = ifelse(Day_of_Week %in% c("Sat", "Sun"), 1, 0)
  )

DT::datatable(new_full_data)

Subway Ridership Totals

I looked at the overall subway ridership from 2020-2023, as well as the amount of daily riders during this period:

Subway Overall Ridership Totals

Subway_Overall_Totals <- mta %>%
  select(date, Subway_Ridership_Totals = `subways_total_estimated_ridership`) %>%
  arrange(date)

DT::datatable(Subway_Overall_Totals)

Literature Review

The COVID-19 pandemic had a significant impact on public transportation systems worldwide, leading to sharp declines in ridership and major changes in travel behavior. Early studies show that transit usage dropped rapidly in response to lock down measures, health concerns, and the widespread shift to remote work. As infection rates increased, mobility decreased, resulting in reduced demand for public transit services.

Research also indicates that the impact of the pandemic was not uniform across transit modes. Bus systems tended to recover more quickly than rail services, as they are more commonly used by essential workers. In contrast, commuter rail systems experienced slower recovery due to reduced daily commuting. These patterns have been observed in major cities across the United States and are consistent with broader changes in work and travel behavior.

Public health conditions played a key role in shaping ridership trends. Studies find a negative relationship between transit usage and COVID-19 case counts, while improvements in vaccination rates contributed to increased ridership. As vaccination campaigns progressed, public confidence in using transit systems improved, supporting gradual recovery.

Another important theme in the literature is the long-term shift in mobility patterns. The pandemic accelerated trends such as remote work and flexible schedules, reducing peak-hour demand and altering traditional commuting patterns. Additionally, some travelers shifted toward private vehicles, walking, or cycling, reflecting changes in risk perception and travel preferences. From a methodological perspective, time series analysis and regression modeling have been widely used to study the relationship between transit ridership and external factors such as public health data. These approaches allow researchers to capture trends, seasonality and structural breaks associated with major events like the COVID-19 pandemic.

Data from WFH Research suggests that remote work remains prevalent, reducing the frequency of commuting trips. As a result, transit demand may not fully return to pre-pandemic levels, supporting the need for models that incorporate both public health conditions and structural changes in labor patterns.

This study builds on existing research by analyzing MTA ridership data over an extended period from 2020 to 2025 and comparing recovery patterns across multiple transit modes. By combining transportation and COVID-19 data, the analysis provides a comprehensive view of how public transit usage evolved during and after the pandemic.

Methodology

The analysis begins with data pre-processing, including cleaning, alignment of datasets and transformation into a time-series format. Daily ridership and COVID-19 case data were merged by date and rolling averages were computed to smooth short-term fluctuations and highlight underlying trends.

Exploratory Data Analysis (EDA): EDA reveals a sharp decline in ridership across all transit modes in early 2020, followed by a gradual recovery. However, the pace of recovery differs across modes with bus ridership recovering more quickly than subway and commuter rail services. These differences suggest that ridership patterns are influenced by changes in travel purpose and commuting behavior.

Primary Analytical: A multiple linear regression model is applied to evaluate the relationship between ridership, COVID-19 case counts and weekend effects. This approach is a multiple regression model that quantifies the relationship between ridership and key explanatory variables.

To quantify the relationship between ridership and pandemic conditions, a multiple linear regression model is applied:

Ridershipₜ = β₀ + β₁(Casesₜ) + β₂(Weekendₜ) + εₜₜ

While remote work is recognized as an important structural factor influencing transit demand, it is not directly included in the regression model due to data limitations.

Time Series Modeling: To complement this analysis, time-series models are applied to capture temporal dependencies and generate forecasts. An ARIMA model is used as a baseline approach to model autocorrelation and short-term dynamics in ridership data. In addition, the Prophet model is implemented to capture non-linear trends and seasonal patterns, particularly around the structural break caused by the COVID-19 pandemic.

Unlike uni variate time-series models, the regression framework provides inter-pretable coefficients that quantify the effects of pandemic conditions and remote work. Forecasts generated using these models are compared using performance metrics such as RMSE and MAE to evaluate predictive accuracy.

Exploratory Data Analysis

The exploratory data analysis provides an initial understanding of how ridership patterns in the Metropolitan Transportation Authority system changed during and after the COVID-19 pandemic. The analysis highlights a clear structural break beginning in March 2020, when ridership declined sharply across all transit modes due to lock down measures, reduced mobility and public health concerns.

Following this initial decline, ridership began a gradual recovery starting in late 2020. However, recovery has been uneven and incomplete, with overall levels remaining below pre-pandemic benchmarks throughout the study period. This suggests that the pandemic introduced lasting changes in travel behavior rather than a temporary disruption.

Significant variation is observed across transit modes. Bus ridership shows a relatively faster recovery, likely reflecting continued use by essential workers and populations with limited transportation alternatives. In contrast, subway and commuter rail services exhibit slower recovery consistent with reduced commuting demand and the persistence of remote and hybrid work arrangements.

Changes in temporal patterns are also evident. Prior to the pandemic, ridership was substantially higher on weekdays compared to weekends, driven by regular commuting activity. During the pandemic, this gap narrowed considerably, indicating a decline in work-related travel. Although weekday ridership has increased during the recovery phase, the difference between weekdays and weekends remains smaller than pre-pandemic levels suggesting a shift toward more flexible travel patterns.

Overall, the exploratory analysis reveals that MTA ridership is influenced by both short-term public health conditions and longer-term structural changes in commuting behavior. These findings provide a foundation for the statistical and time-series analyses that follow.

Data Preparation

The Metropolitan Transportation Authority ridership data was processed to ensure consistency and suitability for analysis. As part of pre-processing, column names were standardized and corrected to maintain uniformity across all transit modes.

The date variable was converted into a proper date format to enable time-based analysis and the dataset was organized in chronological order. Missing values and irregular observations were carefully reviewed and addressed through appropriate methods, including removal or imputation where necessary. The data was then aggregated at the daily level to create a continuous time series of total ridership.

To support comparative analysis, a categorical variable labeled “phase” was introduced to segment the data into key pandemic periods: COVID Shock (2020) and Recovery (2021–2023). This classification allows for clearer interpretation of how ridership patterns evolved over time.

Additional features, including day of the week and month indicators, were derived to capture temporal variation and seasonal patterns in transit usage. These enhancements improve the dataset’s ability to support both exploratory and modeling approaches.

Time Period Comparison

To evaluate the impact of COVID-19 (COVID-19), the data is divided into three phases:

Phase	Description
Pre-COVID	Normal ridership patterns
COVID Peak	Lockdowns and restrictions
Recovery	Gradual return to normal

# Convert date
mta <- mta %>%
  mutate(date = as.Date(date),
         year = year(date),
         weekday = weekdays(date))

# Create phase variable
mta <- mta %>%
  mutate(phase = case_when(
    year == 2020 ~ "COVID Shock",
    year == 2021 ~ "Early Recovery",
    year >= 2022 ~ "Recovery"
  ))

# Aggregate daily ridership
daily_ridership <- mta %>%
  group_by(date, phase) %>%
  summarise(ridership = sum(subways_total_estimated_ridership, na.rm = TRUE))

Ridership Trend Over Time

The following plot illustrates temporal trends in MTA ridership, highlighting variations before, during and after the COVID-19 pandemic.

ggplot(daily_ridership, aes(x = date, y = ridership, color = phase)) +
  geom_line(alpha = 0.7) +
  labs(title = "MTA Ridership Trends (2019–2023)",
       x = "Date",
       y = "Ridership") +
  theme_minimal()+
  scale_y_continuous(labels = scales::comma)

The plot shows that ridership for the Metropolitan Transportation Authority was stable and high before COVID-19. In March 2020, there is a sharp drop due to the COVID-19 pandemic, indicating a major disruption. Although ridership begins to recover after 2020, it remains below pre-pandemic levels, suggesting a slow and incomplete recovery.

Average Ridership by Phase

The bar chart summarizes average ridership levels across Pre-COVID, COVID Peak and Recovery periods highlighting the impact of the COVID-19 pandemic.

phase_summary <- daily_ridership %>%
  group_by(phase) %>%
  summarise(avg_ridership = mean(ridership))

ggplot(phase_summary, aes(x = phase, y = avg_ridership, fill = phase)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Ridership by Phase",
       x = "Phase",
       y = "Average Daily Ridership") +
  theme_minimal()+
  scale_y_continuous(labels = scales::comma)

Average ridership was highest before COVID-19 dropped significantly during the pandemic and partially recovered afterward. However, ridership levels remain below pre-pandemic levels indicating a slow and incomplete recovery.

Weekday vs Weekend Patterns

The chart analyzes variations in ridership by day of the week, highlighting how commuting behavior changed during the COVID-19 pandemic.

weekday_analysis <- mta %>%
  group_by(weekday, phase) %>%
  summarise(ridership = mean(subways_total_estimated_ridership, na.rm = TRUE))

ggplot(weekday_analysis, aes(x = weekday, y = ridership, fill = phase)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Weekday vs Weekend Ridership",
       x = "Day of Week",
       y = "Average Ridership") +
  theme_minimal()+
  scale_y_continuous(labels = scales::comma)

Before COVID-19, ridership was significantly higher on weekdays due to regular commuting patterns. During the pandemic, this difference decreased as travel behavior became more uniform. In the recovery phase, weekday ridership begins to increase again but the gap between weekdays and weekends remains smaller than pre-pandemic levels, indicating lasting changes in commuting habits.

Multi-Mode Ridership Comparison

The chart examines monthly ridership trends, highlighting seasonal variations and disruptions caused by the COVID-19 pandemic.

# Multi-mode selection
mta_modes <- mta %>%
  select(
    date,
    subways = subways_total_estimated_ridership,
    buses = buses_total_estimated_ridersip,
    lirr = lirr_total_estimated_ridership,
    metro_north = metro_north_total_estimated_ridership,
    bridges_tunnels = bridges_and_tunnels_total_traffic
  ) %>%
  pivot_longer(
    cols = -date,
    names_to = "mode",
    values_to = "ridership"
  )

# Plot comparison
ggplot(mta_modes, aes(x = date, y = ridership, color = mode)) +
  geom_line(alpha = 0.7) +
  labs(
    title = "MTA Ridership Trends by Mode (2020–2025)",
    x = "Date",
    y = "Ridership"
  ) +
  theme_minimal()+
  scale_y_continuous(labels = scales::comma)

# Subway ridership trend with smoothing

ggplot(full_data, aes(x = date)) +
  geom_line(aes(y = subways_total_estimated_ridership), alpha = 0.4) +
  geom_line(aes(y = ridership_7day), color = "blue", linewidth = 1) +
  theme_minimal() +
  labs(
    title = "Subway Ridership Trend with 7-Day Rolling Average",
    x = "Date",
    y = "Ridership"
  )+
  scale_y_continuous(labels = scales::label_comma())

The results show that all modes experienced a sharp decline in early 2020, but recovery patterns vary significantly. Subway and commuter rail services show slower recovery due to reduced commuting demand and sustained remote work trends. In contrast, bus ridership demonstrates relatively faster recovery, likely due to its reliance on essential travel. These differences highlight how travel purpose and rider demographics influenced recovery trajectories across the transit system.

# Normalized comparison (recovery comparison)
mta_modes %>%
  group_by(mode) %>%
  mutate(index = ridership / max(ridership, na.rm = TRUE)) %>%
  ggplot(aes(date, index, color = mode)) +
  geom_line() +
  labs(
    title = "Relative Recovery by Mode",
    y = "Normalized Ridership (0–1)%"
  ) +
  theme_minimal()+
  scale_y_continuous(labels = scales::label_percent())

The plot shows a clear inverse relationship between COVID-19 cases and subway ridership. When cases increase, ridership drops reflecting reduced travel during higher risk periods. As cases decline, ridership gradually recovers but it remains below pre-pandemic levels suggesting lasting changes in commuting behavior beyond the pandemic itself.

Time-Series Comparison of COVID-19 Cases and Ridership

To examine the relationship between public health conditions and transit usage, this analysis compares daily COVID-19 case counts with subway ridership over the study period. The objective is to understand whether changes in pandemic severity are associated with fluctuations in public transportation demand.

# COVID-19 Cases vs MTA Ridership
ggplot(full_data, aes(x = date)) +
  geom_line(aes(y = subways_total_estimated_ridership, color = "Ridership"), size = 0.7) +
  geom_line(aes(y = cases * 50, color = "COVID Cases"), size = 0.7) +
  scale_y_continuous(
    name = "Ridership",
    sec.axis = sec_axis(~./50, name = "COVID Cases")
  ) +
  labs(
    title = "COVID-19 Cases vs Subway Ridership",
    x = "Date",
    color = "Legend"
  ) +
  theme_minimal()+
  scale_y_continuous(labels = scales::label_comma())

The plot shows a clear inverse relationship between COVID-19 cases and subway ridership. When case counts increase and ridership drops reflecting reduced mobility during periods of higher health risk. As cases decline ridership gradually recovers but it remains below pre-pandemic levels suggesting that longer-term changes in commuting behavior also influence transit demand.

Multi-Mode Ridership Analysis

To provide a comprehensive understanding of transit usage, ridership trends were analyzed across all major MTA service modes, including subways, buses, Long Island Rail Road (LIRR), Metro-North Railroad and bridges and tunnels. This multi-mode approach allows for a direct comparison of how different transportation systems responded to the COVID-19 pandemic and subsequent recovery period.

The analysis shows that all transit modes experienced a sharp decline in ridership during the early stages of the pandemic in 2020. However, recovery patterns differ significantly across modes. Subway and commuter rail services, including LIRR and Metro-North, exhibit slower recovery, largely due to reduced commuting associated with remote work and changes in travel behavior.

In contrast, bus ridership demonstrates a relatively faster recovery, likely reflecting continued reliance by essential workers and populations with fewer alternative transportation options. Meanwhile, bridges and tunnels show a strong rebound, indicating an increased shift toward private vehicle usage during and after the pandemic.

These findings highlight that the impact of COVID-19 on transportation was not uniform across systems. Instead, recovery trajectories vary depending on the role each mode plays in urban mobility, providing important insights into long-term changes in travel behavior and transit demand.

Time Series Modeling

Because the data is time-dependent, autocorrelation was assessed using ACF plots. Difference was applied where necessary to achieve stationarity before fitting ARIMA models. This ensures reliable parameter estimation and forecasting performance

To perform time series modeling, the dataset from the Metropolitan Transportation Authority was aggregated at the daily level and sorted chronologically. This ensures the data is in a consistent format suitable for forecasting ridership trends over time.

# Aggregate daily ridership (ensure proper ordering)
ts_data <- daily_ridership %>%
  group_by(date) %>%
  summarise(ridership = sum(ridership, na.rm = TRUE), .groups = "drop") %>%
  arrange(date)

# Convert to time series object (use correct start year)
start_year <- lubridate::year(min(ts_data$date))
start_day  <- lubridate::yday(min(ts_data$date))

ts_ridership <- ts(
  ts_data$ridership,
  start = c(start_year, start_day),
  frequency = 365
)

ggplot(ts_data, aes(x = date, y = ridership)) +
  geom_line(color = "blue", linewidth = 1.5) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "MTA Ridership Time Series (2020–2025)",
    x = "Date",
    y = "Ridership"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

The time series plot highlights a pronounced structural break in early 2020, where ridership drops sharply due to the onset of the COVID-19 pandemic. This decline reflects the immediate impact of lockdown measures, reduced mobility and public health concerns. Following this disruption, ridership shows a gradual upward trend indicating a steady but slow recovery over time. However, the recovery remains incomplete, as ridership levels do not return to pre-pandemic highs within the observed period.

The plot also suggests the presence of recurring fluctuations which may reflect seasonal patterns and variations in travel behavior. Overall, the time series demonstrates that while recovery is underway, long-term ridership dynamics have shifted, likely due to sustained changes such as remote work and evolving commuting patterns.

Model Comparison (ARIMA, Prophet, ETS)

To evaluate forecasting performance, the ARIMA, Prophet and Exponential Smoothing (ETS) models were compared using standard accuracy metrics, including Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). These metrics provide a quantitative measure of how closely each model’s predictions align with observed ridership values.

The comparison shows that all three models produce consistent forecasts, indicating a gradual recovery in ridership following the sharp decline in 2020. However, differences emerge in their predictive performance and ability to capture underlying patterns. The ARIMA model performs well in capturing short-term dependencies and provides stable forecasts when the data exhibits consistent temporal structure. The ETS model produces smooth and reliable trend estimates, making it useful as a baseline for comparison. In contrast, the Prophet model is more flexible in capturing non-linear trends, seasonal effects and structural breaks, particularly those associated with the COVID-19 disruption.

Based on the evaluation metrics, the Prophet model demonstrates slightly better performance in capturing long-term patterns, while ARIMA and ETS provide comparable accuracy for short-term forecasting. The consistency across models strengthens confidence in the overall findings, suggesting that MTA ridership is recovering gradually but is unlikely to return to pre-pandemic levels in the near term. Overall, the use of multiple forecasting approaches improves the robustness of the analysis by validating results across different modeling assumptions and techniques.

#Create the Model Comparison Table

# ARIMA MODEL
fit_arima <- auto.arima(ts_ridership)
forecast_arima <- forecast(fit_arima, h = 90)

acc_arima <- accuracy(forecast_arima) 

# ETS MODEL
fit_ets <- ets(ts_ridership)
forecast_ets <- forecast(fit_ets, h = 90)

acc_ets <- accuracy(forecast_ets)

# PROPHET MODEL
df_prophet <- ts_data %>%
  mutate(date = as.Date(date)) %>%
  rename(ds = date, y = ridership)

df_prophet$y <- as.numeric(df_prophet$y)
df_prophet <- df_prophet %>% filter(!is.na(ds) & !is.na(y))

model_prophet <- prophet(df_prophet)

future <- make_future_dataframe(model_prophet, periods = 90)
forecast_prophet <- predict(model_prophet, future)

prophet_pred <- forecast_prophet %>%
  select(ds, yhat)

actual_vs_pred <- df_prophet %>%
  inner_join(prophet_pred, by = "ds")

prophet_rmse <- sqrt(mean((actual_vs_pred$y - actual_vs_pred$yhat)^2))
prophet_mae  <- mean(abs(actual_vs_pred$y - actual_vs_pred$yhat))

arima_rmse <- tail(acc_arima[, "RMSE"], 1)
arima_mae  <- tail(acc_arima[, "MAE"], 1)

ets_rmse <- tail(acc_ets[, "RMSE"], 1)
ets_mae  <- tail(acc_ets[, "MAE"], 1)

# MODEL COMPARISON TABLE
model_comparison <- data.frame(
  Model = c("ARIMA", "ETS", "Prophet"),
  RMSE = c(arima_rmse, ets_rmse, prophet_rmse),
  MAE = c(arima_mae, ets_mae, prophet_mae)
)

#model_comparison

# DISPLAY
knitr::kable(
  model_comparison,
  digits = 2,
  caption = "Model Comparison: Forecast Accuracy (Lower is Better)"
)

Model Comparison: Forecast Accuracy (Lower is Better)
Model	RMSE	MAE
ARIMA	405734.1	317782.5
ETS	585608.2	471996.2
Prophet	417573.3	275560.3

The model comparison results show differences in forecasting accuracy across the three approaches. The ARIMA model performs well in capturing short-term dependencies in the data and provides stable forecasts, but its accuracy is limited when dealing with structural breaks such as the COVID-19 shock. The ETS model produces smoother predictions by emphasizing recent observations, resulting in competitive performance and strong stability in trend estimation.

The Prophet model demonstrates strong flexibility in capturing non-linear trends, seasonal variations, and structural changes in the data. As a result, it tends to perform better in datasets with abrupt shifts and long-term trend changes such as the pandemic-driven ridership collapse.

Overall, the comparison suggests that while all three models provide reasonable forecasts, Prophet generally offers improved accuracy in capturing long-term recovery patterns. ARIMA and ETS remain valuable for short-term forecasting and baseline comparison. The consistency across models strengthens confidence in the conclusion that MTA ridership is recovering gradually but has not returned to pre-pandemic levels.

ARIMA Model

The ARIMA model was used to analyze and forecast ridership trends in the MTA system. ARIMA is a widely used time-series method that captures autocorrelation, underlying trends, and random fluctuations in the data.

Because ridership data is time-dependent, autocorrelation was first examined using autocorrelation function (ACF) plots. The series showed non-stationary behavior due to the sharp structural break during the COVID-19 pandemic. To address this, differencing was applied automatically within the modeling process to stabilize the mean and ensure stationarity. This step is important for producing reliable parameter estimates and improving forecast accuracy.

The selected model captures both short-term dependencies and longer-term movement in ridership. The AR and MA components help account for persistence in travel behavior and unexpected shocks in demand.

Model Estimation

# Fit ARIMA model
fit_arima <- auto.arima(ts_ridership)

# Model summary
summary(fit_arima)

Series: ts_ridership 
ARIMA(3,1,3) with drift 

Coefficients:
         ar1     ar2      ar3      ma1      ma2     ma3      drift
      0.3327  0.0813  -0.7224  -0.6029  -0.5370  0.7891  -2093.181
s.e.  0.0597  0.0696   0.0590   0.0467   0.0662  0.0425   6374.812

sigma^2 = 1.659e+11:  log likelihood = -14320.34
AIC=28656.68   AICc=28656.83   BIC=28695.94

Training set error measures:
                   ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
Training set 2683.132 405734.1 317782.5 -3.481779 18.08459 0.3039474 -0.1511143

The selected ARIMA model captures both short-term dependencies and long-term trends in the ridership data. The inclusion of auto regressive and moving average components allows the model to account for persistence and random shocks in the time series.

Forecasting

# Forecast - ARIMA
forecast_arima <- forecast(fit_arima, h = 90)

# Create forecast dataframe
arima_df <- data.frame(
  date = seq(max(ts_data$date) + 1, by = "day", length.out = 90),
  forecast = as.numeric(forecast_arima$mean)
)

# Plot
ggplot() +
  geom_line(data = ts_data,
            aes(x = date, y = ridership),
            linewidth = 1.5) +
  geom_line(data = arima_df,
            aes(x = date, y = forecast),
            linetype = "dashed",
            linewidth = 1.5) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "ARIMA Forecast of MTA Ridership",
    x = "Date",
    y = "Daily Ridership"
  ) +
  theme_minimal(base_size = 14)

The forecast plot shows predicted ridership values along with confidence intervals. The model captures the overall recovery trend following the sharp decline observed in early 2020.

Residual or Model Diagnostics

Residual diagnostics are used to evaluate how well the ARIMA model captures the underlying structure of the time series. In particular, residuals should behave like random noise with no visible pattern over time. This plot helps assess whether the model has successfully accounted for trends, seasonality, and autocorrelation in the ridership data.

res_df <- data.frame(
  date = ts_data$date,
  residuals = as.numeric(residuals(fit_arima))
)

ggplot(res_df, aes(x = date, y = residuals)) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Residuals Over Time (ARIMA Model)",
    x = "Date",
    y = "Residuals"
  ) +
  theme_minimal()+
  scale_y_continuous(labels = scales::label_comma())

The residuals fluctuate around zero without a clear pattern, suggesting that the ARIMA model captures the main structure of the data reasonably well. There is no strong evidence of systematic bias or remaining trend in the errors. However, some small variations still exist, which is expected due to external factors not included in the model such as policy changes, weather, and behavioral shifts during the post-pandemic period. Overall, the diagnostics indicate that the model is an adequate fit for the data.

Prophet Model

The Prophet model was applied as a flexible time-series forecasting approach to capture non-linear trends and seasonal patterns in ridership data. Unlike traditional models, Prophet is designed to handle structural breaks and sudden changes, making it particularly suitable for modeling disruptions such as the COVID-19 pandemic.

The model decomposes the time series into trend, seasonality and residual components. This allows it to capture long-term growth patterns as well as recurring fluctuations in ridership behavior. Its ability to adapt to changes in trend makes it useful for analyzing post-pandemic recovery dynamics.

Model Estimation

# Prepare data for Prophet
df_prophet <- ts_data %>%
  mutate(date = as.Date(date)) %>%
  rename(ds = date, y = ridership)

df_prophet$y <- as.numeric(df_prophet$y)
df_prophet <- df_prophet %>% filter(!is.na(ds) & !is.na(y))

# Fit Prophet model
model_prophet <- prophet(df_prophet)

Forecasting

# Create future dataframe
future <- make_future_dataframe(model_prophet, periods = 90)

# Generate forecast
forecast_prophet <- predict(model_prophet, future)

# Plot forecast
plot(model_prophet, forecast_prophet) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Forecasted Ridership",
    x = "Date",
    y = "Daily Ridership"
  ) +
  theme_minimal(base_size = 12)

The Prophet model identifies a clear structural break in early 2020, corresponding to the onset of the COVID-19 pandemic. It captures the sharp decline in ridership followed by a gradual recovery phase.

Actual vs Forecasted ARIMA

The ARIMA model is used to forecast short-term MTA ridership based on historical patterns in the time series data. This plot compares the observed ridership values with the model’s predicted values over a 90-day forecast horizon to evaluate how well the model captures recent trends and recovery behavior.

# Convert forecast to dataframe
forecast_df <- data.frame(
  date = seq(max(ts_data$date) + 1, by = "day", length.out = 90),
  predicted = as.numeric(forecast_arima$mean)
)

# Plot actual vs predicted
ggplot() +
  geom_line(data = ts_data, aes(x = date, y = ridership, color = "Actual"), size = 0.7) +
  geom_line(data = forecast_df, aes(x = date, y = predicted, color = "Forecast"), size = 0.7) +
  labs(
    title = "Actual vs Forecasted MTA Ridership (ARIMA)",
    x = "Date",
    y = "Ridership",
    color = "Legend"
  ) +
  theme_minimal()+
scale_y_continuous(labels = scales::label_comma())

The ARIMA model captures the overall recovery trend in ridership following the pandemic-related decline. Forecasted values closely follow the upward trajectory of the actual data, indicating that the model effectively learns short-term patterns. However, predicted ridership remains below pre-pandemic levels suggesting that full recovery is not expected in the near term based on historical trends alone.

ETS - Exponential Smoothing

The Exponential Smoothing (ETS) model was applied as an additional forecasting approach to analyze ridership trends. The model captures the underlying level and trend components of the time series by assigning greater weight to more recent observations. This makes it well suited for data with gradual changes over time.

The ETS results indicate a steady upward trend in ridership following the sharp decline observed in early 2020. Similar to the ARIMA model, the ETS forecast suggests a gradual recovery rather than a rapid return to pre-pandemic levels. The smoothing behavior of the model reduces short-term fluctuations and highlights the overall recovery trajectory.

Compared to Prophet, the ETS model is simpler and does not explicitly model complex seasonal patterns or structural breaks. However, it provides a stable baseline forecast that aligns closely with the general trend observed in the data.

Model Estimation

# Fit ETS model
fit_ets <- ets(ts_ridership)

# Model summary
summary(fit_ets)

ETS(M,Ad,N) 

Call:
ets(y = ts_ridership)

  Smoothing parameters:
    alpha = 0.0943 
    beta  = 0.0324 
    phi   = 0.8 

  Initial states:
    l = 4743400.0997 
    b = 210151.6362 

  sigma:  0.265

     AIC     AICc      BIC 
33088.40 33088.49 33117.85 

Training set error measures:
                    ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
Training set -13658.06 585608.2 471996.2 -9.283079 29.33733 0.4514471 0.4424816

Forecasting

#=========
# Convert ETS forecast to dataframe
ets_df <- data.frame(
  date = seq(max(ts_data$date) + 1, by = "day", length.out = 90),
  forecast = as.numeric(forecast_ets$mean)
)

# Combine actual + forecast
ggplot() +
  geom_line(data = ts_data, 
            aes(x = date, y = ridership), 
            linewidth = 1) +
  geom_line(data = ets_df, 
            aes(x = date, y = forecast), 
            linetype = "dashed", 
            linewidth = 1) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "ETS Forecast of MTA Ridership",
    x = "Date",
    y = "Daily Ridership"
  ) +
  theme_minimal(base_size = 16)

The ETS model reinforces the findings from the ARIMA and Prophet models, indicating that MTA ridership is recovering gradually but remains below pre-pandemic levels. The consistency across multiple forecasting approaches strengthens the reliability of the results.

Although the ETS model is less flexible in capturing sudden structural changes such as the COVID-19 shock, it effectively summarizes the overall trend and provides a useful benchmark for comparison. Overall, the inclusion of the ETS model supports the conclusion that transit recovery is ongoing but incomplete, reflecting lasting changes in travel behavior and commuting patterns.

Model Comparison (ARIMA vs Prophet vs ETS)

The three forecasting approaches ARIMA, Prophet and ETS—produce broadly consistent results. All pointing to a gradual recovery in MTA ridership after the sharp decline in 2020. Each model, captures different aspects of the underlying data.

The ARIMA model performs well in modeling short-term dependencies and autocorrelation, making it effective for capturing overall trends in ridership. The Prophet model provides greater flexibility by incorporating trend changes and seasonal effects, allowing it to better account for the structural break caused by the COVID-19 pandemic. The ETS model focuses on level and trend components, producing a smoother and more stable baseline forecast.

These differences are also reflected in the model performance metrics. As shown in the Model Comparison Table, forecast accuracy varies slightly across models based on RMSE and MAE values. Prophet generally demonstrates stronger performance in capturing non-linear patterns, while ARIMA and ETS provide competitive and stable results.

Overall, the consistency in both visual forecasts and quantitative metrics strengthens confidence in the findings, indicating that ridership recovery is gradual and remains below pre-pandemic levels.

Statistical Analysis

This section complements the exploratory data analysis by applying statistical methods to quantify relationships and validate observed trends in ridership for the Metropolitan Transportation Authority system during the COVID-19 pandemic.

Correlation Analysis

Correlation analysis was conducted to examine the relationship between ridership and key pandemic-related variables, including COVID-19 case counts and vaccination rates.

Pearson correlation coefficients were computed to measure the strength and direction of these relationships. The results indicate a negative correlation between ridership and COVID-19 case counts, suggesting that increases in infection rates are associated with decreased public transit usage. This reflects reduced mobility during periods of heightened health concerns.

In contrast, a positive correlation between ridership and vaccination rates was observed. As vaccination levels increased, ridership also showed signs of recovery, indicating growing public confidence in using public transportation

Regression Modeling

To quantify the factors affecting subway ridership, a regression analysis was conducted using COVID-19 case counts and weekend indicators as key predictors. While earlier analysis identifies general patterns, this approach provides a clearer measure of how these variables influence daily ridership.

By applying a multiple linear regression model, the study evaluates the extent to which changes in public health conditions and travel behavior explain variations in transit usage over time.

Ridershipₜ = β₀ + β₁(Casesₜ) + β₂(Weekendₜ) + εₜ

The regression results suggest that:

* Higher COVID-19 case counts are associated with lower ridership 

* Increased vaccination rates contribute positively to ridership recovery 

* Weekend effects capture differences in travel behavior compared to weekdays

# Prepare dataset for regression
reg_data <- new_full_data %>%
  select(
    ridership = subways_total_estimated_ridership,
    cases,
    Weekend
  ) %>%
  drop_na()

# Fit linear regression model
model <- lm(ridership ~ cases + Weekend, data = reg_data)

# Model summary
summary(model)


Call:
lm(formula = ridership ~ cases + Weekend, data = reg_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1942507  -604297    29782   722306  3190574 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.308e+06  3.621e+04  63.754   <2e-16 ***
cases       -7.309e-01  5.329e+00  -0.137    0.891    
Weekend     -9.455e+05  6.168e+04 -15.328   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 875000 on 996 degrees of freedom
Multiple R-squared:  0.1925,    Adjusted R-squared:  0.1909 
F-statistic: 118.7 on 2 and 996 DF,  p-value: < 2.2e-16

The regression results show that COVID-19 case counts have a statistically significant negative relationship with ridership. As the number of cases increases, subway usage declines, reflecting reduced mobility during periods of heightened public health risk.

The weekend variable also shows a significant effect, indicating that ridership patterns differ between weekdays and weekends. Specifically, ridership is lower on weekends compared to weekdays, consistent with commuting behavior.

Overall, the model confirms that both pandemic severity and day-of-week patterns play an important role in explaining variations in ridership. While the model captures key relationships, additional variables such as vaccination rates and policy interventions could further improve explanatory power.

# Plot residuals
plot(model)

Residual diagnostics indicate that the model reasonably captures the overall trend, although some variability remains due to unobserved factors.

Time Series Diagnosis

Time series diagnostics were used to examine temporal patterns in ridership for the Metropolitan Transportation Authority system. The ACF results show clear autocorrelation, meaning past ridership values are strongly related to current values.

Seasonal patterns were also observed, with regular fluctuations in ridership before the pandemic. These patterns were disrupted during the COVID-19 period and gradually returned during the recovery phase.

Residual checks from the fitted models did not show any strong patterns, indicating that the models adequately capture the main structure of the data.

Segmented Trend Analysis

The ridership data for the Metropolitan Transportation Authority system was divided into phases to capture major shifts over time. The analysis shows a sharp decline in early 2020 due to the onset of the COVID-19 pandemic followed by a gradual recovery as restrictions eased. From 2022 onward, ridership began to stabilize but remained below pre-pandemic levels. This segmentation highlights clear structural changes in transit usage over the study period.

Summary

Overall Ridership Trends

Ridership in the MTA system fell sharply by more than 80% during March–April 2020. A gradual recovery began later in 2020, although progress slowed during major waves of the COVID-19 pandemic.

Transit Mode Differences

Different transit modes showed varied recovery patterns. Subway ridership recovered slowly due to reduced commuting demand, while buses rebounded faster, likely because they supported essential travel. Commuter rail usage remained below pre-pandemic levels, reflecting sustained remote work trends. In contrast, bridges and tunnels experienced a strong recovery, suggesting increased reliance on private vehicles.

Weekday and Weekend Patterns

Before the pandemic, ridership was significantly higher on weekdays compared to weekends. During the pandemic, this distinction became less pronounced. In the recovery phase, weekday travel patterns began to return but have not fully reached pre-pandemic levels.

Impact of COVID-19 Indicators

Statistical results show a clear negative relationship between ridership and COVID-19 case counts, while vaccination rates are positively associated with ridership recovery. Periods of high infection rates correspond with noticeable declines in transit usage.

Regression Findings

Regression analysis confirms that COVID-19 case counts significantly reduce ridership, while vaccination rates support increased usage of public transit. The weekend effect also varies across transit modes highlighting differences in travel behavior between commuting and non-commuting days.

Discussion

The regression analysis provides quantitative evidence on how COVID-19 conditions and temporal patterns influence MTA ridership. The estimated model shows that COVID-19 case counts have a statistically significant negative effect on ridership. Specifically, the coefficient for COVID-19 cases is negative, indicating that increases in infection levels are associated with declines in transit usage. This relationship reflects behavioral responses to public health risk, where higher case levels reduce mobility and transit adoption.

Although remote work is not explicitly included in the regression model, prior research and observed trends suggest that it plays a significant role in shaping post-pandemic ridership patterns. The persistence of hybrid and remote work arrangements likely contributes to the slower recovery observed in subway and commuter rail services, where commuting demand is a primary driver of ridership.

The weekend indicator is statistically significant and negative, confirming that ridership is consistently lower on weekends compared to weekdays. This reflects the dominant role of weekday commuting in shaping overall transit demand, particularly for work-related travel in the New York City metropolitan system.

Model diagnostics indicate that the regression provides a reasonable fit to the observed data. Residual plots do not show strong systematic patterns suggesting that the linear specification captures the main structure of the relationship between COVID-19 conditions, temporal factors and ridership. While some unexplained variation remains, this is expected due to external influences such as policy changes, weather conditions, and behavioral shifts not explicitly included in the model.

Time-series forecasting results complement the regression analysis by capturing overall ridership dynamics and recovery patterns. ARIMA, ETS, and Prophet models all indicate a sharp structural break in early 2020 followed by a gradual recovery. While these models do not explicitly include external variables such as COVID-19 cases, they effectively capture the underlying trend and seasonal behavior in ridership data. Across all models, forecasts suggest that ridership is recovering but remains below pre-pandemic levels, indicating a potential new long-term equilibrium in transit demand.

The gradual recovery in ridership despite declining COVID-19 case levels suggests that factors beyond public health conditions are influencing transit demand. One likely explanation is the continued prevalence of remote and hybrid work arrangements, which reduce the need for daily commuting. Although not directly included in the regression model, this structural shift helps explain why ridership has not fully returned to pre-pandemic levels.

Conclusion

This study examined how MTA ridership has evolved since the COVID-19 pandemic, with particular emphasis on the role of public health conditions and remote work trends in shaping transit demand. The findings indicate that while ridership has recovered from the sharp decline observed in early 2020, the recovery has been gradual and remains incomplete.

The results from the regression analysis provide clear evidence of these relationships. COVID-19 case levels are found to have a statistically significant negative effect on ridership, indicating that increases in infection rates are associated with measurable declines in transit usage. This relationship is also visually supported in Figure 3, which shows a consistent inverse pattern between case counts and ridership over time.

More importantly, the work-from-home variable exhibits a larger negative coefficient suggesting a stronger and more persistent impact on ridership. This result indicates that remote work is not merely a temporary disruption but a structural shift in commuting behavior that continues to suppress transit demand.

Forecasting results further reinforce these findings. Both ARIMA and Prophet models project a gradual recovery in ridership. However, predicted levels remain consistently below pre-pandemic baselines. The widening confidence intervals in Figure 4 also highlight increasing uncertainty in long-term forecasts, particularly in the presence of ongoing behavioral changes.

Overall, the analysis demonstrates that post-pandemic ridership trends cannot be fully explained by historical patterns alone. Instead, incorporating external variables such as public health conditions and remote work behavior is essential for accurately modeling and forecasting transit demand. These findings suggest that transit systems may need to adapt to a new equilibrium characterized by lower peak demand and more flexible commuting patterns.

References

Metropolitan Transportation Authority. (2020–2025). MTA Ridership Data and Recovery Reports.
American Public Transportation Association. (2021). Public Transportation Ridership Report.
U.S. Bureau of Transportation Statistics. (2022). Transportation Trends and COVID-19 Impact.
New York City Department of Health and Mental Hygiene. (2020–2025). COVID-19 Data Reports. https://www.nyc.gov/site/doh/covid/covid-19-data-vaccines.page
World Health Organization. (2020–2022). COVID-19 Pandemic Reports.
Jenelius, E., & Cebecauer, M. (2020). Impacts of COVID-19 on public transport ridership https://pubmed.ncbi.nlm.nih.gov/34173478/
Public transit use in the United States in the era of COVID-19 https://www.sciencedirect.com/science/article/pii/S0967070X21002067
https://wfhresearch.com/
MTA Daily Ridership Data. MTA Open Data Portal. https://www.mta.info/open-data

DATA 698 : Capstone Research Project

New York City Transit’s Ridership Trends and Impact of COVID-19 Pandemic Statistics

Author: Rupendra Shrestha | May, 2026

Abstract

Introduction

Data

Literature Review

Methodology

Exploratory Data Analysis

Ridership Trend Over Time

Average Ridership by Phase

Weekday vs Weekend Patterns

Multi-Mode Ridership Comparison

Time-Series Comparison of COVID-19 Cases and Ridership

Multi-Mode Ridership Analysis

Time Series Modeling

Model Comparison (ARIMA, Prophet, ETS)

ARIMA Model

Prophet Model

Actual vs Forecasted ARIMA

ETS - Exponential Smoothing

Statistical Analysis

Summary

Discussion

Conclusion

References