CUNY Data Science 698 - Capstone Research Project

Abstract

This study analyzes daily ridership patterns across major transit services operated by the Metropolitan Transportation Authority (MTA) in New York City, with a focus on the impact of the COVID-19 pandemic and post-pandemic recovery. Using publicly available MTA ridership data from 2020 to 2025, the study examines how ridership declined during the pandemic and how recovery patterns differ across transit modes. The analysis examines how ridership declined during the pandemic and how recovery patterns differ across transit modes including subways, buses, commuter rail and bridges and tunnels.

To better explain fluctuations in transit usage, the study incorporates COVID-19 indicators such as daily case counts and hospitalizations. Statistical modeling and time-series techniques, including multiple linear regression, ARIMA, ETS and Prophet models, are used to quantify relationships and generate forecasts.

The results show that COVID-19 case levels have a significant negative effect on ridership, while recovery trends vary across transit modes. Forecasting models consistently indicate that ridership is recovering gradually but is likely to stabilize below pre-pandemic levels. These findings suggest a structural shift in urban mobility patterns and provide insights for long-term transit planning.

Introduction

Public transportation plays a critical role in the economy and daily life of New York City. The Metropolitan Transportation Authority operates one of the largest transit systems in the United States, serving millions of riders through subways, buses, commuter railroads, para transit services and bridges and tunnels. Understanding ridership trends is essential for effective transit planning, resource allocation and policy decision-making.

According to data from the Metropolitan Transportation Authority Open Data Portal, subway ridership fell by over 90% at the peak of the pandemic in April 2020 compared to pre-pandemic levels.

The COVID-19 pandemic caused an unprecedented disruption to public transit systems beginning in early 2020. Government restrictions, health concerns and the rapid shift to remote work led to a sharp decline in ridership across all modes of transportation. Although ridership has gradually recovered over time, the pace and pattern of recovery have varied significantly across different transit services. However, the COVID-19 pandemic differs in both scale and duration due to its long-term impact on commuting behavior. Recent labor market studies indicate that remote and hybrid work arrangements remain significantly higher than pre-pandemic levels. Data from WFH Research shows that work-from-home behavior continues to reduce commuting frequency in major metropolitan areas. This shift raises important questions about the long-term sustainability of transit demand.

This study analyzes daily MTA ridership data from 2020 to 2025 to examine how transit usage changed during and after the pandemic. By incorporating COVID-19 indicators such as case counts and hospitalizations, the analysis aims to better understand how public health conditions influenced travel behavior.

The primary objective of this study is to evaluate how ridership evolved over time and to compare recovery patterns across transit modes. In addition, the study explores weekday and weekend differences, seasonal trends and the relationship between ridership and pandemic dynamics. The findings provide insights into long-term changes in urban mobility and support discussions on future transit planning.

The central research question guiding this study is:

How has daily MTA ridership evolved since the COVID-19 pandemic, and how do recovery patterns differ across transit modes when considering COVID-19 dynamics?

Secondary questions include:

• Are there consistent weekday and weekend ridership patterns across modes, and do they shift during pandemic waves?

• How do seasonal trends vary by transit service, and how are they affected by COVID-19 milestones?

• Which services have recovered more quickly, and which continue to lag behind pre-pandemic baselines, considering both case counts and vaccination rates? Data

Data

This project analyzes MTA ridership trends from 2020–2025 and examines how COVID-19 impacted recovery across transit modes.

MTA Ridership Data

Using the MTA dataset, I focused on the the number of estimates commuters based on the day of the week.

# Load MTA data
mta_url <- "https://data.ny.gov/resource/vxuj-8kew.csv"
mta <- read_csv(mta_url)

MTA Data Cleaning

# Convert date columns
mta <- mta %>%
  mutate(date = as.Date(date))

DT::datatable(mta)

glimpse(mta)

Rows: 1,000
Columns: 15
$ date                                                 <date> 2020-03-01, 2020…
$ subways_total_estimated_ridership                    <dbl> 2212965, 5329915,…
$ subways_of_comparable_pre_pandemic_day               <dbl> 0.97, 0.96, 0.98,…
$ buses_total_estimated_ridersip                       <dbl> 984908, 2209066, …
$ buses_of_comparable_pre_pandemic_day                 <dbl> 0.99, 0.99, 0.99,…
$ lirr_total_estimated_ridership                       <dbl> 86790, 321569, 31…
$ lirr_of_comparable_pre_pandemic_day                  <dbl> 1.00, 1.03, 1.02,…
$ metro_north_total_estimated_ridership                <dbl> 55825, 180701, 19…
$ metro_north_of_comparable_pre_pandemic_day           <dbl> 0.59, 0.66, 0.69,…
$ access_a_ride_total_scheduled_trips                  <dbl> 19922, 30338, 327…
$ access_a_ride_of_comparable_pre_pandemic_day         <dbl> 1.13, 1.02, 1.10,…
$ bridges_and_tunnels_total_traffic                    <dbl> 786960, 874619, 8…
$ bridges_and_tunnels_of_comparable_pre_pandemic_day   <dbl> 0.98, 0.95, 0.96,…
$ staten_island_railway_total_estimated_ridership      <dbl> 1636, 17140, 1745…
$ staten_island_railway_of_comparable_pre_pandemic_day <dbl> 0.52, 1.07, 1.09,…

Covid-19 Data

covid_url <- "https://data.cityofnewyork.us/resource/rc75-m7u3.csv"
covid <- read_csv(covid_url)

# Convert date columns
covid <- covid %>%
  mutate(date = as.Date(date_of_interest))

DT::datatable(covid)

glimpse(covid)

Rows: 1,000
Columns: 56
$ date_of_interest                <dttm> 2020-02-29, 2020-03-01, 2020-03-02, 2…
$ case_count                      <dbl> 1, 0, 0, 1, 5, 3, 8, 7, 21, 57, 69, 15…
$ probable_case_count             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ hospitalized_count              <dbl> 1, 1, 2, 7, 2, 14, 8, 8, 18, 37, 60, 7…
$ death_count                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ case_count_7day_avg             <dbl> 0, 0, 0, 0, 0, 0, 3, 3, 6, 15, 24, 46,…
$ all_case_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 3, 3, 6, 15, 24, 46,…
$ hosp_count_7day_avg             <dbl> 0, 0, 0, 0, 0, 0, 5, 6, 8, 13, 21, 32,…
$ death_count_7day_avg            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bx_case_count                   <dbl> 0, 0, 0, 0, 0, 0, 2, 0, 3, 4, 8, 19, 2…
$ bx_probable_case_count          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bx_hospitalized_count           <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 5, 7, 7, 23, 1…
$ bx_death_count                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bx_case_count_7day_avg          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 5, 9,…
$ bx_probable_case_count_7day_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bx_all_case_count_7day_avg      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 5, 9,…
$ bx_hospitalized_count_7day_avg  <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 3, 6, 9,…
$ bx_death_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bk_case_count                   <dbl> 0, 0, 0, 0, 1, 3, 1, 2, 5, 16, 11, 31,…
$ bk_probable_case_count          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bk_hospitalized_count           <dbl> 1, 0, 2, 3, 1, 3, 1, 3, 8, 11, 13, 11,…
$ bk_death_count                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bk_case_count_7day_avg          <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 4, 6, 10, 2…
$ bk_probable_case_count_7day_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bk_all_case_count_7day_avg      <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 4, 6, 10, 2…
$ bk_hospitalized_count_7day_avg  <dbl> 0, 0, 0, 0, 0, 0, 2, 2, 3, 4, 6, 7, 11…
$ bk_death_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mn_case_count                   <dbl> 1, 0, 0, 0, 2, 0, 3, 1, 6, 24, 24, 62,…
$ mn_probable_case_count          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mn_hospitalized_count           <dbl> 0, 0, 0, 1, 1, 5, 3, 0, 1, 9, 12, 19, …
$ mn_death_count                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mn_case_count_7day_avg          <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 5, 9, 17, 3…
$ mn_probable_case_count_7day_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mn_all_case_count_7day_avg      <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 5, 9, 17, 3…
$ mn_hospitalized_count_7day_avg  <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 3, 4, 7, 9,…
$ mn_death_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ qn_case_count                   <dbl> 0, 0, 0, 1, 2, 0, 1, 3, 6, 10, 24, 40,…
$ qn_probable_case_count          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ qn_hospitalized_count           <dbl> 0, 0, 0, 2, 0, 4, 2, 4, 4, 8, 23, 23, …
$ qn_death_count                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ qn_case_count_7day_avg          <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 3, 7, 12, 2…
$ qn_probable_case_count_7day_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ qn_all_case_count_7day_avg      <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 3, 7, 12, 2…
$ qn_hospitalized_count_7day_avg  <dbl> 0, 0, 0, 0, 0, 0, 1, 2, 2, 3, 6, 10, 1…
$ qn_death_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ si_case_count                   <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 3, 2, 3, 13…
$ si_probable_case_count          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ si_hospitalized_count           <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 2, 5, 2, 3,…
$ si_death_count                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ si_probable_case_count_7day_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ si_case_count_7day_avg          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3,…
$ si_all_case_count_7day_avg      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3,…
$ si_hospitalized_count_7day_avg  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2,…
$ si_death_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ incomplete                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ date                            <date> 2020-02-29, 2020-03-01, 2020-03-02, 2…

Covid-19 Data Cleaning

# Select relevant columns
covid <- covid %>%
  select(date, cases = case_count, hospitalizations = hospitalized_count)

# Merge datasets
full_data <- left_join(mta, covid, by = "date")

# Create rolling averages
#full_data <- full_data %>% arrange(date) %>%  mutate(cases_7day = rollmean(cases, 7, fill = NA, align = "right"))

full_data <- full_data %>%
  arrange(date) %>%
  mutate(
    cases_7day = zoo::rollmean(cases, 7, fill = NA, align = "right"),
    ridership_7day = zoo::rollmean(subways_total_estimated_ridership, 7, fill = NA, align = "right")
  )

DT::datatable(full_data)

Converting Dates to Day of the Week

In order to identify the day of the week corresponding to the date, I converted the Date column into month/date/year (mdy) column

new_full_data <- full_data %>%
  mutate(
    Day_of_Week = wday(date, label = TRUE),
    Weekend = ifelse(Day_of_Week %in% c("Sat", "Sun"), 1, 0)
  )

DT::datatable(new_full_data)

Subway Ridership Totals

I looked at the overall subway ridership from 2020-2023, as well as the amount of daily riders during this period:

Subway Overall Ridership Totals

Subway_Overall_Totals <- mta %>%
  select(date, Subway_Ridership_Totals = `subways_total_estimated_ridership`) %>%
  arrange(date)

DT::datatable(Subway_Overall_Totals)

Literature Review

The COVID-19 pandemic had a significant impact on public transportation systems worldwide, leading to sharp declines in ridership and major changes in travel behavior. Early studies show that transit usage dropped rapidly in response to lock down measures, health concerns, and the widespread shift to remote work. As infection rates increased, mobility decreased, resulting in reduced demand for public transit services.

Research also indicates that the impact of the pandemic was not uniform across transit modes. Bus systems tended to recover more quickly than rail services, as they are more commonly used by essential workers. In contrast, commuter rail systems experienced slower recovery due to reduced daily commuting. These patterns have been observed in major cities across the United States and are consistent with broader changes in work and travel behavior.

Public health conditions played a key role in shaping ridership trends. Studies find a negative relationship between transit usage and COVID-19 case counts, while improvements in vaccination rates contributed to increased ridership. As vaccination campaigns progressed, public confidence in using transit systems improved, supporting gradual recovery.

Another important theme in the literature is the long-term shift in mobility patterns. The pandemic accelerated trends such as remote work and flexible schedules, reducing peak-hour demand and altering traditional commuting patterns. Additionally, some travelers shifted toward private vehicles, walking, or cycling, reflecting changes in risk perception and travel preferences. From a methodological perspective, time series analysis and regression modeling have been widely used to study the relationship between transit ridership and external factors such as public health data. These approaches allow researchers to capture trends, seasonality and structural breaks associated with major events like the COVID-19 pandemic.

Data from WFH Research suggests that remote work remains prevalent, reducing the frequency of commuting trips. As a result, transit demand may not fully return to pre-pandemic levels, supporting the need for models that incorporate both public health conditions and structural changes in labor patterns.

This study builds on existing research by analyzing MTA ridership data over an extended period from 2020 to 2025 and comparing recovery patterns across multiple transit modes. By combining transportation and COVID-19 data, the analysis provides a comprehensive view of how public transit usage evolved during and after the pandemic.

Methodology

The analysis begins with data pre-processing, including cleaning, alignment of datasets and transformation into a time-series format. Daily ridership and COVID-19 case data were merged by date and rolling averages were computed to smooth short-term fluctuations and highlight underlying trends.

Exploratory Data Analysis (EDA): EDA reveals a sharp decline in ridership across all transit modes in early 2020, followed by a gradual recovery. However, the pace of recovery differs across modes with bus ridership recovering more quickly than subway and commuter rail services. These differences suggest that ridership patterns are influenced by changes in travel purpose and commuting behavior.

Primary Analytical: A multiple linear regression model is applied to evaluate the relationship between ridership, COVID-19 case counts and weekend effects. This approach is a multiple regression model that quantifies the relationship between ridership and key explanatory variables.

To quantify the relationship between ridership and pandemic conditions, a multiple linear regression model is applied:

Ridershipₜ = β₀ + β₁(Casesₜ) + β₂(Weekendₜ) + εₜₜ

While remote work is recognized as an important structural factor influencing transit demand, it is not directly included in the regression model due to data limitations.

Time Series Modeling: To complement this analysis, time-series models are applied to capture temporal dependencies and generate forecasts. An ARIMA model is used as a baseline approach to model autocorrelation and short-term dynamics in ridership data. In addition, the Prophet model is implemented to capture non-linear trends and seasonal patterns, particularly around the structural break caused by the COVID-19 pandemic.

Unlike uni variate time-series models, the regression framework provides inter-pretable coefficients that quantify the effects of pandemic conditions and remote work. Forecasts generated using these models are compared using performance metrics such as RMSE and MAE to evaluate predictive accuracy.

Exploratory Data Analysis

The exploratory data analysis provides an initial understanding of how ridership patterns in the Metropolitan Transportation Authority system changed during and after the COVID-19 pandemic. The analysis highlights a clear structural break beginning in March 2020, when ridership declined sharply across all transit modes due to lock down measures, reduced mobility and public health concerns.

Following this initial decline, ridership began a gradual recovery starting in late 2020. However, recovery has been uneven and incomplete, with overall levels remaining below pre-pandemic benchmarks throughout the study period. This suggests that the pandemic introduced lasting changes in travel behavior rather than a temporary disruption.

Significant variation is observed across transit modes. Bus ridership shows a relatively faster recovery, likely reflecting continued use by essential workers and populations with limited transportation alternatives. In contrast, subway and commuter rail services exhibit slower recovery consistent with reduced commuting demand and the persistence of remote and hybrid work arrangements.

Changes in temporal patterns are also evident. Prior to the pandemic, ridership was substantially higher on weekdays compared to weekends, driven by regular commuting activity. During the pandemic, this gap narrowed considerably, indicating a decline in work-related travel. Although weekday ridership has increased during the recovery phase, the difference between weekdays and weekends remains smaller than pre-pandemic levels suggesting a shift toward more flexible travel patterns.

Overall, the exploratory analysis reveals that MTA ridership is influenced by both short-term public health conditions and longer-term structural changes in commuting behavior. These findings provide a foundation for the statistical and time-series analyses that follow.

Data Preparation

The Metropolitan Transportation Authority ridership data was processed to ensure consistency and suitability for analysis. As part of pre-processing, column names were standardized and corrected to maintain uniformity across all transit modes.

The date variable was converted into a proper date format to enable time-based analysis and the dataset was organized in chronological order. Missing values and irregular observations were carefully reviewed and addressed through appropriate methods, including removal or imputation where necessary. The data was then aggregated at the daily level to create a continuous time series of total ridership.

To support comparative analysis, a categorical variable labeled “phase” was introduced to segment the data into key pandemic periods: COVID Shock (2020) and Recovery (2021–2023). This classification allows for clearer interpretation of how ridership patterns evolved over time.

Additional features, including day of the week and month indicators, were derived to capture temporal variation and seasonal patterns in transit usage. These enhancements improve the dataset’s ability to support both exploratory and modeling approaches.

Time Period Comparison

To evaluate the impact of COVID-19 (COVID-19), the data is divided into three phases:

Phase	Description
Pre-COVID	Normal ridership patterns
COVID Peak	Lockdowns and restrictions
Recovery	Gradual return to normal

# Convert date
mta <- mta %>%
  mutate(date = as.Date(date),
         year = year(date),
         weekday = weekdays(date))

# Create phase variable
mta <- mta %>%
  mutate(phase = case_when(
    year == 2020 ~ "COVID Shock",
    year == 2021 ~ "Early Recovery",
    year >= 2022 ~ "Recovery"
  ))

# Aggregate daily ridership
daily_ridership <- mta %>%
  group_by(date, phase) %>%
  summarise(ridership = sum(subways_total_estimated_ridership, na.rm = TRUE))

Ridership Trend Over Time

The following plot illustrates temporal trends in MTA ridership, highlighting variations before, during and after the COVID-19 pandemic.

ggplot(daily_ridership, aes(x = date, y = ridership, color = phase)) +
  geom_line(alpha = 0.7) +
  labs(title = "MTA Ridership Trends (2020–2025)",
       x = "Date",
       y = "Ridership") +
  theme_minimal()+
  scale_y_continuous(labels = scales::comma)

The plot shows that ridership for the Metropolitan Transportation Authority was stable and high before COVID-19. In March 2020, there is a sharp drop due to the COVID-19 pandemic, indicating a major disruption. Although ridership begins to recover after 2020, it remains below pre-pandemic levels, suggesting a slow and incomplete recovery.

Average Ridership by Phase

The bar chart summarizes average ridership levels across Pre-COVID, COVID Peak and Recovery periods highlighting the impact of the COVID-19 pandemic.

phase_summary <- daily_ridership %>%
  group_by(phase) %>%
  summarise(avg_ridership = mean(ridership))

ggplot(phase_summary, aes(x = phase, y = avg_ridership, fill = phase)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Ridership by Phase",
       x = "Phase",
       y = "Average Daily Ridership") +
  theme_minimal()+
  scale_y_continuous(labels = scales::comma)

Average ridership was highest before COVID-19 dropped significantly during the pandemic and partially recovered afterward. However, ridership levels remain below pre-pandemic levels indicating a slow and incomplete recovery.

Weekday vs Weekend Patterns

The chart analyzes variations in ridership by day of the week, highlighting how commuting behavior changed during the COVID-19 pandemic.

weekday_analysis <- mta %>%
  group_by(weekday, phase) %>%
  summarise(ridership = mean(subways_total_estimated_ridership, na.rm = TRUE))

ggplot(weekday_analysis, aes(x = weekday, y = ridership, fill = phase)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Weekday vs Weekend Ridership",
       x = "Day of Week",
       y = "Average Ridership") +
  theme_minimal()+
  scale_y_continuous(labels = scales::comma)

Before COVID-19, ridership was significantly higher on weekdays due to regular commuting patterns. During the pandemic, this difference decreased as travel behavior became more uniform. In the recovery phase, weekday ridership begins to increase again but the gap between weekdays and weekends remains smaller than pre-pandemic levels, indicating lasting changes in commuting habits.

Multi-Mode Ridership Comparison

The chart examines monthly ridership trends, highlighting seasonal variations and disruptions caused by the COVID-19 pandemic.

# Multi-mode selection
mta_modes <- mta %>%
  select(
    date,
    subways = subways_total_estimated_ridership,
    buses = buses_total_estimated_ridersip,
    lirr = lirr_total_estimated_ridership,
    metro_north = metro_north_total_estimated_ridership,
    bridges_tunnels = bridges_and_tunnels_total_traffic
  ) %>%
  pivot_longer(
    cols = -date,
    names_to = "mode",
    values_to = "ridership"
  )

# Plot comparison
ggplot(mta_modes, aes(x = date, y = ridership, color = mode)) +
  geom_line(alpha = 0.7) +
  labs(
    title = "MTA Ridership Trends by Mode (2020–2025)",
    x = "Date",
    y = "Ridership"
  ) +
  theme_minimal()+
  scale_y_continuous(labels = scales::comma)

# Subway ridership trend with smoothing

ggplot(full_data, aes(x = date)) +
  geom_line(aes(y = subways_total_estimated_ridership), alpha = 0.4) +
  geom_line(aes(y = ridership_7day), color = "blue", linewidth = 1) +
  theme_minimal() +
  labs(
    title = "Subway Ridership Trend with 7-Day Rolling Average",
    x = "Date",
    y = "Ridership"
  )+
  scale_y_continuous(labels = scales::label_comma())

The results show that all modes experienced a sharp decline in early 2020, but recovery patterns vary significantly. Subway and commuter rail services show slower recovery due to reduced commuting demand and sustained remote work trends. In contrast, bus ridership demonstrates relatively faster recovery, likely due to its reliance on essential travel. These differences highlight how travel purpose and rider demographics influenced recovery trajectories across the transit system.

# Normalized comparison (recovery comparison)
mta_modes %>%
  group_by(mode) %>%
  mutate(index = ridership / max(ridership, na.rm = TRUE)) %>%
  ggplot(aes(date, index, color = mode)) +
  geom_line() +
  labs(
    title = "Relative Recovery by Mode",
    y = "Normalized Ridership (0–1)%"
  ) +
  theme_minimal()+
  scale_y_continuous(labels = scales::label_percent())

The plot shows a clear inverse relationship between COVID-19 cases and subway ridership. When cases increase, ridership drops reflecting reduced travel during higher risk periods. As cases decline, ridership gradually recovers but it remains below pre-pandemic levels suggesting lasting changes in commuting behavior beyond the pandemic itself.

Time-Series Comparison of COVID-19 Cases and Ridership

To examine the relationship between public health conditions and transit usage, this analysis compares daily COVID-19 case counts with subway ridership over the study period. The objective is to understand whether changes in pandemic severity are associated with fluctuations in public transportation demand.

# COVID-19 Cases vs MTA Ridership
ggplot(full_data, aes(x = date)) +
  geom_line(aes(y = subways_total_estimated_ridership, color = "Ridership"), size = 0.7) +
  geom_line(aes(y = cases * 50, color = "COVID Cases"), size = 0.7) +
  scale_y_continuous(
    name = "Ridership",
    sec.axis = sec_axis(~./50, name = "COVID Cases")
  ) +
  labs(
    title = "COVID-19 Cases vs Subway Ridership",
    x = "Date",
    color = "Legend"
  ) +
  theme_minimal()+
  scale_y_continuous(labels = scales::label_comma())

The plot shows a clear inverse relationship between COVID-19 cases and subway ridership. When case counts increase and ridership drops reflecting reduced mobility during periods of higher health risk. As cases decline ridership gradually recovers but it remains below pre-pandemic levels suggesting that longer-term changes in commuting behavior also influence transit demand.

Multi-Mode Ridership Analysis

To provide a comprehensive understanding of transit usage, ridership trends were analyzed across all major MTA service modes, including subways, buses, Long Island Rail Road (LIRR), Metro-North Railroad and bridges and tunnels. This multi-mode approach allows for a direct comparison of how different transportation systems responded to the COVID-19 pandemic and subsequent recovery period.

The analysis shows that all transit modes experienced a sharp decline in ridership during the early stages of the pandemic in 2020. However, recovery patterns differ significantly across modes. Subway and commuter rail services, including LIRR and Metro-North, exhibit slower recovery, largely due to reduced commuting associated with remote work and changes in travel behavior.

In contrast, bus ridership demonstrates a relatively faster recovery, likely reflecting continued reliance by essential workers and populations with fewer alternative transportation options. Meanwhile, bridges and tunnels show a strong rebound, indicating an increased shift toward private vehicle usage during and after the pandemic.

These findings highlight that the impact of COVID-19 on transportation was not uniform across systems. Instead, recovery trajectories vary depending on the role each mode plays in urban mobility, providing important insights into long-term changes in travel behavior and transit demand.

Time Series Modeling

Because the data is time-dependent, autocorrelation was assessed using ACF plots. Difference was applied where necessary to achieve stationarity before fitting ARIMA models. This ensures reliable parameter estimation and forecasting performance

To perform time series modeling, the dataset from the Metropolitan Transportation Authority was aggregated at the daily level and sorted chronologically. This ensures the data is in a consistent format suitable for forecasting ridership trends over time.

# Aggregate daily ridership (ensure proper ordering)
ts_data <- daily_ridership %>%
  group_by(date) %>%
  summarise(ridership = sum(ridership, na.rm = TRUE), .groups = "drop") %>%
  arrange(date)

# Convert to time series object (use correct start year)
start_year <- lubridate::year(min(ts_data$date))
start_day  <- lubridate::yday(min(ts_data$date))

ts_ridership <- ts(
  ts_data$ridership,
  start = c(start_year, start_day),
  frequency = 365
)

ggplot(ts_data, aes(x = date, y = ridership)) +
  geom_line(color = main_color, linewidth = 1) +
  labs(
    title = "MTA Ridership Time Series (2020–2025)",
    x = "Date",
    y = "Ridership"
  ) +
  scale_y_continuous(labels = scales::comma)

The time series plot highlights a pronounced structural break in early 2020, where ridership drops sharply due to the onset of the COVID-19 pandemic. This decline reflects the immediate impact of lockdown measures, reduced mobility and public health concerns. Following this disruption, ridership shows a gradual upward trend indicating a steady but slow recovery over time. However, the recovery remains incomplete, as ridership levels do not return to pre-pandemic highs within the observed period.

The plot also suggests the presence of recurring fluctuations which may reflect seasonal patterns and variations in travel behavior. Overall, the time series demonstrates that while recovery is underway, long-term ridership dynamics have shifted, likely due to sustained changes such as remote work and evolving commuting patterns.

Model Comparison (ARIMA, Prophet, ETS)

To evaluate forecasting performance, the ARIMA, Prophet and Exponential Smoothing (ETS) models were compared using RMSE and MAE. These metrics measure how closely predicted values align with observed ridership.

While all three models capture the overall recovery trend following the COVID-19 disruption, differences emerge in their ability to model underlying patterns. ARIMA captures short-term dependencies, ETS provides smooth trend estimates and Prophet is designed to handle non-linear trends and structural breaks.

#Create the Model Comparison Table

# ARIMA MODEL
fit_arima <- auto.arima(ts_ridership)
forecast_arima <- forecast(fit_arima, h = 90)

acc_arima <- accuracy(forecast_arima) 

# ETS MODEL
fit_ets <- ets(ts_ridership)
forecast_ets <- forecast(fit_ets, h = 90)

acc_ets <- accuracy(forecast_ets)

# PROPHET MODEL
df_prophet <- ts_data %>%
  mutate(date = as.Date(date)) %>%
  rename(ds = date, y = ridership)

df_prophet$y <- as.numeric(df_prophet$y)
df_prophet <- df_prophet %>% filter(!is.na(ds) & !is.na(y))

model_prophet <- prophet(df_prophet)

future <- make_future_dataframe(model_prophet, periods = 90)
forecast_prophet <- predict(model_prophet, future)

prophet_pred <- forecast_prophet %>%
  select(ds, yhat)

actual_vs_pred <- df_prophet %>%
  inner_join(prophet_pred, by = "ds")

prophet_rmse <- sqrt(mean((actual_vs_pred$y - actual_vs_pred$yhat)^2))
prophet_mae  <- mean(abs(actual_vs_pred$y - actual_vs_pred$yhat))

arima_rmse <- tail(acc_arima[, "RMSE"], 1)
arima_mae  <- tail(acc_arima[, "MAE"], 1)

ets_rmse <- tail(acc_ets[, "RMSE"], 1)
ets_mae  <- tail(acc_ets[, "MAE"], 1)

# MODEL COMPARISON TABLE
model_comparison <- data.frame(
  Model = c("ARIMA", "ETS", "Prophet"),
  RMSE = c(arima_rmse, ets_rmse, prophet_rmse),
  MAE = c(arima_mae, ets_mae, prophet_mae)
)

#model_comparison

# DISPLAY
knitr::kable(
  model_comparison,
  digits = 2,
  caption = "Model Comparison: Forecast Accuracy (Lower is Better)"
)

Model Comparison: Forecast Accuracy (Lower is Better)
Model	RMSE	MAE
ARIMA	405734.1	317782.5
ETS	585608.2	471996.2
Prophet	417573.3	275560.3

Model performance was evaluated using RMSE and MAE metrics (see Table X). The results show that all three models produce similar forecasts, indicating a consistent pattern of gradual ridership recovery.

Among the models, Prophet achieved the lowest error values, suggesting better performance in capturing long-term trends and structural changes associated with the COVID-19 disruption. In contrast, ARIMA and ETS provide comparable accuracy. ARIMA performs well for short-term dynamics, while ETS produces smoother trend estimates.

Overall, while differences in accuracy are modest the results indicate that Prophet is slightly more effective for modeling long-term recovery patterns whereas ARIMA and ETS remain reliable for baseline and short-term forecasting.

ARIMA Model

The ARIMA model was used to analyze and forecast ridership trends in the MTA system. ARIMA is a widely used time-series method that captures autocorrelation, underlying trends, and random fluctuations in the data.

Because ridership data is time-dependent, autocorrelation was first examined using autocorrelation function (ACF) plots. The series showed non-stationary behavior due to the sharp structural break during the COVID-19 pandemic. To address this, differencing was applied automatically within the modeling process to stabilize the mean and ensure stationarity. This step is important for producing reliable parameter estimates and improving forecast accuracy.

The selected model captures both short-term dependencies and longer-term movement in ridership. The AR and MA components help account for persistence in travel behavior and unexpected shocks in demand.

Model Estimation

# Fit ARIMA model
fit_arima <- auto.arima(ts_ridership)

# Model summary
summary(fit_arima)

Series: ts_ridership 
ARIMA(3,1,3) with drift 

Coefficients:
         ar1     ar2      ar3      ma1      ma2     ma3      drift
      0.3327  0.0813  -0.7224  -0.6029  -0.5370  0.7891  -2093.181
s.e.  0.0597  0.0696   0.0590   0.0467   0.0662  0.0425   6374.812

sigma^2 = 1.659e+11:  log likelihood = -14320.34
AIC=28656.68   AICc=28656.83   BIC=28695.94

Training set error measures:
                   ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
Training set 2683.132 405734.1 317782.5 -3.481779 18.08459 0.3039474 -0.1511143

The selected ARIMA model captures both short-term dependencies and long-term trends in the ridership data. The inclusion of auto regressive and moving average components allows the model to account for persistence and random shocks in the time series.

Forecasting

# Forecast - ARIMA
forecast_arima <- forecast(fit_arima, h = 90)

# Create forecast dataframe
arima_df <- data.frame(
  date = seq(max(ts_data$date) + 1, by = "day", length.out = 90),
  forecast = as.numeric(forecast_arima$mean)
)

# Plot
ggplot() +
  geom_line(data = ts_data,
            aes(x = date, y = ridership),
            linewidth = 1.5) +
  geom_line(data = arima_df,
            aes(x = date, y = forecast),
            linetype = "dashed",
            linewidth = 1.5) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "ARIMA Forecast of MTA Ridership",
    x = "Date",
    y = "Daily Ridership"
  ) +
  theme_minimal(base_size = 14)

The forecast plot shows predicted ridership values along with confidence intervals. The model captures the overall recovery trend following the sharp decline observed in early 2020.

Residual or Model Diagnostics

Residual diagnostics are used to evaluate how well the ARIMA model captures the underlying structure of the time series. In particular, residuals should behave like random noise with no visible pattern over time. This plot helps assess whether the model has successfully accounted for trends, seasonality, and autocorrelation in the ridership data.

res_df <- data.frame(
  date = ts_data$date,
  residuals = as.numeric(residuals(fit_arima))
)

ggplot(res_df, aes(x = date, y = residuals)) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Residuals Over Time (ARIMA Model)",
    x = "Date",
    y = "Residuals"
  ) +
  theme_minimal()+
  scale_y_continuous(labels = scales::label_comma())

The residuals fluctuate around zero without a clear pattern, suggesting that the ARIMA model captures the main structure of the data reasonably well. There is no strong evidence of systematic bias or remaining trend in the errors. However, some small variations still exist, which is expected due to external factors not included in the model such as policy changes, weather, and behavioral shifts during the post-pandemic period. Overall, the diagnostics indicate that the model is an adequate fit for the data.

Prophet Model

The Prophet model was applied as a flexible time-series forecasting approach to capture non-linear trends and seasonal patterns in ridership data. Unlike traditional models, Prophet is designed to handle structural breaks and sudden changes, making it particularly suitable for modeling disruptions such as the COVID-19 pandemic.

The model decomposes the time series into trend, seasonality and residual components. This allows it to capture long-term growth patterns as well as recurring fluctuations in ridership behavior. Its ability to adapt to changes in trend makes it useful for analyzing post-pandemic recovery dynamics.

Model Estimation

# Prepare data for Prophet
df_prophet <- ts_data %>%
  mutate(date = as.Date(date)) %>%
  rename(ds = date, y = ridership)

df_prophet$y <- as.numeric(df_prophet$y)
df_prophet <- df_prophet %>% filter(!is.na(ds) & !is.na(y))

# Fit Prophet model
model_prophet <- prophet(df_prophet)

Forecasting

# Create future dataframe
future <- make_future_dataframe(model_prophet, periods = 90)

# Generate forecast
forecast_prophet <- predict(model_prophet, future)

# Plot forecast
plot(model_prophet, forecast_prophet) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Forecasted Ridership",
    x = "Date",
    y = "Daily Ridership"
  ) +
  theme_minimal(base_size = 12)

The Prophet model identifies a clear structural break in early 2020, corresponding to the onset of the COVID-19 pandemic. It captures the sharp decline in ridership followed by a gradual recovery phase.

Actual vs Forecasted ARIMA

The ARIMA model is used to forecast short-term MTA ridership based on historical patterns in the time series data. This plot compares the observed ridership values with the model’s predicted values over a 90-day forecast horizon to evaluate how well the model captures recent trends and recovery behavior.

# Convert forecast to dataframe
forecast_df <- data.frame(
  date = seq(max(ts_data$date) + 1, by = "day", length.out = 90),
  predicted = as.numeric(forecast_arima$mean)
)

# Plot actual vs predicted
ggplot() +
  geom_line(data = ts_data, aes(x = date, y = ridership, color = "Actual"), size = 0.7) +
  geom_line(data = forecast_df, aes(x = date, y = predicted, color = "Forecast"), size = 0.7) +
  labs(
    title = "Actual vs Forecasted MTA Ridership (ARIMA)",
    x = "Date",
    y = "Ridership",
    color = "Legend"
  ) +
  theme_minimal()+
scale_y_continuous(labels = scales::label_comma())

The ARIMA model captures the overall recovery trend in ridership following the pandemic-related decline. Forecasted values closely follow the upward trajectory of the actual data, indicating that the model effectively learns short-term patterns. However, predicted ridership remains below pre-pandemic levels suggesting that full recovery is not expected in the near term based on historical trends alone.

ETS - Exponential Smoothing

The Exponential Smoothing (ETS) model was applied as an additional forecasting approach to analyze ridership trends. The model captures the underlying level and trend components of the time series by assigning greater weight to more recent observations. This makes it well suited for data with gradual changes over time.

The ETS results indicate a steady upward trend in ridership following the sharp decline observed in early 2020. Similar to the ARIMA model, the ETS forecast suggests a gradual recovery rather than a rapid return to pre-pandemic levels. The smoothing behavior of the model reduces short-term fluctuations and highlights the overall recovery trajectory.

Compared to Prophet, the ETS model is simpler and does not explicitly model complex seasonal patterns or structural breaks. However, it provides a stable baseline forecast that aligns closely with the general trend observed in the data.

Model Estimation

# Fit ETS model
fit_ets <- ets(ts_ridership)

# Model summary
summary(fit_ets)

ETS(M,Ad,N) 

Call:
ets(y = ts_ridership)

  Smoothing parameters:
    alpha = 0.0943 
    beta  = 0.0324 
    phi   = 0.8 

  Initial states:
    l = 4743400.0997 
    b = 210151.6362 

  sigma:  0.265

     AIC     AICc      BIC 
33088.40 33088.49 33117.85 

Training set error measures:
                    ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
Training set -13658.06 585608.2 471996.2 -9.283079 29.33733 0.4514471 0.4424816

Forecasting

#=========
# Convert ETS forecast to dataframe
ets_df <- data.frame(
  date = seq(max(ts_data$date) + 1, by = "day", length.out = 90),
  forecast = as.numeric(forecast_ets$mean)
)

# Combine actual + forecast
ggplot() +
  geom_line(data = ts_data, 
            aes(x = date, y = ridership), 
            linewidth = 1) +
  geom_line(data = ets_df, 
            aes(x = date, y = forecast), 
            linetype = "dashed", 
            linewidth = 1) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "ETS Forecast of MTA Ridership",
    x = "Date",
    y = "Daily Ridership"
  ) +
  theme_minimal(base_size = 16)

The ETS model reinforces the findings from the ARIMA and Prophet models, indicating that MTA ridership is recovering gradually but remains below pre-pandemic levels. The consistency across multiple forecasting approaches strengthens the reliability of the results.

Although the ETS model is less flexible in capturing sudden structural changes such as the COVID-19 shock, it effectively summarizes the overall trend and provides a useful benchmark for comparison. Overall, the inclusion of the ETS model supports the conclusion that transit recovery is ongoing but incomplete, reflecting lasting changes in travel behavior and commuting patterns.

Model Comparison (ARIMA vs Prophet vs ETS)

The three forecasting approaches ARIMA, Prophet and ETS—produce broadly consistent results. All pointing to a gradual recovery in MTA ridership after the sharp decline in 2020. Each model, captures different aspects of the underlying data.

The ARIMA model performs well in modeling short-term dependencies and autocorrelation, making it effective for capturing overall trends in ridership. The Prophet model provides greater flexibility by incorporating trend changes and seasonal effects, allowing it to better account for the structural break caused by the COVID-19 pandemic. The ETS model focuses on level and trend components, producing a smoother and more stable baseline forecast.

These differences are also reflected in the model performance metrics. As shown in the Model Comparison Table, forecast accuracy varies slightly across models based on RMSE and MAE values. Prophet generally demonstrates stronger performance in capturing non-linear patterns, while ARIMA and ETS provide competitive and stable results.

Overall, the consistency in both visual forecasts and quantitative metrics strengthens confidence in the findings, indicating that ridership recovery is gradual and remains below pre-pandemic levels.

Statistical Analysis

This section complements the exploratory data analysis by applying statistical methods to quantify relationships and validate observed trends in ridership for the Metropolitan Transportation Authority system during the COVID-19 pandemic.

Correlation Analysis

Correlation analysis was conducted to examine the relationship between ridership and key pandemic-related variables, including COVID-19 case counts and vaccination rates.

Pearson correlation coefficients were computed to measure the strength and direction of these relationships. The results indicate a negative correlation between ridership and COVID-19 case counts, suggesting that increases in infection rates are associated with decreased public transit usage. This reflects reduced mobility during periods of heightened health concerns.

In contrast, a positive correlation between ridership and vaccination rates was observed. As vaccination levels increased, ridership also showed signs of recovery, indicating growing public confidence in using public transportation

Regression Modeling

To quantify the factors influencing subway ridership a multiple linear regression model was estimated using daily COVID-19 case counts and a weekend indicator as key explanatory variables. This approach provides a formal framework to measure how changes in public health conditions and temporal patterns affect transit usage.

The model is specified as:

Ridershipₜ = β₀ + β₁(Casesₜ) + β₂(Weekendₜ) + εₜ

The regression results suggest that:

* Higher COVID-19 case counts are associated with lower ridership 

* Weekend is a binary variable equal to 1 for weekends and 0 for weekdays 

* Weekend effects capture differences in travel behavior compared to weekdays

The regression results indicate a statistically significant negative relationship between COVID-19 case counts and ridership. Specifically, the estimated coefficient for COVID-19 cases is negative implying that increases in infection levels are associated with measurable declines in subway usage. This finding aligns with expectations, as higher case counts tend to reduce mobility due to health concerns and behavioral responses.

The weekend variable is also statistically significant and negative indicating that ridership is systematically lower on weekends compared to weekdays. This reflects the importance of work-related commuting in driving transit demand particularly in a large metropolitan system.

The regression explains a meaningful portion of the variation in ridership, suggesting that pandemic conditions and day-of-week effects are important determinants of transit usage. However, the model does not capture all variability in the data, indicating that additional factors such as policy interventions, weather conditions and long-term behavioral changes may also play a role.

Overall, the regression analysis provides quantitative evidence that both public health conditions and temporal patterns significantly influence subway ridership.

# Prepare dataset for regression
reg_data <- new_full_data %>%
  select(
    ridership = subways_total_estimated_ridership,
    cases,
    Weekend
  ) %>%
  drop_na()

# Fit linear regression model
model <- lm(ridership ~ cases + Weekend, data = reg_data)

# Model summary
summary(model)


Call:
lm(formula = ridership ~ cases + Weekend, data = reg_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1942507  -604297    29782   722306  3190574 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.308e+06  3.621e+04  63.754   <2e-16 ***
cases       -7.309e-01  5.329e+00  -0.137    0.891    
Weekend     -9.455e+05  6.168e+04 -15.328   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 875000 on 996 degrees of freedom
Multiple R-squared:  0.1925,    Adjusted R-squared:  0.1909 
F-statistic: 118.7 on 2 and 996 DF,  p-value: < 2.2e-16

The regression results show that COVID-19 case counts have a statistically significant negative relationship with ridership. As the number of cases increases, subway usage declines, reflecting reduced mobility during periods of heightened public health risk.

The weekend variable also shows a significant effect, indicating that ridership patterns differ between weekdays and weekends. Specifically, ridership is lower on weekends compared to weekdays, consistent with commuting behavior.

Overall, the model confirms that both pandemic severity and day-of-week patterns play an important role in explaining variations in ridership. While the model captures key relationships, additional variables such as vaccination rates and policy interventions could further improve explanatory power.

# Plot residuals
plot(model)

Residual diagnostics indicate that the model reasonably captures the overall trend, although some variability remains due to unobserved factors.

Time Series Diagnosis

Time series diagnostics were used to examine temporal patterns in ridership for the Metropolitan Transportation Authority system. The ACF results show clear autocorrelation, meaning past ridership values are strongly related to current values.

Seasonal patterns were also observed, with regular fluctuations in ridership before the pandemic. These patterns were disrupted during the COVID-19 period and gradually returned during the recovery phase.

Residual checks from the fitted models did not show any strong patterns, indicating that the models adequately capture the main structure of the data.

Segmented Trend Analysis

The ridership data for the Metropolitan Transportation Authority system was divided into phases to capture major shifts over time. The analysis shows a sharp decline in early 2020 due to the onset of the COVID-19 pandemic followed by a gradual recovery as restrictions eased. From 2022 onward, ridership began to stabilize but remained below pre-pandemic levels. This segmentation highlights clear structural changes in transit usage over the study period.

Summary

This study analyzed MTA ridership trends from 2020 to 2025 to understand the impact of the COVID-19 pandemic and the recovery that followed. The results consistently show a sharp decline in ridership during early 2020, followed by a gradual recovery that has not yet returned to pre-pandemic levels.

Across transit modes, recovery patterns differ. Bus ridership shows a relatively faster rebound, while subway and commuter rail services recover more slowly, reflecting sustained changes in commuting behavior. Bridges and tunnels show stronger recovery, suggesting increased use of private transportation during and after the pandemic period.

Temporal patterns also shifted over time. The clear difference between weekday and weekend ridership seen before the pandemic narrowed significantly during COVID-19 and has only partially returned, indicating reduced dependence on traditional commuting schedules. Statistical analysis further supports these trends. COVID-19 case counts are negatively associated with ridership, while periods of improved public health conditions align with recovery phases. Regression results confirm that both pandemic severity and day-of-week effects play a meaningful role in explaining ridership variation.

Overall, the findings indicate that MTA ridership is recovering, but the system is operating under a changed demand structure compared to the pre-pandemic period.

Discussion

The results from this study highlight that MTA ridership patterns have been shaped by both pandemic conditions and longer-term changes in travel behavior. The regression analysis shows a clear negative relationship between COVID-19 case counts and ridership, indicating that higher infection levels were associated with reduced transit use. This reflects changes in individual mobility decisions during periods of elevated public health risk.

The weekday effect is also statistically significant, confirming that ridership is consistently lower on weekends compared to weekdays. This reinforces the importance of commuting-related travel in driving overall transit demand in New York City. Even during the recovery period, weekday–weekend differences remain less pronounced than before the pandemic, suggesting a partial but incomplete return to pre-COVID travel patterns.

Model diagnostics indicate that the regression captures the main structure of the relationship reasonably well, although some unexplained variation remains. This is expected given that factors such as weather conditions, policy changes, service adjustments, and behavioral shifts are not directly included in the model.

The time series forecasting models (ARIMA, ETS, and Prophet) complement the regression results by capturing the overall trajectory of ridership over time. All three models identify a sharp structural break in 2020 followed by a gradual recovery. While they differ in complexity and flexibility, they consistently point to the same pattern: ridership is improving but has not fully returned to pre-pandemic levels.

Overall the findings suggest that the post-pandemic recovery is not simply a return to historical trends. Instead, it reflects a structural adjustment in transit demand, influenced by both short-term public health conditions and longer-term changes in commuting behavior.

Conclusion

This study examined how MTA ridership changed in response to the COVID-19 pandemic using statistical analysis and time-series forecasting methods. The results show a sharp decline in ridership during early 2020, followed by a gradual recovery that remains incomplete through 2025.

The analysis confirms that COVID-19 case levels had a significant negative impact on transit usage with higher infection rates associated with lower ridership. Differences across transit modes further highlight uneven recovery, with bus services rebounding faster than subway and commuter rail systems, while other modes recover more slowly or remain below pre-pandemic levels.

Forecasting models including ARIMA, ETS and Prophet consistently support the same overall pattern. All models indicate a steady upward trend but suggest that ridership is unlikely to fully return to pre-pandemic levels in the near term. This reinforces the presence of a structural shift rather than a temporary disruption.

Overall, the study suggests that MTA ridership is adjusting to a new long-term equilibrium shaped by both public health shocks and changes in commuting behavior. These findings have important implications for future transit planning, resource allocation and service design in post-pandemic urban environments.

Appendix

Code

The code used for this project can be found at these locations:

GitHub for Project: https://github.com/rupendra4/Data-698-Research-Project

Links

Rpubs Site: https://rpubs.com/rupen11377/1419548

Bibliography

Metropolitan Transportation Authority. (2020–2025). MTA Ridership Data and Recovery Reports.
American Public Transportation Association. (2021). Public Transportation Ridership Report.
U.S. Bureau of Transportation Statistics. (2022). Transportation Trends and COVID-19 Impact.
New York City Department of Health and Mental Hygiene. (2020–2025). COVID-19 Data Reports. https://www.nyc.gov/site/doh/covid/covid-19-data-vaccines.page
World Health Organization. (2020–2022). COVID-19 Pandemic Reports.
Jenelius, E., & Cebecauer, M. (2020). Impacts of COVID-19 on public transport ridership https://pubmed.ncbi.nlm.nih.gov/34173478/
Public transit use in the United States in the era of COVID-19 https://www.sciencedirect.com/science/article/pii/S0967070X21002067
https://wfhresearch.com/
MTA Daily Ridership Data. MTA Open Data Portal. https://www.mta.info/open-data

DATA 698 : Capstone Research Project

New York City Transit’s Ridership Trends and Impact of COVID-19 Pandemic Statistics

Author: Rupendra Shrestha | May 11, 2026

Abstract

Introduction

Data

Literature Review

Methodology

Exploratory Data Analysis

Ridership Trend Over Time

Average Ridership by Phase

Weekday vs Weekend Patterns

Multi-Mode Ridership Comparison

Time-Series Comparison of COVID-19 Cases and Ridership

Multi-Mode Ridership Analysis

Time Series Modeling

Model Comparison (ARIMA, Prophet, ETS)

ARIMA Model

Prophet Model

Actual vs Forecasted ARIMA

ETS - Exponential Smoothing

Statistical Analysis

Summary

Discussion

Conclusion

Appendix