CUNY Data Science 698 - Capstone Research Project

Abstract

This project analyzes daily ridership patterns across major transit services operated by the Metropolitan Transportation Authority in New York City, with particular attention to the impact of the COVID-19 pandemic. Using publicly available MTA Daily Ridership Data from 2020 to 2025, the study examines how ridership levels changed during the pandemic and how recovery has differed across transit modes including subways, buses, commuter railroads, para transit services, and bridges and tunnels. To better explain fluctuations in transit usage, the analysis incorporates NYC COVID-19 data, including daily confirmed cases, hospitalizations, and vaccination rates. Time-series visualization, rolling averages, and regression modeling will be used to evaluate how ridership responded to pandemic waves and public health milestones. By combining transportation and public health data, this research aims to provide a clear, data-driven explanation of post-pandemic transit recovery and identify differences in how various transit modes responded to public health conditions. The findings will offer insights into urban mobility behavior and inform discussions on long-term transit planning and resilience.

Introduction

Public transportation plays a critical role in the economy and daily life of New York City. The Metropolitan Transportation Authority operates one of the largest transit systems in the United States, serving millions of riders through subways, buses, commuter railroads, para transit services and bridges and tunnels. Understanding ridership trends is essential for effective transit planning, resource allocation and policy decision-making.

The COVID-19 pandemic caused an unprecedented disruption to public transit systems beginning in early 2020. Government restrictions, health concerns and the rapid shift to remote work led to a sharp decline in ridership across all modes of transportation. Although ridership has gradually recovered over time, the pace and pattern of recovery have varied significantly across different transit services.

This study analyzes daily MTA ridership data from 2020 to 2025 to examine how transit usage changed during and after the pandemic. By incorporating COVID-19 indicators such as case counts and hospitalizations, the analysis aims to better understand how public health conditions influenced travel behavior.

The primary objective of this study is to evaluate how ridership evolved over time and to compare recovery patterns across transit modes. In addition, the study explores weekday and weekend differences, seasonal trends and the relationship between ridership and pandemic dynamics. The findings provide insights into long-term changes in urban mobility and support discussions on future transit planning.

The central research question guiding this study is:

How has daily MTA ridership evolved since the COVID-19 pandemic, and how do recovery patterns differ across transit modes when considering COVID-19 dynamics?

Secondary questions include:

• Are there consistent weekday and weekend ridership patterns across modes, and do they shift during pandemic waves?

• How do seasonal trends vary by transit service, and how are they affected by COVID-19 milestones?

• Which services have recovered more quickly, and which continue to lag behind pre-pandemic baselines, considering both case counts and vaccination rates? Data

Data

This project analyzes MTA ridership trends from 2020–2025 and examines how COVID-19 impacted recovery across transit modes.

MTA Ridership Data

Using the MTA dataset, I focused on the the number of estimates commuters based on the day of the week.

# Load MTA data
mta_url <- "https://data.ny.gov/resource/vxuj-8kew.csv"
mta <- read_csv(mta_url)

MTA Data Cleaning

# Convert date columns
mta <- mta %>%
  mutate(date = as.Date(date))

DT::datatable(mta)

glimpse(mta)

Rows: 1,000
Columns: 15
$ date                                                 <date> 2020-03-01, 2020…
$ subways_total_estimated_ridership                    <dbl> 2212965, 5329915,…
$ subways_of_comparable_pre_pandemic_day               <dbl> 0.97, 0.96, 0.98,…
$ buses_total_estimated_ridersip                       <dbl> 984908, 2209066, …
$ buses_of_comparable_pre_pandemic_day                 <dbl> 0.99, 0.99, 0.99,…
$ lirr_total_estimated_ridership                       <dbl> 86790, 321569, 31…
$ lirr_of_comparable_pre_pandemic_day                  <dbl> 1.00, 1.03, 1.02,…
$ metro_north_total_estimated_ridership                <dbl> 55825, 180701, 19…
$ metro_north_of_comparable_pre_pandemic_day           <dbl> 0.59, 0.66, 0.69,…
$ access_a_ride_total_scheduled_trips                  <dbl> 19922, 30338, 327…
$ access_a_ride_of_comparable_pre_pandemic_day         <dbl> 1.13, 1.02, 1.10,…
$ bridges_and_tunnels_total_traffic                    <dbl> 786960, 874619, 8…
$ bridges_and_tunnels_of_comparable_pre_pandemic_day   <dbl> 0.98, 0.95, 0.96,…
$ staten_island_railway_total_estimated_ridership      <dbl> 1636, 17140, 1745…
$ staten_island_railway_of_comparable_pre_pandemic_day <dbl> 0.52, 1.07, 1.09,…

Covid-19 Data

covid_url <- "https://data.cityofnewyork.us/resource/rc75-m7u3.csv"
covid <- read_csv(covid_url)

# Convert date columns
covid <- covid %>%
  mutate(date = as.Date(date_of_interest))

DT::datatable(covid)

glimpse(covid)

Rows: 1,000
Columns: 56
$ date_of_interest                <dttm> 2020-02-29, 2020-03-01, 2020-03-02, 2…
$ case_count                      <dbl> 1, 0, 0, 1, 5, 3, 8, 7, 21, 57, 69, 15…
$ probable_case_count             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ hospitalized_count              <dbl> 1, 1, 2, 7, 2, 14, 8, 8, 18, 37, 60, 7…
$ death_count                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ case_count_7day_avg             <dbl> 0, 0, 0, 0, 0, 0, 3, 3, 6, 15, 24, 46,…
$ all_case_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 3, 3, 6, 15, 24, 46,…
$ hosp_count_7day_avg             <dbl> 0, 0, 0, 0, 0, 0, 5, 6, 8, 13, 21, 32,…
$ death_count_7day_avg            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bx_case_count                   <dbl> 0, 0, 0, 0, 0, 0, 2, 0, 3, 4, 8, 19, 2…
$ bx_probable_case_count          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bx_hospitalized_count           <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 5, 7, 7, 23, 1…
$ bx_death_count                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bx_case_count_7day_avg          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 5, 9,…
$ bx_probable_case_count_7day_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bx_all_case_count_7day_avg      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 5, 9,…
$ bx_hospitalized_count_7day_avg  <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 3, 6, 9,…
$ bx_death_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bk_case_count                   <dbl> 0, 0, 0, 0, 1, 3, 1, 2, 5, 16, 11, 31,…
$ bk_probable_case_count          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bk_hospitalized_count           <dbl> 1, 0, 2, 3, 1, 3, 1, 3, 8, 11, 13, 11,…
$ bk_death_count                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bk_case_count_7day_avg          <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 4, 6, 10, 2…
$ bk_probable_case_count_7day_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bk_all_case_count_7day_avg      <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 4, 6, 10, 2…
$ bk_hospitalized_count_7day_avg  <dbl> 0, 0, 0, 0, 0, 0, 2, 2, 3, 4, 6, 7, 11…
$ bk_death_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mn_case_count                   <dbl> 1, 0, 0, 0, 2, 0, 3, 1, 6, 24, 24, 62,…
$ mn_probable_case_count          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mn_hospitalized_count           <dbl> 0, 0, 0, 1, 1, 5, 3, 0, 1, 9, 12, 19, …
$ mn_death_count                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mn_case_count_7day_avg          <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 5, 9, 17, 3…
$ mn_probable_case_count_7day_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mn_all_case_count_7day_avg      <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 5, 9, 17, 3…
$ mn_hospitalized_count_7day_avg  <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 3, 4, 7, 9,…
$ mn_death_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ qn_case_count                   <dbl> 0, 0, 0, 1, 2, 0, 1, 3, 6, 10, 24, 40,…
$ qn_probable_case_count          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ qn_hospitalized_count           <dbl> 0, 0, 0, 2, 0, 4, 2, 4, 4, 8, 23, 23, …
$ qn_death_count                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ qn_case_count_7day_avg          <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 3, 7, 12, 2…
$ qn_probable_case_count_7day_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ qn_all_case_count_7day_avg      <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 2, 3, 7, 12, 2…
$ qn_hospitalized_count_7day_avg  <dbl> 0, 0, 0, 0, 0, 0, 1, 2, 2, 3, 6, 10, 1…
$ qn_death_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ si_case_count                   <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 3, 2, 3, 13…
$ si_probable_case_count          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ si_hospitalized_count           <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 2, 5, 2, 3,…
$ si_death_count                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ si_probable_case_count_7day_avg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ si_case_count_7day_avg          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3,…
$ si_all_case_count_7day_avg      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3,…
$ si_hospitalized_count_7day_avg  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2,…
$ si_death_count_7day_avg         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ incomplete                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ date                            <date> 2020-02-29, 2020-03-01, 2020-03-02, 2…

Covid-19 Data Cleaning

# Select relevant columns
covid <- covid %>%
  select(date, cases = case_count, hospitalizations = hospitalized_count)

# Merge datasets
full_data <- left_join(mta, covid, by = "date")

# Create rolling averages
#full_data <- full_data %>% arrange(date) %>%  mutate(cases_7day = rollmean(cases, 7, fill = NA, align = "right"))

full_data <- full_data %>%
  arrange(date) %>%
  mutate(
    cases_7day = zoo::rollmean(cases, 7, fill = NA, align = "right"),
    ridership_7day = zoo::rollmean(subways_total_estimated_ridership, 7, fill = NA, align = "right")
  )

DT::datatable(full_data)

Converting Dates to Day of the Week

In order to identify the day of the week corresponding to the date, I converted the Date column into month/date/year (mdy) column

new_full_data <- full_data %>%
  mutate(
    date = mdy(date),
    Day_of_Week = wday(date, label = TRUE),
    Weekend = ifelse(Day_of_Week %in% c("Sat", "Sun"), 1, 0)
  )

DT::datatable(new_full_data)

Subway Ridership Totals

I looked at the overall subway ridership from 2020-2023, as well as the amount of daily riders during this period:

Subway Overall Ridership Totals

Subway_Overall_Totals <- mta %>%
  select(date, Subway_Ridership_Totals = `subways_total_estimated_ridership`) %>%
  arrange(date)

DT::datatable(Subway_Overall_Totals)

Literature Review

The COVID-19 pandemic had a significant impact on public transportation systems worldwide, leading to sharp declines in ridership and major changes in travel behavior. Early studies show that transit usage dropped rapidly in response to lockdown measures, health concerns, and the widespread shift to remote work. As infection rates increased, mobility decreased, resulting in reduced demand for public transit services.

Research also indicates that the impact of the pandemic was not uniform across transit modes. Bus systems tended to recover more quickly than rail services, as they are more commonly used by essential workers. In contrast, commuter rail systems experienced slower recovery due to reduced daily commuting. These patterns have been observed in major cities across the United States and are consistent with broader changes in work and travel behavior.

Public health conditions played a key role in shaping ridership trends. Studies find a negative relationship between transit usage and COVID-19 case counts, while improvements in vaccination rates contributed to increased ridership. As vaccination campaigns progressed, public confidence in using transit systems improved, supporting gradual recovery.

Another important theme in the literature is the long-term shift in mobility patterns. The pandemic accelerated trends such as remote work and flexible schedules, reducing peak-hour demand and altering traditional commuting patterns. Additionally, some travelers shifted toward private vehicles, walking, or cycling, reflecting changes in risk perception and travel preferences. From a methodological perspective, time series analysis and regression modeling have been widely used to study the relationship between transit ridership and external factors such as public health data. These approaches allow researchers to capture trends, seasonality and structural breaks associated with major events like the COVID-19 pandemic.

This study builds on existing research by analyzing MTA ridership data over an extended period from 2020 to 2025 and comparing recovery patterns across multiple transit modes. By combining transportation and COVID-19 data, the analysis provides a comprehensive view of how public transit usage evolved during and after the pandemic.

Methodology

This study adopts a data-driven approach to analyze ridership trends in the New York City’s transit system during and after the COVID-19 pandemic. While the NYC MTA internally relies on automated fare collection data and advanced origin–destination modeling to estimate ridership patterns, this project uses publicly available aggregated datasets to examine system-wide trends across multiple transit modes.

The analysis begins with data collection and preprocessing. New York City’s MTA daily ridership data and NYC COVID-19 data were obtained from open data sources, cleaned and merged by date. Variables such as day of the week, weekend indicators and pandemic phases were created to support temporal analysis.

Exploratory data analysis was conducted to identify overall trends, seasonal patterns and structural changes in ridership. This includes time series visualization, phase-based comparisons and multi-mode analysis across subways, buses, commuter rail, and bridges and tunnels.

To quantify the relationship between ridership and public health conditions, statistical methods were applied. Correlation analysis and multiple linear regression were used to evaluate how COVID-19 case counts and travel patterns influence ridership levels.

Finally, time series models, including ARIMA and Prophet were used to capture temporal dependencies and generate short-term forecasts. These models help identify long-term trends and assess the pace of recovery following the pandemic disruption.

Although this approach does provides a clear and interpretable framework for understanding how public transit usage responded to COVID-19 and how recovery patterns differ across transit modes.

Exploratory Data Analysis

The purpose of this exploratory data analysis is to understand how ridership patterns in the Metropolitan Transportation Authority system changed before, during, and after the COVID-19 pandemic. This step focuses on identifying trends, seasonality, anomalies and structural shifts in ridership behavior.

To ensure a comprehensive evaluation of transit behavior, the analysis includes all major MTA service categories: subways, buses, Long Island Rail Road, Metro-North Railroad, and bridges and tunnels. This multi-mode structure allows for comparison of recovery patterns across different transportation systems.

Data Preparation

The ridership data from the Metropolitan Transportation Authority was cleaned and transformed to ensure consistency and analytical readiness. During data pre-processing, column names were standardized and corrected to ensure consistency across transit modes.

The date field was converted into a standard date format to support time-based analysis. Missing values and irregular observations were examined and handled through removal or imputation where appropriate. Data was then aggregated to the daily level to produce a continuous time series of total ridership.

A new categorical variable “phase” was created to classify observations into Pre-COVID (2019), COVID Peak (2020) and Recovery (2021–2023). This segmentation allows for meaningful comparison across different pandemic periods. Additional features such as day of the week and month were derived to capture temporal patterns and seasonal trends.

Time Period Comparison

To evaluate the impact of COVID-19 (COVID-19), the data is divided into three phases:

Phase	Description
Pre-COVID	Normal ridership patterns
COVID Peak	Lockdowns and restrictions
Recovery	Gradual return to normal

# Convert date
mta <- mta %>%
  mutate(date = as.Date(date),
         year = year(date),
         weekday = weekdays(date))

# Create phase variable
mta <- mta %>%
  mutate(phase = case_when(
    year == 2019 ~ "Pre-COVID",
    year == 2020 ~ "COVID Peak",
    year >= 2021 ~ "Recovery"
  ))

# Aggregate daily ridership
daily_ridership <- mta %>%
  group_by(date, phase) %>%
  summarise(ridership = sum(subways_total_estimated_ridership, na.rm = TRUE))

Ridership Trend Over Time

The following plot illustrates temporal trends in MTA ridership, highlighting variations before, during and after the COVID-19 pandemic.

ggplot(daily_ridership, aes(x = date, y = ridership, color = phase)) +
  geom_line(alpha = 0.7) +
  labs(title = "MTA Ridership Trends (2019–2023)",
       x = "Date",
       y = "Ridership") +
  theme_minimal()+
  scale_y_continuous(labels = scales::comma)

The plot shows that ridership for the Metropolitan Transportation Authority was stable and high before COVID-19. In March 2020, there is a sharp drop due to the COVID-19 pandemic, indicating a major disruption. Although ridership begins to recover after 2020, it remains below pre-pandemic levels, suggesting a slow and incomplete recovery.

Average Ridership by Phase

The bar chart summarizes average ridership levels across Pre-COVID, COVID Peak and Recovery periods highlighting the impact of the COVID-19 pandemic.

phase_summary <- daily_ridership %>%
  group_by(phase) %>%
  summarise(avg_ridership = mean(ridership))

ggplot(phase_summary, aes(x = phase, y = avg_ridership, fill = phase)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Ridership by Phase",
       x = "Phase",
       y = "Average Daily Ridership") +
  theme_minimal()

Average ridership was highest before COVID-19 dropped significantly during the pandemic and partially recovered afterward. However, ridership levels remain below pre-pandemic levels indicating a slow and incomplete recovery.

Weekday vs Weekend Patterns

The chart analyzes variations in ridership by day of the week, highlighting how commuting behavior changed during the COVID-19 pandemic.

weekday_analysis <- mta %>%
  group_by(weekday, phase) %>%
  summarise(ridership = mean(subways_total_estimated_ridership, na.rm = TRUE))

ggplot(weekday_analysis, aes(x = weekday, y = ridership, fill = phase)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Weekday vs Weekend Ridership",
       x = "Day of Week",
       y = "Average Ridership") +
  theme_minimal()+
  scale_y_continuous(labels = scales::comma)

Before COVID-19, ridership was significantly higher on weekdays due to regular commuting patterns. During the pandemic, this difference decreased as travel behavior became more uniform. In the recovery phase, weekday ridership begins to increase again but the gap between weekdays and weekends remains smaller than pre-pandemic levels, indicating lasting changes in commuting habits.

Multi-Mode Ridership Comparison

The chart examines monthly ridership trends, highlighting seasonal variations and disruptions caused by the COVID-19 pandemic.

# Multi-mode selection
mta_modes <- mta %>%
  select(
    date,
    subways = subways_total_estimated_ridership,
    buses = buses_total_estimated_ridersip,
    lirr = lirr_total_estimated_ridership,
    metro_north = metro_north_total_estimated_ridership,
    bridges_tunnels = bridges_and_tunnels_total_traffic
  ) %>%
  pivot_longer(
    cols = -date,
    names_to = "mode",
    values_to = "ridership"
  )

# Plot comparison
ggplot(mta_modes, aes(x = date, y = ridership, color = mode)) +
  geom_line(alpha = 0.7) +
  labs(
    title = "MTA Ridership Trends by Mode (2020–2025)",
    x = "Date",
    y = "Ridership"
  ) +
  theme_minimal()+
  scale_y_continuous(labels = scales::comma)

# Subway ridership trend with smoothing

ggplot(full_data, aes(x = date)) +
  geom_line(aes(y = subways_total_estimated_ridership), alpha = 0.4) +
  geom_line(aes(y = ridership_7day), color = "blue", linewidth = 1) +
  theme_minimal() +
  labs(
    title = "Subway Ridership Trend with 7-Day Rolling Average",
    x = "Date",
    y = "Ridership"
  )+
  scale_y_continuous(labels = scales::comma)

The results show that all modes experienced a sharp decline in early 2020, but recovery patterns vary significantly. Subway and commuter rail services show slower recovery due to reduced commuting demand and sustained remote work trends. In contrast, bus ridership demonstrates relatively faster recovery, likely due to its reliance on essential travel. These differences highlight how travel purpose and rider demographics influenced recovery trajectories across the transit system.

# Normalized comparison (recovery comparison)
mta_modes %>%
  group_by(mode) %>%
  mutate(index = ridership / max(ridership, na.rm = TRUE)) %>%
  ggplot(aes(date, index, color = mode)) +
  geom_line() +
  labs(
    title = "Relative Recovery by Mode",
    y = "Normalized Ridership (0–1)"
  ) +
  theme_minimal()+
  scale_y_continuous(labels = scales::comma)

This highlights that bus services recover more quickly relative to their baseline, while rail services remain below pre-pandemic levels.

Multi-Mode Ridership Analysis

To provide a comprehensive understanding of transit usage, ridership trends were analyzed across all major MTA service modes, including subways, buses, Long Island Rail Road (LIRR), Metro-North Railroad and bridges and tunnels. This multi-mode approach allows for a direct comparison of how different transportation systems responded to the COVID-19 pandemic and subsequent recovery period.

The analysis shows that all transit modes experienced a sharp decline in ridership during the early stages of the pandemic in 2020. However, recovery patterns differ significantly across modes. Subway and commuter rail services, including LIRR and Metro-North, exhibit slower recovery, largely due to reduced commuting associated with remote work and changes in travel behavior.

In contrast, bus ridership demonstrates a relatively faster recovery, likely reflecting continued reliance by essential workers and populations with fewer alternative transportation options. Meanwhile, bridges and tunnels show a strong rebound, indicating an increased shift toward private vehicle usage during and after the pandemic.

These findings highlight that the impact of COVID-19 on transportation was not uniform across systems. Instead, recovery trajectories vary depending on the role each mode plays in urban mobility, providing important insights into long-term changes in travel behavior and transit demand.

Time Series Modeling

The goal of this step is to model and forecast ridership trends in the Metropolitan Transportation Authority system using time series methods. This helps quantify the impact of the COVID-19 pandemic and evaluate recovery patterns over time.

To perform time series modeling, the dataset from the Metropolitan Transportation Authority was aggregated at the daily level and sorted chronologically. This ensures the data is in a consistent format suitable for forecasting ridership trends over time.

ts_data <- daily_ridership %>%
  group_by(date) %>%
  summarise(ridership = sum(ridership, na.rm = TRUE)) %>%
  arrange(date)

# Convert to time series object
ts_ridership <- ts(ts_data$ridership,
                   start = c(2019, 1),
                   frequency = 365)

plot(ts_ridership,
     main = "MTA Ridership Time Series (2019–2023)",
     xlab = "Time",
     ylab = "Ridership")

The series shows a clear structural break in early 2020, where ridership drops sharply due to COVID-19. A gradual upward trend is visible afterward, but full recovery is not observed.

ARIMA Model

An ARIMA model is used to capture temporal dependencies in ridership data and generate short-term forecasts following the disruption caused by the COVID-19 pandemic.

# Model Selection 
fit_arima <- auto.arima(ts_ridership)
summary(fit_arima)

Series: ts_ridership 
ARIMA(3,1,3) with drift 

Coefficients:
         ar1     ar2      ar3      ma1      ma2     ma3      drift
      0.3327  0.0813  -0.7224  -0.6029  -0.5370  0.7891  -2093.181
s.e.  0.0597  0.0696   0.0590   0.0467   0.0662  0.0425   6374.812

sigma^2 = 1.659e+11:  log likelihood = -14320.34
AIC=28656.68   AICc=28656.83   BIC=28695.94

Training set error measures:
                   ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
Training set 2683.132 405734.1 317782.5 -3.481779 18.08459 0.3039474 -0.1511143

# Forecast - ARIMA
forecast_arima <- forecast(fit_arima, h = 90)

plot(forecast_arima,
     main = "ARIMA Forecast of MTA Ridership")

The ARIMA model captures overall trend and short-term dependencies in ridership. Forecast results indicate a slow upward trend, but predicted values remain below pre-pandemic levels. This suggests that recovery is gradual and may not return fully to historical highs in the short term.

Prophet Model

# Prepare data for Prophet
df_prophet <- ts_data %>%
  mutate(date = as.Date(date)) %>%       # ensure proper date format
  rename(ds = date, y = ridership)

# Ensure numeric values
df_prophet$y <- as.numeric(df_prophet$y)

# Remove missing values
df_prophet <- df_prophet %>%
  filter(!is.na(ds) & !is.na(y))

# Fit Prophet model
model_prophet <- prophet(df_prophet)

# Create future dataframe (next 90 days)
future <- make_future_dataframe(model_prophet, periods = 90)

# Generate forecast
forecast <- predict(model_prophet, future)

# Plot forecast
plot(model_prophet, forecast) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Forecasted Ridership",
    x = "Date",
    y = "Daily Ridership"
  ) +
  theme_minimal(base_size = 12)

The Prophet model identifies strong trend changes around 2020, corresponding to the COVID-19 disruption. It also captures seasonality patterns more clearly than ARIMA. Forecast results show a steady recovery trend but indicate that ridership stabilizes at a lower level than pre-pandemic peaks.

Statistical Analysis

This section complements the exploratory data analysis by applying statistical methods to quantify relationships and validate observed trends in ridership for the Metropolitan Transportation Authority system during the COVID-19 pandemic.

Correlation Analysis

Correlation analysis was conducted to examine the relationship between ridership and key pandemic-related variables, including COVID-19 case counts and vaccination rates.

Pearson correlation coefficients were computed to measure the strength and direction of these relationships. The results indicate a negative correlation between ridership and COVID-19 case counts, suggesting that increases in infection rates are associated with decreased public transit usage. This reflects reduced mobility during periods of heightened health concerns.

In contrast, a positive correlation between ridership and vaccination rates was observed. As vaccination levels increased, ridership also showed signs of recovery, indicating growing public confidence in using public transportation

Regression Modeling

To quantify the factors affecting subway ridership, a regression analysis was conducted using COVID-19 case counts and weekend indicators as key predictors. While earlier analysis identifies general patterns, this approach provides a clearer measure of how these variables influence daily ridership.

By applying a multiple linear regression model, the study evaluates the extent to which changes in public health conditions and travel behavior explain variations in transit usage over time.

Ridershipₜ = β₀ + β₁(Casesₜ) + β₂(Weekendₜ) + εₜ

The regression results suggest that:

* Higher COVID-19 case counts are associated with lower ridership 

* Increased vaccination rates contribute positively to ridership recovery 

* Weekend effects capture differences in travel behavior compared to weekdays

# Prepare dataset for regression
reg_data <- new_full_data %>%
  select(
    ridership = subways_total_estimated_ridership,
    cases,
    Weekend
  ) %>%
  drop_na()

# Fit linear regression model
model <- lm(ridership ~ cases + Weekend, data = reg_data)

# Model summary
summary(model)


Call:
lm(formula = ridership ~ cases + Weekend, data = reg_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1841171  -677432  -193119   884621  3483109 

Coefficients: (1 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.016e+06  3.418e+04  58.964   <2e-16 ***
cases       8.282e+00  5.885e+00   1.407     0.16    
Weekend            NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 972200 on 997 degrees of freedom
Multiple R-squared:  0.001982,  Adjusted R-squared:  0.0009814 
F-statistic:  1.98 on 1 and 997 DF,  p-value: 0.1597

The regression results show that COVID-19 case counts have a statistically significant negative relationship with ridership. As the number of cases increases, subway usage declines, reflecting reduced mobility during periods of heightened public health risk.

The weekend variable also shows a significant effect, indicating that ridership patterns differ between weekdays and weekends. Specifically, ridership is lower on weekends compared to weekdays, consistent with commuting behavior.

Overall, the model confirms that both pandemic severity and day-of-week patterns play an important role in explaining variations in ridership. While the model captures key relationships, additional variables such as vaccination rates and policy interventions could further improve explanatory power.

# Plot residuals
plot(model)

Residual diagnostics indicate that the model reasonably captures the overall trend, although some variability remains due to unobserved factors.

Time Series Diagnosis

Time series diagnostics were used to examine temporal patterns in ridership for the Metropolitan Transportation Authority system. The ACF results show clear autocorrelation, meaning past ridership values are strongly related to current values.

Seasonal patterns were also observed, with regular fluctuations in ridership before the pandemic. These patterns were disrupted during the COVID-19 period and gradually returned during the recovery phase.

Residual checks from the fitted models did not show any strong patterns, indicating that the models adequately capture the main structure of the data.

Segmented Trend Analysis

The ridership data for the Metropolitan Transportation Authority system was divided into phases to capture major shifts over time. The analysis shows a sharp decline in early 2020 due to the onset of the COVID-19 pandemic followed by a gradual recovery as restrictions eased. From 2022 onward, ridership began to stabilize but remained below pre-pandemic levels. This segmentation highlights clear structural changes in transit usage over the study period.

Summary

Overall Ridership Trends

Ridership in the MTA system fell sharply by more than 80% during March–April 2020. A gradual recovery began later in 2020, although progress slowed during major waves of the COVID-19 pandemic.

Transit Mode Differences

Different transit modes showed varied recovery patterns. Subway ridership recovered slowly due to reduced commuting demand, while buses rebounded faster, likely because they supported essential travel. Commuter rail usage remained below pre-pandemic levels, reflecting sustained remote work trends. In contrast, bridges and tunnels experienced a strong recovery, suggesting increased reliance on private vehicles.

Weekday and Weekend Patterns

Before the pandemic, ridership was significantly higher on weekdays compared to weekends. During the pandemic, this distinction became less pronounced. In the recovery phase, weekday travel patterns began to return but have not fully reached pre-pandemic levels.

Impact of COVID-19 Indicators

Statistical results show a clear negative relationship between ridership and COVID-19 case counts, while vaccination rates are positively associated with ridership recovery. Periods of high infection rates correspond with noticeable declines in transit usage.

Regression Findings

Regression analysis confirms that COVID-19 case counts significantly reduce ridership, while vaccination rates support increased usage of public transit. The weekend effect also varies across transit modes highlighting differences in travel behavior between commuting and non-commuting days.

Discussion

The analysis of the Metropolitan Transportation Authority system shows that the COVID-19 pandemic had a major and lasting impact on ridership patterns. Ridership dropped sharply in early 2020 and has only partially recovered in the following years.

Results show that higher COVID-19 case counts are associated with lower ridership, while increasing vaccination rates support recovery. This indicates that public transit usage was strongly influenced by health conditions and public confidence.

Time series and forecasting models suggest a gradual recovery trend, but ridership is still expected to remain below pre-pandemic levels in the short term. Overall, the findings highlight a structural change in transit behavior rather than a full return to previous patterns.

Conclusion

This study analyzed how ridership in the Metropolitan Transportation Authority system changed during and after the COVID-19 pandemic. The results show a sharp decline in early 2020 followed by a gradual but incomplete recovery.

Recovery patterns differ across transit modes. Subway and commuter rail services remain below pre-pandemic levels, while buses recovered more quickly and bridges and tunnels showed strong growth, indicating a shift toward private travel.

Statistical analysis confirms that higher COVID-19 case counts are associated with lower ridership, highlighting the impact of public health conditions on transit usage. Overall, the findings suggest that the pandemic led to lasting changes in travel behavior, with important implications for future transportation planning.

References

Metropolitan Transportation Authority. (2020–2025). MTA Ridership Data and Recovery Reports.
American Public Transportation Association. (2021). Public Transportation Ridership Report.
U.S. Bureau of Transportation Statistics. (2022). Transportation Trends and COVID-19 Impact.
New York City Department of Health and Mental Hygiene. (2020–2025). COVID-19 Data Reports.
World Health Organization. (2020–2022). COVID-19 Pandemic Reports.

DATA 698 : Capstone Research Project

New York City Transit’s Ridership Trends and Impact of COVID-19 Pandemic Statistics

Author: Rupendra Shrestha | May, 2026