This project asks: which factors most consistently predict subway delays, and has overall performance improved since 2022?. This project analyzes New York City subway delay patterns from January 2022 through March 2026, exploring which subway lines experience the most disruptions, how delays have trended over time, and which reporting categories such as infrastructure issues, operating conditions, or external factors are the most common causes of delays. Data is sourced from the MTA Open Data portal (CSV). The workflow follows the OSEMN framework: Obtain, Scrub, Explore, Model, and Interpret. The reason I choice this project is because the mta has gotten more expensive and I wanted to see if there’s been any improvement.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Yearly totals for trend analysisdelays_by_year <- mta_clean %>%group_by(year) %>%summarise(total_delays =sum(delays, na.rm =TRUE))# Delays by reporting categorydelays_by_category <- mta_clean %>%group_by(reporting_category) %>%summarise(total_delays =sum(delays, na.rm =TRUE)) %>%arrange(desc(total_delays))# Delays by linedelays_by_line <- mta_clean %>%group_by(line) %>%summarise(total_delays =sum(delays, na.rm =TRUE)) %>%arrange(desc(total_delays))# Delays by category and year (for heatmap)delays_category_year <- mta_clean %>%group_by(year, reporting_category) %>%summarise(total_delays =sum(delays, na.rm =TRUE), .groups ="drop")cat("Date range in cleaned data:\n")
Date range in cleaned data:
range(mta_clean$date)
[1] "2022-01-01" "2026-03-01"
Data exploration
# Delays by line for March 2026kable(delays_by_year, caption ="Table 1: Total Subway Delays by Year – 2022 to 2026")
Table 1: Total Subway Delays by Year – 2022 to 2026
year
total_delays
2022
498487
2023
449652
2024
488407
2025
430338
2026
112317
kable(delays_by_category, caption ="Table 2: Total Subway Delays by Reporting Category – 2022 to 2026")
Table 2: Total Subway Delays by Reporting Category – 2022 to 2026
reporting_category
total_delays
Infrastructure & Equipment
586224
Police & Medical
433112
Planned ROW Work
413873
Operating Conditions
271913
Crew Availability
222752
External Factors
51327
kable(delays_by_line, caption ="Table 3: Total Subway Delays by Line – 2022 to 2026")
Table 3: Total Subway Delays by Line – 2022 to 2026
line
total_delays
N
156761
F
149142
6
146033
A
140230
2
131371
E
130229
1
107459
Q
104843
4
101742
D
101106
R
85248
C
84580
7
79166
5
77749
L
68475
3
68466
B
65085
G
56299
JZ
55102
M
54737
S Rock
9925
S 42nd
2863
S Fkln
2590
ggplot(delays_by_year, aes(x = year, y = total_delays)) +geom_col(fill ="#2980b9", alpha =0.85) +geom_text(aes(label = scales::comma(total_delays)), vjust =-0.5, size =3.5) +labs(title ="Total NYC Subway Delays by Year – 2022 to 2026",subtitle ="Source: MTA Open Data Portal",x ="Year",y ="Total Delays" ) +scale_y_continuous(labels = scales::comma) +theme_minimal()
ggplot(delays_by_category, aes(x =reorder(reporting_category, -total_delays), y = total_delays, fill = total_delays)) +geom_col() +scale_fill_gradient(low ="#f7c59f", high ="#c0392b") +scale_y_continuous(labels = scales::comma) +labs(title ="Total Delays by Reporting Category – 2022 to 2026",subtitle ="Source: MTA Open Data Portal",x ="Reporting Category",y ="Total Delays",fill ="Delays" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
ggplot(delays_by_line, aes(x =reorder(line, -total_delays),y = total_delays, fill = total_delays)) +geom_col() +scale_fill_gradient(low ="#f7c59f", high ="#c0392b") +scale_y_continuous(labels = scales::comma) +labs(title ="Total Delays by Subway Line – 2022 to 2026",subtitle ="Source: MTA Open Data Portal",x ="Subway Line",y ="Total Delays",fill ="Delays" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
# Summary table: delays by line and yeardelays_line_year <- mta_clean %>%group_by(year, line) %>%summarise(total_delays =sum(delays, na.rm =TRUE), .groups ="drop") %>%arrange(year, desc(total_delays))kable(delays_line_year, caption ="Table 4: Subway Delays by Line and Year – 2022 to 2026")
Table 4: Subway Delays by Line and Year – 2022 to 2026
year
line
total_delays
2022
A
41776
2022
6
35775
2022
N
35679
2022
F
35356
2022
E
35300
2022
2
32003
2022
Q
30052
2022
4
26112
2022
D
25593
2022
1
24951
2022
C
22584
2022
R
21398
2022
7
19631
2022
5
18724
2022
L
16638
2022
3
16445
2022
M
15378
2022
JZ
15218
2022
B
14857
2022
G
10535
2022
S Rock
2587
2022
S 42nd
1026
2022
S Fkln
869
2023
N
38820
2023
F
36772
2023
A
35842
2023
6
34334
2023
E
31712
2023
2
28220
2023
D
24103
2023
1
24039
2023
Q
24012
2023
4
22027
2023
R
19357
2023
7
19217
2023
C
19089
2023
5
17636
2023
3
13097
2023
L
12853
2023
G
11962
2023
B
11866
2023
JZ
11283
2023
M
10246
2023
S Rock
2018
2023
S Fkln
657
2023
S 42nd
490
2024
6
39069
2024
F
37990
2024
N
37377
2024
2
33362
2024
A
32611
2024
E
28796
2024
1
28019
2024
D
26766
2024
4
24460
2024
Q
23145
2024
C
22812
2024
R
21794
2024
5
18916
2024
B
18341
2024
7
17331
2024
3
16587
2024
JZ
16287
2024
L
14813
2024
G
13430
2024
M
12855
2024
S Rock
2231
2024
S 42nd
916
2024
S Fkln
499
2025
N
34652
2025
F
32078
2025
2
31008
2025
E
27557
2025
6
27256
2025
1
25223
2025
A
23475
2025
4
22287
2025
Q
20964
2025
D
19567
2025
7
18996
2025
5
18409
2025
R
18339
2025
3
18295
2025
L
17777
2025
G
16772
2025
C
16699
2025
B
15696
2025
M
12505
2025
JZ
9313
2025
S Rock
2620
2025
S Fkln
459
2025
S 42nd
391
2026
N
10233
2026
6
9599
2026
F
6946
2026
E
6864
2026
4
6856
2026
2
6778
2026
Q
6670
2026
A
6526
2026
L
6394
2026
1
5227
2026
D
5077
2026
R
4360
2026
B
4325
2026
5
4064
2026
3
4042
2026
7
3991
2026
M
3753
2026
G
3600
2026
C
3396
2026
JZ
3001
2026
S Rock
469
2026
S Fkln
106
2026
S 42nd
40
ggplot(delays_line_year, aes(x =reorder(line, -total_delays), y = total_delays, fill =factor(year))) +geom_col(position ="dodge") +scale_y_continuous(labels = scales::comma) +scale_fill_brewer(palette ="Set2") +labs(title ="NYC Subway Delays by Line and Year – 2022 to 2026",subtitle ="Source: MTA Open Data Portal",x ="Subway Line",y ="Total Delays",fill ="Year" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),legend.position ="bottom")
Statistical Analysis
Kruskal-Wallis Test Across Lines. This is to see if delay are significantly different across reporting categories
kruskal_category <-kruskal.test(delays ~ reporting_category, data = mta_clean)print(kruskal_category)
Kruskal-Wallis rank sum test
data: delays by reporting_category
Kruskal-Wallis chi-squared = 2254.3, df = 5, p-value < 2.2e-16
cat("\nIf p < 0.05, delay counts differ significantly across reporting categories.\n")
If p < 0.05, delay counts differ significantly across reporting categories.
This tests whether total delays have increased or decreased year over year from 2022 to 2026.
model <-lm(total_delays ~ year, data = delays_by_year)summary(model)
Call:
lm(formula = total_delays ~ year, data = delays_by_year)
Residuals:
1 2 3 4 5
-55684 -25354 92567 113663 -125192
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 160626610 74737056 2.149 0.121
year -79165 36925 -2.144 0.121
Residual standard error: 116800 on 3 degrees of freedom
Multiple R-squared: 0.6051, Adjusted R-squared: 0.4734
F-statistic: 4.596 on 1 and 3 DF, p-value: 0.1214
ggplot(delays_category_year, aes(x = year, y = total_delays, fill = reporting_category)) +geom_col(position ="stack") +scale_y_continuous(labels = scales::comma) +labs(title ="NYC Subway Delays by Reporting Category and Year",subtitle ="2022 to 2026 | Source: MTA Open Data Portal",x ="Year",y ="Total Delays",fill ="Reporting Category" ) +theme_minimal() +theme(legend.position ="bottom",legend.text =element_text(size =7))
delays_category_year_detail <- mta_clean %>%group_by(year, reporting_category) %>%summarise(total_delays =sum(delays, na.rm =TRUE), .groups ="drop") %>%arrange(year, desc(total_delays))ggplot(delays_category_year_detail, aes(x =reorder(reporting_category, -total_delays), y = total_delays, fill =factor(year))) +geom_col(position ="dodge") +scale_y_continuous(labels = scales::comma) +scale_fill_brewer(palette ="Set2") +labs(title ="NYC Subway Delays by Reporting Category and Year – 2022 to 2026",subtitle ="Source: MTA Open Data Portal",x ="Reporting Category",y ="Total Delays",fill ="Year" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),legend.position ="bottom")
ggplot(delays_by_year, aes(x = year, y = total_delays)) +geom_col(fill ="#8e44ad", alpha =0.7) +geom_smooth(method ="lm", se =TRUE, color ="#2c3e50") +scale_y_continuous(labels = scales::comma) +labs(title ="NYC Subway Delay Trend – 2022 to 2026",subtitle ="Linear trend with 95% confidence interval",x ="Year",y ="Total Delays" ) +theme_minimal()
Total subway delays have shown a general decline over the study period, dropping from a peak of 498,487 delays in 2022 to 449,652 in 2023, rising slightly to 488,407 in 2024, then falling again to 430,338 in 2025. The linear regression supports a modest downward trend overall, though the p-value of 0.121 indicates this trend is not statistically significant, meaning we cannot confidently conclude delays are improving system-wide.
Infrastructure & Equipment was by far the largest driver of delays across all years, accounting for 586,224 total delays — nearly 135,000 more than the next highest category, Police & Medical at 433,112. Planned ROW Work (413,873), Operating Conditions (271,913), and Crew Availability (222,752) followed, while External Factors were the least common at 51,327. The Kruskal-Wallis test confirmed these differences are statistically significant (chi-squared = 2254.3, p < 2.2e-16), meaning delay counts differ meaningfully across reporting categories and are not due to chance. The grouped bar chart shows Infrastructure & Equipment consistently dominated delays in every year from 2022 to 2026.
Correlation Matrix: The correlation heatmap showed that reporting category had a modest positive correlation with delays (0.19), while subway line had a slight negative correlation (-0.14). Year and delays were nearly uncorrelated (-0.03), consistent with the regression finding that the year over year trend is weak. Overall the correlations are low, suggesting that no single variable fully explains delay volume and that delays are driven by a complex combination of factors.