DATA607 - How Data Driven Decisions Can Save New York’s Subways

Introduction

New York City Transit Authority (NYCT) is the largest public transit authority in the US. NYCT’s subway system serves about 8 million riders each day, according to the data on Metropolitan Transit Authority’s official website. For most New Yorkers, it is the primary way of getting around in the city. In recent years, the quality of service on the NYCT’s subways has drastically deteriorated, prompting numerous public inquiries. On November 18, 2017, New York Times published a report that tried to explain the reasons behind the decline. Among the many culprits listed, New York Times pointed out that NYCT has diverted maintenance funds to support more flashy capital projects, such as the 2nd Avenue subway line. Also, the report disclosed that NYCT’s leadership, given concession to the unions, has allowed the salaries to soar. All of which contribute to the budget crunch that led to the degradation of the subway services.

This project is a an attempt to dive deeper into the issue, using the same data that the New York Times has used in their report. These are the data opened to public access, provided through two portals:

With these data, we seek to discover the extent of degradation in NYCT’s subway services. Moreover, using a data driven approach, this project explores a preliminary frame work that may be used to find a solution for NYCT’s predicament, particularly in the area of funding allocation.

library(ggplot2)
library(gdata)
library(stringr)
library(dplyr)
library(tidyr)
library(knitr)
library(kableExtra)
library(gridExtra)

Just How Bad Is It?

In this section, we explore MTA’s “Performance Data” to see just how bad the NYCT’s subway service has degraded. As of the date of this report, the “Performance Data” was last updated on November, 2017. The data package can be downloaded from MTA’s data portal, and contains several csv files, each for one of the MTA’s subagency, such as the Long Island Railroad, the Metro North Railroad, etc. The data file of interest here is the “MTA_Performance_NYCT.csv”, which holds the performance measurement for the NYCT’s subway and buses from 2008 to 2017. This link lists all of the performance metrics. For the subway, we are interested in four metrics:

On-Time Performance, total and for all lines
Subway wait assessment for all lines
Mean Distance Between Failure
Total Ridership

The csv file can be downloaded directly from MTA’s data portal.

# Import MTA files
temp <- tempfile()
download.file("http://web.mta.info/persdashboard/perxml/MTA_Performance_Datall.zip", temp)
nyct_perf <- read.csv(unz(temp, "MTA_Performance_NYCT.csv"), stringsAsFactors = F)
unlink(temp)

The description for the performance metrics can be read from a spreadsheet file stored on the MTA’s website.

# Read description for the metrics
temp <- read.xls("http://web.mta.info/developers/Performance_Indicators_by%20Agency.xls", sheet = 1, header=F)
temp <- temp %>% 
  filter(V1 %in% c("On-Time Performance (Absolute) - Subways", "Wait Assessment - Subways", 
                   "Mean Distance Between Failures - Subways", "Total Ridership - Subways"))
names(temp) <- c("PERFORMANCE_INDICATOR", "DESCRIPTION")
kable(temp, row.names = F)

PERFORMANCE_INDICATOR	DESCRIPTION
Mean Distance Between Failures - Subways	Average number of miles a subway car travels in service before a mechanical failure that makes the train arrive at its final destination later than 5 minutes
On-Time Performance (Absolute) - Subways	The percent of trains making all the scheduled station stops arriving at the destination terminal on-time, early or no more than 5 minutes late. Trains re-routed, abandoned or operating under temporary schedule changes are all counted as late. This is a new indicator that New York City Transit began reporting in August 2008.
Total Ridership - Subways	Ridership is the number of passengers from whom the agency receives a fare, either through direct fare payment (cash, Pay-Per-Ride MetroCards, Unlimited Ride MetroCards, etc.) or fare reimbursements (senior citizens, school children, the physically disabled). Passengers who use free transfers (train-to-bus, bus-to-train) are counted as additional passengers even though they are not paying additional fares. Ridership data is preliminary and subject to revision as well as adjustments warranted by annual audit review.
Wait Assessment - Subways	The percent of actual intervals between trains that are no more than the scheduled interval plus 2 minutes during peak hours (6 AM - 9 AM) and plus 4 minutes during off-peak hours (9 AM - 4 PM and 7 PM - midnight). Wait assessment is measured weekdays between 6:00 AM - midnight when service is relatively frequent. The data is based on a sample methodology.

Note that the On-Time Performance (OTP) and the Subway Wait Assessment metrics are in percentage, and can be broken down into each subway line.

# The indicator names of the OTP and wait assessment can be pulled from the data frame using the `stringr` package.
otp_names <- nyct_perf$INDICATOR_NAME %>% 
  str_extract(pattern = "OTP[[:graph:]\\s]*") %>% 
  .[!is.na(.)] %>% 
  unique()
otp_names

##  [1] "OTP (Terminal) - 1 Line"        "OTP (Terminal) - 2 Line"       
##  [3] "OTP (Terminal) - 3 Line"        "OTP (Terminal) - 4 Line"       
##  [5] "OTP (Terminal) - 5 Line"        "OTP (Terminal) - 6 Line"       
##  [7] "OTP (Terminal) - 7 Line"        "OTP (Terminal) - S Line 42 St."
##  [9] "OTP (Terminal) - A Line"        "OTP (Terminal) - B Line"       
## [11] "OTP (Terminal) - C Line"        "OTP (Terminal) - D Line"       
## [13] "OTP (Terminal) - E Line"        "OTP (Terminal) - F Line"       
## [15] "OTP (Terminal) - S Fkln Line"   "OTP (Terminal) - G Line"       
## [17] "OTP (Terminal) - J Z Line"      "OTP (Terminal) - L Line"       
## [19] "OTP (Terminal) - M Line"        "OTP (Terminal) - N Line"       
## [21] "OTP (Terminal) - Q Line"        "OTP (Terminal) - R Line"       
## [23] "OTP (Terminal) - S Line Rock"

wa_names <- nyct_perf$INDICATOR_NAME %>% 
  str_extract(pattern = "Subway Wait Assessment -[[:graph:]\\s]*") %>% 
  .[!is.na(.)] %>% 
  unique()
wa_names

##  [1] "Subway Wait Assessment - 1 Line"  
##  [2] "Subway Wait Assessment - 2 Line"  
##  [3] "Subway Wait Assessment - 3 Line"  
##  [4] "Subway Wait Assessment - 4 Line"  
##  [5] "Subway Wait Assessment - 5 Line"  
##  [6] "Subway Wait Assessment - 6 Line"  
##  [7] "Subway Wait Assessment - 7 Line"  
##  [8] "Subway Wait Assessment - S 42 St" 
##  [9] "Subway Wait Assessment - A Line"  
## [10] "Subway Wait Assessment - B Line"  
## [11] "Subway Wait Assessment - C Line"  
## [12] "Subway Wait Assessment - D Line"  
## [13] "Subway Wait Assessment - E Line"  
## [14] "Subway Wait Assessment - F Line"  
## [15] "Subway Wait Assessment - G Line"  
## [16] "Subway Wait Assessment - S Rock " 
## [17] "Subway Wait Assessment - J Z Line"
## [18] "Subway Wait Assessment - L Line"  
## [19] "Subway Wait Assessment - M Line"  
## [20] "Subway Wait Assessment - N Line"  
## [21] "Subway Wait Assessment - Q Line"  
## [22] "Subway Wait Assessment - R Line"  
## [23] "Subway Wait Assessment - S Fkln"

The three “S” lines are excluded since these lines are missing a number of data. We can create plots to inspect the performance of each subway line. Here, we use a special type of graph called the “Edward Tufte style slopegraphs”, so we can compare the performance changes among the subway lines

# Below two functions, `tufte_sort` and `plot_slopegraph` are copied from source: https://github.com/jkeirstead/r-slopegraph
tufte_sort <- function(df, x="year", y="value", group="group", method="tufte", min.space=0.05) {
  ## First rename the columns for consistency
  ids <- match(c(x, y, group), names(df))
  df <- df[,ids]
  names(df) <- c("x", "y", "group")
  
  ## Expand grid to ensure every combination has a defined value
  tmp <- expand.grid(x=unique(df$x), group=unique(df$group))
  tmp <- merge(df, tmp, all.y=TRUE)
  df <- mutate(tmp, y=ifelse(is.na(y), 0, y))
  
  ## Cast into a matrix shape and arrange by first column
  require(reshape2)
  tmp <- dcast(df, group ~ x, value.var="y")
  ord <- order(tmp[,2])
  tmp <- tmp[ord,]
  
  min.space <- min.space*diff(range(tmp[,-1]))
  yshift <- numeric(nrow(tmp))
  ## Start at "bottom" row
  ## Repeat for rest of the rows until you hit the top
  for (i in 2:nrow(tmp)) {
    ## Shift subsequent row up by equal space so gap between
    ## two entries is >= minimum
    mat <- as.matrix(tmp[(i-1):i, -1])
    d.min <- min(diff(mat))
    yshift[i] <- ifelse(d.min < min.space, min.space - d.min, 0)
  }
  
  tmp <- cbind(tmp, yshift=cumsum(yshift))
  
  scale <- 1
  tmp <- melt(tmp, id=c("group", "yshift"), variable.name="x", value.name="y")
  ## Store these gaps in a separate variable so that they can be scaled ypos = a*yshift + y
  
  tmp <- transform(tmp, ypos=y + scale*yshift)
  return(tmp)
  
}
plot_slopegraph <- function(df) {
  ylabs <- subset(df, x==head(x,1))$group
  yvals <- subset(df, x==head(x,1))$ypos
  fontSize <- 3
  gg <- ggplot(df,aes(x=x,y=ypos)) +
    geom_line(aes(group=group),colour="grey80") +
    geom_point(colour="white",size=8) +
    geom_text(aes(label=y), size=fontSize, family="American Typewriter") +
    scale_y_continuous(name="", breaks=yvals, labels=ylabs) +
    theme_classic()
  return(gg)
}    

# Below function is a slight modification of the code from http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html#Slope%20Chart%202
createSlopPlot <- function(df, title){

  df <- tufte_sort(df, 
                   x="PERIOD_YEAR", 
                   y="YTD_ACTUAL", 
                   group="INDICATOR_NAME", 
                   method="tufte", 
                   min.space=0.05)
  
  df <- transform(df, 
                  x=factor(x, levels=c(2009:2016), 
                           labels=c(2009:2016)), 
                  y=round(y))
  
  plot <- plot_slopegraph(df) + labs(title=title) + 
    theme(axis.title=element_blank(),
          axis.ticks = element_blank(),
          plot.title = element_text(hjust=0.5,
                                    family = "American Typewriter",
                                    face="bold"),
          axis.text = element_text(family = "American Typewriter",
                                   face="bold"))
  
  return(plot)
}

## Construct the plots
otp <- nyct_perf %>% 
  # Select the OTP rows based on the indicator names
  filter(INDICATOR_NAME %in% otp_names[-c(8, 15, 23)]) %>% 
  # Select the month of December only, since we will be using the "YTD_ACTUAL" year-to-date column.
  filter(PERIOD_MONTH == 12)
# Shorten the indicator names
otp$INDICATOR_NAME <- str_replace_all(otp$INDICATOR_NAME, "OTP \\(Terminal\\) - ", "")
otp_names <- str_replace_all(otp_names, "OTP \\(Terminal\\) - ", "")
# Separate the rows into number lines and letter lines
temp1 <- filter(otp, INDICATOR_NAME %in% otp_names[c(1:7)])
temp2 <- filter(otp, INDICATOR_NAME %in% otp_names[c(9:14, 16:22)])
# Construct plots
grid.arrange(createSlopPlot(temp1, "% On-Time for NYCT Number Lines"), 
             createSlopPlot(temp2, "% On-Time for NYCT Letter Lines"),ncol=2)

The on-time performance for all lines has been declining steadily. The 2, 4, 5, and 6 lines are particularly bad, having the steepest decline and the lowest on-time percentage.

wa <- nyct_perf %>% 
  filter(INDICATOR_NAME %in% wa_names[-c(8, 16, 23)]) %>% 
  filter(PERIOD_MONTH == 12)
wa$INDICATOR_NAME <- str_replace_all(wa$INDICATOR_NAME, "Subway Wait Assessment - ", "")
wa_names <- str_replace_all(wa_names, "Subway Wait Assessment - ", "")
temp1 <- filter(wa, INDICATOR_NAME %in% wa_names[c(1:7)])
temp2 <- filter(wa, INDICATOR_NAME %in% wa_names[c(9:15, 17:22)])

grid.arrange(createSlopPlot(temp1, "Wait Assessment for Number Lines"), 
             createSlopPlot(temp2, "Wait Assessment for Letter Lines"),ncol=2)

The subway wait assessment paints a better picture for NYCT, although the declining trend remains true. It is important to note the difference between these two performance metrics. As explained in the description for these metrics, the OTP measures the on-time performance of a train making all stops and arriving at the final station on time. The subway wait assessment, on the other hand, measures the on-time performance of trains in randomly sampled stations throughout the line. OTP can be thought of as a metric that represents the cumulative effect of train delays along the line. If a train line is long, having more stations than others, it is more likely for that line to have lower OTP than others. Therefore the subway wait assessment is a better measurement for everyday commuters, since commuters usually do not care if a train arrives at its last station on-time or not, but do care about if the trains between the origin and destination stations are running on-time. However, MTA does not document its sampling methodology for the subway wait assessment. So it’s difficult to talk about how useful this metric is.

rd <- nyct_perf %>% 
  filter(INDICATOR_NAME == "Total Ridership - Subways") %>% 
  filter(PERIOD_MONTH == 12)

fd <- nyct_perf %>% 
  filter(INDICATOR_NAME == "Mean Distance Between Failures - Subways") %>% 
  filter(PERIOD_MONTH == 12)

plot1 <- ggplot(rd, aes(x = PERIOD_YEAR, y = YTD_ACTUAL/1000000000, color = INDICATOR_NAME)) +
  geom_line() +
  labs(title = "Annual Ridership - Subways", y = "Riders in Billion", x = "Year") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5), legend.position = "none")

plot2 <- ggplot(fd, aes(x = PERIOD_YEAR, y = YTD_ACTUAL/1000, color = INDICATOR_NAME)) +
  geom_line() +
  labs(title = "Mean Failure Distance of Trains", y = "Thousand Miles", x = "Year") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5), legend.position = "none")

grid.arrange(plot1, plot2, ncol=2)

The annual ridership has increased significantly since 2009, reaching more than 1.75 billion in 2016. The mean failure distance of the trains increased from 2008 to 2011, representing an improvement in the train’s performance; but then it began to decline, dropping to pre 2008 level in 2015.

It is apparent from the graphs above that the service performance of NYCT subways has slid over the years. So what can we do about it? The MTA data portal does not provide enough information to answer this question. In the following sections, we turn to the data from the National Transit Database (NTD) in an attempt to answer this question.

NTD Data Exploration

The NTD is maintained by the Federal Transit Administration. As required by law since 1974, a public transit authority receiving federal funding assistance must periodically submit its agency data to the NTD. These include the transit agency’s financial, operating, and asset condition data. NTD consolidates and publishes these data in a number of time series Excel spreadsheets, publicly accessible on its website. The latest time series files contain annual data from 1991 to 2016. We can import these data files directly.

The first file we would like to import is the “TS3.1 - Capital Expenditures Time-Series” Excel file. We are interested in 3 sheets in this file:

Funds spent on Rolling Stock
Funds spent on Facilities
Funds spent on other Capital Projects

url_link <- "http://www.transit.dot.gov/sites/fta.dot.gov/files/TS3.1TimeSeriesUsesofCap_3.xls"
rs <- read.xls(url_link, sheet = "Rolling Stock", stringsAsFactors = F)
fa <- read.xls(url_link, sheet = "Facilities", stringsAsFactors = F)
ot <- read.xls(url_link, sheet = "Other", stringsAsFactors = F)

The second file to import is the “TS2.1 - Service Data and Operating Expenses Time-Series by Mode” Excel file. The sheets being imported into data frame for the analysis are:

Vehicle Operations Expenses
Vehicle Maintenance Expenses
Non-Vehicle Maintenance Expenses
General Administration Expenses
Unlinked Passenger Trips

url_link <- "http://www.transit.dot.gov/sites/fta.dot.gov/files/TS2.1TimeSeriesOpExpSvcModeTOS_6.xls"
op <- read.xls(url_link, sheet = "OpExp VO", stringsAsFactors = F)
ga <- read.xls(url_link, sheet = "OpExp GA", stringsAsFactors = F)
mv <- read.xls(url_link, sheet = "OpExp VM", stringsAsFactors = F)
mn <- read.xls(url_link, sheet = "OpExp NVM", stringsAsFactors = F)
pt <- read.xls(url_link, sheet = "UPT", stringsAsFactors = F)

The two Excels files imported above include the data for all transit agencies and all transit modes (train, bus, etc.) of the agencies in the country. Each row is a mode for a agency. We can extract the row by looking at the “NTD ID” and the “Mode” columns. The NYCT’s NTD ID is 2008, and the subway mode is HR, which stands for Heavy Rail.

The extracted rows will be in wide format, where the each year is a column. We would like to transform it into a long format, so that each row is an annual observation of the expenses and trips.

# This function takes a data frame object and extract the NYCT subway row, and performs transformation so that the returned data frame is in long format. String columns are also treated with stringr and turned into numeric.
extractInf <- function(df){
  df <- df %>% 
    # Extract NYCT subway row 
    filter(X4.Digit.NTDID == 2008 & Mode == "HR") %>%
    # Turn into long format
    gather("Year", "Amount", X1992:X2016) %>% 
    select("Year", "Amount")
  # Change the Year column to numeric format
  df$Year <- df$Year %>% 
    str_replace("X", "") %>% 
    as.numeric()
  # Clean up the Amount column
  df$Amount <- df$Amount %>% 
    str_replace_all(pattern="[$|,]*", replacement="") %>% 
    as.numeric()
  return(df)
}

# Use lapply to aplly the extractInf function to each data frame, and use do.call to combine them into one data frame, also removing the duplicated "Year" columns.
expenses <- list(op, ga, mv, mn, rs, fa, ot, pt) %>% 
  lapply(extractInf) %>% 
  do.call(cbind, .) %>% 
  .[,-c(3,5,7,9,11,13,15)]
names(expenses) <- c("Year", "Operation", "General.Admin", "Vehicle.Maint", "NonVeh.Maint",
                         "Rolling.Stock", "Facilities", "Other.Cap", "Passenger.Trips")

The next two sets of data to be imported are a bit tricky to obtain. They are the mechanical failure record and maintenance spending detail data. The maintenance spending detail breaks down the annual operation and maintenance expense into expense categories such as operator wages, fringe benefit, material and supplies, etc. These are not time series files, so you must get it individually for each year, using NTD’s search engine. These files are also difficult to read automatically. The file names change from year to year, and the table formats also changes from one year to another. It is necessary to download each file and extract the NYCT’s data from each file manually. For this project, we have performed this footwork extracting the information. Two csv files were constructed and stored in the git hub repository, which can be read directly. If wish, the readers may download the original raw files using this link, which contains the file names that can be concatenated with “https://www.transit.dot.gov/sites/fta.dot.gov/files/” to create links for automatic download.

We can now create the complete data table.

# Read the mechanical failure and expense detail csv files
url_link <- "https://raw.githubusercontent.com/Tyllis/Data607/master/ntd_breakdowns.csv"
bd <- read.csv(url_link, stringsAsFactors = F)
url_link <- "https://raw.githubusercontent.com/Tyllis/Data607/master/ntd_maintdetails.csv"
md <- read.csv(url_link, stringsAsFactors = F)

# Use `left_join` function to join the `expenses` table and the bd, md tables together
nyct <- expenses %>% 
  left_join(bd, by = "Year") %>% 
  left_join(md, by = "Year")

# Create scrollable table box
kable(nyct, "html") %>% 
  kable_styling() %>% 
  scroll_box(height = "400px")

Year	Operation	General.Admin	Vehicle.Maint	NonVeh.Maint	Rolling.Stock	Facilities	Other.Cap	Passenger.Trips	Major.Failures	Minor.Failures	Operator.Wage	Other.Wage	Fringe.Benefit	Services	Fuel.Lube	Tires.Other	Utilities	Casualty.Liability	Other	Expense.Transfer
1992	893641326	212967749	301681358	635022737	98308445	186123385	536025689	1373625314	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1993	1067907739	250401313	303238065	511378906	86008580	163768857	536529428	1178121493	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1994	1099407927	286414816	314305911	543277117	28972033	234838528	566128507	1308429940	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1995	922383381	227412141	313146677	543368932	17310000	285660000	579863000	1234598558	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1996	764467178	237037370	303778624	508709115	27850000	324730000	577960000	1352100577	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1997	796058424	247358893	303013154	508679427	205649887	969392939	74693518	1579782509	3942	7642	187152000	1052571000	615292000	66515000	1286000	138231000	158881000	35097000	653000	-400569000
1998	795693241	258616239	303675462	486263766	178398116	975043057	91739033	1535830314	3539	6613	197504000	1048337000	620366000	90600000	1011000	161207000	155766000	37826000	5307000	-473677000
1999	825488171	268523333	356448615	504462527	129485116	1155726514	108370068	1627237995	2931	6560	208887000	1109146000	643602000	81823000	1515000	194973000	161733000	45024000	21264000	-513045000
2000	882799082	272510611	378898854	556121441	200521584	985895843	192204917	1677506585	2931	6560	225083000	1197335000	712861000	86677000	2134000	197128000	163628000	49125000	6163000	-549803000
2001	943335417	259856996	415154052	593210358	601925092	1146096435	134583289	1740326067	2885	6264	236617000	1277831000	792789000	97365000	2040000	218353000	164386000	30328000	759000	-608911000
2002	983238399	276567641	398100666	598038448	878221360	1305788694	207419015	1694026559	2886	6409	248866000	1319444000	841110000	89936000	2144000	189588000	167422000	49380000	577000	-652523000
2003	1067275547	300196252	414259223	596656668	583444043	1173282589	450289379	1701453276	2363	5585	264794000	1317717000	954039000	89032000	3124000	177743000	169477000	62001000	3443000	-662983000
2004	1138484678	311994032	429419154	657741884	169133301	1448268896	526704409	1760778918	2167	5973	277387000	1317275000	1064092000	97057000	3197000	171710000	178642000	50696000	20351000	-642768000
2005	1245850476	311293972	468942283	691364336	281954895	1178470527	436071987	1804034331	1895	6065	285239000	1329631000	1168890000	97871000	3524000	182556000	211472000	44326000	22814000	-628873000
2006	1226258672	295555894	467192342	718737332	184980779	1370286195	437313379	1874980539	2173	7486	292370000	1405646000	1118956000	90610000	3975000	197155000	222698000	25740000	15747000	-665152000
2007	1352215524	360527720	518886019	796878634	388221042	1375050297	499898882	2390402930	2260	7971	299679000	1455800000	1326184000	89533000	4465000	211877000	242560000	54834000	27442000	-683866000
2008	1441121443	385873732	554363491	868672471	818930834	1750414220	550879735	2428308510	2547	8208	307171000	1528666000	1466413000	106169000	5766000	215778000	258283000	47453000	24722000	-710390000
2009	1488221411	384513026	586186075	854206924	1130952810	1786173718	557319784	2358313445	2250	7536	314851000	1557656000	1552822000	102836000	4443000	234136000	269383000	79057000	27744000	-829799000
2010	1530356185	388965044	581012003	845601344	645592650	1805457962	569533037	2439158966	2038	7443	322801000	1542021000	1633885000	102258000	4269000	216441000	279702000	83707000	29143000	-868293000
2011	1659737070	450000081	620936007	794747318	53988596	1840682764	586553078	2497626015	1997	7329	330871000	1548036000	1699870000	93903000	4408000	227745000	299126000	112775000	39818000	-831130000
2012	1713836073	432161210	632776728	965306300	112319809	1888657130	708216282	2569543549	2112	7614	339060000	1572998000	1863936000	107449000	6024000	216747000	318677000	47065000	38231000	-766108000
2013	1893029576	791964523	709394965	1369099149	140437345	1585618007	877008046	2656476693	2260	8346	347536000	1650514000	1915005000	114855000	6216000	243916000	322664000	101798000	60984000	0
2014	1985624384	818085678	743288437	1475083987	314933292	1563285711	710326170	2743004452	2443	10132	356225000	1752498000	2029341000	106449000	7652000	268684000	344067000	107978000	49189000	0
2015	2039452975	845108951	776673215	1538976738	114621030	1593448029	861471378	2662421226	2631	10058	365130000	1814337000	2153595000	105611000	4785000	98000	314349000	131666000	263449000	47194000
2016	2119065465	959270181	836289940	1644317531	102448194	1617950755	838617620	2673282334	3097	10196	374258000	1900102000	2390069000	121174000	4000000	82000	283189000	169989000	270647000	45432000

Note that there are some data missing from the expense details and mechanical failure data groups, for years before 1996. These are not available for public access on NTD’s website.

The Major.Failures and Minor.Failures columns are of particular interest. Mechanical failure is one of the main causes for train delay. Below, we plot the two types of failures together, and separately.

# Plot mechanical failure
temp <- nyct %>%
  filter(Year > 1996) %>% 
  select(Year, Major.Failures:Minor.Failures)%>%
  gather("Exp.Type", "Amount", Major.Failures:Minor.Failures)
plot1 <- ggplot(temp, aes(x = Year, y = Amount, fill = Exp.Type)) +
  geom_bar(stat = "identity") +
  labs(title = "Mechanical Failures", y = "Number of Failures") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5), legend.title=element_blank()) +
  scale_x_continuous(breaks = seq(1997,2017, by = 5))
plot2 <- ggplot(temp, aes(x = Year, y = Amount, color = Exp.Type)) +
  geom_line() +
  labs(title = "Mechanical Failures", y = "Number of Failures") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5), legend.title=element_blank()) +
  scale_x_continuous(breaks = seq(1997,2017, by = 5))
grid.arrange(plot1, plot2, ncol=1)

It appears that the Major.Failures has remained steady over the years, while the Minor.Failures has increased. In the New York Times article, it was suggested that the NYCT’s maintenance spending has stagnated over the years, which led to more breakdowns that causes more delay. Let’s see if it is true.

# Plot expense 
temp <- nyct %>%
  select(Year, Operation:NonVeh.Maint)%>%
  gather("Exp.Type", "Amount", Operation:NonVeh.Maint)
plot1 <- ggplot(temp, aes(x = Year, y = Amount/1000000, fill = Exp.Type)) +
  geom_bar(stat = "identity") +
  labs(title = "Operation and Maintenance Expenses", y = "Million USD") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5), legend.title=element_blank()) +
  scale_x_continuous(breaks = seq(1992,2017, by = 5)) +
  scale_y_continuous(breaks = seq(0, 5000, by = 1000))
plot2 <- ggplot(temp, aes(x = Year, y = Amount/1000000, color = Exp.Type)) +
  geom_line() +
  labs(title = "Operation and Maintenance Expenses", y = "Million USD") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5), legend.title=element_blank()) +
  scale_x_continuous(breaks = seq(1992,2017, by = 5)) +
  scale_y_continuous(breaks = seq(0, 5000, by = 500))

grid.arrange(plot1, plot2, ncol=1)

The graphs above show that the operation and maintenance spending have increased over the years, contrary to the New York Time’s claim, although the increase for the maintenance spending is noticeably smaller than other spending.

However, these graphs do not paint the whole picture, because the ridership has increased over the years as well. The Passenger.Trips records the annual number of unlinked passenger trips (UPT) the trains have served. The official definition of UPT is “number of passengers who board public transportation vehicles. Passengers are counted each time they board vehicles no matter how many vehicles they use to travel from their origin to their destination.” It is another form of measurement for ridership.

temp <- nyct %>% 
  select(Year, Passenger.Trips) 
ggplot(temp, aes(x = Year, y = Passenger.Trips/1000000000)) +
  geom_line() +
  labs(title = "Unlinked Passenger Trips (UPT)", y = "Billion Trips") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5), legend.title=element_blank()) +
  scale_x_continuous(breaks = seq(1992,2017, by = 5)) +
  scale_y_continuous(breaks = seq(0, 3, by = 0.25))

To tell if the expenses have fallen behind the rise in ridership, we have to normalize the expenses against the ridership. Here, we divide the expenses by the unlinked passenger trips. From this point on, we will use the normalized metrics, all obtained by dividing the original data by the unlinked passenger trips corresponding to the data.

Below, we reconstruct the graphs for the operation and maintenance expenses, using the normalized metric.

# Normalized the spending data by dividing by the passenger trips
nyct_normal <- nyct / nyct$Passenger.Trips
nyct_normal$Year <- nyct$Year

# Plot maintenance expense 
temp <- nyct_normal %>%
  select(Year, Operation:NonVeh.Maint)%>%
  gather("Exp.Type", "Amount", Operation:NonVeh.Maint)
plot1 <- ggplot(temp, aes(x = Year, y = Amount, fill = Exp.Type)) +
  geom_bar(stat = "identity") +
  labs(title = "Operation and Maintenance Expenses", y = "USD per Passenger Trip") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5), legend.title=element_blank()) +
  scale_x_continuous(breaks = seq(1992,2017, by = 5)) 
plot2 <- ggplot(temp, aes(x = Year, y = Amount, color = Exp.Type)) +
  geom_line() +
  labs(title = "Operation and Maintenance Expenses", y = "USD per Passenger Trip") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5), legend.title=element_blank()) +
  scale_x_continuous(breaks = seq(1992,2017, by = 5)) 

grid.arrange(plot1, plot2, ncol=1)

The new graphs show that the vehicle maintenance expense has, relatively, kept leveled with the ridership over the years; while the expenses for non-Vehicle maintenance, general administration have increased.

NTD also offers a different breakdown of operation and maintenance spending, into following categories:

Operator Wage
Other Wage
Fringe Benefit
Services
Fuel and Lube
Tires and Other
Utilities
Casualty and Liability
Other
Expense Transfer

The Expense.Transfer are the adjustment made to the other expense categories. Not knowing which categories were adjusted, this variable is excluded in the analysis.

The breakdowns of the operation and maintenance spendings are plotted below.

temp <- nyct_normal %>%
  filter(Year > 1996) %>% 
  select(Year, Operator.Wage:Other)%>%
  gather("Exp.Type", "Amount", Operator.Wage:Other)
plot1 <- ggplot(temp, aes(x = Year, y = Amount, fill = Exp.Type)) +
  geom_bar(stat = "identity") +
  labs(title = "Operation and Maintenance Expenses", y = "USD per Passenger Trip") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5), legend.title=element_blank()) +
  scale_x_continuous(breaks = seq(1997,2017, by = 5)) 
plot2 <- ggplot(temp, aes(x = Year, y = Amount, color = Exp.Type)) +
  geom_line() +
  labs(title = "Operation and Maintenance Expenses", y = "USD per Passenger Trip") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5), legend.title=element_blank()) +
  scale_x_continuous(breaks = seq(1997,2017, by = 5)) 

grid.arrange(plot1, plot2, ncol=1)

The Fringe.Benefit and Other.Wage, which are the wages paid to maintenance works, administrative personnel, and managers, are by far the largest operating and maintenance expenses. In 2016, together, they were 77% of NYCT’s total operating and maintenance expense. In particular, Fringe.Benefit really took off after 2007, passing over Other.Wage to become the number one cost in this spending category.

Below, we construct the graphs for the capital spending.

# Plot capital spending
temp <- nyct_normal %>%
  select(Year, Rolling.Stock:Other.Cap)%>%
  gather("Exp.Type", "Amount", Rolling.Stock:Other.Cap)
plot1 <- ggplot(temp, aes(x = Year, y = Amount, fill = Exp.Type)) +
  geom_bar(stat = "identity") +
  labs(title = "Capital Expense by Type", y = "USD per Passenger Trip") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5), legend.title=element_blank()) +
  scale_x_continuous(breaks = seq(1992,2017, by = 5))
plot2 <- ggplot(temp, aes(x = Year, y = Amount, color = Exp.Type)) +
  geom_line() +
  labs(title = "Capital Expense by Type", y = "USD per Passenger Trip") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5), legend.title=element_blank()) +
  scale_x_continuous(breaks = seq(1992,2017, by = 5))
grid.arrange(plot1, plot2, ncol=1)

There are couple things worthy of mention on this graph. First, the Rolling.Stock capital spending is small comparing to other two spending categories. Rolling stock procurement refers to the purchase of trains and rail equipment. Facilities refers to spendings on the subway stations and NYCT owned buildings and structures. Facilities spending is by far the largest capital spending.

In this section, we explored NTA’s data for NYCT’s subway system and constructed graphs to dissect NYCT’s funding allocations and trends. But are these spending decisions affecting the subway’s service performance? In the next section, we will attempt to answer this question.

Linear Model Solutions

We now move forward to propose a frame work that can be used to test and check if the spending has any relationship with the performance. Namely, we will create a number of linear regression models, using normalized mechanical failures as the target response variable, and the normalized expenses as the explanatory variables. The slope of the linear line is checked to see if it is significant. Using an alpha level of 0.05, we check if the p-value is less than the alpha level, so we can reject the null hypothesis that the slope is zero. A slope of zero means that there is no relationship between the expense and the mechanical failures.

Since the mechanical failures data is only available from 1997 to 2016, any explanatory variables data prior to 1997 are excluded. Also, to simplify the analysis, major and minor failure types are added together to create a new variable Total.Failures.

# Select rows with year later than 1996
nyct_lm <- filter(nyct_normal, Year > 1996)
# Calculate the total number of mechanical failures per 1 million passenger trips
nyct_lm$Total.Failures <- (nyct_lm$Major.Failures + nyct_lm$Minor.Failures) * 1000000

Below is a scatter plot matrix, pairing up the capital expenses with the new failure variable. In the upper panel of the matrix, the Pearson Correlation Coefficients between the variables are calculated. In the diagonal panel, the histograms for the variables are shown.

# Following two functions, panel.cor and panel.hist, were based on a stackoverflow post.
# https://stackoverflow.com/questions/15271103/how-to-modify-this-correlation-matrix-plot
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y, use = "complete.obs"))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt)
}

panel.hist <- function(x, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5) )
    h <- hist(x, plot = FALSE)
    breaks <- h$breaks; nB <- length(breaks)
    y <- h$counts; y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}

# Construct the scattplot matrix
temp <- nyct_lm %>% 
  select(Total.Failures, Rolling.Stock : Other.Cap)
pairs(temp, upper.panel = panel.cor, diag.panel = panel.hist)

The plot matrix shows that the Total.Failures has a high correlation with Other.Cap. Next, we create a function that can perform the linear regression and construct a summary table. This function will be used repeatedly.

# This function takes the reponse variable as a string, and the explanatory variables as a vector of strings. The function performs linear regression with each variable individually, and returns a data frame object containing the intercepts, slopes and p-values of each linear model.
calLm <- function(response, explain, data = nyct_lm){
  
  formula_vec <- paste(response, explain, sep = " ~ ")
  intercepts <- rep(NA, length(explain))
  slopes <- rep(NA, length(explain))
  p_vals <- rep(NA, length(explain))
  adj_rsq <- rep(NA, length(explain))
  
  for (i in 1:length(explain)){
    temp <- lm(as.formula(formula_vec[i]), data = data)
    summary_temp <- summary(temp)
    intercepts[i] <- round(summary_temp$coefficients[1,1],4)
    slopes[i] <- round(summary_temp$coefficients[2,1], 4)
    p_vals[i] <- round(summary_temp$coefficients[2,4], 4)
    adj_rsq[i] <- round(summary_temp$adj.r.squared, 4)
  }
  
  df <- data.frame(Model = formula_vec, Intercept = intercepts, Slope = slopes, 
                   `Adj R Squared` = adj_rsq, `P Value` = p_vals)
  return(df)
}

# Construct table
temp <- calLm("Total.Failures", c("Rolling.Stock", "Facilities", "Other.Cap"), data = nyct_lm)
kable(temp, "html") %>% 
  kable_styling(full_width = F, position = "left")

Model	Intercept	Slope	Adj.R.Squared	P.Value
Total.Failures ~ Rolling.Stock	4.8927	-0.0671	-0.0554	0.9644
Total.Failures ~ Facilities	7.0001	-3.1379	0.0105	0.2873
Total.Failures ~ Other.Cap	6.5273	-7.8984	0.5800	0.0001

The table shows that only the Other.Cap variable has a slope that has a p-value lower than the alpha level of 0.05. The scatter plot and the linear regression line are plotted below. The negative slope suggests that increase the Other.Cap spending will lower the number of mechanical failures.

par(mfrow = c(1,2))
xlab <- "Other Capital Spending (USD per UPT)"
ylab <- "Mechanical Failures (per Million UPT)"
plot(Total.Failures ~ Other.Cap, data = nyct_lm, xlab = xlab, ylab = ylab)
abline(lm(Total.Failures ~ Other.Cap, data = nyct_lm))

The definition of “Other Capital Projects”, according to NTD’s glossary, is:

“Any item not described as guideway, passenger stations, administrative buildings, maintenance buildings, revenue vehicles, service vehicles, fare revenue collection equipment or systems including: - Furniture and equipment that are not an integral part of buildings and structures; and - Shelters, signs and passenger amenities (e.g., benches) not in passenger stations.”

How this help reduces the total failure is not clear. This information is not helpful until after taking a deeper look at the breakdown of the Other.Cap. For now, let us turn to the operating and maintenance expenses. We again begin with a scatter plot matrix, follow by a table summarizing the estimates of the linear regression models.

# Construct the scattplot matrix
temp <- nyct_lm %>% 
  select(Total.Failures, Operation : NonVeh.Maint)
pairs(temp, upper.panel = panel.cor, diag.panel = panel.hist)

# Construct summary table
expense_lm <- calLm("Total.Failures", c("Operation","General.Admin","Vehicle.Maint","NonVeh.Maint"), data = nyct_lm)
kable(expense_lm, "html") %>% 
  kable_styling(full_width = F, position = "left")

Model	Intercept	Slope	Adj.R.Squared	P.Value
Total.Failures ~ Operation	9.0504	-6.6497	0.3235	0.0052
Total.Failures ~ General.Admin	5.3597	-2.4550	-0.0257	0.4786
Total.Failures ~ Vehicle.Maint	9.1004	-17.3148	0.2349	0.0176
Total.Failures ~ NonVeh.Maint	5.8201	-2.4070	0.0027	0.3186

The Operation and Vehicle.Maint variables are significant. Again, the slopes are negative, suggesting that increase spending in operation and vehicle maintenance reduce the mechanical failures.

par(mfrow = c(1,2))
xlab <- "Operation Expense (USD per UPT)"
ylab <- "Mechanical Failures (per Million UPT)"
plot(Total.Failures ~ Operation, data = nyct_lm, xlab = xlab, ylab = ylab)
abline(lm(Total.Failures ~ Operation, data = nyct_lm))
xlab <- "Vehicle Maintenance (USD per UPT)"
plot(Total.Failures ~ Vehicle.Maint, data = nyct_lm, xlab = xlab, ylab = ylab)
abline(lm(Total.Failures ~ Vehicle.Maint, data = nyct_lm))

In particular, the slope for Vehicle.Maint has a steeper slope, suggesting that it has a larger effect on reducing the number of mechanical failures. Moreover, the linear model puts a figure on how much the total mechanical failures can be reduced. The slope for Vehicle.Maint suggests that for each additional 1 million USD spend on vehicle maintenance, NYCT can reduce the number of mechanical failures by 17.

We now use a different break down of the operation and maintenance expenses, using categories such as Fringe.Benefit, Operator.Wage, etc. We would like to see how the spending in these categories affect the number of mechanical failures.

temp <- nyct_lm %>% 
  select(Total.Failures, Operator.Wage : Other)
pairs(temp, upper.panel = panel.cor, diag.panel = panel.hist)

exp_detail <- calLm("Total.Failures", c("Operator.Wage","Other.Wage","Fringe.Benefit","Services","Fuel.Lube",
                               "Tires.Other","Utilities","Casualty.Liability","Other"), data = nyct_lm)
kable(exp_detail, "html") %>% 
  kable_styling(full_width = F, position = "left")

Model	Intercept	Slope	Adj.R.Squared	P.Value
Total.Failures ~ Operator.Wage	7.0596	-15.9058	-0.0147	0.4055
Total.Failures ~ Other.Wage	0.6440	6.1930	0.0878	0.1098
Total.Failures ~ Fringe.Benefit	7.6471	-4.5982	0.4553	0.0007
Total.Failures ~ Services	1.9374	62.9423	0.1563	0.0476
Total.Failures ~ Fuel.Lube	7.1346	-1316.1938	0.5892	0.0000
Total.Failures ~ Tires.Other	4.4918	4.3063	-0.0317	0.5271
Total.Failures ~ Utilities	11.0411	-56.4985	0.3340	0.0045
Total.Failures ~ Casualty.Liability	5.4461	-18.2828	0.0027	0.3188
Total.Failures ~ Other	4.9757	-5.0862	-0.0306	0.5173

The significant explanatory variables are Fringe.Benefit, Fuel.Lube, Services, and Utilities.

par(mfrow = c(1,2))
xlab <- "Fringe Benefit (USD per UPT)"
ylab <- "Mechanical Failures (per Million UPT)"
plot(Total.Failures ~ Fringe.Benefit, data = nyct_lm, xlab = xlab, ylab = ylab)
abline(lm(Total.Failures ~ Fringe.Benefit, data = nyct_lm))
xlab <- "Fuel and Lube Expenses (USD per UPT)"
plot(Total.Failures ~ Fuel.Lube, data = nyct_lm, xlab = xlab, ylab = ylab)
abline(lm(Total.Failures ~ Fuel.Lube, data = nyct_lm))

xlab <- "3rd Party Services (USD per UPT)"
plot(Total.Failures ~ Services, data = nyct_lm, xlab = xlab, ylab = ylab)
abline(lm(Total.Failures ~ Services, data = nyct_lm))
xlab <- "Utilities (USD per UPT)"
plot(Total.Failures ~ Utilities, data = nyct_lm, xlab = xlab, ylab = ylab)
abline(lm(Total.Failures ~ Utilities, data = nyct_lm))

There are few interesting discoveries. The Services expense category, which are the payments made to 3rd party contractors, has a positive slope. Without further research and more data to back it up, we can only speculate that this may be because more mechanical breakdown leads to more 3rd party services being ordered to repair or supplement the service. The relatively small adjusted R squared for this variable also suggests that the variable only explains a small portion of the variation.

Further, notice the large slope for Fuel.Lube. This translates to that each 1 million USD additional spending in Fuel.Lube reduces the number of mechanical failure by 1316. Consider that Fuel.Lube is only a small fraction of the total operating and maintenance expense, there may be room for improvement. However, we must also consider practicality - the Fuel.Lube expenses may be capped by the service load of the trains and equipment. Additional funds in this category may be hard to justified.

In summary, the linear models constructed above suggest that the spending increases in the following expense categories, ranked in the order of descending effectiveness, can reduce the number of mechanical failures in NYCT’s subway system, which may lead to service improvement:

Fuel and Lube
Utilities
Vehicle Maintenance
Other Capital Projects
Operation
Fringe Benefit

Conclusion

In this project, we explored the data provided by MTA to gauge just how bad the NYCT’s subway services have degraded over the years. We extracted data from NTD to identify the issues underlying NYCT’s spending decisions. Moreover, we created a preliminary frame work, using linear regression models, to recommend spending allocations that can help to increase the subway performance by reducing the number of mechanical failures in the system.

It is important to stress that this project is only a preliminary look into how a data driven model can help address the issues in NYCT’s subway system. There are still much works to be done to validate this method. For instance, thus far we have not performed diagnostics to check the residuals of each model against the explanatory variables and see if any regression assumptions are violated. We have not checked the auto-correlation of the data, to see if there are information not being used within the time-series and if the relationship is spurious. Moreover, by creating multiple linear models, we risk running into the multiple comparison problem, in which a multivariate regression model may be a better approach. For these potential issues, we offer that there are more sophisticated statistical methods that can be applied to adjust the model. At last, it is our hope that the basis established in this project can be of any use for future policy decision making and guidance, leading to more efficient allocation of public funds in our public transportation system.