Now Boarding!

An Exploratory Analysis of NYC’s Subway

New York City, home to roughly 8M people, and another 1.6M commuters would be nothing without its 113 year old infrastructure: The New York City Subway. The MTA fleet consists of 6,418 subway cars that travel 358M miles in a calendar year. With 8,200 weekday train trips, carrying a daily average of 5.7M riders to 472 stations may seem like daunting task, but hey, only in New York.

There are a number to different factors that go into a subway ride, and no two days are the same. Looking at complete data from 2015, this project overall aims to analyze the movement of people through NYC’s underground. To do so, a variety of data was collected from the MTA. This data included turnstile usage, 511 events, customer feedback, and MetroCard fares; the largest data set containing nine million records.

Understanding the MTA data is two fold - it benefits both the agency and the consumer. From an MTA stand point, we can gather insight on how to optimize stations, market fare types, and avoid common commuter problems. Knowing the frequency will help the MTA forecast their repair schedules and to some degree, avoid the situation they are in now. From a consumer stand point, it is how to make your commute more bearable (as someone who commuted to NYC, this is crucial), what MTA events mean for your ride home, and do complaints affect ridership.

This is an Analysis Bound Train
The next stop is The Packages

library(tidyverse)  #The tidyverse package / tidyr, readr, dplyr, ggplot2
library(tidyr)      #Cleaning Data
library(readr)      #Read Rectangular Text Data (.csv in this case)
library(dplyr)      #Data Manipulation
library(ggplot2)    #Graphics
library(DT)         #HTML Datatable
library(stringr)    #Manipulating strings
library(lubridate)  #Manipulating dates/times
library(magrittr)   #Using the pipe operator

This is an Analysis Bound Train
The next stop is The Data

Data regarding 2015 customer feedback filtered only to include data pretaining to the Long Island Railroad and NYC Subways. The following is a sample that represents 50% of the data.

feedback <- read_csv("MTA Customer Feedback.csv")

colnames(feedback)[colnames(feedback)=="Commendation or Complaint"] <- "Feedback"
feedbackclean <- filter(feedback, Year == 2015 & Agency %in% c("Long Island Rail Road","Subways")) %>%
  mutate(Agency = if_else(Agency == "Long Island Rail Road", "LIRR","Subway")) %>%
  select(Agency, Feedback, `Subject Matter`, `Subject Detail`,`Issue Detail`,`Quarter`)

datatable(sample_frac(feedbackclean, .5), filter = "top")

Data regarding 511 events filtered only to include LIRR and NYC Subways. A duration variable was calculated to determine impact of event. The following is a sample that represents 50% of the data.

Event <- read_csv("MTA Events.csv")
Eventclean <- select(Event, "Event Type","Organization Name","Facility Name","Direction","Latitude","Longitude", "Create Time", "Close Time", "Event Description") %>%
  separate(`Create Time`, c("Create Date", "Create Time"), sep = " ") %>%
  separate(`Close Time`, c("Close Date", "Close Time"), sep = " ") %>%
  separate("Close Date", c("ClMonth", "ClDay", "ClYear"), sep = "/") %>%
  separate("Create Date", c("CrMonth", "CrDay", "CrYear"), sep = "/") %>%
  mutate(CreateDate = make_date(CrYear, CrMonth, CrDay),
         CloseDate = make_date(ClYear, ClMonth, ClDay),
         Duration = abs(CloseDate - CreateDate)) %>%
  filter(CrYear == 2015, `Organization Name` %in% c("MTA NYC Transit Subway", "MTA Long Island Rail Road")) %>%
  mutate(`Organization Name` = if_else(`Organization Name` == "MTA Long Island Rail Road", "LIRR", "Subway"),`Direction` = if_else(`Direction` == "both directions", "both", 
          `Direction`)) %>%
  select("Event Type", "Organization Name", "Event Description" ,"CreateDate", "CloseDate", Duration) %>%
  arrange(Duration) %>%
  filter(Duration > 0) 
  
EventAll <- select(Event, "Event Type","Organization Name","Facility Name","Direction","Latitude","Longitude", "Create Time", "Close Time", "Event Description") %>%
  separate(`Create Time`, c("Create Date", "Create Time"), sep = " ") %>%
  separate(`Close Time`, c("Close Date", "Close Time"), sep = " ") %>%
  separate("Close Date", c("ClMonth", "ClDay", "ClYear"), sep = "/") %>%
  separate("Create Date", c("CrMonth", "CrDay", "CrYear"), sep = "/") %>%
  mutate(CreateDate = make_date(CrYear, CrMonth, CrDay),
         CloseDate = make_date(ClYear, ClMonth, ClDay),
         Duration = abs(CloseDate - CreateDate))

datatable(sample_frac(Eventclean, .5), filter = "top")

Turnstile data for 2015 filtered to show only subway figures by unit and station. A subset of Janruary figures was joined with a subset of June figures.A new variable was created to determine the total entries per unit. The data was sampled at the same time on the first day of each momth for consistency. The following is a sample that represents 10% of the data.
Data Overview
Data Dictionary

Turnstile <- read_csv("MTA Turnstile Data.csv")

Turnstileclean <- Turnstile %>% filter(Division %in% c("BMT", "INT", "IRD"))
Turnstileclean <- Turnstileclean %>%
  separate("Date", c("Month", "Day", "Year"), sep = "/") %>%
  mutate(NewDate = make_date(Year, Month, Day))
    
TurnstilecleanDate <- Turnstileclean %>%
  filter(Day == "01" & Time == 00:00:00 & Year == "2015" | Day == "31" & Month == "12" & Time == 00:00:00)

TSCD <- TurnstilecleanDate %>%
  arrange(Unit, SCP, Month, Year) %>%
  group_by(Unit, SCP) %>%
  mutate(Entry_Change = Entries - lag(Entries)) %>%
  select(`C/A`,Unit,SCP,Station,`Line Name`,Division, Month, Day, Year, Entries, Exits, Entry_Change) %>%
  arrange(Station) %>%
  filter(Entry_Change >=0 & Entry_Change < 1000000) %>%
  arrange(Month) 

TSMonth <- unique(TurnstilecleanDate) %>%
  arrange(Unit, SCP) %>%
  select(`C/A`, Unit, SCP, Station, Month, Entries) %>% 
  group_by(Unit, SCP) %>%
  mutate(Entry_Change = abs(Entries - lag(Entries))) 

datatable(sample_frac(TSMonth, .1), filter = "top")

Fare data for the second week of the following months: January, April, July, and October. This sample was choosen to reflect each quarter. The data is the fare type by number of swipes for each station.The data was combined into one data set for ease of analysis.
Data Dictionary

FareJan <- read_csv("JanFare.csv")
FareApr <- read_csv("AprilFare.csv")
FareJul <- read_csv("JulyFare.csv")
FareOct <- read_csv("OctFare.csv")

FareJan <- mutate(FareJan, Month = "Jan")
FareApr <- mutate(FareApr, Month = "Apr")
FareJul  <- mutate(FareJul, Month = "Jul")
FareOct <- mutate(FareOct, Month = "Oct")

Fares <- rbind(FareJan, FareApr, FareJul, FareOct)
Fares <- Fares %>% gather(Fare, Swipes, 3:25) 
Fares <- Fares %>% select(REMOTE, STATION, Month, Fare, Swipes)

datatable(Fares)

This is an Analysis Bound Train
The next stop is The Visuals

feedback %>%
  filter(Year == 2015 & Agency %in% c("Long Island Rail Road","Subways")) %>%
  arrange(Quarter) %>%
  ggplot(aes(x = Feedback, fill=Feedback))+
  geom_bar(stat="count")+
  scale_fill_manual(values=c("green","red"))+
  ggtitle("2015 Customer Feedback by Quarter")+
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  facet_wrap(~ Quarter, nrow = 1)

feedbackclean %>%
  count(`Subject Detail`) %>%
  filter(n > 150) %>%
  ggplot(aes(`Subject Detail`, y = n))+
  geom_bar(stat = "identity")+
  coord_flip()+
  ggtitle("Frequent Subjects about which Individuals Have Feedback (n>150)")+
  labs(y = "Count", x = "Subject")+
  theme(plot.title = element_text(size = 10, face = "bold"))

feedbackclean %>%
  filter(`Subject Detail` == "Train - General") %>%
  ggplot(aes(`Issue Detail`))+
  geom_bar(stat = "count")+
  coord_flip()+
  ggtitle("Issues Within the 'Train - General' Category")+
  labs(x = "Issue", y = "Count")+
  theme(plot.title = element_text(size = 10, face = "bold"))

feedbackclean %>%
  filter(Feedback == "Commendation") %>% 
  ggplot(aes(x = `Issue Detail`, fill = Agency))+
  geom_bar(position = "dodge", color = "black")+
  coord_flip()+
  labs(y = "Count", x = "Issue")+
  ggtitle("Common Commendations by Agency")+
  theme(plot.title = element_text(size = 10, face = "bold"))

TSMonth <- unique(TurnstilecleanDate) %>%
  select(`C/A`, Unit, SCP, Station, Month, Year, Entries) %>%
  arrange(Unit, SCP, Year, Month) %>%
  group_by(Unit, SCP) %>%
  mutate(Entry_Change = abs(Entries - lag(Entries))) 


aggregate(Entry_Change~Station, TSCD, mean) %>%
  top_n(20) %>%
  ggplot(aes(x = Station, y = Entry_Change))+
  geom_bar(stat = "identity", color = "red", width = .8)+
  coord_flip()+
  scale_y_continuous(labels = scales::comma)+
  labs(x = "Station", y = "Entries")+
  ggtitle("Most Used Turnstiles (Average)")+
  theme(plot.title = element_text(size = 10, face = "bold"))

TSMonth %>%
  na.omit %>%
  arrange(desc(Entry_Change)) %>%
  filter(Entry_Change < 300000) %>%
  ggplot(aes(x = Month, y = Entry_Change))+
  geom_boxplot()+
  scale_y_continuous(labels = scales::comma)+
  labs(y = "Entries")+
  ggtitle("Entries by Month of all Stations")+
  theme(plot.title = element_text(size = 10, face = "bold"))

aggregate(Swipes~STATION, Fares, mean) %>%
  top_n(10) %>%
  ggplot(aes(x = STATION, y = Swipes))+
  geom_bar(stat = "identity", fill = "orange")+
  coord_flip()+
  scale_y_continuous(labels = scales::comma)+
  ggtitle("Top 10 Station for Average Swipes in Sample")+
  theme(plot.title = element_text(size = 10, face = "bold"))

Fares %>%
  filter(Swipes > 10000) %>%
  ggplot(aes(x = Fare, y = Swipes))+
  geom_boxplot(fill = "orange")+
  scale_y_continuous(labels = scales::comma)

Fares %>%
  filter(Swipes > 10000) %>%
  ggplot(aes(x = Fare, y = Swipes, fill = Month))+
  geom_bar(stat = "summary", fun.y = "sum", position = "dodge")+
  scale_y_continuous(labels = scales::comma)+
  ggtitle("Total Swipes Per Fare By Month")+
  theme(plot.title = element_text(size = 10, face = "bold"))+
  coord_flip()

Fares %>% 
  filter(Swipes > 10000) %>%
  ggplot(aes(x = Month, y = Swipes, fill = Fare))+
  geom_bar(stat = "summary", fun.y = "mean", position = "dodge")+
  scale_y_continuous(labels = scales::comma)+
  ggtitle("Average Swipes Per Month By Fare")+
  theme(plot.title = element_text(size = 10, face = "bold"))

aggregate(Duration~`Event Type`, EventAll, mean) %>%
  arrange(desc(Duration)) %>%
  top_n(10) %>%
  ggplot(aes(x = `Event Type`, y = Duration))+
  geom_bar(stat = "identity", fill = "dark blue")+
  coord_flip()+
  ggtitle("Top 10 Longest Events of all MTA Agencies")+
  theme(plot.title = element_text(size = 10, face = "bold"))

aggregate(`Event Type`~CrMonth, EventAll, length) %>%
  ggplot(aes(x = CrMonth, y = `Event Type`))+
  geom_bar(stat = "identity", color = "blue", fill = "orange")+
  labs(x = "Month")+
  ggtitle("Frequency of All MTA Events by Month")+
  theme(plot.title = element_text(size = 10, face = "bold"))+
  labs(y = "Count")

Eventclean %>%
  ggplot(aes(x = `Event Type`,  fill = `Organization Name`))+
  geom_bar(stat = "count", position = "dodge")+
  ggtitle("Frequency of LIRR and Subway Events")+
  theme(plot.title = element_text(size = 10, face = "bold"))+
  labs(y = "Count", legend = "Agency")

Eventclean %>%
  separate(CreateDate, c("Year","Month","Day"), sep = "-") %>%
  select(`Event Type`, `Organization Name`, `Month`) %>%
  ggplot(aes(x = Month, group = `Organization Name`, color = `Organization Name`))+
  geom_line(stat="count", size=1)+
  ggtitle("Frequency of Events by Month")+
  theme(plot.title = element_text(size = 10, face = "bold"))+
  labs(y = "Count")

EDesc <- Eventclean %>% 
  mutate( EDesc = word(Eventclean$`Event Description`,3,4))

EDesc %>% select(`Event Type`,`Organization Name`,EDesc) %>%
  group_by(`Organization Name`, `Event Type`,EDesc) %>%
  count() %>%
  filter(n > 10) %>%
  ggplot(aes(x = EDesc, y = n, fill = `Organization Name`))+
  geom_bar(stat = "identity", position = "dodge")+
  coord_flip()+
  ggtitle("Common Events on the LIRR and Subway")+
  theme(plot.title = element_text(size = 10, face = "bold"))+
  labs(x = "Event", y = "Count")

This is an Analysis Bound Train
The next stop is The Analysis

Service Status
Planned Work

Frank Sinatra sang of New York, New York, but never really mentioned what many New Yorker’s aptly call the “Summer of Hell. Even though this data is from 2015, we can pretty much assume that figures here are more today than they were in 2015, as more individuals seek alternative methods of transportation. Overall this project aimed to get a glimpse of 2015 in review, explore the most common problems and complaints, and how to have a more enjoyable commute.

Don’t find fault, find a remedy. - Henry Ford

It is probably easier to complain rather than commend, and the data would support that. The number of complaints outweighs commendations by almost 24 times. There is more feedback given in the 4th quarter than the rest of the year. However, this may be a result of the increased ridership in December (see boxplot). Ultimately it seems that filing complaints does not affect ridership, as it almost seems to be negatively correlated. Speaking from experience, I feel there is very little that would truly affect overall ridership. Given that Long Island is landlocked, residents have very few options to commute to the City, given driving is not a possibility (practically and economically). Ergo, Long Islanders have basically no other options other than the Long Island Rail Road. This is not to say that there are not things on which the MTA can improve. Looking at the most frequent subject which individuals have feedback about, “Train-General” is the most significant. Breaking down this category, most feedback revolves around the punctuality of the train (being both late and early). Although unforeseen events occur, issues such as “Failure to make Scheduled Stop” should be avoided (as it is pretty much what a train is supposed to do). “Equipment Breakdown” can be avoided by performing more frequent and thorough checks of equipment (which would also prevent major repair projects). On the rare occasion that the MTA receives positive feedback, it is mostly regarding their employees being helpful or honest. Looking at the “Common Commendations by Agency”, one is more likely to receive a commendation as a subway employee.

Dont let the train of enthusiasm run through the station so fast that people can’t get on board. - H V Morton

Who is even getting on this train anyway? As it turns out, a lot of people. Looking at the “Most Used Turnstiles”, the Sutter Ave. Station has on average, turnstiles with the highest usage per month. This is followed by the 7th and 8th Ave (Penn Station). Looking at “Top 10 Stations for Swipes in Sample”. The 14th St./Union Square Station seems to be the busiest of the sample, followed by more notable locations such as Herald Square (34th and 6th), Columbus Circle, and Penn Station. The results differ based on the sample data and the method used. I believe the fare data to be more accurate as it takes into account the station in its entirety as opposed to a turnstile as an entity, however the fare data is just a sample of four weeks of the year, as opposed to all 52 weeks. The “Entries by Month of all Stations” suggests that December is the busiest month. Most likely due to the influx of tourism during the holiday season. The MTA should adequately prepare for this increase in ridership by maintaining equipment, having extra staff, and avoiding delays. The most popular fares are the “Full Fare” (FF) and the “30 Day Unlimited” (30-D UNL). These fares mostly apply to commuters and residents of New York City. Full fare and 30-day unlimited swipes saw more than 10M and roughly 7M swipes, respectively, across all stations for each of the sample months. The 1 day and 14 day unlimited are almost non-existent in frequency, as well as the 10 Trip and Monthly AirTrain, which were only marginally greater. The MTA should stop offering these fares (assuming there is a cost associated with them). Lastly, the box plot describes the distribution of observations by fare for the most used fares (>10000). Basically, the MTA is going to be busy regardless of the time or location.

I never travel without my diary. One should always have something sensational to read in the train. - Oscar Wilde

Given that it is more likely than not to be impacted by some MTA event, the above is not such a bad idea. Most event affect riders less than a day, delaying most trains for only a few minutes. However, looking at the top ten longest duration events, most events involve construction, with “Bridge Rehabilitation” taking more than a year on average. “Station Renovation” are the most relevant in this instance. Again, looking at the MTA as a whole, there are (almost) over 10,000 events each month, with the peak occurring in May. A time series comparison would be the most useful in this instance to examine if the trend is constant, however the 2015 data was used for completeness. Separated by agency, the Long Island Rail Road has roughly 100 fewer events than the subway for each month. Despite having truncated data, it can be reasonably assumed that available data is representative of the first quarter. Examining the “Common Events”, it is evident that construction is one of the most frequent events of the subway (based on event description). More common for the LIRR seems to be poorly maintained equipment (such as escalators and elevators), relating back to the point earlier about such types of events.

Stand clear of the closing doors, please - Charlie Pellett

The data is just a byte of the massive amount of information released by the MTA. Using a variety of data sets from 511 Events, to Fare Types, to Turnstile Usage, and Customer Feedback, we can gain greater insight in to how this data impacts the MTA and the areas of needed improvement. The most significant hindrance to this analysis is using only one year of data. By comparing this data across several years, we will be able to see notable improvements and areas of concern as trends become more emergent.