Introduction

Recently, riders of the NJ PATH transit system have complained of a common train over-crowding situation so severe they are unable to board and remain on the platform waiting for the next train. Some riders have described missing numerous trains during peak transit hours indicating this problem may be quite common and have varying levels of severity. Armed with turnstile entry and exit data along with the PATH train schedule we began to analyze the data in an attempt to verify or refute rider complaints.

A visual tool that would allow users to view system wide rider flow in an easily understood animation was imagined. Consistent with this goal the visualization would reflect the motion of the train, the capacity and rider flux, and the tracks and relevant stations. Specifically, a circle traveling along a track on a map with a size corresponding to number of riders and bounded by a black open circle reflecting the capacity of the train would convey this information perfectly.

Data Description and Normalization

While exploring the supplied PATH train schedule interesting detail was uncovered, the NJ PATH train schedule operated on a 25 hour day! Trains operating during the 23rd hour could arrive at their destination between the 24th and 25th hour according to the provided timetable. The following code was used to normalize 25 hour NJ time to 24 hour Earth time and join the schedule table with a table of calculated rider flow provided by my teammate Sylvie, whose methodology is detailed here.

library(plyr)
library(dplyr)

#fix miscoded indices
df = read.csv('path_schedule_clean.csv', header=T) 
df$departure_time = as.character(df$departure_time)
idx = grep('24:[[:digit:]]{2}:[[:digit:]]{2}', df$departure_time) 
mins = strsplit(df[idx, 'departure_time'], ':') 
mins = do.call(rbind, lapply(mins, function(x) x[2]))


#assign date placeholder
datetimes = paste0('01/02/2014 0:', mins, ':00') 
df[idx, 'departure_time'] = datetimes 
df$departure_time = ifelse(nchar(df$departure_time) < 16, paste0('01/01/2014 ', df$departure_time), 
                           df$departure_time) 
df$departure_time = strptime(df$departure_time, '%m/%d/%Y %H:%M:%S')


#only analyzing weekday rider flow
df = df[df$service_name == 'Yearly Service (Mon-Fri)',]


#merge train rider flow with cleaned schedule
trainridership = read.csv('station_train.csv')
df = df[,-1]; trainridership = trainridership[,-1]
trainridership = trainridership[-which(is.na(trainridership$left)),]
all = merge(trainridership, select(df, stop_id, trip_id, stop_lat, stop_lon, departure_time, direction_id), 
            by.x=c('stop_id','direction_id','trip_id'), 
            by.y=c('stop_id','direction_id','trip_id'), all.x=T)

Constructing an intuitive Visualization

The schedule data provides only a timetable of arrivals by a given train at a station along its route. In order to create a visualization of a train moving along its track a function that interpolates linear coordinates between two adjacent stations every 15s was written. The map_trains function imputes these data points but, since a linear approximation was used for the imputation, the tracks and train movement do not reflect any real-world track curvature.

Finally, a few aesthetics were added to make the visualization more informative and visually appealing. First, an abbreviated unique ID was assigned to each train. The schedule provided contained an ID number but with 14 digits the ID number would clutter the visualization and be difficult to identify when flagged on an animation. Second, the coordinates of each PATH train station and the computed linear track was extracted in order to provide context to the movement of the trains. Third the capacity of each train was added in order to provide an easy way to visualize the relative load of a train during its trip.

#add station names
stop_id = unique(all$stop_id)
names = c('14th','23rd','33rd','9th','Christopher St','Exchange Pl','Grove St','Harrison','Hoboken',
          'Journal Sq','Newport','Newark', 'World Trade Ctr')
stations = data.frame(stop_id, names)
all = merge(all, stations, all.x=T)
all = all[order(all$departure_time),]

#the map_trains function interpolates assigns a linear coordinate between adjacent stops every 15s
map_trains = function(df){
  tripdf = data.frame()
  for(i in 2:nrow(df)-1){
    newdf = data.frame()
    path = seq(df$departure_time[i], df$departure_time[i+1], '15 secs')
    path = path[1:length(path)-1]
    lat = seq(df$stop_lat[i], df$stop_lat[i+1],  (df$stop_lat[i+1] - df$stop_lat[i])/(length(path)-1))
    lon = seq(df$stop_lon[i], df$stop_lon[i+1],  (df$stop_lon[i+1] - df$stop_lon[i])/(length(path)-1))
    newdf = data.frame(path, lat, lon, takingtrain = df$takingtrain[i], intrain = df$intrain[i], 
                       left = df$left[i], leavingtrain = df$leavingtrain[i], leavingstop = df$leavingstop[i], 
                       trip_id = df$trip_id[i], direction_id = df$direction_id[i], route_id = df$route_id[i],
                       stop_id = df$stop_id[i], traintime = df$traintime[i], names = df$names[i])
    tripdf = rbind(tripdf, newdf)
  }
  return(unique(tripdf))
}

newdf = by(all, all$trip_id, map_trains) 
newdf = do.call(rbind, newdf)


#remove date placeholder
library(lubridate)
time = strsplit(as.character(newdf$path), ' ')
newdf$time = unlist(lapply(time, function(x) x[2]))


#create intervals to plot all trains that operate during same time period
time.range = seq(range(newdf$path)[1], range(newdf$path)[2], by ='15 secs')
newdf2 = data.frame(path = time.range, interval = 1:length(time.range)) 
pathdf = merge(newdf, newdf2, by.x='path', by.y='path', all.x=T)


#station coordinates
station.coords = data.frame(lat = unique(df$stop_lat), lon = unique(df$stop_lon))
write.csv(station.coords, 'station.coords.csv')


#assign each train an abbreviated ID number for animation
id.df = data.frame(trip_id = unique(pathdf$trip_id), id = seq(1, length(unique(pathdf$trip_id)), 1)) 
pathdf3 = merge(pathdf, id.df, by.x='trip_id', by.y='trip_id', all.x=T)


#assign capacity info
traincap = data.frame(route_id = c(859,860,861,862,1024), capacity = c(574, 738, 574, 738, 574))
pathdf4 = merge(pathdf3, traincap, by.x='route_id', by.y='route_id', all.x=T)
write.csv(pathdf4, 'pathdf.csv')


#get train routes
lines = split(df, df$route_id)
lines = data.frame(lon = c(lines[[1]]$stop_lon[1:6], lines[[2]]$stop_lon[5:8], lines[[3]]$stop_lon[1:8], 
                  lines[[4]]$stop_lon[1:6], lines[[5]]$stop_lon[1:9]), lat = c(lines[[1]]$stop_lat[1:6], 
                  lines[[2]]$stop_lat[5:8], lines[[3]]$stop_lat[1:8], lines[[4]]$stop_lat[1:6], 
                  lines[[5]]$stop_lat[1:9]), route_id = rep(names(lines), c(6,4,8,6,9)))
write.csv(lines, 'lines.csv')

Results

Then two animations were created covering the subset of time and direction related to rider complaints, 9am - 11am, from NJ -> NY. The entire system was visualized in the first animation but it quickly becomes overwhelming with all 3 lines running multiple trains during the morning rush hours. To address this the facets arguement was added to the ggmap function which separated the different train lines into different panels allowing a user to focus on a single train line at a time. Please be patient, the animation may take a minute to load.

library(ggmap)
library(animation)
nyc = get_map(location = c(lon = -74.07, lat=40.72), source='stamen', maptype='toner', zoom = 12)
oopt = ani.options(interval = 0.1)
cols = c('859' = 'red', '860' = 'green', '861' = 'purple', '862' = 'blue', '1024'='brown')
train_animation = function(){
  for(i in 1:nrow(pathdf)){
    print(ggmap(nyc, extent='device')+
            geom_text(aes(x=lon, y=lat, label=id), size=5, data=pathdf[pathdf$interval== i,], 
                      color='#80C080', angle=45, hjust=1.5, position = position_dodge())+
            annotate('text', x=-74.153, y=40.77, label=unique(pathdf[pathdf$interval == i ,15]), color='red', size=6)+ 
            geom_path(aes(x=lon, y=lat, group=factor(route_id)), color='orange', alpha=0.5, size = 1, data=lines)+
            geom_point(aes(x=lon, y=lat), size =5, pch = 7, color = 'orange', data=station.coords)+
            geom_point(aes(x=lon, y=lat, color=factor(route_id), size = intrain), 
                       data=pathdf[pathdf$interval == i,])+scale_color_manual(values = cols)+
            geom_point(aes(x=lon, y=lat, size = capacity), pch = 1, data = pathdf[pathdf$interval == i,])+
            scale_size_continuous(range=c(3,6))+
            theme(axis.text.y=element_blank(), axis.title.x=element_blank(), axis.title.y=element_blank(), 
                  axis.text.x=element_blank(), axis.ticks = element_blank(), legend.position='none'))
  animation::ani.pause()
  }
}

saveHTML(train_animation(), 
         autoplay = FALSE, 
         loop = FALSE, 
         verbose = TRUE, 
         outdir = "images", 
         single.opts = "'controls': ['first', 'previous', 'play', 'next', 'last', 'loop', 'speed'], 
         'delayMin': 0", 
         ani.height = 600, 
         ani.width = 600)

Conclusion

We’ve built a powerful visual tool enabling optimization analysts to perform schedule updates and visually inspect the expected rider flow and and potential stations and trains with over-capacity problems. Additionally, we have verified the capacity problems reported by PATH riders and identified the trains lines and stations associated with capacity problems. The overcapacity problems occur on two lines, Newark -> WTC and Journal Sq. -> 33rd St and occur primarily at 4 stations: Journal Sq., Grove, Exchange, and Newport. While there are many ways to address these capacity issues we realize train optimization is a field unto itself that we could not hope to master in the time provided. Instead, we enable experts by providing a powerful new tool to visualize the results of their ongoing optimization efforts.