Following a hackathon on the PATH train data, we decided to help to create a model of the passenger flux in the PATH train system. We had access to three files, one with the number of entries and exits in each station in 15 minutes block, one with the all the schedules of the trains and one that has the name of the station, the station ID, and the latitude and longitude of each station. I loaded the three csv files from Github and realized that the stop_id were different in each file; I made a short script using the same ID in the different data files and added the name of the station in the turnstile data to help readability.
schedule <- read.csv('path_schedule_clean.csv')
turnstile <- read.csv('turnstile_cleaned_2.csv')
link <- read.csv('stop_id_to_turnstile_name_2.csv')
names(turnstile) <- tolower(names(turnstile))
names(link) <- tolower(names(link))
for(i in unique(schedule$stop_name)){
stop_id <- link$stop_id[grep(i, link$turnstile)[1]]
schedule$stop_id[schedule$stop_name == i] <- stop_id
}
names(link)[1] <- 'station'
turnstile <- left_join(turnstile, link[,1:2])
write.csv(schedule, file = 'schedule.csv')
write.csv(turnstile, file = 'turnstile.csv')
write.csv(link, file = 'link.csv')
Using Lubridate I extracted the time of the entries and exits and converted it into minutes beginning at 0 at midnight (so from 0 to 1440 for one day). I then added a Boolean value to flag whether or not each day was a week-day or a week-end day, kept only the week-day values and averaged the number of entries and exits across each week-day.
library(dplyr)
library(lubridate)
turnstile$date_time <- ymd_hms(turnstile$date_time)
turnstile$day <- format((turnstile$date_time), '%a')
turnstile$hour <- hour(turnstile$date_time)
turnstile$min <- minute(turnstile$date_time)
turnstile$time <- turnstile$hour*60+turnstile$min
turnstile$weekday <- ifelse(!(turnstile$day %in% c('Sat', 'Sun')), TRUE, FALSE)
flux <- turnstile %>% group_by(stop_id, weekday, hour, min, time) %>% dplyr::summarise(count = n(), mentry = mean(entry), mexit = mean(exit), flux = mentry - mexit)
turnweek <- flux %>% filter(weekday)
write.csv(turnweek, file = 'turnweek.csv')
In the same manner, I extracted the time from the train schedule, converted it in minutes from midnight, and kept only the week-day trains.
schedule$hour <- hour(hms(schedule$departure_time))
schedule$min <- minute(hms(schedule$departure_time))
schedule$time <- schedule$hour*60+schedule$min
week <- schedule %>% filter(service_name == 'Yearly Service (Mon-Fri)')
write.csv(week, file = 'week.csv')
The PATH system is composed of 5 lines:
To get the passenger flux, I first put some assumption:
For example, this animation shows where a passenger entering the Newport station may be going and where a passenger exiting the Journal Square station may be coming from.
To do so I looked at every 15-minute block. For each of these blocks, I looked at each station and decided where people entering each individual station at this given time could potentially exit.
The main part is just a for loop, subsetting the turnstile data for a given time and calling the main function exit_per_entry_time.f. I used the library doParallel for each to be able to parallelize the computation.
library(doParallel)
library(foreach)
registerDoParallel(cores=4)
possible_travel <- foreach(i = unique(turnweek$time)) %dopar% {
time <- turnweek %>% filter(time == i)
exit_per_entry_time.f(time, week, station,i)
}
possible_travel <- do.call(rbind, possible_travel)
save(possible_travel, file = 'possible_travel')
The main function looks at each station for a given time. For a station, I looked at all the trains that come into this station between the scheduled time and 30 min after (if there were none, I looked at the first train coming in). If several trains going in the same direction and following the same line come during this interval, I keep only the first one. If the stop is in Manhattan, except for Christopher st, I assumed people are going to New Jersey so removed trains going towards 33rd st. When I had all the trip_id of the possible trains, I called get_possible_travel.f to get the possible exits and their time (see below).
I then rounded the possible exit times (deduced from the train schedule) in 15-minute blocks to make them correspond to the turnstile data time. Finally, I looked at how many people leave each possible exit in a compatible time frame using the find_station_data.f function.
exit_per_entry_time.f <- function(time, week, station, en){
# this function takes a 'time' df aka a df from the turnstile data filtered by hour and for each station get the number of ppl that enter and look where they could possibly exit the system
output <- list()
for(i in 1:dim(time)[1]){
t <- time[i,]
posstrip <- week %>% filter(stop_id == t$stop_id & between(time, t$time, (t$time + 30))) %>% select(trip_id, stop_sequence, route_id, direction_id)
if(dim(posstrip)[1] == 0){
posstrip <- week %>% filter(stop_id == t$stop_id & time > t$time) %>% filter(time == min(time)) %>% select(trip_id, stop_sequence, route_id, direction_id)
}
posstrip$route_dir <- paste(posstrip$route_id,posstrip$direction_id, sep = '_')
if(length(unique(posstrip$route_dir)) > 1){
posstrip <- posstrip[!duplicated(posstrip$route_dir),]
}
if(station$inny[which(station$stop_id == t$stop_id)] & t$stop_id != 26726 & nrow(posstrip) > 1){
#if the stop is in manhattan except for christopher st can only go to nj
posstrip <- posstrip %>% filter(direction_id == 0)
}
out <- get_possible_travel.f(posstrip, week)
# I'm rounding. it is definitely not perfect but it make it a lot easier to have only one time value
if(nrow(out) != 0){
out$exittime <- round(out$time/15)*15
out <- out[!(duplicated(out$stop_id) & duplicated(out$exittime)),]
out$exittime[which(out$exittime >= 1440)] <- out$exittime[which(out$exittime >= 1440)] - 1440
out$n_exit = find_station_data.f(out)
out$idx <- paste('t', en, i, sep = "")
out <- data.frame(entry_stop = t$stop_id, entrytime = t$time, n_entry = t$mentry, out)
output[[i]] <- out
}
}
output <- do.call(rbind, output)
return(output)
}
The get_possible_travel.f function goes to all possible trains and looks for the possible exits along the train paths and directions. In certain cases, it is possible that riders changed trains; in these cases the add_stop.f function will return the additional stops (as well as direction, line and time).
get_possible_travel.f <- function(posstrip, week){
#this function get where and when the possible exit is for each entry point and time
output = list()
for(i in 1:dim(posstrip)[1]){
out <- week %>% filter(trip_id == posstrip[i,1] & stop_sequence > posstrip[i,2]) %>% select(stop_id, time, route_id, direction_id, trip_id)
if(nrow(out) != 0){
toadd <- add_stop.f(out)
}else{toadd = NULL}
output[[i]] <- rbind(out, toadd)
}
output = do.call(rbind, output)
return(output)
}
add_stop.f <- function(out){
## add station if ppl change train put limit on the possibilities like if come from manhattan don't change in a train to go back from manhattan. also look only at change at JS and not at grove
route <- out$route_id[1]
direction <- out$direction_id[1]
if(route == 1024 | route == 861){ #33rd st to JS via hob
if(direction == 0){ ## toward nj
js <- out %>% filter(stop_id == 26731)
toadd <- week %>% filter(stop_id == js$stop_id, direction_id == 0, route_id == 862, between(time, js$time,js$time+15)) %>% select(stop_id, time, route_id, direction_id, trip_id)
}else if(route == 861 & direction == 1 & 26732 %in% out$stop_id){
pa <- out %>% filter(stop_id == 26732)
toadd <- week %>% filter(stop_id == js$stop_id, direction_id == 0, route_id == 860, between(time, pa$time,pa$time+15)) %>% select(stop_id, time, route_id, direction_id, trip_id)
}else{return(NULL)}
}else if(route == 862){
if(direction == 1){
if(26731 %in% out$stop_id){
js <- out %>% filter(stop_id == 26731)
t1 <- week %>% filter(stop_id == js$stop_id, direction_id == 1, route_id == 861, between(time, js$time,js$time+15)) %>% select(trip_id)
if(nrow(t1) > 0){
toadd <- week %>% filter(trip_id == t1$trip_id & stop_id %in% c(26732, 26726,2625,2622,2623,2624)) %>% select(stop_id, time, route_id, direction_id, trip_id)
}else{toadd <- NULL}
t2 <- week %>% filter(stop_id == js$stop_id, direction_id == 1, route_id == 1024, between(time, js$time,js$time+15)) %>% select(trip_id)
if(nrow(t2) > 0){
toadd2 <- week %>% filter(trip_id == t2$trip_id & stop_id %in% c(26730, 26732)) %>% select(stop_id, time, route_id, direction_id, trip_id)}else{toadd2 <- NULL}
toadd <- rbind(toadd, toadd2)
}else{return(NULL)}
}else{return(NULL)}
}else if(route == 860){
if(direction == 1){#green line toward nyc can change to yellow and then even to red toward nj
if(26732 %in% out$stop_id){
pa <- out %>% filter(stop_id == 26732)
t1 <- week %>% filter(stop_id == pa$stop_id, direction_id == 0, route_id == 861, between(time, pa$time,pa$time+15)) %>% select(trip_id)
if(nrow(t1) > 0){
toadd <- week %>% filter(trip_id == t1$trip_id & stop_id %in% c(26728, 26731)) %>% select(stop_id, time, route_id, direction_id, trip_id)
}else{toadd <- NULL}
t2 <- week %>% filter(stop_id == pa$stop_id, direction_id == 0, route_id == 1024, between(time, pa$time,pa$time+15)) %>% select(trip_id)
if(nrow(t2) > 0){
toadd2 <- week %>% filter(trip_id == t2$trip_id & stop_id %in% c(26728, 26731)) %>% select(stop_id, time, route_id, direction_id, trip_id)}else{toadd2 <- NULL}
toadd <- rbind(toadd, toadd2)
if(!is.null(toadd) & nrow(toadd) > 0){
js <- toadd %>% filter(stop_id == 26731)
if(nrow(js) < 1){js <- toadd %>% filter(stop_id == 26728)}
if(nrow(js) > 1){js <- js[1,]}
t3 <- week %>% filter(stop_id == js$stop_id, direction_id == 0, route_id == 862, between(time, js$time,js$time+15)) %>% select(trip_id)
if(nrow(t3) > 0){
toadd3 <- week %>% filter(trip_id == t3$trip_id & stop_id %in% c(26729, 26733)) %>% select(stop_id, time, route_id, direction_id, trip_id)}else{toadd3 <- NULL}
toadd <- rbind(toadd, toadd3)
}else{return(NULL)}
}else{return(NULL)}
}else{return(NULL)}
}else{return(NULL)}
return(toadd)
}
This function removes the possible stops in Manhattan if the person took a train in Manhattan (except Christopher st, and except if it is the only possibility).
rm_manhattan.f <- function(out){
tm <- out %>% filter(direction_id == 1) #if the only possible exits are towrd manhattan
if(nrow(tm) == nrow(out)){return(out)}else{
tm <- tm[-which(tm$stop_id %in% c(26722, 26723, 26724)),]
return(tm)
}
}
find_station_data.f <- function(x){
#this function takes an 'out' df and has for goal to find the number of people exiting at each of the possible exit
output = rep(0, dim(x)[1])
for(i in 1:dim(x)[1]){
t <- x[i,]
if(t$exittime >= 1440){t$exittime = 0}
output[i] <- (turnweek %>% filter(stop_id == t$stop_id & time == t$exittime))$mexit
}
return(output)
}
For example, the people who enter at Newport Station can take the 33rd st -> Journal Square line in both directions, and then could change at Journal Square for the WTC -> Newark line, or Hoboken -> WTC line in both directions as illustrated with the example below. At this state of the analysis, a passenger has the same probability to get out at any station on these lines.
First we grouped the possible_travel by exit stop and exit time, and rounded the number of entries and exits (since it’s averaged). I also created an exit index and a unique index for each line to help.
by_exit <- possible_travel %>% group_by(exittime, stop_id)
by_exit$n_entry <- round(by_exit$n_entry)
by_exit$n_exit <- round(by_exit$n_exit)
tochange <- as.data.frame(by_exit)
tochange$exitidx <- paste(tochange$exittime, tochange$stop_id,sep = '_')
tochange <- arrange(tochange, exittime, stop_id)
tochange$uidx <- 1:dim(tochange)[1]
The get_travel.f function gets the number of people coming from each possible entry station (as given by the get_n_remove.f function) and stores it in the n_people variable.
get_travel.f <- function(x){
rem <- get_n_remove.f(x)
t <- x %>% select(entry_stop, entrytime, stop_id, exittime, route_id, direction_id, trip_id,idx)
if(length(rem) == 1 & rem[1] == 0){t$n_people <- 0}else{
rem[x$n_entry < rem] <- x$n_entry[x$n_entry < rem]
t$n_people <- rem
}
return(t)
}
The ge_n_remove.f function takes the total number of exits at a station and distributes it among all the possible entries, in proportion to the number of people entering in each of these entries. I called the get_rounded.f function that checked that the number of people coming from each station is the same as the total number of exits.
get_n_remove.f <- function(x){
n <- x$n_exit[1]
p <- x$n_entry/sum(x$n_entry)
real <- n*p
if(is.na(real[1])){rmv <- 0}else{rmv <- get_rounded.f(real,n)}
return(rmv)
}
get_rounded.f <- function(v, n){
rounded <- round(v)
dif <- rounded - v
ndif <- n - sum(rounded)
if(ndif > 0){
tochoose <- which(dif < 0)
replace <- vector()
for(i in 1:ndif){
replace <- c(replace,tochoose[which.min(dif[tochoose])])
tochoose <- tochoose[-which.min(dif[tochoose])]
}
rounded[replace] <- rounded[replace] + 1
}else if(ndif < 0){
tochoose <- which(dif > 0)
replace <- vector()
for(i in 1:abs(ndif)){
replace <- c(replace,tochoose[which.max(dif[tochoose])])
tochoose <- tochoose[-which.max(dif[tochoose])]
}
rounded[replace] <- rounded[replace] - 1
}
return(rounded)
}
The remove_from_entry.f function takes the data that has not been used in the present loop call and removes the people that have been placed in a train by using the data frame named torm and looking for the entry with the same idx.
remove_from_entry.f <- function(df, x){
x <- x %>% filter(n_people > 0)
for(i in 1:dim(x)[1]){
tmp <- df %>% filter(idx == as.character(x$idx[i]))
tmp$n_entry <- tmp$n_entry - as.numeric(x$n_people[i])
tmp2 <- df %>% filter(idx != as.character(x$idx[i]))
df <- rbind(tmp, tmp2)
}
return(df)
}
The main loop looks at each unique combination of stop and time, and calls the get_travel function to gather where people exiting in this station are coming from. The number of people traveling is then removed from the number of entries and exits.
Sometimes, we could not find as many people entering as exiting; in those cases, we looked at earlier trains to find when these people entered (that would happen if they could not enter the train they would like to, for example). In these cases, it means they had to take a later train than the first one that arrived corresponding to the time of their entry; these data are stored in the later train list. It is important to remember that each entry was repeated several times in the data frame (since for each entry we gathered all the possible exits) so now it is important to remove all the people we found a train for from the rest of the data; this is what the function remove_from_entry.f is for.
travel <- list()
latertrain <- list()
lim <- length(unique(tochange$exitidx))
for(i in 1:lim){
idx <- unique(tochange$exitidx)[i]
tmp <- tochange[which(tochange$exitidx == idx),]
#if(length(unique(tmp$route_id)) > 1){break}
trav <- get_travel.f(tmp)
torm <- trav %>% select(idx, n_people)
if(any(torm$n_people > 0)){
tmp$n_exit <- tmp$n_exit - sum(torm$n_people)
tmp$n_entry <- tmp$n_entry - torm$n_people
if(any(tmp$n_exit > 0)){
et <- min(tmp$entrytime)
if(et < 30){et = 1440 - et}
prev <- 15
tmp2 <- tochange %>% filter(entry_stop %in% tmp$entry_stop & between(entrytime, et-prev, et) & route_id %in% tmp$route_id & direction_id %in% tmp$direction_id & n_exit == 0 & !(idx %in% tmp$idx)) %>% distinct(idx)
while(sum(tmp2$n_entry) < tmp$n_exit[1] & prev <= 90){
prev <- prev + 15
tmp2 <- tochange %>% filter(entry_stop %in% tmp$entry_stop & between(entrytime, et-prev, et) & route_id %in% tmp$route_id & direction_id %in% tmp$direction_id & n_exit == 0 & !(idx %in% tmp$idx)) %>% distinct(idx)
}
if(nrow(tmp2) >0){
tmp2$n_exit = tmp$n_exit[1]
trav2 <- get_travel.f(tmp2)
torm2 <- trav2 %>% select(idx, n_people)
if(any(torm2$n_people > 0)){
latertrain[[i]] <- data.frame(torm2, uidx = tmp2$uidx, wait = et - tmp2$entrytime)
tmp2$n_exit <- 0
tmp2$n_entry <- tmp2$n_entry - torm2$n_people
tmp$n_exit = tmp$n_exit - sum(torm2$n_people)
tmp <- rbind(tmp, tmp2)
torm <- rbind(torm, torm2)
}
}
}
notuse <- tochange[which(!(tochange$uidx %in% tmp$uidx)),]
notuse <- remove_from_entry.f(notuse, torm) #as there is repetition of line need to remove in each repetition
tochange <- rbind(tmp, notuse)
if(any(tochange$n_entry < 0 )){break}
}
travel[[i]] <- trav
tochange <- arrange(tochange, exittime, stop_id)
}
travel <- do.call(rbind, travel)
changed1 <- tochange
So for example, if 100 passengers are leaving the Journal Square station, we may know how many of them are coming from each station by looking at how many people are in these stations when the train comes. (This is a theoretical example for illustration.)
With that done, we put most people in trains, however, we still have some people left on the platform; there is more analysis left to do. However, it is interesting to see that it is mostly during rush hour that we have people remaining on the platform.
groupedtravel <- travel %>% group_by(trip_id, entry_stop, stop_id) %>% summarise(count = sum(n_people))
train <- list()
tid <- unique(week$trip_id)[44:1030]
for(i in tid){
#options(warn = w); cat("\n warn =", i, "\n")
sched <- schedule %>% filter(trip_id == as.character(i))
tr <- groupedtravel %>% filter(trip_id == as.character(i))
enter <- tr %>% group_by(stop_id = entry_stop) %>% dplyr::summarise(entry = sum(count))
exit <- tr %>% group_by(stop_id) %>% dplyr::summarise(exit = sum(count))
sched <- left_join(sched,enter)
sched <- left_join(sched,exit)
sched$transfer <- 0
if(any(!(enter$stop_id %in% sched$stop_id))){ # it means there is a transfer, people come from another line
sched <- get_other_entry.f(sched, enter)
}
sched <- sched %>% select(stop_id, departure_time, trip_id, hour, min, time, entry, exit, transfer)
sched$entry[is.na(sched$entry)] = 0
sched$exit[is.na(sched$exit)] = 0
sched$intrain <- cumsum(sched$entry) - cumsum(sched$exit)
train[[i]] <- sched
}
train <- do.call(rbind, train)
id.df <- data.frame(trip_id = unique(train$trip_id), id = seq(1, length(unique(train$trip_id)), 1))
train <- left_join(train, id.df)
write.csv(train, file = 'train.csv')
For example, this train leaves 33rd st at 16:37 and goes to JS. 262 persons entered it at 33rd st; 413 entered it at Newport; and the maximum number of people in the train was 545 after the Newport stop.
## X stop_id departure_time trip_id hour min time entry exit
## 1 1 26724 16:37:00 93057A744B2171 16 37 997 262 0
## 2 2 26723 16:39:00 93057A744B2171 16 39 999 58 18
## 3 3 26722 16:40:00 93057A744B2171 16 40 1000 61 28
## 4 4 26725 16:41:00 93057A744B2171 16 41 1001 53 46
## 5 5 26726 16:43:00 93057A744B2171 16 43 1003 13 51
## 6 6 26732 16:51:00 93057A744B2171 16 51 1011 413 172
## 7 7 26728 16:55:00 93057A744B2171 16 55 1015 131 149
## 8 8 26731 16:59:00 93057A744B2171 16 59 1019 0 527
## transfer intrain id
## 1 0 262 1
## 2 0 302 1
## 3 0 335 1
## 4 0 342 1
## 5 0 304 1
## 6 0 545 1
## 7 0 527 1
## 8 0 0 1
To get the flux of people in each station and train, I looked at each station and for a given 15-minute block all the people entering and exiting the station, as well as every person entering or exiting trains that are stopping in this station during this time.
turnweek$mentry <- round(turnweek$mentry)
turnweek$mexit <- round(turnweek$mexit)
flux <- turnweek %>% group_by(stop_id)
train$idx <- 1:dim(train)[1]
station_flux <- list()
for(i in unique(flux$stop_id)){
stat <- flux %>% filter(stop_id == i)
tr <- train %>% filter(stop_id == i)
onestat <- list()
k <- 1
left <- 0
for(j in stat$time){
if(j == 0){b <- 1440 - 15}else{b <- j-15}
s <- stat %>% filter(time == j)
before <- stat %>% filter(time == b)
s$instation <- s$mentry + left
t <- tr %>% filter(between(time, j, j+15))
if(nrow(t) > 0){
if(any(t$transfer > 0)){s$instation <- s$instation + sum(t$transfer)}
left <- s$instation - sum(t$entry)
if(left < 0){left <- 0}#not sure why sometimes I have neg numbers there..... I actually think it was because i forgot the else... now should be ok.
onestat[[k]] <- data.frame(stop_id = s$stop_id, time = s$time, instop = s$instation, takingtrain = t$entry, left, traintime = t$time, leavingtrain = t$exit, leavingstop = s$mexit, idx = t$idx)
k <- k + 1
}else{left <- s$instation}
if(left < before$mexit){left <- 0}else{left <- left - before$mexit}
}
station_flux[[i]] <- do.call(rbind, onestat)
}
station_flux <- do.call(rbind, station_flux)
station_flux <- station_flux[which(!duplicated(station_flux$idx)),]
By grouping the station flux and the train flux we have a data frame with all the people flux in the PATH network.
train$departure_time <- as.character(train$departure_time)
pf <- left_join(train, station_flux, by='idx')
pf <- pf %>% select(stop_id = stop_id.x, time = time.y, instop, trip_id, id, traintime, hour, min, takingtrain, intrain, left, leavingtrain, leavingstop)
write.csv(pf, file = 'train_station.csv')
For example, for the same train, we can now see how many people where in the station before the train arrived, how many are leaving the station, and how many people are left in the station. All these numbers have to be taken carefully as we only have 15 minute blocks and several trains are likely to come during this time.
head(train)
## X stop_id time instop trip_id id traintime hour min takingtrain
## 1 1 26724 990 329 93057A744B2171 1 997 16 37 262
## 2 2 26723 990 97 93057A744B2171 1 999 16 39 58
## 3 3 26722 990 106 93057A744B2171 1 1000 16 40 61
## 4 4 26725 990 97 93057A744B2171 1 1001 16 41 53
## 5 5 26726 990 108 93057A744B2171 1 1003 16 43 13
## 6 6 26732 1005 685 93057A744B2171 1 1011 16 51 413
## intrain left leavingtrain leavingstop
## 1 262 0 0 92
## 2 302 39 18 34
## 3 335 45 28 59
## 4 342 44 46 92
## 5 304 29 51 55
## 6 545 0 172 212
To visualize the passenger flux in the station and in the train I built an animation for the JS -> 33rd st. line in both direction as well as the WTC -> Newark line in both direction: