CSCI E-107 Final Project

https://github.com/TaleOfTwoTransportationSystems/DataProject

The Project

The MBTA is the greater Boston area’s traditional public transit system with over 1 million riders per typical weekday. It is comprised of subway trains, longer reach commuter trains, buses and ferries. Since 2009, the MBTA has made available a large amount of data regarding trips on the system, including alerts. The real time alerts are scraped and then Tweeted by CodeForBoston.

While pursing another project (which also related to public transit in Boston), we ran into roadblocks with data availability but had the good fortune to find an exciting new API from MBTA Developers which was just recently made accessible to the public; so recently in fact that we’re not sure it’s been officially announced yet outside the MBTA Developer forum post where they told us about it (the API specification they shared is still stamped with a big “DRAFT” on the cover, which was kind of cool).

Using this API, as well as the CodeForBoston tweets, we were able to conduct some interesting data sourcing, exploration, and visualization. The Twitter-sourced MBTA alerts give us both temporal and geographic insights into where problems may be occuring, and the MBTA-sourced data on train movements allow us to see both specific incidents in the T system and broad patterns in service quality throughout it. We hope that this project will be a good demonstration of what this new MBTA source contains, the sorts of insights it may allow, and the research it might enable, with the connection to Twitter alerts being a first example of how to enrich it even further.

Data

The MBTA’s [GTFS] (https://developers.google.com/transit/gtfs/) archive was a little intimidating because the data is scatter across several CSVs, but is still a good source of schedule information. Alerts are not part of the GTFS archive, but are sent out on Twitter (originally by a third-party bot). Fetching from this source only allows us to get the last 3200 tweets, but the alerts are infrequent enough for that to be a reasonable window; we were able to go back to 2015. The new MBTA performance API allows for pulling down historical data, including actual arrivals, departures and headways, back to mid-2015. Unfornately ridership information (such as turnstile metrics or fare collections) is not yet available through the APIs – previous ridership studies were done in partnership with the MBTA, where they provided the ridership data directly.

MBTA Archive (2009-)

MBTA Alerts Twitter Feed (last 3200 tweets)

https://twitter.com/mbta_alerts
https://twitter.com/mbta?lang=en
https://cran.r-project.org/web/packages/twitteR/index.html
http://bigcomputing.blogspot.com/2016/02/the-twitter-r-package-by-jeff-gentry-is.html
May ask @CodeForBoson to download their entire feed.

MBTA Performance Data (July 2015+)

https://goo.gl/M6G4MZ

The Results

We used the MBTA’s new API to fetch data for both travel times and headways. Travel Times consist of one record (row) for each time a train went from one stop to another. Headways track the time between departures at a given station. When trains are running late, headways exceed their benchmark targets. The root causes of slow service can probably be better picked apart from travel times and dwell times, but the eventual impact to riders is most cleanly seen in headways. In structure and access, this data is very similar to the travel times data, with each record representing one train leaving a station.

Here’s what a raw block of travel times data from the MBTA’s new API looks like:

route_id	dep_dt	arr_dt	travel_time_sec	benchmark_travel_time_sec	threshold_flag_1	threshold_flag_2	threshold_flag_3
Red	1453717691	1453717793	102	120	NA	NA	NA
Red	1453718308	1453718390	82	120	NA	NA	NA
Red	1453718587	1453718696	109	120	NA	NA	NA
Red	1453720076	1453720223	147	120	NA	NA	NA
Red	1453720838	1453720998	160	120	NA	NA	NA
Red	1453721192	1453721491	299	120	NA	NA	NA

And here are some headways:

route_id	prev_route_id	current_dep_dt	previous_dep_dt	headway_time_sec	benchmark_headway_time_sec	threshold_flag_1	threshold_flag_2	threshold_flag_3
Red	Red	1453718308	1453717691	617	405	threshold_id_01	threshold_id_02	NA
Red	Red	1453718587	1453718308	279	405	NA	NA	NA
Red	Red	1453720076	1453718587	1489	420	threshold_id_01	threshold_id_02	threshold_id_03
Red	Red	1453720838	1453720076	762	420	threshold_id_01	threshold_id_02	NA
Red	Red	1453721192	1453720838	354	420	NA	NA	NA
Red	Red	1453726531	1453725039	1492	240	threshold_id_01	threshold_id_02	threshold_id_03

With some cleaning, we turn this raw data into something we can more easily work with. Here are some examples of what the tables end up looking like and the sorts of columns we have. Most are from the source, a few, like service_dt, we have inferred to make other processing easier.

dep_dt	arr_dt	travel_time_sec	benchmark_travel_time_sec	from_stop	to_stop	time_delta	lateness	is_weekend	dep_time	stop_name	parent_station_name	heading	stop_seq	service_d
2016-01-25 05:20:13	2016-01-25 05:21:23	70	120	70063	70065	-50	0	FALSE	5.336944 hours	Davis - Inbound	Davis	Southbound	2	2016-01-25
2016-01-25 05:22:26	2016-01-25 05:23:59	93	180	70065	70067	-87	0	FALSE	5.373889 hours	Porter - Inbound	Porter	Southbound	3	2016-01-25
2016-01-25 05:24:54	2016-01-25 05:27:21	147	180	70067	70069	-33	0	FALSE	5.415000 hours	Harvard - Inbound	Harvard	Southbound	4	2016-01-25
2016-01-25 05:28:11	2016-01-25 05:29:53	102	120	70069	70071	-18	0	FALSE	5.469722 hours	Central - Inbound	Central	Southbound	5	2016-01-25
2016-01-25 05:29:48	2016-01-25 05:30:59	71	120	70063	70065	-49	0	FALSE	5.496667 hours	Davis - Inbound	Davis	Southbound	2	2016-01-25
2016-01-25 05:30:43	2016-01-25 05:35:38	295	120	70071	70073	175	175	FALSE	5.511944 hours	Kendall/MIT - Inbound	Kendall/MIT	Southbound	6	2016-01-25

current_dep_dt	previous_dep_dt	headway_time_sec	benchmark_headway_time_sec	from_stop	to_stop	time_delta	lateness	is_weekend	dep_time	stop_name	parent_station_name	heading	stop_seq	service_d
2016-01-25 05:29:48	2016-01-25 05:20:13	575	480	70063	70065	95	95	FALSE	5.496667 hours	Davis - Inbound	Davis	Southbound	2	2016-01-25
2016-01-25 05:31:47	2016-01-25 05:22:26	561	405	70065	70067	156	156	FALSE	5.529722 hours	Porter - Inbound	Porter	Southbound	3	2016-01-25
2016-01-25 05:34:20	2016-01-25 05:29:48	272	405	70063	70065	-133	0	FALSE	5.572222 hours	Davis - Inbound	Davis	Southbound	2	2016-01-25
2016-01-25 05:34:48	2016-01-25 05:24:54	594	420	70067	70069	174	174	FALSE	5.580000 hours	Harvard - Inbound	Harvard	Southbound	4	2016-01-25
2016-01-25 05:36:26	2016-01-25 05:31:47	279	405	70065	70067	-126	0	FALSE	5.607222 hours	Porter - Inbound	Porter	Southbound	3	2016-01-25
2016-01-25 05:38:28	2016-01-25 05:28:11	617	405	70069	70071	212	212	FALSE	5.641111 hours	Central - Inbound	Central	Southbound	5	2016-01-25

As you can see, the data is not terribly complicated, but there is lots of it; the Red Line alone has almost 450,000 travel-time records in the time window we looked at. Here is a simple scatter plot of train departures within our window, just from Porter Square heading inbound.

Here’s one using headways, so that we can color-code by how the trains were performing compared to its benchmark time when they departed the station. Positive values indicate a train was running late, and are encoded tending toward pink. Negative values indicate the train was actually running a bit ahead, and are coded in blue.

With this simple addition we can start to see patterns in the data. Headway lateness seems to cluster; once it happens in a given day it appears to take a while to sort itself out. Delays tend to happen at the same time of day for several days in a row; this may indicate construction or some other persistent (but not permanent) interruption.

What if we try piling on ALL of the stops for one direction? Here we look at the times between trains by time of day. We restrict the data to only northbound weekday trains, since the two directions have different patterns, as do weekends (when the train schedule is different).

That’s a lot of points! Lets use travel times (the number of seconds between two pairs of stops) to look for some broad trends, by comparing the benchmark time (provided by the MBTA) and the actual time per individual trip, averaged over each minute from January to April. The daily rush hours is highlighted in red (7-9am) and green (5-7pm). This time we pick Park Street’s northbound track (heading to Charles/MGH), a major stop for commuters. The plot is interactive, so you can look in more detail at different points or mouse-over for more information. Note the major spikes in the averages just before (and during) the rush hours, but also note that on average, travel times out of Park Street are, on average, consistently about a minute longer than expected even outside those spikes.

Compare Park Street to the next pair out on the line, Charles/MGH to Kendall. There is a noteworthy cyclical pattern in the averages, which actually is at it’s lowest during those same rush hours.

Park Street seems like it’s generally pretty slow, while Charles/MGH appears more consistent. Lets take ALL of the stations, and see if there’s evidence of this type difference in other stations.

The boxplots makes clear that though much of the data is very close to the benchmark, there are certainly instances where gaps are greater or less than their benchmark times. For reference, Park Street is #8 when on the Southbound plot and #16 Norhtbound.

To get a better sense of the full ranges of the data, lets try a density plot. Here we’re looking at time-deltas for headways by station; there is naturally significant overplotting, but this helps us spot very unusual stations.

We can see that the distributions tend to skew right; this isn’t a surprise, because there is a limit to how early a train can be, but a much greater limit to how late it can be! We also note that even though the “center mass” of the density plots is very close to 0 (no delay), there are a significant number of late and early trains as well. Weekend northbound trains seem to be the most consistent (density closest to 0), but even there some noteworthy delays can be seen in the plot.

We know that some amount of headway delay can be caused by longer-than-expected times to travel between stations. Let’s look at what northbound, weekday trains’ travel times look like; here we plot densities of travel times, also by station.

We certainly get less overplotting! This makes sense as any given connection is of different geographic separation, but what’s more interesting is how variable those times appear to be even along fixed routes.

Lets take a look at service quality. One sensible metric would be how long pasengers are waiting on platforms for trains beyond how long they “should” be waiting per normal service. In our data this can be thought of as the difference between the actual observed times and the benchmark times provided by the MBTA. However, we only look for cases where this number is positive; the MBTA gets no extra credit for early trains! This isn’t because we’re mean spirited, it’s because faster-than-expected headways are indistinguishable from just happening to get to the platform at the right time to a typical rider, and because faster-than-expected headways are often the result of backup cause by slowness earlier in the day.

Here’s a simple density plot of ALL observed lateness (on-time or early headways removed) to give an idea of what is typical.

This shape should be familiar; it’s the right-half of the “time-delta” headway distributions above.

Lets see if there are broad patterns by type of day or direction of travel.

Interesting; it looks like the MBTA’s expected benchmarks are pretty good, in that the they appear to account well for variances by day-of-week and direction. If they didn’t, the we would expect the distributions to show some systemic bias, but they’re actually pretty similar. The weekday southbound trains tend to have somewhat shorter lateness events, and for weekend northbound they tend to be longer. Note, however, that we can’t say that weekend Northbound tends to be late more often, because we threw out anything that wasn’t lateness already.

What if we wanted a sense of both where and when this lateness is most accutely felt? One approach would be to bucket the lateness events, add up the total amount of “overage” in each bucket, and divide by the total time in each bucket. This gives us a metric, “percent late”, which we can think of as “for the entire time we observed the station, for what % of that time was a train past it’s benchmark arrival time? Keep in mind that this may not be a perfect ratio; if trains get really backed up it’s possible that two trains could be sitting at a station simultaneously, allowing us to observe more lateness that the actual clock-time where one ore more trains were late in a given bucket. However, it still makes a reasonable indicator of how behind schedule the system is.

Given this rate, we can compute a heatmap of delay severity:

## Joining by: "weekend"

weekday_southbound <- total_lateness %>%
  filter(heading == "Southbound", weekend == "Weekday") %>%
  group_by(parent_station_name, interval_start_hour) %>%
  summarize(percent_late = sum(total_lateness)/sum(seconds_observed))

ggplot(weekday_southbound, aes(as.ordered(interval_start_hour), parent_station_name)) +
  geom_tile(aes(fill = percent_late), color = "white") +
  scale_fill_gradient(low = "white",high = "red") +
  xlab("Hour of Day") +
  ylab("Station")

Alerts of T delays come in indirectly from Twitter handles for each of the specific lines. The MBTA puts out an RSS feed that enterprising developers at codeforboston group started scraping and turned into tweets. We can user the twitteR R package to retrieve these alerts from codeforboston, as though we had been archiving the alerts ourselves.

text	created	favoriteCount	retweetCount	arr_dt	bounded	severity	alerts_at_station_code
Mattapan Trolley experiencing moderate delays due to a disabled train at Central Ave #mbta	2016-01-25 23:36:17	0	0	2016-01-25 23:36:17		2	70069_Central - Inbound, 70070_Central - Outbound
#RedLine experiencing minor northbound delays between North Quincy and JFK/UMass due to a signal problem. #mbta	2016-01-26 11:46:33	0	0	2016-01-26 11:46:33	northbound	0
Shuttle buses replacing #RedLine service North Quincy to JFK due to disabled work equipment. Customers may utilize the Commuter rai… #mbta	2016-01-28 10:08:44	1	2	2016-01-28 10:08:44		0
#RedLine experiencing moderate residual delays due to earlier disabled work equipment. #mbta	2016-01-28 11:39:40	0	0	2016-01-28 11:39:40		0
#RedLine experiencing moderate delays due to police action at Alewife. #mbta	2016-01-28 23:00:48	0	0	2016-01-28 23:00:48		2	70061_Alewife, 70076_Park Street - to Alewife, 70078_Downtown Crossing - to Alewife
#RedLine experiencing moderate delays due to a disabled train at Charles. #mbta	2016-01-30 00:46:50	0	0	2016-01-30 00:46:50		2	70073_Charles/MGH - Inbound, 70074_Charles/MGH - Outbound

As you can see the tweets follow a regular format, so with some simple text mining we can extract additional information: A) Directionality. If the alert relates to routes running southbound, northbound or both. B) Severity. Is the alert indicating a minor, moderate or major delay? C) Location. Which specific T station is the alert related to.

Here’s a sample of what the data looks like once processed:

text	created	favoriteCount	retweetCount	arr_dt	bounded	severity	alerts_at_station_code
Mattapan Trolley experiencing moderate delays due to a disabled train at Central Ave #mbta	2016-01-25 23:36:17	0	0	2016-01-25 23:36:17		2	70069_Central - Inbound, 70070_Central - Outbound
#RedLine experiencing minor northbound delays between North Quincy and JFK/UMass due to a signal problem. #mbta	2016-01-26 11:46:33	0	0	2016-01-26 11:46:33	northbound	0
Shuttle buses replacing #RedLine service North Quincy to JFK due to disabled work equipment. Customers may utilize the Commuter rai… #mbta	2016-01-28 10:08:44	1	2	2016-01-28 10:08:44		0
#RedLine experiencing moderate residual delays due to earlier disabled work equipment. #mbta	2016-01-28 11:39:40	0	0	2016-01-28 11:39:40		0
#RedLine experiencing moderate delays due to police action at Alewife. #mbta	2016-01-28 23:00:48	0	0	2016-01-28 23:00:48		2	70061_Alewife, 70076_Park Street - to Alewife, 70078_Downtown Crossing - to Alewife
#RedLine experiencing moderate delays due to a disabled train at Charles. #mbta	2016-01-30 00:46:50	0	0	2016-01-30 00:46:50		2	70073_Charles/MGH - Inbound, 70074_Charles/MGH - Outbound

Since we can identify where the tweets came from, we can we can draw maps of the subway lines, and then overplot the tweets. We’ve added markers to show how many tweets over our time period correspond to the various stations. You’ll need to zoom in to see this broken out for each individual station; you can also click on individual dots on each line to get the name of the station.

Twitter Alerts Overlap

## Joining by: c("arr_dt", "text", "created", "favoriteCount", "retweetCount")

Most alerts are clustered around Park Street, Downtown Crossing, and Backbay. Remember from our plots above, general slownes also seems pretty bad around Park Street and Downtown Crossing (Back Bay isn’t on the Red Line)

Further Study

We feel that we’ve only just scratched the surface when it comes to what we could do with this rich data. We’ve demonstrated the sorts of measurements one can do, and how even relatively simple visualizations can give insights into the workings of the MBTA’s subway lines.

At least two avenues present themselves for further study. First, we expect that a thorough review of the permutations of variance from benchmarks (by station, time of day, etc) could derive insights to help the T operators make both short-term and long-term decisions that would positively impact service quality. The MBTA is certainly already reviewing the efficiency and performance of their system with an eye toward improvements as well as managing the system in real-time to deliver the best service, but access to this level of detail (especially crowd-sourced!) about the functioning of each route and station allows a level of modeling that could enhance both of those endeavors.

Which segues into the second area for further research, which would be to apply fancy time-series predictive modeling in real-time to help both T operators and riders. One example, which we had hoped to be able to pursue in this project (but were unable to do so), would be attempting to determine minutes or hours in advance when the system is likely to have a problem and to take corrective action, or at least issue alerts in time for passengers to seek alternate routes. One could imagine a situation where, for example, heavy traffic at one end of the Green Line reliably presages a slowdown on the Red Line some time in the future, as the delays percolate through the whole system. We of course have no idea if that’s actually true, but it’s the sort of behavior this level of detail would allow one to attempt to model.

In conclusion the team greatly enjoyed working with this data, and believes that while we were able to make a good start there is lots of interesting research left to be done.

MBTA Performance Data

Shawn Connor, Jeff Cunningham, Danielle Feng and Varuni Gang

Friday May 6, 2016

CSCI E-107 Final Project

https://github.com/TaleOfTwoTransportationSystems/DataProject

The Project

Data

The Results

Twitter Alerts Overlap

Further Study

MBTA Performance Data

Shawn Connor, Jeff Cunningham, Danielle Feng and Varuni Gang

Friday May 6, 2016

CSCI E-107 Final Project

https://github.com/TaleOfTwoTransportationSystems/DataProject

The Project

Related work

Data

The Results

Twitter Alerts Overlap

Further Study