General Transit Feed Specification (GTFS) is a data exchange format used by transportation agencies worldwide. The main benefit of this format is that it contains both the route and timetable information, and the associated geographical data. Also, as the format is text-based, it does not require any proprietary software to read and process making it easier for developers and data scientists to work with public transport data.
In this presentation, I will show how to read GTFS data into R and how to put it on an interactive map.
A lot of public authorities in all parts of the world are starting to explore the potential of modern data science for the improvement of multiple areas of their work.
Transportation is one of such areas. For a data scientist, transportation analytics is an extremely interesting domain, as it contains multi-faceted datasets (spatio-temporal, networks, streaming data etc.) and a lot of exciting problems, where solutions can bring tangible improvements to the lives of many people.
We will use the GTFS data from Berlin, Germany to make an interactive map of public transport stops.
A GTFS dataset is an archive of simple text databases that are linked by common keys and store various aspects of public transport schedules, stop positions, and fares.
In our case, the following tables are available:
list.files("GTFS_VBB_Dez2016 Aug2017_ohne_shape_files/")
## [1] "agency.txt" "calendar_dates.txt" "calendar.txt"
## [4] "logging" "routes.txt" "stop_times.txt"
## [7] "stops.txt" "transfers.txt" "trips.txt"
routes = read_csv("GTFS_VBB_Dez2016 Aug2017_ohne_shape_files/routes.txt",trim_ws = T)
stops = read_csv("GTFS_VBB_Dez2016 Aug2017_ohne_shape_files/stops.txt",trim_ws = T)
trips = read_csv("GTFS_VBB_Dez2016 Aug2017_ohne_shape_files/trips.txt",trim_ws = T)
transfers = read_csv("GTFS_VBB_Dez2016 Aug2017_ohne_shape_files/transfers.txt",trim_ws = T)
agency = read_csv("GTFS_VBB_Dez2016 Aug2017_ohne_shape_files/agency.txt",trim_ws = T)
stoptime = read_csv("GTFS_VBB_Dez2016 Aug2017_ohne_shape_files/stop_times.txt",trim_ws = T)
We see that we have in total 13020 records for public transport stops and 4578199 records of stops of different means of transport.
The stops
table looks like this:
library(knitr)
kable(head(stops))
stop_id | stop_code | stop_name | stop_desc | stop_lat | stop_lon | zone_id | stop_url | location_type | parent_station | platform_code |
---|---|---|---|---|---|---|---|---|---|---|
8000058 | NA | Büchen | NA | 53.47497 | 10.623298 | NA | NA | 0 | NA | NA |
8000147 | NA | Hamburg-Harburg | NA | 53.45591 | 9.991701 | NA | NA | 0 | NA | NA |
8000168 | NA | Uelzen | NA | 52.96978 | 10.553053 | NA | NA | 0 | NA | NA |
8000238 | NA | Lüneburg | NA | 53.24966 | 10.419890 | NA | NA | 0 | NA | NA |
8003774 | NA | Lübeck St Jürgen | NA | 53.84232 | 10.702942 | NA | NA | 0 | NA | NA |
8010022 | NA | Bad Schandau | NA | 50.91884 | 14.139322 | NA | NA | 0 | NA | NA |
One of the benefits of GTFS is that the data is in perfect shape, with lat/long coordinates, so we can get straight to mapping!
We will use leaflet
, an interactive mapping library to make simple a zoomable map of all the stops.
The usage of Leaflet in R is very straightforward and akin to the logic of base R plotting engine:
library("leaflet")
m = leaflet() %>%
# Add CartoDB background map
addProviderTiles("CartoDB.DarkMatter") %>%
# Add a marker for each stop
addCircleMarkers(lng= ~ stop_lon, lat= ~stop_lat, data = head(stops,100),
stroke = FALSE, fillOpacity = 0.5, radius =5 )
m # Show the map
Wow, the stops are all across Germany!
OK let us filter the stops to Berlin, then plot all the stops in Berlin.
In order to do this, we need to filter the stops to those in the trips which are made on the routes of the agencies that work in Berlin.
STOPS -> STOPTIME -> TRIP -> ROUTE -> AGENCY
library(stringr)
agency_filter = "[Bb]erlin"
# Get coordinates for each stop
berlin = agency %>%
select(agency_id,agency_name) %>%
filter(str_detect(agency_name, agency_filter)) %>%
inner_join(select(routes, route_short_name,agency_id,route_id)) %>%
inner_join(select(trips, route_id, trip_id)) %>%
inner_join(select(stoptime, trip_id, stop_id)) %>%
select(-trip_id) %>% unique() %>%
inner_join(select(stops, stop_id, stop_name, lat=stop_lat, lon=stop_lon)) %>%
unique()
# Get coordinates for each stop and counts of trips passing through each stop
berlin_cnt = agency %>%
select(agency_id,agency_name) %>%
filter(str_detect(agency_name, agency_filter)) %>%
inner_join(select(routes, route_short_name,agency_id,route_id)) %>%
inner_join(select(trips, route_id, trip_id)) %>%
inner_join(select(stoptime, trip_id, stop_id)) %>%
group_by(agency_id,stop_id) %>% summarise(cnt=n()) %>%
inner_join(select(stops, stop_id, stop_name, lat=stop_lat, lon=stop_lon)) %>%
unique()
Now we have a nice table with lat/lon coordinates for every stop in Berlin with the identification of the type of route and the stop name.
library(knitr)
kable(head(berlin))
agency_id | agency_name | route_short_name | route_id | stop_id | stop_name | lat | lon |
---|---|---|---|---|---|---|---|
BVB— | Berliner Verkehrsbetriebe | 12 | 269 | 9140002 | Pasedagplatz (Berlin) | 52.55977 | 13.46011 |
BVB— | Berliner Verkehrsbetriebe | 12 | 269 | 9140013 | Berliner Allee/Rennbahnstr. (Berlin) | 52.55755 | 13.46539 |
BVB— | Berliner Verkehrsbetriebe | 12 | 269 | 9140006 | Berliner Allee/Indira-Gandhi-Str. (Berlin) | 52.55189 | 13.46548 |
BVB— | Berliner Verkehrsbetriebe | 12 | 269 | 9140012 | Falkenberger Str./Berliner Allee (Berlin) | 52.55471 | 13.46774 |
BVB— | Berliner Verkehrsbetriebe | 27 | 270 | 9150008 | Rhinstr./Gärtnerstr. (Berlin) | 52.54639 | 13.51053 |
BVB— | Berliner Verkehrsbetriebe | 27 | 270 | 9150020 | Hauptstr./Rhinstr. (Berlin) | 52.54887 | 13.50476 |
Let’s plot again
# Define colors for different modes of transport
berlin_modes = data.frame(agency_id = c("BVB---", "BVF---", "BVT---" ,"BVU---", "DBS---", "SEV---"), mot = c("Bus", "Ferry", "Tram", "Subway", "Light Rail", "Replacement"))
berlin$agency_id = factor(berlin$agency_id)
berlin = left_join(berlin, berlin_modes)
factpal=colorFactor(palette = c("#91148d", "#145d91","#14912d","#8d9114", "#e3ea12","#ba0d0d"),
domain = berlin$mot, levels = levels(berlin$mot), ordered = FALSE, na.color = "#808080", alpha = FALSE)
# Plot again adding color options and a correspondent legend
m = leaflet() %>%
# Add CartoDB background map
addProviderTiles("CartoDB.DarkMatter") %>%
# Add a marker for each stop
addCircleMarkers(lng= ~ lon, lat= ~lat, data = berlin, stroke = FALSE,
fillOpacity = 0.5, radius =3, color = ~ factpal(mot)) %>%
addLegend(colors =c("#91148d", "#145d91","#14912d","#8d9114", "#e3ea12","#ba0d0d"),
labels = levels(berlin$mot),title = "Mode of Transport")
m # Show the map
Finally, make a map visualizing the main traffic hubs per mode of transport using the table with the counts of trips per stop. This helps to visualize the local structure in the city.
berlin_cnt$agency_id = factor(berlin_cnt$agency_id)
berlin_cnt = left_join(berlin_cnt, berlin_modes)
# Add 10 bins to each MoT group
berlin_cnt = berlin_cnt %>%
group_by(mot) %>%
mutate(bin = ntile(cnt,n = 8))
m = leaflet() %>%
# Add CartoDB background map
addProviderTiles("CartoDB.DarkMatter") %>%
# Add a marker for each stop
addCircleMarkers(lng= ~ lon, lat= ~lat, data = berlin_cnt, stroke = FALSE,
fillOpacity = 0.5, radius = ~(bin/2), color = ~ factpal(mot)) %>%
addLegend(colors =c("#91148d", "#145d91","#14912d","#8d9114", "#e3ea12","#ba0d0d"),
labels = levels(berlin_cnt$mot),title = "Mode of Transport")
m # Show the map