Intro

General Transit Feed Specification (GTFS) is a data exchange format used by transportation agencies worldwide. The main benefit of this format is that it contains both the route and timetable information, and the associated geographical data. Also, as the format is text-based, it does not require any proprietary software to read and process making it easier for developers and data scientists to work with public transport data.

Objective

In this presentation, I will show how to read GTFS data into R and how to put it on an interactive map.

Relevance

A lot of public authorities in all parts of the world are starting to explore the potential of modern data science for the improvement of multiple areas of their work.
Transportation is one of such areas. For a data scientist, transportation analytics is an extremely interesting domain, as it contains multi-faceted datasets (spatio-temporal, networks, streaming data etc.) and a lot of exciting problems, where solutions can bring tangible improvements to the lives of many people.

The Data

We will use the GTFS data from Berlin, Germany to make an interactive map of public transport stops.

Dataset structure

A GTFS dataset is an archive of simple text databases that are linked by common keys and store various aspects of public transport schedules, stop positions, and fares.

In our case, the following tables are available:

list.files("GTFS_VBB_Dez2016 Aug2017_ohne_shape_files/")
## [1] "agency.txt"         "calendar_dates.txt" "calendar.txt"      
## [4] "logging"            "routes.txt"         "stop_times.txt"    
## [7] "stops.txt"          "transfers.txt"      "trips.txt"

Reading data

routes = read_csv("GTFS_VBB_Dez2016 Aug2017_ohne_shape_files/routes.txt",trim_ws = T)
stops = read_csv("GTFS_VBB_Dez2016 Aug2017_ohne_shape_files/stops.txt",trim_ws = T)
trips = read_csv("GTFS_VBB_Dez2016 Aug2017_ohne_shape_files/trips.txt",trim_ws = T)
transfers = read_csv("GTFS_VBB_Dez2016 Aug2017_ohne_shape_files/transfers.txt",trim_ws = T)
agency = read_csv("GTFS_VBB_Dez2016 Aug2017_ohne_shape_files/agency.txt",trim_ws = T)
stoptime = read_csv("GTFS_VBB_Dez2016 Aug2017_ohne_shape_files/stop_times.txt",trim_ws = T)

We see that we have in total 13020 records for public transport stops and 4578199 records of stops of different means of transport.

The stops table looks like this:

library(knitr)
kable(head(stops))
stop_id stop_code stop_name stop_desc stop_lat stop_lon zone_id stop_url location_type parent_station platform_code
8000058 NA Büchen NA 53.47497 10.623298 NA NA 0 NA NA
8000147 NA Hamburg-Harburg NA 53.45591 9.991701 NA NA 0 NA NA
8000168 NA Uelzen NA 52.96978 10.553053 NA NA 0 NA NA
8000238 NA Lüneburg NA 53.24966 10.419890 NA NA 0 NA NA
8003774 NA Lübeck St Jürgen NA 53.84232 10.702942 NA NA 0 NA NA
8010022 NA Bad Schandau NA 50.91884 14.139322 NA NA 0 NA NA

One of the benefits of GTFS is that the data is in perfect shape, with lat/long coordinates, so we can get straight to mapping!

Making a map

We will use leaflet, an interactive mapping library to make simple a zoomable map of all the stops.

The usage of Leaflet in R is very straightforward and akin to the logic of base R plotting engine:

  1. add a background map,
  2. then add elements in layers on top of it
  3. leaflet produces a nice intereactive map!
library("leaflet")

m = leaflet() %>%
  # Add CartoDB background map
  addProviderTiles("CartoDB.DarkMatter") %>%  
  # Add a marker for each stop
  addCircleMarkers(lng= ~ stop_lon, lat= ~stop_lat, data = head(stops,100),
                   stroke = FALSE, fillOpacity = 0.5, radius =5 ) 
m  # Show the map

Wow, the stops are all across Germany!

OK let us filter the stops to Berlin, then plot all the stops in Berlin.

In order to do this, we need to filter the stops to those in the trips which are made on the routes of the agencies that work in Berlin.

STOPS -> STOPTIME -> TRIP -> ROUTE -> AGENCY

library(stringr)

agency_filter = "[Bb]erlin"

# Get coordinates for each stop
berlin = agency %>% 
  select(agency_id,agency_name) %>% 
  filter(str_detect(agency_name, agency_filter)) %>% 
  inner_join(select(routes, route_short_name,agency_id,route_id)) %>% 
  inner_join(select(trips, route_id, trip_id)) %>% 
  inner_join(select(stoptime, trip_id, stop_id)) %>% 
  select(-trip_id) %>% unique() %>% 
  inner_join(select(stops, stop_id, stop_name, lat=stop_lat, lon=stop_lon)) %>% 
  unique()
  
# Get coordinates for each stop and counts of trips passing through each stop

berlin_cnt = agency %>% 
  select(agency_id,agency_name) %>% 
  filter(str_detect(agency_name, agency_filter)) %>% 
  inner_join(select(routes, route_short_name,agency_id,route_id)) %>% 
  inner_join(select(trips, route_id, trip_id)) %>% 
  inner_join(select(stoptime, trip_id, stop_id)) %>% 
  group_by(agency_id,stop_id) %>% summarise(cnt=n()) %>% 
  inner_join(select(stops, stop_id, stop_name, lat=stop_lat, lon=stop_lon)) %>% 
  unique()

Now we have a nice table with lat/lon coordinates for every stop in Berlin with the identification of the type of route and the stop name.

library(knitr)
kable(head(berlin))
agency_id agency_name route_short_name route_id stop_id stop_name lat lon
BVB— Berliner Verkehrsbetriebe 12 269 9140002 Pasedagplatz (Berlin) 52.55977 13.46011
BVB— Berliner Verkehrsbetriebe 12 269 9140013 Berliner Allee/Rennbahnstr. (Berlin) 52.55755 13.46539
BVB— Berliner Verkehrsbetriebe 12 269 9140006 Berliner Allee/Indira-Gandhi-Str. (Berlin) 52.55189 13.46548
BVB— Berliner Verkehrsbetriebe 12 269 9140012 Falkenberger Str./Berliner Allee (Berlin) 52.55471 13.46774
BVB— Berliner Verkehrsbetriebe 27 270 9150008 Rhinstr./Gärtnerstr. (Berlin) 52.54639 13.51053
BVB— Berliner Verkehrsbetriebe 27 270 9150020 Hauptstr./Rhinstr. (Berlin) 52.54887 13.50476

Let’s plot again

# Define colors for different modes of transport

berlin_modes = data.frame(agency_id = c("BVB---", "BVF---", "BVT---" ,"BVU---", "DBS---", "SEV---"), mot = c("Bus", "Ferry", "Tram", "Subway", "Light Rail", "Replacement"))

berlin$agency_id = factor(berlin$agency_id)
berlin = left_join(berlin, berlin_modes)

factpal=colorFactor(palette = c("#91148d", "#145d91","#14912d","#8d9114", "#e3ea12","#ba0d0d"), 
            domain = berlin$mot, levels = levels(berlin$mot), ordered = FALSE, na.color = "#808080", alpha = FALSE)

# Plot again adding color options and a correspondent legend

m = leaflet() %>%
  # Add CartoDB background map
  addProviderTiles("CartoDB.DarkMatter") %>%  
  # Add a marker for each stop
  addCircleMarkers(lng= ~ lon, lat= ~lat, data = berlin, stroke = FALSE, 
                   fillOpacity = 0.5, radius =3, color = ~ factpal(mot)) %>% 
  addLegend(colors =c("#91148d", "#145d91","#14912d","#8d9114", "#e3ea12","#ba0d0d"),
            labels = levels(berlin$mot),title = "Mode of Transport")
m  # Show the map

Finally, make a map visualizing the main traffic hubs per mode of transport using the table with the counts of trips per stop. This helps to visualize the local structure in the city.

berlin_cnt$agency_id = factor(berlin_cnt$agency_id)
berlin_cnt = left_join(berlin_cnt, berlin_modes)

# Add 10 bins to each MoT group

berlin_cnt = berlin_cnt %>% 
  group_by(mot) %>% 
  mutate(bin = ntile(cnt,n = 8))

m = leaflet() %>%
  # Add CartoDB background map
  addProviderTiles("CartoDB.DarkMatter") %>%  
  # Add a marker for each stop
  addCircleMarkers(lng= ~ lon, lat= ~lat, data = berlin_cnt, stroke = FALSE, 
                   fillOpacity = 0.5, radius = ~(bin/2), color = ~ factpal(mot)) %>% 
  addLegend(colors =c("#91148d", "#145d91","#14912d","#8d9114", "#e3ea12","#ba0d0d"),
            labels = levels(berlin_cnt$mot),title = "Mode of Transport")
m  # Show the map

Reference

https://en.wikipedia.org/wiki/General_Transit_Feed_Specification?oldformat=true https://daten.berlin.de/datensaetze/vbb-fahrplandaten-dezember-2016-bis-august-2017 https://rstudio.github.io/leaflet/