Introduction
The aim of this analysis is to explore the feasibility of locating and monitoring a public transit system by aggregating the GPS data of everyday commuters. This would be in the context of a phone app which providers commuters with information about the public transport system while at the same time, recording the user’s route through the city.
Since the users will not provide any input aside from the location which is tracked automatically the system will need to differentiate when a commuter is driving their car, walking or riding on and bus and determining which one.
The implementation of the system won’t assume that information about the network such as bus locations and schedules will be provided from the onset, so the system would also need to map the network from scratch.
In this initial analysis, we will focus on a key element of the mapping process: finding bus stops.
Since the system hasn’t been implemented and it depends on a large number of users, we’ve generated a transit simulation using SUMO (sumo.dlr.de) which will simulate a large population of commuters that will walk, drive, and get on buses in a virtual city.
To achieve a high degree of accuracy, new commuter traces need to be compared with prior ones, which the system has labeled with a high degree of certainty. Since we don’t have a pre-existing data when searching for the stops, we are aiming ti narrow the possible options knowing that as more data is added, accuracy will improve until only true stops are mapped.
Data Generation
Although it’s possible to create a fictional city in SUMO, it’s much more convenient and useful to simualte traffic on an existing city by importing the data from Open Street Maps. For our study, we used Berlin since the public transit data is very complete for that area.
As we can see in the preview, we have vehice information included, which won’t use since the point of the study is to infer this strictly from the pedestrian data.
Parsed with column specification:
cols(
timestep_time = [32mcol_double()[39m,
vehicle_angle = [32mcol_double()[39m,
vehicle_id = [31mcol_character()[39m,
vehicle_lane = [31mcol_character()[39m,
vehicle_pos = [32mcol_double()[39m,
vehicle_slope = [32mcol_double()[39m,
vehicle_speed = [32mcol_double()[39m,
vehicle_type = [31mcol_character()[39m,
vehicle_x = [32mcol_double()[39m,
vehicle_y = [32mcol_double()[39m,
person_angle = [32mcol_double()[39m,
person_edge = [31mcol_character()[39m,
person_id = [31mcol_character()[39m,
person_pos = [32mcol_double()[39m,
person_slope = [32mcol_double()[39m,
person_speed = [32mcol_double()[39m,
person_x = [32mcol_double()[39m,
person_y = [32mcol_double()[39m
)
Exploration:
If we take a random snapshot of pedenstrian clusters ( which are already a small subset) we see that obtaining reliable data will be a challenge.

When comparing speeds between vehicles and pedesntrians we find a key differencitor. Besides total standstill, will do most of their waling at around 1m/s, a speed that rarely seen in vehicles.

Although not our main goal, we can use this information to find roads (and train tracks). BY filtering for groups moving at over 7m/s most become easy to see.
plot(dfc$person_x[dfc$person_speed >7],jitter(dfc$person_y[dfc$person_speed > 7]))

Let’s now graph cluster that aren’t moving and which are larger than 10.
smoothScatter(
dfc$person_x[dfc$person_speed == 0 | dfc$ccount > 10],
dfc$person_y[dfc$person_speed == 0 | dfc$ccount > 10],
nbin = 1000,
bandwidth = 5)

Many of the bus stops are visible under the dark blots but we have alot of noise. Let instead focus on cluster changes while at rest. The transformation below calculates the absolute change in cluster size for clusters that share the same location and time.
dfc2 <- dfc %>%
filter(person_speed == 0) %>%
mutate(abscc = abs(cchange)) %>%
filter(abscc > 0) %>%
dplyr::group_by(person_x, person_y) %>%
count(person_x, person_y)
smoothScatter(
dfc2$person_x,
dfc2$person_y,
nbin = 1000,
bandwidth = 5)

This is an improvement. We have more defined spots but we still many more spots than stops.
Analysis & Conclusion
As we’ve seen, analyzing the speed and collective behavior of pedestrians can give us many clues about the layout of the city, and their mode of transportation. Although the results are very promising they are clearly lacking as we are still getting many false positives. Most of these are likely train stops, pedestrian crossings and any other place where pedestrians may agglomerate in close groups.
An obvious extension of this model which should greatly enhance its capabilities is to extract “cluster chains” from the data. In our current analysis, we rely on instantaneous changes in group size which are both unrealistic and limiting. If we identify consecutive timesteps in which two or more pedestrians are present, complete routes could be linked which would also allow us to analyze the acceleration profile of each chain. All the public transport routes could be easily obtained by linking chains. The acceleration profile would not only enhance the location of stops, it would likely differentiate all modes of transport, from walking to taking riding on a ferry.
As mentioned above, the simulation introduces a major simplification by setting the location of bus passengers exactly the same. It will eventually be necessary to introduce a degree of randomness to account for different seat location and GPS inaccuracy. In this case, determining which pedestrian belong to a cluster will be more difficult but should be possible by constructing some kind of localized cointegration test on possible cluster chains.
It was my intention to incorporate cluster chains from the onset but the data manipulation was unfortunately beyond my capabilities at the time. I consulted with several researchers but they were also unable to help although the logic of the transformation is fairly straightforward. In spite of this, the results suggest that meaningful information can be extracted from random GPS traces, and thus additional work on this topic seems warranted.
References
Krajzewicz, D., Erdmann, J., Behrisch, M., & Bieker, L. (2012). Recent development and applications of SUMO-Simulation of Urban MObility. International Journal On Advances in Systems and Measurements, 5, 128–138. Retrieved from https://elib.dlr.de/80483/
