9. Waze Data Wrangling

In September of 2015, Louisville became the 5th city to partner with Waze Mobile, Ltd. (https://www.bizjournals.com/louisville/news/2015/09/24/louisville-partners-with-waze-to-create-a-real.html) to enhance traffic outcomes. As a result of this partnership, we have received access to a rich dataset that presents unique challenges and opportunities.

9.1 Creating the WazeData dataset.

Our data extract was provided in the form of a backup MSSQL Server Database. Upon uploading, we could explore the data’s schema. A snapshot of the schema is provided below. Before walking through the schema, let’s explore our understanding of the data generation process.

Waze pings its user’s mobile phone at regular (frequent) intervals to obtain location data. Based on it’s internal models and/or proprietary historical data collection, Waze has an expectation for the amount of time that it should take a user to move through a cooridor. The company can, then, detect when someone is moving slower than normal and ping a sample users with an alert as to the cause of delays. As a result, Waze has provided tables with these anonymized trip id’s, the types of delays and the associated coordinates.

To our knowledge, this is the first time anyone is adapting this dataset for predicting collisions. It should be noted that this data was created to enhance the business model of a private enterprise. Therefore, some data that might make the analysis more robust is not available and unlikely to be provided due to proprietary or legal (privacy) considerations. We, nevertheless, approached this data with the understanding that it might provide useful approximations for data that is often difficult to obtain (especially volume and congestion).

9.1.1 Challenges to Wrangling

First, the dataset is large and continuous (Readings from October 2014 to July 2017 every few minutes). There are over 15 million Jam ID’s (our unit of analysis). We selected four random days for our analysis, representing just over 350,000 Jam ID’s (just over 2% of the entire data set). We selected a random Tuesday and Thursday for two different weeks to represent a typical traffic data (10/20/2016, 10/25/2016 & 11/3/2016, 11/8/2016). Weather for these days may be found at www.wunderground.com. Two of the four days experienced light precipitation.

As can be seen in the image below, these Jam ID’s present a unique challenge. For example, if the trips seen in the image below had been pinged for a delay, and the user flagged the delay as due to a pothole, clearly the pothole was not the entire length of the segment or trip. However, there is no means to identify the exact location.

Instead, we decided to treat these situations as “exposure,” which is to say that if we aggregated all segments associated with or “exposed” to a particular flag, this might still suggest something important about the network in relation to the flag (even without an exact location). In the image below, though only three sets of points appear to the naked eye, they represent a sample of 16 trips (by Jam ID).

Lastly, since our focus is on the relative collision-risk of each segment, we did not include timing data in our model. In the end, though time may have a relationship to collision-risk (i.e. collision-risk might increase during rush hour), we assume that the relative risk per street segment is independent of time for the purposes of focusing on the built environment.

9.1.2 The Approach

For congestion, we began by associating each point with its nearest segment using a near table in ArcGis. The results were disappointing, as they did not capture the inherent organization of the road network nor spatial autocorrelation. The result may be seen in the image below.

Level of Traffic Jam by Nearest Segment

Level of Traffic Jam by Nearest Segment

As a result, though not ideal, we used the 3 nearest points to give a more continuous representation of the network. We chose three because we wanted to be judicious without having the level “bleed” into other segments inappropriately. The results can be seen in the image below.

To obtain our average delay by segment feature, we first only considerd jam levels of 1-4 (since Waze scores jam levels of 5 with a -1 for time, while all others represent a continuous count in seconds). We then took the three nearest points per segment, averaging the delay results.

For volume, we first extracted all of the points associated with located with the Jam ID’s for our sample. In ArcGis, we generated a near table to associate the points to its closest segment and aggregated the results. We immediately noticed that this method did not appear to capture the inherent organization of the network (adjacent segments received very different scores). To account for this and our suspicion that there is some spatial autocorrelation in the traffic pattern, we decided to take a kernel density of our volume points, reclassify by 100 quantiles, convert the result to a polygon and spatially join the results to our segments.

The result can be seen in the images below. The results appear consistent with what might be expected. The Downtown area and main arterials appear to receive more volume overall. The same kernel density approach was taken to engineer the following features based on Waze alert data: potholes, roaddkill, missing sign on sholder, hazard object on road, heavy traffic, medium traffic, stand still traffic and level 1 through level 4 traffic jam scores as determined by Waze.

Volume Kernel Density

Volume Kernel Density

Volume by Quantile

Volume by Quantile

9.2 Next Steps

There is a potential for additional feature engineering based on an inverse distance weighted (IDW) of the rasters used to generate variables like volume. Additionally, a principle component analysis (PCA), may help to unpack additional variables.

Lastly, future use may benefit from seasonal cross-validation that may take weather into account.