How reliable is Strava?

Introduction

Strava is a mobile app and website widely used by bicycle riders to track their riding. The app logs GPS position coordinates and timestamps. Strava Metro is a data product that provides counts of bicycle trips by time of day (in one-minute bins) and direction of travel by location. The Metro product assign trips to any GIS road layer, often an OpenStreetMap network, facilitating queries by link and node. The data as provided by Strava does not facilitate select link analysis; it is understood for privacy reasons only aggregated data is provided, or data with the final 100 m truncated from origins and destinations. Origin-destination matrices can be provided for user-provided zones (e.g. BSTM travel zones, ABS Statistical Area geographies).

There is no sample control in Strava - the sample is entirely self-selected and there are no limitations on how many trips a rider can enter into the app. The sample is not weighted.

Purpose

The purpose of this analysis is to attempt to ascertain how representative Strava is for the purposes of identifying “busy” and “quiet” cycling routes in Brisbane. Specifically, we are interested in how reliably Strava can be used to compare two links. Put simply, if Strava indicates location A is, say, 50% busier than location B, how reliable is this indicator?

Methodology

The methodology was to compare Strava against automatic cycling counts data available from Brisbane City Council (BCC) and the Department of Transport and Main Roads (TMR). It is implicitly assumed that the automatic counters accurately represent the cyclist count along these links. This seems to be a generally reasonably assumption, as both counting technologies have been previously tested in realworld conditions and demonstrated count accuracy exceeding 95% (the TMR counters are Q-Free piezoelectric-based devices while the BCC counters are Eco-Counter inductive loops). The caveat on this assumption is to note that site-specific validation has not been performed; it is assumed the counters have been installed correctly and are operating satisfactorily. Sense checks of the average daily counts, seasonal and time-of-day variations suggest this assumption is reasonable.

Strava

Records were provided by TMR from Strava Metro for the links nearest the automatic bicycle counters. The data was provided as counts by direction of travel in one-minute bins for 2016.

Automatic counters

Automatic cyclist counts data is provided by both BCC and TMR. The data for 2016, as available on their respective open data portals, was used for this analysis. In both cases the data was partially aggregated:

The TMR data is provided as montly averages by day of week and hour of day
The BCC data consists of daily counts

The data was aggregated to Average Annual Daily Traffic (AADT) using the averages-of-averages approach recommended by AASHTO. This approach reduces the impact of missing values and outliers compared to naive averages.

Comparisons were made using AADTs, by month (to assess seasonal variation) and by time of day. For the latter only the subset of TMR counter sites was used given that time-of-day data was unavailable for the BCC sites (note: this breakdown has been requested from BCC and if forthcoming this note will be updated accordingly).

Initial inspection of the TMR data suggested the time of day counts provided in the TMR was offset +1 hour from the correct values. This observation was made following initial inspection of the raw data, and plots of the time of day distributions against the Strava data. Both comparisons suggested that, for example, the weekday AM peak hour was 8-9 AM. Experience would suggest this is not the case, and that the peak hour for commuter-oriented sites in Brisbane is in fact 7-8 AM. Assuming this is a simple data error (or that the hour code is for the hour ending rather than starting as stated in the metadata) we have shifted the TMR counts data back one hour to compensate. We have requested clarification from TMR data services on this issue.

Results

Average Annual Daily Traffic (AADT)

The AADT estimates provided by the automatic counters are shown below, as is the equivalent measure from the Strava data. Note that the scale of the Strava count is largely meaningless - it represents the average number of Strava trips per day at the site, and would be expected to only be a small fraction of the true number of cycling trips. What is more relevant than the magnitude is the relative size of each count. In this vein the Strava data appears to correctly measure the Bicentennial Bikeway as the busiest of all the sites. However, beyond this site the correlation between Strava and the automatic counters is fairly weak.

This lack of correlation between busy and quiet sites can be illustrated by ranking the sites from busiest to quietest as shwon below. This illustrates dramatic differences between the datasets. For example, the automatic counter data would suggest the Kangaroo Point Bikeway is the second busiest site from the sample, but Strava suggests it is only the 13th busiest. Similarly, the automatic counts suggest Riverwalk is the 3rd busiest site but Strava suggests it is only the 15th busiest. Conversely, the automatic counters suggest the Jack Pesch Bridge is the 11th busiest site but Strava suggests it is the 2nd busiest. Overall, only two sites are ranked equally across the two datasets (namely, the Bicentennial Bikeway at 1st and Ted Smout Bridge at 20th).

TBD: cordon analysis————————-

Another way of thinking about the over- and under-representation of the Strava data is to scale the sites together. This could also be done as a cordon or screenline; for example, using the Brisbane River crossings. However, simplistically we consider below the full sample of sites. Strava very accurately predicts that just over 15% of cyclist trips occurred at the Bicentennial Bikeway site. Similarly, it accurately predicts that only around 1% of trips occur at sites such as the PA Hosital (O’Keefe St) and at Ted Smout Bridge. However, the widly different estimates at Jack Pesch Bridge, Kangaroo Point Bikeway and Northbank are particularly evident in this analysis.

Seasonal variation

The monthly average daily traffic (MADT) was normalised by the sum of the MADT across the year and expressed as a percentage. In this way, if there were no seasonal variation each month would represent 1/12 = 8.3% of traffic. The Strava data appears to well capture the seasonal variation (or lack thereof) across most sites, as shown below. The few anomolies are Lambert Rd (Indooroopilly) and the Go Between Bridge. At least the latter of these appears to be related to an issue with the automatic counter rather than Strava - the counter appeared to be inoperational in January and for at least part of February.

Day of week

The day of week distribution is shown below. The Average Annual Day of Week (AADW) is presented as a percentage of the sum of AADW across the week; if every day of week had identical demand the AADW would be 1/7 = 14.3%. The Strava data appears to accurately represent the day of week distributions across most sites. However, it appears to somewhat underestimate weekend demand at Norman Park, severely underestimates weekend demand at Story Bridge West and overestimates weekend demand at Stanley Street and the Gateway Motorway.

Time of day

Strava appears to well capture the time of day distribution at most sites, as shown below (note: this analysis is split into weekdays and weekends, given they will have self-evidently different profiles - in this high leve analysis public holidays and school holidays are not treated differently to the day of week upon which they fall).

Conclusions

In our view this analysis would suggest the following:

Strava does not provide spatially representative data; that is, it cannot be relied upon to identity “busy” and “quiet” sites
The over- and under-representation of Strava data by location does not appear to be consistent, nor is there an evident pattern (e.g. journey purpose) which would suggest some kind of clustering and factoring process could conceivably correct for these spatial biases
While the relative demand estimated from Strava appears to be very misleading, in many instances the seasonal variation, day of week variation and time of day variation appears to be well predicted by Strava.

We would however caveat these conclusions by noting that we are assuming here that the automatic counters are performing reliably; while we have no evidence to the contrary it may be prudent to perform checks upon the automatic counters before relying upon these conclusions.

Another avenue for further work could be to look at how reliably Strava may represent the redistribution of cycling trips that may occur after infrastructure is completed. That is, how reliable is Strava at illustrating the redistribution of cycling trips off the pre-existing road and path network onto a new facility?