I have implemented the filtering algorithm which is descriped in Palmius et al.’s paper. Nick was kind enough to provide pseudocode in addition to his matlab code. His pseudocode is below:
data: D(t) recorded data points at tmin ≤ t ≤ tmax. result: D′(t′) filtered data points at t′min ≤ t′ ≤ t′max.
begin
D′ ← D t′←t repeat
t∗ ← t′ where d′ D′ > 100kmh−1 dt
Dsrc ← D′(t∗ − 1)
Ddest ← D′(t∗)
Dˆsrc ← mode(Dsrc)
Dˆdest ← mode(Ddest)
if |Dˆsrc| ≥ |Dˆdest| and |Dˆsrc| > 10 then
Remove all occurrences of Dˆsrc from D′ else if |Dˆdest| ≥ 10 then
Remove all occurrences of Dˆdest from D′ end if
until |Dˆsrc| ≥ 10 and |Dˆdest| < 10 end
◃ Data points with high d′ D. dt
Based on that I implemented it and executed it in R. You can see the repository with all my code here. Check out the filtering algo and the downsampling algo.
As expected, the high speed between measurement is associated with high accuracy scores (that is, inaccurate measurements). Approximately 1.5% percent of measurements in all of 2017 were associated with such jumps.
What Nick does is exclude those points if they are often associated with big jumps (i.e. more than ten times).
For my own data in 2017 there was only one location which was excluded (a total of 0.02% measurements), which means it is not a very effective filtering algorithm using the original parameters. Moreover, the points which are filtered depend on the length of the data which is taken into account.
Further playing around with the parameters would be necessary to see what fits this data best.
The difference may be in the quality of the devices recording. I’ll check to see the 2015 logs, perhaps it would be more efficient then.
In addition to filtering Nick also downsamples the data to 5 minute windows. The way he does this is that if the standard deviation within a fixed one hour window (e.g. from 2am to 3am) is less than 10 meters he simply takes a mean for the hour. If it is more than that he takes a 5 minute median window and applies it to the epoch in question.
The red dots are the raw data, the blue dots are the filtered data for that day of mine on Feb 22.
Palmius uses several imputation algorithms to impute missing data. It took me more time than expected, but I’ve managed to implement them. You can see the exact scripts in the github. The results are below. A few key takeaways:
The map below shows the downsampled vs the imputed measurements for a period of time of my movement within the Netherlands. Interestingly, the averaging does seem to show the paths quite well.