Reproducing Palmius’ filtering & imputation

Implementation of filtering algorithm

I have implemented the filtering algorithm which is descriped in Palmius et al.’s paper. Nick was kind enough to provide pseudocode in addition to his matlab code. His pseudocode is below:

data: D(t) recorded data points at tmin ≤ t ≤ tmax. result: D′(t′) filtered data points at t′min ≤ t′ ≤ t′max.
begin
D′ ← D t′←t repeat
t∗ ← t′ where d′ D′ > 100kmh−1 dt
Dsrc ← D′(t∗ − 1)
Ddest ← D′(t∗)
Dˆsrc ← mode(Dsrc)
Dˆdest ← mode(Ddest)
if |Dˆsrc| ≥ |Dˆdest| and |Dˆsrc| > 10 then
Remove all occurrences of Dˆsrc from D′ else if |Dˆdest| ≥ 10 then
Remove all occurrences of Dˆdest from D′ end if
until |Dˆsrc| ≥ 10 and |Dˆdest| < 10 end
◃ Data points with high d′ D. dt

Based on that I implemented it and executed it in R. You can see the repository with all my code here. Check out the filtering algo and the downsampling algo.

Preliminary evaluation of filtering algorithm using default parameters

As expected, the high speed between measurement is associated with high accuracy scores (that is, inaccurate measurements). Approximately 1.5% percent of measurements in all of 2017 were associated with such jumps.

What Nick does is exclude those points if they are often associated with big jumps (i.e. more than ten times).

For my own data in 2017 there was only one location which was excluded (a total of 0.02% measurements), which means it is not a very effective filtering algorithm using the original parameters. Moreover, the points which are filtered depend on the length of the data which is taken into account.

Further playing around with the parameters would be necessary to see what fits this data best.

The difference may be in the quality of the devices recording. I’ll check to see the 2015 logs, perhaps it would be more efficient then.

Down sampling method

In addition to filtering Nick also downsamples the data to 5 minute windows. The way he does this is that if the standard deviation within a fixed one hour window (e.g. from 2am to 3am) is less than 10 meters he simply takes a mean for the hour. If it is more than that he takes a 5 minute median window and applies it to the epoch in question.

Results following filtering and downsampling

The red dots are the raw data, the blue dots are the filtered data for that day of mine on Feb 22.

Missing data imputation

Palmius uses several imputation algorithms to impute missing data. It took me more time than expected, but I’ve managed to implement them. You can see the exact scripts in the github. The results are below. A few key takeaways:

This type of simple imputation works, but is not infallable. It is only as powerful as the downsampling/filtering, which I see as a fragile step here.
There are still some bugs I have to work on. I was imputed as being in the Benin at one point (I’ve never been there), which is either because of time zone differences and mean based downsampling or because longitude and lattitude got mixed up somewhere.
In terms of imputed time periods it performed better than I expected it to. Of the 9764 missing windows it imputed all but 62.