This is a brief, informal document summarizing the work that I’ve done over the summer. Crystal requested that I do something like this before writing a formal article (which I will be working on after this). It essentially provides an overview of how interpolation methods were identified and how they were tested. Some of the results are also presented. The formal write-up will include more detailed explanations of a lot of the techniques presented here.
The objectives of the project included
The paper A stable and fast implementation of natural neighbor interpolation by Liang & Hale (2010)1 does a good job articulating an important distinction that should be made when considering various interpolation methods. Specifically, the authors note that some interpolation methods use scattered data. These types of interpolants accept pairs of the form \((\mathbf{x}_i, f_i)\) for \(i=\{1,2,...\}\) such that \(\mathbf{x}_i \in \mathbb{R}^n\) is a n-tuple and \(f_i\in \mathbb{R}\) is the corresponding real value. Typically, this data would represent some characteristic of a multidimensional surface (for example, rainfall values over a mapped area). Hence these particular interpolating methods (i.e. ones that rely on scattered data input) wouldn’t work well with trajectory data.
Not appropriate for trajectory or time series data since it typically interpolates an atmospheric variable in the form \(f(x,y)\). Typically \(f\) gives some parameter which can be used with coordinates \(x,y\) to create a contour plot.
This method includes linear and cubic spline interpolation. It can be applied to both the time series and trajectory data.
This method can be used on trajectory and time series data.
Conventionally, bilinear interpolation would not be used with trajectory or time series data; however, the principle behind it (performing linear interpolation in both directions) could be adapted to compute interpolants on the desired datasets.
Kriging is used when the data reflects output fitted to a particular map. It would be used for reasons similar to that of Barnes interpolation.
This method would work best with a “data cloud” through which one constructs an optimal curve. Hence the format of the trajectory data makes it difficult to use with principal curve detection.
This method is outlined in the paper Frequent Trajectory Mining on GPS Data (2010) by Savage et al.2 It could be appropriately applied to trajectory data; however, the algorithm would have to be reconstructed since it doesn’t seem the authors have published any code related to their study.
These methods stand as potential means of interpolating trajectory data.
This method can be applied to both time series and trajectory data.
This method relies on an implementation of a Voronoi diagram. These diagrams are constructed from scattered data. Hence, it wouldn’t work well with trajectory data
This method also applies to scattered data. It is similar to natural neighbor interpolation in that it implements a weighting procedure.
GPS Trajectory Data Enrichment Based on a Latent Statistical Model (2016) by Kinoshita et al.3 outlines the latent statistical model. The model estimates the mode of transportation at various parts of the trajectory and uses such information to yield more accurate interpolations. Their experiments report a 78% accuracy. So, this may be a potential method to implement; however, it does not seem that the authors have published any code accessible to the public. Nonetheless, their paper includes enough detail regarding model construction to build a separate implementation.
This method can work with both time series and trajectory data.
Each interpolation method was tested on a total of 4,000 trajectories. Initially, I selected 100 unique trajectories from the GeoLife dataset. These trajectories are listed below by their filenames and associated file size. The trajectories were intentionally selected to represent a range of file sizes.
## File Size (KB)
## 1 20081026081229.plt 203.7949
## 2 20081031232033.plt 215.9160
## 3 20081025010205.plt 227.8887
## 4 20081206035334.plt 230.0449
## 5 20090127012158.plt 230.9326
## 6 20081122023300.plt 232.1484
## 7 20090128072557.plt 234.1123
## 8 20081001225205.plt 252.4590
## 9 20081109031933.plt 286.7744
## 10 20081114001504.plt 290.3438
## 11 20081024000805.plt 299.4229
## 12 20090612220336.plt 301.8906
## 13 20090129053303.plt 304.6885
## 14 20090131002206.plt 345.5342
## 15 20081115003535.plt 355.5986
## 16 20081105110052.plt 359.8379
## 17 20090226225533.plt 361.6768
## 18 20090702022530.plt 362.4785
## 19 20090628005229.plt 364.3945
## 20 20090130224933.plt 364.7637
## 21 20090702022456.plt 366.0820
## 22 20090208044718.plt 373.2598
## 23 20090309004221.plt 377.2705
## 24 20081026024152.plt 383.0693
## 25 20081024173348.plt 390.9453
## 26 20081214012937.plt 403.7100
## 27 20090116013541.plt 408.5947
## 28 20081231135628.plt 409.9570
## 29 20071020030002.plt 414.0293
## 30 20080928120641.plt 414.8506
## 31 20081206041700.plt 420.2930
## 32 20090125070926.plt 429.2109
## 33 20081209002004.plt 437.6240
## 34 20090315093133.plt 445.6934
## 35 20081024234405.plt 448.1279
## 36 20090204043201.plt 453.7969
## 37 20081121213904.plt 457.5312
## 38 20081001005535.plt 458.9785
## 39 20090413004238.plt 475.3369
## 40 20071024084732.plt 476.6318
## 41 20081213092204.plt 478.0029
## 42 20080618003409.plt 493.2510
## 43 20090301005903.plt 497.9541
## 44 20090322104032.plt 502.5664
## 45 20071229083859.plt 508.1064
## 46 20081202160051.plt 519.4492
## 47 20090214045230.plt 533.0000
## 48 20090221034838.plt 535.0000
## 49 20090120055702.plt 562.3369
## 50 20090120233215.plt 565.3965
## 51 20080929042557.plt 594.5996
## 52 20081101010123.plt 624.5342
## 53 20090226005433.plt 634.9443
## 54 20080615221713.plt 671.6953
## 55 20080619100408.plt 685.3506
## 56 20090120002837.plt 694.1084
## 57 20090109135227.plt 704.0645
## 58 20110912053207.plt 733.8496
## 59 20081227062809.plt 735.9043
## 60 20090307044707.plt 782.8330
## 61 20090528012914.plt 821.2100
## 62 20080927233805.plt 843.3311
## 63 20090407234410.plt 864.4072
## 64 20090124003534.plt 874.8311
## 65 20071017220238.plt 885.8301
## 66 20090403011657.plt 892.1387
## 67 20081109055559.plt 927.0039
## 68 20070921120306.plt 955.7158
## 69 20071228075610.plt 960.7109
## 70 20090214022413.plt 979.1143
## 71 20090521234308.plt 1049.2051
## 72 20071026214019.plt 1065.8379
## 73 20090113112028.plt 1096.7139
## 74 20090520231518.plt 1103.1709
## 75 20090323212145.plt 1150.8281
## 76 20081002160000.plt 1442.1582
## 77 20081116064735.plt 1456.9102
## 78 20110828143340.plt 1519.2480
## 79 20110911000506.plt 1527.4258
## 80 20080618160000.plt 1547.2656
## 81 20080801023537.plt 1557.4189
## 82 20090709003722.plt 1599.8350
## 83 20090218045104.plt 1720.7959
## 84 20071021110759.plt 2077.4229
## 85 20081114124439.plt 2215.4150
## 86 20090717132508.plt 2236.1680
## 87 20111231103152.plt 2387.5195
## 88 20080619160000.plt 2411.0020
## 89 20080930203247.plt 2524.8809
## 90 20090824031656.plt 2565.2549
## 91 20080929160000.plt 2588.6162
## 92 20080403160000.plt 2645.2207
## 93 20081001230824.plt 2885.6201
## 94 20110522080505.plt 3065.3135
## 95 20080928160000.plt 3624.9648
## 96 20111005004827.plt 3635.6660
## 97 20111022031803.plt 3947.2568
## 98 20090124065103.plt 4059.9131
## 99 20120128104325.plt 4315.8643
## 100 20081219114010.plt 5796.9873
A gap was created for each unique trajectory by removing a certain number of points. The number of removed points was a part of the sequence 50, 100, …, 2000. Thus, 40 artificial trajectories resulted from each original trajectory (one for each gap length). At this point, I could linearize the trajectories using the Hilbert space-filling curve, leading to a second dataset. The distinction between time series and trajectory data helped to satisfy the second objective of this project. Each interpolation method (cubic spline, linear, local regression, nearest neighbor, and polynomial) was applied to each file in the trajectory dataset and the time series dataset. An error analysis on the resulting files returned an inaccuracy value which allowed for cross-analysis.
This error analysis was developed in order to calculate an error value that could be used to compare different interpolation methods.
Consider two datasets (either trajectory or time series datasets): the original dataset, denoted \(t\), and the interpolated dataset, denoted \(t'\). If \(p_i, p_{i+1} \in t\) are consecutive points in the original dataset and \(p_i' \in t'\) is the corresponding point (at index \(i\)) in the interpolated dataset then the standard error, \(e_s\) is given by \[ e_s = \frac{\sum d(p_i, p_i')}{\sum d(p_i, p_{i+1})} \] where \(d()\) returns the Euclidian distance between two points. Note that \(e_s\) is unitless because the units of the quantities \(\sum d(p_i, p_i')\) and \(\sum d(p_i, p_i')\) cancel each other. The comparative analysis between interpolation methods is possible from the value returned by \(e_s\).
Also, in the plot above, when the cursor hovers over a particular point, a text box will appear. The bottom two lines of the text box show the x and y values and indicate the interplation method that was used to generate that particular data. The two most inaccurate interpolation methods were cubic spline and LOESS (both when applied to trajectories; LOESS is the blue plot and cubic trajectory is the red plot). Overall, we can observe that cubic spline, linear, and nearest neighbor performed very well when applied to time series.
The plot below includes the same results but with the cubic trajectory and LOESS trajectory results removed.
In this plot, it is a little easier to see that the linear and nearest neighbor interpolation methods applied to time series performed best. However, it is not immediately evident is better. We can visualize the data differently by plotting the mean of the standard error values.
The bar plot shows the three most accurate interpolation methods. Here, it is evident that nearest neighbor yielded the most accurate results. Also, the plot includes error bars, but they are so small it is difficult to differentiate them. Even though the relative size of the error bars suggest that the results are statistically significant, we can perform a t-test to determine the extent of this significance.
Performing a two sample t-test between the inaccuracy values of the linear and nearest neighbor time series data returns the following
##
## Welch Two Sample t-test
##
## data: LinerrorTS$Inaccuracy and NNerrorTS$Inaccuracy
## t = 2.72, df = 7964.2, p-value = 0.006542
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2830415 1.7435553
## sample estimates:
## mean of x mean of y
## 10.927609 9.914311
Since the t statistic is outside of the range defined by the 95% confidence interval and since the p-value is less than 0.05, we can conclude that the two methods returned statistically different results. Nearest neighbor interpolation with the time series data appears to be the most accurate method of interpolation.
From these results, we can make several observations
The original objectives have been satisfied. The following describes how they have been satisfied.
We found that interpolating time series data with a nearest neighbor algorithm yielded much more accurate results than interpolating with a linear model.
We found that it is best to apply interpolation after generating a time series.
We created a standard experimental procedure for testing interpolation methods. All of our tests have been implemented in Java. The files used in the experimentation can be found on GitHub. The READE.md file of the repository includes extensive documentation on all of the programs so that they can be easily modified as needed.
Liang, L., & Hale, D. (2010). A stable and fast implementation of natural neighbor interpolation. http://www.cwp.mines.edu/Meetings/Project10/cwp-657.pdf.↩
Savage, N. S., Nishimura, S., Chavez, N. E., & Yan, X. (2010, November). Frequent trajectory mining on GPS data. In Proceedings of the 3rd International Workshop on Location and the Web (p. 3). ACM.↩
Kinoshita, A., Takasu, A., Aihara, K., Ishii, J., Kurasawa, H., Sato, H., … & Adachi, J. (2016). GPS Trajectory Data Enrichment Based on a Latent Statistical Model.↩