Interpolation Methods

This is a brief, informal document summarizing the work that I’ve done over the summer. Crystal requested that I do something like this before writing a formal article (which I will be working on after this). It essentially provides an overview of how interpolation methods were identified and how they were tested. Some of the results are also presented. The formal write-up will include more detailed explanations of a lot of the techniques presented here.

Objective

The objectives of the project included

Finding interpolation methods that yielded more accurate results than that of linearly interpolated trajectories
Determining the best stage of preprocessing to apply interpolation
Creating a framework from which interpolation methods can be assessed and compared

Notes and Remarks on Potential Interpolation Methods

The paper A stable and fast implementation of natural neighbor interpolation by Liang & Hale (2010)¹ does a good job articulating an important distinction that should be made when considering various interpolation methods. Specifically, the authors note that some interpolation methods use scattered data. These types of interpolants accept pairs of the form \((\mathbf{x}_i, f_i)\) for \(i=\{1,2,...\}\) such that \(\mathbf{x}_i \in \mathbb{R}^n\) is a n-tuple and \(f_i\in \mathbb{R}\) is the corresponding real value. Typically, this data would represent some characteristic of a multidimensional surface (for example, rainfall values over a mapped area). Hence these particular interpolating methods (i.e. ones that rely on scattered data input) wouldn’t work well with trajectory data.

Barnes Interpolation

Not appropriate for trajectory or time series data since it typically interpolates an atmospheric variable in the form \(f(x,y)\). Typically \(f\) gives some parameter which can be used with coordinates \(x,y\) to create a contour plot.

Spline Interpolation

This method includes linear and cubic spline interpolation. It can be applied to both the time series and trajectory data.

Polynomial Interpolation

This method can be used on trajectory and time series data.

Bilinear

Conventionally, bilinear interpolation would not be used with trajectory or time series data; however, the principle behind it (performing linear interpolation in both directions) could be adapted to compute interpolants on the desired datasets.

Kriging

Kriging is used when the data reflects output fitted to a particular map. It would be used for reasons similar to that of Barnes interpolation.

Principal Curve Detection

This method would work best with a “data cloud” through which one constructs an optimal curve. Hence the format of the trajectory data makes it difficult to use with principal curve detection.

Frequent Trajectory Mining

This method is outlined in the paper Frequent Trajectory Mining on GPS Data (2010) by Savage et al.² It could be appropriately applied to trajectory data; however, the algorithm would have to be reconstructed since it doesn’t seem the authors have published any code related to their study.

Neural Network/Machine Learning and Probabilistic Modeling

These methods stand as potential means of interpolating trajectory data.

Nearest Neighbor

This method can be applied to both time series and trajectory data.

Natural Neighbor

This method relies on an implementation of a Voronoi diagram. These diagrams are constructed from scattered data. Hence, it wouldn’t work well with trajectory data

Inverse Distance Weighting

This method also applies to scattered data. It is similar to natural neighbor interpolation in that it implements a weighting procedure.

Latent Statistical Model

GPS Trajectory Data Enrichment Based on a Latent Statistical Model (2016) by Kinoshita et al.³ outlines the latent statistical model. The model estimates the mode of transportation at various parts of the trajectory and uses such information to yield more accurate interpolations. Their experiments report a 78% accuracy. So, this may be a potential method to implement; however, it does not seem that the authors have published any code accessible to the public. Nonetheless, their paper includes enough detail regarding model construction to build a separate implementation.

Local regression (LOESS)

This method can work with both time series and trajectory data.

Experimentation

Each interpolation method was tested on a total of 4,000 trajectories. Initially, I selected 100 unique trajectories from the GeoLife dataset. These trajectories are listed below by their filenames and associated file size. The trajectories were intentionally selected to represent a range of file sizes.

##     File               Size (KB)
## 1   20081026081229.plt  203.7949
## 2   20081031232033.plt  215.9160
## 3   20081025010205.plt  227.8887
## 4   20081206035334.plt  230.0449
## 5   20090127012158.plt  230.9326
## 6   20081122023300.plt  232.1484
## 7   20090128072557.plt  234.1123
## 8   20081001225205.plt  252.4590
## 9   20081109031933.plt  286.7744
## 10  20081114001504.plt  290.3438
## 11  20081024000805.plt  299.4229
## 12  20090612220336.plt  301.8906
## 13  20090129053303.plt  304.6885
## 14  20090131002206.plt  345.5342
## 15  20081115003535.plt  355.5986
## 16  20081105110052.plt  359.8379
## 17  20090226225533.plt  361.6768
## 18  20090702022530.plt  362.4785
## 19  20090628005229.plt  364.3945
## 20  20090130224933.plt  364.7637
## 21  20090702022456.plt  366.0820
## 22  20090208044718.plt  373.2598
## 23  20090309004221.plt  377.2705
## 24  20081026024152.plt  383.0693
## 25  20081024173348.plt  390.9453
## 26  20081214012937.plt  403.7100
## 27  20090116013541.plt  408.5947
## 28  20081231135628.plt  409.9570
## 29  20071020030002.plt  414.0293
## 30  20080928120641.plt  414.8506
## 31  20081206041700.plt  420.2930
## 32  20090125070926.plt  429.2109
## 33  20081209002004.plt  437.6240
## 34  20090315093133.plt  445.6934
## 35  20081024234405.plt  448.1279
## 36  20090204043201.plt  453.7969
## 37  20081121213904.plt  457.5312
## 38  20081001005535.plt  458.9785
## 39  20090413004238.plt  475.3369
## 40  20071024084732.plt  476.6318
## 41  20081213092204.plt  478.0029
## 42  20080618003409.plt  493.2510
## 43  20090301005903.plt  497.9541
## 44  20090322104032.plt  502.5664
## 45  20071229083859.plt  508.1064
## 46  20081202160051.plt  519.4492
## 47  20090214045230.plt  533.0000
## 48  20090221034838.plt  535.0000
## 49  20090120055702.plt  562.3369
## 50  20090120233215.plt  565.3965
## 51  20080929042557.plt  594.5996
## 52  20081101010123.plt  624.5342
## 53  20090226005433.plt  634.9443
## 54  20080615221713.plt  671.6953
## 55  20080619100408.plt  685.3506
## 56  20090120002837.plt  694.1084
## 57  20090109135227.plt  704.0645
## 58  20110912053207.plt  733.8496
## 59  20081227062809.plt  735.9043
## 60  20090307044707.plt  782.8330
## 61  20090528012914.plt  821.2100
## 62  20080927233805.plt  843.3311
## 63  20090407234410.plt  864.4072
## 64  20090124003534.plt  874.8311
## 65  20071017220238.plt  885.8301
## 66  20090403011657.plt  892.1387
## 67  20081109055559.plt  927.0039
## 68  20070921120306.plt  955.7158
## 69  20071228075610.plt  960.7109
## 70  20090214022413.plt  979.1143
## 71  20090521234308.plt 1049.2051
## 72  20071026214019.plt 1065.8379
## 73  20090113112028.plt 1096.7139
## 74  20090520231518.plt 1103.1709
## 75  20090323212145.plt 1150.8281
## 76  20081002160000.plt 1442.1582
## 77  20081116064735.plt 1456.9102
## 78  20110828143340.plt 1519.2480
## 79  20110911000506.plt 1527.4258
## 80  20080618160000.plt 1547.2656
## 81  20080801023537.plt 1557.4189
## 82  20090709003722.plt 1599.8350
## 83  20090218045104.plt 1720.7959
## 84  20071021110759.plt 2077.4229
## 85  20081114124439.plt 2215.4150
## 86  20090717132508.plt 2236.1680
## 87  20111231103152.plt 2387.5195
## 88  20080619160000.plt 2411.0020
## 89  20080930203247.plt 2524.8809
## 90  20090824031656.plt 2565.2549
## 91  20080929160000.plt 2588.6162
## 92  20080403160000.plt 2645.2207
## 93  20081001230824.plt 2885.6201
## 94  20110522080505.plt 3065.3135
## 95  20080928160000.plt 3624.9648
## 96  20111005004827.plt 3635.6660
## 97  20111022031803.plt 3947.2568
## 98  20090124065103.plt 4059.9131
## 99  20120128104325.plt 4315.8643
## 100 20081219114010.plt 5796.9873

A gap was created for each unique trajectory by removing a certain number of points. The number of removed points was a part of the sequence 50, 100, …, 2000. Thus, 40 artificial trajectories resulted from each original trajectory (one for each gap length). At this point, I could linearize the trajectories using the Hilbert space-filling curve, leading to a second dataset. The distinction between time series and trajectory data helped to satisfy the second objective of this project. Each interpolation method (cubic spline, linear, local regression, nearest neighbor, and polynomial) was applied to each file in the trajectory dataset and the time series dataset. An error analysis on the resulting files returned an inaccuracy value which allowed for cross-analysis.

Error Analysis

This error analysis was developed in order to calculate an error value that could be used to compare different interpolation methods.

Consider two datasets (either trajectory or time series datasets): the original dataset, denoted \(t\), and the interpolated dataset, denoted \(t'\). If \(p_i, p_{i+1} \in t\) are consecutive points in the original dataset and \(p_i' \in t'\) is the corresponding point (at index \(i\)) in the interpolated dataset then the standard error, \(e_s\) is given by \[ e_s = \frac{\sum d(p_i, p_i')}{\sum d(p_i, p_{i+1})} \] where \(d()\) returns the Euclidian distance between two points. Note that \(e_s\) is unitless because the units of the quantities \(\sum d(p_i, p_i')\) and \(\sum d(p_i, p_i')\) cancel each other. The comparative analysis between interpolation methods is possible from the value returned by \(e_s\).

Results

After calculating the standard error for each file, I grouped the data according to gap length (so, each gap length had 100 error values associated with it) and then took the average of each group. These averages are plotted on the figure above. Note that TS is an abbreviation for time series, and the “inaccuracy” values plotted on the y-axis represent the mean standard error for each particular gap length. Also, the inaccuracy values were scaled by a factor of one hundred for formatting purposes (so for example, the number 20 on the y-axis denotes an inaccuracy value of 2000). Also, the polynomial interpolation method was excluded since the curve fitting algorithm wouldn’t run on the experimental dataset in a feasable amount of time (time estimates suggest that it would have taken over 60 days). Additionally, the results of LOESS interpolation with time series included incomplete values (either the interpolant returned infinite or undefined values) - so it has be excluded from the plot. Regardless, enough data have been collected on both methods to determine that they are not more accurate than the linear or nearest neighbor methods applied to time series.

Also, in the plot above, when the cursor hovers over a particular point, a text box will appear. The bottom two lines of the text box show the x and y values and indicate the interplation method that was used to generate that particular data. The two most inaccurate interpolation methods were cubic spline and LOESS (both when applied to trajectories; LOESS is the blue plot and cubic trajectory is the red plot). Overall, we can observe that cubic spline, linear, and nearest neighbor performed very well when applied to time series.

The plot below includes the same results but with the cubic trajectory and LOESS trajectory results removed.

In this plot, it is a little easier to see that the linear and nearest neighbor interpolation methods applied to time series performed best. However, it is not immediately evident is better. We can visualize the data differently by plotting the mean of the standard error values.

The bar plot shows the three most accurate interpolation methods. Here, it is evident that nearest neighbor yielded the most accurate results. Also, the plot includes error bars, but they are so small it is difficult to differentiate them. Even though the relative size of the error bars suggest that the results are statistically significant, we can perform a t-test to determine the extent of this significance.

Performing a two sample t-test between the inaccuracy values of the linear and nearest neighbor time series data returns the following

## 
##  Welch Two Sample t-test
## 
## data:  LinerrorTS$Inaccuracy and NNerrorTS$Inaccuracy
## t = 2.72, df = 7964.2, p-value = 0.006542
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2830415 1.7435553
## sample estimates:
## mean of x mean of y 
## 10.927609  9.914311

Since the t statistic is outside of the range defined by the 95% confidence interval and since the p-value is less than 0.05, we can conclude that the two methods returned statistically different results. Nearest neighbor interpolation with the time series data appears to be the most accurate method of interpolation.

Observations

From these results, we can make several observations

All interpolation methods yielded more accurate results when applied to time series (as opposed to trajectories). So, this is the optimal preprocessing step at which interpolation should occur
The error resulting from linear and nearest neighbor interpolation increased by a linear amount as the gap size increased (this observation can be verified via linear regression tests)
Nearest neighbor interpolation with time series returned the most accurate results
Deterministic, global interpolators (cubic spline, LOESS, polynomial) yielded larger errors than deterministic, local interpolators (linear, nearest neighbor). This observation will be explained in more detail

Fulfilling Project Objectives

The original objectives have been satisfied. The following describes how they have been satisfied.

Objective 1

We found that interpolating time series data with a nearest neighbor algorithm yielded much more accurate results than interpolating with a linear model.

Objective 2

We found that it is best to apply interpolation after generating a time series.

Objective 3

We created a standard experimental procedure for testing interpolation methods. All of our tests have been implemented in Java. The files used in the experimentation can be found on GitHub. The READE.md file of the repository includes extensive documentation on all of the programs so that they can be easily modified as needed.

Liang, L., & Hale, D. (2010). A stable and fast implementation of natural neighbor interpolation. http://www.cwp.mines.edu/Meetings/Project10/cwp-657.pdf.↩
Savage, N. S., Nishimura, S., Chavez, N. E., & Yan, X. (2010, November). Frequent trajectory mining on GPS data. In Proceedings of the 3rd International Workshop on Location and the Web (p. 3). ACM.↩
Kinoshita, A., Takasu, A., Aihara, K., Ishii, J., Kurasawa, H., Sato, H., … & Adachi, J. (2016). GPS Trajectory Data Enrichment Based on a Latent Statistical Model.↩