% Spatio-Temporal Imputation Methods: A Case Study
% Lingbing Feng
% 2012/10/29

Missing Value

Missing Piece

Agenda

Motivation
- What is Spatio-Temporal & Imputation
- Why it it important & difficult
- What is our goal?
Related Literature
Methods
- SVD
- CUTOFF
Case Study in Australia
Potential and Future Work

Spatio-Temporal Data

Spatio-temproral data: data with spatial (geographic), temporal components, and attributes.

Where —— space
when —— time
what —— attributes

You can think it as a set of multivariate time series that are geographically distributed

Examples

Climatic readings ( temperature, rainfall, riverflow, etc. ) for a number of nearby stations
Satellite images of parts of the earth
Election results for voting districts and a number of consecutive elections
GPS tracks for people or animal possibly with additional sensor readings
Desease outbreaks or volcano eruption
…

Why impute?

For data structure consistency, and ultimately for modelling
Model estimating procedure assumes fairly completeness upon the data matrix
If there were missing values in your data, sooner or later, you would have to face the music.

It is difficult when

You take it seriously rather than saying:“Here we have chosen to ignore the missing values”
** or “For the sake of convenience, we case-wise delete all the missing values”
You take it very seriously, you'd like to impute by incorporating as much valuable information as possible.
You take it extremly seriously and you want an algorithm which is intuitively simple, credible, and computational efficient.

Our Goal

A simple, intuitive method
Takes both spatial and temporal information into consideration
Computationally efficient

What have we achieved

Goal 1 (pre-modelling)
Goal 2 (to some extent)
Goal 3 (with an R package)

Related Literature

Schneider (2001) proposes a parametric method that makes use of E-M algorithm and ridge regression to estimate the mean and covariance matrix of the data iteratively. (spatially weak, and no package supported)
Kondrashov et al. (2006) developes an novel, iterative form of M-SSA (Multi-channel Singualr Spectrum Analysis) approach which utilizes temporal, as well as spatial correlations (as they claimed so) to fill in gaps.
It raised a heat debate (Schneider (2006)'s interesting comment and Kondrashov et al.'s rejoinder)

We found

K's idea comes down to matrix decomposition and EM algorithm
The cross-validation procedure used in their paper is vague and difficult to understand
Heavily relying on the MAR (Missing At Random) assumption, which sometimes, is not the truth.

SVD imputation

Fuentes et al.(2006): it is stable and reasonble
partly nonparametric, E-M like, counld be used before modelling
standard imputation method in the SpatioTemporal R pakage

** We consider it as the competetor **

Methods: about SVD

** The “crown jewel” of linear algebra (Gilbert Strang，2009) **

A little more

feature extraction
dimensionality reduction
Its relationship with PCA (Principal Component Analysis)
- PCA is done by eigenvalue decomposition on the covariance matrix
- normally, PCA can be done easily by doing SVD on the original data matrix

how does `SVDmiss()` work

Iterative process of regression and SVD
initialising: replace NAs by 0s, a column of rowmeans across sites: $ mu_{1} $
Reg(1): each column on $ mu_{1} $, replace NAs by fitted values
SVD the matrix and get J LeftEigen vectors
Reg(2): each column on the J vectors, replace NAs
Iteration until convergence

Comments on `SVDmiss()`

Works well when most sites share similar temporal patterns
Be watchful when obs have a substantial variation around the trend, the SVD approx might be a noisy representation of the real pattern
I reckon the SVD method as being partly spatio-temporal imputation

Our method: a general idea:

Nearby obs tend to be more alike than those far apart. (Cressie & Wikle, 2011)
Missing values are imputed by (available) obs from its neighbours and from similiar time points.
- Spatial Closeness: (cutoff value $\rho$) distance/correlation
- Temporal closeness: same temporal attribute (month, season, etc)

Illustration

Some Notes

Some stations may have no reference file when the cut-off value is big. If so, we set the nearest station as the reference station.
However, what if there is no values for $ R $. It'd be circumvented by tracking down the reference list to find a station that is not missing in that time point.
A whole candidate row missing is the only invalid case.

Case Study in Australia: Data and EDA

Murray-Darling Basin, 78 gauging stations
100 years' monthly data (Jan 1911 - Dec 2010)
Rainfall

Representation (Space-wide format)

Different columns reflect different stations

        date X048039 X049023 X050004 X050018 X050028 X050031 X050052
1 1911-01-01    33.2    58.3      NA    80.2    95.0   189.4   116.4
2 1911-02-01   149.5   133.6      NA    60.2    70.1    57.2    68.9
3 1911-03-01    14.0    22.8      NA    17.8    41.2    30.0    42.4
4 1911-04-01     0.0     0.8     8.1     0.0     0.0     0.0     0.4
5 1911-05-01    47.4    39.9    42.0    72.0    61.3    45.5    31.1
6 1911-06-01     6.6    23.0    16.5    16.3    12.9    30.5    13.4

location of stations

Missing Pattern (spatially)

plot of chunk unnamed-chunk-3

Structure Heatmap (stations reordered by missingness)

source(file = "functions.R")
HeatStruct(hqmr.cube.reorder) + theme(axis.text.x = element_blank())

plot of chunk unnamed-chunk-4

Cross-Validation

grid-wise
- 10 fold on the columns
- different row grid size (1 year, 5-year, 10-year)
randomization (both on rows and columns)
- month information has been kept while doing randomization
RMSE for each grid (and only for that values were not missing before imputing)
Mean-RMSE for each CUTOFF value from 0.55-0.95, by 0.05

CV results

Performance (CUTOFF)

plot of chunk unnamed-chunk-5

Comparison

plot of chunk unnamed-chunk-6

Comparison (more)

plot of chunk unnamed-chunk-7

Comments

both method capture temporal patterns well
much overlapping, similar imputation performace
which is better?

Simulation should be able to:

Capture the Missing pattern (block Missing and dot missing)
Knowe the real values in advance (for evaluating)

Reviewing the structure heatmap

plot of chunk unnamed-chunk-8

Structure Analysis for simulation

plot of chunk unnamed-chunk-9

Simulation workhorse (a missing vector): big idea

** Many thanks to Alan for the big idea and to Gen for helping me much on the programme!**

If we had a missing vector (with 0s and NAs) which indicates where obs are missing
and it's able to hold diverse missing patterns
combine and superimpose (an addition operation on the complete chunk)

Algorithm: `MissingSimulation()`

n: length of the vector
maxlen: max length of missing block
prob: fixed probability of being missing for each obs
cnst: a constant for controlling the probability increment, which is \[ prob^{*}=\frac{count+cnst}{maxlen+cnst} \]

Specification for our data

Try `MissingSimulation()`

 [1] 0 0 1 1 0 0 1 1 1 0 0 1 0 1 0 1 1 1 0 1

 [1] 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0

And you can play around with it to fit your needs.

Integration to our data problem

Give it a go

HeatStruct(BlockMissing()) + theme(axis.text.x = element_blank())

plot of chunk unnamed-chunk-11

Evaluating Imputation Performance for both method

SVD ($ J=5 $), CUTOFF($ \rho = 0.75 $)
5 sets of simulations: 50, 100, 500, 1000, 5000
RMSE (Root mean sum of error)being used as accuracy measurement.
Elapsed time being recorded for computational efficiency

# Comparison

Choosing $ \rho $ by simulation

100 simulated data matrix with mising value
cutoff value from 0.45 to 0.95, by 0.001
RMSE for each cutoff
standard error being recording for each cutoff value after imputation
compare to SVD

Choosing $ \rho $ by simulation (cont'd)

Conclusion

We provide a whole set of tools for doing spatio-temporal imputation
The CUTOFF method is suprior to SVD method in terms of accuracy and efficiency
It's also applicable for cross-sectional and longitudinal data (we believe so)
Simulation study is consistent with cross-validation results