% Spatio-Temporal Imputation Methods: A Case Study
% Lingbing Feng
% 2012/10/29
Missing Value

Agenda
Spatio-Temporal Data
Spatio-temproral data: data with spatial (geographic), temporal components, and attributes.
Where —— space
when —— time
what —— attributes
You can think it as a set of multivariate time series that are geographically distributed
Examples
Climatic readings ( temperature, rainfall, riverflow, etc. ) for a number of nearby stations
Satellite images of parts of the earth
Election results for voting districts and a number of consecutive elections
GPS tracks for people or animal possibly with additional sensor readings
Desease outbreaks or volcano eruption
…
Why impute?
For data structure consistency, and ultimately for modelling
Model estimating procedure assumes fairly completeness upon the data matrix
If there were missing values in your data, sooner or later, you would have to face the music.
It is difficult when
You take it seriously rather than saying:“Here we have chosen to ignore the missing values”
** or “For the sake of convenience, we case-wise delete all the missing values”
You take it very seriously, you'd like to impute by incorporating as much valuable information as possible.
You take it extremly seriously and you want an algorithm which is intuitively simple, credible, and computational efficient.
Our Goal
A simple, intuitive method
Takes both spatial and temporal information into consideration
Computationally efficient
What have we achieved
Related Literature
Schneider (2001) proposes a parametric method that makes use of E-M algorithm and ridge regression to estimate the mean and covariance matrix of the data iteratively. (spatially weak, and no package supported)
Kondrashov et al. (2006) developes an novel, iterative form of M-SSA (Multi-channel Singualr Spectrum Analysis) approach which utilizes temporal, as well as spatial correlations (as they claimed so) to fill in gaps.
It raised a heat debate (Schneider (2006)'s interesting comment and Kondrashov et al.'s rejoinder)
We found
K's idea comes down to matrix decomposition and EM algorithm
The cross-validation procedure used in their paper is vague and difficult to understand
Heavily relying on the MAR (Missing At Random) assumption, which sometimes, is not the truth.
SVD imputation
Fuentes et al.(2006): it is stable and reasonble
partly nonparametric, E-M like, counld be used before modelling
standard imputation method in the SpatioTemporal R pakage
** We consider it as the competetor **
Methods: about SVD
** The “crown jewel” of linear algebra (Gilbert Strang,2009) **

A little more
- feature extraction
- dimensionality reduction
- Its relationship with PCA (Principal Component Analysis)
- PCA is done by eigenvalue decomposition on the covariance matrix
- normally, PCA can be done easily by doing SVD on the original data matrix
how does SVDmiss() work
- Iterative process of regression and SVD
- initialising: replace NAs by 0s, a column of rowmeans across sites: \( mu_{1} \)
- Reg(1): each column on \( mu_{1} \), replace NAs by fitted values
- SVD the matrix and get J LeftEigen vectors
- Reg(2): each column on the J vectors, replace NAs
- Iteration until convergence
Comments on SVDmiss()
- Works well when most sites share similar temporal patterns
- Be watchful when obs have a substantial variation around the trend, the SVD approx might be a noisy representation of the real pattern
- I reckon the SVD method as being partly spatio-temporal imputation
Our method: a general idea:
Nearby obs tend to be more alike than those far apart. (Cressie & Wikle, 2011)
Missing values are imputed by (available) obs from its neighbours and from similiar time points.
- Spatial Closeness: (cutoff value $\rho$) distance/correlation
- Temporal closeness: same temporal attribute (month, season, etc)
Illustration

Some Notes
- Some stations may have no reference file when the cut-off value is big. If so, we set the nearest station as the reference station.
- However, what if there is no values for \( R \). It'd be circumvented by tracking down the reference list to find a station that is not missing in that time point.
- A whole candidate row missing is the only invalid case.
Case Study in Australia: Data and EDA
- Murray-Darling Basin, 78 gauging stations
- 100 years' monthly data (Jan 1911 - Dec 2010)
- Rainfall
Representation (Space-wide format)
- Different columns reflect different stations
date X048039 X049023 X050004 X050018 X050028 X050031 X050052
1 1911-01-01 33.2 58.3 NA 80.2 95.0 189.4 116.4
2 1911-02-01 149.5 133.6 NA 60.2 70.1 57.2 68.9
3 1911-03-01 14.0 22.8 NA 17.8 41.2 30.0 42.4
4 1911-04-01 0.0 0.8 8.1 0.0 0.0 0.0 0.4
5 1911-05-01 47.4 39.9 42.0 72.0 61.3 45.5 31.1
6 1911-06-01 6.6 23.0 16.5 16.3 12.9 30.5 13.4
location of stations
Missing Pattern (spatially)
Structure Heatmap (stations reordered by missingness)
source(file = "functions.R")
HeatStruct(hqmr.cube.reorder) + theme(axis.text.x = element_blank())
Cross-Validation
- grid-wise
- 10 fold on the columns
- different row grid size (1 year, 5-year, 10-year)
- randomization (both on rows and columns)
- month information has been kept while doing randomization
- RMSE for each grid (and only for that values were not missing before imputing)
- Mean-RMSE for each CUTOFF value from 0.55-0.95, by 0.05
CV results

Performance (CUTOFF)
Comparison
Comparison (more)
Comments
- both method capture temporal patterns well
- much overlapping, similar imputation performace
- which is better?
Simulation should be able to:
- Capture the Missing pattern (block Missing and dot missing)
- Knowe the real values in advance (for evaluating)
Reviewing the structure heatmap
Structure Analysis for simulation
Simulation workhorse (a missing vector): big idea
** Many thanks to Alan for the big idea and to Gen for helping me much on the programme!**
- If we had a missing vector (with 0s and NAs) which indicates where obs are missing
- and it's able to hold diverse missing patterns
- combine and superimpose (an addition operation on the complete chunk)
Algorithm: MissingSimulation()
- n: length of the vector
- maxlen: max length of missing block
- prob: fixed probability of being missing for each obs
- cnst: a constant for controlling the probability increment, which is \[ prob^{*}=\frac{count+cnst}{maxlen+cnst} \]
Specification for our data

Try MissingSimulation()
[1] 0 0 1 1 0 0 1 1 1 0 0 1 0 1 0 1 1 1 0 1
[1] 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
And you can play around with it to fit your needs.
Integration to our data problem
HeatStruct(BlockMissing()) + theme(axis.text.x = element_blank())
Evaluating Imputation Performance for both method
- SVD (\( J=5 \)), CUTOFF(\( \rho = 0.75 \))
- 5 sets of simulations: 50, 100, 500, 1000, 5000
- RMSE (Root mean sum of error)being used as accuracy measurement.
- Elapsed time being recorded for computational efficiency
# Comparison

Choosing \( \rho \) by simulation
- 100 simulated data matrix with mising value
- cutoff value from 0.45 to 0.95, by 0.001
- RMSE for each cutoff
- standard error being recording for each cutoff value after imputation
- compare to SVD
Choosing \( \rho \) by simulation (cont'd)

Conclusion
- We provide a whole set of tools for doing spatio-temporal imputation
- The
CUTOFF method is suprior to SVD method in terms of accuracy and efficiency
- It's also applicable for cross-sectional and longitudinal data (we believe so)
- Simulation study is consistent with cross-validation results
Outlook and future work
- Dig more on how to collect temporal information: e.g. kernel
- Better idea of spatial closeness eagerly wanted
About slides and pkg
- Wrote by Rstudio, in Rmarkdown syntax, converted to html slides by Pandoc
- Reproduciable research and sharing
- check SPImutation in my Github (https://github.com/Lingbing)
- pkg cunrrently available by requesting
Thanks & Questions