Permutation distribution clustering consists of a clustering method for time series, in which the Hellinger square distance, a metric approximation of the Kullback-Leibler divergence, is used as a measure of dissimilarity between the permutation distribution.
To comparatively evaluate the package’s functionalities, a data set of images of SAR textures linearized by the Hilbert-Peano curves was applied. Our goal is to verify the efficiency of these techniques in the analysis of regions.
source("Band_Pompe_Analysis.R")
if(!require(pdc)) install.packages("pdc")
if(!require(NATS)) install.packages("NATS")
if(!require(raster)) install.packages("raster")
if(!require(lattice)) install.packages("lattice")
if(!require(microbenchmark)) install.packages("microbenchmark")
For this analysis, three SAR images with different regions were used, they are:
Sierra del Lacandon National Park, Guatemala (purchased April 10, 2015), available at [https://uavsar.jpl.nasa.gov/cgi-bin/product.pl?jobName=Lacand_30202_15043_ 006_150410_L090_CX_01 # data] (https://uavsar.jpl.nasa.gov/cgi-bin/product.pl?jobName=Lacand_30202_15043_ 006_150410_L090_CX_01 # data);
Oceanic regions of Cape Canaveral (acquired on September 22, 2016);
Urban area of the city of Munich, Germany (acquired on June 5, 2015).
A total of 160 samples were considered during the investigation, with 40 forest regions in Guatemala, 80 ocean regions in Cape Canaveral and 40 urban regions in the city of Munich.
#Defining the number of samples from each region analyzed
ns.guatemala = 40
dimen.guatemala = matrix(nrow = ns.guatemala, ncol = 4)
ns.canaveral.behavior1 = 40
dimen.canaveral.behavior1 = matrix(nrow = ns.canaveral.behavior1, ncol = 4)
ns.canaveral.behavior2 = 40
dimen.canaveral.behavior2 = matrix(nrow = ns.canaveral.behavior2, ncol = 4)
ns.munich = 40
dimen.munich = matrix(nrow = ns.munich, ncol = 4)
n.total = (ns.guatemala + ns.canaveral.behavior1 + ns.canaveral.behavior2 + ns.munich)
#The SAR data is available on https://drive.google.com/file/d/1jtbOcYwQfysfcUp4UhoA7lSl4_tPIqfa/view?usp=sharing and
# correspond to HHHH band of an image taken from the Cape Canaveral (acquired Sep 22, 2016)
#Ocean regions in Cape Canaveral
row1 = c(50, 100, 150, 200, 250, 350, 450, 550, 650, 750)
row2 = c(50, 100, 150, 200, 250, 300, 350, 400, 450, 550)
row3 = c(50, 150, 250, 350, 450, 550, 650, 750, 800, 850)
row4 = c(250, 350, 450, 550, 650, 750, 850, 950, 1050)
row5 = c(50, 150, 250, 350, 450, 550, 650, 750, 850, 950, 1050)
row6 = c(50, 150, 250, 350, 450, 550, 650, 750, 800, 850, 950)
row7 = c(50, 150, 250, 350, 450, 550, 650, 750, 850, 950)
cols = c(1700, 1850, 1550, 1400, 1, 200, 350, 550)
#{Behavior 1}
dimen.canaveral.behavior1[1:9,] = c(row1[1:9], rep(128, 9), rep(cols[1], 9), rep(128, 9))
dimen.canaveral.behavior1[10,] = c(row1[10], 128, 1300, 128)
dimen.canaveral.behavior1[11:19,] = c(row2[1:9], rep(128, 9), rep(cols[2], 9), rep(128, 9))
dimen.canaveral.behavior1[20,] = c(row2[10], 128, 1350, 128)
dimen.canaveral.behavior1[21:30,] = c(row3, rep(128, 10), rep(cols[3], 10), rep(128, 10))
dimen.canaveral.behavior1[31:40,] = c(row3, rep(128, 10), rep(cols[4], 10), rep(128, 10))
#{Behavior 2}
dimen.canaveral.behavior2[1:9,] = c(row4, rep(128, 9), rep(cols[5], 9), rep(128, 9))
dimen.canaveral.behavior2[10:20,] = c(row5, rep(128, 11), rep(cols[6], 11), rep(128, 11))
dimen.canaveral.behavior2[21:30,] = c(row7, rep(128, 10), rep(cols[7], 10), rep(128, 10))
dimen.canaveral.behavior2[31:40,] = c(row7, rep(128, 10), rep(cols[8], 10), rep(128, 10))
#The SAR data is available on https://drive.google.com/file/d/1pO6p_UI9Cgdci9y6jVynAv8SrrAvv7K8/view?usp=sharing and
# correspond to HHHH band of an image taken from the Munich, Germany (acquired Jun 5, 2015)
#Urban regions in Munich
row1 = seq(3000, 3950, by = 50)
row2 = rep(c(4300, 4350), 5)
row3 = c(rep(seq(2300, 2450, by = 50), 2), 2400, 2450)
cols = 400
cols2 = c(1300, 1300, 1350, 1350, 1400, 1400, 1450, 1450, 1500, 1500)
cols3 = c(rep(500, 4), rep(400, 4), rep(300, 2))
dimen.munich[1:20,] = c(row1, rep(128, 20), rep(cols, 20), rep(128, 20))
dimen.munich[21:30,] = c(row2, rep(128, 10), cols2, rep(128, 10))
dimen.munich[31:40,] = c(row3, rep(128, 10), cols3, rep(128, 10))
#Forest regions in Guatemala
row1 = seq(5150, 6100, by = 50)
row2 = seq(5200, 5650, by = 50)
row3 = seq(4100, 4200, by = 50)
row4 = seq(1000, 1150, by = 50)
cols = c(2700, 2800, 2930, 1930, 1870)
dimen.guatemala[1:20,] = c(row1, rep(128, 20), rep(cols[1], 20), rep(128, 20))
dimen.guatemala[21:30,] = c(row2, rep(128, 10), rep(cols[2], 10), rep(128, 10))
dimen.guatemala[31:33,] = c(row3, rep(128, 3), rep(cols[3], 3), rep(128, 3))
dimen.guatemala[34:37,] = c(row4, rep(128, 4), rep(cols[4], 4), rep(128, 4))
dimen.guatemala[38:40,] = c(row4[1:3], rep(128, 3), rep(cols[5], 3), rep(128, 3))
hilbertcurve = unlist(read.table("../../Data/Hilbert/HilbertCurves128.txt")) + 1
time.series = matrix(ncol = 160, nrow = 128*128)
#Guatemala
sar_data = raster(paste( "../../Data/guatemala", "/HHHH", ".grd", sep = ""))
for(j in c(1:ns.guatemala)){
img = getValuesBlock(sar_data, row = dimen.guatemala[j,1], nrows = dimen.guatemala[j,2], col = dimen.guatemala[j,3], ncols = dimen.guatemala[j,4], format = "matrix")
time.series[,j] = img[hilbertcurve]/max(img[hilbertcurve])
}
#Cape Canaveral - behavior 1
sar_data = raster(paste("../../Data/cape", "/HHHH", ".grd", sep = ""))
for(j in c(1:ns.canaveral.behavior1)){
img = getValuesBlock(sar_data, row = dimen.canaveral.behavior1[j,1], nrows = dimen.canaveral.behavior1[j,2], col = dimen.canaveral.behavior1[j,3], ncols = dimen.canaveral.behavior1[j,4], format = "matrix")
time.series[,ns.guatemala + j] = img[hilbertcurve]/max(img[hilbertcurve])
}
#Cape Canaveral - behavior 2
sar_data = raster(paste("../../Data/cape", "/HHHH", ".grd", sep = ""))
for(j in c(1:ns.canaveral.behavior2)){
img = getValuesBlock(sar_data, row = dimen.canaveral.behavior2[j,1], nrows = dimen.canaveral.behavior2[j,2], col = dimen.canaveral.behavior2[j,3], ncols = dimen.canaveral.behavior2[j,4], format = "matrix")
time.series[,(ns.canaveral.behavior1 + ns.guatemala) + j] = img[hilbertcurve]/max(img[hilbertcurve])
}
#Munich
sar_data = raster(paste("../../Data/munich", "/HHHH", ".grd", sep = ""))
for(j in c(1:ns.munich)){
img = getValuesBlock(sar_data, row = dimen.munich[j,1], nrows = dimen.munich[j,2], col = dimen.munich[j,3], ncols = dimen.munich[j,4], format = "matrix")
time.series[,(ns.canaveral.behavior1 + ns.canaveral.behavior2 + ns.guatemala) + j] = img[hilbertcurve]/max(img[hilbertcurve])
}
In this paper, is used an explicit implementation of a sorting algorithm as a tree of if-clauses with each inner node being a pairwise comparison of items and each leaf having a unique integer label between 0 and m! representing the permutation index. Since no items have to swap places, no subroutines are called, and no extra memory is required to be reserved on the stack or the heap, this implementation is more efficient. A sample output of this algorithm is shown below:
function y = get_codeword(x)
if x(3) < x(1)
if x(2) < x(1)
if x(3) < x(2)
y=1;
else
y=2;
end
else
y=3;
end
else
if x(2) < x(1)
y=4;
else
if x(3) < x(2)
y=5;
else
y=6;
end
end
end
Below we can see a comparison of the execution time between the naive implementation of the Bandt-Pompe symbolization and the “optimized” version. We evaluate time elapsed, for the system and user. Both routines were implemented in C and linked by R through the Rcpp package.
cat("\n BP distribution performance \n\n", "NATS: ", system.time(bandt.pompe(time.series[,1], 3, 1)), "\n pdc: ", system.time(codebook(time.series[,1], 3, 1)), "\n")
BP distribution performance
NATS: 0.368 0 0.368 0 0
pdc: 0.001 0 0.001 0 0
Another feature present in the package that also exists in NATS is the plot of the histogram of ordinal patterns.
x = time.series[,1]
barplot(x, xlab="Permutation Distribution")
The choice of parameters m (embedding dimension) and t (time delay) for the time-delayed embedding is crucial for the performance of the clustering. In this way, these observations are formalized in a heuristic that guides researchers in choosing an incorporation dimension, if no prior information is available. This heuristic aims to choose an incorporation dimension that is maximally expressive, that is, the DP must be maximally different from a uniform distribution and for that, permutation entropy is used.
Thus, the heuristic chooses the embedding dimension with the lowest average, normalized permutation entropy:
\[ arg\,min_{m} \sum_{P \in \mathbb{P}} \mathbb{H}(m) \]
In a similar vein, different choices of the time delay t can be selected. Therefore, the hyper-parameters selected for the analysis were:
mine = entropyHeuristic(time.series, t.min = 1, t.max = 10)
summary(mine)
Embedding dimension: 7 [ 3,4,5,6,7 ]
Time delay: 1 [ 1,2,3,4,5,6,7,8,9,10 ]
An illustration of a grid-search on the parameters m and t to determine differences between the regions is given below:
rr plot(mine)
Hellinger’s square distance is used to incorporate the permutation distributions of time series in a metric space. Let \(\mathbb{P} = (p_1, p_2, \dots , p_n)\) and \(\mathbb{Q} = (q_1, q_2, \dots , q_n)\) be two permutation distributions. The squared Hellinger distance is:
\[ D(P, Q) = \frac{1}{\sqrt{2}} || \sqrt{P} - \sqrt{Q} ||^2 \]
The squared Hellinger distance can be derived by means of a Taylor approximation of the KL divergence.
Finally, the pdc package also provides the functionality to calculate the Hellinger distance, however it is done between two time series. In NATS this metric is calculated using the uniform probability distribution as a reference.
x = codebook(time.series[,1], m = 3)
y = codebook(time.series[,2], m = 3)
cat("Hellinger (x, y): ", hellingerDistance(x,y), " - Hellinger (x, UD): ", HellingerDistance(x), " - Hellinger (y, UD): ", HellingerDistance(y))
Hellinger (x, y): 0.007199873 - Hellinger (x, UD): 0.03381798 - Hellinger (y, UD): 0.03489021
A distance matrix can also be calculated. It is through this matrix that clustering is done.
D = pdcDist(time.series)
levelplot(as.matrix(D))
Just as a heuristic was proposed for choosing the hyper-parameters of the Bandt-Pompe symbolization, another heuristic is also proposed for choosing the number of clusters.
Based on a dissimilarity measure, clusters are constructed by iteratively merging clusters until there is only a single top cluster left. This leads to a hierarchy of clusters.
The distances between sets of time series are calculated by the complete linkage method, which is defined as the dissimilarity between two clusters \(C_i\) and \(C_j\) :
\[ d_{complete}(C_i, C_j) = max_{x \in C_i, y \in C_j} D(x,y) \]
The resulting binary tree is typically visualized in a dendrogram depicting the tree with branch heights that reflect the distance between clusters.
clust = pdclust(time.series)
plot(clust)
Using loo1nn, which implements a leave-one-out cross-validation scheme, we estimate the predictive accuracy of the clustering:
truth = c(rep(1,40), rep(2,40), rep(2,40), rep(3, 40))
loo1nn(clust, truth)
[1] 78.125
Also, the convex hull of each cluster can be plotted as the shaded area in the principal coordinate space. The resulting plot is shown below and can be created using the command mdsPlot:
mdsPlot(clust, truth, col = c("lightgray", "lightblue", "lightgreen"))