Objective: To compare the effectiveness of Andrews Curves for as a multivariate exploratory analysis technique to parallel plots, using both the original Andrews Curves and the alternative method proposed in the article Andrews plots for multivariate data: Some new suggestions and applications (Link in the References).
Techniques Covered:
Additional Notes: Following the article, the analysis to compare methods will first used the original data. Later, it will standardize the data by attributes (columns) to have a common mean and variance, as it will be seen to help the in benefit of Andrews Curves to be better able to differentiate the data. The comparison to Parallel Plots will later be seen.
Dataset Description: The dataset used comes from
the article Example 1, of the article, and consists of attributes from
two geological formations, Sandstone and Carbonate, across six
variables: y1, y2, y3,
y4, y5, and y6.
Preprocessing: Filtering the data into two separate datasets:
data_Sandstone: Contains data specific to the Sandstone
formation.data_Carbonate: Contains data specific to the Carbonate
formation.R Code Implementation:
library(dplyr)
data <- read.csv("From your file path/data.csv")
data_Sandstone <- data %>% filter(Population == "Sandstone") %>% select(-Population)
data_Carbonate <- data %>% filter(Population == "Carbonate") %>% select(-Population)
Definition: Andrews Curves represent multivariate data as the linear combination of ortognal trigonometric functions meant to be graphed, thus proyecting the data of multivariate space into lower dimensions, allowing for the preservation of important statistical qualities of the original data, such as the distance between samples and, under very particular circumstances, the variance. Mathematically speaking, if one has a finite set of the properties \(y_1, y_2, \cdots\), the Andrews curve is: \[ f_y(t) = \frac{y_1}{\sqrt{2}}+y_2\sin(t)+y_3\cos(t)+ y_4\sin(2t)+y_5\cos(2t)+y_6\sin(3y)+\cdots \] In the form of vector notation, for a given \(t\in[0,2\pi]\), this may be seen as \[ f_y(t) = \underline{a_f}(t)\cdot \underline{y} \] In where, with little surprise \(\underline{y}=(y_1,\cdots,y_n)\) and \(\underline{a_f}(t) = \left(\frac{1}{\sqrt{2}}, \sin(t), \cdots\right)\).
R Code Implementation: Working with the vector notation, as \(\underline{y}\) represents the 6 attributes associated to a single sample, whatever the data-set it comes from, the the \(\underline{a_y}(t)\) vector is constructed to handle the 6 attributes, for any \(t\):
a_f <- function(t){
c(1/sqrt(2), sin(t), cos(t), sin(2*t), cos(2*t), sin(3*t))
}
The Andrews Curved is then applied to every sample of the dataset through the following function:
apply_AndrewCurves <- function(data, a_f, t_List) {
# data: Is the data of multiple attributes.
# a_f : Is the vector as defined by f_y(t) = a_f(t)*y
# t_List: Is list that constitute the graphing domain
fy_matrix <- matrix(0, nrow = length(t_List), ncol = nrow(data))
# Calculate the Andrews curves for each row in the dataset
for (t_index in seq_along(t_List)) {
t <- t_List[t_index]
a_f_t <- a_f(t)
for (i in seq_len(nrow(data))) {
fy_matrix[t_index, i] <- sum(a_f_t * as.numeric(data[i, ]))
}
}
return(fy_matrix)
}And as the code suggests, the graphs will have a domain of \([-\pi,\pi]\) instead of \([0,2\pi]\), for convenience, as this is an ulimately invariant translation.
Definition: Modified Andrews Curves is the
alternative proposed in the article. It is meant, in part, to handle the
downside of simultaneous zeroes across a variety of coefficients of the
\(y_i\) in the Andrews Curve, \(f_y\), when \(t\) equals particular values such as 0 or
\(\pi/2\). These multiple 0 values
diminish the distinctiveness of samples in these particular values.
Thus, as a solution, this Modified Andrews Curves follows:
\[\begin{align}
g_y(t) =
& \,\,\,y_1 + y_2(\sin(t) + \cos(t)) + y_3(\sin(t) - \cos(t)) \\
&+ y_4(\sin(2t) + \cos(2t)) + y_5(\sin(2t) - \cos(2t)) + \cdots,
\quad -\pi \leq t \leq \pi.
\end{align}
\] In vector nation, for a given \(t\), the function \(\underline{a_g}(t)\) is defined such that
\[g_y(t)=\underline{a_g}(t)\cdot\underline{y}\]
R Code Implementation: Similarly, as in the
Andrews Curve, the function \(a_g\) is
implemented for 6 values:
r a_g <- function(t){ c(1, sin(t)+cos(t), sin(t)-cos(t), sin(2*t)+cos(2*t), sin(2*t)-cos(2*t), sin(3*t)+cos(3*t)) }
Definition: Parallel Plots makes it possible to visualize multivariate data by plotting each attribute on a separate vertical axis, with lines connecting the attribute values of each sample across axes.
R Code Implementation: We are using the
parallelPlot package, which allows for an easy
implementation of the analysis. In the code that follows, the
categorical argument differentiates the samples that share
identical values for the given property, and the
refColumnDim argument colors the samples based on the given
property.
library(parallelPlot)
categorical <- list(Population = c("Sandstone", "Carbonate"))
parallelPlot(data, categorical = categorical, refColumnDim = "Population")Objective: To explore the effectiveness of Andrews Curves and Modified Andrews Curves in distinguishing between Sandstone and Carbonate data without standardization.
Procedure:
a_f(t) and a_g(t) to both
datasets.R Code
t_List <- seq(-pi, pi, length.out = 100)
andrew_Functions <- list(a_f = a_f, a_g = a_g)
yLabels <- list(a_f = "Original Andrews Curves", a_g = "Modified Andrews Curves")
# Data Visualization
par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))
for (name in names(andrew_Functions)) {
a <- andrew_Functions[[name]]
yLabel <- yLabels[[name]]
# Data Preparation
data_Sandstone_Andrews <- apply_AndrewCurves(data_Sandstone, a, t_List)
data_Carbonate_Andrews <- apply_AndrewCurves(data_Carbonate, a, t_List)
# Sandstone Plot
matplot(t_List, data_Sandstone_Andrews, type = "l", lty = 1, col = 1:ncol(data_Sandstone_Andrews), xlab = "t", ylim = c(-600, 900), ylab = yLabel, lwd = 2)
# Carbonate Plot
matlines(t_List, data_Carbonate_Andrews, type = "l", lty = 2, col = 1:ncol(data_Carbonate_Andrews), lwd = 2)
grid(col = "gray")
}Objective: To examine the impact of data standardization on Andrews Curves and Modified Andrews Curves.
Procedure:
a_f(t) and a_g(t) to the
standardized data once the data has been separated by the Population
type.R Code:
data_standardized <- data %>% mutate_if(is.numeric, scale)
data_Sandstone_standardized <- data_standardized %>% filter(Population == "Sandstone") %>% select(-Population)
data_Carbonate_standardized <- data_standardized %>% filter(Population == "Carbonate") %>% select(-Population)
# Data Visualization
par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))
for (name in names(andrew_Functions)) {
a <- andrew_Functions[[name]]
yLabel <- yLabels[[name]]
# Data Preparation
data_Sandstone_Andrews <- apply_AndrewCurves(data_Sandstone_standardized, a, t_List)
data_Carbonate_Andrews <- apply_AndrewCurves(data_Carbonate_standardized, a, t_List)
# Sandstone Plot
matplot(t_List, data_Sandstone_Andrews, type = "l", lty = 1, col = 1:ncol(data_Sandstone_Andrews),
xlab = "t", ylim = c(-7, 7), ylab = yLabel, lwd = 2)
# Carbonate Plot
matlines(t_List, data_Carbonate_Andrews, type = "l", lty = 2, col = 1:ncol(data_Carbonate_Andrews), lwd = 2)
grid(col = "gray")
}Objective: To compare the insights gained from Andrews Curves and Modified Andrews Curves with those from Parallel Plots.
Procedure:
library(parallelPlot)
parallelPlot(data, categorical = list(Population = c("Sandstone", "Carbonate")), refColumnDim = "Population")Figure 1: Andrew Curves of the Example 1 data. The line type differentiates the samples according to their Population attribute - Solid Lines: Sandstone; Dashed lines: Carbonated. The top plot is for the original Andrews Curves and the bottom plot is for the Modified Andrews Curves.
As shown in Figure 1, both the Sandstone samples (solid lines) and the Carbonate samples (dashed lines) can be differentiated from each other. However, it is noteworthy that the Modified Andrews Curves (bottom plot) maintain this differentiation over a longer interval of the domain.
Figure 2: Andrew Curves of the standardized Example 1 data . The line type differentiates the samples according to their Population attribute - Solid Lines: Sandstone; Dashed lines: Carbonated. The top plot is for the original Andrews Curves and the bottom plot is for the Modified Andrews Curves.
Similarly to the previous analysis, Figure 2 clearly shows the distinction between the Sandstone samples (solid lines) and the Carbonate samples (dashed lines), with the Modified Andrews Curves (bottom plot) providing an even clearer distinction between the sample types. However, through the standardization process, it becomes more apparent that there are sub-populations within both the Sandstone and Carbonate data, highlighting the importance of applying such standardization before using Andrews Curves transformations to more precisely identify groups.
Figure 3: Parallel plots of the Example 1 data. The line color differentiates the samples according to their Population attribute - Orange: Carbonate; Blue: Sandstone.
In Figure 3, the parallel plots clearly differentiate the data according to its Population type based on the y1, …, y6 attributes. Evidence of the sub-populations discussed in the Andrews Curves for Standardized Data can also be observed here, reinforcing the findings of distinct groups within the Sandstone and Carbonate data.
When it comes to identifying populations within the data, all the techniques covered are capable of distinguishing between them. However, focusing specifically on the Andrews Curves, the comparison between the results in Figures 1 and 2 highlights the enhanced differentiation achieved with the standardized data. The standardization process revealed the existence of sub-populations more clearly. Additionally, the Modified Andrews Curves outperformed the original Andrews Curves in terms of resolution, indicating that applying Modified Andrews Curves to standardized data offers the best combination of characteristics for using Andrews Curves to differentiate data into distinct groups.
On the other hand, Parallel Plots seem to outperform the other techniques covered in this exploratory data analysis. Similar to the Modified Andrews Curves for standardized data, the Parallel Plot effectively differentiates population types and shows evidence of sub-populations. However, it has the additional advantage of allowing the identification of specific attributes that contribute to distinguishing between populations and sub-populations.