Comparing Multivariate Exploratory Analysis: Andrews Curves, Modified Andrews Curves, and Parallel Plots

Introduction

Objective: To compare the effectiveness of Andrews Curves for as a multivariate exploratory analysis technique to parallel plots, using both the original Andrews Curves and the alternative method proposed in the article Andrews plots for multivariate data: Some new suggestions and applications (Link in the References).
Techniques Covered:
1. Andrews Curves
2. Modified Andrews Curves
3. Parallel Plot
Additional Notes: Following the article, the analysis to compare methods will first used the original data. Later, it will standardize the data by attributes (columns) to have a common mean and variance, as it will be seen to help the in benefit of Andrews Curves to be better able to differentiate the data. The comparison to Parallel Plots will later be seen.

Data Overview

Dataset Description: The dataset used comes from the article Example 1, of the article, and consists of attributes from two geological formations, Sandstone and Carbonate, across six variables: y1, y2, y3, y4, y5, and y6.
Preprocessing: Filtering the data into two separate datasets:
- data_Sandstone: Contains data specific to the Sandstone formation.
- data_Carbonate: Contains data specific to the Carbonate formation.
R Code Implementation:

  library(dplyr)
  data <- read.csv("From your file path/data.csv")
  data_Sandstone <- data %>% filter(Population == "Sandstone") %>% select(-Population)
  data_Carbonate <- data %>% filter(Population == "Carbonate") %>% select(-Population)

Methodology

1. Andrews Curves

Definition: Andrews Curves represent multivariate data as the linear combination of ortognal trigonometric functions meant to be graphed, thus proyecting the data of multivariate space into lower dimensions, allowing for the preservation of important statistical qualities of the original data, such as the distance between samples and, under very particular circumstances, the variance. Mathematically speaking, if one has a finite set of the properties \(y_1, y_2, \cdots\), the Andrews curve is: \[ f_y(t) = \frac{y_1}{\sqrt{2}}+y_2\sin(t)+y_3\cos(t)+ y_4\sin(2t)+y_5\cos(2t)+y_6\sin(3y)+\cdots \] In the form of vector notation, for a given \(t\in[0,2\pi]\), this may be seen as \[ f_y(t) = \underline{a_f}(t)\cdot \underline{y} \] In where, with little surprise \(\underline{y}=(y_1,\cdots,y_n)\) and \(\underline{a_f}(t) = \left(\frac{1}{\sqrt{2}}, \sin(t), \cdots\right)\).

R Code Implementation: Working with the vector notation, as \(\underline{y}\) represents the 6 attributes associated to a single sample, whatever the data-set it comes from, the the \(\underline{a_y}(t)\) vector is constructed to handle the 6 attributes, for any \(t\):

a_f <- function(t){ 
  c(1/sqrt(2), sin(t), cos(t), sin(2*t), cos(2*t), sin(3*t))
}

The Andrews Curved is then applied to every sample of the dataset through the following function:

apply_AndrewCurves <- function(data, a_f, t_List) {
  # data: Is the data of multiple attributes.
  # a_f : Is the vector as defined by f_y(t) = a_f(t)*y
  # t_List: Is list that constitute the graphing domain
  fy_matrix <- matrix(0, nrow = length(t_List), ncol = nrow(data))

  # Calculate the Andrews curves for each row in the dataset
  for (t_index in seq_along(t_List)) {
    t <- t_List[t_index]
    a_f_t <- a_f(t)
    for (i in seq_len(nrow(data))) {
      fy_matrix[t_index, i] <- sum(a_f_t * as.numeric(data[i, ]))
    }
  }
  return(fy_matrix)
}

And as the code suggests, the graphs will have a domain of \([-\pi,\pi]\) instead of \([0,2\pi]\), for convenience, as this is an ulimately invariant translation.

2. Modified Andrews Curves

Definition: Modified Andrews Curves is the alternative proposed in the article. It is meant, in part, to handle the downside of simultaneous zeroes across a variety of coefficients of the \(y_i\) in the Andrews Curve, \(f_y\), when \(t\) equals particular values such as 0 or \(\pi/2\). These multiple 0 values diminish the distinctiveness of samples in these particular values. Thus, as a solution, this Modified Andrews Curves follows:
\[\begin{align} g_y(t) = & \,\,\,y_1 + y_2(\sin(t) + \cos(t)) + y_3(\sin(t) - \cos(t)) \\ &+ y_4(\sin(2t) + \cos(2t)) + y_5(\sin(2t) - \cos(2t)) + \cdots, \quad -\pi \leq t \leq \pi. \end{align} \] In vector nation, for a given \(t\), the function \(\underline{a_g}(t)\) is defined such that \[g_y(t)=\underline{a_g}(t)\cdot\underline{y}\]
R Code Implementation: Similarly, as in the Andrews Curve, the function \(a_g\) is implemented for 6 values: r a_g <- function(t){ c(1, sin(t)+cos(t), sin(t)-cos(t), sin(2*t)+cos(2*t), sin(2*t)-cos(2*t), sin(3*t)+cos(3*t)) }

3. Parallel Plot

Definition: Parallel Plots makes it possible to visualize multivariate data by plotting each attribute on a separate vertical axis, with lines connecting the attribute values of each sample across axes.
R Code Implementation: We are using the parallelPlot package, which allows for an easy implementation of the analysis. In the code that follows, the categorical argument differentiates the samples that share identical values for the given property, and the refColumnDim argument colors the samples based on the given property.
```
library(parallelPlot) 
categorical <- list(Population = c("Sandstone", "Carbonate"))
parallelPlot(data, categorical = categorical, refColumnDim = "Population")
```

Analysis

1. Analysis Using Non-Standardized Data

Objective: To explore the effectiveness of Andrews Curves and Modified Andrews Curves in distinguishing between Sandstone and Carbonate data without standardization.
Procedure:
- Apply a_f(t) and a_g(t) to both datasets.
- Visualize the results using the Andrews Curves and Modified Andrews Curves.

R Code

t_List <- seq(-pi, pi, length.out = 100)
andrew_Functions <- list(a_f = a_f, a_g = a_g)
yLabels <- list(a_f = "Original Andrews Curves", a_g = "Modified Andrews Curves")

# Data Visualization
par(mfrow = c(2, 1), mar = c(4, 4, 2, 1)) 

for (name in names(andrew_Functions)) {
  a <- andrew_Functions[[name]]
  yLabel <- yLabels[[name]]

  # Data Preparation
  data_Sandstone_Andrews <- apply_AndrewCurves(data_Sandstone, a, t_List)
  data_Carbonate_Andrews <- apply_AndrewCurves(data_Carbonate, a, t_List)
  # Sandstone Plot
  matplot(t_List, data_Sandstone_Andrews, type = "l", lty = 1, col = 1:ncol(data_Sandstone_Andrews), xlab = "t", ylim = c(-600, 900), ylab = yLabel, lwd = 2)
  # Carbonate Plot
  matlines(t_List, data_Carbonate_Andrews, type = "l", lty = 2, col = 1:ncol(data_Carbonate_Andrews), lwd = 2)
  grid(col = "gray")
}

2. Analysis Using Standardized Data

Objective: To examine the impact of data standardization on Andrews Curves and Modified Andrews Curves.
Procedure:
- Standardize the dataset (excluding the “Population” column) before separating the dataset.
- Apply a_f(t) and a_g(t) to the standardized data once the data has been separated by the Population type.

R Code:

data_standardized <- data %>% mutate_if(is.numeric, scale)

data_Sandstone_standardized <- data_standardized %>% filter(Population == "Sandstone") %>% select(-Population)
data_Carbonate_standardized <- data_standardized %>% filter(Population == "Carbonate") %>% select(-Population)

# Data Visualization
par(mfrow = c(2, 1), mar = c(4, 4, 2, 1)) 

for (name in names(andrew_Functions)) {
  a <- andrew_Functions[[name]]
  yLabel <- yLabels[[name]]

  # Data Preparation
  data_Sandstone_Andrews <- apply_AndrewCurves(data_Sandstone_standardized, a, t_List)
  data_Carbonate_Andrews <- apply_AndrewCurves(data_Carbonate_standardized, a, t_List)
  # Sandstone Plot
  matplot(t_List, data_Sandstone_Andrews, type = "l", lty = 1, col = 1:ncol(data_Sandstone_Andrews),
          xlab = "t", ylim = c(-7, 7), ylab = yLabel, lwd = 2)
  # Carbonate Plot
  matlines(t_List, data_Carbonate_Andrews, type = "l", lty = 2, col = 1:ncol(data_Carbonate_Andrews), lwd = 2)
  grid(col = "gray")
}

3. Comparison with Parallel Plot

Objective: To compare the insights gained from Andrews Curves and Modified Andrews Curves with those from Parallel Plots.
Procedure:
- Generate Parallel Plots for the original dataset.
- Compare the effectiveness of Parallel Plots with the Andrews Curves and Modified Andrews Curves.
```
library(parallelPlot)
parallelPlot(data, categorical = list(Population = c("Sandstone", "Carbonate")), refColumnDim = "Population")
```

Results

Andrews Plots for Non-Standardized Data

Figure 1: Andrew Curves of the Example 1 data. The line type differentiates the samples according to their Population attribute - Solid Lines: Sandstone; Dashed lines: Carbonated. The top plot is for the original Andrews Curves and the bottom plot is for the Modified Andrews Curves.

As shown in Figure 1, both the Sandstone samples (solid lines) and the Carbonate samples (dashed lines) can be differentiated from each other. However, it is noteworthy that the Modified Andrews Curves (bottom plot) maintain this differentiation over a longer interval of the domain.

Andrews Plots for Standardized Data

Figure 2: Andrew Curves of the standardized Example 1 data . The line type differentiates the samples according to their Population attribute - Solid Lines: Sandstone; Dashed lines: Carbonated. The top plot is for the original Andrews Curves and the bottom plot is for the Modified Andrews Curves.

Similarly to the previous analysis, Figure 2 clearly shows the distinction between the Sandstone samples (solid lines) and the Carbonate samples (dashed lines), with the Modified Andrews Curves (bottom plot) providing an even clearer distinction between the sample types. However, through the standardization process, it becomes more apparent that there are sub-populations within both the Sandstone and Carbonate data, highlighting the importance of applying such standardization before using Andrews Curves transformations to more precisely identify groups.

Parallel Plots

Figure 3: Parallel plots of the Example 1 data. The line color differentiates the samples according to their Population attribute - Orange: Carbonate; Blue: Sandstone.

In Figure 3, the parallel plots clearly differentiate the data according to its Population type based on the y1, …, y6 attributes. Evidence of the sub-populations discussed in the Andrews Curves for Standardized Data can also be observed here, reinforcing the findings of distinct groups within the Sandstone and Carbonate data.

Discussion and Conclusion

When it comes to identifying populations within the data, all the techniques covered are capable of distinguishing between them. However, focusing specifically on the Andrews Curves, the comparison between the results in Figures 1 and 2 highlights the enhanced differentiation achieved with the standardized data. The standardization process revealed the existence of sub-populations more clearly. Additionally, the Modified Andrews Curves outperformed the original Andrews Curves in terms of resolution, indicating that applying Modified Andrews Curves to standardized data offers the best combination of characteristics for using Andrews Curves to differentiate data into distinct groups.

On the other hand, Parallel Plots seem to outperform the other techniques covered in this exploratory data analysis. Similar to the Modified Andrews Curves for standardized data, the Parallel Plot effectively differentiates population types and shows evidence of sub-populations. However, it has the additional advantage of allowing the identification of specific attributes that contribute to distinguishing between populations and sub-populations.

References

Khattree, R., & Naik, D. (2002). Andrews plots for multivariate data: Some new suggestions and applications. Journal of Statistical Planning and Inference, 100, 411-425. https://doi.org/10.1016/S0378-3758(01)00150-1. Link to article: https://www.researchgate.net/publication/248776163_Andrews_plots_for_multivariate_data_Some_new_suggestions_and_applications