Introduction

Data Overview

  library(dplyr)
  data <- read.csv("From your file path/data.csv")
  data_Sandstone <- data %>% filter(Population == "Sandstone") %>% select(-Population)
  data_Carbonate <- data %>% filter(Population == "Carbonate") %>% select(-Population)

Methodology

1. Andrews Curves

  • Definition: Andrews Curves represent multivariate data as the linear combination of ortognal trigonometric functions meant to be graphed, thus proyecting the data of multivariate space into lower dimensions, allowing for the preservation of important statistical qualities of the original data, such as the distance between samples and, under very particular circumstances, the variance. Mathematically speaking, if one has a finite set of the properties \(y_1, y_2, \cdots\), the Andrews curve is: \[ f_y(t) = \frac{y_1}{\sqrt{2}}+y_2\sin(t)+y_3\cos(t)+ y_4\sin(2t)+y_5\cos(2t)+y_6\sin(3y)+\cdots \] In the form of vector notation, for a given \(t\in[0,2\pi]\), this may be seen as \[ f_y(t) = \underline{a_f}(t)\cdot \underline{y} \] In where, with little surprise \(\underline{y}=(y_1,\cdots,y_n)\) and \(\underline{a_f}(t) = \left(\frac{1}{\sqrt{2}}, \sin(t), \cdots\right)\).

  • R Code Implementation: Working with the vector notation, as \(\underline{y}\) represents the 6 attributes associated to a single sample, whatever the data-set it comes from, the the \(\underline{a_y}(t)\) vector is constructed to handle the 6 attributes, for any \(t\):

    a_f <- function(t){ 
      c(1/sqrt(2), sin(t), cos(t), sin(2*t), cos(2*t), sin(3*t))
    }

    The Andrews Curved is then applied to every sample of the dataset through the following function:

    apply_AndrewCurves <- function(data, a_f, t_List) {
      # data: Is the data of multiple attributes.
      # a_f : Is the vector as defined by f_y(t) = a_f(t)*y
      # t_List: Is list that constitute the graphing domain
      fy_matrix <- matrix(0, nrow = length(t_List), ncol = nrow(data))
    
      # Calculate the Andrews curves for each row in the dataset
      for (t_index in seq_along(t_List)) {
        t <- t_List[t_index]
        a_f_t <- a_f(t)
        for (i in seq_len(nrow(data))) {
          fy_matrix[t_index, i] <- sum(a_f_t * as.numeric(data[i, ]))
        }
      }
      return(fy_matrix)
    }

And as the code suggests, the graphs will have a domain of \([-\pi,\pi]\) instead of \([0,2\pi]\), for convenience, as this is an ulimately invariant translation.

2. Modified Andrews Curves

  • Definition: Modified Andrews Curves is the alternative proposed in the article. It is meant, in part, to handle the downside of simultaneous zeroes across a variety of coefficients of the \(y_i\) in the Andrews Curve, \(f_y\), when \(t\) equals particular values such as 0 or \(\pi/2\). These multiple 0 values diminish the distinctiveness of samples in these particular values. Thus, as a solution, this Modified Andrews Curves follows:
    \[\begin{align} g_y(t) = & \,\,\,y_1 + y_2(\sin(t) + \cos(t)) + y_3(\sin(t) - \cos(t)) \\ &+ y_4(\sin(2t) + \cos(2t)) + y_5(\sin(2t) - \cos(2t)) + \cdots, \quad -\pi \leq t \leq \pi. \end{align} \] In vector nation, for a given \(t\), the function \(\underline{a_g}(t)\) is defined such that \[g_y(t)=\underline{a_g}(t)\cdot\underline{y}\]

  • R Code Implementation: Similarly, as in the Andrews Curve, the function \(a_g\) is implemented for 6 values: r a_g <- function(t){ c(1, sin(t)+cos(t), sin(t)-cos(t), sin(2*t)+cos(2*t), sin(2*t)-cos(2*t), sin(3*t)+cos(3*t)) }

3. Parallel Plot

  • Definition: Parallel Plots makes it possible to visualize multivariate data by plotting each attribute on a separate vertical axis, with lines connecting the attribute values of each sample across axes.

  • R Code Implementation: We are using the parallelPlot package, which allows for an easy implementation of the analysis. In the code that follows, the categorical argument differentiates the samples that share identical values for the given property, and the refColumnDim argument colors the samples based on the given property.

    library(parallelPlot) 
    categorical <- list(Population = c("Sandstone", "Carbonate"))
    parallelPlot(data, categorical = categorical, refColumnDim = "Population")

Analysis

1. Analysis Using Non-Standardized Data

  • Objective: To explore the effectiveness of Andrews Curves and Modified Andrews Curves in distinguishing between Sandstone and Carbonate data without standardization.

  • Procedure:

    • Apply a_f(t) and a_g(t) to both datasets.
    • Visualize the results using the Andrews Curves and Modified Andrews Curves.
  • R Code

    t_List <- seq(-pi, pi, length.out = 100)
    andrew_Functions <- list(a_f = a_f, a_g = a_g)
    yLabels <- list(a_f = "Original Andrews Curves", a_g = "Modified Andrews Curves")
    
    # Data Visualization
    par(mfrow = c(2, 1), mar = c(4, 4, 2, 1)) 
    
    for (name in names(andrew_Functions)) {
      a <- andrew_Functions[[name]]
      yLabel <- yLabels[[name]]
    
      # Data Preparation
      data_Sandstone_Andrews <- apply_AndrewCurves(data_Sandstone, a, t_List)
      data_Carbonate_Andrews <- apply_AndrewCurves(data_Carbonate, a, t_List)
      # Sandstone Plot
      matplot(t_List, data_Sandstone_Andrews, type = "l", lty = 1, col = 1:ncol(data_Sandstone_Andrews), xlab = "t", ylim = c(-600, 900), ylab = yLabel, lwd = 2)
      # Carbonate Plot
      matlines(t_List, data_Carbonate_Andrews, type = "l", lty = 2, col = 1:ncol(data_Carbonate_Andrews), lwd = 2)
      grid(col = "gray")
    }

2. Analysis Using Standardized Data

  • Objective: To examine the impact of data standardization on Andrews Curves and Modified Andrews Curves.

  • Procedure:

    • Standardize the dataset (excluding the “Population” column) before separating the dataset.
    • Apply a_f(t) and a_g(t) to the standardized data once the data has been separated by the Population type.
  • R Code:

    data_standardized <- data %>% mutate_if(is.numeric, scale)
    
    data_Sandstone_standardized <- data_standardized %>% filter(Population == "Sandstone") %>% select(-Population)
    data_Carbonate_standardized <- data_standardized %>% filter(Population == "Carbonate") %>% select(-Population)
    
    # Data Visualization
    par(mfrow = c(2, 1), mar = c(4, 4, 2, 1)) 
    
    for (name in names(andrew_Functions)) {
      a <- andrew_Functions[[name]]
      yLabel <- yLabels[[name]]
    
      # Data Preparation
      data_Sandstone_Andrews <- apply_AndrewCurves(data_Sandstone_standardized, a, t_List)
      data_Carbonate_Andrews <- apply_AndrewCurves(data_Carbonate_standardized, a, t_List)
      # Sandstone Plot
      matplot(t_List, data_Sandstone_Andrews, type = "l", lty = 1, col = 1:ncol(data_Sandstone_Andrews),
              xlab = "t", ylim = c(-7, 7), ylab = yLabel, lwd = 2)
      # Carbonate Plot
      matlines(t_List, data_Carbonate_Andrews, type = "l", lty = 2, col = 1:ncol(data_Carbonate_Andrews), lwd = 2)
      grid(col = "gray")
    }

3. Comparison with Parallel Plot

  • Objective: To compare the insights gained from Andrews Curves and Modified Andrews Curves with those from Parallel Plots.

  • Procedure:

    • Generate Parallel Plots for the original dataset.
    • Compare the effectiveness of Parallel Plots with the Andrews Curves and Modified Andrews Curves.
    library(parallelPlot)
    parallelPlot(data, categorical = list(Population = c("Sandstone", "Carbonate")), refColumnDim = "Population")

Results

Andrews Plots for Non-Standardized Data

Figure 1: Andrew Curves of the Example 1 data. The line type differentiates the samples according to their Population attribute - Solid Lines: Sandstone; Dashed lines: Carbonated. The top plot is for the original Andrews Curves and the bottom plot is for the Modified Andrews Curves.

As shown in Figure 1, both the Sandstone samples (solid lines) and the Carbonate samples (dashed lines) can be differentiated from each other. However, it is noteworthy that the Modified Andrews Curves (bottom plot) maintain this differentiation over a longer interval of the domain.

Andrews Plots for Standardized Data

Figure 2: Andrew Curves of the standardized Example 1 data . The line type differentiates the samples according to their Population attribute - Solid Lines: Sandstone; Dashed lines: Carbonated. The top plot is for the original Andrews Curves and the bottom plot is for the Modified Andrews Curves.

Similarly to the previous analysis, Figure 2 clearly shows the distinction between the Sandstone samples (solid lines) and the Carbonate samples (dashed lines), with the Modified Andrews Curves (bottom plot) providing an even clearer distinction between the sample types. However, through the standardization process, it becomes more apparent that there are sub-populations within both the Sandstone and Carbonate data, highlighting the importance of applying such standardization before using Andrews Curves transformations to more precisely identify groups.

Parallel Plots

Figure 3: Parallel plots of the Example 1 data. The line color differentiates the samples according to their Population attribute - Orange: Carbonate; Blue: Sandstone.

In Figure 3, the parallel plots clearly differentiate the data according to its Population type based on the y1, …, y6 attributes. Evidence of the sub-populations discussed in the Andrews Curves for Standardized Data can also be observed here, reinforcing the findings of distinct groups within the Sandstone and Carbonate data.

Discussion and Conclusion

When it comes to identifying populations within the data, all the techniques covered are capable of distinguishing between them. However, focusing specifically on the Andrews Curves, the comparison between the results in Figures 1 and 2 highlights the enhanced differentiation achieved with the standardized data. The standardization process revealed the existence of sub-populations more clearly. Additionally, the Modified Andrews Curves outperformed the original Andrews Curves in terms of resolution, indicating that applying Modified Andrews Curves to standardized data offers the best combination of characteristics for using Andrews Curves to differentiate data into distinct groups.

On the other hand, Parallel Plots seem to outperform the other techniques covered in this exploratory data analysis. Similar to the Modified Andrews Curves for standardized data, the Parallel Plot effectively differentiates population types and shows evidence of sub-populations. However, it has the additional advantage of allowing the identification of specific attributes that contribute to distinguishing between populations and sub-populations.

References