DRAFT Triad Data Visualizations: Part 3

These documents provide some examples of triad data visualization primarily using the packages ggtern, compositions and robCompositions. The ggtern package is an extension of R’s ggplot2 package offering many specialized visualization options for presenting 3-part compositional data within ternary plots. The author, Nicholas Hamilton, offers many examples with accompanying code. Compositions and robCompositions provide custom functions to transform, analyse, and model compositional data (data with multiple variables that measure distinct parts of a whole)

This code and documentation were developed to accompany in-person tutorials for an individual needing to learn just enough R to understand, create, and edit visualizations of their dissertation data. I wrote these notes primarily to serve as reminders of coding topics covered in our tutoring sessions.

NB: R software and packages are regularly updated, so version notes are provided at the end of this document along with instructions on how to roll back to older versions, if necessary.

Load your data

You will need to provide the correct path to your data files.

# Remember to change the file path to your own directory
demodata1 <- read.csv("E:/P_Teaching/DemoSheets/triad_data1.csv")
str(demodata1)

## 'data.frame':    62 obs. of  23 variables:
##  $ ObsID    : int  27 30 2 37 25 5 10 28 4 52 ...
##  $ StartTime: Factor w/ 47 levels "2/3/2016 19:13",..: 10 11 9 43 23 12 47 4 27 38 ...
##  $ EndTime  : Factor w/ 47 levels "2/3/2016 19:13",..: 11 12 10 43 24 13 47 5 28 39 ...
##  $ T1A      : num  9.65 13.07 45.19 42.77 24.14 ...
##  $ T1B      : num  33.2 78.4 42 40.2 17.4 ...
##  $ T1C      : num  57.12 8.53 12.81 16.98 58.42 ...
##  $ T2A      : num  9.6 22.6 46.9 42.9 60.1 ...
##  $ T2B      : num  47.3 58.7 39.4 45.1 23.1 ...
##  $ T2C      : num  43.1 18.8 13.8 12 16.7 ...
##  $ T3A      : num  24.18 16.48 10.8 17.39 9.93 ...
##  $ T3B      : num  54.7 43.8 81.5 70.5 85.4 ...
##  $ T3C      : num  21.1 39.77 7.73 12.11 4.65 ...
##  $ D1X      : num  0.317 0.289 0.247 0.202 0.488 ...
##  $ D1Y      : num  0.683 0.711 0.753 0.798 0.512 ...
##  $ D2X      : num  0.372 0.546 0.82 0.233 0.909 ...
##  $ D2Y      : num  0.6278 0.4543 0.1798 0.7666 0.0915 ...
##  $ D3X      : num  0.665 0.558 0.587 0.699 0.27 ...
##  $ D3Y      : num  0.335 0.442 0.413 0.301 0.73 ...
##  $ F1       : Factor w/ 5 levels "Jupiter","Mars",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ F2       : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
##  $ F3       : Factor w/ 2 levels "A","B": 1 2 2 2 2 1 1 2 2 2 ...
##  $ L1XRight : num  0.306 0.351 0.727 0.301 0.724 ...
##  $ L1YTop   : num  0.303 0.5 0.645 0.274 0.772 ...

Calculate and Plot Cluster Means

As in the previous tutorial, use package compositions to perform Aitchison’s transformation of the compositional data, then calculate the distance matrix and generate the cluster dendogram. Interactively or statically select a height to cut the dendogram into distinct groups. Assign cluster ids to data as a new column named T1Clusters.

This time, prior to plotting the clustered data, we calculate the cluster means. Calculate the geometric means of each cluster and store these values in a new data frame t1_means. Plot the T1 data in a ternary plot as usual with ggtern with each cluster having a distinct color. This time, however, add a second geom_point() function to draw the cluster means as large triangles of the matching color. Include the legend for the colors.

library(compositions)
library(ggplot2)
library(ggtern)

# First define the clusters (acomp transformation, distance matric, and hierarchical clustering with Ward's D)
t1_acomp <- acomp(demodata1[,4:6], parts=c("T1A", "T1B", "T1C"))
t1_dist <- dist(t1_acomp, method="euclidean")
t1_clust <- hclust(t1_dist, method="ward.D2")
plot(t1_clust)

# Select a cut height for dendogram (use "locator" code for interactive selection)
#t1_h <- locator(1)$y
t1_h <- 6

# Label the clusters
t1_clustids <- cutree(t1_clust, h=t1_h)
t1_clusters<-as.data.frame(t1_acomp)
t1_clusters<-cbind(t1_clusters,as.vector(t1_clustids))
names(t1_clusters)[4]<-"T1Clusters"

# Calculate cluster means and then combine the values into a single dataframe
t1_clust1_x <- geometricmeanCol(t1_clusters[t1_clusters$T1Clusters==1,1:3])
t1_clust2_x <- geometricmeanCol(t1_clusters[t1_clusters$T1Clusters==2,1:3])
t1_clust3_x <- geometricmeanCol(t1_clusters[t1_clusters$T1Clusters==3,1:3])
t1_clust4_x <- geometricmeanCol(t1_clusters[t1_clusters$T1Clusters==4,1:3])
t1_means <- rbind(t1_clust1_x, t1_clust2_x, t1_clust3_x, t1_clust4_x)

# Plot the ternary graph of the clusters with their means
# The first geom_point function draws the t1_custers data
# The second geom_point function calls and draws a different dataset (t1_means)
ggtern(data=t1_clusters,aes(x=T1A, y=T1B, z=T1C)) +
    geom_point(aes(colour=as.factor(T1Clusters))) +
    ggtitle("My Triad with Cluster Means") +
    xlab("A Part") +
    ylab("B Part") +
    zlab("C Part") +
    theme_showarrows() +
    scale_colour_manual(values = c("red", "blue", "cyan", "orange"), name="Ward's D Method",
                        labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4"))+
    geom_point(data=as.data.frame(t1_means), aes(x=T1A, y=T1B, z=T1C), 
               colour=c("red", "blue", "cyan", "orange"), size=5, shape=17)

The code above is a good example of a situation where, if you need to evaluate many triads, a function could greatly improve efficiency. You could define a function that used a rule to select h, then a second function to count the number of clusters resulting from that h value. This count could be passed to a second function that would create the mean dataframes, assign colors, etc based on the count for that triad. Without creating functions, running the code for each triad requires much manual editing of code. The same is true for each of the following sections.

Clusters in Landscape Data

The X and Y data in a landscape plot are NOT compositional data - the X and Y values are not two parts of a whole; the value of X does not constrain the value of Y and vice versa. Subset to select the X and Y coordinate data columns. Calculate the distance matrix and generate the cluster dendogram. Interactively or statically select a height to cut the dendogram into distinct groups. Assign cluster ids to data as a new column named L1Clusters. Calculate the geometric means of each cluster and store these values in a new data frame l1_means. Plot the L1 data in an X,Y scatter plot with each cluster having a distinct color and with the cluster means added as large triangles of the matching color. Include the legend for the colors.

library(ggplot2)

# First define the clusters (distance matric, and hierarchical clustering with Ward's D)
l1_sub <- demodata1[complete.cases(demodata1[,22:23]),22:23] # Get columns L1XRight and L1YTop
l1_dist <- dist(l1_sub, method="euclidean")
l1_clust <- hclust(l1_dist, method="ward.D2")
plot(l1_clust)

# Select a cut height for dendogram (use "locator" code for interactive selection)
#t1_h <- locator(1)$y
l1_h <- 2

#Label the groups
l1_clustids <- cutree(l1_clust, h=l1_h)
l1_clusters<-as.data.frame(l1_sub)

#Add the diads
l1_landdy <- cbind(demodata1[complete.cases(demodata1[,22:23]),],as.vector(l1_clustids))
names(l1_landdy)[24]<-"L1Clusters"

# Calculate cluster means and then combine the values into a single dataframe
l1_clust1_x <- geometricmeanCol(l1_landdy[l1_landdy$L1Clusters==1, 22:23])
l1_clust2_x <- geometricmeanCol(l1_landdy[l1_landdy$L1Clusters==2, 22:23])
l1_means <- rbind(l1_clust1_x, l1_clust2_x)

# Plot L1 landscape clusters 
ggplot(data=l1_landdy,aes(x=L1XRight, y=L1YTop)) +
    geom_point(aes(colour=factor(L1Clusters))) +
    theme_bw() +
    ylim(c(0,1)) +
    ggtitle("L1 Landscape Clusters") +
    xlab("X Scale Values") +
    ylab("Y Scale Values") +
    scale_colour_manual(values = c("green", "brown"), name="Ward's D Method",
                        labels=c("L1Cluster 1", "L1Cluster 2"))+
    geom_point(data=as.data.frame(l1_means), aes(x=L1XRight, y=L1YTop), 
               colour=c("green", "brown"), size=5, shape=17)

Landscape Cluster IDs Plotted on Triad Data

Once cluster IDs have been assigned as a new data column, these ID values can be assigned to any of the visualizations. Thus characteristics of clusters (defined based on data structure rather than demographics) can be visually evaluated across all the data variables.

As an example, the following code plots Triad 2 data (T2A, T2B, T2C) and assigns the points colors based on the Landscape 1 cluster IDs. The two mean values are the means of the Triad 2 data points that are members of Landscape 1 cluster 1 and cluster 2.

library(ggplot2)
library(ggtern)

# First define the clusters (distance matric, and hierarchical clustering with Ward's D)
l1_sub <- demodata1[complete.cases(demodata1[,22:23]),22:23] # Get columns L1XRight and L1YTop
l1_dist <- dist(l1_sub, method="euclidean")
l1_clust <- hclust(l1_dist, method="ward.D2")
plot(l1_clust)

#t1_h <- locator(1)$y
l1_h <- 2

#Label the groups
l1_clustids <- cutree(l1_clust, h=l1_h)
l1_clusters<-as.data.frame(l1_sub)

#Add the diads
l1_landdy <- cbind(demodata1[complete.cases(demodata1[,22:23]),],as.vector(l1_clustids))
names(l1_landdy)[24]<-"L1Clusters"

# Calculate L1cluster means in T1 space and then combine the values into a single dataframe
l1t1_clust1_x <- geometricmeanCol(l1_landdy[l1_landdy$L1Clusters==1, 4:6])
l1t1_clust2_x <- geometricmeanCol(l1_landdy[l1_landdy$L1Clusters==2, 4:6])
l1t1_means <- rbind(l1t1_clust1_x, l1t1_clust2_x)

# Calculate L1cluster means in T2 space and then combine the values into a single dataframe
l1t2_clust1_x <- geometricmeanCol(l1_landdy[l1_landdy$L1Clusters==1, 7:9])
l1t2_clust2_x <- geometricmeanCol(l1_landdy[l1_landdy$L1Clusters==2, 7:9])
l1t2_means <- rbind(l1t2_clust1_x, l1t2_clust2_x)

# Plot L1 landscape clusters on the T1 Triad
# This time assign the plot output to the ggtern object "T1L1" so we can print it later in a grid
T1L1 <- ggtern(data=l1_landdy,aes(x=T1A, y=T1B, z=T1C)) +
    geom_point(aes(colour=factor(L1Clusters))) +
    theme_bw() +
    ggtitle("T1 Triad with L1 Clusters") +
    xlab("A") +
    ylab("B") +
    zlab("C") +
    scale_colour_manual(values = c("green", "brown"), name="Ward's D Method",
                        labels=c("L1Cluster 1", "L1Cluster 2"))+
    geom_point(data=as.data.frame(l1t1_means), aes(x=T1A, y=T1B, z=T1C), 
               colour=c("green", "brown"), size=5, shape=17)

# Plot L1 landscape clusters on the T2 Triad
# This time assign the plot output to the ggtern object "T2L1"
T2L1 <- ggtern(data=l1_landdy,aes(x=T2A, y=T2B, z=T2C)) +
    geom_point(aes(colour=factor(L1Clusters))) +
    theme_bw() +
    ggtitle("T2 Triad with L1 Clusters") +
    xlab("A") +
    ylab("B") +
    zlab("C") +
    scale_colour_manual(values = c("green", "brown"), name="Ward's D Method",
                        labels=c("L1Cluster 1", "L1Cluster 2"))+
    geom_point(data=as.data.frame(l1t2_means), aes(x=T2A, y=T2B, z=T2C), 
               colour=c("green", "brown"), size=5, shape=17)

# Use ggterns's grid.arrange function to call and plot the two ternary graphs side-by-side.
grid.arrange(T1L1, T2L1, ncol=2)

Session and Package Information

I created and tested these examples with:

R Studio Version 0.99.892
R version 3.2.4 (2016-03-10) – “Very Secure Dishes”
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 8 x64 (build 9200)
package ‘ggtern’ version 2.1.0
package ‘ggplot2’ version 2.1.0

If you need to install older versions of ggplot2 and ggtern enter these commands (substitute the version numbers you are seeking):

oldggternurl <- “http://cran.r-project.org/src/contrib/Archive/ggtern/ggtern_1.0.2.0.tar.gz”
install.packages(oldggternurl, repos=NULL, type=“source”)
oldggplot2url <- “http://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_1.0.1.tar.gz”
install.packages(oldggplot2url, repos=NULL, type=“source”)

After installation, you should restart R Studio.

Contact Information

For more information about this R script and associated data support consulting services, contact Dr. Ashton Drew.

alt text