DRAFT Triad Data Visualizations: Part 4

These documents provide some examples of triad data visualization primarily using the packages ggtern, compositions and robCompositions. The ggtern package is an extension of R’s ggplot2 package offering many specialized visualization options for presenting 3-part compositional data within ternary plots. The author, Nicholas Hamilton, offers many examples with accompanying code. Compositions and robCompositions provide custom functions to transform, analyse, and model compositional data (data with multiple variables that measure distinct parts of a whole)

This code and documentation were developed to accompany in-person tutorials for an individual needing to learn just enough R to understand, create, and edit visualizations of their dissertation data. I wrote these notes primarily to serve as reminders of coding topics covered in our tutoring sessions.

NB: R software and packages are regularly updated, so version notes are provided at the end of this document along with instructions on how to roll back to older versions, if necessary.

Load your data

You will need to provide the correct path to your data files.

# Remember to change the file path to your own directory
demodata1 <- read.csv("E:/P_Teaching/DemoSheets/triad_data1.csv")
str(demodata1)

## 'data.frame':    62 obs. of  23 variables:
##  $ ObsID    : int  27 30 2 37 25 5 10 28 4 52 ...
##  $ StartTime: Factor w/ 47 levels "2/3/2016 19:13",..: 10 11 9 43 23 12 47 4 27 38 ...
##  $ EndTime  : Factor w/ 47 levels "2/3/2016 19:13",..: 11 12 10 43 24 13 47 5 28 39 ...
##  $ T1A      : num  9.65 13.07 45.19 42.77 24.14 ...
##  $ T1B      : num  33.2 78.4 42 40.2 17.4 ...
##  $ T1C      : num  57.12 8.53 12.81 16.98 58.42 ...
##  $ T2A      : num  9.6 22.6 46.9 42.9 60.1 ...
##  $ T2B      : num  47.3 58.7 39.4 45.1 23.1 ...
##  $ T2C      : num  43.1 18.8 13.8 12 16.7 ...
##  $ T3A      : num  24.18 16.48 10.8 17.39 9.93 ...
##  $ T3B      : num  54.7 43.8 81.5 70.5 85.4 ...
##  $ T3C      : num  21.1 39.77 7.73 12.11 4.65 ...
##  $ D1X      : num  0.317 0.289 0.247 0.202 0.488 ...
##  $ D1Y      : num  0.683 0.711 0.753 0.798 0.512 ...
##  $ D2X      : num  0.372 0.546 0.82 0.233 0.909 ...
##  $ D2Y      : num  0.6278 0.4543 0.1798 0.7666 0.0915 ...
##  $ D3X      : num  0.665 0.558 0.587 0.699 0.27 ...
##  $ D3Y      : num  0.335 0.442 0.413 0.301 0.73 ...
##  $ F1       : Factor w/ 5 levels "Jupiter","Mars",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ F2       : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
##  $ F3       : Factor w/ 2 levels "A","B": 1 2 2 2 2 1 1 2 2 2 ...
##  $ L1XRight : num  0.306 0.351 0.727 0.301 0.724 ...
##  $ L1YTop   : num  0.303 0.5 0.645 0.274 0.772 ...

Evaluate Dyad Values in Triad Clusters

As in the previous tutorial, use package compositions to perform Aitchison’s transformation of the compositional data, then calculate the distance matrix and generate the cluster dendogram. Interactively or statically select a height to cut the dendogram into distinct groups. Assign cluster ids to data as a new column named T1Clusters.

library(compositions)
library(ggplot2)
library(ggtern)


# First define the clusters (acomp transformation, distance matric, and hierarchical clustering with Ward's D)
t1_acomp <- acomp(demodata1[,4:6], parts=c("T1A", "T1B", "T1C"))
t1_dist <- dist(t1_acomp, method="euclidean")
t1_clust <- hclust(t1_dist, method="ward.D2")

#t1_h <- locator(1)$y
t1_h <- 6

#Label the groups
t1_clustids <- cutree(t1_clust, h=t1_h)
t1_clusters<-as.data.frame(t1_acomp)

#Add the vector of cluster ids to the original dataframe
t1_tridy <- cbind(demodata1,as.vector(t1_clustids))
names(t1_tridy)[24]<-"T1Clusters"

#Plot the triad with its clusters and a custom color scheme
ggtern(data=t1_tridy,aes(x=T1A, y=T1B, z=T1C)) +
    geom_point(aes(colour=as.factor(T1Clusters))) +
    ggtitle("Triad 1 Clusters") +
    xlab("A Part") +
    ylab("B Part") +
    zlab("C Part") +
    theme_showarrows() +
    scale_colour_manual(values = c("red", "blue", "cyan", "orange"), name="Ward's D Method",
                        labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4"))

Once the clusters have been defined, you may want to know how these clusters compare across other variables for which you’ve collected data. To evaluate dyad values within clusters, use the package ggplot2 to create the boxplots with the clusters as the x-axis factor variable. If your data have missing data (NA values), these can be excluded using the !is.na() function as shown in the code below. In this example, we also provide code to place text within plots, introducing the annotate() and paste0 functions.

# Select the dataframe and indicate which columns provide the x and y axis values
T1D1boxes <- ggplot(data=t1_tridy[!is.na(t1_tridy$D1X),], aes(x=T1Clusters, y=D1X)) +
    # Select a geometry to display the data and indicate that clusters data should be treated as a group variable
    geom_boxplot(aes(group=as.factor(T1Clusters), fill=as.factor(T1Clusters))) +
    # Change from default to manually selected color scheme and provide labels for the legend
    scale_fill_manual(values = c("red", "blue", "cyan", "orange"), name="Ward's D Method",
                        labels=c("T1Cluster 1", "T1Cluster 2", "T1Cluster 3", "T1Cluster 4")) +
    # Label the x and y axis
    xlab("Triad 1 Cluster Number") +
    ylab("Dyad 1 Value") +
    # Remove the legend because all information contained in text and axis labels
    guides(fill="none") +
    # Add some text that provides the calculated number of samples in each cluster
    # The text is placed according to x and y coordinates
    annotate("text", x =  1:4, y = 0.08, label = c(paste0("N=",sum(t1_tridy$T1Clusters==1)), 
                                                   paste0("N=",sum(t1_tridy$T1Clusters==2)),
                                                   paste0("N=",sum(t1_tridy$T1Clusters==3)),
                                                   paste0("N=",sum(t1_tridy$T1Clusters==4))))
#Repeat for the D2 dyad
T1D2boxes <- ggplot(data=t1_tridy[!is.na(t1_tridy$D2X),], aes(x=T1Clusters, y=D2X)) +
    geom_boxplot(aes(group=as.factor(T1Clusters), fill=as.factor(T1Clusters))) +
    scale_fill_manual(values = c("red", "blue", "cyan", "orange"), name="Ward's D Method",
                        labels=c("T1Cluster 1", "T1Cluster 2", "T1Cluster 3", "T1Cluster 4")) +
    xlab("Triad 1 Cluster Number") +
    ylab("Dyad 2 Value") +
    guides(fill="none") +
    annotate("text", x =  1:4, y = 0.98, label = c(paste0("N=",sum(t1_tridy$T1Clusters==1)), 
                                                   paste0("N=",sum(t1_tridy$T1Clusters==2)),
                                                   paste0("N=",sum(t1_tridy$T1Clusters==3)),
                                                   paste0("N=",sum(t1_tridy$T1Clusters==4))))

# Plot the two ggplot objects side-by-side with grid.arrange
grid.arrange(T1D1boxes, T1D2boxes, ncol=2)

Evaluate Factor Values in Triad Clusters

To evaluate factor values within clusters, use the package ggplot2 to create the bar plots with the clusters as the fill variable. Two options are illustrated for the geom_bar geometry: (1) the default presentation which counts the responses and (2) position="fill" presentation which calculates the proportion of responses in each cluster within factor levels (i.e., within “Yes” and within “No”).

# Select the dataframe and indicate which columns provide the x and y axis values

T1F1bar_count <- ggplot(data=t1_tridy[!is.na(t1_tridy$F2),], aes(x=F2, fill=factor(T1Clusters))) +
    # Select a geometry to display the data and indicate that clusters data should be treated as a group variable
    geom_bar()+
    # geom_bar(aes(group=as.factor(T1Clusters), fill=as.factor(T1Clusters))) +
    # # Change from default to manually selected color scheme and provide labels for the legend
    scale_fill_manual(values = c("red", "blue", "cyan", "orange"), name="Ward's D Method",
                        labels=c("T1Cluster 1", "T1Cluster 2", "T1Cluster 3", "T1Cluster 4")) +
    # Label the x and y axis
    xlab("Factor 3 Value") +
    ylab("Count of Responses by Cluster")

T1F1bar_percent <- ggplot(data=t1_tridy[!is.na(t1_tridy$F2),], aes(x=F2, fill=factor(T1Clusters))) +
    # Select a geometry to display the data and indicate that clusters data should be treated as a group variable
    geom_bar(position="fill")+
    # geom_bar(aes(group=as.factor(T1Clusters), fill=as.factor(T1Clusters))) +
    # # Change from default to manually selected color scheme and provide labels for the legend
    scale_fill_manual(values = c("red", "blue", "cyan", "orange"), name="Ward's D Method",
                        labels=c("T1Cluster 1", "T1Cluster 2", "T1Cluster 3", "T1Cluster 4")) +
    # Label the x and y axis
    xlab("Factor 3 Value") +
    ylab("Proportion of Responses by Cluster")

grid.arrange(T1F1bar_count, T1F1bar_percent, ncol=2)

Notice that these two visual representations of the same data can lead to different conclusions… Does Cluster 1 have more Yes or No responses? It depends on if you care about the raw count or the proportion… Is 1 out of 2 (50%) more or less than 4 out of 10 (40%)?

Session and Package Information

I created and tested these examples with:

R Studio Version 0.99.892
R version 3.2.4 (2016-03-10) – “Very Secure Dishes”
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 8 x64 (build 9200)
package ‘ggtern’ version 2.1.0
package ‘ggplot2’ version 2.1.0

If you need to install older versions of ggplot2 and ggtern enter these commands (substitute the version numbers you are seeking):

oldggternurl <- “http://cran.r-project.org/src/contrib/Archive/ggtern/ggtern_1.0.2.0.tar.gz”
install.packages(oldggternurl, repos=NULL, type=“source”)
oldggplot2url <- “http://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_1.0.1.tar.gz”
install.packages(oldggplot2url, repos=NULL, type=“source”)

After installation, you should restart R Studio.

Contact Information

For more information about this R script and associated data support consulting services, contact Dr. Ashton Drew.

alt text