Introduction

This code provides a comprehensive walk through of data analysis and plotting for one set of data: Amino Acids. Specifically, this script describes a large majority of the steps we took in Dr. Brouwer’s class to analyze other data sets and data frames and applies it to this single data set instead. It is important to learn these techniques because of their wide variety of applications to identify and analyze trends in data (as shown through this script). These types of analyses we have also found in the class to be useful in identifying trends in ecological data studies.

Preliminaries

Packages

library(ggplot2)        # plots graphs to visualize data
library(ggpubr)         # creates "wrappers," which is a function that runs another function for you
library(vegan)          # has functions doing/plotting Principal Components Analysis (PCA)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-6
library(scatterplot3d)  # creates cluster diagrams and 3D graphical representations

ggplot2 and ggpubr are connected, because ggpubr creates wrappers for ggplot2’s core functionality.

Build the dataframe

This data represents mostly represents Table 1 from Higgs (2009) and Higgs & Attwood (2005) with extra columns. The values in the dataframe describe characteristics of amino acids.

# 1 letter code
aa        <-c('A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y')

# molecular weight in Daltons
MW.da     <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

# volume from van der Waals radii
vol    <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
              114,148,73,93,105,163,141)

# bulkiness - a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
              15.71,21.4,16.25,12.28,17.43,
              14.45,14.28,9.47,15.77,21.57,21.67,18.03)

# polarity - a measure of the electric field strength around the molecule
pol  <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
              1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

# isoelectric point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
                   5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

# 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
                  -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

# 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
                  -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

# Surface area accessible to water in an unfolded peptide
saaH2O       <-c(113,140,151,183,218,85,194,182,211,180,204,158,
                 143,189,241,122,146,160,259,229)

# Fraction of accessible area lost when a protein folds
faal.fold    <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
                 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)

# Polar requirement
polar.req    <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
                 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

# Relative frequency of occurrence
## "The frequencies column shows the mean percentage of each amino acid in the protein sequences of modern organisms"
freq        <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
                10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

# Charges
## un = un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
          'un','un','un','un','un','pos','un','un','un','un','un')

# Hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
              hydrophilic','hydrophobic','neutral','neutral','hydrophobic','hydrophilic','hydrophobic','hydrophobic','hydrophilic','neutral',
              'hydrophilic','hydrophilic','neutral','neutral',
              'hydrophobic','hydrophobic','neutral')

# Vol
volume.cat<-c('verysmall','small','small','medium',
              'verylarge','verysmall','medium','large','large',
              'large','large','small','small','medium','large','verysmall','small','medium','verylarge','verylarge')

# Pol
polarity.cat<-c('nonpolar','nonpolar','polar','polar',
                'nonpolar','nonpolar','polar','nonpolar',
                'polar','nonpolar','nonpolar','polar','nonpolar','polar',
                'polar','polar','polar','nonpolar','nonpolar','polar')

# Chemical group
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
            'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
            'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
            'aliphatic','aromatic','aromatic')

aa_dat <- data.frame(aa,MW.da,vol,bulk,
pol,isoelec,H2Opho.34,H2Opho.35,
saaH2O,faal.fold, polar.req,freq,charge,hydropathy,volume.cat,polarity.cat,chemical)

Raw data exploration

Similarity matrix

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA

This code is creating a similarity matrix using the characteristics of amino acids to find how similar or correlated they are with each other.

which(cor_ == max(cor_, na.rm = T), arr.ind = T)
##        row col
## saaH2O   8   2
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
##           row col
## polar.req  10   7

This code is asking for the maximum and minimum correlations between the data in the similarity matrix. The variables with the most positive correlation is saaH2O and vol and the variables with the most negative correlation is polar.req and H2Opho.35.

Plot 0: Scatterplot matrix

This code is defining the dimensions and parameters for the scatterplot matrix below.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0,1,0,1))
  r <- abs(cor(x,y))
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r)
}

Here we are actually plotting the scatterplot matrix.

plot(~MW.da+vol+pol+H2Opho.34+polar.req+saaH2O+freq, 
     data = na.omit(aa_dat[, c("MW.da","vol","pol","H2Opho.34","polar.req","saaH2O","freq")]),
     panel = panel.smooth,
      upper.panel = panel.cor )

According to this matrix, the relationship between all the variables is that they are quantitative. As indicated by the matrix, MW.da, vol, and saaH2O have the strongest correlations between each other, measuring at 0.93, 0.97 and 0.99. The weakest correlation on the plot is between MW.da and polar.req at 0.051. Graphs appearing to have nonlinear relationships on the matrix are those variables with weak correlations.

Plot 1: Replication Higgs and Attwood Figure 2.8

The plot represents the relationship of each amino acid’s isoelectric point with its respective volume.

plot(vol ~ isoelec, data = aa_dat,
     xlab = "pI",
     ylab = "Volume",
     main = "Amino Acid Volume vs. pI",
     col = 0)               
# the plot() function is set up to make a blank plot
text(vol ~ isoelec, 
     labels = aa, 
     data = aa_dat, 
     col = 1:20)

# the text() function uses the same tilda notation as plot() to plot the labels for the graph

Figure 2. HA Figure 2.8 with ggpubr

ggscatter(y = "bulk",
          x = "vol",
          size = "polar.req",
          color = "polar.req",
          data = aa_dat,
          xlab = "Volume",
          ylab = "Bulkiness of Side Chain")

Using ggscatter from ggpubr, we are able to make a scatter plot that represents the correlation between x and y, while also integrating the relationship with a third variable using color gradient and size of plotted points.

Figure 3. Highly correlated data

In this code, we will examine a scatter plot that shows highly correlated data, including a regression line, and a data ellipse.

ggscatter(y = "MW.da",
          x = "saaH2O",
          add = "reg.line", # line of best fit
          ellipse = TRUE,   # data ellipse
          cor.coef = TRUE,  # correlation coefficient
          # color = "saaH2O"
          data = aa_dat,
          xlab = "Surface area accessible to water in an unfolded peptide",
          ylab = "Molecular Weight in Daltons"
          )
## `geom_smooth()` using formula 'y ~ x'

A regression line is a single line that best fits your data. This is calculated using the equation y = a + bx. A data ellipse function constructs and returns one set of x,y coordinates for each value of probability, superimposed over a scatter plot of the data, acting as a measure of confidence.

Figure 4: Non-linear relationship

In this code, we will look at the opposite type of relationship (non-linear), using a function that creates a local smooth line through the data.

ggscatter(y = "MW.da",
          x = "polar.req",
          add  = "loess",
          color = "isoelec",
          size = "bulk",
          data = aa_dat,
          xlab = "Polarity Requirement",
          ylab = "Molecular Weight in Daltons")
## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-linear relationship on a LOG scale

In this code, the log transformation can help us to identify any trends in the data, particularly a linear relationship. It is most often used to normalize skewed data to a normal distribution in order to identify these trends.

ggplot(data = aa_dat, aes(x = polar.req, y = MW.da, color = "saaH2O")) + scale_x_discrete("Polarity Requirement") + scale_y_discrete("Molecular Weight in Daltons") +
  geom_point(aes(size = "pol")) +
  scale_x_log10() + scale_y_log10() + 
  geom_smooth(method=lm)

Figure 6: 3D Scatterplot

This code uses three axes to compare three variables on the same plot. Specifically looking for large correlation trends.

scatterplot3d(x = aa_dat$MW.da,
                     y = aa_dat$vol,
                     z = aa_dat$saaH2O,
              highlight.3d = TRUE,
              type="h")

From this data we can observe a distinct linear relationship in all directions (x, y, and z). This trend or appearance is suggestive of a strong correlation between all three of the variables to one another.

Principal Component Analysis (PCA)

“The PCA method transforms a cloud of points first by scaling them and shifting them to the origin, and then by rotating them in such a way that the points are spread out as much as possible, and the structure in the data is made easier to see” (Higgs and Attwood 2005, pg.26). In doing this, we create an overview of linear relationships between the objects and variables. This is a good starting place when performing multivariate data analysis - one that allows you to note trends, groupings, key variables, and outliers. Additionally, if there is a data set with many variables, PCA can help collapse them into a few principal components to be used in further analysis.

Two basic ways we have made PCA plots in this class are using the base-R method and the vegan package. Base R just uses a simple series of R commands, and the vegan package requires slightly more steps resulting in an output that often draws lines around groupings of points.

Data Prep

Creating a second data frame

# volume from van der Waals radii
vol    <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
              114,148,73,93,105,163,141)

# bulkiness - a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
              15.71,21.4,16.25,12.28,17.43,
              14.45,14.28,9.47,15.77,21.57,21.67,18.03)

# polarity - a measure of the electric field strength around the molecule
pol  <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
              1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

# isoelectric point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
                   5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

# 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
                  -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

# 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
                  -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

# Surface area accessible to water in an unfolded peptide
saaH2O       <-c(113,140,151,183,218,85,194,182,211,180,204,158,
                 143,189,241,122,146,160,259,229)

# Fraction of accessible area lost when a protein folds
faal.fold    <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
                 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)


aa_dat2 <- data.frame(vol,bulk,
pol,isoelec,H2Opho.34,H2Opho.35,
saaH2O,faal.fold)

My approach to making this second data frame was to use the same template as my first one, but manually removing the variables not included in Table 2.2 from Higgs & Attwood (2005)

Figure 7: PCA - base R

Run the PCA using rda()

rda.out <- prcomp(aa_dat2, 
                      scale = TRUE)

Extract the PCA score

rda_scores <- scores(rda.out)

This displays the 2D base-R PCA plot.

biplot(rda.out, display = "sites")

Figure 8: PCA - vegan

biplot(rda.out, display = "sites", col = 0)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
vegan::ordihull(rda.out,
         group = aa_dat$charge,
         col = 1:7,
         lty = 1:7,
         lwd = c(3,6))

The differences between this plot and Figure 2.10 shown on page 28 of Higgs & Attwood (2005) are largely in appearance. As one example, there are no lines included on the Higgs & Attwood graph. Additionally, the plotted sites on this graph are a reflection across the x-axis of the plotted points on Figure 2.10.

Cluster analysis

Cluster analysis is an algorithmic approach to finding discrete groups with varying degrees of (dis)similarity in a data set. What makes it important for us is in its application: hypothesis generation in ecological analyses, identifying distinct regions which respond to an ecologically meaningful group, and more!

UPGMA is a distance method. How this method works is actually explained through its long name: Unweighted Pair-Group Method with Arithmetic mean. Unweighted means that all the pairwise distances contribute equally. Pair-group means that the groups are combined in pairs (dichotomies only). The arithmetic mean refers to the pairwise distances to each group (clade) being mean distances to all members of that group.

Hierarchical cluster analysis

Adding row names.

row.names(aa_dat) <- paste(aa_dat$aa,1:nrow(aa_dat),sep = ".")

Creating a Euclidean distance

dist_euc <- dist(aa_dat, 
                 method = "euclidean")
## Warning in dist(aa_dat, method = "euclidean"): NAs introduced by coercion

A distance matrix is a non-negative, square, symmetric table with elements that correspond to estimates of some pairwise distance between sequences in a set. It uses the proportion of homologous sites in alignment with differing characters. On the other hand, a Euclidean distance is the measurement of a segment connecting two points. This is the most obvious way of representing distance between two points.

clust_euc <- hclust(dist_euc)
# hclust() function is setting the code to use UPGMA 
par(mar = c(1,1,1,1))
is(clust_euc)
## [1] "hclust"
dendro_euc <- as.dendrogram(clust_euc)

is(dendro_euc)
## [1] "dendrogram"
plot(dendro_euc,horiz = T,nodePar = list(pch = c(1,NA), 
                                          cex = 0.5, 
                                          lab.cex = 0.5))

This figure differs greatly from Plate 2.2 shown in Higgs & Attwood (2005). To begin, this diagram is oriented horizontally, and labeled only by 1 letter code. In contrast, Plate 2.2 is oriented vertically and displaying two sets of axes: 1 letter codes arranged in clusters and an assortment of other variables. In addition, there is also a venn-diagram figure to go alongside part a of Plate 2.2, to fill in where the sequence alignment failed. In this venn diagram, there is an included color scheme to display the “chemical” variable. In stark comparison, this plot is only arranged as a dendrogram without sequence alignments or venn diagrams visibly comparing other variables.