Introduction

In this script, we look at various methods of plotting and analyzing amino acid data. We are given data points about each individual amino acid such as their molecular weight, isoelectric point, and polarity. We build the dataframe and being with a scatterplot matrix analysis so we can visualize how the data relates to each other. Next, we also create various plots in ordere to visualize relationships between two (or three) variables within the set. Lastly, we run two different PCA and a cluster analysis to further visualize the data. This is biologically important because amino acids are the building blocks to proteins and we rely on proteins to do many things to keep us alive. Studying amino acids can lead us to many biological advances and it is the main focus of biochemists around the globe.

Preliminaries

Packages

## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-6

ggpubr is basicalyl a wrapper for ggplot that provides some functions for customizing ggplot2 plots. The focus is to make ggplot2 easier to use

Build the

Here we will start by building the vectors that will become columns in the dataframe

These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information.

# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N', 'P','Q','R','S','T','V','W','Y')

## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90, 114,148,73,93,105,163,141)

##  bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4, 15.71,21.4,16.25,12.28,17.43,
              14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13, 1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98, 5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9, -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8, -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158, 143,189,241,122,146,160,259,229)

## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52, 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)

# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10, 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences ### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
                10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

# charges
## un = Un-charged
## neg = negative
## pos = positive 
charge<-c('un','un','neg','neg','un','un','pos','un','pos','un','un','un','un','un','pos','un','un','un','un','un')

## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic',' hydrophilic','hydrophobic','neutral','neutral',
              'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
              'hydrophilic','neutral',
              'hydrophilic','hydrophilic','neutral','neutral',
              'hydrophobic','hydrophobic','neutral')

## vol
vol.cat<-c('verysmall','small','small','medium', 'verylarge','verysmall','medium','large','large',
              'large','large','small','small','medium','large',
              'verysmall',
              'small','medium','verylarge','verylarge')

## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar', 'nonpolar','nonpolar','polar','nonpolar',
                'polar','nonpolar','nonpolar','polar','nonpolar','polar',
                'polar','polar','polar','nonpolar','nonpolar','polar')

## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic', 'aliphatic','basic','aliphatic','basic','aliphatic','sulfur', 'amide','aliphatic','amide','basic','hydroxyl','hydroxyl', 'aliphatic','aromatic','aromatic')

Build the dataframes.

Now we will build the dataframe which is called aa_dat. We use each of the vectors we created in the last code chunk in order to do so.

### Build dataframe
aa_dat <- data.frame(aa,MW.da,vol,bulk,pol,isoelec,H2Opho.34, H2Opho.35, saaH2O, faal.fold, polar.req, freq, charge, hydropathy, vol.cat, pol.cat, chemical)

Raw data exploration

Correlation matrix

This is building the correlation matrix of all colums in the aa_dat dataframe. We use the cor() function to do this as well as rounging them to 2 decimal places.

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2) 
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA

Here we are determining which variables have the most positive and most negative correlations, as well as where they are located in the correlation matrix.

which(cor_ == max(cor_, na.rm = T), arr.ind = T)
##        row col
## saaH2O   8   2
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
##           row col
## polar.req  10   7

Plot 0: Scatterplot Matrix

This panel.cor function is meant to be passed into the upper.panel argument of the plot funciton. This makes it so that the uppper panels in the scatterplot matrix have the correlations in them instead of the scatterplot. This leaves the bottom plots of the matrix unchanged from the default.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

Now we will create the scatterplot matrix with 7 variables and utilize the above panel.cor function to do so.

plot(
    aa_dat[,c("MW.da","vol","pol","H2Opho.34","polar.req","isoelec","freq")],
    panel=panel.smooth,
    upper.panel = panel.cor,
    main = "Scatterplot Matrix"
     )

By examining this matrix we can see that MW.da and volume have the highest positive correlation at a calue of 0.93 (meaning they are almost directly correlated). We also see that there are no notible nonlinear correlations, they all seem to either be linear or not correlated at all. It is also worth noting that we do not see any negative correlations in the matrix.

Plot 1: Replication Higgs and Atwood Figure 2.8

In the code below, the plot() function is handling the plotting of the volume vs pI data, while the text() function is responsible for plotting the colored letter labels that represent the single letter abbreviation for the amino acids.

#main plotting function
plot(vol  ~ isoelec, data = aa_dat,
     xlab = "Isoelectric Point",
     ylab = "Volume (A^3)",
     main = "Plot of amino acid volumes against pI",
     col = 0)
#adding text labels
text(vol  ~ isoelec, 
     labels = aa, 
     data = aa_dat, 
     col = 1:20)

This plot represents the relationship between isoelectric point (pI) and the volume of the specific amino acid. It is useful to see their correlations.

Figure 2: HA Figure 2.8 with ggpubr

Here we are creating the same plot as in plot 1, except this time we are using ggpubr’s ggscatter instead of the normal plot and text functions. The size of the dots prepresent the molecular weight of the amino acid and the color represents the measure of the electric field strength around the molecule.

ggscatter(y='vol',
          x='isoelec',
          size='MW.da',
          color='pol',
          data = aa_dat,
          xlab="Isoelectric Point",
          ylab="Volume (A^3)")

Figure 3: Highly correlated data

Here we are taking 2 highly correlated columns (MW.da and volume with a coeff of 0.93) and plotting them using ggpubr’s ggscatter just as before. This time, however, we are also including various additions to the graph such as a correlation coefficient and a regression line.

ggscatter(y="MW.da",
          x="vol",
          size="MW.da",
          add="reg.line", # adds line of best fit
          ellipse=TRUE, # adds data ellipse
          cor.coef=TRUE, # adds correlation coefficient
          data=aa_dat,
          xlab="Volume",
          ylab="Molecular Weight (Da)")
## `geom_smooth()` using formula 'y ~ x'

The regresion line (aka line of best fit) is given in the form of y=mx+b where m is the slope of the regression line and b is the y-intercept. The data ellipse gives an area on the graph where we can reasonably expect new data points to fall. It is basically an area of confidence for other future points.

Figure 4: Non-linear relationships

Here we are taking 2 variables which show a non-linear relationship and plotting them with a loess smoother.

ggscatter(y = "faal.fold",
          x = "polar.req",
          add  = "loess",
          data = aa_dat,
          size='MW.da',
          color='pol',
          xlab = "Polar requirement",
          ylab="Fraction of accessible area lost")
## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-linear relatnioship on log scale

Here we are recreating figure 4 except we are using the log() function to transform it before graphing.

aa_dat$log_faalfold <- log(aa_dat$faal.fold)
aa_dat$log_polarreq <- log(aa_dat$polar.req)

ggscatter(y = "log_faalfold",
          x = "log_polarreq",
          add  = "reg.line",
          data =  aa_dat,
          size='MW.da',
          color='pol',
          xlab = "log(Polar requirement)",
          ylab="log(Fraction of accessible area lost)")
## `geom_smooth()` using formula 'y ~ x'

The log trasformation takes non-linearaly relateed data and transforms it into a linear relationship. For this example, we can see that the log transformed graph shows a linear negative relationship while the standard data was a curved relationship.

Figure 6: 3D Scatterplot

Here we are creating a 3D scatterplot using 3 highly correlated data points from aa_dat data frame.

scatterplot3d(x=aa_dat$vol,
              y=aa_dat$saaH2O,
              z=aa_dat$MW.da,
              type="h",
              highlight.3d =TRUE)

The correlations in this plot are as follows: vol vs saaH20 = 0.99 vol vs MW.da = 0.93 saaH20 vs MW.da = 0.97 These are all very strong positive correlations. Anything over an R value of 0.9 is considerd very strong and that is why I chose these 3 to represent in the 3d scatterplot.

Principal component analysis

Principal componennt analysis (PCA) is a dimensional reduction tool that is used in visualizing data sets. Higgs and Attwood explain PCA as “a way of visualizing the important features of multidimensional data sets, such as tables of amino acid properties” (Higgs and Attwood 2005, pg 34). In short, it transforms large data sets with many variables into a smaller data said while losing as little information as possible. A PCA is successful if you lose the least amount of information possible while making it easier to visualize.

In this class, we used a few methods to create PCA plots. The first one being BaseR in which we use prcomp() to run the PCA and biplot() to plot the PCA. Vegan is the other one where we still use biplot() to plot the PCA but we use vegan::rda() to run the PCA.

Data Prep

aa_dat2 <- aa_dat[,c("vol", "bulk", "pol", "isoelec", "H2Opho.34", "H2Opho.35", "saaH2O", "faal.fold")]

In order to make this new data frame, I first removed all of the categorical data. Then, I refered to Table 2.2 in the Higgs and Attwood 2005 paper and made sure that my new data frame only inncluded the 8 varibles that were in the table header. I used the “subset by naming the columns” method that was described in lecture to create aa_dat2.

Figure 7: Principal Components Analysis - base R

Here we are using the base R command to plot the PCA.

pca.out <- prcomp(aa_dat2, scale=TRUE)
biplot(pca.out)

Figure 8: Principal Components Analysis - vegan

Here we are using the vegan command to plot the PCA.

rda.out <- vegan::rda(aa_dat2, scale = TRUE)
rda_scores <- scores(rda.out)
biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2) 
ordihull(rda.out, group = aa_dat$pol.cat, col = 1:7,
lty = 1:7,
lwd = c(3,6))

According to the R documentation, ‘scale = TRUE’ causes the variables to be scaled to have unit variance (stdev and variance tending toward 1) before analysis takes place.

There are a few differences between my Figure 8 and Figure 2.10 from the Higgs and Attwood paper. First of all, the PC variables have different scales, with mine having a higher range for both PC1 and PC2. Second, my plot includes the grouping lines that are not included in the paper. Lastly, mine is plotted at a different orientation than the one in the paper (in order for it to match, it needs to be flipped on the horizontal axis).

Cluster Analysis

In short, cluster analysis’ purpose is to create groups of a set of objects (called clusters) such that everything in the cluster is similar to each other in a certain way. Because of this, objects that are in different clusters are different. This is useful for visually comparing data in a dataset.

Unweighted Pair Group Method with Arithmetic Mean (UPGMA) is a method for hierarchical clustering. It uses a similarity matrix (or dissimilarity matrix) to construct an unrooted tree called a dendrogram. In short, once you have the matrix you take the smallest distance value, cluster those variables, and then update the distance matrix to account for the distances between the existing variables and the new cluster. You continue this until the end.

Hiearchical Cluster Analysis

Adding row names

row.names(aa_dat2) <- paste(aa_dat$aa, sep=".")

Creating a distance matrix using the dist() function and euclidean method.

dist_euc <- dist(aa_dat2, method="euclidean")

A distance matrix is a 2d matrix/array that contains the pairwise distances between objects of a dataset. The diagonal of the matrix will always be all 0s because by definition, the distance between an object and itself is 0. There are a few ways to calculate distance, but the most common is euclidean distance. Euclidean distance describes the distance between 2 points in “euclidean space” which is basically just the standard 2d coordinate system that we use. It is described by the formula sqrt((a1-b1)2+(a1-b2)2) where a and b are points in euclidean space.

Here we are doing the cluster analysis with hclust.

clust_euc = hclust(dist_euc, method="average") #method="average" is causing hclust to use the UPGMA algorithm

Here we are plotting the dendrogram

dendro_euc <- as.dendrogram(clust_euc)
plot(dendro_euc,horiz = T,nodePar = list(pch = c(1,NA), 
                                          cex = 0.5, 
                                          lab.cex = 1.2))

My cluster diagram is made using UPGMA, while the one in the paper is made using CLUTO. CLUTO is a non-hiarchical method unline UPGMA. Because of this, the topology of the dendrogram is different. Additionally, ours included a scale on the bottom wheras the one in the paper lacks a scale.