Replicating Higgs and Attwood’s analysis on the properties of amino acids

##Introduction This script is a replication of Higgs and Attwood’s analysis on the properties of amino aicds. This project will help us determine whether there is a correlation between amino acids and their relevance in protein folding, The amino acid property data used in this script is used with principal component analysis and clustering algorithms. Amino acids vary greatly in size, charge, hydrophobicity, and other physical properties. Principal component analysis (PCA) is a way of visualizing the important features of multidimensional data sets, such as tables of amino acid properties.

##Preliminaries Packages

library(ggplot2) # create graphs that represent both univariate and multivariate numerical and categorical data in a straightforward manner
library(ggpubr) # create easily publication-ready plots, makes ggplot2 easier
library(vegan) # PCA has functions doing/plotting PCA

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.5-6

library(scatterplot3d) # creates scatterplot in R
library(permute) # implementation of the permutation schemes
library(lattice) #  number of different functions to create different types of plot

The ggplot2 package allows R to create many different types of graphical represenations of data, and the ggpubr package makes these graphs easily ready for publication. They go hand-in-hand.

##Build the dataframe These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information. Make the vectors.

# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N', 'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90, 114,148,73,93,105,163,141)
##  bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4, 15.71,21.4,16.25,12.28,17.43,
              14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13, 1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98, 5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9, -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8, -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158, 143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52, 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10, 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences ### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
                10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
##
# charges
## un = Un-charged
## neg = negative
## pos = positive charge<-c('un','un','neg','neg','un','un','pos','un','pos','un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic',' hydrophilic','hydrophobic','neutral','neutral',
              'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
              'hydrophilic','neutral',
              'hydrophilic','hydrophilic','neutral','neutral',
              'hydrophobic','hydrophobic','neutral')
## vol
vol.cat<-c('verysmall','small','small','medium', 'verylarge','verysmall','medium','large','large',
              'large','large','small','small','medium','large',
              'verysmall',
              'small','medium','verylarge','verylarge')
## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar', 'nonpolar','nonpolar','polar','nonpolar',
                'polar','nonpolar','nonpolar','polar','nonpolar','polar',
                'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic', 'aliphatic','basic','aliphatic','basic','aliphatic','sulfur', 'amide','aliphatic','amide','basic','hydroxyl','hydroxyl', 'aliphatic','aromatic','aromatic')

Build the data frames. ### Build dataframe

aa_dat = data.frame(aa)
aa_dat <- data.frame(aa,
                     MW.da,vol,
                     bulk, pol,
                     isoelec, H2Opho.34, H2Opho.35,
                     saaH2O, faal.fold, polar.req,
                    freq, hydropathy,
                     vol.cat, pol.cat, chemical)

Raw data exploration

correlation table

##creating a correlation table of all the variables in the data frame
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2) 
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_

##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA

The code below applies the which() function, which in this scenario idenitfies where in the table has the most negative correlation between amino acids

saaH2O and vol have the most positive correlation

polar.req and H2Opho.35 have the most negative correlation

# identifying which where in the table has the most negative correlation
which(cor_ == max(cor_, na.rm = T), arr.ind = T)

##        row col
## saaH2O   8   2

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor) 
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y))
  txt <- format(c(r, 0.123456789), digits = digits)[1] 
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r) 
}

## makes subset of scatter plot matrix
plot(aa_dat[,c("MW.da", "vol", "pol",
               "H2Opho.34", "polar.req",
               "saaH2O", "faal.fold")],
    panel = panel.smooth,
    upper.panel = panel.cor,
    main = "Example")

## creates the dimensions of all the plots in the presentation and includes correlations
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...) 
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y))
  txt <- format(c(r, 0.123456789), digits = digits)
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r) 
}

According to the 6 variables I plotted, there is a large range of weak to strong correlations. The correlation between MW.da and polar.req is the weakest. The strongest correlation is between the variables vol and saaH2O.

This plot represent the relationship the polar.req (polar requirement) vs. freq(mean percentage of each amino acid in the protein sequences of modern organisms")

plot(polar.req - freq, data = aa_dat, #does main plotting
     xlab = "freq",                   #labels x-axis
     ylab = "polar req.",             #labels y-axis
     main = "polar req v. freq.",
     col = 0)

## Warning in plot.window(...): "data" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter

## Warning in box(...): "data" is not a graphical parameter

## Warning in title(...): "data" is not a graphical parameter

text(polar.req - freq,
     labels = aa,
     data = aa_dat,
     col = 1:20)

## Warning in text.default(polar.req - freq, labels = aa, data = aa_dat, col =
## 1:20): "data" is not a graphical parameter

## Figure 2: HA Figure 2.8 with ggpubr

ggscatter(y = "polar.req", x = "freq",
          size = "polar.req",
          color = "polar.req",
          data = aa_dat,
          xlab = "freq",
          ylab = "polar.req")

Figure 3: Highly correlated data

-the regression line formula is y = b1x + b0, where b0 is the y-intercept and b1 is the slope coefficient -the data.ellipse constructs and returns one set of x, y coordinates for each value of probs, in a form that can be passed directly to polygon

ggscatter(y = "polar.req",
          x = "freq",
          size = "polar.req",
          add = "reg.line", # line of best fit 
          ellipse = TRUE, # data ellipse
          cor.coef = TRUE, # correlation coef 
          # color = "freq",
          data = aa_dat,
          xlab = "freq",
          ylab = "polar.req")

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

Figure 4: Non-linear relationship

ggscatter(y = "vol",
          x = "pol",
          size = "vol",
          color = "vol",
          add  = "loess",
          data = aa_dat)

## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-linear relationship on LOG scale

aa_dat$vol_log <- log(aa_dat$vol)
ggscatter(y = "vol",
          x = "pol",
          add = "reg.line",
          cor.coef = TRUE,
          size = "vol",
          color = "vol",
          data = aa_dat)

## `geom_smooth()` using formula 'y ~ x'

##Figure 6: 3D Scatterplot I chose to make a 3d scatterplot of the variavles bulk, pol, and freq. The data plots are rather close to each other and not incredibly scattered, which suggest that these 3 variables would have a strong correlation.

aa_dat$bulk_log <- log(aa_dat$bulk)
aa_dat$pol_log <- log(aa_dat$pol)
aa_dat$freq_log <- log(aa_dat$freq)

aa_dat_df <- data.frame(aa_dat)
scatterplot3d(aa_dat_df[, c("bulk_log",
                            "pol_log",
                            "freq_log")]) #highlight.3d = TRUE type="h" #<---- can't get these two to be working

Principal component analysis

According to Higgs and Attwood, PCA “transforms this cloud of points first by scaling them and shifting them to the origin, and then by rotating them in such a way that the points are spread out as much as possible, and the structure in the data is made easier to see.” (pg.26). One can do this by beginning with the data in the form of an N x P matrix, where each row in the data matrix can be thought of as the coordinates of a point in P-dimensional space. The PCA diagram manages to do a fairly good job at illustrating all these similarities at the same time.

Data prep

To make this new dataframe, first assign the data to a new variable, which is then subsetted as a dataframe using ngative indexing, followed by column naming.

subsetting the dataframe

subset dataframe using negative indexing

aa_dat2 <- data.frame(aa_dat)
aa_dat2 <- aa_dat[,-c(1,
                      2,11,12,
                      13,14,15,16,17,
                      18)]

subset by naming the colums

aa_dat2 <- aa_dat[,c("vol" = 1,"bulk" = 2,"pol" = 11,"isoelec" = 12,
                     "h2Opho.34" = 13,"h2Opho.35" = 14,
                     "saaH2O" = 15, "faal.fold" = 16, "polar.req" = 17)]

##couldnt get code to work from here on ## Figure 7: Principal components analysis - base R Run a PCA and create a PCA plot using base R command.

#aa_dat2 <- na.omit(aa_dat2)
#pca.out <- prcomp(as.numeric(aa_dat2[1,2,11,12,13,14,15,16,17,18], scale = TRUE) )

Plot the output

##biplot(pca.out)

Figure 8: Principal components analysis - vegan

scale = TRUE means that scaling is done by dividing the (centered) columns of x by their standard deviations if center is TRUE, and the root mean square otherwise.

## rda.out <- vegan::rda(aa_dat2, 
 ##                      scale = TRUE)

Extract what are known as the PCA score

##rda_scores <- scores(rda.out)

Make a “biplot” (I usually call this a PCA plot) using the code below.

There are many differences between my plot and the one in Figure 2.10, page 28, of Higgs and Attwood. The scale for the a and y axes are different, with this plot having larger numbers. The lay out of the data is also different, with Higg’s data being more scattered throguhout the graph whereas this one’s data values are more clumped together in the center of the graph.

## biplot(rda.out, display = "sites", col = 0,)
## orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2) 
## ordihull(rda.out,
##          group = aa_dat$hydropathy, 
##          col = 1:7,
 ##         lty = 1:7,
 ##         lwd = c(3,6))

Cluster analysis

The purpose of cluster analysis is to provide a form of data analysis in which observations are divided into different groups that share common characteristics. It performs the task of assigning some set of objects into the groups are also known as clusters.

UPGMA constructs a rooted tree that reflects the structure present in a pairwise similarity matrix, or a dissimilarity matrix. At each step, the nearest two clusters are combined into a higher-level cluster.

#Hierarchical Cluster Analysis The Euclidean distance of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and is occasionally called the Pythagorean distance.

##dist_euc <- dist(aa_dat2,
##                  method = "euclidean")

 ##clust_euc <- hclust(dist_euc,
 ##                    method = "average") #this part sets it to use UPGMA

My cluster diagram is different in many ways compared to the cluster diagram produced by Higgs and Atwood (2005) in Plate 2.2. The layout of the diagram is more flat and horizontal, while the one in the reading is more compressed and skinny. There are some similarities, like how there are 4 distinct, broad, large classes with multiple min clades within them.The color scheme is also different as the colors were more scattered on mine compared to the one in reading in which the colors were more distinct from each other.

## clust_euc <- hclust(dist_euc)

## dendro_euc <- as.dendrogram(clust_euc)

## plot(dendro_euc.horiz = T,
##      nodePar = list(pch = c(1,NA),
##                     cex = 0.5,
##                     lab.cex = 1.2))