Below is code that will first create a data-frame of variables measuring characteristics of amino acids, and then utilize several data visualization methods to determine relationships between these variables. Highly correlated data will be visualized, a 3D plot will be used to show multiple variables and a PCA analysis will show groupings of data by variance in principal components, among other analyses. These analyses are biologically important because amino acids are the primary structure of proteins, the macro-molecules responsible for almost all biological functions. Determing relationships between characteristics of amino acids can help to understand why proteins fold in the way the do or why one protein with a high content of a certain amino acid binds to one receptor but not another. The implications are endless with proper analysis and adequate data.
ggpubr expands on the functionalities of ggplot2 but makes the package easier to use. It also provides functions that create ‘publication ready’ graphics.
library(ggplot2) #Contains functions for plotting/graphing data that are easier, better than base graphics
## Warning: package 'ggplot2' was built under R version 4.0.2
library(ggpubr) #Makes it easier to implement ggplot
## Warning: package 'ggpubr' was built under R version 4.0.2
library(vegan) #Contains functions for plotting PCA
## Warning: package 'vegan' was built under R version 4.0.2
## Loading required package: permute
## Warning: package 'permute' was built under R version 4.0.2
## Loading required package: lattice
## This is vegan 2.5-6
library(scatterplot3d) #Contains functions for creating a 3d scatterplot
## Warning: package 'scatterplot3d' was built under R version 4.0.2
library(permute) #Allows us to create spatial grid designs as well as time series'
library(lattice) #Contains functions useful for creating graphs to display multivariate relationshipsa
The data in each vector represent the values of a characteristic of amino acids, for each acid. For example, MW.da lists the molecular weight in daltons of each amino acid (aa). The volume, bulk, polarity as well as many other variables are also recorded. Listing each value in the order that the acids are listed in the vector aa, will allow use to combine all the vectors into a dataframe, with each value corresponding to the correct aa. - Note - These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information.
Make the vectors.
# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N', 'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90, 114,148,73,93,105,163,141)
## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4, 15.71,21.4,16.25,12.28,17.43,14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13, 1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98, 5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9, -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8, -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158, 143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52, 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10, 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## Relative frequency of occurrence - shows mean percentage of each amino acid in the protein sequences of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
## un = Un-charged
## neg = negative
## pos = positive
charge <- c('un','un','neg','neg','un','un','pos','un','pos','un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic',' hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic', 'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral', 'hydrophobic','hydrophobic','neutral')
## Volume
vol.cat<-c('verysmall','small','small','medium', 'verylarge','verysmall','medium','large','large', 'large','large',
'small','small','medium','large', 'verysmall', 'small','medium','verylarge','verylarge')
## Polarity
pol.cat<-c('nonpolar','nonpolar','polar','polar', 'nonpolar','nonpolar','polar','nonpolar', 'polar','nonpolar',
'nonpolar','polar','nonpolar','polar', 'polar','polar','polar','nonpolar','nonpolar','polar')
## Chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic', 'aliphatic','basic','aliphatic','basic','aliphatic','sulfur', 'amide','aliphatic','amide','basic','hydroxyl','hydroxyl', 'aliphatic','aromatic','aromatic')
Build the Dataframe
aa_dat <- data.frame(aa, MW.da, vol, bulk, pol, isoelec, H2Opho.34, H2Opho.35, saaH2O, faal.fold, polar.req, freq)
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
## MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da NA NA NA NA NA NA NA NA NA
## vol 0.93 NA NA NA NA NA NA NA NA
## bulk 0.55 0.73 NA NA NA NA NA NA NA
## pol 0.29 0.24 -0.20 NA NA NA NA NA NA
## isoelec 0.20 0.36 0.08 0.27 NA NA NA NA NA
## H2Opho.34 -0.27 -0.08 0.44 -0.67 -0.20 NA NA NA NA
## H2Opho.35 -0.25 -0.16 0.32 -0.85 -0.26 0.85 NA NA NA
## saaH2O 0.97 0.99 0.64 0.29 0.35 -0.18 -0.23 NA NA
## faal.fold 0.11 0.18 0.49 -0.53 -0.18 0.84 0.79 0.12 NA
## polar.req -0.05 -0.19 -0.53 0.76 -0.11 -0.79 -0.87 -0.11 -0.81
## freq -0.52 -0.30 -0.04 -0.01 0.02 0.26 -0.02 -0.38 -0.18
## polar.req freq
## MW.da NA NA
## vol NA NA
## bulk NA NA
## pol NA NA
## isoelec NA NA
## H2Opho.34 NA NA
## H2Opho.35 NA NA
## saaH2O NA NA
## faal.fold NA NA
## polar.req NA NA
## freq 0.14 NA
The code below determines which variables have the most positive correlation and which have the most negative correlation. - The variables with the most positive correlation are surface area accessible to water (saaH2O) and volume (vol). This makes logical sense, since volume and surface area are positively related. - The variables with the most negative correlation are polarity requirement (polar req) and Hydrophobicity (H20pho.35).
#Finding the variables that have the most positive correlation
which(cor_ == max(cor_, na.rm = T), arr.ind = T)
## row col
## saaH2O 8 2
cor_[which(cor_ == max(cor_, na.rm = T), arr.ind = T) ]
## [1] 0.99
#Finding the variables that have the most negative correlation
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
## row col
## polar.req 10 7
cor_[which(cor_ == min(cor_, na.rm = T), arr.ind = T)]
## [1] -0.87
This function panel.cor allows us to create an upper half of a distance matrix that displays the correlation between the two variables instead of the scatter plot. Further, it displays the correlation in a greater font it the correlation itself is larger.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...) {
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r) }
plot(~MW.da+vol+pol+H2Opho.34+H2Opho.35+polar.req+bulk+isoelec,
data = na.omit(aa_dat[, c("MW.da", "vol", "pol", "H2Opho.34", "H2Opho.35", "polar.req", "bulk", "isoelec")]),
panel = panel.smooth,
upper.panel = panel.cor)
It seems that Vol and MW.da have a very strong relationship as well as H2Opho.35 and pol and H2Opho.34. The correlations are easily determined because the upper half of the matrix displays them in large text.
This plot represents the relationship between the volume of an amino acid and its isoelectric point. We can see that there seems to be a strong, relatively vertical relationship where the volume does nat change, but most amino acids have an isoelectric point around 6.
ggscatter(y = "vol", x = "isoelec", #Choose which variables to add to which axis
data = aa_dat,
ylab = "Volume",
xlab = "Isoelectric Point", #ylab and xlab arguments change the labels of the axes
main = "Volume vs. Isoelectric point",
col = 1:20)
## Warning in if (color %in% names(data) & is.null(add.params$color))
## add.params$color <- color: the condition has length > 1 and only the first
## element will be used
ggscatter(y = "vol", x = "isoelec",
size = "bulk",
color = "freq",
data = aa_dat,
xlab = "Isoelectric Point",
ylab = "Volume")