Replicating Higgs and Attwood’s analysis on the properties of amino acids

Introduction

This analysis uses multiple data sets relating to each amino acid. Since this is a multi-variable data set, I will be compressing the data using a dimensionally reducing approach called PCA. This approach uses distance analysis converted to a dendrogram. This will show us clusters of data that are correlated and can be used for futher analysis.

Preliminaries

Packages

ggpubr is an extension ggplot2. ggpubr adds to the functionality of ggplot2 to make more detailed plots with simpler commands.

library(ggplot2)       # Useful for creating plots and data frames
library(ggpubr)        # Works with ggplot2 for plot making, adds functionality
library(vegan)         # "tools for descriptive community ecology" - from ?vegan

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.5-6

library(scatterplot3d) # Allows for 3D plotting

Build the Dataframe

The below data represents different quantitative and qualitative components of each amino acid. Each of these amino acid attributes will be used to build a data frame to perform further analysis. Data includes factors unique to the amino acid as well as data regarding interactions with other amino acids.

## 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N', 'P','Q','R','S','T','V','W','Y')


## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)


## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90, 114,148,73,93,105,163,141)


##  bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,15.71,21.4,16.25,12.28,17.43, 14.45,14.28,9.47,15.77,21.57,21.67,18.03)


## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)


## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)


## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)


## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)


## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,143,189,241,122,146,160,259,229)


## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)


# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)


## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences ### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
                10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

## charges
## un = Un-charged
## neg = negative
## pos = positive 
charge<-c('un','un','neg','neg','un','un','pos','un','pos','un','un','un','un','un','pos','un','un','un','un','un')


## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','hydrophilic','hydrophobc','neutral','neutral','hydrophobic','hydrophilic','hydrophobic','hydrophobic','hydrophilic','neutral','hydrophilic','hydrophilic','neutral','neutral','hydrophobic','hydrophobic','neutral')


## vol
vol.cat<-c('verysmall','small','small','medium','verylarge','verysmall','medium','large','large','large','large','small','small','medium','large','verysmall','small','medium','verylarge','verylarge')


## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar','nonpolar','nonpolar','polar','nonpolar','polar','nonpolar','nonpolar','polar','nonpolar','polar','polar','polar','polar','nonpolar','nonpolar','polar')


## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic','aliphatic','basic','aliphatic','basic','aliphatic','sulfur','amide','aliphatic','amide','basic','hydroxyl','hydroxyl', 'aliphatic','aromatic','aromatic')

Data Frame of Amino Acid Data

Below I build two data frames, one with categorical data and one without the categorical data.

# with categorical data
aa_dat <- data.frame(Amino_Acid_Name = aa, 
           Molecular_Weight = MW.da, 
           Vol_VDW = vol, 
           Vol_Categorical = vol.cat,
           Bulk = bulk, 
           Polarity = pol, 
           Polarity_Categorical = pol.cat,
           Polar_Requirement = polar.req, 
           Residue = chemical,
           Isoelectric_Point = isoelec, 
           Charge = charge, 
           First_Hydrophobicity_Scale = H2Opho.34, 
           Second_Hydrophobicity_Scale = H2Opho.35, 
           Hydropathy = hydropathy,
           Surface_Area = saaH2O, 
           Fraction_Accessible_Area = faal.fold, 
           Frequency_of_Occurance = freq)

#wiithout categorical data
aa_dat2 <- data.frame(
           Molecular_Weight = MW.da, 
           Vol_VDW = vol, 
           Bulk = bulk, 
           Polarity = pol, 
           Polar_Requirement = polar.req, 
           Isoelectric_Point = isoelec, 
           First_Hydrophobicity_Scale = H2Opho.34, 
           Second_Hydrophobicity_Scale = H2Opho.35, 
           Surface_Area = saaH2O, 
           Fraction_Accessible_Area = faal.fold, 
           Frequency_of_Occurance = freq)


aa_dat

##    Amino_Acid_Name Molecular_Weight Vol_VDW Vol_Categorical  Bulk Polarity
## 1                A               89      67       verysmall 11.50     0.00
## 2                C              121      86           small 13.46     1.48
## 3                D              133      91           small 11.68    49.70
## 4                E              146     109          medium 13.57    49.90
## 5                F              165     135       verylarge 19.80     0.35
## 6                G               75      48       verysmall  3.40     0.00
## 7                H              155     118          medium 13.69    51.60
## 8                I              131     124           large 21.40     0.13
## 9                K              146     135           large 15.71    49.50
## 10               L              131     124           large 21.40     0.13
## 11               M              149     124           large 16.25     1.43
## 12               N              132      96           small 12.28     3.38
## 13               P              115      90           small 17.43     1.58
## 14               Q              147     114          medium 14.45     3.53
## 15               R              174     148           large 14.28    52.00
## 16               S              105      73       verysmall  9.47     1.67
## 17               T              119      93           small 15.77     1.66
## 18               V              117     105          medium 21.57     0.13
## 19               W              204     163       verylarge 21.67     2.10
## 20               Y              181     141       verylarge 18.03     1.61
##    Polarity_Categorical Polar_Requirement   Residue Isoelectric_Point Charge
## 1              nonpolar               7.0 aliphatic              6.00     un
## 2              nonpolar               4.8    sulfur              5.07     un
## 3                 polar              13.0    acidic              2.77    neg
## 4                 polar              12.5    acidic              3.22    neg
## 5              nonpolar               5.0  aromatic              5.48     un
## 6              nonpolar               7.9 aliphatic              5.97     un
## 7                 polar               8.4     basic              7.59    pos
## 8              nonpolar               4.9 aliphatic              6.02     un
## 9                 polar              10.1     basic              9.74    pos
## 10             nonpolar               4.9 aliphatic              5.98     un
## 11             nonpolar               5.3    sulfur              5.74     un
## 12                polar              10.0     amide              5.41     un
## 13             nonpolar               6.6 aliphatic              6.30     un
## 14                polar               8.6     amide              5.65     un
## 15                polar               9.1     basic             10.76    pos
## 16                polar               7.5  hydroxyl              5.68     un
## 17                polar               6.6  hydroxyl              6.16     un
## 18             nonpolar               5.6 aliphatic              5.96     un
## 19             nonpolar               5.2  aromatic              5.89     un
## 20                polar               5.4  aromatic              5.66     un
##    First_Hydrophobicity_Scale Second_Hydrophobicity_Scale  Hydropathy
## 1                         1.8                         1.6 hydrophobic
## 2                         2.5                         2.0 hydrophobic
## 3                        -3.5                        -9.2 hydrophilic
## 4                        -3.5                        -8.2 hydrophilic
## 5                         2.8                         3.7  hydrophobc
## 6                        -0.4                         1.0     neutral
## 7                        -3.2                        -3.0     neutral
## 8                         4.5                         3.1 hydrophobic
## 9                        -3.9                        -8.8 hydrophilic
## 10                        3.8                         2.8 hydrophobic
## 11                        1.9                         3.4 hydrophobic
## 12                       -3.5                        -4.8 hydrophilic
## 13                       -1.6                        -0.2     neutral
## 14                       -3.5                        -4.1 hydrophilic
## 15                       -4.5                       -12.3 hydrophilic
## 16                       -0.8                         0.6     neutral
## 17                       -0.7                         1.2     neutral
## 18                        4.2                         2.6 hydrophobic
## 19                       -0.9                         1.9 hydrophobic
## 20                       -1.3                        -0.7     neutral
##    Surface_Area Fraction_Accessible_Area Frequency_of_Occurance
## 1           113                     0.74                   7.80
## 2           140                     0.91                   1.10
## 3           151                     0.62                   5.19
## 4           183                     0.62                   6.72
## 5           218                     0.88                   4.39
## 6            85                     0.72                   6.77
## 7           194                     0.78                   2.03
## 8           182                     0.88                   6.95
## 9           211                     0.52                   6.32
## 10          180                     0.85                  10.15
## 11          204                     0.85                   2.28
## 12          158                     0.63                   4.37
## 13          143                     0.64                   4.26
## 14          189                     0.62                   3.45
## 15          241                     0.64                   5.23
## 16          122                     0.66                   6.46
## 17          146                     0.70                   5.12
## 18          160                     0.86                   7.01
## 19          259                     0.85                   1.09
## 20          229                     0.76                   3.30

aa_dat2

##    Molecular_Weight Vol_VDW  Bulk Polarity Polar_Requirement Isoelectric_Point
## 1                89      67 11.50     0.00               7.0              6.00
## 2               121      86 13.46     1.48               4.8              5.07
## 3               133      91 11.68    49.70              13.0              2.77
## 4               146     109 13.57    49.90              12.5              3.22
## 5               165     135 19.80     0.35               5.0              5.48
## 6                75      48  3.40     0.00               7.9              5.97
## 7               155     118 13.69    51.60               8.4              7.59
## 8               131     124 21.40     0.13               4.9              6.02
## 9               146     135 15.71    49.50              10.1              9.74
## 10              131     124 21.40     0.13               4.9              5.98
## 11              149     124 16.25     1.43               5.3              5.74
## 12              132      96 12.28     3.38              10.0              5.41
## 13              115      90 17.43     1.58               6.6              6.30
## 14              147     114 14.45     3.53               8.6              5.65
## 15              174     148 14.28    52.00               9.1             10.76
## 16              105      73  9.47     1.67               7.5              5.68
## 17              119      93 15.77     1.66               6.6              6.16
## 18              117     105 21.57     0.13               5.6              5.96
## 19              204     163 21.67     2.10               5.2              5.89
## 20              181     141 18.03     1.61               5.4              5.66
##    First_Hydrophobicity_Scale Second_Hydrophobicity_Scale Surface_Area
## 1                         1.8                         1.6          113
## 2                         2.5                         2.0          140
## 3                        -3.5                        -9.2          151
## 4                        -3.5                        -8.2          183
## 5                         2.8                         3.7          218
## 6                        -0.4                         1.0           85
## 7                        -3.2                        -3.0          194
## 8                         4.5                         3.1          182
## 9                        -3.9                        -8.8          211
## 10                        3.8                         2.8          180
## 11                        1.9                         3.4          204
## 12                       -3.5                        -4.8          158
## 13                       -1.6                        -0.2          143
## 14                       -3.5                        -4.1          189
## 15                       -4.5                       -12.3          241
## 16                       -0.8                         0.6          122
## 17                       -0.7                         1.2          146
## 18                        4.2                         2.6          160
## 19                       -0.9                         1.9          259
## 20                       -1.3                        -0.7          229
##    Fraction_Accessible_Area Frequency_of_Occurance
## 1                      0.74                   7.80
## 2                      0.91                   1.10
## 3                      0.62                   5.19
## 4                      0.62                   6.72
## 5                      0.88                   4.39
## 6                      0.72                   6.77
## 7                      0.78                   2.03
## 8                      0.88                   6.95
## 9                      0.52                   6.32
## 10                     0.85                  10.15
## 11                     0.85                   2.28
## 12                     0.63                   4.37
## 13                     0.64                   4.26
## 14                     0.62                   3.45
## 15                     0.64                   5.23
## 16                     0.66                   6.46
## 17                     0.70                   5.12
## 18                     0.86                   7.01
## 19                     0.85                   1.09
## 20                     0.76                   3.30

Raw Data Exploration

Correlation Matrix

The below correlation matrix describes the relationship (correlation) between the factors in the data frame as an average for all the amino acids we are analyzing

The closer to 1 in magnitude, the stronger the correlation

Positive correlation indicates that as one factor increases, so does the other

Negative correlation indicates that as one factor decreases, the other decreases

cor_ <- round(cor(aa_dat2[,-c(1,13:17)]),2) 
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_

##                             Vol_VDW  Bulk Polarity Polar_Requirement
## Vol_VDW                          NA    NA       NA                NA
## Bulk                           0.73    NA       NA                NA
## Polarity                       0.24 -0.20       NA                NA
## Polar_Requirement             -0.19 -0.53     0.76                NA
## Isoelectric_Point              0.36  0.08     0.27             -0.11
## First_Hydrophobicity_Scale    -0.08  0.44    -0.67             -0.79
## Second_Hydrophobicity_Scale   -0.16  0.32    -0.85             -0.87
## Surface_Area                   0.99  0.64     0.29             -0.11
## Fraction_Accessible_Area       0.18  0.49    -0.53             -0.81
## Frequency_of_Occurance        -0.30 -0.04    -0.01              0.14
##                             Isoelectric_Point First_Hydrophobicity_Scale
## Vol_VDW                                    NA                         NA
## Bulk                                       NA                         NA
## Polarity                                   NA                         NA
## Polar_Requirement                          NA                         NA
## Isoelectric_Point                          NA                         NA
## First_Hydrophobicity_Scale              -0.20                         NA
## Second_Hydrophobicity_Scale             -0.26                       0.85
## Surface_Area                             0.35                      -0.18
## Fraction_Accessible_Area                -0.18                       0.84
## Frequency_of_Occurance                   0.02                       0.26
##                             Second_Hydrophobicity_Scale Surface_Area
## Vol_VDW                                              NA           NA
## Bulk                                                 NA           NA
## Polarity                                             NA           NA
## Polar_Requirement                                    NA           NA
## Isoelectric_Point                                    NA           NA
## First_Hydrophobicity_Scale                           NA           NA
## Second_Hydrophobicity_Scale                          NA           NA
## Surface_Area                                      -0.23           NA
## Fraction_Accessible_Area                           0.79         0.12
## Frequency_of_Occurance                            -0.02        -0.38
##                             Fraction_Accessible_Area Frequency_of_Occurance
## Vol_VDW                                           NA                     NA
## Bulk                                              NA                     NA
## Polarity                                          NA                     NA
## Polar_Requirement                                 NA                     NA
## Isoelectric_Point                                 NA                     NA
## First_Hydrophobicity_Scale                        NA                     NA
## Second_Hydrophobicity_Scale                       NA                     NA
## Surface_Area                                      NA                     NA
## Fraction_Accessible_Area                          NA                     NA
## Frequency_of_Occurance                         -0.18                     NA

The below code demonstrates which relationship has the greatest positive correlation. saaH2O and volume has the greatest positive correlation.

which(cor_ == max(cor_, na.rm = T), arr.ind = T)

##              row col
## Surface_Area   8   1

The below code demonstrates which relationship has the greatest negative correlation. polar.req and hydrophobe 35 has the greatest negative correlation.

which(cor_ == min(cor_, na.rm = T), arr.ind = T)

##                             row col
## Second_Hydrophobicity_Scale   7   4

Scatterplot matrix

The below code is a function that assists in the creation of a scatterplot matrix. It will be used in conjunction with the plot function.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...) 
  {
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1] 
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

Strong Correlation::

  Molecular weight ~ Volume Van Der Waals
  Surface Area ~ Molecular Weight
  Polar Requirement ~ Polarity 
  Hydrophobicity ~ Polarity
  Fraction Accessible ~ Polarity
  Hydrophobicity ~ Polar Requirement
  Fraction Accesible ~ Polar Requirement 
  Fracction Acccessible ~ Hydrophobicity

Weak Correlation::

Polarity ~ Molecular Weight 
Polar Requirement ~ Molecular Weight 
Hydrophobicity ~ Molecular Weight 
Fraction Accessible ~ Molecular Weight 
Polarity ~ Volume Van Der Waals
Polar Requirement ~ Volume Van Der Waals
Hydrophobicity ~ Volume Van Der Waals
Fraction Acccessibility ~ Volume Van Der Waals 
Surface Area ~ Polarity 
Surface Area ~ Polar Requirement 
Surface Area ~ Hydrophobicity 
Fraction Accesible ~ Surface Area

Linear Relationship::

Molecular Weight ~ Volume Van Der Waals
Polarity ~ Polar Requirement 
Polarity ~ Hydrophobicity 
Molecular Weight ~ Surface Area
Volume Van Der Waals ~ Surface Area 
Polarity ~ Surface Area
Polarity ~ Fraction Accessible 
Hydrophobicity ~ Franction Accesible

Non Linear Relationship::

Molecular Weight ~ Polarity 
Volume Van Der Waals ~ Polarity 
Molecular Weight ~ Polar Requirement 
Volumme Van Der Waals ~ Polar Requirement 
Molecular Weight ~ Hydrophobicity 
Volume Van Der Waals ~ Hydrophobicity 
Polar Requirement ~ Hydrophobicity 
Polar Requirement ~ Surface Area 
Hydrophobicity ~ Surface Area 
Molecular Weight ~ Fraction Accessible

The below is making a scatterplot matrix out of a new data frame with limited vectors

aa_dat3 <-data.frame(
           Molecular_Weight = MW.da, 
           Vol_VDW = vol, 
           Polarity = pol, 
           Polar_Requirement = polar.req, 
           First_Hydrophobicity_Scale = H2Opho.34, 
           Surface_Area = saaH2O, 
           Fraction_Accessible_Area = faal.fold)


plot(aa_dat3,upper.panel = panel.cor,
     panel = panel.smooth)

Plot 1: Replication Higgs and Attwood Figure 2.8

The below figure plots the frequency of amino acid occurane with its polar requirement. There does not appear to be a strong correlation between polar requirement and frequency but the most frequently occuring amino acids do have a greater polar requirement.

plot(polar.req  ~ freq, 
     data = aa_dat, # main plotting 
     xlab = "Polar Req",# text label
     ylab = "Frequency ",#text label
     main = "Amino Acid Plot Frequency Against Polar Req",# text label 
     col = 0)


text(polar.req ~ freq, 
     labels = aa, 
     data = aa_dat, 
     col = 1:20)

Figure 2: HA Figure 2.8 with ggpubr

Frequency ~ Polar Requirement

Below is a scatterplot demonstrating the relationship between Frequency and Polar Requirement

ggscatter(y = "Polar_Requirement", 
          x = "Frequency_of_Occurance",
          size = "Polar_Requirement",
          color = "Polar_Requirement",
          data = aa_dat,
          xlab = "Frequency",
           ylab = "Polar Requirement")

Bulkiness ~ Molecular Weight

Below is a scatterplot demonstrating the relationship between Bulkiness and Molecular Weight

ggscatter(y = "Molecular_Weight", 
          x = "Bulk",
          size = "Molecular_Weight",
          color = "Molecular_Weight",
          data = aa_dat,
          xlab = "Bulk",
          ylab = "Molecular_Weight")

Surface Area ~ Isoelectricity

Below is a scatterplot demonstrating the relationship between Surface Area and Isoelectricity

ggscatter(y = "Isoelectric_Point", 
          x = "Surface_Area",
          size = "Isoelectric_Point",
          color = "Isoelectric_Point",
          data = aa_dat,
          xlab = "Surface_Area",
          ylab = "Isoelectric_Point")

Figure 3: Highly Correlated Data

The general mathematic form of the regression line is y = mx + b The data ellipse is demonstrating the spread of data over the y axis (variability) and representing it as the thickness of the ellipse. The data ellipse demonstrates the spread over the x axis through the length of the ellipse

ggscatter(y = "Surface_Area", 
          x = "Vol_VDW",
          size = "Surface_Area",
          add = "reg.line", # line of best fit 
          ellipse = TRUE, # data ellipse
          cor.coef = TRUE, # correlation coef=
          data = aa_dat,
          xlab = "",
          ylab = "")

## `geom_smooth()` using formula 'y ~ x'

Figure 4: Non - linear relationship

Below is a scatterplot demonstrating the relationship between Isoelectricity and Bulkiness using a smoother

ggscatter(data = aa_dat, 
          x = "Isoelectric_Point",
          y = "Bulk",
          size = "Vol_VDW",
          color = "Surface_Area",
          add  = "loess",
          xlab = "Isoelectric Point", 
          ylab = "Bulkiness")

## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non - linear relationship on LOG scale

Log transformation standardizes our data and to remove skew and give a more normal distribution of data.

Log_aa_dat2 <- log(aa_dat2)

## Warning in FUN(X[[i]], ...): NaNs produced

## Warning in FUN(X[[i]], ...): NaNs produced

ggscatter(data = Log_aa_dat2,
          x = "Isoelectric_Point",
          y = "Bulk",
          size = "Polarity",
          color = "Fraction_Accessible_Area",
          add  = "reg.line",
          xlab = "Isoelectric Point", 
          ylab = "Bulkiness")

## `geom_smooth()` using formula 'y ~ x'

## Warning in sqrt(x): NaNs produced

## Warning: Removed 2 rows containing missing values (geom_point).

Figure 6: 3D Scatterplot

I chose the below 3 variables because they all were measuring physical characteristic of each amino acid. Upon conferring with the scatterplots created in the scatterplot matrix, the correlation appeared promising.

Bulk, Molecular Weight, and Volume positively correlate with each other as demonstrated in the 3d scatterplot below.

scatterplot3d(x = aa_dat$Molecular_Weight, 
              y = aa_dat$Vol_VDW, 
              z = aa_dat$Bulk,
              type = "h",
              highlight.3d = TRUE)

PCA

Figure 7: Principal Component Analysis with Base R

The purpose of a PCA is to take a data set with many variables (multidimensional) and reduce it to a 2D plane. This allows for easier analysis of data. In PCA, distance is used to show relationship and with UPGMA, plotted in clusters.

“The program positions points in two dimensions such that the distances in the two-dimensional space are as close as possible to the original distances in the multi- dimensional space.” (Higgs and Attwood 2005, pg 6).

The two basic ways that we discussed performing PCA are….

Data Prep

Data Prep for PCA done above in previous data prep

To make my new data frame, I removed any categorical variables. I had this data frame previously made so I could convert the entire data frame on a logarithmic scale in Figure 5.

#########3-4 sentences describinh the two basic wways we made PCCA plots in class############

pca.out <- prcomp(aa_dat2, scale = TRUE)

biplot(pca.out)

Figure 8: Principal Components Analysis - Vegan

Scale = True is used to draw out clusters in reference to one another with “unit variance”… scaling is done with the standad deviation

There are some differences between the PCA I made and the PCA in Higgs and Atwood 2009. First, the PCA I made is seperated into clusters with some lines. Also, our scales are different, mine from -2 to 2 on the y and –4 to 4 on the x and theirs -.6 to .6 on the y and -.6 and 6 on the x. Also, it look like Higgs and Atwood created a 4th grouping of amino acids (W and R and G and C) while these two amino acids are included in already established clusters on my PCA. This also occurred for Y and H, which are part of the same cluster in my PCA and but different in Higgs and Atwood.

rda.out <- vegan::rda(aa_dat2, 
                      scale = TRUE)


rda_scores <- scores(rda.out)


biplot(rda.out, display = "sites", 
       col = 0)


orditorp(rda.out, 
         display = "sites", 
         labels = aa_dat$Amino_Acid_Name, cex = 1.2)


ordihull(rda.out, 
         group = aa_dat$Hydropathy, 
         col = 1:7, 
         lty = 1:7, 
         lwd = c(3, 6))

Hierarchial Cluster Matrix

The purpose of a cluster analysis is to show the grouping of data points (amino acids) into a dendogram to show the relationships of these amino acids in a dendrogram. This allows for better visualization f clusters that have been established, but using a different data analysis. In this case we will be using Euclidean clustering.

UPGMA works by using a hierarchical clustering approach. Unweighted averages are done in a distance matrix starting from the smallest distance and recalculating the distance matrix counting this smallest distance as its own point. This is done until multiple distances have been calculated and relationships based on unweighted distance can be formed.

As mentioned above, a distance matrix is created by looking at levels of correlation between variables and once the greatest correlation of “shortest distance” is found, recalculate the disance matrix and do it again. This is how distance matrices are reated.

A Euclidean distance is the straight line distance between two values. This can be thought of as the hypotenuse given x and y between two values.

?par

aa_dat3 <- data.frame(
          Amino_Acid_Name = aa,
           Molecular_Weight = MW.da, 
           Vol_VDW = vol, 
           Bulk = bulk, 
           Polarity = pol, 
           Polar_Requirement = polar.req, 
           Isoelectric_Point = isoelec, 
           First_Hydrophobicity_Scale = H2Opho.34, 
           Second_Hydrophobicity_Scale = H2Opho.35, 
           Surface_Area = saaH2O, 
           Fraction_Accessible_Area = faal.fold, 
           Frequency_of_Occurance = freq)

## which part indicates the UPGMA usage 
dist_euc <- dist(aa_dat3, 
                 method = "euclidean")

## Warning in dist(aa_dat3, method = "euclidean"): NAs introduced by coercion

clust_euc <- hclust(dist_euc)


par(mfrow = c(1,1))


plot(clust_euc, 
     hang = -1, 
     cex = 0.5)

dendro_euc <- as.dendrogram(clust_euc)


plot(dendro_euc, 
     horiz = T)