#Introduction: In this script, we are going to do some analysis of the data, including PVA, clustering analysis, scatterplot matrix, and so on. We are going to examine the data and the graphs on the article Higgs and Attwood. We are also going to try different data in order to see the similarities and differences between our examination and the origional data.

#Preliminaries

##Packages: Loading all packages

library(ape) # ape provides functions for reading, writing, manipulating, analysing, and simulating phylogenetic trees and DNA sequences, computing DNA distances, translating into AA sequences, estimating trees with distance-based methods, and a range of methods for comparative analyses and analysis of diversification.
library(ggpubr) #The ggpubr R package facilitates the creation of beautiful ggplot2-based graphs for researcher with non-advanced programming backgrounds. 
## Warning: package 'ggpubr' was built under R version 4.0.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.3
## 
## Attaching package: 'ggpubr'
## The following object is masked from 'package:ape':
## 
##     rotate
library(ggplot2) #You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
library(permute) #Permute provides an R implementation of the permutation schemes developed by Cajo ter Braak and made available in the Canoco software, version 3.1 (ter Braak, 1990).
## Warning: package 'permute' was built under R version 4.0.3
library(lattice) #The lattice package in R Programming provides barchart to plot Bar Chart.
library(vegan) # It contains the most popular methods of multivariate analysis needed in analysing ecological communities, and tools for diversity analysis, and other potentially useful functions.
## Warning: package 'vegan' was built under R version 4.0.3
## This is vegan 2.5-6
library(scatterplot3d)# It helps us to make 3D graphs. 
## Warning: package 'scatterplot3d' was built under R version 4.0.3
#Relationship between ggpubr and ggplot2: The ggpubr R package facilitates the creation of beautiful ggplot2-based graphs for researcher with non-advanced programming backgrounds.

##Build the dataframe These data provide the chemical structure of the molecules that we are testing. Which including some organic chemistry which is in micro view, like forces and strains between each parts of the molecules.

# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)
## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
'un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')
## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')
## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')


aa_dat <- data.frame(aa,MW.da,vol,bulk,pol,isoelec,H2Opho.34,H2Opho.35,saaH2O,faal.fold,polar.req,freq)


aa_dat
##    aa MW.da vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## 1   A    89  67 11.50  0.00    6.00       1.8       1.6    113      0.74
## 2   C   121  86 13.46  1.48    5.07       2.5       2.0    140      0.91
## 3   D   133  91 11.68 49.70    2.77      -3.5      -9.2    151      0.62
## 4   E   146 109 13.57 49.90    3.22      -3.5      -8.2    183      0.62
## 5   F   165 135 19.80  0.35    5.48       2.8       3.7    218      0.88
## 6   G    75  48  3.40  0.00    5.97      -0.4       1.0     85      0.72
## 7   H   155 118 13.69 51.60    7.59      -3.2      -3.0    194      0.78
## 8   I   131 124 21.40  0.13    6.02       4.5       3.1    182      0.88
## 9   K   146 135 15.71 49.50    9.74      -3.9      -8.8    211      0.52
## 10  L   131 124 21.40  0.13    5.98       3.8       2.8    180      0.85
## 11  M   149 124 16.25  1.43    5.74       1.9       3.4    204      0.85
## 12  N   132  96 12.28  3.38    5.41      -3.5      -4.8    158      0.63
## 13  P   115  90 17.43  1.58    6.30      -1.6      -0.2    143      0.64
## 14  Q   147 114 14.45  3.53    5.65      -3.5      -4.1    189      0.62
## 15  R   174 148 14.28 52.00   10.76      -4.5     -12.3    241      0.64
## 16  S   105  73  9.47  1.67    5.68      -0.8       0.6    122      0.66
## 17  T   119  93 15.77  1.66    6.16      -0.7       1.2    146      0.70
## 18  V   117 105 21.57  0.13    5.96       4.2       2.6    160      0.86
## 19  W   204 163 21.67  2.10    5.89      -0.9       1.9    259      0.85
## 20  Y   181 141 18.03  1.61    5.66      -1.3      -0.7    229      0.76
##    polar.req  freq
## 1        7.0  7.80
## 2        4.8  1.10
## 3       13.0  5.19
## 4       12.5  6.72
## 5        5.0  4.39
## 6        7.9  6.77
## 7        8.4  2.03
## 8        4.9  6.95
## 9       10.1  6.32
## 10       4.9 10.15
## 11       5.3  2.28
## 12      10.0  4.37
## 13       6.6  4.26
## 14       8.6  3.45
## 15       9.1  5.23
## 16       7.5  6.46
## 17       6.6  5.12
## 18       5.6  7.01
## 19       5.2  1.09
## 20       5.4  3.30

##Raw data exploration Correlation test

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA

The code below have chosen the variable that have most positive correlation and the variable that have most negative correlation.

The saaH2O has the highest correlation while polar.req has the lowest correlation.

which(cor_ == max(cor_, na.rm = T), arr.ind = T)
##        row col
## saaH2O   8   2
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
##           row col
## polar.req  10   7

##Plot 0: Scatterplot matrix Make a scatterplot matrix of all of the data. Included the following variables: 1. MW.da 2. vol 3. pol 4. H2Opho.34 5. polar.req 6. bulk 7. freq

ch_dat <- c("MW.da","vol","pol","H2Opho.34","polar.req","bulk","freq")
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
plot(aa_dat[ch_dat],upper.panel = panel.cor,
     panel = panel.smooth)

This code helps people to get the correlation between each variables we chose. When we are making the scatterplot matrix plot, we can just use the panel.cor to plot them on the upper panel.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}

From the scatterplot matrix, we can see that polar.req, H2Opho.34, and pol has strong linear relationship. While MW.da with these three variables are all having non-linear relationships. Vol is strongly correlated to MW.da and bulk.

##Plot 1: Replication Higgs and Attwood Figure 2.8

This graph shows the coorelation of these six variables. The distance between each trails shows the differences between them.

ch_dat <- c("MW.da","vol","pol","H2Opho.34","polar.req","bulk","freq")
data = aa_dat[ch_dat]
plot(MW.da ~ freq, data = data,
     xlab = "freq",
     ylab = "MW.da",
     main = "Correlation coefficient", # main plotting
     col = 0)
text(MW.da ~ freq, 
     labels = aa, 
     data = data, 
     col = 1:7) # text labels

##Figure 2: HA Figure 2.8 with ggpubr

ggscatter(y = "polar.req",
x = "freq",
size = "polar.req",
color = "polar.req",
data = aa_dat,
xlab = "frequency",
ylab = "polar.req")

##Figure 3: Highly correlated data

The regression line is the line that shows the prediction of the relationship between two variables. People use the number exists to calculate the linear regression for the data and want to predict data when we change the independent variable.

The elipse has a verticle and a horizontal axis, which shows the change of data from the regression line in vertical and horizontal direction.

ggscatter(y = "H2Opho.34",
x = "H2Opho.35",
size = "H2Opho.34",
add = "reg.line", # line of best fit
ellipse = TRUE, # data ellipse
cor.coef = TRUE, # correlation coefficient
# color = "freq",
data = aa_dat,
xlab = "H2Opho.35",
ylab = "H2Opho.34")
## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

##Figure 4: Non-linear relationship

ggscatter(y = "MW.da",
x = "bulk",
size = "vol",
add = "loess", # line of best fit
color = "pol", # color changes
data = aa_dat,
xlab = "bulk",
ylab = "MW.da")
## `geom_smooth()` using formula 'y ~ x'

##Figure 5: Non-linear relationship on LOG scale The log transformation calculates the log for each data and helps us to find the pattern of the data more clearly. It could make the graph to have a more linear line of best fit.

aa_dat$log_MW.da <- log(aa_dat$MW.da)
aa_dat$log_bulk <- log (aa_dat$bulk)
ggscatter(y = "log_MW.da",
        x = "log_bulk",
        size = "vol",
        add = "reg.line", # line of best fit
        color = "pol", # color changes
        data = aa_dat,
        xlab = "bulk",
        ylab = "MW.da")
## `geom_smooth()` using formula 'y ~ x'

##Figure 6: 3D Scatterplot

The variable pol and H2Opho.34 has strong negative correlation, pol and polar.req also have positive correlation. H2Opho.34 and the polar.req variable have strong negative correlation.

aa_dat$log_pol <- log(aa_dat$pol)
aa_dat$log_H2Opho.34 <- log(aa_dat$H2Opho.34)
## Warning in log(aa_dat$H2Opho.34): NaNs produced
aa_dat$log_polar.req <- log(aa_dat$polar.req)
scatterplot3d(x = aa_dat$pol,
                     y = aa_dat$log_H2Opho.34,
                     z = aa_dat$polar.req,
              type="h",
              highlight.3d = TRUE)

#Principal component analysis

PCA is the algorithm we use to make graphs simpler. For example, we have multi-dimensional graphs, we can see that it is difficult to read the graph. By using PCA we can make them into a 2D graph thus it is easier to read.

There are two methods for R principal component analysis: 1. Spectral Decomposition. It examines the covariances/correlations between variables. 2. Singular Value Decomposition.

##Data prep In this new dataframe, we only have numeric datas. That will help us to do PCA analysis easier.

aa_dat2 <- aa_dat[, c("vol", "bulk", "pol", "H2Opho.34", "H2Opho.35", "saaH2O", "faal.fold")]

#Figure 7: Principal components analysis - base R Run a PCA and create a PCA plot using base R command.

pca.out <- prcomp(aa_dat2, scale = TRUE)
biplot(pca.out)

#Figure 8: Principal components analysis - vegan For another example see https://rpubs.com/brouwern/veganpca Run the PCA using vegan. Call the output rda.out TODO: use the setting scale = TRUE TODO: describe what scale = TRUE does

rda.out <- vegan::rda(aa_dat2,
scale = TRUE)

Extract what are known as the PCA score - don’t worry about what this means. Call the output rda_scores

rda_scores <- scores(rda.out)

Make a “biplot” (I usually call this a PCA plot) using the code below. TODO: set the group variable to be a different categorical variable that has 4 or fewer categories. TODO:

It seemed that I have a different dataframe compared to the picture in Figure 2.10. The shape between mine graph is different than them, for example, they have W at the top but I have Q at the top. I have found out that the data our professor gave us has missed a variable called “pI”, which I think should be one of the reason that why my graph is different than the origional one.

biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)

ordihull(rda.out,
group = aa_dat$hydropathy,
col = 1:7,
lty = 1:7,
lwd = c(3,6))

#Cluster analysis

Cluster analysis gives us a general idea of how the trees are working. By doing cluster analysis, we can see the possible trees and get the best fitted tree based on them. The UPGMA is the simplest distance-matrix method, and it employs sequential clustering to build a rooted phylogenetic tree. First, all sequences are compared through pairwise alignment to compute the distance matrix. Next, the distance between this pair and all other sequences is recalculated to form a new matrix. (https://www.sciencedirect.com/topics/agricultural-and-biological-sciences/upgma)

##Hierarchical cluster analysis

row.names(aa_dat2) <- paste(aa_dat2$Species,1:nrow(aa_dat2),sep = ".")

The distance matrix shows the distance between each two of the variables. It is easier for us to see them like a matrix instead of a list.

Euclidean distance is a measure of the true straight line distance between two points in Euclidean space. It is the shortest distance between each two of the variables.

dist_euc <- dist(aa_dat2, 
                 method = "euclidean")

In the original article, the diagram has the variables on the top and the different trails on the left.In my graph I didn’t lable them as what they did. Then it has the temperature graph which I don’t have. They have made 6 different clusters there on the right but we didn’t do that.

clust_euc <- hclust(dist_euc) # This is the step used UPGMA algorithm
plot(clust_euc, hang = -1, cex = 0.5)

par(mar = c(1,1,1,1))
is(clust_euc)
## [1] "hclust"
dendro_euc <- as.dendrogram(clust_euc)

is(dendro_euc)
## [1] "dendrogram"
plot(dendro_euc,horiz = T,nodePar = list(pch = c(1,NA), 
                                          cex = 0.5, 
                                          lab.cex = 0.5))