Replicating Higgs and Attwood’s analysis on the properties of amino acids

Add your name to “author” above

NOTE 1: When you submit this please remove all “TODO” prompts, NOTES, instructions. etc by me. THe final product should look like a blog post, NOT a homework assignment

NOTE 2: Each code chunk should have a line that indicates what is being done, like an instructional blog post.

Introduction

WRITING TODO: Write a 4-5 sentence introduce describing what is done by this script and why we care about it biologically.

Preliminaries

Packages

CODE-TODO: load ALL packages
COMMENT-TODO: add a comment to indicate what each packages does
WRITING-TODO: describe in 1-2 sentences the relationship between ggpubr R and ggplot2.

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.5-6

Build the dataframe

CODE-TODO: build the amino acid dataframe I provided.
WRITING-TODO: write 2-3 sentences about what the data represent.
TODO: read the comments I wrote in the code; you will want to refer to these later

These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information.

Make the vectors.

# 1 letter code
aa        <-c('A','C','D','E','F','G','H','I','K','L','M','N',
              'P','Q','R','S','T','V','W','Y')


## molecular weight in dalton
MW.da     <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

## vol from van der Waals radii 
vol    <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
              114,148,73,93,105,163,141)

##  bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
              15.71,21.4,16.25,12.28,17.43,
              14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol  <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
              1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
                   5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
                  -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
                  -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O       <-c(113,140,151,183,218,85,194,182,211,180,204,158,
                 143,189,241,122,146,160,259,229)


## Fraction of accessible area lost when a protein folds
faal.fold    <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
                 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)



# Polar requirement 
polar.req    <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
                 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurance
## "The frequencies column shows the mean 
### percentage of each amino acid in the protein sequences 
### of modern organisms"
freq        <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
                10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
          'un','un','un','un','un','pos','un','un','un','un','un')

## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
              hydrophilic','hydrophobic','neutral','neutral',
              'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
              'hydrophilic','neutral',
              'hydrophilic','hydrophilic','neutral','neutral',
              'hydrophobic','hydrophobic','neutral')

## vol
vol.cat<-c('verysmall','small','small','medium',
              'verylarge','verysmall','medium','large','large',
              'large','large','small','small','medium','large',
              'verysmall',
              'small','medium','verylarge','verylarge')

## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
                'nonpolar','nonpolar','polar','nonpolar',
                'polar','nonpolar','nonpolar','polar','nonpolar','polar',
                'polar','polar','polar','nonpolar','nonpolar','polar')

## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
            'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
            'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
            'aliphatic','aromatic','aromatic')

Build the dataframes.

### Build dataframe
aa_dat <- data.frame(aa, 
                     MW.da,vol,
                     bulk, pol, 
                     isoelec,H2Opho.34, H2Opho.35,
                     saaH2O, faal.fold, polar.req,
                     freq, charge, hydropathy,
                     vol.cat, pol.cat, chemical)

Raw data exploration

XXXXXXXX

TODO: Examine the code below. What is it called? Replace the XXXXXXXX above with the name of this thing

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_

##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA

WRITING-TODO: Describe what the code below is doing and summarzie the next 2 TODOs in 2-3 sentences. TODO: Which variables have the most postive correlation? TODO: What have most negative correlation?

which(cor_ == max(cor_, na.rm = T), arr.ind = T)

##        row col
## saaH2O   8   2

max(cor_, na.rm = T)

## [1] 0.99

min(cor_, na.rm = T)

## [1] -0.87

which(cor_ == min(cor_, na.rm = T), arr.ind = T)

##           row col
## polar.req  10   7

Plot 0: Scatterplot matrix

Code TODO: Make a scatterplot matrix of all of the data. Included the following variables:

MW.da
vol
pol
H2Opho.34
polar.req

PLUS 2 others (6 total)

NOTE: Depending on how you do this you may have to make sure the first column doesn’t go into the plotting command

TODO: Write 1-2 sentences describing what this below code does in your scatter plot matrix below

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

TOD: WRITE 2-3 sentences describing the relationships between the 6 variables you plotted; which have strong correlations which ones weak, which ones have non-linear relationships.

plot(aa_dat[,c("MW.da","vol","pol",
               "H2Opho.34","polar.req",
               "saaH2O","faal.fold")],
     panel = panel.smooth,
      upper.panel =panel.cor ,
     main = "Example")

Plot 1: Replication Higgs and Attwood Figure 2.8

CODE-TODO: make a plot that matches Figure 2.8 on page 26 of the Higgs and Atwood chapter 2. Make sure you have plotted the correct axes
CODE-TODO: label the x and y axes.
COMMENT-TODO: indicate what is doig the main plotting and what is adding the text lables.
WRITING-TODO: Write 1-2 sentences describing what the plot represents

Figure 2: HA Figure 2.8 with ggpubr

TODO: Make a ggpubr version of HA Figure 2.8. Additionally, use both COLOR and SHAPE to represent 2 additional NUMERIC (NOT categorical) variables of your choice.
TODO: lable the axes with the full name of the variables

ggscatter(y = "polar.req",
          x = "freq",
          size = "polar.req",
          color = "polar.req",
          data = aa_dat,
          xlab = "",
          ylab = "")

Figure 3: Highly correlated data

CODE:

TODO: Identify 2 variables that are highly positively correlated and plot them in GGPUBR
TODO: add a regression line
TODO: add a data ellipse
TODO: add a correlation coefficient
TODO: lable the axes with the full name of the variables

COMMENTS:

TODO: add a comment indicating which line of the ggpubr is adding the line
TODO: add a comment indicating which line of the ggpubr is adding the ellipse TODO: add a comment indicating which line of the ggpubr is adding the correlation coefficient

WRITING

1.TODO: describe the general mathematical form of the regression line in 1-2 sentences 1.TODO: describe the general meaning of the data ellipse in 1-2 sentences

ggscatter(y = "polar.req",
          x = "freq",
          size = "polar.req",
           add = "reg.line",  # line of best fit
           ellipse = TRUE,   # data ellipse
          cor.coef = TRUE, # correlation coef
         # color = "freq",
          data = aa_dat,
         xlab = "",
         ylab = "")

## `geom_smooth()` using formula 'y ~ x'

Figure 4: Non-linear relationship

CODE:

TODO: Identify 2 variables that have a non-linear relationship between them (scatter plot has a curved shape).
TODO: Use ggpubr to plot a scatterplot with a loess smoother. Use google or the ggubr help files to figure out how to do this
TODO: also vary the size and color of the data by 2 other NUMERIC variables
TODO: label the axes with the full name of the variables

Figure 5: Non-linear relationship on LOG scale

CODE:

TODO: Log-transform both of your the variables you used in Figure 4 make the plot again
TODO: a regression line instead of a loess smoother
TODO: also vary the size and color of the data by 2 other NUMERIC variables

WRITING:

TODO: describe what the log transformation does in 1 to 2 sentences.

Figure 6: 3D Scatterplot

CODE TODO: Make a 3D scatterplot of 3 variables that have relatively strong correlations.
CODE TODO: use highlight.3d = TRUE to colored code
CODE CODE TODO: Use type=“h”
WRITING TODO: Summarize the relationships and correlations between the variables you chose in 2-4 sentences.

scatterplot3d(x = aa_dat$MW.da,
              y = aa_dat$vol,
              z = aa_dat$bulk,
              highlight.3d = TRUE,
              type="h")

Principal component analysis

TODO: Write 3-4 sentences describing the basic use and interpretation of PCA as we have used it in this class. Refer to Higgs and Attwood, and provide a useful quote that explains PCA. Cite the quote as (Higgs and Attwood 2005, pg XX).

TODO: Write 3-4 sentences describing the 2 basic ways we made PCA plots in this class.

Data prep

TODO: To make life easier, make a new version of your dataset which drops any categorical variables. Call it aa_dat2 TODO: Additionally: compare Table 2.2 and the data provided. I give you ~11 numeric variables, while Table 2.2 has only 8. Read the information carefully and remove the extra variables

TODO: Write 1-2 sentences describing the approach you used to make this new dataframe.

aa_dat2 <- aa_dat[,-c(1,
                      2,11,12,
                      13,14,15,16,17)]

Figure 7: Principal components analysis - base R

Run a PCA and create a PCA plot using base R command.

pca.out <- prcomp(aa_dat2, scale = TRUE)

Plot the output

biplot(pca.out)

We can carry out the numeric analysis using the rda() function from the vegan package.

rda.out <- vegan::rda(aa_dat2, 
                      scale = TRUE)

We then extract what are known as the PCA score - don’t worry about what this means right now, we’ll dig into it in a moment. Basically, we’re going from 15 columns of data to just 2, which we pull out of rda.out with the scores() function.

rda_scores <- scores(rda.out)

Principal Components Analysis (PCA)

We can reduce the dimensionality of complicated data sets in order to make them easier to visualize. In the case of our amino acid dataset we can go from 15 different variables to just to.

Principal Components Analysis (PCA) is common form of dimension reduction. We’ll first produce the standard output of a PCA analysis – a biplot – before digging into the theory behind what is happening.

We can carry out the numeric analysis using the rda() function from the vegan package.

TASK: Call the rda() function on the aa_dat2 object and save it to the rda.out object provided below.

# COMPLETE THIS CODE:

rda.out <- ## YOUR CODE GOES HERE ##

TASK: Call the scores() function on the rda.out object produced above; save the output to the rda_score object provided

# COMPLETE THIS CODE:

rda_scores <- # COMPLETE THIS CODE:

Now we’ll make a biplot. This is similar in some ways to a scatter plot. We’ll also add some additional annotation to the plot to help us see meaninful grouping of our amino acids. First, we’ll label each amino acids with its 1-letter code. Then we’ll delineate all amino acids within pre-specified groups to see if these groups map to the entire set of columns.

# Build plot
biplot(rda.out, display = "sites", col = 0,
         main = "Major clusters of amino acids based on chemistry",
       xlab = "PCA Axis 1: Primary axis of variation",
       ylab = "PCA Axis 1: Secondary axis of variation")

# add point
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)

# add groupings
ordihull(rda.out,
         group = aa_dat$hydropathy,
         col = 1:7,
         lty = 1:7,
         lwd = c(3,6))

Cluster analysis

TODO: Write 2-3 sentences describing the purpose of cluster analysis. TODO: Write 3-5 sentences describing how UPGMA works.

XXXXXXXXXXXXXXX cluster analysis

TODO: What type of cluster analysis is UPGMA? Change the XXXXXXX above to this. NOTE: use the aa_dat2 data we used above for all of this.

Add row names. Don’t worry about what this code is doing

row.names(aa_dat2) <- paste(aa_dat$aa,
                            #aa_dat$hydropathy,
                            sep = ".")

CODE-TODO: Create a distance matrix using the dist() function. Set the distance to be euclidean using method = “euclidean”. Call the output dist_euc WRITING-TODO: Write 2-3 sentences describing what a distance matrix is. WRITING-TODO: write 2-3 sentences describing what Euclidean distance is.

dist_euc <- dist(aa_dat2, 
                 method = "euclidean")

CODE-TODO: Use the hclust() function to do cluster analysis. Use the UPGMA algorithm (see the help file for the proper setting)

COMMENT-TODO: Add a comment that that indicates what part of the code is setting it to use UPGMA.

clust_euc <- hclust(dist_euc,
                    method = "average")

CODE-TODO: Plot the dendrogram (Optional - convert to a dengrogram as I did) WRITING-TODO: Write 2-5 sentences describing how your cluster diagram compares to the cluster diagram produced by Higgs and Atwood (2005) in Plate 2.2. (This is a cluster diagram that also includes a heat map)

dendro_euc <- as.dendrogram(clust_euc)
plot(dendro_euc,horiz = T,
     nodePar = list(pch = c(1,NA), 
                  cex = 0.5, 
                   lab.cex = 1.2))

Replicating Higgs and Attwood’s analysis on the properties of amino acids

YOURNAME, with contributions from Nathan Brouwer

11/12/2020

Introduction

Preliminaries

Packages

Build the dataframe

Raw data exploration

XXXXXXXX

Plot 0: Scatterplot matrix

Plot 1: Replication Higgs and Attwood Figure 2.8

Figure 2: HA Figure 2.8 with ggpubr

Figure 3: Highly correlated data

Figure 4: Non-linear relationship

Figure 5: Non-linear relationship on LOG scale

Figure 6: 3D Scatterplot

Principal component analysis

Data prep

Figure 7: Principal components analysis - base R

Principal Components Analysis (PCA)

Cluster analysis

XXXXXXXXXXXXXXX cluster analysis