Add your name to “author” above
NOTE 1: When you submit this please remove all “TODO” prompts, NOTES, instructions. etc by me. THe final product should look like a blog post, NOT a homework assignment
NOTE 2: Each code chunk should have a line that indicates what is being done, like an instructional blog post.
WRITING TODO: Write a 4-5 sentence introduce describing what is done by this script and why we care about it biologically.
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-6
These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information.
Make the vectors.
# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)
## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
'un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')
## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')
## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')
Build the dataframes.
### Build dataframe
aa_dat <- data.frame(aa,
MW.da,vol,
bulk, pol,
isoelec,H2Opho.34, H2Opho.35,
saaH2O, faal.fold, polar.req,
freq, charge, hydropathy,
vol.cat, pol.cat, chemical)
TODO: Examine the code below. What is it called? Replace the XXXXXXXX above with the name of this thing
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
## MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da NA NA NA NA NA NA NA NA NA
## vol 0.93 NA NA NA NA NA NA NA NA
## bulk 0.55 0.73 NA NA NA NA NA NA NA
## pol 0.29 0.24 -0.20 NA NA NA NA NA NA
## isoelec 0.20 0.36 0.08 0.27 NA NA NA NA NA
## H2Opho.34 -0.27 -0.08 0.44 -0.67 -0.20 NA NA NA NA
## H2Opho.35 -0.25 -0.16 0.32 -0.85 -0.26 0.85 NA NA NA
## saaH2O 0.97 0.99 0.64 0.29 0.35 -0.18 -0.23 NA NA
## faal.fold 0.11 0.18 0.49 -0.53 -0.18 0.84 0.79 0.12 NA
## polar.req -0.05 -0.19 -0.53 0.76 -0.11 -0.79 -0.87 -0.11 -0.81
## freq -0.52 -0.30 -0.04 -0.01 0.02 0.26 -0.02 -0.38 -0.18
## polar.req freq
## MW.da NA NA
## vol NA NA
## bulk NA NA
## pol NA NA
## isoelec NA NA
## H2Opho.34 NA NA
## H2Opho.35 NA NA
## saaH2O NA NA
## faal.fold NA NA
## polar.req NA NA
## freq 0.14 NA
WRITING-TODO: Describe what the code below is doing and summarzie the next 2 TODOs in 2-3 sentences. TODO: Which variables have the most postive correlation? TODO: What have most negative correlation?
which(cor_ == max(cor_, na.rm = T), arr.ind = T)
## row col
## saaH2O 8 2
max(cor_, na.rm = T)
## [1] 0.99
min(cor_, na.rm = T)
## [1] -0.87
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
## row col
## polar.req 10 7
Code TODO: Make a scatterplot matrix of all of the data. Included the following variables:
PLUS 2 others (6 total)
NOTE: Depending on how you do this you may have to make sure the first column doesn’t go into the plotting command
TODO: Write 1-2 sentences describing what this below code does in your scatter plot matrix below
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
TOD: WRITE 2-3 sentences describing the relationships between the 6 variables you plotted; which have strong correlations which ones weak, which ones have non-linear relationships.
plot(aa_dat[,c("MW.da","vol","pol",
"H2Opho.34","polar.req",
"saaH2O","faal.fold")],
panel = panel.smooth,
upper.panel =panel.cor ,
main = "Example")
ggscatter(y = "polar.req",
x = "freq",
size = "polar.req",
color = "polar.req",
data = aa_dat,
xlab = "",
ylab = "")
CODE:
CODE:
WRITING:
scatterplot3d(x = aa_dat$MW.da,
y = aa_dat$vol,
z = aa_dat$bulk,
highlight.3d = TRUE,
type="h")
TODO: Write 3-4 sentences describing the basic use and interpretation of PCA as we have used it in this class. Refer to Higgs and Attwood, and provide a useful quote that explains PCA. Cite the quote as (Higgs and Attwood 2005, pg XX).
TODO: Write 3-4 sentences describing the 2 basic ways we made PCA plots in this class.
TODO: To make life easier, make a new version of your dataset which drops any categorical variables. Call it aa_dat2 TODO: Additionally: compare Table 2.2 and the data provided. I give you ~11 numeric variables, while Table 2.2 has only 8. Read the information carefully and remove the extra variables
TODO: Write 1-2 sentences describing the approach you used to make this new dataframe.
aa_dat2 <- aa_dat[,-c(1,
2,11,12,
13,14,15,16,17)]
Run a PCA and create a PCA plot using base R command.
pca.out <- prcomp(aa_dat2, scale = TRUE)
Plot the output
biplot(pca.out)
We can carry out the numeric analysis using the rda() function from the vegan package.
rda.out <- vegan::rda(aa_dat2,
scale = TRUE)
We then extract what are known as the PCA score - don’t worry about what this means right now, we’ll dig into it in a moment. Basically, we’re going from 15 columns of data to just 2, which we pull out of rda.out with the scores() function.
rda_scores <- scores(rda.out)
We can reduce the dimensionality of complicated data sets in order to make them easier to visualize. In the case of our amino acid dataset we can go from 15 different variables to just to.
Principal Components Analysis (PCA) is common form of dimension reduction. We’ll first produce the standard output of a PCA analysis – a biplot – before digging into the theory behind what is happening.
We can carry out the numeric analysis using the rda() function from the vegan package.
TASK: Call the rda() function on the aa_dat2 object and save it to the rda.out object provided below.
# COMPLETE THIS CODE:
rda.out <- ## YOUR CODE GOES HERE ##
We then extract what are known as the PCA score - don’t worry about what this means right now, we’ll dig into it in a moment. Basically, we’re going from 15 columns of data to just 2, which we pull out of rda.out with the scores() function.
TASK: Call the scores() function on the rda.out object produced above; save the output to the rda_score object provided
# COMPLETE THIS CODE:
rda_scores <- # COMPLETE THIS CODE:
Now we’ll make a biplot. This is similar in some ways to a scatter plot. We’ll also add some additional annotation to the plot to help us see meaninful grouping of our amino acids. First, we’ll label each amino acids with its 1-letter code. Then we’ll delineate all amino acids within pre-specified groups to see if these groups map to the entire set of columns.
# Build plot
biplot(rda.out, display = "sites", col = 0,
main = "Major clusters of amino acids based on chemistry",
xlab = "PCA Axis 1: Primary axis of variation",
ylab = "PCA Axis 1: Secondary axis of variation")
# add point
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
# add groupings
ordihull(rda.out,
group = aa_dat$hydropathy,
col = 1:7,
lty = 1:7,
lwd = c(3,6))
TODO: Write 2-3 sentences describing the purpose of cluster analysis. TODO: Write 3-5 sentences describing how UPGMA works.
TODO: What type of cluster analysis is UPGMA? Change the XXXXXXX above to this. NOTE: use the aa_dat2 data we used above for all of this.
Add row names. Don’t worry about what this code is doing
row.names(aa_dat2) <- paste(aa_dat$aa,
#aa_dat$hydropathy,
sep = ".")
CODE-TODO: Create a distance matrix using the dist() function. Set the distance to be euclidean using method = “euclidean”. Call the output dist_euc WRITING-TODO: Write 2-3 sentences describing what a distance matrix is. WRITING-TODO: write 2-3 sentences describing what Euclidean distance is.
dist_euc <- dist(aa_dat2,
method = "euclidean")
CODE-TODO: Use the hclust() function to do cluster analysis. Use the UPGMA algorithm (see the help file for the proper setting)
COMMENT-TODO: Add a comment that that indicates what part of the code is setting it to use UPGMA.
clust_euc <- hclust(dist_euc,
method = "average")
CODE-TODO: Plot the dendrogram (Optional - convert to a dengrogram as I did) WRITING-TODO: Write 2-5 sentences describing how your cluster diagram compares to the cluster diagram produced by Higgs and Atwood (2005) in Plate 2.2. (This is a cluster diagram that also includes a heat map)
dendro_euc <- as.dendrogram(clust_euc)
plot(dendro_euc,horiz = T,
nodePar = list(pch = c(1,NA),
cex = 0.5,
lab.cex = 1.2))