Amino acids can be described by a number of chemical properties. This information is fairly easy to locate for the 20 standard proteinogenic amino acids coded by the standard codon table. For other amino acids it can be very difficult to find this information.
A key question in the study of the origin of the universal genetic code is: of the hundreds of amino acids that occur on earth, why does the codon table code for the 20 that it does? That is, why isn’t life based on a different set of 20 amino acids?
Many studies compare and contrast the chemical properties of the 20 proteinogenic amino acids with other amino acids, such as the amino acids that occur meteorites. Unfortunately, these studies rarely publish their data. Moreover, it seems like there is not always experimentally derived data on the chemical properties of non-proteinogenic amino acids. Authors’ therefore use chemistry modeling software to predict the chemical attributes non-standard amino acids.
In this portfolio assignment you will build a simple regression model (line of best fit) using the lm() function, then use the molecular weight of non-standard amino acids to predict other chemical values.
You’ll then assess whether this model is likely to be a very good one for predicting pI.
The assignment consists only of instructions - no code! To complete this assignment, you will need to gather the necessary data and code from recent assignments to make a self-sufficient script to carry out the following analysis.
# packages
library(ggpubr)
## Loading required package: ggplot2
library(pander)
First, we need a dataframe to hold data on the 20 proteinogenic amino acids. Make a dataframe with the following columns:
Call the dataframe habk.df2.
# vectors
aa_codes <- c("A", "R", "N", "D", "C", "Q", "E", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")
MW.da <- c(89, 174, 132, 133, 121, 147, 146, 75, 155, 131, 131, 146, 149, 165, 115, 105, 119, 204, 181, 117)
isoelec <- c(6, 10.76, 5.41, 2.77, 5.07, 5.65, 3.22, 5.97, 7.59, 6.02, 5.98, 9.74, 5.74, 5.48, 6.3, 5.68, 6.16, 5.89, 5.66, 5.96)
# make dataframe
habk.df2 <- data.frame(aa_codes, MW.da, isoelec)
Display the finished dataframe with pander::pander()
# show table with pander
pander::pander(habk.df2)
| aa_codes | MW.da | isoelec |
|---|---|---|
| A | 89 | 6 |
| R | 174 | 10.76 |
| N | 132 | 5.41 |
| D | 133 | 2.77 |
| C | 121 | 5.07 |
| Q | 147 | 5.65 |
| E | 146 | 3.22 |
| G | 75 | 5.97 |
| H | 155 | 7.59 |
| I | 131 | 6.02 |
| L | 131 | 5.98 |
| K | 146 | 9.74 |
| M | 149 | 5.74 |
| F | 165 | 5.48 |
| P | 115 | 6.3 |
| S | 105 | 5.68 |
| T | 119 | 6.16 |
| W | 204 | 5.89 |
| Y | 181 | 5.66 |
| V | 117 | 5.96 |
Use ggpubr to make a plot of the pH at the isoelectric point (pI) versus molecular mass.
# make plot
ggscatter(x = "MW.da",
y = "isoelec",
data = habk.df2,
add = "reg.line",
ellipse = TRUE,
main = "pH at the Isoelectric Point (pI) versus Molecular Mass",
xlab = "Molecular Weight",
ylab = "pH at the Isoelectric Point (pI)",
label = aa_codes,
cor.coef = TRUE,
conf.int = TRUE)
## `geom_smooth()` using formula 'y ~ x'
Look up the molecular mass of the non-standard amino acids Selenocysteine and Pyrrolysine from https://en.wikipedia.org/wiki/Amino_acid
Place this information in objects called selenocysteine.MW and pyrrolysine.MW
# molecular weights
selenocystine.MW <- (168.064)
pyrrolysine.MW <- (255.313)
Selenocysteine and Pyrrolysine are amino acids that can be added to proteins by re-coding stop codons. (On an exam you should recognize these as amino acids that occur in proteins via this mechanism and be able to answer simple multiple choice questions about them.)
For more information see:
Scitable: https://www.nature.com/scitable/topicpage/an-evolutionary-perspective-on-amino-acids-14568445/
Wikipedia: https://en.wikipedia.org/wiki/Selenocysteine https://en.wikipedia.org/wiki/Pyrrolysine
Build a regression model using lm() to determine the y = m*x + b equation for the line in your scatterplot.
A simple regression model has a slope (m) and an intercept and can be set up as an equation
y = m*x + b
In this situation the model would be
pI = m*MW.da + b
“b” is called the “intercept” in R output.
When you build a regression model you can access output that looks like this: (Intercept) MW.da 2.30536091 1.01131509
Which is interpreted as
(Intercept) MW.da b m
If the numbers above were accurate (they aren’t), then our equation would be pI = 1.01131509*MW.da + 2.30536091
First, build the regression model (aka linear model)
# Build model
lm(MW.da ~ isoelec, data = habk.df2)
##
## Call:
## lm(formula = MW.da ~ isoelec, data = habk.df2)
##
## Coefficients:
## (Intercept) isoelec
## 115.898 3.445
Next get the coefficients for the model (the m of m*x, and the b).
#get coefficients / parameters for equation
cor(habk.df2$MW.da, habk.df2$isoelec, method = "spearman")
## [1] -0.09706549
Use the equation y = m*x + b to estimate the pI of selenocystein an pyrrolysine.
# Estimate pI for Se and Py
isoelec = 3.445*MW.da + 115.898
Make a table of your results, including the following columns
# make table
amino.acid <- c("Selenocysteine", "Pyrollisine")
three.letter <- c("Sec", "Pyl")
one.letter <- c("U", "O")
molecular.weight <- c(168.1, 255.3)
pI.estimate <- c(694.87848, 995.451285)
df.1 <- data.frame(amino.acid, three.letter, one.letter, molecular.weight, pI.estimate)
df.1
## amino.acid three.letter one.letter molecular.weight pI.estimate
## 1 Selenocysteine Sec U 168.1 694.8785
## 2 Pyrollisine Pyl O 255.3 995.4513
Display the dataframe with the function pander() from the pander package.
# make table
pander(df.1)
| amino.acid | three.letter | one.letter | molecular.weight | pI.estimate |
|---|---|---|---|---|
| Selenocysteine | Sec | U | 168.1 | 694.9 |
| Pyrollisine | Pyl | O | 255.3 | 995.5 |
Using R, determine the correlation coefficient for molecular mass versus pI. Is this a very strong correlation coefficient?
# correlation coefficient
coef(molecular.weight ~ pI.estimate)
## NULL
cor(habk.df2$MW.da, habk.df2$isoelec, method = "spearman")
## [1] -0.09706549
Build a dataframe with all 8 variables used in Higgs and Attwood Chapter 2. Add molecular weight to this (which wasn’t included in their original table).
Determine the variable which has the highest correlation with pI.
First, make a datafrome with volume, bulk etc, along with molecular weight.
# data
vol <- c(67, 148, 96, 91, 86, 114, 109, 48, 118, 124, 124, 135, 124, 135, 90, 73, 93, 163, 141, 105)
bulk <- c(11.50, 14.28, 12.28, 11.68, 13.46, 14.45, 13.57, 3.40, 13.69, 21.40, 21.40, 15.71, 16.25, 19.80, 17.43, 9.47, 15.77, 21.67, 18.03, 21.57)
polarity <- c(0.00, 52.00, 3.38, 49.70, 1.48, 3.53, 49.90, 0.00, 51.60, 0.13, 0.13, 49.50, 1.43, 0.35, 1.58, 1.67, 1.66, 2.10, 1.61, 0.13)
Hyd.1 <- c(1.8, -4.5, -3.5, -3.5, 2.5, -3.5, -3.5, -0.4, -3.2, 4.5, 3.8, -3.9, 1.9, 2.8, -1.6, -0.8, -0.7, -0.9, -1.3, 4.2)
Hyd.2 <- c(1.6, -12.3, -4.8, -9.2, 2.0, -4.1, -8.2, 1.0, -3.0, 3.1, 2.8, -8.8, 3.4, 3.7, -0.2, 0.6, 1.2, 1.9, -0.7, 2.6)
surface.area <- c(113, 241, 158, 151, 140, 189, 183, 85, 194, 182, 180, 211, 204, 218, 143, 122, 146, 259, 229, 160)
fract.area <- c(0.74, 0.64, 0.63, 0.62, 0.91, 0.62, 0.62, 0.72, 0.78, 0.88, 0.85, 0.52, 0.85, 0.88, 0.64, 0.66, 0.70, 0.85, 0.76, 0.86)
# made dataframe
df.2 <- data.frame(MW.da, vol, bulk, polarity, pI = isoelec, Hyd.1, Hyd.2, surface.area, fract.area)
df.2
## MW.da vol bulk polarity pI Hyd.1 Hyd.2 surface.area fract.area
## 1 89 67 11.50 0.00 422.503 1.8 1.6 113 0.74
## 2 174 148 14.28 52.00 715.328 -4.5 -12.3 241 0.64
## 3 132 96 12.28 3.38 570.638 -3.5 -4.8 158 0.63
## 4 133 91 11.68 49.70 574.083 -3.5 -9.2 151 0.62
## 5 121 86 13.46 1.48 532.743 2.5 2.0 140 0.91
## 6 147 114 14.45 3.53 622.313 -3.5 -4.1 189 0.62
## 7 146 109 13.57 49.90 618.868 -3.5 -8.2 183 0.62
## 8 75 48 3.40 0.00 374.273 -0.4 1.0 85 0.72
## 9 155 118 13.69 51.60 649.873 -3.2 -3.0 194 0.78
## 10 131 124 21.40 0.13 567.193 4.5 3.1 182 0.88
## 11 131 124 21.40 0.13 567.193 3.8 2.8 180 0.85
## 12 146 135 15.71 49.50 618.868 -3.9 -8.8 211 0.52
## 13 149 124 16.25 1.43 629.203 1.9 3.4 204 0.85
## 14 165 135 19.80 0.35 684.323 2.8 3.7 218 0.88
## 15 115 90 17.43 1.58 512.073 -1.6 -0.2 143 0.64
## 16 105 73 9.47 1.67 477.623 -0.8 0.6 122 0.66
## 17 119 93 15.77 1.66 525.853 -0.7 1.2 146 0.70
## 18 204 163 21.67 2.10 818.678 -0.9 1.9 259 0.85
## 19 181 141 18.03 1.61 739.443 -1.3 -0.7 229 0.76
## 20 117 105 21.57 0.13 518.963 4.2 2.6 160 0.86
Then make a correlation matrix from the data. Save it to an object called corr.mat
# make correlation matrix
corr.mat <- cor(df.2)
Round off the correlation matrix to 1 decimal [lace]
# round off
round(corr.mat, 1)
## MW.da vol bulk polarity pI Hyd.1 Hyd.2 surface.area fract.area
## MW.da 1.0 0.9 0.6 0.3 1.0 -0.3 -0.2 1.0 0.1
## vol 0.9 1.0 0.7 0.2 0.9 -0.1 -0.2 1.0 0.2
## bulk 0.6 0.7 1.0 -0.2 0.6 0.4 0.3 0.6 0.5
## polarity 0.3 0.2 -0.2 1.0 0.3 -0.7 -0.9 0.3 -0.5
## pI 1.0 0.9 0.6 0.3 1.0 -0.3 -0.2 1.0 0.1
## Hyd.1 -0.3 -0.1 0.4 -0.7 -0.3 1.0 0.8 -0.2 0.8
## Hyd.2 -0.2 -0.2 0.3 -0.9 -0.2 0.8 1.0 -0.2 0.8
## surface.area 1.0 1.0 0.6 0.3 1.0 -0.2 -0.2 1.0 0.1
## fract.area 0.1 0.2 0.5 -0.5 0.1 0.8 0.8 0.1 1.0
Display the matrix with pander::pander()
# display with pander()
pander(round(corr.mat, 1))
| MW.da | vol | bulk | polarity | pI | Hyd.1 | Hyd.2 | |
|---|---|---|---|---|---|---|---|
| MW.da | 1 | 0.9 | 0.6 | 0.3 | 1 | -0.3 | -0.2 |
| vol | 0.9 | 1 | 0.7 | 0.2 | 0.9 | -0.1 | -0.2 |
| bulk | 0.6 | 0.7 | 1 | -0.2 | 0.6 | 0.4 | 0.3 |
| polarity | 0.3 | 0.2 | -0.2 | 1 | 0.3 | -0.7 | -0.9 |
| pI | 1 | 0.9 | 0.6 | 0.3 | 1 | -0.3 | -0.2 |
| Hyd.1 | -0.3 | -0.1 | 0.4 | -0.7 | -0.3 | 1 | 0.8 |
| Hyd.2 | -0.2 | -0.2 | 0.3 | -0.9 | -0.2 | 0.8 | 1 |
| surface.area | 1 | 1 | 0.6 | 0.3 | 1 | -0.2 | -0.2 |
| fract.area | 0.1 | 0.2 | 0.5 | -0.5 | 0.1 | 0.8 | 0.8 |
| surface.area | fract.area | |
|---|---|---|
| MW.da | 1 | 0.1 |
| vol | 1 | 0.2 |
| bulk | 0.6 | 0.5 |
| polarity | 0.3 | -0.5 |
| pI | 1 | 0.1 |
| Hyd.1 | -0.2 | 0.8 |
| Hyd.2 | -0.2 | 0.8 |
| surface.area | 1 | 0.1 |
| fract.area | 0.1 | 1 |
Which variable has the strongest (most positive OR most negative) correlation with pI? (if possible write this with code, otherwise just state what it is).
# find maximum
cor(df.2[,"MW.da"],isoelec)
## [1] 1
cor(df.2[,"vol"],isoelec)
## [1] 0.9337348
cor(df.2[,"bulk"], isoelec)
## [1] 0.5541684
cor(df.2[,"polarity"], isoelec)
## [1] 0.2904128
cor(df.2[,"Hyd.1"], isoelec)
## [1] -0.2714765
cor(df.2[,"Hyd.2"],isoelec)
## [1] -0.245248
cor(df.2[,"surface.area"],isoelec)
## [1] 0.967609
cor(df.2[,"fract.area"], isoelec)
## [1] 0.1090068