Key vocab

Key functions / packages

Predict pI for an Selenocysteine and Pyrrolysine

Amino acids can be described by a number of chemical properties. This information is fairly easy to locate for the 20 standard proteinogenic amino acids coded by the standard codon table. For other amino acids it can be very difficult to find this information.

A key question in the study of the origin of the universal genetic code is: of the hundreds of amino acids that occur on earth, why does the codon table code for the 20 that it does? That is, why isn’t life based on a different set of 20 amino acids?

Many studies compare and contrast the chemical properties of the 20 proteinogenic amino acids with other amino acids, such as the amino acids that occur meteorites. Unfortunately, these studies rarely publish their data. Moreover, it seems like there is not always experimentally derived data on the chemical properties of non-proteinogenic amino acids. Authors’ therefore use chemistry modeling software to predict the chemical attributes non-standard amino acids.

In this portfolio assignment you will build a simple regression model (line of best fit) using the lm() function, then use the molecular weight of non-standard amino acids to predict other chemical values.

You’ll then assess whether this model is likely to be a very good one for predicting pI.

The assignment consists only of instructions - no code! To complete this assignment, you will need to gather the necessary data and code from recent assignments to make a self-sufficient script to carry out the following analysis.

Step 0: Load the necessary packages

library(ggplot2)
library(ggpubr)
library(scatterplot3d)
library(pander) 
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-7
library(dendextend)
## Registered S3 method overwritten by 'dendextend':
##   method     from 
##   rev.hclust vegan
## 
## ---------------------
## Welcome to dendextend version 1.15.2
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following object is masked from 'package:permute':
## 
##     shuffle
## The following object is masked from 'package:ggpubr':
## 
##     rotate
## The following object is masked from 'package:stats':
## 
##     cutree
library(lsa)
## Loading required package: SnowballC

Step 1: Make a dataframe

First, we need a dataframe to hold data on the 20 proteinogenic amino acids. Make a dataframe with the following columns:

  1. amino acid codes
  2. molecular weight
  3. pI (the vector is not called this though in my code code - make sure you set up your code properly)

Call the dataframe habk.df2.

# vectors
aa <-c('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L','M','N',
                'P','Q','R','S','T','V','W','Y')
pI <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98, 5.74, 5.41, 6.3, 
            5.65, 10.76, 5.68, 6.16, 5.96, 5.89, 5.66)
MW <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

# make dataframe
habk.df2 <- data.frame(aa, MW, pI)

habk.df2
##    aa  MW    pI
## 1   A  89  6.00
## 2   C 121  5.07
## 3   D 133  2.77
## 4   E 146  3.22
## 5   F 165  5.48
## 6   G  75  5.97
## 7   H 155  7.59
## 8   I 131  6.02
## 9   K 146  9.74
## 10  L 131  5.98
## 11  M 149  5.74
## 12  N 132  5.41
## 13  P 115  6.30
## 14  Q 147  5.65
## 15  R 174 10.76
## 16  S 105  5.68
## 17  T 119  6.16
## 18  V 117  5.96
## 19  W 204  5.89
## 20  Y 181  5.66

Display the finished dataframe with pander::pander()

# show table with pander
pander::pander(habk.df2)
aa MW pI
A 89 6
C 121 5.07
D 133 2.77
E 146 3.22
F 165 5.48
G 75 5.97
H 155 7.59
I 131 6.02
K 146 9.74
L 131 5.98
M 149 5.74
N 132 5.41
P 115 6.3
Q 147 5.65
R 174 10.76
S 105 5.68
T 119 6.16
V 117 5.96
W 204 5.89
Y 181 5.66

Step 2: Plot pI versus molecular weight

Use ggpubr to make a plot of the pH at the isoelectric point (pI) versus molecular mass.

  1. Plot pI on one axis and molecular weight on the other.
  2. Include regression line (line of best fit)
  3. Include other information, including a confidence intervals (CI), confidence ellipse, and the correlation coefficient
  4. Have each point labeled with the 1-letter amino acid code.
  5. Include relevant x-axis and y-axis labels and a title.
# make plot
ggscatter(y = "MW",
          x = "pI",
          add = "reg.line",   # regression line added
          ellipse = TRUE,     # ellipse included
          cor.coef = TRUE,    # correlation coefficient included
          labels = aa,
          data = habk.df2,
          xlab = "pI",
          ylab = "Molecular Weight")
## `geom_smooth()` using formula 'y ~ x'

Step 3: Determine MW for Selenocysteine and Pyrrolysine

Look up the molecular mass of the non-standard amino acids Selenocysteine and Pyrrolysine from https://en.wikipedia.org/wiki/Amino_acid

Place this information in objects called selenocysteine.MW and pyrrolysine.MW

# molecular weights

selenocysteine.MW <- 168

pyrrolysine.MW <- 255

Selenocysteine and Pyrrolysine are amino acids that can be added to proteins by re-coding stop codons. (On an exam you should recognize these as amino acids that occur in proteins via this mechanism and be able to answer simple multiple choice questions about them.)

For more information see:

Scitable: https://www.nature.com/scitable/topicpage/an-evolutionary-perspective-on-amino-acids-14568445/

Wikipedia: https://en.wikipedia.org/wiki/Selenocysteine https://en.wikipedia.org/wiki/Pyrrolysine

Step 4: Estimate pI for Selenocysteine (U) and Pyrrolysine (O)

Build a regression model using lm() to determine the y = m*x + b equation for the line in your scatterplot.

A simple regression model has a slope (m) and an intercept and can be set up as an equation

y = m*x + b

In this situation the model would be

pI = m*MW.da + b

“b” is called the “intercept” in R output.

When you build a regression model you can access output that looks like this: (Intercept) MW.da 2.30536091 1.01131509

Which is interpreted as

(Intercept) MW.da b m

If the numbers above were accurate (they aren’t), then our equation would be pI = 1.01131509*MW.da + 2.30536091

First, build the regression model (aka linear model)

# Build model
regression <- lm(pI ~ MW , data=habk.df2)

Next get the coefficients for the model (the m of m*x, and the b).

#get coefficients / parameters for equation
summary(regression)
## 
## Call:
## lm(formula = pI ~ MW, data = habk.df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2401 -0.8263 -0.0103  0.4892  4.2860 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  4.50516    1.85405   2.430   0.0258 *
## MW           0.01132    0.01324   0.854   0.4041  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.779 on 18 degrees of freedom
## Multiple R-squared:  0.03898,    Adjusted R-squared:  -0.01441 
## F-statistic: 0.7301 on 1 and 18 DF,  p-value: 0.4041
#y = 4.50516 x + 0.01132

Use the equation y = m*x + b to estimate the pI of selenocystein an pyrrolysine.

# Estimate pI for Se and Py

Se_pI <- 756.88
Py_pI <- 1013.67

Step 5: Make a table of summary information for U and O

Make a table of your results, including the following columns

  1. amino.acid: full amino acid names for U an d O
  2. three.letter: 3-letter code for the amino acids (given on Wikipedia)
  3. one.letter: 1-letter code for the amino acids (given on Wikipedia)
  4. molecular.weight: from wikipedia
  5. pI.estimate: from your regression equation
# make table
table <- c("Selenocysteine", "sec", "u", "168", "756.88",
          "Pyrrolysine", "ply", "o", "255", "1013.67")

table
##  [1] "Selenocysteine" "sec"            "u"              "168"           
##  [5] "756.88"         "Pyrrolysine"    "ply"            "o"             
##  [9] "255"            "1013.67"

Display the dataframe with the function pander() from the pander package.

# make table
table_matrix <- matrix(table,
                       byrow = T,
                       nrow = 2)

# convert to data frame
table_data <- data.frame(table_matrix, 
                     stringsAsFactors = F)

pander(table_data)
X1 X2 X3 X4 X5
Selenocysteine sec u 168 756.88
Pyrrolysine ply o 255 1013.67

Step 6: Do you think your predictions of pI will be very accurate?

Using R, determine the correlation coefficient for molecular mass versus pI. Is this a very strong correlation coefficient?

# correlation coefficient
cor(MW, pI)
## [1] 0.1974383

Build a dataframe with all 8 variables used in Higgs and Attwood Chapter 2. Add molecular weight to this (which wasn’t included in their original table).

Determine the variable which has the highest correlation with pI.

First, make a datafrome with volume, bulk etc, along with molecular weight.

# data
aa        <-c('A','C','D','E','F','G','H','I','K','L','M','N',
              'P','Q','R','S','T','V','W','Y')

MW.da     <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

vol    <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
              114,148,73,93,105,163,141)

bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
              15.71,21.4,16.25,12.28,17.43,
              14.45,14.28,9.47,15.77,21.57,21.67,18.03)

pol  <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
              1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
                   5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
                  -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
                  -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

saaH2O       <-c(113,140,151,183,218,85,194,182,211,180,204,158,
                 143,189,241,122,146,160,259,229)


faal.fold    <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
                 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)

polar.req    <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
                 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

freq        <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
                10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

charge<-c('un','un','neg','neg','un','un','pos','un','pos',
          'un','un','un','un','un','pos','un','un','un','un','un')

hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
              hydrophilic','hydrophobic','neutral','neutral',
              'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
              'hydrophilic','neutral',
              'hydrophilic','hydrophilic','neutral','neutral',
              'hydrophobic','hydrophobic','neutral')

vol.cat<-c('verysmall','small','small','medium',
              'verylarge','verysmall','medium','large','large',
              'large','large','small','small','medium','large',
              'verysmall',
              'small','medium','verylarge','verylarge')

pol.cat<-c('nonpolar','nonpolar','polar','polar',
                'nonpolar','nonpolar','polar','nonpolar',
                'polar','nonpolar','nonpolar','polar','nonpolar','polar', 'polar','polar','polar','nonpolar','nonpolar','polar')

chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
            'aliphatic','basic','aliphatic','basic','aliphatic','sulfur', 'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
            'aliphatic','aromatic','aromatic')

# made dataframe
aa_dat <- data.frame(aa, 
                     MW.da,vol,
                     bulk, pol, 
                     isoelec,H2Opho.34, H2Opho.35,
                     saaH2O, faal.fold, polar.req,
                     freq, charge, hydropathy,
                     vol.cat, pol.cat, chemical)

Then make a correlation matrix from the data. Save it to an object called corr.mat

# make correlation matrix
cor(polar.req, H2Opho.35)
## [1] -0.8655016
dim(aa_dat)
## [1] 20 17
cor(aa_dat[,c("polar.req", "H2Opho.35")])
##            polar.req  H2Opho.35
## polar.req  1.0000000 -0.8655016
## H2Opho.35 -0.8655016  1.0000000
corr.mat <- aa_dat[,-c(1, 13:17)]
corr.mat
##    MW.da vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold polar.req
## 1     89  67 11.50  0.00    6.00       1.8       1.6    113      0.74       7.0
## 2    121  86 13.46  1.48    5.07       2.5       2.0    140      0.91       4.8
## 3    133  91 11.68 49.70    2.77      -3.5      -9.2    151      0.62      13.0
## 4    146 109 13.57 49.90    3.22      -3.5      -8.2    183      0.62      12.5
## 5    165 135 19.80  0.35    5.48       2.8       3.7    218      0.88       5.0
## 6     75  48  3.40  0.00    5.97      -0.4       1.0     85      0.72       7.9
## 7    155 118 13.69 51.60    7.59      -3.2      -3.0    194      0.78       8.4
## 8    131 124 21.40  0.13    6.02       4.5       3.1    182      0.88       4.9
## 9    146 135 15.71 49.50    9.74      -3.9      -8.8    211      0.52      10.1
## 10   131 124 21.40  0.13    5.98       3.8       2.8    180      0.85       4.9
## 11   149 124 16.25  1.43    5.74       1.9       3.4    204      0.85       5.3
## 12   132  96 12.28  3.38    5.41      -3.5      -4.8    158      0.63      10.0
## 13   115  90 17.43  1.58    6.30      -1.6      -0.2    143      0.64       6.6
## 14   147 114 14.45  3.53    5.65      -3.5      -4.1    189      0.62       8.6
## 15   174 148 14.28 52.00   10.76      -4.5     -12.3    241      0.64       9.1
## 16   105  73  9.47  1.67    5.68      -0.8       0.6    122      0.66       7.5
## 17   119  93 15.77  1.66    6.16      -0.7       1.2    146      0.70       6.6
## 18   117 105 21.57  0.13    5.96       4.2       2.6    160      0.86       5.6
## 19   204 163 21.67  2.10    5.89      -0.9       1.9    259      0.85       5.2
## 20   181 141 18.03  1.61    5.66      -1.3      -0.7    229      0.76       5.4
##     freq
## 1   7.80
## 2   1.10
## 3   5.19
## 4   6.72
## 5   4.39
## 6   6.77
## 7   2.03
## 8   6.95
## 9   6.32
## 10 10.15
## 11  2.28
## 12  4.37
## 13  4.26
## 14  3.45
## 15  5.23
## 16  6.46
## 17  5.12
## 18  7.01
## 19  1.09
## 20  3.30

Round off the correlation matrix to 1 decimal [lace]

# round off
round(cor(corr.mat),1)
##           MW.da  vol bulk  pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da       1.0  0.9  0.6  0.3     0.2      -0.3      -0.2    1.0       0.1
## vol         0.9  1.0  0.7  0.2     0.4      -0.1      -0.2    1.0       0.2
## bulk        0.6  0.7  1.0 -0.2     0.1       0.4       0.3    0.6       0.5
## pol         0.3  0.2 -0.2  1.0     0.3      -0.7      -0.9    0.3      -0.5
## isoelec     0.2  0.4  0.1  0.3     1.0      -0.2      -0.3    0.3      -0.2
## H2Opho.34  -0.3 -0.1  0.4 -0.7    -0.2       1.0       0.8   -0.2       0.8
## H2Opho.35  -0.2 -0.2  0.3 -0.9    -0.3       0.8       1.0   -0.2       0.8
## saaH2O      1.0  1.0  0.6  0.3     0.3      -0.2      -0.2    1.0       0.1
## faal.fold   0.1  0.2  0.5 -0.5    -0.2       0.8       0.8    0.1       1.0
## polar.req  -0.1 -0.2 -0.5  0.8    -0.1      -0.8      -0.9   -0.1      -0.8
## freq       -0.5 -0.3  0.0  0.0     0.0       0.3       0.0   -0.4      -0.2
##           polar.req freq
## MW.da          -0.1 -0.5
## vol            -0.2 -0.3
## bulk           -0.5  0.0
## pol             0.8  0.0
## isoelec        -0.1  0.0
## H2Opho.34      -0.8  0.3
## H2Opho.35      -0.9  0.0
## saaH2O         -0.1 -0.4
## faal.fold      -0.8 -0.2
## polar.req       1.0  0.1
## freq            0.1  1.0

Display the matrix with pander::pander()

# display with pander()

pander::pander(corr.mat)
Table continues below
MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O
89 67 11.5 0 6 1.8 1.6 113
121 86 13.46 1.48 5.07 2.5 2 140
133 91 11.68 49.7 2.77 -3.5 -9.2 151
146 109 13.57 49.9 3.22 -3.5 -8.2 183
165 135 19.8 0.35 5.48 2.8 3.7 218
75 48 3.4 0 5.97 -0.4 1 85
155 118 13.69 51.6 7.59 -3.2 -3 194
131 124 21.4 0.13 6.02 4.5 3.1 182
146 135 15.71 49.5 9.74 -3.9 -8.8 211
131 124 21.4 0.13 5.98 3.8 2.8 180
149 124 16.25 1.43 5.74 1.9 3.4 204
132 96 12.28 3.38 5.41 -3.5 -4.8 158
115 90 17.43 1.58 6.3 -1.6 -0.2 143
147 114 14.45 3.53 5.65 -3.5 -4.1 189
174 148 14.28 52 10.76 -4.5 -12.3 241
105 73 9.47 1.67 5.68 -0.8 0.6 122
119 93 15.77 1.66 6.16 -0.7 1.2 146
117 105 21.57 0.13 5.96 4.2 2.6 160
204 163 21.67 2.1 5.89 -0.9 1.9 259
181 141 18.03 1.61 5.66 -1.3 -0.7 229
faal.fold polar.req freq
0.74 7 7.8
0.91 4.8 1.1
0.62 13 5.19
0.62 12.5 6.72
0.88 5 4.39
0.72 7.9 6.77
0.78 8.4 2.03
0.88 4.9 6.95
0.52 10.1 6.32
0.85 4.9 10.15
0.85 5.3 2.28
0.63 10 4.37
0.64 6.6 4.26
0.62 8.6 3.45
0.64 9.1 5.23
0.66 7.5 6.46
0.7 6.6 5.12
0.86 5.6 7.01
0.85 5.2 1.09
0.76 5.4 3.3

Which variable has the strongest (most positive OR most negative) correlation with pI? (if possible write this with code, otherwise just state what it is).

# find maximum
cor(pI, corr.mat)
##          MW.da       vol       bulk       pol isoelec  H2Opho.34  H2Opho.35
## [1,] 0.1974383 0.3625143 0.08225406 0.2656773       1 -0.2041273 -0.2608317
##         saaH2O  faal.fold  polar.req       freq
## [1,] 0.3467519 -0.1798668 -0.1146537 0.02048058
# maximal correlation = isoelec and vol