Predicting amino acid properties using regression

Predict pI for an Selenocysteine and Pyrrolysine

Amino acids can be described by a number of chemical properties. This information is fairly easy to locate for the 20 standard proteinogenic amino acids coded by the standard codon table. For other amino acids it can be very difficult to find this information.

A key question in the study of the origin of the universal genetic code is: of the hundreds of amino acids that occur on earth, why does the codon table code for the 20 that it does? That is, why isn’t life based on a different set of 20 amino acids?

Many studies compare and contrast the chemical properties of the 20 proteinogenic amino acids with other amino acids, such as the amino acids that occur meteorites. Unfortunately, these studies rarely publish their data. Moreover, it seems like there is not always experimentally derived data on the chemical properties of non-proteinogenic amino acids. Authors’ therefore use chemistry modeling software to predict the chemical attributes non-standard amino acids.

In this portfolio assignment you will build a simple regression model (line of best fit) using the lm() function, then use the molecular weight of non-standard amino acids to predict other chemical values.

You’ll then assess whether this model is likely to be a very good one for predicting pI.

The assignment consists only of instructions - no code! To complete this assignment, you will need to gather the necessary data and code from recent assignments to make a self-sufficient script to carry out the following analysis.

Step 0: Load the necessary packages

# packages 
#install.packages("ggpubr")
#install.packages("pander")

library(ggpubr)

## Loading required package: ggplot2

library(pander)

Step 1: Make a dataframe

First, we need a dataframe to hold data on the 20 proteinogenic amino acids. Make a dataframe with the following columns:

amino acid codes
molecular weight
pI (the vector is not called this though in my code code - make sure you set up your code properly)

Call the dataframe habk.df2.

# vectors

aa.code <- c('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L','M','N',
                'P','Q','R','S','T','V','W','Y')
MW.da <- c(89, 121, 133, 146, 165, 75, 155, 131, 146, 131,
               149, 132, 115, 147, 174, 105, 119, 117, 204, 181)

isoelec <- c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98, 5.74, 5.41, 6.3, 
            5.65, 10.76, 5.68, 6.16, 5.96, 5.89, 5.66)

habk.df2 <- data.frame(aa.code, isoelec, MW.da)

habk.df2

##    aa.code isoelec MW.da
## 1        A    6.00    89
## 2        C    5.07   121
## 3        D    2.77   133
## 4        E    3.22   146
## 5        F    5.48   165
## 6        G    5.97    75
## 7        H    7.59   155
## 8        I    6.02   131
## 9        K    9.74   146
## 10       L    5.98   131
## 11       M    5.74   149
## 12       N    5.41   132
## 13       P    6.30   115
## 14       Q    5.65   147
## 15       R   10.76   174
## 16       S    5.68   105
## 17       T    6.16   119
## 18       V    5.96   117
## 19       W    5.89   204
## 20       Y    5.66   181

# make dataframe

Display the finished dataframe with pander::pander()

# show table with pander

pander::pander(habk.df2)

aa.code	isoelec	MW.da
A	6	89
C	5.07	121
D	2.77	133
E	3.22	146
F	5.48	165
G	5.97	75
H	7.59	155
I	6.02	131
K	9.74	146
L	5.98	131
M	5.74	149
N	5.41	132
P	6.3	115
Q	5.65	147
R	10.76	174
S	5.68	105
T	6.16	119
V	5.96	117
W	5.89	204
Y	5.66	181

Step 2: Plot pI versus molecular weight

Use ggpubr to make a plot of the pH at the isoelectric point (pI) versus molecular mass.

Plot pI on one axis and molecular weight on the other.
Include regression line (line of best fit)
Include other information, including a confidence intervals (CI), confidence ellipse, and the correlation coefficient
Have each point labeled with the 1-letter amino acid code.
Include relevant x-axis and y-axis labels and a title.

# make plot

ggscatter(x = "isoelec",
          y = "MW.da",
          data = habk.df2,
          label = aa.code, 
          conf.int = TRUE,
          ellipse = TRUE,
          ylab = "MW.da", 
          xlab = "isoelec",
          title = "pi vs molecular weight",
          cor.coef = TRUE,
          add = "reg.line")

## `geom_smooth()` using formula 'y ~ x'

Step 3: Determine MW for Selenocysteine and Pyrrolysine

Look up the molecular mass of the non-standard amino acids Selenocysteine and Pyrrolysine from https://en.wikipedia.org/wiki/Amino_acid

Place this information in objects called selenocysteine.MW and pyrrolysine.MW

# molecular weights

selenocysteine.MW <- 168.064
pyrrolysine.MW <- 255.313

Selenocysteine and Pyrrolysine are amino acids that can be added to proteins by re-coding stop codons. (On an exam you should recognize these as amino acids that occur in proteins via this mechanism and be able to answer simple multiple choice questions about them.)

For more information see:

Scitable: https://www.nature.com/scitable/topicpage/an-evolutionary-perspective-on-amino-acids-14568445/

Wikipedia: https://en.wikipedia.org/wiki/Selenocysteine https://en.wikipedia.org/wiki/Pyrrolysine

Step 4: Estimate pI for Selenocysteine (U) and Pyrrolysine (O)

Build a regression model using lm() to determine the y = m*x + b equation for the line in your scatterplot.

A simple regression model has a slope (m) and an intercept and can be set up as an equation

y = m*x + b

In this situation the model would be

pI = m*MW.da + b

“b” is called the “intercept” in R output.

When you build a regression model you can access output that looks like this: (Intercept) MW.da 2.30536091 1.01131509

Which is interpreted as

(Intercept) MW.da b m

If the numbers above were accurate (they aren’t), then our equation would be pI = 1.01131509*MW.da + 2.30536091

First, build the regression model (aka linear model)

# Build model

pi_MW <- lm(data = habk.df2,
   isoelec~MW.da)

Next get the coefficients for the model (the m of m*x, and the b).

#get coefficients / parameters for equation

slope <- coef(pi_MW)

Use the equation y = m*x + b to estimate the pI of selenocystein an pyrrolysine.

# Estimate pI for Se and Py

Se_pI <- coef(pi_MW)[2]*selenocysteine.MW + coef(pi_MW)[1]


Py_pI <- coef(pi_MW)[2]*pyrrolysine.MW + coef(pi_MW)[1]

Step 5: Make a table of summary information for U and O

Make a table of your results, including the following columns

amino.acid: full amino acid names for U an d O
three.letter: 3-letter code for the amino acids (given on Wikipedia)
one.letter: 1-letter code for the amino acids (given on Wikipedia)
molecular.weight: from wikipedia
pI.estimate: from your regression equation

# make table

amino.acid <- c("Selenocysteine","Pyrollisine")
three.letter <- c("Sec","Pyl")
one.letter <- c("U","O")
molecular.weight <- c(168.1,255.3)
pI.estimate <- c(Se_pI,Py_pI)
  
  
table <- data.frame(amino.acid,three.letter,one.letter,molecular.weight,pI.estimate)

Display the dataframe with the function pander() from the pander package.

# make table

pander(table)

amino.acid	three.letter	one.letter	molecular.weight	pI.estimate
Selenocysteine	Sec	U	168.1	6.407
Pyrollisine	Pyl	O	255.3	7.394

Step 6: Do you think your predictions of pI will be very accurate?

Using R, determine the correlation coefficient for molecular mass versus pI. Is this a very strong correlation coefficient?

# correlation coefficient

cor(pI.estimate, molecular.weight, method = "spearman")

## [1] 1

Build a dataframe with all 8 variables used in Higgs and Attwood Chapter 2. Add molecular weight to this (which wasn’t included in their original table).

Determine the variable which has the highest correlation with pI.

First, make a datafrome with volume, bulk etc, along with molecular weight.

# data


# made dataframe
MW.da <- c(89, 121, 133, 146, 165, 75, 155, 131, 146, 131,
               149, 132, 115, 147, 174, 105, 119, 117, 204, 181)

  Vol <- c(67, 86, 91, 109, 135, 48, 118, 124, 135, 124, 124, 96, 90, 114,
            148, 73, 93, 105, 163, 141)

  Bulk <- c(11.5, 13.46, 11.68, 13.57, 19.8, 3.4, 13.69, 21.4, 15.71, 21.4, 16.25,
         12.28,17.43, 14.45, 14.28, 9.47, 15.77, 21.57, 21.67, 18.03)

  Polarity <- c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
              1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

  pI <- c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98, 5.74, 5.41, 6.3, 
            5.65, 10.76, 5.68, 6.16, 5.96, 5.89, 5.66)

  Hyd.1 <- c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5, -3.9, 3.8, 1.9, -3.5, -1.6, 
              -3.5, -4.5, -0.8, -0.7, 4.2, -0.9, -1.3)

  Hyd.2 <- c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8, -0.2,-4.1,-12.3,
              0.6,1.2,2.6,1.9,-0.7)

  Surface_area <-c(113, 140,151,183,218,85,194,182,211,180, 204, 158, 143, 189, 
                 241, 122, 146,160,259,229)

  Fract_area <- c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52, 0.85, 0.85, 0.63,
                 0.64,0.62,0.64,0.66,0.7, 0.86, 0.85, 0.76)


table2 <- data.frame(MW.da,Vol,Bulk,Polarity,pI,Hyd.1,Hyd.2,Surface_area,Fract_area)

Then make a correlation matrix from the data. Save it to an object called corr.mat

# make correlation matrix

corr.mat <- c(data = table2,
              byrow = TRUE,
              nrow = 10)

corr.mat

## $data.MW.da
##  [1]  89 121 133 146 165  75 155 131 146 131 149 132 115 147 174 105 119 117 204
## [20] 181
## 
## $data.Vol
##  [1]  67  86  91 109 135  48 118 124 135 124 124  96  90 114 148  73  93 105 163
## [20] 141
## 
## $data.Bulk
##  [1] 11.50 13.46 11.68 13.57 19.80  3.40 13.69 21.40 15.71 21.40 16.25 12.28
## [13] 17.43 14.45 14.28  9.47 15.77 21.57 21.67 18.03
## 
## $data.Polarity
##  [1]  0.00  1.48 49.70 49.90  0.35  0.00 51.60  0.13 49.50  0.13  1.43  3.38
## [13]  1.58  3.53 52.00  1.67  1.66  0.13  2.10  1.61
## 
## $data.pI
##  [1]  6.00  5.07  2.77  3.22  5.48  5.97  7.59  6.02  9.74  5.98  5.74  5.41
## [13]  6.30  5.65 10.76  5.68  6.16  5.96  5.89  5.66
## 
## $data.Hyd.1
##  [1]  1.8  2.5 -3.5 -3.5  2.8 -0.4 -3.2  4.5 -3.9  3.8  1.9 -3.5 -1.6 -3.5 -4.5
## [16] -0.8 -0.7  4.2 -0.9 -1.3
## 
## $data.Hyd.2
##  [1]   1.6   2.0  -9.2  -8.2   3.7   1.0  -3.0   3.1  -8.8   2.8   3.4  -4.8
## [13]  -0.2  -4.1 -12.3   0.6   1.2   2.6   1.9  -0.7
## 
## $data.Surface_area
##  [1] 113 140 151 183 218  85 194 182 211 180 204 158 143 189 241 122 146 160 259
## [20] 229
## 
## $data.Fract_area
##  [1] 0.74 0.91 0.62 0.62 0.88 0.72 0.78 0.88 0.52 0.85 0.85 0.63 0.64 0.62 0.64
## [16] 0.66 0.70 0.86 0.85 0.76
## 
## $byrow
## [1] TRUE
## 
## $nrow
## [1] 10

Round off the correlation matrix to 1 decimal [lace]

# round off
round(table2)

##    MW.da Vol Bulk Polarity pI Hyd.1 Hyd.2 Surface_area Fract_area
## 1     89  67   12        0  6     2     2          113          1
## 2    121  86   13        1  5     2     2          140          1
## 3    133  91   12       50  3    -4    -9          151          1
## 4    146 109   14       50  3    -4    -8          183          1
## 5    165 135   20        0  5     3     4          218          1
## 6     75  48    3        0  6     0     1           85          1
## 7    155 118   14       52  8    -3    -3          194          1
## 8    131 124   21        0  6     4     3          182          1
## 9    146 135   16       50 10    -4    -9          211          1
## 10   131 124   21        0  6     4     3          180          1
## 11   149 124   16        1  6     2     3          204          1
## 12   132  96   12        3  5    -4    -5          158          1
## 13   115  90   17        2  6    -2     0          143          1
## 14   147 114   14        4  6    -4    -4          189          1
## 15   174 148   14       52 11    -4   -12          241          1
## 16   105  73    9        2  6    -1     1          122          1
## 17   119  93   16        2  6    -1     1          146          1
## 18   117 105   22        0  6     4     3          160          1
## 19   204 163   22        2  6    -1     2          259          1
## 20   181 141   18        2  6    -1    -1          229          1

Display the matrix with pander::pander()

# display with pander()

pander(table2)

MW.da	Vol	Bulk	Polarity	pI	Hyd.1	Hyd.2	Surface_area	Fract_area
89	67	11.5	0	6	1.8	1.6	113	0.74
121	86	13.46	1.48	5.07	2.5	2	140	0.91
133	91	11.68	49.7	2.77	-3.5	-9.2	151	0.62
146	109	13.57	49.9	3.22	-3.5	-8.2	183	0.62
165	135	19.8	0.35	5.48	2.8	3.7	218	0.88
75	48	3.4	0	5.97	-0.4	1	85	0.72
155	118	13.69	51.6	7.59	-3.2	-3	194	0.78
131	124	21.4	0.13	6.02	4.5	3.1	182	0.88
146	135	15.71	49.5	9.74	-3.9	-8.8	211	0.52
131	124	21.4	0.13	5.98	3.8	2.8	180	0.85
149	124	16.25	1.43	5.74	1.9	3.4	204	0.85
132	96	12.28	3.38	5.41	-3.5	-4.8	158	0.63
115	90	17.43	1.58	6.3	-1.6	-0.2	143	0.64
147	114	14.45	3.53	5.65	-3.5	-4.1	189	0.62
174	148	14.28	52	10.76	-4.5	-12.3	241	0.64
105	73	9.47	1.67	5.68	-0.8	0.6	122	0.66
119	93	15.77	1.66	6.16	-0.7	1.2	146	0.7
117	105	21.57	0.13	5.96	4.2	2.6	160	0.86
204	163	21.67	2.1	5.89	-0.9	1.9	259	0.85
181	141	18.03	1.61	5.66	-1.3	-0.7	229	0.76

Which variable has the strongest (most positive OR most negative) correlation with pI? (if possible write this with code, otherwise just state what it is).

# find maximum

Predicting amino acid properties using regression

Xiaohan Xu

12/10/2021

Key vocab

Key functions / packages