Preparation

Concepts

These concepts will be on the next exam

  • four protein fold classes
  • correlation as a measure of similarity
  • cosine similarity
  • scatterplot matrix
  • correlation matrix
  • Euclidean distance
  • distance matrix

Packages

  • pander

Introduction

A key determinant of a proteins secondary structure is the frequency of different amino acids in the polypeptide chain. In the 1990s - long before Alphafold - researchers such as Kuo-Chen Chou used multivariate statistical methods to predict whether proteins were classified as alpha, beta, alpha+beta or alpha/beta. At the time there were not many proteins in the PDB, and these methods worked fairly well using the available data.

In this script is code to carry out prediction using a correlation coefficient approach (Chou and Zhang 1992) and a cosine similarity approach (Chou 1994, Chou and Zhang 1995). Both approaches typically yield similar results, though the correlation coefficient method may be more intuitive. We’ll also contrast this with the Euclidean matrix.

Version 1.0 of this script focus on producing the results and does not contain much explanation.

Amino acid compositions for four primary folding types

Chou and Zhang 1995 present a table which summarizes the amino acid composition for proteins with known structures. Based on these structures and the consensus of the protein community, proteins were classified as alpha, beta, alpha+beta or alpha/beta. In a previous study, Chou obtain the sequence of each of thes proteins and determined the number of each amino acid in each protein. In Chou and Zhang (1995) they present the summary information for each of these proteins in Tables 1 through 4 (one table for each fold class of protein).

They then totaled up the total number of each amino acid within each fold class, e.g., the total number of Alanine in all of the alpha proteins, total Argine in the alpha proteins etc. These data are presented in their Table 5 and reproduced below.

From this information the frequency of each amino acid in each fold class can be determined and a prediction made about a new protein with an unknown structure.

Amino acid frequencies

Make a vector amino acid names

# enter once
aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
            "L","K","M","F","P","S","T","W","Y","V")

# enter again
aa.1.2 <- c("A","R","N","D","C","Q","E","G","H","I",
            "L","K","M","F","P","S","T","W","Y","V")

# are they the same length?
length(aa.1.1) == length(aa.1.2)
## [1] TRUE
# are all the values the same?
aa.1.1 == aa.1.2
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE
#length(unique(aa.1.1)) tells me how many unique values
unique(aa.1.1)
##  [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y"
## [20] "V"
# does each vector have the same number of unique values?
length(unique(aa.1.1)) == length(unique(aa.1.2))
## [1] TRUE
# OPTIONAL: are any of the values returned by the logical check == FALSE?
any(c(aa.1.1 == aa.1.2) == FALSE)
## [1] FALSE

Make vectors with the frequency of each amino acid in the databse of proteins of each type used by Chou. Chou’s table lists totals so I do logical checks again the total he gives and the total in my vecotr.

# alpha proteins
alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91, 
           221, 249, 48, 123, 82, 122, 119, 33, 63, 167)

# check against chou's total
sum(alpha) == 2447
## [1] TRUE
# beta proteins
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120, 
          177, 115, 16, 85, 127, 341, 253, 44, 110, 229)

# check against chou's total
sum(beta) == 2776
## [1] TRUE
# alpha + beta
a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
              110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
sum(a.plus.b) == 1889
## [1] TRUE
# alpha/beta
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239, 
             339, 321, 91, 158, 188, 327, 238, 72, 130, 378)
sum(a.div.b) == 4333
## [1] TRUE

Chou and Chang’s table 5 is approximately like this

data.frame(aa.1.1, alpha, beta, a.plus.b, a.div.b)
##    aa.1.1 alpha beta a.plus.b a.div.b
## 1       A   285  203      175     361
## 2       R    53   67       78     146
## 3       N    97  139      120     183
## 4       D   163  121      111     244
## 5       C    22   75       74      63
## 6       Q    67  122       74     114
## 7       E   134   86       86     257
## 8       G   197  297      171     377
## 9       H   111   49       33     107
## 10      I    91  120       93     239
## 11      L   221  177      110     339
## 12      K   249  115      112     321
## 13      M    48   16       25      91
## 14      F   123   85       52     158
## 15      P    82  127       71     188
## 16      S   122  341      126     327
## 17      T   119  253      117     238
## 18      W    33   44       30      72
## 19      Y    63  110      108     130
## 20      V   167  229      123     378

From table 5 we calculate the proprotions of each amino acid in each fold class.

Calculate proportions for each of the four protein fold types

alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b <- a.div.b/sum(a.div.b)

Create a dataframe

#dataframe
aa.prop <- data.frame(alpha.prop,
                      beta.prop,
                      a.plus.b.prop,
                      a.div.b)
#row labels
row.names(aa.prop) <- aa.1.1

Table 5 therefore becomes this

aa.prop
##    alpha.prop   beta.prop a.plus.b.prop    a.div.b
## A 0.116469146 0.073126801    0.09264161 0.08331410
## R 0.021659174 0.024135447    0.04129169 0.03369490
## N 0.039640376 0.050072046    0.06352567 0.04223402
## D 0.066612178 0.043587896    0.05876125 0.05631202
## C 0.008990601 0.027017291    0.03917417 0.01453958
## Q 0.027380466 0.043948127    0.03917417 0.02630972
## E 0.054760932 0.030979827    0.04552673 0.05931225
## G 0.080506743 0.106988473    0.09052409 0.08700669
## H 0.045361667 0.017651297    0.01746956 0.02469421
## I 0.037188394 0.043227666    0.04923240 0.05515809
## L 0.090314671 0.063760807    0.05823187 0.07823679
## K 0.101757254 0.041426513    0.05929063 0.07408262
## M 0.019615856 0.005763689    0.01323452 0.02100162
## F 0.050265631 0.030619597    0.02752779 0.03646434
## P 0.033510421 0.045749280    0.03758602 0.04338795
## S 0.049856968 0.122838617    0.06670196 0.07546734
## T 0.048630977 0.091138329    0.06193753 0.05492730
## W 0.013485901 0.015850144    0.01588142 0.01661666
## Y 0.025745811 0.039625360    0.05717311 0.03000231
## V 0.068246833 0.082492795    0.06511382 0.08723748

Data exploration

Chou’s method rely on the correlation between amino acid composition between proteins with known structure and unknown proteins. Its therefore information to do a standard examination of the data using scatterplots and correlations. First we’ll look at the data for the four fold classes.

Scatterplot matrix

plot(aa.prop,panel = panel.smooth)

Correlation matrix

cor(aa.prop)
##               alpha.prop beta.prop a.plus.b.prop   a.div.b
## alpha.prop     1.0000000 0.4941143     0.6969508 0.8555289
## beta.prop      0.4941143 1.0000000     0.7977771 0.7706654
## a.plus.b.prop  0.6969508 0.7977771     1.0000000 0.8198043
## a.div.b        0.8555289 0.7706654     0.8198043 1.0000000
round(cor(aa.prop), 3)
##               alpha.prop beta.prop a.plus.b.prop a.div.b
## alpha.prop         1.000     0.494         0.697   0.856
## beta.prop          0.494     1.000         0.798   0.771
## a.plus.b.prop      0.697     0.798         1.000   0.820
## a.div.b            0.856     0.771         0.820   1.000

Individual plots

par(mfrow = c(1,3), mar = c(4,4,1,0))
plot(alpha.prop ~ beta.prop, data = aa.prop)
plot(alpha.prop ~ a.plus.b.prop, data = aa.prop)
plot(alpha.prop ~ a.div.b, data = aa.prop)

par(mfrow = c(1,1), mar = c(4,4,4,4))

Download focal gene

We will pick a gene, determine its amino acid composition, and determine the similarity of its composition to the four fold classes.

# download
NP_000783 <- rentrez::entrez_fetch(id = "NP_000783.2",
                                     db = "protein",
                                     rettype = "fasta")

# clean and turn into vector
NP_000783 <- compbio4all::fasta_cleaner(NP_000783, parse = TRUE)

Determine amino acid frequencies for my focal gene (counts of amino acid divided by length of sequence)

NP_000783.freq.table <- table(NP_000783)/length(NP_000783)

Function to convert a table into a vector

table_to_vector <- function(table_x){
  table_names <- attr(table_x, "dimnames")[[1]]
  table_vect <- as.vector(table_x)
  names(table_vect) <- table_names
  return(table_vect)
}
DIO1.human.aa.freq <- table_to_vector(NP_000783.freq.table)

Check for presence of U, an unknown amino acid

aa.names <- names(DIO1.human.aa.freq)
any(aa.names == "U")
## [1] TRUE
i.U <- which(aa.names == "U")

aa.names[i.U]
## [1] "U"
DIO1.human.aa.freq[i.U]
##           U 
## 0.004016064
#uing %in%, which isn't necessary but illustrates the syntax
which(aa.names %in% c("U"))
## [1] 18

Remove the “U”. This throws off the math a tiny bit but will be ok. (Better to remove it from the full sequence.)

DIO1.human.aa.freq <- DIO1.human.aa.freq[-i.U]

# sum is just under 1; ok for our purposes
sum(DIO1.human.aa.freq)
## [1] 0.9959839

Add data to datatable

aa.prop$DIO1.human.aa.freq <- DIO1.human.aa.freq

Functions to calculate similarities

Corrleation used in Chou adn Zhange 1992.

chou_cor <- function(x,y){
  numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}

Cosine similarity used in Higgs and Attwood (2005).
TODO: Check if this is exactly the same as used in Chou’s 1994 and 1995 papers.

chou_cosine <- function(z.1, z.2){
  z.1.abs <- sqrt(sum(z.1^2))
  z.2.abs <- sqrt(sum(z.2^2))
  my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
  return(my.cosine)
}

Data exploration. None of the correlations are great…

par(mfrow = c(2,2), mar = c(1,4,1,1))
plot(alpha.prop ~ DIO1.human.aa.freq, data = aa.prop)
plot(beta.prop ~ DIO1.human.aa.freq, data = aa.prop)
plot(a.plus.b.prop  ~ DIO1.human.aa.freq, data = aa.prop)
plot(a.div.b ~ DIO1.human.aa.freq, data = aa.prop)

par(mfrow = c(1,1), mar = c(1,1,1,1))

Calculate correlation between each column

corr.alpha <- chou_cor(aa.prop[,5], aa.prop[,1])
corr.beta  <- chou_cor(aa.prop[,5], aa.prop[,2])
corr.apb   <- chou_cor(aa.prop[,5], aa.prop[,3])
corr.adb   <- chou_cor(aa.prop[,5], aa.prop[,4])

Calculate cosine similarity

cos.alpha <- chou_cosine(aa.prop[,5], aa.prop[,1])
cos.beta  <- chou_cosine(aa.prop[,5], aa.prop[,2])
cos.apb   <- chou_cosine(aa.prop[,5], aa.prop[,3])
cos.adb   <- chou_cosine(aa.prop[,5], aa.prop[,4])

Calculate distance. Note: we need to flip the dataframe on its side using a command called t()

aa.prop.flipped <- t(aa.prop)
round(aa.prop.flipped,2)
##                       A    R    N    D    C    Q    E    G    H    I    L    K
## alpha.prop         0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04 0.09 0.10
## beta.prop          0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04 0.06 0.04
## a.plus.b.prop      0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05 0.06 0.06
## a.div.b            0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06 0.08 0.07
## DIO1.human.aa.freq 0.05 0.02 0.04 0.05 0.06 0.06 0.03 0.05 0.04 0.12 0.03 0.06
##                       M    F    P    S    T    W    Y    V
## alpha.prop         0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta.prop          0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## a.plus.b.prop      0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## a.div.b            0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## DIO1.human.aa.freq 0.06 0.06 0.06 0.05 0.03 0.08 0.03 0.02

We can get distance matrix like this

dist(aa.prop.flipped, method = "euclidean")
##                    alpha.prop  beta.prop a.plus.b.prop    a.div.b
## beta.prop          0.13342098                                    
## a.plus.b.prop      0.09281824 0.08289406                         
## a.div.b            0.06699039 0.08659174    0.06175113           
## DIO1.human.aa.freq 0.17606277 0.18081192    0.15112424 0.15786886

Individual distances

dist.alpha <- dist((aa.prop.flipped[c(1,5),]),  method = "euclidean")
dist.beta  <- dist((aa.prop.flipped[c(2,5),]),  method = "euclidean")
dist.apb   <- dist((aa.prop.flipped[c(3,5),]),  method = "euclidean")
dist.adb  <- dist((aa.prop.flipped[c(4,5),]), method = "euclidean")

Compile the information

# fold types
fold.type <- c("alpha","beta","alpha plus beta", "alpha/beta")

# data
corr.sim <- round(c(corr.alpha,corr.beta,corr.apb,corr.adb),5)
cosine.sim <- round(c(cos.alpha,cos.beta,cos.apb,cos.adb),5)
Euclidean.dist <- round(c(dist.alpha,dist.beta,dist.apb,dist.adb),5)

# summary
sim.sum <- c("","","most.sim","")
dist.sum <- c("","","min.dist","")

df <- data.frame(fold.type,
           corr.sim ,
           cosine.sim ,
           Euclidean.dist ,
           sim.sum ,
           dist.sum )

Display output

pander::pander(df)
fold.type corr.sim cosine.sim Euclidean.dist sim.sum dist.sum
alpha 0.7552 0.7552 0.1761
beta 0.746 0.746 0.1808
alpha plus beta 0.8072 0.8072 0.1511 most.sim min.dist
alpha/beta 0.7937 0.7937 0.1579