These concepts will be on the next exam
A key determinant of a proteins secondary structure is the frequency of different amino acids in the polypeptide chain. In the 1990s - long before Alphafold - researchers such as Kuo-Chen Chou used multivariate statistical methods to predict whether proteins were classified as alpha, beta, alpha+beta or alpha/beta. At the time there were not many proteins in the PDB, and these methods worked fairly well using the available data.
In this script is code to carry out prediction using a correlation coefficient approach (Chou and Zhang 1992) and a cosine similarity approach (Chou 1994, Chou and Zhang 1995). Both approaches typically yield similar results, though the correlation coefficient method may be more intuitive. We’ll also contrast this with the Euclidean matrix.
Version 1.0 of this script focus on producing the results and does not contain much explanation.
Chou and Zhang 1995 present a table which summarizes the amino acid composition for proteins with known structures. Based on these structures and the consensus of the protein community, proteins were classified as alpha, beta, alpha+beta or alpha/beta. In a previous study, Chou obtain the sequence of each of thes proteins and determined the number of each amino acid in each protein. In Chou and Zhang (1995) they present the summary information for each of these proteins in Tables 1 through 4 (one table for each fold class of protein).
They then totaled up the total number of each amino acid within each fold class, e.g., the total number of Alanine in all of the alpha proteins, total Argine in the alpha proteins etc. These data are presented in their Table 5 and reproduced below.
From this information the frequency of each amino acid in each fold class can be determined and a prediction made about a new protein with an unknown structure.
Make a vector amino acid names
# enter once
aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
"L","K","M","F","P","S","T","W","Y","V")
# enter again
aa.1.2 <- c("A","R","N","D","C","Q","E","G","H","I",
"L","K","M","F","P","S","T","W","Y","V")
# are they the same length?
length(aa.1.1) == length(aa.1.2)
## [1] TRUE
# are all the values the same?
aa.1.1 == aa.1.2
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE
#length(unique(aa.1.1)) tells me how many unique values
unique(aa.1.1)
## [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y"
## [20] "V"
# does each vector have the same number of unique values?
length(unique(aa.1.1)) == length(unique(aa.1.2))
## [1] TRUE
# OPTIONAL: are any of the values returned by the logical check == FALSE?
any(c(aa.1.1 == aa.1.2) == FALSE)
## [1] FALSE
Make vectors with the frequency of each amino acid in the databse of proteins of each type used by Chou. Chou’s table lists totals so I do logical checks again the total he gives and the total in my vecotr.
# alpha proteins
alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91,
221, 249, 48, 123, 82, 122, 119, 33, 63, 167)
# check against chou's total
sum(alpha) == 2447
## [1] TRUE
# beta proteins
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120,
177, 115, 16, 85, 127, 341, 253, 44, 110, 229)
# check against chou's total
sum(beta) == 2776
## [1] TRUE
# alpha + beta
a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
sum(a.plus.b) == 1889
## [1] TRUE
# alpha/beta
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239,
339, 321, 91, 158, 188, 327, 238, 72, 130, 378)
sum(a.div.b) == 4333
## [1] TRUE
Chou and Chang’s table 5 is approximately like this
data.frame(aa.1.1, alpha, beta, a.plus.b, a.div.b)
## aa.1.1 alpha beta a.plus.b a.div.b
## 1 A 285 203 175 361
## 2 R 53 67 78 146
## 3 N 97 139 120 183
## 4 D 163 121 111 244
## 5 C 22 75 74 63
## 6 Q 67 122 74 114
## 7 E 134 86 86 257
## 8 G 197 297 171 377
## 9 H 111 49 33 107
## 10 I 91 120 93 239
## 11 L 221 177 110 339
## 12 K 249 115 112 321
## 13 M 48 16 25 91
## 14 F 123 85 52 158
## 15 P 82 127 71 188
## 16 S 122 341 126 327
## 17 T 119 253 117 238
## 18 W 33 44 30 72
## 19 Y 63 110 108 130
## 20 V 167 229 123 378
From table 5 we calculate the proprotions of each amino acid in each fold class.
Calculate proportions for each of the four protein fold types
alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b <- a.div.b/sum(a.div.b)
Create a dataframe
#dataframe
aa.prop <- data.frame(alpha.prop,
beta.prop,
a.plus.b.prop,
a.div.b)
#row labels
row.names(aa.prop) <- aa.1.1
Table 5 therefore becomes this
aa.prop
## alpha.prop beta.prop a.plus.b.prop a.div.b
## A 0.116469146 0.073126801 0.09264161 0.08331410
## R 0.021659174 0.024135447 0.04129169 0.03369490
## N 0.039640376 0.050072046 0.06352567 0.04223402
## D 0.066612178 0.043587896 0.05876125 0.05631202
## C 0.008990601 0.027017291 0.03917417 0.01453958
## Q 0.027380466 0.043948127 0.03917417 0.02630972
## E 0.054760932 0.030979827 0.04552673 0.05931225
## G 0.080506743 0.106988473 0.09052409 0.08700669
## H 0.045361667 0.017651297 0.01746956 0.02469421
## I 0.037188394 0.043227666 0.04923240 0.05515809
## L 0.090314671 0.063760807 0.05823187 0.07823679
## K 0.101757254 0.041426513 0.05929063 0.07408262
## M 0.019615856 0.005763689 0.01323452 0.02100162
## F 0.050265631 0.030619597 0.02752779 0.03646434
## P 0.033510421 0.045749280 0.03758602 0.04338795
## S 0.049856968 0.122838617 0.06670196 0.07546734
## T 0.048630977 0.091138329 0.06193753 0.05492730
## W 0.013485901 0.015850144 0.01588142 0.01661666
## Y 0.025745811 0.039625360 0.05717311 0.03000231
## V 0.068246833 0.082492795 0.06511382 0.08723748
Chou’s method rely on the correlation between amino acid composition between proteins with known structure and unknown proteins. Its therefore information to do a standard examination of the data using scatterplots and correlations. First we’ll look at the data for the four fold classes.
Scatterplot matrix
plot(aa.prop,panel = panel.smooth)
Correlation matrix
cor(aa.prop)
## alpha.prop beta.prop a.plus.b.prop a.div.b
## alpha.prop 1.0000000 0.4941143 0.6969508 0.8555289
## beta.prop 0.4941143 1.0000000 0.7977771 0.7706654
## a.plus.b.prop 0.6969508 0.7977771 1.0000000 0.8198043
## a.div.b 0.8555289 0.7706654 0.8198043 1.0000000
round(cor(aa.prop), 3)
## alpha.prop beta.prop a.plus.b.prop a.div.b
## alpha.prop 1.000 0.494 0.697 0.856
## beta.prop 0.494 1.000 0.798 0.771
## a.plus.b.prop 0.697 0.798 1.000 0.820
## a.div.b 0.856 0.771 0.820 1.000
Individual plots
par(mfrow = c(1,3), mar = c(4,4,1,0))
plot(alpha.prop ~ beta.prop, data = aa.prop)
plot(alpha.prop ~ a.plus.b.prop, data = aa.prop)
plot(alpha.prop ~ a.div.b, data = aa.prop)
par(mfrow = c(1,1), mar = c(4,4,4,4))
We will pick a gene, determine its amino acid composition, and determine the similarity of its composition to the four fold classes.
# download
NP_000783 <- rentrez::entrez_fetch(id = "NP_000783.2",
db = "protein",
rettype = "fasta")
# clean and turn into vector
NP_000783 <- compbio4all::fasta_cleaner(NP_000783, parse = TRUE)
Determine amino acid frequencies for my focal gene (counts of amino acid divided by length of sequence)
NP_000783.freq.table <- table(NP_000783)/length(NP_000783)
Function to convert a table into a vector
table_to_vector <- function(table_x){
table_names <- attr(table_x, "dimnames")[[1]]
table_vect <- as.vector(table_x)
names(table_vect) <- table_names
return(table_vect)
}
DIO1.human.aa.freq <- table_to_vector(NP_000783.freq.table)
Check for presence of U, an unknown amino acid
aa.names <- names(DIO1.human.aa.freq)
any(aa.names == "U")
## [1] TRUE
i.U <- which(aa.names == "U")
aa.names[i.U]
## [1] "U"
DIO1.human.aa.freq[i.U]
## U
## 0.004016064
#uing %in%, which isn't necessary but illustrates the syntax
which(aa.names %in% c("U"))
## [1] 18
Remove the “U”. This throws off the math a tiny bit but will be ok. (Better to remove it from the full sequence.)
DIO1.human.aa.freq <- DIO1.human.aa.freq[-i.U]
# sum is just under 1; ok for our purposes
sum(DIO1.human.aa.freq)
## [1] 0.9959839
Add data to datatable
aa.prop$DIO1.human.aa.freq <- DIO1.human.aa.freq
Corrleation used in Chou adn Zhange 1992.
chou_cor <- function(x,y){
numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}
Cosine similarity used in Higgs and Attwood (2005).
TODO: Check if this is exactly the same as used in Chou’s 1994 and 1995 papers.
chou_cosine <- function(z.1, z.2){
z.1.abs <- sqrt(sum(z.1^2))
z.2.abs <- sqrt(sum(z.2^2))
my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
return(my.cosine)
}
Data exploration. None of the correlations are great…
par(mfrow = c(2,2), mar = c(1,4,1,1))
plot(alpha.prop ~ DIO1.human.aa.freq, data = aa.prop)
plot(beta.prop ~ DIO1.human.aa.freq, data = aa.prop)
plot(a.plus.b.prop ~ DIO1.human.aa.freq, data = aa.prop)
plot(a.div.b ~ DIO1.human.aa.freq, data = aa.prop)
par(mfrow = c(1,1), mar = c(1,1,1,1))
Calculate correlation between each column
corr.alpha <- chou_cor(aa.prop[,5], aa.prop[,1])
corr.beta <- chou_cor(aa.prop[,5], aa.prop[,2])
corr.apb <- chou_cor(aa.prop[,5], aa.prop[,3])
corr.adb <- chou_cor(aa.prop[,5], aa.prop[,4])
Calculate cosine similarity
cos.alpha <- chou_cosine(aa.prop[,5], aa.prop[,1])
cos.beta <- chou_cosine(aa.prop[,5], aa.prop[,2])
cos.apb <- chou_cosine(aa.prop[,5], aa.prop[,3])
cos.adb <- chou_cosine(aa.prop[,5], aa.prop[,4])
Calculate distance. Note: we need to flip the dataframe on its side using a command called t()
aa.prop.flipped <- t(aa.prop)
round(aa.prop.flipped,2)
## A R N D C Q E G H I L K
## alpha.prop 0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04 0.09 0.10
## beta.prop 0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04 0.06 0.04
## a.plus.b.prop 0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05 0.06 0.06
## a.div.b 0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06 0.08 0.07
## DIO1.human.aa.freq 0.05 0.02 0.04 0.05 0.06 0.06 0.03 0.05 0.04 0.12 0.03 0.06
## M F P S T W Y V
## alpha.prop 0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta.prop 0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## a.plus.b.prop 0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## a.div.b 0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## DIO1.human.aa.freq 0.06 0.06 0.06 0.05 0.03 0.08 0.03 0.02
We can get distance matrix like this
dist(aa.prop.flipped, method = "euclidean")
## alpha.prop beta.prop a.plus.b.prop a.div.b
## beta.prop 0.13342098
## a.plus.b.prop 0.09281824 0.08289406
## a.div.b 0.06699039 0.08659174 0.06175113
## DIO1.human.aa.freq 0.17606277 0.18081192 0.15112424 0.15786886
Individual distances
dist.alpha <- dist((aa.prop.flipped[c(1,5),]), method = "euclidean")
dist.beta <- dist((aa.prop.flipped[c(2,5),]), method = "euclidean")
dist.apb <- dist((aa.prop.flipped[c(3,5),]), method = "euclidean")
dist.adb <- dist((aa.prop.flipped[c(4,5),]), method = "euclidean")
Compile the information
# fold types
fold.type <- c("alpha","beta","alpha plus beta", "alpha/beta")
# data
corr.sim <- round(c(corr.alpha,corr.beta,corr.apb,corr.adb),5)
cosine.sim <- round(c(cos.alpha,cos.beta,cos.apb,cos.adb),5)
Euclidean.dist <- round(c(dist.alpha,dist.beta,dist.apb,dist.adb),5)
# summary
sim.sum <- c("","","most.sim","")
dist.sum <- c("","","min.dist","")
df <- data.frame(fold.type,
corr.sim ,
cosine.sim ,
Euclidean.dist ,
sim.sum ,
dist.sum )
Display output
pander::pander(df)
| fold.type | corr.sim | cosine.sim | Euclidean.dist | sim.sum | dist.sum |
|---|---|---|---|---|---|
| alpha | 0.7552 | 0.7552 | 0.1761 | ||
| beta | 0.746 | 0.746 | 0.1808 | ||
| alpha plus beta | 0.8072 | 0.8072 | 0.1511 | most.sim | min.dist |
| alpha/beta | 0.7937 | 0.7937 | 0.1579 |