suppressPackageStartupMessages(library("dplyr"))
From http://genes.mit.edu/GENSCAN.html
Organism: Vertebrate
Suboptimal Exon Cutoff (optional): 1.00
Print Options: Predicted peptides only
DNA Sequence: sequence.txt attached
GENSCAN 1.0 Date run: 18-May-115 Time: 09:12:11
Sequence /tmp/05_18_15-09:12:11.fasta : 22080 bp : 41.16% C+G : Isochore 1 ( 0 - 43 C+G%)
Parameter matrix: HumanIso.smat
Predicted genes/exons:
Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..
----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------
1.01 Intr + 6799 6871 73 0 1 74 57 81 0.022 1.66
1.02 Term + 6970 7148 179 2 2 11 37 197 0.049 3.77
1.03 PlyA + 7719 7724 6 1.05
2.04 PlyA - 8033 8028 6 1.05
2.03 Term - 8907 8796 112 2 1 37 42 149 0.692 2.25
2.02 Intr - 9994 9702 293 1 2 19 23 301 0.822 11.71
2.01 Init - 12143 11295 849 2 0 68 66 1026 0.981 93.10
2.00 Prom - 12317 12278 40 -10.55
3.05 PlyA - 12339 12334 6 -4.04
3.04 Term - 13402 12433 970 0 1 19 50 1322 0.980 112.03
3.03 Intr - 13715 13616 100 0 1 89 6 143 0.981 4.45
3.02 Intr - 13926 13797 130 0 1 59 111 125 0.986 11.25
3.01 Init - 14074 14024 51 1 0 92 2 70 0.984 0.01
3.00 Prom - 14137 14098 40 -11.64
4.00 Prom + 14386 14425 40 -12.62
4.01 Init + 14428 14481 54 0 0 87 97 71 0.756 9.24
4.02 Intr + 14523 14781 259 2 1 25 11 335 0.738 15.61
4.03 Term + 14857 15359 503 2 2 73 32 338 0.966 20.36
4.04 PlyA + 15483 15488 6 -0.45
5.04 PlyA - 15584 15579 6 1.05
5.03 Term - 16364 16186 179 0 2 65 38 253 0.232 14.87
5.02 Intr - 19527 18263 1265 2 2 55 74 2278 0.588 209.46
5.01 Intr - 19892 19648 245 2 2 110 74 453 0.996 41.47
Suboptimal exons with probability > 1.000
Exnum Type S .Begin ...End .Len Fr Ph B/Ac Do/T CodRg P.... Tscr..
----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------
NO EXONS FOUND AT GIVEN PROBABILITY CUTOFF
Predicted peptide sequence(s):
>/tmp/05_18_15-09:12:11.fasta|GENSCAN_predicted_peptide_1|83_aa
IKSIQSNGHKVSTISVTQMQGVRIAHSIITKHRIQPKAQSPQTVSDALPRSSILGDCVKF
KELSDARCVYLIYRGGVTYEAKI
>/tmp/05_18_15-09:12:11.fasta|GENSCAN_predicted_peptide_2|417_aa
MVFCGITSAATLQSTPKQQTKQTFPQETSEIPIEFPTQRMQKPSHFAREDSQEHLPTSHT
QNSHTVNMKRRTLSGSCGVGVFVFAFAFIVIAFATPSWLVSDYRITGAKLDRLGLWVHCF
RSLPDVNDDSQRRFFVGCRWVYDPFTTGYDEIRGFLLPAFMIATQFFYTLAFIGMLVSAI
GVLVFILCAGPDQKHFITLIKSLGYVLLGAGVSAAIAVIVFAGFGNRNGWMPEHANNWFG
WSFILACVGTVLTLVASTLFLSEAHVQHKKRIQFKESQTRFELSGFKAAGLQNEIRNNES
PRQSAWCIRLSFPMAFAFWENRTNFSLSAFVAVEATGRICISLFAVTTDMPTDCGLHHVS
TRRFRSCLFVLGYLLMIDEIRLTNKSDYCTRYTERIKNFGITRLRLKPTDCVRIVEF
>/tmp/05_18_15-09:12:11.fasta|GENSCAN_predicted_peptide_3|416_aa
MSFLLKAVTTARHIVHKVPVRNLNLLEFQSKDLLQKYGVAIQQFKVLNNSKADAEVVKTF
ECPEYVVKAQILAGGRGKGTFDNGFKGGVHITTNKSEVLSLTQQMIGNRLITKQTPKSGI
LVNKVMVARSINITRETYLCILLDREHNGPVLIASPAGGMDIEAVAEETPEKIKTVPLDI
GKPIPESTLLEVAKFLEFKGDSVKRCAEEIQKLYTLFKAVDAVQIEINPLAETDKGEVIS
VDAKLNFDDNAQFRQKDIFSMDVTEEESDPREVEAAKYNLNYVAMDGNIGCLVNGAGLAM
ATMDIIKLNGGEPANFLDVGGGVREDQVAKAFEILTADPKVKGILVNVFGGIVNCATIAN
GIVAASKKLQLNVPLVVRLEGTNVNQAREILKNSGLPIQTASDLDDAAHKAVAALN
>/tmp/05_18_15-09:12:11.fasta|GENSCAN_predicted_peptide_4|271_aa
MQENVVHESLPQIPVATSPLHLEHQTQPGSSNHGQEKIRQCVHGDAGSPVLRLLLLIAGA
AQPGTISQAARFALGGIVAGCCYTIQPDGAALPGTHLCQHAERVHLDYWRGSFGSIIWVR
NHVIAPLSEEFVFRACMMPLILQSFSPLVAVFITPLFFGVAHLHHIAERLSLGVELSTAL
LIGLFQFIYTTLFGFYSAFLFARTGHVMAPILVHAFCNHMGLPDLQDLWQQDLWRRVVAI
ILYLAGFVGWMFLVPLATDPSIYDNTLYWNA
>/tmp/05_18_15-09:12:11.fasta|GENSCAN_predicted_peptide_5|562_aa
ELPRAEPLLGGAAGAADDSELQSLEPESRKTEATQDAEEHPDKDNSYSDSDETDELQKFP
GKMPLELDLSDLAAAILRNNKKSRMNCVVERKHAEEIVDSPSKTVALPFGVNLTTDPKKA
RITGERISIFCDGGDDKDKEEKEAKRQREQEEEDSDEDAEPLRPMFIPFQRVPIPFGRIP
DQMPLTHMPQRMMPPAMQMQQQLPQQQQQGPIIIRQLPPPFQHLPMRPPMPPQVMQAPRM
ESSEEMQMPKVQTVRIHMQQIPFREIHVADDVPVQIPAQQQRLEMQQRQEMQQRQEMQQR
HEMQQRQEMQQRHEMQQRQEQREIPQIQIQPMPFGMALQRVGITAEDLRNIQRMAEDRIT
DELRRLAAEEGAESSEGSDEDVSSSAEQATTEAQTTVPQQQQQQTVQPEQQHPEEQHQEV
SQERQEQQQQQVPQMRIQRLILQPLPEQPRDTMTPPMPDQPQQLPFGRMVYGRSLAQPVR
IPVPMMQAMEGGAAAPEESQRPHFAPKYAVPDATVAYQGYGCEEYEAADGAGLSLQEESD
QVEDYEHHVRFHQGRIRGFRYQ
Gn.Ex : gene number, exon number (for reference)
Type : Init = Initial exon (ATG to 5' splice site)
Intr = Internal exon (3' splice site to 5' splice site)
Term = Terminal exon (3' splice site to stop codon)
Sngl = Single-exon gene (ATG to stop)
Prom = Promoter (TATA box / initation site)
PlyA = poly-A signal (consensus: AATAAA)
S : DNA strand (+ = input strand; - = opposite strand)
Begin : beginning of exon or signal (numbered on input strand)
End : end point of exon or signal (numbered on input strand)
Len : length of exon or signal (bp)
Fr : reading frame (a forward strand codon ending at x has frame x mod 3)
Ph : net phase of exon (exon length modulo 3)
I/Ac : initiation signal or 3' splice site score (tenth bit units)
Do/T : 5' splice site or termination signal score (tenth bit units)
CodRg : coding region score (tenth bit units)
P : probability of exon (sum over all parses containing exon)
Tscr : exon score (depends on length, I/Ac, Do/T and CodRg scores)
The SCORE of a predicted feature (e.g., exon or splice site) is a
log-odds measure of the quality of the feature based on local sequence
properties. For example, a predicted 5' splice site with
score > 100 is strong; 50-100 is moderate; 0-50 is weak; and
below 0 is poor (more than likely not a real donor site).
The PROBABILITY of a predicted exon is the estimated probability under
GENSCAN's model of genomic sequence structure that the exon is correct.
This probability depends in general on global as well as local sequence
properties, e.g., it depends on how well the exon fits with neighboring
exons. It has been shown that predicted exons with higher probabilities
are more likely to be correct than those with lower probabilities.
a) How many full-length polypeptides are predicted within this sequence by GENSCAN (assume fly to be a vertebrate)?
five full-length polypeptides predicted within this sequence by GENSCAN
b) Which one of these polypeptides is likely to be involved in controlling the tube size of the respiratory system in flies?
predicted peptide 3 because it has the highest and most consistent P values
this is because the p-value gives the probability that the prediction is correct
c) Comment on the confidence of the GENSCAN predictions focusing on the metrics returned for the various predictions. Why are some of these predictions of lower confidence than others?
most confident for gene 3 based on P values
some confidence are lower than other because they don’t meet the exon identification requirements
the algorithm models the transition probabilities from one part of the gene to another which are pre-computed from a set of known gene structures
d) Can sequences in sequence databases be wrong? Why or why not?
yes of course
errors occurr because the sequence being put in the database is wrong
or there are errors in annotating the sequence
e) Suggest a virtual (in silico) and a ‘real’ experimental strategy you would use to test the accuracy of the prediction.
search the protein sequence in BLAST and determine if there are any proteins that have similar respiratory tube controlling properties
f) Identify the closest homologue of the aforementioned polypeptide (i.e. the one regulating tube size in fly) in Homo sapiens.
SUCLG2 protein is the closest homologue closest homologue in Homo sapiens was searched in BLAST with the maximum hits set to 500 and the query organism set to Homo sapiens.
g) Use the PHD suite to analyze the aforementioned polypeptide (i.e. the one regulating tube size in fly).
i. What features about the protein can you discern from these analyses?
two diagram generated by the PHD suite:
analysis gives me the predicted secondary structure of the protein
ii. Compare the secondary structure prediction results with those produced by Jpred (discussed fleetingly in class).
results from JPred4 are similar to those from PredictProtein in that they both predict the locations and residues corresponding to helices and beta sheets
both methods also show the probability that the prediction is correct by digits from 0-9
predictProtein is much more thorough in assigning all the potential seconday structures in proteins
JPred4 provides multiple sequence alignments and predicted helices/beta sheets
both predict the location of secondary structures in roughly the same places
except JPred4 predicts an extra helix in the 1-60 residue range
and PredictProtein predicts an extra beta-barrel in the 120-180 residue range
iii. Now back to the results from the PHD suite: one particular feature strikes you as interesting and you wish to explore this feature using HMMTOP discussed in class.
How do the results compare between TMSEG and HMMTOP?
TMSEG said there was low probability that there was a transmembrane beta-barrel
TMSEG predicts that the protein is not a transmembrane protein
this is the same as HMMTOP
HMMTOP predicts that this sequence is also not a transmembrance protein
HMMTOP did not provide reasons for why it rejected the protein or the accuracy (probability)
The transmembrane helices (TMSEG) from PredictProtein said that there was low probability that there was a transmembrane beta-barrel (Z-score: 2.6, 9.7% estimated percent chance that this protein is a transmembrane beta-barrel).
In the best model, TMSEG predicted 1 membrane helix from residues 344-361. However, the reliability of the model is 2, which is low. Thus, TMSEG predicts that this protein is not a transmembrane protein. This agrees with the results from HMMTOP, which also predicted that this sequence is not a transmembrane protein. However, HMMTOP did not provide any reasons for why it rejected the protein as a membrane protein or the accuracy of its prediction.
iv. What experiments can you do in the lab to test some of these predictions?
protein can be isolated, purified, and crystalized - structure can then be determined via X-ray crystallography
can also be determined via NMR-experiments.
Load the data file yeast_properties.txt and extract the dN/dS ratios (see column dnds) for all the genes into a vector.
genes <- read.table('yeast_properties.txt',header=TRUE)
dnds <- genes$dnds
Compute the mean and standard deviation.
mean(dnds)
## [1] 0.1010852
sd(dnds)
## [1] 0.08331521
See if the summary function provides any additional information.
summary(dnds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.04733 0.08105 0.10110 0.13030 0.88500
Which gene(s) has/have the highest and which has/have the lowest dN/dS ratios.
Lowest
genes[(genes['dnds']==min(genes['dnds'])),]$Gene
## [1] YBR009C YBR010W YDR059C YDR225W YEL027W YFL039C YGR027C YLR293C
## [9] YNL030W YNL031C YOR210W YOR369C YPL218W YPR165W
## 5108 Levels: YAL002W YAL005C YAL007C YAL008W YAL009W YAL010C ... YPR201W
Highest
genes[(genes['dnds']==max(genes['dnds'])),]$Gene
## [1] YOL159C
## 5108 Levels: YAL002W YAL005C YAL007C YAL008W YAL009W YAL010C ... YPR201W
Generate a boxplot of dN/dS ratios into two groups: Genes with Degree >50 and genes with Degree <=50. The Degree column contains the number of protein-protein interactions for each gene.
genes <- genes %>% mutate(degreelevel = ifelse(Degree <= 50, "Low", "High"))
boxplot(dnds~degreelevel,genes, ylab="dN/dS")
What proportion of genes are essential (i.e. essentiality property is 1) of the “young” (i.e., Fungi) genes and what proportion of the “old” (i.e., pre_Eukaryota)?
overall_prop_essential <- nrow(genes[c(genes$essentiality==1),]) / nrow(genes)
young_prop_essential <-
nrow(genes[c(genes$essentiality==1 & genes$Age=="Fungi"),]) / nrow(genes[c(genes$Age=="Fungi"),])
old_prop_essential <-
nrow(genes[c(genes$essentiality==1 & genes$Age=="pre_Eukaryota"),]) / nrow(genes[c(genes$Age=="pre_Eukaryota"),])
c(overall_prop_essential, young_prop_essential, old_prop_essential)
## [1] 0.1947925 0.1373239 0.2522124
Loop over all yeast genes using a for loop and find the genes which fulfill the following conditions: (a) it is essential and (b) it is in the top 10% of the cai values (i.e., they are highly expressed).
setlist<-NULL
for (i in 1:nrow(genes)){
if (genes[i,]$essentiality ==1 && genes[i,]$cai > quantile(genes$cai, probs=.90)){
setlist=c(setlist,i)
}
}
genes_of_setlist <- genes[setlist,]
Is there an easier way of identifying these genes?
easier_way_to_get_setlist <-
genes[c(genes$essentiality == 1 & genes$cai > quantile(genes$cai, probs=.90) ),]
essential_high_copy_genes <- easier_way_to_get_setlist
What is the mean of the dN/dS ratio of this list of genes?
mean(essential_high_copy_genes$dnds)
## [1] 0.0445505
Is it higher/lower than the yeast average and is this difference significant?
mean(genes$dnds)
## [1] 0.1010852
mean(essential_high_copy_genes$dnds)
## [1] 0.0445505
mean(essential_high_copy_genes$dnds) < mean(genes$dnds)
## [1] TRUE
t.test(essential_high_copy_genes$dnds, genes$dnds)
##
## Welch Two Sample t-test
##
## data: essential_high_copy_genes$dnds and genes$dnds
## t = -15.3439, df = 164.181, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.0638099 -0.0492596
## sample estimates:
## mean of x mean of y
## 0.0445505 0.1010852
Essential, high copy mean is about 44% of that of all. This seems significant. Essential, high copy genes would be expected to evolve more slowly as they are critical to core cell operation.
Does the result make sense when you make a boxplot of the two sets of genes?
boxplot(genes$dnds, essential_high_copy_genes$dnds,
names=c("All Genes", "High Copy, Essential Genes"), ylab="dN/dS")
makes sense