Bioinformatics Homework 7

suppressPackageStartupMessages(library("dplyr"))

Task 1

GENSCAN Output

From http://genes.mit.edu/GENSCAN.html
Organism: Vertebrate
Suboptimal Exon Cutoff (optional): 1.00
Print Options: Predicted peptides only
DNA Sequence: sequence.txt attached

GENSCAN 1.0 Date run: 18-May-115    Time: 09:12:11

Sequence /tmp/05_18_15-09:12:11.fasta : 22080 bp : 41.16% C+G : Isochore 1 ( 0 - 43 C+G%)

Parameter matrix: HumanIso.smat

Predicted genes/exons:

Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..
----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------
 1.01 Intr +   6799   6871   73  0  1   74   57    81 0.022   1.66
 1.02 Term +   6970   7148  179  2  2   11   37   197 0.049   3.77
 1.03 PlyA +   7719   7724    6                               1.05
 2.04 PlyA -   8033   8028    6                               1.05
 2.03 Term -   8907   8796  112  2  1   37   42   149 0.692   2.25
 2.02 Intr -   9994   9702  293  1  2   19   23   301 0.822  11.71
 2.01 Init -  12143  11295  849  2  0   68   66  1026 0.981  93.10
 2.00 Prom -  12317  12278   40                             -10.55
 3.05 PlyA -  12339  12334    6                              -4.04
 3.04 Term -  13402  12433  970  0  1   19   50  1322 0.980 112.03
 3.03 Intr -  13715  13616  100  0  1   89    6   143 0.981   4.45
 3.02 Intr -  13926  13797  130  0  1   59  111   125 0.986  11.25
 3.01 Init -  14074  14024   51  1  0   92    2    70 0.984   0.01
 3.00 Prom -  14137  14098   40                             -11.64
 4.00 Prom +  14386  14425   40                             -12.62
 4.01 Init +  14428  14481   54  0  0   87   97    71 0.756   9.24
 4.02 Intr +  14523  14781  259  2  1   25   11   335 0.738  15.61
 4.03 Term +  14857  15359  503  2  2   73   32   338 0.966  20.36
 4.04 PlyA +  15483  15488    6                              -0.45
 5.04 PlyA -  15584  15579    6                               1.05
 5.03 Term -  16364  16186  179  0  2   65   38   253 0.232  14.87
 5.02 Intr -  19527  18263 1265  2  2   55   74  2278 0.588 209.46
 5.01 Intr -  19892  19648  245  2  2  110   74   453 0.996  41.47

Suboptimal exons with probability > 1.000

Exnum Type S .Begin ...End .Len Fr Ph B/Ac Do/T CodRg P.... Tscr..
----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------

NO EXONS FOUND AT GIVEN PROBABILITY CUTOFF

Predicted peptide sequence(s):

>/tmp/05_18_15-09:12:11.fasta|GENSCAN_predicted_peptide_1|83_aa
IKSIQSNGHKVSTISVTQMQGVRIAHSIITKHRIQPKAQSPQTVSDALPRSSILGDCVKF
KELSDARCVYLIYRGGVTYEAKI

>/tmp/05_18_15-09:12:11.fasta|GENSCAN_predicted_peptide_2|417_aa
MVFCGITSAATLQSTPKQQTKQTFPQETSEIPIEFPTQRMQKPSHFAREDSQEHLPTSHT
QNSHTVNMKRRTLSGSCGVGVFVFAFAFIVIAFATPSWLVSDYRITGAKLDRLGLWVHCF
RSLPDVNDDSQRRFFVGCRWVYDPFTTGYDEIRGFLLPAFMIATQFFYTLAFIGMLVSAI
GVLVFILCAGPDQKHFITLIKSLGYVLLGAGVSAAIAVIVFAGFGNRNGWMPEHANNWFG
WSFILACVGTVLTLVASTLFLSEAHVQHKKRIQFKESQTRFELSGFKAAGLQNEIRNNES
PRQSAWCIRLSFPMAFAFWENRTNFSLSAFVAVEATGRICISLFAVTTDMPTDCGLHHVS
TRRFRSCLFVLGYLLMIDEIRLTNKSDYCTRYTERIKNFGITRLRLKPTDCVRIVEF

>/tmp/05_18_15-09:12:11.fasta|GENSCAN_predicted_peptide_3|416_aa
MSFLLKAVTTARHIVHKVPVRNLNLLEFQSKDLLQKYGVAIQQFKVLNNSKADAEVVKTF
ECPEYVVKAQILAGGRGKGTFDNGFKGGVHITTNKSEVLSLTQQMIGNRLITKQTPKSGI
LVNKVMVARSINITRETYLCILLDREHNGPVLIASPAGGMDIEAVAEETPEKIKTVPLDI
GKPIPESTLLEVAKFLEFKGDSVKRCAEEIQKLYTLFKAVDAVQIEINPLAETDKGEVIS
VDAKLNFDDNAQFRQKDIFSMDVTEEESDPREVEAAKYNLNYVAMDGNIGCLVNGAGLAM
ATMDIIKLNGGEPANFLDVGGGVREDQVAKAFEILTADPKVKGILVNVFGGIVNCATIAN
GIVAASKKLQLNVPLVVRLEGTNVNQAREILKNSGLPIQTASDLDDAAHKAVAALN

>/tmp/05_18_15-09:12:11.fasta|GENSCAN_predicted_peptide_4|271_aa
MQENVVHESLPQIPVATSPLHLEHQTQPGSSNHGQEKIRQCVHGDAGSPVLRLLLLIAGA
AQPGTISQAARFALGGIVAGCCYTIQPDGAALPGTHLCQHAERVHLDYWRGSFGSIIWVR
NHVIAPLSEEFVFRACMMPLILQSFSPLVAVFITPLFFGVAHLHHIAERLSLGVELSTAL
LIGLFQFIYTTLFGFYSAFLFARTGHVMAPILVHAFCNHMGLPDLQDLWQQDLWRRVVAI
ILYLAGFVGWMFLVPLATDPSIYDNTLYWNA

>/tmp/05_18_15-09:12:11.fasta|GENSCAN_predicted_peptide_5|562_aa
ELPRAEPLLGGAAGAADDSELQSLEPESRKTEATQDAEEHPDKDNSYSDSDETDELQKFP
GKMPLELDLSDLAAAILRNNKKSRMNCVVERKHAEEIVDSPSKTVALPFGVNLTTDPKKA
RITGERISIFCDGGDDKDKEEKEAKRQREQEEEDSDEDAEPLRPMFIPFQRVPIPFGRIP
DQMPLTHMPQRMMPPAMQMQQQLPQQQQQGPIIIRQLPPPFQHLPMRPPMPPQVMQAPRM
ESSEEMQMPKVQTVRIHMQQIPFREIHVADDVPVQIPAQQQRLEMQQRQEMQQRQEMQQR
HEMQQRQEMQQRHEMQQRQEQREIPQIQIQPMPFGMALQRVGITAEDLRNIQRMAEDRIT
DELRRLAAEEGAESSEGSDEDVSSSAEQATTEAQTTVPQQQQQQTVQPEQQHPEEQHQEV
SQERQEQQQQQVPQMRIQRLILQPLPEQPRDTMTPPMPDQPQQLPFGRMVYGRSLAQPVR
IPVPMMQAMEGGAAAPEESQRPHFAPKYAVPDATVAYQGYGCEEYEAADGAGLSLQEESD
QVEDYEHHVRFHQGRIRGFRYQ

GENSCAN Legend

Gn.Ex : gene number, exon number (for reference)
Type  : Init = Initial exon (ATG to 5' splice site)
                             Intr = Internal exon (3' splice site to 5' splice site)
                             Term = Terminal exon (3' splice site to stop codon)
Sngl = Single-exon gene (ATG to stop)
Prom = Promoter (TATA box / initation site)
PlyA = poly-A signal (consensus: AATAAA)
S     : DNA strand (+ = input strand; - = opposite strand)
Begin : beginning of exon or signal (numbered on input strand)
End   : end point of exon or signal (numbered on input strand)
Len   : length of exon or signal (bp)
Fr    : reading frame (a forward strand codon ending at x has frame x mod 3)
Ph    : net phase of exon (exon length modulo 3)
I/Ac  : initiation signal or 3' splice site score (tenth bit units)
Do/T  : 5' splice site or termination signal score (tenth bit units)
CodRg : coding region score (tenth bit units)
P     : probability of exon (sum over all parses containing exon)
Tscr  : exon score (depends on length, I/Ac, Do/T and CodRg scores)

The SCORE of a predicted feature (e.g., exon or splice site) is a
log-odds measure of the quality of the feature based on local sequence
properties. For example, a predicted 5' splice site with
score > 100 is strong; 50-100 is moderate; 0-50 is weak; and
below 0 is poor (more than likely not a real donor site).

The PROBABILITY of a predicted exon is the estimated probability under
GENSCAN's model of genomic sequence structure that the exon is correct.
This probability depends in general on global as well as local sequence
properties, e.g., it depends on how well the exon fits with neighboring
exons.  It has been shown that predicted exons with higher probabilities
are more likely to be correct than those with lower probabilities.

a) How many full-length polypeptides are predicted within this sequence by GENSCAN (assume fly to be a vertebrate)?

five full-length polypeptides predicted within this sequence by GENSCAN

b) Which one of these polypeptides is likely to be involved in controlling the tube size of the respiratory system in flies?

predicted peptide 3 because it has the highest and most consistent P values
this is because the p-value gives the probability that the prediction is correct

c) Comment on the confidence of the GENSCAN predictions focusing on the metrics returned for the various predictions. Why are some of these predictions of lower confidence than others?

most confident for gene 3 based on P values
some confidence are lower than other because they don’t meet the exon identification requirements
the algorithm models the transition probabilities from one part of the gene to another which are pre-computed from a set of known gene structures

d) Can sequences in sequence databases be wrong? Why or why not?

yes of course
errors occurr because the sequence being put in the database is wrong
or there are errors in annotating the sequence

e) Suggest a virtual (in silico) and a ‘real’ experimental strategy you would use to test the accuracy of the prediction.

search the protein sequence in BLAST and determine if there are any proteins that have similar respiratory tube controlling properties

f) Identify the closest homologue of the aforementioned polypeptide (i.e. the one regulating tube size in fly) in Homo sapiens.

SUCLG2 protein is the closest homologue closest homologue in Homo sapiens was searched in BLAST with the maximum hits set to 500 and the query organism set to Homo sapiens.

g) Use the PHD suite to analyze the aforementioned polypeptide (i.e. the one regulating tube size in fly).

i. What features about the protein can you discern from these analyses?

two diagram generated by the PHD suite:

analysis gives me the predicted secondary structure of the protein

ii. Compare the secondary structure prediction results with those produced by Jpred (discussed fleetingly in class).

results from JPred4 are similar to those from PredictProtein in that they both predict the locations and residues corresponding to helices and beta sheets
both methods also show the probability that the prediction is correct by digits from 0-9
predictProtein is much more thorough in assigning all the potential seconday structures in proteins
JPred4 provides multiple sequence alignments and predicted helices/beta sheets
both predict the location of secondary structures in roughly the same places
except JPred4 predicts an extra helix in the 1-60 residue range
and PredictProtein predicts an extra beta-barrel in the 120-180 residue range

iii. Now back to the results from the PHD suite: one particular feature strikes you as interesting and you wish to explore this feature using HMMTOP discussed in class.
How do the results compare between TMSEG and HMMTOP?

TMSEG said there was low probability that there was a transmembrane beta-barrel
TMSEG predicts that the protein is not a transmembrane protein
this is the same as HMMTOP
HMMTOP predicts that this sequence is also not a transmembrance protein
HMMTOP did not provide reasons for why it rejected the protein or the accuracy (probability)

The transmembrane helices (TMSEG) from PredictProtein said that there was low probability that there was a transmembrane beta-barrel (Z-score: 2.6, 9.7% estimated percent chance that this protein is a transmembrane beta-barrel).

In the best model, TMSEG predicted 1 membrane helix from residues 344-361. However, the reliability of the model is 2, which is low. Thus, TMSEG predicts that this protein is not a transmembrane protein. This agrees with the results from HMMTOP, which also predicted that this sequence is not a transmembrane protein. However, HMMTOP did not provide any reasons for why it rejected the protein as a membrane protein or the accuracy of its prediction.

iv. What experiments can you do in the lab to test some of these predictions?

protein can be isolated, purified, and crystalized - structure can then be determined via X-ray crystallography
can also be determined via NMR-experiments.

Task 2

Part I

Load the data file yeast_properties.txt and extract the dN/dS ratios (see column dnds) for all the genes into a vector.

genes <- read.table('yeast_properties.txt',header=TRUE)
dnds <- genes$dnds

Compute the mean and standard deviation.

mean(dnds)

## [1] 0.1010852

sd(dnds)

## [1] 0.08331521

See if the summary function provides any additional information.

summary(dnds)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.04733 0.08105 0.10110 0.13030 0.88500

Part II

Which gene(s) has/have the highest and which has/have the lowest dN/dS ratios.

Lowest

genes[(genes['dnds']==min(genes['dnds'])),]$Gene

##  [1] YBR009C YBR010W YDR059C YDR225W YEL027W YFL039C YGR027C YLR293C
##  [9] YNL030W YNL031C YOR210W YOR369C YPL218W YPR165W
## 5108 Levels: YAL002W YAL005C YAL007C YAL008W YAL009W YAL010C ... YPR201W

Highest

genes[(genes['dnds']==max(genes['dnds'])),]$Gene

## [1] YOL159C
## 5108 Levels: YAL002W YAL005C YAL007C YAL008W YAL009W YAL010C ... YPR201W

Generate a boxplot of dN/dS ratios into two groups: Genes with Degree >50 and genes with Degree <=50. The Degree column contains the number of protein-protein interactions for each gene.

genes <- genes %>% mutate(degreelevel = ifelse(Degree <= 50, "Low", "High"))
boxplot(dnds~degreelevel,genes, ylab="dN/dS")

Task 3

What proportion of genes are essential (i.e. essentiality property is 1) of the “young” (i.e., Fungi) genes and what proportion of the “old” (i.e., pre_Eukaryota)?

overall_prop_essential <- nrow(genes[c(genes$essentiality==1),]) / nrow(genes)
young_prop_essential <- 
  nrow(genes[c(genes$essentiality==1 & genes$Age=="Fungi"),]) / nrow(genes[c(genes$Age=="Fungi"),])
old_prop_essential <- 
  nrow(genes[c(genes$essentiality==1 & genes$Age=="pre_Eukaryota"),]) / nrow(genes[c(genes$Age=="pre_Eukaryota"),])

c(overall_prop_essential, young_prop_essential, old_prop_essential)

## [1] 0.1947925 0.1373239 0.2522124

Task 4

Loop over all yeast genes using a for loop and find the genes which fulfill the following conditions: (a) it is essential and (b) it is in the top 10% of the cai values (i.e., they are highly expressed).

setlist<-NULL 
for (i in 1:nrow(genes)){
  if (genes[i,]$essentiality ==1 && genes[i,]$cai > quantile(genes$cai, probs=.90)){
    setlist=c(setlist,i)
  }
}

genes_of_setlist <- genes[setlist,]

Is there an easier way of identifying these genes?

easier_way_to_get_setlist <- 
  genes[c(genes$essentiality == 1 & genes$cai > quantile(genes$cai, probs=.90) ),]

essential_high_copy_genes <- easier_way_to_get_setlist

What is the mean of the dN/dS ratio of this list of genes?

mean(essential_high_copy_genes$dnds)

## [1] 0.0445505

Is it higher/lower than the yeast average and is this difference significant?

mean(genes$dnds)

## [1] 0.1010852

mean(essential_high_copy_genes$dnds)

## [1] 0.0445505

mean(essential_high_copy_genes$dnds) < mean(genes$dnds)

## [1] TRUE

t.test(essential_high_copy_genes$dnds, genes$dnds)

## 
##  Welch Two Sample t-test
## 
## data:  essential_high_copy_genes$dnds and genes$dnds
## t = -15.3439, df = 164.181, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.0638099 -0.0492596
## sample estimates:
## mean of x mean of y 
## 0.0445505 0.1010852

Essential, high copy mean is about 44% of that of all. This seems significant. Essential, high copy genes would be expected to evolve more slowly as they are critical to core cell operation.

Does the result make sense when you make a boxplot of the two sets of genes?

boxplot(genes$dnds, essential_high_copy_genes$dnds, 
        names=c("All Genes", "High Copy, Essential Genes"), ylab="dN/dS")

makes sense