WORKFLOW

blastx against genbank protein database

-For each cassette exons, find the upstream and downstream exons
-Blast sequences with cassette exon spliced in and spliced out against the five fish species (e-value threshold = 0.1 and identity % >= 30 and query coverage >= 70 and gap introduced is less than 30% of query length)
-If there are blast results for both spliced in and spliced out sequences, then that splicing event is conserved

Number of total cassette exons: 5051
Number of genes with cassette exons: 3037

Number of total cassette splicing events: 19015

Number of genes with conserved spliced in OR (inclusive) spliced out isoforms in at least one species: 2367

Non-conserved means blast hits for either the spliced in or spliced out sequences (exclusive or)

BLAST RESULTS:

Species Number of Conserved Cassette Splicing Events (# of genes) Number of Non-conserved Cassette Splicing Events (# of genes) Number of genes with at least one conserved isoform
lamprey 73 (20) 436 (174) 184
spotted gar 7257 (906) 3798 (987) 1590
zebrafish 6647 (892) 3815 (972) 1564
fugu 3045 (541) 2754 (737) 1107
coelacanth 6978 (1008) 3967 (990) 1647
human 10322 (1560) 3718 (992) 2134
C. elegans 1817 (146) 1678 (452) 534

Splicing Event Conservation:

upset(fromList(listInput), nsets = 7, order.by = "freq")

  • i.e 10 mouse genes / 70 (14%) have cassette splicing events that are conserved in all 5 fish species, human and C. elegans
    15 mouse genes / 101 (15%) have cassette splicing events that are conserved in all 5 fish species, human

TODO: blast against other mammals
-take the first hit (the one with the highest bit score)
-legnth vs bit