-find the exon coordinates for each isoforms in the BED file
-check if these exons exist in the ribosomal profiling data
isoform_num_per_gene <- mouse.isoforms.with.gene.name %>%
group_by(ensembl_id) %>%
summarise(count = n()) %>%
mutate(num_of_isoforms = ifelse(count == 1, "1", ifelse(count == 2, "2", ifelse(count == 3, "3", ">=4")))) %>%
group_by(num_of_isoforms) %>%
summarise(count = n()) %>%
arrange(desc(count))
ggplot(isoform_num_per_gene, aes(x=reorder(num_of_isoforms, -count) , y=count, label = count)) +
geom_bar(stat="identity", width=0.5) +
xlab("Number of isoforms per gene") +
ylab("Gene count") +
ggtitle("Isoforms in Long Read Data") +
geom_text(size = 4, vjust=-0.3)
Number of isoforms in the long read data (filtering out the ones w/o ensembl id): 8476
Number of genes in the long read data: 5808
Number of genes with more than one isoforms in the long read data: 1456
-Majority of genes in the long read data have only one isoforms
Number of genes in the ribosomal profiling data: 14382
Number of isoforms found in the ribosomal profiling and long read data: 3450
Number of genes found in the ribosomal profiling and long read data: 2863
Number of genes with multiple one isoforms found in the ribosomal profiling and long read data: 485
isoform_count_for_genes_found <- mouse.isoforms.with.gene.name %>%
filter(num_found == blockCount) %>%
dplyr::select(ensembl_id) %>%
group_by(ensembl_id) %>%
summarise(count = n()) %>%
mutate(num_of_isoforms = ifelse(count == 1, "1", ifelse(count == 2, "2", ifelse(count == 3, "3", ">=4")))) %>%
group_by(num_of_isoforms) %>%
summarise(count = n()) %>%
arrange(desc(count))
ggplot(isoform_count_for_genes_found, aes(x=reorder(num_of_isoforms, -count) , y=count, label = count)) +
geom_bar(stat="identity", width=0.5) +
xlab("Number of isoforms per gene") +
ylab("Gene count") +
ggtitle("Isoforms Found in Long Read and Ribosomal Profiling Data") +
geom_text(size = 4, vjust=-0.3)
note: there are 26 ensembl id in the long read datriba that are not found in biomart, so they have no gene symbol. However, some of them are found in the ribosomal profiling data using the code that I wrote
75 genes with multiple isoforms found in the long read data are conserved in all 5 fish species (about 75/644 = 16%)
-blast the isoforms found against the 5 fish species, human and C. elegans