Code
# Load Bioconductor packages for genomic arithmetic
library(rtracklayer)
library(GenomicRanges)In recent versions of GENCODE (including Release 49), long non-coding RNA biotypes such as lincRNA and antisense have been consolidated into a single generic lncRNA gene type.
To determine the specific biotype of these transcripts, we can no longer rely on text labels. Instead, we must computationally re-derive their identities based on their physical genomic coordinates relative to protein-coding genes.
Objective: Use interval mathematics to classify generic GENCODE lncRNA transcripts into: 1. lincRNAs (Intergenic): Transcripts with zero physical overlap with any protein-coding gene. 2. Antisense lncRNAs: Transcripts that physically overlap a protein-coding gene but are transcribed from the opposite DNA strand.
First, we load the required Bioconductor packages to handle genomic intervals and GTF parsing.
# Load Bioconductor packages for genomic arithmetic
library(rtracklayer)
library(GenomicRanges)We import the Comprehensive Gene Annotation (CHR) GTF file from GENCODE Release 49. We then filter the comprehensive map to isolate only the gene level features, separating them into protein_coding and lncRNA subsets.
# Define the file name (ensure this is in your working directory)
gtf <- readRDS("../data/processed/gencode_v49.rds")
# Keep only primary gene annotations to avoid redundant transcript counting
genes <- gtf[gtf$type == "gene"]
# Separate into protein-coding and lncRNA sets
pc_genes <- genes[genes$gene_type == "protein_coding"]
lncrnas <- genes[genes$gene_type == "lncRNA"]
# Output initial counts
cat("Total Protein-Coding Genes:", length(pc_genes), "\n")Total Protein-Coding Genes: 20097
cat("Total Generic lncRNAs:", length(lncrnas), "\n")Total Generic lncRNAs: 34880
lincRNAs exist in the intergenic regions of the genome. We define them computationally by finding all lncRNAs that have zero overlap with any protein-coding gene, regardless of the strand.
# Find any overlaps between lncRNAs and protein-coding genes (ignoring strand)
overlaps <- findOverlaps(lncrnas, pc_genes, ignore.strand = TRUE)
# Filter out the lncRNAs that DO overlap, keeping only the intergenic ones
lincRNAs <- lncrnas[-queryHits(overlaps)]
cat("Identified true lincRNAs:", length(lincRNAs), "\n")Identified true lincRNAs: 20488
Antisense lncRNAs share physical coordinate space with a protein-coding gene but are located on the opposite strand. We compute this by finding all physical overlaps, and then strictly filtering for events where Strand A != Strand B.
# Extract strand information for the overlapping pairs identified earlier
lncrna_strands <- strand(lncrnas)[queryHits(overlaps)]
pc_strands <- strand(pc_genes)[subjectHits(overlaps)]
# Create a logical vector: TRUE if strands are opposite
is_antisense <- lncrna_strands != pc_strands
# Apply the filter to isolate antisense overlap events
antisense_hits <- overlaps[is_antisense]
# Extract the unique lncRNAs from these hits
# (Using unique() because one lncRNA might span multiple coding genes)
antisense_lncRNAs <- lncrnas[unique(queryHits(antisense_hits))]
cat("Identified Antisense lncRNAs:", length(antisense_lncRNAs), "\n")Identified Antisense lncRNAs: 11912
By leveraging the GenomicRanges architecture, we successfully bypassed the biotype consolidation in GENCODE Release 49. We isolated our targets based strictly on their physical geometry, yielding biologically accurate sets of intergenic and antisense transcripts ready for downstream analysis.