Annotating putative protein-coding genes and long non-coding RNAs from public transcriptomic data

We are interested in identifying the differences in lncRNA annotation between the currently used Ensembl version of the Camelina genome and the newly generated Camelina genome by Sun et al.

Background

Transcript assembly for Illumina sequencing and Nanopore sequencing was performed with StringTie (v2.2.1). Newly assembled transcripts were classified relative to annotated protein-coding genes using Gffcompare (v0.12.2). Transcripts antisense to protein-coding genes were identified using the Gffcompare classification code “x” while intergenic transcripts were identified using the classification code “u”. Antisense and intergenic transcripts were processed through the Coding Potential Calculator 2 (CPC2) (v1.0.1). Non-coding intergenic transcripts (putative long intergenic non-coding RNAs – lincRNAs) were further filtered for other “housekeeping” RNAs (e.g., ribosomal RNA, small nuclear/nucleolar RNAs, etc.) using the RNA families (Rfam) webserver. All transcripts with hits to the Rfam database were removed. The following script walks through the exact methods for filtering these newly assembled transcripts.

Load packages

library(tidyverse)
library(readr)
library(readxl)

Pull out lncRNA classes (we are only interested in the antisense and the intergenic ones)

tmap_old <- read_delim("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/gffcompare_CS.all_merged_CS.gtf.tmap", 
                                                  delim = "\t", escape_double = FALSE, 
                                                  trim_ws = TRUE)

tmap_old %>% 
  filter(class_code == "x") %>%
  distinct(qry_id) %>%
  write_delim("old_gen_antisense_txIDs.txt") #save the antisense transcript IDs

tmap_old %>% 
  filter(class_code == "u") %>%
  distinct(qry_id) %>%
  write_delim("old_gen_intergenic_txIDs.txt") #save the intergenic transcript IDs

These transcripts were used to pull out the transcript information from the genome and genome annotation files in commandline. Essentially, we pulled transcript cooridinates and sequences. The sequences were used as input into CPC2.

all antisense tx
grep -wFf  old_gen_antisense_txIDs.txt ../all_merged_CS.gtf > cs_old_antisense_gffcompare_txIDs.gtf

all intergenic tx
grep -wFf  old_gen_intergenic_txIDs.txt ../all_merged_CS.gtf > cs_old_intergenic_gffcompare_txIDs.gtf

all antisense tx
gffread -w cs_old_antisense_gffcompare_txIDs.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa cs_old_antisense_gffcompare_txIDs.gtf

all intergenic tx
gffread -w cs_old_intergenic_gffcompare_txIDs.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa cs_old_intergenic_gffcompare_txIDs.gtf
# using the fasta files as input for CPC2

## bash: all: command not found
## grep: ../all_merged_CS.gtf: No such file or directory
## bash: line 3: all: command not found
## grep: ../all_merged_CS.gtf: No such file or directory
## bash: line 6: all: command not found
## bash: line 9: all: command not found

Determine coding capacity by filtering based on CPC2 results

old_as <- read_delim("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/CPC2_output_antisense.txt", 
                             delim = "\t", escape_double = FALSE, 
                             trim_ws = TRUE)


old_int <- read_delim("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/CPC2_output_intergenic.txt", 
                 delim = "\t", escape_double = FALSE, 
                 trim_ws = TRUE)



old_as %>%
  filter(label == "noncoding") %>%
  distinct(`#ID`) %>%
  write.table("old_noncoding_antisense_txIDs.txt",
              quote = F, row.names = F, col.names = F) #save non-coding antisense transcripts

old_as %>%
  filter(label == "coding") %>%
  distinct(`#ID`) %>%
  write.table("old_coding_antisense_txIDs.txt",
              quote = F, row.names = F, col.names = F) #save coding antisense transcripts

old_int %>%
  filter(label == "noncoding") %>%
  distinct(`#ID`) %>%
  write.table("old_noncoding_intergenic_txIDs.txt",
              quote = F, row.names = F, col.names = F) -> old_int_non #save the non-coding intergenic and write to a new object 

old_int %>%
  filter(label == "coding") %>%
  distinct(`#ID`) %>%
  write.table("old_coding_intergenic_txIDs.txt",
              quote = F, row.names = F, col.names = F) #save coding intergenic transcripts

Pull sequences for each class of transcripts

grep -wFf  old_noncoding_intergenic_txIDs.txt ../../all_merged_CS.gtf > cs_old_noncoding_intergenic_txIDs.gtf

grep -wFf  old_coding_intergenic_txIDs.txt ../../all_merged_CS.gtf > cs_old_coding_intergenic_txIDs.gtf

gffread -w cs_old_noncoding_intergenic_txIDs.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa cs_old_noncoding_intergenic_txIDs.gtf

gffread -w cs_old_coding_intergenic_txIDs.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa cs_old_coding_intergenic_txIDs.gtf

## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory

Now we want to further filter the non-coding intergenic transcripts. We used the Rfam database and resources that are available on the web. We initally identified 9158 non-coding intergenic transcripts, that number is far more than what the Rfam database can handle in one batch search. So, the non-coding intergenic transcripts were split up into 10 batches and submitted to Rfam database. All transcripts that hit at all were filtered out.

#Pt 1. 
sed -n '1,900p;901q'  old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt1.txt

grep -wFf  old_noncoding_intergenic_txIDs_pt1.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt1.gtf

gffread -w old_noncoding_intergenic_txIDs_pt1.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt1.gtf

#Pt 2. 
sed -n '902,1802p;1803q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt2.txt

grep -wFf  old_noncoding_intergenic_txIDs_pt2.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt2.gtf

gffread -w old_noncoding_intergenic_txIDs_pt2.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt2.gtf

#Pt 3. 
sed -n '1803,2703p;2704q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt3.txt

grep -wFf  old_noncoding_intergenic_txIDs_pt3.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt3.gtf

gffread -w old_noncoding_intergenic_txIDs_pt3.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt3.gtf

#Pt 4. 
sed -n '2704,3604p;3605q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt4.txt

grep -wFf  old_noncoding_intergenic_txIDs_pt4.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt4.gtf

gffread -w old_noncoding_intergenic_txIDs_pt4.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt4.gtf

#Pt 5. 
sed -n '3605,4505p;4506q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt5.txt
 
grep -wFf  old_noncoding_intergenic_txIDs_pt5.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt5.gtf
 
gffread -w old_noncoding_intergenic_txIDs_pt5.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt5.gtf

#Pt 6. 
sed -n '4506,5406p;5407q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt6.txt
 
grep -wFf  old_noncoding_intergenic_txIDs_pt6.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt6.gtf
 
gffread -w old_noncoding_intergenic_txIDs_pt6.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt6.gtf

#Pt 7. 
sed -n '5407,6307p;6308q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt7.txt

grep -wFf  old_noncoding_intergenic_txIDs_pt7.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt7.gtf
 
gffread -w old_noncoding_intergenic_txIDs_pt7.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt7.gtf

#Pt 8. 
sed -n '6308,7208p;7209q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt8.txt

grep -wFf  old_noncoding_intergenic_txIDs_pt8.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt8.gtf
 
gffread -w old_noncoding_intergenic_txIDs_pt8.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt8.gtf

#Pt 9. 
sed -n '7209,8109p;8110q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt9.txt

grep -wFf  old_noncoding_intergenic_txIDs_pt9.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt9.gtf
 
gffread -w old_noncoding_intergenic_txIDs_pt9.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt9.gtf

#Pt 10. 
sed -n '8110,9158p;9159q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt10.txt

grep -wFf  old_noncoding_intergenic_txIDs_pt10.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt10.gtf
 
gffread -w old_noncoding_intergenic_txIDs_pt10.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt10.gtf

## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory

Filter out any non-coding intergenic transcripts that hit within the rfam database (prior to this manual delimiting of columns was performed on the pfam outputs)

#Pt. 1
pt1_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt1_rfam_hits.xlsx", 
                  sheet = "Sheet2")

pt1_rfam_hits_tx_names <- pt1_rfam_hits$accession...3 %>% 
  as.data.frame() %>%
  distinct() #extract the column with the transcripts that hit within the database

pt1_rfam_hits_tx_names <- pt1_rfam_hits_tx_names[startsWith(pt1_rfam_hits_tx_names$., "MSTRG"),] %>% 
  as.data.frame() %>%
  na.omit() %>%
  setNames("rfam_hits") #clean up the dataframe and give the column a more informative name

#Do this for the other 8 batches (batch 8 did not have any rfam hits so there are only have 9 other batches to do)

#Pt. 2
pt2_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt2_rfam_hits.xlsx", 
                  sheet = "Sheet2")

pt2_rfam_hits_tx_names <- pt2_rfam_hits$accession...3 %>% 
  as.data.frame() %>%
  distinct()

pt2_rfam_hits_tx_names <- pt2_rfam_hits_tx_names[startsWith(pt2_rfam_hits_tx_names$., "MSTRG"),] %>% 
  as.data.frame() %>%
  na.omit() %>%
  setNames("rfam_hits")

#Pt. 3
pt3_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt3_rfam_hits.xlsx", 
                  sheet = "Sheet2")

pt3_rfam_hits_tx_names <- pt3_rfam_hits$accession...3 %>% 
  as.data.frame() %>%
  distinct()

pt3_rfam_hits_tx_names <- pt3_rfam_hits_tx_names[startsWith(pt3_rfam_hits_tx_names$., "MSTRG"),] %>% 
  as.data.frame() %>%
  na.omit() %>%
  setNames("rfam_hits")

#Pt. 4
pt4_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt4_rfam_hits.xlsx", 
                  sheet = "Sheet2")

pt4_rfam_hits_tx_names <- pt4_rfam_hits$accession...3 %>% 
  as.data.frame() %>%
  distinct()

pt4_rfam_hits_tx_names <- pt4_rfam_hits_tx_names[startsWith(pt4_rfam_hits_tx_names$., "MSTRG"),] %>% 
  as.data.frame() %>%
  na.omit() %>%
  setNames("rfam_hits")

#Pt. 5
pt5_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt5_rfam_hits.xlsx", 
                  sheet = "Sheet2")

pt5_rfam_hits_tx_names <- pt5_rfam_hits$accession...3 %>% 
  as.data.frame() %>%
  distinct()

pt5_rfam_hits_tx_names <- pt5_rfam_hits_tx_names[startsWith(pt5_rfam_hits_tx_names$., "MSTRG"),] %>% 
  as.data.frame() %>%
  na.omit() %>%
  setNames("rfam_hits")

#Pt. 6
pt6_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt6_rfam_hits.xlsx", 
                  sheet = "Sheet2")

pt6_rfam_hits_tx_names <- pt6_rfam_hits$accession...3 %>% 
  as.data.frame() %>%
  distinct()

pt6_rfam_hits_tx_names <- pt6_rfam_hits_tx_names[startsWith(pt6_rfam_hits_tx_names$., "MSTRG"),] %>% 
  as.data.frame() %>%
  na.omit() %>%
  setNames("rfam_hits")

#Pt. 7
pt7_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt7_rfam_hits.xlsx", 
                  sheet = "Sheet2")

pt7_rfam_hits_tx_names <- pt7_rfam_hits$accession...3 %>% 
  as.data.frame() %>%
  distinct()

pt7_rfam_hits_tx_names <- pt7_rfam_hits_tx_names[startsWith(pt7_rfam_hits_tx_names$., "MSTRG"),] %>% 
  as.data.frame() %>%
  na.omit() %>%
  setNames("rfam_hits")

#Pt. 9
pt9_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt9_rfam_hits.xlsx", 
                  sheet = "Sheet2")

pt9_rfam_hits_tx_names <- pt9_rfam_hits$accession...3 %>% 
  as.data.frame() %>%
  distinct()

pt9_rfam_hits_tx_names <- pt9_rfam_hits_tx_names[startsWith(pt9_rfam_hits_tx_names$., "MSTRG"),] %>% 
  as.data.frame() %>%
  na.omit() %>%
  setNames("rfam_hits")

#Pt. 10
pt10_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt10_rfam_hits.xlsx", 
                  sheet = "Sheet1")

pt10_rfam_hits_tx_names <- pt10_rfam_hits$accession...3 %>% 
  as.data.frame() %>%
  distinct()

pt10_rfam_hits_tx_names <- pt10_rfam_hits_tx_names[startsWith(pt10_rfam_hits_tx_names$., "MSTRG"),] %>% 
  as.data.frame() %>%
  na.omit() %>%
  setNames("rfam_hits")

#combine all of the transcripts that hit within the rfam database

all_hits <- rbind (pt1_rfam_hits_tx_names,
                   pt2_rfam_hits_tx_names,
                   pt3_rfam_hits_tx_names,
                   pt4_rfam_hits_tx_names,
                   pt5_rfam_hits_tx_names,
                   pt6_rfam_hits_tx_names,
                   pt7_rfam_hits_tx_names,
                   pt9_rfam_hits_tx_names,
                   pt10_rfam_hits_tx_names) #combine the rfam hits into one dataframe


write_delim(all_hits, "/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/all_hits.txt") #save this

241 transcripts hit within the rfam database. So, lets pull those out of our final numbers for lincRNAs

Filter out the rfam hits

non_coding_linc <- read_csv("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/old_noncoding_intergenic_txIDs.txt", 
    col_names = FALSE)

all_hits <- read_csv("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/all_hits.txt")

dplyr::anti_join(non_coding_linc,
          all_hits,
          by = c("X1" = "rfam_hits")) -> filt_non_coding_linc #only keep the rows in the non_coding_linc dataframe that does NOT have a partner in the the all_hits dataframe


write_delim(filt_non_coding_linc, "/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/filt_non_coding_linc_tx_names.txt",
            col_names = FALSE) #save our final list of non-coding lincRNAs, 8917 transcripts

Generate final “old” annotations for putative protein-coding genes and long non-coding RNAs

# working directory /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos

# coding antisense
grep -wFf /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/old_coding_antisense_txIDs.txt /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/all_merged_CS.gtf > old_coding_antisense_txIDs.gtf
 
gffread -w old_coding_antisense_txIDs.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_coding_antisense_txIDs.gtf

# non-coding antisense
grep -wFf /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/old_noncoding_antisense_txIDs.txt /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/all_merged_CS.gtf > old_noncoding_antisense_txIDs.gtf
 
gffread -w old_noncoding_antisense_txIDs.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_antisense_txIDs.gtf

# coding intergenic

grep -wFf /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/old_coding_intergenic_txIDs.txt /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/all_merged_CS.gtf > cs_old_coding_intergenic_txIDs.gtf #generated above

gffread -w cs_old_coding_intergenic_txIDs.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa cs_old_coding_intergenic_txIDs.gtf # generated above

#cs_old_coding_intergenic_txIDs.fa -> generated above = copied and renamed as old_coding_intergenic_txIDs.fa
#cs_old_coding_intergenic_txIDs.gtf -> generated above = copied and renamed as old_coding_intergenic_txIDs.gtf

# non-coding intergenic

grep -wFf /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/filt_non_coding_linc_tx_names.txt /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/all_merged_CS.gtf > old_noncoding_intergenic_txIDs.gtf
 
gffread -w old_noncoding_intergenic_txIDs.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs.gtf

## Generate our master annotation files for the old and new putative protein-coding genes and long non-coding RNAs
cat /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/old_coding_antisense_txIDs.gtf /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/old_noncoding_antisense_txIDs.gtf > all_old_antisense.gtf
cat /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/old_coding_intergenic_txIDs.gtf /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/old_noncoding_intergenic_txIDs.gtf > all_old_intergenic.gtf

cat /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/new_coding_antisense.gtf /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/new_noncoding_antisense.gtf > all_new_antisense.gtf
cat /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/new_coding_intergenic.gtf /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/new_noncoding_intergenic.gtf > all_new_intergenic.gtf
# all of the new annotations were generated by Kyle

## grep: /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/old_coding_intergenic_txIDs.txt: No such file or directory

Now we employ Liftoff

(https://github.com/agshumate/Liftoff) To help us determine which transcripts are unique to each annotation. For input we will use the the new putative protein-coding genes and long non-coding RNAs annotations and existing (old) and newly generated (new) genome sequence files.

Run liftoff - using default settings and supplying a text file with features we want surveyed in this case we want transcripts

Present in the old but not in the new

/home/crailey/liftoff
-g /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/liftoff/all_new_seq.gtf
-f /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/liftoff/transcript
-o /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/liftoff/comp_out_lift_from_new.gff
-u /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/liftoff/unmapped_features_lift_from_new.txt
/mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa
/mnt/Ninelives/genomes/Camelina/CsSunv4.fasta

#96 transcripts that are present in the new but not the old (cat/wc -l unmapped_features_lift_from_new.txt)

Present in the new but not the old

/home/crailey/liftoff
-g /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/liftoff/all_old_seq.gtf
-f /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/liftoff/transcript
-o /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/liftoff/comp_out_lift_from_old.gff
-u /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/liftoff/unmapped_features_lift_from_old.txt
/mnt/Ninelives/genomes/Camelina/CsSunv4.fasta
/mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa

#200 transcripts are present in the old but not in the new (cat/wc -l unmapped_features_lift_from_old.txt)