We are interested in identifying the differences in lncRNA annotation between the currently used Ensembl version of the Camelina genome and the newly generated Camelina genome by Sun et al.
Background
Transcript assembly for Illumina sequencing and Nanopore sequencing was performed with StringTie (v2.2.1). Newly assembled transcripts were classified relative to annotated protein-coding genes using Gffcompare (v0.12.2). Transcripts antisense to protein-coding genes were identified using the Gffcompare classification code “x” while intergenic transcripts were identified using the classification code “u”. Antisense and intergenic transcripts were processed through the Coding Potential Calculator 2 (CPC2) (v1.0.1). Non-coding intergenic transcripts (putative long intergenic non-coding RNAs – lincRNAs) were further filtered for other “housekeeping” RNAs (e.g., ribosomal RNA, small nuclear/nucleolar RNAs, etc.) using the RNA families (Rfam) webserver. All transcripts with hits to the Rfam database were removed. The following script walks through the exact methods for filtering these newly assembled transcripts.
Load packages
library(tidyverse)
library(readr)
library(readxl)Pull out lncRNA classes (we are only interested in the antisense and the intergenic ones)
tmap_old <- read_delim("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/gffcompare_CS.all_merged_CS.gtf.tmap",
delim = "\t", escape_double = FALSE,
trim_ws = TRUE)
tmap_old %>%
filter(class_code == "x") %>%
distinct(qry_id) %>%
write_delim("old_gen_antisense_txIDs.txt") #save the antisense transcript IDs
tmap_old %>%
filter(class_code == "u") %>%
distinct(qry_id) %>%
write_delim("old_gen_intergenic_txIDs.txt") #save the intergenic transcript IDsThese transcripts were used to pull out the transcript information from the genome and genome annotation files in commandline. Essentially, we pulled transcript cooridinates and sequences. The sequences were used as input into CPC2.
all antisense tx
grep -wFf old_gen_antisense_txIDs.txt ../all_merged_CS.gtf > cs_old_antisense_gffcompare_txIDs.gtf
all intergenic tx
grep -wFf old_gen_intergenic_txIDs.txt ../all_merged_CS.gtf > cs_old_intergenic_gffcompare_txIDs.gtf
all antisense tx
gffread -w cs_old_antisense_gffcompare_txIDs.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa cs_old_antisense_gffcompare_txIDs.gtf
all intergenic tx
gffread -w cs_old_intergenic_gffcompare_txIDs.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa cs_old_intergenic_gffcompare_txIDs.gtf
# using the fasta files as input for CPC2## bash: all: command not found
## grep: ../all_merged_CS.gtf: No such file or directory
## bash: line 3: all: command not found
## grep: ../all_merged_CS.gtf: No such file or directory
## bash: line 6: all: command not found
## bash: line 9: all: command not found
Determine coding capacity by filtering based on CPC2 results
old_as <- read_delim("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/CPC2_output_antisense.txt",
delim = "\t", escape_double = FALSE,
trim_ws = TRUE)
old_int <- read_delim("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/CPC2_output_intergenic.txt",
delim = "\t", escape_double = FALSE,
trim_ws = TRUE)
old_as %>%
filter(label == "noncoding") %>%
distinct(`#ID`) %>%
write.table("old_noncoding_antisense_txIDs.txt",
quote = F, row.names = F, col.names = F) #save non-coding antisense transcripts
old_as %>%
filter(label == "coding") %>%
distinct(`#ID`) %>%
write.table("old_coding_antisense_txIDs.txt",
quote = F, row.names = F, col.names = F) #save coding antisense transcripts
old_int %>%
filter(label == "noncoding") %>%
distinct(`#ID`) %>%
write.table("old_noncoding_intergenic_txIDs.txt",
quote = F, row.names = F, col.names = F) -> old_int_non #save the non-coding intergenic and write to a new object
old_int %>%
filter(label == "coding") %>%
distinct(`#ID`) %>%
write.table("old_coding_intergenic_txIDs.txt",
quote = F, row.names = F, col.names = F) #save coding intergenic transcriptsPull sequences for each class of transcripts
grep -wFf old_noncoding_intergenic_txIDs.txt ../../all_merged_CS.gtf > cs_old_noncoding_intergenic_txIDs.gtf
grep -wFf old_coding_intergenic_txIDs.txt ../../all_merged_CS.gtf > cs_old_coding_intergenic_txIDs.gtf
gffread -w cs_old_noncoding_intergenic_txIDs.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa cs_old_noncoding_intergenic_txIDs.gtf
gffread -w cs_old_coding_intergenic_txIDs.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa cs_old_coding_intergenic_txIDs.gtf## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
Now we want to further filter the non-coding intergenic transcripts. We used the Rfam database and resources that are available on the web. We initally identified 9158 non-coding intergenic transcripts, that number is far more than what the Rfam database can handle in one batch search. So, the non-coding intergenic transcripts were split up into 10 batches and submitted to Rfam database. All transcripts that hit at all were filtered out.
#Pt 1.
sed -n '1,900p;901q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt1.txt
grep -wFf old_noncoding_intergenic_txIDs_pt1.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt1.gtf
gffread -w old_noncoding_intergenic_txIDs_pt1.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt1.gtf
#Pt 2.
sed -n '902,1802p;1803q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt2.txt
grep -wFf old_noncoding_intergenic_txIDs_pt2.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt2.gtf
gffread -w old_noncoding_intergenic_txIDs_pt2.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt2.gtf
#Pt 3.
sed -n '1803,2703p;2704q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt3.txt
grep -wFf old_noncoding_intergenic_txIDs_pt3.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt3.gtf
gffread -w old_noncoding_intergenic_txIDs_pt3.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt3.gtf
#Pt 4.
sed -n '2704,3604p;3605q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt4.txt
grep -wFf old_noncoding_intergenic_txIDs_pt4.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt4.gtf
gffread -w old_noncoding_intergenic_txIDs_pt4.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt4.gtf
#Pt 5.
sed -n '3605,4505p;4506q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt5.txt
grep -wFf old_noncoding_intergenic_txIDs_pt5.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt5.gtf
gffread -w old_noncoding_intergenic_txIDs_pt5.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt5.gtf
#Pt 6.
sed -n '4506,5406p;5407q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt6.txt
grep -wFf old_noncoding_intergenic_txIDs_pt6.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt6.gtf
gffread -w old_noncoding_intergenic_txIDs_pt6.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt6.gtf
#Pt 7.
sed -n '5407,6307p;6308q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt7.txt
grep -wFf old_noncoding_intergenic_txIDs_pt7.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt7.gtf
gffread -w old_noncoding_intergenic_txIDs_pt7.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt7.gtf
#Pt 8.
sed -n '6308,7208p;7209q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt8.txt
grep -wFf old_noncoding_intergenic_txIDs_pt8.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt8.gtf
gffread -w old_noncoding_intergenic_txIDs_pt8.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt8.gtf
#Pt 9.
sed -n '7209,8109p;8110q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt9.txt
grep -wFf old_noncoding_intergenic_txIDs_pt9.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt9.gtf
gffread -w old_noncoding_intergenic_txIDs_pt9.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt9.gtf
#Pt 10.
sed -n '8110,9158p;9159q' old_noncoding_intergenic_txIDs.txt > old_noncoding_intergenic_txIDs_pt10.txt
grep -wFf old_noncoding_intergenic_txIDs_pt10.txt ../../all_merged_CS.gtf > old_noncoding_intergenic_txIDs_pt10.gtf
gffread -w old_noncoding_intergenic_txIDs_pt10.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs_pt10.gtf## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
## grep: ../../all_merged_CS.gtf: No such file or directory
Filter out any non-coding intergenic transcripts that hit within the rfam database (prior to this manual delimiting of columns was performed on the pfam outputs)
#Pt. 1
pt1_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt1_rfam_hits.xlsx",
sheet = "Sheet2")
pt1_rfam_hits_tx_names <- pt1_rfam_hits$accession...3 %>%
as.data.frame() %>%
distinct() #extract the column with the transcripts that hit within the database
pt1_rfam_hits_tx_names <- pt1_rfam_hits_tx_names[startsWith(pt1_rfam_hits_tx_names$., "MSTRG"),] %>%
as.data.frame() %>%
na.omit() %>%
setNames("rfam_hits") #clean up the dataframe and give the column a more informative name
#Do this for the other 8 batches (batch 8 did not have any rfam hits so there are only have 9 other batches to do)
#Pt. 2
pt2_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt2_rfam_hits.xlsx",
sheet = "Sheet2")
pt2_rfam_hits_tx_names <- pt2_rfam_hits$accession...3 %>%
as.data.frame() %>%
distinct()
pt2_rfam_hits_tx_names <- pt2_rfam_hits_tx_names[startsWith(pt2_rfam_hits_tx_names$., "MSTRG"),] %>%
as.data.frame() %>%
na.omit() %>%
setNames("rfam_hits")
#Pt. 3
pt3_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt3_rfam_hits.xlsx",
sheet = "Sheet2")
pt3_rfam_hits_tx_names <- pt3_rfam_hits$accession...3 %>%
as.data.frame() %>%
distinct()
pt3_rfam_hits_tx_names <- pt3_rfam_hits_tx_names[startsWith(pt3_rfam_hits_tx_names$., "MSTRG"),] %>%
as.data.frame() %>%
na.omit() %>%
setNames("rfam_hits")
#Pt. 4
pt4_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt4_rfam_hits.xlsx",
sheet = "Sheet2")
pt4_rfam_hits_tx_names <- pt4_rfam_hits$accession...3 %>%
as.data.frame() %>%
distinct()
pt4_rfam_hits_tx_names <- pt4_rfam_hits_tx_names[startsWith(pt4_rfam_hits_tx_names$., "MSTRG"),] %>%
as.data.frame() %>%
na.omit() %>%
setNames("rfam_hits")
#Pt. 5
pt5_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt5_rfam_hits.xlsx",
sheet = "Sheet2")
pt5_rfam_hits_tx_names <- pt5_rfam_hits$accession...3 %>%
as.data.frame() %>%
distinct()
pt5_rfam_hits_tx_names <- pt5_rfam_hits_tx_names[startsWith(pt5_rfam_hits_tx_names$., "MSTRG"),] %>%
as.data.frame() %>%
na.omit() %>%
setNames("rfam_hits")
#Pt. 6
pt6_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt6_rfam_hits.xlsx",
sheet = "Sheet2")
pt6_rfam_hits_tx_names <- pt6_rfam_hits$accession...3 %>%
as.data.frame() %>%
distinct()
pt6_rfam_hits_tx_names <- pt6_rfam_hits_tx_names[startsWith(pt6_rfam_hits_tx_names$., "MSTRG"),] %>%
as.data.frame() %>%
na.omit() %>%
setNames("rfam_hits")
#Pt. 7
pt7_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt7_rfam_hits.xlsx",
sheet = "Sheet2")
pt7_rfam_hits_tx_names <- pt7_rfam_hits$accession...3 %>%
as.data.frame() %>%
distinct()
pt7_rfam_hits_tx_names <- pt7_rfam_hits_tx_names[startsWith(pt7_rfam_hits_tx_names$., "MSTRG"),] %>%
as.data.frame() %>%
na.omit() %>%
setNames("rfam_hits")
#Pt. 9
pt9_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt9_rfam_hits.xlsx",
sheet = "Sheet2")
pt9_rfam_hits_tx_names <- pt9_rfam_hits$accession...3 %>%
as.data.frame() %>%
distinct()
pt9_rfam_hits_tx_names <- pt9_rfam_hits_tx_names[startsWith(pt9_rfam_hits_tx_names$., "MSTRG"),] %>%
as.data.frame() %>%
na.omit() %>%
setNames("rfam_hits")
#Pt. 10
pt10_rfam_hits <- read_excel("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/pt10_rfam_hits.xlsx",
sheet = "Sheet1")
pt10_rfam_hits_tx_names <- pt10_rfam_hits$accession...3 %>%
as.data.frame() %>%
distinct()
pt10_rfam_hits_tx_names <- pt10_rfam_hits_tx_names[startsWith(pt10_rfam_hits_tx_names$., "MSTRG"),] %>%
as.data.frame() %>%
na.omit() %>%
setNames("rfam_hits")
#combine all of the transcripts that hit within the rfam database
all_hits <- rbind (pt1_rfam_hits_tx_names,
pt2_rfam_hits_tx_names,
pt3_rfam_hits_tx_names,
pt4_rfam_hits_tx_names,
pt5_rfam_hits_tx_names,
pt6_rfam_hits_tx_names,
pt7_rfam_hits_tx_names,
pt9_rfam_hits_tx_names,
pt10_rfam_hits_tx_names) #combine the rfam hits into one dataframe
write_delim(all_hits, "/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/all_hits.txt") #save this241 transcripts hit within the rfam database. So, lets pull those out of our final numbers for lincRNAs
Filter out the rfam hits
non_coding_linc <- read_csv("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/old_noncoding_intergenic_txIDs.txt",
col_names = FALSE)
all_hits <- read_csv("/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/all_hits.txt")
dplyr::anti_join(non_coding_linc,
all_hits,
by = c("X1" = "rfam_hits")) -> filt_non_coding_linc #only keep the rows in the non_coding_linc dataframe that does NOT have a partner in the the all_hits dataframe
write_delim(filt_non_coding_linc, "/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/rfam_hits/filt_non_coding_linc_tx_names.txt",
col_names = FALSE) #save our final list of non-coding lincRNAs, 8917 transcriptsGenerate final “old” annotations for putative protein-coding genes and long non-coding RNAs
# working directory /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos
# coding antisense
grep -wFf /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/old_coding_antisense_txIDs.txt /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/all_merged_CS.gtf > old_coding_antisense_txIDs.gtf
gffread -w old_coding_antisense_txIDs.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_coding_antisense_txIDs.gtf
# non-coding antisense
grep -wFf /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/old_noncoding_antisense_txIDs.txt /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/all_merged_CS.gtf > old_noncoding_antisense_txIDs.gtf
gffread -w old_noncoding_antisense_txIDs.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_antisense_txIDs.gtf
# coding intergenic
grep -wFf /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/old_coding_intergenic_txIDs.txt /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/all_merged_CS.gtf > cs_old_coding_intergenic_txIDs.gtf #generated above
gffread -w cs_old_coding_intergenic_txIDs.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa cs_old_coding_intergenic_txIDs.gtf # generated above
#cs_old_coding_intergenic_txIDs.fa -> generated above = copied and renamed as old_coding_intergenic_txIDs.fa
#cs_old_coding_intergenic_txIDs.gtf -> generated above = copied and renamed as old_coding_intergenic_txIDs.gtf
# non-coding intergenic
grep -wFf /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/filt_non_coding_linc_tx_names.txt /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/all_merged_CS.gtf > old_noncoding_intergenic_txIDs.gtf
gffread -w old_noncoding_intergenic_txIDs.fa -g /mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa old_noncoding_intergenic_txIDs.gtf
## Generate our master annotation files for the old and new putative protein-coding genes and long non-coding RNAs
cat /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/old_coding_antisense_txIDs.gtf /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/old_noncoding_antisense_txIDs.gtf > all_old_antisense.gtf
cat /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/old_coding_intergenic_txIDs.gtf /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/old_noncoding_intergenic_txIDs.gtf > all_old_intergenic.gtf
cat /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/new_coding_antisense.gtf /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/new_noncoding_antisense.gtf > all_new_antisense.gtf
cat /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/new_coding_intergenic.gtf /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/new_noncoding_intergenic.gtf > all_new_intergenic.gtf
# all of the new annotations were generated by Kyle ## grep: /home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/old_coding_intergenic_txIDs.txt: No such file or directory
Now we employ Liftoff
(https://github.com/agshumate/Liftoff) To help us determine which transcripts are unique to each annotation. For input we will use the the new putative protein-coding genes and long non-coding RNAs annotations and existing (old) and newly generated (new) genome sequence files.
Run liftoff - using default settings and supplying a text file with features we want surveyed in this case we want transcripts
Present in the old but not in the new
/home/crailey/liftoff
-g
/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/liftoff/all_new_seq.gtf
-f
/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/liftoff/transcript
-o
/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/liftoff/comp_out_lift_from_new.gff
-u
/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/liftoff/unmapped_features_lift_from_new.txt
/mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa
/mnt/Ninelives/genomes/Camelina/CsSunv4.fasta
#96 transcripts that are present in the new but not the old (cat/wc -l unmapped_features_lift_from_new.txt)
Present in the new but not the old
/home/crailey/liftoff
-g
/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/liftoff/all_old_seq.gtf
-f
/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/liftoff/transcript
-o
/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/liftoff/comp_out_lift_from_old.gff
-u
/home/crailey/documents/collabs/j_cam_lncRNAs/old_genome_analyses/gffcompare_CS/CPC2_output/final_annos/liftoff/unmapped_features_lift_from_old.txt
/mnt/Ninelives/genomes/Camelina/CsSunv4.fasta
/mnt/Kanta/antisense_lncRNAs/Camelina/Camelina_sativa.Cs.dna.toplevel.fa
#200 transcripts are present in the old but not in the new (cat/wc -l unmapped_features_lift_from_old.txt)