Long non-coding RNA discovery in Pocillopora sp.

Author

Zach Bengtsson

Published

June 6, 2023

Zenodo Archive

GitHub Repository

Introduction

This project uses RNA-seq data from Danielle Becker’s experiment exposing colonies of Pocillopora sp. to low level nutrient conditions (nitrogen and phosphorous).

Experimental Design

This study attempts to identify and understand the expression of long non-coding RNAs in corals experiencing low level nutrient enrichment. Long non-coding RNAs (lncRNAs), RNA strands >200bp that do not code for proteins, are implicated in important processes of gene regulation, but little is known about their function in corals.

  • Pocillopora sp. corals were experimentall enriched with dissolved inorganic nitrogen and phosphate for 15 months on an oligotrophic fore reef in Mo’orea, French Polynesia.
  • Tissue samples were retrieved, went through RNA extraction, and were sent off for sequencing.
  • The primary data type used in this study is RNA-seq data.

Objectives

  1. Identify lncRNAs present within the RNA-seq data
  2. Create a list of lncRNA transcript IDs
  3. Examine the sensitivity of lncRNA expression to low level nutrient enrichment with differential expression analysis

Part 1: Finding lncRNAs

Workflow of programs

Data Wrangling

Get data from reef genomics genome assembly (GFF3) and scaffolds (FASTA) as well as genome assembly GTF created by Roberts Lab Research Scientist, Sam White…

wget -P /home/shared/8TB_HDD_02/zbengt/github/zach-lncRNA/data http://pver.reefgenomics.org/download/Pver_genome_assembly_v1.0.gff3.gz
gunzip /home/shared/8TB_HDD_02/zbengt/github/zach-lncRNA/data/Pver_genome_assembly_v1.0.gff3.gz
wget -P /home/shared/8TB_HDD_02/zbengt/github/zach-lncRNA/data http://pver.reefgenomics.org/download/Pver_genome_assembly_v1.0.fasta.gz
gunzip /home/shared/8TB_HDD_02/zbengt/github/zach-lncRNA/data/Pver_genome_assembly_v1.0.fasta.gz
curl -o ../data/Pver_genome_assembly_v1.0-valid.gtf https://gannet.fish.washington.edu/Atumefaciens/20230127-pver-gff_to_gtf/Pver_genome_assembly_v1.0-valid.gtf

Download 32 paired-end RNA-seq FASTQ files from Roberts Lab server, Gannet. This downloads all FASTQ files at once…

wget -r \
--no-check-certificate \
--quiet \
--no-directories --no-parent \
-P /home/shared/8TB_HDD_01/pver \
-A *fastq.gz \
https://gannet.fish.washington.edu/Atumefaciens/hputnam-Becker_E5/Becker_RNASeq/data/trimmed/

What an unzipped FASTQ should look like…

@GWNJ-0957:702:GW2012083569th:7:1101:21278:1608 1:N:0:TCCGGAGA+TTTCGCCT
TATGCTGTGTTGAGTTCAGTATGCAACCCAATTGATACTGTTCATTTTGTGTGCTTCTAGGAAACTGAGGACAATATGATATTCTGTAGACAGCCCAGGATGAGAGATTCCTCTGTGCGAGTGGTCTCCAAATAATATCAACATCCTGCA
+
AAAFFJJJJJJJF-AA7FJJJJJFJFAFFJJFJFJJJJJJJJJJAJJJJJFJJAFF<F-<JFFJA<F7FA7AFJJJJJJJJJJJAJFF-FFJJ-7AFAJJJJ<JFJJJJJJFJJJJJJFJJJJJJFFJJJFJJJJJ<F7AJJJJJJFA<F
@GWNJ-0957:702:GW2012083569th:7:1101:13758:1907 1:N:0:TCCGGAGA+ATTCGCCT
GGATCATCATATTTGTTGTTCACCAGGTTAAGCTCGTGGCGTAGATGATCGACGGGTGAATCCAGGTTCTTTATGTCTGCATCATTTTCTTTAAGTCATCTTCTGGCTTCCGTTAGGTCCTCTTGAATTTGCTCGTTCTTGCTCTGCTTG
+
AAFFFJJJJJJF-AF<AFFJJJJJJJJJJJJJJJJJJJJJJJJJFAFJJJJFJJA<JJA<JJJJJAAAJJJJJFJJJJFJJJJJFJJJ-FJ-AFFJ--AJFJJJJJJFFJJAJJJJFJ<JJJJFJJJJJJJJFAFJJA-AFAFFJ<777-
@GWNJ-0957:702:GW2012083569th:7:1101:8389:1924 1:N:0:TCCGCAGA+CTTCGCCT
CCCAAAGTCATTTTCCCAGTCTGTGCGGGGACATCGTGGAGGAGTATAAATGAAACTTGCTCTTGCTGTGTAGTAGAATTTACACTTGGTTACAGGATCAAGGTGAGTTCGGATATCTTTTGTTTTCTTATCAAACCAGTGTGTAATATC

HISAT2

HISAT2 build to create an index for the Pver_genome_assembly_v1.0.fasta file…

/home/shared/hisat2-2.2.1/hisat2-build \
-f ../data/Pver_genome_assembly_v1.0.fasta \
../output/Pver_genome_assembly_v1.0-valid.index

HISAT2 to align paired-end RNA-Seq reads to the Pver_genome_assembly_v1.0.fasta index…

find /home/shared/8TB_HDD_01/pver/*gz \
| xargs basename -s _R1_001.fastq.gz | xargs -I{} \
/home/shared/hisat2-2.2.1/hisat2 \
-x ../output/Pver_genome_assembly_v1.0-valid.index \
-p 8 \
-1 /home/shared/8TB_HDD_01/pver/{}_R1_001.fastq.gz \
-2 /home/shared/8TB_HDD_01/pver/{}_R2_001.fastq.gz \
-S /home/shared/8TB_HDD_01/pver/hisat-output/{}-valid.sam

Samtools

Samtools to convert the SAM files output from the previous HISAT2 command into sorted BAM files…

for file in /home/shared/8TB_HDD_01/pver/hisat-output/*-valid.sam; do
    base=$(basename "$file" -valid.sam)
    /home/shared/samtools-1.12/samtools view -bS "$file" | \
    /home/shared/samtools-1.12/samtools sort \
    -o /home/shared/8TB_HDD_01/pver/samtools-output/"$base"_valid_sorted.bam
done

StringTie

StringTie to assemble transcripts from the sorted BAM files generated by the previous Samtools commands…

find /home/shared/8TB_HDD_01/pver/samtools-output/*bam \
| xargs basename -s -valid_sorted.bam | xargs -I{} \
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
-p 8 \
-G ../data/Pver_genome_assembly_v1.0-valid.gtf \
-o /home/shared/8TB_HDD_01/pver/stringtie-output/{}-valid.gtf \
/home/shared/8TB_HDD_01/pver/samtools-output/{}-valid_sorted.bam \

Use StringTie merge to merge all the individual GTF files into a single merged GTF file…

/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
--merge \
-G ../data/Pver_genome_assembly_v1.0-valid.gtf \
-o /home/shared/8TB_HDD_01/pver/stringtie-merge-output/stringtie_merged.gtf \
/home/shared/8TB_HDD_01/pver/stringtie-output/*-valid.gtf

GffCompare

Use gffcompare to compare annotation file generated by StringTie to a reference annotation file and to produce a set of output files summarizing the results of the comparison, including classification of each transcript…

/home/shared/gffcompare-0.12.6.Linux_x86_64/gffcompare \
-r ../data/Pver_genome_assembly_v1.0-valid.gtf \
-G ../data/Pver_genome_assembly_v1.0-valid.gtf \
-o /home/shared/8TB_HDD_01/pver/gffcompare-output/gffcompare_merged \
/home/shared/8TB_HDD_01/pver/stringtie-merge-output/stringtie_merged.gtf \

Filter

Filter to get a subset of the transcripts from the original GTF file that are putative lncRNA candidates based on their length and lack of overlap with known reference transcripts…

awk '$3 == "transcript" && $1 !~ /^#/ {print}' /home/shared/8TB_HDD_01/pver/gffcompare-output/gffcompare_merged.annotated.gtf | grep 'class_code "u"' | awk '$5 - $4 > 199 {print}' > /home/shared/8TB_HDD_01/pver/filter-output/merged_lncRNA_candidates.gtf

Bedtools

Get fasta of lncRNA candidate regions with Bedtools…

/home/shared/bedtools2/bin/bedtools \
getfasta -fi ../data/Pver_genome_assembly_v1.0.fasta -bed /home/shared/8TB_HDD_01/pver/filter-output/merged_lncRNA_candidates.gtf -fo /home/shared/8TB_HDD_01/pver/bedtools-output/merged_lncRNA_candidates.fasta -name -split
>transcript::Pver_Sc0000000_size2095917:21743-24809
CTTACCTTTTAGAGCATGTGCCAATGTCTTTACCCAGCCTGTCATATGACATTCACACGTAACCAGTCAATATTTTCTGTTTTTACAATTTAAATCAGCCTCGCTGAGAATTTGACGTCACTTTTTTTAGTCGTCGAAGAAAGGTGGCTCACCCATGTGATTTCGAGTAATTGTTTCTATCATTCTCATTTATTTCCCCTTTCAACTAACTGAGGTAATTACACAAGGAGCATCGACATTATCTTTTTCAGTTTTGTACGTTCTGAACATGGTTTTCAAGACTATCGTTTGCATTCTGCTTGCGAGTCAAGTATTCAGCATCGGTAGGTCTCTTCCTATATTTATTCCATCGTTTGAGGATGCTCTTAAATATTTCCAAATGTTTTAGCCGGTAACCGATTTTTTTATTTTTTTCTGCTCGTTGCATACTCCATGGTTTTTTTTCTGTGTTCTCTAATCGAGAATTTCGAGTTTTCGGGTAGATTTTTCATACGATAAGAAAAGAATTGACAACTAGCTGCTTCATGGTGGAGCCTATTTATTAATTTTTTTTGTTTCAGGTTATTTTATCTCTCCTCTCTAAAGTTCTCCTTGATGAGAGTCCACCATTTTCTCTGTCAATTTTCCATTCCCTCAATATTCCAAAGAGTTTTTGACCAAACAAAACGAGTTTCTCACCACACTTATTGGATATTTCATAGTACCCTTGGCTTATTACTTCTCCTCTTTGCTTGTGCTATTTTTTTGTTTAATTTGCGTCTGATTCTTCTTAGTTGTTTTTAATTTTTTTCATTTTGTATCTTTTTTAAAAATTCTTTATTTTTACTGATCCCAGCATGTTTCGGGATCTTTTGCGCTAAAATCACTCATGGTCATGTAGAAAAGTCACTCCTAAATGAAATAACTTACAATTTGTCGGTCATTTCTTGTTTACAGACGTCATAGACTACGTACCCGTCGGATGTTACAAGGACATACGCTTAAATAGGGCCTTGCCTACTTTGCTCGCAAACTACCGCGTGAAAAACGCAAAATGGCCAGATATCTTGGACTGGACTGATTTGGAAAACTCTGTCATTAAAAAATGCGCCACTAAGGTTTGCCTGATTATTATTGTTGACATTGTTGTCGTTGTTGATGTTTTTCATAATAGTTGCAAAGAAAAAAAATGGCGAAGGTACTTGCACAGACATCTCTCTTAATTCGCCACGGTACTTACGGTCGTCTTGGTAACTTTATTTTCTTTCTACGCATACAGCTCTGTAAGAAGCCTGTTACACTGAACTGACGTTTTTCAAAATTTTATACGAGAAACATGACGAGCATTTTTTGTCCTATTCAGGCTAAAGAGAGGGGCTATTCATTTTTCGGGATCCAATTCTATGGAGAGTGCTGGAGTGGATCAGCGAGTCAAAGTACCTATGCTAAACACGGTTTATCCAAGAGATGCACCACGGCGGCTCCATATGTGGGAAAGGATCACGCCAACTTCGTTTACATGCTTACTGAAGGTTAGCACTTACTAGGAAAGATTTACATTGATGCTTGTTGTACTTGTGTACATATGTAAGCGACTAATTGATTACTGGAACTGATTCTCCTCCTATTTCGGTTAATCCACATGTGGCGGGTGGATTAAAGAGTCCATTGTACCCAGCCGTTACACGGAAGGCACGTGCTTTAGTCAAGTCAAAGCCTTGTTTCCTTATTTTCATTAATTATTTTTGTTATTCTTGTTGCTTTCAAAGAACTGTTTGTTCTGTTTTTTACTTCTAAAGGTTGATTTGTCTTCTTTATCTTTTAATACTGTGCCGAGTTGGACCCAAACGACCTTTTGCTCAAGACCTCGATAGTTAAAAATTTTGTTTTCGTGGGTTTGATTTTGTGCGTGAAGTATTATTGCGCGAAATATTATTAAACAACGTCATAGTAAGAGTCACAGACAATGGGCGAAAGAAAAAGCAGACGAAGAAAATAGATTACAGGTAGGTACGCGTGCGGTCTAGCACAATTGTCATCGCAATGTTGTACGGTTTCGATACTAAAGCGTCCTTACGCAATTTTGTAACAGTATACCAATATTTTCTCTTTTGTGTATACTTCATGCAAATTCTGCTGTTGTTCTTATGCAGAAAATGAATGTAAAAGCTACAAAATTTTTGACGACGCCTCCCGATCTCAGTTTCATGTAACTAACTATGGACGCCCGATTAGATGCGACAATGGACAGTTGGATGAAAACCGCTGGTATAGATTTACTGGCCAAGCAGGTAGTGCCATGGCAGACCAATGTGTCCCATCCAGCAGGTGTCAGACGTTTGGTACAGGATGGTACAGCGGATCTTATCCTCAGGTAATCCTCAGTTTAAATTTGGATTATCTCTTTGTGTTCTATTGGCCCATGATTTTTTCGTGACCATCTGTATAACGAGGGAGTGGCTTGTTTACTCAAAGTAAAATGTCAACGCATGGGAAATGGTCAGGTAGCAGTGGTAACTATATCATCCACTACCACTCATTAGTCAAAAGTAGTCATTCATTTCGCTTGGGGCATTCTTTTGAAGAGAAGATCTCTTCACCTTAAGATAAACATTTACTGACCAAAGAATCTATTCCACACTTCATGTTGAAAGTTAATGAACTAACTTGGACTAACTCCAAACTTTGTATCTCTGTCAGTCGTATTCCTTGTTAAGACAAATTCAAACCAAACCAAGGGAATGAACACTAGGAATGCCATATCAATACGCATTAACCATTTTCCATCTATTTTCTTTCTGGTTTTTTTTTTGTAAATTGAAGGTTTCAGACGGAATTCAAGAAGGAGATGTTTGCTTCAAATGGGATGGCAATTGCTGTAGCAAAAAAGTCAACATTCGCATTCGAAACTGTTCATCTTTTTTCGTTTACAAGCTGAACAACAAGATCAGCTGCCAGTATGCATATTGCGGAAACGGGCAGTGATTAGGAGAGGTCTTCAGTAATGAAACGGGAAGTGAACTGACCCACTGTATCAGCCAGTCATTTTTAACATTATTATCTGCTCAACACACAGCTGATATCAAG
>transcript::Pver_Sc0000000_size2095917:205920-207483
GAGACCTTCGGGCTCGAAGCTTGGGAACTAGTATTCAGGGAAAGATAAAATGGCGGCTTACGGCGGCGAAGGAAATCTTAGTAGTTTAATATTTTAGCGTGAAAAGGCTTGGGTGTTGTGATCTGGTTCTCAAGTAAAAAAAAATCACGGTCTGCAATGGCACTACCACCTTTTGAACAACAAGAAAACATTGAAACGGTGGTAAAAAGCAGCAAGTTTCGAAGTCTAAACGAAAGATGGGTAAGCTGAGTTTCAGTCAAATATCATGACATTTGGGATGTAATTTCGATACCGAATAAATAGTTCTACATGATTATACACGCATAGGATTCTGTTGTAAGTGGTAGGAAATAAAATATATGCATACTTGTTTCGTATTGAAGCCCTGTAAAATCACAAATTTGATTTATTTCTTCTTTGTTTTAGGTGGTCATTATCCATTATTTTTGGAAGTCGAATCATACTGATGAGAAAGAAAAAATGTTTACGATTATTTCTTCTCTCGAACATTTTACTATTTAGGTACCAAATAATCTTGATACTGAAGATCATTGCATCCTATTATGAAAAAATTACACTTGATTGCTTTGTCAACTGGAATGAATAATGTAATATACCTTGTGATTTGTGGGATTTTCACAAGTAAACTAGAAATACTTGAAGAGGGTGATGATGTACTTAATGCAGAGCATTTGGTACAAAGTCTGCTTTGCTGTAATCTACCTGAAGGTGCTTTTTATATGCTAGTAAGACTTGCACAGCCCAGTCAAAATGTTTTTGAGTAGTAGACAAGCTGGCTGTAGTAATTATAGGCAGCAGAAGCCAAACCTCTTCCTGAAAAGCAGTCTGAAAAGTGTGAAGTTAGCGGAAGGTACTTCATTGTAGGCAGCCAAATTGTAAGCTGCCAGCCCATTTTGAGTGAGCAGTAAAACTCAAGATTGGCTATTTTAGCTAATTTTATTGAGAACCAAATTCCCAGCAACGGGCATATTTTTACTGGGCTATTTGTTATTTAACTACTTAGTAATATTCTTATTAAAATTAACTGCATTCCTTTCTAATAATTTGCTGCTGATTGTATGCTGTATATAGCTTATGATTTGTGTGATTTTCACAAGTAAATTAGAAATACCTGAAGATGGTAATGATGTACTAGATAATGTACTTGATGCAAAGCTATTTTGGTACAAAGTGTGTGCCTTGTCCGTAGGTGATTTTGCATGCTAATAAGACTTGTAATTCAAATAATACACATCAGCTGGAGTGGGAATATTAAAAGAAAAATAATAATATTTAGCTATACCAAGTTCTGGTTTTATGTCCAAANTTTCACGCTAAAATATTAAACTACTGAGATTTCCTTCGCCGCCGTAAGCCGCCATTTTATCTTTCCCTGAATACTAGTTCCCAAGCTTCGAGCCCGAAGGTCTCAGACATGCACTGAAATTATGATATTCGTTATTTCAGGGCCAATTTAGAAATAACGATAATCGTTATTTTGGTGTCGAGATCATGCTGATCACAACTGAACGCACATTTACCCCTTTCCTCCGCCATCTTGAA
>transcript::Pver_Sc0000000_size2095917:209824-211345
CATATCAAACTCTTTCCCCAGCGCGATGAACATTTGTTTTTCTCGGTTCCAGGTGTACGCGACCATGCACATGCTCTCCGTTGCGAATTTTCAAAATGGCGGACAGGCTAAATAGTGAAATAAAAAAAGCAAAGTGATTTTGCAACAAATACTTAACATAAACCCCCGTTAGAAAGGCGAATCCACGATAGAAAGATACAAAACTGTATGATATTGAGATCAGGGAGGTTGACAAAATCAACAAGAGAGGTTGAAAATTCATTACGTTGGCTATGGCACATGCTTCGATGAATGGACACCCTTCAGGGCCGATAGAGGATCTGGATATTTTCCGTTTGTTCGCAGAGAAAAGCTGTTCATTCCCTCGGAAAATTCTCTAGAGGATAGAACAGAATCGTTCCATGGCCAACTTGCTCGTGAAATAAAGAAAAAGCTTTGGTCTGGACAGCGAAAGGACCCTGATATAAACATAGGCCTTAATGTTGACCAAGATGTTTTCATGGAAGGTCTTGGGAGTGTTTTTCCCGGCGTTCTTCAAAGAGGGAAGATGGTTTACCAATTGGAGTCAAATAGGGTGCTTGATTCATATCTGGGGCTGAAGTGGGATGAAAGAATTTTAAATGTAGGTGAAGATTTTGCCTATGTAATTCCAGGAACTATTAAGTATTGGCTGGGACAAAGAAGTCCAATCATTGAATACAAGCTCATCGTTGTAGGTGTTCTTAAACCCCTTCATAACCTGCTGCCTCCAGGAACTGAATCCTGTTTTTCTTCAGAAACAGTTGTTTGAGGGCATCAACCCCTATGTGAATTGTGGATCTGAAGAAATCTTTAGAAGCTCTATTATTGTATTTTTCCAAGCCTTGCTGTGTGTAATATGCTATGTTCTGATAAAGTGAGAGGAACTCAGGAACATGATAACATAGTGCATGCATGTAGGGAGTGACATCTTTTGCTTGGTAAACTTTGAGAAAGTGCTCAAACCATGATTTGATATTTCTTTGCAGTTCAAGAATACCCTCTTCAGATGTAAAATCTAGCCTCAAATCACCAATTATGTCCATAAATTTAGCCCAGATCACTTGTATGTCTTTGTTTTGACTGCAAAGAGGGACAAGCATAGGTCGGAATGTTTATATTCTGAAAGAGTTTCAACTTTTCTGGTTTGGTAAGATCTCTATATTCCAGCTTTTTGCTCTCTTTGTTGATTCTAAAATTAAAAGTGATTCCCATACTTTTGATAAATTCTTCATAACATGCCATGTGTTTATNCGATCCCTCTACTACTGCTTTCACGCTCACCAAACAGGAAAAAATCAAAAGGAGGAAGATACAGCTGTTTTGAAACGCCATTTGACGAGGGTAAATCTGCTTTTTGTTCGACACTTTACATATCAAAACAAAGGAAATTTGTTCCTCTTTTTGATTCTGGCGGTACATTTTTAAATTGCGCTTGCGTAAAAGGGATATCGCCCGAGACGCTTCGCTCGGTTGTGTTTTGAACAAAAACGCCTTGTTAAT
>transcript::Pver_Sc0000000_size2095917:209932-211345
TAAATAGTGAAATAAAAAAAGCAAAGTGATTTTGCAACAAATACTTAACATAAACCCCCGTTAGAAAGGCGAATCCACGATAGAAAGATACAAAACTGTATGATATTGAGATCAGGGAGGTTGACAAAATCAACAAGAGAGGTTGAAAATTCATTACGTTGGCTATGGCACATGCTTCGATGAATGGACACCCTTCAGGGCCGATAGAGGATCTGGATATTTTCCGTTTGTTCGCAGAGAAAAGCTGTTCATTCCCTCGGAAAATTCTCTAGAGGATAGAACAGAATCGTTCCATGGCCAACTTGCTCGTGAAATAAAGAAAAAGCTTTGGTCTGGACAGCGAAAGGACCCTGATATAAACATAGGCCTTAATGTTGACCAAGATGTTTTCATGGAAGGTCTTGGGAGTGTTTTTCCCGGCGTTCTTCAAAGAGGGAAGATGGTTTACCAATTGGAGTCAAATAGGGTGCTTGATTCATATCTGGGGCTGAAGTGGGATGAAAGAATTTTAAATGTAGGTGAAGATTTTGCCTATGTAATTCCAGGAACTATTAAGTATTGGCTGGGACAAAGAAGTCCAATCATTGAATACAAGCTCATCGTTGTAGGTGTTCTTAAACCCCTTCATAACCTGCTGCCTCCAGGAACTGAATCCTGTTTTTCTTCAGAAACAGTTGTTTGAGGGCATCAACCCCTATGTGAATTGTGGATCTGAAGAAATCTTTAGAAGCTCTATTATTGTATTTTTCCAAGCCTTGCTGTGTGTAATATGCTATGTTCTGATAAAGTGAGAGGAACTCAGGAACATGATAACATAGTGCATGCATGTAGGGAGTGACATCTTTTGCTTGGTAAACTTTGAGAAAGTGCTCAAACCATGATTTGATATTTCTTTGCAGTTCAAGAATACCCTCTTCAGATGTAAAATCTAGCCTCAAATCACCAATTATGTCCATAAATTTAGCCCAGATCACTTGTATGTCTTTGTTTTGACTGCAAAGAGGGACAAGCATAGGTCGGAATGTTTATATTCTGAAAGAGTTTCAACTTTTCTGGTTTGGTAAGATCTCTATATTCCAGCTTTTTGCTCTCTTTGTTGATTCTAAAATTAAAAGTGATTCCCATACTTTTGATAAATTCTTCATAACATGCCATGTGTTTATNCGATCCCTCTACTACTGCTTTCACGCTCACCAAACAGGAAAAAATCAAAAGGAGGAAGATACAGCTGTTTTGAAACGCCATTTGACGAGGGTAAATCTGCTTTTTGTTCGACACTTTACATATCAAAACAAAGGAAATTTGTTCCTCTTTTTGATTCTGGCGGTACATTTTTAAATTGCGCTTGCGTAAAAGGGATATCGCCCGAGACGCTTCGCTCGGTTGTGTTTTGAACAAAAACGCCTTGTTAAT
>transcript::Pver_Sc0000000_size2095917:309051-311795
AGCAAATCGACGTGGGAGCTTAAAGTAAATCGATGTCCCAGCAGAAATAACCGGTGATTTCAGTCAGGAAAACGCACCAGCATTTGCGTAAGCTTGTTAGGCCACCGAAAGTATACCATTTGTACATAAAACAATTCTTCCTAAATTATTGGTTTTACAAGTGATTATGGTTTTGATAAGGGGACTCATGGTAGTTTGGGATACATCGGTCAAAAATCCGCATTTCTTAAAGAATTATGTTTGCCATTACCATGCAACAGACAAAGAGACTCGTCGTGCTCTAGTATCGGAACTTTTTGATAGAAATATACCAAGGCACTCTTAATGCCACTTCATGTGGCTCCTAGTGTCTTGTTGACACAGCTTTATCAGGGAAAATGTTAATGACATAAAGAATATAATCAGTTCAACATTCTCCACAGGCTCAATCACCGGCGCACGAGGAAACTTTTTTTATTAAATCTAAATTAAAGTCGATTTAGATAAATTTCACTGATTGACTTCGCTGAAACGACTTCCAACTATGAGAAAGTCAGGGGGAAATACTGAACTTCTCTGACGGTAACAATATCAGAAAGATTCCTGACTAATGCGTCTGCGAATTGTTCATGAATCGGTTCAACACAGTTTGATTCAGCTTATTGTTTTTTTGATTTTAGATTGAACTGCAACGGGAGTTGACAAACACACCTGCTTTTTCGAGGCTTTAACATCTCTTTCTAATTATCCTTACCCAGTCTATTCAGCTACATGTTTTATTATGGCATCCTCACAGTGTCTGATCCTGGAGATTAAAGTGAATATTTCCATTACAAAAGCGCTTAAAAACTCGTTTTCAAGAGCCCCGGTGGCATTTCTATCAGATCAAAGAGTGGTCGAAAAAGAGAATGAGAGAGAAGAAGGACGCCTCCCGCTGGTAATTTTAATTTTGTGGTTCCTATGGTGAAGAAGACGACTCGTTTTGTTGGTTTCCTAGACCTATTATAAGTATAAATACAGCAAACCAAGAAACTCAATTAAATTGTGAATTACAGAAGTGCAGGTTATGCAAAAAAGAACTAGTGGAATAACGACAAGGGCAGAAATTCAATCGCCGCACAACAGTGAATAGCGGACCCTAGGCGAGAAAGAAAGATACGTTCGGAACTGGTCCAAAGCCTTGCAAAGCTTCTAAACGTAAGCGCAATTCAGAACGAACACGTCGAAAGAAACGGAGTTCTCATGCAGTAGAAGTTGATCATCGCCCTACAATAAATAGCTTATCTAAAGCACAAAAGCAAGATGAGCTCGGATGTCATTCAAAACATGACAAAACGACTAAATTCGGGTGCAATTCAGAACAAGTCTTCCGAAGGCAACGGACTTCAAATCGACAGAGTGCTCGAGGCGATTTTACTCTCTTTTTTGGCCCTCTTCGCTATCGTGGGAAATGGCCTTCTCATGTACGTCATATACTCAAATCACAATCTTCGAACTCTTCGGAATGCTTTTGTAGCAAGTCTTGCGGCTGCCGATCTCATGTTAGCTTGTACAGACATTATTTACCAAGCTGTGGAAAAGGTCATAATCGACTTTAAGCCTATCCACCCATCGGTTTGTTACATTATCTTGCTTAGCGGAGTATTGTTCGGTTCTGCTTCTGTGTTTAACCTCACCGCCATGACTTTGGTCCGTTACGTAAACATTCGCTACCCTCTTCATTTTAACCGTTACCTAACTGTCCATCGCTCTACTACCATTGTACTGGTGACTTGGCTAATTGCGCTTTGTCTCGCGTTCCCGCCTCTCATTTGGCGTCCGGAAGGCGTCGTGTGTTCGACAACGAAGCCTTCCGACGAGCACATCTTAAACGAAGCCATATACATGAGTGCCGAGTGGTTAATTTGGTTCATTATCCCAGCGATACTCATAACGTTTTCATATTATCGCATATACATGATCGCAAGAACTCAAGCCAGGCAGATAGCTGCACTAGAAGTGACTGCAGTCAGTGAAACAAATTCTAAGAATTCTTTTCCTGTAAGAGGACGAGCGTCGTCTTTGAGAGAAAAGAAAGCGGGGCGAATGGTGGCCATTTTGGTTGGATTTTGGGTTCTTTGCTGGCTTCCGTTCTTCACAGTGCTCACCATAAGTAAATTCAAGTCTCAAGTGCCAGGCTTTCTCATGCGTTTGTTTCTGTGTCTTATGTTTGCAAACTCAGCAATTAACCCTATTGTATTAACGTGGTATAACCGTGAACTTAAAGAAGCGATTAAGAAGTTGTTCTGTCGAGAAAAACAGCGTAACCAGTCGTTATCCATGGCAGAAACACGGTCAAACTTGAGAGGATCAACTTCTAGAACAGCATTTTAAACATGGCATACAAACTCATAAATAAGACTGATCAAGGGCGCTGGCTGGTGTTTCTCTCAGAATGTTTTAAACACCTACTGAGAAGGGAGGACAAAATGTAAGAGCATGCCGTGTCAAAAAAAGTATTATTACTCGCAAGAAATAGCTCTGGATGGTACTTTAAACGAGTGAGTTTTTACCCCTTAAAATCCATTCAAATAAGTCGCAATCTTTGATTCTTTATTACGTACAGAAAATAAATCTAATCAAATTTTATGATTACTGAATCGTTATAAACTCAAGGTTAATGTCATATAGGATCTCAAACTCACAGCAAAAGAAATCACCTTGTGGCAAGGACAATAACTCTTTACATGACATCAAATACAAAATATCAGACGGACCTAGTCTCG

CPC2

Evaluate coding potential of transcripts with Coding Potential Calculator 2…

eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)"
python /home/shared/CPC2_standalone-1.0.1/bin/CPC2.py -i /home/shared/8TB_HDD_01/pver/bedtools-output/merged_lncRNA_candidates.fasta -o ~/github/zach-lncRNA/output/merged_cpc2_results

Final Filter

Filter for those transcripts with label “noncoding”…

awk '$8 == "noncoding" {print $1}' ~/github/zach-lncRNA/output/merged_cpc2_results.txt > ~/github/zach-lncRNA/output/noncoding_transcripts_ids.txt
grep -Fwf ~/github/zach-lncRNA/output/noncoding_transcripts_ids.txt /home/shared/8TB_HDD_01/pver/bedtools-output/merged_lncRNA_candidates.fasta > ~/github/zach-lncRNA/output/merged_final_lncRNAs.gtf

This results in a final list of transcript IDs identified as lncRNAs. The final GTF of lncRNA transcripts can be found here.

>transcript::Pver_Sc0000000_size2095917:21743-24809
>transcript::Pver_Sc0000000_size2095917:205920-207483
>transcript::Pver_Sc0000000_size2095917:209824-211345
>transcript::Pver_Sc0000000_size2095917:209932-211345
>transcript::Pver_Sc0000000_size2095917:405436-413642
>transcript::Pver_Sc0000000_size2095917:434811-435922
>transcript::Pver_Sc0000000_size2095917:437235-439117
>transcript::Pver_Sc0000000_size2095917:645784-646411
>transcript::Pver_Sc0000000_size2095917:811102-817194
>transcript::Pver_Sc0000000_size2095917:812141-814871
>transcript::Pver_xpSc0122334_size1013:0-293
>transcript::Pver_xpSc0122334_size1013:566-801
>transcript::Pver_xpSc0122341_size1005:3-987
>transcript::Pver_xpSc0122357_size967:266-953
>transcript::Pver_xpSc0122375_size948:11-928
>transcript::Pver_xpSc0122429_size882:198-839
>transcript::Pver_xpSc0122462_size858:90-315
>transcript::Pver_xpSc0122471_size849:473-849
>transcript::Pver_xpSc0122613_size713:64-293
>transcript::Pver_xpSc0122829_size550:47-547

Part 2: Differential Expression of lncRNAs

Workflow

Kallisto Abundance Estimation

Transcript quantification using the Kallisto software on multiple pairs of FASTQ files…

find /home/shared/8TB_HDD_01/pver/*R1_001.fastq* \
| xargs basename -s _R1_001.fastq.gz | xargs -I{} /home/shared/kallisto/kallisto \
quant -i ~/github/zach-lncRNA/output/kallisto-output/Pver_genome_assembly_v1.0.index \
-o /home/shared/8TB_HDD_01/pver/kallisto-output/{} \
-t 20 \
/home/shared/8TB_HDD_01/pver/{}_R1_001.fastq.gz \
/home/shared/8TB_HDD_01/pver/{}_R2_001.fastq.gz \
2>&1 | tee /home/shared/8TB_HDD_01/pver/kallisto-output/{}.out

Trinity Count Matrix Generation

Generate a count matrix from Kallisto abundance estimates for all transcripts in RNA-seq data…

perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
--gene_trans_map none \
--out_prefix ~/github/zach-lncRNA/output/count_matrix \
--name_sample_by_basedir \
/home/shared/8TB_HDD_01/pver/kallisto-output/*/abundance.tsv

Filter Count Matrix for lncRNA Transcript IDs

Extract counts only for lncRNAs and save as a new filtered count matrix…

# Specify the transcript ID file (GTF format)
transcript_id_file="/home/shared/8TB_HDD_02/zbengt/github/zach-lncRNA/output/merged_final_lncRNAs.gtf"

# Specify the count matrix file
count_matrix_file="/home/shared/8TB_HDD_02/zbengt/github/zach-lncRNA/output/count_matrix.isoform.counts.matrix"

# Specify the output file for the filtered count matrix
filtered_count_matrix_file="/home/shared/8TB_HDD_02/zbengt/github/zach-lncRNA/output/filtered_count_matrix.tsv"

# Extract the middle part of the transcript IDs from the GTF file
grep -oP "(?<=transcript::)[^:]*" "$transcript_id_file" > "tmp_transcript_ids.txt"

# Store the first line (column headings) of the count matrix file
head -n 1 "$count_matrix_file" > "$filtered_count_matrix_file"

# Filter the count matrix using the extracted transcript IDs, starting from the second line
awk 'NR==FNR{a[$0];next} ($1 in a)' "tmp_transcript_ids.txt" <(tail -n +2 "$count_matrix_file") >> "$filtered_count_matrix_file"

# Remove the temporary transcript ID file
rm "tmp_transcript_ids.txt"

The final count matrix of lncRNAs can be found here. Note: Counts need to be rounded before DESeq2 analysis

DESeq2 Differential Expression Analysis

Install the DESeq2 package from Bioconductor and load necessary packages to run DESeq2 and create figures…

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DESeq2")
library(DESeq2)
library(tidyverse)
library(pheatmap)
library(RColorBrewer)
library(data.table)

Create data frame for DESeq to reference. This specifies which sample IDs are control and which are treatment (nutrient enriched)…

metaData <- data.frame(
  sample = c("C17", "C18", "C19", "C20", "C21", "C22", "C23", "C24", "C25", "C26", "C27", "C28", "C29", "C30", "C31", "C32", "E10", "E11", "E12", "E13", "E14", "E15", "E16", "E1", "E2", "E3", "E4", "E5", "E6", "E7", "E8", "E9"),
  treatment = c("Control", "Control", "Control", "Control", "Control", "Control", "Control", "Control", "Control", "Control", "Control", "Control", "Control", "Control", "Control", "Control", "Enriched", "Enriched", "Enriched", "Enriched", "Enriched", "Enriched", "Enriched", "Enriched", "Enriched", "Enriched", "Enriched", "Enriched", "Enriched", "Enriched", "Enriched", "Enriched")
  )

Load lncRNA filtered count matrix…

countmatrix <- read.delim("~/github/zach-lncRNA/output/filtered_count_matrix.tsv", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X
countmatrix <- countmatrix[,-1]
head(countmatrix)
                              C17      C18      C19      C20      C21     C22
Pver_Sc0004518_size2183   7.85228 17.11440  6.81050  2.16756  7.18268 4.66885
Pver_Sc0006272_size1408   0.00000  0.00000  2.00000  0.00000  1.00000 0.00000
Pver_Sc0003523_size3224   0.00000  1.09759 12.12730  1.61304  1.16390 3.45154
Pver_Sc0006233_size1418   7.34194 14.33960  1.32007 12.92490 20.73860 9.61535
Pver_xpSc0121546_size5746 6.00000  7.46051 13.02040  5.35396  2.47586 0.00000
Pver_xfSc0001217_size5096 6.85357  2.54863  1.00207  2.02089  0.00000 2.69000
                               C23     C24      C25      C26     C27     C28
Pver_Sc0004518_size2183   21.47020 9.93404  1.07379 30.33510 3.59148 4.56873
Pver_Sc0006272_size1408    0.00000 0.00000 13.00000  1.00000 0.00000 1.00000
Pver_Sc0003523_size3224    1.22297 2.52883  9.83971  5.84554 8.89700 6.15775
Pver_Sc0006233_size1418   21.42530 3.65209  9.37371 14.74800 5.75744 7.00592
Pver_xpSc0121546_size5746  2.03561 8.09258  6.53826  7.29864 8.02735 6.94427
Pver_xfSc0001217_size5096  0.00000 0.00000  0.00000  2.84627 3.77124 1.67987
                               C29      C30      C31       C32      E10     E11
Pver_Sc0004518_size2183    3.49436 14.42350 19.15810 19.950900  3.16544 4.45746
Pver_Sc0006272_size1408    2.00000  0.00000  0.00000  2.324370  0.00000 0.00000
Pver_Sc0003523_size3224   18.88970  1.15033  6.69146  2.181350  5.74262 2.48931
Pver_Sc0006233_size1418    7.79912 20.08930 26.51350 12.057400 17.45350 9.35781
Pver_xpSc0121546_size5746  2.91164  3.22259  1.99999  5.000000  1.08613 1.00000
Pver_xfSc0001217_size5096  0.00000  9.00644  3.62421  0.856301  2.81287 2.09052
                              E12      E13      E14      E15      E16       E1
Pver_Sc0004518_size2183   6.03863  5.71854  5.38058 35.04610 18.17120 15.94860
Pver_Sc0006272_size1408   0.00000  2.00000  0.00000  2.00000  1.00000  0.00000
Pver_Sc0003523_size3224   5.28486  2.52545  3.55154 15.84280  3.86092  0.00000
Pver_Sc0006233_size1418   3.82369 12.46040 19.26390 12.24390 28.95150  7.63817
Pver_xpSc0121546_size5746 3.81159  5.89469  5.36072  6.68835  2.41352  1.00000
Pver_xfSc0001217_size5096 0.00000  0.00000  0.00000  3.43347  2.08294  1.57733
                               E2       E3       E4      E5       E6       E7
Pver_Sc0004518_size2183   6.64014  2.77836 15.68280 4.42886  6.84266 28.33200
Pver_Sc0006272_size1408   1.00000  0.00000  2.38019 1.00000  0.00000  0.00000
Pver_Sc0003523_size3224   1.68306  9.61741  2.51885 4.15475 11.98390  1.83673
Pver_Sc0006233_size1418   3.93828 14.00460  1.24852 3.46708 12.58970  2.49715
Pver_xpSc0121546_size5746 1.00000  3.60409  0.00000 1.09690  1.00000 11.00000
Pver_xfSc0001217_size5096 0.00000  3.81283  1.39317 4.50908  2.31714  8.13514
                                E8      E9
Pver_Sc0004518_size2183    4.73182 0.00000
Pver_Sc0006272_size1408    0.00000 1.00000
Pver_Sc0003523_size3224    5.64801 5.10102
Pver_Sc0006233_size1418   11.70150 9.82817
Pver_xpSc0121546_size5746 29.74560 1.70132
Pver_xfSc0001217_size5096  0.00000 0.00000

Round counts to integers so they can be run through DESeq2 and summarize countmatrix structure to check and see that it worked…

countmatrix <- round(countmatrix, 0)
str(countmatrix)
'data.frame':   6264 obs. of  32 variables:
 $ C17: num  8 0 0 7 6 ...
 $ C18: num  17 0 1 14 7 ...
 $ C19: num  7 2 12 1 13 ...
 $ C20: num  2 0 2 13 5 ...
 $ C21: num  7 1 1 21 2 ...
 $ C22: num  5 0 3 10 0 ...
 $ C23: num  21 0 1 21 2 ...
 $ C24: num  10 0 3 4 8 ...
 $ C25: num  1 13 10 9 7 ...
 $ C26: num  30 1 6 15 7 ...
 $ C27: num  4 0 9 6 8 ...
 $ C28: num  5 1 6 7 7 ...
 $ C29: num  3 2 19 8 3 ...
 $ C30: num  14 0 1 20 3 ...
 $ C31: num  19 0 7 27 2 ...
 $ C32: num  20 2 2 12 5 ...
 $ E10: num  3 0 6 17 1 ...
 $ E11: num  4 0 2 9 1 ...
 $ E12: num  6 0 5 4 4 ...
 $ E13: num  6 2 3 12 6 ...
 $ E14: num  5 0 4 19 5 ...
 $ E15: num  35 2 16 12 7 ...
 $ E16: num  18 1 4 29 2 ...
 $ E1 : num  16 0 0 8 1 ...
 $ E2 : num  7 1 2 4 1 ...
 $ E3 : num  3 0 10 14 4 ...
 $ E4 : num  16 2 3 1 0 ...
 $ E5 : num  4 1 4 3 1 ...
 $ E6 : num  7 0 12 13 1 ...
 $ E7 : num  28 0 2 2 11 ...
 $ E8 : num  5 0 6 12 30 ...
 $ E9 : num  0 1 5 10 2 ...

The count matrix should look like this before you input into DESeq2…

                          C17 C18 C19 C20 C21 C22 C23 C24 C25 C26 C27 C28 C29
Pver_Sc0004518_size2183     8  17   7   2   7   5  21  10   1  30   4   5   3
Pver_Sc0006272_size1408     0   0   2   0   1   0   0   0  13   1   0   1   2
Pver_Sc0003523_size3224     0   1  12   2   1   3   1   3  10   6   9   6  19
Pver_Sc0006233_size1418     7  14   1  13  21  10  21   4   9  15   6   7   8
Pver_xpSc0121546_size5746   6   7  13   5   2   0   2   8   7   7   8   7   3
Pver_xfSc0001217_size5096   7   3   1   2   0   3   0   0   0   3   4   2   0
                          C30 C31 C32 E10 E11 E12 E13 E14 E15 E16 E1 E2 E3 E4
Pver_Sc0004518_size2183    14  19  20   3   4   6   6   5  35  18 16  7  3 16
Pver_Sc0006272_size1408     0   0   2   0   0   0   2   0   2   1  0  1  0  2
Pver_Sc0003523_size3224     1   7   2   6   2   5   3   4  16   4  0  2 10  3
Pver_Sc0006233_size1418    20  27  12  17   9   4  12  19  12  29  8  4 14  1
Pver_xpSc0121546_size5746   3   2   5   1   1   4   6   5   7   2  1  1  4  0
Pver_xfSc0001217_size5096   9   4   1   3   2   0   0   0   3   2  2  0  4  1
                          E5 E6 E7 E8 E9
Pver_Sc0004518_size2183    4  7 28  5  0
Pver_Sc0006272_size1408    1  0  0  0  1
Pver_Sc0003523_size3224    4 12  2  6  5
Pver_Sc0006233_size1418    3 13  2 12 10
Pver_xpSc0121546_size5746  1  1 11 30  2
Pver_xfSc0001217_size5096  5  2  8  0  0

Use the DESeq2 package to create a DESeqDataSet object from a count matrix and metadata…

deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix,
                                     colData = metaData, 
                                     design = ~ treatment)

Performs differential expression analysis using DESeq2 on the deseq2.dds object, extract the results, converts them into a data frame, and then orders the rows of the data frame alphabetically based on the row names…

deseq2.dds <- DESeq(deseq2.dds)
deseq2.res <- results(deseq2.dds)
deseq2.res <- as.data.frame(deseq2.res)  # Convert to data frame
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ]

The results (deseq2.res) should look like this…

                            baseMean log2FoldChange      lfcSE       stat
Pver_Sc0000000_size2095917 110374.10    0.055635923 0.04699103  1.1839689
Pver_Sc0000001_size2081954 102310.20    0.117073634 0.05308784  2.2052817
Pver_Sc0000002_size1617595  76740.43    0.281523591 0.15086218  1.8660978
Pver_Sc0000003_size1576134  63080.00    0.045165834 0.03014089  1.4984905
Pver_Sc0000004_size1560107  45979.44   -0.001693396 0.02232370 -0.0758564
Pver_Sc0000005_size1451149  40752.22    0.019263990 0.03704071  0.5200762
                               pvalue      padj
Pver_Sc0000000_size2095917 0.23642537 0.7606988
Pver_Sc0000001_size2081954 0.02743434 0.3870680
Pver_Sc0000002_size1617595 0.06202768 0.5058815
Pver_Sc0000003_size1576134 0.13400586 0.6511615
Pver_Sc0000004_size1560107 0.93953335 0.9916985
Pver_Sc0000005_size1451149 0.60301048 0.9171572

Count the number of hits with adjusted p-value less then 0.05…

dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])
[1] 22  6

Retrieve the transcript IDs of the 22 lncRNAs with differential expression p-values less than 0.05 and save them to a text file…

# Subset the hits with adjusted p-value less than 0.05
hits <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]

# View the transcript IDs of the hits
transcript_ids <- rownames(hits)
print(transcript_ids)
 [1] "Pver_Sc0000139_size537088"  "Pver_Sc0000156_size509041" 
 [3] "Pver_Sc0000172_size480810"  "Pver_Sc0000556_size184003" 
 [5] "Pver_Sc0000853_size79572"   "Pver_Sc0000968_size64176"  
 [7] "Pver_Sc0001125_size43131"   "Pver_Sc0001315_size24563"  
 [9] "Pver_Sc0001356_size25637"   "Pver_Sc0001408_size22845"  
[11] "Pver_Sc0001524_size18081"   "Pver_Sc0001709_size13140"  
[13] "Pver_Sc0001979_size9044"    "Pver_Sc0002476_size5736"   
[15] "Pver_Sc0009607_size821"     "Pver_xfSc0000044_size69330"
[17] "Pver_xfSc0000188_size16086" "Pver_xfSc0001443_size4587" 
[19] "Pver_xfSc0006440_size1274"  "Pver_xfSc0006736_size1221" 
[21] "Pver_xpSc0121218_size45261" "Pver_xpSc0121370_size11769"
# Save the transcript IDs in a text file
writeLines(transcript_ids, "~/github/zach-lncRNA/output/sig_diff_IDs.txt")

These transcript IDs can be found in a text file here.

Generating Differential Expression Figures

Create a plot of differentially expressed lncRNAs with those that have a p-value less than 0.05 highlighted in red…

tmp <- deseq2.res
# The main plot
plot(tmp$baseMean, tmp$log2FoldChange, pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray",
     main="DEG Nutrient Enrichment  (pval <= 0.05)",
     xlab="mean of normalized counts",
     ylab="Log2 Fold Change")
# Getting the significant points and plotting them again so they're a different color
tmp.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]
points(tmp.sig$baseMean, tmp.sig$log2FoldChange, pch=20, cex=0.45, col="red")
# 2 FC lines
abline(h=c(-1,1), col="blue")

Generate a PCA plot to see how all of the differential expression results cluster…

vsd <- vst(deseq2.dds, blind = FALSE)
plotPCA(vsd, intgroup = "treatment")

Create heatmaps to view the top 50 and top 22 differentially expressed lncRNAs…

# Select top 50 differentially expressed genes
res <- results(deseq2.dds)
res_ordered <- res[order(res$padj), ]
top_genes <- row.names(res_ordered)[1:50]

# Extract counts and normalize
counts <- counts(deseq2.dds, normalized = TRUE)
counts_top <- counts[top_genes, ]

# Log-transform counts
log_counts_top <- log2(counts_top + 1)

# Generate heatmap
pheatmap(log_counts_top, scale = "row")

# Select top 22 differentially expressed genes
res <- results(deseq2.dds)
res_ordered <- res[order(res$padj), ]
top_genes <- row.names(res_ordered)[1:22]

# Extract counts and normalize
counts <- counts(deseq2.dds, normalized = TRUE)
counts_top <- counts[top_genes, ]

# Log-transform counts
log_counts_top <- log2(counts_top + 1)

# Generate heatmap
pheatmap(log_counts_top, scale = "row")

Write final differential expression results table…

write.table(tmp.sig, "../output/DEGlist.tab", row.names = T)

The final differential gene expression results can be found here.

Part 3: Interpreting Results and Future Work

  • Of 6264 identified lncRNA transcripts, only 22 were differentially expressed with a p-value of less than 0.05 in this analysis.

  • It may be that lncRNAs don’t play much of a role in nutrient enrichment, or that a relatively small number of lncRNAs are particularly important managing nutrient uptake and utilization. Low level nutrient enrichment may also not be enough stimulus to elicit a response in the expression of lncRNAs.

  • Further work needs to be done to understand the potential function of these lncRNAs. Preliminary database searches did not yield transcript matches with associated GO terms.

  • The next step in this work will be to complete weighted correlation network analysis (WGCNA) to see how these lncRNAs are expressed alongside mRNAs with associated functions. This will provide clues about which GO terms, and thus potential functions, these lncRNAs may be associated with.