Differential Expression Analysis of Vanessa cardui RNA-Seq Data

QC, Sorting, Trimming, Alignment, SAM/BAM Generation, GTF Generation, Transcriptome Assembly, and Table Count Generation From FASTA Files

Perl RNA-Seq Pipeline Script

#!/usr/bin/perl

#I developed this pipeline using perl to take raw fastq files and run quality control during different stages of file generation; merging and unmerging of fastq files; sorting to remove RNA contamination; trimming to remove sequencing adapters; alignment of RNA-seq reads to a reference genome; sorting of .sam files and generation of .bam files; indexing of .bam files to view in IGV; merging aligned reads into full-length transcripts; merging .gtf transcript files; and generation of table counts for differential expression using R.

#To run this script, you need the following programs: FastQC, SortMeRNA, Trimmomatic, HISAT2, SAMtools, and StringTie. The Ballgown package can be downloaded in R to utilize the table count files for differential expression.

use warnings;

use strict;

my @samples = qw(VcV1 VcV2 VcV3 VcN1 VcN2 VcN3);

foreach(@samples){
    do {
        system ("fastqc -o fastqc_results --extract ${_}*gz");

        system ("gunzip ${_}*.gz");

        system ("bash", "merge-paired-reads.sh", "${_}_R1_001.fastq", "${_}_R2_001.fastq", "${_}_merged.fastq");

        system ("sortmerna", "--ref", "/home/alexanderselvey/sortmerna/rRNA_databases/silva-bac-16s-id90.fasta,/home/alexanderselvey/sortmerna/index/silva-bac-16s-db:\/home/alexanderselvey/sortmerna/rRNA_databases/silva-bac-23s-id98.fasta,/home/alexanderselvey/sortmerna/index/silva-bac-23s-db:\/home/alexanderselvey/sortmerna/rRNA_databases/silva-arc-16s-id95.fasta,/home/alexanderselvey/sortmerna/index/silva-arc-16s-db:\/home/alexanderselvey/sortmerna/rRNA_databases/silva-arc-23s-id98.fasta,/home/alexanderselvey/sortmerna/index/silva-arc-23s-db:\/home/alexanderselvey/sortmerna/rRNA_databases/silva-euk-18s-id95.fasta,/home/alexanderselvey/sortmerna/index/silva-euk-18s-db:\/home/alexanderselvey/sortmerna/rRNA_databases/silva-euk-28s-id98.fasta,/home/alexanderselvey/sortmerna/index/silva-euk-28s:\/home/alexanderselvey/sortmerna/rRNA_databases/rfam-5s-database-id98.fasta,/home/alexanderselvey/sortmerna/index/rfam-5s-db:\/home/alexanderselvey/sortmerna/rRNA_databases/rfam-5.8s-database-id98.fasta,/home/alexanderselvey/sortmerna/index/rfam-5.8s-db", "--reads", "/media/alexanderselvey/selvey_alexander_external/rnaseq_pipeline_project/VcB_data/${_}_merged.fastq", "--paired_in", "-a", "11", "--log", "--fastx", "--aligned", "${_}_merged_rRNA", "--other", "${_}_merged_sortmerna");

        system ("bash", "unmerge-paired-reads.sh", "${_}_merged_sortmerna.fastq", "${_}_sortmerna_R1.fastq", "${_}_sortmerna_R2.fastq");

        system ("fastqc -o fastqc_results --extract ${_}_sortmerna_R*.fastq");

        system ("java", "-jar", "/home/alexanderselvey/Downloads/Trimmomatic/Trimmomatic-0.38/trimmomatic-0.38.jar", "PE", "-threads", "11", "-trimlog", "${_}_trimmomatic_trimlog.log", "${_}_sortmerna_R1.fastq", "${_}_sortmerna_R2.fastq", "${_}_sortmerna_R1_trim_pe.fastq", "${_}_sortmerna_R1_trim_se.fastq", "${_}_sortmerna_R2_trim_pe.fastq", "${_}_sortmerna_R2_trim_se.fastq", "ILLUMINACLIP:/home/alexanderselvey/Downloads/Trimmomatic/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10", "LEADING:3", "TRAILING:3", "SLIDINGWINDOW:4:15", "MINLEN:36");

        system ("fastqc -o fastqc_results --extract ${_}_sortmerna_R*_trim_pe.fastq");

        system("hisat2", "-p", "11", "--dta", "-x", "Vcardui_hisat2_index/Vcardui_index", "-1", "${_}_sortmerna_R1_trim_pe.fastq", "-2", "${_}_sortmerna_R2_trim_pe.fastq", "-S", "${_}_alignedh2_Vcardui.sam");

        system("samtools", "sort", "-@", "11", "-o", "${_}_alignedh2_Vcardui.bam", "${_}_alignedh2_Vcardui.sam");

        system("samtools", "index", "${_}_alignedh2_Vcardui.bam", "${_}_alignedh2_Vcardui.bam.bai");

        system("stringtie", "-p", "11", "-o", "${_}.gtf", "-l", "${_}", "${_}_alignedh2_Vcardui.bam");
    }
}

system("stringtie", "--merge", "-p", "11", "-o", "VcB_combined.gtf", "mergelist.txt");

foreach(@samples){
    do{
        system("stringtie", "-e", "-B", "-p", "11", "-G", "VcB_combined.gtf", "-o", "ballgown/${_}/${_}.gtf", "${_}_alignedh2_Vcardui.bam");
    }
}

Ballgown for isoform-level differential expression analysis for the Non-infected and Virally-infected groups

Load Packages and Set Working Directory

library(ballgown) #for differential expression analysis
library(RColorBrewer) #or RSkittleBrewer for plot coloring
library(genefilter) #calc of means and variances
library(dplyr) #sort and arrange results
library(devtools) #to ensure reproducibility and install relevant packages    

setwd()

Load data

pheno_data = read.csv('Vcardui_phenodata.csv')
bg_Vcardui = ballgown(dataDir = "Vc_ballgown", samplePattern = "Vc", pData=pheno_data)

Data Manipulation

Filter

bg_Vc_filt = subset(bg_Vcardui, 'rowVars(texpr(bg_Vcardui)) >1', genomesubset=TRUE)

Identify Transcripts

results_transcripts = stattest(bg_Vc_filt, feature="transcript", covariate="treatment", getFC=TRUE, meas="FPKM")

Identify Genes

results_genes = stattest(bg_Vc_filt,feature="gene", covariate="treatment", getFC=TRUE, meas="FPKM")

Adding Gene Names and IDs

results_transcripts = data.frame(geneNames=ballgown::geneNames(bg_Vc_filt), geneIDs=ballgown::geneIDs(bg_Vc_filt), results_transcripts)

Arrange Data Based on P-Value

results_transcripts = arrange(results_transcripts,pval)
results_genes = arrange(results_genes,pval)

Save Results

Write Results to .csv

write.csv(results_transcripts, "Vc_transcript_results.csv", row.names=FALSE)
write.csv(results_genes, "Vc_gene_results.csv", row.names=FALSE)

Tables

Fold Change Table

fc_values = subset(results_genes,results_genes$fc>2)
fc_values = arrange(fc_values,desc(fc))
print(fc_values)
write.csv(fc_values,"Vc_fc_values_greaterthan_2.csv",row.names=FALSE)

Table of Transcripts with Q-Value < 0.05

transcripts_qvalues = subset(results_transcripts,results_transcripts$qval<0.05)
print(transcripts_qvalues)
write.csv(transcripts_qvalues,"Vc_tran_q_values_lessthan_0.05.csv",row.names=FALSE)

Table of Genes with Q-Value < 0.05

genes_qvalues = subset(results_genes,results_genes$qval<0.05)
print(genes_qvalues)
write.csv(genes_qvalues,"Vc_gene_q_values_lessthan_0.05.csv",row.names=FALSE)

Boxplots

Add Color

colors = brewer.pal(5,"RdBu")
palette(colors)

tropical = c('darkorange', 'dodgerblue', 'hotpink', 'limegreen', 'yellow')
palette(tropical)

FPKM Boxplot

fpkm = texpr(bg_Vcardui,meas="FPKM")
fpkm = log2(fpkm+1)
par(mar=c(9,9,4,2)) 
boxplot(fpkm, col=as.numeric(pheno_data$treatment), las=2, ylab='log2(FPKM+1)',main='Log2 FPKM Values Across Samples')

Individual Boxplots

ballgown::transcriptNames(bg_Vcardui)[10835]

par(mar=c(5,7,5,5))
plot(fpkm[10835,] ~ pheno_data$treatment, border=c(1,2), cex.main=1.5, main=paste(ballgown::geneNames(bg_Vcardui)[10835],' : ', ballgown::transcriptNames(bg_Vcardui)[10835]), pch=19, cex.axis=1.5, cex.lab=1.5, xlab="Treatment", ylab='log2(FPKM+1)')
points(fpkm[10835,] ~ jitter(as.numeric(pheno_data$treatment)), col=as.numeric(pheno_data$treatment))

plot(fpkm[29227,] ~ pheno_data$treatment, border=c(1,2), cex.main=1.5, main=paste(ballgown::geneNames(bg_Vcardui)[29227],' : ', ballgown::transcriptNames(bg_Vcardui)[29227]), pch=19, cex.axis=1.5, cex.lab=1.5, xlab="Treatment", ylab='log2(FPKM+1)')
points(fpkm[29227,] ~ jitter(as.numeric(pheno_data$treatment)), col=as.numeric(pheno_data$treatment))

plot(fpkm[22056,] ~ pheno_data$treatment, border=c(1,2), cex.main=1.5, main=paste(ballgown::geneNames(bg_Vcardui)[22056],' : ', ballgown::transcriptNames(bg_Vcardui)[22056]), pch=19, cex.axis=1.5, cex.lab=1.5, xlab="Treatment", ylab='log2(FPKM+1)')
points(fpkm[22056,] ~ jitter(as.numeric(pheno_data$treatment)), col=as.numeric(pheno_data$treatment))

Summary Plots

Transcript Summary Plots

plotTranscripts(ballgown::geneIDs(bg_Vcardui)[10835], bg_Vcardui, main=c('Gene 10835'), sample=c('VcN1','VcN2','VcN3','VcV1','VcV2','VcV3'))
plotTranscripts(ballgown::geneIDs(bg_Vcardui)[21668], bg_Vcardui, main=c('Gene 21668'), sample=c('VcN1','VcN2','VcN3','VcV1','VcV2','VcV3'))

Means Summary Plots

plotMeans('MSTRG.10806', bg_Vc_filt, groupvar="treatment", legend=FALSE)
plotMeans('MSTRG.29227', bg_Vc_filt, groupvar="treatment", legend=FALSE)
plotMeans('MSTRG.21668', bg_Vc_filt, groupvar="treatment", legend=FALSE)

DEseq2 for gene-level differential expression analysis for the Non-infected and Virally-infected groups

Load Packages and Set Working Directory

library(DESeq2)
library(EnhancedVolcano)
setwd()

Load Data

countData = as.matrix(read.csv("gene_count_matrix.csv",row.names="gene_id"))
colData = read.csv('Vcardui_phenodata.csv',sep=',',row.names=1)
colData$sample = factor(colData$sample)

Data Manipulation

Check Sample IDs in colData Are Also in CountData and Match Their Orders

all(rownames(colData) %in% colnames(countData))
countData = countData[,rownames(colData)]
all(rownames(colData)==colnames(countData))

Run DESeq2

Create a DEseqDataSet From Count Martix and Labels

dds = DESeqDataSetFromMatrix(countData=countData,colData=colData,design=~treatment)

Run Defalut Analysis and Generate Results Table

dds = DESeq(dds,fitType='local')
res = results(dds,alpha=0.05)

Sort by Adjusted P-Value/Fold Change, Display, and Save

(resOrdered = res[order(res$padj),])
(resOrdered1 = res[order(res$log2FoldChange),])
write.csv(resOrdered,"Vc_genes_ordered_by_padj.csv",row.names=FALSE)
write.csv(resOrdered1,"Vc_genes_ordered_by_fc.csv",row.names=FALSE)

Volcano Plots

Make Volcano Plots

EnhancedVolcano(res,x='log2FoldChange',y='pvalue',lab=rownames(res),xlab=bquote(~Log[2]~ "Fold Change"),ylab=bquote(~-Log[10]~italic(P)),title='V vs. N Volcano Plot by P-Value',xlim=c(-35,35))

EnhancedVolcano(res,x='log2FoldChange',y='padj',lab=rownames(res),xlab=bquote(~Log[2]~ "Fold Change"),ylab=bquote(~-Log[10]~adjusted~italic(P)),title='V vs. N Volcano Plot by Adjusted P-Value (FDR)',xlim=c(-35,35))

MA Plots

Make MA plot

DESeq2::plotMA(res,alpha=0.05,main='V vs. N MA Plot (alpha=0.05)')

Ordination Plots

PCA plot

vsd = vst(dds,blind=FALSE)
plotPCA(vsd,intgroup='treatment')
plotPCA(vsd,intgroup='sample')

Histograms

P-Value Histogram Plot

res.df = as.data.frame(res)
ggplot(res.df[!is.na(res.df$pvalue), ], aes(x = pvalue)) +
  geom_histogram(alpha=.5, position='identity', bins = 50) +
  labs(title='Histogram of unadjusted p-values') +
  xlab('Unadjusted p-values') +
  xlim(c(0, 1.0005))

Adjusted P-Values Histogram Plot

ggplot(res.df[!is.na(res.df$padj), ], aes(x = padj)) +
  geom_histogram(alpha=.5, position='identity', bins = 50) +
  labs(title=paste('Histogram of', elementMetadata(res)$description[grep('adjusted', elementMetadata(res)$description)])) +
  xlab('Adjusted p-values') +
  xlim(c(0, 1.0005))+ylim(c(0,1000))

DEseq2 for gene-level differential expression analysis across all four sample groups

This was just for fun and all outputs are included as supplemental data

Load Packages and Set Working Directory

library(DESeq2)
library(EnhancedVolcano)
setwd()

Load Data

countData = as.matrix(read.csv("/media/alexander/selvey_alexander_external/rnaseq_pipeline_project/Vc_data_all/gene_count_matrix.csv",row.names="gene_id"))
colData = read.csv("/media/alexander/selvey_alexander_external/rnaseq_pipeline_project/Vc_data_all/Vcardui_all_phenodata.csv",sep=',',row.names=1)
colData$sample = factor(colData$sample)

Data Manipulation

Check Sample IDs in colData Are Also in CountData and Match Their Orders

all(rownames(colData) %in% colnames(countData))
countData = countData[,rownames(colData)]
all(rownames(colData)==colnames(countData))

Run DESeq2

Create a DEseqDataSet From Count Martix and Labels

dds = DESeqDataSetFromMatrix(countData=countData,colData=colData,design=~treatment)

Run Defalut Analysis and Generate Results Table

dds = DESeq(dds, fitType='local')
res = results(dds,alpha=0.05)
resvn = results(dds, contrast=c("treatment","V","N"),alpha=0.05)
resbs = results(dds, contrast=c("treatment","B","S"),alpha=0.05)
resvb = results(dds, contrast=c("treatment","V","B"),alpha=0.05)
ressn = results(dds, contrast=c("treatment","S","N"),alpha=0.05)

Sort by Adjusted P-Value/Fold Change and Display

#V-vs-N
(resOrderedvn = resvn[order(resvn$padj),])
(resOrderedvn1 = resvn[order(resvn$log2FoldChange),])
write.csv(resOrderedvn,"VcVN_genes_ordered_by_padj.csv",row.names=FALSE)
write.csv(resOrderedvn1,"VcVN_genes_ordered_by_fc.csv",row.names=FALSE)

#B-vs-S
(resOrderedbs = resbs[order(resbs$padj),])
(resOrderedbs1 = resbs[order(resbs$log2FoldChange),])
write.csv(resOrderedbs,"VcBS_genes_ordered_by_padj.csv",row.names=FALSE)
write.csv(resOrderedbs1,"VcBS_genes_ordered_by_fc.csv",row.names=FALSE)

#V-vs-B
(resOrderedvb = resvb[order(resvb$padj),])
(resOrderedvb1 = resvb[order(resvb$log2FoldChange),])
write.csv(resOrderedvb,"VcVB_genes_ordered_by_padj.csv",row.names=FALSE)
write.csv(resOrderedvb1,"VcVB_genes_ordered_by_fc.csv",row.names=FALSE)

#S-vs-N
(resOrderedsn = ressn[order(ressn$padj),])
(resOrderedsn1 = ressn[order(ressn$log2FoldChange),])
write.csv(resOrderedsn,"VcSN_genes_ordered_by_padj.csv",row.names=FALSE)
write.csv(resOrderedsn1,"VcSN_genes_ordered_by_fc.csv",row.names=FALSE)

Volcano Plots

Make Volcano Plots

EnhancedVolcano(resvn,x='log2FoldChange',y='pvalue',lab=rownames(resvn),xlab=bquote(~Log[2]~ "Fold Change"),ylab=bquote(~-Log[10]~italic(P)),title='V vs. N Volcano Plot by P-Value',xlim=c(-35,35))
EnhancedVolcano(resvn,x='log2FoldChange',y='padj',lab=rownames(resvn),xlab=bquote(~Log[2]~ "Fold Change"),ylab=bquote(~-Log[10]~adjusted~italic(P)),title='V vs. N Volcano Plot by Adjusted P-Value (FDR)',xlim=c(-35,35))

EnhancedVolcano(resbs,x='log2FoldChange',y='pvalue',lab=rownames(resbs),xlab=bquote(~Log[2]~ "Fold Change"),ylab=bquote(~-Log[10]~italic(P)),title='B vs. S Volcano Plot by P-Value',xlim=c(-35,35))
EnhancedVolcano(resbs,x='log2FoldChange',y='padj',lab=rownames(resbs),xlab=bquote(~Log[2]~ "Fold Change"),ylab=bquote(~-Log[10]~adjusted~italic(P)),title='B vs. S Volcano Plot by Adjusted P-Value (FDR)',xlim=c(-35,35))

EnhancedVolcano(resvb,x='log2FoldChange',y='pvalue',lab=rownames(resvb),xlab=bquote(~Log[2]~ "Fold Change"),ylab=bquote(~-Log[10]~italic(P)),title='V vs. B Volcano Plot by P-Value',xlim=c(-35,35))
EnhancedVolcano(resvb,x='log2FoldChange',y='padj',lab=rownames(resvb),xlab=bquote(~Log[2]~ "Fold Change"),ylab=bquote(~-Log[10]~adjusted~italic(P)),title='V vs. B Volcano Plot by Adjusted P-Value (FDR)',xlim=c(-35,35))

EnhancedVolcano(ressn,x='log2FoldChange',y='pvalue',lab=rownames(ressn),xlab=bquote(~Log[2]~ "Fold Change"),ylab=bquote(~-Log[10]~italic(P)),title='S vs. N Volcano Plot by P-Value',xlim=c(-35,35))
EnhancedVolcano(ressn,x='log2FoldChange',y='padj',lab=rownames(ressn),xlab=bquote(~Log[2]~ "Fold Change"),ylab=bquote(~-Log[10]~adjusted~italic(P)),title='S vs. N Volcano Plot by Adjusted P-Value (FDR)',xlim=c(-35,35))

MA Plots

Make MA plots

plotMA(resvn,alpha=0.05,main='V vs. N MA Plot (alpha=0.05)') 
plotMA(resbs,alpha=0.05,main='B vs. S MA Plot (alpha=0.05)') 
plotMA(resvb,alpha=0.05,main='V vs. B MA Plot (alpha=0.05)') 
plotMA(ressn,alpha=0.05,main='S vs. N MA Plot (alpha=0.05)')

Ordination Plots

PCA plot

res1 = subset(res,subset=res@listData[['padj']] < 0.05)
vsd = vst(dds,blind=FALSE,fitType='local')
plotPCA(vsd,intgroup='treatment')
plotPCA(vsd,intgroup='sample')

Histograms

P-Value Histogram Plot

res.df = as.data.frame(c(resvn,resvb,resbs,ressn))
ggplot(res.df[!is.na(res.df$pvalue), ], aes(x = pvalue)) +
  geom_histogram(alpha=.5, position='identity', bins = 50) +
  labs(title='Histogram of unadjusted p-values') +
  xlab('Unadjusted p-values') +
  xlim(c(0, 1.0005))

Adjusted P-Values Histogram Plot

ggplot(res.df[!is.na(res.df$padj), ], aes(x = padj)) +
  geom_histogram(alpha=.5, position='identity', bins = 50) +
  labs(title=paste('Histogram of', elementMetadata(c(resvn,resvb,resbs,ressn))$description[grep('adjusted', elementMetadata(c(resvn,resvb,resbs,ressn))$description)])) +
 xlab('Adjusted p-values') +
 xlim(c(0, 1.0005))+ggplot2::ylim(c(0,175))

Rmd BIOL792 Final Project

Alexander Selvey

5/2/2019

Differential Expression Analysis of Vanessa cardui RNA-Seq Data

QC, Sorting, Trimming, Alignment, SAM/BAM Generation, GTF Generation, Transcriptome Assembly, and Table Count Generation From FASTA Files

Perl RNA-Seq Pipeline Script

Ballgown for isoform-level differential expression analysis for the Non-infected and Virally-infected groups

Load Packages and Set Working Directory

Load data

Data Manipulation

Filter

Identify Transcripts

Identify Genes

Adding Gene Names and IDs

Arrange Data Based on P-Value

Save Results

Write Results to .csv

Tables

Fold Change Table

Table of Transcripts with Q-Value < 0.05

Table of Genes with Q-Value < 0.05

Boxplots

Add Color

FPKM Boxplot

Individual Boxplots

Summary Plots

Transcript Summary Plots

Means Summary Plots

DEseq2 for gene-level differential expression analysis for the Non-infected and Virally-infected groups

Load Packages and Set Working Directory

Load Data

Data Manipulation

Check Sample IDs in colData Are Also in CountData and Match Their Orders

Run DESeq2

Create a DEseqDataSet From Count Martix and Labels

Run Defalut Analysis and Generate Results Table

Sort by Adjusted P-Value/Fold Change, Display, and Save

Volcano Plots

Make Volcano Plots

MA Plots

Make MA plot

Ordination Plots

PCA plot

Histograms

P-Value Histogram Plot

Adjusted P-Values Histogram Plot

DEseq2 for gene-level differential expression analysis across all four sample groups

This was just for fun and all outputs are included as supplemental data

Load Packages and Set Working Directory

Load Data

Data Manipulation

Check Sample IDs in colData Are Also in CountData and Match Their Orders

Run DESeq2

Create a DEseqDataSet From Count Martix and Labels

Run Defalut Analysis and Generate Results Table

Sort by Adjusted P-Value/Fold Change and Display

Volcano Plots

Make Volcano Plots

MA Plots

Make MA plots

Ordination Plots

PCA plot

Histograms

P-Value Histogram Plot

Adjusted P-Values Histogram Plot