Data loading, cleaning and Exploration

Let’s first take a look at the dataset. The column name is “gene”, and the row name is “patient ID”. There are 24 patients in the placebo group and 17 patients in the treatment group.
## 
## Placebo   Treat 
##      24      17

Filter the data set

In this case, we will delete the genes that have no counts in any of the patients. This leaves us with 12,410 genes. In this data, 80% gene has no expression in any sample. We remove them from our data set.

keep = rowSums(data1) > 0
genecount1= data1[keep, ]
dim(data1)
## [1] 52580    41
dim(genecount1)
## [1] 12410    41

Data exploration

Our purpose is to detect differentially expressed genes, with a particular focus on identifying samples where the experimental treatment was affected by an abnormality that could render the data obtained from these samples unsuitable for our analysis.

For data exploration and visualization, it is useful to work with transformed versions of the count data. Here we use Since the count values for a gene can be zero in some conditions (and non-zero in others), we advocate the use of pseudocounts, which involve transformations of the form: \[ y=\log _2(K+0.5) \] The left figure shows the counts frequency without transforming and the right figure shows the counts frequency after transforming. The data variates similar across orders of magnitude and we will work with the transformed data set.

GC<-genecount1
GC$Totalcount<- rowSums(GC[,c(1:41)])
par(mfrow=c(2,1))
#p1=ggplot(GC,aes(x = Totalcount)) + geom_histogram(fill = "#525252", binwidth = #5)+theme(plot.margin=unit(c(2, 2, 2, 2),'cm'))
#p1=ggplot(GC,aes(x = Totalcount)) + geom_histogram(colour = "white",fill = "#525252", binwidth = 5)
pseudoCount = log2(GC + 1)
p2=ggplot(pseudoCount, aes(x = Totalcount)) + ylab(expression(log[2](count + 0.5))) +
geom_histogram(colour = "white", fill = "#525252", binwidth = 0.6)+theme(plot.margin=unit(c(2, 2, 2, 2),'cm'))
p2

We will transform the dataset into a DGE List. It is also necessary to generate a Multi-dimensional Scaling (MDS) plot, which provides a visual representation of the pattern of similarities or distances among a set of objects. Distances on the plot can be interpreted as leading log2-fold-change, which refers to the root-mean-square log2-fold-change between the samples for the genes that distinguish those samples.

From the plot, we can observe that Some placebo group samples are clustered towards the left part of the figure. In most part of the figure, the distinction is not clear. But, the two dimension can only explain 26% variance. It is necessary to build a model to determine which genes are differentially expressed.

#pseudoCount = log2(genecount1 + 1)

y <- DGEList(counts=genecount1,group=factor(data2$condition))
#head(y)
dev.new(width=4, height=4, unit="cm")
plotMDS(y, col=as.numeric(y$samples$group))
legend("bottomleft", as.character(unique(y$samples$group)), col=1:2, pch=20)

Parametric model: Negative Binomial based on edgeR

Normalize the data

The first step to build a model is to normalize the data set. Here, we choose the existed “TMM” tool in edgeR package.

Estimating the Dispersion

The first major step in the analysis of DGE data using the NB model is to estimate the dispersion parameter for each tag, which measures the degree of inter-library variation for that tag. Estimating the common dispersion gives an idea of overall variability across the genome for this dataset.

To estimate the dispersion, generally, we will reuse the variable “y” and use the estimateDisp() method. uses the Cox-Reid estimateDisp() make use of quantile-adjusted conditional maximum likelihood (qCML) method and it is only applicable on datasets with a single factor.We can also use profile-adjusted likelihood (CR) method in estimating dispersions when there are multiple factors.

The following code shows the result of qCML method. We then plot the tag-wise biological coefficient of variation (square root of dispersions) against log2-CPM. We can observe that a single estimate for the coefficient of variation is an acceptable model since the tag-wise dispersion decreases as the counts per million (CPM) increases. In this case, I also chose the Bin-spline method (a nonparametric smoothing method) to estimate the trended dispersion. Overall, the common dispersion is 0.36 and the BCV is 0.601.

Each point on the BCV graph represents a gene. Some genes with high logCPM contain low variation, which is in line with biology, and these genes with many replicates are called housekeeping genes. Some genes with middle logCPM but with high variation are of interest for differential expression. We can observe up-regulated genes and down-regulated genes with treatment.

#y1 <- estimateCommonDisp(ynorm, verbose=T)
design.mat <- model.matrix(~ ynorm$samples$group)
colnames(design.mat) <- levels(ynorm$samples$group)
#y2 <- estimateGLMCommonDisp(ynorm,design.mat)
#y2$common.dispersion
#y2 <- estimateGLMTrendedDisp(y2,design.mat, method="bin.spline")
#y2 <- estimateGLMTagwiseDisp(y2,design.mat)
y2 <- estimateDisp(y, design.mat,  method="bin.spline")
y2$common.dispersion
## [1] 0.369794
plotBCV(y2)

Differential Expression based

For routine differential expression analysis, we can use the empirical Bayes tag-wise dispersions. Once the dispersions are estimated, we can proceed with testing procedures to determine differential expression. In edgeR, the function exactTest() conducts tag-wise tests using the exact negative binomial test. As mentioned in the slides, the negative binomial test is identical to the exact test which is also parallel to fisher’s test.

Suppose we have two different conditions, treatment and placebo (\(T\)and \(P\)), with \(m_A\) and \(m_B\) replicate samples, respectively. We would like to test the null hypothesis that there is no difference in expression between the two conditions. This can be formulated as: \[ H_0: n_{g T}=n_{g P} \] where \(n_{g X}\) is the number of counts for the samples of condition \(X\).

We can also utilize likelihood ratio test and quasi-likelihood (QL) F-test. Although LRT is a more obvious choice for inferences with GLMs, the QL F-test is preferred as it reflects the uncertainty in estimating the dispersion for each gene. It provides more robust and reliable error rate control when the number of replicates is small. Comparing the two results, QL F-test can provide three more significant gene.

The test results for the n most significant tags are conveniently displayed by the topTags() function. Here we choose n=2. The resulting p-values can then be adjusted for multiple testing using methods such as the Benjamini-Hochberg procedure to control the false discovery rate (FDR).

In the result table,logFC is effect size estimate. It tells us how much the gene’s expression seems to have changed due to the treatment in comparison to the placebo group.

The top DE tags have tiny p-values and FDR values, as well as large fold changes.The total number of differentially expressed genes at FDR< 0.05 is shown in the summary of de1. Here the entries for -1, 0 and 1 are for down-regulated, non-differentially expressed and up-regulated tags respectively. Compared to the placebo group, the treatment group have 9 down-regulated and 9 up-regulated genes. Usually, the gene with logFC > 1 and FDR < 0.05 are upregulated and who with logFC < 1 or FDR < 0.05 are downregulated. Here, we set the p-value less than 0.05

et12 <- exactTest(y2,pair=c(1,2)) # compare groups 1 and 2
topTags(et12,n=2)
## Comparison of groups:  Treat-Placebo 
##                     logFC   logCPM        PValue           FDR
## ENSG00000154620 -5.409522 3.304863 2.981958e-109 3.700610e-105
## ENSG00000157828 -7.396414 2.362417  5.703878e-58  3.539256e-54
de1 <- decideTestsDGE(et12, adjust.method="BH", p.value = 0.05 , lfc=0)
summary(de1)
##        Treat-Placebo
## Down               8
## NotSig         12395
## Up                 7
#head(de1,n=2)
#de1<-as.data.frame(de1)
fit <- glmQLFit(y2, design.mat)
QL<-glmQLFTest(fit, coef=2)
summary(decideTests(QL))
##        Treat
## Down       8
## NotSig 12390
## Up        12
de1 <- decideTestsDGE(QL, adjust.method="BH", p.value = 0.05 , lfc=0)
plotMD(QL)
abline(h=c(-1, 1), col="blue")

fit <- glmFit(y2, design.mat)
lrt <- glmLRT(fit)
summary(decideTests(lrt))
##        Treat
## Down       9
## NotSig 12393
## Up         8
#de1 <- decideTestsDGE(lrt, adjust.method="BH", p.value = 0.05 , lfc=0)
#plotMD(lrt)
#abline(h=c(-1, 1), col="blue")
topTags(QL,n=2)
## Coefficient:  Treat 
##                     logFC   logCPM        F       PValue          FDR
## ENSG00000154620 -5.411470 3.304863 397.3835 9.820236e-24 1.218691e-19
## ENSG00000157828 -7.375484 2.362417 278.0909 1.062578e-20 4.399800e-17

Parametric Appraoch Based on DESeq2 package

The package provides tools that utilize empirical Bayes techniques to estimate priors for log fold change and dispersion, and to calculate posterior estimates for these quantities. The first step is to create the dds data. The package performs several steps to estimate size factors, dispersions, gene-wise dispersion estimates, mean-dispersion relationships, final dispersion estimates, and fitting models for testing.

We compare the results of differential expression (DE) based on the original data and transformed data. The DE analysis based on original and transformed data shows 6 up-regulated genes and 17 down-regulated genes. However, since the lack of power, 17% genes that have a low count are not powerful enough to result to be significant.

We ordered the result table by adjusted p-value. In the result table, baseMean, is the average of the normalized count values, dividing by size factors, taken over all samples. The column log2FoldChange is the effect size estimate which works similar to the logFC in edgeR. The adjusted p-value is used to correct for multiple testing error which occurs when several hypothesis are tested simultaneously.

dds1<-DESeq(dds)
res <- results(dds1)
#head(results(dds1, tidy=TRUE))
summary(res)
## 
## out of 12407 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 6, 0.048%
## LFC < 0 (down)     : 17, 0.14%
## outliers [1]       : 0, 0%
## low counts [2]     : 2168, 17%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
resOrdered <- res[order(res$padj),]
resOrdered=as.data.frame(resOrdered)
head(resOrdered,n=2)
de2<-as.data.frame(resOrdered)
write.csv(de2, "C:/Users/yinfu/Desktop/研二下/RNA sequence analysis/Class 4/de2.csv", row.names=T)
suppressMessages(dds2 <- DESeq(dds))
rld <- rlogTransformation(dds2)  ## #DESeq2 log normlization
exprSet_new=assay(rld)
## 
res <-  results(dds2) 
summary(res)
## 
## out of 12407 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 6, 0.048%
## LFC < 0 (down)     : 17, 0.14%
## outliers [1]       : 0, 0%
## low counts [2]     : 2168, 17%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
#head(res)
## 
resOrdered <- res[order(res$padj),]
resOrdered=as.data.frame(resOrdered)
head(resOrdered,n=2)

Beyesian with EBSeq

EBSeq is an empirical Bayesian approach that models a number of features observed in RNA-seq data. Here, library size factor may be obtained via the function MedianNorm, which reproduces the median normalization approach in DESeq.

In this package, we assume that the prior distribution is \(Beta(\alpha ,\beta)\) function. The EM algorithm is used to estimate the hyper-parameters \(\alpha\), \(\beta\) and the mixture parameter p. We use 5 iterations to check the convergence. From the result, the prior distribution is \(Beta(0.42 ,1.12)\).The matrix EBDERes$PPMat contains two columns PPEE and PPDE, corresponding to the posterior probabilities of being Equivalent Expression or Differential Expression for each gene.

In practice, QQP generates the Q-Q plot of the empirical distribution vs. the simulated distribution from the Beta prior distribution with estimated hyper parameters. From the QQplot, we can be noticed still a bit of deviation from the normality in both conditions. However, the density plot shows a good negative Binomial approximation which means the NB model fit well in this case.

EBDERes=GetDEResults(EBOut, FDR=0.05)
str(EBDERes$DEfound)
##  chr [1:38] "ENSG00000006757" "ENSG00000010278" "ENSG00000011600" ...
head(EBDERes$PPMat)
##                      PPEE        PPDE
## ENSG00000000003 0.9963635 0.003636458
## ENSG00000000419 0.9972820 0.002717978
## ENSG00000000457 0.9989404 0.001059573
## ENSG00000000460 0.9978708 0.002129157
## ENSG00000000938 0.9967531 0.003246910
## ENSG00000001036 0.9988912 0.001108845
#GeneFC=PostFC(EBOut)
#str(GeneFC)
#PlotPostVsRawFC(EBOut,GeneFC)
par(mfrow=c(1,2))
QQP(EBOut)

EBOut$Alpha
##            [,1]
## iter1 0.4330381
## iter2 0.4263755
## iter3 0.4232318
## iter4 0.4225494
## iter5 0.4218449
EBOut$Beta
##            Ng1
## iter1 1.061551
## iter2 1.125249
## iter3 1.124256
## iter4 1.127799
## iter5 1.124717
EBOut$P
##             [,1]
## iter1 0.14655263
## iter2 0.05717969
## iter3 0.03438135
## iter4 0.02738436
## iter5 0.02504611
par(mfrow=c(1,2))
DenNHist(EBOut)

Nonparametric Analysis Based on NOISeq package

In this part, we will apply the function in the NOISeq package to find differential expression. Non-parametric approaches can be used in case outliers are present, as they are more robust. Simplicity, the principle idea of Non-para test is that the label may not affect of distribution. NOISeq explores the distribution of fold-changes and absolute expression differences between the two contrasted conditions for the observed data and compares this distribution to the corresponding distribution obtained by comparing pairs of samples belonging to the same distribution. Changes in expression between conditions with equivalent magnitudes, as opposed to changes in expression between replicates within a single condition, should not be regarded as indicative of differential expression. Our dataset consists of biological replicates, which are different biological samples processed separately. Biological replicates are required if we want to make any inference about the population we are studying. Noise distribution is obtained by comparing all pairs of replicates within the same condition.

NOISeq also provides the “TMM” method to normalize the dataset. In this package, the differential expression statistic \(\theta\) is defined as (M + D)/2, where M is the log2-ratio of the two conditions and D is the value of the difference between conditions. The probability distribution of θ can be described as a mixture of two distributions: one for features changing between conditions and the other for invariant features.

In the result table, the “prob” values are not equivalent to p-values. They represent the likelihood that the difference in expression is due to the change in the experimental condition and not due to chance. 6 genes are detected. The summary plot for the mean expressed gene level for the treatment and control groups is shown by DEplot. It shows that the gene with high counts contain less variation than these gene with middel counts.

#To compare to the results assuming these are technical replicates, we'll use exactly the same q value cutoff first. However, developers recommend setting q = 0.95 for `NOISeqBIO` 
mynoiseqBio.deg = degenes(mynoiseqBio, q = 0.95, M = NULL)
## [1] "6 differentially expressed features"
par(mfrow = c(1, 2))
DE.plot(mynoiseqBio, q = 0.95, graphic = "expr", log.scale = TRUE)

## [1] "6 differentially expressed features"
head(mynoiseqBio.deg,n=2)
de3<-as.data.frame(mynoiseqBio.deg)
write.csv(de3, "C:/Users/yinfu/Desktop/研二下/RNA sequence analysis/Class 4/de3.csv", row.names=T)

Comparing the three results and Further analysis (Upregulated–enrichment analysis)

In conclusion, use parametric models when assumptions are met and there are few outliers. Bayesian approaches are effective when noise within groups may influence the actual differences between conditions. Non-parametric methods should be employed when dealing with numerous outliers. Note that data cleaning and normalization are often done in the begining and integrated into the function used to test differential expressed genes.

Comparing the three results, we found that the parametric method (edgeR and DESeq2) can detect about 20 genes. The Bayesian method can detect about 38 genes. The nonparametric method is the most strict method, as it does not make any assumption about the dataset. The three methods found some same genes. I combined the result of edgeR and DESeq2 in the merged dataset. From the merged dataset, we find that there are 3 common up-regulated genes and 8 common down-regulated genes.

We can also do annotation for these significant genes which means determining what those genes do. This process usually be down automaticly in the beginning of the whole of process by some R code. Besides, if there are a lot upregulated genes, we make use of clusterProfiler to do Over-Repressentation Analysis(ORA),Functional Class Scoring(FCS) analysis.
However, since it takes long time and not enough biological background were provided. Here, I picked one significant gene that the three methods detected, i.e ENSG00000006757. It is protein coding type gene and the protein link to abundant triacylglycerol lipase activity(Uniport). It indicates that after the treatment, the Catalytic activity in the human body increase.

de1<-na.omit(de1)
write.csv(de1, "C:/Users/yinfu/Desktop/研二下/RNA sequence analysis/Class 4/de1.csv", row.names=T)
de1<-read.csv("C:/Users/yinfu/Desktop/研二下/RNA sequence analysis/Class 4/de1.csv")
de2<-read.csv("C:/Users/yinfu/Desktop/研二下/RNA sequence analysis/Class 4/de2.csv")
de2<-head(de2,13)
MergedDF <- merge(de1, de2)
MergedDF
#up_regulated <- subset(MergedDF, Treat.Placebo > 0)
#down_regulated <- subset(MergedDF, Treat.Placebo < 0)

The following are up-regulated gene plot

par(mfrow=c(1,3))
plotCounts(dds, gene="ENSG00000006757")
plotCounts(dds, gene="ENSG00000154096")
plotCounts(dds, gene="ENSG00000186629")

par(mfrow=c(2,4))
plotCounts(dds, gene="ENSG00000099749")
plotCounts(dds, gene="ENSG00000102962")
plotCounts(dds, gene="ENSG00000129824")
plotCounts(dds, gene="ENSG00000154620")
plotCounts(dds, gene="ENSG00000157828")
plotCounts(dds, gene="ENSG00000158481")
plotCounts(dds, gene="ENSG00000183878")
plotCounts(dds, gene="ENSG00000198692")