The identification of differentially expressed genes using the Wilcoxon rank-sum test

This is a tutorial for identifying differentially expressed genes using Wilcoxon rank-sum test for large-sample-size RNA-seq datasets.

If you use this tutorial in your published research, please cite

Li Y, Ge X, Peng F, Li W, Li JJ. Exaggerated false positives by popular differential expression methods when analyzing human population samples. Genome Biol 23, 79 (2022). https://doi.org/10.1186/s13059-022-02648-4

Load required R Packages

suppressWarnings(library(edgeR, quietly = T))

Read the read count matrix file and the condition labels file.

Read count matrix file: each row is a gene and each column is a sample; the first column is the gene name; the first row is the sample name.

Condition labels file: one row with condition labels corresponding to each sample in the read count matrix file.

You can download the example files from https://github.com/xihuimeijing/DEGs_Analysis_FDR/tree/main/RMarkdown_Wilcoxon/examples

readCount<-read.table(file="examples/examples.countMatrix.tsv", header = T, row.names = 1, stringsAsFactors = F,check.names = F)
conditions<-read.table(file="examples/examples.conditions.tsv", header = F)
conditions<-factor(t(conditions))

Count matrix preprocessing using edgeR package

y <- DGEList(counts=readCount,group=conditions)
##Remove rows consistently have zero or very low counts
keep <- filterByExpr(y)
y <- y[keep,keep.lib.sizes=FALSE]
##Perform TMM normalization and transfer to CPM (Counts Per Million)
y <- calcNormFactors(y,method="TMM")
count_norm=cpm(y)
count_norm<-as.data.frame(count_norm)

Run the Wilcoxon rank-sum test for each gene

pvalues <- sapply(1:nrow(count_norm),function(i){
     data<-cbind.data.frame(gene=as.numeric(t(count_norm[i,])),conditions)
     p=wilcox.test(gene~conditions, data)$p.value
     return(p)
   })
fdr=p.adjust(pvalues,method = "fdr")

Calculate the fold-change for each gene

conditionsLevel<-levels(conditions)
dataCon1=count_norm[,c(which(conditions==conditionsLevel[1]))]
dataCon2=count_norm[,c(which(conditions==conditionsLevel[2]))]
foldChanges=log2(rowMeans(dataCon2)/rowMeans(dataCon1))

Output results based on FDR threshold

outRst<-data.frame(log2foldChange=foldChanges, pValues=pvalues, FDR=fdr)
rownames(outRst)=rownames(count_norm)
outRst=na.omit(outRst)
fdrThres=0.05
write.table(outRst[outRst$FDR<fdrThres,], file="examples/examples.WilcoxonTest.rst.tsv",sep="\t", quote=F,row.names = T,col.names = T)