Why expression value must be log2 transformed before further analysis?

The general reason to log-transform data (log2 or otherwise) is to make variation similar across orders of magnitude. This isn’t really a must, but usually makes things more convenient.

Indeed microarray values or RNA-seq RPKM/FPKM values are better correlated when log-transformed. The reason for it is that the distribution of expression or RPKM/FPKM values is skewed, and by log-transforming it we could bring it closer to normal distribution. It is needless to say that many statistical tests require normally-distributed data.

See the example expression data of natural values below. It is original read count data of RNA-seq, 836 samples of miRNA. The histogram represents the frequency of read count per read length. It is skewed.

Figure1. Natural value data

Figure1. Natural value data


The screenshot below is the expression data of log2 values. As you see the histogram, the read counts were normally distributed.

Figure2. Log2 value data

Figure2. Log2 value data

1. Input data preparation

You have to make your input files in the same format as shown in the examples here. Use microsoft Excel, edit your input data, and save it as Text (tab delimited) format.

The example of input file, “TCGA_578s15miRNA_naturalValue.txt” is at dbgap,
D:/DataTransfer/Resources/LabStandard_DataProcess/ConvertToLog2

Figure 3. Input data with natural values

Figure 3. Input data with natural values


# 3. Configuration

This part is the Configuration part that you need to configure (revise) according to the directory address you saved the data file you want to impute, and you would get the output files.

###############################################
##############  Configuration #################
###############################################
workingDirectory="D:/DataTransfer/Resources/LabStandard_DataProcess/ConvertToLog2"
inFilePath="TCGA_578s15miRNA_naturalValue.txt"
outFilePath="TCGA_578s15miRNA_log2Value.txt"

♦ workingDirectory: The directory address where you saved the input file
♦ inFilePath: The input file name of data that have natural values
♦ outFilePath: The output file name for the data that have log2 values

4. Run ConvertToLog2.R file.

##############################################################
## Convert normal values of expression level to log2 values
##############################################################
data=read.table(inFilePath, sep="\t", stringsAsFactors=T, row.names=1, header=TRUE, check.names=F)
data=log2(data)

data_All_noInf=NULL
for(i in 1:ncol(data)) {
    data_noInf=gsub("-Inf","0",data[,i])  # convert "-Inf" to "0"
    data_All_noInf = cbind(data_All_noInf, data_noInf)
}

colnames(data_All_noInf) <- colnames(data)
rownames(data_All_noInf) <- rownames(data)

data<-data_All_noInf
write.table(data, "TCGA_578s15miRNA_log2Value.txt", col.names=NA, row.names=T, quote=F, sep="\t")

The, you would get an output file in your working directory. The file name is TCGA_578s15miRNA_log2Value.txt as you assigned.

Figure 4. Output data with log2 values

Figure 4. Output data with log2 values