Differential Expression Analysis in R

Identification of RNA biomarkers in breast cancer

Introduction to RNA sequencing - Using publicly available data

Living organisms codes for instructions of life in their genome present in the cell nucleus. The human genome is made up of double stranded deoxyrebonucleic acid (DNA), coiled and packaged inside the chromosomes. It has 23 chromosomes while different animals have varied number of chromosomes.

The DNA is made of buiding blocks called nucleotides. There are four different nucleotides namely: Adenine (A), Cytosine(C), Guanine (G) and Thymine (T). The double stranded DNA forms a helix to the sugar phosphate backbone whereby the nucleotides A pairs with T and C pairs with G. The order of nucleotide pairing form a DNA sequence. Within DNA sequences, there are regions known as genes. Genes provide instructions to make proteins and proteins inturn performs cellular functions.

In order for the proteins to be formed, DNA is transcribed into messenger ribonuclei acid (mRNA) that gets translated into proteins by the ribosomes. Some genes encodes for RNA that do not get translated into protein(s), such RNAs are known as non-coding RNA. They have functions within themselves. Other types of RNA include tRNA, rRNA, sRNA among others. All the RNAs transcribed from genes are called transcripts.

Messenger RNA has to undergo procesing to enable translation into proteins. Genes are transcribed into pre-mRNA. The transcribed mRNA have intronic sequences. After post transcription process, the introns are spliced out and poly A tail and 5’ caps added to yield mature RNA transcripts. Mature mRNA transcripts can be translated into proteins. Poly A tail is a sequence of A’s at the end of the transcript. Many of the non-coding RNAs do not have poly A tail.

All cells contains the same DNA sequence and the difference only arises from the different genes that are turned on and off in the respective cells and the different RNAs and proteins that are produced in the process. Disease causing mutation can result in differences in what genes are turned on or expressed and which are turned off. RNA sequencing quantifiies all the genes that have been expressed in different conditions. this technique is essential in answering questions such as:

What genes are expressed between sample groups
What are the trends in gene expression over time accross conditions
What group of genes change similarly over time accross conditions
What processes or pathways are essential in the condition of interest

RNA sequencing a growing technology and multiple tools have been developed to enable its operation. One of the widely used platform is Galaxy online dataprocessing platfoem and R programming language. R programming language uses R developed and Bioconductor packages such DESEq2 package for analysis.

library(DESeq2)
library(RColorBrewer)
library(pheatmap)
library(tidyverse)
library(dplyr)
library(ggplot2)

The above listed packages need to be installed and loaded into the R Studio environment before any analysis begin.

RNA Sequence Workflow

Planning is essential prior to begining RNA workflow. This step of the nalayis is crutial for the outcome on the results as there is often no saving of poorly designed experiment. There are certain considerations during the planning. These includes replicates, batch effects and compounding. For RNA sequence experiments, there is generally low technical variation. The more biological replicates you have for the experiment the better estimates for variation and expression. (In my study , I will have atleast 3 replicates for each sample selected). Experiments perfomed as batches should be avoided as much as possible and note down all the expperimental variables accross all conditions avoid batches as well. Compounding of experimets with major sources of variations such animals of different sexes should be avoided i.e do not have all male mice as control and female mice as treatment as in this case differentiation can not be based on sex. After the experiment have been well planned out then we can begin the sample preparation. Typically, the workflow follows the outlined steps below.

Computatoional analysis of RNA begins once the FASTQ files have been obtained. Normally, it is computationally intensive to manupulate the FASTQ files considering their massive sizes. Therefore, Galaxy or Unix Commandline can be used for this purpose.

As high throughput sequencing becomes more afforadable and and accesible to a wider community of researchers, lots of publicly available data of RNA sequence reads can be obtained in data repositories. In this case, we obtained RNA sequence data from NCBI’s SRA database. This database in full of publicly avalibel data that have been sequenced and stored in SRA format. Connvering these data requires strong coputaional capabilities such as Unix commandline or use of Galaxy platform.

From Reads to Counts

Generating count matrix is the ultimate goal before any quantitative analysis of the differentially expressed genes can begin. Being a coputationally intensive process, we chose to use Gaxaxy platform for generating reads, performing quality analysis of the FASTQ files and aligning the reads to the refrence genome and finally generating a count matrix for analyis in R.

Using the Galaxy platform, RNA-seq reads were converted to counts, quality control (QC) of RNA-seq reads was performed and interactive reports to summarize the QC information were generated.

In this practice, a list of cell line RNA sequence data was obtained from NCBI’s SRA data base. Data from GEO dataset Accession number GSE113650 will be used in this analysis. This Bioproject, accesion number PRJNA453538 compiles a list of breast cancer RNA-seq data. In this bioproject, the reserchers searched for triple negative breast cancer (TNBC) expressing candidates for target therapy. The aim is to compare non TNBC and TNBC high expressing genes in both cell lines and pick up non cancer expressing genes. Among the cell lines used in this study include Basal-like TNBC HCC70 cell line, Basal-like TNBC HCC1143 cell line, Luminal(non-TNBC) T47D cell line and Luminal(non-TNBC) MCF7 cell lines respectively.

Galaxy Workflow

A collection list was creted from NCBi’s SRA platform and uploaded into Galaxy. In Galaxy the SRA files were fetched, downloaded and converted from SRA to FASTQ format using Galaxy’s SRA dump feature. More often than not,errors are introduced during sequencing. such errors include incorrect calling of nucleotides. Such error can result from technocal limitations of the sequening platforms used. These errors leads to bias during analysis resulting in misinterpretation of the data obtained. Every base that has been sequenced gets a score form the seqencer. A quality score of 30 ressonates to 1 in 1000 chance of incorrect base calling.

To check the qaulity of of the called bases in our read sequence, we used Galaxy’s FastQC feature. Here, the number of unassigned reads and assigned reads were determined.

Incase there was contamination by adaptor from the sequencing facility, we would use Cutadapt feature of Galaxy to trim out adaptors and any other low qality bases at the end of the sequence and also remove any sequences that are shorter than 200 bp.

Now that the reads have been prepared and their quality meets the threshold, we can proceed to map them to a reference human genome. Galaxy’s HISAT2 feature was used to align the reads to the reference gneome. The aligneed reads are generated in BAM format. We checked the percentage of reads that were alligned to the reference genome. Low mapping percentages indicates problems with the data or the overall analysis. MultiQC feature was used in the overall galaxy workflow to get aggregate QC report on every step.

We also perfomed QC on the number of reads that were mapped to each chromosome using SAM TOOL’S IdxStats feature in Galaxy. This generated a report to enable us check the sex of the sample by looking at the reads that were mapped to the X/Y chromoseomes and to determine mitochondrial contamination if there was any. The gene body coverage (5’-3’) was also checked to determine if there was nay bias in the reads coverage. This was achieved by using Gene Body Cooverage(BAM) feature in Galaxy.

Last but not least, reads distribution across exons, introns and intergenic distribution was determined using the Read Distribution feature.

Finally, the reads mapped to the refrence gene were counted by featureCounts feature on the BAM files. The Galaxy workflow was completed by creating a count matrix by Column Join on Collectin feature.

The next steps were conducted in R Studio using R programming language and a bunch of packages predominated by DESEQ2 package for differential expression analysis.

library(readr)
library(dplyr)

rawcounts <- read.table("TNBC_count_data.txt", header = TRUE, row.names = 1)
colnames(rawcounts)

##  [1] "SRR7063013" "SRR7063014" "SRR7063015" "SRR7063016" "SRR7063021"
##  [6] "SRR7063022" "SRR7063023" "SRR7063024" "SRR7063029" "SRR7063030"
## [11] "SRR7063031" "SRR7063032" "SRR7063033" "SRR7063034" "SRR7063035"
## [16] "SRR7063036"

str(rawcounts)

## 'data.frame':    28395 obs. of  16 variables:
##  $ SRR7063013: int  10 0 27 2 0 0 0 0 0 0 ...
##  $ SRR7063014: int  8 0 18 1 1 0 0 0 0 0 ...
##  $ SRR7063015: int  18 0 29 0 0 0 0 0 0 0 ...
##  $ SRR7063016: int  11 0 29 0 0 0 0 0 0 0 ...
##  $ SRR7063021: int  42 0 32 2 0 0 0 0 0 0 ...
##  $ SRR7063022: int  38 1 32 1 0 0 0 0 0 0 ...
##  $ SRR7063023: int  35 3 35 4 0 0 0 0 0 0 ...
##  $ SRR7063024: int  35 2 40 2 1 0 0 0 0 0 ...
##  $ SRR7063029: int  2 0 187 0 0 0 0 0 0 0 ...
##  $ SRR7063030: int  2 0 175 1 0 0 0 0 0 0 ...
##  $ SRR7063031: int  4 0 185 2 0 0 0 0 0 0 ...
##  $ SRR7063032: int  2 0 172 0 0 0 0 0 0 0 ...
##  $ SRR7063033: int  8 0 182 1267 57 0 0 0 0 0 ...
##  $ SRR7063034: int  15 0 147 1295 32 0 0 0 0 0 ...
##  $ SRR7063035: int  20 2 155 1368 23 0 0 0 0 0 ...
##  $ SRR7063036: int  13 0 165 1299 48 0 0 0 0 0 ...

###sample metadata

cell_line <- c(rep("HCC1143", 4), rep("HCC70", 4), rep("T47D", 4), rep("MCF7", 4))
condition <- c(rep("tnbc", 8), rep("non_tnbc", 8))

count_metadata <- data.frame(cell_line, condition)
count_metadata

##    cell_line condition
## 1    HCC1143      tnbc
## 2    HCC1143      tnbc
## 3    HCC1143      tnbc
## 4    HCC1143      tnbc
## 5      HCC70      tnbc
## 6      HCC70      tnbc
## 7      HCC70      tnbc
## 8      HCC70      tnbc
## 9       T47D  non_tnbc
## 10      T47D  non_tnbc
## 11      T47D  non_tnbc
## 12      T47D  non_tnbc
## 13      MCF7  non_tnbc
## 14      MCF7  non_tnbc
## 15      MCF7  non_tnbc
## 16      MCF7  non_tnbc

create metadata row names associated with the sample column names

rownames(count_metadata) <- c("SRR7063036", "SRR7063035", "SRR7063034", "SRR7063033",
                              "SRR7063032", "SRR7063031", "SRR7063030", "SRR7063029", 
                              "SRR7063024", "SRR7063023",   "SRR7063022",   "SRR7063021",   
                              "SRR7063016", "SRR7063015",   "SRR7063014",   "SRR7063013")
  count_metadata

##            cell_line condition
## SRR7063036   HCC1143      tnbc
## SRR7063035   HCC1143      tnbc
## SRR7063034   HCC1143      tnbc
## SRR7063033   HCC1143      tnbc
## SRR7063032     HCC70      tnbc
## SRR7063031     HCC70      tnbc
## SRR7063030     HCC70      tnbc
## SRR7063029     HCC70      tnbc
## SRR7063024      T47D  non_tnbc
## SRR7063023      T47D  non_tnbc
## SRR7063022      T47D  non_tnbc
## SRR7063021      T47D  non_tnbc
## SRR7063016      MCF7  non_tnbc
## SRR7063015      MCF7  non_tnbc
## SRR7063014      MCF7  non_tnbc
## SRR7063013      MCF7  non_tnbc

check whether the metadata row names matches the count data column names using the all() function use the match() to match the column names of count data to the row names of metadata

all(rownames(count_metadata) == colnames(rawcounts))

## [1] FALSE

rownames(count_metadata)

##  [1] "SRR7063036" "SRR7063035" "SRR7063034" "SRR7063033" "SRR7063032"
##  [6] "SRR7063031" "SRR7063030" "SRR7063029" "SRR7063024" "SRR7063023"
## [11] "SRR7063022" "SRR7063021" "SRR7063016" "SRR7063015" "SRR7063014"
## [16] "SRR7063013"

colnames(rawcounts)

##  [1] "SRR7063013" "SRR7063014" "SRR7063015" "SRR7063016" "SRR7063021"
##  [6] "SRR7063022" "SRR7063023" "SRR7063024" "SRR7063029" "SRR7063030"
## [11] "SRR7063031" "SRR7063032" "SRR7063033" "SRR7063034" "SRR7063035"
## [16] "SRR7063036"

idx <- match(colnames(rawcounts), rownames(count_metadata))

reorder the matched data

reordered_count_metadata <- count_metadata[idx, ]
all(rownames(reordered_count_metadata) == colnames(rawcounts))

## [1] TRUE

create DESeq2 object

library(DESeq2)
dds_tnbc <- DESeqDataSetFromMatrix(countData = rawcounts,
                                   colData = reordered_count_metadata, 
                                   design = ~ condition)

dds_tnbc

## class: DESeqDataSet 
## dim: 28395 16 
## metadata(1): version
## assays(1): counts
## rownames(28395): 1 10 ... 9994 9997
## rowData names(0):
## colnames(16): SRR7063013 SRR7063014 ... SRR7063035 SRR7063036
## colData names(2): cell_line condition

normalize raw counts calculate normalized counts

dds_tnbc <- estimateSizeFactors(dds_tnbc)
sizeFactors(dds_tnbc)

## SRR7063013 SRR7063014 SRR7063015 SRR7063016 SRR7063021 SRR7063022 
##  0.9826010  0.9912131  0.9991431  0.9880352  0.9933680  1.0089220 
## SRR7063023 SRR7063024 SRR7063029 SRR7063030 SRR7063031 SRR7063032 
##  1.0105104  1.0020592  0.9668933  0.9758933  0.9776894  0.9745721 
## SRR7063033 SRR7063034 SRR7063035 SRR7063036 
##  1.0848983  1.0875630  1.0999252  1.0913644

extract normalized counts

normalized_tnbc_counts <- counts(dds_tnbc, normalized = TRUE)
head(normalized_tnbc_counts)

##           SRR7063013 SRR7063014 SRR7063015 SRR7063016 SRR7063021
## 1          10.177070   8.070918   18.01544   11.13321  42.280403
## 10          0.000000   0.000000    0.00000    0.00000   0.000000
## 100        27.478090  18.159566   29.02487   29.35118  32.213641
## 1000        2.035414   1.008865    0.00000    0.00000   2.013353
## 10000       0.000000   1.008865    0.00000    0.00000   0.000000
## 100008587   0.000000   0.000000    0.00000    0.00000   0.000000
##           SRR7063022 SRR7063023 SRR7063024 SRR7063029 SRR7063030
## 1         37.6639631  34.635962  34.928076   2.068481   2.049404
## 10         0.9911569   2.968797   1.995890   0.000000   0.000000
## 100       31.7170215  34.635962  39.917802 193.402930 179.322885
## 1000       0.9911569   3.958396   1.995890   0.000000   1.024702
## 10000      0.0000000   0.000000   0.997945   0.000000   0.000000
## 100008587  0.0000000   0.000000   0.000000   0.000000   0.000000
##           SRR7063031 SRR7063032  SRR7063033 SRR7063034  SRR7063035
## 1           4.091279   2.052183    7.373963   13.79230   18.183054
## 10          0.000000   0.000000    0.000000    0.00000    1.818305
## 100       189.221654 176.487715  167.757662  135.16459  140.918668
## 1000        2.045639   0.000000 1167.851418 1190.73564 1243.720890
## 10000       0.000000   0.000000   52.539488   29.42358   20.910512
## 100008587   0.000000   0.000000    0.000000    0.00000    0.000000
##           SRR7063036
## 1           11.91170
## 10           0.00000
## 100        151.18690
## 1000      1190.25322
## 10000       43.98164
## 100008587    0.00000

compare the counts between different samples from the normalized counts use heatmaps and PCA for the comparison and visualization to improve the visualization of clustering, log transform the normalized counts using vst() functionm set blint to TRUE on the DESeq2 object

log_transformed_tnbc <- vst(dds_tnbc, blind = TRUE)

extract the log transformed values using assay(). Log transformed values are used in plotting heatmaps

log_vst_transformed_tnbc <- assay(log_transformed_tnbc)

compute pairwise correlation values between the transformed values. this gives correlation between each of the sample pairs

log_cor_tnbc <- cor(log_vst_transformed_tnbc)
head(log_cor_tnbc)

##            SRR7063013 SRR7063014 SRR7063015 SRR7063016 SRR7063021
## SRR7063013  1.0000000  0.9976934  0.9977668  0.9977345  0.9343396
## SRR7063014  0.9976934  1.0000000  0.9977506  0.9977423  0.9340874
## SRR7063015  0.9977668  0.9977506  1.0000000  0.9978063  0.9341225
## SRR7063016  0.9977345  0.9977423  0.9978063  1.0000000  0.9341436
## SRR7063021  0.9343396  0.9340874  0.9341225  0.9341436  1.0000000
## SRR7063022  0.9342195  0.9339972  0.9340174  0.9340140  0.9977446
##            SRR7063022 SRR7063023 SRR7063024 SRR7063029 SRR7063030
## SRR7063013  0.9342195  0.9342121  0.9343300  0.8957332  0.8959401
## SRR7063014  0.9339972  0.9339506  0.9341296  0.8954832  0.8957177
## SRR7063015  0.9340174  0.9340016  0.9341450  0.8955968  0.8958214
## SRR7063016  0.9340140  0.9339541  0.9341303  0.8957355  0.8959264
## SRR7063021  0.9977446  0.9977555  0.9977458  0.8918946  0.8923080
## SRR7063022  1.0000000  0.9978161  0.9978669  0.8915445  0.8919359
##            SRR7063031 SRR7063032 SRR7063033 SRR7063034 SRR7063035
## SRR7063013  0.8958176  0.8955690  0.8972205  0.8972234  0.8971701
## SRR7063014  0.8955798  0.8952946  0.8970229  0.8969982  0.8969391
## SRR7063015  0.8957280  0.8954504  0.8972533  0.8972531  0.8971976
## SRR7063016  0.8958433  0.8955715  0.8970651  0.8970995  0.8970408
## SRR7063021  0.8920547  0.8920285  0.8882633  0.8882359  0.8879210
## SRR7063022  0.8917117  0.8916691  0.8880183  0.8880125  0.8877317
##            SRR7063036
## SRR7063013  0.8972428
## SRR7063014  0.8971397
## SRR7063015  0.8973284
## SRR7063016  0.8971529
## SRR7063021  0.8881778
## SRR7063022  0.8880254

generate heatmap of the correlated values

library(pheatmap)
pheatmap(log_cor_tnbc, annotation = select(reordered_count_metadata, condition))

perform PCA on the log transformed object

plotPCA(log_transformed_tnbc, intgroup = "condition")

we have explored the data above and now we can begin the DE analysis with DESeq2 Create a design formula:it tells DESeq2 the known major sources of variation to control for or regress out as well as condition of interest to use for DE testing. It should be similar to the metadata

dds_tnbc <- DESeq(dds_tnbc)

calculate the mean-variance relationship between genes

mean_counts <- apply(rawcounts[, 1:3], 1, mean)
head(mean_counts)

##          1         10        100       1000      10000  100008587 
## 12.0000000  0.0000000 24.6666667  1.0000000  0.3333333  0.0000000

variance_counts <- apply(rawcounts[, 1:3], 1, var)
head(variance_counts)

##          1         10        100       1000      10000  100008587 
## 28.0000000  0.0000000 34.3333333  1.0000000  0.3333333  0.0000000

determine the dispersion of the genes plot the relationship between mean and variance create a dataframe with mean and variance for every gene # each black dot represents a gene

library(ggplot2)
df <- data.frame(mean_counts, variance_counts)
ggplot(df) +
  geom_point(aes(x=mean_counts, y=variance_counts)) +
  scale_y_log10() +
  scale_x_log10() +
  xlab("Mean counts per gene") +
  ylab("Variance per gene")

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Transformation introduced infinite values in continuous x-axis

plot dispersion estimate on the DESeq object

plotDispEsts(dds_tnbc)

run the deseq2 model and proceed with the DE analysis. compare results bewteen the conditions

results(dds_tnbc, alpha = 0.05)

## log2 fold change (MLE): condition tnbc vs non tnbc 
## Wald test p-value: condition tnbc vs non tnbc 
## DataFrame with 28395 rows and 6 columns
##                baseMean    log2FoldChange              lfcSE
##               <numeric>         <numeric>          <numeric>
## 1      16.1517125593302 -1.66657072477551  0.528038713110786
## 10    0.485884323243151 -1.34154066771732   2.17705497538572
## 100    98.4975708469911  2.45652440140871  0.149514655256122
## 1000   300.477161321999  8.64153449005958   1.38004655803206
## 10000  9.30387725157366   5.9160961156437   1.68756940266714
## ...                 ...               ...                ...
## 9991   3990.90432320118 0.417678537910221  0.181339607114812
## 9992   4.53976685326931  1.50528046670504  0.563598482433172
## 9993   1014.29155964696 0.439000181576629 0.0627697793663041
## 9994   529.454088746119 0.332923038316119 0.0945239430294878
## 9997   9.72892335247021 0.563324209710535  0.413251166022334
##                     stat               pvalue                 padj
##                <numeric>            <numeric>            <numeric>
## 1        -3.156152538433  0.00159865231809849   0.0043035623807905
## 10    -0.616218094115713    0.537750587910477                   NA
## 100     16.4299907403768 1.16676764268152e-60 9.28257417867291e-59
## 1000    6.26177025678203 3.80631442401815e-10 2.80067420934878e-09
## 10000   3.50569055488535 0.000455424368444192  0.00137911940017739
## ...                  ...                  ...                  ...
## 9991    2.30329460042215   0.0212622732425948   0.0425126188103087
## 9992    2.67083839581404  0.00756620639359585   0.0172426296854567
## 9993    6.99381431651633  2.6751156522809e-12 2.46373312864589e-11
## 9994    3.52210273551813 0.000428138191315946  0.00130357948187963
## 9997    1.36315213610333    0.172834568417487    0.252548773677274

create contrasts between conditions specify condition of interest, level to compare and the base level # use the syntax results(dds, contrast = c(“condition_factor”, “level_to_compare”, “base_level”), alpha = 0.05)

tnbc_results <- results(dds_tnbc,
                       contrast = c("condition", "tnbc", "non_tnbc" ),
                       alpha = 0.05)
tnbc_results

## log2 fold change (MLE): condition tnbc vs non_tnbc 
## Wald test p-value: condition tnbc vs non tnbc 
## DataFrame with 28395 rows and 6 columns
##                baseMean    log2FoldChange              lfcSE
##               <numeric>         <numeric>          <numeric>
## 1      16.1517125593302 -1.66657072477551  0.528038713110786
## 10    0.485884323243151 -1.34154066771732   2.17705497538572
## 100    98.4975708469911  2.45652440140871  0.149514655256122
## 1000   300.477161321999  8.64153449005958   1.38004655803206
## 10000  9.30387725157366   5.9160961156437   1.68756940266714
## ...                 ...               ...                ...
## 9991   3990.90432320118 0.417678537910221  0.181339607114812
## 9992   4.53976685326931  1.50528046670504  0.563598482433172
## 9993   1014.29155964696 0.439000181576629 0.0627697793663041
## 9994   529.454088746119 0.332923038316119 0.0945239430294878
## 9997   9.72892335247021 0.563324209710535  0.413251166022334
##                     stat               pvalue                 padj
##                <numeric>            <numeric>            <numeric>
## 1        -3.156152538433  0.00159865231809849   0.0043035623807905
## 10    -0.616218094115713    0.537750587910477                   NA
## 100     16.4299907403768 1.16676764268152e-60 9.28257417867291e-59
## 1000    6.26177025678203 3.80631442401815e-10 2.80067420934878e-09
## 10000   3.50569055488535 0.000455424368444192  0.00137911940017739
## ...                  ...                  ...                  ...
## 9991    2.30329460042215   0.0212622732425948   0.0425126188103087
## 9992    2.67083839581404  0.00756620639359585   0.0172426296854567
## 9993    6.99381431651633  2.6751156522809e-12 2.46373312864589e-11
## 9994    3.52210273551813 0.000428138191315946  0.00130357948187963
## 9997    1.36315213610333    0.172834568417487    0.252548773677274

Explore the LFC by MA plot

par("mar")

## [1] 5.1 4.1 4.1 2.1

par(mar=c(1,1,1,1))
plotMA(tnbc_results, ylim=c(-8,8))

LFC shrinkage

tnbc_results <- lfcShrink(dds_tnbc,
                          contrast = c("condition", "tnbc", "non_tnbc"),
                          res = tnbc_results)
plotMA(tnbc_results, ylim=c(-8,8))

Explore the differentially expresed genes

mcols(tnbc_results)

## DataFrame with 6 rows and 2 columns
##                        type
##                 <character>
## baseMean       intermediate
## log2FoldChange      results
## lfcSE               results
## stat                results
## pvalue              results
## padj                results
##                                                       description
##                                                       <character>
## baseMean                mean of normalized counts for all samples
## log2FoldChange log2 fold change (MAP): condition tnbc vs non_tnbc
## lfcSE                  standard error: condition tnbc vs non_tnbc
## stat                   Wald statistic: condition tnbc vs non tnbc
## pvalue              Wald test p-value: condition tnbc vs non tnbc
## padj                                         BH adjusted p-values

Determinne the significantly expressed genes

head(tnbc_results, n=10)

## log2 fold change (MAP): condition tnbc vs non_tnbc 
## Wald test p-value: condition tnbc vs non tnbc 
## DataFrame with 10 rows and 6 columns
##                    baseMean     log2FoldChange             lfcSE
##                   <numeric>          <numeric>         <numeric>
## 1          16.1517125593302  -1.62592421348364 0.514801657059637
## 10        0.485884323243151 -0.946506122270113  1.52917461470458
## 100        98.4975708469911   2.45159228508875  0.14915604417004
## 1000       300.477161321999    7.3248848232758  1.16108153448918
## 10000      9.30387725157366   4.76250020227342  1.34363903831347
## 100008587                 0                 NA                NA
## 100008588                 0                 NA                NA
## 100008589                 0                 NA                NA
## 100009601                 0                 NA                NA
## 100009602                 0                 NA                NA
##                         stat               pvalue                 padj
##                    <numeric>            <numeric>            <numeric>
## 1            -3.156152538433  0.00159865231809849   0.0043035623807905
## 10        -0.616218094115713    0.537750587910477                   NA
## 100         16.4299907403768 1.16676764268152e-60 9.28257417867291e-59
## 1000        6.26177025678203 3.80631442401815e-10 2.80067420934878e-09
## 10000       3.50569055488535 0.000455424368444192  0.00137911940017739
## 100008587                 NA                   NA                   NA
## 100008588                 NA                   NA                   NA
## 100008589                 NA                   NA                   NA
## 100009601                 NA                   NA                   NA
## 100009602                 NA                   NA                   NA

summary(tnbc_results)

## 
## out of 22026 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 4671, 21%
## LFC < 0 (down)     : 4476, 20%
## outliers [1]       : 0, 0%
## low counts [2]     : 4205, 19%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Return the genes most likely to be biologically relevant

tnbc_results <- results(dds_tnbc,
                        contrast = c("condition", "tnbc", "non_tnbc"),
                        alpha = 0.05,
                        lfcThreshold = 0.32)
tnbc_results <- lfcShrink(dds_tnbc,
                          contrast = c("condition", "tnbc", "non_tnbc"),
                          res = tnbc_results)
tnbc_results

## log2 fold change (MAP): condition tnbc vs non_tnbc 
## Wald test p-value: condition tnbc vs non tnbc 
## DataFrame with 28395 rows and 6 columns
##                baseMean     log2FoldChange              lfcSE
##               <numeric>          <numeric>          <numeric>
## 1      16.1517125593302  -1.62592421348364  0.514801657059637
## 10    0.485884323243151 -0.946506122270113   1.52917461470458
## 100    98.4975708469911   2.45159228508875   0.14915604417004
## 1000   300.477161321999    7.3248848232758   1.16108153448918
## 10000  9.30387725157366   4.76250020227342   1.34363903831347
## ...                 ...                ...                ...
## 9991   3990.90432320118  0.416445822645098  0.180804486155742
## 9992   4.53976685326931   1.46360847130524  0.546699761954415
## 9993   1014.29155964696  0.438844517706302 0.0627474870571554
## 9994   529.454088746119  0.332655566767773 0.0944479254004225
## 9997   9.72892335247021  0.554818916186089  0.406944155630255
##                     stat               pvalue                 padj
##                <numeric>            <numeric>            <numeric>
## 1      -2.55013636565125   0.0107680791200835   0.0330216032233792
## 10    -0.469230533572689    0.638904862719894                   NA
## 100     14.2897323192084 2.53582264903258e-46 2.11116335374004e-44
## 1000    6.02989402178289 1.64067233070712e-09 1.69825690388482e-08
## 10000    3.3160687239288 0.000912933755845116  0.00368990646961789
## ...                  ...                  ...                  ...
## 9991   0.538649771356221    0.590128537983286    0.910938036981509
## 9992    2.10305830063264   0.0354606734913963   0.0926589155654445
## 9993    1.89581965681578   0.0579838956949794     0.14053764940697
## 9994   0.136717088834174    0.891254421434998                    1
## 9997   0.588804653723312    0.555992323674708    0.872262774496432

Significant DE genes summary

summary(tnbc_results)

## 
## out of 22026 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 3203, 15%
## LFC < 0 (down)     : 2824, 13%
## outliers [1]       : 0, 0%
## low counts [2]     : 4626, 21%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Problem Solving

Charles Otieno

May 5, 2019