Living organisms codes for instructions of life in their genome present in the cell nucleus. The human genome is made up of double stranded deoxyrebonucleic acid (DNA), coiled and packaged inside the chromosomes. It has 23 chromosomes while different animals have varied number of chromosomes.
The DNA is made of buiding blocks called nucleotides. There are four different nucleotides namely: Adenine (A), Cytosine(C), Guanine (G) and Thymine (T). The double stranded DNA forms a helix to the sugar phosphate backbone whereby the nucleotides A pairs with T and C pairs with G. The order of nucleotide pairing form a DNA sequence. Within DNA sequences, there are regions known as genes. Genes provide instructions to make proteins and proteins inturn performs cellular functions.
In order for the proteins to be formed, DNA is transcribed into messenger ribonuclei acid (mRNA) that gets translated into proteins by the ribosomes. Some genes encodes for RNA that do not get translated into protein(s), such RNAs are known as non-coding RNA. They have functions within themselves. Other types of RNA include tRNA, rRNA, sRNA among others. All the RNAs transcribed from genes are called transcripts.
Messenger RNA has to undergo procesing to enable translation into proteins. Genes are transcribed into pre-mRNA. The transcribed mRNA have intronic sequences. After post transcription process, the introns are spliced out and poly A tail and 5’ caps added to yield mature RNA transcripts. Mature mRNA transcripts can be translated into proteins. Poly A tail is a sequence of A’s at the end of the transcript. Many of the non-coding RNAs do not have poly A tail.
All cells contains the same DNA sequence and the difference only arises from the different genes that are turned on and off in the respective cells and the different RNAs and proteins that are produced in the process. Disease causing mutation can result in differences in what genes are turned on or expressed and which are turned off. RNA sequencing quantifiies all the genes that have been expressed in different conditions. this technique is essential in answering questions such as:
RNA sequencing a growing technology and multiple tools have been developed to enable its operation. One of the widely used platform is Galaxy online dataprocessing platfoem and R programming language. R programming language uses R developed and Bioconductor packages such DESEq2 package for analysis.
library(DESeq2)
library(RColorBrewer)
library(pheatmap)
library(tidyverse)
library(dplyr)
library(ggplot2)
The above listed packages need to be installed and loaded into the R Studio environment before any analysis begin.
Planning is essential prior to begining RNA workflow. This step of the nalayis is crutial for the outcome on the results as there is often no saving of poorly designed experiment. There are certain considerations during the planning. These includes replicates, batch effects and compounding. For RNA sequence experiments, there is generally low technical variation. The more biological replicates you have for the experiment the better estimates for variation and expression. (In my study , I will have atleast 3 replicates for each sample selected). Experiments perfomed as batches should be avoided as much as possible and note down all the expperimental variables accross all conditions avoid batches as well. Compounding of experimets with major sources of variations such animals of different sexes should be avoided i.e do not have all male mice as control and female mice as treatment as in this case differentiation can not be based on sex. After the experiment have been well planned out then we can begin the sample preparation. Typically, the workflow follows the outlined steps below.
Computatoional analysis of RNA begins once the FASTQ files have been obtained. Normally, it is computationally intensive to manupulate the FASTQ files considering their massive sizes. Therefore, Galaxy or Unix Commandline can be used for this purpose.
As high throughput sequencing becomes more afforadable and and accesible to a wider community of researchers, lots of publicly available data of RNA sequence reads can be obtained in data repositories. In this case, we obtained RNA sequence data from NCBI’s SRA database. This database in full of publicly avalibel data that have been sequenced and stored in SRA format. Connvering these data requires strong coputaional capabilities such as Unix commandline or use of Galaxy platform.
Generating count matrix is the ultimate goal before any quantitative analysis of the differentially expressed genes can begin. Being a coputationally intensive process, we chose to use Gaxaxy platform for generating reads, performing quality analysis of the FASTQ files and aligning the reads to the refrence genome and finally generating a count matrix for analyis in R.
Using the Galaxy platform, RNA-seq reads were converted to counts, quality control (QC) of RNA-seq reads was performed and interactive reports to summarize the QC information were generated.
In this practice, a list of cell line RNA sequence data was obtained from NCBI’s SRA data base. Data from GEO dataset Accession number GSE113650 will be used in this analysis. This Bioproject, accesion number PRJNA453538 compiles a list of breast cancer RNA-seq data. In this bioproject, the reserchers searched for triple negative breast cancer (TNBC) expressing candidates for target therapy. The aim is to compare non TNBC and TNBC high expressing genes in both cell lines and pick up non cancer expressing genes. Among the cell lines used in this study include Basal-like TNBC HCC70 cell line, Basal-like TNBC HCC1143 cell line, Luminal(non-TNBC) T47D cell line and Luminal(non-TNBC) MCF7 cell lines respectively.
A collection list was creted from NCBi’s SRA platform and uploaded into Galaxy. In Galaxy the SRA files were fetched, downloaded and converted from SRA to FASTQ format using Galaxy’s SRA dump feature. More often than not,errors are introduced during sequencing. such errors include incorrect calling of nucleotides. Such error can result from technocal limitations of the sequening platforms used. These errors leads to bias during analysis resulting in misinterpretation of the data obtained. Every base that has been sequenced gets a score form the seqencer. A quality score of 30 ressonates to 1 in 1000 chance of incorrect base calling.
To check the qaulity of of the called bases in our read sequence, we used Galaxy’s FastQC feature. Here, the number of unassigned reads and assigned reads were determined.
Incase there was contamination by adaptor from the sequencing facility, we would use Cutadapt feature of Galaxy to trim out adaptors and any other low qality bases at the end of the sequence and also remove any sequences that are shorter than 200 bp.
Now that the reads have been prepared and their quality meets the threshold, we can proceed to map them to a reference human genome. Galaxy’s HISAT2 feature was used to align the reads to the reference gneome. The aligneed reads are generated in BAM format. We checked the percentage of reads that were alligned to the reference genome. Low mapping percentages indicates problems with the data or the overall analysis. MultiQC feature was used in the overall galaxy workflow to get aggregate QC report on every step.
We also perfomed QC on the number of reads that were mapped to each chromosome using SAM TOOL’S IdxStats feature in Galaxy. This generated a report to enable us check the sex of the sample by looking at the reads that were mapped to the X/Y chromoseomes and to determine mitochondrial contamination if there was any. The gene body coverage (5’-3’) was also checked to determine if there was nay bias in the reads coverage. This was achieved by using Gene Body Cooverage(BAM) feature in Galaxy.
Last but not least, reads distribution across exons, introns and intergenic distribution was determined using the Read Distribution feature.
Finally, the reads mapped to the refrence gene were counted by featureCounts feature on the BAM files. The Galaxy workflow was completed by creating a count matrix by Column Join on Collectin feature.
The next steps were conducted in R Studio using R programming language and a bunch of packages predominated by DESEQ2 package for differential expression analysis.
library(readr)
library(dplyr)
rawcounts <- read.table("TNBC_count_data.txt", header = TRUE, row.names = 1)
colnames(rawcounts)
## [1] "SRR7063013" "SRR7063014" "SRR7063015" "SRR7063016" "SRR7063021"
## [6] "SRR7063022" "SRR7063023" "SRR7063024" "SRR7063029" "SRR7063030"
## [11] "SRR7063031" "SRR7063032" "SRR7063033" "SRR7063034" "SRR7063035"
## [16] "SRR7063036"
str(rawcounts)
## 'data.frame': 28395 obs. of 16 variables:
## $ SRR7063013: int 10 0 27 2 0 0 0 0 0 0 ...
## $ SRR7063014: int 8 0 18 1 1 0 0 0 0 0 ...
## $ SRR7063015: int 18 0 29 0 0 0 0 0 0 0 ...
## $ SRR7063016: int 11 0 29 0 0 0 0 0 0 0 ...
## $ SRR7063021: int 42 0 32 2 0 0 0 0 0 0 ...
## $ SRR7063022: int 38 1 32 1 0 0 0 0 0 0 ...
## $ SRR7063023: int 35 3 35 4 0 0 0 0 0 0 ...
## $ SRR7063024: int 35 2 40 2 1 0 0 0 0 0 ...
## $ SRR7063029: int 2 0 187 0 0 0 0 0 0 0 ...
## $ SRR7063030: int 2 0 175 1 0 0 0 0 0 0 ...
## $ SRR7063031: int 4 0 185 2 0 0 0 0 0 0 ...
## $ SRR7063032: int 2 0 172 0 0 0 0 0 0 0 ...
## $ SRR7063033: int 8 0 182 1267 57 0 0 0 0 0 ...
## $ SRR7063034: int 15 0 147 1295 32 0 0 0 0 0 ...
## $ SRR7063035: int 20 2 155 1368 23 0 0 0 0 0 ...
## $ SRR7063036: int 13 0 165 1299 48 0 0 0 0 0 ...
###sample metadata
cell_line <- c(rep("HCC1143", 4), rep("HCC70", 4), rep("T47D", 4), rep("MCF7", 4))
condition <- c(rep("tnbc", 8), rep("non_tnbc", 8))
count_metadata <- data.frame(cell_line, condition)
count_metadata
## cell_line condition
## 1 HCC1143 tnbc
## 2 HCC1143 tnbc
## 3 HCC1143 tnbc
## 4 HCC1143 tnbc
## 5 HCC70 tnbc
## 6 HCC70 tnbc
## 7 HCC70 tnbc
## 8 HCC70 tnbc
## 9 T47D non_tnbc
## 10 T47D non_tnbc
## 11 T47D non_tnbc
## 12 T47D non_tnbc
## 13 MCF7 non_tnbc
## 14 MCF7 non_tnbc
## 15 MCF7 non_tnbc
## 16 MCF7 non_tnbc
create metadata row names associated with the sample column names
rownames(count_metadata) <- c("SRR7063036", "SRR7063035", "SRR7063034", "SRR7063033",
"SRR7063032", "SRR7063031", "SRR7063030", "SRR7063029",
"SRR7063024", "SRR7063023", "SRR7063022", "SRR7063021",
"SRR7063016", "SRR7063015", "SRR7063014", "SRR7063013")
count_metadata
## cell_line condition
## SRR7063036 HCC1143 tnbc
## SRR7063035 HCC1143 tnbc
## SRR7063034 HCC1143 tnbc
## SRR7063033 HCC1143 tnbc
## SRR7063032 HCC70 tnbc
## SRR7063031 HCC70 tnbc
## SRR7063030 HCC70 tnbc
## SRR7063029 HCC70 tnbc
## SRR7063024 T47D non_tnbc
## SRR7063023 T47D non_tnbc
## SRR7063022 T47D non_tnbc
## SRR7063021 T47D non_tnbc
## SRR7063016 MCF7 non_tnbc
## SRR7063015 MCF7 non_tnbc
## SRR7063014 MCF7 non_tnbc
## SRR7063013 MCF7 non_tnbc
check whether the metadata row names matches the count data column names using the all() function use the match() to match the column names of count data to the row names of metadata
all(rownames(count_metadata) == colnames(rawcounts))
## [1] FALSE
rownames(count_metadata)
## [1] "SRR7063036" "SRR7063035" "SRR7063034" "SRR7063033" "SRR7063032"
## [6] "SRR7063031" "SRR7063030" "SRR7063029" "SRR7063024" "SRR7063023"
## [11] "SRR7063022" "SRR7063021" "SRR7063016" "SRR7063015" "SRR7063014"
## [16] "SRR7063013"
colnames(rawcounts)
## [1] "SRR7063013" "SRR7063014" "SRR7063015" "SRR7063016" "SRR7063021"
## [6] "SRR7063022" "SRR7063023" "SRR7063024" "SRR7063029" "SRR7063030"
## [11] "SRR7063031" "SRR7063032" "SRR7063033" "SRR7063034" "SRR7063035"
## [16] "SRR7063036"
idx <- match(colnames(rawcounts), rownames(count_metadata))
reorder the matched data
reordered_count_metadata <- count_metadata[idx, ]
all(rownames(reordered_count_metadata) == colnames(rawcounts))
## [1] TRUE
create DESeq2 object
library(DESeq2)
dds_tnbc <- DESeqDataSetFromMatrix(countData = rawcounts,
colData = reordered_count_metadata,
design = ~ condition)
dds_tnbc
## class: DESeqDataSet
## dim: 28395 16
## metadata(1): version
## assays(1): counts
## rownames(28395): 1 10 ... 9994 9997
## rowData names(0):
## colnames(16): SRR7063013 SRR7063014 ... SRR7063035 SRR7063036
## colData names(2): cell_line condition
normalize raw counts calculate normalized counts
dds_tnbc <- estimateSizeFactors(dds_tnbc)
sizeFactors(dds_tnbc)
## SRR7063013 SRR7063014 SRR7063015 SRR7063016 SRR7063021 SRR7063022
## 0.9826010 0.9912131 0.9991431 0.9880352 0.9933680 1.0089220
## SRR7063023 SRR7063024 SRR7063029 SRR7063030 SRR7063031 SRR7063032
## 1.0105104 1.0020592 0.9668933 0.9758933 0.9776894 0.9745721
## SRR7063033 SRR7063034 SRR7063035 SRR7063036
## 1.0848983 1.0875630 1.0999252 1.0913644
extract normalized counts
normalized_tnbc_counts <- counts(dds_tnbc, normalized = TRUE)
head(normalized_tnbc_counts)
## SRR7063013 SRR7063014 SRR7063015 SRR7063016 SRR7063021
## 1 10.177070 8.070918 18.01544 11.13321 42.280403
## 10 0.000000 0.000000 0.00000 0.00000 0.000000
## 100 27.478090 18.159566 29.02487 29.35118 32.213641
## 1000 2.035414 1.008865 0.00000 0.00000 2.013353
## 10000 0.000000 1.008865 0.00000 0.00000 0.000000
## 100008587 0.000000 0.000000 0.00000 0.00000 0.000000
## SRR7063022 SRR7063023 SRR7063024 SRR7063029 SRR7063030
## 1 37.6639631 34.635962 34.928076 2.068481 2.049404
## 10 0.9911569 2.968797 1.995890 0.000000 0.000000
## 100 31.7170215 34.635962 39.917802 193.402930 179.322885
## 1000 0.9911569 3.958396 1.995890 0.000000 1.024702
## 10000 0.0000000 0.000000 0.997945 0.000000 0.000000
## 100008587 0.0000000 0.000000 0.000000 0.000000 0.000000
## SRR7063031 SRR7063032 SRR7063033 SRR7063034 SRR7063035
## 1 4.091279 2.052183 7.373963 13.79230 18.183054
## 10 0.000000 0.000000 0.000000 0.00000 1.818305
## 100 189.221654 176.487715 167.757662 135.16459 140.918668
## 1000 2.045639 0.000000 1167.851418 1190.73564 1243.720890
## 10000 0.000000 0.000000 52.539488 29.42358 20.910512
## 100008587 0.000000 0.000000 0.000000 0.00000 0.000000
## SRR7063036
## 1 11.91170
## 10 0.00000
## 100 151.18690
## 1000 1190.25322
## 10000 43.98164
## 100008587 0.00000
compare the counts between different samples from the normalized counts use heatmaps and PCA for the comparison and visualization to improve the visualization of clustering, log transform the normalized counts using vst() functionm set blint to TRUE on the DESeq2 object
log_transformed_tnbc <- vst(dds_tnbc, blind = TRUE)
extract the log transformed values using assay(). Log transformed values are used in plotting heatmaps
log_vst_transformed_tnbc <- assay(log_transformed_tnbc)
compute pairwise correlation values between the transformed values. this gives correlation between each of the sample pairs
log_cor_tnbc <- cor(log_vst_transformed_tnbc)
head(log_cor_tnbc)
## SRR7063013 SRR7063014 SRR7063015 SRR7063016 SRR7063021
## SRR7063013 1.0000000 0.9976934 0.9977668 0.9977345 0.9343396
## SRR7063014 0.9976934 1.0000000 0.9977506 0.9977423 0.9340874
## SRR7063015 0.9977668 0.9977506 1.0000000 0.9978063 0.9341225
## SRR7063016 0.9977345 0.9977423 0.9978063 1.0000000 0.9341436
## SRR7063021 0.9343396 0.9340874 0.9341225 0.9341436 1.0000000
## SRR7063022 0.9342195 0.9339972 0.9340174 0.9340140 0.9977446
## SRR7063022 SRR7063023 SRR7063024 SRR7063029 SRR7063030
## SRR7063013 0.9342195 0.9342121 0.9343300 0.8957332 0.8959401
## SRR7063014 0.9339972 0.9339506 0.9341296 0.8954832 0.8957177
## SRR7063015 0.9340174 0.9340016 0.9341450 0.8955968 0.8958214
## SRR7063016 0.9340140 0.9339541 0.9341303 0.8957355 0.8959264
## SRR7063021 0.9977446 0.9977555 0.9977458 0.8918946 0.8923080
## SRR7063022 1.0000000 0.9978161 0.9978669 0.8915445 0.8919359
## SRR7063031 SRR7063032 SRR7063033 SRR7063034 SRR7063035
## SRR7063013 0.8958176 0.8955690 0.8972205 0.8972234 0.8971701
## SRR7063014 0.8955798 0.8952946 0.8970229 0.8969982 0.8969391
## SRR7063015 0.8957280 0.8954504 0.8972533 0.8972531 0.8971976
## SRR7063016 0.8958433 0.8955715 0.8970651 0.8970995 0.8970408
## SRR7063021 0.8920547 0.8920285 0.8882633 0.8882359 0.8879210
## SRR7063022 0.8917117 0.8916691 0.8880183 0.8880125 0.8877317
## SRR7063036
## SRR7063013 0.8972428
## SRR7063014 0.8971397
## SRR7063015 0.8973284
## SRR7063016 0.8971529
## SRR7063021 0.8881778
## SRR7063022 0.8880254
generate heatmap of the correlated values
library(pheatmap)
pheatmap(log_cor_tnbc, annotation = select(reordered_count_metadata, condition))
perform PCA on the log transformed object
plotPCA(log_transformed_tnbc, intgroup = "condition")
we have explored the data above and now we can begin the DE analysis with DESeq2 Create a design formula:it tells DESeq2 the known major sources of variation to control for or regress out as well as condition of interest to use for DE testing. It should be similar to the metadata
dds_tnbc <- DESeq(dds_tnbc)
calculate the mean-variance relationship between genes
mean_counts <- apply(rawcounts[, 1:3], 1, mean)
head(mean_counts)
## 1 10 100 1000 10000 100008587
## 12.0000000 0.0000000 24.6666667 1.0000000 0.3333333 0.0000000
variance_counts <- apply(rawcounts[, 1:3], 1, var)
head(variance_counts)
## 1 10 100 1000 10000 100008587
## 28.0000000 0.0000000 34.3333333 1.0000000 0.3333333 0.0000000
determine the dispersion of the genes plot the relationship between mean and variance create a dataframe with mean and variance for every gene # each black dot represents a gene
library(ggplot2)
df <- data.frame(mean_counts, variance_counts)
ggplot(df) +
geom_point(aes(x=mean_counts, y=variance_counts)) +
scale_y_log10() +
scale_x_log10() +
xlab("Mean counts per gene") +
ylab("Variance per gene")
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous x-axis
plot dispersion estimate on the DESeq object
plotDispEsts(dds_tnbc)
run the deseq2 model and proceed with the DE analysis. compare results bewteen the conditions
results(dds_tnbc, alpha = 0.05)
## log2 fold change (MLE): condition tnbc vs non tnbc
## Wald test p-value: condition tnbc vs non tnbc
## DataFrame with 28395 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## 1 16.1517125593302 -1.66657072477551 0.528038713110786
## 10 0.485884323243151 -1.34154066771732 2.17705497538572
## 100 98.4975708469911 2.45652440140871 0.149514655256122
## 1000 300.477161321999 8.64153449005958 1.38004655803206
## 10000 9.30387725157366 5.9160961156437 1.68756940266714
## ... ... ... ...
## 9991 3990.90432320118 0.417678537910221 0.181339607114812
## 9992 4.53976685326931 1.50528046670504 0.563598482433172
## 9993 1014.29155964696 0.439000181576629 0.0627697793663041
## 9994 529.454088746119 0.332923038316119 0.0945239430294878
## 9997 9.72892335247021 0.563324209710535 0.413251166022334
## stat pvalue padj
## <numeric> <numeric> <numeric>
## 1 -3.156152538433 0.00159865231809849 0.0043035623807905
## 10 -0.616218094115713 0.537750587910477 NA
## 100 16.4299907403768 1.16676764268152e-60 9.28257417867291e-59
## 1000 6.26177025678203 3.80631442401815e-10 2.80067420934878e-09
## 10000 3.50569055488535 0.000455424368444192 0.00137911940017739
## ... ... ... ...
## 9991 2.30329460042215 0.0212622732425948 0.0425126188103087
## 9992 2.67083839581404 0.00756620639359585 0.0172426296854567
## 9993 6.99381431651633 2.6751156522809e-12 2.46373312864589e-11
## 9994 3.52210273551813 0.000428138191315946 0.00130357948187963
## 9997 1.36315213610333 0.172834568417487 0.252548773677274
create contrasts between conditions specify condition of interest, level to compare and the base level # use the syntax results(dds, contrast = c(“condition_factor”, “level_to_compare”, “base_level”), alpha = 0.05)
tnbc_results <- results(dds_tnbc,
contrast = c("condition", "tnbc", "non_tnbc" ),
alpha = 0.05)
tnbc_results
## log2 fold change (MLE): condition tnbc vs non_tnbc
## Wald test p-value: condition tnbc vs non tnbc
## DataFrame with 28395 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## 1 16.1517125593302 -1.66657072477551 0.528038713110786
## 10 0.485884323243151 -1.34154066771732 2.17705497538572
## 100 98.4975708469911 2.45652440140871 0.149514655256122
## 1000 300.477161321999 8.64153449005958 1.38004655803206
## 10000 9.30387725157366 5.9160961156437 1.68756940266714
## ... ... ... ...
## 9991 3990.90432320118 0.417678537910221 0.181339607114812
## 9992 4.53976685326931 1.50528046670504 0.563598482433172
## 9993 1014.29155964696 0.439000181576629 0.0627697793663041
## 9994 529.454088746119 0.332923038316119 0.0945239430294878
## 9997 9.72892335247021 0.563324209710535 0.413251166022334
## stat pvalue padj
## <numeric> <numeric> <numeric>
## 1 -3.156152538433 0.00159865231809849 0.0043035623807905
## 10 -0.616218094115713 0.537750587910477 NA
## 100 16.4299907403768 1.16676764268152e-60 9.28257417867291e-59
## 1000 6.26177025678203 3.80631442401815e-10 2.80067420934878e-09
## 10000 3.50569055488535 0.000455424368444192 0.00137911940017739
## ... ... ... ...
## 9991 2.30329460042215 0.0212622732425948 0.0425126188103087
## 9992 2.67083839581404 0.00756620639359585 0.0172426296854567
## 9993 6.99381431651633 2.6751156522809e-12 2.46373312864589e-11
## 9994 3.52210273551813 0.000428138191315946 0.00130357948187963
## 9997 1.36315213610333 0.172834568417487 0.252548773677274
Explore the LFC by MA plot
par("mar")
## [1] 5.1 4.1 4.1 2.1
par(mar=c(1,1,1,1))
plotMA(tnbc_results, ylim=c(-8,8))
LFC shrinkage
tnbc_results <- lfcShrink(dds_tnbc,
contrast = c("condition", "tnbc", "non_tnbc"),
res = tnbc_results)
plotMA(tnbc_results, ylim=c(-8,8))
Explore the differentially expresed genes
mcols(tnbc_results)
## DataFrame with 6 rows and 2 columns
## type
## <character>
## baseMean intermediate
## log2FoldChange results
## lfcSE results
## stat results
## pvalue results
## padj results
## description
## <character>
## baseMean mean of normalized counts for all samples
## log2FoldChange log2 fold change (MAP): condition tnbc vs non_tnbc
## lfcSE standard error: condition tnbc vs non_tnbc
## stat Wald statistic: condition tnbc vs non tnbc
## pvalue Wald test p-value: condition tnbc vs non tnbc
## padj BH adjusted p-values
Determinne the significantly expressed genes
head(tnbc_results, n=10)
## log2 fold change (MAP): condition tnbc vs non_tnbc
## Wald test p-value: condition tnbc vs non tnbc
## DataFrame with 10 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## 1 16.1517125593302 -1.62592421348364 0.514801657059637
## 10 0.485884323243151 -0.946506122270113 1.52917461470458
## 100 98.4975708469911 2.45159228508875 0.14915604417004
## 1000 300.477161321999 7.3248848232758 1.16108153448918
## 10000 9.30387725157366 4.76250020227342 1.34363903831347
## 100008587 0 NA NA
## 100008588 0 NA NA
## 100008589 0 NA NA
## 100009601 0 NA NA
## 100009602 0 NA NA
## stat pvalue padj
## <numeric> <numeric> <numeric>
## 1 -3.156152538433 0.00159865231809849 0.0043035623807905
## 10 -0.616218094115713 0.537750587910477 NA
## 100 16.4299907403768 1.16676764268152e-60 9.28257417867291e-59
## 1000 6.26177025678203 3.80631442401815e-10 2.80067420934878e-09
## 10000 3.50569055488535 0.000455424368444192 0.00137911940017739
## 100008587 NA NA NA
## 100008588 NA NA NA
## 100008589 NA NA NA
## 100009601 NA NA NA
## 100009602 NA NA NA
summary(tnbc_results)
##
## out of 22026 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 4671, 21%
## LFC < 0 (down) : 4476, 20%
## outliers [1] : 0, 0%
## low counts [2] : 4205, 19%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Return the genes most likely to be biologically relevant
tnbc_results <- results(dds_tnbc,
contrast = c("condition", "tnbc", "non_tnbc"),
alpha = 0.05,
lfcThreshold = 0.32)
tnbc_results <- lfcShrink(dds_tnbc,
contrast = c("condition", "tnbc", "non_tnbc"),
res = tnbc_results)
tnbc_results
## log2 fold change (MAP): condition tnbc vs non_tnbc
## Wald test p-value: condition tnbc vs non tnbc
## DataFrame with 28395 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## 1 16.1517125593302 -1.62592421348364 0.514801657059637
## 10 0.485884323243151 -0.946506122270113 1.52917461470458
## 100 98.4975708469911 2.45159228508875 0.14915604417004
## 1000 300.477161321999 7.3248848232758 1.16108153448918
## 10000 9.30387725157366 4.76250020227342 1.34363903831347
## ... ... ... ...
## 9991 3990.90432320118 0.416445822645098 0.180804486155742
## 9992 4.53976685326931 1.46360847130524 0.546699761954415
## 9993 1014.29155964696 0.438844517706302 0.0627474870571554
## 9994 529.454088746119 0.332655566767773 0.0944479254004225
## 9997 9.72892335247021 0.554818916186089 0.406944155630255
## stat pvalue padj
## <numeric> <numeric> <numeric>
## 1 -2.55013636565125 0.0107680791200835 0.0330216032233792
## 10 -0.469230533572689 0.638904862719894 NA
## 100 14.2897323192084 2.53582264903258e-46 2.11116335374004e-44
## 1000 6.02989402178289 1.64067233070712e-09 1.69825690388482e-08
## 10000 3.3160687239288 0.000912933755845116 0.00368990646961789
## ... ... ... ...
## 9991 0.538649771356221 0.590128537983286 0.910938036981509
## 9992 2.10305830063264 0.0354606734913963 0.0926589155654445
## 9993 1.89581965681578 0.0579838956949794 0.14053764940697
## 9994 0.136717088834174 0.891254421434998 1
## 9997 0.588804653723312 0.555992323674708 0.872262774496432
Significant DE genes summary
summary(tnbc_results)
##
## out of 22026 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 3203, 15%
## LFC < 0 (down) : 2824, 13%
## outliers [1] : 0, 0%
## low counts [2] : 4626, 21%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results