The motivation of the following, is to re-do (part of) the analysis described for the expression profiling by high throughput sequencing in the experiment with identifier GSE101942 described at the NCBI Gene Expression Omnibus (GEO), employing the state-of-the-art tidytranscriptomics
suite (Mangiola et al. 2021; Mangiola 2021).
Even though these data originated from a different study, samples reflecting the same cell type clustered together, revealed that a limited number of samples already suffice to identify tissue-specific genes for known disease-associated variants. Additionally, we proposed recent advances in tidy programming to efficiently evaluate a biological dataset to perform integral RNA-seq analyses.
The tools we will use to perform the analysis will be the statistical programming language R
(R Core Team 2020) with its many libraries and the software package RStudio
(RStudio Team 2019) to interact with R
.
Experiment with identifier GSE101942 presents evidence that transmembrane protein podocalyxin (PCX) suppresses genes promoting receptivity (eg LIF, CSF1) but stimulates those inhibiting implantation (eg WNT7A, LEFTY2) and is a significant factor regulating human endometrial receptivity. Therefore, down-regulation of podocalyxin selectively converts the endometrial surface to a more adhesive state that facilitates embryo attachment and penetration and inadequate down-regulation of podocalyxin is associated with poorer implantation rates in IVF (in vitro fertilization).
The extraction protocol of the chosen experiment consisted of 4 replicates per group (control vs PCX-overexpression): Ishikawa cells were cultured, total RNA was isolated from control and PCX-overexpressing Ishikawa cells and total RNA was converted to strand specific Illumina compatible sequencing libraries.
FASTQ reads were assessed for quality using FastQC
(Andrews 2010). Reads were then trimmed for sequence adapters using AdapterRemoval
(Mikkel Schubert and Orlando 2016) and aligned to the human genome GRCh37 using the RNA-seq alignment algorithm STAR
(Dobin and Gingeras. 2013).
After alignment, mapped sequence reads were summarised to the GRCh37.p13 (NCBI:GCA_000001405.14 2013-09) gene intervals using featureCounts
(Yang Liao and Shi 2014). Finally the raw table of counts was transferred to R
for expression analysis. The material on differential expression will be now adapted to introduce a tidy approach to RNA sequencing data representation and analysis.
To analyse the set of RNA transcripts in the population of cells to measure changes in expression, we’ll make use of the Bulk RNA sequencing differential expression workflow, depicted in Figure 1.
Figure 1: Bulk RNAseq pipeline (Doyle, M
and Mangiola, S., 2021)
The tidytranscriptomics
suite follows the data paradigm for providing a standard way to organize data values within a dataset, where each variable is a column, each observation is a row, and data is manipulated using an easy-to-understand vocabulary.
By following this paradigm in RNA sequencing data exploration and analysis, a number of key concepts, such as filtering, scaling, dimensionality reduction, hypothesis testing, clustering and visualization become more intuitive by the use of a homogeneous jargon, the simplicity of data wrangling, and the coding unburden. Therefore, enabling focus on the statistical and biological challenges underlying an informed RNA sequencing analysis.
To approach data representation and analysis through a tidy data paradigm, allows the integration of the tidyverse
with tidybulk, tidyseurat and other tidyverse-friendly packages.
The main objectives in this work will be to recall the key concepts of RNA sequencing data analysis, to apply the concepts to publicly available data and to create plots that summarize the information content of the data and analysis results.
Let’s load all the packages we will need to analyse the bulk RNA sequencing data and setting the colors and theme we will use for our plots.
# tidyverse core packages
library(tidyverse)
# databeses packages
library(GEOquery)
library(SummarizedExperiment)
# tidyverse-friendly packages
library(plotly)
library(ggrepel)
library(GGally)
library(tidybulk)
library(tidySummarizedExperiment)
# Use colourblind-friendly colours
friendly_cols <- dittoSeq::dittoColors()
# Set theme
custom_theme <-
list(
scale_fill_manual(values = friendly_cols),
scale_color_manual(values = friendly_cols),
theme_bw() +
theme(
panel.border = element_blank(),
axis.line = element_line(),
panel.grid.major = element_line(size = 0.2),
panel.grid.minor = element_line(size = 0.1),
text = element_text(size = 12),
legend.position = "bottom",
strip.background = element_blank(),
axis.title.x = element_text(margin = margin(t = 10, r = 10, b = 10, l =
10)),
axis.title.y = element_text(margin = margin(t = 10, r = 10, b = 10, l =
10)),
axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1)
)
)
We first started by downloading the actual count data and the metadata file obtained from the sample information of the series matrix file downloaded from GEO employing the GEOquery package (Davis and Meltzer 2007). The columns were then parsed and new columns with shorter names and factor levels were added.
The information which connects the sample information from GEO with the SRA run id was also downloaded https://www.ncbi.nlm.nih.gov/sra/SRP255647, using the relations code SRP255647. This data.frame
was merged with the query from GEO.
## RNA-seq (control vs PCX-overexpression) 4 replicates per group
#exp_1 <- read_table2("GSE148274_all_GN_samples_counts.tsv")
#exp.1 <- column_to_rownames(exp_1, "Geneid")
exp.1 <- read_table2("GSE148274_all_GN_samples_counts.tsv") %>%
column_to_rownames("Geneid")
# TO DO tidy
gse1 <- getGEO(filename="GSE148274_series_matrix.txt")
pdata1 <- pData(gse1)[,grepl("ch1",names(pData(gse1)))]
pdata1$SampleName <- rownames(pData(gse1))
srp <- read.csv("SraRunInfo3.csv")
srpsmall <- srp[,c("Run","avgLength","Experiment","Sample","BioSample","SampleName")]
coldata1 <- merge(pdata1, srpsmall, by="SampleName")
rownames(coldata1) <- coldata1$Run
Then a SummarizedExperiment
object was created. The library tidySummarizedExperiment provides a bridge between Bioconductor
SummarizedExperiment (Morgan et al. 2021) and the tidyverse (Wickham et al. 2019) to enable visualisation and treatment of the Bioconductor
SummarizedExperiment
object as a tidyverse
tibble, providing SummarizedExperiment
-compatible dplyr, tidyr, ggplot and plotly functions.
The SummarizedExperiment
class is used to store the rectangular matrix of experimental results produced by the high throughput sequencing experiments. Each object stores observations of one or more samples, along with additional meta-data describing both the observations (features) and samples (phenotypes).
pcxoe <- SummarizedExperiment(list(counts=as.matrix(exp.1)),colData = coldata1)
## take a look
pcxoe
## # A SummarizedExperiment-tibble abstraction: 463,240 × 19
## [90m# Transcripts=57905 | Samples=8 | Assays=counts[39m
## feature sample counts SampleName source_name_ch1 organism_ch1
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## 2 ENSG00… SRR11… 16 GSM4459412 epithelial cel… Homo sapiens
## 3 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## 4 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## 5 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## 6 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## 7 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## 8 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## 9 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## 10 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## # … with 40 more rows, and 13 more variables: characteristics_ch1 <chr>,
## # characteristics_ch1.1 <chr>, molecule_ch1 <chr>,
## # extract_protocol_ch1 <chr>, extract_protocol_ch1.1 <chr>, taxid_ch1 <chr>,
## # cell.line.ch1 <chr>, genotype.variation.ch1 <chr>, Run <chr>,
## # avgLength <int>, Experiment <chr>, Sample <chr>, BioSample <chr>
assays(pcxoe)
## List of length 1
## names(1): counts
And now tidyverse commands, such as filter, select and mutate were used to explore the tidy SummarizedExperiment
object, alongside the tidyverse
pipe %>%
, since it’s use reduces the amount of code needed to write when we have multiple steps and also makes the steps clearer and easier to see.
We created a new sample_name
column that contains shorter sample names, by removing the GSM4459 prefix that’s present in all of the samples, as shorter names can fit better the plots that were created. Also, to give a statistical treatment, the genotype variation was converted to a factor type.
Lastly, to set up the data for our RNA sequencing analysis a column with gene symbols was created. The gene symbols for these Ensembl gene ids can be obtained using the Bioconductor
annotation package for human, org.Hs.eg.db (Pagès et al. 2021).
counts <-
pcxoe %>%
mutate(sample_name = str_remove(sample, "GSM4459"),
genotype = as_factor(genotype.variation.ch1)) %>%
mutate(symbol = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
keys = feature,
keytype = "ENSEMBL",
column = "SYMBOL",
multiVals = "first"
))
counts
## # A SummarizedExperiment-tibble abstraction: 463,240 × 22
## [90m# Transcripts=57905 | Samples=8 | Assays=counts[39m
## feature sample counts SampleName source_name_ch1 organism_ch1
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## 2 ENSG00… SRR11… 16 GSM4459412 epithelial cel… Homo sapiens
## 3 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## 4 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## 5 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## 6 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## 7 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## 8 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## 9 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## 10 ENSG00… SRR11… 0 GSM4459412 epithelial cel… Homo sapiens
## # … with 40 more rows, and 16 more variables: characteristics_ch1 <chr>,
## # characteristics_ch1.1 <chr>, molecule_ch1 <chr>,
## # extract_protocol_ch1 <chr>, extract_protocol_ch1.1 <chr>, taxid_ch1 <chr>,
## # cell.line.ch1 <chr>, genotype.variation.ch1 <chr>, Run <chr>,
## # avgLength <int>, Experiment <chr>, Sample <chr>, BioSample <chr>,
## # sample_name <chr>, genotype <fct>, symbol <chr>
A filter for lowly expressed genes was performed using tidybulk
, which in turn uses the edgeR filterByExpr()
function described in McCarthy, Chen, and Smyth (2012), to automatically identify the genes with adequate abundance for differential expression testing.
We also normalized counts by scaling them and avoid technical and composition differences between the samples, employing the default edgeR
’s trimmed mean of M values (TMM), (Robinson and Oshlack 2010). We can visualise the difference in abundance densities before and after scaling (Figure 2).
counts_scaled <- counts %>%
keep_abundant(factor_of_interest = genotype) %>%
scale_abundance()
counts_scaled %>%
# Reshaping
pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source",
values_to = "abundance") %>%
# Plotting
ggplot(aes(x = abundance + 1, color = sample_name)) +
geom_density() +
facet_wrap(~source) +
scale_x_log10() +
custom_theme
Figure 2: Density plot comparing raw and normalized counts distribution for 8 samples
It can be assessed, that the distributions of the counts are not very different to each other before scaling and scaling makes the distributions more similar.
To interrogate the RNA-seq experiment design, as a way to perform quality control, the similarity of the samples with each other in terms of the quantified expression level profiles across the set of genes was measured. This allowed to see whether the most similar samples are the biological replicates of that sample.
We reduced the dimensions of the data to identify the greatest sources of variation by applying Principal Component Analysis (PCA) and visualized them as two (Figure 3) and three-dimensional (Figure 4) scatter plots.
The 2D visualization follows the tidy graphical guidelines of ggplot2 (Wickham 2016) and ggrepel (Slowikowski 2021) to repel away text labels, data points, and edges from each other in the plotting area. The 3D plot uses the plotly framework (Sievert 2020).
## Get principal components
counts_scal_PCA <-
counts_scaled %>%
reduce_dimensions(method = "PCA")
counts_scal_PCA
## # A SummarizedExperiment-tibble abstraction: 136,704 × 28
## [90m# Transcripts=17088 | Samples=8 | Assays=counts, counts_scaled[39m
## feature sample counts counts_scaled SampleName source_name_ch1 organism_ch1
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 ENSG00… SRR11… 16 16.2 GSM4459412 epithelial cel… Homo sapiens
## 2 ENSG00… SRR11… 30 30.4 GSM4459412 epithelial cel… Homo sapiens
## 3 ENSG00… SRR11… 9 9.13 GSM4459412 epithelial cel… Homo sapiens
## 4 ENSG00… SRR11… 2039 2068. GSM4459412 epithelial cel… Homo sapiens
## 5 ENSG00… SRR11… 19 19.3 GSM4459412 epithelial cel… Homo sapiens
## 6 ENSG00… SRR11… 11 11.2 GSM4459412 epithelial cel… Homo sapiens
## 7 ENSG00… SRR11… 11304 11464. GSM4459412 epithelial cel… Homo sapiens
## 8 ENSG00… SRR11… 17 17.2 GSM4459412 epithelial cel… Homo sapiens
## 9 ENSG00… SRR11… 8 8.11 GSM4459412 epithelial cel… Homo sapiens
## 10 ENSG00… SRR11… 505 512. GSM4459412 epithelial cel… Homo sapiens
## # … with 40 more rows, and 21 more variables: characteristics_ch1 <chr>,
## # characteristics_ch1.1 <chr>, molecule_ch1 <chr>,
## # extract_protocol_ch1 <chr>, extract_protocol_ch1.1 <chr>, taxid_ch1 <chr>,
## # cell.line.ch1 <chr>, genotype.variation.ch1 <chr>, Run <chr>,
## # avgLength <int>, Experiment <chr>, Sample <chr>, BioSample <chr>,
## # sample_name <chr>, genotype <fct>, TMM <dbl>, multiplier <dbl>, PC1 <dbl>,
## # PC2 <dbl>, symbol <chr>, .abundant <lgl>
## take a look
counts_scal_PCA %>% pivot_sample()
## # A tibble: 8 × 23
## sample SampleName source_name_ch1 organism_ch1 characteristics_ch1
## <chr> <chr> <chr> <chr> <chr>
## 1 SRR11506594 GSM4459412 epithelial cells Homo sapiens cell line: Ishikawa cells
## 2 SRR11506595 GSM4459413 epithelial cells Homo sapiens cell line: Ishikawa cells
## 3 SRR11506596 GSM4459414 epithelial cells Homo sapiens cell line: Ishikawa cells
## 4 SRR11506597 GSM4459415 epithelial cells Homo sapiens cell line: Ishikawa cells
## 5 SRR11506598 GSM4459416 epithelial cells Homo sapiens cell line: Ishikawa cells
## 6 SRR11506599 GSM4459417 epithelial cells Homo sapiens cell line: Ishikawa cells
## 7 SRR11506600 GSM4459418 epithelial cells Homo sapiens cell line: Ishikawa cells
## 8 SRR11506601 GSM4459419 epithelial cells Homo sapiens cell line: Ishikawa cells
## # … with 18 more variables: characteristics_ch1.1 <chr>, molecule_ch1 <chr>,
## # extract_protocol_ch1 <chr>, extract_protocol_ch1.1 <chr>, taxid_ch1 <chr>,
## # cell.line.ch1 <chr>, genotype.variation.ch1 <chr>, Run <chr>,
## # avgLength <int>, Experiment <chr>, Sample <chr>, BioSample <chr>,
## # sample_name <chr>, genotype <fct>, TMM <dbl>, multiplier <dbl>, PC1 <dbl>,
## # PC2 <dbl>
## PCA plot
counts_scal_PCA %>%
pivot_sample() %>%
ggplot(aes(x = PC1, y = PC2, colour = genotype)) +
geom_point() +
geom_text_repel(aes(label = sample_name), show.legend = FALSE) +
custom_theme
Figure 3: PCA plot of samples using TPM counts
counts_scal_PCA3 <-
counts_scaled %>%
reduce_dimensions(method = "PCA", .dims = 3)
counts_scal_PCA3
## # A SummarizedExperiment-tibble abstraction: 136,704 × 29
## [90m# Transcripts=17088 | Samples=8 | Assays=counts, counts_scaled[39m
## feature sample counts counts_scaled SampleName source_name_ch1 organism_ch1
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 ENSG00… SRR11… 16 16.2 GSM4459412 epithelial cel… Homo sapiens
## 2 ENSG00… SRR11… 30 30.4 GSM4459412 epithelial cel… Homo sapiens
## 3 ENSG00… SRR11… 9 9.13 GSM4459412 epithelial cel… Homo sapiens
## 4 ENSG00… SRR11… 2039 2068. GSM4459412 epithelial cel… Homo sapiens
## 5 ENSG00… SRR11… 19 19.3 GSM4459412 epithelial cel… Homo sapiens
## 6 ENSG00… SRR11… 11 11.2 GSM4459412 epithelial cel… Homo sapiens
## 7 ENSG00… SRR11… 11304 11464. GSM4459412 epithelial cel… Homo sapiens
## 8 ENSG00… SRR11… 17 17.2 GSM4459412 epithelial cel… Homo sapiens
## 9 ENSG00… SRR11… 8 8.11 GSM4459412 epithelial cel… Homo sapiens
## 10 ENSG00… SRR11… 505 512. GSM4459412 epithelial cel… Homo sapiens
## # … with 40 more rows, and 22 more variables: characteristics_ch1 <chr>,
## # characteristics_ch1.1 <chr>, molecule_ch1 <chr>,
## # extract_protocol_ch1 <chr>, extract_protocol_ch1.1 <chr>, taxid_ch1 <chr>,
## # cell.line.ch1 <chr>, genotype.variation.ch1 <chr>, Run <chr>,
## # avgLength <int>, Experiment <chr>, Sample <chr>, BioSample <chr>,
## # sample_name <chr>, genotype <fct>, TMM <dbl>, multiplier <dbl>, PC1 <dbl>,
## # PC2 <dbl>, PC3 <dbl>, symbol <chr>, .abundant <lgl>
## take a look
counts_scal_PCA3 %>% pivot_sample()
## # A tibble: 8 × 24
## sample SampleName source_name_ch1 organism_ch1 characteristics_ch1
## <chr> <chr> <chr> <chr> <chr>
## 1 SRR11506594 GSM4459412 epithelial cells Homo sapiens cell line: Ishikawa cells
## 2 SRR11506595 GSM4459413 epithelial cells Homo sapiens cell line: Ishikawa cells
## 3 SRR11506596 GSM4459414 epithelial cells Homo sapiens cell line: Ishikawa cells
## 4 SRR11506597 GSM4459415 epithelial cells Homo sapiens cell line: Ishikawa cells
## 5 SRR11506598 GSM4459416 epithelial cells Homo sapiens cell line: Ishikawa cells
## 6 SRR11506599 GSM4459417 epithelial cells Homo sapiens cell line: Ishikawa cells
## 7 SRR11506600 GSM4459418 epithelial cells Homo sapiens cell line: Ishikawa cells
## 8 SRR11506601 GSM4459419 epithelial cells Homo sapiens cell line: Ishikawa cells
## # … with 19 more variables: characteristics_ch1.1 <chr>, molecule_ch1 <chr>,
## # extract_protocol_ch1 <chr>, extract_protocol_ch1.1 <chr>, taxid_ch1 <chr>,
## # cell.line.ch1 <chr>, genotype.variation.ch1 <chr>, Run <chr>,
## # avgLength <int>, Experiment <chr>, Sample <chr>, BioSample <chr>,
## # sample_name <chr>, genotype <fct>, TMM <dbl>, multiplier <dbl>, PC1 <dbl>,
## # PC2 <dbl>, PC3 <dbl>
## PCA3 plot
counts_scal_PCA3 %>%
pivot_sample() %>%
plot_ly(
x = ~ PC1,
y = ~ PC2,
z = ~ PC3,
color = ~ genotype,
colors = friendly_cols[1:5]
)
Figure 4: PCA plot of samples using TPM counts
Three principal components.
The experiment is well controlled and has worked well, since the greatest sources of variation in the data are the treatments/groups we are interested in. It also shows a streamlined quality control because no outliers are detected.
The samples group by treatment on PC1 which is what we hope to see. PC2 separates the PCX-overexpression from the other samples, indicating a greater difference between that group and control.
The exploratory results provides hints towards the fundamental mechanisms governing receptivity, a key step in establishing the transformation from a non-receptive to a receptive state, allowing embryos to attach to the surface and enter into the tissue.
The main and most important analysis step we will do is finding genes that show a difference in expression between sample groups. The process consisted on calculating the ratios for all genes between samples to determine the fold-change (FC) denoting the factor of change in expression between groups. Then, we filtered out only those genes that actually show a difference.
Whit a list of genes showing different behavior across samples, a focus on genes that are also statistically significant was tested against the null hypothesis that the activity of the genes stayed the same under overexpression of PCX conditions.
The limiting factors that influence the power of detecting genes that have real changes between control and PCX-overexpression are the limited number of biological replicates, non-normality of the distribution of the read counts, and higher uncertainty of measurements for lowly expressed genes than highly expressed genes (McCarthy, Chen, and Smyth 2012).
To address these limitations, several popular methods for differential transcript abundance testing where integrated in order to maximize the amount of knowledge that can be extracted from such noisy datasets: the edgeR
quasi-likelihood (default method in tidybulk
), edgeR
likelihood ratio (Robinson, McCarthy, and Smyth 2010), limma-voom of the limma library (Ritchie et al. 2015) and DESeq2 (Love, Huber, and Anders 2014).
To choose a method, tidybulk
easily runs all these methods at once to make comparisons. All samples were from the same Ishikawa cell line and there were no additional factors contributing variance, such as batch differences, so the formula only takes genotype as covariate.
This outputs the columns from each method such as log-fold change (logFC), false-discovery rate (FDR) and probability value (p-value). logFC is log2(control/overexpression).
de_all <-
counts_scal_PCA %>%
# edgeR QLT
test_differential_abundance(
~ genotype,
method = "edgeR_quasi_likelihood",
prefix = "edgerQLT_"
) %>%
# edgeR LRT
test_differential_abundance(
~ genotype,
method = "edgeR_likelihood_ratio",
prefix = "edgerLR_"
) %>%
# limma-voom
test_differential_abundance(
~ genotype,
method = "limma_voom",
prefix = "voom_"
) %>%
# DESeq2
test_differential_abundance(
~ genotype,
method = "deseq2",
prefix = "deseq2_"
)
## take a look
de_all
## # A SummarizedExperiment-tibble abstraction: 136,704 × 50
## [90m# Transcripts=17088 | Samples=8 | Assays=counts, counts_scaled[39m
## feature sample counts counts_scaled SampleName source_name_ch1 organism_ch1
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 ENSG00… SRR11… 16 16.2 GSM4459412 epithelial cel… Homo sapiens
## 2 ENSG00… SRR11… 30 30.4 GSM4459412 epithelial cel… Homo sapiens
## 3 ENSG00… SRR11… 9 9.13 GSM4459412 epithelial cel… Homo sapiens
## 4 ENSG00… SRR11… 2039 2068. GSM4459412 epithelial cel… Homo sapiens
## 5 ENSG00… SRR11… 19 19.3 GSM4459412 epithelial cel… Homo sapiens
## 6 ENSG00… SRR11… 11 11.2 GSM4459412 epithelial cel… Homo sapiens
## 7 ENSG00… SRR11… 11304 11464. GSM4459412 epithelial cel… Homo sapiens
## 8 ENSG00… SRR11… 17 17.2 GSM4459412 epithelial cel… Homo sapiens
## 9 ENSG00… SRR11… 8 8.11 GSM4459412 epithelial cel… Homo sapiens
## 10 ENSG00… SRR11… 505 512. GSM4459412 epithelial cel… Homo sapiens
## # … with 40 more rows, and 43 more variables: characteristics_ch1 <chr>,
## # characteristics_ch1.1 <chr>, molecule_ch1 <chr>,
## # extract_protocol_ch1 <chr>, extract_protocol_ch1.1 <chr>, taxid_ch1 <chr>,
## # cell.line.ch1 <chr>, genotype.variation.ch1 <chr>, Run <chr>,
## # avgLength <int>, Experiment <chr>, Sample <chr>, BioSample <chr>,
## # sample_name <chr>, genotype <fct>, TMM <dbl>, multiplier <dbl>, PC1 <dbl>,
## # PC2 <dbl>, symbol <chr>, .abundant <lgl>, edgerQLT_logFC <dbl>,
## # edgerQLT_logCPM <dbl>, edgerQLT_F <dbl>, edgerQLT_PValue <dbl>,
## # edgerQLT_FDR <dbl>, edgerLR_logFC <dbl>, edgerLR_logCPM <dbl>,
## # edgerLR_LR <dbl>, edgerLR_PValue <dbl>, edgerLR_FDR <dbl>,
## # voom_logFC <dbl>, voom_AveExpr <dbl>, voom_t <dbl>, voom_P.Value <dbl>,
## # voom_adj.P.Val <dbl>, voom_B <dbl>, deseq2_baseMean <dbl>,
## # deseq2_log2FoldChange <dbl>, deseq2_lfcSE <dbl>, deseq2_stat <dbl>,
## # deseq2_pvalue <dbl>, deseq2_padj <dbl>
The ggpairs
plot in Figure 5, done via the GGally library (Schloerke et al. 2021) to reduce the complexity of combining transformed data, compares visually the significance for all methods, where we can notice that there is some difference between them.
de_all %>%
pivot_transcript() %>%
select(edgerQLT_PValue, edgerLR_PValue, voom_P.Value, deseq2_pvalue, feature) %>%
ggpairs(1:4)
Figure 5: ggpairs plot for most significant genes
Simplifying the way of performing an RNA sequencing differential expression analysis, the method that detects the most differentially abundant transcripts, p value adjusted for multiple testing < 0.05 (FDR, adj.P.Val, padj), is p value with the limma
package.
It is also important to observe the distribution of raw p-values, since a peak around low p-values and a uniform distribution at P-values above 0.1 (Figure 5) confirms adjustment for multiple testing is working and the results are statistically meaningful.
## take a look
de_all %>% pivot_transcript()
## # A tibble: 17,088 × 25
## feature symbol .abundant edgerQLT_logFC edgerQLT_logCPM edgerQLT_F
## <chr> <chr> <lgl> <dbl> <dbl> <dbl>
## 1 ENSG00000227232 WASH7P TRUE 0.186 -0.350 0.377
## 2 ENSG00000237683 <NA> TRUE 0.783 0.494 7.17
## 3 ENSG00000228463 RPL23AP21 TRUE 0.624 -1.13 2.11
## 4 ENSG00000225630 MTND2P28 TRUE -0.318 5.91 3.18
## 5 ENSG00000237973 MTCO1P12 TRUE -0.543 0.00992 1.35
## 6 ENSG00000229344 MTCO2P12 TRUE 0.114 -1.02 0.0798
## 7 ENSG00000248527 MTATP6P1 TRUE -0.537 8.27 10.9
## 8 ENSG00000237491 LINC01409 TRUE 0.422 0.286 0.748
## 9 ENSG00000225880 LINC00115 TRUE 2.03 -0.944 28.0
## 10 ENSG00000228794 LINC01128 TRUE 0.404 3.89 24.8
## # … with 17,078 more rows, and 19 more variables: edgerQLT_PValue <dbl>,
## # edgerQLT_FDR <dbl>, edgerLR_logFC <dbl>, edgerLR_logCPM <dbl>,
## # edgerLR_LR <dbl>, edgerLR_PValue <dbl>, edgerLR_FDR <dbl>,
## # voom_logFC <dbl>, voom_AveExpr <dbl>, voom_t <dbl>, voom_P.Value <dbl>,
## # voom_adj.P.Val <dbl>, voom_B <dbl>, deseq2_baseMean <dbl>,
## # deseq2_log2FoldChange <dbl>, deseq2_lfcSE <dbl>, deseq2_stat <dbl>,
## # deseq2_pvalue <dbl>, deseq2_padj <dbl>
## summaries
de_all %>%
filter(edgerQLT_FDR < 0.05) %>%
summarise(num_de = n_distinct(feature))
## # A tibble: 1 × 1
## num_de
## <int>
## 1 10655
de_all %>%
filter(edgerQLT_PValue < 0.05) %>%
summarise(num_de = n_distinct(feature))
## # A tibble: 1 × 1
## num_de
## <int>
## 1 11218
de_all %>%
filter(edgerLR_FDR < 0.05) %>%
summarise(num_de = n_distinct(feature))
## # A tibble: 1 × 1
## num_de
## <int>
## 1 10649
de_all %>%
filter(edgerLR_PValue < 0.05) %>%
summarise(num_de = n_distinct(feature))
## # A tibble: 1 × 1
## num_de
## <int>
## 1 11141
de_all %>%
filter(voom_P.Value < 0.05) %>%
summarise(num_de = n_distinct(feature))
## # A tibble: 1 × 1
## num_de
## <int>
## 1 11288
de_all %>%
filter(voom_adj.P.Val < 0.05) %>%
summarise(num_de = n_distinct(feature))
## # A tibble: 1 × 1
## num_de
## <int>
## 1 10735
de_all %>%
filter(deseq2_pvalue < 0.05) %>%
summarise(num_de = n_distinct(feature))
## # A tibble: 1 × 1
## num_de
## <int>
## 1 11036
de_all %>%
filter(deseq2_padj < 0.05) %>%
summarise(num_de = n_distinct(feature))
## # A tibble: 1 × 1
## num_de
## <int>
## 1 10561
At this point we produced several sets of DEGs from the replicates, and Volcano plots are useful to offer guidance to visualize, summarize and connect biological relevance to the results.
For checking genome-wide that the analysis looks good the plot enables visualization of the significance of change (p-value) versus the fold change (logFC). Highly significant genes are towards the top of the plot and significant genes with false-discovery rate < 0.05 got colored.
We decided which genes are differentially expressed taking a cut-off of 0.05 on the FDR (or adjusted P-value), not the raw p-value, because we are testing multiple genes, and the chances of finding differentially expressed genes are very high whit many tests.
In the Volcano plot of Figure 6 the genes are colored if they pass the thresholds for FDR and Log Fold Change, on the right if they are up-regulated and blue if they are down-regulated. We can see in this plot that there are many (hundreds) of significant genes in this dataset but we only indicate the 6 top genes establishing embryo implantation associated with PCX-overexpresion.
# run default edgeR method
counts_de <- counts_scal_PCA %>%
test_differential_abundance(~ genotype)
topgenes_symbols <-
counts_de %>%
pivot_transcript() %>%
arrange(PValue) %>%
head(100) %>%
pull(symbol)
topgenes_symbols
## [1] NA "EDIL3" "EMX2" "PDE1C" "MSX1" "S100A14"
## [7] "HCLS1" "CLIC6" "ADD2" "ITIH5" "MOB3B" "MKX"
## [13] "OSBP2" NA "SALL1" "FGF2" "EHD2" "SYK"
## [19] "SLC1A3" "TIMP2" "VCAN" "CAPN13" "GBP1" "PRKCA"
## [25] "CDH13" "GPR157" "IGFBP4" "AXL" "LRP2" "FBXL17"
## [31] "LGR4" "MAGEC1" "FBLN5" "SFRP1" "TC2N" "PELI2"
## [37] "PRSS8" "EPHA1" "VIM" "RBP7" "ACSL1" "ST6GAL2"
## [43] "ZEB1" "PLAU" "EPCAM" "SLC16A3" "STAT6" "EMP3"
## [49] "MRC2" "PBDC1" "INPP4B" "MAP3K14" "GANAB" "KIF5C"
## [55] "RSPO3" "FXYD5" "FKBP10" "CDH3" "PHKA1" "STX18"
## [61] "WNT7A" "SH3YL1" "TIMP1" "TGFB1" "FLNC" "PLAUR"
## [67] "RNF180" "SV2A" "ST14" "SLC4A8" "CAVIN1" "PNMA2"
## [73] "NPNT" "RAB25" "EFNA5" "P3H3" "TMEM132B" "HSD17B2"
## [79] "LAD1" "CDH1" "AEBP1" "DIRAS2" "COL8A2" "MGAT3"
## [85] "GCNT2" "PCDHB5" "TGFB2" "MYD88" "PRRX1" "APOBEC3C"
## [91] "BEND4" "DAB2" "SCEL" "S100A2" "MME" "SAMD5"
## [97] "SEMA5A" "RBMS3" "ANO1" "GALNT14"
counts_de %>%
pivot_transcript() %>%
# Subset data
mutate(significant = FDR < 0.05 & abs(logFC) >= 2) %>%
mutate(symbol = ifelse(symbol %in% topgenes_symbols, as.character(symbol), "")) %>%
# Plot
ggplot(aes(x = logFC, y = PValue, label = symbol)) +
geom_point(aes(color = significant, size = significant, alpha = significant)) +
geom_text_repel() +
# Custom scales
custom_theme +
scale_y_continuous(trans = "log10_reverse") +
scale_color_manual(values = c("black", "#e11f28")) +
scale_size_discrete(range = c(0, 2))
Figure 6: Visualization of RNA-Seq results with Volcano Plot
A highly significant correlation was found between methods for differential transcript abundance for testing transcriptional profiles of endometrial implantation (Pearson’s r > 0.970 P < 0.05).
Exploratory analyses explained the fraction of variance with the three first principal components in a 97.277% by getting the 500 most variable genes.
For the methods of transcript abundance: the edgeR
quasi-likelihood detects 10655 differential transcripts based on adjusted p value and 11218 transcripts based on p value, the edgeR
likelihood ratio detects 10649 differential transcripts based on adjusted p value and 11141 transcripts based on p value, for the lima-voom method 11288 transcripts for p value and 10735 for adjusted p vaulue and DESeq2
detects 11036 for p value and 10561 for adjusted p value, therefore, limma
outputs the most differentially abundant transcripts
From the differentiation of expression in genes we could distinguish as top gene symbols for the significant genes with the volcano plot tool: PDE1C and EDIL3 downregulating PCX-overexpression and promoting to a receptive state embryo implantation of human endometrial epithelium, and EMX2, MSX1, S100A14 upregulating PCX-overexpression promoting to a non-recpetive state the inner lining of the uterus. The top genes are those that pass the FDR and logFC thresholds that have the smallest P values. As there are hundreds of significant genes here, too many to sensibly label, we labelled the top 6 genes.
We have shown that it is possible to reliably map SummarizedExperiment
objects and perform DEG’S analyses by calling genotypes directly from RNA-seq data of two groups of human samples, despite the fact that these data originated from a different study, mainly due to the novel use of the tidy paradigm to perform these processes.
For the human samples with RNA-seq, this study establishes a gene signature to use for acquiring knowledge in endometrial receptivity tests. This report is one of the first to show that the transcriptome can be better and more efficiently analyzed with the advantage of the unified grammar encoding that the tidytranscriptomics
suite provides and that gene promotion could serve as a reservoir for potential collection of receptivity markers. Our work thus represents a step forward in the design of approaches for real-time monitoring of endometrial status, necessary for advancing the field of reproductive medicine.