Differentially expressed genes that regulate receptivity for embryo implantation in control vs PCX-overexpression replicates

JPAG1* and Ricardo Josué Acuña-González2

1Facultad de Ciencias, UNAM
2Instituto Nacional de Perinatología. Facultad de Medicina, UNAM

*jp_6@ciencias.unam.mx

24 julio 2021

Abstract

The motivation of the following, is to re-do (part of) the analysis described for the expression profiling by high throughput sequencing in the experiment with identifier GSE101942 described at the NCBI Gene Expression Omnibus (GEO), employing the state-of-the-art tidytranscriptomics suite (Mangiola et al. 2021; Mangiola 2021). Even though these data originated from a different study, samples reflecting the same cell type clustered together, revealed that a limited number of samples already suffice to identify tissue-specific genes for known disease-associated variants. Additionally, we proposed recent advances in tidy programming to efficiently evaluate a biological dataset to perform integral RNA-seq analyses. The tools we will use to perform the analysis will be the statistical programming language R (R Core Team 2020) with its many libraries and the software package RStudio (RStudio Team 2019) to interact with R.

1 Introduction

Experiment with identifier GSE101942 presents evidence that transmembrane protein podocalyxin (PCX) suppresses genes promoting receptivity (eg LIF, CSF1) but stimulates those inhibiting implantation (eg WNT7A, LEFTY2) and is a significant factor regulating human endometrial receptivity. Therefore, down-regulation of podocalyxin selectively converts the endometrial surface to a more adhesive state that facilitates embryo attachment and penetration and inadequate down-regulation of podocalyxin is associated with poorer implantation rates in IVF (in vitro fertilization).

The extraction protocol of the chosen experiment consisted of 4 replicates per group (control vs PCX-overexpression): Ishikawa cells were cultured, total RNA was isolated from control and PCX-overexpressing Ishikawa cells and total RNA was converted to strand specific Illumina compatible sequencing libraries.

FASTQ reads were assessed for quality using FastQC (Andrews 2010). Reads were then trimmed for sequence adapters using AdapterRemoval (Mikkel Schubert and Orlando 2016) and aligned to the human genome GRCh37 using the RNA-seq alignment algorithm STAR (Dobin and Gingeras. 2013).

After alignment, mapped sequence reads were summarised to the GRCh37.p13 (NCBI:GCA_000001405.14 2013-09) gene intervals using featureCounts (Yang Liao and Shi 2014). Finally the raw table of counts was transferred to R for expression analysis. The material on differential expression will be now adapted to introduce a tidy approach to RNA sequencing data representation and analysis.

2 Approach

To analyse the set of RNA transcripts in the population of cells to measure changes in expression, we’ll make use of the Bulk RNA sequencing differential expression workflow, depicted in Figure 1.

Figure 1: Bulk RNAseq pipeline (Doyle, M
and Mangiola, S., 2021)

The tidytranscriptomics suite follows the data paradigm for providing a standard way to organize data values within a dataset, where each variable is a column, each observation is a row, and data is manipulated using an easy-to-understand vocabulary.

By following this paradigm in RNA sequencing data exploration and analysis, a number of key concepts, such as filtering, scaling, dimensionality reduction, hypothesis testing, clustering and visualization become more intuitive by the use of a homogeneous jargon, the simplicity of data wrangling, and the coding unburden. Therefore, enabling focus on the statistical and biological challenges underlying an informed RNA sequencing analysis.

To approach data representation and analysis through a tidy data paradigm, allows the integration of the tidyverse with tidybulk, tidyseurat and other tidyverse-friendly packages.

The main objectives in this work will be to recall the key concepts of RNA sequencing data analysis, to apply the concepts to publicly available data and to create plots that summarize the information content of the data and analysis results.

3 Methods

3.1 Getting the data

Let’s load all the packages we will need to analyse the bulk RNA sequencing data and setting the colors and theme we will use for our plots.

# tidyverse core packages
library(tidyverse)

# databeses packages
library(GEOquery)
library(SummarizedExperiment)

# tidyverse-friendly packages
library(plotly)
library(ggrepel)
library(GGally)
library(tidybulk)
library(tidySummarizedExperiment)

# Use colourblind-friendly colours
friendly_cols <- dittoSeq::dittoColors()

# Set theme
custom_theme <-
  list(
    scale_fill_manual(values = friendly_cols),
    scale_color_manual(values = friendly_cols),
    theme_bw() +
      theme(
        panel.border = element_blank(),
        axis.line = element_line(),
        panel.grid.major = element_line(size = 0.2),
        panel.grid.minor = element_line(size = 0.1),
        text = element_text(size = 12),
        legend.position = "bottom",
        strip.background = element_blank(),
        axis.title.x = element_text(margin = margin(t = 10, r = 10, b = 10, l =
                                                      10)),
        axis.title.y = element_text(margin = margin(t = 10, r = 10, b = 10, l =
                                                      10)),
        axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1)
      )
  )

We first started by downloading the actual count data and the metadata file obtained from the sample information of the series matrix file downloaded from GEO employing the GEOquery package (Davis and Meltzer 2007). The columns were then parsed and new columns with shorter names and factor levels were added.

The information which connects the sample information from GEO with the SRA run id was also downloaded https://www.ncbi.nlm.nih.gov/sra/SRP255647, using the relations code SRP255647. This data.frame was merged with the query from GEO.

## RNA-seq (control vs PCX-overexpression) 4 replicates per group

#exp_1 <- read_table2("GSE148274_all_GN_samples_counts.tsv")
#exp.1 <- column_to_rownames(exp_1, "Geneid")
exp.1 <- read_table2("GSE148274_all_GN_samples_counts.tsv") %>% 
  column_to_rownames("Geneid")

# TO DO tidy 
gse1 <- getGEO(filename="GSE148274_series_matrix.txt")
pdata1 <- pData(gse1)[,grepl("ch1",names(pData(gse1)))]
pdata1$SampleName <- rownames(pData(gse1))
srp <- read.csv("SraRunInfo3.csv")
srpsmall <- srp[,c("Run","avgLength","Experiment","Sample","BioSample","SampleName")]
coldata1 <- merge(pdata1, srpsmall, by="SampleName")
rownames(coldata1) <- coldata1$Run

Then a SummarizedExperiment object was created. The library tidySummarizedExperiment provides a bridge between Bioconductor SummarizedExperiment (Morgan et al. 2021) and the tidyverse (Wickham et al. 2019) to enable visualisation and treatment of the Bioconductor SummarizedExperiment object as a tidyverse tibble, providing SummarizedExperiment-compatible dplyr, tidyr, ggplot and plotly functions.

The SummarizedExperiment class is used to store the rectangular matrix of experimental results produced by the high throughput sequencing experiments. Each object stores observations of one or more samples, along with additional meta-data describing both the observations (features) and samples (phenotypes).

pcxoe <- SummarizedExperiment(list(counts=as.matrix(exp.1)),colData = coldata1)

## take a look
pcxoe

## # A SummarizedExperiment-tibble abstraction: 463,240 × 19
## [90m# Transcripts=57905 | Samples=8 | Assays=counts[39m
##    feature sample counts SampleName source_name_ch1 organism_ch1
##    <chr>   <chr>   <dbl> <chr>      <chr>           <chr>       
##  1 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
##  2 ENSG00… SRR11…     16 GSM4459412 epithelial cel… Homo sapiens
##  3 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
##  4 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
##  5 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
##  6 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
##  7 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
##  8 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
##  9 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
## 10 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
## # … with 40 more rows, and 13 more variables: characteristics_ch1 <chr>,
## #   characteristics_ch1.1 <chr>, molecule_ch1 <chr>,
## #   extract_protocol_ch1 <chr>, extract_protocol_ch1.1 <chr>, taxid_ch1 <chr>,
## #   cell.line.ch1 <chr>, genotype.variation.ch1 <chr>, Run <chr>,
## #   avgLength <int>, Experiment <chr>, Sample <chr>, BioSample <chr>

assays(pcxoe)

## List of length 1
## names(1): counts

3.2 Setting up the data

And now tidyverse commands, such as filter, select and mutate were used to explore the tidy SummarizedExperiment object, alongside the tidyverse pipe %>%, since it’s use reduces the amount of code needed to write when we have multiple steps and also makes the steps clearer and easier to see.

We created a new sample_name column that contains shorter sample names, by removing the GSM4459 prefix that’s present in all of the samples, as shorter names can fit better the plots that were created. Also, to give a statistical treatment, the genotype variation was converted to a factor type.

Lastly, to set up the data for our RNA sequencing analysis a column with gene symbols was created. The gene symbols for these Ensembl gene ids can be obtained using the Bioconductor annotation package for human, org.Hs.eg.db (Pagès et al. 2021).

counts <-
  pcxoe %>%
  mutate(sample_name = str_remove(sample, "GSM4459"),
         genotype = as_factor(genotype.variation.ch1)) %>%
  mutate(symbol = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
                                        keys = feature,
                                        keytype = "ENSEMBL",
                                        column = "SYMBOL",
                                        multiVals = "first"
  ))
counts

## # A SummarizedExperiment-tibble abstraction: 463,240 × 22
## [90m# Transcripts=57905 | Samples=8 | Assays=counts[39m
##    feature sample counts SampleName source_name_ch1 organism_ch1
##    <chr>   <chr>   <dbl> <chr>      <chr>           <chr>       
##  1 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
##  2 ENSG00… SRR11…     16 GSM4459412 epithelial cel… Homo sapiens
##  3 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
##  4 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
##  5 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
##  6 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
##  7 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
##  8 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
##  9 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
## 10 ENSG00… SRR11…      0 GSM4459412 epithelial cel… Homo sapiens
## # … with 40 more rows, and 16 more variables: characteristics_ch1 <chr>,
## #   characteristics_ch1.1 <chr>, molecule_ch1 <chr>,
## #   extract_protocol_ch1 <chr>, extract_protocol_ch1.1 <chr>, taxid_ch1 <chr>,
## #   cell.line.ch1 <chr>, genotype.variation.ch1 <chr>, Run <chr>,
## #   avgLength <int>, Experiment <chr>, Sample <chr>, BioSample <chr>,
## #   sample_name <chr>, genotype <fct>, symbol <chr>

3.3 Pre-processing

A filter for lowly expressed genes was performed using tidybulk, which in turn uses the edgeR filterByExpr() function described in McCarthy, Chen, and Smyth (2012), to automatically identify the genes with adequate abundance for differential expression testing.

We also normalized counts by scaling them and avoid technical and composition differences between the samples, employing the default edgeR’s trimmed mean of M values (TMM), (Robinson and Oshlack 2010). We can visualise the difference in abundance densities before and after scaling (Figure 2).

counts_scaled <- counts %>% 
  keep_abundant(factor_of_interest = genotype) %>%
  scale_abundance()

counts_scaled %>%
  
  # Reshaping
  pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source",
               values_to = "abundance") %>%
  
  # Plotting
  ggplot(aes(x = abundance + 1, color = sample_name)) +
  geom_density() +
  facet_wrap(~source) +
  scale_x_log10() +
  custom_theme

Figure 2: Density plot comparing raw and normalized counts distribution for 8 samples

It can be assessed, that the distributions of the counts are not very different to each other before scaling and scaling makes the distributions more similar.

3.4 Exploratory analysis

To interrogate the RNA-seq experiment design, as a way to perform quality control, the similarity of the samples with each other in terms of the quantified expression level profiles across the set of genes was measured. This allowed to see whether the most similar samples are the biological replicates of that sample.

We reduced the dimensions of the data to identify the greatest sources of variation by applying Principal Component Analysis (PCA) and visualized them as two (Figure 3) and three-dimensional (Figure 4) scatter plots.

The 2D visualization follows the tidy graphical guidelines of ggplot2 (Wickham 2016) and ggrepel (Slowikowski 2021) to repel away text labels, data points, and edges from each other in the plotting area. The 3D plot uses the plotly framework (Sievert 2020).

## Get principal components
counts_scal_PCA <-
  counts_scaled %>%
  reduce_dimensions(method = "PCA")

counts_scal_PCA

## # A SummarizedExperiment-tibble abstraction: 136,704 × 28
## [90m# Transcripts=17088 | Samples=8 | Assays=counts, counts_scaled[39m
##    feature sample counts counts_scaled SampleName source_name_ch1 organism_ch1
##    <chr>   <chr>   <dbl>         <dbl> <chr>      <chr>           <chr>       
##  1 ENSG00… SRR11…     16         16.2  GSM4459412 epithelial cel… Homo sapiens
##  2 ENSG00… SRR11…     30         30.4  GSM4459412 epithelial cel… Homo sapiens
##  3 ENSG00… SRR11…      9          9.13 GSM4459412 epithelial cel… Homo sapiens
##  4 ENSG00… SRR11…   2039       2068.   GSM4459412 epithelial cel… Homo sapiens
##  5 ENSG00… SRR11…     19         19.3  GSM4459412 epithelial cel… Homo sapiens
##  6 ENSG00… SRR11…     11         11.2  GSM4459412 epithelial cel… Homo sapiens
##  7 ENSG00… SRR11…  11304      11464.   GSM4459412 epithelial cel… Homo sapiens
##  8 ENSG00… SRR11…     17         17.2  GSM4459412 epithelial cel… Homo sapiens
##  9 ENSG00… SRR11…      8          8.11 GSM4459412 epithelial cel… Homo sapiens
## 10 ENSG00… SRR11…    505        512.   GSM4459412 epithelial cel… Homo sapiens
## # … with 40 more rows, and 21 more variables: characteristics_ch1 <chr>,
## #   characteristics_ch1.1 <chr>, molecule_ch1 <chr>,
## #   extract_protocol_ch1 <chr>, extract_protocol_ch1.1 <chr>, taxid_ch1 <chr>,
## #   cell.line.ch1 <chr>, genotype.variation.ch1 <chr>, Run <chr>,
## #   avgLength <int>, Experiment <chr>, Sample <chr>, BioSample <chr>,
## #   sample_name <chr>, genotype <fct>, TMM <dbl>, multiplier <dbl>, PC1 <dbl>,
## #   PC2 <dbl>, symbol <chr>, .abundant <lgl>

## take a look
counts_scal_PCA %>% pivot_sample()

## # A tibble: 8 × 23
##   sample      SampleName source_name_ch1  organism_ch1 characteristics_ch1      
##   <chr>       <chr>      <chr>            <chr>        <chr>                    
## 1 SRR11506594 GSM4459412 epithelial cells Homo sapiens cell line: Ishikawa cells
## 2 SRR11506595 GSM4459413 epithelial cells Homo sapiens cell line: Ishikawa cells
## 3 SRR11506596 GSM4459414 epithelial cells Homo sapiens cell line: Ishikawa cells
## 4 SRR11506597 GSM4459415 epithelial cells Homo sapiens cell line: Ishikawa cells
## 5 SRR11506598 GSM4459416 epithelial cells Homo sapiens cell line: Ishikawa cells
## 6 SRR11506599 GSM4459417 epithelial cells Homo sapiens cell line: Ishikawa cells
## 7 SRR11506600 GSM4459418 epithelial cells Homo sapiens cell line: Ishikawa cells
## 8 SRR11506601 GSM4459419 epithelial cells Homo sapiens cell line: Ishikawa cells
## # … with 18 more variables: characteristics_ch1.1 <chr>, molecule_ch1 <chr>,
## #   extract_protocol_ch1 <chr>, extract_protocol_ch1.1 <chr>, taxid_ch1 <chr>,
## #   cell.line.ch1 <chr>, genotype.variation.ch1 <chr>, Run <chr>,
## #   avgLength <int>, Experiment <chr>, Sample <chr>, BioSample <chr>,
## #   sample_name <chr>, genotype <fct>, TMM <dbl>, multiplier <dbl>, PC1 <dbl>,
## #   PC2 <dbl>

## PCA plot
counts_scal_PCA %>%
  pivot_sample() %>%
  ggplot(aes(x = PC1, y = PC2, colour = genotype)) +
  geom_point() +
  geom_text_repel(aes(label = sample_name), show.legend = FALSE) +
  custom_theme

Figure 3: PCA plot of samples using TPM counts

counts_scal_PCA3 <-
  counts_scaled %>%
  reduce_dimensions(method = "PCA", .dims = 3)

counts_scal_PCA3

## # A SummarizedExperiment-tibble abstraction: 136,704 × 29
## [90m# Transcripts=17088 | Samples=8 | Assays=counts, counts_scaled[39m
##    feature sample counts counts_scaled SampleName source_name_ch1 organism_ch1
##    <chr>   <chr>   <dbl>         <dbl> <chr>      <chr>           <chr>       
##  1 ENSG00… SRR11…     16         16.2  GSM4459412 epithelial cel… Homo sapiens
##  2 ENSG00… SRR11…     30         30.4  GSM4459412 epithelial cel… Homo sapiens
##  3 ENSG00… SRR11…      9          9.13 GSM4459412 epithelial cel… Homo sapiens
##  4 ENSG00… SRR11…   2039       2068.   GSM4459412 epithelial cel… Homo sapiens
##  5 ENSG00… SRR11…     19         19.3  GSM4459412 epithelial cel… Homo sapiens
##  6 ENSG00… SRR11…     11         11.2  GSM4459412 epithelial cel… Homo sapiens
##  7 ENSG00… SRR11…  11304      11464.   GSM4459412 epithelial cel… Homo sapiens
##  8 ENSG00… SRR11…     17         17.2  GSM4459412 epithelial cel… Homo sapiens
##  9 ENSG00… SRR11…      8          8.11 GSM4459412 epithelial cel… Homo sapiens
## 10 ENSG00… SRR11…    505        512.   GSM4459412 epithelial cel… Homo sapiens
## # … with 40 more rows, and 22 more variables: characteristics_ch1 <chr>,
## #   characteristics_ch1.1 <chr>, molecule_ch1 <chr>,
## #   extract_protocol_ch1 <chr>, extract_protocol_ch1.1 <chr>, taxid_ch1 <chr>,
## #   cell.line.ch1 <chr>, genotype.variation.ch1 <chr>, Run <chr>,
## #   avgLength <int>, Experiment <chr>, Sample <chr>, BioSample <chr>,
## #   sample_name <chr>, genotype <fct>, TMM <dbl>, multiplier <dbl>, PC1 <dbl>,
## #   PC2 <dbl>, PC3 <dbl>, symbol <chr>, .abundant <lgl>

## take a look
counts_scal_PCA3 %>% pivot_sample()

## # A tibble: 8 × 24
##   sample      SampleName source_name_ch1  organism_ch1 characteristics_ch1      
##   <chr>       <chr>      <chr>            <chr>        <chr>                    
## 1 SRR11506594 GSM4459412 epithelial cells Homo sapiens cell line: Ishikawa cells
## 2 SRR11506595 GSM4459413 epithelial cells Homo sapiens cell line: Ishikawa cells
## 3 SRR11506596 GSM4459414 epithelial cells Homo sapiens cell line: Ishikawa cells
## 4 SRR11506597 GSM4459415 epithelial cells Homo sapiens cell line: Ishikawa cells
## 5 SRR11506598 GSM4459416 epithelial cells Homo sapiens cell line: Ishikawa cells
## 6 SRR11506599 GSM4459417 epithelial cells Homo sapiens cell line: Ishikawa cells
## 7 SRR11506600 GSM4459418 epithelial cells Homo sapiens cell line: Ishikawa cells
## 8 SRR11506601 GSM4459419 epithelial cells Homo sapiens cell line: Ishikawa cells
## # … with 19 more variables: characteristics_ch1.1 <chr>, molecule_ch1 <chr>,
## #   extract_protocol_ch1 <chr>, extract_protocol_ch1.1 <chr>, taxid_ch1 <chr>,
## #   cell.line.ch1 <chr>, genotype.variation.ch1 <chr>, Run <chr>,
## #   avgLength <int>, Experiment <chr>, Sample <chr>, BioSample <chr>,
## #   sample_name <chr>, genotype <fct>, TMM <dbl>, multiplier <dbl>, PC1 <dbl>,
## #   PC2 <dbl>, PC3 <dbl>

## PCA3 plot
counts_scal_PCA3 %>%
  pivot_sample() %>%
  plot_ly(
    x = ~ PC1,
    y = ~ PC2,
    z = ~ PC3,
    color = ~ genotype,
    colors = friendly_cols[1:5]
  )

Figure 4: PCA plot of samples using TPM counts
Three principal components.

The experiment is well controlled and has worked well, since the greatest sources of variation in the data are the treatments/groups we are interested in. It also shows a streamlined quality control because no outliers are detected.

The samples group by treatment on PC1 which is what we hope to see. PC2 separates the PCX-overexpression from the other samples, indicating a greater difference between that group and control.

The exploratory results provides hints towards the fundamental mechanisms governing receptivity, a key step in establishing the transformation from a non-receptive to a receptive state, allowing embryos to attach to the surface and enter into the tissue.

3.5 Differential expression

The main and most important analysis step we will do is finding genes that show a difference in expression between sample groups. The process consisted on calculating the ratios for all genes between samples to determine the fold-change (FC) denoting the factor of change in expression between groups. Then, we filtered out only those genes that actually show a difference.

Whit a list of genes showing different behavior across samples, a focus on genes that are also statistically significant was tested against the null hypothesis that the activity of the genes stayed the same under overexpression of PCX conditions.

The limiting factors that influence the power of detecting genes that have real changes between control and PCX-overexpression are the limited number of biological replicates, non-normality of the distribution of the read counts, and higher uncertainty of measurements for lowly expressed genes than highly expressed genes (McCarthy, Chen, and Smyth 2012).

To address these limitations, several popular methods for differential transcript abundance testing where integrated in order to maximize the amount of knowledge that can be extracted from such noisy datasets: the edgeR quasi-likelihood (default method in tidybulk), edgeR likelihood ratio (Robinson, McCarthy, and Smyth 2010), limma-voom of the limma library (Ritchie et al. 2015) and DESeq2 (Love, Huber, and Anders 2014).

To choose a method, tidybulk easily runs all these methods at once to make comparisons. All samples were from the same Ishikawa cell line and there were no additional factors contributing variance, such as batch differences, so the formula only takes genotype as covariate.

This outputs the columns from each method such as log-fold change (logFC), false-discovery rate (FDR) and probability value (p-value). logFC is log2(control/overexpression).

de_all <-
  
  counts_scal_PCA %>%
  
  # edgeR QLT
  test_differential_abundance(
    ~ genotype,
    method = "edgeR_quasi_likelihood",
    prefix = "edgerQLT_"
  ) %>%
  
  # edgeR LRT
  test_differential_abundance(
    ~ genotype,
    method = "edgeR_likelihood_ratio",
    prefix = "edgerLR_"
  ) %>%
  
  # limma-voom
  test_differential_abundance(
    ~ genotype,
    method = "limma_voom",
    prefix = "voom_"
  ) %>%
  
  # DESeq2
  test_differential_abundance(
    ~ genotype,
    method = "deseq2",
    prefix = "deseq2_"
  )

## take a look
de_all

## # A SummarizedExperiment-tibble abstraction: 136,704 × 50
## [90m# Transcripts=17088 | Samples=8 | Assays=counts, counts_scaled[39m
##    feature sample counts counts_scaled SampleName source_name_ch1 organism_ch1
##    <chr>   <chr>   <dbl>         <dbl> <chr>      <chr>           <chr>       
##  1 ENSG00… SRR11…     16         16.2  GSM4459412 epithelial cel… Homo sapiens
##  2 ENSG00… SRR11…     30         30.4  GSM4459412 epithelial cel… Homo sapiens
##  3 ENSG00… SRR11…      9          9.13 GSM4459412 epithelial cel… Homo sapiens
##  4 ENSG00… SRR11…   2039       2068.   GSM4459412 epithelial cel… Homo sapiens
##  5 ENSG00… SRR11…     19         19.3  GSM4459412 epithelial cel… Homo sapiens
##  6 ENSG00… SRR11…     11         11.2  GSM4459412 epithelial cel… Homo sapiens
##  7 ENSG00… SRR11…  11304      11464.   GSM4459412 epithelial cel… Homo sapiens
##  8 ENSG00… SRR11…     17         17.2  GSM4459412 epithelial cel… Homo sapiens
##  9 ENSG00… SRR11…      8          8.11 GSM4459412 epithelial cel… Homo sapiens
## 10 ENSG00… SRR11…    505        512.   GSM4459412 epithelial cel… Homo sapiens
## # … with 40 more rows, and 43 more variables: characteristics_ch1 <chr>,
## #   characteristics_ch1.1 <chr>, molecule_ch1 <chr>,
## #   extract_protocol_ch1 <chr>, extract_protocol_ch1.1 <chr>, taxid_ch1 <chr>,
## #   cell.line.ch1 <chr>, genotype.variation.ch1 <chr>, Run <chr>,
## #   avgLength <int>, Experiment <chr>, Sample <chr>, BioSample <chr>,
## #   sample_name <chr>, genotype <fct>, TMM <dbl>, multiplier <dbl>, PC1 <dbl>,
## #   PC2 <dbl>, symbol <chr>, .abundant <lgl>, edgerQLT_logFC <dbl>,
## #   edgerQLT_logCPM <dbl>, edgerQLT_F <dbl>, edgerQLT_PValue <dbl>,
## #   edgerQLT_FDR <dbl>, edgerLR_logFC <dbl>, edgerLR_logCPM <dbl>,
## #   edgerLR_LR <dbl>, edgerLR_PValue <dbl>, edgerLR_FDR <dbl>,
## #   voom_logFC <dbl>, voom_AveExpr <dbl>, voom_t <dbl>, voom_P.Value <dbl>,
## #   voom_adj.P.Val <dbl>, voom_B <dbl>, deseq2_baseMean <dbl>,
## #   deseq2_log2FoldChange <dbl>, deseq2_lfcSE <dbl>, deseq2_stat <dbl>,
## #   deseq2_pvalue <dbl>, deseq2_padj <dbl>

3.6 Comparison of methods

The ggpairs plot in Figure 5, done via the GGally library (Schloerke et al. 2021) to reduce the complexity of combining transformed data, compares visually the significance for all methods, where we can notice that there is some difference between them.

de_all %>%
  pivot_transcript() %>%
  select(edgerQLT_PValue, edgerLR_PValue, voom_P.Value, deseq2_pvalue, feature) %>%
  ggpairs(1:4)

Figure 5: ggpairs plot for most significant genes

Simplifying the way of performing an RNA sequencing differential expression analysis, the method that detects the most differentially abundant transcripts, p value adjusted for multiple testing < 0.05 (FDR, adj.P.Val, padj), is p value with the limma package.

It is also important to observe the distribution of raw p-values, since a peak around low p-values and a uniform distribution at P-values above 0.1 (Figure 5) confirms adjustment for multiple testing is working and the results are statistically meaningful.

## take a look
de_all %>% pivot_transcript()

## # A tibble: 17,088 × 25
##    feature         symbol    .abundant edgerQLT_logFC edgerQLT_logCPM edgerQLT_F
##    <chr>           <chr>     <lgl>              <dbl>           <dbl>      <dbl>
##  1 ENSG00000227232 WASH7P    TRUE               0.186        -0.350       0.377 
##  2 ENSG00000237683 <NA>      TRUE               0.783         0.494       7.17  
##  3 ENSG00000228463 RPL23AP21 TRUE               0.624        -1.13        2.11  
##  4 ENSG00000225630 MTND2P28  TRUE              -0.318         5.91        3.18  
##  5 ENSG00000237973 MTCO1P12  TRUE              -0.543         0.00992     1.35  
##  6 ENSG00000229344 MTCO2P12  TRUE               0.114        -1.02        0.0798
##  7 ENSG00000248527 MTATP6P1  TRUE              -0.537         8.27       10.9   
##  8 ENSG00000237491 LINC01409 TRUE               0.422         0.286       0.748 
##  9 ENSG00000225880 LINC00115 TRUE               2.03         -0.944      28.0   
## 10 ENSG00000228794 LINC01128 TRUE               0.404         3.89       24.8   
## # … with 17,078 more rows, and 19 more variables: edgerQLT_PValue <dbl>,
## #   edgerQLT_FDR <dbl>, edgerLR_logFC <dbl>, edgerLR_logCPM <dbl>,
## #   edgerLR_LR <dbl>, edgerLR_PValue <dbl>, edgerLR_FDR <dbl>,
## #   voom_logFC <dbl>, voom_AveExpr <dbl>, voom_t <dbl>, voom_P.Value <dbl>,
## #   voom_adj.P.Val <dbl>, voom_B <dbl>, deseq2_baseMean <dbl>,
## #   deseq2_log2FoldChange <dbl>, deseq2_lfcSE <dbl>, deseq2_stat <dbl>,
## #   deseq2_pvalue <dbl>, deseq2_padj <dbl>

## summaries

de_all %>%
  filter(edgerQLT_FDR < 0.05) %>%
  summarise(num_de = n_distinct(feature))

## # A tibble: 1 × 1
##   num_de
##    <int>
## 1  10655

de_all %>%
  filter(edgerQLT_PValue < 0.05) %>%
  summarise(num_de = n_distinct(feature))

## # A tibble: 1 × 1
##   num_de
##    <int>
## 1  11218

de_all %>%
  filter(edgerLR_FDR < 0.05) %>%
  summarise(num_de = n_distinct(feature))

## # A tibble: 1 × 1
##   num_de
##    <int>
## 1  10649

de_all %>%
  filter(edgerLR_PValue < 0.05) %>%
  summarise(num_de = n_distinct(feature))

## # A tibble: 1 × 1
##   num_de
##    <int>
## 1  11141

de_all %>%
  filter(voom_P.Value < 0.05) %>%
  summarise(num_de = n_distinct(feature))

## # A tibble: 1 × 1
##   num_de
##    <int>
## 1  11288

de_all %>%
  filter(voom_adj.P.Val < 0.05) %>%
  summarise(num_de = n_distinct(feature))

## # A tibble: 1 × 1
##   num_de
##    <int>
## 1  10735

de_all %>%
  filter(deseq2_pvalue < 0.05) %>%
  summarise(num_de = n_distinct(feature))

## # A tibble: 1 × 1
##   num_de
##    <int>
## 1  11036

de_all %>%
  filter(deseq2_padj < 0.05) %>%
  summarise(num_de = n_distinct(feature))

## # A tibble: 1 × 1
##   num_de
##    <int>
## 1  10561

3.7 Volcano plots

At this point we produced several sets of DEGs from the replicates, and Volcano plots are useful to offer guidance to visualize, summarize and connect biological relevance to the results.

For checking genome-wide that the analysis looks good the plot enables visualization of the significance of change (p-value) versus the fold change (logFC). Highly significant genes are towards the top of the plot and significant genes with false-discovery rate < 0.05 got colored.

We decided which genes are differentially expressed taking a cut-off of 0.05 on the FDR (or adjusted P-value), not the raw p-value, because we are testing multiple genes, and the chances of finding differentially expressed genes are very high whit many tests.

In the Volcano plot of Figure 6 the genes are colored if they pass the thresholds for FDR and Log Fold Change, on the right if they are up-regulated and blue if they are down-regulated. We can see in this plot that there are many (hundreds) of significant genes in this dataset but we only indicate the 6 top genes establishing embryo implantation associated with PCX-overexpresion.

# run default edgeR method
counts_de <- counts_scal_PCA %>%
  test_differential_abundance(~ genotype)

topgenes_symbols <-
  counts_de %>%
  pivot_transcript() %>%
  arrange(PValue) %>%
  head(100) %>%
  pull(symbol)
topgenes_symbols

##   [1] NA         "EDIL3"    "EMX2"     "PDE1C"    "MSX1"     "S100A14" 
##   [7] "HCLS1"    "CLIC6"    "ADD2"     "ITIH5"    "MOB3B"    "MKX"     
##  [13] "OSBP2"    NA         "SALL1"    "FGF2"     "EHD2"     "SYK"     
##  [19] "SLC1A3"   "TIMP2"    "VCAN"     "CAPN13"   "GBP1"     "PRKCA"   
##  [25] "CDH13"    "GPR157"   "IGFBP4"   "AXL"      "LRP2"     "FBXL17"  
##  [31] "LGR4"     "MAGEC1"   "FBLN5"    "SFRP1"    "TC2N"     "PELI2"   
##  [37] "PRSS8"    "EPHA1"    "VIM"      "RBP7"     "ACSL1"    "ST6GAL2" 
##  [43] "ZEB1"     "PLAU"     "EPCAM"    "SLC16A3"  "STAT6"    "EMP3"    
##  [49] "MRC2"     "PBDC1"    "INPP4B"   "MAP3K14"  "GANAB"    "KIF5C"   
##  [55] "RSPO3"    "FXYD5"    "FKBP10"   "CDH3"     "PHKA1"    "STX18"   
##  [61] "WNT7A"    "SH3YL1"   "TIMP1"    "TGFB1"    "FLNC"     "PLAUR"   
##  [67] "RNF180"   "SV2A"     "ST14"     "SLC4A8"   "CAVIN1"   "PNMA2"   
##  [73] "NPNT"     "RAB25"    "EFNA5"    "P3H3"     "TMEM132B" "HSD17B2" 
##  [79] "LAD1"     "CDH1"     "AEBP1"    "DIRAS2"   "COL8A2"   "MGAT3"   
##  [85] "GCNT2"    "PCDHB5"   "TGFB2"    "MYD88"    "PRRX1"    "APOBEC3C"
##  [91] "BEND4"    "DAB2"     "SCEL"     "S100A2"   "MME"      "SAMD5"   
##  [97] "SEMA5A"   "RBMS3"    "ANO1"     "GALNT14"

counts_de %>%
  pivot_transcript() %>%
  
  # Subset data
  mutate(significant = FDR < 0.05 & abs(logFC) >= 2) %>%
  mutate(symbol = ifelse(symbol %in% topgenes_symbols, as.character(symbol), "")) %>%
  
  # Plot
  ggplot(aes(x = logFC, y = PValue, label = symbol)) +
  geom_point(aes(color = significant, size = significant, alpha = significant)) +
  geom_text_repel() +
  
  # Custom scales
  custom_theme +
  scale_y_continuous(trans = "log10_reverse") +
  scale_color_manual(values = c("black", "#e11f28")) +
  scale_size_discrete(range = c(0, 2))

Figure 6: Visualization of RNA-Seq results with Volcano Plot

4 Discussion

A highly significant correlation was found between methods for differential transcript abundance for testing transcriptional profiles of endometrial implantation (Pearson’s r > 0.970 P < 0.05).

Exploratory analyses explained the fraction of variance with the three first principal components in a 97.277% by getting the 500 most variable genes.

For the methods of transcript abundance: the edgeR quasi-likelihood detects 10655 differential transcripts based on adjusted p value and 11218 transcripts based on p value, the edgeR likelihood ratio detects 10649 differential transcripts based on adjusted p value and 11141 transcripts based on p value, for the lima-voom method 11288 transcripts for p value and 10735 for adjusted p vaulue and DESeq2 detects 11036 for p value and 10561 for adjusted p value, therefore, limma outputs the most differentially abundant transcripts

From the differentiation of expression in genes we could distinguish as top gene symbols for the significant genes with the volcano plot tool: PDE1C and EDIL3 downregulating PCX-overexpression and promoting to a receptive state embryo implantation of human endometrial epithelium, and EMX2, MSX1, S100A14 upregulating PCX-overexpression promoting to a non-recpetive state the inner lining of the uterus. The top genes are those that pass the FDR and logFC thresholds that have the smallest P values. As there are hundreds of significant genes here, too many to sensibly label, we labelled the top 6 genes.

5 Conclusion

We have shown that it is possible to reliably map SummarizedExperiment objects and perform DEG’S analyses by calling genotypes directly from RNA-seq data of two groups of human samples, despite the fact that these data originated from a different study, mainly due to the novel use of the tidy paradigm to perform these processes.

For the human samples with RNA-seq, this study establishes a gene signature to use for acquiring knowledge in endometrial receptivity tests. This report is one of the first to show that the transcriptome can be better and more efficiently analyzed with the advantage of the unified grammar encoding that the tidytranscriptomics suite provides and that gene promotion could serve as a reservoir for potential collection of receptivity markers. Our work thus represents a step forward in the design of approaches for real-time monitoring of endometrial status, necessary for advancing the field of reproductive medicine.

References

Andrews, S. 2010. “FastQC: A Quality Control Tool for High Throughput Sequence Data.”

Davis, Sean, and Paul Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 14: 1846–47.

Dobin, Schlesinger, Davis, and Gingeras. 2013. “STAR: Ultrafast Universal RNA-Seq Aligner.” Bioinformatics 29 (1): 15–21.

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15: 550. https://doi.org/10.1186/s13059-014-0550-8.

Mangiola, Stefano. 2021. tidySummarizedExperiment: Brings SummarizedExperiment to the Tidyverse.

Mangiola, Stefano, Ramyar Molania, Ruining Dong, and Maria A. Doyle & Anthony T. Papenfuss. 2021. “Tidybulk: An r Tidy Framework for Modular Transcriptomic Data Analysis.” Genome Biology 22 (42). https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02233-7.

McCarthy, Davis J, Yunshun Chen, and Gordon K Smyth. 2012. “Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation.” Nucleic Acids Research 40 (10): 4288–97. https://doi.org/10.1093/nar/gks042.

Mikkel Schubert, Stinus Lindgreen, and Ludovic Orlando. 2016. “AdapterRemoval V2: Rapid Adapter Trimming, Identification, and Read Merging.” BMC Research Notes 9 (88).

Morgan, Martin, Valerie Obenchain, Jim Hester, and Hervé Pagès. 2021. SummarizedExperiment: SummarizedExperiment Container. https://bioconductor.org/packages/SummarizedExperiment.

Pagès, Hervé, Marc Carlson, Seth Falcon, and Nianhua Li. 2021. AnnotationDbi: Manipulation of SQLite-Based Annotations in Bioconductor. https://bioconductor.org/packages/AnnotationDbi.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.

Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40.

Robinson, Mark D, and Alicia Oshlack. 2010. “A Scaling Normalization Method for Differential Expression Analysis of RNA-Seq Data.” Genome Biology 11 (3): 1–9.

RStudio Team. 2019. RStudio: Integrated Development Environment for r. Boston, MA: RStudio, Inc. http://www.rstudio.com/.

Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2021. GGally: Extension to ’Ggplot2’. https://CRAN.R-project.org/package=GGally.

Sievert, Carson. 2020. Interactive Web-Based Data Visualization with r, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.

Slowikowski, Kamil. 2021. Ggrepel: Automatically Position Non-Overlapping Text Labels with ’Ggplot2’. https://CRAN.R-project.org/package=ggrepel.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain Francois, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686.

Yang Liao, Gordon K Smyth, and Wei Shi. 2014. “featureCounts: An Efficient General Purpose Program for Assigning Sequence Reads to Genomic Features.” Bioinformatics 30 (7): 923–30.