1 Overview

MeinteR (MEthylation INTERpretation) is an R package that identifies critical differentially methylated sites (DMS), based on the following hypothesis: Critical methylation-mediated changes are more likely to occur in genomic regions that are enriched in cis-acting regulatory elements than in “genomic deserts”. With MeinteR we first calculate the abundance of co-localized elements, such as transcription factor binding sites, tentative splice sites, and other DNA features, such as G-quadruplexes and palindromes, that potentially lead to distinct DNA conformational features, and then we rank them with respect to their putative methylation impact. MeinteR supports the most widely-used, single-base resolution assays, including Illumina’s BeadChip HumanMethylation27, HumanMethylation450 and MethylationEPIC arrays, whole genome, reduced representation and targeted bisulfite sequencing. The input data are either tabular files containing DMS coordinates of human genome (hg19 assembly) or array-based and sequencing data that can be directly imported from the Gene Expression Omnibus (GEO) repository. # Installation To download and install MeinteR and its dependencies, you need to install devtools and then run the following R commands:

devtools::install_github("andigoni/meinter")

Upon successful installation of the package, load MeinteR and all its dependencies in the workspace using:

library(MeinteR)

2 Getting started

Before we start, we may carry out a couple of optional preparatory functions, as described below: ## Setting the working environment First, we set a title for our study by using the nameStudy function:

nameStudy(study.name="MyProj")

The study.name will be shown in the produced plots. Second, we set the analysis folder with read-write access rights e.g.:

project.dir <- "~/meinter_dir"

2.1 Formatting and importing datasets

2.1.1 Data format

MeinteR’s core functions are applied on bed-formatted files containing the following columns:

Chr: Name of the chromosome (e.g. chr3)

Start: Start coordinate (hg19 assembly)

End: End coordinate (hg19 assembly)

Score: Difference of methylation levels ranging between -1 and 1

Strand: DNA strand. Either “.” (=no strand) or “+” or “-”

The Score field of the input dataset may correspond to the delta-beta value of each DMS i.e. the difference of the methylation levels in two groups of data, for example normal/disease samples or pre-, post-treatment conditions of a sample set etc.

In the following, we demostrate the functionalities on a sample dataset assigned to variable sample. The sample dataset is automatically loaded in the workspace and can be used for demostration purposes.

head(sample)

The sample dataset contains the coordinates and delta-beta values for a set of 5,840 CpG sites. The column names are not consistent with formatting rules of the input data, so we need to reorder the misplaced columns using the reorderBed function:

re.sample = reorderBed(sample,1,2,3,5,4) 
head(re.sample)

Columns 4 and 5 in the above example will change position. In addition, all columns will be renamed, appropriately. ### Import local data To work on your own data, load the tabular file by setting the column separator (sep) and heading (header) parameters appropriately. For example,

input.data <- read.csv(file.path(project.dir, "my_data.csv"), sep=",", header = T)

where my_data.csv in this example is a comma-delimited file including a header line. Additional columns in the input data will be ignored. MeinteR also provides a well-formatted dataset loaded in the workspace called test.data that contains 401 DMS with `|delta-beta|>=0.3’.

2.1.2 Import data from GEO

MeinteR supports analyses of GEO data series from microarray and next-generation sequencing platforms. In both cases, MeinteR will automatically fetch data series based on the accession number of the GEO data series. A data series usually contains moew than one samples analyzed by a unique or multiple platforms. MeinteR imports data from a data series that contains two groups of samples analysed by the same platform according to a user-defined annotation file. The annotation file is a comma-delimited file with two columns: Column sample lists the GSM accession numbers that are included in the data series and the status column lists the class of each sample. For example, to import sample data of the GSE37362 series we must create an annotation file that has the following form:

sample,status
GSM916781,WT
GSM916782,WT
GSM916783,WT
GSM916803,Mutant
GSM916804,Mutant
GSM916805,Mutant

The status column must list only two groups of samples e.g. WT/Mutant, pre-treatment/post-treatment etc. The analysis will be performed on the samples listed in the annotation file. Samples included in the data series, but not in the annotation file, will be ignored. MeinteR will automatically identify the platform and will fetch the corresponding sample data. In case of array-based data, the probes will be mapped to the hg19 genomic coordinates and a bed-formatted file will be generated conforming with the formatting rules of the input data. The delta-beta values will be automatically calculated by subtracting the mean beta values of the two groups. To import a GEO data series use the function importGEO as follows:

fp <- file.path(project.dir, "GSE37362_annotation.csv")
geo.data <- importGEO(gse.acc="GSE37362", annotation.file= fp)

where gse.acc is the unique GEO identifier of the data series and annotation.file is the local path to the annotation file.

2.1.3 Import limma differentially methylated probes

To increase compatibility with other software packages MeinteR includes an import function of the Limma-based differentially methylated probes. Limma is widely-used to build linear models for microarray data, includind DNA methylation. Using the importLimma function the highly ranked differentially methylated probes can be directly imported to the workspace and processed by the core functions. An example on the minfi data is shown below:

require(minfi) 
require(minfiData)
require(limma)
library(MeinteR)
# Get example methylation data and transform to M-values
meth <- getMeth(MsetEx)
unmeth <- getUnmeth(MsetEx)
Mval <- log2((meth + 100)/(unmeth + 100))
group <- factor(pData(MsetEx)$Sample_Group)
#Build the design matrix
design <- model.matrix(~group)
#Fit a linear model
lFit <- lmFit(Mval,design)
#Compute Bayes statistics
lFit2 <- eBayes(lFit)
#Extract highly ranked probes
lTop <- topTable(lFit2,coef=2,num=200)
bed.data <- importLimma(lTop, sortBy="adj.P.Val")
pals <- findPals(bed.data)
#Find G-quadruplexes
quads <- findQuads(bed.data, offset = 50)
#Find overlaps with of probes located in promoters with JASPAR TFBS
tfbs <- findTFBS(bed.data)
#Feature weights
weights = list()
weights[["tfbs"]] = 1
weights[["pals"]] = 1
weights[["quads"]] = 1
#Mapping MeinteR arguments with function outputs
funList = list()
funList[["tfbs"]] = tfbs
funList[["pals"]] = pals
funList[["quads"]] = quads
#Calculate genomic index
index <- meinter(bed.data, funList, weights)

topbed is a data frame that complies with the formatting rules of the MeinteR input files. To export topbed data importLimma needs the lTop data frame containing empirical Bayes statistics of the highly ranked probes. These probes are matched to the annotation file of the platform that they came from and the score column of the topbed data frame is filled with the values set by the sortBy parameter.

2.2 Data filtering

A common preprocessing step is subsetting the initial dataset using filtering of the score values. For example:

#Select DMS with delta-beta values equal to 0
subsample.1 <- re.sample[re.sample$score == 0,] 
#Select DMS with delta-beta values greater than or equal to 0.30
subsample.2 <- re.sample[re.sample$score >= 0.30,] 
#Select DMS with absolute delta-beta values greater than or equal to 0.60
subsample.3 <- re.sample[abs(re.sample$score) >= 0.60,] 

3 Core functions

MeinteR offers a set of core functions that investigate the presence of the following methylation-mediated regulatory features:

Transcription factors: Two functions are implemented that examine the effect of methylation on the binding affinity of a) human transcription factors, and b) conserved human/mouse/rat transcription factors.

Splice sites and alternative splicing events: Splicing regulation is analysed by two functions: a) A function that identifies potential 5’ and 3’ splice sites in the DMS proximal regions, and b) a function that searches for known alternative splicing events overlapping our DMS data.

Symmetry elements/Palindromes: With this function DMS are examined with respect to the presence of palindromic regions in a user-defined region centered at the DMS.

G-quadruplex structures: With this function the DMS are examined with respect to the presence of G-quadruplex structures.

DNA shapes: This function quantifies the effect of DNA methylation on four DNA shape conformations, aiming to identify potential changes on the DNA-protein interactions.

3.1 Detection of transcription factor binding sites

3.1.1 JASPAR transcription factor binding sites

findTFBS identifies potential binding sites of human transcription factors in a region of 2*offset nucleotides centered at the DMS, where offset defines the sequence length on each side and persim sets the similarity threshold for considering a trascription factor as candidate for binding. The list of transcription factors, on both strands, is extracted from the JASPAR 2018 database (Khan et al., 2018) and processed using TFBSTools (Tan and Lenhard, 2016). By default, MeinteR reserves the maximum number of available cores. Still, for large datasets findTFBS is time-consuming. In such case, consider shortening the list of target transcription factors using the td.ID parameter and/or descreasing the number of DMS to those overlapping with promoters or CpG islands, using the target parameter. For example, to identify a list of transcription factor binding sites on subsample.3 run:

#Transcription factors of interest
tf.ID = c("MA0003.1", "MA0019.1", "MA0004.1", "MA0036.3", "MA0037.3")
tfbs <-findTFBS(bed.data=subsample.3,persim=0.8, offset=10, target="PROMOTER", 
                up.tss=2000, down.tss=100, mcores = 2, tf.ID=tf.ID)

In the above example, findTFBS will return transcription factors that are inluded in the tf.ID list as well as the binding sequence that resembles at least 80% with the DMS expanded by 10bp on each side. The analysis is performed on DMS located in promoters (2000bp upstream and 100bp downstream TSS). tfbs is a list of two data frames. The first data frame tfbs[[1]] contains all the details of each transcription factor found at each DMS and tfbs[[2]] summarizes the number of transcription factors per DMS. plotTF function gets as input the tfbs[[1]] data frame and visualises the most frequent transcription factors that potentially bind to DMS regions.

plotTF(tfbs[[1]], topTF=10) #topTF:Number of most frequent transcription factors

plotTF returns a list of three plots (two bar charts and a scatter plot): The first bar chart overlays the total number of binding sites for each JASPAR transcription factor and the number of sequences containing at least one binding site (Figure 1). The second bar chart sums the number of binding sites per transcription factor class (Figure 2). Finally, the scatter plot combines the number of transcription factors per class that are identified in our dataset as compared with the JASPAR2018 number of transcription factors in each class (Figure 3).

Most frequent transcription factors with putative binding sites on the DMS-containing sequences.

Most frequent transcription factors with putative binding sites on the DMS-containing sequences.