IMPC statistical pipeline
Preprocessing the raw data
Packaging the raw data for parallelisation
Executing the IMPC statistical pipeline
Preparation before running the stats pipeline
Executing the pipeline
Postprocessing the IMPC-SP
Frequently asked questions

IMPC statistical pipeline

Working with the IMPC data is an existing experience for any data scientist. However, the nature of high-throughput pipelines allows too many data points that overflow the complexity of running the statistical analysis pipeline. In this manual, we describe the step by step execution of the IMPC statistical pipeline. To follow this manual, the following software must be installed on your/server machine:

Unix/Linux operation system
Unix/Linux terminal
IBM LSF platform https://en.wikipedia.org/wiki/Platform_LSF
R software https://cran.r-project.org/

Preprocessing the raw data

The input data to the IMPC statistical pipeline (IMPC-SP) is in the form of comma-separated values (CSV), tab-separated values (TSV), Rdata (See R software data.frame) or Parquet files. The latter must be in the flattened format (no nested structure in the parquet files allowed). The CSV or TSV files can be on a remote server but parquet files must be locally available on the disk. The entire IMPC-SP require 300GB to 1.5TB disk space depending on the number of analyses/methods included in the StatPackets, the term that we assign to the IMPC-SP output. This document assumes the LSF cluster as the computing driver for the IMPC-SP, however, IMPC-SP can be run on a single core machine but potentially takes a significant amount of time (estimated 1.5 months).

The diagram below shows the optimal steps to run the data preparation pipeline as fast as possible.

The whole analytics steps in the IMPC-SP require R software with the list of packages and dependencies shown in the table below,

R Package name	R Package name
1. DRrequiredAgeing (available from the GitHub)	2. OpenStats
3. SmoothWin	4. base64enc
5. RJSONIO	6. jsonlite
7. DBI	8. foreach
9. doParallel	10. parallel
11. nlme	12. plyr
13. rlist	14. pingr
15. robustbase	16. abind
17. stringi	18. RPostgreSQL
19. data.table	20. Tmisc
21. devtools	22. miniparquet

The driver packages are DRrequiredAgeing, OpenStats and SmoothWin that need to be updated every time the IMPC-SP runs. This makes sure that the latest version of the software packages is used in the analysis pipeline.

One can update the driver packages by running the commands below from the terminal:

R -e "file.copy(file.path(DRrequiredAgeing:::local(), 'StatsPipeline/jobs/UpdatePackagesFromGithub.R') , to = file.path(getwd(), 'UpdatePackagesFromGithub.R'))"
Rscript UpdatePackagesFromGithub.R

Having the packages updated, the first step is to read the input files. CSV, TSV and Rdata files can be directly read in the pipeline (skip to _ Packaging the raw data for parallelisation _). Parquet files require an extra step to be converted into the R data frames. To this end, the parquet files need to be available locally on the disk. The whole process is divided into four steps, two for creating and two for executing LSF cluster jobs:

Read the parquet files and create a list of jobs for the LSF cluster to process the data,
Process the data and create scattered Rdata files,
Create a set of jobs to merge the scattered Rdata files into single files per IMPC procedure.
Run the merging step

The scripts for the 4 steps above are available from the R package DRrequiredAgeing.

Copy the contents of the directory below into a path on your machine/server

Path to the scripts: Run the following command from the terminal to see the full path
- R -e "file.path(DRrequiredAgeing:::local(),'StatsPipeline/0-ETL')"

There are 4 scripts in the directory that you just copied to. Run the commands below to get the dataframes ready

Rscript Step1MakePar2RdataJobs "FULL PATH TO THE PARQUET FILES + trailing /"
chmod 775 jobs_step2_Parquet2Rdata.bch

./jobs_step2_Parquet2Rdata.bch
1. This should take 10+ minutes on the LSF cluster depending on the available resources
2. Do not go to step 3 before this step has finishes
3. The output of this step is a directory named ProcedureScatterRdata filled with loads of small Rdata files

Rscript Step3MergeRdataFilesJobs.R "FULL PATH TO THE ProcedureScatterRdata DIRECTORY + trailing/"
chmod 775 jobs_step4_MergeRdatas.bch

./jobs_step4_MergeRdatas.bch
1. This should take 1+ hour depending on the available resources on the LSF cluster
2. The output of this step is a directory named Rdata contains individual Rdata files foreach IMPC procedures, see an example here.

each step above is accompanied by the log files. If no error found in the log files then you can safely remove the ProcedureScatterRdata directory by running

rm –rf ProcedureScatterRdata

Packaging the raw data for parallelisation

The previous step leads to having bulky data files. This is very inefficient for parallelization via LSF cluster. In the next step, we allow breaking the raw data into small packages that can be independently processed via parallelisation. This step is fully automated and only requires an initialisation step. The output of this step is a set of LSF jobs XXXX.bch (see an example BCM_ACS_Batch.bch) that need to be concatenated into a single file or can be used individually for each IMPC procedure. The script for this step is available from the path that comes out of the command below:

R -e "file.path(DRrequiredAgeing:::local(),'StatsPipeline/jobs)"

The script is named InputDataGenerator.R. You can customize the output XXXX.bch files for the amount of memory allocated to single jobs in LSF by tweaking the memory/CPU/etc. parameters in this script.

To run the InputDataGenerator.R follow the steps below:

create a list of LSF jobs for each raw data file in the previous section by running the command below on your terminal:

R -e "DRrequiredAgeing:::jobCreator('FULL PATH TO THE Rdata DIRECTORY OR RAW DATA FILES')";
1. This command creates a job file DataGenerationJobList.bch [see an example here]and an empty directory DataGeneratingLog [see an example here]that stores the log files.
2. This command is similar to ls ing the directory and creates an LSF job for each input entry (data)
Execute the driver script by:
1. chmod 775 DataGenerationJobList.bch
2. ./DataGenerationJobList.bch
3. This normally takes 15+ hours depending on the available resources on the LSF cluster.

You can check the log files in DataGeneratingLog directory for any error and if there is no error shown in the log, the preprocessing step is marked as successful. Here is the command to check the errors in the file:

grep "exit" * -lR
As the log directory could get bulky quickly, we suggest compressing the whole directory to save space on the disk. You can run the command below to compress and remove the log directory
- zip –rm DataGeneratingLog/
One import note in this step is to adjust the LSF jobs configurations such as memory limit (in InputDataGenerator.R script). Overestimating the memory required for the LSF jobs prevents unwanted halt of the LSF jobs.

Executing the IMPC statistical pipeline

The output of the previous steps is a set of directories for individual IMPC procedures (see an example here) that contain an XXX.bch file. The next step is to append these XXX.bch files into a single file we call AllJobs.bch [see an example here]. You can use methods like find to search for the XXX.bch files and cat to append these files. The example of merging command is shown below:

cat *.bch >> AllJobs.bch

Preparation before running the stats pipeline

Some preparation is recommended before running the stats pipeline that is listed below:

Updating the list of levels (for categorical data) from IMPReSS. To this end run the command below in your terminal:
- R -e "DRrequiredAgeing:::updateImpress(updateImpressFileInThePackage = TRUE,updateOptionalParametersList = TRUE,updateTheSkipList = TRUE)"
  - This command updates the required for analysis [see here for details] parameters as well as adds the meta data parameters [see here for details] to the skip list of the statistical pipeline. The skip list is available in the DRrequiredAgeing package directory [see here]. Run the command below to retrieve the full path:
    - Rscript "DRrequiredAgeing:::local()"

Executing the pipeline

The IMPC-SP require a function.R [see an example here]driver script written in R to perform the analysis to the data. The script is available from

R -e "file.path(DRrequiredAgeing:::local(),'StatsPipeline/jobs')"

put the function.R script and the AllJobs.bch in the same directory and execute the AllJobs.bch to start the IMPC-SP. Some notes are required for better understanding of the IMPC-SP.

You can modify some parameters in the function.R such as activating SoftWindowing. Here is the typical function.R and the parameters wherein: mainAgeing(
- file = suppressWarnings(tail(UnzipAndfilePath(file), 1)),
  - The input file (csv,tsv, Rdata)
- subdir = 'Results\_DR12V1OpenStats',
  - Name of the IMPC-SP output directory
- concurrentControlSelect = FALSE,
  - Concurrent control selection applies to the controls?
- seed = 123456,
  - Psudo random number generator seed
- messages = FALSE,
  - Write error messages from the SoftWindowing pipeline to the output file
- utdelim = '\t',
  - The StatPacket delimiter
- activeWindowing = FALSE,
  - Activate SoftWindowing
- check = 2,
  - The check type in SoftWindowing. See check argument in the SmoothWin package
- storeplot = FALSE,
  - Store SoftWindowing plots in a file that accompanies the statpacket
- plotWindowing = FALSE,
  - Set to TRUE to plot the SoftWindowing output
- debug = FALSE,
  - Show more details of the process
- MMOptimise = c(1,1,1,1,1,1),
  - See MM_Optimise in the OpenStats package
- FERRrep = 1500,
  - Total iterations in the Fisher’s Exact Test framework (Monte Carlo iterations)
- activateMulticore = FALSE,
  - Activate multi core processing
- coreRatio = 1,
  - The core/CPU proportion (1=100% cores, 0.5=50% cores)
- MultiCoreErrorHandling = 'stop',
  - Error handeling for the multicore processing. Here the process fails if encounters and error. Note that this does not fail IMPC-SP but an specific job.
- inorder = FALSE,
  - See inorder in foreach R package
- verbose = TRUE,
  - see verbose in foreach R package
- OverwriteExistingFiles = FALSE,
  - Do not overwrite the StatPacket if already exists
- storeRawData = TRUE,
  - Store the raw data with the StatPacket
- outlierDetection = FALSE,
  - Activate outlier detection strategy
- compressRawData = TRUE,
  - zip the output raw data
- writeOutputToDB = FALSE,
  - write the StatPackets to mysqlite db in a directory db in the results directory. Note that there could be up to 10k individual mysqlite databases in the db directory. An extra step requires to merge all dbs.
- onlyFillNotExisitingResults = FALSE
  - Only run the statistical analyses if the StatPacket does not exist )
It is highly recommended that remove the log files before runing/re-runing of the IMPC-SP. To do this, navigate to the AllJobs.bch directory and run the commands below in your terminal:
- find ./\*/\*\_RawData/ClusterErr/ -name \*ClusterErr -type f |xargs rm
- find ./\*/\*\_RawData/ClusterOut/ -name \*ClusterOut -type f |xargs rm
Results: IMPC-SP outputs is a directory named in subdir argument in function.R script. The StatPackets are located on the very right hand side of the following directory structure path:
- phenotyping_center/procedure_group/parameter_stable_id/colony_id/zygosity/metadata_group
  - Note 1: All special characters in the path above get replaced by the underscore (_)
  - Note 2. Depending on the input data, there could be more that one StatPacket in a path

Postprocessing the IMPC-SP

IMPC-SP require some QC checks and random validation to assure that the output results in StatPackets are reliable and there is no failure in the pipeline. Here we list some typical checks to the pipeline outputs:

Aggregating the log files: log files are the best place to track down any error and/or failure in the pipeline. The issue here is that the log files are scattered among the directories. To address this complexity we copy all log files to a single directory and run a check. To copy the log files, navigate to the AllJobs.bch directory and run the commands below:
- cd ..
- find ./\*/\*\_RawData/ClusterOut/ -name \*ClusterOut -type f |xargs cp --backup=numbered -t ~/ **XXXXXX**
- find ./\*/\*\_RawData/ClusterErr/ -name \*ClusterErr -type f |xargs cp --backup=numbered -t ~/ **XXXXXX**
  - Here XXXXXX is a directory that you have created for log files such as logs [see the example here]- Searching for errors in the log file: You can search for any failure in the log files by running the command below:
- grep "exit" \* -lR
  - The existence of errors must be investigated manually
Random checking of the results: a random check of the results in the StatPackets is highly recommended for QC of the IMPC-SP

Frequently asked questions

Here we answer some of the frequency asked questions.

Where are the IMPC-SP on Github?
- All IMPC-SP scripts and R packages are available from https://github.com/mpi2/impc_stats_pipeline
Where can I find function.R
- This file is located in the extension directory of the DRrequiredAgeing package see here for the files.
How long normally the IMPC-SP takes to complete.
- This depends on the LSF cluster and the available resources. Setting the EMBL-EBI LSF cluster as a reference, the whole process takes from 2 days to 4 days.
How to ask for help?
- We are always available for help from IMPC website or email hamedhm@ebi.ac.uk

IMPC statistical pipeline

Hamed Haselimashhadi

20/07/2020

Contents

IMPC statistical pipeline

Preprocessing the raw data

Packaging the raw data for parallelisation

Executing the IMPC statistical pipeline

Preparation before running the stats pipeline

Executing the pipeline

Postprocessing the IMPC-SP

Frequently asked questions