Working with the IMPC data is an existing experience for any data scientist. However, the nature of high-throughput pipelines allows too many data points that overflow the complexity of running the statistical analysis pipeline. In this manual, we describe the step by step execution of the IMPC statistical pipeline. To follow this manual, the following software must be installed on your/server machine:
The input data to the IMPC statistical pipeline (IMPC-SP) is in the form of comma-separated values (CSV), tab-separated values (TSV), Rdata (See R software data.frame) or Parquet files. The latter must be in the flattened format (no nested structure in the parquet files allowed). The CSV or TSV files can be on a remote server but parquet files must be locally available on the disk. The entire IMPC-SP require 300GB to 1.5TB disk space depending on the number of analyses/methods included in the StatPackets, the term that we assign to the IMPC-SP output. This document assumes the LSF cluster as the computing driver for the IMPC-SP, however, IMPC-SP can be run on a single core machine but potentially takes a significant amount of time (estimated 1.5 months).
The diagram below shows the optimal steps to run the data preparation pipeline as fast as possible.
The whole analytics steps in the IMPC-SP require R software with the list of packages and dependencies shown in the table below,
| R Package name | R Package name |
|---|---|
| 1. DRrequiredAgeing (available from the GitHub) | 2. OpenStats |
| 3. SmoothWin | 4. base64enc |
| 5. RJSONIO | 6. jsonlite |
| 7. DBI | 8. foreach |
| 9. doParallel | 10. parallel |
| 11. nlme | 12. plyr |
| 13. rlist | 14. pingr |
| 15. robustbase | 16. abind |
| 17. stringi | 18. RPostgreSQL |
| 19. data.table | 20. Tmisc |
| 21. devtools | 22. miniparquet |
The driver packages are DRrequiredAgeing, OpenStats and SmoothWin that need to be updated every time the IMPC-SP runs. This makes sure that the latest version of the software packages is used in the analysis pipeline.
R -e "file.copy(file.path(DRrequiredAgeing:::local(), 'StatsPipeline/jobs/UpdatePackagesFromGithub.R') , to = file.path(getwd(), 'UpdatePackagesFromGithub.R'))"Rscript UpdatePackagesFromGithub.RHaving the packages updated, the first step is to read the input files. CSV, TSV and Rdata files can be directly read in the pipeline (skip to _ Packaging the raw data for parallelisation _). Parquet files require an extra step to be converted into the R data frames. To this end, the parquet files need to be available locally on the disk. The whole process is divided into four steps, two for creating and two for executing LSF cluster jobs:
The scripts for the 4 steps above are available from the R package DRrequiredAgeing.
Copy the contents of the directory below into a path on your machine/server
R -e "file.path(DRrequiredAgeing:::local(),'StatsPipeline/0-ETL')"There are 4 scripts in the directory that you just copied to. Run the commands below to get the dataframes ready
Rscript Step1MakePar2RdataJobs "FULL PATH TO THE PARQUET FILES + trailing /"chmod 775 jobs_step2_Parquet2Rdata.bch./jobs_step2_Parquet2Rdata.bch
Rscript Step3MergeRdataFilesJobs.R "FULL PATH TO THE ProcedureScatterRdata DIRECTORY + trailing/"chmod 775 jobs_step4_MergeRdatas.bch./jobs_step4_MergeRdatas.bch
rm –rf ProcedureScatterRdataThe previous step leads to having bulky data files. This is very inefficient for parallelization via LSF cluster. In the next step, we allow breaking the raw data into small packages that can be independently processed via parallelisation. This step is fully automated and only requires an initialisation step. The output of this step is a set of LSF jobs XXXX.bch (see an example BCM_ACS_Batch.bch) that need to be concatenated into a single file or can be used individually for each IMPC procedure. The script for this step is available from the path that comes out of the command below:
R -e "file.path(DRrequiredAgeing:::local(),'StatsPipeline/jobs)"The script is named InputDataGenerator.R. You can customize the output XXXX.bch files for the amount of memory allocated to single jobs in LSF by tweaking the memory/CPU/etc. parameters in this script.
To run the InputDataGenerator.R follow the steps below:
R -e "DRrequiredAgeing:::jobCreator('FULL PATH TO THE Rdata DIRECTORY OR RAW DATA FILES')";
chmod 775 DataGenerationJobList.bch./DataGenerationJobList.bchYou can check the log files in DataGeneratingLog directory for any error and if there is no error shown in the log, the preprocessing step is marked as successful. Here is the command to check the errors in the file:
grep "exit" * -lR
As the log directory could get bulky quickly, we suggest compressing the whole directory to save space on the disk. You can run the command below to compress and remove the log directory
zip –rm DataGeneratingLog/One import note in this step is to adjust the LSF jobs configurations such as memory limit (in InputDataGenerator.R script). Overestimating the memory required for the LSF jobs prevents unwanted halt of the LSF jobs.
The output of the previous steps is a set of directories for individual IMPC procedures (see an example here) that contain an XXX.bch file. The next step is to append these XXX.bch files into a single file we call AllJobs.bch [see an example here]. You can use methods like find to search for the XXX.bch files and cat to append these files. The example of merging command is shown below:
cat *.bch >> AllJobs.bchSome preparation is recommended before running the stats pipeline that is listed below:
R -e "DRrequiredAgeing:::updateImpress(updateImpressFileInThePackage = TRUE,updateOptionalParametersList = TRUE,updateTheSkipList = TRUE)"
Rscript "DRrequiredAgeing:::local()"The IMPC-SP require a function.R [see an example here]driver script written in R to perform the analysis to the data. The script is available from
R -e "file.path(DRrequiredAgeing:::local(),'StatsPipeline/jobs')"put the function.R script and the AllJobs.bch in the same directory and execute the AllJobs.bch to start the IMPC-SP. Some notes are required for better understanding of the IMPC-SP.
You can modify some parameters in the function.R such as activating SoftWindowing. Here is the typical function.R and the parameters wherein: mainAgeing(
file = suppressWarnings(tail(UnzipAndfilePath(file), 1)),
subdir = 'Results\_DR12V1OpenStats',
concurrentControlSelect = FALSE,
seed = 123456,
messages = FALSE,
utdelim = '\t',
activeWindowing = FALSE,
check = 2,
storeplot = FALSE,
plotWindowing = FALSE,
debug = FALSE,
MMOptimise = c(1,1,1,1,1,1),
FERRrep = 1500,
activateMulticore = FALSE,
coreRatio = 1,
MultiCoreErrorHandling = 'stop',
inorder = FALSE,
verbose = TRUE,
OverwriteExistingFiles = FALSE,
storeRawData = TRUE,
outlierDetection = FALSE,
compressRawData = TRUE,
writeOutputToDB = FALSE,
onlyFillNotExisitingResults = FALSE
It is highly recommended that remove the log files before runing/re-runing of the IMPC-SP. To do this, navigate to the AllJobs.bch directory and run the commands below in your terminal:
find ./\*/\*\_RawData/ClusterErr/ -name \*ClusterErr -type f |xargs rmfind ./\*/\*\_RawData/ClusterOut/ -name \*ClusterOut -type f |xargs rmResults: IMPC-SP outputs is a directory named in subdir argument in function.R script. The StatPackets are located on the very right hand side of the following directory structure path:
IMPC-SP require some QC checks and random validation to assure that the output results in StatPackets are reliable and there is no failure in the pipeline. Here we list some typical checks to the pipeline outputs:
cd ..find ./\*/\*\_RawData/ClusterOut/ -name \*ClusterOut -type f |xargs cp --backup=numbered -t ~/ **XXXXXX**find ./\*/\*\_RawData/ClusterErr/ -name \*ClusterErr -type f |xargs cp --backup=numbered -t ~/ **XXXXXX**
grep "exit" \* -lR
Here we answer some of the frequency asked questions.