Simulation Study

The objective of the simulation study is to simulate a diverse array of scenarios, generating data using various distributions, means, standard deviations, and mixing proportions etc. Subsequently, the power and false positive rates of different methods in bifucatoR are compared.

Module 1 concentrates on identifying unstructured mean or variance heterogeneity through seven methods.

Meanwhile, Module 2 is dedicated to implementing techniques for detecting structured mean or variance heterogeneity, or data bimodality.The methods implemented include mClust, Bimodality Coefficient, Ameijeiras-Alonso et al.,. Excess Mass Fit, Cheng and Hall Excess Mass, Fisher and Marron Carmer-von Mises, Hall and York Bandwidth test, Hartigans’ dip test, Sliverman Bandwidth and mixR.

All scripts are executed utilizing the Northeastern University high-performance computing resource Discovery.

Module 1

To generate the power and false positive results of the simulation study, execute Module1.R(vignettes/Simulation_Study/Module_1/scripts/Module1.R) The study includes 7 tests: “Levene,” “Permutations (Gini),” “Permutations (MAD),” “Permutations (SD),” “ANOVA,” “Non-parametric ANOVA,” and “Permutations (Raw).”

The table below summarizes the scripts and the associated information. Module_1_Methods

Note that the running time differs for each method; some may take several hours to complete. To expedite execution, we’ve arranged for multiple NEU Discovery R sessions, with each test running in its own session. For instance, in Module1.R, we replace the 63rd line of code, which currently executes 7 methods, with a specific method such as “Levene” using the following code:

tests = c("Levene")

In this setup, all tests will be able to execute in parallel, leveraging the multiple NEU Discovery R sessions concurrently. This parallel execution significantly speeds up the overall process, ensuring efficient utilization of computing resources and reducing the total time required for the simulation study.

After the completion of Module_1 execution, it produced multiple raw results .rds files, which are stored within the same directory. To consolidate these files with identical structure in the same folder, we employ merge_results.R(vignettes/Simulation_Study/Module_1/scripts/merge_results.R). Simply adjust the working directory and the folder for result storage as required. This process yields module1_n_w_7_tests_sim(1000).rds(vignettes/Simulation_Study/Module_1/results/module1_n_w_7_tests_sim(1000).rds) and module1_n_w_7_tests_sim(1000).csv(vignettes/Simulation_Study/Module_1/results/module1_n_w_7_tests_sim(1000).csv)

Module 2

Unlike Module 1, Module 2 is considerably more computationally intensive and may require longer time to complete all tests for all scenarios. To address this, we introduce Slurm (Simple Linux Utility for Resource Management), a widely utilized tool in High-Performance Computing environments. Slurm is tailored to handle the intricate demands of extensive computational workloads. It excels in distributing and overseeing tasks across clusters consisting of thousands of nodes, providing efficient control over resources, scheduling, and job queuing. More information can be found in Slurm

The table below summarizes the scripts and the associated information. All the scripts can be found in Module_2(vignettes/Simulation_Study/Module_2/scripts) Please note that the minimum sample size required for running mixR is 30. Therefore, we exclude sample sizes of 25 from the mixR parameters. Additionally, ensure that you have installed the mixR package from CRAN.

For instance, to execute 8 methods from Module 2 for both Normal and Weibull distributions, you can submit Slurm array jobs employing the scripts Module2_Norm_Weib_Distribution.R(vignettes/Simulation_Study/Module_2/scripts/Module2_Norm_Weib_Distribution.R) and Module2_Norm_Weib_Distribution.sh(vignettes/Simulation_Study/Module_2/scripts/Module2_LNorm_Distribution_mixR.sh). The Slurm script facilitates running multiple analogous job concurrently, such as performing the same analysis with various inputs or parameters, eliminating the need to submit each job individually. Ensure that keep each R script and .sh script in the same directory.

You can also define the number of jobs within the .sh script. For instance, if the params_grid function in the R script suggests 270 similar scenarios, you can set the number of array jobs using: #SBATCH –array=1-270.

Additionally, you can customize parameters such as the partition, memory, number of nodes, and others according to your requirements. Further details on Slurm Array Jobs are available in the Slurm Array Jobs documentation.

After the execution of Module_2 is complete, the same merge_results.R(vignettes/Simulation_Study/Module_1/scripts/merge_results.R) script is utilized to combine the raw .rds files. Ensure to adjust the saving directory according to where you want the file to be stored.

Specifically, for normal, Weibull, and lognormal distributions, merge the raw results from Module2_Norm_Weib_Distribution.R, Module2_Norm_Weib_Distribution_mixR.R, and Module2_LNorm_Distribution_mixR.R to generate module2_n_w_9_tests_sim(1000).rds (vignettes/Simulation_Study/Module_2/results/module2_n_w_9_tests_sim(1000).rds) and module2_n_w_9_tests_sim(1000).csv(vignettes/Simulation_Study/Module_2/results/module2_n_w_9_tests_sim(1000).csv)

For the beta distribution, merge the raw results generated from Module2_Beta_Distribution.R to produce module2_b_all_tests_sim(1000).rds(/vignettes/Simulation_Study/Module_2/results/module2_b_all_tests_sim(1000).rds) and module2_b_all_tests_sim(1000).csv(/vignettes/Simulation_Study/Module_2/results/module2_b_all_tests_sim(1000).csv)

We created an overall combined results .xlsx file Module_1_2_Results.xlsx(/vignettes/Simulation_Study/Module_2/results/Module_1_2_Results.xlsx) using the three .csv files above through Microsoft Excel.