class: center, middle, title-slide .title[ # High Performance Computing and Cloud Computing ] .subtitle[ ## JSC 370: Data Science II ] .date[ ### March 3, 2025 ] --- <!--Yeah... I have really long code chunks, so I just changed the default size :)--> <style type="text/css"> code.r,code.cpp{ font-size:medium; } </style> ## What is HPC High Performance Computing (HPC) can relate to any of the following: - **Parallel computing**, i.e. using multiple resources (could be threads, cores, nodes, etc.) simultaneously to complete a task. - **Big data** working with large datasets (in/out-of-memory). We discussed this with lazy loading. Today we will focus on parallel computing. --- ## Serial computation <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#fig/pll-computing-explained-serial.svg" alt="Here we are using a single core. The function is applied one element at a time, leaving the other 3 cores without usage." width="30%" /> <p class="caption">Here we are using a single core. The function is applied one element at a time, leaving the other 3 cores without usage.</p> </div> --- ## Serial computation - Serial programming is the default method of code execution. - Code is run sequentially, meaning only one instruction is processed at a time (line-by-line). - Code is executed on a single processor, so only one instruction can execute at a time. --- ## Parallel computation <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#fig/pll-computing-explained-parallel.svg" alt="Here we are taking full advantage of our computer by using all 4 cores at the same time. This will translate in a reduced computation time which, in the case of complicated/long calculations, can be an important speed gain." width="30%" /> <p class="caption">Here we are taking full advantage of our computer by using all 4 cores at the same time. This will translate in a reduced computation time which, in the case of complicated/long calculations, can be an important speed gain.</p> </div> --- ## Parallel computation Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: - A problem is broken into discrete parts that can be solved concurrently - Each part is further broken down to a series of instructions - Instructions from each part execute simultaneously on different processors - An overall control/coordination mechanism is employed --- ## Parallel computation The parallel compute resources are typically: - A single computer with multiple processors/cores - An arbitrary number of such computers connected by a network --- ## Parallel computing: Hardware When it comes to parallel computing, there are several ways (levels) in which we can speed up our analysis. From the bottom up: - **[Thread level SIMD instructions](https://en.wikipedia.org/wiki/Vector_processor)**: In most modern processors support some level of what is called vectorization, this is, applying a single (same) instruction to streams of data, for example: adding vector `A` and `B`. - **[Hyper-Threading Technology](https://en.wikipedia.org/wiki/Hyper-threading)** (HTT): Intel's hyper-threading generates a virtual partition of a single core (processor) which, while not equivalent to having multiple physical threads, does speed up things. - **[Multi-core processor](https://en.wikipedia.org/wiki/Multi-core_processor)**: Most modern CPUs (Central Processing Unit) have two or more physical cores. A typical laptop computer has about 8 cores. --- ## Parallel computing: Hardware - **[General-Purpose Computing on Graphics Processing Unit](https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units)** (GP-GPU): While modern CPUs have a couple of dozens of cores, GPUs can hold thousands of those. Designed for image processing, there's an increasing use of GPUs as an alternative of CPUs for scientific computing. For example, Compute Canada part of the [Digital Research Alliance](https://alliancecan.ca/en/services/advanced-research-computing/federation/national-host-sites). - **[High-Performance Computing Cluster](https://en.wikipedia.org/wiki/Computer_cluster)** (HPC): A collection of computing nodes that are interconnected using a fast Ethernet network. Now have CPUs and GPUs. - **[Grid Computing](https://en.wikipedia.org/wiki/Grid_computing)**: A collection of loosely interconnected machines that may or may not be in the same physical place, for example: HTCondor clusters. --- ## Parallel computing: Terminology CPU: Central Processing Unit, consists of one or more cores. Cores may be organized into one or more sockets, each with its own distinct memory. Core: A processing unit within a CPU that can execute a separate thread or process. Node: A standalone computer, usually comprised of multiple CPUs/processors/cores, memory, network interfaces. Nodes are networked together to comprise a supercomputer. Cluster: A collection of nodes working together as a single system. --- ## Parallel computing: Terminology Thread: A sequence of instructions that a CPU executes. It shares memory with other threads within the same process. For example, a CPU with 8 cores and 2 threads per core can execute 16 threads simultaneously. Process: An independent instance of a program with its own memory space. --- ## Parallel computing: CPU components <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#fig/cpu-slurm.png" alt="Taxonomy of CPUs (from https://slurm.schedmd.com/mc_support.html" width="45%" /> <p class="caption">Taxonomy of CPUs (from https://slurm.schedmd.com/mc_support.html</p> </div> --- ## Parallel computing: Terminology Job: A container that includes all the work submitted to the cluster. It may consist of one or more tasks. A job is typically submitted using SLURM (Simple Linux Utility for Resource Management). SLURM is an open-source job scheduler used for managing and allocating resources in computing clusters. It allows users to submit, schedule, and monitor jobs on a cluster of computers. Task: Typically a program or program-like set of instructions that is executed by a processor. A parallel program consists of multiple tasks running on multiple processors. --- ## Parallel computing: Components <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#fig/components.png" alt="Parallel Computing Components with a SLURM script" width="70%" /> <p class="caption">Parallel Computing Components with a SLURM script</p> </div> --- ## Parallel Programming Models There are several parallel programming models in common use: - Shared Memory (without threads) - Threads - Distributed Memory / Message Passing (e.g. MPI) - Data Parallel - Hybrid - Single Program Multiple Data (SPMD) - Multiple Program Multiple Data (MPMD) --- ## Do I need HPC? <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#fig/when_to_parallel.png" alt="Ask yourself these questions before jumping into HPC!" width="60%" /> <p class="caption">Ask yourself these questions before jumping into HPC!</p> </div> --- ## Top 10 HPC: www.top500.org <img src="data:image/png;base64,#fig/hpcs.png" width="60%" style="display: block; margin: auto;" /> --- ## Parallel computing in R While there are several ways to do parallel computing in R (just take a look at the [High-Performance Computing Task View](https://cran.r-project.org/web/views/HighPerformanceComputing.html)), we'll focus on the following R-packages for **explicit parallelism** Some examples: > * [**parallel**](https://cran.r-project.org/package=parallel): R package that provides 'support for parallel computation, including random-number generation'. > * [**foreach**](https://cran.r-project.org/package=foreach): R package for 'general iteration over elements' in parallel fashion. --- ## Parallel computing in R > * [**doParallel**](https://cran.r-project.org/package=doParallel): Provides a parallel backend for the %dopar% function using the parallel package. > * [**future**](https://cran.r-project.org/package=future): 'A lightweight and unified Future API for sequential and parallel processing of R expression via lazy loading and futures.' Implicit parallelism, on the other hand, are out-of-the-box tools that allow the programmer not to worry about parallelization, e.g. such as [**gpuR**](https://cran.r-project.org/package=gpuR) for Matrix manipulation using GPU, [**tensorflow**](https://cran.r-project.org/package=tensorflow) --- ## Parallel computing in R There are more advanced options: > * [**Rcpp**](https://cran.r-project.org/package=Rcpp) + [OpenMP](https://www.openmp.org): [Rcpp](https://cran.r-project.org/package=Rcpp) is a R package for integrating R with C++, and OpenMP is a library for high-level parallelism for C/C++ and Fortran. > * A ton of other type of resources, notably the tools for working with batch schedulers such as Slurm, HTCondor, etc. --- ## Embarrasingly Parallel Many problems can be executed in an “embarrassingly parallel” way, whereby multiple independent pieces of a problem are executed simultaneously because the different pieces of the problem never really have to communicate with each other (except perhaps at the end when all the results are assembled). --- ## Embarrasingly Parallel The basic mode of an embarrassingly parallel operation can be seen with `lapply()`. Recall that the `lapply()` function has two arguments: - a list, or an object that can be coerced to a list. - a function to be applied to each element of the list. The `lapply()` function works much like a loop–it cycles through each element of the list and applies the supplied function to that element. NOTE: we can also relate this to the `apply()` function, which takes a matrix rather than a list. --- ## Parallelization Conceptually, the steps in the parallel procedure are: 1. Split list X across multiple cores 2. Copy the supplied function (and associated environment) to each of the cores 3. Apply the supplied function to each subset of the list X on each of the cores in parallel 4. Assemble the results of all the function evaluations into a single list and return --- ## The parallel package * Based on the `snow` and `multicore` R Packages. * Explicit parallelism. * Simple yet powerful idea: Parallel computing as multiple R sessions. * Clusters can be made of both local and remote sessions * Multiple types of cluster: `PSOCK`, `Fork`, `MPI`, etc. <embed src="fig/parallel-package.pdf" width="30%" style="display: block; margin: auto;" type="application/pdf" /> --- # Parallel workflow (Usually) We do the following: 1. Create a `PSOCK/FORK` (or other) cluster using `makePSOCKCluster`/`makeForkCluster` (or `makeCluster`) 2. Copy/prepare each R session (if you are using a `PSOCK` cluster): a. Copy objects with `clusterExport` b. Pass expressions with `clusterEvalQ` c. Set a seed 3. Do your call: `parApply`, `parLapply`, `mclapply`. 4. Stop the cluster with `clusterStop` --- ## The parallel package - The `mclapply()` function essentially parallelizes calls to `lapply()`. - The first two arguments to `mclapply()` are exactly the same as they are for `lapply()`. - `mclapply()` has further arguments (that must be named), the most important of which is the mc.cores argument which you can use to specify the number of processors/cores you want to split the computation across. - For example, if your machine has 4 cores on it, you might specify mc.cores = 4 to break your parallelize your operation across 4 cores (although this may not be the best idea if you are running other operations in the background besides R). - check for the number of cores you have with: ``` r library(parallel) detectCores() ``` ``` ## [1] 8 ``` --- ## Ex 1: Hello world! ``` r # 1. CREATING A CLUSTER library(parallel) cl <- makePSOCKcluster(4) x <- 20 # 2. PREPARING THE CLUSTER clusterSetRNGStream(cl, 123) # Equivalent to `set.seed(123)` clusterExport(cl, "x") # 3. DO YOUR CALL clusterEvalQ(cl, { paste0("Hello from process #", Sys.getpid(), ". I see x and it is equal to ", x) }) # 4. STOP THE CLUSTER stopCluster(cl) ``` ``` ## [[1]] ## [1] "Hello from process #36507. I see x and it is equal to 20" ## ## [[2]] ## [1] "Hello from process #36508. I see x and it is equal to 20" ## ## [[3]] ## [1] "Hello from process #36509. I see x and it is equal to 20" ## ## [[4]] ## [1] "Hello from process #36510. I see x and it is equal to 20" ``` --- ## Ex 2: Parallel regressions **Problem**: Run multiple regressions on a very wide dataset. We need to fit the following model: $$ y = X_i\beta_i + \varepsilon,\quad \varepsilon\sim N(0, \sigma^2_i) $$ ``` r dim(X) X[1:6, 1:5] str(y) ``` ``` ## [1] 500 999 ## x001 x002 x003 x004 x005 ## 1 0.61827227 1.72847041 -1.4810695 -0.2471871 1.4776281 ## 2 0.96777456 -0.19358426 -0.8176465 0.6356714 0.7292221 ## 3 -0.04303734 -0.06692844 0.9048826 -1.9277964 2.2947675 ## 4 0.84237608 -1.13685605 -1.8559158 0.4687967 0.9881953 ## 5 -1.91921443 1.83865873 0.5937039 -0.1410556 0.6507415 ## 6 0.59146153 0.81743419 0.3348553 -1.8771819 0.8181764 ## num [1:500] -0.8188 -0.5438 1.0209 0.0467 -0.4501 ... ``` --- ## Ex 2: Parallel regressions (cont'd 1) **Serial solution**: Use `apply` (forloop) to solve it ``` r ans <- apply( X = X, MARGIN = 2, FUN = function(x) coef(lm(y ~ x)) ) ans[,1:5] ``` ``` ## x001 x002 x003 x004 x005 ## (Intercept) -0.03449819 -0.03339681 -0.03728140 -0.03644192 -0.03717344 ## x -0.06082548 0.02748265 -0.01327855 -0.08012361 -0.04067826 ``` --- ## Ex 2: Parallel regressions (cont'd 2) **Parallel solution**: Use `parApply` ``` r library(parallel) cl <- makePSOCKcluster(4) clusterExport(cl, "y") ans <- parApply( cl = cl, X = X, MARGIN = 2, FUN = function(x) coef(lm(y ~ x)) ) ans[,1:5] ``` ``` ## x001 x002 x003 x004 x005 ## (Intercept) -0.02094065 -0.01630664 -0.01565541 -0.01848773 -0.015305816 ## x 0.09269758 -0.05233096 0.02893588 0.02404687 0.009151671 ``` --- Are we going any faster? ``` r microbenchmark::microbenchmark( parallel = parApply( cl = cl, X = X, MARGIN = 2, FUN = function(x) coef(lm(y ~ x)) ), serial = apply( X = X, MARGIN = 2, FUN = function(x) coef(lm(y ~ x)) ), unit="ms" ) ``` ``` ## Unit: milliseconds ## expr min lq mean median uq max neval cld ## parallel 71.8764 75.45558 79.4269 77.32483 79.28072 152.1781 100 a ## serial 188.6807 197.96128 204.1578 200.25206 203.98470 272.3249 100 b ``` --- ## Extended Example: SARS-CoV2 simulation An altered version of [Conway's game of life](https://en.wikipedia.org/wiki/Conway's_Game_of_Life) 1. People live in torus, each individual having 8 neighbors. 2. A healthy individual interacting with a sick neighbor has the following probabilities of contracting the disease: a. 100% if neither wears a face-mask. b. 50% if only he wears the face-mask. c. 20% if only his neighbor wears the mask. d. 5% if both wear the face-mask. --- ## Extended Example: SARS-CoV2 simulation 3. Infected individuals may die with probability 10%. We want to illustrate the importance of wearing face masks. We need to simulate a system with 2,500 (50 x 50) individuals, 1,000 times so we can analyze: (a) contagion curve, (b) death curve. More models like this: The [SIRD model](https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology#The_SIRD_model) (Susceptible-Infected-Recovered-Deceased) --- ## Conway's Game of Masks Download the program [here](https://github.com/JSC370/JSC370-2025/blob/main/slides/sars-cov2.R). ``` r source("sars-cov2.R", echo=FALSE) # Looking at some constants probs_sick # Sick individual's probabilities probs_susc # Probabilities of i getting the disease ``` ``` ## deceased infected recovered ## 0.1 0.4 0.5 ## j doesn't wear j wears ## i doesn't wear 0.9 0.20 ## i wears 0.5 0.05 ``` --- ## How does the simulation look? ``` r set.seed(7123) one <- simulate_covid( pop_size = 1600, nsick = 160, nwears_mask = 1:400, nsteps = 20, store = TRUE ) one$statistics[c(1:5, 16:20),] ``` ``` ## susceptible infected recovered deceased ## 0 1440 160 0 0 ## 1 1260 236 87 17 ## 2 1075 278 213 34 ## 3 909 282 349 60 ## 4 779 258 475 88 ## 15 399 10 1004 187 ## 16 398 6 1009 187 ## 17 398 1 1010 191 ## 18 398 1 1010 191 ## 19 397 1 1011 191 ``` --- ## How does the simulation look? ``` r # Location of who wears the facemask. This step is only for plotting wears <- which(one$wears, arr.ind = TRUE) - 1 wears <- wears/(one$nr) * (1 + 1/one$nr) # Initializing the animation fig <- magick::image_device(600, 600, res = 96/2, pointsize = 24) for (i in 1:one$current_step) { # Plot image( one$temporal[,,i], col=c("gray", "tomato", "steelblue","black"), main = paste("Time", i - 1L, "of", one$nsteps), zlim = c(1,4) ) points(wears, col="white", pch=20, cex=1.5) legend( "topright", col = c("gray", "tomato", "steelblue","black", "black"), legend = c(names(codes), "wears a mask"), pch = c(rep(15, 4), 21) ) } # Finalizing plot and writing the animation dev.off() animation <- magick::image_animate(fig, fps = 2) magick::image_write(animation, "covid1.gif") ``` --- ## How it Spreads <video controls width="400" height="400"> <source type="video/webm" src="data:image/png;base64,#fig/covid1.webm"> </video> --- ``` r set.seed(123355) stats_nobody_wears_masks <- replicate(50, { simulate_covid( pop_size = 900, nsick = 10, nwears_mask = 0, nsteps = 15)$statistics[,"deceased"] }, simplify = FALSE ) set.seed(123355) stats_half_wears_masks <- replicate(50, { simulate_covid( pop_size = 900, nsick = 10, nwears_mask = 450, nsteps = 15)$statistics[,"deceased"] }, simplify = FALSE ) set.seed(123355) stats_all_wears_masks <- replicate(50, { simulate_covid( pop_size = 900, nsick = 10, nwears_mask = 900, nsteps = 15)$statistics[,"deceased"] }, simplify = FALSE ) ``` --- ## Results: How it Spreads <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#JSC370-slides-09_files/figure-html/boxplot-1.png" alt="Cumulative number of deceased as a function of whether none (red), half (gray), or all (blue) individuals wear a face mask." /> <p class="caption">Cumulative number of deceased as a function of whether none (red), half (gray), or all (blue) individuals wear a face mask.</p> </div> --- ## Speed things up: Timing under the serial implementation We will use the function `system.time` to measure how much time it takes to complete 20 simulations in serial versus parallel using 4 cores. ``` r time_serial <- system.time({ ans_serial <- replicate(50, { simulate_covid( pop_size = 900, nsick = 10, nwears_mask = 900, nsteps = 20)$statistics[,"deceased"] }, simplify = FALSE ) }) ``` --- ## Speed things up: Parallel with a Forking Cluster If you are using Unix-like system (Ubuntu, OSX, etc.), you can take advantage of process forking, and thus, parallel's `mclapply` function: ``` r set.seed(1231) time_parallel_fork <- system.time({ ans_parallel <- parallel::mclapply(1:50, function(i) { simulate_covid( pop_size = 900, nsick = 10, nwears_mask = 900, nsteps = 20)$statistics[,"deceased"] }, mc.cores = 4 ) }) ``` --- ## Speed things up: Parallel with a Socket Cluster Forking does not work on Windows. Regardless of the operating system, we can use a Socket cluster, which is simply a group of fresh R sessions that listen to the parent/main/mother session. ``` r # Step 1: Make the cluster cl <- parallel::makePSOCKcluster(2) ``` ```r # Step 2: Prepare the cluster # We could either export all the needed variables parallel::clusterExport( cl, c("calc_stats", "codes", "dat", "get_neighbors", "init", "probs_sick", "probs_susc", "simulate_covid", "update_status", "update_status_all" ) ) ``` Or simply running the simulation script in the other sessions ``` r # Step 2 (alt): Prepare the cluster parallel::clusterEvalQ(cl, source("sars-cov2.R")) parallel::clusterSetRNGStream(cl, 123) # Make sure it is reproducible! ``` --- ## Speed things up: Parallel with a Socket Cluster Bonus: Checking what are the processes running on the system ```r (pids <- c( master = Sys.getpid(), offspring = unlist(parallel::clusterEvalQ(cl, Sys.getpid())) )) # master offspring1 offspring2 # 14810 15998 16012 ``` --- ## Speed things up: Parallel with a Socket Cluster ``` r # Step 3: Do your call time_parallel_sock <- system.time({ ans_parallel <- parallel::parLapply(cl, 1:50, function(i) { simulate_covid( pop_size = 900, nsick = 10, nwears_mask = 900, nsteps = 20)$statistics[,"deceased"] } ) }) # Step 4: Stop parallel::stopCluster(cl) ``` --- ## Serial vs Parallel Time Using two threads/processes, you can obtain the following speedup ``` r time_serial;time_parallel_sock;time_parallel_fork ``` ``` ## user system elapsed ## 8.263 0.045 8.313 ## user system elapsed ## 0.001 0.000 4.442 ## user system elapsed ## 6.974 0.219 2.618 ``` We care about the elapsed time value. The user and system times --- ## Cloud Computing (a.k.a. on-demand computing) HPC clusters, super-computers, etc. need not to be bought... you can rent: - [Amazon Web Services (AWS)](https://aws.amazon.com) - [Google Cloud Computing](https://cloud.google.com) - [Microsoft Azure](https://azure.microsoft.com) These services provide more than just computing (storage, data analysis, etc.). But for computing and storage, there are other free resources, e.g.: - [The Extreme Science and Engineering Discovery Environment (XSEDE)](https://www.xsede.org/) --- ## There are many ways to run R in the cloud Running R in: - Google Cloud: https://cloud.google.com/solutions/running-r-at-scale - Amazon Web Services: https://aws.amazon.com/blogs/big-data/running-r-on-aws/ - Microsoft Azure: https://docs.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/r-developers-guide --- ## Submitting jobs - A key feature of cloud services > interact via command line. - You will need to familiarize with `Rscript` and `R CMD BATCH`. - Which is better? It depends on the application. --- ## Submitting jobs Imagine we have the following R script (download [here](https://github.com/JSC370/JSC370-2024/blob/main/slides/dummy.R)): ```r library(data.table) set.seed(1231) dat <- data.table(y = rnorm(1e3), x = sample.int(5, 1e3, TRUE)) dat[,mean(y), by = x] ``` **R CMD BATCH** This will run a non-interactive R session and put all the output ([stdout](https://wikipedia.org/wiki/stdout) and [stderr](https://wikipedia.org/wiki/stderr)) to the file `dummy.Rout`. --- ## Submitting jobs ``` R CMD BATCH --vanilla dummy.R dummy.Rout & ``` **Rscript** This will also execute R in the background, with the difference that the output `dummy.Rout` will not capture `stderr` (messages, warnings and errors from R). ``` Rscript --vanilla dummy.R > dummy.Rout & ``` The `&` at the end makes sure the job is submitted and does not wait for it to end. Try it yourself (5 mins)! --- ## Rscript The R script can be executed as program directly, if you specify where the `Rscript` program lives. The following example works in Unix. This is an R script named `since_born.R` (download [here](https://github.com/JSC370/JSC370-2024/blob/main/slides/since_born.R)) ```bash #!/usr/bin/Rscript args <- tail(commandArgs(), 0) message(Sys.Date() - as.Date(args), " days since you were born.") ``` This R script, can be executed in various ways... --- ## Rscript as a program For this we would need to change it to an executable. In unix you can use the [chmod](https://wikipedia.org/wiki/chmod) command: `chmod +x since_born.R`. This allows to: ```bash ./since_born.R 2000-01-01 ``` --- ## Rscript in a bash script (most common) In the case of running jobs in a cluster or something similar, we usually need to have a bash script, In our case, here we have a file named `since_born_bash.sh` that calls `Rscript` (download [here](https://github.com/JSC370/jsc370-2023/blob/main/slides/since_born_bash.sh)) ```bash #!/bin/bash Rscript since_born.R 2000-01-01 ``` Which we would execute something like this ``` bash sh since_born_bash.sh ``` ``` ## 9193 days since you were born. ``` --- # Summary - Parallel computing can speed up things. - Not always needed... need to make sure that you are taking advantage of vectorization. --- # Summary - Most of the time we look at "Embarrassingly parallel computing." - In R, explicit parallelism can be achieved using the **parallel** package: 1. Load the package and create a cluster **parallel::makeCluster()** 2. Setup the environment **parallel::clusterEvalQ()**, **parallel::clusterExport()**, and **parallel::clusterSetRNGStream()** 3. Make the call, e.g., **parallel::parLapply()** 4. Stop the cluster **parallel::stopCluster()** - Regardless of the Cloud computing service we are using, we will be using either `R CMD BATCH` or `Rscript` to submit jobs. --- ## Session info ``` ## R version 4.4.2 (2024-10-31) ## Platform: aarch64-apple-darwin20 ## Running under: macOS Sequoia 15.3.1 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0 ## ## Random number generation: ## RNG: L'Ecuyer-CMRG ## Normal: Inversion ## Sample: Rejection ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## time zone: America/Toronto ## tzcode source: internal ## ## attached base packages: ## [1] parallel stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] reticulate_1.40.0 lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 ## [5] dplyr_1.1.4 purrr_1.0.4 readr_2.1.5 tidyr_1.3.1 ## [9] tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0 sass_0.4.9 ## ## loaded via a namespace (and not attached): ## [1] Matrix_1.7-2 gtable_0.3.6 jsonlite_1.8.9 compiler_4.4.2 ## [5] Rcpp_1.0.14 tidyselect_1.2.1 jquerylib_0.1.4 png_0.1-8 ## [9] scales_1.3.0 yaml_2.3.10 fastmap_1.2.0 lattice_0.22-6 ## [13] R6_2.5.1 generics_0.1.3 knitr_1.49 munsell_0.5.1 ## [17] bslib_0.9.0 pillar_1.10.1 tzdb_0.4.0 rlang_1.1.5 ## [21] stringi_1.8.4 cachem_1.1.0 xfun_0.50 timechange_0.3.0 ## [25] cli_3.6.3 withr_3.0.2 magrittr_2.0.3 digest_0.6.37 ## [29] grid_4.4.2 rstudioapi_0.17.1 hms_1.1.3 lifecycle_1.0.4 ## [33] xaringan_0.30 vctrs_0.6.5 evaluate_1.0.3 glue_1.8.0 ## [37] colorspace_2.1-1 rmarkdown_2.29 tools_4.4.2 pkgconfig_2.0.3 ## [41] htmltools_0.5.8.1 ``` --- ## Resources * [Package parallel](https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf) * [Using the iterators package](https://cran.r-project.org/web/packages/iterators/vignettes/iterators.pdf) * [Using the foreach package](https://cran.r-project.org/web/packages/foreach/vignettes/foreach.pdf) * [32 OpenMP traps for C++ developers](https://software.intel.com/en-us/articles/32-openmp-traps-for-c-developers) * [The OpenMP API specification for parallel programming](http://www.openmp.org/) * ['openmp' tag in Rcpp gallery](gallery.rcpp.org/tags/openmp/) * [OpenMP tutorials and articles](http://www.openmp.org/resources/tutorials-articles/) For more, checkout the [CRAN Task View on HPC](https://cran.r-project.org/web/views/HighPerformanceComputing.html){target="_blank"} --- ## Simulating `\(\pi\)` * We know that `\(\pi = \frac{A}{r^2}\)`. We approximate it by randomly adding points `\(x\)` to a square of size 2 centered at the origin. * So, we approximate `\(\pi\)` as `\(\Pr\{\|x\| \leq 1\}\times 2^2\)` <img src="data:image/png;base64,#JSC370-slides-09_files/figure-html/unnamed-chunk-10-1.jpeg" width="300px" height="300px" style="display: block; margin: auto;" /> --- The R code to do this ``` r pisim <- function(i, nsim) { # Notice we don't use the -i- # Random points ans <- matrix(runif(nsim*2), ncol=2) # Distance to the origin ans <- sqrt(rowSums(ans^2)) # Estimated pi (sum(ans <= 1)*4)/nsim } ``` --- ``` r library(parallel) # Setup cl <- makePSOCKcluster(4L) clusterSetRNGStream(cl, 123) # Number of simulations we want each time to run nsim <- 1e5 # We need to make -nsim- and -pisim- available to the # cluster clusterExport(cl, c("nsim", "pisim")) # Benchmarking: parSapply and sapply will run this simulation # a hundred times each, so at the end we have 1e5*100 points # to approximate pi microbenchmark::microbenchmark( parallel = parSapply(cl, 1:100, pisim, nsim=nsim), serial = sapply(1:100, pisim, nsim=nsim), times = 1, unit="ms" ) ``` ``` ## Unit: milliseconds ## expr min lq mean median uq max neval ## parallel 80.34819 80.34819 80.34819 80.34819 80.34819 80.34819 1 ## serial 272.63417 272.63417 272.63417 272.63417 272.63417 272.63417 1 ``` --- ## (Bonus) Overview of HPC Using [Flynn's classical taxonomy](https://en.wikipedia.org/wiki/Flynn%27s_taxonomy), we can classify parallel computing according to the following two dimmensions: a. Type of instruction: Single vs Multiple b. Data stream: Single vs Multiple <figure> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/SISD.svg/480px-SISD.svg.png" style="width:23%;"> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/97/MISD.svg/480px-MISD.svg.png" style="width:23%;"> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/21/SIMD.svg/480px-SIMD.svg.png" style="width:23%;"> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c6/MIMD.svg/480px-MIMD.svg.png" style="width:23%;"> <figcaption><a href="https://en.wikipedia.org/wiki/Michael_J._Flynn" target="_blank">Michael Flynn</a>'s Taxonomy (<a href="https://en.wikipedia.org/wiki/Flynn%27s_taxonomy" target="_blank">wiki</a>)</figcaption> </figure> --- ## (Bonus) Parallel computing: Software Implicit parallelization: - [tensorflow](https://en.wikipedia.org/wiki/TensorFlow): Machine learning framework - [pqR](http://www.pqr-project.org/): Branched version of R. - [Microsoft R](https://mran.microsoft.com/open): Microsoft's R private version (based on Revolution Analytics' R version). - [data.table](https://cran.r-project.org/package=data.table) (R package): Data wrangling using multiple cores. - [caret](https://cran.r-project.org/package=caret) (R package): A meta package, has various implementations using parallel computing. Explicit parallelization ([DIY](https://en.wikipedia.org/wiki/Do_it_yourself)): - [CUDA](https://en.wikipedia.org/wiki/CUDA) (C/C++ library): Programming with GP-GPUs. - [Open MP](https://openmp.org) (C/C++ library): Multi-core programming (CPUs). - [Open MPI](https://open-mpi.org) (C/C++ library): Large scale programming with multi-node systems. - [Threading Building Blocks](https://en.wikipedia.org/wiki/Threading_Building_Blocks) (C/C++ library): Intel's parallel computing library. - [Kokkos](https://kokkos.org/about/) (C++ library): A hardware-agnostic programming framework for HPC applications. - [parallel](https://CRAN.R-project.org/view=HighPerformanceComputing) (R package): R's built-in parallel computing package - [future](https://cran.r-project.org/package=future) (R package): Framework for parallelzing R. - [RcppParallel](https://cran.r-project.org/package=RcppParallel) (R C++ API wrapper): Header and templates for building [Rcpp](https://cran.r-project.org/package=Rcpp)+multi-threaded programs. - [julia](https://julialang.org) (programming language): High-performing, has a framework for parallel computing as well.