High Performance Computing and Cloud Computing

class: center, middle, title-slide

.title[
# High Performance Computing and Cloud Computing
]
.subtitle[
## JSC 370: Data Science II
]
.date[
### March 3, 2025
]

---

<style type="text/css">
code.r,code.cpp{
  font-size:medium;
}
</style>

a, a > code {
  color: #4393C3;
  text-decoration: none;
}

strong {
  color: #000000;
}

em {
  color: #08519C;
  font-style: normal;
  font-weight: bold;
}

del {
  color: #E5E5E5;
  text-decoration: none;
  font-weight: bold;
}

/* General slides */
.remark-code {
  font-family: 'Hack', 'Lucida Console', Monaco, monospace;
  font-size: 65%;
}

.remark-inline-code {
  background: #F5F5F5;
  border-radius: 3px;
  padding: 4px;
}

.remark-slide-content {
  color: #474747;
  font-weight: 300;
  font-size: 180%;
  padding: 1em 2em 1em 2em;
}

/* Slide headers */
.remark-slide-content h1 {
  color: #444444;
  font-family: Montserrat;
}

.remark-slide-content h2 {
  color: #444444;
  font-family: Montserrat;
}

.remark-slide-content h3 {
  color: #444444;
  font-family: Montserrat;
}

.remark-slide-content h4 {
  color: #444444;
  font-family: Montserrat;
}

.remark-slide-content h5 {
  color: #444444;
  font-family: Montserrat;
}

/* Images */
img, video, iframe {
  max-width: 100%;
}

/* Lists */
li {
  padding: 10px;
  display: block;
}

li::before {
  content: "- ";
  font-weight: bold;
}

/* Title slide */
.title-slide {
  background-color: #FFFFFF;
}

.title-slide .remark-slide-number {
  display: none;
}

.title-slide h1 {
  color: #101645;
  font-size: 72px;
  text-shadow: none;
  font-family: Montserrat;
  text-align: center;
  vertical-align: bottom;
}

.title-slide h2 {
  color: #101645;
  text-shadow: none;
  font-family: Montserrat;
  font-size: 48px;
  text-align: center;
  font-weight: bold;
}

.title-slide h3 {
  color: #333333;
  text-shadow: none;
  font-family: Montserrat;
  font-size: 36px;
  text-align: center;
  margin-bottom: 10px;
}

.title-slide h4 {
  color: #333333;
  text-shadow: none;
  font-family: Montserrat;
  font-size: 24px;
  text-align: center;
  margin-bottom: 10px;
}

.hidden {
  visibility: hidden;
}
</style>

## What is HPC

High Performance Computing (HPC) can relate to any of the following:

- **Parallel computing**, i.e. using multiple resources (could be threads, cores, nodes, etc.) simultaneously to complete a task.
- **Big data** working with large datasets (in/out-of-memory). We discussed this with lazy loading.
Today we will focus on parallel computing.

---

## Serial computation

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#fig/pll-computing-explained-serial.svg" alt="Here we are using a single core. The function is applied one element at a time, leaving the other 3 cores without usage." width="30%" />
<p class="caption">Here we are using a single core. The function is applied one element at a time, leaving the other 3 cores without usage.</p>
</div>
---

## Serial computation

- Serial programming is the default method of code execution.
- Code is run sequentially, meaning only one instruction is processed at a time (line-by-line). 
- Code is executed on a single processor, so only one instruction can execute at a time.

---

## Parallel computation

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#fig/pll-computing-explained-parallel.svg" alt="Here we are taking full advantage of our computer by using all 4 cores at the same time. This will translate in a reduced computation time which, in the case of complicated/long calculations, can be an important speed gain." width="30%" />
<p class="caption">Here we are taking full advantage of our computer by using all 4 cores at the same time. This will translate in a reduced computation time which, in the case of complicated/long calculations, can be an important speed gain.</p>
</div>

---

## Parallel computation

Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem:

- A problem is broken into discrete parts that can be solved concurrently
- Each part is further broken down to a series of instructions
- Instructions from each part execute simultaneously on different processors
- An overall control/coordination mechanism is employed

---

## Parallel computation

The parallel compute resources are typically:

- A single computer with multiple processors/cores
- An arbitrary number of such computers connected by a network

---

## Parallel computing: Hardware

When it comes to parallel computing, there are several ways (levels) in which we can speed up our analysis. From the bottom up:

- **[Thread level SIMD instructions](https://en.wikipedia.org/wiki/Vector_processor)**: In most modern processors support some level of what is called vectorization, this is, applying a single (same) instruction to streams of data, for example: adding vector `A` and `B`.
- **[Hyper-Threading Technology](https://en.wikipedia.org/wiki/Hyper-threading)** (HTT): Intel's hyper-threading generates a virtual partition of a single core (processor) which, while not equivalent to having multiple physical threads, does speed up things.
- **[Multi-core processor](https://en.wikipedia.org/wiki/Multi-core_processor)**: Most modern CPUs (Central Processing Unit) have two or more physical cores. A typical laptop computer has about 8 cores.

---

## Parallel computing: Hardware

- **[General-Purpose Computing on Graphics Processing Unit](https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units)** (GP-GPU): While modern CPUs have a couple of dozens of cores, GPUs can hold thousands of those. Designed for image processing, there's an increasing use of GPUs as an alternative of CPUs for scientific computing. For example, Compute Canada part of the [Digital Research Alliance](https://alliancecan.ca/en/services/advanced-research-computing/federation/national-host-sites).
- **[High-Performance Computing Cluster](https://en.wikipedia.org/wiki/Computer_cluster)** (HPC): A collection of computing nodes that are interconnected using a fast Ethernet network. Now have CPUs and GPUs.
- **[Grid Computing](https://en.wikipedia.org/wiki/Grid_computing)**: A collection of loosely interconnected machines that may or may not be in the same physical place, for example: HTCondor clusters.

---

## Parallel computing: Terminology

CPU: Central Processing Unit, consists of one or more cores. Cores may be organized into one or more sockets, each with its own distinct memory.

Core: A processing unit within a CPU that can execute a separate thread or process.

Node: A standalone computer, usually comprised of multiple CPUs/processors/cores, memory, network interfaces. Nodes are networked together to comprise a supercomputer.

Cluster: A collection of nodes working together as a single system.
---

## Parallel computing: Terminology

Thread: A sequence of instructions that a CPU executes. It shares memory with other threads within the same process. For example, a CPU with 8 cores and 2 threads per core can execute 16 threads simultaneously.

Process: An independent instance of a program with its own memory space.

---

## Parallel computing: CPU components

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#fig/cpu-slurm.png" alt="Taxonomy of CPUs (from https://slurm.schedmd.com/mc_support.html" width="45%" />
<p class="caption">Taxonomy of CPUs (from https://slurm.schedmd.com/mc_support.html</p>
</div>

---

## Parallel computing: Terminology

Job: A container that includes all the work submitted to the cluster. It may consist of one or more tasks. A job is typically submitted using SLURM (Simple Linux Utility for Resource Management). SLURM is an open-source job scheduler used for managing and allocating resources in  computing clusters. It allows users to submit, schedule, and monitor jobs on a cluster of computers.

Task: Typically a program or program-like set of instructions that is executed by a processor. A parallel program consists of multiple tasks running on multiple processors.

---

## Parallel computing: Components

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#fig/components.png" alt="Parallel Computing Components with a SLURM script" width="70%" />
<p class="caption">Parallel Computing Components with a SLURM script</p>
</div>

---

## Parallel Programming Models

There are several parallel programming models in common use:

- Shared Memory (without threads)
- Threads
- Distributed Memory / Message Passing (e.g. MPI)
- Data Parallel
- Hybrid
- Single Program Multiple Data (SPMD)
- Multiple Program Multiple Data (MPMD)

---

## Do I need HPC?

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#fig/when_to_parallel.png" alt="Ask yourself these questions before jumping into HPC!" width="60%" />
<p class="caption">Ask yourself these questions before jumping into HPC!</p>
</div>

---

## Top 10 HPC: www.top500.org

---

## Parallel computing in R

While there are several ways to do parallel computing in R (just take a look at the
[High-Performance Computing Task View](https://cran.r-project.org/web/views/HighPerformanceComputing.html)),
we'll focus on the following R-packages for **explicit parallelism**

Some examples:

> *   [**parallel**](https://cran.r-project.org/package=parallel): R package that provides 'support for parallel computation,
    including random-number generation'.
> *   [**foreach**](https://cran.r-project.org/package=foreach): R package for 'general iteration over elements' in parallel fashion.

---

## Parallel computing in R

> *   [**doParallel**](https://cran.r-project.org/package=doParallel): Provides a parallel backend for the %dopar% function using the parallel package.
> *   [**future**](https://cran.r-project.org/package=future): 'A lightweight and
    unified Future API for sequential and parallel processing of R
    expression via lazy loading and futures.' 
    
Implicit parallelism, on the other hand, are out-of-the-box tools that allow the
programmer not to worry about parallelization, e.g. such as
[**gpuR**](https://cran.r-project.org/package=gpuR) for Matrix manipulation using
GPU, [**tensorflow**](https://cran.r-project.org/package=tensorflow)

---

## Parallel computing in R

There are more advanced options:

> *   [**Rcpp**](https://cran.r-project.org/package=Rcpp) + [OpenMP](https://www.openmp.org):
    [Rcpp](https://cran.r-project.org/package=Rcpp) is a R package for integrating
    R with C++, and OpenMP is a library for high-level parallelism for C/C++ and
    Fortran.
> *   A ton of other type of resources, notably the tools for working with 
    batch schedulers such as Slurm, HTCondor, etc.
   
---

## Embarrasingly Parallel

Many problems can be executed in an “embarrassingly parallel” way, whereby multiple independent pieces of a problem are executed simultaneously because the different pieces of the problem never really have to communicate with each other (except perhaps at the end when all the results are assembled).

---

## Embarrasingly Parallel

The basic mode of an embarrassingly parallel operation can be seen with `lapply()`. Recall that the `lapply()` function has two arguments:

- a list, or an object that can be coerced to a list.
- a function to be applied to each element of the list.

The `lapply()` function works much like a loop–it cycles through each element of the list and applies the supplied function to that element.

NOTE: we can also relate this to the `apply()` function, which takes a matrix rather than a list.
---

## Parallelization

Conceptually, the steps in the parallel procedure are:

1. Split list X across multiple cores
  2. Copy the supplied function (and associated environment) to each of the cores
  3. Apply the supplied function to each subset of the list X on each of the cores in parallel
  4. Assemble the results of all the function evaluations into a single list and return

---

## The parallel package

*  Based on the `snow` and `multicore` R Packages.
*  Explicit parallelism.
*  Simple yet powerful idea: Parallel computing as multiple R sessions.
*  Clusters can be made of both local and remote sessions
*  Multiple types of cluster: `PSOCK`, `Fork`, `MPI`, etc.

# Parallel workflow

(Usually) We do the following:

1.  Create a `PSOCK/FORK` (or other) cluster using `makePSOCKCluster`/`makeForkCluster`
    (or `makeCluster`)
2.  Copy/prepare each R session (if you are using a `PSOCK` cluster):
    a.  Copy objects with `clusterExport`
    b.  Pass expressions with `clusterEvalQ`
    c.  Set a seed
3.  Do your call: `parApply`, `parLapply`, `mclapply`. 
4.  Stop the cluster with `clusterStop`

---

## The parallel package

- The `mclapply()` function essentially parallelizes calls to `lapply()`. 
- The first two arguments to `mclapply()` are exactly the same as they are for `lapply()`. 
- `mclapply()` has further arguments (that must be named), the most important of which is the mc.cores argument which you can use to specify the number of processors/cores you want to split the computation across. 
- For example, if your machine has 4 cores on it, you might specify mc.cores = 4 to break your parallelize your operation across 4 cores (although this may not be the best idea if you are running other operations in the background besides R).
- check for the number of cores you have with:

``` r
library(parallel)
detectCores()
```

```
## [1] 8
```
---

## Ex 1: Hello world!

``` r
# 1. CREATING A CLUSTER
library(parallel)
cl <- makePSOCKcluster(4)    
x  <- 20

# 2. PREPARING THE CLUSTER
clusterSetRNGStream(cl, 123) # Equivalent to `set.seed(123)`
clusterExport(cl, "x")

# 3. DO YOUR CALL
clusterEvalQ(cl, {
  paste0("Hello from process #", Sys.getpid(), ". I see x and it is equal to ", x)
})

# 4. STOP THE CLUSTER
stopCluster(cl)
```

```
## [[1]]
## [1] "Hello from process #36507. I see x and it is equal to 20"
## 
## [[2]]
## [1] "Hello from process #36508. I see x and it is equal to 20"
## 
## [[3]]
## [1] "Hello from process #36509. I see x and it is equal to 20"
## 
## [[4]]
## [1] "Hello from process #36510. I see x and it is equal to 20"
```

---

## Ex 2: Parallel regressions

**Problem**: Run multiple regressions on a very wide dataset. We need to fit the
following model:

$$
y = X_i\beta_i + \varepsilon,\quad \varepsilon\sim N(0, \sigma^2_i)
$$

``` r
dim(X)
X[1:6, 1:5]
str(y)
```

```
## [1] 500 999
##          x001        x002       x003       x004      x005
## 1  0.61827227  1.72847041 -1.4810695 -0.2471871 1.4776281
## 2  0.96777456 -0.19358426 -0.8176465  0.6356714 0.7292221
## 3 -0.04303734 -0.06692844  0.9048826 -1.9277964 2.2947675
## 4  0.84237608 -1.13685605 -1.8559158  0.4687967 0.9881953
## 5 -1.91921443  1.83865873  0.5937039 -0.1410556 0.6507415
## 6  0.59146153  0.81743419  0.3348553 -1.8771819 0.8181764
##  num [1:500] -0.8188 -0.5438 1.0209 0.0467 -0.4501 ...
```

---

## Ex 2: Parallel regressions (cont'd 1)

**Serial solution**: Use `apply` (forloop) to solve it

``` r
ans <- apply(
  
  X      = X,
  MARGIN = 2,
  FUN    = function(x) coef(lm(y ~ x))
  )

ans[,1:5]
```

```
##                    x001        x002        x003        x004        x005
## (Intercept) -0.03449819 -0.03339681 -0.03728140 -0.03644192 -0.03717344
## x           -0.06082548  0.02748265 -0.01327855 -0.08012361 -0.04067826
```

---

## Ex 2: Parallel regressions (cont'd 2)

**Parallel solution**: Use `parApply`

``` r
library(parallel)
cl <- makePSOCKcluster(4)
clusterExport(cl, "y")
ans <- parApply(
  cl     = cl,
  X      = X,
  MARGIN = 2,
  FUN    = function(x) coef(lm(y ~ x))
  )
 
ans[,1:5]
```

```
##                    x001        x002        x003        x004         x005
## (Intercept) -0.02094065 -0.01630664 -0.01565541 -0.01848773 -0.015305816
## x            0.09269758 -0.05233096  0.02893588  0.02404687  0.009151671
```

---

Are we going any faster?

``` r
microbenchmark::microbenchmark(
  parallel = parApply(
    cl  = cl,
    X   = X, MARGIN = 2,
    FUN = function(x) coef(lm(y ~ x))
    ),
  serial = apply(
    X   = X, MARGIN = 2,
    FUN = function(x) coef(lm(y ~ x))
    ), unit="ms"
)
```

```
## Unit: milliseconds
##      expr      min        lq     mean    median        uq      max neval cld
##  parallel  71.8764  75.45558  79.4269  77.32483  79.28072 152.1781   100  a 
##    serial 188.6807 197.96128 204.1578 200.25206 203.98470 272.3249   100   b
```

---

## Extended Example: SARS-CoV2 simulation

An altered version of [Conway's game of life](https://en.wikipedia.org/wiki/Conway's_Game_of_Life)

1. People live in torus, each individual having 8 neighbors.
2. A healthy individual interacting with a sick neighbor has the following probabilities of contracting the disease:
   a. 100% if neither wears a face-mask.
   b. 50% if only he wears the face-mask.
   c. 20% if only his neighbor wears the mask.
   d. 5% if both wear the face-mask.

---

## Extended Example: SARS-CoV2 simulation
   
3. Infected individuals may die with probability 10%.
  
We want to illustrate the importance of wearing face masks. We need to simulate a system with 2,500 (50 x 50) individuals, 1,000 times so we can analyze: (a) contagion curve, (b) death curve.

More models like this: The [SIRD model](https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology#The_SIRD_model) (Susceptible-Infected-Recovered-Deceased)

---

## Conway's Game of Masks

Download the program [here](https://github.com/JSC370/JSC370-2025/blob/main/slides/sars-cov2.R).

``` r
source("sars-cov2.R", echo=FALSE)

# Looking at some constants
probs_sick # Sick individual's probabilities
probs_susc # Probabilities of i getting the disease 
```

```
##  deceased  infected recovered 
##       0.1       0.4       0.5 
##                j doesn't wear j wears
## i doesn't wear            0.9    0.20
## i wears                   0.5    0.05
```

---

## How does the simulation look?

``` r
set.seed(7123)
one <- simulate_covid(
    pop_size    = 1600,
    nsick       = 160,
    nwears_mask = 1:400,
    nsteps      = 20,
    store       = TRUE
    )

one$statistics[c(1:5, 16:20),]
```

```
##    susceptible infected recovered deceased
## 0         1440      160         0        0
## 1         1260      236        87       17
## 2         1075      278       213       34
## 3          909      282       349       60
## 4          779      258       475       88
## 15         399       10      1004      187
## 16         398        6      1009      187
## 17         398        1      1010      191
## 18         398        1      1010      191
## 19         397        1      1011      191
```

---

## How does the simulation look?

``` r
# Location of who wears the facemask. This step is only for plotting
wears <- which(one$wears, arr.ind = TRUE) - 1
wears <- wears/(one$nr) * (1 + 1/one$nr)

# Initializing the animation
fig   <- magick::image_device(600, 600, res = 96/2, pointsize = 24)
for (i in 1:one$current_step) {
  
  # Plot
  image(
    one$temporal[,,i], col=c("gray", "tomato", "steelblue","black"),
    main = paste("Time", i - 1L, "of", one$nsteps),
    zlim = c(1,4)
    )
  points(wears, col="white", pch=20, cex=1.5)
  legend(
    "topright",
    col    = c("gray", "tomato", "steelblue","black", "black"),
    legend = c(names(codes), "wears a mask"),
    pch    = c(rep(15, 4), 21)
  )
}

# Finalizing plot and writing the animation
dev.off()
animation <- magick::image_animate(fig, fps = 2)
magick::image_write(animation, "covid1.gif")
```

---
## How it Spreads

---

``` r
set.seed(123355)
stats_nobody_wears_masks <- replicate(50, {
  simulate_covid(
    pop_size    = 900,
    nsick       = 10,
    nwears_mask = 0,
    nsteps      = 15)$statistics[,"deceased"]
  }, simplify = FALSE
  )

set.seed(123355)
stats_half_wears_masks <- replicate(50, {
  simulate_covid(
    pop_size    = 900,
    nsick       = 10,
    nwears_mask = 450,
    nsteps      = 15)$statistics[,"deceased"]
  }, simplify = FALSE
  )

set.seed(123355)
stats_all_wears_masks <- replicate(50, {
  simulate_covid(
    pop_size    = 900,
    nsick       = 10,
    nwears_mask = 900,
    nsteps      = 15)$statistics[,"deceased"]
  }, simplify = FALSE
  )
```

---

## Results: How it Spreads

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#JSC370-slides-09_files/figure-html/boxplot-1.png" alt="Cumulative number of deceased as a function of whether none (red), half (gray), or all (blue) individuals wear a face mask."  />
<p class="caption">Cumulative number of deceased as a function of whether none (red), half (gray), or all (blue) individuals wear a face mask.</p>
</div>

---

## Speed things up: Timing under the serial implementation

We will use the function `system.time` to measure how much time it takes to complete 20 simulations in serial versus parallel using 4 cores.

``` r
time_serial <- system.time({
  ans_serial <- replicate(50, {
    simulate_covid(
      pop_size    = 900,
      nsick       = 10,
      nwears_mask = 900,
      nsteps      = 20)$statistics[,"deceased"]
    },
    simplify = FALSE
    )
})
```

---

## Speed things up: Parallel with a Forking Cluster

If you are using Unix-like system (Ubuntu, OSX, etc.), you can
take advantage of process forking, and thus, parallel's `mclapply` function:

``` r
set.seed(1231)
time_parallel_fork <- system.time({
  ans_parallel <- parallel::mclapply(1:50, function(i) {
    simulate_covid(
      pop_size    = 900,
      nsick       = 10,
      nwears_mask = 900,
      nsteps      = 20)$statistics[,"deceased"]
    }, mc.cores = 4
  )
})
```

---

## Speed things up: Parallel with a Socket Cluster

Forking does not work on Windows. Regardless of the operating system, we can use a Socket cluster,
which is simply a group of fresh R sessions that listen to the parent/main/mother 
session.

``` r
# Step 1: Make the cluster
cl <- parallel::makePSOCKcluster(2)
```

```r
# Step 2: Prepare the cluster
# We could either export all the needed variables
parallel::clusterExport(
  cl,
  c("calc_stats", "codes", "dat", "get_neighbors", "init", "probs_sick",
    "probs_susc", "simulate_covid", "update_status", "update_status_all"
    )
  )
```

Or simply running the simulation script in the other sessions

``` r
# Step 2 (alt): Prepare the cluster
parallel::clusterEvalQ(cl, source("sars-cov2.R"))
parallel::clusterSetRNGStream(cl, 123) # Make sure it is reproducible!
```

---

## Speed things up: Parallel with a Socket Cluster

Bonus: Checking what are the processes running on the system

```r
(pids <- c(
  master    = Sys.getpid(),
  offspring = unlist(parallel::clusterEvalQ(cl, Sys.getpid()))
  ))
#      master offspring1 offspring2 
#     14810      15998      16012 
```

---

## Speed things up: Parallel with a Socket Cluster

``` r
# Step 3: Do your call
time_parallel_sock <- system.time({
  ans_parallel <- parallel::parLapply(cl, 1:50, function(i) {
    simulate_covid(
      pop_size    = 900,
      nsick       = 10,
      nwears_mask = 900,
      nsteps      = 20)$statistics[,"deceased"]
    }
  )
})

# Step 4: Stop
parallel::stopCluster(cl)
```

---

## Serial vs Parallel Time

Using two threads/processes, you can obtain the following speedup

``` r
time_serial;time_parallel_sock;time_parallel_fork
```

```
##    user  system elapsed 
##   8.263   0.045   8.313 
##    user  system elapsed 
##   0.001   0.000   4.442 
##    user  system elapsed 
##   6.974   0.219   2.618
```

We care about the elapsed time value. The user and system times
---

## Cloud Computing (a.k.a. on-demand computing)

HPC clusters, super-computers, etc. need not to be bought... you can rent:

- [Amazon Web Services (AWS)](https://aws.amazon.com)
- [Google Cloud Computing](https://cloud.google.com)
- [Microsoft Azure](https://azure.microsoft.com)

These services provide more than just computing (storage, data analysis, etc.). But for computing and storage, there are other free resources, e.g.:

- [The  Extreme Science and Engineering Discovery Environment (XSEDE)](https://www.xsede.org/)

---

## There are many ways to run R in the cloud

Running R in:

- Google Cloud: https://cloud.google.com/solutions/running-r-at-scale
- Amazon Web Services: https://aws.amazon.com/blogs/big-data/running-r-on-aws/
- Microsoft Azure: https://docs.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/r-developers-guide

---

## Submitting jobs

- A key feature of cloud services > interact via command line.
- You will need to familiarize with `Rscript` and `R CMD BATCH`.
- Which is better? It depends on the application.

---

## Submitting jobs

Imagine we have the following R script (download [here](https://github.com/JSC370/JSC370-2024/blob/main/slides/dummy.R)):

```r
library(data.table)
set.seed(1231)
dat <- data.table(y = rnorm(1e3), x = sample.int(5, 1e3, TRUE))
dat[,mean(y), by = x]
```

**R CMD BATCH**

This will run a non-interactive R session and put all the output
([stdout](https://wikipedia.org/wiki/stdout) and [stderr](https://wikipedia.org/wiki/stderr)) to the file `dummy.Rout`.

---

## Submitting jobs

```
R CMD BATCH --vanilla dummy.R dummy.Rout &
```

**Rscript**

This will also execute R in the background, with the difference that
the output `dummy.Rout` will not capture `stderr` (messages, warnings and
errors from R).

```
Rscript --vanilla dummy.R > dummy.Rout &
```

The `&` at the end makes sure the job is submitted and does not wait for it to end.
Try it yourself (5 mins)!

---

## Rscript

The R script can be executed as program directly, if you specify where the
`Rscript` program lives.

The following example works in Unix. This is an R script named `since_born.R`
(download [here](https://github.com/JSC370/JSC370-2024/blob/main/slides/since_born.R))

```bash
#!/usr/bin/Rscript
args <- tail(commandArgs(), 0)
message(Sys.Date() - as.Date(args), " days since you were born.")
```

This R script, can be executed in various ways...

---

## Rscript as a program

For this we would need to change it to an executable. In unix you can use the
[chmod](https://wikipedia.org/wiki/chmod) command: `chmod +x since_born.R`. This
allows to:

```bash
./since_born.R 2000-01-01
```

---

## Rscript in a bash script (most common)

In the case of running jobs in a cluster or something similar, we usually need to
have a bash script, In our case, here we have a file named `since_born_bash.sh`
that calls `Rscript` (download [here](https://github.com/JSC370/jsc370-2023/blob/main/slides/since_born_bash.sh))

```bash
#!/bin/bash
Rscript since_born.R 2000-01-01
```
Which we would execute something like this

``` bash
sh since_born_bash.sh
```

```
## 9193 days since you were born.
```

---

# Summary

- Parallel computing can speed up things.
- Not always needed... need to make sure that you are taking advantage of vectorization.

---

# Summary
- Most of the time we look at "Embarrassingly parallel computing."
- In R, explicit parallelism can be achieved using the **parallel** package:
    1. Load the package and create a cluster **parallel::makeCluster()**
    2. Setup the environment **parallel::clusterEvalQ()**, **parallel::clusterExport()**, and **parallel::clusterSetRNGStream()**
    3. Make the call, e.g., **parallel::parLapply()**
    4. Stop the cluster **parallel::stopCluster()**
    
- Regardless of the Cloud computing service we are using, we will be
using either `R CMD BATCH` or `Rscript` to submit jobs.

---

## Session info

```
## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.3.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## Random number generation:
##  RNG:     L'Ecuyer-CMRG 
##  Normal:  Inversion 
##  Sample:  Rejection 
##  
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Toronto
## tzcode source: internal
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] reticulate_1.40.0 lubridate_1.9.4   forcats_1.0.0     stringr_1.5.1    
##  [5] dplyr_1.1.4       purrr_1.0.4       readr_2.1.5       tidyr_1.3.1      
##  [9] tibble_3.2.1      ggplot2_3.5.1     tidyverse_2.0.0   sass_0.4.9       
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-2      gtable_0.3.6      jsonlite_1.8.9    compiler_4.4.2   
##  [5] Rcpp_1.0.14       tidyselect_1.2.1  jquerylib_0.1.4   png_0.1-8        
##  [9] scales_1.3.0      yaml_2.3.10       fastmap_1.2.0     lattice_0.22-6   
## [13] R6_2.5.1          generics_0.1.3    knitr_1.49        munsell_0.5.1    
## [17] bslib_0.9.0       pillar_1.10.1     tzdb_0.4.0        rlang_1.1.5      
## [21] stringi_1.8.4     cachem_1.1.0      xfun_0.50         timechange_0.3.0 
## [25] cli_3.6.3         withr_3.0.2       magrittr_2.0.3    digest_0.6.37    
## [29] grid_4.4.2        rstudioapi_0.17.1 hms_1.1.3         lifecycle_1.0.4  
## [33] xaringan_0.30     vctrs_0.6.5       evaluate_1.0.3    glue_1.8.0       
## [37] colorspace_2.1-1  rmarkdown_2.29    tools_4.4.2       pkgconfig_2.0.3  
## [41] htmltools_0.5.8.1
```

---

## Resources

*   [Package parallel](https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf) 
*   [Using the iterators package](https://cran.r-project.org/web/packages/iterators/vignettes/iterators.pdf)
*   [Using the foreach package](https://cran.r-project.org/web/packages/foreach/vignettes/foreach.pdf)
*   [32 OpenMP traps for C++ developers](https://software.intel.com/en-us/articles/32-openmp-traps-for-c-developers)
*   [The OpenMP API specification for parallel programming](http://www.openmp.org/)
*   ['openmp' tag in Rcpp gallery](gallery.rcpp.org/tags/openmp/)
*   [OpenMP tutorials and articles](http://www.openmp.org/resources/tutorials-articles/)

For more, checkout the [CRAN Task View on HPC](https://cran.r-project.org/web/views/HighPerformanceComputing.html){target="_blank"}

---

## Simulating `$\pi$`

*   We know that `$\pi = \frac{A}{r^2}$`. We approximate it by randomly adding
    points `$x$` to a square of size 2 centered at the origin.
*   So, we approximate `$\pi$` as `$\Pr\{\|x\| \leq 1\}\times 2^2$`

---

The R code to do this

``` r
pisim <- function(i, nsim) {  # Notice we don't use the -i-
  # Random points
  ans  <- matrix(runif(nsim*2), ncol=2)
  
  # Distance to the origin
  ans  <- sqrt(rowSums(ans^2))
  
  # Estimated pi
  (sum(ans <= 1)*4)/nsim
}
```

---

``` r
library(parallel)
# Setup
cl <- makePSOCKcluster(4L)
clusterSetRNGStream(cl, 123)

# Number of simulations we want each time to run
nsim <- 1e5

# We need to make -nsim- and -pisim- available to the
# cluster
clusterExport(cl, c("nsim", "pisim"))

# Benchmarking: parSapply and sapply will run this simulation
# a hundred times each, so at the end we have 1e5*100 points
# to approximate pi
microbenchmark::microbenchmark(
  parallel = parSapply(cl, 1:100, pisim, nsim=nsim),
  serial   = sapply(1:100, pisim, nsim=nsim), times = 1, unit="ms"
)
```

```
## Unit: milliseconds
##      expr       min        lq      mean    median        uq       max neval
##  parallel  80.34819  80.34819  80.34819  80.34819  80.34819  80.34819     1
##    serial 272.63417 272.63417 272.63417 272.63417 272.63417 272.63417     1
```

---

## (Bonus) Overview of HPC

Using [Flynn's classical taxonomy](https://en.wikipedia.org/wiki/Flynn%27s_taxonomy), we can classify parallel computing according to the following two dimmensions:

a. Type of instruction: Single vs Multiple
b. Data stream: Single vs Multiple

<figure>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/SISD.svg/480px-SISD.svg.png" style="width:23%;">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/97/MISD.svg/480px-MISD.svg.png" style="width:23%;">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/21/SIMD.svg/480px-SIMD.svg.png" style="width:23%;">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c6/MIMD.svg/480px-MIMD.svg.png" style="width:23%;">
<figcaption><a href="https://en.wikipedia.org/wiki/Michael_J._Flynn" target="_blank">Michael Flynn</a>'s Taxonomy (<a href="https://en.wikipedia.org/wiki/Flynn%27s_taxonomy" target="_blank">wiki</a>)</figcaption>
</figure>

---

## (Bonus) Parallel computing: Software

Implicit parallelization:

- [tensorflow](https://en.wikipedia.org/wiki/TensorFlow): Machine learning framework
- [pqR](http://www.pqr-project.org/): Branched version of R.
- [Microsoft R](https://mran.microsoft.com/open): Microsoft's R private version (based on Revolution Analytics' R version).
- [data.table](https://cran.r-project.org/package=data.table) (R package): Data wrangling using multiple cores.
- [caret](https://cran.r-project.org/package=caret) (R package): A meta package, has various implementations using parallel computing.

Explicit parallelization ([DIY](https://en.wikipedia.org/wiki/Do_it_yourself)):

- [CUDA](https://en.wikipedia.org/wiki/CUDA) (C/C++ library): Programming with GP-GPUs.
- [Open MP](https://openmp.org) (C/C++ library): Multi-core programming (CPUs).
- [Open MPI](https://open-mpi.org) (C/C++ library): Large scale programming with multi-node systems.
- [Threading Building Blocks](https://en.wikipedia.org/wiki/Threading_Building_Blocks) (C/C++ library): Intel's parallel computing library.
- [Kokkos](https://kokkos.org/about/) (C++ library): A hardware-agnostic programming framework for HPC applications.
- [parallel](https://CRAN.R-project.org/view=HighPerformanceComputing) (R package): R's built-in parallel computing package
- [future](https://cran.r-project.org/package=future) (R package): Framework for parallelzing R.
- [RcppParallel](https://cran.r-project.org/package=RcppParallel) (R C++ API wrapper): Header and templates for building [Rcpp](https://cran.r-project.org/package=Rcpp)+multi-threaded programs.
- [julia](https://julialang.org) (programming language): High-performing, has a framework for parallel computing as well.