In this notebook (or markdown) we will attempt to solve three of the tasks for the R GSoC 2023 StochOptim project. The tasks are the Rasrigin function, Finding several local minima using Optim, and Stochastic Optimization.

Rastrigin Function

The Rastrigin function is a highly non-convex function with, well, infinite local minima. If your optimization algorithm is not stochastic and you would like to give it a nightmare, this is the function to use. If its stochastic then this is a good benchmark to test its performance.

As in Wikipedia, the Rastrigin function is defined as: \[f(\vec{x}) = An + \sum_{i=1}^n (x_i^2 - A\cos(2\pi x_i))\]

where \(x_i\) is the \(i\)-th element of the vector \(x\) and \(n\) is the dimension of the vector \(x\). The cosine should explain why the function is non-convex.

Implementing the Rastrigin Function in 2D

Rastrigin <- function(x,y, n=2, A=5) A*2 + (x^2 - A*cos(2*pi*x)) + (y^2 - A*cos(2*pi*y))

Now let’s plot!

We will plot the function for all \(x_i \in [-3, 3]\) and \(y_i \in [-3, 3]\). Let’s start by creating the ranges and outer() will help us with creating the grid.

# We will make a grid using these later via outer()
x<-seq(-3,3,length=50)
y<-seq(-3,3,length=50) 

To make guessing where the global minimum is easier, we will make three plots with different values of A

# Define a vector of A values
A_vec <- c(9, 3, 1)

# Define a vector of colors
colors <- c("blue", "green", "red")

# Setting up plot options to show three versions of the function
options(repr.plot.width=26, repr.plot.height=2)     #doesn't work with Knitting                      
par(mfrow=c(1,3) )

# Now let's plot each
for (i in seq_along(A_vec)) {
  A <- A_vec[i]
  
  # An outer product yields all combinations we need to plug into the function to get the surface
  z <- outer(x, y, function(x,y) Rastrigin(x, y, A=A))
  
  # Plot it using persp
  persp(x, y, z, theta = 30, phi = 40, expand = 0.5, d = 5, col = colors[i], main = paste0("A = ", A))
}

It’s obvious at this point that the global minimum is at \((x, y) = (0, 0)\). Indeed, when \(x_i=0\), in general, the cosine becomes 1 in the sum making it evaluate to \(nA\).

Finding Several Local Minima using Optim

Okay. Now checking the documentation for Optim, the interface seems to support 6 different optimization algoritms. And quite oddly none of which is gradient descent. At least there is limited memory BFGS, so let’s go for that.

# Define the Rastrigin function (N dimensions this time)
Rastrigin <- function(x, n=2, A=5)    A*n + sum(x^2 - A*cos(2*pi*x))

# Method will be L-BFGS-B
method <- "L-BFGS-B"

# Set a vector of initial values (formally, one would generate these randomly)
initial_values <- list(c(1.11, 2.9), c(3.3, 12), c(6, 3), c(12, 16), c(0.3, 0.4))

# Lets make a table for the results
column_names <- c("initial_value", "optimized_value", "minimum_value")
results <- matrix(NA, nrow = length(initial_values), ncol = 3, dimnames = list(NULL, column_names))

# Loop through each initial value and run the optimization
for (i in 1:length(initial_values)) {

  # Run the optimization (assume 1000 iterations is enough)
  result <- optim(par = initial_values[[i]], fn = Rastrigin, method = method, control = list(maxit = 1000))

  # Store the results
  results[i, "initial_value"] <- paste("(", paste(initial_values[[i]], collapse = ", "), ")", sep = "")
  results[i, "optimized_value"] <- paste("(", paste(round(result$par, 3), collapse = ", "), ")", sep = "")
  results[i, "minimum_value"] <- round(result$value, 3)
}

results
##      initial_value optimized_value    minimum_value
## [1,] "(1.11, 2.9)" "(0.99, 2.97)"     "9.899"      
## [2,] "(3.3, 12)"   "(-7.916, -4.949)" "88.091"     
## [3,] "(6, 3)"      "(2.97, 0)"        "8.909"      
## [4,] "(12, 16)"    "(0, 0)"           "0"          
## [5,] "(0.3, 0.4)"  "(-0.99, 0)"       "0.99"

Interesting results. It’s hard to tell what BFGS was thinking when it decided to converge to the local minimum at \((x, y) = (0, 0)\) from \((x, y) = (12, 16)\).

Now let’s try the other algorithms provided by Optim.

# Initial valuem
initial_value <- c(1.11, 2.9)

# The 5th algorithm, Brent only works for 1D functions
methods <- c("BFGS", "SANN", "CG", "Nelder-Mead")

# Lets make a table for the results
column_names <- c("method", "optimized_value", "minimum_value")
results <- matrix(NA, nrow = length(methods), ncol = 3, dimnames = list(NULL, column_names))

# Loop through each method and run the optimization
for (i in 1:length(methods)) {

  # Run the optimization
  result <- optim(par = initial_value, fn = Rastrigin, method = methods[i], control = list(maxit = 1000))

  # Store the results
  results[i, "method"] <- methods[i]
  results[i, "optimized_value"] <- paste("(", paste(round(result$par, 3), collapse = ", "), ")", sep = "")
  results[i, "minimum_value"] <- round(result$value, 3)
}
results
##      method        optimized_value  minimum_value
## [1,] "BFGS"        "(0.99, 1.98)"   "4.95"       
## [2,] "SANN"        "(0.986, 0.988)" "1.982"      
## [3,] "CG"          "(0.99, 2.97)"   "9.899"      
## [4,] "Nelder-Mead" "(0.99, 2.97)"   "9.899"

It’s obvious here that SANN has the most spectacular results. It makes sense once you realize that its the only stochastic algorithm supported by optim.

Stochastic Optimization

After checking the “Global and Stochastic Optimization” section of the Optimization Task View. It’s obvious that there are much more algorithms for global and stochastic optimization. We will consider simulated annealing, cross-enropy and genetic algorithms.

Simulated Annealing

The documentation thankfully has a good example from which this is inspired.

library(GenSA)

Rastrigin <- function(x, A=10)    A*length(x) + sum(x^2 - A*cos(2*pi*x))


dimension <- 30                          #  Let's make the problem harder
global_min <- 0                          #  Something we know about the function that can help us stop the algorithm
tol <- 1e-13                             #  We'll call the results "satisfactory" if we get within this tolerance of the global minimum
lower <- rep(-5.12, dimension)           #  The lower bound of the components (again, as if we knew this about the function)
upper <- rep(5.12, dimension)            #  The upper bound of the components 

temperature <- 100                       #  The initial temperature (how often should it be bouncing initially)

par = runif(dimension, lower, upper)     #  Generate a random starting point

out <- GenSA(lower = lower, upper = upper, fn = Rastrigin, control=list(threshold.stop=global_min+tol, temperature=temperature))

cat("The minimum value found was ", out$value, " at ", out$par[c(1:5)], "... \n after ", out$counts, " iterations.")
## The minimum value found was  5.684342e-14  at  1.716471e-10 7.000393e-11 -1.457059e-10 7.842643e-11 -2.130771e-10 ... 
##  after  49387  iterations.

The algorithm specific parameters that GenSA takes are the initial temperature and two more parameters related to the visiting and acceptance distributions. All of these are important, it’s quite odd that there is no way to set a custom schedule for the temperature but if there were that would have been important as well.

The initial value is of course as well an important option but there are no magical ways to set it to obtain impressive results in the general sense.

Genetic Algorithm

library(GA)
## Loading required package: foreach
## Loading required package: iterators
## Package 'GA' version 3.2.3
## Type 'citation("GA")' for citing this R package in publications.
## 
## Attaching package: 'GA'
## The following object is masked from 'package:utils':
## 
##     de
Rastrigin <- function(x, A=10)    A*length(x) + sum(x^2 - A*cos(2*pi*x))

dimension <- 30                          #  Let's make the problem harder
tol <- 1e-13                             #  We'll call the results "satisfactory" if we get within this tolerance of the global minimum
lower <- rep(-5.12, dimension)           #  The lower bound of the components (again, as if we knew this about the function)
upper <- rep(5.12, dimension)            #  The upper bound of the components 

popsize = 100000                          # population size
pcrossover = 0.4                          # probability of crossover
pmutation = 0.01                          # probability of mutation
elitism = 0.1                             # elitism


out <- ga(type = "real-valued", fitness = function(x) -Rastrigin(x), lower = lower, upper = upper, 
          run = 500, popSize=popsize, pcrossover = pcrossover, pmutation = pmutation, elitism=elitism)

summary(out)
## ── Genetic Algorithm ─────────────────── 
## 
## GA settings: 
## Type                  =  real-valued 
## Population size       =  1e+05 
## Number of generations =  100 
## Elitism               =  0 
## Crossover probability =  0.4 
## Mutation probability  =  0.01 
## Search domain = 
##          x1    x2    x3    x4    x5    x6    x7    x8    x9   x10  ...    x29
## lower -5.12 -5.12 -5.12 -5.12 -5.12 -5.12 -5.12 -5.12 -5.12 -5.12       -5.12
## upper  5.12  5.12  5.12  5.12  5.12  5.12  5.12  5.12  5.12  5.12        5.12
##         x30
## lower -5.12
## upper  5.12
## 
## GA results: 
## Iterations             = 100 
## Fitness function value = -0.7738309 
## Solution = 
##               x1          x2         x3          x4        x5          x6
## [1,] 0.004387843 0.009406425 0.01166057 0.003237599 0.0221791 0.009492672
##               x7          x8           x9        x10  ...          x29
## [1,] -0.01600166 0.007839713 -0.007547582 0.01263047       0.007436299
##              x30
## [1,] 0.004930968

The genetic algorithm involves many important parameters. Four key examples of such parameters are the population size, survival rate (elitism), crossover and mutation probabilities.

Setting such parameters to give better results can be sometimes tricky. The basic observation is that GA can be thought of as an instance of local beam search. Local beam search is surely optimal when the population size approach infinity. Thus, the easiest way to get better results while sacrificing computational time is to increase the population size.

Other parameters like the probability of mutation won’t always change the results as intended but sure a too high mutation probability will cause the algorithm to be just randomly searching.

The GA library is nice enough to provide a way to plot how the fitness function changed over time through the search. This should be decreasing but remember we negated the fitness function.

par(bg='white') 
plot(out)

Cross-Entropy Optimizer

We will keep it easy this time and optimize the Rastrigin in 2D

library(CEoptim)
## Loading required package: MASS
## Loading required package: msm
## Loading required package: sna
## Loading required package: statnet.common
## 
## Attaching package: 'statnet.common'
## The following objects are masked from 'package:base':
## 
##     attr, order
## Loading required package: network
## 
## 'network' 1.18.1 (2023-01-24), part of the Statnet Project
## * 'news(package="network")' for changes since last version
## * 'citation("network")' for citation information
## * 'https://statnet.org' for help, support, and other information
## sna: Tools for Social Network Analysis
## Version 2.7-1 created on 2023-01-24.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
##  For citation information, type citation("sna").
##  Type help(package="sna") to get started.
μ <- c(0,0)                 # The initial mean
σ <- c(1,1)                 # The inital standard deviation
ρ <- 0.05                   # The elite proportion
N <- 500L                   # The CE sample size

result <- CEoptim(Rastrigin, continuous=list(mean=μ , sd=σ), rho=ρ, N=N, maximize=FALSE)

result
## Optimizer for continuous part: 
##  -1.897005e-05 0.0004384767 
## Optimum: 
##  3.821459e-05 
## Number of iterations: 
##  7 
## Total number of function evaluations: 
##  3500 
## Convergence: 
##  Variance converged
library(CEoptim)

# Algorithm Parameters
μ <- c(3.5,6.2)                                    # The initial mean
σ <- c(1,1)                                        # The inital standard deviation
ρ <- 0.3                                           # The elite proportion

# Let's try a few different values of N to see which optimizes the function better
Ns <- c(1000L, 5000L, 50000L)                      # The values of N to try

# Lets make a table for the results
results <- matrix(ncol=4, nrow=length(Ns))
colnames(results) <- c("N", "Optimum Value", "Number of Iterations", "Convergence")


# Loop through each value of N and run the optimization
for (i in 1:length(Ns)) {
  result <- CEoptim(Rastrigin, continuous=list(mean=μ , sd=σ), rho=ρ, N=Ns[i], maximize=FALSE, noImproveThr = 10)

  results[i,] <- c(Ns[i], result$optimum, result$termination$niter, result$termination$convergence)
}

results
##      N       Optimum Value      Number of Iterations Convergence         
## [1,] "1000"  "5.36186898153036" "21"                 "Variance converged"
## [2,] "5000"  "4.29160691977389" "24"                 "Variance converged"
## [3,] "50000" "3.9835281026324"  "22"                 "Variance converged"

These seem like the poorest results we have so far. No doubts why this wasn’t covered at my uni. Of course we can always cheat by letting the mean be something near \((0,0)\) to observe much better convergence.

Thank you.