In this notebook (or markdown) we will attempt to solve three of the tasks for the R GSoC 2023 StochOptim project. The tasks are the Rasrigin function, Finding several local minima using Optim, and Stochastic Optimization.
The Rastrigin function is a highly non-convex function with, well, infinite local minima. If your optimization algorithm is not stochastic and you would like to give it a nightmare, this is the function to use. If its stochastic then this is a good benchmark to test its performance.
As in Wikipedia, the Rastrigin function is defined as: \[f(\vec{x}) = An + \sum_{i=1}^n (x_i^2 - A\cos(2\pi x_i))\]
where \(x_i\) is the \(i\)-th element of the vector \(x\) and \(n\) is the dimension of the vector \(x\). The cosine should explain why the function is non-convex.
Rastrigin <- function(x,y, n=2, A=5) A*2 + (x^2 - A*cos(2*pi*x)) + (y^2 - A*cos(2*pi*y))
We will plot the function for all \(x_i \in [-3, 3]\) and \(y_i \in [-3, 3]\). Let’s start by creating the ranges and outer() will help us with creating the grid.
# We will make a grid using these later via outer()
x<-seq(-3,3,length=50)
y<-seq(-3,3,length=50)
To make guessing where the global minimum is easier, we will make three plots with different values of A
# Define a vector of A values
A_vec <- c(9, 3, 1)
# Define a vector of colors
colors <- c("blue", "green", "red")
# Setting up plot options to show three versions of the function
options(repr.plot.width=26, repr.plot.height=2) #doesn't work with Knitting
par(mfrow=c(1,3) )
# Now let's plot each
for (i in seq_along(A_vec)) {
A <- A_vec[i]
# An outer product yields all combinations we need to plug into the function to get the surface
z <- outer(x, y, function(x,y) Rastrigin(x, y, A=A))
# Plot it using persp
persp(x, y, z, theta = 30, phi = 40, expand = 0.5, d = 5, col = colors[i], main = paste0("A = ", A))
}
It’s obvious at this point that the global minimum is at \((x, y) = (0, 0)\). Indeed, when \(x_i=0\), in general, the cosine becomes 1 in the sum making it evaluate to \(nA\).
Okay. Now checking the documentation for Optim, the interface seems to support 6 different optimization algoritms. And quite oddly none of which is gradient descent. At least there is limited memory BFGS, so let’s go for that.
# Define the Rastrigin function (N dimensions this time)
Rastrigin <- function(x, n=2, A=5) A*n + sum(x^2 - A*cos(2*pi*x))
# Method will be L-BFGS-B
method <- "L-BFGS-B"
# Set a vector of initial values (formally, one would generate these randomly)
initial_values <- list(c(1.11, 2.9), c(3.3, 12), c(6, 3), c(12, 16), c(0.3, 0.4))
# Lets make a table for the results
column_names <- c("initial_value", "optimized_value", "minimum_value")
results <- matrix(NA, nrow = length(initial_values), ncol = 3, dimnames = list(NULL, column_names))
# Loop through each initial value and run the optimization
for (i in 1:length(initial_values)) {
# Run the optimization (assume 1000 iterations is enough)
result <- optim(par = initial_values[[i]], fn = Rastrigin, method = method, control = list(maxit = 1000))
# Store the results
results[i, "initial_value"] <- paste("(", paste(initial_values[[i]], collapse = ", "), ")", sep = "")
results[i, "optimized_value"] <- paste("(", paste(round(result$par, 3), collapse = ", "), ")", sep = "")
results[i, "minimum_value"] <- round(result$value, 3)
}
results
## initial_value optimized_value minimum_value
## [1,] "(1.11, 2.9)" "(0.99, 2.97)" "9.899"
## [2,] "(3.3, 12)" "(-7.916, -4.949)" "88.091"
## [3,] "(6, 3)" "(2.97, 0)" "8.909"
## [4,] "(12, 16)" "(0, 0)" "0"
## [5,] "(0.3, 0.4)" "(-0.99, 0)" "0.99"
Interesting results. It’s hard to tell what BFGS was thinking when it decided to converge to the local minimum at \((x, y) = (0, 0)\) from \((x, y) = (12, 16)\).
Now let’s try the other algorithms provided by Optim.
# Initial valuem
initial_value <- c(1.11, 2.9)
# The 5th algorithm, Brent only works for 1D functions
methods <- c("BFGS", "SANN", "CG", "Nelder-Mead")
# Lets make a table for the results
column_names <- c("method", "optimized_value", "minimum_value")
results <- matrix(NA, nrow = length(methods), ncol = 3, dimnames = list(NULL, column_names))
# Loop through each method and run the optimization
for (i in 1:length(methods)) {
# Run the optimization
result <- optim(par = initial_value, fn = Rastrigin, method = methods[i], control = list(maxit = 1000))
# Store the results
results[i, "method"] <- methods[i]
results[i, "optimized_value"] <- paste("(", paste(round(result$par, 3), collapse = ", "), ")", sep = "")
results[i, "minimum_value"] <- round(result$value, 3)
}
results
## method optimized_value minimum_value
## [1,] "BFGS" "(0.99, 1.98)" "4.95"
## [2,] "SANN" "(0.986, 0.988)" "1.982"
## [3,] "CG" "(0.99, 2.97)" "9.899"
## [4,] "Nelder-Mead" "(0.99, 2.97)" "9.899"
It’s obvious here that SANN has the most spectacular results. It makes sense once you realize that its the only stochastic algorithm supported by optim.
After checking the “Global and Stochastic Optimization” section of the Optimization Task View. It’s obvious that there are much more algorithms for global and stochastic optimization. We will consider simulated annealing, cross-enropy and genetic algorithms.
The documentation thankfully has a good example from which this is inspired.
library(GenSA)
Rastrigin <- function(x, A=10) A*length(x) + sum(x^2 - A*cos(2*pi*x))
dimension <- 30 # Let's make the problem harder
global_min <- 0 # Something we know about the function that can help us stop the algorithm
tol <- 1e-13 # We'll call the results "satisfactory" if we get within this tolerance of the global minimum
lower <- rep(-5.12, dimension) # The lower bound of the components (again, as if we knew this about the function)
upper <- rep(5.12, dimension) # The upper bound of the components
temperature <- 100 # The initial temperature (how often should it be bouncing initially)
par = runif(dimension, lower, upper) # Generate a random starting point
out <- GenSA(lower = lower, upper = upper, fn = Rastrigin, control=list(threshold.stop=global_min+tol, temperature=temperature))
cat("The minimum value found was ", out$value, " at ", out$par[c(1:5)], "... \n after ", out$counts, " iterations.")
## The minimum value found was 5.684342e-14 at 1.716471e-10 7.000393e-11 -1.457059e-10 7.842643e-11 -2.130771e-10 ...
## after 49387 iterations.
The algorithm specific parameters that GenSA takes are the initial temperature and two more parameters related to the visiting and acceptance distributions. All of these are important, it’s quite odd that there is no way to set a custom schedule for the temperature but if there were that would have been important as well.
The initial value is of course as well an important option but there are no magical ways to set it to obtain impressive results in the general sense.
library(GA)
## Loading required package: foreach
## Loading required package: iterators
## Package 'GA' version 3.2.3
## Type 'citation("GA")' for citing this R package in publications.
##
## Attaching package: 'GA'
## The following object is masked from 'package:utils':
##
## de
Rastrigin <- function(x, A=10) A*length(x) + sum(x^2 - A*cos(2*pi*x))
dimension <- 30 # Let's make the problem harder
tol <- 1e-13 # We'll call the results "satisfactory" if we get within this tolerance of the global minimum
lower <- rep(-5.12, dimension) # The lower bound of the components (again, as if we knew this about the function)
upper <- rep(5.12, dimension) # The upper bound of the components
popsize = 100000 # population size
pcrossover = 0.4 # probability of crossover
pmutation = 0.01 # probability of mutation
elitism = 0.1 # elitism
out <- ga(type = "real-valued", fitness = function(x) -Rastrigin(x), lower = lower, upper = upper,
run = 500, popSize=popsize, pcrossover = pcrossover, pmutation = pmutation, elitism=elitism)
summary(out)
## ── Genetic Algorithm ───────────────────
##
## GA settings:
## Type = real-valued
## Population size = 1e+05
## Number of generations = 100
## Elitism = 0
## Crossover probability = 0.4
## Mutation probability = 0.01
## Search domain =
## x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 ... x29
## lower -5.12 -5.12 -5.12 -5.12 -5.12 -5.12 -5.12 -5.12 -5.12 -5.12 -5.12
## upper 5.12 5.12 5.12 5.12 5.12 5.12 5.12 5.12 5.12 5.12 5.12
## x30
## lower -5.12
## upper 5.12
##
## GA results:
## Iterations = 100
## Fitness function value = -0.7738309
## Solution =
## x1 x2 x3 x4 x5 x6
## [1,] 0.004387843 0.009406425 0.01166057 0.003237599 0.0221791 0.009492672
## x7 x8 x9 x10 ... x29
## [1,] -0.01600166 0.007839713 -0.007547582 0.01263047 0.007436299
## x30
## [1,] 0.004930968
The genetic algorithm involves many important parameters. Four key examples of such parameters are the population size, survival rate (elitism), crossover and mutation probabilities.
Setting such parameters to give better results can be sometimes tricky. The basic observation is that GA can be thought of as an instance of local beam search. Local beam search is surely optimal when the population size approach infinity. Thus, the easiest way to get better results while sacrificing computational time is to increase the population size.
Other parameters like the probability of mutation won’t always change the results as intended but sure a too high mutation probability will cause the algorithm to be just randomly searching.
The GA library is nice enough to provide a way to plot how the fitness function changed over time through the search. This should be decreasing but remember we negated the fitness function.
par(bg='white')
plot(out)
We will keep it easy this time and optimize the Rastrigin in 2D
library(CEoptim)
## Loading required package: MASS
## Loading required package: msm
## Loading required package: sna
## Loading required package: statnet.common
##
## Attaching package: 'statnet.common'
## The following objects are masked from 'package:base':
##
## attr, order
## Loading required package: network
##
## 'network' 1.18.1 (2023-01-24), part of the Statnet Project
## * 'news(package="network")' for changes since last version
## * 'citation("network")' for citation information
## * 'https://statnet.org' for help, support, and other information
## sna: Tools for Social Network Analysis
## Version 2.7-1 created on 2023-01-24.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
## For citation information, type citation("sna").
## Type help(package="sna") to get started.
μ <- c(0,0) # The initial mean
σ <- c(1,1) # The inital standard deviation
ρ <- 0.05 # The elite proportion
N <- 500L # The CE sample size
result <- CEoptim(Rastrigin, continuous=list(mean=μ , sd=σ), rho=ρ, N=N, maximize=FALSE)
result
## Optimizer for continuous part:
## -1.897005e-05 0.0004384767
## Optimum:
## 3.821459e-05
## Number of iterations:
## 7
## Total number of function evaluations:
## 3500
## Convergence:
## Variance converged
library(CEoptim)
# Algorithm Parameters
μ <- c(3.5,6.2) # The initial mean
σ <- c(1,1) # The inital standard deviation
ρ <- 0.3 # The elite proportion
# Let's try a few different values of N to see which optimizes the function better
Ns <- c(1000L, 5000L, 50000L) # The values of N to try
# Lets make a table for the results
results <- matrix(ncol=4, nrow=length(Ns))
colnames(results) <- c("N", "Optimum Value", "Number of Iterations", "Convergence")
# Loop through each value of N and run the optimization
for (i in 1:length(Ns)) {
result <- CEoptim(Rastrigin, continuous=list(mean=μ , sd=σ), rho=ρ, N=Ns[i], maximize=FALSE, noImproveThr = 10)
results[i,] <- c(Ns[i], result$optimum, result$termination$niter, result$termination$convergence)
}
results
## N Optimum Value Number of Iterations Convergence
## [1,] "1000" "5.36186898153036" "21" "Variance converged"
## [2,] "5000" "4.29160691977389" "24" "Variance converged"
## [3,] "50000" "3.9835281026324" "22" "Variance converged"
These seem like the poorest results we have so far. No doubts why this wasn’t covered at my uni. Of course we can always cheat by letting the mean be something near \((0,0)\) to observe much better convergence.