Optimization with Genetic Algorithm

1 Introduction
2 Optimization Problem
3 Genetic Algorithm: Concepts
4 Genetic Algorithm with R
- 4.1 Business Application
- 4.2 Hyper-Parameter Tuning on Machine Learning Model
5 Conclusion
6 Reference

1 Introduction

1.1 About

Optimization is important in many fields, including in data science. In manufacturing, where every decision is critical to the process and the profit of organization, optimization is often employed, from the number of each products produced, how the unit is scheduled for production, get the best or optimal process parameter, ans also the routing determination such as the traveling salesman problem. In data science, we are familiar with model tuning, where we tune our model in order to improve the model performance. Optimization algorithm can help us to get a better model performance. Genetic Algorithm (GA) is one of the widely used optimization algorithm.

This article is an attempt to explain the mechanism behind one of the most effective algorithm for optimization problems: Genetic Algorithm (GA). There are many algorithm specifically created for optimization, such as the Particle Swarm Optimization (PSO), Simulated Annealing (SA), and Tabu Search. However, in many areas, GA can achieve the objective better than said methods. There are various derivative and variation of GA implementation, with perhaps the most famous one is the implementation of GA in multiobjective optimization problem, where we want to optimize two objective simultaneously, called Multi Objective Evolutionary Algorithm (MOEA). We will see how a general GA works and where it can be applied, either on business problems and on the data science field.

1.2 Objectives

Learning Objective:

Learn the importance of optimization in business
Learn about cost function/objective function and decision variables
Learn the relationship between machine learning and optimization
Learn the principle and concepts of Genetic Algorithm
Learn the implementation of GA in business case and data science field

1.3 Library and Setup

To use GA, you can install GA package using the install.packages() function.

The following is the packages that will be used throughout the articles.

library(tidyverse)
library(GA)
library(ranger)
library(tidymodels)
library(caret)
library(tictoc)

2 Optimization Problem

2.1 Why Optimization is Important

Say, we have 2 line of product, with product A has profit of USD 50 while product B has profit of USD 30. We want to maximize our profit, but we have limited resources. We only have 80 hours of work for a week, with product A take about 10 hours in making while product B is 4 hours in making. Both product shares the same materials and we have only 100 meters of fabric with each product A requires 2 meters of fabric while product B requires 4 meters of fabric. How do we decide what is the best number of product A and B to be produced? Such is the optimization problem (in these case, optimize the profit).

What if we decide to randomly select the quantities of each products to be produced. We may don’t have enough time to finish it. We may don’t have enough materials, or the opposite could happen: we may have spare materials.

Another case, say we want to deliver our product the some warehouses. How do we decide which warehouse should we visit first or how do we decide what route should we take in order to minimize the delivery time?

Optimization are important especially if we have limited resources. More importanly, optimization concern with profit and loss: choosing the wrong delivery route will result in late delivery, which may incur penalty cost or damaging our products along the way. That’s why optimization is really important for business.

2.2 Objective Function and Decision Variables

Every optimization problem always has 2 elements: objective function and decision variables. Objective function mean how do we formulate our goals, like how do we measure profit in order to maximize it, or how do we measure delivery time in order to minimize it. Inside the objective function there exist decision variables, what should be our decision that can result in maximal profit? We want to maximize our goals by choosing the right decision variables. In the previous example, the goal to maximize profit belong to the objective function, while the number of product A and product B to be produced belong to the decision variables.

Some objective may have a constraint, like the number of workhours a week is only 80 hours or the number of materials available is only 100 meters. The constraint is important, since any solution that doesn’t meet the constraint is an infeasible solution.

2.3 Machine Learning and Optimization

Machine learning and optimization cannot be separated, they are related to each other, although the goal is different. Machine learning is concerned with prediction or classification based on several predictors, while optimization is concerned with finding the best objective value based on the choice of decision variables. However, the mechanism behind most of machine learning models are optimization. For example, the gradient descent method of Neural Network training model is an optimization method, one which goal is to find the lowest point and converge. Another example is the linear regression fitting, which minimize the sum of squared error (thus, the name Least Square method).

3 Genetic Algorithm: Concepts

3.1 Overview

“Genetic algorithms (GAs) are stochastic search algorithms inspired by the basic principles of biological evolution and natural selection. GAs simulate the evolution of living organisms, where the fittest individuals dominate over the weaker ones, by mimicking the biological mechanisms of evolution, such as selection, crossover and mutation.” - Luca Scrucca

Genetic Algorithm is an optimization algorithm that use the concept of evolution by natural selection. Evolution by natural selection, as proposed by Darwin, is the mechanism on how many varieties of living things are adapting to their environment to survive, through 2 main principles: natural selection and random mutation. The genetic algorithm (GA) borrowed the concept and use it to gain optimal solutions by selecting the best or the fittest solution alongside the rare and random occurence of mutation. The general flow of the algorithm is shown below:

Genetic Algorithm

There will be a population of randomly selected chromosomes or solutions.
The fitness values or the objective function values from each solutions are calculated.
From the population, two of the solutions will be chosen as parent solutions, either by tournament selection or any other selection methods.
The chosen solutions will be crossed to create new solutions, called child solutions.
The child solutions may change due to the random mutations (that has really low probability to happen).
In the end of the iteration, new population will be chosen from the parent solutions or the initial population and the child solutions based on the fitness values.
As long as the stopping criterion is not fulfilled, usually in term of number of iterations, the algorithm will continue to iterate.
Best or optimal solutions are those with the optimal fitness value.

3.2 Elements

There exist 3 main elements of GA: Population, Chromosomes, and Genes.

Population is a group of individuals or solutions, mainly called chromosomes. A chromosomes consists of a sequence of genes, with each or multiple genes (depend on the encoding) represent a single decision variables that will be fitted on the objective functions.

Population, Chromosome, and Gene

A gene can be represented or encoded in various ways, including:

Binary Encoding

Binary encoding mean that our genes are encoded into binary number, this is the earliest and the most common form of GA encoding.

Real-Valued Encoding

Real-valued encoding mean that our genes are encoded in a floating point or decimal number. Real-valued encoding can be used to optimize a process parameter, since it consists of floating point number and are not suited for integer-based optimization.

Permutation Encoding

Permutation encoding mean that our genes are encoded into a sequence of number, each number is uniquely assigned to a gene so no 2 genes that will have the same number or value. This encoding is often used in a scheduling or routing problem, where each permutation represent one location or unique points.

3.3 Parents Selection

Parents selections are done in order to select two solutions that will be used for crossover to create new solutions called child solutions. There are several methods to choose parents:

Pairing From Top to Bottom

This method pairs the chromosomes of the top individuals in the population with individuals afterwards until a number of individuals are fulfilled to be used in the crossing process. In other words, this method marry solutions in odd rows with solutions in even rows. The method is very simple although it does not model natural selection accurately.

Random Pairing

Parents are selected randomly with equal chance to be chosen.

Weighted Random Pairing (Roulette Selection)

Parents will be selected randomly based on the fitness value. Higher fitness value mean higher chance to be chosen.

Tournament Selection

Tournament selection is done by selecting a set of chromosomes from population and make certain selection criterion. Tournament selection is effective for GAs with that has big population because there is no need to sort the individuals inside the population.

3.4 Crossover and Mutation

3.4.1 Crossover

Crossover is the process of mating between the two selected individuals, the process represents how genes are transferred to the offspring. There are several methods of crossover:

One-point crossover

One-point crossover use only one random point as the marker where crossover should happen.

Two-point crossover

Two-point crossover use two point as markers where each chromosomes should split and join with the other chromosome.

Three-point crossover

Two-point crossover use three point as markers where each chromosomes should split and join with the other chromosome.

3.4.2 Mutation

A mutation means change in one or several genes inside the chromosome. Just like in the natural world, mutation rarely occur, thus they have a little probability to happen, usually set at 0.01. There are some example of mutations:

Adjacent Exchange Mutation

Two adjacent genes are swapped randomly.

Random Exchange Mutation

Two genes are randomly swapped position.

Shift Mutation

Shift mutation is done by randomly repositioning a gene and shift the following gene after the position.

Inverse Mutation

Inverse mutation is done by inversing the sequence between two random points.

3.5 Application

Genetic Algorithm can be applied in various business problem. Since the algorithm is all about optimization, as long as there is an objective/fitness function to optimize, GA can be applied. Some example including production scheduling, knapsack problem (how many/how much we can load things in our bag or truck to maximize the space or the weight load), traveling salesman problem, etc. GA can also be applied in data science field, especially when we do hyper-parameter tuning to optimize the model performance.

4 Genetic Algorithm with R

Here we will illustrate how GA can work using the GA package both on business problem and on machine learning hyper-parameter tuning.

4.1 Business Application

4.1.1 Production Scheduling

Supposed that we have various units of car to be produced at our manufacturing plant. There are 7 different types of product colors. If the following unit has different color than the previous car, there will changeover cost (cost incurred due to change in product types), such as the cost of material required to clean the equipment from the previous paint color. There are many way to optimize the color sequence, but we will use the simplest one in here. We want to minimize the number of setup, making the number of color switch or changover occur less, therefore the changeover cost can also be minimized.

First we import the data. I’ve prepared the dummy data which you can access here.

df_car <- read.csv("data_input/car_data.csv")

df_car

The objective function will be: \[ S = 1 + \Sigma_{k=2}^{DT} \ S_k\]

\(S\) = number of setup
\(D_T\) = Number of car/unit to be produced
\(k\) = The position of car in the sequence
\(S_k\) = Is setup required? 1 if the next car (\(K+1\)) doesn’t have the same color with the current \(k^{th}\) car, 0 if the next car does

We will translate the objective function into an R function of min_setup. The algorithm in GA package run in context of maximization, so in order to do a minimization problem, we will make our fitness value negative.

min_setup <- function(x) {
    df <- df_car[x, ]
    for (i in 1:(nrow(df_car) - 1)) {
        df$s[i] <- if_else(df$color[i] == df$color[i + 1], 0, 1)
    }
    S = -(1 + sum(df$s, na.rm = T))
    return(S)
}

Now we run the algorithm. Since the problem is a scheduling problem, we will encode our solutions with permutation encoding. We will stop the algorithm if the number of iteration has reached 300 iterations or if in 30 iterations there is no improvement in the optimal fitness value.

IMPORTANT The choice of population size and the number of iteration will greatly affect the performance of GA. If the population size is too little, there will be limited gene pool in our population so the the new solutions will not differ too much from the previous generations. If the number of iterations is too little, there will not be enough time for the algorithm to find new optimal solutions. GA package can be run with parallel computing.

The parameter of the algorithm:

type: Type of the gene encoding, here we use “permutation”
fitness: The fitness function that will be optimized
lower: The lowest value of the permutation encoding, we will encode the permutation from 1 to the number of car available
upper: The highest value of the permutation encoding
seed: The value of random seed to get reproducible result
elitism: Number of best solution that will survive in each iteration, default is the top 5% of the population, here we set it to 50 solutions that will survive
maxiter: Number of max iterations for the algorithm to run
popSize: Number of solutions in a population
run: Number of iteration before the algorithm stop if there is no improvement on the optimal fitness value
parallel: Will we use parallel computing? Can be logical value or the number of core used

The rest of the parameters, such as the crossover and mutation probability are left to the default setting.

tic()
gann <- ga(type = "permutation", fitness = min_setup, lower = 1, upper = nrow(df_car), 
    seed = 123, elitism = 50, maxiter = 300, popSize = 300, run = 30, parallel = parallel::detectCores())

summary(gann)

-- Genetic Algorithm ------------------- 

GA settings: 
Type                  =  permutation 
Population size       =  300 
Number of generations =  300 
Elitism               =  50 
Crossover probability =  0.8 
Mutation probability  =  0.1 

GA results: 
Iterations             = 96 
Fitness function value = -7 
Solutions = 
      x1 x2 x3 x4 x5 x6 x7 x8 x9 x10  ...  x27 x28
[1,]  25  4 11 18  1 15  3  8 17  22        13  27
[2,]  19  7 21 10  5 24  3 17 15   2        11  18
[3,]  25 11  4 18  1  8  3 15  2  22        13  27
[4,]  25  4 11 18  1  8  3 15  2  22        13  27
[5,]  25  4 11 18  2  1 15 22 16  17        19   5
[6,]   1  2  8 16 15 17  3 22 23   9        11  18
[7,]  25  4 11 18  1  8  3 17 15   2        13  27
[8,]  11  4 25 18  1  8  3 15  2  22        13  27
[9,]   7 19 24 10 21  5  1 15  3   8        23   9
[10,]  4 25 11 18  2  1 15 22 16  17        19   5
 ...                                              
[24,]  4 25 11 18  1  8  3 17 15   2        13  27
[25,]  2  1 15 22 16 17  3  8 23   9        11  18

toc()

87.14 sec elapsed

We can plot the progression of the algorithm by using plot() function on the GA object.

plot(gann)

The best or optimal solutions are obtained at least at the 67th iteration and didn’t improve since then, so the algorithm stop after 30 iterations at 96th iteration.

The Solution output is the solution that result in the optimal fitness value. We can choose one of them since they will result in the same fitness value. Let’s check one of them.

min_setup(gann@solution[1, ])

[1] -7

If you want to look at all of the fitness value from the last population, use can use the fitness attribute from the GA object. We’ll summarise the fitness value to get the distribution.

summary(gann@fitness)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -25.00  -18.00  -15.00  -14.29  -11.00   -7.00

The worst solution in the population has fitness value of 25, while the optimal solution has fitness value of 7. From all 300 chromosomes or solutions generated, the median of fitness value is 15 while the mean is 14.29.

Let’s put the optimal sequence to our data. The resulting data frame will be our production schedule. Since there are multiple solutions, let’s just look at the first two optimal solutions.

The first solution:

df_car[gann@solution[1, ], ]

The second solution:

df_car[gann@solution[2, ], ]

Even though the first solution and the second solution has different sequence, the fitness function or the number of setup are the same, therefore both are optimal solutions and the decision to choose which one should be put into actual use is being handed to the decision maker, in these case, the production control staff.

4.1.2 Knapsack Problem

Suppose we have several items to deliver into the distribution center. However, our truck can only load up to total of 10 tons or 10000 kg, so we need to choose which items to deliver and maximize the weight loaded. You can directly solve this problem using the Knapsack Method, but you can also use Genetic Algorithm. Here we will see how GA deal with a constrained problem (weight should not exceed 10 tons or 10000 kg).

df_item <- data.frame(item = c("Tires", "Bumper", "Engine", "Chasis", "Seat"), freq = c(80, 
    50, 70, 50, 70), weight = c(7, 17, 158, 100, 30))

df_item

Each items have quantities and their respective weight.s

Let’s create a new dataset with one observation correspond to only 1 item.

df_item_long <- df_item[rep(rownames(df_item), df_item$freq), c(1, 3)]

df_item_long

Let’s set the objective function.

\[Total\ Weight = \Sigma_{i=1}^{k} \ W_k \ * \ n_k \]

\(Total\ Weight\): The total weight (kg)
\(k\): Number of unique item
\(W_k\): Weight of item k
\(n_k\): Number of item k that is chosen

Constrained to:

\[Total\ Weight <= 10000\]

weightlimit <- 10000

evalFunc <- function(x) {
    df <- df_item_long[x == 1, ]
    total_weight <- sum(df$weight)
    total_weight <- if_else(total_weight > weightlimit, 0, total_weight)
    return(total_weight)
}

Let’s run the algorithm. We will use binary encoding since the decision variables are integer-valued. Each bit represent a single logical value whether the item is selected.

tic()
gann2 <- ga(type = "binary", fitness = evalFunc, popSize = 100, maxiter = 100, run = 20, 
    nBits = nrow(df_item_long), seed = 123)

summary(gann2)

-- Genetic Algorithm ------------------- 

GA settings: 
Type                  =  binary 
Population size       =  100 
Number of generations =  100 
Elitism               =  5 
Crossover probability =  0.8 
Mutation probability  =  0.1 

GA results: 
Iterations             = 22 
Fitness function value = 10000 
Solutions = 
     x1 x2 x3 x4 x5 x6 x7 x8 x9 x10  ...  x319 x320
[1,]  1  1  0  1  1  1  1  0  1   1          0    1
[2,]  1  1  0  1  1  1  1  0  1   1          1    1
[3,]  0  1  1  1  1  0  0  1  1   0          0    0
[4,]  1  1  0  1  1  1  1  0  1   1          1    1
[5,]  1  1  0  1  1  1  1  0  1   1          1    0

toc()

0.7 sec elapsed

Let’s check the plot.

plot(gann2)

Let’s choose one of the solution as our optimal choice.

df_sol <- df_item_long[gann2@solution[1, ] == 1, ]

df_sol <- df_sol %>% group_by(item, weight) %>% summarise(freq = n()) %>% mutate(total_weigth = freq * 
    weight)

df_sol

Let’s check the total weight

sum(df_sol$total_weigth)

[1] 10000

The optimal fitness value is same with the constrain, so we can fully load the vehicle based on it’s maximum capacity.

There are still many application of GA in various field, especially in manufacturing and engineering process. As long as there is a fitness function and decision variables, GA can be appllied toward the problem.

4.2 Hyper-Parameter Tuning on Machine Learning Model

Now we will try to apply the algorithm to optimize our machine learning model. We will use the attrition dataset.

4.2.1 Import Data

attrition <- read.csv("data_input/attrition.csv")

attrition <- attrition %>% mutate(job_involvement = as.factor(job_involvement), education = as.factor(education), 
    environment_satisfaction = as.factor(environment_satisfaction), performance_rating = as.factor(performance_rating), 
    relationship_satisfaction = as.factor(relationship_satisfaction), work_life_balance = as.factor(work_life_balance))

attrition

4.2.2 Cross-Validation

We split the data into training set and testing dataset, with 80% of the data will be used as the training set.

set.seed(123)
intrain <- initial_split(attrition, prop = 0.8, strata = "attrition")

intrain

<1177/293/1470>

4.2.3 Data Preprocessing

We will preprocess the data with the following step:

Upsample to increase the number of the minority class and prevent class imbalance
Scaling all of the numeric variables
Remove numeric variable with near zero variance
Use PCA to reduce dimensions with threshold of 90% information kept
Remove over_18 variable since it only has 1 levels of factor

rec <- recipe(attrition ~ ., data = training(intrain)) %>% step_upsample(attrition) %>% 
    step_scale(all_numeric()) %>% step_nzv(all_numeric()) %>% step_pca(all_numeric(), 
    threshold = 0.9) %>% step_rm(over_18) %>% prep()

data_train <- juice(rec)
data_test <- bake(rec, testing(intrain))

data_train

We will train the model using random forest. We will optimize the mtry (number of nodes in each tree) and the min_n (minimum number of member on each tree in order to be splitted) value by maximizing the model accuracy on the testing dataset. Since the number of mtry and min_n is integer, we better use binary encoding instead of real-valued encoding.

We will set the range of mtry from 1 to 32. We will limit the range of min_n from 1 to 128. Let’s check how many bits we need to cover those numbers.

# max value of mtry = 32
2^5

[1] 32

# max value of min_n = 128
2^7

[1] 128

So, we will need at least 12 bits of binary. If the converted value of bits for mtry or min_n is 0, we change it to 1.

We will write the model as fitness function. We will maximize the accuracy of our model on the testing dataset.

fit_function <- function(x) {
    a <- binary2decimal(x[1:5])
    b <- binary2decimal(x[6:12])
    
    if (a == 0) {
        a <- 1
    }
    if (b == 0) {
        b <- 1
    }
    if (a > 17) {
        a <- 17
    }
    
    # define model spec
    model_spec <- rand_forest(mode = "classification", mtry = a, trees = 500, min_n = b)
    
    # define model engine
    model_spec <- set_engine(model_spec, engine = "ranger", seed = 123, num.threads = parallel::detectCores(), 
        importance = "impurity")
    
    # model fitting
    set.seed(123)
    model <- fit_xy(object = model_spec, x = select(data_train, -attrition), y = select(data_train, 
        attrition))
    
    pred_test <- predict(model, new_data = data_test %>% select(-attrition))
    acc <- accuracy_vec(truth = data_test$attrition, estimate = pred_test$.pred_class)
    return(acc)
}

IMPORTANT Since some model require a lot of time to train, you may need to be extra careful on choosing the population size and the number of iteration. If you don’t have a problem to run the GA on a long period of time, than you may choose bigger population size and number of iterations. Here I will set the population size to be only 100 and the maximum number of iteration is 100. You can use parallel = TRUE if you want faster computation with parallel computing.

There are several options that can be chosen for each of the GA process. For example, the default parent selection of GA for binary is linear rank selection. The detailed default methods used on each encoding can be found at https://www.rdocumentation.org/packages/GA/versions/3.2/topics/gaControl. Here, we will try to use tournament selection instead of the default linear rank selection.

tic()
ga_rf <- ga(type = "binary", fitness = fit_function, nBits = 12, seed = 123, popSize = 100, 
    maxiter = 100, run = 10, parallel = T, selection = gabin_tourSelection)
summary(ga_rf)

-- Genetic Algorithm ------------------- 

GA settings: 
Type                  =  binary 
Population size       =  100 
Number of generations =  100 
Elitism               =  5 
Crossover probability =  0.8 
Mutation probability  =  0.1 

GA results: 
Iterations             = 16 
Fitness function value = 0.8805461 
Solutions = 
     x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12
[1,]  0  0  0  0  0  0  1  0  0   0   0   0
[2,]  0  0  0  0  1  0  1  0  0   0   0   0

toc()

633.28 sec elapsed

Let’s plot the progression of GA

plot(ga_rf)

Let’s check each decision variables and the accuracy that is used as the fitness value.

# Transform decision variables into data frame
ga_rf_pop <- as.data.frame(ga_rf@population)

# Convert decision variables into mtry and min_n
ga_mtry <- apply(ga_rf_pop[, 1:5], 1, binary2decimal)
ga_min_n <- apply(ga_rf_pop[, 6:12], 1, binary2decimal)

# Show the list of solutions in the final population
data.frame(mtry = ga_mtry, min_n = ga_min_n, accuracy = ga_rf@fitness) %>% mutate(mtry = ifelse(mtry == 
    0, 1, mtry), min_n = ifelse(min_n == 0, 1, min_n), mtry = ifelse(mtry > 17, 17, 
    mtry)) %>% distinct() %>% arrange(desc(accuracy))

You may want to compare the computation time of GA with the required time for repeated K-fold cross-validation using the conventional caret package. We will use k = 5 and 10 repeats.

tic()
set.seed(123)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 10)
attrition_forest <- train(attrition ~ ., data = data_train, method = "rf", trControl = ctrl)
toc()

660.06 sec elapsed

attrition_forest

Random Forest 

1974 samples
  17 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 10 times) 
Summary of sample sizes: 1580, 1579, 1578, 1580, 1579, 1578, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.9489451  0.8978902
  22    0.9596744  0.9193495
  42    0.9476663  0.8953330

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 22.

Let’s check the accuracy of the random forest model with caret

pred_rf <- predict(attrition_forest, newdata = data_test)
accuracy_vec(truth = data_test$attrition, estimate = pred_rf)

[1] 0.8634812

The model performance by GA is better than those optimized with simple K-fold cross validation, even though GA require more time to compute. This trade-off of computation time vs performance should be considered.

5 Conclusion

Optimization problem is one of many concern in business, which may seek to maximize profit and minimize loss. Machine learning apply optimization on their method of training model to get the best possible result. GA is an optimization algorithm that can be applied in various fields, including data science. If a data scientist can harness the benefit of GA, he can get more optimal model with better performance. The downside of GA is that it require time to compute since it belong to the group of search algorithm.

6 Reference

Luca Scrucca. A quick tour of GA. http://luca-scr.github.io/GA/articles/GA.html
Luca Scrucca. 2013. GA: A Package for Genetic Algorithms in R. Journal of Statistical Software, 3(4).
Rohit Ghandi. Genetic Algorithms. https://medium.com/datadriveninvestor/genetic-algorithms-9f920939f7cc
Nearchou, A. C. 2004. The Effect of Various Operators on The Genetic Search for Large Scheduling Problems. International Journal of productions Economics, 88, 191-203.