hEDA

Joint determination of stratification and sample allocation designs

Partition a population into mutually exclusive and collectively exhaustive strata
Basic strata created from the cross product of auxiliary variables
Variables can be continuous or discrete
These basic strata tend to be small in size (especially in the case of continuous strata)
For business surveys think NACE Code (similar to Standard Industry Classification (SIC)) * firm size*earnings band
For household surveys think low-level NUTS region * gender * age class
For agricultural surveys think NUTS region * legal status * farm Size
These basic strata are then combined into larger strata

Stratification Quality

Each stratification is a solution
The quality of each solution is measured by the optimal sample size that can be allocated to it and still meet precision constraints
This quality is evaluated by the Bethel-Chromy algorithm
This is a partitioning problem in which the number of possible solutions grows with the set of basic strata
For larger sets, the problem cannot be solved because the search for an optimum solution requires a prohibitive amount of computation time

Objective Function

Neyman allocation - sample size allocated to each stratum considers its size and variation
Sample size calculated to meet precision requirements
Calculate the sample size for one or more variables
Bethel-Chromy algorithm

Estimation of distribution algorithms (EDAs)

Iteratively estimating, building and sampling probability models in the search for optimal solutions
Also known as probabilistic model-building genetic algorithms (PMBGAs)
Belong to a family of evolutionary algorithms
Stochastic blackbox optimization algorithms
EDAs have been show to outperform other optimisation techniques for other problems
we assume that the probability distribution of any basic stratum belonging to a particular stratum is independent of that for another basic stratum
Nonetheless, while EDAs should locate the global optimum or its near approximation in a reasonable amount of iterations, they have two main computational bottlenecks: (1) fitness evaluation and (2) model building -These must be addressed by efficiency enhancement techniques suitable for EDAs (Hauschild and Pelikan, 2011).

Simulated Annealing Algorithm (SAA)

Simulated annealing makes small adjustments to a candidate solution
Keeps those adjustments which lead to an improvement in solution quality
Gradually morphs the solution towards the closest local optimum
Resembles Hillclimbing in behaviour

A hybrid estimation of distribution algorithm (hEDA)

Hybridisation addresses the first bottleneck, i.e. combining it with the SAA to combine the methods of exploration methods with exploitation methods to shorten the search time
We address the model buliding bottleneck by using the simple model building approach, i.e. probably of each basic stratum belonging to a larger stratum

Outline of the hEDA

Start with a population of \(N_p\) solutions
Either generated at random, or generated with random permutations of a starting solution (accompanying that solution)
Evaluate the population and select a predefined number of the best quality solutions (elitism)
Construct a probability model of the best solutions
Use this model to generate new solutions to replace those not selected at the elitism stage
After a tunable number of iterations a simulated annealing algorithm is applied to a tunable number of solutions in the population
We the continue as before
This process continues until a stopping rule has been reached

Versatility and Adaptability of the hEDA

Hybrid Estimation of Distribution Algorithm with Simulated Annealing
Hybrid Estimation of Distribution Algorithm with Hillclimbing
Estimation of Distribution Algorithm
Simulated Annealing Algorithm
Hillclimbing Algorithm

Evaluation Approach

Each of the \(Np\) solutions in the population are evaluated in a two step-approach using the aggrStrata function (Ballin and Barcaroli, 2020) and Bethel algorithm (Ballin and Barcaroli, 2020; De Meo and De Meo, 2009)
We have coded the aggrStrata function in C++ and integrated it in to R using the Rcpp package (Eddelbuettel, 2013)
the traditional set.seed() function in R which was useful for reproducbility of experiments is not easily applied in Rcpp - meaning that some further work is required before we can reproduce results exactly

Initial Solutions generated using

Hard clustering

Kmeans algorithm
Kmedoids algorithm (partitioning around medoids) ### Soft Clustering
Fuzzy clustering (Soft clustering)

Future plans

Integrate a C++ version of the Bethel algorithm into R (Some work on this has already been done and we have been collaborating with Giulio Barcaroli)
Integrate a C++ version of the hEDA into R (Beta version exists)
Expecation Maximisation algorithm for generating initial solutions to be finalised

Bibliography

Ballin, M. and G. Barcaroli (2020). Optimization of sampling strata with the SamplingStrata package. Accessed 3 May 2021. https://barcaroli.github.io/SamplingStrata/articles/SamplingStrata.html

Bethel, J. W. (1985). An optimum allocation algorithm for multivariate surveys. American Statistical Proceedings of the Survey Research Methods, Section, 209-212.

Chromy, J. R. (1987). Design optimization with multiple objectives. Proceedings of the Survey Research Methods Section.

De Meo, M. and M. M. De Meo (2009). Package ‘bethel’.

Eddelbuettel, E. (2013). Seamless R and C++ Integration with Rcpp ISBN 978-1-4614- 6867-7 10.1007/978-1-4614-6868-4

Hauschild, M. and M. Pelikan (2011). An introduction and survey of estimation of distribution algorithms. Swarm and evolutionary computation 1 (3), 111–128.