Joint determination of stratification and sample allocation designs
- Partition a population into mutually exclusive and collectively exhaustive strata
- Basic strata created from the cross product of auxiliary variables
- Variables can be continuous or discrete
- These basic strata tend to be small in size (especially in the case of continuous strata)
- For business surveys think NACE Code (similar to Standard Industry Classification (SIC)) * firm size*earnings band
- For household surveys think low-level NUTS region * gender * age class
- For agricultural surveys think NUTS region * legal status * farm Size
- These basic strata are then combined into larger strata
Stratification Quality
- Each stratification is a solution
- The quality of each solution is measured by the optimal sample size that can be allocated to it and still meet precision constraints
- This quality is evaluated by the Bethel-Chromy algorithm
- This is a partitioning problem in which the number of possible solutions grows with the set of basic strata
- For larger sets, the problem cannot be solved because the search for an optimum solution requires a prohibitive amount of computation time
Objective Function
- Neyman allocation - sample size allocated to each stratum considers its size and variation
- Sample size calculated to meet precision requirements
- Calculate the sample size for one or more variables
- Bethel-Chromy algorithm

Estimation of distribution algorithms (EDAs)
- Iteratively estimating, building and sampling probability models in the search for optimal solutions
- Also known as probabilistic model-building genetic algorithms (PMBGAs)
- Belong to a family of evolutionary algorithms
- Stochastic blackbox optimization algorithms
- EDAs have been show to outperform other optimisation techniques for other problems
- we assume that the probability distribution of any basic stratum belonging to a particular stratum is independent of that for another basic stratum
- Nonetheless, while EDAs should locate the global optimum or its near approximation in a reasonable amount of iterations, they have two main computational bottlenecks: (1) fitness evaluation and (2) model building -These must be addressed by efficiency enhancement techniques suitable for EDAs (Hauschild and Pelikan, 2011).
Simulated Annealing Algorithm (SAA)
- Simulated annealing makes small adjustments to a candidate solution
- Keeps those adjustments which lead to an improvement in solution quality
- Gradually morphs the solution towards the closest local optimum
- Resembles Hillclimbing in behaviour
A hybrid estimation of distribution algorithm (hEDA)
- Hybridisation addresses the first bottleneck, i.e. combining it with the SAA to combine the methods of exploration methods with exploitation methods to shorten the search time
- We address the model buliding bottleneck by using the simple model building approach, i.e. probably of each basic stratum belonging to a larger stratum
Outline of the hEDA
- Start with a population of \(N_p\) solutions
- Either generated at random, or generated with random permutations of a starting solution (accompanying that solution)
- Evaluate the population and select a predefined number of the best quality solutions (elitism)
- Construct a probability model of the best solutions
- Use this model to generate new solutions to replace those not selected at the elitism stage
- After a tunable number of iterations a simulated annealing algorithm is applied to a tunable number of solutions in the population
- We the continue as before
- This process continues until a stopping rule has been reached

Versatility and Adaptability of the hEDA
- Hybrid Estimation of Distribution Algorithm with Simulated Annealing
- Hybrid Estimation of Distribution Algorithm with Hillclimbing
- Estimation of Distribution Algorithm
- Simulated Annealing Algorithm
- Hillclimbing Algorithm
Evaluation Approach
- Each of the \(Np\) solutions in the population are evaluated in a two step-approach using the aggrStrata function (Ballin and Barcaroli, 2020) and Bethel algorithm (Ballin and Barcaroli, 2020; De Meo and De Meo, 2009)
- We have coded the aggrStrata function in C++ and integrated it in to R using the Rcpp package (Eddelbuettel, 2013)
- the traditional set.seed() function in R which was useful for reproducbility of experiments is not easily applied in Rcpp - meaning that some further work is required before we can reproduce results exactly
Initial Solutions generated using
Hard clustering
- Kmeans algorithm
- Kmedoids algorithm (partitioning around medoids) ### Soft Clustering
- Fuzzy clustering (Soft clustering)
Future plans
- Integrate a C++ version of the Bethel algorithm into R (Some work on this has already been done and we have been collaborating with Giulio Barcaroli)
- Integrate a C++ version of the hEDA into R (Beta version exists)
- Expecation Maximisation algorithm for generating initial solutions to be finalised
Bibliography
Ballin, M. and G. Barcaroli (2020). Optimization of sampling strata with the SamplingStrata package. Accessed 3 May 2021. https://barcaroli.github.io/SamplingStrata/articles/SamplingStrata.html
Bethel, J. W. (1985). An optimum allocation algorithm for multivariate surveys. American Statistical Proceedings of the Survey Research Methods, Section, 209-212.
Chromy, J. R. (1987). Design optimization with multiple objectives. Proceedings of the Survey Research Methods Section.
De Meo, M. and M. M. De Meo (2009). Package ‘bethel’.
Eddelbuettel, E. (2013). Seamless R and C++ Integration with Rcpp ISBN 978-1-4614- 6867-7 10.1007/978-1-4614-6868-4
Hauschild, M. and M. Pelikan (2011). An introduction and survey of estimation of distribution algorithms. Swarm and evolutionary computation 1 (3), 111–128.