CMA-ES

Covariance matrix adaptation evolution strategy is a particular kind of strategy for numerical optimization. Evolution strategies are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation new individuals are generated by variation, usually in a stochastic way, of the current parental individuals. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value. Like this, over the generation sequence, individuals with better and better -values are generated.
In an evolution strategy, new candidate solutions are sampled according to a multivariate normal distribution in. Recombination amounts to selecting a new mean value for the distribution. Mutation amounts to adding a random vector, a perturbation with zero mean. Pairwise dependencies between the variables in the distribution are represented by a covariance matrix. The covariance matrix adaptation is a method to update the covariance matrix of this distribution. This is particularly useful if the function is ill-conditioned.
Adaptation of the covariance matrix amounts to learning a second order model of the underlying objective function similar to the approximation of the inverse Hessian matrix in the quasi-Newton method in classical optimization. In contrast to most classical methods, fewer assumptions on the nature of the underlying objective function are made. Only the ranking between candidate solutions is exploited for learning the sample distribution and neither derivatives nor even the function values themselves are required by the method.

Principles

Two main principles for the adaptation of parameters of the search distribution are exploited in the CMA-ES algorithm.
First, a maximum-likelihood principle, based on the idea to increase the probability of successful candidate solutions and search steps. The mean of the distribution is updated such that the likelihood of previously successful candidate solutions is maximized. The covariance matrix of the distribution is updated such that the likelihood of previously successful search steps is increased. Both updates can be interpreted as a natural gradient descent. Also, in consequence, the CMA conducts an iterated principal components analysis of successful search steps while retaining all principal axes. Estimation of distribution algorithms and the Cross-Entropy Method are based on very similar ideas, but estimate the covariance matrix by maximizing the likelihood of successful solution points instead of successful search steps.
Second, two paths of the time evolution of the distribution mean of the strategy are recorded, called search or evolution paths. These paths contain significant information about the correlation between consecutive steps. Specifically, if consecutive steps are taken in a similar direction, the evolution paths become long. The evolution paths are exploited in two ways. One path is used for the covariance matrix adaptation procedure in place of single successful search steps and facilitates a possibly much faster variance increase of favorable directions. The other path is used to conduct an additional step-size control. This step-size control aims to make consecutive movements of the distribution mean orthogonal in expectation. The step-size control effectively prevents premature convergence yet allowing fast convergence to an optimum.

Algorithm

In the following the most commonly used sampling of new solutions, 2) re-ordering of the sampled solutions based on their fitness, 3) update of the internal state variables based on the re-ordered samples. A pseudocode of the algorithm looks as follows.
set // number of samples per iteration, at least two, generally > 4
initialize,,,, // initialize state variables
while not terminate do // iterate
for in do // sample new solutions and evaluate them
sample_multivariate_normal

← with // sort solutions
// we need later and
← update_m // move mean to better solutions
← update_ps // update isotropic evolution path
← update_pc // update anisotropic evolution path
← update_C // update covariance matrix
← update_sigma // update step-size using isotropic path length
return or
The order of the five update assignments is relevant: must be updated first, and must be updated before, and must be updated last. In the following, the update equations for the five state variables are specified.
Given are the search space dimension and the iteration step. The five state variables are
The iteration starts with sampling candidate solutions from a multivariate normal distribution, i.e.
for
The second line suggests the interpretation as perturbation of the current favorite solution vector . The candidate solutions are evaluated on the objective function to be minimized. Denoting the -sorted candidate solutions as
the new mean value is computed as
where the positive weights sum to one. Typically, and the weights are chosen such that. The only feedback used from the objective function here and in the following is an ordering of the sampled candidate solutions due to the indices.
The step-size is updated using cumulative step-size adaptation, sometimes also denoted as path length control. The evolution path is updated first.
where
The step-size is increased if and only if is larger than the expected value
and decreased if it is smaller. For this reason, the step-size update tends to make consecutive steps -conjugate, in that after the adaptation has been successful.
Finally, the covariance matrix is updated, where again the respective evolution path is updated first.
where denotes the transpose and
The covariance matrix update tends to increase the likelihood for and for to be sampled from. This completes the iteration step.
The number of candidate samples per iteration,, is not determined a priori and can vary in a wide range. Smaller values, for example, lead to more local search behavior. Larger values, for example with default value, render the search more global. Sometimes the algorithm is repeatedly restarted with increasing by a factor of two for each restart. Besides of setting , the above introduced parameters are not specific to the given objective function and therefore not meant to be modified by the user.

Example code in MATLAB/Octave

function xmin=purecmaes % -CMA-ES
% -------------------- Initialization --------------------------------
% User defined input parameters
strfitnessfct = 'frosenbrock'; % name of objective/fitness function
N = 20; % number of objective variables/problem dimension
xmean = rand; % objective variables initial point
sigma = 0.3; % coordinate wise standard deviation
stopfitness = 1e-10; % stop if fitness < stopfitness
stopeval = 1e3*N^2; % stop after stopeval number of function evaluations

% Strategy parameter setting: Selection
lambda = 4+floor; % population size, offspring number
mu = lambda/2; % number of parents/points for recombination
weights = log-log'; % muXone array for weighted recombination
mu = floor;
weights = weights/sum; % normalize recombination weights array
mueff=sum^2/sum; % variance-effectiveness of sum w_i x_i
% Strategy parameter setting: Adaptation
cc = / ; % time constant for cumulation for C
cs = / ; % t-const for cumulation for sigma control
c1 = 2 / ; % learning rate for rank-one update of C
cmu = min / ); % and for rank-mu update
damps = 1 + 2*max/)-1) + cs; % damping for sigma
% usually close to 1
% Initialize dynamic strategy parameters and constants
pc = zeros; ps = zeros; % evolution paths for C and sigma
B = eye; % B defines the coordinate system
D = ones; % diagonal D defines the scaling
C = B * diag * B'; % covariance matrix C
invsqrtC = B * diag * B'; % C^-1/2
eigeneval = 0; % track update of B and D
chiN=N^0.5*+1/); % expectation of
% ||N|| norm
% -------------------- Generation Loop --------------------------------
counteval = 0; % the next 40 lines contain the 20 lines of interesting code
while counteval < stopeval

% Generate and evaluate lambda offspring
for k=1:lambda
arx = xmean + sigma * B * ; % m + sig * Normal
arfitness = feval; % objective function call
counteval = counteval+1;
end

% Sort by fitness and compute weighted mean into xmean
= sort; % minimization
xold = xmean;
xmean = arx*weights; % recombination, new mean value

% Cumulation: Update evolution paths
ps = *ps...
+ sqrt * invsqrtC * / sigma;
hsig = norm/sqrt^)/chiN < 1.4 + 2/;
pc = *pc...
+ hsig * sqrt * / sigma;
% Adapt covariance matrix C
artmp = * );
C = * C... % regard old matrix
+ c1 * * cc*... % minor correction if hsig0
+ cmu * artmp * diag * artmp'; % plus rank mu update
% Adapt step size sigma
sigma = sigma * exp;

% Decomposition of C into B*diag*B'
if counteval - eigeneval > lambda//N/10 % to achieve O
eigeneval = counteval;
C = triu + triu'; % enforce symmetry
= eig; % eigen decomposition, Bnormalized eigenvectors
D = sqrt; % D is a vector of standard deviations now
invsqrtC = B * diag * B';
end

% Break, if fitness is good enough or condition exceeds 1e14, better termination methods are advisable
if arfitness <= stopfitness || max > 1e7 * min
break;
end
end % while, end generation loop
xmin = arx; % Return best point of last iteration.
% Notice that xmean is expected to be even
% better.
end
% ---------------------------------------------------------------
function f=frosenbrock
if size < 2 error; end
f = 100*sum.^2 - x).^2) + sum;
end

Theoretical foundations

Given the distribution parameters—mean, variances and covariances—the normal probability distribution for sampling new candidate solutions is the maximum entropy probability distribution over, that is, the sample distribution with the minimal amount of prior information built into the distribution. More considerations on the update equations of CMA-ES are made in the following.

Variable metric

The CMA-ES implements a stochastic variable-metric method. In the very particular case of a convex-quadratic objective function
the covariance matrix adapts to the inverse of the Hessian matrix, up to a scalar factor and small random fluctuations. More general, also on the function, where is strictly increasing and therefore order preserving and is convex-quadratic, the covariance matrix adapts to, up to a scalar factor and small random fluctuations. Note that a generalized capability of evolution strategies to adapt a covariance matrix reflective of the inverse-Hessian has been proven for a static model relying on a quadratic approximation.

Maximum-likelihood updates

The update equations for mean and covariance matrix maximize a likelihood while resembling an expectation-maximization algorithm. The update of the mean vector maximizes a log-likelihood, such that
where
denotes the log-likelihood of from a multivariate normal distribution with mean and any positive definite covariance matrix. To see that is independent of remark first that this is the case for any diagonal matrix, because the coordinate-wise maximizer is independent of a scaling factor. Then, rotation of the data points or choosing non-diagonal are equivalent.
The rank- update of the covariance matrix, that is, the right most summand in the update equation of, maximizes a log-likelihood in that
for . Here, denotes the likelihood of from a multivariate normal distribution with zero mean and covariance matrix. Therefore, for and, is the above maximum-likelihood estimator. See estimation of covariance matrices for details on the derivation.

Natural gradient descent in the space of sample distributions

Akimoto et al. and Glasmachers et al. discovered independently that the update of the distribution parameters resembles the descent in direction of a sampled natural gradient of the expected objective function value , where the expectation is taken under the sample distribution. With the parameter setting of and, i.e. without step-size control and rank-one update, CMA-ES can thus be viewed as an instantiation of Natural Evolution Strategies.
The natural gradient is independent of the parameterization of the distribution. Taken with respect to the parameters of the sample distribution, the gradient of can be expressed as
where depends on the parameter vector. The so-called score function,, indicates the relative sensitivity of w.r.t., and the expectation is taken with respect to the distribution. The natural gradient of, complying with the Fisher information metric, now reads
where the Fisher information matrix is the expectation of the Hessian of and renders the expression independent of the chosen parameterization. Combining the previous equalities we get
A Monte Carlo approximation of the latter expectation takes the average over samples from
where the notation from above is used and therefore are monotonically decreasing in.
Ollivier et al.
finally found a rigorous derivation for the more robust weights,, as they are defined in the CMA-ES. They are formulated as the consistent estimator for the CDF of at the point, composed with a fixed monotonous decreased transformation, that is,
This makes the algorithm insensitive to the specific -values. More concisely, using the CDF estimator of instead of itself let the algorithm only depend on the ranking of -values but not on their underlying distribution. It renders the algorithm invariant to monotonous -transformations. Let
such that is the density of the multivariate normal distribution. Then, we have an explicit expression for the inverse of the Fisher information matrix where is fixed
and for
and, after some calculations, the updates in the CMA-ES turn out as

and

where mat forms the proper matrix from the respective natural gradient sub-vector. That means, setting, the CMA-ES updates descend in direction of the approximation of the natural gradient while using different step-sizes for the orthogonal parameters and respectively. The most recent version of CMA-ES also use a different function for and with negative values only for the latter.

Stationarity or unbiasedness

It is comparatively easy to see that the update equations of CMA-ES satisfy some stationarity conditions, in that they are essentially unbiased. Under neutral selection, where, we find that
and under some mild additional assumptions on the initial conditions
and with an additional minor correction in the covariance matrix update for the case where the indicator function evaluates to zero, we find

Invariance

imply uniform performance on a class of objective functions. They have been argued to be an advantage, because they allow to generalize and predict the behavior of the algorithm and therefore strengthen the meaning of empirical results obtained on single functions. The following invariance properties have been established for CMA-ES.

Invariance under order-preserving transformations of the objective function value, in that for any the behavior is identical on for all strictly increasing. This invariance is easy to verify, because only the -ranking is used in the algorithm, which is invariant under the choice of.
Scale-invariance, in that for any the behavior is independent of for the objective function given and.
Invariance under rotation of the search space in that for any and any the behavior on is independent of the orthogonal matrix, given. More general, the algorithm is also invariant under general linear transformations when additionally the initial covariance matrix is chosen as.

Any serious parameter optimization method should be translation invariant, but most methods do not exhibit all the above described invariance properties. A prominent example with the same invariance properties is the Nelder–Mead method, where the initial simplex must be chosen respectively.

Convergence

Conceptual considerations like the scale-invariance property of the algorithm, the analysis of simpler evolution strategies, and overwhelming empirical evidence suggest that the algorithm converges on a large class of functions fast to the global optimum, denoted as. On some functions, convergence occurs independently of the initial conditions with probability one. On some functions the probability is smaller than one and typically depends on the initial and. Empirically, the fastest possible convergence rate in for rank-based direct search methods can often be observed. Informally, we can write
for some, and more rigorously
or similarly,
This means that on average the distance to the optimum decreases in each iteration by a "constant" factor, namely by. The convergence rate is roughly, given is not much larger than the dimension. Even with optimal and, the convergence rate cannot largely exceed, given the above recombination weights are all non-negative. The actual linear dependencies in and are remarkable and they are in both cases the best one can hope for in this kind of algorithm. Yet, a rigorous proof of convergence is missing.

Interpretation as coordinate-system transformation

Using a non-identity covariance matrix for the multivariate normal distribution in evolution strategies is equivalent to a coordinate system transformation of the solution vectors, mainly because the sampling equation
can be equivalently expressed in an "encoded space" as
The covariance matrix defines a bijective transformation for all solution vectors into a space, where the sampling takes place with identity covariance matrix. Because the update equations in the CMA-ES are invariant under linear coordinate system transformations, the CMA-ES can be re-written as an adaptive encoding procedure applied to a simple evolution strategy with identity covariance matrix.
This adaptive encoding procedure is not confined to algorithms that sample from a multivariate normal distribution, but can in principle be applied to any iterative search method.

Performance in practice

In contrast to most other evolutionary algorithms, the CMA-ES is, from the user's perspective, quasi-parameter-free. The user has to choose an initial solution point,, and the initial step-size,. Optionally, the number of candidate samples λ can be modified by the user in order to change the characteristic search behavior and termination conditions can or should be adjusted to the problem at hand.
The CMA-ES has been empirically successful in hundreds of applications and is considered to be useful in particular on non-convex, non-separable, ill-conditioned, multi-modal or noisy objective functions. One survey of Black-Box optimizations found it outranked 31 other optimization algorithms, performing especially strong on "difficult functions" or larger dimensional search spaces.
The search space dimension ranges typically between two and a few hundred. Assuming a black-box optimization scenario, where gradients are not available and function evaluations are the only considered cost of search, the CMA-ES method is likely to be outperformed by other methods in the following conditions:

on low-dimensional functions, say, for example by the downhill simplex method or surrogate-based methods ;
on separable functions without or with only negligible dependencies between the design variables in particular in the case of multi-modality or large dimension, for example by differential evolution;
on convex-quadratic functions with low or moderate condition number of the Hessian matrix, where BFGS or NEWUOA are typically ten times faster;
on functions that can already be solved with a comparatively small number of function evaluations, say no more than, where CMA-ES is often slower than, for example, NEWUOA or Multilevel Coordinate Search.

On separable functions, the performance disadvantage is likely to be most significant in that CMA-ES might not be able to find at all comparable solutions. On the other hand, on non-separable functions that are ill-conditioned or rugged or can only be solved with more than function evaluations, the CMA-ES shows most often superior performance.

Variations and extensions

The -CMA-ES generates only one candidate solution per iteration step which becomes the new distribution mean if it is better than the current mean. For the -CMA-ES is a close variant of Gaussian adaptation. Some Natural Evolution Strategies are close variants of the CMA-ES with specific parameter settings. Natural Evolution Strategies do not utilize evolution paths and they formalize the update of variances and covariances on a Cholesky factor instead of a covariance matrix. The CMA-ES has also been extended to multiobjective optimization as MO-CMA-ES. Another remarkable extension has been the addition of a negative update of the covariance matrix with the so-called active CMA.
Using the additional active CMA update is considered as the default variant nowadays.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...