# Background¶

## Color space¶

*Gecos* utilizes *CIE L*a*b**, a color space that is designed to be
perceptually approximately uniform:
Changes of *L*a*b** component values scale approximately linearly with the
visually perceived change of the corresponding color.
The *L*a*b** color space contains the following components:

L*- The lightness of the color.`0`

is completely black and`100`

is completely white.

a*- The green-red component. Green is in the negative direction, red is in the positive direction.

b*- The blue-yellow component. Blue is in the negative direction, yellow is in the positive direction.

The values for *a** and *b** are theoretically not limited in either
direction, but only a subspace (*gamut*) can be displayed on devices and can
be converted into *RGB*.

Due to the perceptual uniformity, the perceptual difference of two *L*a*b**
colors is approximately the euclidean distance of the *L*a*b** components,
according to the *CIE76* formula.
Newer standards apply several corrections to the euclidean distance to deal
with perceptual non-uniformities.
By default, *Gecos* uses the *CIEDE2000* formula, the latest iteration
of the perceptual difference formula.

*Gecos* uses the package *scikit-image* for all kinds of color space
conversions and perceptual difference calculation.

## Score function¶

The score function \(S_T\) is a compound of two terms:
a sum of harmonic potentials between each pair of symbols \(S_H\)
and a linear *contrast score* \(S_C\):

Note that a low score is desirable in this context.

Note

In the following, \(\left< X \right>\) denotes the arithmetic mean of all matrix elements (\(\left< X \right> = \frac{1}{n} \sum_{ij} X_{ij}\)). For a triangular matrix only the non-zero diagonals are taken into account. Hence, the amount of possible \(ij\) pairings is equal to \(n = \frac{n_s (n_s - 1)} {2}\). \(n_s\) is the number of symbols in the alphabet.

### Construction of distance matrix¶

For both score terms it is required that the input substitution matrix \(M\) is converted into a triangular distance matrix \(D'\):

For any substitution matrix \(M\), \(M_{ii}\) should be the maximum value in the row/column \(i\), otherwise a symbol would be more similar to another symbol than to itself. Thus, \(D'_{ii} = 0\).

\(D'\) is scaled, so that the average distance is \(1\).

### Perceptual difference matrix¶

On the other side the triangular matrix \(C\) denotes the pairwise
perceptual differences of the *L*a*b** color for each pair of symbols.

According to the *CIE76* formula, the perceptual difference is the euclidean
distance.
For the sake of simplicity the exact formula of the more modern
*CIEDE2000* formula is omitted here.

While \(D\) is constant, \(C\) changes with every step of the optimization process.

In order to translate the *L*a*b** color distances into values in the
distance matrix \(D\), a scale factor \(f_s\) is introduced.
\(f_s\) is the proportion of the average distance in \(D\) to the
average difference in \(C\):

As \(C\) is variable, \(f_s\) also dynamically changes.

### Harmonic potentials¶

A harmonic potential applied all pairs of symbols with the equilibrium position being the respective value from the distance matrix:

This adjusts the relative distance between all symbols.

### Contrast score¶

The *contrast score* rewards symbol conformations with a high contrast,
i.e. a high average perceptual difference between the symbols.
Like the harmonic potentials adjusts the relative distances between the
symbols, the *contrast score* tries to maximize the absolute distances.
A reciprocal function based on the average color differences is used here.
The *contrast factor* \(f_c\) is a user-supplied parameter for weighting
this term:

Without this term, there would be no score difference between conformations, that use a small or a large part of the color space, as long as the relative distances are equal. This term drives the symbols to the edges of the color space, thereby increasing the contrast.

## Optimization¶

The purpose of *Gecos* is to find colors that match distances derived from a
substitution matrix as good as possible.
This means, that the software tries to optimize the *L*a*b** values for all
symbols, so that the score function described above is minimized.
The *L*a*b** values can be described as vector \(\vec{x}\) with
\(n_s \times 3\) dimensions, where \(n_s\) is the amount of symbols
in the alphabet (e.g. 20 for amino acids).

The optimization is performed via *simulated annealing* (SA).
The SA algorithm is basically an improved Monte-Carlo
optimization algorithm, which means sampling the solution space of the
optimization problem with a given temperature \(T\) or
rather an inverse temperature \(\beta\) where \(\beta \sim 1/T\).

The improvement of SA over a simple Monte-Carlo optimization algorithm is to perform the optimization with an initially high temperature, or low inverse temperature accordingly, which is continuously cooled down over the course of the algorithms runtime. The idea here comes from the physical process of annealing of, e.g., steel where you can make the observation that a slowly cooled steel has superior material characteristics.

The cooling down is steered by an annealing schedule which in our case is the exponential schedule, so we have

Furthermore, as SA is usually employed for combinatorial optimization problems, so problems defined on discrete space, we also use an exponential schedule for the step size

The step size is used for perturbing the current solution in each step of the SA algorithm to find a new candidate solution. So the idea for using the schedule here is to start with relatively large step size \(\delta_{start}\) and to chose the rate according to an target step size \(\delta_{end}\). An according rate is easily derived by claiming \(\delta(N_{max})=\delta_{end}\) which leads to

### Monte-Carlo algorithm¶

Starting from a random initial conformation \(\vec{x}_0\) with a score of \(S_0 = S_T(\vec{x}_0)\), the following steps are performed:

Perform random modifications on \(\vec{x}_n\):

\(\vec{x}_{n+1} = \vec{x}_n + \Delta(\vec{x}_n)\)

where \(\Delta(\vec{x}_n)\) is a random perturbation calculated using the step size \(\delta(n)\).

Calculate the score of the new conformation:

\(S_{n+1} = S_T(\vec{x}_{n+1})\)

Decide, whether to accept the new conformation based on the difference to the score of the conformation prior to modification:

\(\Delta S = S_{n+1} - S_{n}\)

If \(\Delta S \leq 0\), then accept the new conformation.

If \(\Delta S > 0\), then accept the new conformation with a probability of \(p = exp \left( \beta(n) \cdot \Delta S \right)\) where \(\beta(n)\) is the inverse temperature according to the exponential annealing schedule.

The initial inverse temperature \(\beta_0\) as well as the rate \(\tau\), specifying how fast the inverse temperature increases, can be user specified. In case the new conformation is not accepted, the new conformation is replaced with the conformation prior to modification:

\(\vec{x}_{n+1} = \vec{x}_n\)

These steps are repeated until an stop criterion is met, which is just a fixed number of iterations in this case.