An Empirical Approach For Probing the Definiteness of Kernels**preprint submitted to Soft Computing journal, Springer (2024)

∎

¹¹institutetext: Martin Zaefferer (corresponding author) ²²institutetext: Faculty of Computer Science and Engineering Science,
TH Köln - University of Applied Sciences,
Steinmüllerallee 1, 51643 Gummersbach, Germany
Telephone: +49 2261 8196 6327
²²email: Martin.Zaefferer@th-koeln.de³³institutetext: Thomas Bartz-Beielstein ⁴⁴institutetext: Faculty of Computer Science and Engineering Science,
TH Köln - University of Applied Sciences,
Steinmüllerallee 1, 51643 Gummersbach, Germany
⁵⁵institutetext: Günter Rudolph ⁶⁶institutetext: Department of Computer Science, TU Dortmund University
Otto-Hahn-Str. 14, 44227 Dortmund, Germany

Martin Zaefferer Thomas Bartz-Beielstein Günter Rudolph

(Received: date / Accepted: date)

Abstract

Models like support vector machines orGaussian process regression often require positivesemi-definite kernels. These kernels may be basedon distance functions.While definiteness isproven for common distances and kernels, a proof for a new kernel mayrequire too much time and effort for users who simply aim at practical usage.Furthermore, designing definite distances or kernels may be equallyintricate.Finally, models can be enabled to use indefinite kernels.This may deteriorate the accuracy or computational cost of the model.Hence, an efficient method to determine definiteness is required.We propose an empirical approach. We show that sampling as well asoptimization with an evolutionary algorithm may be employedto determine definiteness.We provide a proof-of-concept with 16 different distance measures for permutations.Our approach allows to disprove definiteness if a respective counter-example is found.It can also provide an estimate ofhow likely it is to obtain indefinitekernel matrices.This providesa simple, efficient tool to decidewhether additional effort should be spent ondesigning/selecting a more suitable kernel or algorithm.

Keywords:

Definiteness Kernel Distance Sampling Optimization Evolutionary Algorithm

^†^†journal: Soft Computing

1 Introduction

The definiteness of kernels and distances is an important issuein statistics and machine learning(Feller, 1971; Vapnik, 1998; Schölkopf, 2001).One application that recently gained interest is the field of surrogate model-basedcombinatorial optimization(Moraglio and Kattan, 2011; Zaefferer etal., 2014b ; Bartz-Beielstein and Zaefferer, 2017).Continuous distance measures are replaced by distance measures that are adequatefor the respective search space (e.g., permutation distances or string distances).Such a measure will have an effect on the definiteness of the employed kernel function.For arbitrary problems, practitioners may come up with any kind of suitable distance measure or kernel.

While it is easyto determine definiteness of matrices, determining the definiteness of a function is not as simple.Proving definiteness by theoretical means may be infeasible in practice(Murphy, 2012, p.482; Ong etal., 2004).It may be equally difficult to design a function to be definite.Finally, algorithms may be adapted to handle indefinite kernels.These adaptations usually have a detrimental impact on the computational effortor accuracy of the derived model.Hence, this study tries to answerthe following two research questions:

$\text{Q}_{1}$: Discovery: Is there an efficient, empirical approach to determine thedefiniteness of kernel functions based on arbitrary distance measures?
${\text{Q}_{2}}$: Measurability: If $\text{Q}_{1}$ can be answered affirmatively,can we quantify to what extent a lack of definiteness is problematic?

$\text{Q}_{1}$ tries to find an answer to the general question of whether or not a function is definite.Measurability ( ${\text{Q}_{2}}$ ) is important, asthis may allow determining the impact that indefiniteness has in practice.For instance, if a kernel rarely produces indefinite matrices, an optimization or learning process that exploresonly a small subset of the search space may not be negatively affected.Therefore, this article proposes two approaches.

${\text{A}_{1}}$: Sampling: Random sampling is used to determine the proportion of solution sets withindefinite distance or kernel matrices for a given setting.
${\text{A}_{2}}$: Optimization: Maximizing the largest eigenvalue related to a certain solution set with an Evolutionary Algorithm (EA), hence findingindefinite cases even when they are rare.

If either approach detects an indefinite matrix, the respective kernel function is demonstrated to beindefinite. On the other hand, if no indefinite matrix is detected,definiteness of the function is not proven. Still, this indeterminate resultmay indicate that indefiniteness is at least unlikely to occur. Hence, the function may be treated as unproblematicin practical use.

Section2 provides the background of the methods presented inSection3.The experimental setup for a proof-of-concept, based on permutation distance measures, is described in Section4.Results of the experiments are analyzed and discussed in Section5.Finally, a summary of this work as well as an outlook on future research directions is given in Section6.

2 Background: Distances, Kernels and Definiteness

2.1 Distance Measures

Distance measures compute the dissimilarity of two objects $x,x^{\prime}\in\mathcal{X}$ ,where we do not assume anything about the nonempty set $\mathcal{X}$ .Such objects can be, e.g., permutations, trees, strings, or vectors of real values.Thus, a distance measure ${d}:~{}\mathcal{X}\times\mathcal{X}\to\mathbb{R}_{+}$ expresses a scalar, numerical value ${d}(x,x^{\prime})$ that should becomelarger the more distinct the objects $x$ and $x^{\prime}$ are.For a set of $n\in\mathbb{N}$ objects, the distance matrix $D$ of dimension $n\times n$ collects all pairwise distances $D_{ij}={d}(x_{i},x_{j})$ with $i=1,\ldots,n$ and $j=1,\ldots,n$ .Intuitively, distances can be expected to satisfy certain conditions, e.g., theyshould be zero when comparing identical objects.

A more formal definition is implied by the term distance metric.A distance metric ${d}(x,x^{\prime})$ is symmetric ${d}(x,x^{\prime})={d}(x^{\prime},x)$ , non-negative ${d}(x,x^{\prime})\geq 0,$ preserves identity ${d}(x,x^{\prime})=0\iff x=x^{\prime},$ and satisfies the triangle inequality ${d}(x,x^{\prime\prime})\leq{d}(x,x^{\prime})+{d}(x^{\prime},x^{\prime\prime}).$ Distance measures that do not preserve identity are often called pseudo-metrics.

An important class of distance measures are edit distance measures.Edit distance measures can be defined to count the minimal number of edit operationsrequired to transform one object into another. An edit distancemeasure may concern one specific edit operation (e.g., only swaps)or a set of different operations (e.g., Levenshtein distance with substitutions, deletions, insertions).Edit distances usuallysatisfy the metric axioms.

2.2 Kernels

In the following, a kernel (also: kernel function, similarity measure or correlation function) is defined asa real valued function $k(x,x^{\prime})$ with

\begin{split}k:&~{}\mathcal{X}\times\mathcal{X}\to\mathbb{R}\\&~{}(x,x^{\prime})\mapsto k(x,x^{\prime})\end{split}

(1)

that will usually be symmetric $k(x,x^{\prime})=k(x^{\prime},x)$ andnon-negative $k(x,x^{\prime})\geq 0$ (Murphy, 2012, p. 479).Kernels can be based on distance measures, i.e., $k({d}(x,x^{\prime}))$ .

2.3 Definiteness of Matrices

One important property of kernels and distance measures is their definiteness.We refer to the literature for more in-detail descriptions and proofs, which arethe basis for the following paragraphs(Berg etal., 1984; Schölkopf, 2001; Camastra and Vinciarelli, 2008).

First, we introduce the concept of matrix definiteness.A symmetric, square matrix $A$ of dimension $n\times n$ ( $n\in\mathbb{N}$ ) is positive definite (PD) if and only if

\sum_{i=1}^{n}\sum_{j=1}^{n}c_{i}c_{j}A_{ij}>0,

for all $c\in\mathbb{R}^{n}\setminus\{0\}$ . This is equivalent toall eigenvalues $\lambda_{1}\leq\lambda_{2}\leq\ldots\leq\lambda_{n}$ of the matrix $A$ being positive.Due to symmetry, the eigenvalues are $\lambda\in\mathbb{R}^{n}$ .Respectively, a matrix is negativedefinite (ND) if all eigenvalues are negative.If eigenvalues are non-negative (i.e., some are zero) or non-positive, the matrix is respectivelycalled Positive or Negative Semi-Definite (PSD, NSD).If mixed signs are present in the eigenvalues, the matrix may be called indefinite.Kernel (or correlation, covariance) matrices are examples of matrices thathave to be PSD.

A broader class of matrices are Conditionally PSD or NSD (CPSD, CNSD).Here, the coefficients satisfy

\sum_{i=1}^{n}c_{i}=0,

(2)

with $n>1$ .All PSD (NSD) matrices are CPSD (CNSD).To check conditional definiteness, let the $n\times n$ matrix $P$ be

P=\begin{pmatrix}I_{(n-1)}-\mathbf{e}\mathbf{e}^{T}/n~{}~{}~{}~{}&\mathbf{e}/n\\(0,\ldots,0)~{}~{}~{}~{}&1\end{pmatrix}

with $\mathbf{e}=(1,\ldots,1)^{T}$ , and $B=PAP^{T}$ . Then, $A$ is CNSD if and only if

\hat{A}=B_{n-1}

(3)

is NSD(Ikramov and Savel’eva, 2000, Algorithm 1).Here, $B_{n-1}$ is the leading principal submatrix of $B$ , thatis, the last column and row of $B$ are removed.

2.4 Definiteness of Kernel Functions

In a similar way to the definiteness of matrices, definiteness can also be determined for kernel functions.The upcoming description roughly follows the definitions and notations byBerg etal., (1984)andSchölkopf, (2001).For the nonempty set $\mathcal{X}$ , a symmetric kernel $k$ is called PSD if and only if

\sum_{i=1}^{n}\sum_{j=1}^{n}c_{i}c_{j}k(x_{i},x_{j})\geq 0,

for all $n\in\mathbb{N}$ , $x\in\mathcal{X}$ and $\mathbf{c}\in\mathbb{R}^{n}$ .A PSD kernel will always yield PSD kernel matrices.

An important special case are conditionally definite functions.Analogous to the matrix case, they observe the respective condition in equation(2).One example of a CNSD function is the Euclidean distance.The importance of CNSD functions is also due to the fact that the distance measure ${d}(x,x^{\prime})$ is CNSDif and only if the kernel $k(x,x^{\prime})=\exp(-\theta{d}(x,x^{\prime}))$ is PSD $\forall~{}\theta>0$ (Schölkopf, 2001, Proposition 2.28).

It is often stated that kernels must be PSD (Rasmussen and Williams, 2006; Curriero, 2006).Still, even an indefinite kernel function may yield a PSD kernel matrix.This depends on the specific data set used to train the model(Burges, 1998);(Li and Jiang, 2004, Theorem 2)as well as the parameters of the kernel function.Some frequently used kernels are known to be indefinite.Examples are the sigmoid kernel(Smola etal., 2000; Camps-Valls etal., 2004) or time-warp kernels for time series(Marteau and Gibet, 2014).

To handle the issue of a kernel’s definiteness, different (not mutually exclusive) approaches can be found in the literature.

•
Proving:Definiteness of a specific function can be proven (or disproven) by theoretical considerationsin some cases (cf.(Berg etal., 1984)). For complex cases, or practitioners this may be aninfeasible approach(Ong etal., 2004).
•
Designing:Functions can be designedto be definite(Haussler, 1999; Gärtner etal., 2003; Marteau and Gibet, 2014).Especially noteworthy are the so called convolution kernels(Haussler, 1999), asthey provide a method to construct PSD kernels for structured data.For a similar purpose, Gärtner etal., (2004) show how to design asyntax based PSD kernel for structured data.However, convolution kernels may be hard to design(Gärtner etal., 2004).Also, kernels and distance measures may be predetermined for a certain application.
•
Adapting:Algorithms or kernel functions may be adaptedto be usable despite a lack of definiteness.This, however, may affect the computational effort or accuracy of the derived model.Some approaches alter matrices (or rather, their eigenspectrum),hence enforcing PSDness(Wu etal., 2005; Chen etal., 2009; Zaefferer and Bartz-Beielstein, 2016).For the case of SVMs, Loosli etal., (2015) provide a nice comparison ofvarious approaches of this type and propose an interesting new solution to the issueof indefinite kernels based on learning in Krein spaces.A recent survey is given bySchleif and Tino, (2015).

Between these three approaches, there is a lack of straightforward empirical procedures,without resorting to complex theoretical reasoning.How can the definiteness of a function be determined? And what impact doesa lack of definiteness have on a model?

Hence, we propose the two related empirical approaches ${\text{A}_{1}}$ and ${\text{A}_{2}}$ introduced inSec.1 to fill the gap.An empirical approach may help to overcome the difficulty of theoretical considerationsor designed kernels.Empirical results may also be a starting point for a more formal approach.Furthermore, it may give a quick answer on whether or not the algorithm will have to be adaptedfor non-PSD matrices(which, if applied by default, would require additional computational effort and may limit the accuracy of derived models).

3 Methods for Estimating Definiteness

We propose an experimental approachto determine and analyze definiteness ( ${\text{A}_{1}}$ , ${\text{A}_{2}}$ ).As a test case, we determine the definiteness of a distance-based exponential kernel, $k(x,x^{\prime})=exp(-\theta d(x,x^{\prime}))$ .The kernel is definite if the underlying distance function is CNSD.For a given distance matrix, CNSDness is determined by the largest eigenvalue $\lambda_{n}$ of $\hat{D}$ ,based on equation(3). $D$ is not CNSD if $\lambda_{n}>0$ .We could also probe the kernel matrix $K$ , but in this case the kernel parameter $\theta$ would have to be dealt with.

Of course, we cannot simply check definiteness of a single matrix $D$ ,since this would be only one possible outcome of the respective kernel or distance function.Hence, a large number of solution sets with respective distance matrices has to be generatedto determine whether anyof the matrices are CNSD (research question $\text{Q}_{1}$ : discovery)and to what extent this may affect a model( ${\text{Q}_{2}}$ : measurability).For smaller, finite spaces a brute force approach maybe viable. All potential matrices $D$ can be enumerated and checked for CNSDness.Since this quickly becomes computational infeasible, we proposeto use sampling or optimization instead.

3.1 Estimating Definiteness with Random Sampling

To estimate the definiteness of a distance or kernel function we propose a simple random sampling approach ( ${\text{A}_{1}}$ ).This approach randomly generates $t\in\mathbb{N}$ sets $X_{1},\ldots,X_{t}$ .Each set has size $n$ , that is, it contains $n\in\mathbb{N}$ candidate samples $X=\{x_{1},\ldots,x_{n}\}$ .

For each set, the distance matrix $D$ is computed, containing distances between all candidates in the set.Based on this, $\hat{D}$ is derived from equation(3).Then, the largest eigenvalue $\lambda_{n}$ of $\hat{D}$ is computed.This eigenvalue determines whether $\hat{D}$ is NSD, and hence whether $D$ is CNSD and $K$ PSD.This is repeated for all $t$ sets.The number of times that the largest eigenvalue is positive ( $\lambda_{n}>0$ ) is retained as $n_{\lambda+}$ .Accordingly, the proportion of non-CNSD matrices is determined with $p=\frac{n_{\lambda+}}{t}.$

Obviously, all distance measures that yield $p>0$ are proven to be non-CNSD.Hence, an exponential kernel based on these measures is also proven to be indefinite.If $p=0$ , CNSDness is not proven or disproven.

In general, the proposed method can be categorized as a randomized algorithm of the complexity class RP (Motwani and Raghavan, 1995, p.21f.).That is, it stops after polynomially many steps, and if the output is “no” then the distance measure is non-CNSDwith probability 1, and if the output is “yes” then the distance measure is CNSD with some probabilitystrictly bounded from zero.

The parameter $p$ is an estimator of how likely a non-CNSD matrix is to occur, for the specified set size $n$ .To determine definiteness, the calculation of $\lambda_{n}$ of $\hat{D}$ is not mandatory, but it may be usefulto see how close to zero $\lambda_{n}$ is, to distinguish between a pathological case andcases where the matrix is just barely non-CNSD.We will show in Section5.3 that $\lambda_{n}$ of $\hat{D}$ can be linked to model quality.

Note, that inaccuracies of the numerical algorithmused to compute the eigenvalues might lead to an erroneous sign of the largest eigenvalue.To deal with that, one could try to use exact or symbolic methods or else usea tolerance when checking whether the largest eigenvalue is larger than zero. In thelatter case, a matrix $\hat{D}$ is assumed to be non-NSD if $\lambda_{n}>\epsilon$ , where $\epsilon$ is a small positive number.

3.2 Estimating Definiteness with Directed Search

If very few sets $X$ yield indefinite matrices, ${\text{A}_{1}}$ may failto find indefinite matrices by pure chance.In such cases, it may be more efficient to replace random sampling with a directed search ( ${\text{A}_{2}}$ ).In detail, a set $X$ can itself be viewed as a candidate solution of an optimization problem.The largest eigenvalue $\lambda_{n}$ of the transformed distance matrix $\hat{D}$ is the objective to be maximized.By maximizing the largest eigenvalue, a positive $\lambda_{n}$ may be foundmore quickly (and more reliably).

This optimization problem is strongly dependent onthe kind of solution representation used. Evolutionary Algorithms (EAs)are a good choice to solve this problem, because they are applicableto a wide range of solution representations (e.g., real values, strings, permutations, graphs, and mixed search spaces).EAs use principles derived from natural evolution for the purpose of optimization.That is, EAs are optimization algorithms based on a cyclic repetition ofparent selection, recombination, mutation, and fitness selection (see e.g.,Eiben and Smith, (2003)).These operations can be adapted to a large variety of representations, mainly dependingon suitable mutation and recombination operators. The EA in this study will operate as follows:

•
Individual: $X$ . A set $X=\{x_{1},\ldots,x_{n}\}$ with set size $n$ is considered as an individual.Set elements $x$ are samples in the actual search or input space, as used throughout Sec.2.
•
Search space: $S^{n}$ . All possible sets $X$ of size $n$ , i.e., $X\in S^{n}$ .
•
Population: $Z$ . A population of size $r$ , containing $X_{k}\in Z$ with $k\in\{1,\ldots,r\}$ and $Z\subseteq S^{n}$ .
•
Objective Function:
$\displaystyle\begin{split}f:~{}&S^{n}\to\mathbb{R}\\&X\mapsto\lambda_{n}\end{split}$ (4)
where $\lambda_{n}$ is the largest eigenvalue of the transformed distance matrix $\hat{D}$ based on equation(3). The objective function $f$ is maximized.
•
Mutation: Alteration of an individual.
$X_{\text{new}}=mutation(X)=$
$\{x_{1},\ldots,x_{j-1},submutation(x_{j}),x_{j+1},\ldots,x_{n},\}$ ,with $j\in\{1,\ldots,n\}$ .For the submutation function, any edit operation that works for a sample $x$ can be chosen: $x_{new}=submutation(x)=edit(x)$ .
For example, in case of permutations, one permutation $x_{j}\in X$ is chosen and mutated with typical permutation edit-operations (swap, interchange, reversal).The specific edit-operation is called submutation operator, to distinguish between mutation of the individual set $X$ and the submutation of a single permutation $x_{j}\in X$ .
•
Recombination: Combining two sets. For recombination, two sets are randomly split and the parts of both sets are joined to form a new setof the same size.
•
Repair: Duplicate removal.Mutation and recombination may createduplicates ( $x_{i}=x_{j}$ with $i\neq j$ ). In practice, duplicates are not desirable and are irrelevant tothe question of definiteness.Hence, duplicates are replaced by randomly generated, unique samples $x^{*}\notin X$ .
•
Stopping criterion: Indefiniteness proven or budget exhausted.The optimization can stop when some solution set $X$ is found which yields $\lambda_{n}>\epsilon$ , where $\epsilon$ is a small positive number. Alternatively, the process stops if a budget of objective function evaluations is exhausted.

4 Experimental Validation

The proposed approaches can be useful in any case where definiteness of kernels is of interest.The experiments provide a proof of concept of the proposed approaches.Hence, we chose to pick a recent application as a motivation for our experiments:surrogate-model based optimization in permutation spaces (Moraglio and Kattan, 2011; Zaefferer etal., 2014a ).

In many real-world optimization problems, objective functionevaluations are expensive.Sequential modeling and optimization techniques are state-of-the-art in these settings(Bartz-Beielstein and Zaefferer, 2017).Typically, an initial model is built at the first stage of the process. The model will be subsequently refined by adding further datauntil the budget is exhausted. At each stage of this sequential process, available information from the model is used to determinepromising new candidate solutions.This motivates the rather small data set sizes used throughout this study.

4.1 Test Case: Permutations

As a proof-of-concept, we selected 16 distance measuresfor permutations; see Table1 for a complete list.The implementation of these distance measures is taken from the R-package CEGO¹¹1The package CEGO is available on CRAN at http://cran.r-project.org/package=CEGO..

Similarly to Schiavinotto and Stützle, (2007) we define $\Pi^{m}$ as the set of all permutations of the numbers $\{1,2,\ldots,m\}$ .A permutation has exactly $m$ elements.We denote a single permutation with $\pi\in\Pi^{m}$ and $\pi=\{\pi_{1},\pi_{2},\ldots,\pi_{m}\}$ where $\pi_{i}$ is aspecific element of the permutation at position $i$ .For example, a permutation in this notation is $\pi=\{3,2,1,4,5\}\in\Pi^{5}$ .Explanations and formulas (where applicable) for the distance measures are given in appendix A.

name	complexity	metric	abbreviation
Levenshtein	$O(m^{2})$	yes	Lev
Swap	$O(m^{2})$	yes	Swa
Interchange	$O(m^{2})$	yes	Int
Insert	$O(mlog(m))$	yes	Ins
LC Substring	$O(m^{2})$	yes	LCStr
R	$O(m^{2})$	yes	R
Adjacency	$O(m^{2})$	pseudo	Adj
Position	$O(m^{2})$	yes	Pos
Position²	$O(m^{2})$	no	Posq
Hamming	$O(m)$	yes	Ham
Euclidean	$O(m)$	yes	Euc
Manhattan	$O(m)$	yes	Man
Chebyshev	$O(m)$	yes	Che
Lee	$O(m)$	yes	Lee
Cosine	$O(m)$	no	Cos
Lexicographic	$O(mlog(m))$	yes	Lex

4.2 Random Sampling

In a single experiment, $t=10,000$ sets of permutations are randomly generated.Each set contains $n$ permutations and each permutation has $m$ elements.For each set, the largest eigenvalue $\lambda_{n}$ of $\hat{D}$ is computed based on equation(3).To summarize all $t$ sets, the largest $\lambda_{n}$ as well as the ratio $p=\frac{n_{\lambda+}}{t}$ are recorded.The tolerance value used to check whether the largest eigenvalue is positive is $\epsilon$ =1e-10.This process is repeated 10 times, to achieve a reliable estimate of the recorded values.

Two batches of experiments are performed. In the first, all 16 distance measures areexamined, with $n=\{4,\ldots,20\}$ and $m=\{4,\ldots,15\}$ .In the second batch, larger sizes $n=\{21,\ldots,40,45,50,60,70,80,90,100\}$ are tested,but the permutations are restricted to $m=\{5,\ldots,15\}$ and the distance measures are only LCStr, Insert, Chebyshev, Levenshtein and Interchange.

4.3 Directed Search

To be comparable to the random sampling approach, the budget for each EA run is $10,000$ fitness function evaluations. A run will stop if the budget isexhausted or if $\lambda_{n}>\epsilon=10^{-10}$ .The population size of the EA is set to 100. The recombination rate is 0.5, the mutationrate is $1/m$ and truncation selection is used.

To identify bias introducedby the choice of the submutation operator (which may have a strong interaction withthe respective distance measures), each EA run is performed repeatedly withthree different submutation operators:

•

Swap mutation: Transposing two adjacent elements of the permutation.

	$\displaystyle\pi$	$\displaystyle=\pi_{1},\ldots,\pi_{a},\pi_{b},\ldots,\pi_{m}$
	$\displaystyle\pi^{*}$	$\displaystyle=\pi_{1},\ldots,\pi_{b},\pi_{a},\ldots,\pi_{m},$

with $1\leq a<(m-1)$ and $b=a+1$ .

•

Interchange mutation: Transposing two arbitrary elements of the permutation.

	$\displaystyle\pi$	$\displaystyle=\pi_{1},\ldots,\pi_{a-1},\pi_{a},\pi_{a+1},\ldots,\pi_{b-1},\pi_{b},\pi_{b+1},\ldots,\pi_{m}$
	$\displaystyle\pi^{*}$	$\displaystyle=\pi_{1},\ldots,\pi_{a-1},\pi_{b},\pi_{a+1},\ldots,\pi_{b-1},\pi_{a},\pi_{b+1},\ldots,\pi_{m},$

with $1\leq a\leq m$ and $1\leq b\leq m$ .

•

Reversal mutation: Reversing a substring of the permutation.

	$\displaystyle\pi$	$\displaystyle=\pi_{1},\ldots,\pi_{a},\pi_{a+1},\ldots,\pi_{b-1},\pi_{b},\ldots,\pi_{m}$
	$\displaystyle\pi^{*}$	$\displaystyle=\pi_{1},\ldots,\pi_{b},\pi_{b-1},\ldots,\pi_{a+1},\pi_{a},\ldots,\pi_{m},$

with $1\leq a<b\leq m$ .

All 16 distance measures aretested, with
$n=\{4,\ldots,20\}$ and $m=\{4,\ldots,15\}$ . With ten repeats, and the three differentsubmutation operators, this results into $97,920$ EA runs, each with $10,000$ fitness function evaluations.The employed EA implementation is part of the R-package CEGO.

4.4 Tests For Other Search Domains

To show that the proposed approach is not limited to the presented permutation distance example, wealso briefly explore other search domains and their respective distances. However, these are not analyzed in further detail. Instead,we provide a list of minimal examples in Appendix B: the smallest (w.r.t.dimension) indefinite distance matrix for eachtested distance measure. The examples include distance measures for permutations, signed permutations,trees, and strings.

5 Observations and Discussion

5.1 Sampling Results

The proportions of sets with positive eigenvalues ( $p$ ) are summarized in Fig.1.The largest eigenvalues are depicted in Fig.2.Only the five indefinite distance measures, which achieved positive eigenvalues are shown:

An Empirical Approach For Probing the Definiteness of Kernels**preprint submitted to Soft Computing journal, Springer (1)

An Empirical Approach For Probing the Definiteness of Kernels**preprint submitted to Soft Computing journal, Springer (2)

Longest Common Substring, Insert, Chebyshev, Levenshtein, Interchange.

No counter-examples are found for the remaining eleven measures. That does notprove that they are CNSD (although some are CNSD, e.g., Euclidean distance, Swapdistance(Jiao and Vert, 2015), Hamming distance(Hutter etal., 2011)),but indicates that it may be unproblematic to use them in practice.Some of the five non-CNSD distance measures were reported to work wellin a surrogate-model optimization framework, e.g., Levenshtein distanceseemed to work well for modeling of scheduling problems(Zaefferer etal., 2014a ).This is mainly due to the fact, that even a non-CNSD distance may yieldPSD kernel matrix $K$ , depending on the specific data set and kernel parameters used.We do not suggest that non-CNSD distance measures should be avoided, but thattheir application should be handled with care.

Regarding values of $p$ (Fig.1),some trends can be observed.For indefinite distance measures, increasing theset size( $n$ ) will in general lead to larger values of $p$ . Obviously, a larger set is more likely to contain combinations of samples that yieldnegative eigenvalues. In addition, the lower bound for eigenvalues decreases with increasingmatrix size(Constantine, 1985).

In contrast to the set size, increasing the number of permutation elements ( $m$ ) decreases the proportion of positive eigenvalues $p$ in all five cases. This can be attributed to a larger and hence more difficult search space.Overall, none of the distance measures shows exactly the same behavior. LCStr distancehas the least problematic behavior. Only very few sets (of comparatively large size) yielded positive eigenvalues with LCStr distance.Interchange, Levenshtein and Insert distance all have relatively large $p$ for small set sizes $n$ . Chebyshev on the other hand,starts to have non-zero $p$ for relatively large $n$ . However, the number of permutation elements has only a weak influencein case of Chebyshev distance. Hence, curves for different $m$ are much closer to each other, compared to the other distance measures.Somewhat analogous to $p$ , the largest eigenvalues of $\hat{D}$ plotted in Fig.2 are generally increasing for larger $n$ ,and decreasing with larger $m$ .

Our findings are confirmed by some results from literature.Cortes etal., (2004) have shownthat an exponential kernel function based on the Levenshtein distancesis indefinite for strings of more than one symbol. Our experiments show that this result can be easilyrediscovered empirically, for the case of permutations.At the same time, these findings also confirm (and provide reasons for)problems observed with these kinds of kernel functions in a previous study(Zaefferer etal., 2014a ).

As a consistency check, Table2 compares the samplingresults to a brute force approach for $n=\{4,\ldots,8\}$ and $m=4$ .It shows that the sampling indeed approximatesthe number of non-CNSD matrices quite well. In this small set, the sampling identified all combinations of distance measure and $n$ that may yield non-CNSD matrices.

n	distance	$n_{s}$	true	estimate
5	Insert	42,504	0.047	0.048
6	Insert	134,596	0.221	0.219
7	Insert	346,104	0.530	0.529
8	Insert	735,471	0.827	0.825
5	Interchange	42,504	0.106	0.105
6	Interchange	134,596	0.465	0.463
7	Interchange	346,104	0.833	0.834
8	Interchange	735,471	0.978	0.978
5	Levenshtein	42,504	0.087	0.087
6	Levenshtein	134,596	0.293	0.293
7	Levenshtein	346,104	0.591	0.589
8	Levenshtein	735,471	0.847	0.846
5	LCStr	42,504	0.002	0.002
6	LCStr	134,596	0.007	0.007
7	LCStr	346,104	0.026	0.026
8	LCStr	735,471	0.093	0.092

5.2 Optimization Results

The average number of fitness function evaluations (i.e., number of times $\lambda_{n}$ of $\hat{D}$ is computed)required to find a matrix $\hat{D}$ with positive $\lambda_{n}$ is depicted in Fig.3.

An Empirical Approach For Probing the Definiteness of Kernels**preprint submitted to Soft Computing journal, Springer (3)

The optimization results show similar behavior with respect to $n$ and $m$ as the sampling approach. Increasing $m$ leads to an increased number of fitness function evaluations. That means, finding positive eigenvalues becomes more difficult with increasingvalues of $m$ .Increasing $n$ reduces the number of required fitness function evaluations. That is, finding positive eigenvalues becomes easier.In some cases, this effect disappears for large values of $m$ , e.g., for LCStr distance, where the average is more or less constant over $n$ , if $m$ is large enough.

Importantly, the comparison to the sampling results clearly shows that the EA has some success in optimizing the largest eigenvalueof the transformed distance matrix $\hat{D}$ .In several cases, positive eigenvalues are found by the EA while sampling with the same budgetfailed to find any.Hence, it can be assumed that the fitness landscape based on $\lambda_{n}$ of $\hat{D}$ is sufficiently smooth to allow for optimization.The eigenvalue $\lambda_{n}$ seems to be a good indicator of how close a solution set is to yielding an indefinite kernel matrix.

Furthermore, the expected bias of the used submutation operator becomes visible.For Insert and LCStr distance, the EA with swap mutation works considerably better than the other two variants.Hence, comparisons of these values across different distance measures should be handled with caution.Clearly, other aspects of the optimization algorithm (e.g., selection criteria or recombination operators) might have similar effects.While this bias is troubling, results may still offer interesting insights. The eigenvalue optimizationcould be interpreted as a worst-case scenario that occurs if an iterative learning process strongly correlates with theeigenvalues of the employed distance matrix.

5.3 Verification: Impact on Model Quality

Earlier, we discussed two valuesthat may express the effect of the lack of definitenessin practice, i.e., $p$ and $\lambda_{n}$ of $\hat{D}$ .But what do these values imply?

The value $p$ can be seen as a probability of generating indefinite matrices.If we assume that a model is unable to deal with indefinite data,the fraction $p$ is an estimate of how likely a modeling failure is.For other cases, it is very hard to link it to a model performance measuresuch as accuracy without making too many assumptions.We still argue that $p$ provides useful information, especiallywhen the kernel is designed and probed before sampling any data (e.g.,when planning an experiment).We suggest to use $p$ to support an initial decision (e.g., whetherto spend additional time on fixing or otherwise dealing with the indefinite kernel).One advantage is that it is rather easy to interpret.

In contrast, the parameter $\lambda_{n}$ of $\hat{D}$ is more difficult to interpret.But it has the advantage that it may be estimated for a single matrixas well as its average for a set of matrices. Hence, we want to determinewhether the magnitude of this eigenvalue affects model performance.We expect an influence that depends on the choice of model.Consider, e.g., a Gaussian process regression model, as e.g., described by Forrester etal., (2008).The model may be able to mitigate the problematic eigenvalueby assigning larger $\theta$ values to the kernel $k(x,x^{\prime})=\exp(-\theta{d}(x,x^{\prime}))$ .For very large $\lambda_{n}$ and thus very large $\theta$ , this will lead to kernel matricesthat approximate the unit matrix, which is positive definite.A model with a unit kernel matrix would be able to reproducethe training data, but would predict the process mean for most other data points.Hence, we examine Gaussian process regression models, since they providea transparent and interpretable test case (but similar experimentscould easily be made with support vector regression).

An experimental test has to consider the potential bias of the used data set.We need to be able to reasonably assume that differences in performanceare actually due to the properties of employed distance or kernel(i.e., a kernel performs poorly because the corresponding $\lambda_{n}$ of $\hat{D}$ is high)rather than properties of the data set(i.e., a kernel performs poorly because it does not fit well to the groundtruth of the data set).To that end, we suggest that observations in a test data set are derived fromthe same distances that are used in the model.

Hence, we randomly created data sets $X$ of size $n$ with permutations of dimension $m$ ,similarly to the random sampling performed earlier.Then, we created training observations by evaluating the distance of each permutation in $X$ to a reference permutation $x_{ref}={1,...,m}$ , i.e., $y={d}(x,x_{ref})$ .A Gaussian process model was then trained with this data, largely followingthe descriptions inForrester etal., (2008).The model was trained with the kernel $k(x,x^{\prime})=\exp(-\theta{d}(x,x^{\prime}))$ ,and the model parameters (e.g., $\theta$ ) were determined by maximum likelihood estimation, via the locally biasedversion of the DIviding RECTangles (DIRECT) algorithm (Gablonsky and Kelley, 2001)with 1,000 likelihood evaluations.For each test, the distance chosen to produce the observations $y$ and the distance chosen in the kernel were identical.

The Root MeanSquare Error (RMSE) of the model was evaluated on 1000 randomly chosen permutations.The resulting RMSE values for each training set $X$ , as well as the correspondingeigenvalue $\lambda_{n}$ of $\hat{D}$ are shown in Fig.4.The graphic shows a trend that confirms our expectation. It seems that distancesassociated to larger $\lambda_{n}$ tend to produce larger errors.

An Empirical Approach For Probing the Definiteness of Kernels**preprint submitted to Soft Computing journal, Springer (4)

6 Summary and Outlook

The focus of this study was the definiteness of kernel and distance measures.Definiteness is a main requirement for modeling techniques like Gaussian processes or SVMs.Distance-based kernels with unknown definitenessmay be promising choices for certain parameter spaces.Yet, their definiteness may be very hard to prove or enforce.

It was found, that empirical approaches (based on sampling or optimization) may helpto assess definiteness of the respective function. In detail, two research questionswere investigated:

$\text{Q}_{1}$: Discovery: Is there an efficient, empirical approach to determine thedefiniteness of kernel functions based on arbitrary distance measures?
${\text{Q}_{2}}$: Measurability: If $\text{Q}_{1}$ can be answered affirmatively,can we quantify to what extent a lack of definiteness is problematic?

Two empirical approaches were suggested towards that end. The first approach ( ${\text{A}_{1}}$ )samples from the space of solution sets, and determines whether a set is found which leadsto an indefinite distance or kernel matrix. If indefinite matrices are rare, a directed search with an EAis more successful ( ${\text{A}_{2}}$ ). The EA maximizes the largest eigenvalue of a transformed distance matrix $\hat{D}$ ,respectively minimizes the smallest eigenvalue of a kernel matrix. Hence, the EA searches for sets that yield indefinite matrices.

As a proof-of-concept, the approaches were applied to distance measures for permutations.It was shown that five problematic distance measures couldbe identified: Longest Common Substring (LCStr), Insert, Chebyshev, Levenshtein, and Interchange distance.Information known from literature (regarding indefiniteness of the respectivekernel function) could be rediscovered by the empirical approaches.

The optimization approach was successful, as it was able to outperform the sampling approachin discovering sets with indefinite kernel matrices. Still, the results also indicated that thechoice of variation operators in the optimization algorithm does introduce bias.Hence, the respective results do not allow a conclusion about the impact of a lack of definitenessof the respective sets/matrices.Still, the success of the EA indicates that the fitness landscape posed by the largest eigenvalue isnot excessively rugged and has an exploitable structure. This suggests that the largest eigenvalue is a good indicator of how far acertain solution set (and the respective distance or kernel matrix) is from becoming indefinite.In an additional set of experiments, we further verified that increasing the largest eigenvalue can in factbe linked to a decrease in model quality.This results into the following responses to the posed research questions:

R1: Discovery: Sampling from the space of potential candidate solution sets allows identifying problems with definiteness,by identifying solution sets that lead to non-CNSD distance matrices ( ${\text{A}_{1}}$ ).Where such situations are rare (hence more likely to be missed by the sampling),an optimization approach may be more successful ( ${\text{A}_{2}}$ ).While neither approach ${\text{A}_{1}}$ nor ${\text{A}_{2}}$ are able to prove definiteness, both are able to disprove it.If no negative results are found it is reasonable to assume thatusing the respective distance/kernel function is feasible.
R2: Measurability: The sampling approach ( ${\text{A}_{1}}$ ) yields a proportion of potentially non-CNSD matrices, which in turn yields an estimateof how problematic a distance measure is. In a similar way, yet potentially biased by the choice of optimization algorithm,the number of evaluations required by the optimization approach gives a similar estimate.In addition, the success of the optimization approach ( ${\text{A}_{2}}$ ) suggests that the respective largest eigenvalueis an indicator of how close certain sets (and respective distance or kernel matrices) are to becoming indefinite.Additional experiments showed how this eigenvalue could be linked to model performance.

For future research, it may be of interest to allow the EA to change the set size.Clearly, one issue would be that enlarging the sets may quickly lead to a trivial solution, sincelarger sets naturally lead to larger $\lambda_{n}$ of $\hat{D}$ .Hence, there is a trade-off between the largest eigenvalue and the set size.A multi-objective EA (e.g. NSGA-II(Deb etal., 2002) or SMS-EMOA(Beume etal., 2007))may be used to handle this issue by simultaneouslymaximizing $\lambda_{n}$ and minimizing the set size $n$ .

Finally, the herein described kernels and distances are not the full story. For other kernels, the relation between distance measure and kernel functionmay not be as straightforward. Parameters of the distance measure or the kernel functioncould complicate the situation. It may be necessary to adapt the proposed method to, e.g., includeparameters in the sampling and optimization procedures.

Compliance with ethical standards

Conflicts of Interest The authors declare that they have no conflict of interest.

References

Bader etal., (2004)Bader, D.A., Moret, B.M., Warnow, T., Wyman, S.K., Yan, M., Tang, J.,Siepel, A.C., & Caprara, A. (2004).Genome rearrangements analysis under parsimony and other phylogeneticalgorithms (grappa) 2.0.Online: https://www.cs.unm.edu/moret/GRAPPA/.accessed: 16.11.2016.
Bartz-Beielstein and Zaefferer, (2017)Bartz-Beielstein, T. & Zaefferer, M. (2017).Model-based methods for continuous and discrete global optimization.Applied Soft Computing, 55:154 – 167.
Berg etal., (1984)Berg, C., Christensen, J. P.R., & Ressel, P. (1984).Harmonic Analysis on Semigroups, volume 100 of GraduateTexts in Mathematics.Springer New York.
Beume etal., (2007)Beume, N., Naujoks, B., & Emmerich, M. (2007).SMS-EMOA: Multiobjective selection based on dominatedhypervolume.European Journal of Operational Research, 181(3):1653–1669.
Boytsov, (2011)Boytsov, L. (2011).Indexing methods for approximate dictionary searching: Comparativeanalysis.J. Exp. Algorithmics, 16:1–91.
Burges, (1998)Burges, C.J. (1998).A tutorial on support vector machines for pattern recognition.Data Mining and Knowledge Discovery, 2(2):121–167.
Camastra and Vinciarelli, (2008)Camastra, F. & Vinciarelli, A. (2008).Machine Learning for Audio, Image and Video Analysis: Theoryand Applications.Advanced information and knowledge processing. Springer, London.
Campos etal., (2005)Campos, V., Laguna, M., & Martí, R. (2005).Context-independent scatter and tabu search for permutation problems.INFORMS Journal on Computing, 17(1):111–122.
Camps-Valls etal., (2004)Camps-Valls, G., Martín-Guerrero, J.D., Rojo-Álvarez, J.L., &Soria-Olivas, E. (2004).Fuzzy sigmoid kernel for support vector classifiers.Neurocomputing, 62:501–506.
Chen etal., (2009)Chen, Y., Gupta, M.R., & Recht, B. (2009).Learning kernels from indefinite similarities.In Proceedings of the 26th Annual International Conference onMachine Learning, ICML ’09, (pp. 145–152), New York, NY, USA. ACM.
Constantine, (1985)Constantine, G. (1985).Lower bounds on the spectra of symmetric matrices with nonnegativeentries.Linear Algebra and its Applications, 65:171–178.
Cortes etal., (2004)Cortes, C., Haffner, P., & Mohri, M. (2004).Rational kernels: Theory and algorithms.J. Mach. Learn. Res., 5:1035–1062.
Curriero, (2006)Curriero, F. (2006).On the use of non-euclidean distance measures in geostatistics.Mathematical Geology, 38(8):907–926.
Deb etal., (2002)Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002).A fast and elitist multiobjective genetic algorithm: NSGA-II.IEEE Transactions on Evolutionary Computation, 6(2):182–197.
Deza and Huang, (1998)Deza, M. & Huang, T. (1998).Metrics on permutations, a survey.Journal of Combinatorics, Information and System Sciences,23(1-4):173–185.
Eiben and Smith, (2003)Eiben, A.E. & Smith, J.E. (2003).Introduction to Evolutionary Computing.Springer Berlin Heidelberg.
Feller, (1971)Feller, W. (1971).An Introduction to Probability Theory and Its Applications,Volume 2.JOHN WILEY & SONS INC.
Forrester etal., (2008)Forrester, A., Sobester, A., & Keane, A. (2008).Engineering Design via Surrogate Modelling.Wiley.
Gablonsky and Kelley, (2001)Gablonsky, J. & Kelley, C. (2001).A locally-biased form of the direct algorithm.Journal of Global Optimization, 21(1):27–37.
Gärtner etal., (2003)Gärtner, T., Lloyd, J., & Flach, P. (2003).Kernels for structured data.In Matwin, S. & Sammut, C., editors, Inductive LogicProgramming, volume 2583 of Lecture Notes in Computer Science, pages66–83. Springer Berlin Heidelberg.
Gärtner etal., (2004)Gärtner, T., Lloyd, J., & Flach, P. (2004).Kernels and distances for structured data.Machine Learning, 57(3):205–232.
Haussler, (1999)Haussler, D. (1999).Convolution kernels on discrete structures.Technical Report UCSC-CRL-99-10, Department of Computer Science,University of California at Santa Cruz.
Hirschberg, (1975)Hirschberg, D.S. (1975).A linear space algorithm for computing maximal common subsequences.Commun. ACM, 18(6):341–343.
Hutter etal., (2011)Hutter, F., Hoos, H.H., & Leyton-Brown, K. (2011).Sequential model-based optimization for general algorithmconfiguration.In Proc.of LION-5, (pp. 507–523).
Ikramov and Savel’eva, (2000)Ikramov, K. & Savel’eva, N. (2000).Conditionally definite matrices.Journal of Mathematical Sciences, 98(1):1–50.
Jiao and Vert, (2015)Jiao, Y. & Vert, J.-P. (2015).The Kendall and Mallows kernels for permutations.In Proceedings of the 32nd International Conference on MachineLearning (ICML-15), (pp. 1935–1944).
Kendall and Gibbons, (1990)Kendall, M. & Gibbons, J. (1990).Rank Correlation Methods.Oxford University Press.
Lee, (1958)Lee, C. (1958).Some properties of nonbinary error-correcting codes.IRE Transactions on Information Theory, 4(2):77–82.
Li and Jiang, (2004)Li, H. & Jiang, T. (2004).A class of edit kernels for svms to predict translation initiationsites in eukaryotic mrnas.In Proceedings of the Eighth Annual International Conference onResaerch in Computational Molecular Biology, RECOMB ’04, (pp. 262–271),New York, NY, USA. ACM.
Loosli etal., (2015)Loosli, G., Canu, S., & Ong, C. (2015).Learning SVM in Krein spaces.IEEE Transactions on Pattern Analysis and Machine Intelligence,38(6):1204–1216.
Marteau and Gibet, (2014)Marteau, P.-F. & Gibet, S. (2014).On recursive edit distance kernels with application to time seriesclassification.IEEE Transactions on Neural Networks and Learning Systems,PP(99):1–1.
Moraglio and Kattan, (2011)Moraglio, A. & Kattan, A. (2011).Geometric generalisation of surrogate model based optimisation tocombinatorial spaces.In Proceedings of the 11th European Conference on EvolutionaryComputation in Combinatorial Optimization, EvoCOP’11, (pp. 142–154),Berlin, Heidelberg, Germany. Springer.
Motwani and Raghavan, (1995)Motwani, R. & Raghavan, P. (1995).Randomized Algorithms.Cambridge University Press.
Murphy, (2012)Murphy, K.P. (2012).Machine Learning.MIT Press Ltd.
Ong etal., (2004)Ong, C.S., Mary, X., Canu, S., & Smola, A.J. (2004).Learning with non-positive kernels.In Proceedings of the Twenty-first International Conference onMachine Learning, ICML ’04, (pp. 81–88), New York, NY, USA. ACM.
Pawlik and Augsten, (2015)Pawlik, M. & Augsten, N. (2015).Efficient computation of the tree edit distance.ACM Transactions on Database Systems, 40(1):1–40.
Pawlik and Augsten, (2016)Pawlik, M. & Augsten, N. (2016).Tree edit distance: Robust and memory-efficient.Information Systems, 56:157–173.
Rasmussen and Williams, (2006)Rasmussen, C.E. & Williams, C. K.I. (2006).Gaussian Processes for Machine Learning.The MIT Press.
Reeves, (1999)Reeves, C.R. (1999).Landscapes, operators and heuristic search.Annals of Operations Research, 86(0):473–490.
Schiavinotto and Stützle, (2007)Schiavinotto, T. & Stützle, T. (2007).A review of metrics on permutations for search landscape analysis.Computers & Operations Research, 34(10):3143–3153.
Schleif and Tino, (2015)Schleif, F.-M. & Tino, P. (2015).Indefinite proximity learning: A review.Neural Computation, 27(10):2039–2096.
Schölkopf, (2001)Schölkopf, B. (2001).The kernel trick for distances.In Leen, T.K., Dietterich, T.G., & Tresp, V., editors, Advances in Neural Information Processing Systems 13, pages 301–307. MITPress.
Sevaux and Sörensen, (2005)Sevaux, M. & Sörensen, K. (2005).Permutation distance measures for memetic algorithms with populationmanagement.In Proceedings of 6th Metaheuristics International Conference(MIC’05), (pp. 832–838). University of Vienna.
Singhal, (2001)Singhal, A. (2001).Modern information retrieval: a brief overview.IEEE Bulletin on Data Engineering, 24(4):35–43.
Smola etal., (2000)Smola, A.J., Ovári, Z.L., & Williamson, R.C. (2000).Regularization with dot-product kernels.In Advances in Neural Information Processing Systems 13,Proceedings, (pp. 308–314). MIT Press.
vander Loo, (2014)vander Loo, M.P. (2014).The stringdist package for approximate string matching.The R Journal, 6(1):111–122.
Vapnik, (1998)Vapnik, V.N. (1998).Statistical Learning Theory, volume1.Wiley New York.
Voutchkov etal., (2005)Voutchkov, I., Keane, A., Bhaskar, A., & Olsen, T.M. (2005).Weld sequence optimization: The use of surrogate models for solvingsequential combinatorial problems.Computer Methods in Applied Mechanics and Engineering,194(30-33):3535–3551.
Wagner and Fischer, (1974)Wagner, R.A. & Fischer, M.J. (1974).The string-to-string correction problem.J. ACM, 21(1):168–173.
Wu etal., (2005)Wu, G., Chang, E.Y., & Zhang, Z. (2005).An analysis of transformation on non-positive semidefinite similaritymatrix for kernel machines.In Proceedings of the 22nd International Conference on MachineLearning.
Zaefferer and Bartz-Beielstein, (2016)Zaefferer, M. & Bartz-Beielstein, T. (2016).Efficient global optimization with indefinite kernels.In Parallel Problem Solving from Nature–PPSN XIV, (pp.69–79). Springer.
(52)Zaefferer, M., Stork, J., & Bartz-Beielstein, T. (2014a).Distance measures for permutations in combinatorial efficient globaloptimization.In Bartz-Beielstein, T., Branke, J., Filipič, B., & Smith, J.,editors, Parallel Problem Solving from Nature–PPSN XIII, (pp.373–383), Cham, Switzerland. Springer.
(53)Zaefferer, M., Stork, J., Friese, M., Fischbach, A., Naujoks, B., &Bartz-Beielstein, T. (2014b).Efficient global optimization for combinatorial problems.In Proceedings of the 2014 Conference on Genetic andEvolutionary Computation, GECCO ’14, (pp. 871–878), New York, NY, USA.ACM.

Appendix A: Distance Measures for Permutations

In the following, we describe the distance measures employed in the experiments.

•
The Levenshtein distance is an edit distance measure:
${d}_{Lev}(\pi,\pi^{\prime})=edits_{\pi\rightarrow\pi^{\prime}}$
Here, $edits_{\pi\rightarrow\pi^{\prime}}$ is the minimal number of deletions, insertions, or substitutionsrequired to transform one string(or here: permutation) $\pi$ into another string $\pi^{\prime}$ .The implementation is based on Wagner and Fischer, (1974).

•

Swaps are transpositions of two adjacent elements.The Swap distance (also: Kendall’s Tau (Kendall and Gibbons, 1990; Sevaux and Sörensen, 2005) orPrecedence distance (Schiavinotto and Stützle, 2007))counts the minimum number of swapsrequired to transform one permutation into another.For permutations, it is(Sevaux and Sörensen, 2005):

	$\displaystyle{d}_{Swa}(\pi,\pi^{\prime})=\sum_{i=1}^{m}\sum_{j=1}^{m}z_{ij}~{}~{}\text{with}$
	$\displaystyle z_{ij}=\left\{\begin{array}[]{l l}1&\quad\text{if }\pi_{i}<\pi_{j}~{}\text{and}~{}\pi^{\prime}_{i}>\pi^{\prime}_{j},\\0&\quad\text{otherwise.}\end{array}\right.$

•
An interchange operation is the transposition of two arbitrary elements.Respectively, the Interchange (also: Cayley) distance counts the minimum numberof interchanges ( $interchanges_{\pi\rightarrow\pi^{\prime}}$ ) required to transform one permutation into another (Schiavinotto and Stützle, 2007):
${d}_{Int}(\pi,\pi^{\prime})=interchanges_{\pi\rightarrow\pi^{\prime}}$
•
The Insert distance is based on the longest common subsequence $LCSeq(\pi,\pi^{\prime})$ .The longest common subsequence is the largest number of elements that follow eachother in both permutations, with interruptions. The correspondingdistance is
${d}_{Ins}(\pi,\pi^{\prime})=m-LCSeq(\pi,\pi^{\prime}).$
We use the algorithm described by Hirschberg, (1975).The name is due to its interpretation as an edit distance measure. The corresponding edit operation is a combination of insertionand deletion. A single element is moved from one position (delete) to a new position (insert).It is also called Ulam’s distance (Schiavinotto and Stützle, 2007).

•

The Longest Common Substring distance is based on the largest number ofelements that follow each other in both permutations, without interruption.Unlike the longest common subsequence all elements have to be adjacent.

If $LCStr(\pi,\pi^{\prime})$ is the length of the longest common string,the distance is

{d}_{LCStr}(\pi,\pi^{\prime})=m-LCStr(\pi,\pi^{\prime}).

•

The R-distance(Campos etal., 2005; Sevaux and Sörensen, 2005)counts the number of times that one element follows another in one permutation, but not in the other.It is identical with the uni-directional adjacency distance(Reeves, 1999). It is computed by

	$\displaystyle{d}_{R}(\pi,\pi^{\prime})=\sum_{i=1}^{m-1}y_{i}~{}~{}\text{with}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}$
	$\displaystyle y_{i}=\left\{\begin{array}[]{l l}0&\quad\text{if }\exists j:\pi_{i}=\pi^{\prime}_{j}~{}\text{and}~{}\pi_{i+1}=\pi^{\prime}_{j+1},\\1&\quad\text{otherwise.}\end{array}\right.$

•

The (bi-directional) Adjacencydistance(Reeves, 1999; Schiavinotto and Stützle, 2007) counts the number of times two elements are neighbors in one, but not in the other permutation.Unlike R-distance (uni-directional), the order of the two elements does not matter.It is computed by

	$\displaystyle{d}_{Adj}(\pi,\pi^{\prime})=\sum_{i=1}^{m-1}y_{i}~{}~{}\text{with}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}$
	$\displaystyle y_{i}=\left\{\begin{array}[]{l l}0&\quad\text{if }\exists j:\pi_{i}=\pi^{\prime}_{j}~{}\text{and}~{}\pi_{i+1}\in\{\pi^{\prime}_{j+1},\pi^{\prime}_{j-1}\},\\1&\quad\text{otherwise.}\end{array}\right.$

•
The Position distance(Schiavinotto and Stützle, 2007) is identical with the Deviation distance or Spearman’s footrule (Sevaux and Sörensen, 2005),
${d}_{\text{Pos}}(\pi,\pi^{\prime})=\sum_{k=1}^{m}|i-j|~{}~{}\text{where}~{}~{}\pi_{i}=\pi^{\prime}_{j}=k$ .
•
The non-metric Squared Position distance is Spearman’s rank correlation coefficient(Sevaux and Sörensen, 2005). In contrast to the Position distance, the term $|i-j|$ is replaced by $(i-j)^{2}$ .
•
The Hamming distance or Exact Match distance simply counts the number of unequal elements in two permutations, i.e.,
${d}_{Ham}(\pi,\pi^{\prime})=\sum_{i=1}^{m}a_{i},~{}~{}\text{where}~{}~{}a_{i}=\left\{\begin{array}[]{l l}0&\quad\text{if }\pi_{i}=\pi^{\prime}_{i},\\1&\quad\text{otherwise.}\end{array}\right.$
•
The Euclidean distance is
${d}_{Euc}(\pi,\pi^{\prime})=\sqrt{\sum_{i=1}^{m}(\pi_{i}-\pi^{\prime}_{i})^{2}}$ .
•
The Manhattan distance (A-Distance,
cf.(Sevaux and Sörensen, 2005; Campos etal., 2005)) is
${d}_{Man}(\pi,\pi^{\prime})=\sum_{i=1}^{m}|\pi_{i}-\pi^{\prime}_{i}|$ .
•
The Chebyshev distance is
${d}_{Che}(\pi,\pi^{\prime})=\underset{1\leq i\leq m}{\max}(|\pi_{i}-\pi^{\prime}_{i}|)$ .
•
For permutations, the Lee distance(Lee, 1958; Deza and Huang, 1998) is
${d}_{Lee}(\pi,\pi^{\prime})=\sum_{i=1}^{m}\min(|\pi_{i}-\pi^{\prime}_{i}|,m-|\pi_{i}-\pi^{\prime}_{i}|)$ .

•

The non-metric Cosine distance is based on the dot product of two permutations.It is derived from the cosine similarity(Singhal, 2001) of two vectors:

{d}_{Cos}(\pi,\pi^{\prime})=1-\frac{\pi\cdot\pi^{\prime}}{||\pi||~{}||\pi^{\prime}||}.

•
The Lexicographic distance regards the lexicographic ordering of permutations. If the position of a permutation $\pi$ in the lexicographic ordering of all permutations with fixed $m$ is $L(\pi)$ , then the Lexicographic distance metric is
${d}_{Lex}(\pi,\pi^{\prime})=|L(\pi)-L(\pi^{\prime})|.$

Appendix B: Minimal Examples for Indefinite Sets

To showcase the usefulness of the proposed methods, this sectionlists small example data sets and the respective indefinite distance matrices.Besides the standard permutation distances we also tested:

•
Signed permutations, reversal distance: Permutations where each element has a sign are referred to as signed permutations.An application example for signed permutations is, e.g., weld path optimization(Voutchkov etal., 2005).The reversal distance counts the number of reversals required to transform one permutationinto another.We used the non-cyclic reversal distance provided in theGRAPPA library version 2.0(Bader etal., 2004).
•
Labeled trees, tree edit distance:Trees in general are widely applied as solution representation, e.g., in Genetic Programming.In this study, we considered labeled trees. The tree edit distance counts the number node insertions, deletions or relabels.We used the efficient implementation in the APTED 0.1.1 library(Pawlik and Augsten, 2015, 2016).The labeled trees will be denoted with the bracket notation: curly brackets indicate the tree structure, letters indicate labels (internal and terminal nodes).
•
Strings, Optimal String Alignment distance (OSA): The OSA is an non-metric edit distance that counts insertions, deletions, substitutions and transpositions of characters. Each substring can be edited no more than once. It is also called the restricted Damerau-Levenshtein distance(Boytsov, 2011). We used the implementation in the stringdist R package(vander Loo, 2014).
•
Strings, Jaro-Winkler distance: The Jaro Winkler distance is based on the number of matching characters in two strings as well as the number of transpositions required to bring all matches in the same order.We used the implementation in the stringdist R package(vander Loo, 2014).

The respective results are listed in Table3.All of the listed distance measures are shown to be non-CNSD.

Permutations, Insert, $n=5$ , $m=4$ , $\lambda_{n}\approx$ 0.090
$i$	$x_{i}$	$d_{i,1}$	$d_{i,2}$	$d_{i,3}$	$d_{i,4}$	$d_{i,5}$
1	$\{1,2,3,4\}$	0	$1/3$	$1/3$	$2/3$	$1/3$
2	$\{1,3,4,2\}$		0	$2/3$	$1/3$	$2/3$
3	$\{2,3,4,1\}$			0	$1/3$	$2/3$
4	$\{3,4,1,2\}$				0	$1/3$
5	$\{4,1,2,3\}$					0
Permutations, Interchange, $n=5$ , $m=4$ , $\lambda_{n}\approx$ 0.090
1	$\{1,2,3,4\}$	0	$1/3$	$1/3$	$2/3$	$1/3$
2	$\{1,2,4,3\}$		0	$2/3$	$1/3$	$2/3$
3	$\{1,3,2,4\}$			0	$1/3$	$2/3$
4	$\{1,3,4,2\}$				0	$1/3$
5	$\{1,4,3,2\}$					0
Permutations, Levenshtein, $n=5$ , $m=4$ , $\lambda_{n}\approx$ 0.135
1	$\{1,2,4,3\}$	0	$1$	$1/2$	$1/2$	$1$
2	$\{2,3,1,4\}$		0	$1/2$	$1/2$	$1$
3	$\{2,4,3,1\}$			0	$1$	$1/2$
4	$\{3,1,2,4\}$				0	$1/2$
5	$\{3,4,2,1\}$					0
Permutations, LCStr, $n=5$ , $m=4$ , $\lambda_{n}\approx$ 0.023
1	$\{1,3,2,4\}$	0	$2/3$	$1/3$	$1/3$	$2/3$
2	$\{2,4,1,3\}$		0	$1/3$	$1/3$	$2/3$
3	$\{3,2,4,1\}$			0	$2/3$	$1$
4	$\{4,1,3,2\}$				0	$2/3$
5	$\{4,2,1,3\}$					0
Permutations, Chebyshev, $n=5$ , $m=5$ , $\lambda_{n}\approx$ 0.034
1	$\{1,5,3,4,2\}$	0	$1/4$	$3/4$	$3/4$	$1$
2	$\{2,5,3,4,1\}$		0	$1$	$1$	$3/4$
3	$\{4,2,3,1,5\}$			0	$2/4$	$1/4$
4	$\{4,3,1,2,5\}$				0	$1/4$
5	$\{5,3,2,1,4\}$					0
Sign. Permutations, Reversal, $n=5$ , $m=5$ , $\lambda_{n}\approx$ 0.016
1	$\{~{}4,~{}5,\text{-}1,\text{-}2,\text{-}3\}$	0	$4/6$	$5/6$	$3/6$	$2/6$
2	$\{~{}2,~{}1,~{}3,\text{-}4,\text{-}5\}$		0	$2/6$	$3/6$	$5/6$
3	$\{\text{-}2,~{}1,~{}3,~{}5,~{}4\}$			0	$5/6$	$3/6$
4	$\{~{}4,\text{-}2,~{}3,~{}1,\text{-}5\}$				0	$2/6$
5	$\{~{}4,\text{-}2,~{}1,\text{-}5,\text{-}3\}$					0
Labeled Trees, Edit dist., $n=5$ , $\lambda_{n}\approx$ 0.026
1	$\{b\{c\{b\}\}\}$	0	$2$	$1$	$3$	$1$
2	$\{b\}$		0	$1$	$3$	$1$
3	$\{b\{c\}\}$			0	$2$	$2$
4	$\{a\{c\}\{a\}\}$				0	$3$
5	$\{c\{b\}\}$					0
Strings, Optimal String Alignment, $n=5$ , $\lambda_{n}\approx$ 0.102
1	abc	0	$1$	$2$	$3$	$1$
2	acc		0	$3$	$2$	$2$
3	cba			0	$1$	$2$
4	caa				0	$2$
5	bac					0
Strings, Jaro-Winkler, $n=4$ , $\lambda_{n}\approx$ 0.046
1	bbbb	0	$1$	$1/6$	$3/6$
2	aaaa		0	$3/6$	$1/6$
3	bbba			0	$3/6$
4	aaab				0