## Spoken term discovery

Spoken term discovery can be logically broken down into a series of 3 operations, which can be all evaluated independently (see Figure 1). The first step consists in matching pairs of stretches of speech on the basis of their global similarity. The second step consists in clustering the matching pairs, thereby building a library of classes with potentially many instances. This is equivalent to building a lexicon. In the third step, the system can use its acquired classes to parse the continuous stream into candidate tokens and boundaries. Some systems may only implement some of these steps, others may do them simultaneously rather than sequentially. The metric below have been devised to enable comparisons between these different systems by evaluating separately these logically distinct steps.

## Evaluation metrics

All of our metric assume a time aligned transcription, where *T _{i,j}* is the (phoneme) transcription corresponding to the speech fragment designed by the pair of indices ⟨

*i,j*⟩ (i.e., the speech fragment between frame

*i*and

*j*). If the left or right edge of the fragment contains part of a phoneme, that phoneme is included in the transcription if is corresponds to more than more than 30ms or more than 50% of it's duration.

We first define the set related to the output of the discovery algorithm:

*C*: the set of discovered clusters (a cluster being a set of fragments with the same category name)._{disc}

From these, we can derive:

*F*: the set of discovered fragments_{disc}*= { f | f∈ c , c ∈ C*_{disc}}*P*: the set of non overlapping discovered pairs_{disc}*= { {a,b} | a∈c, b∈c, ¬ overlap(a,b), c ∈ C*_{disc}}*P** : the set of pairwise substring completion of_{disc}*P*, which mean that we compute all of the possible minimal path realignments of the two strings, and extract all of the substrings pairs along the path (e.g., for fragment pair_{disc}*⟨abcd,efg⟩: ⟨abc,efg⟩, ⟨ab,ef⟩, ⟨bc,fg⟩, ⟨bcd,efg⟩*, etc).*B*: the set of discovered fragment boundaries (boundaries are defined in terms of_{disc}*i*, the index of the nearest phoneme boundary in the transcription if it is less than 30ms away, and -1 (wrong boundary) otherwise)

Note: two fragments a and b *overlap* if they share more than half of their temporal extension.

Next, we define the gold sets:

*F*: the set of all possible fragments of size between 3 and 20 phonemes in the corpus._{all}*P*: the set of all possible non overlapping matching fragment pairs._{all}*P*_{all}={ {a,b} ∈ F_{all}x F_{all}| T_{a}=T_{b}, ¬ overlap(a,b)}*F*: the set of fragments corresponding to the corpus transcribed at the word level (gold transcription)._{goldLex}*P*: the set of matching fragments pairs from the_{goldLex}*F*._{goldLex}- B
_{gold}the set of boundaries in the parsed corpus.

Most of our measures are defined in terms of *Precision*, *Recall* and *F-score*. *Precision* is the probability that an element in a discovered set of entities belongs to the gold set, and *Recall* the probability that a gold entity belongs to the discovered set. The *F-score* is the harmonic mean between *Precision* and *Recall*.

*Precision*_{disc,gold}= | disc ∩ gold | / | disc |*Recall*_{disc,gold}= | disc ∩ gold | / | gold |*F-Score*_{disc,gold}=2/(1/Precision_{disc,gold}+1/Recall_{disc,gold})

### Matching quality

Many spoken term discovery systems incorporate a step whereby fragments of speech are realigned and compared. Matching quality measures the accuraty of this process. Here, we use two kinds of metrics for evaluating this: *NED/Coverage*, and Matching *F-score*.

*NED* and *Coverage* are quick to compute and give a qualitative estimate of the matching step. *Ned* is the normalised edit distance; it is equal to zero when a pair of fragments have exactly the same transcription, and 1 when they differ in all phonemes. *Coverage* is the fraction of corpus that contain matching pairs that has been discovered.

where

The Matching metrics (*precision, recall and F-score*) is much more exhaustive, but requires considerably more computation. It compares *X=P _{disc}** the set of discovered pairs (with substring completion) to

*Y=P*the set of all possible gold pairs. The precision and recall are computed over each type of pairs, and averaged after reweighting by the frequency of the pair.

_{all}where

### Clustering Quality

Clustering quality is evaluated using two metrics. The first metrics (Grouping precision, recall and F-score) computes the intrinsic quality of the clusters in terms of their phonetic composition. This score is equivalent to the purity and inverse purity scores used for evaluating clustering. As the Matching score, it is computed over pairs, but contrary to the Matching scores, it focusses on the covered part of the corpus.

where

The second metrics (Type precision, recall and F-score) takes as the gold cluster set the true lexicon and is therefore much more demanding. Indeed, a system could have very pure clusters, but could systematically missegment words. Since a discovered cluster could have several transcriptions, we use all of them (rather than using some kind of centroid).

### Parsing Quality

Parsing quality is evaluated using two metrics. The first one (Token precision, recall and F-score) evaluates how many of the word tokens were correctly segmented (*X=F _{disc}, Y=F_{goldLex}*). The second one (Boundary precision, recall and F-score) evaluates how many of the gold word boundaries were found (

*X=B*). These two metrics are typically correlated, but researchers typically use the first. We provide Boundary metrics for completeness, and also to enable system diagnostic.

_{disc}, Y=B_{gold}The details of these metrics are given in the Ludusan et al (2014) paper. The only divergence between this paper and the present measures, is that contrary to the paper, we compute these scores on the entirety of the corpus, rather than on the covered corpus. It is necessary to do this if we want to compare systems that will cover different subsets of the corpus. In the implementation for the challenge, we use a subsampling scheme whereby the corpus is cut into 10 equal parts and each metric is computed on each of the subsample separately and then averaged. This enables the computation to be more tractable (especially for the matching metric which requires substring completion), and also to provide a standard deviation measure for each metric. We also provide, in addition to each metric ran on the entire corpus, the same metric restricted to within talker matches. This is to enable the evaluation of systems that are specialized in within talker spoken term discovery.