Track 1

Baseline and topline

The baseline and topline ABX error rates for Track 1 are given in Table 1 (see also [1]). For the baseline model, we used 13 dimensions MFCC features computed every 10ms and the ABX score was computed using the cosine distance. For the topline model, we used posteriorgrams extracted from a Kaldi GMM-HMM pipeline with MFCC and Delta and Delta-Delta features, Gaussian mixtures, triphone word-position-dependent states, fMLLR talker adaptation, with a bigram word language model. The exact same Kaldi pipeline was used for the two languages and gave a phone error rate (PER) of 26.4% for English, and 7.5% for Tsonga. Note that the two corpora are quite different: The English corpus contains spontaneous, casual speech; the Tsonga corpus contain read speech constructed out of a small vocabulary, and tailored for constructing speech recognition applications. The acoustic and language models were trained on the part of the corpora not used in the evaluation, and the posterior fed into the ABX evaluation software using the KL divergence. Unsupervised models are expected to fall in between the performance of these two systems.

Table 1. Track 1 ABX error rate (%) for baseline and topline models on the English and Tsonga datasets.

Note: the kaldi recipes can be found HERE:

Interspeech 2015 results

The results can be found HERE. The papers covered are here: [2, 3, 4, 5, 6, & 7]

Track 2

Baseline and topline

For the baseline model, we used the JHU system described in Jansen & van Durme (2011) on PLP features. It performs DTW matching and uses random projections for increasing efficiency, and uses connected component clustering as a second step. The topline is an Adaptor Grammar using a unigram grammar, run on the gold phoneme transcription. Here, the topline performance is probably not attainable by unsupervised systems since it uses the gold transcription. It is more of a reference for the maximum value that it reasonable to expect for the metrics used.

Table 2. Track 2 metrics for baseline and topline models on the English and Tsonga datasets

Note: the spoken term discovery baseline can be found HERE. The Pitman-Yor Adaptor Grammar sampler can be found HERE.

The results can be found HERE. The papers covered are here: [8 & 9]

Challenge References

[1]M. Versteegh, R. Thiollière, T. Schatz, X.-N. Cao, X. Anguera, A. Jansen, and E. Dupoux, “The zero resource speech challenge 2015,” in INTERSPEECH-2015, 2015. Available: http://www.isca-speech.org/archive/interspeech_2015/i15_3169.html

[2]R. Thiollière, E. Dunbar, G. Synnaeve, M. Versteegh, and E. Dupoux, “A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling,” in INTERSPEECH-2015, 2015. Available: http://www.isca-speech.org/archive/interspeech_2015/i15_3169.html

[3]L. Badino, A. Mereta, and L. Rosasco, “Discovering Discrete Subword Units with Binarized Autoencoders and Hidden-Markov-Model Encoders,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. Available: http://www.isca-speech.org/archive/interspeech_2015/i15_3174.html

[4]D. Renshaw, H. Kamper, A. Jansen, and S. Goldwater, “A Comparison of Neural Network Methods for Unsupervised Representation Learning on the Zero Resource Speech Challenge,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. Available: http://www.isca-speech.org/archive/interspeech_2015/i15_3199.html

[5]W. Agenbag and T. Niesler, “Automatic Segmentation and Clustering of Speech Using Sparse Coding and Metaheuristic Search,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. Available: http://www.isca-speech.org/archive/interspeech_2015/i15_3184.html

[6]H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Parallel Inference of Dirichlet Process Gaussian Mixture Models for Unsupervised Acoustic Modeling: A Feasibility Study,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. Available: http://www.isca-speech.org/archive/interspeech_2015/i15_3189.html

[7]P. Baljekar, S. Sitaram, P. K. Muthukumar, and A. W. Black, “Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. Available: http://www.cs.cmu.edu/~pbaljeka/papers/IS2015.pdf

[8]O. Räsänen, G. Doyle, and M. C. Frank, “Unsupervised word discovery from speech using automatic segmentation into syllable-like units,” in Proceedings of Interspeech, 2015. Available: http://www.isca-speech.org/archive/interspeech_2015/i15_3204.html

[9]V. Lyzinski, G. Sell, and A. Jansen, “An Evaluation of Graph Clustering Methods for Unsupervised Term Discovery,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. Available: https://ccrma.stanford.edu/~gsell/pubs/2015_IS1.pdf