This challenge targets the unsupervised discovery of linguistic units from raw speech in an unknown language. Such a task is done within the first year of life by human infants through mere immersion in a language speaking community, but remains very difficult to do by machine, where the dominant paradigm is massive supervision with large human-annotated datasets.
The idea behind this challenge is to push the envelope on the notion of adaptability and flexibility in speech recognition systems by setting up the rather extreme situation where a whole language has to be learned from scratch. We expect this both to impact the Speech and Language Technology field by providing algorithms that can supplement supervised systems when human annotated corpora are scarce, (under-resourced languages or dialects), and to help the basic science of infant research by providing scalable quantitative models that can be compared to psycholinguistic data.
This challenge covers two levels of linguistic structure: subword units and word units, respectively. These two levels have already been investigated in previous work (see [1-6], and [7-10], respectively), but the performance of the different systems has not yet been compared using common evaluation metrics and datasets. In the first track, we use a psychophysically inspired evaluation task (minimal pair ABX discrimination), and, in the second, metrics inspired by the ones used in NLP word segmentation applications (segmentation and token F scores).
- Track 1: unsupervised subword modeling. The aim in this task is to construct a representation of speech sounds which is robust to within- and between-talker variation and supports word identification. The metric we will use is the ABX discriminability between phonemic minimal pairs (see [11,12]). The ABX discriminability between the minimal pair "beg" and "bag" is defined as the probability that A and X are further apart than B and X, where A and X are tokens of "beg", and B a token of "bag" (or vice versa), distance being defined as the DTW divergence of the representations of the tokens. Our global ABX discriminability score aggregates over the entire set of minimal pairs like "beg"-"bag" to be found in the corpus. We analyze separately the effect of within- and between-talker variation.
- Track 2: spoken term discovery. The aim in this task is the unsupervised discovery of "words" defined as recurring speech fragments. The systems should take raw speech as input and output a list of speech fragments (timestamps referring to the original audio file) together with a discrete label for category membership. The evaluation will use the suite of F-score metrics described in , which enables detailed assessment of the different components of a spoken term discovery pipeline (matching, clustering, segmentation, parsing) and so will support a direct comparison with NLP models of unsupervised word segmentation.
Data and Sharing
To encourage teams from both ASR and non-ASR communities to apply to these tracks, all of the resources for this challenge (software for evaluation and baselines, datasets) are free and open source. We strongly encourage applicants to make their systems available in an open source fashion. This is not only scientific good practice (enables verification and replication), but it is our belief that it will encourage the growth of this domain by facilitating the emergence of new teams and participants.
Data for the challenge is drawn from two languages, one English dataset that we is nevertheless treated as a zero resource language (which means no pretraining with an English dataset will be allowed), and a low resource language, Xitsonga. The data is made available in three sets.
- the sample set (2 speakers, 40 min each, English) is provided to for anyone to download (see the Getting Started Tab) together with the evaluation software.
- the English test dataset (casual conversations, 12 speakers, 16-30 min each, total 5h)
- the Xitsonga test dataset (read speech, 24 speakers, 2-29 minutes each, total 2h30).
To get these datasets, see Registration below. All datasets have been prepared in the following way:
- the original recordings were segmented into short files that contains only `clean speech', ie, no overlap, pauses, or nonspeech noises, and contain only the speech of a single speakers.
- the file names contain a talker ID. We kept this information on the basis of the fact that infants arguably have access to this information when they learn their language, and that it is relatively easy to recover anyway. Therefore, the proposed systems can openly use it.
This challenge is primarily driven by a scientific question: how could an infant or a system learn language(s) in an unsupervised fashion? We expect therefore that the submissions' proposals will emphasize novel and interesting ideas (as opposed to trying to get the best result through all possible means). Since we provide the evaluation software, there is the distinct possibility that it can be used to optimize system parameters according to the particular corpus at hand. Doing this would blur the comparison between competing ideas and architectures, especially if this information is not disclosed. We therefore ask kindly the participants to disclose whenever they publish their work whether and how they have used the evaluation software to tune particular system parameters.
Similarly, competitors should disclose the type of information they have used for training their systems. In order to compare systems, we will distinguish those that use absolutely no training to derive the speech features (bare signal processing systems), systems that use unsupervised training on the provided datasets (unsupervised systems), and systems that use supervised training on some other languages or mixtures of languages (transfer systems). Training features or models with another English dataset will be prohibited, except for baseline comparison.
The first round of the competition was presented as a special session at Interspeech 2015 (Sept, 6-10, 2015, Dresden, see the Interspeech 2015 proceedings and the Results tab), but the challenge remains open and participants can still compete and try to beat the current best system. The only requirement is that the results are send to the organizers so that we can update the result page.
Note that the next SLTU 2016 has a special topic on Zero resource speech technology!
To register, send an email to and use this github repository for instructions. If you encounter a problem, please send us an email ()
You can try out your systems without registering by downloading the starter kit (Getting Started Tab)
- Xavier Anguera (Telefonica)
- Emmanuel Dupoux (Ecole des Hautes Etudes en Sciences Sociales, Paris)
- Aren Jansen (Johns Hopkins University, Baltimore)
- Maarten Versteegh (ENS, Paris)
- Thomas Schatz (ENS, Paris)
- Roland Thiollière (EHESS, Paris)
- Bogdan Ludusan (EHESS, Paris)
- Maarten Versteegh (ENS, Paris)
The organizers can be reached at
ZeroSpeech2015 is funded through an ERC grant to Emmanuel Dupoux (see website).
 Badino, L., Canevari, C., Fadiga, L., & Metta, G. (2014). An auto-encoder based approach to unsupervised learning of subword units. in ICASSP.
 Huijbregts, M., McLaren, M., & van Leeuwen, D. (2011). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection. In ICASSP (pp. 4436-4439).
 Jansen, A., Thomas, S., & Hermansky, H. (2013). Weak top-down constraints for unsupervised acoustic model training. In ICASSP (pp. 8091-8095).
 Lee, C., & Glass, J. (2012). A nonparametric Bayesian approach to acoustic model discovery. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers Volume 1 (pp. 40-49).
 Varadarajan, B., Khudanpur, S. & Dupoux, E. (2008). Unsupervised Learning of Acoustic Subword Units. In Proceedings of ACL-08: HLT, (pp 165-168) .
 Synnaeve, G., Schatz, T & Dupoux, E. (2014). Phonetics embedding learning with side information. in IEEE:SLT
 Siu, M., Gish, H., Chan, A., Belfield, W. & Lowe, S. (2014). Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery. In Computer Speech & Language 28.1, (pp 210-223).
Spoken term discovery
 Jansen, A., & Van Durme, B. (2011). Efficient spoken term discovery using randomized algorithms. In IEEE ASRU Workshop (pp. 401-406).
 Muscariello, A., Gravier, G., & Bimbot, F. (2012). Unsupervised Motif Acquisition in Speech via Seeded Discovery and Template Matching Combination in ICASSP (Vol. 20,7, pp. 2031-2044).
 Park, A. S., & Glass, J. R. (2008). Unsupervised Pattern Discovery in Speech. In ICASSP, 16(1), 186-197.
 Zhang, Y., & Glass, J. R. (2010). Towards multi-speaker unsupervised speech pattern discovery. In ICASSP (pp. 4366-4369).
 Schatz, T., Peddinti, V., Xuan-Nga, C., Bach, F., Hynek, H. & Dupoux, E. (2014). Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise. In Interspeech.
 Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hynek, H. & Dupoux, E. (2013). Evaluating speech features with the Minimal-Pair ABX task: Analysis of the classical MFC/PLP pipeline. In Interspeech (pp 1781-1785) .
 Ludusan, B., Versteegh, M., Jansen, A., Gravier, G., Cao, X.N., Johnson, M. & Dupoux, E. (2014). Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems. In Proceedings of LREC.