This challenge targets the unsupervised discovery of linguistic units from raw speech in an unknown language. Such a task is done within the first year of life by human infants through mere immersion in a language speaking community, but remains very difficult to do by machine, where the dominant paradigm is massive supervision with large human-annotated datasets.

The idea behind this challenge is to push the envelope on the notion of adaptability and flexibility in speech recognition systems by setting up the rather extreme situation where a whole language has to be learned from scratch. We expect this both to impact the Speech and Language Technology field by providing algorithms that can supplement supervised systems when human annotated corpora are scarce, (under-resourced languages or dialects), and to help the basic science of infant research by providing scalable quantitative models that can be compared to psycholinguistic data.

This challenge covers two levels of linguistic structure: subword units and word units, respectively. These two levels have already been investigated in previous work (see [1-6], and [7-10], respectively), but the performance of the different systems has not yet been compared using common evaluation metrics and datasets. In the first track, we use a psychophysically inspired evaluation task (minimal pair ABX discrimination), and, in the second, metrics inspired by the ones used in NLP word segmentation applications (segmentation and token F scores).

You can find more details on these two tracks in the relevant tabs (Track 1 and Track 2).

Data and Sharing

To encourage teams from both ASR and non-ASR communities to apply to these tracks, all of the resources for this challenge (software for evaluation and baselines, datasets) are free and open source. We strongly encourage applicants to make their systems available in an open source fashion. This is not only scientific good practice (enables verification and replication), but it is our belief that it will encourage the growth of this domain by facilitating the emergence of new teams and participants.

Data for the challenge is drawn from two languages, one English dataset that we is nevertheless treated as a zero resource language (which means no pretraining with an English dataset will be allowed), and a low resource language, Xitsonga. The data is made available in three sets.

To get these datasets, see Registration below. All datasets have been prepared in the following way:

Ground rules

This challenge is primarily driven by a scientific question: how could an infant or a system learn language(s) in an unsupervised fashion? We expect therefore that the submissions' proposals will emphasize novel and interesting ideas (as opposed to trying to get the best result through all possible means). Since we provide the evaluation software, there is the distinct possibility that it can be used to optimize system parameters according to the particular corpus at hand. Doing this would blur the comparison between competing ideas and architectures, especially if this information is not disclosed. We therefore ask kindly the participants to disclose whenever they publish their work whether and how they have used the evaluation software to tune particular system parameters.

Similarly, competitors should disclose the type of information they have used for training their systems. In order to compare systems, we will distinguish those that use absolutely no training to derive the speech features (bare signal processing systems), systems that use unsupervised training on the provided datasets (unsupervised systems), and systems that use supervised training on some other languages or mixtures of languages (transfer systems). Training features or models with another English dataset will be prohibited, except for baseline comparison.

Registration

The first round of the competition was presented as a special session at Interspeech 2015 (Sept, 6-10, 2015, Dresden, see the Interspeech 2015 proceedings and the Results tab), but the challenge remains open and participants can still compete and try to beat the current best system. The only requirement is that the results are send to the organizers so that we can update the result page.

Note that the next SLTU 2016 has a special topic on Zero resource speech technology!

To register, send an email to and use this github repository for instructions. If you encounter a problem, please send us an email ()

You can try out your systems without registering by downloading the starter kit (Getting Started Tab)

Organizers

Challenge Organization

Track 1

Track 2

Contact

The organizers can be reached at

Sponsors

ZeroSpeech2015 is funded through an ERC grant to Emmanuel Dupoux (see website).

References

Subword units/embeddings

[1] Badino, L., Canevari, C., Fadiga, L., & Metta, G. (2014). An auto-encoder based approach to unsupervised learning of subword units. in ICASSP.

[2] Huijbregts, M., McLaren, M., & van Leeuwen, D. (2011). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection. In ICASSP (pp. 4436-4439).

[3] Jansen, A., Thomas, S., & Hermansky, H. (2013). Weak top-down constraints for unsupervised acoustic model training. In ICASSP (pp. 8091-8095).

[4] Lee, C., & Glass, J. (2012). A nonparametric Bayesian approach to acoustic model discovery. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers Volume 1 (pp. 40-49).

[5] Varadarajan, B., Khudanpur, S. & Dupoux, E. (2008). Unsupervised Learning of Acoustic Subword Units. In Proceedings of ACL-08: HLT, (pp 165-168) .

[6] Synnaeve, G., Schatz, T & Dupoux, E. (2014). Phonetics embedding learning with side information. in IEEE:SLT

[7] Siu, M., Gish, H., Chan, A., Belfield, W. & Lowe, S. (2014). Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery. In Computer Speech & Language 28.1, (pp 210-223).

Spoken term discovery

[8] Jansen, A., & Van Durme, B. (2011). Efficient spoken term discovery using randomized algorithms. In IEEE ASRU Workshop (pp. 401-406).

[9] Muscariello, A., Gravier, G., & Bimbot, F. (2012). Unsupervised Motif Acquisition in Speech via Seeded Discovery and Template Matching Combination in ICASSP (Vol. 20,7, pp. 2031-2044).

[10] Park, A. S., & Glass, J. R. (2008). Unsupervised Pattern Discovery in Speech. In ICASSP, 16(1), 186-197.

[11] Zhang, Y., & Glass, J. R. (2010). Towards multi-speaker unsupervised speech pattern discovery. In ICASSP (pp. 4366-4369).

Evaluation metrics

[12] Schatz, T., Peddinti, V., Xuan-Nga, C., Bach, F., Hynek, H. & Dupoux, E. (2014). Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise. In Interspeech.

[13] Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hynek, H. & Dupoux, E. (2013). Evaluating speech features with the Minimal-Pair ABX task: Analysis of the classical MFC/PLP pipeline. In Interspeech (pp 1781-1785) .

[14] Ludusan, B., Versteegh, M., Jansen, A., Gravier, G., Cao, X.N., Johnson, M. & Dupoux, E. (2014). Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems. In Proceedings of LREC.