Emmanuel Dupoux Home Page

Emmanuel Dupoux
Email: emmanuel.dupoux at gmail dot com

Directeur d'Études
École des Hautes Etudes en Sciences Sociales

Laboratoire de Science Cognitive et Psycholinguistique.
29 rue d'Ulm, 75005 Paris, France.
tel: (+33 1) 44 32 26 16, fax: (+33 1) 44 32 26 30.

Vitae
Research Topics
Publications
Teaching

My research focuses on the mecanisms and representations specific to the human brain that allow the human baby to acquire one or several languages and become cognitively functional in his or her culture. This investigation is conducted using behavioral methods in adult and infants, brain imagery, and computational modeling with machine learning techniques.

Updated: Apr 2024

Vitae

Born November 30th 1964 in Paris.

2002- Co-creator and director of the Cognitive Science Master program (see the CogMaster site).

1998-2009 Director of Laboratoire de Sciences Cognitives et Psycholinguistique (LSCP).

1992 Diploma in Telecom Engineering at Télécom Paris.

1989-1990 Post-doc at the Cognitive Science Program, Univ. of Arizona.

1989 PhD in Cognitive Psychology, EHESS, Paris.

1984-1988 Student at École Normale Supérieure

Complete Vitae

Research topics

In my research, I have been focusing on the early acquisition of linguistic and social skills in infants and their more or less reversible consequences in adults, in terms of a cognitive specialization for a particular language or culture. My approach is to run comparative studies in adults and infants, and test theoretical models that take into account both types of studies. More recently, I explore how machine learning and artificial intelligence can provide quantitative models of processing and learning in infants. For more details on this current activity, see the Cognitive Machine Learning (CoML) INRIA team website. Below is three of my major research interest:

Modeling early language acquisition.

The developmental landmarks of language acquisition during the first years of life have been well described but the mechanisms underpinning them remain poorly understood. The complexity of learning problem that infants face is daunting: he or she has to acquire, mostly without supervision, several interdependant aspects of language simultaneously: phonology, morphology, syntax, semantics.

The aim of this project is to aply machine learning and signal processing techniques (bayesian models, HMM, etc.) to corpora of child-adult verbal interactions and develop unsupervized algorithms which can extract phonological categories (syllables, phonemes, features). These algorithms are then tested in infants or newborns using either behavioral techniques or noninvasive brain imagery (Near InfraRed Spectroscopy, EEGs).
See the summary of the main results in:

Modelling phonological acquisition (in preparation) [PDF]

Dupoux. E. (2009). How Do Infants Bootstrap into Spoken Language?: Models and Challenges. ICML, McGill, June 2009. [video lecture]

More details in the CoML website

Phonological 'deafnesses' in speech perception: acquisition and plasticity.

Infants can learn effortlessly one or several languages at the same time. Yet, adults, have a hard time acquiring a second language. Why? In this project, we investigate the hypothesis that part of the difficulties are due to an early specialization and subsequent lack of plasticity of perceptual processes after a certain critical age.

We test this hypothesis by conducting experiments on the development of phonological categories during the first year of life, and on perception and production in monolingual and bilingual adults. We use psycholinguistic methods, as well as brain imagery (ERPs, fMRI). The issue of the residual functional plasticity for language is also studied in neurological patients with language impairement.

See the summary of the main results in:

Phonological deafnesses: a summary [PDF].

The development of social cognition.

Humans have unique abilities to help, communicate, and cooperate with their conspecifics. Correlatively they also have a high propensity to cheat, defect and harm their conspecifics. It is therefore plausible that we have adaptative mechanisms devoted to the quick evaluation of the actions and dispositions of other humans. Here, we study the biological, developmental and psychological bases of these implicit social evaluations mechanisms and the role they play in the emergence of explicit cultural systems of social and moral norms.

We conduct experiments in infants, toddlers, adults, and patients using animated cartoons showing characters that perform prosocial or antisocial actions towards conspecifics. We test the social evaluation of these characters and the inferences that are drawn from them using a variety of explicit and implicit measures.

See the summary of the main results in:

Moral and social development: a summary [PDF].

This work is supported in part by a grant from the French ministry of research (SOCODEV ANR-09-BLAN-0327 CSD 9).

Note. For some paper, you can read an abstract and/or download a ready-to-print pdf file. To print or preview a pdf file, use Acrobat Reader.

Peer Reviewed Articles

Marvin Lavechin, , Maureen de Seyssel, , Marianne Métais, , Florian Metze, , Abdelrahman Mohamed, , Hervé Bredin, , Emmanuel Dupoux, & Alejandrina Cristia, (2024). Modeling early phonetic acquisition from child-centered audio data. Cognition, 245, 105734. [abstract] ABSTRACT = Infants learn their native language(s) at an amazing speed. Before they even talk, their perception adapts to the language(s) they hear. However, the mechanisms responsible for this perceptual attunement and the circumstances in which it takes place remain unclear. This paper presents the first attempt to study perceptual attunement using ecological child-centered audio data. We show that a simple prediction algorithm exhibits perceptual attunement when applied on unrealistic clean audio-book data, but fails to do so when applied on ecologically-valid child-centered data. In the latter scenario, perceptual attunement only emerges when the prediction mechanism is supplemented with inductive biases that force the algorithm to focus exclusively on speech segments while learning speaker-, pitch-, and room-invariant representations. We argue these biases are plausible given previous research on infants and non-human animals. More generally, we show that what our model learns and how it develops through exposure to speech depends exquisitely on the details of the input signal. By doing so, we illustrate the importance of considering ecologically valid input data when modeling language acquisition.

de Seyssel, M., Lavechin, M., Titeux, H., Thomas, A., Virlet, G., Revilla, A.S., Wisniewski, G., Ludusan, B. & Dupoux, E. (2023). ProsAudit, a prosodic benchmark for self-supervised speech models. In INTERSPEECH-2023, (pp 2963-2967) . [abstract] ABSTRACT = We present ProsAudit, a benchmark in English to assess structural prosodic knowledge in self-supervised learning (SSL) speech models. It consists of two subtasks, their corresponding metrics, and an evaluation dataset. In the protosyntax task, the model must correctly identify strong versus weak prosodic boundaries. In the lexical task, the model needs to correctly distinguish between pauses inserted between words and within words. We also provide human evaluation scores on this benchmark. We evaluated a series of SSL models and found that they were all able to perform above chance on both tasks, even when evaluated on an unseen language. However, non-native models performed significantly worse than native ones on the lexical task, highlighting the importance of lexical knowledge in this task. We also found a clear effect of size with models trained on more data performing better in the two subtasks.

de Seyssel, M., Lavechin, M. & Dupoux, E. (2023). Realistic and broad-scope learning simulations: first results and challenges. Journal of Child Language. [abstract] There is a current `theory crisis' in language acquisition research, resulting from fragmentation both at the level of the approaches and the linguistic level studied. We identify a need for integrative approaches that go beyond these limitations, and propose to analyse the strengths and weaknesses of current theoretical approaches of language acquisition. In particular, we advocate that language learning simulations, if they integrate realistic input and multiple levels of language, have the potential to contribute significantly to our understanding of language acquisition. We then review recent results obtained through such language learning simulations. Finally, we propose some guidelines for the community to build better simulations.

Taillandier, V., Hupkes, D., Sagot, B., Dupoux, E. & Michel, P. (2023). Neural Agents Struggle to Take Turns in Bidirectional Emergent Communication. In ICLR. [abstract] ABSTRACT = Parts of the brain that carry sensory tasks are organized topographically: nearby neurons are responsive to the same properties of input signals. Thus, in this work, inspired by the neuroscience literature, we proposed a new topographic inductive bias in Convolutional Neural Networks (CNNs). To achieve this, we introduced a new topographic loss and an efficient implementation to topographically organize each convolutional layer of any CNN. We benchmarked our new method on 4 datasets and 3 models in vision and audio tasks and showed equivalent performance to all benchmarks. Besides, we also showcased the generalizability of our topographic loss with how it can be used with different topographic organizations in CNNs. Finally, we demonstrated that adding the topographic inductive bias made CNNs more resistant to pruning. Our approach provides a new avenue to obtain models that are more memory efficient while maintaining better accuracy.

Sy, Y., Havard, W.N., Lavechin, M., Dupoux, E. & Cristia, A. (2023). Measuring language development from child-centered recordings. In Proceedings of INTERSPEECH, (pp 4618-4622) . [abstract] ABSTRACT = Standard ways to measure child language development from spontaneous corpora rely on detailed linguistic descriptions of a language as well as exhaustive transcriptions of the child's speech, which today can only be done through costly human labor. We tackle both issues by proposing (1) a new language development metric (based on entropy) that does not require linguistic knowledge other than having a corpus of text in the language in question to train a language model, (2) a method to derive this metric directly from speech based on a smaller text-speech parallel corpus. Here, we present descriptive results on an open archive including data from six English-learning children as a proof of concept. We document that our entropy metric documents a gradual convergence of children's speech towards adults' speech as a function of age, and it also correlates moderately with lexical and morphosyntactic measures derived from morphologically-parsed transcriptions.

Poli, M., Dupoux, E. & Riad, R. (2023). Introducing Topography in Convolutional Neural Networks. In Proc. of ICASSP, (pp 1--5) IEEE. [abstract] ABSTRACT = Parts of the brain that carry sensory tasks are organized topographically: nearby neurons are responsive to the same properties of input signals. Thus, in this work, inspired by the neuroscience literature, we proposed a new topographic inductive bias in Convolutional Neural Networks (CNNs). To achieve this, we introduced a new topographic loss and an efficient implementation to topographically organize each convolutional layer of any CNN. We benchmarked our new method on 4 datasets and 3 models in vision and audio tasks and showed equivalent performance to all benchmarks. Besides, we also showcased the generalizability of our topographic loss with how it can be used with different topographic organizations in CNNs. Finally, we demonstrated that adding the topographic inductive bias made CNNs more resistant to pruning. Our approach provides a new avenue to obtain models that are more memory efficient while maintaining better accuracy.

Nguyen, T.A., Kharitonov, E., Copet, J., Adi, Y., Hsu, W.N., Elkahky, A., Tomasello, P., Algayres, R., Sagot, B., Mohamed, A. & Dupoux, E. (2023). Generative Spoken Dialogue Language Modeling. Transactions of the Association for Computational Linguistics. [abstract] We introduce dGSLM, the first ``textless'' model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn taking compared to a text-based cascaded model.

Nguyen, T.A., Hsu, W.N., d'Avirro, A., Shi, B., Gat, I., Fazel-Zarani, M., Remez, T., Copet, J., Synnaeve, G., Hassid, M. & others, (2023). Expresso: A benchmark and analysis of discrete expressive speech resynthesis. In INTERSPEECH-2023, (pp 4823-4827) . [abstract] ABSTRACT = Recent work has shown that it is possible to resynthesize highquality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthesis datasets are read, severely limiting spontaneity and expressivity. Here, we introduce EXPRESSO, a high-quality expressive speech dataset for textless speech synthesis that includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We illustrate the challenges and potentials of this dataset with an expressive resynthesis benchmark where the task is to encode the input in lowbitrate units and resynthesize it in a target voice while preserving content and style. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders, and explore tradeoffs between quality, bitrate and invariance to speaker and style. All the dataset, evaluation metrics and baseline models are open sourced.

Lavechin, M., Sy, Y., Titeux, H., Cruz Blandón, M.A., Räsänen, O., Bredin, H., Dupoux, E. & Cristia, A. (2023). BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models. In INTERSPEECH-2023, (pp 4588--4592) . [abstract] ABSTRACT = Standard ways to measure child language development from spontaneous corpora rely on detailed linguistic descriptions of a language as well as exhaustive transcriptions of the child's speech, which today can only be done through costly human labor. We tackle both issues by proposing (1) a new language development metric (based on entropy) that does not require linguistic knowledge other than having a corpus of text in the language in question to train a language model, (2) a method to derive this metric directly from speech based on a smaller text-speech parallel corpus. Here, we present descriptive results on an open archive including data from six English-learning children as a proof of concept. We document that our entropy metric documents a gradual convergence of children's speech towards adults' speech as a function of age, and it also correlates moderately with lexical and morphosyntactic measures derived from morphologically-parsed transcriptions.

Lavechin, M., Métais, M., Titeux, H., Boissonnet, A., Copet, J., Rivière, M., Bergelson, E., Cristia, A., Dupoux, E. & Bredin, H. (2023). Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), (pp 1--7) . [abstract] Most automatic speech processing systems register degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a neural network jointly trained to extract speech/non-speech segments, speech-to-noise ratios, and C50 room acoustics from single-channel recordings. Brouhaha is trained using a data-driven approach in which noisy and reverberant audio segments are synthesized. We first evaluate its performance and demonstrate that the proposed multi-task regime is beneficial. We then present two scenarios illustrating how Brouhaha can be used on naturally noisy and reverberant data: 1) to investigate the errors made by a speaker diarization model (pyannote.audio); and 2) to assess the reliability of an automatic speech recognition model (Whisper from OpenAI). Both our pipeline and a pretrained model are open source and shared with the speech community.

Hassid, M., Remez, T., Nguyen, T.A., Gat, I., Conneau, A., Kreuk, F., Copet, J., Defossez, A., Synnaeve, G., Dupoux, E., Schwartz, R. & Adi, Y. (2023). Textually pretrained speech language models. In NeurIPS, 36, (pp 63483--63501) . [abstract] ABSTRACT = Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. We make speech samples, code and models publicly available.

Hallap, M., Dupoux, E. & Dunbar, E. (2023). Evaluating context-invariance in unsupervised speech representations. In INTERSPEECH-2023, (pp 2973--2977) . [abstract] ABSTRACT = Unsupervised speech representations have taken off with benchmarks demonstrating major progress on semi-supervised speech recognition, speech synthesis, and speech-only language modelling. Inspiration comes from the promise of discovering the phonemes of a language or a similar low-bitrate encoding. However, one of the critical properties of phoneme transcriptions is context-invariance: the phonetic context of a speech sound can have massive influence on the way it is pronounced while text remains stable. This is why tokens of the same word have the same transcriptions---key to language understanding. Current benchmarks do not measure context-stability. We develop a new version of the ZeroSpeech ABX benchmark that does, and apply it to recent self-supervised representations. We show that context-independence of representations is predictive of the stability of word-level representations. We suggest research concentrate on improving context-independence of unsupervised representations.

Gat, I., Kreuk, F., Nguyen, T.A., Lee, A., Copet, J., Synnaeve, G., Dupoux, E. & Adi, Y. (2023). Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling. In The 20th International Conference on Spoken Language Translation, (pp 465-477) . [abstract] ABSTRACT = Generative Spoken Language Modeling research focuses on optimizing speech Language Models (LMs) using raw audio recordings without accessing any textual supervision. Such speech LMs usually operate over discrete units obtained from quantizing internal representations of self-supervised models. Although such units show impressive modeling results, their robustness capabilities have not been extensively investigated. This work focuses on improving the invariance of discrete input representations to non-spoken augmentations for generative spoken language modeling. First, we formally define how to measure the robustness of such representations to various signal variations that do not alter the spoken information (e.g., time-stretch). Next, we empirically demonstrate how current state-of-the-art representation models lack robustness to such variations. To overcome this, we propose an effective and efficient method to learn invariant discrete speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudolabeling scheme. Our method significantly improves over the evaluated baselines when considering encoding and modeling metrics. We additionally evaluate our method on the speechto-speech translation task, considering SpanishEnglish and French-English translations, and show the proposed approach outperforms the evaluated baselines.

Elkahky, A., Hsu, W.N., Tomasello, P., Nguyen, T.A., Algayres, R., Adi, Y., Copet, J., Dupoux, E. & Mohamed, A. (2023). Do Coarser Units Benefit Cluster Prediction-Based Speech Pre-Training? In Proc. of ICASSP, (pp 1--5) . [abstract] ABSTRACT = The research community has produced many successful self-supervised speech representation learning methods over the past few years. Discrete units have been utilized in various self-supervised learning frameworks, such as VQ-VAE [1], wav2vec 2.0 [2], Hu-BERT [3], and Wav2Seq [4]. This paper studies the impact of altering the granularity and improving the quality of these discrete acoustic units for pre-training encoder-only and encoder-decoder models. We systematically study the current proposals of using Byte-Pair Encoding (BPE) and new extensions that use cluster smoothing and Brown clustering. The quality of learned units is studied intrinsically using zero speech metrics and on the down-stream speech recognition (ASR) task. Our results suggest that longer-range units are helpful for encoder-decoder pre-training; however, encoder-only masked-prediction models cannot yet benefit from self-supervised word-like targets.

Bernard, M., Poli, M., Karadayi, J. & Dupoux, E. (2023). Shennong: a Python toolbox for audio speech features extraction". Behavior Research Methods. [abstract] We introduce Shennong, a Python toolbox and command-line utility for audio speech features extraction. It implements a wide range of well-established state-of-the-art algorithms: spectro-temporal filters such as Mel-Frequency Cepstral Filterbank or Predictive Linear Filters, pre-trained neural networks, pitch estimators, speaker normalization methods, and post-processing algorithms. Shennong is an open source, reliable and extensible framework built on top of the popular Kaldi speech processing library. The Python implementation makes it easy to use by non-technical users and integrates with third-party speech modeling and machine learning tools from the Python ecosystem. This paper describes the Shennong software architecture, its core components, and implemented algorithms. Then, three applications illustrate its use. We first present a benchmark of speech features extraction algorithms available in Shennong on a phone discrimination task. We then analyze the performances of a speaker normalization model as a function of the speech duration used for training. We finally compare pitch estimation algorithms on speech under various noise conditions.

Algayres, R., Adi, Y., Nguyen, T., Copet, J., Synnaeve, G., Sagot, B. & Dupoux, E. (2023). Generative Spoken Language Model based on continuous word-sized audio tokens. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, (pp 3008--3028) . [abstract] In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio tokens that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative language model based on word-size continuous tokens. Its performance is on par with discrete unit GSLMs regarding generation quality as measured by automatic metrics and subjective human judgements. Moreover, it is five times more memory efficient thanks to its large 200ms units. In addition, the embeddings before and after the Lexical Embedder are phonetically and semantically interpretable.

de Seyssel, M., Wisniewski, G., Dupoux, E. & Ludusan, B. (2022). Investigating the usefulness of i-vectors for automatic language characterization. In Proceesings of Speech Prosody, (pp 460-464) . [abstract] ABSTRACT = Work done in recent years has shown the usefulness of using automatic methods for the study of linguistic typology. However, the majority of proposed approaches come from natural language processing and require expert knowledge to predict typological information for new languages. An alternative would be to use speech-based methods that do not need extensive linguistic annotations, but considerably less work has been done in this direction. The current study aims to reduce this gap, by investigating a promising speech representation, i-vectors, which by capturing suprasegmental features of language, can be used for the automatic characterization of languages. Employing data from 24 languages, covering several linguistic families, we computed the i-vectors corresponding to each sentence and we represented the languages by their centroid i-vector. Analyzing the distance between the language centroids and phonological, inventory and syntactic distances between the same languages, we observed a significant correlation between the i-vector distance and the syntactic distance. Then, we explored in more detailed a number of syntactic features and we proposed a method for predicting the value of the most promising feature, based on the i-vector information. The obtained results, an 87% classification accuracy, are encouraging and we envision to extend this method further.

de Seyssel, M., Wisniewski, G. & Dupoux, E. (2022). Is the Language Familiarity Effect gradual ? A computational modelling approach. In A.P. J. Culbertson(ed) Proceedings of Cognitive Science, (pp 1728-1735) . [abstract] ABSTRACT = According to the Language Familiarity Effect (LFE), people are better at discriminating between speakers of their native language. Although this cognitive effect was largely studied in the literature, experiments have only been conducted on a limited number of language pairs and their results only show the presence of the effect without yielding a gradual measure that may vary across language pairs. In this work, we show that the computational model of LFE introduced by Thorburn, Feldman, and Schatz (2019) can address these two limitations. In a first experiment, we attest to this model's capacity to obtain a gradual measure of the LFE by replicating behavioural findings on native and accented speech. In a second experiment, we evaluate LFE on a large number of language pairs, including many which have never been tested on humans. We show that the effect is replicated across a wide array of languages, providing further evidence of its universality. Building on the gradual measure of LFE, we also show that languages belonging to the same family yield smaller scores, supporting the idea of an effect of language distance on LFE.

de Seyssel, M., Lachevin, M., Adi, Y., Dupoux, E. & Wisniewski, G. (2022). Probing phoneme, language and speaker information in unsupervised speech representations. In INTERSPEECH, (pp 1402-1406) . [abstract] ABSTRACT = Unsupervised models of representations based on Contrastive Predictive Coding (CPC)[1] are primarily used in spoken language modelling in that they encode phonetic information. In this study, we ask what other types of information are present in CPC speech representations. We focus on three categories: phone class, gender and language, and compare monolingual and bilingual models. Using qualitative and quantitative tools, we find that both gender and phone class information are present in both types of models. Language information, however, is very salient in the bilingual model only, suggesting CPC models learn to discriminate languages when trained on multiple languages. Some language information can also be retrieved from monolingual models, but it is more diffused across all features. These patterns hold when analyses are carried on the discrete units from a downstream clustering model. However, although there is no effect of the number of target clusters on phone class and language information, more gender information is encoded with more clusters. Finally, we find that there is some cost to being exposed to two languages on a downstream phoneme discrimination task.

Tomasello, P., Shrivastava, A., Lazar, D., Le, D., Sagar, A., Elkahky, A., Copet, J., Hsu, W.N., Adi, Y., Algayres, R., Nguyen, T.A., Dupoux, E., Zettlemoyer, L. & Mohamed, A. (2022). STOP: A dataset for Spoken Task Oriented Semantic Parsing. In IEEE SLT-2022, (pp 991-998) . [abstract] ABSTRACT = End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assistant systems on-device. Unfortunately, the limited number of public audio datasets with semantic parse labels hinders the research progress in this area. In this paper, we release the Spoken task-oriented semantic parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available. Additionally, we define low-resource splits to establish a benchmark for improving SLU when limited labeled data is available. Furthermore, in addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.

Rita, M., Tallec, C., Michel, P., Grill, J.B., Pietquin, O., Dupoux, E. & Strub, F. (2022). Emergent communication: generalization and overfitting in Lewis games. In NeurIPS. [abstract] ABSTRACT = Lewis signaling games are a class of simple communication games for simulating the emergence of language. In these games, two agents must agree on a communication protocol in order to solve a cooperative task. Previous work has shown that agents trained to play this game with reinforcement learning tend to develop languages that display undesirable properties from a linguistic point of view (lack of generalization, lack of compositionality, etc). In this paper, we aim to provide better understanding of this phenomenon by analytically studying the learning problem in Lewis games. As a core contribution, we demonstrate that the standard objective in Lewis games can be decomposed in two components: a co-adaptation loss and an information loss. This decomposition enables us to surface two potential sources of overfitting, which we show may undermine the emergence of a structured communication protocol. In particular, when we control for overfitting on the co-adaptation loss, we recover desired properties in the emergent languages: they are more compositional and generalize better.

Rita, M., Strub, F., Grill, J.B., Pietquin, O. & Dupoux, E. (2022). On the role of population heterogeneity in emergent communication. In ICLR. [abstract] ABSTRACT = Populations have often been perceived as a structuring component for language to emerge and evolve: the larger the population, the more structured the language. While this observation is widespread in the sociolinguistic literature, it has not been consistently reproduced in computer simulations with neural agents. In this paper, we thus aim to clarify this apparent contradiction. We explore emergent language properties by varying agent population size in the speaker-listener Lewis Game. After reproducing the experimental difference, we challenge the simulation assumption that the agent community is homogeneous. We then investigate how speaker-listener asymmetry alters language structure through the analysis a potential diversity factor: learning speed. From then, we leverage this observation to control population heterogeneity without introducing confounding factors. We finally show that introducing such training speed heterogeneities naturally sort out the initial contradiction: larger simulated communities start developing more stable and structured languages.

Riad, R., Titeux, H., Lemoine, L., Montillot, J., Sliwinski, A., Xuan-Nga, C., Bachoud-Lévi, A.C. & Dupoux, E. (2022). A comparison study on patient-psychologist voice diarization. In Ninth Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022), (pp 30--36) . [abstract] Conversations between a clinician and a patient, in natural conditions, are valuable sources of information for medical follow-up. The automatic analysis of these dialogues could help extract new language markers and speed up the clinicians' reports. Yet, it is not clear which model is the most efficient to detect and identify the speaker turns, especially for individuals with speech disorders. Here, we proposed a split of the data that allows conducting a comparative evaluation of different diarization methods. We designed and trained end-to-end neural network architectures to directly tackle this task from the raw signal and evaluate each approach under the same metric. We also studied the effect of fine-tuning models to find the best performance. Experimental results are reported on naturalistic clinical conversations between Psychologists and Interviewees, at different stages of Huntington's disease, displaying a large panel of speech disorders. We found out that our best end-to-end model achieved 19.5 % IER on the test set, compared to 23.6% achieved by the finetuning of the X-vector architecture. Finally, we observed that we could extract clinical markers directly from the automatic systems, highlighting the clinical relevance of our methods.

Riad, R., Lunven, M., Titeux, H., Xuan-Nga, C., Hamet Bagnou, J., Lemoine, L., Montillot, J., Sliwinski, A., Youssov, K., Cleret de Langavant, L., Dupoux, E. & Bachoud-Lévi, A.C. (2022). Predicting clinical scores in Huntington's disease: a lightweight speech test. Journal of Neurology, 269, 5008--5021. [abstract] Objectives Using brief samples of speech recordings, we aimed at predicting, through machine learning, the clinical performance in Huntington's Disease (HD), an inherited Neurodegenerative disease (NDD). Methods We collected and analyzed 126 samples of audio recordings of both forward and backward counting from 103 Huntington's disease gene carriers [87 manifest and 16 premanifest; mean age 50.6 (SD 11.2), range (27--88) years] from three multicenter prospective studies in France and Belgium (MIG-HD (ClinicalTrials.gov NCT00190450); BIO-HD (ClinicalTrials.gov NCT00190450) and Repair-HD (ClinicalTrials.gov NCT00190450). We pre-registered all of our methods before running any analyses, in order to avoid infated results. We automatically extracted 60 speech features from blindly annotated samples. We used machine learning models to combine multiple speech features in order to make predictions at individual levels of the clinical markers. We trained machine learning models on 86% of the samples, the remaining 14% constituted the independent test set. We combined speech features with demographics variables (age, sex, CAG repeats, and burden score) to predict cognitive, motor, and functional scores of the Unifed Huntington's disease rating scale. We provided correlation between speech variables and striatal volumes. Results Speech features combined with demographics allowed the prediction of the individual cognitive, motor, and functional scores with a relative error from 12.7 to 20.0% which is better than predictions using demographics and genetic information. Both mean and standard deviation of pause durations during backward recitation and clinical scores correlated with striatal atrophy (Spearman 0.6 and 0.5--0.6, respectively). Interpretation Brief and examiner-free speech recording and analysis may become in the future an efcient method for remote evaluation of the individual condition in HD and likely in other NDD

Nguyen, T.A., Sagot, B. & Dupoux, E. (2022). Are discrete units necessary for Spoken Language Modeling? IEEE Journal of Selected Topics in Signal Processing, 16(6), 1415 -- 1423. [abstract] ABSTRACT = Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels. The approach relies first on transforming the audio into a sequence of discrete units (or pseudo-text) and then training a language model directly on such pseudo-text. Is such a discrete bottleneck necessary, potentially introducing irreversible errors in the encoding of the speech signal, or could we learn a language model without discrete units at all? In this work, show that discretization is indeed essential for good results in spoken language modeling, but that can omit the discrete bottleneck if we use using discrete target features from a higher level than the input features. We also show that an end-to-end model trained with discrete target like HuBERT achieves similar results as the best language model trained on pseudo-text on a set of zero-shot spoken language modeling metrics from the Zero Resource Speech Challenge 2021.

Ludusan, B., Cristia, A., Mazuka, R. & Dupoux, E. (2022). How much does prosody help word segmentation? A simulation study on infant-directed speech Cognition, 219, 104961. [abstract] ABSTRACT = Infants come to learn several hundreds of word forms by two years of age, and it is possible this involves carving these forms out from continuous speech. It has been proposed that the task is facilitated by the presence of prosodic boundaries. We revisit this claim by running computational models of word segmentation, with and without prosodic information, on a corpus of infant-directed speech. We use five cognitively-based algorithms, which vary in whether they employ a sub-lexical or a lexical segmentation strategy and whether they are simple heuristics or embody an ideal learner. Results show that providing expert-annotated prosodic breaks does not uniformly help all segmentation models. The sub-lexical algorithms, which perform more poorly, benefit most, while the lexical ones show a very small gain. Moreover, when prosodic information is derived automatically from the acoustic cues infants are known to be sensitive to, errors in the detection of the boundaries lead to smaller positive effects, and even negative ones for some algorithms. This shows that even though infants could potentially use prosodic breaks, it does not necessarily follow that they should incorporate prosody into their segmentation strategies, when confronted with realistic signals.

Lavechin, M., Seyssel, M.D., Gautheron, L., Dupoux, E. & Cristia, A. (2022). Reverse engineering language acquisition with child-centered long-form recordings. Annual Review of Linguistics, 8, 389-407.

Kreuk, F., Polyak, A., Copet, J., Kharitonov, E., Nguyen, T.A., Rivière, M., Hsu, W.N., Mohamed, A., Dupoux, E. & Adi, Y. (2022). Textless Speech Emotion Conversion using Decomposed and Discrete Representations. In Proceedings of EMNLP, (pp 11200 - 11214) . [abstract] ABSTRACT = Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion. First, we modify the speech content by translating the phonetic-content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is vastly superior to current approaches and even beats text-based systems in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples are available under the following link at this URL: https://speechbot.github.io/emotion/

Kharitonov, E., Lee, A., Polyak, A., Adi, Y., Copet, J., Lakhotia, K., Nguyen, T.A., Rivière, M., Mohamed, A., Dupoux, E. & Hsu, W.N. (2022). Text-Free Prosody-Aware Generative Spoken Language Modeling. In ACL, (pp 8666-8681) . [abstract] ABSTRACT = Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) (Lakhotia et al., 2021) is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech. In this work, we present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms. We devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt. Audio samples can be found at this https URL.

Kharitonov, E., Copet, J., Lakhotia, K., Nguyen, T.A., Tomasello, P., Lee, A., Elkahky, A., Hsu, W.N., Mohamed, A., Dupoux, E. & Adi, Y. (2022). textless-lib: a Library for Textless Spoken Language Processing. In NAACL: System Demonstrations, (pp 1-9) . [abstract] ABSTRACT = Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources. In this paper, we introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area. We describe the building blocks that the library provides and demonstrate its usability by discuss three different use-case examples: (i) speaker probing, (ii) speech resynthesis and compression, and (iii) speech continuation. We believe that textless-lib substantially simplifies research the textless setting and will be handful not only for speech researchers but also for the NLP community at large. The code, documentation, and pre-trained models are available at this URL: https://github.com/facebookresearch/textlesslib/

Gallezot, C., Riad, R., Titeux, H., Lemoine, L., Montillot, J., Sliwinski, A., Bagnou Hamet, J., Cao, X.N., Youssov, K., Dupoux, E. & Bachoud-Lévi, A.C. (2022). Emotion expression through spoken language in Huntington disease. Cortex, 155, 150-161. [abstract] ABSTRACT = Patients with Huntington's disease suffer from disturbances in the perception of emotions; they do not correctly read the body, vocal and facial expressions of others. With regard to the expression of emotions, it has been shown that they are impaired in expressing emotions through face but up until now, little research has been conducted about their ability to express emotions through spoken language. To better understand emotion production in both voice and language in Huntington's Disease (HD), we tested 115 individuals: 68 patients (HD), 22 participants carrying the mutant HD gene without any motor symptoms (pre-manifest HD), and 25 controls in a single-centre prospective observational follow-up study. Participants were recorded in interviews in which they were asked to recall sad, angry, happy, and neutral stories. Emotion expression through voice and language was investigated by comparing the identifiability of emotions expressed by controls, preHD and HD patients in these interviews. To assess separately vocal and linguistic expression of emotions in a blind design, we used machine learning models instead of a human jury performing a forced-choice recognition test. Results from this study showed that patients with HD had difficulty expressing emotions through both voice and language compared to preHD participants and controls, who behaved similarly and above chance. In addition, we did not find any differences in expression of emotions between preHD and healthy controls. We further validated our newly proposed methodology with a human jury on the speech produced by the controls. These results are consistent with the hypothesis that emotional deficits in HD are caused by impaired sensori-motor representations of emotions, in line with embodied cognition theories. This study also shows how machine learning models can be leveraged to assess emotion expression in a blind and reproducible way.

Dunbar, E., Hamilakis, N. & Dupoux, E. (2022). Self-supervised language learning from raw audio: Lessons from the Zero Resource Speech Challenge. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1211 - 1226. [abstract] ABSTRACT = Recent progress in self-supervised or unsupervised machine learning has opened the possibility of building a full speech processing system from raw audio without using any textual representations or expert labels such as phonemes, dictionaries or parse trees. The contribution of the Zero Resource Speech Challenge series since 2015 has been to break down this long-term objective into four well-defined tasks---Acoustic Unit Discovery, Spoken Term Discovery, Discrete Resynthesis, and Spoken Language Modeling---and introduce associated metrics and benchmarks enabling model comparison and cumulative progress. We present an overview of the six editions of this challenge series since 2015, discuss the lessons learned, and outline the areas which need more work or give puzzling results.

Algayres, R., Ricoul, T., Karadayi, J., Laurençon, H., Zaiem, S., Mohamed, A., Sagot, B. & Dupoux, E. (2022). DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon. Transactions of the Association for Computational Linguistics, 10, 1051--1065. [abstract] ABSTRACT = Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a `space' delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., 2006, 2009) use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-theart in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark.

Algayres, R., Nabli, A., Sagot, B. & Dupoux, E. (2022). Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning. In INTERSPEECH-2022, (pp 2123-2127) . [abstract] ABSTRACT = We introduce a simple neural encoder architecture that can be trained using an unsupervised contrastive learning objective which gets its positive samples from data-augmented k-Nearest Neighbors search. We show that when built on top of recent self-supervised audio representations [1, 2, 3], this method can be applied iteratively and yield competitive SSE as evaluated on two tasks: query-by-example of random sequences of speech, and spoken term discovery. On both tasks our method pushes the state-of-the-art by a significant margin across 5 different languages. Finally, we establish a benchmark on a query-byexample task on the LibriSpeech dataset to monitor future improvements in the field.

Wang, C., Rivière, M., Lee, A., Wu, A., Talnikar, C., Haziza, D., Williamson, M., Pino, J. & Dupoux, E. (2021). VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation. In Proceedings of ACL, (pp 993--1003) . [abstract] ABSTRACT = We introduce VoxPopuli, a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours. We provide speech recognition baselines and validate the versatility of VoxPopuli unlabelled data in semi-supervised learning under challenging out-of-domain settings. We will release the corpus at https://github.com/facebookresearch/voxpopuli under an open license.

Tsuji, S., Cristia, A. & Dupoux, E. (2021). SCALa: A blueprint for computational models of language acquisition in social context. Cognition, 213, 104779. [abstract] ABSTRACT = Theories and data on language acquisition suggest a range of cues are used, ranging from information on structure found in the linguistic signal itself, to information gleaned from the environmental context or through social interaction. We propose a blueprint for computational models of the early language learner (SCALa, for Socio-Computational Architecture of Language Acquisition) that makes explicit the connection between the kinds of information available to the social learner and the computational mechanisms required to extract language-relevant information and learn from it. SCALa integrates a range of views on language acquisition, further allowing us to make precise recommendations for future large-scale empirical research.

Riochet, R., Ynocente Castro, M., Bernard, M., Lerer, A., Fergus, R., Izard, V. & Dupoux, E. (2021). IntPhys: A Framework and Benchmark for Visual Intuitive Physics Understanding. Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5016 - 5025. [abstract] In order to reach human performance on complex visual tasks, artificial systems need to incorporate a significant amount of understanding of the world in terms of macroscopic objects, movements, forces, etc. Inspired by work on intuitive physics in infants, we propose an evaluation benchmark which diagnoses how much a given system understands about physics by testing whether it can tell apart well matched videos of possible versus impossible events constructed with a game engine. The test requires systems to compute a physical plausibility score over an entire video. To prevent perceptual biases, the dataset is made of pixel matched quadruplets of videos, enforcing systems to focus on high level temporal dependencies between frames rather than pixel-level details. We then describe two Deep Neural Networks systems aimed at learning intuitive physics in an unsupervised way, using only physically possible videos. The systems are trained with a future semantic mask prediction objective and tested on the possible versus impossible discrimination task. The analysis of their results compared to human data gives novel insights in the potentials and limitations of next frame prediction architectures.

Riad, R., Karadayi, J., Bachoud-Lévi, A.C. & Dupoux, E. (2021). Learning spectro-temporal representations of complex sounds with parameterized neural networks. Journal of the Acoustical Society of America, 150(1), 353--366. [abstract] ABSTRACT = Deep Learning models have become potential candidates for auditory neuroscience research, thanks to their recent successes on a variety of auditory tasks. Yet, these models often lack interpretability to fully understand the exact computations that have been performed. Here, we proposed a parametrized neural network layer, that computes specific spectro-temporal modulations based on Gabor kernels (Learnable STRFs) and that is fully interpretable. We evaluated the predictive capabilities of this layer on Speech Activity Detection, Speaker Verification, Urban Sound Classification and Zebra Finch Call Type Classification. We found out that models based on this learnable parametrized neural network are on par for all tasks with the different toplines, and obtain the best performance for Speech Activity Detection. As this layer is fully interpretable, we used quantitative measures to describe the distribution of the learned spectro-temporal modulations. The filters adapted to each task and focused mostly modulation on low temporal and spectral modulations. The analyses show that the filters learned on human speech have similar spectro-temporal parameters as the ones measured directly in the human auditory cortex. Finally, equipped with the Sinkhorn distance to com- pare the learned STRFs distributions, we observed that the tasks organized in a meaningful way: the human vocalizations tasks closer to each other and bird vocalizatoins far away from human vocalizations and urban sounds tasks.

Polyak, A., Adi, Y., Copet, J., Kharitonov, E., Lakhotia, K., Hsu, W.N., Mohamed, A. & Dupoux, E. (2021). Speech Resynthesis from Discrete Disentangled Self-Supervised Representations. In INTERSPEECH-2021, (pp 3615--3619) . [abstract] ABSTRACT = We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods. Audio samples can be found under this https URL: https://resynthesis-ssl.github.io/

Ludusan, B., Morii, M., Minagawa, Y. & Dupoux, E. (2021). The effect of different information sources on prosodic boundary perception. JASA Express Letters, 1(11), 115203. [abstract] ABSTRACT = This study aims to quantify the effect of several information sources: acoustic, higher-level linguistic, and knowledge of the prosodic system of the language, on the perception of prosodic boundaries. An experiment with native and non-native participants investigating the identification of prosodic boundaries in Japanese was conducted. It revealed that non-native speakers as well as native speakers with access only to acoustic information can recognize boundaries better than chance level. However, knowledge of both the prosodic system and of higher-level information are required for a good boundary identifica- tion, each one having similar or higher importance than that of acoustic information.

Lakhotia, K., Kharitonov, E., Hsu, W.N., Adi, Y., Polyak, A., Bolte, B., Nguyen, T.A., Copet, J., Baevski, A., Mohamed, A. & Dupoux, E. (2021). Generative Spoken Language Modeling from Raw Audio. Transactions of the Association for Computational Linguistics, 9, 1336--1354. [abstract] ABSTRACT = We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.

Feldman, N., Goldwater, S., Dupoux, E. & Schatz, T. (2021). Do Infants Really Learn Phonetic Categories? Open Mind, 5, 113--131. [abstract] ABSTRACT = Acoustic realizations of a given phonetic segment are typically affected by coarticulation with the preceding and following phonetic context. While coarticulation has been extensively studied using descriptive phonetic measurements, little is known about the functional impact of coarticulation for speech processing. Here, we use DTW-based similarity defined on raw acoustic features and ABX scores to derive a measure of the effect of coarticulation on phonetic discriminability. This measure does not rely on defining segment-specific phonetic cues (formants, duration, etc.) and can be applied systematically and automatically to any segment in large scale corpora. We illustrate our method using stimuli in English and Japanese. We replicate some well-known results, i.e., stronger anticipatory than perseveratory coarticulation and stronger coarticulation for lax/short vowels than for tense/long vowels. We then quantify for the first time the impact of coarticulation across different segment types (like vowels and consonants). We discuss how our metric and its possible extensions can help addressing current challenges in the systematic study of coarticulation.

Dunbar, E., Bernard, M., Hamilakis, N., Nguyen, T.A., de Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E. & Dupoux, E. (2021). The Interspeech Zero Resource Speech Challenge 2021: Spoken language modelling. In INTERSPEECH-2021, (pp 1574--1578) . [abstract] ABSTRACT = We present the Zero Resource Speech Challenge 2021, which asks participants to learn a language model directly from audio, without any text or labels. The challenge is based on the Libri-light dataset, which provides up to 60k hours of audio from English audio books without any associated text. We provide a pipeline baseline system consisting on an encoder based on contrastive predictive coding (CPC), a quantizer (k-means) and a standard language model (BERT or LSTM). The metrics evaluate the learned representations at the acoustic (ABX discrimination), lexical (spot-the-word), syntactic (acceptability judgment) and semantic levels (similarity judgment). We present an overview of the eight submitted systems from four groups and discuss the main results.

Chaabouni, R., Kharitonov, E., Dupoux, E. & Baroni, M. (2021). Communicating artificial neural networks develop efficient color naming systems. Proceedings of the National Academy of Sciences of the United States of America, 118(12), e2016569118. [abstract] ABSTRACT = Words categorize the semantic fields they refer to in ways that maximize communication accuracy while minimizing complexity. Focusing on the well-studied color domain, we show that artificial neural networks trained with deep learning techniques to play a discrimination game develop communication systems whose distribution on the accuracy/complexity plane closely matches that of human languages. The observed variation among emergent color-naming systems is explained by different degrees of discriminative need, of the sort that might also characterize different human communities. Like human languages, emergent systems show a preference for relatively low-complexity solutions, even at the cost of imperfect communication. We demonstrate next that the nature of the emergent systems crucially depends on communication being discrete (as is human word usage). When continuous message passing is allowed, emergent systems become more complex, and eventually less efficient. Our study suggests that efficient semantic categorization is a general property of discrete communication systems, not limited to human language. It suggests moreover that it is exactly the discrete nature of such systems that, acting as a bottleneck, pushes them towards low complexity and optimal efficiency.

de Seyssel, M. & Dupoux, E. (2020). Does bilingual input hurt? A simulation of language discrimination and clustering using i-vectors In Proceedings of the Cognitive Science Conference, (pp 2791--2797) . [abstract] ABSTRACT = The language discrimination process in infants has been successfully modeled using i-vector based systems, with results replicating several experimental findings. Still, recent work found intriguing results regarding the difference between monolingual and mixed-language exposure on language discrimination tasks. We use two carefully designed datasets, with an additional ``bilingual'' condition on the i-vector model of language discrimination. Our results do not show any difference in the ability of discriminating languages between the three backgrounds, although we do replicate past observations that distant languages (English-Finnish) are easier to discriminate than close languages (English-German). We do, however, find a strong effect of background when testing for the ability of the learner to automatically sort sentences in language clusters: bilingual background being generally harder than mixed background (one speaker one language). Other analyses reveal that clustering is dominated by speakers information rather than by languages.

Titeux, H., Riad, R., Cao, X.N., Hamilakis, N., Madden, K., Cristia, A., Bachoud-Lévi, A.C. & Dupoux, E. (2020). Seshat: A tool for managing and verifying annotation campaigns of audio data. In LREC, (pp 6976--6982) . [abstract] ABSTRACT = We introduce Seshat, a new, simple and open-source software to efficiently manage annotations of speech corpora. The Seshat software allows users to easily customise and manage annotations of large audio corpora while ensuring compliance with the formatting and naming conventions of the annotated output files. In addition, it includes procedures for checking the content of annotations following specific rules are implemented in personalised parsers. Finally, we propose a double-annotation mode, for which Seshat computes automatically an associated inter-annotator agreement with the γ measure taking into account the categorisation and segmentation discrepancies.

Schatz, T., Feldman, N., Goldwater, S., Cao, X.N. & Dupoux, E. (2020). Early phonetic learning without phonetic categories. Insights from large-scale simulations on realistic input Proceedings of the National Academy of Sciences of the United States of America. [abstract] ABSTRACT = Acoustic realizations of a given phonetic segment are typically affected by coarticulation with the preceding and following phonetic context. While coarticulation has been extensively studied using descriptive phonetic measurements, little is known about the functional impact of coarticulation for speech processing. Here, we use DTW-based similarity defined on raw acoustic features and ABX scores to derive a measure of the effect of coarticulation on phonetic discriminability. This measure does not rely on defining segment-specific phonetic cues (formants, duration, etc.) and can be applied systematically and automatically to any segment in large scale corpora. We illustrate our method using stimuli in English and Japanese. We replicate some well-known results, i.e., stronger anticipatory than perseveratory coarticulation and stronger coarticulation for lax/short vowels than for tense/long vowels. We then quantify for the first time the impact of coarticulation across different segment types (like vowels and consonants). We discuss how our metric and its possible extensions can help addressing current challenges in the systematic study of coarticulation.

Rivière, M., Mazaré, P.E., Joulin, A. & Dupoux, E. (2020). Unsupervised pretraining transfers well across languages. In ICASSP-2020, (pp 7414-7418) . [abstract] ABSTRACT = Cross-lingual and multi-lingual training of Automatic Speech Recognition (ASR) has been extensively investigated in the supervised setting. This assumes the existence of a parallel corpus of speech and orthographic transcriptions. Recently, contrastive predictive coding (CPC) algorithms have been proposed to pretrain ASR systems with unlabelled data. In this work, we investigate whether unsupervised pretraining transfers well across languages. We show that a slight modification of the CPC pretraining extracts features that transfer well to other languages, being on par or even outperforming supervised pretraining. This shows the potential of unsupervised methods for languages with few linguistic resources.

Rivière, M., Kharitonov, E., Mazaré, P.E., Douze, M. & Dupoux, E. (2020). Towards unsupervised learning of speech features in the wild. In SLT-2020. [abstract] ABSTRACT = Recent work on unsupervised contrastive learning of speech representation has shown promising results, but so far has mostly been applied to clean, curated speech datasets. Can it also be used with unprepared audio data ``in the wild''? Here, we explore three potential problems in this setting: (i) presence of non-speech data, (ii) noisy or low quality speech data, and (iii) imbalance in speaker distribution. We show that on the Libri-light train set, which is itself a relatively clean speech-only dataset, these problems combined can already have a performance cost of up to 30% relative for the ABX score. We show that the first two problems can be alleviated by data filtering, with voice activity detection selecting speech segments, while perplexity of a model trained with clean data helping to discard entire files. We show that the third problem can be alleviated by learning a speaker embedding in the pre- dictive branch of the model. We show that these techniques build more robust speech features that can be transferred to an ASR task in the low resource setting.

Rita, M., Chaabouni, R. & Dupoux, E. (2020). " LazImpa": Lazy and Impatient neural agents learn to communicate efficiently. In CONLL, (pp 335--343) . [abstract] ABSTRACT = Previous work has shown that artificial neural agents naturally develop surprisingly non-efficient codes. This is illustrated by the fact that in a referential game involving a speaker and a listener neural networks optimizing accurate transmission over a discrete channel, the emergent messages fail to achieve an optimal length. Furthermore, frequent messages tend to be longer than infrequent ones, a pattern contrary to the Zipf Law of Abbreviation (ZLA) observed in all natural languages. Here, we show that near-optimal and ZLA-compatible messages can emerge, but only if both the speaker and the listener are modified. We hence introduce a new communication system," LazImpa", where the speaker is made increasingly lazy, ie avoids long messages, and the listener impatient, ie, seeks to guess the intended content as soon as possible.

Riad, R., Titeux, H., Lemoine, L., Montillot, J., Xuan-Nga, C., Dupoux, E. & Bachoud-Lévi, A.C. (2020). Vocal markers from sustained phonation in Huntington's Disease. In INTERSPEECH-2020, (pp 1893--1897) . [abstract] ABSTRACT = Disease-modifying treatments are currently assessed in neurodegenerative diseases. Huntington's Disease represents a unique opportunity to design automatic sub-clinical markers, even in premanifest gene carriers. We investigated phonatory impairments as potential clinical markers and propose them for both diagnosis and gene carriers follow-up. We used two sets of features: Phonatory features and Modulation Power Spectrum Features. We found that phonation is not sufficient for the identification of sub-clinical disorders of premanifest gene carriers. According to our regression results, Phonatory features are suitable for the predictions of clinical performance in Huntington's Disease.

Riad, R., Bachoud-Lévi, A.C., Rudzicz, F. & Dupoux, E. (2020). Identification of primary and collateral tracks in stuttered speech. In LREC, (pp 1681--1688) . [abstract] ABSTRACT = Disfluent speech has been previously addressed from two main perspectives: the clinical perspective focusing on diagnostic, and the Natural Language Processing (NLP) perspective aiming at modeling these events and detect them for downstream tasks. In addition, previous works often used different metrics depending on whether the input features are text or speech, making it difficult to compare the different contributions. Here, we introduce a new evaluation framework for disfluency detection inspired by the clinical and NLP perspective together with the theory of performance from Clark (1996) which distinguishes between primary and collateral tracks. We introduce a novel forced-aligned disfluency dataset from a corpus of semi-directed interviews, and present baseline results directly comparing the performance of text-based features (word and span information) and speech-based (acoustic-prosodic information). Finally, we introduce new audio features inspired by the word-based span features. We show experimentally that using these features outperformed the baselines for speech-based predictions on the present dataset.

Nguyen, T.A., de Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E., Baevski, A., Dunbar, E. & Dupoux, E. (2020). The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. In NeuRIPS Workshop on Self-Supervised Learning for Speech and Audio Processing. [abstract] ABSTRACT = We present the Zero Resource Speech Challenge 2020, which aims at learning speech representations from raw audio signals without any labels. It combines the data sets and metrics from two previous benchmarks (2017 and 2019) and features two tasks which tap into two levels of speech representation. The first task is to discover low bit-rate subword representations that optimize the quality of speech synthesis; the second one is to discover word-like units from unsegmented raw speech. We present the results of the twenty submitted models and discuss the implications of the main findings for unsupervised speech learning.

Ludusan, B., Mazuka, R. & Dupoux, E. (2020). Does infant-directed speech help phonetic learning? A machine learning investigation Cognitive Science. [abstract] ABSTRACT = A prominent hypothesis holds that by speaking to infants in infant-directed speech (IDS) as opposed to adult-directed speech (ADS), parents help them learn phonetic categories. Specifically, two characteristics of IDS have been claimed to facilitate learning: hyperarticulation, which makes the categories more \textitseparable and variability, which makes the generalization more robust. Here, we test the separability and robustness of vowel category learning on acoustic representations of speech uttered by Japanese adults in either ADS, IDS (addressed to 18-24 month olds) or read speech (RS). Separability is determined by means of a distance measure computed between the five short vowel categories of Japanese, while robustness is assessed by testing the ability of six different machine learning algorithms trained to classify vowels to generalize on stimuli spoken by a novel speaker in ADS. Using two different speech representations, we find that hyperarticulated speech, in the case of RS, can yield better separability, and that increased between-speaker variability in ADS, can yield, for some algorithms, more robust categories. However, these conclusions do not apply to IDS, which turned out to yield neither more separable nor more robust categories compared to ADS inputs. We discuss the usefulness of machine learning algorithms run on real data to test hypotheses about the functional role of IDS.

Lavechin, M., Bousbib, R., Bredin, H., Dupoux, E. & Cristia, A. (2020). An open-source voice type classifier for child-centered daylong recordings. In INTERSPEECH-2020, (pp 3072--3076) . [abstract] ABSTRACT = Spontaneous conversations in real-world settings such as those found in child-centered recordings have been shown to be amongst the most challenging audio files to process. Nevertheless, building speech processing models handling such a wide variety of conditions would be particularly useful for language acquisition studies in which researchers are interested in the quantity and quality of the speech that children hear and produce, as well as for early diagnosis and measuring effects of remediation. In this paper, we present our approach to designing an open-source neural network to classify audio segments into vocalizations produced by the child wearing the recording device, vocalizations produced by other children, adult male speech, and adult female speech. To this end, we gathered diverse child-centered corpora which sums up to a total of 260 hours of recordings and covers 10 languages. Our model can be used as input for downstream tasks such as estimating the number of words produced by adult speakers, or the number of linguistic units produced by children. Our architecture combines SincNet filters with a stack of recurrent layers and outperforms by a large margin the state-of-the-art system, the Language ENvironment Analysis (LENA) that has been used in numerous child language studies.

Kharitonov, E., Rivière, M., Synnaeve, G., Wolf, L., Mazaré, P.E., Douze, M. & Dupoux, E. (2020). Data Augmenting Contrastive Learning of Speech Representationsin the Time Domain. In SLT-2020. [abstract] ABSTRACT = Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally more efficient and yields better performances than other methods. We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC (relative improvement of 18-22\%), beating the reference Libri-light results with 600 times less data. Using an out-of-domain dataset, time-domain data augmentation can push CPC to be on par with the state of the art on the Zero Speech Benchmark 2017. We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15\% relative.

Kahn, J., Rivière, M., Zheng, W., Kharitonov, E., Xu, Q., Mazaré, P.E., Karadayi, J., Liptchinsky, V., Collobert, R., Fuegen, C., Likhomanenko, T., Synnaeve, G., Joulin, A., Mohamed, A. & Dupoux, E. (2020). Libri-Light: A Benchmark for ASR with Limited or No Supervision. In ICASSP-2020, (pp 7669--7674) . [abstract] ABSTRACT = We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.

Jiang, B., Dunbar, E., Clayards, M., Darcy, I., Sonderegger, M. & Dupoux, E. (2020). Modelling Perceptual Effects of Phonology with ASR Systems. In Proceedings of the Cognitive Science Conference, (pp 2735--2741) . [abstract] ABSTRACT = This paper explores the minimal knowledge a listener needs to compensate for phonological assimilation, one kind of phonological process responsible for variation in speech. We used standard automatic speech recognition models to represent English and French listeners. We found that, first, some types of models show language-specific assimilation patterns comparable to those shown by human listeners. Like English listeners, when trained on English, the models compensate more for place assimilation than for voicing assimilation, and like French listeners, the models show the opposite pattern when trained on French. Second, the models which best predict the human pattern use contextually-sensitive acoustic models and language models, which capture allophony and phonotactics, but do not make use of higher-level knowledge of a lexicon or word boundaries. Finally, some models overcompensate for assimilation, showing a (super-human) ability to recover the underlying form even in the absence of the triggering phonological context, pointing to an incomplete neutralization not exploited by human listeners.

Fournier, L., Dunbar, E. & Dupoux, E. (2020). Analogies minus analogy test: measuring regularities in word embeddings. In CoNLL 2020, (pp 365--375) Association for Computational Linguistics. [abstract] ABSTRACT = Vector space models of words have long been claimed to capture linguistic regularities as simple vector translations, but problems have been raised with this claim. We decompose and empirically analyze the classic arithmetic word analogy test, to motivate two new metrics that address the issues with the standard test, and which distinguish between class-wise offset concentration (similar directions between pairs of words drawn from different broad classes, such as France-- London, China--Ottawa, . . . ) and pairing consistency (the existence of a regular transformation between correctly-matched pairs such as France:Paris::China:Beijing). We show that, while the standard analogy test is flawed, several popular word embeddings do nevertheless encode linguistic regularities.

Dunbar, E., Karadayi, J., Bernard, M., Cao, X.N., Algayres, R., Ondel, L., Besacier, L., Sakriani, S. & Dupoux, E. (2020). The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units. In INTERSPEECH-2020, (pp 4831--4835) . [abstract] ABSTRACT = We present the Zero Resource Speech Challenge 2020, which aims at learning speech representations from raw audio signals without any labels. It combines the data sets and metrics from two previous benchmarks (2017 and 2019) and features two tasks which tap into two levels of speech representation. The first task is to discover low bit-rate subword representations that optimize the quality of speech synthesis; the second one is to discover word-like units from unsegmented raw speech. We present the results of the twenty submitted models and discuss the implications of the main findings for unsupervised speech learning.

Chaabouni, R., Kharitonov, E., Bouchacourt, D., Dupoux, E. & Baroni, M. (2020). Compositionality and Generalization in Emergent Languages. In ACL, (pp 4427--4442) . [abstract] ABSTRACT = Natural language allows us to refer to novel composite concepts by combining expressions denoting their parts according to systematic rules, a property known as \emphcompositionality. In this paper, we study whether the language emerging in deep multi-agent simulations possesses a similar ability to refer to novel primitive combinations, and whether it accomplishes this feat by strategies akin to human-language compositionality. Equipped with new ways to measure compositionality in emergent languages inspired by disentanglement in representation learning, we establish three main results. First, given sufficiently large input spaces, the emergent language will naturally develop the ability to refer to novel composite concepts. Second, there is no correlation between the degree of compositionality of an emergent language and its ability to generalize. Third, while compositionality is not necessary for generalization, it provides an advantage in terms of language transmission: The more compositional a language is, the more easily it will be picked up by new learners, even when the latter differ in architecture from the original agents. We conclude that compositionality does not arise from simple generalization pressure, but if an emergent language does chance upon it, it will be more likely to survive and thrive.

Algayres, R., Zaiem, S., Sagot, B. & Dupoux, E. (2020). Evaluating the reliability of acoustic speech embeddings. In INTERSPEECH-2020, (pp 4621--4625) . [abstract] Speech embeddings are fixed-size acoustic representations of variable-length speech sequences. They are increasingly used for a variety of tasks ranging from information retrieval to unsupervised term discovery and speech segmentation. However, there is currently no clear methodology to compare or optimise the quality of these embeddings in a task-neutral way. Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods, ranging from supervised to fully unsupervised, and using different loss functions (autoencoders, correspondence autoencoders, siamese). Then we use the ABX and MAP to predict performances on a new downstream task: the unsupervised estimation of the frequencies of speech segments in a given corpus. We find that overall, ABX and MAP correlate with one another and with frequency estimation. However, substantial discrepancies appear in the fine-grained distinctions across languages and/or embedding methods. This makes it unrealistic at present to propose a task-independent silver bullet method for computing the intrinsic quality of speech embeddings. There is a need for more detailed analysis of the metrics currently used to evaluate such embeddings.

Fourtassi, A. & Dupoux, E. (2019). Phoneme learning is influenced by the taxonomic similarity of the semantic referents. In Proceedings of the Cognitive Science Conference, (323-324), Cognitive Science Society. [abstract] ABSTRACT = Word learning relies on the ability to master the sound contrasts that are phonemic (i.e., signal meaning difference) in a given language. Though the timeline of phoneme development has been studied extensively over the past few decades, the mechanism of this development is poorly understood. Previous work has shown that human learners rely on referential information to differentiate similar sounds, but largely ignored the problem of taxonomic ambiguity at the semantic level (two different objects may be described by one or two words depending on how abstract the meaning intended by the speaker is). In this study, we varied the taxonomic distance of pairs of objects and tested how adult learners judged the phonemic status of the sound contrast associated with each of these pairs.We found that judgments were sensitive to gradients in the taxonomic structure, suggesting that learners use probabilistic information at the semantic level to optimize the accuracy of their judgements at the phonological level. The findings provide evidence for an interaction between phonological learning and meaning generalization, raising important questions about how these two important processes of language acquisition are related.

Dunbar, E., Algayres, R., Karadayi, J., Bernard, M., Benjumea, J., Cao, X.N., Miskic, L., Dugrain, C., Ondel, L., Black, A., Besacier, L., Sakriani, S. & Dupoux, E. (2019). The Zero Resource Speech Challenge 2019: TTS without T. In INTERSPEECH-2019. [abstract] ABSTRACT = We present the Zero Resource Speech Challenge 2019, which proposes to build a speech synthesizer without any text or phonetic labels: hence, TTS without T (text-to-speech without text). We provide raw audio for a target voice in an unknown language (the Voice dataset), but no alignment, text or labels. Participants must discover subword units in an unsupervised way (using the Unit Discovery dataset) and align them to the voice recordings in a way that works best for the purpose of synthesizing novel utterances from novel speakers, similar to the target speaker's voice. We describe the metrics used for evaluation, a baseline system consisting of unsupervised subword unit discovery plus a standard TTS system, and a topline TTS using gold phoneme transcriptions. We present an overview of the 19 submitted systems from 11 teams and discuss the main results.

Cristia, A., Dupoux, E., Bernstein Ratner, N. & Soderstrom, M. (2019). Segmentability differences between child-directed and adult-directed speech: A systematic test with an ecologically valid corpus. In Open Mind, 3, (pp 13-22) . [abstract] ABSTRACT = Previous computational modeling suggests it is much easier to segment words from child-directed (CDS) than adult-directed speech (ADS). However, this conclusion is based on data collected in the laboratory, with CDS from play sessions and ADS between a parent and an experimenter, which may not be representative of ecologically-collected CDS and ADS. Fully naturalistic ADS and CDS collected with a non-intrusive recording device as the child went about her day were analyzed with a diverse set of algorithms. The difference between registers was small compared to differences between algorithms, it reduced when corpora were matched, and it even reversed under some conditions. These results highlight the interest of studying learnability using naturalistic corpora and diverse algorithmic definitions.

Chaabouni, R., Kharitonov, E., Lazaric, A., Dupoux, E. & Baroni, M. (2019). Word-order biases in deep-agent emergent communication. In ACL 2019. [abstract] ABSTRACT = Sequence-processing neural networks led to remarkable progress on many NLP tasks. As a consequence, there has been increasing interest in understanding to what extent they process language as humans do. We aim here to uncover which biases such models display with respect to ``natural" word-order constraints. We train models to communicate about paths in a simple gridworld, using miniature languages that reflect or violate various natural language trends, such as the tendency to avoid redundancy or to minimize long-distance dependencies. We study how the controlled characteristics of our miniature languages affect individual learning and their stability across multiple network generations. The results draw a mixed picture. On the one hand, neural networks show a strong tendency to avoid long-distance dependencies. On the other hand, there is no clear preference for the efficient, non-redundant encoding of information that is widely attested in natural language. We thus suggest inoculating a notion of ``effort'' into neural networks, as a possible way to make their linguistic behavior more human-like.

Chaabouni, R., Kharitonov, E., Dupoux, E. & Baroni, M. (2019). Anti-efficient encoding in emergent communication. In NeuRIPS. [abstract] ABSTRACT = Despite renewed interest in emergent language simulations with neural networks, little is known about the basic properties of the induced code, and how they compare to human language. One fundamental characteristic of the latter, known as Zipf's Law of Abbreviation (ZLA), is that more frequent words are efficiently associated to shorter strings. We study whether the same pattern emerges when two neural networks, a "speaker" and a "listener", are trained to play a signaling game. Surprisingly, we find that networks develop an *anti-efficient* encoding scheme, in which the most frequent inputs are associated to the longest messages, and messages in general are skewed towards the maximum length threshold. This anti-efficient code appears easier to discriminate for the listener, and, unlike in human communication, the speaker does not impose a contrasting least-effort pressure towards brevity. Indeed, when the cost function includes a penalty for longer messages, the resulting message distribution starts respecting ZLA. Our analysis stresses the importance of studying the basic features of emergent communication in a highly controlled setup, to ensure the latter will not strand too far from human language. Moreover, we present a concrete illustration of how different functional pressures can lead to successful communication codes that lack basic properties of human language, thus highlighting the role such pressures play in the latter.

Bernard, M., Thiollière, R., Saksida, A., Loukatou, G., Larsen, E., Johnson, M., Fibla Reixachs, L., Dupoux, E., Daland, R., Xuan-Nga, C. & Cristia, A. (2019). WordSeg: Standardizing unsupervised word form segmentation from text. Behavior Research Methods, 52, 264--278.

Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R. & Dupoux, E. (2018). End-to-End Speech Recognition from the raw waveform. In Interspeech-2018. [abstract] ABSTRACT = State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the wave- form before the training pipeline. In this paper, we study end-to-end systems trained directly from the raw waveform, building on two alternatives for trainable replacements of mel-filterbanks that use a convolutional architecture. The first one is inspired by gammatone filterbanks (Hoshen et al., 2015; Sainath et al, 2015), and the second one by the scattering transform (Zeghidour et al., 2017). We propose two modifications to these architectures and systematically compare them to mel-filterbanks, on the Wall Street Journal dataset. The first modification is the addition of an instance normalization layer, which greatly improves on the gammatone-based trainable filterbanks and speeds up the training of the scattering-based filterbanks. The second one relates to the low-pass filter used in these approaches. These modifications consistently improve performances for both ap- proaches, and remove the need for a careful initialization in scattering-based trainable filterbanks. In particular, we show a consistent improvement in word error rate of the trainable filterbanks relatively to comparable mel-filterbanks. It is the first time end-to-end models trained from the raw signal significantly outperform mel-filterbanks on a large vocabulary task under clean recording conditions.

Zeghidour, N., Usunier, N., Kokkinos, I., Schatz, T., Synnaeve, G. & Dupoux, E. (2018). Learning filterbanks from raw speech for phoneme recognition. In ICASSP-2018. [abstract] ABSTRACT = In this work we train a bank of complex filters that operates at the level of the raw speech signal and feeds into a convolutional neural network for phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an ap- proximation of MFSC, and then fine-tuned jointly with the remaining convolutional network. We perform phone recognition experiments on TIMIT and show that for several architectures, models trained on TD-filterbanks consistently out-perform their counterparts trained on comparable MFSC. We get our best performance by learning all front-end steps, from pre-emphasis up to averaging. Finally, we observe that the filters at convergence have an asymmetric impulse response while preserving some analyticity.

Thual, A., Dancette, C., Karadayi, J., Benjumea, J. & Dupoux, E. (2018). A K-nearest neighbours approach to unsupervised spoken term discovery. In IEEE SLT-2018. [abstract] ABSTRACT = Unsupervised spoken term discovery is the task of finding recurrent acoustic patterns in speech without any annotations. Current approaches consists of two steps: (1) discovering similar patterns in speech, and (2) partitioning those pairs of acoustic tokens using graph clustering methods. We propose a new approach for the first step. Previous systems used various approximation algorithms to make the search tractable on large amounts of data. Our approach is based on an optimized k-nearest neighbours (KNN) search coupled with a fixed word embedding algorithm. The results show that the KNN algorithm is robust across languages, consistently outperforms the DTW-based baseline, and is competitive with current state-of-the-art spoken term discovery systems.

Schatz, T., Bach, F. & Dupoux, E. (2018). Evaluating automatic speech recognition systems as quantitative models of cross-lingual phonetic category perception. Journal of the Acoustical Society of America: Express Letters. [abstract] ABSTRACT = Existing theories of cross-linguistic phonetic category perception agree that listeners perceive foreign sounds by mapping them onto their native phonetic categories. Yet, none of the available theories specify a way to compute this mapping. As a result, they cannot provide systematic quantitative predictions and remain mainly descriptive. In this paper, Automatic Speech Recognition (ASR) systems are used to provide a fully specified mapping between foreign and native sounds. This is shown to provide a quantitative model that can account for several empirically attested effects in human cross-linguistic phonetic category perception.

Scharenborg, O., Besacier, L., Black, A., Hasegawa-Johnson, M., Metze, F., Neubig, G., Stüker, S., Godard, P., Müller, M., Ondel, L., Palaskar, S., Arthur, P., Ciannella, F., Du, M., Larsen, E., Merkx, D., Riad, R., Wang, L. & Dupoux, E. (2018). Linguistic unit discovery from multimodal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop. In ICASSP-2018. [abstract] ABSTRACT = We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography. We study the re- placement of orthographic transcriptions by images and/or translated text in a well-resourced language to help unsuper- vised discovery from raw speech.

Riad, R., Dancette, C., Karadayi, J., Zeghidour, N., Schatz, T. & Dupoux, E. (2018). Sampling strategies in Siamese Networks for unsupervised speech representation learning. In Interspeech-2018. [abstract] ABSTRACT = Recent studies have investigated siamese network architectures for learning invariant speech representations using same-different side information at the word level. Here we investigate systematically an often ignored component of siamese networks: the sampling procedure (how pairs of same vs. different tokens are selected). We show that sampling strategies taking into account Zipf's Law, the distribution of speakers and the proportions of same and different pairs of words significantly impact the performance of the network. In particular, we show that word frequency compression improves learning across a large range of variations in number of training pairs. This effect does not apply to the same extent to the fully unsupervised setting, where the pairs of same-different words are obtained by spoken term discovery. We apply these results to pairs of words discovered using an unsupervised algorithm and show an improvement on state-of-the-art in unsupervised representation learning using siamese networks.

Ondel, L., Godard, P., Besacier, L., Larsen, E., Hasegawa-Johnson, M., Scharenborg, O., Dupoux, E., Burget, L.s., Yvon, F.c. & Khudanpur, S. (2018). Bayesian models for unit discovery on a very low resource language. In ICASSP-2018. [abstract] ABSTRACT = Developing speech technologies for low-resource languages has become a very active research field over the last decade. Among others, Bayesian models have shown some promising results on artificial examples but still lack of in situ exper- iments. Our work applies state-of-the-art Bayesian models to unsupervised Acoustic Unit Discovery (AUD) in a real low-resource language scenario. We also show that Bayesian models can naturally integrate information from other re- sourceful languages by means of informative prior leading to more consistent discovered units. Finally, discovered acoustic units are used, either as the 1-best sequence or as a lattice, to perform word segmentation. Word segmentation results show that this Bayesian approach clearly outperforms a Segmental-DTW baseline on the same corpus.

Holzenberger, N., Du, M., Karadayi, J., Riad, R. & Dupoux, E. (2018). Learning word embeddings: unsupervised methods for fixed-size representations of variable-length speech segments. In Interspeech-2018. [abstract] ABSTRACT = Fixed-length embeddings of words are very useful for a variety of tasks in speech and language processing. Here we sys- tematically explore two methods of computing fixed-length embeddings for variable-length sequences. We evaluate their sus- ceptibility to phonetic and speaker-specific variability on English, a high resource language, and Xitsonga, a low resource language, using two evaluation metrics: ABX word discrimina- tion and ROC-AUC on same-different phoneme n-grams. We show that a simple downsampling method supplemented with length information can be competitive with the variable-length input feature representation on both evaluations. Recurrent au- toencoders trained without supervision can yield even better re- sults at the expense of increased computational complexity.

Guevara-Rukoz, A., Cristia, A., Ludusan, B., Thiollière, R., Martin, A., Mazuka, R. & Dupoux, E. (2018). Are words easier to learn from infant- than adult- directed speech? A quantitative corpus-based investigation Cognitive Science, 42(5)(1586-1617). [abstract] ABSTRACT = We investigate whether infant-directed speech (IDS) facilitates lexical learning when compared to adult-directed speech (ADS). To study this, we compare the distinctiveness of the lexicon at two levels, acoustic and phonological, using a large database of spontaneous speech in Japanese. At the acoustic level we show that, as has been documented before for phonemes, the realizations of words are more variable and less discriminable in IDS. At the phonological level, we find that despite a slight increase in the number of phonological neighbors, the IDS lexicon contains more distinctive words (such as onomatopeias). Combining the acoustic and phonological metrics together in a global discrimination score, the two effects cancel each other out and the IDS lexicon winds up being as discriminable as its ADS counterpart. We discuss the implication of these findings for the view of IDS as hyperspeech, i.e., a register whose purpose is to facilitate language acquisition.

Dupoux, E. (2018). Cognitive Science in the era of Artificial Intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173, 34-59. [abstract] ABSTRACT = Spectacular progress in the information processing sciences (machine learning, wearable sensors) promises to revolutionize the study of cognitive development. Here, we analyse the conditions under which 'reverse engineering' language development, i.e., building an effective system that mimics infant's achievements, can contribute to our scientific understanding of early language development. We argue that, on the computational side, it is important to move from toy problems to the full complexity of the learning situation, and take as input as faithful reconstructions of the sensory signals available to infants as possible. On the data side, accessible but privacy-preserving repositories of home data have to be setup. On the psycholinguistic side, specific tests have to be constructed to benchmark humans and machines at different linguistic levels. We discuss the feasibility of this approach and present an overview of current results.

Cao, X.N., Dakhlia, C., del Carmen, P., Jaouani, M.A., Ould-Arbi, M. & Dupoux, E. (2018). Baby Cloud, a technological platform for parents and researchers. In Nicoletta Calzolari (Conference chair), Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis & Takenobu Tokunaga (eds) Proceedings of LREC 2018, European Language Resources Association (ELRA). [abstract] ABSTRACT = In this paper, we present BabyCloud, a platform for capturing, storing and analyzing daylong audio recordings and photographs of children's linguistic environments, for the purpose of studying infant's cognitive and linguistic development and interactions with the environment. The proposed platform connects two communities of users: families and academics, with strong innovation potential for each type of users. For families, the platform offers a novel functionality: the ability for parents to follow the development of their child on a daily basis through language and cognitive metrics (growth curves in number of words, verbal complexity, social skills, etc). For academic research, the platform provides a novel means for studying language and cognitive development at an unprecedented scale and level of detail. They will submit algorithms to the secure server which will only output anonymized aggregate statistics. Ultimately, BabyCloud aims at creating an ecosystem of third parties (public and private research labs...) gravitating around developmental data, entirely controlled by the party whose data originate from, i.e. families.

Tsuji, S., Fikkert, P., Minagawa-Kawai, Y., Dupoux, E., Filippin, L., Versteegh, M., Hagoort, P. & Cristia, A. (2017). The more, the better? Behavioral and neural correlates of frequent and infrequent vowel exposure Developmental Psychobiology, 59, 603-612. [abstract] ABSTRACT = A central assumption in the perceptual attunement literature holds that exposure to a speech sound contrast leads to improvement in native speech sound processing. However, whether the amount of exposure matters for this process has not been put to a direct test. We elucidated indicators of frequency-dependent perceptual attunement by comparing 5--8-month-old Dutch infants' discrimination of tokens containing a highly frequent [hɪt-he:t] and a highly infrequent [hYt-h\o:t] native vowel contrast as well as a non-native [hɛt-h\aet] vowel contrast in a behavioral visual habituation paradigm (Experiment 1). Infants discriminated both native contrasts similarly well, but did not discriminate the non-native contrast. We sought further evidence for subtle differences in the processing of the two native contrasts using near-infrared spectroscopy and a within-participant design (Experiment 2). The neuroimaging data did not provide additional evidence that responses to native contrasts are modulated by frequency of exposure. These results suggest that even large differences in exposure to a native contrast may not directly translate to behavioral and neural indicators of perceptual attunement, raising the possibility that frequency of exposure does not influence improvements in discriminating native contrasts.

Schatz, T., Turnbull, R., Bach, F. & Dupoux, E. (2017). A Quantitative Measure of the Impact of Coarticulation on Phone Discriminability. In INTERSPEECH-2017. [abstract] ABSTRACT = Acoustic realizations of a given phonetic segment are typically affected by coarticulation with the preceding and following phonetic context. While coarticulation has been extensively studied using descriptive phonetic measurements, little is known about the functional impact of coarticulation for speech processing. Here, we use DTW-based similarity defined on raw acoustic features and ABX scores to derive a measure of the effect of coarticulation on phonetic discriminability. This measure does not rely on defining segment-specific phonetic cues (formants, duration, etc.) and can be applied systematically and automatically to any segment in large scale corpora. We illustrate our method using stimuli in English and Japanese. We replicate some well-known results, i.e., stronger anticipatory than perseveratory coarticulation and stronger coarticulation for lax/short vowels than for tense/long vowels. We then quantify for the first time the impact of coarticulation across different segment types (like vowels and consonants). We discuss how our metric and its possible extensions can help addressing current challenges in the systematic study of coarticulation.

Michel, P., Räsänen, O., Thiollière, R. & Dupoux, E. (2017). Blind phoneme segmentation with temporal prediction errors. In Proceedings of ACL: Student Research Workshop, 62-68. [abstract] Phonemic segmentation of speech is a crit- ical step of speech recognition systems. We propose a novel unsupervised algo- rithm based on sequence prediction mod- els such as Markov chains and recurrent neural networks. Our approach consists in analyzing the error profile of a model trained to predict speech features frame- by-frame. Specifically, we try to learn the dynamics of speech in the MFCC space and hypothesize boundaries from lo- cal maxima in the prediction error. We evaluate our system on the TIMIT dataset, with improvements over similar methods.

Ludusan, B., Mazuka, R., Bernard, M., Cristia, A. & Dupoux, E. (2017). The Role of Prosody and Speech Register in Word Segmentation: A Computational Modelling Perspective. In ACL 2017, 2, (pp 178-183) . [abstract] ABSTRACT = This study explores the role of speech register and prosody for the task of word segmentation. Since these two factors are thought to play an important role in early language acquisition, we aim to quantify their contribution for this task. We study a Japanese corpus containing both infant- and adult-directed speech and we apply four different word segmentation models, with and without knowledge of prosodic boundaries. The results showed that the difference between registers is smaller than previously reported and that prosodic boundary information helps more adult- than infant-directed speech.

Le Godais, G., Linzen, T. & Dupoux, E. (2017). Comparing character-level neural language models using a lexical decision task. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics., 2, (pp 125--130) . [abstract] ABSTRACT = What is the information captured by neural network models of language? We address this question in the case of character-level recurrent neural language models. These models do not have explicit word repre- sentations; do they acquire implicit ones? We assess the lexical capacity of a network using the lexical decision task common in psycholinguistics: the system is required to decide whether or not a string of charac- ters forms a word. We explore how accu- racy on this task is affected by the architec- ture of the network, focusing on cell type (LSTM vs. SRN), depth and width. We also compare these architectural properties to a simple count of the parameters of the network. The overall number of parame- ters in the network turns out to be the most important predictor of accuracy; in partic- ular, there is little evidence that deeper net- works are beneficial for this task.

Larsen, E., Cristia, A. & Dupoux, E. (2017). Relating unsupervised word segmentation to reported vocabulary acquisition. In INTERSPEECH-2017. [abstract] ABSTRACT = A range of computational approaches have been used to model the discovery of word forms from continuous speech by infants. Typically, these algorithms are evaluated with respect to the ideal 'gold standard' word segmentation and lexicon. These metrics assess how well an algorithm matches the adult state, but may not reflect the intermediate states of the child's lexical development. We set up a new evaluation method based on the correlation between word frequency counts derived from the application of an algorithm onto a corpus of child-directed speech, and the proportion of infants knowing the words according to parental reports. We evaluate a representative set of 4 algorithms, applied to transcriptions of the Brent corpus, which have been phonologized using either phonemes or syllables as basic units. Results show remarkable variation in the extent to which these 8 algorithm-unit combinations predicted infant vocabulary, with some of these predictions surpassing those derived from the adult gold standard segmentation. We argue that infant vocabulary prediction provides a useful complement to traditional evaluation; for example, the best predictor model was also one of the worst in terms of segmentation score, and there was no clear relationship between token or boundary F-score and vocabulary prediction.

Guevara-Rukoz, A., Parlato-Oliveira, E., Yu, S., Hirose, Y., Peperkamp, S. & Dupoux, E. (2017). Predicting epenthetic vowel quality from acoustics. In INTERSPEECH-2017. [abstract] ABSTRACT = Past research has shown that sound sequences not permitted in our native language may be distorted by our perceptual system. A well documented example is vowel epenthesis, a phenomenon in which non-existent vowels are hallucinated by listeners, in order to repair illegal consonantal sequences. As reported in previous work, this occurs in Japanese (JP) and Brazilian Portuguese (BP), languages for which the 'default' epenthetic vowels are /u/ and /i/, respectively. In a perceptual experiment, we corroborate the finding that the quality of this illusory vowel is language-dependent, but also that this default choice can be overridden by coarticulatory information present on the consonant cluster. In a second step, we analyse recordings of JP and BP speakers producing 'epenthesized' versions of stimuli from the perceptual task. Results reveal that the default vowel corresponds to the vowel with the most reduced acoustic characteristics, also the one for which formants are acoustically closest to formant transitions present in consonantal clusters. Lastly, we model behavioural responses from the perceptual experiment with an exemplar model using dynamic time warping (DTW)-based similarity measures on MFCCs.

Guevara-Rukoz, A., Lin, I., Morii, M., Minagawa, Y., Dupoux, E. & Peperkamp, S. (2017). Which epenthetic vowel? Phonetic categories versus acoustic detail in perceptual vowel epenthesis Journal of the Acoustical Society of America: Express Letters, 142(2), EL211-2017. [abstract] ABSTRACT = This study aims to quantify the relative contributions of phonetic categories and acoustic detail on phonotactically induced perceptual vowel epenthesis in Japanese listeners. A vowel identification task tested whether a vowel was perceived within illegal consonant clusters and, if so, which vowel was heard. Cross-spliced stimuli were used in which vowel coarticulation present in the cluster did not match the quality of the flanking vowel. Two clusters were used, /hp/ and /kp/, the former containing larger amounts of resonances of the preceding vowel. While both flanking vowel and coarticulation influenced vowel quality, the influence of coarticulation was larger, especially for /hp/.

Dunbar, E., Xuan-Nga, C., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., Anguera, X. & Dupoux, E. (2017). The Zero Resource Speech Challenge 2017. In ASRU-2017. [abstract] ABSTRACT = We describe a new challenge aimed at discovering subword and word units from raw speech. This challenge is the fol- lowup to the Zero Resource Speech Challenge 2015. It aims at constructing systems that generalize across languages and adapt to new speakers. The design features and evaluation metrics of the challenge are presented and the results of sev- enteen models are discussed.

Cristia, A., Dupoux, E., Gurven, M. & Stieglitz, J. (2017). Child-directed speech is infrequent in a forager-farmer population: a time allocation study. Child Development. [abstract] This article provides an estimation of how frequently, and from whom, children aged 0-11 years (Ns between 9 and 24) receive one-on-one verbal input among Tsimane forager-horticulturalists of lowland Bolivia. Analyses of systematic daytime behavioral observations reveal < 1 min per daylight hour is spent talking to children younger than 4 years of age, which is 4 times less than estimates for others present at the same time and place. Adults provide a majority of the input at 0--3 years of age but not afterward. When integrated with previous work, these results reveal large cross-cultural variation in the linguistic experiences provided to young children. Consideration of more diverse human populations is necessary to build generalizable theories of language acquisition.

Chaabouni, R., Dunbar, E., Zeghidour, N. & Dupoux, E. (2017). Learning weakly supervised multimodal phoneme embeddings. In INTERSPEECH-2017. [abstract] ABSTRACT = Recent works have explored deep architectures for learning multimodal speech representation (e.g. audio and images, articulation and audio) in a supervised way. Here we investigate the role of combining different speech modalities, i.e. audio and visual information representing the lips' movements, in a weakly-supervised way using Siamese networks and lexical same-different side information. In particular, we ask whether one modality can benefit from the other to provide a richer representation for phone recognition in a weakly supervised setting. We introduce mono-task and multi-task methods for merging speech and visual modalities for phone recognition. The mono-task learning consists in applying a Siamese network on the concatenation of the two modalities, while the multi-task learning receives several different combinations of modalities at train time. We show that multi-task learning enhances discriminability for visual and multimodal inputs while minimally impacting auditory inputs. Furthermore, we present a qualitative analysis of the obtained phone embeddings, and show that cross-modal visual input can improve the discriminability of phonetic features which are visually discernable (rounding, open/close, labial place of articulation), resulting in representations that are closer to abstract linguistic features than those based on audio only.

de Diego-Balaguer, R., Schramm, C., Rebeix, I., Dupoux, E., Durr, A., Brice, A., Charles, P., Cleret de Langavant, L., Youssov, K., Verny, C., Damotte, V., Azulay, J.P., Goizet, C., Simonin, C., Tranchant, C., Maison, P., Rialland, A., Schmitz, D., Jacquemot, C., Fontaine, B. & Bachoud-Lévi, A.C. (2016). COMT Val158Met Polymorphism Modulates Huntington's Disease Progression. Plos One, 11(9), e0161106.

Zeghidour, N., Synnaeve, G., Versteegh, M. & Dupoux, E. (2016). A Deep Scattering Spectrum - Deep Siamese Network Pipeline For Unsupervised Acoustic Modeling. In ICASSP-2016, (pp 4965-4969) . [abstract] ABSTRACT = Recent work has explored deep architectures for learning acoustic features in an unsupervised or weakly supervised way for phone recognition. Here we investigate the role of the input features, and in particular we test whether standard mel-scaled filterbanks could be replaced by inherently richer representations, such as derived from an analytic scattering spectrum. We use a Siamese network using lexical side information similar to a well performing architecture used in the Zero Resource Speech Challenge (2015), and show a substantial improvement when the filterbanks are replaced by scattering features, even though these features yield similar performance when tested without training. This shows that unsupervised and weakly-supervised architectures can benefit from richer features than the traditional ones.

Zeghidour, N., Synnaeve, G., Usunier, N. & Dupoux, E. (2016). Joint Learning of Speaker and Phonetic Similarities with Siamese Networks. In INTERSPEECH-2016, (pp 1295-1299) . [abstract] ABSTRACT = Recent work has demonstrated, on small datasets, the feasibility of jointly learning specialized speaker and phone embeddings, in a weakly supervised siamese DNN architecture using word and speaker identity as side information. Here, we scale up these architectures to the 360 hours of the Librispeech corpus by implementing a sampling method to efficiently select pairs of words from the dataset and improving the loss function. We also compare the standard siamese networks fed with same (AA) or different (AB) pairs, to a 'triamese' network fed with AAB triplets. We use ABX discrimination tasks to evaluate the discriminability and invariance properties of the obtained joined embeddings, and compare these results with mono-embeddings architectures. We find that the joined embeddings architectures succeed in effectively disentangling speaker from phoneme information, with around 10% errors for the matching tasks and embeddings (speaker task on speaker embeddings, and phone task on phone embedding) and near chance for the mismatched task. Furthermore, the results carry over in out-of-domain datasets, even beating the best results obtained with similar weakly supervised techniques.

Versteegh, M., Anguera, X., Jansen, A. & Dupoux, E. (2016). The Zero Resource Speech Challenge 2015: Proposed Approaches and Results. In SLTU-2016 Procedia Computer Science, 81, (pp 67-72) . [abstract] This paper reports on the results of the Zero Resource Speech Challenge 2015, the first unified benchmark for zero resource speech technology, which aims at the unsupervised discovery of subword and word units from raw speech. This paper dis- cusses the motivation for the challenge, its data sets, tasks and baseline systems. We outline the ideas behind the systems that were submitted for the two challenge tracks: unsuper- vised subword unit modeling and spoken term discovery, and summarize their results. The results obtained by participating teams show great promise; many systems beat the provided baselines and some even perform better than comparable su- pervised systems.

Synnaeve, G. & Dupoux, E. (2016). A temporal coherence loss function for learning unsupervised acoustic embeddings. In SLTU-2016 Procedia Computer Science, 81, (pp 95-100) . [abstract] ABSTRACT = We train Neural Networks of varying depth with a loss function which imposes the output representations to have a temporal profile which looks like that of phonemes. We show that a simple loss function which maximizes the dissimilarity between near frames and long distance frames helps to construct a speech embedding that improves phoneme discriminability, both within and across speakers, even though the loss function only uses within speaker information. However, with too deep an architecture, this loss function yields overfitting, suggesting the need for more data and/or regularization.

Ogawa, T., Mallidi, S.H., Dupoux, E., Cohen, J., Feldman, N. & Hermansky, H. (2016). A new efficient measure for accuracy prediction and its application to multistream-based unsupervised adaptation. In ICPR. [abstract] ABSTRACT = Abstract---A new efficient measure for predicting estimation accuracy is proposed and successfully applied to multistream-based unsupervised adaptation of ASR systems to address data uncertainty when the ground-truth is unknown. The proposed measure is an extension of the M-measure, which predicts confidence in the output of a probability estimator by measuring the divergences of probability estimates spaced at specific time intervals. In this study, the M-measure was extended by considering the latent phoneme information, resulting in an improved reliability. Experimental comparisons carried out in a multistream-based ASR paradigm demonstrated that the extended M-measure yields a significant improvement over the original M-measure, especially under narrow-band noise conditions.

Ludusan, B., Cristia, A., Martin, A., Mazuka, R. & Dupoux, E. (2016). Learnability of prosodic boundaries: Is infant-directed speech easier? Journal of the Acoustical Society of America, 140(2), 1239-1250. [abstract] ABSTRACT = This study explores the long-standing hypothesis that the acoustic cues to prosodic boundaries in infant-directed speech (IDS) make those boundaries easier to learn than those in adult-directed speech (ADS). Three cues (pause duration, nucleus duration and pitch change) were investigated, by means of a systematic review of the literature, statistical analyses of a new corpus, and machine learning experiments. The review of previous work revealed that the effect of register on boundary cues is less well established than previously thought, and that results often vary across studies for certain cues. Statistical analyses run on a large database of mother-child and mother-interviewer interactions showed that the duration of a pause and the duration of the syllable nucleus preceding the boundary are two cues which are enhanced in IDS, while f0 change is actually degraded in IDS. Supervised and unsupervised machine learning techniques applied to these acoustic cues revealed that IDS boundaries were consistently better classified than ADS ones, regardless of the learning method used. The role of the cues examined in this study and the importance of these findings in the more general context of early linguistic structure acquisition is discussed.

Ludusan, B. & Dupoux, E. (2016). The role of prosodic boundaries in word discovery: Evidence from a computational model. Journal of the Acoustical Society of America, 140(1), EL1. [abstract] ABSTRACT = This study aims to quantify the role of prosodic boundaries in early language acquisition using a computational modeling approach. A spoken term discovery system that models early word learning was used with and without a prosodic component on speech corpora of English, Spanish, and Japanese. The results showed that prosodic information induces a consistent improvement both in the alignment of the terms to actual word boundaries and in the phonemic homogeneity of the discovered clusters of terms. This benefit was found also when automatically discovered prosodic boundaries were used, boundaries which did not perfectly match the linguistically defined ones.

Ludusan, B. & Dupoux, E. (2016). Automatic syllable segmentation using broad phonetic class information. In SLTU-2016 Procedia Computer Science, 81, (pp 101-106) . [abstract] ABSTRACT = We propose in this paper a language-independent method for syllable segmentation. The method is based on the Sonor- ity Sequencing Principle, by which the sonority inside a syl- lable increases from its boundaries towards the syllabic nu- cleus. The sonority function employed was derived from the posterior probabilities of a broad phonetic class recognizer, trained with data coming from an open-source corpus of En- glish stories. We tested our approach on English, Spanish and Catalan and compared the results obtained to those given by an energy-based system. The proposed method outperformed the energy-based system on all three languages, showing a good generalizability to the two unseen languages. We con- clude with a discussion of the implications of this work for under-resourced languages.

Linzen, T., Dupoux, E. & Spector, B. (2016). Quantificational features in distributional word representations. In Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics, (pp pages 1 -- 1-11) . [abstract] ABSTRACT = We present in this paper an evaluation of the role of prosodic boundaries in the process of unsupervised word discovery. The tests performed on a corpus of English broadcast news showed that the system precision increases systematically when prosodic boundaries are incorporated, with respect to the baseline. We also investigated whether pauses, a simpler phenomenon to extract automatically, would offer the same advantage, and we discovered that prosodic boundaries offer more information to the word discovery process.

Linzen, T., Dupoux, E. & Goldberg, Y. (2016). Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4, 521-535. [abstract] ABSTRACT = We present in this paper an evaluation of the role of prosodic boundaries in the process of unsupervised word discovery. The tests performed on a corpus of English broadcast news showed that the system precision increases systematically when prosodic boundaries are incorporated, with respect to the baseline. We also investigated whether pauses, a simpler phenomenon to extract automatically, would offer the same advantage, and we discovered that prosodic boundaries offer more information to the word discovery process.

Gvozdic, K., Moutier, S., Dupoux, E. & Buon, M. (2016). Priming Children's Use of Intentions in Moral Judgement with Metacognitive Training. Frontiers in Language Sciences, 7(190).

Fourtassi, A. & Dupoux, E. (2016). The role of word-word co-occurrence in word learning. In Proceedings of the 38th Annual Conference of the Cognitive Science Society, (pp 662-667) . [abstract] ABSTRACT = A growing body of research on early word learning suggests that learners gather word-object co-occurrence statistics across learning situations. Here we test a new mechanism whereby learners are also sensitive to word-word co-occurrence statistics. Indeed, we find that participants can infer the likely referent of a novel word based on its co-occurrence with other words, in a way that mimics a machine learning algorithm dubbed `zero-shot learning'. We suggest that the interaction between referential and distributional regularities can bring robustness to the process of word acquisition.

Dunbar, E. & Dupoux, E. (2016). Geometric constraints on human speech sound inventories. Frontiers in Psychology, 7(1061). [abstract] We investigate the idea that the languages of the world have developed coherent sound systems in which having one sound increases or decreases the chances of having certain other sounds, depending on shared properties of those sounds. We investigate the geometries of sound systems that are defined by the inherent properties of sounds. We document three typological tendencies in sound system geometries: economy, a tendency for the differences between sounds in a system to be definable on a relatively small number of independent dimensions; local symmetry, a tendency for sound systems to have relatively large numbers of pairs of sounds that differ only on one dimension; and global symmetry, a tendency for sound systems to be relatively balanced. The finding of economy corroborates previous results; the two symmetry properties have not been previously documented. We also investigate the relation between the typology of inventory geometries and the typology of individual sounds, showing that the frequency distribution with which individual sounds occur across languages works in favour of both local and global symmetry.

Carbajal, J., Fér, R. & Dupoux, E. (2016). Modeling language discrimination in infants using i-vector representations. In Proceedings of the 38th Annual Conference of the Cognitive Science Society, (pp 889-896) . [abstract] ABSTRACT = Experimental research suggests that at birth infants can discriminate two languages if they belong to different rhythmic classes, and by 4 months of age they can discriminate two languages within the same class provided they have been previously exposed to at least one of them. In this paper, we present a novel application of speech technology tools to model language discrimination, which may help to understand how infants achieve this task. By combining a Gaussian Mixture Model of the acoustic space and low-dimensional representations of novel utterances with a model of a habituation paradigm, we show that brief exposure to French does not allow to discriminate between two previously unheard languages belonging to the same rhythmic class, but allows to discriminate two languages across rhythmic class. The implications of these findings are discussed.

Carbajal, J., Dawud, A., Thiollière, R. & Dupoux, E. (2016). The 'Language Filter' Hypothesis: Modeling Language Separation in Infants using I-vectors. In EPIROB 2016, (pp 195-201) . [abstract] ABSTRACT = Experimental research suggests that at birth infants can discriminate two languages if they belong to different rhythmic classes, and by 4 months of age they can discriminate two languages within the same class provided they have been previously exposed to at least one of them. In this paper, we present a novel application of speech technology tools to model language discrimination, which may help to understand how infants achieve this task. By combining a Gaussian Mixture Model of the acoustic space and low-dimensional representations of novel utterances with a model of a habituation paradigm, we show that brief exposure to French does not allow to discriminate between two previously unheard languages belonging to the same rhythmic class, but allows to discriminate two languages across rhythmic class. The implications of these findings are discussed.

Bergmann, C., Cristia, A. & Dupoux, E. (2016). Discriminability of sound contrasts in the face of speaker variation quantified. In Proceedings of the 38th Annual Conference of the Cognitive Science Society, (pp 1331-1336) . [abstract] ABSTRACT = How does a naive language learner deal with speaker variation irrelevant to distinguish word meanings? Experimental data is conflicting and incompatible models have been proposed. In this paper we examine the basic assumptions of these models regarding the signal the learner deals with: Is speaker variability a hurdle in discriminating sounds or can it easily be abstracted over? To this end we summarize existing infant data and compare them to machine-based discriminability scores of sound pairs obtained without added language knowledge. Our results show consistently that speaker variability decreases sound contrast discriminability, and that some pairs are affected more than others. Further, chance performance is a rare exception; contrasts remain discriminable in the face of speaker variation. Our data offer a way to reunite seemingly conflicting findings in the infant literature and show a path forward in testing whether and how speaker variation plays a role for language acquisition.

Versteegh, M., Thiollière, R., Schatz, T., Cao, X.N., Anguera, X., Jansen, A. & Dupoux, E. (2015). The Zero Resource Speech Challenge 2015. In INTERSPEECH-2015, (pp 3169-3173) . [abstract] ABSTRACT = The Interspeech 2015 Zero Resource Speech Challenge aims at discovering subword and word units from raw speech. The challenge provides the first unified and open source suite of evaluation metrics and data sets to compare and analyse the results of unsupervised linguistic unit discovery algorithms. It consists of two tracks. In the first, a psychophysically inspired evaluation task (minimal pair ABX discrimination) is used to assess how well speech feature representations discriminate between contrastive subword units. In the second, several metrics gauge the quality of discovered word-like patterns. Two data sets are provided, one for English, one for Xitsonga. Both data sets are provided without any annotation except for voice activity and talker identity. This paper introduces the evaluation metrics, presents the results of baseline systems and discusses some of the key issues in unsupervised unit discovery.

Thiollière, R., Dunbar, E., Synnaeve, G., Versteegh, M. & Dupoux, E. (2015). A Hybrid Dynamic Time Warping-Deep Neural Network Architecture for Unsupervised Acoustic Modeling. In INTERSPEECH-2015, (pp 3179-3183) . [abstract] ABSTRACT = We report on an architecture for the unsupervised discovery of talker-invariant subword embeddings. It is made out of two components: a dynamic-time warping based spoken term discovery (STD) system and a Siamese deep neural network (DNN). The STD system clusters word-sized repeated fragments in the acoustic streams while the DNN is trained to minimize the distance between time aligned frames of tokens of the same cluster, and maximize the distance between tokens of different clusters. We use additional side information regarding the average duration of phonemic units, as well as talker identity tags. For evaluation we use the datasets and metrics of the Zero Resource Speech Challenge. The model shows improvement over the baseline in subword unit modeling.

Michon, E., Dupoux, E. & Cristia, A. (2015). Salient dimensions in implicit phonotactic learning. In INTERSPEECH-2015, (pp 2665-2669) . [abstract] ABSTRACT = Adults are able to learn sound co-occurrences without conscious knowledge after brief exposures. But which dimensions of sounds are most salient in this process? Using an artificial phonology paradigm, we explored potential learnability differences involving consonant-, speaker-, and tone-vowel co-occurrences. Results revealed that participants, whose native language was not tonal, implicitly encoded consonant-vowel patterns with a high level of accuracy; were above chance for tone-vowel co-occurrences; and were at chance for speaker-vowel co-occurrences. This pattern of results is exactly what would be expected if both language-specific experience and innate biases to encode potentially contrastive linguistic dimensions affect the salience of different dimensions during implicit learning of sound patterns.

Martin, A., Schatz, T., Versteegh, M., Miyazawa, K., Mazuka, R., Dupoux, E. & Cristia, A. (2015). Mothers speak less clearly to infants: A comprehensive test of the hyperarticulation hypothesis. Psychological Science, 26(3), 341-347. [abstract] ABSTRACT = Infants learn language at an incredible speed, and one of the first steps in this voyage includes learning the basic sound units of their native language. It is widely thought that caregivers facilitate this task by hyperarticulating when speaking to their infants. Utilizing state-of-the-art speech technology, we address this key theoretical question: Are sound categories clearer in infant- than in adult-directed speech? A comprehensive examination of sound contrasts in a large corpus of spontaneous Japanese demonstrates that there is a small but significant tendency for contrasts in infant-directed speech to be less clear than those in adult-directed speech, contrary to the idea that caregivers actively enhance phonetic categories in infant-directed speech. These results suggest that the ability to learn from noisy data must be a crucial component of plausible theories of infant language acquisition.

Ludusan, B., Synnaeve, G. & Dupoux, E. (2015). Prosodic boundary information helps unsupervised word segmentation. In NAACL HLT 2015, (pp 953-963) .

Ludusan, B., Seidl, A., Dupoux, E. & Cristia, A. (2015). Motif discovery in infant- and adult-directed speech. In Proceedings of CogACLL2015, (pp 93-102) . [abstract] ABSTRACT = Infant-directed speech (IDS) is thought to play a key role in determining infant language acquisition. It is thus important to describe to what extent it differs from adult-directed speech (ADS) in dimensions that could affect learnability. In this paper, we explore how an acoustic motif discovery algorithm fares when presented with spontaneous speech from both registers. Results show small but significant differences in performance, with lower recall and higher fragmentation in IDS than ADS. Such a result is inconsistent with a view of IDS where clarity and ease of lexical recognition is a primary consideration. Additionally, it predicts that learners who extract acoustic word-forms should do worse with IDS than ADS. Similarities and differences with human infants' performance on word segmentation tasks are discussed.

Ludusan, B., Origlia, A. & Dupoux, E. (2015). Rhythm-Based Syllabic Stress Learning without Labelled Data. In Proceedings of Statistical Language and Speech Processing -SLSP 2015, (pp 185-196) . [abstract] ABSTRACT = In this paper we propose a method for syllabic stress annotation which does not require manual labels for the learning process, but uses stress labels automatically generated from a multiscale model of rhythm perception. The model gives in its output a sequence of events, corresponding the sequences of strong-weak syllables present in speech, based on which a stressed/unstressed decision is taken. We tested our approach on two languages, Catalan and Spanish, and we found that a supervised system employing the automatic labels for learning improves the performance over the baseline, for both languages. We also compared the results of this system with that of an identical learning algorithm, but which employs manual labels for stress, as well as to that of an unsupervised learning algorithm using the same features. It showed that the system using automatic labels has a similar performance to the one using manual labels, with both supervised systems outperforming the clustering algorithm.

Ludusan, B., Caranica, A., Cucu, H., Buzo, A., Burileanu, C. & Dupoux, E. (2015). Exploring multi-language resources for unsupervised spoken term discovery. In Speech Technology and Human-Computer Dialogue (SpeD), 2015 International Conference on, (pp 1-6) . [abstract] With information processing and retrieval of spoken documents becoming an important topic, there is a need of systems performing automatic segmentation of audio streams. Among such algorithms, spoken term discovery allows the extraction of word-like units (terms) directly from the continuous speech signal, in an unsupervised manner and without any knowledge of the language at hand. Since the performance of any downstream application depends on the goodness of the terms found, it is relevant to try to obtain higher quality automatic terms. In this paper we investigate whether the use input features derived from of multi-language resources helps the process of term discovery. For this, we employ an open-source phone recognizer to extract posterior probabilities and phone segment decisions, for several languages. We examine the features obtained from a single language and from combinations of languages based on the spoken term discovery results attained on two different datasets of English and Xitsonga. Furthermore, a comparison to the results obtained with standard spectral features is performed and the implications of the work discussed.

Ludusan, B. & Dupoux, E. (2015). A multilingual study on intensity as a cue for marking prosodic boundaries. In ICPhS, (pp e982) . [abstract] ABSTRACT = Speech intensity is one of the main prosodic cues, playing a role in most of the suprasegmental phenomena. Despite this, its contribution to the signalling of prosodic hierarchy is still relatively under-studied, compared to the other cues, like duration or fundamental frequency. We present here an investigation on the role of intensity in prosodic boundary detection in four different languages, by testing several intensity measures. The statistical analysis performed showed significant correlates of prosodic boundaries, for most intensity measures employed and in all languages. Our findings were further validated with a classification experiment in which the boundary/non-boundary distinction was learned in unsupervised manner, using only intensity cues. It showed that intensity range measures outperform absolute intensity measures, with the total intensity range being consistently the best feature.

Johnson, M., Pater, J., Staub, R. & Dupoux, E. (2015). Sign constraints on feature weights improve a joint model of word segmentation and phonology. In NAACL HLT 2015, (pp 303-313) . [abstract] ABSTRACT = This paper describes a joint model of word segmentation and phonological alternations, which takes unsegmented utterances as input and infers word segmentations and underlying phonological representations. The model is a Maximum Entropy or log-linear model, which can express a probabilistic version of Optimality Theory (OT; Prince and Smolensky, 2004), a standard phonological framework. The features in our model are inspired by OT's Markedness and Faithfulness constraints. Following the OT principle that such features indicate ``violations'', we require their weights to be non-positive. We apply our model to a modified version of the Buckeye corpus (Pitt et al., 2007) in which the only phonological alternations are deletions of word-final /d/ and /t/ segments. The model sets a new state-of-the-art for this corpus for word segmentation, identification of underlying forms, and identification of /d/ and /t/ deletions. We also show that the OT-inspired sign constraints on feature weights are crucial for accurate identification of deleted /d/s; without them our model posits approximately 10 times more deleted underlying /d/s than appear in the manually annotated data.

Hermansky, H., Burget, L., Cohen, J., Dupoux, E., Feldman, N., Godfrey, J., Khudanpur, S., Maciejewski, M., Mallidi, S.H., Menon, A., Ogawa, T., Peddinti, V., Rose, R., Stern, R., Wiesner, M. & Vesely, K. (2015). Towards machines that know when they do not know: Summary of work done at 2014 Frederick Jelinek memorial workshop in Prague. In ICASSP-2015 (IEEE International Conference on Acoustics Speech and Signal Processing), (pp 5009-5013) . [abstract] ABSTRACT = A group of junior and senior researchers gathered as a part of the 2014 Frederick Jelinek Memorial Workshop in Prague to address the problem of predicting the accuracy of a nonlinear Deep Neural Network probability estimator for unknown data in a different application domain from the domain in which the estimator was trained. The paper describes the problem and summarizes approaches that were taken by the group.

Dunbar, E., Synnaeve, G. & Dupoux, E. (2015). Quantitative methods for comparing featural representations. In ICPhS, (pp paper number 1024) . [abstract] ABSTRACT = The basic representational hypothesis in phonology is that segments are coded using a universal set of discrete features. We propose a method for quantitatively measuring how well such features align with arbitrary segment representations. We assess articulatory, spectral, and phonotactic representations of English consonants. Our procedure constructs a concrete representation of a feature in terms of the pairs it distinguishes, and can be extended to any pair of representations to test the consistency of one with the individual dimensions of the other. We validate the method on our phonetic representations and then show that major natural classes are not well represented in the surface phonotactics.

Synnaeve, G., Versteegh, M. & Dupoux, E. (2014). Learning words from images and speech. In NIPS Workshop on Learning Semantics. [abstract] ABSTRACT = The Interspeech 2015 Zero Resource Speech Challenge aims at discovering subword and word units from raw speech. The challenge provides the first unified and open source suite of evaluation metrics and data sets to compare and analyse the results of unsupervised linguistic unit discovery algorithms. It consists of two tracks. In the first, a psychophysically inspired evaluation task (minimal pair ABX discrimination) is used to assess how well speech feature representations discriminate between contrastive subword units. In the second, several metrics gauge the quality of discovered word-like patterns. Two data sets are provided, one for English, one for Xitsonga. Both data sets are provided without any annotation except for voice activity and talker identity. This paper introduces the evaluation metrics, presents the results of baseline systems and discusses some of the key issues in unsupervised unit discovery.

Synnaeve, G., Schatz, T. & Dupoux, E. (2014). Phonetics embedding learning with side information. In IEEE Spoken Language Technology Workshop, (pp 106 - 111) . [abstract] We show that it is possible to learn an efficient acoustic model using only a small amount of easily available word-level similarity nnotations. In contrast to the detailed phonetic label- ing required by classical speech recognition technologies, the only information our method requires are pairs of speech ex- cerpts which are known to be similar (same word) and pairs of speech excerpts which are known to be different (different words). An acoustic model is obtained by training shallow and deep neural networks, using an architecture and a cost function well-adapted to the nature of the provided informa- tion. The resulting model is evaluated on an ABX minimal- pair discrimination task and is shown to perform much better (11.8% ABX error rate) than raw speech features (19.6%), not far from a fully supervised baseline (best neural network: 9.2%, HMM-GMM: 11%).

Synnaeve, G., Dautriche, I., Boerschinger, B., Johnson, M. & Dupoux, E. (2014). Unsupervised word segmentation in context. In Proceedings of 25th International Conference on Computational Linguistics (CoLing), (pp 2326-2334) . [abstract] ABSTRACT = This paper extends existing word segmentation models to take non-linguistic context into account. It improves the token F-score of well-performing segmentation models by 2.5% on a 27k utterances dataset. We posit that word segmentation is easier in-context because the learner is not trying to access irrelevant lexical items. We use topics from Latent Dirichlet Allocation as a proxy for activities context, to label the Providence corpus. We present Adaptor Grammar models that use these context labels, and we study their performance with and without context annotations at test time.

Schatz, T., Peddinti, V., Cao, X.N., Bach, F., Hermansky, H. & Dupoux, E. (2014). Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise. In INTERSPEECH-2014, (pp 915-919) . [abstract] ABSTRACT = The Minimal-Pair ABX (MP-ABX) paradigm has been proposed as a method for evaluating speech features for zero-resource/unsupervised speech technologies. We apply it in a phoneme discrimination task on the Articulation Index corpus to evaluate the resistance to noise of various speech features. In Experiment 1, we evaluate the robustness to additive noise at different signal-to-noise ratios, using car and babble noise from the Aurora-4 database and white noise. In Experiment 2, we examine the robustness to different kinds of convolutional noise. In both experiments we consider two classes of techniques to induce noise resistance: smoothing of the time-frequency representation and short-term adaptation in the time-domain. We consider smoothing along the spectral axis (as in PLP) and along the time axis (as in FDLP). For short-term adaptation in the time-domain, we compare the use of a static compressive non-linearity followed by RASTA filtering to an adaptive compression scheme.

Ludusan, B., Versteegh, M., Jansen, A., Gravier, G., Cao, X.N., Johnson, M. & Dupoux, E. (2014). Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems. In Proceedings of LREC 2014, (pp 560-567) . [abstract] ABSTRACT = The unsupervised discovery of linguistic terms from either continuous phoneme transcriptions or from raw speech has seen an increasing interest in the past years both from a theoretical and a practical standpoint. Yet, there exists no common accepted evaluation method for the systems performing term discovery. Here, we propose such an evaluation toolbox, drawing ideas from both speech technology and natural language processing. We first transform the speech-based output into a symbolic representation and compute five types of evaluation metrics on this representation: the quality of acoustic matching, the quality of the clusters found, and the quality of the alignment with real words (type, token, and boundary scores). We tested our approach on two term discovery systems taking speech as input, and one using symbolic input. The latter was run using both the gold transcription and a transcription obtained from anautomatic speech recognizer, in order to simulate the case when only imperfect symbolic information is available. The results obtained are analysed through the use of the proposed evaluation metrics and the implications of these metrics are discussed.

Ludusan, B., Gravier, G. & Dupoux, E. (2014). Incorporating Prosodic Boundaries in Unsupervised Term Discovery. In Proceedings of Speech Prosody, 7, (pp 939-943) . [abstract] We present a preliminary investigation on the usefulness of prosodic boundaries for unsupervised term discovery (UTD). Studies in language acquisition show that infants use prosodic boundaries to segment continuous speech into word-like units. We evaluate whether such a strategy could also help UTD algorithms. Running a previously published UTD algorithm (MODIS) on a corpus of prosodically annotated English broadcast news revealed that many discovered terms straddle prosodic boundaries. We then implemented two variants of this algorithm: one that discards straddling items and one that truncates them to the nearest boundary (either prosodic or pause marker). Both algorithms showed a better term matching Fscore compared to the baseline and higher level prosodic boundaries were found to be better than lower level boundaries or pause markers. In addition, we observed that the truncation algorithm, but not the discard algorithm, increased word boundary F-score over the baseline.

Ludusan, B. & Dupoux, E. (2014). Towards Low Resource Prosodic Boundary Detection. In Proceedings of International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU'14), (pp 231-237) . [abstract] ABSTRACT = In this study we propose a method of prosodic boundary detection based only on acoustic cues which are easily extractable from the speech signal and without any supervision. Drawing a parallel between the process of language acquisition in babies and the speech processing techniques for under-resourced languages, we take advantage of the findings of several psycholinguistic studies relative to the cues used by babies for the identification of prosodic boundaries. Several durational and pitch cues were investigated, by themselves or in a combination, and relatively good performances were achieved. The best result obtained, a combination of all the cues, compares well against a previously proposed approach, without relying on any learning method or any lexical or syntactic cues.

Johnson, M., Christophe, A., Demuth, K. & Dupoux, E. (2014). Modelling function words improves unsupervised word segmentation. In Proceedings of the 52nd Annual meeting of the ACL, (pp 282--292) . [abstract] ABSTRACT = Inspired by experimental psychological findings suggesting that function words play a special role in word learning, we make a simple modification to an Adaptor Grammar based Bayesian word segmentation model to allow it to learn sequences of monosyllabic "function words" at the beginnings and endings of collocations of (possibly multi-syllabic) words. This modification improves unsupervised word segmentation on the standard Bernstein-Ratner (1987) corpus of child-directed English by more than 4% token f-score compared to a model identical except that it does not special-case "function words", setting a new state-of-the-art of 92.4% token f-score. Our function word model assumes that function words appear at the left periphery, and while this is true of languages such as English, it is not true universally. We show that a learner can use Bayesian model selection to determine the location of function words in their language, even though the input to the model only consists of unsegmented sequences of phones. Thus our computational models support the hypothesis that function words play a special role in word learning.

Fourtassi, A., Schatz, T., Varadarajan, B. & Dupoux, E. (2014). Exploring the Relative Role of Bottom-up and Top-down Information in Phoneme Learning. In Proceedings of the 52nd Annual meeting of the ACL, 2, (pp 1-6) Association for Computational Linguistics. [abstract] We test both bottom-up and top-down approaches in learning the phonemic status of the sounds of English and Japanese. We used large corpora of spontaneous speech to provide the learner with an input that models both the linguistic properties and statistical regularities of each language. We found both approaches to help discriminate between allophonic and phonemic contrasts with a high degree of accuracy, although top-down cues proved to be effective only on an interesting subset of the data. cues based of the properties of the lexicon. We test their performance in a task that consists on discriminating within category contrasts from between category contrasts. Finally we discuss the role and scope of each approach in learning phonemes.

Fourtassi, A., Dunbar, E. & Dupoux, E. (2014). Self Consistency as an Inductive Bias in Early Language Acquisition. In Proceedings of the 36th Annual Meeting of the Cognitive Science Society, (pp 469-474) . [abstract] ABSTRACT = In this paper we introduce an inductive bias for language acquisition. It is based on a holistic approach, whereby the levels of representations are not treated in isolation, but as different interacting parts. The best representation of the sound system is the one that leads to the best lexicon, defined as the one that sustains the most coherent semantics. We quantify this coherence through an intrinsic and unsupervised measure called "Self Consistency". We found this measure to be optimal under the true phonemic inventory and the correct word segmentation in English and Japanese.

Fourtassi, A. & Dupoux, E. (2014). A Rudimentary Lexicon and Semantics Help Bootstrap Phoneme Acquisition. In Proceedings of the 18th Conference on Computational Natural Language Learning (CoNLL), (pp 191-200) Association for Computational Linguistics. [abstract] Infants spontaneously discover the relevant phonemes of their language without any direct supervision. This acquisition is puzzling because it seems to require the availability of high levels of linguistic structures (lexicon, semantics), that logically suppose the infants having a set of phonemes already. We show how this circularity can be broken by testing, in real-size language corpora, a scenario whereby infants would learn approximate representations at all levels, and then refine them in a mutual constraining way. We start with corpora of spontaneous speech that have been encoded in a varying number of detailed context-dependent allophones. We derive an approximate lexicon and a rudimentary semantic representation. Despite the fact that all these representations are poor approximations of the ground truth, they help reorganize the fine grained categories into phoneme-like categories with a ahigh degree of accuracy.

Cristia, A., Minagawa-Kawai, Y., Vendelin, I., Cabrol, D. & Dupoux, E. (2014). Responses to vocalizations and auditory controls in the human newborn brain. Plos One, 9(12), e115162. [abstract] The functional organization of the human adult brain allows selective activation of specific regions in response to stimuli. In the adult, linguistic processing has been associated with left-dominant activations in perisylvian regions, whereas emotional vocalizations can give place to right-dominant activation in posterior temporal cortices. Near Infrared Spectroscopy was used to register the response of 40 newborns' temporal regions when stimulated with speech, human and macaque emotional vocalizations, and auditory controls where the formant structure was destroyed but the long-term spectrum was retained. Speech elicited left-dominant activation in one channel in left posterior temporal cortices, as well as in more anterior, deeper tissue with no clear lateralization. Emotional vocalizations induced left-dominant, large activations in more anterior regions, and induced activation. Finally, activation elicited by the control stimuli was right-dominant, and more variable across infants. Overall, these results suggest that left-dominance for speech processing in newborns may be partially modulated by the presence of formant structure, which is shared between speech and non-linguistic vocalizations. Moreover, they indicate that development plays an important role in shaping the cortical networks involved in the processing of emotional vocalizations.

Cristia, A., Minagawa-Kawai, Y., Egorova, N., Gervain, J., Filippin, L., Cabrol, D. & Dupoux, E. (2014). Neural correlates of infant dialect discrimination: A fNIRS study. Developmental Science, 17(4), 628-635. [abstract] ABSTRACT = The present study investigated the neural correlates of infant discrimination of very similar linguistic varieties (Quebecois and Parisian French) using functional Near InfraRed Spectroscopy. In line with previous behavioral and electrophysiological data, there was no evidence that 3-month-olds discriminated the two regional accents, whereas 5-month-olds did, with the locus of discrimination in left anterior perisylvian regions. These neuroimaging results suggest that a developing language network relying crucially on left perisylvian cortices sustains infants' discrimination of similar linguistic varieties within this early period of infancy.

Buon, M., Jacob, P., Margules, S., Brunet, I., Dutat, M., Cabrol, D. & Dupoux, E. (2014). Friend or foe? Early social evaluation of human interactions PloS One, 9(2), e88612. [abstract] ABSTRACT = We report evidence that 29-month-old toddlers and preverbal 10-month-old human infants discriminate between two agents, a pro-social agent, who performs a positive action on a human patient and a negative action on an inanimate object, and an anti-social agent, who does the opposite. Furthermore the evidence shows that they prefer the former to the latter even though the agents perform the same bodily movements. Given that humans can be threats to their conspecifics, we discuss this finding in light of the likely adaptive value of the ability to detect harmful human agents.

Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hermansky, H. & Dupoux, E. (2013). Evaluating speech features with the Minimal-Pair ABX task: Analysis of the classical MFC/PLP pipeline. In INTERSPEECH-2013, (pp 1781-1785) . [abstract] ABSTRACT = We present a new framework for the evaluation of speech representations in zero-resource settings, that extends and complements previous work by Carlin, Jansen and Hermansky [1]. In particular, we replace their Same/Different discrimination task by several Minimal-Pair ABX (MP-ABX) tasks. We explain the analytical advantages of this new framework and apply it to decompose the standard signal processing pipelines for computing PLP and MFC coefficients. This method enables us to confirm and quantify a variety of well-known and not-so-well-known results in a single framework.

Ngon, C., Martin, A., Dupoux, E., Cabrol, D. & Peperkamp, S. (2013). Nonwords, nonwords, nonwords: Evidence for a proto-lexicon during the first year of life. Developmental Science, 16(1), 24-34. [abstract] ABSTRACT = Previous research with artificial language learning paradigms has shown that infants are sensitive to statistical cues to word boundaries (Saffran, Aslin & Newport, 1996) and that they can use these cues to extract word-like units (Saffran, 2001). However, it is unknown whether infants use statistical information to construct a recognition lexicon when acquiring their native language. In order to investigate this issue, we rely on the fact that besides real words a statistical algorithm extracts sound sequences that are highly frequent in infant-directed speech but constitute nonwords. In two experiments, we use a preferential listening paradigm to test French-learning 11-month-old infants' recognition of highly frequent disyllabic sequences from their native language. In Experiment 1, we use nonword stimuli and find that infants listen longer to high-frequency than to low-frequency sequences. In Experiment 2, we compare high-frequency nonwords to real words in the same frequency range, and find that infants show no preference. Thus, at 11 months, French-learning infants recognize highly frequent sound sequences from their native language and fail to differentiate between words and nonwords among these sequences. These results are evidence that they have used statistical information to extract word candidates from their input and store them in a ``proto-lexicon'', containing both words and nonwords.

Minagawa-Kawai, Y., Cristia, A., Long, B., Vendelin, I., Hakuno, Y., Dutat, M., Filippin, L., Cabrol, D. & Dupoux, E. (2013). Insights on NIRS sensitivity from a cross-linguistic study on the emergence of phonological grammar. Frontiers in Language Sciences, 4(170), 10.3389/fpsyg.2013.00170. [abstract] ABSTRACT = Each language has a unique set of phonemic categories and phonotactic rules which determine permissible sound sequences in that language. Behavioral research demonstrates that one's native language shapes the perception of both sound categories and sound sequences in adults, and neuroimaging results further indicate that the processing of native phonemes and phonotactics involves a left-dominant perisylvian brain network. Recent work using a novel technique, functional Near InfraRed Spectroscopy (NIRS), has suggested that a left-dominant network becomes evident toward the end of the first year of life as infants process phonemic contrasts. The present research project attempted to assess whether the same pattern would be seen for native phonotactics. We measured brain responses in Japanese- and French-learning infants to two contrasts: Abuna vs. Abna (a phonotactic contrast that is native in French, but not in Japanese) and Abuna vs. Abuuna (a vowel length contrast that is native in Japanese, but not in French). Results did not show a significant response to either contrast in either group, unlike both previous behavioral research on phonotactic processing and NIRS work on phonemic processing. To understand these null results, we performed similar NIRS experiments with Japanese adult participants. These data suggest that the infant null results arise from an interaction of multiple factors, involving the suitability of the experimental paradigm for NIRS measurements and stimulus perceptibility. We discuss the challenges facing this novel technique, particularly focusing on the optimal stimulus presentation which could yield strong enough hemodynamic responses when using the change detection paradigm.

Martin, A., Peperkamp, S. & Dupoux, E. (2013). Learning Phonemes with a Proto-lexicon. Cognitive Science, 37, 103-124. [abstract] ABSTRACT = Before the end of the first year of life, infants begin to lose the ability to perceive distinctions between sounds that are not phonemic in their native language. It is typically assumed that this developmental change reflects the construction of language-specific phoneme categories, but how these categories are learned largely remains a mystery. Peperkamp, Le Calvez, Nadal, & Dupoux (2006) present an algorithm that can discover phonemes using the distributions of allophones as well as the phonetic properties of the allophones and their contexts. We show that a third type of information source, the occurrence of pairs of minimally-differing word forms in speech heard by the infant, is also useful for learning phonemic categories, and is in fact more reliable than purely distributional information in data containing a large number of allophones. In our model, learners build an approximation of the lexicon consisting of the high-frequency n-grams present in their speech input, allowing them to take advantage of top-down lexical information without needing to learn words. This may explain how infants have already begun to exhibit sensitivity to phonemic categories before they have a large receptive lexicon.

Jansen, A., Dupoux, E., Goldwater, S., Johnson, M., Khudanpur, S., Church, K., Feldman, N., Hermansky, H., Metze, F., Rose, R., Seltzer, M., Clark, P., McGraw, I., Varadarajan, B., Bennett, E., Boerschinger, B., Chiu, J., Dunbar, E., Fourtassi, A., Harwath, D., Lee, C.y., Levin, K., Norouzian, A., Peddinti, V., Richardson, R., Schatz, T. & Thomas, S. (2013). A summary of the 2012 JH CLSP Workshop on zero resource speech technologies and models of early language acquisition. In ICASSP-2013 (IEEE International Conference on Acoustics Speech and Signal Processing), (pp 8111-8115) . [abstract] ABSTRACT = We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding zero resource (unsupervised) speech technologies and related models of early language acquisition. Centered around the tasks of phonetic and lexical discovery, we consider unified evaluation metrics, present two new approaches for improving speaker independence in the absence of supervision, and evaluate the application of Bayesian word segmentation algorithms to automatic subword unit tokenizations. Finally, we present two strategies for integrating zero resource techniques into supervised settings, demonstrating the potential of unsupervised methods to improve mainstream technologies.

Fourtassi, A., Boerschinger, B., Johnson, M. & Dupoux, E. (2013). WhyisEnglishsoeasytosegment. In Proceedings of the 4th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2013), (pp 1-10) . [abstract] ABSTRACT = Cross-linguistic studies on unsupervised word segmentation have consistently shown that English is easier to segment than other languages. In this paper, we propose an explanation based on the notion of segmentation ambiguity. We show that English has a very low segmentation ambiguity compared to Japanese and that this difference correlates with the segmentation performance in a unigram model. We suggest that segmentation ambiguity is linked to a trade-off between syllable structure complexity and word length distribution.

Fourtassi, A. & Dupoux, E. (2013). A corpus-based evaluation method for Distributional Semantic Models. In Proceedings of ACL-SRW 2013, (pp 165-171) . [abstract] ABSTRACT = Evaluation methods for Distributional Semantic Models typically rely on behaviorally derived gold standards. These methods are difficult to deploy in languages with scarce linguistic/behavioral resources. We introduce a corpus-based measure that evaluates the stability of the lexical semantic similarity space using a pseudo-synonym same-different detection task and no external resources. We show that it enables to predict two behavior-based measures across a range of parameters in a Latent Semantic Analysis model.

Cristia, A., Dupoux, E., Hakuna, Y., Lloyd-Fox, S., Schuetze, M., Kivits, J., Bergvelt, T., van Gelder, M., Filippin, L., Charron, S. & Minagawa-Kawai, Y. (2013). An online database of infant functional Near InfraRed Spectroscopy studies: A community-augmented systematic review. PLoS One, 8(3), e58906.

Buon, M., Jacob, P., Loissel, E. & Dupoux, E. (2013). A non-mentalistic cause-based heuristic in human social evaluations. Cognition, 126(2), 149-155.

Buon, M., Dupoux, E., Jacob, P., Chaste, P., Leboyer, M. & Zalla, T. (2013). The role of causal and intentional reasoning in moral judgment in individuals with High Functioning Autism. Journal of Autism and Developmental Disorders, 43(2), 458-70.

Kinzler, K.D., Dupoux, E. & Spelke, E.S. (2012). "Native" objects and collaborators: Infants' object choices and acts of giving reflect favor for native over foreign speakers. Journal of Cognition and Development, 13(1), 1-15. [abstract] Infants learn from adults readily and cooperate with them spontaneously, but how do they select culturally appropriate teachers and collaborators? Building on evidence that children demonstrate social preferences for speakers of their native language, Experiment 1 presented 10- month-old infants with videotaped events in which a native and a foreign speaker introduced two different toys. When given a chance to choose between real exemplars of the objects, infants preferentially chose the toy modeled by the native speaker. In Experiment 2, 2.5-year-old children were presented with the same videotaped native and foreign speakers, and played a game in which they could offer an object to one of two individuals. Children reliably gave to the native speaker. Together, the results suggest that infants and young children are selective social learners and cooperators, and that language provides one basis for this selectivity.

Jacquemot, C., Dupoux, E., Robotham, L. & Bachoud-Lévi, A.C. (2012). Specificity in rehabilitation of word production: a meta-analysis and a case study. Behavioural Neurology, 25(2), 73-101. [abstract] ABSTRACT = Speech production impairment is a frequent deficit observed in aphasic patients and rehabilitation programs have been extensively developed. Nevertheless, there is still no agreement on the type of rehabilitation that yields the most successful outcomes. Here, we ran a detailed meta-analysis of 39 studies of word production rehabilitation involving 124 patients. We used a model-driven approach for analyzing each rehabilitation task by identifying which levels of our model each task tapped into. We found that (1) all rehabilitation tasks are not equally efficient and the most efficient ones involved the activation of the two levels of the word production system: the phonological output lexicon and the phonological output, and (2) the activation of the speech perception system as it occurs in many tasks used in rehabilitation is not successful in rehabilitating word production. In this meta-analysis, the effect of the activation of the phonological output lexicon and the phonological output cannot be assessed separately. We further conducted a rehabilitation study with DPI, a patient who suffers from a damage of the phonological output lexicon. Our results confirm that rehabilitation is more efficient, in terms of time and performance, when specifically addressing the impaired level of word production.

Cova, F., Dupoux, E. & Jacob, P. (2012). On doing things intentionally. Mind and Language, 27(4), 378--409. [abstract] ABSTRACT = Recent empirical and conceptual research has shown that moral considerations have an influence on the way we use the adverb ``intentionally''. Here we propose our own account of these phenomena according to which they arise from the fact that the adverb ``intentionally'' has three different meanings that are differently selected by contextual factors, including normative expectations. We argue that our hypotheses can account for most available data and present some new results which support this. We end by discussing the implications of our account for folk psychology.

Cleret de Langavant, L., Trinkler, I., Remy, P., Thirioux, B., McIntyre, J., Berthoz, A., Dupoux, E. & Bachoud-Lévi, A.C. (2012). Viewing another person's body as a target object: a behavioural and PET study of pointing. Neuropsychologia, 50(8), 1801-13. [abstract] ABSTRACT = ...abstract missing...

Minagawa-Kawai, Y., van der Lely, H., Ramus, F., Sato, Y., Mazuka, R. & Dupoux, E. (2011). Optical Brain Imaging Reveals General Auditory and Language-Specific Processing in Early Infant Development. Cerebral Cortex, 21(2), 254-261. [abstract] This study uses near-infrared spectroscopy in young infants in order to elucidate the nature of functional cerebral processing for speech. Previous imaging studies of infants' speech perception revealed left-lateralized responses to native language. However, it is unclear if these activations were due to language per se rather than to some low-level acoustic correlate of spoken language. Here we compare native (L1) and non-native (L2) languages with 3 different nonspeech conditions including emotional voices, monkey calls, and phase scrambled sounds that provide more stringent controls. Hemodynamic responses to these stimuli were measured in the temporal areas of Japanese 4 month-olds. The results show clear left-lateralized responses to speech, prominently to L1, as opposed to various activation patterns in the nonspeech conditions. Furthermore, implementing a new analysis method designed for infants, we discovered a slower hemodynamic time course in awake infants. Our results are largely explained by signal-driven auditory processing. However, stronger activations to L1 than to L2 indicate a language-specific neural factor that modulates these responses. This study is the first to discover a significantly higher sensitivity to L1 in 4 month-olds and reveals a neural precursor of the functional specialization for the higher cognitive network.

Minagawa-Kawai, Y., Cristia, A., Vendelin, I., Cabrol, D. & Dupoux, E. (2011). Assessing signal-driven mechanisms in neonates: Brain responses to temporally and spectrally different sounds. Frontiers in Language Sciences, 2(135). [abstract] ABSTRACT = Past studies have found that, in adults, the acoustic properties of sound signals (such as fast vs. slow temporal features) differentially activate the left and right hemispheres, and some have hypothesized that left-lateralization for speech processing may follow from left-lateralization to rapidly changing signals. Here, we tested whether newborns' brains show some evidence of signal-specific lateralization responses using near-infrared spectroscopy (NIRS) and auditory stimuli that elicits lateralized responses in adults, composed of segments that vary in duration and spectral diversity. We found significantly greater bilateral responses of oxygenated hemoglobin (oxy-Hb) in the temporal areas for stimuli with a minimum segment duration of 21 ms, than stimuli with a minimum segment duration of 667 ms. However, we found no evidence for hemispheric asymmetries dependent on the stimulus characteristics. We hypothesize that acoustic-based functional brain asymmetries may develop throughout early infancy, and discuss their possible relationship with brain asymmetries for language.

Minagawa-Kawai, Y., Cristià, A. & Dupoux, E. (2011). Cerebral lateralization and early speech acquisition: A developmental scenario. Developmental Cognitive Neuroscience, 1(3), 217-232. [abstract] During the past ten years, research using Near-InfraRed Spectroscopy (NIRS) to study the developing brain has provided groundbreaking evidence of brain functions in infants. We review three competing classes of hypotheses, (signal-driven, domain-driven, and learning biases hypotheses) regarding the causes of hemispheric specialization for speech processing. We assess the fit between each of these hypotheses and neuroimaging evidence in speech perception and show that none of the three hypotheses can account for the entire set of observations on its own. However, we argue that they provide a good fit when combined within a developmental perspec- tive. According to our proposed scenario, lateralization for language emerges out of the interaction between pre-existing left--right biases in generic auditory processing (signal- driven hypothesis), and a left-hemisphere predominance of particular learning mechanisms (learning-biases hypothesis). As a result of thiscompleted developmental process, the native language is represented in the left hemisphere predominantly. The integrated sce- nario enables to link infant and adult data, and points to many empirical avenues that need to be explored more systematically.

Mazuka, R., Cao, Y., Dupoux, E. & Christophe, A. (2011). The development of a phonological illusion: A cross-linguistic study with Japanese and French infants. Developmental Science, 14(4), 693-699. [abstract] ABSTRACT = In adults, the native language phonology has strong perceptual effects. Previous work showed that Japanese speakers, unlike French speakers, break up illegal sequences of consonants with illusory vowels: they report hearing abna as abuna. To study the development of the phonological grammar, we compared Japanese and French infants in a discrimination task. In Experiment 1, we observed that 14-month-old Japanese infants, in contrast with French infants, failed to discriminate phonetically varied sets of abna-type and abuna-type stimuli. In Experiment 2, 8 month-old French and Japanese did not differ significantly from each other. In Experiment 3, we found that, like adults, Japanese infants can discriminate abna from abuna when phonetic variability is reduced (single item). These results show that the phonologically- induced /u/ illusion is already experienced by Japanese infants at the age of 14 months. Hence, before having acquired many words of their language, they have grasped enough of their native phonological grammar to constrain their perception of speech sound sequences.

Jacquemot, C., Dupoux, E. & Bachoud-Lévi, A.C. (2011). Is the word-length effect linked to subvocal rehearsal? Cortex, 47(4), 484-493. [abstract] Models of phonological short-term memory (pSTM) generally distinguish between two components: a phonological buffer and a subvocal rehearsal. Evidence for these two components comes, respectively, from the phonological similarity effect and the word-length effect which disappears under articulatory suppression. But alternative theories posit that subvocal rehearsal is only an optional component of the pSTM. According to them, the depletion of the length effect under articulatory suppression results from the interference of the self-produced speech rather than the disruption of subvocal rehearsal. In order to disentangle these two theories, we tested two patients with a short-term memory deficit. FA, who presents a pseudoword repetition deficit, and FL, who does not. FA's deficit allowed for the observance of an ecological case of subvocal rehearsal disruption without any articulatory suppression task. FA's performance in pSTM tasks reveals as controls a phonological similarity effect, and contrary to controls no word-length effect. In contrast, the second patient, FL, exhibits the same effects as control subjects. This result is in accordance with models of pSTM in which the word-length effect emerges from subvocal rehearsal and disappears when this latter is disrupted.

Hannagan, T., Dupoux, E. & Christophe, A. (2011). Holographic String Encoding. Cognitive Science, 35(1), 79-118. [abstract] In this article, we apply a special case of holographic representations to letter position coding. We translate different well-known schemes into this format, which uses distributed representations and supports constituent structure. We show that in addition to these brain-like characteristics, performances on a standard benchmark of behavioral effects are improved in the holographic format relative to the standard localist one. This notably occurs because of emerging properties in holographic codes, like transposition and edge effects, for which we give formal demonstrations. Finally, we outline the limits of the approach as well as its possible future extensions.

Dupoux, E., Parlato, E., Frota, S., Hirose, Y. & Peperkamp, S. (2011). Where do illusory vowels come from? Journal of Memory and Language, 64(3), 199-210. [abstract] Listeners of various languages tend to perceive an illusory vowel inside consonant clusters that are illegal in their native language. Here, we test whether this phenomenon arises after phoneme categorization or rather interacts with it. We assess the perception of illegal consonant clusters in native speakers of Japanese, Brazilian Portuguese, and European Portuguese, three languages that have similar phonological properties, but that differ with respect to both segmental categories and segmental transition probabilities. We manipulate the coarticulatory information present in the consonant clusters, and use a forced choice vowel labeling task (Experiment 1) and an ABX discrimination task (Experiment 2). We find that only Japanese and Brazilian Portuguese listeners show a perceptual epenthesis effect, and, furthermore, that within these participant groups the nature of the perceived epenthetic vowel varies according to the coarticulation cues. These results are consistent with models that integrate phonotactic probabilities within perceptual categorization, and are problematic for two-step models in which the repair of illegal sequences follows that of categorization.

Dupoux, E., Beraud-Sudreau, G. & Sagayama, S. (2011). Templatic features for modeling phoneme acquisition. In Proceedings of the 33rd Annual Conference of the Cognitive Science Society, Boston, Mass.. [abstract] We describe a model for the coding of speech sounds into a high dimensional space. This code is obtained by computing the similarity between speech sounds and stored syllable-sized templates. We show that this code yields a better linear separation of phonemes than the standard MFCC code. Additional experiments show that the code is tuned to a particular language, and is able to use temporal cues for the purpose of phoneme recognition. Optimal templates seem to correspond to chunks of speech of around 120ms containing transitions between phonemes or syllables.

Cleret de Langavant, L., Remy, P., Trinkler, I., McIntyre, J., Dupoux, E., Berthoz, A. & Bachoud-Lévi, A.C. (2011). Behavioral and Neural Correlates of Communication via Pointing. Plos One, 6(3), e17719. [abstract] Communicative pointing is a human specific gesture which allows sharing information about a visual item with another person. It sets up a three-way relationship between a subject who points, an addressee and an object. Yet psychophysical and neuroimaging studies have focused on non-communicative pointing, which implies a two-way relationship between a subject and an object without the involvement of an addressee, and makes such gesture comparable to touching or grasping. Thus, experimental data on the communicating function of pointing remain scarce. Here, we examine whether the communicative value of pointing modifies both its behavioral and neural correlates by comparing pointing with or without communication. We found that when healthy participants pointed repeatedly at the same object, the communicative interaction with an addressee induced a spatial reshaping of both the pointing trajectories and the endpoint variability. Our finding supports the hypothesis that a change in reference frame occurs when pointing conveys a communicative intention. In addition, measurement of regional cerebral blood flow using H2O15 PET-scan showed that pointing when communicating with an addressee activated the right posterior superior temporal sulcus and the right medial prefrontal cortex, in contrast to pointing without communication. Such a right hemisphere network suggests that the communicative value of pointing is related to processes involved in taking another person's perspective. This study brings to light the need for future studies on communicative pointing and its neural correlates by unraveling the three-way relationship between subject, object and an addressee.

Boruta, L., Peperkamp, S., Crabbé, B. & Dupoux, E. (2011). Testing the robustness of online word segmentation: effects of linguistic diversity and phonetic variation. In Proceedings of the 2011 Workshop on Cognitive Modeling and Computational Linguistics, ACL, 1-9, Portland, Oregon. [abstract] Models of the acquisition of word segmentation are typically evaluated using phonemically transcribed corpora. Accordingly, they implicitly assume that children know how to undo phonetic variation when they learn to extract words from speech. Moreover, whereas models of language acquisition should perform similarly across languages, evaluation is often limited to English samples. Using child-directed corpora of English, French and Japanese, we evaluate the performance of state-of-the-art statistical models given inputs where phonetic variation has not been reduced. To do so, we measure segmentation robustness across different levels of segmental variation, simulating systematic allophonic variation or errors in phoneme recognition. We show that these models do not resist an increase in such variations and do not generalize to typologically different languages. From the perspective of early language acquisition, the results strengthen the hypothesis according to which phonological knowledge is acquired in large part before the construction of a lexicon.

Peperkamp, S., Vendelin, I. & Dupoux, E. (2010). Perception of predictable stress: A cross-linguistic investigation. Journal of Phonetic, 38(3), 422-430. [abstract] Previous studies have documented that speakers of French, a language with predictable stress, have difficulty distinguishing nonsense words that vary in stress position solely (stress ``deafness{''}). In a sequence recall task with adult speakers of five languages with predictable stress (Standard French, Southeastern French, Finnish, Hungarian and Polish) and one language with non-predictable stress (Spanish), it was found that speakers of all languages with predictable stress except Polish exhibited a strong stress ``deafness{''}, while Spanish speakers exhibited no such ``deafness{''}. Polish speakers yielded an intermediate pattern of results: they exhibited a weak stress ``deafness{''}. These findings are discussed in light of current theoretical models of speech perception.

Parlato-Oliveira, E., Christophe, A., Hirose, Y. & Dupoux, E. (2010). Plasticity of illusory vowel perception in Brazilian-Japanese bilinguals. Journal of the Acoustical Society of America, 127(6), 3738-3748. [abstract] Previous research shows that monolingual Japanese and Brazilian Portuguese listeners perceive illusory vowels (/u/ and /i/, respectively) within illegal sequences of consonants. Here, several populations of Japanese-Brazilian bilinguals are tested, using an explicit vowel identification task (experiment 1), and an implicit categorization and sequence recall task (experiment 2). Overall, second-generation immigrants, who first acquired Japanese at home and Brazilian during childhood (after age 4) showed a typical Brazilian pattern of result (and so did simultaneous bilinguals, who were exposed to both languages from birth on). In contrast, late bilinguals, who acquired their second language in adulthood, exhibited a pattern corresponding to their native language. In addition, an influence of the second language was observed in the explicit task of Exp. 1, but not in the implicit task used in Exp. 2, suggesting that second language experience affects mostly explicit or metalinguistic skills. These results are compared to other studies of phonological representations in adopted children or immigrants, and discussed in relation to the role of age of acquisition and sociolinguistic factors. (C) 2010 Acoustical Society of America. [DOI: 10.1121/1.3327792]

Kouider, S., de Gardelle, V., Sackur, J. & Dupoux, E. (2010). How rich is consciousness? The partial awareness hypothesis Trends in Cognitive Sciences, 14(7), 301-307. [abstract] Current theories of consciousness posit a dissociation between `phenomenal' consciousness (rich) and `access' consciousness (limited). Here, we argue that the empirical evidence for phenomenal consciousness without access is equivocal, resulting either from a confusion between phenomenal and unconscious contents, or from an impression of phenomenally rich experiences arising from illusory contents. We propose a refined account of access that relies on a hierarchy of representational levels and on the notion of partial awareness, whereby lower and higher levels are accessed independently. Reframing of the issue of dissociable forms of consciousness into dissociable levels of access provides a more parsimonious account of the existing evidence. In addition, the rich phenomenology illusion can be studied and described in terms of testable cognitive mechanisms.

Kouider, S., de Gardelle, V., Dehaene, S., Dupoux, E. & Pallier, C. (2010). Cerebral bases of subliminal speech priming. Neuroimage, 49(1), 922-929. [abstract] While the neural correlates of unconscious perception and subliminal priming have been largely studied for visual stimuli, little is known about their counterparts in the auditory modality. Here we used a subliminal speech priming method in combination with fMRI to investigate which regions of the cerebral network for language can respond in the absence of awareness. Participants performed a lexical decision task on target items preceded by subliminal primes, which were either phonetically identical or different from the target. Moreover, the prime and target could be spoken by the same speaker or by two different speakers. Word repetition reduced the activity in the insula and in the left superior temporal gyrus. Although the priming effect on reaction times was independent of voice manipulation, neural repetition suppression was modulated by speaker change in the superior temporal gyrus while the insula showed voice-independent priming. These results provide neuroimaging evidence Of Subliminal priming for spoken words and inform us on the first, unconscious stages of speech perception.

Dupoux, E., Peperkamp, S. & Sebastian-Galles, N. (2010). Limits on bilingualism revisited: Stress "deafness" in simultaneous French-Spanish bilinguals. Cognition, 114(2), 266-275. [abstract] We probed simultaneous French-Spanish bilinguals for the perception of Spanish lexical stress using three tasks, two short-term memory encoding tasks and a speeded lexical decision. In all three tasks, the performance of the group of simultaneous bilinguals was intermediate between that of native speakers of Spanish on the one hand and French late learners of Spanish on the other hand. Using a composite stress `deafness' index measure computed over the results of the three tasks, we found that the performance of the simultaneous bilinguals is best fitted by a bimodal distribution that corresponds to a mixture of the performance distributions of the two control groups. Correlation analyses showed that the variables explaining language dominance are linked to early language exposure. These findings are discussed in light of theories of language processing in bilinguals.

Teichmann, M., Darcy, I., Bachoud-Lévi, A.C. & Dupoux, E. (2009). The role of the striatum in phonological processing. Evidence from early stages of Huntington's disease Cortex, 45(7), 839-849. [abstract] The linguistic role of subcortical structures such as the striatum is still controversial. According to the claim that language processing is subdivided into a lexical memory store and a computational rule system (Pinker, 1999) several studies on word morphology (e.g., Ullman et al., 1997) and on syntax (e.g., Teichmann et al., 2005) have suggested that the striatum is specifically dedicated to the latter component. However, little is known about whether the striatum is involved in phonological operations and whether its role in linguistic rule application generalizes to phonological processing. We investigated this issue by assessing perceptual compensation for assimilation rules in a model of striatal disorders, namely in the early stages of Huntington's disease (HD). In Experiment 1 we used a same-different task with isolated words to evaluate whether phoneme perception is intact in HD. In Experiment 2 a word detection task in phrasal contexts allowed for assessing both phoneme perception and perceptual compensation for the French regressive assimilation rule. Results showed that HD patients have normal performance with both phoneme perception in isolated words and regressive assimilation rules. However, in phrasal contexts they display reduced abilities of phoneme discrimination. These findings challenge the striatum-rule claim and suggest a more fine-grained function of striatal structures in linguistic rule processing. Alternative explanatory frameworks of the striatum-language link are discussed.

Skoruppa, K., Pons, F., Christophe, A., Bosch, L., Dupoux, E., Sebastian-Galles, N., Limissuri, R.A. & Peperkamp, S. (2009). Language-specific stress perception by 9-month-old French and Spanish infants. Developmental Science, 12(6), 914-919. [abstract] During the first year of life, infants begin to have difficulties perceiving non-native vowel and consonant contrasts, thus adapting their perception to the phonetic categories of the target language. In this paper, we examine the perception of a non-segmental feature, i.e. stress. Previous research with adults has shown that speakers of French (a language with fixed stress) have great difficulties in perceiving stress contrasts (Dupoux, Pallier, Sebastian & Mehler, 1997), whereas speakers of Spanish (a language with lexically contrastive stress) perceive these contrasts as accurately as segmental contrasts. We show that language-specific differences in the perception of stress likewise arise during the first year of life. Specifically, 9-month-old Spanish infants successfully distinguish between stress-initial and stress-final pseudo-words, while French infants of this age show no sign of discrimination. In a second experiment using multiple tokens of a single pseudo-word, French infants of the same age successfully discriminate between the two stress patterns, showing that they are able to perceive the acoustic correlates of stress. Their failure to discriminate stress patterns in the first experiment thus reflects an inability to process stress at an abstract, phonological level.

Kouider, S. & Dupoux, E. (2009). Episodic accessibility and morphological processing: Evidence from long-term auditory priming. Acta Psychologica, 130(1), 38-47. [abstract] Long-term priming studies of lexical processing have yielded conflicting claims as to whether abstract versus episodic representations are involved during word recognition. A critical piece of evidence that could separate the two accounts rests on the existence of full morphological priming, where morphologically related words yield the same amount of priming as repeated words. In this study. participants performed speeded lexical decision on lists of auditory words and non-words, which contained repeated, morphologically related, semantically related and phonologically related pairs of items. In order to minimize the involvement of episodic factors, we increased the prime-target interval and decreased their physical similarity by introducing a change in speaker's voice. We show that under conditions that minimize access to episodic features, the magnitude of repetition priming decreased to attain that of morphological priming. Importantly, morphological and repetition priming for words were always observed in the absence of any semantic and phonological priming, suggesting that they cannot be reduced to formal or meaning overlap. Our results support the view that long-term priming taps both abstract lexical codes with a morphological format and episodic memory components. Further, they show that episodic influences on priming can be modulated by prime-target interval and physical similarity.

Varadarajan, B., Khudanpur, S. & Dupoux, E. (2008). Unsupervised Learning of Acoustic Subword Units. In Proceedings of ACL-08: HLT, (pp 165-168) . [abstract] Accurate unsupervised learning of phonemes of a language directly from speech is demonstrated via an algorithm for joint unsupervised learning of the topology and parameters of a hidden Markov model (HMM); states and short state-sequences through this HMM correspond to the learnt sub-word units. The algorithm, originally proposed for unsupervised learning of allophonic variations within a given phoneme set, has been adapted to learn without any knowledge of the phonemes. An evaluation methodology is also proposed, whereby the state-sequence that aligns to a test utterance is transduced in an automatic manner to a phoneme-sequence and compared to its manual transcription. Over 85% phoneme recognition accuracy is demonstrated for speaker-dependent learning from fluent, large-vocabulary speech.

Teichmann, M., Dupoux, E., Cesaro, P. & Bachoud-Lévi, A.C. (2008). The role of the striatum in sentence processing: Evidence from a priming study in early stages of Huntington's disease. Neuropsychologia, 46(1), 174-185. [abstract] The role of sub-cortical structures such as the striatum in language remains a controversial issue. Based on linguistic claims that language processing implies both recovery of lexical information and application of combinatorial rules it has been shown that striatal damaged patients have difficulties applying conjugation rules while lexical recovery of irregular forms is broadly spared (e.g., Ullman, M. T., Corkin, S., Coppola, M., Hickok, G., Growdon, J. H., Koroshetz, W. J., et al. (1997). A neural dissociation within language: Evidence that the mental dictionary is part of declarative memory, and that grammatical rules are processed by the procedural system. Journal of Cognitive Neuroscience, 9(2), 266-276). Here we bolstered the striatum-rule hypothesis by investigating lexical abilities and rule application at the phrasal level. Both processing aspects were assessed in a model of striatal dysfunction, namely Huntington's disease (HD). Using a semantic priming task we compared idiomatic prime sentences involving lexical access to whole phrases (e.g., ``Paul has kicked the bucket{''}) with idiom-derived sentences that contained passivation changes involving syntactic movement rules (e.g., ``Paul was kicked by the bucket{''}), word changes (e.g., ``Paul has crushed the bucket{''}) or either. Target words that were either idiom-related (e.g., ``death{''}) reflecting lexical access to idiom meanings, word-related (e.g., ``bail{''}) reflecting lexical access to single words, or unrelated (e.g., ``table{''}). HD patients displayed selective abnormalities with passivated sentences whereas priming was normal with idioms and sentences containing only word changes. We argue that the role of the striatum in sentence processing specifically pertains to the application of syntactic movement rules whereas it is not involved in canonical rules required for active structures or in lexical processing aspects. Our findings support the striatum-rule hypothesis but suggest that it should be refined by tracking the particular kind of language rules depending on striatal computations.

Minagawa-Kawai, Y., Mori, K., Hebden, J.C. & Dupoux, E. (2008). Optical Imaging of infants' neurocognitive development: Recent advances and perspectives. Developmental Neurobiology, 68(6), 712-728. [abstract] Near-infrared spectroscopy (NIRS) provides a unique method of monitoring infant brain function by measuring the changes in the concentrations of oxygenated and deoxygenated hemoglobin. During the past 10 years, NIRS measurement of the developing brain has rapidly expanded. In this article, a brief discussion of the general principles of NIRS, including its technical advantages and limitations, is followed by a detailed review of the role played so far by NIRS in the study of infant perception and cognition, including language, and visual and auditory functions. Results have highlighted, in particular, the developmental changes of cerebral asymmetry associated with speech acquisition. Finally, suggestions for future studies of neurocognitive development using NIRS are presented. Although NIRS studies of the infant brain have yet to fulfill their potential, a review of the work done so far indicates that NIRS is likely to provide many unique insights in the field of developmental neuroscience.

Dupoux, E., de Gardelle, V. & Kouider, S. (2008). Subliminal speech perception and auditory streaming. Cognition, 109(2), 267-273. [abstract] Current theories of consciousness assume a qualitative dissociation between conscious and unconscious processing: while subliminal stimuli only elicit a transient activity, supraliminal stimuli have long-lasting influences. Nevertheless, the existence of this qualitative distinction remains controversial, as past studies confounded awareness and stimulus strength (energy, duration). Here, we used a masked speech priming method in conjunction with a submillisecond interaural delay manipulation to contrast subliminal and supraliminal processing at constant prime, mask and target strength. This delay induced a perceptual streaming effect, with the prime popping out in the supraliminal condition. By manipulating the prime-target interval (ISI), we show a qualitatively distinct profile of priming longevity as a function of prime awareness. While subliminal priming disappeared after half a second, supraliminal priming was independent of ISI. This shows that the distinction between conscious and unconscious processing depends on high-level perceptual streaming factors rather than low-level features (energy, duration).

Dupoux, E., Sebastian-Galles, N., Navarrete, E. & Peperkamp, S. (2008). Persistent stress "deafness": The case of French learners of Spanish. Cognition, 106(2), 682-706. [abstract] Previous research by Dupoux et al. [Dupoux, E., Pallier, C., Sebastian, N., & Mehler, J. (1997). A destressing ``deafness{''} in French? Journal of Memory Language 36, 406-421; Dupoux, E., Peperkamp, S., & Sebastian-Galles (2001). A robust method to study stress' deafness. Journal of the Acoustical Society of America 110, 1608-1618.] found that French speakers, as opposed to Spanish ones, are impaired in discrimination tasks with stimuli that vary only in the position of stress. However, what was called stress `deafness' was only found in tasks that used high phonetic variability and memory load, not in cognitively less demanding tasks such as single token AX discrimination. This raised the possibility that instead of a perceptual problem, monolingual French speakers might simply lack a metalinguistic representation of contrastive stress, which would impair them in memory tasks. We examined a sample of 39 native speakers of French who underwent formal teaching of Spanish after age 10, and varied in degree of practice in this language. Using a sequence recall task, we observed in all our groups of late learners of Spanish the same impairment in short-term memory encoding of stress contrasts that was previously found in French monolinguals. Furthermore, using a speeded lexical decision task with word-nonword minimal pairs that differ only in the position of stress, we found that all late learners had much difficulty in the use of stress to access the lexicon. Our results show that stress `deafness' is better interpreted as a lasting processing problem resulting from the impossibility for French speakers to encode contrastive stress in their phonological representations. This affects their memory encoding as well as their lexical access in on-line tasks. The generality of such a persistent suprasegmental `deafness' is discussed in relation to current findings and models on the perception of non-native phonological contrasts.

Peperkamp, S. & Dupoux, E. (2007). Learning the mapping from surface to underlying representations in an artificial language. In J. Cole & J. Hualde (eds) Laboratory Phonology, 9, Mouton de Gruyter. [abstract] ABSTRACT = When infants acquire their native language they not only extract language-specific segmental categories and the words of their language, they also learn the underlying form of these words. This is difficult because words can have multiple phonetic realizations, according to the phonological context. In a series of artificial language-learning experiments with a phrase-picture matching task, we consider the respective contributions of word meaning and distributional information for the acquisition of underlying representations in the presence of an allophonic rule. We show that on the basis of semantic information, French adults can learn to map voiced and voiceless stops or fricatives onto the same underlying phonemes, whereas in their native language voicing is phonemic in all obstruents. They do not extend this knowledge to novel stops or fricatives, though. In the presence of distributional cues only, learning is much reduced and limited to the words subjects are trained on. We also test if phonological naturalness plays a role in this type of learning, and find that if semantic information is present, French adults can learn to map different segments onto a single underlying phoneme even if the mappings are highly unnatural. We discuss our findings in light of current statistical learning approaches to language acquisition.

Minagawa-Kawai, Y., Naoi, N., Nishijima, N., Kojima, S. & Dupoux, E. (2007). Developmental changes in cerebral responses to native and non-native vowels: a NIRS study. In Proceedings of the International Conference of Phonetic Sciences XVI, (pp 1877--1880) Saarbrucken. [abstract] ABSTRACT = While newborn infants discriminate speech sounds from languages that they have never heard, 6-month-olds demonstrate the beginnings of vowel classification specific to their native-language. The neuronal correlates involved in such a dramatic perceptual reorganization process, however, are not well understood. Using near-infrared spectroscopy (NIRS), this study compares the neural responses of Japanese infants at 3-4 months and 7-8 months of age as well as of adults to native ([i] vs. [w] ) and non-native vowel contrasts ([w] vs. [u]) within pseudo-word contexts. The findings demonstrated longitudinal developmental changes of functional temporal cortex asymmetries associated with the exposure of the native language.

Kinzler, K.D., Dupoux, E. & Spelke, E.S. (2007). The native language of social cognition. Proceedings of the National Academy of Sciences of the United States of America, 104(30), 12577-12580. [abstract] What leads humans to divide the social world into groups, preferring their own group and disfavoring others? Experiments with infants and young children suggest these tendencies are based on predispositions that emerge early in life and depend, in part, on natural language. Young infants prefer to look at a person who previously spoke their native language. Older infants preferentially accept toys from native-language speakers, and preschool children preferentially select native-language speakers as friends. Variations in accent are sufficient to evoke these social preferences, which are observed in infants before they produce or comprehend speech and are exhibited by children even when they comprehend the foreign-accented speech. Early-developing preferences for native-language speakers may serve as a foundation for later-developing preferences and conflicts among social groups.

Jacquemot, C., Dupoux, E. & Bachoud-Lévi, A.C. (2007). Breaking the mirror: Asymmetrical disconnection between the phonological input and output codes. Cognitive Neuropsychology, 24(1), 3-22. [abstract] In this paper, we study the link between the processing systems that sustain speech perception and production in a patient (F.A.) with conduction aphasia. Her pattern of performance in repetition task - quantitative but also qualitative striking difference in errors with pseudowords versus words - cannot be properly accounted for either by a perception deficit or by a production deficit. We discuss this finding according to theoretical models of phonological processing and show that it is best explained by an impaired ability to transfer phonological information from the perception to the production system. We also probed for a phonological link in the opposite direction, from the production to the perception system. F. A.'s results show that this link was not impaired. Overall, our results suggest that (a) the phonological codes in perception and in production are separate but connected by two conversion mechanisms and that (b) these two mechanisms can be disrupted independently.

Dupoux, E. & Jacob, P. (2007). Universal moral grammar: a critical appraisal. Trends in Cognitive Sciences, 11(9), 373-378. [abstract] A new framework for the study of the human moral faculty is currently receiving much attention: the so-called `universal moral grammar' framework. It is based on an intriguing analogy, first pointed out by Rawls, between the study of the human moral sense and Chomsky's research program into the human language faculty. To assess UMG, we ask: is moral competence modular? Does it have an underlying hierarchical grammatical structure? Does moral diversity rest on culture-dependant parameters? We review the evidence and argue that formal grammatical concepts are of limited value for the study of moral judgments, moral development and moral diversity.

Darcy, I., Peperkamp, S. & Dupoux, E. (2007). Bilinguals play by the rules. Perceptual compensation for assimilation in late L2-learners In J. Cole & J. Hualde (eds) Laboratory Phonology, 9, (pp 411-442) Mouton de Gruyter. [abstract] Phonological rules introduce variation in word forms, that listeners have to compensate for. We previously showed (Darcy 2002, Darcy et al., under review) that compensation for phonological variation in perception is driven by language-specific mechanisms. In particular, English speakers compensate more for place assimilation than for voicing assimilation, whereas the reverse holds for French speakers. English indeed has a rule of place assimilation, whereas French has a rule of voicing assimilation. In the present study, we explore the patterns of compensation for assimilation in English learners of French and in French learners of English. We use the same design and stimuli as Darcy 2002, Darcy et al. (under review); in this design, listeners are engaged in a word detection task on sentences containing occurrences of both place assimilation and voicing assimilation. We test British English and American English learners of French as well as French learners of American English on both their native language (L1) and their second language (L2). The results show that beginners interpret their L2 in exactly the same way as their L1: they apply the native compensation pattern to both languages. Advanced learners, by contrast, succeed in compensating for the non-native assimilation rule in their L2, while keeping the native compensation pattern for L1; as little or no interference from L2 on L1 is observed for these learners, we conclude that two separate systems of compensation for phonological processes can co-exist.

Teichmann, M., Dupoux, E., Kouider, S. & Bachoud-Lévi, A.C. (2006). The role of the striatum in processing language rules: Evidence from word perception in Huntington's disease. Journal of Cognitive Neuroscience, 18(9), 1555-1569. [abstract] On the assumption that linguistic faculties reflect both lexical storage in the temporal cortex and combinatorial rules in the striatal circuits, several authors have shown that striatal-damaged patients are impaired with conjugation rules while retaining lexical knowledge of irregular verbs {[}Teichmann, M., Dupoux, E., Kouider, S., Brugieres, P., Boisse, M. F., Baudic, S., Cesaro, P., Peschanski, M., & Bachoud-L{é}vi, A. C. (2005). The role of the striatum in rule application. The model of Huntington's disease at early stage. Brain, 128, 1155-1167; Ullman, A T., Corkin, S., Coppola, M., Hickok, G., Growdon, J. H., Koroshetz, W. J., & Pinker, S. (1997). A neural dissociation within language: Evidence that the mental dictionary is part of declarative memory, and that grammatical rules are processed by the procedural system. Journal of Cognitive Neuroscience, 9, 266-276]. Yet, such impairment was documented only with explicit conjugation tasks in the production domain. Little is known about whether it generalizes to other language modalities such as perception and whether it refers to implicit language processing or rather to intentional rule operations through executive functions. We investigated these issues by assessing perceptive processing of conjugated verb forms in a model of striatal dysfunction, namely, in Huntington's Disease (HD) at early stages. Rule application and lexical processes were evaluated in an explicit task (acceptability judgments on verb and nonword forms) and in an implicit task (lexical decision on frequency-manipulated verb forms). HD patients were also assessed in executive functions, and striatal atrophy was evaluated with magnetic resonance imaging (bicaudate ratio). Results from both tasks showed that HD patients were selectively impaired for rule application but lexical abilities were spared. Bicaudate ratios correlated with rule scores on both tasks, whereas executive parameters only correlated with scores on the explicit task. We argue that the striatum has a core function in linguistic rule application generalizing to perceptive aspects of morphological operations and pertaining to implicit language processes. In addition, we suggest that the striatum may enclose computational circuits that underpin explicit manipulation of regularities.

Peperkamp, S., Le Calvez, R., Nadal, J.P. & Dupoux, E. (2006). The acquisition of allophonic rules: Statistical learning with linguistic constraints. Cognition, 101(3), B31-B41. [abstract] Phonological rules relate surface phonetic word forms to abstract underlying forms that are stored in the lexicon. Infants must thus acquire these rules in order to infer the abstract representation of words. We implement a statistical learning algorithm for the acquisition of one type of rule, namely allophony, which introduces context-sensitive phonetic variants of phonemes. This algorithm is based on the observation that different realizations of a single phoneme typically do not appear in the same contexts (ideally, they have complementary distributions). In particular, it measures the discrepancies in context probabilities for each pair of phonetic segments. In Experiment 1, we test the algorithm's performances on a pseudo-language and show that it is robust to statistical noise due to sampling and coding errors, and to non-systematic rule application. In Experiment 2, we show that a natural corpus of semiphonetically transcribed child-directed speech in French presents a very large number of near-complementary distributions that do not correspond to existing allophonic rules. These spurious allophonic rules can be eliminated by a linguistically motivated filtering mechanism based on a phonetic representation of segments. We discuss the role of a priori linguistic knowledge in the statistical learning of phonology.

Jacquemot, C., Dupoux, E., Decouche, O. & Bachoud-Lévi, A.C. (2006). Misperception in sentences but not in words: Speech perception and the phonological buffer. Cognitive Neuropsychology, 23(6), 949-971. [abstract] We report two case studies of aphasic patients with a working-memory impairment due to reduced storage in the phonological buffer. The two patients display excellent performance in phonological discrimination tasks as long as the tasks do not involve a memory load. We then show that their performance drops when they have to maintain fine-grained phonological information for sentence comprehension: They are impaired at mispronunciation detection and at comprehending sentences involving minimal word pairs. We argue that the phonological buffer plays a role in sentence perception during the phonological analysis of the speech stream: It sustains the temporary storage of phonological input in order to check and resolve phonological ambiguities, and it also allows reexamination of the phonological input if necessary.

Teichmann, M., Dupoux, E., Kouider, S., Brugières, P., Boisse, M., Baudic, S., Cesaro, P., Peschanski, M. & Bachoud-Lévi, A.C. (2005). The role of the striatum in rule application: the model of Huntington's disease at early stage. Brain, 128(5), 1155-1167. [abstract] The role of the basal ganglia, and more specifically of the striatum, in language is still debated. Recent studies have proposed that linguistic abilities involve two distinct types of processes: the retrieving of stored information, implicating temporal lobe areas, and the application of combinatorial rules, implicating fronto-striatal circuits. Studies of patients with focal lesions and neurodegenerative diseases have suggested a role for the striatum in morphological rule application, but functional imaging studies found that the left caudate was involved in syntactic processing and not morphological processing. In the present study, we tested the view that the basal ganglia are involved in rule application and not in lexical retrieving in a model of striatal dysfunction, namely Huntington's disease at early stages. We assessed the rule-lexicon dichotomy in the linguistic domain with morphology (conjugation of non-verbs and verbs) and syntax (sentence comprehension) and in a non-linguistic domain with arithmetic operations (subtraction and multiplication). Thirty Huntington's disease patients (15 at stage I and 15 at stage II) and 20 controls matched for their age and cultural level were included in this study. Huntington's disease patients were also assessed using the Unified Huntington's Disease Rating Scale (UHDRS) and MRI. We found that early Huntington's disease patients were impaired in rule application in the linguistic and non-linguistic domains (morphology, syntax and subtraction), whereas they were broadly spared with lexical processing. The pattern of performance was similar in patients at stage I and stage II, except that stage II patients were more impaired in all tasks assessing rules and had in addition a very slight impairment in the lexical condition of conjugation. Finally, syntactic rule abilities correlated with all markers of the disease evolution including bicaudate ratio and performance in executive function, whereas there was no correlation with arithmetic and morphological abilities. Together, this suggests that the striatum is involved in rule processing more than in lexical processing and that it extends to linguistic and non-linguistic domains. These results are discussed in terms of domain-specific versus domain-general processes of rule application.

Kouider, S. & Dupoux, E. (2005). Subliminal speech priming. Psychological Science, 16(8), 617-625. [abstract] We present a novel subliminal priming technique that operates in the auditory modality. Masking is achieved by hiding a spoken word within a stream of time-compressed speechlike sounds with similar spectral characteristics. Participants were unable to consciously identify the hidden words, yet reliable repetition priming was found. This effect was unaffected by a change in the speaker's voice and remained restricted to lexical processing. The results show that the speech modality, like the written modality, involves the automatic extraction of abstract word-form representations that do not include nonlinguistic details. In both cases, priming operates at the level of discrete and abstract lexical entries and is little influenced by overlap in form or semantics.

Kouider, S. & Dupoux, E. (2004). Partial awareness creates the "illusion" of subliminal semantic priming. Psychological Science, 15(2), 75-81. [abstract] We argue that the lack of consensus regarding the existence of subliminal semantic processing arises from not taking into account the fact that linguistic stimuli are represented across several processing levels (features, letters, word form) that can independently reach or not reach awareness. Using masked words, we constructed conditions in which participants were aware of some letters or fragments of a word, while remaining unaware of the whole word. Three experiments using the Stroop priming paradigm show that when the stimulus set is reduced and participants are encouraged to guess the identity of the prime, such partially perceived stimuli can nonetheless give rise to ``semantic{''} processing. We provide evidence that this effect is due to illusory reconstruction of the incompletely perceived stimulus, followed by usual semantic processing of the result. We conclude that previously reported unconscious Stroop priming is in fact a conscious effect, but applied to a perceptual illusion.

Pallier, C., Dahaene, S., Poline, J., LeBihan, D., Argenti, A., Dupoux, E. & Mehler, J. (2003). Brain imaging of language plasticity in adopted adults: Can a second language replace the first? Cerebral Cortex, 13(2), 155-161. [abstract] Do the neural circuits that subserve language acquisition lose plasticity as they become tuned to the maternal language? We tested adult subjects born in Korea and adopted by French families in childhood; they have become fluent in their second language and report no conscious recollection of their native language. In behavioral tests assessing their memory for Korean, we found that they do not perform better than a control group of native French subjects who have never been exposed to Korean. We also used event-related functional magnetic resonance imaging to monitor cortical activations while the Korean adoptees and native French listened to sentences spoken in Korean, French and other, unknown, foreign languages. The adopted subjects did not show any specific activations to Korean stimuli relative to unknown languages. The areas activated more by French stimuli than by foreign stimuli were similar in the Korean adoptees and in the French native subjects, but with relatively larger extents of activation in the latter group. We discuss these data in light of the critical period hypothesis for language acquisition.

Jacquemot, C., Pallier, C., LeBihan, D., Dehaene, S. & Dupoux, E. (2003). Phonological grammar shapes the auditory cortex: A functional magnetic resonance imaging study. Journal of Neuroscience, 23(29), 9541-9546. [abstract] Languages differ depending on the set of basic sounds they use (the inventory of consonants and vowels) and on the way in which these sounds can be combined to make up words and phrases (phonological grammar). Previous research has shown that our inventory of consonants and vowels affects the way in which our brains decode foreign sounds (Goto, 1971; Naatanen et al., 1997; Kuhl, 2000). Here, we show that phonological grammar has an equally potent effect. We build on previous research, which shows that stimuli that are phonologically ungrammatical are assimilated to the closest grammatical form in the language (Dupoux et al., 1999). In a cross-linguistic design using French and Japanese participants and a fast event-related functional magnetic resonance imaging (fMRI) paradigm, we show that phonological grammar involves the left superior temporal and the left anterior supramarginal gyri, two regions previously associated with the processing of human vocal sounds.

Dupoux, E., Kouider, S. & Mehler, J. (2003). Lexical access without attention? Explorations using dichotic priming Journal of Experimental Psychology-human Perception and Performance, 29(1), 172-184. [abstract] The authors used lexical decision in a dichotic listening situation and measured identity priming across channels to explore whether unattended stimuli can be processed lexically. In 6 experiments, temporal synchronization of prime and target words was manipulated, and acoustic saliency of the unattended prime was varied by embedding it in a carrier sentence or in babble speech. When the prime was acoustically salient, a cross-channel priming effect emerged, and participants were aware of the prime. When the prime was less salient, no identity priming was found, and participants failed to notice the prime. Saliency was manipulated in ways that did not degrade the prime. Results are inconsistent with models of late filtering, which predict equal priming irrespective of prime saliency.

Bachoud-Lévi, A.C. & Dupoux, E. (2003). An influence of syntactic and semantic variables on word form retrieval. Cognitive Neuropsychology, 20(2), 163-188. [abstract] We report the case of DPI, an aphasic patient who shows a phonological impairment in production that spares certain syntactic and semantic categories. On a picture naming task, he produces mostly phonological paraphasias, and the probability of producing a correct response depends on the frequency and length of the target word. This deficit occurs in the presence of spared ability to find the grammatical gender of the items that he cannot name, intact conceptual knowledge, and very good reading and word repetition. Therefore, we conclude that DPI's deficit is restricted to the phonological retrieval of a correctly selected lexical entry. However, production errors are not uniform across semantic and syntactic domains. Numerals and names of days and months are totally spared compared to matched controls. In addition, abstract nouns and verbs are significantly less affected than concrete nouns, even when variables affecting phonological retrieval ( frequency, length, syllabic structure) are controlled for. This suggests that a functional organisation in terms of semantic and syntactic variables exists at the level of phonological retrieval. We discuss these findings in light of current models of speech production.

Peperkamp, S. & Dupoux, E. (2002). A typological study of stress "deafness". In C. Gussenhoven & N. Warner (eds) Laboratory Phonology 7, 4-1, (pp 203-240) . [abstract] Previous research has shown that native speakers of French, as opposed to those of Spanish, exhibit stress `deafness', i.e. have difficulties distinguishing stress contrasts. In French, stress is non-contrastive, while in Spanish, stress is used to make lexical distinctions. We examine three other languages with non-contrastive stress, Finnish, Hungarian and Polish. In two experiments with a short-term memory sequence repetition task, we find that speakers of Finnish and Hungarian are like French speakers (i.e. exhibit stress `deafness'), but not those of Polish. We interpret these findings in the light of an acquisition framework, that states that infants decide whether or not to keep stress in their phonological representation during the first two years of life, based on information extractable from utterance edges. In particular, we argue that Polish infants, unlike French, Finnish and Hungarian ones, cannot extract the stress regularity of their language on the basis of what they have already learned. As a consequence, they keep stress in their phonological representation, and as adults, they do not have difficulties in distinguishing stress contrasts.

Jacquemot, C., Dupoux, E., Pallier, C. & Bachoud-Lévi, A.C. (2002). Comprehending spoken words without hearing phonemes: A case study. Cortex, 38, 869-873. [abstract] In this paper, we describe a patient who presents a strong dissociation between performance on sublexical and lexical tasks in the unexpected direction. While he was extremely poor in a sublexical discrimination task, he was only mildly impaired in lexical tasks. The patient had a global aphasia resulting from a left parieto-temporal ischemia. Tested with the Boston Diagnostic Aphasia Examination, he showed impairment in oral comprehension, and strong deficits in naming and repetition. Here, we focus on his speech comprehension deficit and in particular on the relatively spared lexical level compared to the drastic impairment of the sublexical level.

Gout, A., Christophe, A. & Dupoux, E. (2002). Testing Infants' Discrimination With the Orientation Latency Procedure. Infancy, 3(2), 249-259. [abstract] A new discrimination procedure based on the measurement of visual orientation latency to speech stimuli is introduced. Each participant listens to a series of short familiarization test trials. In each trial, 5 to 7 centrally-presented familiarization stimuli are followed by laterally-presented test stimuli. Infants were found to orient faster to different-category than to same-category test stimuli. This result was found despite a high degree of prosodic variability in the familiarization and test stimuli introduced by changes in talker and speaking rate. The combination of a multitrial design with use of acoustic and prosodic variability seems suitable for studying the representation of phonological categories.

Kouider, S. & Dupoux, E. (2001). A functional disconnection between spoken and visual word recognition: evidence from unconscious priming. Cognition, 82(1), B35-B49. [abstract] The goal of the present study is to assess whether there is an automatic and obligatory activation of the phonological lexicon upon the presentation of a written word under unconscious processing conditions. We use a cross-modal version of the masked repetition priming procedure introduced by Forster and Davis (Journal of Experimental Psychology: Learning, Memory. and Cognition 10 (1984) 680) which consists of priming a spoken word by its written equivalent under masked conditions. These trials are randomly mixed with within-modal (visual-visual) repetition priming control trials. Our results show that cross-modal priming effects are absent unless primes are consciously perceived, as assessed by d ` scores obtained with a letter/pseudo discrimination task. In contrast, priming effects within the written modality are observed under conscious as well as unconscious processing conditions. We conclude that the systems underlying written and spoken word processing are, respectively, autonomous and connected only under conscious conditions.

Dupoux, E., Peperkamp, S. & Sebastian-Galles, N. (2001). A robust method to study stress "deafness". Journal of the Acoustical Society of America, 110(3), 1606-1618. [abstract] Previous research by Dupoux et al. [J. Memory Lang. 36, 406-421 (1997)] has shown that French participants, as opposed to Spanish participants, have difficulties in distinguishing nonwords that differ only in the location of stress. Contrary to Spanish, French does not have contrastive stress, and French participants are ``deaf{''} to stress contrasts. The experimental paradigm used by Dupoux et al. (speeded ABX) yielded significant group differences, but did not allow for a sorting of individuals according to their stress ``deafness.{''} Individual assessment is crucial to study special populations, such as bilinguals or trained monolinguals. In this paper, a more robust paradigm based on a short-term memory sequence repetition task is proposed. In five French-Spanish cross-linguistic experiments, stress ``deafness{''} is shown to crucially depend upon a combination of memory load and phonetic variability in F0. In experiments 3 and 4, nonoverlapping distribution of individual results for French and Spanish participants is observed. The paradigm is thus appropriate for assessing stress deafness in individual participants.

Dupoux, E., Pallier, C., Kakehi, K. & Mehler, J. (2001). New evidence for prelexical phonological processing in word recognition. Language and Cognitive Processes, 16(5-6), 491-505. [abstract] When presented with stimuli that contain illegal consonant clusters, Japanese listeners tend to hear an illusory vowel that makes their perception conform to the phonotactics of their language. In a previous paper, we suggested that this effect arises from language-specific prelexical processes. The present paper assesses the alternative hypothesis that this illusion is due to a ``top-down{''} lexical effect. We manipulate the lexical neighbourhood of nonwords that contain illegal consonant clusters and show that perception of the illusory vowel is not due to lexical influences. This demonstrates that phonotactic knowledge influences speech processing at an early stage.

Bachoud-Lévi, A.C., Dupoux, E. & Degos, J.D. (2001). Syntactic and semantic organization in word form retrieval? Cortex, 37(5), 693-695. [abstract] Many studies have reported that naming disorders may affect selectively certain semantic categories (animals vs. vegetables or artifacts, see Caramazza and Shelton, 1998, for a review) or syntactic categories (open vs. closed class items, Friederici and Schoenle, 1980, nouns vs. verbs, Baxter and Warrington, 1985; Caramazza and Hillis, 1991; Daniele et al., 1994; McCarthy and Warrington, 1985; Miceli et al., 1988) suggesting that the conceptual system and the output lexicon are organized along both syntactic and semantic dimensions. Most current models of speech production distinguish two components in the output lexicon: lexical selection and word form retrieval. Lexical selection consists in comparing the conceptual representation of the object to be named to the lexical entries, and selecting the best match. Conceivably, this level should be both sensitive to syntactic and semantic parameters. Word form retrieval involves recovering the phonological information associated to the selected entry which is then used to construct a phonological plan to be executed by the articulatory system. Prima facie, word form retrieval should not be influenced by syntactic, and even more, semantic variables. However, Cohen et al. (1997) reported the case of a patient impaired in word form retrieval, as evidenced by a predominance of phonological paraphasias in naming and reading tasks, which totally spared names for numbers. The authors speculated that the topographical segregation of numbers in the conceptual system propagates along the speech production pathway, even down to word form retrieval. In this paper, we report the case of another aphasic patient who shows a word form retrieval impairment in production which surprisingly spares certain syntactic and semantic categories.

Sebastian-Galles, N., Dupoux, E., Costa, A. & Mehler, J. (2000). Adaptation to time-compressed speech: Phonological determinants. Perception & Psychophysics, 62(4), 834-842. [abstract] Perceptual adaptation to time-compressed speech was analyzed in two experiments. Previous research has suggested that this adaptation phenomenon is language specific and takes place at the phonological level. Moreover, it has been proposed that adaptation should only be observed for languages that are rhythmically similar. This assumption was explored by studying adaptation to different time-compressed languages in Spanish speakers. In Experiment 1, the performances of Spanish-speaking subjects who adapted to Spanish, Italian, French, English, and Japanese were compared. In Experiment 2, subjects from the same population were tested with Greek sentences compressed to two different rates. The results showed adaptation for Spanish, Italian, and Greek and no adaptation for English and Japanese, with French being an intermediate case. To account for the data, we propose that variables other than just the rhythmic properties of the languages, such as the vowel system and/or the lexical stress pattern, must be considered. The Greek data also support the view that phonological, rather than lexical, information is a determining factor in adaptation to compressed speech.

Le Clec'H, G., Dehaene, S., Cohen, L., Mehler, J., Dupoux, E., Poline, J., Lehericy, S., van de Moortele, P. & Le Bihan, D. (2000). Distinct cortical areas for names of numbers and body parts independent of language and input modality. Neuroimage, 12(4), 381-391. [abstract] Some models of word comprehension postulate that the processing of words presented in different modalities and languages ultimately converge's toward common cerebral systems associated with semantic-level processing and that the localization of these systems may vary with the category of semantic knowledge being accessed. We used functional magnetic resonance imaging to investigate this hypothesis with two categories of words, numerals, and body parts, for which the existence of distinct category-specific areas is debated in neuropsychology. Across two experiments, one with a blocked design and the other with an event-related design, a reproducible set of left-hemispheric parietal and prefrontal areas showed greater activation during the manipulation of topographical knowledge about body parts and a right-hemispheric parietal network during the manipulation of numerical quantities. These results complement the existing neuropsychological and brain-imaging literature by suggesting that within the extensive network of bilateral parietal regions active during both number and body-part processing, a subset shows category-specific responses independent of the language and modality of presentation.

Dehaene-Lambertz, G., Dupoux, E. & Gout, A. (2000). Electrophysiological correlates of phonological processing: A cross-linguistic study. Journal of Cognitive Neuroscience, 12(4), 635-647. [abstract] It is well known that speech perception is deeply affected by the phoneme categories of the native language. Recent studies have found that phonotactics, i.e., constraints on the cooccurrence of phonemes within words, also have a considerable impact on speech perception routines. For example, Japanese does not allow (nonnasal) coda consonants. When presented with stimuli that violate this constraint, as in / ebzo/, Japanese adults report that they hear a /u/ between consonants, i.e.,/ebuzo/. We examine this phenomenon using event-related potentials (ERPs) on French and Japanese participants in order to study how and when the phonotactic properties of the native language affect speech perception routines. Trials using four similar precursor stimuli were presented followed by a test stimulus that was either identical or different depending on the presence or absence of an epenthetic vowel /u/ between two consonants (e.g., ``ebuzo ebuzo ebuzo-ebzo{''}). Behavioral results confirm that Japanese, unlike French participants, are not able to discriminate between identical and deviant trials. In ERPs, three mismatch responses were recorded in French participants. These responses were either absent or significantly weaker for Japanese. In particular, a component similar in latency and topography to the mismatch negativity (MMN) was recorded for French, but not for Japanese participants. Our results suggest that the impact of phonotactics cakes place early in speech processing and support models of speech perception, which postulate that the input signal is directly parsed into the native language phonological format. We speculate that such a fast computation of a phonological representation should facilitate lexical access, especially in degraded conditions.

Dupoux, E., Kakehi, K., Hirose, Y., Pallier, C. & Mehler, J. (1999). Epenthetic vowels in Japanese: A perceptual illusion? Journal of Experimental Psychology-human Perception and Performance, 25(6), 1568-1578. [abstract] In 4 cross-linguistic experiments comparing French and Japanese Listeners, we found that the phonotactic properties of Japanese (a reduced set of syllable types) induce Japanese listeners to perceive ``illusory{''} vowels inside consonant clusters in vowel-consonant-consonant-vowel (VCCV) stimuli. In Experiments 1 and 2, a continuum of stimuli ranging from no vowel (e.g., ebzo) to a full vowel between the consonants (e.g., ebuzo) was used. Japanese, but not French participants, reported the presence of a vowel {[}u] between consonants, even in stimuli with no vowel. A speeded ABX discrimination paradigm was used in Experiments 3 and 4 and revealed that Japanese participants had trouble discriminating between VCCV and VCuCV stimuli. French participants, in contrast, had problems discriminating items that differed in vowel length (ebuzo vs. ebunzo), a distinctive contrast in Japanese bur not in French. It is concluded that models of speech perception have to be revised to account for phonotactically based assimilations.

Perani, D., Paulesu, E., Galles, N., Dupoux, E., Dehaene, S., Bettinardi, V., Cappa, S., Fazio, F. & Mehler, J. (1998). The bilingual brain - Proficiency and age of acquisition of the second language. Brain, 121(10), 1841-1852. [abstract] Functional imaging methods show differences in the pattern of cerebral activation associated with the subject's native language (L1) compared with a second language (L2). In a recent PET investigation on bilingualism we showed that auditory processing of stories in L1 (Italian) engages the temporal lobes and temporoparietal cortex more extensively than L2 (English), However, in that study the Italian subjects learned L2 late and attained a fair, but not an excellent command of this language (low proficiency, late acquisition bilinguals), Thus, the different patterns of activation could be ascribed either to age of acquisition or to proficiency level, In the current study we use a similar paradigm to evaluate the effect of early and late acquisition of L2 in highly proficient bilinguals. We studied a group of Italian-English bilinguals who acquired L2 after the age of 10 years thigh proficiency, late acquisition bilinguals) and a group of Spanish-Catalan bilinguals who acquired L2 before the age of 4 years thigh proficiency, early acquisition bilinguals), The differing cortical responses we had observed when low proficiency volunteers listened to stories in L1 and L2 were not found in either of the high proficiency groups in this Study, Several brain areas, similar to those observed for L1 in low proficiency bilinguals, were activated by L2, These findings suggest that, at least for pairs of L1 and L2 languages that are fairly close, attained proficiency is more important than age of acquisition as a determinant of the cortical representation of L2.

Pallier, C., Sebastian-Galles, N., Dupoux, E., Christophe, A. & Mehler, J. (1998). Perceptual adjustment to time-compressed speech: A cross-linguistic study. Memory & Cognition, 26(4), 844-851. [abstract] Previous research has shown that, when hearers listen to artificially speeded speech, their performance improves over the course of 10-15 sentences, as if their perceptual system was ``adapting{''} to these fast rates of speech. In this paper, we further investigate the mechanisms that are responsible for such effects. In Experiment 1, we report that, for bilingual speakers of Catalan and Spanish, exposure to compressed sentences in either language improves performance on sentences in the ether language. Experiment 2 reports that Catalan/Spanish transfer of performance occurs even in monolingual speakers of Spanish who do not understand Catalan. In Experiment 3, we study another pair of languages-namely, English and French-and report no transfer of adaptation between these two languages for English-French bilinguals. Experiment 4, with monolingual English speakers, assesses transfer of adaptation from French, Dutch, and English toward English. Here we find that there is no adaptation from French and intermediate adaptation from Dutch. We discuss the locus of the adaptation to compressed speech and relate our findings to other cross-linguistic studies in speech perception.

Bachoud-Lévi, A.C., Dupoux, E., Cohen, L. & Mehler, J. (1998). Where is the length effect? A cross-linguistic study of speech production Journal of Memory and Language, 39(3), 331-346. [abstract] Many models of speech production assume that one cannot begin to articular a word before all its segmental units are inserted into the articulatory plan. Moreover, some of these models assume that segments are serially inserted from left to right. As a consequence, latencies to name words should increase with word length In a series of five experiments, however, we showed that the time to name a picture or retrieve a word associated with a symbol is not affected by the length of the word. Experiments 1 and 2 used French materials and participants, while Experiments 3, 4, and 5 were conducted with English materials and participants. These results are discussed in relation to current models of speech production and previous reports of length effects are reevaluated in light of these findings. We conclude that if words are encoded serially, then articulation can start before an entire phonological word has been encoded.

Pallier, C., Dupoux, E. & Jeannin, X. (1997). EXPE: An expandable programming language for on-line psychological experiments. Behavior Research Methods Instruments & Computers, 29(3), 322-327. [abstract] EXPE is a DOS program for the design and running of experiments that involve the presentation of audio or visual stimuli and the collection of on-line or off-line behavioral responses. Its flexibility also makes it a useful tool for the rapid design of protocols for testing neuropsychological patients. EXPE provides a powerful scripting language that allows the user to specify all the components of an experiment in a human readable file. Subjects' responses are saved in a user-specified format as well as in readable AscII files. The user can easily add new commands to the language: All the instructions are calls to functions written in independent Borland Pascal units. Thus, users can link their own Pascal procedures to EXPE to meet virtually any special need. This makes it possible, for example, to adapt EXPE to new hardware, such as new sound or video boards.

Dupoux, E., Pallier, C., Sebastian, N. & Mehler, J. (1997). A destressing "deafness" in French? Journal of Memory and Language, 36(3), 406-421. [abstract] Spanish but not French uses accent to distinguish between words (e.g., topo vs topo). Two populations of subjects were tested on the same materials to determine whether this difference has an impact on the perceptual capacities of listeners. In Experiment 1, using an ABX paradigm, we found that French subjects had significantly more difficulties than Spanish subjects in performing an ABX classification task based on accent. In Experiment 2, we found that Spanish subjects were unable to ignore irrelevant differences in accent in a phoneme-based ABX tack, whereas French subjects had no difficulty at all. In Experiment 3, we replicated the basic French finding and found that Spanish subjects benefited from redundant accent information even when phonemic information alone was sufficient to perform the task. In our final experiment, we showed that French subjects can be made to respond to the acoustic correlates of accent; therefore their difficulty in Experiment 1 seems to be located at the level of short-term memory. The implications of these findings for language-specific processing and acquisition are discussed.

Dupoux, E. & Green, K. (1997). Perceptual adjustment to highly compressed speech: Effects of talker and rate changes. Journal of Experimental Psychology-human Perception and Performance, 23(3), 914-927. [abstract] This study investigated the perceptual adjustments that occur when listeners recognize highly compressed speech. In Experiment 1, adjustment was examined as a function of the amount of exposure to compressed speech by use of 2 different speakers and compression rates. The results demonstrated that adjustment takes place over a number of sentences, depending on the compression rate. Lower compression rates required less experience before full adjustment occurred. In Experiment 2, the impact of an abrupt change in talker characteristics was investigated; in Experiment 3, the impact of an abrupt change in compression rate was studied. The results of these 2 experiments indicated that sudden changes in talker characteristics or compression rate had little impact on the adjustment process. The findings are discussed with respect to the level of speech processing at which such adjustment might occur.

Dehaene, S., Dupoux, E., Mehler, J., Cohen, L., Paulesu, E., Perani, D., van de Moortele, P., Lehericy, S. & LeBihan, D. (1997). Anatomical variability in the cortical representation of first and second language. Neuroreport, 8(17), 3809-3815. [abstract] FUNCTIONAL magnetic resonance imaging was used to assess inter-subject variability in the cortical representation of language comprehension processes. Moderately fluent French-English bilinguals were scanned while they listened to stories in their first language (L1 = French) or in a second language (L2 = English) acquired at school after the age of seven. In all subjects, listening to L1 always activated a similar set of areas in the left temporal lobe, clustered along the left superior temporal sulcus. Listening to L2, however, activated a highly variable network of left and right temporal and frontal areas, sometimes restricted only to right-hemispheric regions. These results support the hypothesis that first language acquisition relies on a dedicated left-hemispheric cerebral network, while late second language acquisition is not necessarily associated with a reproducible biological substrate. The postulated contribution of the right hemisphere to L2 comprehension(1) is found to hold only on average, individual subjects varying from complete right lateralization to standard left lateralization for L2.

Christophe, A., Guasti, T., Nespor, M., Dupoux, E. & Van Ooyen, B. (1997). Reflections on phonological bootstrapping: Its role for lexical and syntactic acquisition. Language and Cognitive Processes, 12(5-6), 585-612. [abstract] ``Phonological bootstrapping'' is the hypothesis that a purely phonological analysis of the speech signal may allow infants to start acquiring the lexicon and syntax of their native language (Morgan & Demuth, 1996a) To assess this hypothesis, a first step is to estimate how much information is provided by a phonological analysis of the speech input conducted in the absence of any prior (language-specific) knowledge in other domains such as syntax or semantics. We first review existing work on how babies may start acquiring a lexicon by relying on distributional regularities, phonotactics, typical word shape and prosodic boundary cues. Taken together, these sources of information may enable babies to learn the sound pattern of a reasonable number of the words in their native language. We then focus on syntax acquisition and discuss how babies may set one of the major structural syntactic parameters, the head direction parameter, by listening to prominence within phonological phrases and before they possess any words. Next, we discuss how babies may hope to acquire function words early, and how this knowledge would help lexical segmentation and acquisition, as well as syntactic analysis and acquisition. We then present a model of phonological bootstrapping of the lexicon and syntax that helps us to illustrate the congruence between problems. Some sources of information appear to be useful for more than one purpose; for example, phonological phrases and function words may help lexical segmentation as well as segmentation into syntactic phrases and labelling (NP, VP, etc.). Although our model derives directly from our reflection on acquisition, we argue that it may also be adequate as a model of adult speech processing. Since adults allow a greater variety of experimental paradigms, an advantage of our approach is that specific hypotheses can be tested on both populations. We illustrate this aspect in the final section of the paper, where we present the results of an adult experiment which indicates that prosodic boundaries and function words play an important role in continuous speech processing.

Perani, D., Dehaene, S., Grassi, F., Cohen, L., Cappa, S., Paulesu, E., Dupoux, E., Fazio, F. & Mehler, J. (1996). A PET study of native and foreign language processing. Brain and Language, 55(1), 99-101. [abstract] We used positron emission tomography to study brain activity in adults while they were listening to stories in their native language, in a second language acquired after the age of seven and in a third unknown language. Several areas, similar to those previously observed in monolinguals, were activated by the native but not by the second language. Both the second and the unknown language yielded distinct left-hemispheric activations in areas specialized for phonological processing, which were not engaged in a backward speech control task. These results indicate that some brain areas are shaped by early exposure to the maternal language, and are not necessarily activated by the processing of a second language to which they have been exposed for a limited time later in life.

Perani, D., Dehaene, S., Grassi, F., Cohen, L., Cappa, S., Dupoux, E., Fazio, F. & Mehler, J. (1996). Brain processing of native and foreign languages. Neuroreport, 7(15-17), 2439-2444. [abstract] We used positron emission tomography to study brain activity in adults while they were listening to stories in their native language, in a second language acquired after the age of seven, and in a third unknown language. Several areas, similar to those previously observed in monolinguals, were activated by the native but not by the second language. Both the second and the unknown language yielded distinct left-hemispheric activations in areas specialized for phonological processing, which were not engaged by a backward speech control task. These results indicate that some brain areas are shaped by early exposure to the maternal language, and are not necessarily activated by the processing of a second language to which they have been exposed for a limited time later in life.

Christophe, A. & Dupoux, E. (1996). Bootstrapping lexical acquisition: The role of prosodic structure. Linguistic Review, 13(3-4), 383-412.

Mehler, J., Dupoux, E., Pallier, C. & Dehaene-Lambertz, G. (1994). Cross-linguistic approaches to speech processing. Current Opinion in Neurobiology, 4(2), 171-176. [abstract] Recent advances in the field of speech processing indicate that speakers of differing languages process speech relying on units that are appropriate to the rhythmical properties of their maternal tongue. Studies with young infants suggest that the acquisition of these processing routines takes place before the end of the first year of life. Further evidence shows that the left hemisphere initially processes any language and gradually becomes specialized for the maternal language.

Mehler, J., Bertoncini, J., Dupoux, E. & Pallier, C. (1994). The role of suprasegmental in speech perception and acquisition. Dokkyo International Review, 7, 343-376.

Christophe, A., Dupoux, E., Bertoncini, J. & Mehler, J. (1994). Do infants perceive word boundaries ? An empirical study of the bootstrapping of lexical acquisition Journal of the Acoustical Society of America, 95(3), 1570-1580. [abstract] Babies, like adults, hear mostly continuous speech. Unlike adults, however, they are not acquainted with the words that constitute the utterances; yet in order to construct representations for words, they have to retrieve them from the speech wave. Given the apparent lack of obvious cues to word boundaries (such as pauses between words), this is not a trivial problem. Among the several mechanisms that could be explored to solve this bootstrapping problem for lexical acquisition, a tentative but reasonable one posits the existence of some cues (other than silence) that signal word boundaries. In order to test this hypothesis, infants were used as informants in our experiments. It was hypothesized that if word boundary cues exist, and if infants are to use them in the course of language acquisition, then they should at least perceive these cues. As a consequence, infants should be able to discriminate sequences that contain a word boundary from those that do not. A number of bisyllabic stimuli were extracted either from within French words (e.g., mati in mathematicien), or from between words (e.g., mati in panorama typique). Three-day-old infants were tested with a non-nutritive sucking paradigm, and the results of two experiments suggest that infants can discriminate between items that contain a word boundary and items that do not. It is therefore conceivable that newborns are already sensitive to cues that correlate with word boundaries. This result lends plausibility to the hypothesis that infants might use word boundary cues during lexical acquisition.

Mehler, J., Sebastian-Galles, N., Altmann, G., Dupoux, E., Christophe, A. & Pallier, C. (1993). Understanding compressed sentences - The role of rhythm and meaning. Annals of the New York Academy of Sciences, 682, 272-282.

Sebastian-Galles, N., Dupoux, E., Segui, J. & Mehler, J. (1992). Contrasting syllabic effects in Catalan and Spanish. Journal of Memory and Language, 31(1), 18-32. [abstract] The role of syllabic structure and stress assignment in the perceptual segmentation of Catalan and Spanish words is studied. Previous research suggested that the syllable is the segmentation unit for languages with clear syllabic structure. In Experiment I, we found that syllabification effects are found in Catalan but only in unstressed first syllable word-targets. No syllabification is obtained when the tirst syllable is stressed. In Experiment 2, we failed to find any syllabification effect in Spanish, regardless of stress in word-targets. Nonetheless, Experiment 3 shows that syllabification effects emerge in Spanish when subjects are made to respond to 250 ms slower than in Experiment 2. On the basis of these results, a modified version of the original syllabic hypothesis is proposed. We propose that both task demands and language specific parameters play a role in the presence or absence of syllabification effects in segment detection.

Dupoux, E. & Mehler, J. (1990). Monitoring the lexicon with normal and compressed speech - Frequency effects and the prelexical code. Journal of Memory and Language, 29(3), 316-335. [abstract] Previous reportss uggest that initial phonemes are monitored on the basis of lexical information in monosyllabic words and on the basis of acoustic/phonetic information in multisyllabic words (Cutler, Mehler, Norris, & Segui, 1987). In Experiment 1, a frequency effect was found with item-initial phoneme monitoring for monosyllabic but not for bisyllabic words. In Experiments 2 and 3, we used speech time-compressed at a rate of 50% and failed to find a frequency effect for bisyllabic words, even though they were shorter than uncompressed monosyllables. In Experiment 4, we used a lexical decision task on the same items and found a frequency effect for both mono- and bisyllabic words. Results are interpreted on the basis of the dual code hypothesis. Implications for the nature of the prelexical code are discussed.

Dehaene, S., Dupoux, E. & Mehler, J. (1990). Is numerical comparison digital? Analogical and symbolic effects in two digit number comparison Journal of Experimental Psychology-human Perception and Performance, 16(3), 626-641. [abstract] Do Ss compare multidigit numbers digit by digit (symbolic model) or do they compute the whole magnitude of the numbers before comparing them (holistic model)? In 4 experiments of timed 2-digit number comparisons with a fixed standard, the findings of Hinrichs, Yurko, and Hu (1981) were extended with French Ss. Reaction times (RTs) decreased with target-standard distance, with discontinuities at the boundaries of the standard's decade appearing only with standards 55 and 66 but not with 65. The data are compatible with the holistic model. A symbolic interference model that posits the simul~meous comparison of decades and units can also account for the results. To separate the 2 models, the decades and units digits of target numbers were presented asynchronously in Experiment 4. Contrary to the prediction of the interference model, presenting the units before the decades did not change the influence of units on RTs. Pros and cons of the holistic model are discussed.

Books

Dupoux, E. (2001). Language, Brain and Cognitive Development: Essays in Honor of Jacques Mehler., Cambridge, Mass: MIT Press (translated in French: (2002). Les langages du cerveau, Paris: O. Jacob.). [abstract] ABSTRACT = Au début des années 1960, la cognition n'était connue que d'un groupe de scientifiques d'avant-garde. L'audacieux projet de ce domaine de recherche était de soumettre l'esprit humain à un examen rationnel fondé sur la philosophie, la linguistique, l'informatique, la psychologie. Quarante ans plus tard, les sciences cognitives se sont épanouies. Quels ont été les vrais progrès ? Qu'avons-nous appris sur le langage, la cognition, le cerveau ? Quels ont été les échecs et les succès ? Quelles sont les voies d'avenir les plus prometteuses ?

Mehler, J. & Dupoux, E. (1990). Naître Humain., Paris: Odile Jacob. Translated and published in English (Blackwell), Chineese (Yuan-Liou Publishers), Greek (Alexiandria) Italian, (Mondadori), Japanese, (Fujiwara-Shoten), Portuguese (Piaget), & Spanish (Alianza).

Chapters, commentaries, etc.

Rochereau, C., Sagot, B. & Dupoux, E. (2019). Modeling German Verb Argument Structures: LSTMs vs. Humans In ArXiv, 1912.00239. [abstract] LSTMs have proven very successful at language modeling. However, it remains unclear to what extent they are able to capture complex morphosyntactic structures. In this paper, we examine whether LSTMs are sensitive to verb argument structures. We introduce a German grammaticality dataset in which ungrammatical sentences are constructed by manipulating case assignments (eg substituting nominative by accusative or dative). We find that LSTMs are better than chance in detecting incorrect argument structures and slightly worse than humans tested on the same dataset. Surprisingly, LSTMs are contaminated by heuristics not found in humans like a preference toward nominative noun phrases. In other respects they show human-similar results like biases for particular orders of case assignments.

Riochet, R., Castro, M.Y., Bernard, M., Lerer, A., Fergus, R., Izard, V. & Dupoux, E. (2019). IntPhys: A Benchmark for Visual Intuitive Physics Reasoning. In ArXiv, 1803.07616. [abstract] ABSTRACT = In order to reach human performance on complex visual tasks, artificial systems need to incorporate a significant amount of understanding of the world in terms of macroscopic objects, movements, forces, etc. Inspired by work on intuitive physics in infants, we propose an evaluation framework which diagnoses how much a given system understands about physics by testing whether it can tell apart well matched videos of possible versus impossible events. The test requires systems to compute a physical plausibility score over an entire video. It is free of bias and can test a range of specific physical reasoning skills. We then describe the first release of a benchmark dataset aimed at learning intuitive physics in an unsupervised way, using videos constructed with a game engine. We describe two Deep Neural Network baseline systems trained with a future frame prediction objective and tested on the possible versus impossible discrimination task. The analysis of their results compared to human data gives novel insights in the potentials and limitations of next frame prediction architectures.

Synnaeve, G. & Dupoux, E. (2015). Weakly Supervised Multi-Embeddings Learning of Acoustic Models. In ICLR Workshop, (pp ArXiv 1412.6645 [cs.SD]) . [abstract] ABSTRACT = We trained a Siamese network with multi-task same/different information on a speech dataset, and found that it was possible to share a network for both tasks without a loss in performance. The first task was to discriminate between two same or different words, and the second was to discriminate between two same or different talkers.

Dupoux, E. (2015). Category Learning in Songbirds: top-down effects are not unique to humans. Current Biology, 25(16), R718-R720. [abstract] ABSTRACT = Human infants use higher order patterns (words) to learn the sound category of their language. A new study using artificial patterns made up of naturally occurring vocalizations shows that a similar mechanism may also exist in songbirds.

Dupoux, E. (2014). Towards Quantitative Studies of Early Cognitive Development. Autonomous Mental Development Technical Committee Newsletter, 11(1), 10-11. [abstract] ABSTRACT = We present a new framework for the evaluation of speech representations in zero-resource settings, that extends and complements previous work by Carlin, Jansen and Hermansky [1]. In particular, we replace their Same/Different discrimination task by several Minimal-Pair ABX (MP-ABX) tasks. We explain the analytical advantages of this new framework and apply it to decompose the standard signal processing pipelines for computing PLP and MFC coefficients. This method enables us to confirm and quantify a variety of well-known and not-so-well-known results in a single framework.

Synnaeve, G. & Dupoux, E. (2013). In Depth Deep Beliefs Networks for Phone Recognition. In Poster presented in NIPS-2013.

Cleret de Langavant, L., Charlotte Jacquemot, , Bachoud-Lévi, A.C. & Dupoux, E. (2013). The second person in `I'-`you'-`it' triadic interactions. Behavioral and Brain Sciences, 36(416-417). [abstract] Generative linguistics' search for linguistic universals (1) is not comparable to the vague explanatory suggestions of the article; (2) clearly merits a more central place than linguistic typology in cognitive science; (3) is fundamentally untouched by the article's empirical arguments; (4) best explains the important facts of linguistic diversity; and (5) illuminates the dominant component of language's ``biocultural'' nature: biology.

Ramus, F., Peperkamp, S., Christophe, A., Jacquemot, C., Kouider, S. & Dupoux, E. (2011). A psycholinguistic perspective on the acquisition of phonology. In C. Fougeron, B. Kühnert, d'Imperio M. & Vallée N. (eds) Laboratory Phonology, 10, Berlin: Mouton de Gruyter. [abstract] This paper discusses the target articles by Fikkert, Vihman, and Goldrick & Larson, which address diverse aspects of the acquisition of phonology. These topics are examined using a wide range of tasks and experimental paradigms across different ages. Various levels of processing and representation are thus involved. The main point of the present paper is that such data can be coherently interpreted only within a particular information-processing model that specifies in sufficient detail the different levels of processing and representation. In this paper, we first present the basic architecture of a model of speech perception and production, justifying it with psycholinguistic and neuropsychological data. We then use this model to interpret data from the target articles relative to the acquisition of phonology.

Cova, F., Dupoux, E. & Jacob, P. (2010). Moral evaluation shapes linguistic reports of others' psychological states, not theory-of-mind judgments. Behavioral and Brain Sciences, 33(4), 334-335. [abstract] We use psychological concepts (e.g., intention and desire) when we ascribe psychological states to others for purposes of describing, explaining, and predicting their actions. Does the evidence reported by Knobe show, as he thinks, that moral evaluation shapes our mastery of psychological concepts? We argue that the evidence so far shows instead that moral evaluation shapes the way we report, not the way we think about, others' psychological states.

Smolensky, P. & Dupoux, E. (2009). Universals in cognitive theories of language. Behavioral and Brain Sciences, 32(5), 468-469. [abstract] Generative linguistics' search for linguistic universals (1) is not comparable to the vague explanatory suggestions of the article; (2) clearly merits a more central place than linguistic typology in cognitive science; (3) is fundamentally untouched by the article's empirical arguments; (4) best explains the important facts of linguistic diversity; and (5) illuminates the dominant component of language's ``biocultural'' nature: biology.

Darcy, I., Ramus, F., Christophe, A., Kinzler, K.D. & Dupoux, E. (2009). Phonological knowledge in compensation for native and non-native assimilation. In F. Kügler, C. Féry & R. van de Vijver (eds) Variation and Gradience in Phonetics and Phonology, (pp 265-309) Berlin: Mouton De Gruyter. [abstract] We investigated whether compensation for phonological assimilation depends on language-universal or language-specific processes. To this end, we tested two different assimilation rules, one that exists in English and involves place of articulation, and another that exists in French and involves voicing. Both contrasts were tested on speakers of French, British English and American English. In three experiments using a word detection task, we observed that monolingual participants showed a significantly higher degree of compensation for phonological changes that correspond to rules existing in their language than to rules that do not exist in their language (even though they are phonologically possible since they exist in another language). Thus, French participants compensated more for voicing than place assimilation, while British and American English participants compensated more for place than voicing assimilation. In all three experiments, we also found that the non-native rule induced a very small but significant compensation effect, suggesting that both a language-specific and a language-universal mechanism are at play. In Experiment 4, we studied native speakers of British English who were late learners of French: they showed the British pattern of results even when listening to French stimuli, confirming that compensation for assimilation is induced by language-specific phonological processes rather than specific phonetic cues. The results are discussed in light of current models of lexical access and phonological processing.

Jacob, P. & Dupoux, E. (2008). A precursor of moral judgment in human infants? Current Biology, 18(5), R216-R218. [abstract] Human infants evaluate social interactions well before they can speak, and show a preference for characters that help others over characters that are not cooperative or are hindering.

Dupoux, E. & Jacob, P. (2008). Response to Dwyer and Hauser: Sounding the retreat? Trends in Cognitive Sciences, 12(1), 2-3.

Le Calvez, R., Peperkamp, S. & Dupoux, E. (2007). Bottom-up learning of phonemes: A computational study. In S. Vosniadou, D. Kayser & A. Protopapas (eds) Proceedings of the Second European Cognitive Science Conference, Taylor and Francis. (French translation in Mathematiques et Sciences Humaines 2007(4), 99-111). [abstract] We present a computational evaluation of a hypothesis according to which distributional information is suffic ient to acquire allophonic rules (and hence phonemes) in a bottom-up fashion. The hypothesis was tested using a measure based on information theory that com- pares distributions. The test was conducted on several artificial language corpora and on two natural corpora containing transcriptions of speech directed to infants from two typologically distant languages (French and Japanese). The measure was complemented with three filters, one concerning the statistical reliability due to sample size and two concerning the following univer- sal properties of allophonic rules: constituents of an al- lophonic rule should be phonetically similar, and allo- phonic rules should be assimilatory in nature.

Kouider, S., de Gardelle, V. & Dupoux, E. (2007). Partial awareness and the illusion of phenomenal consciousness (Comment on Block, 2007). Behavioral and Brain Sciences, 30(5-6), 510-511. [abstract] The dissociation Block provides between phenomenal and access consciousness (P-consciousness and A-consciousness) captures much of our intuition about conscious experience. However, it raises a major methodological puzzle, and is not uniquely supported by the empirical evidence. We provide an alternative interpretation based on the notion of levels of representation and partial awareness.

Kouider, S. & Dupoux, E. (2007). How "semantic'' is response priming restricted to practiced items? A reply to Abrams & Grinspan (2007) Consciousness and Cognition, 16(4), 954-956.

Peperkamp, S., Skoruppa, K. & Dupoux, E. (2006). The role of phonetic naturalness in phonological rule acquisition. In D. Bamman, T. Magnitskaia & C. Zaller (eds) Proceedings of the 30th Annual Boston University Conference on Language Development, Vols 1 and 2, (pp 464-475) . [abstract] The role of naturalness constraints in phonological learning is of considerable theoretical importance for linguistically motivated models of language acquisition. However, the existence of naturalness effects is still not resting on firm empirical grounds. P&D (in press) exposed French subjects to an artificial language consisting of determiner + noun phrases which obey either a natural allophonic rule that voices a subclass of obstruents intervocalically, or an unnatural one that defines arbitrary relationships among certain obstruents intervocalically. After exposure, a phrase-picture matching task was used to assess whether subjects had learned the allophonic distributions and hence distinguished between phonemic and allophonic contrasts among obstruents for the purposes of word identification. Surprisingly, P&D (in press) found that natural assimilatory rules and unnatural arbitrary rules were learned with equal ease. In the present study, we use exactly the same exposure phase, but change the test phase: here, subjects have to produce a noun phrase upon the presentation of a picture, both for nouns that they have been trained on during the exposure phase, and for novel nouns. We find that with this more ecologically valid, but also more demanding task, a naturalness effect emerges: subjects learned the rule on old items and extended it to novel items, but ony for the natural assimilatory rules, not for the nonntatural arbitrary rules. We discuss these findings in relation to existing studies of the acquisition of phonological rules. We distinguish at least three constraints that characterize rule naturalness, and discuss the role of task demands and response strategies in relation to the emergence of naturalness effects in learning studies using artificial languages.

Dupoux, E. (2004). The Acquisition of Discrete Segmental Categories: Data and Model. In Proceedings of the 18th International Congress of Acoustics, Kyoto. [abstract] The way in which we parse continuous speech into discrete phonemes is highly language-dependant. Here, we first report that this phenomenon not only depends on the inventory of phonetic distinctions in the language, but also on the inventory of syllabic types. This is illustrated by studies showing that Japanese listeners perceptually insert epenthetic vowels inside illegal consonant clusters in order to make them legal. We then argue that this raises a bootstrapping problem for language acquisition, as the learning of phonetic inventories and syllabic types depend on each other. We present an acquisition model based on the storing and analysis of phonetic syllabic templates. We argue that this model has the potential of solving the bootstrapping problem as well as a range of observation regarding perceptual categorization for speech sounds.

Peperkamp, S., Pettinato, M. & Dupoux, E. (2003). Allophonic variation and the acquisition of phoneme categories. In B. Beachley, A. Brown & F. Conlin (eds) BUCLD 27: Annual Boston University Conference on Language Development, Vols 1 and 2, Proceedings, (pp 650-661) .

Peperkamp, S. & Dupoux, E. (2003). Reinterpreting loanword adaptations: The role of perception. In Proceedings of the 15th International Congress of Phonetic Sciences, (pp 367-370) . [abstract] Standard phonological accounts of loanword adaptations state that the inputs to the adaptations are constituted by the surface forms of the words in the source language and that the adaptations are computed by the phonological grammar of the borrowing language. In processing terms, this means that in perception, the phonetic form of the source words is faithfully copied onto an abstract underlying form, and that adaptations are produced by the standard phonological processes in production. We argue that this is at odds with speech perception models and propose that loanword adaptations take place in perception and are defined as phonetically minimal transformations.

Peperkamp, S. & Dupoux, E. (2002). Coping with phonological variation in early lexical acquisition. In I. Lasser(ed) The Process of Language Acquisition, (pp 359-385) Berlin: Peter Lang Verlag. [abstract] Models of lexical acquisition assume that infants can somehow extract unique word forms out of the speech stream before they acquire the meaning of words (e.g. Siskind 1996). However, words often surface with different phonetic forms due to the application of postlexical phonological processes; that is, surface word forms exhibit what we call phonological variation. In this paper, we will examine if and how infants that do not have a semantic lexicon might undo phonological variation, i.e. deduce which phonological processes apply and infer unique underlying word forms that will constitute lexical entries. We will propose a learning mechanism that deduces which rule applies and infers underlying phonemes and word forms. This mechanism is based on an examination of the distribution of either surface segments or surface word forms. The distribution of segments will be shown to provide sufficient information in the case of allophonic rules, i.e. rules that involves segments that do not otherwise occur in the language; the distribution of segments that are introduced by this type of rule is complementary to that of segments that are the direct phonetic realization of certain phonemes. The distribution of word forms will be shown to be necessary in cases in which all surface segments have a phonemic status in the language. In particular, infants can make use of the fact that certain word forms - i.e. the ones that have undergone the rule - fail to occur at the left or right edge of certain phrasal constituents, where the context for application of the rule is never met. This proposal makes predictions regarding the order in which various types of phonological variations can be coped with in the infant.

Dupoux, E. & Peperkamp, S. (2002). Fossil markers of language development: phonological deafnesses in adult speech processing. In B. Laks & J. Durand (eds) Phonetics, Phonology, and Cognition, (pp 168-190) Oxford: Oxford University Press.. [abstract] The sound pattern of the language(s) we have heard as infants affects the way in which we perceive linguistic sounds as adults. Typically, some foreign sounds are very difficult to perceive accurately, even after extensive training. For instance, native speakers of French have troubles distinguishing foreign words that differ only in the position of main stress, French being a language in which stress is not contrastive. In this paper, we propose to explore the perception of foreign sounds cross- linguistically in order to understand the processes that govern early language acquisition. Specifically, we propose to test the hypothesis that early language acquisition begins by using only regularities that infants can observe in the surface speech stream (Bottom-Up Bootstrapping), and compare it with the hypothesis that they use all possible sources of information, including, for instance, word boundaries (Interactive Bootstrapping). We set up a research paradigm using the stress system, since it allows to test the various options at hand within a single test procedure. We distinguish four types of regular stress systems the acquisition of which requires different sources of information. We show that the two hypotheses make contrastive predictions as to the pattern of stress perception of adults in these four types of languages. We conclude that cross-linguistic research of adults speech perception, when coupled with detailed linguistic analysis, can be brought to bear on important issues of language acquisition.

Bachoud-Lévi, A.C. & Dupoux, E. (2001). L'effet de longueur et la production des mots parlés. Psychologie française, 46, 65-76.

Peperkamp, S., Dupoux, E. & Sebastián-Gallés, N. (1999). Perception of stress by french, spanish, and bilingual subjects. In Proceedings of Eurospeech '99, 6, (pp 2683-2686) . [abstract] Previous research has shown that French subjects, as opposed to Spanish subjects, have difficulties in distinguishing two words that differ only as far as the location of stress is concerned. In French, stress is not contrastive, and French subjects are `deaf' to stress contrasts. In Experiment 1, we replicate this finding with a new and more powerful paradigm for assessing the perception of stress. With this new method, we obtain a complete separation of the two subject populations. In Experiment 2, we test highly proficient French-Spanish bilinguals with the same paradigm. Our findings are that the performance of individual bilinguals is either Frenchlike or Spanish-like. The factor that best predicts the bilingual's performance is the country in which the subject is born. Consequences for models of bilingualism are discussed.

Dupoux, E., Fushimi, T., Kakehi, K. & Mehler, J. (1999). Prelexical locus of an illusory vowel effect in japanese. In Eurospeech '99 Proceedings; ESCA 7th European Conference on Speech Communication and Technology. [abstract] Studies in vision have demonstrated that the visual system can induce the perception of illusory contours. In this study we document a similar phenomenon in the auditory mode: Japanese speakers report perceiving vowels that are absent in the acoustic signal. Such an illusion is due to the fact that in Japanese, succession of consonants are not allowed. Hence the linguistic system inserts an illusory vowel between adjacent vowels in order to conform to the expected pattern in this language. Here, we manipulate the lexical neighborhood of nonwords that contain illegal consonant clusters and show that this illusion is not due to lexical influence. Rather, it arises before lexical knowledge is activated, suggesting that phonotactics impact perception routines at a very early processing stage.

Dupoux, E. & Mehler, J. (1999). Non-Developmental studies of Development: examples from newborn research, bilingualism, and brain imaging. In C. Rovee-Collier, L. Lipsitt & Hayne H. (eds) Advances in infancy research, 12, (pp 375-406) Stamford, Connecticut: Ablex Publishing Corporation.

Mehler, J., Dupoux, E., Nazzi, T. & Dehaene-Lambertz, G. (1996). Coping with linguistic diversity: The infant's viewpoint. In J. Morgan & Demuth K.D. (eds) From Signal to Syntax: Bootstrapping from speech to grammar in early acquisition, (pp 101-116) Hillsdale, NJ: Erlbaum.

Hammond, M. & Dupoux, E. (1996). Psychophonology. In J. Durand & B. Laks (eds) Current Trends in Phonology: Models and Methods, (pp 281-304) .

Dupoux, E. (1993). The time course of prelexical processing: The syllabic hypothesis revisited. In G.&.S. Altmann (eds) Cognitive Models of Speech Processing, (pp 81-114) Hillsdale, NJ: Erlbaum.

Dupoux, E. & Mehler, J. (1992). Unifying awareness and on line studies of speech: A tentative framework. In J. Alegria, D. Holender, J. Morais & Radeau M. (eds) Analytic approaches to human cognition, (pp 59-75) The Netherlands: Elsevier. [abstract] Generally, studies of speech recognitio n are related to theories of performance while studies of awareness are thought to bear upon language competence. In our concept ion, both area s or resea rch contribute to our unders tandlng of processing and of the represeetanons that the subjeds use when listening to speech. We present a unitary framework within which it becomes pouibk: to incorporate the results from on-line speech rccognitlon studies and from studies of the awareness that the language user has of speech sc:gmen~. In puticular, we argue that it is necessary to include a descriptlon of the manner in which acoustic-phonet ic information is transduced , and represented in order for us to understand how subjects come to decide to respond or not in a psycholinguistic expe riment. Particular attention is given to the data from on-line chunk detection experiments and to the potential role of orthographic representation.

Dupoux, E. & Mehler, J. (1992). La segmentation de la parole. Courier du CNRS.

Christophe, A., Dupoux, E. & Mehler, J. (1992). How do infants extract words from the speech stream? A discussion of the bootstrapping problem for lexical acquisition In Proceedings of Child Language Research Forum, Stanford, CA.

Segui, J., Dupoux, E. & Mehler, J. (1990). The role of the syllable in speech segmentation, phoneme identification and lexical access. In G. Altmann(ed) Cognitive Models of Speech Processing, (pp 263-280) Cambridge Mass: MIT Press.

Mehler, J., Dupoux, E. & Segui, J. (1990). Constraining models of lexical access: The onset of word recognition. In G. Altmann(ed) Cognitive Models of Speech Processing, (pp 236-262) Cambridge Mass: MIT Press.

Mehler, J. & Dupoux, E. (1987). De la psychologie à la science cognitive. Le Débat, 47, 65-87.

Unpublished manuscripts

Attention: these manuscripts are either unpublished, or in revision. If you want to quote one of them, please send me an email.

Pallier, C., Dupoux, E. & Jeannin, X. (1997). EXPE6 Reference manual.. [abstract] Expe is an experiment generator for PC computers: it allows to run cognitive psychology experiments that involve the presentation of audio or visual stimuli and the collection of on-line or off-line behavioral responses (e.g. discrimination tasks, auditory target detection tasks, lexical decision and picture naming experiments...). Its flexibility makes it also a very useful tool for the rapid design of protocols for testing neuropsychological patients. Expe provides a powerful scripting language which allows the user to specify with human readable commands, all the components of an experiment (materials, stimulus presentation, training, instructions, etc...). Subjects' responses are saved in readable ASCII files, in a user-specified format. Expe is an open system: the commands of the language are calls to functions written in independent Borland Pascal units. The power user can thus easily add new commands to the language by linking their own pascal procedures to meet any special need. This makes it possible, for example, to adapt Expe to new hardware, such as new sound, video boards, ERP collecting device, etc.

Dupoux, E. (1994). A Syllabic Bottleneck in Prelexical Processing ? A Phoneme Monitoring Investigation LSCP Tech Report, 94(2), 1-14. [abstract] Previous research has found that phoneme detection latencies depend on the complexity of the syllable that bears the target phoneme. CV syllables give rise to faster latencies than CVC, that are faster than CCV (Treiman et al., 1982, Cutler et al., 1987). In Experiment~1, we replicate this result and extend it to a fourth structure: CCVC. In Experiment~2, we report a similar effect in first syllables of disyllabic items, showing that complexity effects cannot be reduced to stimulus duration effects. We argue that the complexity effect is inconsistent with the view that phonemes are the only units involved in speech perception, but supports models which stipulate larger sized units like syllables (Mehler, 1981; Segui, Dupoux & Mehler, 1990). In a series of post-hoc analyses, however, we show that the complexity effect is not uniform across subjects. Although both the complexity of onsets and codas of syllables influence phoneme detection latencies for slow subjects, fast subjects are only influenced by the nature of the onset. The interaction of speed of response with complexity effects is confirmed in Experiment~3, where it is found that when subjects are urged to respond as fast as possible, CVC items no longer show a complexity effect nor a lexical superiority effect. Implications for the existence of a syllabic bottleneck and the time course of prelexical processing are discussed.

Dupoux, E., Christophe, A. & Mehler, J. (1994). Lexical effects in phoneme monitoring: Time-course versus attentional accounts. LSCP Tech Report, 94(1), 1-12. [abstract] Under what conditions do lexical factors influence phoneme detection times? Experiment 1 measured subjects' latencies to detect initial phonemes in monosyllabic and disyllabic words that were preceded by a semantically related or unrelated word. One group of subjects was instructed to pay attention to the semantic relations between words, and a second group was asked to focus on acoustic-phonetic information. A significant priming effect was found, only for monosyllabic words, and only in the first group. In Experiment 2, previously observed frequency effects (Dupoux and Mehler, 1990) disappeared when the detection task was biased towards acoustic-phonetic information. In Experiment 3, two student populations were tested with exactly the same instruction set and showed markedly different results: One group showed a consistent lexical superiority effect on monosyllabic items while the other group showed no such effect. Taken together, these results suggest that the presence or absence of lexical effects is extremely sensitive to attentional parameters that can be affected by explicit biasing instructions and/or individual differences. Importantly, these effects cannot be accounted for in terms of mean reaction time differences (where slow reaction times would be expected to lead to stronger lexical influences than fast ones). The results reported here are consistent with the view that phoneme detection can be carried out using either of two quite different routes. Implications for current models of lexical and prelexical processing are discussed.

Dupoux, E. & Hammond, M. (1994). The role of stress in English: A fragment detection study. Unpublished Manuscript. [abstract] Previous investigations have claimed that speech perception uses language specific strategies and that, in particular, English does not use a strategy based on syllables (Cutler, Mehler, Norris, and Segui, 1983, 1986). This conclusion is based on a failure to replicate the interaction between target type and word type in fragment monitoring experiments that was originaly found in French (Mehler et al., 1981) with English subjects and materials. Here, we explore the possibility that this might be due to one of three related hypotheses: i. syllable boundaries in English depend on the stress value of the following syllable, ii. English listeners use the foot instead of the syllable in speech perception. iii. English subjects posit a perceptual boundary before an unreduced (or 'strong') syllable but not before a reduced, or 'weak' one. These three hypotheses can all be tested on the basis of the same contrasts and so we group them together under the rubric of 'stress-sensitive strategies' (SSSs). In Experiment 1, we fnd some incomplete support for a SSS, but the effect is not replicated in subsequent Experiments 2 (with slowed-down subjects) and 3 (with a different set of materials). An associated off-line task (Experiment 4) reveals that, according to subjects' intuitions, syllables have a rather different structure than we assumed at the outset. We conclude by rejecting these SSSs as the major source of the difference between French and English. In the final section, we discuss the possibility that the English perceptual system might still be based on the syllable, but not a stress-sensitive one.

See also Google Scholar.

Back to the CoML Home Page

2002-	Co-creator and director of the Cognitive Science Master program (see the CogMaster site).
1998-2009	Director of Laboratoire de Sciences Cognitives et Psycholinguistique (LSCP).
1992	Diploma in Telecom Engineering at Télécom Paris.
1989-1990	Post-doc at the Cognitive Science Program, Univ. of Arizona.
1989	PhD in Cognitive Psychology, EHESS, Paris.
1984-1988	Student at École Normale Supérieure