-In phonology, we make a useful distinction between a phoneme, an abstract category used to contrast and encode the lexicon in memory, and a phone, the physical instantiation of a given phoneme in a given linguistic contexts (Cite XX). (give an example)
-Although phonemes represent the desired end-state of phonological learning, most computational studies have focused on modeling the discovery of sound clusters in speech (Vallabha, Tuscano, Feldman et al,…), which correspond more to the notion of phones, as was noted on multiple occasions (e.g., Dillon, Dunbar et al. 2013, Feldman et al. 2013).
-If these statistical models converge more likely on phones, then the second step of phonological development, i.e., determining the phonemic status of these phones, has yet to be addressed.
-A few algorithms have been proposed (Peperkamp et al. 2006, Martin et al. 2013, Calamaro & Jarosz, 2015). While they have run more or less successfully on simplified input, they failed to scale up to realistic input (Martin et al. 2013, Fourtassi et al. 2014, Dupoux (Blog post?) )
-In this paper, we test bottom-up and top-down cues on large corpora of spontaneous speech in English and Japanese, using tools from speech recognition and natural language processing.
-We assume learners have already converged on a first set of phones, which correspond to physical instantiation of phonemes in various linguistic contexts.
This phonetic variation is largely driven by the phenomenon of coarticulation, that is, when vocal tract gestures for one sound overlap with gestures for another.
-Experimental designs probing phoneme learning in infant studies compare the young infants’ behavior (e.g., looking time) when presented with a pair of contrastive sounds (e.g., da-ta) to their behaviour when they are presented with a pair of phones which are not contrastive in their native language (e.g., da-ɖa).
-We use a task similar in spirit to the laboratory experiment, in order to evaluate how the learning mechanisms fare in learning the phonemic status of phones:
-> For each corpus, we list all possible combinations of pairs of phones. Some of these pairs are instantiations of the same phoneme and are labeled “0” (non-contrastive), and others are instantiations of different phonemes and are labeled “1” (contrastive).
-> Each of these pairs are then assigned a score from each of the cues under investigation.
-> These scores should allow the learner to rank contrastive pairs higher than non-contrastive ones.
-> Since these scores are continuous, we compute the Receiver Operating Characteristic (ROC) which illustrates the performance of the binary classification that results from the cue at hand and a discrimination threshold (which we vary accross values).
-> Finally, the overall performance of the mechanism is summarized through the Area Under the ROC-Curve (AUC)
-Bottom-up cue: -> Acoustic cue applies to all pairs, since each phone has its HMM model
-Top-down cue: -> Top-down cue does not apply to all pairs: only to those pairs that give rise to word-form variation. Since it is coarticulation data, word-form variation occur at the first or last segment.
-Bottom-up cue is better and has a wider scope than the top-down cue
-Using a basic matrix completion algorithm, we show that phonemic information can, in principle, be generalized to the totality of pairs but only when variation is not too extreme
-Previous results tested pair ranking with respect to gold phonemic categorization. However, this leaves out the question of the hierarchy level.
-In machine learning terms, given a distance matrix, we usually need to specify a number that sets the desired hierarchy level (or the number of categories).
-We test whether acoustic and top-down cues encode a preference for the phonemic categorization. To this end, we evaluate how the cues fare under various hierarchical categorizagion.