This study examines the relationship between the continuous properties of an acoustic stimulus, and the perception of that stimulus. Through phoneme categorization (ID tasks) and discrimination tasks (AX tasks), the study explores the association between the Voice Onset Time (VOT) of a stop consonant, and the listener’s perception of that consonant’s phonetic/phonemic identity. Further, it focuses on the influence of the listener’s native language on the nature of this relationship, by testing the listener on acoustic stimuli from both English and Russian.
Categorical Perception is the phenomenon whereby speakers cannot discriminate between acoustically dissimilar speech sounds that lie within the same phonetic category (Liberman, 1970). This has prompted many to assume that speakers are simply not sensitive to the finer acoustic differences that occur within the phonetic boundaries of their native language. Thus, if a listener classifies two different stimuli as the same phonetic segment, then they must not be sensitive to the acoustic differences between the stimuli.
Pisoni and Tash (1974) challenged this assumption by positing an alternate explanation – that the speakers’ observed inability to discriminate between within-category acoustic stimuli was due to the nature of the discrimination task. Categorical Perception experiments typically employ the ABX discrimination task, wherein speakers are presented with three successive stimuli (first two are always different) and then have to decide whether the third sound is the same as the first or the second sound. Such a task paradigm - Pisoni and Tash argued - relies on short-term memory and thus fails to capture acoustic information from the earliest stages of processing, which tends to deplete most rapidly. Here, the underlying assumption is that speech processing occurs at multiple levels. At the earliest stage, this processing takes place at an acoustic level, and later abstracts to a more phonetic representation.
To harness this early processing information, Pisoni and Tash designed a novel discrimination task with an A-X setup. Here, at each trial, listeners were presented with a pair of speech sounds and asked to identify them as same of different from one another. We assume that classifying two acoustically identical speech sounds as “same” simply involves the early (acoustic) stages of processing. On the other hand, classifying two acoustically dissimilar sounds as “same” requires an additional comparision of their abstract phonetic features, which occurs at a higher level of processing. Thus, the time taken by the listener to arrive at a decision should then be reflectice of their level of processing and the information needed to make the comparision.
Alternatively, if only an abstract phonetic representation of speech sounds is available to listeners, then there should be no difference in their response times, regardless of the acoustic differences between a pair of stimuli.
In this study, I participated in four speech perception experiments. The first was an Identification task, where I was presented with stop consonants and asked to categorize them into one of two phonetic categories. The second was an AX discrimination task where I was presented with two stop consonants, sepereated by varying interstimulus intervals, and asked to identify them as either “same” or “different” phonemes. These two experiments were each repeated with both English and Russian stimuli.
In the English experiments, the stop consonants were synthesized using the Klatt synthesizer. The stimuli consisted of 10 synthetic tokens whose Voice Onset Times (in ms) were: 0, +10, +20, +24, +28, +32, +40, +50, +60. These values were chosen to densely sample the area around the English /t/-/d/ phonetic boundary at approx. 30ms.
In the Russian expriment, the stimuli were recorded by a native Russian speaker and manually edited to sample the /t/-/d/ boundary. Here, this boundary lies at approx. -15ms. The VOT values for the Russian stimuli were: -44, -36, -28, -20, -18, -16, -14, -6, +2, +10.
For the AX experiments, stimulus pairs were categorized into three groups. A-A stimuli were two physically identical tokens. A-a stimuli were those trials where A and X were physically distinct tokens (with different VOTs), but on the same side of the phonetic boundary. A-B stimuli included those trials where A and X were on opposite sides of the category boundary. Each of the stimuli pairs were seperated by a particular interstimulus distance. For the English pairs, this distance was a multiple of 7ms whereas for the Russian pairs, it was a multiple of about 6ms.
There appears to be a healthy distribution of Response Times around 750ms with a standard right skew.
My identification function seems consistent with the findings of Pisoni and Tash (1974). The stimulus continuum is partitioned into two discrete phonetic segments. My phonetic boundary of identification lies at about +27 ms VOT, which is slightly lower than the +30 ms VOT reported by Pisoni and Tash. That said, their results reported an aggregate over the responses of nine subjects, each of whose category boundaries were likely situated somewhere around +30 ms VOT.
Similarly, Response Times (RTs) were slower close to the perceived phonetic boundary, with the longest mean RT of 1045ms seen at Voice Onset Time of +24ms. At the identified category boundary of +27ms, Response Time was approx. 925 ms.
These results are consistent with the finding that Reaction Time is a positive function of uncertainty, increasing at the phonetic boundary where identification is most ambiguous, and progressively decreasing away from the boundary where this discrimination is least ambiguous. This suggests that as a listener, I am sensitive to the finer acoustic differences between stop consonants that lie within the same phonetic category (here /t/ and /d/ in English). This refutes the conclusion drawn by previous speech-perception studies – that listeners cannot discriminate between the acoustic properties of two stimuli that they identify as the same phonetic segment.
There appears to be a healthy distribution of Respone Times for all three categories of stimuli-pairs. The RTs are concentrated around approx. 750 ms and there is a characteristic rightward skew. Overall, my RTs are much larger than the average 300 ms observed by Pisoni and Tash.
My AX experiment seems to follow a somewhat similar trend as observed by Pisoni and Tash. The percentage of “Different” responses provided by the listener increased with increasing acoustic distance between the sounds in the stimulus pair. This was true even for pairs that fell within the same phonemic category. This affirms the notion that speakers are sensitive to intra-category acoustic properties of speech sounds.
In the “Different” response condition (where the pair of stimuli were identified by the listener as two seperate phonemes), RT decreased consistently with an increase in interstimulus distance. Thus, the further apart two stimuli were acoustically – as measured by VOT – the faster the listener came to a decision regarding their similarity. Although they followed a similar trend as seen in Pisoni and Tash, the differences in RTs across the stimulus conditions were not nearly as stark in my experiment. That said, their constant interstimulus distance at 250ms was much larger than ours which ranged between 7-35ms. This could be the reason that a more pronounced effect was observed by Pisoni and Tash.
Both the English ID and AX tasks support Pisoni and Tash’s hypothesis that speakers can discern the finer acoustic differences between sounds that belong to the same phonemic category. And When presented with the right task, they can utilise this information to succesfully discriminate between sounds.
The positive relationship between the acoustic difference and Response Time (as well as proportion of “different” responses) suggest that I, as a listener, do show sensitivity to the /t/-/d/ phonemic boundary in English. This is not too surprising because as a native speaker of (Indian) English and Hindi, I am sensitive to the voicing distinction in stop consonants. That said, neither of the two sounds in this continuum (American English /t/ and /d/) are exactly part of my native phonemic inventory.
There appears to be a healthy distribution of Respone Times concentrated around ~800 ms and there is a characteristic rightward skew.
My identification function is somewhat consistent with the English results. Between the VOT ranges of -36ms and +10ms, the stimulus continuum is partitioned into two discrete phonetic segments whose boundary of identification lies at about -14ms VOT. This is quite close to the -15ms boundary reported by Kazanina (2006). There is, however, a surprising level of ambiguity in the -36ms to -44ms VOT range. Here, the data suggests that the speaker is not categorically perceiving sounds that lie within this acoustic range.
A possible explanation is that in this narrow range (-36,-44), the acoustic proprties of the stimuli resemble members of yet another phonemic category which could be confused with a Russian /t/. As a Hindi speakers, I am sensitive to contrasts in voicing and aspiration in stop consonants. I am also sensitive to the alveolar-retrolflex contrast in these sounds. This might be contributing to some noise towards the left edge of the continuum.
The relationship between VOT and Response Times does not show the same pattern as the one observed in the English data. Here, the largest peak in RTs (where the listener took longest to respond) is not located at the perceived phonemic boundary. Instead, it occurs further back in the continuum, somewhere in the anomolous region pointed out in the previous graph. Whether and how these two anomolies influence each other is unclear. That said, we do see a secondary (smaller) peak at the phonetic boundary of -14ms, beyond which RT decreases with increasing distance from the boundary.
There appears to be a healthy distribution of Respone Times for all three categories of stimuli-pairs. The RTs are concentrated around 800 ms and there is a characteristic rightward skew.
Here, the results are in line with the expectations of Pisani and Tosh. The percentage of different responses given by the listener vary inversely with the acoustic similarity of the two stimuli. The more acoustically dissimlar the two stimuli are, the more likely the litener is to identify them as different phonemic segments. This suggests that despite categorically perceiving speech sounds, listeners remain sensitive to their acoustic properties.
For the stimuli pairs identified by the speakers as “different”, the time taken to arrive at the decision decreases with increasing acoustic distance between the two sounds. Similarly, for stimuli pairs identified as belongong to the same phonetic category, Response Time increases with increasing acoustic distance. This suggests that the acoustic propoerties of sounds influence the listener’s ability to discriminate amongst them. This holds true even for sounds that lie within the same phonemic category. The delay in response time as a function of acoustic distance further reinforces this view.
Together, the ID and AX experiments suggest that I do show sensitivity to the /t/-/d/ contrast in Russian, albeit less than perfectly. This is not surprising because as a native Hindi speaker, I am sensitive to voicing and aspiration contrasts in stop consonants. Moreover, Hindi like Russian, has both a voiceless dental /t/ and a voiced dental /d/ in its phonemic inventory. This likely aided me in the identification task and discrimination tasks.
Overall, this study supports the hypothesis presented by Pisoni and Tash (1974). Both the English and Russian experiments provide evidence that speakers are sensitive to the finer acoustic differences between speech sounds that lie within the same perceive phonetic category. Moreover, this sensitivity to acoustic signals manifests in varied response times during AX-type discrimination tasks. RT appears to be a positive function of uncertainty, increasing at the phonetic boundary where identification is least consistent. This supports the theory that speech processing occurs at multiple levels, starting with the acoustic and slowly abstracting to more phonetic representation. The time and effort needed to discriminate between speech sounds thus depends on the level at which they are being processed.