Audiovisual Integration in Speech and Speaker Perception

Abstract

We are able to understand speech more efficiently if we are able to see the speaker’s lips moving, in addition to the acoustic signal. The so-called McGurk illusion demonstrates that the visual signal has an involuntary influence on the perception of the spoken sound. Brain imaging methods have shown that attending to a speaking face activates areas in the auditory cortex even when acoustic stimulation is absent. It seems therefore, that in speech perception, the visual signal is able to directly modulate acoustic processing. In addition to speech perception, voices and faces are also naturally important sources of information for the recognition of people. As the result of a successful pilot study in our laboratory, this project investigated, for the first time,  the integration processes involved in person recognition. The main aims were: (1) to explore the required conditions and the mechanisms of this phenomenon, (2) to investigate the neuronal correlates of the integration processes for person recognition, (3) to compare the audiovisual integration processes involved in person recognition and speech recognition. The intention of this project is to attain a better understanding of person recognition in everyday conditions, where dynamic audiovisual processing regularly occurs. The results of this project so far have indicated that audiovisual face-voice integration is an important factor in the recognition of people, depends on familiarity with a speaker, shows sensitivity to temporal synchronization of the facial and vocal articulation, and can occur in a bidirectional manner. Moreover, event-related brain recordings suggested multiple loci of audiovisual integration. Specifically, perceiving time-synchronized speaking faces triggers early (~50-80 ms) audiovisual processing, although audiovisual speaker identity is only computed ~200 ms later.

Selected Relevant Publications

Audiovisual Integration in Speech and Speaker Perception

Abstract

We are able to understand speech more efficiently if we are able to see the speaker’s lips moving, in addition to the acoustic signal. The so-called McGurk illusion demonstrates that the visual signal has an involuntary influence on the perception of the spoken sound. Brain imaging methods have shown that attending to a speaking face activates areas in the auditory cortex even when acoustic stimulation is absent. It seems therefore, that in speech perception, the visual signal is able to directly modulate acoustic processing. In addition to speech perception, voices and faces are also naturally important sources of information for the recognition of people. As the result of a successful pilot study in our laboratory, this project investigated, for the first time,  the integration processes involved in person recognition. The main aims were: (1) to explore the required conditions and the mechanisms of this phenomenon, (2) to investigate the neuronal correlates of the integration processes for person recognition, (3) to compare the audiovisual integration processes involved in person recognition and speech recognition. The intention of this project is to attain a better understanding of person recognition in everyday conditions, where dynamic audiovisual processing regularly occurs. The results of this project so far have indicated that audiovisual face-voice integration is an important factor in the recognition of people, depends on familiarity with a speaker, shows sensitivity to temporal synchronization of the facial and vocal articulation, and can occur in a bidirectional manner. Moreover, event-related brain recordings suggested multiple loci of audiovisual integration. Specifically, perceiving time-synchronized speaking faces triggers early (~50-80 ms) audiovisual processing, although audiovisual speaker identity is only computed ~200 ms later.

Selected Relevant Publications

Von Eiff, C.I., Frühholz, S., Korth, D., Guntinas-Lichius, O., & Schweinberger, S.R. (2022). Crossmodal benefits to vocal emotion perception in cochlear implant users. iScience, 25, 105711. (Link to PDF).

Estudillo, A.J., Kaufmann, J.M, Bindemann, M., & Schweinberger, S.R. (2018). Multisensory Stimulation Modulates Perceptual and Post-perceptual Face Representations: Evidence from Event-Related Potentials. European Journal of Neuroscience, 48(5), 2259-2271. (Link to PDF)

Robertson, D.M.C., & Schweinberger, S.R. (2010). The role of audiovisual asynchrony in person recognition. Quarterly Journal of Experimental Psychology, 63, 23-30.

Robertson, D., & Schweinberger, S.R. (2007). Hearing facial identities: Audiovisual integration in the recognition of people. Joint meeting of the Experimental Psychology Society and the Psychonomic Society, Edinburgh, 4-7 July, 2007.

Schweinberger, S.R. (2013). Audiovisual integration in speaker identification. In: P. Belin, S. Campanella, & T. Ethofer (Eds.) Integrating Face and Voice in Person Perception (pp. 119 – 134). New York, Heidelberg: Springer.

Schweinberger, S.R., Casper, C., Hauthal, N., Kaufmann, J.M., Kawahara, H., Kloth, N., Robertson, D.M.C., Simpson, A.P., & Zäske, R. (2008). Auditory adaptation in voice perception. Current Biology, 18, 684-688.

Schweinberger, S.R., Kawahara, H., Simpson, A.P., Skuk, V.G., & Zäske, R. (2014). Speaker Perception. Wiley Interdisciplinary Reviews: Cognitive Science, 5, 15-25. (Link to PDF)Schweinberger, S.R., Kloth, N., & Robertson, D.M.C. (2011). Hearing facial identities: Brain correlates of face-voice integration in person identification. Cortex, 47, 1026-1037. (Link to PDF)Schweinberger, S.R. & Robertson, D.M.C. (in press). Audiovisual integration in familiar person recognition. Visual Cognition. (Link to PDF)Schweinberger, S.R., Robertson, D.M.C., & Kaufmann, J.M. (2006). Hearing facial identities: Audiovisual integration in the recognition of familiar people. Mid-year Meeting of the International Neuropsychological Society (INS), Zürich, July 26-30, 2006.Schweinberger, S.R., Robertson, D. & Kaufmann, J.M. (2007). Hearing facial identities. The Quarterly Journal of Experimental Psychology, 60, 1446-1456.


Funding

DFG-Projekt Schw 511/6-1


Stimulus examples (Windows Media Player required)

So far, our research has uncovered the importance of the integration between visual and auditory information in the recognition of familiar speakers (Schweinberger, S.R., Robertson, D. & Kaufmann, J.M. (in press). Hearing facial identities. The Quarterly Journal of Experimental Psychology.)Below are some examples of the types of stimuli we use to investigate the effects of audiovisual integration on speaker recognition in our ongoing research:

  • Corresponding Static: An example of a familiar voice, combined with the correct (corresponding) static face
  • Corresponding Dynamic: An example of a familiar voice, combined with the correct (corresponding) dynamic face
  • Noncorresponding Dynamic: An example of a familiar voice, combined with an incorrect (non-corresponding) dynamic face, edited so as to ensure precise temporal synchronisation.
  • Non-corresponding Dynamic (with delayed video clarity): An example of a familiar voice, combined with an incorrect (non-corresponding) dynamic face. The video was presented in black and white and the blurred face becomes clearer over time. This was done to investigate whether the voice can affect face recognition.
  • Audiovisual AsynchronyWe investigated the effects that asynchronous audiovisual presentations had on voice recognition. Below are examples of those stimuli.Backwards video:

  • Corresponding Backwards: An example of a familiar voice, combined with the correct (corresponding) backwards-animated face.
  • Non-corresponding Backwards: An example of a familiar voice, combined with an incorrect (non-corresponding) backwards-animated face.
  • Manipulation of audiovisual synchrony:

  • Corresponding -600ms: An example of a familiar voice and face video, where the voice leads the facial motion by 600 milliseconds (voice begins 600ms before facial motion).
  • Corresponding Synchronous: An example of a familiar voice and face video, where the voice and facial motion are in synchrony.
  • Non-corresponding Synchronous: An example of an unfamiliar voice with a familiar face video, where the voice and facial motion are in synchrony.
  • Corresponding +200ms: An example of a familiar voice and face video, where the voice onset lags behind the facial motion by 200 milliseconds (voice begins 200ms after facial motion).
  • Corresponding +600ms: An example of a familiar voice and face video, where the voice onset lags behind the facial motion by 600 milliseconds (voice begins 600ms after facial motion).

  • Example Stimuli

    Schweinberger, S.R., & Robertson, D.M.C. (in press). Audiovisual integration in familiar person recognition. Frontiers in Bioscience.Experiment 1: Two example videos each for two personally familiar speakers, with either corresponding or noncorresponding auditory and visual speaker identities.

  • Corresponding A1V1
  • Corresponding A2V2
  • Noncorresonding A1V2
  • Noncorresponding A2V1
  • Experiment 2: Two example videos of auditory speakers that were combined with corresponding or noncorresponding visual identities, with the video clip playes backwards (time-reversed).

  • Corresponding A2V2 time reversed
  • Noncorresponding A1V2 time reversed
  • Experiment 3: Two example videos of visual speakers that were combined with corresponding or noncorresponding auditory voice identities. Note: Videos were linearly deblurred during the first 1000 ms of presentation.

  • Corresponding A2V2 deblur
  • Noncorresponding A1V2 deblur

  • The McGurk Effect

    Furthermore, as has been well established in previous research, audiovisual integration is important in speech perception. Our own demonstration of the classic McGurk effect (McGurk and Macdonald, 1976) is displayed here.This video is a combination of an auditory /aba/ and a visual /aga/, but what most adults (98%) perceive is /ada/. You should play the clip while looking at the face and listening to the voice. Try listening to the voice with your eyes closed to get an idea of the difference between what you hear auditorily and what you perceive audiovisually.The McGurk effect works for full sentence stimuli as well. The first two parts of the video contain the auditory-only “Bichter und Benker bachten basselbe” followed by the visual-only “Gichter und Genker gachten gasselbe”. The third and final part of the video allows you to perceive the result of the integration of these two signals.It can be viewed here.

    Funding

    DFG-Projekt Schw 511/6-1

    Stimulus examples (Windows Media Player required)

    So far, our research has uncovered the importance of the integration between visual and auditory information in the recognition of familiar speakers (Schweinberger, S.R., Robertson, D. & Kaufmann, J.M. (in press). Hearing facial identities. The Quarterly Journal of Experimental Psychology.)

    Below are some examples of the types of stimuli we use to investigate the effects of audiovisual integration on speaker recognition in our ongoing research:

    • Corresponding Static: An example of a familiar voice, combined with the correct (corresponding) static face
    • Corresponding Dynamic: An example of a familiar voice, combined with the correct (corresponding) dynamic face
    • Noncorresponding Dynamic: An example of a familiar voice, combined with an incorrect (non-corresponding) dynamic face, edited so as to ensure precise temporal synchronisation.
    • Non-corresponding Dynamic (with delayed video clarity): An example of a familiar voice, combined with an incorrect (non-corresponding) dynamic face. The video was presented in black and white and the blurred face becomes clearer over time. This was done to investigate whether the voice can affect face recognition.

    Audiovisual Asynchrony

    We investigated the effects that asynchronous audiovisual presentations had on voice recognition. Below are examples of those stimuli.

    Backwards video:

    Manipulation of audiovisual synchrony:

    • Corresponding -600ms: An example of a familiar voice and face video, where the voice leads the facial motion by 600 milliseconds (voice begins 600ms before facial motion).
    • Corresponding Synchronous: An example of a familiar voice and face video, where the voice and facial motion are in synchrony.
    • Non-corresponding Synchronous: An example of an unfamiliar voice with a familiar face video, where the voice and facial motion are in synchrony.
    • Corresponding +200ms: An example of a familiar voice and face video, where the voice onset lags behind the facial motion by 200 milliseconds (voice begins 200ms after facial motion).
    • Corresponding +600ms: An example of a familiar voice and face video, where the voice onset lags behind the facial motion by 600 milliseconds (voice begins 600ms after facial motion).

    Example Stimuli

    Schweinberger, S.R., & Robertson, D.M.C. (in press). Audiovisual integration in familiar person recognition. Frontiers in Bioscience.

    Experiment 1: Two example videos each for two personally familiar speakers, with either corresponding or noncorresponding auditory and visual speaker identities.

    Experiment 2: Two example videos of auditory speakers that were combined with corresponding or noncorresponding visual identities, with the video clip playes backwards (time-reversed).

    Experiment 3: Two example videos of visual speakers that were combined with corresponding or noncorresponding auditory voice identities. Note: Videos were linearly deblurred during the first 1000 ms of presentation.

    The McGurk Effect

    Furthermore, as has been well established in previous research, audiovisual integration is important in speech perception. Our own demonstration of the classic McGurk effect (McGurk and Macdonald, 1976) is displayed here.

    This video is a combination of an auditory /aba/ and a visual /aga/, but what most adults (98%) perceive is /ada/. You should play the clip while looking at the face and listening to the voice. Try listening to the voice with your eyes closed to get an idea of the difference between what you hear auditorily and what you perceive audiovisually.

    The McGurk effect works for full sentence stimuli as well. The first two parts of the video contain the auditory-only “Bichter und Benker bachten basselbe” followed by the visual-only “Gichter und Genker gachten gasselbe”. The third and final part of the video allows you to perceive the result of the integration of these two signals.

    It can be viewed here.