Computer model mimics human audiovisual perception

A new computer model developed at the University of Liverpool can combine sight and sound in a way that closely resembles how humans do it. This model is inspired by biology and could be useful for artificial intelligence and machine perception.

The model is based on a brain function first found in insects, which helps them detect movement. Dr Cesare Parise, Senior Psychology Lecturer, has adapted this idea to create a system that can process real-life audiovisual signals – like videos and sounds – rather than relying on abstract parameters used in older models.

When we watch someone speak, our brains automatically match what we see with what we hear. This can lead to illusions, such as the McGurk effect, where mismatched sounds and lip movements create a new perception, or the ventriloquist illusion, where a voice seems to come from a puppet instead of the performer. This latest work asks how does the brain know when sound and vision match?

Previous models tried to explain this but were limited because they didn’t work directly with real audiovisual signals. Dr Cesare Parise, Institute of Population Health, University of Liverpool explains: “Despite decades of research in audiovisual perception, we still did not have a model that could solve a task as simple as taking a video as input and telling whether the audio would be perceived as in sync. This limitation reveals a deeper issue: without being stimulus-computable, perceptual models can capture many aspects of perception in theory, but can’t perform even the most straightforward real-world test.”

Dr Parise’s new model addresses a long-standing challenge in sensory integration. It builds on earlier work by Parise and Marc Ernst (University of Bielefeld, Germany) who introduced the principle of correlation detection – a possible explanation for how the brain combines signals from different senses. This work led to the development of the Multisensory Correlation Detector (MCD), a model which could mimic human responses to simple audiovisual patterns like flashes and clicks. They later improved the model to focus on brief changes in input, which are key to how we integrate sight and sound.

In the current study, Parise simulated a group of these detectors arranged like a grid across visual and auditory space. This setup allowed the model to handle complex, real-world stimuli. It successfully reproduced results from 69 well-known experiments involving humans, monkeys, and rats.

Dr Parise added: “This represents the largest-scale simulation ever conducted in the field. While other models have been tested extensively in the past, none have been tested against so many datasets in a single study.”

The model matched behaviour across species and performed better than the leading Bayesian Causal Inference model, using the same number of adjustable parameters. It also predicted where people would look while watching audiovisual movies, acting as a lightweight ‘saliency model’.

Parise believes the model could be useful beyond neuroscience: “Evolution has already solved the problem of aligning sound and vision with simple, general-purpose computations that scale across species and contexts. The crucial step here is stimulus computability: because the model works directly on raw audiovisual signals, it can be applied to any real-world material.”

He added: “Today’s AI systems still struggle to combine multimodal information reliably, and audiovisual saliency models depend on large, parameter-heavy networks trained on vast labelled datasets. By contrast, the MCD lattice is lightweight, efficient, and requires no training. This makes the model a powerful candidate for next-generation applications.”

Parise concludes: “What began as a model of insect motion vision now explains how brains – human or otherwise – integrate sound and vision across an extraordinary range of contexts. From predicting illusions like the McGurk and ventriloquist effects to inferring causality and generating dynamic audiovisual saliency maps, it offers a new blueprint for both neuroscience and artificial intelligence research.”

The paper, ‘A Stimulus-Computable Model for Audiovisual Perception and Spatial Orienting in Mammals’ was published in eLife (DOI: 10.7554/eLife.106122.3).

Ventriloquist image: Model response to a performing ventriloquist. The model response clusters around the dummy, reproducing the illusory shift in the localisation of the perceived sound source. Image credit: Parise, eLife 2025.