Abstract

Viewers of 360-degree videos are provided with both visual modality to characterize their surrounding views and audio modality to indicate the sound direction. Though both modalities are important for saliency prediction, little work has been done by jointly exploiting them, which is mainly due to the lack of audio-visual saliency datasets and insufficient exploitation of the multi-modality. In this article, we first construct an audio-visual saliency dataset with 57 360-degree videos watched by 63 viewers. Through a deep analysis of the constructed dataset, we find that the human gaze can be attracted by the auditory cues, resulting in a more concentrated saliency map if the sound source's location is further provided. To jointly exploit the visual and audio features and their correlation, we further design a saliency prediction network for 360-degree videos (SVGC-AVA) based on spherical vector-based graph convolution and audio-visual attention. The proposed spherical vector-based graph convolution can process visual and audio features directly in the sphere domain, thus avoiding projection distortion incurred by traditional CNN-based predictors. In addition, the audio-visual attention scheme explores self-modal and cross-modal correlation for both modalities, which are further hierarchically processed with the U-Net's multi-scale structure of SVGC-AVA. Evaluations on both our and public datasets validate that SVGC-AVA can achieve higher prediction accuracy, both qualitatively and subjectively.

Details