Research Output

Audio-visual speech enhancement and separation by leveraging multimodal self-supervised embeddings

AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies, demonstrating that multi-modal self-supervised embeddings obtained from AV-HuBERT can be generalized to audio-visual regression tasks.

Date:

02 August 2023
Publication Status:

Published
Publisher

IEEE
DOI:
Funders:

Edinburgh Napier Funded

http://researchrepository.napier.ac.uk/output/3609134 <p>Chern, I.-C., Hung, K.-H., Chen, Y.-T., Hussain, T., Gogate, M., Hussain, A., Tsao, Y., & Hou, J.-C. (2023, June). <i>Audio-visual speech enhancement and separation by leveraging multimodal self-supervised embeddings</i>. Presented at 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), Rhodes Island, Greece</p>

Citation

�鶹��

Chern, I.-C., Hung, K.-H., Chen, Y.-T., Hussain, T., Gogate, M., Hussain, A., Tsao, Y., & Hou, J.-C. (2023, June). Audio-visual speech enhancement and separation by leveraging multimodal self-supervised embeddings. Presented at 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), Rhodes Island, Greece