Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models

Abstract:

We propose a novel approach to significantly improve the intelligibility in the Non-Audible Murmur (NAM)-to-speech conversion task, leveraging self-supervision and sequence-to-sequence (Seq2Seq) learning techniques. Unlike conventional methods that explicitly record ground-truth speech, our methodology relies on self-supervision and speech-to-speech synthesis to simulate ground-truth speech. Despite utilizing simulated speech, our method surpasses the current state-of-the-art (SOTA) by an impressive 29.08% improvement in the Mel-Cepstral Distortion (MCD) metric. Additionally, we present error rates and demonstrate our model’s proficiency to synthesize speech in novel voices of interest. Moreover, we present a methodology for augmenting the existing CSTR NAM TIMIT Plus corpus, setting a benchmark with a Word Error Rate (WER) of 42.57% to gauge the intelligibility of the synthesized speech.

Proposed Architecture

Overview of our proposed method: (A) Ground-truth speech simulation from available whisper speech, (B) Data augmentation using LJSpeech data and DTW algorithm to generate time-aligned data, and (C) sequence-to-sequence (Seq2Seq) learning framework to synthesize speech in novel voices for NAM-to-speech conversion task.

Comparing speech converted with existing methods and NAM2Speech on CSTR NAM TIMIT Plus corpus

Text	Input NAM Vibrations	DiscoGAN	MSpec-Net	NAM2Speech (Novel voice 1)	NAM2Speech (Novel voice 2)
It's the whole season.
It is a terrible loss.

More samples from our proposed methods on CSTR NAM TIMIT Plus corpus

ID	Text	Input NAM Vibrations	NAM2Speech (Novel voice 1)	NAM2Speech (Novel voice 2)	NAM2Speech (Novel voice 3)
192	I think we're going to make it.
208	Like everything in scotland it takes time.
112	People have been wonderful beyond belief.
193	I didn't know where they were.
193	He was the architect.

Simulated ground-truth on CSTR NAM TIMIT Plus corpus