Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models

Abstract:

We propose a novel approach to significantly improve the intelligibility in the Non-Audible Murmur (NAM)-to-speech conversion task, leveraging self-supervision and sequence-to-sequence (Seq2Seq) learning techniques. Unlike conventional methods that explicitly record ground-truth speech, our methodology relies on self-supervision and speech-to-speech synthesis to simulate ground-truth speech. Despite utilizing simulated speech, our method surpasses the current state-of-the-art (SOTA) by an impressive 29.08% improvement in the Mel-Cepstral Distortion (MCD) metric. Additionally, we present error rates and demonstrate our model’s proficiency to synthesize speech in novel voices of interest. Moreover, we present a methodology for augmenting the existing CSTR NAM TIMIT Plus corpus, setting a benchmark with a Word Error Rate (WER) of 42.57% to gauge the intelligibility of the synthesized speech.

Proposed Architecture

...
Overview of our proposed method: (A) Ground-truth speech simulation from available whisper speech, (B) Data augmentation using LJSpeech data and DTW algorithm to generate time-aligned data, and (C) sequence-to-sequence (Seq2Seq) learning framework to synthesize speech in novel voices for NAM-to-speech conversion task.

Table of Contents

Note : The samples for comparison are taken from demo's of previous work. Therefore they are not cherry-picked samples.

Comparing speech converted with existing methods and NAM2Speech on CSTR NAM TIMIT Plus corpus

Text Input NAM Vibrations DiscoGAN MSpec-Net NAM2Speech (Novel voice 1) NAM2Speech (Novel voice 2)
It's the whole season.
It is a terrible loss.

More samples from our proposed methods on CSTR NAM TIMIT Plus corpus

ID Text Input NAM Vibrations NAM2Speech (Novel voice 1) NAM2Speech (Novel voice 2) NAM2Speech (Novel voice 3)
192 I think we're going to make it.
208 Like everything in scotland it takes time.
112 People have been wonderful beyond belief.
193 I didn't know where they were.
193 He was the architect.

Simulated ground-truth on CSTR NAM TIMIT Plus corpus

ID Text Input NAM Vibrations Simulated Ground-Truth
002 Ask her to bring these things with her from the store.
004 We also need a small plastic snake and a big toy frog for the kids.
007 The rainbow is a division of white light into many beautiful colors.
010 People look, but no one ever finds it.
012 Throughout the centuries people have explained the rainbow in various ways.