We present two new single-microphone algorithms developed in the course of the SPRING project.
The first is a speaker extraction algorithm, which can be used when a reference utterance of the desired speaker is available. This may be obtained by identifying time frames in the same conversation (about 1 Sec long) in which only a single speaker is active, e.g. by utilizing the multi-channel current speakers activity detector (MCCSD). The second is a speaker separation algorithm that employs TCN module.
Both algorithms are implemented in Python. The extraction algorithm obtains excellent improvements in both SI-SDR and STOI measures for a medium-high reverberation level. The separation algorithm was tested with NVIDIA’s ASR system and provided significant improvements
in medium reverberation levels, T60 ≈ 350 ms, and low noise levels. The speaker overlap was randomly chosen between 25% and 50%, as scenarios with full overlap between speakers are not assumed realistic. We also tested the algorithm with a moving ReSpeaker sound card, which is not embedded in ARI. Again, significant ASR improvements are demonstrated in a low-noise environment.
We note that in the current audio processing architecture, two independent audio streams are simultaneously transcribed by the ASR system and transmitted to the dialogue system. To circumvent the speaker permutation phenomenon, typical to separation algorithms, we will apply a speaker identification module on the separated outputs and preserve the time consistency of the transcribed speech signals. For the extraction algorithm, we may also use the identity determined by the reference speaker, and for the separation algorithm the VAD decisions.
The operation paradigm of the dialogue system with two concurrent transcriptions should still be determined.