We report our progress on audio speaker diarisation and extraction with a moving robot, which goal is to provide several separated
audio streams to be transcribed by the automatic speech recognition (ASR) and fed to the multi-party conversational system that will be deployed on ARI, the robotic platform designed by PAL Robotics for the SPRING project.
The main achievements reported are:
1. Single-microphone speaker extraction algorithm, using a reference utterance of the desired speaker,
2. Single-microphone speaker separation algorithm, based on a temporal convolutional network (TCN) module.
Both methods are extensively tested with common databases and also with real recordings from ARI. word error rate (WER) improvements are also reported. Details are available in D3.4 at our results page.
We present two new single-microphone algorithms developed in the course of the SPRING project.
The first is a speaker extraction algorithm, which can be used when a reference utterance of the desired speaker is available. This may be obtained by identifying time frames in the same conversation (about 1 Sec long) in which only a single speaker is active, e.g. by utilizing the multi-channel current speakers activity detector (MCCSD). The second is a speaker separation algorithm that employs TCN module.
Both algorithms are implemented in Python. The extraction algorithm obtains excellent improvements in both SI-SDR and STOI measures for a medium-high reverberation level. The separation algorithm was tested with NVIDIA’s ASR system and provided significant improvements
in medium reverberation levels, T60 ≈ 350 ms, and low noise levels. The speaker overlap was randomly chosen between 25% and 50%, as scenarios with full overlap between speakers are not assumed realistic. We also tested the algorithm with a moving ReSpeaker sound card, which is not embedded in ARI. Again, significant ASR improvements are demonstrated in a low-noise environment.
We note that in the current audio processing architecture, two independent audio streams are simultaneously transcribed by the ASR system and transmitted to the dialogue system. To circumvent the speaker permutation phenomenon, typical to separation algorithms, we will apply a speaker identification module on the separated outputs and preserve the time consistency of the transcribed speech signals. For the extraction algorithm, we may also use the identity determined by the reference speaker, and for the separation algorithm the VAD decisions.
The operation paradigm of the dialogue system with two concurrent transcriptions should still be determined.