Sonos at ICASSP 2021
Principal Audio Research Engineer, Advanced Technology
Head of Machine Learning Research, Sonos Voice Experience
Senior Machine Learning Engineer, Sonos Voice Experience
Senior Manager, Machine Learning, Sonos Voice Experience
Principal Software Engineer, Sonos Voice Experience
The 46th edition of the International Conference on Acoustics, Speech, and Signal Processing will start next week on June 6, 2021. This conference, one of the most prominent academic events in signal processing, will be held virtually again this year (see a previous blog post on how to make the most out of virtual conferences). It brings together researchers from Academia and the industry from all over the world to discuss the latest achievements in many areas in the field both theoretical and applied.
The Advanced Technology and Voice Experience teams at Sonos attend this conference every year and we are pleased to announce that the two scientific papers we submitted for publication have been accepted to the 2021 edition. We are excited to share our most recent work on acoustic echo cancellation and speaker verification with the community. If you’re attending ICASSP as well, don’t hesitate to reach out and say hi! The session times for our two presentations are displayed below.
In this post, we will give a brief overview of these two papers and share some resources allowing you to dig further.
Robust STFT domain multi-channel acoustic echo cancellation with adaptive decorrelation of the reference signals
By Saeed Bagheri & Daniele Giacobello
An acoustic echo cancellation (AEC) system is generally required to perform voice control on music playback devices, i.e., smart speakers, where the coupling of closely spaced loudspeakers and microphones can create very challenging speech-to-echo ratios. The goal is to remove the acoustic signal due to the music/video playback captured by the microphones.
In the case where a multi-channel (MC) speaker setup is deployed, either from multiple spatially distributed loudspeakers (e.g., a 5.1 surround system) or one device equipped with a number of loudspeakers (e.g., a soundbar), the loudspeaker-driving signals (i.e., the reference signals) are typically highly correlated. For P loudspeaker channels, implementing P independent adaptive filters in parallel, based on single-channel AEC techniques, suffers from the so-called non-uniqueness problem.
Our objective in this work is to design a robust and scalable MCAEC algorithm with low CPU and memory resource budget that is easy to deploy on different devices, and different loudspeaker configurations with minimum parameter tuning, and with no calibration requirements. These requirements result in fast prototyping, testing, and deployment of the algorithm in new products.
AEC is a well-studied and researched topic. The existing MCAEC solutions in the literature have been mostly targeted towards hands-free voice communication which is very different from voice assistant application on a loudspeaker array. Reviewing the literature, we can find two types of solutions to cope with the non-uniqueness problem in MCAEC. The first type adds distortions to the loudspeaker signals to decorrelate the channels. While more recent solutions have applied perceptually motivated criteria in order to reduce audible distortions, the results are still considered unacceptable for the type of high-fidelity (Hi-Fi) loudspeaker systems we are considering. Furthermore, these methods might interfere with the sound beamforming operations often present in this type of system. A second type of solution for MCAEC is more suited for our application scenario and is based on applying decorrelation filters to the loudspeaker signals in order to make the convergence faster. The idea is to adjust the adaptive filters by decorrelating the reference signals, making these algorithms resilient to convergence issues and the non-uniqueness problem. However, these methods require very high computational and memory resources which are beyond our compute and processing budget (especially when you consider 11 channels in Sonos Arc).
In this work, we propose a novel algorithm for MCAEC that applies decorrelation of the reference channels adaptively in the time-domain. This method reduces the channel correlation which helps the convergence speed and robustness of the algorithm. The AEC operating on the decorrelated channels is applied directly in the STFT domain, which is based on robust adaptive AEC algorithms to allow for robust update of the adaptive filters when local speech is present. In the AEC implementation, we combine various ideas to improve the robustness of the method, reduce computational complexity, enhance echo cancellation, and improve the convergence rate. These ideas include frequency-domain adaptive filtering (FDAF), adaptive crossband filters to reduce the aliasing problem, the multi-delay filter (MDF) to reduce the processing delay by segmenting the FDAF into smaller blocks, error recovery nonlinearity (ERN) that enhances the filter estimation error prior to the adaptation step, adaptive time-frequency dependent step-size for continuous and stable adaptation of the cancellation filters without applying double-talk detection (DTD).
The combination of these methods offers several advantages for the MCAEC problem in smart speakers: works at very low speech-to-music ratios, works on various number of loudspeaker channels, and attempts to limit distortion on the target voice command. Moreover, the number of parameters to tune are kept very small which helps the scalability of the solution.
Presentation Time 1:
Tuesday, 8 June, 16:30 - 17:15 (Eastern Daylight Time)
Presentation Time 2:
Wednesday, 9 June, 04:30 - 05:15 (Eastern Daylight Time)
Small footprint Text-Independent Speaker Verification for Embedded Systems
By Julien Balian, Raffaele Tavarone, Mathieu Poumeyrol, and Alice Coucke
Speaker verification is the task of verifying someone’s identity based on the characteristics of their voice. This technology has received increasing attention in recent years, partially due to its application to voice assistants. Speaker verification indeed enables to customize the assistant responses for the user (e.g. “add an event to my calendar“).
This task is done in several steps: first a neural network is trained to map voice signals to a low-dimensional space where “distances” are a direct measure of speaker similarity. The challenge is to make this representation, or embedding, invariant to everything that is not related to the speaker identity, such as background noise, reverberation, words that are being pronounced, duration of the sentence, etc so that it performs well in real use case scenarios. Such techniques are well known in machine learning and used in face recognition from images or language modelling for instance. The personal voiceprint of a given user can then be derived from this representation during an “enrollment” phase where they are asked to pronounce a few sentences. Finally, each time a user addresses a voice assistant, their query will be compared to the set of recorded voiceprints in order to find the right match.
Keeping a small footprint is one of the biggest challenges of speaker verification, besides acoustic robustness. Although some speaker verification engines with low execution latency or the ability to run on mobile devices have been proposed, they remain too large for embedded applications where memory and computing power are further limited. In this work, we propose a system specifically tailored to embedded use cases. We budget CPU and memory resources to match that of typical wake word detection systems designed to run continuously and in real time on device. Voice being a highly sensitive biometric identifier, this lightweight approach could grant speaker verification abilities to small devices typical of IoT systems such as Sonos speakers, while fully respecting the privacy of users.
We propose a fixed-size speaker embedding model in two stages. In the first feature extraction stage, a streaming neural network, inspired by the QuartzNet  architecture recently introduced for speech recognition, takes as input an arbitrarily long time series of acoustic features and outputs another series of higher-level features. The second stage consists of an aggregator neural network, built upon the GVLAD  architecture, that aggregates the outputs of the feature extraction stage along the time dimension to build a fixed-size embedding of the audio signal (that will become, after training, the so-called voiceprint mentioned above). We bring some key modifications to these architectures, such as the inclusion of Max Features Map operations and PReLU activations to the former and a more computationally efficient method for descriptor aggregation to the latter. This two-stage approach allows to decouple streamed time-series features extraction from aggregation and provide an optimal balance between representation quality and inference latency. We finally demonstrate that this lightweight system yields a limited increase of Equal Error Rate (EER) on well established benchmarks compared to state-of-the-art approaches (3.31% EER on the VoxCeleb1 verification test set and 7.47% EER on the VOiCES from a Distance 2019 Challenge) with a number of learning parameters and operations orders of magnitude smaller.
Presentation Time 1:
Wednesday, 9 June, 14:00 - 14:45 (Eastern Daylight Time)
Presentation Time 2:
Thursday, 10 June, 02:00 - 02:45 (Eastern Daylight Time)
 Samuel Kriman et al., "Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions," in ICASSP. IEEE, 2020, pp. 6124–6128.
 Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, "Utterance-level aggregation for speaker recognition in the wild," in ICASSP. IEEE, 2019, pp. 5791–5795