Reproducing On-Device Data Accurately for Private-by-Design Voice Control
Senior Signal Processing and Machine Learning Scientist, Sonos Voice Experience
PhD student at École polytechnique fédérale de Lausanne
Speech recognition has made incredible progress in recent years. Its recent adoption across a wide range of applications, devices, and settings is a testament to this achievement. Smart devices, such as the Amazon Echo, Google Home, and all voice assistant-enabled Sonos speakers, have played a huge role in this surge of voice assistant usage, embedding this natural form of interaction within our homes and our daily lives.
Sonos Voice Control (SVC) allows users to control their music and their Sonos system, placing speed, accuracy and privacy on an equal footing. A core technology behind this system is supervised machine learning. Training supervised machine learning models requires vast amounts of data, which can easily amount to tens of thousands of hours of audio from hundreds of different people, all of which needs to be labeled.
Near-field audio data, e.g. when speaking into a headset, is abundantly available online, e.g. , and can also be collected fairly easily through crowd-sourcing platforms. In the context of SVC, we need to be able to handle far-field conditions, as people are more likely to talk to their smart devices from across the room. On top of that, the speech signals captured at the microphones are often perturbed by interfering noises and the content played by the device itself.
Having data that integrates all these elements and accurately reproduces audio signals as they appear on device is of the utmost importance to avoid mismatch between training and real-life situations . Recording actual audio data directly from users’ devices raises serious privacy concerns. Getting physical measurements in realistic conditions, even though possible, is highly impractical as it does not scale across content, speakers, nor acoustic environments, and would need to be repeated for new devices and languages.
In this post, we present our pipeline for augmenting crowd-sourced close-field recordings to reproduce as accurately as possible the audio signal as it is handed over to the SVC pipeline.
Reproducing on-device data accurately
Smart devices need to do two things: play audio content and capture audio input. They do so through their loudspeakers and microphones, respectively. A strong difficulty lies in the fact that during playback there is a coupling between the two interfaces as the microphones can pick up the signal played by the loudspeakers themselves. This is what we call self-sound.
A general diagram of the whole data pipeline is shown in Fig. 2, showing all the different components that need to be reproduced to obtain realistic input audio signals for SVC as they would appear on actual devices.
The whole process can be summarized as the presence of three different types of sound sources: self-sound coming from the device itself, external noises, and the speech command of interest. These three sound sources all add up at the microphones and are then processed by an Audio Front-End (AFE) before being passed on to the SVC pipeline. In the following, we will see what happens inside each of the blocks outlined in the diagram in Fig. 2.
What do we need?
In general, when capturing audio using a microphone within a room, the recorded signals are corrupted by acoustic reflections from the walls and objects within the surrounding space. Taken together, these reflections, which appear as delayed and attenuated copies of the original signal, are what we call reverberation. Assuming the process is linear and time-invariant, reverberation can be characterized using Room Impulse Responses (RIRs).
The RIR captures most sound propagation properties between two points in a room. Contrary to what its name suggests, there is not a single RIR to characterize an entire room. Rather, there is a unique RIR for each pair of source (e.g. loudspeaker, person talking) and receiver (microphone) positions, and it changes as objects move around the room .
By convolving an audio file with an RIR, we are able to simulate the propagation of this signal between the two points of the RIR (see example in Fig. 3). This essentially means we can recreate how an audio recording would sound in a particular room.
In order to do so, we need the original audio recordings to be anechoic, or free from any echo or reverberation, as well as free from external noise. The anechoic condition is rarely met, as it requires the original files to be recorded in an anechoic chamber. In practice, a minimal amount of background noise and reverberation is considered acceptable, and this can be achieved with a standard headset or recording with a microphone less than 10 cm away.
We can thus model the signal captured at each microphone of a Sonos device as the sum of all sound sources present within the room after they have undergone acoustic propagation by applying the respective RIRs. In the example shown in Fig. 2, the sound sources include: the two loudspeakers of the device (when playback is active), the interfering microwave and vacuum cleaner, and the user issuing the voice command.
Once audio signals have been captured at the microphones, they are processed by a module called the Audio Front-End (AFE). This term encompasses all the processing applied to the microphone signals in order to remove as much reverberation and unwanted signal as possible, before passing on the audio to the downstream tasks such as wakeword detection or speech recognition.
The AFE developed at Sonos comprises two main components: (1) a Multi-Channel Acoustic Echo Canceller, which aims at removing the self-sound picked up at the microphones during device playback using the knowledge of the signals sent to the loudspeakers , and (2) a Multi-Channel Wiener Filter, whose goal is to reduce the noise component and enhance the speech .
All in all, the implemented AFE is robust to a wide range of acoustic conditions, lightweight, agnostic to the downstream models, and easy to adapt to different devices with minimal tuning.
The playback processing component, shown at the beginning of the data pipeline in Fig. 2, transforms any audio content selected by the user for playback (e.g. a stereo piece of music or a movie in surround format on a soundbar) into the signals sent to the loudspeakers.
This transformation can include either down- or up-mixing depending on whether the target device is an all-in-one device like the Sonos One or a soundbar with a very large number of loudspeakers like the Sonos Arc. On top of that, this processing handles audio effects such as equalization, limiting, compression, etc. to make the device sound as good as possible at all volumes when optimally placed in a room.
We can compute the exact signals sent to the loudspeakers using the actual code running on Sonos devices. Having access to this at training time is very important, especially for soundbars and heavily spatialized audio content. Since the loudspeaker signals are used by the AFE to remove self-sound, it allows to make sure the training data will exhibit the proper distribution of residual interference at the output of the AFE.
What parameters can we play with?
To ensure our voice interface can handle all the realistic situations encountered in people’s homes, we need to define the dimensions along which the training data should vary. Indeed, the models have to work in what is commonly called the far-field setting, which encompasses a whole range of acoustic conditions, for example a speech command spoken in a living room with someone else talking at the same time, in the kitchen with the microwave running, or in a reverberant setting (lots of echo) like a bathroom. More concretely, our models need to be robust to:
Different rooms: The dimensions of a room, the furniture, and the materials inside it can drastically change how your voice sounds. As an extreme example, imagine speaking to a friend from opposite ends of a tunnel. There will be a lot of reverberation, and you probably won’t be able to understand everything your friend says. On the other hand, in a typical living room setting, there is much less reverberation and it’s significantly easier to understand each other. Far-field speech recognition systems need to be robust to these different room properties.
Varying user distance from the Sonos device: The distance between the user and the device has a strong impact on the intelligibility, especially in reverberant settings. The closer the talker is, the more prominent their voice will be with respect to the reverberation. While the further the talker is, the more prominent the reverberation becomes. We call this relationship between the direct path (being the talker) and the reverberation: the direct-to-reverberant ratio (DRR) .
Voice levels: Closely tied to the distance between the user and the Sonos device, the level of the user’s voice at the microphones will not stay constant each time they talk. Sometimes the talker may speak softly, other times they may talk loudly. We quantify this loudness with the sound pressure level (SPL) at the microphones.
Voice characteristics: Voice characteristics such as pitch, prosody, accent, vary from person to person. Even though we try to cover these aspects by directly having data that spans these dimensions, we artificially increase the diversity in our data by applying voice modification techniques such as pitch shifting, time stretching and compression.
Noise types: By external noise, we refer to any sound source within the room except the user talking to the device. This can mean other people talking in the room, background speech coming from a TV, the hum of a fan or AC, household appliances, cars outside, etc.
Noise levels: Similar to how the user’s voice will change in volume, so will the noise’s volume. We vary its volume with respect to the volume of the user's voice, namely by adjusting the Signal-to-Noise ratio (SNR). The SNR is typically measured in decibels (dB); an SNR of 0dB means the signal and noise are at the same level, while an SNR of 6dB means the signal has twice the amplitude than that of the noise. Every factor of two adds about 6dB.
Playback content types: All types of playback content can be used to produce self-sound. We need to consider anything that can be played out of a Sonos device, e.g. music, podcasts, TV shows, movies. Depending on whether the target device is an all-in-one device like the Sonos One or a soundbar like the Sonos Arc, the distribution of content type needs to be adjusted.
Playback levels: Similar to the external noise, self-sound can take on different levels. Furthermore, the processing applied to playback signals before they are sent to the loudspeakers depends on the volume. We quantify the self-sound level using the Signal-to-Echo Ratio (SER), where echo refers to the self-sound.
Being robust to everything above as well as different vocabulary / pronunciations necessitates a much larger dataset than what is required for near-field speech recognition. The “brute force” way of collecting such data would be to go into thousands of houses, ask people to say different commands from various locations, and record the data at the device. Then we would need to generate different types of external noises and self-sound at different levels, and ask the same people to say more commands. Finally, after collecting all this data we would have to transcribe it manually! It would be a very arduous, expensive and time-consuming process and certainly not scalable.
Another approach is to store the recordings of the devices used by real consumers and transcribe the resulting in-domain data, which avoids the tedious process of physically going to different homes and has the advantage of collecting data directly from the field. However, this is extremely invasive of users' privacy and poses ethical questions.
The approach we have taken is to start from a dataset that would be used for training a near-field speech recognition system, and generate through simulations a dataset for far-field recognition that takes into account the different conditions listed above. Through simulation, we can avoid the tedious process of recording and labeling (assuming the original dataset is already labeled). Changing data distributions, adding or updating content such as rooms, noise types, etc. is also made straightforward. Finally, we can respect the privacy of our users by never storing nor listening to recordings made by their smart devices.
Modeling Room Acoustics
As explained above, modeling the acoustic propagation of sound sources within rooms is a key component in reproducing on-device data accurately. Namely, we need to be able to get RIRs between all sound sources and the device microphones.
There are two ways to obtain RIRs, and choosing between them is a trade-off between accuracy and scalability:
Measuring them from real rooms.
When considering room simulation by itself, namely outside of the constraints of data augmentation for speech-to-text, using measured RIRs would be preferred. This is due to the fact that modeling the exact physical properties of a room and sound propagation is a very complex and time-consuming procedure. Whereas by simply measuring the RIR, there is no need to model all the fine details of a specific room. However, with measured RIRs, we face the same scalability problem as with collecting far-field data for training. That is to say, we would have to measure these RIRs in thousands of different room setups in order to obtain the variability that is necessary for robustness to various acoustic conditions. Although measuring RIRs requires much less time than recording the data itself, having access to this large set of rooms and setting up the measurement equipment remains a significant bottleneck.
Therefore, to cater to the high variability needs in the training data, simulation is an attractive and scalable alternative. The difficulty that arises when simulating RIRs is how to properly model the room and sound propagation within it.
There are generally two approaches to RIR simulation: geometric and wave-based. Geometric approaches make simplifying assumptions about sound propagation, modeling sound as individual rays in the room. This allows for faster simulation, but at the expense of accuracy since the assumptions made are only theoretically valid at higher frequencies. Wave-based approaches, on the other hand, attempt to solve the wave equation numerically, which can be very expensive depending on how fine-grained the simulation is. They provide much more accurate simulations but have a high computational load and boundary conditions remain a problem.
In the following, we will provide some intuition about geometric approaches, as these are much faster to compute and allow us to scale our data generation very well.
Building the rooms
The first step in simulating RIRs is to model the room. This boils down to defining the shape and dimensions of the room, the furniture inside, and the materials of all the surfaces (walls and furniture). Different materials have different properties that affect how sound interacts with them when they come into contact. On a high level, two things can happen to sound when it comes in contact with a material:
It can be absorbed by the material.
It can be reflected by the material. There are two types of reflections: specular and scattered, which are governed by the smoothness of the surface. You can view specular reflections as mirror-like, whereas scattered reflections emit in all directions.
Thus, defining the materials of all the reflective surfaces within the room amounts to defining their absorption and scattering coefficients. These can be set either by randomly sampling materials from available data  or by setting their values to match desired acoustic properties such as the reverberation time of the room.
Once the room has been defined, a virtual device is placed inside it. Virtual sources are placed where the device loudspeakers are in order to simulate the self-sound RIRs. All other virtual sound sources, i.e. the user uttering the voice command and the noise sources, are then placed relative to the device and according to realistic distance and position constraints.
Shoebox rooms and the Image Source Method
The Image Source Method (ISM) is a very popular technique to simulate RIRs since it is conceptually simple and very fast for shoebox rooms (i.e. cuboid rooms). Originally described in , ISM is based on the idea that reflections off walls can be modeled through virtual image sources placed symmetrically on the other side of the walls, with the length of the acoustic propagation path being the same.
Each image source is then also mirrored, creating an infinite regular lattice of sources. The authors in  proved this was equivalent to the solution to the wave equation for a rectangular room with rigid walls (i.e. perfectly reflecting walls). While this no longer holds for the more general case with non-rigid walls, it is widely accepted as a very good approximation .
Let’s look at a concrete example below. For the sake of visualization, we will consider a 2D room, which can be extended to 3D in a straightforward manner.
Each time we mirror a source across a wall, we are essentially modeling a reflection against that very wall. So by continuously mirroring the target speaker and its virtual sources, we can perfectly model all the specular reflections up to a desired number of reflections. Using the length of the path, we can place these reflections at the appropriate timestamp. Moreover, the distance and the walls intersected will affect the amplitude, due to the inverse distance law and energy absorption by the walls. ISM is deterministic in the sense that all rays contribute to the resulting RIR. Thanks to the rectangular geometry of shoebox rooms, we have a regular lattice of sources meaning the positions of the image sources are readily available and the RIR can be computed with a complexity in O(R^3) with R the maximum reflection order.
Arbitrary room shapes and hybrid ray-tracing techniques
With arbitrarily-shaped rooms, the ISM becomes a lot more expensive to compute. Indeed, the set of valid image sources and their positions is not trivial to obtain and an expensive validity test is required for each image source . Since each image source is the last in a chain of image sources, the validity test must be started at the receiver point and traced back up to the source . This brings the complexity of the ISM for arbitrary room shapes with N reflective surfaces to O(N^R). Even with simple polyhedra, it becomes completely impractical to simulate reflections of order greater than 6 or 7, making it impossible to capture the late reverberation properly.
This is where Stochastic Ray Tracing (SRT)  comes into play: rather than trying to find all reflection paths deterministically, a large number of rays are sampled at the source and reflections are propagated within the room until one of three conditions is met: a ray intersects with a receiver, a ray’s energy falls below a threshold, or a maximum travel time has been reached . Not all rays in such stochastic approaches contribute to the RIR, but it is much more efficient for modeling high order reflections.
Another advantage of SRT compared to ISM is the ability to model not only specular reflections but also scattered ones, via the diffuse rain technique . ISM and SRT are finally combined into a hybrid approach , ensuring the late and diffuse reverberation is properly simulated while retaining the accurate early specular reflections from the ISM.
Putting it all together
We now have all the tools needed to accurately reproduce the audio data as it appears on Sonos devices. Below we give a breakdown of the whole procedure:
Randomly generate a room, place the device and all sound sources within it, and simulate the RIRs from all sources to all microphones.
Simulate the speech command in the room by convolving it with the appropriate RIRs.
Simulate any external noise source(s) in the room by convolving them with the appropriate RIRs.
Simulate self-sound in the room by applying the playback processing to the desired content and convolving the resulting signals with the appropriate RIRs.
Set the SPL of the simulated speech command.
Rescale the simulated noise(s) according to the desired SNR(s) with respect to the rescaled and simulated speech command.
Rescale the simulated self-sound according to the desired SER with respect to the rescaled and simulated speech command.
Add together the rescaled and simulated speech command, noise(s), and self-sound to create the mixture signal at the microphones.
Pass the mixture and the loudspeaker signals through the AFE to obtain the enhanced single-channel signal given to the SVC pipeline.
With this procedure, there are many opportunities to increase the diversity of the resulting augmented audio data:
When generating the rooms, we can randomly sample different values for the shape and dimensions, the absorption and scattering properties of the walls, as well as the position of the user, noise source(s), and Sonos device.
When simulating external noise and self-sound, we can sample files from large representative datasets in order to simulate different types of far-field noise and playback content.
We can randomly sample the speech SPL, SNR, and SER values from realistic distributions expected to appear in the field.
In this fashion, we can artificially create a large and varied dataset to train a robust far-field speech recognition system! There are of course some disadvantages to this method, as the quest for realism of the simulations is never-ending and there is always a trade-off to choose between accuracy and scalability. Nonetheless, this scalable data augmentation pipeline allows us to train successful speech recognition models while at the same time respecting the privacy of users.
 John S. Garofolo, L. F. Lamel, W. M. Fisher, Jonathan G. Fiscus, D. S. Pallett, Nancy L. Dahlgren, et al. “Darpa TIMIT acoustic-phonetic continuous speech corpus cd-rom”. NIST Inter-agency/Internal Report (NISTIR) 4930, 1993.
 Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. “Librispeech: an ASR corpus based on public domain audio books”. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
 Junichi Yamagishi, Christophe Veaux, Kirsten MacDonald, et al. “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit”. University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.
 Anirudh Raju, Sankaran Panchapagesan, Xing Liu, Arindam Mandal, and Nikko Strom. “Data augmentation for robust keyword spotting under playback interference”. arXiv preprint arXiv:1808.00563, 2018.
 Kuba Łopatka, Katarzyna Kaszuba-Miotke, Piotr Klinke, and Paweł Trella. “Device playback augmentation with echo cancellation for keyword spotting”. In Proceedings of Interspeech, pages 4383–4387, 2021.
 Heinrich Kuttruff. “Room acoustics”. 2009.
 Saeed Bagheri and Daniele Giacobello. “Robust STFT domain multi-channel acoustic echo cancellation with adaptive decorrelation of the reference signals”. In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135. IEEE, 2021.
 Saeed Bagheri and Daniele Giacobello. “Exploiting multi-channel speech presence probability in parametric multi-channel wiener filter”. In Proceedings of Interspeech, pages 101–105, 2019.
 James Eaton, Nikolay D Gaubitch, Alastair H Moore, and Patrick A Naylor. “Estimation of room acoustic parameters: The ACE challenge”. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(10):1681–1693, 2016.
 Michael Vorländer. “Auralization”. 2020.
 Jont B Allen and David A Berkley. “Image method for efficiently simulating small-room acoustics”. The Journal of the Acoustical Society of America, 65(4):943–950, 1979.
 Enzo De Sena, Niccolo Antonello, Marc Moonen, and Toon Van Waterschoot. “On the modeling of rectangular geometries in room acoustic simulations”. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(4):774–786, 2015.
 Jeffrey Borish. “Extension of the image model to arbitrary polyhedra”. The Journal of the Acoustical Society of America, 75(6):1827–1836, 1984.
 Dirk Schröder. “Physically based real-time auralization of interactive virtual environments”. PhD dissertation, RWTH Aachen, 2011.
 Eric Bezzam, Robin Scheibler, Cyril Cadoux, and Thibault Gisselbrecht. “A study on more realistic room simulation for far-field keyword spotting”. In 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 674–680. IEEE, 2020.
Continue reading in Machine Learning:
Continue reading in Data Engineering: