Tech Blog
Machine Learning
,
Audio processing
July 28, 2022

Sonos at ICASSP 2022

Julien Balian

Senior Machine Learning Engineer, Sonos Voice Experience

Matt Benatan

Principal Audio Researcher, Advanced Technology

Wenyu Jin

Principal Audio Researcher, Advanced Technology

Adib Mehrabi

Director, Advanced Technology

The 2022 edition of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) in Singapore was the first physical edition of ICASSP since 2019, and Sonos were proud to be a platinum sponsor of this year's event.

In this article, we speak to some of the researchers from the Sonos Advanced Technology and Voice Experience teams who attended the conference, who provide their selected highlights and responses to ICASSP 2022.

What is your name, role and research focus area at Sonos?

Julien Balian: Hello, I am Julien Balian, Senior Machine Learning Engineer in the Sonos Voice Experience team. I work on acoustic modeling on device for ASR, keyword spotting, and voice recognition.

Matt Benatan: I'm a Principal Audio Researcher. My focus is on the research and development of machine learning and signal processing techniques for new features and capabilities being developed by Sonos Advanced Technology's Context Awareness team.

Wenyu Jin: I am a Principal Audio Researcher in Advanced Technology at Sonos. My main research focus in Sonos is on 3D spatial sound rendering, and room acoustics modeling and estimation.

Adib Mehrabi: I lead the Advanced Rendering and Immersive Audio research group in Advanced Technology at Sonos. Our team is focussed on developing methods for spatial rendering, adaptive rendering, and immersive audio technologies.

Photograph: Adib Mehrabi, Julien Balian, Matt Benatan, Wenyu Jin

From left-right: Adib Mehrabi, Julien Balian, Matt Benatan, Wenyu Jin

What were you presenting at ICASSP 2022?

AM: I gave an industry expert talk entitled: When signal processing meets user experience: how to turn a regular user into an audio systems engineer in 60 seconds. This was co-presented with Dayn Wilberding, who is a colleague in the design team. We delved into the design and development of Trueplay, which is a feature that allows users to easily tune their Sonos system to their room acoustics and speaker placement. The talk was focussed on how the design decisions for the user-facing feature were informed by balancing the objectives of premium sound quality with human-centric user experience.

WJ: I gave a presentation on the technical program paper “Individualized Hear-Through For Acoustic Transparency Using PCA-Based Sound Pressure Estimation At The Eardrum” (paper). The work was conducted when I was affiliated with Starkey Hearing Technologies in 2021. The goal of this work is to address practical issues when achieving personalized hear-through functionality. On hearable devices, hear-through functionality provides hearing equivalent to the open-ear, whilst creating the possibility to modify the sound pressure at the eardrum in a desired manner, and has drawn great attention from researchers in recent years.

This paper proposes an individualized hear-through equalization filter design that leverages the measurement data on the inward-facing microphone of the hearing device to predict the sound pressure at the eardrum. Experimental results using real-ear measured transfer functions confirm that the proposed method achieves a good sound quality compared to the open-ear.

Since I joined Sonos in September 2021, the team has been working on leveraging similar ideas from this paper and applying them to room acoustics estimation for improving the sound quality of Trueplay. The output of that work has been accepted by International Workshop on Acoustic Signal Enhancement 2022 (IWAENC 2022), and will be presented at Bamberg, Germany this September.

What were your two highlights from ICASSP 2022, and why?

JB: I would like to first highlight Yi Ma's invited plenary talk. The work realized by him and his team to optimize parsimony and consistency of representational models is inspiring. His proposal for building from Linear Discriminative Representation and advocating for Maximizing Coding rate reduction instead of Cross Entropy leads to convincing theories and results.

If you are interested to learn more, you should look at his recent submission to JMLR: “ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction”, or the paper “Closed-loop data transcription to an LDR via Minimaxing Rate Reduction”.

The second highlight for me would be Soheil Khorram's and Jaeyoung Kim's paper from Google titled “Contrastive siamese network for semi-supervised speech recognition”. The presented models, while impractical for on-device assistant use cases (as we do for Sonos Voice Control), certainly capture the current trends in the field toward scalable gigantic deep neural net models trained in semi-supervised fashion.

From an industry perspective, it begs the question of efficiency of such big networks and their capacity to be distilled, pruned and factored into practical real-time use cases on commodity hardware.

MB: The first highlight I'll mention is Hung-yi Lee's invited talk on self-supervised models. Hung-yi's team have done some fantastic work on understanding the embedding space for models trained on sequential data, and there were some fascinating insights into semantic embeddings and cross-disciplinary applications of transformers. I highly recommend checking out a recent paper from Hung-yi's group, “Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of Pre-trained Models’ Transferability”.

Another highlight for me was Yang Guo et al.'s paper, “Bayesian Continual Imputation and Prediction for Time Series Data”. This excellent piece of work introduces a method for imputing missing data in a continual learning setting through the use of Bayesian LSTMs. I particularly like how they incorporate the KL divergence from Bayes by backpropagation as a means of constraining parameter shift over time —  basically, using it as a regulariser against catastrophic forgetting. As someone who's been very involved in Bayesian Deep Learning over the past few years, it's great to see how the advantages of BDL methods are being leveraged by the wider research community.

WJ: For this ICASSP, I mainly attended sessions that were related to room acoustic modeling and spatial sound(field) control. One of my observations is that data-driven and machine learning (ML) based room acoustic modeling approaches are becoming increasingly relevant. Mathematical models can be trained to emulate all transformations that any given input sound may be subjected to under  certain room conditions, without the need for simulating or solving complex equations.

Meanwhile, more data-centric approaches are emerging, i.e. to augment training data to enable existing models with more capabilities. Data collection is often seen as a one-time event and is neglected in favour of building better model architecture. As a result, hundreds of hours are lost in fine-tuning models based on imperfect data. According to Andrew Ng, we are encouraged  “to move from a model-centric approach to a data-centric approach.” There are two interesting papers that leverage data-centric approaches and catch my eyes in this ICASSP:

1) P. Götz, C. Tuna, A. Walther and E. A. P. Habets, "Blind Reverberation Time Estimation in Dynamic Acoustic Conditions," ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 581-585

2) A. Ratnarajah, S. -X. Zhang, M. Yu, Z. Tang, D. Manocha and D. Yu, "Fast-Rir: Fast Neural Diffuse Room Impulse Response Generator," ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 571-575

The first paper presents a novel way of generating training data and demonstrates, using an existing deep neural network architecture, the considerable improvement in the ability to follow temporal changes in room reverberation time estimation. The second paper, on the other hand, devises a neural-network-based fast diffuse room impulse response generator for generating room impulse responses (RIRs) for a given acoustic environment.

Any other takeaways from ICASSP, or tips for first-time attendees?

JB: ICASSP 2022 was my first-time in-person research conference.

1) As a newcomer, I was lucky to attend a tutorial and a few short courses. Those are ideal introductions to new research areas. I know additional fees are a barrier, but this in my opinion is totally worth it. Just be prepared to receive a large quantity of information in a short amount of time.

2) Networking is a core pillar of research conferences. You should allocate time to meet new people and discuss informally. Organizing meetings with authors if you already know them or reaching people you do not know personally during the conference is far easier than usual. That said, ICASSP is a very big conference: be sure to at least know what they look like. A natural opportunity for that is to attend their presentations even if you already know their work.

MB: I have two contradictory but complementary tips: 1) Go in with a plan: there's a significant variety of excellent work at ICASSP, so it's good to plan how you'll spend your time to make sure you don't miss anything particularly relevant to your work. 2) Make time to explore: allow yourself to drift from your schedule every so often and discover the incredible work going on in adjacent fields - the invited talks are a great way to do this.

WJ: This year’s ICASSP marks the 7th ICASSP that I have attended in person since my first time back in 2013. ICASSP is indeed a big flagship conference in the societies of speech and signal processing, typically with over 3,000 participants (before the pandemic). Naturally, it can be both exciting and intimidating for people who are new to it. My advice for first-time attendees is to be open-minded and brave, to talk to researchers from different areas when you may or may not know their work, to broadcast your own work for more exposure, and to network with people who may offer direct or indirect help to your research career. I am personally looking forward to the first post-pandemic edition of ICASSP next year in Rhodes Island, Greece.

AM: I would second Matt’s point from above about being flexible with your plans. There are so many interesting talks and papers at ICASSP that it is impossible to cover everything. It is tempting to spend all your time in sessions that are closest to your field or areas of interest, but ICASSP is a great conference to explore new areas of research. I would recommend selecting a few sessions that are outside your usual research topics.

Share

Continue reading in Machine Learning:

Continue reading in Audio processing:

© 2022 by Sonos. Inc.
All rights reserved. Sonos and Sonos product names are trademarks or registered trademarks of Sonos, Inc.
All other product names and services may be trademarks or service marks of their respective owners. Sonos, Inc.