Takeaways from Interspeech 2020, Part 2: scientific highlights
Principal Machine Learning Scientist, Sonos Voice Experience
Senior Signal Processing and Machine Learning Scientist, Sonos Voice Experience
Senior Machine Learning Scientist, Sonos Voice Experience
Last month, from October 25th to 29th, the Interspeech conference, one of the prominent conferences on speech processing took place. Initially planned to happen in Shanghai, China, it was “relocated” to a virtual conference online due to the ongoing pandemic. We attended the conference to learn more about the current trends in speech processing systems and to present a paper we submitted on keyword spotting applications.
In this series of two posts, we would like to share our impressions and takeaways from this very instructive event. For our experience attending virtual conferences, have a look at Part 1.
Here, we discuss our takeaways on three major topics: speech enhancement, automatic speech recognition, and spoken language understanding. Note that the conference covers a wide variety of subjects and our list is not meant to be exhaustive. We also take this opportunity to tell you more about the paper we presented this year, entitled “Predicting detection filters for small footprint open-vocabulary keyword spotting”.
Speech enhancement and source separation
One interesting trend we observed in speech enhancement papers (mostly in the multi-channel case but also somewhat for single-channel methods), was the tendency to go back to more traditional signal-processing approaches and replace some of the estimation components (usually the weakest ones) by neural network models [1, 2, 3, 4]. This is usually done in an effort to harness the power of black box neural network training while still retaining explainable models that have the potential to generalize better. This trend, which gained a lot of popularity with the Differentiable Digital Signal Processing (DDSP) paper published at ICLR this year , was also present in speech synthesis papers, with physically motivated methods based on voice production models, like the LPCNet and its variants [6, 7, 8].
In terms of source separation, there were a significant number of papers tackling the problem of targeted source separation. This area of research aims to solve the problem of isolating the speech coming from a specific speaker while there are competing speech sources also present in the room. This topic has applications in voice assistants as this allows issuing commands in a typical cocktail party scenario with a speaker-agnostic ASR system. At the intersection of speaker recognition and source separation, some of the challenges of such systems are to make sure that there is no oversuppression of speech and that no information is lost when no speaker information is available. There were interesting papers in this track, such as the VoiceFilter-Lite system , showing impressive reductions in Word Error Rate at fairly low SNRs.
Regarding the neural network models and architectures favoured for single-channel noise suppression and speech enhancement, although there are end-to-end architectures showing successful results (e.g. ConvTasNet , DeMucs), the majority of papers presented seemed to indicate that processing in the spectral domain is still dominating the field. There are interesting trends though; for example, more and more papers seem to move away from estimating magnitude masks in favour of complex masks as the supervised target [12, 13, 18], and attention layers are more and more present at all stages of the models [1, 2, 14]. In terms of architectures, we saw the continued tendency of using RNN layers such as LSTM and GRU for online smaller footprint networks [15, 16] and variations on the U-Net for bigger offline models [17, 18, 11]. Interestingly, there were relatively few papers presenting systems applicable in a low-latency real-time setting .
Automatic Speech Recognition
Regarding acoustic models and automatic speech recognition (ASR) systems, we observed the continued trend of end-to-end models. It looks like nowadays, most of the research is dedicated to such models. The disadvantage of the classical hybrid NN/HMM models for general ASR and "ask-me-anything" voice assistants is the big decoding graph required to perform the inference, weighing over 5GB. Thus, end-to-end neural networks predicting words or word pieces from audio directly are competitive, even with hundreds of millions of parameters. One of the announced goal is to be able to run ASR directly in streaming mode on-device, not only for privacy concerns but also, and perhaps more importantly, to reduce the latency (notably, privacy is getting more and more attention, with a VoicePrivacy Challenge at this Interspeech and a full session dedicated to it).
One of the drawbacks of end-to-end systems is the difficulty to integrate external knowledge, which comes in especially useful for rare words or named entities. These are likely to occur very rarely or not at all in paired audio/text datasets. Hybrid systems do not suffer from these issues by using a pronunciation lexicon and an external language model that may be estimated on a vast amount of text-only data. Building end-to-end systems that include external knowledge with something smarter than a shallow fusion with an external language model already received a lot of attention in ICASSP and is still a big topic, mainly though biasing with external FSTs , error correction  or methods to train from unpaired text/audio data [21, 22].
The other important drawback of end-to-end systems is that they are generally not really suited for streaming recognition. A lot of research is carried out to define training tricks to make these models streamable, including masking or chunk-based , monotonic  or triggered attention [25, 26], complexifying the training pipeline. A full session was dedicated to streaming ASR, showing how big of a topic it has become, when hybrid systems were readily adapted to streaming mode. It is also probably the reason why the RNN-T approach seems to be taking over in the end-to-end ecosystem, compared to attention-based encoder-decoder (AED) or transformer approaches, by being time and label synchronous (while the other two are label synchronous only and hybrid methods time synchronous only).
In terms of neural network architecture, the long short-term memory recurrent networks are still very present, in their unidirectional and bi-directional form (depending on whether streaming or accuracy is most wanted). However, transformer blocks including self-attention continue to receive more and more focus (they also had a session dedicated to them) in spite of their requirement for tricks to be used in streaming mode. Time-depth separable convolutions and other factorizations of the convolutional layers are more and more popular, and appeared in many papers as well [27, 28, 29]. These trends seem to be independent of the chosen approach (RNN-T, AED, or hybrid). Interestingly, a few compression methods were proposed [30, 31, 32], confirming the interest in running ASR on devices.
Finally, it is worth noting that self-supervised and semi-supervised training strategies are becoming more and more popular to increase the size of the dataset for training [33, 34, 35, 36]. It seems that these approaches are now more popular in research than multi-task of transfer learning, although there were a few papers on these methods too [37, 38].
Spoken Language Understanding and Language Modeling
One of the key challenges faced in Spoken Language Understanding (SLU) systems composed of two distinct Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) blocks is to make the NLU robust to ASR errors. Such errors can occur in noisy and far-field conditions or when a domain agnostic ASR is used on a specific domain. While this topic received a lot of attention this year, no clear trend seemed to emerge from the conference. Among the various approaches being explored, we've seen Transformers embedding the ASR Word Confusions Networks (WCN)  to make SLU predictions, using the idea that the right ASR transcription is not necessarily the ASR 1-best hypothesis but is probably contained somewhere in the ASR lattice  or WCN. Other papers made the NLU training “ASR-errors-aware” by training it jointly with the ASR  or augmenting the NLU training dataset with the ASR transcriptions using tricks such as the Marginalized-CRF  to take potential re-labeling errors into account.
SLU model architectures seem to follow the global trend of commonly using Transformers for various cloud and device applications. They are used to encode WCN or dialog context , to perform ASR errors correction , to jointly perform intent classification and slot filling in a discriminative or generative fashion, and they are also used for end-to-end SLU . This last application seems to gain popularity at the conference partly because end-to-end SLU doesn't require expensive audio and transcript pairs, the model can directly be trained on audios for which we only have the final SLU parsing (domain + intents + slots). While such approaches seem attractive, they are, for now, evaluated on "easy" SLU tasks such as ATIS or Fluent Speech Command. To predict directly from audios, some end-to-end SLU architectures make the choice not to model human speech and language. Such models can struggle to distinguish between utterances with similar audio features but opposite or different semantics: "More light in the kitchen please" vs. "No more light in the kitchen please", "Deactivate the kitchen's lights" vs. "Activate the kitchen's lights", this limitation make them unfit for complex real-word assistants at the moment. Some other end-to-end SLU architectures include an ASR module and leverage Transfer Learning: they are pre-trained on general-domain (audio, transcripts) pairs and then fine-tuned using domain specific (audio, SLU parsing) pairs.
Another interesting trend we observed in Neural Language Modeling (NLM) is to use Pointer Networks or similar mechanisms [45, 46] to attend to some context and bias the Language Model (LM) decoding towards contextually relevant words, improving performances on rare words and entity values.
Our paper at Interspeech 2020
At this conference we presented our work on keyword spotting entitled “Predicting detection filters for small footprint open-vocabulary keyword spotting”.
In the past 5 years or so, end-to-end keyword spotting models have gained a lot of popularity and are becoming the standard method for this task. This shift was largely motivated by the success of deep learning and end-to-end neural networks in general, and the availability of open datasets of spoken keywords, such as Google Speech Commands and Hey Snips. In these methods a neural network directly predicts the presence of a keyword in an audio clip or in an audio stream from acoustic features. Not only does this approach yield a much better accuracy than previous ones, but the neural networks built can also be very small, fitting into micro-controllers.
The main drawback of this approach is the requirement of a training dataset containing a large number of keyword samples. To detect a new keyword, data must be collected and a new system must be trained, preventing the easy and rapid customization of the keyword spotting engine by the user. Historical methods, however, tended to be more flexible in that regard. In the so-called acoustic keyword spotting systems, a classical acoustic model of phones is first trained from generic speech data, that is, training data that do not necessarily contain the keyword. This has two advantages. First, that kind of data is already available in large amounts: we can train on the datasets used for large-vocabulary speech recognition models. Second, the same acoustic model may be employed without retraining to detect a new keyword defined by the user. At inference, a phone-based keyword model is built and used to detect the keyword among the acoustic model’s outputs. The drawbacks of such methods are that they are generally bigger, less accurate and that it is more difficult to define and calibrate a meaningful confidence measure at the keyword level (we explored such methods in a previous publication).
The goal of our research was to propose a method to build a keyword detection model combining the advantages of these two approaches, namely:
that would be small, accurate and predicting a confidence at the keyword level, like the end-to-end approach
that could be trained on generic data, and customized to new, user-specified keywords at inference, without retraining, like the acoustic keyword spotting approach
We adopted a sort of meta-learning strategy. Basically, we want to build an auxiliary neural network that would predict the end-to-end keyword spotting neural network from the keywords themselves. This way, at inference, the user could input their keyword to that network and get a neural network which could subsequently be used to detect these keywords.
Since it would be a bit cumbersome and not necessarily relevant to predict the whole keyword spotting neural network, we instead have the auxiliary network predict only the weight of the last (classification) layer of a generic keyword spotting network. Alternatively, you can see it as a neural network predicting a small linear classifier whose inputs are high-level features extracted by a generic neural network.
The generic neural network is made of two parts: an acoustic encoder and a keyword detector.
The acoustic encoder is made of a stack of quantized long short-term memory layers, pre-trained with connectionist temporal classification on a generic speech dataset (Librispeech) to predict phones from audio. The motivation for this decision is so that the high-level features extracted by this network would contain information about phones that should be relevant to keyword detection.
The keyword detector is a small two-layer convolutional neural network whose weights are partially predicted by the auxiliary network. All the weights are quantized to eight bits precision, so the overall model weighs less than 250 kilobytes.
In the whole keyword spotting neural network, there is one sigmoid output for each keyword, hence one convolution filter for each keyword in the last layer, independent of each other. The weights of these filters are predicted by the auxiliary keyword encoder neural network.
There are two advantages of having independent keyword outputs in the network. First, it allows to build a keyword encoder that only predicts one filter for every single keyword. Assuming that a pronunciation lexicon or a grapheme-to-phoneme converter is available, in our work the keyword encoder is a simple bidirectional long short-term memory layer followed by a linear layer, predicting the filter’s weights from the keyword’s phone sequence. Second, if specific training data is available (for example for very common keywords), the filters may still be trained the usual way from data, without affecting the ability of the neural network to accommodate for new custom keywords for which the filters are predicted by the keyword encoder.
For the keyword encoder, a keyword is merely an arbitrary sequence of phones. Using a generic acoustic model, we can align the training dataset to have both the audio and the corresponding phone sequence with timings. From the latter, we may build “fake” keywords, which would be any subsequence of the aligned phone sequence and get labeled pairs of audio + keywords and expected classification outcome (1 where the phone sequence matches the audio and 0 otherwise). It allows to jointly train the keyword encoder and detector on generic speech training data.
In our experiments, we found that the proposed method outperforms classical acoustic keyword spotting approaches for open-vocabulary keywords and that when fine-tuned to keyword-specific training data, it is competitive with state-of-the-art end-to-end models directly trained on that data. If you want to know more about our work, we encourage you to watch our presentation at Interspeech below (one of the advantages of a virtual conference…) and to read our paper.
During the development of our method, we collected two datasets of voice queries containing keywords, that we re-recorded in clean and noisy environments. We made these datasets open to encourage reproducible and comparable research on the topic.
This concludes our series on Interspeech 2020. We are looking forward to future conferences and might share our takeaways again. We hope it was useful, don’t hesitate to share it if you liked it!
 Li, G., Liang, S., Nie, S., Liu, W., Yang, Z., Xiao, L., Deep Neural Network-Based Generalized Sidelobe Canceller for Robust Multi-Channel Speech Recognition, Proc. Interspeech, 2020.
 Xu, Y., Yu, M., Zhang, S., Chen, L., Weng, C., Liu, J., Yu, D., Neural Spatio-Temporal Beamformer for Target Speech Separation, Proc. Interspeech, 2020.
 Roy, S.K., Nicolson, A., Paliwal, K.K., A Deep Learning-Based Kalman Filter for Speech Enhancement, Proc. Interspeech, 2020.
 Yu, H., Zhu, W., Champagne, B., Subband Kalman Filtering with DNN Estimated Parameters for Speech Enhancement, Proc. Interspeech, 2020.
 Engel J., Hantrakul L., Gu C. and Roberts A., DDSP: Differentiable Digital Signal Processing, Proc. ICLR, 2020.
 Liu, Z., Chen, K., Yu, K., Neural Homomorphic Vocoder, Proc. Interspeech, 2020.
 Tian, Q., Zhang, Z., Lu, H., Chen, L., Liu, S., FeatherWave: An Efficient High-Fidelity Neural Vocoder with Multi-Band Linear Prediction, Proc. Interspeech, 2020.
 Kanagawa, H., Ijima, Y., Lightweight LPCNet-Based Neural Vocoder with Tensor Decomposition, Proc. Interspeech, 2020.
 Wang, Q., Moreno, I.L., Saglam, M., Wilson, K., Chiao, A., Liu, R., He, Y., Li, W., Pelecanos, J., Nika, M., Gruenstein, A., VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition, Proc. Interspeech, 2020.
 Luo Y. and Mesgarani N., “Conv-TASnet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
 Défossez, A., Synnaeve, G., Adi, Y., Real Time Speech Enhancement in the Waveform Domain, Proc. Interspeech, 2020.
 Li, X., Horaud, R., Online Monaural Speech Enhancement Using Delayed Subband LSTM, Proc. Interspeech, 2020.
 Strake, M., Defraene, B., Fluyt, K., Tirry, W., Fingscheidt, T., INTERSPEECH 2020 Deep Noise Suppression Challenge: A Fully Convolutional Recurrent Network (FCRN) for Joint Dereverberation and Denoising, Proc. Interspeech, 2020.
 Deng, F., Jiang, T., Wang, X., Zhang, C., Li, Y., NAAGN: Noise-Aware Attention-Gated Network for Speech Enhancement, Proc. Interspeech, 2020.
 Schröter, H., Rosenkranz, T., Escalante-B., A., Zobel, P., Maier, A., Lightweight Online Noise Reduction on Embedded Devices Using Hierarchical Recurrent Neural Networks, Proc. Interspeech, 2020.
 Valin, J., Isik, U., Phansalkar, N., Giri, R., Helwani, K., Krishnaswamy, A., A Perceptually Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech, Proc. Interspeech, 2020.
 Isik, U., Giri, R., Phansalkar, N., Valin, J., Helwani, K., Krishnaswamy, A., PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss, Proc. Interspeech, 2020.
 Hu, Y., Liu, Y., Lv, S., Xing, M., Zhang, S., Fu, Y., Wu, J., Zhang, B., Xie, L., DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement, Proc. Interspeech, 2020.
 Huang R, Abdel-hamid O, Li X, Evermann G. Class LM and word mapping for contextual biasing in End-to-End ASR, Proc. Interspeech, 2020.
 Peyser C, Mavandadi S, Sainath TN, Apfel J, Pang R, Kumar S. Improving Tail Performance of a Deliberation E2E ASR Model Using a LargeText Corpus. Proc. Interspeech 2020.
 Garg A, Gupta A, Gowda D, Singh S, Kim C. Hierarchical multi-stage word-to-grapheme named entity corrector for automatic speech recognition. Proc. Interspeech 2020.
 Huang Y, Li J, He L, Wei W, Gale W, Gong Y. Rapid RNN-T Adaptation Using Personalized Speech Synthesis and Neural Language Generator. Proc. Interspeech 2020.
 Wu C, Wang Y, Shi Y, Yeh CF, Zhang F. Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory. Proc. Interspeech 2020.
 Inaguma H, Mimura M, Kawahara T. Enhancing Monotonic Multihead Attention for Streaming ASR. Proc. Interspeech 2020.
 Zhang S, Gao Z, Luo H, Lei M, Gao J, Yan Z, Xie L. Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition. Proc. Interspeech 2020.
 Wang C, Wu Y, Liu S, Li J, Lu L, Ye G, Zhou M. Low latency end-to-end streaming speech recognition with a scout network. Proc. Interspeech 2020.
 Kumar K, Liu C, Gong Y, Wu J. 1-D row-convolution LSTM: Fast streaming ASR at accuracy parity with LC-BLSTM. Proc. Interspeech 2020.
 Xu M, Zhang XL. Depthwise separable convolutional resnet with squeeze-and-excitation blocks for small-footprint keyword spotting. Proc. Interspeech 2020.
 Pratap V, Xu Q, Kahn J, Avidov G, Likhomanenko T, Hannun A, Liptchinsky V, Synnaeve G, Collobert R. Scaling Up Online Speech Recognition Using ConvNets. Proc. Interspeech 2020.
 Kadetotad D, Meng J, Berisha V, Chakrabarti C, Seo JS. Compressing LSTM Networks with Hierarchical Coarse-Grain Sparsity. Proc. Interspeech 2020.
 Mehrotra A, Dudziak Ł, Yeo J, Lee YY, Vipperla R, Abdelfattah MS, Bhattacharya S, Ishtiaq S, Ramos AG, Lee S, Kim D. Iterative Compression of End-to-End ASR Model using AutoML. Proc. Interspeech 2020.
 Nguyen HD, Alexandridis A, Mouchtaris A. Quantization Aware Training with Absolute-Cosine Regularization for Automatic Speech Recognition. Proc. Interspeech 2020.
 Sheikh I, Vincent E, Illina I. On semi-supervised LF-MMI training of acoustic models with limited data. Proc. Interspeech 2020.
 Weninger F, Mana F, Gemello R, Andrés-Ferrer J, Zhan P. Semi-Supervised Learning with Data Augmentation for End-to-End ASR. Proc. Interspeech 2020.
 Sapru A, Garimella S. Leveraging Unlabeled Speech for Sequence Discriminative Training of Acoustic Models. Proc. Interspeech 2020.
 Xu Q, Likhomanenko T, Kahn J, Hannun A, Synnaeve G, Collobert R. Iterative Pseudo-Labeling for Speech Recognition. Proc. Interspeech 2020.
 Houston B, Kirchhoff K. Continual Learning for Multi-Dialect Acoustic Models. Proc. Interspeech 2020.
 Joshi V, Zhao R, Mehta RR, Kumar K, Li J. Transfer Learning Approaches for Streaming End-to-End Speech Recognition System. Proc. Interspeech 2020.
 Liu C., Zhu S., Zhao Z., Cao R., Chen L., Yu K., Jointly Encoding Word Confusion Network and Dialogue Context with BERT for Spoken Language Understanding. Proc. Interspeech 2020.
 Ladhak F., Gandhe A., Dreyer M., Mathias L., Rastrow A., Hoffmeister B., LatticeRnn: Recurrent Neural Networks Over Lattices, Proc. Interspeech 2016.
 Rao M., Raju A., Dheram P., Bui B., Rastrow A., Speech to Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces. Proc. Interspeech 2020.
 Ruan W., Nechaev Y., Chen L., Su C., Kiss I., Towards an ASR error robust Spoken Language Understanding System. Proc. Interspeech 2020.
 Wang H., Dong S., Liu Y., Logan J., Kumar Agrawal A., Liu Y., ASR Error Correction with Augmented Transformer for Entity Retrieval. Proc. Interspeech 2020.
 Radfar M., Mouchtaris A., Siegfried Kunzmann S., End-to-End Neural Transformer Based Spoken Language Understanding. Proc. Interspeech 2020.
 Liu D., Liu C., Zhang F., Synnaeve. G, Saraf Y., Geoffrey Zweig G., Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model. Proc. Interspeech 2020.
 Li K., Povey D., Khudanpur S., Neural Language Modeling with Implicit Cache Pointers. Proc. Interspeech 2020.