Tech Blog
Audio processing
January 27, 2022

# How near-ultrasonic audio adds spatial awareness to the Sonos system

Daniel Jones

Distinguished Audio Research Engineer, Advanced Technology

It's a truism of design and engineering that behind the greatest simplicity lies the greatest complexity, and the Sonos system is a modern-day embodiment of this saying. On the surface is a clean, pared-down industrial design that blends neatly into a minimalist interior, and a mobile UI that makes it easy to access your music across the home with a single touch.

However, behind the minimal frontend lies something staggeringly sophisticated: a distributed array of powerful, multi-purpose computing devices, capable of synchronised audio rendering, sensing and networking, quietly orchestrating the household’s music playback via a network of performant DSP processors. Although Sonos is perceived as a maker of audio hardware, at its core lies a million-line codebase, with vast cloud operations, highly-optimised embedded code developed by a team of hundreds of brilliant software engineers, and a track record of innovation spanning 20 years and 500 patented inventions.

Our team, Sonos Advanced Technology, is focused on future-facing innovation and R&D within Sonos. We are an interdisciplinary group with backgrounds spanning physics, computer science, acoustics, studio production, sonic arts, and machine learning, working together to explore open problems with a 5-year horizon. It's an amusing paradox that, given the team’s background in solving some of the world's hardest problems - from astrophysics to animal auditory processing to AI-driven drug discovery - our unifying goal is to create experiences that are as simple and as effortless to use as possible.

Part of this remit is to consider new hardware advances for future products, working with the teams in Audio Systems Engineering, Radio Engineering, Mechanical Engineering and our own dedicated research labs to conduct experiments with novel rendering and sensing hardware.

But a fundamental advantage of the distributed computers underlying the Sonos system is that we can continually evolve and expand the existing system's capabilities with updates in the form of over-the-air software releases, taking advantage of new research and hitherto untapped hardware capabilities. You may wake up to find that the system that you bought a year ago has suddenly grown a set of new capabilities that it didn't have yesterday; Trueplay spectral correction technology, for example, was deployed as an overnight software update, immediately enabling millions of Sonos systems to tailor their sound to the room they were placed within.

In 2019, two problems had appeared on the Sonos product development roadmap that could not be addressed by the system's existing capabilities.

• The Setup team were embarking on a ground-up rewrite of the device setup process, aiming to make configuring a new device as seamless and friction-free as possible. As part of the legacy configuration process, the Sonos mobile app prompted the user to enter a PIN labelled on the device, to ensure that the device being configured was the one in front of the person.

The challenge was to replace manual PIN entry with a more streamlined approach, minimizing user effort whilst retaining the assurance of physical collocation.

• Simultaneously, the Control team were laying down the architecture for Roam, Sonos’ first compact portable speaker. Part of the product vision was to allow Roam to interoperate seamlessly with an existing household of Sonos devices, including making it quick and easy to swap audio streams between the Roam and other speakers, a feature now known as “Sound Swap”.

The objective was for Roam to automatically detect what speakers are around it, allowing users to move audio between nearby speakers without any manual target selection, which should work robustly at a target range of 3m.

Both of these can be reframed as signalling problems, requiring that devices A and B exchange a token when collocated within some small spatial bound. However, neither of these challenges were adequately addressed by the existing set of communication technologies available to Sonos devices:

• Bluetooth LE is available on most mobiles and Sonos devices, but is unreliable for precise ranging in a peer-to-peer scenario. Using off-the-shelf single-antenna hardware, the only viable candidate for estimating peer-to-peer range is by using the received signal strength (RSS). However, BLE's propagation characteristics and uneven channel gains mean that RSS can depend strongly on the RF transmission channel, and can vary by up to 30dB over a 10cm range due to phase interference from reflected signals [1]. Conversely, RF signals in the 2.4GHz range are transmissible through walls, meaning that devices in adjacent rooms can appear nearer than those within the same space.

• Wi-Fi angle-of-arrival (AoA)-based approaches have been demonstrated to give reasonable localisation accuracy (<1m) using commodity hardware [2], but this typically requires multiple anchors at known reference points. In a peer-to-peer scenario, RSS-based distance estimates are subject to the same spatial fluctuations and room ambiguity as BLE. Moreover, the slow rate of Wi-Fi scans makes it too high-latency for scenarios involving dynamic positioning, and the requisite APIs are often not exposed to apps on mobile devices.

• Near-field communication (NFC) radios are available on more recent Sonos devices, and offer industry-standard security for short-range exchanges. However, the ambition of the Setup and Sound Swap use cases was to offer an approach to pairing and proximity that is supported across the entire range of Sonos speakers, including those pre-dating NFC. The ideal approach should allow casual interactions at a 1m range.

The proposed solution was to make use of an alternative channel that was available to almost all mobile handsets and Sonos devices: harnessing the on-board speakers and microphones as a data transmission medium, encoding data in audio signals that could be sent over the air and decoded at the receiver's microphone. In the Setup case, the receiving device would be the microphone of the user's mobile device; for Sound Swap, the portable speaker performing the audio exchange.

Our team already had significant expertise in this area, having joined Sonos from Chirp, a UK startup whose heritage was in applying acoustic communication to device-to-device data transmission, so a firm foundation was already in place to extend this capability across the Sonos platform.

There are a number of benefits of using the acoustic channel for data transmission:

• No additional hardware required: As audio I/O is already available across almost all Sonos devices, audio-based communication could be deployed using a software-defined networking stack without any additional hardware additions. Devices already in the field would be instantly upgraded with this new capability. By designing the acoustic transmission protocols to tolerate extreme variances in speaker and microphone frequency response, this support could extend to every mobile device running the Sonos app.

• Frictionless user experience: Acting as a one-shot broadcast signal with an uncongested medium, audio transmission can eliminate the need for pairing handshakes and passwords, reducing user input and potential points of failure. It is also less reliant on precise physical alignment than NFC or QR codes. In a previous study [3], we demonstrated two different facets to the efficiency of data-over-sound versus other methods: (1) that it is less error-prone to user handling traits; and (2) that it is deterministic in its transmission time, particularly compared with Bluetooth, whose handshake duration can be increased dramatically by the presence of other nearby devices.

• Spatially bounded: In contrast with the radio frequency transmissions used by Wi-Fi and Bluetooth, high-frequency audio is heavily attenuated by walls and other room boundaries. This means that it can be used to infer whether a receiving device is within the same room as the transmitter, a useful heuristic for presence detection tasks such as Sound Swap.

By implementing the transport components of the acoustic transmission library as a conduit for arbitrary packets of bytes, a well-designed implementation can act as a generic physical layer in an OSI networking stack; the schema shown in the figure below should look familiar from any networking textbook. For secure transmissions and to prevent replay attacks, security mechanisms such as RSA or time-based one-time passwords (TOTP) can be added in an application layer on top of this transport, just as in any well-behaved network stack.

Moreover, by leveraging the narrow slice of the frequency band that’s above the typical range of human hearing (>19kHz) but within the performance spec of most consumer audio devices (<20kHz), acoustic data transmission could be achieved in a way that would be audible to machines but inaudible to humans. This resonates with the “calm technology” [4] principles of the Sonos user experience.

## The research challenges

Acoustic transmission dates back to the dawn of computing, with the Bell 101 modem sending ASCII text thousands of miles across the NORAD network back in 1958. However, sending acoustic signals over the air, into the chaotic acoustics of the real world, using devices whose audio hardware was not designed for precise and accurate rendering of digitally-encoded signals, poses a number of major new challenges.

1. Real-world noise and acoustics

The dominant challenge to acoustic communications is the noisy, reverberant nature of real-world acoustic environments and the distortion that it introduces to the signal.

As the original signal is passed from transmitter to receiver, it encounters multiple sources of potential noise and corruption: device loudspeakers, microphones and DACs/ADCs can introduce colouring from their frequency responses, further affected by directivity and filtering from the device’s enclosure; the reverberation of room acoustics introduces multi-path delays and constructive/destructive phase interference, leading to frequency-dependent comb filtering; and the background noise of unpredictable acoustic environments reduces the signal-to-noise ratio at the receiving device, sometimes obscuring the signal altogether.

Below is a sequence of plots illustrating what happens to the signal as it passes through each point in the transmission chain. On the left are the time-amplitude waveform representations, and on the right are the time-frequency spectrograms. The signal itself is encoded using a modulation scheme called m-ary frequency shift keying (m-FSK), with each discrete frequency roughly corresponding to a different integer in the payload.

It's apparent that the receiving device may have a difficult task in decoding the final signal against this level of distortion. For this reason, much of our research focuses on the decoding parts of the pipeline.

To address the issue, we take a multi-stage approach to decoding which attempts to address each of these sources of distortion:

• a dereverberation unit suppresses the per-bin contribution of room energy, utilising a lightweight algorithm that is designed to be as efficient as possible for embedded devices

• a spectral equaliser addresses frequency-dependent filtering, with controllable parameters for different classes of transducer and room

• a time alignment algorithm sequences the symbols for decoding with the objective of maximising the input energy and minimising the overlap between symbols, reducing the impact of background noise

• finally, incorporating error correction and detection into the protocol ensures that, if some of the tones are obscured by background noise, the remainder of the signal may still be decoded successfully

To maximise signal-to-noise ratio at the receiver, the signal transmission strength is calibrated to be as high as possible whilst remaining below the threshold of audibility for near-ultrasonic audio in young human ears [5]. The signal may be audible to household pets, but it is still played at a relatively low volume compared to music — so, to a dog or a cat, the transmission sounds like nothing more than a tiny high-frequency melody.

2. The Doppler effect

In many of the scenarios targeted by Sound Swap, the user may be in motion, carrying the portable device when they initiate the swap. This introduces a second form of distortion to the signal that is not intuitively obvious at first glance: the Doppler effect.

Similar to the effect of an ambulance siren passing by on the street, the motion of the transmitter relative to the receiver introduces a frequency shift to the audio signal. At audible frequencies, and at walking pace, this isn’t typically an issue — or we would hear people’s voices shift up and down as they walked past us. However, the Doppler effect is linearly proportional to the rate of movement and the frequency of the signal. At the near-ultrasonic frequencies used by these transmissions, this starts to become significant.

For example, if the transmitter is playing a sinusoidal tone of frequency 19kHz, and moving at a walking pace of 1m/s towards a static receiver, we can calculate the expected Doppler shift, where Δf is the expected change in received frequency, f₀ is the emitted carrier frequency, v is the velocity of the transmitter towards the receiver, and c is the speed of sound (343m/s at 20°C):

$\Delta f = \frac{v}{c}f_0 \\ \Delta f = \frac{1}{343} \cdot 19000 \\ \Delta f = 55.4$

The received frequency will therefore be around 19055.4Hz. Although this may seem like a relatively small shift, particularly on the logarithmic axis used in perceptual frequency scales, the frequency-to-symbol demapper used by the demodulator works on a linear index, with evenly-spaced boundaries between 19-20kHz.

As illustrated in Figure 3 above, each frequency bin can be thought of as corresponding to an integer within the payload. If the frequency band is divided into equal bands of 50Hz each, a shift of 55.4Hz would thus cause symbols to be misinterpreted as the subsequent bin, so that transmitting a symbol of 0 will be received as a 1, 1 → 2, 2 → 3, …

We evaluated a number of different approaches to tackling Doppler shift, including spread-spectrum modulation schemes [6] and frequency compensation with pilot subcarriers [7]. However, for the low-bitrate, low-bandwidth, high-noise channel that we are targeting, the most effective solution turns out to be the simplest: maximising the inter-symbol frequency interval, and introducing a short guard period between adjacent frequencies, ensuring that motion within the likely velocity range will not prevent the symbols from being classified successfully.

3. Channel contention and many-device households

The final challenge for the Sound Swap communication flow was to address the scenario in which a user has many Sonos speakers around their household. In this situation, multiple players might be broadcasting their identifiers simultaneously, introducing the likelihood of interference between signals.

In radio frequency communication, collisions such as this are typically addressed by introducing a channel-sharing scheme, which permits multiple devices to share the airwaves. Common schemes include:

• Frequency-division multiple access (FDMA): Split the frequency band into equally-sized slots, and allocate a slot to each transmitter. This is a low-complexity scheme where airspace is abundant, but is not viable here due to the narrow near-ultrasonic frequency band and compounded by the Doppler constraints.

• Time-division multiple access (TDMA): Divide time into slots, and allow each transmitter to only broadcast in an empty slot (either with a centralized coordinator or by allowing each transmitter to sense whether the channel is available). This effectively requires the transmitters to broadcast sequentially, which would multiply the transmission time and add unacceptable amounts of latency to the user-facing Sound Swap experience.

• Code-division multiple access (CDMA): Design a coding system such that multiple transmitters can broadcast simultaneously. This is the scheme we have adopted, as it permits multiple concurrent transmissions that can be decoded. The cost is an additional reliance on error correction and a slightly higher per-packet broadcast time, but still allowed us to stay within the target communication time (750ms, with a maximum budget of 1000ms) whilst minimising the likelihood of failed decodes.

Introducing CDMA to the communication flow instantly introduced a complex matrix of combinations, requiring testing variable numbers of devices transmitting at differing ranges. This posed the additional challenge of requiring all transmitters to be broadcasting at an equivalent sound pressure level (SPL). This is out of scope of this article, but for the interested reader, issues around test and QA will be addressed by a later blog post by the Software Test team.

## Deploying in the wild

With a field-tested suite of algorithms and communication profiles, the final phase was to port the code to the Sonos players' embedded processors. For this, Advanced Technology has an internal group of expert engineers under the name of Research Operations, whose remit includes translating functional prototypes into production-ready code, incorporating platform-specific hardware optimizations to maximise the performance and minimise energy usage.

The new setup process launched in stages from January to April 2021, and near-ultrasonic communication is now used to facilitate a friction-free PIN exchange thousands of times a day. The impact and success of the new communication flow is substantial: from the metrics of the first 500,000 setups in the wild, the median time saved by the audio-based setup is 38s per transaction.

Sound Swap launched with the release of Roam in April 2021, and has been a headline feature that has been recognised as a new way to interact with the smart home. We believe it is a small but important step towards easier and calmer interaction with music around the home, spreading music from room to room like a candle of light. The technology behind it is complex, but this is surely what technology should be: an invisible hand that extends and augments our relationship with the world, making complex things easy, and adding moments of delight to our own daily rhythms.

## References

[1] Faragher, R. and Harle, R. (2015) ‘Location Fingerprinting With Bluetooth Low Energy Beacons’, IEEE Journal on Selected Areas in Communications, 33(11), pp. 2418–2428. doi:10.1109/JSAC.2015.2430281.

[2] Kotaru, M. et al. (2015) ‘SpotFi: Decimeter Level Localization Using WiFi’, in Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication. SIGCOMM ’15: ACM SIGCOMM 2015 Conference, London United Kingdom: ACM, pp. 269–282.

[3] Mehrabi, A. et al. (2020) ‘Evaluating the user experience of acoustic data transmission: A study of sharing data between mobile devices using sound’, Personal and Ubiquitous Computing, 24(5), pp. 655–668.

[4] Case, A. (2015). Calm technology: principles and patterns for non-intrusive design. O'Reilly Media, Inc.

[5] Rodríguez Valiente, A. et al. (2014) ‘Extended high-frequency (9–20 kHz) audiometry reference thresholds in 645 healthy subjects’, International Journal of Audiology, 53(8), pp. 531–545.

[6] Doroshkin, A.A. et al. (2019) ‘Experimental Study of LoRa Modulation Immunity to Doppler Effect in CubeSat Radio Communications’, IEEE Access, 7, pp. 75721–75731.

[7] Ebihara, T. and Leus, G. (2016) ‘Doppler-Resilient Orthogonal Signal-Division Multiplexing for Underwater Acoustic Communication’, IEEE Journal of Oceanic Engineering, 41(2), pp. 408–427.