October 13, 2021

Optimising a neural network for inference

Mathieu Poumeyrol

Distinguished Software Engineer, Sonos Voice Experience

This is the first of three posts that will dive into some of the intricacies that come with running neural networks on small CPUs. We want to share what we’ve learned about computing while designing and implementing tract — an open source software solution developed by Sonos for embedded neural networks inference — and talk about the challenges encountered.

This first post exposes the overall strategy and architecture of tract, discussing why efficiency is paramount for embedded systems. It explains how neural networks live two lives, one when they are trained and one when they are used for inference. The second post will focus on the most critical component for performance when running neural networks: a matrix multiplier. Finally, the third post will dive deeper into the matrix multiplier and how it can be optimised, first for a family of processors, and then for specific members of this family.

Past-century Science Fiction promised us a world full of flying cars, talking computers, robot assistants, and wearable technology. Even just two decades ago, these wild ideas seemed impossibly out of reach.

Over that time, though, artificial intelligence and smart objects have advanced so far this is becoming our reality. Maybe we're not escaping from Dr. Claw, but we walk around daily wearing Penny Gadget's watch and carrying her computer book. Facial detection and recognition, no matter how biased and unethical, have arrived and are fighting to stay. Self-driving cars finally feel within reach. Flying hoverboards are late, but we'll get there eventually.

While Science Fiction preferred to focus on the trope of the AI run amok, one thing that was less obvious to novelists and screenwriters was how computationally expensive it would be to run them. Neural network theory was drafted decades ago, but it took that much time for technology and usages to make it applicable and practical. Today, it is the wide increase in computing power and available data that makes some of these wild ideas possible.

In order to cope with the computing load, some devices just act as remote sensors for an AI that actually lives “in the cloud”. And, of course, there have been a number of fiction works that explore using the omniscient cloud as a plot device, too.

Sometimes using the cloud is just not an option. Self driving cars can’t stop driving when entering a tunnel. People shouldn’t get locked out of their home when a datacenter on the other side of the world gets in trouble. And some of us would just feel a tad more comfortable interacting with only the devices we own instead of sharing part of our lives with the ominous cloud.

Obviously, running Artificial Intelligence and neural networks locally, on a device, comes with its own set of challenges. Sonos is investing a lot of time and energy into tract, a software solution for embedded neural networks inference. It is published as open source, meaning that anybody is free to audit it, use it or contribute.

This first post gives an overview of the global architecture and design of tract: efficiency, formats, and more generally the benefits of also considering the neural network inference task, as distinct from training.

A plea for efficiency

Most of the recent advances in Artificial Intelligence come from the progress around deep neural networks. While the theory has been there for ages, the last two decades have brought us the vast amount of computing power and the oceans of data required to train them.

In some cases, inference is done in the cloud: user data will be sent to the cloud, go through the model there, and the result will be sent back to the end-user’s device. The other option is to ship the model to the device, and run it locally. There are pros and cons to this local approach. On one hand, this is a pretty good way to preserve privacy, as the end-user data will never have to travel from the device to the cloud. It also saves a bit of latency, and is obviously less prone to connectivity issues. On the other hand, well, it has to fit on the device.

One of the big buzzwords over the past two decades is “scalability”. Scalability means that infrastructure can grow to keep up with the demand. Running a scalable cloud infrastructure means you commission more servers to accommodate the pressure generated by your ever growing fleet of devices. It will cost you money, but you’ll meet the demand.

In the process of looking for scalability, as an industry, we have sometimes lost track of another key metric of a system: efficiency. Efficiency means doing more with the same hardware. If you’re running a cloud, efficiency is a net gain: even if your infrastructure is scalable without limit. Gaining 10% efficiency means running 10% fewer servers. It will cost 10% less money, and will save 10% energy.

On a user’s device, efficiency is an even more critical factor — you obviously cannot scale the hardware CPUs. In the context of retrofitting new usages and applications into existing deployed devices, gaining efficiency will often be the difference between whether it works or not.

Efficiency gains are beneficial at all stages of a product lifetime. During the design stage, it can reduce the manufacturing cost by scaling down processors. Later in the product life cycle, efficiency gains can unlock bigger models — increasing quality, and, in extreme cases, unlocking unplanned features and prolonging a product’s lifetime.

Training and inference

Training neural networks for a task is orders of magnitude more expensive than actually running them. The training happens once, usually in the cloud or in a data center. It makes extensive use of huge databases and hardware managed by the organisation designing and training the network. Then, the trained model is frozen and used to perform “inference” for the end users. During “inference”, the model is read-only — it just does its task on the user data. The parameters are fixed, and it will not “learn” anything more.

Training models is a daunting and humbling task, while inference is comparatively straightforward. Model designing and training is also where most of the research in the field happens. As a result, software offerings in the neural network space are structured by the training frameworks — for any other piece of software in the ecosystem, a key characteristic is which training framework they are designed to work with. Google really made a big splash with TensorFlow, aggregating a huge part of the community around it.

Training frameworks are technically capable of running inference: you actually need to run inference at each step of the training process. At each step, once the training framework has run inference, it will perform backpropagation: adjust the parameters in the model a tiny bit in a direction that would improve the result. This typically involves at least as much code as the inference. And, on top of that, the framework must provide the plumbing and machinery that will drive the training: present the example data to the network and monitor the training progression, potentially running all of that on a distributed system.

During model designing and training, machine learning teams focus on accuracy of prediction. While the overall computing budget is a known constraint in the background, the goal is to figure out the best model design, the best training process to get the best accuracy.

During inference, efficiency is key. Model and hardware are fixed entities at this stage: the issue is to make the given model run on the given hardware as efficiently as possible. First, to fit on the hardware, and later on, to free as much resource as possible for future evolutions.

Once the network is trained and frozen, training contingencies disappear. With them a lot of abstractions that are useful for model design and training become redundant: when performing the multiplication of two values, a CPU will not care too much which high level neural network concept — like a convolution, or a normalisation layer — this operation belongs to. As a matter of fact, for inference, we prefer these two operations to use the same code: as they are identical from a computing point of view. We want to switch the neural network design abstractions for another set of abstractions, mostly revolving around tensor arithmetics and manipulation. Then, we try to make this smaller set of operations as efficient as possible.

Neural network formats

After TensorFlow's big splash, other organisations started to organise so their contributions could have more impact on the ecosystem. One of these efforts is Open Neural Network eXchange, aka ONNX. ONNX is mostly driven by software companies, including giants like Facebook and Microsoft. It was designed as an interoperable format to link PyTorch and Caffe2, in the hope to bring the two ecosystems and communities together in the struggle to exist against TensorFlow.

ONNX still adopts many TensorFlow operators, sometimes taking the opportunity to fix a design error here and there. It has a strong versioning system, and the co-design approach leads to a slower pace of evolution than proprietary frameworks. This makes it possible for third parties to try and play in the ecosystem: following TensorFlow’s frenzied evolution is not possible for a small team, ONNX evolution is slower and more concerted. But ONNX is still very much model-designing and -training oriented. It still features lots of operators that are redundant from the perspective of an inference engine implementation.

Neural Network Exchange Format, or NNEF, finally, is the product of inference engine implementers, both software and hardware, taking the problem in their own hands and defining a neural network description format that focuses exclusively on inference. In the same spirit as ONNX, NNEF is an exchange format — it does not provide an implementation for inference engines, it is just a specification. But in NNEF, primitives are low level tensor operations, like matrix product, convolution, or shape transformations. Just enough to express what has to be done during inference for neural network design concepts like batch normalisation or dropout, but only by composing simple primitives. The special tricks these operators do during training are irrelevant and erased.

To sum up, TensorFlow is proprietary, evolves at a breakneck pace and features hundreds of operators. ONNX is somewhat smaller and slower paced, sometimes distilling or generalising TensorFlow or PyTorch a bit operators before adopting them. Both are very training-oriented. To some extent, networks can be converted from one format to the other. On the other hand, NNEF aims at being as stable and small as possible. Conversion from ONNX or TensorFlow to NNEF is possible (to some extent), but the other way around would barely make sense at all, since NNEF uses a lower-level representation where the training semantics are erased.

Architecture of a neural network inference stack

With this landscape in mind, we can try and design a neural network inference software stack. However, choosing the right model serialization format is complicated. NNEF format would be close to ideal for the purpose of inference, but the format is not mainstream enough, most software integrators expect support for ONNX and/or TensorFlow out of the box. But translating these training formats to NNEF is not always possible: some features or operators are missing.

So, we introduced our own format, tract-opl (OPerational Language). It is semantically close to NNEF: focusing on simple operation, without considering the high-level training features that ONNX and TensorFlow formats encode. It is designed as a set of NNEF extensions: tract can actually serialize tract-opl to plain NNEF if the model does not use any feature or operators that NNEF does not include. This also means that tract can convert from ONNX and TensorFlow to NNEF.

This design is not what we initially intended: tract started as a monolithic TensorFlow frozen model interpreter. At some point, we began to look at tract as a compiler, or an interpreter, for a more inference-oriented neural network format, and we drew comparisons and analogies with modern compilers, interpreters and virtual machines.

Using an LLVM analogy, for instance, tract-opl is similar to the “Intermediate Representation”, while ONNX and TensorFlow converters act as compiler frontends (like clang or rustc). The intermediate representation is very close to NNEF : dumping or loading from NNEF (and some extension) is easy. Most optimisations are performed on the intermediate form, and later on, a backend (akin to a code generator for Intel or ARM) is picked, coming with its own set of optimisations.

Translating to tract-opl and "decluttering"

The tract intermediate form, tract-opl, is semantically close to NNEF. Converting from ONNX or TensorFlow to tract-opl erases their training semantics. We call this process “decluttering” since, from a tract point of view, the idiosyncrasies of training are useless and redundant.

Decluttering can be as simple as translating operators by substitution: for example, a training operator like BatchNormalisation is re-expressed as two arithmetic operations, Mul and Add. But, some of the “clutter” is more subtle. The batch axis is a good example. It is standard for training network that for all tensors the outer axis is a “batch” axis: several samples of data are presented to the network simultaneously by stacking the independent inputs on this axis. Convolution in TensorFlow, ONNX or even NNEF can not be encoded without the batch axis.

But, at inference, we often only process one input at a time, making the batch axis useless and redundant. So tract tries as best as it can to remove unused axes like these from the graph. The same applies to axis permutation: switching axis order is sometimes required to satisfy the definition of an ONNX or TensorFlow operator, but tract-opl equivalent operators may be more simple and flexible, with fewer semantics attached to specific axes. Tract has the ability to eliminate these expensive transpositions.

Recurring operators

Recurring operators decluttering is a similar process of simplifying to lower level semantics: for instance, tract-opl does not know about LSTM. It supports it by using the “Scan” operator from ONNX. “Scan” is a relatively complex operator: it has an internal subgraph, and will scan one or more input tensors on a specified axis while looping over that internal subgraph. Some inputs and outputs of the internal network can be paired and declared state vectors: at each step, the operators will feed the subnetworks whatever value had been outputted at the previous turn for the paired output, maintaining state and effectively implementing the recurring aspect of the operator. In other words, the “Scan” operator is similar to the for-loop construct in imperative languages.

This abstraction is enough to translate LSTM (or GRU, or RNN), while also allowing the network designer to tweak the cell definition as much as they see fit. There are a large variety of LSTM variants in the ecosystem, and tract obstinately refuses to pick sides. It’s up to the machine learning engineer to pick their favourite definition or actually invent any arbitrary recurring operator. And ONNX or TensorFlow LSTM cells are macros that get expanded to the right Scan operator.

The Scan operator has more tricks up its sleeve. Many of the recurring operators' inner loop start by concatenating the state vector from the previous loop with an input frame, then perform a linear operation: multiply this vector by a matrix of trained weights. Tract has been made able to detect this pattern, split the big product and concatenations into smaller components, and then extract the non-recurring half of the product out of the loop. That way, we trade one matrix-times-matrix product for many matrix-times-vector products. It does not sound like much: both forms will actually perform the exact same number of individual multiplications and additions, and of course yield the same values. But the matrix-times-matrix technique gives more opportunity for instruction vectorisation. In turn, this can yield a sizeable speed-up : on Intel, our matrix-times-matrix is roughly four times faster than our matrix-times-vector.

Coming up

This post is an overview of the overall architecture of tract, explaining the differences between training and inference and how they matter. The next part will dive into the component that is the most critical when running a neural network efficiently: the matrix multiplication and convolution engine.

Continue reading in Machine Learning:

Audio processing
,
Machine Learning
,
Quality Assurance
Arc Ultra Speech Enhancement: Delivering Inclusive Sound Experiences
Read More
May 27, 2025
Machine Learning
,
Audio processing
Arc Ultra Speech Enhancement: Announcing A Step Change in Speech Enhancement Using AI
Read More
May 11, 2025

Continue reading in Open Source:

Machine Learning
,
Open Source
Assembly still matters: Cortex-A53 vs M1
Read More
December 6, 2021
Open Source
,
Machine Learning
The anatomy of efficient matrix multipliers
Read More
November 15, 2021