Tech Blog
Machine Learning, Open Source
December 6, 2021

# Assembly still matters: Cortex-A53 vs M1

Mathieu Poumeyrol

Distinguished Software Engineer, Sonos Voice Experience

This post is the third and final part of a series about tract, Sonos’ open source neural network inference engine. The first post of the series discussed some high-level aspects of running neural network inference on a modest processor. Next, the second one explained the general architecture of an efficient matrix multiplication engine.

Tract, Sonos's open-source neural network engine, took years to build. Its development story follows more or less the same big steps as this blog series. The first stage, described in Part 1, stays in the realm of neural networks. It is all about understanding what a neural network inference engine really has to compute. Once the inference problem is re-expressed in simpler terms and conceptually moved from neural network design terminology to arithmetics and linear algebra, we realized we were facing a very common problem seen in the high-performance computing field: multiplying matrices. We discussed this in Part 2, where we also had a first look at the other end of the problem: the CPU, and where the co-evolution of CPU hardware design and software has led us to today.

In this final post, we go one more step down the rabbit hole to squeeze as much performance as we can out of the Cortex-A53, a CPU that is at the core of an incredible number of everyday devices. We will have to leave behind the comfort zone of clean design and sensible abstractions for the painful and sometimes surprising exploration of this chip which will help us improve tract's overall performance and get us acquainted with a methodology that will help in tackling relevant problems on similar chips.

At the core of the matrix multiplier is a kernel, a function that performs a loop over the shared dimensions of the two operands, computing a “tile” of the output matrix. We explained in Part 2 that there are performance gains in making the tile bigger, as long as the tile result fits on the CPU registers. Part 2 kernels were built with simple C code, leaving it to the auto-vectorization features in the compiler to figure out an efficient way to lay the tile data and perform the kernel operations. We also noticed that results were not completely consistent, hinting at some possible limitations of the auto-vectorization.

We will now take the matter into our own hands to write the kernel functions. We will get as close to the metal as possible, writing by hand, in assembly language, the exact sequence of instructions we want the processor to execute. We will show that a simple, generic kernel in ARMv8 assembly can do wonders on some CPUs, but that others may benefit from dedicated attention. We will discuss where this journey took us on the Cortex-A53, and try to share what we learned along the way. In summary, we learned a lot about the floating point vector multiplier of the Cortex-A53, and developed a core methodology that can be applied to relevant optimization problems on other CPUs.

## Defining efficiency

In order to simplify the problem a bit, we are going to focus on single-core performance. Multi-thread and multi-core are interesting tools to lower latency or increase total throughput. There are many levels where out-of-core parallelism can be deployed: from running several processes on a single computer to splitting a big matrix product into smaller ones and processing each of them in a separate thread that will be able to run on a different core. We will only be focusing here on how much performance we can squeeze out of a single core because the better our single-core performance is, the better a multi-core implementation built on top of it can be.

First of all, it would be nice to get absolute measurements of how good a given solution is. In the previous post we compared our toy solutions to competitors like BLIS and saw that they were out-performing these simple implementations in many situations. What we don’t know is… Is it possible to make an even better solution? How much better could it be? Is there a hard, unbreakable limit to the performance and how far are we?

It is pretty easy to get some values. Let’s have a look at how modern CPUs are designed first. We discussed CPU registers before; registers are CPU immediate memory. They store intermediate values, and can be used as the inputs and outputs of actual operations. CPU ports are a bit less well-known. Ports are the actual operational part of the CPU which perform the data operations. At each cycle, the logical unit of the CPU will wire registers and ports together to “issue” the operation from its program. Port operations can last one cycle or more, and when they’re done will write the output to one of the registers they have been connected with.

Ports are the actual computing component of the CPU, the circuitry that implements integer or floating point addition, multiplication and so forth.

Modern CPUs have several ports, and many are actually capable of issuing more than one operation during one given cycle. Increasing the number of ports and having a logical unit capable of issuing more operations at each cycle is one of the tricks used by CPU designers to make them run “faster than the clock”.

To translate this into a hard limit, we need to choose a specific CPU. We will focus on the Cortex-A53 in this post. This CPU is quite common: Wikipedia calls it “the most widely used architecture for mobile SoCs since 2014 to the present day”. In the Raspberry Pi 3, it runs at 1.2GHz.

Despite its wide popularity, the documentation for Cortex-A53 may be surprisingly obscure on anything related to the actual performance of the chip. A bit of research though does tell us it has one single port for vector instructions. Each execution of these vector instructions will be able to perform 4 multiplications (vector registers can hold 4 single precision values). The theoretical throughput then for a single Cortex-A53 core is 1.2 x 1 x 4 = 4.8GFlops.

At the other end of the ARMv8 spectrum sits the Apple M1 chip. Apple’s flagship new chip targets an entirely different market. It is designed for laptops and desktop computers while the Cortex-A53 finds its place in smart objects and entry-level mobile phones. It provides us with an interesting comparison point as it belongs to the same ARMv8 family. They share the same instruction set and can run the same code.

The M1 runs at 3.2GHz, and has fast cores with 4 ports that can handle 4-item vectorized multiplication: 3.2 x 4 x 4 = 51.2GFlops

These values are hard maximum limits since we know we can not physically go beyond them. However even getting close to them is a challenge. Besides actually doing multiplications, our multiplier kernel must also pull data for both operands. We will define efficiency as the ratio between the actual performance of a kernel over a CPU and this hard limit. In the previous post, we measured the BLIS multiplier running at 1.3GFlops of the Cortex-A53, giving us an efficiency of 1.3/4.8=27%

## Writing the kernels

We discussed in Part 2 the most important aspects of multiplying two matrices. The main idea is to compute the output matrix tile per tile. Inside each tile, a kernel function computes all values for one tile in one scan over a $k$-length panel of operands $A$ and $B$, pulling packed data and performing accumulations in registers.

Part 2 also showed us that relying on auto-vectorization is not necessarily the best idea for a component as critical as the multiplier. To get a bit more control, compilers offer intrinsic instructions. They are pseudo-functions exposed by the compiler that will just get replaced by actual CPU instruction. The following function, for instance, represents the fadd instruction described in Part 2: it adds the four values from the first operand vector to their counterpart in the second operand vector.

At compile time, the compiler replaces a call to vaddq_f32 by just a single CPU instruction, letting us work with variables while it chooses registers. We could write a tile multiplier in C or Rust using intrinsic instructions, leaving the compiler to “fill in the blanks” and figure out how $a$ and $b$ and their sum could be put in registers. If we are careful about not inducing impossible conditions, such as using more variables than there are registers available, the compiler will probably find a way to accommodate us.

We tried using intrinsic instructions but found that they do not yield the same level of performance that we can get with raw assembly. Thus, in practice, we chose to implement our kernel functions in assembly, with one or more hand-crafted implementations for each processor family. Of course, the more complex, but also less performance critical higher level loops — scanning the matrix tile per tile — can stay in a high level, portable language — Rust in our case.

## A simple semantic ARMv8 multiplier loop

Time to finally look at a bit of assembly! We will start from a simple, very straightforward multiplier loop. It implements a kernel for a 8x8 tile as described in Part 2.

First of all, we need to choose how to map our variables to the 32 vector registers. This choice is pretty much arbitrary, we just want something relatively easy to remember.

For the accumulators, we pick the “high number” registers: $v16$ to $v31$: 16 vectors of 4 values each, storing the 64 values of an 8 times 8 tile. They are arranged going down then right: $v16$ contains a “vertical” vector of four values in the left top corner of the tile, $v17$ goes right behind it. $v18$ and $v19$ make up our second column and so forth.

It leaves us with registers $v0$ to $v15$ for other temporary data: $v0$ and $v1$ will temporarily store a 8-value vector of data read from $A$, while $v5$ and $v6$ will do the same for $B$. We have left a few registers unused in the low numbers, but it does not matter.

.loop:
ld1	{ v0.4s, v1.4s }, [x1], #32
ld1	{ v4.4s, v5.4s }, [x2], #32

fmla	v16.4s, v0.4s, v4.s[0]
fmla	v17.4s, v1.4s, v4.s[0]

fmla	v18.4s, v0.4s, v4.s[1]
fmla	v19.4s, v1.4s, v4.s[1]

fmla	v20.4s, v0.4s, v4.s[2]
fmla	v21.4s, v1.4s, v4.s[2]

fmla	v22.4s, v0.4s, v4.s[3]
fmla	v23.4s, v1.4s, v4.s[3]

fmla	v24.4s, v0.4s, v5.s[0]
fmla	v25.4s, v1.4s, v5.s[0]

fmla	v26.4s, v0.4s, v5.s[1]
fmla	v27.4s, v1.4s, v5.s[1]

fmla	v28.4s, v0.4s, v5.s[2]
fmla	v29.4s, v1.4s, v5.s[2]

fmla	v30.4s, v0.4s, v5.s[3]
fmla	v31.4s, v1.4s, v5.s[3]

subs	x3, x3, #1
bne	.loop

Some values have been set up before we enter this loop: $x1$ contains the address of the first value in the panel from $A$ ; $x2$ contains the address of the first value in the panel from $B$ ; $x3$ contains $k$, the number of individual products we must accumulate for each value of the result. It is also the number of columns of $A$, and the number of rows of $B$.

The last two instructions manage the loop. First, decrement (subtract 1 from) $k$ in place inside $x3$, and branch back to the loop begin label if we are not done. The loop will run $k$ times.

The first two operations in the loop are fetching data from memory. $A$ first ld1 loads 8 values from $A$ to two vector registers $v0$ and $v1$. $x1$ is the address to read from. ld1 will also add 32 to $x1$, effectively moving forward its pointer to the next vector of 8 elements that we will need during the next iteration. If you’re a bit lost at this point, and have not yet checked out Part 2, let’s just remind you that $A$ matrix has been “packed” beforehand to be read in this specific order. Same goes for $B$: 8 values are read from the $x2$ address to $v4$ and $v5$, and $x2$ is incremented accordingly.

Next comes the long list of multiplications and accumulations. We are using the fmla instruction, which does exactly what we need, namely multiply the second and third operand, then add this product into the first operand, in place. Each block of two instructions deals in turn with one of the eights columns of our tile. Two fmla instructions, each computing 4 products and sums, will update 8 accumulator values. Each column uses one value from $B$ (addressed individually from within $v4$, then $v5$) and all values from $A$ (full $v0$ and $v1$ vector).

All in all, a very “semantic” bit of assembly. Very clean and symmetric, maybe a bit boring. The Apple M1 loves it; it runs at 49.7GFlops, which, as the M1 theoretical limit is 51.2GFlops gives us an efficiency of 97%. On Cortex-A53, this code runs at 2.6GFlops. Out of 4.8GFlops, our Cortex-A53 efficiency is only 55.5%.

The 55.5% is already a nice improvement over the performance of BLIS (27%), but it turns out that we can still do a bit better on the Cortex-A53. We will not get nearly as efficient as the M1, but every cycle matters. Nonetheless, it will get weird. We will need to get closely acquainted with the Cortex-A53, and do some very counterintuitive things to get there.

## In order and out-of-order execution

As we discussed, the old mental model of a processor as something that runs one instruction at a time, instruction after instruction, is obsolete. CPUs now contain multiple ports that can be made to start executing instructions in the same cycle. The processor keeps track of multi-cycle instructions that were started but not yet terminated, like multiplication or a memory read operation. Its important to note that the degree of sophistication of this issuing logic can vary greatly from one chip to another.

The main differentiating factor between CPUs is the pipeline implementation. Some chips have an in-order pipeline, while others have an out-of-order pipeline.

Seeing as the Cortex-A53 is an old and modest CPU, it uses the “in-order, partial dual issue” pattern; In other words the issuing logic is just smart enough to dispatch one or two instructions at each cycle to two different ports. Instructions will be started in the exact order in which the program presents them.

On the other hand, the Apple M1 features an out-of-order pipeline that will consider a handful of upcoming instructions at each cycle, start as many as it can in each cycle, and even move instructions before their turn if it does not change the program semantics. As a consequence, for such an advanced CPU, it is enough that the program logic is arranged in the simplest way possible to let the processor issue logic reorder and reorganize instructions in the most favorable order.

On the Cortex-A53 things are harder. The in-order pipeline makes it critical that the developer is aware of “dependency chains” in the code. For example, in our first kernel, the second $ld1$ instruction loads values into $v4$ from memory. The next instruction actually uses this value in a computation. Loading a value typically takes 3 to 5 cycles, assuming the memory we are reading is in cache. During these cycles, the Cortex-A53 has nothing else to do and thus must wait for the result. During this time it can not start the next fmla as it needs $v4$ to be loaded. The core is stalling; its wasting cycles, time, and energy. It is up to the assembly developer to find a way to organize the computations so that these dependency chains do not block the processor.

ld1	{ v0.4s, v1.4s }, [x1], #32
ld1	{ v4.4s, v5.4s }, [x2], #32

.loop:
ld1	{ v2.4s, v3.4s }, [x1], #32
ld1	{ v6.4s, v7.4s }, [x2], #32

fmla	v16.4s, v0.4s, v4.s[0]
[ … ]
fmla	v31.4s, v1.4s, v5.s[3]

fmov	v0.4s, v2.4s
fmov	v1.4s, v3.4s
fmov	v4.4s, v6.4s
fmov	v5.4s, v7.4s

subs	x3, x3, #1
bne .loop

Or in a more synthetic way:

start load v0 and v1 with 8 values from A
start load v4 and v5 with 8 values from B
loop:
start load in v2 and v3 with 8 values from A
start load v6 and v7 with 8 values from B
start products from v0, v1 and v4, v5 into v16 to v31
move v2, v3, v6, and v7 to v0, v1, v4 and v5
decrease k by 1 and loop if necessary

One common solution is to use the remaining registers to pre-load values, instead of loading input data and then using it during the same iteration. We will assume every time we enter the loop, data for the current $k$ is already present in $v0$, $v1$, $v4$ and $v5$. The loop will load data for the next iteration instead, put it in $v2$, $v3$, $v6$ and $v7$, and will do the products just as before, from $v0$, $v1$, $v4$, and $v5$. The processor no longer needs to wait as it can start processing the products right away, while the memory circuitry pulls data from memory into separate registers. Hopefully, during the 16 cycles required to issue the fmla, the values for the next round will have been loaded. The loop must then move them where they are expected before starting again — moving from register to register is reasonably fast.

In order to avoid the four fmov calls, it is also possible to change the loop so that each iteration processes 2 steps over $k$ instead of 1:

start load v0 and v1 with 8 values from A
start load v4 and v5 with 8 values from B
loop:
start load v2 and v3 with 8 values from A
start load v6 and v7 with 8 values from B
start products from v0, v1 and v4, v5 into v16 to v31
start load v0 and v1 with 8 values from A
start load v4 and v5 with 8 values from B
start products from v2, v3 and v6, v7 into v16 to v31
decrease k by 2 and loop if necessary

This chart displays the efficiency of our three kernel variants. Bigger bars are better. We observe that our two attempts are not greatly successful in improving the performance. The first “broken chain” variant, with its four move instructions is even worse than our initial, while the second variant is only marginally better.

We were a bit eager and jumped to the “obvious” issues in the kernel, though we can now see we were wrong. At this stage, it’s time to backpedal a bit and learn more about our CPU.

## Cortex-A53 instruction timings

We have not yet looked too deeply into the precise timing of instructions. We only paid attention to breaking the dependency chains between back-to-back instructions. To move forward we must first find out the exact timing of instructions:

• Issue latency: how many cycles does it take to “start” an instruction before the flow can move to the next one? It is often 1, but it may get higher for the “big” vector instructions we are using.

• Result latency: how many cycles does it take for the result of an instruction to be available for the next instruction needing it? This is mostly important to know how “loose” we need to make the dependency chains.

Of course this information is different from one CPU to another and depending on the CPU, the documentation may be more or less explicit about it.

We timed tight loops over a handful of single operations or combinations to try and find more precise timing information.

• The fmla instruction can be issued at one instruction/cycle, as expected.

• The ld1 instruction which loads 8 values into two vectors takes 4 cycles to issue. This was a bad surprise.

• The and instruction, which we used to implement moving values from vector register to register does not dual-issue.

With these first bits, we can count cycles in our three kernels. Our loop performs 16 fmla calls, decrements and observes the $k$ counter (roughly 2 cycles), and loads data (8 cycles, 4 for $A$ and 4 for $B$). In the first “broken chain” example, we need to move loaded data between registers, this costs us 4 more cycles. And in the broken chain without moves, we put two loops together to use different temporary storage for odd and even iterations.

We can infer a kernel efficiency from here. With our definition, we reach 100% efficiency when the multiplier is saturated. This is only the case when every clock cycle is used to issue an fmla call. Every other operation can be seen as a necessary distraction, impairing efficiency. With this definition, the efficiency is the ratio between the “fmla cycle” and the total cycles.

naive                       16  +   8  +    2  =    26  ->     61.5%
broken chains, with move    16  +  12  +    2  =    30  ->     53.3%
broken chains, odd/even     32  +  16  +    2  =    50  ->     64.0%

Additionally we started measuring the time taken by implementations of the kernel loop, out of the additional complexity induced by the tiling scan of the full matrix.

Our measurements do not completely match out predictions meaning that there is a bit of waste somewhere. Our cycle count may not be perfect, however we are making a rather strong hypotheses on the fact that the data we need is in cache. Modeling the memory cache efficiency is an entirely different topic, so we will just ignore this for now. We can still be satisfied with the fact that our estimations and measurements are relatively consistent when the implementation changes. This is a first step towards a better understanding of our CPU.

## Cortex-A53 dual issue

The Cortex-A53 is branded as a “partial dual issue” processor. This means it is able to issue two instructions during one cycle, but special attention needs to be paid to the word “partial” since it is not completely clear which combinations will work. We therefore need to ask ourselves another question with respect to instruction timing, in addition to the three previous questions mentioned before: which instructions can be dual issued with which?

One thing that is clear is that the A53 can not issue two fmla in the same cycle, because it only has one port for them. As most of our operation is made of these fmla operations, it is interesting to try and figure out what could be dual issues with them. If we can instruct our Cortex-A53 to start loading the data at the same time as doing the multiplications, without slowing them down, we may be able to squeeze out a few cycles.

Cortex-A53 documentation is not explicit about which operations or combinations of operations are capable of dual issues, so we need to rely on softer sources of information. Some people on the internet have worked on this topic, and some of this wisdom can be found in gcc or LLVM code and comments. There are hints which can also be found on some bits of assembly optimized for Cortex-A53. We have a few hints here and there, so we know what we should look into, but no ready-to-use hard truths. In this search we discovered that the rules of dual issuing on the Cortex-A53 are rather complex and often counterintuitive.

While there are few good-looking alternatives to fmla to implement the actual computation, the case for data loading is quite different. There are about a dozen of load variants in the instruction set, and the diffused knowledge of the internet led us to take a closer look at the 64-bit load to general register (not vector).

We ran more tight loops tests and here are some of our findings:

• As hinted by the internet, a fmla and a 32-bit or 64-bit load to a generic purpose register can dual-issue.

• No other fmla and load combinations dual-issue.

• ins, instructions inserting a 32-bit or 64-bit value from a generic purpose register to a vector can be dual-issued together.

• There is also something weird happening when ins appears after an fmla. As we stated before, an fmla takes 8 cycles to complete. This is not a problem as long as we take it into account in the dependency chain analysis and check that no instruction should use the fmla result less than eight cycles after it is issued. Additionally, it looks like the processor can either do fmla or ins at one given time, independently of which registers we are targeting, so it will wait for all fmla (up to eight cycles) to be done before issuing an ins following it. This means it is better to group ins together in one block, as far as possible from the fmla.

There is another interesting path for loading the data. While issuing fmla for one value of $k$, we are actively pulling data for the $k+1$ loop from memory to the general-purpose register. These loads require 8 64-bit loads, but if we pair each of them with one of the 16 fmla, then they are “free”. Once this is done, we insert all the data from the regular registers to the “input” vector registers, $v0$, $v1$ for $A$ and $v4$ and $v5$ for $B$ for the next iteration. This takes 8 ins instructions, but as they are dual-issued, they only consume 4 cycles. We have saved 4 cycles, getting our theoretical efficiency to 72.7%.

## More Cortex-A53 idiosyncrasies

There are a few more subtleties that are worth mentioning here:

• When we switched from loading using ld1 naturally to our convoluted way, we noticed that performance was improved by the use of some prfm instruction to “prefetch” memory. Prefetching is a way to signal to the CPU that it would be nice to load a few pages from memory to the cache because we are going to access them soon. This was somewhat expected as prefetching memory to warm up a cache is a rather common trick. The prefetch instruction can be dual issued with fmla, so it is basically free to call. The actual surprise was that trying the prefetch trick in our initial semantic implementation did not improve efficency. We assumed the multiple load instructions (ld1) that we use is somehow a hint that we are going to stream through this data, so the CPU does some prefetching for us.

• Our loop ends with a big ins list at the end. From our measurements, it turns out that ins can be dual issued, but not when inserting into the same vector register. This is not a big problem since they are all independent, and thus we can run them in any order, but it does add to the apparent complexity.

• A final observation is that back-to-back instruction loading while incrementing the pointer sometimes seems to have a two-cycle length dependency chain, so we need to alternate loads for A and B.

All in all, this took a lot of work for barely 50 lines of assembly code. Reading compiler source code of various open-source projects to find ideas, micro-benching combinations of operations to verify the findings, and finally quite a bit of trial and error helped us put all the pieces together. In more than one case, some exciting findings at the micro-bench level did not result in much of a positive impact when assembled with the rest, mostly because of complex memory interactions that are hard to model.

## tract production Cortex-A53 kernel

Now that we have a better understanding of what makes the Cortex-A53 tick, we can finally come back to tract, our neural network inference engine, and optimize its matrix multiplier.

We went a bit beyond the 8x8 kernel we have been discussing. Tract actually has several kernels for the Cortex-A53, and the main one is not a 8x8 tile but a 12x8 tile. We are using the general purpose register to pre-load data as we just described and there are enough free registers to hold extra accumulators. Its main loop has 24 fmla, plus 5 cycles used for 10 insertions, giving it a theoretical efficiency of 77%, measured at 70% on the micro-benches.

In-vivo efficiency, measuring the entire process, is always a bit lower. Operands need to be packed and temporary buffers need to be allocated. In order to compare our kernel with other implementations, we bench tested it as if it was used to perform a standard BLAS gemm operation, just like we did in Part 2.

In this chart, we compare tract end-to-end efficiency with

• The naive implementation we started in Part 2

• BLIS, which only has a generic ARMv8 implementation

• Ruy, the recent engine from Tensorflow-Lite. They take things further, with a Cortex-A53 and Cortex-A55 kernel using the same load trick we described earlier. They stop at a 8x8 kernel though, which is probably enough to account for tract being more efficient.

This final part started as a comparison between two CPUs of the same family. Nevertheless, it turned out we had much more to cover on the side of the Cortex-A53 than for the Apple M1. The Cortex-A53, an incredibly successful piece of technology, is a unique challenge to developers, because it combines a huge, modern instruction set with limited issuing capabilities. This is aggravated by the documentation being relatively elliptical on the timings.

The M1 CPU, on the other hand, with the same instruction set, uses its impressive issuing logic to make the best out of very simple assembly. Our experience with the M1 vs Cortex-A53 was very unbalanced; M1 just worked, while Cortex-A53 has been a lot of work.

In order to compensate for the poor documentation, we had to study the handful of instructions that mattered to us and find out their exact timings. Prompted by wisdom disseminated across the open source world, we considered a totally counterintuitive way to load data into the Cortex-A53. It took some work but we were able to leverage it and get faster results while doing more "work”. The experience collected on the way allowed us to tackle optimization problems more systematically on various CPUs with differing problem sets.

As this series is drawing to a close we'd like to first thank you for reading, and congratulate you on following us so far. Beyond being a fun experience, writing this led us to investigate a few loose ends that happened to have a measurable impact on the performance. We started with a plea for efficiency, and saw how the hunt started with making the engine aware of the semantics of the network graph to implement a first series of simplification. Nonetheless, very soon thereafter the convolution and matrix multiplication cost dominated any performance profile, making it obvious it has to be optimized as much as is practical. There are low-hanging fruit, namely using third-party libraries, but we preferred to take ownership of the problem space and see how much progress we could make. We went as far as assembly on this problem, which is as far as you can go in the realm of software, and it yielded great results. We hope this series may help the next team taking this journey.

Stay tuned :)