The Lance Arm-strong of performance-enhanced CPUs: Armv8.1-M arch jams vector math into super-microcontrollers

Processor designer Arm will, we’re told, today pull the wraps off its Armv8.1-M architecture for crafting next-gen 32-bit microcontrollers.

This technology is expected to be the foundation of future beefy Arm Cortex-M CPU cores that chipmakers can license and stick in their components. For buyers of stuff, like you and me, that means gadgets and gizmos that eventually use these system-on-chips will hopefully see a bit of a processing-performance boost.

Microcontrollers drive stuff like wearables, medical equipment, Internet-of-Things sensors, and electronics in cars. They tend to be out of sight, small, power efficient, and each programmed to perform a specific task and perform it well. Being cut-down and streamlined CPUs, they therefore lack features found in general-purpose application processors such as Arm’s Cortex-A line.

For instance, Arm’s microcontroller-grade Cortex-M family is missing a memory management unit, though it can optionally sport a rudimentary protection unit. Today’s Cortex-M controllers can also be loaded up with things like TrustZone for security, and a DSP extension for real-time number-crunching.

As product designers and chip manufacturers eye up running intensive machine-learning and similar math-heavy code on their embedded devices, though, they need a little more oopmh from their low-end microcontrollers. This is where Armv8.1-M, and the Cortex-M cores built on top of it, hope to step in.

The key difference, at least as far as we can tell, between v8-M and v8.1-M is the optional addition of Helium, which is Arm’s branding for vector math extensions. In more detail, a v8.1-M microcontroller core can feature the following bits and pieces, according to Arm senior principal engineer Joseph Yiu:

Helium, aka the M-Profile Vector Extension (MVE), brings in up to roughly 150 new instructions – the R in Arm no longer stands for RISC, after all – and is aimed at rapidly executing advanced DSP and machine-learning code, effectively bringing Neon-style SIMD tech from the Cortex-A line to its Cortex-M cousins. Helium is engineered to be as small and energy-efficient as possible, it is claimed. That means the microcontrollers can operate on and process 128-bit vectors, using floating-point registers to hold the data, without blowing the power budget. There are levers system-on-chip designers can pull and push to select just how much of Helium they want to use in their components.

Packing too much into a box

AI engines, Arm brains, DSP brawn… Versal is Xilinx’s Kitchen Sink Edition FPGA


We’re told Helium has fewer vector registers than Neon, it can perform things like loop prediction and scatter-gather memory accesses, and it supports more data types than Neon. In terms of using FPU registers, Helium can form up to eight 128-bit vectors from the FPU bank, and split each of these vectors into, say, arrays of 16 one-byte values, eight 16-bit integers or floats, or four 32-bit integers or floats, which can then be fed into Helium’s SIMD-style instructions to process in one go.

Again, this means microcontrollers using this tech should, in theory, get a performance kick when working on arrays and vectors of data, as typically seen when performing machine-learning inference, and products using the chips will analyze stuff like sensor readings, audio, speech, and video faster without having to consult a network-connected IoT gateway or backend cloud.

Helium also supports conditional execution for each vector lane, can work with integers 128 or larger bits in size, and enables a few handy optional shift instructions for 32 and 64-bit data.

Meanwhile, the Low Overhead Branch Extension is pretty funky, as it introduces a bunch of instructions to mark the start and end of classic loops: WLS is While-Loop-Start, which sets the loop count and branch back address, and LE marks the Loop End. Plus there’s DLS, which is Do-Loop-Start, and other variants. This is supposed to optimize memory accesses and instruction fetches. When the loop is running, these special instructions are skipped, hence why they’re dubbed low overhead.

Armv8.1-M also offers a load of new conditional execution instructions, more floating-point types, the FPU context can be saved and restored in hardware using a couple of instructions, the CPU when in privileged mode can be stopped from executing in particular memory regions (thus thwart potential elevation-of-privilege attacks), various debugging add-ons, and other enhancements.

We understand chips using the Armv8.1-M architecture will hit the market in two years’ time, so around 2021. We won’t really know exactly how the power usage and performance will balance out until these become available.

And we imagine more information will appear at some point today over here on Softbank-owned Arm’s website if you’re so interested. ®

Rojenx is a leading concept artist who work appears in games and publications

Check out his personal gallery here

This site uses Akismet to reduce spam. Learn how your comment data is processed.