Qualcomm Hexagon 685 DSP is a Boon for Machine Learning

Qualcomm’s Snapdragon 845 – the newest chip in its Snapdragon mobile processor series – is shaping up to be a powerhouse, though not necessarily through just conventional CPU and GPU improvements. It does boast updated CPU cores, a new, third-generation Spectra image signal processor (ISP) that can capture video in 10-bit color, and an architecture that’s 30 percent more power-efficient than the previous generation. But one of its most impressive (and often unrecognized) aspects is a co-processor that can juggle sensor, photo, and AI data in real time.

Just what makes Qualcomm’s Hexagon 685 DSP tick?

hexagon dsp

The Hexagon DSP architecture in the Snapdragon 835. Source: Qualcomm

“The Hexagon processor is a hardware multi-threaded, variable instruction length, VLIW processor architecture developed for efficient control and signal processing code execution at low power levels (…) ” – Qualcomm

To understand what makes the Hexagon DSP and other chips like it so unique, it helps to know that machine learning and artificial intelligence are actually powered up by the kind of math most engineering students are required to learn. Machine learning in particular requires computations with large vectors, which poses a challenge for general-purpose smartphone, tablet, and PC processors. It’s harder for general-purpose processors to compute algorithms like stochastic gradient descent so quickly and in a power-efficient manner that enables smartphone applications. The Hexagon 685 DSP was introduced to handle image and sensor data processing, being famous for what it enabled in smartphone photography (like on the Pixel line) and its “low-power island”, which became integrated with Google’s All-Ways Aware API. But the included HVX contexts (more on this later) allow for much more than that — halfway between a general purpose processor and a fixed-function core, the Hexagon 685 DSP sees terrific efficiency while computing the math behind on-device machine-learning while having the flexibility of a more-programmable device.

AI chips like the programmable Hexagon 685 DSP (compute unit), which are sometimes referred to as “neural processing units,” “neural engines,” or “machine learning cores,” are tailored specifically to AI algorithms’ mathematical needs. They’re much more rigid in design than a traditional CPU, and contain special instructions and arrangements (in the Hexagon 685 DSP’s case, the aforementioned HVX architecture) that accelerate certain scalar and vector operations — which becomes noticeable in large-scale implementations.

The Snapdragon 845’s Hexagon 685 DSP can handle up to thousands of bits of vector units per processing cycle, compared to the average CPU core’s hundreds of bits per cycle. That’s no accident. “Vector math is the foundation of deep learning,” said Travis Lanier, Senior Director of Product Management at Qualcomm.

With four parallel scalar threads for Very Long Instruction Word (VLIW) operations and multiple HVX contexts, the Hexagon 685 DSP’s capable of juggling multiple execution units on a single instruction and quickly blazing through integer and fixed point decimal operations. Rather than pushing performance through raw MHz, the hexagon design aims for high levels of work per cycle at a reduced clock speed. It includes hardware multi-threading that works well with VLIW, as multi-threading hides pipeline latencies which enables better utilization of VLIW packets. The multi-threading of the DSP means it can also service multiple offload sessions (concurrent apps for audio, camera, computer vision, and so on) and other tweaks help various tasks work concurrently without fighting for execution time on the DSP.

Source: Qualcomm

HVX in particular helped the Hexagon DSP further become a power-efficient image-processing engine in 2015, aiding the ISP by handling some image processing tasks at a higher efficiency margin; the HVX-processed pixels are then streamed out to the ISP hardware for further processing and image composing. The Hexagon’s HVX units were also said to be sufficient for handling 4K video post-processing and other complex image and video tasks. The Hexagon 680 was followed by the Hexagon 682, a relatively small revision as the name implies, through which Qualcomm further tried to advertise the Snapdragon 835’s on-device machine learning capabilities. The original Hexagon 680’s HVX extension featured thirty-two 1024-bit vector data registers for up to 4096 bits per cycle using four slots per instruction.

But those aren’t the Hexagon DSP’s only strengths. Its instruction set architecture (ISA) boasts improved efficiency over traditional VLIW thanks to the improved control code, and it employs clever tricks to recover performance from idle and stalled threads. It also implements zero-latency round-robin thread scheduling, meaning that the DSP’s threads process new instructions immediately after completing the previous data packet.

hexagon dsp

Source: Qualcomm

None of this is new, to be clear. Qualcomm introduced the ‘first-generation’ (or proper) Hexagon DSP (or QDSP6 v6) alongside the Snapdragon 820 in 2015. The Hexagon 680 was followed by the Hexagon 682, a relatively small revision as the name implies, through which Qualcomm further tried to advertise the Snapdragon 835’s on-device machine learning capabilities. But the latest generation is the most sophisticated yet, and delivers up to three times the overall performance of the Snapdragon 835’s DSP.

That’s thanks in large part to the HVX, which worked very well for image processing (think augmented reality, computer vision, video, and pictures). The DSP’s HVX registers can be controlled by any two of the scalar registers, and the HVX units and scalar units can be used simultaneously, resulting in substantial performance gains and concurrency.

Here’s Qualcomm’s explanation:

“Say you’re processing on the mobile CPU in control code mode and you switch to computational mode on the coprocessor. If you need any control code, you have to stop and go back from the coprocessor to the main CPU. With Hexagon, both the control code processor on the DSP and the computational code processor on HVX can run at the same time for tight coupling of control and computational code. That allows the DSP to take the result of an HVX computation and use it in a control code decision in the next clock cycle.”

The HVX affords another big advantage in image sensor processing. Snapdragon devices with the Hexagon 685 DSP can stream data directly from the imaging sensor to the DSP’s local memory (L2 Cache), bypassing the device’s DDR memory controller. That reduces latency, of course, but also improves battery life — the Snapdragon processor is designed to idle throughout the operation.

It’s specifically optimized for 16-bit floating point networks, and controlled by Qualcomm’s machine learning software: Snapdragon Neural Processing Engine.

“We’ve [taken] it very seriously,” a Qualcomm spokesperson said. “We’ve been working with partners for the last three years to have them utilize […] our silicon for AI and imaging.”

heaxgon dsp

Those partners include Google, which used the Hexagon DSP’s image-processing part to power the Pixel and Pixel 2’s ’s HDR+ algorithm, for example. While Google has introduced their own Pixel Core as well, it’s worth noting that Hexagon 685 DSP-enabled devices are the ones that see the best results with the famous Google Camera port, in part because (as we’ve confirmed) of HVX utilization. Facebook, another partner, worked closely with Qualcomm to accelerate Messenger’s real-time camera filters and effects. Oppo’s optimized its face unlock technology for the Hexagon 685 DSP, meanwhile, and Lenovo’s developed its Landmark Detection feature around it.

One reason for platform’s wealth of support is its simplicity. Qualcomm’s extensive Hexagon SDK supports the Halide language for high-performance image processing, for instance, and you don’t have to worry about the framework you train a machine learning model on. Implementing a model is as simple as making an API call, in most cases.

“We’re not […] competing with the likes of IBM and Nvidia [in AI], but we have areas that developers can tap into — and already have,” Qualcomm told XDA Developers.

Hexagon vs. the Competition

The Snapdragon 845’s Hexagon 685 DSP comes as an increasing number of original equipment manufacturers (OEM) pursue mobile / on-device AI solutions of their own. Huawei’s Kirin 970 — the system-on-chip inside the Mate 10 and Mate 10 Pro — has a “neural processing unit” (NPU) that can reportedly recognize more than 2,000 images per second at just 1/50th the power consumption of an average smartphone CPU — but of course, that’s marketing, and Huawei has struggled to gain a foothold with developers to really make such computing capabilities shine. And the Apple A10 Bionic system-on-chip in the iPhone 8, iPhone 8 Plus, and iPhone X has a “Neural Engine” that performs real-time facial modeling and up to 600 billion operations per second, something that the company has proudly incorporated into their marketing efforts and feature set (famously via animoji).

But Qualcomm says that the Hexagon’s platform agnosticism gives it an advantage. Unlike Apple and Huawei, which largely force developers to use proprietary APIs, Qualcomm sought to support some of the most popular open-source frameworks from the get-go. For example, it worked with Google to optimize TensorFlow, Google’s machine learning platform, for the Hexagon 685 DSP — Qualcomm says it runs up to eight times faster and 25 times more power-efficiently than on non-Hexagon devices.

Source: Qualcomm

On Qualcomm’s DSP architecture, Google’s GoogLeNet Inception Deep Neural Network — a machine learning algorithm designed to assess the quality of object detection and classification systems — demonstrated gains in a demo showing one TensorFlow-powered image recognition app on two smartphones: One that runs the app on the CPU, and the other that runs it on Qualcomm’s Hexagon DSP. The DSP-accelerated smartphone app captured more images per second, identified objects faster, and had higher confidence in its conclusion on what the object was than the CPU-only app.

Google also uses the Hexagon 685 DSP to accelerate Project Tango, its augmented reality platform for smartphones. Lenovo’s Phab 2 Pro, Asus’s ZenFone AR, and other devices with Tango’s depth-sensing IR module and image-tracking cameras take advantage of Qualcomm’s Heterogeneous Processing Architecture, which delegates processing tasks among the Snapdragon chipset’s Hexagon 685 DSP, the sensor hub, and image signal processor (ISP). The result is a “less than 10 percent” overhead on the system-on-chip’s CPU, according to Qualcomm.

“As far as we know, we’re the only mobile guys out there who [are] optimizing for performance and power efficiency,” a Qualcomm spokesperson said.

Of course, competitors are also working to expand their influence sphere and foster developer support on their platforms. The Kirin 970’s neural chip launched with support for TensorFlow and Caffe (Facebook’s open API framework) in addition to Huawei’s Kirin APIs, with TensorFlow Lite and Caffe2 integration on the way later this year. And Huawei worked with Microsoft to optimize its AI-powered Translator for the Mate 10.

hexagon dsp

But Qualcomm has another advantage: Reach. The chipmaker commanded 42 percent of the smartphone chip market in the first half of 2017, followed by Apple and MediaTek with 18 percent each, according to Strategy Analytics. Suffice it to say, it’s not shaking in its boots just yet.

And Qualcomm predicts it’ll only grow. The chipmaker’s projecting $160 billion in revenue by 2025 with AI software technologies like computer vision, and sees the smartphone market — which is expected to reach 8.6 billion units shipped by 2021 — as the largest platform.

With the Hexagon 685 DSP and other “tertiary” improvements continuously making their way downstream to mid-range hardware, it’s also easier for Qualcomm chips to bring on-device machine learning to all sorts of devices in the near future. They also offer a handy SDK for developers (no need to fiddle with DSP assembly language) to take advantage of the Hexagon 685 DSP and HVX in their applications and services.

“There’s a need for these dedicated processing units for neural processing, but you also need to expand it, so you can support [open source] frameworks,” a Qualcomm spokesperson said. “If you don’t create that ecosystem, there’s no way […] developers can create on it.”

Rojenx is a leading concept artist who work appears in games and publications

Check out his personal gallery here

In other news …

This site uses Akismet to reduce spam. Learn how your comment data is processed.