Most major players argue that machine learning needs a separate hardware. Several companies have introduced such hardware like Huawei’s NPU (Neural Processing Unit), Google’s IPU, Qualcomm’s NPE (Neural Processing Engine) or Apple’s Neural Engine. However, some specialists believe that these are all merely fancy words for custom DSPs (Digital Signal Processors). DSP could be a dedicated hardware for running advanced mathematical functions quickly. Latest custom silicones these days are specifically optimized around machine learning and neural network operations, the most common of which include dot product math and matrix multiply.
Despite what OEMs can tell you, there's a drawback to this approach. Neural networking is yet a growing field, and types of operations best suited to given use cases may change with future advances. Instead of future-proofing the device, these early silicones can quickly become obsolete. Investing in such primitive Products is a costly decision and one that may need revisions after that best mobile use cases become apparent.
Silicon designers and OEMs won’t invest in such advanced circuits for mid/low-level devices at this stage, which is why you’ll see these dedicated processors just in the most expensive smartphones for now. New processor elements from the ARM that are expected to debut in SoCs next year can help adaptation of more efficient machine learning algorithms with no separate processor, though.
Machine Learning in 2018
We have already been introduced to Cortex-A75 and A55 CPUs and Mali-G72 GPU designs in this year. While Arm’s primary focus was on the company’s new DynamIQ technology, all of these new silicones can support more efficient machine learning algorithms too.
Neural Networks aren’t very dependent to highly accurate data, particularly after training, so math can be performed on 16-bit or even 8-bit data, instead of large 32/64-bit entries. This reduces memory and cache requirements and highly improves memory information bandwidth, which has always been limited in smartphone SoCs.
As part of the ARMv8.2-A architecture for the Cortex-A75 and A55, ARM introduced support for FP16 (half-precision floating point) and integer dot products (INT8) with NEON – ARM’s advanced single instruction multiple data architecture extension. The introduction of FP16 removed the conversion stage to FP32 from the previous architecture, reducing overhead and speeding up processing.
ARM’s new INT8 operation combines multiple instructions into one instruction to reduce latency. By embedding optional NEON pipeline in A55, INT8 performance will improve up to 4x over the A53, which, in turn, makes the core a power efficient option to compute low accuracy machine learning mathematics.
On the GPU side, ARM has specifically designed Bifrost architecture to simplify system coherency. This implies the Mali-G71 and G72 can share cache memory directly with CPU, increasing the speed of computing workloads because CPU and GPU can work more closely. Given the fact that GPUs are specialized in Computing massive parallel threads, close cooperation with CPU makes a perfect combination for processing machine learning algorithms.
ARM created variety of optimizations in Mali-G72 to boost performance, including FMA (Fused Multiply-Add, which is used to speed up dot product), convolutions, and matrix multiplication, which all are vital for machine learning algorithms. The G72 does gain 17 percent energy efficiency savings for FP32 and FP16 instructions that is a really important for mobile applications.
In summary, 2018’s mobile SoCs engineered around ARM’s Cortex-A75, A55, and Mali-G72, as well as SoCs for mid-range products, will boast some efficiency optimizations for machine learning algorithms. There is still no known product, but these optimizations will surely be implemented in some Qualcomm, MediaTek, HiSilicon, and Samsung SoCs next year.
While next-generation technologies are designed with machine learning in mind, existing SoCs can be used to run machine learning applications. Compute Library puts all Arms efforts together. The library includes a universal set of functions for imaging and vision applications and machine learning frameworks like Google’s TensorFlow. The objective of this library is to make portable code possible that will be runnable on different ARM hardware configurations.
ARM isn’t the only company that allows developers to create portable code for its hardware. Qualcomm has its Hexagon SDK to let developers make use of the DSP capabilities found in its Snapdragon mobile platforms.
There is also Symphony System Manager SDK in Qualcomm’s arsenal that offers a set of specifically designed APIs for allowing heterogeneous processing for computer vision, image/data processing, and development of low-level algorithms. Although Qualcomm is preparing to make use of some special hardware, it’s also using its own DSP for audio, imaging, video, and other common smartphone tasks.
Need to dedicated processor
With all that said, you may think why any OEM would wish to trouble with a special hardware for neural networks. But such a custom hardware still has one big advantage: performance and raw power. Take Huawei’s NPU within the Kirin 970 for example; it is rated at 1.92 TFLOPs of FP16 throughput, that’s over 3x what the Kirin 970’s Mali-G72 GPU can do (~0.6 TFLOPs of FP16).
Although ARM’s latest computer hardware and GPU boast variety of machine learning energy and performance enhancements, dedicated hardware optimized for highly specific tasks and a limited set of operations can be more efficient.
Most important thing for consumers is that with optimizing pipeline in next generation of SoCs, even mid/low-level smartphones that aren’t going to have a dedicated Neural Networking processor may boast some performance gain for machine learning. This will, in turn, encourage investment and development of more interesting use cases, which is a win-win for consumers.