The Challenges of Building Inferencing Chips

via Semiconductor Engineering

As the field of AI continues to advance, different approaches to inferencing are being developed. Not all of them will work.

Putting a trained algorithm to work in the field is creating a frenzy of activity across the chip world, spurring designs that range from purpose-built specialty processors and accelerators to more generalized extensions of existing and silicon-proven technologies.

What’s clear so far is that no single chip architecture has been deemed the go-to solution for inferencing. Machine learning is still in its infancy, and so is the entire edge concept where most of these inferencing chips ultimately will be deployed. Moreover, how to utilize this technology across multiple end markets and use cases, let alone choose the best chip architectures, has shifted significantly over the past 12 to 18 months as training algorithms continue to evolve. That makes it difficult, if not impossible, for any single architecture to dominate this field for very long.

“Machine learning can run on a range of processors, depending on what you are most concerned about,” said Dennis Laudick, vice president of marketing for the machine learning group at Arm. “For example, all machine learning will run on an existing CPU today. Where you only want to do light ML, such as keyword spotting, or where response time is not critical, such as analyzing offline photos, then the CPU is capable of doing this. It can still carry out other tasks, which cuts the need for additional silicon investment. Where workloads become heavier, and where performance is critical or power efficiency is a concern, then there are a range of options.”

There are a variety of configuration options available to improve power, performance, area and bandwidth. “For example, many audio focused ML networks are scalar-heavy and relatively matrix-light, while many object detection algorithms are matrix-heavy but fairly light on scalar needs,” Laudick said. “There is not one right answer.”

While there is agreement that most inferencing will be done at the edge, due to the physical inability to move large quantities of data quickly and efficiently enough, that’s still a very broad and hazy category. “The edge has expanded all the way from low-level IoT to the data center edge,” said Megha Daga, director of product management for AI inference at the edge in Cadence’s Tensilica Group. “Does it have to be a standalone? Do we need another co-processor? Is there a place for the co-processor? All of this really depends on the application market we are looking into, whether it is small-scale IoT or consumer IoT or industrial IoT or data center applications.”


Fig. 1: Arm’s ML processor. Source: Arm

In the consumer IoT space, for instance, power efficiency is critical because some of these devices will need to work off one or more tiny batteries.

“For things like AR/VR, there are AI at the edge requirements, but with that there are other sensors that dominate, as well,” Daga said. “There, you have to work with the vision sensor and audio sensors and compositely look at the system design. From an inferencing perspective, it then becomes more of how much bandwidth you get from a system level to each of these configurations. What is the area budget? The cost due to the power is very important, and very highly critical. In that case, it’s not just influencing edge by itself, but several core components, that need to be looked at for the design of the chip.”

In the case of AR/VR glasses, Daga noted that because they will be sitting on the face, the core power is highly critical. “There, [the engineering team] wants to do certain AI applications, as the trend is taking them toward that. But at the same time, they need to pick the IP and design the chip, such that the traditional computer vision can be done, as well. The AI is not a standard AI inference because you don’t have the area and the power to put in multiple chips. The design team has to look at the composite perspective.”

Other inferencing applications involve the industrial IoT or in the datacenter. There, it’s more about data analytics and number-crunching kind of problems. “There are tons of data in different formats,” she explained. “It could be vision data, radar data, or whatever it is in the financial sector. All you have to do is crunch the numbers, so there it’s pure AI inference at the edge. That’s where they are looking more from a cost perspective because if they see that they don’t have to move the data back and forth into the cloud, it saves cost. I would rather do a lot of work at the edge and then send it over to the cloud because there you can have that standalone AI inferencing at the edge.”

Speed matters
The key factor here is throughput. “These are generally plugged-in devices. Power is always critical, and there is only so much dissipation you can afford. But in the hierarchy of systems, there are other things that come before power. Memory is certainly another big component of AI inference at the edge. How much memory and how much bandwidth you can sustain?”

For companies building these chips, market opportunities are flourishing. Geoff Tate, CEO of Flex Logix, points to such markets as biomedical imaging for implementing AI in ultrasound systems, genomic systems, along with scientific imaging applications that require very high resolution and very high frame rate. Surveillance cameras for retail stores also are growing in use so retailers can extend the use of the cameras wired already into their servers to capture information such as how many customers are coming into the store, customer wait times, etc.

While many, if not most, inferencing chips are mainly CPU-based, Flex Logix uses some embedded FPGA technology in its inferencing chip. “Companies like Microsoft use FPGAs in their datacenter today. They’ve deployed FPGAs for some time. They’ve done it because they found workloads that are common in their datacenter for which they can write code that runs on the FPGA, and basically it will run faster at lower cost and power than if it ran on a processor,” Tate said.

This opens up a whole swath of new options. “If it runs faster on the Xilinx boards than on an Intel Xeon, and the price is better, the customer just wants throughput per dollar and the FPGA can do better,” said Tate. “In the Microsoft data center, they run their inference on FPGAs because the FPGA needs a lot of multiplier-accumulators and the Xeons don’t have them. Microsoft has shown for years that FPGA is good for inference.”

Flex Logix’s path to an inference chip started with a customer asking for an FPGA that was optimized for inferencing. “There was a time when FPGAs just had logic,” he said. “There were no multiplier-accumulators in them. That was in the ’80s, when Xilinx first came out with them. At a later point in time, all FPGAs had multiplier-accumulators in them, introduced primarily for signal processing. They were optimized in terms of their size and their function for signal processing applications. Those multiplier-accumulators are why Microsoft is doing inference using FPGAs, because FPGAs have a fair number of multiplier-accumulators,” Tate explained.

Then development teams started using GPUs for inferencing, because they also have a lot of multipliers and accumulators. But they weren’t optimized for inference, although Nvidia has been slowly optimizing that. Flex Logix’s customer asked the company to change its FPGA in two ways — change all the MACs from 22-bit to 8-bit, and throw away all the extra bits and make a smaller multiplier-accumulator. The second request, given that the MAC was smaller and more could be fit into the same area, was to allocate more area to MACs.

“We’ll find out over this next year which of the architectures actually deliver better throughput per dollar, or the throughput per watt, and those will be the winners,” Tate said. “The customer doesn’t care which one wins. To them it’s just a piece of silicon. They put in their neural model, the software does the magic to make the silicon work, and they don’t care what’s inside as long as the answers come out, at high throughput, and the price and power are right.”


Fig. 2: Flex Logix’s reconfigurable approach. Source: Flex Logix

Different approaches
It’s still far too early to determine who will win in this competition. Chris Rowen, CEO of BabbleLabs, believes there will be inferencing subsystems across a wide range of silicon platforms rather than lots of successful standalone pure inferencing chips.

Deep learning inference is a powerful new computing tool, but few end solutions consist solely of inference execution,” Rowen said. “There is also conventional software and lots of use-case-specific interfaces (both hardware and software) that make up a silicon solution. In addition, neural network inferencing is so inherently parallel and efficient that a modest amount of silicon – say 5 to 10 mm2 can support huge throughput. Would you add a separate chip to a board if you can get a more efficient on-board subsystem for less money and power?”

For the most part, only the very compute-intensive vision and real-time enterprise data analytics are going to justify big standalone chips for inference, Rowen said. “Of course, big standalone chips for neural network training will be a different story. There also may be a case for new inference chips close to memory. Some systems will require high memory bandwidth for inference, but not for other system functions, so specialized inference chips that sit close to new high-bandwidth memories may also find a niche. However, many high bandwidth systems need that bandwidth for more than just inference operations, so it will be more effective to combine inference and non-inference subsystems sharing common high-bandwidth memory.”

Still, when looking at chipsets or chips for AI, what’s developed over the last six or eight years has been the concept of a deep learning accelerator, the sample being the GPU, observed Roger Levinson, chief operating officer at BrainChip. “This is where Nvidia did brilliantly in realizing that their floating point math processor is great for doing matrix multiplications, which is a calculation required to do convolution on neural networks. It’s an image. Convolution is an input processing thing. And that’s what GPUs did. It has enabled a huge step forward in our capabilities in AI, and we have to be extremely thankful that we have this hardware, because without it we wouldn’t have gotten anywhere. That was a technology breakthrough that unleashed AI to be practical with the first-generation of AI, and it has done a great job of getting us to where we are. But the power is way too high.”

Further, the ability to do real learning is not enabled through that hardware, he said. “The traditional architecture uses a CPU or host in a data center that’s going to be a big host, or it might be a little microcontroller, but either way the CPU is really the brains of the system. That’s what’s doing the network algorithm management and running the algorithm itself. This offloads compute-heavy workloads to an accelerator — a deep learning accelerator or a MAC accelerator or an AI accelerator, whatever it may be called. It’s a chip that’s provisioned through a systolic array or some other structure in order to do very efficient multiply-accumulates and accelerate the process of doing calculations to support the algorithm that is running on the CPU. The data to go in and out of this as it drives, the CPU says, ‘I need to run a bunch of calculations. Here’s your data, do a bunch of calculations, put it back into memory, and then I’ll go process that, and will send you the next batch.’ The whole idea is do that as fast as possible. Folks are looking at different architectures for how to optimize this.”

BrainChip’s approach is to build a power-efficient neuromorphic, purpose-built processor for doing this job. “It’s like the von Neumann computer was set up in a certain way to manage data, manipulate data and do calculations efficiently. It’s great for those types of workloads. But for AI workloads, you want a processor that’s different. It needs to be purpose built in order to process neural network types of information,” Levinson added.


Fig. 3: BrainChip’s Akida architecture. Source: BrainChip

Conclusion
Tradeoffs between specialized processors and general-purpose processors will continue to confound the industry for the foreseeable future. This may provide an opening for eFPGAs or other programmable logic or software, but it will take time before there is any clarity in this market.

Whether the best solution is purpose-built, or whether it is an off-the-shelf component will vary widely by application and ultimately by how these solutions perform over time and under load. Regardless, the inferencing market has opened the door to much different architectures and approaches than in the past, and there is no indication that will change anytime soon.