Keyword Spotting: From ‘Hey, Siri’ to Advanced Voice-Activated Apps
A natural-language processing technique known as keyword spotting is gaining traction with the proliferation of smart appliances controlled by voice commands.
Voice assistants from Amazon, Google, Apple and others can respond to a phrase that follows a “hot word” such as “Hey, Google” or “Hey, Siri” and appear to respond almost immediately. In fact, the response has a delay of a fraction of a second, which is acceptable in a smart speaker device.
How can a small device be so clever?
The voice assistant uses a digital signal processor to digest the first “hot word.” The phrases that follow are sent via the Internet to the cloud. The speech is then converted into streams of numbers, which are processed in a recurrent convolutional neural network that remembers previous internal states, so that it can be trained to recognize phrases or sequences of words.
These data streams are processed in a datacenter, and the answer or song requested is sent back to the voice assistant via the web. This works well in situations that are non-critical, where a delay does not matter and where Internet connections are reliable.
The neural networks located in data centers are trained using millions of samples in a method that resembles successive approximation; errors are initially very large, but are reduced by feeding the error back into an algorithm that adjusts the network parameters. The error is reduced in each training cycle. Training cycles are then repeated until the output is correct. This is done for every word and phrase in the dataset. Training such networks can take a very long time, on the order of weeks.
Once trained, the network can recognize words and phrases spoken by different individuals. The recognition process, called inference, is computed and requires millions of multiplications followed by accumulate (MAC) operations, which is why the information cannot be processed in a timely manner on a microprocessor within the device.
In keyword spotting, multiple words need to be recognized. The delay of sending it to the datacenter is not acceptable, and Internet connections are not always guaranteed. Hence, local processing of phrases on the device is preferable.
One solution is to shrink the multiply-accumulate functions into smaller chips. The Google Edge-Tensor Processing Unit (TPU), for instance, incorporates many array multipliers and math functions. This solution still requires a microprocessor to run the neural network, but the MAC functions are passed on to the chip and accelerated.
While this approach allows a small microprocessor to run larger neural networks, it comes with disadvantages: The power consumption remains too high for small or battery-powered appliances. With diminishing size comes diminishing performance. Small dedicated arrays of multipliers are not as plentiful or as fast as those provided by large, power-hungry GPUs or TPUs in datacenters.
An alternative approach involves smaller, tighter neural networks for keyword processing. Rather than performing complex processing techniques in large recurrent networks, these networks process keywords by converting a stream of values into a spectrograph using a voice recognition algorithm known as MFCC. The spectrograph below represents the frequency spectrum of the signal.
The spectrograph picture is input to a much simpler 7-layer feed-forward neural network that has been trained to recognize the features of a keyword set. The Google keyword dataset, for instance, consists of 65,000 one-second samples of 30 individual words spoken by thousands of different people. Examples of keywords are UP, DOWN, LEFT, RIGHT, STOP, GO, ON and OFF.
An alternative approach
We have taken a completely different approach, processing sound, images, data and odors in event-based hardware. Brainchip was founded long before the current machine learning rage. The advancement of processing methods for neural networks and artificial intelligence are our main aims, and we are focused on neuromorphic hardware designs.
The human brain does not run instructions, but instead relies on neural cells. These cells process information and communicate in spikes, which are short bursts of electrical energy which express the occurrence of an “event” such as a change in color, a line, a frequency, or touch.
By contrast, computers are designed to operate on data bits and execute instructions written by a programmer. These are two very different processing techniques. It takes many computer instructions to emulate the function of brain cells — in the form of a neural network — on a computer.
We realized we could do away with the instructions and build very efficient digital circuits that compute in the same way the brain does. The brain is the ultimate example of a general intelligent system. This is exactly what Brainchip has done to develop the Akida neural processor.
The chip evolved further when we combined deep learning capabilities with the event-based spiking neural network (SNN) hardware, thus significantly lowering power requirements and improving performance — with the added advantage of rapid on-chip learning. The Akida chip can process the Google keyword dataset, utilizing the simple 7-layer neural network described above, within a power budget of less than 200 microwatts.
Akida was trained using the the ImageNet dataset, enabling it to instantly learn to recognize a new object without expensive retraining. The chip has built-in sparsity. The all-digital design is event-based and therefore does not produce any output when the input stimulus does not cause the neuron to exceed the threshold.
This can be illustrated in a simplified, although extreme example. Imagine an image with a single dot in the middle. A conventional neural network needs to process every location of the image to determine if there is something there. It takes a block of pixels from the image and performs a convolution. The results are zero, and these zeros are propagated throughout the entire network, together with the zeros generated by all the other blocks, until it reaches the dot. To detect and eliminate the zeros would add additional latency and would cause processing to slow down rather than speed it up. Nearly 500 million operations are required to determine that there is a single dot in the image.
By contrast, the Akida event-based approach responds only to the one event, the single dot. All other locations contain no information and zeros are not propagated through the network, because they do not generate an event. In practical terms, with real images this sparsity results in up to 40 to 60 percent fewer computations to produce the same classification results using less power.
A keyword spotting application using the Akida chip trained on the Google Speech Commands Dataset can run for years off a penlight battery. The same circuit configured to use 30 layers and all 80 neural processing units on the chip can be used to process the entire ImageNet dataset in real-time at less than 200 milliwatts (about five days on a penlight battery).
The MobileNet network for image classification fits comfortably on the chip, including all the required memory. The on-chip, real-time learning capability makes it possible to add to the library of learned words, a nice feature that can be used for personalized word recognition like names, places and customized commands.
Another option for keyword spotting is the Syntiant NDP101 chip. While this device also operates at comparable low power (200 microwatts) it is a dedicated audio processor that integrates an audio front end, buffering and feature extraction together with the neural network. Syntiant expects to replace digital MACs with an in-memory analog circuit in the future to further reduce power.
The Akida chip has the added advantages of on-chip learning and versatility. It can also be reconfigured to perform sound or image classification, odor identification or to classify features extracted from data. Another advantage of local processing is that no images or data are exposed on the Internet, significantly reducing privacy risks.
Applications for the technology range from voice-activated appliances to replacing worn-out components in manufacturing equipment. The technology also could be used to determine tire wear based on the sound a tire makes on a road surface. Other automotive applications include monitoring a driver’s alertness, listening to the engine to determine if maintenance is required and scanning for vehicles in the driver’s blind spot.
We expect Akida to evolve, incorporating the structures of the brain, particularly cortical neural networks aimed at artificial general intelligence (AGI). This is a form of machine intelligence that can be trained to perform multiple tasks. AGI technology can be used for controlling autonomous vehicles, with sufficient intelligence to control a vehicle and eventually learn to drive much like humans learn. To be sure, there will be many intermediate steps along the way to that goal.
A future Akida device will include a more sophisticated neural network model that can lean increasingly complex tasks. Stay tuned.
— Peter AJ van der Made is the CTO of Brainchip.