Monday, July 9, 2012

Growth of voice-recognition UI drives need for embedded speech processors

Google introduced offline voice typing for Android Jelly Bean
at the 2012 IO Developer's COnference.
Apple's introduction of the Siri (described as beta) voice-input User Interface (UI) on the iPhone 4S, and the company's pervasive advertising campaign, has raised the public's awareness that speech recognition technology is no longer just science fiction. While Google has incorporated voice input in their mobile apps for several years, the addition of a spoken response in Siri has stimulated thoughts of the H.A.L. 9000 computer in Stanley Kubrick's 1968 classic movie - 2001: A Space Odyssey.

At the recently completed 2012 Google IO Developer's Conference, Google responded to Apple by adding voice response to the search functions in the new Android version 4.1, aka Jelly Bean.  Voice search in Android will now be very Siri-like, using what Google calls a knowledge graph to formulate contextually-aware spoken responses.

Google then went a step further than Apple, by introducing offline voice typing in Android Jelly Bean. The average iPhone user may be unaware that Siri relies on a connection to cloud servers, as has Google's voice search, but Google now claims to have "shrunk the Google speech recognizer" to fit into smartphones.  While this will not provide the intelligent response of online voice search, by providing this capability Google has addressed one of the problems with speech UIs, i.e. the need for a continuous internet connection to execute the recognition algorithms and language database search.

Another problem with speech UIs is response time, or latency. As Google said in their IO presentation, a slow connection can make voice input unusable. By embedding the speech recognizer in the device, Android developers can more confidently include voice input in their applications.  However, the demo device at Google IO was a top of the line Nexus smartphone, so it remains to be seen how pervasive the offline Android voice typing functions will be, at least initially. Google appears to be relying on general purpose application processor horsepower, and on-board memory resources, to execute these functions in software. It is noteworthy that Google qualified their introduction, by saying only U.S. English language will be supported at launch. Installing a multi-language database was no doubt infeasible at this time.

Spansion's Acoustic Co-Processor combines customer logic and flash memory
to offload CPUs for speech recognition applications.
Now Spansion, best known as a provider of flash memory, is seeking to address these issues with the introduction of their Acoustic Coprocessor, which combines custom-designed logic and high-speed memory to accelerate and optimize voice-enabled human machine interfaces (HMI). The company says that their new product will be ideal for voice recognition systems in automotive, gaming and consumer electronics, applications where the addition of a new component will be less difficult than with the cost, space and power limitations of smartphones. As Tensilica CTO Chris Rowen said, during his presentation at the Global Press Summit earlier this year, including always-on voice recognition in such devices will drive increased usage of specialized audio DSP blocks, which would be integrated directly into an application processor System on a Chip (SoC).

Alvin Wong, Spansion's VP of Marketing and Business Development, says that the acoustic co-processor is inserted into the speech processing path immediately after the Analog-Digital conversion of the voice input. The processor utilizes voice technology from Nuance Communications, a provider of speech recognition solutions for PC applications, call centers, and in healthcare. The acoustic co-processor logic, which is Spansion's own design, executes algorithms to score sound packets (similar to syllables or phonemes) from the digitized voice against the acoustic database, stored in flash memory on the same chip. The co-processor transmits sound scores over a Serial Peripheral Interface (SPI) to an application processor, which then executes a search algorithm to select the most likely spoken words from a language database.

Wong says that by implementing the scoring algorithms in their acoustic co-processor hardware, Spansion is able to significantly improve both response time and accuracy over conventional voice interfaces. In a benchmark experiment with an automobile infotainment system, Spansion claims to have reduced the CPU load and response time by 50%, compared to a standalone 800MHz ARM processor. Spansion attributes much of the speedup in the scoring process to their design of a dedicated 1.2GB/s wide data bus between the acoustic processor logic and the flash memory. The embedded flash memory in the Spansion acoustic co-processor allows for storage of as many as 10 to 12 language models, according to Wong, each with their own library of sounds that are provided by Nuance.  The larger acoustic databases provide finer granularity, and hence greater accuracy, in the matching process. In addition, offloading the scoring process can also free the application processor to execute a more natural language interface.

Spansion is planning to deliver design samples of their automotive platform in Q3, and is targeting Q1 of 2012 for full production. Device scaling will be based on how much flash memory is required to store language models.The company plans to introduce a low-end co-processor, supporting 1 or 2 language models, along with a high-end device capable of 10 to 12 models.

No comments: