Information Blog

Posted by Françoise Beaufays, Research Scientist

Over the past several years, deep learning has shown remarkable success on some of the world’s most difficult computer science challenges, from image classification and captioning to translation to model visualization techniques. Recently we announced improvements to Google Voice transcription using Long Short-term Memory Recurrent Neural Networks (LSTM RNNs)—yet another place neural networks are improving useful services. We thought we’d give a little more detail on how we did this.

Since it launched in 2009, Google Voice transcription had used Gaussian Mixture Model (GMM) acoustic models, the state of the art in speech recognition for 30+ years. Sophisticated techniques like adapting the models to the speakers voice augmented this relatively simple modeling method.

Then around 2012, Deep Neural Networks (DNNs) revolutionized the field of speech recognition. These multi-layer networks distinguish sounds better than GMMs by using “discriminative training,” differentiating phonetic units instead of modeling each one independently.

But things really improved rapidly with Recurrent Neural Networks (RNNs), and especially LSTM RNNs, first launched in Android’s speech recognizer in May 2012. Compared to DNNs, LSTM RNNs have additional recurrent connections and memory cells that allow them to “remember” the data they’ve seen so far—much as you interpret the words you hear based on previous words in a sentence.

By then, Google’s old voicemail system, still using GMMs, was far behind the new state of the art. So we decided to rebuild it from scratch, taking advantage of the successes demonstrated by LSTM RNNs. But there were some challenges.

An LSTM memory cell, showing the gating mechanisms that allow it to store
and communicate information. Image credit: Alex Graves

There’s more to speech recognition than recognizing individual sounds in the audio: sequences of sounds need to match existing words, and sequences of words should make sense in the language. This is called “language modeling.” Language models are typically trained over very large corpora of text, often orders of magnitude larger than the acoustic data. It’s easy to find lots of text, but not so easy to find sources that match naturally spoken sentences. Shakespeare’s plays in 17th-century English won’t help on voicemails.

We decided to retrain both the acoustic and language models, and to do so using existing voicemails. We already had a small set of voicemails users had donated for research purposes and that we could transcribe for training and testing, but we needed much more data to retrain the language models. So we asked our users to donate their voicemails in bulk, with the assurance that the messages wouldn’t be looked at or listened to by anyone—only to be used by computers running machine learning algorithms. But how does one train models from data that’s never been human-validated or hand-transcribed?

We couldn’t just use our old transcriptions, because they were already tainted with recognition errors—garbage in, garbage out. Instead, we developed a delicate iterative pipeline to retrain the models. Using improved acoustic models, we could recognize existing voicemails offline to get newer, better transcriptions the language models could be retrained on, and with better language models we could recognize again the same data, and repeat the process. Step by step, the recognition error rate dropped, finally settling at roughly half what it was with the original system! That was an excellent surprise.

There were other (not so positive) surprises too. For example, sometimes the recognizer would skip entire audio segments; it felt as if it was falling asleep and waking up a few seconds later. It turned out that the acoustic model would occasionally get into a “bad state” where it would think the user was not speaking anymore and what it heard was just noise, so it stopped outputting words. When we retrained on that same data, we’d think all those spoken sounds should indeed be ignored, reinforcing that the model should do it even more. It took careful tuning to get the recognizer out of that state of mind.

It was also tough to get punctuation right. The old system relied on hand-crafted rules or “grammars,” which, by design, can’t easily take textual context into account. For example, in an early test our algorithms transcribed the audio “I got the message you left me” as “I got the message. You left me.” To try and tackle this, we again tapped into neural networks, teaching an LSTM to insert punctuation at the right spots. It’s still not perfect, but we’re continually working on ways to improve our accuracy.

In speech recognition as in many other complex services, neural networks are rapidly replacing previous technologies. There’s always room for improvement of course, and we’re already working on new types of networks that show even more promise!

Posted by Andrea Cohan, Google Science Fair Program Manager

(Cross-posted from the Google for Education Blog)

Sometimes the biggest discoveries are made by the youngest scientists. They’re curious and not afraid to ask, and it’s this spirit of exploration that leads them to try, and then try again. Thousands of these inquisitive young minds from around the world submitted projects for this year’s Google Science Fair, and today we’re thrilled to announce the 20 Global Finalists whose bright ideas could change the world.

From purifying water with corn cobs to transporting Ebola antibodies through silk; extracting water from air or quickly transporting vaccines to areas in need, these students have all tried inventive, unconventional things to help solve challenges they see around them. And did we mention that they’re all 18 or younger?

We’ll be highlighting each of the impressive 20 finalist projects over the next 20 days in the Spotlight on a Young Scientist series on the Google for Education blog to share more about these inspirational young people and what inspires them.

Then on September 21st, these students will join us in Mountain View to present their projects to a panel of notable international scientists and scholars, eligible for a $50,000 scholarship and other incredible prizes from our partners at LEGO Education, National Geographic, Scientific American and Virgin Galactic.

Congratulations to our finalists and everyone who submitted projects for this year’s Science Fair. Thank you for being curious and brave enough to try to change the world through science.