The neural networks behind Google Voice transcription
Posted by
gilogo
at
11:51 AM
Labels:
behind,
computer,
google,
networks,
neural,
the,
transcription,
voice
Posted by Françoise Beaufays, Research Scientist
Over the past several years, deep learning has shown remarkable success on some of the worlds most difficult computer science challenges, from image classification and captioning to translation to model visualization techniques. Recently we announced improvements to Google Voice transcription using Long Short-term Memory Recurrent Neural Networks (LSTM RNNs)yet another place neural networks are improving useful services. We thought wed give a little more detail on how we did this.
Since it launched in 2009, Google Voice transcription had used Gaussian Mixture Model (GMM) acoustic models, the state of the art in speech recognition for 30+ years. Sophisticated techniques like adapting the models to the speakers voice augmented this relatively simple modeling method.
Then around 2012, Deep Neural Networks (DNNs) revolutionized the field of speech recognition. These multi-layer networks distinguish sounds better than GMMs by using discriminative training, differentiating phonetic units instead of modeling each one independently.
But things really improved rapidly with Recurrent Neural Networks (RNNs), and especially LSTM RNNs, first launched in Androids speech recognizer in May 2012. Compared to DNNs, LSTM RNNs have additional recurrent connections and memory cells that allow them to remember the data theyve seen so farmuch as you interpret the words you hear based on previous words in a sentence.
By then, Googles old voicemail system, still using GMMs, was far behind the new state of the art. So we decided to rebuild it from scratch, taking advantage of the successes demonstrated by LSTM RNNs. But there were some challenges.
Theres more to speech recognition than recognizing individual sounds in the audio: sequences of sounds need to match existing words, and sequences of words should make sense in the language. This is called language modeling. Language models are typically trained over very large corpora of text, often orders of magnitude larger than the acoustic data. Its easy to find lots of text, but not so easy to find sources that match naturally spoken sentences. Shakespeares plays in 17th-century English wont help on voicemails.
We decided to retrain both the acoustic and language models, and to do so using existing voicemails. We already had a small set of voicemails users had donated for research purposes and that we could transcribe for training and testing, but we needed much more data to retrain the language models. So we asked our users to donate their voicemails in bulk, with the assurance that the messages wouldnt be looked at or listened to by anyoneonly to be used by computers running machine learning algorithms. But how does one train models from data thats never been human-validated or hand-transcribed?
We couldnt just use our old transcriptions, because they were already tainted with recognition errorsgarbage in, garbage out. Instead, we developed a delicate iterative pipeline to retrain the models. Using improved acoustic models, we could recognize existing voicemails offline to get newer, better transcriptions the language models could be retrained on, and with better language models we could recognize again the same data, and repeat the process. Step by step, the recognition error rate dropped, finally settling at roughly half what it was with the original system! That was an excellent surprise.
There were other (not so positive) surprises too. For example, sometimes the recognizer would skip entire audio segments; it felt as if it was falling asleep and waking up a few seconds later. It turned out that the acoustic model would occasionally get into a bad state where it would think the user was not speaking anymore and what it heard was just noise, so it stopped outputting words. When we retrained on that same data, wed think all those spoken sounds should indeed be ignored, reinforcing that the model should do it even more. It took careful tuning to get the recognizer out of that state of mind.
It was also tough to get punctuation right. The old system relied on hand-crafted rules or grammars, which, by design, cant easily take textual context into account. For example, in an early test our algorithms transcribed the audio I got the message you left me as I got the message. You left me. To try and tackle this, we again tapped into neural networks, teaching an LSTM to insert punctuation at the right spots. Its still not perfect, but were continually working on ways to improve our accuracy.
In speech recognition as in many other complex services, neural networks are rapidly replacing previous technologies. Theres always room for improvement of course, and were already working on new types of networks that show even more promise!
Read More..
Over the past several years, deep learning has shown remarkable success on some of the worlds most difficult computer science challenges, from image classification and captioning to translation to model visualization techniques. Recently we announced improvements to Google Voice transcription using Long Short-term Memory Recurrent Neural Networks (LSTM RNNs)yet another place neural networks are improving useful services. We thought wed give a little more detail on how we did this.
Since it launched in 2009, Google Voice transcription had used Gaussian Mixture Model (GMM) acoustic models, the state of the art in speech recognition for 30+ years. Sophisticated techniques like adapting the models to the speakers voice augmented this relatively simple modeling method.
Then around 2012, Deep Neural Networks (DNNs) revolutionized the field of speech recognition. These multi-layer networks distinguish sounds better than GMMs by using discriminative training, differentiating phonetic units instead of modeling each one independently.
But things really improved rapidly with Recurrent Neural Networks (RNNs), and especially LSTM RNNs, first launched in Androids speech recognizer in May 2012. Compared to DNNs, LSTM RNNs have additional recurrent connections and memory cells that allow them to remember the data theyve seen so farmuch as you interpret the words you hear based on previous words in a sentence.
By then, Googles old voicemail system, still using GMMs, was far behind the new state of the art. So we decided to rebuild it from scratch, taking advantage of the successes demonstrated by LSTM RNNs. But there were some challenges.
![]() |
| An LSTM memory cell, showing the gating mechanisms that allow it to store and communicate information. Image credit: Alex Graves |
We decided to retrain both the acoustic and language models, and to do so using existing voicemails. We already had a small set of voicemails users had donated for research purposes and that we could transcribe for training and testing, but we needed much more data to retrain the language models. So we asked our users to donate their voicemails in bulk, with the assurance that the messages wouldnt be looked at or listened to by anyoneonly to be used by computers running machine learning algorithms. But how does one train models from data thats never been human-validated or hand-transcribed?
We couldnt just use our old transcriptions, because they were already tainted with recognition errorsgarbage in, garbage out. Instead, we developed a delicate iterative pipeline to retrain the models. Using improved acoustic models, we could recognize existing voicemails offline to get newer, better transcriptions the language models could be retrained on, and with better language models we could recognize again the same data, and repeat the process. Step by step, the recognition error rate dropped, finally settling at roughly half what it was with the original system! That was an excellent surprise.
There were other (not so positive) surprises too. For example, sometimes the recognizer would skip entire audio segments; it felt as if it was falling asleep and waking up a few seconds later. It turned out that the acoustic model would occasionally get into a bad state where it would think the user was not speaking anymore and what it heard was just noise, so it stopped outputting words. When we retrained on that same data, wed think all those spoken sounds should indeed be ignored, reinforcing that the model should do it even more. It took careful tuning to get the recognizer out of that state of mind.
It was also tough to get punctuation right. The old system relied on hand-crafted rules or grammars, which, by design, cant easily take textual context into account. For example, in an early test our algorithms transcribed the audio I got the message you left me as I got the message. You left me. To try and tackle this, we again tapped into neural networks, teaching an LSTM to insert punctuation at the right spots. Its still not perfect, but were continually working on ways to improve our accuracy.
In speech recognition as in many other complex services, neural networks are rapidly replacing previous technologies. Theres always room for improvement of course, and were already working on new types of networks that show even more promise!
Young people who are changing the world through science
Posted by
gilogo
at
10:01 AM
Labels:
are,
changing,
computer,
people,
science,
the,
through,
who,
world,
young
Posted by Andrea Cohan, Google Science Fair Program Manager
(Cross-posted from the Google for Education Blog)
Sometimes the biggest discoveries are made by the youngest scientists. Theyre curious and not afraid to ask, and its this spirit of exploration that leads them to try, and then try again. Thousands of these inquisitive young minds from around the world submitted projects for this years Google Science Fair, and today were thrilled to announce the 20 Global Finalists whose bright ideas could change the world.
From purifying water with corn cobs to transporting Ebola antibodies through silk; extracting water from air or quickly transporting vaccines to areas in need, these students have all tried inventive, unconventional things to help solve challenges they see around them. And did we mention that theyre all 18 or younger?
Well be highlighting each of the impressive 20 finalist projects over the next 20 days in the Spotlight on a Young Scientist series on the Google for Education blog to share more about these inspirational young people and what inspires them.
Then on September 21st, these students will join us in Mountain View to present their projects to a panel of notable international scientists and scholars, eligible for a $50,000 scholarship and other incredible prizes from our partners at LEGO Education, National Geographic, Scientific American and Virgin Galactic.
Congratulations to our finalists and everyone who submitted projects for this years Science Fair. Thank you for being curious and brave enough to try to change the world through science.
Read More..
(Cross-posted from the Google for Education Blog)
Sometimes the biggest discoveries are made by the youngest scientists. Theyre curious and not afraid to ask, and its this spirit of exploration that leads them to try, and then try again. Thousands of these inquisitive young minds from around the world submitted projects for this years Google Science Fair, and today were thrilled to announce the 20 Global Finalists whose bright ideas could change the world.
From purifying water with corn cobs to transporting Ebola antibodies through silk; extracting water from air or quickly transporting vaccines to areas in need, these students have all tried inventive, unconventional things to help solve challenges they see around them. And did we mention that theyre all 18 or younger?
Well be highlighting each of the impressive 20 finalist projects over the next 20 days in the Spotlight on a Young Scientist series on the Google for Education blog to share more about these inspirational young people and what inspires them.

Congratulations to our finalists and everyone who submitted projects for this years Science Fair. Thank you for being curious and brave enough to try to change the world through science.
Largest collection of Google Logos on the web Set 8
Posted by
gilogo
at
7:14 AM
Labels:
8,
collection,
computer,
google,
largest,
logos,
of,
on,
set,
the,
web
Set1 Set2 Set3 Set4 Set5 Set6 Set7 Set8 Set9 Set10
Set1 Set2 Set3 Set4 Set5 Set6 Set7 Set8 Set9 Set10
Subscribe to:
Posts (Atom)
