Showing posts with label networks. Show all posts
Showing posts with label networks. Show all posts

The neural networks behind Google Voice transcription



Over the past several years, deep learning has shown remarkable success on some of the world’s most difficult computer science challenges, from image classification and captioning to translation to model visualization techniques. Recently we announced improvements to Google Voice transcription using Long Short-term Memory Recurrent Neural Networks (LSTM RNNs)—yet another place neural networks are improving useful services. We thought we’d give a little more detail on how we did this.

Since it launched in 2009, Google Voice transcription had used Gaussian Mixture Model (GMM) acoustic models, the state of the art in speech recognition for 30+ years. Sophisticated techniques like adapting the models to the speakers voice augmented this relatively simple modeling method.

Then around 2012, Deep Neural Networks (DNNs) revolutionized the field of speech recognition. These multi-layer networks distinguish sounds better than GMMs by using “discriminative training,” differentiating phonetic units instead of modeling each one independently.

But things really improved rapidly with Recurrent Neural Networks (RNNs), and especially LSTM RNNs, first launched in Android’s speech recognizer in May 2012. Compared to DNNs, LSTM RNNs have additional recurrent connections and memory cells that allow them to “remember” the data they’ve seen so far—much as you interpret the words you hear based on previous words in a sentence.

By then, Google’s old voicemail system, still using GMMs, was far behind the new state of the art. So we decided to rebuild it from scratch, taking advantage of the successes demonstrated by LSTM RNNs. But there were some challenges.
An LSTM memory cell, showing the gating mechanisms that allow it to store
and communicate information. Image credit: Alex Graves
There’s more to speech recognition than recognizing individual sounds in the audio: sequences of sounds need to match existing words, and sequences of words should make sense in the language. This is called “language modeling.” Language models are typically trained over very large corpora of text, often orders of magnitude larger than the acoustic data. It’s easy to find lots of text, but not so easy to find sources that match naturally spoken sentences. Shakespeare’s plays in 17th-century English won’t help on voicemails.

We decided to retrain both the acoustic and language models, and to do so using existing voicemails. We already had a small set of voicemails users had donated for research purposes and that we could transcribe for training and testing, but we needed much more data to retrain the language models. So we asked our users to donate their voicemails in bulk, with the assurance that the messages wouldn’t be looked at or listened to by anyone—only to be used by computers running machine learning algorithms. But how does one train models from data that’s never been human-validated or hand-transcribed?

We couldn’t just use our old transcriptions, because they were already tainted with recognition errors—garbage in, garbage out. Instead, we developed a delicate iterative pipeline to retrain the models. Using improved acoustic models, we could recognize existing voicemails offline to get newer, better transcriptions the language models could be retrained on, and with better language models we could recognize again the same data, and repeat the process. Step by step, the recognition error rate dropped, finally settling at roughly half what it was with the original system! That was an excellent surprise.

There were other (not so positive) surprises too. For example, sometimes the recognizer would skip entire audio segments; it felt as if it was falling asleep and waking up a few seconds later. It turned out that the acoustic model would occasionally get into a “bad state” where it would think the user was not speaking anymore and what it heard was just noise, so it stopped outputting words. When we retrained on that same data, we’d think all those spoken sounds should indeed be ignored, reinforcing that the model should do it even more. It took careful tuning to get the recognizer out of that state of mind.

It was also tough to get punctuation right. The old system relied on hand-crafted rules or “grammars,” which, by design, can’t easily take textual context into account. For example, in an early test our algorithms transcribed the audio “I got the message you left me” as “I got the message. You left me.” To try and tackle this, we again tapped into neural networks, teaching an LSTM to insert punctuation at the right spots. It’s still not perfect, but we’re continually working on ways to improve our accuracy.

In speech recognition as in many other complex services, neural networks are rapidly replacing previous technologies. There’s always room for improvement of course, and we’re already working on new types of networks that show even more promise!
Read More..

Beyond Short Snippets Deep Networks for Video Classification



Convolutional Neural Networks (CNNs) have recently shown rapid progress in advancing the state of the art of detecting and classifying objects in static images, automatically learning complex features in pictures without the need for manually annotated features. But what if one wanted not only to identify objects in static images, but also analyze what a video is about? After all, a video isn’t much more than a string of static images linked together in time.

As it turns out, video analysis provides even more information to the object detection and recognition task performed by CNN’s by adding a temporal component through which motion and other information can be also be used to improve classification. However, analyzing entire videos is challenging from a modeling perspective because one must model variable length videos with a fixed number of parameters. Not to mention that modeling variable length videos is computationally very intensive.

In Beyond Short Snippets: Deep Networks for Video Classification, to be presented at the 2015 Computer Vision and Pattern Recognition conference (CVPR 2015), we1 evaluated two approaches - feature pooling networks and recurrent neural networks (RNNs) - capable of modeling variable length videos with a fixed number of parameters while maintaining a low computational footprint. In doing so, we were able to not only show that learning a high level global description of the video’s temporal evolution is very important for accurate video classification, but that our best networks exhibited significant performance improvements over previously published results on the Sports 1 million dataset (Sports-1M).

In previous work, we employed 3D-convolutions (meaning convolutions over time and space) over short video clips - typically just a few seconds - to learn motion features from raw frames implicitly and then aggregate predictions at the video level. For purposes of video classification, the low level motion features were only marginally outperforming models in which no motion was modeled.

To understand why, consider the following two images which are very similar visually but obtain drastically different scores from a CNN model trained on static images:
Slight differences in object poses/context can change the predicted class/confidence of CNNs trained on static images.
Since each individual video frame forms only a small part of the video’s story, static frames and short video snippets (2-3 secs) use incomplete information and could easily confuse subtle fine-grained distinctions between classes (e.g: Tae Kwon Do vs. Systema) or use portions of the video irrelevant to the action of interest.

To get around this frame-by-frame confusion, we used feature pooling networks that independently process each frame and then pool/aggregate the frame-level features over the entire video at various stages. Another approach we took was to utilize an RNN (derived from Long Short Term Memory units) instead of feature pooling, allowing the network itself to decide which parts of the video are important for classification. By sharing parameters through time, both feature pooling and RNN architectures are able to maintain a constant number of parameters while capturing a global description of the video’s temporal evolution.

In order to feed the two aggregation approaches, we compute an image “pixel-based” CNN model, based on the raw pixels in the frames of a video. We processed videos for the “pixel-based” CNNs at one frame per second to reduce computational complexity. Of course, at this frame rate implicit motion information is lost.

To compensate, we incorporate explicit motion information in the form of optical flow - the apparent motion of objects across a cameras viewfinder due to the motion of the objects or the motion of the camera. We compute optical flow images over adjacent frames to learn an additional “optical flow” CNN model.
Left: Image used for the pixel-based CNN; Right: Dense optical flow image used for optical flow CNN
The pixel-based and optical flow based CNN model outputs are provided as inputs to both the RNN and pooling approaches described earlier. These two approaches then separately aggregate the frame-level predictions from each CNN model input, and average the results. This allows our video-level prediction to take advantage of both image information and motion information to accurately label videos of similar activities even when the visual content of those videos varies greatly.
Badminton (top 25 videos according to the max-pooling model). Our methods accurately label all 25 videos as badminton despite the variety of scenes in the various videos because they use the entire video’s context for prediction.
We conclude by observing that although very different in concept, the max-pooling and the recurrent neural network methods perform similarly when using both images and optical flow. Currently, these two architectures are the top performers on the Sports-1M dataset. The main difference between the two was that the RNN approach was more robust when using optical flow alone on this dataset. Check out a short video showing some example outputs from the deep convolutional networks presented in our paper.


1 Research carried out in collaboration with University of Maryland, College Park PhD student Joe Yue-Hei Ng and University of Texas at Austin PhD student Matthew Hausknecht, as part of a Google Software Engineering Internship?

Read More..

A Beginner’s Guide to Deep Neural Networks



Last year, we (a couple of people who knew nothing about how voice search works) set out to make a video about the research that’s gone into teaching computers to recognize speech and understand language.

Making the video was eye-opening and brain-opening. It introduced us to concepts we’d never heard of – like machine learning and artificial neural networks – and ever since, we’ve been kind of fascinated by them. Machine learning, in particular, is a very active area of Computer Science research, with far-ranging applications beyond voice search – like machine translation, image recognition and description, and Google Voice transcription.

So... still curious to know more (and having just started this project) we found Google researchers Greg Corrado and Christopher Olah and ambushed them with our machine learning questions.
This video is our attempt to distill what we learned from talking with them, but if anything in it piques your curiosity, or you have other questions, you’re in luck! On Friday, September 25, at 1 PM PDT / 4 PM EST Greg and Chris will be doing an Ask Me Anything on Reddit (see the calendar here) to answer your deep learning questions.

Everyone who’s curious is welcome to join, ask questions, and hopefully gain a better understanding of the world of machine learning and deep neural networks. (And we’ll be hanging out with them, too...in case you have any questions about video making or dogs.) We hope to see you this Friday!
Read More..

DeepDream a code example for visualizing Neural Networks



Two weeks ago we blogged about a visualization tool designed to help us understand how neural networks work and what each layer has learned. In addition to gaining some insight on how these networks carry out classification tasks, we found that this process also generated some beautiful art.
Top: Input image. Bottom: output image made using a network trained on places by MIT Computer Science and AI Laboratory.
We have seen a lot of interest and received some great questions, from programmers and artists alike, about the details of how these visualizations are made. We have decided to open source the code we used to generate these images in an IPython notebook, so now you can make neural network inspired images yourself!

The code is based on Caffe and uses available open source packages, and is designed to have as few dependencies as possible. To get started, you will need the following (full details in the notebook):

  • NumPy, SciPy, PIL, IPython, or a scientific python distribution such as Anaconda or Canopy.
  • Caffe deep learning framework (Installation instructions)

Once you’re set up, you can supply an image and choose which layers in the network to enhance, how many iterations to apply and how far to zoom in. Alternatively, different pre-trained networks can be plugged in.

Itll be interesting to see what imagery people are able to generate. If you post images to Google+, Facebook, or Twitter, be sure to tag them with #deepdream so other researchers can check them out too.
Read More..

Inceptionism Going Deeper into Neural Networks



Update - 13/07/2015
Images in this blog post are licensed by Google Inc. under a Creative Commons Attribution 4.0 International License. However, images based on places by MIT Computer Science and AI Laboratory require additional permissions from MIT for use.

Artificial Neural Networks have spurred remarkable recent progress in image classification and speech recognition. But even though these are very useful tools based on well-known mathematical methods, we actually understand surprisingly little of why certain models work and others don’t. So let’s take a look at some simple techniques for peeking inside these networks.

We train an artificial neural network by showing it millions of training examples and gradually adjusting the network parameters until it gives the classifications we want. The network typically consists of 10-30 stacked layers of artificial neurons. Each image is fed into the input layer, which then talks to the next layer, until eventually the “output” layer is reached. The network’s “answer” comes from this final output layer.

One of the challenges of neural networks is understanding what exactly goes on at each layer. We know that after training, each layer progressively extracts higher and higher-level features of the image, until the final layer essentially makes a decision on what the image shows. For example, the first layer maybe looks for edges or corners. Intermediate layers interpret the basic features to look for overall shapes or components, like a door or a leaf. The final few layers assemble those into complete interpretations—these neurons activate in response to very complex things such as entire buildings or trees.

One way to visualize what goes on is to turn the network upside down and ask it to enhance an input image in such a way as to elicit a particular interpretation. Say you want to know what sort of image would result in “Banana.” Start with an image full of random noise, then gradually tweak the image towards what the neural net considers a banana (see related work in [1], [2], [3], [4]). By itself, that doesn’t work very well, but it does if we impose a prior constraint that the image should have similar statistics to natural images, such as neighboring pixels needing to be correlated.
So here’s one surprise: neural networks that were trained to discriminate between different kinds of images have quite a bit of the information needed to generate images too. Check out some more examples across different classes:
Why is this important? Well, we train networks by simply showing them many examples of what we want them to learn, hoping they extract the essence of the matter at hand (e.g., a fork needs a handle and 2-4 tines), and learn to ignore what doesn’t matter (a fork can be any shape, size, color or orientation). But how do you check that the network has correctly learned the right features? It can help to visualize the network’s representation of a fork.

Indeed, in some cases, this reveals that the neural net isn’t quite looking for the thing we thought it was. For example, here’s what one neural net we designed thought dumbbells looked like:
There are dumbbells in there alright, but it seems no picture of a dumbbell is complete without a muscular weightlifter there to lift them. In this case, the network failed to completely distill the essence of a dumbbell. Maybe it’s never been shown a dumbbell without an arm holding it. Visualization can help us correct these kinds of training mishaps.

Instead of exactly prescribing which feature we want the network to amplify, we can also let the network make that decision. In this case we simply feed the network an arbitrary image or photo and let the network analyze the picture. We then pick a layer and ask the network to enhance whatever it detected. Each layer of the network deals with features at a different level of abstraction, so the complexity of features we generate depends on which layer we choose to enhance. For example, lower layers tend to produce strokes or simple ornament-like patterns, because those layers are sensitive to basic features such as edges and their orientations.
Left: Original photo by Zachi Evenor. Right: processed by Günther Noack, Software Engineer
Left: Original painting by Georges Seurat. Right: processed images by Matthew McNaughton, Software Engineer
If we choose higher-level layers, which identify more sophisticated features in images, complex features or even whole objects tend to emerge. Again, we just start with an existing image and give it to our neural net. We ask the network: “Whatever you see there, I want more of it!” This creates a feedback loop: if a cloud looks a little bit like a bird, the network will make it look more like a bird. This in turn will make the network recognize the bird even more strongly on the next pass and so forth, until a highly detailed bird appears, seemingly out of nowhere.
The results are intriguing—even a relatively simple neural network can be used to over-interpret an image, just like as children we enjoyed watching clouds and interpreting the random shapes. This network was trained mostly on images of animals, so naturally it tends to interpret shapes as animals. But because the data is stored at such a high abstraction, the results are an interesting remix of these learned features.
Of course, we can do more than cloud watching with this technique. We can apply it to any kind of image. The results vary quite a bit with the kind of image, because the features that are entered bias the network towards certain interpretations. For example, horizon lines tend to get filled with towers and pagodas. Rocks and trees turn into buildings. Birds and insects appear in images of leaves.
The original image influences what kind of objects form in the processed image.
This technique gives us a qualitative sense of the level of abstraction that a particular layer has achieved in its understanding of images. We call this technique “Inceptionism” in reference to the neural net architecture used. See our Inceptionism gallery for more pairs of images and their processed results, plus some cool video animations.

We must go deeper: Iterations

If we apply the algorithm iteratively on its own outputs and apply some zooming after each iteration, we get an endless stream of new impressions, exploring the set of things the network knows about. We can even start this process from a random-noise image, so that the result becomes purely the result of the neural network, as seen in the following images:
Neural net “dreams”— generated purely from random noise, using a network trained on places by MIT Computer Science and AI Laboratory. See our Inceptionism gallery for hi-res versions of the images above and more (Images marked “Places205-GoogLeNet” were made using this network).
The techniques presented here help us understand and visualize how neural networks are able to carry out difficult classification tasks, improve network architecture, and check what the network has learned during training. It also makes us wonder whether neural networks could become a tool for artists—a new way to remix visual concepts—or perhaps even shed a little light on the roots of the creative process in general.
Read More..