Showing posts with label deep. Show all posts
Showing posts with label deep. Show all posts

Improving YouTube video thumbnails with deep neural nets



Video thumbnails are often the first things viewers see when they look for something interesting to watch. A strong, vibrant, and relevant thumbnail draws attention, giving viewers a quick preview of the content of the video, and helps them to find content more easily. Better thumbnails lead to more clicks and views for video creators.

Inspired by the recent remarkable advances of deep neural networks (DNNs) in computer vision, such as image and video classification, our team has recently launched an improved automatic YouTube "thumbnailer" in order to help creators showcase their video content. Here is how it works.

The Thumbnailer Pipeline

While a video is being uploaded to YouTube, we first sample frames from the video at one frame per second. Each sampled frame is evaluated by a quality model and assigned a single quality score. The frames with the highest scores are selected, enhanced and rendered as thumbnails with different sizes and aspect ratios. Among all the components, the quality model is the most critical and turned out to be the most challenging to develop. In the latest version of the thumbnailer algorithm, we used a DNN for the quality model. So, what is the quality model measuring, and how is the score calculated?
The main processing pipeline of the thumbnailer.
(Training) The Quality Model

Unlike the task of identifying if a video contains your favorite animal, judging the visual quality of a video frame can be very subjective - people often have very different opinions and preferences when selecting frames as video thumbnails. One of the main challenges we faced was how to collect a large set of well-annotated training examples to feed into our neural network. Fortunately, on YouTube, in addition to having algorithmically generated thumbnails, many YouTube videos also come with carefully designed custom thumbnails uploaded by creators. Those thumbnails are typically well framed, in-focus, and center on a specific subject (e.g. the main character in the video). We consider these custom thumbnails from popular videos as positive (high-quality) examples, and randomly selected video frames as negative (low-quality) examples. Some examples of the training images are shown below.
Example training images.
The visual quality model essentially solves a problem we call "binary classification": given a frame, is it of high quality or not? We trained a DNN on this set using a similar architecture to the Inception network in GoogLeNet that achieved the top performance in the ImageNet 2014 competition.

Results

Compared to the previous automatically generated thumbnails, the DNN-powered model is able to select frames with much better quality. In a human evaluation, the thumbnails produced by our new models are preferred to those from the previous thumbnailer in more than 65% of side-by-side ratings. Here are some examples of how the new quality model performs on YouTube videos:
Example frames with low and high quality score from the DNN quality model, from video “Grand Canyon Rock Squirrel”.
Thumbnails generated by old vs. new thumbnailer algorithm.
We recently launched this new thumbnailer across YouTube, which means creators can start to choose from higher quality thumbnails generated by our new thumbnailer. Next time you see an awesome YouTube thumbnail, don’t hesitate to give it a thumbs up. ;)
Read More..

Classifying everything using your RPi Camera Deep Learning with the Pi

For those who dont want to read, the code can be found on my github with a readme:
https://github.com/StevenHickson/RPi_CaffeQuery
You can also read about it on my Hackaday io page here.

What is object classification?

Object classification has been a very popular topic the past couple years. Given an image, we want a computer to be able to tell us what that image is showing. The newest trend has been using convolutional neural networks in order to classify networks trained with a large amount of data.

One of the bigger frameworks for this is the Caffe framework. For more on this see the Caffe home page.
You can test out there web demo here. It isnt great at people but it is very good at cats, dogs, objects, and activities.


Why is this useful?

There are all kinds of autonomous tasks you can do with the RPi camera. Perhaps you want to know if your dog is in your living room, so the Pi can take his/her picture or tell him/her they are a good dog. Perhaps you want your RPi to recognize whether there is fruit in your fruit drawer so it can order you more when it is empty. The possibilities are endless.

How do convolutional neural networks work (a VERY simple overview)?

Convolutional neural networks are based loosely off how the human brain works. They are built of layers of many neurons that are "activated" by certain inputs. The input layer is connected in a network through a series of interconnected neurons in hidden layers like so:
[1]

Each neuron sends its signal to any other neuron it is connected to which is then multiplied by the connection weight and run through a sigmoid function. The training of the network is done by changing the weights in order to minimize the error function based on a set of inputs with a known set of outputs using back propagation.

How do we get this on the Pi?

Well I went ahead and compiled Caffe on the RPi. Unfortunately since it doesnt have code to optimize the network with its GPU, the classification takes ~20-25s per image, which is far too much.
Note: I did find a different optimized CNN network for the RPi by Pete Warden here. It looks great but it still takes about 3 seconds per image, which still doesnt seem fast  enough. 

You will also need the Raspberry Pi camera which you can get from here:
Raspberry PI 5MP Camera Board Module

A better option: Using the web demo with python

So we can take advantage of the Caffe web demo and use that to reduce the processing time even further. With this method, the image classification takes ~1.5s, which is usable for a system.

How does the code work?

We make a symbolic link from /dev/shm/images/ to our /var/www for apache and forward our router port 5050 to the Pi port 80. 
Then we use raspistill to take an image and save it to memory as /dev/shm/images/test.jpg. Since this is symlinked in /var/www, we should be able to see it at http://YOUR-EXTERNAL-IP:5005/images/test.jpg.
Then we use grab to qull up the Caffe demo framework with our image and get the classification results. This is done in queryCNN.py which gets the results.

What does the output look like?

Given a picture of some of my Pi components, I get this, which is pretty accurate:

Where can I get the code?

https://github.com/StevenHickson/RPi_CaffeQuery

[1] http://white.stanford.edu/teach/index.php/An_Introduction_to_Convolutional_Neural_Networks

Consider donating to further my tinkering since I do all this and help people out for free.



Places you can find me
Read More..

Beyond Short Snippets Deep Networks for Video Classification



Convolutional Neural Networks (CNNs) have recently shown rapid progress in advancing the state of the art of detecting and classifying objects in static images, automatically learning complex features in pictures without the need for manually annotated features. But what if one wanted not only to identify objects in static images, but also analyze what a video is about? After all, a video isn’t much more than a string of static images linked together in time.

As it turns out, video analysis provides even more information to the object detection and recognition task performed by CNN’s by adding a temporal component through which motion and other information can be also be used to improve classification. However, analyzing entire videos is challenging from a modeling perspective because one must model variable length videos with a fixed number of parameters. Not to mention that modeling variable length videos is computationally very intensive.

In Beyond Short Snippets: Deep Networks for Video Classification, to be presented at the 2015 Computer Vision and Pattern Recognition conference (CVPR 2015), we1 evaluated two approaches - feature pooling networks and recurrent neural networks (RNNs) - capable of modeling variable length videos with a fixed number of parameters while maintaining a low computational footprint. In doing so, we were able to not only show that learning a high level global description of the video’s temporal evolution is very important for accurate video classification, but that our best networks exhibited significant performance improvements over previously published results on the Sports 1 million dataset (Sports-1M).

In previous work, we employed 3D-convolutions (meaning convolutions over time and space) over short video clips - typically just a few seconds - to learn motion features from raw frames implicitly and then aggregate predictions at the video level. For purposes of video classification, the low level motion features were only marginally outperforming models in which no motion was modeled.

To understand why, consider the following two images which are very similar visually but obtain drastically different scores from a CNN model trained on static images:
Slight differences in object poses/context can change the predicted class/confidence of CNNs trained on static images.
Since each individual video frame forms only a small part of the video’s story, static frames and short video snippets (2-3 secs) use incomplete information and could easily confuse subtle fine-grained distinctions between classes (e.g: Tae Kwon Do vs. Systema) or use portions of the video irrelevant to the action of interest.

To get around this frame-by-frame confusion, we used feature pooling networks that independently process each frame and then pool/aggregate the frame-level features over the entire video at various stages. Another approach we took was to utilize an RNN (derived from Long Short Term Memory units) instead of feature pooling, allowing the network itself to decide which parts of the video are important for classification. By sharing parameters through time, both feature pooling and RNN architectures are able to maintain a constant number of parameters while capturing a global description of the video’s temporal evolution.

In order to feed the two aggregation approaches, we compute an image “pixel-based” CNN model, based on the raw pixels in the frames of a video. We processed videos for the “pixel-based” CNNs at one frame per second to reduce computational complexity. Of course, at this frame rate implicit motion information is lost.

To compensate, we incorporate explicit motion information in the form of optical flow - the apparent motion of objects across a cameras viewfinder due to the motion of the objects or the motion of the camera. We compute optical flow images over adjacent frames to learn an additional “optical flow” CNN model.
Left: Image used for the pixel-based CNN; Right: Dense optical flow image used for optical flow CNN
The pixel-based and optical flow based CNN model outputs are provided as inputs to both the RNN and pooling approaches described earlier. These two approaches then separately aggregate the frame-level predictions from each CNN model input, and average the results. This allows our video-level prediction to take advantage of both image information and motion information to accurately label videos of similar activities even when the visual content of those videos varies greatly.
Badminton (top 25 videos according to the max-pooling model). Our methods accurately label all 25 videos as badminton despite the variety of scenes in the various videos because they use the entire video’s context for prediction.
We conclude by observing that although very different in concept, the max-pooling and the recurrent neural network methods perform similarly when using both images and optical flow. Currently, these two architectures are the top performers on the Sports-1M dataset. The main difference between the two was that the RNN approach was more robust when using optical flow alone on this dataset. Check out a short video showing some example outputs from the deep convolutional networks presented in our paper.


1 Research carried out in collaboration with University of Maryland, College Park PhD student Joe Yue-Hei Ng and University of Texas at Austin PhD student Matthew Hausknecht, as part of a Google Software Engineering Internship?

Read More..

How Google Translate squeezes deep learning onto a phone



Today we announced that the Google Translate app now does real-time visual translation of 20 more languages. So the next time you’re in Prague and can’t read a menu, we’ve got your back. But how are we able to recognize these new languages?

In short: deep neural nets. When the Word Lens team joined Google, we were excited for the opportunity to work with some of the leading researchers in deep learning. Neural nets have gotten a lot of attention in the last few years because they’ve set all kinds of records in image recognition. Five years ago, if you gave a computer an image of a cat or a dog, it had trouble telling which was which. Thanks to convolutional neural networks, not only can computers tell the difference between cats and dogs, they can even recognize different breeds of dogs. Yes, they’re good for more than just trippy art—if youre translating a foreign menu or sign with the latest version of Googles Translate app, youre now using a deep neural net. And the amazing part is it can all work on your phone, without an Internet connection. Here’s how.

Step by step

First, when a camera image comes in, the Google Translate app has to find the letters in the picture. It needs to weed out background objects like trees or cars, and pick up on the words we want translated. It looks at blobs of pixels that have similar color to each other that are also near other similar blobs of pixels. Those are possibly letters, and if they’re near each other, that makes a continuous line we should read.
Second, Translate has to recognize what each letter actually is. This is where deep learning comes in. We use a convolutional neural network, training it on letters and non-letters so it can learn what different letters look like.

But interestingly, if we train just on very “clean”-looking letters, we risk not understanding what real-life letters look like. Letters out in the real world are marred by reflections, dirt, smudges, and all kinds of weirdness. So we built our letter generator to create all kinds of fake “dirt” to convincingly mimic the noisiness of the real world—fake reflections, fake smudges, fake weirdness all around.

Why not just train on real-life photos of letters? Well, it’s tough to find enough examples in all the languages we need, and it’s harder to maintain the fine control over what examples we use when we’re aiming to train a really efficient, compact neural network. So it’s more effective to simulate the dirt.
Some of the “dirty” letters we use for training. Dirt, highlights, and rotation, but not too much because we don’t want to confuse our neural net.
The third step is to take those recognized letters, and look them up in a dictionary to get translations. Since every previous step could have failed in some way, the dictionary lookup needs to be approximate. That way, if we read an ‘S’ as a ‘5’, we’ll still be able to find the word ‘5uper’.

Finally, we render the translation on top of the original words in the same style as the original. We can do this because we’ve already found and read the letters in the image, so we know exactly where they are. We can look at the colors surrounding the letters and use that to erase the original letters. And then we can draw the translation on top using the original foreground color.

Crunching it down for mobile

Now, if we could do this visual translation in our data centers, it wouldn’t be too hard. But a lot of our users, especially those getting online for the very first time, have slow or intermittent network connections and smartphones starved for computing power. These low-end phones can be about 50 times slower than a good laptop—and a good laptop is already much slower than the data centers that typically run our image recognition systems. So how do we get visual translation on these phones, with no connection to the cloud, translating in real-time as the camera moves around?

We needed to develop a very small neural net, and put severe limits on how much we tried to teach it—in essence, put an upper bound on the density of information it handles. The challenge here was in creating the most effective training data. Since we’re generating our own training data, we put a lot of effort into including just the right data and nothing more. For instance, we want to be able to recognize a letter with a small amount of rotation, but not too much. If we overdo the rotation, the neural network will use too much of its information density on unimportant things. So we put effort into making tools that would give us a fast iteration time and good visualizations. Inside of a few minutes, we can change the algorithms for generating training data, generate it, retrain, and visualize. From there we can look at what kind of letters are failing and why. At one point, we were warping our training data too much, and ‘$’ started to be recognized as ‘S’. We were able to quickly identify that and adjust the warping parameters to fix the problem. It was like trying to paint a picture of letters that you’d see in real life with all their imperfections painted just perfectly.

To achieve real-time, we also heavily optimized and hand-tuned the math operations. That meant using the mobile processor’s SIMD instructions and tuning things like matrix multiplies to fit processing into all levels of cache memory.

In the end, we were able to get our networks to give us significantly better results while running about as fast as our old system—great for translating what you see around you on the fly. Sometimes new technology can seem very abstract, and its not always obvious what the applications for things like convolutional neural nets could be. We think breaking down language barriers is one great use.
Read More..

A Beginner’s Guide to Deep Neural Networks



Last year, we (a couple of people who knew nothing about how voice search works) set out to make a video about the research that’s gone into teaching computers to recognize speech and understand language.

Making the video was eye-opening and brain-opening. It introduced us to concepts we’d never heard of – like machine learning and artificial neural networks – and ever since, we’ve been kind of fascinated by them. Machine learning, in particular, is a very active area of Computer Science research, with far-ranging applications beyond voice search – like machine translation, image recognition and description, and Google Voice transcription.

So... still curious to know more (and having just started this project) we found Google researchers Greg Corrado and Christopher Olah and ambushed them with our machine learning questions.
This video is our attempt to distill what we learned from talking with them, but if anything in it piques your curiosity, or you have other questions, you’re in luck! On Friday, September 25, at 1 PM PDT / 4 PM EST Greg and Chris will be doing an Ask Me Anything on Reddit (see the calendar here) to answer your deep learning questions.

Everyone who’s curious is welcome to join, ask questions, and hopefully gain a better understanding of the world of machine learning and deep neural networks. (And we’ll be hanging out with them, too...in case you have any questions about video making or dogs.) We hope to see you this Friday!
Read More..

From Pixels to Actions Human level control through Deep Reinforcement Learning



Remember the classic videogame Breakout on the Atari 2600? When you first sat down to try it, you probably learned to play well pretty quickly, because you already knew how to bounce a ball off a wall in real life. You may have even worked up a strategy to maximise your overall score at the expense of more immediate rewards. But what if you didnt possess that real-world knowledge — and only had the pixels on the screen, the control paddle in your hand, and the score to go on? How would you, or equally any intelligent agent faced with this situation, learn this task totally from scratch?

This is exactly the question that we set out to answer in our paper “Human-level control through deep reinforcement learning”, published in Nature this week. We demonstrate that a novel algorithm called a deep Q-network (DQN) is up to this challenge, excelling not only at Breakout but also a wide variety of classic videogames: everything from side-scrolling shooters (River Raid) to boxing (Boxing) and 3D car racing (Enduro). Strikingly, DQN was able to work straight “out of the box” across all these games – using the same network architecture and tuning parameters throughout and provided only with the raw screen pixels, set of available actions and game score as input.

The results: DQN outperformed previous machine learning methods in 43 of the 49 games. In fact, in more than half the games, it performed at more than 75% of the level of a professional human player. In certain games, DQN even came up with surprisingly far-sighted strategies that allowed it to achieve the maximum attainable score—for example, in Breakout, it learned to first dig a tunnel at one end of the brick wall so the ball could bounce around the back and knock out bricks from behind.
Video courtesy of Atari Inc. and Mnih et al. “Human-level control through deep reinforcement learning"
So how does it work? DQN incorporated several key features that for the first time enabled the power of Deep Neural Networks (DNN) to be combined in a scalable fashion with Reinforcement Learning (RL)—a machine learning framework that prescribes how agents should act in an environment in order to maximize future cumulative reward (e.g., a game score). Foremost among these was a neurobiologically inspired mechanism, termed “experience replay,” whereby during the learning phase DQN was trained on samples drawn from a pool of stored episodes—a process physically realized in a brain structure called the hippocampus through the ultra-fast reactivation of recent experiences during rest periods (e.g., sleep). Indeed, the incorporation of experience replay was critical to the success of DQN: disabling this function caused a severe deterioration in performance.
Comparison of the DQN agent with the best reinforcement learning methods in the literature. The performance of DQN is normalized with respect to a professional human games tester (100% level) and random play (0% level). Note that the normalized performance of DQN, expressed as a percentage, is calculated as: 100 X (DQN score - random play score)/(human score - random play score). Error bars indicate s.d. across the 30 evaluation episodes, starting with different initial conditions. Figure courtesy of Mnih et al. “Human-level control through deep reinforcement learning”, Nature 26 Feb. 2015.
This work offers the first demonstration of a general purpose learning agent that can be trained end-to-end to handle a wide variety of challenging tasks, taking in only raw pixels as inputs and transforming these into actions that can be executed in real-time. This kind of technology should help us build more useful products—imagine if you could ask the Google app to complete any kind of complex task (“Okay Google, plan me a great backpacking trip through Europe!”).

We also hope this kind of domain general learning algorithm will give researchers new ways to make sense of complex large-scale data creating the potential for exciting discoveries in fields such as climate science, physics, medicine and genomics. And it may even help scientists better understand the process by which humans learn. After all, as the great physicist Richard Feynman famously said: “What I cannot create, I do not understand.”
Read More..

Teach Yourself Deep Learning with TensorFlow and Udacity



Deep learning has become one of the hottest topics in machine learning in recent years. With TensorFlow, the deep learning platform that we recently released as an open-source project, our goal was to bring the capabilities of deep learning to everyone. So far, we are extremely excited by the uptake: more than 4000 users have forked it on GitHub in just a few weeks, and the project has been starred more than 16000 times by enthusiasts around the globe.

To help make deep learning even more accessible to engineers and data scientists at large, we are launching a new Deep Learning Course developed in collaboration with Udacity. This short, intensive course provides you with all the basic tools and vocabulary to get started with deep learning, and walks you through how to use it to address some of the most common machine learning problems. It is also accompanied by interactive TensorFlow notebooks that directly mirror and implement the concepts introduced in the lectures.
The course consists of four lectures which provide a tour of the main building blocks that are used to solve problems ranging from image recognition to text analysis. The first lecture focuses on the basics that will be familiar to those already versed in machine learning: setting up your data and experimental protocol, and training simple classification models. The second lecture builds on these fundamentals to explore how these simple models can be made deeper, and more powerful, and explores all the scalability problems that come with that, in particular regularization and hyperparameter tuning. The third lecture is all about convolutional networks and image recognition. The fourth and final lecture explore models for text and sequences in general, with embeddings and recurrent neural networks. By the end of the course, you will have implemented and trained this variety of models on your own machine and will be ready to transfer that knowledge to solve your own problems!

Our overall goal in designing this course was to provide the machine learning enthusiast a rapid and direct path to solving real and interesting problems with deep learning techniques, and were now very excited to share what weve built! It has been a lot of fun putting together with the fantastic team of experts in online course design and production at Udacity. For more details, see the Udacity blog post, and register for the course. We hope you enjoy it!

Read More..