Information Blog: improving

Posted by Chuck Rosenberg, Image Search Team

Last month at Google I/O, we showed a major upgrade to the photos experience: you can now easily search your own photos without having to manually label each and every one of them. This is powered by computer vision and machine learning technology, which uses the visual content of an image to generate searchable tags for photos combined with other sources like text tags and EXIF metadata to enable search across thousands of concepts like a flower, food, car, jet ski, or turtle.

For many years Google has offered Image Search over web images; however, searching across photos represents a difficult new challenge. In Image Search there are many pieces of information which can be used for ranking images, for example text from the web or the image filename. However, in the case of photos, there is typically little or no information beyond the pixels in the images themselves. This makes it harder for a computer to identify and categorize what is in a photo. There are some things a computer can do well, like recognize rigid objects and handwritten digits. For other classes of objects, this is a daunting task, because the average toddler is better at understanding what is in a photo than the world’s most powerful computers running state of the art algorithms.

This past October the state of the art seemed to move things a bit closer to toddler performance. A system which used deep learning and convolutional neural networks easily beat out more traditional approaches in the ImageNet computer vision competition designed to test image understanding. The winning team was from Professor Geoffrey Hinton’s group at the University of Toronto.

We built and trained models similar to those from the winning team using software infrastructure for training large-scale neural networks developed at Google in a group started by Jeff Dean and Andrew Ng. When we evaluated these models, we were impressed; on our test set we saw double the average precision when compared to other approaches we had tried. We knew we had found what we needed to make photo searching easier for people using Google. We acquired the rights to the technology and went full speed ahead adapting it to run at large scale on Google’s computers. We took cutting edge research straight out of an academic research lab and launched it, in just a little over six months. You can try it out at photos.google.com.

Why the success now? What is new? Some things are unchanged: we still use convolutional neural networks -- originally developed in the late 1990s by Professor Yann LeCun in the context of software for reading handwritten letters and digits. What is different is that both computers and algorithms have improved significantly. First, bigger and faster computers have made it feasible to train larger neural networks with much larger data. Ten years ago, running neural networks of this complexity would have been a momentous task even on a single image -- now we are able to run them on billions of images. Second, new training techniques have made it possible to train the large deep neural networks necessary for successful image recognition.

We feel it would be interesting to the research community to discuss some of the unique aspects of the system we built and some qualitative observations we had while testing the system.

The first is our label and training set and how it compares to that used in the ImageNet Large Scale Visual Recognition competition. Since we were working on search across photos, we needed an appropriate label set. We came up with a set of about 2000 visual classes based on the most popular labels on Google+ Photos and which also seemed to have a visual component, that a human could recognize visually. In contrast, the ImageNet competition has 1000 classes. As in ImageNet, the classes were not text strings, but are entities, in our case we use Freebase entities which form the basis of the Knowledge Graph used in Google search. An entity is a way to uniquely identify something in a language-independent way. In English when we encounter the word “jaguar”, it is hard to determine if it represents the animal or the car manufacturer. Entities assign a unique ID to each, removing that ambiguity, in this case “/m/0449p” for the former and “/m/012x34” for the latter. In order to train better classifiers we used more training images per class than ImageNet, 5000 versus 1000. Since we wanted to provide only high precision labels, we also refined the classes from our initial set of 2000 to the most precise 1100 classes for our launch.

During our development process we had many more qualitative observations we felt are worth mentioning:

1) Generalization performance. Even though there was a significant difference in visual appearance between the training and test sets, the network appeared to generalize quite well. To train the system, we used images mined from the web which did not match the typical appearance of personal photos. Images on the web are often used to illustrate a single concept and are carefully composed, so an image of a flower might only be a close up of a single flower. But personal photos are unstaged and impromptu, a photo of a flower might contain many other things in it and may not be very carefully composed. So our training set image distribution was not necessarily a good match for the distribution of images we wanted to run the system on, as the examples below illustrate. However, we found that our system trained on web images was able to generalize and perform well on photos.

A typical photo of a flower found on the web.

A typical photo of a flower found in an impromptu photo.

2) Handling of classes with multi-modal appearance. The network seemed to be able to handle classes with multimodal appearance quite well, for example the “car” class contains both exterior and interior views of the car. This was surprising because the final layer is effectively a linear classifier which creates a single dividing plane in a high dimensional space. Since it is a single plane, this type of classifier is often not very good at representing multiple very different concepts.

3) Handling abstract and generic visual concepts. The system was able to do reasonably well on classes that one would think are somewhat abstract and generic. These include "dance", "kiss", and "meal", to name a few. This was interesting because for each of these classes it did not seem that there would be any simple visual clues in the image that would make it easy to recognize this class. It would be difficult to describe them in terms of simple basic visual features like color, texture, and shape.

Photos recognized as containing a meal.

4) Reasonable errors. Unlike other systems we experimented with, the errors which we observed often seemed quite reasonable to people. The mistakes were the type that a person might make - confusing things that look similar. Some people have already noticed this, for example, mistaking a goat for a dog or a millipede for a snake. This is in contrast to other systems which often make errors which seem nonsensical to people, like mistaking a tree for a dog.

Photo of a banana slug mistaken for a snake.

Photo of a donkey mistaken for a dog.

5) Handling very specific visual classes. Some of the classes we have are very specific, like specific types of flowers, for example “hibiscus” or “dhalia”. We were surprised that the system could do well on those. To recognize specific subclasses very fine detail is often needed to differentiate between the classes. So it was surprising that a system that could do well on a full image concept like “sunsets” could also do well on very specific classes.

Photo recognized as containing a hibiscus flower.

Photo recognized as containing a dahlia flower.

Photo recognized as containing a polar bear.

Photo recognized as containing a grizzly bear.

The resulting computer vision system worked well enough to launch to people as a useful tool to help improve personal photo search, which was a big step forward. So, is computer vision solved? Not by a long shot. Have we gotten computers to see the world as well as people do? The answer is not yet, there’s still a lot of work to do, but we’re closer.

Posted by Weilong Yang and Min-hsuan Tsai, Video Content Analysis team and the YouTube Creator team

Video thumbnails are often the first things viewers see when they look for something interesting to watch. A strong, vibrant, and relevant thumbnail draws attention, giving viewers a quick preview of the content of the video, and helps them to find content more easily. Better thumbnails lead to more clicks and views for video creators.

Inspired by the recent remarkable advances of deep neural networks (DNNs) in computer vision, such as image and video classification, our team has recently launched an improved automatic YouTube "thumbnailer" in order to help creators showcase their video content. Here is how it works.

The Thumbnailer Pipeline

While a video is being uploaded to YouTube, we first sample frames from the video at one frame per second. Each sampled frame is evaluated by a quality model and assigned a single quality score. The frames with the highest scores are selected, enhanced and rendered as thumbnails with different sizes and aspect ratios. Among all the components, the quality model is the most critical and turned out to be the most challenging to develop. In the latest version of the thumbnailer algorithm, we used a DNN for the quality model. So, what is the quality model measuring, and how is the score calculated?

The main processing pipeline of the thumbnailer.

(Training) The Quality Model

Unlike the task of identifying if a video contains your favorite animal, judging the visual quality of a video frame can be very subjective - people often have very different opinions and preferences when selecting frames as video thumbnails. One of the main challenges we faced was how to collect a large set of well-annotated training examples to feed into our neural network. Fortunately, on YouTube, in addition to having algorithmically generated thumbnails, many YouTube videos also come with carefully designed custom thumbnails uploaded by creators. Those thumbnails are typically well framed, in-focus, and center on a specific subject (e.g. the main character in the video). We consider these custom thumbnails from popular videos as positive (high-quality) examples, and randomly selected video frames as negative (low-quality) examples. Some examples of the training images are shown below.

Example training images.

The visual quality model essentially solves a problem we call "binary classification": given a frame, is it of high quality or not? We trained a DNN on this set using a similar architecture to the Inception network in GoogLeNet that achieved the top performance in the ImageNet 2014 competition.

Results

Compared to the previous automatically generated thumbnails, the DNN-powered model is able to select frames with much better quality. In a human evaluation, the thumbnails produced by our new models are preferred to those from the previous thumbnailer in more than 65% of side-by-side ratings. Here are some examples of how the new quality model performs on YouTube videos:

Example frames with low and high quality score from the DNN quality model, from video “Grand Canyon Rock Squirrel”.

Thumbnails generated by old vs. new thumbnailer algorithm.

We recently launched this new thumbnailer across YouTube, which means creators can start to choose from higher quality thumbnails generated by our new thumbnailer. Next time you see an awesome YouTube thumbnail, don’t hesitate to give it a thumbs up. ;)

Information Blog

Improving Photo Search A Step Across the Semantic Gap

Improving YouTube video thumbnails with deep neural nets

Search

Archive