Information Blog: annotations

All the News thats Fit to Read A Study of Social Annotations for News Reading

Posted by Chinmay Kulkarni, Stanford University Ph.D candidate and former Google Intern, and Ed H. Chi, Google Research Scientist

News is one of the most important parts of our collective information diet, and like any other activity on the Web, online news reading is fast becoming a social experience. Internet users today see recommendations for news from a variety of sources; newspaper websites allow readers to recommend news articles to each other, restaurant review sites present other diners’ recommendations, and now several social networks have integrated social news readers.

With news article recommendations and endorsements coming from a combination of computers and algorithms, companies that publish and aggregate content, friends and even complete strangers, how do these explanations (i.e. why the articles are shown to you, which we call “annotations”) affect users selections of what to read? Given the ubiquity of online social annotations in news dissemination, it is surprising how little is known about how users respond to these annotations, and how to offer them to users productively.

In All the News that’s Fit to Read: A Study of Social Annotations for News Reading, presented at the 2013 ACM SIGCHI Conference on Human Factors in Computing Systems and highlighted in the list of influential Google papers from 2013, we reported on results from two experiments with voluntary participants that suggest that social annotations, which have so far been considered as a generic simple method to increase user engagement, are not simple at all; social annotations vary significantly in their degree of persuasiveness, and their ability to change user engagement.

News articles in different annotation conditions

The first experiment looked at how people use annotations when the content they see is not personalized, and the annotations are not from people in their social network, as is the case when a user is not signed into a particular social network. Participants who signed up for the study were suggested the same set of news articles via annotations from strangers, a computer agent, and a fictional branded company. Additionally, they were told whether or not other participants in the experiment would see their name displayed next to articles they read (i.e. “Recorded” or “Not Recorded”).

Surprisingly, annotations by unknown companies and computers were significantly more persuasive than those by strangers in this “signed-out” context. This result implies the potential power of suggestion offered by annotations, even when they’re conferred by brands or recommendation algorithms previously unknown to the users, and that annotations by computers and companies may be valuable in a signed-out context. Furthermore, the experiment showed that with “recording” on, the overall number of articles clicked decreased compared to participants without “recording,” regardless of the type of annotation, suggesting that subjects were cognizant of how they appear to other users in social reading apps.

If annotations by strangers is not as persuasive as those by computers or brands, as the first experiment showed, what about the effects of friend annotations? The second experiment examined the signed-in experience (with Googlers as subjects) and how they reacted to social annotations from friends, investigating whether personalized endorsements help people discover and select what might be more interesting content.

Perhaps not entirely surprising, results showed that friend annotations are persuasive and improve user satisfaction of news article selections. What’s interesting is that, in post-experiment interviews, we found that annotations influenced whether participants read articles primarily in three cases: first, when the annotator was above a threshold of social closeness; second, when the annotator had subject expertise related to the news article; and third, when the annotation provided additional context to the recommended article. This suggests that social context and personalized annotation work together to improve user experience overall.

Some questions for future research include whether or not highlighting expertise in annotations help, if the threshold for social proximity can be algorithmically determined, and if aggregating annotations (e.g. “110 people liked this”) help increases engagement. We look forward to further research that enable social recommenders to offer appropriate explanations for why users should pay attention, and reveal more nuances based on the presentation of annotations.

Posted by Dan Gillick, Research Scientist, and Dave Orr, Product Manager

Language understanding systems are largely trained on freely available data, such as the Penn Treebank, perhaps the most widely used linguistic resource ever created. We have previously released lots of linguistic data ourselves, to contribute to the language understanding community as well as encourage further research into these areas.

Now, we’re releasing a new dataset, based on another great resource: the New York Times Annotated Corpus, a set of 1.8 million articles spanning 20 years. 600,000 articles in the NYTimes Corpus have hand-written summaries, and more than 1.5 million of them are tagged with people, places, and organizations mentioned in the article. The Times encourages use of the metadata for all kinds of things, and has set up a forum to discuss related research.

We recently used this corpus to study a topic called “entity salience”. To understand salience, consider: how do you know what a news article or a web page is about? Reading comes pretty easily to people -- we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task? This problem is a key step towards being able to read and understand an article.

One way to approach the problem is to look for words that appear more often than their ordinary rates. For example, if you see the word “coach” 5 times in a 581 word article, and compare that to the usual frequency of “coach” -- more like 5 in 330,000 words -- you have reason to suspect the article has something to do with coaching. The term “basketball” is even more extreme, appearing 150,000 times more often than usual. This is the idea of the famous TFIDF, long used to index web pages.

Congratulations to Becky Hammon, first female NBA coach! Image via Wikipedia.

Term ratios are a start, but we can do better. Search indexing these days is much more involved, using for example the distances between pairs of words on a page to capture their relatedness. Now, with the Knowledge Graph, we are beginning to think in terms of entities and relations rather than keywords. “Basketball” is more than a string of characters; it is a reference to something in the real word which we already already know quite a bit about.

Background information about entities ought to help us decide which of them are most salient. After all, an article’s author assumes her readers have some general understanding of the world, and probably a bit about sports too. Using background knowledge, we might be able to infer that the WNBA is a salient entity in the Becky Hammon article even though it only appears once.

To encourage research on leveraging background information, we are releasing a large dataset of annotations to accompany the New York Times Annotated Corpus, including resolved Freebase entity IDs and labels indicating which entities are salient. The salience annotations are determined by automatically aligning entities in the document with entities in accompanying human-written abstracts. Details of the salience annotations and some baseline results are described in our recent paper: A New Entity Salience Task with Millions of Training Examples (Jesse Dunietz and Dan Gillick).

Since our entity resolver works better for named entities like WNBA than for nominals like “coach” (this is the notoriously difficult word sense disambiguation problem, which we’ve previously touched on), the annotations are limited to names.

Below is sample output for a document. The first line contains the NYT document ID and the headline; each subsequent line includes an entity index, an indicator for salience, the mention count for this entity in the document as determined by our coreference system, the text of the first mention of the entity, the byte offsets (start and end) for the first mention of the entity, and the resolved Freebase MID.

Features like mention count and document positioning give reasonable salience predictions. But because they only describe what’s explicitly in the document, we expect a system that uses background information to expose what’s implicit could give better results.

Download the data directly from Google Drive, or visit the project home page with more information at our Google Code site. We look forward to seeing what you come up with!

Information Blog

All the News thats Fit to Read A Study of Social Annotations for News Reading

Teaching machines to read between the lines and a new corpus with entity salience annotations

Search

Archive