Showing posts with label corpus. Show all posts
Showing posts with label corpus. Show all posts

Teaching machines to read between the lines and a new corpus with entity salience annotations



Language understanding systems are largely trained on freely available data, such as the Penn Treebank, perhaps the most widely used linguistic resource ever created. We have previously released lots of linguistic data ourselves, to contribute to the language understanding community as well as encourage further research into these areas.

Now, we’re releasing a new dataset, based on another great resource: the New York Times Annotated Corpus, a set of 1.8 million articles spanning 20 years. 600,000 articles in the NYTimes Corpus have hand-written summaries, and more than 1.5 million of them are tagged with people, places, and organizations mentioned in the article. The Times encourages use of the metadata for all kinds of things, and has set up a forum to discuss related research.

We recently used this corpus to study a topic called “entity salience”. To understand salience, consider: how do you know what a news article or a web page is about? Reading comes pretty easily to people -- we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task? This problem is a key step towards being able to read and understand an article.

One way to approach the problem is to look for words that appear more often than their ordinary rates. For example, if you see the word “coach” 5 times in a 581 word article, and compare that to the usual frequency of “coach” -- more like 5 in 330,000 words -- you have reason to suspect the article has something to do with coaching. The term “basketball” is even more extreme, appearing 150,000 times more often than usual. This is the idea of the famous TFIDF, long used to index web pages.
Congratulations to Becky Hammon, first female NBA coach! Image via Wikipedia.
Term ratios are a start, but we can do better. Search indexing these days is much more involved, using for example the distances between pairs of words on a page to capture their relatedness. Now, with the Knowledge Graph, we are beginning to think in terms of entities and relations rather than keywords. “Basketball” is more than a string of characters; it is a reference to something in the real word which we already already know quite a bit about.

Background information about entities ought to help us decide which of them are most salient. After all, an article’s author assumes her readers have some general understanding of the world, and probably a bit about sports too. Using background knowledge, we might be able to infer that the WNBA is a salient entity in the Becky Hammon article even though it only appears once.

To encourage research on leveraging background information, we are releasing a large dataset of annotations to accompany the New York Times Annotated Corpus, including resolved Freebase entity IDs and labels indicating which entities are salient. The salience annotations are determined by automatically aligning entities in the document with entities in accompanying human-written abstracts. Details of the salience annotations and some baseline results are described in our recent paper: A New Entity Salience Task with Millions of Training Examples (Jesse Dunietz and Dan Gillick).

Since our entity resolver works better for named entities like WNBA than for nominals like “coach” (this is the notoriously difficult word sense disambiguation problem, which we’ve previously touched on), the annotations are limited to names.

Below is sample output for a document. The first line contains the NYT document ID and the headline; each subsequent line includes an entity index, an indicator for salience, the mention count for this entity in the document as determined by our coreference system, the text of the first mention of the entity, the byte offsets (start and end) for the first mention of the entity, and the resolved Freebase MID.
Features like mention count and document positioning give reasonable salience predictions. But because they only describe what’s explicitly in the document, we expect a system that uses background information to expose what’s implicit could give better results.

Download the data directly from Google Drive, or visit the project home page with more information at our Google Code site. We look forward to seeing what you come up with!
Read More..

A Multilingual Corpus of Automatically Extracted Relations from Wikipedia



In Natural Language Processing, relation extraction is the task of assigning a semantic relationship between a pair of arguments. As an example, a relationship between the phrases “Ottawa” and “Canada” is “is the capital of”. These extracted relations could be used in a variety of applications ranging from Question Answering to building databases from unstructured text.

While relation extraction systems work accurately for English and a few other languages, where tools for syntactic analysis such as parsers, part-of-speech taggers and named entity analyzers are readily available, there is relatively little work in developing such systems for most of the worlds languages where linguistic analysis tools do not yet exist. Fortunately, because we do have translation systems between English and many other languages (such as Google Translate), we can translate text from a non-English language to English, perform relation extraction and project these relations back to the foreign language.
Relation extraction in a Spanish sentence using the cross-lingual relation extraction pipeline.
In Multilingual Open Relation Extraction Using Cross-lingual Projection, that will appear at the 2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015), we use this idea of cross-lingual projection to develop an algorithm that extracts open-domain relation tuples, i.e. where an arbitrary phrase can describe the relation between the arguments, in multiple languages from Wikipedia. In this work, we also evaluated the performance of extracted relations using human annotations in French, Hindi and Russian.

Since there is no such publicly available corpus of multilingual relations, we are releasing a dataset of automatically extracted relations from the Wikipedia corpus in 61 languages, along with the manually annotated relations in 3 languages (French, Hindi and Russian). It is our hope that our data will help researchers working on natural language processing and encourage novel applications in a wide variety of languages. More details on the corpus and the file formats can be found in this README file.

We wish to thank Bruno Cartoni, Vitaly Nikolaev, Hidetoshi Shimokawa, Kishore Papineni, John Giannandrea and their teams for making this data release possible. This dataset is licensed by Google Inc. under the Creative Commons Attribution-ShareAlike 3.0 License.
Read More..

11 Billion Clues in 800 Million Documents A Web Research Corpus Annotated with Freebase Concepts



“I assume that by knowing the truth you mean knowing things as they really are.”
- Plato

When you type in a search query -- perhaps Plato -- are you interested in the string of letters you typed? Or the concept or entity represented by that string? But knowing that the string represents something real and meaningful only gets you so far in computational linguistics or information retrieval -- you have to know what the string actually refers to. The Knowledge Graph and Freebase are databases of things, not strings, and references to them let you operate in the realm of concepts and entities rather than strings and n-grams.

We’ve previously released data to help with disambiguation and recently awarded $1.2M in research grants to work on related problems. Today we’re taking another step: releasing data consisting of nearly 800 million documents automatically annotated with over 11 billion references to Freebase entities.

These Freebase Annotations of the ClueWeb Corpora (FACC) consist of ClueWeb09 FACC and ClueWeb12 FACC. 11 billion phrases that refer to concepts and entities in Freebase were automatically labeled with their unique identifiers (Freebase MID’s). For example:



Since the annotation process was automatic, it likely made mistakes. We optimized for precision over recall, so the algorithm skipped a phrase if it wasn’t confident enough of the correct MID. If you prefer higher precision, we include confidence levels, so you can filter out lower confidence annotations that we did include.

Based on review of a sample of documents, we believe the precision is about 80-85%, and recall, which is inherently difficult to measure in situations like this, is in the range of 70-85%. Not every ClueWeb document is included in this corpus; documents in which we found no entities were excluded from the set. A document might be excluded because there were no entities to be found, because the entities in question weren’t in Freebase, or because none of the entities were resolved at a confidence level above the threshold.

The ClueWeb data is used in multiple TREC tracks. You may also be interested in our annotations of several TREC query sets, including those from the Million Query Track and Web Track.

If you would prefer a human-annotated set, you might want to look at the Wikilinks Corpus we released last year. Entities there were disambiguated by links to Wikipedia, inserted by the authors of the page, which is effectively a form of human annotation.

You can find more detail and download the data on the pages for the two sets: ClueWeb09 FACC and ClueWeb12 FACC. You can also subscribe to our data release mailing list to learn about releases as they happen.

Special thanks to Jamie Callan and Juan Caicedo Carvajal for their help throughout the annotation project.
Read More..