The context: semantic knowledge extraction from unstructured text

In a previous post we introduced fise an open source semantic engine now being incubated at the ASF under the new name: Apache Stanbol. Here is a 4 minute demo that explains how such a semantic engine can be used by a Document Management System] such as Nuxeo DM to tag documents with entities instead of potentially ambiguous words:

Semantic ECM Demo: Preview of upcoming marketplace addon

The problem with the current implementation, which is based on OpenNLP, is the lack of readily available statistical models for Named Entity Recognition in languages such as French. Furthermore, the existing models are restricted to the detection of few entity classes (right now the English models can detect people, place and organization names).

To build such a model, developers have to teach or train the system by applying a machine learning algorithm on an annotated corpus of data. It is very easy to write such a corpus for OpenNLP: just pass it a text file with one sentence per line, where entity occurrences are located using the START and END tags, for instance:

<START:person> Olivier Grisel <END> is working on the <START:software> Stanbol <END> project .

The only slight problem is to somehow convince someone to spend hours manually annotating hundred of thousands of sentences from text on various topics such as business, famous people, sports, science, literature, history… without making too many mistakes.

Mining Wikipedia in the cloud

Instead manually of annotating text, one should try to benefit from an existing annotated and publicly available text corpus that deals with a wide range of topics, namely Wikipedia.

Our approach is rather simple: the text body of Wikipedia articles is rich in internal links pointing to other Wikipedia articles. Some of those articles are referring to the entity classes we are interested in (e.g. person, countries, cities, …). Hence we just need to find a way to convert those links into entity class annotations on text sentences (without the Wikimarkup formatting syntax).

To find the type of the entity described by a given Wikipedia article, one can use the category information as described in this paper by Alexander E. Richman and Patrick Schone . Alternatively we can use the semi-structured information available in the Wikipedia infoboxes. We decided to go for the latter by reusing the work done by the DBpedia project:

DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. We hope this will make it easier for the amazing amount of information in Wikipedia to be used in new and interesting ways, and that it might inspire new mechanisms for navigating, linking and improving the encyclopedia itself.

More specifically we will use a subset of the DBpedia RDF dumps:

  • instance_types_en.nt to relate a DBpedia entity ID to its entity class ID
  • page_links_en.nt to relate a Wikipedia article entity ID to its entity class ID

The mapping from a Wikipedia URL to a DBpedia entity ID is also available in 12 languages (en, de, fr, pl, ja, it, nl, es, pt, ru, sv, zh) which should allow us to reuse the same program to build statistical models for all of them.

Hence to summarize, we want a program that will:

  1. parse the Wikimarkup of a Wikipedia dump to extract unformatted text body along with the internal wikilink position information;
  2. for each link target, find the DBpedia ID of the entity if available (this is the equivalent of a JOIN operation in SQL);
  3. for each DBpedia entity ID find the entity class ID (this is another JOIN);
  4. convert the result to OpenNLP formatted files with entity class information.

In order to implement this, we started a new project called pignlproc, licensed under ASL2. The source code is available as a github repository. pignlproc uses Apache Hadoop for distributed processing, Apache Pig for high level Hadoop scripting and Apache Whirr to deploy and manage Hadoop on a cluster of tens of virtual machines on the Amazon EC2 cloud infrastructure (you can also run it locally on a single machine of course).

Detailed instructions on how to run this yourself are available in the and the online wiki.

Parsing the wikimarkup

The script performs the first step of the program, namely parsing & cleaning up the wikimarkup and extracting the sentences and link positions. This script uses some pignlproc specific User Defined Functions written in java to parse the XML dump, parse the Wikimarkup syntax using the bliki wiki parser and detect sentence boundaries using OpenNLP - all of this while propagating the link positioning information.

-- Register the project jar to use the custom loaders and UDFs
parsed = LOAD '$INPUT'
AS (title, wikiuri, text, redirect, links, headers, paragraphs);
-- filter and project as early as possible
noredirect = FILTER parsed by redirect IS NULL;
projected = FOREACH noredirect GENERATE title, text, links, paragraphs;
-- Extract the sentence contexts of the links respecting the paragraph
-- boundaries
sentences = FOREACH projected
GENERATE title, flatten(pignlproc.evaluation.SentencesWithLink(
text, links, paragraphs));
stored = FOREACH sentences
GENERATE title, sentenceOrder, linkTarget, linkBegin, linkEnd, sentence;
-- Ensure ordering for fast merge with type info later
ordered = ORDER stored BY linkTarget ASC, title ASC, sentenceOrder ASC;
STORE ordered INTO '$OUTPUT/$LANG/sentences_with_links';

We store the intermediate results on HDFS for later reuse by the last script in step 4.

Extracting entity class information from DBpedia

The second script is doing step 2 and step 3 (the joins) on the DBpedia dumps. This script also uses some pignlproc specific tools to quickly parse NT triples while filtering out those that are not interesting:

-- Load wikipedia, instance types and redirects from DBpedia dumps
wikipedia_links = LOAD '$INPUT/wikipedia_links_$LANG.nt'
AS (wikiuri: chararray, dburi: chararray);
wikipedia_links2 =  FILTER wikipedia_links BY wikiuri IS NOT NULL;
-- Load DBpedia type data and filter out the overly generic owl:Thing type
instance_types =
LOAD '$INPUT/instance_types_en.nt'
AS (dburi: chararray, type: chararray);
instance_types_no_thing =  FILTER instance_types BY type NEQ '';
joined = JOIN instance_types_no_thing BY dburi, wikipedia_links2 BY dburi;
projected = FOREACH joined GENERATE wikiuri, type;
-- Ensure ordering for fast merge with sentence links
ordered = ORDER projected BY wikiuri ASC, type ASC;
STORE ordered INTO '$OUTPUT/$LANG/wikiuri_to_types';

Again we store the intermediate results on HDFS for later reuse by other scripts.

Merging and converting to the OpenNLP annotation format

Finally the last script takes as input the previously generated files and an additional mapping from DBpedia class names to their OpenNLP counterpart, for instance:  person   location    organization   album    movie    book    software    drug

The PIG script to do the final joins and conversion to the OpenNLP output format is the following. Here again pignlproc provides some UDFs for converting the pig tuple & bag representation to the serialized format accepted by OpenNLP:

SET default_parallel 40
-- use the english tokenizer for other European languages as well
DEFINE opennlp_merge pignlproc.evaluation.MergeAsOpenNLPAnnotatedText('en');
sentences = LOAD '$INPUT/$LANG/sentences_with_links'
AS (title: chararray, sentenceOrder: int, linkTarget: chararray,
linkBegin: int, linkEnd: int, sentence: chararray);
wikiuri_types = LOAD '$INPUT/$LANG/wikiuri_to_types'
AS (wikiuri: chararray, typeuri: chararray);
-- load the type mapping from DBpedia type URI to OpenNLP type name
type_names = LOAD '$TYPE_NAMES' AS (typeuri: chararray, typename: chararray);
-- Perform successive joins to find the OpenNLP typename of the linkTarget
joined = JOIN wikiuri_types BY typeuri, type_names BY typeuri USING 'replicated';
joined_projected = FOREACH joined GENERATE wikiuri, typename;
joined2 = JOIN joined_projected BY wikiuri, sentences BY linkTarget;
result = FOREACH joined2
GENERATE title, sentenceOrder, typename, linkBegin, linkEnd, sentence;
-- Reorder and group by article title and sentence order
ordered = ORDER result BY title ASC, sentenceOrder ASC;
grouped = GROUP ordered BY (title, sentenceOrder);
-- Convert to the OpenNLP training format
opennlp_corpus =
FOREACH grouped
GENERATE opennlp_merge(
ordered.sentence, ordered.linkBegin, ordered.linkEnd, ordered.typename);
STORE opennlp_corpus INTO '$OUTPUT/$LANG/opennlp';

Depending the size of the corpus and the number of nodes you are using, the length of each individual job will run from a couple of minutes to a couple of hours. For instance, the first steps for parsing 3GB of Wikipedia XML chunks on 30 small EC2 instances will typically take between 5 and 10 minutes.

Some preliminary results

Here is a sample of the output on the French Wikipedia dump for location detection only:

You can replace “location” by “person” or “organization” in the previous URL for more examples. You can also replace “part-r-00000” by “part-r-000XX” to download larger chunks of the corpus. You can also replace “fr” by “en” to get English sentences.

By concatenating chunks of each corpus to into files of ~100k lines one can get reasonably sized input files for the OpenNLP command line tool:

$ opennlp TokenNameFinderTrainer -lang fr -encoding utf-8
-iterations 50  -type location -model fr-ner-location.bin
-data ~/data/fr/opennlp_location/train

Here are the resulting models:

It is possible to retrain those models on a larger subset of chunks by allocating more than 2GB of heap-space to the OpenNLP CLI tool (I used version 1.5.0). To evaluate the performance of the trained models you can run the OpenNLP evaluator on a separate part of the corpus (commonly called the testing set):

$ opennlp TokenNameFinderEvaluator -encoding utf-8
-model fr-ner-location
-data ~/data/fr/opennlp_location/test

The corpus is quite noisy so the performance of the trained models is not optimal (but better than nothing anyway). Here is the result of evaluations on held out chunks of the French corporas (+/- 0.02):

Performance evaluation for NER on a French extraction with 100k sentences
| class | precision | recall | f1-score |
| location | 0.87 | 0.74 | 0.80 |
| person | 0.80 | 0.68 | 0.74 |
| organization | 0.80 | 0.65 | 0.72 |

Performance evaluation for NER on a English extraction with 100k sentences
| class | precision | recall | f1-score |
| location | 0.77 | 0.67 | 0.71 |
| person | 0.80 | 0.70 | 0.75 |
| organization | 0.79 | 0.64 | 0.70 |

The results of this fist experiment are interesting, but lower than the state of the art, especially for the recall values. The main reason is that there are many sentences in Wikipedia that hold entities that do not carry a link.

A potential way to improve this would be to set up a sort of active learning tooling where the trained models are reused to suggest missing annotations to a human validator to be quickly accepted or rejected so as to improve the quality the corpus and then the quality of the following generation of models until the corpus reaches the quality of the fully manually annotated one.

Future work

I hope that this first experiment could convince some of you of the power of combining tools such as Pig, Hadoop and OpenNLP for batch text analytics. By advertising the project on the OpenNLP users mailing list, we already got some very positive feedback. It is very likely that pignlproc will get contributed one way or another to the OpenNLP project.

These tools are, of course, not limited to training OpenNLP models, and it will be very easy to adapt the code of the conversion UDF to generate BIO formatted corpora to be used by other NLP libraries such as NLTK for instance.

Finally, there is no reason to limit this processing to NER corpora generation. Similar UDFs and scripts could be produced to identify text sentences that express in a natural language the relationships that link entities and that have already been extracted in a structured manner from the infoboxes by the DBpedia project.

Such a new corpus would be of great value for developing and evaluating the quality of automated entity relationships and properties extraction. Such a new extractor could potentially be based on syntactic parsers such as the one available in OpenNLP or MaltParser.


This work was funded by the Scribo and IKS R&D projects. We also would like to thank all the developers of the involved projects.

For information about the Nuxeo Platform, please visit our product page: Content Services Platform or request a custom demo to see how we can help your organization.