Mining Wikipedia with Hadoop and Pig for Natural Language Processing

Tue 11 January 2011 By Olivier Grisel

The context: semantic knowledge extraction from unstructured text

In a previous post we introduced fise an open source semantic engine
now being incubated at the ASF under the new name: Apache
. Here is a 4 minute demo that
explains how such a semantic engine can be used by a document management system
such as Nuxeo DM to tag documents with entities instead of potentially ambiguous words:

The problem with the current implementation, which is based on
OpenNLP, is the lack of readily
available statistical models for Named Entity Recognition in languages such as French. Furthermore, the existing models are restricted to the detection of few entity
classes (right now the English models can detect people, place and organization names).

To build such a model, developers have to teach or train the system
by applying a machine learning algorithm on an annotated corpus of data.
It is very easy to write such a corpus for OpenNLP:
just pass it a text file with one sentence per line, where entity occurrences
are located using the START and END tags, for instance:

<START:person> Olivier Grisel <END> is working on the <START:software> Stanbol <END> project .

The only slight problem is to somehow convince someone to spend hours
manually annotating hundred of thousands of sentences from text on various
topics such as business, famous people, sports, science, literature,
history... without making too many mistakes.

Mining Wikipedia in the cloud

Instead manually of annotating text, one should try to benefit from an existing
annotated and publicly available text corpus that deals with a wide range of topics,
namely Wikipedia.

Our approach is rather simple: the text body of Wikipedia articles is rich in internal links
pointing to other Wikipedia articles. Some of those articles are referring to the entity classes
we are interested in (e.g. person, countries, cities, ...). Hence we just need to find a way
to convert those links into entity class annotations on text sentences (without the
Wikimarkup formatting syntax).

To find the type of the entity described
by a given Wikipedia article, one can use the category information as described in this paper by Alexander E. Richman and Patrick Schone . Alternatively we can use the semi-structured
information available in the Wikipedia infoboxes. We decided to go for the latter by reusing the
work done by the DBpedia project:

DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. We hope this will make it easier for the amazing amount of information in Wikipedia to be used in new and interesting ways, and that it might inspire new mechanisms for navigating, linking and improving the encyclopedia itself.

More specifically we will use a subset of the DBpedia RDF dumps:

  • instance_types_en.nt to relate a DBpedia entity ID to its entity class ID

  • page_links_en.nt to relate a Wikipedia article entity ID to its entity class ID

The mapping from a Wikipedia URL to a DBpedia entity ID is also available in 12 languages
(en, de, fr, pl, ja, it, nl, es, pt, ru, sv, zh) which should allow us to reuse the same program
to build statistical models for all of them.

Hence to summarize, we want a program that will:

  1. parse the Wikimarkup of a Wikipedia dump to extract unformatted text body along with
    the internal wikilink position information;

  2. for each link target, find the DBpedia ID of the entity if available (this is the equivalent
    of a JOIN operation in SQL);

  3. for each DBpedia entity ID find the entity class ID (this is another JOIN);

  4. convert the result to OpenNLP formatted files with entity class information.

In order to implement this, we started a new project called pignlproc, licensed under ASL2.
The source code is available as a github repository. pignlproc uses Apache Hadoop for distributed processing, Apache Pig for high level Hadoop scripting
and Apache Whirr to deploy and manage Hadoop on
a cluster of tens of virtual machines on the Amazon EC2
cloud infrastructure (you can also run it locally on a single machine of course).

Detailed instructions on how to run this yourself are available in the
and the online wiki.

Parsing the wikimarkup

The script performs the first step of the program, namely parsing & cleaning up
the wikimarkup and extracting the sentences and link positions. This script uses some
pignlproc specific User Defined Functions written in java to parse the
XML dump, parse the Wikimarkup syntax using the
bliki wiki parser and detect sentence boundaries
using OpenNLP - all of this while propagating the link positioning information.

-- Register the project jar to use the custom loaders and UDFs
parsed = LOAD '$INPUT'
AS (title, wikiuri, text, redirect, links, headers, paragraphs);
-- filter and project as early as possible
noredirect = FILTER parsed by redirect IS NULL;
projected = FOREACH noredirect GENERATE title, text, links, paragraphs;
-- Extract the sentence contexts of the links respecting the paragraph
-- boundaries
sentences = FOREACH projected
GENERATE title, flatten(pignlproc.evaluation.SentencesWithLink(
text, links, paragraphs));
stored = FOREACH sentences
GENERATE title, sentenceOrder, linkTarget, linkBegin, linkEnd, sentence;
-- Ensure ordering for fast merge with type info later
ordered = ORDER stored BY linkTarget ASC, title ASC, sentenceOrder ASC;
STORE ordered INTO '$OUTPUT/$LANG/sentences_with_links';

We store the intermediate results on HDFS for later reuse by the last script in step 4.

Extracting entity class information from DBpedia

The second script is doing step 2 and step 3 (the joins) on the DBpedia dumps. This script
also uses some pignlproc specific tools to quickly parse NT triples while
filtering out those that are not interesting:

-- Load wikipedia, instance types and redirects from DBpedia dumps
wikipedia_links = LOAD '$INPUT/wikipedialinks$LANG.nt'
AS (wikiuri: chararray, dburi: chararray);
wikipedia_links2 = FILTER wikipedia_links BY wikiuri IS NOT NULL;
-- Load DBpedia type data and filter out the overly generic owl:Thing type
instance_types =
LOAD '$INPUT/instance_types_en.nt'
AS (dburi: chararray, type: chararray);
instance_types_no_thing = FILTER instance_types BY type NEQ '';
joined = JOIN instance_types_no_thing BY dburi, wikipedia_links2 BY dburi;
projected = FOREACH joined GENERATE wikiuri, type;
-- Ensure ordering for fast merge with sentence links
ordered = ORDER projected BY wikiuri ASC, type ASC;
STORE ordered INTO '$OUTPUT/$LANG/wikiuri_to_types';

Again we store the intermediate results on HDFS for later reuse by other scripts.

Merging and converting to the OpenNLP annotation format

Finally the last script takes as input the previously generated files and an additional mapping from DBpedia class names to their OpenNLP counterpart, for instance:  person location organization album movie book software drug

The PIG script to do the final joins and conversion to the OpenNLP output format is the
following. Here again pignlproc provides some UDFs for converting the
pig tuple & bag representation to the serialized format accepted by OpenNLP:

SET default_parallel 40
-- use the english tokenizer for other European languages as well
DEFINE opennlp_merge pignlproc.evaluation.MergeAsOpenNLPAnnotatedText('en');
sentences = LOAD '$INPUT/$LANG/sentences_with_links'
AS (title: chararray, sentenceOrder: int, linkTarget: chararray,
linkBegin: int, linkEnd: int, sentence: chararray);
wikiuri_types = LOAD '$INPUT/$LANG/wikiuri_to_types'
AS (wikiuri: chararray, typeuri: chararray);
-- load the type mapping from DBpedia type URI to OpenNLP type name
type_names = LOAD '$TYPE_NAMES' AS (typeuri: chararray, typename: chararray);
-- Perform successive joins to find the OpenNLP typename of the linkTarget
joined = JOIN wikiuri_types BY typeuri, type_names BY typeuri USING 'replicated';
joined_projected = FOREACH joined GENERATE wikiuri, typename;
joined2 = JOIN joined_projected BY wikiuri, sentences BY linkTarget;
result = FOREACH joined2
GENERATE title, sentenceOrder, typename, linkBegin, linkEnd, sentence;
-- Reorder and group by article title and sentence order
ordered = ORDER result BY title ASC, sentenceOrder ASC;
grouped = GROUP ordered BY (title, sentenceOrder);
-- Convert to the OpenNLP training format
opennlp_corpus =
FOREACH grouped
GENERATE opennlp_merge(
ordered.sentence, ordered.linkBegin, ordered.linkEnd, ordered.typename);
STORE opennlp_corpus INTO '$OUTPUT/$LANG/opennlp';

Depending the size of the corpus and the number of nodes you are using, the length of
each individual job will run from a couple of minutes to a couple of hours. For instance, the
first steps for parsing 3GB of Wikipedia XML chunks on 30 small EC2 instances will typically take between 5 and 10 minutes.

Some preliminary results

Here is a sample of the output on the French Wikipedia dump for location detection only:

You can replace "location" by "person" or "organization" in the
previous URL for more examples. You can also replace "part-r-00000" by
"part-r-000XX" to download larger chunks of the corpus. You can also replace "fr" by "en"
to get English sentences.

By concatenating chunks of each corpus to into files of ~100k lines one can get reasonably
sized input files for the OpenNLP command line tool:

$ opennlp TokenNameFinderTrainer -lang fr -encoding utf-8
-iterations 50 -type location -model fr-ner-location.bin
-data ~/data/fr/opennlp_location/train

Here are the resulting models:

It is possible to retrain those models on a larger subset of chunks by allocating more
than 2GB of heap-space to the OpenNLP CLI tool (I used version 1.5.0). To evaluate the
performance of the trained models you can run the OpenNLP evaluator on a separate part
of the corpus (commonly called the testing set):

$ opennlp TokenNameFinderEvaluator -encoding utf-8
-model fr-ner-location
-data ~/data/fr/opennlp_location/test

The corpus is quite noisy so the performance of the trained models is
not optimal (but better than nothing anyway). Here is the result of
evaluations on held out chunks of the French corporas (+/- 0.02):

Performance evaluation for NER on a French extraction with 100k sentences
class precision recall f1-score
location 0.87 0.74 0.80
person 0.80 0.68 0.74
organization 0.80 0.65 0.72

Performance evaluation for NER on a English extraction with 100k sentences
class precision recall f1-score
location 0.77 0.67 0.71
person 0.80 0.70 0.75
organization 0.79 0.64 0.70

The results of this fist experiment are interesting, but lower than the state of the art,
especially for the recall values. The main reason is that there are many sentences in Wikipedia
that hold entities that do not carry a link.

A potential way to improve this would be to set up a sort of active learning tooling where the
trained models are reused to suggest missing annotations to a human validator
to be quickly accepted or rejected so as to improve the quality the corpus and then the quality
of the following generation of models until the corpus reaches the quality of the fully
manually annotated one.

Future work

I hope that this first experiment could convince some of you of the power of combining tools
such as Pig, Hadoop and OpenNLP for batch text analytics. By advertising the project on
the OpenNLP users mailing list, we already got some very positive feedback. It is very likely
that pignlproc will get contributed one way or another to the OpenNLP project.

These tools are, of course, not limited to training OpenNLP models, and it will be very easy to
adapt the code of the conversion UDF to generate BIO formatted corpora to be used by other
NLP libraries such as NLTK for instance.

Finally, there is no reason to limit this processing to NER corpora generation. Similar UDFs and
scripts could be produced to identify text sentences that express in a natural language the
relationships that link entities and that have already been extracted in a structured
manner from the infoboxes by the DBpedia project.

Such a new corpus would be of great value for developing and evaluating the
quality of automated entity relationships and properties extraction. Such a new extractor
could potentially be based on syntactic parsers such as the one available in OpenNLP or


This work was funded by the Scribo and IKS R&D projects. We also would like to thank all the developers of the involved projects.

For information about the Nuxeo Platform, please visit our product page or request a custom demo to see how we can help your organization.

Category: Product & Development
Tagged: Apache, Java