Last week I had the opportunity to attend the second edition of the Berlin Buzzwords conference and then to participate in the Semantic / NLP hackathon hosted by Neofonie. Here is my personal executive digest. For a more comprehensive overview, most of the slides are online and the videos should follow at some point.
Part 1 - the conference
Berlin Buzzwords is developer conference with a focus on scalable data processing, storage and search, mostly using Apache projects such as Hadoop, HBase, Solr & Lucene, Mahout and related projects. This second edition attracted more than 400 developers from all over the world including about a third who are Apache committers.
Hadoop MapReduce is no silver bullet
The idea that appealed to me the most across talks is that the MapReduce model is far from being an optimal way to do large scale distributed analytics. Both the model in general and the Hadoop implementation in particular exhibit limitations. The following summarize three recurrent data analytics situations where Hadoop MapReduce is almost useless:
1- In his keynote Ted Dunning highlighted the numerous file-system serialization barriers of a single MapReduce job. This design feature makes Hadoop MapReduce unable to scale down to small to medium problems that would benefit from data-local, distributed computing with just a couple of nodes. This use case is however the long tail of the real life business use cases: most companies do not have the typical 10k+ nodes of Google, Facebook and Yahoo! data-centers. There seemed to be an agreement that Hadoop MapReduce is currently almost always wasteful on clusters with less than 10 to 20 nodes. It was just not designed to address this use case but that will probably change in the future (see below).
2- On the opposite end of the scalability ladder (etymologic pun intended), MapReduce is no good fit for distributed real-time analytics as the batch processing model imposes high latency (often several hours).
Facebook hence introduced a new distributed flow processing architecture based on Scribe / PTail / HBase. The input of the flow are the various signals coming from user interactions with the front facing web applications (navigation logs, comments, posts, like button clicks, search queries...). The output are UI updates to adapt to the recent user interactions (personalized ads?). This new system dubbed Puma has now totally replaced a batch Hadoop MapReduce flow and allowed Facebook to reduce the overall latency from 8 to 24 hours to a couple of seconds or minutes before a sequence of user interactions are being reflected into UI changes (IIRC).
This architecture can probably be compared to Yahoo! S4, Backtype's Storm and to a lesser extent to LinkedIn's Kafka (the latter being more a distributed message service that does the Scribe part rather than a full-fledged distributed flow processing system).
3- Many Machine Learning algorithms are not well addressed by MapReduce either:
- the data model is usually small enough to fit in memory (hence writing it often to disk is counter productive),
- many workers want to get regular udpates on the model parameters (both read and write),
- finally the training set is usually scanned several times (instead of just a single pass).
The canonical example is Logistic Regression: the (excellent) Mahout implementation of this very scalable algorithm is purely sequential: Hadoop is not used at all. Other distributed data processing frameworks can still manage to get some speed up by leveraging the distributed power of a cluster. A new breed of distributed / large scale machine learning frameworks is thus recently emerging outside of the MapReduce paradigm:
- GraphLab: a new graph-based model and runtime developed at Carnegie Mellon,
- Spark: a new collection-centric model and runtime implemented in Scala at Berkeley,
- Stratosphere: a new model and runtime implementation developed at TU Berlin,
- A new Hadoop-based framework implemented in Scala under development at Yahoo!Labs.
I found about the latter incidentally by glimpsing over Markus Weimer's shoulder while he was working on a new slide-deck for another conference. We later had the opportunity to discuss this project in more detail during the informal Mahout meetup that took place after day 2: in summary this framework aims at leveraging the Hadoop infrastructure for data-locality aware storage and job scheduling while side stepping all the limitations imposed by the MapReduce model. Markus gave a talk (video & slides) at the last NIPS workshop on "Learning on Cores, Clusters and Clouds" featuring an overview of the project along with early results. Markus told me that the results of this ongoing development should be open sourced when ready.
The future is not that gloomy for Hadoop however: the aforementioned issues are mostly addressed by an undergoing refactoring of Hadoop MapReduce tackled by the Hadoop team at Yahoo (again!). The new architecture called YARN (Yet Another Resource Negotiator or YARN Application Resource Negotiator) is described in more detail in [MAPREDUCE-279].
In short, the main goal is to separate the job scheduling logic from the MapReduce algorithmic model. This should make it possible reuse Hadoop as a kind of "Cluster OS" that handles resource allocation and scheduling while leaving the "user-land" applications to implement their own distribution model, specifically tailored to the use case at hand (scaling down to in-memory datasets, low latency real time analytics and highly iterative machine learning processes). MapReduce would thus be one of the possible "user-land" application pattern among others.
Some random data hacks
Here are a couple of other useful tricks I learned during the event:
- The distributed mode of Solr does not share an IDF estimate across nodes and in practice this affects badly the scoring and thus ranking accuracy for evenly, randomly sharded nodes. This issue was known for quite some time but Andrzej Bialecki took care to run quantified comparative analysis to measure it's impact on ranking using various Information Retrieval evaluation metrics: the answer is that it hurts significantly.
- It is possible to store a compressed representation of the output Ted's Log-Likelihood Ratios test for collocation extraction as a bloom filter. This in memory data-structure can then be embedded in a really efficient collocation whitelist filter for Lucene [MAHOUT-415] that improves the indexing of interesting n-grams (a.k.a. Shingles) such as "New York". Frank Scholten is currently working on a pref-configured MR job tool to make it easy to deploy this as a reusable text-preprocessing step (e.g. for clustering text documents).
All in all Berlin Buzzwords 2011 was a great conference. It is a unique opportunity to have a fresh Hefeweizen in a funny looking crashed space station while casually chatting with people shaping the future of Big Data such as Doug Cutting, Chris Wensel, Ted Dunning and Tin-foil Chap. Basically people that I had previously interacted with only through twitter or mailing lists (careful though, I am pretty sure the last chap is a bot).
Part 2 - the Semantic / NLP Hackathon
The following 2 days Neofonie hosted a hackathon for people willing to collaborate on tools related to Named Entity corpus annotations, Named Entity disambiguation, active learning and related topics.
During the mornings we ran a couple of presentations and demos of:
- pignlproc, a set of scripts to extract Natural Language Processing training corpora out of Wikipedia and DBpedia using Apache Pig,
- Stanbol, the open source semantic engine previously introduced in this blog post,
- Alexandria, a very cool project by Neofonie with an impressive question answering component that uses Freebase as the main knowledge base.
During the afternoons we worked in small 2-3 people groups on various related topics.
Hannes Korte introduced a HTML / JS based tool named Walter to visualize and
edit named entities and (optionally typed relations between those
Currently Walter walks with UIMA / XMI formatted files as input /
output using a java servlet deployed on a tomcat server, for instance.
The plan is to adapt it to a corpus annotation validation / refinement
pattern: feed it with a partially annotated corpus coming from the
output of a OpenNLP pre-trained on the annotations extracted from
Wikipedia using pignlproc to bootstrap multilingual models.
We would like to make a fast 'binary' interface with keyboard shortcuts
to focus one sentence at a time. If the user thinks that all the
entities in the sentence are correctly annotated by the model, he/she
presses "space" and the sentence is marked validated and the focus moves
to the next sentence. If the sentence is complete gibberish he/she can
discard the sample by pressing "d". The user could spend more time to
fix individual annotations using the mouse interface before validating
the corrected sample.
When the focus is on a sample, the previous and next samples should be
displayed before and after with a lower opacity level in read-only
mode so as to provide the user with contextual information to make the
right decision on the active sample.
At the end of the session, the user can export all the validated
samples as a new corpus formatted using the OpenNLP format.
Unprocessed or explicitly discarded samples are not part of this
refined version of the annotated corpus.
To implement this we plan to rewrite the server side part of Walter in
- a set of JAX-RS resources to convert corpus items + their
annotations as JSON objects on the client to / from OpenNLP NameSamples
on the server. The first embryo for this part is on github.
- a POJO lib that uses OpenNLP to handle corpus loading and iterative
validation with validation, discarding, update and navigation. It also handles the serialization of the validated samples to a new
OpenNLP formatted file that can be fed to train a new generation of
the model. Work on this part was started in another github folder. I
would like to keep this in a separate maven artifact to be able to
build a simple alternative CLI variant of the refiner interface that
does not require to start a jetty or tomcat instance / browser.
For the client side, Hannes also started to integrate JQuery UI to make
it easier to implement the ajax callbacks based on mouse + keyboard
I really hope that this corpus refinement tools will help with the OpenNLP annotations effort recently bootstrapped by Jörn Kottmann.
Another group did some experiments with Dualist, an Active Learning user interface to quickly train machine learning models: the tool looks really nice for the text
classification case (check the demo video), less so for the NE detection case (the sample view is not very well suited for structured output and it requires to pre-process the corpus to extract interesting features first. Dualist does not do it automatically apparently.
The last group worked on named entity disambiguation using Solr's MoreLikeThisHandler and indexes of context occurrences of those
entities occurring in Wikipedia article. This work is currently being integrated into the Stanbol codebase.
Upcoming Stanbol-related events
- We will be co-organizing the next IKS workshop in Paris, early July (registration is still open)
- I will also present Stanbol in a talk dubbed Bridging traditional Open Source Content Management and the Web of Data with the Apache Stanbol Semantic Engine at ApacheCon 2011, Nov 7-11 in Vancouver, BC.
- Also expect an update on all of this during Nuxeo World 2011, Oct 20-21 in Paris.
See you there!
Special thanks go to the Berlin Buzzwords organizers for putting together such a great event and to the Neofonie crew for kindly hosting us during those two days (you have the best automatic coffee machine I have ever seen).