Iksgamworkshop37


Last week I had the pleasure to attend the second
workshop
organized by the IKS
project
in Rome. The goal of this 4 years project is to
develop a software stack and a set of design guidelines to help CMS
developers leverage the promises of knowledge oriented software and Linked Data.


In the following I will give a brief overview of some of the
discussions that happened during those four days and a summary
of the Scribo project I presented during the demo sessions the
last day. A more complete coverage of the event can be found the event page
of the IKS wiki.


Materialized semantic indexes


Rupert Westenthaler from the Salzburg Research team is working
on a very interesting prototype to make CMS applications able to perform
fast complex graph queries on a knowledge base by materializing named
graph queries into flat Lucene indexes and tracking the knowledge base
changes to detect when the indexes need incremental updates.


To me this sounds a lot like the permanent MapReduce views used to
query the CouchDB document database. I really look forward to the release
of the first prototype along with some benchmarks to compare this
approach with general purpose un-materialized SPARQL engines such as
Jena SDB / TDB, Sesame and Virtuoso.


Bridging CMIS and RDF/OWL


During the workshop Gokce
Laleci
introduced a prototype mapper from JCR to RDF, from content
structure to explicit semantic knowledge. The goal is to express the
underlying structure (document types and properties) specific to a given
CMS content store as a standard based and interoperable knowledge view
that can be directly aggregated by Linked Data crawlers.


She and her team will now work on a similar mapper for the CMIS protocol
using Nuxeo DM
and Apache Chemistry
as primary integration platform. Another interesting lead would be to
translate SPARQL queries in CMISQL when the mapping makes sense and
hence allow any CMIS content repository to behave as a SPARQL endpoint.


In the long term, I am not sure whether we want to keep the content
and knowledge in separate stores as we do currently in Nuxeo (Nuxeo Core and Jena). It might be
simpler and more efficient to combine them both in the Core and use such a
configurable knowledge mappers along with materialized graph queries to
implement the semantic features of Nuxeo.


Ontology-free semantic indexing


Stephane Gamard (@sgamard)
introduced the services offered by the SalsaDev platform. Their startup
focuses on leveraging an algorithm able to semantically index any
text document such as blog posts, web page snippets, wikipedia
articles and look up semantically related documents in all
indexed content without relying on explicit ontologies or topic
classification. Their approach offers the same advantages as Latent
Semantic Analysis
but is also scalable to very large document
collections while LSA suffers from quadratic lookup times that
makes it unusable in practice.


This approach is very similar to a semantic
hashing prototype
I have been working on my idle weekends for quite
some time now. The short term goal is to implement an image search
by similarity feature for the future Nuxeo Digital Asset Management
product. On a longer term the same algorithm should be adapted to also
work for text document similarity search.


Those purely data-driven approaches are interesting for at least two
reasons:




  • they allow for a natural implementation of the unstructured
    "query by example" paradigm,


  • they can be combined with more structured semantic extractions to
    perform disambiguation in a named entities recognition component
    for instance.


Using UIMA for economic intelligence


Tommaso Teofili (@tommasoteofili)
from the Apache UIMA team
demoed a real application of semantic knowledge extraction to monitor the
temporal evolution real estate market prices in the Rome area. The assets
prices data categorized by surface and number of rooms is automatically
extracted from the raw unstructured content of public ads web pages
and aggregated in a relational database that feeds a charting and reporting
user interface.


The data extraction magic is performed by a UIMA chain
that wraps the online semantic engines provided by the AlchemyAPI web
service. Such semantic lifting services are typically what Nuxeo aims
to provide as part of the platform without relying on third party service
providers.


Incidentally Peter
Mika
from Yahoo! Research is working on a similar prototype to find
his next flat in Barcelona.


Nuxeo and automated Semantic Knowledge extraction


As part of the demo session, I chose to present some of the ongoing
work done by Nuxeo and its partners as part of the Scribo project. One of the
goals of this project is to extract the occurrences of entities (such as
persons, organizations and places) semantic assertions between those
entities ("Person A" is the CEO of "Company B" or "Person B" has declared
that "he will reform the Health care system"). To that hand we chose
to package annotators as chained UIMA Analysis Engines and store the
extracted semantic annotations as RDF assertions using the classes of
the DBPedia ontology. Here are the slides introducing the context of
the demo:


Nuxeo Iks 2009 11 13[slideshare doc=nuxeo-iks-2009-11-13-091112122852-phpapp02]
View more documents from ogrisel.

The demo itself is two-fold. The first part features the Scribo
Workbench mainly developed by XWiki to configure and test a chain of
UIMA annotators to extract semantic knowledge from the text content
of documents coming from heterogeneous content repositories such as
a filesystem folder, a CMIS repository (Nuxeo DM) or an XWiki server
accessed through its RESTful API.


The user can then combine such a document source with one or several
registered annotators into a UIMA chain (a.k.a Collection Processing
Engine), run the process and view the results as annotated text document
directly in the Eclipse UI. The user can also validate or invalidate the
extracted annotations and hence incrementally build a validated knowledge
base of semantic statements out of his unstructured content. The following
screencast shows the details of this scenario using the Stanford Named
Entity Recognition annotator on 2 wikinews articles:



The second part of the demo showcases the deployment of the previous
UIMA chain directly inside a Nuxeo DM 5.3 instance. PDF documents are
directly semantically annotated at import time thanks to an asynchronous
event listener that calls a new UIMARunnerService packaged as
an OSGi components deployed by the Nuxeo Runtime.


The extracted named entities are stored in the default Nuxeo Jena
store. Some work is still needed to make the annotations show up correctly
in the "preview tab" and make it possible to validate / invalidate extractions
from the "knowledge base" tab.



Remember this is just the beginning and we plan to
support all languages significantly represented in Wikipedia
along with finer grained entity classes. You can also get an
overview on the global semantic R&D effort at Nuxeo on our
Jira
.


Last but not least, the showcased demo is deployable on your own Nuxeo
DM 5.3 instance by deploying a simple plugin as explained on the UIMA page
of the Nuxeo wiki
. Beware that this is really alpha alpha
work and should not be deployed on a production setup.