I think it's time to drop a note to the outside world about what I've been
working on for a little while at Nuxeo. I am pretty confident that this
project is nowadays at the end of its first iteration.



This post will give you a short overview of the solution we chose to tackle
which is the indexing and searching stack in a Zope and CPS architecture. I submitted an
abstract
to EuroPython this
year. Hopefuly, I'll have the chance to give you more technical details at
the conference in July.

Motivations




CPS is based on Zope and the standard cataloging solution of
Zope, nowadays, is the ZCatalog.
The ZCatalog
works really well until a certain amount of indexed documents : that's a
fact. As well, ZCatalog
extensions, such as TextIndexNG,
are of a great interest.



But, because there is a but,  the main problem is that Zope is dealing with a task it shouldn't have
to deal with. As a result, it decreases the overall performances of the Zope platform itself. If you are not
convinced just try out to inject 200k documents within a Zope instance (or a Plone one if you wish :)) with documents
having 50 fields to be indexed and check how your response time is evolving
when your instance is as much used by people working and writing within the
database and by others consulting it and thus searching all along. In Nuxeo, we tried on large scale projects. It
simply doesn't work well/fast enough for serious deployments. Zope gets really slow...



Anyways, you should consider the ZCatalog
as what it is : a hack on top of the ZODB because the ZODB doesn't provide any native query
language nor full indexing suppport.



For those reasons, we needed such a solution for our customer projects.



As well, this is following our vision of Zope3 being an integration platform
for ECM applications where
external services could be plugged in thanks to the
Zope3 component architecture
flexibility and the agility of the Python language.

What is Lucene ?




Lucene is an open source
project from the Apache Software
Foundation
written in Java. This is
a high-performance, full-featured text search engine library.



I would suggest that you check the website that contains a lot
of useful information and documentation. As well, I would really recommand
this book to anyone interested in
working with Lucene and /
or in understanding more deeply how it works and how to use it in a proper
way. As well, some projects such as nutch are described as case
studies which is more than interesting for anyone who wants to build a
system on top of Lucene
since the best practices are described within those case studies.



 In Nuxeo, we first integrated Lucene for a customer within
the Apogee project scope. (Apogee is a framework based on Eclipse RCP for ECM rich client applications). Its
use had been a real success so we decided to go further and see how we
could leverage the use of Lucene server side.

What is PyLucene ?




The first time we've seriously considered using PyLucene was at last year's
EuroPython conference after Andi Vajda's really
great presentation of PyLucene. Andi is the actual
main PyLucene developer.
PyLucene is maintained by
the Open Source Applications
Foundation.




PyLucene is a GCJ-compiled version of Java Lucene integrated with Python. Its goal is to allow the use of Lucene's text indexing and
searching capabilities from Python. It
is designed to be API compatible with the latest version of Java Lucene.



PyLucene is freaking fast
! Even faster than the Java Lucene version according to
the authors of the Lucene In
Action
book. Furthermore, It
will be easily synchronized with the latest Java Lucene releases since this
is not a from scratch port but a GCJ-compiled version of Java Lucene itself.

NXLucene : standalone Lucene indexation server




NXLucene
is a standalone multi-threaded remote server handling Lucene stores. It takes
advantage of the freaking fast PyLucene Python bindings and uses Twisted for its server
implementation. It uses some part of the Zope3 component architecture as well.
NXLucene
currently supports the XML-RPC
protocol. (Its roadmap includes an ICE connector for the 1.x branch.)
As well, NXLucene
might be seen as a good example of what could be achieved using the best
parts of different worlds (Java Lucene , PyLucene, Zope3, Twisted,...). Bear in mind, that
NXLucene
is not running on top of the Zope AS. It
is standlone.



NXLucene
exposes an XML query language for indexing and searching operations. Note
the Lucene native search
query is of course still supported. Check the NXLucene

interfaces




While installing NXLucene,
you will install as well the core libs that might be used by third party Python programs. For instance, the query
lib might be useful to help you format your NXLucene
XML queries or still the testing library might be really helpful to write
tests for your Python components that
need to communicate with an NXLucene
server.



This is important to note here that you can request NXLucene
using any language. You will only need an XML-RPC client library to do so.



NXLucene
is an open source project under the LGPL part of the CPS platform project.



For more information about NXLucene
and its installation you may check the NXLucene
website.

nuxeo.lucene : Zope 3 cataloging component




nuxeo.lucene
is a cataloging component written on top of to the Zope3 application server currently
offering an XML-RPC proxy to a NXLucene
remote server. As well, It offers an abstraction for Python objects cataloging strategy
providing the ability to specify how Python objects should be indexed and
retrieved from a Lucene
store through NXLucene.
(This is important to note here, that whatever remote server providing an XML-RPC remote interface on a Lucene server could be
theoretically used.)



Currently, this component is used through Five from CPS. Its integration on top of the Zope3 AS is not finished since we
didn't need nuxeo.lucene
outside of CPS yet. Feel free to
participate to its development if you
are interested about having nuxeo.lucene
fully integrated on top of a stock Zope3
AS
.



nuxeo.lucene
is an open source project available under the ZPL part of the CPS platform project.


CPSLuceneCatalog : CMF Catalog replacement for CPS-3.
4





CPSLuceneCatalog
is a CPS-3.4.x specific
product adding the CPS specific
business rules to nuxeo.lucene.
For example, it takes care of the way different versions of CPS documents should be indexed.
CPSLuceneCatalog
is a complete substitute for the ZCatalog
that is showing its limits while dealing with millions of objects.
CPSLuceneCatalog
will be shipped along with the next major release of CPS, version 4, along with the JackRabbit JCR repository.




CPSLuceneCatalog
is almost fully backward compatible with the ZCatalog
query syntax so be sure you code won't break if you want to migrate. I don't
currently support 100% compatibility but I do support at least the subset of
ZCatalog
query syntax we have been using in CPS internals.



An upgrade step is already available on CPS 3.4.x instances.




CPSLuceneCatalog
is an open source project available under the GPL part of the CPS platform project.


Already significant results !




The result is a big win on large scale deployments :


  • Indexing and searching are much faster and scalable compared to ZCatalog.

  • Indexing and searching are much more powerful compared to ZCatalog
    (Analysis, ranking, etc...)


  • Zope global performances are
    increased because Zope no longer deals
    with the indexing and searching business.




Looking for support ?


If you are looking for any technical information or help regarding these
products please subscribe to the CPS devel mailing
list
.




If you are looking for commercial support, Nuxeo provides professional services
whatever your needs are.



Nuxeo is currently maintaining NXLucene,
nuxeo.lucene
and 
CPSLuceneCatalog
and we are always welcoming third-party contributors.
As a developer, if you are interested about contributing to these projects,
we will grant you access to our svn
repositories
and provide you all the information you need in order to
get started. Just subscribe to the CPS devel mailing
list
.


Thanks




A big thanks to our customers at Nuxeo for
trusting us, being patient and for always bringing along with, their
projects, bleeding edge use cases.



And don't forget, at Nuxeo we love
challenge and innovation !



Hope you'll enjoy those components as much as I enjoyed writing them for
our customers. Looking
forward to hearing from you.



    J.

(Post originally written by Julien Anguenot on the old Nuxeo blogs.)