Starting with the Nuxeo Platform 6.0, we’ve integrated MongoDB as a NoSQL alternative to using a relational database such as PostgreSQL.
In recent posts, we referenced our participation in MongoDB Days conferences around the world, our perspective on the use of MongoDB and how dynamic facets of the Nuxeo Platform can provide Metadata Agility with MongoDB.
In this article, we’ll dive deeper into some of the specifics of using MongoDB for storage of metadata with the Nuxeo Platform.
The Nuxeo Platform, being a pluggable system, has a few layers of abstraction. The Document Store represents the storage of metadata and hierarchy relationships, and the generic operations of writing, reading and updating this data. Regardless of how the data is stored the Document Store represents these operations.
Getting more specific, the Visible Content Store represents a multi-table and in practice SQL based storage system. It’s an abstraction of SQL based storage, requiring joins across tables. At the lowest level is the adapter to a particular SQL database to issue the actual SQL call and manage the various particularities of different DB systems.
Since MongoDB is a Document based storage system, we have created a Document Based Store (DBS) abstraction as an alternative to VCS and a MongoDB adapter for the appropriate connector implementation. DBS assumes the storage implementation will be a single record per document and includes an appropriate Transient State manager for multi document transactions.
Relational vs Document Based Storage
Document based storage addresses 3 areas of limitations of SQL based storage of Documents: impedance, scalability, and concurrency.
O/R Impedance Mismatch
Without getting too deep into the theory, in practice, we can just say that it’s an issue of too many tables which limits performance. A Nuxeo document is a set of multiple schema fields which need to be mapped to a set of SQL tables. So, in case of large document objects with multiple schema and therefore many joins, queries requiring the full object become unwieldy. This requires caching and lazy loading in the application. With a Document based storage, a document is a JSON object in a single record in the DB. It is designed to allow large objects in a single record, so we can efficiently load many large objects without compromising. One Nuxeo document matches with one MongoDB document.
It’s well known that SQL scale out can be complex since it’s not a native distributed architecture. Most Document based storage systems have been designed to be clustered and scaled out. While MongoDB still requires writes to go to a single primary instance, it’s possible to take advantage of Replica Sets for reads. Sharding is also an option and not difficult to setup.
Every database system is some compromise between availability and consistency even within a particular implementation tweaked with configurations. MVCC determines the degree of locking in a multi-user database for a transaction to achieve a particular level of consistency and availability. Most SQL databases give up a small amount of consistency for acceptable performance by default. In case of a document being spread across multiple tables that require locking to achieve consistency, some availability and performance is sacrificed. With a DBS, a Nuxeo Document CRUD is a single operation, so transactions and locking are unnecessary and there is considerable performance improvement.
Still Needing Transactions
There will be cases where we need to run operations on multiple documents transactionally, such as, when running a workflow and updating node data as well as documents. Nuxeo Automation Chains are also expected to be transactional. The Nuxeo Platform implements transaction control at the application level by tracking a transient state for documents, making changes to objects in memory, and only flushing to DB when another operation in the transaction expects those changes to be present. Additionally, we can implement rollbacks using an Undo Log of the application level transaction. We need actual transactions only in some cases, so most of the time we reap the performance benefits of the non-transactional nature of MongoDB.
Configuring MongoDB with Nuxeo
The capability is included as a bundle with the core distribution and it’s just a matter of changing some configuration properties. First, you’d want to have a running MongoDB accessible from the Nuxeo Platform. Then add the ‘mongodb’ name to the list of templates in the nuxeo.conf configuration file. Upon startup, the Nuxeo Platform will connect and create a Nuxeo DB on MongoDB using default ports and hosts if you haven’t changed them and are running local and a Repository configuration descriptor.
Refer to the documentation to learn more. This is all that is required and nothing in your application has to change or be particular to MongoDB. Abstraction allows this to be transparent.
Use Cases for MongoDB Advantage
Due to the levels of abstractions you have choices of pluggable storage systems for the Nuxeo Platform. There are many factors determining the optimal storage system. SQL will often be the best choice, but as noted, DBS and MongoDB provided certain benefits. They are:
Huge Repository with Heavy Loading
This applies when you have a repository with document counts approaching a billion. This can also happen in cases when every change is versioned which would multiply the document records in the Nuxeo Platform. Combined with write intensive access or recursive updates, an SQL DB could be limited. MongoDB can handle the volume without concurrency or scalability issues.
As noted in another post about benchmark figures, import throughput can be a consistent 4000 documents/second or past 30 million documents. An SQL DB would be limited and throughput would drop early due to the way data is written. In addition, due to concurrency issues, a high number of writes per second doesn’t have the same limitation on concurrent reads as seen in an SQL system.
In situations where each Document is a collection of a vast number of fields, large Document objects are created. This is addressing the impedence mismatch and the number of joins when data is distributed among many tables. In such cases, SQL requires lazy loading of objects after a query because a transaction to load a large number of fragmented objects would be unresponsive. Once the cache of objects is filled performance is good. With MongoDB, no lazy loading or 2 level caching is required because it’s simpler to retrieve these large objects. Improvements up to 15x are possible as shown in our published benchmarks.
Additionally MongoDB is easier to scale out and distribute with ReplicaSets and host on multiple sites for architecture flexibility.
While MongoDB is great for the noted use cases, as a system architect we know the perfect solution is often a compromise between competing priorities. You can’t maximize availability and consistency at the same time. Every database choice and configuration is an optimization of this. In the noted use cases we give up some consistency of SQL to handle the large bias toward availability and maximize performance of the usage profile of the system.
At Nuxeo, we strive to provide choices for the varied use cases of our customers and to maximize performance in a variety of environments. Support for MongoDB further expands the ability of the Nuxeo Platform to be the highest performing ECM for a wide range of use cases.