Today, I am very pleased to announce the completion of a new benchmarking study that’s a bit different from anything that we have done before. To begin with, benchmarking is not a new activity for Nuxeo. For those of you who are familiar with us and our Content Services Platform, you will know that we have conducted several 1 billion-object benchmark tests and that we continually benchmark our platform builds to ensure the highest level of scalability and performance in each new release. And, in keeping with our “open kitchen” approach and open-source heritage, all of this information is documented and publicly available on benchmarks.nuxeo.com (contact us to get the password).

So, what makes this benchmarking effort so different from our previous exercises?

A Fundamentally Different Approach

Well, the first and most obvious answer is size. Our ultimate target with this benchmark was to exceed 10 billion objects. But that’s only a small part of the story.

Our goal for this project was fundamentally different from a theoretical benchmarking exercise. You see, we have had several customers approach us with plans to scale their own Nuxeo content repositories – in the cloud – from 1B to 10B objects and beyond. What they were looking for, from us, was guidance and best practices on how to optimally get there and how to effectively operate the Nuxeo Platform at extreme scale. This is a reflection of the changing nature of content and its growing importance and variety in today’s modern enterprises. With legacy Enterprise Content Management (ECM) solutions, we have seen customers slowly accumate documents, scanned images and other traditional content over long periods of time – ten, fifteen, or even twenty years. And eventually, with a lot of hard work and hand-holding, they have been able to get these systems to scale up to 8B or 9B objects. We have one customer who is well on its way to 15B in the next three years.

So, in order to provide real-world guidance and best practices for our customers, this could not be a theoretical exercise in a lab, and it wasn’t enough to simply simulate loading documents in a NoSQL database. First, we provisioned an actual instance of our Nuxeo Cloud service, complete with Amazon S3 object stores, MongoDB Atlas database (NoSQL, of course), and Elasticsearch indices. Second, we began creating real content – a lot of content – along with corresponding metadata. In all, we created 11.3B binaries with metadata, an effort that, in and of itself, took several days. This content was based on real-world input from our customers and was reflective of the kind of extreme high-volume use cases that we are seeing, particularly with our financial services customers. It included customer IDs (images), statements, account documents, and even customer correspondence.

Fun Fact: if you create 1,000 new documents per second, every hour of every day, it will take you about 100 days to generate 10B objects of content.

And, because our goal was to test an actual working Nuxeo application, and not a storage service, we enabled all of the expected features, including full indexing and security for every object. We also incorporated real-life testing scenarios, including mixed workloads, full reindexing, and bulk actions or updates to large sets of content.

A Two-Phase Project

Another unique aspect of our benchmark is that it was actually a two-phase project:

In the first phase of the project, we focused on scaling up a single-repository instance of our Nuxeo Cloud service. Simply put, this means one database (no sharding), one set of indices, and one object store. The goal of this first phase was to identify any bottlenecks to performance, either from an ingestion or usability perspective. We also wanted to identify key components and metrics that needed to be actively monitored and determine when we needed to scale out different components of the Nuxeo Cloud architecture. We conducted performance testing along the way, at 1B, 2B and then 3B objects. In terms of deliverables and outcomes, we have now produced reference architectures for a single-repository system on AWS running at 1B, 2B, and 3B objects with requisite hardware, configuration, and expected performance information.

By the way, for many years now, we have extolled the virtues of NoSQL and MongoDB. As a dramatic proof point, yes, a single instance of a NoSQL database will scale to 3B objects (and beyond without sharding). And for those who might ask, did we stop at 3B objects for any particular reason? The answer is no. 3B isn’t, by any means, a hard limit for a single-repository system. We just felt that this number was sufficient for customers that would look to scale out a single-repository system.

All of which brings us to phase 2. For phase 2, we utilized a multi-repository system that we believed well represented the requirements of a large, global customer. Specifically, we deployed three separate repositories:

  1. An active, geographic repository (US West)
  2. A second, active geographic repository (US East)
  3. And an archival repository

If you think about this from a global customer perspective, this is a good representation of how they might partition a system.
One, they may need to establish different regional repositories based on data privacy or data locality or even high availability requirements. Two, they may want to begin to partition active data from archival data and house this information in separate repositories and lower-cost, less-performant storage tiers.

Each repository had its own database instance (MongoDB Atlas). It also had its own set of indices (Elasticsearch) and object store (Amazon S3). While we do support Amazon Glacier and this might be an obvious cloud storage choice for archival content, we did not use a deep archiving service for this exercise. The key point here, however, is that, for global customers with data residency requirements, we can create local repositories (in any Amazon, GCP, or Azure data center) complete with the binary storage and corresponding metadata and search indices. No data is shared in common across these different repositories and yet you can search for and access content from any repository (depending on access rights and permissions, of course) seamlessly from within the Nuxeo UI.

As with phase 1, we had some specific deliverables in mind. Our target with phase 2 was to deliver a series of benchmark results that illustrated the performance of Nuxeo Cloud in scaling up to 10B+ objects. Additionally, we also conducted a series of real-world tests at the 11B-object mark to demonstrate the performance of the system at scale. Along with these benchmark testing results, we have developed a reference sharding architecture for our customers’ use and corresponding guidance and best practices for scaling out a multi-repository system.

Along the way, we also created a valuable new tool for our ongoing benchmarking efforts. The goal of this project was ultimately not to deliver a one-and-done benchmark, but to be able to run this benchmark continuously to ensure that the performance of the Nuxeo Platform continues to evolve (positively) as the platform itself evolves. Further, for those who want to see a live, 10B+ object system, running in the cloud, we would be happy to show it to you.

Where do I Find the Results?

Here is a bit of a teaser: a screenshot from our ingestion dashboard which clearly indicates our achievement: 11.34B documents.

11 billion documents performance

But there is so much more detail available. Over the course of the next week or so, Thierry Delprat, Nuxeo’s CTO, will post a series of blogs detailing our benchmark project and sharing our test results as well as his thoughts and insights about the benchmark exercise. For those of you who simply can’t wait, we have also made a detailed technical brief available that you can download here.

Happy reading, and please join me in congratulating Thierry and his team on this notable and highly informative accomplishment.