One Billion Documents, Testing the Limits of Nuxeo
The amount of information and content that needs to be managed continues to increase at a tremendous rate. This has brought scalability and performance to the forefront of many recent customer conversations. To help our customers succeed, we put in considerable effort to ensure that the Nuxeo Platform performs and scales. Which also makes us curious about how far we can push the limits of the Platform. Recently we tested these limits, seriously tested, to the tune of 1 billion (1,000,000,000) documents. I'll take you through the story of how and what we tested and of course the results.
Benchmarking Is in Our DNA
We automatically benchmark and test the Nuxeo Platform performance every night to verify that the platform performance isn't altered by new features or changes in the code base.
We also perform custom benchmarks for client projects on a regular basis, but it has been a while since we tried a generic large scale benchmark of the Nuxeo Platform. The last major benchmark was in 2010 and with only 100 million documents. I've always been interested in whether we could do more.
In the last four years, much has changed inside the Nuxeo Platform, including VCS Optimizations, the integration of Elasticsearch, and a new REST API. Also during that time, hardware has progressed significantly, especially via the availability of SSD drives that increased I/O performance.
When we saw that OVH was proposing big servers (20 cores / 256 GB RAM) with SSD disks for less than 400$ a month, the first thing that jumped into our heads was, "Hey, we could use that kind of server to perform a huge benchmark!"We used this as an opportunity to update our generic benchmarks and see where we stand in terms of performance and how far we can go.
We looked around on the internet to see what others in the industry had published and found many with 10 million document or 100 million document benchmarks, which are interesting but we wanted to really push the system. We designed a way to benchmark the Nuxeo Platform with an unprecedented 1,000,000,000 documents, yes 1 BILLION!
The 1 Billion Document Benchmark
Running this type of benchmark requires some thought into the system and how to create a 1 billion document environment. Also important is to create an environment that will test under near real world circumstances.
Hardware & Software
The benchmark was run against a single server:
- 2x Intel Xeon E5-2670v2 @2.5 Ghz - 20 cores with HT
- 256 GB DDR RAM
- 2x240 GB SSD (ext4)
- 2x2TB HDD (ext4)
On this hardware we installed :
- Ubuntu 14.04
- a single Nuxeo 5.9.5 instance (Fast Track version)
- a PostgreSQL 9.3
- an Elasticsearch cluster (1.1.2)
Importing 100 Million Documents
We used the nuxeo-platform-importer addon that allows us to handle high speed bulk import through the Java API. Conveniently, nuxeo-platform-importer also includes a mode to generate random documents to import, this last point is actually important when you want to perform big import benchmarks and don't want to spend too much time generating test data.
- nuxeo-platform-importer with random document generator
- 30 metadata from 5 different schemas
- 5 Dublin Core metadata set by the importer including one that is complex
- 1KB average file size of random generated text
- Bulk mode is enabled / Full text extraction is disabled
- Storage on SSD
- PostgreSQL Storage on the first SSD drive
- BinaryStore and ElasticSearch share the second SSD drive
We ran the import until there were 100 million documents since at this point the SSD disks were full:
- 120GB for the database including 60GB of index
- 70GB for the binary store
- 45GB for the Elasticsearch index
An important fact is that during all the import phase the performance remained consistent:
- Document import runs at 6000 documents/second
- Full Text indexing inside Elasticsearch runs at 3500 documents/s
This means that importing and full text indexing of 100 million documents takes about 12 hours!
From 100 Million to 1 Billion
Initial testing showed us that the platform was responding very well (see below). We were ready to go further, to 1 Billion! However, we needed to change the import strategy since SSD disks were almost full, the import was really I/O bound, and importing 1B documents would take a couple of weeks on regular HDD.
We also thought that a one shot bulk import of 100,000,000 documents was already big enough. We decided the best way push the Nuxeo architecture to reach 1 Billions documents is to plug the same Nuxeo platform instance into several repositories. Having several repositories on the same application is a simple way to shard the Nuxeo data across several database engine.
To make that happen efficiently we duplicated the PostgreSQL database and the Elasticsearch shards resulting in a single Nuxeo instance with 10 PostgreSQL databases (each of them storing 100 Millions documents), 10 Binary stores (each of them storing 100 Millions blobs), and 1 unique Elasticsearch unified index (using several shards)
This approach has several advantages:
- It is faster to setup - Importing 1 billion document take a long time, even when importer is very fast
- It is still representative of real life data since each repository is handled separately
- We can massively scale out the storage processing!
Once the data loaded, we ran some benchmarks through the REST API using Siege. The 2 main use cases tested were retrieve a Document by it's ID and search and retrieve a document by a text meta-data value (ex: title).
1 Billion Documents, Results and Conclusions
The tests showed that even with a huge number of documents, the responsiveness and performance of the Nuxeo Platform remains intact and as you can see from the data, it can even increase the performance with higher volume, thanks to the scale out architecture: multi-repository and Elasticsearch.
It is very interesting to see that when multiplying the number of documents by 10 we also increase the throughput. This is due to sharding on several databases. The Nuxeo Platform scales out well not only because of the support for distributed architecture at the index level but also at the repository level.
The Nuxeo Platform performance on these tests shows that the Platform is able to scale to meet even the most demanding enterprise volumes of documents and information.
Our tests showed us that the hardware architecture we used to do the bench is clearly not the best suited since the limit in all tests is I/O and all components scale out well but we have only one server. This means that the ideal architecture would be to use several medium size VMs for the different PGSQL Database and several small size VMs for the Elasticsearch Cluster.
With that kind of architecture, we expect to significantly increase search performances and increase the total number of documents we can manage inside a single Nuxeo Application. That's why the next step is to run a similar benchmark, but on a distributed AWS based architecture. Based on the current results and our understanding of the system we believe that we now have the perfect architecture to be able to scale out efficiently on AWS .
Category: Nuxeo Updates