With the release of the Nuxeo Platform 6.0, Nuxeo powers its search queries with the natively integrated Elasticsearch.aggregatesLike everything we implement or integrate, the search experience is also configurable with Nuxeo Studio - our online configuration and design tool. This integration provides a unique cocktail of configurability and permissiveness that leverages the capabilities of Elasticsearch aggregates.

Elasticsearch Aggregates and Integration on Nuxeo Page Providers

From the Elasticsearch documentation, we know “An aggregation can be seen as a unit-of-work that builds analytic information over a set of documents.” Aggregation is of two kinds:

  • Bucketing: For a given aggregate, a set of buckets (key + filter) is either defined manually or generated automatically following a rule. Each document of the search result is evaluated against each bucket’s filter to see if the document should be added to it or not. In the end, for each bucket you have the number of documents that matched the filter.
  • Metrics: Indicators computed against a set of documents.

In this post, we will focus on bucketing. Bucketing aggregation is the one we integrated with the Nuxeo Platform 6.0. Elasticsearch proposes different kinds of bucketing with various nuances:

  • term: Documents are grouped by distinct values in the search results for a given document property. Each distinct value is the key of a bucket. This aggregate is also known as “facet”. Another one is known as “significant terms”, which returns terms with statistical relevancy filtering.
  • range: Each bucket of this aggregate corresponds to a range of values that has been manually defined on the aggregate. Documents are grouped in one bucket if the value of a given property fits in the range of the bucket.
  • date range: It’s same as range but is specific to dates.
  • histogram: All values of the search results for a given property of the documents are projected on a discrete distribution of values that defines the buckets, such as a suite: 0 5 10 15. It can be used for numerical document properties.
  • date histogram: It’s same as histogram but specific to dates.
  • geo-distance: Documents are grouped based on their distance to a specific point. Like range, each bucket of the aggregate is defined by the distance to the point.
  • geo-hash: It is like a two dimensional area projection for geospatial position grouping. The space is divided in square zones and a count and grouping of documents is established for each non-empty area.
  • missing: It includes documents for which no value is available for a given property.
  • nested: Documents are aggregated based on subjects.
  • children: The bucket values are joined using a parent/children relation.

It is possible to nest aggregates request, which means that aggregates provide not only a way to filter a result set, but also to extract some specific information and gather it in a cross-documents join.

In the Nuxeo Platform, the query layer has been extended (more exactly the PageProvider interface) to add the support of aggregates for filtering purposes. When defining a page provider object, it is possible to specify a list of aggregates that will be added to the query definition and then to handle the corresponding list of buckets that were returned.

In the latest version of the Nuxeo Platform, we have support for aggregates of type term, significant term, range, histogram, date histogram and date range. For each type, it is possible to specify all the options on the page provider that Elasticsearch supports. For instance, a term aggregate is specified as follows:

<aggregate id="dc_nature_agg" type="terms" parameter="dc:nature">
   <field schema="default_search" name="dc_nature_agg" />
     <properties>
       <property name="size">10</property>
     </properties>
</aggregate>

You can see that we have chosen a generic grammar, one that is not a dependent of Elasticsearch. This will be helpful in case we want to have a page provider implemented for another search engine. The ElasticsearchPageProvider will add the following json to the query:

"aggs" : {
       “dc_nature_agg" : {
           "terms" : { "field" : “dc:nature” ,”size”:10}
       }
   }

All the accepted parameters can be passed through:

  • size
  • mindDocCount (returns only buckets having at least n matches)
  • order
  • script

Then, the buckets’ list can be fetched on the page provider that stores the list of aggregates objects. Each of them will store the list of Bucket object whose interface provides getKey() and getDocCount() methods.

Nuxeo Platform Aggregates Widgets

On the Web UI layer of the Nuxeo Platform, a set of elements has been added that will display the returned buckets in an intuitive way for the user. This display will depend on the nature of the value of the bucket: a date range, a string referencing a user or an entry of a vocabulary list (controlled list), a string representing a document id, etc. For instance, in the users use case, we display the enriched information: the first name and last name of the resolved principal. Also, for terms widgets (whatever the nature of the term is) we propose either a checkboxes display mode or a suggester display mode

terms_aggregateA Term widget

range_aggregateA Range widget

Configuring the Screens with Nuxeo Studio

The technical process for configuring a search screen would be to:

  • define the query and accepted aggregates
  • define the form that will modify the dynamic part of the query

In Nuxeo Studio, both these steps were merged to make the process simpler. The user just has to design the search form, and fill all options when configuring the field. Nuxeo Studio will pick up some of the information for generating the correct UI component, and some for generating the correct aggregate configuration on the page provider. When configuring a field, all options available in the Elasticsearch document can be passed through.

Configuring a Term widget in StudioConfiguring an aggregate controle in Nuxeo Studio

Combining aggregates types, aggregates options, UI Element types and UI Element options, the variety and possibility for creating a search form becomes virtually endless. Not only that, all this is configured by simply dragging and dropping form controls in Nuxeo Studio.

You can implement such screens in javascript too, but then you have to code, debug, and maintain everything. So, you would definitively not get the same agility.

What’s Next

As you have seen, we have barely touched the possibilities offered by Elasticsearch. This integration of Elasticsearch with the Nuxeo Platform is full of potential and our next steps will be:

  • Leveraging other bucketing kind of aggregates: Among them, the spatial search would definitively be relevant for a content repository. The parent/children aggregate may also be interesting in implementing join queries.
  • Leveraging Metrics: Through our Page Provider, we will expose some of the available metrics and extend the Web UI framework with a set of KPI widgets that will graphically present the metrics computed by Elasticsearch. This will allow us to display statistics among workflow tasks, documents, etc. very easily via Nuxeo Studio.
  • Configuring Easy Field Mapping: Depending on the kind of query that should be operated on the field, different mappings should be contributed. The mapping may be about what analyser to be used, what filter/tokenizer to be used at indexing time, etc. It would make sense to add this mapping information directly at the content model design for each property and to generate the mappings automatically from this information. A current item of the Nuxeo roadmap prepares the content repository configuration structure for being able to store such information
  • “More like this document”, “Suggestions”: Other features will be used to improve user experience and benefit the most from Elasticsearch. The “suggestions” could be used for the quick search, be made available as an element, or as an option of the input text element. “More like this document” could be a simple element that can be set on a tab.
  • Last but not the least, a percolator could be great for handling real time notifications/synchronization. This is a vast topic, and we will discuss it in another blog when we start working on it!

To find out more:

Indexing and Query Documentation Top Page (including architecture diagram): https://doc.nuxeo.com/x/9IdH

Supported Aggregates: https://doc.nuxeo.com/x/7hQ5AQ

How to Configure a Search Screen with facets and other aggregates: https://doc.nuxeo.com/x/DxM5AQ

Elasticsearch Aggregates Top Documentation page: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations.html