[Monday Dev Heaven] Multi-threaded, transactional bulk import with Nuxeo

Mon 05 March 2012 By Laurent Doguin

Most Nuxeo developers like toying around with new ideas or concepts. Sometimes the result ends up in a sandbox, sometimes in a new addon, sometimes in the platform. Unfortunately some wicked cool stuff can be left in their sandbox. This doesn't necessarily mean they lack something or have no value, but more that we had no time to take them to the next step, for instance making a supported package on the Marketplace. Some of these projects have indeed a lot to offer, and are totally compliant with 5.5. So today I've decided to exhume nuxeo-importer-scan. I remember using it for a project and found that it was really easy to configure and set up. It's a bulk import plugin made by our mighty CTO Thierry Delprat. Technically it's based on nuxeo-platform-importer-core. The way it works is quite simple. When the scanImporter event is fired, it looks into a specified directory and imports its content. Content could be XML files and resources could be .tiff or .pdf if you use XML mapping, or just resources if you don't.

This makes it a very good fit for digitalized content, which generally ends up on a shared folder on a NAS. This is a common need in the ECM world and as a Nuxeo developer, you will surely face this issue.

Note that you already have a way of importing content through the bulk document importer available on the Nuxeo Marketplace. It works a bit differently though, with a file tree mapping the path of the documents in the repository and properties files for metadata. It might be less suited for digitalized output than the scan importer, but it's great when you need to import a tree of documents.

It's also triggered a different way, via a REST API. But it would be simple to adapt the scan importer. All you have to do is call the Notification.SendEvent operation with 'scanImporter' as parameter.

But let's see how to configure the scan importer.

Import Configuration

The import is triggered by a simple event fired by the scheduler service. The default value is set to fire the event every 30 seconds as you can see on the contribution below. If you are used to good old Unix cron, be careful, the syntax is slightly different since we are using quartz.

  <!-- define scheduler event to trigger the import -->
    <schedule id="scanImporter">
      <!-- every 30 seconds -->
      <!-- only edit this part !!! -->
      <cronExpression>*/30 * * * * ?</cronExpression>

What's next? you need to configure the importer. In the following example you will specify the source and destination folder of the resources to import, where to import the documents in the repository, the number of concurrent import threads, the size of the document batch to import between commits, etc..

  <extension target="org.nuxeo.ecm.platform.scanimporter.service.ScannedFileMapperComponent"

        <!-- define here importer configuration -->
            <!-- folder that holds the data to be imported -->
            <!-- folder where xml files will be moved when processed (files will be deleted if directory is not set or does not exist)-->
            <!-- number of threads used by the importer : keep it to 1 if using H2 or you will break H2's lucene index -->
            <!-- define how many documents are imported between 2 commits -->
            <!-- Specify the path of the root document where you want to import your documents -->
            <!-- default to true -->
            <!-- default to false -->
            <!-- Looks for XML file and use mapping configuration, default to true. -->


If you choose to use the XML mapping, you can configure it with an extension point contribution as you can see in the following example.

  <extension target="org.nuxeo.ecm.platform.scanimporter.service.ScannedFileMapperComponent"

        <!-- you can define the target foldersih Document Type here  : default to Folder

        <!-- you can define the target leaf Document type here: default to File
        You can use a static definition :

        <!-- Or a dynamic one by defining a class that implements the DocumentTypeMapper interface-->

            <!-- simple meta-data mapping
                   sourceXPath : xpath expression to locate the target XML node
                   sourceAttribute : attribute used to read value (if null, TEXT subnode will be used)
                   targetXPath : xpath of the target field in the Nuxeo DocumentModel
                   targetType : target type (integer, string,  double, date)
                   dateFormat : define the date format to be used to parse a date (default to "yyyy-MM-dd'T'hh:mm:ss.sss'Z'")
            <fieldMapping sourceXPath="//string[@name='supplier']" sourceAttribute="value" targetXPath="dc:source" targetType="string"/>
            <fieldMapping sourceXPath="//string[@name='order_number']" sourceAttribute="value" targetXPath="dc:title" targetType="string"/>
            <fieldMapping sourceXPath="//date[@name='order_timestamp']" targetXPath="foo:order_timestamp" targetType="date" />
            <fieldMapping sourceXPath="//date[@name='order_date']" targetXPath="foo:order_date" targetType="date" dateFormat="yyyy-MM-dd"/>

            Mapping for blobs resources :
                    sourceXPath : xpath expression to locate the target XML node
                    sourcePathAttribute : attribute used to read the file path (if null, TEXT subnode will be used)
                    sourceFilenameAttribute : attribute used to read the filename (if null, TEXT subnode will be used)
                    targetXPath : xpath of the target field in the Nuxeo DocumentModel (if null, Nuxeo will use the default file:content)
            <blobMapping sourceXPath="//file/content" sourcePathAttribute="filepath" sourceFilenameAttribute="name" />


The mapping used is simple XPath. Here is a sample of what the corresponding file could look like:

<?xml version="1.0" encoding="UTF-8"?>
  <string name="supplier" label="Supplier" value="SFC" />
  <string name="order_number" label="Order Number" value="3-77-2" />
  <date name="order_timestamp" label="Order Timestamp" value="2005-03-17T11:00:00.000Z" />
  <date name="order_date" label="Order date" value="2005-03-17" />
  <collection name="OrderResources">
      <content name="testFile.txt" filepath="3-77-2/OrderResources/testFile.txt" />

If you are interested in this plugin, I really encourage you to look at its unit tests. They cover the different options and will give you a better understanding of the possible configurations.

There are many things that you could do to enhance this plugin. You could, for instance, add mapping for ACLs or for the document's path. This could come really in handy when you're working with CMF and need to distribute some mail directly to the right mailbox. You could add a rest API to trigger the import. You could also add a nice front end for all this configuration in the Admin center (this might be a subject for another Monday blog, as well as a Marketplace package update :) ).

Anyway I hope you like what you saw here, see you Friday for the next Q&A :)

Category: Product & Development
Tagged: Java, Monday Dev Heaven