Document capture can be a critically important part of any system that uses a content repository. There are quite a few solutions out there that provide the ability and feature set to automate the process of scanning paper documents. More importantly, the ability to automatically capture metadata from the content of those documents and store it in the repository is a must.

Our partner Ephesoft offers an excellent solution for document capture, data extraction, and (most relevant to this blog) built-in support for exporting documents via CMIS. Given Nuxeo’s long involvement in the development of CMIS, and excellent support of the standard, it’s the perfect solution for automatically getting those pesky paper documents into your Nuxeo repository.

What is Ephesoft?

Ephesoft (in this case the Enterprise Edition) offers “intelligent document capture”. It provides the ability to scan physical and electronic documents, automatically process them for arbitrary content (ICR, OCR, images, etc.), and export/report on the results.

What is CMIS?

Content Management Interoperability Services (CMIS) is an open standard that allows different content repositories to interoperate. Specifically, CMIS defines an abstraction layer for document management using web protocols.

To be clear, the support for CMIS in Ephesoft means that it can export content to any repository that properly supports CMIS with no special tooling or integration. Nuxeo happens to have excellent CMIS support so this kind of integration is really easy.

Ok, how does it work?

Here’s a short rundown on how to capture documents directly in Nuxeo using Ephesoft and CMIS. Note that in this example Ephesoft is running on Windows so any file paths are in Windows format.

Set up Ephesoft

You can find a complete tutorial to setting up document capture here:

http://www.ephesoft.com/wiki/index.php?title=Tutorial

I will just summarize the basic steps and follow with some helpful tips:

  • Create a “batch class”.
  • Define a “document type”.
  • Define the “index fields”.
  • Define the “key value extraction” for those fields.

Tip: I recommend using the Advanced Extraction in most cases, as opposed to the Key-Value Extraction, because it’s more explicit and intuitive. With Advanced Extraction you visually define the explicit capture area with the label (green) and the field (red) like so: Set up advanced extraction

Tip: Ephesoft uses “confidence” scores to determine if a document matches a particular Document Type, and if the fields match or not. If something does not match, user intervention is generally required. Confidence scores are not a percentage but an index that is capped at 40. For basic testing and development it’s perfectly acceptable to set the confidence score to 0 to avoid any human intervention.

Tip: If you’re accessing the Ephesoft UI from somewhere other than the host you may find that document images to not show up. To fix this you need to modify the file “C:\Ephesoft\Application\WEB-INF\classes\META-INF\dcma-batch\dcma-batch.properties”. Set the property “batch.base_http_url” to match the IP address or hostname of the Ephesoft server.

Set up Nuxeo

You need to create a folderish document into which Ephesoft will export the documents.

You may need to create a new document type in Nuxeo to support the information coming from Ephesoft. This depends on whether or not you want to reuse an existing document type - in this case beware of any events/automation for that document type - or create a new one to decouple the documents coming from Ephesoft from any existing content. In the latter case this gives you complete control over what happens after the documents arrive, without affecting any existing business logic.

For security reasons you may want to create a user specifically for Ephesoft to use, with appropriate permissions so the user doesn’t have full access to the whole repository.

Integrate Nuxeo

To integrate Nuxeo and Ephesoft via CMIS you only need to complete two steps:

  • Configure the CMIS plug-in.
  • Configure the field mapping.

Configure CMIS plug-in

Use the Ephesoft Admin Client to perform these steps.

From the “Batch Class Management” tab, open your batch class.

Select the “Module” tab.

Module tab in CMIS plugin

Scroll down to “Export” and double-click it.

Screen Shot 2015-02-17 at 2.39.40 PM

Double-click “CMIS-Export”.

Screen Shot 2015-02-17 at 2.39.23 PM

Then click the Edit button to make the necessary changes. Here is an example:

Plugin configuration

Configure the following options:

  • Cmis Root Folder Name - this is the folder you created in Nuxeo to receive the documents. The path should be relative to the repository name.
    • Do not enter a leading slash.
    • Do not enter a trailing slash.
  • Cmis Upload File Extension - can be “pdf” or “tiff”.
  • Cmis Server URL - Use the format http://server:port/nuxeo/atom/cmis.
  • Cmis Server User Name - Nuxeo username that has write access to the “Cmis Root Folder Name”.
  • Cmis Server User Password - password for the Nuxeo user.
  • Cmis Server Repository Id - the name of the Nuxeo repository, usually “default”.
  • Cmis Server Switch ON/OFF - make sure this is set to “ON”.

Click “OK” the save the edit, and be sure to click “Apply” to permanently commit the changes. Finally click “Validate” and then “Deploy Workflow” any time you make plug-in changes.

Configure Field Mapping

Locate the file “C:\Ephesoft\SharedFolders\BC4\cmis-plugin-mapping\DLF-Attribute-mapping.properties”. Here you must define the mapping between your Ephesoft document type and the corresponding Nuxeo document type. Ephesoft values are on the left, Nuxeo on the right. Here is an example:

InsuranceClaimForm=InsuranceClaimForm
InsuranceClaimForm.date_received=incf:date_received
InsuranceClaimForm.first_name=pein:first_name
InsuranceClaimForm.last_name=pein:last_name
InsuranceClaimForm.phone_main=pein:phone_main
InsuranceClaimForm.contract_id=incf:contract_id
InsuranceClaimForm.incident_date=incf:incident_date
InsuranceClaimForm.incident_description=incf:incident_description
InsuranceClaimForm.type_accident=incf:type_accident
InsuranceClaimForm.type_breakdown=incf:type_breakdown
InsuranceClaimForm.type_other=incf:type_other

Tip: If you make a mistake note that you can edit this file without restarting the server.

Try it out!

When you configured your batch class, you defined a folder where Ephesoft will expect to find documents to import (the “UNC Folder” property). Drop a PDF or TIFF in this folder and Ephesoft will work its magic. After a few minutes you’ll end up with a document in Nuxeo at the path you configured. Easy peasy!

Tip: If something doesn’t work, open the “Batch Instance Management” tab in the Ephesoft Admin client, locate the failing batch, click the “>>” button and then the “Troubleshoot” button.

Troubleshoot in Ephesoft for Document capture in Nuxeo

This allows you to download a copy of all the logs and involved documents for that batch. Generally the Application Log contains the most useful information.

Tip: A failing batch can be restarted using the “Restart” button; this restarts the failing step, not the entire batch! If the CMIS export isn’t working, you can easily make changes and retry just the export.

To evaluate the Ephesoft Enterprise Edition with the Nuxeo Platform, follow the instructions here.

Frequently Asked Questions

Ephesoft (in this case the Enterprise Edition) offers “intelligent document capture”. It provides the ability to scan physical and electronic documents, automatically process them for arbitrary content (ICR, OCR, images, etc.), and export/report on the results.

Content Management Interoperability Services (CMIS) is an open standard that allows different content repositories to interoperate. Specifically, CMIS defines an abstraction layer for document management using web protocols.