Document capture can be a critically important part of any system that uses a content repository. There are quite a few solutions out there that provide the ability and feature set to automate the process of scanning paper documents. More importantly, the ability to automatically capture metadata from the content of those documents and store it in the repository is a must.
Our partner Ephesoft offers an excellent solution for document capture, data extraction, and (most relevant to this blog) built-in support for exporting documents via CMIS. Given Nuxeo’s long involvement in the development of CMIS, and excellent support of the standard, it’s the perfect solution for automatically getting those pesky paper documents into your Nuxeo repository.
What is Ephesoft?
Ephesoft (in this case the Enterprise Edition) offers “intelligent document capture”. It provides the ability to scan physical and electronic documents, automatically process them for arbitrary content (ICR, OCR, images, etc.), and export/report on the results.
What is CMIS?
Content Management Interoperability Services (CMIS) is an open standard that allows different content repositories to interoperate. Specifically, CMIS defines an abstraction layer for document management using web protocols.
To be clear, the support for CMIS in Ephesoft means that it can export content to any repository that properly supports CMIS with no special tooling or integration. Nuxeo happens to have excellent CMIS support so this kind of integration is really easy.
If you would like to find out more about connecting content management applications using CMIS, watch our webinar where CMIS visionary Jeff Potts and I discuss CMIS and its value.
Ok, how does it work?
Here’s a short rundown on how to capture documents directly in Nuxeo using Ephesoft and CMIS. Note that in this example Ephesoft is running on Windows so any file paths are in Windows format.
Set up Ephesoft
You can find a complete tutorial to setting up document capture here:
I will just summarize the basic steps and follow with some helpful tips:
- Create a “batch class”.
- Define a “document type”.
- Define the “index fields”.
- Define the “key value extraction” for those fields.
Tip: I recommend using the Advanced Extraction in most cases, as opposed to the Key-Value Extraction, because it’s more explicit and intuitive. With Advanced Extraction you visually define the explicit capture area with the label (green) and the field (red) like so:
Here’s a helpful video about how to setup Advanced Extraction:
Tip: Ephesoft uses “confidence” scores to determine if a document matches a particular Document Type, and if the fields match or not. If something does not match, user intervention is generally required. Confidence scores are not a percentage but an index that is capped at 40. For basic testing and development it’s perfectly acceptable to set the confidence score to 0 to avoid any human intervention.
Tip: If you’re accessing the Ephesoft UI from somewhere other than the host you may find that document images to not show up. To fix this you need to modify the file “C:\Ephesoft\Application\WEB-INF\classes\META-INF\dcma-batch\dcma-batch.properties”. Set the property “batch.base_http_url” to match the IP address or hostname of the Ephesoft server.
Set up Nuxeo
You need to create a folderish document into which Ephesoft will export the documents.
You may need to create a new document type in Nuxeo to support the information coming from Ephesoft. This depends on whether or not you want to reuse an existing document type - in this case beware of any events/automation for that document type - or create a new one to decouple the documents coming from Ephesoft from any existing content. In the latter case this gives you complete control over what happens after the documents arrive, without affecting any existing business logic.
For security reasons you may want to create a user specifically for Ephesoft to use, with appropriate permissions so the user doesn’t have full access to the whole repository.
To integrate Nuxeo and Ephesoft via CMIS you only need to complete two steps:
- Configure the CMIS plug-in.
- Configure the field mapping.
Configure CMIS plug-in
Use the Ephesoft Admin Client to perform these steps.
From the “Batch Class Management” tab, open your batch class.
Select the “Module” tab.
Scroll down to “Export” and double-click it.
Then click the Edit button to make the necessary changes. Here is an example:
Configure the following options:
- Cmis Root Folder Name - this is the folder you created in Nuxeo to receive the documents. The path should be relative to the repository name.
- Do not enter a leading slash.
- Do not enter a trailing slash.
- Cmis Upload File Extension - can be “pdf” or “tiff”.
- Cmis Server URL - Use the format “http://server:port/nuxeo/atom/cmis”.
- Cmis Server User Name - Nuxeo username that has write access to the “Cmis Root Folder Name”.
- Cmis Server User Password - password for the Nuxeo user.
- Cmis Server Repository Id - the name of the Nuxeo repository, usually “default”.
- Cmis Server Switch ON/OFF - make sure this is set to “ON”.
Click “OK” the save the edit, and be sure to click “Apply” to permanently commit the changes. Finally click “Validate” and then “Deploy Workflow” any time you make plug-in changes.
Configure Field Mapping
Locate the file “C:\Ephesoft\SharedFolders\BC4\cmis-plugin-mapping\DLF-Attribute-mapping.properties”. Here you must define the mapping between your Ephesoft document type and the corresponding Nuxeo document type. Ephesoft values are on the left, Nuxeo on the right. Here is an example:
Tip: If you make a mistake note that you can edit this file without restarting the server.
Try it out!
When you configured your batch class, you defined a folder where Ephesoft will expect to find documents to import (the “UNC Folder” property). Drop a PDF or TIFF in this folder and Ephesoft will work its magic. After a few minutes you’ll end up with a document in Nuxeo at the path you configured. Easy peasy!
Tip: If something doesn’t work, open the “Batch Instance Management” tab in the Ephesoft Admin client, locate the failing batch, click the “>>” button and then the “Troubleshoot” button.
This allows you to download a copy of all the logs and involved documents for that batch. Generally the Application Log contains the most useful information.
Tip: A failing batch can be restarted using the “Restart” button; this restarts the failing step, not the entire batch! If the CMIS export isn’t working, you can easily make changes and retry just the export.