Index Text Content of Illustrator Files


Thu 01 December 2016 By Michael Vachette

One of the key features of the Nuxeo Platform is the indexing of the text content of files. Obviously users expect no less when it comes to office files but the platform can provide the same experience with media assets that typically contain a lot of text: Adobe Illustrator (.ai) and Encapsulated PostScript (.eps). And the best part is that it takes only a few minutes to configure this feature with Nuxeo Studio!

Indexing the text content of ai and eps files is done in two steps. First, convert the file to a PDF using Ghostscript and then extract the text content from the PDF. Ghostscript is already a part of the third-party tools used by the platform so there is no extra installation steps here. All that needs to be done is to configure some command lines and converters with Studio.

Let’s start with the command line. The Nuxeo Platform provides a command line executor service so we just need to register a new command with the following XML contribution

<extension point="command" target="org.nuxeo.ecm.platform.commandline.executor.service.CommandLineExecutorComponent">
  <command enabled="true" name="ps2pdf">
        <commandLine>gs</commandLine>
       <winCommand>gswin64c</winCommand>
       <parameterString>-dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=#{targetFilePath} #{sourceFilePath}</parameterString>
        <installationDirective>You need to install GhostScript.</installationDirective>
  </command>
</extension>

Once it’s done we can register a converter that uses the previous command.

<extension point="converter" target="org.nuxeo.ecm.core.convert.service.ConversionServiceImpl">

  <converter class="org.nuxeo.ecm.platform.convert.plugins.CommandLineConverter" name=”psi2pdf">
    <sourceMimeType>application/postscript</sourceMimeType>
    <sourceMimeType>application/eps</sourceMimeType>
    <sourceMimeType>application/x-eps</sourceMimeType>
    <sourceMimeType>image/eps</sourceMimeType>
    <sourceMimeType>image/x-eps</sourceMimeType>
    <sourceMimeType>application/illustrator</sourceMimeType>
    <destinationMimeType>application/pdf</destinationMimeType>
    <parameters>
        <parameter name="CommandLineName">ps2pdf</parameter>
    </parameters>
  </converter>
 </extension>

Finally, we’ll take advantage of a little known feature of the conversion service in the Nuxeo Platform which is the ability to chain sub converters. We’ll use the converter defined previously and chain it with the pdf2text converter already registered in the Platform.

<extension point="converter" target="org.nuxeo.ecm.core.convert.service.ConversionServiceImpl"> 
 <converter name="ps2pdf2text">
      <sourceMimeType>application/postscript</sourceMimeType>
      <sourceMimeType>application/eps</sourceMimeType>
      <sourceMimeType>application/x-eps</sourceMimeType>
      <sourceMimeType>image/eps</sourceMimeType>
      <sourceMimeType>image/x-eps</sourceMimeType>
      <sourceMimeType>application/illustrator</sourceMimeType>
      <destinationMimeType>text/plain</destinationMimeType>
      <conversionSteps>
        <subconverter>ps2pdf</subconverter>
        <subconverter>pdf2text</subconverter>
      </conversionSteps>
  </converter>
</extension>

How does the platform know that it must index the text content of the files? It just needs a converter for the source file mimetype which returns text/plain. The platform will use it to extract the text content and index it.


Text content
Text content search

That’s it for the configuration. Your application can now index the text content of Adobe Illustrator and Encapsulated PostScript files!


Tagged: Nuxeo Studio, How to
Check out the features of our latest Nuxeo Platform Download Nuxeo