Nuxeo Vision is a wonderful plugin using Google Cloud Vision API. In a previous blog, we explained how to use it to add safe searching to your images. In this article, we will explore the OCR capabilities of the Google Vision API with an example - extracting the text from the T-shirt in the following picture. (If you haven’t guessed already, let me tell you that extracting the text was a piece of cake!)

A brief reminder in case you haven’t yet read the previous article (which would make me quite sad; I really put all my heart in it): - We call the VisionOp operation provided by the plugin - It expects a single blob or a list of blobs as input and returns them unchanged - The operation outputs the result of the call in a Context Variable (whose name you pass as parameter), and is an array of VisionResponse. - Using JavaScript automation makes it easier to handle the results - There are limitations (from Google), especially to the size and the format of the image(s) you send to the API - As best practice, calls to the operation should be asynchronous because since you call an external service anything in the network can go wrong or take time. It is not required but strongly recommended.

Today, we want to extract the text from an image, which basically is what OCR stands for. For this purpose, we will use the “TEXT_DETECTION“ feature of the API.

A little warning here: At the time this blob was written, the Google API does not handle TIFF images.

Also, on the documentation page of the plugin, you probably noticed that it already uses the “TEXT_DETECTION“ feature and stores the result in the dc:source field: You may want to disable this default behavior (as explained in the plugin’s documentation.

Split Every Part

Using the text detection feature means a typical call to the operation would look like…

// blob is a variable holding the picture
blob = **VisionOp**(blob, {
  **'features': ['TEXT_DETECTION'],**
  'maxResults': 5,
  'outputVariable': **'results'**
});
// . . . handle **ctx.results. . .**

…where we get the values returned by the API in the results Context Variable. We will have a single result here, which means we access the VisionResponse in ctx.results[0].

Google Vision actually returns a list of EntityAnnotation, one for every piece of text it could detect, and for each of them we have even more information: The raw text and a locale ( which specifies the locale the text is expressed). It can happen that your image contains text in different locales, but Google Vision will try to sort this out. Nuxeo Vision encapsulates these values, and here is how it works: - We have the VisionResponse as described above - We call its getOcrText() method, which is a list of of TextEntity. Each of them basically encapsulates an EntityAnnotation - We can then call its getText() method (and/or getLocale()).

So, say you want to store the text in a single string field, ocr:text in our example. You would need to concatenate each result. Something like:

function run(input, params) {
  var blob, results, textEntities, finalText, i, max, details;

  // In this example, let's get the medium view if the original size is > 4MB
  blob = input["file:content"];
  if(blob.getLength() > 4194304) {
    Console.log("Getting the medium view)");
    blob = Picture.GetView(input, {'viewName': 'Medium'});
  }
  // Ask for OCR
  blob = **VisionOp**(blob, {
    **'features': ['TEXT_DETECTION']**,
    'maxResults': 5,
    'outputVariable': **'results'**
  });
  // Get the results, if any
  finalText = null;
  results = **ctx.results**;
  if(results && results.length) {
    // get the first one, and get all its TextEntity values
    **textEntities = results[0].getOcrText();**
    if(textEntities && textEntities.length) {
      // Loop for each value and concatenate
      finalText = "";
      max = textEntities.length;
      for(i = 0; i < max; ++i) {
        **finalText += textEntities[i].getText() + "\n";**
      }
    }
  }
  // Save in the input document
  input["ocr:text"] = finalText;
  Input = Document.Save(input, {});

  return input;

}

While running this chain, you will notice you have at the same text at least twice. This is because the first element of the list contains the full extracted text, and the other elements are pieces of text as understood by the API. In the example given above, Vision found 5 texts, one per line, so the final result was:

| Element | Value |
| 0 | I CAN’T KEEP CALM I’M A RANGERS FAN |
| 1 | I CAN’T |
| 2 | KEEP CALM |
| 3 | I’M A |
| 4 | RANGERS |
| 5 | FAN |

Which means we could make the script a bit simpler: No need to loop, just get the first value and call its getText() method:

. . .
if(results && results.length) {
  textEntities = ctx.results[0].getOcrText();
  **finalText = textEntities[0].getText();**
}
. . .

And tada!

More OCR examples

Notice how the text was extracted from a photo of a receipt.

Get Everything

Now, maybe you want to store more information. For example, you want to store every value returned by the API, which includes the text and the locale. Say you have an ocr schema with one field, ocr:details, type Complex, Multivalued. This field has 2 subfields: text (String) and locale (String):

OCR Schema

To fill this field (which is, somehow, fun to pronounce - and this has nothing to do with the algorithm, just a quick fun break in the blog), you will loop on the results, building a JavaScript array of objects, each object having two properties whose names are exactly the same as the subfields of ocr:details. You then pass this map to the Document.Update operation:

. . .
**var details,** entity, i, max, ...
. . .
max = textEntities.length
**details = [];**
for(i = 0; i < max; ++i) {
  entity = textEntities[i];
  **details.push({
    "text": entity.getText(),
    "locale": entity.getLocale()
  });**
}

input = Document.Update(input, {
      **'properties': {"ocr:details": details},**
      'save': true
    });
. . .

And voilà!

Result

You probably noticed we don’t have the locale for each text. According to Vision API, this is normal.