[Monday Dev Heaven] How to search for PDF files, or any other type for that matter

Mon 19 November 2012 By Laurent Doguin

Search documents with a specific file like a PDF or an MP3 Search documents with a specific file like a PDF or an MP3

Hi, here's a common need that is not really addressed in Nuxeo's default UI: searching for documents with only PDF as attachments, or any other kind of file for that matter. So today I'm going to show you how to write a widget for the search form. This widget will let the user select any type of file.

The Query

I'm going to start with the NXQL, as I feel it's really the part people usually don't know about. The query part you need is quite simple. Let's say you want to search for audio files stored in file:content. The appropriate query would be:

SELECT * FROM Document WHERE content/mime-type LIKE 'audio%'

There's no schema here, just the complex content metadata that is declared as follows:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"&gt;
<xs:complexType name="content">
<xs:element name="encoding" type="xs:string" />
<xs:element name="mime-type" type="xs:string" />
<xs:element name="data" type="xs:base64Binary" />
<xs:element name="name" type="xs:string" />
<xs:element name="length" type="xs:long" />
<xs:element name="digest" type="xs:string" />

So of course you can also search using other metadata like size:

SELECT * FROM Document WHERE content/length > '20111213'

or name:

SELECT * FROM Document WHERE content/name LIKE 'filename%'

Now let's say you want all attachments, like all the files stored in files:files, this is how you should do it:

SELECT FROM Document WHERE files/1/file/data IS NOT NULL

The question you might be asking yourself now is why is it content/name instead of file:content/name?
From our wiki:

A complex property is a property of a schema containing lists, or subelements or sequences of them.

For complex subproperties, like the length field of the content field of the file schema, you can refer to:

  • content/length for the value of the subproperty.

For simple lists, like dc:subjects, you can refer to:

  • dc:subjects/3 for the 4th element of the list (indexes start at 0),
  • dc:subjects/* for any element of the list,
  • dc:subjects/1 for any element of the list, correlated with other uses of the same number after .

Something you can do to understand this a little more is take a look at your Nuxeo SQL database. You'll see there's a table named content and one named dublincore, for instance. Take a look at the NXQL documentation for the details.

Modify the Advanced Search

Now that we know how to do the search, we can work on the form and its widget. It should give multiple file type choices to the user. Each file type has to match a mime-type or the beginning of a mime-type. So, for instance, PDF will match 'application/pdf' and any audio file will match 'audio*'. This looks a lot like a select widget bound to a directory. The directory will use the usual vocabulary schema with an id and a label. This way we'll have a list of values like:


Declare your directory like this:

<extension point="directories" target="org.nuxeo.ecm.directory.sql.SQLDirectoryFactory">
<directory name="file_types">
<!-- Don't forget this part that makes the directory available in the vocabulary tab of the admin center:-->
<extension point="directories" target="org.nuxeo.ecm.directory.ui.DirectoryUIManager">
<directory layout="vocabulary" name="file_types" sortField="label"/>

And there's already a built-in widget in Nuxeo called selectOneDirectory. It displays the directory entries in a select box. Take a close look at the field tag, it says search:nature. This search schema is used in the AdvancedSearch document type, which is in the advanced search form. I could add my own metadata to the search schema, but I've decided to replace the existing metadata nature instead.

<extension point="widgets" target="org.nuxeo.ecm.platform.forms.layout.WebLayoutManager">
<widget name="search_file_type" type="selectOneDirectory">
<label mode="any">label.search.type</label>
<properties mode="any">
<property name="directoryName">file_types</property>
<property name="localize">true</property>

Now we that we have a widget that displays these choices, we need to use that widget on the advanced search form. If you are familiar with Nuxeo, you know that a form is represented by what we call a layout. So we need to modify the advanced search wizard. Go to the layout extension point page on Nuxeo Explorer. One of the last improvements we added is the possibility to search between all the registered contributions. So if you type advanced in the search input, you'll find the advanced search layout contribution. It looks like this:

<extension point="layouts" target="org.nuxeo.ecm.platform.forms.layout.WebLayoutManager">
<layout name="advanced_search">
<template mode="any">/layouts/layout_default_template.xhtml</template>
<!-- <widget>search_nature</widget> -->

I've only replaced the search_nature widget by my new search_file_type widget. At this point, we can select the file type we want to look for on the advanced search form. The result will be stored in the search:nature field of the AdvancedSearch doc type. But we still need to change the search query so that it takes search in content/mime-type what we've put in search:nature instead of searching in dc:nature. To do that, we have to modify the advanced_search content view. A Content View is a very important notion in Nuxeo as it represents a list of documents.

A content view is a notion to define all the elements needed to get a list of items and perform their rendering. The most obvious use case is the listing of a folderish document content, where we would like to be able to:

  • define the NXQL query that will be used to retrieve the documents, filtering some of them (documents in the trash for instance)
  • pass on contextual parameters to the query (the current container identifier)
  • define a filtering form to refine the query
  • define what columns will be used for the rendering of the list, and how to display their content
  • handle selection of documents, and actions available when selecting them (copy, paste, delete...)
  • handle sorting and pagination
  • handle caching, and refresh of this cache when a document is created, deleted, modified...

Take a look at the actual advanced search content view definition. You can find it as easily as the previous layout, using Nuxeo Explorer.

<extension point="contentViews" target="org.nuxeo.ecm.platform.ui.web.ContentViewService">
<contentView name="advanced_search">
<property name="coreSession">#{documentManager}</property>
<property name="maxResults">DEFAULT_NAVIGATION_RESULTS</property>
<whereClause docType="AdvancedSearch">
<predicate operator="FULLTEXT" parameter="ecm:fulltext">
<field name="fulltext_all" schema="advanced_search"/>
<predicate operator="=" parameter="ecm:isCheckedInVersion">
<field name="isCheckedInVersion" schema="advanced_search"/>
<predicate operator="STARTSWITH" parameter="ecm:path">
<field name="searchpath" schema="advanced_search"/>
<predicate operator="FULLTEXT" parameter="dc:title">
<field name="title" schema="advanced_search"/>
<predicate operator="FULLTEXT" parameter="dc:description">
<field name="description" schema="advanced_search"/>
<predicate operator="LIKE" parameter="dc:rights">
<field name="rights" schema="advanced_search"/>
<predicate operator="LIKE" parameter="dc:source">
<field name="source" schema="advanced_search"/>
<predicate operator="LIKE" parameter="content/mime-type">
<!-- <predicate operator="IN" parameter="dc:nature"> -->
<field name="nature" schema="advanced_search"/>
<predicate operator="IN" parameter="dc:nature">
<field name="nature" schema="advanced_search"/>
<predicate operator="IN" parameter="dc:coverage">
<field name="coverage" schema="advanced_search"/>
<predicate operator="IN" parameter="dc:subjects">
<field name="subjects" schema="advanced_search"/>
<predicate operator="BETWEEN" parameter="dc:created">
<field name="created_min" schema="advanced_search"/>
<field name="created_max" schema="advanced_search"/>
<predicate operator="BETWEEN" parameter="dc:modified">
<field name="modified_min" schema="advanced_search"/>
<field name="modified_max" schema="advanced_search"/>
<predicate operator="BETWEEN" parameter="dc:issued">
<field name="issued_min" schema="advanced_search"/>
<field name="issued_max" schema="advanced_search"/>
<predicate operator="BETWEEN" parameter="dc:valid">
<field name="valid_min" schema="advanced_search"/>
<field name="valid_max" schema="advanced_search"/>
<predicate perator="BETWEEN" parameter="dc:expired">
<field name="expired_min" schema="advanced_search"/>
<field name="expired_max" schema="advanced_search"/>
<predicate operator="LIKE" parameter="dc:format">
<field name="format" schema="advanced_search"/>
<predicate operator="LIKE" parameter="dc:language">
<field name="language" schema="advanced_search"/>
<predicate operator="!=" parameter="ecm:currentLifeCycleState">
<field name="currentLifeCycleState" schema="advanced_search"/>
ecm:mixinType != 'HiddenInNavigation' AND
ecm:isCheckedInVersion = 0
<!-- sort column="dc:title" ascending="true" / sort by fulltext relevance -->

  &lt;searchLayout name=&quot;advanced_search&quot;/&gt;
    &lt;layout iconPath=&quot;/icons/document_listing_icon.png&quot; name=&quot;search_listing_ajax&quot; showCSVExport=&quot;true&quot; showPDFExport=&quot;false&quot; showSyndicationLinks=&quot;true&quot; title=&quot;document_listing&quot; translateTitle=&quot;true&quot;/&gt;
    &lt;layout iconPath=&quot;/icons/document_listing_compact_2_columns_icon.png&quot; name=&quot;document_virtual_navigation_listing_ajax_compact_2_columns&quot; showCSVExport=&quot;true&quot; showPDFExport=&quot;false&quot; showSyndicationLinks=&quot;true&quot; title=&quot;document_listing_compact_2_columns&quot; translateTitle=&quot;true&quot;/&gt;
    &lt;layout iconPath=&quot;/icons/document_listing_icon_2_columns_icon.png&quot; name=&quot;document_virtual_navigation_listing_ajax_icon_2_columns&quot; showCSVExport=&quot;true&quot; showPDFExport=&quot;false&quot; showSyndicationLinks=&quot;true&quot; title=&quot;document_listing_icon_2_columns&quot; translateTitle=&quot;true&quot;/&gt;
  &lt;actions category=&quot;CURRENT_SELECTION_LIST&quot;/&gt;


As you can see, a content view regroups a lot of information. What I want to modify here is the predicate in the WhereClause of my coreQueryPageProvider. Here's the little modification I made:

<predicate operator="LIKE" parameter="content/mime-type">
<!-- <predicate operator="IN" parameter="dc:nature"> -->
<field name="nature" schema="advanced_search"/>

I search for the nature field and its value is updated by the widget we previously defined. This predicate was adding dc:nature IN ('selectedNature1', 'selectedNature2') to the where clause. Now it will add content/mime-type LIKE 'widgetValue'. Once you've done this, the advanced search is modified, you can search for documents containing PDF files, or any file matching a particular mime type. I won't go in to detail over the content view configuration and internals. It would be too long here, and I'll be repeating the content of the awesome doc made by Anahide. Just remember that every serious Nuxeo developer has to know how it works.

As a conclusion, there are some limitations with this that are a little frustrating. The first one to me is that you cannot search for more than one mime type. I would like to do queries like content/mime-type IN ('audio%','application/pdf','video%'). But the IN operator only works with an exact match. Another thing is that LIKE queries with % can be very expensive. Last, I would like to search for files that are slide decks other than PPT, Keynote and maybe PDF. So the current model has limitations for that. I'll try to bring solutions to these problems in the next blog post. In the meantime, take care and happy coding!

Category: Product & Development
Tagged: Java, Monday Dev Heaven