GSA - what is it?

Taken from the Google product description the "Google Search Appliance (GSA) is an integrated hardware and software search solution that brings the ease of Google search to intranets and websites".


The GSA is doing three different tasks:



  1. Locating content

    This can be done by crawling, this is a common pull process used to index static web site.


    For ECM repositories like Nuxeo DM the GSA uses a connector framework to locates the document. Connectors issues queries to their repositories and feed the GSA for indexing, this is a push process.



  2. Indexing content

    Once retrieved the content is processed to extract and index the full text.



  3. Serving search results

    The search appliance returns the relevant results to the user, mixing content from many sources.




The connector framework


The central part of this framework is the connector manager. it manages creation, instantiation, scheduling and monitoring of connectors that supply content and provide authentication and authorization services to the GSA.


The connector manager is not part of the GSA, the connector manager is an open source web application that must be deployed on a Tomcat server. The connector manager may host different connectors to access multiple sources like file system, database or ECM repositories.


Nuxeo GSA Connector comes with a connector manager embedded. This means that you don't need to setup another web application.


Indexation


The traversal is done in chunk by chunk.


The security policy


By default the Nuxeo GSA Connector declares the content as private. This means that the GSA's access control integrates with Nuxeo security policies so that users only see search results if they have access to the source content.


Searching


For the next request the authentication phase is skipped.


The connector can be configured as public, in this case there is no authentication or authorization requests, the Nuxeo content is public and not restricted.


What goes into the index?


The Nuxeo GSA Connector is responsible for submitting documents and meta-data. This data extraction is customizable. The default process is:



  • By default it submits any visible Nuxeo Document type. A document type list defined in the connector contribution can be used as a filter.

  • A Nuxeo document can be mapped into multiple Google records (or Google documents). For instance a File document with one attached file will be returned as 2 Google records: one for the Nuxeo File document with its meta-data and one for the attached file.

  • The default meta-data exposed are the following:

    • title: the dublincore title

    • description: the dublincore description

    • contributors: the dublincore contributors

    • path: the document path

    • readacl: the read access control list




The feed submitted to the GSA looks like this:


<record url="googleconnector://nuxeo-connector.localhost/doc?docid=0e02d8df-d95e-4754-93f4-00477151ac56" d
isplayurl="http://localhost:8080/nuxeo/nxdoc/default/0e02d8df-d95e-4754-93f4-00477151ac56/view_documents&quot;
mimetype="text/plain" last-modified="Fri, 17 Dec 2010 09:45:41 +0100" authmethod="httpbasic">
<metadata>
<meta name="google:feedid" content="9ae252fd51734325808b6a5ce44dc556"/>
<meta name="contributors" content="user13"/>
<meta name="readacl" content="members"/>
<meta name="readacl" content="Administrator"/>
<meta name="readacl" content="members"/>
<meta name="google:ispublic" content="false"/>
<meta name="description" content="FLNXTEST Chloreus cyanos nothos bradus lutea mono pedis, arvensis ad vul
garis aquam."
/>
<meta name="google:title" content="Flnxtest qzvba domesticus novaeseelandiae minor dermis domesticus ODT"/
>
<meta name="path" content="#default-domain#workspaces#FLNXTEST Bench workspace#FLNXTEST Bench folder#Flnxt
est qzvba domesticu#"
/>
<meta name="readaclstr" content="#members#Administrator#members#"/>
<meta name="google:mimetype" content="text/plain"/>
<meta name="google:displayurl" content="http://localhost:8080/nuxeo/nxdoc/default/0e02d8df-d95e-4754-93f4-
00477151ac56/view_documents"
/>
<meta name="google:lastmodified" content="2010-12-17"/>
</metadata>
<content encoding="base64compressed">
eJxNjDEOwjAMRXdO4QOgXgLohGAgA6tbW22kxKZ2EgGnJ92Y/v9PX29M8i7sBbZvmxBIcx9xrg6iDdmZEwpFZMhR1IDYcvT/3/0cDuP19g
yXR4DTmtS40/mDorukrD0mQ+ow1cIIWUXhxRT9CGiNxbsQCVpNC9ret4p5+AGQlDlW
</content>
</record>

Note that we expose a readacl field. Its represents the internal read Access Control List (ACL). It can be used at the upper level in the GSA to filter security at search time (early stage binding).


This is the default extraction. If it does not fit your needs, you can contribute a new extractor and decide what will be indexed.


Ready to try it for yourself?


You can install the Nuxeo GSA Connector and get started playing with it in 15 minutes or less:



  1. Download and install Nuxeo Document Managment

  2. Sign up for a 30-day trial of Nuxeo Online Services, for access to all Nuxeo Marketplace packages

  3. Install the Nuxeo GSA Connector via the Update Center in Nuxeo Online Services


For the developer types, here's the source code.


As a software engineer, it has been an enriching experience to build the GSA Connector for the Nuxeo platform. This yellow box is very impressive and it was a pleasure to have a chance to play with it.


-- Benoit Delbosc