As you may perhaps know, VCS stands for Visible Content Store. It means that Nuxeo stores its documents in a SQL database, using a table for each schema or complex property type.


This enables various things:


  • Put indices on some columns in order to tune some particular queries

  • Access the tables to query data directly.


In this post, I will show you something else I did on a project in order to fill some data into a schema directly in SQL. I will use PostgreSQL for this example, but it should also work with other backends.

Querying nxp_log can be difficult


The problem here is to compute some statistics from the nxp_logs table that holds the data of the audit. Each row of the table holds the id of the document (log_doc_uuid), the category of the event (log_event_category) and the id of the event (log_event_id).

If we want to know how many times a given document has been modified, it's quite simple in SQL:

[sql]select count(*) from nxp_logs
where log_event_id = 'documentModified'
and log_doc_uuid = '01431fe7-9154-4522-92eb-0dc0d8caa241'
[/sql]

Now, if we want to have this type of data for all documents, a simple GROUP BY statement will do the trick.

[sql]select log_doc_uuid, count(*) from nxp_logs
where log_event_id = 'documentModified'
group by log_doc_uuid
[/sql]

The real problem comes when you want to have statistics for a lot of different log_event_ids and a lot of documents. We won't be able to query data with a simple query to have stats onto one line. We want to have something like a table that holds this data:






id cnt_modified cnt_locked cnt_checkin last_event_date
docid1 4 0 2 2013-05-15 13:42
docid2 3 1 0 2013-05-12 17:35

Moreover, when querying on a large document repository (such as several million lines in NXP_LOG) it can make your application server suffer a lot!

The solution


We want a statistic table, so let's create it:

[sql]CREATE TABLE doc_stats
(
id character varying(36) NOT NULL,
cnt_modified bigint,
cnt_locked bigint,
cnt_checkin bigint,
last_event_date timestamp without time zone,
CONSTRAINT doc_stats_pk PRIMARY KEY (id)
)
[/sql]

Now we have to fill it with some values. As we said, it's not easy to find a direct query to fill it, so we will begin by creating a view on the nxp_logs table in order to have some data directly ready:

[sql]CREATE OR REPLACE VIEW vw_doc_stats AS
SELECT dc.id as id, last_date.lastevent as last_event_date, events.log_event_id as log_event_id, events.cnt as cnt
FROM dublincore dc,
-- Extract last_event_date
( SELECT log.log_doc_uuid, max(log.log_event_date) AS lastevent
FROM nxp_logs log
GROUP BY log.log_doc_uuid) last_date,
-- Extract count of events
( SELECT log.log_doc_uuid, log.log_event_id, count(log.log_doc_uuid) AS cnt
FROM nxp_logs log
WHERE log.log_event_id IN ('documentCheckin', 'documentModified', 'documentLocked')
GROUP BY log.log_event_id, log.log_doc_uuid) events
WHERE dc.id = last_date.log_doc_uuid AND dc.id = events.log_doc_uuid
ORDER BY last_date.lastevent DESC, dc.id;
[/sql]

The execution of this view will give you several rows per document -- one for each eventId.






id last_event_date log_event_id cnt
docid1 2013-05-15 13:42 documentModified 4
docid1 2013-05-15 13:42 documentCheckin 2

The next thing we have to do now is to normalize that data in our doc_stats table. We will do it with a stored procedure that will iterate on the vw_doc_stats view:

[sql]CREATE OR REPLACE FUNCTION nx_update_doc_stats() RETURNS integer AS $$
DECLARE
stat RECORD;
node RECORD;
nodeids integer[];
BEGIN
RAISE NOTICE 'Refreshing Doc stats';
TRUNCATE TABLE doc_stats;

INSERT INTO doc_stats (id, last_event_date, cnt_commit, cnt_checkin, cnt_lock)
SELECT DISTINCT id,last_event_date,0,0,0 FROM vw_doc_stats;

FOR stat IN SELECT * FROM vw_doc_stats LOOP
    CASE stat.log_event_id
      WHEN 'documentModified' THEN
        UPDATE doc_stats SET cnt_modified = stat.cnt WHERE id=stat.id;
      WHEN 'documentCheckin' THEN
        UPDATE doc_stats SET cnt_checkin = stat.cnt WHERE id=stat.id;
      WHEN 'documentLocked' THEN
        UPDATE doc_stats SET cnt_lock = stat.cnt WHERE id=stat.id;
    END CASE;

END LOOP;

RAISE NOTICE 'Done refreshing Doc stats table.';
RETURN 1;

END;
$$ LANGUAGE plpgsql;
[/sql]

That's it! Now when you call SELECT nx_update_doc_stats(); it will update your doc_stats table. It takes about 10 seconds for Postgres to update the stat table with about 10k documents.

In order to launch this command periodically, you can use:


  • a cron task: very efficient and simple, but it will add some packaging work

  • use the Nuxeo scheduler: same efficiency, but is included in the existing packaging.

The frosting on the cake


Yeah that's very cool, we have a SQL table that holds our normalized data but… how can we use it now? One solution could be to use raw JDBC and native query. But as we are on top of VCS, we can do something smarter.

In fact, document schemas are stored in SQL tables. They have an id that references the id of the fragment in the hierarchy table and other columns that refer to the schema properties. The doc_stats table is created like that so we will be able to map a schema on this table:

[xml]
<?xml version="1.0" encoding="UTF-8"?>

    &lt;xs:schema xmlns:xs=&quot;http://www.w3.org/2001/XMLSchema&quot;
      xmlns:nxs=&quot;http://www.nuxeo.org/ecm/project/schemas/doc_stats&quot;
      targetNamespace=&quot;http://www.nuxeo.org/ecm/project/schemas/doc_stats&quot;&gt;

      &lt;xs:element name=&quot;last_event_date&quot; type=&quot;xs:date&quot; /&gt;
      &lt;xs:element name=&quot;cnt_modified&quot; type=&quot;xs:integer&quot; /&gt;
      &lt;xs:element name=&quot;cnt_locked&quot; type=&quot;xs:integer&quot; /&gt;
      &lt;xs:element name=&quot;cnt_checkin&quot; type=&quot;xs:integer&quot; /&gt;

    &lt;/xs:schema&gt;

[/xml]

Now let's bind this schema to a dynamic facet:

[xml]
<component name="org.nuxeo.ecm.quota.core.types">

<extension target="org.nuxeo.ecm.core.schema.TypeService" point="schema">

&lt;schema name=&quot;doc_stats&quot; prefix=&quot;dstats&quot;
  src=&quot;schemas/doc_stats.xsd&quot;/&gt;

</extension>

<extension target="org.nuxeo.ecm.core.schema.TypeService" point="doctype">

&lt;facet name=&quot;DocStats&quot;&gt;
  &lt;schema name=&quot;doc_stats&quot; /&gt;
&lt;/facet&gt;

</extension>
</component>
[/xml]

With all of this configuration, for a given document it's very easy to get its statistics.

[java] DocumentModel doc = session.getDocument(new PathRef("/default-domain/workspaces/myDoc"));
doc.addFacet("DocStats");
seesion.getProperty("dstats:cnt_modified").getValue(Long.class)
[/java]

VCS cache problem


There is still a small problem here. Since VCS has an internal cache mecanism, it will cache the values and not update them when we want. We have to find a hack to invalidate the VCS cache… at the SQL level.

When set up in cluster mode, VCS creates two tables that hold the invalidation state. The cluster_nodes table holds the list of all cluster nodes and thecluster_invals hold the list of fragments to invalidate for each node. When one node updates a document, it puts an entry in this table to tell the other nodes to invalidate their own data on the document.

We will set up our server in cluster mode and our stored procedure will act as a cluster node and tell every node to invalidate their cache on the doc_stats fragment. In fact, it's quite simple with this SQL query:

[sql]
INSERT INTO cluster_invals (nodeid,id,fragments,kind)
SELECT nodeid,id, '{doc_stats}',1 from cluster_nodes,doc_stats;
[/sql]

This is a query that makes a cross join between cluster_nodes and doc_stats and inserts the result into the invalidation table.

Adding this query in at the end of our procedure will tell VCS to invalidate the cache for that data.

Deployment


In order to deploy our view and stored procedure, there is a way since Nuxeo 5.7 to run some SQL DDL at the start of the Nuxeo instance. It's just about another configuration of the repository. The documentation is here.

Conclusion and possible enhancements


In this post we saw how to query the audit table and normalize its data into a dedicated document's schema. In this case, the data came from inside Nuxeo, but we could imagine that external systems can also fill our data with an ETL system, for instance.

Here we update the data by a scheduler, that means our data won't be up to date between every update, but for the statistics it's not that important. To enhance that, we could imagine using a trigger on the audit table that does quite the same job for every inserted row.