Advanced CMIS


Wed 02 December 2009 By Florent Guillaume

The upcoming CMIS standard is approaching its final 1.0 version, and I thought I would take the time to present some of its most advanced features.


Basics


I will not detail here the basics of the CMIS domain model, but I will mention quickly for completeness:




  • CMIS stores folders, documents and relationship (collectively called objects),

  • each object has a unique id,

  • objects have "object types" detailing the properties they're allowed to have,

  • properties have the usual basic "property types" (strings, numbers, dates, lists, etc.),

  • you can create, retrieve, update and delete objects (CRUD),

  • documents may have an associated content stream (an attachment),

  • you can search documents using a SQL-based language,

  • clients talk to CMIS servers using AtomPub or SOAP.


Below I will detail the more advanced features of CMIS.


Unfiling, Multi-filing

While most people are used to storing documents inside a navigation tree, where the intermediate tree nodes are folders, there are other ways to deal with content, which CMIS exposes through the concepts of "unfiling" and "multi-filing" (the term "filing" expresses the idea that a document is stored in a place, much like in the real world).


The first alternative way of storing a document is to not file it anywhere: the document is not held in a folder, it just exists: it is then said to be unfiled. The document is not lost however, because given a document id you can retrieve the properties and content stream of the document, and if you don't know its id you can do a search based on relevant criteria to find your document.


This model of unfiled documents is quite common in the world of record management, where what is important is the "record" (the content and metadata), and not a folder in which it may live. The record itself carries all the metadata you need to find it (dates, keywords, tags, etc.), and instead of listing "what's in a given folder", you can list records according to simple or complex search criteria.


The second alternative way provided by CMIS to store a document is to allow it to live in several folders at the same time: this is called multi-filing. It's another way of organizing content, and can be quite powerful.


Multi-filing is often used to organize documents in folders along several axes, where a folder represents a criterion and the presence of a document in a folder reflects the fact that the criterion applies to the document. Multi-filing can also be used to express "publishing" concepts (publishing a document in several categories means just filing it in different folders, each folder representing a category).


Both of these features are optional in a CMIS repository.



Renditions

In content management systems, it's quite common for a document to have different renditions. A rendition is an alternate way of viewing or representing a master document. For instance from an OpenDocument file you may derive a PDF rendition, a 100x140 pixels image rendition of the cover page, a Microsoft Word rendition, a rendition as a series of high-resolution images for each page, an HTML rendition, a pure text rendition, an MP3 rendition of the content as spoken text, etc. From a video document you may get a H.264 rendition, a Flash rendition, a 64x64 pixels image rendition, a rendition as a series of 320x200 pixels images every 10 seconds of the video, an MP3 rendition of the audio stream, a pure text rendition of the speech extracted from that audio stream, a text rendition of the extracted subtitles, etc.


CMIS doesn't expose any way to create or control these renditions (it's too complex, and up to the content management system to decide what they are), but it exposes a way to discover and retrieve them. Documents and folders can both have renditions, each rendition being seen as an alternate content stream.


Renditions have rudimentary metadata, among which a MIME type, a width and height (recognizing that rendition are often visually oriented), a title, and a "kind" which is used to categorize the renditions. CMIS only defines one standard kind, the thumbnail, but more could be added in future versions of the specification. The fact that it's useful for a folder to have a thumbnail or an icon is the reason why folders are allowed to have renditions while they can't have a normal content stream.


Rendition support is optional (and in any case it's the repository that decides what renditions to expose for each object).



Versioning


In CMIS a document (if its type supports it) can be versioned, which means that "old" versions are retained by the system. A version can be "major" or not, but CMIS doesn't impose any semantics on this, it's just a useful abstraction. To create new versions, a model of checkin/checkout is used: after checkout from a version, a private working copy (PWC) is created, which can be modified and then checked back in, creating a new version.


Here the model gets complex because in the real world there are many ways in which versioning can be done.


In the most complete scenario, the repository allows read and write access to all versions, including the PWC, and allows all versions and the PWC to be searched. The versions can also be filed independently in the same (or different) folders, several versions being then accessible at the same time.


This model can be restricted by the CMIS repository in various ways. The repository can specify that:




  • only the latest version may be accessible or searchable, not the older versions nor the PWC,

  • a PWC may be checked out from only the latest version,

  • a PWC may not be updatable at all, only checked back in with some modifications in a single operation,

  • a checkout may not be allowed at all, in which case new versions may be created only by applying an update to an existing version; this leaves the existing version unchanged but creates a new version holding the updated data (this is called auto-versioning),

  • all the versions of a given document are held in the same folder (this is called version-independent filing, the opposite is called version-specific filing),

  • only a single version of a document (the latest version or latest major version) may be filed in a folder, the other versions being "hidden" (not filed); when new versions or new major versions are created they automatically replace the previous one filed in the folder (this is another aspect of version-independent filing).



Given this wide variation of capabilities, having a generic client that understands all the versioning models will certainly be a challenge, but this is the cost of having interoperability with many systems that have different ideas of what versioning should look like.




Security through ACLs

Being able to access documents is the basis of content management, but in existing systems this access is often restricted by various permissions that depend on the user doing the action. The permission systems implemented by content repositories are extremely varied (even more than for versioning), and even though CMIS cannot hope to model them in an interoperable manner it's been recognized that some minimal operations can be agreed upon.


In order to work with permissions, a basic (and optional) set of permission management operations has been defined, based on access control lists (ACLs). The ACL on a document is a list of basic assignment of permissions to users, defining what they can do on this document.


CMIS defines three basic permissions: Read, Write, and All. It's up to each repository to define exactly the semantics of these permissions, but they are common enough that a client should be able to work with them easily even if the details are unknown to it: a client can easily tell a user if it will have the right to modify a document or not.


If a client really needs it, however, the CMIS repository exposes exactly what individual CMIS operations are allowed for each of these permissions. A repository can also define additional non-standard permissions, and using the same mechanism tell a client what operations will be allowed for each. In this manner, a client can discover in advance the restrictions placed on a document.


Optionally, a repository may allow a client to not only check but also change the ACL on a document, so that for instance other users are given rights to modify it, or instead disallowed from even seeing it.


ACLs are often more complex than just a list of permissions given to users on a document, for example many systems have inheritance of ACLs, which means that an ACL applied to a folder has an effect on the documents filed in that folder, and also on other documents further down the tree. Other systems have more complex rules. A CMIS repository can tell a client which of these three models (object-only, with inheritance, or completely repository-specific) it uses. When retrieving the ACL effective on a document, a repository can also tell a client if the ACL has really been set directly on that document, or if has somehow been derived from inherited ACLs or through more complex policies.



Change Log

It's important for external search services, caching systems or synchronization engines to be able to know what has happened in a repository since their "last visit". To that end, CMIS has an (optional) change log service that can be queried to discover the past operations that have been done in the repository after a specified date.


The change log service returns a list of basic operations that have happened in the repository: object creation, modification or deletion, as well as security changes on an object. For modification operations, the repository may also include the new values of properties set on that object.


The change log can be queried by starting from a given point in time materialized by an opaque "change log token", which a client should ask to the repository whenever it checkpoints its state. The repository will later be capable of returning all the changes made since that time.


If the repository cannot record all its history since it was created, the change log may be "incomplete"; in that case it may not be possible to get a change log starting from very old change log tokens. However when a repository returns changes from a supported change log token, all the changes up to the current moment must be returned: no intermediate changes can be lost.





Conclusion










I hope that this overview of its advanced features has convinced you that CMIS is a worthwhile standard, that many powerful things can be done with it, and that many vendors will soon be using it for interoperability. Nuxeo is committed to CMIS, and we'll be releasing a new version of our CMIS connector, supporting the latest 1.0cd04 draft, in a few days.


A final approval of CMIS 1.0 is expected in early-to-mid 2010. In the meantime, the Public Review of CMIS is still under way, please read the spec, implement it, and give feedback!




Category: Product & Development
Tagged: CMIS