April 2009
« Mar   May »

CMIS – Is XPath Just A Bit Too Tricksy?

Sara, Sara,
Whatever made you want to change your mind?
Sara, Sara,
So easy to look at, so hard to define.

So What Is CMIS?

CMIS (or the Content Management Interoperability Services ) looks pretty sweet and, according to the general buzz, is going to do well. The recent demo at AIIM, and the examples created by various vendors have made the effort feel very alive. For the uninitiated, CMIS is a lowest common denominator API which can sit on top of any Enterprise Content Management repository. It has many practical uses, including a search across multiple ECM repositories, workflows and processes that span repositories and, my favourite, an ECM Mashup. I’m not going to go into the detail, as that has been done better than I ever could in other places. For those that don’t like detail, however, I’m going to try to summarise over 150 pages of specification in a paragraph. Look at the picture below, then take a deep breath.
Overview of CMIS architecture
CMIS provides a logical data model to represent an ECM repository with four base types: documents (versionable content objects with an optional binary attachment), folders (which contain other folders or documents), a relationship (between folders and documents) and policies (for security, retention, workflow or anything else). Content objects are strongly typed, and a repository can specify its own types by subtyping (adding properties to) one of the four basic types. The folders form a directed acyclic graph and a document can live in zero (unfiled), one or many (multi-filed) folders. The API provides basic create, read, update and delete (CRUD) operations, the ability to navigate the relationships in the graph, and the ability to search the repository using a combination of structured property searches and unstructured full text searches. The query language is based on SQL. Not all implementations need to support all of the CMIS features, and the API provides a means for a client application to interrogate a repository to discover the features which it supports. CMIS aims to be independent of transport mechanisms for the interrogating clients. For the first release, the implementation must support both SOAP based Web Services and REST/AtomPub (APP). WebDAV was not included.
If you’re new to CMIS and want more than a paragraph explaining what it is, I’d recommend the following starting points:

I should also mention here that there is a fair bit of debate about whether CMIS is really RESTful. It’s hard to argue with Roy Fielding who defined the term. But a misuse of the term REST hasn’t stopped other APIs from gaining enormous popularity. This posting is only going to talk about Part I of the specification (the Domain Model), not Part II (the SOAP and APP bindings).

CMIS and the Java Content Respository

The CMIS goal has a lot in common with the Java Content Repository (JSR 170/283). The main difference that everyone cites is the fact the JCR interface is Java only, while CMIS allows access to any implementation. CMIS is much simpler, which may give it the kiss of life that the more functionally rich JCR seems to lack. CMIS is accessed via a remote API, while the JCR is accessed via Java methods, but I don’t think this difference is fundamental. The CMIS specification could have added a remote HTTP access protocol on top of the JCR to overcome the differences mentioned. Most of the contributors to the CMIS specification were also involved in the JCR, so the fact that this didn’t happen suggests to me that they felt something else was amiss. Get your chainsaws out ’cause I’m going to go out on a limb here and suggest that the main difference between CMIS and the JCR API lies in the query language choice – XPath versus SQL. Note that SQL is an optional extra in the JCR spec, while XPath currently isn’t an option in CMIS.

Below are some of my thoughts based on my limited understanding of the standard. I’m positive that the points I’m going to raise have been discussed in the CMIS committee meetings, and equally positive that I’m completely wrong on all of this. I’ve probably missed something obvious too. But I would be extremely grateful if those in the know could point me to any resources that answer the questions I have. My googling skills didn’t unearth anything.

The CMIS Relational View

In order to understand the choice of query language, one must first understand the virtual relational view of a CMIS repository. This consists of a collection of virtual tables. A virtual table exists for every queryable object type (content type if you prefer) in the repository. Each row in these virtual tables correspond to an instance of the corresponding object type (or of one of its subtypes). A column exists for every property that the object type has. The figure below, taken from the specification, tries to explain this. Someone needs to draw a prettier version of this.


Aside for another posting: One thing worth thinking about would have been the benefit of implementing an entity-attribute (or “skinny table”) relational view instead of a table per object. While the SQL queries against such a repository are dog-ugly, they would have many of the benefits of XPath I think.

So Why not XPath?

When reading the spec, I kept asking myself why the query language chosen is based on SQL (called CMIS SQL or CQL), not XPath like the JCR. The only reference I found to XPath and CMIS was in Russ Danner’s excellent blog . He says:

CMIS also specifies a SQL like query language. Unlike previously proposed standards that pushed XQUERY and XPATH, CMIS is adopting a well understood paradigm which I believe will only encourage its adoption.

So this is the opposite view to mine. My instinct screams XPath to me for the following reasons:

  • XPath is a more natural way to search a hierarchy. ANSI SQL doesn’t provide this functionality although extensions like Oracle CONNECT BY and SQL Server’s Common Table Expressions make it possible. Most implementations would involve creating a denormalised table which flattened the graph. The CMIS specification adds 2 extensions (IN_FOLDER and IN_TREE) which I find slightly smelly.
  • An XPath result set can naturally contain content of different types. In the SQL model, each content type needs to be added to the result set which would usually mean a whole lot of UNIONS and ensure that the columns selected from each virtual table are the same. Except that CSQL doesn’t do UNIONS. For example, if I want to find all objects that are green, I far prefer
    XPATH: /root//*[@color='green']
    which I’d have to issue for every object type as UNIONS are not supported. If I am querying multiple types I need to send multiple queries, and I can’t use ORDER BY across the types.
  • With the standard as it is, I can achieve content of different types in the same SQL resultset, but only if they are sub-types. I’m not a big fan of deep type inheritance trees. And most of the content repositories that I deal with don’t support content type inheritance natively.
  • I don’t need to change my XPath Queries if a new content type is added to the repository. Using SQL, I could dynamically generate the queries by using reflection (getTypes and getTypeDefinition) on the repository, but that’s an extra step.
  • The Navigation Services API calls could probably be replaced with more XPath queries. Calls such as getChildren, getDescendants, getObjectParents, getFolderParents,
  • Once I retrieve a document, I’d like to get the “folder breadcrumb”. I didn’t see an obvious way to do this. I think a multi-filed document might need the concept of a “primary folder” or, at least, an ordering of the folder-document containment relationship.
  • The pagination functionality feels slighlty unwieldy, sending SKIPCOUNT and MAXCOUNT as optional parameters to the query function. Not that this is solved by XPath, but I thought I’d mention it anyway.

Maybe XPath is just too hard for the vendors?

So why didn’t they go with XPath? I think the biggest hint comes from the design goal is the CMIS specification:

However, it is an explicit goal that CMIS will NOT require major product changes or significant data model changes like other standards such as JSR 170 have required.

Wearing my developer hat, I think the API would be more useful if I could interogate it using XPath. However, from the point of view of the ECM vendor based on a relational database, maybe implementing an XPath search on their respository is just too damn hard! One would think that the vendors that support JCR already have done most of the heavy lifting. But not many vendors have implemented it, and maybe of those that have use Apache Jackrabbit.

So, in summary, my theory is that an XPath based query language is very difficult for the vendors to implement. Which means developers of CMIS clients are gonna have to bite the bullet and use CSQL. Which is still going to be great, and means we’re going to get far more CMIS enabled repositories than JCR ones. Which hopefully means the Day CMIS PlugFest is going to be a very busy event. But I do so love XPath, and here’s hoping that it makes it into a later version of the specification.

Rock on, CMIS.

  • del.icio.us
  • Facebook
  • Google Bookmarks
  • Digg
  • LinkedIn
  • StumbleUpon
  • Technorati

1,954 comments to CMIS – Is XPath Just A Bit Too Tricksy?