CMIS – Is XPath Just A Bit Too Tricksy?
Whatever made you want to change your mind?
So easy to look at, so hard to define.
So What Is CMIS?
- Dr David Choy from EMC (who really knows his stuff!) gives an friendly overview on YouTube: Part I and Part II
- The Alfresco CMIS wiki – http://wiki.alfresco.com/wiki/CMIS
- Kas Thomas from CMS Watch on CMIS
- The EMC CMIS Page
- The CMIS and Interoperability – AIIM 2009 presentation from Alfresco on SlideShare – very useful
- Laurence Hart’s Word of Pie blog
- [13 Apr 2009 UPDATED VERSION] Read version 0.6 of the specification (it’s longer than you think but a lot of prose)
I should also mention here that there is a fair bit of debate about whether CMIS is really RESTful. It’s hard to argue with Roy Fielding who defined the term. But a misuse of the term REST hasn’t stopped other APIs from gaining enormous popularity. This posting is only going to talk about Part I of the specification (the Domain Model), not Part II (the SOAP and APP bindings).
CMIS and the Java Content Respository
The CMIS goal has a lot in common with the Java Content Repository (JSR 170/283). The main difference that everyone cites is the fact the JCR interface is Java only, while CMIS allows access to any implementation. CMIS is much simpler, which may give it the kiss of life that the more functionally rich JCR seems to lack. CMIS is accessed via a remote API, while the JCR is accessed via Java methods, but I don’t think this difference is fundamental. The CMIS specification could have added a remote HTTP access protocol on top of the JCR to overcome the differences mentioned. Most of the contributors to the CMIS specification were also involved in the JCR, so the fact that this didn’t happen suggests to me that they felt something else was amiss. Get your chainsaws out ’cause I’m going to go out on a limb here and suggest that the main difference between CMIS and the JCR API lies in the query language choice – XPath versus SQL. Note that SQL is an optional extra in the JCR spec, while XPath currently isn’t an option in CMIS.
Below are some of my thoughts based on my limited understanding of the standard. I’m positive that the points I’m going to raise have been discussed in the CMIS committee meetings, and equally positive that I’m completely wrong on all of this. I’ve probably missed something obvious too. But I would be extremely grateful if those in the know could point me to any resources that answer the questions I have. My googling skills didn’t unearth anything.
The CMIS Relational View
In order to understand the choice of query language, one must first understand the virtual relational view of a CMIS repository. This consists of a collection of virtual tables. A virtual table exists for every queryable object type (content type if you prefer) in the repository. Each row in these virtual tables correspond to an instance of the corresponding object type (or of one of its subtypes). A column exists for every property that the object type has. The figure below, taken from the specification, tries to explain this. Someone needs to draw a prettier version of this.
Aside for another posting: One thing worth thinking about would have been the benefit of implementing an entity-attribute (or “skinny table”) relational view instead of a table per object. While the SQL queries against such a repository are dog-ugly, they would have many of the benefits of XPath I think.
So Why not XPath?
When reading the spec, I kept asking myself why the query language chosen is based on SQL (called CMIS SQL or CQL), not XPath like the JCR. The only reference I found to XPath and CMIS was in Russ Danner’s excellent blog . He says:
CMIS also specifies a SQL like query language. Unlike previously proposed standards that pushed XQUERY and XPATH, CMIS is adopting a well understood paradigm which I believe will only encourage its adoption.
So this is the opposite view to mine. My instinct screams XPath to me for the following reasons:
- XPath is a more natural way to search a hierarchy. ANSI SQL doesn’t provide this functionality although extensions like Oracle CONNECT BY and SQL Server’s Common Table Expressions make it possible. Most implementations would involve creating a denormalised table which flattened the graph. The CMIS specification adds 2 extensions (IN_FOLDER and IN_TREE) which I find slightly smelly.
- An XPath result set can naturally contain content of different types. In the SQL model, each content type needs to be added to the result set which would usually mean a whole lot of UNIONS and ensure that the columns selected from each virtual table are the same. Except that CSQL doesn’t do UNIONS. For example, if I want to find all objects that are green, I far prefer
SQL: SELECT * FROM OBJ_TYPE_ONE WHERE ( IN_TREE( , ‘ID000XXXXXXXX) ) AND ( ‘green’ = ANY COLOR)
which I’d have to issue for every object type as UNIONS are not supported. If I am querying multiple types I need to send multiple queries, and I can’t use ORDER BY across the types.
- With the standard as it is, I can achieve content of different types in the same SQL resultset, but only if they are sub-types. I’m not a big fan of deep type inheritance trees. And most of the content repositories that I deal with don’t support content type inheritance natively.
- I don’t need to change my XPath Queries if a new content type is added to the repository. Using SQL, I could dynamically generate the queries by using reflection (getTypes and getTypeDefinition) on the repository, but that’s an extra step.
- The Navigation Services API calls could probably be replaced with more XPath queries. Calls such as getChildren, getDescendants, getObjectParents, getFolderParents,
- Once I retrieve a document, I’d like to get the “folder breadcrumb”. I didn’t see an obvious way to do this. I think a multi-filed document might need the concept of a “primary folder” or, at least, an ordering of the folder-document containment relationship.
- The pagination functionality feels slighlty unwieldy, sending SKIPCOUNT and MAXCOUNT as optional parameters to the query function. Not that this is solved by XPath, but I thought I’d mention it anyway.
Maybe XPath is just too hard for the vendors?
So why didn’t they go with XPath? I think the biggest hint comes from the design goal is the CMIS specification:
However, it is an explicit goal that CMIS will NOT require major product changes or significant data model changes like other standards such as JSR 170 have required.
Wearing my developer hat, I think the API would be more useful if I could interogate it using XPath. However, from the point of view of the ECM vendor based on a relational database, maybe implementing an XPath search on their respository is just too damn hard! One would think that the vendors that support JCR already have done most of the heavy lifting. But not many vendors have implemented it, and maybe of those that have use Apache Jackrabbit.
So, in summary, my theory is that an XPath based query language is very difficult for the vendors to implement. Which means developers of CMIS clients are gonna have to bite the bullet and use CSQL. Which is still going to be great, and means we’re going to get far more CMIS enabled repositories than JCR ones. Which hopefully means the Day CMIS PlugFest is going to be a very busy event. But I do so love XPath, and here’s hoping that it makes it into a later version of the specification.
Rock on, CMIS.