April 2009
M T W T F S S
« Mar   May »
 12345
6789101112
13141516171819
20212223242526
27282930  

CMIS – Is XPath Just A Bit Too Tricksy?

Sara, Sara,
Whatever made you want to change your mind?
Sara, Sara,
So easy to look at, so hard to define.
- SARA

So What Is CMIS?

CMIS (or the Content Management Interoperability Services ) looks pretty sweet and, according to the general buzz, is going to do well. The recent demo at AIIM, and the examples created by various vendors have made the effort feel very alive. For the uninitiated, CMIS is a lowest common denominator API which can sit on top of any Enterprise Content Management repository. It has many practical uses, including a search across multiple ECM repositories, workflows and processes that span repositories and, my favourite, an ECM Mashup. I’m not going to go into the detail, as that has been done better than I ever could in other places. For those that don’t like detail, however, I’m going to try to summarise over 150 pages of specification in a paragraph. Look at the picture below, then take a deep breath.
Overview of CMIS architecture
CMIS provides a logical data model to represent an ECM repository with four base types: documents (versionable content objects with an optional binary attachment), folders (which contain other folders or documents), a relationship (between folders and documents) and policies (for security, retention, workflow or anything else). Content objects are strongly typed, and a repository can specify its own types by subtyping (adding properties to) one of the four basic types. The folders form a directed acyclic graph and a document can live in zero (unfiled), one or many (multi-filed) folders. The API provides basic create, read, update and delete (CRUD) operations, the ability to navigate the relationships in the graph, and the ability to search the repository using a combination of structured property searches and unstructured full text searches. The query language is based on SQL. Not all implementations need to support all of the CMIS features, and the API provides a means for a client application to interrogate a repository to discover the features which it supports. CMIS aims to be independent of transport mechanisms for the interrogating clients. For the first release, the implementation must support both SOAP based Web Services and REST/AtomPub (APP). WebDAV was not included.
If you’re new to CMIS and want more than a paragraph explaining what it is, I’d recommend the following starting points:

I should also mention here that there is a fair bit of debate about whether CMIS is really RESTful. It’s hard to argue with Roy Fielding who defined the term. But a misuse of the term REST hasn’t stopped other APIs from gaining enormous popularity. This posting is only going to talk about Part I of the specification (the Domain Model), not Part II (the SOAP and APP bindings).

CMIS and the Java Content Respository

The CMIS goal has a lot in common with the Java Content Repository (JSR 170/283). The main difference that everyone cites is the fact the JCR interface is Java only, while CMIS allows access to any implementation. CMIS is much simpler, which may give it the kiss of life that the more functionally rich JCR seems to lack. CMIS is accessed via a remote API, while the JCR is accessed via Java methods, but I don’t think this difference is fundamental. The CMIS specification could have added a remote HTTP access protocol on top of the JCR to overcome the differences mentioned. Most of the contributors to the CMIS specification were also involved in the JCR, so the fact that this didn’t happen suggests to me that they felt something else was amiss. Get your chainsaws out ’cause I’m going to go out on a limb here and suggest that the main difference between CMIS and the JCR API lies in the query language choice – XPath versus SQL. Note that SQL is an optional extra in the JCR spec, while XPath currently isn’t an option in CMIS.

Below are some of my thoughts based on my limited understanding of the standard. I’m positive that the points I’m going to raise have been discussed in the CMIS committee meetings, and equally positive that I’m completely wrong on all of this. I’ve probably missed something obvious too. But I would be extremely grateful if those in the know could point me to any resources that answer the questions I have. My googling skills didn’t unearth anything.

The CMIS Relational View

In order to understand the choice of query language, one must first understand the virtual relational view of a CMIS repository. This consists of a collection of virtual tables. A virtual table exists for every queryable object type (content type if you prefer) in the repository. Each row in these virtual tables correspond to an instance of the corresponding object type (or of one of its subtypes). A column exists for every property that the object type has. The figure below, taken from the specification, tries to explain this. Someone needs to draw a prettier version of this.

cmisrelationalmodel

Aside for another posting: One thing worth thinking about would have been the benefit of implementing an entity-attribute (or “skinny table”) relational view instead of a table per object. While the SQL queries against such a repository are dog-ugly, they would have many of the benefits of XPath I think.

So Why not XPath?

When reading the spec, I kept asking myself why the query language chosen is based on SQL (called CMIS SQL or CQL), not XPath like the JCR. The only reference I found to XPath and CMIS was in Russ Danner’s excellent blog . He says:

CMIS also specifies a SQL like query language. Unlike previously proposed standards that pushed XQUERY and XPATH, CMIS is adopting a well understood paradigm which I believe will only encourage its adoption.

So this is the opposite view to mine. My instinct screams XPath to me for the following reasons:

  • XPath is a more natural way to search a hierarchy. ANSI SQL doesn’t provide this functionality although extensions like Oracle CONNECT BY and SQL Server’s Common Table Expressions make it possible. Most implementations would involve creating a denormalised table which flattened the graph. The CMIS specification adds 2 extensions (IN_FOLDER and IN_TREE) which I find slightly smelly.
  • An XPath result set can naturally contain content of different types. In the SQL model, each content type needs to be added to the result set which would usually mean a whole lot of UNIONS and ensure that the columns selected from each virtual table are the same. Except that CSQL doesn’t do UNIONS. For example, if I want to find all objects that are green, I far prefer
    XPATH: /root//*[@color='green']
    to
    SQL: SELECT * FROM OBJ_TYPE_ONE WHERE ( IN_TREE( , ‘ID000XXXXXXXX) ) AND ( ‘green’ = ANY COLOR)
    which I’d have to issue for every object type as UNIONS are not supported. If I am querying multiple types I need to send multiple queries, and I can’t use ORDER BY across the types.
  • With the standard as it is, I can achieve content of different types in the same SQL resultset, but only if they are sub-types. I’m not a big fan of deep type inheritance trees. And most of the content repositories that I deal with don’t support content type inheritance natively.
  • I don’t need to change my XPath Queries if a new content type is added to the repository. Using SQL, I could dynamically generate the queries by using reflection (getTypes and getTypeDefinition) on the repository, but that’s an extra step.
  • The Navigation Services API calls could probably be replaced with more XPath queries. Calls such as getChildren, getDescendants, getObjectParents, getFolderParents,
  • Once I retrieve a document, I’d like to get the “folder breadcrumb”. I didn’t see an obvious way to do this. I think a multi-filed document might need the concept of a “primary folder” or, at least, an ordering of the folder-document containment relationship.
  • The pagination functionality feels slighlty unwieldy, sending SKIPCOUNT and MAXCOUNT as optional parameters to the query function. Not that this is solved by XPath, but I thought I’d mention it anyway.

Maybe XPath is just too hard for the vendors?

So why didn’t they go with XPath? I think the biggest hint comes from the design goal is the CMIS specification:

However, it is an explicit goal that CMIS will NOT require major product changes or significant data model changes like other standards such as JSR 170 have required.

Wearing my developer hat, I think the API would be more useful if I could interogate it using XPath. However, from the point of view of the ECM vendor based on a relational database, maybe implementing an XPath search on their respository is just too damn hard! One would think that the vendors that support JCR already have done most of the heavy lifting. But not many vendors have implemented it, and maybe of those that have use Apache Jackrabbit.

So, in summary, my theory is that an XPath based query language is very difficult for the vendors to implement. Which means developers of CMIS clients are gonna have to bite the bullet and use CSQL. Which is still going to be great, and means we’re going to get far more CMIS enabled repositories than JCR ones. Which hopefully means the Day CMIS PlugFest is going to be a very busy event. But I do so love XPath, and here’s hoping that it makes it into a later version of the specification.

Rock on, CMIS.

  • del.icio.us
  • Facebook
  • Google Bookmarks
  • Digg
  • LinkedIn
  • StumbleUpon
  • Technorati

14 comments to CMIS – Is XPath Just A Bit Too Tricksy?

  • Mainly, XPath has not been used because the repository vendors aren’t too fond of it (most of them are SQL guys). Note also that in JCR 2 (JSR-283) SQL has become the primary mandatory query language, and XPath is deprecated.

    It seems to me that XPath is mostly used by XML fans, and is only applicable if you can have a natural mapping of your data into the XML basic infoset. That’s sometimes hard.

    Also, have a look at John Newton’s article from 2006 about SQL vs XPath, which is still relevant: http://newton.typepad.com/content/2006/09/sql_vs_xpath_vs.html

    Regarding some of the bulleted points you made:
    - XPath is more natural for a hierarchy but the hierarchy aspects of CMIS only apply to the folders, the documents themeselves can be multi-filed or unfiled so XPath would have a harder time applying to them,
    - If you want to find all documents that are green you can do SELECT * FROM Document WHERE ‘green’ = ANY Color. It’s hard to see a use case where you would want to query at the same times Documents and Folders, or Documents and Relations for instance,
    - Folder breadcrumb is an ill-defined notion if you have multi-filing or unfiling, and that’ll be quite common in some CMIS repositories. Anyway you can do getObjectParents to get all the containing folders, then getFolderParent with returnToRoot=true on each,
    - Pagination functionality is expressed in the model but may appear differently in the actual protocols, for instance AtomPub specifies RFC 5005 for paging.

  • Thanks so much for your reply. The link you posted is exactly what I was looking for. I was Googling things link “CMIS XPath SQL” in the search, where using “iECM XPath SQL” is far cleverer. I guess the CMIS discussions didn’t want to rehash the iECM discussions too much. I’ll read this properly and add my thoughts.

    On your points:
    - I still think an XPath query still makes sense for multi-filed documents. Unfiled make slightly less sense. Probably need to hack in some new virtual /root/unfiled/ node.
    - I probably wasn’t clear here. My example would be searching across content types such as NEWS_ARTICLE, PRESS_RELEASE or WHITE_PAPER that may have different but overlapping attributes. In your example, is “Document” a reserved work that means all content types? Could I do this in one CSQL query as you have? Or have I misunderstood/misread the spec?
    - Agree. I probably shouldn’t have mentioned the breadcrumb notion at all. I mentioned it as it is something that always needs consideration but isn’t really relevant to the discussion at hand.
    - I guess I’ve got some more reading to do here too. You know a lot more than I do …

    Would you agree, though, that the main difference between CMIS and JCR is SQL v. XPath? Or do you think the other differences are more fundamental?

    Thanks again
    Jon

  • Al Brown

    Jon,

    This is a very interesting article. Your choice of search metaphor (XPath vs SQL) is based on where you start your search from. Most ECM systems start search from the point of type. I am looking for a document (invoice, claim, loan, etc) that meets the following criteria. The criteria could be on the properties or its location. It is not necessarily clear that those objects exist in a folder structure. They could. That’s a customer’s choice.

    With XPath, you start with a hierarchy. I want to find items under this folder that match a certain criteria. This implies that customers always use the hierarchy and have organized their content that way.

    Based on the paradigm you start with, you add in access to the rest of the vendor’s models, such as /root/unfiled hacks. It is probable companies did not implement these extensions the same way as the standard and thus the cost to support increases.

    Also, the issue with XPath was since many vendors did not natively support that style, when XPath limitations were encountered (and they were) adding support for XQuery became problematic.

    CMIS is based on the idea of standardizing the 80% of what all the vendors do and do well. It is backward looking by design. This decreases the cost and makes it an easier decision on whether to support the effort or not. This was proven by the interoperability plugfest where most vendors created their prototypes with a couple of man-months of effort. That is impressive for a standard that provides significant capability.

    I think JCR is a good standard, and a forward looking one, and is the result of a lot of good work by very many talented people. Unfortunately, times change, and a big drawback of a Java standard is typically the lack of MSFT support. SharePoint, like it or not, has an impact on this space and it is important to include them in interoperability.

  • In CMIS, Document is the mandatory base type for all document types. Querying on Document is required to do the query on all its subtypes. Ok, actually only subtypes that have includeInSuperTypeQuery=true, but a repository that doesn’t have that for the direct subtypes of Document would be considered subpar (even though I think the spec allows it).

    I don’t agree that the difference between CMIS and JCR is SQL vs XPath, especially when you consider JCR 2. For me the difference between the two is quite deeper. CMIS is higher level & simpler, it describes protocols (on top of a unique model) and not a language binding, and is supported by more vendors. See the points I made in http://asserttrue.blogspot.com/2009/04/hell-freezes-over-as-big-ecm-vendors.html?showComment=1238880000000#c3031100695560533822

  • I suspect (with far more reservations on my own ability to understand this than you have) it would be a lot harder to get XPath to work on these rather heterogeneous repositories than it is to implement CSQL. (Or at the very least, it’d be a lot harder to squeeze any kind of performance out of it).

    But whereas I agree — and I’d much prefer XPath to SQL — it may not just be a case of it being to hard for the vendors. I think it might also be too hard on many developers. That’s probably why it has seen some, but not great uptake (much the same as XSLT).

    So if you want to set a standard that everybody already sort of knows, you go with SQL. Not because it’s the best, or the most appropriate, but because it’ll have the most success, which in the end is pretty important for a standard :)

  • Oh, okay. I got that wrong too then. Seems the only thing I was right about in this post was being wrong about everything :-) I think I’ve also proved that it is dangerous to comment on a spec based on reading it alone. You need to play with it to understand it properly.

    Just to make sure I’ve got this right, then: I can do “SELECT * FROM Document WHERE ‘green’ = ANY Color” even if Color is not a property of the base type, but is of a few subtypes? I understood from the Search Query Scope Diagram that I could only do the query you suggested if Color was a property of the base type. I presume that “SELECT *” will only return the properties in Document (as each subtype will have different properties).

    I do really appreciate someone like yourself that knows CMIS so well answering my questions! I’ll repay with beers any time :-)

    Jon

  • Al Brown

    For CMIS, Color would have to be on Document. If Color was on Invoice, then you could do “SELECT * FROM Invoice WHERE ‘green’ = ANY Color”.

    If you have many types that have Color, but there is not a common type ancestor that has Color, then best practice would be to create such a type ancestor.

    Sometimes adhering to that best practice is problematic. Especially if a system desires multiple inheritance. One of the discussions in CMIS is around mixins/aspects. This is one of the more forward looking proposals. It would allow you to add an aspect, e.g., Colorable, to each class/object instance as appropriate. You could then treat Clorable as a type to search against.

    This proposal is being discussed in the TC and John Newton is leading it. It is unclear whether or not it will make it into 1.0.

    • Al,

      Thanks again for clarifying. And yes, I am thinking of the case where Color is a property of many types with no common type ancestor. I don’t really like the idea of creating type inheritance trees for this. And even if I did, most of the CMS products I touch don’t entertain the notion of subtypes (which I don’t really care for) or aspects/mixins (which I would love).

      Although it might sound strange to have many content types with common attributes, it seems to happen a lot. Sometimes, we might even have two or more content types with *identical* properties, for example “News” and “Press Release”. Now these should really be the same type but for the quirks of a products that mean you need a different type to apply different workflows, security or other policies. But I digress.

      I’m looking forward to seeing 1.0. After receiving comments on this posting, I’ve found so much more to read … :-) I wish I was a vendor so I could try to implement it!

      Thanks again,
      Jon

  • you might want to look at vtd-xml for best possible xpath query perfomrance

    vtd-xml

  • This web site truly has all the info I wanted concerning
    this subject and didn’t know who to ask.

    my website; להציג אותו עכשיו

  • Hi i am kavin, its my first occasion to commenting anyplace, when i read this article i thought i could also make comment due to this good piece
    of writing.

  • I in reality like finding net sites which comprehend the value of furnishing a decent resource for absolutely free.
    Prix Arc De Triomphe Longchamps http://www.latabledesanges.fr/info/?pid=4450

  • My brother recommended I might like this web site. He was entirely proper. This publish in fact made my day. You can not consider just how so considerably time I had spent for this details! Thank you!
    Ugg Kensington Sale http://shop.much3g.com/images/listinfo.php?pid=7572

  • Tree on the large hill: A pine that’s on a high hill
    may drop in the wrong way and may not be additional safe.

    Visit my web blog :: tree service jobs in ohio
    - Tami -

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>