Community Blogs

Custom CQ5 workflow step that integrates Twitter and Jabber

Monday, December 28, 2009 - 01:00


As part of the IKS project each CMS vendor completes a couple of benchmarks in order to establish a baseline against which future semantic improvements can be measured. For benchmark 3 "Workflow Service" Bertrand and I chose to implement the task "Create a multi-channel (email, SMS, instant messaging, Twitter,...) notification service for workflow transitions". We have created an automated workflow step that can be inserted into a custom workflow and either send an e-mail, send a direct message on Twitter or send a chat message on GTalk/Jabber. The corresponding message's payload is the path to the content node in the workflow plus an optional custom text.

Below follows a description how this functionality was implemented in CQ5. The complete code is attached to this post as a CQ5 package. I will outline of some of the considerations and gotchas regarding this particular feature, but some issues apply to CQ5 development in general as well. The environment I used for development was CRXDE Lite (the web-based IDE available at /crxde of your CQ5 installation) and a beta version of the upcoming CQ5 release 5.3. It is probably helpful to install the package (see setup section below) and read the code alongside with this post.

OSGi services

A good way to hook up external services like Twitter etc. is to create a custom OSGi service that exposes only the business functionality and hides the internal classes. Moreover, it is good practice to provide a Java interface and the separate the implementation of the service (allowing the replacement of the implementation without affecting relying parties). The services will show up in the Sling configuration console at /system/console/configMgr. This allows the administrator to configure the service's private parameters at deployment time (in our case Twitter account credentials and Jabber user credentials). The config is consumed by the service like e.g.:

/** @scr.property */
public static final String GTALK_USER = "gtalk.service.user"; 
/** @scr.property */
public static final String GTALK_PASSWORD = "gtalk.service.password";   

protected void activate(ComponentContext context) {
    Dictionary config = context.getProperties();
    user = (String) config.get(GTALK_USER);
    password = (String) config.get(GTALK_PASSWORD);
}

3rd party libraries

In order to use Twitter and Jabber I utilized the open source libraries Twitter4J and Smack, respectively. With CRXDE (Lite) it is very simple to include such 3rd party jars in a custom OSGi bundle: just drop them in the bundle's /libs folder. When building the bundle CRXDE will embed them. Compilation and deployment is done by executing "Build Bundle" (right-click on the .bnd file in the bundle root).

A note on 3rd party jar's dependencies

It might well be that the bundle compiles and deploys, but does not start. Check the OSGi console at /system/console/bundles to find out if your bundle's state is "Active" (good) or just "Installed" (not good). The latter happens e.g. when the embedded jar has dependencies on other jars that are not embedded. In such a case check the bundle's details page in the Sling console to find out which dependencies are missing and either add them to /libs as well or take them out of the OSGi imports. That is achieved by editing the .bnd file's import directive, e.g.

Import-Package: !com.sun.syndication.*, !dalvik.system, *

Workflow action

The last needed piece is a workflow step that can be added into a custom workflow. For that purpose one simply needs to create a class that implements the Interface JavaProcessExt. The method execute will receive the workflow's payload - from there is is trivial to obtain the services described above and pass them the content. CQ Workflow Actions can be customized for each particular workflow they are used in. I use this feature to customize the accounts to which a message shall be sent (the custom format is explained in the setup section below). The customization string is passed to the execute method as well: comma-separated values will arrive as a String[] array.

Setting up the package

To get this up and running download the attached CQ5 package and install it throught the package manager. In the Sling console configure the the services com.day.cq.mailer.impl.MailerService, com.day.iks.service.impl.TwitterServiceImpl and com.day.iks.service.impl.GtalkServiceImpl. For Twitter and Gtalk you need to supply the credentials of the (technical) user that shall send the DMs or chat messages, respectively. In the case of e-mail you need to configure your mail server.

Next, create a custom workflow in the CQ5 workflow section and add the workflow action (name). The configuration options are:

  • for sending an e-mail: email,user@mydomain.com,some_message

  • for sending a direct message on Twitter: dm,twitter_user,some_message

  • for sending a chat message on Gtalk: gtalk,user@gmail.com,some_message

The (optional) message will be appended with the content item's path.

Here is an example for GTalk:



In the cases of Twitter DM and GTalk make sure that the recipient has opted-in to receive messages from the technical user you have configured as a sender.

What is a content repository

Thursday, November 19, 2009 - 01:00



Joint post of Henri Bergius and Michael Marth cross-posted here and here.

Web Content Repositories are more than just plain old relational databases. In fact, the requirements that arise when managing web content have led to a class of content repository implementations that are comparable on a conceptual level. During the IKS community workshop in Rome we got together to compare JCR (the Jackrabbit implementation) and Midgard's content repository. While in some cases the terminology might be different, many of the underlying ideas are identical. So we came up with a list of common traits and features of our content repositories. For comparison, there is also Apache CouchDB.

So, why use a Content Repository for your application instead of the old familiar RDBMS? Repositories provide several advantages:

  • Common rules for data access mean that multiple applications can work with same content without breaking consistency of the data

  • Signals about changes let applications know when another application using the repository modifies something, enabling collaborative data management between apps

  • Objects instead of SQL mean that developers can deal with data using APIs more compatible with the rest of their desktop programming environment, and without having to fear issues like SQL injection

  • Data model is scriptable when you use a content repository, meaning that users can easily write Python or PHP scripts to perform batch operations on their data without having to learn your storage format

  • Synchronization and sharing features can be implemented on the content repository level meaning that you gain these features without having to worry about them

feature JCR / Jackrabbit Midgard CouchDB
content type system In JCR structured or unstructured nodes are supported and can be mixed at will in a content tree. Content types are defined in MgdSchema types. All content must be stored to an MgdSchema type, but types can be extended on content instance level using the "parameter" triplets Type-free
type hierarchy Structured node types support inheritence of types, additional cross-cutting aspects can be added with "mixins". Node types can define allowed node types for child nodes in the content hierarchy. MgdSchemas allow inheritance, and an extended type can be instantiated either using the extended type or the base type Type-free
IDs Nodes with mixin "referenceable" have GUID a UUID. In practice the node path is often used to reference nodes. Every object has a GUID used for referencing. Objects located in trees that have a "name" property can also be referred to using the path All objects can be accessed via a UUID
References Nodes can reference each other with hard link (special property type) or soft link (by referring to the node path) MgdSchema types can have properties linking to other objects of same or different type. A link of "parentfield" type places an MgdSchema type in a tree. No reference support built-in
content hierarchy All content is hierarchical / in a tree Content can exist in tree, or independently of it depending on the MgdSchema type definition flat structure
interesting property types Multi-valued (like an array), binary properties (e.g. for files), nodes have an implicit sort-order Binary properties stored using the Midgard Attachment system Support for binary properties
transactions Multiple content modifications are written in transactions. Transactions can be used optionally.
events JCR Observers can register for content changes on different paths and/or for different node types and/or CRUD, receive notification of changes as serialized node All transactions cause both process-internal GObject signals, and interprocess DBus signals Support for one external event notification shell script
workspaces Workspaces provide separate root trees. No workspaces support in Midgard 9.03, coming in next version Multiple databases within one CouchDB instance
import and export nodes or parts of the repository (or the whole repo) can be imported or exported in XML. 2 formats: docview for human-frindly representation, sysview including all technical aspects Objects can be exported and imported in XML format. There are tools supporting replication via HTTP, tarballs, XMPP, and the CouchDB replication protocol JSON serialization is the standard way of accessing the repository. CouchDB replication protocol supports full synchronization between instances
versioning Checkin/checkout model to create new versions of nodes, optionally versions complete sub-trees, supports branching of versions. No versioning All versions of content are stored and accessible separately, no branching
locking Nodes can be locked and unlocked Objects can be locked and unlocked
object mapping Not in standard, but implemented in Jackrabbit. Rarely used in practice. Object mapping is the standard way of accessing the repository All content is accessed via JSON objects
queries In JCR1 Sql or XPath, in JCR2 also QueryBuilder. Query Builder Javascript map/reduce
access control Done on repository level, i.e. all access control is independent of application. In Jackrabbit: pluggable authentication/authorization handlers. No access control in Midgard repository, usually implemented on application level. Midgard proves a user authentication API No access control
persistence In Jackrabbit different Persistence Managers can be plugged in (RDBMS, tar file, ...) libgda allows storage to different RDBMS like MySQL, SQLite and Postgres CouchDB has its own storage
architecture Jackrabbit: library (jar), JEE resource, OSGi bundle or standalone server Library Erlang-based daemon
APIs Standard: Java-based, PHP coming up. In Jackrabbit: also WebDAV and HTTP-based API C, Objective-C, PHP, Python HTTP+JSON
full-text search Included in repository. In Jackrabbit: Lucene bundled No (SOLR used on application level) Plugin for using Lucene, not installed by default
standard metadata All nodes have access rights, jcr:primaryType and jcr:mixinTypes properties. JCR 2.0 standardizes a set of optional metadata properties. All objects have a set of standard metadata including creator, revisor, timestamps etc No standard properties

[LOTD] IKS in the press

Wednesday, November 18, 2009 - 01:00

French IT mag LeMagIT has published an article about the IKS project including quotations from Bertrand Delacretaz. Bertrand emphasizes the need for concrete results:

pour décoller, les technologies sémantiques ont besoin de cas d'utilisateur concrets

In the comments section Bertrand mentions his tag line for semantic technologies that I can very well relate to:

La sémantique "sous le capot" oui, la sémantique "dans la figure", non

This roughly translates as: "semantics under the hood yes, semantics in your face, no".

In Computerworld UK open source blogger Glyn Moody has described his first hands impressions from the IKS workshop in Rome. He comes to a similar conclusion:

Paradoxically, semantic search will only ever really take off once it has receded so far into the fabric of computing that people aren't even aware it's there.

Update: The IKS semantic engine - a pragmatist's view

Friday, November 13, 2009 - 01:00

Update to "The IKS semantic engine - a pragmatist's view": here are the slides:

Title The IKS RESTful semantic engine - let's get
started!
style="font-size:11px;font-family:tahoma,arial;height:26px;padding-top:2px;">View
more documents from bdelacretaz.

The presentation went well, and will hopefully lead to a sprint to actually implement something along these lines. The two demos that used UIMA at the workshop made me think that UIMA should be part of that picture, at least as a plugin for semantic lifting. And I did the presentation in less than 8 minutes out of the 10 that were allocated. Bonus points?

IKS Search Benchmark

Thursday, November 12, 2009 - 01:00



CQ5 search comes with some improvements over JCR's search capabilities, e.g. adapting result rankings to what users choose or faceted search. Within the IKS project Bertrand and I have experimented with another possibility: link-based ranking, i.e. adjusting search results based on the content of link tags. For example: if page A links to page B with the link text "lorem ipsum" then page B should get a higher ranking when a user searches for "lorem ipsum". This is essentially what Google does, but we wanted to apply it to internal links (within the same site) only.

To give away the results right away: for many web sites the results will probably not improve dramatically, because there are not enough internal links. However, it might help for some projects so our implementation approach is described below in case you want to give it a try in your project.

In order to extract links from a node we opted for parsing the complete rendered HTML presentation of a node rather than looking only at the Rich Text properties of one node. In that way we could also catch programmatically generated links from templates. So we ended up by setting up a little spider on the publish server that retrieves HTML representations of all pages. The spider is deployed as an OSGi bundle within the server so it gets the locations of all pages from an internal repository query. For each page the HTML is retrieved and parsed. The found links are stored as child nodes below the page that is linked to. In the example from above: if page A links to page B with the link text "lorem ipsum" then page B gets a child node with properties source=A and text="lorem ipsum". Implemented in that way we could basically use the Jackrabbit indexer without further changes.

We have also implemented a JCR Observer that catches changes to pages and fixes the corresponding links. Template updates are not caught, yet.

The sources are attached to this post. The Java program can be used as a standalone application or deployed as an OSGi bundle. The standalone program takes a couple of optional arguments for running a full upfront spidering, deleting all found link nodes etc. In case you want to give it a try please be aware:

  • The standalone program requires RMI to be enabled on the repository which is not the case by default (in the code port 1235 is used).

  • The searches must take into account the new properties of the link nodes. One possibility is to re-configure the Jackrabbit indexing,
    which in CQ5 is done in the
    crx-quickstart/server/runtime/0/_crx/WEB-INF/classes/indexing_config.xml
    file, by adding:

    
      linkedText
    
    

The boost factor in this configuration can be adjusted to give links a proper weight relative to the other properties of a node

For reindexing delete these directories:

crx-quickstart/repository/repository/index

crx-quickstart/repository/workspaces/crx.default/index

crx-quickstart/repository/workspaces/crx.system/index

Results

We tested the approach on the content of our corporate website (a rather small content corpus). Overall, the search results improved slightly, but not much (although we did not spend a lot of time on tweaking the boost factor). As stated above I believe that corporate websites in general will not benefit from link-based ranking very much as the majority of links in them are often reflecting the navigation (i.e. the hierarchical structure of the site) so they provide little additional information. Of course, on the other side there is no harm in using links for search relevance either.

Alternative approach

Marcel Reutegger (the MAN when it comes to JCR searches) gave a lot of great input to our experiment (thanks a lot for this). He also hinted how an alternative implementation could look like: using an output filter, which can process HTML content as it's being generated. In CQ5 the validity of links is already checked that way, so storing them would naturally fit there. Also, he suggested storing the links not below the pages themselves, but in a separate part of the repository. In a background processing job these links could be aggregated and the most relevant key words would eventually be written into the page nodes.

Syndicate content