Tuesday, August 30, 2011

UIMA, Clerezza, AO and Linked Open Data [1]

More than a year ago I started to work on the integration of text mining services with DOMEO, the tool I am building. In DOMEO we try to focus, as much as possible, on annotations that refers to entities defined in controlled vocabularies, ontologies or knowledge bases. In other words entities identified by URIs. Almost naturally, after writing a few connectors to existing text mining services, we realized how nice would have been to have a common way of exposing such services.

After some investigations, I found Apache Clerezza a service platform based on OSGi (Open Services Gateway initiative) which provides a set of functionality for management of semantically linked data accessible through RESTful Web Services and in a secured way. As I am familiar with the OSGi technology, I contacted the responsible for the UIMA integration - Tommaso Teofili - right away and we started to exchange ideas. After writing to the Clerezza mailing list, I decided to write and contribute some code able to transform UIMA results into Annotation Ontology (AO) RDF format.

Working with Antony Scerry (Elsevier) months ago, I learned right away that the hard part was not the code itself but the fact that the UIMA types are extremely flexible. Therefore, even if most existing NLP tools are not returning URIs, when text mining tools do return URIs it is not trivial to capture the URI of the recognized entity. That info can be encoded in whatever way the service developers decide to. That can be a problem when trying to integrate multiple text mining service under a common interface.

As I did not want to propose something too complicated, Tommaso and I convened that adopting a little convention could make our life easier. After a few discussions, Tommaso introduced a couple of basic entities: ClerezzaBaseAnnotation and ClerezzaBaseEntity. These are both implementing the property 'uri' and ClerezzaBaseEntity is also implementing the property 'label'. When creating new type system descriptors, in order to adopt the proposed convention, it is simply necessary to extend these two types and used them appropriately.  As a result, through a patch I wrote, it is now possible, for instance, to map the results to the Annotation Ontology (AO) RDF format without any effort and, therefore, to display your results on the analyzed document through DOMEO.

Thanks to Tommaso,  my patch has been integrated in the Clerezza code today. The documentation will be made available soon as well as other code that will allow to easily set up web services for publishing UIMA algorithm.