Monday, February 28, 2011

SWAN Annotation Tool: what is new in build 7

Here is a list of some of the new features that I will deploy probably this week.


The development of the SWAN Annotation Tool is managed and carried out by Dr. Paolo Ciccarese. The fist build of the SWAN Annotation Tool has been developed by Dr. Paolo Ciccarese and Marco Ocana. The SWAN Annotation Tool is a product of the MIND Informatics group - Mass General Hospital - directed by Tim Clark.

Sunday, February 27, 2011

ClientBundle, UIBinder and CSS (GWT)

In a previous post I was discussing all the possible ways for using image resources with GWT. Another thing you might want to do is dealing with CSS.

1) Standard use of CSS. While using GWT you can certainly use the CSS how you would use them for any other application, simply declaring CSS classes in a *.css file and importing it in the webpage of interest. With GWT widgets you will then simply set the style name as follows:
widget.setStyleName("cssClassName");
This approach works, however, if the CSS declaration is missing no exceptions are raised. Also, if you use multiple CSS files as I do, it is always annoying to find the declarations when you need to.

2) Using the UIBinder. If you are already using UIBinder, the easiest way to include CSS declarations is to add them to the binder. It is easy and it is safe as the Eclipse plugin is helping you out in finding missing declarations. I always use Eclipse but if you don't I am assuming you'll still find the problems at compile time.
<ui:UiBinder
  xmlns:ui='urn:ui:com.google.gwt.uibinder'
  xmlns:g='urn:import:com.google.gwt.user.client.ui'>

    <ui:style>
       .outer {
          width: 100%;
       }
    </ui:style>

    <g:VerticalPanel styleName='{style.outer}'>
    </g:VerticalPanel>
</ui:UiBinder>
The downside of this approach is redundancy. Sometimes I want to use CSS declarations multiple times and, with this approach, I have to repeat them for each single Binder.

3) Using CssResource. Another alternative consists in doing something similar to what you can do for icons with ImageResource. First I declare the set of declarations of the stylesheet in a file that I name Commons.css:
.smallIcon {
    height: 16px;
    width: 16px;
}

Then, linking the CSS file, I declare the use of the stylesheet as a resource for the application:
public class Example implements EntryPoint {

  public interface Resources extends ClientBundle
  {
     public static final Resources INSTANCE =  GWT.create(Resources.class);
 
     public interface Resources extends ClientBundle { 
        @Source("org/example/application/client/Commons.css")
        CommonsCss commonsCss();
     }
     
     ...
  }
  
}
Now, where the dots in the above snippet are, I can declare the stylesheet declarations I want to expose to the application:
public interface CommonsCss extends CssResource {
     String smallIcon();
}
As the CSS class is named as the method everything works fine. However, sometimes you might want to change the name of the method. Using Java annotations you address that issue as well:
public interface CommonsCss extends CssResource {
     @ClassName("smallIcon")
     String smallIconClass();
}
Now, we can write something like:
Resources resources = Resources.INSTANCE.factory().create();
Image img = new Image();
img.setStyleName(resource.commonsCss.smallIcon());
...
This approach allows you to collect in one single place CSS declarations you need to use in multiple packages in your application. Also you can leverage a good amount of validation in your Java code. You might argue the process can be a bit tedious but I can assure you, for a big GWT application, it can help you saving lots of time later on especially when refactoring the code.

There are other interesting things to know about the ClientBundles, but for now I'll stop here.

Saturday, February 26, 2011

Dublin Core and PRISM

As I was saying in one of my previous posts, distinguishing the different kinds of contributions is not trivial. However, sometimes is necessary. And this is probably the case of publishers that want to keep track of the exact role of the different contributors to a resource.

This is the case of PRISM (Publishing Requirements for Industry Standard Metadata), a metadata vocabulary for managing, post-processing, multi-purposing and aggregating publishing content for magazine and journal publishing. PRISM allows to distinguish between different creators roles: writer, editor, composer, speaker, photographer... you can find the full list in the The PRISM Controlled Vocabulary Namespace. PRISM is also using parts of Dublin Core Element Set and Dublin Core Terms, the subset of terms is listed in the document named The PRISM Subset of the Dublin Core Namespace.

The combination of DC and PRISM, for instance for a book, will become in XML something like:
<dc:creator prism:role=”writer”>John Doe</dc:creator> 
<dc:creator prism:role=”editor”>Paolo Ciccarese</dc:creator>
<dc:creator prism:role=”graphicDesigner”>Micheal Doe</dc:creator>
In RDF, according to the specifications (paragraph 3.5.2 of the PRISM Subset of the Dublin Core Namespaces: Version 2.1), this would look like:
<dc:creator rdf:resource=”contributorrole.xml#writer”>
     John Doe
</dc:creator>
<dc:creator rdf:resource=”contributorrole.xml#editor”>
     Paolo Ciccarese
</dc:creator>
<dc:creator rdf:resource=”contributorrole.xml#graphicDesigner”>
     Micheal Doe
</dc:creator>
However, this is not valid RDF for a couple of reasons that you can find yourself through the RDF Validator Service. Dublin Core Element Sets properties used by PRISM and by the PRISM aggregator message are: creator, contributor, description, format (PRISM records restrict values of the dc:format element to those in list of Internet Media Types [MIME]), identifier (for instance DOI), publisher, subject, title, type. Other properties are listed but not as items of the PAM format: language, relation, source.

For instance, this is how PRISM can deal with identifiers in RDF:
<dc:identifier>10.1030/03054</dc:identifier>
<prism:doi>http://dx.doi.org/10.1030/03054</prism:doi>
<prism:url rdf:resource=”http://dx.doi.org/10.1030/03054”/>
Basically, besides the usage of dc:identifier, PRISM is using the properties prism:doi - which is declaring more explicitly than dc:creator what the identifier is - and prism:url. Strangely enough, the property prism:doi is actually taking as value the DOI proxy URL and not the DOI string. Therefore, I see prism:doi and prism:url as redundant properties. You can find some more details on this old blog post by Tony Hammond.

Moreover, PRISM PAM is making use also of the Dublin Core Terms dct:hasPart and dct:isPartOf for detecting for instance images that are part of a document:
<dcterms:hasPart rdf:resource= ”http://www.myexamples.com/ExamplePhoto.jpg”/>

Thursday, February 24, 2011

AO: Annotating with one or multiple statements (triples)

A few days ago I had a phone discussion with some collaegues (Tudor Groza, Vit Novacek and Cartic Ramakrishnan) on how to use Annotation Ontology (AO) for attaching something more complex than a single term (identified by a URI) to a document or document fragment. To make it clear I am giving here an idea on how something like that can be already done in AO.

Let's say I am performing some text-mining on some textual content. It is possible that I don't want simply to associate a term to a span of text but I want to do something more elaborate. For example I want to say, analyzing this span of text I obtain the triple GeneG encodes ProteinP. How can I do that in AO? For instance I can use a Named Graph and I can say something like in the following picture:

Figure 1: The dashed ovals are instances of annotation items. Selectors and other details of the actual annotation have been omitted.

As you can see we have annotated also the atomic components of my triple. In doing this, while analyzing the assertions belonging to a specific domain I can always trace back to the original text. Also, using a graph as object of my annotation I am going in the direction of the Nanopublication format, however this will be topic for a future post.

Given this, you can imagine you can attach the proper provenance to the annotation. If you are a text miner, you might be interested in attaching what software or computational workflow generated such annotation and with what confidence.


You might have noticed the usage of the namespace tm that stand for Text Mining. It is a set of properties I am working on for extending AO to better represent text mining results.

Principle: Traceability [2] - Provenance and Doublin Core

For people working with Semantic Web technologies, for a long time, provenance has been called Dublin Core Metadata Element Set, a vocabulary of fifteen properties for use in resource description (as they currently state in the webpage). Let's take, for instance, the following property
'creator': an entity primarily responsible for making the content of the resource. Examples of a Creator include a person, an organization, or a service. Typically the name of the Creator should be used to indicate the entity.
You can find the guidelines for the usage of the creator property here. If we consider the RDF format (important for providing a syntactical framework) we can look at the following (RDF/XML) example:
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
 xmlns:dc="http://purl.org/dc/elements/1.1/"> 
  <rdf:Description rdf:about="http://www.w3.org/TR/hcls-swan/">
   <dc:title>Semantic Web Applications in Neuromedicine (SWAN) Ontology</dc:title>
   <dc:creator>Paolo Ciccarese</dc:creator>
   <dc:date>2009-10-20</dc:date>
   <dc:format>text/html</dc:format>
   <dc:language>en</dc:language>
  </rdf:Description>
</rdf:RDF>
Of the above example I want to focus on the following triple (Turtle syntax):
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .

<http://www.w3.org/TR/hcls-swan/> dc:creator "Paolo Ciccarese".
Now if you take a look of the actual document I wrote where I appear as the Editor. It actually happened that somebody else created the file for that note and I've filled in the actual content. This situation is difficult to model using simply Dublin Core Element Set. Probably one way to go is to distinguish between the file and the content.

Another example. Let's say I want to create a file with a quote from a book or a speech. I create the HTML file (my resource). However, the actual content has been authored by somebody else. How do I represent it with Dublin Core Element Set. Let me give it a try:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
 xmlns:dcterms="http://purl.org/dc/terms/" 
 xmlns:dc="http://purl.org/dc/elements/1.1/">
  <rdf:Description rdf:about="http://www.paolociccarese.info/example/quotes/1">
   <dc:title>My favourite quote</dc:title>
   <dc:creator>Paolo Ciccarese</dc:creator>
   <dc:date>2011-02-23</dc:date>
   <dc:format>text/html</dc:format>
   <dc:language>it</dc:language>
   <dcterms:hasPart>
     <rdf:Description>
       <dc:creator>Dante Alighieri</dc:creator>
       <dc:description>Lasciate ogni speranza, voi ch'entrate</dc:description>
     </rdf:Description>
   </dcterms:hasPart>
  </rdf:Description>
</rdf:RDF>
As you probably noticed I have not defined a URI for the quote and therefore the generated triples will include a blank node. I can also think of making up a URI like http://www.paolociccarese.info/example/quotes/1#quote as long as I can make it resolvable. The above snippet does more or less what I wanted to. Now, one thing I don't like is that Dante Alighieri and I are both creators. As a matter of fact, in the quote there is some intellectual property involved, while in the making of the simple HTML page, not so much. However this could lead to problems as drawing the lines is not easy. I could also consider the use of the property contributor - see the guideline here -, however I am not sure that is appropriate in the present case.

Friday, February 18, 2011

Principle: Traceability [1]

According to Wikipedia:
Traceability refers to the completeness of the information about every step in a process chain. 
I've been working on Clinical Information Systems for quite a while and traceability is a very well-known - even if usually poorly implemented - concept when talking about medical processes and patient data. For instance, for a blood pressure measurement, it is important to know who performed the procedure, where and when but also, if the notes have been written on paper first, who wrote the measures and when, eventually who entered the data in the system, where and when... if the information system is managing structured data,  we might want to record the language of the operator who entered the data, the templates she used and so on... The main idea is to keep track of the process details and of all the accountable health care professionals. In the previous list, to make it simple, I voluntarily excluded the medical context - which cuff has been used, was the pressure measured after a meal, after physical activity - which is crucial for reproducibility but opens to multiple other representational issues. You would be amazed on how complicated the model for a blood pressure measurement can become.

But what is traceability in Semantic Web terms? I guess one way of saying it is through the term Provenance, very popular these days.
Provenance, from the French provenir, "to come from", means the, or the of something, or the history of the ownership or location of an object. The term was originally mostly used for works of art, but is now used in similar senses in a wide range of fields, including science and computing.
A good alternative definition, more focused on computing and, which takes into account processual aspects, is provided by the W3C Provenance Incubator Group:
Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance.
In my mind, traceability is still a more generic concept than provenance. For instance, I would consider some of the aspects required for reproducibility part of the traceability and not of the provenance. And this is because I believe it would be easier to standardize 'where, when and who' (what I consider provenance), than 'what, why, which, how' (which represent context that added to provenance gives traceability) that are domain dependent and can become very hard to define. However, the last definition makes me quite happy and I would be glad if the incubator for provenance will translate into an actual Working Group.

Tuesday, February 15, 2011

Principle: Adequate Documentation [1]

This is trivial to understand. The documentation for an ontology is very important for adoption and for data sharing. This works for software as well. One thing I found always awesome regarding Java was the quality of the API documentation, the JavaDocs are cool and easy to generate. For ontologies the reality is a bit more complicated.

Some ontologies, including some of those I made, are just the pure RDFS/OWL file. Often with no descriptions whatsoever of the classes/properties. The understanding of those ontologies relies totally on the ability of the users to interpret the labels/names correctly. Other ontologies include very fuzzy, sometimes poor, descriptions - "Document: a document. The Document class represents those things which are, broadly conceived, 'documents'.". The interpretation relies mostly on the users' common sense. To be fair, in recent versions of the FOAF vocabulary, the previous definition is accompanied by a note saying: "We do not (currently) distinguish precisely between physical and electronic documents, or between copies of a work and the abstraction those copies embody. The relationship between documents and their byte-stream representation needs clarification". And let's be honest, defining what a document is nowadays, in the digital era, is not trivial. Is every file a document?

In the case of OBO Foundry, a set of principles has been defined. Here are some of them:
"The ontologies include textual definitions for all terms. Many biological and medical terms may be ambiguous, so terms should be defined so that their precise meaning within the context of a particular ontology is clear to a human reader."
"The ontology is well documented."

The thing is, definitions are a necessary step but that is not enough. When I talk about 'Adequate Documentation' I mean many different things: definitions, examples of use cases, examples of resulting triples, motivations, related projects... In other words, a good amount of shared knowledge about the ontology and the process that generated it.

Unfortunately there are no clear rules, I keep trying different ways of translating the most tacit knowledge I can into explicit and I can't say I found the right recipe yet. Definitions can certainly help, I find valuable to include explanations of the ontology building process where motivations behind the different choices are given, explanatory figures, plenty of examples with actual triples, maybe a list of Frequently Asked Questions where the authors can publicly address some of the concerns of real users. All this takes time and effort, and, by personal experience, can also cause collateral damage...

Sunday, February 13, 2011

Which principles drive ontology adoption?

Several weeks ago, I started to think of the next version of the Annotation Ontology (AO). After one year spent developing the Annotation Framework and discussing with several colleagues and friends, I certainly have a little list of things I want to improve. Nothing major, mostly a clean up.

Before proceeding with the updates, I wanted to better clarify the set of principles I want to follow in developing AO2. These are, in random order: Traceability, Orthogonality, Generality, Interoperability, Modularity, Extensibility, Adequate Documentation, Community Driven. The reason why I am listing this principles is important, I believe they influence adoption.

As you might have noticed the number of available ontologies is constantly increasing. If you need to use an ontology, you have to go through the process of revising what is out there, and selecting what you think is most appropriate. How many time have you done that? How many time did you succeed? How many times did you find the right ontology covering exactly what you needed? I am pretty sure that if you are involved in the development of a complex application the answer is something like: I found a few ontologies I could mix and match... I still need to add pieces... and, most importantly, I am not sure I agree on the way some or them are done. Right. Welcome to the Semantic Web I would say.

I remember the old days - many years ago - when Dublin Core Metadata Element Set, Version 1.1 (DC) was the answer to almost everything. When I started working on SWAN (Semantic Web Applications in Neuromedicine) in 2006 I found immediately DC to be insufficient for our needs. For days I've been struggling trying to understand what to do: use DC and being sloppy or create something more appropriate risking isolation and to increase the entropy of the Semantic Web world.

Well at that time my answer has been the Provenance, Authoring and Versioning Ontology (PAV) now available in version 2. The choice, at the time, has been dictated also by practical reasons: if I was using DC for Annotation Properties and I wanted to be OWL DL, I could not use it also for other properties. Since then, PAV has been used in our applications but also in several others developed by people/groups I barely know - sometimes I wish they just would tell me something like: "hey I am using PAV and it's cool" or even "hey I am using PAV and it sucks because...". PAV has also been considered as one of the starting points for the W3C Provenance Incubator Group.

PAV was not such a bad idea at the end. But it was a risky business. If you are developing an application you need always to keep an eye on what is existing and the other eye on your requirements. This results to be even more complicated because it is hard to find appropriate ontologies and, when you find them, they often don't have adequate documentation for you to understand that is what you are looking for. Surprise! The  lack of shared knowledge about the ontology does not help it to emerge and does not help adoption... unless, of course, external factors - networking, important supporters, big institutions ... - come into play. And external factors are not little thing.

Friday, February 04, 2011