Monday, September 26, 2011

Collections Ontology v.2... list of persons [1]

Back in 2008 I've published a couple of posts explaining my need of creating an ontology for collections:
After some months of work with Silvio Peroni we are almost done with v.2 of Collections Ontology (CO) expressed in OWL2.  Following is a simple example that illustrates some of the features of CO2. Before that here is how to set up the environment for testing the features yourself. First of all I would suggest you to install Protege 4.1 and to make sure to install the Pellet plugin for it.

With Protege up and running, I performed the import of the development version of the Collection Ontology v.2 with URL: http://collections-ontology.googlecode.com/svn/trunk/collections.owl.

Figure 1 - Import of the development version of the Collections Ontology (CO) with Protege

Figure 2 - Collections Ontology (CO) is imported

After that I've created a class person - note that I haven't reused classes such as foaf:Person to keep the example simple - and the instances in figure 3 in order to model a list of persons (you can download the file here).

Figure 3 - Collections Ontology (CO) example instances (ovals)

After triggering the reasoner, it is immediate to notice inferred properties (figure 4 in light yellow).

Figure 4 - Inferred types and properties for item instance itemOne. For instance itemOne has been defined as instance of item, the reasoner infers itemOne (i) is a list item (ii) is follwed by both itemTwo and itemThree (iii) is item of and is first item of persons.

Figure 5 - Explanation of itemOne type 'list item' obtaining by clicking the highlighted button

We can now proceed with some DL Queries.

Query 1
For instance we can ask for all the items that have item content (has item content) persons with name "Paolo Ciccarese":
item and 'has item content' some (person and (name value "Paolo Ciccarese"))
In the Protege tab named 'DL Query' we can enter the query above and, if we select on the right side the option 'Individuals' we are going to  retrieve one item (itemOne):


Query 2
We can ask for all the lists where the first person is named 'Paolo Ciccarese' (answer 'persons'):
list and 'has first item' some (
     item and 'has item content' some (person and (name value "Paolo Ciccarese"))
)
Query 3
Similarly we can ask more complex queries such as: find all the persons lists where the first item points to a person named 'Paolo Ciccarese' and the last item points to a person named 'Silvio Peroni' (answer 'persons'):
list and (
    'has first item' some (item and 'has item content' some
           (person and (name value "Paolo Ciccarese")))
    and
    'has last item' some (item and 'has item content' some
           (person and (name value "Silvio Peroni")))
)
Query 4
Another query can be: give me all the lists where the first person is named 'Paolo Ciccarese' and the second is 'Marco Ocana' (given the transitive nature of the property 'is followed by' the answer is 'persons'):
list and (
      'has first item' some (item and
            'has item content' some (person and (name value "Paolo Ciccarese"))
            and
            'is followed by' some (item  and
                    ('has item content' some(person and(name value "Marco Ocana"))))
      )
)
Query 5
Returns all the lists containing a person named 'Paolo Ciccarese' (answer 'persons'):
list and 'has item' some (
     item and 'has item content' some (person and (name value "Paolo Ciccarese"))
)
Query 6
Returns any list where a person named 'Paolo Ciccarese' is followed by a person named 'Silvio Peroni' (answer 'persons'):
list and (
    'has item' some (item and
            'has item content' some (person and (name value "Paolo Ciccarese"))
            and
            'is followed by' some (item  and
                    ('has item content' some(person and(name value "Silvio Peroni"))))
      )
)
Query 7
Returns all the lists where a person named 'Silvio Peroni' is preceeded by a person named 'Paolo Ciccarese' (answer 'persons'):
list and (
    'has item' some (item and
            'has item content' some (person and (name value "Silvio Peroni"))
            and
            'is preceded by' some (item  and
                    ('has item content' some(person and(name value "Paolo Ciccarese"))))
      )
)

Thursday, September 22, 2011

SWAN, AlzSWAN, HyQue and Nanopublications

While developing the SWAN ontology and the SWAN platform (see AlzSWAN for Alzheimer disease) there have always been two open issues: (i) the use of named graphs and (ii) the translation of the textual discourse elements (claims such as: Intramembranous Aβ could behave as chaperones of other membrane proteins) into a formal representation made of triples.


(i) The use of named graphs is a useful way for wrapping some content and specifying its provenance. Basically the idea is to create an 'onion layers' model where each layer has its own provenance. At the time - back in 2006 - I have been investigating the usage of named graphs - and TriX - for representing SWAN content.  However, we decided not to implement such approach because the technological uncertainty in that uncharted territory - of developing an application like SWAN - was already high enough, named graphs usage was not homogeneous across the community and  their serialization was not standardized. This meant introducing de-facto reification for some of the SWAN relationships in order to be able to attach the appropriate provenance. As Graphs are the topic of one of the task forces of the RDF Working Group for updating the 2004 RDF Recommendations, I was starting to think of resuming the old plans.

(ii) The translation of the textual discourse elements into a formal representation made of triples is, for instance, possible through the HyBrow (now HyQue) approach. Translating narrative into triples is not easy job though. Many already found the SWAN manual creation process of narrative claims very labor intensive. In fact, the SWAN curators have been usually rephrasing each claims/hypothesis to make them simple and self contained (including the minimum necessary context). Translation into triples requires, even more, starting from neat hypothesis and claims. And these are not always that easy to obtain.

These two SWAN-related issues have been in my thoughts since a while when the Nanopublication [1] concept came out.

A Nanopublication is a "set of annotations that refers to the same statement and contains a minumum set of (community) agreed-upon annotations.
The concept itself is simple and in the above linked slideshow you can find a first attempt based on real SWAN data. With respect to the paper, the concept of 'statement' (triple) has to be updated to 'statements' (triples) as one single statement is not always enough to satisfy needs of real use cases.
 Statement --> statements
Starting from the above example, we are now trying to formalize a bit better what a Nanopublication architecture would look like... it is work in progress, but if you look at the slides you will get the drift.

[1] Paul Groth, Andrew Gibson, Jan Velterop. The anatomy of a nanopublication. Information Services and Use (2010). Volume: 30, Issue: 1, Publisher: IOS Press, Pages: 51-56 (on Mendeley)

Wednesday, September 21, 2011

UIBinder, CssResource and CSS (GWT)

Few month ago I blogged about ClientBundle, UIBinder and CSS (GWT). I just realized that an important use case was missing. And here it is.

4) Using CssResource for CSS expressed in the UIBinder. When creating a UI with the UIBinder it might happen to include some CSS rules in the ui.xml file.

<ui:UiBinder
  xmlns:ui='urn:ui:com.google.gwt.uibinder'
  xmlns:g='urn:import:com.google.gwt.user.client.ui'>

    <ui:style>
       .outer {
          width: 100%;
       }
    </ui:style>
    <g:SimplePanel ui:field='sideBar'>
    </g:SimplePanel>
</ui:UiBinder>

It is possible to refer to those CSS rules from the GWT code. For doing so we have to declare a 'type' for ui:style:

<ui:UiBinder
  xmlns:ui='urn:ui:com.google.gwt.uibinder'
  xmlns:g='urn:import:com.google.gwt.user.client.ui'>

    <ui:style type='org.example.Example.ExStyle'>
       .outer {
          width: 100%;
       }
    </ui:style>
   
    <g:SimplePanel ui:field='sideBar'>
    </g:SimplePanel>
</ui:UiBinder>


Now, in the class  org.example.Example, we can reference the CSS rules and use them:

public class Example extends Composite {

    interface Binder extends UiBinder { }    
    private static final Binder binder = GWT.create(Binder.class);
   
    @UiField SimplePanel sideBar;
    @UiField ExStyle style;

    interface ExStyle extends CssResource {
        String outer();
    }

    public Example() {
        initWidget(binder.createAndBindUi(this)); 
        sideBar.setStyleName(style.outer()); 
    }
   ...

Wednesday, September 14, 2011

Grails 1.3.7, Spring Security and OpenID... with exception

Here is the list of steps I performed to set up OpenID authentication with Spring Security:

1) Create Project with Grails 1.3.7
grails create-project UserManagement 
The file application.properties will look something like this:
app.grails.version=1.3.7
app.name=UsersManagement
app.servlet.version=2.4
app.version=0.1
plugins.hibernate=1.3.7
plugins.tomcat=1.3.7 

2) Installing the Spring Security plugin (website and documentation)
grails install-plugin spring-security-core
The file application.properties will now look something like this:
app.grails.version=1.3.7
app.name=UsersManagement
app.servlet.version=2.4
app.version=0.1
plugins.hibernate=1.3.7
plugins.spring-security-core=1.2.1
plugins.tomcat=1.3.7

3) Creating Controller and template classes
grails s2-quickstart org.commonsemantics.scigrails.module.users.security User Role
You will notice the following new files
controllers/LoginController.groovy
controllers/LogoutController.groovy
domain/org.commonsemantics.scigrails.module.users.security.User.groovy
domain/org.commonsemantics.scigrails.module.users.security.Role.groovy
domain/org.commonsemantics.scigrails.module.users.security.UserRole.groovy
views/login/auth.gsp
views/login/denied.gsp
The Config.groovy will be updated with the following lines:
grails.plugins.springsecurity.userLookup.userDomainClassName =
      'org.commonsemantics.scigrails.module.users.security.User'
grails.plugins.springsecurity.userLookup.authorityJoinClassName = 
      'org.commonsemantics.scigrails.module.users.security.UserRole'
grails.plugins.springsecurity.authority.className = 
      'org.commonsemantics.scigrails.module.users.security.Role'
If you are going to change the package of the above classes, just remember to update the above properties.

4) Move controllers to the desired package - org.commonsemantics.scigrails.module.users.security

5) Install the OpenID module (website and documentation)
grails install-plugin spring-security-openid
The file application.properties will now look something like this:
app.grails.version=1.3.7
app.name=UsersManagement
app.servlet.version=2.4
app.version=0.1
plugins.hibernate=1.3.7
plugins.spring-security-core=1.2.1
plugins.spring-security-openid=1.0.3
plugins.tomcat=1.3.7

6) Creates OpenId Controller and templates for it.
grails s2-init-openid
This script adds the following files:
controllers/OpenIdController.groovy
views/openId/auth.gsp
views/openId/createAccount.gsp
views/openId/linkAccount.gsp

7) Move OpenIdController to the desired package - org.commonsemantics.scigrails.module.users.security

8) Add support for the remember-me checkbox
grails s2-create-persistent-token 
     org.commonsemantics.scigrails.module.users.security.PersistentLogin
It adds the file:
domain/org.commonsemantics.scigrails.module.users.security.PersistentLogin.groovy
The Config.groovy is updated with the following lines:
grails.plugins.springsecurity.rememberMe.persistent = true
grails.plugins.springsecurity.rememberMe.persistentToken.domainClassName =
     'org.commonsemantics.scigrails.module.users.security.PersistentLogin'
Once again, if you are going to change the package of the above classe, just remember to update the correspontend property.

9) Creating the OpenID domain class
grails s2-create-openid 
     org.commonsemantics.scigrails.module.users.security.OpenID 
The script creates on file:
domain/org.commonsemantics.scigrails.module.users.security.OpenID.groovy
The Config.groovy is updated with the following line:
grails.plugins.springsecurity.openid.domainClass =       
     'org.commonsemantics.scigrails.module.users.security.OpenID'
Once again, if you are going to change the package of the above classe, just remember to update the correspontend property.

10) Adding OpenIDs to the User domain class

The following line of code has to be added to the existing User.groovy class:
static hasMany = [openIds: OpenID]

11) Creating some test users

As suggested by the documentation we can now create some test users by editing the Bootstrap.groovy file as follows
import org.commonsemantics.scigrails.module.users.security.Role
import org.commonsemantics.scigrails.module.users.security.User
import org.commonsemantics.scigrails.module.users.security.UserRole

class BootStrap {

    def springSecurityService

    def init = { servletContext -> 

        String password = springSecurityService.encodePassword('password')
        
        def roleAdmin = new Role(authority: 'ROLE_ADMIN').save() 
        def roleUser = new Role(authority: 'ROLE_USER').save()

        def user = new User(username: 'user', 
            password: password, enabled: true).save() 
        def admin = new User(username: 'admin', 
            password: password, enabled: true).save()

        UserRole.create user, roleUser 
        UserRole.create admin, roleUser 
        UserRole.create admin, roleAdmin, true 
    } 
}


12) Redirect the login requests

It is now necessary to direct the authentication calls to the new Controller that manages also the OpenIDs. We can add to the UrlMappings.groovy file the following:
"/login/auth" {
         controller = 'openId'
         action = 'auth'
      }
      "/login/openIdCreateAccount" {
         controller = 'openId'
         action = 'createAccount'
      }

13) Running Grails and testing
grails run-app 
Welcome to Grails 1.3.7 - http://grails.org/
Licensed under Apache Standard License 2.0
...
...
Configuring Spring Security ...
Configuring Spring Security OpenID ...
Server running. Browse to http://localhost:8080/UsersManagement

When accessing the page you should see something like this:
I will try it yourself but the authentication did not work for me right away. I logged in with my Google OpenID and I got sent to the screen for creating an account.
I then selected the 'link this OpenID' option.
After introducing the testing credential 'user' and 'password' I got a 'user not found' back.

14) Turning on the logging

First thing I wanted to see if there were any weird things happening by turning on the logging in the Config.groovy file:
debug 'org.springframework.security'
No exceptions emerged when repeating the process.

15) Making sure the User is stored in the DB

We can inspect the default HSQLDB by adding at the end of Bootstrap.groovy init the following:
org.hsqldb.util.DatabaseManager.main()'
this will open the in memory database inspector. Just remember that if you close the inspector that will also shutdown your Grails app. Read more about the inspector here and remember to take out that line once the debugging is done..

The users resulted stored in the database.

16) Making sure the encoding process works correctly

After editing the Bootstrap.groovy file with:
String password = springSecurityService.encodePassword('password')
println 'a ' + password
def admin = new User(username: 'admin', password: password, 
     enabled: true).save(failOnError: true)
println 'b ' + User.findByUsername('admin').password
and a restart, I obtained:
a 5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8
b 113459eb7bb31bddee85ade5230d6ad5d8b2fb52879e00a84ff6ae1067a210d3

It is clear that the login process is not performing correctly. After investigating I found out that the User domain class includes the methods:
def beforeInsert() {
          encodePassword()
     }

     def beforeUpdate() {
          if (isDirty('password')) {
                encodePassword()
          }
     }

17) Updating the Bootstrap.groovy file

As the User domain class is already performing the encoding, we have to edit the Bootstrap.groovy file line:
String password = springSecurityService.encodePassword('password')
Into
String password = 'password'

And now the login process works for me. We can now remove org.hsqldb.util.DatabaseManager.main() from the end of the Bootstrap.groovy and maybe comment out the logging (see point #14).

18) Remove the unused views and controllers?

It is not possible to remove any of the generated controllers. As far as I know the only file that can be removed is
views/login/auth.gsp
as the new auth.gsp file provided by the OpenID plugin is now in use (see point #12).


You can check out the code as follows:
svn checkout 
     https://common-semantics.googlecode.com/svn/tags/UsersManagement20110914  
     UsersManagement

Resources
OpenID website
Spring Security Plugin for Grails website
OpenID Plugin for Spring Security website

Tuesday, August 30, 2011

UIMA, Clerezza, AO and Linked Open Data [1]

More than a year ago I started to work on the integration of text mining services with DOMEO, the tool I am building. In DOMEO we try to focus, as much as possible, on annotations that refers to entities defined in controlled vocabularies, ontologies or knowledge bases. In other words entities identified by URIs. Almost naturally, after writing a few connectors to existing text mining services, we realized how nice would have been to have a common way of exposing such services.

After some investigations, I found Apache Clerezza a service platform based on OSGi (Open Services Gateway initiative) which provides a set of functionality for management of semantically linked data accessible through RESTful Web Services and in a secured way. As I am familiar with the OSGi technology, I contacted the responsible for the UIMA integration - Tommaso Teofili - right away and we started to exchange ideas. After writing to the Clerezza mailing list, I decided to write and contribute some code able to transform UIMA results into Annotation Ontology (AO) RDF format.

Working with Antony Scerry (Elsevier) months ago, I learned right away that the hard part was not the code itself but the fact that the UIMA types are extremely flexible. Therefore, even if most existing NLP tools are not returning URIs, when text mining tools do return URIs it is not trivial to capture the URI of the recognized entity. That info can be encoded in whatever way the service developers decide to. That can be a problem when trying to integrate multiple text mining service under a common interface.

As I did not want to propose something too complicated, Tommaso and I convened that adopting a little convention could make our life easier. After a few discussions, Tommaso introduced a couple of basic entities: ClerezzaBaseAnnotation and ClerezzaBaseEntity. These are both implementing the property 'uri' and ClerezzaBaseEntity is also implementing the property 'label'. When creating new type system descriptors, in order to adopt the proposed convention, it is simply necessary to extend these two types and used them appropriately.  As a result, through a patch I wrote, it is now possible, for instance, to map the results to the Annotation Ontology (AO) RDF format without any effort and, therefore, to display your results on the analyzed document through DOMEO.

Thanks to Tommaso,  my patch has been integrated in the Clerezza code today. The documentation will be made available soon as well as other code that will allow to easily set up web services for publishing UIMA algorithm.

Thursday, May 26, 2011

DOMEO: Linking science and semantics with Annotation Ontology (AO) [1]

In the last few months I've been focusing on the development of the SWAN Annotation Tool (recently renamed DOMEO)*. DOMEO (Document Metadata Exchange Organizer), is an extensible web component enabling users to visually and efficiently create and share ontology-based stand-off annotation metadata on HTML or XML document targets, using the Annotation Ontology RDF model. The tool supports manual, fully automated, and semi-automated annotation with complete provenance records, as well as personal or community annotation with access authorization and control. DOMEO is one of the pieces of a bigger architecture that we internally call Annotation Framework.

The DOMEO interface
The idea itself is pretty simple, DOMEO is basically a little browser inside the browser. It allows the user to type a URL, open the correspondent document and annotate it. It is also possible to pass a URL as a parameter so that the tool opens with the page you want annotate already in the content frame. This option is particularly helpful when integrating the tool with other applications or other sections of the Annotation Framework.

Figure 1: A screenshot of DOMEO. You can notice the address bar where you see the URL of the document displayed below. The document displays the same way it would appear when opened in a new browser window.
The annotation can be performed manually by the user or automatically by text mining or entity recognition services. The two features are available through the two buttons in the DOMEO toolbar and labeled respectively 'Annotate' and 'Text Mining'. When the option 'Text Mining' is selected, the tool lists all the available text mining or entity recognition services. The user can then decide which one or which ones to run on the loaded document.

The button 'save' is saving the produced annotation. DOMEO supports a complex versioning system that basically saves items only when necessary, keeps track of the different versions and for each of them records the full provenance data. I will probably explain the versioning and the provenance models in another post.

* The development of DOMEO is managed and carried out by Dr. Paolo Ciccarese. DOMEO is a product of the MIND Informatics group - Mass General Hospital. The tool is developed in parallel with the Annotation Ontology (AO)

Thursday, March 03, 2011

HTML5: The new 'semantic elements'

One of the news in HTML5 consists in the so called 'semantic elements'. The goal of these elements is provide better way for web authors to define the parts of a document and, potentially, to improve accessibility (i.e. screen readers). Some of these elements are: section, nav, article, aside, hgroup, header, and footer.

Let's say I want to create the main page of a blog. My page is going to have a header and probably a footer. If you look at web pages source code, you will notice that these two elements are present in almost every page in many different variants. The blog you are reading already uses <header> and <footer>.

The elements <header> and <footer> can be used not only for defining the structure of the main page but also for defining headers and footers of sections, articles and asides.


If you look at the structure of the blog you are reading, this is pretty much similar to the one depicted in the above figure. The 'nav' areas is correspondent to the 'Blog Archive' and the 'aside' area is collecting my picture, 'label' and 'links'. Using these new elements - together with the new features brought by CSS3 - is certainly a way to avoid the well known 'Div Mania'.

Once we have a <header>, it is possible to use the element <hgroup>: it represents the heading of a section. The element is used to group a set of h1–h6 elements when the heading has multiple levels, such as subheadings, alternative titles, or taglines. Also, in the specs, given the following example, it is stated that the point is to mask the h2 element (which acts as a secondary title) from the outline algorithm.
<hgroup>
  <h1>The Coolest Application</h1>
  <h2>Alpha Release</h2>
</hgroup>
I have to admit, it seems a bit rigid and I feel my code will end up with one element more than usual. I have to force myself in not using <div> or <span> or <p> for things that look more like a subtitle and do something like:
<header>
  <hgroup>
    <h1>The Coolest Application</h1>
    <h2>Alpha Release</h2>
  </hgroup>
  <p>Description...</p>
</header>
One thought, the new elements seems still structural - with different level of granularity - and not semantic to me. Maybe structure can be seen as a kind of semantic, but I don't think the name is a good idea as it intersect with the pool of the technologies of the Semantic Web. In other words, saying a chunk of the document is a section does not help machine understanding of the content - besides that it knows where it starts and where it ends - , but if I say the section is a 'http://rdfs.org/sioc/types#BlogPost' we are starting to attach meaning we can leverage. And we can certainly do that with RDFa for example.

Monday, February 28, 2011

SWAN Annotation Tool: what is new in build 7

Here is a list of some of the new features that I will deploy probably this week.


The development of the SWAN Annotation Tool is managed and carried out by Dr. Paolo Ciccarese. The fist build of the SWAN Annotation Tool has been developed by Dr. Paolo Ciccarese and Marco Ocana. The SWAN Annotation Tool is a product of the MIND Informatics group - Mass General Hospital - directed by Tim Clark.

Sunday, February 27, 2011

ClientBundle, UIBinder and CSS (GWT)

In a previous post I was discussing all the possible ways for using image resources with GWT. Another thing you might want to do is dealing with CSS.

1) Standard use of CSS. While using GWT you can certainly use the CSS how you would use them for any other application, simply declaring CSS classes in a *.css file and importing it in the webpage of interest. With GWT widgets you will then simply set the style name as follows:
widget.setStyleName("cssClassName");
This approach works, however, if the CSS declaration is missing no exceptions are raised. Also, if you use multiple CSS files as I do, it is always annoying to find the declarations when you need to.

2) Using the UIBinder. If you are already using UIBinder, the easiest way to include CSS declarations is to add them to the binder. It is easy and it is safe as the Eclipse plugin is helping you out in finding missing declarations. I always use Eclipse but if you don't I am assuming you'll still find the problems at compile time.
<ui:UiBinder
  xmlns:ui='urn:ui:com.google.gwt.uibinder'
  xmlns:g='urn:import:com.google.gwt.user.client.ui'>

    <ui:style>
       .outer {
          width: 100%;
       }
    </ui:style>

    <g:VerticalPanel styleName='{style.outer}'>
    </g:VerticalPanel>
</ui:UiBinder>
The downside of this approach is redundancy. Sometimes I want to use CSS declarations multiple times and, with this approach, I have to repeat them for each single Binder.

3) Using CssResource. Another alternative consists in doing something similar to what you can do for icons with ImageResource. First I declare the set of declarations of the stylesheet in a file that I name Commons.css:
.smallIcon {
    height: 16px;
    width: 16px;
}

Then, linking the CSS file, I declare the use of the stylesheet as a resource for the application:
public class Example implements EntryPoint {

  public interface Resources extends ClientBundle
  {
     public static final Resources INSTANCE =  GWT.create(Resources.class);
 
     public interface Resources extends ClientBundle { 
        @Source("org/example/application/client/Commons.css")
        CommonsCss commonsCss();
     }
     
     ...
  }
  
}
Now, where the dots in the above snippet are, I can declare the stylesheet declarations I want to expose to the application:
public interface CommonsCss extends CssResource {
     String smallIcon();
}
As the CSS class is named as the method everything works fine. However, sometimes you might want to change the name of the method. Using Java annotations you address that issue as well:
public interface CommonsCss extends CssResource {
     @ClassName("smallIcon")
     String smallIconClass();
}
Now, we can write something like:
Resources resources = Resources.INSTANCE.factory().create();
Image img = new Image();
img.setStyleName(resource.commonsCss.smallIcon());
...
This approach allows you to collect in one single place CSS declarations you need to use in multiple packages in your application. Also you can leverage a good amount of validation in your Java code. You might argue the process can be a bit tedious but I can assure you, for a big GWT application, it can help you saving lots of time later on especially when refactoring the code.

There are other interesting things to know about the ClientBundles, but for now I'll stop here.

Saturday, February 26, 2011

Dublin Core and PRISM

As I was saying in one of my previous posts, distinguishing the different kinds of contributions is not trivial. However, sometimes is necessary. And this is probably the case of publishers that want to keep track of the exact role of the different contributors to a resource.

This is the case of PRISM (Publishing Requirements for Industry Standard Metadata), a metadata vocabulary for managing, post-processing, multi-purposing and aggregating publishing content for magazine and journal publishing. PRISM allows to distinguish between different creators roles: writer, editor, composer, speaker, photographer... you can find the full list in the The PRISM Controlled Vocabulary Namespace. PRISM is also using parts of Dublin Core Element Set and Dublin Core Terms, the subset of terms is listed in the document named The PRISM Subset of the Dublin Core Namespace.

The combination of DC and PRISM, for instance for a book, will become in XML something like:
<dc:creator prism:role=”writer”>John Doe</dc:creator> 
<dc:creator prism:role=”editor”>Paolo Ciccarese</dc:creator>
<dc:creator prism:role=”graphicDesigner”>Micheal Doe</dc:creator>
In RDF, according to the specifications (paragraph 3.5.2 of the PRISM Subset of the Dublin Core Namespaces: Version 2.1), this would look like:
<dc:creator rdf:resource=”contributorrole.xml#writer”>
     John Doe
</dc:creator>
<dc:creator rdf:resource=”contributorrole.xml#editor”>
     Paolo Ciccarese
</dc:creator>
<dc:creator rdf:resource=”contributorrole.xml#graphicDesigner”>
     Micheal Doe
</dc:creator>
However, this is not valid RDF for a couple of reasons that you can find yourself through the RDF Validator Service. Dublin Core Element Sets properties used by PRISM and by the PRISM aggregator message are: creator, contributor, description, format (PRISM records restrict values of the dc:format element to those in list of Internet Media Types [MIME]), identifier (for instance DOI), publisher, subject, title, type. Other properties are listed but not as items of the PAM format: language, relation, source.

For instance, this is how PRISM can deal with identifiers in RDF:
<dc:identifier>10.1030/03054</dc:identifier>
<prism:doi>http://dx.doi.org/10.1030/03054</prism:doi>
<prism:url rdf:resource=”http://dx.doi.org/10.1030/03054”/>
Basically, besides the usage of dc:identifier, PRISM is using the properties prism:doi - which is declaring more explicitly than dc:creator what the identifier is - and prism:url. Strangely enough, the property prism:doi is actually taking as value the DOI proxy URL and not the DOI string. Therefore, I see prism:doi and prism:url as redundant properties. You can find some more details on this old blog post by Tony Hammond.

Moreover, PRISM PAM is making use also of the Dublin Core Terms dct:hasPart and dct:isPartOf for detecting for instance images that are part of a document:
<dcterms:hasPart rdf:resource= ”http://www.myexamples.com/ExamplePhoto.jpg”/>

Thursday, February 24, 2011

AO: Annotating with one or multiple statements (triples)

A few days ago I had a phone discussion with some collaegues (Tudor Groza, Vit Novacek and Cartic Ramakrishnan) on how to use Annotation Ontology (AO) for attaching something more complex than a single term (identified by a URI) to a document or document fragment. To make it clear I am giving here an idea on how something like that can be already done in AO.

Let's say I am performing some text-mining on some textual content. It is possible that I don't want simply to associate a term to a span of text but I want to do something more elaborate. For example I want to say, analyzing this span of text I obtain the triple GeneG encodes ProteinP. How can I do that in AO? For instance I can use a Named Graph and I can say something like in the following picture:

Figure 1: The dashed ovals are instances of annotation items. Selectors and other details of the actual annotation have been omitted.

As you can see we have annotated also the atomic components of my triple. In doing this, while analyzing the assertions belonging to a specific domain I can always trace back to the original text. Also, using a graph as object of my annotation I am going in the direction of the Nanopublication format, however this will be topic for a future post.

Given this, you can imagine you can attach the proper provenance to the annotation. If you are a text miner, you might be interested in attaching what software or computational workflow generated such annotation and with what confidence.


You might have noticed the usage of the namespace tm that stand for Text Mining. It is a set of properties I am working on for extending AO to better represent text mining results.

Principle: Traceability [2] - Provenance and Doublin Core

For people working with Semantic Web technologies, for a long time, provenance has been called Dublin Core Metadata Element Set, a vocabulary of fifteen properties for use in resource description (as they currently state in the webpage). Let's take, for instance, the following property
'creator': an entity primarily responsible for making the content of the resource. Examples of a Creator include a person, an organization, or a service. Typically the name of the Creator should be used to indicate the entity.
You can find the guidelines for the usage of the creator property here. If we consider the RDF format (important for providing a syntactical framework) we can look at the following (RDF/XML) example:
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
 xmlns:dc="http://purl.org/dc/elements/1.1/"> 
  <rdf:Description rdf:about="http://www.w3.org/TR/hcls-swan/">
   <dc:title>Semantic Web Applications in Neuromedicine (SWAN) Ontology</dc:title>
   <dc:creator>Paolo Ciccarese</dc:creator>
   <dc:date>2009-10-20</dc:date>
   <dc:format>text/html</dc:format>
   <dc:language>en</dc:language>
  </rdf:Description>
</rdf:RDF>
Of the above example I want to focus on the following triple (Turtle syntax):
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .

<http://www.w3.org/TR/hcls-swan/> dc:creator "Paolo Ciccarese".
Now if you take a look of the actual document I wrote where I appear as the Editor. It actually happened that somebody else created the file for that note and I've filled in the actual content. This situation is difficult to model using simply Dublin Core Element Set. Probably one way to go is to distinguish between the file and the content.

Another example. Let's say I want to create a file with a quote from a book or a speech. I create the HTML file (my resource). However, the actual content has been authored by somebody else. How do I represent it with Dublin Core Element Set. Let me give it a try:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
 xmlns:dcterms="http://purl.org/dc/terms/" 
 xmlns:dc="http://purl.org/dc/elements/1.1/">
  <rdf:Description rdf:about="http://www.paolociccarese.info/example/quotes/1">
   <dc:title>My favourite quote</dc:title>
   <dc:creator>Paolo Ciccarese</dc:creator>
   <dc:date>2011-02-23</dc:date>
   <dc:format>text/html</dc:format>
   <dc:language>it</dc:language>
   <dcterms:hasPart>
     <rdf:Description>
       <dc:creator>Dante Alighieri</dc:creator>
       <dc:description>Lasciate ogni speranza, voi ch'entrate</dc:description>
     </rdf:Description>
   </dcterms:hasPart>
  </rdf:Description>
</rdf:RDF>
As you probably noticed I have not defined a URI for the quote and therefore the generated triples will include a blank node. I can also think of making up a URI like http://www.paolociccarese.info/example/quotes/1#quote as long as I can make it resolvable. The above snippet does more or less what I wanted to. Now, one thing I don't like is that Dante Alighieri and I are both creators. As a matter of fact, in the quote there is some intellectual property involved, while in the making of the simple HTML page, not so much. However this could lead to problems as drawing the lines is not easy. I could also consider the use of the property contributor - see the guideline here -, however I am not sure that is appropriate in the present case.

Friday, February 18, 2011

Principle: Traceability [1]

According to Wikipedia:
Traceability refers to the completeness of the information about every step in a process chain. 
I've been working on Clinical Information Systems for quite a while and traceability is a very well-known - even if usually poorly implemented - concept when talking about medical processes and patient data. For instance, for a blood pressure measurement, it is important to know who performed the procedure, where and when but also, if the notes have been written on paper first, who wrote the measures and when, eventually who entered the data in the system, where and when... if the information system is managing structured data,  we might want to record the language of the operator who entered the data, the templates she used and so on... The main idea is to keep track of the process details and of all the accountable health care professionals. In the previous list, to make it simple, I voluntarily excluded the medical context - which cuff has been used, was the pressure measured after a meal, after physical activity - which is crucial for reproducibility but opens to multiple other representational issues. You would be amazed on how complicated the model for a blood pressure measurement can become.

But what is traceability in Semantic Web terms? I guess one way of saying it is through the term Provenance, very popular these days.
Provenance, from the French provenir, "to come from", means the, or the of something, or the history of the ownership or location of an object. The term was originally mostly used for works of art, but is now used in similar senses in a wide range of fields, including science and computing.
A good alternative definition, more focused on computing and, which takes into account processual aspects, is provided by the W3C Provenance Incubator Group:
Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance.
In my mind, traceability is still a more generic concept than provenance. For instance, I would consider some of the aspects required for reproducibility part of the traceability and not of the provenance. And this is because I believe it would be easier to standardize 'where, when and who' (what I consider provenance), than 'what, why, which, how' (which represent context that added to provenance gives traceability) that are domain dependent and can become very hard to define. However, the last definition makes me quite happy and I would be glad if the incubator for provenance will translate into an actual Working Group.

Tuesday, February 15, 2011

Principle: Adequate Documentation [1]

This is trivial to understand. The documentation for an ontology is very important for adoption and for data sharing. This works for software as well. One thing I found always awesome regarding Java was the quality of the API documentation, the JavaDocs are cool and easy to generate. For ontologies the reality is a bit more complicated.

Some ontologies, including some of those I made, are just the pure RDFS/OWL file. Often with no descriptions whatsoever of the classes/properties. The understanding of those ontologies relies totally on the ability of the users to interpret the labels/names correctly. Other ontologies include very fuzzy, sometimes poor, descriptions - "Document: a document. The Document class represents those things which are, broadly conceived, 'documents'.". The interpretation relies mostly on the users' common sense. To be fair, in recent versions of the FOAF vocabulary, the previous definition is accompanied by a note saying: "We do not (currently) distinguish precisely between physical and electronic documents, or between copies of a work and the abstraction those copies embody. The relationship between documents and their byte-stream representation needs clarification". And let's be honest, defining what a document is nowadays, in the digital era, is not trivial. Is every file a document?

In the case of OBO Foundry, a set of principles has been defined. Here are some of them:
"The ontologies include textual definitions for all terms. Many biological and medical terms may be ambiguous, so terms should be defined so that their precise meaning within the context of a particular ontology is clear to a human reader."
"The ontology is well documented."

The thing is, definitions are a necessary step but that is not enough. When I talk about 'Adequate Documentation' I mean many different things: definitions, examples of use cases, examples of resulting triples, motivations, related projects... In other words, a good amount of shared knowledge about the ontology and the process that generated it.

Unfortunately there are no clear rules, I keep trying different ways of translating the most tacit knowledge I can into explicit and I can't say I found the right recipe yet. Definitions can certainly help, I find valuable to include explanations of the ontology building process where motivations behind the different choices are given, explanatory figures, plenty of examples with actual triples, maybe a list of Frequently Asked Questions where the authors can publicly address some of the concerns of real users. All this takes time and effort, and, by personal experience, can also cause collateral damage...

Sunday, February 13, 2011

Which principles drive ontology adoption?

Several weeks ago, I started to think of the next version of the Annotation Ontology (AO). After one year spent developing the Annotation Framework and discussing with several colleagues and friends, I certainly have a little list of things I want to improve. Nothing major, mostly a clean up.

Before proceeding with the updates, I wanted to better clarify the set of principles I want to follow in developing AO2. These are, in random order: Traceability, Orthogonality, Generality, Interoperability, Modularity, Extensibility, Adequate Documentation, Community Driven. The reason why I am listing this principles is important, I believe they influence adoption.

As you might have noticed the number of available ontologies is constantly increasing. If you need to use an ontology, you have to go through the process of revising what is out there, and selecting what you think is most appropriate. How many time have you done that? How many time did you succeed? How many times did you find the right ontology covering exactly what you needed? I am pretty sure that if you are involved in the development of a complex application the answer is something like: I found a few ontologies I could mix and match... I still need to add pieces... and, most importantly, I am not sure I agree on the way some or them are done. Right. Welcome to the Semantic Web I would say.

I remember the old days - many years ago - when Dublin Core Metadata Element Set, Version 1.1 (DC) was the answer to almost everything. When I started working on SWAN (Semantic Web Applications in Neuromedicine) in 2006 I found immediately DC to be insufficient for our needs. For days I've been struggling trying to understand what to do: use DC and being sloppy or create something more appropriate risking isolation and to increase the entropy of the Semantic Web world.

Well at that time my answer has been the Provenance, Authoring and Versioning Ontology (PAV) now available in version 2. The choice, at the time, has been dictated also by practical reasons: if I was using DC for Annotation Properties and I wanted to be OWL DL, I could not use it also for other properties. Since then, PAV has been used in our applications but also in several others developed by people/groups I barely know - sometimes I wish they just would tell me something like: "hey I am using PAV and it's cool" or even "hey I am using PAV and it sucks because...". PAV has also been considered as one of the starting points for the W3C Provenance Incubator Group.

PAV was not such a bad idea at the end. But it was a risky business. If you are developing an application you need always to keep an eye on what is existing and the other eye on your requirements. This results to be even more complicated because it is hard to find appropriate ontologies and, when you find them, they often don't have adequate documentation for you to understand that is what you are looking for. Surprise! The  lack of shared knowledge about the ontology does not help it to emerge and does not help adoption... unless, of course, external factors - networking, important supporters, big institutions ... - come into play. And external factors are not little thing.

Friday, February 04, 2011

Monday, January 31, 2011

Annotation and Content Improvement

I was recently attending the workshop 'Beyond the PDF' in San Diego and I noticed multiple times how the concept of 'Annotation' is often intended as a task performed after publication of a physical or digital document.

I consider Annotation to be more ubiquitous and important at all stages: before, during and after publication. Also, Annotation is not only about classic textual document. Images, database records and data-sets can be annotated as well. Even physical objects can be digitally annotated when we create a correspondent digital record or - speaking in terms of ontologies - when we refer to the representation of that particular instance of a certain class.

Annotation can exist as such forever or can be incorporated back in the original document/resource or a new version of the original document/resource. If you think at the old fashion paper encyclopedia, every year - or bunch of years - the editor was collecting the several annotations to come up with a new edition of the heavy volumes. This was very close to what in the digital world is called versioning.

In the modern digital world annotation is everywhere. Tags attached to a document are annotation. Leveraging crowdsourcing makes possible to include the most popular tags as keywords for that document. Delicious users are experiencing this anytime they are in the process to tag a new resource and they receive suggestions of popular or appropriate tags. Reviews of catalog items in Amazon are annotations and the statistical analysis of such results appears close to the selected item under the form of stars. To some extents, edits in a Wiki can be seen as annotation - and could be exported as such - where a user changes the current document content. However,  I understand using the term Annotation for edits might sound a bit of a stretch.

Maybe, in today digital world, a better way to refer to this process is 'Content refinement' as everything can potentially be 'changed'. But even the term refine might fall short as 'to refine' means improving by making small changes. Sometimes edits are massive and an article in Wikipedia can evolve dramatically in time. It is not simply polishing and fixing, we can add/remove big chunks of the original documents - adding missing items or removing items that are redundant or not valid anymore - or can make the original document more actual - for instance adding new evidence that was not available when the document has been previously published. 'Content Improvement' is probably generic enough to cover refinements and edits.

Sure, I am talking of evolving documents but it does not preclude to take snapshots of it in a 'traditional publication' or in a version of that resource. Take online news. I realized more than once that the news at a specific URL was changing and journalists were incrementally adding new sections at the bottom of the page whenever the new updates were available. You might argue this is not good practice but it happens more often than what you think.  The reason is simple: in the digital world, it is possible and cheap. We don't have to reprint a book or to add an errata page to avoid reprinting. We just create a note or directly edit the content - hopefully while keeping track of the changes.

I see many attempts to redefine what a publication is. These days, I believe publication is a multidimensional evolving artifact including images, videos, live tables, data, metadata... and no matter what it includes or what it looks like, it has to manage change or content improvement. Only snapshots of it, at particular times, would match the 'classic' concept of publication.