HckLab: Provenance

Showing posts with label Provenance. Show all posts

Saturday, February 26, 2011

Dublin Core and PRISM

As I was saying in one of my previous posts, distinguishing the different kinds of contributions is not trivial. However, sometimes is necessary. And this is probably the case of publishers that want to keep track of the exact role of the different contributors to a resource.

This is the case of PRISM (Publishing Requirements for Industry Standard Metadata), a metadata vocabulary for managing, post-processing, multi-purposing and aggregating publishing content for magazine and journal publishing. PRISM allows to distinguish between different creators roles: writer, editor, composer, speaker, photographer... you can find the full list in the The PRISM Controlled Vocabulary Namespace. PRISM is also using parts of Dublin Core Element Set and Dublin Core Terms, the subset of terms is listed in the document named The PRISM Subset of the Dublin Core Namespace.

The combination of DC and PRISM, for instance for a book, will become in XML something like:

<dc:creator prism:role=”writer”>John Doe</dc:creator> 
<dc:creator prism:role=”editor”>Paolo Ciccarese</dc:creator>
<dc:creator prism:role=”graphicDesigner”>Micheal Doe</dc:creator>

In RDF, according to the specifications (paragraph 3.5.2 of the PRISM Subset of the Dublin Core Namespaces: Version 2.1), this would look like:

<dc:creator rdf:resource=”contributorrole.xml#writer”>
     John Doe
</dc:creator>
<dc:creator rdf:resource=”contributorrole.xml#editor”>
     Paolo Ciccarese
</dc:creator>
<dc:creator rdf:resource=”contributorrole.xml#graphicDesigner”>
     Micheal Doe
</dc:creator>

However, this is not valid RDF for a couple of reasons that you can find yourself through the RDF Validator Service. Dublin Core Element Sets properties used by PRISM and by the PRISM aggregator message are: creator, contributor, description, format (PRISM records restrict values of the dc:format element to those in list of Internet Media Types [MIME]), identifier (for instance DOI), publisher, subject, title, type. Other properties are listed but not as items of the PAM format: language, relation, source.

For instance, this is how PRISM can deal with identifiers in RDF:

<dc:identifier>10.1030/03054</dc:identifier>
<prism:doi>http://dx.doi.org/10.1030/03054</prism:doi>
<prism:url rdf:resource=”http://dx.doi.org/10.1030/03054”/>

Basically, besides the usage of dc:identifier, PRISM is using the properties prism:doi - which is declaring more explicitly than dc:creator what the identifier is - and prism:url. Strangely enough, the property prism:doi is actually taking as value the DOI proxy URL and not the DOI string. Therefore, I see prism:doi and prism:url as redundant properties. You can find some more details on this old blog post by Tony Hammond.

Moreover, PRISM PAM is making use also of the Dublin Core Terms dct:hasPart and dct:isPartOf for detecting for instance images that are part of a document:

<dcterms:hasPart rdf:resource= ”http://www.myexamples.com/ExamplePhoto.jpg”/>

Thursday, February 24, 2011

Principle: Traceability [2] - Provenance and Doublin Core

For people working with Semantic Web technologies, for a long time, provenance has been called Dublin Core Metadata Element Set, a vocabulary of fifteen properties for use in resource description (as they currently state in the webpage). Let's take, for instance, the following property:

'creator': an entity primarily responsible for making the content of the resource. Examples of a Creator include a person, an organization, or a service. Typically the name of the Creator should be used to indicate the entity.

You can find the guidelines for the usage of the creator property here. If we consider the RDF format (important for providing a syntactical framework) we can look at the following (RDF/XML) example:

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
 xmlns:dc="http://purl.org/dc/elements/1.1/"> 
  <rdf:Description rdf:about="http://www.w3.org/TR/hcls-swan/">
   <dc:title>Semantic Web Applications in Neuromedicine (SWAN) Ontology</dc:title>
   <dc:creator>Paolo Ciccarese</dc:creator>
   <dc:date>2009-10-20</dc:date>
   <dc:format>text/html</dc:format>
   <dc:language>en</dc:language>
  </rdf:Description>
</rdf:RDF>

Of the above example I want to focus on the following triple (Turtle syntax):

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .

<http://www.w3.org/TR/hcls-swan/> dc:creator "Paolo Ciccarese".

Now if you take a look of the actual document I wrote where I appear as the Editor. It actually happened that somebody else created the file for that note and I've filled in the actual content. This situation is difficult to model using simply Dublin Core Element Set. Probably one way to go is to distinguish between the file and the content.

Another example. Let's say I want to create a file with a quote from a book or a speech. I create the HTML file (my resource). However, the actual content has been authored by somebody else. How do I represent it with Dublin Core Element Set. Let me give it a try:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
 xmlns:dcterms="http://purl.org/dc/terms/" 
 xmlns:dc="http://purl.org/dc/elements/1.1/">
  <rdf:Description rdf:about="http://www.paolociccarese.info/example/quotes/1">
   <dc:title>My favourite quote</dc:title>
   <dc:creator>Paolo Ciccarese</dc:creator>
   <dc:date>2011-02-23</dc:date>
   <dc:format>text/html</dc:format>
   <dc:language>it</dc:language>
   <dcterms:hasPart>
     <rdf:Description>
       <dc:creator>Dante Alighieri</dc:creator>
       <dc:description>Lasciate ogni speranza, voi ch'entrate</dc:description>
     </rdf:Description>
   </dcterms:hasPart>
  </rdf:Description>
</rdf:RDF>

As you probably noticed I have not defined a URI for the quote and therefore the generated triples will include a blank node. I can also think of making up a URI like http://www.paolociccarese.info/example/quotes/1#quote as long as I can make it resolvable. The above snippet does more or less what I wanted to. Now, one thing I don't like is that Dante Alighieri and I are both creators. As a matter of fact, in the quote there is some intellectual property involved, while in the making of the simple HTML page, not so much. However this could lead to problems as drawing the lines is not easy. I could also consider the use of the property contributor - see the guideline here -, however I am not sure that is appropriate in the present case.

Sunday, February 13, 2011

Which principles drive ontology adoption?

Several weeks ago, I started to think of the next version of the Annotation Ontology (AO). After one year spent developing the Annotation Framework and discussing with several colleagues and friends, I certainly have a little list of things I want to improve. Nothing major, mostly a clean up.

Before proceeding with the updates, I wanted to better clarify the set of principles I want to follow in developing AO2. These are, in random order: Traceability, Orthogonality, Generality, Interoperability, Modularity, Extensibility, Adequate Documentation, Community Driven. The reason why I am listing this principles is important, I believe they influence adoption.

As you might have noticed the number of available ontologies is constantly increasing. If you need to use an ontology, you have to go through the process of revising what is out there, and selecting what you think is most appropriate. How many time have you done that? How many time did you succeed? How many times did you find the right ontology covering exactly what you needed? I am pretty sure that if you are involved in the development of a complex application the answer is something like: I found a few ontologies I could mix and match... I still need to add pieces... and, most importantly, I am not sure I agree on the way some or them are done. Right. Welcome to the Semantic Web I would say.

I remember the old days - many years ago - when Dublin Core Metadata Element Set, Version 1.1 (DC) was the answer to almost everything. When I started working on SWAN (Semantic Web Applications in Neuromedicine) in 2006 I found immediately DC to be insufficient for our needs. For days I've been struggling trying to understand what to do: use DC and being sloppy or create something more appropriate risking isolation and to increase the entropy of the Semantic Web world.

Well at that time my answer has been the Provenance, Authoring and Versioning Ontology (PAV) now available in version 2. The choice, at the time, has been dictated also by practical reasons: if I was using DC for Annotation Properties and I wanted to be OWL DL, I could not use it also for other properties. Since then, PAV has been used in our applications but also in several others developed by people/groups I barely know - sometimes I wish they just would tell me something like: "hey I am using PAV and it's cool" or even "hey I am using PAV and it sucks because...". PAV has also been considered as one of the starting points for the W3C Provenance Incubator Group.

PAV was not such a bad idea at the end. But it was a risky business. If you are developing an application you need always to keep an eye on what is existing and the other eye on your requirements. This results to be even more complicated because it is hard to find appropriate ontologies and, when you find them, they often don't have adequate documentation for you to understand that is what you are looking for. Surprise! The lack of shared knowledge about the ontology does not help it to emerge and does not help adoption... unless, of course, external factors - networking, important supporters, big institutions ... - come into play. And external factors are not little thing.