HckLab: Ontology

Showing posts with label Ontology. Show all posts

Monday, September 26, 2011

Collections Ontology v.2... list of persons [1]

Back in 2008 I've published a couple of posts explaining my need of creating an ontology for collections:

After some months of work with Silvio Peroni we are almost done with v.2 of Collections Ontology (CO) expressed in OWL2. Following is a simple example that illustrates some of the features of CO2. Before that here is how to set up the environment for testing the features yourself. First of all I would suggest you to install Protege 4.1 and to make sure to install the Pellet plugin for it.

With Protege up and running, I performed the import of the development version of the Collection Ontology v.2 with URL: http://collections-ontology.googlecode.com/svn/trunk/collections.owl.

Figure 1 - Import of the development version of the Collections Ontology (CO) with Protege

Figure 2 - Collections Ontology (CO) is imported

After that I've created a class person - note that I haven't reused classes such as foaf:Person to keep the example simple - and the instances in figure 3 in order to model a list of persons (you can download the file here).

Figure 3 - Collections Ontology (CO) example instances (ovals)

After triggering the reasoner, it is immediate to notice inferred properties (figure 4 in light yellow).

Figure 4 - Inferred types and properties for item instance itemOne. For instance itemOne has been defined as instance of item, the reasoner infers itemOne (i) is a list item (ii) is follwed by both itemTwo and itemThree (iii) is item of and is first item of persons.

Figure 5 - Explanation of itemOne type 'list item' obtaining by clicking the highlighted button

We can now proceed with some DL Queries.

Query 1
For instance we can ask for all the items that have item content (has item content) persons with name "Paolo Ciccarese":

item and 'has item content' some (person and (name value "Paolo Ciccarese"))

In the Protege tab named 'DL Query' we can enter the query above and, if we select on the right side the option 'Individuals' we are going to retrieve one item (itemOne):

Query 2
We can ask for all the lists where the first person is named 'Paolo Ciccarese' (answer 'persons'):

list and 'has first item' some (
item and 'has item content' some (person and (name value "Paolo Ciccarese"))
)

Query 3
Similarly we can ask more complex queries such as: find all the persons lists where the first item points to a person named 'Paolo Ciccarese' and the last item points to a person named 'Silvio Peroni' (answer 'persons'):

list and (
    'has first item' some (item and 'has item content' some
           (person and (name value "Paolo Ciccarese")))
    and
    'has last item' some (item and 'has item content' some
           (person and (name value "Silvio Peroni")))
)

Query 4
Another query can be: give me all the lists where the first person is named 'Paolo Ciccarese' and the second is 'Marco Ocana' (given the transitive nature of the property 'is followed by' the answer is 'persons'):

list and (
      'has first item' some (item and
            'has item content' some (person and (name value "Paolo Ciccarese"))
            and
            'is followed by' some (item and
                    ('has item content' some(person and(name value "Marco Ocana"))))
      )
)

Query 5
Returns all the lists containing a person named 'Paolo Ciccarese' (answer 'persons'):

list and 'has item' some (
item and 'has item content' some (person and (name value "Paolo Ciccarese"))
)

Query 6
Returns any list where a person named 'Paolo Ciccarese' is followed by a person named 'Silvio Peroni' (answer 'persons'):

list and (
    'has item' some (item and
            'has item content' some (person and (name value "Paolo Ciccarese"))
            and
            'is followed by' some (item and
                    ('has item content' some(person and(name value "Silvio Peroni"))))
      )
)

Query 7
Returns all the lists where a person named 'Silvio Peroni' is preceeded by a person named 'Paolo Ciccarese' (answer 'persons'):

list and (
    'has item' some (item and
            'has item content' some (person and (name value "Silvio Peroni"))
            and
            'is preceded by' some (item and
                    ('has item content' some(person and(name value "Paolo Ciccarese"))))
      )
)

Saturday, February 26, 2011

Dublin Core and PRISM

As I was saying in one of my previous posts, distinguishing the different kinds of contributions is not trivial. However, sometimes is necessary. And this is probably the case of publishers that want to keep track of the exact role of the different contributors to a resource.

This is the case of PRISM (Publishing Requirements for Industry Standard Metadata), a metadata vocabulary for managing, post-processing, multi-purposing and aggregating publishing content for magazine and journal publishing. PRISM allows to distinguish between different creators roles: writer, editor, composer, speaker, photographer... you can find the full list in the The PRISM Controlled Vocabulary Namespace. PRISM is also using parts of Dublin Core Element Set and Dublin Core Terms, the subset of terms is listed in the document named The PRISM Subset of the Dublin Core Namespace.

The combination of DC and PRISM, for instance for a book, will become in XML something like:

<dc:creator prism:role=”writer”>John Doe</dc:creator> 
<dc:creator prism:role=”editor”>Paolo Ciccarese</dc:creator>
<dc:creator prism:role=”graphicDesigner”>Micheal Doe</dc:creator>

In RDF, according to the specifications (paragraph 3.5.2 of the PRISM Subset of the Dublin Core Namespaces: Version 2.1), this would look like:

<dc:creator rdf:resource=”contributorrole.xml#writer”>
     John Doe
</dc:creator>
<dc:creator rdf:resource=”contributorrole.xml#editor”>
     Paolo Ciccarese
</dc:creator>
<dc:creator rdf:resource=”contributorrole.xml#graphicDesigner”>
     Micheal Doe
</dc:creator>

However, this is not valid RDF for a couple of reasons that you can find yourself through the RDF Validator Service. Dublin Core Element Sets properties used by PRISM and by the PRISM aggregator message are: creator, contributor, description, format (PRISM records restrict values of the dc:format element to those in list of Internet Media Types [MIME]), identifier (for instance DOI), publisher, subject, title, type. Other properties are listed but not as items of the PAM format: language, relation, source.

For instance, this is how PRISM can deal with identifiers in RDF:

<dc:identifier>10.1030/03054</dc:identifier>
<prism:doi>http://dx.doi.org/10.1030/03054</prism:doi>
<prism:url rdf:resource=”http://dx.doi.org/10.1030/03054”/>

Basically, besides the usage of dc:identifier, PRISM is using the properties prism:doi - which is declaring more explicitly than dc:creator what the identifier is - and prism:url. Strangely enough, the property prism:doi is actually taking as value the DOI proxy URL and not the DOI string. Therefore, I see prism:doi and prism:url as redundant properties. You can find some more details on this old blog post by Tony Hammond.

Moreover, PRISM PAM is making use also of the Dublin Core Terms dct:hasPart and dct:isPartOf for detecting for instance images that are part of a document:

<dcterms:hasPart rdf:resource= ”http://www.myexamples.com/ExamplePhoto.jpg”/>

Thursday, February 24, 2011

Principle: Traceability [2] - Provenance and Doublin Core

For people working with Semantic Web technologies, for a long time, provenance has been called Dublin Core Metadata Element Set, a vocabulary of fifteen properties for use in resource description (as they currently state in the webpage). Let's take, for instance, the following property:

'creator': an entity primarily responsible for making the content of the resource. Examples of a Creator include a person, an organization, or a service. Typically the name of the Creator should be used to indicate the entity.

You can find the guidelines for the usage of the creator property here. If we consider the RDF format (important for providing a syntactical framework) we can look at the following (RDF/XML) example:

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
 xmlns:dc="http://purl.org/dc/elements/1.1/"> 
  <rdf:Description rdf:about="http://www.w3.org/TR/hcls-swan/">
   <dc:title>Semantic Web Applications in Neuromedicine (SWAN) Ontology</dc:title>
   <dc:creator>Paolo Ciccarese</dc:creator>
   <dc:date>2009-10-20</dc:date>
   <dc:format>text/html</dc:format>
   <dc:language>en</dc:language>
  </rdf:Description>
</rdf:RDF>

Of the above example I want to focus on the following triple (Turtle syntax):

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .

<http://www.w3.org/TR/hcls-swan/> dc:creator "Paolo Ciccarese".

Now if you take a look of the actual document I wrote where I appear as the Editor. It actually happened that somebody else created the file for that note and I've filled in the actual content. This situation is difficult to model using simply Dublin Core Element Set. Probably one way to go is to distinguish between the file and the content.

Another example. Let's say I want to create a file with a quote from a book or a speech. I create the HTML file (my resource). However, the actual content has been authored by somebody else. How do I represent it with Dublin Core Element Set. Let me give it a try:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
 xmlns:dcterms="http://purl.org/dc/terms/" 
 xmlns:dc="http://purl.org/dc/elements/1.1/">
  <rdf:Description rdf:about="http://www.paolociccarese.info/example/quotes/1">
   <dc:title>My favourite quote</dc:title>
   <dc:creator>Paolo Ciccarese</dc:creator>
   <dc:date>2011-02-23</dc:date>
   <dc:format>text/html</dc:format>
   <dc:language>it</dc:language>
   <dcterms:hasPart>
     <rdf:Description>
       <dc:creator>Dante Alighieri</dc:creator>
       <dc:description>Lasciate ogni speranza, voi ch'entrate</dc:description>
     </rdf:Description>
   </dcterms:hasPart>
  </rdf:Description>
</rdf:RDF>

As you probably noticed I have not defined a URI for the quote and therefore the generated triples will include a blank node. I can also think of making up a URI like http://www.paolociccarese.info/example/quotes/1#quote as long as I can make it resolvable. The above snippet does more or less what I wanted to. Now, one thing I don't like is that Dante Alighieri and I are both creators. As a matter of fact, in the quote there is some intellectual property involved, while in the making of the simple HTML page, not so much. However this could lead to problems as drawing the lines is not easy. I could also consider the use of the property contributor - see the guideline here -, however I am not sure that is appropriate in the present case.

Friday, February 18, 2011

Principle: Traceability [1]

According to Wikipedia:

Traceability refers to the completeness of the information about every step in a process chain.

I've been working on Clinical Information Systems for quite a while and traceability is a very well-known - even if usually poorly implemented - concept when talking about medical processes and patient data. For instance, for a blood pressure measurement, it is important to know who performed the procedure, where and when but also, if the notes have been written on paper first, who wrote the measures and when, eventually who entered the data in the system, where and when... if the information system is managing structured data, we might want to record the language of the operator who entered the data, the templates she used and so on... The main idea is to keep track of the process details and of all the accountable health care professionals. In the previous list, to make it simple, I voluntarily excluded the medical context - which cuff has been used, was the pressure measured after a meal, after physical activity - which is crucial for reproducibility but opens to multiple other representational issues. You would be amazed on how complicated the model for a blood pressure measurement can become.

But what is traceability in Semantic Web terms? I guess one way of saying it is through the term Provenance, very popular these days.

Provenance, from the French provenir, "to come from", means the, or the of something, or the history of the ownership or location of an object. The term was originally mostly used for works of art, but is now used in similar senses in a wide range of fields, including science and computing.

A good alternative definition, more focused on computing and, which takes into account processual aspects, is provided by the W3C Provenance Incubator Group:

Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance.

In my mind, traceability is still a more generic concept than provenance. For instance, I would consider some of the aspects required for reproducibility part of the traceability and not of the provenance. And this is because I believe it would be easier to standardize 'where, when and who' (what I consider provenance), than 'what, why, which, how' (which represent context that added to provenance gives traceability) that are domain dependent and can become very hard to define. However, the last definition makes me quite happy and I would be glad if the incubator for provenance will translate into an actual Working Group.

Tuesday, February 15, 2011

Principle: Adequate Documentation [1]

This is trivial to understand. The documentation for an ontology is very important for adoption and for data sharing. This works for software as well. One thing I found always awesome regarding Java was the quality of the API documentation, the JavaDocs are cool and easy to generate. For ontologies the reality is a bit more complicated.

Some ontologies, including some of those I made, are just the pure RDFS/OWL file. Often with no descriptions whatsoever of the classes/properties. The understanding of those ontologies relies totally on the ability of the users to interpret the labels/names correctly. Other ontologies include very fuzzy, sometimes poor, descriptions - "Document: a document. The Document class represents those things which are, broadly conceived, 'documents'.". The interpretation relies mostly on the users' common sense. To be fair, in recent versions of the FOAF vocabulary, the previous definition is accompanied by a note saying: "We do not (currently) distinguish precisely between physical and electronic documents, or between copies of a work and the abstraction those copies embody. The relationship between documents and their byte-stream representation needs clarification". And let's be honest, defining what a document is nowadays, in the digital era, is not trivial. Is every file a document?

In the case of OBO Foundry, a set of principles has been defined. Here are some of them:

"The ontologies include textual definitions for all terms. Many biological and medical terms may be ambiguous, so terms should be defined so that their precise meaning within the context of a particular ontology is clear to a human reader."

"The ontology is well documented."

The thing is, definitions are a necessary step but that is not enough. When I talk about 'Adequate Documentation' I mean many different things: definitions, examples of use cases, examples of resulting triples, motivations, related projects... In other words, a good amount of shared knowledge about the ontology and the process that generated it.

Unfortunately there are no clear rules, I keep trying different ways of translating the most tacit knowledge I can into explicit and I can't say I found the right recipe yet. Definitions can certainly help, I find valuable to include explanations of the ontology building process where motivations behind the different choices are given, explanatory figures, plenty of examples with actual triples, maybe a list of Frequently Asked Questions where the authors can publicly address some of the concerns of real users. All this takes time and effort, and, by personal experience, can also cause collateral damage...

Sunday, February 13, 2011

Which principles drive ontology adoption?

Several weeks ago, I started to think of the next version of the Annotation Ontology (AO). After one year spent developing the Annotation Framework and discussing with several colleagues and friends, I certainly have a little list of things I want to improve. Nothing major, mostly a clean up.

Before proceeding with the updates, I wanted to better clarify the set of principles I want to follow in developing AO2. These are, in random order: Traceability, Orthogonality, Generality, Interoperability, Modularity, Extensibility, Adequate Documentation, Community Driven. The reason why I am listing this principles is important, I believe they influence adoption.

As you might have noticed the number of available ontologies is constantly increasing. If you need to use an ontology, you have to go through the process of revising what is out there, and selecting what you think is most appropriate. How many time have you done that? How many time did you succeed? How many times did you find the right ontology covering exactly what you needed? I am pretty sure that if you are involved in the development of a complex application the answer is something like: I found a few ontologies I could mix and match... I still need to add pieces... and, most importantly, I am not sure I agree on the way some or them are done. Right. Welcome to the Semantic Web I would say.

I remember the old days - many years ago - when Dublin Core Metadata Element Set, Version 1.1 (DC) was the answer to almost everything. When I started working on SWAN (Semantic Web Applications in Neuromedicine) in 2006 I found immediately DC to be insufficient for our needs. For days I've been struggling trying to understand what to do: use DC and being sloppy or create something more appropriate risking isolation and to increase the entropy of the Semantic Web world.

Well at that time my answer has been the Provenance, Authoring and Versioning Ontology (PAV) now available in version 2. The choice, at the time, has been dictated also by practical reasons: if I was using DC for Annotation Properties and I wanted to be OWL DL, I could not use it also for other properties. Since then, PAV has been used in our applications but also in several others developed by people/groups I barely know - sometimes I wish they just would tell me something like: "hey I am using PAV and it's cool" or even "hey I am using PAV and it sucks because...". PAV has also been considered as one of the starting points for the W3C Provenance Incubator Group.

PAV was not such a bad idea at the end. But it was a risky business. If you are developing an application you need always to keep an eye on what is existing and the other eye on your requirements. This results to be even more complicated because it is hard to find appropriate ontologies and, when you find them, they often don't have adequate documentation for you to understand that is what you are looking for. Surprise! The lack of shared knowledge about the ontology does not help it to emerge and does not help adoption... unless, of course, external factors - networking, important supporters, big institutions ... - come into play. And external factors are not little thing.