Friday, October 17, 2014

When SPARQL query length is an issue

While developing Annotopia I wrote some code to create dynamically SPARQL queries servicing a faceted search. According to the facets values, the queries can become extremely long... until I hit the limit:

Virtuoso 37000 Error SP031: SPARQL: Internal error: 
       The length of generated SQL text has exceeded 10000 lines of code
It seems that the SPARQL compiler stops because the SQL compiler, the successor in the processing pipeline, will fail to compile it in any reasonable time. After initial surprised reaction I started to dig deeper in the structure of my queries. Here is what I learned.

Use the FILTER + IN construct instead of multiple UNIONs


It might result simpler, when writing code, to dynamically compose a query with lists of UNIONs. Unfortunately that translates in much longer SQL queries. So this:

        
        { ?s oa:serializedBy <urn:application:domeo> }
        UNION
        { ?s oa:serializedBy <urn:application:utopia> }

for multiple items, should become:

        { ?s oa:serializedBy ?serializer .
            FILTER ( ?serializer IN 
               (<urn:application:domeo>, <urn:application:utopia>) )
        }

Friday, October 10, 2014

Annotopia 101 - Basic use for document/data annotation

This post explains how to get started in using Annotopia as a server for document/data annotation. It assumes Annotopia is already installed and running and that you have admin access to the instance.


Step 1. Register your system

After logging in (as admin) to Annotopia, you will see a welcome screen:
  • Click on 'Administration Dashboard' (top left of the screen)


  • Select 'Create System'

  • Fill out the form and 'Save system'

  • Take note of the 'API key' which is going to be used by your system to communicate with Annotopia when Annotopia is not set up to use a stronger Authentication mechanism.

 

Step 2. Create my first annotation (POST)

Assuming that the server address is http://myserver.example.com:8090 we are going to create our first POST. Normally your application will connect to the server through Ajax or a server call. For the sake of this tutorial we are going to use curl that is easy to use in command line.

The structure of the POST for an annotation item is very simple (API documentation here):

  curl -i -X POST http://myserver.example.com:8090/s/annotation \
       -H "Content-Type: application/json" \
       -d'{"apiKey":"{+SYSTEM_API_KEY}", "outCmd":"frame", "item":{+ANNOTATION}}
Where +SYSTEM_API_KEY is the API key of the previous section and +ANNOTATION is the actual annotation content. Notice also the parameter "outCmd":"frame", this is used to frame the JSON-LD result, which means that the result will always be returned with a precise hierarchical structure so that the clients don't have to deal with the variability of a graph-like representation.

A simple example of Annotation of type Highlight (conformant to the Open Annotation Model) would be:

{
 "@context": "https://raw2.github.com/Annotopia/AtSmartStorage/master/web-app/data/OAContext.json",
 "@id": "urn:temp:7",
 "@type": "oa:Annotation",
 "motivatedBy": "oa:highlighting",
 "annotatedBy": {
  "@id": "http://orcid.org/0000-0002-5156-2703",
  "@type": "foaf:Person",
  "foaf:name": "Paolo Ciccarese"
 },
 "annotatedAt": "2014-02-17T09:46:11EST",
 "serializedBy": "urn:application:utopia",
 "serializedAt": "2014-02-17T09:46:51EST",
 "hasTarget": {
  "@id": "urn:temp:8",
  "@type": "oa:SpecificResource",
  "hasSelector": {
   "@type": "oa:TextQuoteSelector",
   "exact": "senior scientist and software engineer",
   "prefix": "I am a",
   "suffix": ", working in the bio-medical informatics field since the year 2000"
  },
  "hasSource": {
   "@id": "http://paolociccarese.info",
   "@type": "dctypes:Text"
  }
 }
}


Note that:
  • the '@context' is necessary for the server to interpret the content
  • the '@id' fields contain a temporary value. In fact, when posting an annotation for the first time, the server will mint URIs for Annotation and Target and will return the updated content to the client as a response of the POST
  • the 'motivatedBy' property declares that the intent of the annotation is of highlighting.
  • the 'hasTarget' uses a quote of the annotated piece of content. 
  • the 'hasSource/@id' represents the URI of the annotated resource
  • the 'hasSelector' identifies a fragment of that resource.
  • the 'serializedBy' declares which system created the artifact. In the above case it is the Utopia for PDF application. Domeo would  be urn:application:domeo**.
** Note that this aspect is not fully implemented yet, therefore only specific systems are recognized by Annotopia and used for filtering. All others are managed but not exploited. In other words, currently only two values are fully manage 'urn:application:domeo' and 'urn:application:utopia'. Alternative values can be used and stored but they will not appear in the facets.  for search.

A simpler example of Annotation of type Comment of an entire resource:

{
    "@context": "https://raw2.github.com/Annotopia/AtSmartStorage/master/web-app/data/OAContext.json",
    "@id": "urn:temp:001",
    "@type": "http://www.w3.org/ns/oa#Annotation",
    "motivatedBy": "oa:commenting",
    "annotatedBy": {
        "@id": "http://orcid.org/0000-0002-5156-2703",
        "@type": "foaf:Person",
        "foaf:name": "Paolo Ciccarese"
    },
    "annotatedAt": "2014-02-17T09:46:11EST",
    "serializedBy": "urn:application:domeo",
    "serializedAt": "2014-02-17T09:46:51EST",
    "hasBody": {
        "@type": [
            "cnt:ContentAsText",
            "dctypes:Text"
        ],
        "cnt:chars": "This is an interesting document",
        "dc:format": "text/plain"
    },
    "hasTarget": "http://paolociccarese.info"
}
 
Note that:
  • the 'hasBody' shows how to encode textual content
  • the 'hasTarget' is just a URI**.
** Note that as the target is a URI, anything identifiable can be annotated. In the above case we are annotating a web page, but the URI could be the identifier for a Data point as well.

Once the POST is sent, if everything is correct, the server (if "outCmd":"frame" was specified) will return a result message that has the following structure:

{"status":"saved", "result": {"duration": "1764ms","graphs":"1","item":[{
  "@context" : {
    ...
  },
  "@graph" : [ {
    "@id" : "http://myserver.example.com:8090/s/annotation/597C3DE9-8657-4FA6-ABCA-895A74B448E9",
    "@type" : "oa:Annotation",
    "http://purl.org/pav/previousVersion" : "urn:temp:7",
    "annotatedAt" : "2014-02-17T09:46:11EST",
    "annotatedBy" : {
      "@id" : "http://orcid.org/0000-0002-5156-2703",
      "@type" : "foaf:Person",
      "name" : "Paolo Ciccarese"
    },
    "hasTarget" : {
      "@id" : "http://myserver.example.com:8090/s/resource/ED20AE10-4916-485C-903D-54D6F11DF682",
      "@type" : "oa:SpecificResource",
      "http://purl.org/pav/previousVersion" : "urn:temp:8",
      "hasSelector" : {
        "@id" : "_:b0",
        "@type" : "oa:TextQuoteSelector",
        "exact" : "senior scientist and software engineer",
        "prefix" : "I am a",
        "suffix" : ", working in the bio-medical informatics field since the year 2000"
      },
      "hasSource" : {
        "@id" : "http://paolociccarese.info",
        "@type" : "dctypes:Text"
      }
    },
    "motivatedBy" : "oa:highlighting",
    "serializedAt" : "2014-02-17T09:46:51EST",
    "serializedBy" : "urn:application:utopia"
  } ]
}]}}

Note that:
  • the updated message is stored in the "item" section in a '@graph'
  • the '@id' have been updated with resolvable URIs
  • the property "http://purl.org/pav/previousVersion" returns the original temporary '@id' for matching.

Step 3. How to include bibliographic metadata/identifiers

Annotopia can use identifiers (PubMed IDs, PubMed Central IDs, DOIs and PIIs) to resolve equivalent documents. For example a HTML version of the document vs a PDF version. Or multiple HTML versions of the same document.

To include bibliographic metadata identifiers in the annotation, is sufficient to add the data to the 'hasSource' section as follows:

"hasSource": {
 "@id": "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3102893/",
 "@type": "dctypes:Text",
 "format": "text/html",
 "http://purl.org/vocab/frbr/core#embodimentOf": {
                "http://purl.org/dc/terms/title":"An open annotation ontology for science on web 3.0",
                "http://prismstandard.org/namespaces/basic/2.0/doi": "10.1186/2041-1480-2-S2-S4",
  "http://purl.org/spar/fabio#hasPII": "2041-1480-2-S2-S4",
  "http://purl.org/spar/fabio#hasPubMedCentralId": "PMC3102893",
  "http://purl.org/spar/fabio#hasPubMedId": "21624159"
 }
}

Step 4. Request for a specific annotation (GET {$id})

For requesting a specific annotation through its URI it is sufficient to execute (see API docs):

    curl -i -X GET +ANNOTATION_URI -H "Content-Type: application/json" \
       -d'{"apiKey":"+SYSTEM_API_KEY","outCmd":"frame"}'

Step 4. Request for annotations for a document (GET)

It is common to request the annotation for a particular document by URI (see API docs):

    curl -i -X GET http://myserver.example.com:8090/s/annotation \
      -H "Content-Type: application/json" \
      -d '{"apiKey":"+SYSTEM_API_KEY","tgtUrl":"http://www.jbiomedsem.com/content/2/S2/S4"}'


Or by bibliographic identifier:
    curl -i -X GET http://myserver.example.com:8090/s/annotation \
      -H "Content-Type: application/json" \
      -d '{"apiKey":"+SYSTEM_API_KEY","tgtIds":"{'pii':'2041-1480-2-S2-S4'}"}'

Tuesday, July 08, 2014

Adding bibliographic data to Open Annotation

One of the challenges for achieving interoperability between annotation clients that deal with different formats (for example PDF and HTML, see previous post Domeo and Utopia integration through Annotopia) is to be able to identify the annotated content.

For example, let's consider the paper about Annotation Ontology: Ciccarese P, Ocana M, Castro LJG, Das S, Clark, T. An open annotation ontology for science on web 3.0. J Biomed Semantics 2011, 2(Suppl 2):S4 (17 May 2011) [doi:10.1186/2041-1480-2-S2-S4]

Besides this PDF version (there might be others) of the article:
* PDF at Journal of Biomedical Semantics

The manuscript can be found in HTML format at least in these two locations (which exhibits different layouts):
* PubMed Central
* Journal of Biomedical Semantics

We know that the same content can be identified through identifiers:
* DOI (Digital Object Identifier) 10.1186/2041-1480-2-S2-S4
* PMID (PubMed ID) 21624159
* PMCID (PubMed Central ID) PMC3102893
* PII (Publisher Item Identifier) 2041-1480-2-S2-S4

In order to take into account all the available identifiers, it is possible to include in the annotation target the additional information. So if the client is annotating the PubMed Central version of the document (identified by the URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3102893/), the source of the target will be identified by:

 ...
    "hasSource": {
        "@id": "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3102893/",
        "@type": "dctypes:Text",
        "frbr:embodimentOf" : 
        { 
            "prism:doi": "10.1186/2041-1480-2-S2-S4",
            "fabio:hasPII":"2041-1480-2-S2-S4",
            "fabio:hasPubMedCentralId":"PMC3102893",
            "fabio:hasPubMedId":"21624159"
        }
        
    }
Where I made use of the FaBiO (FRBR aligned bibliographic Ontology) ontology which, in turns, reuse term from the FRBR ontology and the PRISM vocabulary. Kudos to Silvio Peroni for pointing out that the relationship between the Manifestation (HTML page) and the Expression should be frbr:embodimentOf and not fabio:manifestationOf. The latter would assume the identifiers are identifying the Work.

Domeo and Utopia integration through Annotopia

Here is a very recent demo of annotation created on a HTML document through Domeo and then seen on the correspondent PDF with the Utopia PDF viewer. All through Open Annotation and Annotopia.

Thanks to Steve Pettifer and Dave Thorne (for the Utopia plugin development); Thomas Wilkins for OAuth implementation in Annotopia. Annotopia is currently architected and developed by me.

Friday, April 25, 2014

Annotopia: Creation/updates with Open Annotation (?)

I am currently developing the Annotopia Open Annotation server [GitHub, Living Slides, Talk 'I Annotate 2014'] and there are a few topics related to the application of the Open Annotation Model that might need further discussion within the community. I will start with one

:: Date/agent of annotation creation and update ::

Even if we don't have the need to support versioning (this will be subject of a future post) and in the unlikely event that the annotation cannot be edited we often need to be able to keep track of who/when the annotation has been created and, eventually, updated. 


Open Annotation Provenance Model

The Open Annotation Model now supports the following provenance relationships/properties:


Vocabulary ItemTypeDescription
oa:annotatedByRelationship[subProperty of prov:wasAttributedTo] The object of the relationship is a resource that identifies the agent responsible for creating the Annotation. This may be either a human or software agent.
There SHOULD be exactly 1 oa:annotatedBy relationship per Annotation, but MAY be 0 or more than 1, as the Annotation may be anonymous, or multiple agents may have worked together on it.
oa:annotatedAtPropertyThe time at which the Annotation was created.
There SHOULD be exactly 1 oa:annotatedAt property per Annotation, and MUST NOT be more than 1. The datetime MUST be expressed in the xsd:dateTime format, and SHOULD have a timezone specified.
oa:serializedByRelationship[subProperty of prov:wasAttributedTo] The object of the relationship is the agent, likely software, responsible for generating the Annotation's serialization.
There MAY be 0 or more oa:serializedBy relationships per Annotation.
oa:serializedAtPropertyThe time at which the agent referenced by oa:serializedBy generated the first serialization of the Annotation, and any subsequent substantially different one. The annotation graph MUST have changed for this property to be updated, and as such represents the last modified datestamp for the Annotation. This might be used to determine if it should be re-imported into a triplestore when discovered.
There MAY be exactly 1 oa:serializedAt property per Annotation, and MUST NOT be more than 1. The datetime MUST be expressed in the xsd:dateTime format, and SHOULD have a timezone specified.

So we can encode when the annotation has been created and usually that coincides with the time when the user created the annotation on the user interface of the annotation client.

Then the annotation is sent to the server to be persisted.

Do nothing approach

One possible approach, that I would rather not advocate for, is to forget about the concepts of creation and update: 'every time a change is performed on an annotation, the old instance is swapped with the new one. The new one replaces entirely the previous annotation and shares the same URI.'. In this case, 'annotatedAt' is always referring to the latest annotation event (no matter if it was the original creation or following updates). 

Use a richer provenance model

To be a little more exhaustive, in Annotopia, as I was doing in Domeo and Annotation Ontology, I could use a series of properties of PAV (Provenance, Authoring and Versioning) ontology [paper]: pav:createdOn (when it has been created), pav:createdBy (who created it), pav:lastUpdateOn (when it has been last updated), pav:lastUpdateBy (who last updated the annotation).

So I could say:

Option A: Add lastUpdateOn

In this scenario we use annotatedAt/annotatedBy for the annotation creation and lastUpdateOn/lastUpdateBy for the last update.

{
    "@id" : "http://host/s/annotation/830ED7EE-BF7B-4A18-8AE1-A9AF96AC135B",
    "@type" : "oa:Annotation",
    "annotatedAt" : "2014-02-17T09:46:11EST",
    "annotatedBy" : {
      "@id" : "http://orcid.org/0000-0002-5156-2703",
      "@type" : "foaf:Person",
      "name" : "Paolo Ciccarese"
    },
    "pav:lastUpdateOn" : "2014-03-11T11:46:11EST",
    "pav:lastUpdateBy" : {
      "@id" : "http://example.org/johndoe",
      "@type" : "foaf:Person",
      "name" : "John Doe"
    }
...
}

In this case, both events would refer to when the act has been performed on the user interface (?).

Option B: Add createdOn and lastUpdateOn

Here we make use of annotatedAt/annotatedBy, createdOn/createdBy and  lastUpdateOn/lastUpdateBy

{
    "@id" : "http://host/s/annotation/830ED7EE-BF7B-4A18-8AE1-A9AF96AC135B",
    "@type" : "oa:Annotation",
    "pav:previousVersion" : "urn:temp:001",
    "annotatedAt" : "2014-02-17T09:46:11EST",
    "annotatedBy" : {
      "@id" : "http://orcid.org/0000-0002-5156-2703",
      "@type" : "foaf:Person",
      "name" : "Paolo Ciccarese"
    },
    "pav:createdOn" : "2014-02-17T09:48:11EST",
    "pav:createdBy" : {
      "@id" : "http://orcid.org/0000-0002-5156-2703",
      "@type" : "foaf:Person",
      "name" : "Paolo Ciccarese"
    },
    "pav:lastUpdateOn" : "2014-03-11T11:46:11EST",
    "pav:lastUpdateBy" : {
      "@id" : "http://example.org/johndoe",
      "@type" : "foaf:Person",
      "name" : "John Doe"
    }
...
}

In this case it is necessary to agree on the semantics of all those properties. I could use:
(i) 'createdOn/createdBy' for the original creation on the (Annotopia) server
(ii) 'lastUpdateOn/lastUpdateBy' for the last update on the (Annotopia)  server
(iii) and what is  'annotatedAt' going to indicate? The original creation or the latest update? And how do I keep track of the agents involved?

Saturday, February 08, 2014

A W3C Workshop on Annotations (April 2) and the I Annotate 2014 Conference (April 3-5)

The beginning of April is going to be a very exiting period for Annotation.
A W3C Workshop on Annotations
On April 2nd W3C is organizing a full day workshop in San Francisco on Annotation http://www.w3.org/2014/04/annotation/ Those who are members of the Open Annotation Community Group already know that there is a concrete possibility for a W3C Working Group focused on Annotation. A first draft of the charter has been shared: http://www.w3.org/2014/01/Ann-charter.html and comments/thoughts on that can be shared on the mailing list public-annotation@w3.org

I Annotate 2014
Hypothes.is just advertised the 2014 edition of the I Annotate conference for April 3-4 followed by two days of hacking 4-5.  Registration is now open.

Both events will be held at the FORT MASON CENTER - SAN FRANCISCO, CA.

Here are some links to videos of "I Annotate 213" presentations, more here.




From CATCH to HarvardX to Annotopia

On October 18, 2012, Philip Desenne (at the time Senior Product Manager, Academic Technology Services at Harvard), Martin Schreiner (Head of Maps, Media, Data and Government Information, Harvard College Library) and I got awarded a small grant from Harvard Library Labs called CATCH: Common Annotation, Tagging, and Citation at Harvard.

The idea was to create a federated network of server for storing annotations created for pedagogical purposes. As we knew there are many applications at Harvard creating annotation we wanted to provide a common back-end for all these to store, retrieve and search for annotation. The CATCH was meant to produce also some services for translating annotation into Open Annotation format so that we could store all the annotation coming from different tools in a uniform way that would have made search a lot easier.

Obviously, as I've spent the last two years developing the Domeo Annotation Tool, the idea was also to have Domeo using the same technology for storing/retrieving/searching annotation.



However, the original grant has been broken down in two phases and only the first phase has been funded so far. As result of the first phase I produced with the help of Justin Miranda, a back end for persisting annotation produced by an annotator client based on annotator.js technology.

Three weeks ago,  both client (thanks to the work by Daniel Cebrian Robles and Phil Desenne) and the CATCH server (developed by Justin Miranda and I) entered production in HarvardX for one class that counts about 14.000 students.

As the result of phase I was supposed to be just a prototype and not a production quality server, this has been a stressful and at the same time exciting transition.
In a few days, the CATCH counts already 21.000 annotation produced by more than 800 students and the number of annotations is increasing steadily.
The future of CATCH is named Annotopia
The original plan for CATCH has not been fully realized and the streaming of funding ended. So in agreement with Tim Clark (Director of MIND Informatics and PI of the Domeo project) we decided to create a new project called Annotopia that will consist in developing the full potential of the original CATCH idea. Annotopia will also provide additional services: text mining, terms search and support for semantic annotation. These features were already available in Domeo but they will be generalized and made available through APIs for third party annotation clients. 

The CATCH codebase will merge with the new platform and, at least for now, we will still refer to the name CATCH for indicating the instance for HavardX of the Annotopia annotation back-end.

The first release of Annotopia is scheduled by the end of March.

Monday, January 27, 2014

JSON-LD, Jena, Virtuoso and Named Graphs

After working for a couple of years on the Domeo Annotation Tool I am now working on a couple of projects that focus on the creation of a back-end for saving/searching annotation. I am planning to use the Open Annotation model and some other ontologies such as: PAV (Provenance, Authoring and Versioning) ontology and maybe CO (Collections Ontology).

Named Graphs and JSON-LD

Most importantly I am going to make large use of Named Graphs and their serialization in JSON-LD format, which is the recommended format for Open Annotation. JSON-LD became very recently a W3C Recommendation.
A Named Graph is a collections of Statements that is identified by a URI.
JSON-LD is a lightweight Linked Data format. It is easy for humans to read and write. It is based on the already successful JSON format and provides a way to help JSON data interoperate at Web-scale.
JSON-LD provides a very slick way of representing Named Graphs. Here is an example of Named Graph used for representing a very basic annotation (with Open Annotation):
  
  {
     "@context": {
        ...
     },
     "@id": "http://example.org/graphs/1",
     "@graph":
     [
        {
          "@id": "http://www.example.org/ann/1",
          "@type": "oa:Annotation",
          "hasBody": "http://www.example.org/body/1",
          "hasTarget": "http://www.example.org/target/1"
        }
     ]
  }

  Figure 1 - JSON-LD representation of a Named Graph and Open Annotation data.
  You can find the full @context in the Open Annotation specifications.

Loading JSON-LD in memory with Jena API 

I would like to store the above Named Graph for instance in the triple store Virtuoso Open-Source Edition. For this task I chose the Apache Jena API that makes use of the JSON-LD implementation for Java

I will start by loading in memory the above JSON-LD code (figure 1) that is currently in a JSON file:
  
  JenaJSONLD.init(); // Only needed once
  
  Dataset dataset = DatasetFactory.createMem();
  InputStream inputStream = new FileInputStream(annotationFile);
  if(inputStream == null) {
    throw new IllegalArgumentException("File: " + annotationFile + " not found");
  }
  RDFDataMgr.read(dataset, inputStream, "http://example.com/", JenaJSONLD.JSONLD);

  Figure 2 - Jena API code for loading the JSON-LD file in an in-memory Dataset.
The reason why I used a Dataset rather than a Model is because the
Dataset is a collection of named graphs and a background graph (also called the default graph or unnamed graph)
And that fits exactly the needs we have with the code in Figure 1. And the needs of much more complex use cases related to Domeo. Also, this approach works for both the JSON-LD making and not making use of graphs. If the JSON-LD does not contain any graph, the Statements will belong to the default graph.

Note: When I tired to use the Model and not the Dataset for loading the JSON-LD files, I realized that only the files with no @graph declarations were loaded correctly. The ones with the @graph declaration were not generating any statement.

Persist the Named Graphs in Virtuoso 

And these are the few lines of code I use to store the in-memory graphs in the Virtuoso store (I am sure there is a better way of doing this and combining the above step with these lines of code, however, this seems to work the way I want):
  // Default graph
  if(dataset.getDefaultModel()!=null && dataset.getDefaultModel().size()>0) {
    VirtGraph virtGraph = new VirtGraph (
      "jdbc:virtuoso://localhost:1111", "dba", "dba");
    VirtModel virtModel = new VirtModel(virtGraph);
    virtModel.add(dataset.getDefaultModel());
    // Print the triples
    println "graph: *"
    RDFDataMgr.write(System.out, dataset.getDefaultModel(), JenaJSONLD.JSONLD);
  }

  // Named graphs
  Iterator names = dataset.listNames()
  while(names.hasNext()) {
    String name = names.next();
    Model model = dataset.getNamedModel(name)
    VirtGraph virtGraph = new VirtGraph (name, 
      "jdbc:virtuoso://localhost:1111", "dba", "dba");
    VirtModel virtModel = new VirtModel(virtGraph);
    virtModel.add(model);

    // Print the triples
    println "graph: " + name
    RDFDataMgr.write(System.out, model, JenaJSONLD.JSONLD);
  }

  Figure 3 - Saving default and named graphs in Virtuoso

Software versions used in the example above

For the above examples I've used the following libraries/versions:
  • jena-core v. 2.11.0
  • jena-arq v. 2.11.0
  • jsonld-java-jena v. 0.2.99
  • virtjdbc4.jar
  • virt_jena2.jar