Friday, October 17, 2014

When SPARQL query length is an issue

While developing Annotopia I wrote some code to create dynamically SPARQL queries servicing a faceted search. According to the facets values, the queries can become extremely long... until I hit the limit:

Virtuoso 37000 Error SP031: SPARQL: Internal error: 
       The length of generated SQL text has exceeded 10000 lines of code
It seems that the SPARQL compiler stops because the SQL compiler, the successor in the processing pipeline, will fail to compile it in any reasonable time. After initial surprised reaction I started to dig deeper in the structure of my queries. Here is what I learned.

Use the FILTER + IN construct instead of multiple UNIONs


It might result simpler, when writing code, to dynamically compose a query with lists of UNIONs. Unfortunately that translates in much longer SQL queries. So this:

        
        { ?s oa:serializedBy <urn:application:domeo> }
        UNION
        { ?s oa:serializedBy <urn:application:utopia> }

for multiple items, should become:

        { ?s oa:serializedBy ?serializer .
            FILTER ( ?serializer IN 
               (<urn:application:domeo>, <urn:application:utopia>) )
        }

Friday, October 10, 2014

Annotopia 101 - Basic use for document/data annotation

This post explains how to get started in using Annotopia as a server for document/data annotation. It assumes Annotopia is already installed and running and that you have admin access to the instance.


Step 1. Register your system

After logging in (as admin) to Annotopia, you will see a welcome screen:
  • Click on 'Administration Dashboard' (top left of the screen)


  • Select 'Create System'

  • Fill out the form and 'Save system'

  • Take note of the 'API key' which is going to be used by your system to communicate with Annotopia when Annotopia is not set up to use a stronger Authentication mechanism.

 

Step 2. Create my first annotation (POST)

Assuming that the server address is http://myserver.example.com:8090 we are going to create our first POST. Normally your application will connect to the server through Ajax or a server call. For the sake of this tutorial we are going to use curl that is easy to use in command line.

The structure of the POST for an annotation item is very simple (API documentation here):

  curl -i -X POST http://myserver.example.com:8090/s/annotation \
       -H "Content-Type: application/json" \
       -d'{"apiKey":"{+SYSTEM_API_KEY}", "outCmd":"frame", "item":{+ANNOTATION}}
Where +SYSTEM_API_KEY is the API key of the previous section and +ANNOTATION is the actual annotation content. Notice also the parameter "outCmd":"frame", this is used to frame the JSON-LD result, which means that the result will always be returned with a precise hierarchical structure so that the clients don't have to deal with the variability of a graph-like representation.

A simple example of Annotation of type Highlight (conformant to the Open Annotation Model) would be:

{
 "@context": "https://raw2.github.com/Annotopia/AtSmartStorage/master/web-app/data/OAContext.json",
 "@id": "urn:temp:7",
 "@type": "oa:Annotation",
 "motivatedBy": "oa:highlighting",
 "annotatedBy": {
  "@id": "http://orcid.org/0000-0002-5156-2703",
  "@type": "foaf:Person",
  "foaf:name": "Paolo Ciccarese"
 },
 "annotatedAt": "2014-02-17T09:46:11EST",
 "serializedBy": "urn:application:utopia",
 "serializedAt": "2014-02-17T09:46:51EST",
 "hasTarget": {
  "@id": "urn:temp:8",
  "@type": "oa:SpecificResource",
  "hasSelector": {
   "@type": "oa:TextQuoteSelector",
   "exact": "senior scientist and software engineer",
   "prefix": "I am a",
   "suffix": ", working in the bio-medical informatics field since the year 2000"
  },
  "hasSource": {
   "@id": "http://paolociccarese.info",
   "@type": "dctypes:Text"
  }
 }
}


Note that:
  • the '@context' is necessary for the server to interpret the content
  • the '@id' fields contain a temporary value. In fact, when posting an annotation for the first time, the server will mint URIs for Annotation and Target and will return the updated content to the client as a response of the POST
  • the 'motivatedBy' property declares that the intent of the annotation is of highlighting.
  • the 'hasTarget' uses a quote of the annotated piece of content. 
  • the 'hasSource/@id' represents the URI of the annotated resource
  • the 'hasSelector' identifies a fragment of that resource.
  • the 'serializedBy' declares which system created the artifact. In the above case it is the Utopia for PDF application. Domeo would  be urn:application:domeo**.
** Note that this aspect is not fully implemented yet, therefore only specific systems are recognized by Annotopia and used for filtering. All others are managed but not exploited. In other words, currently only two values are fully manage 'urn:application:domeo' and 'urn:application:utopia'. Alternative values can be used and stored but they will not appear in the facets.  for search.

A simpler example of Annotation of type Comment of an entire resource:

{
    "@context": "https://raw2.github.com/Annotopia/AtSmartStorage/master/web-app/data/OAContext.json",
    "@id": "urn:temp:001",
    "@type": "http://www.w3.org/ns/oa#Annotation",
    "motivatedBy": "oa:commenting",
    "annotatedBy": {
        "@id": "http://orcid.org/0000-0002-5156-2703",
        "@type": "foaf:Person",
        "foaf:name": "Paolo Ciccarese"
    },
    "annotatedAt": "2014-02-17T09:46:11EST",
    "serializedBy": "urn:application:domeo",
    "serializedAt": "2014-02-17T09:46:51EST",
    "hasBody": {
        "@type": [
            "cnt:ContentAsText",
            "dctypes:Text"
        ],
        "cnt:chars": "This is an interesting document",
        "dc:format": "text/plain"
    },
    "hasTarget": "http://paolociccarese.info"
}
 
Note that:
  • the 'hasBody' shows how to encode textual content
  • the 'hasTarget' is just a URI**.
** Note that as the target is a URI, anything identifiable can be annotated. In the above case we are annotating a web page, but the URI could be the identifier for a Data point as well.

Once the POST is sent, if everything is correct, the server (if "outCmd":"frame" was specified) will return a result message that has the following structure:

{"status":"saved", "result": {"duration": "1764ms","graphs":"1","item":[{
  "@context" : {
    ...
  },
  "@graph" : [ {
    "@id" : "http://myserver.example.com:8090/s/annotation/597C3DE9-8657-4FA6-ABCA-895A74B448E9",
    "@type" : "oa:Annotation",
    "http://purl.org/pav/previousVersion" : "urn:temp:7",
    "annotatedAt" : "2014-02-17T09:46:11EST",
    "annotatedBy" : {
      "@id" : "http://orcid.org/0000-0002-5156-2703",
      "@type" : "foaf:Person",
      "name" : "Paolo Ciccarese"
    },
    "hasTarget" : {
      "@id" : "http://myserver.example.com:8090/s/resource/ED20AE10-4916-485C-903D-54D6F11DF682",
      "@type" : "oa:SpecificResource",
      "http://purl.org/pav/previousVersion" : "urn:temp:8",
      "hasSelector" : {
        "@id" : "_:b0",
        "@type" : "oa:TextQuoteSelector",
        "exact" : "senior scientist and software engineer",
        "prefix" : "I am a",
        "suffix" : ", working in the bio-medical informatics field since the year 2000"
      },
      "hasSource" : {
        "@id" : "http://paolociccarese.info",
        "@type" : "dctypes:Text"
      }
    },
    "motivatedBy" : "oa:highlighting",
    "serializedAt" : "2014-02-17T09:46:51EST",
    "serializedBy" : "urn:application:utopia"
  } ]
}]}}

Note that:
  • the updated message is stored in the "item" section in a '@graph'
  • the '@id' have been updated with resolvable URIs
  • the property "http://purl.org/pav/previousVersion" returns the original temporary '@id' for matching.

Step 3. How to include bibliographic metadata/identifiers

Annotopia can use identifiers (PubMed IDs, PubMed Central IDs, DOIs and PIIs) to resolve equivalent documents. For example a HTML version of the document vs a PDF version. Or multiple HTML versions of the same document.

To include bibliographic metadata identifiers in the annotation, is sufficient to add the data to the 'hasSource' section as follows:

"hasSource": {
 "@id": "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3102893/",
 "@type": "dctypes:Text",
 "format": "text/html",
 "http://purl.org/vocab/frbr/core#embodimentOf": {
                "http://purl.org/dc/terms/title":"An open annotation ontology for science on web 3.0",
                "http://prismstandard.org/namespaces/basic/2.0/doi": "10.1186/2041-1480-2-S2-S4",
  "http://purl.org/spar/fabio#hasPII": "2041-1480-2-S2-S4",
  "http://purl.org/spar/fabio#hasPubMedCentralId": "PMC3102893",
  "http://purl.org/spar/fabio#hasPubMedId": "21624159"
 }
}

Step 4. Request for a specific annotation (GET {$id})

For requesting a specific annotation through its URI it is sufficient to execute (see API docs):

    curl -i -X GET +ANNOTATION_URI -H "Content-Type: application/json" \
       -d'{"apiKey":"+SYSTEM_API_KEY","outCmd":"frame"}'

Step 4. Request for annotations for a document (GET)

It is common to request the annotation for a particular document by URI (see API docs):

    curl -i -X GET http://myserver.example.com:8090/s/annotation \
      -H "Content-Type: application/json" \
      -d '{"apiKey":"+SYSTEM_API_KEY","tgtUrl":"http://www.jbiomedsem.com/content/2/S2/S4"}'


Or by bibliographic identifier:
    curl -i -X GET http://myserver.example.com:8090/s/annotation \
      -H "Content-Type: application/json" \
      -d '{"apiKey":"+SYSTEM_API_KEY","tgtIds":"{'pii':'2041-1480-2-S2-S4'}"}'