Showing posts with label Annotation. Show all posts
Showing posts with label Annotation. Show all posts

Friday, October 17, 2014

When SPARQL query length is an issue

While developing Annotopia I wrote some code to create dynamically SPARQL queries servicing a faceted search. According to the facets values, the queries can become extremely long... until I hit the limit:

Virtuoso 37000 Error SP031: SPARQL: Internal error: 
       The length of generated SQL text has exceeded 10000 lines of code
It seems that the SPARQL compiler stops because the SQL compiler, the successor in the processing pipeline, will fail to compile it in any reasonable time. After initial surprised reaction I started to dig deeper in the structure of my queries. Here is what I learned.

Use the FILTER + IN construct instead of multiple UNIONs


It might result simpler, when writing code, to dynamically compose a query with lists of UNIONs. Unfortunately that translates in much longer SQL queries. So this:

        
        { ?s oa:serializedBy <urn:application:domeo> }
        UNION
        { ?s oa:serializedBy <urn:application:utopia> }

for multiple items, should become:

        { ?s oa:serializedBy ?serializer .
            FILTER ( ?serializer IN 
               (<urn:application:domeo>, <urn:application:utopia>) )
        }

Friday, October 10, 2014

Annotopia 101 - Basic use for document/data annotation

This post explains how to get started in using Annotopia as a server for document/data annotation. It assumes Annotopia is already installed and running and that you have admin access to the instance.


Step 1. Register your system

After logging in (as admin) to Annotopia, you will see a welcome screen:
  • Click on 'Administration Dashboard' (top left of the screen)


  • Select 'Create System'

  • Fill out the form and 'Save system'

  • Take note of the 'API key' which is going to be used by your system to communicate with Annotopia when Annotopia is not set up to use a stronger Authentication mechanism.

 

Step 2. Create my first annotation (POST)

Assuming that the server address is http://myserver.example.com:8090 we are going to create our first POST. Normally your application will connect to the server through Ajax or a server call. For the sake of this tutorial we are going to use curl that is easy to use in command line.

The structure of the POST for an annotation item is very simple (API documentation here):

  curl -i -X POST http://myserver.example.com:8090/s/annotation \
       -H "Content-Type: application/json" \
       -d'{"apiKey":"{+SYSTEM_API_KEY}", "outCmd":"frame", "item":{+ANNOTATION}}
Where +SYSTEM_API_KEY is the API key of the previous section and +ANNOTATION is the actual annotation content. Notice also the parameter "outCmd":"frame", this is used to frame the JSON-LD result, which means that the result will always be returned with a precise hierarchical structure so that the clients don't have to deal with the variability of a graph-like representation.

A simple example of Annotation of type Highlight (conformant to the Open Annotation Model) would be:

{
 "@context": "https://raw2.github.com/Annotopia/AtSmartStorage/master/web-app/data/OAContext.json",
 "@id": "urn:temp:7",
 "@type": "oa:Annotation",
 "motivatedBy": "oa:highlighting",
 "annotatedBy": {
  "@id": "http://orcid.org/0000-0002-5156-2703",
  "@type": "foaf:Person",
  "foaf:name": "Paolo Ciccarese"
 },
 "annotatedAt": "2014-02-17T09:46:11EST",
 "serializedBy": "urn:application:utopia",
 "serializedAt": "2014-02-17T09:46:51EST",
 "hasTarget": {
  "@id": "urn:temp:8",
  "@type": "oa:SpecificResource",
  "hasSelector": {
   "@type": "oa:TextQuoteSelector",
   "exact": "senior scientist and software engineer",
   "prefix": "I am a",
   "suffix": ", working in the bio-medical informatics field since the year 2000"
  },
  "hasSource": {
   "@id": "http://paolociccarese.info",
   "@type": "dctypes:Text"
  }
 }
}


Note that:
  • the '@context' is necessary for the server to interpret the content
  • the '@id' fields contain a temporary value. In fact, when posting an annotation for the first time, the server will mint URIs for Annotation and Target and will return the updated content to the client as a response of the POST
  • the 'motivatedBy' property declares that the intent of the annotation is of highlighting.
  • the 'hasTarget' uses a quote of the annotated piece of content. 
  • the 'hasSource/@id' represents the URI of the annotated resource
  • the 'hasSelector' identifies a fragment of that resource.
  • the 'serializedBy' declares which system created the artifact. In the above case it is the Utopia for PDF application. Domeo would  be urn:application:domeo**.
** Note that this aspect is not fully implemented yet, therefore only specific systems are recognized by Annotopia and used for filtering. All others are managed but not exploited. In other words, currently only two values are fully manage 'urn:application:domeo' and 'urn:application:utopia'. Alternative values can be used and stored but they will not appear in the facets.  for search.

A simpler example of Annotation of type Comment of an entire resource:

{
    "@context": "https://raw2.github.com/Annotopia/AtSmartStorage/master/web-app/data/OAContext.json",
    "@id": "urn:temp:001",
    "@type": "http://www.w3.org/ns/oa#Annotation",
    "motivatedBy": "oa:commenting",
    "annotatedBy": {
        "@id": "http://orcid.org/0000-0002-5156-2703",
        "@type": "foaf:Person",
        "foaf:name": "Paolo Ciccarese"
    },
    "annotatedAt": "2014-02-17T09:46:11EST",
    "serializedBy": "urn:application:domeo",
    "serializedAt": "2014-02-17T09:46:51EST",
    "hasBody": {
        "@type": [
            "cnt:ContentAsText",
            "dctypes:Text"
        ],
        "cnt:chars": "This is an interesting document",
        "dc:format": "text/plain"
    },
    "hasTarget": "http://paolociccarese.info"
}
 
Note that:
  • the 'hasBody' shows how to encode textual content
  • the 'hasTarget' is just a URI**.
** Note that as the target is a URI, anything identifiable can be annotated. In the above case we are annotating a web page, but the URI could be the identifier for a Data point as well.

Once the POST is sent, if everything is correct, the server (if "outCmd":"frame" was specified) will return a result message that has the following structure:

{"status":"saved", "result": {"duration": "1764ms","graphs":"1","item":[{
  "@context" : {
    ...
  },
  "@graph" : [ {
    "@id" : "http://myserver.example.com:8090/s/annotation/597C3DE9-8657-4FA6-ABCA-895A74B448E9",
    "@type" : "oa:Annotation",
    "http://purl.org/pav/previousVersion" : "urn:temp:7",
    "annotatedAt" : "2014-02-17T09:46:11EST",
    "annotatedBy" : {
      "@id" : "http://orcid.org/0000-0002-5156-2703",
      "@type" : "foaf:Person",
      "name" : "Paolo Ciccarese"
    },
    "hasTarget" : {
      "@id" : "http://myserver.example.com:8090/s/resource/ED20AE10-4916-485C-903D-54D6F11DF682",
      "@type" : "oa:SpecificResource",
      "http://purl.org/pav/previousVersion" : "urn:temp:8",
      "hasSelector" : {
        "@id" : "_:b0",
        "@type" : "oa:TextQuoteSelector",
        "exact" : "senior scientist and software engineer",
        "prefix" : "I am a",
        "suffix" : ", working in the bio-medical informatics field since the year 2000"
      },
      "hasSource" : {
        "@id" : "http://paolociccarese.info",
        "@type" : "dctypes:Text"
      }
    },
    "motivatedBy" : "oa:highlighting",
    "serializedAt" : "2014-02-17T09:46:51EST",
    "serializedBy" : "urn:application:utopia"
  } ]
}]}}

Note that:
  • the updated message is stored in the "item" section in a '@graph'
  • the '@id' have been updated with resolvable URIs
  • the property "http://purl.org/pav/previousVersion" returns the original temporary '@id' for matching.

Step 3. How to include bibliographic metadata/identifiers

Annotopia can use identifiers (PubMed IDs, PubMed Central IDs, DOIs and PIIs) to resolve equivalent documents. For example a HTML version of the document vs a PDF version. Or multiple HTML versions of the same document.

To include bibliographic metadata identifiers in the annotation, is sufficient to add the data to the 'hasSource' section as follows:

"hasSource": {
 "@id": "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3102893/",
 "@type": "dctypes:Text",
 "format": "text/html",
 "http://purl.org/vocab/frbr/core#embodimentOf": {
                "http://purl.org/dc/terms/title":"An open annotation ontology for science on web 3.0",
                "http://prismstandard.org/namespaces/basic/2.0/doi": "10.1186/2041-1480-2-S2-S4",
  "http://purl.org/spar/fabio#hasPII": "2041-1480-2-S2-S4",
  "http://purl.org/spar/fabio#hasPubMedCentralId": "PMC3102893",
  "http://purl.org/spar/fabio#hasPubMedId": "21624159"
 }
}

Step 4. Request for a specific annotation (GET {$id})

For requesting a specific annotation through its URI it is sufficient to execute (see API docs):

    curl -i -X GET +ANNOTATION_URI -H "Content-Type: application/json" \
       -d'{"apiKey":"+SYSTEM_API_KEY","outCmd":"frame"}'

Step 4. Request for annotations for a document (GET)

It is common to request the annotation for a particular document by URI (see API docs):

    curl -i -X GET http://myserver.example.com:8090/s/annotation \
      -H "Content-Type: application/json" \
      -d '{"apiKey":"+SYSTEM_API_KEY","tgtUrl":"http://www.jbiomedsem.com/content/2/S2/S4"}'


Or by bibliographic identifier:
    curl -i -X GET http://myserver.example.com:8090/s/annotation \
      -H "Content-Type: application/json" \
      -d '{"apiKey":"+SYSTEM_API_KEY","tgtIds":"{'pii':'2041-1480-2-S2-S4'}"}'

Tuesday, July 08, 2014

Domeo and Utopia integration through Annotopia

Here is a very recent demo of annotation created on a HTML document through Domeo and then seen on the correspondent PDF with the Utopia PDF viewer. All through Open Annotation and Annotopia.

Thanks to Steve Pettifer and Dave Thorne (for the Utopia plugin development); Thomas Wilkins for OAuth implementation in Annotopia. Annotopia is currently architected and developed by me.

Saturday, February 08, 2014

A W3C Workshop on Annotations (April 2) and the I Annotate 2014 Conference (April 3-5)

The beginning of April is going to be a very exiting period for Annotation.
A W3C Workshop on Annotations
On April 2nd W3C is organizing a full day workshop in San Francisco on Annotation http://www.w3.org/2014/04/annotation/ Those who are members of the Open Annotation Community Group already know that there is a concrete possibility for a W3C Working Group focused on Annotation. A first draft of the charter has been shared: http://www.w3.org/2014/01/Ann-charter.html and comments/thoughts on that can be shared on the mailing list public-annotation@w3.org

I Annotate 2014
Hypothes.is just advertised the 2014 edition of the I Annotate conference for April 3-4 followed by two days of hacking 4-5.  Registration is now open.

Both events will be held at the FORT MASON CENTER - SAN FRANCISCO, CA.

Here are some links to videos of "I Annotate 213" presentations, more here.




From CATCH to HarvardX to Annotopia

On October 18, 2012, Philip Desenne (at the time Senior Product Manager, Academic Technology Services at Harvard), Martin Schreiner (Head of Maps, Media, Data and Government Information, Harvard College Library) and I got awarded a small grant from Harvard Library Labs called CATCH: Common Annotation, Tagging, and Citation at Harvard.

The idea was to create a federated network of server for storing annotations created for pedagogical purposes. As we knew there are many applications at Harvard creating annotation we wanted to provide a common back-end for all these to store, retrieve and search for annotation. The CATCH was meant to produce also some services for translating annotation into Open Annotation format so that we could store all the annotation coming from different tools in a uniform way that would have made search a lot easier.

Obviously, as I've spent the last two years developing the Domeo Annotation Tool, the idea was also to have Domeo using the same technology for storing/retrieving/searching annotation.



However, the original grant has been broken down in two phases and only the first phase has been funded so far. As result of the first phase I produced with the help of Justin Miranda, a back end for persisting annotation produced by an annotator client based on annotator.js technology.

Three weeks ago,  both client (thanks to the work by Daniel Cebrian Robles and Phil Desenne) and the CATCH server (developed by Justin Miranda and I) entered production in HarvardX for one class that counts about 14.000 students.

As the result of phase I was supposed to be just a prototype and not a production quality server, this has been a stressful and at the same time exciting transition.
In a few days, the CATCH counts already 21.000 annotation produced by more than 800 students and the number of annotations is increasing steadily.
The future of CATCH is named Annotopia
The original plan for CATCH has not been fully realized and the streaming of funding ended. So in agreement with Tim Clark (Director of MIND Informatics and PI of the Domeo project) we decided to create a new project called Annotopia that will consist in developing the full potential of the original CATCH idea. Annotopia will also provide additional services: text mining, terms search and support for semantic annotation. These features were already available in Domeo but they will be generalized and made available through APIs for third party annotation clients. 

The CATCH codebase will merge with the new platform and, at least for now, we will still refer to the name CATCH for indicating the instance for HavardX of the Annotopia annotation back-end.

The first release of Annotopia is scheduled by the end of March.

Monday, January 27, 2014

JSON-LD, Jena, Virtuoso and Named Graphs

After working for a couple of years on the Domeo Annotation Tool I am now working on a couple of projects that focus on the creation of a back-end for saving/searching annotation. I am planning to use the Open Annotation model and some other ontologies such as: PAV (Provenance, Authoring and Versioning) ontology and maybe CO (Collections Ontology).

Named Graphs and JSON-LD

Most importantly I am going to make large use of Named Graphs and their serialization in JSON-LD format, which is the recommended format for Open Annotation. JSON-LD became very recently a W3C Recommendation.
A Named Graph is a collections of Statements that is identified by a URI.
JSON-LD is a lightweight Linked Data format. It is easy for humans to read and write. It is based on the already successful JSON format and provides a way to help JSON data interoperate at Web-scale.
JSON-LD provides a very slick way of representing Named Graphs. Here is an example of Named Graph used for representing a very basic annotation (with Open Annotation):
  
  {
     "@context": {
        ...
     },
     "@id": "http://example.org/graphs/1",
     "@graph":
     [
        {
          "@id": "http://www.example.org/ann/1",
          "@type": "oa:Annotation",
          "hasBody": "http://www.example.org/body/1",
          "hasTarget": "http://www.example.org/target/1"
        }
     ]
  }

  Figure 1 - JSON-LD representation of a Named Graph and Open Annotation data.
  You can find the full @context in the Open Annotation specifications.

Loading JSON-LD in memory with Jena API 

I would like to store the above Named Graph for instance in the triple store Virtuoso Open-Source Edition. For this task I chose the Apache Jena API that makes use of the JSON-LD implementation for Java

I will start by loading in memory the above JSON-LD code (figure 1) that is currently in a JSON file:
  
  JenaJSONLD.init(); // Only needed once
  
  Dataset dataset = DatasetFactory.createMem();
  InputStream inputStream = new FileInputStream(annotationFile);
  if(inputStream == null) {
    throw new IllegalArgumentException("File: " + annotationFile + " not found");
  }
  RDFDataMgr.read(dataset, inputStream, "http://example.com/", JenaJSONLD.JSONLD);

  Figure 2 - Jena API code for loading the JSON-LD file in an in-memory Dataset.
The reason why I used a Dataset rather than a Model is because the
Dataset is a collection of named graphs and a background graph (also called the default graph or unnamed graph)
And that fits exactly the needs we have with the code in Figure 1. And the needs of much more complex use cases related to Domeo. Also, this approach works for both the JSON-LD making and not making use of graphs. If the JSON-LD does not contain any graph, the Statements will belong to the default graph.

Note: When I tired to use the Model and not the Dataset for loading the JSON-LD files, I realized that only the files with no @graph declarations were loaded correctly. The ones with the @graph declaration were not generating any statement.

Persist the Named Graphs in Virtuoso 

And these are the few lines of code I use to store the in-memory graphs in the Virtuoso store (I am sure there is a better way of doing this and combining the above step with these lines of code, however, this seems to work the way I want):
  // Default graph
  if(dataset.getDefaultModel()!=null && dataset.getDefaultModel().size()>0) {
    VirtGraph virtGraph = new VirtGraph (
      "jdbc:virtuoso://localhost:1111", "dba", "dba");
    VirtModel virtModel = new VirtModel(virtGraph);
    virtModel.add(dataset.getDefaultModel());
    // Print the triples
    println "graph: *"
    RDFDataMgr.write(System.out, dataset.getDefaultModel(), JenaJSONLD.JSONLD);
  }

  // Named graphs
  Iterator names = dataset.listNames()
  while(names.hasNext()) {
    String name = names.next();
    Model model = dataset.getNamedModel(name)
    VirtGraph virtGraph = new VirtGraph (name, 
      "jdbc:virtuoso://localhost:1111", "dba", "dba");
    VirtModel virtModel = new VirtModel(virtGraph);
    virtModel.add(model);

    // Print the triples
    println "graph: " + name
    RDFDataMgr.write(System.out, model, JenaJSONLD.JSONLD);
  }

  Figure 3 - Saving default and named graphs in Virtuoso

Software versions used in the example above

For the above examples I've used the following libraries/versions:
  • jena-core v. 2.11.0
  • jena-arq v. 2.11.0
  • jsonld-java-jena v. 0.2.99
  • virtjdbc4.jar
  • virt_jena2.jar

Saturday, July 20, 2013

Domeo, Annotation Framework, Catch Annotation Hub and Grails Plugins architecture

I found organizing big projects in components always a reassuring idea.
Component Oriented Programming? I let you decide if that is what I mean. I’ve read several discussion on the topic Component Oriented Programming vs. Object Oriented Programming and I am personally one of those who believes the two strategies are complementary and not in competition. As I am not interested in debating the theoretical differences, I would stick to what I normally do and not what I think.
That is one of the reasons I’ve always liked - and I still like - OSGi and that is also one of the reasons I’ve been always attracted by the Grails Plugins architecture.
The components oriented approach did not always pay off. Occasionally I just gave up when I found myself fighting with the technology of the moment, which was getting a little on the way. I am sure most of my problems were related to my limited knowledge of that particular technology... still, deadlines are deadlines and I needed to get things done.
I am certainly not the first nor the last developer celebrating the Grails Plugin Oriented Architecture. Here is a blog post that shows how a domain class defined in a plugin can be reused by other components of the architecture.

However, I have been thinking about the OSGi-based Eclipse architecture for a long time and I even tried to develop a lighter Java framework for developing applications along the same lines. Naturally, since I've been using Grails, I’ve been thinking on how to reproduce the same behavior in web applications by using Grails plugins. Basically I am talking about conveniently leveraging plugins to benefit from all the perks of the Grails platform: domain classes, services, controllers and views. I will defer to future posts some of the technical details. Meanwhile I wish to provide a little context.

I am thinking of leveraging the plugin architecture for a project called CATCH that I’ve been working on for a tiny grant awarded by Harvard Library Labs. As the Domeo Annotation Tool already provided some of the features I need for CATCH I've decided to refactor and spin off some of its components. I've  created a new GitHub project called Annotation Framework which will collect all the new improved modules that will be later used by both Domeo and CATCH


CATCH Annotation Hub

The goal of CATCH is to provide a hub for collecting/searching and sharing annotation produced by several clients. These includes the Domeo annotation client, HighBrow - an annotation client developed at Harvard by Reinhard Engels - and annotator.js an open-source JavaScript library and tool - developed by Nick Stenning  - that can be added to any webpage to make it annotatable.

Both CATCH and the older sister project Domeo are meant to be installed in several instances that should be able to communicate with each other in a federated architecture. You can think this as a series of Annotation Framework Nodes that are distributed and connected so that when a user performs a search on one of the nodes, it can also find results that have been created and stored in other authorized/linked nodes. All with access control...


Saturday, January 14, 2012

Domeo v2 working much faster on Firefox then Chrome

Since a week now, I am deeply involved in developing the v.2 of the Domeo annotation tool. Domeo is a combination of GWT (Google Web Toolkit) and JavaScript. I've been mainly working on the infrastructure and, initially, as it is not much UI work, I was testing only on Chrome (16.0.912.75). Sadly, I've  noticed several times that my code was not running fast enough and I started to study alternative ways of performing various tasks.

Then, out of curiosity, I ran the same tests on Firefox and surprise... what was running slow on Chrome was running almost without any latency in FF (6.0.2).


The test consists in 18 steps and each of them is adding the same amount of complexity. Looking at the above figure (milliseconds on the Y-axis) you will easily detect what I am talking about. The difference seems way too big.

Thursday, February 24, 2011

AO: Annotating with one or multiple statements (triples)

A few days ago I had a phone discussion with some collaegues (Tudor Groza, Vit Novacek and Cartic Ramakrishnan) on how to use Annotation Ontology (AO) for attaching something more complex than a single term (identified by a URI) to a document or document fragment. To make it clear I am giving here an idea on how something like that can be already done in AO.

Let's say I am performing some text-mining on some textual content. It is possible that I don't want simply to associate a term to a span of text but I want to do something more elaborate. For example I want to say, analyzing this span of text I obtain the triple GeneG encodes ProteinP. How can I do that in AO? For instance I can use a Named Graph and I can say something like in the following picture:

Figure 1: The dashed ovals are instances of annotation items. Selectors and other details of the actual annotation have been omitted.

As you can see we have annotated also the atomic components of my triple. In doing this, while analyzing the assertions belonging to a specific domain I can always trace back to the original text. Also, using a graph as object of my annotation I am going in the direction of the Nanopublication format, however this will be topic for a future post.

Given this, you can imagine you can attach the proper provenance to the annotation. If you are a text miner, you might be interested in attaching what software or computational workflow generated such annotation and with what confidence.


You might have noticed the usage of the namespace tm that stand for Text Mining. It is a set of properties I am working on for extending AO to better represent text mining results.

Friday, February 04, 2011

Monday, January 31, 2011

Annotation and Content Improvement

I was recently attending the workshop 'Beyond the PDF' in San Diego and I noticed multiple times how the concept of 'Annotation' is often intended as a task performed after publication of a physical or digital document.

I consider Annotation to be more ubiquitous and important at all stages: before, during and after publication. Also, Annotation is not only about classic textual document. Images, database records and data-sets can be annotated as well. Even physical objects can be digitally annotated when we create a correspondent digital record or - speaking in terms of ontologies - when we refer to the representation of that particular instance of a certain class.

Annotation can exist as such forever or can be incorporated back in the original document/resource or a new version of the original document/resource. If you think at the old fashion paper encyclopedia, every year - or bunch of years - the editor was collecting the several annotations to come up with a new edition of the heavy volumes. This was very close to what in the digital world is called versioning.

In the modern digital world annotation is everywhere. Tags attached to a document are annotation. Leveraging crowdsourcing makes possible to include the most popular tags as keywords for that document. Delicious users are experiencing this anytime they are in the process to tag a new resource and they receive suggestions of popular or appropriate tags. Reviews of catalog items in Amazon are annotations and the statistical analysis of such results appears close to the selected item under the form of stars. To some extents, edits in a Wiki can be seen as annotation - and could be exported as such - where a user changes the current document content. However,  I understand using the term Annotation for edits might sound a bit of a stretch.

Maybe, in today digital world, a better way to refer to this process is 'Content refinement' as everything can potentially be 'changed'. But even the term refine might fall short as 'to refine' means improving by making small changes. Sometimes edits are massive and an article in Wikipedia can evolve dramatically in time. It is not simply polishing and fixing, we can add/remove big chunks of the original documents - adding missing items or removing items that are redundant or not valid anymore - or can make the original document more actual - for instance adding new evidence that was not available when the document has been previously published. 'Content Improvement' is probably generic enough to cover refinements and edits.

Sure, I am talking of evolving documents but it does not preclude to take snapshots of it in a 'traditional publication' or in a version of that resource. Take online news. I realized more than once that the news at a specific URL was changing and journalists were incrementally adding new sections at the bottom of the page whenever the new updates were available. You might argue this is not good practice but it happens more often than what you think.  The reason is simple: in the digital world, it is possible and cheap. We don't have to reprint a book or to add an errata page to avoid reprinting. We just create a note or directly edit the content - hopefully while keeping track of the changes.

I see many attempts to redefine what a publication is. These days, I believe publication is a multidimensional evolving artifact including images, videos, live tables, data, metadata... and no matter what it includes or what it looks like, it has to manage change or content improvement. Only snapshots of it, at particular times, would match the 'classic' concept of publication.