Wednesday, December 05, 2007

JDPF (Java Data Processing Framework)

JDPF (www.jdpf.org) is a framework for the definition of pipelines/nets for performing data analysis. I've been personally involved many times in the definition of algorithms for doing every sort of data processing (mainly in medical informatics). For this reasons, some time ago, we thought to create something that was able to foster re-usability of data analysis components. And I thought it should have been free, hopefully community driven.

The first implementation, two years ago, has been done from scratch and it was already giving an idea of the power of such an architecture (pipelines are not a news). Recently with the outstanding work of a couple of students (Bruno Farina and Paolo Mauri) and with the valuable help of Ezio Caffi we decided to move to OSGI technology. Working with OSGI has been really interesting and hard at the very beginning (at that time the documentation was really skinny). Now, JDPF is composed by a set of core bundles (that are taking care of net loading, validation and running) and a set of classes that can be used to develop new calculation blocks.

Right, because JDPF is an open architecture upon which you can run your own modules (or re-using the available ones). Let's say for instance that I need to clean some data. I can create my component (right now still editing an xml file, we are going to publish the visual builder soon) putting together the existing modules:
  • the generator able to load data from a file or location over the internet
  • the range filter able to clip or simply erase all the data that are outside the specific, allowed range
  • the serializer that is writing the results in a file
After the creation of the component (or net) we need to edit a second XML-file that is used to parametrize the previous block. For instance we need to define the range for the filtering, how to read the data and where, how to write the data and where.

After the creation of these two xml files JDPF is ready to run on your data... no single line of code has been written by the user.

Of course if you need a new custom block, you need to implement the algorithm. In this case, JDPF helps you in focusing only on that, forgetting the validation and running aspects...

No comments: