zondag 26 mei 2013

GPML writer

GPML writer

The goal of my project is to write a RDF to GPML converter. To achieve this goal I had to learn about RDF and SPARQL. The next step was to know how GPML was build and how to write GPML using Java. To do this I am using eclipse to write my Java code.

To write GPML I will be using the pathvisio library. This library contains the necessary modules to write GPML. Here is a link to the tutorial on how to put the Jar files on eclipse: http://developers.pathvisio.org/wiki/PluginEclipseSetup

Now that the required packages are set I can start making my own GPML example pathway. As an example my task was to make the following:

Each square block is called a "Data Node" and each line has its own "MIM shape". The letters in the data nodes are the names of the genes/metabolites that the data nodes represent. The numbers on the right corner of each data node is the number of reference and the number given to the reference based on order inserted. Aside from the stuff that can be seen, background information like, detailed information of the reference, unique ids of the data nodes and lines and annotation have to be also included.

In the bellow Gist Github embeded link is the code to make this example shown. I will briefly go step by step through the code and explain it.

Before I start explaining the code, I would like to thank Martina Kutmon, who helped me with the start up of the code and on places where I got stuck and couldn't find out how to define or call certain stuff.

So lets start by making a separate class and call it WriteExampleGPML. Now I can start writing the code to create the GPML format for the example shown in the first figure. Before defining certain aspects of the pathway, I needed to create the pathway model:



// create pathway model
Pathway pathway = new Pathway();



Now that I have created the pathway model, I can start insert the certain elements that it requires. So first I will create all the data nodes (3x) and lines (2x) represented in the example. This can ve done using the following code:


// create data nodes
PathwayElement e1 = PathwayElement.createPathwayElement(ObjectType.DATANODE);
PathwayElement e2 = PathwayElement.createPathwayElement(ObjectType.DATANODE);
PathwayElement e3 = PathwayElement.createPathwayElement(ObjectType.DATANODE);

// create lines
PathwayElement l1 = PathwayElement.createPathwayElement(ObjectType.LINE);
PathwayElement l2 = PathwayElement.createPathwayElement(ObjectType.LINE);

In eclipse, when you forget to implement a certain module, will ask you to implemented, when it is being used. For this it has to be available. If I would have done the tutorial mentioned above wrong, I would not have access it.

Now that I have the basic figures I can start adding the attributes to each of them. For the data nodes different attributes are required than the lines. For the data nodes I added the following attributes: the data node coordinates, size, the name, ID, type and data node graph ID (which identifies the data node from other data nodes):


// adding attributes for data nodes
e1.setMCenterX(100);                                        # x-coordinate
                e1.setMCenterY(200);                                        # y-coordinate
                e1.setMWidth(80);                                              # size of the box
                e1.setMHeight(20);                                             # size of the box
e1.setTextLabel("A");                                        # gene product name
                e1.setElementID("1");                                        # gene product name
e1.setDataNodeType("Metabolite");                  # type of the data node
pathway.add(e1);                                                # add to pathway model
e1.setGeneratedGraphId();           # graph id (has to be set after its been added to the model).

As for the lines I added the attributes: The line coordinates, where it connects to, line thickness and line type. As well for the lines as data nodes, extra information can be added, like color etc., if required. But I will now just focus on these attributes. Make sure that MIMShapes is registered, or else it will not be recognized:


// register mimshapes
MIMShapes.registerShapes();



// adding attributes for lines
l1.setMStartX(140);                                         # x-coordinate of the start of the arrow
                l1.setMStartY(200);                                         # y-coordinate of the start of the arrow
                l1.setMEndX(260);                                          # x-coordinate of the end of the arrow
                l1.setMEndY(200);                                          # x-coordinate of the end of the arrow
l1.setStartGraphRef(e1.getGraphId());            # link the line to the data node it binds to (tail)
                l1.setEndGraphRef(e3.getGraphId());             # link the line to the data node it binds to (head)
l1.setLineThickness(1);                                   # Thickness of the line
l1.setEndLineType(LineType.fromName("mim-conversion"));        #The type of the line

Now that the main attributes has been set we can add extra ones that are required. As seen in figure one above the data node B links to the middle of the line between A and C. This line is linked to an anchor of the line between A and C. To create an anchor I did the following:


// create Anchor
MAnchor anchor = l1.addMAnchor(0.5);                  # puts anchor in the middle of line l1
anchor.setGraphId(pathway.getUniqueGraphId());    # create unique id for the anchor

Another point is that when we linked the lines to the data nodes, it binds to the data node in the middle of the box instead of on it sides, which will head the arrow head:




To correct this I need to define the MPoint bindings on the data nodes:

MPoint startl1 = l1.getMStart();      # start of the tail of the line
MPoint endl1 = l1.getMEnd();       # end of the head of the line
startl1.linkTo(e1,1.0,0.0);               # bind the tail to a point on the side of the data node
endl1.linkTo(e3,-1.0,0.0);              # bind the head to a point on the side of the data node
pathway.add(l1);                            # add the line

Now that we have the basic structure of the example, I will add the references to them. first we create the ids for each reference in each data node:


// add publications
e1.addBiopaxRef("id1");
e1.addBiopaxRef("id4");
e2.addBiopaxRef("id1");
e3.addBiopaxRef("id2");
e3.addBiopaxRef("id3");

Now that this is done I can add the annotation for each reference, Fist we need to get the Biopax reference manager and the PublicationXref model and then adding the information of the references:

                # createing the Biopax and PublicationXref model

BiopaxElement refMgr = pathway.getBiopax();
PublicationXref xref = new PublicationXref();

                # adding the reference information
xref.setPubmedId("1234");
xref.setTitle("Title");
xref.setYear("2013");
xref.setSource("Some source");
xref.setAuthors("Me");

               # here we link the reference to the data node with the Id we created previously
xref.setId(e1.getBiopaxRefs().get(0));
refMgr.addElement(xref);                            # adding the data to the pathway model

Now that I have added all the information for my task in the pathway model I can write it to the GPML model using the following code:


// write to GPML
pathway.writeToXml(new File("/home/cizaralmalak/Desktop/test.gpml"), true);






dinsdag 7 mei 2013

IBM tutorial making on RDF and Using SPARQL

IBM tutorial

On IBM there is a nice tutorial "Introduction to Jena" (https://www.ibm.com/developerworks/xml/library/j-jena/), which was pointed out to me by my supervisor Andra Waagmeester. This tutorial is meant to teach people how to use Jena to create a RDF model and use SPARQL to query some information out of it. This is a very clear and helpful  tutorial, but since I am not an expert in JAVA, but quite good in python, I will try to translate the code into python. To do this ill be using the python RDFlib library.

The reason I do this is to give a clear overview of what my code does later on, since most people in this department do not have the python expertise that I do. By doing this, they will have a template to understand my scripts better. Also doing this tutorial, will give me more inside and experience to optimize my scripts.

To be able to replicate the tutorial in python, I needed to find out what kind of libraries there were in python that allows me to build RDF models. As I have mentioned in earlier blogs, in python it is the RDFlib library that will allow me to do this.

This blog I will show the scripts I have created to do the exact same thing as the Jena tutorial on IBM but in python.

For listing 1 of the IBM tutorial I have put up a python script that is equivalent to it:

This script creates the separate subjects and predicates and joins them together in one graph as a triple. These triples can now be used to extract information out of them. In listing 2 and 3, they mention a few iterators to query the model from listing 1. These iterators are also available in the RDFlib library of python. These operators can be found here: http://www.rdflib.net/rdflib-2.4.0/html/public/rdflib.Graph.Graph-class.html

The script for listing 2 and in python can be seen here:


The subject_objects property of graph you can find any subject and object that are connected to a certain predicate. The triples property of gragh allow me to find exact matches of triples. In this case you can also put "None" in if you dont know the URI or Literal and it will find the triples based on the 1 or 2 values you have entered. Basically I can use the "triples" property as I would use the "subject_objects" property.

It is also possible to store this graph in a database. In this case I followed the example and used a MySQL database. Fist of all I needed to install MySQL for python using:

easy_install mysql-python

When the MySQL server and mysql-python library is installed I am now able to use the RDFlib library to connect to MySQL database or create one. In Listing 4, they use a different database that comes with the Jena library, since I am not using the Jena library, ill be using listing 1 as example.

With the following code we can open the MySQL database and put the precious created graph in it:

Now that the graph is stored in the database I can read it anytime I would like, just by opening the database and either choose to use SPARQL or predefined syntax's. These predefined syntax's are properties of the graph module. Following the example from IBM of listing 2 and 3.

In the case that there is no predefined property I can use, there is also a possibility to use SPARQL. This is also including in the RDFlib package. The python script for it can be seen here:

This SPARQL query finds out who has an uncle and/or an aunt. More information about how SPARQL works, I will explain in my next chapter below.

SPARQL tutorial

SPARQL is the query language used to search RDF stores. Since my project requires me to use SPARQL endpoint, it is necessary to learn how it works. Now SPARQL is basically a modified version of SQL. So a lot of syntax's are still the same. This is very fortunate for me since I have a background in SQL, which makes it easier to understand SPARQL.

Just like SQL you need a "SELECT" to choose which columns I want to see. Unlike SQL, you do not need to use columns that exist, since RDF stores dont work like SQL stores, they dont make use of tables. If for example I write "SELECT ?child" it means that I am creating a column called "child". The "?" is how SPARQL defines its variables. Now that I have defined what kind of information I want to see, I can try to find it, either by looking for the "subjects", "predicates" or "objects". This we can do by using the "WHERE":

SELECT ?child
WHERE { ?parent <http://example.org/has_child> ?child }

This small query will return the children of every parent. Now as can be seen ?parent is defined in the WHERE statement but is not visualized at all. This is no problem since, ?parent is just a variable to catch the subjects, it could have been any "?" followed by a word. In this query the only 2 variables of importance is the predicate (has_child) and the ?child to catch the information who are the children. Of course if I wanted the parents too I could have added it to the SELECT statement.

As for the predicate these can be found in vocabularies used by the store. In this case it was an example. But existing RDF stores use predicates from, certain vocabularies to define there data. This Subject was previously explained by me in a precious Blog.

Also it is important to note that URIs have to be between "<>". And if I don't want to define a variable and since I'm not going to use it I can use "[]" as in that I am not interested in this field. example:

SELECT ?child
WHERE { [] <http://example.org/has_child> ?child }

Would give me the same result as the first query.

Furthermore just like SQL you can use statements like: group by, order by, DESC, count, etc. Also a very important syntax is the FILTER. This syntax allowed me to search within the context of a variable for certain words or characters to filter on. Example finding the Fins word for Saxophone:

SELECT ?fin
WHERE
{
    <http://dbpedia.org/resource/Saxophone> rdfs:label ?fin .
    FILTER ( lang(?fin) = 'fi' )
}

donderdag 2 mei 2013

Different Triplestores

Triplestores

Before I start to learn and use SPARQL I need to understand what kind of triplestores there are, which ones are the best and how to use triplestores. Triplestores are databases that can store and retrieve triples. A few points we have to look at to make a decision is:
  • What platform does the triplestore support
  • Whether it is open source
  • What RDF formats are they supporting
  • What programmatic environment it supports
  • What is the max amount of the triplestores it can hold
  • how user friendly is it
  • What language is it written in
After reading on the different kind of triplestores I managed to make a top 5 list of the most suitable triplestores to use:
  • ClioPatria
  • Jena
  • OpenLink Virtuoso
  • Sesame --> with external packages
  • Open Anzo
I made this list based on the data provided on the following websites:
First of all I am going to kick out the Open Anzo out of the list. Because I don't seem to be able to go to the main webpage and can't find any specific information about it, so I presume the tool is dead.

As for the rest it seems they seem to be able to do what is required of them. Though some functionality in OpenLink Virtuoso is not available if you don't have to the commercial version.

As for ClioPatria, its limitation lies with the scalability and limited query optimization. Which would be important if you are working with data exceeding a billion. It already takes 30 minutes and 30 gigabytes of memory to read a 120M triples.

For Jena, is that it is a platform that other tools can make use like Fuseki. Also it seems like Jena is much slower than for example Sesame.

Sesame biggest disadvantage is the storage capacity which is really very low. And without third-party APIs, has some short comings in certain functions.

So in the end everyone of the triple stores mentioned  are good in its own way. Though myself I would choose for Sesame, this options is based on the simplicity based on what I require from the tool. Though other people would prefer other tools because of certain short comings of the specified triple store.