zondag 26 mei 2013

GPML writer

GPML writer

The goal of my project is to write a RDF to GPML converter. To achieve this goal I had to learn about RDF and SPARQL. The next step was to know how GPML was build and how to write GPML using Java. To do this I am using eclipse to write my Java code.

To write GPML I will be using the pathvisio library. This library contains the necessary modules to write GPML. Here is a link to the tutorial on how to put the Jar files on eclipse: http://developers.pathvisio.org/wiki/PluginEclipseSetup

Now that the required packages are set I can start making my own GPML example pathway. As an example my task was to make the following:

Each square block is called a "Data Node" and each line has its own "MIM shape". The letters in the data nodes are the names of the genes/metabolites that the data nodes represent. The numbers on the right corner of each data node is the number of reference and the number given to the reference based on order inserted. Aside from the stuff that can be seen, background information like, detailed information of the reference, unique ids of the data nodes and lines and annotation have to be also included.

In the bellow Gist Github embeded link is the code to make this example shown. I will briefly go step by step through the code and explain it.

Before I start explaining the code, I would like to thank Martina Kutmon, who helped me with the start up of the code and on places where I got stuck and couldn't find out how to define or call certain stuff.

So lets start by making a separate class and call it WriteExampleGPML. Now I can start writing the code to create the GPML format for the example shown in the first figure. Before defining certain aspects of the pathway, I needed to create the pathway model:



// create pathway model
Pathway pathway = new Pathway();



Now that I have created the pathway model, I can start insert the certain elements that it requires. So first I will create all the data nodes (3x) and lines (2x) represented in the example. This can ve done using the following code:


// create data nodes
PathwayElement e1 = PathwayElement.createPathwayElement(ObjectType.DATANODE);
PathwayElement e2 = PathwayElement.createPathwayElement(ObjectType.DATANODE);
PathwayElement e3 = PathwayElement.createPathwayElement(ObjectType.DATANODE);

// create lines
PathwayElement l1 = PathwayElement.createPathwayElement(ObjectType.LINE);
PathwayElement l2 = PathwayElement.createPathwayElement(ObjectType.LINE);

In eclipse, when you forget to implement a certain module, will ask you to implemented, when it is being used. For this it has to be available. If I would have done the tutorial mentioned above wrong, I would not have access it.

Now that I have the basic figures I can start adding the attributes to each of them. For the data nodes different attributes are required than the lines. For the data nodes I added the following attributes: the data node coordinates, size, the name, ID, type and data node graph ID (which identifies the data node from other data nodes):


// adding attributes for data nodes
e1.setMCenterX(100);                                        # x-coordinate
                e1.setMCenterY(200);                                        # y-coordinate
                e1.setMWidth(80);                                              # size of the box
                e1.setMHeight(20);                                             # size of the box
e1.setTextLabel("A");                                        # gene product name
                e1.setElementID("1");                                        # gene product name
e1.setDataNodeType("Metabolite");                  # type of the data node
pathway.add(e1);                                                # add to pathway model
e1.setGeneratedGraphId();           # graph id (has to be set after its been added to the model).

As for the lines I added the attributes: The line coordinates, where it connects to, line thickness and line type. As well for the lines as data nodes, extra information can be added, like color etc., if required. But I will now just focus on these attributes. Make sure that MIMShapes is registered, or else it will not be recognized:


// register mimshapes
MIMShapes.registerShapes();



// adding attributes for lines
l1.setMStartX(140);                                         # x-coordinate of the start of the arrow
                l1.setMStartY(200);                                         # y-coordinate of the start of the arrow
                l1.setMEndX(260);                                          # x-coordinate of the end of the arrow
                l1.setMEndY(200);                                          # x-coordinate of the end of the arrow
l1.setStartGraphRef(e1.getGraphId());            # link the line to the data node it binds to (tail)
                l1.setEndGraphRef(e3.getGraphId());             # link the line to the data node it binds to (head)
l1.setLineThickness(1);                                   # Thickness of the line
l1.setEndLineType(LineType.fromName("mim-conversion"));        #The type of the line

Now that the main attributes has been set we can add extra ones that are required. As seen in figure one above the data node B links to the middle of the line between A and C. This line is linked to an anchor of the line between A and C. To create an anchor I did the following:


// create Anchor
MAnchor anchor = l1.addMAnchor(0.5);                  # puts anchor in the middle of line l1
anchor.setGraphId(pathway.getUniqueGraphId());    # create unique id for the anchor

Another point is that when we linked the lines to the data nodes, it binds to the data node in the middle of the box instead of on it sides, which will head the arrow head:




To correct this I need to define the MPoint bindings on the data nodes:

MPoint startl1 = l1.getMStart();      # start of the tail of the line
MPoint endl1 = l1.getMEnd();       # end of the head of the line
startl1.linkTo(e1,1.0,0.0);               # bind the tail to a point on the side of the data node
endl1.linkTo(e3,-1.0,0.0);              # bind the head to a point on the side of the data node
pathway.add(l1);                            # add the line

Now that we have the basic structure of the example, I will add the references to them. first we create the ids for each reference in each data node:


// add publications
e1.addBiopaxRef("id1");
e1.addBiopaxRef("id4");
e2.addBiopaxRef("id1");
e3.addBiopaxRef("id2");
e3.addBiopaxRef("id3");

Now that this is done I can add the annotation for each reference, Fist we need to get the Biopax reference manager and the PublicationXref model and then adding the information of the references:

                # createing the Biopax and PublicationXref model

BiopaxElement refMgr = pathway.getBiopax();
PublicationXref xref = new PublicationXref();

                # adding the reference information
xref.setPubmedId("1234");
xref.setTitle("Title");
xref.setYear("2013");
xref.setSource("Some source");
xref.setAuthors("Me");

               # here we link the reference to the data node with the Id we created previously
xref.setId(e1.getBiopaxRefs().get(0));
refMgr.addElement(xref);                            # adding the data to the pathway model

Now that I have added all the information for my task in the pathway model I can write it to the GPML model using the following code:


// write to GPML
pathway.writeToXml(new File("/home/cizaralmalak/Desktop/test.gpml"), true);






dinsdag 7 mei 2013

IBM tutorial making on RDF and Using SPARQL

IBM tutorial

On IBM there is a nice tutorial "Introduction to Jena" (https://www.ibm.com/developerworks/xml/library/j-jena/), which was pointed out to me by my supervisor Andra Waagmeester. This tutorial is meant to teach people how to use Jena to create a RDF model and use SPARQL to query some information out of it. This is a very clear and helpful  tutorial, but since I am not an expert in JAVA, but quite good in python, I will try to translate the code into python. To do this ill be using the python RDFlib library.

The reason I do this is to give a clear overview of what my code does later on, since most people in this department do not have the python expertise that I do. By doing this, they will have a template to understand my scripts better. Also doing this tutorial, will give me more inside and experience to optimize my scripts.

To be able to replicate the tutorial in python, I needed to find out what kind of libraries there were in python that allows me to build RDF models. As I have mentioned in earlier blogs, in python it is the RDFlib library that will allow me to do this.

This blog I will show the scripts I have created to do the exact same thing as the Jena tutorial on IBM but in python.

For listing 1 of the IBM tutorial I have put up a python script that is equivalent to it:

This script creates the separate subjects and predicates and joins them together in one graph as a triple. These triples can now be used to extract information out of them. In listing 2 and 3, they mention a few iterators to query the model from listing 1. These iterators are also available in the RDFlib library of python. These operators can be found here: http://www.rdflib.net/rdflib-2.4.0/html/public/rdflib.Graph.Graph-class.html

The script for listing 2 and in python can be seen here:


The subject_objects property of graph you can find any subject and object that are connected to a certain predicate. The triples property of gragh allow me to find exact matches of triples. In this case you can also put "None" in if you dont know the URI or Literal and it will find the triples based on the 1 or 2 values you have entered. Basically I can use the "triples" property as I would use the "subject_objects" property.

It is also possible to store this graph in a database. In this case I followed the example and used a MySQL database. Fist of all I needed to install MySQL for python using:

easy_install mysql-python

When the MySQL server and mysql-python library is installed I am now able to use the RDFlib library to connect to MySQL database or create one. In Listing 4, they use a different database that comes with the Jena library, since I am not using the Jena library, ill be using listing 1 as example.

With the following code we can open the MySQL database and put the precious created graph in it:

Now that the graph is stored in the database I can read it anytime I would like, just by opening the database and either choose to use SPARQL or predefined syntax's. These predefined syntax's are properties of the graph module. Following the example from IBM of listing 2 and 3.

In the case that there is no predefined property I can use, there is also a possibility to use SPARQL. This is also including in the RDFlib package. The python script for it can be seen here:

This SPARQL query finds out who has an uncle and/or an aunt. More information about how SPARQL works, I will explain in my next chapter below.

SPARQL tutorial

SPARQL is the query language used to search RDF stores. Since my project requires me to use SPARQL endpoint, it is necessary to learn how it works. Now SPARQL is basically a modified version of SQL. So a lot of syntax's are still the same. This is very fortunate for me since I have a background in SQL, which makes it easier to understand SPARQL.

Just like SQL you need a "SELECT" to choose which columns I want to see. Unlike SQL, you do not need to use columns that exist, since RDF stores dont work like SQL stores, they dont make use of tables. If for example I write "SELECT ?child" it means that I am creating a column called "child". The "?" is how SPARQL defines its variables. Now that I have defined what kind of information I want to see, I can try to find it, either by looking for the "subjects", "predicates" or "objects". This we can do by using the "WHERE":

SELECT ?child
WHERE { ?parent <http://example.org/has_child> ?child }

This small query will return the children of every parent. Now as can be seen ?parent is defined in the WHERE statement but is not visualized at all. This is no problem since, ?parent is just a variable to catch the subjects, it could have been any "?" followed by a word. In this query the only 2 variables of importance is the predicate (has_child) and the ?child to catch the information who are the children. Of course if I wanted the parents too I could have added it to the SELECT statement.

As for the predicate these can be found in vocabularies used by the store. In this case it was an example. But existing RDF stores use predicates from, certain vocabularies to define there data. This Subject was previously explained by me in a precious Blog.

Also it is important to note that URIs have to be between "<>". And if I don't want to define a variable and since I'm not going to use it I can use "[]" as in that I am not interested in this field. example:

SELECT ?child
WHERE { [] <http://example.org/has_child> ?child }

Would give me the same result as the first query.

Furthermore just like SQL you can use statements like: group by, order by, DESC, count, etc. Also a very important syntax is the FILTER. This syntax allowed me to search within the context of a variable for certain words or characters to filter on. Example finding the Fins word for Saxophone:

SELECT ?fin
WHERE
{
    <http://dbpedia.org/resource/Saxophone> rdfs:label ?fin .
    FILTER ( lang(?fin) = 'fi' )
}

donderdag 2 mei 2013

Different Triplestores

Triplestores

Before I start to learn and use SPARQL I need to understand what kind of triplestores there are, which ones are the best and how to use triplestores. Triplestores are databases that can store and retrieve triples. A few points we have to look at to make a decision is:
  • What platform does the triplestore support
  • Whether it is open source
  • What RDF formats are they supporting
  • What programmatic environment it supports
  • What is the max amount of the triplestores it can hold
  • how user friendly is it
  • What language is it written in
After reading on the different kind of triplestores I managed to make a top 5 list of the most suitable triplestores to use:
  • ClioPatria
  • Jena
  • OpenLink Virtuoso
  • Sesame --> with external packages
  • Open Anzo
I made this list based on the data provided on the following websites:
First of all I am going to kick out the Open Anzo out of the list. Because I don't seem to be able to go to the main webpage and can't find any specific information about it, so I presume the tool is dead.

As for the rest it seems they seem to be able to do what is required of them. Though some functionality in OpenLink Virtuoso is not available if you don't have to the commercial version.

As for ClioPatria, its limitation lies with the scalability and limited query optimization. Which would be important if you are working with data exceeding a billion. It already takes 30 minutes and 30 gigabytes of memory to read a 120M triples.

For Jena, is that it is a platform that other tools can make use like Fuseki. Also it seems like Jena is much slower than for example Sesame.

Sesame biggest disadvantage is the storage capacity which is really very low. And without third-party APIs, has some short comings in certain functions.

So in the end everyone of the triple stores mentioned  are good in its own way. Though myself I would choose for Sesame, this options is based on the simplicity based on what I require from the tool. Though other people would prefer other tools because of certain short comings of the specified triple store.

donderdag 25 april 2013

Building a RDF and Wikipathway curations

Building a RDF

Today I added some more triples to my Saxophone RDF using as "subjects" and "Objects" dbpedia. I mainly used dbpedia, because I am still not clear when to justify a website to be a good link to use for RDF. But I am working on this issue and hope to come up on a solution real soon.

Now that I have made the Saxophone RDF, I started to work on a pathway from Wikipathways to convert it into RDF. To do this I have to make use of certain predicates, that indicate certain reactions or interaction. In this case the predicates the website (http://lov.okfn.org) seems to work also really well. The next examples on how to use the "linked open vocabularies" are used from the Saxophone RDF file I made.

By going to the website you will notice a search button in the middle of the browser:



You can enter the predicate or a part of the word of the predicate you are looking for and it will try to find it. As example I will take "produced_work" from the music ontology (mo). As we type this predicate in the search area, you will instantly notice that it does NOT find any hits. This is because this search engine does not work with "_" so I had to remove it first. Second problem I stumbled up on is that you must be precise on the word or the word is a part of a bigger word to find the predicate. This mean you will find the ontology if you would look for: "produced work" or "produced"  but NOT "produced works", so I have to be more strict with the naming. So I typed the words  "produced work" in the search engine and luckily I fond 1 exact hit for this one:


Here I copy the URI directly as the predicate. For this example we had only 1 hit but if I would look for "birth" I would get a lot of hits:



Unlike "produced work" I find a lot of hits here, but luckily there is on the left side panel a filtering possibility. These filtering steps are based on: Vocabulary, Type and Domain. So to remove most of the  predicate that have no link to what I am searching for, I put the domain filter to "people". This reduced my search capacity drastically. And by using the other filers, the data output will be narrower and more specific. But still there are still some hit with "birth". In this case I pushed the arrow on the right side of the URI. This will show a pop-up with information about the URI:



As can seen in the image above, the pop-up does not only give information about the URI, it gives also some statistics between the word that was used to find the the hit and the hit. But the important information that I mainly used is the lower table ("Element information"). If this is still not enough information about the URI, it is possible to click the URI and go directly to the vocabulary in question and read more about it.

Using this website with dbpedia, I was able to make triples about the subject "Saxophone". Now that I have finished this, my next objective is to try to turn a pathway into RDF. To find the "subject" and "object" (in this case the metabolites/enzymes/proteins/genes etc.), I am using the following site that will allow me to search different vocabularies:

http://bioportal.bioontology.org/


This site was pointed out to me by my supervisor Andra Waagmeester and with the help of Egon Willighagen I was able to understand it. On this page I got the options to search all ontologies, to look for a certain ontology or to search resources. In this case I started with to search in the "search all ontologies" field for "adenosine triphosphate". This led to a list with many hits:



As can be seen here I found many hits with "adenosine triphosphate". To see whether the link shown has anything to do with what I am looking for, I clicked on the detail button just below each link. This will give me a pop-up with information about the link:



In this case it was the metabolite I was looking for, so I can use the link under the header "Full id" to put in my RDF. Using "Linked open vocabularies" to find predicates and using "bioportal NCBO" to find metabolites/enzymes/proteins/genes etc. I will try to make a RDF from a Wikipathway pathway. To start off I took a simple pathway:

http://wikipathways.org/index.php/Pathway:WP2487

This is the NAD salvage pathway II of E.coli that I uploaded on Wikipathways last week:



This is a small and relative simple pathway to convert to RDF. So This will be my goal for tomorrow to finish converting the whole pathway to RDF.

Curating Wikipathways

One of the first things I did when I started my internship to understand Wikipathways was (aside from reading the website and doing to tutorial) was try to curate existing pathways before I start building one myself.

Many of the pathways that i had to curate were Rat pathways. As example we will discuss the following pathway: Fatty Acid Beta Oxidation from Rattus norvegicus

http://wikipathways.org/index.php/Pathway:WP1307

The next pictures are examples of some of the problems I encountered during curating. I'll be also explaining how I did my curating.

First we open the  pathway for editing like explained in my previous blog. Now I will can start looking for data nodes that are not annotated yet. This can be done in 3 ways:

1.  You double click on the data node and see whether it is annotated.
2.  You click on the data node and open on the right side panel the "Backpage" tab. This will will show whether the data node is annotated.


3.  Or just scroll down the page until you reach the header "External references". Under this header every data node is noted and whether it is annotated:




As can be seen in the figure above the "Trans-Hexadecanoyl-CoA" has no reference, so we can instantly go to that metabolite and try to annotate it.

To annotate the data node I can double click on the data node type the name of the gene product and  push search to find and select it (explained in more detail in previous blog). But sometimes the annotation engine of Wikipathways does not find the gene product I require. So the first thing I do, is to find out whether the removing part of the name of the gene product helps finding what I need. I do this because names can be defined differently for example: if I have a gene product called "5'-rebonucleotide" than the search engine might not recognize the "5'" so by removing it I might find that it is "5-rebonucleotide". Another reason is that its maybe misspelled and by removing a piece of the name, the search engine still be able to recognize the gene product i require.
If I still don't find the what I need, I try to find whether this is a synonym for the primary name of the specific product. And if that doesn't help maybe the product in question is wrong, and I can try to find the original reaction and see what gene products/metabolites are involved. This can be done in multiple ways:

  1. Search existing databases like: NCBI, KEGG, HMDB, ENSEMBL etc.
  2. Find pathways mentioned in Articles: PUBMED etc.
  3. Or find a book that might have the pathway noted in it.
Since  "Trans-Hexadecanoyl-CoA" is a metabolite and I tried putting a part of the name in the annotation engine and didnt find anything, I am going to the HMDB website (http://www.hmdb.ca/) to look whether it is synonym and try to identify its primary name. By entering the name of metabolite of the websites search engine on top of the site:


Make sure to choose Metabolites in the "search type" field. After I pushed search, a list of metabolites are shown where this name is found. That is not only in the name of the hit, but in the synonyms list and the description list. So this means that the first hit is not necessarily the good one.


In the case the first hit is completely the same as the word searched in the search engine, but keep in mind that the origenal name had "Trans" in it, but i kept that away because I couldn't find any hit with it (this is actually the first hit that "Trans-Hexadecanoyl-CoA" is not located in this database). But I took a closer look at the first hit, but I didn't that it was the metabolite I require. But to be more sure I tried to find the complete name on google and the only hits I found were "2-Trans-Hexadecanoyl-CoA" in an article with the exact same end product as the one in Wikipathways, This can be a potential candidate to be the correct metabolite, since this is about a pigs heart mitochondria not rat, there could be a small variation. So I will try to identify this metabolite by finding an ID I can connect to it. This ID can be then inserted manually in the data node when I double click on it. But in this case I did not find enough proof and an ID to verify my findings. But these are the website I found information about it. This I can discuss with the rest of the group to come up with a definite conclusion:

http://www.genome.jp/dbget-bin/www_bget?reaction+R00385+R01278+R03776+R03856+R03989+R04753+R06985
http://www.jbc.org/content/271/30/17816.long

Another problem is mistyping as i mentioned earlier. For example this metabolite: myrisoyl-CoA:



Is wrongly spelled it should be Myristoyl-CoA. Just by changing the name and looking it up on HMDB I found that is was not a primary name but the primary name was: Tetradecanoyl-CoA. Putting it in the annotation search engine I was able to annotate the metabolite:



I contineued this process untill I had annotated all the metabolites. For the ones that I could not annotate I left them as they were.

woensdag 24 april 2013

Making a Turtle RDF with rdflib

RDF with Saxophone as topic

So the fist thing I did today is finalize mij rdf script. Now it is a fully working command based script to make an RDF.  source:


There are only a few things that are hard coded. These things are the name of the output file and the format the script parses and writes. Paying close attention to the parsing part is that it says that it has to parse n3 rdfs. Whereas the serializing happens in turtle. This is no problem since, there is no turtle parser and since turtle is a correct format for n3, the n3 parser can read turtle formats too.

Before I started writing the RDF I needed to find out which predicate vocabulary I can use to define my predicates. Doing research on this subject, I came upon a problem it seems that there was no easy way of searching for certain predicates then going to the individual vocabularies and looking for the predicates required. But just recently I stumbled on this website (http://lov.okfn.org/dataset/lov/), it seems like you can type the predicate you want in the search engine and it will find you the namespace for it, which is exactly what I needed. Now I haven't fully tested this website, but it looks like a nice start. Here is another website I found useful to use to find certain namespaces (no property attached to it): http://prefix.cc/

Also another issue, that I wasn't to sure of is, whether reversals were to be implemented in the RDF too. An example of what I mean:

normal:     Saxophone    maker    Adolphe_Sax
reversed: Adolphe_Sax    made    Saxophone

Aside from the predicates, the subject has to be a link to an object not a webpage, for example, we take the dbpedia of Saxophone and not the wikipedia webpage of Saxophone. This is required because dbpedia has further links that connects other webpages and libraries.

The object, can be a link as well as a literal.

To use the python script to create your own RDF you have to supply the "subject-predicate-object". By entering in the command line the following the "subject-predicate-object" will be added to the RDF in turtle format.

python <name of script> <subject> <predicate> <object>

below you see a beginning for the RDF in turtle format:


@prefix : <file:///home/cizaralmalak/Desktop/Scripts/Saxophone.ttl#>.
@prefix _4: <http://dbpedia.org/resource/>.
@prefix _5: <http://purl.org/ontology/mo/>.
@prefix _6: <http://xmlns.com/foaf/0.1/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix xml: <http://www.w3.org/XML/1998/namespace>.

 _4:Saxophone "a" _5:instrument;
     _6:maker _4:Adolphe_Sax;
     _6:name "Saxophone". 

 _4:Adolphe_Sax _6:age "79";
     <http://xmlns.com/foaf/0.1/familyName> "Sax";
     _6:firstName "Adolphe";
     _6:gender "male";
     _6:name "Adolphe Sax". 

Using the rdflib, this format can be changed in any other rdf format quite easily  This can be achieved by changing the graph.serialize format to any of the preferred formats. But in the next run the parser has to be changed too, to be able to read the new format.

dinsdag 23 april 2013

corrections to the NAD pathway & bug report & some predicate inforamtion

Correcting some annotations in the NAD pathway:

Wikipathways is an open, public platform dedicated to the curation of biological pathways.
Some of the annotations were GeneOntology IDs. This is not prefered in Wikipathways (EC numbers as well). If the search function when trying to annotate a certain gene, does not find the gene it could be because of a few things:
  1. The gene is not located in the databases incorporated in Wikipathways.
  2. The gene name that is used is not the primary name of the gene (only primary names can be searched).
  3. The name was not typed correctly.
  4. Or the gene does not exist, which in this case it can not be annotated.
In the case of point 1, 2 or 3 extra research is required. I will show a small example about how I tackled these points.

Fist we go to the the Wikipathway webpage: http://wikipathways.org/index.php/WikiPathways.
From there I entered the name of my pathway in the search engine.



After we entered the name and pushed "Search" we can add the species to narrow down the search.



Now I choose the to be adjusted NAD pathway (NAD salvage pathway I). After I have selected pathway you can edit them by pushing the "Edit pathway" button on the lower left of the pathway screen. Now I double click on the data node that I want to adjust.



In the mean time I go to the original source from where I took the pathway. In this case it was: http://biocyc.org/ECOLI/NEW-IMAGE?type=PATHWAY&object=PYRIDNUCSAL-PWY&detail-level=2. Here I can see what the the name of the gene/enzyme/protein was before I annotated it. In this case it was also "NAD+ diphosphatase". Keep in mind the reason I am re-annotating this enzyme, is because GO is not preferred in Wikipathways because more than 1 gene can be linked to one GOterm.
So the first step is to try and find this enzyme/gene product a different database like ensembl (ensembl bacteria), uniprot etc. If like in this case we do not find a hit, we try to look whether it is a primary name. This can be done just by looking at the synonyms and see whether a hit is found. The synonyms are mostly listed on the same website you found the gene in question. Else you can google the current name and see whether alternative hits are found and whether on that specific website the synonyms are mentioned.



In this case I was able to find a hit in Uniprot with "NADH pyrophosphatase" and since the gene name is mentioned that is associated with the enzyme in question I can use the gene name to annotate the enzyme in Wikipathways. With the gene name I was able to atomaticly annotate the enzyme. If the gene name was not found i can manually put the Uniprot ID with the specific enzyme/gene name.



Now that I have annotated one of the data nodes that needed to be adjusted I can start a new one.

Abnormality in Wikipathways

There seems to be some kind of bug in Wikipathways. The bug is that (if we take the example above) that the ensembl identifier found for nudC does not exist when you go to the ensembl website. But since this is a bacteria if you go to "ensembl bacteria" a separate website of ensembl for bacteria, I was able to find the gene, but not the identifier, it seems that the identifier that is used in ensembl is completely different to the one annotated in Wikipathways:

Wikipathway ensembl identifier for nudC: EBESCG00000003522

Ensembl bacteria identifier:



I reported this bug on Wikipathway-discuss, we will soon see what the cause of it is.

Interface RDF builder

I been working on the RDF building from commandline, but I seem to come to a hold because of a certain factor in the rdflib. I need to be able to change from "Literal" arguments to URIs easly without going into the code or writing a permutation of 3 on every combination possible. Currently i have a working code that does the job but its not command based yet. But I will be using:

import sys
sys.argv[argument] 

This will allow me to take commands from a command base interface. But the previous explained problem can cause a problem if not defined correctly.

The good thing about rdflib, it seems it does not write double triples. So I don't have to write a separate module to check for it.

maandag 22 april 2013

Writing RDF using python rdflib & correcting NAD pathways


Correcting and adjusting NAD pathways


So about the pathways I uploaded on Wikipathways last week. There were some small corrections still to be made about them. There were 2 main issues: The metabolites that go into or out of a reaction with the main metabolites have to be grouped together and the data nodes have to be aligned correctly.


So this is how to tackle both of the problems. First we need to select the non-grouped metabolites that belong to each other:

Here I push “Ctrl+g” to group the selected items together. Now that they are grouped you need to move the arrow from where it is bound to one metabolite to the box around the group:

Now that this is done we can start aligning the individual metabolites. this can be done by selecting all the metabolites inside the box and going to the top menu bar of pathvisio or Wikipathways (right side).
Using these 6 aligning button, I made my data nodes more readable (I mainly used the 1st and 3de one). Keeping in mind that it will adjust the data nodes to the largest data node:
Now we have a nice group of metabolites that belong or come from the same reaction and align correctly with each other so it’s more readable for everyone.    

Using python rdflib

I am currently trying to understand the different formats for rdf, one way to do this is by making a rdf myself. Since I am more practiced with python, i’ll be using the rdflib a python module to build my own rdf about Saxophone (the instrument I used to play).

So first of all I am trying to get the information from dbpedia and store it in a specific format (either n-triples or turtle). After a few tutorials:
and some trial and error I finally managed to download the dbpedia about Saxophones and converted it into the specific format required. I know you can get any rdf format from dbpedia directly.

Here you have to make sure that the dbpedia link say “resource” not “page” which is the redirected link of the “resource” link. Also the format can be changed in all the required formats:

N-triples = nt
Turtle = turtle
Notation 3 = n3
RDf xml = rdf+xml
N-quads = nq

So now I know how to convert one RDF format to another. So now I need to make my own, by inserting my own subject, predicate and objects.
This can be achieved by using the “graph.add((sub,pre,obj))”. Again here is a useful manual about rdflib: http://rdflib.readthedocs.org/en/3.2.0/gettingstarted.html. Though honestly its not that clear in the beginning. In the next example i made 2 lines of triples and added in the nt format:


Again here I can change the format to any of the format that were mentioned above. below is an example of how the “nt” format looks like:

<http://rdflib.net/test/Saxophone> <http://rdflib.net/test/invented_by> "mr. Sax".
<http://rdflib.net/test/Saxophone> <http://rdflib.net/test/is_a> "instrument".

Now all I have to do is making a command based interactive script that allows me to enter the information I want manually and add it to the rdf file.
Since This was a test i didnt pay much attention on the predicates. Put I am currently trying to understand which predicates suppose to be used when. A few website I am currently studying:
http://xmlns.com/foaf/spec/
http://www.heppnetz.de/grprofile/
http://www.w3.org/TR/vcard-rdf/
And other related subject like: rdf, rdfs, dc, dcterms etc.

Introduction & Assignment: building a pathway in pathvisio & importing it onto Wikipathways


Hi everyone,

I am Cizar Almalak and a few weeks back I started my internship here in Maastricht University. I choose to do my internship here in Maastricht because I was quite interested In the Semantic web, which is big issue here. Through Chris Evelo i came in contact with Andra Waagmeester, who is currently my supervisor.

The goal of my project exist of multiple steps. First I have to understand how Wikipathways works and the concept of Semantic Web. Secondly is to write a script that can convert RDF into GPML and thirdly is to write a script that can convert one RDF format into another RDF format.

So to understand how Wikipathways works, I have to practice by making some pathways myself, starting with the pathway tutorial on Wikipathways: http://wikipathways.org/index.php/Help:Tutorial.

After doing this tutorial, It was time to take a real pathway from different database than Wikipathways and see whether it was possible for me to reproduce it for Wikipathways. Since this will be my first time making a pathway for Wikipathways, instead of putting the pathway directly on on Wikipathways, I made it with pathvisio. Which has the same interface as Wikipathways and can export the data in GPML, which Wikipathways can read.
So in the following example i’ll be showing how I did the NAD biosynthesis pathway in E.coli. I got the pathway from Biocyc (source: http://biocyc.org/ECOLI/NEW-IMAGE?type=PATHWAY&object=PYRIDNUCSYN-PWY).

At fist I got the latest version of pathvisio (3).

So the first step when you have pathvisio, you need to download the required data from Wiki pathways, about the organism you are using (Genes and metabolite database). I ashieved this by going to the following link: http://www.pathvisio.org/downloads/download-bridgedbs/ or just going to www.pathvisio.org >> upper menu bar >> downloads >> download mapping databases:


And then I select the species that would like to use (E.coli for this example). For metabolites there is only a human one, which you can use on every specie.

After I downloaded the required databases I needed to upload them in pathvisio. This can be done by going to the upper panel of pathvisio and choosing: Data >> Select gene database and Data >>  Select metabolite database.

So now that pathvisio is ready, I can start with the pathway. Before I started the pathway I went to the box that says “title” on the upper left corner and double clicked it. a menu will pop-up.



On the pop-up you will see on the top 3 options: Comments, Literature and Properties. In properties I put the title and organism I’m working on.
In the comments menu I put the source from where I tool the pathway.
In the reference menu I put all the references that were associated with this pathway. I achieved this by pushing the new reference button below, which will open another pop-up menu:
By entering the pubmed id of the article and pushing the button on the right (Query PubMed), all the required information from the publication will be uploaded correctly.
Now that i have done this, I started building the pathway, based on the BioCyc NAD biosynthesis pathway of E.coli.

I started first creating all proteins/enzymes/metabolites data nodes without the interactions. For proteins and enzymes I used the gene data nodes and for the metabolites the metabolite data node. These data nodes can be found under “Objects” on the right side panel of pathvisio:
Just creating the data nodes is not enough, each protein/enzyme/metabolite has to be annotated. This is the tricky part since, some of the them have synonyms and in Wikipathway as pathvisio only recognise the primary name. So extra research is required to find the primary name. So first I double click on one of the data nodes and a data node property list will pop-up:
In this pop-up menu I entered in the search field the name of the protein/enzyme/metabolite in question and pushed search. If there is no hit, it means that this protein/enzyme/metabolite is not annotated or it is not its primary name. But in this case this is the primary name of the metabolite and I got another pop-up minute that shows me all the hits that were found:
So here I choose the metabolite in question, all other hits are similar but not the metabolite i was searching for. After I made the selection I pressed “ok” and the my metabolite in question was annotated based on the database I selected:
Any extra information or reference about the annotated protein/enzyme/metabolite can be added in the Comments or Literature tab on top of the data node property pop-up.

So when I had all the data nodes annotated, it was time to add the interactions between each data node. Again like the data nodes the interactions can be found on the right side panel under the data nodes section “Basic interactions” & “MIM interactions”:
I used “Basic interactions” when the type of interaction is not clear or not located in the “MIM interaction” list and “MIM interactions” when it is clearly stated what kind of interaction is involved between the 2 products. There are a few problems that occurred to me when trying to build this pathway. One is that when you have an interacting product with the main product, you have to use 2 different lines example:
The circles are 2 separate lines binding to a single place, to illustrate the involvement of the metabolite oxygen with L-Aspartic acid and the rest product hydrogen peroxide and H+. Secondly is that sometimes the program bugs and all data nodes and interactions scatter. This happens mainly when you try to move all of them at once.

Now that i have done all the interactions I am basically done, all I have to do now is put them in a nice order to make it more readable.

Using the top menu bar from pathvisio and selecting File >> Export:
A pop-up menu will appear. using the new menu, you can give name to the file in the “File Name” field, and choose a format in the “Files of Type:” field. Currently i required the gpml format, to upload on Wikipathways.
After exporting the file into a gpml format I can now upload it in Wikipathways. So first i went to the Wikipathway website (http://wikipathways.org/index.php/WikiPathways) and choose create from the left side panel (circled).
In the new window I can choose to create a new path or upload a new pathway in a certain format.
After I uploaded the pathway, Everyone can see the pathways on Wikipathways and make adjustments to them. Also when the pathway is uploaded, I added a description to the pathway, so that people know what the pathway is about.
The description can be added under the header description under the pathway:
Also more information can be added to the pathway under different headers, if the information is available.