Semantic Web, Wikipathways and RDF: april 2013

donderdag 25 april 2013

Building a RDF and Wikipathway curations

Building a RDF

Today I added some more triples to my Saxophone RDF using as "subjects" and "Objects" dbpedia. I mainly used dbpedia, because I am still not clear when to justify a website to be a good link to use for RDF. But I am working on this issue and hope to come up on a solution real soon.

Now that I have made the Saxophone RDF, I started to work on a pathway from Wikipathways to convert it into RDF. To do this I have to make use of certain predicates, that indicate certain reactions or interaction. In this case the predicates the website (http://lov.okfn.org) seems to work also really well. The next examples on how to use the "linked open vocabularies" are used from the Saxophone RDF file I made.

By going to the website you will notice a search button in the middle of the browser:

You can enter the predicate or a part of the word of the predicate you are looking for and it will try to find it. As example I will take "produced_work" from the music ontology (mo). As we type this predicate in the search area, you will instantly notice that it does NOT find any hits. This is because this search engine does not work with "_" so I had to remove it first. Second problem I stumbled up on is that you must be precise on the word or the word is a part of a bigger word to find the predicate. This mean you will find the ontology if you would look for: "produced work" or "produced" but NOT "produced works", so I have to be more strict with the naming. So I typed the words "produced work" in the search engine and luckily I fond 1 exact hit for this one:

Here I copy the URI directly as the predicate. For this example we had only 1 hit but if I would look for "birth" I would get a lot of hits:

Unlike "produced work" I find a lot of hits here, but luckily there is on the left side panel a filtering possibility. These filtering steps are based on: Vocabulary, Type and Domain. So to remove most of the predicate that have no link to what I am searching for, I put the domain filter to "people". This reduced my search capacity drastically. And by using the other filers, the data output will be narrower and more specific. But still there are still some hit with "birth". In this case I pushed the arrow on the right side of the URI. This will show a pop-up with information about the URI:

As can seen in the image above, the pop-up does not only give information about the URI, it gives also some statistics between the word that was used to find the the hit and the hit. But the important information that I mainly used is the lower table ("Element information"). If this is still not enough information about the URI, it is possible to click the URI and go directly to the vocabulary in question and read more about it.

Using this website with dbpedia, I was able to make triples about the subject "Saxophone". Now that I have finished this, my next objective is to try to turn a pathway into RDF. To find the "subject" and "object" (in this case the metabolites/enzymes/proteins/genes etc.), I am using the following site that will allow me to search different vocabularies:

http://bioportal.bioontology.org/

This site was pointed out to me by my supervisor Andra Waagmeester and with the help of Egon Willighagen I was able to understand it. On this page I got the options to search all ontologies, to look for a certain ontology or to search resources. In this case I started with to search in the "search all ontologies" field for "adenosine triphosphate". This led to a list with many hits:

As can be seen here I found many hits with "adenosine triphosphate". To see whether the link shown has anything to do with what I am looking for, I clicked on the detail button just below each link. This will give me a pop-up with information about the link:

In this case it was the metabolite I was looking for, so I can use the link under the header "Full id" to put in my RDF. Using "Linked open vocabularies" to find predicates and using "bioportal NCBO" to find metabolites/enzymes/proteins/genes etc. I will try to make a RDF from a Wikipathway pathway. To start off I took a simple pathway:

http://wikipathways.org/index.php/Pathway:WP2487

This is the NAD salvage pathway II of E.coli that I uploaded on Wikipathways last week:

This is a small and relative simple pathway to convert to RDF. So This will be my goal for tomorrow to finish converting the whole pathway to RDF.

Curating Wikipathways

One of the first things I did when I started my internship to understand Wikipathways was (aside from reading the website and doing to tutorial) was try to curate existing pathways before I start building one myself.

Many of the pathways that i had to curate were Rat pathways. As example we will discuss the following pathway: Fatty Acid Beta Oxidation from Rattus norvegicus

http://wikipathways.org/index.php/Pathway:WP1307

The next pictures are examples of some of the problems I encountered during curating. I'll be also explaining how I did my curating.

First we open the pathway for editing like explained in my previous blog. Now I will can start looking for data nodes that are not annotated yet. This can be done in 3 ways:

1. You double click on the data node and see whether it is annotated.

2. You click on the data node and open on the right side panel the "Backpage" tab. This will will show whether the data node is annotated.

3. Or just scroll down the page until you reach the header "External references". Under this header every data node is noted and whether it is annotated:

As can be seen in the figure above the "Trans-Hexadecanoyl-CoA" has no reference, so we can instantly go to that metabolite and try to annotate it.

To annotate the data node I can double click on the data node type the name of the gene product and push search to find and select it (explained in more detail in previous blog). But sometimes the annotation engine of Wikipathways does not find the gene product I require. So the first thing I do, is to find out whether the removing part of the name of the gene product helps finding what I need. I do this because names can be defined differently for example: if I have a gene product called "5'-rebonucleotide" than the search engine might not recognize the "5'" so by removing it I might find that it is "5-rebonucleotide". Another reason is that its maybe misspelled and by removing a piece of the name, the search engine still be able to recognize the gene product i require.
If I still don't find the what I need, I try to find whether this is a synonym for the primary name of the specific product. And if that doesn't help maybe the product in question is wrong, and I can try to find the original reaction and see what gene products/metabolites are involved. This can be done in multiple ways:

Search existing databases like: NCBI, KEGG, HMDB, ENSEMBL etc.
Find pathways mentioned in Articles: PUBMED etc.
Or find a book that might have the pathway noted in it.

Since "Trans-Hexadecanoyl-CoA" is a metabolite and I tried putting a part of the name in the annotation engine and didnt find anything, I am going to the HMDB website (http://www.hmdb.ca/) to look whether it is synonym and try to identify its primary name. By entering the name of metabolite of the websites search engine on top of the site:

Make sure to choose Metabolites in the "search type" field. After I pushed search, a list of metabolites are shown where this name is found. That is not only in the name of the hit, but in the synonyms list and the description list. So this means that the first hit is not necessarily the good one.

In the case the first hit is completely the same as the word searched in the search engine, but keep in mind that the origenal name had "Trans" in it, but i kept that away because I couldn't find any hit with it (this is actually the first hit that "Trans-Hexadecanoyl-CoA" is not located in this database). But I took a closer look at the first hit, but I didn't that it was the metabolite I require. But to be more sure I tried to find the complete name on google and the only hits I found were "2-Trans-Hexadecanoyl-CoA" in an article with the exact same end product as the one in Wikipathways, This can be a potential candidate to be the correct metabolite, since this is about a pigs heart mitochondria not rat, there could be a small variation. So I will try to identify this metabolite by finding an ID I can connect to it. This ID can be then inserted manually in the data node when I double click on it. But in this case I did not find enough proof and an ID to verify my findings. But these are the website I found information about it. This I can discuss with the rest of the group to come up with a definite conclusion:

http://www.genome.jp/dbget-bin/www_bget?reaction+R00385+R01278+R03776+R03856+R03989+R04753+R06985
http://www.jbc.org/content/271/30/17816.long

Another problem is mistyping as i mentioned earlier. For example this metabolite: myrisoyl-CoA:

Is wrongly spelled it should be Myristoyl-CoA. Just by changing the name and looking it up on HMDB I found that is was not a primary name but the primary name was: Tetradecanoyl-CoA. Putting it in the annotation search engine I was able to annotate the metabolite:

I contineued this process untill I had annotated all the metabolites. For the ones that I could not annotate I left them as they were.

woensdag 24 april 2013

Making a Turtle RDF with rdflib

RDF with Saxophone as topic

So the fist thing I did today is finalize mij rdf script. Now it is a fully working command based script to make an RDF. source:

There are only a few things that are hard coded. These things are the name of the output file and the format the script parses and writes. Paying close attention to the parsing part is that it says that it has to parse n3 rdfs. Whereas the serializing happens in turtle. This is no problem since, there is no turtle parser and since turtle is a correct format for n3, the n3 parser can read turtle formats too.

Before I started writing the RDF I needed to find out which predicate vocabulary I can use to define my predicates. Doing research on this subject, I came upon a problem it seems that there was no easy way of searching for certain predicates then going to the individual vocabularies and looking for the predicates required. But just recently I stumbled on this website (http://lov.okfn.org/dataset/lov/), it seems like you can type the predicate you want in the search engine and it will find you the namespace for it, which is exactly what I needed. Now I haven't fully tested this website, but it looks like a nice start. Here is another website I found useful to use to find certain namespaces (no property attached to it): http://prefix.cc/

Also another issue, that I wasn't to sure of is, whether reversals were to be implemented in the RDF too. An example of what I mean:

normal: Saxophone maker Adolphe_Sax
reversed: Adolphe_Sax made Saxophone

Aside from the predicates, the subject has to be a link to an object not a webpage, for example, we take the dbpedia of Saxophone and not the wikipedia webpage of Saxophone. This is required because dbpedia has further links that connects other webpages and libraries.

The object, can be a link as well as a literal.

To use the python script to create your own RDF you have to supply the "subject-predicate-object". By entering in the command line the following the "subject-predicate-object" will be added to the RDF in turtle format.

python <name of script> <subject> <predicate> <object>

below you see a beginning for the RDF in turtle format:

@prefix : <file:///home/cizaralmalak/Desktop/Scripts/Saxophone.ttl#>.
@prefix _4: <http://dbpedia.org/resource/>.
@prefix _5: <http://purl.org/ontology/mo/>.
@prefix _6: <http://xmlns.com/foaf/0.1/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix xml: <http://www.w3.org/XML/1998/namespace>.

_4:Saxophone "a" _5:instrument;
_6:maker _4:Adolphe_Sax;
_6:name "Saxophone".

_4:Adolphe_Sax _6:age "79";
<http://xmlns.com/foaf/0.1/familyName> "Sax";
_6:firstName "Adolphe";
_6:gender "male";
_6:name "Adolphe Sax".

Using the rdflib, this format can be changed in any other rdf format quite easily This can be achieved by changing the graph.serialize format to any of the preferred formats. But in the next run the parser has to be changed too, to be able to read the new format.

dinsdag 23 april 2013

corrections to the NAD pathway & bug report & some predicate inforamtion

Correcting some annotations in the NAD pathway:

Wikipathways is an open, public platform dedicated to the curation of biological pathways.
Some of the annotations were GeneOntology IDs. This is not prefered in Wikipathways (EC numbers as well). If the search function when trying to annotate a certain gene, does not find the gene it could be because of a few things:

The gene is not located in the databases incorporated in Wikipathways.
The gene name that is used is not the primary name of the gene (only primary names can be searched).
The name was not typed correctly.
Or the gene does not exist, which in this case it can not be annotated.

In the case of point 1, 2 or 3 extra research is required. I will show a small example about how I tackled these points.

Fist we go to the the Wikipathway webpage: http://wikipathways.org/index.php/WikiPathways.
From there I entered the name of my pathway in the search engine.

After we entered the name and pushed "Search" we can add the species to narrow down the search.

Now I choose the to be adjusted NAD pathway (NAD salvage pathway I). After I have selected pathway you can edit them by pushing the "Edit pathway" button on the lower left of the pathway screen. Now I double click on the data node that I want to adjust.

In the mean time I go to the original source from where I took the pathway. In this case it was: http://biocyc.org/ECOLI/NEW-IMAGE?type=PATHWAY&object=PYRIDNUCSAL-PWY&detail-level=2. Here I can see what the the name of the gene/enzyme/protein was before I annotated it. In this case it was also "NAD+ diphosphatase". Keep in mind the reason I am re-annotating this enzyme, is because GO is not preferred in Wikipathways because more than 1 gene can be linked to one GOterm.
So the first step is to try and find this enzyme/gene product a different database like ensembl (ensembl bacteria), uniprot etc. If like in this case we do not find a hit, we try to look whether it is a primary name. This can be done just by looking at the synonyms and see whether a hit is found. The synonyms are mostly listed on the same website you found the gene in question. Else you can google the current name and see whether alternative hits are found and whether on that specific website the synonyms are mentioned.

In this case I was able to find a hit in Uniprot with "NADH pyrophosphatase" and since the gene name is mentioned that is associated with the enzyme in question I can use the gene name to annotate the enzyme in Wikipathways. With the gene name I was able to atomaticly annotate the enzyme. If the gene name was not found i can manually put the Uniprot ID with the specific enzyme/gene name.

Now that I have annotated one of the data nodes that needed to be adjusted I can start a new one.

Abnormality in Wikipathways

There seems to be some kind of bug in Wikipathways. The bug is that (if we take the example above) that the ensembl identifier found for nudC does not exist when you go to the ensembl website. But since this is a bacteria if you go to "ensembl bacteria" a separate website of ensembl for bacteria, I was able to find the gene, but not the identifier, it seems that the identifier that is used in ensembl is completely different to the one annotated in Wikipathways:

Wikipathway ensembl identifier for nudC: EBESCG00000003522

Ensembl bacteria identifier:

I reported this bug on Wikipathway-discuss, we will soon see what the cause of it is.

Interface RDF builder

I been working on the RDF building from commandline, but I seem to come to a hold because of a certain factor in the rdflib. I need to be able to change from "Literal" arguments to URIs easly without going into the code or writing a permutation of 3 on every combination possible. Currently i have a working code that does the job but its not command based yet. But I will be using:

import sys
sys.argv[argument]

This will allow me to take commands from a command base interface. But the previous explained problem can cause a problem if not defined correctly.

The good thing about rdflib, it seems it does not write double triples. So I don't have to write a separate module to check for it.

maandag 22 april 2013

Writing RDF using python rdflib & correcting NAD pathways

Correcting and adjusting NAD pathways

So about the pathways I uploaded on Wikipathways last week. There were some small corrections still to be made about them. There were 2 main issues: The metabolites that go into or out of a reaction with the main metabolites have to be grouped together and the data nodes have to be aligned correctly.

So this is how to tackle both of the problems. First we need to select the non-grouped metabolites that belong to each other:

Here I push “Ctrl+g” to group the selected items together. Now that they are grouped you need to move the arrow from where it is bound to one metabolite to the box around the group:

Now that this is done we can start aligning the individual metabolites. this can be done by selecting all the metabolites inside the box and going to the top menu bar of pathvisio or Wikipathways (right side).

Using these 6 aligning button, I made my data nodes more readable (I mainly used the 1st and 3de one). Keeping in mind that it will adjust the data nodes to the largest data node:

Now we have a nice group of metabolites that belong or come from the same reaction and align correctly with each other so it’s more readable for everyone.

Using python rdflib

I am currently trying to understand the different formats for rdf, one way to do this is by making a rdf myself. Since I am more practiced with python, i’ll be using the rdflib a python module to build my own rdf about Saxophone (the instrument I used to play).

So first of all I am trying to get the information from dbpedia and store it in a specific format (either n-triples or turtle). After a few tutorials:

http://semanticweb.org/wiki/Getting_data_from_the_Semantic_Web; http://www.michelepasin.org/blog/2011/07/18/inspecting-an-ontology-with-rdflib/.

and some trial and error I finally managed to download the dbpedia about Saxophones and converted it into the specific format required. I know you can get any rdf format from dbpedia directly.

Here you have to make sure that the dbpedia link say “resource” not “page” which is the redirected link of the “resource” link. Also the format can be changed in all the required formats:

N-triples = nt

Turtle = turtle

Notation 3 = n3

RDf xml = rdf+xml

N-quads = nq

So now I know how to convert one RDF format to another. So now I need to make my own, by inserting my own subject, predicate and objects.

This can be achieved by using the “graph.add((sub,pre,obj))”. Again here is a useful manual about rdflib: http://rdflib.readthedocs.org/en/3.2.0/gettingstarted.html. Though honestly its not that clear in the beginning. In the next example i made 2 lines of triples and added in the nt format:

Again here I can change the format to any of the format that were mentioned above. below is an example of how the “nt” format looks like:

<http://rdflib.net/test/Saxophone> <http://rdflib.net/test/invented_by> "mr. Sax".

<http://rdflib.net/test/Saxophone> <http://rdflib.net/test/is_a> "instrument".

Now all I have to do is making a command based interactive script that allows me to enter the information I want manually and add it to the rdf file.
Since This was a test i didnt pay much attention on the predicates. Put I am currently trying to understand which predicates suppose to be used when. A few website I am currently studying:
http://xmlns.com/foaf/spec/
http://www.heppnetz.de/grprofile/
http://www.w3.org/TR/vcard-rdf/
And other related subject like: rdf, rdfs, dc, dcterms etc.

Introduction & Assignment: building a pathway in pathvisio & importing it onto Wikipathways

Hi everyone,

I am Cizar Almalak and a few weeks back I started my internship here in Maastricht University. I choose to do my internship here in Maastricht because I was quite interested In the Semantic web, which is big issue here. Through Chris Evelo i came in contact with Andra Waagmeester, who is currently my supervisor.

The goal of my project exist of multiple steps. First I have to understand how Wikipathways works and the concept of Semantic Web. Secondly is to write a script that can convert RDF into GPML and thirdly is to write a script that can convert one RDF format into another RDF format.

So to understand how Wikipathways works, I have to practice by making some pathways myself, starting with the pathway tutorial on Wikipathways: http://wikipathways.org/index.php/Help:Tutorial.

After doing this tutorial, It was time to take a real pathway from different database than Wikipathways and see whether it was possible for me to reproduce it for Wikipathways. Since this will be my first time making a pathway for Wikipathways, instead of putting the pathway directly on on Wikipathways, I made it with pathvisio. Which has the same interface as Wikipathways and can export the data in GPML, which Wikipathways can read.

So in the following example i’ll be showing how I did the NAD biosynthesis pathway in E.coli. I got the pathway from Biocyc (source: http://biocyc.org/ECOLI/NEW-IMAGE?type=PATHWAY&object=PYRIDNUCSYN-PWY).

At fist I got the latest version of pathvisio (3).

So the first step when you have pathvisio, you need to download the required data from Wiki pathways, about the organism you are using (Genes and metabolite database). I ashieved this by going to the following link: http://www.pathvisio.org/downloads/download-bridgedbs/ or just going to www.pathvisio.org >> upper menu bar >> downloads >> download mapping databases:

And then I select the species that would like to use (E.coli for this example). For metabolites there is only a human one, which you can use on every specie.

After I downloaded the required databases I needed to upload them in pathvisio. This can be done by going to the upper panel of pathvisio and choosing: Data >> Select gene database and Data >> Select metabolite database.

So now that pathvisio is ready, I can start with the pathway. Before I started the pathway I went to the box that says “title” on the upper left corner and double clicked it. a menu will pop-up.

On the pop-up you will see on the top 3 options: Comments, Literature and Properties. In properties I put the title and organism I’m working on.

In the comments menu I put the source from where I tool the pathway.

In the reference menu I put all the references that were associated with this pathway. I achieved this by pushing the new reference button below, which will open another pop-up menu:

By entering the pubmed id of the article and pushing the button on the right (Query PubMed), all the required information from the publication will be uploaded correctly.

Now that i have done this, I started building the pathway, based on the BioCyc NAD biosynthesis pathway of E.coli.

I started first creating all proteins/enzymes/metabolites data nodes without the interactions. For proteins and enzymes I used the gene data nodes and for the metabolites the metabolite data node. These data nodes can be found under “Objects” on the right side panel of pathvisio:

Just creating the data nodes is not enough, each protein/enzyme/metabolite has to be annotated. This is the tricky part since, some of the them have synonyms and in Wikipathway as pathvisio only recognise the primary name. So extra research is required to find the primary name. So first I double click on one of the data nodes and a data node property list will pop-up:

In this pop-up menu I entered in the search field the name of the protein/enzyme/metabolite in question and pushed search. If there is no hit, it means that this protein/enzyme/metabolite is not annotated or it is not its primary name. But in this case this is the primary name of the metabolite and I got another pop-up minute that shows me all the hits that were found:

So here I choose the metabolite in question, all other hits are similar but not the metabolite i was searching for. After I made the selection I pressed “ok” and the my metabolite in question was annotated based on the database I selected:

Any extra information or reference about the annotated protein/enzyme/metabolite can be added in the Comments or Literature tab on top of the data node property pop-up.

So when I had all the data nodes annotated, it was time to add the interactions between each data node. Again like the data nodes the interactions can be found on the right side panel under the data nodes section “Basic interactions” & “MIM interactions”:

I used “Basic interactions” when the type of interaction is not clear or not located in the “MIM interaction” list and “MIM interactions” when it is clearly stated what kind of interaction is involved between the 2 products. There are a few problems that occurred to me when trying to build this pathway. One is that when you have an interacting product with the main product, you have to use 2 different lines example:

The circles are 2 separate lines binding to a single place, to illustrate the involvement of the metabolite oxygen with L-Aspartic acid and the rest product hydrogen peroxide and H+. Secondly is that sometimes the program bugs and all data nodes and interactions scatter. This happens mainly when you try to move all of them at once.

Now that i have done all the interactions I am basically done, all I have to do now is put them in a nice order to make it more readable.

Using the top menu bar from pathvisio and selecting File >> Export:

A pop-up menu will appear. using the new menu, you can give name to the file in the “File Name” field, and choose a format in the “Files of Type:” field. Currently i required the gpml format, to upload on Wikipathways.

After exporting the file into a gpml format I can now upload it in Wikipathways. So first i went to the Wikipathway website (http://wikipathways.org/index.php/WikiPathways) and choose create from the left side panel (circled).

In the new window I can choose to create a new path or upload a new pathway in a certain format.

After I uploaded the pathway, Everyone can see the pathways on Wikipathways and make adjustments to them. Also when the pathway is uploaded, I added a description to the pathway, so that people know what the pathway is about.

The description can be added under the header description under the pathway:
Also more information can be added to the pathway under different headers, if the information is available.