dinsdag 7 mei 2013

IBM tutorial making on RDF and Using SPARQL

IBM tutorial

On IBM there is a nice tutorial "Introduction to Jena" (https://www.ibm.com/developerworks/xml/library/j-jena/), which was pointed out to me by my supervisor Andra Waagmeester. This tutorial is meant to teach people how to use Jena to create a RDF model and use SPARQL to query some information out of it. This is a very clear and helpful  tutorial, but since I am not an expert in JAVA, but quite good in python, I will try to translate the code into python. To do this ill be using the python RDFlib library.

The reason I do this is to give a clear overview of what my code does later on, since most people in this department do not have the python expertise that I do. By doing this, they will have a template to understand my scripts better. Also doing this tutorial, will give me more inside and experience to optimize my scripts.

To be able to replicate the tutorial in python, I needed to find out what kind of libraries there were in python that allows me to build RDF models. As I have mentioned in earlier blogs, in python it is the RDFlib library that will allow me to do this.

This blog I will show the scripts I have created to do the exact same thing as the Jena tutorial on IBM but in python.

For listing 1 of the IBM tutorial I have put up a python script that is equivalent to it:

This script creates the separate subjects and predicates and joins them together in one graph as a triple. These triples can now be used to extract information out of them. In listing 2 and 3, they mention a few iterators to query the model from listing 1. These iterators are also available in the RDFlib library of python. These operators can be found here: http://www.rdflib.net/rdflib-2.4.0/html/public/rdflib.Graph.Graph-class.html

The script for listing 2 and in python can be seen here:


The subject_objects property of graph you can find any subject and object that are connected to a certain predicate. The triples property of gragh allow me to find exact matches of triples. In this case you can also put "None" in if you dont know the URI or Literal and it will find the triples based on the 1 or 2 values you have entered. Basically I can use the "triples" property as I would use the "subject_objects" property.

It is also possible to store this graph in a database. In this case I followed the example and used a MySQL database. Fist of all I needed to install MySQL for python using:

easy_install mysql-python

When the MySQL server and mysql-python library is installed I am now able to use the RDFlib library to connect to MySQL database or create one. In Listing 4, they use a different database that comes with the Jena library, since I am not using the Jena library, ill be using listing 1 as example.

With the following code we can open the MySQL database and put the precious created graph in it:

Now that the graph is stored in the database I can read it anytime I would like, just by opening the database and either choose to use SPARQL or predefined syntax's. These predefined syntax's are properties of the graph module. Following the example from IBM of listing 2 and 3.

In the case that there is no predefined property I can use, there is also a possibility to use SPARQL. This is also including in the RDFlib package. The python script for it can be seen here:

This SPARQL query finds out who has an uncle and/or an aunt. More information about how SPARQL works, I will explain in my next chapter below.

SPARQL tutorial

SPARQL is the query language used to search RDF stores. Since my project requires me to use SPARQL endpoint, it is necessary to learn how it works. Now SPARQL is basically a modified version of SQL. So a lot of syntax's are still the same. This is very fortunate for me since I have a background in SQL, which makes it easier to understand SPARQL.

Just like SQL you need a "SELECT" to choose which columns I want to see. Unlike SQL, you do not need to use columns that exist, since RDF stores dont work like SQL stores, they dont make use of tables. If for example I write "SELECT ?child" it means that I am creating a column called "child". The "?" is how SPARQL defines its variables. Now that I have defined what kind of information I want to see, I can try to find it, either by looking for the "subjects", "predicates" or "objects". This we can do by using the "WHERE":

SELECT ?child
WHERE { ?parent <http://example.org/has_child> ?child }

This small query will return the children of every parent. Now as can be seen ?parent is defined in the WHERE statement but is not visualized at all. This is no problem since, ?parent is just a variable to catch the subjects, it could have been any "?" followed by a word. In this query the only 2 variables of importance is the predicate (has_child) and the ?child to catch the information who are the children. Of course if I wanted the parents too I could have added it to the SELECT statement.

As for the predicate these can be found in vocabularies used by the store. In this case it was an example. But existing RDF stores use predicates from, certain vocabularies to define there data. This Subject was previously explained by me in a precious Blog.

Also it is important to note that URIs have to be between "<>". And if I don't want to define a variable and since I'm not going to use it I can use "[]" as in that I am not interested in this field. example:

SELECT ?child
WHERE { [] <http://example.org/has_child> ?child }

Would give me the same result as the first query.

Furthermore just like SQL you can use statements like: group by, order by, DESC, count, etc. Also a very important syntax is the FILTER. This syntax allowed me to search within the context of a variable for certain words or characters to filter on. Example finding the Fins word for Saxophone:

SELECT ?fin
WHERE
{
    <http://dbpedia.org/resource/Saxophone> rdfs:label ?fin .
    FILTER ( lang(?fin) = 'fi' )
}

Geen opmerkingen:

Een reactie posten