Blog

Blog

Some lobid.org stats

We've made some statistics reflecting the usage of lobid.org:

Web-access of lobid org

The following graphic shows web server access to the following resources:

  1. titles, example
  2. organisations, example
  3. items, example
  4. SPARQL queries

Remarkable:

  • Title access in 2012-06 - more than 100k hits coming from Korea Advanced Institute of Science
  • SPARQL queries in 2012-08: more than 3M coming from our own department, "eating our own dogfood"

Downloads

The following graphic shows the downloads of the tar.bz2-dumps (of all the subdirectories) of lobid.org. The "updates" are not taken into account, because these consist of many small files (ok, could be interesting, too, and will surely be provided later. Since these updates exist since less than a month, there would not be much to see yet.)

Remarkable: Downloads in 2012-03 have suddenly increased. Around 700 downloads coming from only one IP. Same is true for the following dates - there are around 40 to 60 different IPs, some of them downloading more then once a day (which doesn't make sense, since these dumps don't change at all (or maybe 3 times a year). So, yes, there is a need for a sitemap.xml in which we could provide the information if a file was updated (and where these files reside, because these URLs changes sometimes - yes, this is no good practise, and we try to make these URLs stable).

Our data-dump-updates got broken some time in the past - we just fixed it.

There are now two daily update dumps [1][2]:

1. rdfmab[3] - pseudo-LOD, but with all the catalog data (aka "raw data") .
2. lod[4] - true lod. That's the basis for lobid.org .

What's missing, though, is a sitemap for the dumps. Such a sitemap would it make easier for automatically consuming the updates, because the filenames may change, have a look at b3kat

[1]http://datahub.io/dataset/hbz_unioncatalog/resource/7168ca3e-2528-4d78-b9ec-678679611aa6
[2]http://datahub.io/en/dataset/lobid-resources/resource/1919f1ca-ed2a-427a-bceb-3cf624fd1379
[3]http://datahub.io/dataset/hbz_unioncatalog
[4]http://datahub.io/en/dataset/lobid-resources

We substituted the outdated RHEL5.4 with a fresh Ubuntu . Thus, it was easy to install a new 4store 1.1.5 , and so our test-server now also provides SPARQL1.1 . This was a must-have, because when reindexing we switched the server to avoid downtime, but were only able to provide old SPARQl1.0 with the old 4store (which, in return, is not enough for projects which make use of lobid like e.g. LODUM ) .

Back-linked to LODUM

The LODUM project linked to our lobid-resource service. Now, we could query their databse with SPARQL to get the owl:sameAs to our IDs and indexed them into our endpoint. Have a look http://lobid.org/resource/HT016649061 (click on "AKA").
Thus, we gained >30 k links, providing many new abstracts.

Some more details

This is the script, using SPARQL and further filtering to get the data we need:

curl -G --data-urlencode "query=CONSTRUCT {?a <http://www.w3.org/2002/07/owl#sameAs> ?c}
  FROM <http://data.uni-muenster.de/context/cris/publication-hbz-relations/>
  WHERE {?a <http://www.w3.org/2002/07/owl#sameAs> ?c}
  " http://data.uni-muenster.de/sparql | tr -d '\t' | grep '^<http://lobid.org' > owl_sameAs_lodum.ttl
rapper -i turtle owl_sameAs_lodum.ttl -o ntriples | grep muenster > owl_sameAs_lodum.nt

Filtering was necessary, because LODUM makes some false owl:sameAs assumptions between lobid-resources. These resources rather point to the same frbr:work, not the same frbr:manifestation.

What would be nice: if the LODUM data would be Open Data, so we could index the abstracts into our search engines. I already asked for an Open Data license, and LODUM is considering it. Lets hope and see .

The Internet Explorer 8 cannot handle the content-type "xhtml", so we switched back to "text/html". IE8 has also some Javascript problems and could not display the openstreetmap for lobid-organisation. Whilst there are still people out there using this crappy IE8 browser, we put some effort into the html so that even IE8 user can browse lobid.org .

Issues-per-year diagram

Reading the blog post from Ed Summers about LCNAF Record Creation I decided to also make a quick statistic about our dataset using bash:

curl http://lobid.org/download/dumps/DE-605/hbzlod.nt.tar.bz2 ;
bunzip2 hbzlod.nt.tar.bz2 ;
grep 'issued> "' lobid-resource_base_[0123456789]*.nt | cut -d '"' -f2 | sed 's#[^0-9]*\([0-9]*\).*#\1#g' | sort | uniq -c | sed -e 's#^\ *##g' > issues_aggregation.txt

The resources described in the hbz Open Data dump are taken from a part of German university libraries and are mostly written in German.

Sorted some obviously wrong entries out and imported the csv into openoffice to render a diagram with data from 1840 to 2011:

Interesting: around 1944 are lesser issues than at any time of World War 1 and approximately as many as in 1890.

Digging our Open Data, among other Open Data, the findbuch has set up an app where you can ask if there is a Table Of Contents for a ISBN
For more background, read infobib.de (German).

Substantial expansion of the database

The dataset now comprises more than 15.5 Million records published under a Creative Commons Zero license, that is 85 % of the whole hbz union catalog. The conversion to RDF yields 664 Million triples. Data excluded from the open data includes records that exclusively have associated holdings by the Universitäts- und Landesbibliothek Düsseldorf and/or the Paderborn University Library - both of which decided not to cooperate with the hbz regarding open data. Also, metadata describing resources licensed nation-wide in the Nationallizenzen project isn't included due to the "Grundsätze für den Erwerb DFG-geförderter überregionaler Lizenzen" (pdf) which impede the publication of this metadata under an open license. (See the legal Open Data guide by lawyer Till Kreutzer (pdf), S.22f)

Due to the growth and the fact that big turtle files can't be parsed easily (see e.g. here) we switched to providing N-Triples that is suitable for streaming processing.

Other changes

(Nearly) all authority URIs now dereferencing

In the past we had some problems with URIs for persons and corporate bodies linked to a bibliographic resource as quite a lot of these URIs didn't resolve. In the course of the nation-wide switch to the Gemeinsame Normdatei (GND) the regional authority identifiers in the hbz were replaced by nation-wide GND-identifiers. Thus, nearly all authority URIs are now GND-URIs that actually resolve.

Changes in vocabulary usage

Using the Tools listed by the Pedantic Web Group at http://pedantic-web.org/tools.html we came across some problems with our data, e.g. we identified incorrect use of the dct:extent and dct:medium properties. To resolve this problems, we switched to the isbd\:P1053 (has extent) instead of dct:extent as the former doesn't require an entity in object position. Regarding the dct:medium, we worked on providing URIs for the medium specification instead of literals (see here for details.

Provenance information added to lobid.org

With our recent update of the lobid.org data we also made a first implementation of provenance information. As a start we provide what we believe the most important provenance information (note that the data is in the dump and queryable via SPARQL but is not yet part of the RDF representations of the dereferencable URIs). The general approach is illustrated in this figure:

Previously we posted to have 1.823 links to German dbpedia in our authorization file of organisations. We have now taken some more heuristics into account have a look at the source of the script and now have only 1.305 links to German DBpedia. Furthermore, we linked to the international DBpedia, and that got us 4.018 more links.
You may download the BEACON files at thedatahub.org.

Vorbemerkung:
Alle Google-Links mögen je nach Benutzer zu anderen Ergebnissen führen, da google seine Suchergebnisse personalisiert.

lobid.org: Titeldatensätze und Organisationsdaten im Google-Index

Quantität

Google hat in den letzten Wochen lobid.org, den LOD Dienst des hbz' durchsucht - und ein Teil der 10 Millionen Titeldaten (116 Tausend) und ein Teil der 44 Tausend Organisationsseiten (13 Tausend) sind nun tatsächlich im Google-Index gelandet.
Grundlage der Suche waren wahrscheinlich die Open Data Dumps der lobid-Ressourcen und der lobid-Organisationen.

Beispielsuchen

Da die lobid-Seiten teilweise nicht hoch gerankt sind: hier ein paar Beispielsuchen.

Titeldaten: Zur Demonstration kann über eine domäneneinschränkende Suche gezeigt werden dass Bücher gefunden werden können, Beispiel "Thomas Bernhard: Auslöschung".
Daneben gibt es aber speziellere Suchanfragen die über den Titel hinausgehen, z.B. eine ISBN10 (oder ISBN13) Suche in Verbindung mit einer Stadt, z.B. Duisburg. Hier landet die entsprechende lobid-Seite auf Platz eins von drei Treffern.

Organisationsverzeichnis: Wer das Isil einer Bibliothek oder eines Museums sucht kann das nun auch bequem via google machen. Z.B. "Hochschulbibliothek der Fachhochschule Aachen" isil. Auch hier gibt es nur eine einzigen Treffer (zugegeben: die Suchanfrage ist auch sehr speziell durch die Phrasierung). Manche Isils lassen sich nun auch direkt in Google suchen , was sehr bequem ist.

Warum indexiert Google den LOD Katalog?

Schon in der Vergangenheit wurde versucht Katalogsseiten in html zu überführen und diese Google "zum Fraß" vorzuwerfen - ohne Erfolg. Warum nutzt Google nun aber die LOD Daten? Hierüber kann spekuliert werden:

  1. Google nutzt die Daten noch für andere Dinge. Google hat massiv in Linked (Open) Data investiert und kann nun leicht das im Netz vorhandene Linked Open Data miteinander vernetzen.
  2. Google, Bing, Yahoo und Co haben mit schema.org den SEOs (Suchmaschinenoptimierern ) RDFa und generell Mikrofromate schmackhaft gemacht, weil den Suchmaschinen selbst RDF schmeckt - Maschinen wollen mit Maschinenlesbarem gefüttert werden. Facebook und viel andere Webseiten verwenden dafür z.B. RDFa. Da lobid.org das Essentielle zur Verfügung stellt , nämlich u.a. HTTP-URIs, kann nun jeder, der auf seiner Webseite mittels RDFa strukturierte Daten publiziert, per Link die lobid.org-Daten integrieren. Z.B. habe ich eine Kurzrezension über Thomas Bernhard: "Auslöschung - ein Zerfall" in meine Webseite integriert: zu sehen sind die in der Google-Trefferliste integrierten Bewertungssterne, die auf Grundlage der Maschinenlesbarkeit der Daten hergestellt werden konnte. Nebenbei ist auch interessant dass diese Webseite von mir erst vor einer Woche erstellt wurde, der Pagerank dieser kleinen Webseite sehr niedrig ist und trotz über 9.000 Treffer die Seite auf Platz sieben angezeigt wird. Ganz klar - Google bevorzugt Linked Data.
  3. Google hat ein vitales Interessse an Dezentralität - google möchte weiterhin der zentrale Einstigespunkt für die Suche nach vielen Webseiten sein. Wenn alle Webseiten bei Facebook wären (oder alle Bücher dort zu finden wären) würde google nicht mehr gebraucht. Mit Linked Open Data kann jede Webseite zur Datenbank werden - und diese veilen verteilten Datenbanken lassen sich in beliebigen Suchmaschinen integrieren.
  4. Google kann schöne Kurztrefferlisten generieren da die Daten maschinenlesbar sind. Das kommt den Kunden zugute.

... und die Vorteile für die Bibliotheken?

Die Vorteile ergeben sich aus dem oben Erwähnten, denn Google nimmt nicht nur sondern gibt auch: in erster Linie die Erhöhung der Sichtbarkeit. Darüberhinaus sind nun aber noch ganz andere Seiteneffekte zu erwarten. Machte es bisher eher Sinn eigene Rezensionen auf Amazon oder der Open Library zu hinterlegen so kann dies jetzt auch auf der eigenen Webseite oder Webblog gemacht werden. Meine Daten bleiben bei mir und zusätzlich bei jedem der sie gebrauchen kann. Wenn dies verstärkt genutzt wird kann darüber z.B. einfach eine Bibliometrie erstellt werden: je mehr Links auf ein Titeldatensatz zeigen desto höher ist der Rang der Ressource. (Es gibt sicherlich noch viele weitere Möglichkeiten wie das "Web of Data" zum Gewinn für alle werden kann...)

In April 16th we posted to have 139 links to wikipedia/DBpedia.
Now, 1.823 out of 43.464 lobid organisations have a link to the German wikipedia and the German DBpedia.[1] (see an example). That is an increase of more than 1000%. The computation was simple string matching of the title of both resources with the exception that if there was more than one organisation with the same name those resources were omitted.
Surely there are mismappings there. But our belive is that it is better to have much interlinking with a slight increase in wrong mappings than a only few links which are 99.5% perfect.

Sure, disambiguation through crowdsourcing is the next big thing to come!

[1]you may want to download a beacon concordance file that reflects the links between ISIL and wikipedia/DBpedia

Depending on the triple store you can do string searching with regular expressions using SPARQL1.1 . But is this a good idea? Having some data (300 M triple), this log file clearly says "no":

##### 2012-06-04T11:44:08Z Q8316
SELECT  ?s
WHERE
  { ?s <http://purl.org/dc/terms/title> ?o
    FILTER regex(str(?o), "beach", "i")
  }

#### execution time for Q8316: 108.760666s, returned 2815 rows.

So, this simple case insensitive search on the title field took nearly 2 minutes. It consumes a lot of CPU, MEM ... resources. The store in question is the actual 4store, and even if 4store could be made better in handling searching , the most important question is:
why should it?
Do what you can do best. Be a nice quad store, allowing SPARQL 1.1 all right, but focus on what Semantic Web is all about: inferencing, querying graphs. Let search engines do the string searches - we recommend Elasticsearch .

just for comparison: how fast can we get 2800 URIs in principle:

##### 2012-06-04T12:19:10Z Q8362

prefix dct: <http://purl.org/dc/terms/>
SELECT  ?s WHERE
  { ?s <http://purl.org/dc/terms/title> ?o.
  } limit 2800

#### execution time for Q8362: 0.424803s, returned 2800 rows.

(granted, this query is the simplest one to be formulated, but even though it says something about the relation of the time consumption.)

And here is the elasticsearch query:

$ curl 'http://$ip:9200/_search?q=beach&from=0&size=2800' > tmp_es.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15.2M  100 15.2M    0     0  3269k      0  0:00:04  0:00:04 --:--:-- 4133k

Even though in the result there is not only one URI per document but the whole data it is much faster (25 times) than the SPAQRL query with its 110 kB result.

Update 29 May 2012: While discussing with Jakob about my implementation of DAIA, we found some problems. Thus, I corrected the text and RDF below, aligning it with the usage in Jakob's and my presentation at the German Bibliothekartag (see slides). - Adrian

I wrote a post in Übertext: Blog about describing items, services and organisations in RDF using the DAIA ontology and other vocabularies. I'd like to provide a concrete example here of how to describe items and services and link them together using the DAIA. This example is based on data in the LOD service lobid.org (some of it being fed in to the triple store particularly for this example).

Assumed I am writing my PHD thesis somewhere in Bochum, it is 6 pm on a Saturday. In a bibliography of some article I find references to a book that seems quite relevant, "System und Performanz" by Christian Stetter. Thus,I want to borrow a copy of this book.

As I don't want to go more than 5 kilometres for obtaining the item, I do a geo-based search in lobid.org (see the underyling SPARQL query here) and get back one item. The item's RDF description looks like this (in turtle syntax):

 @prefix daia: <http://purl.org/ontology/daia/>
 @prefix frbr: <http://purl.org/vocab/frbr/core#>.
 @prefix foaf: <http://xmlns.com/foaf/0.1/> .

 <http://lobid.org/item/HT014576567%3AHWB25011>
   daia:label "HWB25011" ;
   frbr:exemplarOf <http://lobid.org/resource/HT014576567>
   frbr:owner <http://lobid.org/organisation/DE-294> ;       # also possible is "daia:heldBy"
   daia:storage <http://lobid.org/service/DE-294-servicetheke> ;
   a frbr:Item ;
   foaf:isPrimaryTopicOf <https://opac.ub.ruhr-uni-bochum.de/webOPACClient/start.do?Language=De&amp;Query=010%3D%22HT014576567%22> .

 <http://lobid.org/organisation/DE-294>
   a foaf:Organization ;
   foaf:name "Ruhr-Universität Bochum, Universitätsbibliothek" .

OK, I can see that this item is linked to the University Library of Bochum as holding institution. But there are several questions now:

  1. Is the item currently available or is it already lent by another user?
  2. Can I get access to the item right now? Where and how?

Current Availability

Question number one currently can be answered by clicking on the "Weitere Informationen" link that takes you to the library's OPAC description of the item. If the library would provide a DAIA server we could request the item's status and could get the information in RDF looking like this:

@prefix daia: <http://purl.org/ontology/daia/> .
@prefix daiaserv: <http://purl.org/ontology/daia/Service> .
@prefix frbr: <http://purl.org/vocab/frbr/core#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://lobid.org/item/HT014576567%3AHWB25011>
   a frbr:Item ;
   daia:unavailableFor [
       a daiaserv:Loan ;
       daia:expected "2012-06-19"^^xsd:date ;
       daia:queue "0"^^xsd:nonNegativeInteger
   ] .

Or - if available:

   daia:availableFor [
       a daiaserv:Loan
   ] .

Related Service and Opening Hours

To get an answer to question #2, I could follow the link to the library's website and look at the opening hours. That's the usual way to do it I think. But fortunately, in this case the item is linked to a specific service from where it is generally available (using the property daia:storage) and the service's opening hours are specified in RDF. The RDF looks like this:

 @prefix gr: <http://purl.org/goodrelations/v1#> .
 @prefix dcmitype: <http://purl.org/dc/dcmitype/> .
 @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
 @prefix daia: <http://purl.org/ontology/daia/> .
 @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .


 <http://lobid.org/service/DE-294-servicetheke>
   a dcmitype:Service ;
   rdfs:label "Servicetheke der Universitätsbibliothek der Ruhr-Universität Bochum"@de ;
   daia:providedBy <http://lobid.org/organisation/DE-294> ;
   gr:hasOpeningHoursSpecification [
   a gr:OpeningHoursSpecification ;
   gr:opens "08:00:00"^^xsd:time ;
   gr:closes "24:00:00"^^xsd:time ;
   gr:hasOpeningHoursDayOfWeek gr:Monday ;
   gr:hasOpeningHoursDayOfWeek gr:Tuesday ;
   gr:hasOpeningHoursDayOfWeek gr:Wednesday ;
   gr:hasOpeningHoursDayOfWeek gr:Thursday ;
   gr:hasOpeningHoursDayOfWeek gr:Friday
   ] ,
   [
   a gr:OpeningHoursSpecification ;
   gr:opens "11:00:00"^^xsd:time ;
   gr:closes "20:00:00"^^xsd:time ;
   gr:hasOpeningHoursDayOfWeek gr:Saturday
   ] ,
   [
   a gr:OpeningHoursSpecification ;
   gr:opens "11:00:00"^^xsd:time ;
   gr:closes "18:00:00"^^xsd:time ;
   gr:hasOpeningHoursDayOfWeek gr:Sunday
   ] .

I now can see immediately that I can obtain the item I need and have four hours left to do it.

Different items, different services

This is only an example. In the same way an exemplar of an encyclopedia volume that can't be borrowed like this one would be linked - using daia:storage - to a corresponding reading room where it is generally available. The item would then be indicated as available for <http://purl.org/ontology/daia/Service/Presentation>.

Problems and Perspectives

This is just a proof of concept. There are several things missing here that should be added to provide rich end-user services:

  • Obviously, you would have to improve the user interface to make this interesting for end-users.
  • It would be nice to show current availability (kind of a green or red light) besides the item information instead of the user having to follow the link to the OPAC. That's where DAIA implementation in library systems would come handy so that I could query the local database on the fly and get the information whether the item is available for lending or already checked out by another user and - if so - when the item will be available again.
  • It is also important to add user-relevant information about a service, like adress, geo coordinates, contact details (person, phone, email) etc. They should be able to easily find the service and ask somebody questions about it.
  • Thinking even further, users would have their own library account information publicly or privately represented in RDF and would be associated with a specific user groups with specific rights. As most services are only available for specific user groups, search results could be customized to user profiles.

Presentation

Have a look at the slides (in German) from Bibliothekartag 2012 where Jakob and me presented the benefits of publishing and integrating holding and availability information in RDF:

Linking lobid to Open Library works

Crowdsourcing gains ever more importance. I.e. look at what Open Library achieved so far with their community driven approach and their claim "one web page for every book". It is pretty amazing.
So the question is: can libraries benefit from (better: partake in) that process?

The answer is of course "yes", and no more is necessary as to link our datasets. Using SILK and just matching isbn 10 we achieved to link around 1.2 M lobid resources (of about 10 M) to 1 M Open Library works[1] - see an example.

We used the property rdrel:workManifested which the Open Library itself is using to link from manifestations to works (deliberately bridging the "expression" level):

<http://rdvocab.info/RDARelationshipsWEMI/workManifested>

Btw, already two years ago Oliver Flimm presented a slideshare showing how lobid resources could be linked to Open Library (German) .

We could generate the links because Open Library provides their data as Open Data, thus the json dump could be transformed into RDF and import that into our triple store to give SILK the needed SPARQL endpoint.
We potentially gained a lot of covers, excerpts, Table of Contents etc. and many links to LibraryThing with a lot of tags etc. There are also a lot of links to goodreads with tons of reviews and ratings ... and so forth.
So this brings us a lot of more data which we partly (a question of license) can use to enrich our frontends, searchengine and SPARQL endpoint.

And even vice versa, our data may be a gain for Open Library too. I.e. many resources seems to lack an identifier for authors while we use the German GND authority file which has many links to i.e. viaf. Since our data is also Open Data you can just download the lobid data dump and copy the needed data to your data.

A remarkable example

Der Herr der Ringe (German for: Lord of the rings) links to Open Library. (There seems to be a problem with that work level because only one edition is associated with it - it should be more[1]). The resource is linked to Goodreads. There is no less than 160.245 ratings and >4000 reviews about that book :O - and have a look at the discussion about it. But also the LibraryThing data is very interesting with their tag cloud and the recommendations.

Much to discover !

The next big step will be: integrating all this data into our enduser application.

[1] it seems unlikely that there are so many different works, one woul expected a ration of at least 1/3