Linked Open Data @hbz Blog

  2014/03/10
Switch to Metafacture with many data improvements
Zuletzt geändert von Adrian Pohl, Mar 14, 2014 11:27
Labels: news

Today, we roll out a major change to the data on bibliographic resources we get from the hbz union catalog. We have put a lot of work into this during the last half year. For one thing, the lobid data now complies with the DINI KIM recommendations on publishing title data as RDF (pdf, German).

Simultaneously, with this move, we have switched our entire data transformation workflow from a local tool to the free software tool Metafacture that is developed on GitHub by the German National Library (DNB) with support from the hbz and others.

Here are two major changes the new transformation brings with it.

Links to ToC files

We have added links to scanned table of contents (ToCs) for more than one Million documents. E. g.:

The table of contents reside within a DigiTool repository. When you resolve the URL within a browser you will be directed to a viewer presenting a PDF of the ToCs. Using curl you will be redirected to an - unfortunately unstructured - OCR text of these ToCs, e.g.

(See also GitHub issue #307.)

URNs

We now have information on URNs (when a resource has a URN assigned). In the data we both express this with the property lv:urn which has the URN as literal and with directly linking to the resource using the property umbel:isLike. Example:

(See also GitHub issues #78 & #211.)

Information from bibliograpies of Northrhine-Westphalia and Rhineland-Palatinate

Data of two "Landesbibliographien" (bibliographies of German federal states) are part of the hbz union catalog (the bibliographies of North Rhine-Westphalia (NWBib) and Rhineland-Palatinate (RPB)). These are very interesting datasets
which focus on the literature about a specific region. As we are currently working on a project to build a web site for one of these bibliographies (NWBib) on top of the lobid API, we needed to get the the bibliography-specific information into the RDF. There are three special points of interest:

  • Identifying resources that are part of a bibliography. We did this linking each resource to a bibliography using dct:isPartOf, e.g.:
  • Enabling bibliography-only search. We enable search over NWBib, respectively RPB with using a 'set' parameter, e.g. - if you are interested in economy books about (parts of) Northrhine Westphalia: http://lobid.org/resource?name=economy&set=NWBib
  • Including the bibliography-specific classification in the data. Each of the bibliographies uses a custom classification system for subject indexing. We converted these systems to SKOS (RPB SKOS classification, NWBib classifications in SKOS) and used URIs - instead of notation numbers - to link bibliographic resources with the SKOS concepts. Thus, we get a rich list of subject for such a resource, at times including GND subjects, DDC as well as the respective bibliography's classification. For the example resource we get eight subject links:

We have implemented this analogous with the RPB data.

Several smaller improvements

Also, we did a lot of minor improvements, amongst others:

Eingetragen bei 10 Mar @ 5:01 PM von Adrian Pohl | 0 Kommentare
  2014/01/28
lobid-Mailingliste
Zuletzt geändert von Adrian Pohl, Jan 28, 2014 10:58
Labels: listserv, communication

Nachdem wir im November mit der lobid-API produktiv gegangen sind, haben wir nun auch eine offene deutschsprachige Mailingliste eingerichtet für Fragen und Diskussionen zu lobid, dem Linked-Open-Data-Dienst des hbz. Auch werden wir dort regelmäßig aktuelle Entwicklungen von lobid bekanntgeben. Die Liste kann hier abonniert werden

Bug-Reports und Nachfragen nach konkreten neuen Funktionen können natürlich gerne nach wie vor auf GitHub gestellt werden. Auch steht das lobid-Team weiterhin unter semweb@hbz-nrw.de zur Verfügung. Hauptsächlich englischsprachige Updates werden nach wie vor über dieses hbz-LOD-Blog verbeitet sowie über Twitter. Als Ergänzung zu diesen Kommunikationswegen dient die neue lobid-Mailingliste der offenen und transparenten Kommunikation mit deutschsprachigen API-Nutzern ohne GitHub-Konto bzw. über Themen, die nicht in den Issue-Tracker passen.

Eingetragen bei 28 Jan @ 10:42 AM von Adrian Pohl | 0 Kommentare
  2013/11/13
Providing machine-readable application profiles with OAI-ORE
Zuletzt geändert von Adrian Pohl, Nov 14, 2013 09:42
Labels: applicationprofiles, oai-ore

Since my post on application profiles and JSON-LD I have been putting some more thought on the question of how to publish application profiles ("APs") in a formal machine-readable way. As far as I know, there doesn't exist any common practice for publishing machine-readable documentations of application profiles. Here, I will give it a first try to publish an AP documentation using OAI-ORE. (If you are only interested in this and want to skip the preliminaries, just directly jump to the section entitled Describing an application profile with OAI-ORE.)

Benefits

I already talked about this in the last post on APs but will nonetheless first present use cases for and benefits of machine-readable application profiles. Here I three use cases I can think of:

  1. Enable people to create data/describe resources in accordance with the application profile: A machine-readable AP might enable developers to automatically create the necessary input forms for describing resources based on the AP. At best, the AP would enable an automatic input validation.
  2. Enable people (and machines) to understand the data that underlies an application independent of a specific natural language: E.g., studying an application profile makes sense if you want to build an application based on data that is created according to the AP. A machine-readable AP can on the one hand be understood by people independent of the natural languages they understand and might on the other hand enable automatic creation of an application that works with the data.
  3. Machine-Readable application profiles could play a role in generating vocabulary usage statistics. This could be interesting for services like Linked Open Vocabularies (LOV) that provide statistics on vocabulary/property usage.

An application profile is more than a list of metadata terms

As my last post might suggest otherwise, I want to make one thing clear: A JSON-LD context document hardly meets all characteristica of an application profile as defined by the Dublin Core community. According to the Guidelines for Dublin Core Application Profiles an AP should meet the following criteria:

A DCAP is a document (or set of documents) that specifies and describes the metadata used in a particular application. To accomplish this, a profile:

  • describes what a community wants to accomplish with its application (Functional Requirements);
  • characterizes the types of things described by the metadata and their relationships (Domain Model);
  • enumerates the metadata terms to be used and the rules for their use (Description Set Profile and Usage Guidelines); and
  • defines the machine syntax that will be used to encode the data (Syntax Guidelines and Data Formats).

As a JSON-LD context document isn't much more than a list of metadata terms employed in an application, it doesn't possess all the characterictics of an AP.

In the following, I am definitely not concerned with fulfilling all four criteria for APs through a machine-readable document but will focus on the third one: 'enumerate the metadata terms to be used and the rules for their use'.

Side-effects of vocabulary re-use

If all people would (re)use a vocabulary in the same way there would be no need to describe the usage for a specific service. To find out what might be the characteristics of an application profile we first have to answer the question: What operations are involved in the re-use of RDF properties and classes? In this paragraph, I will try to name and illustrate some of these.

First, here is a simple example RDF description which a service that serves descriptions of libraries might provide. 1

We can see that the AP used for this library description draws five properties (foaf:name, rdfs:seeAlso, dcterms:identifier, gn:locatedIn, org:classification, foaf:isPrimaryTopicOf) and one class (foaf:Organization) from five different RDF vocabularies.

Generally, when vocabulary terms are re-used, this may be accompanied by a more restrictive usage in the context of an application in comparison to the existing wider usage. The example description illustrates how a service might use properties and classes. foaf:name is the only property that is used according to the wider practice. The use of the other properties is adjusted for this specific service:

  • In the context of this fictional service, the generic property rdfs:seeAlso is only used to state links to a related DBpedia resource (line 12).
  • The generic dcterms:identifier property isn't used with different identifiers but solely indicates an organisation's ISIL. Also an according datatype is attached to the string value.
  • The property gn:located is solely used to link an organisation to the city it resides in, using the city's geoname URI.
  • org:classification is used in the context of the service to indicate the organisation type using a specific controlled vocabulary.

I named another example in the previous post on APs: the different inteprretation fo dcterms:alternative as "unifrom title" by one group and as "other title information" by another. A machine-readable application profile should somehow reflect all these side-effects of reusing vocabularies.

Describing an application profile with OAI-ORE

The basic idea of this post is: Application profiles are aggregations of terms from different RDF vocabularies. OAI-ORE is an RDF vocabulary to describe aggregations of things. Shouldn't it make sense then to describe APs using OAI-ORE?
If you want to know more about OAI-ORE, I suggest starting with the wikipedia entry.

Here is how one could describe the underlying application profile for the fictional service described above:

  • As the aggregation itself is an abstract resource it is described in a RDF document called "resource map". After the prefix declarations (lines 1-9) the resource map is described (lines 11-18) with information about license, creator, creation date etc. and the information which aggregation is described by the resource map.
  • In lines 12-24 the aggregation - i. e. the application profile itself - is described, its name and the aggregated resources (in this case the properties and classes used in the application profile) are stated.
  • The rest of the document (lines 26-60) is concerned with making clear where the usage of the re-used properties diverges from their original definition in their home vocabulary. In a human-readable way this information is provided in rdfs:comment but it also happens via the rdfs:domain and rdfs:range declarations. For example, lines 32-40 state the regular expression to which values of dcterms:identifier are constrained in the AP.

Conclusion

OAI-ORE offers what one needs to represent application profiles - or at least the characteristics of APs we were looking for - in RDF. The question is whether it is also comfortable to create and make use of or whether one should look for a lightweight alternative. I'd be happy to hear opinions and to learn about other approaches.

_______________

Footnotes
Ref Notes
1 The example was constructed for the purpose of this text but it reflects current or past practices in the lobid organisations index.

Add a comment

Eingetragen bei 13 Nov @ 11:58 AM von Adrian Pohl | 4 Kommentare
  2013/10/07
Zwei neue Veröffentlichungen zu LOD und Bibliotheken
Zuletzt geändert von Adrian Pohl, Oct 15, 2013 09:29
Labels: publications, news

Sammelband zu LOD in Bibliotheken

Der Sammelband "(Open) Linked Data in Bibliotheken" ist veröffentlicht!
Mehr darüber zu lesen gibt es auf dem Blog von Mitherausgeber Adrian Pohl.

Artikel zu Fachinformationsdiensten und wie sie von einem LOD-Ansatz profitieren könnten

Außerdem haben Pascal und Adrian in den letzten Wochen einen Artikel namens "Dezentral, offen, vernetzt - Überlegungen zum Aufbau eines LOD-basierten FID-Fachinformationssystems" verfasst Der Preprint ist (bis zur Veröffentlichung der Printfassung) hier hier zu finden.

Eine automatisch (mit pandoc) generierte PDF-Version des Textes (mit klickbaren URLs) findet sich hier. (Wir haben den Text in Markdown geschrieben - angelehnt an einem Beispiel, auf das Martin Fenner hingewiesen hatte - und github als Redaktionssystem genutzt.) Update, 15.10.1013: Es findet sich auch eine Version im Wiki unter https://wiki1.hbz-nrw.de/x/EYOf.

Eingetragen bei 07 Oct @ 10:46 AM von Pascal Christoph | 0 Kommentare
  2013/09/19
OER? Wie? Was? - Bericht zur OER-Konferenz 2013
Zuletzt geändert von Adrian Pohl, Sep 20, 2013 23:12
Labels: konferenz, metadata, oer

von Adrian Pohl

Letztes Wochenende war ich mit meinem Kollegen Jan Neumann auf der OER-Konferenz 2013 (#OERde13). OER steht für Open Educational Resources. Veranstaltet wurde die Konferenz von Wikimedia Deutschland unter der Schirmherrschaft der UNESCO, und das Ganze fand statt in der Kalkscheune in Berlin Mitte, die ich noch von der Open Knowledge Conference (OKCon) 2011 in bester Erinnerung hatte.

Die Konferenz war spitzenmäßig organisiert und jede Session von netten Wikimedianerinnen und Wikimedianern betreut. Zu jeder Session gab es ein Etherpad, das - mal mehr mal weniger - von den Teilnehmerinnen und Sessionleitern befüllt wurde. (Ich bin ja der Meinung, dass Etherpads eine sehr gute, weil leicht zu erlernende "Einstiegsdroge" für kollaboratives Arbeiten im Web sind.) Wunderbar war auch, dass - soweit ich das beurteilen kann - wirklich Repräsentanten aller für OER relevanten Bereiche auf der Konferenz waren. Vertreten waren: Lehrkräfte, Wissenschaft und Hochschulverwaltung, Erziehungswissenschaft, Management und Politik, Verlage, Bibliotheken, private OER-Initiativen (z. B. Leute von serlo, Lernfink), Landesbildungsserver, selbstverständlich jede Menge Leute aus dem Wikimedia-Umfeld und viele mehr. Es gab insgesamt drei verschiedene Veranstaltungsformate auf der Konferenz: traditionelle von einer Programmkommission ausgewählte Vortäge mit Diskussion, ein Speedlab und ein BarCamp.

Am Samstagvormittag ging es - nach den ersten Gesprächen beim morgendlichen Kaffee - los mit einer Keynote von Philipp Schmidt (Videoaufzeichnung). Der stellte die grundlegenden Errungenschaften des WWW heraus und die Chancen, die OER mit dem Web als Plattform uns biete, um ein offenes Bildungssystem zu gestalten, das die aktive Beteiligung der Lerner ermögliche. Dazu nannte Schmidt eine Menge Beispiele, einige aus seinen eigenen Projekten. Mir ist insbesondere die Geschichte des neunjährigen Jungens im Gedächtnis geblieben, mit der Philipp Schmidt seinen Vortrag beendete (ab 30:14). Diese Junge sich sehr für Pinguine interessiert und im Web mit anderen Interessierten in Austausch tritt, so dass er sich schließlich regelmäßig mit anderen Pinguin-Forschern an der Johns-Hopkins-Universität austauscht. (Denn: Im Internet weiß niemand, dass du ein Kind bist.)

Philipp Schmidt bei seiner Keynote (Foto: Agnieszka Krolik, CC-BY-SA )

Methodisch-Didaktisches

Es folgten drei Sessions mit jeweils fünf durch ein Programmkomitee ausgewählten Vorträgen. Ich habe mich zunächst mal auf das mir unbekannte Terrain der Didaktik begeben und den Vortrag der Professorinnen Kerstin Mayrberger und Sandra Hofhues besucht (Etherpad zur Session). Die Thematik war hier eher das Gebiet der Open Educational Practices als der Open Edication Resources. Es wurden drei Thesen zu Diskussion gestellt (siehe die Input-Folien zur Session). Ich habe aus der Diskussion mitgenommen, dass die Teilnehmer der OER-Konferenz, die sich ja mit dem Web und Computern wahrscheinlich recht gut auskennen, eher frustriert mit dem Tempo sind, in dem neue Technologien und die damit verbundenen neuen methodisch-didaktischen Möglichkeiten Einzug in den Unterricht erhalten. Bevor nicht ein Großteil der Lehrkräfte selbst die nötigen Kompetenzen besitzen, kann eben OER keinen Einzug in die Klassenzimmer erhalten...

OER-Repositories und -Suche

Die nächsten zwei Sessions habe ich auf Basis meines bibliothekarischen Hintergrunds gewählt und bin zur Session über das edu-sharing-Netzwerk, OER-metadaten und -Systematiken von Christian Lukaschik gegangen (Etherpad). Das Edu-Sharing-Netzwerk scheint mir der am besten durchdachte und am weitesten umgesetzte Ansatz für den Aufbau einer dezentralen OER-Repository- und -Discovery-Infrastruktur zu sein. edu-sharing ist ein e. V., der sich wie folgt darstellt: "Als gemeinnützige Gemeinschaft betreiben wir ein freies Netzwerk für Lern- und Wissensinhalte und bündeln Ressourcen für die Verfügbarkeit und Qualitätssicherung von Contents und freier Software." edu-sharing beauftragt die metaVentis GmnH für die Entwicklung für edu-sharing eine OER-Repository-Software (auf Basis von Alfresco), wobei edu-sharing etwa auch die Support-Anfragen der Mitglieder bündelt. Ziel ist es, ein OER-Repository-Netzwerk auf Basis der edu-sharing-Software aufzubauen. Ich frage mich, inwiefern dieses Netzwerk auf Webstandards basiert und damit auch den Beitritt anderer Repositories zum Netzwerk ermöglicht. (Idealerweise sollte ja das Web das OER-Netzwerk sein und nicht ein softwareabhängiges Netzwerk-im-Netzwerk.) Damit OER-Repositories in die Oberflächen der bereits an Schulen verwendeten E-Learning-Systeme eingebunden werden kann, entwickelt metaVentis auch Plugins für Moodle und andere E-Learning-Umgebungen. edu-sharing/metaVentis hat schon eine Menge Erfahrung im Bereich (O)ER-Metadaten gesammelt und bspw. auch NRW-Curricula so bearbeitet, dass sie für die Verschlagwortung nutzbar sind, damit Sucheinstieg und Kategorisierung nach den Kompetenzen eines Curriculums möglich werden.

Als nächstes habe ich mir die Session von Lothar Palm zu learn:line NRW angeschaut, um die bisherigen Ansaätze zur Recherche nach Bildungsmaterialien besser zu verstehen. Dafür gibt es die Landesbildungsserver, die den Lehrkräften ihres Bundeslandes Recherche nach Bildungsressourcen (OER, nicht-offen lizenzierte, exklusiv für Lehrkräfte) ermöglichen. Das Institut für Internationale Pädagogische Forschung (DIPF) bietet darüber hinaus mit dem Deutschen Bildungsserver und der Suchmaschine Elixier eine Recherche u.a. über die aggregierten Daten der Landesbildungsserver an. Für Nordrhein-Westfalen wird die Recherche in Form der learn:line NRW angeboten, die technisch auf Basis der edu-sharing-Software bzw. dessen Discovery-Modul läuft. Zwar befinden sich im index von learn:line NRW mit ca. 25.000 Lernobjekten im Vergleich zur Bibliothekswelt (noch) recht wenige Ressorucen-Metadaten, allerdings kommen diese von einer Menge verschiedener Anbieter, was den Aufbau und den Betrieb des Portals nicht gerade erleichtert. Schließlich ist das Aggregieren von Daten verschiedener Anbieter, das Normalisieren und Indexieren mit einer Menge Aufwand verbunden. Ich habe am Ende der Session mal nachgefragt, ob es bereits Überlegungen gab, die learn:line-Daten unter einer offenen Lizenz für andere zur Wiederverwendung bereitszustellen. Diese Überlegungen gab es und offensichtlich wurden auch Experten zu dem Thema befragt. Die Nutzungsbedingungen, die die verschiedenen Anbieter mit der Datenlieferung verbinden, würden aber laut Aussage von learn:line-NRW-Vertretern keine Freigabe zulassen. Für mich war diese Erfahrung der Grund, das Thema "Offene OER-Metadatenökologie" in meiner Barcamp-Session am Samstagnachmittag in den Vordergrund zu stellen (Wie mein ursprünglicher Sessionvorschlag zeigt, wollte ich eher auch auf LRMI als Standard eingehen und herausfinden, welche relevanten Systematiken es im Bildungsbereich gibt.). Denn, wenn jetzt am Anfang die Weichen richtig gestellt werden, gibt es später keine Arbeit damit, noch mehr Leute von Open Data zu überzeugen.

Es folgte ein kurzes Intermezzo in Form eines - womöglich vom Speeddating inspirierten - Speedlabs. Dabei gabe es die Möglichkeit, von anderen europäischen Ländern zu lernen und sich deren OER-Ansätze und -Plattformen anzuschauen. Ich habe mir das tolle niederländische Wikiwijs angeschaut, das u. a. auch ein integriertes Remix-Tool anbietet und ansonsten meine Barcamp-Session vorbereitet.


Warten auf die Vorstellung einer Barcamp-Session (Foto: Agnieszka Krolik, CC-BY-SA)

Auf dem Weg zu einer offenen OER-Metadatenökologie

Am Samstagnachmittag habe ich dann die Session zum "Aufbau einer offene OER-Metadatenökologie" angeboten. Teilgenommen haben etwa 12 Personen, darunter interessierte OER-Plattformenbauer sowie ein Vertreter vom Landesbildungsserver Berlin-Brandenburg, einer von edu-sharing,. Für Sessionnotizen siehe das Etherpad unter https://etherpad.wikimedia.org/p/oercamp13-1-18. Ich habe einen kleines einleitendes Referat zu den Bestandteilen und Vorteilen einer offenen Metadatenökologie gehalten, siehe die Folien. Im Anschluss ging es vor allem um Systematiken und kontrollierte Vokabulare (Schulformen und -fächer sowie Curriculum-basierte Verschlagwortung). Die Teilnehmerinnen und Teilnehmer waren so interessiert, dass es am Ende Beifall gab und eine Nachfolgesession am Sonntag gewünscht wurde. Ich wollte mich zunächst nicht drauf festelgen, weil meine Verfassung nach dem ersten Konferenztag nicht gerade die beste war. (Ich kam mir vor, als hätte ich zwei Konferenztage in einem erlebt, weil das Programm ja sehr dicht war und in der Pause jede Menge Gespräche über Dinge anstanden, mit denen ich noch nicht sonderlich vertraut war...)
Dennoch habe ich dann in der ersten sonntäglichen Sessionrunde eine Fortsetzung angeboten, die nur von fünf bis sechs Leuten besucht war. Am Ende waren sich alle Anwesenden einig, dass eine Vernetzung der im OER-Metadatenbereich Aktiven über die Konferenz hinaus sinnvoll sei. So habe ich die E-Mail-Adressen der interessierten Leute eingesammelt und werde bei der DINI-KIM AG vorschlagen, Mailingliste und Wikibereich für OER-Metadaten einzurichten. (Wer noch Interesse hat und mir noch nicht bescheid gegeben hat, hinterlasse einen Kommentar unter diesem Beitrag oder schreibe eine E-Mail an mich (pohl[at]hbz-nrw.de).

Zum Abschluss der Konferenz habe ich noch die Beiträge von Stephan Suwelack ("Autoren als Marke") und dem geschätzten Kollegen Lambert Heller besucht. Leider kann ich da jetzt nicht weiter drauf eingehen, weil ich in mein (verlängertes) Wochenende muss, diesen Beitrag aber noch diese Woche veröffentlichen möchte. Die abschließende Keynote von Neil Butcher habe ich leider nicht mehr gehört. Zum Glück gibt es eine Aufzeichnung online.

Vielen Dank an die Organisatorinnen und Organisatoren der OER-Konferenz 2013, an das Programmkomitee und an die Teilnehmerinnen und Teilnehmer für die vielen interessanten Gespräche und Diskussionen.

Eingetragen bei 19 Sep @ 1:07 PM von Adrian Pohl | 0 Kommentare
Announcing the lobid API beta
Zuletzt geändert von Adrian Pohl, Oct 14, 2013 11:18
Labels: api, lobid-general, news

We are happy to announce the public beta of the lobid API. After ten months of work by the lobid team (especially Fabian and Pascal who are doing the software development) the lobid API beta was deployed at the end of August. Our technology stack consists of Metafacture for the transformation of the raw data to N-Triples, Hadoop for enrichment and conversion to JSON-LD, Elasticsearch for indexing and data delivery, and the Play framework for the HTTP API. The software is developed on GitHub. For usage instructions and API details see the API documentation on the API homepage.

The lobid API technology stack and data flow

Providing easy access to authority data

Besides the data from the whole hbz union catalog now being available through the API 2 , we think that the lobid API will be an especially useful tool for interacting with the authority data provided by lobid. Continuing the original idea of lobid, the lobid API provides access to the data from the German ISIL registry (the MARC Organization Codes Database will be added to the API index later, see below). We decided to also add functionality to search for persons from the German Integrated Authority file (GND) and will add other GND data in the future.

In effect, libraries have been creating linked data for decades now, with subject, person, and title records referencing each other. Authority data has always constituted a hub in the library data world that many local datasets point to. Library authority data is relevant for other areas besides library catalogs (e.g. open access repositories which often do not make use of authority data yet, or a website for collecting job offers by libraries) — but to be attractive linking points, authority files should be easy to use. We hope to help library authority files move into this direction by enabling easy integration of the German ISIL registry as well as GND data into web forms via the lobid API. The API homepage provides JavaScript examples of how to use the API in web forms.

Screenshot of lobid API auto suggest use case

Some historical background

There already had been numerous ways of accessing lobid data before we developed the API: You can get the data in different RDF serializations via content negotiation 3 , you have the possibility to get a full dump of the bibliographic data, you can query the SPARQL endpoint at http://lobid.org/sparql, and the HTML pages are enriched with RDFa. Why add another data access mechanism? Besides the additional possibilities an API provides (e.g. the described auto-suggest functionality) and making the data more accessible to web developers by serving JSON-LD, there also were performance-related reasons that made us switch to provide LOD via Elasticsearch. In November 2012 Pascal Christoph presented at SWIB12 (slides, video) about our plans to move to a JSON-LD- and Elasticsearch-based approach for publishing LOD, as the triple-store-based approach had performance issues. Soon after that we broadened that approach to not only provide content negotiation and RDFa but to also offer an easy-to-use LOD-based web API using JSON-LD, which we now make available to public testing.

Prospects

We're planning to replace the current lobid.org site with the new implementation (since February 2012 lobid.org has been based on the data in our triple store (4store) being rendered by Phresnel). We're also going to add more data, e.g. we want to make all GND data available (not only on persons but e.g. subjects etc.), and integrate the open data from the Cultural Heritage Organizations vocabulary (which essentially is built from the MARC organization code database). For details and more plans, see our open issues and milestones on GitHub.

Footnotes

Footnotes
Ref Notes
2 As of September 2013 the last two libraries from the hbz library network - ULB Düsseldorf and UB Paderborn - have decided to join the open data initiative that started in March 2010. Thus, after three and a half years of arguing for open data, the hbz catalogue in whole is now openly licensed with CC0.
3 e.g. curl -L -H "Accept: text/turtle" http://lobid.org/resource/HT002189125 — the same works with the API: curl -L -H "Accept: text/turtle" http://api.lobid.org/resource/HT002189125
Eingetragen bei 19 Sep @ 10:06 AM von Adrian Pohl | 0 Kommentare
  2013/08/01
Sharing context - publishing application profiles with JSON-LD
Zuletzt geändert von Adrian Pohl, Jan 16, 2014 15:44
Labels: applicationprofiles, json-ld

Since 2010, more and more library service centers and libraries in Germany have been publishing their catalog data as linked open data, see e.g. this list on the Data Hub. Regarding the RDF modeling and the related questions of which RDF properties from which vocabularies to use and how, the different data publishers mostly oriented themselves towards prior LOD publication projects. Thus, the different LOD publications don't differ in the broader approach, e.g. all agree on using the Bibliographic Ontology (Bibo) and Dublin Core as base vocabularies choosing needed properties from other ontologies like RDA elements. Nonetheless, the datasets slightly differ in what RDF properties they use and how they apply them. In order to easily work with and combine different datasets it would make things easier, though, to have some agreed-upon best practices established for representing library catalog data in RDF.

To promote such best practices by publishing a recommendation for the RDF representation of bibliographic records, the group "Titeldaten" within the KIM-DINI working group (KIM = Competence Centre Interoperable Metadata) was established in January 2012. Currently the group consists of representatives from most German-speaking library service centers, from the German and Swiss National Libraries as well as from other institutions. The group will soon publish the first stable version of the recommendations (in German) which are currently focused on bibliographic descriptions of textual resources and thus leave out descriptions of audio(-visual) media etc.

Application profiles

At the hbz, we are promoting the re-use of existing vocabularies instead of creating a new one for every application. For our LOD service lobid, we only create new properties or SKOS vocabularies if we can't find anything from an existing vocabulary that looks serious and is still maintained. But as we have seen, re-using vocabularies doesn't by itself guarantee interoperability on the linked data web. Even if two projects select the same RDF properties for publishing their data, their use of these properties might differ significantly. E. g, one application might use dcterms:alternative for indicating what librarians call a uniform title and others might use it for title information that accompanies the main title. That is why documentation of vocabulary usage in the form of application profiles makes sense. Principally, the goal of the "Titeldaten" group is nothing else but creating an application profile (recently also called "community profile") for publishing library catalogs as linked data.

The concept of an application profile has its origin in the Dublin Core community. In a Dublin Core glossary published in 2001, "application profile" is explained as follows:

A set of metadata elements, policies, and guidelines defined for a particular application. The elements may be from one or more element sets, thus allowing a given application to meet its functional requirements by using metadata from several element sets including locally defined sets. For example, a given application might choose a subset of the Dublin Core that meets its needs, or may include elements from the Dublin Core, another element set, and several locally defined elements, all combined in a single schema. An Application profile is not complete without documentation that defines the policies and best practices appropriate to the application.

So, an application profile is an element set that draws together elements from other element sets. "Element set" is Dublin Core language for what in the linked data community is often called a "vocabulary". (Element sets aren't necessarily encoded in RDF, though.) So you can say that an application profile is a selection of RDF properties from different vocabularies. But it is more than that as the last sentence of the quote indicates. An important part of any application profile is a "documentation that defines the policies and best practices appropriate to the application".

Sharing an application profile

We like the concept of an application profile and we think it should play an important role in a linked data world where vocabularies for different domains are published all over the web and can be used by anybody for exposing their linked data. We believe, that the LOD community would benefit from a broader practice of documentation and sharing of application profiles. But how to do this properly?

Regarding the choice of language, the approach of the "Titeldaten" group probably isn't the best as we stuck to German as the language for discussing and publishing the profile. As the recommendations are directed toward the German-speaking community this might nonetheless make sense. For documentation, we chose a wiki which probably is fine for anybody interested in understanding and using the application profile for their LOD publication. In line with other DINI recommendations, the "official" text will also be published as PDF. However, what we at hbz like to have is a simple overview of the properties used for an application property along with maybe some additional information like whether URIs or strings should be used in object position. At best, we would like to publish this simple list in a standard machine-readable way so that it could even directly be used by applications. Also, this would make it possible for people to fork the application profile on github and to extend it while one could easily see the differences between both profiles. That is where JSON-LD comes into play...

JSON-LD and @context

What excited me about JSON-LD was that it brings something new to the linked data world: external JSON-LD context documents. Before I go there, first some explanation of JSON and JSON-LD. JSON (JavaScript Object Notation) is a lightweight format that is used for data interchanging. It is also a subset of JavaScript's Object Notation (the way objects are built in JavaScript) (source). During the last couple of years, JSON more and more replaced XML as standard format for data interchange on the web. Today, nearly every API on the web serves JSON.

JSON-LD is a way of encoding RDF statements (triples) in JSON. Thus, JSON-LD could be a big step forward for the linked data community as it makes it quite easy for web developers to understand the virtues of linked data. Here's an example JSON-LD document:

In the first line, '@id’ indicates what entity this JSON-LD document is about. In other cases, the '@id' keyword is used to make clear that a URI is used as value (i.e. object of an RDF statement). The second and third line make statements about the resource using the elements 'title' and 'creator' from the DC terms vocabulary. This looks straightforward and easy to understand if you already have a linked data and/or JSON background. To shorten the descriptive part of the document and to make it easier to read, JSON-LD has introduced the @context as a syntactic mechanism to map short JSON terms to property URIs.

So the context documents maps JSON terms (here: "title" and "creator") to property URIs and - as seen with dcterms:creator - with the usage of the '@type' keyword declares when the object position is to be interpreted as URI. The thing that makes a @context interesting in the context of application profiles is that it can be published at a different location than the descriptive part of the JSON document and may only be referenced in the document, e.g.:

In this example I link to a version of an external JSON-LD context file for the DINI-KIM "Titeldaten" recommendations which, amongst others, contains the mapping of the terms "title" and "creator" to the corresponding DC elements.

Example: Putting an application profile into @context

I have already pointed to a version of the DINI-KIM "Titeldaten" recommendations as JSON-LD context document. Let's look at an excerpt.

This part of the context document catches some core information of the recommendations in a clear and concise form: that title and other title information should be represented with dc:title and rda:otherTitleInformation and that both must contain an xsd:string in object position; that the uniform title is specified with dcterms:alternative; that you use different properties for indicating the URI of a creator or contributor in an authority file (dcterms:creator/dcterms:contributor) as to specifying the creator's/contributor's name (dc:creator/dc:contributor). I think this provides a good starting point for someone who wants to quickly get familiar with an application profile - and it is even comprehensible for people not at all familiar with German.

As usual, this approach also has some shortcomings. One of them is, that you can't specify a list of terms that are intended to be used as values of the listed properties. E.g. the "Titeldaten" recommendations make use of rdf:type and dcterms:medium and a list of different owl:classes and skos:concepts for indication of media type/carrier format. (Notably, this is a temporary solution as currently now sensible solution for indication of carrier and media type exists.) However, there is definitely no way to express this in a JSON-LD context document so that this part of the recommendations can't be found in the @context document.

Using @context for specifying property labels?

One might have the idea to put more than this information in a JSON-LD context document, e.g., I would like to be able to use it for specifying property labels. Linked Data might be actionable data for machines but in the end you always want to present that data to humans. That is when you need human-readable labels for the properties you use. Often, you don't want to use the label that is declared in the respective vocabulary with rdfs:label especially if you are in need of a German label. And even if the vocabulary provides a label in the language you need, you might want to choose another one for a specific application (e.g. you can present something like "title proper" to librarians but not to people who aren't familiar with cataloging rules.)

However, it might make a lot of sense to specify property labels in an application profile so that e.g. users of online services by different libraries aren't confused by differing terms. Thus, I tried to provide this kind of information in a @context document but unfortunately this is not valid JSON-LD (see the invalid document here). As said above, a context document only is a syntactic mechanism but isn't intended for containing this kind of semantic information that you can express with a vocabulary defined with RDFS. Markus Lanthaler and Niklas Lindström helped me to better understand JSON-LD and its restrictions and proposed some options on how to route around the problem by somehow enclosing property label information otherwise in the document. I must say that I am still a beginner with JSON-LD and don't know which option makes most sense for us. We will explore these options deeper when we come to the point of replacing our current setup for the lobid.org frontend (we are working on replacing the Fresnel implementation Phresnel with a solution that is based on the lobid-API currently under development).

Conclusion

It is not difficult and doesn't take much time to encode the core content of an application profile as a JSON context document. As it is a useful addition to a human-readable documentation of an application profile it may well be worth the effort to publish the core information of an application profile as a JSON-LD context document.

Read the follow-up post on providing application profiles with OAI-ORE

Leave a comment

Eingetragen bei 01 Aug @ 11:57 AM von Adrian Pohl | 0 Kommentare
  2013/07/15
Daily updates for lobid-organisations
Zuletzt geändert von Pascal Christoph, Jul 15, 2013 16:30
Labels: news, lobid-organisations

Our LOD service http://lobid.org/organisation is now updated daily, at least the data coming from the German Online-ISIL-Verzeichnis .

Help update and curate the ISIL data

As you might have discovered, some geo location entries are not correct. We use the OSM-API and some heuristics to compute these geo coordinates if they are missing in the original data (and there are only just around 3 % of data with geo coordinates in the original data!), but that is prone to errors.
So, if you discover wrong geo coordinates for German institutions, the best way to correct them is to update the original database (which is the German Online-ISIL-Verzeichnis). You can edit this online formular OR you can contact the Institution (find the E-Mail for the institution on that same lobid-organisation website where you have found the incorrect geolocation on the map), make them aware of the importance of updating their coordinates and lobid.org (and others) will present the updated data one day later.

Eingetragen bei 15 Jul @ 4:28 PM von Pascal Christoph | 0 Kommentare
  2013/05/28
Changes to organisation descriptions in lobid
Zuletzt geändert von Pascal Christoph, May 28, 2013 16:57
Labels: lobid-organisations, news

We have recently updated the organisations data in lobid.org using the Culturegraph Metafacture software (see the morph-mapping) and - in order to represent even more data from the German ISIL registry - published some new controlled vocabularies in RDF. To be more appealing for re-users we chose to switch to a sustainable URI namespace at purl.org for the lobid vocabularies that are maintained at github.

The changes

So what did we actually add to the organisation descriptions?

  • Added information about the organisation type (using the libtype vocabulary).
  • Added information about the stock size and the type of funding organisation (using the newly published stocksize and the fundertype vocabularies.
  • Added opening hours information.
  • Added subject headings.

Some of these changes were already present in the triple store data but weren't reflected on the lobid.org frontend. Now you can get all this data in HTML and RDFa via your web browser or - using content negotiation - in other RDF formats.

Monthly updates

We've automated the updating process so that from now on the organisations data will be updated on a monthly basis.

TODO

If you want to make use of the new data by querying subject headings, say: "Give me all institutions which have 'Karten' (German for 'maps')", that would translate into SPARQL as follows:

You will be disappointed, because this simple query (about a small dataset of only 350k triples) took 10 minutes (at the first time, without cache) and will not bring full result because of "hit of complexity". The problem is not SPARQL per se, but that you deal with literals, for which a triple store is (understandable) not optimized. This directly leads to another desiderata:
Subject Headings should not be literals, but URIs. That's already the case in the lobid data describing bibliographic resources but not in the organisation descriptions.
URIs as subject headings have other positive side effects. Using e. g. dewey decimal classification you can have direct access to translation of each class into many languages. You have a hierarchy of classes, and, whats most important, you have a non ambigous identifier from a controlled vocabulary rather than a plain word which could have different meanings in different contexts.
Thus, the transformation of these literals into URIs is a TODO.

Of course, an API based on a search engine would be also fast and would bring some extra benefits, e. g. auto suggestions. We are working on that!

Eingetragen bei 28 May @ 4:56 PM von Pascal Christoph | 0 Kommentare
  2013/05/23
The GND ontology's class hierarchy - an overview
Zuletzt geändert von Adrian Pohl, May 23, 2013 17:46
Labels: gnd

We do much work with the German authority file GND (Gemeinsame Normdatei), specifically its RDF version. The underlying GND ontology was published last year and I must admit I haven't been taking the time yet to get an overview over the ontology. Today I started by getting to know the hierarchical structure of the ontology's classes. As I didn't find an overview on the web, I created my own and publish it here.

First I took the current GND ontology turtle file from the GND namespace http://d-nb.info/standards/elementset/gnd#, extracted only the classes and their rdfs:subProperty as well as the few owl:EquivalentClass relations, converted the result to dot format using rapper and created the following graph image from this:

(Link to high resolution image)

As one can see on the image, GND ontology's class hierarchy extends over three levels (except of gnd:Person which adds another level to distinguish between differentiated person and undifferenitiated persons, i.e. person names). The top class is gnd:AuthorityUnit which has seven subclasses that have several subclasses themselves. Additionally, there is the seperate class gnd:NameOfThePerson that isn't directly linked (with rdfs:subClassOf) to other GND classes at all. Here is a bulleted list providing another view of GND ontology's the class hierarchy. (For better readability I put spaces into the GND URI class names.)

  • Name of the Person
  • Authority Resource
    • Corporate Body
      • Fictive Corporate Body
      • Organ of Corporate Body
      • Project or Program
    • Conference or Event
      • Series of Conference or Event
    • Subject Heading
      • Subject Heading Senso Stricto
      • Characters or Morphemes
      • Ethnographic Name
      • Fictive Term
      • Group of Persons
      • Historic Single Event or Era
      • Language
      • Means of Transport with Individual Name
      • Nomenclature in Biology or Chemistry
      • Product Name or Brand Name
      • Software Product
    • Work
      • Manuscript
      • Musical Work
      • Provenance Characteristic
      • Version of a Musical Work
      • Collection
      • Collective Manuscript
    • Place or Geographic Name
      • Fictive Place
      • Member State
      • Name of Small Geographic Unit Lying within another Geographic Unit
      • Natural Geographic Unit
      • Religious Territory
      • Territorial Corporate Body or Administrative Unit
      • Administrative Unit
      • Way, Border or Line
      • Building or Memorial
      • Country
      • Extraterrestrial Territory
    • Person
      • Differentiated Person
        • Collective Pseudonym
        • Gods
        • Literary or Legendary Character
        • Pseudonym
        • Royal or Member of a Royal House
        • Spirits
      • Undifferentiated Person
    • Family
Eingetragen bei 23 May @ 5:36 PM von Adrian Pohl | 0 Kommentare
  2013/03/11
Visualizing lobid as part of the LOD cloud
Zuletzt geändert von Felix Ostrowski, Mar 13, 2013 11:09

Felix Ostrowski has visualized some German bibliographic datasets available at thedatahu.io using the gephi. Please leave a comment if a dataset is missing.

Approach

  • Manually scraped the "links:xyz"-Statements from the German LLD datasets and converted them to turtle.
  • Converted the result into a Gephi graph with the Semantic Web Plugin, using a SPARQL-Construct to load the entire graph.
  • Removed the dcat:Dataset-Node because rdf:type information is not of interest in this case. This last step was necessary, because dhub:oanetzwerk has neither outgoing nor incoming links and thus would not have showed up if only the dc:related relationships would have been constructed.

Nodes sized by number of datasets linking in

Nodes sized by number of datasets linked to

Eingetragen bei 11 Mar @ 3:30 PM von Pascal Christoph | 2 Kommentare
  2013/02/27
Some lobid.org stats
Zuletzt geändert von Fabian Steeg, Feb 27, 2013 17:25
Labels: lobid-organisations, news, lobid-resources

We've made some statistics reflecting the usage of lobid.org:

Web-access of lobid org

The following graphic shows web server access to the following resources:

  1. titles, example
  2. organisations, example
  3. items, example
  4. SPARQL queries

Remarkable:

  • Title access in 2012-06 - more than 100k hits coming from Korea Advanced Institute of Science
  • SPARQL queries in 2012-08: more than 3M coming from our own department, "eating our own dogfood"

Downloads

The following graphic shows the downloads of the tar.bz2-dumps (of all the subdirectories) of lobid.org. The "updates" are not taken into account, because these consist of many small files (ok, could be interesting, too, and will surely be provided later. Since these updates exist since less than a month, there would not be much to see yet.)

Remarkable: Downloads in 2012-03 have suddenly increased. Around 700 downloads coming from only one IP. Same is true for the following dates - there are around 40 to 60 different IPs, some of them downloading more then once a day (which doesn't make sense, since these dumps don't change at all (or maybe 3 times a year). So, yes, there is a need for a sitemap.xml in which we could provide the information if a file was updated (and where these files reside, because these URLs changes sometimes - yes, this is no good practise, and we try to make these URLs stable).

Eingetragen bei 27 Feb @ 3:24 PM von Pascal Christoph | 0 Kommentare
  2013/01/15
Fixed Daily data updates
Zuletzt geändert von Pascal Christoph, Jan 15, 2013 12:33
Labels: lobid-resources, news

Our data-dump-updates got broken some time in the past - we just fixed it.

There are now two daily update dumps [1][2]:

1. rdfmab[3] - pseudo-LOD, but with all the catalog data (aka "raw data") .
2. lod[4] - true lod. That's the basis for lobid.org .

What's missing, though, is a sitemap for the dumps. Such a sitemap would it make easier for automatically consuming the updates, because the filenames may change, have a look at b3kat

[1]http://datahub.io/dataset/hbz_unioncatalog/resource/7168ca3e-2528-4d78-b9ec-678679611aa6
[2]http://datahub.io/en/dataset/lobid-resources/resource/1919f1ca-ed2a-427a-bceb-3cf624fd1379
[3]http://datahub.io/dataset/hbz_unioncatalog
[4]http://datahub.io/en/dataset/lobid-resources

Eingetragen bei 15 Jan @ 11:45 AM von Pascal Christoph | 0 Kommentare
  2013/01/14
Test-server now running ubuntu and 4store 4.1.5
Zuletzt geändert von Pascal Christoph, Jan 14, 2013 15:58
Labels: news

We substituted the outdated RHEL5.4 with a fresh Ubuntu . Thus, it was easy to install a new 4store 1.1.5 , and so our test-server now also provides SPARQL1.1 . This was a must-have, because when reindexing we switched the server to avoid downtime, but were only able to provide old SPARQl1.0 with the old 4store (which, in return, is not enough for projects which make use of lobid like e.g. LODUM ) .

Eingetragen bei 14 Jan @ 3:41 PM von Pascal Christoph | 0 Kommentare
  2012/12/07
Back-linked to LODUM
Zuletzt geändert von Pascal Christoph, Dec 07, 2012 17:06
Labels: lobid-resources, news

The LODUM project linked to our lobid-resource service. Now, we could query their databse with SPARQL to get the owl:sameAs to our IDs and indexed them into our endpoint. Have a look http://lobid.org/resource/HT016649061 (click on "AKA").
Thus, we gained >30 k links, providing many new abstracts.

Some more details

This is the script, using SPARQL and further filtering to get the data we need:

Filtering was necessary, because LODUM makes some false owl:sameAs assumptions between lobid-resources. These resources rather point to the same frbr:work, not the same frbr:manifestation.

What would be nice: if the LODUM data would be Open Data, so we could index the abstracts into our search engines. I already asked for an Open Data license, and LODUM is considering it. Lets hope and see .

Eingetragen bei 07 Dec @ 4:52 PM von Pascal Christoph | 0 Kommentare