Converting the Open Data from the hbz to BIBO

You are welcome to comment on this page and this wiki in general. To do so, you will need to sign up.

The most up-to-date version of the documentation can be found under LOD Mapping 201107.

Vocabularies

We have decided to use the Bibliographic Ontology for our first attempt to convert our catalog data to Linked Data. The main motivation to do so was to create comprehensible data that lines up with existing Linked Bibliographic Data such as that published by LIBRIS, the OpenLibrary and Mannheim University Library. We are planning to also release the same data using the the RDA vocabularies.

Namespaces used

Note There are several alternatives available for FRBR-vocabularies. We are using the version by Ian Davis et. al because of a naming problems in the current IFLA version. Predicate names in that version have numbers as local name parts, which makes it impossible to serialize the data as RDF/XML.

Mapping of fields

We have mapped to fields from the record-centric RDF/ISO2709-format to a resource-centric BIBO-description as follows. Note that the original field names used below may contain wildcards for single characters (. as used in regular expressions).

Resource-URI
The URI of the resource that is to be described is derived from identifier of the record, to be found in <rdfmab:field/001__a>.
dc:title
The title of the resource, found in <rdfmab:field/331._a>.
dc:language
The language of the resource, found in <rdfmab:field/037b_a>.
dc:subject
Subject-Links. These are derived from several fields:
  • <rdfmab:field/9..__9> fields contain identifiers from the subject authority file of the German National Library(DNB), which are available as Linked Data since April 2010.
  • <rdfmab:field/700b_a> contain DDC-Notations. In order to link to the Linked Data Version of the classification, these numbers are truncated to the first three levels. If the full classification where available, we would be very happy to link to deeper levels.
bibo:isbn
The ISBN of the resource, found in <rdfmab:field/540._a>. The ISBN is deliberately provided as a string, not a URI, since it is the string that is the identifier, not some resource identified by <uri:ISBN:ISBN>. This conforms to the range defined in the BIBO.
bibo:issn
The ISSN of the resource, found in <rdfmab:field/542._a>. The ISSN is deliberately provided as a string, not a URI, since it is the string that is the identifier, not some resource identified by <uri:ISBN:ISBN>. This conforms to the range defined in the BIBO.
dc:extent
The extent of the resource, usually the number of pages, as found in <rdfmab:field/433__a>.
dcterms:issued The year the resource was issued, as found in <rdfmab:field/425a_a>.
rdf:type
The type of a resource is derived from several fields, thus possibly resulting in multiple types for the same resource. The current mapping is most likely over-simplified and will be subject of a further analysis for future releases:
  • if the value of <rdfmab:field/050> contains an a at the first position, the resource is typed as dc:BibliographicResource.
  • if the value of <rdfmab:field/051> contains an m at the first position, the resource is typed as bibo:Book.
  • all resources are generally typed as frbr:Manifestation.
bibo:volume
The volume number of the resource, found in <rdfmab:field/090_a>, which holds the sortable form. If this is not available, the descriptive form in field <rdfmab:field/089_a> is used.
dc:isPartOf
Fortunately, the original data already includes many links from subordinate to superordinate records which can be used to link the corresponding resources:
  • <rdfmab:field/010__a> contains the record-id of a direct superordinate
  • <rdfmab:field/453__a> contains the record-id of the first series title
  • <rdfmab:field/599__a> contains the record-id of the record describing the journal that this resource is published in.
bibo:authorlist
The <rdfmab:field/1..._9> fields contain authority numbers of the authors of the resource. To preserve the order, an rdf-list is used instead of simply linking all authors directly via dc:creator. The downside of this is that currently the authorlists are blank nodes and thus not handled ideally by generic Linked-Data-Displays such as pubby. Note that there are basically two types of authority numbers in the data: those maintained by the DNB (which are available as Linked Data) and local hbz-numbers, which are not available as Linked Data. In the first case, the resulting link leads to the Linked Data Service of the DNB, in the latter case the link unfortunately leads nowhere.
dc:publisher
The fields <rdfmab:field/412_a> and <rdfmab:field/410_a> contain the name and place of the publisher. To conform to the range of the dc:publisher predicate as defined in the DCMI Metadata Terms, we have introduced blank nodes for the publishers, typed as foaf:Organisation. The place of the publisher is attached as another blank node via geo:location. That blank node is typed geo:SpatialThing and has the name of the place attached by geonames:name, since we lack a mapping of the place names to geonames-identifiers. We are aware that this seems overly complicated, but we are trying to identify and properly model the entities that are referenced in the original data, even if that results in blank nodes in the first run. As soon as an authority file for publishers is available, we will try to link there. We might even have a look at the resulting blank nodes and see if the information is clean enough to form the basis of such a file.
frbr:exemplar
In the current state of the raw data, holding information is only implicitly available. Since the records are segmented into packages by instutition, we know that an institution is the frbr:owner of at least one frbr:Item of the described frbr:Manifestation. Since we currently do not have signature-information, those items are once again modelled as blank nodes.

There is a complete documentation of the fields found in the RDF/ISO2709-Version of the data. Unfortunately, the RDF/ISO2709-fields are not completely in line with this official documentation. This is due to the fact that our data passes through an interface that is based on MARC21 before it is published. Some fields are renamed in this process. We are working on either documenting the differences or using the proper fields.

The resulting model

Infrastructure

The conversion results in 82.471.813 triples which we have loaded into a 4store instance, providing a SPARQL-Endpoint. To serve the data as Linked Data, there is a Pubby Linked Data Frontend tied to that endpoint here. You can also download the entire dump.

Conversion process

Although we have released the raw data in an ntriples-format, using native rdf-tools such as rdflib for python has proved to be way to slow to handle massive amounts of data. Regular expressions in Perl are much faster, and thus used here. Due to the use of blank nodes as explained above, the script outputs RDF in turtle notation so that blank node identifiers don’t have to be generated.

Simple perl-regex based conversion script

Preliminary steps

We have released those parts of the union catalog that participating institutions have holdings in. For each institution the corresponding subset of records was extracted from the union catalog and packaged independently from any other records. This results in duplicate records for those resources held by several institutions. Thus, the first step was to generate a list of unique files. This file is then split up in order to process the data in parallel.

Preparation: Create list of unique files

Invoking the script

Generating holdings information

To generate the holding information, we simply generate the corresponding triples based on the file names in the data packaged by institution:

Importing into 4store

Stichwörter

bibo bibo Löschen
rdfmab rdfmab Löschen
conversion conversion Löschen
Geben Sie Stichwörter ein, die dieser Seite hinzugefügt werden sollen:
Please wait 
Sie suchen ein Stichwort? Beginnen Sie einfach zu schreiben.
  1. Sep 22, 2010

    Owen Stephens sagt:

    I note that you've hung the place of publication off the publisher, and wonder i...

    I note that you've hung the place of publication off the publisher, and wonder if this is an issue. Clearly the 'place of publication' is linked to the publisher in some way - they'll have to have some kind of operating address I guess - but it also feels like this is a direct property of the published item as well, and having a direct link may well be beneficial. In the latest modelling from the British Library, they use the proposed isbd:hasPlaceOfPublicationProductionDistribution property. I'm not particularly keen on this - partly because it doesn't exist yet, but mainly because I'm not sure it is sensible to limit this to a 'bibliographic' type property (many things can have a place of production). Any thoughts on this?

    1. Jul 28, 2011

      Pascal Christoph sagt:

      Hi Owen, sorry for the "late" answer... we think it makes sense to make this a d...

      Hi Owen, sorry for the "late" answer... we think it makes sense to make this a direct property of the manifestation itself. Mainly because the place of the publisher may change but the place of publication can not change. Now, imagine we had URIs for publishers (which would be very nice indeed!)), the information of the place of publication can thus not be derived from the place of publisher.
      Do you know a better property for this? http://iflastandards.info/ns/isbd/elements doesnot work . Also, do we really need a rdfs:label and thus a bnode?

Kommentar hinzufügen