We have decided to use the Bibliographic Ontology for our first attempt to convert our catalog data to Linked Data. The main motivation to do so was to create comprehensible data that lines up with existing Linked Bibliographic Data such as that published by LIBRIS, the OpenLibrary and Mannheim University Library. We are planning to also release the same data using the the RDA vocabularies.
Note There are several alternatives available for FRBR-vocabularies. We are using the version by Ian Davis et. al because of a naming problems in the current IFLA version. Predicate names in that version have numbers as local name parts, which makes it impossible to serialize the data as RDF/XML.
We have mapped to fields from the record-centric RDF/ISO2709-format to a resource-centric BIBO-description as follows. Note that the original field names used below may contain wildcards for single characters (. as used in regular expressions).
||The URI of the resource that is to be described is derived from identifier of the record, to be found in <rdfmab:field/001__a>.|
||The title of the resource, found in <rdfmab:field/331._a>.|
||The language of the resource, found in <rdfmab:field/037b_a>.|
|| Subject-Links. These are derived from several fields:
||The ISBN of the resource, found in <rdfmab:field/540._a>. The ISBN is deliberately provided as a string, not a URI, since it is the string that is the identifier, not some resource identified by <uri:ISBN:ISBN>. This conforms to the range defined in the BIBO.|
||The ISSN of the resource, found in <rdfmab:field/542._a>. The ISSN is deliberately provided as a string, not a URI, since it is the string that is the identifier, not some resource identified by <uri:ISBN:ISBN>. This conforms to the range defined in the BIBO.|
||The extent of the resource, usually the number of pages, as found in <rdfmab:field/433__a>.|
|dcterms:issued||The year the resource was issued, as found in <rdfmab:field/425a_a>.|
|| The type of a resource is derived from several fields, thus possibly resulting in multiple types for the same resource. The current mapping is most likely over-simplified and will be subject of a further analysis for future releases:
||The volume number of the resource, found in <rdfmab:field/090_a>, which holds the sortable form. If this is not available, the descriptive form in field <rdfmab:field/089_a> is used.|
|| Fortunately, the original data already includes many links from subordinate to superordinate records which can be used to link the corresponding resources:
||The <rdfmab:field/1..._9> fields contain authority numbers of the authors of the resource. To preserve the order, an rdf-list is used instead of simply linking all authors directly via dc:creator. The downside of this is that currently the authorlists are blank nodes and thus not handled ideally by generic Linked-Data-Displays such as pubby. Note that there are basically two types of authority numbers in the data: those maintained by the DNB (which are available as Linked Data) and local hbz-numbers, which are not available as Linked Data. In the first case, the resulting link leads to the Linked Data Service of the DNB, in the latter case the link unfortunately leads nowhere.|
||The fields <rdfmab:field/412_a> and <rdfmab:field/410_a> contain the name and place of the publisher. To conform to the range of the dc:publisher predicate as defined in the DCMI Metadata Terms, we have introduced blank nodes for the publishers, typed as foaf:Organisation. The place of the publisher is attached as another blank node via geo:location. That blank node is typed geo:SpatialThing and has the name of the place attached by geonames:name, since we lack a mapping of the place names to geonames-identifiers. We are aware that this seems overly complicated, but we are trying to identify and properly model the entities that are referenced in the original data, even if that results in blank nodes in the first run. As soon as an authority file for publishers is available, we will try to link there. We might even have a look at the resulting blank nodes and see if the information is clean enough to form the basis of such a file.|
||In the current state of the raw data, holding information is only implicitly available. Since the records are segmented into packages by instutition, we know that an institution is the frbr:owner of at least one frbr:Item of the described frbr:Manifestation. Since we currently do not have signature-information, those items are once again modelled as blank nodes.|
There is a complete documentation of the fields found in the RDF/ISO2709-Version of the data. Unfortunately, the RDF/ISO2709-fields are not completely in line with this official documentation. This is due to the fact that our data passes through an interface that is based on MARC21 before it is published. Some fields are renamed in this process. We are working on either documenting the differences or using the proper fields.
The conversion results in 82.471.813 triples which we have loaded into a 4store instance, providing a SPARQL-Endpoint. To serve the data as Linked Data, there is a Pubby Linked Data Frontend tied to that endpoint here. You can also download the entire dump.
Although we have released the raw data in an ntriples-format, using native rdf-tools such as rdflib for python has proved to be way to slow to handle massive amounts of data. Regular expressions in Perl are much faster, and thus used here. Due to the use of blank nodes as explained above, the script outputs RDF in turtle notation so that blank node identifiers don’t have to be generated.
Simple perl-regex based conversion script
We have released those parts of the union catalog that participating institutions have holdings in. For each institution the corresponding subset of records was extracted from the union catalog and packaged independently from any other records. This results in duplicate records for those resources held by several institutions. Thus, the first step was to generate a list of unique files. This file is then split up in order to process the data in parallel.
Preparation: Create list of unique files
To generate the holding information, we simply generate the corresponding triples based on the file names in the data packaged by institution: