Using dcat for Open Bibliographic Data

Background

In March 2010 the first libraries in Germany - following CERN library and the Ghent University library - started to publish catalog data (raw data dumps exported from the catalog) on the internet under an open license (for more information see the announcement).

Since then another two libraries followed this Open Data initiative and there are more to come.

We knew that opening up one's data isn't simply done by uploading it to a server and publishing an announcement. Of course the data's description plays an important part in finding people who can benefit from the data and thus maximizing the data's use. At first we qickly developed a scheme for describing the data in a table-like manner, see http://www.hbz-nrw.de/projekte/linked_open_data/english_version/. This scheme was only developed for transitory use and should be replaced by a better vocabulary in RDF. Such a vocabulary for open catalog exports would mean creating Linked Open Bibliographic Data on a first not very granular level.

From April to August we worked on a concept for an infrastructure for Open Data from library catalogs (only in German) which included a search for a fitting vocabulary. That's when we had a deeper look at the Data Catalog Vocabulary. (We here use the short term "dcat" for the vocabulary though it seems that the name isn't used anymore since the project is further developped at the W3C.)

This pages shows our first experiments with using dcat and documents our problems using it and its limitations. It is primarily intended as our feedback to the Data Catalog Vocabulary project which further develops the dcat vocabulary. We used this draft vocabulary reference for our efforts.

Experimental descriptions

In the following you'll find some exemplary descriptions of a catalog for open bibliographic data as well as of individual open datasets. The descriptions are written in the RDF serialization Turtle.

Description of the COBiD (Catalogue of Open Bibliographic Data)

This description is about the Catalog of Open Bibliographic Data (which hasn't been built yet) itself which contains individual records which describe datasets.

COBiD record for describing the Cologne University Library data from the hbz union catalog

This is an exemplary COBiD-record describing bibliographic records with holdings from the USB extracted from the hbz union catalog.
(There are two corresponding descriptions. The initial description on the hbz webpages can be found here (it's the first one). The description in CKAN is part of this entry.)

COBiD record for describing the local export of the Central Library for Sports Science at the German Sports University Cologne

see also http://opendata.zbsport.de/

Notes

As you can see dereferencing the organization URIs leads to a HTML oder RDF description of the respective organization, e.g. http://lobid.org/organisation/DE-Kn41. For more information about this - also experimental - service see https://wiki1.hbz-nrw.de/pages/viewpage.action?pageId=1572888.

We haven't complied to the proposed usage of the property dcat:size but have only used a literal as object.

Lessons Learned & Questions

General

  • Should individual records be identified as a file in a concrete RDF serialization like above (e.g. http://lobid.org/cobid/hbz_DE-38.ttl)? Or should there be a record URI identifying the abstract record (like http://lobid.org/cobid/hbz_DE-38) which serves the record in the desired serialization?
  • We need URIs for libraries and library centers. --> We tackled this problems with a first Linked Data based index of library institutions.
  • More URIs for databases, thesauri und classifications would make sense, e.g. for the authority files for names, subject headings and corporative bodies at the German National Library (above we used the literals "SWD" and "PND" for referencing the authority files).
  • URIs for collections: It is desirable to connect the data collections with the physicle collections which are described by the data. Until no we haven't got any formal descriptions of collections but for the holding institutions. E.g, there is a URI for Cologne University Library (USB) (which is http://lobid.org/organisation/DE-38) but the more than 30 collections held by the USB don't have any URI and aren't described in a uniform way yet. Libraries might describe their collections in the future on the basis of the Dublin Core Collections Application Profile.
  • URI-scheme for collections: One possibility is to apply URIs which have the form http://lobid.org/collection/{} to physical collections of resources as well as to collections of data describing physical resources. For a start we chose this approach.

dcat-specific

  • It would make sense to add to the description of a dataset a link to the respective OPAC to connect the raw data to the traditional user interface for the data. Both - the OPAC as well as the export - are without question instantiations of the same dataset and might be attached as distributions to a dataset. The problem is that OPAC and raw data - in this case - are published under different licenses. As the licensing information is assigned to a dcat:Dataset we've got a problem: You can't attach differently licensed distributions to a dataset.
  • A predicate for the checksum of an archive would make sense.
  • A predicate for the number of records in a dcat:dataset would make sense.
  • The distinction between "file format of the archive" and "format of the containing data" can't be represented with dcat. That might be desirable. E.g., one might only use dcterms:format to indicate the format of the containing data.
  • Regarding the distinction of the classes "dcat:Distribution" and "dcat:Download" see Ed Summers' commentary and suggestion for improvement. We support Ed's proposal.
  • We aren't sure as well as Ed Summers (see http://www.w3.org/egov/IG/track/issues/39) whether there is a need for the class Distribution and its subclasses.
  • dcat uses the DC terms predicate dcterms:accrualPeriodicity. This has a defined range individuals of the class dcterms:Frequency. We couldn't find anywhere a definition of instances of this class. For now we are using literals a object of dcterms:accrualPeriodicity. This obviously could be improved.
  • A general idea: Define a dcat:Dataset as an abstract resource independent from its instantiation in time and specific formats. This would mean to attach information about formats, release data and modification date as well as licensing information at distribution level. (Of course you'd have to keep the class dcat:Distribution for this.) In our cases for example the release date of the dataset and its distribution differs.

Stichwörter

dcat dcat Löschen
opendata opendata Löschen
metadata metadata Löschen
Geben Sie Stichwörter ein, die dieser Seite hinzugefügt werden sollen:
Please wait 
Sie suchen ein Stichwort? Beginnen Sie einfach zu schreiben.
  1. Aug 27, 2010

    Fadi Maali sagt:

    Hi, (As a member of dcat team) thanks a lot for the insightful feedback on dcat&...

    Hi, (As a member of dcat team) thanks a lot for the insightful feedback on dcat!

    quite interesting to see dcat used in the bibliographic domain!

    I added an issues on dcat issue tracker based on your feedback:

    http://www.w3.org/egov/IG/track/issues/43

    I will raise the other points in the next dcat meeting  (which should be soon).

    I have a question regarding: """The distinction between "file format of the archive" and "format of the containing data" can't be represented with dcat."""

    do you mean something  like a zipped CSV file (so that file format of the archive will be ZIP and format of data will be CSV)?

    Thanks,

    Fadi 

    1. Aug 27, 2010

      Adrian Pohl sagt:

      I have a question regarding: """The distinction between "file format of the arc...

      I have a question regarding: """The distinction between "file format of the archive" and "format of the containing data" can't be represented with dcat."""

      do you mean something like a zipped CSV file (so that file format of the archive will be ZIP and format of data will be CSV)?

      Yes, this is exactly what I mean.

Kommentar hinzufügen