Thursday, December 3, 2009

Protecting your Linked Data

One of the highlights of the recent ISWC2009 was a tutorial on Legal and Social Frameworks for Sharing Data on the Web. As one who during the rise of "Web 1.0" was writing and presenting frequently on topics like Copyright for Cybernauts and is now seduced by the world of linked data, I've been considering how the legal, business and technical worlds will reconcile themselves in this new world, a world where value will come from joining networks of data together. Eric Hellman puts this nicely:

Linked Data is the idea that the merger of a database produced by one provider and another database provided by a second provider has value much larger than that of the two separate databases... Eric Hellmen, Databases are Services, NOT Content (Dec 2009)
The question is, what legal and technical strategies are available to a linked data provider to protect themselves as they pursue such a value proposition? The following post is an effort to try to rationalise this a bit more clearly.

I'm not a lawyer. I'm a technologist who has since the early 1990s immersed himself in the sometimes delicate, more often violent dance between technology, business and public policy that has been catalysed by the rise of the digital, networked environment. In particular I've been motivated by the question of how policies can, and more often can't, be systematically "implemented" by technologies --- as well as by the question of how technical architectures often enforce ad hoc policy regimes, inadvertently or otherwise (see esp. Lawrence Lessig's Code v2, the community update of Code and Other Laws of Cyberspace).

As an early (an perhaps idiosyncratic) player in the DRM industry, I quickly concluded that the only sustainable solution to the problem of communicating rights for creative works in the digital domain was to evolve an infrastructure of identifiers and metadata, which has been realised to a great extent by the rise in prominence of the DOI, accessible templates for rights communications (due in large part to Creative Commons), the emergence of a variety of metadata standards, and a standard data model (RDF) for associating metadata with objects. The more recent emergence of standards of practice for linked data will only help to further disambiguate the rights world, as these practices make the expression and transferral of content-descriptive metadata orders of magnitude easier.

I'm interested in questions concerning the communication of intellectual property rights for data shared through linked data mechanisms: What rights can be claimed? What are the best practices for claiming and transferring rights? What technical mechanisms exist --- in this case, specific vocabularies and protocols --- for communicating rights to metadata? The four thought leaders at the ISWC2009 LSFSDW tutorial have done a fairly complete job; this post is an attempt to summarise and/or interpret their messages and resources found elsewhere. I'd like to highlight pioneering work by the Science Commons, an offshoot of CC which has considered these questions specifically for scientific data. Also, in preparing this post I stumbled across some works that I poured over more than a decade ago, that now seem prescient! David Lanzotti and Doug Ferguson's thorough analysis circa 2006 shows that little has changed: IP protection for databases is nebulous territory.

Copyright does not apply to datasets: Most regimes hold that copyright applies only to original creative works. This means you can only claim copyright for works that are yours and which are "creative." This second piece means you cannot claim copyright on databases unless their structure and organisation is sufficiently creative; the US Supreme Court held that "sweat of the brow" is not sufficient to cross this threshold, and that copyright protections do not extend to non-creative accumulations of facts (c.f. Feist, 1991).

The individual elements of a dataset might themselves be extensive and creative enough to merit copyright protection; we'll assume for this discuss that these are handled separately. In their FAQ the Open Data Commons nicely emphasises the difference between a dataset and the individual contents of that dataset, including text and images. Note also that the European Space Agency (ESA) web site includes a nice, concise explanation of the legal reasons why copyright cannot be applied to databases.

Intellectual property protection for datasets: The fact that copyright (generally) cannot be applied to datasets means that the Creative Commons body of work can't be applied directly; indeed CC specifically discourages it. But is there an IP regime that covers accumulated data? If not copyright, patent or trademark, then what? ca. 1996 database "owners" thought that a sui generis ("of its own kind") regime for protecting databases might proliferate, and in March 1996 the EU issued a Database Directive. International IP law requires reciprocal directives from member states, however, and the lack of adoption of this model around the world and most notably in the United Sates means IP protection for datasets is still nebulous.

In principle there are no "default" protections for datasets as there are with copyright; providers must be proactive and declare their terms of use up front, whether they choose to waive all restrictions; a limited set focused on attribution; or more extensive limitations based on customised licenses. It is clearly in the interests of both providers and consumers of datasets to ensure that rights are explicit stipulated up front, especially since a key value proposition of linked data is (as we are reminded above) the merger of graphs; for certain applications graphs from difference sources must be merged together within a single store so that inference can be applied. A service agency must know up front whether triples from particular sources can be "thrown in the hopper," and even of there are exclusions.

Templates for expressing licensing terms: The Open Data Commons provides a template Open Database License (ODbL) that specifies Attribution and Share-alike Terms

This {DATA(BASE)-NAME} is made available under the Open Database License: http://opendatacommons.org/licenses/odbl/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/
The specific text of the ODbL license is quite extensive, but the gist of it is nicely summarised in the ODbL Plain Language Summary:
You are free: To Share...To Create...To Adapt...
As long as you: Attribute...Share-alike...Keep open...
(details of each stipulation omitted for simplicity)

My point in dwelling on ODbL is not to argue that commercial providers should adopt it, but rather to consider adapting it; I'm holding it up as an exemplar for the explicit expression of terms of use for a dataset.

Expressing your rights to linked data as linked data: One of the things that has impressed me about Creative Commons is that its rights expressions were intended from the start to be modelled in RDF and machine-readable; indeed CC has created ccREL: the Creative Commons Rights Expression Language, which primarily uses the idea of embedded RDF (via RDFa) in content pages to communicate rights. A recent development is Creative Commons guidance on how ccREL and RDFa might be applied to "deploy the Semantic Web." Note that Nathan Yergler's (excellent) OpenWeb 2008 presentation explains this well, but doesn't specifically deal with the linked data question. Note that in particular Nathan addresses CC+, a CC licensing model that allows providers to include a way for users to request rights beyond those stated in the basic CC license. Those who know me know what I'll say next: this is another step forward as we converge on Henry Perritt's ca. 1993 vision of permissions headers!

For further reading:

No comments:

Post a Comment