Wednesday, December 16, 2009

Recent Efforts toward Linked Multimedia Metadata

Recently I've been "having a think" on issues ranging from rights expression for datasets to realising the value of linked data, but frankly I've felt that something is missing ; even with scientific and government linked datasets going online, a voice inside me wonders if the stakes are still (arguably) too low to really shake things up. I've been wondering what kind of data we haven't been hearing about --- the kind of data that if it were published according to linked data principles would surely lead to the emergence of outrageously cool applications, demonstrate the inherent value of the linked data approach, and perhaps even test some interesting new monetisation models? The area that immediately came to mind was multimedia metadata, especially semantic metadata for video and audio content.

Several recent venues have focused on the general topic of generating, publishing and using semantic multimedia metadata, including the Oct-Dec 2009 IEEE Multimedia Magazine special issue on Multimedia Metadata and Semantic Management, and SAMT2009: The 4th International Conference on Semantic and Digital Media Technologies (3-4 Dec 2009; Graz, Austria). Both of these are "powered" by members of the Multimedia Metadata Community, an outgrowth of the MPEG-7 and MPEG-21 worlds that "brings together experts from research and industry in the area of multimedia meta data interoperability for collaborative working environments." Finally, since 2008 the W3C has been host to its Video in the Web activity; within this the Media Annotations Working Group is developing an ontology and API to facilitate cross-community sharing and use of multimedia metadata in the Web.

IEEE Multimedia (Oct-Dec 2009): This special issue features six research articles focused on different facets of the "semantic management of multimedia and multimedia metadata" ranging from retrieval and processing to consumption and presentation. Of the six, perhaps the first two are most relevant in today's linked data environment

  • "Managing and Querying Distributed, Multimedia Metadata." This article advocates the use of a centralized metadata résumé --- a condensed, automatically-constructed version of the larger metadata set --- for locating content on remote servers. The authors demonstrate the advantages of their approach using conventional semweb technologies to represent and query semantic metadata.
  • "Semantic MPEG Query Format Validation and Processing." The authors present their semantic validation of MPEG Query Format (MPQF) queries and their implementation of a practical MPQF query engine over an Oracle RDBMS. The article introduces methods for evaluating MPQF semantic-validation rules not expressed by syntactic means within the XML schema. The authors highlight their prototype implementation of an MPQF-capable processing engine using several query types on a set of MPEG-7 based image annotations.
  • "Diversifying Image Retrieval with Affinity-Propagation Clustering on Visual Manifolds." The authors describe a post-processing subsystem for retrieval systems that improves the diversity of results presented to users. Image retrieval systems typically focus on the similarity between the retrieval and sample images, where the relevance of the retrieval results is considered but the diversity is neglected. Ideally, retrieval results should contain a diverse array of items representing a variety of subtopics. This article presents a method for removing duplicate images from a "top 20" list, replacing them with images representing new subtopics.
  • "A Media Value Chain Ontology for MPEG-21." The authors have created a semantic representation of intellectual property derived from MPEG-21 Part 19. Their model defines the minimal set of types of intellectual property, the roles of users interacting with them, and the relevant actions regarding intellectual property law. The article is a helpful guide to the standardization efforts, with its many examples and useful insight into the multimedia value chain.
  • "Using Social Networking and Collections to Enable Video Semantics Acquisition." The authors consider media production, acquisition, and metadata gathering, the first elements of the multimedia value chain. Methods from video annotation and social networking are brought together to solve problems associated with gathering metadata that describes user interaction, usage, and opinions of video content. Individual user-interaction metadata is aggregated to provide semantic metadata for a given video. Coolness alert: The authors have successfully implemented their model in a Flex-based Facebook application!
  • "A Web-Based Music Lecture Database Framework." This article describes semantic audio authoring and presentation for Web-published music lectures. The authors propose a dynamic programming-based algorithm for MIDI-to-Wave alignment to explore the temporal relations between MIDI and the corresponding performance recording. The synchronized MIDI and wave can be attached to many kinds of teaching materials where synchronized presentations can add value.

SAMT'09: Nearly 15 years ago I had the good fortune to present my early rights metadata research at EDMEDIA'95 in Graz (Austria); visiting the conference web site this weekend, especially seeing the real-time image of the historic "Urhturm" on the hill high about the city, brought back a flood of fond memories! The topics of the three tutorials offered at SAMT'09 demonstrate that current research has definitely taken a turn toward getting multimedia multimedia into the Web. (Unfortunately, only slides from the first are currently available):

  • "Web of Data in the Context of Multimedia (WoDMM)." How multimedia content can be integrated into the Web of Data and how users and developers can consume and benefit from linked data. (slides)
  • "MPEG Metadata for Context-Aware Multimedia Applications (MPEG)." Overview of MPEG metadata formats that enable the development and deployment of content- and context-aware multimedia applications.
  • "A Semantic Multimedia Web: Create, Annotate, Present and Share your Media (SemMMW)." How multimedia metadata can be represented and attached to the content it describes within the context of established media workflow practices, and how users can benefit from a Web of Data containing more formalized knowledge.

For much more information, see the Proceedings from the 20th International Workshop of the Multimedia Metadata Community on Semantic Multimedia Database Technologies (SeMuDaTe'09)

Metadata Standards for the Web of Data: Finally, research such as that describe above has led to progress on the standards front. As the IEEE Multimedia guest editors note in their foreword, since 2008 there as been quiet but steady progress within the W3C's Video in the Web activity, which was chartered to make video a first class citizen of the Web by creating an architectural foundation that by taking full advantage of the Web's underlying principles will enable people to create, navigate, search, link and distribute video... Of its three working groups, the editors highlight the Media Annotations Working Group as being motivated by progress in RDF and topic maps and appears most aligned with emerging linked data activities.

In their forward, the IEEE Multimedia editors provide a very nice summary of the core problem with multimedia metadata and thus the motivation for the W3C efforts:

Most of the standards are tailored to specific application domains. Examples include European Broadcasting Union P/Meta 2.0 for broadcasting; TV-Anytime and SMPTE Metadata Dictionary for TV; and MPEG-21 for the delivery chain of multimedia and technical aspects (such as EXIF). These standards exhibit a different semantic level of detail in their descriptions (from simple keywords to regulated taxonomies and ontologies). Only some of the standards are general purpose, for instance MPEG-7...

Coolness is on the Horizon: This rather lengthy posting is merely a sampling of works-in-progress, not only to put multimedia metadata on the Web but more importantly to establish such metadata as a useful and valuable part of the Web. Combine with such visionary efforts as the revamped, linked data-driven BBC web site, I'm increasingly confident that a new generation of linked data applications are around the corner, fueled this time by datasets that add video and audio to the semantic mix. Bring it on!

Friday, December 11, 2009

Scale-free Networks and the Value of Linked Data

Kingsley Idehen of OpenLink Software and others on the Business of Linked Data (BOLD) list have been debating a value proposition for linked data via Twitter (search for #linkeddata) and email. The discussion has included useful iterations on various "elevator pitches" and citations of recent successes, especially the application of GoodRelations e-commerce vocabularies at Best Buy. After some deep thought I decided to take the question of value in a different direction and to consider it from the perspective of the science of networks, especially with reference to the works of Albert-László Barabási, director of the Center for Complex Network Research and author of Linked: The New Science of Networks. I'd like to test the idea here that data sharing between organisations based on linked open data principles is the approach most consistent with the core principles of a networked economy. I believe that the linked data model best exploits "networking thinking" and maximizes the organisation's ability to respond to changes in relationships within the "global graph" of business. Using Barabási as a framework, linked data is the approach that most embodies a networked view of the economy from the macro- to the micro-economic level, and therefore best empowers the enterprise to understand and leverage the consequences of interconnectedness.

As has been noted numerous times elsewhere, the so-called Web of Data is perhaps the web in its purest form. Following Tim Berners-Lee principles or "rules" as stated in his Linked Data Design Issues memo from 2006, we have a very elegant framework for people and especially machines to describe the relationship between entities in a network. If we are smart about how we define those links and the entities we create to aggregate those links --- the linked datasets we create --- we can build dynamic, efficiently adaptive networks embodying the two laws that govern real networks: growth and preferential attachment. Barabási illustrates these two laws with an example "algorithm" for scale-free networks in Chapter 7 of Linked. The critical lessons are (a) networks must have a means to grow --- there must not only be links, but the ability to add links, and (b) networks must provide some mechanism for entities to register their preference for other nodes by creating links to the more heavily-linked nodes. Preferential attachment ensures that the converse is also true: entities will "vote with their feet" and register their displeasure with nodes by eliminating links.

In real networks, the rich get richer. In the Web, the value is inherent in the links. Google's PageRank merely reinforced the "physical" reality that the most valuable properties in the Web of Documents are those resources that are most heavily linked-to. Those properties provide added value if they in turn provide useful links to other resources. The properties that are sensitive to demand and can adapt to the preferences of their consumers, especially to aggregate links to more resources that compound their value and distinguish them from other properties, are especially valuable and are considered hubs.

Openness is important. At this point it is tempting to jump to the conclusion that Tim Berners-Lee's four principles are all we need to create a thriving Web of Data, but this would be premature; Sir Tim's rules are necessary but not sufficient conditions. Within any "space" where Webs of Data are to be created, whether global or constrained within an organisation, the network must embody the open world assumption as it pertains to the web: when datasets or other information models are published, their providers must expect them to be reused and extended. In particular this means that entities within the network, whether powered by humans or machines, must be free to arbitrarily link to (make assertions about) other entities within the network. The "friction" of permission in this linking process must approximate zero.

Don't reinvent and don't covet! The extent of graphs that are built within organisations should not stop at their boundaries; as the BBC has shown so beautifully with their use of linked data on the revamped BBC web site, the inherent value of their property was increased radically by not only linking to datasets provided elsewhere, openly on the "global graph," but also by enabling reuse of their properties. The BBC's top-level principles for the revamped site are all about openness and long-term value:


The site has been developed against the principles of linked open data and RESTful architecture where the creation of persistent URLs is a primary objective. The initial sources of data are somewhat limited but this will be extended over time. Here's our mini-manifesto: Persistence...Linked open data...RESTful...One web

The BBC has created a valuable "ecosystem"; their use of other resources, especially MusicBrainz and DBPedia, has not only made the BBC site richer but in turn has increased the value of those properties. And those properties will continue to increase in value; by the principle of preferential attachment, every relationship "into" a dataset by valuable entities such as the BBC in turn increases the likelihood that other relationships will be established.

Links are not enough. It should be obvious that simply exposing datasets and providing value-added links to others isn't enough; as Eric Hellman notes, dataset publishers must see themselves service providers who add value beyond simply exposing data. Some will add value to the global graph by gathering, maintaining, publishing useful datasets and fostering a community of users and developers; others will add value by combining datasets from other services in novel ways, possibly decorated by their own. Eric has argued that the only winners in the linked open data space have indeed been those who have provided such merged datasets as a service.

Provide value-adding services and foster community. I would argue that dataset providers asking how they might realise the full value potential of publishing their datasets on the Web should examine whether, based on the principles I've outlined above, they have done everything they can to make their datasets part of the Web (rather than merely "on" the web) and have truly added value to the global graph. Do they view themselves as a service? Have they made their datasets as useful and easy-to-use as possible? Have they provided the best possible community support, including wikis and other mechanisms? Have they fully documented their vocabularies? Have they clearly defined any claimed rights, and in particular have they considered adopting open data principles?

Tuesday, December 8, 2009

Linking Rights to Aggregations of Data (Part 2)

In my background research for today's entry I discovered that the smart people at Talis, especially Ian Davis, have been working the problem I outlined in Linking Rights to Aggregations of Data (Part 1). Specifically, back in July 2009 Ian proposed WAIVER: A vocabulary for waivers of rights. In Ian's words,

(The WAIVER) vocabulary defines properties for use when describing waivers of rights over data and content. A waiver is the voluntary relinquishment or surrender of some known right or privilege. This vocabulary is designed for use with the Open Data Commons Public Domain Dedication and License and with the Creative Commons CC-0 waiver

In his July 2009 post Linked Data and the Public Domain Ian argues for providers to unambiguously declare their datasets public domain and explains how to use the WAIVER vocabulary to do this, in the context of a voID description of a dataset. (See also this email discussion thread involving several of the thought leaders in this area on this issue) Ian provides the following example, which I repeat here to illustrate (a) use of voID to describe a dataset named "myDataset," (b) use of the wv:waiver property to link the dataset to the Open Data Commons PDDL waiver, (c) use of the wv:declaration property to include a human-readable declaration of the waiver, and (d) use of the wv:norms property to link the dataset to the community norms he suggests, ODC Attribution and Share-alike.


<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/terms/"
xmlns:wv="http://vocab.org/waiver/terms/"
xmlns:void="http://rdfs.org/ns/void#">
<void:Dataset rdf:about="http://myOrganisation.org/myDataset">
<dc:title>myDataset</dc:title>
<wv:waiver rdf:resource="http://www.opendatacommons.org/odc-public-domain-dedication-and-licence/"/>
<wv:norms rdf:resource="http://www.opendatacommons.org/norms/odc-by-sa/" />
<wv:declaration>
To the extent possible under law, myOrganisation
has waived all copyright and related or neighboring rights to
myDataset
</wv:declaration>
</void:Dataset>
</rdf:RDF>

WAIVER and OAI-ORE: As I proposed in Part 1, we should be able to combine the voID and OAI-ORE approaches. The only conceptual difference is by OAI-ORE guidelines the RDF file shown above would be treated as the resource map for the aggregation URI (in this example, "http://myOrganisation.org/myDataset") and would have a URI unto itself (perhaps "http://myOrganisation.org/myDataset.rdf").

What about other rights? It is critically important for the reader to understand that Ian's example (repeated above) only shows how to declare a waiver of rights, which by its nature is intended to promote the reuse of data based on open principles. Today, this is mostly what the linked data world has focused on, but as the NYTimes open data experiment is showing us, providers will want to assert rights where they can. In a future post I'll applied what we've learned so far, to consider approaches for declaring dataset rights in legal regimes where this is actually possible.

Monday, December 7, 2009

Linking Rights to Aggregations of Data (Part 1)

In my previous post Protecting your Linked Data I considered the dual questions of what legal regimes are available to linked data providers for the protection of their published datasets, and what technical frameworks and best practices exist especially within the realm of RDF and linked data to make such rights assertions. In this (shorter!) post I begin to consider an attribution scheme that comes to mind on the heels of discussions on the The New York Times Linked Open Data Community list, that of using named graphs (see also here) and specifically the OAI-ORE data model to associate specific rights to aggregations of resources.

What's the problem? Given a set -- an aggregation -- of data assertions, how might we properly assert rights over those assertions, especially in a way that a responsible client won't lose track of the ownership context? Lets assume a file of RDF triples is read into store. Consist with the NYTimes LOD discussion, we'll call the file people.rdf. Since "all RDF stores support named graphs these days" (Richard Cyganiak), a named graph URI shall be assumed to have been created and names the aggregation of assertions imported from "people.rdf" (i.e. the assertions in the file "people.rdf" from the provider become members of the named graph "people.rdf" in the client's RDF store.

Recall that a named graph is "a set of triples named by an URI." [ref] The OAI-ORE data model extends this with a set of guidelines for making assertions about aggregations that "describe" the named graph. ORE's core idea is to create one URI to represent the aggregation itself, and another to represent the resource map that we created to describe that aggregation. It should be in this OAI-ORE resource map that rights expressions applying to the aggregation should appear.

In my next post I'll take a stab a mocking up -- and hopefully not mucking up -- what an implementation of this might look like...

Thursday, December 3, 2009

Protecting your Linked Data

One of the highlights of the recent ISWC2009 was a tutorial on Legal and Social Frameworks for Sharing Data on the Web. As one who during the rise of "Web 1.0" was writing and presenting frequently on topics like Copyright for Cybernauts and is now seduced by the world of linked data, I've been considering how the legal, business and technical worlds will reconcile themselves in this new world, a world where value will come from joining networks of data together. Eric Hellman puts this nicely:

Linked Data is the idea that the merger of a database produced by one provider and another database provided by a second provider has value much larger than that of the two separate databases... Eric Hellmen, Databases are Services, NOT Content (Dec 2009)
The question is, what legal and technical strategies are available to a linked data provider to protect themselves as they pursue such a value proposition? The following post is an effort to try to rationalise this a bit more clearly.

I'm not a lawyer. I'm a technologist who has since the early 1990s immersed himself in the sometimes delicate, more often violent dance between technology, business and public policy that has been catalysed by the rise of the digital, networked environment. In particular I've been motivated by the question of how policies can, and more often can't, be systematically "implemented" by technologies --- as well as by the question of how technical architectures often enforce ad hoc policy regimes, inadvertently or otherwise (see esp. Lawrence Lessig's Code v2, the community update of Code and Other Laws of Cyberspace).

As an early (an perhaps idiosyncratic) player in the DRM industry, I quickly concluded that the only sustainable solution to the problem of communicating rights for creative works in the digital domain was to evolve an infrastructure of identifiers and metadata, which has been realised to a great extent by the rise in prominence of the DOI, accessible templates for rights communications (due in large part to Creative Commons), the emergence of a variety of metadata standards, and a standard data model (RDF) for associating metadata with objects. The more recent emergence of standards of practice for linked data will only help to further disambiguate the rights world, as these practices make the expression and transferral of content-descriptive metadata orders of magnitude easier.

I'm interested in questions concerning the communication of intellectual property rights for data shared through linked data mechanisms: What rights can be claimed? What are the best practices for claiming and transferring rights? What technical mechanisms exist --- in this case, specific vocabularies and protocols --- for communicating rights to metadata? The four thought leaders at the ISWC2009 LSFSDW tutorial have done a fairly complete job; this post is an attempt to summarise and/or interpret their messages and resources found elsewhere. I'd like to highlight pioneering work by the Science Commons, an offshoot of CC which has considered these questions specifically for scientific data. Also, in preparing this post I stumbled across some works that I poured over more than a decade ago, that now seem prescient! David Lanzotti and Doug Ferguson's thorough analysis circa 2006 shows that little has changed: IP protection for databases is nebulous territory.

Copyright does not apply to datasets: Most regimes hold that copyright applies only to original creative works. This means you can only claim copyright for works that are yours and which are "creative." This second piece means you cannot claim copyright on databases unless their structure and organisation is sufficiently creative; the US Supreme Court held that "sweat of the brow" is not sufficient to cross this threshold, and that copyright protections do not extend to non-creative accumulations of facts (c.f. Feist, 1991).

The individual elements of a dataset might themselves be extensive and creative enough to merit copyright protection; we'll assume for this discuss that these are handled separately. In their FAQ the Open Data Commons nicely emphasises the difference between a dataset and the individual contents of that dataset, including text and images. Note also that the European Space Agency (ESA) web site includes a nice, concise explanation of the legal reasons why copyright cannot be applied to databases.

Intellectual property protection for datasets: The fact that copyright (generally) cannot be applied to datasets means that the Creative Commons body of work can't be applied directly; indeed CC specifically discourages it. But is there an IP regime that covers accumulated data? If not copyright, patent or trademark, then what? ca. 1996 database "owners" thought that a sui generis ("of its own kind") regime for protecting databases might proliferate, and in March 1996 the EU issued a Database Directive. International IP law requires reciprocal directives from member states, however, and the lack of adoption of this model around the world and most notably in the United Sates means IP protection for datasets is still nebulous.

In principle there are no "default" protections for datasets as there are with copyright; providers must be proactive and declare their terms of use up front, whether they choose to waive all restrictions; a limited set focused on attribution; or more extensive limitations based on customised licenses. It is clearly in the interests of both providers and consumers of datasets to ensure that rights are explicit stipulated up front, especially since a key value proposition of linked data is (as we are reminded above) the merger of graphs; for certain applications graphs from difference sources must be merged together within a single store so that inference can be applied. A service agency must know up front whether triples from particular sources can be "thrown in the hopper," and even of there are exclusions.

Templates for expressing licensing terms: The Open Data Commons provides a template Open Database License (ODbL) that specifies Attribution and Share-alike Terms

This {DATA(BASE)-NAME} is made available under the Open Database License: http://opendatacommons.org/licenses/odbl/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/
The specific text of the ODbL license is quite extensive, but the gist of it is nicely summarised in the ODbL Plain Language Summary:
You are free: To Share...To Create...To Adapt...
As long as you: Attribute...Share-alike...Keep open...
(details of each stipulation omitted for simplicity)

My point in dwelling on ODbL is not to argue that commercial providers should adopt it, but rather to consider adapting it; I'm holding it up as an exemplar for the explicit expression of terms of use for a dataset.

Expressing your rights to linked data as linked data: One of the things that has impressed me about Creative Commons is that its rights expressions were intended from the start to be modelled in RDF and machine-readable; indeed CC has created ccREL: the Creative Commons Rights Expression Language, which primarily uses the idea of embedded RDF (via RDFa) in content pages to communicate rights. A recent development is Creative Commons guidance on how ccREL and RDFa might be applied to "deploy the Semantic Web." Note that Nathan Yergler's (excellent) OpenWeb 2008 presentation explains this well, but doesn't specifically deal with the linked data question. Note that in particular Nathan addresses CC+, a CC licensing model that allows providers to include a way for users to request rights beyond those stated in the basic CC license. Those who know me know what I'll say next: this is another step forward as we converge on Henry Perritt's ca. 1993 vision of permissions headers!

For further reading: