Wednesday, December 16, 2009

Recent Efforts toward Linked Multimedia Metadata

Recently I've been "having a think" on issues ranging from rights expression for datasets to realising the value of linked data, but frankly I've felt that something is missing ; even with scientific and government linked datasets going online, a voice inside me wonders if the stakes are still (arguably) too low to really shake things up. I've been wondering what kind of data we haven't been hearing about --- the kind of data that if it were published according to linked data principles would surely lead to the emergence of outrageously cool applications, demonstrate the inherent value of the linked data approach, and perhaps even test some interesting new monetisation models? The area that immediately came to mind was multimedia metadata, especially semantic metadata for video and audio content.

Several recent venues have focused on the general topic of generating, publishing and using semantic multimedia metadata, including the Oct-Dec 2009 IEEE Multimedia Magazine special issue on Multimedia Metadata and Semantic Management, and SAMT2009: The 4th International Conference on Semantic and Digital Media Technologies (3-4 Dec 2009; Graz, Austria). Both of these are "powered" by members of the Multimedia Metadata Community, an outgrowth of the MPEG-7 and MPEG-21 worlds that "brings together experts from research and industry in the area of multimedia meta data interoperability for collaborative working environments." Finally, since 2008 the W3C has been host to its Video in the Web activity; within this the Media Annotations Working Group is developing an ontology and API to facilitate cross-community sharing and use of multimedia metadata in the Web.

IEEE Multimedia (Oct-Dec 2009): This special issue features six research articles focused on different facets of the "semantic management of multimedia and multimedia metadata" ranging from retrieval and processing to consumption and presentation. Of the six, perhaps the first two are most relevant in today's linked data environment

  • "Managing and Querying Distributed, Multimedia Metadata." This article advocates the use of a centralized metadata résumé --- a condensed, automatically-constructed version of the larger metadata set --- for locating content on remote servers. The authors demonstrate the advantages of their approach using conventional semweb technologies to represent and query semantic metadata.
  • "Semantic MPEG Query Format Validation and Processing." The authors present their semantic validation of MPEG Query Format (MPQF) queries and their implementation of a practical MPQF query engine over an Oracle RDBMS. The article introduces methods for evaluating MPQF semantic-validation rules not expressed by syntactic means within the XML schema. The authors highlight their prototype implementation of an MPQF-capable processing engine using several query types on a set of MPEG-7 based image annotations.
  • "Diversifying Image Retrieval with Affinity-Propagation Clustering on Visual Manifolds." The authors describe a post-processing subsystem for retrieval systems that improves the diversity of results presented to users. Image retrieval systems typically focus on the similarity between the retrieval and sample images, where the relevance of the retrieval results is considered but the diversity is neglected. Ideally, retrieval results should contain a diverse array of items representing a variety of subtopics. This article presents a method for removing duplicate images from a "top 20" list, replacing them with images representing new subtopics.
  • "A Media Value Chain Ontology for MPEG-21." The authors have created a semantic representation of intellectual property derived from MPEG-21 Part 19. Their model defines the minimal set of types of intellectual property, the roles of users interacting with them, and the relevant actions regarding intellectual property law. The article is a helpful guide to the standardization efforts, with its many examples and useful insight into the multimedia value chain.
  • "Using Social Networking and Collections to Enable Video Semantics Acquisition." The authors consider media production, acquisition, and metadata gathering, the first elements of the multimedia value chain. Methods from video annotation and social networking are brought together to solve problems associated with gathering metadata that describes user interaction, usage, and opinions of video content. Individual user-interaction metadata is aggregated to provide semantic metadata for a given video. Coolness alert: The authors have successfully implemented their model in a Flex-based Facebook application!
  • "A Web-Based Music Lecture Database Framework." This article describes semantic audio authoring and presentation for Web-published music lectures. The authors propose a dynamic programming-based algorithm for MIDI-to-Wave alignment to explore the temporal relations between MIDI and the corresponding performance recording. The synchronized MIDI and wave can be attached to many kinds of teaching materials where synchronized presentations can add value.

SAMT'09: Nearly 15 years ago I had the good fortune to present my early rights metadata research at EDMEDIA'95 in Graz (Austria); visiting the conference web site this weekend, especially seeing the real-time image of the historic "Urhturm" on the hill high about the city, brought back a flood of fond memories! The topics of the three tutorials offered at SAMT'09 demonstrate that current research has definitely taken a turn toward getting multimedia multimedia into the Web. (Unfortunately, only slides from the first are currently available):

  • "Web of Data in the Context of Multimedia (WoDMM)." How multimedia content can be integrated into the Web of Data and how users and developers can consume and benefit from linked data. (slides)
  • "MPEG Metadata for Context-Aware Multimedia Applications (MPEG)." Overview of MPEG metadata formats that enable the development and deployment of content- and context-aware multimedia applications.
  • "A Semantic Multimedia Web: Create, Annotate, Present and Share your Media (SemMMW)." How multimedia metadata can be represented and attached to the content it describes within the context of established media workflow practices, and how users can benefit from a Web of Data containing more formalized knowledge.

For much more information, see the Proceedings from the 20th International Workshop of the Multimedia Metadata Community on Semantic Multimedia Database Technologies (SeMuDaTe'09)

Metadata Standards for the Web of Data: Finally, research such as that describe above has led to progress on the standards front. As the IEEE Multimedia guest editors note in their foreword, since 2008 there as been quiet but steady progress within the W3C's Video in the Web activity, which was chartered to make video a first class citizen of the Web by creating an architectural foundation that by taking full advantage of the Web's underlying principles will enable people to create, navigate, search, link and distribute video... Of its three working groups, the editors highlight the Media Annotations Working Group as being motivated by progress in RDF and topic maps and appears most aligned with emerging linked data activities.

In their forward, the IEEE Multimedia editors provide a very nice summary of the core problem with multimedia metadata and thus the motivation for the W3C efforts:

Most of the standards are tailored to specific application domains. Examples include European Broadcasting Union P/Meta 2.0 for broadcasting; TV-Anytime and SMPTE Metadata Dictionary for TV; and MPEG-21 for the delivery chain of multimedia and technical aspects (such as EXIF). These standards exhibit a different semantic level of detail in their descriptions (from simple keywords to regulated taxonomies and ontologies). Only some of the standards are general purpose, for instance MPEG-7...

Coolness is on the Horizon: This rather lengthy posting is merely a sampling of works-in-progress, not only to put multimedia metadata on the Web but more importantly to establish such metadata as a useful and valuable part of the Web. Combine with such visionary efforts as the revamped, linked data-driven BBC web site, I'm increasingly confident that a new generation of linked data applications are around the corner, fueled this time by datasets that add video and audio to the semantic mix. Bring it on!

Friday, December 11, 2009

Scale-free Networks and the Value of Linked Data

Kingsley Idehen of OpenLink Software and others on the Business of Linked Data (BOLD) list have been debating a value proposition for linked data via Twitter (search for #linkeddata) and email. The discussion has included useful iterations on various "elevator pitches" and citations of recent successes, especially the application of GoodRelations e-commerce vocabularies at Best Buy. After some deep thought I decided to take the question of value in a different direction and to consider it from the perspective of the science of networks, especially with reference to the works of Albert-László Barabási, director of the Center for Complex Network Research and author of Linked: The New Science of Networks. I'd like to test the idea here that data sharing between organisations based on linked open data principles is the approach most consistent with the core principles of a networked economy. I believe that the linked data model best exploits "networking thinking" and maximizes the organisation's ability to respond to changes in relationships within the "global graph" of business. Using Barabási as a framework, linked data is the approach that most embodies a networked view of the economy from the macro- to the micro-economic level, and therefore best empowers the enterprise to understand and leverage the consequences of interconnectedness.

As has been noted numerous times elsewhere, the so-called Web of Data is perhaps the web in its purest form. Following Tim Berners-Lee principles or "rules" as stated in his Linked Data Design Issues memo from 2006, we have a very elegant framework for people and especially machines to describe the relationship between entities in a network. If we are smart about how we define those links and the entities we create to aggregate those links --- the linked datasets we create --- we can build dynamic, efficiently adaptive networks embodying the two laws that govern real networks: growth and preferential attachment. Barabási illustrates these two laws with an example "algorithm" for scale-free networks in Chapter 7 of Linked. The critical lessons are (a) networks must have a means to grow --- there must not only be links, but the ability to add links, and (b) networks must provide some mechanism for entities to register their preference for other nodes by creating links to the more heavily-linked nodes. Preferential attachment ensures that the converse is also true: entities will "vote with their feet" and register their displeasure with nodes by eliminating links.

In real networks, the rich get richer. In the Web, the value is inherent in the links. Google's PageRank merely reinforced the "physical" reality that the most valuable properties in the Web of Documents are those resources that are most heavily linked-to. Those properties provide added value if they in turn provide useful links to other resources. The properties that are sensitive to demand and can adapt to the preferences of their consumers, especially to aggregate links to more resources that compound their value and distinguish them from other properties, are especially valuable and are considered hubs.

Openness is important. At this point it is tempting to jump to the conclusion that Tim Berners-Lee's four principles are all we need to create a thriving Web of Data, but this would be premature; Sir Tim's rules are necessary but not sufficient conditions. Within any "space" where Webs of Data are to be created, whether global or constrained within an organisation, the network must embody the open world assumption as it pertains to the web: when datasets or other information models are published, their providers must expect them to be reused and extended. In particular this means that entities within the network, whether powered by humans or machines, must be free to arbitrarily link to (make assertions about) other entities within the network. The "friction" of permission in this linking process must approximate zero.

Don't reinvent and don't covet! The extent of graphs that are built within organisations should not stop at their boundaries; as the BBC has shown so beautifully with their use of linked data on the revamped BBC web site, the inherent value of their property was increased radically by not only linking to datasets provided elsewhere, openly on the "global graph," but also by enabling reuse of their properties. The BBC's top-level principles for the revamped site are all about openness and long-term value:


The site has been developed against the principles of linked open data and RESTful architecture where the creation of persistent URLs is a primary objective. The initial sources of data are somewhat limited but this will be extended over time. Here's our mini-manifesto: Persistence...Linked open data...RESTful...One web

The BBC has created a valuable "ecosystem"; their use of other resources, especially MusicBrainz and DBPedia, has not only made the BBC site richer but in turn has increased the value of those properties. And those properties will continue to increase in value; by the principle of preferential attachment, every relationship "into" a dataset by valuable entities such as the BBC in turn increases the likelihood that other relationships will be established.

Links are not enough. It should be obvious that simply exposing datasets and providing value-added links to others isn't enough; as Eric Hellman notes, dataset publishers must see themselves service providers who add value beyond simply exposing data. Some will add value to the global graph by gathering, maintaining, publishing useful datasets and fostering a community of users and developers; others will add value by combining datasets from other services in novel ways, possibly decorated by their own. Eric has argued that the only winners in the linked open data space have indeed been those who have provided such merged datasets as a service.

Provide value-adding services and foster community. I would argue that dataset providers asking how they might realise the full value potential of publishing their datasets on the Web should examine whether, based on the principles I've outlined above, they have done everything they can to make their datasets part of the Web (rather than merely "on" the web) and have truly added value to the global graph. Do they view themselves as a service? Have they made their datasets as useful and easy-to-use as possible? Have they provided the best possible community support, including wikis and other mechanisms? Have they fully documented their vocabularies? Have they clearly defined any claimed rights, and in particular have they considered adopting open data principles?

Tuesday, December 8, 2009

Linking Rights to Aggregations of Data (Part 2)

In my background research for today's entry I discovered that the smart people at Talis, especially Ian Davis, have been working the problem I outlined in Linking Rights to Aggregations of Data (Part 1). Specifically, back in July 2009 Ian proposed WAIVER: A vocabulary for waivers of rights. In Ian's words,

(The WAIVER) vocabulary defines properties for use when describing waivers of rights over data and content. A waiver is the voluntary relinquishment or surrender of some known right or privilege. This vocabulary is designed for use with the Open Data Commons Public Domain Dedication and License and with the Creative Commons CC-0 waiver

In his July 2009 post Linked Data and the Public Domain Ian argues for providers to unambiguously declare their datasets public domain and explains how to use the WAIVER vocabulary to do this, in the context of a voID description of a dataset. (See also this email discussion thread involving several of the thought leaders in this area on this issue) Ian provides the following example, which I repeat here to illustrate (a) use of voID to describe a dataset named "myDataset," (b) use of the wv:waiver property to link the dataset to the Open Data Commons PDDL waiver, (c) use of the wv:declaration property to include a human-readable declaration of the waiver, and (d) use of the wv:norms property to link the dataset to the community norms he suggests, ODC Attribution and Share-alike.


<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/terms/"
xmlns:wv="http://vocab.org/waiver/terms/"
xmlns:void="http://rdfs.org/ns/void#">
<void:Dataset rdf:about="http://myOrganisation.org/myDataset">
<dc:title>myDataset</dc:title>
<wv:waiver rdf:resource="http://www.opendatacommons.org/odc-public-domain-dedication-and-licence/"/>
<wv:norms rdf:resource="http://www.opendatacommons.org/norms/odc-by-sa/" />
<wv:declaration>
To the extent possible under law, myOrganisation
has waived all copyright and related or neighboring rights to
myDataset
</wv:declaration>
</void:Dataset>
</rdf:RDF>

WAIVER and OAI-ORE: As I proposed in Part 1, we should be able to combine the voID and OAI-ORE approaches. The only conceptual difference is by OAI-ORE guidelines the RDF file shown above would be treated as the resource map for the aggregation URI (in this example, "http://myOrganisation.org/myDataset") and would have a URI unto itself (perhaps "http://myOrganisation.org/myDataset.rdf").

What about other rights? It is critically important for the reader to understand that Ian's example (repeated above) only shows how to declare a waiver of rights, which by its nature is intended to promote the reuse of data based on open principles. Today, this is mostly what the linked data world has focused on, but as the NYTimes open data experiment is showing us, providers will want to assert rights where they can. In a future post I'll applied what we've learned so far, to consider approaches for declaring dataset rights in legal regimes where this is actually possible.

Monday, December 7, 2009

Linking Rights to Aggregations of Data (Part 1)

In my previous post Protecting your Linked Data I considered the dual questions of what legal regimes are available to linked data providers for the protection of their published datasets, and what technical frameworks and best practices exist especially within the realm of RDF and linked data to make such rights assertions. In this (shorter!) post I begin to consider an attribution scheme that comes to mind on the heels of discussions on the The New York Times Linked Open Data Community list, that of using named graphs (see also here) and specifically the OAI-ORE data model to associate specific rights to aggregations of resources.

What's the problem? Given a set -- an aggregation -- of data assertions, how might we properly assert rights over those assertions, especially in a way that a responsible client won't lose track of the ownership context? Lets assume a file of RDF triples is read into store. Consist with the NYTimes LOD discussion, we'll call the file people.rdf. Since "all RDF stores support named graphs these days" (Richard Cyganiak), a named graph URI shall be assumed to have been created and names the aggregation of assertions imported from "people.rdf" (i.e. the assertions in the file "people.rdf" from the provider become members of the named graph "people.rdf" in the client's RDF store.

Recall that a named graph is "a set of triples named by an URI." [ref] The OAI-ORE data model extends this with a set of guidelines for making assertions about aggregations that "describe" the named graph. ORE's core idea is to create one URI to represent the aggregation itself, and another to represent the resource map that we created to describe that aggregation. It should be in this OAI-ORE resource map that rights expressions applying to the aggregation should appear.

In my next post I'll take a stab a mocking up -- and hopefully not mucking up -- what an implementation of this might look like...

Thursday, December 3, 2009

Protecting your Linked Data

One of the highlights of the recent ISWC2009 was a tutorial on Legal and Social Frameworks for Sharing Data on the Web. As one who during the rise of "Web 1.0" was writing and presenting frequently on topics like Copyright for Cybernauts and is now seduced by the world of linked data, I've been considering how the legal, business and technical worlds will reconcile themselves in this new world, a world where value will come from joining networks of data together. Eric Hellman puts this nicely:

Linked Data is the idea that the merger of a database produced by one provider and another database provided by a second provider has value much larger than that of the two separate databases... Eric Hellmen, Databases are Services, NOT Content (Dec 2009)
The question is, what legal and technical strategies are available to a linked data provider to protect themselves as they pursue such a value proposition? The following post is an effort to try to rationalise this a bit more clearly.

I'm not a lawyer. I'm a technologist who has since the early 1990s immersed himself in the sometimes delicate, more often violent dance between technology, business and public policy that has been catalysed by the rise of the digital, networked environment. In particular I've been motivated by the question of how policies can, and more often can't, be systematically "implemented" by technologies --- as well as by the question of how technical architectures often enforce ad hoc policy regimes, inadvertently or otherwise (see esp. Lawrence Lessig's Code v2, the community update of Code and Other Laws of Cyberspace).

As an early (an perhaps idiosyncratic) player in the DRM industry, I quickly concluded that the only sustainable solution to the problem of communicating rights for creative works in the digital domain was to evolve an infrastructure of identifiers and metadata, which has been realised to a great extent by the rise in prominence of the DOI, accessible templates for rights communications (due in large part to Creative Commons), the emergence of a variety of metadata standards, and a standard data model (RDF) for associating metadata with objects. The more recent emergence of standards of practice for linked data will only help to further disambiguate the rights world, as these practices make the expression and transferral of content-descriptive metadata orders of magnitude easier.

I'm interested in questions concerning the communication of intellectual property rights for data shared through linked data mechanisms: What rights can be claimed? What are the best practices for claiming and transferring rights? What technical mechanisms exist --- in this case, specific vocabularies and protocols --- for communicating rights to metadata? The four thought leaders at the ISWC2009 LSFSDW tutorial have done a fairly complete job; this post is an attempt to summarise and/or interpret their messages and resources found elsewhere. I'd like to highlight pioneering work by the Science Commons, an offshoot of CC which has considered these questions specifically for scientific data. Also, in preparing this post I stumbled across some works that I poured over more than a decade ago, that now seem prescient! David Lanzotti and Doug Ferguson's thorough analysis circa 2006 shows that little has changed: IP protection for databases is nebulous territory.

Copyright does not apply to datasets: Most regimes hold that copyright applies only to original creative works. This means you can only claim copyright for works that are yours and which are "creative." This second piece means you cannot claim copyright on databases unless their structure and organisation is sufficiently creative; the US Supreme Court held that "sweat of the brow" is not sufficient to cross this threshold, and that copyright protections do not extend to non-creative accumulations of facts (c.f. Feist, 1991).

The individual elements of a dataset might themselves be extensive and creative enough to merit copyright protection; we'll assume for this discuss that these are handled separately. In their FAQ the Open Data Commons nicely emphasises the difference between a dataset and the individual contents of that dataset, including text and images. Note also that the European Space Agency (ESA) web site includes a nice, concise explanation of the legal reasons why copyright cannot be applied to databases.

Intellectual property protection for datasets: The fact that copyright (generally) cannot be applied to datasets means that the Creative Commons body of work can't be applied directly; indeed CC specifically discourages it. But is there an IP regime that covers accumulated data? If not copyright, patent or trademark, then what? ca. 1996 database "owners" thought that a sui generis ("of its own kind") regime for protecting databases might proliferate, and in March 1996 the EU issued a Database Directive. International IP law requires reciprocal directives from member states, however, and the lack of adoption of this model around the world and most notably in the United Sates means IP protection for datasets is still nebulous.

In principle there are no "default" protections for datasets as there are with copyright; providers must be proactive and declare their terms of use up front, whether they choose to waive all restrictions; a limited set focused on attribution; or more extensive limitations based on customised licenses. It is clearly in the interests of both providers and consumers of datasets to ensure that rights are explicit stipulated up front, especially since a key value proposition of linked data is (as we are reminded above) the merger of graphs; for certain applications graphs from difference sources must be merged together within a single store so that inference can be applied. A service agency must know up front whether triples from particular sources can be "thrown in the hopper," and even of there are exclusions.

Templates for expressing licensing terms: The Open Data Commons provides a template Open Database License (ODbL) that specifies Attribution and Share-alike Terms

This {DATA(BASE)-NAME} is made available under the Open Database License: http://opendatacommons.org/licenses/odbl/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/
The specific text of the ODbL license is quite extensive, but the gist of it is nicely summarised in the ODbL Plain Language Summary:
You are free: To Share...To Create...To Adapt...
As long as you: Attribute...Share-alike...Keep open...
(details of each stipulation omitted for simplicity)

My point in dwelling on ODbL is not to argue that commercial providers should adopt it, but rather to consider adapting it; I'm holding it up as an exemplar for the explicit expression of terms of use for a dataset.

Expressing your rights to linked data as linked data: One of the things that has impressed me about Creative Commons is that its rights expressions were intended from the start to be modelled in RDF and machine-readable; indeed CC has created ccREL: the Creative Commons Rights Expression Language, which primarily uses the idea of embedded RDF (via RDFa) in content pages to communicate rights. A recent development is Creative Commons guidance on how ccREL and RDFa might be applied to "deploy the Semantic Web." Note that Nathan Yergler's (excellent) OpenWeb 2008 presentation explains this well, but doesn't specifically deal with the linked data question. Note that in particular Nathan addresses CC+, a CC licensing model that allows providers to include a way for users to request rights beyond those stated in the basic CC license. Those who know me know what I'll say next: this is another step forward as we converge on Henry Perritt's ca. 1993 vision of permissions headers!

For further reading:

Wednesday, November 25, 2009

Long Tails and "Scaling Down" Linked Data Services

Chris Anderson's newest book FREE: The Future of a Radical Price received some attention this summer, but I've actually been meditating on principles he laid out three years ago in his blog post, Scaling up is good. Scaling down is even better. In that post he marveled at Google et.al.'s ability to scale down, to run themselves efficiently enough to serve users who generate no revenue at all. Anderson's principles are guidance on approaches to conducting business such that even if only a tiny percentage of ones visitors "convert" into paying customers, by ensuring this small percentage is of a very large number one can still achieve big-time profitability.

My goal with this post is to consider how these ideas might be applied to the domain of Linked Data, and specifically how they pertain to the provision of unique data that adds real value to the greater "Web of Data."

In his blog Anderson gives us four keys to scaling down: Self-service, "Freemium" services, No-frills products and Crowdsourcing...

1. Self-service: give customers all the tools they need to manage their own accounts. It's cheap, convenient, and they'll thank you for it. Control is power, and the person who wants the work done is the one most motivated in seeing that it's done properly.

"Self-service" applies to linked data services in oh-so-many ways! Self- service in this case is not as much about support (see "Crowdsourcing," below) as it is about eliminating any and all intervention customers might need to customize or specialize how services perform for them. In principle, the goal should be to provide users with a flexible API and let them figure it out, with the support of their peers. Ensure that everything is doable from their side, and step out of the way.

The (negative) corollary is this: if you "baby sit" your customers by providing specialized services that require maintenance, then you own it and must eat the cost. By making the specializations a user-side function, then they own it. But they won't be alone; they'll have the support of their community!

2. "Freemium" services: As VC Fred Wilson puts it, "give your service away for free, possibly ad supported but maybe not, acquire a lot of customers very efficiently through word of mouth, referral networks, organic search marketing, etc, then offer premium priced value added services or an enhanced version of your service to your customer base." Free scales down very nicely indeed.

There are any number of ways providers might apply this concept to the linked data world:

 Free Access  Premium Access 
 Restricted vocabulary of assertions  Full access, all assertions 
 Limited query rate  Unlimited query rate 
 Limited query extent  Unlimited query extent 
 Limited data  Unlimited data size 
 Read-only  Term upload capability 
 Narrow reuse rights  Broad reuse rights 
 Community support  Private/ dedicated support 
 ...  ... 

3. No-frills products: Some may come for the low cost, others for the simplicity. But increasingly consumers are sophisticated enough to know that they don't need, or want to pay for premium brands and unnecessary features. It's classic market segmentation, with most of the growth coming at the bottom.

In the linked data world, achieving "no frills" would seem easy because by definition it is only about the data! For linked data a "frill" is added data complexity that serves no purpose or detracts from the utility of the service. Avoid any temptation to "add value" on behalf of customers, such as merging your core graph with others in an attempt to "make it easy" for them. Providers should also avoid "pruning" graphs, except in the case of automated filtering in order to differentiate between Freemium and Premium services.

4. Crowdsourcing: From Amazon reviews to eBay listings, letting the customers do the work of building the service is the best way to expand a company far beyond what employees could do on their own.

By now it is not only obvious, but imperative that providers should develop communities around their services. Usually communities are about evangelism, and this is certainly true for linked data provides, but increasingly service provides realize well-groomed communities can radically reduce their service costs.

Linked data providers should commit themselves to a minimum of direct support and invest in fostering an active community around their service. Every provider should have a means for members of their community to support each other. Every provider should leverage this community to demonstrate to potential adopters the richness of the support and the inherent value of their dataset.

Finally: In a thought-provoking post Linked Data and the Enterprise: A Two-way Street Paul Miller reminds the skeptical enterprise community that they, not merely their user community, will ultimately benefit from the widespread use of their data, and when developing their linked data strategy they should consider how they can "enhance" the value of the Web of Data, for paying and non-paying users alike:

...[A] viable business model for the data-curating Enterprise might be to expose timely and accurate enrichments to the Linked Data ecosystem; enrichments that customers might pay a premium to access more quickly or in more convenient forms than are available for free...

I've purposely avoiding considering the legal and social issues associated with publishing certain kinds of enterprise data as linked data (see also this), which I'll address in a future post...

Tuesday, November 24, 2009

DRM & Me Part III: DOIs, Metadata and Long Tails

In Part II of this retrospective I discussed the NetRights years and our novel approach to binding static and dynamic metadata to objects in the early days of the Web. In this installment I'll cover my years at Yankee Rights Management (YRM) (a division of YBP, Inc., especially the development of Copyright Direct(tm) and my personal realization of the potential of content identifiers and their associated metadata. Note: It was actually during my YRM years that I coined my now-infamous expression (referenced in Part II of this series), Metadata is the lifeblood of e-commerce!

YBP, originally known as Yankee Book Peddler and now a division of Baker & Taylor, have been a leader in using information technology to provide books and other materials, including bibliographic data --- metadata! --- to university and research libraries for more than 35 years. YBP executive Glen M. Secor also happened to be a professor of law at the Franklin Pierce Law Center specializing in copyright law, with a particular interest in the unique challenges of copyright in the emerging digital, networked environment. Glen and I first met when I presented my early Ph.D. work at DAGS'95 in Boston (prior to the founding of NetRights) and from that point on took an interest in this metadata-oriented, iconoclastic approach to copyright. Glen spearheaded YBP's investment in NetRights in 1996, and with the sale of NetRights in 1997 I joined with Glen to launch Yankee Rights Management (YRM) in mid-1997.

One of YRM's goals was to build a business solving rights management problems for stakeholders in YBP's ecosystem, especially scientific/ technical/ medical (STM) publishers and their university and research customers. With the help of Kelly Frey, then VP of Business Development for the Copyright Clearance Center (CCC), we conceived of Copyright Direct(tm), which soon became the first web-based, real-time, pay-as-you go copyright permissions service for a wide variety of multimedia types. As with LicensIt(tm), the usage model for Copyright Direct(tm) would be simple:

  1. From a web page or PDF document, the user would click on a distinctive green "Copyright Direct" icon
  2. A mini-window would pop up clearly identifying the work and presenting available options for that item
  3. The user would step through a short series of menus to specify their use and, if available, transact their request (via credit card!) and receive their permissions
  4. If the usage they needed was not available, the system collected the user's plain-text request and began a managed workflow between the user and the rightsholder
  5. When all parties agreed, the agreement became a "template" and was added as an available option --- the system learned and adapted
  6. At the end of each month, rightsholders would receive royalty payments.

Glen Secor, Jennifer Goodrich and I demonstrated my Copyright Direct prototype to a variety of stakeholders and thought leaders at the Frankfurt Book Fair in October, 1997 and collected critical feedback. We returned "triumphantly" in October 1998 with a booth in the main hall, a live Copyright Direct demo (now powered by the fledgling DOI standard and a major "beta" rightsholder: the IEEE!

But throughout 1998-1999 we also came to realize a fundamental problem with the Copyright Direct model: it depended not only on a ready supply of clean descriptive metadata from rightholders, but also upon a rich set of rightsholder-generated rights metadata, including pricing and other licensing templates, none of which existed! Our goal was to use lightweight, easily accessible permissions transactions to provide "found money" to rightholders, but it cost too much to generate the metadata required to fuel the system! In the September 2006 issue of D-Lib magazine I extrapolate this problem in my article, Handle Records, Rights and Long Tail Economies.

Chris Anderson's "long tail" argument (see also his Long Tail blog) asserts that modern systems based entirely on metadata make "unlimited selection" economically viable. I argue that yes, metadata really is the lifeblood of e-commerce and is the enabler of phenomena like the seemingly-unlimited selection of products through Amazon.com ("make everything available, help anyone find it!"), but all metadata must somehow still be generated, verified and published, and the cost of creating and supporting the neccessary metadata supply chains must not exceed the anticipated value that can be redeemed. Since the demand of a given "unit" may be exceptionally low, the "per unit" cost of creating or aggregating each unit's metadata halo must be near-zero!

These principles can be extrapolated to the "Web of Data"; indeed, by coupling Linked Data principles with a low-overhead infrastructure for authenticating metadata assertions, the cost of metadata may indeed approach zero. I'll talk about that in a future blog entry...

Monday, November 23, 2009

DRM & Me Part II: "Copyright for the rest of us!"

In Part I of this retrospective I covered the raw beginnings of my interest and research in enabling copyright in the digital, networked environment. In this second part I'll discuss work my colleagues and I did to take these ideas commercial, and I'll continue to focus on core principles of my work in content identification and metadata architecture, summed up by this quote (attributed to me!): Metadata is the lifeblood of e-commerce!

As the spring of 1995 approached it became clear that there was an opportunity to make a unique contribution to improving the world of copyright in the digital, networked environment. As I prepared to present a paper at ED-MEDIA 95 in Graz, Austria, I was approached by local businessman who had been principals in a successful software company, Corporate Microsystems, Inc., that had just been acquired by a global enterprise software company. As the story goes, they were looking for an original idea upon to base their next start-up, and I was looking for a strategy for implementing my research ideas that would scale well beyond what I was capable of doing part-time as a researcher at IML. Over the summer of 1995 my future partners Gerry Hunt, Theo Pozzy, Henry Adams, Hal Franklin and I held numerous planning meetings, and on 1 November 1995 NetRights, LLC was born!

We started NetRights at a time when other players, in particular InterTrust (then still called EPR) and IBM InfoMarket were starting to draw attention to their robust, encryption-based "envelope" strategies for "protecting copyright" --- quotes intentional! --- and the term digital rights management wasn't yet in standard use. Taking a clue from my prototype work at Dartmouth, the core idea behind LicensIt(tm) (later @attribute) was to "objectify" flat multimedia objects using secure wrappers whose primary objective was to provide structured metadata about the object in hand. Our goal was to provide rich static and Internet-served dynamic metadata to facilitate "conversations" between creators and users of content. Our motif for "experiencing" copyright was a simple and elegant: A user sees a photo, audio clip, video, even an embedded text snippet; they "right-click" on it and a tabbed set of property pages is displayed; they use those various pages to view descriptions of the content, to start emails with the creator or other contributors, to view default terms of use, even to initiate live rights transactions, all while staying within the context of use.

From a technical standpoint we were using OLE structured storage in very much the same way as XML (and especially RDF) is used today. Our development team, including Mark Schlageter, Norm Tiedemann, Mark Markus and Dan O'Connor (our sole Mac-head!), created amazing tools that let us design not only these metadata structures, but to actually create "soft" property-page layout templates (think CSS!) that were packaged with the metadata, enabling customized content-specific views. Considerable infrastructure was required to make all of this work, starting with OLE services installed on the user machine, to the tools for design and packaging, to back-end services for object registration. Also, major, bet-the-company decisions about PC vs Mac, "networked COM" (which became ActiveX), Spyglass/IE vs Mosaic/Netscape support, etc. To a startup company, Bill Gates' commitment of Microsoft to "embracing and extending" the Internet in late 1995 was helpful!

Trade journals like Seybold took notice and wondered whether our "kinder, gentler" approach to copyright, which by that time (June 1996) we were calling "enhanced attribution," might actually be a better option than so-called "opaque packages." Publishers were torn; they liked the obvious value our approach was bringing to the user and the fact that we were actually facilitating the copyright process, but they also couldn't get over their perceived need for "strong protection."

Today we see echoes all over the Internet of infrastructure and technology that make "copyright for the rest of us" radically easier than it was at NetRights birth in 1995. First and foremost are systems of globally unique, persistent object identifiers, in particular the Digital Object Identifier (DOI), implemented on CNRI's Handle System. (As it happens, that same 1996 issue of Seybold also carried an article about the birth of the DOI!) RDF provides a universal information model for conveying metadata assertions (local and remote) about objects; RDFa provides a way to do this within (esp.) web documents. The recent massive and growing interest in publishing Linked data by organizations, including governments, has fortified distributed metadata as a means of conveying object information from a variety of sources. And special mention must be made of The Creative Commons, which has applied most of these techniques to not only make the process of copyright readily accessible to creators and users all over the world, but also to make content use safe through the explicit and unambiguous communication of terms of its use.

Providing immediate, unambiguous expression of copyright information and connections to processes for any piece of content was my mantra starting in the lab at Dartmouth, then at NetRights, and following our acquisition in 1997 by Digimarc, with the creation of Copyright Direct(tm) at Yankee Rights Management (YRM) and my subsequent involvement with the content identification and metadata communities. More on that in our next installment...

Wednesday, November 18, 2009

DRM & Me: A 15-year retrospective (Part 1)

Fifteen years ago, in November 1994, I was two years into a Ph.D. program at the Thayer School of Engineering at Dartmouth College. I had entered Dartmouth with a background in computer engineering and an interest in "special-purpose systems," a narrow field that focuses on creating computing systems that are exceptionally good at a very narrow range of operations, such as particle-in-cell simulation or gene sequence processing. This interest led me across campus to become a research assistant in Dr. Joseph V. Henderson's pioneering Interactive Media Lab at Dartmouth Medical School --- at first to consider the infrastructural problems of delivering IML's high-value multimedia training programs across the Internet, and by mid-1994 over a novel set of technologies known as the "World Wide Web."

As the story goes, the IML team was preparing a major set of demos for a visit by Dr. C. Everett Koop, a Dartmouth alumnus, area resident and recently retired as one of the more influential Surgeons General the United States has ever had. My particular focus was creating an interactive web site for IML, focusing in particular on the delivery of several key video sequences via the web. Several of us worked long into the night to migrate a few select videos into tolerable Quicktime format and suitable "thumbnails," then onto the lab's server, then linked (for downloading) from web pages, and finally viewable on the demo Mac.

When Joe arrived on the morning of our demo, I greeted him with (something like), "Joe, I got the 'Binding Sequence up on the Web!'" His incredibly insightful response was:

John, that's great!...John, that's terrible!

Joe preceded to express his concerns about two fundamental implications of my "success":

  • The copyright implications, especially as many IML programs were funded by private entities that retained certain rights to the works;
  • The implications of dis-aggregating medical and other training programs and delivering their content out-of-context, possibly doing harm to their message due to loss of design integrity.

Joe framed the challenge for me: to study the question of rights management from the perspective of multimedia production. In 24 hours, I learned that this was an important and rising issue that was not going away; that very little research had been done on the question from a practical standpoint; that the few proposed solutions at the time were overly simplistic, equating "copyright management" with "security" and in fact did neither; and no one appeared to be considering the issues from the perspective of the creator. In 24 hours, my Ph.D. topic was born!

This leads us to 1 November 1994 when I presented my dissertation proposal, which included as an example research artifact my Mr. Copyright(tm) prototype --- quickly re-named at the urging of my committee and others to LicensIt(tm). LicensIt demonstrated in the form of a easy-to-use, desktop "appliance" the key ideas of (a) binding actionable copyright metadata to multimedia objects, and (b) user-friendly, real-time, networked copyright registration. The LicensIt desktop icon said it all: modeled after the famous Stuffit(tm) coffee grinder, users dragged and dropped their content (initially GIF files) onto LicensIt; a dialog popped up to collect (and display) their descriptive and other metadata and to enable them to select their "registration server" from a menu of choices; their work was registered. By way of both the static metadata and the registry, users would be able to contact the principals involved in the creation of the item. I envisioned several other options, including registering digital signatures to allow users to authenticate a work in hand, as well as enveloping the work in an encrypted envelope.

It is important to note that the focus of my work at that time was on enabling copyright by binding static and dynamic metadata to content and especially to make it as accessible as possible within the context of use; content security was only a secondary concern. "Enablement" means that although a desktop client is interesting, plugins for creation tools like Photoshop, Acrobat and Macromedia Director, and enjoyment tools like Mosiac --- this was 1994!! --- would be infinitely more interesting and useful! I assumed that one day, creators would be mixing and matching content found around the web, and at least commercial and other highly visible producers would want/need to "do the right thing" w.r.t. copyright and thus would benefit from instantly accessible attribution, bound to the item. Note that I was heavily influenced at that time by the writings of Prof. Henry H. Perritt, Jr. whose concept of permissions headers was not only an inspiration for me, but I believe anticipated Creative Commons licensing templates.

Fifteen years later, we can at least say the world is different! The world we imagined 15 years ago of rampant "re-mixing" of content has arrived; licensing models such as Creative Commons have improved awareness; but still the infrastructure does not accommodate the discovery and transmission of rights information as readily as it should. With the rise of new data-centric models such as Linked Data (a practical outcome of Semantic Web research) and the acceptable of persistent identifier systems including the Handle System and the Digital Object Identifier, we're getting there...

Next installment: The NetRights and YRM years...

Friday, October 2, 2009

Embracing the "Groundswell"

The "social web" enables smart companies to engage in conversations at any scale with their customers, from product support to "ideation" that leads to new products, features and more efficient operations. Companies ignore the reality of social computing at their peril; new and evolving web technologies make it increasingly easy for sufficiently-motivated customers to spontaneously generate and maintain communities around a company's products and services in order to offer mutual support, accolades, but also to air grievances they feel those companies are ignoring.

Smart, proactive companies understand that committing themselves to maintaining healthy community-based relationship with their customers is not only the right thing to do, but is essential in this age of the digital native.

Forrester Research analysts Charlene Li and Josh Bernoff discussed this reality in their 2008 book groundswell: winning in a world transformed by social technologies, the culmination of their research since 2006 on role that social computing plays in the enterprise. groundswell (Harvard Business Press) contains many rich case studies illustrating how companies have successfully applied social web strategies to foster communities internally and especially with their customers.

One thing that impresses me about groundswell is that it recognizes that companies must have a wide variety of "critical conversations" (to paraphrase Peter Block) with their customers, and that there is an equally diverse palette of technologies for implementing these conversations. Here is a sample of the social technologies --- Li and Bernoff call them "strategies" --- that they examine in detail:

Ambassador programs; Blogs; Brand monitoring; Community (ideas); Community (private); Community (public); Crowd-sourcing; Discussion forums; Q & A; Ratings and reviews; Social networking sites; User-generated videos; Voting; Wikis; Widgets;

This is a great book targeted at decision makers who need to understand the rich set of relationships they must establish with their customers, and how to apply social computing tools to maintain those relationships.

Wednesday, September 30, 2009

The Care and Feeding of (online) Communities

I love the topic of "communities": how to build them, how to maintain them, and understanding the critical factors that determine their success or failure. It's fascinating to me to discover why many ad hoc, unplanned communities succeed while so many more intentional communities fail. The answers are in the common, essential elements that successful communities embody, often organically, and which failing communities neglect.

A colleague recently asked for a short list of recommendations for resources on "community." Here are my top picks:

1. A great "pure" book on building community -- in general, not specifically online -- is Peter Block's Community: The Structure of Belonging. Block's book focuses on the sorts of critical conversations that must happen for communities to happen. It is a "bible" for community organizing, etc.

2.Perhaps the best, ready-to-apply overview I have seen is Guy Kawasaki's How to Change the World: The Art of Creating a Community which (like much of Kawasaki's material) is based on insights dating back to his days as The "Software Evangelist" on the original Macintosh project. His points are nicely mapped onto non-SW and even non-Web communities.

3.The site CommunitySpark.com regularly provides excellent advice. I've followed it for more than a year and have found it to have very good articles, discussions and podcasts on online community construction and maintenance, including very practical articles on dealing with trouble-makers, inciting conversation, etc. I like it because it puts into practice many concepts I've seen in the Block book, etc.

4.Finally, Rick Warren's The Purpose-Driven Church is full of proven advice on community building from a spiritual perspective, 100% in synch with the practical advice provided by the resources above. Warren's Saddleback Church in Orange County, CA grew from nothing into one of the largest and most successful evangelical congregations in the USA based on these principles.

Tuesday, June 16, 2009

A little about me...

"For the record," here is a little about me:

I have spent many years studying the unique social, legal, and technical problems that arise when managing and disseminating information in the digital environment. In my role as a principle investigator on several projects at HP Labs, I have focused on the policy-based management and personalization of distributed, heterogeneous digital object repositories and content processing architectures. Most recently I was co-PI on Fractal, research focused on delivering a platform for content-centered collaboration spaces "in the cloud." Before joining HP Labs in January 2000 I was the architect of Copyright Direct (tm), the first real-time, Internet-based service to fully automate the complex copyright permissions process for a variety of media types.

Since 1998 I have been awarded multiple US patents for digital rights management (DRM) and information security technologies; numerous related patents are pending. have been an active participant in a number of international metadata and rights management standards efforts and currently serves on the OAI Object Reuse and Exchange (OAI-ORE) advisory committee, the DSpace Architectural Review committee, the Handle System Technical Review committee and the Global Handle System Advisory Committee. In early 2007 I was elected to the board of directors of the National Information Standards Organization (NISO). In the past I have served on the Industry Working Group for Digital Copyright Submissions for the U.S. Copyright Office, the OASIS Rights Language Technical Committee, and the W3C Digital Rights Management Program Committee. I was a charter editorial board member of IEEE Security and Privacy magazine.

From 1997-1999, I was VP of Technology Strategy and a co-founder of Yankee Rights Management. From 1995-1997, I was VP of Product & Technology Strategy for NetRights, LLC, a company I co-founded in 1995 to commercially deploy his research in technologies for copyright management in the digital, networked environment. NetRights was sold in 1997 to Digimarc Corporation (DMRC), a leading provider of digital image watermarking technologies. From 1984-1992 I was a systems architect and project leader for Digital Equipment Corporation.

I hold a Ph.D. in Engineering Sciences from Dartmouth College (1997), an M.Eng.(EE) from Cornell University (1989), and a BSEE from Rensselaer Polytechnic Institute (1984).

Wednesday, June 3, 2009

Open for Business!

Prompted by the changing climate in corporate research, I am looking for new opportunities and especially collaborations. "Bitwacker" is my new public home!