Friday, March 19, 2010

"This linked data went to market...wearing lipstick!?!"

This post originally appeared on the Wordpress.com version of this blog

Paraphrasing the nursery rhyme,

This linked data went to market,
This linked data stayed open,
This linked data was mashed-up,
This linked data was left alone.
And this linked data went...
Wee wee wee all the way home!

In his recent post Business models for Linked Data and Web 3.0 Scott Brinker suggests 15 business models that "offer a good representation of the different ways in which organisations can monetise — directly or indirectly — data publishing initiatives." As is our fashion, the #linkeddata thread buzzed with retweets and kudos to Scott for crafting his post, which included a very seductive diagram.

My post today considers whether commercial members of the linked data community have been sufficiently diligent in analysing markets and industries to date, and what to do moving forward to establish a sustainable, linked data-based commercial ecosystem. I use as my frame of reference John W. Mullins' The New Business Road Test: What entrepreneurs and executives should do before writing a business plan. I find Mullins' guidance to be highly consistent with my experience!

So much lipstick...
As I read Scott's post I wondered, aren't we getting ahead of ourselves? Business models are inherently functions of markets --- "micro" and "macro" [1] --- and their corresponding industries, and I believe our linked data world has precious little understanding of the commercial potential of either. Scott's 15 points are certainly tactics that providers, as the representatives of various industries, can and should weigh as they consider how to extract revenue from their markets, but these tactics will be so much lipstick on a pig if applied to linked data-based ecosystems without sufficient analysis of either the markets or the industries themselves.

Pig sporting lipstick

To be specific, consider one of the "business models" Scott lists...

3. Microtransactions: on-demand payments for individual queries or data sets.
By whom? For what? Provided by whom? Competing against whom? Having at one time presented to investment bankers, I can say that "microtransactions" is no more of a business model for linked data than "Use a cash register!" is one for Home Depot or Sainsbury's! What providers really need to develop is a deeper consideration of the specific needs they will fulfill, the benefits they will provide, and the scale and growth of the customer demand for their services.

Macro-markets: Understanding Scale A macro-market analysis will give the provider a better understanding of how many customers are in its market and what the short- and long-term growth rates are expected to be. While it is useful for any linked data provider, whether commercial or otherwise, to understand the scale of its customer base, it is absolutely essential if the provider intends to take on investors, because they will demand credible, verifiable numbers!

Providers can quantify their macro-markets by identifying trends, including demographic, socio-cultural, economic, technological, regulatory, natural. Judging whether the macro-market is attractive depends upon whether do the trends work in favour of the opportunity.

Micro-markets: Identifying Segments, Offering Benefits Whereas macro-market analysis considers the macro-environment, micro-market analysis focuses on identifying and targeting segments where the provider will deliver specific benefits. To paraphrase John Mullins, successful linked data providers will be those who deliver great value to their specific market segments:

  • Linked data providers should be looking for segments where they can provide clear and compelling benefits to the customer; commercial providers should especially look to ease customers' pain in ways for which they will pay.
  • Linked data providers must ask whether the benefits their services provide as seen by their customers are sufficiently different from and better than their competitors, e.g. in terms of data quality, query performance, more supportive community, better contract support services, etc.
  • Linked data providers should quantify the scale of the segment just as they do the macro-environment: how large is the segment and how fast is it growing?
  • Finally, linked data providers should ask whether the segment can be a launching point into other segments.
The danger of falling into the "me-too" trap is particularly glaring with linked data, since a provider's competition may come from open data sources as well as other commercial providers: think Encarta vs. Wikipedia!

Having helped found a start-up in the mid-1990s, I am acutely aware of the difference between perceived and actual need. The formula for long-term success and fulfillment is fairly straightforward: provide a service that people need, and solve problems that people need solved!

Notes:

References

  1. John W. Mullins, The New Business Road Test (FT Prentice Hall, 2006)

DOIs, URIs and Cool Resolution

This post originally appeared on the Wordpress.com version of this blog.

The art of happiness is to serve all -- Yogi Bhajan


Once we get beyond the question of the basic HTTP URI-ness of the digital object identifier (DOI) --- since for each DOI there exists DOI-based URIs due to the dx.doi.org and hdl.handle.net proxies, this issue is moot --- and old-skool questions of "coolness" based on the relative brittleness over time of creative URI encoding [1], we are then left with the more substantial question of whether DOI-based HTTP URIs really "behave" themselves within the "Web-of-Objects" universe. The purpose of this post is to identify the problem and propose a potential solution, implementation of which will require certain changes to the current Handle System platform. I believe that if the proposed changes are made, lingering questions concerning the "URI-ness" of DOIs (and Handles) will disappear, once and for all.

Note: It is beyond the scope of this post to present all of the gory background details regarding the Handle System, the DOI, and the 1998 and 2008 versions of "Cool URIs." If there is enough interest in a stand-alone article, I will happily consider writing a longer version in the future, perhaps as piece for D-Lib Magazine.

With the increasing influence of semantic web technologies there has been strong interest in assigning actionable HTTP URIs to non-document things, ranging from abstract ideas to real world objects. In the case of URI-named, Web-accessible physical items --- sensors, routers and toasters --- this is sometimes referred to as The Web of Things. Until 2005 the community disagreed as to what an HTTP URI could be assumed to represent, but a June 2005 decision by the W3C TAG settled the issue: If a server responds with an HTTP response code of 200 (aka a successful retrieval), the URI indeed is for an information resource; with no such response, or with a different code, no such assumption can be made. This "compromise" was said to have resolved the issue, leaving a "consistent architecture." [3]

The result of this decision was to force consensus on how to apply the long-established principles of HTTP content negotiation in more consistent ways. In particular, "human" and "machine" requests to a given entity URI --- a top-level URI representing a "thing" --- should be treated differently; for example, there should be different responses to requests with HTTP headers specifying Accept: text/html (for an HTML-encoded page) versus Accept: application/rdf+xml (for RDF-modeled, XML-encoded data). This is most often seen in the semantic web and linked data worlds, where it is now common to have both textual and machine readable manifestations of the same URI-identified thing.

Modern web servers including Apache have been engineered to handle these requests through content negotiation [4]. Through standard configuration procedures, site administrators specify how their servers should respond to text/html and application/rdf+xml requests in the same way they specify what should be returned for alternate language- and encoding- requests; "en," "fr," etc. Typically, when media-specific requests are made against entity URIs representing concepts, the accepted practice is to return a 302 Found response code with the URI to a resource containing a representation of the expected type, such as an html-encoded page or an XML document with RDF-encoded data.

Many readers of this post will be familiar with the basic idea of HTTP proxy-based Handle System name resolution: A HTTP resolution request for a DOI-based URI is made to a proxy --- a registration-agency run proxy such as dx.doi.org or the "native" Handle System proxy hdl.handle.net --- the appropriate local handle server is located, the handle record for the DOI is resolved, and the default record (e.g. a document information page) is returned to the client as the payload in a 302 Found response. In a Web of Documents this might make sense, but in a universe of URI-named real-world objects and ideas, not so much.

The 2008 document provides two requirements for dealing with URIs that identify real world objects:

  1. Be on the Web: Given only a URI, machines and people should be able to retrieve a description about the resource identified by the URI from the Web. Such a look-up mechanism is important to establish shared understanding of what a URI identifies. Machines should get RDF data and humans should get a readable representation, such as HTML. The standard Web transfer protocol, HTTP, should be used.
  2. Be unambiguous: There should be no confusion between identifiers for Web documents and identifiers for other resources. URIs are meant to identify only one of them, so one URI can't stand for both a Web document and a real-world object.

In the post-2005 universe of URI usage as summarised above and detailed in [2], if DOI-based URIs are used to represent conceptual objects these rules will be broken! For example, Handle System proxies today cannot distinguish between Accept: codes in the request headers; the only possible resolution is the default (first) element of the Handle record. (For hackers or merely the curious out there, I encourage you to experiment with curl at your command line or Python's urllib2 library, hitting the DOI proxy with a DOI-based URL like http://dx.doi.org/10.1109/MIC.2009.93.) This problem with how proxies resolve DOIs and Handles is a lingering manifestation of the native Handle System protocol not being HTTP-based and the system of HTTP-based proxies being something of a work-around, but the vast majority of DOI and Handle System resolutions occur through and rely on these proxies.

One possible solution would be to enable authorities --- Registration Agencies --- who operate within the Handle System to configure how content negotiation within their Handle prefix space is handled at the proxy. For document-based use of the DOI an example of this would be to return the URI in the first element of the Handle record whenever a text/html request is made and (for example) the second element whenever an application/rdf+xml is made. When a request is made to the proxy, request-appropriate representation URIs would be returned to the client along with the 302 Found code. This approach treats the DOI-based URI as a conceptual or entity URI and gives the expected responses as per [2]. pax vobiscum...

Readers familiar with the Handle System will appreciate that there are many potential schemes for relating HTTP content type requests to elements of the Handle record; in the example above I use position (index value), but it is also possible to use special TYPEs.

Handle servers are powerful repositories and can implement potentially many different models other than redirection as described above. Sometimes, for example, the desire is to use a Handle record as the primary metadata store. In that case, the preferred application/rdf+xml might very well be to return an RDF-encoded serialisation of the Handle record. How this is handled should be a feature of the Handle server platform and a decision by registration agencies based on their individual value propositions, and not locked in by the code.

I eagerly look forward to your comments and reactions on these ideas!

Update 1: In a comment to this post, Herbert Van de Sompel argues that the real question is, what should DOIs represent? Herbert asserts that DOI-based URIs should model OAI-ORE resource aggregations and that Handle System HTTP proxies should behave according to OAI-ORE's HTTP implementation guidelines. Herbert's suggestion doesn't conflict with what I've written above; this is a more subtle and (arguably) more robust view of how compound objects should be modeled, which I generally agree with.

Here's how OAI-ORE resolution would work following the Handle proxy solution I've described above: Assume some DOI-based HTTP URI doi.A-1 identifies an abstract resource aggregation "A-1" (In OAI-ORE nomenclature doi.A-1 is the Aggregation URI). Following the given HTTP implementation example, let there be two Resource Maps that "describe" this Aggregation, an Atom serialization and an RDF/XML serialization. Each of these Resource Maps is (indeed MUST be) available from different HTTP URI's, ReM-1 and ReM-2, but the desired behaviour is for either to be accessible through the DOI-based Aggregation URI, doi.A-1. Let these two URIs be persisted in the Handle record, preferably using TYPEs which distinguish how they should be returned to clients based on the naming authority's configuration of the HTTP proxy. By the approach I describe above, the Handle System proxy would then respond to resolution requests for doi.A-1 with 303 See Other redirects to either ReM-1 or ReM-2 depending upon MIME-type preferences expressed in the Accept: headers of the requests.

Update 2: Complete listing of MIME types for OAI-ORE Resource Map serializations. Follow-up conversations with Herbert Van de Sompel, Carl Lagoze and others have reminded me I neglected to mention how the OAI-ORE model recommends handling "HTML" (application/xhtml+xml and text/html) requests! This is not a minor issue, since the purpose of ORE is to model aggregations of resources and not resources themselves, and so it is not immediately clear what such a page request should return. My solution (for the purposes of this blog post) is for Handle System HTTP proxies to respond to these requests also with 303 See Other redirects, supplying redirect URIs that map to appropriately-coded "splash screens."


For completeness, the table below (repeated from [5]) lists the standard MIME types for Resource Map serializations. Continuing with the major theme of this post, Handle System HTTP proxies resolving requests for DOI-named ORE Resource Maps should follow these standards so the clients may request appropriate formats using HTTP Accept: headers.


Resource Map TypeMIME type
Atomapplication/atom+xml
RDF/XMLapplication/rdf+xml
RDFa in XHTMLapplication/xhtml+xml

If a client prefers RDF/XML but can also parse Atom then it might use the following HTTP header in requests:

Accept: application/rdf+xml, application/atom+xml;q=0.5

The table below list the two common MIME types for HTML/XHTML Splash Pages following the W3C XHTML Media Types recommendations.

Resource Map TypeMIME type
XHTMLapplication/xhtml+xml
HTML (legacy)text/html

Thus, if a client wishes to receive a Splash Page from the Aggregation URI and prefers XHTML to HTML then it might use the following HTTP header in requests:

Accept: application/xhtml+xml, text/html;q=0.5


As noted in [5] there is no way to distinguish a plain XHTML document from an XHTML+RDFa document based on MIME type. It is thus not possible for a client to request an XHTML+RDFa Resource Map in preference to an RDF/XML or Atom Resource Map without running the risk of a server correctly returning a plain XHTML Splash Page (without included RDFa) in response.

The Handle record for a given DOI or Handle identifying an ORE aggregation would therefore contain a set of URIs reflecting the mappings in the tables above. A content-negotiation-savvy Handle System HTTP proxy would then return the appropriate URI in the 303 Found response, based on its configuration and policies.

References:

See the ensuing comments at my Wordpress.com version of this blog...

Coomunity as a Measure of Research Success

This post originally appeared on the Wordpress.com version of this blog

In his 02 Feb 2010 post entitled Doing the Right Thing vs. Doing Things Right Matthias Kaiserswerth, the head of IBM Research - Zurich sums up his year-end thinking with this question for researchers...

We have so many criteria of what defines success that one of our skills as research managers is to choose the right ones at the right time, so we work on the right things rather than only doing the work right...For the scientists that read this blog, how do you measure success at the end of the year?

Having just “graduated” after a decade with another major corporate research lab, this is a topic that is near and dear to my heart! My short answer was the following blog comment...

I can say with conviction that the true measure of a scientist must be their success in growing communities around their novel ideas. If you can look back over a period of time and say that you have engaged in useful discourse about your ideas, and in so doing have moved those ideas forward — in your mind and in the minds of others — then you have been successful...Publications, grad students and dollar signs are all artifacts of having grown such communities. Pursued as ends unto themselves, it is not a given that a community will grow. But if your focus is on fostering communities around your ideas, then these artifacts will by necessity follow...

My long answer is that those of us engaged in research must act as stewards of our ideas; we must measure our success by how we apply the time, skills, assets, and financial resources we have available to us to grow and develop communities around our ideas. If we can look back over a period of time — a day, a quarter, a year, or a career — and say that we have been “good stewards” by this definition, then we can say we have been successful. If on the other hand we spend time and money accumulating assets, but haven't moved our ideas forward as evidenced by a growing community discourse supporting those ideas, then we haven't been successful.

A very trendy topic over the past few years has been open innovation, as iconified by Henry Chesborough's 2003 book by the same name. Chesborough's "preferred" definition of OI found in Open Innovation: Researching a New Paradigm (2006) reads as follows...

Open innovation is the use of purposive inflows and outflows of knowledge to accelerate internal innovation, and expand the markets for external use of innovation, respectively. [This paradigm] assumes that firms can and should use external ideas as well as internal ideas, and internal and external paths to market, as they look to advance their technology.

In very compact language Chesborough (I believe) argues that innovators within organisations can best move their ideas forward through open, active engagement with internal and external participants. [1] Yes, individual engagement could be conducted through closed "tunnels," but for the ideas to truly flourish (think Java) this is best done through open communities. I believe the most important --- perhaps singular --- responsibility of the corporate research scientist is to become a "master of their domain," to know their particular area of interest and expertise better than anyone, to propose research agendas based upon that knowledge, and to leverage their companies' assets to motivate communities of interest around those ideas. External communities that are successfully grown based on this view of OI can become force multipliers for the companies that invest in them!

To appreciate this one needs only to consider the world of open source software and the ways in which strong communities contribute dimensions of value that no single organisation could... I'll pause while you contemplate this idea: open-source like communities of smart people developing your ideas. Unconvinced? Then think about "Joy's Law," famously attributed to Sun Microsystems co-founder Bill Joy (1990):

No matter who you are, most of the smartest people work for someone else

Bill Joy's point was that that best path to success is to create communities [2] in which all of the "world's smartest people" are applying themselves to your problems and growing your ideas. As scientists, our measure of success must be how well we leverage the assets available to us to grow communities around our ideas.

Peter Block has given us a profound, alternative perspective on the role of leaders in the context of communities [3]. In his view, leaders provide context and produce engagement. In Block's view, leaders...

  • Create a context that nurtures an alternative future, one based on gifts, generosity, accountability, and commitment;
  • Initiate and convene conversations that shift peoples' experience, which occurs through the way people are brought together and the nature of the questions used to engage them;
  • Listen and pay attention.

Ultimately, I believe that successful researchers must first be successful community leaders, by this definition!

Update: In a 4 Feb 2010 editorial in the New York Times entitled Microsoft's Creative Distruction, former Microsoft VP Dick Brass examines why Microsoft, America’s most famous and prosperous technology company, no longer brings us the future. As a root cause, he suggests:

What happened? Unlike other companies, Microsoft never developed a true system for innovation. Some of my former colleagues argue that it actually developed a system to thwart innovation. Despite having one of the largest and best corporate laboratories in the world, and the luxury of not one but three chief technology officers, the company routinely manages to frustrate the efforts of its visionary thinkers.

I believe Mr. Brass' analysis is far too inwardly focused. Never in his editorial does Mr. Brass lift up the growing outreach by Microsoft Research, especially under the leadership of the likes of Tony Hey (CVP, External Research) and Lee Dirks (Director, Education & Scholarly Communications), to empower collaboration with and sponsorship of innovative researchers around the world. Through its outreach Microsoft is enabling a global community of innovators and is making an important contribution far beyond its bottom line. I think Mr. Brass would do well to focus on the multitude of possibilities Microsoft is helping to make real through its outreach, rather than focusing on what he perceives to be its problems...

Notes:

  1. One version of the open innovation model has been called distributed innovation. See e.g. Karim Lakhani and Jill Panetta, The Principles of Distributed Innovation (2007)
  2. Some authors have referred to "ecologies" or "ecosystems" when interpreting Bill Joy's quote, but I believe the more accurate and useful term is community.
  3. For more on community building, see Peter Block, esp. Community: The Structure of Belonging (2008)

Wednesday, January 20, 2010

Bitwacker has moved to Wordpress.com!

Due to Blogspot's overly-aggressive spam-blog detection which resulting in the Bitwacker Assocates blog being shut down for most of December 2009 and January 2010, I've moved to Wordpress.com. Find me there --- with a better style! --- at:



http://bitwacker.wordpress.com/



John S. Erickson, Ph.D.

Wednesday, December 16, 2009

Recent Efforts toward Linked Multimedia Metadata

Recently I've been "having a think" on issues ranging from rights expression for datasets to realising the value of linked data, but frankly I've felt that something is missing ; even with scientific and government linked datasets going online, a voice inside me wonders if the stakes are still (arguably) too low to really shake things up. I've been wondering what kind of data we haven't been hearing about --- the kind of data that if it were published according to linked data principles would surely lead to the emergence of outrageously cool applications, demonstrate the inherent value of the linked data approach, and perhaps even test some interesting new monetisation models? The area that immediately came to mind was multimedia metadata, especially semantic metadata for video and audio content.

Several recent venues have focused on the general topic of generating, publishing and using semantic multimedia metadata, including the Oct-Dec 2009 IEEE Multimedia Magazine special issue on Multimedia Metadata and Semantic Management, and SAMT2009: The 4th International Conference on Semantic and Digital Media Technologies (3-4 Dec 2009; Graz, Austria). Both of these are "powered" by members of the Multimedia Metadata Community, an outgrowth of the MPEG-7 and MPEG-21 worlds that "brings together experts from research and industry in the area of multimedia meta data interoperability for collaborative working environments." Finally, since 2008 the W3C has been host to its Video in the Web activity; within this the Media Annotations Working Group is developing an ontology and API to facilitate cross-community sharing and use of multimedia metadata in the Web.

IEEE Multimedia (Oct-Dec 2009): This special issue features six research articles focused on different facets of the "semantic management of multimedia and multimedia metadata" ranging from retrieval and processing to consumption and presentation. Of the six, perhaps the first two are most relevant in today's linked data environment

  • "Managing and Querying Distributed, Multimedia Metadata." This article advocates the use of a centralized metadata résumé --- a condensed, automatically-constructed version of the larger metadata set --- for locating content on remote servers. The authors demonstrate the advantages of their approach using conventional semweb technologies to represent and query semantic metadata.
  • "Semantic MPEG Query Format Validation and Processing." The authors present their semantic validation of MPEG Query Format (MPQF) queries and their implementation of a practical MPQF query engine over an Oracle RDBMS. The article introduces methods for evaluating MPQF semantic-validation rules not expressed by syntactic means within the XML schema. The authors highlight their prototype implementation of an MPQF-capable processing engine using several query types on a set of MPEG-7 based image annotations.
  • "Diversifying Image Retrieval with Affinity-Propagation Clustering on Visual Manifolds." The authors describe a post-processing subsystem for retrieval systems that improves the diversity of results presented to users. Image retrieval systems typically focus on the similarity between the retrieval and sample images, where the relevance of the retrieval results is considered but the diversity is neglected. Ideally, retrieval results should contain a diverse array of items representing a variety of subtopics. This article presents a method for removing duplicate images from a "top 20" list, replacing them with images representing new subtopics.
  • "A Media Value Chain Ontology for MPEG-21." The authors have created a semantic representation of intellectual property derived from MPEG-21 Part 19. Their model defines the minimal set of types of intellectual property, the roles of users interacting with them, and the relevant actions regarding intellectual property law. The article is a helpful guide to the standardization efforts, with its many examples and useful insight into the multimedia value chain.
  • "Using Social Networking and Collections to Enable Video Semantics Acquisition." The authors consider media production, acquisition, and metadata gathering, the first elements of the multimedia value chain. Methods from video annotation and social networking are brought together to solve problems associated with gathering metadata that describes user interaction, usage, and opinions of video content. Individual user-interaction metadata is aggregated to provide semantic metadata for a given video. Coolness alert: The authors have successfully implemented their model in a Flex-based Facebook application!
  • "A Web-Based Music Lecture Database Framework." This article describes semantic audio authoring and presentation for Web-published music lectures. The authors propose a dynamic programming-based algorithm for MIDI-to-Wave alignment to explore the temporal relations between MIDI and the corresponding performance recording. The synchronized MIDI and wave can be attached to many kinds of teaching materials where synchronized presentations can add value.

SAMT'09: Nearly 15 years ago I had the good fortune to present my early rights metadata research at EDMEDIA'95 in Graz (Austria); visiting the conference web site this weekend, especially seeing the real-time image of the historic "Urhturm" on the hill high about the city, brought back a flood of fond memories! The topics of the three tutorials offered at SAMT'09 demonstrate that current research has definitely taken a turn toward getting multimedia multimedia into the Web. (Unfortunately, only slides from the first are currently available):

  • "Web of Data in the Context of Multimedia (WoDMM)." How multimedia content can be integrated into the Web of Data and how users and developers can consume and benefit from linked data. (slides)
  • "MPEG Metadata for Context-Aware Multimedia Applications (MPEG)." Overview of MPEG metadata formats that enable the development and deployment of content- and context-aware multimedia applications.
  • "A Semantic Multimedia Web: Create, Annotate, Present and Share your Media (SemMMW)." How multimedia metadata can be represented and attached to the content it describes within the context of established media workflow practices, and how users can benefit from a Web of Data containing more formalized knowledge.

For much more information, see the Proceedings from the 20th International Workshop of the Multimedia Metadata Community on Semantic Multimedia Database Technologies (SeMuDaTe'09)

Metadata Standards for the Web of Data: Finally, research such as that describe above has led to progress on the standards front. As the IEEE Multimedia guest editors note in their foreword, since 2008 there as been quiet but steady progress within the W3C's Video in the Web activity, which was chartered to make video a first class citizen of the Web by creating an architectural foundation that by taking full advantage of the Web's underlying principles will enable people to create, navigate, search, link and distribute video... Of its three working groups, the editors highlight the Media Annotations Working Group as being motivated by progress in RDF and topic maps and appears most aligned with emerging linked data activities.

In their forward, the IEEE Multimedia editors provide a very nice summary of the core problem with multimedia metadata and thus the motivation for the W3C efforts:

Most of the standards are tailored to specific application domains. Examples include European Broadcasting Union P/Meta 2.0 for broadcasting; TV-Anytime and SMPTE Metadata Dictionary for TV; and MPEG-21 for the delivery chain of multimedia and technical aspects (such as EXIF). These standards exhibit a different semantic level of detail in their descriptions (from simple keywords to regulated taxonomies and ontologies). Only some of the standards are general purpose, for instance MPEG-7...

Coolness is on the Horizon: This rather lengthy posting is merely a sampling of works-in-progress, not only to put multimedia metadata on the Web but more importantly to establish such metadata as a useful and valuable part of the Web. Combine with such visionary efforts as the revamped, linked data-driven BBC web site, I'm increasingly confident that a new generation of linked data applications are around the corner, fueled this time by datasets that add video and audio to the semantic mix. Bring it on!

Friday, December 11, 2009

Scale-free Networks and the Value of Linked Data

Kingsley Idehen of OpenLink Software and others on the Business of Linked Data (BOLD) list have been debating a value proposition for linked data via Twitter (search for #linkeddata) and email. The discussion has included useful iterations on various "elevator pitches" and citations of recent successes, especially the application of GoodRelations e-commerce vocabularies at Best Buy. After some deep thought I decided to take the question of value in a different direction and to consider it from the perspective of the science of networks, especially with reference to the works of Albert-László Barabási, director of the Center for Complex Network Research and author of Linked: The New Science of Networks. I'd like to test the idea here that data sharing between organisations based on linked open data principles is the approach most consistent with the core principles of a networked economy. I believe that the linked data model best exploits "networking thinking" and maximizes the organisation's ability to respond to changes in relationships within the "global graph" of business. Using Barabási as a framework, linked data is the approach that most embodies a networked view of the economy from the macro- to the micro-economic level, and therefore best empowers the enterprise to understand and leverage the consequences of interconnectedness.

As has been noted numerous times elsewhere, the so-called Web of Data is perhaps the web in its purest form. Following Tim Berners-Lee principles or "rules" as stated in his Linked Data Design Issues memo from 2006, we have a very elegant framework for people and especially machines to describe the relationship between entities in a network. If we are smart about how we define those links and the entities we create to aggregate those links --- the linked datasets we create --- we can build dynamic, efficiently adaptive networks embodying the two laws that govern real networks: growth and preferential attachment. Barabási illustrates these two laws with an example "algorithm" for scale-free networks in Chapter 7 of Linked. The critical lessons are (a) networks must have a means to grow --- there must not only be links, but the ability to add links, and (b) networks must provide some mechanism for entities to register their preference for other nodes by creating links to the more heavily-linked nodes. Preferential attachment ensures that the converse is also true: entities will "vote with their feet" and register their displeasure with nodes by eliminating links.

In real networks, the rich get richer. In the Web, the value is inherent in the links. Google's PageRank merely reinforced the "physical" reality that the most valuable properties in the Web of Documents are those resources that are most heavily linked-to. Those properties provide added value if they in turn provide useful links to other resources. The properties that are sensitive to demand and can adapt to the preferences of their consumers, especially to aggregate links to more resources that compound their value and distinguish them from other properties, are especially valuable and are considered hubs.

Openness is important. At this point it is tempting to jump to the conclusion that Tim Berners-Lee's four principles are all we need to create a thriving Web of Data, but this would be premature; Sir Tim's rules are necessary but not sufficient conditions. Within any "space" where Webs of Data are to be created, whether global or constrained within an organisation, the network must embody the open world assumption as it pertains to the web: when datasets or other information models are published, their providers must expect them to be reused and extended. In particular this means that entities within the network, whether powered by humans or machines, must be free to arbitrarily link to (make assertions about) other entities within the network. The "friction" of permission in this linking process must approximate zero.

Don't reinvent and don't covet! The extent of graphs that are built within organisations should not stop at their boundaries; as the BBC has shown so beautifully with their use of linked data on the revamped BBC web site, the inherent value of their property was increased radically by not only linking to datasets provided elsewhere, openly on the "global graph," but also by enabling reuse of their properties. The BBC's top-level principles for the revamped site are all about openness and long-term value:


The site has been developed against the principles of linked open data and RESTful architecture where the creation of persistent URLs is a primary objective. The initial sources of data are somewhat limited but this will be extended over time. Here's our mini-manifesto: Persistence...Linked open data...RESTful...One web

The BBC has created a valuable "ecosystem"; their use of other resources, especially MusicBrainz and DBPedia, has not only made the BBC site richer but in turn has increased the value of those properties. And those properties will continue to increase in value; by the principle of preferential attachment, every relationship "into" a dataset by valuable entities such as the BBC in turn increases the likelihood that other relationships will be established.

Links are not enough. It should be obvious that simply exposing datasets and providing value-added links to others isn't enough; as Eric Hellman notes, dataset publishers must see themselves service providers who add value beyond simply exposing data. Some will add value to the global graph by gathering, maintaining, publishing useful datasets and fostering a community of users and developers; others will add value by combining datasets from other services in novel ways, possibly decorated by their own. Eric has argued that the only winners in the linked open data space have indeed been those who have provided such merged datasets as a service.

Provide value-adding services and foster community. I would argue that dataset providers asking how they might realise the full value potential of publishing their datasets on the Web should examine whether, based on the principles I've outlined above, they have done everything they can to make their datasets part of the Web (rather than merely "on" the web) and have truly added value to the global graph. Do they view themselves as a service? Have they made their datasets as useful and easy-to-use as possible? Have they provided the best possible community support, including wikis and other mechanisms? Have they fully documented their vocabularies? Have they clearly defined any claimed rights, and in particular have they considered adopting open data principles?

Tuesday, December 8, 2009

Linking Rights to Aggregations of Data (Part 2)

In my background research for today's entry I discovered that the smart people at Talis, especially Ian Davis, have been working the problem I outlined in Linking Rights to Aggregations of Data (Part 1). Specifically, back in July 2009 Ian proposed WAIVER: A vocabulary for waivers of rights. In Ian's words,

(The WAIVER) vocabulary defines properties for use when describing waivers of rights over data and content. A waiver is the voluntary relinquishment or surrender of some known right or privilege. This vocabulary is designed for use with the Open Data Commons Public Domain Dedication and License and with the Creative Commons CC-0 waiver

In his July 2009 post Linked Data and the Public Domain Ian argues for providers to unambiguously declare their datasets public domain and explains how to use the WAIVER vocabulary to do this, in the context of a voID description of a dataset. (See also this email discussion thread involving several of the thought leaders in this area on this issue) Ian provides the following example, which I repeat here to illustrate (a) use of voID to describe a dataset named "myDataset," (b) use of the wv:waiver property to link the dataset to the Open Data Commons PDDL waiver, (c) use of the wv:declaration property to include a human-readable declaration of the waiver, and (d) use of the wv:norms property to link the dataset to the community norms he suggests, ODC Attribution and Share-alike.


<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/terms/"
xmlns:wv="http://vocab.org/waiver/terms/"
xmlns:void="http://rdfs.org/ns/void#">
<void:Dataset rdf:about="http://myOrganisation.org/myDataset">
<dc:title>myDataset</dc:title>
<wv:waiver rdf:resource="http://www.opendatacommons.org/odc-public-domain-dedication-and-licence/"/>
<wv:norms rdf:resource="http://www.opendatacommons.org/norms/odc-by-sa/" />
<wv:declaration>
To the extent possible under law, myOrganisation
has waived all copyright and related or neighboring rights to
myDataset
</wv:declaration>
</void:Dataset>
</rdf:RDF>

WAIVER and OAI-ORE: As I proposed in Part 1, we should be able to combine the voID and OAI-ORE approaches. The only conceptual difference is by OAI-ORE guidelines the RDF file shown above would be treated as the resource map for the aggregation URI (in this example, "http://myOrganisation.org/myDataset") and would have a URI unto itself (perhaps "http://myOrganisation.org/myDataset.rdf").

What about other rights? It is critically important for the reader to understand that Ian's example (repeated above) only shows how to declare a waiver of rights, which by its nature is intended to promote the reuse of data based on open principles. Today, this is mostly what the linked data world has focused on, but as the NYTimes open data experiment is showing us, providers will want to assert rights where they can. In a future post I'll applied what we've learned so far, to consider approaches for declaring dataset rights in legal regimes where this is actually possible.