Friday, December 11, 2009

Scale-free Networks and the Value of Linked Data

Kingsley Idehen of OpenLink Software and others on the Business of Linked Data (BOLD) list have been debating a value proposition for linked data via Twitter (search for #linkeddata) and email. The discussion has included useful iterations on various "elevator pitches" and citations of recent successes, especially the application of GoodRelations e-commerce vocabularies at Best Buy. After some deep thought I decided to take the question of value in a different direction and to consider it from the perspective of the science of networks, especially with reference to the works of Albert-László Barabási, director of the Center for Complex Network Research and author of Linked: The New Science of Networks. I'd like to test the idea here that data sharing between organisations based on linked open data principles is the approach most consistent with the core principles of a networked economy. I believe that the linked data model best exploits "networking thinking" and maximizes the organisation's ability to respond to changes in relationships within the "global graph" of business. Using Barabási as a framework, linked data is the approach that most embodies a networked view of the economy from the macro- to the micro-economic level, and therefore best empowers the enterprise to understand and leverage the consequences of interconnectedness.

As has been noted numerous times elsewhere, the so-called Web of Data is perhaps the web in its purest form. Following Tim Berners-Lee principles or "rules" as stated in his Linked Data Design Issues memo from 2006, we have a very elegant framework for people and especially machines to describe the relationship between entities in a network. If we are smart about how we define those links and the entities we create to aggregate those links --- the linked datasets we create --- we can build dynamic, efficiently adaptive networks embodying the two laws that govern real networks: growth and preferential attachment. Barabási illustrates these two laws with an example "algorithm" for scale-free networks in Chapter 7 of Linked. The critical lessons are (a) networks must have a means to grow --- there must not only be links, but the ability to add links, and (b) networks must provide some mechanism for entities to register their preference for other nodes by creating links to the more heavily-linked nodes. Preferential attachment ensures that the converse is also true: entities will "vote with their feet" and register their displeasure with nodes by eliminating links.

In real networks, the rich get richer. In the Web, the value is inherent in the links. Google's PageRank merely reinforced the "physical" reality that the most valuable properties in the Web of Documents are those resources that are most heavily linked-to. Those properties provide added value if they in turn provide useful links to other resources. The properties that are sensitive to demand and can adapt to the preferences of their consumers, especially to aggregate links to more resources that compound their value and distinguish them from other properties, are especially valuable and are considered hubs.

Openness is important. At this point it is tempting to jump to the conclusion that Tim Berners-Lee's four principles are all we need to create a thriving Web of Data, but this would be premature; Sir Tim's rules are necessary but not sufficient conditions. Within any "space" where Webs of Data are to be created, whether global or constrained within an organisation, the network must embody the open world assumption as it pertains to the web: when datasets or other information models are published, their providers must expect them to be reused and extended. In particular this means that entities within the network, whether powered by humans or machines, must be free to arbitrarily link to (make assertions about) other entities within the network. The "friction" of permission in this linking process must approximate zero.

Don't reinvent and don't covet! The extent of graphs that are built within organisations should not stop at their boundaries; as the BBC has shown so beautifully with their use of linked data on the revamped BBC web site, the inherent value of their property was increased radically by not only linking to datasets provided elsewhere, openly on the "global graph," but also by enabling reuse of their properties. The BBC's top-level principles for the revamped site are all about openness and long-term value:

The site has been developed against the principles of linked open data and RESTful architecture where the creation of persistent URLs is a primary objective. The initial sources of data are somewhat limited but this will be extended over time. Here's our mini-manifesto: Persistence...Linked open data...RESTful...One web

The BBC has created a valuable "ecosystem"; their use of other resources, especially MusicBrainz and DBPedia, has not only made the BBC site richer but in turn has increased the value of those properties. And those properties will continue to increase in value; by the principle of preferential attachment, every relationship "into" a dataset by valuable entities such as the BBC in turn increases the likelihood that other relationships will be established.

Links are not enough. It should be obvious that simply exposing datasets and providing value-added links to others isn't enough; as Eric Hellman notes, dataset publishers must see themselves service providers who add value beyond simply exposing data. Some will add value to the global graph by gathering, maintaining, publishing useful datasets and fostering a community of users and developers; others will add value by combining datasets from other services in novel ways, possibly decorated by their own. Eric has argued that the only winners in the linked open data space have indeed been those who have provided such merged datasets as a service.

Provide value-adding services and foster community. I would argue that dataset providers asking how they might realise the full value potential of publishing their datasets on the Web should examine whether, based on the principles I've outlined above, they have done everything they can to make their datasets part of the Web (rather than merely "on" the web) and have truly added value to the global graph. Do they view themselves as a service? Have they made their datasets as useful and easy-to-use as possible? Have they provided the best possible community support, including wikis and other mechanisms? Have they fully documented their vocabularies? Have they clearly defined any claimed rights, and in particular have they considered adopting open data principles?

No comments:

Post a Comment