Wednesday, November 25, 2009

Long Tails and "Scaling Down" Linked Data Services

Chris Anderson's newest book FREE: The Future of a Radical Price received some attention this summer, but I've actually been meditating on principles he laid out three years ago in his blog post, Scaling up is good. Scaling down is even better. In that post he marveled at Google et.al.'s ability to scale down, to run themselves efficiently enough to serve users who generate no revenue at all. Anderson's principles are guidance on approaches to conducting business such that even if only a tiny percentage of ones visitors "convert" into paying customers, by ensuring this small percentage is of a very large number one can still achieve big-time profitability.

My goal with this post is to consider how these ideas might be applied to the domain of Linked Data, and specifically how they pertain to the provision of unique data that adds real value to the greater "Web of Data."

In his blog Anderson gives us four keys to scaling down: Self-service, "Freemium" services, No-frills products and Crowdsourcing...

1. Self-service: give customers all the tools they need to manage their own accounts. It's cheap, convenient, and they'll thank you for it. Control is power, and the person who wants the work done is the one most motivated in seeing that it's done properly.

"Self-service" applies to linked data services in oh-so-many ways! Self- service in this case is not as much about support (see "Crowdsourcing," below) as it is about eliminating any and all intervention customers might need to customize or specialize how services perform for them. In principle, the goal should be to provide users with a flexible API and let them figure it out, with the support of their peers. Ensure that everything is doable from their side, and step out of the way.

The (negative) corollary is this: if you "baby sit" your customers by providing specialized services that require maintenance, then you own it and must eat the cost. By making the specializations a user-side function, then they own it. But they won't be alone; they'll have the support of their community!

2. "Freemium" services: As VC Fred Wilson puts it, "give your service away for free, possibly ad supported but maybe not, acquire a lot of customers very efficiently through word of mouth, referral networks, organic search marketing, etc, then offer premium priced value added services or an enhanced version of your service to your customer base." Free scales down very nicely indeed.

There are any number of ways providers might apply this concept to the linked data world:

 Free Access  Premium Access 
 Restricted vocabulary of assertions  Full access, all assertions 
 Limited query rate  Unlimited query rate 
 Limited query extent  Unlimited query extent 
 Limited data  Unlimited data size 
 Read-only  Term upload capability 
 Narrow reuse rights  Broad reuse rights 
 Community support  Private/ dedicated support 
 ...  ... 

3. No-frills products: Some may come for the low cost, others for the simplicity. But increasingly consumers are sophisticated enough to know that they don't need, or want to pay for premium brands and unnecessary features. It's classic market segmentation, with most of the growth coming at the bottom.

In the linked data world, achieving "no frills" would seem easy because by definition it is only about the data! For linked data a "frill" is added data complexity that serves no purpose or detracts from the utility of the service. Avoid any temptation to "add value" on behalf of customers, such as merging your core graph with others in an attempt to "make it easy" for them. Providers should also avoid "pruning" graphs, except in the case of automated filtering in order to differentiate between Freemium and Premium services.

4. Crowdsourcing: From Amazon reviews to eBay listings, letting the customers do the work of building the service is the best way to expand a company far beyond what employees could do on their own.

By now it is not only obvious, but imperative that providers should develop communities around their services. Usually communities are about evangelism, and this is certainly true for linked data provides, but increasingly service provides realize well-groomed communities can radically reduce their service costs.

Linked data providers should commit themselves to a minimum of direct support and invest in fostering an active community around their service. Every provider should have a means for members of their community to support each other. Every provider should leverage this community to demonstrate to potential adopters the richness of the support and the inherent value of their dataset.

Finally: In a thought-provoking post Linked Data and the Enterprise: A Two-way Street Paul Miller reminds the skeptical enterprise community that they, not merely their user community, will ultimately benefit from the widespread use of their data, and when developing their linked data strategy they should consider how they can "enhance" the value of the Web of Data, for paying and non-paying users alike:

...[A] viable business model for the data-curating Enterprise might be to expose timely and accurate enrichments to the Linked Data ecosystem; enrichments that customers might pay a premium to access more quickly or in more convenient forms than are available for free...

I've purposely avoiding considering the legal and social issues associated with publishing certain kinds of enterprise data as linked data (see also this), which I'll address in a future post...

Tuesday, November 24, 2009

DRM & Me Part III: DOIs, Metadata and Long Tails

In Part II of this retrospective I discussed the NetRights years and our novel approach to binding static and dynamic metadata to objects in the early days of the Web. In this installment I'll cover my years at Yankee Rights Management (YRM) (a division of YBP, Inc., especially the development of Copyright Direct(tm) and my personal realization of the potential of content identifiers and their associated metadata. Note: It was actually during my YRM years that I coined my now-infamous expression (referenced in Part II of this series), Metadata is the lifeblood of e-commerce!

YBP, originally known as Yankee Book Peddler and now a division of Baker & Taylor, have been a leader in using information technology to provide books and other materials, including bibliographic data --- metadata! --- to university and research libraries for more than 35 years. YBP executive Glen M. Secor also happened to be a professor of law at the Franklin Pierce Law Center specializing in copyright law, with a particular interest in the unique challenges of copyright in the emerging digital, networked environment. Glen and I first met when I presented my early Ph.D. work at DAGS'95 in Boston (prior to the founding of NetRights) and from that point on took an interest in this metadata-oriented, iconoclastic approach to copyright. Glen spearheaded YBP's investment in NetRights in 1996, and with the sale of NetRights in 1997 I joined with Glen to launch Yankee Rights Management (YRM) in mid-1997.

One of YRM's goals was to build a business solving rights management problems for stakeholders in YBP's ecosystem, especially scientific/ technical/ medical (STM) publishers and their university and research customers. With the help of Kelly Frey, then VP of Business Development for the Copyright Clearance Center (CCC), we conceived of Copyright Direct(tm), which soon became the first web-based, real-time, pay-as-you go copyright permissions service for a wide variety of multimedia types. As with LicensIt(tm), the usage model for Copyright Direct(tm) would be simple:

  1. From a web page or PDF document, the user would click on a distinctive green "Copyright Direct" icon
  2. A mini-window would pop up clearly identifying the work and presenting available options for that item
  3. The user would step through a short series of menus to specify their use and, if available, transact their request (via credit card!) and receive their permissions
  4. If the usage they needed was not available, the system collected the user's plain-text request and began a managed workflow between the user and the rightsholder
  5. When all parties agreed, the agreement became a "template" and was added as an available option --- the system learned and adapted
  6. At the end of each month, rightsholders would receive royalty payments.

Glen Secor, Jennifer Goodrich and I demonstrated my Copyright Direct prototype to a variety of stakeholders and thought leaders at the Frankfurt Book Fair in October, 1997 and collected critical feedback. We returned "triumphantly" in October 1998 with a booth in the main hall, a live Copyright Direct demo (now powered by the fledgling DOI standard and a major "beta" rightsholder: the IEEE!

But throughout 1998-1999 we also came to realize a fundamental problem with the Copyright Direct model: it depended not only on a ready supply of clean descriptive metadata from rightholders, but also upon a rich set of rightsholder-generated rights metadata, including pricing and other licensing templates, none of which existed! Our goal was to use lightweight, easily accessible permissions transactions to provide "found money" to rightholders, but it cost too much to generate the metadata required to fuel the system! In the September 2006 issue of D-Lib magazine I extrapolate this problem in my article, Handle Records, Rights and Long Tail Economies.

Chris Anderson's "long tail" argument (see also his Long Tail blog) asserts that modern systems based entirely on metadata make "unlimited selection" economically viable. I argue that yes, metadata really is the lifeblood of e-commerce and is the enabler of phenomena like the seemingly-unlimited selection of products through Amazon.com ("make everything available, help anyone find it!"), but all metadata must somehow still be generated, verified and published, and the cost of creating and supporting the neccessary metadata supply chains must not exceed the anticipated value that can be redeemed. Since the demand of a given "unit" may be exceptionally low, the "per unit" cost of creating or aggregating each unit's metadata halo must be near-zero!

These principles can be extrapolated to the "Web of Data"; indeed, by coupling Linked Data principles with a low-overhead infrastructure for authenticating metadata assertions, the cost of metadata may indeed approach zero. I'll talk about that in a future blog entry...

Monday, November 23, 2009

DRM & Me Part II: "Copyright for the rest of us!"

In Part I of this retrospective I covered the raw beginnings of my interest and research in enabling copyright in the digital, networked environment. In this second part I'll discuss work my colleagues and I did to take these ideas commercial, and I'll continue to focus on core principles of my work in content identification and metadata architecture, summed up by this quote (attributed to me!): Metadata is the lifeblood of e-commerce!

As the spring of 1995 approached it became clear that there was an opportunity to make a unique contribution to improving the world of copyright in the digital, networked environment. As I prepared to present a paper at ED-MEDIA 95 in Graz, Austria, I was approached by local businessman who had been principals in a successful software company, Corporate Microsystems, Inc., that had just been acquired by a global enterprise software company. As the story goes, they were looking for an original idea upon to base their next start-up, and I was looking for a strategy for implementing my research ideas that would scale well beyond what I was capable of doing part-time as a researcher at IML. Over the summer of 1995 my future partners Gerry Hunt, Theo Pozzy, Henry Adams, Hal Franklin and I held numerous planning meetings, and on 1 November 1995 NetRights, LLC was born!

We started NetRights at a time when other players, in particular InterTrust (then still called EPR) and IBM InfoMarket were starting to draw attention to their robust, encryption-based "envelope" strategies for "protecting copyright" --- quotes intentional! --- and the term digital rights management wasn't yet in standard use. Taking a clue from my prototype work at Dartmouth, the core idea behind LicensIt(tm) (later @attribute) was to "objectify" flat multimedia objects using secure wrappers whose primary objective was to provide structured metadata about the object in hand. Our goal was to provide rich static and Internet-served dynamic metadata to facilitate "conversations" between creators and users of content. Our motif for "experiencing" copyright was a simple and elegant: A user sees a photo, audio clip, video, even an embedded text snippet; they "right-click" on it and a tabbed set of property pages is displayed; they use those various pages to view descriptions of the content, to start emails with the creator or other contributors, to view default terms of use, even to initiate live rights transactions, all while staying within the context of use.

From a technical standpoint we were using OLE structured storage in very much the same way as XML (and especially RDF) is used today. Our development team, including Mark Schlageter, Norm Tiedemann, Mark Markus and Dan O'Connor (our sole Mac-head!), created amazing tools that let us design not only these metadata structures, but to actually create "soft" property-page layout templates (think CSS!) that were packaged with the metadata, enabling customized content-specific views. Considerable infrastructure was required to make all of this work, starting with OLE services installed on the user machine, to the tools for design and packaging, to back-end services for object registration. Also, major, bet-the-company decisions about PC vs Mac, "networked COM" (which became ActiveX), Spyglass/IE vs Mosaic/Netscape support, etc. To a startup company, Bill Gates' commitment of Microsoft to "embracing and extending" the Internet in late 1995 was helpful!

Trade journals like Seybold took notice and wondered whether our "kinder, gentler" approach to copyright, which by that time (June 1996) we were calling "enhanced attribution," might actually be a better option than so-called "opaque packages." Publishers were torn; they liked the obvious value our approach was bringing to the user and the fact that we were actually facilitating the copyright process, but they also couldn't get over their perceived need for "strong protection."

Today we see echoes all over the Internet of infrastructure and technology that make "copyright for the rest of us" radically easier than it was at NetRights birth in 1995. First and foremost are systems of globally unique, persistent object identifiers, in particular the Digital Object Identifier (DOI), implemented on CNRI's Handle System. (As it happens, that same 1996 issue of Seybold also carried an article about the birth of the DOI!) RDF provides a universal information model for conveying metadata assertions (local and remote) about objects; RDFa provides a way to do this within (esp.) web documents. The recent massive and growing interest in publishing Linked data by organizations, including governments, has fortified distributed metadata as a means of conveying object information from a variety of sources. And special mention must be made of The Creative Commons, which has applied most of these techniques to not only make the process of copyright readily accessible to creators and users all over the world, but also to make content use safe through the explicit and unambiguous communication of terms of its use.

Providing immediate, unambiguous expression of copyright information and connections to processes for any piece of content was my mantra starting in the lab at Dartmouth, then at NetRights, and following our acquisition in 1997 by Digimarc, with the creation of Copyright Direct(tm) at Yankee Rights Management (YRM) and my subsequent involvement with the content identification and metadata communities. More on that in our next installment...

Wednesday, November 18, 2009

DRM & Me: A 15-year retrospective (Part 1)

Fifteen years ago, in November 1994, I was two years into a Ph.D. program at the Thayer School of Engineering at Dartmouth College. I had entered Dartmouth with a background in computer engineering and an interest in "special-purpose systems," a narrow field that focuses on creating computing systems that are exceptionally good at a very narrow range of operations, such as particle-in-cell simulation or gene sequence processing. This interest led me across campus to become a research assistant in Dr. Joseph V. Henderson's pioneering Interactive Media Lab at Dartmouth Medical School --- at first to consider the infrastructural problems of delivering IML's high-value multimedia training programs across the Internet, and by mid-1994 over a novel set of technologies known as the "World Wide Web."

As the story goes, the IML team was preparing a major set of demos for a visit by Dr. C. Everett Koop, a Dartmouth alumnus, area resident and recently retired as one of the more influential Surgeons General the United States has ever had. My particular focus was creating an interactive web site for IML, focusing in particular on the delivery of several key video sequences via the web. Several of us worked long into the night to migrate a few select videos into tolerable Quicktime format and suitable "thumbnails," then onto the lab's server, then linked (for downloading) from web pages, and finally viewable on the demo Mac.

When Joe arrived on the morning of our demo, I greeted him with (something like), "Joe, I got the 'Binding Sequence up on the Web!'" His incredibly insightful response was:

John, that's great!...John, that's terrible!

Joe preceded to express his concerns about two fundamental implications of my "success":

  • The copyright implications, especially as many IML programs were funded by private entities that retained certain rights to the works;
  • The implications of dis-aggregating medical and other training programs and delivering their content out-of-context, possibly doing harm to their message due to loss of design integrity.

Joe framed the challenge for me: to study the question of rights management from the perspective of multimedia production. In 24 hours, I learned that this was an important and rising issue that was not going away; that very little research had been done on the question from a practical standpoint; that the few proposed solutions at the time were overly simplistic, equating "copyright management" with "security" and in fact did neither; and no one appeared to be considering the issues from the perspective of the creator. In 24 hours, my Ph.D. topic was born!

This leads us to 1 November 1994 when I presented my dissertation proposal, which included as an example research artifact my Mr. Copyright(tm) prototype --- quickly re-named at the urging of my committee and others to LicensIt(tm). LicensIt demonstrated in the form of a easy-to-use, desktop "appliance" the key ideas of (a) binding actionable copyright metadata to multimedia objects, and (b) user-friendly, real-time, networked copyright registration. The LicensIt desktop icon said it all: modeled after the famous Stuffit(tm) coffee grinder, users dragged and dropped their content (initially GIF files) onto LicensIt; a dialog popped up to collect (and display) their descriptive and other metadata and to enable them to select their "registration server" from a menu of choices; their work was registered. By way of both the static metadata and the registry, users would be able to contact the principals involved in the creation of the item. I envisioned several other options, including registering digital signatures to allow users to authenticate a work in hand, as well as enveloping the work in an encrypted envelope.

It is important to note that the focus of my work at that time was on enabling copyright by binding static and dynamic metadata to content and especially to make it as accessible as possible within the context of use; content security was only a secondary concern. "Enablement" means that although a desktop client is interesting, plugins for creation tools like Photoshop, Acrobat and Macromedia Director, and enjoyment tools like Mosiac --- this was 1994!! --- would be infinitely more interesting and useful! I assumed that one day, creators would be mixing and matching content found around the web, and at least commercial and other highly visible producers would want/need to "do the right thing" w.r.t. copyright and thus would benefit from instantly accessible attribution, bound to the item. Note that I was heavily influenced at that time by the writings of Prof. Henry H. Perritt, Jr. whose concept of permissions headers was not only an inspiration for me, but I believe anticipated Creative Commons licensing templates.

Fifteen years later, we can at least say the world is different! The world we imagined 15 years ago of rampant "re-mixing" of content has arrived; licensing models such as Creative Commons have improved awareness; but still the infrastructure does not accommodate the discovery and transmission of rights information as readily as it should. With the rise of new data-centric models such as Linked Data (a practical outcome of Semantic Web research) and the acceptable of persistent identifier systems including the Handle System and the Digital Object Identifier, we're getting there...

Next installment: The NetRights and YRM years...