Friday, March 19, 2010

DOIs, URIs and Cool Resolution

This post originally appeared on the Wordpress.com version of this blog.

The art of happiness is to serve all -- Yogi Bhajan


Once we get beyond the question of the basic HTTP URI-ness of the digital object identifier (DOI) --- since for each DOI there exists DOI-based URIs due to the dx.doi.org and hdl.handle.net proxies, this issue is moot --- and old-skool questions of "coolness" based on the relative brittleness over time of creative URI encoding [1], we are then left with the more substantial question of whether DOI-based HTTP URIs really "behave" themselves within the "Web-of-Objects" universe. The purpose of this post is to identify the problem and propose a potential solution, implementation of which will require certain changes to the current Handle System platform. I believe that if the proposed changes are made, lingering questions concerning the "URI-ness" of DOIs (and Handles) will disappear, once and for all.

Note: It is beyond the scope of this post to present all of the gory background details regarding the Handle System, the DOI, and the 1998 and 2008 versions of "Cool URIs." If there is enough interest in a stand-alone article, I will happily consider writing a longer version in the future, perhaps as piece for D-Lib Magazine.

With the increasing influence of semantic web technologies there has been strong interest in assigning actionable HTTP URIs to non-document things, ranging from abstract ideas to real world objects. In the case of URI-named, Web-accessible physical items --- sensors, routers and toasters --- this is sometimes referred to as The Web of Things. Until 2005 the community disagreed as to what an HTTP URI could be assumed to represent, but a June 2005 decision by the W3C TAG settled the issue: If a server responds with an HTTP response code of 200 (aka a successful retrieval), the URI indeed is for an information resource; with no such response, or with a different code, no such assumption can be made. This "compromise" was said to have resolved the issue, leaving a "consistent architecture." [3]

The result of this decision was to force consensus on how to apply the long-established principles of HTTP content negotiation in more consistent ways. In particular, "human" and "machine" requests to a given entity URI --- a top-level URI representing a "thing" --- should be treated differently; for example, there should be different responses to requests with HTTP headers specifying Accept: text/html (for an HTML-encoded page) versus Accept: application/rdf+xml (for RDF-modeled, XML-encoded data). This is most often seen in the semantic web and linked data worlds, where it is now common to have both textual and machine readable manifestations of the same URI-identified thing.

Modern web servers including Apache have been engineered to handle these requests through content negotiation [4]. Through standard configuration procedures, site administrators specify how their servers should respond to text/html and application/rdf+xml requests in the same way they specify what should be returned for alternate language- and encoding- requests; "en," "fr," etc. Typically, when media-specific requests are made against entity URIs representing concepts, the accepted practice is to return a 302 Found response code with the URI to a resource containing a representation of the expected type, such as an html-encoded page or an XML document with RDF-encoded data.

Many readers of this post will be familiar with the basic idea of HTTP proxy-based Handle System name resolution: A HTTP resolution request for a DOI-based URI is made to a proxy --- a registration-agency run proxy such as dx.doi.org or the "native" Handle System proxy hdl.handle.net --- the appropriate local handle server is located, the handle record for the DOI is resolved, and the default record (e.g. a document information page) is returned to the client as the payload in a 302 Found response. In a Web of Documents this might make sense, but in a universe of URI-named real-world objects and ideas, not so much.

The 2008 document provides two requirements for dealing with URIs that identify real world objects:

  1. Be on the Web: Given only a URI, machines and people should be able to retrieve a description about the resource identified by the URI from the Web. Such a look-up mechanism is important to establish shared understanding of what a URI identifies. Machines should get RDF data and humans should get a readable representation, such as HTML. The standard Web transfer protocol, HTTP, should be used.
  2. Be unambiguous: There should be no confusion between identifiers for Web documents and identifiers for other resources. URIs are meant to identify only one of them, so one URI can't stand for both a Web document and a real-world object.

In the post-2005 universe of URI usage as summarised above and detailed in [2], if DOI-based URIs are used to represent conceptual objects these rules will be broken! For example, Handle System proxies today cannot distinguish between Accept: codes in the request headers; the only possible resolution is the default (first) element of the Handle record. (For hackers or merely the curious out there, I encourage you to experiment with curl at your command line or Python's urllib2 library, hitting the DOI proxy with a DOI-based URL like http://dx.doi.org/10.1109/MIC.2009.93.) This problem with how proxies resolve DOIs and Handles is a lingering manifestation of the native Handle System protocol not being HTTP-based and the system of HTTP-based proxies being something of a work-around, but the vast majority of DOI and Handle System resolutions occur through and rely on these proxies.

One possible solution would be to enable authorities --- Registration Agencies --- who operate within the Handle System to configure how content negotiation within their Handle prefix space is handled at the proxy. For document-based use of the DOI an example of this would be to return the URI in the first element of the Handle record whenever a text/html request is made and (for example) the second element whenever an application/rdf+xml is made. When a request is made to the proxy, request-appropriate representation URIs would be returned to the client along with the 302 Found code. This approach treats the DOI-based URI as a conceptual or entity URI and gives the expected responses as per [2]. pax vobiscum...

Readers familiar with the Handle System will appreciate that there are many potential schemes for relating HTTP content type requests to elements of the Handle record; in the example above I use position (index value), but it is also possible to use special TYPEs.

Handle servers are powerful repositories and can implement potentially many different models other than redirection as described above. Sometimes, for example, the desire is to use a Handle record as the primary metadata store. In that case, the preferred application/rdf+xml might very well be to return an RDF-encoded serialisation of the Handle record. How this is handled should be a feature of the Handle server platform and a decision by registration agencies based on their individual value propositions, and not locked in by the code.

I eagerly look forward to your comments and reactions on these ideas!

Update 1: In a comment to this post, Herbert Van de Sompel argues that the real question is, what should DOIs represent? Herbert asserts that DOI-based URIs should model OAI-ORE resource aggregations and that Handle System HTTP proxies should behave according to OAI-ORE's HTTP implementation guidelines. Herbert's suggestion doesn't conflict with what I've written above; this is a more subtle and (arguably) more robust view of how compound objects should be modeled, which I generally agree with.

Here's how OAI-ORE resolution would work following the Handle proxy solution I've described above: Assume some DOI-based HTTP URI doi.A-1 identifies an abstract resource aggregation "A-1" (In OAI-ORE nomenclature doi.A-1 is the Aggregation URI). Following the given HTTP implementation example, let there be two Resource Maps that "describe" this Aggregation, an Atom serialization and an RDF/XML serialization. Each of these Resource Maps is (indeed MUST be) available from different HTTP URI's, ReM-1 and ReM-2, but the desired behaviour is for either to be accessible through the DOI-based Aggregation URI, doi.A-1. Let these two URIs be persisted in the Handle record, preferably using TYPEs which distinguish how they should be returned to clients based on the naming authority's configuration of the HTTP proxy. By the approach I describe above, the Handle System proxy would then respond to resolution requests for doi.A-1 with 303 See Other redirects to either ReM-1 or ReM-2 depending upon MIME-type preferences expressed in the Accept: headers of the requests.

Update 2: Complete listing of MIME types for OAI-ORE Resource Map serializations. Follow-up conversations with Herbert Van de Sompel, Carl Lagoze and others have reminded me I neglected to mention how the OAI-ORE model recommends handling "HTML" (application/xhtml+xml and text/html) requests! This is not a minor issue, since the purpose of ORE is to model aggregations of resources and not resources themselves, and so it is not immediately clear what such a page request should return. My solution (for the purposes of this blog post) is for Handle System HTTP proxies to respond to these requests also with 303 See Other redirects, supplying redirect URIs that map to appropriately-coded "splash screens."


For completeness, the table below (repeated from [5]) lists the standard MIME types for Resource Map serializations. Continuing with the major theme of this post, Handle System HTTP proxies resolving requests for DOI-named ORE Resource Maps should follow these standards so the clients may request appropriate formats using HTTP Accept: headers.


Resource Map TypeMIME type
Atomapplication/atom+xml
RDF/XMLapplication/rdf+xml
RDFa in XHTMLapplication/xhtml+xml

If a client prefers RDF/XML but can also parse Atom then it might use the following HTTP header in requests:

Accept: application/rdf+xml, application/atom+xml;q=0.5

The table below list the two common MIME types for HTML/XHTML Splash Pages following the W3C XHTML Media Types recommendations.

Resource Map TypeMIME type
XHTMLapplication/xhtml+xml
HTML (legacy)text/html

Thus, if a client wishes to receive a Splash Page from the Aggregation URI and prefers XHTML to HTML then it might use the following HTTP header in requests:

Accept: application/xhtml+xml, text/html;q=0.5


As noted in [5] there is no way to distinguish a plain XHTML document from an XHTML+RDFa document based on MIME type. It is thus not possible for a client to request an XHTML+RDFa Resource Map in preference to an RDF/XML or Atom Resource Map without running the risk of a server correctly returning a plain XHTML Splash Page (without included RDFa) in response.

The Handle record for a given DOI or Handle identifying an ORE aggregation would therefore contain a set of URIs reflecting the mappings in the tables above. A content-negotiation-savvy Handle System HTTP proxy would then return the appropriate URI in the 303 Found response, based on its configuration and policies.

References:

See the ensuing comments at my Wordpress.com version of this blog...

No comments:

Post a Comment