Think-Dash: A .data Top-Level Internet Domain?

.data

There’s been very little change in top-level internet domains (like .com, .org, .us, etc.) for a long time. But a number of years ago I started thinking about the possibility of having a new .data top-level domain (TLD). And starting this week, there’ll finally be a period when it’s possible to apply to create such a thing.

It’s not at all clear what’s going to happen with new TLDs—or how people will end up feeling about them. Presumably there’ll be TLDs for places and communities and professions and categories of goods and events. A .data TLD would be a slightly different kind of thing. But along with some other interested parties, I’ve been exploring the possibility of creating such a thing.

With Wolfram|Alpha and Mathematica—as well as our annual Data Summit—we’ve been deeply involved with the worldwide data community, and coordinating the creation of a .data TLD would be an extension of that activity.

But what would be the point? For me, it’s about highlighting the exposure of data on the internet—and providing added impetus for organizations to expose data in a way that can efficiently be found and accessed.

In building Wolfram|Alpha, we’ve absorbed an immense amount of data, across a huge number of domains. But—perhaps surprisingly—almost none of it has come in any direct way from the visible internet. Instead, it’s mostly from a complicated patchwork of data files and feeds and database dumps.

But wouldn’t it be nice if there was some standard way to get access to whatever structured data any organization wants to expose?

Right now there are conventions for websites about exposing sitemaps that tell web crawlers how to navigate the sites. And there are plenty of loose conventions about how websites are organized. But there’s really nothing about structured data.

Now of course today’s web is primarily aimed at two audiences: human readers and search engine crawlers. But with Wolfram|Alpha and the idea of computational knowledge, it’s become clear that there’s another important audience: automated systems that can compute things.

There are product catalogs, store information, event calendars, regulatory filings, inventory data, historical reference material, contact information—lots of things that can be very usefully computed from. But even if these things are somewhere on an organization’s website, there’s no standard way to find them, let alone standard structured formats for them.

My concept for the .data domain is to use it to create the “data web”—in a sense a parallel construct to the ordinary web, but oriented toward structured data intended for computational use. The notion is that alongside a website like wolfram.com, there’d be wolfram.data.

If a human went to wolfram.data, there’d be a structured summary of what data the organization behind it wanted to expose. And if a computational system went there, it’d find just what it needs to ingest the data, and begin computing with it.

Needless to say, as we’ve learned over and over again in building Wolfram|Alpha, getting the underlying data is just the beginning of the story. The real work usually starts when one wants to compute from it—so that one can answer specific questions, generate specific reports, and so on.

For example, in our recent work on making the Best Buy product catalog computable, the original data (which came to us as a database dump) was perfectly easy to read. The real work came in the whole rest of the pipeline that was involved in making that data computable.

But the first step is to get the underlying data. And my concept for the .data domain is to provide a uniform mechanism—accessible to any organization, of any size—for exposing the underlying data.

Now of course one could just start a convention that organizations should have a “/datamap.xml” file (or somesuch) in the root of their web domains, just like a sitemap—rather than having a whole separate .data site. But I think introducing a new .data top-level domain would give much more prominence to the creation of the data web—and would provide the kind of momentum that’d be needed to get good, widespread, standards for the various kinds of data.

What is the relation of all this to the semantic web? The central notion of the semantic web is to introduce markup for human-readable web pages that makes them easier for computers to understand and process. And there’s some overlap here with the concept of the data web. But the bulk of the data web is about providing a place for large lumps of structured data that no human would ever directly want to deal with.

A decade ago I suggested to early search engine pioneers that they could get to the deep web by defining standards for how to expose data from databases. For a while there was enthusiasm about exposing “web services”, and now there are all manner of APIs made available by different organizations.

It’s been interesting for me in the past few years to be involved in the emergence of the modern data community. And from what I have seen, I think we’re now just reaching a critical point, where a wide range of organizations are ready to engage in delivering large-scale structured data in standardized forms. So it is a convenient coincidence that this is happening just when it becomes possible to create a .data top-level domain.

We’re certainly not sure what all the issues about a .data TLD will be, and we’re actively seeking input and partners in this effort. But I think there’s a potentially important opportunity, so I’m trying to do what I can to provide leadership, and further help to accelerate the birth of the data web.

Thursday, January 12, 2012

A .data Top-Level Internet Domain?

No comments:

Post a Comment