standard

Approaches to Exposing Institutional Data and Other Content

I've been thinking about and researching how an institution can share its data, documents, and other content.  Obviously your data and content is already exposed via the web, but providing the data in a more structured way allows more users (both internal and external) to manipulate the data in interesting ways, for example in mashups.  There seem to be a few ways to share data from an enterprise with a lot of content:

  • Straight RSS/Atom.  Although straight RSS/Atom (with no custom extensions / namespaces) may not be that interesting, it's obviously a useful way to get your content out there.  Typically straight RSS/Atom is fairly time-based and might in effect show some history (news items like "John goes to work" and then "John goes home") rather than some state (like "John is now home"). 
  • Common repositories / services such as Swivel and StrikeIron.   Rather than exposing your data/content directly to the outside world from your site/servers, you can use an intermediary.  Swivel allows users to create their own graphs on fata from either official sources or any user-supplied data.  StrikeIron is built into mashup editors like QEDwiki, and also has built an extension to Excel to call their services.  You probably would want to provide data to these services through an API of your own, but you could get started with Swivel for example by directly uploading the data. 
  • Specialized XML formats for particular types of content.  Examples include OpenSearch for search results and SDMX for statistical data.  These specific XML formats both allow a level of sophistication for people specializing in your type of content and allows tools built for this type of data to consume it.  This fits in with the following item, which, for historical reasons may or may not be XML-based. 
  • Institution-to-institution services.  Sometimes you need to provide a point-to-point interface with another institution.  In that case, you may need to support all sorts of unusual formats and delivery mechanisms.  Hopefully you could leverage your various systems' web services to just transform the data into the formats you need. 
  • A common API that your institution follows across all types of content.   This one is the most interesting to me and one that I alluded to in my previous post on interaction publishing.  Especially if your institution has various repositories, one possible approach would be to slap up a page that has links to the different instructions for referencing each.  But, to make access as easy as possible, a common API with consistent parameters that can be queried against all systems would be preferable (for instance, queries such as "give me all your documents and data on Chad" via url requests like http://xml.example-domain.com/apis/type=docs,data&country=td).  Potentially the returned XML could be in a simple format such as RSS extended with a custom namespace (so that other tools such as Yahoo Pipes, and even feedreaders, could easily consume the data).
  • Microformats.  Probably most useful to future browsers or other tools like the Firefox Operator extension (or for services that crawl sites such as Google), microformats allow you to just change your existing HTML a bit to expose very common types of data like address and calendar events.  For example, instead of your HTML having "100 Main Street, Anytown, USA" it would be marked up as "<div class="adr"> <div class="street-address">100 Main Street</div>, <span class="locality">Anytown</span>, <div class="country-name">USA</div></div>" and then define the CSS to show it as you wish.  For example (with sloppy CSS):
    100 Main Street

    Anytown,

    U.S.A.

    See how this page appears in Firefox Operator (also notice the tagspaces):

 Screenshot of how example adr microformat works in Firefox Operator

 

Enabling the Interaction Publisher

New sites with dynamic, interactive functionality using data from different sources and allowing the user to interact with the data are exciting to see (examples: geo.worldbank.org and carma.org). But how do we unleash this functionality so that non-programmers can create interaction like this? We have content management systems that allow more people to easily add content to sites. But I think we should be driving toward an environment where users can a) take data from a variety of sources and b) create interactive sites based on this data. Maps are the most prominent example, but interactive tables are also important. Let's review where we are now:

  • We have sites already applying Google maps and other interactive functionality to various data sources (examples above).
  • Programmers have resources/examples/documentation for creating these types of sites (see Programmable Web for example).
  • Various APIs have been exposed for interacting and using data (examples).
  • We have tools like Yahoo Pipes that allow advanced users (probably not needing full-blown programmer skills) to create mashups. That said Yahoo Pipes is now focused on consuming/dealing with RSS feeds (the Fetch Data Module is supposed to more general XML, I had problems getting it to do so -- if you look at examples using DC crime data, you see it's RSS with some customization). In addition, this is a hosted solution, so you're at the mercy of Yahoo if you host a mashup with them (I noted Yahoo Pipes having problems accessing feeds intermittently even in my brief testing).
  • There are probably other similar examples of specialized tools, but I know of Swivel, which allows you to create your own graphs of data.

Here are the types of interactive functionality that I think we should be allowing non-programmers (let's call these folks "Interaction Publisher", riffing off the role of "Content Publisher") to create:

  • Interactive data tables. Interaction Publisher should be able to point at one (or multiple) data source, and indicate which columns/attributes to display in a table. The Interaction Publisher should also indicate which attributes should be selectable (in pulldowns for example) be the end user. Of course some theming / design and annotation should be possible.
  • Interactive maps. Interaction Publisher should be able to point at a data source, the attributes containing the locations, and what data to show for each location (along with the extent of the default map and formatting). Also, please can we get rid of the points / waypoints / circles that indicate arbitrary points that are used to indicate data for a large area (for example, a pointer to the capital for a country), and instead highlight the whole area (for example, the whole country). Ideally the Interaction Publisher will be able to indicate further interaction with the map (for example, displaying different layers of a map -- if not full-blown layers, then at least indicating different sets of waypoints to display).
  • Custom data. The Interaction Publisher should also be able to easily publish their own data/content, and pull their data into an interactive feature (for instance, this could even be a simple search on a little database / resource center the user has). An extension of this would be including some mechanism for overriding other data sources data points (of course this should somehow be indicated on the map/table so it isn't misleading).
  • Wizard-like functionality. The Interaction Publisher should not have to resort to XPATH, XSL, or programming in PHP / Perl / whatever.

Sounds nice -- but how would this be possible? One possible step is for institutions to expose their data in a consistent manner (at least each institution exposing its own data in consistently). This would involve something of a meta-API, where you are consistent about:

  • Attributes that can be queried. Perhaps the list would be just topics and countries, for example. The topics lists should be something that the outside world will understand rather than an organization-centric list. If you have multiple topics lists, then it would be preferable if all systems were moved to a single topics list (even if that meant two topics lists per system).
  • Simplicity and consistency in APIs. Perhaps all your XML APIs are at http://xml.example-domain.com/apis/ (with an html page just listing all the APIs there) and then APIs to different systems like http://xml.example-domain.com/api/documents and http://xml.example-domain.com/api/web with example calls like http://xml.example-domain.com/api/web/api-version=1&topic=agriculture.
  • Consistent exposure of non-standard attributes. The issue of consistent query parameters was covered above -- this means that all systems are queried on the same parameters. But of course some systems will need to provide other attributes (such as, say, "Population"). This could be done in a custom namespace in RSS as the DC crime data (see xml) does in its Atom feed (which Yahoo Pipes, for example, can consume). This could be documented, and the consumer of the data could handle this.
  • Custom databases would also preferably comply. Perhaps there could be an http://xml.example-domain.com/api/core/ for institutionally, centrally supported repositories and http://xml.example-domain.com/api/special/ for one-off databases. This would still allow easy access of data by Interaction Publishers.

Some potential ways of inching toward the goal of the non-developer Interaction Designer easily being able to publish dynamic, interactive features would be:

  • Start by using javascript libraries. There are several javascript libraries out there (examples: Dojo, mootools, Prototype / Scriptalicious), but most seem to be too low-level (concentrating on opening/closing panels, transitions, and the like) to be useful for interactive data features. Possibly a library that has higher level features including interactive table such as EXT JS could be used as a first step. It would require touching some code, but perhaps a CMS, for example, could include in its documentation with code snippets indicating what needs to be replaced (for example, where to put in the url to the source XML).
  • Create some simple wizards in CMSes. So that we aren't relying on, for example, Yahoo Pipes for hosting our interaction, we may wish to start including simple wizards in our CMSes. For example, one could be for interactive tables that just had one data source and three columns.
  • Push for stronger hosted interactive feature builders. For example, Yahoo Pipes perhaps could include some of the features mentioned in this email (for example, a tool for creating interactive maps, or a tool for creating a pulldown of options to drive a Google map.

Here's a little chart displaying some of the ideas in this post (also see pdf version):

I'd really like your comments on this post. Specifically:

  • Is the role of Interaction Publisher important?
  • How could we enable this role?
  • What ideas above do you think would work and which would not work?
  • Is their a need for a separate generic standard XML from RSS feeds, or should an institution's RSS just be extended to include custom portions?

 

Standardization and Large Web Sites

Very large sites supporting a large number of units/stakeholders can easily turn into a hodge-podge of styles, user interface elements, and quality. One of the toughest discussions with clients, however, is why they can't do more customization (even if one of the core requirements of the system is to help enforce standardization). What are some of the reasons *not* to standardize:

  • specific business needs of different groups (not to be confused with a group just wanting to differentiate itself somehow, for instance with a different look, that does not help the web visitor at all)
  • professional development (for instance a developer might be interesting to do a mashup)
  • personal expression (liking particular colors for example)
  • experimentation (don't know in advance what's going to "stick," so try a variety of things)

In my opinion, the first and last reasons are the most compelling (and the third not being a good reason at all for an enterprise-wide system), although one of the problems with experimentation is the frequent expectation that an experiment could quickly be rolled into the normal standardized platform (that's probably a post on its own!). Here are some reasons *to* standardize:

  • consistent brand for the user ("am I still on the same site? Is this high quality content?")
  • consistent UI for the user ("do I know how to use the site?")
  • better support for new site admins or transition of support between sites
  • single sign-on. It's confusing for a user to have various accounts with the same institution.
  • standard statistics. Different statistics packages can have entirely different ways of counting something as basic as a page view. Standardizing no a statistics package can help ensure you're comparing apples to apples in your web analysis.
  • better search. If everyone does their own thing, then there may be more fragmented information which would mean search results aren't as good.
  • stability / support. As anyone who works with software/systems knows, the more functionality or special customization you put into a system, the more effort it takes to maintain it. Also, the system will probably be less stable. This one is also very tough to discuss with a client (and another probable future blog post) since they tend to only see their particular need.

Some possible methods of standardization:

  • Governance. There needs to be a group with the power and influence to say "no" to requests that undermine the quality of the user experience of the site at large. This ideally is not the technology group since there would appear to be a conflict of interest.
  • Clearly define exactly what is inside the standard and what is outside.
  • Technology. The content management system used to manage the site can be set up such that users can only make changes that comply with the standard.
  • The right level of customization. Standardization shouldn't be an excuse to totally control every aspect of everyone's sites or to not allow any innovation.
  • Hooks into core shared functionalty. You may decide that a single sign on for users of your site is desirable. If so, then perhaps the system could be set up with an API such that tools developed and commissioned by other groups could work with the core functionality.
  • Standardized access to data. Ideally, you could define a standard method of each system exposing its core data, that even people outside the institution could utilize for mashups, etc. By providing the data in a simple XML API, this could facilitate both internal and external usage of data.
  • Another potential approach is to have separate branding for the official, blessed content and for the organization-centric content. For instance, you may have multiple units in your institution all looking at the topic of taxes. Ideally you would have one official web site that makes sense of your institution's view of taxes overall, and preferably this would pull information from all the units. The various units still may want their own site, but this is less useful for the end user -- so perhaps these units could have their own sites branded differently (and perhaps all requiring a standard link back to the official site) to clearly indicate it is the view of a particular unit with your institution.

Of course, all of these are easier said than done when trying to get a large number of units into the same system, but perhaps some of these could be initiated even after a large suite of sites have been implemented in a central content management system.

Syndicate content