CMS

Why Data is Different

But isn't it all data?  Although web content certainly is an important type of information available on a web site, Data needs to be treated differently.  Here I'm talking about Data with a capital D -- I thought the Wikipedia description was good: "Data refers to a collection of organized information, usually the results of experience, observation or experiment, or a set of premises. This may consist of numbers, words, or images, particularly as measurements or observations of a set of variables."  Here are some of the ways that Data is different:

  1. People expect Data to be available in different Formats.
  2. Users want to manipulate the Data.
  3. You don't totally control your Data, since it is available in different Channels.

There are several implications of this including:

  • Formats.  You may wish to standardize the formats that your data is available in.  Is all your data always available in csv (if that's what you standardize on)?  This includes both the formats themselves (Excel, Stata, etc) and also the method by which the data is requested.  For instance, is there one place that users can directly get all your data?  Directly doesn't mean some thin layer with links to databases each doing its own thing.  An example consistent format would be a web service with a published set of parameters by which the data could be requested.  Ideally, all the institutions data would be available from this one web service.
  • Manipulation.  Sometimes people just want to see your data, but usually they will want to manipulate the data.  By providing your data in consistent formats, then it will be easier for your users to utilize your data.  Other users will expect that *you* provide the tools for manipulation of your own data. 
  • Channels.  Ideally you will work to directly feed data to primary channels.  For instance, if you feed data directly to services like Swivel then you both get more use out of your data and also can ensure your data is available in its highest quality (not watered down by other people copying and pasting your data, for example). 

Beware False Precision in Tagging

The prior post Taxonomy Mappings: Be Careful When Integrating gave some examples and described the problem of taxonomy mappings.  Related to that is false precision in your tags.  In thinking about this more, it occurs to me that there are probably two useful rules of thumb to keep in mind whenever tagging/pulling content (whether the content is automatically tagged, or mapped from another taxonomy, or mapped by hand):

  1. You can't tag in a course grained taxonomy and pull based on a fine-grained taxonomy (for example, if you have a system that only tags to "Washington, DC Metro Area," then you won't be able to pull by "Washington, DC" since any content tagged in the system may only be relevant to "Alexandria, VA").
  2. You can't tag in a fine-grained taxonomy when you only are using coarse information to determine the tagging (for example, if all you know about a group of content is that they're all animals, you can't tag each of content to frogs, cats, dogs, etc).

In both of these cases, when you pull by the fine-grained taxonomy there is a false sense of precision (and you can get grossly wrong.Another way of stating the rules of thumb above:

  1. You have to originally tag (or possibly go through the effort of retro-actively tagging, perhaps through automated concept extraction) all content to at least as fine-grained a taxonomy as you're going to pull from,
  2. without artificially tagging more precisely than you are accurate.

Of course, by far the most preferable treatment is that all content, across the various systems you want to pull from (onto the same web page, for example) is tagged to the same, fine-grained taxonomy (or at as fine grained as you ever expect to need to pull from).  Otherwise you'll have to resort to taxonomy mappings, or retroactively tag content.

Interaction Publisher: Mashup Editor Comparison / Roundup

Since posting Enabling the Interaction Publisher, I've done some more research on mashup builders.   I found a lot of excellent resources (see end of this post for a list), but felt that a summary of what's out there would be helpful (this is what I would have liked to see when I researched this).  Although lists of tools exist, I wasn't getting a sense of the overall space.  Note that I believe that mashup editors are only part of enabling the interaction publisher: the other parts being standardized access to data (particularly all data from one institution) and deeply embedding these types of tools in content management systems (for instance, using topic-driven templates already in your CMS to drive mashups that are driven by those topics). 

Some of the particular variables that are probably most relevant in what tool you should use for your particular application / situation:

  • Is the expectation that you will quickly have a mashup, or that you are building out an infrastructure for your institution?  Does this need to be behind your firewall?  Do you need guaranteed uptime?  Basically, do you need this to be hosted (quick and easy, get-me-started-now) or will you build out and manage the infrastructure?
  • Do you want the end result of your efforts to be another data feed, a map, or other types of user interactions like data grids?  What should the output be?
  • What types of inputs do you need to pull: structured (if so, generic XML, or just RSS/atom specifically), unstructured (like web pages), or direct database connections?
  • How are the mashups built?  Do you need totally non-technical people to create, or do you just need to support your power users?  What is the mashup building environment for building the feeds (create-a-flowchart like Yahoo Pipes and Microsoft Popfly?)? 
  • What are the browser requirements for people to use / consume your mashups (for instance Microsoft Popfly requires both Silverlight and recent versions of just Firefox and IE)?  
  • Can a mashup be embedded into an existing web page? I didn't find enough useful information on this to fill this out meaningfully in the table below, so perhaps this is something to add in the future.

Although I was hoping to put together a fuller spreadsheet of all the tools out there, I selected a subset that was either easy to start using or had very clear documentation (I used all to at least some extent but AquaLogic Pages).  At any rate, here's a brief (and admittedly incomplete) table comparing different mashup builders along the criteria listed above (please comment with any corrections/additions):

Mashup Builder Hosted? What outputs? What inputs? Building Environment  Browser Restrictions Can be embedded?
Apatar N/A (run from desktop) A wide range including: RSS, Text, Salesforce, File, MySQL, Amazon S3 A wide range including: RSS, Text, Salesforce, File, MySQL, Amazon S3 Visually create a flowchart N/A No
AquaLogic Pages  No Interactions: Data Table,
Record List, Text, Map,
RSS, web services, user-created data WYSIWYG ?  Partially? (only within BEA environment)
Dapper  Yes Feeds: XML, RSS, HTML, JSON.  Interactions: Google Gadgets, Google Maps (more) web pages or RSS Pointing at the parts of the screen you want scraped and/or filling in forms Depends on output Yes
Google Mashup Editor  Yes Hosted web page (within GME environment) RSS, GoogleBase, or user inputs  coding  ?  No
Microsoft Popfly  Yes Interactions: Gobs, although much of the focus seems to be on playful things like wak-a-mole web pages, RSS Visually create a flowchart Silverlight + Firefox 2 or IE7 (!)  
QEDWiki  Either (although install option didn't work for me)

Wiki pages

 XML, RSS WYSIWYG + filling out forms  ?  
StrikeIron SOA Express for Excel  Desktop (extension to Excel)  Excel sheet Web Services, especially StrikeIron services Excel N/A No
Yahoo! Pipes  Yes Feeds + Interactions  (Maps / Lists)  Feeds / CSV / some other specific web sources / limited generic XML (reference) Visually create a flowchart  Didn't see official reference, but seems to run on Firefox 2.0, IE6, and Opera 9.24  

I found the easiest to use were Apatar, StrikeIron SOA Express for Excel, and Yahoo! Pipes (if you just want to play with something to get a feel for mashup editors, I'd recommend starting with these), although each is entirely different.  Although QEDwiki has a great intro video and appears to be able to do a lot, I got the least far actually using it.  Although I'm sure that someone that knows Dapper inside and out could create a feed from scraping pages, in practice I didn't manage to get it to do the two tests I tried (for example, in trying to scrape country pages from the World Bank site, it wouldn't let me since the different pages were coming from different domains although they were driven from the same CMS).  Popfly seemed interesting, but doesn't appear to be geared toward the enterprise and has very specific browser requirements.  That said, these were just initial impressions (and initial ease of use may be irrelevant for a particular application) -- the main purpose of this post was to put together the matrix above just getting a feel for the current overall state / scope of mashup editors. 

Here are some excellent resources on mashup editors:

 

Approaches to Exposing Institutional Data and Other Content

I've been thinking about and researching how an institution can share its data, documents, and other content.  Obviously your data and content is already exposed via the web, but providing the data in a more structured way allows more users (both internal and external) to manipulate the data in interesting ways, for example in mashups.  There seem to be a few ways to share data from an enterprise with a lot of content:

  • Straight RSS/Atom.  Although straight RSS/Atom (with no custom extensions / namespaces) may not be that interesting, it's obviously a useful way to get your content out there.  Typically straight RSS/Atom is fairly time-based and might in effect show some history (news items like "John goes to work" and then "John goes home") rather than some state (like "John is now home"). 
  • Common repositories / services such as Swivel and StrikeIron.   Rather than exposing your data/content directly to the outside world from your site/servers, you can use an intermediary.  Swivel allows users to create their own graphs on fata from either official sources or any user-supplied data.  StrikeIron is built into mashup editors like QEDwiki, and also has built an extension to Excel to call their services.  You probably would want to provide data to these services through an API of your own, but you could get started with Swivel for example by directly uploading the data. 
  • Specialized XML formats for particular types of content.  Examples include OpenSearch for search results and SDMX for statistical data.  These specific XML formats both allow a level of sophistication for people specializing in your type of content and allows tools built for this type of data to consume it.  This fits in with the following item, which, for historical reasons may or may not be XML-based. 
  • Institution-to-institution services.  Sometimes you need to provide a point-to-point interface with another institution.  In that case, you may need to support all sorts of unusual formats and delivery mechanisms.  Hopefully you could leverage your various systems' web services to just transform the data into the formats you need. 
  • A common API that your institution follows across all types of content.   This one is the most interesting to me and one that I alluded to in my previous post on interaction publishing.  Especially if your institution has various repositories, one possible approach would be to slap up a page that has links to the different instructions for referencing each.  But, to make access as easy as possible, a common API with consistent parameters that can be queried against all systems would be preferable (for instance, queries such as "give me all your documents and data on Chad" via url requests like http://xml.example-domain.com/apis/type=docs,data&country=td).  Potentially the returned XML could be in a simple format such as RSS extended with a custom namespace (so that other tools such as Yahoo Pipes, and even feedreaders, could easily consume the data).
  • Microformats.  Probably most useful to future browsers or other tools like the Firefox Operator extension (or for services that crawl sites such as Google), microformats allow you to just change your existing HTML a bit to expose very common types of data like address and calendar events.  For example, instead of your HTML having "100 Main Street, Anytown, USA" it would be marked up as "<div class="adr"> <div class="street-address">100 Main Street</div>, <span class="locality">Anytown</span>, <div class="country-name">USA</div></div>" and then define the CSS to show it as you wish.  For example (with sloppy CSS):
    100 Main Street

    Anytown,

    U.S.A.

    See how this page appears in Firefox Operator (also notice the tagspaces):

 Screenshot of how example adr microformat works in Firefox Operator

 

Link Repository: Structured Link Checking

Especially as a content management system grows to have a large amount of content, it would be nice if you could do structured link checking. One of the problems with link checking in general is what to do with the reports once you get them. Of course, for a very small site you can easily scan an entire site with tools like LinkScan ($) and Xenu Linksleuth (free, but ads are put in the reports) or even monitor 404 requests and use single page tools like the LinkChecker Firefox extension. But with large sites you can end up with reports that are hard to know where to even start fixing links. This is especially true for CMS-driven sites: the same bad link may appear in only one piece of content that is displayed throughout the site. Or you could wind up linking from lots of content items to a url (possibly outside your control) that changes.

I envision getting a report with a list of the bad links, where a user (with appropriate global rights) could indicate the correct new link which would get reflected in all content items (or left menus or other components surrounding the content) that used that link. This list could be prioritized by the cumulative page views that contained that bad link, or by the number of pages that contained that link. Another approach might be to provide a prioritized list of content items that have bad links (preferably directly linkable to edit mode of that content item. At any rate, note that we're not talking about pages here but content items or links -- the user can quickly take action that will correct links on multiple pages. A long list of pages (specific urls) with bad links are confusing, but, more importantly, aren't as quickly actionable.

Here is how normal link checking reports look and how more useful reports might look:

Before / Existing Reports (where do you start with a report like this, where content items may drive multiple pages?) Report indicating bad links where the user can immediately correct them (and apply the correction everywhere) Report indicating which content items have the bad links(content items linkable to edit them directly)
  • http://badlinkone.com is referenced on http://example-site.com/page1, http://example-site.com/page35, and http://example-site.com/page102
  • http://badlinktwo.com is referenced on http://example-site.com/page1, http://example-site.com/page1023, http://example-site.com/page2439, http://example-site.com/page5192

Etc.

Etc.

Etc.

One possible way to implement this is to change all the urls into some logical link in your CMS. Assuming your CMS stores straight HTML rather than a more structured format, then any url the user enters could be changed to a macro (if the user could put in a hard link directly into the HTML without the system changing it, even if there was an option for creating a logical link, most users would probably just skip the logic linking). For example if the user put in this HTML:

<a href="http://hobbsontech.com>Hobbs On Tech</a> then the system would replace it with !link(123,"Hobbs On Tech")and put in its link repository that link 123 was http://hobbsontech.com. When the page was generated then the correct link could be replaced in the HTML (so of course the end user's browser should never see the "123" in the HTML). If the page linked to was in your CMS, then the macro could be different and just indicate the unique key for the content item being pointed to (this would depend on whether the context that the content appeared in was relevant). For example: !cms_item(123,"Hobbs On Tech")

Related items that a link repository might help with:

  • Reporting on content use. A link repository would allow other interesting reporting, such as the most linked-to content items in your repository.
  • Easily move content. In some cases, it may be easier to move content if you had a link repository. For instance, you may sometimes need to restructure your site resulting in the links changing. With a link repository, you could automatically change all the links so that the move did not result in broken links (of course this would work best for intranet sites where there were limited links outside your control to your content).

Of course, this would add complexity (and possible failure points) to a CMS. Do you think it would be worth it?

Syndicate content