But isn't it all data? Although web content certainly is an important type of information available on a web site, Data needs to be treated differently. Here I'm talking about Data with a capital D -- I thought the Wikipedia description was good: "Data refers to a collection of organized information, usually the results of experience, observation or experiment, or a set of premises. This may consist of numbers, words, or images, particularly as measurements or observations of a set of variables." Here are some of the ways that Data is different:
There are several implications of this including:
Since posting Enabling the Interaction Publisher, I've done some more research on mashup builders. I found a lot of excellent resources (see end of this post for a list), but felt that a summary of what's out there would be helpful (this is what I would have liked to see when I researched this). Although lists of tools exist, I wasn't getting a sense of the overall space. Note that I believe that mashup editors are only part of enabling the interaction publisher: the other parts being standardized access to data (particularly all data from one institution) and deeply embedding these types of tools in content management systems (for instance, using topic-driven templates already in your CMS to drive mashups that are driven by those topics).
Some of the particular variables that are probably most relevant in what tool you should use for your particular application / situation:
Although I was hoping to put together a fuller spreadsheet of all the tools out there, I selected a subset that was either easy to start using or had very clear documentation (I used all to at least some extent but AquaLogic Pages). At any rate, here's a brief (and admittedly incomplete) table comparing different mashup builders along the criteria listed above (please comment with any corrections/additions):
| Mashup Builder | Hosted? | What outputs? | What inputs? | Building Environment | Browser Restrictions | Can be embedded? |
| Apatar | N/A (run from desktop) | A wide range including: RSS, Text, Salesforce, File, MySQL, Amazon S3 | A wide range including: RSS, Text, Salesforce, File, MySQL, Amazon S3 | Visually create a flowchart | N/A | No |
| AquaLogic Pages | No | Interactions: Data Table, Record List, Text, Map, |
RSS, web services, user-created data | WYSIWYG | ? | Partially? (only within BEA environment) |
| Dapper | Yes | Feeds: XML, RSS, HTML, JSON. Interactions: Google Gadgets, Google Maps (more) | web pages or RSS | Pointing at the parts of the screen you want scraped and/or filling in forms | Depends on output | Yes |
| Google Mashup Editor | Yes | Hosted web page (within GME environment) | RSS, GoogleBase, or user inputs | coding | ? | No |
| Microsoft Popfly | Yes | Interactions: Gobs, although much of the focus seems to be on playful things like wak-a-mole | web pages, RSS | Visually create a flowchart | Silverlight + Firefox 2 or IE7 (!) | |
| QEDWiki | Either (although install option didn't work for me) |
Wiki pages |
XML, RSS | WYSIWYG + filling out forms | ? | |
| StrikeIron SOA Express for Excel | Desktop (extension to Excel) | Excel sheet | Web Services, especially StrikeIron services | Excel | N/A | No |
| Yahoo! Pipes | Yes | Feeds + Interactions (Maps / Lists) | Feeds / CSV / some other specific web sources / limited generic XML (reference) | Visually create a flowchart | Didn't see official reference, but seems to run on Firefox 2.0, IE6, and Opera 9.24 |
I found the easiest to use were Apatar, StrikeIron SOA Express for Excel, and Yahoo! Pipes (if you just want to play with something to get a feel for mashup editors, I'd recommend starting with these), although each is entirely different. Although QEDwiki has a great intro video and appears to be able to do a lot, I got the least far actually using it. Although I'm sure that someone that knows Dapper inside and out could create a feed from scraping pages, in practice I didn't manage to get it to do the two tests I tried (for example, in trying to scrape country pages from the World Bank site, it wouldn't let me since the different pages were coming from different domains although they were driven from the same CMS). Popfly seemed interesting, but doesn't appear to be geared toward the enterprise and has very specific browser requirements. That said, these were just initial impressions (and initial ease of use may be irrelevant for a particular application) -- the main purpose of this post was to put together the matrix above just getting a feel for the current overall state / scope of mashup editors.
Here are some excellent resources on mashup editors:
I've been thinking about and researching how an institution can share its data, documents, and other content. Obviously your data and content is already exposed via the web, but providing the data in a more structured way allows more users (both internal and external) to manipulate the data in interesting ways, for example in mashups. There seem to be a few ways to share data from an enterprise with a lot of content:
Anytown,
See how this page appears in Firefox Operator (also notice the tagspaces):

Sometimes you need to pull content from multiple systems into a single page, and you want to pull from both systems based on some metadata, perhaps by topic. For instance, let's say you have a site that you want to pull data from a document repository and a news archive, and you want the the user to use a pulldown to select the topic they want to filter the content by (for example, by "Politics", "Entertainment", "Travel", "Europe", and other topics). Sometimes out of the box the two systems will share the same list of topics, but more frequently than not they will not.
One deceptively simple approach when systems do not share the same list of topics is to have some sort of mapping between the taxonomies of the two systems (for instance, "Travel" = "Vacations", "Politics" = "Domestic Politics", "Europe" = "EU", etc). I fairly frequently hear something like this when discussing integration between different systems: "We have a mapping between these topics, so there shouldn't be any problem." But just because you have a mapping doesn't mean that it will be satisfactory for combining information from multiple systems. I thought it would be helpful to think through the issues some and write out some examples.
One taxonomy's controlled vocabulary being more specific than another
Let's say you've got some content in two systems that you want to pull into one page. Perhaps you want to find out all the fathers in both systems. If the taxonomies available were the following (and you didn't have other metadata on gender, for example), then you could not do this:
| "Relationship" values site one | "Relationship" values on site two |
| Father | Parent |
| Mother | |
| Sister | Sibling |
| Brother |
A simple and meaningful mapping between the two would be something like this (allowing you to find all the people across systems that are a parent, for example):
Father or Mother - > Parent
Sister or Brother - > Sibling
Note that the other direction makes no sense (it's tough to be both a sister and brother, and you wouldn't know which to pick when translating between systems). So, although you may have a mapping between the systems, it does NOT neccessarily enable the types of queries you want to do.
A slightly more realistic example
Of course that was a simplified example to illustrate the point, and you usually have overlapping, something like the following (still a forced example though):
| "Location" in system one | "Location" in system two |
| SF Bay Area | |
| Palo Alto | Silicon Valley |
| (other cities in Silicon Valley) | |
| Richmond | East Bay |
| (other cities in East Bay) | |
| Sausalito | Marin County |
| (other cities in Marin County) | |
| San Francisco | San Francisco |
| South San Francisco |
Let's say these two systems had a selection of companies tagged to these controlled vocabularies. What kinds of queries would probably be meaningful?
Obviously, you couldn't query on cities since system two has virtually no cities. But what about San Francisco? Isn't that on both lists? Although at first blush it may seem that you could find all companies in San Francisco across both systems, looking at the list more carefully it becomes apparent that they almost certainly have different meanings: the first taxonomy only has the broad San Francisco Bay Area and then cities, and the second taxonomy is just listing areas within the San Francisco Bay area. So San Fancisco in system 2 probably includes San Francisco proper as well as, for example, South San Francisco. So you can do this query (but *not* query on all companies in San Francisco):
Part of the issue is that often you have much larger taxonomies that are more difficult to analyze (for example, for a taxonomy that includes all cities in California, or the US). It would be very difficult to go through and determine the meaning of the different values of the taxonomy.
What to do?
In practice, you probably won't be able to deeply analyze all the mappings between your systems, so you'll have a mapping but might only have a feel for how good it is (and in what direction). Perhaps the most dangerous mapping, and one that is hopefully fairly easy to identify, is from a more general taxonomy to a more specific one (the first example above) and should be avoided entirely. Of course if the systems do not even having taxonomies that are close, then this will be obvious and require changes to at least one of the source systems. The second example above (overlapping but not quite lining up) might a type of taxonomy matching that's not be that bad but just require documentation/labeling (just use "San Francisco and South San Francisco" in a pulldown to select areas of the Bay Area) and careful design (obviously don't allow the user to select a city if you need data from both systems, or you could clearly show in the results that the information is just from System 1). But figuring out the relationships between the taxonomies might take a lot of work. Some potential general approaches: be careful and a) where possible try to access globally well-understood and clear values (like zip codes, lat/long, ISO country codes, etc) rather than fall into the trap of just trying to use two taxonomies since they're called the same thing (this is probably easier in something like a location than a topic), b) force all systems to tag to a neutral reference source (this could, with a lot of work in defining rules, be automated with something like Teragram for example), or c) seek out a metadata expert since they have some best practices of mapping between flat or networked taxonomies. Also, when you are designing a system in the first place (even before being faced with a new integration), try if possible to use metadata for your content to well-understood values (especially easy for geographic tagging).
Very large sites supporting a large number of units/stakeholders can easily turn into a hodge-podge of styles, user interface elements, and quality. One of the toughest discussions with clients, however, is why they can't do more customization (even if one of the core requirements of the system is to help enforce standardization). What are some of the reasons *not* to standardize:
In my opinion, the first and last reasons are the most compelling (and the third not being a good reason at all for an enterprise-wide system), although one of the problems with experimentation is the frequent expectation that an experiment could quickly be rolled into the normal standardized platform (that's probably a post on its own!). Here are some reasons *to* standardize:
Some possible methods of standardization:
Of course, all of these are easier said than done when trying to get a large number of units into the same system, but perhaps some of these could be initiated even after a large suite of sites have been implemented in a central content management system.