scraping

Interaction Publisher: Mashup Editor Comparison / Roundup

Since posting Enabling the Interaction Publisher, I've done some more research on mashup builders.   I found a lot of excellent resources (see end of this post for a list), but felt that a summary of what's out there would be helpful (this is what I would have liked to see when I researched this).  Although lists of tools exist, I wasn't getting a sense of the overall space.  Note that I believe that mashup editors are only part of enabling the interaction publisher: the other parts being standardized access to data (particularly all data from one institution) and deeply embedding these types of tools in content management systems (for instance, using topic-driven templates already in your CMS to drive mashups that are driven by those topics). 

Some of the particular variables that are probably most relevant in what tool you should use for your particular application / situation:

  • Is the expectation that you will quickly have a mashup, or that you are building out an infrastructure for your institution?  Does this need to be behind your firewall?  Do you need guaranteed uptime?  Basically, do you need this to be hosted (quick and easy, get-me-started-now) or will you build out and manage the infrastructure?
  • Do you want the end result of your efforts to be another data feed, a map, or other types of user interactions like data grids?  What should the output be?
  • What types of inputs do you need to pull: structured (if so, generic XML, or just RSS/atom specifically), unstructured (like web pages), or direct database connections?
  • How are the mashups built?  Do you need totally non-technical people to create, or do you just need to support your power users?  What is the mashup building environment for building the feeds (create-a-flowchart like Yahoo Pipes and Microsoft Popfly?)? 
  • What are the browser requirements for people to use / consume your mashups (for instance Microsoft Popfly requires both Silverlight and recent versions of just Firefox and IE)?  
  • Can a mashup be embedded into an existing web page? I didn't find enough useful information on this to fill this out meaningfully in the table below, so perhaps this is something to add in the future.

Although I was hoping to put together a fuller spreadsheet of all the tools out there, I selected a subset that was either easy to start using or had very clear documentation (I used all to at least some extent but AquaLogic Pages).  At any rate, here's a brief (and admittedly incomplete) table comparing different mashup builders along the criteria listed above (please comment with any corrections/additions):

Mashup Builder Hosted? What outputs? What inputs? Building Environment  Browser Restrictions Can be embedded?
Apatar N/A (run from desktop) A wide range including: RSS, Text, Salesforce, File, MySQL, Amazon S3 A wide range including: RSS, Text, Salesforce, File, MySQL, Amazon S3 Visually create a flowchart N/A No
AquaLogic Pages  No Interactions: Data Table,
Record List, Text, Map,
RSS, web services, user-created data WYSIWYG ?  Partially? (only within BEA environment)
Dapper  Yes Feeds: XML, RSS, HTML, JSON.  Interactions: Google Gadgets, Google Maps (more) web pages or RSS Pointing at the parts of the screen you want scraped and/or filling in forms Depends on output Yes
Google Mashup Editor  Yes Hosted web page (within GME environment) RSS, GoogleBase, or user inputs  coding  ?  No
Microsoft Popfly  Yes Interactions: Gobs, although much of the focus seems to be on playful things like wak-a-mole web pages, RSS Visually create a flowchart Silverlight + Firefox 2 or IE7 (!)  
QEDWiki  Either (although install option didn't work for me)

Wiki pages

 XML, RSS WYSIWYG + filling out forms  ?  
StrikeIron SOA Express for Excel  Desktop (extension to Excel)  Excel sheet Web Services, especially StrikeIron services Excel N/A No
Yahoo! Pipes  Yes Feeds + Interactions  (Maps / Lists)  Feeds / CSV / some other specific web sources / limited generic XML (reference) Visually create a flowchart  Didn't see official reference, but seems to run on Firefox 2.0, IE6, and Opera 9.24  

I found the easiest to use were Apatar, StrikeIron SOA Express for Excel, and Yahoo! Pipes (if you just want to play with something to get a feel for mashup editors, I'd recommend starting with these), although each is entirely different.  Although QEDwiki has a great intro video and appears to be able to do a lot, I got the least far actually using it.  Although I'm sure that someone that knows Dapper inside and out could create a feed from scraping pages, in practice I didn't manage to get it to do the two tests I tried (for example, in trying to scrape country pages from the World Bank site, it wouldn't let me since the different pages were coming from different domains although they were driven from the same CMS).  Popfly seemed interesting, but doesn't appear to be geared toward the enterprise and has very specific browser requirements.  That said, these were just initial impressions (and initial ease of use may be irrelevant for a particular application) -- the main purpose of this post was to put together the matrix above just getting a feel for the current overall state / scope of mashup editors. 

Here are some excellent resources on mashup editors:

 

Screen Scraping

This may not seem very Web 2.0 (O'Reilly wrote web services is 2.0 but screen scraping is 1.0), but I think there are a variety of reasons that screen scraping is still helpful, including:

  • Need to be closer to what the user sees
  • Don't have access directly to the database or a web service that will provide you the information you need (or you won't have access soon enough)

For example:

  • Testing whether your web pages are looking the way you expect. Sometimes testing this from the back end just isn't going to cut it, and you need to analyze the HTML to see if the page looks reasonable.
  • Writing a report that doesn't already exist on top of some reporting tool (for instance, on top of a defect-tracking system that you don't have access to the code for).
  • Creating archived versions of sites. Sometimes using HTTRACK, for example, isn't enough on its own (for example, when you need to pull in full-sized videos from the source system as oppossed to the streamed version on the web). Also, you can use Perl to wrap around HTTRACK so that you have a standard way of passing options to HTTRACK.
  • Seeing which of a large set of your sites are indexed in Google.
  • Testing your RSS feeds to determine if they have the right number of content items, etc (I guess this would be more "RSS scraping" than screen scraping).
  • Importing from a static site to a CMS (less and less commonly needed nowadays).

Often, if there's a direct DB connection or an RSS feed or some other XML interface that you can use, then it probably makes sense to use that. Even in that case, the archiving and web page testing cases would probably benefit from screen scraping. 

Syndicate content