Why Data is Different

But isn't it all data?  Although web content certainly is an important type of information available on a web site, Data needs to be treated differently.  Here I'm talking about Data with a capital D -- I thought the Wikipedia description was good: "Data refers to a collection of organized information, usually the results of experience, observation or experiment, or a set of premises. This may consist of numbers, words, or images, particularly as measurements or observations of a set of variables."  Here are some of the ways that Data is different:

  1. People expect Data to be available in different Formats.
  2. Users want to manipulate the Data.
  3. You don't totally control your Data, since it is available in different Channels.

There are several implications of this including:

  • Formats.  You may wish to standardize the formats that your data is available in.  Is all your data always available in csv (if that's what you standardize on)?  This includes both the formats themselves (Excel, Stata, etc) and also the method by which the data is requested.  For instance, is there one place that users can directly get all your data?  Directly doesn't mean some thin layer with links to databases each doing its own thing.  An example consistent format would be a web service with a published set of parameters by which the data could be requested.  Ideally, all the institutions data would be available from this one web service.
  • Manipulation.  Sometimes people just want to see your data, but usually they will want to manipulate the data.  By providing your data in consistent formats, then it will be easier for your users to utilize your data.  Other users will expect that *you* provide the tools for manipulation of your own data. 
  • Channels.  Ideally you will work to directly feed data to primary channels.  For instance, if you feed data directly to services like Swivel then you both get more use out of your data and also can ensure your data is available in its highest quality (not watered down by other people copying and pasting your data, for example). 

Beware False Precision in Tagging

The prior post Taxonomy Mappings: Be Careful When Integrating gave some examples and described the problem of taxonomy mappings.  Related to that is false precision in your tags.  In thinking about this more, it occurs to me that there are probably two useful rules of thumb to keep in mind whenever tagging/pulling content (whether the content is automatically tagged, or mapped from another taxonomy, or mapped by hand):

  1. You can't tag in a course grained taxonomy and pull based on a fine-grained taxonomy (for example, if you have a system that only tags to "Washington, DC Metro Area," then you won't be able to pull by "Washington, DC" since any content tagged in the system may only be relevant to "Alexandria, VA").
  2. You can't tag in a fine-grained taxonomy when you only are using coarse information to determine the tagging (for example, if all you know about a group of content is that they're all animals, you can't tag each of content to frogs, cats, dogs, etc).

In both of these cases, when you pull by the fine-grained taxonomy there is a false sense of precision (and you can get grossly wrong.Another way of stating the rules of thumb above:

  1. You have to originally tag (or possibly go through the effort of retro-actively tagging, perhaps through automated concept extraction) all content to at least as fine-grained a taxonomy as you're going to pull from,
  2. without artificially tagging more precisely than you are accurate.

Of course, by far the most preferable treatment is that all content, across the various systems you want to pull from (onto the same web page, for example) is tagged to the same, fine-grained taxonomy (or at as fine grained as you ever expect to need to pull from).  Otherwise you'll have to resort to taxonomy mappings, or retroactively tag content.

Why Short Delivery Cycles for Products

The natural temptation on planning a product's roadmap is to plan far into the future.  That temptation arises for reasons such as wanting to please clients by telling them you'll give them what they say they want, wanting to relay the internal technical teams' plans to deliver something far in the future, feeling that you already know the list of features that users want, and not wanting to feel like you're planning all the time.  In my experience, that is less successful than trying to only plan and publicize a short time horizon out (and not even promising anything outside that time horizon).  Here are some of the reasons:

  • You're always delivering something.  In other words, the user regularly sees progress.
  • Related to the above bullet, it forces you to think of creative ways of solving your users' problems.  If you know you have to deliver something in the next month, then you have to carefully explore the requirements, prioritize, and come up with some unexpected solutions.
  • Quick responses to what you release:
    • When something needs to be tweaked that you just released, you can quickly move to the next iteration.
    • If your idea turns out to be a real stinker, you can drop it before you spend a lot of time on it.
    • If you have an unexpected hit of an idea, you can quickly continue to refine it.
  • Inevitable slips don't end up cascading tons of other deliverables (if all you promise is delivering 10 features in the next three months, then if three slip it only affects those three -- if you had promised 10 each quarter for the next year, then if three items slip in the first quarter then probably 37 items slip for the year).
  • Combats somewhat against almost everyone's natural propensity to procrastinate.  If there are always items that need to be publicly delivered, then it's harder to procrastinate than if you have some huge delivable a year out. 
  • A regular period that people expect a progress report.  Some advantages: a) the potential "embarassment" factor makes you pay close attention on a regular basis, and, moreover, b) your clients know they will be engaged in the discussion of your progress.
  • The further out you plan, the less accurate you are about the schedule.
  • You can respond more quickly to changes in your industry or competition.  If you've already publicly planned a year out, then if, for example Web 4.0 hits the scene quickly, you've either got to upset a lot of people that are already expecting a bunch of features that year (by slipping those to make room for Web 4.0 now) or not even consider adding the Web 4.0 features until after the year has passed!

I partially read the following a while ago that helped convince me to push toward short delivery schedules: Agile and Iterative Development: A Manager's Guide by Craig Larman and Getting Real by 37signals

Notes from a job search

After kicking the job search into high gear over the past three months or so, I'm happy to be joining Welchman Consulting as a Senior Consultant later this month.   As their web site states: "Welchman Consulting helps organizations develop effective Web Operations Mangement strategies that optimize the quality and impact of their Web sites" (see this powerpoint presentation on Web Operations Management). 

This was an interesting job search, so I thought I'd mention things that worked and didn't work, along with some of the most interesting places I interviewed with.

Some of the places I interviewed (and how I found the job in parentheses):

  • Google Switzerland (through a blind resume submission at google.com), where they've started an engineering center (which manages it's own portfolio of global products).  I really enjoyed interviewing with Google.  The process was an intense game.  After thinking I bombed two phone interviews (with excellent questions like "How would you improve gmail?  OK, you're about to pitch your idea to Larry and Sergie now: Go!"), I ended up being flown out to Zurich for an interview there.  The best part of that interview process was the technical portion of the interview, which was mostly algorithm analysis (involving the Fibonnaci numbers) and then a sequence of "how would you compute that?  How would you compute it faster?  faster!  faster!?"  The worst part was with a product manager about a fussball tournament problem  -- I just was not reaching a breakthrough on how to solve the problem and the interviewer just stared at me sweating for 20 minutes (side note: they're a little too nuts about fussball -- during a break some guy got a bit too upset with me about not playing fussball well enough on his team).  Anyway, the process was a lot of fine and quite a mental game. 
  • iapps/Bridgeline (from Potomac Tech Wire, a local DC mailing list).  Iapps has been developing web sites in the DC area for years now, and was recently bought by Bridgeline.   I was most impressed by their performance/incentive metrics which seemed reasonable and well-aligned to drive a strong, profitable company.  Also, their hosted CMS is quite strong. 
  • LTU Technologies (through a colleague).  This is a visual search and filtering company, with offices in D.C. and Paris.  With more and more user-generated content, being able to identify offensive images becomes increasingly important.  Also, forensics and identifying stolen property are other applications of their technology. 
  • Welchman Consulting (met through work).  When I met folks at Welchman Consulting, I was immediately impressed by their knowledge and ability to articulate their ideas.  I kept an eye on their site and contacted them when I saw a Senior Consultant job announcement.  Their focus on Web Operations Management seems timely since so many large organizations now have moved into Content Management Systems but are facing quality/management issues.

Tools used and my subjective impression of their effectiveness in my job search (1 being totally unhelpful and 10 being a sure thing):

  • Blind inquiries to interesting people I found at various institutions: 2
  • Joined a couple relevant associations / special interest groups (not totally fair to include these since I didn't dive into actively participating in them!): 2
  • Blind sending in resumes through job boards like dice.com, theladders.com, simplyhired, etc: 4
  • Using linkedin.com to reach out to colleagues of colleagues at places I was interested in joining: 5
  • Sending blind resumes directly to organizations' job submission system: 6
  • Various talking directly with friends/colleagues/vendors/clients (good old networking): 8
  • Specialized mailing lists (such as Potomac Tech Wire and a non-profit CIO list): 8
  • Publishing this blog: 9

Talking about my job search with trusted friends, colleagues, vendors, and clients has found me most of my jobs.  Perhaps that's the reason that specialized mailing lists are also helpful: you already know something about the person when they apply (or it would be easy to find out something about them in the small pool).  I found starting and maintaining this blog helpful in the job search for a couple reasons: 1) it forced me to think about what it is that I know and am good at and 2) it was a very quick and easy way for potential employers to learn a little bit about my skills (although it is a surprising amount of work to maintain the blog).

Interaction Publisher: Mashup Editor Comparison / Roundup

Since posting Enabling the Interaction Publisher, I've done some more research on mashup builders.   I found a lot of excellent resources (see end of this post for a list), but felt that a summary of what's out there would be helpful (this is what I would have liked to see when I researched this).  Although lists of tools exist, I wasn't getting a sense of the overall space.  Note that I believe that mashup editors are only part of enabling the interaction publisher: the other parts being standardized access to data (particularly all data from one institution) and deeply embedding these types of tools in content management systems (for instance, using topic-driven templates already in your CMS to drive mashups that are driven by those topics). 

Some of the particular variables that are probably most relevant in what tool you should use for your particular application / situation:

  • Is the expectation that you will quickly have a mashup, or that you are building out an infrastructure for your institution?  Does this need to be behind your firewall?  Do you need guaranteed uptime?  Basically, do you need this to be hosted (quick and easy, get-me-started-now) or will you build out and manage the infrastructure?
  • Do you want the end result of your efforts to be another data feed, a map, or other types of user interactions like data grids?  What should the output be?
  • What types of inputs do you need to pull: structured (if so, generic XML, or just RSS/atom specifically), unstructured (like web pages), or direct database connections?
  • How are the mashups built?  Do you need totally non-technical people to create, or do you just need to support your power users?  What is the mashup building environment for building the feeds (create-a-flowchart like Yahoo Pipes and Microsoft Popfly?)? 
  • What are the browser requirements for people to use / consume your mashups (for instance Microsoft Popfly requires both Silverlight and recent versions of just Firefox and IE)?  
  • Can a mashup be embedded into an existing web page? I didn't find enough useful information on this to fill this out meaningfully in the table below, so perhaps this is something to add in the future.

Although I was hoping to put together a fuller spreadsheet of all the tools out there, I selected a subset that was either easy to start using or had very clear documentation (I used all to at least some extent but AquaLogic Pages).  At any rate, here's a brief (and admittedly incomplete) table comparing different mashup builders along the criteria listed above (please comment with any corrections/additions):

Mashup Builder Hosted? What outputs? What inputs? Building Environment  Browser Restrictions Can be embedded?
Apatar N/A (run from desktop) A wide range including: RSS, Text, Salesforce, File, MySQL, Amazon S3 A wide range including: RSS, Text, Salesforce, File, MySQL, Amazon S3 Visually create a flowchart N/A No
AquaLogic Pages  No Interactions: Data Table,
Record List, Text, Map,
RSS, web services, user-created data WYSIWYG ?  Partially? (only within BEA environment)
Dapper  Yes Feeds: XML, RSS, HTML, JSON.  Interactions: Google Gadgets, Google Maps (more) web pages or RSS Pointing at the parts of the screen you want scraped and/or filling in forms Depends on output Yes
Google Mashup Editor  Yes Hosted web page (within GME environment) RSS, GoogleBase, or user inputs  coding  ?  No
Microsoft Popfly  Yes Interactions: Gobs, although much of the focus seems to be on playful things like wak-a-mole web pages, RSS Visually create a flowchart Silverlight + Firefox 2 or IE7 (!)  
QEDWiki  Either (although install option didn't work for me)

Wiki pages

 XML, RSS WYSIWYG + filling out forms  ?  
StrikeIron SOA Express for Excel  Desktop (extension to Excel)  Excel sheet Web Services, especially StrikeIron services Excel N/A No
Yahoo! Pipes  Yes Feeds + Interactions  (Maps / Lists)  Feeds / CSV / some other specific web sources / limited generic XML (reference) Visually create a flowchart  Didn't see official reference, but seems to run on Firefox 2.0, IE6, and Opera 9.24  

I found the easiest to use were Apatar, StrikeIron SOA Express for Excel, and Yahoo! Pipes (if you just want to play with something to get a feel for mashup editors, I'd recommend starting with these), although each is entirely different.  Although QEDwiki has a great intro video and appears to be able to do a lot, I got the least far actually using it.  Although I'm sure that someone that knows Dapper inside and out could create a feed from scraping pages, in practice I didn't manage to get it to do the two tests I tried (for example, in trying to scrape country pages from the World Bank site, it wouldn't let me since the different pages were coming from different domains although they were driven from the same CMS).  Popfly seemed interesting, but doesn't appear to be geared toward the enterprise and has very specific browser requirements.  That said, these were just initial impressions (and initial ease of use may be irrelevant for a particular application) -- the main purpose of this post was to put together the matrix above just getting a feel for the current overall state / scope of mashup editors. 

Here are some excellent resources on mashup editors:

 

Syndicate content