Submitted by David Hobbs on 25 April 2008 - 3:22pm
But isn't it all data? Although web content certainly is an important type of information available on a web site, Data needs to be treated differently. Here I'm talking about Data with a capital D -- I thought the Wikipedia description was good: "Data refers to a collection of organized information, usually the results of experience, observation or experiment, or a set of premises. This may consist of numbers, words, or images, particularly as measurements or observations of a set of variables." Here are some of the ways that Data is different:
- People expect Data to be available in different Formats.
- Users want to manipulate the Data.
- You don't totally control your Data, since it is available in different Channels.
There are several implications of this including:
- Formats. You may wish to standardize the formats that your data is available in. Is all your data always available in csv (if that's what you standardize on)? This includes both the formats themselves (Excel, Stata, etc) and also the method by which the data is requested. For instance, is there one place that users can directly get all your data? Directly doesn't mean some thin layer with links to databases each doing its own thing. An example consistent format would be a web service with a published set of parameters by which the data could be requested. Ideally, all the institutions data would be available from this one web service.
- Manipulation. Sometimes people just want to see your data, but usually they will want to manipulate the data. By providing your data in consistent formats, then it will be easier for your users to utilize your data. Other users will expect that *you* provide the tools for manipulation of your own data.
- Channels. Ideally you will work to directly feed data to primary channels. For instance, if you feed data directly to services like Swivel then you both get more use out of your data and also can ensure your data is available in its highest quality (not watered down by other people copying and pasting your data, for example).
Bookmark/Search this post with
Submitted by David Hobbs on 7 February 2008 - 10:24pm
The natural temptation on planning a product's roadmap is to plan far into the future. That temptation arises for reasons such as wanting to please clients by telling them you'll give them what they say they want, wanting to relay the internal technical teams' plans to deliver something far in the future, feeling that you already know the list of features that users want, and not wanting to feel like you're planning all the time. In my experience, that is less successful than trying to only plan and publicize a short time horizon out (and not even promising anything outside that time horizon). Here are some of the reasons:
- You're always delivering something. In other words, the user regularly sees progress.
- Related to the above bullet, it forces you to think of creative ways of solving your users' problems. If you know you have to deliver something in the next month, then you have to carefully explore the requirements, prioritize, and come up with some unexpected solutions.
- Quick responses to what you release:
- When something needs to be tweaked that you just released, you can quickly move to the next iteration.
- If your idea turns out to be a real stinker, you can drop it before you spend a lot of time on it.
- If you have an unexpected hit of an idea, you can quickly continue to refine it.
- Inevitable slips don't end up cascading tons of other deliverables (if all you promise is delivering 10 features in the next three months, then if three slip it only affects those three -- if you had promised 10 each quarter for the next year, then if three items slip in the first quarter then probably 37 items slip for the year).
- Combats somewhat against almost everyone's natural propensity to procrastinate. If there are always items that need to be publicly delivered, then it's harder to procrastinate than if you have some huge delivable a year out.
- A regular period that people expect a progress report. Some advantages: a) the potential "embarassment" factor makes you pay close attention on a regular basis, and, moreover, b) your clients know they will be engaged in the discussion of your progress.
- The further out you plan, the less accurate you are about the schedule.
- You can respond more quickly to changes in your industry or competition. If you've already publicly planned a year out, then if, for example Web 4.0 hits the scene quickly, you've either got to upset a lot of people that are already expecting a bunch of features that year (by slipping those to make room for Web 4.0 now) or not even consider adding the Web 4.0 features until after the year has passed!
I partially read the following a while ago that helped convince me to push toward short delivery schedules: Agile and Iterative Development: A Manager's Guide by Craig Larman and Getting Real by 37signals.
Bookmark/Search this post with
Submitted by David Hobbs on 10 January 2008 - 10:10pm
Sometimes you need to pull content from multiple systems into a single page, and you want to pull from both systems based on some metadata, perhaps by topic. For instance, let's say you have a site that you want to pull data from a document repository and a news archive, and you want the the user to use a pulldown to select the topic they want to filter the content by (for example, by "Politics", "Entertainment", "Travel", "Europe", and other topics). Sometimes out of the box the two systems will share the same list of topics, but more frequently than not they will not.
One deceptively simple approach when systems do not share the same list of topics is to have some sort of mapping between the taxonomies of the two systems (for instance, "Travel" = "Vacations", "Politics" = "Domestic Politics", "Europe" = "EU", etc). I fairly frequently hear something like this when discussing integration between different systems: "We have a mapping between these topics, so there shouldn't be any problem." But just because you have a mapping doesn't mean that it will be satisfactory for combining information from multiple systems. I thought it would be helpful to think through the issues some and write out some examples.
One taxonomy's controlled vocabulary being more specific than another
Let's say you've got some content in two systems that you want to pull into one page. Perhaps you want to find out all the fathers in both systems. If the taxonomies available were the following (and you didn't have other metadata on gender, for example), then you could not do this:
| "Relationship" values site one |
"Relationship" values on site two |
| Father |
Parent |
| Mother |
|
| Sister |
Sibling |
| Brother |
|
A simple and meaningful mapping between the two would be something like this (allowing you to find all the people across systems that are a parent, for example):
Father or Mother - > Parent
Sister or Brother - > Sibling
Note that the other direction makes no sense (it's tough to be both a sister and brother, and you wouldn't know which to pick when translating between systems). So, although you may have a mapping between the systems, it does NOT neccessarily enable the types of queries you want to do.
A slightly more realistic example
Of course that was a simplified example to illustrate the point, and you usually have overlapping, something like the following (still a forced example though):
| "Location" in system one |
"Location" in system two |
| SF Bay Area |
|
| Palo Alto |
Silicon Valley |
| (other cities in Silicon Valley) |
|
| Richmond |
East Bay |
| (other cities in East Bay) |
|
| Sausalito |
Marin County |
| (other cities in Marin County) |
|
| San Francisco |
San Francisco |
| South San Francisco |
|
Let's say these two systems had a selection of companies tagged to these controlled vocabularies. What kinds of queries would probably be meaningful?
- All companies in Silicon Valley
- All companies in East Bay
- All companies in Marin County
- All companies in the San Francisco Bay area
Obviously, you couldn't query on cities since system two has virtually no cities. But what about San Francisco? Isn't that on both lists? Although at first blush it may seem that you could find all companies in San Francisco across both systems, looking at the list more carefully it becomes apparent that they almost certainly have different meanings: the first taxonomy only has the broad San Francisco Bay Area and then cities, and the second taxonomy is just listing areas within the San Francisco Bay area. So San Fancisco in system 2 probably includes San Francisco proper as well as, for example, South San Francisco. So you can do this query (but *not* query on all companies in San Francisco):
- All companies in San Francisco and the immediate area (including South San Francisco)
Part of the issue is that often you have much larger taxonomies that are more difficult to analyze (for example, for a taxonomy that includes all cities in California, or the US). It would be very difficult to go through and determine the meaning of the different values of the taxonomy.
What to do?
In practice, you probably won't be able to deeply analyze all the mappings between your systems, so you'll have a mapping but might only have a feel for how good it is (and in what direction). Perhaps the most dangerous mapping, and one that is hopefully fairly easy to identify, is from a more general taxonomy to a more specific one (the first example above) and should be avoided entirely. Of course if the systems do not even having taxonomies that are close, then this will be obvious and require changes to at least one of the source systems. The second example above (overlapping but not quite lining up) might a type of taxonomy matching that's not be that bad but just require documentation/labeling (just use "San Francisco and South San Francisco" in a pulldown to select areas of the Bay Area) and careful design (obviously don't allow the user to select a city if you need data from both systems, or you could clearly show in the results that the information is just from System 1). But figuring out the relationships between the taxonomies might take a lot of work. Some potential general approaches: be careful and a) where possible try to access globally well-understood and clear values (like zip codes, lat/long, ISO country codes, etc) rather than fall into the trap of just trying to use two taxonomies since they're called the same thing (this is probably easier in something like a location than a topic), b) force all systems to tag to a neutral reference source (this could, with a lot of work in defining rules, be automated with something like Teragram for example), or c) seek out a metadata expert since they have some best practices of mapping between flat or networked taxonomies. Also, when you are designing a system in the first place (even before being faced with a new integration), try if possible to use metadata for your content to well-understood values (especially easy for geographic tagging).
Bookmark/Search this post with
Submitted by David Hobbs on 8 December 2007 - 4:06pm
You know when it's time to move into a new house or apartment, when you look at the stuff you need to move and think "Why in the world do I have this bread machine? I haven't used this in years and I forgot I even had it." Or you dread moving your old clunker of a TV, thinking of the new fancy flat-panel TVs? Well, it's the same thing with migrating to a new system, for instance into a new content management system. Only it's harder. When you're moving and you're pressed for time, you may just start tossing stuff into boxes to be moved, even when you know you don't totally want all the stuff (one reason: you'll need to negotiate with a spouse about getting rid of something, and there's no time for that). This isn't that big a deal, since it's just moving more of the same stuff. Or, if you have a huge sectional couch that won't fit in your new place, then perhaps you can just sell it to the next homeowner. When you're moving content, you have all sorts of extra things to think about including:
- It's not just content. Content on a site doesn't just live in some abstract ether, but it is linked into a larger site context. This includes left navigation, headers, footers, and special site behaviors. Of course moving the site context of a simple site like hobbsontech.com would be relatively easy to move (re-creating the the menus, configuring the overall style, etc), but the more sites you have, the more there would be to do. This is especially relevant for sites with a lot of custom dynamic functionality. For instance, if you have comments on your current site's content, then you'd have to figure out how to embed it in the new framework (or just leave it behind). Chances are you have a lot of functionality distributed throughout your site that may even be hard to inventory.
- Metadata and taxonomies. You may have to re-create taxonomies in another system, and there may be incompatabilities you have to work through.
- Internal references to other pieces of content. Your content probably refers to itself (for instance, a press release may refer to your product description page). This somehow has to be reflected in a new system.
- Structured content. You may have structured content (for instance, a document that has multiple chapters), which you'll need to figure out how to handle in the new system.
- Outside references to your content. Other sites, as well as search engines, will have links to your content. You'll need to have some strategy to deal with the links from external sites.
In the end, a lot of this has to do with the web of information that's involved in the content of a web site. And this isn't counting the types of technical issues that would come up with any technical migration (differences in size limits for fields, encoding differences, etc.). Of course there's the issue of why you even have all this stuff to move in the first place (and the more stuff you have the more hassle it is to move). This blog entry has focused on why it's difficult to move all this content, but of course one of the morals of the story is to have less stuff in the first place. In the case of the web this would involve better governance of what goes on the web, and clearly defining what the focus of your web site should be. Hopefully, just like when moving houses, any discussion of moving content would also include discussing what stuff you need in the first place. Unlike houses, having extra or duplicate stuff doesn't just inconvenience you but it is a disservice to your users. I'll leave the issue of the old TV and desiring the new flat panel to a future post (on survivorship bias).
Bookmark/Search this post with
Submitted by David Hobbs on 2 December 2007 - 10:52pm
Very large sites supporting a large number of units/stakeholders can easily turn into a hodge-podge of styles, user interface elements, and quality. One of the toughest discussions with clients, however, is why they can't do more customization (even if one of the core requirements of the system is to help enforce standardization). What are some of the reasons *not* to standardize:
- specific business needs of different groups (not to be confused with a group just wanting to differentiate itself somehow, for instance with a different look, that does not help the web visitor at all)
- professional development (for instance a developer might be interesting to do a mashup)
- personal expression (liking particular colors for example)
- experimentation (don't know in advance what's going to "stick," so try a variety of things)
In my opinion, the first and last reasons are the most compelling (and the third not being a good reason at all for an enterprise-wide system), although one of the problems with experimentation is the frequent expectation that an experiment could quickly be rolled into the normal standardized platform (that's probably a post on its own!). Here are some reasons *to* standardize:
- consistent brand for the user ("am I still on the same site? Is this high quality content?")
- consistent UI for the user ("do I know how to use the site?")
- better support for new site admins or transition of support between sites
- single sign-on. It's confusing for a user to have various accounts with the same institution.
- standard statistics. Different statistics packages can have entirely different ways of counting something as basic as a page view. Standardizing no a statistics package can help ensure you're comparing apples to apples in your web analysis.
- better search. If everyone does their own thing, then there may be more fragmented information which would mean search results aren't as good.
- stability / support. As anyone who works with software/systems knows, the more functionality or special customization you put into a system, the more effort it takes to maintain it. Also, the system will probably be less stable. This one is also very tough to discuss with a client (and another probable future blog post) since they tend to only see their particular need.
Some possible methods of standardization:
- Governance. There needs to be a group with the power and influence to say "no" to requests that undermine the quality of the user experience of the site at large. This ideally is not the technology group since there would appear to be a conflict of interest.
- Clearly define exactly what is inside the standard and what is outside.
- Technology. The content management system used to manage the site can be set up such that users can only make changes that comply with the standard.
- The right level of customization. Standardization shouldn't be an excuse to totally control every aspect of everyone's sites or to not allow any innovation.
- Hooks into core shared functionalty. You may decide that a single sign on for users of your site is desirable. If so, then perhaps the system could be set up with an API such that tools developed and commissioned by other groups could work with the core functionality.
- Standardized access to data. Ideally, you could define a standard method of each system exposing its core data, that even people outside the institution could utilize for mashups, etc. By providing the data in a simple XML API, this could facilitate both internal and external usage of data.
- Another potential approach is to have separate branding for the official, blessed content and for the organization-centric content. For instance, you may have multiple units in your institution all looking at the topic of taxes. Ideally you would have one official web site that makes sense of your institution's view of taxes overall, and preferably this would pull information from all the units. The various units still may want their own site, but this is less useful for the end user -- so perhaps these units could have their own sites branded differently (and perhaps all requiring a standard link back to the official site) to clearly indicate it is the view of a particular unit with your institution.
Of course, all of these are easier said than done when trying to get a large number of units into the same system, but perhaps some of these could be initiated even after a large suite of sites have been implemented in a central content management system.
Bookmark/Search this post with