Submitted by David Hobbs on 25 April 2008 - 3:22pm
But isn't it all data? Although web content certainly is an important type of information available on a web site, Data needs to be treated differently. Here I'm talking about Data with a capital D -- I thought the Wikipedia description was good: "Data refers to a collection of organized information, usually the results of experience, observation or experiment, or a set of premises. This may consist of numbers, words, or images, particularly as measurements or observations of a set of variables." Here are some of the ways that Data is different:
- People expect Data to be available in different Formats.
- Users want to manipulate the Data.
- You don't totally control your Data, since it is available in different Channels.
There are several implications of this including:
- Formats. You may wish to standardize the formats that your data is available in. Is all your data always available in csv (if that's what you standardize on)? This includes both the formats themselves (Excel, Stata, etc) and also the method by which the data is requested. For instance, is there one place that users can directly get all your data? Directly doesn't mean some thin layer with links to databases each doing its own thing. An example consistent format would be a web service with a published set of parameters by which the data could be requested. Ideally, all the institutions data would be available from this one web service.
- Manipulation. Sometimes people just want to see your data, but usually they will want to manipulate the data. By providing your data in consistent formats, then it will be easier for your users to utilize your data. Other users will expect that *you* provide the tools for manipulation of your own data.
- Channels. Ideally you will work to directly feed data to primary channels. For instance, if you feed data directly to services like Swivel then you both get more use out of your data and also can ensure your data is available in its highest quality (not watered down by other people copying and pasting your data, for example).
Bookmark/Search this post with
Submitted by David Hobbs on 7 February 2008 - 10:24pm
The natural temptation on planning a product's roadmap is to plan far into the future. That temptation arises for reasons such as wanting to please clients by telling them you'll give them what they say they want, wanting to relay the internal technical teams' plans to deliver something far in the future, feeling that you already know the list of features that users want, and not wanting to feel like you're planning all the time. In my experience, that is less successful than trying to only plan and publicize a short time horizon out (and not even promising anything outside that time horizon). Here are some of the reasons:
- You're always delivering something. In other words, the user regularly sees progress.
- Related to the above bullet, it forces you to think of creative ways of solving your users' problems. If you know you have to deliver something in the next month, then you have to carefully explore the requirements, prioritize, and come up with some unexpected solutions.
- Quick responses to what you release:
- When something needs to be tweaked that you just released, you can quickly move to the next iteration.
- If your idea turns out to be a real stinker, you can drop it before you spend a lot of time on it.
- If you have an unexpected hit of an idea, you can quickly continue to refine it.
- Inevitable slips don't end up cascading tons of other deliverables (if all you promise is delivering 10 features in the next three months, then if three slip it only affects those three -- if you had promised 10 each quarter for the next year, then if three items slip in the first quarter then probably 37 items slip for the year).
- Combats somewhat against almost everyone's natural propensity to procrastinate. If there are always items that need to be publicly delivered, then it's harder to procrastinate than if you have some huge delivable a year out.
- A regular period that people expect a progress report. Some advantages: a) the potential "embarassment" factor makes you pay close attention on a regular basis, and, moreover, b) your clients know they will be engaged in the discussion of your progress.
- The further out you plan, the less accurate you are about the schedule.
- You can respond more quickly to changes in your industry or competition. If you've already publicly planned a year out, then if, for example Web 4.0 hits the scene quickly, you've either got to upset a lot of people that are already expecting a bunch of features that year (by slipping those to make room for Web 4.0 now) or not even consider adding the Web 4.0 features until after the year has passed!
I partially read the following a while ago that helped convince me to push toward short delivery schedules: Agile and Iterative Development: A Manager's Guide by Craig Larman and Getting Real by 37signals.
Bookmark/Search this post with
Submitted by David Hobbs on 10 January 2008 - 10:10pm
Sometimes you need to pull content from multiple systems into a single page, and you want to pull from both systems based on some metadata, perhaps by topic. For instance, let's say you have a site that you want to pull data from a document repository and a news archive, and you want the the user to use a pulldown to select the topic they want to filter the content by (for example, by "Politics", "Entertainment", "Travel", "Europe", and other topics). Sometimes out of the box the two systems will share the same list of topics, but more frequently than not they will not.
One deceptively simple approach when systems do not share the same list of topics is to have some sort of mapping between the taxonomies of the two systems (for instance, "Travel" = "Vacations", "Politics" = "Domestic Politics", "Europe" = "EU", etc). I fairly frequently hear something like this when discussing integration between different systems: "We have a mapping between these topics, so there shouldn't be any problem." But just because you have a mapping doesn't mean that it will be satisfactory for combining information from multiple systems. I thought it would be helpful to think through the issues some and write out some examples.
One taxonomy's controlled vocabulary being more specific than another
Let's say you've got some content in two systems that you want to pull into one page. Perhaps you want to find out all the fathers in both systems. If the taxonomies available were the following (and you didn't have other metadata on gender, for example), then you could not do this:
| "Relationship" values site one |
"Relationship" values on site two |
| Father |
Parent |
| Mother |
|
| Sister |
Sibling |
| Brother |
|
A simple and meaningful mapping between the two would be something like this (allowing you to find all the people across systems that are a parent, for example):
Father or Mother - > Parent
Sister or Brother - > Sibling
Note that the other direction makes no sense (it's tough to be both a sister and brother, and you wouldn't know which to pick when translating between systems). So, although you may have a mapping between the systems, it does NOT neccessarily enable the types of queries you want to do.
A slightly more realistic example
Of course that was a simplified example to illustrate the point, and you usually have overlapping, something like the following (still a forced example though):
| "Location" in system one |
"Location" in system two |
| SF Bay Area |
|
| Palo Alto |
Silicon Valley |
| (other cities in Silicon Valley) |
|
| Richmond |
East Bay |
| (other cities in East Bay) |
|
| Sausalito |
Marin County |
| (other cities in Marin County) |
|
| San Francisco |
San Francisco |
| South San Francisco |
|
Let's say these two systems had a selection of companies tagged to these controlled vocabularies. What kinds of queries would probably be meaningful?
- All companies in Silicon Valley
- All companies in East Bay
- All companies in Marin County
- All companies in the San Francisco Bay area
Obviously, you couldn't query on cities since system two has virtually no cities. But what about San Francisco? Isn't that on both lists? Although at first blush it may seem that you could find all companies in San Francisco across both systems, looking at the list more carefully it becomes apparent that they almost certainly have different meanings: the first taxonomy only has the broad San Francisco Bay Area and then cities, and the second taxonomy is just listing areas within the San Francisco Bay area. So San Fancisco in system 2 probably includes San Francisco proper as well as, for example, South San Francisco. So you can do this query (but *not* query on all companies in San Francisco):
- All companies in San Francisco and the immediate area (including South San Francisco)
Part of the issue is that often you have much larger taxonomies that are more difficult to analyze (for example, for a taxonomy that includes all cities in California, or the US). It would be very difficult to go through and determine the meaning of the different values of the taxonomy.
What to do?
In practice, you probably won't be able to deeply analyze all the mappings between your systems, so you'll have a mapping but might only have a feel for how good it is (and in what direction). Perhaps the most dangerous mapping, and one that is hopefully fairly easy to identify, is from a more general taxonomy to a more specific one (the first example above) and should be avoided entirely. Of course if the systems do not even having taxonomies that are close, then this will be obvious and require changes to at least one of the source systems. The second example above (overlapping but not quite lining up) might a type of taxonomy matching that's not be that bad but just require documentation/labeling (just use "San Francisco and South San Francisco" in a pulldown to select areas of the Bay Area) and careful design (obviously don't allow the user to select a city if you need data from both systems, or you could clearly show in the results that the information is just from System 1). But figuring out the relationships between the taxonomies might take a lot of work. Some potential general approaches: be careful and a) where possible try to access globally well-understood and clear values (like zip codes, lat/long, ISO country codes, etc) rather than fall into the trap of just trying to use two taxonomies since they're called the same thing (this is probably easier in something like a location than a topic), b) force all systems to tag to a neutral reference source (this could, with a lot of work in defining rules, be automated with something like Teragram for example), or c) seek out a metadata expert since they have some best practices of mapping between flat or networked taxonomies. Also, when you are designing a system in the first place (even before being faced with a new integration), try if possible to use metadata for your content to well-understood values (especially easy for geographic tagging).
Bookmark/Search this post with
Submitted by David Hobbs on 19 December 2007 - 9:25pm
Especially as a content management system grows to have a large amount of content, it would be nice if you could do structured link checking. One of the problems with link checking in general is what to do with the reports once you get them. Of course, for a very small site you can easily scan an entire site with tools like LinkScan ($) and Xenu Linksleuth (free, but ads are put in the reports) or even monitor 404 requests and use single page tools like the LinkChecker Firefox extension. But with large sites you can end up with reports that are hard to know where to even start fixing links. This is especially true for CMS-driven sites: the same bad link may appear in only one piece of content that is displayed throughout the site. Or you could wind up linking from lots of content items to a url (possibly outside your control) that changes. I envision getting a report with a list of the bad links, where a user (with appropriate global rights) could indicate the correct new link which would get reflected in all content items (or left menus or other components surrounding the content) that used that link. This list could be prioritized by the cumulative page views that contained that bad link, or by the number of pages that contained that link. Another approach might be to provide a prioritized list of content items that have bad links (preferably directly linkable to edit mode of that content item. At any rate, note that we're not talking about pages here but content items or links -- the user can quickly take action that will correct links on multiple pages. A long list of pages (specific urls) with bad links are confusing, but, more importantly, aren't as quickly actionable. Here is how normal link checking reports look and how more useful reports might look:
| Before / Existing Reports (where do you start with a report like this, where content items may drive multiple pages?) |
Report indicating bad links where the user can immediately correct them (and apply the correction everywhere) |
Report indicating which content items have the bad links(content items linkable to edit them directly) |
- http://badlinkone.com is referenced on http://example-site.com/page1, http://example-site.com/page35, and http://example-site.com/page102
- http://badlinktwo.com is referenced on http://example-site.com/page1, http://example-site.com/page1023, http://example-site.com/page2439, http://example-site.com/page5192
Etc. |
Etc. |
Etc. |
One possible way to implement this is to change all the urls into some logical link in your CMS. Assuming your CMS stores straight HTML rather than a more structured format, then any url the user enters could be changed to a macro (if the user could put in a hard link directly into the HTML without the system changing it, even if there was an option for creating a logical link, most users would probably just skip the logic linking). For example if the user put in this HTML:
<a href="http://hobbsontech.com>Hobbs On Tech</a> then the system would replace it with
!link(123," _fcksavedurl="http://hobbsontech.com>Hobbs On Tech</a> then the system would replace it with
!link(123,"Hobbs On Tech")and put in its link repository that link 123 was
http://hobbsontech.com. When the page was generated then the correct link could be replaced in the HTML (so of course the end user's browser should never see the "123" in the HTML). If the page linked to was in your CMS, then the macro could be different and just indicate the unique key for the content item being pointed to (this would depend on whether the context that the content appeared in was relevant). For example:
!cms_item(123,"Hobbs On Tech") Related items that a link repository might help with:
- Reporting on content use. A link repository would allow other interesting reporting, such as the most linked-to content items in your repository.
- Easily move content. In some cases, it may be easier to move content if you had a link repository. For instance, you may sometimes need to restructure your site resulting in the links changing. With a link repository, you could automatically change all the links so that the move did not result in broken links (of course this would work best for intranet sites where there were limited links outside your control to your content).
Of course, this would add complexity (and possible failure points) to a CMS. Do you think it would be worth it?
Bookmark/Search this post with
Submitted by David Hobbs on 2 December 2007 - 9:11am
This post doesn't attempt to cover more obscure aspects of search engine optimization (SEO), but covers the basics that are really easy to overlook when you work on your site. Also, since Google is the major search player, I just refer to "Google" rather than trying to be more generic.
Step 0: Has Google indexed your site at all?
Go to google.com and do a search on site:your-site-name-here, like "site:http://bhphotovideo.com" to see if byphotovideo.com is indexed by Google. If there are no results, you're not indexed. Some ideas to get indexed: a) put in links from sites / pages you already have (for example, your profile on linkedin.com), b) get other sites to link to you (for example, you can comment on other peoples' blogs linking to your site), c) for blogs, use pingomatic to automatically update other services of your site, and d) submit your site to Google for indexing (not sure that actually does anything though?).
Step 1: What are you trying to accomplish?
This one sounds so obvious and silly, but it's very easy to overlook. It's useful to just write down the search phrases you'd like to find your site. Of course the more specific the better, since generic terms will be very difficult to get high rankings on. For example, I knew I wanted people to find this site if they typed my name and a little about me (for example, "David Hobbs CMS").
Step 2: Make sure your keywords are in the title and header tags, as well as in the text users will see (and preferably in the domain and url)
You may not have control over the domain and url (if you are in some content management systems), but you should at least make sure to have the title, header, and main text contain your terms.
Step 3: Track your progress.
Type your search terms into Google and see how high in the rankings you appear. If you have already gotten good results (first page of results?), it may be time to set your goals higher. For instance, for this site I'm now interested in shooting for more topic-based search phrases such as "multilingual CMS" (currently the 14th page of results). Also, you will want to look for dips in the performance of your search phrases. This is especially relevant to test before and after any changes you make to your site/system. If you're working with a client on their site, by having the metrics (and search goals) before you start you'll be able to more objectively discuss the search performance of their site. Another angle is to look at the terms that people are using to actually find your site. You may find interest in your site from unexpected angles that you may wish to further enhance (for instance, people are finding my site with phrases such as "annotate excel graph", so I may put a more generic introduction to that blog entry).
Repeat.
The first step, to get into the Google index at all, involved getting links to your site. As you proceed, of course you also want to have higher and higher quality sites link to you. As mentioned in the previous step, your search goals will also probably change, and you'll want to add/reword/reconfigure portions of your site (per Step 2 above) to optimize for those new goals.
Bookmark/Search this post with