Submitted by David Hobbs on 11 February 2011 - 7:56am
Many sites are largely organized around topics (as oppossed to, say, the org chart or products). But there are plenty of nuances to having successful topics pages, from dealing with political issues (when should a new topic page be created) to functionality options (how should topics pages be managed) to the metadata concerns. If creating automated listings on topics pages, consider this:
You can only have automatically-pulled topics listings if there is a one-to-one or many-to-one relationship from your source metadata tags to your target groupings.
This probably sounds a bit too academic, so let's look at this in concrete terms. Let's consider the case where you have a bunch of articles tagged to either broccoli, apple, or orange. If you wanted to have a vegetable topic page and a fruit topic page, then this would work fine. This is because you have a one-to-one mapping from broccoli to vegetable (broccoli is always a vegetable), and a many-to-one mapping of apple and orange to fruit (both apple and orange always map to fruit and nothing else). That said, you cannot have red and green topic pages based on the existing tagging. Broccoli is always green, so if the only tags you had were to broccoli then you would be fine. But the apple tagging is the problem: an apple can be either green or red. Obviously, you could introduce a new color tag, but if you had a large number of existing pieces of content then you would have a large amount of retagging to do.

This may be obvious when looking at three tags and a handful of possible topics pages. But when looking at larger repositories and the possibility of creating topic pages with some automated pulls, consider the mappings to ensure you end up with relevant topics pages.
--------------------------
Need help setting up or fixing the processes, functionality, metadata, or other aspects of creating topics pages on your site? Contact David Hobbs Consulting.
Bookmark/Search this post with
Submitted by David Hobbs on 26 February 2010 - 10:33am
The road to metadata quality hell is paved with good intentions. Spellbound by the possibilities a complex taxonomy (or just seeming like an interesting mental exercise in its own right), the team gets to work on a detailed taxonomy that could cover all the future needs of sophisticated searching / browsing. You have a problem when the complexity of your taxonomy outpaces functionality the site visitor sees. So if you have a five level deep taxonomy but your site visitors only ever see the first level, then chances are that only the top level is accurately tagged. Since the content contributors won't see any difference / gain, then why would they spend the extra time required?
Match automated functionality to taxonomy complexity
The metadata sweet spot is where the taxonomy complexity matches the automated functionality on your site. So for example the hobbsontech.com site is all tagged to the five stages of CMS migration, and this tagging is used throughout the site (on the home page, as tags with every blog post, etc). So the complexity matches the functionality, and I am compelled to tag correctly. In addition to the functionality, the match must also be in the pain level of bad tagging. So for example I originally structured this site differently so old posts do not have very useful metatadata. Since the pain level is relatively low, I haven't yet cleaned that old metadata (note also that this is a problem with free metadata tags without a large number of contributors). So if you bury some metadata-driven functionality that none of the content contributors care about, then the quality will also be low.
So what to do about this? Carefully consider any additional metadata complexity before adding it to your system. If at all possible, consider a way of increasing the visibility of the metadata through automated functionality on your site. The most direct way would be prominent functionality that the site visitor would see/use (and both the site visitor and content owners / site editors care about). Relying on administrative functionality (for example, ranking groups based on how much content is tagged to their detailed topic without showing this to site visitors) only would be more dangerous since it might encourage people to game the system.
For other related metadata posts, see:
As always on this blog, this post is most relevant for larger sites. Is your site large? Answer four questions to get a quick gauge.
Bookmark/Search this post with
Submitted by David Hobbs on 8 September 2009 - 10:43am
Content re-use is important for many sites, but the implementation for large sites is more subtle than it may initially appear (some issues). In order to implement content re-use successfully, you must consider:
In this blog post, I'll explore the third item: roles in content re-use for large sites. These are some of the roles relevant to content re-use:
- Content Contributor
- Metator
- Page Owner
- Template / block Designer
- Developers, including DB, Platform, and Site Developers
Why Roles Matter
For the end user's sake, you want your site to be consistent. For the ego of site owners, they may specifically want inconsistency. By consistency here, I mean look and feel as well as consistency in the way content is aggregated. For example, if one section of your site has Current Events that are two years old, another that only lists content in the future, and the RSS feed lists events that aren't even published, then it will be confusing for the user (sounds "out there", but I've seen this occur!). By enforcing what different groups (from database developer to content contributor) can do, you can control these types of issues. Ideally, you enforce as many of your site standards as far back on the left of the graph as possible.
Content Contributor
You could have the most beautiful design and CSS in the world, but then the content contributor could set text to a font that's not in your standard (can your users set font, or just CSS styles)? This is an easy example, but not as crucial as other standards such as the length of titles (longer titles may gum up content appearing in smaller blocks) and tagging. The impact here is setting the relevant standards, training on the standards, and enforcing the standard as much as possible in the content entry interface.
Metator
Who will be responsible for the overall quality of the tagging, key to content re-use? You could have someone dedicated to tagging (a metator), train the content contributors to do it, or have a librarian defining rules for automatic tagging. At any rate, who gets to do what tagging is a key decision. There is no free lunch here, so whichever way you go will require resources. For example, if you do automatic tagging then you do need someone to effectively train the tool (and then decide whether you remove the right of the content contributor to tag).
Page Owner
Since we're talking about content re-use, the difference between a page and piece of content is important. A good example would be this article: you might be reading this in a feed reader, on the HobbsOnTech home page, or on it's own dedicated page (it's permalink). And you could be seeing a link to this piece of content in a variety of places (for example, the See All Articles page which is automated). So the content is re-used on a variety of pages.
The content contributor is considered above, but there may be page owners as well. For example, there could be an owner of the home page that determines what content shows in the top block of the page; in this case, the owner would be playing the role of an editor. Again, this is the editor of a page pulling content and not the content in the first place. You need to define what the page owner can and cannot do. For example, can the page owner chose what appears in all blocks? If they can, is it pre-filtered for them (for example, if the topic of their page is Cycling, can they even choose something that's not tagged to Cycling)?
Developers
For a large site, you will have developers of different skill levels and experience with your environment. Instead of lumping all developers into one group, I would propose considering database / platform developers separately from site developers. This distinction probably only makes sense for large sites, but in that case it's important to at least consider.
Database Developer
For starters, there usually is not a separate role for the database developer. Perhaps only one person can change the schema, but anyone can write whatever query they want to get to the data. Of course, for small sites this is probably desirable, but for large sites (especially with high turnover amongst developers), this can result in fairly major issues. An example would be that if a "published" flag needs to be checked by anyone querying the database; in this case, a new developer could easily create an RSS feed that exposes draft content. A more robust solution would be to have a layer/API implemented by a database developer that is the only way that a site developer can get at the data. So, for example, this would mean that new site developer can't even get at draft content.
Platform Developer
Much of the platform will already be built into the CMS that you use, but inevitably you will make changes to the platform. By platform, I mean the core driving code of the site. For instance, I would consider the basic site-wide page template to be part of the platform. Ideally, much of the functionality of content re-use will be built into the platform, so that it's not done over by the page template or block developers (leading to inefficiency and possibly inconsistency).
Site Developer
The site developer implements the various page templates and re-usable blocks of pages (this is assuming you have multiple sites running off the same platform). A lot of the rubber-hitting-the-road happens here, but hopefully the platform and database access has been defined in a way to make it easier to develop consistency. As with all the developers, the site developer needs to make sure to embed as many of the site standards right into the page template/blocks. The site developer will also be developing what components the page owner can modify.
Layers and Enforcement
One way of looking at these roles is through the lens of the layers of people involved in creating a page on a web site:
-
The DB and Platform developers set up the system for the site developers and content contributors
-
The site developer defines the templates that page owners then use for their particular pages
-
The content contributor publishes the content, possibly with a metator to ensure the tagging is correct
-
All this together renders a successful page

A key point is that ideally you would have your site standards enforced as deep in the system (as far to the left of the top track as possible in the diagram) as possible. Some examples:
- The DB developer has hopefully created a layer such that no one else can even get at draft content.
- The platform developer sets up a platform-wide basic template such that key elements cannot be overriden by a particular site developer.
- The site developer creates a page such that only appropriate parameters can be set by the page owner (a topic page owner, for example, cannot include pages that are not tagged to the topic
It may be that for some sites different roles make more sense, but the general point is that for content re-use to work, you need to carefully consider the roles of who can do what. Also, you want to implement standards as deep in the system as possible.
Bookmark/Search this post with
Submitted by David Hobbs on 8 March 2008 - 11:01am
The prior post Taxonomy Mappings: Be Careful When Integrating gave some examples and described the problem of taxonomy mappings. Related to that is false precision in your tags. In thinking about this more, it occurs to me that there are probably two useful rules of thumb to keep in mind whenever tagging/pulling content (whether the content is automatically tagged, or mapped from another taxonomy, or mapped by hand):
- You can't tag in a course grained taxonomy and pull based on a fine-grained taxonomy (for example, if you have a system that only tags to "Washington, DC Metro Area," then you won't be able to pull by "Washington, DC" since any content tagged in the system may only be relevant to "Alexandria, VA").
- You can't tag in a fine-grained taxonomy when you only are using coarse information to determine the tagging (for example, if all you know about a group of content is that they're all animals, you can't tag each of content to frogs, cats, dogs, etc).
In both of these cases, when you pull by the fine-grained taxonomy there is a false sense of precision (and you can get grossly wrong.
Another way of stating the rules of thumb above:
- You have to originally tag (or possibly go through the effort of retro-actively tagging, perhaps through automated concept extraction) all content to at least as fine-grained a taxonomy as you're going to pull from,
- without artificially tagging more precisely than you are accurate.
Of course, by far the most preferable treatment is that all content, across the various systems you want to pull from (onto the same web page, for example) is tagged to the same, fine-grained taxonomy (or at as fine grained as you ever expect to need to pull from). Otherwise you'll have to resort to taxonomy mappings, or retroactively tag content.
Bookmark/Search this post with
Submitted by David Hobbs on 10 January 2008 - 10:10pm
Sometimes you need to pull content from multiple systems into a single page, and you want to pull from both systems based on some metadata, perhaps by topic. For instance, let's say you have a site that you want to pull data from a document repository and a news archive, and you want the the user to use a pulldown to select the topic they want to filter the content by (for example, by "Politics", "Entertainment", "Travel", "Europe", and other topics). Sometimes out of the box the two systems will share the same list of topics, but more frequently than not they will not.
One deceptively simple approach when systems do not share the same list of topics is to have some sort of mapping between the taxonomies of the two systems (for instance, "Travel" = "Vacations", "Politics" = "Domestic Politics", "Europe" = "EU", etc). I fairly frequently hear something like this when discussing integration between different systems: "We have a mapping between these topics, so there shouldn't be any problem." But just because you have a mapping doesn't mean that it will be satisfactory for combining information from multiple systems. I thought it would be helpful to think through the issues some and write out some examples.
One taxonomy's controlled vocabulary being more specific than another
Let's say you've got some content in two systems that you want to pull into one page. Perhaps you want to find out all the fathers in both systems. If the taxonomies available were the following (and you didn't have other metadata on gender, for example), then you could not do this:
| "Relationship" values site one |
"Relationship" values on site two |
| Father |
Parent |
| Mother |
|
| Sister |
Sibling |
| Brother |
|
A simple and meaningful mapping between the two would be something like this (allowing you to find all the people across systems that are a parent, for example):
Father or Mother - > Parent
Sister or Brother - > Sibling
Note that the other direction makes no sense (it's tough to be both a sister and brother, and you wouldn't know which to pick when translating between systems). So, although you may have a mapping between the systems, it does NOT neccessarily enable the types of queries you want to do.
A slightly more realistic example
Of course that was a simplified example to illustrate the point, and you usually have overlapping, something like the following (still a forced example though):
| "Location" in system one |
"Location" in system two |
| SF Bay Area |
|
| Palo Alto |
Silicon Valley |
| (other cities in Silicon Valley) |
|
| Richmond |
East Bay |
| (other cities in East Bay) |
|
| Sausalito |
Marin County |
| (other cities in Marin County) |
|
| San Francisco |
San Francisco |
| South San Francisco |
|
Let's say these two systems had a selection of companies tagged to these controlled vocabularies. What kinds of queries would probably be meaningful?
- All companies in Silicon Valley
- All companies in East Bay
- All companies in Marin County
- All companies in the San Francisco Bay area
Obviously, you couldn't query on cities since system two has virtually no cities. But what about San Francisco? Isn't that on both lists? Although at first blush it may seem that you could find all companies in San Francisco across both systems, looking at the list more carefully it becomes apparent that they almost certainly have different meanings: the first taxonomy only has the broad San Francisco Bay Area and then cities, and the second taxonomy is just listing areas within the San Francisco Bay area. So San Fancisco in system 2 probably includes San Francisco proper as well as, for example, South San Francisco. So you can do this query (but *not* query on all companies in San Francisco):
- All companies in San Francisco and the immediate area (including South San Francisco)
Part of the issue is that often you have much larger taxonomies that are more difficult to analyze (for example, for a taxonomy that includes all cities in California, or the US). It would be very difficult to go through and determine the meaning of the different values of the taxonomy.
What to do?
In practice, you probably won't be able to deeply analyze all the mappings between your systems, so you'll have a mapping but might only have a feel for how good it is (and in what direction). Perhaps the most dangerous mapping, and one that is hopefully fairly easy to identify, is from a more general taxonomy to a more specific one (the first example above) and should be avoided entirely. Of course if the systems do not even having taxonomies that are close, then this will be obvious and require changes to at least one of the source systems. The second example above (overlapping but not quite lining up) might a type of taxonomy matching that's not be that bad but just require documentation/labeling (just use "San Francisco and South San Francisco" in a pulldown to select areas of the Bay Area) and careful design (obviously don't allow the user to select a city if you need data from both systems, or you could clearly show in the results that the information is just from System 1). But figuring out the relationships between the taxonomies might take a lot of work. Some potential general approaches: be careful and a) where possible try to access globally well-understood and clear values (like zip codes, lat/long, ISO country codes, etc) rather than fall into the trap of just trying to use two taxonomies since they're called the same thing (this is probably easier in something like a location than a topic), b) force all systems to tag to a neutral reference source (this could, with a lot of work in defining rules, be automated with something like Teragram for example), or c) seek out a metadata expert since they have some best practices of mapping between flat or networked taxonomies. Also, when you are designing a system in the first place (even before being faced with a new integration), try if possible to use metadata for your content to well-understood values (especially easy for geographic tagging).
Bookmark/Search this post with