metadata

Beware False Precision in Tagging

The prior post Taxonomy Mappings: Be Careful When Integrating gave some examples and described the problem of taxonomy mappings.  Related to that is false precision in your tags.  In thinking about this more, it occurs to me that there are probably two useful rules of thumb to keep in mind whenever tagging/pulling content (whether the content is automatically tagged, or mapped from another taxonomy, or mapped by hand):

  1. You can't tag in a course grained taxonomy and pull based on a fine-grained taxonomy (for example, if you have a system that only tags to "Washington, DC Metro Area," then you won't be able to pull by "Washington, DC" since any content tagged in the system may only be relevant to "Alexandria, VA").
  2. You can't tag in a fine-grained taxonomy when you only are using coarse information to determine the tagging (for example, if all you know about a group of content is that they're all animals, you can't tag each of content to frogs, cats, dogs, etc).

In both of these cases, when you pull by the fine-grained taxonomy there is a false sense of precision (and you can get grossly wrong.Another way of stating the rules of thumb above:

  1. You have to originally tag (or possibly go through the effort of retro-actively tagging, perhaps through automated concept extraction) all content to at least as fine-grained a taxonomy as you're going to pull from,
  2. without artificially tagging more precisely than you are accurate.

Of course, by far the most preferable treatment is that all content, across the various systems you want to pull from (onto the same web page, for example) is tagged to the same, fine-grained taxonomy (or at as fine grained as you ever expect to need to pull from).  Otherwise you'll have to resort to taxonomy mappings, or retroactively tag content.

Taxonomy Mappings: Be Careful When Integrating

Sometimes you need to pull content from multiple systems into a single page, and you want to pull from both systems based on some metadata, perhaps by topic.  For instance, let's say you have a site that you want to pull data from a document repository and a news archive, and you want the the user to use a pulldown to select the topic they want to filter the content by (for example, by "Politics", "Entertainment", "Travel", "Europe", and other topics).  Sometimes out of the box the two systems will share the same list of topics, but more frequently than not they will not.

One deceptively simple approach when systems do not share the same list of topics is to have some sort of mapping between the taxonomies of the two systems (for instance, "Travel" = "Vacations", "Politics" = "Domestic Politics", "Europe" = "EU", etc).  I fairly frequently hear something like this when discussing integration between different systems: "We have a mapping between these topics, so there shouldn't be any problem."  But just because you have a mapping doesn't mean that it will be satisfactory for combining information from multiple systems.  I thought it would be helpful to think through the issues some and write out some examples.

One taxonomy's controlled vocabulary being more specific than another

Let's say you've got some content in two systems that you want to pull into one page.  Perhaps you want to find out all the fathers in both systems.  If the taxonomies available were the following (and you didn't have other metadata on gender, for example), then you could not do this:

"Relationship" values site one "Relationship" values on site two
Father Parent
Mother  
Sister Sibling
Brother  

A simple and meaningful mapping between the two would be something like this (allowing you to find all the people across systems that are a parent, for example):

Father or Mother - > Parent
Sister or Brother - > Sibling

Note that the other direction makes no sense (it's tough to be both a sister and brother, and you wouldn't know which to pick when translating between systems).  So, although you may have a mapping between the systems, it does NOT neccessarily enable the types of queries you want to do.

A slightly more realistic example

Of course that was a simplified example to illustrate the point, and you usually have overlapping, something like the following (still a forced example though):

"Location" in system one "Location" in system two
SF Bay Area  
Palo Alto Silicon Valley
(other cities in Silicon Valley)  
Richmond East Bay
(other cities in East Bay)  
Sausalito Marin County
(other cities in Marin County)  
San Francisco San Francisco
South San Francisco  

Let's say these two systems had a selection of companies tagged to these controlled vocabularies.  What kinds of queries would probably be meaningful?

  • All companies in Silicon Valley
  •  All companies in East Bay
  • All companies in Marin County
  • All companies in the San Francisco Bay area

Obviously, you couldn't query on cities since system two has virtually no cities.  But what about San Francisco?  Isn't that on both lists?  Although at first blush it may seem that you could find all companies in San Francisco across both systems, looking at the list more carefully it becomes apparent that they almost certainly have different meanings: the first taxonomy only has the broad San Francisco Bay Area and then cities, and the second taxonomy is just listing areas within the San Francisco Bay area.  So San Fancisco in system 2 probably includes San Francisco proper as well as, for example, South San Francisco.  So you can do this query (but *not* query on all companies in San Francisco):

  • All companies in San Francisco and the immediate area (including South San Francisco)

Part of the issue is that often you have much larger taxonomies that are more difficult to analyze (for example, for a taxonomy that includes all cities in California, or the US).  It would be very difficult to go through and determine the meaning of the different values of the taxonomy.

What to do?

In practice, you probably won't be able to deeply analyze all the mappings between your systems, so you'll have a mapping but might only have a feel for how good it is (and in what direction).  Perhaps the most dangerous mapping, and one that is hopefully fairly easy to identify, is from a more general taxonomy to a more specific one (the first example above) and should be avoided entirely.  Of course if the systems do not even having taxonomies that are close, then this will be obvious and require changes to at least one of the source systems.   The second example above (overlapping but not quite lining up) might a type of taxonomy matching that's not be that bad but just require documentation/labeling (just use "San Francisco and South San Francisco" in a pulldown to select areas of the Bay Area) and careful design (obviously don't allow the user to select a city if you need data from both systems, or you could clearly show in the results that the information is just from System 1).   But figuring out the relationships between the taxonomies might take a lot of work.  Some potential general approaches: be careful and a) where possible try to access globally well-understood and clear values (like zip codes, lat/long, ISO country codes, etc) rather than fall into the trap of just trying to use two taxonomies since they're called the same thing (this is probably easier in something like a location than a topic), b) force all systems to tag to a neutral reference source (this could, with a lot of work in defining rules, be automated with something like Teragram for example), or c) seek out a metadata expert since they have some best practices of mapping between flat or networked taxonomies.  Also, when you are designing a system in the first place (even before being faced with a new integration), try if possible to use metadata for your content to well-understood values (especially easy for geographic tagging).

Syndicate content