Submitted by David Hobbs on 8 March 2008 - 11:01am
The prior post Taxonomy Mappings: Be Careful When Integrating gave some examples and described the problem of taxonomy mappings. Related to that is false precision in your tags. In thinking about this more, it occurs to me that there are probably two useful rules of thumb to keep in mind whenever tagging/pulling content (whether the content is automatically tagged, or mapped from another taxonomy, or mapped by hand):
- You can't tag in a course grained taxonomy and pull based on a fine-grained taxonomy (for example, if you have a system that only tags to "Washington, DC Metro Area," then you won't be able to pull by "Washington, DC" since any content tagged in the system may only be relevant to "Alexandria, VA").
- You can't tag in a fine-grained taxonomy when you only are using coarse information to determine the tagging (for example, if all you know about a group of content is that they're all animals, you can't tag each of content to frogs, cats, dogs, etc).
In both of these cases, when you pull by the fine-grained taxonomy there is a false sense of precision (and you can get grossly wrong.
Another way of stating the rules of thumb above:
- You have to originally tag (or possibly go through the effort of retro-actively tagging, perhaps through automated concept extraction) all content to at least as fine-grained a taxonomy as you're going to pull from,
- without artificially tagging more precisely than you are accurate.
Of course, by far the most preferable treatment is that all content, across the various systems you want to pull from (onto the same web page, for example) is tagged to the same, fine-grained taxonomy (or at as fine grained as you ever expect to need to pull from). Otherwise you'll have to resort to taxonomy mappings, or retroactively tag content.
Bookmark/Search this post with
Submitted by David Hobbs on 19 December 2007 - 9:25pm
Especially as a content management system grows to have a large amount of content, it would be nice if you could do structured link checking. One of the problems with link checking in general is what to do with the reports once you get them. Of course, for a very small site you can easily scan an entire site with tools like LinkScan ($) and Xenu Linksleuth (free, but ads are put in the reports) or even monitor 404 requests and use single page tools like the LinkChecker Firefox extension. But with large sites you can end up with reports that are hard to know where to even start fixing links. This is especially true for CMS-driven sites: the same bad link may appear in only one piece of content that is displayed throughout the site. Or you could wind up linking from lots of content items to a url (possibly outside your control) that changes. I envision getting a report with a list of the bad links, where a user (with appropriate global rights) could indicate the correct new link which would get reflected in all content items (or left menus or other components surrounding the content) that used that link. This list could be prioritized by the cumulative page views that contained that bad link, or by the number of pages that contained that link. Another approach might be to provide a prioritized list of content items that have bad links (preferably directly linkable to edit mode of that content item. At any rate, note that we're not talking about pages here but content items or links -- the user can quickly take action that will correct links on multiple pages. A long list of pages (specific urls) with bad links are confusing, but, more importantly, aren't as quickly actionable. Here is how normal link checking reports look and how more useful reports might look:
| Before / Existing Reports (where do you start with a report like this, where content items may drive multiple pages?) |
Report indicating bad links where the user can immediately correct them (and apply the correction everywhere) |
Report indicating which content items have the bad links(content items linkable to edit them directly) |
- http://badlinkone.com is referenced on http://example-site.com/page1, http://example-site.com/page35, and http://example-site.com/page102
- http://badlinktwo.com is referenced on http://example-site.com/page1, http://example-site.com/page1023, http://example-site.com/page2439, http://example-site.com/page5192
Etc. |
Etc. |
Etc. |
One possible way to implement this is to change all the urls into some logical link in your CMS. Assuming your CMS stores straight HTML rather than a more structured format, then any url the user enters could be changed to a macro (if the user could put in a hard link directly into the HTML without the system changing it, even if there was an option for creating a logical link, most users would probably just skip the logic linking). For example if the user put in this HTML:
<a href="http://hobbsontech.com>Hobbs On Tech</a> then the system would replace it with
!link(123," _fcksavedurl="http://hobbsontech.com>Hobbs On Tech</a> then the system would replace it with
!link(123,"Hobbs On Tech")and put in its link repository that link 123 was
http://hobbsontech.com. When the page was generated then the correct link could be replaced in the HTML (so of course the end user's browser should never see the "123" in the HTML). If the page linked to was in your CMS, then the macro could be different and just indicate the unique key for the content item being pointed to (this would depend on whether the context that the content appeared in was relevant). For example:
!cms_item(123,"Hobbs On Tech") Related items that a link repository might help with:
- Reporting on content use. A link repository would allow other interesting reporting, such as the most linked-to content items in your repository.
- Easily move content. In some cases, it may be easier to move content if you had a link repository. For instance, you may sometimes need to restructure your site resulting in the links changing. With a link repository, you could automatically change all the links so that the move did not result in broken links (of course this would work best for intranet sites where there were limited links outside your control to your content).
Of course, this would add complexity (and possible failure points) to a CMS. Do you think it would be worth it?
Bookmark/Search this post with
Submitted by David Hobbs on 18 November 2007 - 1:08am
This may not seem very Web 2.0 (O'Reilly wrote web services is 2.0 but screen scraping is 1.0), but I think there are a variety of reasons that screen scraping is still helpful, including:
- Need to be closer to what the user sees
- Don't have access directly to the database or a web service that will provide you the information you need (or you won't have access soon enough)
For example:
- Testing whether your web pages are looking the way you expect. Sometimes testing this from the back end just isn't going to cut it, and you need to analyze the HTML to see if the page looks reasonable.
- Writing a report that doesn't already exist on top of some reporting tool (for instance, on top of a defect-tracking system that you don't have access to the code for).
- Creating archived versions of sites. Sometimes using HTTRACK, for example, isn't enough on its own (for example, when you need to pull in full-sized videos from the source system as oppossed to the streamed version on the web). Also, you can use Perl to wrap around HTTRACK so that you have a standard way of passing options to HTTRACK.
- Seeing which of a large set of your sites are indexed in Google.
- Testing your RSS feeds to determine if they have the right number of content items, etc (I guess this would be more "RSS scraping" than screen scraping).
- Importing from a static site to a CMS (less and less commonly needed nowadays).
Often, if there's a direct DB connection or an RSS feed or some other XML interface that you can use, then it probably makes sense to use that. Even in that case, the archiving and web page testing cases would probably benefit from screen scraping.
Bookmark/Search this post with