Blogs

Client-centric metrics

Here's an example of a practically-meaningless statistic from a user's perspective (this is from a free version of a monitoring service by host-tracker.com checking hobbsontech.com, I believe for a valid HTTP response, every thirty minutes since November 21st):

Straight off, let me say that for troubleshooting this is useful. If someone tells me they couldn't get to this site at noon yesterday, I can quickly check to see if the site was totally down for long enough that the monitoring would catch this. In a more sophisticated environment, one can quickly check to see if a particular server in a cluster, for example, is bad and needs to be dealt with or taken offline. But this kind of metric shouldn't be confused with client-centric metrics, and you should try setting your sights on statistics that come closer to the user experience. Some key things to keep in mind when defining client-centric metrics: 

  • Don't use averages. Users don't think in terms of averages, but in extremes (for instance, "yesterday I saw a page with outdated content"). So, your goals should be in terms like "page loads in less than five seconds within institution's firewall 99% of the time, and within 10 seconds 99.99% of the time." Note that all those 9's in the example above are not about simple uptime, but a percentage of time that a metric is met. Also see Amazon's CTO's description of measuring at higher percentiles (one reason he gives is to ensure high value clients with more complicated, personalized pages, have good response times).
  • Don't use planned maintenance windows as an excuse. Service downtime experienced during maintenance windows should be calculated as downtime. Of course, planned maintenance is still important, since you can warn your users of downtime (and you can hopefully pick times that are lower-impact). But it's still downtime. Of course, if a server is down but the service is still available, then you shouldn't ding yourself in your client-centric metrics (and you should congratulate yourself in doing things in a way that allows downtime of server(s) without downtime of the service).
  • Try to base your metrics on the way a user experiences your system. This will usually involve more sophisticated analysis of responses from the server(s). For example, don't just check for a successful HTTP response from the server (for example, I could create a page that returns a successful HTTP return code but says "This site is down for maintenance"), but check that the page has a valid left navigation, header, and piece of content(s) (perhaps using screen scraping techniques). Also, aside from a troubleshooting technique, don't consider server time for generating pages, but the time a user would actually experience in downloading a page (including pulling in all the components of the page, and, if possible, the time to compute/render a page). If you cache your pages, don't get too hung up on the performance of just cached pages (the end user won't know if it's a cached or dynamic page they are getting, so your metric has to consider the dynamic pages as well). Another example of client-centric metrics for a large system with a suite of sites: length of time between content publishing and appearance in all relevant pages.

But probably most important is to identify client-centric metrics as early as possible, and create a method of tracking these. If possible, you could install a large display outside a manager's office with the metrics in red when you aren't meeting the goals. Here's a table listing some example before/after types of client-centric metrics:

 

Not client-centric More client-centric
Excluding maintenance windows, the server uptime this last week was 99.9% 99.7% of all pages in the last week loaded completely (good header, footer, and content area)
Excluding maintenance windows, 99.9% of all pages were generated within 1 second by the server 99.1% of all pages in the last week loaded within 5 seconds inside our firewall
1,000 content items were published yesterday The 1,000 content items published yesterday appeared on all relevant pages within 30 minutes 99.9% of the time.

 

Administrative Title

If you don't speak Thai and get a message like "the Thai news page is totally bonkers," you won't be able to help troubleshoot very easily since it may be difficult to even *find* that page. If you have multiple timezones that you're dealing with, you could waste time getting clarifications/etc about which page it is they are talking about (of course a url + screenshot + indication of exactly what's wrong would be great in the first place, but upset users often don't send this level of detail). Being able to search the Thai site in the backend on "news" may allow you to quickly find that content/page for troubleshooting. If you have a large site with lots of languages, you may have one common language that all of the institution's staff knows. For instance, English might be the institutional working language. In that case, it might be best to always require an English title in addition to the actual title of a piece of content. Of course, if you *always* had an English version of every piece of content, then you could just use that English version for back end administration. But chances are you will have content that is *just* in a non-English piece of content. In that case, having all non-English content tagged with an English title could be helpful (which could automatically get set to the existing title of the English version of the content if the English version exists). Also, you should probably decide early on if you're going to have other backend features in English or in the different languages. For example, if you have a common institutional working language, then you should probably have all the names of the sites in the administrative backend in *English* in addition to the site's actual title in the site's language.

Responding to urgent user issues

In any system that's actually used (!), you're going to get user reports of problems that need fast response such as access issues (as oppossed to enhancement or bug fix requests). Unfortunately, a common and very difficult type of problem is an intermittent issue or one that you cannot reproduce. That said, even in that case, here are some rules of thumb in responding to a user report:

  • Identify whether or not you saw the problem (and never, ever, just close a ticket just because you don't see the issue -- at a minimum contact the user first *before* declaring something couldn't be reproduced).
  • If you cannot reproduce the problem, then try to walk through the exact steps with the user on your desktop (or by sharing their desktop). Of course, the user may resist this since they're already frustrated if they contacted you. But as we all know a problem may only occur in a very specific situation (that you never do), although it may be the *only* way that particular user does things (so the user feels this always happens). Of course, if there is another way of doing the same thing, then suggest a workaround.
  • Clearly indicate if you *did* something for the problem to go away.
  • Ask for confirmation that the *user* thinks the issue is resolved (this one is important but easy to overlook).
  • Make it clear that the user can get back to you with any follow-up questions.

An example response that isn't very useful to a user: "Try now" and nothing more (the user doesn't know if you did anything, and they might think you don't believe anything was wrong in the first place). 

Screen Scraping

This may not seem very Web 2.0 (O'Reilly wrote web services is 2.0 but screen scraping is 1.0), but I think there are a variety of reasons that screen scraping is still helpful, including:

  • Need to be closer to what the user sees
  • Don't have access directly to the database or a web service that will provide you the information you need (or you won't have access soon enough)

For example:

  • Testing whether your web pages are looking the way you expect. Sometimes testing this from the back end just isn't going to cut it, and you need to analyze the HTML to see if the page looks reasonable.
  • Writing a report that doesn't already exist on top of some reporting tool (for instance, on top of a defect-tracking system that you don't have access to the code for).
  • Creating archived versions of sites. Sometimes using HTTRACK, for example, isn't enough on its own (for example, when you need to pull in full-sized videos from the source system as oppossed to the streamed version on the web). Also, you can use Perl to wrap around HTTRACK so that you have a standard way of passing options to HTTRACK.
  • Seeing which of a large set of your sites are indexed in Google.
  • Testing your RSS feeds to determine if they have the right number of content items, etc (I guess this would be more "RSS scraping" than screen scraping).
  • Importing from a static site to a CMS (less and less commonly needed nowadays).

Often, if there's a direct DB connection or an RSS feed or some other XML interface that you can use, then it probably makes sense to use that. Even in that case, the archiving and web page testing cases would probably benefit from screen scraping. 

Giving vs. Taking

A couple rules of thumb about giving vs. taking:

Rule 1. If you give something away for free that your would normally charge for, then make it clear you are waiving the normal rules (and that you may not be doing so in the future).

Rule 2. If you didn't follow Rule 1, then be very wary of changing your policy of giving that something away (even if it is in the manual or contract or whatever). Why? It will feel like you are taking something away (rather than *giving* something if Rule 1 is followed).

This is all *especially* true if it's something that would be easy for the user/client to overlook.

It's very easy to point to the contract, or to the manual, about some policy. But if you have routinely been not following the policy, or not charging for something, then people will continue to expect that they will get it for free. Popping what appears to be a new requirement or cost on them will probably not go over well.

Example One: Let's say a contract stipulates $100/month extra for priority support. If you've silently been giving that priority support for 6 months for no additional cost, and then call up the client to say you'll charge it in the future, they'll be upset. It'll seem like you're suddenly charging them way more money. It's a lot better to just put on the bill in the first place something like "$100/month fee waived for initial six months" or something, so you're *giving* something for six months. Of course, if you mistakenly did not charge that extra amount, then you may want to consider giving a grace period before charging that again (so that the user has time to wrap their head around the idea, and they will also feel that at least they are getting something for free a while longer).

Example Two: Your product's manual says your content should be 600 pixels wide. But your product never enforced that, and, although larger pages didn't look perfect, wider pages didn't look horrible either. If you suddenly change the system (for other good reasons) so that these spurious pages inadvertantly look bad, just pointing to the manual to say they should have kept their content to 600 pixels wide will annoy the users. It would be better to in the first place enforce the rule or at least remind people that you are waiving that restriction but may in the future require it. After the fact of wide pages suddenly looking worse, you can also offer to help your users to review and change the size of their problem pages. Also, if you can somehow change the system fairly easily to be a bit more lax on the requirement, then it would be better to do so.

Obviously, these types of issues may arise because of an oversight on your part (you forgot to charge the additional $100/month in the example above), but in general the main things to try to keep in mind are: a) these types of details *are* important, so try to keep an eye on them, b) try to remind people when you are temporarily waiving a fee or restriction, and c) by all means, don't just flippantly point to the manual, contract, or other document. Of course, there may be times when you do need to fall back to pointing to the document/agreement, but carefully consider the options before doing so.