Here's an example of a practically-meaningless statistic from a user's perspective (this is from a free version of a monitoring service by host-tracker.com checking hobbsontech.com, I believe for a valid HTTP response, every thirty minutes since November 21st):
Straight off, let me say that for troubleshooting this is useful. If someone tells me they couldn't get to this site at noon yesterday, I can quickly check to see if the site was totally down for long enough that the monitoring would catch this. In a more sophisticated environment, one can quickly check to see if a particular server in a cluster, for example, is bad and needs to be dealt with or taken offline. But this kind of metric shouldn't be confused with client-centric metrics, and you should try setting your sights on statistics that come closer to the user experience. Some key things to keep in mind when defining client-centric metrics:
- Don't use averages. Users don't think in terms of averages, but in extremes (for instance, "yesterday I saw a page with outdated content"). So, your goals should be in terms like "page loads in less than five seconds within institution's firewall 99% of the time, and within 10 seconds 99.99% of the time." Note that all those 9's in the example above are not about simple uptime, but a percentage of time that a metric is met. Also see Amazon's CTO's description of measuring at higher percentiles (one reason he gives is to ensure high value clients with more complicated, personalized pages, have good response times).
- Don't use planned maintenance windows as an excuse. Service downtime experienced during maintenance windows should be calculated as downtime. Of course, planned maintenance is still important, since you can warn your users of downtime (and you can hopefully pick times that are lower-impact). But it's still downtime. Of course, if a server is down but the service is still available, then you shouldn't ding yourself in your client-centric metrics (and you should congratulate yourself in doing things in a way that allows downtime of server(s) without downtime of the service).
- Try to base your metrics on the way a user experiences your system. This will usually involve more sophisticated analysis of responses from the server(s). For example, don't just check for a successful HTTP response from the server (for example, I could create a page that returns a successful HTTP return code but says "This site is down for maintenance"), but check that the page has a valid left navigation, header, and piece of content(s) (perhaps using screen scraping techniques). Also, aside from a troubleshooting technique, don't consider server time for generating pages, but the time a user would actually experience in downloading a page (including pulling in all the components of the page, and, if possible, the time to compute/render a page). If you cache your pages, don't get too hung up on the performance of just cached pages (the end user won't know if it's a cached or dynamic page they are getting, so your metric has to consider the dynamic pages as well). Another example of client-centric metrics for a large system with a suite of sites: length of time between content publishing and appearance in all relevant pages.
But probably most important is to identify client-centric metrics as early as possible, and create a method of tracking these. If possible, you could install a large display outside a manager's office with the metrics in red when you aren't meeting the goals. Here's a table listing some example before/after types of client-centric metrics:
|Not client-centric||More client-centric|
|Excluding maintenance windows, the server uptime this last week was 99.9%||99.7% of all pages in the last week loaded completely (good header, footer, and content area)|
|Excluding maintenance windows, 99.9% of all pages were generated within 1 second by the server||99.1% of all pages in the last week loaded within 5 seconds inside our firewall|
|1,000 content items were published yesterday||The 1,000 content items published yesterday appeared on all relevant pages within 30 minutes 99.9% of the time.|