Monitoring

Anyone running an online service would do well to read this great post on  Monitoring Theory:

Monitors can be informativeactionable, or both. By informative, I mean that the alert must tell you categorically that there is a problem. By actionable I mean that receiving the alert must prompt some kind of immediate response.

I strongly agree with the points in the post, including the thoughts on queues and consumers which often have the characteristic of making a complex system even more complex.

I'll add one other: put alerts in place for things you 'know' to be true. Some of the most interesting failure scenarios I've seen have been foretold by 'this'll never fire' alerts.