Monitoring
Anyone running an online service would do well to read this great post on Monitoring Theory:
Monitors can be informative, actionable, or both. By informative, I mean that the alert must tell you categorically that there is a problem. By actionable I mean that receiving the alert must prompt some kind of immediate response.
I strongly agree with the points in the post, including the thoughts on queues and consumers which often have the characteristic of making a complex system even more complex.
I'll add one other: put alerts in place for things you 'know' to be true. Some of the most interesting failure scenarios I've seen have been foretold by 'this'll never fire' alerts.