Following on from last weeks post about monitoring, the next logical step is having the system create alerts when a threshold is breached.
Types of alert
Alerts comes a few shapes, most are expressed as a threshold of some metric that once breached, either too many or too little, sends an alert notification. There are a number of ways this can be achieved, email, sms, slack/irc or some status flashing on a wall board. There are pros and cons of each.
Automation
Can the resolution be automated in some way, order a hard drive and create a ticket that tracks the status of the order. In terms of a service, this could be auto restarted. At a larger level, in an autoscaling group if the CPU breaches a certain threshold, a new instance is launched to deal with the extra load. Automatically and without staff to get involved, except for reporting afterwards to check the action was sensible.
Let me know your ideas to improve your own monitoring and alerting.