Talk abstract
SAAFE - A prioritized alerting model to troubleshoot your incidents
Existing taxonomies for time-series data, including the Four Golden Signals, the RED, and the USE Method, are most concerned with the nature of each type of series. The SAAFE - Saturation, Amend, Anomaly, Failure, and Error alerting model helps you focus on what they imply and not the type.
At Grafana Labs, we have built a scalable, fully automated alerting system that analyzes the data using its domain knowledge. These alerts are categorized into the SAAFE model based on their implications for the system. Combined with severity levels - info, warning, critical, no of instances, and firing
duration, the SAAFE alerts are scored and ranked. When our on-call engineers troubleshoot incidents, they use the SAAFE categorization and ranking to prioritize, filter, and infer causality.
In this talk, we will introduce the SAAFE method with real-world examples of how this has been useful. We will also share the open-source framework built purely using PromQL and Grafana that you can adopt.