What are the Four Golden Signs?

We recently published the IT Topic “IT System Monitoring: advanced solutions for total visibility and security”, in which we present how advanced solutions for IT system monitoring optimize performance, improve security and reduce alert noise with AI and machine learning. We also mentioned that there are four golden signals that IT systems monitoring should focus on. The term “golden signals” was introduced by Google in 2014 in its book Site Reliability Engineering: How Google Runs Production Systems,where Site Reliability Engineering (SRE) is a discipline used by IT and software engineering teams to proactively create and maintain more reliable services. The four golden signs are also defined:

  • Latency: This metric is the time that elapses between a system receiving a request and subsequently sending a response. You might think of it as a unique “average” latency metric, or perhaps an established “average” latency that can be used to guide SLAs. But, as a golden signal we want to observe the latency over a period of time, which can be displayed as a histogram of frequency distribution. For instance:

    This histogram shows the latency of 1000 requests made to a service with an expected response time of less than 80 milliseconds (ms). Each histogram section groups requests according to the amount of time they take to complete, from 0 ms to 150 ms in increments of five.
  • Traffic: It refers to the demand in the system. For example, a system might have an average of 100 requests HTTPS per second; but averages can be misleading. Average trends can be observed for problems or averages over time. Also, traffic may increase at certain times of the day (when people respond to an offer for a few hours or inquiries are made about stock prices at market close.
  • Errors: It refers to API error codes that indicate something is not working properly. The tracking of the total number of errors that take place and the percentage of failed requests allows you to compare the service with others. Google SREs extend this concept to include functional errors of incorrect data and slow responses.
  • Saturation: There is a saturation point for networks, disks, and memory where demand exceeds the performance limits of a service. You can do load testing to identify the saturation point, as well as restrictions, when a request failed first. A very common bad practice is to ignore saturation when there are load balancers and other automated scaling mechanisms. In poorly configured systems, inconsistent scaling and other factors can prevent load balancers from doing their job properly. For that reason, monitoring saturation helps teams identify issues before they become serious problems by taking proactive actions to prevent these incidents from happening again.

The Importance of the Four Golden Signals in Monitoring

The relevance of the four golden signals in IT systems monitoring lies in the feasible tracking on latency, traffic, errors and saturation of all services, in real time, providing the elements for IT teams to identify potential or ongoing issues more quickly. Also, with the single view of everyone’s status, the work of the team devoted to monitoring IT or third-party systems is streamlined. Instead of performing different monitoring for each function or service, monitoring metrics and records can be grouped into a single location. All of this helps to better manage issues and track the whole lifecycle of an event.

How to Implement the Four Golden Signals

The four golden signals are a way to help SRE teams focus on what’s important, so they don’t rely on a plethora of metrics and alarms that might be difficult to interpret. To implement them, follow these steps:

  • Define baselines and thresholds: Sets normal operating ranges or service level targets for each signal. SLO help identify anomalies and set up significant alerts. For example, you may set a latency threshold of 200 ms; if it is higher, an alert should be triggered.
  • Implement alerts: Set up alerts to receive notifications when signals exceed predefined thresholds, ensuring issues can be responded to promptly. Combination with AI streamlines alert and notification management and escalation.
  • Analyze trends: Review historical data periodically to understand trends and patterns, as well as gather information for proactive capacity planning and identifying areas of opportunity to optimize them. Advanced analytics and AI are valuable tools to give the correct reading to these analyses.
  • Automate responses: Try to automate responses to common problems so as not to overwhelm your IT team and so that they can also focus on more strategic tasks or incidents that really deserve attention. With AI, automatic scaling can be established to help manage traffic spikes.

Monitoring Tools Open Source or Commercial Solutions?

To choose a Monitoring tool, the question may arise as to which option is more convenient: an open source one or a commercial solution. The answer should not depend only on an economic question (whether or not to pay for resources), but also on taking into account that almost all IT products cannot do without open source, since they are constantly used and that is why we do not question their value. Of course, it should be borne in mind that, to use open source, you must choose monitoring solutions supported by professional and reliable monitoring, in addition to support for correct configuration.
It is also important for the open source solution to be intuitive, to not represent a consumption of valuable time spent on configuration, adjustments, maintenance and updating tasks. Remember that agility and speed are required.

Importance of Golden Signals in Observability

Monitoring allows problems to be detected before they become critical, while observability is particularly useful for diagnosing problems and understanding the root cause. Golden signals enable site reliability engineering (SRE) to be implemented based on availability, performance, monitoring, and readiness to respond to incidents, improving overall system reliability and performance. Also, monitoring based on golden signals offers the observability elements to find out what is happening and what needs to be done about it. To achieve observability, metrics from different domains and environments must be gathered in one place, and then analyzed, compared, and interpreted.

The Golden Signals as Part of Full-Stack Observability

The full-stack observability refers to the ability to understand what is happening in a system at any time, monitoring system inputs and outputs, along with cross-domain correlations and dependency mapping. Golden signals help manage the complexities of multi-component monitoring, avoiding blind spots. It also links system behavior, performance, and health to user experience and business outcomes.
Also, golden signals are integrated to the principles of SRE: Risk Acceptance, Service Level Objectives, Automation, Effort Reduction, and Distributed Systems Monitoring, combining software engineering and operations to build and execute large-scale, distributed, and high-availability systems. SRE practices also include the definition and measurement of reliability objectives, the design and implementation of observability, along with the definition, testing and execution of incident management processes. In advanced observability platforms, the golden signals provide the data to also improve financial management (costs, capital decisions by use of technologies, SLA compliance), security and risk prevention.

Conclusion

The digital nature of business has caused IT security strategists to face the complexity of multi-component monitoring. Golden signals provide the key indicators that apply to almost all types of systems. In addition, it is necessary to analyze and predict system performance, where observability is essential. In this regard, MELT (Metrics, Events, Logs, and Traces) is a framework with a comprehensive approach to observability, gaining insight into the health, performance, and performance of systems.

Pandora FMS: a Complete Solution for Monitoring the Four Golden Signals

Pandora FMS stands out as a complete solution for monitoring distributed systems and implementing the Four Golden Signals. Here we explain why.

1. Versatility and Flexibility
Pandora FMS (Flexible Monitoring System) is known for its ability to adapt to different environments and business needs. Whether you’re managing a small on-premise infrastructure or a complex, large-scale distributed system, Pandora FMS can scale and adapt seamlessly.

2. Comprehensive Latency Monitoring
Pandora FMS enables detailed latency monitoring at different levels, from application latency to network and database latency. It provides real-time alerts and intuitive dashboards that make it easy to identify bottlenecks and optimize performance.

3. Detailed Traffic Monitoring
With Pandora FMS, you may monitor traffic in real time, getting a clear view of the volume of requests and transactions. This tool allows you to identify usage patterns, detect unexpected spikes, and plan capacity effectively.

4. Error Detection and Analysis
Pandora FMS platform offers a strong feature for error detection, both application errors, network errors, such as packet loss, network interface errors and device errors through SNMP traps in real time or even failures in the infrastructure. Configurable alerts and detailed reports help teams respond quickly to critical issues, reducing downtime and improving system reliability.

5. Resource Saturation Monitoring
Pandora FMS monitors key resource usage, such as CPU, memory, and storage, allowing administrators to anticipate and avoid saturation. This is vital to keep system performance and availability under control, especially during periods of high demand.

6. Integration with Existing Tools and Technologies
Pandora FMS integrates easily with a wide range of existing tools and technologies, enabling easier deployment and greater interoperability. This flexibility makes it easy to consolidate all monitoring data into a centralized platform.

7. Custom Reports and Intuitive Dashboards
The ability to generate custom reports and interactive dashboards allows IT teams to look at the status of their systems effectively. These features are essential for informed decision making and continuous service improvement.

8. Support and Active Community
Pandora FMS has strong technical support and an active community that offers ongoing resources and support. This is crucial to ensure that any issues are quickly solved and that users can get the most out of the platform.

9. Cost-Effectiveness
Unlike many commercial solutions, Pandora FMS offers excellent value for money, providing advanced features at a competitive cost. This makes it an attractive option for both small businesses and large corporations.

Shares