Difference between revisions of "Pandora: QuickGuides EN: Alert configuration"

From Pandora FMS Wiki
Jump to: navigation, search
(Configuring the Alert)
(Configuring the Template (Alert template))
Line 128: Line 128:
 
== Configuring the Template (Alert template) ==
 
== Configuring the Template (Alert template) ==
  
Now, we have to create an alert template, as generic as possible, in order to use it later. For example, "This is wrong because I have a module in Critical status" and by default send an email to the operator. Let's go to the administration menu-> Alerts-> Templates and click on the button to create a new alert template:
+
Now, we have to create an alert template, that should be as generic as possible, in order to use it later. For example, "This is wrong because I have a module in Critical status" and by default have it send an email to the operator. Let's go to the administration menu-> Alerts-> Templates and click on the button to create a new alert template:
  
 
<center>
 
<center>
Line 137: Line 137:
  
  
The element that defines the condition is the field "Condition". In this case, it is selected to "Critical status" so this template, when it is associated to a module, will be fired when that associated module goes to critical status. We have configured the "cpu_sys" module previously to become critical status when the response becomes 20 or more.
+
The element that defines the condition is the "Condition" field. In this case, it should be programmed to "Critical status", so this template when associated to a module, will be fired when that associated module goes to critical status. We've configured the "cpu_sys" module previously to turn to critical status when the response becomes 20 or more.
  
The priority defined here as "Critical" is the priority of the alert, that has nothing to do with the "Critical" status of the module. The criticality of alerts allows us to visualize them in other views, such as the event view, with different identifiers.
+
The priority defined here as "Critical" is the priority of the alert, which has nothing to do with the "Critical" status of the module. The criticality of alerts allows us to visualize them in other views, such as the event view, with different identifiers.
  
  
Go to step 2, clicking on the "next" button:
+
Go to step 2 clicking on the "next" button:
  
 
<center>
 
<center>
Line 150: Line 150:
 
</center>
 
</center>
  
Step 2 defines all the "fines" configuration "values" of the alert template in the trigger condition. Some of them, the first ones, are quite simple, and they limit the moment of the action of this alert to some specific days between different hours.
+
Step 2 defines all of the alert template's "fine" configuration "values" and puts them in the trigger condition. Some of them, the first ones, are quite simple, and they limit the time of action on this alert to some specific days, during a specific time period.
  
The most critical parameters here are these:
+
The most critical parameters here are the following:
  
* Time threshold: It's one day by default. If one module is always down, during, for example one day, and we have here a value of 5 minutes, then, it means that it would be sending us alerts every 5 minutes. If we adjust it for one day (24 hours), it will only send us the alert once, when it triggers. If the module recovers and triggers again, it will send us an alert again, but if it remains down from the second down, then the system won't send us another alert until 24 hours have passed.
+
* Time threshold: It's established to one day by default. If one module is always down, during, for example an entire day, and we have assigned a 5 minute value, then it means that it would be sending us an alert every 5 minutes. If we adjust it to one day (24 hours), it'll only send us the alert once, when it's triggered. If the module recovers and triggers an alert again, it'll simply re-send the alert again, but if the object remains down from the second fall, then the system won't send us another alert until another full 24 hours have passed.
  
* Min. Number of alerts:  Minimum number of times that the condition must be (in this case, that the module would be in CRITICAL status) before Pandora FMS executes the actions associated with the alert template. This is a way to avoid false positives that would "overflow" me with alerts, or that an erratic performance (bouncing) causes many alerts to be fired. If we put 1 here, it means that until it happens at least once, the system won't consider it. If we put 0, the first time the module is triggered, then the alert will fire.
+
* Min. Number of alerts:  Minimum number of times that the condition must repeat itself (in this case, that the module would be in CRITICAL status) before Pandora FMS executes the actions associated with the alert template. This is a way to avoid false positives that would "overflow" you with alerts, or that things such as an erratic performance (bouncing) cause many alerts to be fired. If we put 1 here, it means that until it happens at least once, the system won't consider it. If we put 0, the first time the module is triggered, the alert will fire.
  
* Max. Number of alerts: 1 means that it will execute the action only once. If we have here 10, it will execute the action 10 times. It's a way to limit the number of times an alert could be executed.
+
* Max. Number of alerts: 1 means that it will execute the action only once. If we set it to 10, it'll execute the action 10 times. It's a way to limit the number of times an alert can be executed.
 
**
 
**
Now we have fields "field1, field2 and field3" again. Now we can see that the field1 is blank, that is exactly the one that we've defined when we configured the action. The field2 and the field3 are used in the action of sending an email to define the subject and the message text, whereas the field1 is used to define the receivers (separated by commas). So the template, using some macros, is defining the subject and the message alert as in our case we'll receive a message as the one that follows (supposing that the agent where it's the module is called "Farscape"):
+
Now we have fields "field1, field2 and field3" again. Now we can see that field1 is blank, that is exactly the one that we've defined when we configured the action. Field2 and field3 are used in the action of sending an email to define the message's subject and text, whereas field1 is used to define the recipients for said message (separated by commas). So the template, using some macros, is defining the subject and the message alert as in our case we'll receive a message as the one that follows (supposing that the agent where it's the module is called "Farscape"):
  
  
Line 167: Line 167:
  
 
  This is an automated alert generated by Pandora FMS
 
  This is an automated alert generated by Pandora FMS
  Please contact your Pandora FMS for more information. *DO NOT* reply this email.
+
  Please contact your Pandora FMS for more information. *DO NOT* reply to this email.
  
Given that the default action is the one we have defined previously, all the alerts that use this template will use this predefined action by default, unless it would be modified.
+
Given that the default action is the one we have defined previously, all the alerts that use this template will use this predefined action by default, unless it were to be modified.
  
In case 3, we'll see that it's possible to configure the alert system in order to it notify when the alert has stopped.
+
In the third case, we'll see that it's also possible to configure the alert system in order for it notify when the alert has stopped.
  
 
<center>
 
<center>
Line 179: Line 179:
 
</center>
 
</center>
  
It's almost the same, but in field1 it's not defined, because it'll be used the same that comes defined in the action that has been executed previously (when firing the alert). In this case it'll send only an email when a subject that says that the condition in the cpu-syst module has been recovered)
+
It's almost the same, but in field1 it's not defined because the same one that was defined in the previously executed action (when the alert was fired). In this case it'll send only an email with a subject that says that the condition in the cpu-syst module has been recovered.
  
The alert recovery is optional. It's important to say that if in the alert recovery data are fields (field2 and field3) that are defined, these "ignore and overwrite''' the action fields, that's to say, that they have preference over them. The only valid field that can't be modified is the field1.
+
Alert recovery is optional. It's important to say that if in the alert recovery data there are fields (field2 and field3) that are defined, these "ignore and overwrite''' the action fields, that's to say, that they have priority over them. The only valid field that can't be modified is field1.
  
 
== Associating the Alert to the Command ==
 
== Associating the Alert to the Command ==

Revision as of 10:11, 22 February 2016

Go back to Quick Guides index

1 Pandora FMS Alert Configuration Quick Guide

1.1 Introduction to the Current Alert System

People usually complained about the complexity of defining alerts in Pandora FMS. Until we released version 2.0, where alerts were simpler to configure. For each alert we used to define the condition, and what reaction the alarm provoked when the condition was met or not, in each case. It was more of an "intuitive" thing (but it also had fields such as the "threshold" alert that caused many people headaches). It was very simple, but, was it worth it?

One of our best users (because he had lots of agents installed and managed Pandora FMS really well), mentioned to us that creating an alert for 2000 modules was very difficult, especially when you have to modify something in all of them. Due to this and other problems, we modified the alert system to make it become modular, and separated the definition of the alert firing condition (Alert template) from the action to execute when it is fired (Alert action), and at the same time isolated both from the command that is executed in the action (Alert command). The combination of an alert template with a module triggers the alert.

This way, if I have 1000 systems with a module called "Host alive" and all of them have an associated alert template called "Host down", then an alert called "Call to the operator" will be executed by default, and if I want to change the minimum number of alerts that should be fired before notifying the operator, I will only need to make a change in the definition of the template, instead of modifying 1000 instances.

Several users only manage a few dozen machines, but there are users with hundreds, even thousands of systems monitored with Pandora FMS. This approach makes it possible for Pandora FMS to manage all kind of environments.

1.1.1 Alert structure


Esquema-alert-structure.png


An alert is composed by:

  • Commands
  • Actions
  • Templates

A command defines the operation to perform when the alert is fired. Some examples of a command could be: write to a log, send an email or SMS, execute a script or a program, etc.

An action links a command with a template and allows you to customize the command execution using three generic parameters: Field 1, Field 2 and Field 3. These parameters allow you to customize the command execution because they are passed as input parameters in command execution.

On the template you can define the alert's generic parameters which are: firing conditions, firing actions and alert recovery.

  • Firing conditions: the conditions under which the alert will be fired, for example: when the data is above a threshold, when the status is critical, etc.
  • Firing actions: allows configuring the action that will be performed when the alert is fired.
  • Alert recovery: allows configuring the actions that will be performed when the system is recovered after the alert was fired.

1.1.2 Alert system information flow

When you define the actions and the templates you have generic fields called: Field1, Field2 and Field3. They are the parameters passed as input parameters upon command execution. The values of these parameters are propagated from template to action, and then to the command. The propagation value from template to action will only be performed if the defined field in the action lacks any value, otherwise the value is used.


Esquema-parameters-carrying.png


This is an example of how template values are overwritten by the action values.


Alertas esquema6.png


For example we can create a template that fires an alert and sends an email with the following fields:

  • Template:
    • Field1: [email protected]
    • Field2: [Alert] The alert was fired
    • Field3: The alert was fired!!! SOS!!!

The values that will be passed to the command are:

  • Command:
    • Field1: [email protected]
    • Field2: [Alert] The alert was fired
    • Field3: The alert was fired!!! SOS!!!

1.2 Defining a single Alert

Now, suppose we are in the previous case, we have a single necessity: to monitor one module that has numerical values. In our case, it's a module that evaluates the system CPU, in other cases, it could be a temperature sensor that reads the value in degrees Celsius. Let's first make sure that our module receives the data correctly:


Qgcpu1.png

In this screenshot, we can see that we have a module called sys_cpu with a current value of 7. In our case, we want the system to fire an alert when the value becomes greater than 20. For this to occur we're going to configure the module such that it goes to CRITICAl status when it gets higher than 20. For that to happen, click on the adjustable wrench to configure the monitor performance:


Qgcpu2.png

We modify the value selected in red as shown on the following screenshot:


Qgcpu3.png

Agree and save any changes. Now, when the CPU module value goes up to 20 or higher, it will change status to CRITICAL and it will be marked in red, as we can see here.


Qgcpu4.png

The system knows how to recognize when something is right (OK, green color) and when is wrong (CRITICAL, red color). Now, what we want to do is have Pandora FMS send us an email when the module changes to this status. To do so, we will use the Pandora FMS alert system.

The first thing we should do is to make sure that there is at least one command that does what we need it to(to send an email). This example is easy because it's a default command in Pandora FMS to send mails.

1.3 Configuring the Alert

Now, we have to create an action called "Send an email to the operator". Let's do it: go to the menu -> Alerts -> Actions and click to create a new action:


Qgcpu5.png

This action uses the command "Send email" and it's really simple, so you only need to fill in one field (Field 1) and leave the other two empty. This is one of the most confusing parts of the Pandora FMS alert system: What are the fields: field1, field2 and field3?.

These fields are used to "pass" the information from the alert template to the command, so both the Template and the Command can give different information to the command line. In this case, the command only uses field 1, and we leave field2 and field 3 to the template, as we can see below.

Field 1 is the one we use to define the operator's email, in this case, a false mail to "[email protected]".

1.4 Configuring the Template (Alert template)

Now, we have to create an alert template, that should be as generic as possible, in order to use it later. For example, "This is wrong because I have a module in Critical status" and by default have it send an email to the operator. Let's go to the administration menu-> Alerts-> Templates and click on the button to create a new alert template:


Qgcpu6.png


The element that defines the condition is the "Condition" field. In this case, it should be programmed to "Critical status", so this template when associated to a module, will be fired when that associated module goes to critical status. We've configured the "cpu_sys" module previously to turn to critical status when the response becomes 20 or more.

The priority defined here as "Critical" is the priority of the alert, which has nothing to do with the "Critical" status of the module. The criticality of alerts allows us to visualize them in other views, such as the event view, with different identifiers.


Go to step 2 clicking on the "next" button:


Qgcpu7.png

Step 2 defines all of the alert template's "fine" configuration "values" and puts them in the trigger condition. Some of them, the first ones, are quite simple, and they limit the time of action on this alert to some specific days, during a specific time period.

The most critical parameters here are the following:

  • Time threshold: It's established to one day by default. If one module is always down, during, for example an entire day, and we have assigned a 5 minute value, then it means that it would be sending us an alert every 5 minutes. If we adjust it to one day (24 hours), it'll only send us the alert once, when it's triggered. If the module recovers and triggers an alert again, it'll simply re-send the alert again, but if the object remains down from the second fall, then the system won't send us another alert until another full 24 hours have passed.
  • Min. Number of alerts: Minimum number of times that the condition must repeat itself (in this case, that the module would be in CRITICAL status) before Pandora FMS executes the actions associated with the alert template. This is a way to avoid false positives that would "overflow" you with alerts, or that things such as an erratic performance (bouncing) cause many alerts to be fired. If we put 1 here, it means that until it happens at least once, the system won't consider it. If we put 0, the first time the module is triggered, the alert will fire.
  • Max. Number of alerts: 1 means that it will execute the action only once. If we set it to 10, it'll execute the action 10 times. It's a way to limit the number of times an alert can be executed.

Now we have fields "field1, field2 and field3" again. Now we can see that field1 is blank, that is exactly the one that we've defined when we configured the action. Field2 and field3 are used in the action of sending an email to define the message's subject and text, whereas field1 is used to define the recipients for said message (separated by commas). So the template, using some macros, is defining the subject and the message alert as in our case we'll receive a message as the one that follows (supposing that the agent where it's the module is called "Farscape"):

To: [email protected]
Subject: [PANDORA] Farscape cpu_sys is in CRITICAL status with value 20
Texto email:
This is an automated alert generated by Pandora FMS
Please contact your Pandora FMS for more information. *DO NOT* reply to this email.

Given that the default action is the one we have defined previously, all the alerts that use this template will use this predefined action by default, unless it were to be modified.

In the third case, we'll see that it's also possible to configure the alert system in order for it notify when the alert has stopped.


Qgcpu8.png

It's almost the same, but in field1 it's not defined because the same one that was defined in the previously executed action (when the alert was fired). In this case it'll send only an email with a subject that says that the condition in the cpu-syst module has been recovered.

Alert recovery is optional. It's important to say that if in the alert recovery data there are fields (field2 and field3) that are defined, these "ignore and overwrite the action fields, that's to say, that they have priority over them. The only valid field that can't be modified is field1.

1.5 Associating the Alert to the Command

Now, we have all that we need, we only have to associate the alert template to the module. For it, go to the alert tab in the agent where the module is:


Qgcpu9.png

It's easy. In this screenshot we can see an alert already configured for a module named "Last_Backup_Unixtime" to the same template that we have defined before as "Module critical". Now, in the controls that are below, we are going to create an association between the module "cpu-sys" and the alert template "Module critical". By default it'll show the action that we've defined in this template "Send email to Sancho Lerena".

1.6 Scaling Alerts

The values that are in the "Number of alerts match from" are to define the alert scaling. This allows to "redefine" a little more the alert performance, so if we have defined a maximum of 5 times the times that an alert could be fired, and we only want that it send us an email, then we should put here one 0 and one 1, to order it that only send us an email from time 0 to 1 (that is, once).

Now we see that we can add more actions to the same alert, defining with this fields "Number of alerts match from" the alert performance depending on how many times it would be fired.

For example: we want that it sends an email to XXXXX the first time it happens, and if the monitor continues being down, it sends an email to ZZZZ. For it, after associating the alert, in the assigned alerts table, I can add more actions to a previously defined alert, as we can see in the following screenshot:


Qgcpu9.png


Qgcpu10.png

1.7 Standby alerts

Alerts can be enable, disable or in standby mode. The difference between the disabled and standby alerts is that the disable alerts just do not work and therefore will not showed in the alerts view. Standby alerts will be showed in the alerts view and work, but only at display level. It will show if are fired or not but will do not engage in configured actions and will do not generate events.

Stanby alerts are useful for viewing them without bothering other aspects

1.8 Using Alert Commands different from the email

The email, as a command is internal to Pandora FMS and can't be configured, that is, field1, field2 and field3 are fields that are defined that are used as receiver, subject and text of the message. But, what happens if I want a different action that is defined by me?

We're going to define a new command, something completely defined by us. Imagine that we want to create a lof file with each alert that we find. The format of this log file should be something like:

DATE_ HOUR - NAME_AGENT - NAME_MODULE - VALUE - PROBLEM DESCRIPTION

Where VALUE is the value of the module at this moment. It'll be several log files, depending on the action that calls to the command. The action will define the description and the file to which the events go to.

For it, first we are going to create a command as follows:


Qgcpu11.png

And we're going to define an action:


Qgcpu12.png

If we take a look at the log that we've created:


2010-05-25 18:17:10 - farscape - cpu_sys - 23.00 - Custom alert for LOG#1

We can see that the alert was fired at 18:17:10 in the " farscape" agent, in the "cpu_sys" module, with a data of "23.00" and with the description that we chose when we defined the action.

As the command execution, the field order and other things could do that we don't understand well how the command is finally executed, the easiest thing is to activate the debug traces of the pandora server (verbose 10) in the pandora server configuration file /etc/pandora/pandora_server.conf, and restart the server (/etc/init.d/pandora_server restart) and we take a look to the file /var/log/pandora/pandora_server.log looking for the exact line with the alert command execution that we've defined, to see how the Pandora FMS server is firing the command.