Pandora: Documentation en: Intro Monitoring

From Pandora FMS Wiki
Jump to: navigation, search

Go back to Pandora FMS documentation index

1 Introduction to Monitoring

All user interaction with Pandora FMS is done through the WEB console. The console allows access through a browser without the need to install heavy applications, allowing management from any computer with a browser.

Monitoring is the execution of processes on all types of systems to collect and store information, take action and make decisions based on such data.

Pandora FMS is a scalable monitoring system that has multiple functionalities to extend the scope and volume of information collected almost with no limits.

2 Agents on Pandora FMS

All monitoring done by Pandora FMS is managed through a generic entity called 'Agent' which is included in a more generic segment called 'Group'. These agents will be equivalent to each of the different monitored computers, devices, websites or applications.

The agents defined in the Pandora FMS console can present local information gathered through a software agent, remote information collected through network checks, or both. Therefore, it is worth highlighting the difference between agents as an organizational unit in the Pandora FMS console, and software agents as local data collection services.




AgentHierarchy.png



2.1 Monitoring by Software Agent vs. Remote Monitoring

Monitoring can be divided into two large groups based on how the information is collected: monitoring based on software agents and remote monitoring.

Agent-based monitoring consists of installing a small software that keeps on running in the system and obtaining information locally through command and script execution.

Remote monitoring is the use of the network to run remote checks on systems, without the need to install any additional components on the equipment to be monitored.

As it can be seen, agent based monitoring will obtain the information through local checks while remote monitoring will obtain the information through network checks from the Pandora FMS server.

With Pandora FMS, monitoring can be carried out one way or another and also combined, producing a mixed monitoring.

Both types of agent share the same general configuration and data visualization.

2.2 Agent configuration in the console




Configuracion agente consola1.png






Configuracion agente consola2.png



  • Alias: For proper operation of all functions performed by Pandora FMS with their agents/modules, it is recommended not to use characters such as /,\,|,%,#,&,$ in the name of the agent. When dealing with these agents, they can be misleading when using system paths or when running other commands, causing errors on the server.
  • Server: Server that will execute the checks configured in agent monitoring, special parameter in case of having configured HA in its installation.
  • Secondary groups: Optional parameter for an agent to belong to more than one group.
  • Cascade protection: Parameter with which an avalanche of alerts can be avoided. It is possible to choose an agent or a module of an agent. In the first case, when the chosen agent is in critical, the agent will not generate alerts. In the second case, only when the specified module is critical, the agent will not generate alerts.
  • Module definition: Three work modes can be selected
    • Learning mode: If an XML arrives with new modules, they will be created automatically (by default).
    • Normal mode: If an XML arrives with new modules, they will not be created unless they have already been declared in the console.
    • Auto-disable mode: Same as learning mode, but if all modules go into unknown, the agent will be disabled until information arrives again.

2.3 Agent configuration in the console

In this screen, plenty of information on the agent can be seen, with the possibility of forcing the remote execution and refreshing the data.

Visualizacion agente consola1.png

In the upper part a summary with the agent data can be seen:

  • Total of modules and their status.
  • Events in the last 24 hours
  • Agent Information
    • Name
    • Version
    • Agent accessibility
    • Group
  • ...
Visualizacion agente consola2.png

Next, the list of modules belonging to the agent is displayed, where it will not be possible to view modules in uninitiated status, and below the alerts generated for those modules.

Visualizacion agente consola3.png

Finally, the events generated from the agent are displayed.

Visualizacion agente consola4.png

3 Modules

Modules are units of information stored within an agent. They are the monitoring elements with which the information is extracted from the device or server to which the agent points. Each module can store only one metric. There cannot be two modules with the same name inside the same agent. All modules have an associated status, which can be:

  • Not started: where no data has been received yet.
  • Normal: data is being received with values out of the warning or critical thresholds.
  • Warning: Data is being received with values within the warning threshold.
  • Critical: Data is being received with values within the critical threshold.
  • Unknown: the module has been running and has stopped receiving information for a certain amount of time.

The modules have different types of data, such as Boolean, numeric or alphanumeric. Depending on the information collected by the module, it will be one type or another.

3.1 Types of modules

There are several types of modules inside Pandora FMS.

  • Data module: it is a type of local monitoring module with which checks are made on the system in which the agent is located, such as for example the use of CPU of the device or its free memory. To find out more about this type of monitoring, go to the following link.
  • Network module: it is a type of remote monitoring module with which checks are made to verify the connection with the device or server to which the agent points, for example whether it is working or whether it has a particular port open. To learn more about this type of monitoring, go to the following link.
  • Plugin module: this is a type of local or remote monitoring module with which custom checks can be made through the creation of scripts. With them, more advanced and extensive checks than the ones proposed directly through Pandora FMS console can be done. To find out more about this type of monitoring, go to the following link.
  • WMI module: this is a type of local monitoring module with which the Windows system can be checked through the WMI protocol, such as obtaining the list of installed services or the current CPU load. To find out more about this type of monitoring, go to the following link.
  • Prediction module: this is a type of predictive monitoring module with which different arithmetic operations are performed through the consultation of data from other "base" modules, such as the average CPU usage of the monitored servers or the sum of connection latency. To learn more about this type of monitoring, go to the following link.
  • Webserver module: this is a type of web monitoring with which checks of the status of a website are made and data is obtained from it, such as for example to see whether a website is down or if it contains a specific word. To find out more about this type of monitoring, go to the following link.
  • Web analysis module: this is a type of web monitoring with which simulations of a user's web browsing are carried out, such as browsing a website, introducing credentials or complying with forms. To learn more about this type of monitoring, go to the following link.

3.2 Common Parameters

Within the configuration of each module, there are parameters common to all of them.




Parametros comunes modulos1.png



  • Using module component: Pandora FMS has a repertoire of default modules that can be used. Depending on the selected module, the necessary parameters will be automatically filled in to carry out the monitoring. This token appears in all types of modules except prediction ones.
  • Dynamic Threshold Interval: token for dynamic monitoring to be explained in a later section.
  • Warning/Critical Status: token for status monitoring which will be explained in a later section.
  • FF threshold: FlipFlop (FF) is known as a common phenomenon in monitoring: when a value fluctuates frequently between alternative values (RIGHT/WRONG), which makes it difficult to interpret. When this occurs, a "threshold" is usually used so that in order to consider something as having changed status, it has to "stay" more than X intervals in a state without being altered. We call this in Pandora FMS terminology: "FF Threshold".



Fft.png


The FF Threshold Parameter (FF=FlipFlop) is used to 'filter' the continuous changes of the state in the creation of events / statuses. In Pandora FMS, you can indicate that, until an element has adopted the same status at least X times after having changed from an original status, it will not be considered as changed. Let us see a common example: A ping to a host where there is package loss. In an environment like this, it is possible to receive the following results:


1
1
0
1
1
0
1
1
1

However, the host is alive in all cases. What it is really intended to say to Pandora FMS is: Until the host does not say that it is at least three times down, it must not be shown as down, so in the previous case it would never be shown as down, and it would only be like this in this case:

1
1
0
1
0
0
0

From this point on, it will be shown as down - but not before that.

So the 'Flip_Flop' protections are pretty useful to avoid disturbing fluctuations. All modules implement it. Its use avoids the change of status (limited by the defined or automatic limits, as shown in the case of *proc modules).

    • Keep counters

This is an advanced option of the Flip Flop to control the status of a module. By means of "keep counters" some counter values will be established to go from one status into another depending, instead of the value, on the status of the module with the received value.

An example of how it works is shown below:

Let us suppose there is a module with the following characteristics:


Interval: 5 min.
Threshold:
  Critical: 90 - 100;
  Warning: 80 - 90;

Flip Flop:
   Normal: 0;
   Warning: 3;
   Critical: 2;

Current Status: Normal;

And these data/Status are received:

Data Status
81 Warning
83 Warning
95 Crítical
89 Warning
98 Crítical
81 Warning
86 Warning

As it can be seen in the example, the data shown belong to status warning and critical but the current status is normal because the Flip Flop conditions are not met.

By setting the keep counters parameter, a status counter will be kept, resulting in the change of status as shown below:


Data Data Status Module Status
81 Warning Normal
83 Warning Normal
95 Critical Normal
89 Warning Warning
98 Critical Warning
81 Warning Warning
86 Warning Warning

Let us look at a more complete case:

Let us suppose there is a module with the following characteristics:


Interval: 5 min.
Threshold:
  Critical: 90 - 100;
  Warning: 80 - 90;

Flip Flop:
   Normal: 2;
   Warning: 3;
   Critical: 2;

Current Status: Normal;

The Status counter will only accumulate normal and critical statuses if they arrive consecutively. On the other hand, the Status warning may accumulate them even if they do not arrive consecutively.

The Status counter will be restarted in the following cases: - A value whose Status coincides with the current Status is arrives. - The status is changed when the " keep counter " conditions are met.


Normal and Critical Counters have a special behavior, for which only these Counters will be restarted, if they are not consecutive.


In this case, the following data is received:

Data Status data Critical counter Warning counter Normal counter Module Status
81 Warning 0 1 0 Normal
83 Warning 0 2 0 Normal
95 Critical 1 2 0 Normal
89 Warning 0 0 0 Warning
When the warning counter gets to three, the status is changed to warning and all counters are restarted
50 Normal 0 0 1 Warning
98 Critical 1 0 0 Warning
The normal counter and the critical counter must be consecutive to keep increasing. When receiving a critical value, the normal counter becomes 0
91 Critical 0 0 0 Critical
When the critical counter reaches two, the status is changed to critical and all counters are restarted
30 Normal 0 0 1/td> Critical
31 Normal 0 0 0/td> Normal
When the normal counter reaches two, the status is changed to normal and all counters are restarted
81 Warning 0 1 0/td> Normal
83 Warning 0 2 0/td> Normal
12 Normal 0 0 0/td> Normal
When receiving data in Normal Status that is equal to the current status, the counters are restarted.

Within the advanced options of the modules, the following common parameters can be observed.




Parametros comunes modulos2.png






Parametros comunes modulos3.png



  • Interval: Parameter where the period in which the module should return data is defined. In the case of remote modules, this is the period in which the remote check is performed. In the case of data modules, it is a numerical value which represents X times the defined agent interval, performing the local check in that period. If a module spends more than two intervals without receiving data, it will go into in unknown state.
  • Post process: Parameter by which the data received by the module can be converted. By default it is disabled with the value 0. The following conversions can be made:
    • Seconds to months
    • Seconds to weeks
    • Seconds to days
    • Seconds to minutes
    • Bytes to Gigabytes
    • Bytes to Megabytes
    • Bytes to Kilobytes
    • Timeticks to weeks
    • Timeticks to days
  • FF interval: If the flip-flop threshold is activated and there is a state change, the module interval will be changed for the next execution.
  • FlipFlop timeout: Parameter that can only be used in asynchronous modules. For a state change by flip-flop to be effective, equal consecutive data must be received within the specified interval.
  • Quiet: Parameter by which the module will continue to receive information, but no type of event or alert will be generated.
  • Cascade Protection Services: Parameter by which the generation of events and alerts would go through to the service to which it belongs if this feature is enabled.
  • Cron: Parameter by which it is possible to specify periods of time in which the module will be executed with the nomenclature: Minute, Hour, Day of the Month, Month, Day of the week. There are three different possibilities:
    • Cron from: any -> No monitoring restrictions (default)
    • Cron from: specific. Cron to: any -> To be executed only when it matches the specified number. Ex: 15 20 * * *, will run every day at 20:15
    • Cron desde: specific. Cron to: specific -> It will run during the established interval. Ex: 5-10 * * * *, will run every hour from 5 to 10 minutes.
  • Custom macros: Any number of custom module macros may be defined. The recommended format for macro names is:
   _macroname_

For example:

   _technology_
   _modulepriority_
   _contactperson_

These macros can be used in module alerts. IF THE MODULE IS A WEB MODULE ANALYSIS TYPE:

Dynamic macros will have a special format starting with @ and will have these possible replacements:

   @DATE_FORMAT (current date/time with user-defined format)
   @DATE_FORMAT_nh (hours)
   @DATE_FORMAT_nm (minutes)
   @DATE_FORMAT_nd (days)
   @DATE_FORMAT_ns (seconds)
   @DATE_FORMAT_nM (month)
   @DATE_FORMAT_nY (years)

Where "n" can be a number without a sign (positive) or negative.

3.3 Status Monitoring

When monitoring, values are obtained from a system, whether it might be memory, CPU, hardware temperature, number of connected users, orders on an e-commerce website or any other numerical value. Sometimes only data might be relevant, but generally it is wished to associate a STATUS with these values, so that when they exceed a "THRESHOLD", the status changes, to let you know whether something is right or wrong. Therefore, when talking about monitoring, the STATUS concept must be discussed.

Pandora FMS allows you to define thresholds to determine the status that a check will have based on the data it shows. The three possible statuses are: NORMAL, WARNING and CRITICAL. A threshold is a value from which something goes from one status to another. The status of the modules will depend on these thresholds, which are specified by the following parameters present in the configuration of each module:

  • Warning status - Min. Max.: lower and upper limits for the warning status. If the numerical value of the module is within this range, the module will go into warning status. If no upper limit is specified, it will be infinite (all values above the lower limit).
  • Warning status - Str.: regular expression for alphanumeric modules (string). If any matches are found, the module will go into warning status.
  • Critical status - Min. Max.: lower and upper limits for the critical status. If the numerical value of the module is in this range, the module will go into critical status. If no upper limit is specified, it will be infinite (all values above the lower limit).
  • Critical status - Str.: regular expression for alphanumeric modules (string). If any matches are found, the module will go into critical status.
  • Inverse interval: present for both the warning and critical threshold. If enabled, the module will change status when its values are outside the range specified in the thresholds. It also works for alphanumeric modules (string), if the text strings do NOT match the Warning/Critical Str., the module will change its status



Threshold1.JPG



Threshold2.JPG


Info.png

In case the "warning" and "critical" thresholds match in any range, the "critical" threshold will always prevail.

 


3.3.1 Numerical thresholds - Case study 1

There is a CPU usage percentage module that will always be green in agent status, since it simply reports a value between 0% and 100%. If you want the CPU use module to go to into warning status (yellow ) when it reaches 70% of its use, and into critical status (red) when it reaches 90%, the thresholds must be set as follows:

  • Warning status Min.: 70
  • Critical status Min.: 90


Threshold3.JPG


Thus, when the value 90 is reached, the module will appear in red (CRITICAL), while between 70 and 89.99 will be yellow (WARNING), and below 70 in green (NORMAL).

Due to the way the thresholds operate, in cases like this one, it is not necessary to set upper limits. That is because if only the lower threshold is set, the upper threshold will be taken into account as "no limit", so any value above the lower limit will be taken as within the threshold. In addition, if thresholds overlap, the critical threshold will prevail over the warning, resulting in the graph of thresholds shown in the previous screenshot.

3.3.2 Text thresholds - Case study 2

If there is a string type module, the status can be configured using regular expressions in the Str fields of the Warning Status and Critical Status parameters. In this case, there is a module that can return :"OK", "ERROR connection fail" or "BUSY too many devices", depending on the result of the query.

To configure the WARNING and CRITICAL states of the text module, the following regular expressions must be used:

Warning Status: .*BUSY.*
Critical Status: .*ERROR.*


Threshold4.JPG


With this configuration, the module will go into WARNING status when the data contains the string BUSY, and its status will be CRITICAL when the data contains the string ERROR. "Please, be careful, regular expressions are case sensitive."

3.3.3 Dynamic monitoring (Automatic strings)

Dynamic monitoring consists of automatically and dynamically adjusting the status thresholds of the modules in an intelligent and predictive way. The procedure consists of collecting the values for a given period and calculating an average and a standard deviation, which are used to establish the corresponding thresholds.

The configuration is done at module level, and the possible parameters are:

  • Dynamic Threshold Interval: time interval to be considered for the calculation of thresholds. If 1 month is chosen, the system will take all existing data from the last month and build the thresholds based on that data.
  • Dynamic Threshold Two Tailed: if activated, the dynamic threshold system will also set thresholds below the average. If unchecked (default) only thresholds with values above the average will be set.
  • Dynamic Threshold Max.: allows you to increase the upper limit by the indicated percentage . E.g.: if the average values are around 60 and the critical threshold has been set from 80 on, if the value Dynamic Threshold Max: 10is set, the critical threshold will increase by 10%, so it would be 88.
  • Dynamic Threshold Min.: it only applies if the Dynamic Threshold Two Tailed parameter is active. Allows the lower limit to be reduced by the percentage indicated. E.g.: if the average values are around 60 and the lower critical threshold has been set to 40, if the value Dynamic Threshold Min: 10 is set, the critical threshold will be reduced by 10%, so it would be 36.

There are also several additional configuration parameters in the pandora_server. conf file.

  • dynamic_updates: this parameter determines how many times the thresholds are recalculated during the time period set in Dynamic Threshold Interval. If "Dynamic Threshold Interval" is set to a value of 1 week, the data is collected from one week backwards and the calculation is done only once by default, repeating the process again after one week. If the parameter "dynamic_updates" is modified, this frequency can be increased. For example, setting the parameter to the value 3 will cause the thresholds to be recalculated up to three times during the period of a week (or the period set in "Dynamic Threshold Interval"). Its default value is 5.
  • dynamic_warning: percentage of difference between warning and critical thresholds. Its default value is 25.
  • dynamic_constant: determines the deviation of the average that will be used to establish thresholds, higher values will take thresholds further away from the average values. Its default value is 10.


In the following example, the calculated average value is at the red line (approx. 30):


Thresh1.JPG


When the dynamic thresholds are activated, the upper threshold (approx. 45 and above) is set like this :


Thresh2.JPG


Having the parameter Dynamic Threshold Two Tailed activated means a critical threshold has also been set below the average values (approx. 15 and lower):


Thresh3.JPG


Now, once the "Dynamic Threshold Min." and "Dynamic Threshold Max." parameters are set at 20 and 30 respectively, the thresholds have therefore been opened, so they are slightly more permissive:


Thresh4.JPG


3.3.3.1 Case study 1

The starting point is from a web latency module. The featured basic settings take into account a week interval:


Dynamic1.JPG


When saving changes, after running pandora_db, the thresholds have been set in this way:


Dynamic2.JPG


The module will therefore switch to warning status when the alteration is higher than 0.33 seconds, and to critical when it is higher than 0.37 seconds. The graph will be shown as follows:


Dynamic3.JPG


The threshold has been considered to be somewhat permissive, so it has been decided to make use of the parameter Dynamic Threshold Min. to lower the minimum thresholds. Since in this case the threshold has no maximum values because everything above a certain value will be considered incorrect, Dynamic Threshold Maxwill not be used. The modification would look like this:


Dynamic4.JPG


After applying changes and executing the pandora_db, the thresholds are set as follows:


Dynamic5.JPG


And the graph will look like this:


Dynamic6.JPG


3.3.3.2 Case study 2

In this example, the temperature of a control room or a CPD, the graph shown is being monitored. It shows some values with little variation:


Dynamic7.JPG


In this situation, it is essential that the temperature remains stable and does not reach overly high values, neither excessively low, so the parameter "Dynamic Threshold Two Tailed" is used to delimit thresholds both above and below. The configuration is as follows:


Dynamic8.JPG


The automatically generated thresholds have been these:


Dynamic9.JPG


And the graph will look like this:


Dynamic10.JPG


That way, all values between 23'10 and 26 will be considered normal, since it is the acceptable temperature in the CPD or control room. If needed, the "Dynamic Threshold Min." and "Dynamic Threshold Max." parameters can be used again to adjust the thresholds if necessary.

Go back to Pandora FMS documentation index