Difference between revisions of "Pandora: Documentation en: Intro Monitoring"

From Pandora FMS Wiki
Jump to: navigation, search
(Monitoring with Pandora FMS)
(Tags)
 
(52 intermediate revisions by 6 users not shown)
Line 1: Line 1:
 
[[Pandora:Documentation_en|Go back to Pandora FMS documentation index]]
 
[[Pandora:Documentation_en|Go back to Pandora FMS documentation index]]
 +
  
 
=Introduction to Monitoring=
 
=Introduction to Monitoring=
  
== Monitoring with Pandora FMS ==
+
All user interaction with Pandora FMS is done through the WEB console. The console allows access through a browser without the need to install heavy applications, allowing management from any computer as long as said software is supported by HTML5.
 +
 
 +
Monitoring is the execution of processes on all types of systems to collect and store information, take action and make decisions based on such data.
 +
 
 +
Pandora FMS is a scalable monitoring system that has multiple features to extend the scope and volume of information collected almost unlimitedly.
 +
 
 +
= Logic agents on Pandora FMS =
 +
<br>
 +
<center><br><br>
 +
[[Image:AgentHierarchy.png|center|550px]]
 +
</center><br><br>
 +
<br>
 +
All monitoring done by Pandora FMS is classified into ''Logic agents''. All ''Logic agents'' belong to a 'Group'. These agents will be equivalent to each of the different monitored computers, devices, websites or applications.
 +
 
 +
Logic agents defined in Pandora FMS console may present local information gathered through a software agent, remote information collected through network checks, or both. Therefore, it is worth highlighting the difference between agents as an organizational unit in the Pandora FMS console, and software agents as local data collection services.
 +
 
 +
 
 +
== Monitoring by Software Agent vs. Remote Monitoring ==
 +
 
 +
Monitoring can be divided into two large groups based on how the information is collected: monitoring based on software agents and remote monitoring.
 +
 
 +
*'''Agent-based monitoring''' consists of installing a small software that keeps on running in the system and obtaining information '''locally''' through command and script execution.
 +
 
 +
*'''Remote monitoring''' is the use of the network to run remote checks on systems, without the need to install any additional components on the computer to be monitored.
 +
 
 +
As it can be seen, [[Pandora:Documentation_en:Operations|software agent]] based monitoring will obtain information through '''local checks''' while [[Pandora:Documentation_en:Remote_Monitoring|remote monitoring]] will obtain the information through '''network checks''' from the Pandora FMS server.
 +
 
 +
Both agent types share the same general configuration and data display. With Pandora FMS, monitoring can be carried out one way or another and also combined, producing a mixed monitoring.
 +
 
 +
==Agent setup in the console==
 +
 
 +
;Normal view editing interface:
 +
 
 +
[[Image:Configuracion agente consola1.png|center|799px]]
 +
 
 +
* '''Alias''': For proper operation of all functions performed by Pandora FMS with their agents/modules, it is recommended not to use characters such as <code>/</code>, <code>\</code>, <code>|</code>, <code>%</code>, <code>#</code>, <code>&</code> and <code>$</code> in the name of the agent. When dealing with these agents, they can be misleading when using system paths or when running other commands, causing server errors.
 +
* '''Server:''' Server that will execute the checks configured in agent monitoring, special parameter in case of having configured [[Pandora:Documentation_en:HA|HA]] in its installation.
 +
 
 +
;Advanced view editing interface:
 +
 
 +
[[Image:Configuracion agente consola2.png|center|vista avanzada|800px]]
 +
 
 +
* '''Secondary groups:''' Optional parameter for an agent to belong to more than one group (secondary groups).
 +
* '''Cascade protection services:''' Parameter with which an avalanche of alerts can be avoided. It is possible to choose an agent or an agent module. In the first case, when the chosen agent is in critical, the agent will not generate alerts. In the second case, only when the specified module is critical, the agent will not generate alerts.
 +
* '''Module definition:''' Three work modes can be selected to define modules.
 +
** '''Learning mode:''' (Default mode) If an XML arrives with new modules, they will be created automatically.
 +
** '''Normal mode:''' If an XML arrives with new modules, they will not be created unless they have already been declared in the console.
 +
** '''Auto-disable mode:''' Same as learning mode, but if all modules go into unknown, the agent will be disabled until information arrives again.
 +
 
 +
==Agent display==
 +
 
 +
In this screen, plenty of information on the agent can be seen, with the possibility of forcing the remote execution and refreshing the data.
 +
 
 +
[[Image:Visualizacion_agente_consola1.png|center|111px]]
 +
 
 +
In the upper section, a summary with the agent data can be seen:
 +
 
 +
[[Image:Visualizacion_agente_consola2.png|center|800px]]
 +
 
 +
* Total of modules and their status.
 +
* Events in the last 24 hours
 +
* Agent Information
 +
** Name
 +
** Version
 +
** Agent accessibility
 +
** Group
 +
 
 +
[[Image:Visualizacion_agente_consola3.png|center|800px]]
 +
 
 +
'''Initiated''' module list (''Module name'') that belong to the agent and is corresponding status.
 +
 
 +
 
 +
Finally, the events generated from the agent are displayed.
 +
 
 +
[[Image:Visualizacion_agente_consola4.png|center|800px]]
 +
 
 +
=Modules=
 +
 
 +
Modules are units of information stored within an agent. They are the monitoring elements with which the information is extracted from the device or server to which the agent points.
 +
 
 +
{{Tip|Each module can store only one type of metric.
 +
There cannot be two modules with the same name within the same agent.}}
 +
 
 +
All modules have an associated status, which can be:
 +
 
 +
* '''Not started:''' Where no data has been received yet.
 +
* '''Normal:''' Data is being received with values out of the warning or critical thresholds.
 +
* '''Warning:''' Data is being received with values within the warning threshold.
 +
* '''Critical:''' Data is being received with values within the critical threshold.
 +
* '''Unknown:''' The module has been running and has stopped receiving information for a certain amount of time.
 +
 
 +
Modules have different types of data: such as Boolean, numeric or alphanumeric, [[Pandora:QuickGuides_EN:What_Is_Pandora_FMS#Which_kinds_of_modules_are_there.3F|among others.]]
 +
 
 +
== Types of modules ==
 +
 
 +
There are several types of modules inside Pandora FMS.
 +
* '''[[Pandora:Documentation_en:Operations|Data module]]:''' It is a type of local monitoring module with which checks are made on the system in which the agent is located, such as for example the use of CPU of the device or its free memory.
 +
* '''[[Pandora:Documentation_en:Operations|Network module]]:''' It is a type of remote monitoring module with which checks are made to verify the connection with the device or server to which the agent points, for example whether it is working or whether it has a particular port open.
 +
* '''[[Pandora:Documentation_en:Operations|Plugin module]]:''' this is a type of local or remote monitoring module with which custom checks can be made through the creation of scripts. With them, more advanced and extensive checks than the ones proposed directly through Pandora FMS console can be done.
 +
* '''[[Pandora:Documentation_en:Operations|WMI module]]:''' This is a type of local monitoring module with which the Windows system can be checked through the WMI protocol, such as obtaining the list of installed services or the current CPU load.
 +
* '''[[Pandora:Documentation_en:Operations|Prediction module]]:''' This is a type of predictive monitoring module with which different arithmetic operations are performed through the consultation of data from other "base" modules, such as the average CPU usage of the monitored servers or the sum of connection latency.
 +
* '''[[Pandora:Documentation_en:Operations|Webserver module]]:''' This is a type of web monitoring with which checks of the status of a website are made and data is obtained from it, such as for example to see whether a website is down or if it contains a specific word.
 +
* '''[[Pandora:Documentation_en:Operations|Web analysis module]]:''' This is a type of web monitoring with which simulations of a user's web browsing are carried out, such as browsing a website, entering credentials or complying with forms.
 +
 
 +
== Status Monitoring ==
 +
 
 +
When monitoring, values are obtained from a system, whether it might be memory, CPU, hardware temperature, number of connected users, orders on an e-commerce website or any other numerical value. Sometimes only data sucha as the "absolute value" might be relevant, but generally the "relative value" is more useful: associating a STATUS with these values, so that when they exceed a "THRESHOLD", the status changes, to let you know whether something is right or wrong, or about to be wrong. Therefore, when talking about monitoring, the STATUS concept must be discussed.
 +
 
 +
Pandora FMS allows you to define '''thresholds''' to determine the status that a check will have based on the data it shows. The three possible statuses are: <code>NORMAL</code>, <code>WARNING</code> and <code>CRITICAL</code>. A threshold is a value from which something goes from one status to another. The status of the modules will depend on these thresholds, which are specified by the following parameters present in the configuration of each module:
 +
 
 +
* '''Warning status - Min. Max.''': Lower and upper limits for the <code>warning</code> status. If the numerical value of the module is within this range, the module will go into warning status. If no upper limit is specified, it will be infinite (all values above the lower limit).
 +
* '''Critical status - Min. Max.''': lower and upper limits for the critical status. If the numerical value of the module is in this range, the module will go into critical status. If no upper limit is specified, it will be infinite (all values above the lower limit).
 +
* '''Critical status - Str.''': The same as the previous point but for <code>critical</code> status.
 +
* '''Inverse interval''': present for both <code>warning</code> and <code>critical</code> thresholds. If enabled, the module will change status when its values are '''outside the range''' specified in the thresholds. It also works for alphanumeric modules (string), if the text strings do NOT match the Warning/Critical Str., the module will change its status.
 +
 
 +
[[image:Threshold2.JPG|center|200px]]
 +
 
 +
* '''Warning status''' - '''Str.''': Regular expression for alphanumeric modules (string). If any matches are found, the module will go into <code>warning</code> status.
 +
 
 +
* '''Critical status - Str.''': Regular expression for alphanumeric modules (string). If any matches are found, the module will go into critical status.
 +
 
 +
{{Tip|In case the "warning" and "critical" thresholds match in any range, the "critical" threshold will always prevail.}}
 +
 
 +
=== Numerical thresholds - Case study 1 ===
 +
 
 +
When creating a module, thresholds have value 0 by default, to monitor the CPU usage percentage you need for it to go into <code>warning</code> (yellow color) when it reaches 70% usage, and into <code>critical</code> (red) when reaching 90%; since it will be necessary to set and fix these values:
 +
 
 +
[[image:Threshold3.JPG|center|700px]]
 +
 
 +
When receiving the metric from that computer, if the data is under  70%, it will be green, <code>normal</code>, between 70% and 89,99% yellow, <code>WARNING</code> and from 90% or more, red, <code>CRITICAL</code>. Due to the way the thresholds operate, in cases like this one, it is not necessary to set upper limits. That is because if only the lower threshold is set, the upper threshold will be taken into account as "no limit", so any value above the lower limit will be taken as within the threshold. In addition, if thresholds overlap, the <code>CRITICAL</code> threshold will prevail over the <code>WARNING</code> one.
 +
 
 +
=== Text thresholds - Case study 2 ===
 +
 
 +
UA module may return as collected data some of the following character ''strings'':
 +
 
 +
* <code>OK</code>.
 +
* <code>ERROR connection fail</code>.
 +
* <code>BUSY too many devices</code>.
 +
 
 +
By using regular expressions in '''Str.''' fields of the '''Warning Status''' and '''Critical Status''' parameters, as indicated by the picture, you may define alert thresholds.
 +
 
 +
[[image:Threshold4.JPG|center|200px]]
 +
 
 +
{{Tip|Be careful with regular expressions, since the distinguish between uppercase and lowercase, they are ''case sensitive''.}}
 +
 
 +
With this configuration, the module will go into <code>WARNING</code> status when the data contains the string "BUSY", and its status will be <code>CRITICAL</code> when the data contains the string ERROR.
 +
 
 +
=== Dynamic monitoring (Automatic strings) ===
 +
 
 +
Dynamic monitoring consists of automatically and dynamically adjusting the status thresholds of the modules in an intelligent and predictive way. The procedure consists of collecting the values for a given period and calculating an average and a standard deviation, which are used to establish the corresponding thresholds at module level.
 +
 
 +
==== Possible parameters ====
 +
 
 +
[[Image:Dynamic1.JPG|center|700px]]
 +
 
 +
* '''Dynamic Threshold Interval''': Time interval to be considered for threshold calculation. If 1 month is chosen, the system will take all existing data from the last month and build the thresholds based on that data and thresholds with values '''over''' the average will be set.
 +
* '''Dynamic Threshold Max.''': It allows you to increase the upper limit by the indicated percentage . E.g.: if the average values are around 60 and the critical threshold has been set from 80 on, if the value ''Dynamic Threshold Max: 10''is set, the critical threshold will increase by 10%, so it would be 88.
 +
* '''Dynamic Threshold Two Tailed''': If activated, the dynamic threshold system will also set thresholds ''below'' the average. If unchecked (default) only thresholds with values ''above'' the average will be set.
 +
* '''Dynamic Threshold Min.''': It only applies if the ''Dynamic Threshold Two Tailed'' parameter is active. It allows the lower limit to be reduced by the percentage indicated. E.g.: if the average values are around 60 and the lower critical threshold has been set to 40, if the value ''Dynamic Threshold Min: 10'' is set, the critical threshold will be reduced by 10%, so it would be 36.
 +
 
 +
==== Case study 1 ====
 +
 
 +
In the following example, the average value calculated is at the red line height (aprox. 30):
 +
 
 +
[[Image:thresh1.JPG|center|560px]]
 +
 
 +
When activating dynamic thresholds, the upper threshold has been set that way (aprox. 45 and higher):
 +
 
 +
[[Image:thresh2.JPG|center|560px]]
 +
 
 +
The parameter ''Dynamic Threshold Two Tailed'' has been activated, so that a critical threshold below the average values has been set too (aprox. 15 and lower):
 +
 
 +
[[Image:thresh3.JPG|center|560px]]
 +
 
 +
Now the parameters ''Dynamic Threshold Min.'' and ''Dynamic Threshold Max.'' have been set to 20 and 30 accordingly, so the thresholds have been broadened, being slightly more permissive:
 +
 
 +
[[Image:thresh4.JPG|center|560px]]
 +
 
 +
==== Case study 2 ====
 +
The starting point is from a web latency module. The featured basic settings take into account a week interval:
 +
 
 +
<br>
 +
[[File:dynamic1.JPG|center]]
 +
<br>
 +
 
 +
When saving changes, after running ''pandora_db'', the thresholds have been set in this way:
 +
 
 +
<br>
 +
[[File:dynamic2.JPG|center]]
 +
<br>
  
All user interaction with Pandora FMS is done through the WEB console. The Pandora FMS console is a WEB console which follows the latest standards and WEB technologies It requires an advanced browser and the optional use of Flash. It is recommended to use Firefox 2.x or higher.
+
The module will therefore switch to ''warning'' status when the alteration is higher than 0.33 seconds, and to ''critical'' when it is higher than 0.37 seconds. The graph will be shown as follows:
You can also use Internet Explorer 8 or higher, although it gives an uncomfortable user experience due to its peculiar way of managing some WEB controls.
 
  
Generally speaking, monitoring consists of the execution of processes (through modules) in any system in order to send the resulting data to a server. The server processes the resulting data where the front-end (WEB console) is going to display it to the user.
+
<br>
 +
[[File:dynamic3.JPG|center]]
 +
<br>
  
Pandora FMS is a scalable monitoring tool. It would be possible to monitor about 1200 to 1500 agents with a single server, although the number of monitoring processes could grow without restrictions with the correct architecture (Meta Console).
+
The threshold has been considered to be somewhat permissive, so it has been decided to make use of the parameter ''Dynamic Threshold Min.'' to lower the minimum thresholds. Since in this case the threshold has no maximum values because everything above a certain value will be considered incorrect, ''Dynamic Threshold Max''will not be used. The modification would look like this:
  
=== Monitoring by Software Agent vs. Remote Monitoring ===
+
<br>
 +
[[File:dynamic4.JPG|center]]
 +
<br>
  
There are two main monitoring procedures with Pandora FMS: The software agent based (local) and the remote one.
+
After applying changes and executing the ''pandora_db'', the thresholds are set as follows:
  
The software agent based monitoring includes a piece of software (module) in the monitored system, e.g. the measurement of the percentage of CPU usage on a certain system while the remote monitoring is done through network tests without the use of modules, e.g. checking if a certain host is active or not.
+
<br>
 +
[[File:dynamic5.JPG|center]]
 +
<br>
  
The main difference between these two types is that whereas the software agents are executed from the monitored system, the remote monitoring is executed from the Pandora FMS Server against the target system.
+
And the graph will  look like this:
  
=== Agents on Pandora FMS ===
+
<br>
 +
[[File:dynamic6.JPG|center]]
 +
<br>
  
All monitoring done by Pandora FMS is managed through a generic entity called 'Agent' which is incorporated into a more generic block called 'Group'. An agent can only belong to one group.
+
==== Case study 3 ====
  
Information is logically arranged by means of a hierarchy which is based on groups, agents, module groups and modules. There are Agents which are solely based on the information given by a software agent installed on the system, and Agents with exclusive network information - information that doesn't come from a software agent where installing software is not necessary which would execute the network monitoring tasks from the Pandora FMS Network Servers.
+
In this example, the temperature of a control room or a CPD, the graph shown is being monitored. It shows some values with little variation:
  
<center><br><br>
+
<br>
[[Image:AgentHierarchy.png|center|550px]]
+
[[File:dynamic7.JPG|center]]
</center><br><br>
+
<br>
 +
 
 +
In this situation, it is essential that the temperature remains stable and does not reach overly high values, neither excessively low, so the parameter "Dynamic Threshold Two Tailed" is used to set thresholds both above and below. The configuration is as follows:
  
There are also agents which have network information -and- information obtained through software agents.
+
<br>
 +
[[File:dynamic8.JPG|center]]
 +
<br>
  
The information is collected in modules which are logically assigned to Pandora FMS agents in the console. It's important to distinguish the concept of Agents (where the modules which contain the collected info are located) from Software Agents which are getting executed on remote systems.
+
The automatically generated thresholds have been these:
  
=== Status and Event Monitoring ===
+
<br>
 +
[[File:dynamic9.JPG|center]]
 +
<br>
  
With Pandora FMS 3.0, a new important functionality was added; allowing the user to fix standards to define any data in three possible states:
+
And the graph will look like this:
  
'NORMAL', 'WARNING' and 'CRITICAL'.
+
<br>
 +
[[File:dynamic10.JPG|center]]
 +
<br>
  
Automatically, all modules of the 'proc' kind are defined as 'NORMAL' if they have a value of '1' or higher. They will be defined as 'CRITICAL' if they have a value lower than '1' ('0' or a negative value).
+
That way, all values between 23'10 and 26 will be considered normal, since it is the acceptable temperature in the CPD or control room. If needed, the "Dynamic Threshold Min." and "Dynamic Threshold Max." parameters can be used again to set thresholds if necessary.
  
But what happens with a value of CPU usage? How could the system know if it's a 'NORMAL', 'CRITICAL' or 'WARNING' value? It doesn't know by default - it only gets a numeric value and if nothing has been defined for it, all the values would be 'right' in 'NORMAL' status.
+
==== Additional configuration parameters ====
  
There are two status fields in the agent configuration which haven't been mentioned before. These are:
+
In addition in the [[Pandora:Documentation_en:Configuration|''pandora_server.conf'']] you may set:
  
* '''''Warning status'''''
+
* '''dynamic_updates''': This parameter determines how many times thresholds are recalculated dureing the time period set in ''Dynamic Threshold Interval'', where ist default value is 5. If ''Dynamic Threshold Interval'' is configured with 1 week value, one-week backwards data will be collected by default and calculations will be done just once, repeting the process after one week goes by. By modifying the ''dynamic_updates'' parameter, you may reduce the frecuency, e.g. a value of 3 will make thresholds to be calculated thrice along the week (or the period configured in ''Dynamic Threshold Interval'').
* '''''Critical status'''''
+
* '''dynamic_warning''': If differentiates, in percentage, between  <code>warning</code> and <code>critical</code> thresholds, default value 25.
 +
* '''dynamic_constant''': It determines the average deviation that will be used to set the thresholds, 10 by default. Higher values will set thresholds farther from average values.
  
Each of those two fields can possess two values: Minimum and Maximum. By configuring them correctly, some values will show a module in 'warning' and others in a 'critical' status:
+
==Common Parameters==
  
 +
<br>
 
<center><br><br>
 
<center><br><br>
[[image:critico.jpg|800px]]
+
[[Image:Parametros_comunes_modulos1.png|center]]
 
</center><br><br>
 
</center><br><br>
 +
<br>
  
To understand these options better, it's best to see an example. The CPU module will always be on 'green' in the agent status, so it simply informs about a value between 0% and 100%. If we want the module of the CPU usage to be shown in yellow ('warning') if it has reached e.g. 70% of its use, and in red ('critical') if it e.g. reached 90%, it's recommended to configure:
+
* '''Using module component:''' Pandora FMS has a repertoire of default modules that can be used. Depending on the selected module, the necessary parameters will be automatically filled in to carry out the monitoring. This token appears in all types of modules except prediction ones.
 +
* '''Dynamic Threshold Interval:''' Token for dynamic monitoring to be explained in a later section.
 +
* '''Warning/Critical Status:''' Token for status monitoring which will be explained in a later section.
  
* Warning status:70
+
[[image:fft.png|center|700px]]
* Critical status:90
 
  
If you're going to reach the 90% value with these settings, the module will be shown in red ('CRITICAL'), if it's between 70% and 89.99%, it will be yellow ('WARNING') and under 70% in green ('NORMAL').
+
* '''Flip-Flop threshold:''' FlipFlop (FF) is known as a common phenomenon in monitoring: when a value fluctuates frequently between alternative values (RIGHT/WRONG). When this takes place, a "threshold" is usually used, so that in order to consider something as having changed status, it has to "stay" more than N intervals in a state without changing. ''FF threshold'' is used to 'filter' the continuous status changes in the creation of events/statuses.: that way Pandora FMS knows that, until an element has adopted the same status at least N times in the same status after having changed from an original status, it will not be considered as changed.
  
If we have a module with a string type, you're able to configure the status using a regular expression in the ''Str'' fields of 'Warning' and 'Critical' status parameters. If we have e.g. a module that returns ''OK'', ''ERROR: Connection fail'' or ''BUSY: Too many devices'' it depends on the query result.
+
=== Advanced common parameters ===
  
To configure the 'WARNING' and 'CRITICAL' module status, we will use the following regular expressions:
+
[[Image:Parametros_comunes_modulos2.png|center|700px]]
  
Warning Status: .*BUSY.*
 
Critical Status: .*ERROR.*
 
  
'''You have to be careful here, because these regular expressions are case sensitive'''. With this module configuration, the status will be 'WARNING' if the data contains the string ''BUSY'' and it's going to jump to 'CRITICAL' if the data string contains ''ERROR''.
+
* '''Interval:'''  Period in which the module should return data. If a module does not receive data during more than two intervals, it will go into in unknown state.
 +
** If they are remote modules: Time period during which the remote check takes place.
 +
** If they are data modules: Remote module that represents N times the interval of the defined agent, doing the local check during that time.
 +
* '''Unit''': Choosing of the unit of the data received by the module, disabled by default (''none''). Available values:
 +
** Timeticks.
 +
** Bytes.
 +
** Entries.
 +
** Files.
 +
** Hits.
 +
** Sessions.
 +
** Users.
 +
** ºC.
 +
** ºF.
 +
* '''Post process:''' Disabled by default (0), it allows to specify carrying out a post-processing, a module-received data conversion. Available modules:
 +
** Seconds to months
 +
** Seconds to weeks
 +
** Seconds to days
 +
** Seconds to minutes
 +
** Bytes to Gigabytes
 +
** Bytes to Megabytes
 +
** Bytes to Kilobytes
 +
** Timeticks to weeks
 +
** Timeticks to days
 +
* '''FF interval:''' If the flip-flop threshold is activated and there is a state change, the module interval will be changed for the next execution.
 +
* '''FlipFlop timeout:''' Parameter that can only be used in asynchronous modules. For a state change by flip-flop to be effective, equal consecutive data must be received within the specified interval.
 +
* '''Silent:''' Parameter by which the module will continue to receive information, but no type of event or alert will be generated.
 +
* '''Cascade Protection Services:''' Parameter by which event and alert generation would become part of the service to which it belongs if this feature is enabled.
  
If, by any chance, '''both states are configured with the same values, the 'Critical' value will always have precedence'''. In this case, 'Warning' status is unreachable, because 'Critical' status is more important.
+
[[Image:Parametros_comunes_modulos3.png|center|700px]]
  
This is an example of the modules in each of the states:
+
You may specify time periods when the module will be executed; if follows the nomenclature: Minute, Hore, Month Day, Month, Week Day and there are three different possibilities.
 +
**'''Cron from:''' It has '''Any''' set in all its fields, with no time restriction for monitoring.
 +
** ''Cron from: specific. Cron to: any'': To be executed only when it matches the specified number. E.g.: <code>15 20 * * *</code>, it will be run every day at 20:15
 +
** ''Cron from: specific. Cron to: specific'': It will be run during the established interval. E.g.: <code>5 * * * *</code> and <code>10 * * * *</code>, will run every hour from 5 to 10 minutes.
 +
* '''Custom macros:'''  Any number of custom module macros may be defined. The recommended format for macro names is:
  
<center><br><br>
+
    _macroname_
[[image:colorin.jpg|center|800px]]
 
</center><br><br>
 
  
It's obvious these fields have no sense for modules which only return boolean values ('1' or '0').
+
For example:
  
These values are shown in the main screen of the monitor view. You're instantly able to tell by taking a quick look how many checks are in the 'Normal', 'Warning' or 'Critical' states.
+
    _technology_
 +
    _modulepriority_
 +
    _contactperson_
  
=== Other Common Monitoring Parameters===
+
These macros can be used in module alerts and are particularly useful in [[Pandora:Documentation_en:User_Monitorization#Custom_macros|WUX monitoring]] and [[Pandora:Documentation_en:User_Monitorization#Custom_macros|user monitoring]] if the module is a web-module analysis one:
  
==== Historical Data ====
+
Dynamic macros will have a special format starting with @ and will have these possible replacements:
  
<center><br><br>
+
    @DATE_FORMAT (current date/time with user-defined format)
[[image:historicaldata.png]]
+
    @DATE_FORMAT_nh (hours)
</center><br><br>
+
    @DATE_FORMAT_nm (minutes)
 +
    @DATE_FORMAT_nd (days)
 +
    @DATE_FORMAT_ns (seconds)
 +
    @DATE_FORMAT_nM (month)
 +
    @DATE_FORMAT_nY (years)
  
Pandora FMS optionally allows any individual data set to be saved. All modules keep a history (so they're able to generate graphs and include them in reports of the historical / evolutive kind) by default. In a very big implantation which requires a lot of data to be monitored, it's possible that you have no need to keep the history for some, thereby allowing for the possibility of occupying less resources.
+
Where "n" can be a number without a sign (positive) or negative and FORMAT follows the [http://search.cpan.org/~dexter/POSIX-strftime-GNU-0.02/lib/POSIX/strftime/GNU.pm perl strftime].
  
This option allows the history of the modules where you don't need to keep a history to be deactivated. Even if you deactivate the history, the alerts will continue to work in exactly the same way e.g. as event generation and the view of the current state of this monitor.
+
==== '''Tags''' ====
 +
 +
They are tags linked to each of the modules that later on spread to the events generated by this module. They can be used in that module's event alerts. Tags are quite useful since they can work as filter in reports, event views and they even have their own specific views. Each tag's additional information (URL, email, phone number) can be used in alerts as they are available as macro.
  
==== FF Threshold ====
+
To be able to create a tag, click on Module tags:
  
<center><br><br>
+
[[Image:module_tags_imagen2.png|center|300px]]
[[image:fft.png]]
 
</center><br><br>
 
  
The FF Threshold Parameter (FF=FlipFlop) is used to 'filter' the continuous changes of the state in the creation of events / statuses. In Pandora FMS, you can indicate that, until an element has adapted the same status at least X times after having changed from an original status, it won't be considered as changed. Lets see an example: One ping to a host where there is loss of packages. In an environment like this, it's possible to receive the following results:
+
The tag allows to define a name, a description and there is also the possibility to add the complete URL, email or phone number associated to that tag. It is worth highlighting that one or several tags can be associated to the same module. However, they must first be created as it was previously described, and then they will be available to be allocated to each module.
  
 +
Within module advanced options, the left column shows the tags available and the right column shows the tags linked to that module:
  
1
+
[[Image:tags_1.png|center|700px]]
1
 
0
 
1
 
1
 
0
 
1
 
1
 
1
 
  
However, the host is alive in all cases. What we really want to say to Pandora is: Until the host doesn't say that it's at least three times down, it doesn't show it as down, so in the previous case it would never be shown as down, and it would only be this way in this case:
+
Furthermore, tags can be used to grant module specific access permissions, so that a user can access only that agent's module without having access to the remaining modules. This can be seen in the user profiling section [[Pandora:Documentation_en:Managing_and_Administration|uder profiling]].
  
1
+
= Module library =
1
 
0
 
1
 
0
 
0
 
0
 
  
From this point it will be shown as down - but not before that.
+
{{Tip|Available from version <b>744</b>. To access the module library from the menu, ''Agent Read'' (AR) permissions are needed.}}
  
So the 'Flip_Flop' protections are pretty useful to avoid disturbing fluctuations. All modules implement it. Its use is to avoid the change of status (limited by the defined or automatic limits, as shown in the case of 'proc' modules).
+
[[Image:homelibreria.png|center|700px]]
  
From 5.1 version, the FF threshold has two modes.
+
The nine most important categories are shown, by clicking on '''See all categories''' you will find the rest of them:
  
* '''All state changing''': same value is used for all state changing, to normal, warning and critical.
+
[[Image:categorylibrary.png|center|700px]]
* '''Each state changing''': different values can be set for each change of status, to normal, warning and critical.
 
  
In async modules, the timeout (FF timeout) can also be set. It's useful if you want to fire an alert only when the data server received several critical/warning data in a short period of time.
+
In each category all available modules will be shown with a brief description that may be enlarged when clicking on '''More details'''.
When data arrival interval exceeded the timeout, the counter of FF threshold is reset.
 
  
<center>
+
[[Image:modulecategory.png|center|700px]]
[[image:ff_timeout.png|800px]]
 
</center>
 
  
For example, if you want to fire an alert only when an agent sends critical data twice in 5 minutes (you don't want to fire an alert when data arrival interval exceeds 5 minutes.),
+
Note: <b>Pandora FMS</b> <b>Enterprise module</b> download links will only be visible in these cases:
set the FF threshold to 1 and the FF timeout to 300.
+
* The <b>user and password</b> configured in the ''setup'' must match those of <b>Integria IMS</b> support.
 +
* Pandora FMS <b>versión</b> must be <b>Enterprise</b>.
 +
* Pandora FMS user has <b>AW permissions</b>.
 +
Form more information on how to access the library, visit [https://pandorafms.com/docs/index.php?title=Pandora:Documentation_en:Console_Setup#Module_library Console configuration]
  
 
[[Pandora:Documentation_en|Go back to Pandora FMS documentation index]]
 
[[Pandora:Documentation_en|Go back to Pandora FMS documentation index]]

Latest revision as of 12:07, 11 January 2021

Go back to Pandora FMS documentation index


1 Introduction to Monitoring

All user interaction with Pandora FMS is done through the WEB console. The console allows access through a browser without the need to install heavy applications, allowing management from any computer as long as said software is supported by HTML5.

Monitoring is the execution of processes on all types of systems to collect and store information, take action and make decisions based on such data.

Pandora FMS is a scalable monitoring system that has multiple features to extend the scope and volume of information collected almost unlimitedly.

2 Logic agents on Pandora FMS




AgentHierarchy.png



All monitoring done by Pandora FMS is classified into Logic agents. All Logic agents belong to a 'Group'. These agents will be equivalent to each of the different monitored computers, devices, websites or applications.

Logic agents defined in Pandora FMS console may present local information gathered through a software agent, remote information collected through network checks, or both. Therefore, it is worth highlighting the difference between agents as an organizational unit in the Pandora FMS console, and software agents as local data collection services.


2.1 Monitoring by Software Agent vs. Remote Monitoring

Monitoring can be divided into two large groups based on how the information is collected: monitoring based on software agents and remote monitoring.

  • Agent-based monitoring consists of installing a small software that keeps on running in the system and obtaining information locally through command and script execution.
  • Remote monitoring is the use of the network to run remote checks on systems, without the need to install any additional components on the computer to be monitored.

As it can be seen, software agent based monitoring will obtain information through local checks while remote monitoring will obtain the information through network checks from the Pandora FMS server.

Both agent types share the same general configuration and data display. With Pandora FMS, monitoring can be carried out one way or another and also combined, producing a mixed monitoring.

2.2 Agent setup in the console

Normal view editing interface
Configuracion agente consola1.png
  • Alias: For proper operation of all functions performed by Pandora FMS with their agents/modules, it is recommended not to use characters such as /, \, |, %, #, & and $ in the name of the agent. When dealing with these agents, they can be misleading when using system paths or when running other commands, causing server errors.
  • Server: Server that will execute the checks configured in agent monitoring, special parameter in case of having configured HA in its installation.
Advanced view editing interface
vista avanzada
  • Secondary groups: Optional parameter for an agent to belong to more than one group (secondary groups).
  • Cascade protection services: Parameter with which an avalanche of alerts can be avoided. It is possible to choose an agent or an agent module. In the first case, when the chosen agent is in critical, the agent will not generate alerts. In the second case, only when the specified module is critical, the agent will not generate alerts.
  • Module definition: Three work modes can be selected to define modules.
    • Learning mode: (Default mode) If an XML arrives with new modules, they will be created automatically.
    • Normal mode: If an XML arrives with new modules, they will not be created unless they have already been declared in the console.
    • Auto-disable mode: Same as learning mode, but if all modules go into unknown, the agent will be disabled until information arrives again.

2.3 Agent display

In this screen, plenty of information on the agent can be seen, with the possibility of forcing the remote execution and refreshing the data.

Visualizacion agente consola1.png

In the upper section, a summary with the agent data can be seen:

Visualizacion agente consola2.png
  • Total of modules and their status.
  • Events in the last 24 hours
  • Agent Information
    • Name
    • Version
    • Agent accessibility
    • Group
Visualizacion agente consola3.png

Initiated module list (Module name) that belong to the agent and is corresponding status.


Finally, the events generated from the agent are displayed.

Visualizacion agente consola4.png

3 Modules

Modules are units of information stored within an agent. They are the monitoring elements with which the information is extracted from the device or server to which the agent points.

Info.png

Each module can store only one type of metric. There cannot be two modules with the same name within the same agent.

 


All modules have an associated status, which can be:

  • Not started: Where no data has been received yet.
  • Normal: Data is being received with values out of the warning or critical thresholds.
  • Warning: Data is being received with values within the warning threshold.
  • Critical: Data is being received with values within the critical threshold.
  • Unknown: The module has been running and has stopped receiving information for a certain amount of time.

Modules have different types of data: such as Boolean, numeric or alphanumeric, among others.

3.1 Types of modules

There are several types of modules inside Pandora FMS.

  • Data module: It is a type of local monitoring module with which checks are made on the system in which the agent is located, such as for example the use of CPU of the device or its free memory.
  • Network module: It is a type of remote monitoring module with which checks are made to verify the connection with the device or server to which the agent points, for example whether it is working or whether it has a particular port open.
  • Plugin module: this is a type of local or remote monitoring module with which custom checks can be made through the creation of scripts. With them, more advanced and extensive checks than the ones proposed directly through Pandora FMS console can be done.
  • WMI module: This is a type of local monitoring module with which the Windows system can be checked through the WMI protocol, such as obtaining the list of installed services or the current CPU load.
  • Prediction module: This is a type of predictive monitoring module with which different arithmetic operations are performed through the consultation of data from other "base" modules, such as the average CPU usage of the monitored servers or the sum of connection latency.
  • Webserver module: This is a type of web monitoring with which checks of the status of a website are made and data is obtained from it, such as for example to see whether a website is down or if it contains a specific word.
  • Web analysis module: This is a type of web monitoring with which simulations of a user's web browsing are carried out, such as browsing a website, entering credentials or complying with forms.

3.2 Status Monitoring

When monitoring, values are obtained from a system, whether it might be memory, CPU, hardware temperature, number of connected users, orders on an e-commerce website or any other numerical value. Sometimes only data sucha as the "absolute value" might be relevant, but generally the "relative value" is more useful: associating a STATUS with these values, so that when they exceed a "THRESHOLD", the status changes, to let you know whether something is right or wrong, or about to be wrong. Therefore, when talking about monitoring, the STATUS concept must be discussed.

Pandora FMS allows you to define thresholds to determine the status that a check will have based on the data it shows. The three possible statuses are: NORMAL, WARNING and CRITICAL. A threshold is a value from which something goes from one status to another. The status of the modules will depend on these thresholds, which are specified by the following parameters present in the configuration of each module:

  • Warning status - Min. Max.: Lower and upper limits for the warning status. If the numerical value of the module is within this range, the module will go into warning status. If no upper limit is specified, it will be infinite (all values above the lower limit).
  • Critical status - Min. Max.: lower and upper limits for the critical status. If the numerical value of the module is in this range, the module will go into critical status. If no upper limit is specified, it will be infinite (all values above the lower limit).
  • Critical status - Str.: The same as the previous point but for critical status.
  • Inverse interval: present for both warning and critical thresholds. If enabled, the module will change status when its values are outside the range specified in the thresholds. It also works for alphanumeric modules (string), if the text strings do NOT match the Warning/Critical Str., the module will change its status.
Threshold2.JPG
  • Warning status - Str.: Regular expression for alphanumeric modules (string). If any matches are found, the module will go into warning status.
  • Critical status - Str.: Regular expression for alphanumeric modules (string). If any matches are found, the module will go into critical status.

Info.png

In case the "warning" and "critical" thresholds match in any range, the "critical" threshold will always prevail.

 


3.2.1 Numerical thresholds - Case study 1

When creating a module, thresholds have value 0 by default, to monitor the CPU usage percentage you need for it to go into warning (yellow color) when it reaches 70% usage, and into critical (red) when reaching 90%; since it will be necessary to set and fix these values:

Threshold3.JPG

When receiving the metric from that computer, if the data is under 70%, it will be green, normal, between 70% and 89,99% yellow, WARNING and from 90% or more, red, CRITICAL. Due to the way the thresholds operate, in cases like this one, it is not necessary to set upper limits. That is because if only the lower threshold is set, the upper threshold will be taken into account as "no limit", so any value above the lower limit will be taken as within the threshold. In addition, if thresholds overlap, the CRITICAL threshold will prevail over the WARNING one.

3.2.2 Text thresholds - Case study 2

UA module may return as collected data some of the following character strings:

  • OK.
  • ERROR connection fail.
  • BUSY too many devices.

By using regular expressions in Str. fields of the Warning Status and Critical Status parameters, as indicated by the picture, you may define alert thresholds.

Threshold4.JPG

Info.png

Be careful with regular expressions, since the distinguish between uppercase and lowercase, they are case sensitive.

 


With this configuration, the module will go into WARNING status when the data contains the string "BUSY", and its status will be CRITICAL when the data contains the string ERROR.

3.2.3 Dynamic monitoring (Automatic strings)

Dynamic monitoring consists of automatically and dynamically adjusting the status thresholds of the modules in an intelligent and predictive way. The procedure consists of collecting the values for a given period and calculating an average and a standard deviation, which are used to establish the corresponding thresholds at module level.

3.2.3.1 Possible parameters

Dynamic1.JPG
  • Dynamic Threshold Interval: Time interval to be considered for threshold calculation. If 1 month is chosen, the system will take all existing data from the last month and build the thresholds based on that data and thresholds with values over the average will be set.
  • Dynamic Threshold Max.: It allows you to increase the upper limit by the indicated percentage . E.g.: if the average values are around 60 and the critical threshold has been set from 80 on, if the value Dynamic Threshold Max: 10is set, the critical threshold will increase by 10%, so it would be 88.
  • Dynamic Threshold Two Tailed: If activated, the dynamic threshold system will also set thresholds below the average. If unchecked (default) only thresholds with values above the average will be set.
  • Dynamic Threshold Min.: It only applies if the Dynamic Threshold Two Tailed parameter is active. It allows the lower limit to be reduced by the percentage indicated. E.g.: if the average values are around 60 and the lower critical threshold has been set to 40, if the value Dynamic Threshold Min: 10 is set, the critical threshold will be reduced by 10%, so it would be 36.

3.2.3.2 Case study 1

In the following example, the average value calculated is at the red line height (aprox. 30):

Thresh1.JPG

When activating dynamic thresholds, the upper threshold has been set that way (aprox. 45 and higher):

Thresh2.JPG

The parameter Dynamic Threshold Two Tailed has been activated, so that a critical threshold below the average values has been set too (aprox. 15 and lower):

Thresh3.JPG

Now the parameters Dynamic Threshold Min. and Dynamic Threshold Max. have been set to 20 and 30 accordingly, so the thresholds have been broadened, being slightly more permissive:

Thresh4.JPG

3.2.3.3 Case study 2

The starting point is from a web latency module. The featured basic settings take into account a week interval:


Dynamic1.JPG


When saving changes, after running pandora_db, the thresholds have been set in this way:


Dynamic2.JPG


The module will therefore switch to warning status when the alteration is higher than 0.33 seconds, and to critical when it is higher than 0.37 seconds. The graph will be shown as follows:


Dynamic3.JPG


The threshold has been considered to be somewhat permissive, so it has been decided to make use of the parameter Dynamic Threshold Min. to lower the minimum thresholds. Since in this case the threshold has no maximum values because everything above a certain value will be considered incorrect, Dynamic Threshold Maxwill not be used. The modification would look like this:


Dynamic4.JPG


After applying changes and executing the pandora_db, the thresholds are set as follows:


Dynamic5.JPG


And the graph will look like this:


Dynamic6.JPG


3.2.3.4 Case study 3

In this example, the temperature of a control room or a CPD, the graph shown is being monitored. It shows some values with little variation:


Dynamic7.JPG


In this situation, it is essential that the temperature remains stable and does not reach overly high values, neither excessively low, so the parameter "Dynamic Threshold Two Tailed" is used to set thresholds both above and below. The configuration is as follows:


Dynamic8.JPG


The automatically generated thresholds have been these:


Dynamic9.JPG


And the graph will look like this:


Dynamic10.JPG


That way, all values between 23'10 and 26 will be considered normal, since it is the acceptable temperature in the CPD or control room. If needed, the "Dynamic Threshold Min." and "Dynamic Threshold Max." parameters can be used again to set thresholds if necessary.

3.2.3.5 Additional configuration parameters

In addition in the pandora_server.conf you may set:

  • dynamic_updates: This parameter determines how many times thresholds are recalculated dureing the time period set in Dynamic Threshold Interval, where ist default value is 5. If Dynamic Threshold Interval is configured with 1 week value, one-week backwards data will be collected by default and calculations will be done just once, repeting the process after one week goes by. By modifying the dynamic_updates parameter, you may reduce the frecuency, e.g. a value of 3 will make thresholds to be calculated thrice along the week (or the period configured in Dynamic Threshold Interval).
  • dynamic_warning: If differentiates, in percentage, between warning and critical thresholds, default value 25.
  • dynamic_constant: It determines the average deviation that will be used to set the thresholds, 10 by default. Higher values will set thresholds farther from average values.

3.3 Common Parameters




Parametros comunes modulos1.png



  • Using module component: Pandora FMS has a repertoire of default modules that can be used. Depending on the selected module, the necessary parameters will be automatically filled in to carry out the monitoring. This token appears in all types of modules except prediction ones.
  • Dynamic Threshold Interval: Token for dynamic monitoring to be explained in a later section.
  • Warning/Critical Status: Token for status monitoring which will be explained in a later section.
Fft.png
  • Flip-Flop threshold: FlipFlop (FF) is known as a common phenomenon in monitoring: when a value fluctuates frequently between alternative values (RIGHT/WRONG). When this takes place, a "threshold" is usually used, so that in order to consider something as having changed status, it has to "stay" more than N intervals in a state without changing. FF threshold is used to 'filter' the continuous status changes in the creation of events/statuses.: that way Pandora FMS knows that, until an element has adopted the same status at least N times in the same status after having changed from an original status, it will not be considered as changed.

3.3.1 Advanced common parameters

Parametros comunes modulos2.png


  • Interval: Period in which the module should return data. If a module does not receive data during more than two intervals, it will go into in unknown state.
    • If they are remote modules: Time period during which the remote check takes place.
    • If they are data modules: Remote module that represents N times the interval of the defined agent, doing the local check during that time.
  • Unit: Choosing of the unit of the data received by the module, disabled by default (none). Available values:
    • Timeticks.
    • Bytes.
    • Entries.
    • Files.
    • Hits.
    • Sessions.
    • Users.
    • ºC.
    • ºF.
  • Post process: Disabled by default (0), it allows to specify carrying out a post-processing, a module-received data conversion. Available modules:
    • Seconds to months
    • Seconds to weeks
    • Seconds to days
    • Seconds to minutes
    • Bytes to Gigabytes
    • Bytes to Megabytes
    • Bytes to Kilobytes
    • Timeticks to weeks
    • Timeticks to days
  • FF interval: If the flip-flop threshold is activated and there is a state change, the module interval will be changed for the next execution.
  • FlipFlop timeout: Parameter that can only be used in asynchronous modules. For a state change by flip-flop to be effective, equal consecutive data must be received within the specified interval.
  • Silent: Parameter by which the module will continue to receive information, but no type of event or alert will be generated.
  • Cascade Protection Services: Parameter by which event and alert generation would become part of the service to which it belongs if this feature is enabled.
Parametros comunes modulos3.png

You may specify time periods when the module will be executed; if follows the nomenclature: Minute, Hore, Month Day, Month, Week Day and there are three different possibilities.

    • Cron from: It has Any set in all its fields, with no time restriction for monitoring.
    • Cron from: specific. Cron to: any: To be executed only when it matches the specified number. E.g.: 15 20 * * *, it will be run every day at 20:15
    • Cron from: specific. Cron to: specific: It will be run during the established interval. E.g.: 5 * * * * and 10 * * * *, will run every hour from 5 to 10 minutes.
  • Custom macros: Any number of custom module macros may be defined. The recommended format for macro names is:
   _macroname_

For example:

   _technology_
   _modulepriority_
   _contactperson_

These macros can be used in module alerts and are particularly useful in WUX monitoring and user monitoring if the module is a web-module analysis one:

Dynamic macros will have a special format starting with @ and will have these possible replacements:

   @DATE_FORMAT (current date/time with user-defined format)
   @DATE_FORMAT_nh (hours)
   @DATE_FORMAT_nm (minutes)
   @DATE_FORMAT_nd (days)
   @DATE_FORMAT_ns (seconds)
   @DATE_FORMAT_nM (month)
   @DATE_FORMAT_nY (years)

Where "n" can be a number without a sign (positive) or negative and FORMAT follows the perl strftime.

3.3.1.1 Tags

They are tags linked to each of the modules that later on spread to the events generated by this module. They can be used in that module's event alerts. Tags are quite useful since they can work as filter in reports, event views and they even have their own specific views. Each tag's additional information (URL, email, phone number) can be used in alerts as they are available as macro.

To be able to create a tag, click on Module tags:

Module tags imagen2.png

The tag allows to define a name, a description and there is also the possibility to add the complete URL, email or phone number associated to that tag. It is worth highlighting that one or several tags can be associated to the same module. However, they must first be created as it was previously described, and then they will be available to be allocated to each module.

Within module advanced options, the left column shows the tags available and the right column shows the tags linked to that module:

Tags 1.png

Furthermore, tags can be used to grant module specific access permissions, so that a user can access only that agent's module without having access to the remaining modules. This can be seen in the user profiling section uder profiling.

4 Module library

Info.png

Available from version 744. To access the module library from the menu, Agent Read (AR) permissions are needed.

 


Homelibreria.png

The nine most important categories are shown, by clicking on See all categories you will find the rest of them:

Categorylibrary.png

In each category all available modules will be shown with a brief description that may be enlarged when clicking on More details.

Modulecategory.png

Note: Pandora FMS Enterprise module download links will only be visible in these cases:

  • The user and password configured in the setup must match those of Integria IMS support.
  • Pandora FMS versión must be Enterprise.
  • Pandora FMS user has AW permissions.

Form more information on how to access the library, visit Console configuration

Go back to Pandora FMS documentation index