Pandora: Documentation en: Intro Monitoring

From Pandora FMS Wiki
Jump to: navigation, search

Go back to Pandora FMS documentation index


1 Introduction to Monitoring

All user interaction with Pandora FMS is done through the WEB console. The console allows access through a browser without the need to install heavy applications, allowing management from any computer as long as said software is supported by HTML5.

Monitoring is the execution of processes on all types of systems to collect and store information, take action and make decisions based on such data.

Pandora FMS is a scalable monitoring system that has multiple features to extend the scope and volume of information collected almost unlimitedly.

2 Logic agents on Pandora FMS




AgentHierarchy.png



All monitoring done by Pandora FMS is classified into Logic agents. All Logic agents belong to a 'Group'. These agents will be equivalent to each of the different monitored computers, devices, websites or applications.

Logic agents defined in Pandora FMS console may present local information gathered through a software agent, remote information collected through network checks, or both. Therefore, it is worth highlighting the difference between agents as an organizational unit in the Pandora FMS console, and software agents as local data collection services.


2.1 Monitoring by Software Agent vs. Remote Monitoring

Monitoring can be divided into two large groups based on how the information is collected: monitoring based on software agents and remote monitoring.

  • Agent-based monitoring consists of installing a small software that keeps on running in the system and obtaining information locally through command and script execution.
  • Remote monitoring is the use of the network to run remote checks on systems, without the need to install any additional components on the computer to be monitored.

As it can be seen, software agent based monitoring will obtain information through local checks while remote monitoring will obtain the information through network checks from the Pandora FMS server.

Both agent types share the same general configuration and data display. With Pandora FMS, monitoring can be carried out one way or another and also combined, producing a mixed monitoring.

2.2 Agent setup in the console

Normal view editing interface
Configuracion agente consola1.png
  • Alias: For proper operation of all functions performed by Pandora FMS with their agents/modules, it is recommended not to use characters such as /, \, |, %, #, & and $ in the name of the agent. When dealing with these agents, they can be misleading when using system paths or when running other commands, causing server errors.
  • Server: Server that will execute the checks configured in agent monitoring, special parameter in case of having configured HA in its installation.
Advanced view editing interface
vista avanzada
  • Secondary groups: Optional parameter for an agent to belong to more than one group (secondary groups).
  • Cascade protection services: Parameter with which an avalanche of alerts can be avoided. It is possible to choose an agent or an agent module. In the first case, when the chosen agent is in critical, the agent will not generate alerts. In the second case, only when the specified module is critical, the agent will not generate alerts.
  • Module definition: Three work modes can be selected to define modules.
    • Learning mode: (Default mode) If an XML arrives with new modules, they will be created automatically.
    • Normal mode: If an XML arrives with new modules, they will not be created unless they have already been declared in the console.
    • Auto-disable mode: Same as learning mode, but if all modules go into unknown, the agent will be disabled until information arrives again.

2.3 Agent display

In this screen, plenty of information on the agent can be seen, with the possibility of forcing the remote execution and refreshing the data.

Visualizacion agente consola1.png

In the upper section, a summary with the agent data can be seen:

Visualizacion agente consola2.png
  • Total of modules and their status.
  • Events in the last 24 hours
  • Agent Information
    • Name
    • Version
    • Agent accessibility
    • Group
Visualizacion agente consola3.png

Initiated module list (Module name) that belong to the agent and is corresponding status.


Finally, the events generated from the agent are displayed.

Visualizacion agente consola4.png

3 Modules

Modules are units of information stored within an agent. They are the monitoring elements with which the information is extracted from the device or server to which the agent points.

Info.png

Each module can store only one type of metric. There cannot be two modules with the same name within the same agent.

 


All modules have an associated status, which can be:

  • Not started: Where no data has been received yet.
  • Normal: Data is being received with values out of the warning or critical thresholds.
  • Warning: Data is being received with values within the warning threshold.
  • Critical: Data is being received with values within the critical threshold.
  • Unknown: The module has been running and has stopped receiving information for a certain amount of time.

Modules have different types of data: such as Boolean, numeric or alphanumeric, among others.

3.1 Types of modules

There are several types of modules inside Pandora FMS.

  • Data module: It is a type of local monitoring module with which checks are made on the system in which the agent is located, such as for example the use of CPU of the device or its free memory.
  • Network module: It is a type of remote monitoring module with which checks are made to verify the connection with the device or server to which the agent points, for example whether it is working or whether it has a particular port open.
  • Plugin module: this is a type of local or remote monitoring module with which custom checks can be made through the creation of scripts. With them, more advanced and extensive checks than the ones proposed directly through Pandora FMS console can be done.
  • WMI module: This is a type of local monitoring module with which the Windows system can be checked through the WMI protocol, such as obtaining the list of installed services or the current CPU load.
  • Prediction module: This is a type of predictive monitoring module with which different arithmetic operations are performed through the consultation of data from other "base" modules, such as the average CPU usage of the monitored servers or the sum of connection latency.
  • Webserver module: This is a type of web monitoring with which checks of the status of a website are made and data is obtained from it, such as for example to see whether a website is down or if it contains a specific word.
  • Web analysis module: This is a type of web monitoring with which simulations of a user's web browsing are carried out, such as browsing a website, entering credentials or complying with forms.

3.2 Status Monitoring

When monitoring, values are obtained from a system, whether it might be memory, CPU, hardware temperature, number of connected users, orders on an e-commerce website or any other numerical value. Sometimes only data sucha as the "absolute value" might be relevant, but generally the "relative value" is more useful: associating a STATUS with these values, so that when they exceed a "THRESHOLD", the status changes, to let you know whether something is right or wrong, or about to be wrong. Therefore, when talking about monitoring, the STATUS concept must be discussed.

Pandora FMS allows you to define thresholds to determine the status that a check will have based on the data it shows. The three possible statuses are: NORMAL, WARNING and CRITICAL. A threshold is a value from which something goes from one status to another. The status of the modules will depend on these thresholds, which are specified by the following parameters present in the configuration of each module:

  • Warning status - Min. Max.: Lower and upper limits for the warning status. If the numerical value of the module is within this range, the module will go into warning status. If no upper limit is specified, it will be infinite (all values above the lower limit).
  • Critical status - Min. Max.: lower and upper limits for the critical status. If the numerical value of the module is in this range, the module will go into critical status. If no upper limit is specified, it will be infinite (all values above the lower limit).
  • Critical status - Str.: The same as the previous point but for critical status.
  • Inverse interval: present for both warning and critical thresholds. If enabled, the module will change status when its values are outside the range specified in the thresholds. It also works for alphanumeric modules (string), if the text strings do NOT match the Warning/Critical Str., the module will change its status.
Threshold2.JPG
  • Warning status - Str.: Regular expression for alphanumeric modules (string). If any matches are found, the module will go into warning status.
  • Critical status - Str.: Regular expression for alphanumeric modules (string). If any matches are found, the module will go into critical status.

Info.png

In case the "warning" and "critical" thresholds match in any range, the "critical" threshold will always prevail.

 


3.2.1 Numerical thresholds - Case study 1

When creating a module, thresholds have value 0 by default, to monitor the CPU usage percentage you need for it to go into warning (yellow color) when it reaches 70% usage, and into critical (red) when reaching 90%; since it will be necessary to set and fix these values:

Threshold3.JPG

When receiving the metric from that computer, if the data is under 70%, it will be green, normal, between 70% and 89,99% yellow, WARNING and from 90% or more, red, CRITICAL. Due to the way the thresholds operate, in cases like this one, it is not necessary to set upper limits. That is because if only the lower threshold is set, the upper threshold will be taken into account as "no limit", so any value above the lower limit will be taken as within the threshold. In addition, if thresholds overlap, the CRITICAL threshold will prevail over the WARNING one.

3.2.2 Text thresholds - Case study 2

UA module may return as collected data some of the following character strings:

  • OK.
  • ERROR connection fail.
  • BUSY too many devices.

By using regular expressions in Str. fields of the Warning Status and Critical Status parameters, as indicated by the picture, you may define alert thresholds.

Threshold4.JPG

Info.png

Be careful with regular expressions, since the distinguish between uppercase and lowercase, they are case sensitive.

 


With this configuration, the module will go into WARNING status when the data contains the string "BUSY", and its status will be CRITICAL when the data contains the string ERROR.

3.2.3 Dynamic monitoring (Automatic strings)

Dynamic monitoring consists of automatically and dynamically adjusting the status thresholds of the modules in an intelligent and predictive way. The procedure consists of collecting the values for a given period and calculating an average and a standard deviation, which are used to establish the corresponding thresholds at module level.

3.2.3.1 Possible parameters

Dynamic1.JPG
  • Dynamic Threshold Interval: Time interval to be considered for threshold calculation. If 1 month is chosen, the system will take all existing data from the last month and build the thresholds based on that data and thresholds with values over the average will be set.
  • Dynamic Threshold Max.: It allows you to increase the upper limit by the indicated percentage . E.g.: if the average values are around 60 and the critical threshold has been set from 80 on, if the value Dynamic Threshold Max: 10is set, the critical threshold will increase by 10%, so it would be 88.
  • Dynamic Threshold Two Tailed: If activated, the dynamic threshold system will also set thresholds below the average. If unchecked (default) only thresholds with values above the average will be set.
  • Dynamic Threshold Min.: It only applies if the Dynamic Threshold Two Tailed parameter is active. It allows the lower limit to be reduced by the percentage indicated. E.g.: if the average values are around 60 and the lower critical threshold has been set to 40, if the value Dynamic Threshold Min: 10 is set, the critical threshold will be reduced by 10%, so it would be 36.

3.2.3.2 Case study 1

In the following example, the average value calculated is at the red line height (aprox. 30):

Thresh1.JPG

When activating dynamic thresholds, the upper threshold has been set that way (aprox. 45 and higher):

Thresh2.JPG

The parameter Dynamic Threshold Two Tailed has been activated, so that a critical threshold below the average values has been set too (aprox. 15 and lower):

Thresh3.JPG

Now the parameters Dynamic Threshold Min. and Dynamic Threshold Max. have been set to 20 and 30 accordingly, so the thresholds have been broadened, being slightly more permissive:

Thresh4.JPG

3.2.3.3 Case study 2

The starting point is from a web latency module. The featured basic settings take into account a week interval:


Dynamic1.JPG


When saving changes, after running pandora_db, the thresholds have been set in this way:


Dynamic2.JPG


The module will therefore switch to warning status when the alteration is higher than 0.33 seconds, and to critical when it is higher than 0.37 seconds. The graph will be shown as follows:


Dynamic3.JPG


The threshold has been considered to be somewhat permissive, so it has been decided to make use of the parameter Dynamic Threshold Min. to lower the minimum thresholds. Since in this case the threshold has no maximum values because everything above a certain value will be considered incorrect, Dynamic Threshold Maxwill not be used. The modification would look like this:


Dynamic4.JPG


After applying changes and executing the pandora_db, the thresholds are set as follows:


Dynamic5.JPG


And the graph will look like this:


Dynamic6.JPG


3.2.3.4 Case study 3

In this example, the temperature of a control room or a CPD, the graph shown is being monitored. It shows some values with little variation:


Dynamic7.JPG


In this situation, it is essential that the temperature remains stable and does not reach overly high values, neither excessively low, so the parameter "Dynamic Threshold Two Tailed" is used to set thresholds both above and below. The configuration is as follows:


Dynamic8.JPG


The automatically generated thresholds have been these:


Dynamic9.JPG


And the graph will look like this:


Dynamic10.JPG


That way, all values between 23'10 and 26 will be considered normal, since it is the acceptable temperature in the CPD or control room. If needed, the "Dynamic Threshold Min." and "Dynamic Threshold Max." parameters can be used again to set thresholds if necessary.

3.2.3.5 Additional configuration parameters

In addition in the pandora_server.conf you may set:

  • dynamic_updates: This parameter determines how many times thresholds are recalculated dureing the time period set in Dynamic Threshold Interval, where ist default value is 5. If Dynamic Threshold Interval is configured with 1 week value, one-week backwards data will be collected by default and calculations will be done just once, repeting the process after one week goes by. By modifying the dynamic_updates parameter, you may reduce the frecuency, e.g. a value of 3 will make thresholds to be calculated thrice along the week (or the period configured in Dynamic Threshold Interval).
  • dynamic_warning: If differentiates, in percentage, between warning and critical thresholds, default value 25.
  • dynamic_constant: It determines the average deviation that will be used to set the thresholds, 10 by default. Higher values will set thresholds farther from average values.

3.3 Common Parameters




Parametros comunes modulos1.png



  • Using module component: Pandora FMS has a repertoire of default modules that can be used. Depending on the selected module, the necessary parameters will be automatically filled in to carry out the monitoring. This token appears in all types of modules except prediction ones.
  • Dynamic Threshold Interval: Token for dynamic monitoring to be explained in a later section.
  • Warning/Critical Status: Token for status monitoring which will be explained in a later section.
Fft.png
  • Flip-Flop threshold: FlipFlop (FF) is known as a common phenomenon in monitoring: when a value fluctuates frequently between alternative values (RIGHT/WRONG). When this takes place, a "threshold" is usually used, so that in order to consider something as having changed status, it has to "stay" more than N intervals in a state without changing. FF threshold is used to 'filter' the continuous status changes in the creation of events/statuses.: that way Pandora FMS knows that, until an element has adopted the same status at least N times in the same status after having changed from an original status, it will not be considered as changed.

3.3.1 Advanced common parameters

Parametros comunes modulos2.png


  • Interval: Period in which the module should return data. If a module does not receive data during more than two intervals, it will go into in unknown state.
    • If they are remote modules: Time period during which the remote check takes place.
    • If they are data modules: Remote module that represents N times the interval of the defined agent, doing the local check during that time.
  • Unit: Choosing of the unit of the data received by the module, disabled by default (none). Available values:
    • Timeticks.
    • Bytes.
    • Entries.
    • Files.
    • Hits.
    • Sessions.
    • Users.
    • ºC.
    • ºF.
  • Post process: Disabled by default (0), it allows to specify carrying out a post-processing, a module-received data conversion. Available modules:
    • Seconds to months
    • Seconds to weeks
    • Seconds to days
    • Seconds to minutes
    • Bytes to Gigabytes
    • Bytes to Megabytes
    • Bytes to Kilobytes
    • Timeticks to weeks
    • Timeticks to days
  • FF interval: If the flip-flop threshold is activated and there is a state change, the module interval will be changed for the next execution.
  • FlipFlop timeout: Parameter that can only be used in asynchronous modules. For a state change by flip-flop to be effective, equal consecutive data must be received within the specified interval.
  • Silent: Parameter by which the module will continue to receive information, but no type of event or alert will be generated.
  • Cascade Protection Services: Parameter by which event and alert generation would become part of the service to which it belongs if this feature is enabled.
Parametros comunes modulos3.png

You may specify time periods when the module will be executed; if follows the nomenclature: Minute, Hore, Month Day, Month, Week Day and there are three different possibilities.

    • Cron from: It has Any set in all its fields, with no time restriction for monitoring.
    • Cron from: specific. Cron to: any: To be executed only when it matches the specified number. E.g.: 15 20 * * *, it will be run every day at 20:15
    • Cron from: specific. Cron to: specific: It will be run during the established interval. E.g.: 5 * * * * and 10 * * * *, will run every hour from 5 to 10 minutes.
  • Custom macros: Any number of custom module macros may be defined. The recommended format for macro names is:
   _macroname_

For example:

   _technology_
   _modulepriority_
   _contactperson_

These macros can be used in module alerts and are particularly useful in WUX monitoring and user monitoring if the module is a web-module analysis one:

Dynamic macros will have a special format starting with @ and will have these possible replacements:

   @DATE_FORMAT (current date/time with user-defined format)
   @DATE_FORMAT_nh (hours)
   @DATE_FORMAT_nm (minutes)
   @DATE_FORMAT_nd (days)
   @DATE_FORMAT_ns (seconds)
   @DATE_FORMAT_nM (month)
   @DATE_FORMAT_nY (years)

Where "n" can be a number without a sign (positive) or negative and FORMAT follows the perl strftime.

3.3.1.1 Tags

They are tags linked to each of the modules that later on spread to the events generated by this module. They can be used in that module's event alerts. Tags are quite useful since they can work as filter in reports, event views and they even have their own specific views. Each tag's additional information (URL, email, phone number) can be used in alerts as they are available as macro.

To be able to create a tag, click on Module tags:

Module tags imagen2.png

The tag allows to define a name, a description and there is also the possibility to add the complete URL, email or phone number associated to that tag. It is worth highlighting that one or several tags can be associated to the same module. However, they must first be created as it was previously described, and then they will be available to be allocated to each module.

Within module advanced options, the left column shows the tags available and the right column shows the tags linked to that module:

Tags 1.png

Furthermore, tags can be used to grant module specific access permissions, so that a user can access only that agent's module without having access to the remaining modules. This can be seen in the user profiling section uder profiling.

4 Module library

Info.png

Available from version 744. To access the module library from the menu, Agent Read (AR) permissions are needed.

 


Homelibreria.png

The nine most important categories are shown, by clicking on See all categories you will find the rest of them:

Categorylibrary.png

In each category all available modules will be shown with a brief description that may be enlarged when clicking on More details.

Modulecategory.png

Note: Pandora FMS Enterprise module download links will only be visible in these cases:

  • The user and password configured in the setup must match those of Integria IMS support.
  • Pandora FMS versión must be Enterprise.
  • Pandora FMS user has AW permissions.

Form more information on how to access the library, visit Console configuration

Go back to Pandora FMS documentation index