Pandora: Documentation en: PandoraFMS Engineering
- 1 Pandora FMS Engineering Details
- 1.1 Pandora FMS Database Design
- 1.2 Status of The Modules in Pandora FMS
- 1.3 Pandora FMS graphs
1 Pandora FMS Engineering Details
This section explains some of the design principles and particularities of Pandora FMS.
1.1 Pandora FMS Database Design
Pandora FMS first versions, from version 0.83 to version 1.1, were based on a very simple idea: one piece of data, one insertion in the database. This allowed the program to perform simple searches, insertions and other operations.
Although its development had some advantages, there was a big disadvantage: scalability. This system has an specific limit regarding the maximum amount of modules supported, and when having a significant amount of data (> 5 millions of elements), performance level decreased.
On the other hand, solutions based on MySQL cluster are not easy: even though they allow managing a higher load, they entail some minor problems. They do not offer a long term solution either to this performance problem with higher data load.
The current version of Pandora FMS implements a data compression in real time for each insertion. It also allows data compression based on interpolation. The maintenance task also allows automatic deletion for those data that exceed a certain period of time.
The new Pandora FMS processing system keeps only «new» data. If a duplicated value enters the system, it will not be stored in the database. It is very useful to keep the database to a minimum and it works for all Pandora FMS modules: numeric, incremental, boolean,and string. In the boolean data type, the compressing index is very high, since they are data that rarely change. Nevertheless, the «index» elements are stored every 24 hours, so there is minimum information that is used as a reference when compacting the information.
This system solves part of the scalability problem, reducing database usage by 40%-70%, but there are other ways to increase scalability. Pandora FMS allows component breakup to balance the data file processing load and network module execution in different servers. It it possible to have several Pandora FMS servers (network servers, data or SNMP), Pandora FMS Web consoles, and also a database or a high performance cluster (with MySQL5) in different servers.
The adjustments imply big changes when reading or interpreting data. In latest Pandora FMS versions, the graphic engine has been redesigned from scratch to be able to represent data quickly with the new data storage model. Compressing processes have certain implications when reading and interpreting data graphically. Imagine an agent cannot communicate with Pandora FMS, so the Pandora FMS server does not receive data from that agent, and there is a period of time during which the server has no information from said agent's modules. If you access the graphic of one of those modules, the interval with no data will be represented as not suffering any changes, as a horizontal line. If Pandora FMS does not receive new values, it will assume there were no changes and everything will look as it did in the last notification.
To see a graphic example, this image shows the changes for each data, received every 180 seconds.
This would be the equivalent graphic for the same data, expect for a connection failure, from 05:55 to 15:29 approximately.
In Pandora FMS 1.3 a new general graphic for the agents was added. It shows its connectivity, and the access rate from the module to the agent. This graphic complements the other graphs that are shown when the agent has activity and receives data. This is an example of an agent that is regularly connected to the server:
If you have peaks (low) in this graphic, there could be some problems or slow connections in the Pandora FMS agent connectivity with the Pandora FMS server, or either connectivity problems from the network server.
From Pandora FMS version 5 onwards, a new feature was added, which makes possible to cross the data of the "unknown module" type events with the graphs, to show in the graph the piece of data in unknown status, complementing the graph for better understanding, for example:
1.1.1 Other DB technical aspects
Throughout software updates, small improvements have been made to the relational model of Pandora FMS database. One of the changes is the indexation by module types. That way, Pandora FMS can access information more quickly since it is broken down into different tables. Tables can be partitioned (by timestamps) to improve even more data history access performance.
In addition, factors such as numerical representation of timestamps (in _timestamp_ UNIX format), speeds up date range searches, their comparison, etc. This work has allowed a significant improvement in search times and insertions.
1.1.2 Database Main Tables
This is an ER diagram and also a detailed description of the main tables of Pandora FMS database.
- taddress: It contains agent additional addresses.
- taddress_agent: Addresses linked to an agent(rel. taddress/tagent).
- tagente: It contains the information of Pandora FMS agents.
- id_agent: Agent unique identifier.
- name: Agent name (case sensitive).
- address: Agent address. It is possible to assign additional addresses through taddress.
- comentarios: Free text.
- id_group: Identifier of the group the agent belongs to (ref. tgrupo).
- last_contact: Last agent contact date, either through a software agent or through a remote module.
- mode: Running agent mode, 0 normal, 1 training.
- interval: Agent execution interval. Depending on this interval, the agent will be showed as out of limits.
- id_os: Agent SO identifier (ref. tconfig_os).
- os_version: SO version (free text).
- agent_version: Agent version (free text). Updated by software agents.
- last_remote_contact: Last agent-received contact date. In case of software agents, and unlike last_contact, the date is sent by the agent itself.
- disabled: Agent status, enabled (0) or disabled (1).
- id_parent: Identifier of the agent parent (ref. tagent).
- custom_id: Agent custom identifier. Useful to interact with other tools.
- server_name: Name of the server the agent is assigned to.
- cascade_protection: Cascade protection. Disabled at 0. When at 1, it prevents agent-associated alerts from being triggered if a critical agent parent alert was triggered. For more info, check the section about Alerts.
- tagent_data: Data received from each module. If for the same module the last received data is the same as the previous one, it will be not added (but tagent_status is updated). The incremental and string type data are saved in different tables.
- tagent_data_inc: Incremental data type.
- tagent_data_string: String data type.
- tagent_status: Information of the current status of each module.
- id_agent_status: Identifier.
- id_agent_module: Module identifier.(ref. tagent_module).
- data: Value of the last received data.
- timestamp: Data of the last data received (it could come from the agent).
- status: Module status: 0 NORMAL, 1 CRITICAL, 2 WARNING, 3 UNKNOWN.
- id_agent: Agent identifier associated to the module (ref. tagent).
- last_try: Date of the module's last successful execution.
- utimestamp: Date of the module's last execution in UNIX format.
- current_interval: Module execution intervale in seconds.
- running_by: Name of the server that executed the module.
- last_execution_try: Date of the last module execution try. The execution could have failed.
- status_changes: Number of status changes. It is used to avoid continuous status changes. For more info, check out the Operation section.
- last_status: Previous module status.
- tagent_module: Module configuration.
- id_agent_module: Module unique identifier.
- id_agente: Agent identifier associated to the module (ref. tagent).
- id_tipe_module: Type of module (ref. ttipo_modulo).
- description: Free text.
- name: Module name.
- max: Module maximum value. Data higher than this value will not be valid.
- min: Module minimum value. Data lower than this value will not be valid.
- module_interval: Module execution interval in seconds.
- tcp_port: Destination TCP port in network modules and plugins. Name of the column to read in WMI modules.
- tcp_send: Data to send in network modules. Namespace in WMI modules.
- tcp_rcv: Expected answer in network modules.
- snmp_community: SNMP community in network modules. Filter in WMI modules.
- snmp_oid: OID in network modules. WQL Query in WMI modules.
- ip_target: Destination address in network modules, plugin and WMI.
- id_module_group: Identifier of the group the module belongs to (ref. tmodule_group).
- flag: Forced execution flag. If is at 1, the module will be executed although it has no right by interval.
- id_modulo: Identifier for modules that could not been recognized by its id_module_type. 6 for WMI modules, 7 for WEB modules.
- disabled: Module status, 0 enabled, 1 disabled.
- id_export: Identifier of the export server associated to the module (ref. tserver).
- plugin_user: Username in plugin and WMI modules, user-agent in Web modules.
- plugin_pass: Password in plugin modules and WMI, number of retries in Web modules.
- plugin_parameter: Additional parameters in plugin modules, configuration of Goliat task in Web modules.
- id_plugin: Identifier of the plugin associated to the module in plugin modules (ref. tplugin).
- post_process: Value the module data will be multiplied by before being saved.
- prediction_module: 1 if it is a prediction module, 2 if it is a service module, 3 if it is synthetic and 0 in any other case.
- max_timeout: Waiting time in seconds for plugin modules.
- custom_id: Module customized identifier. Useful to interact with other tools.
- history_data: If it is set at 0, module data will not be saved at tagent_data*, only tagent_status will be updated.
- min_warning: Minimum value that activates the WARNING status.
- max_warning: Maximum value that activates the WARNING status.
- min_critical: Minimum value that activates the CRITICAL status.
- max_critical: Maximum value that activates the CRITICAL status.
- min_ff_event: Number of times that should a status change term must be met before this change take place. It is is related to tagent_status.status_changes.
- delete_pending: If it is set at 1, it will be deleted by the maintenance script of pandora_db.pl database.
- custom_integer_1: When prediction_module equals 1, this field is the module id from where data for predictions are obtained. When prediction_module equals 2, this field is the service id assigned to the module.
- tagent_access: A new entry will be added each time data are received from an agent to any server, but never more than one by minute to avoid overloading the database. It can be disabled by setting agentaccess to 0 in the pandora_server.conf configuration file.
- talert_snmp: SNMP alert configuration.
- talert_commands: Commands that can be executed from actions associated to an alert (e.g. send mail).
- talert_actions: Command instance associated to any alert (e.g. send mail to administrator).
- talert_templates: Alert templates.
- id: Template unique identifier.
- name: Template name.
- description: Description.
- id_alert_action: Identifier of the default action associated to the template.
- field1: Customized field 1(free text).
- field2: Customized field 2(free text).
- field3: Customized field 3 (free text).
- type: kind of alert depending on the shot condition ('regex', 'max_min', 'max', 'min', 'equal', 'not_equal', 'warning', 'critical').
- value: Value for alerts kind regex (free text).
- matches_value: To 1 it inverts the logic of the shot condition.
- max_value: Maximum value for max_min and max alerts.
- min_value: Minimum value for max_min and min alerts.
- time_threshold: Alert interval.
- max_alerts: Maximum number of times that an alert will be fired during an interval.
- min_alerts: Minimum number of times that the shot condition should be shown during an interval to the alert will be fired.
- time_from: Time from which the alert will be active.
- time_to: Time to which the alert will be active.
- monday: To 1 the alert is active on Mondays.
- tuesday: To 1 the alert will be active on Tuesdays.
- wednesday: To 1 the alert will be active on Wednesdays.
- thursday: To 1 the alert will be active on Thursdays.
- friday: To 1 the alert will be active on Fridays.
- saturday: To 1 the alert will be active on Saturdays.
- sunday: To 1 the alert will be active on Sundays.
- recovery_notify: To 1 activate the alert recovery.
- field2_recovery: Customized field 2 for alert recovery (free text).
- field3_recovery: Customized field 3 for alert recovery (free text).
- priority: Alert priority: 0 Maintenance, 1 Informational, 2 Normal, 3 Warning, 4 Critical.
- id_group: Identifier of the group the template belongs to (ref. tgrupo).
- talert_template_modules: Instance of an alert template associated to a module.
- id: Alert unique identifier.
- id_agent_module: Identifier of the module associated to the alert (ref. tagente_modulo).
- id_alert_template: Identifier of the templated associated to the alert (ref. talert_templates).
- internal_counter: Number of times that the alert shot condition has occurred.
- last_fired: Last time the alert was fired (Unix time)
- last_reference: Start of the current interval (Unix time).
- times_fired: number of times the alert was fired (could be different from internal_counter)
- disabled: At 1 the alert is deactivated.
- priority: Alert priority : 0 Maintenance, 1 Informational, 2 Normal, 3 Warning, 4 Critical.
- force_execution: At 1 the action of the alert will be executed thought it has not been fired. It is used for the alert manual execution.
- talert_template_module_actions: Instance of an action associated to one alert (ref. talert_template_modules).
- talert_compound: Compound alerts, the columns are similar to the talert_templates.
- talert_compound_elements: Simple alerts associated to a compound alert, each one with its correspondent logic operation (ref. talert_template_modules).
- talert_compound_actions: Actions associated with a compound alert (ref. talert_compound).
- tattachment: Attachments associated to one incident.
- tconfig: Console configuration.
- tconfig_os: Valid Operative systems in Pandora FMS.
- tevento: Event entries. The severity values are the same ones than for the alerts.
- tgrupo: Defined groups in Pandora FMS.
- tincidencia: Incident entries
- tlanguage: Available languages in Pandora FMS.
- tlink: Links showed at the console menu lower side.
- tnetwork_component: Network components. They are modules associated to a network profile used by the Recon Server. After they result in an entry at tagente_modulo, so the columns of both tables are similar.
- tnetwork_component_group: Groups to classify the network components.
- tnetwork_profile: Network profile. Network components group that will be assigned to recognition tasks of the Recon Server. The network components associated to the profile will result in modules in the created agents.
- tnetwork_profile_component: Componentes de red asociados a un perfil de red (rel. tnetwork_component/tnetwork_profile).
- tnota: Notes associated to an incident.
- torigen: Possible origins of an incident.
- tperfil: User profiles defined at the console.
- trecon_task: Recon tasks the Recon Server performs.
- tserver: Registered servers.
- tsesion: information on actions that toke place during an user session for administration and statistical logs.
- ttipo_modulo: Kinds of modules depending on their origin and kind of data.
- ttrap: SNMP traps received by the SNMP console.
- tusuario: Registered users at the console.
- tusuario_perfil: Profiles asociated to an user (rel. tusuario/tperfil).
- tnews: News showed at the console.
- tgraph: Customized graphs created in the console.
- tgraph_source: Modules associated to a graph (rel. tgraph/tagente_modulo).
- treport: Customized reports created at the console.
- treport_content: Elements associated to one report.
- treport_content_sla_combined: Components of an SLA element associated to one report.
- tlayout: Customized maps created at the console.
- tlayout_data: Elements associated to a map.
- tplugin: Plugin definitions for the Plugin Server.
- tmodule: Kinds of modules (Network, Plugin, WMI...).
- tserver_export: Configured destinations for the Export Server.
- tserver_export_data: Data to export, associated to a destination.
- tplanned_downtime: Programmed stops.
- tplanned_downtime_agents: Agents associated to a programmed stop (rel. tplanned_downtime/tagente).
1.1.3 Data Compression in Real Time
To avoid overload the database, the server does a simple compression in time of insertion.One data won't be stored at the database unless it would be different to the previous one or it would be a difference of 24 hours between both of them.
For example, supposing an interval of approximately 1 hour, then the sequence 0,1,0,0,0,0,0,0,1,1,0,0 is kept in the database as 0,1,0,1,0. It won't kept other consecutive 0 unless 24 h. have passed.
The graph that is shown next has been drawn from the data of the previous example. Only the data in red has been inserted in the database.
The compression affects to the algorithms of data processing. Either to the metrics as to the graphs, and it's important to consider that you should fill in the blanks that are caused by the compression.
Considering all the previous things, in order to calculate with the data of a given module the interval and the starting data, you should follow these steps:
- Search for the previous data out of the interval and date given. If it exists, you have to put it at the beginning of the range. If it doesn't exist, then previously there was no data.
- Search the following data out of the range and data given until a maximum equal to the module interval. If it exists, then you have to put it at the end of the interval. If not, you have to extend the last available value until the end of the interval.
- All data should be check, considering that one data is valid until we get another data.
1.1.4 Data compaction
Pandora FMS has included a system to "compact" database information. This system is focus on small / mid-size deployments (250-500 agents, < 100,000 modules) which want to have a long history information but "loosing" some resolution.
Pandora FMS database maintance, which is executed each day do a scan of old data subject to be compacted. This compactation is done using a simple linear interpolation, that means, if you have 10,000 points of information in a day, you will get a result of a process of interpolation, which replace that 10,000 points for 1000 points.
This, obviously "loose" information, because is an interpolation, BUT also saves database storage and on long term graphs (monthly, yearly) the graphs are mostly the same. In big databases this behaviour coult be "costly" in terms on database performance, and should be disabled and you should use the history database model instead.
1.1.5 History database
This is an Enterprise feature, and is used to store the information from a given point in time, for example, data with more than one month in a different database. This database must be in a different physical server (no virtualize here, please!). Automatically, when you request a data graph for 1 year, Pandora FMS will look the first XX days in the "realtime/main" database and the other information in the history database. In this way you can avoid to have performance penalties when you store a huge ammount of information in your system.
To configure this, you need to setup manually in another server, a history database (importing the Pandora FMS DB Schema into it, without data), and setup permissions to allow access to it from the main Pandora FMS server.
Go to Setup -> History database and configure there the settings to access the history database.
Some settings interesting which need to be explained:
- Days: max days information is stored in main database. After that date, data will be moved to history db. 30 days is a good default.
- Step: This acts like a buffer, database maintance script, will take XX registers from database, will insert it in the history database and will delete it from main database. This is timeconsuming, and size depends on your setup, 1000 is a good default value.
- Delay: After a block of step modules, script will wait for delay seconds. Useful if your database performance is poor, to avoid locks. Use values only between 1-5.
The default configuration of Pandora FMS does NOT transfer string type data to the historical database, however, if we have modified this configuration and our historical database is receiving this type of information it is essential that we configure its purging otherwise it will end up occupying too much time, causing big problems, besides having a negative impact on the performance.
To configure this parameter we must run a query directly in the database to determine the days after which this information will be purged. The table we are interested in is tconfig and the field string_purge. If we wanted for example to set 30 days for the purging of this type of information, for example, we would run the next query directly on the historical database:
UPDATE tconfig SET value = 30 WHERE token = "string_purge";
A good way to test this is to run the database maintance script manually:
There shouldn't be any reported error.
1.2 Status of The Modules in Pandora FMS
In Pandora FMS the modules can have different status: Unkown, Normal, Warning, Critical or with Fired Alerts.
1.2.1 When is Each Status Set?
Each module has the Warning and Critical thresholds set in its configuration. These thresholds define its data values for which these status will be activate. If the module gives data out of these thresholds, then it will be considered that it's on Normal status.
Each module has also a time interval that will fix the frequency with which it will get the data. This interval will be taken into account by the console to collect data. If the module has the double of its interval without collecting data, then, it'll be considered that this module is in Unknown status.
Finally, if the module has configured alerts and any of them have been fired and have not been validated, then the module will have the corresponding Fired Alert status.
1.2.2 Spreading and Priority
In Pandora's organization, some elements depend on others, as for example the modules of one agent or the agents of one group.These can also be applied to the case of the Pandora's FMS Enterprise policies, which have associated some agents and some modules that are considered associated to each agent.
This structure is specially useful in order to evaluate easily the status of the modules. This is obtained spreading up the status in this organization, giving status to the agents, groups and policies this way.
188.8.131.52 Which status will an Agent have?
An agent will have the worst of its modules's status. Recursively, a group will have the worst of the agent's status that belong to it, and the same for the policies, that will have the worst status of its assigned agents.
This way, by seeing one group with a critical status, for example, we'll known that at least one of its agents has the same status. When we locate it, we could get down another level to get to the module or modules that have caused the spreading of the critical status to the upper level.
184.108.40.206 Which should be the Priority of the status?
When we say that the worst of the status is spread, we should be sure which status are the most important ones. This way, there is a priority list, being the first status in it the one that has highest priority over the others and the last one the one that has the lowest. This one will be shown only with all elements have it.
- Fired Alerts
- Critical status
- Warning status
- Unknown status
- Normal status
We can see that when a module has fired alerts, its status has priority over the rest, and the agent to which it belongs will have this status and also the group to which this agent belongs to. On the other hand, in order to one group, for example, has a normal status, all its agents should have this status; which implies that all the modules of these groups will have normal status.
1.2.3 Color Code
Each one of the commented status has a color assigned, in order to could view in the network maps, with a quick view, when something isn't working properly.
1.3 Pandora FMS graphs
Graphs are one of the most complex implementations on Pandora FMS, because they gather information in real-time from the DB, and no external system is used (rrdtool or similar).
There are several behaviors of the graphs that depend on the type of the data:
- Asynchronous modules. It is assumed that there is no data compaction. Data stored in the DB are all the real samples of the data (therefore, no compaction). It produces more "exact" graphs without possible misinterpretation.
- Text string modules. Shows the rate of the gathered data.
- Numerical modules. Most modules report such data.
- Boolean modules. This are numerical data on *PROC modules: for instance, ping checks, interface status, etc. 0 means wrong, 1 means "Normal". They raise events automatically when they change of status.
Compression affects on how the graphics are represented. When we receive two data with the same value, Pandora does not store the last data, but interprets that the last known value can be used for the present time if we don't have another value. When we are painting a graph, if we do not have a reference value just when the graphic starts, Pandora searches 48 hours back in time to find the last known value to take as reference. If it doesn't find anything, it will start from 0.
In asynchronous modules, although there are not compression, the backwards search algorithm behaves similar.
When composing a graph, Pandora takes 50xN samples, being N the resolution factor of the graphs (this value can be configured in the setup). A monitor that gathers data every 300 seconds (5 minutes) will have 12 samples per hour, and 12x24 = 288 samples in a day. So when we ask a graph of a day, we are not printing 288 values, we are "compressing" or interpolating the graphic using only 50x3=150 samples (by default, graph resolution in Pandora is 3).
This means that we lose some resolution and the more samples. When we have a lot of values, for instance the 2016 samples of a week, of 8400 samples of a month, we must compress them in the 150 samples of a graph. This is why sometimes we lose detail and do not see some details, that's why the graphs can be queried with different intervals and to zoom in or out.
In the normal graphs, the interpolation is implemented in a simple way: if withing an interval we have two samples (p.e: interval B of the example), we do the average and we draw its value.
In boolean graphs, if within a sample we have several data (we can only have 1 or 0), we take the pessimist approach, and draw 0. This helps for the visualization of failures within an interval, having priority showing the problem that the normal status.
In both cases, if within a sample we don't have any data (because it's compressed or because it's missing), we will use the last known value of the previous interval to show the data, like the interval E of the above example shows.
The graphs by default show the average, maximum and minimum values. Because a sample (see interpolation) can have several data, we show the average values of the data, the maximum or the minimum. The more interpolation needed (the longer the period we are visualizing and we have considerably more data), the higher the interpolation level will be, therefore the difference between maximum and minimum values will be greater. The lower the range of the graph (an hour or so), there will not be interpolation, or it will be minimum, so we'll see the data with its real resolution, and the three series will be identical.