Difference between revisions of "Pandora: Documentation en: Services"

From Pandora FMS Wiki
Jump to: navigation, search
(Vista de árbol de los servicios)
(Dynamic mode)
 
(113 intermediate revisions by 8 users not shown)
Line 1: Line 1:
 
[[Pandora:Documentation_en|Go back to Pandora FMS documentation index]]
 
[[Pandora:Documentation_en|Go back to Pandora FMS documentation index]]
 +
  
 
= Service Monitoring =
 
= Service Monitoring =
 +
 +
{{Tip|[[Image:icono-modulo-enterprise.png|Enterprise version.]]<br>Service monitoring is a Pandora FMS Enterprise-exclusive feature.}}
  
 
== Introduction ==
 
== Introduction ==
  
  
A service is a way to group your IT resources based on their functionalities.  
+
A '''service''' in Pandora FMS is a way to group IT resources based on their features.  
  
A service could be your official website, your CRM system, your support application, or even your printers. Services are logical groups which can include hosts, routers, switches, firewalls, CRMs, ERPs, websites and of course, numerous other services.
+
A service could be an official website, a CRM system, a support application, or even printers. Services are logical groups which can include hosts, routers, switches, firewalls, CRMs, ERPs, websites and of course, different other services.
  
In Pandora FMS, we represent the services as a grouping of monitored elements (modules, agents or other services) whose individual status affects in a certain way the global functionality of the service that is provided.
+
In Pandora FMS, services are represented as a group of monitored elements ([[Pandora:Documentation_en:Glossary#Module|'''Modules''']], [[Pandora:Documentation_en:Glossary#Agent|'''Agents''']] or other '''Services''') whose individual status affects in a certain way the global performance of the service provided. To learn more, watch our video tutorial [https://www.youtube.com/watch?v=9b7tbl7Sxcg "Service monitoring in Pandora FMS"]
  
 
== Services under Pandora FMS ==  
 
== Services under Pandora FMS ==  
  
=== How Services work under Pandora FMS ===
 
The basic monitoring in Pandora FMS consists of collecting metrics from different origins, representing them as monitors (modules).
 
  
The monitoring in services allows us to group these modules, so that, playing with certain margins based on the accumulation of failures, we can monitor groups of different types of elements and their relationship in a larger and general service.
+
[[Pandora:Documentation_en:Intro_Monitoring|Basic monitoring]] in Pandora FMS consists of collecting metrics from different sources, representing them as monitors (modules). Service-based monitoring allows to group these modules, so that, by playing with certain ranges based on failure build-up, groups of different types of elements and their relationship in a larger and general service can be monitored.
 +
 
 +
In short, service monitoring allows to check the status of a global service. You will be able to know if our service is being provided normally (green), degraded (yellow) or if it is not being provided altogether (red).
 +
 
 +
[[Image:PFMS_color_legend.png|center|300px|Leyenda de colores y su significado.]]
 +
 
 +
Service monitoring is represented under three concepts: simple, by ''weight importance'' and chained by cascade events.
 +
 
 +
=== How simple mode works ===
 +
 
 +
In this mode it is only necessary to point out which elements are critical and which ones are not.
 +
 
 +
[[Image:New Service simple mode.png|center|600px|When creating a new service you may select simple mode.]]
 +
 
 +
Only elements checked as critical will be taken into account to make calculations and only the <code>critical</code> status od said elements will have value.
 +
 
 +
* When between 0 and 50% of the elements are in <code>critical</code> status, the service will go into <code>warning</code> status.
 +
* ''When more'' than 50% of the elements go into <code>critical</code> status, the service will go into <code>critical</code> status.
  
In short, service monitoring allows us to check the status of a global service. We will be able to know if our service is being provided normally (green), degraded (yellow) or if we are not providing the service (red).
+
Example:
  
To better understand what service monitoring is all about, let's give a small example.
+
* Router is a '''critical''' element.
 +
* Printer is a '''non critical''' element.
 +
* Apache Web Server is a '''critical''' element.
  
Suppose we want to monitor our web application, which we have balanced through a series of redundant elements. The infrastructure on which our application is based could consist of the following elements:
+
Situation 1:
  
* Two routers in HA.
+
* Router, <code>critical</code> status.
* Two switches in HA.
+
* Printer, <code>critical</code> status.
* 20 Apache Web Servers.
+
* Apache Server, <code>warning</code> status.
* Four Weblogic Appliance Servers.
 
* One MySQL Cluster consisting of two Storage and two SQL Processing Nodes.
 
  
Since our goal is to know if our web application is working correctly, that is, the final assessment by our customers is that the application works.
+
'''Result''':  The service is in <code>warning</code> status since the printer is not critical, the router is in <code>critical</code> mode and only represents 50% of the critical elements, Apache server is ''not in crtitical status and does not add value to the evaluation''.  
  
 +
Situation 2:
 +
 +
* Router, <code>critical</code> status.
 +
* Printer, <code>critical</code> status.
 +
* Apache Server, <code>critical</code> status.
 +
 +
'''Result''': Service in <code>critical</code> status (the printer still adds no value).
 +
 +
Situation 3:
 +
 +
* Router, <code>normal</code> status.
 +
* Printer, <code>critical</code> status.
 +
* Apache Server, <code>normal</code> status.
 +
 +
'''Result''': The status of the service would be '''normal''', since no key element is in ''critical'' status (again the printer does not add any value).
 +
 +
===How services work according to their weight===
  
 
The need to monitor services as something "abstract" arises when faced with the following question:
 
The need to monitor services as something "abstract" arises when faced with the following question:
  
'''What happens to my application if a non-critical element falls down?'''
+
'''What happens to an application if a non-critical element fails?'''
  
For example, if one of the twenty Apache servers were to crash. In theory, we could not warn, because so much redundancy arises to have problematic situations covered. But then, which one to alert about? everyone? Just some? What's the rule for the warning?
+
To solve all these doubts, in Pandora FMS there is the service monitoring feature that helps:
  
We could think that Pandora should only warn us if a more critical element is dropped (for example a router) or if several Apache servers are dropped.
+
*Limit the number of received alerts. You will receive alerts about situations that compromise the reliability of the services you provide.
 +
*Track the SLA compliance level.
 +
*Simplify the monitoring display of your infrastructure.
  
The monitoring through <b>services in Pandora FMS</b>feature appears to solve all these doubts.
 
  
The services in Pandora FMS help us to:
+
To achieve this, monitor every element that could negatively affect your application.
*Limit the number of received alerts. We will receive alerts about situations that compromise the reliability of the services we provide.
 
*Track the compliance level.
 
*Simplify the visualization of the monitoring of our infrastructure.
 
  
 +
Through Pandora FMS console, define a '''service tree''' in which to indicate both the elements that affect your application, as well as their impact degree.
  
To achieve this, we will have to monitor every element that could negatively affect our application.
+
All elements added to the service trees will correspond to information that is already being monitored, either in the form of modules, specific agents or other services.
  
Through the Pandora FMS console, we will have to define a '''service tree''' in which we will indicate both the elements that affect our application, as well as the degree to which they affect.
 
  
All the elements we add to the service trees will correspond to information that is already being monitored, whether in the form of modules, specific agents or other services.
+
To indicate the degree to which the status of each element affects the overall status, a '''weight sum''' system will be used, so that the most important ones (with more weight) will be more relevant to adjust the overall status of the whole service to an incorrect status before less important elements (with less weight).
  
 +
====Example====
  
To indicate the degree to which the status of each element affect the overall status, a system of '''sum of weights''' will be used, so that the most important (with more weight) will be more relevant to adjust the overall status of the complete service to an incorrect status before the less important elements (with less weight).
+
You may monitor a web application balanced through a series of redundant elements. The infrastructure the application is based on is made in this example by the following elements:
  
 +
* Two HA routers.
 +
* Two HA switches.
 +
* Twenty Web Apache® servers.
 +
* Four WebLogic® application servers.
 +
* One MySQL® cluster made by two storing nodes and two SQL processing nodes.
  
Let's look at all these ideas through a practical example:
+
The goal is to find out whether the web application is working properly, that means the final appreciations by our clients is that the application receives, processes and returns en a peremptory time period the requests.
*Switches and routers: 5 points each when in critical, and 3 points if in warning.
 
*WEB servers: 1.2 points for each one in critical, we do not contemplate the warning status.
 
*WebLogic Servers: 2 points each in critical.
 
*MySQL Cluster: 5 points for each node in critical and 3 points in warning.
 
  
 +
If one of the twenty Apache servers were offline, due to so much redundancy, would it be wise to warn or alert all the employees? ''What is the rule for alerting?''
 +
 +
You may conclude Pandora FMS should only warn if a highly critical element fails (for example a ''router'') or if serveral Apache servers are offline at the same time... but, how many of them? To solve this, weight values must be assigned to the list of previously described components:
 +
 +
;''Switches'' and ''routers'': 5 points to each one when they are in <code>critical</code> and 3 points if they are in <code>warning</code>.
 +
;Web servers: 1.2 points to each one in <code>critical</code>, <code>warning</code> status is not contemplated.
 +
;WebLogic servers: 2 points to each one in <code>critical</code>.
 +
;MySQL cluster: 5 points to each one in <code>critical</code> and 3 points in <code>warning</code>.
  
 
<table border="0" style="width: 80%; margin: 15px auto; border-collapse: collapse;">
 
<table border="0" style="width: 80%; margin: 15px auto; border-collapse: collapse;">
 
<tr>
 
<tr>
<th rowspan="2" style="color: #fff;padding-left: 0;text-align: center;">Element type</th>
+
<th rowspan="2" style="color: #fff;padding-left: 0;text-align: center;">Tipo de elemento</th>
<th colspan="4" style="color: #fff;padding-left: 0;text-align: center;">Weight asignment</th>
+
<th colspan="4" style="color: #fff;padding-left: 0;text-align: center;">Asignación de pesos</th>
 
</tr>
 
</tr>
 
<tr>
 
<tr>
Line 80: Line 123:
 
<tr><td>Router</td><td>0</td><td>3</td><td>5</td><td>5</td></tr>
 
<tr><td>Router</td><td>0</td><td>3</td><td>5</td><td>5</td></tr>
 
<tr><td>Switch</td><td>0</td><td>3</td><td>5</td><td>5</td></tr>
 
<tr><td>Switch</td><td>0</td><td>3</td><td>5</td><td>5</td></tr>
<tr><td>Web server</td><td>0</td><td>0</td><td>1.2</td><td>1.2</td></tr>
+
<tr><td>Apache server</td><td>0</td><td>0</td><td>1,2</td><td>1,2</td></tr>
<tr><td>Weblogic server</td><td>0</td><td>0</td><td>2</td><td>2</td></tr>
+
<tr><td>WebLogic server</td><td>0</td><td>0</td><td>2</td><td>2</td></tr>
 
<tr><td>MySQL server</td><td>0</td><td>3</td><td>5</td><td>5</td></tr>
 
<tr><td>MySQL server</td><td>0</td><td>3</td><td>5</td><td>5</td></tr>
 
</table>
 
</table>
  
 
+
When in a normal situation, the sum of those weights is zero, that is why in this example <code>warning</code> status threshold must be higher than 4 and for <code>critical</code> status higher than 6:
We set a warning threshold for service of 4, and a critical threshold of 6. In this way, and assuming that everything is going well, the service would be "OK" if all the monitored elements are OK or not important enough to cause deficiencies in the provision of our service.
 
  
 
<table border="0" style="width: 80%; margin: 15px auto; border-collapse: collapse;">
 
<table border="0" style="width: 80%; margin: 15px auto; border-collapse: collapse;">
 
<tr>
 
<tr>
<th colspan="3" style="color: #fff;padding-left: 0;text-align: center;">Service configuration</th>
+
<th colspan="3" style="color: #fff;padding-left: 0;text-align: center;">Configuración del servicio</th>
 
</tr>
 
</tr>
 
<tr>
 
<tr>
Line 104: Line 146:
 
</table>
 
</table>
  
 +
Failure scenarios:
  
Now let's suppose that one (1) Apache web server goes down:
+
* An Apache web server is offline (<code>critical</code> status): since everything else is in normal and adds 0 value, the total would be 1.2 since 1.2 < 4 (<code>warning</code> threshold), the service is still in OK status(<code>normal</code> status).
  
* 1 x SApache server in CRITICAL x 1.2 pto = 1.2 because 1.2 < 4 (Warning), the service is stil in OK status.
+
* A WEB server and a WebLogic one, both in <code>critical</code> status: the first one adds 1.2 points and the second 2.0 for a total of 3.2; however it is still lower than 4 so the service is still in OK status, no alerts or actions needed.
  
The weight contribution will be:
+
* Now two WEB servers and a WebLogic one are offline: 2 x 1,2 + 1 x 2 = 4,4; in this case it exceeded the warning threshold so it goes into <code>warning</code> status; it is still working and it may not require any immediate technical action, but it is obvious there is a problem in the infrastructure.
  
2 x 0 (routers in OK)
+
* To the previous situation we add a router in <code>critical</code> status and it triggers a new situation: it adds 5 points to the weight sum and exceeds the criticity threshold set at 6; the service is in critical status, '''the service is not working''' and immediate technical action is required.
+ 2 x 0 (switches in OK)
 
+ 19 x 0 (apache OK)
 
+ 1 x 1.2 (apache CRIT)
 
+ 4 x 0 (weblogic OK)
 
+ 1 x 0 (mysql OK)
 
Total: 1.2 --> Our service will be in NORMAL
 
  
 +
In this last situation, '''Pandora FMS will alert''' the corresponding working team (operators, technicians, etc.).
  
Let's see what happens if a WEB server and a Weblogic go down:  
+
{{Tip|You may get more interesting information about service monitoring in [https://pandorafms.com/blog/service-monitoring/ Pandora FMS blog]}}.
  
* 1 x Apache Server in CRITICAL x 1.2 pto = 1.2
+
===  Root services ===
* 1 x Weblogic Server in CRITICAL x 2 = 2
 
  
Total, 3,2 is still< 4 so the server remains in OK status, it is still working, it's not necessary to make a technical action immediately.
 
  
The weight contribution will be:
+
{{Tip|[[Image:icono-modulo-enterprise.png|Enterprise version.]]<br>Version NG 726 or higher.}}
  
2 x 0 (routers in OK)
+
A root service is that one that is not part of another service. This logic concept allows making monitoring smoother, reducing work queues.
+ 2 x 0 (switches in OK)
 
+ 19 x 0 (apache OK)
 
+ 1 x 1.2 (apache CRIT)
 
+ 3 x 0 (weblogic OK)
 
+ 1 x 2 (weblogic CRIT)
 
+ 1 x 0 (mysql OK)
 
Total: 3.2 --> Our service will be in NORMAL
 
  
 +
In addition and based on that, when a service defined in a Pandora FMS node appears as a [[Pandora:Metaconsole:Documentation_en:Introduction#Introduction|Metaconsole]] root service element, the Metaconsole server will be the one to evaluate it, updating the values stored in the node.
  
Let's see what happens if two WEB servers and a WEblogic go down:  
+
This provides a more efficient distributed logic, and allows to apply a [[Pandora:Documentation_en:Alerts#Service-based_cascade_protection|Cascade protection]] system based on services.
  
* 2 x Apache Server in CRITICAL x 1.2 pto = 2.4
+
Metaconsole service possibilities have also been extended, allowing to add other services, modules or agents as service elements. In previous versions, only node services could be added.
* 1 x Weblogic Server in CRITICAL x 2 = 2
 
  
Total, 4.4 now it is > 4 and the service goes into WARNING status, our service has gone into a <b>degraded</b> status. It continues to work, and may not require immediate technical action, but it is clear that there has been a problem with our infrastructure.
+
==  Creating a new Service ==
 +
===Pandora FMS server===
  
2 x 0 (routers in OK)
+
{{Warning|The '''PredictionServer''' component must be enabled to be able to use these services.}}
+ 2 x 0 (switches in OK)
 
+ 18 x 0 (apache OK)
 
+ 2 x 1.2 (apache CRIT)
 
+ 3 x 0 (weblogic OK)
 
+ 1 x 2 (weblogic CRIT)
 
+ 1 x 0 (mysql OK)
 
Total: 4.4 --> Our service will be in WARNING
 
  
 +
It is necessary for the [[Pandora:Documentation_en:Architecture#The_Prediction_Server|PredictionServer]] component to be working and for Pandora FMS Enterprise server version to be installed.
  
Let's suppose that in addition to the above, a Router goes down:
+
===Introduction===
  
* 2 x Apache Server in CRITICAL x 1.2 pto = 2.4
+
The services may represent:
* 1 x Weblogic Server in CRITICAL x 2 = 2
 
* 1 x Router in CRITICAL x 5 = 5
 
  
Now we have 9.4 above the threshold set at 6 for CRITICAL, so, the service is in critical, <b>it is not working</b> Immediate technical action is imperative.
+
* Modules.
 +
* Full agents.
 +
* Other Services.
  
1 x 0 (routers in OK)
+
Service values are calculated using the Prediction Server.
+ 1 x 5 (router in CRIT)
 
+ 2 x 0 (switches in OK)
 
+ 18 x 0 (apache OK)
 
+ 2 x 1.2 (apache CRIT)
 
+ 3 x 0 (weblogic OK)
 
+ 1 x 2 (weblogic CRIT)
 
+ 1 x 0 (mysql OK)
 
Total: 9.4 --> Our service is in CRÍTICAL
 
  
<b>Pandora FMS will alert</b> the corresponding work team (operators, technicians, etc.).  
+
Once you have all the devices monitored. Add within each service all the modules, agents or sub-services that you need to monitor the service. For example, if you want to monitor the Online Store service, you need a module for content, a service that monitors the state of communications and so on.
  
Service monitoring is a feature only of the Enterprise version of Pandora FMS.
+
To create a new service, click on '''Services''' at the '''Topology Maps''' menu.
<br><br>
 
  
 +
<br>
 +
[[Image:menu_services.png|center]]
 +
<br>
  
==== How the simple mode works ====
+
A tree view containing all the available services will be shown.
  
The weight system as detailed above may be too complex when the monitoring needs are basic. To deal with this situation, a new simple mode is available on the service configuration since the 5.1 version.
+
<br>
 +
[[Image:Arbol_servicios.png|center|800px]]
 +
<br>
  
In this mode the only configuration needed is to select which elements are critical and which not. Only the critical elements will be taken into account when calculating the service status and only the ''critical'' status of the critical elements will have value. The service will go to ''warning'' when 50% of the critical elements reach ''critical'' status. When 50% of the critical elements in ''critical'' are surpassed, the service will go to ''critical''.
+
===Initial Configuration===
  
Let's follow an example of a simple service:
+
To create a new service, click on the '''Create service''' and fill out the form.
  
* Router as '''critical''' element.
+
<br><center><br>
* Printer as '''non critical''' element.
+
[[Image:Formulario_servicios.png|center|800px]]
* Apache server as '''critical''' element.
+
</center><br><br>
  
One day the elements report this status:
+
;Name: Unique name to identify the service.
 +
;Description: Service description, a long mandatory text. Said description will appear in the service map, the service table view and the service widget (instead of the name).
 +
;Group:  Group to which the service belongs, useful to organize it and to apply [[Pandora:Documentation_en:Managing_and_Administration#Profiles.2C_users.2C_groups_and_ACL|ACL]] restrictions.
 +
;Agent to store data: The service saves the data in some special data modules (in particular the prediction modules) and it is necessary to add an agent to be the container of said modules and the alarms (see the following steps).
  
* Router on '''critical'''.
+
{{tip|'''Nota''': Please bear in mind that the interval in which all the calculations of the service modules will be done will depend on the agent interval configured as container.}}
* Printer on '''critical'''.
 
* Apache server on '''warning'''.
 
  
The service status is '''warning''', because the printer isn't a critical element and its status is not taken into account, as well as the Apache service status, which, even though a critical element, will only be taken into account in ''critical'' status. In this situation, one critical element is on ''critical'' status, 50% of the critical elements.
+
[[Image:Formulario_servicios_detalle_1.png|center|600px|Modo para cálculos de peso.]]
  
Another day the elements report this status:
+
;Mode: Mode in which the element weights will be calculated. It may have 2 values:
  
* Router on '''critical'''.
+
** '''Smart''': The service's weights and elements will be calculated automatically based on established rules.
* Printer on '''critical'''.
+
** '''Manual''': The service's weights and elements will be indicated manually with fixed values.
* Apache server on '''critical'''.
 
  
The service status is '''critical''', since over 50% of the critical elements are on ''critical'' status.
+
* '''Critical''': Weight threshold to declare the service as critical. In '''smart''' mode this value will be a percentage. We will explain later how the elements contribute to this value.
 +
* '''Warning''': Weight threshold to declare the service as in warning status. In '''smart''' mode, this value will be a percentage. We will explain later how the elements contribute to this value.
 +
;Unknown elements as critical: It allows you to indicate that elements in an unknown state contribute their weight as if they were a critical element.
 +
{{warning|The ''smart'' mode is only available from Pandora FMS version ''7.0NG 748''.
  
Finally, another day the elements report this status:
+
The ''automatic'' and ''simple'' modes of previous versions will become ''manual'' by applying the ''MR 40'' in the version update.}}
 +
;Favorite: It creates a direct link in the side menu and services will be able to be filters in the views based on this criteria.
  
* Router on '''normal'''.
+
[[Image:Servicios_favoritos.png|center|700px]]
* Printer on '''critical'''.
 
* Apache on '''normal'''.
 
  
The service status is '''normal''', since less than 50% of the critical elements are on ''critical'' status. Only the printer is on ''critical'' status but, as we have seen, the non critical elements aren't taken into account when calculating the service status.
+
;Quiet: It activates the silence mode of the service, so it will not generate alerts or events.
 +
;Cascade protection enabled: It activates cascade protection over the service elements. These will not generate alerts or events if they belong to a service (or sub-service) that is in a critical state.
 +
;Calculate continuous SLA: It activates the creation of SLA and SLA value modules for the current service. If disabled, the dynamically calculated SLA information will not be available, nor will the alerts on SLA compliance for this service. It is used for cases where the number of services required is so high that it can affect performance.  
  
===  Creating a New Service ===
+
{{Warning|If this option is disabled, once the service has been created, the data history of these modules will be deleted, so information will be lost.}}
  
{{Warning|You need the Enterprise version and the ''PredictionServer'' component enabled to be able to use the services.}}
+
;SLA Interval: Time period to calculate the effective SLA of the service.
 +
;SLA limit: Service status threshold in OK to be considered a positive SLA during the period of time you have configured in the previous field.
  
* Modules
+
;Alerts: In this section select templates that the service will have to launch the alert when the service goes into warning, critical, unknown status or when the service SLA is not met.
* Agents
 
* Other Services
 
  
The service values are calculated using the Prediction Server which utilizes the default interval of the prediction modules.
+
===Element Configuration===
  
Once you have all the devices monitored. Within each service, you can add all the modules, agents or sub-services you need to monitor the service. For example, if you want to monitor the Online Store service you need a module for content, a service that monitors the status of communications and so on. Through the following steps, you can see how to create a service with Pandora FMS.
+
Once the form has been correctly filled in, it will have an empty service which must be filled in with elements as we will see below. In the service edition form, select the 'Configure elements' tab.
 
 
To create a new service, please click on '''Services''' at the '''Topology Maps''' .
 
  
 
<br>
 
<br>
[[Image:menu_services.png|center]]
+
[[Image:Elementos_servicios.png|center]]
 
<br>
 
<br>
  
A list of all the available services will be shown. The next screenshot shows an empty service list.
+
By clicking on '''Add element''', a pop-up window with a form will appear. The form will be slightly different if the service is in '''smart''' mode or in '''manual''' mode.
  
 
<br>
 
<br>
[[Image:Services empty v5.png|center|800px]]
+
[[Image:Formulario_elementos_servicios.png|center]]
 
<br>
 
<br>
  
To create a new service, just click on the 'Create' button and fill out the form as shown below.
+
;Description: Optional text that will be used to represent the element on the service map. If not indicated, the name of the module, agent or service (depending on the added element) will be used.
 +
;Type: Drop-down list to choose whether the element will be a service, module or agent. In smart mode services you can also choose the '''dynamic''' type.
 +
;Agent: Intelligent agent search engine. Only visible if the element to create or edit is an agent or module type.
 +
;Module: Deployable list with the modules of the agent previously chosen in the intelligent search engine. This control is only visible if an element for the module type service is edited or created.
 +
;Service: Dropdown list of the services to create an element. Only visible if the element to be created or edited is a service element.
 +
 
 +
{{Tip|It should also be noted that the services that will appear in the drop-down list are those that are not the ancestors of the service. This is necessary to show a correct tree structure of dependency between services.}}
 +
 
 +
====Manual mode====
 +
 
 +
The following fields will only available for services in manual mode:
 +
 
 +
* <code>critical</code>: Weight that the element will add to the service when in critical state.
 +
* <code>warning</code>: Weight that the element will add to the service when in warning state.
 +
* <code>unknown</code>: Weight that the element will add to the service when in unknown state.
 +
* <code>normal</code>: Weight that the element will add to the service when in normal state.
 +
 
 +
To calculate the status of a service, the weight of each of its elements will be added based on its status, and if it exceeds the thresholds established in the service for warning or critical, the status of the service will change to warning or critical accordingly.
 +
 
 +
====Smart mode====
 +
 
 +
In smart mode services, since no weights are defined for the elements, the way their status is calculated is as follows:
 +
 
 +
* Critical elements contribute their full percentage to the weight of the service. This means that if, for example, there are 4 elements in the service and only 1 of them is critical, that element will add 25% to the weight of the service. If instead of 4 elements there were 5, the critical element would add 20% to the weight of the service.
 +
* Warning elements contribute half of their percentage to the weight of the service. This means that if for example a service has 4 elements and only 1 of them is in warning status, that element will add 12.5% to the weight of the service. If instead of 4 elements there were 5, the warning element would add 10% to the weight of the service.
 +
 
 +
===== Dynamic mode=====
  
<br><center><br>
+
[[Image:Topology_maps-services-edit_service_elements-add_element-01.png|center|500px]]
[[Image:Services creation v5.png|center|800px]]
+
 
</center><br><br>
+
The following fields will only be available for dynamic elements, in services in smart mode:
 +
 
 +
;Matching object types: Drop-down list to choose whether the elements for which the dynamic rules will be evaluated and that will be part of the service will be agents or modules.
 +
;Filter by group: Rule to indicate the group the element must belong to to be part of the service.
 +
;Having agent name: Rule to indicate the name of the agent that must have the element to be part of the service. A text will be indicated that must be part of the name of the desired agent.
 +
;Having module name: Rule to indicate the module name that must have the element to be part of the service. A text that must be part of the desired module name will be indicated.
 +
[[Image:Topology_maps-services-edit_service_elements-add_element-02.png|center|400px]]
 +
;Use regular expresions selector: If you activate this option, the search mechanism using [https://en.wikipedia.org/wiki/Regular_expression Regular Expressions] ('''regex''' o '''regexp''') will be used, but according to how [https://dev.mysql.com/doc/refman/8.0/en/regexp.html MySQL handles this type of expressions.].
 +
;Having custom field name: Rule to indicate the name of the custom field that must have the element to be part of the service. A text that must be part of the name of the desired custom field will be indicated.
 +
;Having custom field value: Rule to indicate the value of the custom field that the element must have to be part of the service. A text that must be part of the desired custom field value will be indicated.
 +
 
 +
[[Image:Topology_maps-services-edit_service_elements-add_element-03.png|center|400px]]
  
The names of the form fields and their meaning are as follows:
+
{{Tip|You must place text in both fields to be considered when searching in custom fields.}}
  
* '''Name''': The name of the service.
+
[[Image:Topology_maps-services-edit_service_elements-add_element-04.png|center|400px]]
* '''Description''': The description of the service, a long text that can be optional.
 
* '''Group''': The group of the service. It's quite useful for organization purposes and to enforce the SLA ('''S'''ervice '''L'''evel '''A'''greement) restrictions.
 
* '''Mode''': the mode in which the calculation of the weight of the elements will be performed.
 
** '''Manual''': the weights should be entered manually into the service and their elements.
 
** '''Auto''': implying the 'critical' threshold for the service to be '1' and the 'warning' threshold to be '0.5'. It's also assumed that you'll automatically assign weights of '0' for the 'OK' status, '0.5' for 'warning' and '1' for 'critical' each time you're creating an element for this service.
 
** '''Simple''': there is no need to enter weights, only enable or disable a checkbox to indicate if the element is critical.
 
* '''Critical''': The weight threshold to enter the 'critical' status. This field is disabled if the auto-calculate check is enabled. The default value is '1'.
 
* '''Warning''': weight threshold for declaring service in warning status. This field is disabled when the automatic mode is selected and has the default value of '0.5'. Not visible when the simple mode is selected.
 
* '''Agent to store Data''': the service stores the data in special data modules (specifically the prediction modules) and it is necessary to introduce an agent to be the container of these modules, as well as the alarms that you will later have to configure in this form. '''Note''' Please note that the interval at which all service module calculations will be performed depends on the agent interval configured as a container.
 
* '''Quiet''':Activates the silent mode of the service, it will not generate alerts or events
 
* '''Cascade Protection''':Activates cascade protection over the elements of the service. These will not generate alerts or events if they belong to a service (or subservice) that is in critical condition.
 
* '''Calculate continuous SLA for this service''': Activates the creation of SLA and SLA value modules for the current service. If disabled, dynamically calculated SLA information is not available, and SLA compliance alerts for this service do not work. It is used for cases where the number of services needed is so high that it can affect performance.'''If this option is disabled, once the service has been created, the data history of these modules will be deleted and information will be lost.'''
 
* '''SLA Interval''': The time range for performing the SLA constraint's calculation. The default value is '1 month'.
 
* '''SLA Limit''': OK status threshold of the service considered an SLA as positive for the period of time you have set in the previous field.
 
* '''Warning Service Alert''': alert template that the service will use to issue the alert when the service goes into warning status.
 
* '''Critical Service alert''': alert template that the service will use to issue the alert when the service goes into critical status.
 
* '''SLA Critical Service Alert''': alert template that the service will use to issue the alert if the SLA restrictions aren't met.
 
  
Once the form has been filled in correctly, you will have an empty service which must be filled in with elements or service items as we will see below. In the service edit form, the' Config Elements' tab is selected.
+
{{Tip|Since version NG 752, it is possible to add searches in more custom fields, these will be selected if they match any of the keyword pairs set.}}
  
<br>
+
[[Image:Topology_maps-services-edit_service_elements-add_element-07.png|center|400px]]
[[Image:Services tab setup v5.png|center]]
 
<br>
 
  
You'll see a page like the one below where you can manage (modify, add new ones or delete) service elements.
+
;For example: If you choose to filter the Agents in the group '''Servers''' whose Agent's name ''contains'' <code>Firewall</code> and Module name ''contains'' <code>Network</code> you can obtain the following result.
  
<br>
+
[[Image:Topology_maps-services-edit_service_elements-add_element-06.png|center|600px]]
[[Image:Services elements empty v5.png|center|800px]]
 
<br>
 
  
Some important items on the services configuration page are:
+
;For example: if the configuration of a dynamic element was.
  
* '''Type''': a drop-down list that can show service, module or agent.
+
[[Image:Topology_maps-services-edit_service_elements-add_element-05.png|center|500px]]
* '''Agent''': The smart-search input control for the agent. It's only visible if the element type is either the 'agent' or the 'module' type.
 
* '''Module''': The drop-down list along with the modules' agent under previously chosen via smart search. This control is only visible when editing or creating a service element for the 'module' type.
 
* '''Service''': The drop-down list of the services to create an item. It's only visible if the item is of the 'create' or 'edit service' type. It's also important to keep in mind that the services which are going to appear in the drop-down list are '''not''' ancestors of the service. It's also necessary to show an appropriate tree-structure dependency between the services.
 
* '''Critical''': A checkbox to select if the element is critical. Not visible unless the service is in simple mode.
 
* '''Weight on ''Critical''''': The weight of the element if it's in a 'critical' status. The default value is '1'. It's disabled if the service is in 'auto calculate' mode. Not visible if the service is in simple mode.
 
* '''Weight on ''Warning''''': The weight of the 'warning' status. The default value is '0.5'. It's disabled if the service is in 'auto calculate' mode. Not visible if the service is in simple mode.
 
* '''Weight on ''Unknown''''': The weight of the element if it's in ''unknown'' status. The default value is '0'. It's disabled if the service is in 'auto calculate' mode. Not visible if the service is in simple mode.
 
* '''Weight on ''OK''''': The weight of the element if it's in perfect conditions. The default value is '0'. It's disabled if the service is in 'auto calculate' mode. Not visible if the service is in simple mode.
 
  
Once you have created the service items on this page, we're looking at a list management similar to the one shown in the picture below:
+
All the modules that in its name include "Host Alive", in an agent whose name includes "SW", inside the "Servers" group, with a customized field whose name include "Department" with a value including "Systems", would be used as service elements.
  
<center>
 
[[Image:Services list elements admin v5.png|800px|center]]
 
</center>
 
  
In which, in the last column on the right, entitled "Actions", you have icons for:
+
{{warning|Dynamic elements are not affected by service cascade protection.}}
* '''Edit''': which is the icon represented by a wrench with an orange handle. Edit the element of the row corresponding to that icon.
 
* '''Delete''': which is the icon represented by a red cross. When clicking on it, you will be asked in a modal window for confirmation to remove and delete the service element from the database.
 
  
=====Modules created when configuring a service=====
+
===Modules created when configuring a service===
  
* '''SLA Value Service:''' The percentage value of the SLA compliance. (async_data).
+
* '''SLA Value Service:''' The percentage value of SLA compliance. (<code>async_data</code>).
  
* '''Service_SLA_Service:''' This shows if the SLA is being accomplishing or not. (async_proc).
+
* '''Service_SLA_Service:''' This shows whether the SLA is met or not. (<code>async_proc</code>).
  
* '''Service_Service:''' This module shows the sum of the weights of the service. (async_proc).
+
* '''Service_Service:''' This module shows the sum of the service weights. (<code>async_data</code>).
  
 
<br><br>
 
<br><br>
  
=== Service Visualization ===
+
== Service Visualization ==
  
==== Simple list-based View of all the Services ====
+
=== Simple all-service view ===
  
It is the operation list that shows all the services created, of course, it only shows those of the groups that the user that is using the Pandora console has access to.
+
It is the operation list that shows all created services. Of course, it only shows those groups that the user that is using the Pandora FMS console has access to. Click '''Operation''' > '''Monitoring''' and there '''Services'''.
 
 
To get to this view, just go to the Operation menu, open the Monitoring entry and within this is the Services section.
 
  
 
<center>
 
<center>
Line 323: Line 345:
 
</center>
 
</center>
  
Each row represents a service, and the columns represent:
+
Each row represents a service:
  
* '''Name''': The name of the service.
+
;Group: The icon of the group the service belongs to.
* '''Description''': The service description.
+
;Critical: The threshold value for weight sums to get the service into 'critical' status.
* '''Group''': The icon of the group the service belongs to.
+
;Warning: The threshold value for weight sums to get the service into 'warning' status.
* '''Critical''': The threshold value for the sums of weights to put the service into 'critical' status.
+
;Value: The current value for weight sums for the service.
* '''Warning''': The threshold value for the sums of weights to put the service into 'warning' status.
+
;Status: An icon that represents the status of the service. Four possible status are represented:
* '''Value''': The current value for the sum of all weights for the service.
 
* '''Status''': An icon which represents the status of the service.  
 
Four possible status are represented:
 
 
** '''Red''': The service is in 'critical' status because the value exceeded the critical threshold.
 
** '''Red''': The service is in 'critical' status because the value exceeded the critical threshold.
** '''Yellow''': The service is in 'warning' status because the value exceeded the critical threshold.
+
** '''Yellow''': The service is in 'warning' status because the value equaled or exceeded the critical threshold.
** '''Green''': The service is within the 'normal' range.
+
** '''Green''': The service is within the 'normal' range because weight sum does not reach the threshold.
** '''Gray''': The service is in 'unknown' status. This usually means the service has been recently created and doesn't contain any modules or the Prediction Server is down.
+
** '''Gray''': The service is in 'unknown' status. This usually means the service has been recently created and does not contain any modules or the Pandora FMS Prediction server is down.
  
* '''SLA''': The current value of the SLA Service. The values can be:
+
;SLA: The current value of the SLA Service. The values can be:
** '''OK''': the SLA is met for the interval defined in the SLA service.
+
** '''OK''': The SLA is met for the interval defined in the SLA service.
** '''INCORRECT''': The SLA is not meant for the interval currently defined in the SLA Service.
+
** '''INCORRECT''': The SLA is not met for the interval currently defined in the SLA Service.
** '''N/A''': The SLA is in 'unknown' status because there is insufficient data to perform the calculation.
+
** '''N/A''': The SLA is in 'unknown' status because there is not enough data to perform the calculation.
  
===== Table of all services =====
+
===== Table including all services =====
  
A table for quick display of all visible services and their current status.
+
A table for quick display including all visible services and their current status.
 
<br>
 
<br>
 
[[File:Servs.JPG|center|800px]]
 
[[File:Servs.JPG|center|800px]]
 
<br>
 
<br>
  
===== List-based view of a Service and its Elements =====
+
===== Simple list of a service and its elements =====
  
 
This view is accessible by clicking on the name of a service in the list of all services, or through the magnifying glass icon tab in the service title header.
 
This view is accessible by clicking on the name of a service in the list of all services, or through the magnifying glass icon tab in the service title header.
  
Pandora will show a page similar to the one shown in the following screenshot:
+
Pandora FMS will show a page similar to the one shown in the following screenshot:
 
<center>
 
<center>
 
[[Image:Services list elements operation v5.png|center|800px]]
 
[[Image:Services list elements operation v5.png|center|800px]]
 
</center>
 
</center>
  
In the screenshot, we can distinguish two zones, the service with the same columns as in the previous view at the top. And the list of the elements that compose this service at the bottom.
+
The list of the elements that make up this service is at the bottom:
  
The list of elements appears in table format, where the rows correspond to each element and the columns represent:
+
* '''Type''': The icon which represents the element type. It is a building block for modules or some stacked blocks for an agent and a Network Diagram Icon for the services.
 
 
* '''Type''': The icon which represents the type of an element. It's a Lego block for modules, some stacked Lego blocks for an agent and a Network Diagram Icon for the services.
 
 
 
* '''Name''': The text which contains the name of the module, agent or service. They're also linked to the proper section.
 
* '''Description''': A small free-text field intended for a short description.
 
* '''Weight critical''': The value if the element is in 'critical' status.
 
* '''Weight warning''': The value if the element is in 'warning' status.
 
* '''Weight normal''': The value if the element is in 'normal' status.
 
* '''Data''': The value of the element. It's able to adopt the following modes:
 
  
 +
;Type: The text which contains the name of the module, agent or service. They are also linked to the corresponding section.
 +
;Name: Text that contains the name of the agent, the name of the agent and module or the name of the service. All of them contain a link to the corresponding operation view.
 +
;Weight critical: The value if the element when in 'critical' status. The following three columns ('''Warning weight''', '''Weight Unknown''' and '''Weight OK''') correspond to ''warning'', ''unknown'' and ''normal''.
 +
;Data: The value of the element. It can adopt the following modes:
 
** '''Module:''' The value of the module.
 
** '''Module:''' The value of the module.
** '''Agents:''' The text which displays the agent's status.
+
** '''Agents:''' The text that displays the agent's status.
** '''Service:''' The sum of all elements' weights from the selected service.
+
** '''Services:''' The weight sum of the elements of the service that has been chosen as the element for the parent service.
 +
;Status: The icon which represents the element's status by color.
  
* '''Status:''' The icon which represents the element's status by color.
+
{{Warning|Keep in mind that service-element calculation is performed by Prediction Server. What you look at is '''not''' real-time data. There are some situations in which a module's agent is added to the service where its weight will '''not''' be updated until calculation is performed by the Prediction Server again.}}
  
{{warning|Keep in mind that the service-elements calculation is performed by the Prediction Server. It's '''not''' real-time data you're looking at. There are some situations in which a module's agent is added to the service where its weight is '''not''' going to be updated until the calculation is performed by the Prediction Server again.}}
+
==== Service map view ====
 
 
===== Service Map View =====
 
 
 
To access this view, you're required to click on the flap above the header in the service operation view, as you can see in the picture below.
 
 
 
<center>
 
[[Image:Services tab servicemap v5.png|center]]
 
</center>
 
  
This view will display the service in arborescent form as you can see in the following screenshot. In this way, it is possible to quickly see how modules, agents or sub-services influence the monitoring of the service. Even in the subservices you can see what influences them when calculating the status by the sum of the weights.
+
This view will display the service in arborescent form as you can see in the following screenshot. That way, it is possible to quickly see how modules, agents or sub-services influence service monitoring. Even in sub-services you can see what influences them when calculating the status by summing weights.
  
 
<center>
 
<center>
Line 395: Line 402:
  
 
The possible nodes can be:
 
The possible nodes can be:
* '''Module Node:''' It's represented by the 'heartbeat' icon. This node is always final (leaf).
+
;Module Node: It is represented by the 'heartbeat' icon. This node is always final (leaf).
* '''Agent Node:''' It's represented by the 'CPU box' icon. This module is also always final (leaf).
+
;Agent Node: It is represented by the 'CPU box' icon. This module is always final too (leaf).
* '''Service Node:''' It's represented by the 'crossed hammer and wrench' icon. This module is not a final node. It's required to contain additional nodes.
+
;Service Node: It is represented by the 'crossed hammer and wrench' icon. This module is not a final node. It is required to contain additional nodes.
  
The node's colors and the arrow which connects them to the service depend on the node's status.
+
The node's colors and the arrow which connects them to the service depend on the node's status, as always green OK, red critical, yellow warning or grey in unknown state.
  
 
There are the following attributes within the node:
 
There are the following attributes within the node:
  
* '''Title:''' The name of the service's / agent's or module's node.
+
* '''Title:''' The name of the service's / agent's or module's node, accompanied by the agent.
* '''Value list:''' This list refers to the possible numeric value calculated for that instance. It accepts any assigned integer value.
+
* '''Value list:'''  
** '''Critical:''': The weight if it reaches 'critical' status, except if it's the root-service node, which represents a threshold to reach the 'critical' status.
+
** '''Critical:''': The total weight it reaches in 'critical' status, except if it is the root-service node, which represents a threshold to reach the 'critical' status.
** '''Warning:''' The weight if it reaches 'warning' status, except if it's the root-service node, which represents the threshold to reach the 'warning' status.
+
** '''Warning:''' The weight if it reaches 'warning' status, except if it is the root-service node, which represents the threshold to reach the 'warning' status.
** '''Normal:''' The weight if it reaches 'normal' status, except if it's the root-service node, in which case nothing is going to be displayed here.
+
** '''Normal:''' The weight if it reaches 'normal' status, except if it is the root-service node, in which case nothing will be displayed here.
** '''Unknown:''' The 'unknown' status, except if it's the root-service node, which represents a threshold to reach the 'unknown' status.
+
** '''Unknown:''' The 'unknown' status, except if it is the root-service node, which represents a threshold to reach the 'unknown' status.
  
 
You may click on each node in the tree. The target link represents the operational view of the node itself.
 
You may click on each node in the tree. The target link represents the operational view of the node itself.
Line 415: Line 422:
 
{{tip|When the service mode is ''simple'', a red exclamation mark appears on the right side of the critical elements.}}
 
{{tip|When the service mode is ''simple'', a red exclamation mark appears on the right side of the critical elements.}}
  
===== Services within the Visual Console =====
+
==== Services within the Visual Console ====
  
From Pandora FMS versions 5 and above, you may add services in the Visual Console like any other item on the map.
+
From Pandora FMS versions 5 onwards, you may add services in the Visual Console like any other item on the map.
  
<center>
+
<br><center><br>
[[Image:Services visualmap v5.png|center|800px]]
+
[[Image:Servicios1.JPG|center|800px]]
</center>
+
</center><br><br>
  
To create a service item on a map, the process is the same as for all other visual map items, but the options palette will be the same as in the screenshot.
+
To create a service item on a map, the process is the same as for all other visual map items, but the option range will be:
  
<center>
+
<br><center><br>
[[Image:Services visualmap add item v5.png|center|800px]]
+
[[Image:Servicios2.JPG|center|800px]]
</center>
+
</center><br><br>
  
It contains the following attributes:
+
Controls:
  
* '''Label''': The title which is going to be shown within the visual console's node.
+
* '''Label''': The title shown within the visual console's node.
* '''Service''': The service that's going to be represented.
+
* '''Service''': Drop-down list that shows the services it has access to, to add to the map.
  
Note that a service item, unlike other items on the visual map, cannot be linked to other visual maps, and always the clickable link in the visual console is intended for the tree service map view described above.
+
Note that a service item, unlike other items in the visual map, cannot be linked to other visual maps, and always the clickable link in the visual console is intended for the tree service map view described above.
  
==== Services Tree View ====
+
=== Service tree view ===
  
This view allows you to view the services in the form of a tree.
+
This view allows you to view services in the form of a tree.
  
At each level, a count of the number of elements included in each service or agent is shown.  
+
Each level shows the total number of elements included in each service or agent.  
* Services: reports the total number of services, agents and modules that belong to that service.  
+
* Services: It reports the total number of services, agents and modules that belong to that service.  
* Agents: reports the number of modules in critical state (red color), warning (yellow color), unknown (gray color), uninitiated (blue color) and normal state (green color).
+
* Agents: It reports the number of modules in critical state (red color), warning (yellow color), unknown (gray color), uninitiated (blue color) and normal state (green color).
  
Services that do not belong to another one will always be shown in the first level. In the case of a child service, it will be shown nested inside its parent.
+
Services that do not belong to another one will always be shown on the first level. In the case of a child service, it will be shown nested inside its parent.
  
 
<center>
 
<center>
Line 450: Line 457:
 
</center>
 
</center>
  
{{Warning|ACLs permission restriction is only applied to the first level.}}
+
{{Warning|ACL permission restriction is only applied to the first level.}}
  
 
<br><br>
 
<br><br>
  
=== How to read the service values  ===
+
== How to read service values  ==
  
Planned shutdowns added before the stop date allow us to recalculate the value of the SLA reports. First, we need to activate it in the general setup. When it comes to an SLA service report, if there is a scheduled outage affecting one or more elements of the service, it is considered that the planned shutdown affects the entire service, because the system cannot evaluate the impact of a service component "inactive" in the whole service.
+
Planned shutdowns added before the stop date allow recalculating the value of SLA reports, given that it allows "backwards" recalculating with scheduled shutdowns added afterwards (that option is globally activated in the general setup). When it is an SLA service report, if there is a scheduled shutdown that affects one or several service elements, the scheduled shutdown is considered to affect the service as a whole, since the shutdown impact on the whole service cannot be measured.
 +
 
 +
It is worth highlighting that this is at a report level. Therefore, service trees, and the information presented in the visual console are not altered based on planned shutdowns added after the intended execution date. These service compliance percentages are calculated in real time, based on the history data of the same service, they do not have anything to do with the actual report.
  
It is important to remember that this is at report level; service map, and the information presented in the visual console are not altered based on planned shutdowns added after the effective execution date. These service compliance percentages are calculated in real-time, based on the history data of the same service, it is very different than a report which can be "cooked" adding a "fake" planned downtime.
+
On the other hand, it is important to know how the compliance percentage of a service is calculated:
  
On the other hand, it is important to know how the compliance of a service is calculated:
+
;Weight calculation in simple mode
 +
Weights are dealt with slightly differently on simple mode, since there is only the critical weight and the possibility of going into two more status apart from the normal one. Each element receives weight 1 on critical and 0 on other status, and each time there is a change in service elements, service weights are calculated again. The warning weight can be overlooked. It always has value 0.5 because if it is 0, the service will always be on warning at least, but warning weight is not used in simple mode. The critical weight is calculated so that it is half of the element critical weights summed, which is 1. If there are 3 elements, the service critical weight is 1.5 and then, it is the server the one in charge or checking whether the critical weight has been exceeded or matched to render the service into critical or warning status.
  
Let's suppose we have a service defined by a 95% compliance in an interval of 1 hour (this is very short for the real world, but good for understanding the internal algorithm). We will use a table of values, where t is time, x is the % compliance (SLAs), and s is whether or not the service complies (1 complies, 0 fails). In 1 hour we should have exactly 12 values, assuming an interval of 5 minutes.
+
;Weight calculation according to their importance
 +
Suppose there is a service defined by a 95% compliance in a 1-hour interval. A table of values, where t is time, x is the compliance % (SLAs), and s is whether complies or not (1 it complies, 0 it fails), will be used. In 1 hour there should be exactly 12 samples (assuming the interval is 5 minutes long).
  
A similar case, where the service complies for the first 11 samples (first 55 minutes) and in the 60th minute, it fails, we would have these values:
+
Picture a similar case, where the service complies for the first 11 samples (first 55 minutes) and it fails in the 60th minute these would be the values:
  
 
<pre>
 
<pre>
Line 483: Line 494:
 
</pre>
 
</pre>
  
This case is easier to calculate. The % is calculated depending on the number of samples, for example in t3 there are a total of three samples that meet service, 100%, whereas t12, we have 12 and 11 valid samples: 11 / 12.
+
This case is easier to calculate. The % is calculated depending on the number of samples, for example in t3, there are a total of three samples that meet service, a 100%, whereas in t12, there are 12 samples and 11 are valid samples: 11 / 12.
  
Assume you are in the middle of the series, and it is recovering slowly:
+
Suppose you are in the middle of the series, and it is recovering slowly:
  
 
<pre>
 
<pre>
Line 504: Line 515:
 
</pre>
 
</pre>
  
So far all seems similar to the previous scenario, but let's see what happens if we go over time:
+
So far all seems similar to the previous scenario, but see what happens if you go over time:
  
 
<pre>
 
<pre>
Line 519: Line 530:
 
</pre>
 
</pre>
  
Now we see unintuitive behavior, because the volume of valid samples remains 11 for a window of time to get to t18, where the only invalid value is out of the window, so in t18 compliance becomes 100%. This step between 91.6 and 100 is explained by the size of the window. The larger the window is (usually SLA calculation interval is daily, weekly or monthly), the less abrupt will be the step.
+
Now there is unintuitive behavior, because the volume of valid samples remains 11 for a time window that goes up to t18, where the only invalid value is out of the window, so in t18 compliance becomes 100%. This step between 91.6 and 100 is explained by the size of the window. The larger the window is (usually SLA calculation interval is daily, weekly or monthly), the less abrupt the step will be.
 +
 
 +
== Service cascade protection ==
 +
 
 +
 
 +
{{Tip|[[Image:icono-modulo-enterprise.png|Enterprise version.]]<br>Version NG 725 or higher.}}
 +
 
 +
 
 +
It is possible to mute service elements dynamically. This allows to avoid an alert overload for each element that belongs to a certain service or sub-services.
 +
 
 +
When the 'service cascade protection' feature is enabled, the action linked to the template configured for the root service will be executed. It will report which the elements have an incorrect status within the service.
 +
 
 +
It is important to take into account that this system allows the alerts of the elements within the service to be triggered when they go to critical status, even if the general service status is correct.
 +
 
 +
Service cascade protection will indicate us which elements have failed regardless of the depth of the defined service.
 +
 
 +
<center>
 +
[[File:service2test.png]]
 +
</center>
 +
 
 +
In the example above we see that we have one of the elements of our service in critical status. Even if the main service is correct, it will warn us of the critical state of the elements within, triggering the alert related with the element in critical status.
 +
 
 +
== Root cause analysis ==
 +
 
 +
 
 +
You may have an endless number of sub-services (paths) within a service. In previous versions, Pandora FMS alerted indicating the service status (normal, critical, warning, etc.). From OUM725 on, there is a new macro available that will show the service status root cause.
 +
 
 +
To use it, add the following text to the template linked to the service:
 +
 
  
=== Service grouping ===
+
Alert body: Example message
 +
The series of events that have caused the service status is the following one:
 +
_rca_
  
Services are logical groups that conform part of a business structure. Due to that, it makes sense to group services, because in a lot of cases there can be dependences between them, conforming, for example, a global company service composed by some other particular services (webpage, communications, etc). To group the services, it's necessary to create the big general service and the smaller ones that will be aggregated to the global service, creating a logical tree structure.
 
  
The service groups can help us to: create visual maps, configure alerts, apply monitoring policies, etc. So we can create specific alerts when the company service is down due to the commercial department not being able to work, or the webpage being offline.
+
This will return an output similar to this one:
  
Next we have two examples to understand service grouping.
+
Alert body: Example message
 +
The series of events that have caused the service status is the following one:
 +
[Web Application -> HW -> Apache server 3]
 +
[Web Application -> HW -> Apache server 4]
 +
[Web Application -> HW -> Apache server 10]
 +
[Web Application -> DB Instances -> MySQL_base_1]
 +
[Web Application -> DB Instances -> MySQL_base_5]
 +
[Web Application -> Balanceadores -> 192.168.10.139]
  
=== Examples of services monitoring ===
 
  
==== PandoraFMS service ====
+
By seeing this output, it is supposed that:
  
In this case the service of PandoraFMS is being monitored. It is composed of the Apache service, MySQL, Pandora server and Tentacle server. Every one of these elements also constitutes a service with different components, creating a tree-type structure.
+
* Apache servers 3,4 and 10 are in critical status
 +
* MySQL_base databases 1 and 5 are down
 +
* The 192.168.10.139 balancer does not respond
  
  
[[File:Arbol.JPG|800px|center]]
+
This added information allows to find out the reason behind the service status, reducing failure cause research tasks.
  
 +
== Service grouping ==
  
The general Pandora service will turn into critical status if it reaches the weight of 2, and warning status with 1.
+
Services are logical groupings that make up an organization's business structure. That is why service grouping may make sense, since they depend on each other in many cases, creating for example a whole service (the business company) or more specific services (corporate web, communications, etc.). To group services, both the general and more particular services must be created, and the last ones must be added to the first one to create the logical tree-shaped structure.
As you can see, the four components have different weights over the Pandora service:
 
* '''MySQL:''' critical for the Pandora service, individual weight of 2 if MySSQL is down. It will have weight 1 if it turns into warning status, already displaying yellow status on the Pandora service.
 
* '''Pandora Server:''' critical for the Pandora service, individual weight of 2 if Pandora Server is down. Individual weight of 1 if it is on warning status, displaying the warning status on the Pandora service for example if it reaches a heavy CPU load.
 
* '''Apache:''' it means a degradation of the global Pandora service, but not a complete interruption, so it will have an individual weight of 1 if it is down, showing the warning status on the Pandora service.
 
* '''Tentacle:''' same as the Apache, it means a degradation of the service, but not a total interruption, so it gets 1 of weight if down, and will display warning status.
 
  
In the following picture we can see the setup of the different weights for the elements over the Pandora general service:
+
This groups may help you to: create visual maps, configure alerts, apply monitoring policies, etc. Therefore, it is possible to create alerts that warn you when the business goes into critical status because sales representatives cannot do their job, or any branch is not working full capacity due to technichal problems with the ERP service.
  
 +
To understand more clearly what service grouping is, take a look at these examples.
  
[[File:Pesos.JPG|800px|center]]
+
== Service monitoring examples ==
  
 +
=== Pandora FMS service ===
  
 +
Use case where the status of Pandora FMS monitoring service made by Apache and MysSQL services and Pandora FMS server and Tentacle, with their respective weights, is monitored.
  
==== Storage cluster, grouping of services ====
 
  
Services are logically arranged groups which are part of a company's business structure. Therefore, it may be necessary to create groups of services, because services alone sometimes don't have an appropriate context. To create service groups, you're required to add each service to an existing agent. In this case, a service is going to be a module of an agent. You're able to create a new logical structure (a group of services) by these groups.
+
[[Image:Pesos.JPG|center|800px|Click to zoom in]]
  
On the following example we have an HA storage cluster. For this case there are two fileserver systems working in parallel, each one controlling the percentage and status of some different disks that provide service to specific departments, creating a tree-type structure with grouped services.
+
Each of these elements is at the same time a service with different components, creating through service grouping a tree-shaped structure.
 +
 
 +
[[Image:Arbol.JPG|center|800px|Click to zoom in]]
 +
 
 +
In this case, the general Pandora FMS service will go into <code>critical</code> status when reaching weight 2 and <code>warning</code> when it reaches weight 1.
 +
As seen, the four components have different weights on Pandora FMS service:
 +
* '''MySQL:''' It is essential for Pandora FMS service. Individual weight of 2 if MySQL is down. It will get a weight of 1 if it is in warning status, showing a warning in Pandora FMS service.
 +
* '''Pandora Server:''' It is essential for Pandora FMS service. Individual weight of 2 if the Pandora FMS Server is down. Individual weight of 1 if it is in warning status, for example, due to CPU overload, scaling the warning until reaching Pandora FMS general service.
 +
* '''Apache:''' It implies a degrading of Pandora FMS service, but not a total interruption, so it gets an individual weight of 1 if it is down, showing the warning status in Pandora FMS service.
 +
* '''Tentacle:''' It entails a degrading, and certain components may fail, but it does not Mean Pandora FMS stops working completely, so its individual weight in case of failure is 1, showing a warning in the general service.
 +
 
 +
=== Cluster storing service, service grouping ===
 +
 
 +
Services are logical groups that make up part of the business structure of an organization. Therefore, service grouping is reasonable since sometimes some services on their own do not have a complete meaning. To group services, they just need to be added to a greater service as elements, creating a new logical group.
 +
 
 +
In the following example, there is an HA storing cluster. This time, a system of two fileservers working at the same time has been chosen, each one controlling the percentage and the status of a series of hard drives that provide service to particular departments, creating a group service tree-shaped structure.
  
  
Line 563: Line 624:
  
  
According to this structure, the critical threshold of the storage service will be reached only if both of the fileservers fail, this would totally deny the service, and if only one of the fileservers fail it would only suppose a degraded service.
+
According to this structure, the critical threshold of the company's storing service is reached when both fileservers fail, since that would turn down the service, while just one of them failing would entail a service downgrading.
In the screenshot below we can appreciate the weight configuration of the two main elements of the storage service:
+
The following image contains weight configuration granted to two storing service main elements:
  
  
Line 570: Line 631:
  
  
In the following image, we can see the content and weight configuration of the grouped service FS01. Here, the elements will have a specific weigh according to its criticalness, being:
+
This image shows the content and weight configuration of the FS01 grouped service. Here the elements have a specific weight according to their severity:
* '''FS01 ALIVE:''' critical to the FS01 service, since it is the virtual IP assigned to the first disk cluster. Individual weigh 2, if it's down, the other elements would automatically be inoperative. In this case there is no warning threshold, since it is a yes/no based type of information.
+
* '''FS01 ALIVE:''' Critical for the FS01 service, since it is the virtual IP allocated to the first hard drive cluster. Individual weight of 2, since if it is down, the rest of the service elements will not work. There is no <code>warning</code> threshold, since it is data that depends on the status Yes/No.
* '''DHCPserver ping:''' critic to the FS01 service, we give it an individual weight of 2. In this case there is no warning threshold either.
+
* '''DHCPserver ping:''' critical for the FS01 service. It has an individual weight of 2. In this case, there is no <code>warning</code> threshold either.
* '''Disks''' we give them an individual weight of 1 in case they reach its own critical status, and 0.5 for their warning status. According to this, the FS01 service will only reach critical status if there are two disks on critical status o four in warning status.
+
* '''Hard drives''': They have an individual weight of 1 in case they reach their critical threshold, and 0.5 for their <code>warning</code> threshold, so this will only affect critically the FS01 service if there are at least two in critical status or the four hard drives in warning status.
  
  
 
[[File:Pesosfs01.JPG|center|800px]]
 
[[File:Pesosfs01.JPG|center|800px]]
 
== Pandora Server ==
 
 
It's mandatory the Prediction Service is running appropriately and also to have the Enterprise Version of Pandora FMS installed.
 
  
 
[[Pandora:Documentation_en|Go back to Pandora FMS documentation index]]
 
[[Pandora:Documentation_en|Go back to Pandora FMS documentation index]]
  
 
[[Category:Pandora 3.0]]
 
[[Category:Pandora 3.0]]

Latest revision as of 08:23, 22 February 2021

Go back to Pandora FMS documentation index


1 Service Monitoring

Info.png

Enterprise version.
Service monitoring is a Pandora FMS Enterprise-exclusive feature.

 


1.1 Introduction

A service in Pandora FMS is a way to group IT resources based on their features.

A service could be an official website, a CRM system, a support application, or even printers. Services are logical groups which can include hosts, routers, switches, firewalls, CRMs, ERPs, websites and of course, different other services.

In Pandora FMS, services are represented as a group of monitored elements (Modules, Agents or other Services) whose individual status affects in a certain way the global performance of the service provided. To learn more, watch our video tutorial "Service monitoring in Pandora FMS"

1.2 Services under Pandora FMS

Basic monitoring in Pandora FMS consists of collecting metrics from different sources, representing them as monitors (modules). Service-based monitoring allows to group these modules, so that, by playing with certain ranges based on failure build-up, groups of different types of elements and their relationship in a larger and general service can be monitored.

In short, service monitoring allows to check the status of a global service. You will be able to know if our service is being provided normally (green), degraded (yellow) or if it is not being provided altogether (red).

Leyenda de colores y su significado.

Service monitoring is represented under three concepts: simple, by weight importance and chained by cascade events.

1.2.1 How simple mode works

In this mode it is only necessary to point out which elements are critical and which ones are not.

When creating a new service you may select simple mode.

Only elements checked as critical will be taken into account to make calculations and only the critical status od said elements will have value.

  • When between 0 and 50% of the elements are in critical status, the service will go into warning status.
  • When more than 50% of the elements go into critical status, the service will go into critical status.

Example:

  • Router is a critical element.
  • Printer is a non critical element.
  • Apache Web Server is a critical element.

Situation 1:

  • Router, critical status.
  • Printer, critical status.
  • Apache Server, warning status.

Result: The service is in warning status since the printer is not critical, the router is in critical mode and only represents 50% of the critical elements, Apache server is not in crtitical status and does not add value to the evaluation.

Situation 2:

  • Router, critical status.
  • Printer, critical status.
  • Apache Server, critical status.

Result: Service in critical status (the printer still adds no value).

Situation 3:

  • Router, normal status.
  • Printer, critical status.
  • Apache Server, normal status.

Result: The status of the service would be normal, since no key element is in critical status (again the printer does not add any value).

1.2.2 How services work according to their weight

The need to monitor services as something "abstract" arises when faced with the following question:

What happens to an application if a non-critical element fails?

To solve all these doubts, in Pandora FMS there is the service monitoring feature that helps:

  • Limit the number of received alerts. You will receive alerts about situations that compromise the reliability of the services you provide.
  • Track the SLA compliance level.
  • Simplify the monitoring display of your infrastructure.


To achieve this, monitor every element that could negatively affect your application.

Through Pandora FMS console, define a service tree in which to indicate both the elements that affect your application, as well as their impact degree.

All elements added to the service trees will correspond to information that is already being monitored, either in the form of modules, specific agents or other services.


To indicate the degree to which the status of each element affects the overall status, a weight sum system will be used, so that the most important ones (with more weight) will be more relevant to adjust the overall status of the whole service to an incorrect status before less important elements (with less weight).

1.2.2.1 Example

You may monitor a web application balanced through a series of redundant elements. The infrastructure the application is based on is made in this example by the following elements:

  • Two HA routers.
  • Two HA switches.
  • Twenty Web Apache® servers.
  • Four WebLogic® application servers.
  • One MySQL® cluster made by two storing nodes and two SQL processing nodes.

The goal is to find out whether the web application is working properly, that means the final appreciations by our clients is that the application receives, processes and returns en a peremptory time period the requests.

If one of the twenty Apache servers were offline, due to so much redundancy, would it be wise to warn or alert all the employees? What is the rule for alerting?

You may conclude Pandora FMS should only warn if a highly critical element fails (for example a router) or if serveral Apache servers are offline at the same time... but, how many of them? To solve this, weight values must be assigned to the list of previously described components:

Switches and routers
5 points to each one when they are in critical and 3 points if they are in warning.
Web servers
1.2 points to each one in critical, warning status is not contemplated.
WebLogic servers
2 points to each one in critical.
MySQL cluster
5 points to each one in critical and 3 points in warning.
Tipo de elemento Asignación de pesos
Normal Warning Critical Unknown
Router0355
Switch0355
Apache server001,21,2
WebLogic server0022
MySQL server0355

When in a normal situation, the sum of those weights is zero, that is why in this example warning status threshold must be higher than 4 and for critical status higher than 6:

Configuración del servicio
Normal Warning Critical
0 >=4 >=6

Failure scenarios:

  • An Apache web server is offline (critical status): since everything else is in normal and adds 0 value, the total would be 1.2 since 1.2 < 4 (warning threshold), the service is still in OK status(normal status).
  • A WEB server and a WebLogic one, both in critical status: the first one adds 1.2 points and the second 2.0 for a total of 3.2; however it is still lower than 4 so the service is still in OK status, no alerts or actions needed.
  • Now two WEB servers and a WebLogic one are offline: 2 x 1,2 + 1 x 2 = 4,4; in this case it exceeded the warning threshold so it goes into warning status; it is still working and it may not require any immediate technical action, but it is obvious there is a problem in the infrastructure.
  • To the previous situation we add a router in critical status and it triggers a new situation: it adds 5 points to the weight sum and exceeds the criticity threshold set at 6; the service is in critical status, the service is not working and immediate technical action is required.

In this last situation, Pandora FMS will alert the corresponding working team (operators, technicians, etc.).

Info.png

You may get more interesting information about service monitoring in Pandora FMS blog

 

.

1.2.3 Root services

Info.png

Enterprise version.
Version NG 726 or higher.

 


A root service is that one that is not part of another service. This logic concept allows making monitoring smoother, reducing work queues.

In addition and based on that, when a service defined in a Pandora FMS node appears as a Metaconsole root service element, the Metaconsole server will be the one to evaluate it, updating the values stored in the node.

This provides a more efficient distributed logic, and allows to apply a Cascade protection system based on services.

Metaconsole service possibilities have also been extended, allowing to add other services, modules or agents as service elements. In previous versions, only node services could be added.

1.3 Creating a new Service

1.3.1 Pandora FMS server

Template warning.png

The PredictionServer component must be enabled to be able to use these services.

 


It is necessary for the PredictionServer component to be working and for Pandora FMS Enterprise server version to be installed.

1.3.2 Introduction

The services may represent:

  • Modules.
  • Full agents.
  • Other Services.

Service values are calculated using the Prediction Server.

Once you have all the devices monitored. Add within each service all the modules, agents or sub-services that you need to monitor the service. For example, if you want to monitor the Online Store service, you need a module for content, a service that monitors the state of communications and so on.

To create a new service, click on Services at the Topology Maps menu.


Menu services.png


A tree view containing all the available services will be shown.


Arbol servicios.png


1.3.3 Initial Configuration

To create a new service, click on the Create service and fill out the form.



Formulario servicios.png


Name
Unique name to identify the service.
Description
Service description, a long mandatory text. Said description will appear in the service map, the service table view and the service widget (instead of the name).
Group
Group to which the service belongs, useful to organize it and to apply ACL restrictions.
Agent to store data
The service saves the data in some special data modules (in particular the prediction modules) and it is necessary to add an agent to be the container of said modules and the alarms (see the following steps).

Info.png

Nota: Please bear in mind that the interval in which all the calculations of the service modules will be done will depend on the agent interval configured as container.

 


Modo para cálculos de peso.
Mode
Mode in which the element weights will be calculated. It may have 2 values:
    • Smart: The service's weights and elements will be calculated automatically based on established rules.
    • Manual: The service's weights and elements will be indicated manually with fixed values.
  • Critical: Weight threshold to declare the service as critical. In smart mode this value will be a percentage. We will explain later how the elements contribute to this value.
  • Warning: Weight threshold to declare the service as in warning status. In smart mode, this value will be a percentage. We will explain later how the elements contribute to this value.
Unknown elements as critical
It allows you to indicate that elements in an unknown state contribute their weight as if they were a critical element.

Template warning.png

The smart mode is only available from Pandora FMS version 7.0NG 748.

The automatic and simple modes of previous versions will become manual by applying the MR 40 in the version update.

 


Favorite
It creates a direct link in the side menu and services will be able to be filters in the views based on this criteria.
Servicios favoritos.png
Quiet
It activates the silence mode of the service, so it will not generate alerts or events.
Cascade protection enabled
It activates cascade protection over the service elements. These will not generate alerts or events if they belong to a service (or sub-service) that is in a critical state.
Calculate continuous SLA
It activates the creation of SLA and SLA value modules for the current service. If disabled, the dynamically calculated SLA information will not be available, nor will the alerts on SLA compliance for this service. It is used for cases where the number of services required is so high that it can affect performance.

Template warning.png

If this option is disabled, once the service has been created, the data history of these modules will be deleted, so information will be lost.

 


SLA Interval
Time period to calculate the effective SLA of the service.
SLA limit
Service status threshold in OK to be considered a positive SLA during the period of time you have configured in the previous field.
Alerts
In this section select templates that the service will have to launch the alert when the service goes into warning, critical, unknown status or when the service SLA is not met.

1.3.4 Element Configuration

Once the form has been correctly filled in, it will have an empty service which must be filled in with elements as we will see below. In the service edition form, select the 'Configure elements' tab.


Elementos servicios.png


By clicking on Add element, a pop-up window with a form will appear. The form will be slightly different if the service is in smart mode or in manual mode.


Formulario elementos servicios.png


Description
Optional text that will be used to represent the element on the service map. If not indicated, the name of the module, agent or service (depending on the added element) will be used.
Type
Drop-down list to choose whether the element will be a service, module or agent. In smart mode services you can also choose the dynamic type.
Agent
Intelligent agent search engine. Only visible if the element to create or edit is an agent or module type.
Module
Deployable list with the modules of the agent previously chosen in the intelligent search engine. This control is only visible if an element for the module type service is edited or created.
Service
Dropdown list of the services to create an element. Only visible if the element to be created or edited is a service element.

Info.png

It should also be noted that the services that will appear in the drop-down list are those that are not the ancestors of the service. This is necessary to show a correct tree structure of dependency between services.

 


1.3.4.1 Manual mode

The following fields will only available for services in manual mode:

  • critical: Weight that the element will add to the service when in critical state.
  • warning: Weight that the element will add to the service when in warning state.
  • unknown: Weight that the element will add to the service when in unknown state.
  • normal: Weight that the element will add to the service when in normal state.

To calculate the status of a service, the weight of each of its elements will be added based on its status, and if it exceeds the thresholds established in the service for warning or critical, the status of the service will change to warning or critical accordingly.

1.3.4.2 Smart mode

In smart mode services, since no weights are defined for the elements, the way their status is calculated is as follows:

  • Critical elements contribute their full percentage to the weight of the service. This means that if, for example, there are 4 elements in the service and only 1 of them is critical, that element will add 25% to the weight of the service. If instead of 4 elements there were 5, the critical element would add 20% to the weight of the service.
  • Warning elements contribute half of their percentage to the weight of the service. This means that if for example a service has 4 elements and only 1 of them is in warning status, that element will add 12.5% to the weight of the service. If instead of 4 elements there were 5, the warning element would add 10% to the weight of the service.
1.3.4.2.1 Dynamic mode
Topology maps-services-edit service elements-add element-01.png

The following fields will only be available for dynamic elements, in services in smart mode:

Matching object types
Drop-down list to choose whether the elements for which the dynamic rules will be evaluated and that will be part of the service will be agents or modules.
Filter by group
Rule to indicate the group the element must belong to to be part of the service.
Having agent name
Rule to indicate the name of the agent that must have the element to be part of the service. A text will be indicated that must be part of the name of the desired agent.
Having module name
Rule to indicate the module name that must have the element to be part of the service. A text that must be part of the desired module name will be indicated.
Topology maps-services-edit service elements-add element-02.png
Use regular expresions selector
If you activate this option, the search mechanism using Regular Expressions (regex o regexp) will be used, but according to how MySQL handles this type of expressions..
Having custom field name
Rule to indicate the name of the custom field that must have the element to be part of the service. A text that must be part of the name of the desired custom field will be indicated.
Having custom field value
Rule to indicate the value of the custom field that the element must have to be part of the service. A text that must be part of the desired custom field value will be indicated.
Topology maps-services-edit service elements-add element-03.png

Info.png

You must place text in both fields to be considered when searching in custom fields.

 


Topology maps-services-edit service elements-add element-04.png

Info.png

Since version NG 752, it is possible to add searches in more custom fields, these will be selected if they match any of the keyword pairs set.

 


Topology maps-services-edit service elements-add element-07.png
For example
If you choose to filter the Agents in the group Servers whose Agent's name contains Firewall and Module name contains Network you can obtain the following result.
Topology maps-services-edit service elements-add element-06.png
For example
if the configuration of a dynamic element was.
Topology maps-services-edit service elements-add element-05.png

All the modules that in its name include "Host Alive", in an agent whose name includes "SW", inside the "Servers" group, with a customized field whose name include "Department" with a value including "Systems", would be used as service elements.


Template warning.png

Dynamic elements are not affected by service cascade protection.

 


1.3.5 Modules created when configuring a service

  • SLA Value Service: The percentage value of SLA compliance. (async_data).
  • Service_SLA_Service: This shows whether the SLA is met or not. (async_proc).
  • Service_Service: This module shows the sum of the service weights. (async_data).



1.4 Service Visualization

1.4.1 Simple all-service view

It is the operation list that shows all created services. Of course, it only shows those groups that the user that is using the Pandora FMS console has access to. Click Operation > Monitoring and there Services.

Services list services admin v5.png

Each row represents a service:

Group
The icon of the group the service belongs to.
Critical
The threshold value for weight sums to get the service into 'critical' status.
Warning
The threshold value for weight sums to get the service into 'warning' status.
Value
The current value for weight sums for the service.
Status
An icon that represents the status of the service. Four possible status are represented:
    • Red: The service is in 'critical' status because the value exceeded the critical threshold.
    • Yellow: The service is in 'warning' status because the value equaled or exceeded the critical threshold.
    • Green: The service is within the 'normal' range because weight sum does not reach the threshold.
    • Gray: The service is in 'unknown' status. This usually means the service has been recently created and does not contain any modules or the Pandora FMS Prediction server is down.
SLA
The current value of the SLA Service. The values can be:
    • OK: The SLA is met for the interval defined in the SLA service.
    • INCORRECT: The SLA is not met for the interval currently defined in the SLA Service.
    • N/A: The SLA is in 'unknown' status because there is not enough data to perform the calculation.
1.4.1.1 Table including all services

A table for quick display including all visible services and their current status.

Servs.JPG


1.4.1.2 Simple list of a service and its elements

This view is accessible by clicking on the name of a service in the list of all services, or through the magnifying glass icon tab in the service title header.

Pandora FMS will show a page similar to the one shown in the following screenshot:

Services list elements operation v5.png

The list of the elements that make up this service is at the bottom:

  • Type: The icon which represents the element type. It is a building block for modules or some stacked blocks for an agent and a Network Diagram Icon for the services.
Type
The text which contains the name of the module, agent or service. They are also linked to the corresponding section.
Name
Text that contains the name of the agent, the name of the agent and module or the name of the service. All of them contain a link to the corresponding operation view.
Weight critical
The value if the element when in 'critical' status. The following three columns (Warning weight, Weight Unknown and Weight OK) correspond to warning, unknown and normal.
Data
The value of the element. It can adopt the following modes:
    • Module: The value of the module.
    • Agents: The text that displays the agent's status.
    • Services: The weight sum of the elements of the service that has been chosen as the element for the parent service.
Status
The icon which represents the element's status by color.

Template warning.png

Keep in mind that service-element calculation is performed by Prediction Server. What you look at is not real-time data. There are some situations in which a module's agent is added to the service where its weight will not be updated until calculation is performed by the Prediction Server again.

 


1.4.1.3 Service map view

This view will display the service in arborescent form as you can see in the following screenshot. That way, it is possible to quickly see how modules, agents or sub-services influence service monitoring. Even in sub-services you can see what influences them when calculating the status by summing weights.

Services servicemap v5.png

The possible nodes can be:

Module Node
It is represented by the 'heartbeat' icon. This node is always final (leaf).
Agent Node
It is represented by the 'CPU box' icon. This module is always final too (leaf).
Service Node
It is represented by the 'crossed hammer and wrench' icon. This module is not a final node. It is required to contain additional nodes.

The node's colors and the arrow which connects them to the service depend on the node's status, as always green OK, red critical, yellow warning or grey in unknown state.

There are the following attributes within the node:

  • Title: The name of the service's / agent's or module's node, accompanied by the agent.
  • Value list:
    • Critical:: The total weight it reaches in 'critical' status, except if it is the root-service node, which represents a threshold to reach the 'critical' status.
    • Warning: The weight if it reaches 'warning' status, except if it is the root-service node, which represents the threshold to reach the 'warning' status.
    • Normal: The weight if it reaches 'normal' status, except if it is the root-service node, in which case nothing will be displayed here.
    • Unknown: The 'unknown' status, except if it is the root-service node, which represents a threshold to reach the 'unknown' status.

You may click on each node in the tree. The target link represents the operational view of the node itself.


Info.png

When the service mode is simple, a red exclamation mark appears on the right side of the critical elements.

 


1.4.1.4 Services within the Visual Console

From Pandora FMS versions 5 onwards, you may add services in the Visual Console like any other item on the map.



Servicios1.JPG


To create a service item on a map, the process is the same as for all other visual map items, but the option range will be:



Servicios2.JPG


Controls:

  • Label: The title shown within the visual console's node.
  • Service: Drop-down list that shows the services it has access to, to add to the map.

Note that a service item, unlike other items in the visual map, cannot be linked to other visual maps, and always the clickable link in the visual console is intended for the tree service map view described above.

1.4.2 Service tree view

This view allows you to view services in the form of a tree.

Each level shows the total number of elements included in each service or agent.

  • Services: It reports the total number of services, agents and modules that belong to that service.
  • Agents: It reports the number of modules in critical state (red color), warning (yellow color), unknown (gray color), uninitiated (blue color) and normal state (green color).

Services that do not belong to another one will always be shown on the first level. In the case of a child service, it will be shown nested inside its parent.

Services treeview.png

Template warning.png

ACL permission restriction is only applied to the first level.

 




1.5 How to read service values

Planned shutdowns added before the stop date allow recalculating the value of SLA reports, given that it allows "backwards" recalculating with scheduled shutdowns added afterwards (that option is globally activated in the general setup). When it is an SLA service report, if there is a scheduled shutdown that affects one or several service elements, the scheduled shutdown is considered to affect the service as a whole, since the shutdown impact on the whole service cannot be measured.

It is worth highlighting that this is at a report level. Therefore, service trees, and the information presented in the visual console are not altered based on planned shutdowns added after the intended execution date. These service compliance percentages are calculated in real time, based on the history data of the same service, they do not have anything to do with the actual report.

On the other hand, it is important to know how the compliance percentage of a service is calculated:

Weight calculation in simple mode

Weights are dealt with slightly differently on simple mode, since there is only the critical weight and the possibility of going into two more status apart from the normal one. Each element receives weight 1 on critical and 0 on other status, and each time there is a change in service elements, service weights are calculated again. The warning weight can be overlooked. It always has value 0.5 because if it is 0, the service will always be on warning at least, but warning weight is not used in simple mode. The critical weight is calculated so that it is half of the element critical weights summed, which is 1. If there are 3 elements, the service critical weight is 1.5 and then, it is the server the one in charge or checking whether the critical weight has been exceeded or matched to render the service into critical or warning status.

Weight calculation according to their importance

Suppose there is a service defined by a 95% compliance in a 1-hour interval. A table of values, where t is time, x is the compliance % (SLAs), and s is whether complies or not (1 it complies, 0 it fails), will be used. In 1 hour there should be exactly 12 samples (assuming the interval is 5 minutes long).

Picture a similar case, where the service complies for the first 11 samples (first 55 minutes) and it fails in the 60th minute these would be the values:

   t    |   s   |    x  
--------+-------+--------
1          1      100
2          1      100
3          1      100
4          1      100
5          1      100
6          1      100
7          1      100
8          1      100
9          1      100
10         1      100
11         1      100
12         0      91,6

This case is easier to calculate. The % is calculated depending on the number of samples, for example in t3, there are a total of three samples that meet service, a 100%, whereas in t12, there are 12 samples and 11 are valid samples: 11 / 12.

Suppose you are in the middle of the series, and it is recovering slowly:

   t    |   s   |    x  
--------+-------+--------
1          1      100
2          1      100
3          1      100
4          1      100
5          1      100
6          0      83,3
7          1      85,7
8          1      87,5
9          1      88,8
10         1      90 
11         1      90,9
12         1      91,6

So far all seems similar to the previous scenario, but see what happens if you go over time:

   t    |   s   |    x  
--------+-------+--------
13        1      91,6
14        1      91,6
15        1      91,6
16        1      91,6
17        1      91,6
18        1      100
19        1      100
....

Now there is unintuitive behavior, because the volume of valid samples remains 11 for a time window that goes up to t18, where the only invalid value is out of the window, so in t18 compliance becomes 100%. This step between 91.6 and 100 is explained by the size of the window. The larger the window is (usually SLA calculation interval is daily, weekly or monthly), the less abrupt the step will be.

1.6 Service cascade protection

Info.png

Enterprise version.
Version NG 725 or higher.

 



It is possible to mute service elements dynamically. This allows to avoid an alert overload for each element that belongs to a certain service or sub-services.

When the 'service cascade protection' feature is enabled, the action linked to the template configured for the root service will be executed. It will report which the elements have an incorrect status within the service.

It is important to take into account that this system allows the alerts of the elements within the service to be triggered when they go to critical status, even if the general service status is correct.

Service cascade protection will indicate us which elements have failed regardless of the depth of the defined service.

Service2test.png

In the example above we see that we have one of the elements of our service in critical status. Even if the main service is correct, it will warn us of the critical state of the elements within, triggering the alert related with the element in critical status.

1.7 Root cause analysis

You may have an endless number of sub-services (paths) within a service. In previous versions, Pandora FMS alerted indicating the service status (normal, critical, warning, etc.). From OUM725 on, there is a new macro available that will show the service status root cause.

To use it, add the following text to the template linked to the service:


Alert body: Example message
The series of events that have caused the service status is the following one:
_rca_


This will return an output similar to this one:

Alert body: Example message
The series of events that have caused the service status is the following one:
[Web Application -> HW -> Apache server 3]
[Web Application -> HW -> Apache server 4]
[Web Application -> HW -> Apache server 10]
[Web Application -> DB Instances -> MySQL_base_1]
[Web Application -> DB Instances -> MySQL_base_5]
[Web Application -> Balanceadores -> 192.168.10.139]


By seeing this output, it is supposed that:

  • Apache servers 3,4 and 10 are in critical status
  • MySQL_base databases 1 and 5 are down
  • The 192.168.10.139 balancer does not respond


This added information allows to find out the reason behind the service status, reducing failure cause research tasks.

1.8 Service grouping

Services are logical groupings that make up an organization's business structure. That is why service grouping may make sense, since they depend on each other in many cases, creating for example a whole service (the business company) or more specific services (corporate web, communications, etc.). To group services, both the general and more particular services must be created, and the last ones must be added to the first one to create the logical tree-shaped structure.

This groups may help you to: create visual maps, configure alerts, apply monitoring policies, etc. Therefore, it is possible to create alerts that warn you when the business goes into critical status because sales representatives cannot do their job, or any branch is not working full capacity due to technichal problems with the ERP service.

To understand more clearly what service grouping is, take a look at these examples.

1.9 Service monitoring examples

1.9.1 Pandora FMS service

Use case where the status of Pandora FMS monitoring service made by Apache and MysSQL services and Pandora FMS server and Tentacle, with their respective weights, is monitored.


Click to zoom in

Each of these elements is at the same time a service with different components, creating through service grouping a tree-shaped structure.

Click to zoom in

In this case, the general Pandora FMS service will go into critical status when reaching weight 2 and warning when it reaches weight 1. As seen, the four components have different weights on Pandora FMS service:

  • MySQL: It is essential for Pandora FMS service. Individual weight of 2 if MySQL is down. It will get a weight of 1 if it is in warning status, showing a warning in Pandora FMS service.
  • Pandora Server: It is essential for Pandora FMS service. Individual weight of 2 if the Pandora FMS Server is down. Individual weight of 1 if it is in warning status, for example, due to CPU overload, scaling the warning until reaching Pandora FMS general service.
  • Apache: It implies a degrading of Pandora FMS service, but not a total interruption, so it gets an individual weight of 1 if it is down, showing the warning status in Pandora FMS service.
  • Tentacle: It entails a degrading, and certain components may fail, but it does not Mean Pandora FMS stops working completely, so its individual weight in case of failure is 1, showing a warning in the general service.

1.9.2 Cluster storing service, service grouping

Services are logical groups that make up part of the business structure of an organization. Therefore, service grouping is reasonable since sometimes some services on their own do not have a complete meaning. To group services, they just need to be added to a greater service as elements, creating a new logical group.

In the following example, there is an HA storing cluster. This time, a system of two fileservers working at the same time has been chosen, each one controlling the percentage and the status of a series of hard drives that provide service to particular departments, creating a group service tree-shaped structure.


Cluster.JPG


According to this structure, the critical threshold of the company's storing service is reached when both fileservers fail, since that would turn down the service, while just one of them failing would entail a service downgrading. The following image contains weight configuration granted to two storing service main elements:


Pesoscluster.JPG


This image shows the content and weight configuration of the FS01 grouped service. Here the elements have a specific weight according to their severity:

  • FS01 ALIVE: Critical for the FS01 service, since it is the virtual IP allocated to the first hard drive cluster. Individual weight of 2, since if it is down, the rest of the service elements will not work. There is no warning threshold, since it is data that depends on the status Yes/No.
  • DHCPserver ping: critical for the FS01 service. It has an individual weight of 2. In this case, there is no warning threshold either.
  • Hard drives: They have an individual weight of 1 in case they reach their critical threshold, and 0.5 for their warning threshold, so this will only affect critically the FS01 service if there are at least two in critical status or the four hard drives in warning status.


Pesosfs01.JPG

Go back to Pandora FMS documentation index