Pandora: Documentation en: Services

From Pandora FMS Wiki
Jump to: navigation, search

Go back to Pandora FMS documentation index

1 Service Monitoring

1.1 Introduction

A service is a way to group IT resources based on their features.

A service could be an official website, a CRM system, a support application, or even printers. Services are logical groups which can include hosts, routers, switches, firewalls, CRMs, ERPs, websites and of course, different other services.

In Pandora FMS, the services are represented as a group of monitored elements (modules, agents or other services) whose individual status affects in a certain way the global performance of the service that is provided.

1.2 Services under Pandora FMS

1.2.1 How Services work under Pandora FMS

Basic monitoring in Pandora FMS consists of collecting metrics from different sources, representing them as monitors (modules).

Service monitoring allows to group these modules, so that, toying with certain ranges based on the accumulation of failures, groups of different types of elements and their relationship in a larger and general service can be monitored.

In short, service monitoring allows to check the status of a global service. You will be able to know if our service is being provided normally (green), degraded (yellow) or if you are not providing the service altogether (red).

To better understand what service monitoring is all about, let us give a small example.

Suppose you want to monitor a web application, which you have balanced through a series of redundant elements. The infrastructure on which the application is based on could consist of the following elements:

  • Two routers in HA.
  • Two switches in HA.
  • 20 Apache Web Servers.
  • Four Weblogic Appliance Servers.
  • One MySQL Cluster consisting of two Storage and two SQL Processing Nodes.

The goal is to know if the web application works properly. That is the final assessment by customers, whether the application works or not.


The need to monitor services as something "abstract" arises when faced with the following question:

What happens to an application if a non-critical element fails?

For example, if one of the twenty Apache servers were to fail, in theory it could not be warned, because so much redundancy arises to have problematic situations covered. But then, which one should be warned about?, all of them or some? What is the rule for warning?

You might think that Pandora should only warn you if a highly critical element fails (for example a router) or if several Apache servers fail.

Monitoring through services in Pandora FMSfeature is here to solve all these doubts.

The services in Pandora FMS help you to:

  • Limit the number of received alerts. You will receive alerts about situations that compromise the reliability of the services you provide.
  • Track the compliance level.
  • Simplify the display of the monitoring of your infrastructure.


To achieve this, monitor every element that could negatively affect your application.

Through Pandora FMS console, define a service tree in which to indicate both the elements that affect your application, as well as their impact degre.

All the elements added to the service trees will correspond to information that is already being monitored, either in the form of modules, specific agents or other services.


To indicate the degree to which the status of each element affects the overall status, a system of sum of weights will be used, so that the most important (with more weight) will be more relevant to adjust the overall status of the complete service to an incorrect status before less important elements (with less weight).


Let us look at all these ideas through a practical example:

  • Switches and routers: 5 points each when in critical, and 3 points if in warning.
  • WEB servers: 1.2 points for each one in critical, warning status is disregarded.
  • WebLogic Servers: 2 points each in critical.
  • MySQL Cluster: 5 points for each node in critical and 3 points in warning.


Element type Weight asignment
Normal Warning Critical Unknown
Router0355
Switch0355
Web server001.21.2
Weblogic server0022
MySQL server0355


A warning threshold of 4 for service, and a critical threshold of 6 are set. In this way, and assuming that there are no issues, the service would be "OK" if all the monitored elements are OK or not important enough to cause deficiencies when providing the service.

Service configuration
Normal Warning Critical
0 >=4 >=6


Now let us suppose that one (1) Apache web server fails:

  • 1 x SApache server in CRITICAL x 1.2 pto = 1.2 because 1.2 < 4 (Warning), the service is stil in OK status.

The weight contribution will be:

2 x 0 (routers in OK)
+ 2 x 0 (switches in OK)
+ 19 x 0 (apache OK)
+ 1 x 1.2 (apache CRIT)
+ 4 x 0 (weblogic OK)
+ 1 x 0 (mysql OK)
Total: 1.2 --> The service will be NORMAL


Let us see what happens if a WEB server and a Weblogic fail:

  • 1 x Apache Server in CRITICAL x 1.2 pto = 1.2
  • 1 x Weblogic Server in CRITICAL x 2 = 2

Total, 3,2 is still< 4 so the server remains in OK status, it is still working, it is not necessary to take technical action immediately.

The weight contribution will be:

2 x 0 (routers in OK)
+ 2 x 0 (switches in OK)
+ 19 x 0 (apache OK)
+ 1 x 1.2 (apache CRIT)
+ 3 x 0 (weblogic OK)
+ 1 x 2 (weblogic CRIT)
+ 1 x 0 (mysql OK)
Total: 3.2 --> The service will be NORMAL


Let us see what happens if two WEB servers and a WEblogic fail:

  • 2 x Apache Server in CRITICAL x 1.2 pto = 2.4
  • 1 x Weblogic Server in CRITICAL x 2 = 2

Total, 4.4 now it is > 4 and the service goes into WARNING status, the service has gone into a degraded status. It continues to work, and may not require immediate technical action, but it is clear that there has been a problem with your infrastructure.

2 x 0 (routers in OK)
+ 2 x 0 (switches in OK)
+ 18 x 0 (apache OK)
+ 2 x 1.2 (apache CRIT)
+ 3 x 0 (weblogic OK)
+ 1 x 2 (weblogic CRIT)
+ 1 x 0 (mysql OK)
Total: 4.4 --> The service will be in WARNING


Let us suppose that in addition to the above, a Router fails:

  • 2 x Apache Server in CRITICAL x 1.2 pto = 2.4
  • 1 x Weblogic Server in CRITICAL x 2 = 2
  • 1 x Router in CRITICAL x 5 = 5

Now you have 9.4 above the threshold set at 6 for CRITICAL, so, the service is in critical, it is not working Immediate technical action is required.

1 x 0 (routers in OK)
+ 1 x 5 (router in CRIT)
+ 2 x 0 (switches in OK)
+ 18 x 0 (apache OK)
+ 2 x 1.2 (apache CRIT)
+ 3 x 0 (weblogic OK)
+ 1 x 2 (weblogic CRIT)
+ 1 x 0 (mysql OK)
Total: 9.4 --> The service is in CRÍTICAL

Pandora FMS will alert the corresponding work team (operators, technicians, etc.).

Service monitoring is a feature only of the Enterprise version of Pandora FMS.


1.2.1.1 How the simple mode works

The weight system as detailed above may be too complex when monitoring needs are quite basic. To deal with this situation, a new simple mode is available on the service configuration since version 5.1.

In this mode, the only configuration needed is to select which elements are critical and which not. Only the critical elements will be taken into account when calculating the service status and only the critical status of the critical elements will have value. The service will go into warning when 50% of the critical elements reach critical status. When 50% of the critical elements in critical are surpassed, the service will go into critical.

Let us follow an example of a simple service:

  • Router as critical element.
  • Printer as non critical element.
  • Apache server as critical element.

One day the elements report this status:

  • Router on critical.
  • Printer on critical.
  • Apache server on warning.

The service status is warning, because the printer is not a critical element and its status is not taken into account, as well as the Apache service status, which, even though it is a critical element, will only be taken into account in critical status. In this situation, one critical element is on critical status, 50% of the critical elements.

Another day the elements report this status:

  • Router on critical.
  • Printer on critical.
  • Apache server on critical.

The service status is critical, since over 50% of the critical elements are on critical status.

Finally, another day the elements report this status:

  • Router on normal.
  • Printer on critical.
  • Apache on normal.

The service status is normal, since less than 50% of the critical elements are on critical status. Only the printer is on critical status, but as you have seen, the non critical elements are not taken into account when calculating the service status.

1.2.2 Creating a New Service

1.2.2.1 Introduction

Template warning.png

The Enterprise version is required and the PredictionServer component enabled to be able to use these services.

 


The services can represent:

  • Modules
  • Agents
  • Other Services

Service values are calculated using the Prediction Server which uses the default interval of the prediction modules.

Once you have all the devices monitored. Add within each service all the modules, agents or sub-services whose service you need to monitor. For example, if you want to monitor the Online Store service, you need a module for content, a service that monitors the status of communications and so on. Throughout the following steps, it is described how to create a service with Pandora FMS.

To create a new service, click on Services at the Topology Maps .


Menu services.png


A list of all the available services will be shown. The next screenshot shows an empty service list.


Services empty v5.png


1.2.2.2 Initial Configuration

To create a new service, click on the 'Create' button and fill out the form as shown below.



Services creation v5.png


The names of the form fields and their meaning are as follows:

  • Name: The name of the service.
  • Description: Service description, a long mandatory text. Said description will be the one to appear in the service map, the service table view and the service widget (instead of the name).
  • Group: The group of the service. It is quite useful for organization purposes and to enforce SLA (Service Level Agreement) restrictions.
  • Mode: the mode in which the calculation of the weight of the elements will be performed.
    • Manual: the weights should be entered manually into the service and their elements.
    • Auto: implying the 'critical' threshold for the service to be '1' and the 'warning' threshold to be '0.5'. It is also assumed that you will automatically assign weights of '0' for the 'OK' status, '0.5' for 'warning' and '1' for 'critical' each time you create an element for this service.
    • Simple: there is no need to enter weights, only enable or disable a checkbox to indicate if the element is critical.
  • Critical: The weight threshold to enter the 'critical' status. This field is disabled if the auto-calculate check is enabled. The default value is '1'.
  • Warning: weight threshold for declaring service in warning status. This field is disabled when the automatic mode is selected and has the default value of '0.5'. Not visible when the simple mode is selected.
  • Agent to store Data: the service stores the data in special data modules (specifically the prediction modules) and it is necessary to enter an agent to be the container of these modules, as well as the alarms that have to be configured in this form. Note Please note that the interval in which all service module calculations will be performed depends on the agent interval configured as a container.
  • Quiet: It activates the silent mode of the service, it will not generate alerts or events
  • Cascade Protection: Activates cascade protection over the elements of the service. These will not generate alerts or events if they belong to a service (or subservice) in critical condition.
  • Favourite: Token to turn the new service into favourite. If it is activated, a direct link will be provided in the lateral menu.
  • Calculate continuous SLA for this service: Activates the creation of SLA and SLA value modules for the current service. If disabled, dynamically calculated SLA information is not available, and SLA compliance alerts for this service do not work. It is used for cases where the number of services needed is so high that it can affect performance.If this option is disabled, once the service has been created, the data history of these modules will be deleted and information will be lost.
  • SLA Interval: The time range for performing the SLA constraint's calculation. The default value is '1 month'.
  • SLA Limit: OK status threshold of the service considered an SLA as positive for the period of time you have set in the previous field.
  • Warning Service Alert: alert template that the service will use to issue the alert when the service goes into warning status.
  • Critical Service alert: alert template that the service will use to issue the alert when the service goes into critical status.
  • SLA Critical Service Alert: alert template that the service will use to issue the alert if the SLA restrictions are not met.

1.2.2.3 Element Configuration

Once the form has been filled out correctly, there will be an empty service which must be filled in with elements or service items as seen below. In the service edit form, select the' Config Elements' tab.


Services tab setup v5.png


You will see a page like the one below where you can manage (modify, add new ones or delete) service elements.


Services elements empty v5.png


Some important items on the service configuration page are:

  • Type: a drop-down list that can show services, modules or agents.
  • Agent: The agent smart-search input control. It is only visible if the element type to create or edit is either the 'agent' or the 'module' type.
  • Module: The drop-down list along with the modules' agent previously chosen via smart search. This control is only visible when editing or creating a service element for the 'module' type.
  • Service: The drop-down list of the services to create an item. It is only visible if the item is of the 'create' or 'edit service' type. It is also important to keep in mind that the services which will appear in the drop-down list are not ancestors of the service. It is also necessary to show an appropriate tree-structure dependency between the services.
  • Critical: A checkbox to select if the element is critical. Not visible unless the service is in simple mode.
  • Weight on Critical: The weight of the element if it is in a 'critical' status. The default value is '1'. It is disabled if the service is in 'auto calculate' mode. Not visible if the service is in simple mode.
  • Weight on Warning: The weight of the 'warning' status. The default value is '0.5'. It is disabled if the service is in 'auto calculate' mode. Not visible if the service is in simple mode.
  • Weight on Unknown: The weight of the element if it is in unknown status. The default value is '0'. It is disabled if the service is in 'auto calculate' mode. Not visible if the service is in simple mode.
  • Weight on OK: The weight of the element if it is in perfect conditions. The default value is '0'. It is disabled if the service is in 'auto calculate' mode. Not visible if the service is in simple mode.

Within this one, in the last column at the right, entitled "Actions", there are icons for:

  • Edit: which is the icon represented by a wrench with an orange handle. Edit the element of the row corresponding to that icon.
  • Delete: which is the icon represented by a red cross. When clicking on it, you will be asked in a modal window for confirmation to remove and delete the service element from the database.

1.2.2.4 Modules created when configuring a service

  • SLA Value Service: The percentage value of the SLA compliance. (async_data).
  • Service_SLA_Service: This shows whether the SLA is being accomplished or not. (async_proc).
  • Service_Service: This module shows the sum of the weights of the service. (async_proc).



1.2.3 Service Visualization

1.2.3.1 Simple list-based View of all the Services

It is the operation list that shows all the services created. Of course, it only shows those groups that the user that is using the Pandora console has access to.

To get to this view, go to the Operation menu, open the Monitoring entry and there is the Services section.

Services list services admin v5.png

Each row represents a service, and the columns represent:

  • Name: The name of the service.
  • Description: The service description.
  • Group: The icon of the group the service belongs to.
  • Critical: The threshold value for the sums of weights to get the service into 'critical' status.
  • Warning: The threshold value for the sums of weights to get the service into 'warning' status.
  • Value: The current value for the sum of all weights for the service.
  • Status: An icon which represents the status of the service.

Four possible status are represented:

    • Red: The service is in 'critical' status because the value exceeded the critical threshold.
    • Yellow: The service is in 'warning' status because the value exceeded the critical threshold.
    • Green: The service is within the 'normal' range.
    • Gray: The service is in 'unknown' status. This usually means the service has been recently created and does not contain any modules or the Prediction Server is down.
  • SLA: The current value of the SLA Service. The values can be:
    • OK: the SLA is met for the interval defined in the SLA service.
    • INCORRECT: The SLA is not meant for the interval currently defined in the SLA Service.
    • N/A: The SLA is in 'unknown' status because there is not enough data to perform the calculation.
1.2.3.1.1 Table including all services

A table for quick display including all visible services and their current status.

Servs.JPG


1.2.3.1.2 List-based view of a Service and its Elements

This view is accessible by clicking on the name of a service in the list of all services, or through the magnifying glass icon tab in the service title header.

Pandora will show a page similar to the one shown in the following screenshot:

Services list elements operation v5.png

In the screenshot, two sections can be distinguished, the service with the same columns as in the previous view at the top. And the list of the elements that make up this service at the bottom.

The list of elements appears in table format, where the rows correspond to each element and the columns represent:

  • Type: The icon which represents the type of an element. It is a Lego block for modules, some stacked Lego blocks for an agent and a Network Diagram Icon for the services.
  • Name: The text which contains the name of the module, agent or service. They are also linked to the proper section.
  • Description: A small free-text field intended for a short description.
  • Weight critical: The value if the element is in 'critical' status.
  • Weight warning: The value if the element is in 'warning' status.
  • Weight normal: The value if the element is in 'normal' status.
  • Data: The value of the element. It can adopt the following modes:
    • Module: The value of the module.
    • Agents: The text which displays the agent's status.
    • Service: The sum of all elements' weights from the selected service.
  • Status: The icon which represents the element's status by color.

Template warning.png

Keep in mind that the service-elements calculation is performed by the Prediction Server. What you look at is not real-time data. There are some situations in which a module's agent is added to the service where its weight will not be updated until the calculation is performed by the Prediction Server again.

 


1.2.3.1.3 Service Map View

This view will display the service in arborescent form as you can see in the following screenshot. That way, it is possible to quickly see how modules, agents or sub-services influence the service monitoring. Even in the subservices you can see what influences them when calculating the status by summing the weights.

Services servicemap v5.png

The possible nodes can be:

  • Module Node: It is represented by the 'heartbeat' icon. This node is always final (leaf).
  • Agent Node: It is represented by the 'CPU box' icon. This module is always final too (leaf).
  • Service Node: It is represented by the 'crossed hammer and wrench' icon. This module is not a final node. It is required to contain additional nodes.

The node's colors and the arrow which connects them to the service depend on the node's status.

There are the following attributes within the node:

  • Title: The name of the service's / agent's or module's node.
  • Value list: This list refers to the possible numeric value calculated for that instance. It accepts any assigned integer value.
    • Critical:: The weight if it reaches 'critical' status, except if it is the root-service node, which represents a threshold to reach the 'critical' status.
    • Warning: The weight if it reaches 'warning' status, except if it is the root-service node, which represents the threshold to reach the 'warning' status.
    • Normal: The weight if it reaches 'normal' status, except if it is the root-service node, in which case nothing will be displayed here.
    • Unknown: The 'unknown' status, except if it is the root-service node, which represents a threshold to reach the 'unknown' status.

You may click on each node in the tree. The target link represents the operational view of the node itself.


Info.png

When the service mode is simple, a red exclamation mark appears on the right side of the critical elements.

 


1.2.3.1.4 Services within the Visual Console

From Pandora FMS versions 5 onwards, you may add services in the Visual Console like any other item on the map.

Services visualmap v5.png

To create a service item on a map, the process is the same as for all other visual map items, but the options palette will be the same as in the screenshot.

Services visualmap add item v5.png

It contains the following attributes:

  • Label: The title which will be shown within the visual console's node.
  • Service: The service that will be represented.

Note that a service item, unlike other items in the visual map, cannot be linked to other visual maps, and always the clickable link in the visual console is intended for the tree service map view described above.

1.2.3.2 Services Tree View

This view allows you to view the services in the form of a tree.

At each level, a count of the number of elements included in each service or agent is shown.

  • Services: reports the total number of services, agents and modules that belong to that service.
  • Agents: reports the number of modules in critical state (red color), warning (yellow color), unknown (gray color), uninitiated (blue color) and normal state (green color).

Services that do not belong to another one will always be shown in the first level. In the case of a child service, it will be shown nested inside its parent.

Services treeview.png

Template warning.png

ACLs permission restriction is only applied to the first level.

 




1.2.4 How to read service values

Planned shutdowns added before the stop date allow recalculating the value of the SLA reports. First, activate it in the general setup. When it comes to an SLA service report, if there is a scheduled outage affecting one or more elements of the service, it is considered that the planned shutdown affects the entire service, because the system cannot evaluate the impact of an "inactive" service component on the whole service.

It is important to remember that this is at a report level. Therefore, service map, and the information presented in the visual console are not altered based on planned shutdowns added after the effective execution date. These service compliance percentages are calculated in real-time, based on the history data of the same service, it is very different from a report which can be "cooked" adding a "fake" planned downtime.

On the other hand, it is important to know how the compliance of a service is calculated:

Let us suppose there is a service defined by a 95% compliance in a 1-hour interval (this would actually be very short in real life, but good for understanding the internal algorithm). A table of values, where t is time, x is the % compliance (SLAs), and s is whether or not the service complies (1 complies, 0 fails) will be used. In 1 hour there should be exactly 12 values, assuming an interval of 5 minutes.

A similar case, where the service complies for the first 11 samples (first 55 minutes) and it fails in the 60th minute these would be the values:

   t    |   s   |    x  
--------+-------+--------
1          1      100
2          1      100
3          1      100
4          1      100
5          1      100
6          1      100
7          1      100
8          1      100
9          1      100
10         1      100
11         1      100
12         0      91,6

This case is easier to calculate. The % is calculated depending on the number of samples, for example in t3, there are a total of three samples that meet service, 100%, whereas in t12, there are 12 and 11 valid samples: 11 / 12.

Suppose you are in the middle of the series, and it is recovering slowly:

   t    |   s   |    x  
--------+-------+--------
1          1      100
2          1      100
3          1      100
4          1      100
5          1      100
6          0      83,3
7          1      85,7
8          1      87,5
9          1      88,8
10         1      90 
11         1      90,9
12         1      91,6

So far all seems similar to the previous scenario, but see what happens if you go over time:

   t    |   s   |    x  
--------+-------+--------
13        1      91,6
14        1      91,6
15        1      91,6
16        1      91,6
17        1      91,6
18        1      100
19        1      100
....

Now there is unintuitive behavior, because the volume of valid samples remains 11 for a time window until t18, where the only invalid value is out of the window, so in t18 compliance becomes 100%. This step between 91.6 and 100 is explained by the size of the window. The larger the window is (usually SLA calculation interval is daily, weekly or monthly), the less abrupt will be the step.

1.2.5 Service grouping

Services are logical groups that make up part of a business structure. Due to that fact, it makes sense to group services, because in lots of cases there can be dependences between them, shaping for example, a global company service made up by some other particular services (webpage, communications, etc). To group these services, it is necessary to create a big general service and smaller ones that will be added to the global service, creating a logical tree structure.

The service groups can help us to: create visual maps, configure alerts, apply monitoring policies, etc. So you can create specific alerts when the company service is down due to the commercial department not being able to work, or the webpage being offline.

Next, here come two examples to understand service grouping.

1.2.6 Examples of service monitoring

1.2.6.1 Pandora FMS service

In this case, the service of PandoraFMS is being monitored. It is made up by the Apache service, MySQL, Pandora server and Tentacle server. Every one of these elements also constitutes a service with different components, creating a tree-type structure.


Arbol.JPG


The general Pandora service will go into critical status if it reaches the weight of 2, and warning status 1. As you can see, the four components have different weights over the Pandora FMS service:

  • MySQL: critical for the Pandora FMS service, individual weight of 2 if MySSQL is down. It will have weight 1 if it goes into warning status, already displaying yellow status on the Pandora service.
  • Pandora Server: critical for the Pandora FMS service, individual weight of 2 if Pandora Server is down. Individual weight of 1 if it is on warning status, displaying the warning status on the Pandora FMS service for example if it reaches a heavy CPU load.
  • Apache: it means a degradation of the global Pandora FMS service, but not a complete interruption, so it will have an individual weight of 1 if it is down, showing the warning status on the Pandora FMS service.
  • Tentacle: same as the Apache, it means a degradation of the service, but not a total interruption, so it gets weight 1 if down and will display warning status.

The following picture includes the setup of the different weights for the elements over the Pandora FMS general service:


Pesos.JPG


1.2.6.2 Storage cluster, service grouping

Services are logically arranged groups which are part of a company's business structure. Therefore, it may be necessary to create service groups, because services alone sometimes do not have an appropriate context. To create service groups, add each service to an existing agent. In this case, a service will be a module of an agent. You may create a new logical structure (a group of services) by means of these groups.

On the following example, there is an HA storage cluster. In this case, there are two fileserver systems working in parallel, each one controlling the percentage and status of some different disks that provide a service to specific departments, creating a tree-type structure with grouped services.


Cluster.JPG


According to this structure, the critical threshold of the storage service will be reached only if both of the fileservers fail, this would totally deny the service, and if only one of the fileservers fail, it would only imply a degraded service. The screenshot below features the weight configuration of the two main elements of the storage service:


Pesoscluster.JPG


The following image portraits the content and weight configuration of the grouped service FS01. Here, the elements will have a specific weight according to its criticalness, being:

  • FS01 ALIVE: critical to the FS01 service, since it is the virtual IP assigned to the first disk cluster. Individual weigh 2, if it is down, the other elements would automatically become inoperative. In this case there is no warning threshold, since it is a yes/no based type of information.
  • DHCPserver ping: critic to the FS01 service, it has an individual weight of 2. In this case, there is no warning threshold either.
  • Disks they have an individual weight of 1 in case they reach its own critical status, and 0.5 in their warning status. According to this, the FS01 service will only reach critical status if there are two disks on critical status o four in warning status.


Pesosfs01.JPG

1.3 Pandora Server

It is mandatory that the Prediction Service runs properly and also to have the Enterprise Version of Pandora FMS installed.

Go back to Pandora FMS documentation index