Pandora: Documentation en: Services

From Pandora FMS Wiki
Revision as of 11:44, 2 November 2017 by Slerena (talk | contribs) (The Concept of Service Monitoring)
Jump to: navigation, search

Go back to Pandora FMS documentation index

1 Service Monitoring

1.1 Introduction

1.1.1 The Concept of Service Monitoring

A service is a way to group your IT resources based on their functionalities. A service could be your official website, your CRM system, your support application, or even your printers. Services are logical groups which can include hosts, routers, switches, firewalls, CRMs, ERPs, websites and numerous other services.

1.2 Services under Pandora FMS

1.2.1 How Services work under Pandora FMS

Unlike 'specific' monitoring, where specific values are kept from specific indicators, the service monitoring by Pandora FMS is to monitor 'groups' of elements from different kinds with certain margins of error which is in itself based on failure accumulation.

To better understand what service monitoring consists of, we're going to show an example below.

In this example, we intend to monitor whether the service, which we're providing by the usage of a Web Cluster, is conducted appropriately or not. This cluster consists of the following elements:

  • Two routers in HA (High Availability)
  • Two switches in HA.
  • 20 Apache Web Servers.
  • Four Weblogic Appliance Servers.
  • One MySQL Cluster consisting of two Storage and two SQL Processing Nodes.

It's possible to monitor each element individually. First, we're required to 'globally' activate the service monitoring. Each element included in the service should be a 'standard' monitor of the type which is getting monitored by Pandora FMS, which is a task that is required to be installed PRIOR to any service monitoring.

The need of monitoring services as something 'abstract' appears if we ask ourselves the question: "What happens when an element that is considered non-critical, initially?", such as one of the twenty Apache Servers. Firstly, we wouldn't be able to warn anyone. In fact, it's possible it has frequent failures. There are also 20 nodes - it shouldn't warn us because of the failure of only one node (let's imagine the warning of 'wake up someone who's sleeping.' here). In fact, a service which comes with so much redundancy is installed for the purpose of giving us more peace, not more work. It should only warn us if a critical element is down (such as a router) or if several web servers are down, e.g. four or five of them.

In this way, if we evaluate each element from our example:

  • Switches and Routers: 5 points for each one if there are in a 'critical' state and 3 points if they are in the 'warning' range.
  • Web Servers: 1.2 points for each one in a 'critical' state. We omit the 'warning' state here.
  • WebLogic Servers: 2 points for each one in a 'critical' state.
  • MySQL Clusters: 5 points for each node, 3 points in a 'warning' state.

We're setting up a 'warning' threshold of '4' for the service and a 'critical' threshold of '6' for it. In this way (and supposing that all entities are working appropriately) the service would give back an 'OK' if all the monitored elements are working as they're supposed to.

Now, assume the following:

  • One Apache Server in a 'critical' state x 1.2 points = 1.2. '1.2' is smaller than '4' (the 'warning' threshold), so the service is still considered to be within the 'OK' range.

See what happens if e.g. a Web server and a Weblogic server are down:

  • One apache server in a 'critical' state x 1.2 points = 1.2
  • One Weblogic server in a 'critical' state x 2 = 2

As you can see, the sum of '3.2' is still smaller than '4'. The service is also within the 'OK' status range. It won't wake up the operator from his sleep.

See what happens if two Web Servers and one Weblogic Server are down:

  • Two Apache Servers in a 'critical' state x 1.2 points = 2.4
  • One Weblogic Server in a 'critical' state x 2 = 2

As you can see, '4.4' is of course bigger than '4' which meets the conditions for the 'warning' status. It's possible that an urgent SMS has not been received from the operator yet, but rest assured that at least someone -certainly- will receive an email. Let's continue with the example.

Let's also suppose that (in addition to the events mentioned above) one router is down:

  • Two Apache Servers in a 'critical' state x 1.2 points = 2.4
  • One Weblogic Server in a 'critical' state x 2 = 2
  • One Router in a 'critical' state x 5 = 5

As you can see here, we've reached a value of '9.4' which is already higher than the 'critical' threshold of '8' in this example. The service is definitely considered to be in a 'critical' state now - and that's the moment in which our operator has no other option but to wake up.


Info.png

The Service Monitoring feature is only available to the Enterprise Versions of Pandora FMS.

 


1.2.1.1 How the simple mode works

The weight system as detailed above may be too complex when the monitoring needs are basic. To deal with this situation, a new simple mode is available on the service configuration since the 5.1 version.

In this mode the only configuration needed is to select which elements are critical and which not. Only the critical elements will be taken into account when calculating the service status and only the critical status of the critical elements will have value. The service will go to warning when 50% of the critical elements reach critical status. When 50% of the critical elements in critical are surpassed, the service will go to critical.

Let's follow an example of a simple service:

  • Router as critical element.
  • Printer as non critical element.
  • Apache server as critical element.

One day the elements report this status:

  • Router on critical.
  • Printer on critical.
  • Apache server on warning.

The service status is warning, because the printer isn't a critical element and its status is not taken into account, as well as the Apache service status, which, even though a critical element, will only be taken into account in critical status. In this situation, one critical element is on critical status, 50% of the critical elements.

Another day the elements report this status:

  • Router on critical.
  • Printer on critical.
  • Apache server on critical.

The service status is critical, since over 50% of the critical elements are on critical status.

Finally, another day the elements report this status:

  • Router on normal.
  • Printer on critical.
  • Apache on normal.

The service status is normal, since less than 50% of the critical elements are on critical status. Only the printer is on critical status but, as we have seen, the non critical elements aren't taken into account when calculating the service status.

1.2.2 Creating a New Service

1.2.2.1 Pandora FMS Versions 5 and above

Services are able to represent:

  • Modules
  • Agents
  • Other Services

The service values are calculated using the Prediction Server which utilizes the default interval of the prediction modules.

Within each service, you may add all the modules, agents or sub services which are required to monitor the service you are creating. If you intend to monitor the on-line shop, you're going to need a module for the content, a sub-service that is going to monitor the communications, etc.

To create a new service, please click on 'Operation' and the 'Service' tab and click the configuration button.

On Pandora 6.0: To create a new service, please click on Services at the Topology Maps tab and then click the Create Service button.

Menu services.png

A list of all the available services will be shown. The next screen shot shows an empty service list.

Services empty v5.png

To create a new service, just click on the 'Create Service' button and fill out the form as shown below.

Services creation v5.png

The names of the form fields and their meaning are as follows:

  • Name: The name of the service.
  • Description: The description of the service.
  • Group: The group of the service. It's quite useful for organization purposes and to enforce the SLA (Service Level Agreement) constraints.
  • Mode: mode in which the calculation of the elements weights will be performed.
    • Manual: the weights should be entered manually into the service and their elements.
    • Auto: implying the 'critical' threshold for the service to be '1' and the 'warning' threshold to be '0.5'. It's also assumed that you'll automatically assign weights of '0' for the 'OK' status, '0.5' for 'warning' and '1' for 'critical' each time you're creating an element for this service.
    • Simple: there is no need to enter weights, only enable or disable a checkbox to indicate if the element is critical.
  • Critical: The weight threshold to enter the 'critical' status. This field is disabled if the auto-calculate check is enabled. The default value is '1'.
  • Warning: The weight threshold to enter the 'warning' status. This field is disabled when the auto-calculate check is enabled. The default value is '0.5'.
  • Agent to store Data: The agent in which the service module will be stored. The service stores the data in special modules (called 'prediction modules'). An agent is necessary to store those modules and the alerts of the service along with it.
  • SLA Interval: The time range for performing the SLA constraint's calculation. The default value is '1 month'.
  • SLA Limit: The SLA threshold for the 'OK' status.
  • Warning Service Alert: The alert template the service is going to use if it changes to 'warning' status.
  • Critical Service alert: The alert template the service is going to use if it changes to 'critical' status.
  • SLA Critical Service Alert: The service's alert template for firing an alert if the SLA constraints aren't met.


To add nodes, you're required to go to the 'Config Elements' tab.

Services tab setup v5.png

You'll see a page like the one below where you can manage (modify, add new ones or delete) service elements.

Services elements empty v5.png

Some important items on the services configuration page are:

  • Type: The module or agent. The agent's service works on all modules.
  • Agent: The smart-search input control for the agent. It's only visible if the element type is either the 'agent' or the 'module' type.
  • Module: The drop-down list along with the modules' previously chosen agent under smart search. This control is only visible when editing or creating a service element for the 'module' type.
  • Service: The drop-down list of the services to create an item. It's only visible if the item is of the 'create' or 'edit service' type. It's also important to keep in mind that the services which are going to appear in the drop down list are not ancestors of the service. It's also necessary to show an appropriate tree-structure dependency between the services.
  • Critical: A checkbox to select if the element is critical. Not visible unless the service is in simple mode.
  • Weight on Critical: The weight of the element if it's in a 'critical' state. The default value is '1'. It's disabled if the service is in 'auto calculate' mode. Not visible if the service is in simple mode.
  • Weight on Warning: The weight of the 'warning' state. The default value is '0.5'. It's disabled if the service is in 'auto calculate' mode. Not visible if the service is in simple mode.
  • Weight on Unknown: The weight of the element if it's in unknown state. The default value is '0'. It's disabled if the service is in 'auto calculate' mode. Not visible if the service is in simple mode.
  • Weight on OK: The weight of the element if it's in perfect conditions. The default value is '0'. It's disabled if the service is in 'auto calculate' mode. Not visible if the service is in simple mode.

Once you have created the service items on this page, we're looking at a list management similar to the one shown on the picture below:

Services list elements admin v5.png


1.2.2.1.1 Modules created when configuring a service
  • SLA Value Service: The percentage value of the SLA compliance. (async_data).
  • Service_SLA_Service: This shows if the SLA is being accomplishing or not. (async_proc).
  • Service_Service: This module shows the sum of the weights of the service. (async_proc).

1.2.3 Service Visualization

1.2.3.1 Pandora FMS Versions 5 and above

From the version 5 and above, multiple ways of service visualizations are available. You may choose to see the status of your services using a tree-based view or a list-based one.

1.2.3.1.1 List-based View of all the Services

This is an operational list that displays all the services the user is able to see (Access Control List implementation).

Please go to the Operation Menu and click on 'Monitorization' and 'Services'.

Services list services admin v5.png

Each row represents a service, and the columns represent:

  • Name: The name of the service.
  • Description: The service description.
  • Group: The icon of the group the service belongs to.
  • Critical: The threshold value for the sums of weights to put the service into 'critical' state.
  • Warning: The threshold value for the sums of weights to put the service into 'warning' state.
  • Value: The current value for the sum of all weights for the service.
  • Status: An icon which represents the status of the service.

Four possible states are represented:

    • Red: The service is in 'critical' state, because the value exceeded the critical threshold.
    • Yellow: The service is in 'warning' state, because the value exceeded the critical threshold.
    • Green: The service is within the 'normal' range.
    • Gray: The service is in 'unknown' state. This usually means the service has been recently created and doesn't contain any modules or the Prediction Server is down.
  • SLA: The current value of the SLA Service. The values can be:
    • OK: the SLA is met for the the interval defined in the SLA service.
    • INCORRECT: The SLA is not meant for the interval currently defined in the SLA Service.
    • N/A: The SLA is in 'unknown' state, because there is insufficient data to perform the calculation.
1.2.3.1.2 List-based view of a Service and its Elements

To obtain this view, click on the magnifying glass icon next to the service name.

Services list elements operation v5.png

As you can see, there are two zones here: The service containing the same columns like in the previous view and a list of the service's elements where the columns are:

  • Type: The icon which represents the type of an element. It's a Lego block for modules, some stacked Lego blocks for an agent and a Network Diagram Icon for the services.
  • Name: The text which contains the name of the module, agent or service. They're also linked to the proper section.
  • Description: A small free-text field intended for a short description.
  • Weight critical: The value if the element is in 'critical' state.
  • Weight warning: The value if the element is in 'warning' state.
  • Weight normal: The value if the element is in 'normal' state.
  • Data: The value of the element. It's able to adopt the following modes:
    • Module: The value of the module.
    • Agents: The text which displays the agent's status.
    • Service: The sum of all elements' weights from the selected service.
  • Status: The icon which represents the element's status by color.

Template warning.png

Keep in mind that the service-elements calculation is performed by the Prediction Server. It's not real-time data you're looking at. There are some situations in which a module's agent is added to the service where its weight is not going to be updated until the calculation is performed by the Prediction Server again.

 


1.2.3.1.3 Service Map View

To access this view, you're required to click on the flap above the header in the service operation view, as you can see on the picture below.

Services tab servicemap v5.png

This view displays the service in a tree-structured view as shown on the picture below. You can see at a glance in which way the service is being impacted by the elements which compose the view.

Services servicemap v5.png

The possible nodes can be:

  • Module Node: It's represented by the 'heartbeat' icon. This node is always final (leaf).
  • Agent Node: It's represented by the 'CPU box' icon. This module is also always final (leaf).
  • Service Node: It's represented by the 'crossed hammer and wrench' icon. This module is not a final node. It's required to contain additional nodes.

The node's colors and the arrow which connects them to the service depend on the node's status.

There are the following attributes within the node:

  • Title: The name of the service's / agent's or module's node.
  • Value list: This list refers to the possible numeric value calculated for that instance. It accepts any assigned integer value.
    • Critical:: The weight if it reaches 'critical' status, except if it's the root-service node, which represents a threshold to reach the 'critical' status.
    • Warning:: The weight if it reaches 'warning' state, except if it's the root-service node, which represents the threshold to reach the 'warning' status.
    • Normal:: The weight if it reaches 'normal' state, except if it's the root-service node, in which case nothing is going to be displayed here.
    • Unknown:: The 'unknown' status, except if it's the root-service node, which represents a threshold to reach the 'unknown' status.

You may click on each node in the tree. The target link represents the operational view of the node itself.


Info.png

When the service mode is simple, a red exclamation point appears on the right side of the critical elements.

 


1.2.3.1.4 Services within the Visual Console

From Pandora FMS versions 5 and above, you may add services in the Visual Console like any other item on the map.

Services visualmap v5.png

To create a service on a map, the procedure is the same as for all other items on the Visual Map.

Services visualmap add item v5.png

It contains the following attributes:

  • Label: The title which is going to be shown within the visual console's node.
  • Service: The service that's going to be represented.

Info.png

Keep in mind that a service node can not be linked to another Visual Map. The link is always going to represent the service-tree view.

 


1.2.4 How to read the service values

Planned shutdowns added before the stop date allow us to recalculate the value of the SLA reports. First, we need to activate it in the general setup. When it comes to an SLA service report, if there is a scheduled outage affecting one or more elements of the service, it is considered that the planned shutdown affects the entire service, because the system cannot evaluate the impact of a service component "inactive" in the whole service.

It is important to remember that this is at report level; service map, and the information presented in the visual console are not altered based on planned shutdowns added after the effective execution date. These service compliance percentages are calculated in real-time, based on the history data of the same service, it is very different than a report which can be "cooked" adding a "fake" planned downtime.

On the other hand, it is important to know how the compliance of a service is calculated:

Let's suppose we have a service defined by a 95% compliance in an interval of 1 hour (this is very short for the real world, but good for understanding the internal algorithm). We will use a table of values, where t is time, x is the % compliance (SLAs), and s is whether or not the service complies (1 complies, 0 fails). In 1 hour we should have exactly 12 values, assuming an interval of 5 minutes.

A similar case, where the service complies for the first 11 samples (first 55 minutes) and in the 60th minute, it fails, we would have these values:

   t    |   s   |    x  
--------+-------+--------
1          1      100
2          1      100
3          1      100
4          1      100
5          1      100
6          1      100
7          1      100
8          1      100
9          1      100
10         1      100
11         1      100
12         0      91,6

This case is easier to calculate. The % is calculated depending on the number of samples, for example in t3 there are a total of three samples that meet service, 100%, whereas t12, we have 12 and 11 valid samples: 11 / 12.

Assume you are in the middle of the series, and it is recovering slowly:

   t    |   s   |    x  
--------+-------+--------
1          1      100
2          1      100
3          1      100
4          1      100
5          1      100
6          0      83,3
7          1      85,7
8          1      87,5
9          1      88,8
10         1      90 
11         1      90,9
12         1      91,6

So far all seems similar to the previous scenario, but let's see what happens if we go over time:

   t    |   s   |    x  
--------+-------+--------
13        1      91,6
14        1      91,6
15        1      91,6
16        1      91,6
17        1      91,6
18        1      100
19        1      100
....

Now we see unintuitive behavior, because the volume of valid samples remains 11 for a window of time to get to t18, where the only invalid value is out of the window, so in t18 compliance becomes 100%. This step between 91.6 and 100 is explained by the size of the window. The larger the window is (usually SLA calculation interval is daily, weekly or monthly), the less abrupt will be the step.

1.2.5 Service grouping

Services are logical groups that conform part of a business structure. Due to that, it makes sense to group services, because in a lot of cases there can be dependences between them, conforming, for example, a global company service composed by some other particular services (webpage, communications, etc). To group the services, it's necessary to create the big general service and the smaller ones that will be aggregated to the global service, creating a logical tree structure.

The service groups can help us to: create visual maps, configure alerts, apply monitoring policies, etc. So we can create specific alerts when the company service is down due to the commercial department not being able to work, or the webpage being offline.

Next we have two examples to understand service grouping.

1.2.6 Examples of services monitoring

1.2.6.1 PandoraFMS service

In this case the service of PandoraFMS is being monitored. It is composed of the Apache service, MySQL, Pandora server and Tentacle server. Every one of these elements also constitute a service with different components, creating a tree-type structure.


Arbol.JPG


The general Pandora service will turn into critical status if it reaches the weight of 2, and warning status with 1. As you can see, the four components have different weights over the Pandora service:

  • MySQL: critical for the Pandora service, individual weight of 2 if MySSQL is down. It will have weight 1 if it turns into warning status, already displaying yellow status on the Pandora service.
  • Pandora Server: critical for the Pandora service, individual weight of 2 if Pandora Server is down. Individual weight of 1 if it is on warning status, displaying the warning state on the Pandora service for example, if it reaches a heavy CPU load.
  • Apache: it means a degradation of the global Pandora service, but not a complete interruption, so it will has an individual weight of 1 if it is down, showing the warning status on the Pandora service.
  • Tentacle: same as the Apache, it means a degradation of the service, but not a total interruption, so it gets 1 of weight if down, and will display warning status.

In the following picture we can see the setup of the different weights for the elements over the Pandora general service:


Pesos.JPG


1.2.6.2 Storage cluster, grouping of services

Services are logically arranged groups which are part of a company's business structure. Therefore, it may be necessary to create groups of services, because services alone sometimes don't have an appropriate context. To create service groups, you're required to add each service to an existing agent. In this case, a service is going to be a module of an agent. You're able to create a new logical structure (a group of services) by these groups.

On the following example we have an HA storage cluster. For this case there are two fileserver systems working in parallel, each one controlling the percentage and status of some different disks that provide service to specific departments, creating a tree-type structure with grouped services.


Cluster.JPG


According to this structure, the critical threshold of the storage service will be reached only if both of the fileservers fail, this would totally deny the service, and if only one of the fileservers fail it would only suppose a degraded service. In the screenshot below we can appreciate the weight configuration of the two main elements of the storage service:


Pesoscluster.JPG


In the following image, we can see the content and weight configuration of the grouped service FS01. Here, the elements will have a specific weigh according to its criticalness, being:

  • FS01 ALIVE: critical to the FS01 service, since it is the virtual IP assigned to the first disk cluster. Individual weigh 2, if it's down, the other elements would automatically be inoperative. In this case there is no warning threshold, since it is a yes/no based type of information.
  • DHCPserver ping: critic to the FS01 service, we give it an individual weight of 2. In this case there is no warning threshold either.
  • Disks we give them an individual weight of 1 in case they reach its own critical status, and 0.5 for their warning status. According to this, the FS01 service will only reach critical status if there are two disks on critical status o four in warning status.


Pesosfs01.JPG

1.3 Pandora Server

It's mandatory the Prediction Service is running appropriately and also to have the Enterprise Version of Pandora FMS installed.

Go back to Pandora FMS documentation index