Welcome to Pandora FMS Community › Forums › Community support › Advanced troubleshooting › Unreachability and notifications to many
-
Unreachability and notifications to many
Posted by alchemyx on August 13, 2009 at 15:26Hello!
I am new to Pandora FMS, currently I am using Nagios which I am planning to drop because of its ugly configuration files and funny tricks to make it better (a.k.a. extensions).
Currently I have two problems which I can’t solve. One is about unreachability of hosts. For example Host A is parent of B and B is parent of C: A <- B <- C. Now if A goes out because of power outage, also B and C are going down. Is it possible to have only notification of A going down? In nagios it works that way. It is important because if one of main switches in city fails, I will get 20 or 30 notifications which will make it useless. Second thing is about sending alerts to many people. I have four people that get notified by blackberry (same as e-mail), four get notifications via jabber, and 2 gets them via e-mail. So if I have 5 services for a host it means that I need to create 50 (!!) alerts? Or I am not getting something? Thanks
Sancho replied 15 years, 3 months ago 2 Members · 8 Replies -
8 Replies
-
::
All this issues are solved in next release: v3.0, you can check it our from current SVN code.
2.1 version (current stable version) supports combined alerts. 2.1 dont support multiple “alert actions”, but 3.0 does, I recommend that you try 3.0 directly, at this time is pretty stable.
-
-
::
I tried 3.0 and I have still same issues:
– actions and alerts are now separated – great, but I add for every alert 10 actions (4 blackberry, 4 jabbers, 2 mails) and it works fine. But what if we hire another person or some person leaves us? Then it means that I have manually add or remove action for each alert (for example in nagios I have about 600 services monitored, so it seems impossible to do it in Pandora by hand), unfortunately alert template doesn’t include actions; I can create external deamon to handle alerts but it will be a “dirty hack”
– correlated alarms are not solution for unreacheability problem, because I would have to correlate every alert in B host with host_alive of host A (its parent). Why can’t it use “parent” field automatically and depend on it?
– also there I can’t find way to create template with correlated alerts or even copy it to another agentI wonder how people manage pandora in large installation with that issues? Maybe there are things that I am missing?
-
::
About the first question: If you alter the ACTION (p.e: the destination address in an email action) the updated alert will be used on all alert templates who use it. This is the way we use to manage big enviroments, and yes, you can add a default action in alert templates, don’t you see it ?, I think it’s on the first step of configuration/edition of template alert.
About the second, what solution do you think it’s possible, I’ve thought a lot about this and I don’t find a better solution. If you rely only on parent, you only can provide a single “parent” or a single rule, most of times, parent is not the machine you need to “correlate”, for example, you may have several routers or devices before the application you need to monitor. In your enviroment, is the parent relation all yoy need to make this kind of alerts?, it would solve all your problems about this issue ?.
About the third, not, there are not templates for correlation yet.
Enterprise version, comes with a tool call “policy management” who allow to propagate colletions of modules and alerts, it helps to do this kind of things, but there is always a need to administrate the tool, it’s difficult to make things easy for everybody, but all suggestions are welcome !
-
::
Thank you for your reply.
About notifications I mean that best solution is to group actions in one group. For example I have a group “notify-administrators” and then I have attached all my notifications to that group. So if we hire another person I can modify one place. It is not flexible current way. About default actions for alert template – fine, but I can’t add multiple default actions for a alert template or am I missing something?
For example – we hired another person few months ago. I added another “contact” item and then added that contact to “contactgroup” item named “administrator” and then restarted nagios and that was all.
Now if I want to do it in Pandora – but not for every monitored host, but only for switches and servers (I have a such setting in nagios so administrators don’t get not very important messages about DSL) – how to do it?
What if that person leaves us? How to remove him from notifications? In nagios same path – remove from contagrroup and remove from contacts then restart nagios.
===
About correlation between hosts. I have a quite simple network architecture. Switch X have usally two alternative paths to some “middle” core switches and those middle switches are connected to main switch. Something like this. X has parent C and D. C has parent B and D has parent A. For A and B both parents are the same – core switch. Something like this:
+------ A ------ C -----+ CORE X +------ B ------ D -----+
I hope it shows properly. So in that case if:
– everything is UP (I mean CORE, A, B, C, D) and X is down it means that it is state DOWN
– everything is UP and X is up then it is in state UP
– A and C are down and also X is down – it should be in state DOWN, because clearly something is wrong with that switch because it has alternative path via D
– A, B, C, D are down and also X is down – it should be in state UNREACHABLE (it has both its parents down so we assume it is the reason)
– A, B, C, D are down and X is up – it means that those down switches have something wrong with their administration (for example misconfigured vlan 1), but in that situation X should be in state UP
– A, B are down, C and D are up, x is UP – it should be in state UP (same situation as earlier)
– A, B are down, C and D are up, x is DOWN – it should be in state DOWN
– A, B are up, C, D and X are down – it should be in state UNREACHABLESo to put it in simple words – if every parent of host X is in state DOWN or UNREACHABLE and X is down then it means it is UNREACHABLE.
If at least one parent of host X is in state UP and host X is down then it means that it is DOWNThat’s it. It always worked for me that way and if anybody have some more complicated situations – for example webapp depends on router and also on service “proxy” on other server then you can add manual correlation between services.
So it would be great if there was another state of a host – UNREACHABLE and person who uses Pandora would decide how to treat that state – as unreachable or as down. It would make Pandora way more flexible.
Thanks!
-
::
About notifications I mean that best solution is to group actions in one group. For example I have a group “notify-administrators” and then I have attached all my notifications to that group. So if we hire another person I can modify one place. It is not flexible current way. About default actions for alert template – fine, but I can’t add multiple default actions for a alert template or am I missing something?
You always can put a list of people in the field used to send an email (user@domain, user2@domain) and use that action like a group. I don’t sure that works, but try and if not, open a bug to be able to send a mail to multiple users internally. I think this is a good option. About the multiple defaults, yes, its a limitation to have only one default action at this time 🙁
I hope it shows properly. So in that case if:
– everything is UP (I mean CORE, A, B, C, D) and X is down it means that it is state DOWN
– everything is UP and X is up then it is in state UP
– A and C are down and also X is down – it should be in state DOWN, because clearly something is wrong with that switch because it has alternative path via D
– A, B, C, D are down and also X is down – it should be in state UNREACHABLE (it has both its parents down so we assume it is the reason)
– A, B, C, D are down and X is up – it means that those down switches have something wrong with their administration (for example misconfigured vlan 1), but in that situation X should be in state UP
– A, B are down, C and D are up, x is UP – it should be in state UP (same situation as earlier)
– A, B are down, C and D are up, x is DOWN – it should be in state DOWN
– A, B are up, C, D and X are down – it should be in state UNREACHABLESo to put it in simple words – if every parent of host X is in state DOWN or UNREACHABLE and X is down then it means it is UNREACHABLE.
If at least one parent of host X is in state UP and host X is down then it means that it is DOWNThat’s it. It always worked for me that way and if anybody have some more complicated situations – for example webapp depends on router and also on service “proxy” on other server then you can add manual correlation between services.
So it would be great if there was another state of a host – UNREACHABLE and person who uses Pandora would decide how to treat that state – as unreachable or as down. It would make Pandora way more flexible.
Thanks!
Very very intesting. This is really complex to implement, I figure that its not easy to setup in other tools. In pandora you can do using correlation, BUT individual alerts will fire, even they donesn’t execute no actions. You need individual alerts to be defined prior to create a correlation rule/alert.
You can manage your UNREACHABLE status with our WARNING status, and get the NORMAL and CRITICAL like UP or DOWN. We have another state, UNKNOWN for system that you cannot contact but don’t know if are up or down, sounds like your UNREACHABLE, but this state is induced from different conditions not for an active check, so it’s complex to manage, because you cannot know exactly when a device is Unknown.
Some days ago, after this converstation I decided to implement some kind of “cascade protection” system for alerting system:
This option is designed to avoid a “storm” of alerts coming because a group of agents are unreachable. This kind of behaviour happen when an intermediate device, as for example a router, is down, and all devices behind it are just not reachable, probably that devices are not down and even that devices are working behind another router, in HA mode, but if you don’t do nothing probably Pandora FMS thinks they are down because cannot remotely test it with a Remote ICMP Proc test (a ping).
When you enable cascade protection in an agent, this means that if it’s parent has a CRITICAL alert fired, then the agent alerts WILL NOT BE fired. If agent’s parent has a module in CRITICAL or several alerts with less criticity than CRITICAL, alerts from the agent will be fired if should be. Cascade protection checks parents alerts with CRITICAL criticity, including the correlation alerts assigned to the parent.
If you want to use an advanced cascade protection system, just use correlation between sucesive parents, and just enable the Cascade Protection in the children.
Using this combined with correlation alerts, could result in a very flexible alerting system, not too heavy to administrate. I suggest to try to use our current 3.0 development version, things are more complex to explain in messages than to test and check it personally 🙂
-
::
Idea with multiple recipients in one actions sounds fine. If it doesn’t work I can’t always make it work with some kind of wrapper to mailx. So let say it is solved.
About unreachability I am using SVN version of Pandora and can’t find cascade protection – where is it?
BTW: How to make this kind of correlated alarm: Fire only alarm when host A is up and host B is down? I really can’t think of logical operator that could make it work (0 – down/alarm, 1 up/no alarm)A / B / alarm 0 / 0 / 1 0 / 1 / 1 1 / 0 / 0 1 / 1 / 1
About parents in other software I can speak only for nagios. There it is simply achieved:
– I set up for every host one or more parents (parent is the host being immediate parent)
– I set up for my contact that I want to receive notifications about – up, down, but not unreachibilityAnd that is all 🙂
-
::
[cite]Posted By: alchemyx[/cite]
Idea with multiple recipients in one actions sounds fine. If it doesn’t work I can’t always make it work with some kind of wrapper to mailx. So let say it is solved.We’re working on this now, you’re not the first asking for that and you convinced me.
About unreachability I am using SVN version of Pandora and can’t find cascade protection – where is it?
Check it out in a few hours, Ramon is finising server code. Beware because you need to update your DB (Check latest lines in extras/pandoradb_migrate_v2.x_to_v3.0.sql (August 2009).
BTW: How to make this kind of correlated alarm: Fire only alarm when host A is up and host B is down? I really can’t think of logical operator that could make it work (0 – down/alarm, 1 up/no alarm)
A / B / alarm 0 / 0 / 1 0 / 1 / 1 1 / 0 / 0 1 / 1 / 1
About parents in other software I can speak only for nagios. There it is simply achieved:
– I set up for every host one or more parents (parent is the host being immediate parent)
– I set up for my contact that I want to receive notifications about – up, down, but not unreachibilityAnd that is all 🙂
In your example, will suppose that your host A is a router and your host B is a server.
If you setup in B that parent’s agent is A and set “Cascade protection”. Alerts of B don’t be fired if A is down (any of them). You of course need to define an alert in A to define when A is down for you.
You may have a lots of servers like B, and you need only to define the parent for them. So you have one alert for each host (B type) and one alert for A router. In a 1000 agent setup you have 1001 alerts, most of them probably the same and very easy to assign using the alert template system and the massive configuration tool (easiest even in the enterprise version with the policy management).
Second option is more fun, its to use correlation. But with a simple example of two host are not enough to understand when this feature is most useful, for example, think you have a router, a firewall, a server with a webserver and a database. You want to know when your service is operating in bad shape (having problems) and when your service is definitively out, and know why.
You will have the following monitors:
– ROUTER: a ICMP check and a SNMP check using a Standard OID to get the ATM port status. Also may have a Latency check for your parent/provider router.
– WEB SERVER: you have several internal checks running with the Pandora FMS agent: CPU usage, MEM usage and process check of your Apache. You have also a latency check for a 4-step navigation HTTP check.
– DATABASE SERVER: you have several internal checks running with the Pandora FMS agent: CPU usage, MEM usage and process check of your Database. Also a few database integrity checks. You also check remote connectivity to database using a plugin-defined test to login, make a query and exit, timing the answer.Now you define several SINGLE alerts:
-ROUTER:
ICMP Check / CRITICAL -> Action, send MAIL.
SNMP Check / CRITICAL -> Action, send MAIL.
Latency > 200ms / WARNING -> Action, none, just compound.-WEB SERVER
CPU / WARNING -> Action, none, just compound.
MEM / WARNING -> Action, none, just compound.
PROCESS / CRITICAL -> Action, send MAIL.
HTTP LATENCY / WARNING -> Action, none, just compound.-DATABASE SERVER
CPU / WARNING -> Action, none, just compound.
MEM / WARNING -> Action, none, just compound.
PROCESS / CRITICAL -> Action, send MAIL.
SQL LATENCY / WARNING > Action, send MAIL.You define ROUTER as parent for DATABASE and WEB servers. You enable the Cascade Protection in both agents (Database and Web).
You now define one correlation alert assigned to DATABASE:
Router ICMP Check NOT Fired
AND
Router SNMP Check NOT Fired
AND
WEB Server Process NOT Fired
AND
Database Server Process Critical
THEN
Send MAIL: “Service DOWN: Database Failure”You now define one correlation alert assigned to DATABASE:
Router ICMP Check NOT Fired
AND
Router SNMP Check NOT Fired
AND
WEB Server Process Fired
AND
Database Server Process NOT Fired
THEN
Send MAIL: “Service DOWN: WebServer Failure”And more complex alerts like:
Router ICMP Check NOT Fired
AND
Router SNMP Check NOT Fired
AND
WEB Server HTTP Latency NOT Fired
AND
DATABASE Server SQL Latency Fired
AND
DATABASE Server CPU NOT fired
AND
DATABASE Server MEM Fired
THEN
Send MAIL: Database is getting exausted. Please check it ASAP.