Alert should not be fired

Community support

Alert should not be fired

Posted by daniels on May 22, 2009 at 06:57

Hi guys.

I thought I had figured out how alerts work, but an alert fired today made me start over again…

I suppose that the alert should only be fired if occured twice inside the threshold time interval. The pics show what happened:

(Still using Pandora FMS version 2.0)

daniels replied 15 years, 8 months ago 4 Members · 13 Replies
13 Replies

daniels

Member
May 22, 2009 at 06:58

0 Karma points

Community rank: Tentacle noob

Like it
Up
0
Down
Drop it
::
The alert configuration:
daniels

Member
May 22, 2009 at 06:58

0 Karma points

Community rank: Tentacle noob

Like it
Up
0
Down
Drop it
::
Alert fired…
daniels

Member
May 22, 2009 at 07:01

0 Karma points

Community rank: Tentacle noob

Like it
Up
0
Down
Drop it
::
Any explanation for this behaviour?
Sancho

Administrator
June 1, 2009 at 04:50

2321 Karma points

Community awards: Bright ideas

Community rank: Tentacle Master

Like it
Up
0
Down
Drop it
::
You have min 2 alerts, in a time period of 15 mins.

If you get your first alert on time = 0, and your second alert on time = 5, your first alert should be fired on time = 10 and your second alert in time = 15, but opps, if you have at least 1 sec of delay between each packet, you are out of that 15 min interval.

Up to 20 min and should work

Hope this helps.
daniels

Member
June 1, 2009 at 06:16

0 Karma points

Community rank: Tentacle noob

Like it
Up
0
Down
Drop it
::
Hi nil.

Thanks for your answer.

But I think I have not expressed myself clearly: The issue here is “Why an alert was fired?” and not “Why I have only one alert fire?”

With this configuration, I expected that one alert would be fired only if I have more than 2 values bellow the “min value” (in this case, below 13).

As you can see, I have only one value below 13 (07:06:20 – 5.6). This is a false positive to me, because I needed more low values before the alert get fired:

“Min. number of alerts: Minimum number of alerts needed to start triggering an alert. Works as a filter, needed to remove false positives. ”

I noticed this strange behaviour only once, but I don’t know what to think about it.

Regards.
Sancho

Administrator
June 2, 2009 at 15:34

2321 Karma points

Community awards: Bright ideas

Community rank: Tentacle Master

Like it
Up
0
Down
Drop it
::
Oops !, you’re rigth!, I miss this !.

We’ve rewritten the alert system in 3.0 so I expect this kind of “erratic” problems non happening in 3.0, I only can say this is usual and we have to check if any 2.1 alerts have this behaviour, not detected until now.
getnetworks

Member
June 8, 2009 at 23:27

0 Karma points

Community rank: Tentacle noob

Like it
Up
0
Down
Drop it
::
[cite]Posted By: nil[/cite]
We’ve rewritten the alert system in 3.0 so I expect this kind of “erratic” problems non happening in 3.0, I only can say this is usual and we have to check if any 2.1 alerts have this behaviour, not detected until now.

It would be great to learn more about how you’ve rewritten the alerts functionality as that is the one feature that has kept us from using Pandora. We started out using a commercial package in Sysorb for many years and then switched early last year to Nagios. Both supported a very similar alerting mechanism that allowed us to simply have the system set an alert trigger if criteria is reached in ‘x’ subsequent checks and notify (or perform a selected action for the trigger) each ‘y’ time period while in that state, and then retrigger each time the situation occurs. For example, if the serverload is >3 on three subsequent checks, each occurring 4 minutes apart, set the alert trigger. Notification is set for every 15 minutes, so since three subsequent alerts would take 15min., we receive notification of the situation (our desired alert function). If any check was 3 again. If over the next 15 minutes, the check is still in a triggered state, we get alerted again (so alert notifications would have the potential to occur repeatedly every 15 minutes [sometimes longer if there are resets occurring, but never shorter]). Either the alerts stop due to the trigger situation ending or we have manually performed an action via the console to stop them (such as setting downtime).

We have never been able to get this to work in Pandora; the max/min has not work properly for this scenario and the timing is always off (it seems to alert at 2x the desired interval whenever we did get things to trigger in some fashion [i.e., once every 30 minutes instead of 15 minutes, even though it was set for 15 minutes]).
Sancho

Administrator
June 9, 2009 at 17:47

2321 Karma points

Community awards: Bright ideas

Community rank: Tentacle Master

Like it
Up
0
Down
Drop it
::
I will put some people looking for that. It should work fine on 2.1 and 3.0-dev versions !
getnetworks

Member
June 9, 2009 at 22:44

0 Karma points

Community rank: Tentacle noob

Like it
Up
0
Down
Drop it
::
[cite]Posted By: nil[/cite]
I will put some people looking for that. It should work fine on 2.1 and 3.0-dev versions !

Thank you; maybe we’re doing something wrong, but with our familiarity of management systems and a complete review of documentation, we’re pretty sure we’re configuring things as best as possible for the desired effect. We’ll be more than happy to share a scenario that can be replicated very easily using simple generic_data checks. I hope what I outlined was clear (sometimes a flowchart on a whiteboard could really be more helpful than straightline text when trying to describe things such as this).
manu

Member
June 10, 2009 at 09:35

0 Karma points

Community rank: Tentacle noob

Like it
Up
0
Down
Drop it
::
So, you guys are using 2.0 and 2.1 right?
This should work exactly the same in both systems.

I want to test this and I want to do exactly what you guys are doing. So, please correct me if any of the scenario below is incorrect:

Let’s say I have my latency module checking every 10 seconds

Min Number of alerts:5
Max: 6

Threshold: 15 minutes

I should get an alert (JUST ONE) if the module shows the wrong latency 6 times in a row within 15 minutes, alright?
Am I right?
getnetworks

Member
June 10, 2009 at 21:39

0 Karma points

Community rank: Tentacle noob

Like it
Up
0
Down
Drop it
::
[cite]Posted By: manu[/cite]
So, you guys are using 2.0 and 2.1 right?
This should work exactly the same in both systems.

I want to test this and I want to do exactly what you guys are doing. So, please correct me if any of the scenario below is incorrect:

Let’s say I have my latency module checking every 10 seconds

Min Number of alerts:5
Max: 6

Threshold: 15 minutes

I should get an alert (JUST ONE) if the module shows the wrong latency 6 times in a row within 15 minutes, alright?
Am I right?

Using that example, assuming the latency value is being exceeded, we would end up getting 6 alert notifications (emails as we have it set); one every 10 seconds (so 6 in one minute). There would then be a 14-20 minute (not accurate as we can’t tell what happens at this point; see our new example below) delay before Pandora starts over counting events and then sending the same flood of 6 emails in one minute (assuming the threshold is still be exceeded). This honestly fails in two ways for a reasonable management tool: 1) We should only be getting one email and that should be 40 seconds after the first event fired in that sequence (the first two events occur in the first 10 seconds, with the 5th event actually firing just 40 seconds after the first, meeting our min requirement of 5 alerts), and 2) if the event hasn’t cleared, there’s no reason to be starting the count over again before alerting (that just causes excessive delays in re-alerting/notifying; not very apparent in your example because you are using such a small timeframe for the check [10 seconds] – much more noticeable on longer checks).

What we expect (as would be typical for any management alerting software) is for Pandora not to do anything but keep track of the number of times the event fired subsequently, so that if, in your example, the event fired on 5 subsequent checks, a single alert/notification is triggered. Since the time threshold is 15 minutes, we should then simply receive another single alert/notification 15 minutes later (assuming the event has continued firing), and repeat until the event stops firing. This min/max number of alerts method is not very usable as it is currently implemented.

An example that makes things more clear would be the following:

Latency Module checking every 3 minutes (180 seconds), firing on a value of 5 (set as max limit)
Min number of alerts: 3
Max number of alerts: 4
Time Threshold: 15 minutes
Alert: Email

In this example, assuming latency is >5 indefinitely, we receive an email the instant Pandora detects a value >5, we then receive 3 more emails at approx. 3 minute intervals (max value of 4 has then been reached). So we have now received 4 emails in 9 minutes. We then don’t receive another email until *26* minutes later; why that is the case we’re not sure, as under Pandora’s current design, we would expect the next 6 minutes to be “dead time” to satisfy the original 15 minute requirement (9+6), and then, since Pandora starts over again with counting events, we should start getting the next round of emails immediately after that 6 minute “dead time”, but instead 20 extra minutes go by before we get the next wave of emails (the same 4 in 9 minutes). The 26 minutes of silence repeats, and so on. Obviously something is wrong just in the way Pandora should be working based on its current design.

Under this example, what would be ideal is to receive the first email (and only 1, not 4) at the 6 minute mark (first event fires, 3 minutes later the second event and 3 minutes later the 3rd event, which satisfies the “min number of alerts” value of 3). A “triggered flag” is then set, and we get another single email 15 minutes later. As long as that “triggered flag” is set, we continue to get the email every 15 minutes since that is what we have as the time threshold. There’s no need to start new counters as our requirement has already been meet once and has yet to clear. Should the latency drop under 5, per this example, at anytime thereafter, the “triggered flag” it cleared and it would then take 3 new events to start the process again (now necessary to restart that process of counting). There should also be a way to suppress/deactivate alerting via “set downtime” or something similar.

On a related note, we also notice that the “last fired” time values seem to erratically jump around throughout the system, including in the alerting module. An alert can trigger and on the screen refresh that showed it trigger, it can “15 seconds” for the last trigger value. One minute later we can refresh the alert screen and the “last fired” value has suddenly jumped to around “3:30 minutes”, then a refresh 30 seconds later might then make it shows “2:27 minutes”. Something is definitely amiss. This may be a contributing factor to that “extra” 20 minutes occurring in the current implementation between re-alerting.
getnetworks

Member
June 10, 2009 at 21:39

0 Karma points

Community rank: Tentacle noob

Like it
Up
0
Down
Drop it
::
and yes, we are using v2.1 – Build PC090224.
daniels

Member
June 15, 2009 at 12:59

0 Karma points

Community rank: Tentacle noob

Like it
Up
0
Down
Drop it
::
Sorry about the late answer, I was in a holiday.

In my scenario I got one message if the module shows the wrong latency 1 time and so a delay of 15 minutes before the next message.

The problem that I see is that the minimum number is not working to wipe out false positives.

I have another examples here in my environment. If you like, I could post more examples here (like the first one in this post).

I’m using Pandora FMS v2.0 – Build PC081027

Regards.

Welcome to Pandora FMS Community!