Problem:
The ICMP module in Pandora FMS is not working right: a network module called Host Alive. I have to minimize at maximum problems related to the network or other variables, and in the screenshot that I send you, it is possible to appreciate that the machine is shown in the “Next agent contact” as Out of limits and the module “Host Alive” says that it can’t contact from one minute ago (this is odd, because we can see that the checking interval is 10 seconds and it is on green). The peculiarity of this screenshot is that it is the Pandora FMS server itself, so the checks that are done are local ones, that is to say, that there are not problems of network failures, nor overflow, noise etc. Though it take place on all agents. Any module that has configured in remote that it could check things such as the CPUs state, disc occupation, etc. and that communicate through the Pandora FMS client through Tentacle, are working OK (but as i have already told you, the value of the available space in disks is useful, but units are not shown correctly). On the contrary, the ICMP checks are not performed in the right way, so half of the agents I have defined are shown as down. Therefore, when you are going to check the platform state from the menu option “Group View” there are shown the majority of groups in grey, at unknown status.
Solution:
The problem here is basically because the ping checking interval is not well sized. If you put the interval to 5 minutes, they won’t be in red, so it will have time to do the test in a reasonable time. To monitor very short intervals (<30) seconds, you have to consider this:
- Limit the number of checks to a reasonable time limit.
- Increase at the maximum possible the number of network server concurrent threads:
- Set the
server_threshold
to minimum (1) - Set ICMP timeouts to the minimum (1 o 2 seconds).
- Set the
Another scenario:
Have a powerful machine and a well optimized MySQL. So that you could have an idea of how it works: If I have 100 ping modules every 10 sec, I should do 10 ping by second. In the worst of possible cases i should wait until interpret it as if it is not responding, that is to say that 10 pings could be result in 20 seconds, so i should have at least 20 threads exclusively dedicated to do ping in ideal conditions where there would not be delays associated to other places.
-
- Solution: increase the module interval (e.g. to 60 sec), increase the thread number (no more than 50), decrease the network timeout (to 2 seconds or to 1 second), add other network server in another server.
- Basically, you have to calculate with the number of modules that you have and their interval, the maximum number of tests that you could do without they are out of time.