Community Features Tech

Improving performance of Pandora FMS

October 22, 2014

Improving performance of Pandora FMS

This post is also available in : Spanish

Introduction

The main goal of this article is to highlight the bottlenecks in the execution of such a demanding resources system as Pandora FMS. In order of relevance we can remark the following:

  • CPU
  • Memory
  • Disc access
  • DB performance
  • Configuration of the Pandora FMS Server
  • Status of the DB of Pandora FMS

Now we are going to analize the different analysis techniques to detect problems in each of these points. The solution to each problem exceeds the purpose of this article, which aims only to show how to identify the problem and give some clues about how to face its solution.

Processor and disk access

vmstat
We will execute “vmstat 1 10” command. Usually, the first line should be ignored as it’s afected by the boot of the command itself.

vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 10892 105036 404324 2540940 0 0 1 184 2 3 8 1 76 15 0
0 0 10892 104780 404324 2540936 0 0 0 32 557 641 5 2 92 1 0
1 0 10892 103788 404324 2540936 0 0 0 120 335 475 3 0 94 2 0
0 0 10892 103756 404324 2540936 0 0 0 36 361 489 5 0 94 1 0
1 0 10892 103384 404324 2540936 0 0 0 32 378 449 6 1 92 1 0
0 0 10892 103400 404324 2540936 0 0 0 0 465 664 1 0 99 0 0
1 0 10892 103860 404324 2540940 0 0 0 32 1439 1522 8 1 90 1 0
0 1 10892 106264 404324 2540948 0 0 0 112 9086 20506 9 1 87 2 0
0 0 10892 97052 404324 2540948 0 0 0 3704 9543 21045 13 2 77 9 0
0 0 10892 106956 404324 2540948 0 0 0 32 547 752 3 1 95 2 0

The most important columns are:

  • R: Number of threads in the running queue. There are executable threads but they don’t have available CPU to execute them.
  • B: Number of blocked processes waiting for access to E/S.
  • US: CPU usage in user context (Applications).
  • SY: CPU usage in system context (calls to system).
  • WA: Real percentage of time “without use” of the processor in forced wait operations Input/Output.
  • CS: Context Switches, CPU context switches.
  • IN: Interruptions.

The number in “R” shouldn’t exceed 1-3 threads for each processor. So, a system with 2 processors should never exceed a value of 6, that would mean that there are a lot of threads in queue for their execution and a lot of pending work.

If the number in CS is higher than the number in IN, it usually involves a problem because the kernel has to execute a lot of context switches, spending the most part of the time in this operation. It uses to be a system scheduler overload problem. As a secondary effect , WA increases.

CPU usage:The right balance of CPU usage should be 70% user, 25-30% system and 0-5% Idle.

mpstat
This command can be used to see the load balance between the different system CPU’S. Execute the “mpstat -P ALL 1” command. The first line should be ignored.

12:17:19 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
12:17:20 all 0,75 0,00 0,00 0,00 0,00 0,00 0,00 0,00 99,25
12:17:20 0 1,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 99,00
12:17:20 1 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00
12:17:20 2 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00
12:17:20 3 1,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 99,00

12:17:20 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
12:17:21 all 6,00 0,00 0,50 0,00 0,00 0,00 0,00 0,00 93,50
12:17:21 0 7,00 0,00 1,00 0,00 0,00 0,00 0,00 0,00 92,00
12:17:21 1 9,90 0,00 0,99 0,00 0,00 0,00 0,00 0,00 89,11
12:17:21 2 8,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 92,00
12:17:21 3 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00

12:17:21 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
12:17:22 all 7,48 0,00 0,25 4,49 0,00 0,00 0,00 0,00 87,78
12:17:22 0 7,07 0,00 1,01 15,15 0,00 0,00 0,00 0,00 76,77
12:17:22 1 5,94 0,00 0,00 0,00 0,00 0,00 0,00 0,00 94,06
12:17:22 2 12,87 0,00 0,99 2,97 0,00 0,00 0,00 0,00 83,17
12:17:22 3 4,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 96,00

12:17:22 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
12:17:23 all 14,50 0,00 1,25 0,75 0,00 0,00 0,00 0,00 83,50
12:17:23 0 23,00 0,00 2,00 3,00 0,00 0,00 0,00 0,00 72,00
12:17:23 1 15,84 0,00 1,98 0,00 0,00 0,00 0,00 0,00 82,18
12:17:23 2 2,97 0,00 0,00 0,00 0,00 0,00 0,00 0,00 97,03
12:17:23 3 16,00 0,00 1,00 0,00 0,00 0,00 0,00 0,00 83,00

It is normal that the load is balanced between the different processors. If that isn’t the case then the system has a multiprocessing problem.

It is important to analyze disks from two perspectives: manufacturer information and real write speed.

To get information about the device we need to use the smartctl command:

smartctl –a /dev/sda

This will provide us with manufacturer information and model. With that information we can get an estimation of the IOPS of the module and it’s averige write speed.

The average write speed:

dd if=/dev/urandom of=testfileR bs=8k count=10000; sync;

Optimal values are between 50MB/Sec and 100, values between 20-30MB/sec are for the relatively new devices. Under de 10MB/Sec the system is slow and under 5MB/Sec we don’t recommend continuing with the deployment because the performance is very poor.

The write speed doesn’t have to have correlation with the IOPS, witch are related with the writting eficency than with the writting speed. There is a correlation though for disks that are quick in writting tend to have high IOPS.

Memory

vmstat
Use the vmstat command to get, relative to the SWAP usage, information about the system memory:

vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 10948 904160 404324 2536700 0 0 1 184 2 3 8 1 76 15 0
0 0 10948 896568 404324 2536696 0 0 0 32 2620 3553 18 6 75 1 0
0 0 10948 898332 404324 2536700 0 0 0 36 329 461 2 0 97 1 0
1 0 10948 898332 404324 2536700 0 0 0 20 440 547 4 0 96 0 0
0 0 10948 898396 404324 2536736 0 0 0 0 270 301 4 0 96 0 0
1 0 10948 898372 404324 2536736 0 0 96 88 844 1495 6 0 93 2 0
0 0 10948 898492 404324 2536736 0 0 80 3644 499 781 6 0 84 10 0
0 0 10948 902860 404324 2536736 0 0 0 24 315 405 2 0 98 0 0
0 1 10948 902724 404324 2536736 0 0 48 52 1651 2942 16 1 81 2 0
0 0 10948 902700 404324 2536736 0 0 0 20 128 172 1 0 99 1 0

SI,SO: Swap In/Out. Any value different from 0 means that the system is working on swap. In stable production systems swap isn’t used at all. This also means that there is little memory in the system so we need to adjust the mysql configuration, the pandora FMS configuration or any other elements that might interfere.

4. Database Performance

/etc/my.cnf

There are some key parameters in order to optimize the performance of MySQL. For more information visit the Pandora FMS documentation on MySQL optimization. Let’s start with these three:

 innodb_io_capacity 75
 innodb_flush_log_at_trx_commit 0
 innodb_flush_method O_DIRECT

These three parameters are crucial and should have values as showed above. The value of IO_Capacity should be one or another depending on the type of storage:

  • 5000 RPM disks or lower ~ innodb_io_capacity 75
  • 7200 RPM disks ~ innodb_io_capacity 100
  • 15000 RPM disks ~ innodb_io_capacity 180
  • Last generation SSD disks ~ innodb_io_capacity 240

pandoradb_stress

This is a diagnostic tool used to verify the data insertion capacity of a Pandora FMS, using the Pandora FMS library (API) mechanisims to access the data. To do that we have to follow the steps:

         $target_agent = -1;
  • We replace the -1 by the ID of our agent.

And then we execute the following command:

/usr/share/pandora_server/util/pandora_dbstress.pl /etc/pandora/pandora_server.conf

Pandora DB Stress tool 5.1dev Build 140602 Copyright (c) 2004-2014 Artica ST
This program is OpenSource, licensed under the terms of GPL License version 2.
You can download latest versions and documentation at http://www.pandorafms.org
[*] Working for agent ID 52610
[*] Generating data of 90 days ago
[*] Interval for this workload is 300
[*] Processing module Host Latency
[D] ID_AgenteModulo 341281 Interval 300 ModuleName Host Latency Days 90 Agent 198.27.73.105
-> Current rate: 0.12 modules/sec
-> Current rate: 358.95 modules/sec
-> Current rate: 387.78 modules/sec
-> Current rate: 411.94 modules/sec
-> Current rate: 426.88 modules/sec
-> Current rate: 359.93 modules/sec

For having more exact data is recommended to create a new module and delete the last one on this agent. The tool will start to insert data in the module of this agent, simulating data that later could be used for graphs and reports. By default, the tool inserts data from a month in all the modules of every agent of their installation. By modifying the agent parameter we force to do it in the specified agent.

The average value of a Pandora FMS server should be above 300 mod/second. This tool can be used to check the system optimization.

Pandora_server configuration

/etc/pandora/pandora_server.conf

The proper configuration of the Pandora FMS server can increase up to a 500% its performance. Let’s make some easy checks to verify its correct parameterization:

  • verbosity 1: Higher values will be use as problem diagnosis, but a value higher than 1 will impact in the system performance.
  • network_timeout X: Being 3 the value by default, it’s recommended to make it lower if working in local networks. A high value (e.g.: 10) can easily lead to the emergence of a lot of modules in “unknown” because of the server has to wait 10 seconds per each check failed.
  • server_threshold x:Being 5 the value by default, in case of overload can be recommended to move it up till 10 or 20, but never move it down below 3 or 4 (for lightly loaded servers and checks with small intevals).
  • server_keepalive 45: This parameter is used in environments with several Pandora FMS servers, to detect when a server is down. It shouldn’t be modified.
  • xxxx_checks X: Number of checks that the network server does (icmp, snmp. Tcp). By default its value is 1, in environments with many false positives can be necessary increase it to 2 or 3 maximum, but this can damage the performance of the network server.
  • xxxx_timeout: Similar to network_timeout. When we increase the default values sometimes the performance can decrease. Move it down can produce false positives or monitoring lacks.
  • xxxx_threads: The total number of threads of all the options shouldn’t exceed 30-40.
  • dataserver_threads:The values should be between 1 and 5.
  • max_queue_files 500: Its value shouldn’t be changed.

/var/log/pandora

A simple glance at this directory can help to detect problems. The logs shouldn’t have large sizes:

[[email protected] pandora]# ls -lah /var/log/pandora/
 total 356K
 drwxr-xr-x. 2 pandora root 4,0K jul 21 03:17 .
 drwxr-xr-x. 13 root root 4,0K jul 20 03:33 ..
 -rw-r--r--. 1 root root 983 oct 22 2013 pandora_agent.log
 -rw-rw-rw-. 1 root root 32K jul 23 19:33 pandora_server.error
 -rw-rw-rw- 1 root root 2,1K jul 21 03:17 pandora_server.error-20140721.gz
 -rw-rw-rw- 1 root root 44K jul 23 19:27 pandora_server.log
 -rw-rw-rw- 1 root root 65K jun 14 18:17 pandora_server.log.old
 -rw-rw-rw- 1 root root 176K jul 23 19:33 pandora_snmptrap.log
 -rw-rw-rw- 1 root root 10 jul 23 19:34 pandora_snmptrap.log.index

A log with a size of over 50 MB should be rotated or deleted.

Pandora FMS BBDD

To do that we are going to run the diagnosis tool of Pandora FMS system:

Setup->Diagnostic Info

We should look at the following values:

  • Table tagent_access: It shouldn’t exceed 250.000 records.
  • Table tagente_datos: It shouldn’t exceed 5-10 million records.
  • Table tagente_datos_string: It shouldn’t exceed 2-4 million.
  • Table tagente_estado: It shouldn’t exceed 100,000 records.
  • Table tevento: It shouldn’t exceed 250,000 records.
  • Table tsesion: It shouldn’t have more than 50.000 records.
  • PandoraDB Last run: There should be a date not far than 24h compared to the current date.

Values outside of the specified threshold can be an indicative of a problem, oversized or an imbalance in the system configuration.

icon_contact_us download_it-08
Do you want to know more
on how to optimize Pandora FMS?
Do you want to get Pandora FMS?

Written by:



Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.