High Availability (HA)
Introduction
Pandora FMS is a very stable application (thanks to the test and improvements included in each version and to the fixing of some failures discovered by users. In spite of this, in critical environments and/or with high load, it is possible that it would be necessary to distribute the load among several machines, making sure that if any component of Pandora FMS fails, the system will not be down.
Pandora FMS has been designed to be very modular. Any of its modules could work in an independent way. But it has also been designed to work with other components and for being able to take the load from those components that have been down.
The Pandora FMS standard design could be this one:
Obviously, agents are not redundant. If an agent is down, it makes no sense to execute another one, since the only cause for an agent down is that data could not be obtained because the execution of some module is failing, and this could not be solved with another agent running in parallel, or because the system is isolated or fails. The best solution is to make the critical systems redundant - regardless of them having Pandora FMS agents or not- and so to make the monitoring of these systems redundant.
It is possible to use HA in several scenaries:
- Data Server Balancing and HA.
- Network Servers,WMI, Plugin, Web and Prediction Balancing and HA
- DDBB Load Balancing and HA.
- Recon Servers Balancing and HA.
- Pandora FMS Console Balancing and HA.
HA of Data Server
The easiest way is to use the HA implemented in the agents (which allow you to contact an alternative server if the main one does not reply). However, since the data server supports port 41121 and it is a standard TCP port, it is possible to use any commercial solution that allows balancing or clustering an ordinary TCP service.
For Pandora FMS data server, you will need to mount two machines with a configured Pandora FMS data server (and different hostname and server name). You will have to configure a Tentacle server in each of them. Each machine will have a different IP address. If we are going to use an external balancer, this one will provide a single IP address to which the agents will connect to send their data.
If you are using an external balancer, and one of the servers fails, the HA mechanism enables one of the available active servers and Pandora FMS agents will keep on connecting with the same address as before, without noticing the change, but in this case, the load balancer will no longer send data to the server that failed, but to another active server.
There is no need to change anything in every Pandora FMS data server, even each server can keep its own name. This is useful to find out if any of them has failed in the server status view. Pandora FMS data modules can be processed by any server without pre-assignment being necessary. It is designed precisely that way so that HA can be implemented more easily.
In the case of using the agent HA mechanism, there will be a small delay when sending data, since at each agent execution, it will try to connect with the primary server, and if it does not answer, it will do so against the secondary one (if it has been configured like that). This is described below as “Balancing in Software Agents”.
If you wish to use two data servers and for both to manage policies, collections, and remote configurations, you will need to share the following directories so that all data server instances can read and write over these directories. Consoles must have access to these shared directories as well.
/var/spool/pandora/data_in/conf
/var/spool/pandora/data_in/collections
/var/spool/pandora/data_in/md5
/var/spool/pandora/data_in/netflow
/var/www/html/pandora_console/attachment
Balancing in the Software Agents
From the software agents, it is possible to balance data servers so it is possible to configure a master and backup data servers.In the agent configuration file pandora_agent.conf
, configure and uncomment the following part of the agent configuration file:
# Secondary server configuration # ============================== # If secondary_mode is set to on_error, data files are copied to the secondary # server only if the primary server fails. If set to always, data files are # always copied to the secondary server secondary_mode on_error secondary_server_ip localhost secondary_server_path /var/spool/pandora/data_in secondary_server_port 41121 secondary_transfer_mode tentacle secondary_server_pwd mypassword secondary_server_ssl no secondary_server_opts
There are the following options (for more information, go to the Agent configuration chapter).
- secondary_mode: Secondary server mode. It may have two values:
- on_error: Send data to the secondary server only if it cannot send them to the main server.
- always: It always sends data to the secondary server, regardless of it being able to connect or not with the main server.
- secondary_server_ip: Secondary server IP.
- secondary_server_path: Path where the XML are copied in the secondary server, usually
/var/spoo/pandora/data_in
- secondary_server_port: Port through which the XML will be copied to the secondary server, in tentacle 41121, in ssh 22 and in ftp 21.
- secondary_transfer_mode: Transfer mode that will be used to copy the XML to the secondary server, Tentacle, ssh, ftp, etc.
- secondary_server_pwd: Password option for FTP transfer.
- secondary_server_ssl:
Yes
ornot
should be typed in depending if you want to use ssl to transfer data through Tentacle or not. - secondary_server_opts: This field is for other options that are needed for the transfer.
Only the remote configuration of the agent is operative in the main server, if enabled.
HA of network, WMI, plugin, web and prediction servers, among others
You must install several servers, network, WMI, plugin, web or prediction, in several machines of the network (all with the same visibility for the systems that you want monitor). All these machines should be in the same segment (so that network latency data are coherent).
The servers could be selected as primaries. These servers will automatically collect the data from all assigned modules to a server that is selected as “down”. Pandora FMS own servers implement a system to detect that one of them is down through verifying its last contact date (server threshold x 2
). It will be enough if only one Pandora FMS server is active for it to detect whether the other ones fall down. If all Pandora FMS are down, there is no way to detect or to implement HA.
The obvious way to implement HA and load balancing in a system of two nodes is to assign 50% of the modules to each server and select both servers as masters. In case that there would be more than two master servers ,and a third server down with modules yet to be executed, the first of the master servers that executes the module will “self-assign” the module of the down server. In case of recovering one of the down servers, the modules that have been assigned to the primary server are automatically assigned again.
The load balancing between the different servers is done in the Agent Administration section in the Setup menu.
In the Server field, there is a combo where you can choose the server that will do the checking.
Server configuration
A Pandora FMS server can be running in two different modes:
Master
mode.- Non-master mode.
If a server fails, its modules will be executed by the master server so that no data is lost.
At any given time there can only be one master server, which is chosen from all the servers with the master configuration option in /etc/pandora/pandora_server.conf
set to a value higher than 0:
master [1..7]
If the current master server fails, a new master server is chosen. If there is more than one candidate, the one with the highest master value is chosen.
Be careful about disabling servers. If a server with Network modules fails and the Network Server is disabled in the master server, those modules will not be executed.
For example, if you have three Pandora FMS Servers with master set to 1, a master server will be randomly chosen and the other two will run in non-master mode. If the master server fails, a new master will be randomly chosen.
The following parameters have been entered in pandora_server.conf
:
ha_file
: HA temporary binary file address.ha_pid_file
: HA current process.pandora_service_cmd
: Pandora FMS service status control.
Pandora FMS Console HA
Install another console. Any of them can be used simultaneously from different locations by different users. You may use a web balancer encompassing all consoles in case you need horizontal growth to manage console load. The session system is managed by cookies and they stay stored in the browser.
In the case of using remote configuration and to manage it from all consoles, both data servers and consoles must share the data directory (/var/spool/pandora/data_in
) for remote agent, collections and directory configuration.
You can learn how to share the key folders with NFS or GlusterFS using this guide.
It is important to only share data_in
subdirectories and not the data_in
folder itself, since doing so would affect server performance negatively.
Update
When updating Pandora FMS console in an HA environment, it is important to bear in mind the following points when updating by means of OUM through Update Manager > Update Manager offline.
Enterprise version users can download the OUM package from Pandora FMS support website.
When in a balanced environment with a shared database, updating the first console applies the corresponding changes to the database. This means that when updating the secondary console, Pandora FMS shows an error massage when finding the already entered information in the database. However, the console is still updated.
High Availability Database
The main goal of this section is to offer a complete solution for HA in Pandora FMS environments. This is the only HA model with official support for Pandora FMS, and it is provided from version 770 onwards. This system replaces the cluster configuration with Corosync and Pacemaker from previous versions.
The new Pandora FMS HA solution is integrated into the product (inside the pandora_ha binary). It implements an HA that supports geographically isolated sites, with different IP ranges, which is not possible with Corosync/Pacemaker.
In the new HA model, the usual setup is in pairs of two, so the design does not implement a quorum system and simplifies the configuration and the necessary resources. That way the monitoring system will work as long as there is a DB node available and in case there is a DB Split-Brain, the system will work in parallel until both nodes are merged again.
The new proposal seeks to solve the current three problems:
- Complexity and maintainability of the current system (up to version NG 770).
- Possibility of having an HA environment spread over different geographical locations with non-local network segmentation.
- Data recovery procedure in case of Split-Brain and secured system operation in case of communication breakdown between the two geographically separated sites.
The new HA system for DB is implemented on Percona8, although in future versions we will detail how to do so also on MySQL/MariaDB 8.
Pandora FMS is based on a MySQL database to configure and store data, so a failure in the database can temporarily paralyze the monitoring tool. The Pandora FMS high availability database cluster allows to easily deploy a strong and fail-safe architecture.
This is an advanced feature that requires knowledge in GNU/Linux systems. It is important that all servers have the time synchronized with an NTP server (chronyd service in Rocky Linux 8).
Binary replication MySQL cluster nodes are managed with the pandora_ha
binary, starting with version 770 (Enterprise feature). Percona was chosen as the default RDBMS for its scalability, availability, security and backup features.
Active/passive replication takes place from a single master node (with writing permissions) to any number of secondaries (read-only). If the master node fails, one of the secondaries is upgraded to master and pandora_ha
takes care of updating the IP address of the master node.
The environment will consist of the following elements:
- MySQL8 servers with binary replication enabled (Active - Passive).
- Server with
pandora_ha
with the configuration of all MySQL servers to carry out ongoing monitoring and perform the slave-master and master-slave promotions necessary for the correct operation of the cluster.
Installation of Percona 8
Version 770 or later.
Percona 8 Installation for RHEL 8 and Rocky Linux 8
First of all, it is necessary to have the Percona repository installed in all the nodes of the environment in order to be able to install the Percona server packages later on.
You must open a terminal window with root rights or as root user. You are solely responsible for this key. The following instructions indicate whether you should run instructions on all devices, on some devices or on one device in particular, please pay attention to the statements.
Execute on all devices involved:
yum install -y https://repo.percona.com/yum/percona-release-latest.noarch.rpm
Activate version 8 of the Percona repository on all devices:
percona-release setup ps80
Install the Percona server next to the backup tool with which the backups are to be performed for manual synchronization of both environments. Run on all devices involved:
yum install percona-server-server percona-xtrabackup-80
In case you install the Percona server together with the Web Console and the PFMS server, you will be able to use the deploy indicating the MySQL 8 version by means of the MYVER=80
parameter:
curl -Ls https://pfms.me/deploy-pandora-el8 | env MYVER=80 bash
Installing Percona 8 on Ubuntu Server
Install Percona repository version 8 on all devices:
curl -O https://repo.percona.com/apt/percona-release_latest.generic_all.deb apt install -y gnupg2 lsb-release ./percona-release_latest.generic_all.deb
Activate Percona repository version 8 on all devices:
percona-release setup ps80
Install the Percona server next to the backup tool with which backups are to be performed for manual synchronization of both environments. On all devices run:
apt install -y percona-server-server percona-xtrabackup-80
Binary replication configuration
Version 770 or later.
Once you have installed MySQL server in all the cluster nodes, proceed to configure both environments to have them replicated.
First of all, configure the configuration file my.cnf
preparing it for the binary replication to work correctly.
Node 1
Node 1 /etc/my.cnf
( /etc/mysql/my.cnf
for Ubuntu server):
[mysqld] server_id=1 # It is important that it is different in all nodes. datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock log-error=/var/log/mysqld.log pid-file=/var/run/mysqld/mysqld.pid # OPTIMIZATION FOR PANDORA FMS innodb_buffer_pool_size = 4096 innodb_lock_wait_timeout = 90 innodb_file_per_table innodb_flush_method = O_DIRECT innodb_log_file_size = 64M innodb_log_buffer_size = 16M thread_cache_size = 8 max_connections = 200 key_buffer_size=4M read_buffer_size=128K read_rnd_buffer_size=128K sort_buffer_size=128K join_buffer_size=4M sql_mode="" # SPECIFIC PARAMETERS FOR BINARY REPLICATION binlog-do-db=pandora replicate-do-db=pandora max_binlog_size = 100M binlog-format=MIXED binlog_expire_logs_seconds=172800 # 2 DAYS sync_source_info=1 sync_binlog=1 port=3306 report-port=3306 report-host=master gtid-mode=off enforce-gtid-consistency=off master-info-repository=TABLE relay-log-info-repository=TABLE sync_relay_log = 0 replica_compressed_protocol = 1 replica_parallel_workers = 1 innodb_flush_log_at_trx_commit = 2 innodb_flush_log_at_timeout = 1800 [client] user=root password=pandora
- The tokens after the OPTIMIZATION FOR PANDORA FMS comment perform the optimized configuration for Pandora FMS.
- After the comment SPECIFIC PARAMETERS FOR BINARY REPLICATION the specific parameters for the binary replication are configured.
- The token called
binlog_expire_logs_seconds
is configured for a period of two days. - In the
[client]
subsection enter the user and password used for the database, by default when installing PFMS they areroot
andpandora
respectively. These values are necessary to perform the backups without indicating user (automated).It is important that the
server_id
token is different in all nodes, in this example for node 1 this number is used.
Node 2
Node 2 /etc/my.cnf
( /etc/mysql/my.cnf
for Ubuntu server)
[mysqld] server_id=2 # It is important that it is different in all nodes. datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock log-error=/var/log/mysqld.log pid-file=/var/run/mysqld/mysqld.pid # OPTIMIZATION FOR PANDORA FMS innodb_buffer_pool_size = 4096 innodb_lock_wait_timeout = 90 innodb_file_per_table innodb_flush_method = O_DIRECT innodb_log_file_size = 64M innodb_log_buffer_size = 16M thread_cache_size = 8 max_connections = 200 key_buffer_size=4M read_buffer_size=128K read_rnd_buffer_size=128K sort_buffer_size=128K join_buffer_size=4M sql_mode="" # SPECIFIC PARAMETERS FOR BINARY REPLICATION binlog-do-db=pandora replicate-do-db=pandora max_binlog_size = 100M binlog-format=MIXED binlog_expire_logs_seconds=172800 # 2 DAYS sync_source_info=1 sync_binlog=1 port=3306 report-port=3306 report-host=master gtid-mode=off enforce-gtid-consistency=off master-info-repository=TABLE relay-log-info-repository=TABLE sync_relay_log = 0 replica_compressed_protocol = 1 replica_parallel_workers = 1 innodb_flush_log_at_trx_commit = 2 innodb_flush_log_at_timeout = 1800 [client] user=root password=pandora
- The tokens after the OPTIMIZATION FOR PANDORA FMS comment perform the optimized configuration for Pandora FMS.
- After the comment SPECIFIC PARAMETERS FOR BINARY REPLICATION the specific parameters for the binary replication are configured.
- The token called
binlog_expire_logs_seconds
is configured for a period of two days. - In the
[client]
subsection enter the user and password used for the database, by default when installing PFMS they areroot
andpandora
respectively. These values are necessary to perform the backups without indicating user (automated).It is important that the
server_id
token is different in all nodes, in this example for node 2 this number is used.
Master node configuration
Once you have the correct configuration on both nodes, start the configuration of the node that will take the role of master server.
1.- Start the mysqld
service:
systemctl start mysqld
2.- Access with the temporary root password that will have been generated in the log, the file /var/log/mysqld.log
:
grep "temporary password" /var/log/mysqld.log
With the password that appears, access MySQL server:
mysql -u root -p
Password
: → Enter the password observed with the grep command.
3.- Change the temporary password to pandora
of the root user. Remember that the mysql > prompt corresponds to the MySQL command interpreter (MYSQL CLI):
mysql > ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'Pandor4!';
mysql > UNINSTALL COMPONENT “file:component_validate_password”;
mysql > ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'pandora';
4.- Create the binary replication user and root user for remote connections and cluster administration:
mysql > CREATE USER slaveuser@'%' IDENTIFIED WITH mysql_native_password BY 'pandora';
mysql > GRANT REPLICATION CLIENT, REPLICATION SLAVE on *.* to slaveuser@'%';
mysql > CREATE USER root@'%' IDENTIFIED WITH mysql_native_password BY 'pandora';
mysql > GRANT ALL PRIVILEGES ON *.* to root@'%';
5.- Create Pandora FMS database:
mysql > create database pandora;
mysql > use pandora;
mysql > source /var/www/html/pandora_console/pandoradb.sql
mysql > source /var/www/html/pandora_console/pandoradb_data.sql
For the source command: As long as Pandora FMS console is installed in the same server, otherwise send this file to the master server.
6.- Create the pandora user and give the access privileges to this user:
mysql > CREATE USER pandora@'%' IDENTIFIED WITH mysql_native_password BY 'pandora';
mysql > grant all privileges on pandora.* to pandora@'%';
At this point you have the master server ready to start replicating Pandora FMS database.
Database cloning
The next step is to make a clone of the master database (MASTER) in the slave node (SLAVE). To do so, follow the steps below:
1.- Make a complete download (dump) of the MASTER database:
MASTER # xtrabackup –backup –target-dir=/root/pandoradb.bak/
MASTER # xtrabackup –prepare –target-dir=/root/pandoradb.bak/
2.- Get the position of the backup binary log:
MASTER # cat /root/pandoradb.bak/xtrabackup_binlog_info
It will return something like the following:
binlog.000003 157
Take note of these two values as they are needed for point number 6.
3.- Make a copy using rsync with the SLAVE server to send the backup:
MASTER # rsync -avpP -e ssh /root/pandoradb.bak/ node2:/var/lib/mysql/
4.- On the SLAVE server, configure the permissions so that the MySQL server can access the files sent without any problem:
SLAVE # chown -R mysql:mysql /var/lib/mysql
SLAVE # chcon -R system_u:object_r:mysqld_db_t:s0 /var/lib/mysql
5.- Start the mysqld
service on the SLAVE server:
systemctl start mysqld
6.- Start the SLAVE mode on this server (use the data noted in point 2):
SLAVE # mysql -u root -ppandora
SLAVE # mysql > reset slave all;
SLAVE # mysql >
CHANGE MASTER TO MASTER_HOST='nodo1', MASTER_USER='slaveuser', MASTER_PASSWORD='pandora', MASTER_LOG_FILE='binlog.000003', MASTER_LOG_POS=157;
SLAVE # mysql > start slave;
SLAVE # mysql > SET GLOBAL read_only=1;
Once you are done with all these steps, if you run the show slave status
command inside the MySQL shell you will notice that the node is set as slave. If it has been configured correctly you should see an output like the following:
*************************** 1. row *************************** Slave_IO_State: Waiting for source to send event Master_Host: node1 Master_User: root Master_Port: 3306 Connect_Retry: 60 Master_Log_File: binlog.000018 Read_Master_Log_Pos: 1135140 Relay_Log_File: relay-bin.000002 Relay_Log_Pos: 1135306 Relay_Master_Log_File: binlog.000018 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: pandora Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 1135140 Relay_Log_Space: 1135519 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_UUID: fa99f1d6-b76a-11ed-9bc1-000c29cbc108 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: Executed_Gtid_Set: Auto_Position: 0 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: Master_public_key_path: Get_master_public_key: 0 Network_Namespace: 1 row in set, 1 warning (0,00 sec)
At this point you can be sure that you have binary replication enabled and working correctly.
pandora_server configuration
Version 770 or later.
It is necessary to configure inside the pandora_server.conf file a series of necessary parameters for the correct operation of the pandora_ha
.
The parameters to be added are the following:
- ha_mode [pandora|pacemaker] :
The token with the pandora value will be indicated for the current pandora_ha
configuration, in case the previous mode is being used (version 769 and earlier), the pacemaker
value will be used. Example:
ha_mode pandora
- ha_hosts <IP_ADDRESS1>,<IP_ADDRESS2> :
Configure the ha_host
parameter followed by the IP addresses or FQDN of the MySQL servers that make up the HA environment. The IP address you put first will have preference to be the MASTER server or at least have the master role when you first start the HA environment. Example:
ha_hosts 192.168.80.170,192.168.80.172
- ha_dbuser and ha_dbpass :
These are the parameters where you must indicate the user and password of root user or otherwise a MySQL user with the maximum privileges that will be in charge of performing all the master - slave promotion operations on the nodes. Example:
ha_dbuser root ha_dbpass pandora
- repl_dbuser and repl_dbpass :
Parameters to define the replication user that will use the SLAVE to connect to the MASTER. Example:
repl_dbuser slaveuser repl_dbpass pandora
- ha_sshuser and ha_sshport :
Parameters to define the user/port with which it is connected by ssh to the Percona/MySQL servers to perform the recovery operations. For the correct operation of this option, it is necessary to have the ssh keys shared between the user with which the pandora_ha
service is executed and the user indicated in the ha_sshuser
parameter. Example:
ha_sshuser root ha_sshport 22
- ha_resync
PATH_SCRIPT_RESYNC
:
By default the script to perform the resynchronization of the nodes, this is located at:
/usr/share/pandora_server/util/pandora_ha_resync_slave.sh
In the case of having a customized installation of the script, indicate in this parameter its location to perform the automatic or manual synchronization of the SLAVE node when needed.
ha_resync /usr/share/pandora_server/util/pandora_ha_resync_slave.sh
- ha_resync_log :
Log path where all the information related to the executions performed by the synchronization script configured in the previous token will be stored. Example:
ha_resync_log /var/log/pandoraha_resync.log
- ha_connect_retries :
Number of attempts it will perform on each check with each of the servers in the HA environment before making any changes to the environment. Example:
ha_connect_retries 2
Once all these parameters are configured, you could start Pandora FMS server with the pandora_ha
service. The server will get an image of the environment and it will know at that moment who is the MASTER server.
When it knows it, it will create the pandora_ha_hosts.conf
file in the /var/spool/pandora/data_in/conf/ folder
, where the Percona/MySQL server that has the MASTER role will be indicated at all times.
In case the incomingdir
parameter of the pandora_server.conf
file contains a different path (PATH), this file will be located at that PATH.
This file will be used as an interchange with the Pandora FMS Console to know at any time the IP address of the Percona/MySQL server with MASTER role.
- restart :
It will be indicated with a value of 0
, since the pandora_ha daemon is the one in charge of restarting the service in case of failure, thus avoiding possible conflicts. Example:
# Pandora FMS will restart after restart_delay seconds on critical errors. restart 0
SSH key sharing between servers
An OpenSSH server must be installed and running on each host. Suppress the welcome message or banner that displays OpenSSH, run on all devices:
[ -f /etc/cron.hourly/motd_rebuild ] && rm -f /etc/cron.hourly/motd_rebuild sed -i -e 's/^Banner.*//g' /etc/ssh/sshd_config systemctl restart sshd
Share the SSH keys between pandora_ha
and all the existing Percona/MySQL servers in the environment, run on Pandora FMS server:
printf "\n\n\n" | ssh-keygen -t rsa -P '' ssh-copy-id -p22 root@node1 ssh-copy-id -p22 root@node2
- In case you have the installation on Ubuntu Server, enable the root user to connect via SSH. This is done by generating a password to the root user by executing the
sudo passwd root
command. - Then enable the SSH connection of the root user at least through shared keys “PermitRootLogin without-password” in the configuration file of the
sshd
service.
Using the synchronization script
With Pandora FMS server a script is implemented that allows you to synchronize the SLAVE database in case it is out of sync.
The manual execution of this script is the following:
./pandora_ha_resync_slave.sh "pandora_server.conf file" MASTER SLAVE
For example, to make a manual synchronization from node 1 to node 2 the execution would be the following:
/usr/share/pandora_server/util/pandora_ha_resync_slave.sh /etc/pandora/pandora_server.conf node1 node2
To configure the automatic recovery of the HA environment when there is any synchronization problem between MASTER and SLAVE, it is necessary to have the configuration token splitbrain_autofix
configured to 1, inside the server configuration file (/etc/pandora/pandora_server.conf
).
So, whenever a Split-Brain occurs (both servers have the master role) or there is any synchronization problem between MASTER and SLAVE node, pandora_ha
will try to launch the pandora_ha_resync_slave.sh
script to synchronize from that point the MASTER server status in the SLAVE server.
This process will generate events in the system indicating the start, the end and if any error took place in there.
Pandora FMS Console Configuration
Version 770 or later.
A new parameter has been added to the config.php
configuration indicating the path of the exchange directory that Pandora FMS uses by default /var/spool/pandora/data_in
.
If it is configured, it will look for the file /var/spool/pandora/data_in/conf/pandora_ha_hosts.conf
where it will get the IP address to make the connection.
$config["remote_config"] = "/var/spool/pandora/data_in";
In Pandora FMS console you could see the cluster status accessing to the Manage HA view.
https://PANDORA_IP/pandora_console/index.php?sec=gservers&sec2=enterprise/godmode/servers/HA_cluster
The data of this view is constantly updated thanks to pandora_ha
, there is no need to do any previous configuration procedure to be able to see this section as long as the pandora_server.conf
is correctly configured with the parameters mentioned in the previous section.
Among the available actions, you may configure a label for each one of the nodes and you can perform the option to synchronize the SLAVE node through the icon.
This icon can have the following states:
- Green: Normal, no operation to be performed.
- Blue: Pending resynchronization.
- Yellow: Resynchronization is in progress.
- Red: Error, resynchronization failed.
In the Setup → Setup → Enterprise, the Legacy HA database management token must be deactivated for the HA view to be displayed with the configuration of this new mode.
Corosync-Pacemaker HA environment migration
The main difference between an HA environment used in MySQL/Percona Server version 5 and the current HA mode is that pandora_ha
is now used to manage the cluster nodes to the detriment of Corosync-Pacemaker, which will no longer be used from now on.
The migration of the environment will consist of:
1.- Upgrading Percona from version 5.7 to version 8.0: “Installation and upgrade to MySQL 8”.
2.- Install xtrabackup-80 on all devices:
yum install percona-xtrabackup-80
If you use Ubuntu server see section “Percona 8 Installation for Ubuntu Server”.
3.- Create all users again with the token mysql_native_password
in the MASTER node:
mysql > CREATE USER slaveuser@% IDENTIFIED WITH mysql_native_password BY 'pandora';
mysql > GRANT REPLICATION CLIENT, REPLICATION SLAVE on *.* to slaveuser@%;
mysql > CREATE USER pandora@% IDENTIFIED WITH mysql_native_password BY 'pandora';
mysql > grant all privileges on pandora.* to pandora@%;
4.- Dump the database from the MASTER node to the SLAVE node:
4.1.- Make the full dump of the MASTER database:
MASTER # xtrabackup –backup –target-dir=/root/pandoradb.bak/
MASTER # xtrabackup –prepare –target-dir=/root/pandoradb.bak/
4.2.- Get the position of the backup binary log:
MASTER # cat /root/pandoradb.bak/xtrabackup_binlog_info
binlog.000003 157
Take note of these two values as they are needed in point 4.6.
4.3.- Synchronize with rsync with the SLAVE server to send the backup.
SLAVE # rm -rf /var/lib/mysql/*
MASTER # rsync -avpP -e ssh /root/pandoradb.bak/ node2:/var/lib/mysql/
4.4- On the SLAVE server, configure permissions so that MySQL server can access the sent files without any issues.
SLAVE # chown -R mysql:mysql /var/lib/mysql
SLAVE # chcon -R system_u:object_r:mysqld_db_t:s0 /var/lib/mysql
4.5.- Start mysqld
service on the SLAVE server.
systemctl start mysqld
4.6.- Start the SLAVE mode on this server (use the data from point 4.2):
SLAVE # mysql -u root -ppandora
SLAVE # mysql > reset slave all;
SLAVE # mysql >
CHANGE MASTER TO MASTER_HOST='nodo1', MASTER_USER='slaveuser', MASTER_PASSWORD='pandora', MASTER_LOG_FILE='binlog.000003', MASTER_LOG_POS=157;
SLAVE # mysql > start slave;
SLAVE # mysql > SET GLOBAL read_only=1;
In case you want to install from zero the environment in a new server, in the migration procedure you should only install from zero as the current procedure indicates in the new environment, and in the step of creating the Pandora FMS database you should import the data with a backup of the database of the old environment.
At the same time it will be necessary to save in the new environment the Pandora FMS Console and Server configuration indicated in previous sections.
Split-Brain
Due to several factors, high latencies, network outages, etc., you might find that both MySQL servers have acquired the master role and there is no autoresync
option enabled in pandora_ha
so that the server itself chooses the server that will work as master and synchronizes the master node with the slave one, thus losing all the information that could be collected from that server.
To solve this problem data can be merged following this procedure.
This manual procedure only covers the data and event retrieval covering the interval between two dates. It assumes that it only recovers data from agents/modules that already exist in the node where the data merge will be performed.
If new agents are created during Split-Brain time, or new configuration information (alerts, policies, etc.) these will not be taken into account. Only data and events will be retrieved. That is, data related to the tagent_data
, tagent_data_string
and tevent
tables.
The following commands will be executed in the node that was disconnected (the one to be promoted to SLAVE), where yyyy-mm-dd hh:mm:ss
is the Split-Brain start date and time and yyyy2-mm2-dd2 hh2:mm2:ss2
its end date and time.
Run mysqldump command with appropriate user rights to get a data dump:
mysqldump -u root -p -n -t --skip-create-options --databases pandora --tables tagente_datos --where='FROM_UNIXTIME(utimestamp)> "yyyy-mm-dd hh:mm:ss" AND FROM_UNIXTIME(utimestamp) <"yyyy2-mm2-dd2 hh2:mm2:ss2"'> tagente_datos.dump.sql
mysqldump -u root -p -n -t --skip-create-options --databases pandora --tables tagente_datos_string --where='FROM_UNIXTIME(utimestamp)> "yyyy-mm-dd hh:mm:ss" AND FROM_UNIXTIME(utimestamp) <"yyyy2-mm2-dd2 hh2:mm2:ss2"'> tagente_datos_string.dump.sql
mysqldump -u root -p -n -t --skip-create-options --databases pandora --tables tevento --where='FROM_UNIXTIME(utimestamp)> "yyyy-mm-dd hh:mm:ss" AND FROM_UNIXTIME(utimestamp) <"yyyy2-mm2-dd2 hh2:mm2:ss2"' | sed -e "s/([0-9]*,/(NULL,/gi"> tevento.dump.sql
Once the dumps of these tables have been obtained, the data will be loaded in the MASTER node:
MASTER # cat tagente_datos.dump.sql | mysql -u root -p pandora
MASTER # cat tagente_datos_string.dump.sql | mysql -u root -p pandora
MASTER # cat tagente_evento.dump.sql | mysql -u root -p pandora
After loading the data you retrieved from the node to be promoted to SLAVE, proceed to synchronize it using the following procedure:
1.- Dump the Master DB:
MASTER # xtrabackup –backup –target-dir=/root/pandoradb.bak/
MASTER # xtrabackup –prepare –target-dir=/root/pandoradb.bak/
2.- Get the position of the binary log of the backed up data:
MASTER # cat /root/pandoradb.bak/xtrabackup_binlog_info
You will get something like the following (take due note of these values):
binlog.000003 157
3.- Do a task with the rsync command with the slave server to send the backup done.
MASTER # rsync -avpP -e ssh /root/pandoradb.bak/ node2:/var/lib/mysql/
4.- On the slave server, configure permissions so that MySQL server can access the files sent without any issues.
SLAVE # chown -R mysql:mysql /var/lib/mysql
SLAVE # chcon -R system_u:object_r:mysqld_db_t:s0 /var/lib/mysql
5.- Start mysqld
service in the slave server.
systemctl start mysqld
6.- Start the slave mode on this server.
SLAVE # mysql -u root -ppandora
SLAVE # mysql > reset slave all;
SLAVE # mysql > CHANGE MASTER TO MASTER_HOST='nodo1', MASTER_USER='slaveuser', MASTER_PASSWORD='pandora', MASTER_LOG_FILE='binlog.000003', MASTER_LOG_POS=157;
SLAVE # mysql > start slave;
SLAVE # mysql > SET GLOBAL read_only=1;
Once all these steps have been completed, if you run the show slave status
command inside MySQL shell you will see that the node is in slave mode. If it was configured properly you should see an output like the following example:
*************************** 1. row *************************** Slave_IO_State: Waiting for source to send event Master_Host: node1 Master_User: root Master_Port: 3306 Connect_Retry: 60 Master_Log_File: binlog.000018 Read_Master_Log_Pos: 1135140 Relay_Log_File: relay-bin.000002 Relay_Log_Pos: 1135306 Relay_Master_Log_File: binlog.000018 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: pandora Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 1135140 Relay_Log_Space: 1135519 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_UUID: fa99f1d6-b76a-11ed-9bc1-000c29cbc108 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: Executed_Gtid_Set: Auto_Position: 0 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: Master_public_key_path: Get_master_public_key: 0 Network_Namespace: 1 row in set, 1 warning (0,00 sec)
At this point you can be sure that you have binary replication enabled and working properly again.