Pandora: Documentation en: HA
Go back to Pandora FMS documentation index
Contents
- 1 High Availability
- 1.1 Introduction
- 1.2 Data Server Balancing and HA
- 1.3 Balancing in the Software Agents
- 1.4 Load Balancing in the DDBB
- 1.5 Balancing and HA of the Network Servers, WMI, Plugin, Web and Prediction
- 1.6 HA of Pandora FMS Console
- 1.7 Pandora FMS HA Database Cluster
1 High Availability
1.1 Introduction
Pandora FMS is a very stable application (thanks to the test and improvements included in each version and to the hundred of fails opened by users and that have been solved.In spite of this, in critical environments and/or with much load, it is possible that it would be necessary to distribute the load in several machines, being sure that if any component of Pandora FMS fails, the system will not be down. Pandora FMS has been designed to it would be very modular. Any of its modules could work in an independent way. But it has also been designed to work with other components and for being able to take the load from those components that have been down.
The Pandora FMS standar design could be this one:
Obviously, the agents are not redundants. If an agent is down,it makes no sense to execute another one, so the only cause for than an agent downs is that data could not be obtained because the execution of any module is failing, and this could not be solved with another agent running in parallel, or because the system is isolated or fails. The best solution is to make redundancy of the critical systems- regardless if they have Pandora FMS agents or not- and so to make redundancy or these systems monitoring.
It is possible to use HA in several scenaries:
- Data Server Balancing and HA.
- Network Servers,WMI, Plugin, Web and Prediction Balancing and HA
- DDBB Load Balancing.
- Recon Servers Balancing and HA.
- Pandora FMS Console Balancing and HA.
1.2 Data Server Balancing and HA
The easiest way is to use the HA implemented in the agents (which allow you to contact an alternative server if the principal does not reply). However, since the data server supports port 41121 and is a standard TCP port, it is possible to use any commercial solution that allows balancing or clustering an ordinary TCP service.
For Pandora FMS data server you will need to mount two machines with a configured Pandora FMS data server (and different hostname and server name). You will have to configure a Tentacle server in each of them. Each machine will have a different IP address. If we are going to use an external balancer, it will provide a unique IP address to which the agents will connect to send their data.
If we are using an external balancer, and one of the servers fails, the HA mechanism "promotes" one of the available active servers and Pandora FMS agents will continue to connect with the same address as before, without noticing the change, but in this case, the load balancer will no longer send the data to the server that has failed, but to another active server. There is no need to change anything in every Pandora FMS data server, even each server can keep its own name, useful to know if any has fallen in the server status view. Pandora FMS data modules can be processed by any server without a preassignment being necessary. It is designed precisely this way so that HA can be implemented more easily.
In the case of using the HA mechanism of the agents, there will be a small delay in the sending of data, since at each execution of the agent, it will try to connect with the primary server, and if it does not answer, it will do so against the secondary one (if it has been configured this way). This is described below as "Balancing in Software Agents".
If you want to use two data servers and both manage policies, collections, and remote configurations, you will need to share the following directories by NFS so that all instances of the data server can read and write over these directories. Consoles should also have access to these shared directories shared by NFS.
- /var/spool/pandora/data_in/conf
- /var/spool/pandora/data_in/collections
- /var/spool/pandora/data_in/md5
1.3 Balancing in the Software Agents
From the software agents it is possible to do a balancing of Data servers so it is possible to configure a Data server master and another one for backup.
In the agent configuration file pandora_agent.conf, you should configure and uncomment the following part of the agent configuration file:
# Secondary server configuration # ============================== # If secondary_mode is set to on_error, data files are copied to the secondary # server only if the primary server fails. If set to always, data files are # always copied to the secondary server secondary_mode on_error secondary_server_ip localhost secondary_server_path /var/spool/pandora/data_in secondary_server_port 41121 secondary_transfer_mode tentacle secondary_server_pwd mypassword secondary_server_ssl no secondary_server_opts
There are the following options (for more information, go to the Agents Configuration chapter.
- secondary_mode: Mode in which the secondary server should be. It could have two values:
- on_error: Send data to the secondary server only if it could not send them to the main server.
- always: Always sends data to the secondary server not regarding if it could or not connect with the main server.
- secondary_server_ip: Secondary server IP
- secondary_server_path: Path where the XML are copied in the secondary server,usually /var/spoo//pandora/data_in
- secondary_server_port: Port through the XML will be copy to the secondary server in tentacle 41121, in ssh 22 are in ftp 21.
- secondary_transfer_mode: transfer mode that will be used to copy the XML to the sercondary server, tentacle, ssh, ttp etc
- secondary_server_pwd: Password option for the transfer through FTP
- secondary_server_ssl: Yes or not should be put depending if you want to use ssl to transfer data through Tentacle or not.
- secondary_server_opts: This field is for other options that are necessaries for the transfer.
1.4 Load Balancing in the DDBB
The Database is the most critical component of Pandora, and therefore, the issue is more complex. We are currently proposing a software solution based on DRBD that allows MySQL to cluster via software, and Pandora FMS can also work with different hardware or virtualized solutions that allow clustering any standard application such as MySQL.
We are working to offer an integrated solution in Pandora FMS that allows to implement a native HA tool in Pandora, based on MySQL.
You can consult the documentation available on DRBD and MySQL
1.5 Balancing and HA of the Network Servers, WMI, Plugin, Web and Prediction
This is easier. You need to install several servers, network,WMI, Plugin, Web or Prediction, in several machines of the network (all with the same visibility for the systems that you want monitor). All these machines should be in the same segment (so as the network latency data whould be coherents)
The servers could be selected as primaries.These servers will automatically collect the data form all the modules assigned to a server that is selected as «down».Pandora FMS own servers implement a mechanism to detect that one of them has down thorugh a verification of its last date of contact (server threshold x 2).It will be enough if only one Pandora FMS server would be active for that it could detect the collapse of the other ones. If all Pandora FMS are down, there is no way to detect or to implement HA.
The obvious way to implement HA and a load balancing in a system of two nodes is to asign the 50% of the modules to each server and select both servers as masters (Master. In case that there would be more than two master servers and a third server down with modules expecting to be executed, the first of the master server that would execute the module will "self-assign" the module of the down server. In case of the recovering of one of the down servers, the modules that have been assigned to the primary server would automatically be assigned again.
The load balancing between the different servers is done in the Agent Administration section in the "setup" menu.
In the field "server" there is a combo where you can choose the server that will do the checking.
1.5.1 Server configuration
A Pandora FMS Server can be running in two different modes:
- Master mode.
- Non-master mode.
If a server goes down, its modules will be executed by the master server so that no data is lost.
At any given time there can only be one master server, which is chosen from all the servers with the master configuration token in /etc/pandora/pandora_server.conf set to a value greater than 0:
master [1..7]
If the current master server goes down, a new master server is chosen. If there is more than one candidate, the one with the highest master value is chosen.
Be careful about disabling servers. If a server with Network modules goes down and the Network Server is disabled in the master server, those modules will not be executed. |
|
For example, if you have three Pandora FMS Servers with master set to 1, a master server will be randomly chosen and the other two will run in non-master mode. If the master server goes down, a new master will be randomly chosen.
1.6 HA of Pandora FMS Console
Just install another console. Any of them can be used simultaneously from different locations by different users. Using a web balancer in front of the consoles, will make possible to access them without really knowing which one is being accessed since the session system is managed by "cookies" and this is stored in the browser. In the case of using remote configuration, both data servers and consoles must share (NFS) the entry data directory (/var/spool/pandora/data_in) for remote configuration of agents, collections and other directories.
1.7 Pandora FMS HA Database Cluster
This solution is provided to offer a fully-featured solution for HA in Pandora FMS environments. This is the only officially-supported HA model for Pandora FMS. This solution is provided -preinstalled- since OUM 724. This system replaces DRBD and other HA systems we recommended in the past. |
|
This is the first Pandora DB HA implementation, and the install process is almost manual, by using linux console as Root. In future versions we will easy the setup and configuration from the GUI |
|
Pandora FMS relies on a MySQL database for configuration and data storage. A database failure can temporarily bring your monitoring solution to a halt.
The Pandora FMS high-availability database cluster allows you to easily deploy a fault-tolerant robust architecture. This model is a PASSIVE/ACTIVE with a realtime replication from MASTER node to the SLAVE (or slaves) nodes. It uses MySQL replication to provide redundancy, and a mechanism based on Corosync and Pacemaker to maintain the Virtual IP address.
A special component of Pandora FMS will keep control of any problem, do the the switchover and of course, will monitor everything related with the HA.
This is a very advanced feature of Pandora FMS and requires linux skills and also know very well Pandora FMS internals. |
|
1.7.1 Preamble
We will configure a two node cluster, with hosts node1 and node2. Change hostnames, passwords, etc. as needed to match your environment.
<diagrama>
Commands that should be run on one node will be preceded by that node's hostname. For example:
node1# <command>
Commands that should be run on all nodes will be preceded by the word all. For example:
all# <command>
There is an additional host, which will be referred to as pandorafms, where Pandora FMS is or will be installed.
1.7.2 Prerequisites
CentOS version 7 must be installed on all hosts, and they must be able to resolve each other's hostnames.
node1# ping node2 PING node2 (192.168.0.2) 56(84) bytes of data. node2# ping node1 PING node1 (192.168.0.1) 56(84) bytes of data. pandorafms# ping node1 PING node1 (192.168.0.1) 56(84) bytes of data. pandorafms# ping node2 PING node2 (192.168.0.2) 56(84) bytes of data.
An Open SSH server must be installed and running on every host. Remove the banner displayed by Open SSH:
all# [ -f /etc/cron.hourly/motd_rebuild ] && rm -f /etc/cron.hourly/motd_rebuild all# sed -i -e 's/^Banner.*//g' /etc/ssh/sshd_config all# systemctl restart sshd
Generate new SSH authentication keys for each host and copy the public key to each of the other hosts (replaces the example keys with the actual ones):
node1# echo -e "\n\n\n" | ssh-keygen -t rsa node1# cat $HOME/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA0xoEf2A+in/uReenQzJniYLrSxcFOjNUOpwDJi2jIGGoUrEd8c8gn8ut1p57H73SWlI5+YQAhfSF0BaM158XD5bIZC94M05ZVFs8UCjjuATrgNJ38hboF3CNrfWYhA5m1JraKsT2EAMNCSz55OI+GrCxmeM6o4DMoQu2W6WZu6YX8F7axCh5uOBLh1W06sgMcudn6x/lwzYhPUWm9OiS58n6pd9SkC2LoDYoRetZE2GW/1M9t+8UliSwdCpPpGIW3R5avEPfXii3XVIyO43YTpPR2cqHLwGAJN1DVp61lYMXnLSvJma9LG6gnzl0MB865YysB66o9w2v8C28asYSew== [email protected] node2# echo -e "\n\n\n" | ssh-keygen -t rsa node2# cat $HOME/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA3PCvWkX4kDm1fD5FWsvAN/dl9VEZV7k7cOGLmOKcDLHUE3OkS6+7/4b9J65mssZ1yc/ocQe/dvFQNJlkxk117PK+NP5PB8s7+UI5LBZHunmLAuajnLbFYwyTDIF2qHRCxsRJfU4HXHY/DIZNoL90Enrk3Al+pTSdYr6mK5QJ4LZ3DX3mN3DpeMW8duWgWP3VMY/QhDJ+pGCJ/dOW3zYMdAQwSVqzHzgUR+hhMCmgOn8ACkeEMa2rUyzlblnGMApTbK1rim82SRupiNoaPfHjSiK/GJ5l+DpBCLp26Fj+AMO2kgRkWSAmWdJh/40T7TFj4uhTJgrnPsvrvkjpp0vppw== [email protected] pandorafms# echo -e "\n\n\n" | ssh-keygen -t rsa pandorafms# cat $HOME/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAx6eRCGw86qDcjzSENlPQ1/7ukOcD0xxi9jG/Kgf1syT1ZYz4trJHSxVG05iCpVF0YRZa1YcoWltcCNCOu3rD2jwbHl98CHmKXpq+kGnSEf02NtEiCP9366/tq9V8zknBVOJE3oND0GuhAvDUo1OqxlI35gR7bO6zXTAxXAv3o736lHqzCjmsn8wA3XfZy+CHBtTpsovCqr1SG9geIcmXRYSJpb5SmE2iIuekybYM0yWfK6c0KY0zPl4v21wX1GU6O2KE7+Q8tSDxEr/3NBeLOFZpP/0A8sYlL1U4+Xcfbi45YGZ2x7oZZ1ZMtp4OwL3v2w8fILakCsfpHcpONBZmHw== [email protected] node1# echo 'ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA3PCvWkX4kDm1fD5FWsvAN/dl9VEZV7k7cOGLmOKcDLHUE3OkS6+7/4b9J65mssZ1yc/ocQe/dvFQNJlkxk117PK+NP5PB8s7+UI5LBZHunmLAuajnLbFYwyTDIF2qHRCxsRJfU4HXHY/DIZNoL90Enrk3Al+pTSdYr6mK5QJ4LZ3DX3mN3DpeMW8duWgWP3VMY/QhDJ+pGCJ/dOW3zYMdAQwSVqzHzgUR+hhMCmgOn8ACkeEMa2rUyzlblnGMApTbK1rim82SRupiNoaPfHjSiK/GJ5l+DpBCLp26Fj+AMO2kgRkWSAmWdJh/40T7TFj4uhTJgrnPsvrvkjpp0vppw== [email protected]' >> $HOME/.ssh/authorized_keys node1# echo 'ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAx6eRCGw86qDcjzSENlPQ1/7ukOcD0xxi9jG/Kgf1syT1ZYz4trJHSxVG05iCpVF0YRZa1YcoWltcCNCOu3rD2jwbHl98CHmKXpq+kGnSEf02NtEiCP9366/tq9V8zknBVOJE3oND0GuhAvDUo1OqxlI35gR7bO6zXTAxXAv3o736lHqzCjmsn8wA3XfZy+CHBtTpsovCqr1SG9geIcmXRYSJpb5SmE2iIuekybYM0yWfK6c0KY0zPl4v21wX1GU6O2KE7+Q8tSDxEr/3NBeLOFZpP/0A8sYlL1U4+Xcfbi45YGZ2x7oZZ1ZMtp4OwL3v2w8fILakCsfpHcpONBZmHw== [email protected]' >> $HOME/.ssh/authorized_keys node2# echo 'ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA0xoEf2A+in/uReenQzJniYLrSxcFOjNUOpwDJi2jIGGoUrEd8c8gn8ut1p57H73SWlI5+YQAhfSF0BaM158XD5bIZC94M05ZVFs8UCjjuATrgNJ38hboF3CNrfWYhA5m1JraKsT2EAMNCSz55OI+GrCxmeM6o4DMoQu2W6WZu6YX8F7axCh5uOBLh1W06sgMcudn6x/lwzYhPUWm9OiS58n6pd9SkC2LoDYoRetZE2GW/1M9t+8UliSwdCpPpGIW3R5avEPfXii3XVIyO43YTpPR2cqHLwGAJN1DVp61lYMXnLSvJma9LG6gnzl0MB865YysB66o9w2v8C28asYSew== [email protected]' >> $HOME/.ssh/authorized_keys node2# echo 'ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAx6eRCGw86qDcjzSENlPQ1/7ukOcD0xxi9jG/Kgf1syT1ZYz4trJHSxVG05iCpVF0YRZa1YcoWltcCNCOu3rD2jwbHl98CHmKXpq+kGnSEf02NtEiCP9366/tq9V8zknBVOJE3oND0GuhAvDUo1OqxlI35gR7bO6zXTAxXAv3o736lHqzCjmsn8wA3XfZy+CHBtTpsovCqr1SG9geIcmXRYSJpb5SmE2iIuekybYM0yWfK6c0KY0zPl4v21wX1GU6O2KE7+Q8tSDxEr/3NBeLOFZpP/0A8sYlL1U4+Xcfbi45YGZ2x7oZZ1ZMtp4OwL3v2w8fILakCsfpHcpONBZmHw== [email protected]' >> $HOME/.ssh/authorized_keys pandorafms# echo 'ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA0xoEf2A+in/uReenQzJniYLrSxcFOjNUOpwDJi2jIGGoUrEd8c8gn8ut1p57H73SWlI5+YQAhfSF0BaM158XD5bIZC94M05ZVFs8UCjjuATrgNJ38hboF3CNrfWYhA5m1JraKsT2EAMNCSz55OI+GrCxmeM6o4DMoQu2W6WZu6YX8F7axCh5uOBLh1W06sgMcudn6x/lwzYhPUWm9OiS58n6pd9SkC2LoDYoRetZE2GW/1M9t+8UliSwdCpPpGIW3R5avEPfXii3XVIyO43YTpPR2cqHLwGAJN1DVp61lYMXnLSvJma9LG6gnzl0MB865YysB66o9w2v8C28asYSew== [email protected]' >> $HOME/.ssh/authorized_keys pandorafms# echo 'ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA3PCvWkX4kDm1fD5FWsvAN/dl9VEZV7k7cOGLmOKcDLHUE3OkS6+7/4b9J65mssZ1yc/ocQe/dvFQNJlkxk117PK+NP5PB8s7+UI5LBZHunmLAuajnLbFYwyTDIF2qHRCxsRJfU4HXHY/DIZNoL90Enrk3Al+pTSdYr6mK5QJ4LZ3DX3mN3DpeMW8duWgWP3VMY/QhDJ+pGCJ/dOW3zYMdAQwSVqzHzgUR+hhMCmgOn8ACkeEMa2rUyzlblnGMApTbK1rim82SRupiNoaPfHjSiK/GJ5l+DpBCLp26Fj+AMO2kgRkWSAmWdJh/40T7TFj4uhTJgrnPsvrvkjpp0vppw== [email protected]' >> $HOME/.ssh/authorized_keys node1# ssh node2 node2# ssh node1 pandorafms# ssh node1 pandorafms# ssh node2
On the Pandora FMS node, copy the key pair to /usr/share/httpd/.ssh/:
pandorafms# cp -r /root/.ssh/ /usr/share/httpd/ pandorafms# chown -R apache:apache /usr/share/httpd/.ssh/
The following steps are only necessary if the nodes are running SSH on a non-standard port. Replace 22 with the right port number:
all# echo -e "Host node1\n Port 22" >> /root/.ssh/config all# echo -e "Host node2\n Port 22" >> /root/.ssh/config
1.7.3 Installation
1.7.3.1 Installing Percona
Install the required packages:
all# yum install -y http://www.percona.com/downloads/percona-release/redhat/0.1-4/percona-release-0.1-4.noarch.rpm all# yum install -y Percona-Server-server-57 percona-xtrabackup-24
Make sure the Percona service is disabled, since it will be managed by the cluster:
all# systemctl disable mysqld
Configure Percona, replacing <ID> with a number that must be unique for each cluster node:
all# export SERVER_ID=<ID> all# cat <<EOF > /etc/my.cnf [mysqld] server_id=$SERVER_ID datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock symbolic-links=0 log-error=/var/log/mysqld.log show_compatibility_56=on max_allowed_packet = 64M innodb_buffer_pool_size = 256M innodb_lock_wait_timeout = 90 innodb_file_per_table innodb_flush_method = O_DIRECT innodb_log_file_size = 64M innodb_log_buffer_size = 16M thread_cache_size = 8 max_connections = 100 innodb_flush_log_at_trx_commit=1 key_buffer_size=4M read_buffer_size=128K read_rnd_buffer_size=128K sort_buffer_size=128K join_buffer_size=4M log-bin=mysql-bin query_cache_type = 1 query_cache_size = 4M query_cache_limit = 8M sql_mode="" expire_logs_days=3 binlog-format=ROW log-slave-updates=true sync-master-info=1 sync_binlog=1 max_binlog_size = 100M replicate-do-db=pandora port=3306 report-port=3306 report-host=master gtid-mode=off enforce-gtid-consistency=off master-info-repository=TABLE relay-log-info-repository=TABLE [mysqld_safe] log-error=/var/log/mysqld.log pid-file=/var/run/mysqld/mysqld.pid [client] user=root password=pandora EOF
Start the Percona server:
all# systemctl start mysqld
A new temporary password will be generated and logged to /var/log/mysqld.log. Connect to the Percona server and change the root password:
all# mysql -uroot -p$(grep "temporary password" /var/log/mysqld.log | rev | cut -d' ' -f1 | rev) mysql> SET PASSWORD FOR 'root'@'localhost' = PASSWORD('Pandor4!'); mysql> UNINSTALL PLUGIN validate_password; mysql> SET PASSWORD FOR 'root'@'localhost' = PASSWORD('pandora'); mysql> GRANT REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'root'@'%' IDENTIFIED BY 'pandora'; mysql> GRANT REPLICATION CLIENT, REPLICATION SLAVE, SUPER, PROCESS, RELOAD ON *.* TO 'root'@'localhost' IDENTIFIED BY 'pandora'; mysql> GRANT select ON mysql.user TO 'root'@'%' IDENTIFIED BY 'pandora'; mysql> FLUSH PRIVILEGES; mysql> quit
1.7.3.2 Installing Pandora FMS
1.7.3.2.1 New Pandora FMS installation
Install Pandora FMS on the newly created database. For more information see:
https://wiki.pandorafms.com/index.php?title=Pandora:Documentation_en:Installing
Stop the Pandora FMS server:
newpandorafms# /etc/init.d/pandora_server stop
1.7.3.2.2 Existing Pandora FMS installation
Stop your Pandora FMS Server:
pandorafms# /etc/init.d/pandora_server stop
Backup the Pandora FMS database:
pandorafms# mysqldump -uroot -ppandora --databases pandora > /tmp/pandoradb.sql pandorafms# scp /tmp/pandorafms.sql node1:/tmp/
Load it into the new database:
node1# mysql -uroot -ppandora < /tmp/pandoradb.sql node1# systemctl stop mysqld
1.7.3.3 Setting up replication
Grant the required privileges on all databases:
all# mysql -uroot -ppandora mysql> GRANT ALL ON pandora.* TO 'root'@'%' IDENTIFIED BY 'pandora'; mysql> GRANT REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'root'@'%' IDENTIFIED BY 'pandora'; mysql> GRANT REPLICATION CLIENT, REPLICATION SLAVE, SUPER, PROCESS, RELOAD ON *.* TO 'root'@'localhost' IDENTIFIED BY 'pandora'; mysql> GRANT select ON mysql.user TO 'root'@'%' IDENTIFIED BY 'pandora'; mysql> FLUSH PRIVILEGES; mysql> quit
Backup the database of the first node and write down the master log file name and position (in the example, mysql-bin.000001 and 785):
node1# innobackupex --no-timestamp /root/pandoradb.bak/ node1# innobackupex --apply-log /root/pandoradb.bak/ node1# cat /root/pandoradb.bak/xtrabackup_binlog_info mysql-bin.000001 785
Load the database on the second node and configure it to replicate from the first node (set MASTER_LOG_FILE and MASTER_LOG_FILE to the values found in the previous step):
node2# systemctl stop mysqld node1# rsync -avpP -e ssh /root/pandoradb.bak/ node2:/var/lib/mysql/ node2# chown -R mysql:mysql /var/lib/mysql node2# chcon -R system_u:object_r:mysqld_db_t:s0 /var/lib/mysql node2# systemctl start mysqld node2# mysql -uroot -ppandora mysql> CHANGE MASTER TO MASTER_HOST='node1', MASTER_USER='root', MASTER_PASSWORD='pandora', MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=785; mysql> START SLAVE; mysql> SHOW SLAVE STATUS \G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: node1 Master_User: root Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000002 Read_Master_Log_Pos: 785 Relay_Log_File: node2-relay-bin.000003 Relay_Log_Pos: 998 Relay_Master_Log_File: mysql-bin.000002 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: pandora Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 785 Relay_Log_Space: 1252 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_UUID: 580d8bb0-6991-11e8-9a22-16efadb2f150 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: Executed_Gtid_Set: Auto_Position: 0 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: 1 row in set (0.00 sec) mysql> QUIT all# systemctl stop mysqld
1.7.3.4 Configuring the two node cluster
Install the required packages:
yum install -y epel-release corosync ntp pacemaker pcs all# systemctl enable ntpd all# systemctl enable corosync all# systemctl enable pcsd all# systemctl start ntpd all# systemctl start corosync all# systemctl start pcsd
Stop the Percona server:
node1# systemctl stop mysqld
Authenticate all the nodes in the cluster:
Create and start the cluster:
all# echo hapass | passwd hacluster --stdin all# cp /etc/corosync/corosync.conf.example /etc/corosync/corosync.conf node1# pcs cluster auth -u hacluster -p hapass --force node1 node2 node1# pcs cluster setup --force --name pandoraha node1 node2 node1# pcs cluster start --all node1# pcs cluster enable --all node1# pcs property set stonith-enabled=false node1# pcs property set no-quorum-policy=ignore
Check the status of the cluster:
node#1 pcs status Cluster name: pandoraha Stack: corosync Current DC: node1 (version 1.1.18-11.el7_5.2-2b07d5c5a9) - partition with quorum Last updated: Fri Jun 8 12:53:49 2018 Last change: Fri Jun 8 12:53:47 2018 by root via cibadmin on node1 2 nodes configured 0 resources configured Online: [ node1 node2 ] No resources Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled
Install the Percona pacemaker replication agent:
all# cd /usr/lib/ocf/resource.d/ all# mkdir percona all# cd percona all# curl -L -o mysql https://github.com/Percona-Lab/pacemaker-replication-agents/raw/1.0.0-stable/agents/mysql_prm all# chmod u+x mysql
Configure the cluster resources. Replace VIRT_IP with the virtual IP address of your choice:
node1# pcs resource create pandoradb ocf:percona:mysql config="/etc/my.cnf" pid="/var/run/mysqld/mysqld.pid" socket="/var/lib/mysql/mysql.sock" replication_user="root" replication_passwd="pandora" max_slave_lag="60" evict_outdated_slaves="false" binary="/usr/sbin/mysqld" test_user="root" test_passwd="pandora" op start interval="0" timeout="60s" op stop interval="0" timeout="60s" op promote timeout="120" op demote timeout="120" op monitor role="Master" timeout="30" interval="5" op monitor role="Slave" timeout="30" interval="10" node1# pcs resource create pandoraip ocf:heartbeat:IPaddr2 ip=VIRT_IP cidr_netmask=24 op monitor interval=20s node1# pcs resource master master_pandoradb pandoradb meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" globally-unique="false" target-role="Master" is-managed="true" node1# pcs constraint colocation add master master_pandoradb with pandoraip node1# pcs constraint order promote master_pandoradb then start pandoraip
Check the status of the cluster:
node1# pcs status Cluster name: pandoraha Stack: corosync Current DC: node1 (version 1.1.18-11.el7_5.2-2b07d5c5a9) - partition with quorum Last updated: Fri Jun 8 13:02:21 2018 Last change: Fri Jun 8 13:02:11 2018 by root via cibadmin on node1 2 nodes configured 3 resources configured Online: [ node1 node2 ] Full list of resources: Master/Slave Set: master_pandoradb [pandoradb] Masters: [ node1 ] Slaves: [ node2 ] pandoraip (ocf::heartbeat:IPaddr2): Started node1 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled
1.7.3.5 Configuring Pandora FMS
Make sure php-pecl-ssh2 is installed:
pandorafms# yum install php-pecl-ssh2 pandorafms# systemctl restart httpd
There are two parameters in /etc/pandora/pandora_server.conf that control the behavior of the Pandora FMS Database HA Tool. Adjust them to suit your needs:
# Pandora FMS Database HA Tool execution interval in seconds (PANDORA FMS ENTERPRISE ONLY). ha_interval 30 # Pandora FMS Database HA Tool monitoring interval in seconds. Must be a multiple of ha_interval (PANDORA FMS ENTERPRISE ONLY). ha_monitoring_interval 60
Point your Pandora FMS to the virtual IP address you chose in the previous section:
pandorafms# sed -i -e 's/^dbhost .*/dbhost <virtual IP>/' /etc/pandora/pandora_server.conf pandorafms# sed -i -e 's/\$config\["dbhost"\]=".*";/$config["dbhost"]="<virtual IP>";/' /var/www/html/pandora_console/include/config.php
Install and start the pandora_ha service:
pandorafms# cat > /etc/systemd/system/pandora_ha.service <<-EOF [Unit] Description=Pandora FMS Database HA Tool [Service] Type=forking PIDFile=/var/run/pandora_ha.pid Restart=always ExecStart=/usr/bin/pandora_ha -d -p /var/run/pandora_ha.pid /etc/pandora/pandora_server.conf [Install] WantedBy=multi-user.target EOF pandorafms# systemctl enable pandora_ha pandorafms# systemctl start pandora_ha
Create database entries for the two nodes:
pandorafms# mysql -uroot -ppandora mysql> INSERT INTO pandora.tdatabase (host, os_user) VALUES ('node1', 'root'), ('node2', 'root');
Log-in to your Pandora FMS Console and navigate to Servers -> Manage database HA:
You should see something similar to this:
1.7.4 Adding a new node to the cluster
Install Percona (see Installing Percona). Backup the database of the master node (node1 in this example) and write down the master log file name and position (in the example, mysql-bin.000001 and 785):
node1# innobackupex --no-timestamp /root/pandoradb.bak/ node1# innobackupex --apply-log /root/pandoradb.bak/ node1# cat /root/pandoradb.bak/xtrabackup_binlog_info mysql-bin.000001 785
Load the database on the new node, which we will call node3, and configure it to replicate from node1 (set MASTER_LOG_FILE and MASTER_LOG_FILE to the values found in the previous step):
node3# systemctl stop mysqld node1# rsync -avpP -e ssh /root/pandoradb.bak/ node3:/var/lib/mysql/ node3# chown -R mysql:mysql /var/lib/mysql node3# chcon -R system_u:object_r:mysqld_db_t:s0 /var/lib/mysql node3# systemctl start mysqld node3# mysql -uroot -ppandora mysql> CHANGE MASTER TO MASTER_HOST='node1', MASTER_USER='root', MASTER_PASSWORD='pandora', MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=785; mysql> START SLAVE; mysql> SHOW SLAVE STATUS \G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: node1 Master_User: root Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000002 Read_Master_Log_Pos: 785 Relay_Log_File: node3-relay-bin.000003 Relay_Log_Pos: 998 Relay_Master_Log_File: mysql-bin.000002 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: pandora Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 785 Relay_Log_Space: 1252 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_UUID: 580d8bb0-6991-11e8-9a22-16efadb2f150 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: Executed_Gtid_Set: Auto_Position: 0 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: 1 row in set (0.00 sec) mysql> QUIT node3# systemctl stop mysqld
Add the new node to the cluster:
node3# echo -n hapass | passwd hacluster --stdin node3# cd /usr/lib/ocf/resource.d/ node3# mkdir percona node3# cd percona node3# curl -L -o mysql https://github.com/Percona-Lab/pacemaker-replication-agents/raw/master/agents/mysql_prm node3# chmod u+x mysql node3# pcs cluster auth -u hacluster -p hapass --force node3 node3# pcs cluster node add --enable --start node3
clone-max must be equal to the number of nodes in your cluster:
node3# pcs resource update master_pandoradb meta master-max="1" master-node-max="1" clone-max="3" clone-node-max="1" notify="true" globally-unique="false" target-role="Master" is-managed="true"
Check the status of the cluster:
node3# pcs status Cluster name: pandoraha Stack: corosync Current DC: node1 (version 1.1.18-11.el7_5.2-2b07d5c5a9) - partition with quorum Last updated: Fri Jun 1 10:55:47 2018 Last change: Fri Jun 1 10:55:09 2018 by root via crm_attribute on node3 3 nodes configured 3 resources configured Online: [ node1 node2 node3 ] Full list of resources: pandoraip (ocf::heartbeat:IPaddr2): Started node1 Master/Slave Set: master_pandoradb [pandoradb] Masters: [ node1 ] Slaves: [ node2 node3 ] Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
1.7.5 Fixing a broken node
We will use node2 as an example. Put node2 into standby mode:
node2# pcs node standby node2 node2# pcs status Cluster name: pandoraha Stack: corosync Current DC: node1 (version 1.1.18-11.el7_5.2-2b07d5c5a9) - partition with quorum Last updated: Tue Jun 12 08:20:49 2018 Last change: Tue Jun 12 08:20:34 2018 by root via cibadmin on node2 2 nodes configured 3 resources configured Node node2: standby Online: [ node1 ] Full list of resources: Master/Slave Set: master_pandoradb [pandoradb] Masters: [ node1 ] Stopped: [ node2 ] pandoraip (ocf::heartbeat:IPaddr2): Started node1 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
Backup Percona's data directory:
node2# systemctl stop mysqld node2# mv /var/lib/mysql /var/lib/mysql.bak
Backup the database of the master node (node1 in this example) and write down the master log file name and position (in the example, mysql-bin.000001 and 785):
node1# innobackupex --no-timestamp /root/pandoradb.bak/ node1# innobackupex --apply-log /root/pandoradb.bak/ node1# cat /root/pandoradb.bak/xtrabackup_binlog_info mysql-bin.000001 785
Load the database on the broken node and configure it to replicate from node1 (set MASTER_LOG_FILE and MASTER_LOG_FILE to the values found in the previous step):
node1# rsync -avpP -e ssh /root/pandoradb.bak/ node2:/var/lib/mysql/ node2# chown -R mysql:mysql /var/lib/mysql node2# chcon -R system_u:object_r:mysqld_db_t:s0 /var/lib/mysql node2# systemctl start mysqld node2# mysql -uroot -ppandora mysql> CHANGE MASTER TO MASTER_HOST='node1', MASTER_USER='root', MASTER_PASSWORD='pandora', MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=785; mysql> START SLAVE; mysql> SHOW SLAVE STATUS \G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: node1 Master_User: root Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000002 Read_Master_Log_Pos: 785 Relay_Log_File: node2-relay-bin.000003 Relay_Log_Pos: 998 Relay_Master_Log_File: mysql-bin.000002 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: pandora Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 785 Relay_Log_Space: 1252 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_UUID: 580d8bb0-6991-11e8-9a22-16efadb2f150 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: Executed_Gtid_Set: Auto_Position: 0 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: 1 row in set (0.00 sec) mysql> QUIT node2# systemctl stop mysqld
Remove node2 from standby mode:
node2# pcs node unstandby node2 node2# pcs resource cleanup --node node2
Check the status of the cluster:
node3# pcs status Cluster name: pandoraha Stack: corosync Current DC: node1 (version 1.1.18-11.el7_5.2-2b07d5c5a9) - partition with quorum Last updated: Fri Jun 1 10:55:47 2018 Last change: Fri Jun 1 10:55:09 2018 by root via crm_attribute on node3 2 nodes configured 3 resources configured Online: [ node1 node2 ] Full list of resources: pandoraip (ocf::heartbeat:IPaddr2): Started node1 Master/Slave Set: master_pandoradb [pandoradb] Masters: [ node1 ] Slaves: [ node2 ] Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
1.7.6 Troubleshooting
1.7.6.1 What do I do if one of the cluster nodes is not working?
The service will not be affected as long as the master node is running. If the master node fails, a slave node will be automatically promoted to master. See Fixing a broken node.