Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
en:documentation:05_big_environments:06_ha [2021/08/04 14:57]
jimmy.olano [Adding a new node to the cluster] Boleto GitLab # 7872 ticket.
en:documentation:05_big_environments:06_ha [2021/11/05 12:05] (current)
Line 1: Line 1:
-====== High Availavility (HA) ======+====== High Availability (HA) ====== 
 {{indexmenu_n>6}} {{indexmenu_n>6}}
  
-[[en:documentation:start|Go back to Pandora FMS documentation index]]+[[:en:documentation:start|Go back to Pandora FMS documentation index]]
  
-<WRAP center round important 60%> +<WRAP center round important 60%> We are working on the translation of the Pandora FMS documentation. Sorry for any inconvenience. </WRAP>
-We are working on the translation of the Pandora FMS documentation. Sorry for any inconvenience. +
-</WRAP>+
  
 ===== High Availability ===== ===== High Availability =====
  
 ==== Introduction ==== ==== Introduction ====
-Pandora FMS is a very stable application (thanks to the test and improvements included in each version and to the fixing of some failures discovered by users. In spite of this, in critical environments and/or with high load, it is possible that it would be necessary to distribute the load among several machines, making sure that if any component of Pandora FMS fails, the system will not be down. + 
 +Pandora FMS is a very stable application (thanks to the test and improvements included in each version and to the fixing of some failures discovered by users. In spite of this, in critical environments and/or with high load, it is possible that it would be necessary to distribute the load among several machines, making sure that if any component of Pandora FMS fails, the system will not be down.
  
 Pandora FMS has been designed to be very modular. Any of its modules could work in an independent way. But it has also been designed to work with other components and for being able to take the load from those components that have been down. Pandora FMS has been designed to be very modular. Any of its modules could work in an independent way. But it has also been designed to work with other components and for being able to take the load from those components that have been down.
Line 17: Line 17:
 The Pandora FMS standard design could be this one: The Pandora FMS standard design could be this one:
  
-{{ wiki:ha1.png?550 }}+{{  :wiki:ha1.png?550  }}
  
-Obviously, the agents are not redundant. If an agent is down, it makes no sense to execute another one, since the only cause for an agent downs is that data could not be obtained because the execution of any module is failing, and this could not be solved with another agent running in parallel, or because the system is   +Obviously, the agents are not redundant. If an agent is down, it makes no sense to execute another one, since the only cause for an agent downs is that data could not be obtained because the execution of any module is failing, and this could not be solved with another agent running in parallel, or because the system is isolated or fails. The best solution is to make the critical systems redundant - regardless of they having Pandora FMS agents or not- and so to make the monitoring of these systems redundant.
-isolated or fails. The best solution is to make the critical systems redundant - regardless of they having Pandora FMS agents or not- and so to make the monitoring of these systems redundant.+
  
 It is possible to use HA in several scenaries: It is possible to use HA in several scenaries:
Line 31: Line 30:
  
 ==== Dimensioning and HA architecture designs ==== ==== Dimensioning and HA architecture designs ====
 +
 The most important components of Pandora FMS are: The most important components of Pandora FMS are:
  
Line 36: Line 36:
   - Server   - Server
   - Console   - Console
- 
  
 Each of these components can be replicated to protect the monitoring system from any catastrophe. Each of these components can be replicated to protect the monitoring system from any catastrophe.
Line 44: Line 43:
 Depending on the monitoring needs, the different architectures will be defined. Depending on the monitoring needs, the different architectures will be defined.
  
- +**Note: **  The tests carried out to define the architectures have been carried out using different equipment:
-** Note: ** The tests carried out to define the architectures have been carried out using different equipment:+
  
 Intel (R) Core (TM) i5-8600K CPU @ 3.60GHz Intel (R) Core (TM) i5-8600K CPU @ 3.60GHz
  
-Instance <i> t2.large </i> from Amazon ((https://aws.amazon.com/en/ec2/instance-types/t2/)) +Instance //t2.large//  from Amazon (([[https://aws.amazon.com/ec2/instance-types/t2/?nc1=h_ls|https://aws.amazon.com/ec2/instance-types/t2/?nc1=h_ls]]))
  
 === Sizing === === Sizing ===
 +
 Depending on the needs: Depending on the needs:
  
 1. Standalone (without high availability) up to 2500 agents / 50000 modules every 5 minutes, even data, no historical data. 1. Standalone (without high availability) up to 2500 agents / 50000 modules every 5 minutes, even data, no historical data.
- 
 <code> <code>
 +
  Servers: 1 (shared)  Servers: 1 (shared)
- +
  Main:  Main:
  ----------  ----------
Line 65: Line 63:
  RAM: 8 GB  RAM: 8 GB
  Disk: 100GB  Disk: 100GB
-</code> 
  
 +</code>
  
-{{ wiki:dim_std1.png?500 }}+{{  :wiki:dim_std1.png?500  }}
  
 2. Standalone (without high availability) up to 2500 agents / 50000 modules every 5 minutes, even data, with historical data (1 year). 2. Standalone (without high availability) up to 2500 agents / 50000 modules every 5 minutes, even data, with historical data (1 year).
Line 74: Line 72:
 <code> <code>
  Servers: 2 (1 shared, 1 historical)  Servers: 2 (1 shared, 1 historical)
- +
  Main:  Main:
  ----------  ----------
Line 80: Line 78:
  RAM: 8 GB  RAM: 8 GB
  Disk: 100GB  Disk: 100GB
- +
  Historical:  Historical:
  ----------  ----------
Line 86: Line 84:
  RAM: 4 GB  RAM: 4 GB
  Disk: 200GB  Disk: 200GB
 +
 </code> </code>
  
-{{ wiki:dim_std2.png?700 }}+{{  :wiki:dim_std2.png?700  }}
  
 3. Standalone (without high availability) up to 5000 agents / 100000 modules every 5 minutes, even data, with historical data (1 year). 3. Standalone (without high availability) up to 5000 agents / 100000 modules every 5 minutes, even data, with historical data (1 year).
Line 94: Line 93:
 <code> <code>
  Servers: 3 (1 server + console, 1 main database, 1 historical)  Servers: 3 (1 server + console, 1 main database, 1 historical)
- +
  Server + console:  Server + console:
  -------------------  -------------------
Line 100: Line 99:
  RAM: 8 GB  RAM: 8 GB
  Disk: 40GB  Disk: 40GB
- +
  Main database:  Main database:
  ------------------------  ------------------------
Line 106: Line 105:
  RAM: 8 GB  RAM: 8 GB
  Disk: 100GB  Disk: 100GB
- +
  Historical:  Historical:
  ----------  ----------
Line 112: Line 111:
  RAM: 4 GB  RAM: 4 GB
  Disk: 200GB  Disk: 200GB
 +
 </code> </code>
  
-{{ wiki:dim_std3.png?700 }}+{{  :wiki:dim_std3.png?700  }}
  
 === HA Architecture designs === === HA Architecture designs ===
 +
 1. Database in simple HA, up to 7500 agents / 125000 modules every 5 minutes, even data, with historical data (1 year). 1. Database in simple HA, up to 7500 agents / 125000 modules every 5 minutes, even data, with historical data (1 year).
  
 <code> <code>
-Servers: 4 (1 server + console, 2 database, 1 historical) +Servers: 4 (1 server + console, 2 database, 1 historical)
  
  Server + console:  Server + console:
Line 127: Line 128:
  RAM: 8 GB  RAM: 8 GB
  Disk: 40GB  Disk: 40GB
- +
  Database node 1:  Database node 1:
  ---------------------  ---------------------
Line 133: Line 134:
  RAM: 8 GB  RAM: 8 GB
  Disk: 100GB  Disk: 100GB
- +
  Database node 2:  Database node 2:
  ---------------------  ---------------------
Line 139: Line 140:
  RAM: 8 GB  RAM: 8 GB
  Disk: 100GB  Disk: 100GB
- +
  Historical:  Historical:
  ----------  ----------
Line 145: Line 146:
  RAM: 4 GB  RAM: 4 GB
  Disk: 300GB  Disk: 300GB
 +
 </code> </code>
  
-{{ wiki:dim_ha1.png?700 }}+{{  :wiki:dim_ha1.png?700  }}
  
 2. Database in complete HA (with quorum), up to 7500 agents / 125000 modules every 5 minutes, even data, with historical data (1 year). 2. Database in complete HA (with quorum), up to 7500 agents / 125000 modules every 5 minutes, even data, with historical data (1 year).
Line 153: Line 155:
 <code> <code>
  Servers: 5 (1 server + console, 3 database, 1 historical)  Servers: 5 (1 server + console, 3 database, 1 historical)
- +
  Server + console:  Server + console:
  ------------------  ------------------
Line 159: Line 161:
  RAM: 8 GB  RAM: 8 GB
  Disk: 40GB  Disk: 40GB
- +
  Database node 1:  Database node 1:
  ---------------------  ---------------------
Line 165: Line 167:
  RAM: 8 GB  RAM: 8 GB
  Disk: 100GB  Disk: 100GB
-  + 
- Database node 2: + Database node 2:
  ---------------------  ---------------------
  CPU: 6 cores  CPU: 6 cores
  RAM: 8 GB  RAM: 8 GB
  Disk: 100GB  Disk: 100GB
- +
  Database node 3:  Database node 3:
  ---------------------  ---------------------
Line 177: Line 179:
  RAM: 8 GB  RAM: 8 GB
  Disk: 100GB  Disk: 100GB
- +
  Historical:  Historical:
  ----------  ----------
Line 183: Line 185:
  RAM: 4 GB  RAM: 4 GB
  Disk: 200GB  Disk: 200GB
 +
 </code> </code>
  
-{{ wiki:dim_ha2.png?700 }}+{{  :wiki:dim_ha2.png?700  }}
  
 3. Database in HA simple and Pandora FMS in HA balanced, up to 7500 agents / 125000 modules every 5 minutes, even data, with historical data (1 year). 3. Database in HA simple and Pandora FMS in HA balanced, up to 7500 agents / 125000 modules every 5 minutes, even data, with historical data (1 year).
Line 191: Line 194:
 <code> <code>
  Servers: 5 (2 server + console, 2 database, 1 historical)  Servers: 5 (2 server + console, 2 database, 1 historical)
- +
  Server + console:  Server + console:
  -------------------  -------------------
Line 197: Line 200:
  RAM: 8 GB  RAM: 8 GB
  Disk: 40GB  Disk: 40GB
- +
  Server + console:  Server + console:
  -------------------  -------------------
Line 203: Line 206:
  RAM: 8 GB  RAM: 8 GB
  Disk: 40GB  Disk: 40GB
- +
  Database node 1:  Database node 1:
  ---------------------  ---------------------
Line 209: Line 212:
  RAM: 8 GB  RAM: 8 GB
  Disk: 100GB  Disk: 100GB
- +
  Database node 2:  Database node 2:
  ---------------------  ---------------------
Line 215: Line 218:
  RAM: 8 GB  RAM: 8 GB
  Disk: 100GB  Disk: 100GB
- +
  Historical:  Historical:
  ----------  ----------
Line 221: Line 224:
  RAM: 4 GB  RAM: 4 GB
  Disk: 200GB  Disk: 200GB
 +
 </code> </code>
  
-{{ wiki:dim_ha3.png?700 }}+{{  :wiki:dim_ha3.png?700  }}
  
 4. Basic HA balanced on server, main and replica database, up to 4000 agents / 90000 modules every 5 minutes, even data, with historical data (1 year). 4. Basic HA balanced on server, main and replica database, up to 4000 agents / 90000 modules every 5 minutes, even data, with historical data (1 year).
Line 229: Line 233:
 <code> <code>
  Servers: 3 (2 shared, 1 historical)  Servers: 3 (2 shared, 1 historical)
- +
  Main: (console + server + database node 1)  Main: (console + server + database node 1)
  ----------  ----------
Line 235: Line 239:
  RAM: 12 GB  RAM: 12 GB
  Disk: 100GB  Disk: 100GB
- +
  Secondary: (console + server + database node 2)  Secondary: (console + server + database node 2)
  ----------  ----------
Line 241: Line 245:
  RAM: 12 GB  RAM: 12 GB
  Disk: 100GB  Disk: 100GB
- +
  Historical:  Historical:
  ----------  ----------
Line 247: Line 251:
  RAM: 4 GB  RAM: 4 GB
  Disk: 200GB  Disk: 200GB
 +
 </code> </code>
  
 In this overview, Pandora FMS database nodes are configured in each of the two available servers (main and secondary). In this overview, Pandora FMS database nodes are configured in each of the two available servers (main and secondary).
  
-{{ wiki:dim_ha4.png?700 }}+{{  :wiki:dim_ha4.png?700  }}
  
-** Note: ** For large environments, each of the configuration overviews previously described as computing nodes will be defined.+**Note: **  For large environments, each of the configuration overviews previously described as computing nodes will be defined.
  
 === Example === === Example ===
 +
 If you need to monitor 30,000 agents with 500,000 modules, configure as many nodes as necessary to cover these requirements. Follow the example: If you need to monitor 30,000 agents with 500,000 modules, configure as many nodes as necessary to cover these requirements. Follow the example:
- 
  
 If you choose the HA # 1 design (1 server + console, 2 database nodes in HA, and a historical database), you must configure 30,000 / 7500 = 4 nodes. If you choose the HA # 1 design (1 server + console, 2 database nodes in HA, and a historical database), you must configure 30,000 / 7500 = 4 nodes.
  
 To manage the entire environment, it will be necessary to have an installed Metaconsole, from which to configure the entire monitoring infrastructure. To manage the entire environment, it will be necessary to have an installed Metaconsole, from which to configure the entire monitoring infrastructure.
- 
  
 The Metaconsole will require: The Metaconsole will require:
Line 268: Line 272:
 <code> <code>
  Servers: 1 (shared)  Servers: 1 (shared)
- +
  Main:  Main:
  ----------  ----------
Line 274: Line 278:
  RAM: 12 GB  RAM: 12 GB
  Disk: 100GB  Disk: 100GB
 +
 </code> </code>
- 
  
 Total servers with independent historical databases: 17 Total servers with independent historical databases: 17
Line 281: Line 285:
 Total servers with combined historical databases: 13 Total servers with combined historical databases: 13
  
- +**Note **> To combine all the historical databases (4) in a single team, resize their characteristics to take on the extra load:
-** Note **> To combine all the historical databases (4) in a single team, resize their characteristics to take on the extra load:+
  
 <code> <code>
Line 290: Line 293:
  RAM: 12 GB  RAM: 12 GB
  Disk: 1200GB  Disk: 1200GB
 +
 </code> </code>
  
 ==== HA of Data Server ==== ==== HA of Data Server ====
 +
 The easiest way is to use the HA implemented in the agents (which allow you to contact an alternative server if the main one does not reply). However, since the data server supports port 41121 and it is a standard TCP port, it is possible to use any commercial solution that allows balancing or clustering an ordinary TCP service. The easiest way is to use the HA implemented in the agents (which allow you to contact an alternative server if the main one does not reply). However, since the data server supports port 41121 and it is a standard TCP port, it is possible to use any commercial solution that allows balancing or clustering an ordinary TCP service.
  
Line 301: Line 306:
 In the case of using the agent HA mechanism, there will be a small delay when sending data, since at each agent execution, it will try to connect with the primary server, and if it does not answer, it will do so against the secondary one (if it has been configured like that). This is described below as "Balancing in Software Agents". In the case of using the agent HA mechanism, there will be a small delay when sending data, since at each agent execution, it will try to connect with the primary server, and if it does not answer, it will do so against the secondary one (if it has been configured like that). This is described below as "Balancing in Software Agents".
  
-If you wish to use two data servers and for both to manage policies, collections, and remote configurations, you will need to [[en:documentation:07_technical_annexes:10_share_var_spool_directory_between_several_pandora_servers|share the following directories]] so that all data server instances can read and write over these directories. Consoles must have access to these shared directories as well.+If you wish to use two data servers and for both to manage policies, collections, and remote configurations, you will need to [[:en:documentation:07_technical_annexes:10_share_var_spool_directory_between_several_pandora_servers|share the following directories]] so that all data server instances can read and write over these directories. Consoles must have access to these shared directories as well.
  
   * /var/spool/pandora/data_in/conf   * /var/spool/pandora/data_in/conf
Line 309: Line 314:
   * /var/www/html/pandora_console/attachment   * /var/www/html/pandora_console/attachment
  
-{{ wiki:ha2.png?550 }}+{{  :wiki:ha2.png?550  }}
  
 ==== Balancing in the Software Agents ==== ==== Balancing in the Software Agents ====
 +
 From the software agents, it is possible to balance data servers so it is possible to configure a master and backup data servers. From the software agents, it is possible to balance data servers so it is possible to configure a master and backup data servers.
  
 In the agent configuration file pandora_agent.conf, configure and uncomment the following part of the agent configuration file: In the agent configuration file pandora_agent.conf, configure and uncomment the following part of the agent configuration file:
- 
 <code> <code>
 +
  # Secondary server configuration  # Secondary server configuration
  # ==============================  # ==============================
Line 330: Line 336:
  secondary_server_ssl no  secondary_server_ssl no
  secondary_server_opts  secondary_server_opts
 +
 </code> </code>
  
Line 335: Line 342:
  
   * **secondary_mode**: Secondary server mode. It may have two values:   * **secondary_mode**: Secondary server mode. It may have two values:
-    * **on_error**: Send data to the secondary server only if it cannot send them to the main server. +      * **on_error**: Send data to the secondary server only if it cannot send them to the main server. 
-    * **always**: It always sends data to the secondary server, regardless of it being able to connect or not with the main server. +      * **always**: It always sends data to the secondary server, regardless of it being able to connect or not with the main server. 
-  * **secondary_server_ip**: Secondary server IP. +  * **secondary_server_ip**: Secondary server IP.
   * **secondary_server_path**: Path where the XML are copied in the secondary server, usually /var/spoo/pandora/data_in   * **secondary_server_path**: Path where the XML are copied in the secondary server, usually /var/spoo/pandora/data_in
-  * **secondary_server_port**:  Port through which the XML will be copied to the secondary server, in tentacle 41121, in ssh 22 and in ftp 21. +  * **secondary_server_port**: Port through which the XML will be copied to the secondary server, in tentacle 41121, in ssh 22 and in ftp 21.
   * **secondary_transfer_mode**: Transfer mode that will be used to copy the XML to the secondary server, Tentacle, ssh, ftp, etc.   * **secondary_transfer_mode**: Transfer mode that will be used to copy the XML to the secondary server, Tentacle, ssh, ftp, etc.
   * **secondary_server_pwd**: Password option for FTP transfer.   * **secondary_server_pwd**: Password option for FTP transfer.
   * **secondary_server_ssl**: Yes or not should be typed in depending if you want to use ssl to transfer data through Tentacle or not.   * **secondary_server_ssl**: Yes or not should be typed in depending if you want to use ssl to transfer data through Tentacle or not.
   * **secondary_server_opts**: This field is for other options that are needed for the transfer.   * **secondary_server_opts**: This field is for other options that are needed for the transfer.
- +<WRAP center round important 60%> Only the remote configuration of the agent is operative in the main server, if enabled. </WRAP>
-<WRAP center round important 60%> +
-Only the remote configuration of the agent is operative in the main server, if enabled.  +
-</WRAP>+
  
 ==== HA of network, WMI, plugin, web and prediction servers, among others ==== ==== HA of network, WMI, plugin, web and prediction servers, among others ====
 +
 You must install several servers, network, WMI, plugin, web or prediction, in several machines of the network (all with the same visibility for the systems that you want monitor). All these machines should be in the same segment (so that network latency data are coherent). You must install several servers, network, WMI, plugin, web or prediction, in several machines of the network (all with the same visibility for the systems that you want monitor). All these machines should be in the same segment (so that network latency data are coherent).
  
Line 356: Line 361:
 The obvious way to implement HA and load balancing in a system of two nodes is to assign 50% of the modules to each server and select both servers as masters. In case that there would be more than two master servers ,and a third server down with modules yet to be executed, the first of the master servers that executes the module will "self-assign" the module of the down server. In case of recovering one of the down servers, the modules that have been assigned to the primary server are automatically assigned again. The obvious way to implement HA and load balancing in a system of two nodes is to assign 50% of the modules to each server and select both servers as masters. In case that there would be more than two master servers ,and a third server down with modules yet to be executed, the first of the master servers that executes the module will "self-assign" the module of the down server. In case of recovering one of the down servers, the modules that have been assigned to the primary server are automatically assigned again.
  
- +{{  :wiki:ha3.png?550  }}
- +
-{{ wiki:ha3.png?550 }} +
- +
  
 The load balancing between the different servers is done in the Agent Administration section in the "setup" menu. The load balancing between the different servers is done in the Agent Administration section in the "setup" menu.
  
-{{ wiki:ha4.png?800 }}+{{  :wiki:ha4.png?800  }}
  
 In the "server" field, there is a combo where you can choose the server that will do the checking. In the "server" field, there is a combo where you can choose the server that will do the checking.
  
-=== Server configuration  ===+=== Server configuration === 
 A Pandora FMS server can be running in two different modes: A Pandora FMS server can be running in two different modes:
  
Line 376: Line 378:
 If a server fails, its modules will be executed by the master server so that no data is lost. If a server fails, its modules will be executed by the master server so that no data is lost.
  
-At any given time there can only be one master server, which is chosen from all the servers with the //master// configuration option in ///etc/pandora/pandora_server.conf// set to a value higher than 0:+At any given time there can only be one master server, which is chosen from all the servers with the //master//  configuration option in ///etc/pandora/pandora_server.conf//  set to a value higher than 0:
  
-  master [1..7]+<code> 
 +master [1..7]
  
-If the current master server fails, a new master server is chosen. If there is more than one candidate, the one with the highest //master// value is chosen.+</code>
  
-<WRAP center round important 60%> +If the current master server failsa new master server is chosenIf there is more than one candidate, the one with the highest //master//  value is chosen.
-Be careful about disabling servers. If a server with Network modules fails and the Network Server is disabled in the master server, those modules will not be executed. +
-</WRAP>+
  
-For example, if you have three Pandora FMS Servers with //master// set to 1, a master server will be randomly chosen and the other two will run in non-master mode. If the master server fails, a new master will be randomly chosen.+<WRAP center round important 60%> Be careful about disabling servers. If a server with Network modules fails and the Network Server is disabled in the master server, those modules will not be executed. </WRAP> 
 + 
 +For example, if you have three Pandora FMS Servers with //master//  set to 1, a master server will be randomly chosen and the other two will run in non-master mode. If the master server fails, a new master will be randomly chosen.
  
 The following parameters have been entered in pandora_server.conf: The following parameters have been entered in pandora_server.conf:
Line 395: Line 398:
  
 ==== Pandora FMS Console HA ==== ==== Pandora FMS Console HA ====
-Install another console. Any of them can be used simultaneously from different locations by different users. You may use a web balancer encompassing all consoles in case you need horizontal growth to manage console load. The session system is managed by cookies and they stay stored in the browser. + 
 +Install another console. Any of them can be used simultaneously from different locations by different users. You may use a web balancer encompassing all consoles in case you need horizontal growth to manage console load. The session system is managed by cookies and they stay stored in the browser.
  
 In the case of using remote configuration and to manage it from all consoles, both data servers and consoles must share the data directory (/var/spool/pandora/data_in) for remote agent, collections and directory configuration. In the case of using remote configuration and to manage it from all consoles, both data servers and consoles must share the data directory (/var/spool/pandora/data_in) for remote agent, collections and directory configuration.
  
-<WRAP center round tip 60%> +<WRAP center round tip 60%> You can learn how to share the key folders with NFS or GlusterFS using [[:en:documentation:07_technical_annexes:10_share_var_spool_directory_between_several_pandora_servers|this guide]]. </WRAP>
-You can learn how to share the key folders with NFS or GlusterFS using [[en:documentation:07_technical_annexes:10_share_var_spool_directory_between_several_pandora_servers|this guide]]. +
-</WRAP>+
  
-It is important to only share data_in's subdirectories and not data_in folder itself, since doing so would affect server performance negatively. +It is important to only share data_in's subdirectories and not data_in folder itself, since doing so would affect server performance negatively.
  
 === Update === === Update ===
-When updating Pandora FMS console in an HA environment, it is important to bear in mind the following points when updating by means of OUM through Update Manager > [[en:documentation:02_installation:02_anexo_upgrade|Update Manager offline]].  
  
-Enterprise version users can download the OUM package from Pandora FMS support website. +When updating Pandora FMS console in an HA environment, it is important to bear in mind the following points when updating by means of OUM through Update Manager > [[:en:documentation:02_installation:02_anexo_upgrade|Update Manager offline]]. 
 + 
 +Enterprise version users can download the OUM package from Pandora FMS support website.
  
 When in a balanced environment with a shared database, updating the first console applies the corresponding changes to the database. This means that when updating the secondary console, Pandora FMS shows an error massage when finding the already entered information in the database. However, the console is still updated. When in a balanced environment with a shared database, updating the first console applies the corresponding changes to the database. This means that when updating the secondary console, Pandora FMS shows an error massage when finding the already entered information in the database. However, the console is still updated.
  
-{{ wiki:OUM1.jpg }}+{{  :wiki:oum1.jpg  |oum1.jpg}}
  
-{{ wiki:OUM2.jpg }}+{{  :wiki:oum2.jpg  |oum2.jpg}}
  
 ==== Database HA ==== ==== Database HA ====
-<WRAP center round tip 60%> 
-This solution is provided to offer a fully-featured solution for HA in Pandora FMS environments. This is the only officially-supported HA model for Pandora FMS. This solution is provided -preinstalled- since OUM 724. This system replaces DRBD and other HA systems recommended in the past. 
-</WRAP> 
  
-<WRAP center round important 60%> +<WRAP center round tip 60%> This solution is provided to offer a fully-featured solution for HA in Pandora FMS environments. This is the only officially-supported HA model for Pandora FMS. This solution is provided -preinstalled- since OUM 724. This system replaces DRBD and other HA systems recommended in the past. </WRAP
-This is the first Pandora DB HA implementation, and the installing process is almost fully manual, by using the Linux console as Root. In future versions setup from the GUI will be provided. + 
-</WRAP>+<WRAP center round important 60%> This is the first Pandora DB HA implementation, and the installing process is almost fully manual, by using the Linux console as Root. In future versions setup from the GUI will be provided. </WRAP>
  
 Pandora FMS relies on a MySQL database for configuration and data storage. A database failure can temporarily bring your monitoring solution to a halt. The Pandora FMS high-availability database cluster allows to easily deploy a fault-tolerant, robust architecture. Pandora FMS relies on a MySQL database for configuration and data storage. A database failure can temporarily bring your monitoring solution to a halt. The Pandora FMS high-availability database cluster allows to easily deploy a fault-tolerant, robust architecture.
Line 431: Line 431:
 Active/passive [[https://dev.mysql.com/doc/refman/5.7/en/replication.html|replication]] takes place from a single master node (with writing permissions) to any number of slaves (read only). A virtual IP address always points to the current master. If the master node fails, one of the slaves is promoted to master and the virtual IP address is updated accordingly. Active/passive [[https://dev.mysql.com/doc/refman/5.7/en/replication.html|replication]] takes place from a single master node (with writing permissions) to any number of slaves (read only). A virtual IP address always points to the current master. If the master node fails, one of the slaves is promoted to master and the virtual IP address is updated accordingly.
  
-{{ wiki:ha_cluster_diagram.png }}+{{  :wiki:ha_cluster_diagram.png  }}
  
-The Pandora FMS Database HA Tool, //pandora_ha//, monitors the cluster and makes sure the Pandora FMS Server is always running, restarting it when needed. //pandora_ha// itself is monitored by systemd.+The Pandora FMS Database HA Tool, //pandora_ha//, monitors the cluster and makes sure the Pandora FMS Server is always running, restarting it when needed. //pandora_ha//  itself is monitored by systemd.
  
-<WRAP center round important 60%> +<WRAP center round important 60%> This is an advanced feature that requires knowledge of Linux systems. </WRAP>
-This is an advanced feature that requires knowledge of Linux systems. +
-</WRAP>+
  
 === Installation === === Installation ===
-Configure a two-node cluster, with hosts //node1// and //node2//. Change hostnames, passwords, etc. as needed to match your environment.+ 
 +Configure a two-node cluster, with hosts //node1//  and //node2//. Change hostnames, passwords, etc. as needed to match your environment.
  
 Commands that should be run on one node will be preceded by that node's hostname. For example: Commands that should be run on one node will be preceded by that node's hostname. For example:
 +<code>
  
-  node1# <command>+node1# <command
 + 
 +</code>
  
 Commands that should be run on all nodes will be preceded by the word **all**. For example: Commands that should be run on all nodes will be preceded by the word **all**. For example:
  
-  all# <command>+<code> 
 +all# <command
 + 
 +</code>
  
 There is an additional host, which will be referred to as **pandorafms**, where Pandora FMS is or will be installed. There is an additional host, which will be referred to as **pandorafms**, where Pandora FMS is or will be installed.
  
-When referencing **all** it only refers to the Database nodes, the additional Pandora FMS node will always be referenced as **pandorafms** and is not part of **all**+When referencing **all**  it only refers to the Database nodes, the additional Pandora FMS node will always be referenced as **pandorafms**  and is not part of **all**
  
 == Prerequisites == == Prerequisites ==
 +
 CentOS version 7 must be installed on all hosts, and they must be able to resolve each other's hostnames. CentOS version 7 must be installed on all hosts, and they must be able to resolve each other's hostnames.
  
Line 460: Line 466:
  node1# ping node2  node1# ping node2
  PING node2 (192.168.0.2) 56(84) bytes of data.  PING node2 (192.168.0.2) 56(84) bytes of data.
- +
  node2# ping node1  node2# ping node1
  PING node1 (192.168.0.1) 56(84) bytes of data.  PING node1 (192.168.0.1) 56(84) bytes of data.
- +
  pandorafms# ping node1  pandorafms# ping node1
  PING node1 (192.168.0.1) 56(84) bytes of data.  PING node1 (192.168.0.1) 56(84) bytes of data.
- +
  pandorafms# ping node2  pandorafms# ping node2
  PING node2 (192.168.0.2) 56(84) bytes of data.  PING node2 (192.168.0.2) 56(84) bytes of data.
 +
 </code> </code>
  
Line 477: Line 484:
 all# sed -i -e 's/^Banner.*//g' /etc/ssh/sshd_config all# sed -i -e 's/^Banner.*//g' /etc/ssh/sshd_config
 all# systemctl restart sshd all# systemctl restart sshd
 +
 </code> </code>
  
- +<WRAP center round important 60%> The Pandora FMS Database HA Tool will not work properly if a banner is configured for Open SSH. </WRAP>
-<WRAP center round important 60%> +
-The Pandora FMS Database HA Tool will not work properly if a banner is configured for Open SSH.  +
-</WRAP>+
  
 Generate new SSH authentication keys for each host and copy the public key to each of the other hosts: Generate new SSH authentication keys for each host and copy the public key to each of the other hosts:
  
-<WRAP center round important 60%> +<WRAP center round important 60%> Passwords can be generated for a non-root user for a later clusted installation with a non-root user. </WRAP>
-Passwords can be generated for a non-root user for a later clusted installation with a non-root user.  +
-</WRAP>+
  
 <code> <code>
Line 494: Line 497:
  node1# ssh-copy-id -p22 [email protected]  node1# ssh-copy-id -p22 [email protected]
  node1# ssh node2  node1# ssh node2
- +
  node2# echo -e "\n\n\n" | ssh-keygen -t rsa  node2# echo -e "\n\n\n" | ssh-keygen -t rsa
  node2# ssh-copy-id -p22 [email protected]  node2# ssh-copy-id -p22 [email protected]
  node2# ssh node1  node2# ssh node1
- +
  pandorafms# echo -e "\n\n\n" | ssh-keygen -t rsa  pandorafms# echo -e "\n\n\n" | ssh-keygen -t rsa
  pandorafms# ssh-copy-id -p22 [email protected]  pandorafms# ssh-copy-id -p22 [email protected]
Line 504: Line 507:
  pandorafms# ssh node1  pandorafms# ssh node1
  pandorafms# ssh node2  pandorafms# ssh node2
-</code> 
  
-On the Pandora FMS node, copy the key pair to ///usr/share/httpd/.ssh///. The Pandora FMS Console needs to retrieve the cluster status:+</code>
  
 +On the Pandora FMS node, copy the key pair to ///usr/share/httpd/.ssh// /. The Pandora FMS Console needs to retrieve the cluster status:
 <code> <code>
 +
  pandorafms# cp -r /root/.ssh/ /usr/share/httpd/  pandorafms# cp -r /root/.ssh/ /usr/share/httpd/
  pandorafms# chown -R apache:apache /usr/share/httpd/.ssh/  pandorafms# chown -R apache:apache /usr/share/httpd/.ssh/
 +
 </code> </code>
  
-The following steps are only necessary if the nodes are running SSH on a non-standard port. Replace //22// with the right port number:+The following steps are only necessary if the nodes are running SSH on a non-standard port. Replace //22//  with the right port number:
  
 <code> <code>
- all# echo -e "Host node1\n    Port 22" >> /root/.ssh/config + all# echo -e "Host node1\n    Port 22">> /root/.ssh/config 
- all# echo -e "Host node2\n    Port 22" >> /root/.ssh/config+ all# echo -e "Host node2\n    Port 22">> /root/.ssh/config 
 </code> </code>
  
 == Installing Percona == == Installing Percona ==
 +
 Install the required packages: Install the required packages:
  
 <code> <code>
  all# yum install https://repo.percona.com/yum/percona-release-latest.noarch.rpm  all# yum install https://repo.percona.com/yum/percona-release-latest.noarch.rpm
- +
  all# yum install -y Percona-Server-server-57 percona-xtrabackup-24  all# yum install -y Percona-Server-server-57 percona-xtrabackup-24
 +
 </code> </code>
  
 Make sure the Percona service is disabled, since it will be managed by the cluster: Make sure the Percona service is disabled, since it will be managed by the cluster:
  
-  all# systemctl disable mysqld+<code> 
 +all# systemctl disable mysqld
  
-<WRAP center round important 60%> +</code> 
-If the system service is not disabled, the cluster's resource manager will not work properly. + 
-</WRAP>+<WRAP center round important 60%> If the system service is not disabled, the cluster's resource manager will not work properly. </WRAP>
  
 Next, start the Percona server: Next, start the Percona server:
  
-  all# systemctl start mysqld+<code> 
 +all# systemctl start mysqld 
 + 
 +</code>
  
 A new temporary password will be generated and linked to /var/log/mysqld.log. Log in the Percona server and change the root password: A new temporary password will be generated and linked to /var/log/mysqld.log. Log in the Percona server and change the root password:
Line 550: Line 562:
  mysql> SET PASSWORD FOR 'root'@'localhost' = PASSWORD('pandora');  mysql> SET PASSWORD FOR 'root'@'localhost' = PASSWORD('pandora');
  mysql> quit  mysql> quit
 +
 </code> </code>
  
 +At the time of carrying out the MySQL configuration, it will be possible to do it through the following autogenerator that is already included in the installation package of Pandora FMS Enterprise server if it is installed with the //–ha//  modifier and the default Pandora FMS ISO.
  
-At the time of carrying out the MySQL configuration, it will be possible to do it through the following autogenerator that is already included in the installation package of Pandora FMS Enterprise server if it is installed with the //--ha// modifier and the default Pandora FMS ISO.+<WRAP center round important 60%> Remember that if you have installed the server package manually, instead of using the ISO, you must pass the //ha//  parameter to access the HA server tools</WRAP>
  
-<WRAP center round important 60%> +In case you have not done it, you should only reinstall the server package with the –ha parameter : In the example environment we have 2 database nodes (node1 and node2) and one application node (Pandorafms)so the pandora server package should only be installed in the application nodeif the application node is installed with a Pandora iso (recommended option) this step is not necessary, otherwise the server package should be reinstalled with the –ha flag
-Remember that if you have installed the server package manuallyinstead of using the ISOyou must pass the //--ha// parameter to access the HA server tools.  +<code>
-</WRAP>+
  
-In case you have not done it, you should only reinstall the server package with the --ha parameter : In the example environment we have 2 database nodes (node1 and node2) and one application node (Pandorafms), so the pandora server package should only be installed in the application node, if the application node is installed with a Pandora iso (recommended option) this step is not necessary, otherwise the server package should be reinstalled with the --ha flag.+pandorafms# ./pandora_server_installer --install --ha
  
-  pandorafms# ./pandora_server_installer --install --ha+</code>
  
 Once the server is installed with HA tools enabled, you will find the configuration generator for database replication in the path: ///usr/share/pandora_server/util/myconfig_ha_gen.sh// Once the server is installed with HA tools enabled, you will find the configuration generator for database replication in the path: ///usr/share/pandora_server/util/myconfig_ha_gen.sh//
Line 577: Line 590:
       -s --poolsize   Set innodb_buffer_pool_size static size in M (Megabytes) or G (Gigabytes). [ default value: autocalculated ] (optional)       -s --poolsize   Set innodb_buffer_pool_size static size in M (Megabytes) or G (Gigabytes). [ default value: autocalculated ] (optional)
       -h --help       Print help.       -h --help       Print help.
 +
 </code> </code>
- 
  
 In the current case where the databases are not on the same server as the application, it will be necessary to copy the script to the nodes to be executed locally. In the current case where the databases are not on the same server as the application, it will be necessary to copy the script to the nodes to be executed locally.
Line 585: Line 598:
  pandorafms# scp /usr/share/pandora_server/util/myconfig_ha_gen.sh [email protected]:/root/  pandorafms# scp /usr/share/pandora_server/util/myconfig_ha_gen.sh [email protected]:/root/
  pandorafms# scp /usr/share/pandora_server/util/myconfig_ha_gen.sh [email protected]:/root/  pandorafms# scp /usr/share/pandora_server/util/myconfig_ha_gen.sh [email protected]:/root/
 +
 </code> </code>
  
-As you see in the example, it will only be necessary to enter the parameter **serverid** (mandatory) in standard environments or deployed with the ISO and some optional parameters for custom environments.+As you see in the example, it will only be necessary to enter the parameter **serverid**  (mandatory) in standard environments or deployed with the ISO and some optional parameters for custom environments.
  
 If the default user or the defined one does not connect to the database, the script will end up giving a connection error. If the default user or the defined one does not connect to the database, the script will end up giving a connection error.
Line 598: Line 612:
  node1# /root/myconfig_ha_gen.sh -i 1  node1# /root/myconfig_ha_gen.sh -i 1
  node2# /root/myconfig_ha_gen.sh -i 2  node2# /root/myconfig_ha_gen.sh -i 2
 +
 </code> </code>
  
Line 606: Line 621:
 Restart the mysqld service to check that the configuration has been correctly applied. Restart the mysqld service to check that the configuration has been correctly applied.
  
-  all# systemctl restart mysqld+<code> 
 +all# systemctl restart mysqld 
 + 
 +</code>
  
 == Installing Pandora FMS == == Installing Pandora FMS ==
Line 626: Line 644:
 </code> </code>
  
-<WRAP center round tip 60%>\\ +<WRAP center round tip 60%> \\ From version NG 754 onwards, [[:en:documentation:05_big_environments:07_server_management#manual_startupshutdown_for_pandora_fms_servers|additional options are available for manual startup and shutdown]] of High Availability (HA) environments. \\ 
-From version NG 754 onwards, [[:en:documentation:05_big_environments:07_server_management#manual_startup:shutdown_for_pandora_fms_servers|additional options are available for manual startup and shutdown]] of High Availability (HA) environments.\\ +</WRAP> \\
-</WRAP>\\+
 **Existing Pandora FMS installation** **Existing Pandora FMS installation**
  
Line 652: Line 669:
  
 </code> </code>
- 
  
 == Replication setup == == Replication setup ==
 +
 Grant the required privileges for replication to work on all databases: Grant the required privileges for replication to work on all databases:
  
Line 667: Line 684:
  mysql> FLUSH PRIVILEGES;  mysql> FLUSH PRIVILEGES;
  mysql> quit  mysql> quit
 +
 </code> </code>
  
-Back up the database of the first node and write down the master log file name and position (in the example, //mysql-bin.000001// and //785//):+Back up the database of the first node and write down the master log file name and position (in the example, //mysql-bin.000001//  and //785//):
  
 <code> <code>
Line 675: Line 693:
 node1# innobackupex --no-timestamp /root/pandoradb.bak/ node1# innobackupex --no-timestamp /root/pandoradb.bak/
 node1# innobackupex --apply-log /root/pandoradb.bak/ node1# innobackupex --apply-log /root/pandoradb.bak/
-node1# cat /root/pandoradb.bak/xtrabackup_binlog_info +node1# cat /root/pandoradb.bak/xtrabackup_binlog_info
 mysql-bin.000001        785 mysql-bin.000001        785
-</code> 
  
 +</code>
  
 Load the database on the second node and configure it to replicate from the first node (set MASTER_LOG_FILE and MASTER_LOG_POS to the values found in the previous step): Load the database on the second node and configure it to replicate from the first node (set MASTER_LOG_FILE and MASTER_LOG_POS to the values found in the previous step):
  
-<WRAP center round important 60%> +<WRAP center round important 60%> If Pandora FMS installation was carried out through an **ISO**  delete the original database from node 2. Only the database from node 1 will be needed, which will be the one to store information from both machines. That way, balancing will be correctly done. </WRAP>
-If Pandora FMS installation was carried out through an **ISO** delete the original database from node 2. Only the database from node 1 will be needed, which will be the one to store information from both machines. That way, balancing will be correctly done. +
-</WRAP>+
  
 <code> <code>
Line 714: Line 730:
                Slave_SQL_Running: Yes                Slave_SQL_Running: Yes
                  Replicate_Do_DB: pandora                  Replicate_Do_DB: pandora
-             Replicate_Ignore_DB:  +             Replicate_Ignore_DB: 
-              Replicate_Do_Table:  +              Replicate_Do_Table: 
-          Replicate_Ignore_Table:  +          Replicate_Ignore_Table: 
-         Replicate_Wild_Do_Table:  +         Replicate_Wild_Do_Table: 
-     Replicate_Wild_Ignore_Table: +     Replicate_Wild_Ignore_Table:
                       Last_Errno: 0                       Last_Errno: 0
-                      Last_Error: +                      Last_Error:
                     Skip_Counter: 0                     Skip_Counter: 0
              Exec_Master_Log_Pos: 785              Exec_Master_Log_Pos: 785
                  Relay_Log_Space: 1252                  Relay_Log_Space: 1252
                  Until_Condition: None                  Until_Condition: None
-                  Until_Log_File: +                  Until_Log_File:
                    Until_Log_Pos: 0                    Until_Log_Pos: 0
               Master_SSL_Allowed: No               Master_SSL_Allowed: No
-              Master_SSL_CA_File:  +              Master_SSL_CA_File: 
-              Master_SSL_CA_Path:  +              Master_SSL_CA_Path: 
-                 Master_SSL_Cert:  +                 Master_SSL_Cert: 
-               Master_SSL_Cipher:  +               Master_SSL_Cipher: 
-                  Master_SSL_Key: +                  Master_SSL_Key:
            Seconds_Behind_Master: 0            Seconds_Behind_Master: 0
    Master_SSL_Verify_Server_Cert: No    Master_SSL_Verify_Server_Cert: No
                    Last_IO_Errno: 0                    Last_IO_Errno: 0
-                   Last_IO_Error: +                   Last_IO_Error:
                   Last_SQL_Errno: 0                   Last_SQL_Errno: 0
-                  Last_SQL_Error:  +                  Last_SQL_Error: 
-     Replicate_Ignore_Server_Ids: +     Replicate_Ignore_Server_Ids:
                 Master_Server_Id: 1                 Master_Server_Id: 1
                      Master_UUID: 580d8bb0-6991-11e8-9a22-16efadb2f150                      Master_UUID: 580d8bb0-6991-11e8-9a22-16efadb2f150
Line 747: Line 763:
          Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates          Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
               Master_Retry_Count: 86400               Master_Retry_Count: 86400
-                     Master_Bind:  +                     Master_Bind: 
-         Last_IO_Error_Timestamp:  +         Last_IO_Error_Timestamp: 
-        Last_SQL_Error_Timestamp:  +        Last_SQL_Error_Timestamp: 
-                  Master_SSL_Crl:  +                  Master_SSL_Crl: 
-              Master_SSL_Crlpath:  +              Master_SSL_Crlpath: 
-              Retrieved_Gtid_Set:  +              Retrieved_Gtid_Set: 
-               Executed_Gtid_Set: +               Executed_Gtid_Set:
                    Auto_Position: 0                    Auto_Position: 0
-            Replicate_Rewrite_DB:  +            Replicate_Rewrite_DB: 
-                    Channel_Name:  +                    Channel_Name: 
-              Master_TLS_Version: +              Master_TLS_Version:
    1 row in set (0.00 sec)    1 row in set (0.00 sec)
 mysql> QUIT mysql> QUIT
  
 all# systemctl stop mysqld all# systemctl stop mysqld
 +
 </code> </code>
  
 +<WRAP center round important 60%> It is recommended to **Slave_IO_Running**  and **Slave_SQL_Running**  show **Yes**. Other values may differ from the example. </WRAP>
  
- +<WRAP center round important 60%> Make sure **not**  use **root**  user to perform this process. It is advised to grant permissions to other use in charge of managing the database to avoid possible conflicts. </WRAP>
-<WRAP center round important 60%> +
-It is recommended to **Slave_IO_Running** and **Slave_SQL_Running** show **Yes**. Other values may differ from the example. +
-</WRAP> +
- +
-<WRAP center round important 60%> +
-Make sure **not** use **root** user to perform this process. It is advised to grant permissions to other use in charge of managing the database to avoid possible conflicts. +
-</WRAP>+
  
 == Configuring the two-node cluster == == Configuring the two-node cluster ==
 +
 Install the required packages: Install the required packages:
  
 <code> <code>
  all# yum install -y epel-release corosync ntp pacemaker pcs  all# yum install -y epel-release corosync ntp pacemaker pcs
- +
  all# systemctl enable ntpd  all# systemctl enable ntpd
  all# systemctl enable corosync  all# systemctl enable corosync
  all# systemctl enable pcsd  all# systemctl enable pcsd
  all# cp /etc/corosync/corosync.conf.example /etc/corosync/corosync.conf  all# cp /etc/corosync/corosync.conf.example /etc/corosync/corosync.conf
 +
 </code> </code>
  
Line 790: Line 803:
  all# systemctl start corosync  all# systemctl start corosync
  all# systemctl start pcsd  all# systemctl start pcsd
 +
 </code> </code>
  
 Stop the Percona server: Stop the Percona server:
  
-  node1# systemctl stop mysqld+<code> 
 +node1# systemctl stop mysqld
  
-  node2# systemctl stop mysqld+</code> 
 + 
 +<code> 
 +node2# systemctl stop mysqld 
 + 
 +</code>
  
 Authenticate all nodes in the cluster: Authenticate all nodes in the cluster:
Line 802: Line 822:
 Create and start the cluster: Create and start the cluster:
  
-  all# echo hapass | passwd hacluster --stdin+<code> 
 +all# echo hapass | passwd hacluster --stdin
  
 +</code>
 <code> <code>
 +
  node1# pcs cluster auth -u hacluster -p hapass --force node1 node2  node1# pcs cluster auth -u hacluster -p hapass --force node1 node2
  node1# pcs cluster setup --force --name pandoraha node1 node2  node1# pcs cluster setup --force --name pandoraha node1 node2
Line 811: Line 834:
  node1# pcs property set stonith-enabled=false  node1# pcs property set stonith-enabled=false
  node1# pcs property set no-quorum-policy=ignore  node1# pcs property set no-quorum-policy=ignore
 +
 </code> </code>
  
Line 822: Line 846:
    Last updated: Fri Jun  8 12:53:49 2018    Last updated: Fri Jun  8 12:53:49 2018
    Last change: Fri Jun  8 12:53:47 2018 by root via cibadmin on node1    Last change: Fri Jun  8 12:53:47 2018 by root via cibadmin on node1
-   +
    2 nodes configured    2 nodes configured
    0 resources configured    0 resources configured
-   +
    Online: [ node1 node2 ]    Online: [ node1 node2 ]
-   +
    No resources    No resources
-    +
-   +
    Daemon Status:    Daemon Status:
      corosync: active/disabled      corosync: active/disabled
      pacemaker: active/disabled      pacemaker: active/disabled
      pcsd: active/enabled      pcsd: active/enabled
 +
 </code> </code>
  
-<WRAP center round important 60%> +<WRAP center round important 60%> Both nodes should be online (**Online: [node1 node2 ]**). Other values may differ from the example. </WRAP>
-Both nodes should be online (**Online: [node1 node2 ]**). Other values may differ from the example. +
-</WRAP>+
  
 Install the Percona Pacemaker replication agent (it can be downloaded manually from our [[https://pandorafms.com/library/pacemaker-replication-agent-for-mysql/|library]]): Install the Percona Pacemaker replication agent (it can be downloaded manually from our [[https://pandorafms.com/library/pacemaker-replication-agent-for-mysql/|library]]):
- 
 <code> <code>
 +
 all# cd /usr/lib/ocf/resource.d/ all# cd /usr/lib/ocf/resource.d/
 all# mkdir percona all# mkdir percona
Line 851: Line 873:
 all# rm -f pacemaker_mysql_replication.zip all# rm -f pacemaker_mysql_replication.zip
 all# chmod u+x mysql all# chmod u+x mysql
 +
 </code> </code>
  
 +Configure cluster resources. Replace **<VIRT_IP>**  by the virtual IP address of your choice:
  
-Configure cluster resources. Replace **<VIRT_IP>** by the virtual IP address of your choice:+<WRAP center round important 60%If you have changed the default password used in this guide for the database's root user, update //replication_passwd//  and //test_passwd//  accordingly. </WRAP>
  
-<WRAP center round important 60%> +<WRAP center round important 60%> Cluster ressource names must be exactly those indicated in this guide ("pandoraip" and "pandoradb") </WRAP>
-If you have changed the default password used in this guide for the database's root user, update //replication_passwd// and //test_passwd// accordingly. +
-</WRAP> +
- +
-<WRAP center round important 60%> +
-Cluster ressource names must be exactly those indicated in this guide ("pandoraip" and "pandoradb") +
-</WRAP>+
  
 <code> <code>
Line 881: Line 899:
 node1# pcs constraint colocation add master master_pandoradb with pandoraip node1# pcs constraint colocation add master master_pandoradb with pandoraip
 node1# pcs constraint order promote master_pandoradb then start pandoraip node1# pcs constraint order promote master_pandoradb then start pandoraip
 +
 </code> </code>
- 
  
 Check cluster status: Check cluster status:
Line 893: Line 911:
    Last updated: Fri Jun  8 13:02:21 2018    Last updated: Fri Jun  8 13:02:21 2018
    Last change: Fri Jun  8 13:02:11 2018 by root via cibadmin on node1    Last change: Fri Jun  8 13:02:11 2018 by root via cibadmin on node1
-   +
    2 nodes configured    2 nodes configured
    3 resources configured    3 resources configured
-   +
    Online: [ node1 node2 ]    Online: [ node1 node2 ]
-   +
    Full list of resources:    Full list of resources:
-   +
     Master/Slave Set: master_pandoradb [pandoradb]     Master/Slave Set: master_pandoradb [pandoradb]
         Masters: [ node1 ]         Masters: [ node1 ]
         Slaves: [ node2 ]         Slaves: [ node2 ]
     pandoraip      (ocf::heartbeat:IPaddr2):       Started node1     pandoraip      (ocf::heartbeat:IPaddr2):       Started node1
-   +
    Daemon Status:    Daemon Status:
      corosync: active/disabled      corosync: active/disabled
      pacemaker: active/disabled      pacemaker: active/disabled
      pcsd: active/enabled      pcsd: active/enabled
 +
 </code> </code>
  
- +<WRAP center round important 60%> Both nodes should be online (**Online: [ node1 node2 ]**). Other values may differ from the example. </WRAP>
-<WRAP center round important 60%> +
-Both nodes should be online (**Online: [ node1 node2 ]**). Other values may differ from the example. +
-</WRAP>+
  
 == Configuring the two-node cluster with a non-root user == == Configuring the two-node cluster with a non-root user ==
 +
 It will be done similarly to the previous one. The login information must have been copied, which has already been explained, and the following steps must be carried out: It will be done similarly to the previous one. The login information must have been copied, which has already been explained, and the following steps must be carried out:
  
 +<code>
 +# All nodes:
  
-  # All nodes:+</code>
  
 <code> <code>
Line 927: Line 946:
  passwd <usuario>  passwd <usuario>
  usermod -a -G haclient <usuario>  usermod -a -G haclient <usuario>
 +
 </code> </code>
  
Line 932: Line 952:
  # Enable PCS ACL system  # Enable PCS ACL system
  pcs property set enable-acl = true --force  pcs property set enable-acl = true --force
 +
 </code> </code>
  
Line 937: Line 958:
  # Create role  # Create role
  pcs acl role create <rol> description="RW role"  write xpath /cib  pcs acl role create <rol> description="RW role"  write xpath /cib
 +
 </code> </code>
  
Line 942: Line 964:
  # Create PCS user - Local user  # Create PCS user - Local user
  pcs acl user create <usuario> <rol>  pcs acl user create <usuario> <rol>
 +
 </code> </code>
  
Line 949: Line 972:
  pcs status  pcs status
  Username: <usuario>  Username: <usuario>
- Password: ***** + Password: ***** 
 </code> </code>
  
 +<code>
 +# Wait for 'Authorized' message, ignore output. Wait a second and retry 'pcs status' command
  
-  # Wait for 'Authorized' message, ignore output. Wait a second and retry 'pcs status' command+</code>
  
 == Pandora FMS setup == == Pandora FMS setup ==
-Make sure //php-pecl-ssh2// is installed:+ 
 +Make sure //php-pecl-ssh2//  is installed:
  
 <code> <code>
  pandorafms# yum install php-pecl-ssh2  pandorafms# yum install php-pecl-ssh2
  pandorafms# systemctl restart httpd  pandorafms# systemctl restart httpd
 +
 </code> </code>
  
-There are two parameters in ///etc/pandora/pandora_server.conf// that control the performance of the Pandora FMS Database HA Tool. Adjust them to suit your needs:+There are two parameters in ///etc/pandora/pandora_server.conf//  that control the performance of the Pandora FMS Database HA Tool. Adjust them to suit your needs:
  
 <code> <code>
-# Pandora FMS Database HA Tool execution interval in seconds (PANDORA FMS ENTERPRISE ONLY).                                                                                                                                                +# Pandora FMS Database HA Tool execution interval in seconds (PANDORA FMS ENTERPRISE ONLY). 
-ha_interval 30                                                                                                                                                                                                                             +ha_interval 30 
-                                                                                                                                                                                                                                          + 
-# Pandora FMS Database HA Tool monitoring interval in seconds. Must be a multiple of ha_interval (PANDORA FMS ENTERPRISE ONLY).                                                                                                           +# Pandora FMS Database HA Tool monitoring interval in seconds. Must be a multiple of ha_interval (PANDORA FMS ENTERPRISE ONLY).
 ha_monitoring_interval 60 ha_monitoring_interval 60
 +
 </code> </code>
  
- +Point your Pandora FMS to the master's virtual IP address (replacing **<IP>**  with the virtual IP address):
-Point your Pandora FMS to the master's virtual IP address (replacing **<IP>** with the virtual IP address):+
  
 <code> <code>
Line 982: Line 1010:
 pandorafms# sed -i -e "s/\$config\[\"dbhost\"\]=\".*\";/\$config[\"dbhost\"]=\"$VIRT_IP\";/" \ pandorafms# sed -i -e "s/\$config\[\"dbhost\"\]=\".*\";/\$config[\"dbhost\"]=\"$VIRT_IP\";/" \
 /var/www/html/pandora_console/include/config.php /var/www/html/pandora_console/include/config.php
-</code> 
  
 +</code>
  
 Install and start the pandora_ha service: Install and start the pandora_ha service:
  
 <code> <code>
-pandorafms# cat > /etc/systemd/system/pandora_ha.service <<-EOF+pandorafms# cat> /etc/systemd/system/pandora_ha.service <<-EOF
 [Unit] [Unit]
 Description=Pandora FMS Database HA Tool Description=Pandora FMS Database HA Tool
Line 1005: Line 1033:
 pandorafms# systemctl enable pandora_ha pandorafms# systemctl enable pandora_ha
 pandorafms# systemctl start pandora_ha pandorafms# systemctl start pandora_ha
-</code> 
  
 +</code>
  
 Log in your Pandora FMS Console and go to //Servers > Manage database HA//: Log in your Pandora FMS Console and go to //Servers > Manage database HA//:
  
-{{ wiki:Manage_ha_menu.png }}+{{  :wiki:manage_ha_menu.png  }}
  
-Click on //Add new node// and create an entry for the first node: +Click on //Add new node//  and create an entry for the first node:
  
-{{ wiki:Manage_ha_add_node.png }}+{{  :wiki:manage_ha_add_node.png  }}
  
-Next, click on //Create slave// and add an entry for the second node. You should see something similar to this:+Next, click on //Create slave//  and add an entry for the second node. You should see something similar to this:
  
-{{ wiki:Manage_ha_view.png }}+{{  :wiki:manage_ha_view.png  }}
  
-<WRAP center round important 60%> +<WRAP center round important 60%> It is necessary to check whether the **functions_HA_cluster.php**  file contains the $ssh_user correct paths so that it is properly shown and it is possible to correctly interact with the icons. </WRAP>
-It is necessary to check whether  the **functions_HA_cluster.php** file contains the $ssh_user correct paths so that it is properly shown and it is possible to correctly interact with the icons. +
-</WRAP>+
  
-<WRAP center round important 60%> +<WRAP center round important 60%> //Seconds behind master//  should be close to 0. If it keeps increasing, replication is nor working. </WRAP>
-//Seconds behind master// should be close to 0. If it keeps increasing, replication is nor working. +
-</WRAP>+
  
 === Adding a new node to the cluster === === Adding a new node to the cluster ===
Line 1104: Line 1128:
 <WRAP center round important 60%> If the system service is not disabled, the cluster's resource manager will not work properly. </WRAP> <WRAP center round important 60%> If the system service is not disabled, the cluster's resource manager will not work properly. </WRAP>
  
-Configure Percona, replacing **<ID>** with a number that must be exclusive for each cluster node:+Configure Percona, replacing **<ID>**  with a number that must be exclusive for each cluster node:
  
 <WRAP center round important 60%> Database replication will not work if two nodes have the same **SERVER_ID**. </WRAP> <WRAP center round important 60%> Database replication will not work if two nodes have the same **SERVER_ID**. </WRAP>
 +
 <code> <code>
 node3# export SERVER_ID=<ID> node3# export SERVER_ID=<ID>
Line 1285: Line 1310:
 </code> </code>
  
-<WRAP center round important 60%> It is recommended to **Slave_IO_Running** and **Slave_SQL_Running** show **Yes**. Other values may differ from the example. </WRAP>+<WRAP center round important 60%> It is recommended to **Slave_IO_Running**  and **Slave_SQL_Running**  show **Yes**. Other values may differ from the example. </WRAP>
  
 Install the required packages: Install the required packages:
Line 1361: Line 1386:
  
 Register the cluster node in the Pandora console from the "Servers → Manage database HA" menu. Register the cluster node in the Pandora console from the "Servers → Manage database HA" menu.
- 
  
 === Fixing a broken node === === Fixing a broken node ===
 +
 node2 shall be used as example. Set node2 into standby mode: node2 shall be used as example. Set node2 into standby mode:
  
Line 1374: Line 1399:
     Last updated: Tue Jun 12 08:20:49 2018     Last updated: Tue Jun 12 08:20:49 2018
     Last change: Tue Jun 12 08:20:34 2018 by root via cibadmin on node2     Last change: Tue Jun 12 08:20:34 2018 by root via cibadmin on node2
-    +
     2 nodes configured     2 nodes configured
     3 resources configured     3 resources configured
-    +
     Node node2: standby     Node node2: standby
     Online: [ node1 ]     Online: [ node1 ]
-    +
     Full list of resources:     Full list of resources:
-    +
      Master/Slave Set: master_pandoradb [pandoradb]      Master/Slave Set: master_pandoradb [pandoradb]
          Masters: [ node1 ]          Masters: [ node1 ]
          Stopped: [ node2 ]          Stopped: [ node2 ]
      pandoraip      (ocf::heartbeat:IPaddr2):       Started node1      pandoraip      (ocf::heartbeat:IPaddr2):       Started node1
-    +
     Daemon Status:     Daemon Status:
       corosync: active/enabled       corosync: active/enabled
       pacemaker: active/enabled       pacemaker: active/enabled
       pcsd: active/enabled       pcsd: active/enabled
-</code> 
  
 +</code>
  
-<WRAP center round important 60%> +<WRAP center round important 60%> node2 should be on standby (**Node node2: standby**). Other values may differ from the example. </WRAP>
-node2 should be on standby (**Node node2: standby**). Other values may differ from the example. +
-</WRAP>+
  
 Back up Percona's data directory: Back up Percona's data directory:
Line 1405: Line 1428:
 node2# [ -e /var/lib/mysql.bak ] && rm -rf /var/lib/mysql.bak node2# [ -e /var/lib/mysql.bak ] && rm -rf /var/lib/mysql.bak
 node2# mv /var/lib/mysql /var/lib/mysql.bak node2# mv /var/lib/mysql /var/lib/mysql.bak
 +
 </code> </code>
- 
  
 Back up the database of the master node (node1 in this example) and update the master node name and master log file name and position in the cluster (in the example, mysql-bin.000001 and 785): Back up the database of the master node (node1 in this example) and update the master node name and master log file name and position in the cluster (in the example, mysql-bin.000001 and 785):
Line 1417: Line 1440:
 node1# crm_attribute --type crm_config --name pandoradb_REPL_INFO -s mysql_replication \ node1# crm_attribute --type crm_config --name pandoradb_REPL_INFO -s mysql_replication \
 -v "node1|$(echo $binlog_info | awk '{print $1}')|$(echo $binlog_info | awk '{print $2}')" -v "node1|$(echo $binlog_info | awk '{print $1}')|$(echo $binlog_info | awk '{print $2}')"
 +
 </code> </code>
- 
  
 Load the database of the broken node: Load the database of the broken node:
- 
 <code> <code>
 +
  node1# rsync -avpP -e ssh /root/pandoradb.bak/ node2:/var/lib/mysql/  node1# rsync -avpP -e ssh /root/pandoradb.bak/ node2:/var/lib/mysql/
- +
  node2# chown -R mysql:mysql /var/lib/mysql  node2# chown -R mysql:mysql /var/lib/mysql
  node2# chcon -R system_u:object_r:mysqld_db_t:s0 /var/lib/mysql  node2# chcon -R system_u:object_r:mysqld_db_t:s0 /var/lib/mysql
 +
 </code> </code>
  
Line 1434: Line 1458:
  node2# pcs node unstandby node2  node2# pcs node unstandby node2
  node2# pcs resource cleanup --node node2  node2# pcs resource cleanup --node node2
 +
 </code> </code>
  
Line 1462: Line 1487:
  pacemaker: active/enabled  pacemaker: active/enabled
  pcsd: active/enabled  pcsd: active/enabled
 +
 </code> </code>
  
- +<WRAP center round important 60%> Both nodes should be online (**Online: [|node1 node2 ]**). Other values may differ from those of the example. </WRAP>
-<WRAP center round important 60%> +
-Both nodes should be online (**Online: [|node1 node2 ]**). Other values may differ from those of the example. +
-</WRAP>+
  
 Check database replication status: Check database replication status:
- 
 <code> <code>
 +
 node2# mysql -uroot -ppandora node2# mysql -uroot -ppandora
  mysql> SHOW SLAVE STATUS \G  mysql> SHOW SLAVE STATUS \G
Line 1488: Line 1511:
                Slave_SQL_Running: Yes                Slave_SQL_Running: Yes
                  Replicate_Do_DB: pandora                  Replicate_Do_DB: pandora
-             Replicate_Ignore_DB:  +             Replicate_Ignore_DB: 
-              Replicate_Do_Table:  +              Replicate_Do_Table: 
-          Replicate_Ignore_Table:  +          Replicate_Ignore_Table: 
-         Replicate_Wild_Do_Table:  +         Replicate_Wild_Do_Table: 
-     Replicate_Wild_Ignore_Table: +     Replicate_Wild_Ignore_Table:
                       Last_Errno: 0                       Last_Errno: 0
-                      Last_Error: +                      Last_Error:
                     Skip_Counter: 0                     Skip_Counter: 0
              Exec_Master_Log_Pos: 785              Exec_Master_Log_Pos: 785
                  Relay_Log_Space: 1252                  Relay_Log_Space: 1252
                  Until_Condition: None                  Until_Condition: None
-                  Until_Log_File: +                  Until_Log_File:
                    Until_Log_Pos: 0                    Until_Log_Pos: 0
               Master_SSL_Allowed: No               Master_SSL_Allowed: No
-              Master_SSL_CA_File:  +              Master_SSL_CA_File: 
-              Master_SSL_CA_Path:  +              Master_SSL_CA_Path: 
-                 Master_SSL_Cert:  +                 Master_SSL_Cert: 
-               Master_SSL_Cipher:  +               Master_SSL_Cipher: 
-                  Master_SSL_Key: +                  Master_SSL_Key:
            Seconds_Behind_Master: 0            Seconds_Behind_Master: 0
    Master_SSL_Verify_Server_Cert: No    Master_SSL_Verify_Server_Cert: No
                    Last_IO_Errno: 0                    Last_IO_Errno: 0
-                   Last_IO_Error: +                   Last_IO_Error:
                   Last_SQL_Errno: 0                   Last_SQL_Errno: 0
-                  Last_SQL_Error:  +                  Last_SQL_Error: 
-     Replicate_Ignore_Server_Ids: +     Replicate_Ignore_Server_Ids:
                 Master_Server_Id: 1                 Master_Server_Id: 1
                      Master_UUID: 580d8bb0-6991-11e8-9a22-16efadb2f150                      Master_UUID: 580d8bb0-6991-11e8-9a22-16efadb2f150
Line 1521: Line 1544:
          Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates          Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
               Master_Retry_Count: 86400               Master_Retry_Count: 86400
-                     Master_Bind:  +                     Master_Bind: 
-         Last_IO_Error_Timestamp:  +         Last_IO_Error_Timestamp: 
-        Last_SQL_Error_Timestamp:  +        Last_SQL_Error_Timestamp: 
-                  Master_SSL_Crl:  +                  Master_SSL_Crl: 
-              Master_SSL_Crlpath:  +              Master_SSL_Crlpath: 
-              Retrieved_Gtid_Set:  +              Retrieved_Gtid_Set: 
-               Executed_Gtid_Set: +               Executed_Gtid_Set:
                    Auto_Position: 0                    Auto_Position: 0
-            Replicate_Rewrite_DB:  +            Replicate_Rewrite_DB: 
-                    Channel_Name:  +                    Channel_Name: 
-              Master_TLS_Version: +              Master_TLS_Version:
    1 row in set (0.00 sec)    1 row in set (0.00 sec)
 +
 </code> </code>
  
 +<WRAP center round important 60%> Make sure **Slave_IO_Running**  and **Slave_SQL_Running**  show **Yes**. Other values may differ from the example. </WRAP>
 +
 +=== Automatic node recovery in Splitbrain ===
 +
 +**Scenario.**
 +
 +Both servers are as main or //master//, in the HA console view both appear as main (Master) but the Virtual IP is only on one node (the one that is actually acting as main or Master).
 +
 +{{  :wiki:pfms-ha-view_nodes.png  }}
 +
 +At this point, if the //token// [[:en:documentation:02_installation:04_configuration#splitbrain_autofix|splitbrain_autofix]] is set to 1, the node recovery process will be started at //splitbrain//.
 +
 +For the correct operation of this functionality the following components must be correctly configured::
 +
 +  * SSH root user keys shared between the server ''pandora_ha master'' and all database servers.
 +  * Replicator user configured in the setup with rights or //grants//  from the server where the server ''pandora_ha master''  is hosted.
 +
 +{{  :wiki:pfms-servers-manage_database_ha-setup_values.png  }}
 +
 +  * Space available for database backup on both servers where the 2 databases are hosted (primary and secondary, Master/Slave).
 +
 +<wrap hi>In the case that the ''datadir''  and the ''path''  where the partition must be done are in the same partition, it is necessary that the free space is at least 50%. </wrap>
 +
 +If all the above points are correctly configured, **the recovery process is as follows**:
 +
 +  - Delete the previous //backups//.
 +  - Back up the ''datadir''  of the secondary node (//Slave)//.
 +  - Performs backup of the main node (//Master)//.
 +  - Sends backup of the main node to the secondary node (//Master// -> //Slave)//.
 +  - Starts the resource of the "secondary" ("Slave") node with the corresponding resynchronization parameters at the time of the backup.
 +  - Checks that the resource is active and correct. For this it will make use of the configuration indicated in the parameters [[:en:documentation:02_installation:04_configuration#ha_max_resync_wait_retries|ha_max_resync_wait_retries]] and [[:en:documentation:02_installation:04_configuration#ha_resync_sleep|ha_resync_sleep]] .
 +
 +Once the process is finished, an event will appear indicating that the process has been completed successfully in the event view.
 +
 +If the environment is still not recovered automatically, it will leave the secondary (Slave) node in standby and **an event will appear indicating that the recovery must be performed manually in the event view**.
  
-<WRAP center round important 60%> 
-Make sure **Slave_IO_Running** and **Slave_SQL_Running** show **Yes**. Other values may differ from the example. 
-</WRAP> 
  
 === Troubleshooting === === Troubleshooting ===
 +
 +
 == What do I do if one of the cluster nodes does not work? == == What do I do if one of the cluster nodes does not work? ==
-{{ wiki:Manage_ha_failed.png }} 
  
-The service will not compromised as long as the master node is running. If the master node fails, a slave node will be automatically promoted to master. See [[en:documentation:05_big_environments:06_ha_Cluster#Fixing_a_broken_node|Fixing a broken node]].+{{  :wiki:manage_ha_failed.png  }} 
 + 
 +The service will not compromised as long as the master node is running. If the master node fails, a slave node will be automatically promoted to master. See [[:en:documentation:05_big_environments:06_ha_cluster#fixing_a_broken_node|Fixing a broken node]]. 
 + 
 +[[:en:documentation:start|Go back to Pandora FMS documentation index]]
  
-[[en:documentation:start|Go back to Pandora FMS documentation index]] 
  
ºº