Difference between revisions of "Pandora: Documentation en: HA"

From Pandora FMS Wiki
Jump to: navigation, search
(Server configuration)
Line 10: Line 10:
 
The Pandora FMS standar design could be this one:
 
The Pandora FMS standar design could be this one:
  
<br>
 
<br>
 
 
<center>
 
<center>
 
[[image:ha1.png|550px]]
 
[[image:ha1.png|550px]]
 
</center>
 
</center>
<br>
 
<br>
 
  
 
Obviously, the agents are not redundants. If an agent is down,it makes no sense to execute another one, so the only cause for than an agent downs is that data could not be obtained because the execution of any module is failing, and this could not be solved with another agent running in parallel, or because the system is   
 
Obviously, the agents are not redundants. If an agent is down,it makes no sense to execute another one, so the only cause for than an agent downs is that data could not be obtained because the execution of any module is failing, and this could not be solved with another agent running in parallel, or because the system is   
Line 41: Line 37:
 
At the end of the chapter is described the mechanism to implement HA and Load balancing with LVS and Keepalive on a TCP service that could be the Tentacle port (41121) or the SSH port, FTP or another one.The same procedure could be used to cluster two or more systems. In this case the Pandora FMS Web will be useful through an Apache.
 
At the end of the chapter is described the mechanism to implement HA and Load balancing with LVS and Keepalive on a TCP service that could be the Tentacle port (41121) or the SSH port, FTP or another one.The same procedure could be used to cluster two or more systems. In this case the Pandora FMS Web will be useful through an Apache.
  
<br>
 
<br>
 
 
<center>
 
<center>
 
[[image:ha2.png|550px]]
 
[[image:ha2.png|550px]]
 
</center>
 
</center>
<br>
 
<br>
 
  
 
==== Balancing in the Software Agents====
 
==== Balancing in the Software Agents====
Line 90: Line 82:
 
The obvious way to implement HA and a load balancing in a system of two nodes is to asign the 50% of the modules to each server and select both servers as masters (Master. In case that there would be more than two master servers and a third server down with modules expecting to be executed, the first of the master server that would execute the module will "self-assign" the module of the down server. In case of the recovering of one of the down servers, the modules that have been assigned to the primary server would automatically be assigned again.
 
The obvious way to implement HA and a load balancing in a system of two nodes is to asign the 50% of the modules to each server and select both servers as masters (Master. In case that there would be more than two master servers and a third server down with modules expecting to be executed, the first of the master server that would execute the module will "self-assign" the module of the down server. In case of the recovering of one of the down servers, the modules that have been assigned to the primary server would automatically be assigned again.
  
<br>
 
<br>
 
 
<center>
 
<center>
 
[[image:ha3.png|350px]]
 
[[image:ha3.png|350px]]
 
</center>
 
</center>
<br>
 
<br>
 
  
 
The load balancing between the different servers is done in the Agent Administration section in the "setup" menu.
 
The load balancing between the different servers is done in the Agent Administration section in the "setup" menu.
  
<br>
 
<br>
 
 
<center>
 
<center>
 
[[image:ha4.png|800px]]
 
[[image:ha4.png|800px]]
 
</center>
 
</center>
<br>
 
<br>
 
  
 
In the field "server" there is a combo where you can choose the server that will do the checking.
 
In the field "server" there is a combo where you can choose the server that will do the checking.
Line 136: Line 120:
 
In this case it would not be necessary an special configuration of Pandora FMS.
 
In this case it would not be necessary an special configuration of Pandora FMS.
  
<br>
 
<br>
 
 
<center>
 
<center>
 
[[image:ha5.png|500px]]
 
[[image:ha5.png|500px]]
 
</center>
 
</center>
<br>
 
<br>
 
  
 
You have several proposals to implement the MySQL HA, see more about this in out annexes (MySQL Cluster, MySQL HA Binary Replication and DRBD).
 
You have several proposals to implement the MySQL HA, see more about this in out annexes (MySQL Cluster, MySQL HA Binary Replication and DRBD).

Revision as of 14:00, 14 June 2017

Go back to Pandora FMS documentation index

1 High Availability

1.1 Introduction

Pandora FMS is a very stable application (thanks to the test and improvements included in each version and to the hundred of fails opened by users and that have been solved.In spite of this, in critical environments and/or with much load, it is possible that it would be necessary to distribute the load in several machines, being sure that if any component of Pandora FMS fails, the system will not be down. Pandora FMS has been designed to it would be very modular. Any of its modules could work in an independent way. But it has also been designed to work with other components and for being able to take the load from those components that have been down.

The Pandora FMS standar design could be this one:

Ha1.png

Obviously, the agents are not redundants. If an agent is down,it makes no sense to execute another one, so the only cause for than an agent downs is that data could not be obtained because the execution of any module is failing, and this could not be solved with another agent running in parallel, or because the system is isolated or fails. The best solution is to make redundancy of the critical systems- regardless if they have Pandora FMS agents or not- and so to make redundancy or these systems monitoring.

It is possible to use HA in several scenaries:

  • Data Server Balancing and HA.
  • Network Servers,WMI, Plugin, Web and Prediction Balancing and HA
  • DDBB Load Balancing.
  • Recon Servers Balancing and HA.
  • Pandora FMS Console Balancing and HA.

1.1.1 Data Server Balancing and HA

This is the more complex setting, so in the level of Pandora FMS it is not necessary to have specific knowledges about the server installation.You should use another tool to implement HA and the load balancing instead: commercial hardware tools that implements HA and balancing or through OpenSource solutions such as vrrpd, LVS or Keepalive.

For the Pandora FMS dataserver you will need to install two machines with one configured Pandora FMS dataserver (and differents hostname and server name). You should also configure a Tentacle server in each of them and, if it would be necessary, an SSH/FTP server.Consider that you need to copy the keys of each machine in the server (SSH). It is easier through Tentacle, so you only need to reply the configuration. Each machine will have a different IP, and the balancer will give (sa,e as with MySQL cluster) only one IP address to which the agents will connect with to send data.The balancer will send the data to the corresponding server.

If one fails, the HA device «promote» one of the active servers that are availables and the Pandora FMS agents will connect with the same address that they used before, without noticing the change, but in this case, the load balancer will not send the data to the server has failed, but to another active server.You do not need to change anything in Pandora FMS dataserver.Even each server could keep its own name, useful to know if any of them has down in the server state view.Pandora FMS data modules could be processed by any server and a preassignation is not necessary. It is designed this way so it would be possible to implement HA in an easier way.

Another way to implement the HA is though the sending from the agents, to two different servers, one of them for reserve (HA Active/Passive) just in case that the main one fails, or two different ones at the same time, replying data in two different and independent instances of Pandora FMS. This is described next as "Balancing in the Software Agents"

At the end of the chapter is described the mechanism to implement HA and Load balancing with LVS and Keepalive on a TCP service that could be the Tentacle port (41121) or the SSH port, FTP or another one.The same procedure could be used to cluster two or more systems. In this case the Pandora FMS Web will be useful through an Apache.

Ha2.png

1.1.1.1 Balancing in the Software Agents

From the software agents it is possible to do a balancing of Data servers so it is possible to configure a Data server master and another one for backup.

In the agent configuration file pandora_agent.conf, you should configure and uncomment the following part of the agent configuration file:

# Secondary server configuration
# ==============================
# If secondary_mode is set to on_error, data files are copied to the secondary
# server only if the primary server fails. If set to always, data files are
# always copied to the secondary server
secondary_mode on_error
secondary_server_ip localhost
secondary_server_path /var/spool/pandora/data_in
secondary_server_port 41121
secondary_transfer_mode tentacle
secondary_server_pwd mypassword
secondary_server_ssl no
secondary_server_opts

There are the following options (for more information, go to the Agents Configuration chapter.

  • secondary_mode: Mode in which the secondary server should be. It could have two values:
    • on_error: Send data to the secondary server only if it could not send them to the main server.
    • always: Always sends data to the secondary server not regarding if it could or not connect with the main server.
  • secondary_server_ip: Secondary server IP
  • secondary_server_path: Path where the XML are copied in the secondary server,usually /var/spoo//pandora/data_in
  • secondary_server_port: Port through the XML will be copy to the secondary server in tentacle 41121, in ssh 22 are in ftp 21.
  • secondary_transfer_mode: transfer mode that will be used to copy the XML to the sercondary server, tentacle, ssh, ttp etc
  • secondary_server_pwd: Password option for the transfer through FTP
  • secondary_server_ssl: Yes or not should be put depending if you want to use ssl to transfer data through Tentacle or not.
  • secondary_server_opts: This field is for other options that are necessaries for the transfer.

1.1.2 Balancing and HA of the Network Servers, WMI, Plugin, Web and Prediction

This is easier. You need to install several servers, network,WMI, Plugin, Web or Prediction, in several machines of the network (all with the same visibility for the systems that you want monitor). All these machines should be in the same segment (so as the network latency data whould be coherents)

The servers could be selected as primaries.These servers will automatically collect the data form all the modules assigned to a server that is selected as «down».Pandora FMS own servers implement a mechanism to detect that one of them has down thorugh a verification of its last date of contact (server threshold x 2).It will be enough if only one Pandora FMS server would be active for that it could detect the collapse of the other ones. If all Pandora FMS are down, there is no way to detect or to implement HA.

The obvious way to implement HA and a load balancing in a system of two nodes is to asign the 50% of the modules to each server and select both servers as masters (Master. In case that there would be more than two master servers and a third server down with modules expecting to be executed, the first of the master server that would execute the module will "self-assign" the module of the down server. In case of the recovering of one of the down servers, the modules that have been assigned to the primary server would automatically be assigned again.

Ha3.png

The load balancing between the different servers is done in the Agent Administration section in the "setup" menu.

Ha4.png

In the field "server" there is a combo where you can choose the server that will do the checking.

1.1.2.1 Server configuration

A Pandora FMS Server can be running in two different modes:

  • Master mode.
  • Non-master mode.

If a server goes down, its modules will be executed by the master server so that no data is lost.

At any given time there can only be one master server, which is chosen from all the servers with the master configuration token in /etc/pandora/pandora_server.conf set to a value greater than 0:

master [1..7]

If the current master server goes down, a new master server is chosen. If there is more than one candidate, the one with the highest master value is chosen.

Template warning.png

Be careful about disabling servers. If a server with Network modules goes down and the Network Server is disabled in the master server, those modules will not be executed.

 


For example, if you have three Pandora FMS Servers with master set to 1, a master server will be randomly chosen and the other two will run in non-master mode. If the master server goes down, a new master will be randomly chosen.

1.1.3 Load Balancing in the DDBB

It is possible to configure a database cluster to implement at the same time HA and the load balancing. The database is the more critical component of all architecture, so a cluster would be the best option. You only need to convert the DB sketch in tables compatibles with a MySQL cluster. This setting has been tested and it works well, but it is necessary to have an advanced knowledge in cluster administration with MySQL5 and that the modules would have lot of RAM memory. A minimum of 2GiB in a setting of two nodes for a maximum of 5000 modules (in total).

In this case it would not be necessary an special configuration of Pandora FMS.

Ha5.png

You have several proposals to implement the MySQL HA, see more about this in out annexes (MySQL Cluster, MySQL HA Binary Replication and DRBD).

1.1.4 Balancing and HA of the Recon Servers

In the Recon Server the redundancy is very easy to apply. You only need to install two recon servers with alternate tasks.So is one of them is down, the other one will continue executing the same task.

1.1.5 Balancing and HA of Pandora FMS console

In this case, you do not neither need an special configuration of Pandora FMS. It is very easy,you will only need to install another console.Any of them could be used at the same time from different locations by different users. Using a Web balancer in front of the consoles, you could have access to them without knowing exactly to which one are you having access into, so the performaces system is managed through cookies and this will be kept in the browser. The balancing procedure implementing LVS and the HA using KeepAlived is described after.

1.2 Annex 1: HA implementation and Load Balancing with LVS and Keepalived

For the load balancing we advise to use Linux Virtual Server (LVS). To manage the High Availability (HA) between the services, we advise to use Keepalived.

LVS

At present, the main function of the LVS project is to develop an advanced IP system of load balancing through software (IPVS), load balancing through software at application level and components for the management of a services cluster.

IPVS

Advanced IP system of load balancing through software implemented in the Linux own kernel and that has been already included in versions 2.4 and 2.6.

Keepalived It is used to manage the LVS. Keepalived is being used in the cluster to make sure that the SSH servers, both Nodo -1 and Nodo-2 are alive, if any of them falls, Keepalive show to the LVS that one of the two nodes is down and it should to readdress the petitions to the node that is alive.

We have chosen Keepalived as HA service so it allows to keep a persistence of session between the servers. This is, if any of the modules falls, the users that are working on this node will be conduced to the other module that is alive, but these will be exactly in the same place that they were before, doing that the fall will be fully transparent to its work and sessions ( in the case of SSH it will not work due to the SSH encrypting logic, but in simple TCP sessions, such as Tentacle without SSL or FTP, they will work without problem).With Tentacle/SSH the communication should be try again and this way the information of the data packet will not be lost.

The configuration file and the orders for use of KeepAlived are in the Annex 2.


Load Balancing Algorithm Algorithm

The two more used algorithms nowadays are:«Round Robin» and «Weight Round Robin». They are very similar and they are based on an assignment of work by turns.

In the case of the «Round Robin»,it is one of the process planning algorithms more simple in an Operative System that assigns to each proccess an equitable and ordered time share, considering all processes with the same priority.

On the other hand, the «Weight Round Robin» algorithm allows to assign load to the machines inside the cluster so as a number of specific petitions will go to one or other node, depending on its weight inside the cluster.

This has no sense in the topology that we consider here, so both machines have exactly the same hardware features. For all these we have decided to use «Round Robin» as load balancing algorithm.

1.2.1 Action when a node is down

Keepalived will detect is one of the services is down. So, if it happens it will eliminated the module that have failed from the LVS active modules to the node that has failed, so all the petitions to the node that have failed will be readdressed to the active node.

Once the possible problem will be solved with the service that has fallen, you should restart keeoalived:

/etc/init.d/keepalived restart

With this restart of the service, the nodes will be inserted again in the LVS available nodes list.

If one of the nodes falls, it would be not necessary to do a manual insertion of the nodes using ipvsadm, so Keepalived will do it once it would restart and check that the services that are supposed to do an HA service are running and are accessibles by its «HealthCheckers».

1.3 Annex 2. LVS Balancer Configuration

Use of ipvsadm:

Installing of the manager Linux with ipvsadm:

ipvsadm -A -t ip_cluster:22 -s rr

The options are:

  • A Add service
  • t TCP service with Ip format
  • s Sheduler, in this case you should use the "rr" parameter (round robin)

Install the nodes (real servers) to which the petitions to the 22 port will be readdress.

ipvsadm-a -t ip_cluster:22 -r 192.168.1.10:22 -m ipvsadm -a -t ip_cluster:22 -r 192.168.1.11:22 -m

The ipvsadm situation without active connections is the following:

Prot LocalAddress:Port Scheduler Flags 
 -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP cluster:www rr 
 -> nodo-2:ssh                    Masq    1         0        0
 -> nodo-1:ssh                    Masq    1         0        0

Using the «Round Robin» algorithm, both machines have the same weight in the cluster. So the connexions will be shared. Here you can see an example of LVS balancing connexions against the cluster:

Prot LocalAddress:Port Scheduler Flags 
 -> RemoteAddress:Port      Forward Weight ActiveConn InActConn

TCP cluster:ssh rr 
 -> nodo-2:ssh              Masq     1         12        161
 -> nodo-1:ssh              Masq     1         11        162

1.4 Annex 3. KeepAlived Configuration

Keepalived is the one that verifies that the files selected in its configuration file (/etc/keepalived/keepalived.conf)are empty, and keep the different host in the balancing cluster. If any of these services falls, get out the host of the balancing cluster.

To start Keepalived:

/etc/init.d/keepalived start

To stop Keepalived:

/etc/init.d/keepalived stop

The configuration file used for the cluster is the following one:

 # Configuration File for keepalived
  global_defs {
      notification_email {
          [email protected]
      }
      notification_email_from [email protected]
      smtp_server 127.0.0.1
      smtp_connect_timeout 30
      lvs_id LVS_MAIN
 }
 
 virtual_server 192.168.1.1 22 {
        delay_loop 30
        lb_algo rr
        lb_kind NAT
        protocol TCP
        real_server 192.168.1.10 22 {
              weight 1
               TCP_CHECK {
                        connect_port 22
                        connect_timeout 3
                        nb_get_retry 3
                        delay_before_retry 1
                }
        }
        real_server 192.168.1.11 22 {
              weight 1
              TCP_CHECK {
                        connect_port 22
                        connect_timeout 3
                        nb_get_retry 3
                        delay_before_retry 1
              }
        }
 }

Go back to Pandora FMS documentation index