Many people have asked me about the load-balanced FusionPBX cluster I offer to the public. In order to try to answer all your possible question, I am writing this article. I hope after a reading you get a clear picture what it is and what it is not a load-balanced FusonPBX cluster.
First thing I need to clarify that a load-balanced cluster is not a high availability cluster. Although both kinds of approaches are not mutually exclusive (you can combine them), their pros and cons are different and the way they work as well. I will write later a comparison between them, for now just remember it is not the same.
A Cluster Overview
A load-balanced cluster deployment needs at least 5 servers. The following picture shows a basic deployment.
The elements are:
two PBX servers who are responsible for handling the SIP and RTP flow. These servers have installed FusionPBX, FreeSWITCH, Memcached, the Lua supporting scripts and more stuff.
two Database servers who are the ones to store all the information that FreeSWITCH and FusionPBX use.
one Arbitrator server that its only role is to avoid the brain-split issue when the communication between the two database nodes breaks.
Pros, Cons and Side Effects of a Load-balanced FusionPBX Cluster
Any cluster holder needs to remember that the cluster is far to be a stand-alone server. There are internal differences; information flows are different and as a consequence, the way it operates is not the same. First, let's list the pros:
Fault tolerance: when a node goes down, the other node will take the load
More capacity: as both nodes are active, you can hold more extensions in your cluster
Spreadable: the nodes in the cluster do not need to be in the same data centre.
Cross-server connection: for example, extension 100@domainA registered in PBX 1 an extension 200@domainB in PBX2. When extension 100 calls 200, it will interconnect. The PBX'es are aware of their peers and they will route the call properly.
Now the cons:
In case of a fault event, some endpoints may need a reboot in able to reconnect. This is due to some DNS effects, possibly by a local DNS or the way the firmware of the IP phone works.
Strict cross-server connection: some conferences, queues or complex ring-groups may not work if the endpoints of a given tenant are connected to different nodes. The obvious workaround is to make sure all endpoints of a given tenant are registered in the same PBX.
Some side effects that are not good nor bad, but they are different than a stand-alone server:
Flushing the Memcached: in the worst case, you do not know what server you are connecting to do dial plan modifications, you can not be sure where the endpoints are registering. For example, worst case scenario you are editing a dial plan on PBX 1 but the extension is registered in the PBX 2. You will need to flush the Memcached in PBX 2 manually. Depending on the nature of your PBX, you may opt by placing crontabs to do it on regular basis or just doing it under demand. It is up to you; you have been warned.
File synchronization: as the Memcached behaviour, this is the same, you do not know where the call is hitting. Although the PBX cluster has a synchronization mechanism (regardless which one you selected), the important thing here is to place a synchronization policy. It is very CPU expensive to synchronize every five seconds, you will waste valuable CPU resources that will impact the quality of your service. You can opt for a five minutes synchronization policy, or a midnight policy. Whatever it is, do not forget about this, it is very common complains about an IVR not playing the proper recording and the cause is that the file has not been synched yet.
DNS policy: I have written an article about the DNS relationship with the VoIP. If you do not opt for a smart DNS solution, you just need to be careful. For example, if customer A has 90% of their endpoints close to the PBX 1, then your SRV and A records should point first to PBX 1 IP.
Rebooting the database server: whatever reason makes you reboot a database node, never reboot them all at the same time. Reboot one, wait it recovers, then reboot the next and so on.
A cluster's information flow is a little different than a stand-alone FusionPBX server. In a stand-alone deployment, all flows are almost local, then only flow that is external are the ones related to the endpoints or bridging a call to the PSTN. A cluster has some extra flows that cross among the servers.
The endpoints will decide what node to connect by doing DNS requests. When a server connects, the PBX will record some information in the database. The database will replicate the information with the other node.
When a call happens, the receiving PBX will keep all the processing local as much as possible; there is a point in the dial plans that it will need to bridge to an extension. The PBX will then consult the database to know where the desired endpoint is registered. Best case scenario, the extension is registered in the same receiving PBX and the flow will be kept local; worst case scenario, the extension is found active in another server, then the PBX will connect to the other PBX. The other PBX will then connect to its now local endpoint and the call will flow.
Fault Tolerance in a Load-balanced FusionPBX Cluster
When a fault tolerance event happens, the endpoints are the ones who will decide what to do. Thanks to the information from the DNS, the extensions will know who is their second best option and they will connect to the active PBX. Depending on the endpoint brand, some IP Phones may need a reboot; sometimes local DNS caches do not honour the DNS TTL and a router reboot may be needed as well.
The following image shows the fault tolerance event.
The IP Phone that was connected to the PBX 1 now it connects to PBX 2. PBX 1 is the best option for that extension, but as it is not available, it will connect to its second-best option; in this case, the PBX 2.
I hope this gives you a very clear picture of the way a load-balanced FusionPBX cluster works.