.. _resilience_architecture: *********************** Resilience Architecture *********************** .. contents:: :local: The Resilience solution in XiVO makes it possible to maintain basic telephony function whether your main XiVO server is running or not. When running a XiVO HA cluster, users are guaranteed to never experience a downtime of more than 5 minutes of their basic telephony service. The Resilience solution in XiVO is based on a 2-nodes "main and standby" architecture. In the normal situation, both the main and standby nodes are running in parallel, the standby acting as a "hot standby", and all the telephony services are provided by the main node. If the main fails or must be shutdown for maintenance, then the standby node automatically takes over the telephony services. Supported telephony devices automatically communicate with the standby node instead of the main one. Once the main is up again, the standby node stops itself and the telephony devices failback to the main node. Currently, resilience is supported with: * XIVO deployment * CC deployment * XDS deployment Prerequisites ============= * Phones must be able to reach the main and the standby * Main and Standby nodes must be in the same subnet * If firewalling, the main must be allowed to join the standby on ports 22 and 5432 * If firewalling, the standby must be allowed to join the main with an ICMP ping * Trunk registration timeout (``expiry``) should be less than 300 seconds (5 minutes) * The standby must have **no** provisioning plugins installed. The Resilience solution is guaranteed to work correctly with `the following devices `_. .. _resilience_architecture_failover_mode: Failover Mode ============= Resilience comes in two failover mode, automatic or manual. Automatic Failover ------------------ .. note:: Refer to :ref:`resilience_administration_automatic_failover` for more information When chosing to operate in **Automatic failover** mode: * Standby XiVO will actively ping the Main XiVO * if Standby XiVO cannot ping Main XiVO, it will automatically activate and start the services * when Standby XiVO correctly ping Main XiVO, it will automatically disable and stop the services Manual Failover --------------- .. note:: Refer to :ref:`resilience_administration_manual_failover` for more information When chosing to operate in **Manual failover** mode: * there is no heartbeat between Standby and Main XiVO * the failover must be triggered by an admin via the Web interface Resilience in XiVO deployement ============================== .. figure:: images/resilience_archi_xivo.png In a simple XiVO deployment another XiVO is added in standby mode. In standby mode the XiVO ping the main XiVO every minutes. * if the ping succeeds (main is up): the standby XiVO shutdown the telephony services * if more than 3 ping fails (main is down): the standby XiVO startup the telephony services Resilience in CC deployement ============================ .. figure:: images/resilience_archi_cc.png In a CC deployment another CC is configured with the standby XiVO. The standby CC is up and running. Resilience in XDS deployement ============================= .. figure:: images/resilience_archi_xds.png In an XDS deployment no other node is added. MDS will be shared between the main and the standby XiVO. When the standby starts the telephony services, it will register the inter-mds trunk with the other MDS so the inter-MDS call can work. Resilience with Edge deployement ================================ .. note:: Resilience can work with single Edge deployment .. figure:: images/resilience_archi_edge.png Edge must know where the CC and the XiVO are located. Therefore a reconfiguration of the Edge is needed when standby takeover. XiVO and CC standby must be configured with Edge information (mainly TURN secret). Replication =========== Once main standby configuration is completed, XiVO configuration (DB and files) is replicated from the main node to the standby every hour (:00). DB Replication -------------- The replication does not copy the full XiVO configuration of the main. Notably, these are **excluded**: * All the network configuration **except DHCP configuration** (i.e. everything under the :menuselection:`Configuration --> Network --> {Interfaces, Resolver, Mail}` sections) * All the support configuration (i.e. everything under the :menuselection:`Configuration --> Support` section) * Resilience settings * Access Web Services configuration * Provisioning configuration * Voicemail messages These event data are also excluded: * Queue logs * CELs File Replication ---------------- The following directories will then be rsync'ed every hour: * /etc/asterisk/extensions_extra.d * /etc/xivo/asterisk * /var/lib/asterisk/agi-bin * /var/lib/asterisk/moh * /var/lib/xivo/certificates * /var/lib/xivo/sounds/acd * /var/lib/xivo/sounds/playback Limitations =========== Architecture: * Since DHCP parameters are replicated, Main and Standby node MUST be on the same VoIP network. When the main node is down, some features are not available and some behave a bit differently. This includes: * Call history / call records are not recorded. * Voicemail messages saved on the main node are not available. * Custom voicemail greetings recorded on the main node are not available. * Phone provisioning is disabled, i.e. a phone will always keep the same configuration, even after restarting it. * Phone remote directory is not accessible, because provisioned IP address points to the main. Note that, on failover and on failback: * DND, call forwards, call filtering, ..., statuses may be lost if changed recently. * If you are connected as an agent, then you might need to reconnect as an agent when the main goes down. Since it's hard to know when the main goes down, if your CTI client disconnects and you can't reconnect it, then it's a sign the main might be down. Additionally, only on failback: * Voicemail messages are not copied from the standby to the main, i.e. if someone left a message on your voicemail when the main was down, you won't be able to consult it once the main is up again. * More generally, custom sounds are not copied back. This includes recordings. Here's the list of limitations that are more relevant on an administrator standpoint: * The main status is up or down, there's no middle status. This mean that if Asterisk is crashed the XiVO is still up and the failover will NOT happen.