Disasters do not happen often, but they do happen, and we have to be prepared to handle them properly to protect your data and allow you to use the system.

There are 2 levels of disaster we are prepared for:

  1. A single server failure
  2. Entire cluster failure

Please note, disasters like this are very rare. In our entire history we had a handful single server failures, and we have never had the entire cluster failure.

A single server failure

For a single server failure on clusters with 2 servers, restoration if fully automated and transparent.

For Medium and Large systems all applications from the failed server are redeployed to the remaining server right away. As the application data is mirrored on both servers, they are ready to use without a delay. The service interruption usually lasts a few seconds. Normally not more than 45 seconds.

For Small system, all application data is still mirrored on both servers but applications themselves cannot be fit into memory on a single server. Therefore, they cannot be redeployed until the second server is added to the cluster. This usually takes about 11 minutes but may take as long as 30 minutes. Then the server applications are redeployed on the new server.

In both cases, after adding the second server, the application data must be re-synced to the new server which may temporarily affect the performance of the system. This is typically not significant unless the server is under a very high load.

Personal system runs on a single server only. Therefore, if this server fails, this is the same as the entire cluster failure. How we handle the entire cluster failure is explained below.

Entire cluster failure

Entire cluster failure happens when all the servers are lost. There is no machine to run server applications and no data mirrors available on any disks. So nothing works at all.

Of course we are prepared for this case. Every system makes automatic, regular backups on a daily basis. Backups are stored "off-site", away from the cluster. Therefore, if the entire cluster fails, all customer data are still safe. Entire cluster, all server apps and user data can be restored from these backups.

However, restoring cluster from backup is semi-automated process, which involves our technical team to maintain the process. Also, depending on how much user data was kept on the cluster, the transfer of this data from backup, restoring data volumes, rebuilding everything may take a significant time. Usually a few hours are required for the process to complete, but it may take up to 2 days in some cases.