Updated on November 25, 2021
You can configure ConsenSys Rollups to be highly available by having multiple instances of the same operator. If one operator goes down, then the other instance continues to process batches and operations from the accounts managed by the operator.
Multiple operators in a rollup do not offer high availability because operators can manage different accounts.
In the following high availability example, an operator has multiple instances, with each instance in a separate data center.
Incoming operations such as creating money orders, and redeeming money orders and accounts are replicated
to all operator instances. Operations are replicated using Kafka streams. The
operator’s manager saves incoming operations to a Kafka topic (
pending-operations) before forwarding
to the engine.
Each instance saves the incoming operation in its own Kafka topic, and the other instance replays the the operation from the other instance. This ensures that every instance has all the operations submitted to the operator, and the operations will eventually be executed, regardless of the instance that submits the batch to the rollup smart contract.
Each engine instance listens to the rollup smart contract events and populates the Kafka
topic, meaning all manager instances are up to date. The manager detects when the engine fails and
reports the operators as failing. This way the API load balancer or gateway can forward the request
to another instance.
Configure the Kafka high availability settings in the configuration file.
High availability modes
Highly available operator instances can run in either active-active mode, or active-passive mode.
The active-passive mode may be more efficient due to:
- Not wasting CPU cycles keeping multiple provers running.
- Not spending gas fees by submitting concurrent batches when only one batch will succeed.
A blockchain is asynchronous, meaning active-passive can still lead to multiple batch creations at the same time.
Multi-site high availability configuration allows the system to remain available in the case of site-wide failures.
To ensure zero data loss, use Kafka to replicate data across multiple sites. Each site must have
its own independent Kafka cluster to save the pending operations. The
pending-operations topics must
have a replication factor of 2 or more, and each cluster must have at least one replica located in
Kafka must be replicated with a strategy that ensures that a single site failure does not block the system.
Synchronous writes ensure that there is no data loss, but are usually very expensive, both in terms of the hardware and network requirements, and system latency.