Lossless Cluster Restart

The Lossless Cluster Restart feature allows you to gracefully shut down the cluster at any time and have the computational data of all the jobs preserved. After you restart the cluster, Jet automatically restores the data and resumes the jobs.

This allows Jet cluster to be shut down gracefully and restarted without a data loss. When the cluster restarts, the jobs will automatically be restarted, without having to submit them to the cluster again.

Jet jobs regularly snapshot their state. A state snapshot can be used as a recovery point. Snapshots are stored in memory. Lossless Cluster Restart works by persisting the RAM-based snapshot data to disk. It uses the Hot Restart Persistence feature of Hazelcast under the hood.

Enabling Lossless Restart

To enable Lossless Cluster Restart, you need to enable it in hazelcast-jet.yaml:

hazelcast-jet:
  instance:
    lossless-restart-enabled: true

By default the snapshot will be written to <JET_HOME>/recovery folder. To change the directory where the data will be stored, configure the hot-restart-persistence option in hazelcast.yaml:

hazelcast:
  hot-restart-persistence:
    enabled: true
    base-dir: /var/hazelcast/recovery

See the docs for a detailed description including fine tuning.

Safely Shutting Down a Jet Cluster

For lossless restart to work the cluster must be shut down gracefully. When members are shutdown one by one in a rapid succession, Jet triggers an automatic rebalancing process where backup partitions are promoted and new backups are created for each member. This may result in out-of-memory errors or data loss.

The entire cluster can be shut down using the jet-cluster-admin command line tool.

bin/jet-cluster-admin -a <address> -c <cluster-name> -o shutdown

For the command line tool to work, REST should be enabled in hazelcast.yaml:

hazelcast:
  network:
    rest-api:
      enabled: true
      endpoint-groups:
        CLUSTER_READ:
          enabled: true
        CLUSTER_WRITE:
          enabled: true
        HEALTH_CHECK:
          enabled: true
        HOT_RESTART:
          enabled: true

Alternatively, you can shutdown the cluster using the Jet Management Center:

Cluster Shutdown using Management Center

Since the Hot Restart data is saved locally on each member, all the members must be present after the restart for Jet to be able to reload the data. Beyond that, there’s no special action to take: as soon as the cluster re-forms, it will automatically reload the persisted snapshots and resume the jobs.

Fault Tolerance Considerations

Lossless Cluster Restart is a maintenance tool. It isn't a means of fault tolerance.

The purpose of the Lossless Cluster Restart is to provide a maintenance window for member operations and restart the cluster in a fast way. As a part of the graceful shutdown process Jet makes sure that the in-memory snapshot data were persisted safely.

The fault tolerance of Jet is entirely RAM-based. Snapshots are stored in multiple in-memory replicas across the cluster. Given the redundancy present in the cluster this is sufficient to maintain a running cluster across single-node failures (or multiple-node, depending on the backup count), but it doesn’t cover the case when the entire cluster must shut down.

This design decision allows Jet to exclude disk from the operations that affect job execution, leading to low and predictable latency.

Performance considerations

Even with the fastest solid-state storage, Lossless Cluster Restart reduces the maximum snapshotting throughput available to Jet: while DRAM can take up to 50 GB/s, a high-end storage device caps at 2 GB/s. Keep in mind that this is throughput per member, so even on a minimal cluster of 3 machines, this is actually 6 GB/s available to Jet. Also, this throughput limit does not apply to the data going through the Jet pipeline, but only to the periodic saving of the state present in the pipeline. Typically the state scales with the number of distinct keys seen within the time window.

4.3.1

Change Data Capture

Enabling Lossless Restart

Safely Shutting Down a Jet Cluster

Fault Tolerance Considerations

Performance considerations