Job Management

Once a Jet job is submitted, it has its own lifecycle on the cluster which is distinct from the submitter. Jet offers several ways to manage the job after it's been submitted to the cluster.

Submitting Jobs

You can submit jobs to a cluster using the submit command and packaging the job as a JAR:

$ bin/jet submit -n hello-world examples/hello-world.jar arg1 arg2
Submitting JAR 'examples/hello-world.jar' with arguments [arg1, arg2]
Using job name 'hello-world'

For a full guide on submitting jobs, see the relevant section in the Programming Guide.

Listing Jobs

Use the list-jobs command to get a list of all jobs running in the cluster:

$ bin/jet list-jobs
ID                  STATUS             SUBMISSION TIME         NAME
0401-9f77-b9c0-0001 RUNNING            2020-03-07T15:59:49.234 hello-world

You can also see completed jobs, by specifying the -a parameter:

$ bin/jet list-jobs -a
ID                  STATUS             SUBMISSION TIME         NAME
0402-de9d-35c0-0001 RUNNING            2020-03-08T15:14:11.439 hello-world-v2
0402-de21-7f00-0001 FAILED             2020-03-08T15:12:04.893 hello-world

Cancelling Jobs

Only a batch job may complete successfully; a streaming Jet job runs indefinitely until cancelled. You can cancel a job using the cancel <job_name_or_id> command:

$ bin/jet cancel hello-world
Cancelling job id=0402-de21-7f00-0001, name=hello-world, submissionTime=2020-03-08T15:12:04.893
Job cancelled.

Once a job is cancelled, the snapshot for the job is lost and the job can't be resumed. Cancelled jobs will have the "failed" status.

Auto-scaling

By default, Jet jobs scale up and down automatically when you add or remove a cluster node. To rescale a job, Jet must restart it. You can find an in-depth explanation of Jet's fault tolerance design in the Architecture section.

When auto-scaling is off and you add a new node to a cluster, the job will keep running on the previous nodes but not on the new one. However, if the job restarts for whatever reason, Jet will automatically scale it to the whole cluster.

The exact behavior of what happens when a node joins or leaves depends on whether a job is configured with a processing guarantee and with auto-scaling. The table below shows the behavior of a job after a cluster change depending on these two settings.

Auto-Scaling Setting	Processing Guarantee Setting	Member Added	Member Removed
enabled (default)	any setting	restart after a delay	restart immediately
disabled	none	keep job running on old members	fail job
disabled	at-least-once or exactly-once	keep job running on old members	suspend job

Suspending and Resuming

Jet supports manually suspending and resuming a streaming job in a fault-tolerant way. The job must be configured with a processing guarantee.

When a job is suspended, all the metadata about the job is still kept in the cluster. A snapshot of the job computational state is taken during a suspend operation and then on a resume the job is gracefully started from the same snapshot.

Suspending and resuming can be useful for example when you need to perform maintenance on a data source or sink without disrupting a running job.

Use the suspend <job_name_or_id> and resume <job_name_or_id> commands to suspend and resume jobs:

$ bin/jet suspend hello-world
Suspending job id=0401-9f77-b9c0-0001, name=hello-world, submissionTime=2020-03-07T15:59:49.234...
Job suspended.

$ bin/jet resume hello-world
Resuming job id=0401-9f77-b9c0-0001, name=hello-world, submissionTime=2020-03-07T15:59:49.234...
Job resumed.

You can also configure a job to be suspended automatically if its execution fails (see `JobConfig.setSuspendOnFailure).

Restarting

It's also possible to simply restart a job without suspending and resuming in one step. This can be useful when you want to have finer-grained control on when the job should be scaled. For example, if you have auto-scaling off and are adding 3 nodes to a cluster you can manually restart at the desired point to have the jobs utilizing all of the new nodes. This can be achieved with the restart <job_name_or_id> command, just as in the other examples above.

4.5.4

Change Data Capture