Skip navigation links

Package com.hazelcast.jet.pipeline

The Pipeline API is Jet's high-level API to build and execute distributed computation jobs.

See: Description

Package com.hazelcast.jet.pipeline Description

The Pipeline API is Jet's high-level API to build and execute distributed computation jobs. It models the computation using an analogy with a system of interconnected water pipes. The data flows from the pipeline's sources to its sinks. Pipes can bifurcate and merge, but there can't be any closed loops (cycles).

The basic element is a pipeline stage which can be attached to one or more other stages, both in the upstream and the downstream direction. A pipeline accepts the data coming from its upstream stages, transforms it, and directs the resulting data to its downstream stages.

Kinds of transformation performed by pipeline stages

Basic

Basic transformations have a single upstream pipeline and statelessly transform individual items in it. Examples are map, filter, and flatMap.

Grouping and aggregation

The aggregate*() transformations perform an aggregate operation on a set of items. You can call stage.groupingKey() to group the items by a key and then Jet will aggregate each group separately. For stream stages you must specify a stage.window() which will transform the infinite stream into a series of finite windows. If you specify more than one input stage for the aggregation (using stage.aggregate2(), stage.aggregate3() or stage.aggregateBuilder(), the data from all streams will be combined into the aggregation result. The AggregateOperation you supply must define a separate accumulate primitive for each contributing stream. Refer to its Javadoc for further details.

Hash-join

Hash-join is a special kind of joining transform, specifically tailored to the use case of data enrichment. It is an asymmetrical join that joins one or more enriching stages to the primary stage The source for an enriching stage is most typically a key-value store (such as a Hazelcast IMap). It must be a batch stage and each item must have a distinct join key. The primary stage, on the other hand, may be either a batch or a stream stage and may contain duplicate keys.

For each of the enriching stages there is a separate pair of functions to extract the joining key on both sides. For example, a Trade can be joined with both a Broker on trade.getBrokerId() == broker.getId() and a Product on trade.getProductId() == product.getId(), and all this can happen in a single hash-join transform.

Implementationally, the hash-join transform is optimized for throughput so that each computing member has a local copy of all the enriching data, stored in hashtables (hence the name). The enriching streams are consumed in full before ingesting any data from the primary stream.

Skip navigation links

Copyright © 2019 Hazelcast, Inc.. All rights reserved.