apache_beam.runners.dataflow.dataflow_runner module

A runner implementation that submits a job for remote execution.

The runner will create a JSON description of the job graph and then submit it to the Dataflow Service for remote execution by a worker.

class apache_beam.runners.dataflow.dataflow_runner.DataflowRunner(cache=None)[source]

Bases: apache_beam.runners.runner.PipelineRunner

A runner that creates job graphs and submits them for remote execution.

Every execution of the run() method will submit an independent job for remote execution that consists of the nodes reachable from the passed in node argument or entire graph if node is None. The run() method returns after the service created the job and will not wait for the job to finish if blocking is set to False.

is_fnapi_compatible()[source]
apply(transform, input, options)[source]
static poll_for_job_completion(runner, result, duration)[source]

Polls for the specified job to finish running (successfully or not).

Updates the result with the new job information before returning.

Parameters:
  • runner – DataflowRunner instance to use for polling job state.
  • result – DataflowPipelineResult instance used for job information.
  • duration (int) – The time to wait (in milliseconds) for job to finish. If it is set to None, it will wait indefinitely until the job is finished.
static side_input_visitor(use_unified_worker=False, use_fn_api=False, deterministic_key_coders=True)[source]
static flatten_input_visitor()[source]
static combinefn_visitor()[source]
run_pipeline(pipeline, options)[source]

Remotely executes entire pipeline or parts reachable from node.

get_pcoll_with_auto_sharding()[source]
add_pcoll_with_auto_sharding(applied_ptransform)[source]
run_Impulse(transform_node, options)[source]
run_Flatten(transform_node, options)[source]
apply_GroupByKey(transform, pcoll, options)[source]
run_GroupByKey(transform_node, options)[source]
run_ExternalTransform(transform_node, options)[source]
run_ParDo(transform_node, options)[source]
run_CombineValuesReplacement(transform_node, options)[source]
run_Read(transform_node, options)[source]
run__NativeWrite(transform_node, options)[source]
run_TestStream(transform_node, options)[source]
classmethod serialize_windowing_strategy(windowing, default_environment)[source]
classmethod deserialize_windowing_strategy(serialized_data)[source]
static byte_array_to_json_string(raw_bytes)[source]

Implements org.apache.beam.sdk.util.StringUtils.byteArrayToJsonString.

static json_string_to_byte_array(encoded_string)[source]

Implements org.apache.beam.sdk.util.StringUtils.jsonStringToByteArray.

get_default_gcp_region()[source]

Get a default value for Google Cloud region according to https://cloud.google.com/compute/docs/gcloud-compute/#default-properties. If no default can be found, returns None.

class CombineValuesPTransformOverride

Bases: apache_beam.pipeline.PTransformOverride

A PTransformOverride for CombineValues.

The DataflowRunner expects that the CombineValues PTransform acts as a primitive. So this override replaces the CombineValues with a primitive.

get_replacement_inputs(applied_ptransform)

Provides inputs that will be passed to the replacement PTransform.

Parameters:applied_ptransform – Original AppliedPTransform containing the PTransform to be replaced.
Returns:An iterable of PValues that will be passed to the expand() method of the replacement PTransform.
get_replacement_transform(ptransform)
get_replacement_transform_for_applied_ptransform(applied_ptransform)

Provides a runner specific override for a given AppliedPTransform.

Parameters:applied_ptransformAppliedPTransform containing the PTransform to be replaced.
Returns:A PTransform that will be the replacement for the PTransform inside the AppliedPTransform given as an argument.
matches(applied_ptransform)
class CreatePTransformOverride

Bases: apache_beam.pipeline.PTransformOverride

A PTransformOverride for Create in streaming mode.

get_replacement_inputs(applied_ptransform)

Provides inputs that will be passed to the replacement PTransform.

Parameters:applied_ptransform – Original AppliedPTransform containing the PTransform to be replaced.
Returns:An iterable of PValues that will be passed to the expand() method of the replacement PTransform.
get_replacement_transform(ptransform)

Provides a runner specific override for a given PTransform.

Parameters:ptransform – PTransform to be replaced.
Returns:A PTransform that will be the replacement for the PTransform given as an argument.
get_replacement_transform_for_applied_ptransform(applied_ptransform)
matches(applied_ptransform)
class JrhReadPTransformOverride

Bases: apache_beam.pipeline.PTransformOverride

A PTransformOverride for Read(BoundedSource)

get_replacement_inputs(applied_ptransform)

Provides inputs that will be passed to the replacement PTransform.

Parameters:applied_ptransform – Original AppliedPTransform containing the PTransform to be replaced.
Returns:An iterable of PValues that will be passed to the expand() method of the replacement PTransform.
get_replacement_transform(ptransform)

Provides a runner specific override for a given PTransform.

Parameters:ptransform – PTransform to be replaced.
Returns:A PTransform that will be the replacement for the PTransform given as an argument.
get_replacement_transform_for_applied_ptransform(applied_ptransform)
matches(applied_ptransform)
class NativeReadPTransformOverride

Bases: apache_beam.pipeline.PTransformOverride

A PTransformOverride for Read using native sources.

The DataflowRunner expects that the Read PTransform using native sources act as a primitive. So this override replaces the Read with a primitive.

get_replacement_inputs(applied_ptransform)

Provides inputs that will be passed to the replacement PTransform.

Parameters:applied_ptransform – Original AppliedPTransform containing the PTransform to be replaced.
Returns:An iterable of PValues that will be passed to the expand() method of the replacement PTransform.
get_replacement_transform(ptransform)
get_replacement_transform_for_applied_ptransform(applied_ptransform)

Provides a runner specific override for a given AppliedPTransform.

Parameters:applied_ptransformAppliedPTransform containing the PTransform to be replaced.
Returns:A PTransform that will be the replacement for the PTransform inside the AppliedPTransform given as an argument.
matches(applied_ptransform)
class ReadPTransformOverride

Bases: apache_beam.pipeline.PTransformOverride

A PTransformOverride for Read(BoundedSource)

get_replacement_inputs(applied_ptransform)

Provides inputs that will be passed to the replacement PTransform.

Parameters:applied_ptransform – Original AppliedPTransform containing the PTransform to be replaced.
Returns:An iterable of PValues that will be passed to the expand() method of the replacement PTransform.
get_replacement_transform(ptransform)

Provides a runner specific override for a given PTransform.

Parameters:ptransform – PTransform to be replaced.
Returns:A PTransform that will be the replacement for the PTransform given as an argument.
get_replacement_transform_for_applied_ptransform(applied_ptransform)
matches(applied_ptransform)
apply_PTransform(transform, input, options)
run(transform, options=None)

Run the given transform or callable with this runner.

Blocks until the pipeline is complete. See also PipelineRunner.run_async.

run_async(transform, options=None)

Run the given transform or callable with this runner.

May return immediately, executing the pipeline in the background. The returned result object can be queried for progress, and wait_until_finish may be called to block until completion.

run_transform(transform_node, options)

Runner callback for a pipeline.run call.

Parameters:transform_node – transform node for the transform to run.

A concrete implementation of the Runner class must implement run_Abc for some class Abc in the method resolution order for every non-composite transform Xyz in the pipeline.

visit_transforms(pipeline, options)