apache_beam.runners.interactive.cache_manager module¶
-
class
apache_beam.runners.interactive.cache_manager.
CacheManager
[source]¶ Bases:
object
Abstract class for caching PCollections.
A PCollection cache is identified by labels, which consist of a prefix (either ‘full’ or ‘sample’) and a cache_label which is a hash of the PCollection derivation.
-
read
(*labels, **args)[source]¶ Return the PCollection as a list as well as the version number.
Parameters: - *labels – List of labels for PCollection instance.
- **args – Dict of additional arguments. Currently only ‘tail’ as a boolean. When tail is True, will wait and read new elements until the cache is complete.
Returns: - A tuple containing an iterator for the items in the PCollection and the
version number.
It is possible that the version numbers from read() and_latest_version() are different. This usually means that the cache’s been evicted (thus unavailable => read() returns version = -1), but it had reached version n before eviction.
-
write
(value, *labels)[source]¶ Writes the value to the given cache.
Parameters: - value – An encodable (with corresponding PCoder) value
- *labels – List of labels for PCollection instance
-
clear
(*labels)[source]¶ Clears the cache entry of the given labels and returns True on success.
Parameters: - value – An encodable (with corresponding PCoder) value
- *labels – List of labels for PCollection instance
-
sink
(labels, is_capture=False)[source]¶ Returns a PTransform that writes the PCollection cache.
TODO(BEAM-10514): Make sure labels will not be converted into an arbitrarily long file path: e.g., windows has a 260 path limit.
-
save_pcoder
(pcoder, *labels)[source]¶ Saves pcoder for given PCollection.
Correct reading of PCollection from Cache requires PCoder to be known. This method saves desired PCoder for PCollection that will subsequently be used by sink(…), source(…), and, most importantly, read(…) method. The latter must be able to read a PCollection written by Beam using non-Beam IO.
Parameters: - pcoder – A PCoder to be used for reading and writing a PCollection.
- *labels – List of labels for PCollection instance.
-
-
class
apache_beam.runners.interactive.cache_manager.
FileBasedCacheManager
(cache_dir=None, cache_format='text')[source]¶ Bases:
apache_beam.runners.interactive.cache_manager.CacheManager
Maps PCollections to local temp files for materialization.
-
is_latest_version
(version, *labels)¶ Returns if the given version number is the latest.
-
-
class
apache_beam.runners.interactive.cache_manager.
ReadCache
(cache_manager, label)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A PTransform that reads the PCollections from the cache.
-
annotations
() → Dict[str, Union[bytes, str, google.protobuf.message.Message]]¶
-
default_label
()¶
-
default_type_hints
()¶
-
display_data
()¶ Returns the display data associated to a pipeline component.
It should be reimplemented in pipeline components that wish to have static display data.
Returns: A dictionary containing key:value
pairs. The value might be an integer, float or string value; aDisplayDataItem
for values that have more data (e.g. short value, label, url); or aHasDisplayData
instance that has more display data that should be picked up. For example:{ 'key1': 'string_value', 'key2': 1234, 'key3': 3.14159265, 'key4': DisplayDataItem('apache.org', url='http://apache.org'), 'key5': subComponent }
Return type: Dict[str, Any]
-
classmethod
from_runner_api
(proto, context)¶
-
get_type_hints
()¶ Gets and/or initializes type hints for this object.
If type hints have not been set, attempts to initialize type hints in this order: - Using self.default_type_hints(). - Using self.__class__ type hints.
-
get_windowing
(inputs)¶ Returns the window function to be associated with transform’s output.
By default most transforms just return the windowing function associated with the input PCollection (or the first input if several).
-
infer_output_type
(unused_input_type)¶
-
label
¶
-
pipeline
= None¶
-
classmethod
register_urn
(urn, parameter_type, constructor=None)¶
-
runner_api_requires_keyed_input
()¶
-
side_inputs
= ()¶
-
to_runner_api
(context, has_parts=False, **extra_kwargs)¶
-
to_runner_api_parameter
(unused_context)¶
-
to_runner_api_pickled
(unused_context)¶
-
type_check_inputs
(pvalueish)¶
-
type_check_inputs_or_outputs
(pvalueish, input_or_output)¶
-
type_check_outputs
(pvalueish)¶
-
with_input_types
(input_type_hint)¶ Annotates the input type of a
PTransform
with a type-hint.Parameters: input_type_hint (type) – An instance of an allowed built-in type, a custom class, or an instance of a TypeConstraint
.Raises: TypeError
– If input_type_hint is not a valid type-hint. Seeapache_beam.typehints.typehints.validate_composite_type_param()
for further details.Returns: A reference to the instance of this particular PTransform
object. This allows chaining type-hinting related methods.Return type: PTransform
-
with_output_types
(type_hint)¶ Annotates the output type of a
PTransform
with a type-hint.Parameters: type_hint (type) – An instance of an allowed built-in type, a custom class, or a TypeConstraint
.Raises: TypeError
– If type_hint is not a valid type-hint. Seevalidate_composite_type_param()
for further details.Returns: A reference to the instance of this particular PTransform
object. This allows chaining type-hinting related methods.Return type: PTransform
-
-
class
apache_beam.runners.interactive.cache_manager.
WriteCache
(cache_manager, label, sample=False, sample_size=0, is_capture=False)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A PTransform that writes the PCollections to the cache.
-
annotations
() → Dict[str, Union[bytes, str, google.protobuf.message.Message]]¶
-
default_label
()¶
-
default_type_hints
()¶
-
display_data
()¶ Returns the display data associated to a pipeline component.
It should be reimplemented in pipeline components that wish to have static display data.
Returns: A dictionary containing key:value
pairs. The value might be an integer, float or string value; aDisplayDataItem
for values that have more data (e.g. short value, label, url); or aHasDisplayData
instance that has more display data that should be picked up. For example:{ 'key1': 'string_value', 'key2': 1234, 'key3': 3.14159265, 'key4': DisplayDataItem('apache.org', url='http://apache.org'), 'key5': subComponent }
Return type: Dict[str, Any]
-
classmethod
from_runner_api
(proto, context)¶
-
get_type_hints
()¶ Gets and/or initializes type hints for this object.
If type hints have not been set, attempts to initialize type hints in this order: - Using self.default_type_hints(). - Using self.__class__ type hints.
-
get_windowing
(inputs)¶ Returns the window function to be associated with transform’s output.
By default most transforms just return the windowing function associated with the input PCollection (or the first input if several).
-
infer_output_type
(unused_input_type)¶
-
label
¶
-
pipeline
= None¶
-
classmethod
register_urn
(urn, parameter_type, constructor=None)¶
-
runner_api_requires_keyed_input
()¶
-
side_inputs
= ()¶
-
to_runner_api
(context, has_parts=False, **extra_kwargs)¶
-
to_runner_api_parameter
(unused_context)¶
-
to_runner_api_pickled
(unused_context)¶
-
type_check_inputs
(pvalueish)¶
-
type_check_inputs_or_outputs
(pvalueish, input_or_output)¶
-
type_check_outputs
(pvalueish)¶
-
with_input_types
(input_type_hint)¶ Annotates the input type of a
PTransform
with a type-hint.Parameters: input_type_hint (type) – An instance of an allowed built-in type, a custom class, or an instance of a TypeConstraint
.Raises: TypeError
– If input_type_hint is not a valid type-hint. Seeapache_beam.typehints.typehints.validate_composite_type_param()
for further details.Returns: A reference to the instance of this particular PTransform
object. This allows chaining type-hinting related methods.Return type: PTransform
-
with_output_types
(type_hint)¶ Annotates the output type of a
PTransform
with a type-hint.Parameters: type_hint (type) – An instance of an allowed built-in type, a custom class, or a TypeConstraint
.Raises: TypeError
– If type_hint is not a valid type-hint. Seevalidate_composite_type_param()
for further details.Returns: A reference to the instance of this particular PTransform
object. This allows chaining type-hinting related methods.Return type: PTransform
-
-
class
apache_beam.runners.interactive.cache_manager.
SafeFastPrimitivesCoder
[source]¶ Bases:
apache_beam.coders.coders.Coder
This class add an quote/unquote step to escape special characters.
-
as_cloud_object
(coders_context=None)¶ For internal use only; no backwards-compatibility guarantees.
Returns Google Cloud Dataflow API description of this coder.
-
as_deterministic_coder
(step_label, error_message=None)¶ Returns a deterministic version of self, if possible.
Otherwise raises a value error.
-
decode_nested
(encoded)¶ Uses the underlying implementation to decode in nested format.
-
encode_nested
(value)¶ Uses the underlying implementation to encode in nested format.
-
estimate_size
(value)¶ Estimates the encoded size of the given value, in bytes.
Dataflow estimates the encoded size of a PCollection processed in a pipeline step by using the estimated size of a random sample of elements in that PCollection.
The default implementation encodes the given value and returns its byte size. If a coder can provide a fast estimate of the encoded size of a value (e.g., if the encoding has a fixed size), it can provide its estimate here to improve performance.
Parameters: value – the value whose encoded size is to be estimated. Returns: The estimated encoded size of the given value.
-
classmethod
from_runner_api
(coder_proto, context)¶ Converts from an FunctionSpec to a Fn object.
Prefer registering a urn with its parameter type and constructor.
-
classmethod
from_type_hint
(unused_typehint, unused_registry)¶
-
get_impl
()¶ For internal use only; no backwards-compatibility guarantees.
Returns the CoderImpl backing this Coder.
-
is_deterministic
()¶ Whether this coder is guaranteed to encode values deterministically.
A deterministic coder is required for key coders in GroupByKey operations to produce consistent results.
For example, note that the default coder, the PickleCoder, is not deterministic: the ordering of picked entries in maps may vary across executions since there is no defined order, and such a coder is not in general suitable for usage as a key coder in GroupByKey operations, since each instance of the same key may be encoded differently.
Returns: Whether coder is deterministic.
-
is_kv_coder
()¶
-
key_coder
()¶
-
static
register_structured_urn
(urn, cls)¶ Register a coder that’s completely defined by its urn and its component(s), if any, which are passed to construct the instance.
-
classmethod
register_urn
(urn, parameter_type, fn=None)¶ Registers a urn with a constructor.
For example, if ‘beam:fn:foo’ had parameter type FooPayload, one could write RunnerApiFn.register_urn(‘bean:fn:foo’, FooPayload, foo_from_proto) where foo_from_proto took as arguments a FooPayload and a PipelineContext. This function can also be used as a decorator rather than passing the callable in as the final parameter.
A corresponding to_runner_api_parameter method would be expected that returns the tuple (‘beam:fn:foo’, FooPayload)
-
to_runner_api
(context)¶
-
to_runner_api_parameter
(context)¶
-
to_type_hint
()¶
-
value_coder
()¶
-