apache_beam.io.gcp.datastore.v1new.datastoreio module¶
A connector for reading from and writing to Google Cloud Datastore.
This module uses the newer google-cloud-datastore client package. Its API was different enough to require extensive changes to this and associated modules.
-
class
apache_beam.io.gcp.datastore.v1new.datastoreio.
ReadFromDatastore
(query, num_splits=0)[source]¶ Bases:
apache_beam.transforms.ptransform.PTransform
A
PTransform
for querying Google Cloud Datastore.To read a
PCollection[Entity]
from a Cloud DatastoreQuery
, use theReadFromDatastore
transform by providing a query to read from. The project and optional namespace are set in the query. The query will be split into multiple queries to allow for parallelism. The degree of parallelism is automatically determined, but can be overridden by setting num_splits to a value of 1 or greater.Note: Normally, a runner will read from Cloud Datastore in parallel across many workers. However, when the query is configured with a limit or if the query contains inequality filters like GREATER_THAN, LESS_THAN etc., then all the returned results will be read by a single worker in order to ensure correct data. Since data is read from a single worker, this could have significant impact on the performance of the job. Using a
Reshuffle
transform after the read in this case might be beneficial for parallelizing work across workers.- The semantics for query splitting is defined below:
1. If num_splits is equal to 0, then the number of splits will be chosen dynamically at runtime based on the query data size.
2. Any value of num_splits greater than ReadFromDatastore._NUM_QUERY_SPLITS_MAX will be capped at that value.
3. If the query has a user limit set, or contains inequality filters, then num_splits will be ignored and no split will be performed.
4. Under certain cases Cloud Datastore is unable to split query to the requested number of splits. In such cases we just use whatever Cloud Datastore returns.
See https://developers.google.com/datastore/ for more details on Google Cloud Datastore.
Initialize the ReadFromDatastore transform.
This transform outputs elements of type
Entity
.Parameters: -
annotations
() → Dict[str, Union[bytes, str, google.protobuf.message.Message]]¶
-
default_label
()¶
-
default_type_hints
()¶
-
classmethod
from_runner_api
(proto, context)¶
-
get_type_hints
()¶ Gets and/or initializes type hints for this object.
If type hints have not been set, attempts to initialize type hints in this order: - Using self.default_type_hints(). - Using self.__class__ type hints.
-
get_windowing
(inputs)¶ Returns the window function to be associated with transform’s output.
By default most transforms just return the windowing function associated with the input PCollection (or the first input if several).
-
infer_output_type
(unused_input_type)¶
-
label
¶
-
pipeline
= None¶
-
classmethod
register_urn
(urn, parameter_type, constructor=None)¶
-
runner_api_requires_keyed_input
()¶
-
side_inputs
= ()¶
-
to_runner_api
(context, has_parts=False, **extra_kwargs)¶
-
to_runner_api_parameter
(unused_context)¶
-
to_runner_api_pickled
(unused_context)¶
-
type_check_inputs
(pvalueish)¶
-
type_check_inputs_or_outputs
(pvalueish, input_or_output)¶
-
type_check_outputs
(pvalueish)¶
-
with_input_types
(input_type_hint)¶ Annotates the input type of a
PTransform
with a type-hint.Parameters: input_type_hint (type) – An instance of an allowed built-in type, a custom class, or an instance of a TypeConstraint
.Raises: TypeError
– If input_type_hint is not a valid type-hint. Seeapache_beam.typehints.typehints.validate_composite_type_param()
for further details.Returns: A reference to the instance of this particular PTransform
object. This allows chaining type-hinting related methods.Return type: PTransform
-
with_output_types
(type_hint)¶ Annotates the output type of a
PTransform
with a type-hint.Parameters: type_hint (type) – An instance of an allowed built-in type, a custom class, or a TypeConstraint
.Raises: TypeError
– If type_hint is not a valid type-hint. Seevalidate_composite_type_param()
for further details.Returns: A reference to the instance of this particular PTransform
object. This allows chaining type-hinting related methods.Return type: PTransform
-
class
apache_beam.io.gcp.datastore.v1new.datastoreio.
WriteToDatastore
(project)[source]¶ Bases:
apache_beam.io.gcp.datastore.v1new.datastoreio._Mutate
Writes elements of type
Entity
to Cloud Datastore.Entity keys must be complete. The
project
field in each key must match the project ID passed to this transform. Ifproject
field in entity or property key is empty then it is filled with the project ID passed to this transform.Initialize the WriteToDatastore transform.
Parameters: project – ( str
) The ID of the project to write entities to.-
class
DatastoreMutateFn
(project)¶ Bases:
apache_beam.transforms.core.DoFn
A
DoFn
that write mutations to Datastore.Mutations are written in batches, where the maximum batch size is util.WRITE_BATCH_SIZE.
Commits are non-transactional. If a commit fails because of a conflict over an entity group, the commit will be retried. This means that the mutation should be idempotent (upsert and delete mutations) to prevent duplicate data or errors.
Parameters: project – (str) cloud project id -
BundleFinalizerParam
¶ alias of
apache_beam.transforms.core._BundleFinalizerParam
-
DoFnProcessParams
= [ElementParam, SideInputParam, TimestampParam, WindowParam, <class 'apache_beam.transforms.core._WatermarkEstimatorParam'>, PaneInfoParam, <class 'apache_beam.transforms.core._BundleFinalizerParam'>, KeyParam, <class 'apache_beam.transforms.core._StateDoFnParam'>, <class 'apache_beam.transforms.core._TimerDoFnParam'>]¶
-
DynamicTimerTagParam
= DynamicTimerTagParam¶
-
ElementParam
= ElementParam¶
-
KeyParam
= KeyParam¶
-
PaneInfoParam
= PaneInfoParam¶
-
RestrictionParam
¶ alias of
apache_beam.transforms.core._RestrictionDoFnParam
-
SideInputParam
= SideInputParam¶
-
StateParam
¶ alias of
apache_beam.transforms.core._StateDoFnParam
-
TimerParam
¶ alias of
apache_beam.transforms.core._TimerDoFnParam
-
TimestampParam
= TimestampParam¶
-
WatermarkEstimatorParam
¶ alias of
apache_beam.transforms.core._WatermarkEstimatorParam
-
WindowParam
= WindowParam¶
-
add_to_batch
(client_batch_item)¶
-
default_label
()¶
-
default_type_hints
()¶
-
display_data
()¶ Returns the display data associated to a pipeline component.
It should be reimplemented in pipeline components that wish to have static display data.
Returns: A dictionary containing key:value
pairs. The value might be an integer, float or string value; aDisplayDataItem
for values that have more data (e.g. short value, label, url); or aHasDisplayData
instance that has more display data that should be picked up. For example:{ 'key1': 'string_value', 'key2': 1234, 'key3': 3.14159265, 'key4': DisplayDataItem('apache.org', url='http://apache.org'), 'key5': subComponent }
Return type: Dict[str, Any]
-
element_to_client_batch_item
(element)¶
-
finish_bundle
()¶
-
static
from_callable
(fn)¶
-
classmethod
from_runner_api
(fn_proto, context)¶ Converts from an FunctionSpec to a Fn object.
Prefer registering a urn with its parameter type and constructor.
-
get_function_arguments
(func)¶
-
get_type_hints
()¶ Gets and/or initializes type hints for this object.
If type hints have not been set, attempts to initialize type hints in this order: - Using self.default_type_hints(). - Using self.__class__ type hints.
-
infer_output_type
(input_type)¶
-
process
(element)¶
-
classmethod
register_pickle_urn
(pickle_urn)¶ Registers and implements the given urn via pickling.
-
classmethod
register_urn
(urn, parameter_type, fn=None)¶ Registers a urn with a constructor.
For example, if ‘beam:fn:foo’ had parameter type FooPayload, one could write RunnerApiFn.register_urn(‘bean:fn:foo’, FooPayload, foo_from_proto) where foo_from_proto took as arguments a FooPayload and a PipelineContext. This function can also be used as a decorator rather than passing the callable in as the final parameter.
A corresponding to_runner_api_parameter method would be expected that returns the tuple (‘beam:fn:foo’, FooPayload)
-
setup
()¶ Called to prepare an instance for processing bundles of elements.
This is a good place to initialize transient in-memory resources, such as network connections. The resources can then be disposed in
DoFn.teardown
.
-
start_bundle
()¶
-
teardown
()¶ Called to use to clean up this instance before it is discarded.
A runner will do its best to call this method on any given instance to prevent leaks of transient resources, however, there may be situations where this is impossible (e.g. process crash, hardware failure, etc.) or unnecessary (e.g. the pipeline is shutting down and the process is about to be killed anyway, so all transient resources will be released automatically by the OS). In these cases, the call may not happen. It will also not be retried, because in such situations the DoFn instance no longer exists, so there’s no instance to retry it on.
Thus, all work that depends on input elements, and all externally important side effects, must be performed in
DoFn.process
orDoFn.finish_bundle
.
-
to_runner_api
(context)¶ Returns an FunctionSpec encoding this Fn.
Prefer overriding self.to_runner_api_parameter.
-
to_runner_api_parameter
(context)¶
-
static
unbounded_per_element
()¶ A decorator on process fn specifying that the fn performs an unbounded amount of work per input element.
-
with_input_types
(*arg_hints, **kwarg_hints)¶
-
with_output_types
(*arg_hints, **kwarg_hints)¶
-
write_mutations
(throttler, rpc_stats_callback, throttle_delay=1)¶ Writes a batch of mutations to Cloud Datastore.
If a commit fails, it will be retried up to 5 times. All mutations in the batch will be committed again, even if the commit was partially successful. If the retry limit is exceeded, the last exception from Cloud Datastore will be raised.
Assumes that the Datastore client library does not perform any retries on commits. It has not been determined how such retries would interact with the retries and throttler used here. See
google.cloud.datastore_v1.gapic.datastore_client_config
for retry config.Parameters: - rpc_stats_callback – a function to call with arguments successes and failures and throttled_secs; this is called to record successful and failed RPCs to Datastore and time spent waiting for throttling.
- throttler – (
apache_beam.io.gcp.datastore.v1new.adaptive_throttler. AdaptiveThrottler
) Throttler instance used to select requests to be throttled. - throttle_delay – (
float
) time in seconds to sleep when throttled.
Returns: (int) The latency of the successful RPC in milliseconds.
-
-
annotations
() → Dict[str, Union[bytes, str, google.protobuf.message.Message]]¶
-
default_label
()¶
-
default_type_hints
()¶
-
display_data
()¶ Returns the display data associated to a pipeline component.
It should be reimplemented in pipeline components that wish to have static display data.
Returns: A dictionary containing key:value
pairs. The value might be an integer, float or string value; aDisplayDataItem
for values that have more data (e.g. short value, label, url); or aHasDisplayData
instance that has more display data that should be picked up. For example:{ 'key1': 'string_value', 'key2': 1234, 'key3': 3.14159265, 'key4': DisplayDataItem('apache.org', url='http://apache.org'), 'key5': subComponent }
Return type: Dict[str, Any]
-
expand
(pcoll)¶
-
classmethod
from_runner_api
(proto, context)¶
-
get_type_hints
()¶ Gets and/or initializes type hints for this object.
If type hints have not been set, attempts to initialize type hints in this order: - Using self.default_type_hints(). - Using self.__class__ type hints.
-
get_windowing
(inputs)¶ Returns the window function to be associated with transform’s output.
By default most transforms just return the windowing function associated with the input PCollection (or the first input if several).
-
infer_output_type
(unused_input_type)¶
-
label
¶
-
pipeline
= None¶
-
classmethod
register_urn
(urn, parameter_type, constructor=None)¶
-
runner_api_requires_keyed_input
()¶
-
side_inputs
= ()¶
-
to_runner_api
(context, has_parts=False, **extra_kwargs)¶
-
to_runner_api_parameter
(unused_context)¶
-
to_runner_api_pickled
(unused_context)¶
-
type_check_inputs
(pvalueish)¶
-
type_check_inputs_or_outputs
(pvalueish, input_or_output)¶
-
type_check_outputs
(pvalueish)¶
-
with_input_types
(input_type_hint)¶ Annotates the input type of a
PTransform
with a type-hint.Parameters: input_type_hint (type) – An instance of an allowed built-in type, a custom class, or an instance of a TypeConstraint
.Raises: TypeError
– If input_type_hint is not a valid type-hint. Seeapache_beam.typehints.typehints.validate_composite_type_param()
for further details.Returns: A reference to the instance of this particular PTransform
object. This allows chaining type-hinting related methods.Return type: PTransform
-
with_output_types
(type_hint)¶ Annotates the output type of a
PTransform
with a type-hint.Parameters: type_hint (type) – An instance of an allowed built-in type, a custom class, or a TypeConstraint
.Raises: TypeError
– If type_hint is not a valid type-hint. Seevalidate_composite_type_param()
for further details.Returns: A reference to the instance of this particular PTransform
object. This allows chaining type-hinting related methods.Return type: PTransform
-
class
-
class
apache_beam.io.gcp.datastore.v1new.datastoreio.
DeleteFromDatastore
(project)[source]¶ Bases:
apache_beam.io.gcp.datastore.v1new.datastoreio._Mutate
Deletes elements matching input
Key
elements from Cloud Datastore.Keys must be complete. The
project
field in each key must match the project ID passed to this transform. Ifproject
field in key is empty then it is filled with the project ID passed to this transform.Initialize the DeleteFromDatastore transform.
Parameters: project – ( str
) The ID of the project from which the entities will be deleted.-
class
DatastoreMutateFn
(project)¶ Bases:
apache_beam.transforms.core.DoFn
A
DoFn
that write mutations to Datastore.Mutations are written in batches, where the maximum batch size is util.WRITE_BATCH_SIZE.
Commits are non-transactional. If a commit fails because of a conflict over an entity group, the commit will be retried. This means that the mutation should be idempotent (upsert and delete mutations) to prevent duplicate data or errors.
Parameters: project – (str) cloud project id -
BundleFinalizerParam
¶ alias of
apache_beam.transforms.core._BundleFinalizerParam
-
DoFnProcessParams
= [ElementParam, SideInputParam, TimestampParam, WindowParam, <class 'apache_beam.transforms.core._WatermarkEstimatorParam'>, PaneInfoParam, <class 'apache_beam.transforms.core._BundleFinalizerParam'>, KeyParam, <class 'apache_beam.transforms.core._StateDoFnParam'>, <class 'apache_beam.transforms.core._TimerDoFnParam'>]¶
-
DynamicTimerTagParam
= DynamicTimerTagParam¶
-
ElementParam
= ElementParam¶
-
KeyParam
= KeyParam¶
-
PaneInfoParam
= PaneInfoParam¶
-
RestrictionParam
¶ alias of
apache_beam.transforms.core._RestrictionDoFnParam
-
SideInputParam
= SideInputParam¶
-
StateParam
¶ alias of
apache_beam.transforms.core._StateDoFnParam
-
TimerParam
¶ alias of
apache_beam.transforms.core._TimerDoFnParam
-
TimestampParam
= TimestampParam¶
-
WatermarkEstimatorParam
¶ alias of
apache_beam.transforms.core._WatermarkEstimatorParam
-
WindowParam
= WindowParam¶
-
add_to_batch
(client_batch_item)¶
-
default_label
()¶
-
default_type_hints
()¶
-
display_data
()¶ Returns the display data associated to a pipeline component.
It should be reimplemented in pipeline components that wish to have static display data.
Returns: A dictionary containing key:value
pairs. The value might be an integer, float or string value; aDisplayDataItem
for values that have more data (e.g. short value, label, url); or aHasDisplayData
instance that has more display data that should be picked up. For example:{ 'key1': 'string_value', 'key2': 1234, 'key3': 3.14159265, 'key4': DisplayDataItem('apache.org', url='http://apache.org'), 'key5': subComponent }
Return type: Dict[str, Any]
-
element_to_client_batch_item
(element)¶
-
finish_bundle
()¶
-
static
from_callable
(fn)¶
-
classmethod
from_runner_api
(fn_proto, context)¶ Converts from an FunctionSpec to a Fn object.
Prefer registering a urn with its parameter type and constructor.
-
get_function_arguments
(func)¶
-
get_type_hints
()¶ Gets and/or initializes type hints for this object.
If type hints have not been set, attempts to initialize type hints in this order: - Using self.default_type_hints(). - Using self.__class__ type hints.
-
infer_output_type
(input_type)¶
-
process
(element)¶
-
classmethod
register_pickle_urn
(pickle_urn)¶ Registers and implements the given urn via pickling.
-
classmethod
register_urn
(urn, parameter_type, fn=None)¶ Registers a urn with a constructor.
For example, if ‘beam:fn:foo’ had parameter type FooPayload, one could write RunnerApiFn.register_urn(‘bean:fn:foo’, FooPayload, foo_from_proto) where foo_from_proto took as arguments a FooPayload and a PipelineContext. This function can also be used as a decorator rather than passing the callable in as the final parameter.
A corresponding to_runner_api_parameter method would be expected that returns the tuple (‘beam:fn:foo’, FooPayload)
-
setup
()¶ Called to prepare an instance for processing bundles of elements.
This is a good place to initialize transient in-memory resources, such as network connections. The resources can then be disposed in
DoFn.teardown
.
-
start_bundle
()¶
-
teardown
()¶ Called to use to clean up this instance before it is discarded.
A runner will do its best to call this method on any given instance to prevent leaks of transient resources, however, there may be situations where this is impossible (e.g. process crash, hardware failure, etc.) or unnecessary (e.g. the pipeline is shutting down and the process is about to be killed anyway, so all transient resources will be released automatically by the OS). In these cases, the call may not happen. It will also not be retried, because in such situations the DoFn instance no longer exists, so there’s no instance to retry it on.
Thus, all work that depends on input elements, and all externally important side effects, must be performed in
DoFn.process
orDoFn.finish_bundle
.
-
to_runner_api
(context)¶ Returns an FunctionSpec encoding this Fn.
Prefer overriding self.to_runner_api_parameter.
-
to_runner_api_parameter
(context)¶
-
static
unbounded_per_element
()¶ A decorator on process fn specifying that the fn performs an unbounded amount of work per input element.
-
with_input_types
(*arg_hints, **kwarg_hints)¶
-
with_output_types
(*arg_hints, **kwarg_hints)¶
-
write_mutations
(throttler, rpc_stats_callback, throttle_delay=1)¶ Writes a batch of mutations to Cloud Datastore.
If a commit fails, it will be retried up to 5 times. All mutations in the batch will be committed again, even if the commit was partially successful. If the retry limit is exceeded, the last exception from Cloud Datastore will be raised.
Assumes that the Datastore client library does not perform any retries on commits. It has not been determined how such retries would interact with the retries and throttler used here. See
google.cloud.datastore_v1.gapic.datastore_client_config
for retry config.Parameters: - rpc_stats_callback – a function to call with arguments successes and failures and throttled_secs; this is called to record successful and failed RPCs to Datastore and time spent waiting for throttling.
- throttler – (
apache_beam.io.gcp.datastore.v1new.adaptive_throttler. AdaptiveThrottler
) Throttler instance used to select requests to be throttled. - throttle_delay – (
float
) time in seconds to sleep when throttled.
Returns: (int) The latency of the successful RPC in milliseconds.
-
-
annotations
() → Dict[str, Union[bytes, str, google.protobuf.message.Message]]¶
-
default_label
()¶
-
default_type_hints
()¶
-
display_data
()¶ Returns the display data associated to a pipeline component.
It should be reimplemented in pipeline components that wish to have static display data.
Returns: A dictionary containing key:value
pairs. The value might be an integer, float or string value; aDisplayDataItem
for values that have more data (e.g. short value, label, url); or aHasDisplayData
instance that has more display data that should be picked up. For example:{ 'key1': 'string_value', 'key2': 1234, 'key3': 3.14159265, 'key4': DisplayDataItem('apache.org', url='http://apache.org'), 'key5': subComponent }
Return type: Dict[str, Any]
-
expand
(pcoll)¶
-
classmethod
from_runner_api
(proto, context)¶
-
get_type_hints
()¶ Gets and/or initializes type hints for this object.
If type hints have not been set, attempts to initialize type hints in this order: - Using self.default_type_hints(). - Using self.__class__ type hints.
-
get_windowing
(inputs)¶ Returns the window function to be associated with transform’s output.
By default most transforms just return the windowing function associated with the input PCollection (or the first input if several).
-
infer_output_type
(unused_input_type)¶
-
label
¶
-
pipeline
= None¶
-
classmethod
register_urn
(urn, parameter_type, constructor=None)¶
-
runner_api_requires_keyed_input
()¶
-
side_inputs
= ()¶
-
to_runner_api
(context, has_parts=False, **extra_kwargs)¶
-
to_runner_api_parameter
(unused_context)¶
-
to_runner_api_pickled
(unused_context)¶
-
type_check_inputs
(pvalueish)¶
-
type_check_inputs_or_outputs
(pvalueish, input_or_output)¶
-
type_check_outputs
(pvalueish)¶
-
with_input_types
(input_type_hint)¶ Annotates the input type of a
PTransform
with a type-hint.Parameters: input_type_hint (type) – An instance of an allowed built-in type, a custom class, or an instance of a TypeConstraint
.Raises: TypeError
– If input_type_hint is not a valid type-hint. Seeapache_beam.typehints.typehints.validate_composite_type_param()
for further details.Returns: A reference to the instance of this particular PTransform
object. This allows chaining type-hinting related methods.Return type: PTransform
-
with_output_types
(type_hint)¶ Annotates the output type of a
PTransform
with a type-hint.Parameters: type_hint (type) – An instance of an allowed built-in type, a custom class, or a TypeConstraint
.Raises: TypeError
– If type_hint is not a valid type-hint. Seevalidate_composite_type_param()
for further details.Returns: A reference to the instance of this particular PTransform
object. This allows chaining type-hinting related methods.Return type: PTransform
-
class