apache_beam.transforms.stats module

This module has all statistic related transforms.

This ApproximateUnique class will be deprecated [1]. PLease look into using HLLCount in the zetasketch extension module [2].

[1] https://lists.apache.org/thread.html/501605df5027567099b81f18c080469661fb426 4a002615fa1510502%40%3Cdev.beam.apache.org%3E [2] https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensio ns/zetasketch/HllCount.html

class apache_beam.transforms.stats.ApproximateUnique[source]

Bases: object

Hashes input elements and uses those to extrapolate the size of the entire set of hash values by assuming the rest of the hash values are as densely distributed as the sample space.

static parse_input_params(size=None, error=None)[source]

Check if input params are valid and return sample size.

Parameters:
  • size – an int not smaller than 16, which we would use to estimate number of unique values.
  • error – max estimation error, which is a float between 0.01 and 0.50. If error is given, sample size will be calculated from error with _get_sample_size_from_est_error function.
Returns:

sample size

Raises:

ValueError: If both size and error are given, or neither is given, or values are out of range.

class Globally(size=None, error=None)[source]

Bases: apache_beam.transforms.ptransform.PTransform

Approximate.Globally approximate number of unique values

expand(pcoll)[source]
annotations() → Dict[str, Union[bytes, str, google.protobuf.message.Message]]
default_label()
default_type_hints()
display_data()

Returns the display data associated to a pipeline component.

It should be reimplemented in pipeline components that wish to have static display data.

Returns:A dictionary containing key:value pairs. The value might be an integer, float or string value; a DisplayDataItem for values that have more data (e.g. short value, label, url); or a HasDisplayData instance that has more display data that should be picked up. For example:
{
  'key1': 'string_value',
  'key2': 1234,
  'key3': 3.14159265,
  'key4': DisplayDataItem('apache.org', url='http://apache.org'),
  'key5': subComponent
}
Return type:Dict[str, Any]
classmethod from_runner_api(proto, context)
get_type_hints()

Gets and/or initializes type hints for this object.

If type hints have not been set, attempts to initialize type hints in this order: - Using self.default_type_hints(). - Using self.__class__ type hints.

get_windowing(inputs)

Returns the window function to be associated with transform’s output.

By default most transforms just return the windowing function associated with the input PCollection (or the first input if several).

infer_output_type(unused_input_type)
label
pipeline = None
classmethod register_urn(urn, parameter_type, constructor=None)
runner_api_requires_keyed_input()
side_inputs = ()
to_runner_api(context, has_parts=False, **extra_kwargs)
to_runner_api_parameter(unused_context)
to_runner_api_pickled(unused_context)
type_check_inputs(pvalueish)
type_check_inputs_or_outputs(pvalueish, input_or_output)
type_check_outputs(pvalueish)
with_input_types(input_type_hint)

Annotates the input type of a PTransform with a type-hint.

Parameters:input_type_hint (type) – An instance of an allowed built-in type, a custom class, or an instance of a TypeConstraint.
Raises:TypeError – If input_type_hint is not a valid type-hint. See apache_beam.typehints.typehints.validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform
with_output_types(type_hint)

Annotates the output type of a PTransform with a type-hint.

Parameters:type_hint (type) – An instance of an allowed built-in type, a custom class, or a TypeConstraint.
Raises:TypeError – If type_hint is not a valid type-hint. See validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform
class PerKey(size=None, error=None)[source]

Bases: apache_beam.transforms.ptransform.PTransform

Approximate.PerKey approximate number of unique values per key

expand(pcoll)[source]
annotations() → Dict[str, Union[bytes, str, google.protobuf.message.Message]]
default_label()
default_type_hints()
display_data()

Returns the display data associated to a pipeline component.

It should be reimplemented in pipeline components that wish to have static display data.

Returns:A dictionary containing key:value pairs. The value might be an integer, float or string value; a DisplayDataItem for values that have more data (e.g. short value, label, url); or a HasDisplayData instance that has more display data that should be picked up. For example:
{
  'key1': 'string_value',
  'key2': 1234,
  'key3': 3.14159265,
  'key4': DisplayDataItem('apache.org', url='http://apache.org'),
  'key5': subComponent
}
Return type:Dict[str, Any]
classmethod from_runner_api(proto, context)
get_type_hints()

Gets and/or initializes type hints for this object.

If type hints have not been set, attempts to initialize type hints in this order: - Using self.default_type_hints(). - Using self.__class__ type hints.

get_windowing(inputs)

Returns the window function to be associated with transform’s output.

By default most transforms just return the windowing function associated with the input PCollection (or the first input if several).

infer_output_type(unused_input_type)
label
pipeline = None
classmethod register_urn(urn, parameter_type, constructor=None)
runner_api_requires_keyed_input()
side_inputs = ()
to_runner_api(context, has_parts=False, **extra_kwargs)
to_runner_api_parameter(unused_context)
to_runner_api_pickled(unused_context)
type_check_inputs(pvalueish)
type_check_inputs_or_outputs(pvalueish, input_or_output)
type_check_outputs(pvalueish)
with_input_types(input_type_hint)

Annotates the input type of a PTransform with a type-hint.

Parameters:input_type_hint (type) – An instance of an allowed built-in type, a custom class, or an instance of a TypeConstraint.
Raises:TypeError – If input_type_hint is not a valid type-hint. See apache_beam.typehints.typehints.validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform
with_output_types(type_hint)

Annotates the output type of a PTransform with a type-hint.

Parameters:type_hint (type) – An instance of an allowed built-in type, a custom class, or a TypeConstraint.
Raises:TypeError – If type_hint is not a valid type-hint. See validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform
class apache_beam.transforms.stats.ApproximateQuantiles[source]

Bases: object

PTransform for getting the idea of data distribution using approximate N-tile (e.g. quartiles, percentiles etc.) either globally or per-key.

Examples

in: list(range(101)), num_quantiles=5

out: [0, 25, 50, 75, 100]

in: [(i, 1 if i<10 else 1e-5) for i in range(101)], num_quantiles=5,
weighted=True

out: [0, 2, 5, 7, 100]

in: [list(range(10)), …, list(range(90, 101))], num_quantiles=5,
input_batched=True

out: [0, 25, 50, 75, 100]

in: [(list(range(10)), [1]*10), (list(range(10)), [0]*10), …,
(list(range(90, 101)), [0]*11)], num_quantiles=5, input_batched=True, weighted=True

out: [0, 2, 5, 7, 100]

class Globally(num_quantiles, key=None, reverse=False, weighted=False, input_batched=False)[source]

Bases: apache_beam.transforms.ptransform.PTransform

PTransform takes PCollection and returns a list whose single value is approximate N-tiles of the input collection globally.

Parameters:
  • num_quantiles – number of elements in the resulting quantiles values list.
  • key – (optional) Key is a mapping of elements to a comparable key, similar to the key argument of Python’s sorting methods.
  • reverse – (optional) whether to order things smallest to largest, rather than largest to smallest.
  • weighted – (optional) if set to True, the transform returns weighted quantiles. The input PCollection is then expected to contain tuples of input values with the corresponding weight.
  • input_batched – (optional) if set to True, the transform expects each element of input PCollection to be a batch, which is a list of elements for non-weighted case and a tuple of lists of elements and weights for weighted. Provides a way to accumulate multiple elements at a time more efficiently.
expand(pcoll)[source]
display_data()[source]
annotations() → Dict[str, Union[bytes, str, google.protobuf.message.Message]]
default_label()
default_type_hints()
classmethod from_runner_api(proto, context)
get_type_hints()

Gets and/or initializes type hints for this object.

If type hints have not been set, attempts to initialize type hints in this order: - Using self.default_type_hints(). - Using self.__class__ type hints.

get_windowing(inputs)

Returns the window function to be associated with transform’s output.

By default most transforms just return the windowing function associated with the input PCollection (or the first input if several).

infer_output_type(unused_input_type)
label
pipeline = None
classmethod register_urn(urn, parameter_type, constructor=None)
runner_api_requires_keyed_input()
side_inputs = ()
to_runner_api(context, has_parts=False, **extra_kwargs)
to_runner_api_parameter(unused_context)
to_runner_api_pickled(unused_context)
type_check_inputs(pvalueish)
type_check_inputs_or_outputs(pvalueish, input_or_output)
type_check_outputs(pvalueish)
with_input_types(input_type_hint)

Annotates the input type of a PTransform with a type-hint.

Parameters:input_type_hint (type) – An instance of an allowed built-in type, a custom class, or an instance of a TypeConstraint.
Raises:TypeError – If input_type_hint is not a valid type-hint. See apache_beam.typehints.typehints.validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform
with_output_types(type_hint)

Annotates the output type of a PTransform with a type-hint.

Parameters:type_hint (type) – An instance of an allowed built-in type, a custom class, or a TypeConstraint.
Raises:TypeError – If type_hint is not a valid type-hint. See validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform
class PerKey(num_quantiles, key=None, reverse=False, weighted=False, input_batched=False)[source]

Bases: apache_beam.transforms.ptransform.PTransform

PTransform takes PCollection of KV and returns a list based on each key whose single value is list of approximate N-tiles of the input element of the key.

Parameters:
  • num_quantiles – number of elements in the resulting quantiles values list.
  • key – (optional) Key is a mapping of elements to a comparable key, similar to the key argument of Python’s sorting methods.
  • reverse – (optional) whether to order things smallest to largest, rather than largest to smallest.
  • weighted – (optional) if set to True, the transform returns weighted quantiles. The input PCollection is then expected to contain tuples of input values with the corresponding weight.
  • input_batched – (optional) if set to True, the transform expects each element of input PCollection to be a batch, which is a list of elements for non-weighted case and a tuple of lists of elements and weights for weighted. Provides a way to accumulate multiple elements at a time more efficiently.
expand(pcoll)[source]
display_data()[source]
annotations() → Dict[str, Union[bytes, str, google.protobuf.message.Message]]
default_label()
default_type_hints()
classmethod from_runner_api(proto, context)
get_type_hints()

Gets and/or initializes type hints for this object.

If type hints have not been set, attempts to initialize type hints in this order: - Using self.default_type_hints(). - Using self.__class__ type hints.

get_windowing(inputs)

Returns the window function to be associated with transform’s output.

By default most transforms just return the windowing function associated with the input PCollection (or the first input if several).

infer_output_type(unused_input_type)
label
pipeline = None
classmethod register_urn(urn, parameter_type, constructor=None)
runner_api_requires_keyed_input()
side_inputs = ()
to_runner_api(context, has_parts=False, **extra_kwargs)
to_runner_api_parameter(unused_context)
to_runner_api_pickled(unused_context)
type_check_inputs(pvalueish)
type_check_inputs_or_outputs(pvalueish, input_or_output)
type_check_outputs(pvalueish)
with_input_types(input_type_hint)

Annotates the input type of a PTransform with a type-hint.

Parameters:input_type_hint (type) – An instance of an allowed built-in type, a custom class, or an instance of a TypeConstraint.
Raises:TypeError – If input_type_hint is not a valid type-hint. See apache_beam.typehints.typehints.validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform
with_output_types(type_hint)

Annotates the output type of a PTransform with a type-hint.

Parameters:type_hint (type) – An instance of an allowed built-in type, a custom class, or a TypeConstraint.
Raises:TypeError – If type_hint is not a valid type-hint. See validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform