apache_beam.io.parquetio module

PTransforms for reading from and writing to Parquet files.

Provides two read PTransforms, ReadFromParquet and ReadAllFromParquet, that produces a PCollection of records. Each record of this PCollection will contain a single record read from a Parquet file. Records that are of simple types will be mapped into corresponding Python types. The actual parquet file operations are done by pyarrow. Source splitting is supported at row group granularity.

Additionally, this module provides a write PTransform WriteToParquet that can be used to write a given PCollection of Python objects to a Parquet file.

class apache_beam.io.parquetio.ReadFromParquetBatched(file_pattern=None, min_bundle_size=0, validate=True, columns=None)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform for reading Parquet files as a PCollection of pyarrow.Table. This PTransform is currently experimental. No backward-compatibility guarantees.

Initializes ReadFromParquetBatched

An alternative to ReadFromParquet that yields each row group from the Parquet file as a pyarrow.Table. These Table instances can be processed directly, or converted to a pandas DataFrame for processing. For more information on supported types and schema, please see the pyarrow documentation.

with beam.Pipeline() as p:
  dataframes = p \
      | 'Read' >> beam.io.ReadFromParquetBatched('/mypath/mypqfiles*') \
      | 'Convert to pandas' >> beam.Map(lambda table: table.to_pandas())

See also: ReadFromParquet.

Parameters:
  • file_pattern (str) – the file glob to read
  • min_bundle_size (int) – the minimum size in bytes, to be considered when splitting the input into bundles.
  • validate (bool) – flag to verify that the files exist during the pipeline creation time.
  • columns (List[str]) – list of columns that will be read from files. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’
expand(pvalue)[source]
display_data()[source]
annotations() → Dict[str, Union[bytes, str, google.protobuf.message.Message]]
default_label()
default_type_hints()
classmethod from_runner_api(proto, context)
get_type_hints()

Gets and/or initializes type hints for this object.

If type hints have not been set, attempts to initialize type hints in this order: - Using self.default_type_hints(). - Using self.__class__ type hints.

get_windowing(inputs)

Returns the window function to be associated with transform’s output.

By default most transforms just return the windowing function associated with the input PCollection (or the first input if several).

infer_output_type(unused_input_type)
label
pipeline = None
classmethod register_urn(urn, parameter_type, constructor=None)
runner_api_requires_keyed_input()
side_inputs = ()
to_runner_api(context, has_parts=False, **extra_kwargs)
to_runner_api_parameter(unused_context)
to_runner_api_pickled(unused_context)
type_check_inputs(pvalueish)
type_check_inputs_or_outputs(pvalueish, input_or_output)
type_check_outputs(pvalueish)
with_input_types(input_type_hint)

Annotates the input type of a PTransform with a type-hint.

Parameters:input_type_hint (type) – An instance of an allowed built-in type, a custom class, or an instance of a TypeConstraint.
Raises:TypeError – If input_type_hint is not a valid type-hint. See apache_beam.typehints.typehints.validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform
with_output_types(type_hint)

Annotates the output type of a PTransform with a type-hint.

Parameters:type_hint (type) – An instance of an allowed built-in type, a custom class, or a TypeConstraint.
Raises:TypeError – If type_hint is not a valid type-hint. See validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform
class apache_beam.io.parquetio.ReadFromParquet(file_pattern=None, min_bundle_size=0, validate=True, columns=None)[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform for reading Parquet files as a PCollection of dictionaries. This PTransform is currently experimental. No backward-compatibility guarantees.

Initializes ReadFromParquet.

Uses source _ParquetSource to read a set of Parquet files defined by a given file pattern.

If /mypath/myparquetfiles* is a file-pattern that points to a set of Parquet files, a PCollection for the records in these Parquet files can be created in the following manner.

with beam.Pipeline() as p:
  records = p | 'Read' >> beam.io.ReadFromParquet('/mypath/mypqfiles*')

Each element of this PCollection will contain a Python dictionary representing a single record. The keys will be of type str and named after their corresponding column names. The values will be of the type defined in the corresponding Parquet schema. Records that are of simple types will be mapped into corresponding Python types. Records that are of complex types like list and struct will be mapped to Python list and dictionary respectively. For more information on supported types and schema, please see the pyarrow documentation.

See also: ReadFromParquetBatched.

Parameters:
  • file_pattern (str) – the file glob to read
  • min_bundle_size (int) – the minimum size in bytes, to be considered when splitting the input into bundles.
  • validate (bool) – flag to verify that the files exist during the pipeline creation time.
  • columns (List[str]) – list of columns that will be read from files. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’
expand(pvalue)[source]
display_data()[source]
annotations() → Dict[str, Union[bytes, str, google.protobuf.message.Message]]
default_label()
default_type_hints()
classmethod from_runner_api(proto, context)
get_type_hints()

Gets and/or initializes type hints for this object.

If type hints have not been set, attempts to initialize type hints in this order: - Using self.default_type_hints(). - Using self.__class__ type hints.

get_windowing(inputs)

Returns the window function to be associated with transform’s output.

By default most transforms just return the windowing function associated with the input PCollection (or the first input if several).

infer_output_type(unused_input_type)
label
pipeline = None
classmethod register_urn(urn, parameter_type, constructor=None)
runner_api_requires_keyed_input()
side_inputs = ()
to_runner_api(context, has_parts=False, **extra_kwargs)
to_runner_api_parameter(unused_context)
to_runner_api_pickled(unused_context)
type_check_inputs(pvalueish)
type_check_inputs_or_outputs(pvalueish, input_or_output)
type_check_outputs(pvalueish)
with_input_types(input_type_hint)

Annotates the input type of a PTransform with a type-hint.

Parameters:input_type_hint (type) – An instance of an allowed built-in type, a custom class, or an instance of a TypeConstraint.
Raises:TypeError – If input_type_hint is not a valid type-hint. See apache_beam.typehints.typehints.validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform
with_output_types(type_hint)

Annotates the output type of a PTransform with a type-hint.

Parameters:type_hint (type) – An instance of an allowed built-in type, a custom class, or a TypeConstraint.
Raises:TypeError – If type_hint is not a valid type-hint. See validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform
class apache_beam.io.parquetio.ReadAllFromParquetBatched(min_bundle_size=0, desired_bundle_size=67108864, columns=None, label='ReadAllFiles')[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform for reading PCollection of Parquet files.

Uses source _ParquetSource to read a PCollection of Parquet files or file patterns and produce a PCollection of pyarrow.Table, one for each Parquet file row group. This PTransform is currently experimental. No backward-compatibility guarantees.

Initializes ReadAllFromParquet.

Parameters:
  • min_bundle_size – the minimum size in bytes, to be considered when splitting the input into bundles.
  • desired_bundle_size – the desired size in bytes, to be considered when splitting the input into bundles.
  • columns – list of columns that will be read from files. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’
DEFAULT_DESIRED_BUNDLE_SIZE = 67108864
label
expand(pvalue)[source]
annotations() → Dict[str, Union[bytes, str, google.protobuf.message.Message]]
default_label()
default_type_hints()
display_data()

Returns the display data associated to a pipeline component.

It should be reimplemented in pipeline components that wish to have static display data.

Returns:A dictionary containing key:value pairs. The value might be an integer, float or string value; a DisplayDataItem for values that have more data (e.g. short value, label, url); or a HasDisplayData instance that has more display data that should be picked up. For example:
{
  'key1': 'string_value',
  'key2': 1234,
  'key3': 3.14159265,
  'key4': DisplayDataItem('apache.org', url='http://apache.org'),
  'key5': subComponent
}
Return type:Dict[str, Any]
classmethod from_runner_api(proto, context)
get_type_hints()

Gets and/or initializes type hints for this object.

If type hints have not been set, attempts to initialize type hints in this order: - Using self.default_type_hints(). - Using self.__class__ type hints.

get_windowing(inputs)

Returns the window function to be associated with transform’s output.

By default most transforms just return the windowing function associated with the input PCollection (or the first input if several).

infer_output_type(unused_input_type)
pipeline = None
classmethod register_urn(urn, parameter_type, constructor=None)
runner_api_requires_keyed_input()
side_inputs = ()
to_runner_api(context, has_parts=False, **extra_kwargs)
to_runner_api_parameter(unused_context)
to_runner_api_pickled(unused_context)
type_check_inputs(pvalueish)
type_check_inputs_or_outputs(pvalueish, input_or_output)
type_check_outputs(pvalueish)
with_input_types(input_type_hint)

Annotates the input type of a PTransform with a type-hint.

Parameters:input_type_hint (type) – An instance of an allowed built-in type, a custom class, or an instance of a TypeConstraint.
Raises:TypeError – If input_type_hint is not a valid type-hint. See apache_beam.typehints.typehints.validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform
with_output_types(type_hint)

Annotates the output type of a PTransform with a type-hint.

Parameters:type_hint (type) – An instance of an allowed built-in type, a custom class, or a TypeConstraint.
Raises:TypeError – If type_hint is not a valid type-hint. See validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform
class apache_beam.io.parquetio.ReadAllFromParquet(**kwargs)[source]

Bases: apache_beam.transforms.ptransform.PTransform

expand(pvalue)[source]
annotations() → Dict[str, Union[bytes, str, google.protobuf.message.Message]]
default_label()
default_type_hints()
display_data()

Returns the display data associated to a pipeline component.

It should be reimplemented in pipeline components that wish to have static display data.

Returns:A dictionary containing key:value pairs. The value might be an integer, float or string value; a DisplayDataItem for values that have more data (e.g. short value, label, url); or a HasDisplayData instance that has more display data that should be picked up. For example:
{
  'key1': 'string_value',
  'key2': 1234,
  'key3': 3.14159265,
  'key4': DisplayDataItem('apache.org', url='http://apache.org'),
  'key5': subComponent
}
Return type:Dict[str, Any]
classmethod from_runner_api(proto, context)
get_type_hints()

Gets and/or initializes type hints for this object.

If type hints have not been set, attempts to initialize type hints in this order: - Using self.default_type_hints(). - Using self.__class__ type hints.

get_windowing(inputs)

Returns the window function to be associated with transform’s output.

By default most transforms just return the windowing function associated with the input PCollection (or the first input if several).

infer_output_type(unused_input_type)
label
pipeline = None
classmethod register_urn(urn, parameter_type, constructor=None)
runner_api_requires_keyed_input()
side_inputs = ()
to_runner_api(context, has_parts=False, **extra_kwargs)
to_runner_api_parameter(unused_context)
to_runner_api_pickled(unused_context)
type_check_inputs(pvalueish)
type_check_inputs_or_outputs(pvalueish, input_or_output)
type_check_outputs(pvalueish)
with_input_types(input_type_hint)

Annotates the input type of a PTransform with a type-hint.

Parameters:input_type_hint (type) – An instance of an allowed built-in type, a custom class, or an instance of a TypeConstraint.
Raises:TypeError – If input_type_hint is not a valid type-hint. See apache_beam.typehints.typehints.validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform
with_output_types(type_hint)

Annotates the output type of a PTransform with a type-hint.

Parameters:type_hint (type) – An instance of an allowed built-in type, a custom class, or a TypeConstraint.
Raises:TypeError – If type_hint is not a valid type-hint. See validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform
class apache_beam.io.parquetio.WriteToParquet(file_path_prefix, schema, row_group_buffer_size=67108864, record_batch_size=1000, codec='none', use_deprecated_int96_timestamps=False, file_name_suffix='', num_shards=0, shard_name_template=None, mime_type='application/x-parquet')[source]

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform for writing parquet files.

This PTransform is currently experimental. No backward-compatibility guarantees.

Initialize a WriteToParquet transform.

Writes parquet files from a PCollection of records. Each record is a dictionary with keys of a string type that represent column names. Schema must be specified like the example below.

with beam.Pipeline() as p:
  records = p | 'Read' >> beam.Create(
      [{'name': 'foo', 'age': 10}, {'name': 'bar', 'age': 20}]
  )
  _ = records | 'Write' >> beam.io.WriteToParquet(filename,
      pyarrow.schema(
          [('name', pyarrow.binary()), ('age', pyarrow.int64())]
      )
  )

For more information on supported types and schema, please see the pyarrow document.

Parameters:
  • file_path_prefix – The file path to write to. The files written will begin with this prefix, followed by a shard identifier (see num_shards), and end in a common extension, if given by file_name_suffix. In most cases, only this argument is specified and num_shards, shard_name_template, and file_name_suffix use default values.
  • schema – The schema to use, as type of pyarrow.Schema.
  • row_group_buffer_size – The byte size of the row group buffer. Note that this size is for uncompressed data on the memory and normally much bigger than the actual row group size written to a file.
  • record_batch_size – The number of records in each record batch. Record batch is a basic unit used for storing data in the row group buffer. A higher record batch size implies low granularity on a row group buffer size. For configuring a row group size based on the number of records, set row_group_buffer_size to 1 and use record_batch_size to adjust the value.
  • codec – The codec to use for block-level compression. Any string supported by the pyarrow specification is accepted.
  • use_deprecated_int96_timestamps – Write nanosecond resolution timestamps to INT96 Parquet format. Defaults to False.
  • file_name_suffix – Suffix for the files written.
  • num_shards – The number of files (shards) used for output. If not set, the service will decide on the optimal number of shards. Constraining the number of shards is likely to reduce the performance of a pipeline. Setting this value is not recommended unless you require a specific number of output files.
  • shard_name_template – A template string containing placeholders for the shard number and shard count. When constructing a filename for a particular shard number, the upper-case letters ‘S’ and ‘N’ are replaced with the 0-padded shard number and shard count respectively. This argument can be ‘’ in which case it behaves as if num_shards was set to 1 and only one file will be generated. The default pattern used is ‘-SSSSS-of-NNNNN’ if None is passed as the shard_name_template.
  • mime_type – The MIME type to use for the produced files, if the filesystem supports specifying MIME types.
Returns:

A WriteToParquet transform usable for writing.

expand(pcoll)[source]
display_data()[source]
annotations() → Dict[str, Union[bytes, str, google.protobuf.message.Message]]
default_label()
default_type_hints()
classmethod from_runner_api(proto, context)
get_type_hints()

Gets and/or initializes type hints for this object.

If type hints have not been set, attempts to initialize type hints in this order: - Using self.default_type_hints(). - Using self.__class__ type hints.

get_windowing(inputs)

Returns the window function to be associated with transform’s output.

By default most transforms just return the windowing function associated with the input PCollection (or the first input if several).

infer_output_type(unused_input_type)
label
pipeline = None
classmethod register_urn(urn, parameter_type, constructor=None)
runner_api_requires_keyed_input()
side_inputs = ()
to_runner_api(context, has_parts=False, **extra_kwargs)
to_runner_api_parameter(unused_context)
to_runner_api_pickled(unused_context)
type_check_inputs(pvalueish)
type_check_inputs_or_outputs(pvalueish, input_or_output)
type_check_outputs(pvalueish)
with_input_types(input_type_hint)

Annotates the input type of a PTransform with a type-hint.

Parameters:input_type_hint (type) – An instance of an allowed built-in type, a custom class, or an instance of a TypeConstraint.
Raises:TypeError – If input_type_hint is not a valid type-hint. See apache_beam.typehints.typehints.validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform
with_output_types(type_hint)

Annotates the output type of a PTransform with a type-hint.

Parameters:type_hint (type) – An instance of an allowed built-in type, a custom class, or a TypeConstraint.
Raises:TypeError – If type_hint is not a valid type-hint. See validate_composite_type_param() for further details.
Returns:A reference to the instance of this particular PTransform object. This allows chaining type-hinting related methods.
Return type:PTransform