apache_beam.io.gcp.bigquery_tools module

Tools used by BigQuery sources and sinks.

Classes, constants and functions in this file are experimental and have no backwards compatibility guarantees.

These tools include wrappers and clients to interact with BigQuery APIs.

NOTHING IN THIS FILE HAS BACKWARDS COMPATIBILITY GUARANTEES.

class apache_beam.io.gcp.bigquery_tools.FileFormat[source]

Bases: object

CSV = 'CSV'
JSON = 'NEWLINE_DELIMITED_JSON'
AVRO = 'AVRO'
class apache_beam.io.gcp.bigquery_tools.ExportCompression[source]

Bases: object

GZIP = 'GZIP'
DEFLATE = 'DEFLATE'
SNAPPY = 'SNAPPY'
NONE = 'NONE'
apache_beam.io.gcp.bigquery_tools.default_encoder(obj)[source]
apache_beam.io.gcp.bigquery_tools.get_hashable_destination(destination)[source]

Parses a table reference into a (project, dataset, table) tuple.

Parameters:destination – Either a TableReference object from the bigquery API. The object has the following attributes: projectId, datasetId, and tableId. Or a string representing the destination containing ‘PROJECT:DATASET.TABLE’.
Returns:A string representing the destination containing ‘PROJECT:DATASET.TABLE’.
apache_beam.io.gcp.bigquery_tools.parse_table_schema_from_json(schema_string)[source]

Parse the Table Schema provided as string.

Parameters:schema_string – String serialized table schema, should be a valid JSON.
Returns:A TableSchema of the BigQuery export from either the Query or the Table.
apache_beam.io.gcp.bigquery_tools.parse_table_reference(table, dataset=None, project=None)[source]

Parses a table reference into a (project, dataset, table) tuple.

Parameters:
  • table – The ID of the table. The ID must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_). If dataset argument is None then the table argument must contain the entire table reference: ‘DATASET.TABLE’ or ‘PROJECT:DATASET.TABLE’. This argument can be a bigquery.TableReference instance in which case dataset and project are ignored and the reference is returned as a result. Additionally, for date partitioned tables, appending ‘$YYYYmmdd’ to the table name is supported, e.g. ‘DATASET.TABLE$YYYYmmdd’.
  • dataset – The ID of the dataset containing this table or null if the table reference is specified entirely by the table argument.
  • project – The ID of the project containing this table or null if the table reference is specified entirely by the table (and possibly dataset) argument.
Returns:

A TableReference object from the bigquery API. The object has the following attributes: projectId, datasetId, and tableId. If the input is a TableReference object, a new object will be returned.

Raises:

ValueError – if the table reference as a string does not match the expected format.

class apache_beam.io.gcp.bigquery_tools.BigQueryWrapper(client=None, temp_dataset_id=None)[source]

Bases: object

BigQuery client wrapper with utilities for querying.

The wrapper is used to organize all the BigQuery integration points and offer a common place where retry logic for failures can be controlled. In addition it offers various functions used both in sources and sinks (e.g., find and create tables, query a table, etc.).

TEMP_TABLE = 'temp_table_'
TEMP_DATASET = 'temp_dataset_'
HISTOGRAM_METRIC_LOGGER = <apache_beam.internal.metrics.metric.MetricLogger object>
unique_row_id

Returns a unique row ID (str) used to avoid multiple insertions.

If the row ID is provided, BigQuery will make a best effort to not insert the same row multiple times for fail and retry scenarios in which the insert request may be issued several times. This comes into play for sinks executed in a local runner.

Returns:a unique row ID string
get_query_location(project_id, query, use_legacy_sql)[source]

Get the location of tables referenced in a query.

This method returns the location of the first available referenced table for user in the query and depends on the BigQuery service to provide error handling for queries that reference tables in multiple locations.

wait_for_bq_job(job_reference, sleep_duration_sec=5, max_retries=0)[source]

Poll job until it is DONE.

Parameters:
  • job_reference – bigquery.JobReference instance.
  • sleep_duration_sec – Specifies the delay in seconds between retries.
  • max_retries – The total number of times to retry. If equals to 0, the function waits forever.
Raises:

RuntimeError – If the job is FAILED or the number of retries has been reached.

get_table(project_id, dataset_id, table_id)[source]

Lookup a table’s metadata object.

Parameters:
  • client – bigquery.BigqueryV2 instance
  • project_id – table lookup parameter
  • dataset_id – table lookup parameter
  • table_id – table lookup parameter
Returns:

bigquery.Table instance

Raises:

HttpError – if lookup failed.

get_or_create_dataset(project_id, dataset_id, location=None)[source]
get_table_location(project_id, dataset_id, table_id)[source]
create_temporary_dataset(project_id, location)[source]
clean_up_temporary_dataset(project_id)[source]
get_job(project, job_id, location=None)[source]
perform_load_job(destination, files, job_id, schema=None, write_disposition=None, create_disposition=None, additional_load_parameters=None, source_format=None, job_labels=None)[source]

Starts a job to load data into BigQuery.

Returns:bigquery.JobReference with the information about the job that was started.
perform_extract_job(destination, job_id, table_reference, destination_format, project=None, include_header=True, compression='NONE', use_avro_logical_types=False, job_labels=None)[source]

Starts a job to export data from BigQuery.

Returns:bigquery.JobReference with the information about the job that was started.
get_or_create_table(project_id, dataset_id, table_id, schema, create_disposition, write_disposition, additional_create_parameters=None)[source]

Gets or creates a table based on create and write dispositions.

The function mimics the behavior of BigQuery import jobs when using the same create and write dispositions.

Parameters:
  • project_id – The project id owning the table.
  • dataset_id – The dataset id owning the table.
  • table_id – The table id.
  • schema – A bigquery.TableSchema instance or None.
  • create_disposition – CREATE_NEVER or CREATE_IF_NEEDED.
  • write_disposition – WRITE_APPEND, WRITE_EMPTY or WRITE_TRUNCATE.
Returns:

A bigquery.Table instance if table was found or created.

Raises:

RuntimeError – For various mismatches between the state of the table and the create/write dispositions passed in. For example if the table is not empty and WRITE_EMPTY was specified then an error will be raised since the table was expected to be empty.

run_query(project_id, query, use_legacy_sql, flatten_results, dry_run=False, job_labels=None)[source]
insert_rows(project_id, dataset_id, table_id, rows, insert_ids=None, skip_invalid_rows=False)[source]

Inserts rows into the specified table.

Parameters:
  • project_id – The project id owning the table.
  • dataset_id – The dataset id owning the table.
  • table_id – The table id.
  • rows – A list of plain Python dictionaries. Each dictionary is a row and each key in it is the name of a field.
  • skip_invalid_rows – If there are rows with insertion errors, whether they should be skipped, and all others should be inserted successfully.
Returns:

A tuple (bool, errors). If first element is False then the second element will be a bigquery.InserttErrorsValueListEntry instance containing specific errors.

convert_row_to_dict(row, schema)[source]

Converts a TableRow instance using the schema to a Python dict.

class apache_beam.io.gcp.bigquery_tools.BigQueryReader(source, test_bigquery_client=None, use_legacy_sql=True, flatten_results=True, kms_key=None)[source]

Bases: apache_beam.runners.dataflow.native_io.iobase.NativeSourceReader

A reader for a BigQuery source.

get_progress()

Returns a representation of how far the reader has read.

Returns:A SourceReaderProgress object that gives the current progress of the reader.
request_dynamic_split(dynamic_split_request)

Attempts to split the input in two parts.

The two parts are named the “primary” part and the “residual” part. The current ‘NativeSourceReader’ keeps processing the primary part, while the residual part will be processed elsewhere (e.g. perhaps on a different worker).

The primary and residual parts, if concatenated, must represent the same input as the current input of this ‘NativeSourceReader’ before this call.

The boundary between the primary part and the residual part is specified in a framework-specific way using ‘DynamicSplitRequest’ e.g., if the framework supports the notion of positions, it might be a position at which the input is asked to split itself (which is not necessarily the same position at which it will split itself); it might be an approximate fraction of input, or something else.

This function returns a ‘DynamicSplitResult’, which encodes, in a framework-specific way, the information sufficient to construct a description of the resulting primary and residual inputs. For example, it might, again, be a position demarcating these parts, or it might be a pair of fully-specified input descriptions, or something else.

After a successful call to ‘request_dynamic_split()’, subsequent calls should be interpreted relative to the new primary.

Parameters:dynamic_split_request – A ‘DynamicSplitRequest’ describing the split request.
Returns:‘None’ if the ‘DynamicSplitRequest’ cannot be honored (in that case the input represented by this ‘NativeSourceReader’ stays the same), or a ‘DynamicSplitResult’ describing how the input was split into a primary and residual part.
returns_windowed_values

Returns whether this reader returns windowed values.

class apache_beam.io.gcp.bigquery_tools.BigQueryWriter(sink, test_bigquery_client=None, buffer_size=None)[source]

Bases: apache_beam.runners.dataflow.native_io.iobase.NativeSinkWriter

The sink writer for a BigQuerySink.

Write(row)[source]
takes_windowed_values

Returns whether this writer takes windowed values.

class apache_beam.io.gcp.bigquery_tools.RowAsDictJsonCoder[source]

Bases: apache_beam.coders.coders.Coder

A coder for a table row (represented as a dict) to/from a JSON string.

This is the default coder for sources and sinks if the coder argument is not specified.

encode(table_row)[source]
decode(encoded_table_row)[source]
to_type_hint()[source]
as_cloud_object(coders_context=None)

For internal use only; no backwards-compatibility guarantees.

Returns Google Cloud Dataflow API description of this coder.

as_deterministic_coder(step_label, error_message=None)

Returns a deterministic version of self, if possible.

Otherwise raises a value error.

decode_nested(encoded)

Uses the underlying implementation to decode in nested format.

encode_nested(value)

Uses the underlying implementation to encode in nested format.

estimate_size(value)

Estimates the encoded size of the given value, in bytes.

Dataflow estimates the encoded size of a PCollection processed in a pipeline step by using the estimated size of a random sample of elements in that PCollection.

The default implementation encodes the given value and returns its byte size. If a coder can provide a fast estimate of the encoded size of a value (e.g., if the encoding has a fixed size), it can provide its estimate here to improve performance.

Parameters:value – the value whose encoded size is to be estimated.
Returns:The estimated encoded size of the given value.
classmethod from_runner_api(coder_proto, context)

Converts from an FunctionSpec to a Fn object.

Prefer registering a urn with its parameter type and constructor.

classmethod from_type_hint(unused_typehint, unused_registry)
get_impl()

For internal use only; no backwards-compatibility guarantees.

Returns the CoderImpl backing this Coder.

is_deterministic()

Whether this coder is guaranteed to encode values deterministically.

A deterministic coder is required for key coders in GroupByKey operations to produce consistent results.

For example, note that the default coder, the PickleCoder, is not deterministic: the ordering of picked entries in maps may vary across executions since there is no defined order, and such a coder is not in general suitable for usage as a key coder in GroupByKey operations, since each instance of the same key may be encoded differently.

Returns:Whether coder is deterministic.
is_kv_coder()
key_coder()
static register_structured_urn(urn, cls)

Register a coder that’s completely defined by its urn and its component(s), if any, which are passed to construct the instance.

classmethod register_urn(urn, parameter_type, fn=None)

Registers a urn with a constructor.

For example, if ‘beam:fn:foo’ had parameter type FooPayload, one could write RunnerApiFn.register_urn(‘bean:fn:foo’, FooPayload, foo_from_proto) where foo_from_proto took as arguments a FooPayload and a PipelineContext. This function can also be used as a decorator rather than passing the callable in as the final parameter.

A corresponding to_runner_api_parameter method would be expected that returns the tuple (‘beam:fn:foo’, FooPayload)

to_runner_api(context)
to_runner_api_parameter(context)
value_coder()
class apache_beam.io.gcp.bigquery_tools.JsonRowWriter(file_handle)[source]

Bases: io.IOBase

A writer which provides an IOBase-like interface for writing table rows (represented as dicts) as newline-delimited JSON strings.

Initialize an JsonRowWriter.

Parameters:file_handle (io.IOBase) – Output stream to write to.
close()[source]
closed
flush()[source]
read(size=-1)[source]
tell()[source]
writable()[source]
write(row)[source]
fileno()

Returns underlying file descriptor if one exists.

OSError is raised if the IO object does not use a file descriptor.

isatty()

Return whether this is an ‘interactive’ stream.

Return False if it can’t be determined.

readable()

Return whether object was opened for reading.

If False, read() will raise OSError.

readline()

Read and return a line from the stream.

If size is specified, at most size bytes will be read.

The line terminator is always b’n’ for binary files; for text files, the newlines argument to open can be used to select the line terminator(s) recognized.

readlines()

Return a list of lines from the stream.

hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.

seek()

Change stream position.

Change the stream position to the given byte offset. The offset is interpreted relative to the position indicated by whence. Values for whence are:

  • 0 – start of stream (the default); offset should be zero or positive
  • 1 – current stream position; offset may be negative
  • 2 – end of stream; offset is usually negative

Return the new absolute position.

seekable()

Return whether object supports random access.

If False, seek(), tell() and truncate() will raise OSError. This method may need to do a test seek().

truncate()

Truncate file to size bytes.

File pointer is left unchanged. Size defaults to the current IO position as reported by tell(). Returns the new size.

writelines()

Write a list of lines to stream.

Line separators are not added, so it is usual for each of the lines provided to have a line separator at the end.

class apache_beam.io.gcp.bigquery_tools.AvroRowWriter(file_handle, schema)[source]

Bases: io.IOBase

A writer which provides an IOBase-like interface for writing table rows (represented as dicts) as Avro records.

Initialize an AvroRowWriter.

Parameters:
  • file_handle (io.IOBase) – Output stream to write Avro records to.
  • schema (Dict[Text, Any]) – BigQuery table schema.
close()[source]
closed
flush()[source]
read(size=-1)[source]
tell()[source]
writable()[source]
write(row)[source]
fileno()

Returns underlying file descriptor if one exists.

OSError is raised if the IO object does not use a file descriptor.

isatty()

Return whether this is an ‘interactive’ stream.

Return False if it can’t be determined.

readable()

Return whether object was opened for reading.

If False, read() will raise OSError.

readline()

Read and return a line from the stream.

If size is specified, at most size bytes will be read.

The line terminator is always b’n’ for binary files; for text files, the newlines argument to open can be used to select the line terminator(s) recognized.

readlines()

Return a list of lines from the stream.

hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.

seek()

Change stream position.

Change the stream position to the given byte offset. The offset is interpreted relative to the position indicated by whence. Values for whence are:

  • 0 – start of stream (the default); offset should be zero or positive
  • 1 – current stream position; offset may be negative
  • 2 – end of stream; offset is usually negative

Return the new absolute position.

seekable()

Return whether object supports random access.

If False, seek(), tell() and truncate() will raise OSError. This method may need to do a test seek().

truncate()

Truncate file to size bytes.

File pointer is left unchanged. Size defaults to the current IO position as reported by tell(). Returns the new size.

writelines()

Write a list of lines to stream.

Line separators are not added, so it is usual for each of the lines provided to have a line separator at the end.

class apache_beam.io.gcp.bigquery_tools.RetryStrategy[source]

Bases: object

RETRY_ALWAYS = 'RETRY_ALWAYS'
RETRY_NEVER = 'RETRY_NEVER'
RETRY_ON_TRANSIENT_ERROR = 'RETRY_ON_TRANSIENT_ERROR'
static should_retry(strategy, error_message)[source]
class apache_beam.io.gcp.bigquery_tools.AppendDestinationsFn(destination)[source]

Bases: apache_beam.transforms.core.DoFn

Adds the destination to an element, making it a KV pair.

Outputs a PCollection of KV-pairs where the key is a TableReference for the destination, and the value is the record itself.

Experimental; no backwards compatibility guarantees.

display_data()[source]
process(element, *side_inputs)[source]
BundleFinalizerParam

alias of apache_beam.transforms.core._BundleFinalizerParam

DoFnProcessParams = [ElementParam, SideInputParam, TimestampParam, WindowParam, <class 'apache_beam.transforms.core._WatermarkEstimatorParam'>, PaneInfoParam, <class 'apache_beam.transforms.core._BundleFinalizerParam'>, KeyParam, <class 'apache_beam.transforms.core._StateDoFnParam'>, <class 'apache_beam.transforms.core._TimerDoFnParam'>]
DynamicTimerTagParam = DynamicTimerTagParam
ElementParam = ElementParam
KeyParam = KeyParam
PaneInfoParam = PaneInfoParam
RestrictionParam

alias of apache_beam.transforms.core._RestrictionDoFnParam

SideInputParam = SideInputParam
StateParam

alias of apache_beam.transforms.core._StateDoFnParam

TimerParam

alias of apache_beam.transforms.core._TimerDoFnParam

TimestampParam = TimestampParam
WatermarkEstimatorParam

alias of apache_beam.transforms.core._WatermarkEstimatorParam

WindowParam = WindowParam
default_label()
default_type_hints()
finish_bundle()

Called after a bundle of elements is processed on a worker.

static from_callable(fn)
classmethod from_runner_api(fn_proto, context)

Converts from an FunctionSpec to a Fn object.

Prefer registering a urn with its parameter type and constructor.

get_function_arguments(func)
get_type_hints()

Gets and/or initializes type hints for this object.

If type hints have not been set, attempts to initialize type hints in this order: - Using self.default_type_hints(). - Using self.__class__ type hints.

infer_output_type(input_type)
classmethod register_pickle_urn(pickle_urn)

Registers and implements the given urn via pickling.

classmethod register_urn(urn, parameter_type, fn=None)

Registers a urn with a constructor.

For example, if ‘beam:fn:foo’ had parameter type FooPayload, one could write RunnerApiFn.register_urn(‘bean:fn:foo’, FooPayload, foo_from_proto) where foo_from_proto took as arguments a FooPayload and a PipelineContext. This function can also be used as a decorator rather than passing the callable in as the final parameter.

A corresponding to_runner_api_parameter method would be expected that returns the tuple (‘beam:fn:foo’, FooPayload)

setup()

Called to prepare an instance for processing bundles of elements.

This is a good place to initialize transient in-memory resources, such as network connections. The resources can then be disposed in DoFn.teardown.

start_bundle()

Called before a bundle of elements is processed on a worker.

Elements to be processed are split into bundles and distributed to workers. Before a worker calls process() on the first element of its bundle, it calls this method.

teardown()

Called to use to clean up this instance before it is discarded.

A runner will do its best to call this method on any given instance to prevent leaks of transient resources, however, there may be situations where this is impossible (e.g. process crash, hardware failure, etc.) or unnecessary (e.g. the pipeline is shutting down and the process is about to be killed anyway, so all transient resources will be released automatically by the OS). In these cases, the call may not happen. It will also not be retried, because in such situations the DoFn instance no longer exists, so there’s no instance to retry it on.

Thus, all work that depends on input elements, and all externally important side effects, must be performed in DoFn.process or DoFn.finish_bundle.

to_runner_api(context)

Returns an FunctionSpec encoding this Fn.

Prefer overriding self.to_runner_api_parameter.

to_runner_api_parameter(context)
static unbounded_per_element()

A decorator on process fn specifying that the fn performs an unbounded amount of work per input element.

with_input_types(*arg_hints, **kwarg_hints)
with_output_types(*arg_hints, **kwarg_hints)
apache_beam.io.gcp.bigquery_tools.get_table_schema_from_string(schema)[source]

Transform the string table schema into a TableSchema instance.

Parameters:schema (str) – The sting schema to be used if the BigQuery table to write has to be created.
Returns:The schema to be used if the BigQuery table to write has to be created but in the TableSchema format.
Return type:TableSchema
apache_beam.io.gcp.bigquery_tools.table_schema_to_dict(table_schema)[source]

Create a dictionary representation of table schema for serialization

apache_beam.io.gcp.bigquery_tools.get_dict_table_schema(schema)[source]

Transform the table schema into a dictionary instance.

Parameters:schema (str, dict, TableSchema) – The schema to be used if the BigQuery table to write has to be created. This can either be a dict or string or in the TableSchema format.
Returns:The schema to be used if the BigQuery table to write has to be created but in the dictionary format.
Return type:Dict[str, Any]
apache_beam.io.gcp.bigquery_tools.get_avro_schema_from_table_schema(schema)[source]

Transform the table schema into an Avro schema.

Parameters:schema (str, dict, TableSchema) – The TableSchema to convert to Avro schema. This can either be a dict or string or in the TableSchema format.
Returns:An Avro schema, which can be used by fastavro.
Return type:Dict[str, Any]
class apache_beam.io.gcp.bigquery_tools.BigQueryJobTypes[source]

Bases: object

EXPORT = 'EXPORT'
COPY = 'COPY'
LOAD = 'LOAD'
QUERY = 'QUERY'
apache_beam.io.gcp.bigquery_tools.generate_bq_job_name(job_name, step_id, job_type, random=None)[source]