apache_beam.io.filebasedsource module¶
A framework for developing sources for new file types.
To create a source for a new file type a sub-class of FileBasedSource
should be created. Sub-classes of FileBasedSource
must implement the
method FileBasedSource.read_records()
. Please read the documentation of
that method for more details.
For an example implementation of FileBasedSource
see
_AvroSource
.
-
class
apache_beam.io.filebasedsource.
FileBasedSource
(file_pattern, min_bundle_size=0, compression_type='auto', splittable=True, validate=True)[source]¶ Bases:
apache_beam.io.iobase.BoundedSource
A
BoundedSource
for reading a file glob of a given type.Initializes
FileBasedSource
.Parameters: - file_pattern (str) – the file glob to read a string or a
ValueProvider
(placeholder to inject a runtime value). - min_bundle_size (int) – minimum size of bundles that should be generated when performing initial splitting on this source.
- compression_type (str) – Used to handle compressed output files.
Typical value is
CompressionTypes.AUTO
, in which case the final file path’s extension will be used to detect the compression. - splittable (bool) – whether
FileBasedSource
should try to logically split a single file into data ranges so that different parts of the same file can be read in parallel. If set toFalse
,FileBasedSource
will prevent both initial and dynamic splitting of sources for single files. File patterns that represent multiple files may still get split into sources for individual files. Even if set toTrue
by the user,FileBasedSource
may choose to not split the file, for example, for compressed files where currently it is not possible to efficiently read a data range without decompressing the whole file. - validate (bool) – Boolean flag to verify that the files exist during the pipeline creation time.
Raises: TypeError
– when compression_type is not valid or if file_pattern is not astr
or aValueProvider
.ValueError
– when compression and splittable files are specified.IOError
– when the file pattern specified yields an empty result.
-
MIN_NUMBER_OF_FILES_TO_STAT
= 100¶
-
MIN_FRACTION_OF_FILES_TO_STAT
= 0.01¶
-
read_records
(file_name, offset_range_tracker)[source]¶ Returns a generator of records created by reading file ‘file_name’.
Parameters: - file_name – a
string
that gives the name of the file to be read. MethodFileBasedSource.open_file()
must be used to open the file and create a seekable file object. - offset_range_tracker – a object of type
OffsetRangeTracker
. This defines the byte range of the file that should be read. See documentation iniobase.BoundedSource.read()
for more information on reading records while complying to the range defined by a givenRangeTracker
.
Returns: an iterator that gives the records read from the given file.
- file_name – a
-
splittable
¶
-
default_output_coder
()¶ Coder that should be used for the records returned by the source.
Should be overridden by sources that produce objects that can be encoded more efficiently than pickling.
-
classmethod
from_runner_api
(fn_proto, context)¶ Converts from an FunctionSpec to a Fn object.
Prefer registering a urn with its parameter type and constructor.
-
is_bounded
()¶
-
classmethod
register_pickle_urn
(pickle_urn)¶ Registers and implements the given urn via pickling.
-
classmethod
register_urn
(urn, parameter_type, fn=None)¶ Registers a urn with a constructor.
For example, if ‘beam:fn:foo’ had parameter type FooPayload, one could write RunnerApiFn.register_urn(‘bean:fn:foo’, FooPayload, foo_from_proto) where foo_from_proto took as arguments a FooPayload and a PipelineContext. This function can also be used as a decorator rather than passing the callable in as the final parameter.
A corresponding to_runner_api_parameter method would be expected that returns the tuple (‘beam:fn:foo’, FooPayload)
-
to_runner_api
(context)¶ Returns an FunctionSpec encoding this Fn.
Prefer overriding self.to_runner_api_parameter.
-
to_runner_api_parameter
(context)¶
- file_pattern (str) – the file glob to read a string or a