apache_beam.io.hadoopfilesystem module

FileSystem implementation for accessing Hadoop Distributed File System files.

class apache_beam.io.hadoopfilesystem.HadoopFileSystem(pipeline_options)[source]

Bases: apache_beam.io.filesystem.FileSystem

FileSystem implementation that supports HDFS.

URL arguments to methods expect strings starting with hdfs://.

Initializes a connection to HDFS.

Connection configuration is done by passing pipeline options. See HadoopFileSystemOptions.

classmethod scheme()[source]
join(base_url, *paths)[source]

Join two or more pathname components.

Parameters:
  • base_url – string path of the first component of the path. Must start with hdfs://.
  • paths – path components to be added
Returns:

Full url after combining all the passed components.

split(url)[source]
mkdirs(url)[source]
has_dirs()[source]
create(url, mime_type='application/octet-stream', compression_type='auto')[source]
Returns:A Python File-like object.
open(url, mime_type='application/octet-stream', compression_type='auto')[source]
Returns:A Python File-like object.
copy(source_file_names, destination_file_names)[source]

It is an error if any file to copy already exists at the destination.

Raises BeamIOError if any error occurred.

Parameters:
  • source_file_names – iterable of URLs.
  • destination_file_names – iterable of URLs.
rename(source_file_names, destination_file_names)[source]
exists(url)[source]

Checks existence of url in HDFS.

Parameters:url – String in the form hdfs://…
Returns:True if url exists as a file or directory in HDFS.
size(url)[source]
last_updated(url)[source]
checksum(url)[source]

Fetches a checksum description for a URL.

Returns:String describing the checksum.
CHUNK_SIZE = 1
delete(urls)[source]
classmethod get_all_plugin_paths()

Get full import paths of the BeamPlugin subclass.

classmethod get_all_subclasses()

Get all the subclasses of the BeamPlugin class.

match(patterns, limits=None)

Find all matching paths to the patterns provided.

Patterns ending with ‘/’ or ‘’ will be appended with ‘*’.

Parameters:
  • patterns – list of string for the file path pattern to match against
  • limits – list of maximum number of responses that need to be fetched

Returns: list of MatchResult objects.

Raises:BeamIOError – if any of the pattern match operations fail
match_files(file_metas, pattern)

Filter FileMetadata objects by pattern

Parameters:
  • file_metas (list of FileMetadata) – Files to consider when matching
  • pattern (str) – File pattern
Returns:Generator of matching FileMetadata
static translate_pattern(pattern)

Translate a pattern to a regular expression. There is no way to quote meta-characters.

Pattern syntax:

The pattern syntax is based on the fnmatch syntax, with the following differences:

  • * Is equivalent to [^/\]* rather than .*.
  • ** Is equivalent to .*.

See also

match() uses this method

This method is based on Python 2.7’s fnmatch.translate. The code in this method is licensed under PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2.