Design¶
PFIO is an IO abstraction library for Chainer, optimized for deep learning training with batteries included. It supports
- Filesystem API abstraction with unified error semantics,
- Explicit user-land caching system,
- IO performance tracing and metrics stats, and
- Fileset container utilities to save metadata.
Rationale¶
There are a lot of non-POSIX-compliant filesystems in the industry and all of them have Pros and Cons, from cloud storages like Google Cloud Stroage and Amazon S3, to on-premise distributed filesystems like HDFS. Supporting different filesystems by developers themselves creates unnecessary burden on the developers and might reduce the portability on the code. That is the motivation of PFIO supporting filesystem abstraction API.
Also, deep learning training programs have benefited from filesystem page caching provided Linux kernel because its optmization method is based on stochastic gradient descent, where all training data is repeatedly read to iteratively train the model. But such non-POSIX-compliant filesystem usually does not provide content caching capability and I/O workload of training program would not implicitly optmized unlike Posix-based filesystems in Linux. To help with learning on non-POSIX-compliant filesystems, e.g. HDFS, PFIO implements data caching capability in userland. Moreover, developers can choose which data to cache, from the raw files to DNN input NumPy array.
Opmization in deep learning model training is also important as the model training usually takes long time and even 1% speedup is important. Modern external non-Posix filesystem is based on complex communication protocol between multiple data nodes and its performance metrics are not simply observable. Built-in tracing system, would be the first step for optimization, mitigating the complexity problem.
One of the general problems in filesystems is scalability of metadata access, caused by inbalanced ratio of number of files and total capacity of one system. PFIO supports various container file formats to aggregate many small files into single large file with metadata mapping, e.g. HDF5, ZIP and Tar (and more in future), by taking advantage of the fact that in machine learning training workload, usually SGD, is repeated read of single fixed dataset (Write-once-and-read-many). Aggregating millions of single-kilobytes files into single file would save a lot of metadata store.
Architecture¶
PFIO abstracts underlying system with three objects. See API documentation for details.
FileSystem¶
Abstraction of each filesystems. Depending on the context the term might stand for the filesystem type, or the (network) filesystem instance. It supports
- Getting basic information of the filesystem (info)
- Container creation, deletion
- Accessing containers
- Accessing raw files
- Listing all files under specific directory
- Primarily HDFS and POSIX, and AWS S3 API
FS¶
Abstraction of a directory subtree, or file containers such as ZIP. It contains a set of (key, binary object) pairs. Keys are typically path-like string and binary is typically a file content. In PFIO keys in a container are UTF-8 strings. Containers can be nested, e.g. ZIP in ZIP. It supports:
- Showing basic information of the container (info)
- Accessing raw files included (open)
- Accessing containers included (open)
- Adding and remove file (create, delete)
- Listing keys in a container
- Primarily ZIP, and possibly Hdf5?
import pfio
from PIL import Image
import io
with pfio.v2.from_url('hdfs://name-service-cluster1/some/many-files-dataset.zip') as fs:
print(container.info())
# List all keys in the container
for name in fs.list(recursive=True):
print(name)
# Obtains a file object to access binary content that
# corresponds to the key ``some/file.jpg``
with fs.open('some/file.jpg', 'rb') as fp:
binary = fp.read()
image = Image(io.BytesIO(binary))
...
File-like Objects¶
Abstraction of binary objects or files, typically returned by
fs.open
method. It is an implementation of RawIOBase
class
(See RawIOBase in Python
document). It supports
- Read to underlying file or binary in a container
- Writes supported by filesystems, but possibly not in containers
URI Expression of File Paths¶
Filesystems can be expressed as:
<filesystem> := <scheme>[://<service-idenfier>]
where scheme
represends the filesystem type. Currently hdfs
and file
are supported. hdfs
stands for HDFS and file
means local filesystem. For remote network file system like HDFS,
service-identifier
stands for service instance. It can be omitted
when the default service is defined. For example in HDFS, it is the
name of name service described in hdfs-site.xml
in the Hadoop
configuration directory like following example:
<configuration>
<property>
<name>dfs.nameservices</name>
<value>hdfs-nameservice1</value>
</property>
...
In this example is service-identifier
is hdfs-nameservice
.
Containers, files are uniquely identifiable by partial set of URI expression:
<uri> := <scheme>://[<service-idenfier>]/<path>
service-identifier
can be omitted when it can be uniquely defined
by the environment. path
is a UTF-8 string, a sequence of path
segments separated by /
and path segments are recommended to only
use [a-z][A-Z][0-9][_-]
. However, details depend on underlying filesystem
implementation or containers.
pfio.open_as_container
and pfio.fs.open
take
filesystem
, uri
or path
as an argument to identify the
file to be opened, when the context is a filesystem. If the context
is a container, they accept a key as an argument.
If the context is a file system, they also take a path
as a
relative path. The base for relative path depends on filesystems; for
HDFS it is home directory and for POSIX it is current working
directory.
For example, all these fs.open
open the same file, given that the
default name service is name-service1
and user Smith’s home
derectory is defined as /user/smith
:
import pfio
# Using full URI
pfio.open('hdfs://name-service1/user/smith/path/to/file.txt')
# Using set_root and absolute path
pfio.set_root('hdfs://name-service1/')
pfio.open('/user/smith/path/to/file.txt')
# Using set root and relative path
pfio.set_root('hdfs')
pfio.open('path/to/file.txt')
# Overwrite the global setting with full URI
# Access the posix with the global setting to hdfs
pfio.open('file://path/to/file.txt')
# Accessing with filesystem object
with pfio.create_handler('hdfs') as handler:
handler.open('file.txt')
Major Use Cases¶
With all these primitive concepts and operations PFIO supports various use cases from loading training data, taking snapshots of models in the middle of training process, and recording the final model.
In order to load training data in Chainer, developers create a dataset class which derived the DatasetMixin from the chainer.dataset package. PFIO will provide several implementation replacements for generic datasets included in Chainer and other Chainer family libraries.
According to the survey we conduct on developers’ code. I/Os can be categorized into two different classes.
- Inputs and outputs using file object: direct access via built-in APIs e.g. Image class in PIL, cv2.image.open and pandas.read_hdf. In such case, the file object (in PFIO, it is implementation of RawIOBase )
- Inputs and outputs all wrapped by 3rd party library. Some of them has functions only takes the file path string as an argument and all file operations are hidden underneath the library. Examples are cv2.VideoWriter(), cv2.imread() and cv2.VideoCapture() from OpenCV. Since we cannot change the library, we provide a monkey patch of major libraries frequently used along with Chainer.
For details see API.
V2 API history¶
PFIO v2 API tries to solve the impedance mismatch between different local filesystem, NFS, and other object storage systems, with a lot simpler and cleaner code.
It has removed several toplevel functions that seem to be less
important. It turned out that they introduced more complexity than
originally intended, due to the need of the global context. Thus,
functions that depends on the global context such as open()
,
set_root()
and etc. have been removed in v2 API.
Instead, v2 API provides only two toplevel functions that enable
direct resource access with full URL: open_url()
and
from_url()
. The former opens a file and returns FileObject. The
latter, creates a fs.FS
object that enable resource access under
the URL. The new class fs.FS
, is something close to handler object
in version 1 API. fs.FS
is intended to be as much compatible as
possible, however, it has several differences.
One notable difference is that it has the virtual concept of current
working directory, and thus provides subfs()
method. subfs()
method behaves like chroot(1)
or os.chdir()
without actually
changing current working directory of the process, but actually
returns a new fs.FS
object that has different working
directory. All resouce access through the object automatically
prepends the working directory.
V2 API does not provide lazy resouce initialization any more. Instead,
it provides simple wrapper lazify()
, which recreates the fs.FS
object every time the object experiences fork(2)
. Hdfs
and
Zip
can be wrapped with it, and will be fork-tolerant object.