In a previous post, we introduced the Versioned HDF5 library and described some of its features. In this post, we'll go into detail on how the underlying design of the library works on a technical level.
Versioned HDF5 is a library that wraps h5py and offers a versioned abstraction for HDF5 groups and datasets. Versioned HDF5 works fundamentally as a copy-on-write system. The basic idea of copy-on-write is that all data is effectively immutable in the backend. Whenever a high-level representation of data is modified, it is copied to a new location in the backend, leaving the original version intact. Any references to the original will continue to point to it.
At a high level, in a versioned system built with copy-on-write, all data in the system is stored in immutable blobs in the backend. The immutability of these blobs is often enforced by making them content-addressable, where the blobs are always referred to in the system by a cryptographic hash of their contents. Cryptographic hashes form a mapping from any blob of data to a fixed set of bytes, which is effectively one-to-one, meaning if two blobs have the same hash, they must be exactly the same data. This means that it is impossible to mutate a blob of data in-place. Doing so would change its hash, which would make it a different blob, since blobs are referenced only by their hash.
Whenever data for a version is committed, its data is stored as blobs in the backend. It may be put into a single blob, or split into multiple blobs. If it is split, a way to reconstruct the data from the blobs is stored. If a later version modifies that data, any blobs that are different are stored as new blobs. If the data is the same, the blobs will also be the same, and hence will not be written to twice, because they will already exist in the backend.
At a high level, this is how the git version control system works, for example (this is a good talk on the internals of git). It is also how versioning constructs in some modern filesystems like APFS and Btrfs.
Versioned HDF5 Implementation
In Versioned HDF5, this idea is implemented using two key HDF5 primitives: chunks and virtual datasets.
In HDF5, datasets are split into multiple chunks. Each chunk is of equal size, which is configurable, although some chunks may not be completely full. A chunk is the smallest part of a dataset that HDF5 operates on. Whenever a subset of a dataset is to be read, the entire chunk containing that dataset is read into memory. Picking an optimal chunk size is a nontrivial task, and depends on things such as the size of your L1 cache and the typical shape of your dataset. Furthermore, in Versioned HDF5 a chunk is the smallest amount of data that is stored only once across versions if it has not changed. If the chunk size is too small, it would affect performance, as operations would require reading and writing more chunks, but if it is too large, it would make the resulting versioned file unnecessarily large, as changing even a single element of a chunk requires rewriting the entire chunk. Versioned HDF5 does not presently contain any logic for automatically picking a chunk size. The pytables documentation has some tips on picking an optimal chunk size.
Because chunks are the most basic HDF5 primitive, Versioned HDF5 uses them as the underlying blobs for storage. This way operations on data can be as performant as possible.
Virtual datasets are a special kind of dataset that reference data from other datasets in a seamless way. The data from each part of a virtual dataset comes from another dataset. HDF5 does this seamlessly, so that a virtual dataset appears to be a normal dataset.
The basic design of Versioned HDF5 is this: whenever a dataset is created for
the first time (the first version containing the dataset), it is split into
chunks. The data in each chunk is hashed and stored in a hash table. The
unique chunks are then appended into a
corresponding to the
dataset. Finally, a virtual dataset is made that references the
chunks in the raw dataset to recreate the original dataset. When later
versions modify this dataset, each modified chunk is appended to the raw
dataset, and a new virtual dataset is created pointing to corresponding
For example, say we start with the first version,
n chunks. The dataset
chunks will be written into the
raw dataset, and the final virtual dataset will point to those chunks.
If we then create a version
version_2 based off
version_1, and modify only
data contained in CHUNK 2, that new data will be appended to the raw
and the resulting virtual dataset for
version_2 will look like
Since both versions 1 and 2 of
my_dataset have identical data in
chunks other than
CHUNK 2, they both point to the exact same data in
underlying HDF5 file only stores the data in version 1 of
my_dataset once, and
only the modified chunks from
my_dataset are stored on top of that.
All extra metadata, such as attributes, is stored on the virtual dataset. Since virtual datasets act exactly like real datasets and operate at the HDF5 level, each version is a real group in the HDF5 file that is exactly that version. However, these groups should be treated as read-only, and you should never access them outside of the Versioned HDF5 API (see below).
HDF5 File Layout
Inside of the HDF5 file, there is a special
group that holds
all the internal data for Versioned HDF5. This group contains a
group, which contains groups for each version that has been created. It also
contains a group for each dataset that exists in a version. These groups
contain two datasets,
For example, consider a Versioned HDF5 file that contains two versions,
version2, with datasets
data2. Suppose also
data1 exists in both versions and
The HDF5 layout would look like this
/_versioned_data/ ├── data1/ │ ├── hash_table │ └── raw_data │ ├── data2/ │ ├── hash_table │ └── raw_data │ └── versions/ ├── __first_version__/ │ ├── version1/ │ └── data1 │ └── version2/ ├── data1 └── data2
__first_version__ is an empty group that exists only for
hash_table datasets store the hashes for each chunk of data,
duplicate data will not be written twice, and the
the chunks for all versions of a given dataset. It is referenced by the
virtual datasets in the corresponding version groups in
example, the chunks for the data
data1 in versions
are stored in
Versioned HDF5 API
The biggest challenge of this design is that the virtual datasets
the data in each versioned data all point to the same blobs in the backend.
However, in HDF5, if a virtual dataset is written to, it will write to the
location it points to. This is at odds with the immutable copy-on-write
design. As a consequence, Versioned HDF5 needs to wrap all the h5py APIs
write into a dataset to disallow writing for versions that are already
committed, and to do the proper copy-on-write semantics for new versions
are being staged. Several classes that wrap the h5py Dataset and Group
are present in the
versioned_hdf5.wrappers submodule. These
just like their h5py counterparts, but do the right thing on versioned data.
versioned_hdf5.VersionedHDF5File API returns
whenever a versioned dataset is accessed. They are designed to work
like the corresponding h5py objects.
In several industry and science applications, a filesystem-like storage model such as HDF5 is the more appropriate solution for manipulating large amounts of data. However, suppose that data changes over time. In that case, it's not obvious how to track those different versions, since HDF5 is a binary format and is not well suited for traditional version control systems and tools.
In a previous post, we introduced the Versioned HDF5 library, which implements a mechanism for storing binary data sets in a versioned way that feels natural to users of other version control systems, and described some of its features. In this post, we'll show some of the performance analysis we did while developing the library, hopefully making the case that reading and writing versioned HDF5 files can be done with a nice, intuitive API while being as efficient as possible. The tests presented here show that using the Versioned HDF5 library results in reduced disk space usage, and further reductions in this area can be achieved with the use of HDF5/h5py-provided compression algorithms. That only comes at a cost of <10x file writing speed.
What are we measuring?
Performance can mean different things for different operations. For the tests presented here, the main goals are:
- To evaluate the performance (size on disk and I/O speed) of reading/writing versioned HDF5 files and compare it with non-versioned files (that is, files where only the last version of the datasets is stored);
- To evaluate the performance when reading/writing data to a versioned HDF5 file in a set of different use cases;
- To evaluate the performance when different parameter options are considered for chunking and compression on versioned HDF5 files.
When different versions of a dataset are stored in a versioned HDF5 file, modified copies of the data are stored as new versions. This means that there may be duplication of data between versions, which might impact the performance of reading, writing or manipulating these files.
In order to analyze this, we will consider test files created with a variable number of versions (or transactions) of a dataset consisting of three ndarrays of variable length. One test includes a two-dimensional ndarray as part of this dataset, but all other test cases consist of three one-dimensional ndarrays per dataset.
With each new version a certain number of rows are added, removed, and modified. For these tests, all changes are made to elements chosen according to a power law which biases the modifications towards the end of the array, simulating a possible use case of modifying more recent results in a given timeseries.
The tests are as follows:
A large fraction of changes is made to the dataset with each new version: The dataset initially has three arrays with 5000 rows, and 1000 positions are chosen at random and changed, and a small number (at most 10) rows are added or deleted with each new version. We will refer to this test as
A small fraction of changes is made to the dataset with each new version: The dataset initially has three arrays with 5000 rows, but only 10 positions are chosen at random and changed, and a small number (at most 10) rows are added or deleted with each new version. We will refer to this test as
A large fraction of changes is made to the dataset with each version, with the same three arrays of 5000 rows defined initially, 1000 positions are chosen at random and changed, but the size of the final array remains constant (no new rows are added and no rows are deleted). We will refer to this test as
In the next tests, the number of modifications is dominated by the number of appended rows. There are two such cases:
In the first case, the dataset contains three one-dimensional arrays with 1000 rows initially, and 1000 rows are added with each new version. A small number (at most 10) values are chosen at random, following the power law described above, and changed or deleted. We call this test
In the second case, the dataset contains one two-dimensional array with shape
(30, 30)and two one-dimensional arrays acting as indices to the 2d array. In this case, rows are only appended in the first axis of the two-dimensional array, and a small number of positions (at most 10) is chosen at random and changed. We call this test
To test the performance of the Versioned HDF5 library, we have chosen to compare a few different chunk sizes and compression algorithms. These values have been chosen arbitrarily, and optimal values depend on different use cases and on the nature of the datasets stored in the file.
As the number of versions in a file grows, its size on disk is also expected to grow. However, it is reasonable to expect that the overhead of storing metadata for versioned files doesn't cause the file sizes to explode as the number of versions increases.
We'll start by analyzing how the HDF5 file sizes grow as the number of versions grows. Using a chunk size of 4096, we can see the following results for the 4 one-dimensional test cases:
We can see from the figure that in
test_large_fraction_constant_sparse case, writing 5000 versions
of a 117KB array, which would take around 572MB in separate files, takes
around 252MB - less than half the storage size. Note that the other examples
the size of the arrays stored in the file also grow as the number of
versions grows, since each new version is changing the original arrays by
adding, deleting and changing values in the original arrays. Keep in
mind there is redundant data as some of it is not changed during the
staging of a new version but it is still being stored. It's also
worth noting that for these tests we don't use compression, even though algorithms
available in h5py can be used in Versioned HDF5. You can see the Versioned HDF5
documentation for more detailed tests regarding chunking and
If we look at the time spent creating the files for each example, comparing chunk sizes but not considering compression, we have something like this:
Now, we can look at the time required to stage a new version in the file, that is, to add a new transaction. The graphs below show, for each fixed number of transactions, the time required to add new versions as the file is created. Note that the scales vary for each test.
It is clear that as the number of versions stored in the file increases, the
times required to create versioned HDF5 files increases significantly when
compared to regular HDF5 files. However, note that the increase is linear,
consistent with what is expected from adding new versions to the file. For
example, looking at
the size of the arrays do not increase as new versions are added, choosing
(again) a chunk size of 4096 means that writing each new version to file
takes about 6-8x as much as in the unversioned case, with more than 50%
savings on disk storage. Note that a larger chunk size may mean faster read
and write times but can also mean larger file sizes if no compression is
used, because of how Versioned HDF5 is designed. This is expected, since all
chunks where data has been changed from one version to the next have to be
stored. Also, in
test_mostly_appends_sparse, where the size of
the arrays stored in the file grow significantly with each new version, we
can see a marked increase in the times required to stage new versions.
Finally, we'll look at a two-dimensional dataset. In this case, we have chosen different chunk sizes to test, considering that larger chunk sizes increase file sizes considerably.
We can also look at the times required to create each new version and write it to file in the versioned and unversioned cases. This is shown in the image below (note the different axis scale from previous figures.)
This test case is unique for a few reasons. First, having a two-dimensional dataset introduces new considerations, such as the number of rows being added in each axis. For this test case, we have only added (few) new rows to the first axis with each new version, and this might explain why we don't see an increase in the time required to write new versions to file as the number of versions grow. While these are preliminary tests, and multidimensional datasets are still experimental at this point in Versioned HDF5, we can see that there are no unexpected drops in performance and the results can generally be explained by the size of the data stored in each file.
I/O performance for versioned HDF5 files
First, we'll look at the time required to read data from all versions in a
file, sequentially. To keep this test short, we'll only analyze the tests
using no compression, and chunk size 16384 for the
one-dimensional datasets and 256 for the two-dimensional dataset in
Plotting with respect to the number of versions, we get the following:
As expected, read times increase for files with a larger number of versions,
but the growth is close to linear in all cases except for
where the large array sizes explain the larger read times.
Next, we'll compare the times necessary to read the latest version on each file. Because of how Versioned HDF5 is designed, this is the same as reading the last array stored in the HDF5 file. For each test, we made 20 different reads of the latest version in order to eliminate fluctuations generated by background processes or the OS (solid lines). We also compare these read times with the time required to read an unversioned file (which only stored the latest version - dashed black line).
In this case, we can see that on average, reading the latest version on a
VersionedHDF5File is ~5x-10x slower than reading an unversioned
file. Also, the time required to read the latest version from a versioned
HDF5 file increases modestly with the number of versions stored in the file,
except when the size of the array increases significantly with each new
These results show that the library behaves reasonably well without unexpected overhead. The reduction in file size comes at a moderate cost for file writing speed, and file reading speed for the latest version of the data is unaffected. The largest impact on I/O performance and storage is in fact explained by the size of the datasets stored in the file, and the Versioned HDF5 abstraction does not significantly reduce this performance.
Overall, the worst performance was observed for tests with large array sizes. This seems to show that the file sizes and I/O performance of versioned HDF5 files are significantly affected by the size of the unique datasets stored in each file, which is to be expected. Also, choosing the right chunk size parameter can have an impact on the performance of the library.
versioned-hdf5 1.0 has recently been released, and is available
on PyPI and conda-forge. You can install it with
conda install -c conda-forge versioned-hdf5
versioned-hdf5 happens on GitHub. Currently,
the library supports basic use cases, but there is still a lot to do. We
welcome community contributions to the library, including any issues or
For now, you can check out the documentation for more details on what is supported and how the library is built.
The contents of this post and any third-party content that is linked to or included in this post (“Linked Content”) are collectively referred to as the “Content.”
The Content is provided for informational purposes only. It does not convey an offer of any type and is not intended to be, and should not be construed as, an offer to sell, or the solicitation of an offer to buy, any interest in any entity, investment, or investment vehicle. The Content does not constitute investment advice. No assurances can be given that any aims, assumptions, expectations, and/or goals expressed or implied in the Content were or will be realized, or that the activities described in the Content have continued or will continue at all or in the same manner as is described.
The D. E. Shaw group, its affiliates and members, and any party or person associated with or acting on behalf of the foregoing shall have no liability for any errors (as a result of negligence or otherwise, to the fullest extent permitted by law in the absence of fraud) in the Content or its production or for the consequences of any party’s reliance on the Content. The D. E. Shaw group does not endorse any information or beliefs discussed in any Linked Content and makes no representation as to the accuracy or adequacy of any Linked Content.
All trademarks, logos, information, and photos are owned by D. E. Shaw & Co., L.P. or its affiliates or used with permission of the applicable third party. The reproduction or retransmission of the Content (other than Linked Content) is prohibited without the prior written consent of D. E. Shaw & Co., L.P.