h5py
h5py is a Pythonic interface to the HDF5 binary data format. It lets you store huge amounts of numerical data and easily manipulate that data from NumPy. For example, you can slice into multi-dimensional arrays stored in HDF5, without having to copy the data first. This is a real game-changer for big data workflows.
Overview
The h5py package is a Pythonic interface to the HDF5 binary data format. It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.
h5py is not intended for use in streaming or manipulating large raw datasets; for that there are other excellent packages available from the SciPy stack such as PyTables and pandas. h5py is targeted at developers who need to both store arrays on disk and quickly manipulate them in memory.
Download and Installation
h5py requires Python 2.6 or 2.7, and NumPy 1.6.1 or later.
- on Windows systems, you will need visual studio 2008 (for Python 2.6) or 2010 (for Python 2.7). The free “express” editions work fine.
- on OS X systems, you need to have XCode installed
- on Linux systems, you need the development headers for HDF5 (the package is usually called libhdf5-dev or hdf5-devel) as well as a C compiler (usually gcc).
Quick Start
h5py is a Pythonic interface to the HDF5 library. The h5py package is a Pythonic interface to the HDF5 binary data format.
HDF5 lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-gigabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.
The h5py user manual is a great place to start; you may also want to check out the FAQ.
If you have questions that are not covered by the documentation or FAQ, please post to the h5py mailing list; we’re happy to help!
type object h5pyh5h5pyconfig has no attribute reducecython
type object h5pyh5h5pyconfig has no attribute reducecython
h5py.File
The h5py.File class implements the HDF5 file object. The File object is the root object for a given HDF5 file. It is from this object that you can access all other fdsdfe The most common way to create a File object is to use the h5py.File function:
file = h5py.File(‘mytestfile.hdf5’, ‘r’)
Other ways to create a File instance exist, which we’ll discuss later.
The first argument to h5py.File() is always a string containing the path to the file you want to open or create: if the file doesn’t exist, it will be created (and you must have write permissions in the enclosing directory). The second argument (optional) specifies how you want to open the file:
‘r’ Readonly, file must exist
‘w’ Writeable, existing content erased
‘a’ Read/write if exists, create otherwise
‘x’ Create if doesn’t exist, fail otherwise
h5py.Group
A group is the h5py class for handling groups of data. Groups work like dictionaries, and support the same style of indexing. They also support attribute access and object creation.
h5py.Dataset
The h5py.Dataset class gives you a powerful way to store and manipulate numerical data in HDF5 files. A Dataset is like a NumPy array, but it lives in an HDF5 file instead of in memory. You can use all the same NumPy methods on a Dataset as you would on a NumPy array:
import h5py
f = h5py.File(“myfile.hdf5”)
dset = f[“mydataset”]
dset.shape
(100, 200)
dset[…] = 42 #Fill the entire dataset with the value 42
One advantage of using Datasets is that they can be resized without having to be recreated from scratch. You can also create compound Datasets, which are just like regular NumPy arrays except that they have multiple columns (called fields or members). Each field can have a different data type:
import h5py
f = h5py.File(“myfile.hdf5”)
dset = f[“mydataset”] #A compound dataset with two fields, named “x” and “y”
dset[“x”] #The “x” field, which has 100 elements
array([ 0., 1., 2., …, 97., 98., 99.])
dset[“y”] #The “y” field, which has 200 elements
array([ 100., 101., 102., …, 297., 298., 299.])
Tutorial
In this tutorial, we will be discussing the type object h5pyh5h5pyconfig has no attribute reducecython. This is a type of object that is used in the programming language Python.
Creating a File
To create a file, use the h5py.File class. This returns an instance of the HDF5 file object; it acts like a dictionary, mapping strings (the “keys” of the dictionary) to groups and datasets. The most important keys are ‘attrs’, which gives you access to the attributes of the file as a whole (see below), and ‘keys’, which tells you what other keys are in the file.
Here’s a simple example:
import h5py
f = h5py.File(‘myfile.hdf5’, ‘w’) # Create a new file in write mode
dataset1 = f.create_dataset(‘data1′, (100,), dtype=’i’) # Create empty dataset 1, size 100
group1 = f[‘/’] # Get group 1 from file
group2 = f.create_group(‘group2’) # Create group 2 in root directory
Creating a Group
Assuming you’re signed in, open the client and click “Groups” in the left sidebar. If it’s not visible, click the menu icon in the top left corner to expand the sidebar.
In the top right corner of the groups page, click “Create Group.” A modal will appear asking you to name your group and select its privacy settings.
Enter a name for your group and decide whether you want it to be private (only visible and accessible to group members) or public (anyone can join without an invitation). You can also choose whether you want the group to appear in search results.
Finally, click “Create” to finish creating your group.
Creating a Dataset
A Dataset is a collection of data, organized into named columns. It is similar to a relational database table in that each row has a unique key and each column has a specific type. In addition, datasets can be nested (i.e. have sub-datasets).
Creating a dataset is simple:
import h5py
f = h5py.File("mydataset.hdf5", "w")
dset = f.create_dataset("mydataset", (100,), dtype='i')
This will create a dataset called “mydataset” with 100 rows and one column of type int (i.e. signed 32-bit integer). The data is not initialized and will contain garbage values until you write to it.