databases

Full Documentation for hippynn.databases package. Click here for a summary page.

Organized datasets for training and prediction.

Note

Databases constructed from disk (i.e. anything besides the base Database class) will load floating point data in the format (float32 or float64) specified via the torch.get_default_dtype() function. Use torch.set_default_dtype() to control this behavior.

class Database(arr_dict: dict[str, ~numpy.ndarray], inputs: list[str], targets: list[str], seed: [<class 'int'>, <class 'numpy.random.mtrand.RandomState'>, <class 'tuple'>], test_size: float | int = None, valid_size: float | int = None, num_workers: int = 0, pin_memory: bool = True, allow_unfound: bool = False, auto_split: bool = False, device: ~torch.device = None, dataloader_kwargs: dict[str, object] = None, quiet=False)[source]

Bases: object

Class for holding a pytorch dataset, splitting it, generating dataloaders, etc.”

add_split_masks(dict_to_add_to=None, split_prefix=None)[source]

Add split masks to the dataset. This function is used internally before writing databases.

When using the dict_to_add_to parameter, this function writes numpy arrays. When adding to self.splits, this function writes tensors. :param dict_to_add_to: where to put the split masks. Default to self.splits. :param split_prefix: prefix for mask names :return:

get_device() device[source]

Determine what device the database resides on. Raises ValueError if multiple devices are encountered.

Returns:

device.

make_automatic_splits(split_prefix=None, dry_run=False)[source]

Split the database automatically. Since the user specifies this routine, it fails pretty strictly.

Parameters:
  • split_prefix – None, use default. If otherwise, use this prefix to determine what arrays are masks.

  • dry_run – Only validate that existing split masks are correct; don’t perform splitting.

Returns:

make_database_cache(file: str = './hippynn_db_cache.npz', overwrite: bool = False, **override_kwargs) Database[source]

Cache the database as-is, and re-open it.

Useful for creating an easy restart script if the storage space is available. The new datatbase will by default inherit the properties of this database.

usage: >>> database = database.make_database_cache()

Parameters:
  • file – where to store the database

  • overwrite – whether to overwrite an existing cache file with this name.

  • override_kwargs – passed to NPZDictionary instead of the current database settings.

Returns:

The new database created from the cache.

make_explicit_split(split_name: str, split_indices: ndarray)[source]
Parameters:
  • split_name – name for split, typically ‘train’, ‘valid’, ‘test’

  • split_indices – the indices of the items for the split

Returns:

make_explicit_split_bool(split_name: str, split_mask: ndarray | tensor)[source]
Parameters:
  • split_name – name for split, typically ‘train’, ‘valid’, ‘test’

  • split_mask – a boolean array for where to split

Returns:

make_generator(split_name: str, evaluation_mode: str, batch_size: int | None = None, subsample: float | bool = False)[source]

Makes a dataloader for the given type of split and evaluation mode of the model.

In most cases, you do not need to call this function directly as a user.

Parameters:
  • split_name – str; “train”, “valid”, or “test” ; selects data to use

  • evaluation_mode – str; “train” or “eval”. Used for whether to shuffle.

  • batch_size – passed to pytorch

  • subsample – fraction to subsample

Returns:

dataloader containing relevant data

make_random_split(split_name: str, split_size: int | float)[source]

Make a random split using self.random_state to select items.

Parameters:
  • split_name – String naming the split, can be anything, but ‘train’, ‘valid’, and ‘test’ are special.

  • split_size – int (number of items) or float<1, fraction of samples.

Returns:

make_trainvalidtest_split(*, test_size: int | float, valid_size: int | float)[source]

Make a split for train, valid, and test out of any remaining unsplit entries in the database. The size is specified in terms of test and valid splits; the train split will be the remainder.

If you wish to specify precise rows for each split, see make_explict_split or make_explicit_split_bool.

This function takes keyword-arguments only in order to prevent confusion over which size is which.

The types of both test_size and valid_size parameters must match.

Parameters:
  • test_size – int (count) or float (fraction) of data to assign to test split

  • valid_size – int (count) or float (fraction) of data to assign to valid split

Returns:

None

remove_high_property(key: str, atomwise: bool, norm_per_atom: bool = False, species_key: str = None, cut: float | None = None, std_factor: float | None = 10, norm_axis: int | None = None)[source]

For removing outliers from a dataset. Use with caution; do not inadvertently remove outliers from benchmarks!

The parameters cut and std_factor can be set to None to avoid their steps. the per_atom and atom_var properties are exclusive; they cannot both be true.

Parameters:
  • key – The property key in the dataset to check for high values

  • atomwise – True if the property is defined per atom in axis 1, otherwise property is treated as whole-system value

  • norm_per_atom – True if the property should be normalized by atom counts

  • species_key – Which array represents the atom presence; required if per_atom is True

  • cut – If values > mu + cut, the system is removed. The step done first.

  • std_factor – If (value-mu)/std > std_fact, the system is trimmed. This step done second.

  • norm_axis – if not None, the property array is normed on the axis. Useful for vector properties like force.

Returns:

send_to_device(device: device = None)[source]

Move the database to an accelerator device if possible. In some circumstances this can accelerate training.

Note

If the database is moved to a GPU, pin_memory will be set to False and num_workers will be set to 0.

Parameters:

device – device to move to, if None, try to auto-detect.

Returns:

sort_by_index(index_name: str = 'indices')[source]

Sort arrays in each split of the database by an index key.

The default is ‘indices’, also possible is ‘split_indices’, or any other variable name in the database.

Parameters:

index_name

Returns:

None

split_the_rest(split_name: str)[source]
trim_by_species(species_key: str, keep_splits_same_size: bool = True)[source]

Remove any excess padding in a database.

Parameters:
  • species_key – what array to use to mark atom presence.

  • keep_splits_same_size – true: trim by the minimum amount across splits, false: trim by the maximum amount for each split.

Returns:

None

write_h5(split: str | None = None, h5path: str | None = None, species_key: str = 'species', overwrite: bool = False)[source]

Write this database to the pyanitools h5 format. See hippynn.databases.h5_pyanitools.write_h5() for details.

Note: This function will error if h5py is not installed.

Parameters:
  • split

  • h5path

  • species_key

  • overwrite

Returns:

write_npz(file: str, record_split_masks: bool = True, compressed: bool = True, overwrite: bool = False, split_prefix: str | None = None, return_only: bool = False)[source]
Parameters:
  • file – str, Path, or file object compatible with np.save

  • record_split_masks – whether to generate and place masks for the splits into the saved database.

  • compressed – whether to use np.savez_compressed (True) or np.savez

  • overwrite – Whether to accept an existing path. Only used if fname is str or path.

  • split_prefix – optionally override the prefix for the masks computed by the splits.

  • return_only – if True, ignore the file string and just return the resulting dictionary of numpy arrays.

Returns:

property var_list
class DirectoryDatabase(directory, name, inputs, targets, *args, quiet=False, allow_unfound=False, **kwargs)[source]

Bases: Database, Restartable

Database stored as NPY files in a directory.

Parameters:
  • directory – directory path where the files are stored

  • name – prefix for the arrays.

This function loads arrays of the format f”{name}{db_name}.npy” for each variable db_name in inputs and targets.

Other arguments: See Database.

Note

This database loader does not support the allow_unfound setting in the base Database. The variables to load must be set explicitly in the inputs and targets.

get_file_dict(directory, prefix)[source]
load_arrays(directory, name, inputs, targets, quiet=False, allow_unfound=False)[source]
class NPZDatabase(file, inputs, targets, *args, allow_unfound=False, quiet=False, **kwargs)[source]

Bases: Database, Restartable

load_arrays(file, inputs, targets, allow_unfound=False, quiet=False)[source]

Submodules