database module

Full Documentation for hippynn.databases.database module. Click here for a summary page.

Base database functionality from dictionary of numpy arrays

class Database(arr_dict: dict[str, ~numpy.ndarray], inputs: list[str], targets: list[str], seed: [<class 'int'>, <class 'numpy.random.mtrand.RandomState'>, <class 'tuple'>], test_size: float | int = None, valid_size: float | int = None, num_workers: int = 0, pin_memory: bool = True, allow_unfound: bool = False, auto_split: bool = False, device: ~torch.device = None, dataloader_kwargs: dict[str, object] = None, quiet=False)[source]

Bases: object

Class for holding a pytorch dataset, splitting it, generating dataloaders, etc.”

add_split_masks(dict_to_add_to=None, split_prefix=None)[source]

Add split masks to the dataset. This function is used internally before writing databases.

When using the dict_to_add_to parameter, this function writes numpy arrays. When adding to self.splits, this function writes tensors. :param dict_to_add_to: where to put the split masks. Default to self.splits. :param split_prefix: prefix for mask names :return:

get_device() → device[source]

Determine what device the database resides on. Raises ValueError if multiple devices are encountered.

Returns:: device.

make_automatic_splits(split_prefix=None, dry_run=False)[source]

Split the database automatically. Since the user specifies this routine, it fails pretty strictly.

Parameters:

split_prefix – None, use default. If otherwise, use this prefix to determine what arrays are masks.
dry_run – Only validate that existing split masks are correct; don’t perform splitting.

Returns:

make_database_cache(file: str = './hippynn_db_cache.npz', overwrite: bool = False, **override_kwargs) → Database[source]

Cache the database as-is, and re-open it.

Useful for creating an easy restart script if the storage space is available. The new datatbase will by default inherit the properties of this database.

usage: >>> database = database.make_database_cache()

Parameters:

file – where to store the database
overwrite – whether to overwrite an existing cache file with this name.
override_kwargs – passed to NPZDictionary instead of the current database settings.

Returns:

The new database created from the cache.

make_explicit_split(split_name: str, split_indices: ndarray)[source]

Parameters:

split_name – name for split, typically ‘train’, ‘valid’, ‘test’
split_indices – the indices of the items for the split

Returns:

make_explicit_split_bool(split_name: str, split_mask: ndarray | tensor)[source]

Parameters:

split_name – name for split, typically ‘train’, ‘valid’, ‘test’
split_mask – a boolean array for where to split

Returns:

make_generator(split_name: str, evaluation_mode: str, batch_size: int | None = None, subsample: float | bool = False)[source]

Makes a dataloader for the given type of split and evaluation mode of the model.

In most cases, you do not need to call this function directly as a user.

Parameters:

split_name – str; “train”, “valid”, or “test” ; selects data to use
evaluation_mode – str; “train” or “eval”. Used for whether to shuffle.
batch_size – passed to pytorch
subsample – fraction to subsample

Returns:

dataloader containing relevant data

make_random_split(split_name: str, split_size: int | float)[source]

Make a random split using self.random_state to select items.

Parameters:

split_name – String naming the split, can be anything, but ‘train’, ‘valid’, and ‘test’ are special.
split_size – int (number of items) or float<1, fraction of samples.

Returns:

make_trainvalidtest_split(*, test_size: int | float, valid_size: int | float)[source]

Make a split for train, valid, and test out of any remaining unsplit entries in the database. The size is specified in terms of test and valid splits; the train split will be the remainder.

If you wish to specify precise rows for each split, see make_explict_split or make_explicit_split_bool.

This function takes keyword-arguments only in order to prevent confusion over which size is which.

The types of both test_size and valid_size parameters must match.

Parameters:

test_size – int (count) or float (fraction) of data to assign to test split
valid_size – int (count) or float (fraction) of data to assign to valid split

Returns:

None

remove_high_property(key: str, atomwise: bool, norm_per_atom: bool = False, species_key: str = None, cut: float | None = None, std_factor: float | None = 10, norm_axis: int | None = None)[source]

For removing outliers from a dataset. Use with caution; do not inadvertently remove outliers from benchmarks!

The parameters cut and std_factor can be set to None to avoid their steps. the per_atom and atom_var properties are exclusive; they cannot both be true.

Parameters:

key – The property key in the dataset to check for high values
atomwise – True if the property is defined per atom in axis 1, otherwise property is treated as whole-system value
norm_per_atom – True if the property should be normalized by atom counts
species_key – Which array represents the atom presence; required if per_atom is True
cut – If values > mu + cut, the system is removed. The step done first.
std_factor – If (value-mu)/std > std_fact, the system is trimmed. This step done second.
norm_axis – if not None, the property array is normed on the axis. Useful for vector properties like force.

Returns:

send_to_device(device: device = None)[source]

Move the database to an accelerator device if possible. In some circumstances this can accelerate training.

Note

If the database is moved to a GPU, pin_memory will be set to False and num_workers will be set to 0.

Parameters:: device – device to move to, if None, try to auto-detect.
Returns:

sort_by_index(index_name: str = 'indices')[source]

Sort arrays in each split of the database by an index key.

The default is ‘indices’, also possible is ‘split_indices’, or any other variable name in the database.

Parameters:: index_name
Returns:: None

split_the_rest(split_name: str)[source]

trim_by_species(species_key: str, keep_splits_same_size: bool = True)[source]

Remove any excess padding in a database.

Parameters:

species_key – what array to use to mark atom presence.
keep_splits_same_size – true: trim by the minimum amount across splits, false: trim by the maximum amount for each split.

Returns:

None

write_h5(split: str | None = None, h5path: str | None = None, species_key: str = 'species', overwrite: bool = False)[source]

Write this database to the pyanitools h5 format. See hippynn.databases.h5_pyanitools.write_h5() for details.

Note: This function will error if h5py is not installed.

Parameters:

split
h5path
species_key
overwrite

Returns:

write_npz(file: str, record_split_masks: bool = True, compressed: bool = True, overwrite: bool = False, split_prefix: str | None = None, return_only: bool = False)[source]

Parameters:

file – str, Path, or file object compatible with np.save
record_split_masks – whether to generate and place masks for the splits into the saved database.
compressed – whether to use np.savez_compressed (True) or np.savez
overwrite – Whether to accept an existing path. Only used if fname is str or path.
split_prefix – optionally override the prefix for the masks computed by the splits.
return_only – if True, ignore the file string and just return the resulting dictionary of numpy arrays.

Returns:

property var_list

class NamedTensorDataset(tensor_names, *tensors)[source]: Bases: TensorDataset

compute_index_mask(indices: ndarray, index_pool: ndarray) → ndarray[source]

Parameters:

indices
index_pool

Returns:

prettyprint_arrays(arr_dict: dict[slice(<class 'str'>, <class 'numpy.ndarray'>, None)])[source]: Pretty-print array dictionary. :return: None