database module
Full Documentation for hippynn.databases.database
module.
Click here for a summary page.
Base database functionality from dictionary of numpy arrays
- class Database(arr_dict: dict[str, ~numpy.ndarray], inputs: list[str], targets: list[str], seed: [<class 'int'>, <class 'numpy.random.mtrand.RandomState'>, <class 'tuple'>], test_size: float | int = None, valid_size: float | int = None, num_workers: int = 0, pin_memory: bool = True, allow_unfound: bool = False, auto_split: bool = False, device: ~torch.device = None, dataloader_kwargs: dict[str, object] = None, quiet=False)[source]
Bases:
object
Class for holding a pytorch dataset, splitting it, generating dataloaders, etc.”
- add_split_masks(dict_to_add_to=None, split_prefix=None)[source]
Add split masks to the dataset. This function is used internally before writing databases.
When using the dict_to_add_to parameter, this function writes numpy arrays. When adding to self.splits, this function writes tensors. :param dict_to_add_to: where to put the split masks. Default to self.splits. :param split_prefix: prefix for mask names :return:
- get_device() device [source]
Determine what device the database resides on. Raises ValueError if multiple devices are encountered.
- Returns:
device.
- make_automatic_splits(split_prefix=None, dry_run=False)[source]
Split the database automatically. Since the user specifies this routine, it fails pretty strictly.
- Parameters:
split_prefix – None, use default. If otherwise, use this prefix to determine what arrays are masks.
dry_run – Only validate that existing split masks are correct; don’t perform splitting.
- Returns:
- make_database_cache(file: str = './hippynn_db_cache.npz', overwrite: bool = False, **override_kwargs) Database [source]
Cache the database as-is, and re-open it.
Useful for creating an easy restart script if the storage space is available. The new datatbase will by default inherit the properties of this database.
usage: >>> database = database.make_database_cache()
- Parameters:
file – where to store the database
overwrite – whether to overwrite an existing cache file with this name.
override_kwargs – passed to NPZDictionary instead of the current database settings.
- Returns:
The new database created from the cache.
- make_explicit_split(split_name: str, split_indices: ndarray)[source]
- Parameters:
split_name – name for split, typically ‘train’, ‘valid’, ‘test’
split_indices – the indices of the items for the split
- Returns:
- make_explicit_split_bool(split_name: str, split_mask: ndarray | tensor)[source]
- Parameters:
split_name – name for split, typically ‘train’, ‘valid’, ‘test’
split_mask – a boolean array for where to split
- Returns:
- make_generator(split_name: str, evaluation_mode: str, batch_size: int | None = None, subsample: float | bool = False)[source]
Makes a dataloader for the given type of split and evaluation mode of the model.
In most cases, you do not need to call this function directly as a user.
- Parameters:
split_name – str; “train”, “valid”, or “test” ; selects data to use
evaluation_mode – str; “train” or “eval”. Used for whether to shuffle.
batch_size – passed to pytorch
subsample – fraction to subsample
- Returns:
dataloader containing relevant data
- make_random_split(split_name: str, split_size: int | float)[source]
Make a random split using self.random_state to select items.
- Parameters:
split_name – String naming the split, can be anything, but ‘train’, ‘valid’, and ‘test’ are special.
split_size – int (number of items) or float<1, fraction of samples.
- Returns:
- make_trainvalidtest_split(*, test_size: int | float, valid_size: int | float)[source]
Make a split for train, valid, and test out of any remaining unsplit entries in the database. The size is specified in terms of test and valid splits; the train split will be the remainder.
If you wish to specify precise rows for each split, see make_explict_split or make_explicit_split_bool.
This function takes keyword-arguments only in order to prevent confusion over which size is which.
The types of both test_size and valid_size parameters must match.
- Parameters:
test_size – int (count) or float (fraction) of data to assign to test split
valid_size – int (count) or float (fraction) of data to assign to valid split
- Returns:
None
- remove_high_property(key: str, atomwise: bool, norm_per_atom: bool = False, species_key: str = None, cut: float | None = None, std_factor: float | None = 10, norm_axis: int | None = None)[source]
For removing outliers from a dataset. Use with caution; do not inadvertently remove outliers from benchmarks!
The parameters cut and std_factor can be set to None to avoid their steps. the per_atom and atom_var properties are exclusive; they cannot both be true.
- Parameters:
key – The property key in the dataset to check for high values
atomwise – True if the property is defined per atom in axis 1, otherwise property is treated as whole-system value
norm_per_atom – True if the property should be normalized by atom counts
species_key – Which array represents the atom presence; required if per_atom is True
cut – If values > mu + cut, the system is removed. The step done first.
std_factor – If (value-mu)/std > std_fact, the system is trimmed. This step done second.
norm_axis – if not None, the property array is normed on the axis. Useful for vector properties like force.
- Returns:
- send_to_device(device: device = None)[source]
Move the database to an accelerator device if possible. In some circumstances this can accelerate training.
Note
If the database is moved to a GPU, pin_memory will be set to False and num_workers will be set to 0.
- Parameters:
device – device to move to, if None, try to auto-detect.
- Returns:
- sort_by_index(index_name: str = 'indices')[source]
Sort arrays in each split of the database by an index key.
The default is ‘indices’, also possible is ‘split_indices’, or any other variable name in the database.
- Parameters:
index_name
- Returns:
None
- trim_by_species(species_key: str, keep_splits_same_size: bool = True)[source]
Remove any excess padding in a database.
- Parameters:
species_key – what array to use to mark atom presence.
keep_splits_same_size – true: trim by the minimum amount across splits, false: trim by the maximum amount for each split.
- Returns:
None
- write_h5(split: str | None = None, h5path: str | None = None, species_key: str = 'species', overwrite: bool = False)[source]
Write this database to the pyanitools h5 format. See
hippynn.databases.h5_pyanitools.write_h5()
for details.Note: This function will error if h5py is not installed.
- Parameters:
split
h5path
species_key
overwrite
- Returns:
- write_npz(file: str, record_split_masks: bool = True, compressed: bool = True, overwrite: bool = False, split_prefix: str | None = None, return_only: bool = False)[source]
- Parameters:
file – str, Path, or file object compatible with np.save
record_split_masks – whether to generate and place masks for the splits into the saved database.
compressed – whether to use np.savez_compressed (True) or np.savez
overwrite – Whether to accept an existing path. Only used if fname is str or path.
split_prefix – optionally override the prefix for the masks computed by the splits.
return_only – if True, ignore the file string and just return the resulting dictionary of numpy arrays.
- Returns:
- property var_list